RATIONALITY
AND SOCIAL RESPONSIBILITY
ER59969.indb 1
3/21/08 10:49:51 AM
Modern Pioneers in Psychological Science: An APS-psychology press Series This series celebrates the careers and contributions of a generation of pioneers in psychological science. Based on the proceedings of day-long Festschrift events at the annual meeting of the Association for Psychological Science, each volume commemorates the research and life of an exceptionally influential scientist. These books document the professional and personal milestones that have shaped the frontiers of progress across a variety of areas, from theoretical discoveries to innovative applications and from experimental psychology to clinical research. The unifying element among the individuals and books in this series is a commitment to science as the key to understanding and improving the human condition.
PUBLISHED TITLES 1. Psychological Clinical Science: Papers in Honor of Richard M. McFall, edited by Teresa A. Treat, Richard R. Bootzin, and Timothy B. Baker (2007). 2. Rationality and Social Responsibility: Essays in Honor of Robyn Mason Dawes, edited by Joachim I. Krueger (2008).
ER59969.indb 2
3/21/08 10:49:52 AM
RATIONALITY
AND SOCIAL RESPONSIBILITY Essays in Honor of Robyn Mason Dawes
Edited by
Joachim I. Krueger
aps ASSOCIATION FOR
PSYCHOLOGICAL SCIENCE
ER59969.indb 3
3/21/08 10:49:52 AM
Psychology Press Taylor & Francis Group 270 Madison Avenue New York, NY 10016
Psychology Press Taylor & Francis Group 27 Church Road Hove, East Sussex BN3 2FA
© 2008 by Taylor & Francis Group, LLC Printed in the United States of America on acid‑free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number‑13: 978‑0‑8058‑5996‑6 (Hardcover) Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans‑ mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Rationality and social responsibility : essays in honor of Robyn Mason Dawes / editor, Joachim I. Krueger. p. cm. ‑‑ (Modern pioneers in psychological science) (An APS‑LEA series) Includes bibliographical references. ISBN 978‑0‑8058‑5996‑6 (alk. paper) 1. Reasoning (Psychology) 2. Thought and thinking. 3. Decision making. 4. Responsibility. I. Dawes, Robyn M., 1936‑ II. Krueger, Joachim I. BF442.R37 2008 153.4‑‑dc22
2007045940
Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the Psychology Press Web site at http://www.psypress.com
ER59969.indb 4
3/21/08 10:49:52 AM
Contents
Acknowledgments Chapter 1 Chapter 2
Chapter 3
vii
A Psychologist between Logos and Ethos Joachim I. Krueger
The Gambler’s Fallacy and the Coin’s Memory
Eric Gold Gordon Hester
1 21
Being an Advocate for Linear Models of Judgment Is Not an Easy Life 47
Hal R. Arkes
Chapter 4 Chapter 5
What Makes Improper Linear Models Tick?
Jason Dana
71
Why Meta-Science Should Be Irresistible to Decision Researchers 91 David Faust
Chapter 6
The Robust Beauty of Simple Associations
Chapter 7
When It Is Rational for the Majority to Believe That They Are Better Than Average 141
Joachim I. Krueger
111
Don Moore Deborah Small
ER59969.indb 5
3/21/08 10:49:52 AM
vi • Contents
Chapter 8
Wishful Thinking in Predicting World Cup Results 175 Maya Bar-Hillel David V. Budescu Moty Amar
Chapter 9
How Expectations Affect Behavior
187
Chapter 10 Depersonalized Trust and Ingroup Cooperation
215
Chapter 11 Must Good Guys Finish Last?
233
Chapter 12 Women’s Beliefs about Breast Cancer Risk Factors: A Mental Models Approach
245
Cristina Bicchieri
Marilynn B. Brewer David Messick
Stephanie J. Byram Lisa M. Schwartz Steven Woloshin Baruch Fischhoff
Chapter 13 Groups and the Evolution of Good Stories and Good Choices 275 Linnda R. Caporael
Appendix 1 The Robust Beauty of Improper Linear Models in Decision Making Robyn M. Dawes
321
Appendix 2 Behavior, Communication, and Assumptions about Other People’s Behavior in a Commons Dilemma Situation 345 Robyn M. Dawes Jeanne McTavish Harriet Shaklee
ER59969.indb 6
Appendix 3 Robyn M. Dawes’s Festschrift Remarks
363
Appendix 4 Robyn M. Dawes’s Biography and Selected Work
365
Contributors
371
Subject Index
377
Author Index
383
3/21/08 10:49:53 AM
Acknowledgments Robyn M. Dawes was honored in a day-long symposium on May 29, 2005, as part of the 17th Annual Convention of the Association for Psychological Science (APS) in Los Angeles, California. Most of the chapters collected in this volume are records of contributions to this event. Other chapters were solicited after the symposium. The production of this Festschrift was supported by grants of the Association for Psychological Science, The Society for Judgment and Decision Making, Carnegie Mellon University (College of Arts and Sciences and Department of Social and Decision Sciences), and Lawrence Erlbaum and Associates. We gratefully acknowledge their support.
vii
ER59969.indb 7
3/21/08 10:49:53 AM
ER59969.indb 8
3/21/08 10:49:53 AM
1
A Psychologist Between Logos and Ethos Joachim I. Krueger Brown University
At the 17th Annual Convention of the Association for Psychological Science, Los Angeles, May 29, 2005, Robyn Mason Dawes was honored with a Festschrift conference. Following a convivial banquet on the eve of the Festschrift symposium, several of Robyn’s friends and colleagues presented papers on recent research that was, in one way or another, indebted to his intellectual inspiration over the years. This volume is composed of chapter-length reports of most of these research enterprises as well as some chapters authored by individuals who did not speak at the symposium. With characteristic modesty, Robyn requested that the contributors present their finest work and refrain from emphasizing his contributions to these efforts. It falls on this introductory chapter to provide the overarching context for these lines of research and a sense of Robyn’s abiding influence. It is fitting to begin with a biographical note. Robyn entered the graduate program in clinical psychology at the University of Michigan in 1958. His stint in that program included some experiences that shaped his outlook on psychological science for years to come. Robyn noticed that clinical psychology was caught between the two frames of being seen as an art and a science. In his view, the latter was neglected. He found that scientific advances showing the limits of clinical diagnosis were being ignored. In one poignant instance, Robyn was asked to administer a Rorschach test to a 16-year old girl who had been admitted because of sexual relations with an older man, which had led to
ER59969.indb 1
3/21/08 10:49:53 AM
• Joachim I. Krueger
strained relations with her family. After analyzing the Rorschach protocol, Robyn found that just one of her responses to any of the Rorschach cards was “abnormal” (i.e., “poor form”). At a case conference, he argued that this low rate of nonstandard responding did not warrant a diagnosis of schizophrenia. In essence, his argument was that her performance was no different than what would be expected from base rate responding in the general population. If anything, her responses were more “reality oriented.” Robyn was overruled by experienced clinicians, who rejected his statistical argument. Mr. Dawes, he was told, you may understand mathematics, but you do not understand psychotic people (“Just look at the card. Does it look like a bear to you? She must have been hallucinating.”). With that, Robyn left the program and turned his attention to mathematical psychology. In due course, his collaboration with his mentor Clyde Coombs and his cohort Amos Tversky led to the publication of the classic monograph of this field (Coombs, Dawes, & Tversky, 1970). Much of his own empirical work was henceforth dedicated to the detection of psychological distortions in human judgment (Dawes, 2001). From the outset, he did not single out clinical experts as victims of irrational thought but tried to understand general principles of mind that yield systematic biases in judgment. With this orientation, he placed himself in the tradition of Jerome Bruner and his colleagues, who had recently laid the groundwork for the cognitive revolution in psychology. Reaching further back still, the influence of Bartlett’s (1932) work on the role of narratives in remembering turned out to be relevant. To Robyn, human reasoning occurs in both a narrative and a scientific mode. Narratives provide meaning, help remembering, and simplify otherwise complex quantitative input, yet these same narratives can systematically interfere with logic or statistical rationale. In his dissertation, Robyn presented his participants with declarative sentences representing a suite of set relations (Dawes, 1964, 1966). Some relations were nested, such as when all X were Y (inclusion) or when no X were Y (exclusion). Other relations were disjunctive, such as when some X were Y. Immediately after presentation, Robyn probed the participants’ recollections of these relations. As he had suspected, most participants were more likely to remember disjunctive relations as being nested than to remember nested relations as being disjunctive (and were more confident in their recall when it involved this error than when it was correct). In other words, he had not only identified systematic memory distortions, but he had correctly predicted which type of distortion was more likely to occur. From a narrative point of view, these distortions were benign in that they yielded mental representations that were sim-
ER59969.indb 2
3/21/08 10:49:53 AM
A Psychologist between Logos and Ethos •
pler than the reality they meant to portray. From a scientific or paradigmatic point of view, however, these distortions were worrisome because they could lead to incoherent judgments and ultimately harmful action. As Robyn cautioned, “we may think of these people as one-bit minds living in a two-bit world” (Dawes, 1964, p. 457). Misremembering disjunctive relations as being nested is an error of overgeneralization, a kind of error Robyn had already seen in action. Giving an “abnormal” response to a particular Rorschach card and being psychotic are instances of disjunctive sets. To say that everyone who gives this response is psychotic is to claim that the former set is nested within the latter. In time, Robyn approached these types of judgment task from a Bayesian point of view. Harkening back to his earlier experience with clinical case conferences, it could be said that the clinicians began with the idea that there was a high probability of finding evidence for an offending response if the person was ill (i.e., p[E|I]). Their diagnostic judgment, however, was the inverse of this conditional probability, namely, the probability that the person was ill given that one abnormal response was in evidence (i.e., p[I|E]). Meehl and Rosen (1955) had shown that base rate neglect pervades human judgment in general and clinical judgment in particular. Inasmuch as the base rate of making the response, p(E), is higher than the base rate of being ill, p(I), the probability of illness given the evidence from testing can be very low indeed. Along with his friends Amos Tversky and Daniel Kahneman, Robyn stimulated the growth of the psychology of judgment and decision making. Whereas Tversky and Kahneman documented the base rate fallacy in a series of widely noted experiments (Tversky & Kahneman, 1974), Robyn discussed numerous examples of non-Bayesian thinking in his writing. His 1988 book, Rational Thought in an Uncertain World (see also Hastie & Dawes, 2001, for a revised 2nd edition; a 3rd edition is in preparation), is a classic exposition of common impediments to rationality. His critique of clinical psychology and psychotherapy (House of Cards [Dawes, 1994]) is the authoritative application of the principles of judgment and decision making to that field. Robyn’s empirical work yielded a report in which he and his collaborators presented a rigorous correlational test of the overgeneralization hypothesis (Dawes, Mirels, Gold, & Donahue, 1993). Unlike Tversky and Kahneman, who provided their participants with base rate information, Robyn and his colleagues asked their participants to estimate all four constituent Bayesian probabilities themselves. This within-person method yielded strong evidence for overgeneralization. When estimating p(I|E), participants neglected the base rates of p(E) and p(I) that they themselves had estimated.
ER59969.indb 3
3/21/08 10:49:54 AM
• Joachim I. Krueger
Instead, they appeared to derive p(I|E) directly from p(E|I). Bayes’s theorem states that the two inverse conditional probabilities are the same only when the two base rates are the same (which they were not for most participants). In Robyn’s analysis, people assumed a symmetry of association that is rare in the real world. Associations between variables are often represented as correlations. One may want to predict, for example, illness versus health from positive versus negative test results, yet a perfect correlation can only be obtained when the two base rates are the same. Overgeneralization becomes more likely inasmuch as the base rate of the diagnostic sign is larger than the base rate of the underlying illness. It is worth considering a numerical example. Suppose a certain test response is rather common (p[E] = .8), whereas the underlying illness is rare (p[I] = .3). The ceiling for the correlation between the two is .327. Even if everyone who is ill emits the response (i.e., p[E|I] = 1), the inverse probability, p(I|E), is only .375. The former probability expresses the sensitivity of the test, or how typical the response is of the underlying disposition. The inverse probability, however, also depends on the ratio of the base rates. According to Bayes’s theorem, p(I|E) = p(E|I) p(I)/p(E), where p(E) = p(I)p(E|I) + p(−I)p(E|−I). In other words, p(E) depends greatly on the probability of the evidence given the absence of the illness (which, in turn, is the complement of the test’s specificity, or 1 − p[−E|−I]). This is what people neglect to take into account and what leads them to confuse pseudo-diagnosticity with true diagnosticity. As the predictive validity of a test can be expressed as the correlation between test results and actual health status, this Bayesian analysis shows that the ceiling of validity coefficients becomes lower as the two base rates become more discrepant. In Figure 1.1 the maximum correlations are plotted against the ratio of the base rates. Data for this illustration were constructed by letting p(I) range from .05 to .95 and letting p(E) range from .05 to p(I). Across the simulated instances, the correlation between the base rate ratio and their maximum statistical association is −.65. When the natural logarithms of the ratios are used to reduce the biasing effect of nonlinearity, this correlation is almost perfectly negative (r = −.81). The irony of this result is that a mere increase in positive test responses may be seen as an improvement of the test when in fact its diagnostic value drops even when the test’s sensitivity remains the same. This example can be extended to current research on prejudice, with the variables I and E, respectively, denoting implicit and explicit attitudes. Research with the implicit association test (IAT; Greenwald, McGhee, & Schwarz, 1998) routinely shows large effect sizes for implicit bias and
ER59969.indb 4
3/21/08 10:49:54 AM
phi(I,E)
A Psychologist between Logos and Ethos • 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
0
5
10
15
20
p(I)/p(E)
Figure 1.1 The maximum correlation between I and E plotted against the ratio of the two base rates.
low to moderate correlations between I and E. As the high values of p(I) are hailed as important evidence for how many people are really prejudiced, it is overlooked that the same finding reduces the predictive power of the test. In clinical assessment the reliance on pseudo-diagnosticity is particularly fraught with danger because a person’s true status is difficult to ascertain. Ultimately, criterion judgments about underlying conditions are also clinically derived. Such judgments are notoriously elastic, as they allow, for example, speculative assessments of a person’s latent (i.e., not overtly expressed) pathological disposition. To Robyn, social responsibility demands that people, and presumed experts, be disabused of associationist thinking. Ultimately, his outlook is optimistic. He defines rational thought as the avoidance of contradictory beliefs. A set of beliefs or inferences is rational if it is coherent, and to insure coherence, it is necessary to make comparisons. In Bayesian language, drawing inferences from the ratio’s numerator alone is associationist and bound to be incoherent. In contrast, drawing inferences from comparisons avoids contradiction. Often, associationist thinking is considered primitive and automatic, whereas comparative thinking is controlled, effortful, and resource-consuming. Robyn generally agrees with this view, but he suggests his own characteristic metaphor. As a young man, he taught children in a summer camp to swim. He noticed that the kids instinctively tried to keep their heads above water, which moved their bodies into a vertical position, thereby making drowning more likely. To help them overcome their fear, he had to teach them to keep their faces beneath the waterline and come up for air only periodically and rhythmically. As all swimmers know, this new attitude
ER59969.indb 5
3/21/08 10:50:04 AM
• Joachim I. Krueger
quickly becomes second nature, and so it can be with rational thinking, according to Robyn. Of course, proper comparisons require the presence of all the necessary information. Sometimes people do not have access to that information but act as if they do. To illustrate, Robyn extended Tversky and Kahneman’s notion of the availability heuristic to incorporate what he termed a “structural availability bias” (Dawes, 2005). Again, a clinical example is instructive. Robyn observed that many clinicians claim that “child abusers never quit on their own without therapy.” All that clinicians know, however, is the probability with which child abusers stop given that they are in therapy. The probability with which abusers stop without being in therapy is out of view. When the proper input for rational judgment cannot be obtained, Robyn advised suspending judgment, an insight that is as indicative of his genius as it is counterintuitive. Another inspiration Robyn drew from Tversky and Kahneman’s work is the gambler’s fallacy. In Rational Thought in an Uncertain World (Dawes, 1988) Robyn illustrated this bit of irrationality with a letter to Dear Abby, in which the writer insisted that the law of averages [sic] demanded that a baby boy was due after five consecutive baby girls. Whereas the gambler’s fallacy is easily explained statistically and equally easily demonstrated empirically, the field of psychology has sorely been missing a compelling account of the mechanisms that produce it. Gold and Hester (chapter 2) present a series of experiments, asking just what type of animistic reasoning is involved. The answers lie in the boundary conditions of the fallacy. It disappears, for example, when a coin is exchanged for a new one before the critical trial or if a longer interval between tosses is introduced. The memory of the coin is as weak as its muscle. Suppose Abby’s correspondent had wanted another girl. Would she have boosted her confidence in getting her wish by looking for a new father or by imposing a few months of sexual abstinence? One important line of Robyn’s work has been dedicated to attitude measurement, culminating in a book (Dawes, 1972) and in a chapter (Dawes & Smith, 1985) in the third edition of the Handbook of Social Psychology. This chapter contains a surprising insight about the century-old tug of war over the predictive value of social attitudes. LaPiere (1934) published a landmark study suggesting that attitudes and behaviors are unrelated. LaPiere found that a Chinese couple touring California was virtually never turned away at hotels or restaurants, yet virtually all clerks at these establishments asserted over the phone that they would refuse service to “members of the Chinese race.” Dawes and Smith noted that a correlation coefficient is not defined when three of the four cells are (virtually) empty. In other words, judgments about
ER59969.indb 6
3/21/08 10:50:04 AM
A Psychologist between Logos and Ethos •
how well attitudes predict behavior had to be suspended in the context of LaPiere’s study (which virtually no attitude researcher had done). In hindsight, LaPiere’s findings can be conceptualized within the Bayesian framework discussed above. Namely, the upper bound of the attitude-behavior correlation is low inasmuch as the ratio of base rates (for a particular attitude and for a behavior corresponding to that attitude) is large. Robyn’s interest in measurement converged with his long-standing concern about overgeneralization when he turned his attention to the well-known contrast effect in attitudinal judgment. Attitudinal contrasts mean that people overestimate the difference between typical attitudes held by people with known disagreements. Judging existing methods to be unsatisfactory, Dawes, Singer, and Lemons (1972) devised a paradigm in which members of two attitudinal camps (with regard to the Vietnam War) had to write statements they considered typical of members of the opposing camp. As expected, doves wrote statements for hawks that hawks themselves rejected as too extreme and vice versa. The origins of this idea in Robyn’s work on set relations are evident. Participants treated the two attitudinal sets as being exclusive, when in fact they were disjunctive. When rational thought cannot be coaxed to the surface by able tutoring and dedicated practice, people need not give up. Recognizing how difficult it is for the human mind to avoid random judgmental errors, Robyn turned his attention to comparisons of clinical judgment with actuarial decision methods. In a survey of the literature, Meehl (1954) had found that actuarial decision making beats intuitive judgment when valid correlations between predictors and outcome’s are known. Scores on predictor variables can be clerically combined, thereby guaranteeing freedom from random error and the haphazard intrusion of nondiagnostic variables. Robyn showed that unit weights, or even random weights, not only do better than holistic intuitive judgments, but that they are also more robust than optimal regression weights (Dawes, 1979; see Appendix 1, this volume). Optimal weights tend to overfit the data of the validation sample and thus lose predictive power in a new sample. To avoid the pitfalls of both holistic judgment and statistical overfitting, one only needs to “know what variables to look at and then know how to add” (Dawes & Corrigan, 1974, p. 105). In a striking demonstration, Howard and Dawes (1976) predicted couples’ marital happiness from the frequency of sexual intercourse (broadly defined) and the frequency of arguments and fights. The difference between the two was a better predictor than each frequency alone because the two were positively correlated in their particular samples.
ER59969.indb 7
3/21/08 10:50:04 AM
• Joachim I. Krueger
Following up on Robyn’s suggestion that federal granting agencies use linear composites of specific judgments to evaluate the merit of grant proposals, Arkes (chapter 3) found that disaggregated judgments are more reliable than the holistic judgments that are usually made. Arkes’s tale illuminates both how “improper linear models” can be put to good use in practice and the kind of resistance such a rational procedure may face, even among sophisticated scientists. Dana (chapter 4) presents a much needed, in-depth analysis of the conditions under which improper linear models work. He finds that such models work best when predictor–outcome associations are weak, which is to say they work particularly well in the social sciences, where the temptation to rely on intuitive appraisals is the greatest. Faust (chapter 5) takes this approach further by applying it to judgments about the quality of scientific theories. Drawing on philosophy of science, Faust notes the existence of multiple reasonable criteria by which theories can be appraised (e.g., parsimony, falsifiability, generativity). Despite claims to the contrary, no single criterion trumps all others. Even Popper’s criterion of disconfirmability is no gold standard because Popper’s theory itself might be disconfirmed, and then what? Faust’s theme connects with Robyn’s insistence that comparisons be made (e.g., it is not enough to enumerate instances in which parsimonious theories do well); it also connects with the idea that simple combinations of criteria can go a long way toward reliable and valid appraisals. Nevertheless, Faust goes beyond the linear-combination model by suggesting that, at the present state of our knowledge, nonlinear configurations of predictor scores might yield incremental validity. Krueger (chapter 6) addresses the issue of incremental validity in a variety of social judgment tasks. Some current models of social judgment (e.g., for self-perception or stereotyping) use predictors that empirically do not contribute independently to outcome judgments. For example, self–other comparisons are well predicted by absolute self-judgments, whereas absolute judgments of others contribute little. This finding reveals two of Robyn’s themes. Namely, to be rational, people should base self–other comparisons on both types of absolute judgment. When they fail to do that, however, scientific modeling of the intuitive judgment process should only include the valid predictors. The regression–analytic approach sheds light on Bayesian probability judgments. It is evident from Bayes’s theorem that across multiple judgment contexts, p(I|E) increases with p(I) and with p(E|I) and that it decreases with p(E|-I). For a sample of judgment contexts, optimal regression weights can be calculated, but they are of limited use in other contexts. To obtain decontextualized, and thus nonoptimal weights,
ER59969.indb 8
3/21/08 10:50:04 AM
A Psychologist between Logos and Ethos •
one can begin with the assumption that the distributions of the three predictor variables are flat (i.e., one assumes uniform priors). In a simulation with predictor values ranging from .1 to .9 in steps of .2, p(I|E) is correlated .81, .37, and −.37, respectively, with p(I), p(E|I), and p(E|-I). The criterion base rate is a better predictor than the probability of the evidence conditioned on the criterion. What is more, the correlation between criterion and prediction is even larger than the correlation between p(I|E) and the likelihood ratio, p(E|I)/ p(E|−I), which is .46. Even the natural logarithm of the ratio does not predict p(I|E) as well (r = .55) as the base rate does. The predictive value of Bayesian probabilities across contexts may explain why Robyn never joined the call for an abandonment of null hypothesis significance testing (NHST). In NHST, p(E|I) is the p value that researchers use to reject or retain a null hypothesis. In a mature area of research, credible estimates of p(I) and p(E|−I) are available and should be used to estimate the probability of the hypothesis given the evidence. In many areas of “soft” psychology, however, such estimates are often lacking, and p(E|I) is all the researchers have (Krueger, 2001). Thus, it is defensible to assume uniform priors for p(I) and p(E|−I), and to base inferences on the predictive value of p(E|I). Applying his “improper” rule, Robyn noted that NHST at least tells researchers that there is “not nothing” and the directionality thereof (Dawes, 1991). In his advocacy of rational inference, Robyn has held clinicians, academic research scientists, and ordinary people to the same standards. Putative demonstrations of biased judgment must themselves satisfy the demands of coherence. The most famous of Robyn’s debunking of putative biases was his finding that the vaunted “false consensus effect” is not necessarily false. Using both a Bayesian (Dawes, 1989) and a regression-analytic approach (Dawes, 1990), he showed that people are well advised to use their own responses to predict the responses of others inasmuch as they are uncertain about what others will actually do. This heuristic reasoning guarantees that errors will be made, but it minimizes the errors relative to what would happen if people merely guessed (note the comparative nature of this argument). This pivotal insight led to a revival of research on social projection (e.g., Krueger, 1998, for a review), including an empirical article by Dawes and Mulford (1996), which also questioned whether the so-called overconfidence bias has been demonstrated satisfactorily. With characteristic candor, Dawes and Mulford (p. 201) suggested that the “belief in these particular systematic limitations of judgment arises not from the irrationality of experimental subjects who allegedly demonstrate their
ER59969.indb 9
3/21/08 10:50:05 AM
10 • Joachim I. Krueger
existence, but from the cognitive limitations of the psychologists studying these subjects.” Moore and Small (chapter 7) open the case against the entrenched view that it is logically impossible for most people to be better than average. They show that they can, under conditions that are often met in the real social world. Using Bayesian and regression-analytic tools, Moore and Small present a theoretical model that can reproduce both better-than-average and worse-than-average effects. Their empirical data support their analysis, although they allow for motivated distortions in addition to rational inference. Bar-Hillel, Budescu, and Amar (chapter 8) pursue the better-than-average phenomenon at the group level. Noting that judgmental optimism is expressed by the overestimation of desirable outcomes, they examine whether people are more likely to predict that a sports team will win if that outcome results in their receiving a payment. In two studies they find small effects, leading them to conclude that the wishful thinking effect remains elusive. Robyn’s interest in the consensus bias originated in his work on social dilemmas. Dawes, McTavish, and Shaklee (1977; see Appendix 2, in this volume) found that cooperators in a prisoner’s dilemma expected other players to cooperate, whereas defectors expected others to defect. The prisoner’s dilemma is the kind of situation that maximizes a person’s uncertainty regarding the behavior of others. At the limit, ignorance is complete, meaning that the probabilities of all possible rates of cooperation are uniform. Laplace’s rule of succession says that after a person’s own choice between cooperation and defection is in evidence, the posterior probability of the other’s choice being the same is 2/3 (see Dawes, 1989, for a derivation). Recent evidence suggests that this is why some people cooperate in the first place, namely, because they realize that making the cooperative choice enables them to expect a higher value from the game than does making a defecting choice (Krueger & Acevedo, 2007). Robyn is not altogether comfortable with this conclusion. He argues instead that people draw on preexisting inclinations to either cooperate or to defect and then project their own choice rationally. In a series of studies conducted with his friend John Orbell at the University of Oregon, Robyn proposed and refined a “cognitive miser model,” which reveals the social benefits of projection in games that allow players to withdraw (Orbell & Dawes, 1991). Because both cooperators and defectors expect others to be like them, defectors anticipate the dreary Nash equilibrium of mutual defection. When offered the option to withdraw, they take it. Cooperators also project but anticipate mutual cooperation. They are more likely to stay in the game and, hence, a greater total
ER59969.indb 10
3/21/08 10:50:05 AM
A Psychologist between Logos and Ethos • 11
amount of resources is earned than would have been if everyone had been forced to play. In n-person social dilemmas, Robyn distinguished between givesome and take-some games (Dawes, 1980). Give-some games are exemplified by public goods dilemmas, such as donating to public broadcasting; take-some games have decisions to pollute as a prominent real-world analog. Empirically, many people cooperate, and the question is why. As Dawes and Messick (2000, p. 111) put it, the puzzle is that “by doing what seems individually reasonable and rational, people end up doing less well than they would have done if they had acted unreasonably or irrationally.” Here is a genuine dilemma for Dawes himself, as individual rationality comes face to face with social responsibility. A single-minded pursuit of individual rationality spells disaster for the group. What to do? Merely redefining cooperation as being “collectively rational” skirts the issue (and abandons methodological individualism). Robyn and colleagues found that communication among group members is of great value. When talking to one another, participants can construct a shared understanding of the situation they are in, form an inclusive social identity, and exchange promises of cooperation (Orbell, van de Kragt, & Dawes, 1988). These social events work. The finding that promises are honored is surprising from a narrow view of individual rationality because after listening to the promises of others, each individual could conclude that the prospect of getting a free ride is now particularly alluring. Bicchieri (chapter 9) offers one answer for why promises work. She notes that much collectively desirable behavior depends on the activation of social norms. Making promises, reciprocating the promises of others, and honoring such promises is ethically mandated, yet social interactions remain critical. It is not sufficient to assume, Bicchieri argues, that people act in the collective interest out of social preferences (e.g., benevolence or inequality aversion). However, this is what many current reformulations of economic theories of rationality do (Fehr & Schmidt, 1999). Norms depend on the expectations individuals have about the behavior of others, whereas social preferences are decontextualized and consequentialist. People who honor norms expect others to do the same (i.e., they project) and, thus, care about how final collective outcomes come about. Brewer (chapter 10) takes up the question of how cooperation can emerge even in the absence of communication. She highlights the role of social identification with a group, which experiments have shown to occur even in minimal laboratory groups (Tajfel, Billig,
ER59969.indb 11
3/21/08 10:50:05 AM
12 • Joachim I. Krueger
Bundy, & Flament, 1971). According to self-categorization theory (Turner, Hogg, Oakes, Reicher, & Wetherell, 1987), the experience of being in a group entails a sense of depersonalization, in which individuals regard themselves as being interchangeable with other group members. It follows that, to the extent that they trust themselves, they can also trust others (and hence cooperate virtually with themselves). Brewer reviews the evidence for several mechanisms that may account for the emergence of depersonalized trust. Critically, she notes that the presence of an outgroup is not necessary. Without an outgroup, meta-contrasts between groups, as postulated by self-categorization theory, are not possible. Likewise, the comparative (here, intergroup) judgments Robyn regards critical for rational inference do not appear necessary either. Then again, Robyn also asked that only valid predictors be used in judgment. Messick (chapter 11) focuses on the question of how cooperation can emerge and stabilize over repeated rounds of social dilemma games. When it can be shown that “good guys” can win, it follows that good groups (or firms) can win, too. Arguably, social dilemmas played out at the group or corporate level involve higher stakes for the collective good than do games among individuals, yet the dynamics are much the same from an analytical point of view. Messick’s notion of “corporate social responsibility” is a direct challenge to the individual (i.e., selfish) rationality of Friedman’s stripe. The outcome of Messick’s review is textured. He shows that although good guys are likely to do poorly locally (i.e., are vulnerable to exploitation by neighbors), they do well globally (i.e., attract more wealth than do bad guys who are surrounded by bad neighbors). Although this finding may follow analytically from the definition of social dilemmas, Messick’s data show how mutual cooperation can attain equilibrium status. Hence, his finding has important implications for the evolution of altruism for many biological and social systems. Many of Robyn’s contributions revolve around the idea that good science should be put to beneficial use. The pursuit of good evidence is not l’art pour l’art, but interwoven with social, clinical, and personal needs. Arguably, the social and behavioral sciences are lagging behind medical science in putting a premium on evidence-based practice. Byram, Schwartz, Woloshin, and Fischhoff (chapter 12) provide an example by applying a mental-models approach to expert and lay reasoning about breast cancer. Their methods enable them to identify shared and idiosyncratic misconceptions about risk and thereby empower ordinary people to make better decisions.
ER59969.indb 12
3/21/08 10:50:05 AM
A Psychologist between Logos and Ethos • 13
Human rationality is a protean concept, with no universally accepted definition. In his early work, Robyn emphasized differences between belief and evidence, writing that a cognitive distortion occurs when an individual “maintains a belief despite exposure to information contradicting that belief” (Dawes, 1964, p. 444). He also suggested that the difference between normal and distorted reasoning may be fluid rather than categorical, so that “distortion can be viewed as the result of a normal cognitive process running amok” (p. 458). Later, he rendered rationality more restrictively in terms of coherence. In his chapter on behavioral decision making, written for the fourth edition of the Handbook of Social Psychology, he defined rationality “not in terms of selfishness and goals, or even in terms of the likelihood of means to achieve these goals, but in terms of avoiding outright contradictions in the policies or thought processes leading to choice” (Dawes, 1998, p. 497). The linkage between beliefs and evidence is now only indirect, but we know, Robyn argues, that if a set of beliefs is incoherent, not all of them can be true. Likewise, the notion of self-interest, although removed from the definition, remains indirectly relevant. Superficially, one could refuse to worry if one’s own beliefs violate Bayes’s theorem. However, as De Finetti (1931) showed, probabilistic beliefs can be translated into a series of bets. If the beliefs are incoherent, a canny opponent can make a Dutch book that guarantees that gambles will be lost. What is the reach of rationality thus defined? From the outset, Robyn rejected the view that in a hierarchically organized mind, the lower instinctual and emotional faculties interfere with the otherwise finely working rationality of the conscious faculty (Dawes, 1964, 1976, 2001). This view has been entrenched in Western civilization since the days of Plato, Aristotle, and the Church Fathers. Instead, Robyn has argued that irrationality as a lack of coherence can arise in the conscious mind itself. The critical limitation of rational thought is that—aside from its boundedness to self-interest—it does not generate goals. Emotions, on the other hand, can signal to a person what his or her goals and values are. Robyn was much intrigued by the case of Rudolf Höss, the commandant of the Auschwitz extermination camp. In a repugnant way, Höss exemplified rationality by acting on a coherent set of beliefs. Moreover, as Robyn noted, Höss embodied “in many ways the Platonic ideal” (Dawes, 2001, p. 38) of a man whose rational mind overruled his own emotional reactions. Like Eichmann and other Nazi leaders who could not bear the sight of individual humans being tortured, Höss felt the tug of compassion. However, he “rationally” forced himself to overcome this tug.
ER59969.indb 13
3/21/08 10:50:05 AM
14 • Joachim I. Krueger
For the sake of social responsibility, rationality needs to be channeled by emotions rather than be cut off from them. There appear to be two modes of regulation. One mode involves respect for one’s own emotional responses. When cognitively appraised, emotions are eminently useful because they tell us “how we feel about it” (Schwarz, 2001). Robyn often refers to short stories by Bertrand Russell, in which the great philosopher argued that human compassion can overcome repressive and brutal social practices. The other mode involves respect for socially shared narratives. Like Robyn, Caporael (chapter 13) argues that humans are both reasoning and story-telling creatures. The challenge is to find a way to let the good stories win. To Robyn, the pursuit of science is an opportunity to create stories that gradually become better (see also Faust, chapter 5). It is to be hoped that scientific stories displace folk stories and mythologies with coherent stories based on solid evidence. Caporael’s story about stories emerges from her coreconfigurations model of human evolution. Indeed, she argues that stories are primary because they can never be fully displaced by evidence. Even scientific theories retain a core narrative, in the light of which evidence is sought and interpreted. With regard to Höss, Caporael argues that he, while perhaps being individually rational, was caught up in a bad narrative, the Nazi construal of the world. As a counterweight, she tells the story of the Danish resistance and the rescue of the Danish Jews. This rescue would have been far less likely had the Danes not found strength and purpose in a folk tale that defined who they were and whom to include in their system of values. In Caporael’s words, the Danes had a good story. Their daring success was linked with the construction of their national identity (see Brewer, chapter 10) and with the ability of individual Danes to project their acceptance of the national narrative on other Danes and, hence, trust their support. In a comment on how ethics can, at least in part, be derived from rational thought, Robyn notes that the Golden Rule works because people can infer what is pleasing and what is abhorrent to others from what is pleasing or abhorrent to themselves (Dawes, 2001). The struggle to attain rationality in a complex, story-loving world and to square rationality with social responsibility is difficult. Robyn’s quest for coherence has met with many roadblocks. His life’s work—and in a small way, the chapters gathered here—is a testament to the contributions that rational thought can make to the social good, yet competing theories of rationality continue to thrive in academia, and competing narratives continue to hold sway over individual imaginations. If the past is a guide to the future, this basic reality will not change. Perhaps it is apt to conclude with a story. The members of a synagogue find that
ER59969.indb 14
3/21/08 10:50:06 AM
A Psychologist between Logos and Ethos • 15
they are split into two camps. One camp argues that a particular prayer is recited standing up, whereas the other camp argues that it is recited sitting down. To settle the dispute, they ask an elder for guidance. After listening to the case made by the first camp, he shakes his head, saying that it is not the tradition to pray standing up. Well then, they say, it must therefore be the tradition to pray sitting down. Again, the old sage shakes his head. That is not the tradition either. Exasperated, they conclude that they will have to keep arguing about this matter. That, the old man says, is the tradition. This brief sketch of Robyn Dawes’s views on rationality and social responsibility is necessarily incomplete. This volume ends with a list of selected publications, which the reader may consult for a deeper understanding of the breadth and richness of Robyn’s contributions. Still, further reading will not fully disclose the informal impact Robyn has had on the contributors to this volume and many others. After the Festschrift symposium I asked the present authors to send one paragraph describing Robyn’s personal impact on them and another paragraph describing his most significant intellectual impact. To conclude this introduction, I offer a sample of quotes—without attribution—to highlight the variety of ways in which Robyn’s presence has left footprints in many careers. Robyn does care, with a passion, not only for ideas and mathematics and music, but also about ethics, morality, and future of this species. Robyn’s work was so meaningful to me that it caused me to change careers. I was a graduate student in a clinical psychology program geared toward training practitioners. My coursework exposed me to Robyn’s clinical judgment work more so than at most training programs, probably owing to the fact that a couple of years earlier Robyn had graciously agreed to enter the lion’s den and speak to the program defending his House of Cards work. When we read “Representative Thinking in Clinical Judgment” I was amazed both at how telling were the (often unfavorable) descriptions of clinical activity and at how many clinicians either ignored or rejected the work out of hand. Eventually, I applied to be his student so that I could dedicate myself to judgment and decision making research. Dawes and Corrigan’s (1974) demonstration of the inherent limits to regression approaches for revealing the cognitive processes underlying complex judgments provided an intuitive feeling for
ER59969.indb 15
3/21/08 10:50:06 AM
16 • Joachim I. Krueger
the seductiveness of explaining potentially random data. It was liberating research in the narrow sense of freeing me from trying to do the impossible—and in the broad sense of showing that there were insights available only to those who looked for the fundamental structure of the problem. Nobody does it like Robyn. It was a bit of a shock to meet Robyn and find myself baffled by conversations with him. He often says things that, I must admit, I have difficulty following. His statements often seem to assume knowledge that I could not have had. When, in embarrassment, I admitted to others that I did not always understand Robyn, I was comforted to learn that others had had the same experience. The general consensus here is that there are two leading explanations of this: 1. Robyn is so much smarter than us mere mortals that it is difficult for us to follow the speed and logic of his thinking. 2. Robyn is only tenuously in touch with the basic norms of everyday social interaction. The data are equally consistent with both of these. I feel that Robyn Dawes’s most important contributions are his insistence on intellectual humility and process transparency. To make a good decision that influences the lives of other people, let people know upon what bases the decision is being made and then do it in a fair and consistent manner. People deserve no less than our best decision procedures. To do less is unethical. As Dawes noted, in almost 2,000 experimental game studies, no one attempted to discover if cooperation would occur in the absence of incentives. . . . Dawes developed an interesting “subtractive method” for his research, where possible alternative explanations for cooperation were controlled in the experimental design, and what could not be controlled became the focus of the manipulation for the next study. Nevertheless, despite such care, there are still experts that still believe that somewhere, there must be such a hidden incentive. I recall once at a conference in the United Kingdom, during the 1980s when I was still the editor of the Journal of Experimental Social Psychology, I ran into Robyn shortly after I had checked in, but before my wife and I had gone to our room. Robyn was excited about an idea that had to do, as best I could tell in the hotel lobby, with a claim that the false consensus effect was, in fact, not false at all. Robyn had a napkin or some other small piece of paper and he was making notes and writing equations on it for
ER59969.indb 16
3/21/08 10:50:06 AM
A Psychologist between Logos and Ethos • 17
me to see, and, while I grasped the general drift of his claim, I did not grasp the underlying logic. (Robyn often made me feel like a moron because I failed to see something that was blatantly evident to him.) I asked him to write out the argument and send it to me as a journal editor and I would give him some feedback on the interest of the idea. The sequel, as they say, is history. He sent me the paper; I read it and studied it and dug through a rather opaque argument (it is hard to write clearly for a moron when the validity of the argument is self-evident) and realized that his argument was original, valid, and nearly inaccessible to most of our readers. A revision or two improved the accessibility and the resulting paper is one of the most important I published as an editor. Again, Robyn Dawes was a decade ahead of his peers. Robyn is the most interesting philosopher I know. Yes, philosopher. His understanding and critical assessment of the foundations of rationality is almost unparalleled among philosophers who do this for a living. And the most interesting thing is that he follows this up with experiments showing that what appears as irrational behavior is in fact rational, and the real problem is that we usually attribute wrong motives to agents.
References Bartlett, F. C. (1932). Remembering. Cambridge, UK: Cambridge University Press. Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall. Dawes, R. M. (1964). Cognitive distortion. Psychological Reports, 14, 443–459. Dawes, R. M. (1966). Memory and distortion of meaningful written material. British Journal of Psychology, 57, 77–86. Dawes, R. M. (1972). Fundamentals of attitude measurement. New York: Wiley. Dawes, R. M. (1976). Shallow psychology. In J. Carroll & J. Payne (Eds.), Cognition and social behavior (pp. 3–12). Hillsdale, NJ: Erlbaum. Dawes, R. M. (1979). The robust beauty of improper linear models. American Psychologist, 34, 571–582. Dawes, R. M. (1980). Social dilemmas. Annual Review of Psychology, 31, 169–193. Dawes, R. M. (1988). Rational choice in an uncertain world. San Diego, CA: Harcourt, Brace, Jovanovich. Dawes, R. M. (1989). Statistical criteria for establishing a truly false consensus effect. Journal of Experimental Social Psychology, 25, 1–17.
ER59969.indb 17
3/21/08 10:50:06 AM
18 • Joachim I. Krueger Dawes, R. M. (1990). The potential nonfalsity of the false consensus effect. In R. M. Hogarth (Ed.), Insights in decision making: A tribute to Hillel J. Einhorn (pp. 179–199). Chicago: University of Chicago Press. Dawes, R. M. (1991). Probabilistic versus causal thinking. In D. Ciuhetti & W. M. Grove (Eds.), Thinking clearly about psychology: Vol. 1. Matters of public interest. Essays in honor of Paul Everett Meehl (pp. 235–264). Minneapolis: University of Minnesota Press. Dawes, R. M. (1994). House of cards: Psychology and psychotherapy based on myth. New York: The Free Press. Dawes, R. M. (1998). Behavioral decision making and judgment. In D. Gilbert, S. Fiske, & G. Lindsey (Eds.), The handbook of social psychology (4th ed., Vol. 2, pp. 497–548). Boston: McGraw-Hill. Dawes, R. M. (2001). Everyday irrationality: How pseudoscientists, lunatics, and the rest of us systematically fail to think rationally. Boulder, CO: Westview Press. Dawes, R. M. (2005). An analysis of structural availability biases, and a brief experiment. In K. Fiedler & P. Juslin (Eds.), Information sampling and adaptive cognition (pp. 147–152). New York: Cambridge University Press. Dawes, R. M., & Corrigan, B. (1974). Linear models and decision making. Psychological Bulletin, 81, 95–106. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication, and assumptions about other people’s behavior in a commons dilemma situation. Journal of Personality and Social Psychology, 35, 1–11. Dawes, R. M., & Messick, D. M. (2000). Social dilemmas. International Journal of Psychology, 35, 111–116. Dawes, R. M., Mirels, H. I., Gold, E., & Donahue, E. (1993). Equating inverse probabilities in implicit personality judgments. Psychological Science, 6, 396–400. Dawes, R. M., & Mulford, M. (1996). The false consensus effect and overconfidence: Flaws in judgment or flaws in how we study judgment? Organizational Behavior and Human Decision Processes, 65, 201–211. Dawes, R. M., Singer, D., & Lemons, F. (1972). An experimental analysis of the contrast effect and its implications for intergroup communication and the indirect assessment of attitude. Journal of Personality and Social Psychology, 21, 281–295. Dawes, R. M., & Smith, T. E. (1985). Attitude and opinion measurement. In G. Lindsey & E. Aronson (Eds.), The handbook of social psychology (3rd ed., Vol. 1, pp. 509–566). New York: Random House. De Finetti, B. (1931). Probabilismo: Saggio critico sulla teoria delle probabilità e sul valore della scienza. In A. Alivera (Ed.), Biblioteca di Filosofia (163–219). Naples, Italy. Fehr, E., & Schmidt, K. (1999). A theory of fairness, competition, and cooperation. The Quarterly Journal of Economics, 114, 159–181.
ER59969.indb 18
3/21/08 10:50:06 AM
A Psychologist between Logos and Ethos • 19 Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74, 1464–1480. Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world (2nd ed.). New York: Sage. Howard, J. W., & Dawes, R. M. (1976). Linear prediction of marital happiness. Personality and Social Psychology Bulletin, 2, 478–480. Krueger, J. I., (1998). On the perception of social consensus. In M. P. Zanna (Ed.), Advances in experimental social psychology (Vol. 30, pp. 163–240). San Diego, CA: Academic Press. Krueger, J. I., (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16–26. Krueger, J. I., & Acevedo, M. (in press). Perceptions of self and other in the prisoner’s dilemma: Outcome bias and evidential reasoning. American Journal of Psychology, 120, 593–618. LaPiere, R. T. (1934). Attitudes vs. action. Social Forces, 13, 230–237. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. Meehl, P. E., & Rosen, A. (1955). Antecedent probability and the efficiency of psychometric signs, patterns, and cutting scores. Psychological Bulletin, 52, 194–216. Orbell, J. M., & Dawes, R. M. (1991). A “cognitive miser” theory of cooperators’ advantage. American Political Science Review, 85, 515–528. Orbell, J. M., van de Kragt, A. J. C., & Dawes, R. M. (1988). Explaining discussion-induced cooperation. Journal of Personality and Social Psychology, 54, 811–819. Schwarz, N. (2001). Feelings as information: Implications for affective influences on information processing. In L. L. Martin & G. L. Clore (Eds.), Theories of mood and cognition: A user’s guidebook (pp. 159–176). Mahwah, NJ: Erlbaum. Tajfel, H., Billig, M. G., Bundy, R. P., & Flament, C. (1971). Social categorization and intergroup behavior. European Journal of Social Psychology, 1, 1–39. Turner, J. C., Hogg, M. A., Oakes, P. J., Reicher, S. D., & Wetherell, M. (1987). Rediscovering the social group: A self-categorization theory. Oxford, UK: Blackwell. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics and biases. Science, 185, 1124–1131
ER59969.indb 19
3/21/08 10:50:07 AM
ER59969.indb 20
3/21/08 10:50:07 AM
2
The Gambler’s Fallacy and the Coin’s Memory Eric Gold Fidelity Investments
Gordon Hester Electric Power Research Institute
The gambler’s fallacy is a mistaken belief that a past repetition of the same independent, random outcome somehow increases the probability of a different future random outcome (Tversky & Kahneman, 1971). The belief that a coin will come up tails after a run of heads is an example of the fallacy. Another example, cited in Hastie and Dawes (2001), quotes a letter to Dear Abby from a mother who was surprised when she gave birth to the seventh girl in a row: Dear Abby: My husband and I just had our eighth child. Another girl, and I am really one disappointed woman. I suppose I should thank God that she was healthy, but, Abby, this one was supposed to have been a boy. Even the doctor told me the law of averages were in our favor 100 to 1. Previous research has demonstrated that such an effect leads to methodological problems in experimentation (Bush & Morlock, 1959, cited in Colle, Rose, & Taylor, 1974; Friedman, Cartette, Nakatani, & Ahumada, 1968), gambling errors (Metzger, 1985; Oldman, 1974; Wagenaar, 1988), and problems in real-world decision making (McClelland & Hackenberg, 1978). Tversky and Kahneman (1971) argued that 21
ER59969.indb 21
3/21/08 10:50:07 AM
22 • Rationality and Social Responsibility
the gambler’s fallacy is a result of people making judgments about the representativeness of an underlying causal process, that is, how similar the actual sequence is to one generated by some process. For example, a sequence of heads and tails generated by flips of a coin should have about the same number of heads and tails, alternating between heads and tails sufficiently often. The previous work, however, does not describe a mechanism for the perceived dependence. Consequently, we conducted experiments where the same sequence of outcomes does or does not lead to the fallacy and propose such a mechanism. In particular, this work is concerned with how people view gambling devices and how the use of those devices either facilitates the gambler’s fallacy or reduces it. Our experiments follow from the idea that the gambler’s fallacy occurs because people treat the gambling device as an intentional system, attributing volition, the ability to affect outcomes, and memory to the device. We suggest that people attribute behavior to the device and that the action of the device leads to an incorrect perception of dependence among random outcomes. This is not to say that people would explicitly say that a coin has volition, but their behavior betrays such an implicit belief. People attribute behavior to inanimate objects in contexts other than gambling. Children learn that the world is animate before they understand the concept of inanimacy. Many elderly people also show animistic tendencies, as do many mentally retarded individuals. Typical adults also engage in anthropomorphism. Everyone can identify with a person who curses at a parking meter or who wonders about the motives of computers. Naive physicists treat moving objects as if they have intentions. Until about 200 years ago, even scientists described all sorts of phenomena animistically. In The Child’s Conception of the World, Piaget (1928) studied the development of animistic thinking. When children begin to describe the world around them, they consider everything to be alive. They believe that intact objects have consciousness, feelings, and purpose, and they think that broken objects are dead. After that, children move through a sequence of four stages. First, they believe that any useful item is alive. During the second stage, they believe that objects that move are alive, and in stage three, they hold this belief only about objects that appear to move by themselves. It is in the last stage that children consider only plants and animals to be alive. In an attempt to replicate Piaget’s findings, Russell and Dennis (1939) prepared a questionnaire about animism, which asked subjects to specify which of 20 items are animate and inanimate and to explain why. The questionnaire was administered verbally to children
ER59969.indb 22
3/21/08 10:50:07 AM
The Gambler’s Fallacy and the Coin’s Memory • 23
with spontaneous explanations followed up by the interviewer. Russell and Dennis’s results were consistent with Piaget’s. More recent work by Dolgin and Behrend (1984) and Lucas, Linke, and Sedgwick (1979) also supported Piaget’s findings. However, studies by Bullock (1985), Gelman and Spelke (1981), and Massey (1988) were more critical of the idea that children have an animistic bias. Dennis and Mallinger (1949) administered the Russell and Dennis questionnaire to 36 elderly Pittsburgh residents. To their surprise, they discovered that only nine subjects were in stage four; a minority of their subjects believed only plants and animals were alive. Of the remainder, 12 subjects were in stage one, seven subjects were in stage two, and eight subjects were in stage three. Subjects were explicit in their beliefs, giving answers such as: The mirror is living because you can see yourself in it. The knife is living because it cuts and performs work. The dish (broken) is dead because it is of no use. (p. 219) Applying Russell and Dennis’s materials to 600 high school students, Russell (1942) found that as many as 25% of the students were not in stage four. Russell, Dennis, and Ash (1940) claimed that of 100 institutionalized mentally retarded subjects, 57% scored in one of the first three stages. Searles (1962) described a prevalence of animism in schizophrenic patients. Surprisingly, nearly one-third of college students had not reached stage four (Dennis, 1953). Cranell (1954) and Bell (1954) reported a similar finding. Whereas Russell, Dennis, Ash, Mallinger, and other researchers argued that adults who show animistic behaviors have childlike tendencies, Brown and Thouless (1965) believed that “such animistic behavior patterns cannot be attributed to immaturity or a confusion of categories, but should rather be regarded as products of an essentially deliberate process” (p. 40). Lowie (1954, cited in Looft & Bartz, 1969) argued that the subjects in experiments like Russell’s do not literally believe that the objects are alive but instead argued that an almost universal application of anthropomorphism shows that such behavior is desirable. Crowell and Dole (1957) asked college students to determine whether items on a list were animate or inanimate, correlating their responses with their year in college, with whether they had taken a course in biology, and with their scores on an aptitude test. They found no relationship between animism and the year in school or with their training in biology, but found a moderate correlation between lack of animism and aptitude. Caporael (1986, p. 215) observed that people “have inferred human feelings and motivations as causal explanations for otherwise
ER59969.indb 23
3/21/08 10:50:07 AM
24 • Rationality and Social Responsibility
inexplicable malfunctioning, and in short, entered (briefly or extensively) into social relations with their automobiles.” She argues that the desire to anthropomorphize follows from a need to be able to predict and control one’s environment. Caporael also claimed that people anthropomorphize computers. She quoted Branscomb (1979) as arguing that a usable computer “creates in us a warm feeling, and may be described as ‘friendly,’ ‘faithful’ or ‘obedient.’” She discussed Weizenbaum’s (1976) notion that, because the computer is an extension of the body, anthropomorphism is a consequence of the resulting emotional bonding, and Minsky’s (1967, p. 120) description of a particular program “as an individual whose range of behavior is uncertain.” Dennett (1980, p. 9) suggested that people sometimes deal with a system such as a computer “by explaining and predicting its behavior by citing its beliefs and desires.” He referred to such explanations as intentional, noting “a particular thing is an intentional system only in relation to the strategies of someone who is trying to explain and predict its behavior” (pp. 3–4). Scheibe and Erwin (1979) recorded the spontaneous verbalization of subjects seated before a computer and discovered evidence for anthropomorphism in 39 out of 40 subjects. Before the eighteenth century, scientists routinely portrayed the physical world animistically. Aristotle, for example, believed that a falling object maintains its own constant velocity (Champagne, Klopfer, & Anderson, 1980; Toulmin, 1961; Shanon, 1976). Other scientists described changes in inorganic compounds as a result of those compounds striving to change to compounds of which they are capable (cf. Toulmin, 1961). Even the noted psychophysicist Gustav Fechner (cited in Brown and Thouless, 1965) described the planets and stars animistically, referring to the Earth as a “self-moving organism.” Examples of treating the physical world animistically can be seen in the modern world by studying naive physics. Many people’s intuitive beliefs about the motion of objects are Aristotelian. DiSessa (1982) and Shanon (1976) tried to help introductory physics students unlearn their Aristotelian views. Caramazza, McCloskey, and Green (1981) showed naive subjects a ball tied to the end of a rope and then asked them what would happen if the ball were swung in a circle and suddenly released. Fortynine percent of the subjects claimed that the ball would follow a curved path, as if the ball somehow had a memory for spinning in a circle. The studies summarized in this chapter were conducted to discover the nature and mechanisms underlying this kind of animistic projection by examining situations in which the gambler’s fallacy can be found and ways in which the fallacy can be reduced or increased. Not
ER59969.indb 24
3/21/08 10:50:07 AM
The Gambler’s Fallacy and the Coin’s Memory • 25
all random, independent sequences lead to the gambler’s fallacy, as the following five experiments each show by manipulating how gambling devices generate the same sequence of outcomes.
Experiment 1 Suppose subjects bet on the outcome of coin flips and, after some number of flips, witness four heads in a row. Subjects who accept the gambler’s fallacy will expect the next outcome to be tails. Suppose, further, that the coin is swapped for a different coin just before the fifth flip. Will subjects still expect tails on the next flip? The idea here is to test whether subjects associate the fallacy with a particular gambling device, consistent with attributing intentional behavior to the device. The original coin’s job is to balance outcomes; the new coin has no such responsibility. Such a finding would indicate an implicit belief in the behavior of a coin. Instead of two coins, the first experiment relied on two separate gambling devices—a deck of cards and a coin. The deck of cards was repeatedly cut, with the outcome being either red or black, depending on the suit of the cut card. The coin was painted red on one side and black on the other so that the possible outcomes of the coin flip would match the possible outcomes of the card cuts. For each of two conditions, subjects were repeatedly presented with coin flips, which eventually resulted in four reds in a row. In one condition, subjects would then gamble on the outcome of another coin flip; in a second condition, subjects would gamble on the outcome of a card cut. The experiment tested whether subjects demonstrate the gambler’s fallacy after the four reds occur. An important question is how to measure the gambler’s fallacy. The most obvious answer is simply to ask subjects about the likelihood of different outcomes after a run of the same outcome. McClelland and Hackenberg (1978) used this method; their dependent measure was the proportion of people who believed that a girl was more likely than a boy after a run of male births. What people know and what people do may differ, so a measure of the fallacy should involve participating in actual gambles. One possibility is to ask subjects to decide between heads and tails after a run of four heads. There is nothing fallacious, however, about choosing heads over tails; they both lead to the same expected gain. Subjects who choose tails after a string of heads are not making a bad bet. In contrast, this experiment measured the gambler’s fallacy by introducing a sure-thing alternative. Subjects chose, for each flip, between some number of points that they would receive no matter
ER59969.indb 25
3/21/08 10:50:08 AM
26 • Rationality and Social Responsibility
what the outcome versus payoff if an outcome specified by the experimenter came up. In one pilot experiment, for example, subjects were offered either 50 points for sure or 100 points if a specified outcome, which could be either heads or tails, came up. Subjects who chose the gamble received no points if the specified outcome did not occur. The winning outcome varied from flip to flip without any discernible pattern of which outcomes would win the gambles. The important decision occurred after the coin came up heads four times in a row. For this flip only, half of the subjects chose between the sure points and a payoff if heads came up, whereas half of the subjects chose between the sure points and a payoff if tails came up. Subjects demonstrated the gambler’s fallacy by tending to choose the sure thing when offered heads as the winning outcome and tending to choose the gamble when offered tails. With this design, subjects were willing to forgo expected gain when confronted with a run of the same outcome. The results from three pilot experiments were used to determine how many points to offer the subjects for the gamble and the sure thing. The goal was to determine values so that without any effect from the fallacy, subjects would choose the gamble and the sure thing equally often, thus preventing floor and ceiling effects. The results of these experiments showed that subjects were risk seeking and led an experimental design with 100 points for the gamble and 70 points for the sure thing. Method Subjects Eighty-eight undergraduates from the University of Pittsburgh participated in the experiment for extra credit for an introductory communications course. Materials Gambling Devices The experiment used three gambling devices: a half dollar painted red on one side and black on the other, a deck of cards, and a large wooden die with three faces painted red and three faces painted black. Printed Materials Subjects recorded their responses, the outcome of each trial, and their scores on an answer sheet. There were two versions of the answer sheet. Subjects were asked the two questions given in Table 2.1. Procedure Four groups of subjects were presented with a sequence of 40 coin flips and card cuts where the outcome of each flip or cut was red
ER59969.indb 26
3/21/08 10:50:08 AM
The Gambler’s Fallacy and the Coin’s Memory • 27 Table 2.1 Questions About the Gambler’s Fallacy: Experiment 1 Questions 1. Suppose you flipped a coin five times and each time the coin came up heads. If you flip the coin a sixth time you are… 2. Suppose you flipped a coin five times and each time the coin came up heads. If you flip a different coin for the sixth flip you are… Answers a. Much more likely to come up heads than tails b. Somewhat more likely to come up heads than tails c. Equally likely to come up heads as tails d. Somewhat more likely to come up tails than heads e. Much more likely to come up tails than heads
or black. For each of these trials, the experimenter specified the winning and losing color; subjects decided whether to gamble on the winning color. Subjects who chose to gamble received 100 points if the card or coin came up the winning color; otherwise they received no points. Alternatively, subjects could choose to receive 70 points; if so, the outcome of the trial had no bearing on their winnings. The experimenter explained that there would be three kinds of gambling devices and that the device in use might be changed at any time. The subjects indicated their choices on an answer sheet before each trial. After each trial, they recorded the outcome and the number of points received. The outcome of each trial was rigged. The subjects were too far from the experimenter to see the results of the coin flips; the experimenter simply announced the predetermined outcomes. The outcomes of the card cuts were arranged by using a trick deck. The first 22 trials were the same for all four groups. Table 2.2 shows, for each trial, the type of gambling device, the rigged outcome, and the winning color. The rigged outcomes of the first 22 trials were chosen so that (1) an equal number of reds and blacks would come up and (2) trials 19 to 22, for which the coin was used, would come up red. Subjects whose behavior was consistent with the gambler’s fallacy would expect black to be the due outcome on the 23rd trial; hence they would tend to gamble if black was specified as the winning color but take the sure payoff if red was specified as the winning color. Table 2.3 presents the details for trial 23, giving, for each group, the type of gambling device used, the rigged outcome, and the winning color. Subjects answered the questions about the gambler’s fallacy at the end of the experiment.
ER59969.indb 27
3/21/08 10:50:08 AM
28 • Rationality and Social Responsibility Table 2.2 The First 22 Trials of Experiment 1 Trial Gambling Device Rigged Outcome Winning Outcome 1 Cards Black Red 2 Cards Red Black 3 Cards Red Red 4 Cards Reda Red 5 Cards Blacka Red 6 Cards Red Black 7 Cards Red Red 8 Cards Black Black 9 Cards Black Red 10 Cards Black Red 11 Cards Red Red 12 Coin Black Red 13 Coin Black Red 14 Coin Red Black 15 Coin Black Black 16 Coin Black Red 17 Coin Red Black 18 Coin Black Black 19 Coin Red Black 20 Coin Red Black 21 Coin Red Red 22 Coin Red Black a The rigged card cutting did not work for every trial. The outcomes of trials 4 and 5 for groups I and II were black and red, respectively. This anomaly should not affect the interpretation of the results.
Table 2.3 Trial 23 of Experiment 1 Group
Gambling Device
Rigged Outcome
Winning Outcome
I II III IV
Cards Cards Coin Coin
Black Black Black Black
Red Black Red Black
Results The subjects’ choices for the 23rd trial were used to construct a 2 × 2 × 2 contingency table; this is presented in Table 2.4. The upper-right and lower-left cells in each half table indicate responses consistent with a belief that one color is more likely to come up than the other. A subject acting in accordance with the gambler’s fallacy would choose the
ER59969.indb 28
3/21/08 10:50:09 AM
The Gambler’s Fallacy and the Coin’s Memory • 29 Table 2.4 Subjects’ Responses to Trial 23 of Experiment 1 Same Gambling Device Chose sure thing Chose gamble
Chose sure thing Chose gamble
Winning color was due 4 14 Different Gambling Device
Losing color was due 13 3
Winning color was due
Losing color was due
12 15
12 15
gamble when the winning color is due and would choose the sure thing when the losing color is due. This table shows a stronger effect of the gambler’s fallacy when the experimenter used the same gambling device than when the experimenter changed the device. Bartlett’s test for a device x winning color x subject choice interaction (Fienberg, 1977) revealed a significant threeway effect (z = 2.68, p < .01). Table 2.5 summarizes the subjects’ answers to the questions about the gambler’s fallacy. Subjects accepted the fallacy if they indicated that tails is more likely for the first question and explained their answers in a way consistent with this belief. Subjects believed that changing the coin lessens the fallacy if they answered that tails is more likely for the same coin than for a different coin. Table 2.6 shows the choices for the 23rd trial for the 51 subjects responding that they did not believe the fallacy. Again, the table shows a stronger effect of the gambler’s fallacy with the same gambling device (z = 2.13, p < .05). Subjects who explicitly know the gambler’s fallacy is false still show an implicit belief in the fallacy; they are more likely to forgo expected gain when the same coin is used.
Experiment 2 This experiment replicates the first experiment but with a more subtle change in the gambling device. Instead of using two different types of devices, two coins that differed only in size were used. Method Subjects Eighty-four undergraduates from the University of Pittsburgh participated in the experiment for extra credit for an introductory communications course.
ER59969.indb 29
3/21/08 10:50:09 AM
30 • Rationality and Social Responsibility Table 2.5 Subjects’ Beliefs About the Gambler’s Fallacy: Experiment 1 Belief Both outcomes are judged equally likely for each question The due outcome is more likely after the fifth flip The two outcomes are equally likely when the coin is changed The due outcome is more likely even when the coin is changed Other beliefs
n
%
51 23 15 8 13
59 26 17 9 15
Table 2.6 Rational Subjects’ Responses to Trial 23 of Experiment 1 Same Gambling Device Chose sure thing Chose gamble
Chose sure thing Chose gamble
Winning color was due 2 11 Different Gambling Device
Losing color was due 8 2
Winning color was due 4 8
Losing color was due 6 10
Materials Gambling Devices The experiment used two gambling devices: a half dollar painted red on one side and black on the other and a nickel painted red on one side and black on the other. Printed Materials Subjects recorded their choices, for each trial, along with the outcome of the trial and their score on one of two versions of an answer sheet. The answer sheets were identical to the one used for Experiment 1. Procedure Four groups of subjects were presented with a sequence of 40 coin flips where the outcome of each flip was either red or black. For each of these flips, the answer sheet specified the winning and losing color; subjects decided whether to gamble on the winning color for 100 points or receive 70 points for sure. The experimenter explained that he would use two different coins and that the device in use might be changed at any time. The subjects indicated their choices on an answer sheet before each flip. After each flip, they recorded the outcome and the number of points they received. The first 22 flips were the same for all four groups. Table 2.7 shows, for each flip, the type of coin used, the rigged outcome, and the winning color. This table is identical to the one used for Experiment 1 with the
ER59969.indb 30
3/21/08 10:50:09 AM
The Gambler’s Fallacy and the Coin’s Memory • 31 Table 2.7 The First 22 Flips of Experimental 2 Flip
Gambling Device
Rigged Outcome
Winning Outcome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Nickel Nickel Nickel Nickel Nickel Nickel Nickel Nickel Nickel Nickel Nickel Quarter Quarter Quarter Quarter Quarter Quarter Quarter Quarter Quarter Quarter Quarter
Black Red Red Red Black Red Red Black Black Black Red Black Black Red Black Black Red Black Red Red Red Red
Red Black Red Red Red Black Red Black Red Red Red Red Red Black Black Red Black Black Black Black Red Black
exception of the specification of the gambling device. For this experiment a nickel is used whenever the cards were used in Experiment 1. Design Like Experiment 1, this experiment consisted of two factors: (1) whether the same or different coins were used immediately following the critical trial and (2) whether subjects bet on the outcome that was due or the outcome that was not due. The predictions and analyses were the same as Experiment 1. Table 2.8 shows, for each group, the type of gambling device, the rigged outcome, and the winning color for flip 23. Results Again, the subjects’ choices for the 23rd flip were used to construct a 2 × 2 × 2 contingency table; this is presented in Table 2.9. As before, there was a stronger effect of the gambler’s fallacy when the experimenter used the same gambling device than when the
ER59969.indb 31
3/21/08 10:50:09 AM
32 • Rationality and Social Responsibility Table 2.8 Flip 23 of Experiment 2 Group
Gambling Device
Rigged Outcome
Winning Outcome
I II III IV
Nickel Nickel Quarter Quarter
Black Black Black Black
Red Black Red Black
Table 2.9 Subjects’ Responses to Flip 23 of Experiment 2 Same Gambling Device Chose sure thing Chose gamble
Winning color was due
Losing color was due
0 20 Different Gambling Device
13 6
Winning color was due
Losing color was due
Chose sure thing
8
12
Chose gamble
14
11
experimenter changed the device (z = 2.42, p < .01). In some sense, the subjects implicitly believed that the coin was responsible for maintaining a balanced sequence of outcomes.
Experiment 3 If a coin implicitly exercises volition and balances its outcomes, it must possess some kind of memory. One characteristic of memory is that it fades over time. This study used just one coin, flipping repeatedly until four reds came up in a row. Subjects, in each of two conditions, stopped after the four reds to review the outcomes of the previous flips. For the first condition, a 24-minute pause was introduced after the run of reds and before the review. The review assured that subjects would remember the outcomes that occurred prior to the pause. The second condition was similar to the first except that no pause was introduced before the review. As in the first two experiments, the sequence of outcomes was identical in both conditions, as were the probabilities associated with each flip.
ER59969.indb 32
3/21/08 10:50:10 AM
The Gambler’s Fallacy and the Coin’s Memory • 33
Method Subjects One hundred and twenty undergraduates from the University of Pittsburgh participated for extra credit in an introductory communications course. Materials Gambling Device The experiment used a half dollar painted red on one side and black on the other. Printed Materials Two versions of an answer sheet were used. The sheets were similar to those used for Experiments 1 and 2; only the format differed. Three different word search puzzles were also used. These puzzles consisted of an array of letters and a list of words. The puzzle takers’ task was to search for the words in the grid of letters. Some subjects also completed a vocabulary test. The results of this test served as a measure of intelligence. The subjects answered the two questions in Table 2.10: Procedure Four groups of subjects were run in two sessions. Each subject was presented with 40 flips of the specially prepared coin. For each flip subjects indicated, on the answer sheet, their choices between a sure win of 70 points whatever the outcome versus a gamble for 100 points on the outcome specified by the experimenter. After each flip, subjects indicated the outcome, either red or black, and the number of points won. The subjects won if the coin came up the color specified on the answer sheet. Table 2.10 Questions About the Gambler’s Fallacy: Experiment 3 Questions 1. Suppose you flipped a coin five times and each time the coin came up heads. If you flip the coin a sixth time you are... 2. Suppose you flipped a coin five times and each time the coin came up heads. If you wait one day and flip the coin a sixth time you are... Answers a. Much more likely to come up heads than tails b. Somewhat more likely to come up heads than tails c. Equally likely to come up heads as tails d. Somewhat more likely to come up tails than heads e. Much more likely to come up tails than heads
ER59969.indb 33
3/21/08 10:50:10 AM
34 • Rationality and Social Responsibility Table 2.11 The First 22 Flips of Experiment 3 Flip
Rigged Outcome
Winning Outcome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Black Red Red Red Black Red Red Black Black Black Red Black Black Red Black Black Red Black Red Red Red Red
Red Black Red Red Red Black Red Black Red Red Red Red Red Black Black Red Black Black Black Black Red Black
The outcomes of the coin and the colors that would win the gambles were the same as the flips in Experiment 1. Table 2.11 presents this information for the first 22 flips. These flips were identical for all four groups. After the 22nd flip, the point at which four reds in a row had come up, some of the subjects proceeded immediately to the 23rd flip while other subjects paused for 24 minutes before continuing to the 23rd flip. The subjects who paused worked on the three word search puzzles. So that subjects would not forget the previous outcomes, the experimenter reviewed, immediately after the 11th and 22nd flips, the previous 11 outcomes. For the subjects in the pause conditions, the experimenter reviewed the outcomes after the pause, just before the 23rd flip. Additionally, for the 23rd flip, half of the subjects won their gamble betting on red and half won betting on black. Table 2.12 shows how the four groups varied. Subjects in groups III and IV also completed the vocabulary test. All subjects completed the experiment by answering the two questions about the gambler’s fallacy.
ER59969.indb 34
3/21/08 10:50:10 AM
The Gambler’s Fallacy and the Coin’s Memory • 35 Table 2.12 Trial 23 of Experiment 3 Group
Delay
Rigged Outcome
Winning Outcome
I II III IV
Pause Pause No pause No pause
Black Black Black Black
Red Black Red Black
Design This experiment consisted of two factors: (1) whether a pause was introduced immediately following the critical trial and (2) whether subjects were given a bet on the outcome that was due or the outcome that was not due. The dependent measure was whether subjects took a gamble or sure thing for the critical trial. The results were arranged in a 2 × 2 × 2 three-way contingency table and were analyzed using Bartlett’s test. The experiment determined whether subjects given the pause show a weaker effect of the fallacy than subjects who did not receive a pause. Results The subjects’ choices for the 23rd flip are presented in Table 2.13. There was a weaker effect of the gambler’s fallacy with the 24-minute pause than without the pause (z = 2.58, p < .01). Table 2.14 shows the subjects’ responses to the questions about the gambler’s fallacy. Subjects were categorized using the same rules as Experiment 1. Table 2.15 shows the choices for the 23rd flip for the 77 subjects responding that they did not believe the fallacy. The analysis demonstrates a weaker effect of the gambler’s fallacy with a pause than without a pause (z = 1.78, p < .05). A median split of the subjects who did not pause after flip 22 was done based on the vocabulary scores; a separate contingency Table 2.13 Subjects’ Responses to Flip 23 of Experiment 3 No Pause Winning color was due
Losing color was due
8 22 Pause
24 5
Winning color was due
Losing color was due
Chose sure thing
13
18
Chose gamble
16
14
Chose sure thing Chose gamble
ER59969.indb 35
3/21/08 10:50:10 AM
36 • Rationality and Social Responsibility Table 2.14 Subjects’ Beliefs About the Gambler’s Fallacy: Experiment 3 Belief
n
%
Both outcomes are equally likely for both questions
77
65
The due outcome is more likely after the fifth flip
30
25
The two outcomes are equally likely after the pause
5
4
The due outcome is more likely even after the pause
25
21
Other beliefs
12
10
Table 2.15 Correctly Answering Subjects’ Responses to Trial 23 of Experiment 3 No Pause Chose sure thing Chose gamble
Winning color was due
Losing color was due
6 14
12 3
Pause Chose sure thing Chose gamble
Winning color was due
Losing color was due
8 12
11 11
Table 2.16 Results for High and Low Scorers: Experiment 3 High Vocabulary Scores Chose sure thing Chose gamble
Chose sure thing Chose gamble
Winning color was due
Losing color was due
3 10 Low Vocabulary Scores
13 3
Winning color was due
Losing color was due
5 12
11 1
table was constructed for high and low scorers (see Table 2.16). Both groups showed an effect of the gambler’s fallacy (high scorers: χ21 = 7.60, p < .05; low scorers: χ21 = 8.65, p < .01). Bartlett’s test resulted in no evidence for a stronger effect in the contingency table for low scorers than for the contingency table for the high scores (z = .40). Thus, no conclusions could be reached based on the intelligence of the subjects.
ER59969.indb 36
3/21/08 10:50:10 AM
The Gambler’s Fallacy and the Coin’s Memory • 37
Experiment 4 This experiment looked at explanations for the first three experiments that did not involve attributions made to the gambling device. One possibility is that any kind of interruption would lessen the effect. Each of the first three experiments introduced a jarring interruption just after the sequence of four reds. Would the effect decrease, for example, if the experimenter slammed a door or if the lights suddenly went out? Another possibility is that subjects created ad hoc categories (cf. Barsalou, 1983) of outcomes and that it is those sequences that subjects believe should be representative. Thus, the effect of the gambler’s fallacy might be lessened when a nickel is swapped for a quarter, not because the coin is involved but because “outcomes generated by the quarter” is a likely ad hoc category. Interruptions might be one way that ad hoc categories are created. This experiment relied on an obvious category not tied to the gambling device. Suppose chips are drawn alternately from each of three urns. This is much like the situation in Experiment 1, but here the gambling device is changed after every trial. If the fallacy is discovered, then the results of this experiment support the idea that subjects are forming ad hoc categories, in this case, “outcomes generated by the three urns.” Method Subjects One hundred and twenty-six subjects from the University of Pittsburgh participated in the experiment for extra credit in an introductory communications course. Materials Gambling Device A blue, a green, and a pink urn, each filled with 200 white and 200 red poker chips, served as the gambling device. Printed Materials Subjects recorded, for each trial, which urn was used, their choices, the outcome of the trial, and the number of points won on one of two versions of an answer sheet. Procedure Subjects made 40 bets on whether red or white chips would be drawn from three differently colored urns. For each bet, an experimenter announced which urn would be used, drew a single chip from one of the three urns, held the chip up, announced the color, and returned the chip to the urn. Before each bet, subjects indicated, on their answer sheets, which urn would be used and whether they wanted a sure 70 points or 100 points if the draw resulted in the chip color specified on the answer sheet. After the draw, subjects indicated what color chip was picked and the number of points won.
ER59969.indb 37
3/21/08 10:50:11 AM
38 • Rationality and Social Responsibility
Table 2.17 shows, for the first 34 trials, the color of the urn used, the winning chip color, and the rigged outcome. Trials 31 through 34 all resulted in a draw of a red chip; thus, trial 35 was the critical trial. For this trial, subjects were broken up into two groups: About half of the subjects won the bet if a red chip was drawn and the others won if a white chip was drawn. Table 2.17 The First 34 Trials of Experiment 4
ER59969.indb 38
Trial
Urn Used
Rigged Outcome
Winning Outcome
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green Blue Pink Green
White Red Red White White Red White Red White White White White Red White Red Red Red White Red White White White Red White Red White Red Red White White Red Red Red Red
Red White Red Red White White Red Red White Red White Red White White White Red Red Red Red White White Red White White Red Red Red White White Red Red White White Red
3/21/08 10:50:11 AM
The Gambler’s Fallacy and the Coin’s Memory • 39
The subjects were run in two conditions. For groups I and II, the draws were alternately taken from the green, blue, and pink urns. On the 35th trial, this pattern remained unbroken. For groups III and IV, however, the experimenter broke the pattern by drawing a chip from the pink urn. This sequence is shown in Table 2.18. Design The experiment was made up of two factors: (1) whether the urns were used in a patterned or not-patterned way and (2) whether subjects were required to bet on the due color or the color that was not due. The dependent measure was whether subjects took the gamble or the sure thing. The prediction was that if subjects were using ad hoc categories or if they were delimiting the sequences based on interruptions, then the patterned use of the urns would show an effect of the fallacy and unpatterned use would not show such an effect. Results A three-way contingency table was constructed based on subjects’ responses to trial 35. The top part of Table 2.19 shows the responses of groups I and II (the subjects who did not break the pattern of draws), and the middle part of the table was constructed using the responses of subjects who broke the pattern of draws. Both tables showed no effect of the fallacy (top: χ21 = .883, n.s.; bottom: χ21 = 2.08, n.s.), and Bartlett’s test showed no significant difference between the tables (z = .54, n.s.). The subjects were apparently not creating ad hoc categories. A power analysis (Cohen, 1988) showed that a large effect would be discovered with a probability of .99 for the kept pattern condition and .94 for the broken pattern condition. Given the robustness of the gambler’s fallacy, the power analysis suggested that there was no effect to be found. Table 2.18 Subject Responses to Trial 35 of Experiment 4
ER59969.indb 39
Group
Urn Used
Rigged Outcome
Winning Outcome
I
Blue
White
Red
II
Blue
White
White
III
Pink
White
Red
IV
Pink
White
White
3/21/08 10:50:11 AM
40 • Rationality and Social Responsibility Table 2.19 Subject Responses to Trial 35 of Experiment 4 Kept Pattern Winning outcome was due Chose sure thing Chose gamble
14 22 Broke Pattern Winning outcome was due
Chose sure thing Chose gamble
Small effect Medium effect Large effect
8 23 Power Analysis
Losing outcome was due 18 16 Losing outcome was due 12 13
Kept Pattern
Broke Pattern
.13 .71 .99
.11 .56 .94
Experiment 5 Why do subjects expect a tail after four heads, but expect another win after four wins? The answer, in line with the arguments presented in this chapter, might be that subjects attribute different behaviors to the gamblers and to the gambling devices. In some sense, the job of the coin is to be fair, and the job of the gambler is to win as often as possible. Oldman (1974) explained that, at one British club, croupiers who were on a losing streak were temporarily replaced with employees of higher status, as if the croupiers were somehow responsible for their losses. Keren and Wagenaar (1985) interviewed blackjack players—confirming that gamblers believe they should bet heavily when on a winning streak and reduce bets when losing. Gilovich, Vallone, and Tversky (1985) argued that people see, in basketball, a “hot hand” of successful baskets, even when the streak is just chance. Langer and Roth (1975) showed that subjects bet more when the series of gambles begin with a burst of wins. Jones, Roch, Shaver, Goethals, and Ward (1968) also showed that a descending pattern of wins (i.e., more wins at the beginning of the series of gambles) led to a greater expectation of winning than either an ascending or random pattern of wins. Greenberg and Weiner (1966) found that the amount of money won or lost did not influence betting behavior, but that the ratio of wins to losses had a profound effect. Ayton and Fischer (2004) showed that subjects exhibited the gambler’s fallacy based on the outcome of a roulette wheel but believed in a hot hand based on their betting successes and failures. They also gave
ER59969.indb 40
3/21/08 10:50:11 AM
The Gambler’s Fallacy and the Coin’s Memory • 41
subjects sequences of outcomes, explaining that they were either generated by an athlete (or sports team) or by a random gambling device. Sequences with positive recency were attributed to processes tied to human performance, but sequences with negative recency were attributed to chance processes. For this experiment, subjects were divided into two groups, where both groups eventually won four times in a row. The only difference between the groups was whether the winning could be attributed to the workings of a gambling device or to the performance of the subjects. This was done by creating a special coin for one of the two conditions, with the word win on one side and the word lose on the other side. In this condition, a win would be associated with the gambling device instead of the gambler. Method Subjects Ninety-one students from an introductory course in communications at the University of Pittsburgh participated for extra credit. One subject did not turn in the experimental materials. Materials Gambling Devices The experimenter used two gambling devices: a regular quarter and a slug, the size of a quarter, with the word win printed on one side and the word lose printed on the other side. Printed Materials Two versions of an answer sheet were used. The answer sheets allowed subjects to indicate their choices for 18 gambles, the outcomes of the gambles, and the number of points won. Procedure Two groups of subjects made choices on the outcome of 18 coin flips; one group made choices for gambles using a regular quarter, and the other group bet on the outcomes of flips using the specially prepared slug. Both groups played for the same period of time and eventually won four bets in a row. The groups differed, however, in what counted as a win. Subjects in the first group won a bet if the coin came up heads or tails, as specified on the answer sheet. In the second group, subjects won if the word “win” came up. For each bet, subjects chose between a sure 70 points or a gamble for 100 points. After each flip, subjects recorded whether the flip was a winning or losing flip and the number of points won. The two versions of the answer sheet differed only in what was specified as winning the gamble.
ER59969.indb 41
3/21/08 10:50:12 AM
42 • Rationality and Social Responsibility Table 2.20 The First 12 Flips of Experiment 5
Flip
Group I Coin Outcome
Group I Winning Outcome
Group I Outcome
Group II Coin Outcome
1 2 3 4 5 6 7 8 9 10 11 12
Heads Tails Tails Heads Tails Heads Tails Tails Heads Tails Heads Tails
Tails Tails Heads Heads Tails Tails Heads Tails Heads Tails Heads Heads
Lose Win Lose Win Win Lose Lose Win Win Win Win Lose
Lose Win Lose Win Win Lose Lose Win Win Win Win Lose
Table 2.20 shows, for each group, the outcomes of the coin flips and the winning outcomes. All subjects were presented with the same sequence of wins and losses, and subjects in both groups won flips 8 through 11. Their choices for flip 12 were of particular interest. Design The design consisted of one factor: whether a regular or special coin was used. The dependent measure was whether subjects took a gamble or sure thing on the critical trial. The data were analyzed as a single 2 × 2 contingency table. The prediction was that subjects gambling on the special coin would gamble more conservatively than subjects using the regular coin. Results The subjects’ responses for flip 12 were used to construct Table 2.21. Subjects who bet on the regular quarter tended to take the gamble, but subjects who bet on the specially prepared coin tended to take the sure thing (χ21 = 6.82, p < .01). The conclusion is that subjects attributed responsibility to the special coin (with the words win and lose) for generating a balanced sequence of wins and losses. However, when using a regular coin, the subjects were responsible for their own wins.
ER59969.indb 42
3/21/08 10:50:12 AM
The Gambler’s Fallacy and the Coin’s Memory • 43 Table 2.21 Subjects’ Responses to Flip 12 of Experiment 5 Special Coin
Regular Coin
Chose sure thing
40
15
Chose gamble
15
20
Overall Conclusion All the experiments presented here suggest that the gambling device plays a role in determining whether people accept or reject the gambler’s fallacy and that they attribute intention, volition, and memory to the inanimate devices. The first three experiments suppose an inanimate coin has animate-like memory and set up situations where subjects behave as though the coins do have memory. The first two experiments show that the expectation of a tails following four heads only applies to the coin that generated the four heads; an alternative coin has no memory of the run of heads. The third experiment demonstrates that the effect of gambler’s fallacy diminishes over time, a characteristic of human memory. The fourth experiment argues against ad hoc categories as an alternative explanation to animism. Finally, the last experiment considers additional characteristics of the gambling device. A streak of four wins does not normally result in the gambler’s fallacy, but it does when winning and losing are tied to the behavior of the coin and not the “skill” of the bettor. Much research in behavioral decision theory addresses how people deal with difficult judgments. In these cases, a normative approach is infeasible, and people must do something to make the problem tractable. In contrast, judgments like estimating the probability of a coin flip are easy; in fact, the judgment really could not be any easier. In these experiments, subjects notably make the problem more difficult, bringing unnecessary computational baggage. Here, people aren’t just estimating probabilities but are interacting with the objects around them and sometimes anthropomorphizing those objects. They attribute human-like qualities, such as frail memory, to simple coins and assign to those coins the job of keeping the outcomes balanced. Quoting Robyn Dawes, the problem is that “even if the coin did have memory, it wouldn’t have the musculature to affect the outcome.”
ER59969.indb 43
3/21/08 10:50:12 AM
44 • Rationality and Social Responsibility
References Ayton, P., & Fischer, I. (2004). The hot hand fallacy and the gambler’s fallacy: Two faces of subjective randomness. Memory and Cognition, 32, 1369–1378. Barsalou, L. W. (1983). Ad hoc categories. Memory and Cognition 11, 211–27. Bell, C. R. (1954). Additional data on animistic thinking. Scientific Monthly, 79, 67-69. Branscomb, L. M. (1979). The human side of the computer. Paper presented at the symposium on Computer, Man and Society. Haifa, Israel. Brown, L. B., & Thouless, R. H. (1965). Animistic thought in civilized adults. The Journal of Genetic Psychology, 107, 33–42. Bullock, M. (1985). Animism in childhood thinking: A new look at an old question. Developmental Psychology, 21, 217–225. Bush, R. R., & Morlock, H. C. (1959). Test of a general conditioning axiom for human two-choice experiments. Memorandum MP-1. Department of Psychology. University of Pennsylvania. Caporael, L. (1986). Anthropomorphism and mechanomorphism: Two faces of the human machine. Computers in Human Behavior, 2, 215–234. Caramazza, A., McCloskey, M., & Green, B. (1981). Naive beliefs in “sophisticated” subjects: Misconceptions about trajectories of objects. Cognition, 9, 117–123. Champagne, A. B., Klopfer, L. E., & Anderson, J. H. (1980). Factors influencing the learning of classical mechanics. American Journal of Physics, 48, 1074–1079. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum. Colle, H. A., Rose, R. M., & Taylor, H. A. (1974). Absence of the gambler’s fallacy in psychophysical settings. Perceptions & Psychophysics, 15, 31–36. Cranell, C. W., (1954). The responses of college students to a questionnaire on animistic thinking. Scientific Monthly, 78, 54. Crowell, D. H., & Dole, A. A. (1957). Animism and college students. Journal of Educational Research, 50, 393–395. Dennett, D. (1980). Brainstorms: Philosophical essays on mind and psychology. Cambridge, MA: MIT Press. Dennis, W. (1953). Animistic thinking among college and high school students in the Near East. Journal of Educational Psychology, 48, 193–198. Dennis, W. (1953). Animistic thinking and reasoning among college and university students, Scientific Monthly 76, 247–49. Dennis, W., & Russell, R. W. (1940). Piaget’s questions applied to Zuni children. Child Development, 11, 181–187. DiSessa, A. A. (1982). Unlearning Aristotelian physics: A study of knowledgebased learning. Cognitive Science, 6, 37–75. Dolgin, K. G., & Behrend, D. A. (1984). Children’s knowledge about animates and inanimates. Child Development, 55, 1646–1650.
ER59969.indb 44
3/21/08 10:50:12 AM
The Gambler’s Fallacy and the Coin’s Memory • 45 Fienberg, S. E. (1985). The analysis of cross-classified categorical data. Cambridge, MA: MIT Press. Friedman, M. P., Cartette, E. C., Nakatani, L., & Ahumada, A. (1968). Comparisons of some learning models for response bias in signal detection. Perception and Psychophysics, 3, 5–11. Gelman, R., & Spelke, E. (1981). The development of thoughts about animate and inanimate objects: Implications for research on social cognition. In J. Flavell & L. Ross (Eds.), Social cognitive development (pp. 43–66). New York: Cambridge University Press. Gilovich, T., Vallone, R., & Tversky, A. (1985). The hot hand in basketball: On the misperception of random sequences. Cognitive Psychology, 17, 295–314. Greenberg, M. G., & Weiner, B. (1966). Effects of reinforcement history upon risk-taking behavior. Journal of Experimental Psychology, 71, 587–592. Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world: The psychology of judgment and decision making. Thousand Oaks, CA: Sage. Jones, E. E, Rock, L., Shaver, K. G., Goethals, G. R., & Ward, L. M. (1968). Pattern performance and ability attribution: An unexpected primacy effect. Journal of Personality and Social Psychology, 10, 317–341. Keren, G., & Wagenaar, W. A. (1985). On the psychology of playing blackjack: Normative and descriptive considerations with implications for decision theory. Journal of Experimental Psychology: General, 114, 133–158. Langer, E. J., & Roth, J. (1975). Heads I win, tails it’s chance: The illusion of control as a function of the sequence of outcomes in a purely chance task. Journal of Personality and Social Psychology, 32, 951–955. Looft, W. R., & Bartz, W. H. (1969). Animism revived. Psychological Bulletin, 71, 1–19. Lowie, D. E. (1954). Additional data on animistic thinking. Scientific Monthly, 79, 69–70. Lucas, A. M., Linke, R. D., & Sedgwick, P. P. (1979). Schoolchildren’s criteria for “alive”: A content analysis approach. Journal of Psychology, 103, 103–112. Massey, C. M. (1988). The development of the animate-inanimate distinction and preschoolers. Dissertation abstract. University of Pennsylvania. DAI-B 50/02. Publication number: AAT890836. McClelland, G. H., & Hackenberg, B. H. (1978). Subjective probabilities for sex of next child: U.S. college students and Philippine villagers. Journal of Population, 1, 132–147. Metzger, M. A. (1985). Biases in betting: An application of laboratory findings. Psychological Reports, 56, 883–888. Minsky, M. (1967). Why programming is a good medium for expressing poorly understood and sloppily formulated ideas. In M. Krampen & P. Seitz (Eds.), Design and planning II. (pp. 120–125). New York: Hastings House.
ER59969.indb 45
3/21/08 10:50:13 AM
46 • Rationality and Social Responsibility Oldman, D. (1974). Chance and skill: A study of roulette. Sociology, 8, 407–426. Piaget, J. (1928). The child’s conception of the world. Totowa, NJ: Littlefield, Adams, & Company. Russell, R. W. (1942). Studies in animism: V. Animism in older children. Journal of Genetic Psychology, 60, 329–335. Russell, R. W., & Dennis, W. (1939). Studies in animism: I. A standardized procedure for the investigation of animism. Journal of Genetic Psychology, 55, 389–400. Russell, R. W., Dennis, W., & Ash, F. E. (1940). Studies in animism: III. Animism in feeble-minded subjects. Journal of Genetic Psychology, 57, 57–63. Scheibe, K. E., & Erwin, M. (1979). The computer as alter. Journal of Social Psychology, 108, 103–109. Searles, H. F. (1962). The differentiation between concepts and metaphorical thinking in the recovering schizophrenic patient. Journal of American Psychoanalytic Association, 10, 22–49. Shanon, B. (1976). Aristotelianism, Newtonianism and the physics of the layman. Perception, 5, 241–243. Toulmin, S. (1961). Foresight and understanding. New York: Harper & Row. Tversky, A., & Kahneman, D. (1971). Belief in the law of small numbers. Psychological Bulletin, 76, 105–110. Wagenaar, W. A. (1988). Paradoxes of gambling behaviour. Hillsdale, NJ: Lawrence Erlbaum. Weizenbaum, J. (1976). Computer power and human reason. San Francisco: W. H. Freeman.
ER59969.indb 46
3/21/08 10:50:13 AM
3
Being an Advocate for Linear Models of Judgment Is Not an Easy Life Hal R. Arkes The Ohio State University
When I accepted my first faculty tenure-track position in 1972 I was already familiar with an article Robyn Dawes had authored in the American Psychologist in 1971. It pertained to the graduate admissions committee of the Department of Psychology at the University of Oregon. Dawes found that a very simple linear combination of a student’s undergraduate GPA, GRE scores, and selectivity of the undergraduate school he or she attended predicted progress in the Oregon graduate program better than did the ratings of the graduate admissions committee. Subsequent publications by Dawes and Corrigan (1974) and Dawes (1979) strengthened the conclusion of that first article: Decide what the criteria are, rate each candidate on those criteria, and then add up the ratings. Do not try to do “holistic” or clinical judgments. I was completely convinced by Dawes’s data. Because I was not then in the research area that has come to be known as “judgment and decision making” (JDM), my interest in Dawes’s research was similar to my interest in astronomy: I was fascinated by it even though it seemed not to have any practical application to my personal situation. Over the next decade my career path changed when I decided that JDM was the research area I wanted to pursue wholeheartedly. In 1987 I became the chair of Ohio University’s Department of Psychology, and I decided to apply Dawes’s linear model strategy to some of our department’s own decision making. First, I was very pleased that the 47
ER59969.indb 47
3/21/08 10:50:13 AM
48 • Hal R. Arkes
department’s Clinical Section was already using a linear model in its graduate student application process. The Clinical Section annually received about 1,000 preliminary applications to its graduate program. These applications required the applicant to provide only a few pieces of data. Based on the output of a simple linear regression equation, each applicant was either encouraged or discouraged from submitting a full application. The regression equation, of course, was based on data from prior years’ applicants. The preliminary application was free, but the full application was expensive; I was pleased that the Clinical Section had adopted this benevolent strategy that gave the applicants cost-free feedback. I thought that converting my local world to a Dawesian way of looking at things might not be as difficult as I initially thought. However, I learned that the regression equation approach was only used to reduce the pool of applicants down to approximately 100 who were encouraged to submit the full application, and the Clinical Section was quite opposed to using the equation to decide which 10 of those 100 should be offered admittance. The faculty preferred to rate in a holistic manner all 100 finalists, even though all of the raters admitted that this was a very time-consuming task. I did not press the point, and I turned my attention to another judgment task, one that was under my sole control. Each year the department hired a few lecturers who would teach several sections of our introductory statistics course. During my first year as chair, we had the opportunity to rehire a person who had spectacular teaching ratings the year before. His application was being considered amidst a pool of other applications. The committee that considered the applicants rejected this highly rated teacher. One committee member gave him an extremely low rating, because she divined a lack of enthusiasm in his cover letter. I decided that we would not have a committee the following year; I rated everyone myself. First, I decided on the criteria. Then I rated each candidate on these criteria. Then I added up the ratings. A few faculty members were quite surprised when I told them that I was not going to interview anyone. I explained to them the general philosophy of linear models, the results of the Dawes (1971) analysis, and the highly readable summary in Dawes (1979). Using the linear model method I hired lecturers for the next three years. I am pleased to say that the people we hired were excellent statistics instructors. My career took a new turn in 1993 when I temporarily left academia. From 1993 until 1995 and again from 1998 to 2000 I codirected the Program in Decision, Risk, and Management Science at the National Science Foundation (NSF). I was impressed with the semiannual panel meetings at which eight or nine of the best researchers in the field
ER59969.indb 48
3/21/08 10:50:13 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 49
considered approximately 100 proposals. The discussions were always serious and thoughtful. However, after the first two panel meetings I noted something that also had attracted the attention of the General Accounting Office (GAO) in their 1994 report, “Peer Review: Reforms Needed to Ensure Fairness in Federal Agency Grant Selection.” On page 84 of their report the GAO criticized the fact that panel members exhibited different levels of strictness and leniency when evaluating their assigned proposals. Because only two panelists read each proposal, the principal investigator whose proposal had been assigned to the strictest panelists was less likely to be awarded a grant than the principal investigator whose proposal had been assigned to the most lenient ones. At one panel meeting I converted each panelist’s ratings to personal z-scores. This new rating sheet did not contain for each proposal the average of the two panelists’ absolute ratings on the fivepoint NSF scale. Instead it contained the average of the two panelists’ z-scores. In this way each panelist’s ratings would be placed on the same calibrated scale. One proposal that would not have been funded was funded, because the strict panelist who gave it a mediocre “good” pointed out that her “good” had approximately the same z-score as that of another lenient panelist who had rated it “very good.” Therefore her “good” should not be considered a middling rating. Three years later when I returned to the NSF, our panel had to make a difficult decision between two candidates for a prestigious award. The panel chose candidate A over B. At the NSF the panel’s recommendation is only advisory to the program officers, so my codirector and I took a closer look at the rating data. When we converted the panelists’ ratings to z-scores, we saw that A had enjoyed the benefit of having two very lenient raters, whereas B had not. B’s average z-score was higher than that of A. We reversed the panel’s recommendation. Blackburn and Hakel (2006) recently examined the ratings of 1,983 posters submitted to three professional conferences. The original ratings were not done using z-scores, but Blackburn and Hakel converted them to z-scores and compared the new ratings to the original ones. Between 17 and 20% of the original accept or reject decisions would have been reversed had z-scores been used. Before I left the NSF in 1995 three other events occurred that pertained to policy decisions concerning the evaluation of proposals. First, a program officer showed me a letter he received from Robyn Dawes advocating disaggregated ratings of NSF proposals. In 1995 the NSF had four criteria for proposal evaluation, but only one overall holistic evaluation was required. Dawes was suggesting that each of the four criteria be rated separately—in “disaggregated” fashion—and that the
ER59969.indb 49
3/21/08 10:50:13 AM
50 • Hal R. Arkes
resulting four ratings then be summed. Of course, this suggestion followed directly from Dawes’s research on the advantages of linear models over holistic or “clinical” ratings. I decided to investigate this idea by prevailing upon another program officer to allow me to do a research project at his upcoming panel meeting at which dissertations would be evaluated. I asked the program officer to have his panelists do two things. First, they should provide the usual overall rating based on the four NSF criteria for dissertations. Second, they should rate each dissertation proposal on each of the four criteria. There were four panelists and 70 proposals. I simply regressed the ratings given to these four criteria on the overall rating given by the same panelist. The data showed that three of the four panelists were using the official NSF criteria to a significant degree. The fourth panelist was not. Such misbehavior was also noted in the GAO report, in which the NSF and two other agencies were suspected of allowing stealth criteria to influence panel evaluations. To the extent that such unofficial criteria pollute the ratings, the official ones will play a smaller role. This is unfair to those who are unaware of the secret criteria. The program officer noted the mischief I had uncovered and indicated that he would try to rectify the situation. I was optimistic that the Dawesian way of looking at evaluation could improve the proposal rating process at federal agencies. Nearly simultaneously, two other events occurred that buoyed my hopes in this regard. First, prompted by the GAO report, the NSF decided that it was time to examine its own evaluation policy. The director of the Social, Behavioral, and Economic Science Directorate had been invited to participate in this enterprise, but she asked me to take her place. I joined this group, which was very ably chaired by the deputy director of the NSF, Anne Peterson, also a psychologist. I made a formal presentation to this group, which consisted of a half-hour lecture on such Dawesian topics as linear models, the virtues of disaggregated judgments, the use of z-scores, plus one more topic. I pointed out that each program should not burden its panelists with discussion of either those proposals that had such low ratings that they stood no chance of being funded or those that had such high ratings that they were definitely going to be funded. I suggested that the panel should discuss only those proposals close to the border. I cited the research of Klahr (1985), who had analyzed the evaluations of the panel of the Human Cognition and Perception Program. Klahr found that if any proposal with an average panel rating better than 1.5 had been automatically funded and if any proposal with a rating worse than 3.5 had been automatically rejected, not a single decision of the panel would have been overruled. The panel’s workload
ER59969.indb 50
3/21/08 10:50:14 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 51
would have been drastically reduced, however. My recommendation was to do an analysis similar to Klahr’s to generate cutoff scores for the panels of each program. I also recommended that the NSF perform an experiment suggested by Dawes. Panels A and B would each independently evaluate the same set of proposals using the old system of assigning one holistic rating to each proposal. A correlation would then be calculated between the ratings given by these two panels to these proposals. Two other panels—C and D—would evaluate the same proposals that were reviewed by panels A and B. However, panels C and D would use disaggregated ratings. Their members would rate each proposal on each of the four NSF criteria, the ratings would be summed, and a correlation would be calculated between the evaluations of the two panels. I suggested that if the correlation between the ratings of the disaggregated panels was statistically higher than correlation between the ratings of the holistic panels, the NSF should switch to disaggregated ratings. I explained that this recommendation to switch was based on the assumption that higher reliability of the panelists’ ratings was a worthwhile goal. Furthermore, the maximum validity coefficient between two variables is equal to the square root of the product of their reliabilities (Kaplan & Saccuzzo, 1997, p. 148). Therefore if disaggregated ratings led to higher reliability than holistic ratings, it was also very likely that disaggregated ratings would foster higher validity. (I also recommended that when this experiment was actually performed, all four panels would be told that the ratings of one of the panels would be used to make the actual funding decisions, but that the selection of the panel to make the real decision would be randomly determined at the end of the experiment.) The reaction of the small working group ranged from genuine interest to healthy skepticism about the empirical basis for my suggestions. A biologist seemed surprised that merely adding up some impersonal numbers could be superior to a holistic judgment of an expert. After all, the expert could take into account various nuances that might elude a rigid formula. I left the NSF in July of 1995, and this committee was expanded into a larger group that made recommendations to the NSF board in February of 1996. I felt gratified that this group’s recommendations included some of my suggestions. However, a year later the NSF board rejected every one of these recommendations and merely adopted two new NSF criteria to replace the quartet of old ones. At nearly the same time that the NSF was reassessing its evaluation process, National Institutes of Health (NIH) director Harold Varmus convened a committee to do the same thing at his agency. The Review of Grant Application Committee (RGA) was formed. Robyn Dawes was
ER59969.indb 51
3/21/08 10:50:14 AM
52 • Hal R. Arkes
asked to join the group but was unable to participate personally. He did serve as an outside consultant. Robyn suggested that I serve as a committee member in his stead, which I did beginning in January of 1995. This group also benefited from fine leadership, thanks to the participation of Walter Stolz and Hugh Stamper. This committee was larger than the corresponding NSF one, and it contained mainly NIH administrative personnel. The issues at the NIH were similar to those at the NSF. Calibration of reviewers, the suspected use of unofficial criteria, and the use of holistic rather than disaggregated ratings all were discussed by the RGA members. My recommendations to the NIH were similar to the ones I made at the NSF. In addition, I brought in articles from the psychological literature that suggested that the NIH’s use of a 150-point rating scale was probably not optimal (Landy & Farr, 1980, p. 87; Cicchetti, Showalter, & Tyrer, 1985, p. 31). These investigators showed that if the number of points on a rating scales were to extend beyond approximately seven, rater reliability either dropped or failed to increase. The reaction to my suggestions was different at the NIH than at the NSF. At the NSF the initial committee members were receptive to the data from the psychological literature. The NIH, however, was not receptive. A “bulletin board” was set up to receive comments from the NIH community. The responses were overwhelmingly negative. I think many of the respondents were people who had been highly successful in receiving NIH funding, and they therefore were not eager to see the evaluation procedure changed. Also, some totally unsupported hypotheses were advanced that reflected negatively on the use of disaggregated ratings. For example, a popular hypothesis was that if the NIH used disaggregated ratings, reviewers would not write as expansive or as helpful comments compared to the remarks they might make using one holistic rating. Although there were absolutely no data whatsoever to support this conjecture, it elicited approving nods from people who were looking for some basis for rejecting disaggregated ratings. During one phone call I offered the news that the NSF was considering the use of disaggregated ratings based on the superiority of such ratings demonstrated in the literature. One senior NIH administrator loudly told everyone on the conference call of consultants that she did not care what the NSF thought. Another NIH administrator publicly announced that my advocacy of disaggregated ratings caused less agreement than the NIH liked to have in considering such policy matters. Another person surprised me by saying that he did not want any explicit criteria, and by suggesting the use of disaggregated ratings, I would be forcing raters to attend to each specific criterion. I replied that each rater would
ER59969.indb 52
3/21/08 10:50:14 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 53
have to use either implicit criteria or explicit ones, and wouldn’t it be fairer to those who submitted proposals if they knew what the criteria actually were? Needless to say, the NIH opted not to do the experiment recommended by Dawes. They also rejected disaggregated ratings, calibration of reviewers, truncation of the 150-point rating scale, and everything else I suggested. I concluded that converting the federal agencies to a Dawesian approach to evaluation was not going to be as easy as I thought.
Brief Political Interlude For those readers who are well-versed in evaluation, gloom might be deepening at this point. To forestall this despair I want to interject into the narrative a brief political lesson for those uninitiated to the workings of government. Having spent four years working in Washington, I want to assure everyone that the very large majority of the people working at federal agencies are conscientious individuals who do their best to serve the citizenry. However, many of them are called upon to do things for which they do not have substantial knowledge or training. For example, I met a person who had to make decisions about which educational proposals to fund. This person had no idea what the purpose of a control group was. She opted to provide a lot of your tax money to a proposal that had no control group to which one might compare the performance of the experimental group. I tried to point out to this person why a control group was essential, but I was unable to persuade her. My conjecture is that the content of this proposal was in an area that she believed needed funding, and providing money to this particular investigator would further this goal. In other words, she was trying to be a Good Person, but she lacked the knowledge or training needed to make the judicious decisions required of someone in her position. Very often the Congress passes “enabling legislation,” which merely provides the most rudimentary guidelines as to what one of the federal agencies should do to address some problem. The agency reacts to this legislation by having well-meaning bureaucrats try to figure out how to achieve the lofty goals roughly outlined in the legislation. Without the benefit of any pilot data, the agency’s employees—many of whom are not particularly knowledgeable about the topic being addressed by the legislation—concoct some program or procedure that they hope will achieve some benefit. No one in this chain of events is trying to waste money or perform poorly.
ER59969.indb 53
3/21/08 10:50:14 AM
54 • Hal R. Arkes
If Congress provides money to fund proposals in the domains of infectious disease, plasma physics, or nanotechnology, the agency hires some folks in these areas, locks them in a room for a day or two, and lets them out when they finish rating the proposals. This seems like a good procedure to achieve the goals mandated by Congress and the appropriate agency. Now some psychologist who knows nothing about infectious disease, plasma physics, or nanotechnology comes along and insists that disaggregated ratings are a good idea. The agency administrators, who know very little about the scientific literature on psychometrics, judgment and decision making, or evaluation, are understandably wary of overhauling the rating system currently used on tens of thousands proposals. The Office of Management and Budget (OMB) would have to approve all of the new forms, many investigators who did well under the old evaluation system will get their senators and representatives to harass the agency administrators, and numerous oversight boards will have to be persuaded. It’s so much easier just to tweak the criteria a little bit and send those psychologists home. I reluctantly feel a little sorry for the administrators who have to listen to folks like me and Robyn Dawes every 10 years or so when an overhaul of the evaluation procedure is considered. I also feel sorry for folks like me and Robyn Dawes.
The Saga Resumes I returned to academia in the summer of 2000. I mulled over the various alibis used by opponents of the suggestions I made to the NIH. (I was not in Washington when the NSF Science Board rejected its committee’s suggestions, so I did not hear any of those objections firsthand.) Some of the NIH objections could not be parried, such as “We don’t want criteria.” My inability to parry this objection was not due to the high quality of the objection but instead because no data could be marshaled to address it. However, one prominent objection could be supported or refuted by data. That objection was the reasonable one that none of the studies showing the advantages of disaggregated over holistic ratings had used scientific proposals as the to-be-rated material. For example, Von Winterfeldt and Edwards (1986) reviewed studies that had demonstrated the advantages of disaggregated ratings in such areas as evaluating bank loans. No one had ever used NSF proposals as the to-be-rated material; perhaps the critics were right in their unwillingness to modify the agencies’ evaluation systems until the relative performance of disaggregated and holistic ratings could be tested with materials more appropriate to those agencies’ goals.
ER59969.indb 54
3/21/08 10:50:14 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 55
A very important problem had to be overcome, however. How would I decide whether the use of holistic or disaggregated ratings led to better decisions in evaluating proposals? Wouldn’t I need a “gold standard” of proposal quality? Then I could ascertain if the ratings given through the use of the holistic procedure or the disaggregated procedure corresponded more closely to this gold standard. During many lunch conversations I had asked my fellow NSF program officers in the social sciences what they would consider to be a gold standard of proposal quality. We first considered the number of references to the publications that emanated from the proposal. Those proposals whose ensuing publications were cited more often would be deemed objectively higher in quality. The majority of program officers felt that this gold standard was much too difficult to calculate with any degree of confidence. In short, we never could agree on any objective gold standard of proposal quality. Some investigators who had done research in this area had examined not the validity of holistic and disaggregated ratings but their relative reliability (e.g., Ravinder, 1992). If, like Ravinder (1992), I could show that one method had superior reliability, then this superiority would be some evidence that the same method likely had higher validity, too. No variable can correlate with another variable more highly than it correlates with itself, so more reliable measures had more “room” for higher validity to be manifested. After deciding that I would compare holistic and disaggregated ratings on reliability rather than validity, I now needed access to some proposals to make my research highly relevant to the NIH and NSF. I could get access to successful proposals at the two agencies. Such proposals are in the public record. However, there are legal reasons why I could not get access to rejected proposals at the agencies. Thus, it did not appear that I could use actual proposals as the to-be-rated material in my experiment unless I used solely successful ones. The inability to gain access to rejected proposals would result in range restriction in the quality of the to-be-rated proposals. Also, recruiting enough professionals to rate lengthy proposals would be a problem whether such proposals were successful, unsuccessful, or both. Then I considered submissions to professional conventions. Such submissions are much briefer than full NIH or NSF proposals. If I could get a professional society to grant me permission to use submissions to one of their annual conventions, then perhaps I could recruit enough society members to rate them, half using a holistic rating procedure and half using a disaggregated one. Of course, I would have to pay each rater to evaluate a lot of proposals. This meant I would need a grant to do this research, but before I could ask the NSF for the required funds,
ER59969.indb 55
3/21/08 10:50:15 AM
56 • Hal R. Arkes
I had to find a professional society that would give me access to the materials I would need. In 2000 the Society for Medical Decision Making (SMDM) was meeting in nearby Cincinnati. I asked the board if I could address them during their executive session to gain their permission to do some research using the proposals to be submitted for their next annual meeting, which would be held in San Diego in 2001. They kindly granted me permission to address them, and in October of 2000 I made my impassioned appeal at the board meeting. The reception was guarded, and a very reasonable objection was raised. Suppose that half of the submitted abstracts were rated lower using rating procedure A compared to B. The people whose proposals had been rated using method A would be understandably upset that their proposals had been thus disadvantaged. I could not think of a quick solution to this problem, and I was about to leave the meeting empty handed. Then Steve Pauker spoke up. He suggested that were I to limit my research to already accepted proposals, nobody could raise this objection. How would I obtain raters for already accepted proposals? The attendees at the convention could do the evaluations, half using holistic and half using disaggregated ratings. Permission granted. Of course, because I was limited to using accepted proposals, I might face the same range restriction problem that I would have faced were I to use only accepted rather than submitted NSF proposals. However, the fraction of NSF proposals accepted is far smaller than the fraction of convention abstracts accepted, so the problem would be less serious using the abstracts submitted to the SMDM convention. There was some aspect of revenge in submitting a proposal to the NSF to request funds to obtain the data needed to refute a main objection to disaggregated ratings—namely, that no study showing the superiority of disaggregated over holistic ratings had used scientific proposals as the to-be-rated material. The NSF had to agree that my proposed project was very important, because NSF personnel had spent a lot of time discussing this very topic during the prior years. I felt confident when I submitted the proposal and gratified when it was funded about a half-year later.
The Big Event I had done applied research before, and I knew that it involved a lot more logistical coordination than the amount needed to test a room full of college freshmen. However, this project was going to require a new level of organization. I obtained the permission of the SMDM president,
ER59969.indb 56
3/21/08 10:50:15 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 57
Myriam Hunink, and the conference coordinator, Michael Barry. I needed an advanced copy of the program so I could prepare evaluation sheets for the disaggregated and holistic ratings of each paper to be presented at the 2001 convention. I asked the SMDM board to approve a list of criteria that would provide the basis for each paper’s evaluation. I had to make sure that I had about as many raters in the holistic as in the disaggregated groups for each paper session. I had to get to the convention early to recruit as many raters as possible. I needed to coordinate with the SMDM staff to obtain e-mail addresses for the attendees so that I could recruit even before the convention began. I made arrangements to pay the raters when I returned from the convention. Through various means I was able to recruit 101 members of SMDM to rate one of the seven paper sessions. Each session contained about five papers. About half of the raters rated all of the presentations within one paper session using the holistic method; each rater provided one overall rating, and the criteria upon which this single rating was based were typed at the top of the rating sheet. The other half of the raters provided disaggregated ratings; they rated each paper on each of the five criteria. The five criteria were (1) Significance: Is the topic significant? Does it concern a scientifically important subject or is it relevant for health policy? (2) Methods: Are the methods scientifically sound? (3) Results: Are actual results presented in enough detail and in an understandable way? (4) Conclusions: Do the conclusions follow from the results? Are they justified? Are the results generalizable? (5) Innovation: Is there something innovative about the presented material? Two problems became apparent. To be in the experiment participants had to agree to rate all of the papers within one session. If a participant wanted to listen to the first two papers in the session on pharmacoeconomics and the last three papers in the session on psychiatry and mental health, one simply could not participate in the study. One physician complained about a related issue. Because it was necessary to be present for and rate all of the papers in one session, this research required outstanding bladder control on the part of the participants. Applied research always has a number of unanticipated problems. The terrorist attack on September 11, 2001, created another predicament. One of the seven sessions was composed largely of foreign participants. Because many of them chose not to fly to the United States in October, their slots in the schedule were taken by others. Consequently the rating forms I had prepared for that session were incorrect. Those performing the holistic and disaggregated ratings during that session tried to write in the correct title and presenter for the substitute talks, but the raters were inconsistent in their ordering of the talks, so we could
ER59969.indb 57
3/21/08 10:50:15 AM
58 • Hal R. Arkes
not use the data from that session. After discarding the data from this session and the data from a few other wayward participants, I was left with 83 raters for 33 oral presentations occurring during 6 sessions. For each person who performed disaggregated ratings, we calculated the mean rating over the four criteria—this was the re-aggregated rating—and calculated the mean Spearman correlation between that person’s re-aggregated rating and the re-aggregated ratings of every other person rating the same session using the disaggregated method. For each person who did holistic ratings, we calculated the mean Spearman correlation between that person’s holistic rating and the holistic ratings of every other person rating the same session using holistic ratings. As can be seen in Table 3.1, in four of the six sessions the disaggregated method was superior, and in two of the sessions the holistic method was superior. However, in only three of the six sessions were the differences significant, and all three favored the disaggregated method. The mean Spearman correlations for the disaggregated method in the opening plenary, health economics, and pharmacoeconomics sessions were significantly (p < .06) higher than the holistic ratings in these sessions. To compare the disaggregated raters in all six sessions with the holistic raters in all six sessions we used the mean r-to-z transformed interrater correlations within each of the 12 groups (6 sessions × 2 modes of evaluation). Weighting these 12 means by the number of raters in their respective groups, the correlation for the disaggregated method (.44, 95% CI [.33, .56]) exceeded that of the holistic method (.18, 95% CI [.09, .27]); t(81) = 3.90, p < .001. So at last we had evidence that disaggregated ratings led to higher interrater reliability than did holistic ratings. The to-be-rated materials Table 3.1 Interrater Reliabilities within Each Group as Represented by Mean Spearman RankOrder Correlations Disaggregated Ratings Session
n
Spearman
n
Spearman
Opening plenarya
10
.43
8
.01
6
.59
9
.02
Cost-effectiveness
8
.16
7
.27
Mental health
4
.51
4
–.16
Pharmacoeconomicsa
4
.85
5
.52
Technology outcomes
10
.03
8
.29
Health economics
a
ER59969.indb 58
Holistic Ratings
a
Within these sessions the disaggregated ratings had significantly higher interrater reliability than did the holistic ratings (p < .06).
3/21/08 10:50:15 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 59
were scientific presentations of research, which I thought were similar in nature to NIH and NSF proposals. I sent a copy of the results to a few people in the NIH. They did not respond. I do not take their lack of a response to be a discouraging sign. It is not time to reexamine the NIH evaluation process yet again, and when such reconsideration occurs, this research will already be in the literature (Arkes, Shaffer, & Dawes, 2006). Even if the NIH and NSF administrations don’t care to adopt the Dawesian view of evaluation, there might be some bottom-up pressure to do so. After I published a short history of my experience at the NIH and NSF (Arkes, 2003), I received about a half-dozen e-mails from people around the country who were involved with the evaluation of a variety of performances ranging from education to medicine. All of the e-mails had the same theme: (1) “I’m involved along with others in the evaluation of X.” (2) “We rate X holistically.” (3) “The craziest things enter into our evaluation.” (4) “It would be great if everyone just rated in a disaggregated manner using the official criteria.” (5) “Thanks for publishing this article.” I also received the following e-mail from a member of an NSF panel: Thought you might like this anecdote. A few weeks ago I was on a panel for the NSF (name deleted) program—one of 2 psych panels. They work a little differently—all reading of applications is done onsite, but there are the two vague intellectual merit/broader impacts criterion [sic] that no one seems to be able to give a good definition of (though the NSF people wander around and tut-tut about the questions a lot—I suppose that is LIKE a clear definition, only different). Anyway, the way it works is that everyone gets a rating on a 1.00 (perfect) to 4.99 (not worthy) scale (even worse than NIH’s). Throughout the process, the data specialists compute our z-scores, our ranges, and report them back to the panel chair, who is supposed to “sit on” outliers and get them to use the scale the way everyone else is (and there are BIG problems with that: they want 67% to be scored 4.00 to 4.99, with a verbal label of “does not merit support”— this year we had 250 applicants, and about 3 didn’t merit support). THEN, the bottom 67% are “retired” and the remaining files get a “third read.” THEN files which exceed one-point of discrepancy among the raters are identified, and the 3 raters are told to conference to attempt to resolve discrepancies. Then the files are rank-ordered by final scores (assuming the conferences result in changes), and the top 11% get an award, the next 20% either get an award or an honorable mention, etc.
ER59969.indb 59
3/21/08 10:50:15 AM
60 • Hal R. Arkes
Ok. Our panel rebelled (one of us—that would be me—waving your article around). We insisted on computing final averages by z-scores, we insisted on reading randomly distributed files, and we refused to conference (though we did re-read the discrepant ones just in case anyone wanted to recalibrate; there were few changes). The staff treated us like communists, arguing that they didn’t want the program to average z-scores because they didn’t have months to “test the algorithm,” but we held firm. So we are no doubt hated, but our panel really bonded—it went the smoothest it’s ever gone, and who knows—maybe next year the officials will “think of” using z-scores for everything. Thought you’d like tangible evidence of your inspirational effects. I had two other e-mails from inside the NSF review process. In each case the NSF staff was wary of the spontaneous suggestions made by panelists in NSF’s Social, Behavioral, and Economic Science Directorate. Members of these panels knew about my 2003 Psychological Science article in which I discussed my history with the NSF and NIH review processes. I fully understand the reluctance of the NSF staff to adopt rating suggestions brought up during the panel meeting. Those who submitted proposals to this panel would not know that the mode of evaluating their proposals would be different than what it had previously been. Perhaps legal action could be taken against them if the staff authorized these spur-of-the-moment suggestions. I think the changes should come from the top-down based on the weight of the scientific evidence, although apparently some insistent panelists were willing to compel change in the other direction.
Linear Models in Medicine Much of my own research is in the domain of medical decision making, where many studies have found that computer-based diagnostic support systems (DSSs) perform better than physicians in a wide variety of diagnostic contexts. For example, Corey and Merenstein (1987) showed that use of a predictive index for acute cardiac ischemia resulted in far more accurate classification of patients than occurred when physicians did not use the index. Ridderikhoff and van Herk (1997) found that a diagnostic support system used in a general practice more than doubled the diagnostic accuracy of unaided physicians. Reviews by Kaplan (2001) and Hunt, Haynes, Hanna, and Smith (1998) confirm that although many studies verify the superiority of DSSs in the diagnostic process, some studies do not (Kassirer, 1994). However, there is
ER59969.indb 60
3/21/08 10:50:16 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 61
unanimity with regard to one characteristic of diagnostic support systems: they are grossly underutilized (Kaplan, 2001). Many diagnostic aids are rather simple linear amalgamations of signs and symptoms, such as the Alvarado Score (Alvarado, 1986). To calculate this score one simply adds up the points assigned to various symptoms. For example, having a high white blood cell count warrants two points, and having a fever above a certain level garners one point. If a patient has nine or more points, then it is very probable that the patient has appendicitis. Another extremely simple diagnostic aid is the Ottawa Ankle Rule (Stiell, Greenberg, & McKnight, 1992). Approximately 96% of U.S. physicians are aware of this rule, which is very helpful in the diagnosis of ankle fracture. Like the Alvarado Score, it consists of a linear combination of a few symptoms. However, only about 32% of physicians use this very helpful rule either “always” or “most of the time” (Graham, Stiell, & Laupacis, 2001). I asked one physician why he did not use the Alvarado Score. He replied that he was aware of the rule and its high accuracy level, but he thought his unaided judgment was just as accurate as the Alvarado Score. I was surprised at this answer, because I thought that it was highly unlikely that he had made diagnoses without the rule, then ascertained what the rule’s “judgment” was, and then compared the two diagnoses with a gold standard, such as the undisputed presence or absence of an inflamed appendix. One of the most cogent reasons for not using a decision aid such as a linear scoring rule was articulated by another physician. He thought that if he used the decision aid but disagreed with its diagnosis, he would nevertheless have to heed the aid’s recommendation. Why? He thought that if he defied the aid and an adverse outcome occurred, he would be highly vulnerable to a malpractice verdict should the patient file a lawsuit. The opposing side would point out that he had ignored the recommendation of an accurate decision aid that had been published in a highly reputable journal. This physician told me that he thought that decision aids were designed for the “vanilla” patient. However, if his patient were pregnant or diabetic, for example, he might deem the aid to be inapplicable to this particular case. Thus, his defiance of the aid might be entirely justifiable in his opinion, but in hindsight a jury would not agree with him. The way to solve this problem, the doctor concluded, was never to use the aid. Note that this doctor was not explicitly taking issue with the fact that linear models or other simple decision aids could be very helpful in most cases. He was saying that in some instances he thought that a “broken leg” cue might make the aid inapplicable. Paul Meehl (1954)
ER59969.indb 61
3/21/08 10:50:16 AM
62 • Hal R. Arkes
coined this term to denote a cue, the presence of which would invalidate the prediction of an actuarial model. If Professor Jones generally goes to the movies on Thursday evenings and loves John Wayne movies, then one might predict that this Thursday Professor Jones is extremely likely to go to the movies because there is a John Wayne movie playing. However, he does not go because he has a broken leg and cannot leave the house. This cue, which was not included in the model, completely defeats the model’s prediction. Similarly, the physician claimed that broken leg cues should cause him to ignore the decision aid in some genuine instances. Because there indeed may be a true broken leg cue present in a particular case, there exists the possibility that the decision aid might not be applicable and the physician’s disregard of it may be warranted. The problem is that people tend to perceive broken leg cues when none exist. One reason for the unaided judge’s inconsistency is the false belief that broken leg cues abound. I did not think I could defeat the doctor’s belief that the decision aid should be followed even in the presence of what he perceived to be genuine broken leg cues. However, there does exist some evidence pertaining to people’s opinions of physicians who do or do not use decision aids. Pezzo and Pezzo (2006) provided participants with a written scenario in which either a positive or a negative medical outcome occurred. Half of these subjects were told that the physician used a computer decision aid that was correct 93% of the time. Because unaided physicians were said to be correct only 84% of the time, the doctor described in the scenario opted to use the aid. There were a number of important dependent measures. Subjects’ ratings of decision quality were higher following the positive outcome compared to the negative outcome, but the influence of the outcome on the rating of decision quality was reduced when the computer-assisted decision aid was used. Pezzo and Pezzo also found that after a negative outcome, the doctor who heeded the advice of the decision aid was deemed less negligent than the doctor who did not use an aid. Thus, using a decision aid did not jeopardize the physician, although it is important to note that in this particular experiment physicians only heeded the aid. None defied it. In the design of experiment 2 Pezzo and Pezzo included a condition in which the physician described in the scenario defied the recommendation of the decision aid. In addition, there were groups who read that the physician agreed with the aid, the physician initially did not agree with the aid but heeded it anyway, and the physician did not use any aid at all. Both medical students and laypersons were the participants, and only a negative medical outcome occurred in the scenarios used in this study. The results were that participants perceived the physician
ER59969.indb 62
3/21/08 10:50:16 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 63
who defied the aid to be more at fault than when the doctor agreed with or heeded the aid. Here is some evidence consistent with my physician colleague’s fear: If the physician uses an aid, defies it, and an adverse outcome occurs, jurors might find the physician to be more at fault. Pezzo and Pezzo also found that raters deemed the physician who defied the aid or the physician who used no aid to be less competent than physicians who heeded or agreed with the aid. One of these results is incongruent with recent data my colleagues and I have collected (Arkes, Shaffer, & Medow, 2007). In several studies we found that a physician who used no aid is deemed to be more—not less—competent than a physician who does use an aid. In these studies we do not mention whether the outcome is positive or negative. In our first study we divided a group of participants into three subgroups of about 115 people each. All three groups were asked to take the role of a patient who had suffered an ankle injury and had gone to their primary physician for diagnosis and treatment. The groups differed in the diagnostic process used by the physician described in the scenario. The doctor used either no decision aid, an unspecified decision aid, or a decision aid developed at a prestigious medical institution. After reading the scenario, the participants were asked to rate the following five criteria: thoroughness of examination, length of wait, diagnostic ability of the physician, professionalism of the physician, and overall satisfaction with the examination. Amazingly, if the physician used a decision aid, the participants rated her significantly lower in diagnostic ability and professionalism compared to the doctor who did not use a decision aid or an aid developed at a prestigious institution. The participants whose physician used a decision aid also rated the examination as significantly lower in overall satisfaction. In a subsequent study we manipulated whether the physician heeded the aid, defied the aid by treating more aggressively than the aid recommended, or defied the aid by treating less aggressively than the aid recommended. Of course, in the control group, no aid use was mentioned. All of the participants were second- or third-year medical students. The results were that the physician who used no aid was rated as having the highest diagnostic ability, as we always have found. The rating of this group significantly exceeded that of the group who defied the aid and treated less aggressively but did not significantly exceed the rating of the group who read that the physician heeded the aid. This study partially replicates Pezzo and Pezzo in that those who defy the aid are treated harshly. It differs from the Pezzo and Pezzo study in that physicians who use no aid are rated the highest in our study but rather low in the Pezzo and Pezzo study. If our results prove to be valid, this
ER59969.indb 63
3/21/08 10:50:16 AM
64 • Hal R. Arkes
would suggest a sensible reason why physicians might not want to use a decision aid: If their patient knows that they have used such an aid, the patient might have a lower opinion of them. There are some potentially important differences between the Pezzo and Pezzo study and our research. The aid described in the former is said to be more accurate than the unaided physician. We make no such claim in our scenarios. This may account for the rather benevolent view of physicians who use diagnostic aids in the Pezzo and Pezzo research and the rather negative views of these physicians in our studies. I hypothesize that our subjects might assume that a physician is more accurate than an aid. Subjects in the Pezzo and Pezzo research are explicitly told that is not the case. Thus, the evidence is rather mixed with regard to potential jurors’ views of a physician who uses a decision aid. Our written scenario research generally shows that people derogate the diagnostic ability of physicians who do use a decision aid. Pezzo and Pezzo (2006) do not replicate this effect. To help shed more light on this issue, we are collecting data from a much larger national sample of mock jurors using our highly realistic films of the malpractice trial.
Why the Resistance? Justified or unjustified fears of malpractice might be a cause of physicians’ unwillingness to use simple decision aids, but this factor cannot account for the resistance exhibited by most other decision makers. My experience at the NIH and, to a lesser extent, the NSF has led me to consider other causes of the widespread underutilization of decision aids. First, I do not think there is any doubt in anyone’s mind that linear models simply cannot lose in a contest with holistic “gut reactions” if both are based on the same input data. Linear models are boringly consistent. Personal holistic ratings are doomed by the inconsistency of the raters. Low reliability puts an inviolate damper on validity. Thus, I surmise that some people who object to linear models do so because they believe the holistic gut reaction raters have more information available to them than does a linear combination of cues. Of course, if the model is devoid of important cues, then the holistic gut reaction will be superior. My response is that these extra cues should be included in the linear model. Second, I think some opponents of linear models and disaggregated ratings believe that some cues simply cannot be included. Perhaps these cues cannot be articulated well, possibly due to their affective rather than cognitive nature (Wilson & Schooler, 1991; Millar & Tesser, 1986).
ER59969.indb 64
3/21/08 10:50:16 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 65
While this objection may have merit in the rating of art or other affective domains, I think it has no merit in the rating of scientific proposals. If the policy of the NSF dissertation panel is to evaluate each dissertation proposal on four specific criteria such as “methodology,” then I think no problem exists with regard to the panelists’ ability to articulate, rate, and discuss each proposal’s standing on each criterion. Third, I reluctantly mention hubris or “cognitive conceit” (Dawes, 1976). Some people simply do not accept the research findings that linear models do better than unaided judgment, all the evidence notwithstanding (Dawes, Faust, & Meehl, 1989). Perhaps I should rephrase that last sentence: Some people simply do not accept the research findings that linear models do better than their unaided judgment. Note that there is not much evidence to diminish this confidence, because hardly anyone has ever tested their own judgment in competition with a linear model over a series of trials. Winston Sieck and I (Sieck & Arkes, 2005) found that overconfidence in one’s decisions was extremely persistent, and this confidence led to reluctance to use a decision aid. Sieck and I asked participants in our experiment to decide whether a survey respondent to an actual survey favored or did not favor physician-assisted suicide (PAS). To help our participants make this judgment we provided five cues, such as each survey respondent’s level of alcohol consumption and level of religious service attendance. These cues were combined in a linear regression, and the regression equation’s “prediction” was given to the participants. The equation—our decision aid— correctly classified 77% of the survey respondents with regard to their views on PAS. In experiment 2 of Sieck and Arkes (2005) we took a very heavy-handed approach to encourage decision aid usage by the participants. After each of 120 trials participants found out whether they were correct in their judgment as to whether the survey respondent favored PAS. On each trial participants had access to the equation. Presumably they could eventually gauge whether they were performing better when they used the equation than when they did not. In addition, after each block of 40 trials some participants were asked to consider those times when they defied the prediction of the equation. They had to answer “Yes” or “No” to the query as to whether they were correct at least 50% of the time when they did defy the equation’s prediction. In addition, some subjects were also given calibration feedback after 120 trials in which their stated confidence level on the trials was compared to their accuracy level. A graph showed how closely their accuracy matched their stated confidence when they did heed the equation and how closely their accuracy matched their stated confidence when they did not heed the equation. As if this weren’t enough, subjects were required to answer a
ER59969.indb 65
3/21/08 10:50:17 AM
66 • Hal R. Arkes
few questions that mandated their understanding of the fact that their accuracy–confidence calibration was better when they did heed the equation. None of these mental calisthenics reduced overconfidence. When Robyn Dawes, Caryn Christensen, and I investigated decision aid use almost 20 years earlier (Arkes, Dawes, & Christensen, 1986), we found that paying people for good judgment actually decreased people’s reliance on a very helpful decision aid. We also found that people who were more knowledgeable about the topic were less likely to use the aid than people who were only moderately knowledgeable. I think both findings are related to hubris. People who are knowledgeable think they do not need an aid, and people who are playing for “high stakes” think that in such situations their own judgment can be ratcheted up to meet this challenge, whereas boringly consistent linear models cannot be improved at all (also see Ashton, 1990). In the Arkes et al. (1986) paper we reported a study in which people tried to predict which of three players won the Most Valuable Player Award (MVP) in the National League for each of 19 years. We provided no feedback after each year’s trio of candidates appeared on the screen and participants made their selection. Those who were more knowledgeable about baseball used a helpful decision aid less than did people who were only moderately knowledgeable about baseball. Consequently, the “experts” did more poorly on the task. They were more confident, however, and it is this confidence that led to their disdain for the helpful decision aid. What we did not report in the article was another study in which we did provide feedback after every trial. I was present during one of these sessions, sitting in the back of the room and observing the running of the experiment. I was interested when three members of the varsity baseball team entered the room. They were wearing a portion of their uniform, so it was obvious they were in the expert group. They sat together in the back. Among the other participants in the room was a very diminutive female who sat in the front row. The experiment began when the experimenter displayed on the screen the batting average, runs batted in, and home runs of three players in the National League during a particular year. She also displayed for each player where his team finished in the standings. The instructions had contained the accurate statement that for each year, if the participants in the study merely chose the player whose team finished highest in the standings, the participant would be correct about 75% of the time in selecting the player who won the MVP Award that year. I observed that the female in the front of the room quickly selected the player who she thought won the MVP during that first year. I assume her fast response was because she followed the simple decision rule.
ER59969.indb 66
3/21/08 10:50:17 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 67
The three varsity players who sat near me had much longer response latencies. I surmise they were carefully weighing the trade-off between home runs and batting average or between runs batted in and the other indices. They finally made their selections, and the experimenter provided feedback as to what was the correct answer. Upon hearing the right answer the female in the front of the room let out a high-pitched “Yea!” The varsity guys in the back row just sat there in silence. Trial 2 was a repeat of trial 1: “Yea!” from the front row and silence plus footshuffling from the back row. Trial 3 was identical, with one exception. After hearing the discouraging feedback, one of the varsity players tore off his hat, threw it on the ground, stood up, and stomped on it. Here was a person who really needed a decision aid but was too “expert” to condescend to use one. The fourth reason why decision makers resist decision aids is related to the cognitive conceit notion mentioned above. Linear models and other decision aids are based on averages. While it may be true that an aid is correct 93% of the time and an unaided decision maker only 84% of the time, for any particular trial the decision maker does not know if this is one of those instances when the aid is correct and the decision maker is wrong, one of those fewer instances when the aid is wrong and the decision maker is right, or one of those instances when the two contestants would agree. Because the second category is not a null set, there is the temptation to deem the current situation to be one of those cases in which one’s gut reaction might be worth honoring. The threat of malpractice might foster a very close look at the current case for just this reason. In a very profound article Einhorn (1986) pointed out that we have to accept error to make less error. By that he meant that we must tolerate the lower error rate of decision aids, because if we try to augment the aid’s accuracy by relying on the examination of additional but worthless cues, we will make more error. Physicians remind me that a malpractice verdict is based upon a particular case, not on one’s overall accuracy level. Although the decision aid might win the majority of contests, would it win this one? My response is always, “If you find a cue that has been shown to help you achieve a higher accuracy level, include it in the equation. Do not just use it inconsistently.” A final reason why holistic ratings might be preferred to disaggregated ones was quite clearly expressed by Katarina Witt (2006), who won two Olympic gold medals in figure skating. In describing the new scoring system used to evaluate figure skaters in the 2006 Olympics she writes, “There’s no question that the new system is fairer and more quantifiable. . . . on the whole it’s a vast improvement. It’s more accountable, and quality can be easily defined.” What is the new system that
ER59969.indb 67
3/21/08 10:50:17 AM
68 • Hal R. Arkes
Ms. Witt praises so effusively? “The new system assigns specific values to the technical and expression elements in a skater’s program.” Witt sounds like a Dawesian member of the NSF or NIH advisory committees. The relevant dimensions are identified, and skaters are evaluated separately on these dimensions. Quantifiability, accountability, transparency—this evaluation method sounds ideal. But wait. Ms. Witt is unhappy with the new system. Had the system been in use when she competed in Calgary in 1988, she might not have won the gold medal. She confesses that she “had time between jumps to flirt with the audience. I’m not sure I could have gotten away with that under today’s scoring system.” That’s a problem with explicit dimensions rated in a disaggregated manner. If flirting is not one of the criteria, then points will not be awarded for it. This new scoring system is terrible for poor skaters who are good flirters. Ms. Witt adds, “Or take the 1994 Olympics in Lillehammer, Norway, where American Nancy Kerrigan lost the gold by a small margin to Oksana Baiul of Ukraine. Kerrigan’s performance was technically superior to Baiul’s, but I think Baiul’s personal story of tragedy and comeback was so compelling that it propelled her to the gold. She captured the judges’ hearts, who decided in her favor on the emotion of the moment, and that wouldn’t really happen under the new system. From a logical perspective that’s good.” That’s the trouble with the new system. It’s too logical. It limits its domain of evaluation to figure skating and does not encompass prior tragedies in one’s nonprofessional life. Ms. Witt’s dilemma is an example of another reason I think linear models, disaggregated ratings, DSSs, and decision aids are not popular. They might be logical. They might be fair. They might even be highly accurate, but they do not take into account some of the factors that people might grudgingly agree are inappropriate but that people would very much like to include. Because Robyn Dawes started graduate school as a clinical psychologist, perhaps it is best to close with a response from a doctor who replied to my request to answer this central question: “Why don’t doctors use decision rules when the rules have been shown to be more accurate than the doctors’ unaided judgment?” His response was simple: “For this you need Sigmund Freud.”
References Alvarado, A. (1986). A practical score for the early diagnosis of acute appendicitis. Annals of Emergency Medicine, 15(5), 557–564. Arkes, H. R. (2003). The non-use of psychological research at two federal agencies. Psychological Science, 14, 1–6.
ER59969.indb 68
3/21/08 10:50:17 AM
Being an Advocate for Linear Models of Judgment Is Not an Easy Life • 69 Arkes, H. R., Dawes, R. M., & Christensen, C. (1986). Factors influencing the use of a decision rule in a probabilistic task. Organizational Behavior and Human Decision Processes, 37, 93–110. Arkes, H. R., Shaffer, V. A., & Dawes, R. M. (2006). Comparing holistic and disaggregated ratings in the evaluation of scientific presentations. Journal of Behavioral Decision Making, 19, 429–439. Arkes, H. R., Shaffer, V. A., & Medow, M. A. (2007). Patients derogate physicians who use a computer-assisted diagnostic aid. Medical Decision Making, 27, 189–202. Ashton, R. H. (1990). Pressure and performance in accounting decision settings: Paradoxical effects of incentives, feedback and justification. Journal of Accounting Research, 28(Supplement), 148–180. Blackburn, J. L., & Hakel, M. D. (2006). An examination of source of peerreview bias. Psychological Science, 17, 378–382. Cicchetti, D. V., Showalter, D., & Tyrer, P. J. (1985). The effect of number of rating scale categories on levels of interrater reliability: A Monte Carlo investigation. Applied Psychological Measurement, 1985, 31–36. Corey, G. A., & Merenstein, J. H. (1987) Applying the acute ischemic heart disease predictive instrument. Journal of Family Practice, 25, 127–133. Dawes, R. M. (1971). A case study of graduate admissions: Application of three principles of human decision making. American Psychologist, 26, 180–188. Dawes, R. M. (1976). Shallow psychology. In J. Carroll & J. Payne (Eds.), Cognition and social behavior (pp. 3–11). Hillsdale, NJ: Erlbaum. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, 571–582. Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus acturial judgment. Science, 243 (March 31), 1668–1674. Einhorn, H. J. (1986). Accepting error to make less error. Journal of Personality Assessment, 50, 387–395. General Accounting Office. (1994). Peer review: Reforms needed to ensure fairness in federal agency grant selection. (GAO/PEMD-94-1). Washington, DC: author. Graham, I. D., Stiell, I. G., Laupacis, A., et al. (2001). Awareness and use of the Ottawa Ankle and Knee Rules in 5 countries: Can publication alone be enough to change practice? Annals of Emergency Medicine, 37, 259–266. Hunt, D. L., Haynes, B., Hanna, S. E., & Smith, K. (1998). Effects of computerbased clinical decision support systems on physician performance and patient outcomes. JAMA, 280(15), 1339–1346. Kaplan, B. (2001). Evaluating informatics applications: Clinical decision support systems literature review. International Journal of Medical Informatics, 64, 15–37.
ER59969.indb 69
3/21/08 10:50:17 AM
70 • Hal R. Arkes Kaplan, R. M., & Saccuzzo, D. P. (1997). Psychological testing: Principles, applications, and issues (4th ed.). Pacific Grove, CA: Brooks-Cole. Kassirer, J. P. (1994). A report card on computer-assisted diagnosis—The grade: C. New England Journal of Medicine, 330(25), 1824–1825. Klahr, D. (1985). Insiders, outsiders, and efficiency in a National Science Foundation panel. American Psychologist, 40, 148–154. Landy, F. J., & Farr, J. L. (1980). Performance rating. Psychological Bulletin, 87, 72–107. Meehl, P. E. (1954). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. Millar, M. G., & Tesser, A. (1986). The effects of affective and cognitive focus on the attitude–behavior relation. Journal of Personality and Social Psychology, 51, 270–276. Pezzo, M. V., & Pezzo, S. P. (2006). Physician evaluation after medical errors: Does having a computer decision aid help or hurt in hindsight. Medical Decision Making, 26, 48–56. Ravinder, H. V. (1992). Random error in holistic evaluations and additive decompositions of multiattribute utility—An empirical comparison. Journal of Behavioral Decision Making, 5, 155–167. Ridderikhoff, J., & van Herk, E. (1997). A diagnostic support system in general practice: Is it feasible? International Journal of Medical Informatics, 45, 133–143. Sieck, W. R., & Arkes, H. R. (2005). The recalcitrance of overconfidence and its contribution to decision aid neglect. Journal of Behavioral Decision Making, 18, 29–53. Stiell, I. G., Greenberg, G. H., McKnight, R. D., et al. (1992). A study to develop decision rules for the use of radiography in the acute ankle injuries. Annals of Emergency Medicine, 21, 384–390. Von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. Cambridge: Cambridge University Press. Wilson, T. D., & Schooler, J. (1991). Thinking too much: Introspection can reduce the quality of preferences and decisions. Journal of Personality and Social Psychology, 60, 181–192. Witt, K. (2006, February 22). No soul on ice. New York Times, p. A-23.
ER59969.indb 70
3/21/08 10:50:18 AM
4
What Makes Improper Linear Models Tick? Jason Dana University of Pennsylvania
Predictions of human behavior are inevitably plagued by error. Perhaps because there are so many factors in the subject matter that cannot be controlled, social scientists often strive for maximum precision in their statistical models. This practice has intuitive appeal: If much of the imprecision of measurement cannot be controlled, the imprecision that is under one’s control (i.e., in the model) should be eliminated wherever possible. However, this quest for precision can lead to more error because the intuition above is incorrect. As measurement becomes poorer, less precise models become more desirable for making inferences about a population of interest from a sample of data. As an example, “improper linear models” (Dawes, 1979) such as equal weighting are quintessentially imprecise, yet they often cross-validate better than “proper” regression models for social science data. This chapter will provide a theoretical rationale for the use of improper linear models that has not been formally articulated. Specifically, they can be understood as what are referred to as “shrinkage” or “regularized” regression models. These models bias predictions conservatively in light of ill-posed prediction problems. Hence, they have the Bayesian motivation of beginning with a prior that predictive power is poor, rather than beginning with diffuse priors about the values of coefficients for various cues. By introducing this conservative bias, shrinkage models (including improper linear models) avoid the most serious errors that regression makes and on average lead to 71
ER59969.indb 71
3/21/08 10:50:18 AM
72 • Jason Dana
better out-of-sample predictions. Because data in many social science domains are unreliable, the less precise improper models are the proper models for attaining maximally efficient predictions. The success of improper linear models serves as a case study for illustrating a key idea that Dawes has emphasized in personal communications: We do not need models more precise than our measurements.
Improper Linear Models Consider a decision to be made under uncertainty in which the decision maker has to predict some criterion or outcome (for instance, predictions of criminal recidivism or predictions of future faculty ratings of graduate students) from a set of cues (for instance, ratings of crime severity and the number of past criminal charges or undergraduate GPA and GRE scores and the quality of the undergraduate institution). Writing on the use of linear models for such decisions, Dawes came to a surprising conclusion: A simple weighted sum of the cues will typically predict better than a human judge and as well as cross-validated regression regardless of how the weights are chosen (Dawes, 1979; Dawes & Corrigan, 1974). The only proviso is that the weights have the correct signs. Thus, if we only know which cues are important and the direction of influence they have on the criterion, almost any set of weights will do. Dawes called this sort of willy-nilly weighting “improper” because the weights are not the result of any explicit statistical procedure. To get the flavor of improper linear models, consider a few examples. For simplicity, assume that the cue values are standardized on the same scale, as with z-scores: Unit weights retain only the sign from optimal cue weights. All cues are weighted either 1 or −1. The sign may be determined by the data or it may be assigned a priori based on theory or domain knowledge. In Dawes and Corrigan’s (1974) words, the modeling only requires one to “know what variables to look at and then know how to add” (p. 105). Correlation weights are each cue’s correlation with the criterion. This approach differs from regression in that all correlations between the cues are ignored. Single variable rule, as proposed by Gigerenzer and Todd (1999), consider only one cue and ignore all others. These rules have brought renewed interest to the topic of improper models due to several empirical demonstrations of single variables outperforming regression and other models in picking the one of multiple alternatives that has the largest value on a criterion (Gigerenzer & Goldstein, 1996; Gigerenzer & Todd, 1999; Hogarth & Karelaia, 2005). A common example of
ER59969.indb 72
3/21/08 10:50:18 AM
What Makes Improper Linear Models Tick? • 73
a single variable rule is “take the best,” which weights only the cue that discriminates the best on the criterion. Random weights have also been employed effectively. For example, Dawes and Corrigan (1974) randomly drew cue weights from uniform (0, 1) and normal distributions scaled onto (0, 1) and then placed the proper sign on them. This procedure yielded better predictions than clinical judgment and bootstrapped clinical judgment and was almost as accurate as cross-validated regression coefficients. Dawes came to his conclusion decades after Meehl’s (1954) review of the clinical–statistical comparison literature pronounced formal models the superior method for such predictions. However, the success of improper linear models was to provide an even greater embarrassment to clinical judgment. If even these weighting schemes could beat clinicians, then perhaps prediction models in general were successful for less superficial reasons than statistical optimization. Further, if these simple schemes worked, then statistical expertise was not necessary to use prediction models, and there were fewer excuses for not doing so. Improper models thus gained traction as a lesson in how human judgment can be unreliable and how prediction rules should be used instead. To this end, the more implausible, imprecise, and improper a model appeared (e.g., random weights), the better it made the point if it outpredicted human judges. Could the ostensible implausibility of improper models be the same reason that the second part of Dawes’s findings—that improper models could be equal to or better than regression—has not been taken as seriously? Several authors have noted that unit weights incur little loss relative to optimal weights (e.g., Einhorn & Hogarth, 1975; Wainer, 1976) or that precision is relatively unimportant (Green, 1977). However, the suggestion that they could often be better than regression weights is contrary to statistical intuition (Cohen, 1990) and unit weights are rarely used. The general wisdom seems to be that they are “almost as good” as more sophisticated techniques and that when accuracy is paramount, sophisticated techniques should be used. One could hardly be blamed for believing so. Improper models seem haphazardly chosen, and their success seems serendipitous. How could they be better estimates of the best weights than regression techniques, which have a well-defined rationale? Part of the problem is that there is no uniformly accepted rationale behind improper linear models and no formal explication of how and when they work. For example, Einhorn and Hogarth (1975) offered this list of reasons why unit weights work:
ER59969.indb 73
3/21/08 10:50:18 AM
74 • Jason Dana
1. Unit weights are not estimated from the data and therefore do not consume degrees of freedom. 2. Unit weights are “estimated” without error (they have no standard errors). 3. Unit weights cannot reverse the “true” relative weights of the variables. A different set of explanations can be distilled from Dawes and Corrigan (1974):
1. Linear models correlate well with each other. 2. Relative weights are impervious to error in the criterion. 3. Error in the cues makes optimal functions more linear. 4. The linear composite is insensitive to deviations from optimal weights.
What is remarkable about these two lists is that there is no overlap between them. The first deals with trade-offs between standard errors and precision, whereas the second deals with the fact that linear models correlate well with each other, so that the particular choice of weights is not important. To be fair, these were early treatments of the subject, and probably nothing on the lists is incorrect. Still, the lack of agreement between them is representative of the literature that followed. There is some awareness that improper linear models can work, but no unified understanding as to why. In the following sections, I hope to show that improper linear models do have a sound, if implicit, statistical rationale. That is, in the domains and for the purposes to which they have been applied, so-called improper models are actually proper. Of course, all models are wrong if taken literally, so it would seem unnecessary to call these models improper. For example, in all comparisons of linear models, we know that the true nature of prediction may not be precisely linear, but we still find the linear model useful. However, the impropriety of the models with which Dawes was concerned is more fundamental. The cue coefficients are not the result of any optimization process; indeed it is not clear what the objective function is. Further, as will become clearer in the following sections, these improper models are biased and not functions of the sufficient statistic; they do not use all of the information available in the sample. Perhaps most importantly, the models are inconsistent—they do not converge on the best weighting policy as one’s sample becomes infinitely large. Unit weights, for example, do not even get better as one’s sample becomes larger. Even techniques that are meant to improve upon the over-fitting problem, such as many approaches to ridge regression, usually converge
ER59969.indb 74
3/21/08 10:50:18 AM
What Makes Improper Linear Models Tick? • 75
on the least squares regression solution as samples become very large. Improper models are the wrong estimates of the best cue weights, and we know it. Not trying to find the best model might seem unscientific to some, yet if we want to make the most accurate predictions we can, the scientific thing to do is to use the model we expect to be most accurate in the population of interest, not the one that is best in our sample because it uses all of the information in the sample.
Social Predictions and Ill Conditioning To understand why improper linear models have been successful, one must understand the environments in which they have been successful. Almost all of the early applications of improper linear models were predictions of important human outcomes. As Dawes pointed out (Dawes, 1979; Dawes & Corrigan, 1974), improper linear models have been used to predict who will recidivate if paroled, who will succeed in a training program, how well an individual will respond to a certain kind of therapy, and so on. It was in light of the clinical–statistical controversy over making such “social predictions,” as I will henceforth call them, that improper linear models initially garnered interest. It is important to note two conditions that plague the domain of social predictions: 1. The cues are redundant. For example, those who have committed the most serious crimes tend also to have prior criminal charges, and those who went to the most selective institutions may also tend to have higher GRE scores. It is difficult to find a good set of cues for a social prediction that does not have a significant amount of overlap. Often, the cues correlate with each other as much or more than they do with the criterion. 2. The predictions of a statistical model, while better than a human judge’s, are still not that good. Consider the earlier example of predicting faculty ratings of graduate students from the students’ undergraduate GPA, quality of undergraduate institutions, and GREs. Dawes (1971) studied this problem and found that an admissions committee’s predictions correlated .19 with the ultimate ratings, while a unit-weighted model on the three cues correlated .48. The model was much better than the clinical method, but still left 75% of the variance in ratings unexplained. That sort of limited accuracy is typical for social predictions. Assuming that these conditions adequately characterize the social predictions for which improper linear models have been successful, a statistical rationale for improper linear models becomes clear. Consider
ER59969.indb 75
3/21/08 10:50:19 AM
76 • Jason Dana
the ordinary least squares regression model. The object is to estimate the best set of cue weights in the population of interest, which we may call a vector β. The n observations of p cue values are arrayed in an n × p matrix X, and the n observations of the criterion are a vector y. The least squares regression solution to estimating β from the data is then given by the familiar −1
β = ( X ′X ) X ′y (4.1) Again, we can assume without loss of generality that our data are in standard or z-score format. If so, we may divide n from the right side of Equation 4.1 and write it as
β = S−1r
(4.2)
where S is the correlation matrix among the cues and r is the vector of correlations between each of the cues and the criterion, that is, the cue validities. The redundancy of the cues, that is, the cues being correlated with each other, means we face the familiar problem of multicolinearity. With extreme multicolinearity, S is not invertible and regression coefficients cannot be computed. However, correlated cues become problematic long before computation becomes impossible or dangerously unstable. Indeed, any deviation from uncorrelated cues makes the estimation of the cue weights less efficient. In mathematics, this problem is conceptualized more generally by saying that the S matrix is “ill conditioned.” The more correlated the cues are, the more poorly conditioned S becomes. As S becomes more poorly conditioned, the solution to the system of equations in Equation 4.2 becomes untrustworthy. Specifically, small changes in r will lead to disproportionately large changes in ˆ the solution to the unknowns, β . Put differently, if a new sample of data were only slightly different, the cue weights could be much different. Accordingly, the standard errors of regression weights are large when S is ill conditioned. The condition number of S is defined as:
κ= S
S−1
(4.3)
where ⋅ is a matrix norm. This number directly relates the amount of ˆ change in β to the amount of change in r. Denoting these changes as δβˆ and δr :
ER59969.indb 76
3/21/08 10:50:22 AM
What Makes Improper Linear Models Tick? • 77
(4.4) δβˆ / βˆ = κ( δr / r ) Equation 4.4 tells us that a 1% change in the data will lead to a κ% change in the estimated cue weights. Note that κ is always ≥1 and reaches its minimum of 1 when all intercue correlations are zero. For a thorough discussion of conditioning and its relation to estimator choice, see Casella (1985). Equation 4.4 makes clear that when we have correlated cues, we can only trust our estimates of the best cue weights if we trust that our sample is highly representative of the population of interest. However, social predictions are characterized by a fairly large amount of error, and these large residuals imply instability unless our sample is large. If one were to resample from the same population, one could expect the data to be somewhat different. In ill-conditioned problems like social prediction, this difference could lead to large changes in the estimated cue weights. Exactly how large a sample will be necessary to trust the regression solution, as opposed to Dawes’s improper weights, is an issue that will be addressed later.
Improper Linear Models as Shrinkage Estimators Several methods of estimation have addressed the problem of oversensitivity of least squares regression in ill-conditioned problems. Particularly, the Euclidean length of the least squares estimator (the square root of the sum of squared cue weights) is risky for such problems. In some cases, “reining in” the estimates toward 0, which is equivalent to making more conservative or regressive predictions, can bring expected improvement in prediction errors. Estimators that shorten the least squares estimate are often referred to as “shrinkage” estimators. For example, James–Stein estimators (James & Stein, 1961) multiply each least squares cue estimate by a penalty factor <1, whereas ridge regression techniques (originally Hoerl & Kennard, 1970) lower the condition number of S by adding a positive constant to each of its diagonal elements. In general, better conditioning means less sensitive estimators, and the process of adjusting an ill-conditioned S to make it more stable is known as “regularization” (see Brown, 1993, for a discussion of techniques based on regularization). Shrinkage estimates of β are biased toward 0, whereas least squares estimates are unbiased. However, the decrease in the variance of the shrinkage estimates relative to least squares can sometimes offset this bias and lead to a smaller mean squared error (MSE) on average between
ER59969.indb 77
3/21/08 10:50:23 AM
78 • Jason Dana
the estimator and β. No estimator can be guaranteed to improve upon least squares in terms of MSE for all β. However, if one has beliefs that bound the range of possibilities for β, it can be shown that improvement in MSE is guaranteed by shrinking each of the least squares cue weights (Eldar, Ben-Tal, & Nemirovski, 2005). Social predictions bound the possibilities for β; if the errors are large and the cues are redundant, the β vector will be relatively short. Thus, improvement over the least squares solution by a shorter estimator, perhaps substantially shorter, is possible. What needs to be verified is whether improper models indeed shrink the least squares estimate and whether this shrinkage leads to an improvement in MSE. Heuristically, improper models achieve regularization in reverse of ridge regression; they can be construed as removing the off-diagonals of S instead of adding to the diagonals.1 That is, they ignore the correlations among the predictors. Replacing the off-diagonal elements in S yields correlation weights. Tampering further with the precision of r can make unit weights or even take the best rule. Tampering with S in this way makes it more “invertible” because it is perfectly conditioned, so that we are sure it is a less sensitive estimator. To this point, we have considered improper models in their raw form, much as Dawes discussed them. However, it may occur to the reader that unit weights and correlation weights are inappropriate for point predictions. Unit weights, for example, imply no regression to the mean at all. Because the purpose of a social prediction is often to rank candidates, for example, applicants to a training program, simply correlating with the right answer is often sufficient. However, improper models can also be multiplied by a fractional constant so as to minimize squared errors for point predictions without changing the model’s R2. In determining this constant, it becomes clear how improper models are a special transformation of the least squares framework in Equations 4.1 and 4.2. Denote the vector of improper cue weights as a. For example, for unit weights, a would be a vector of all vector of ones. The least squares scaling is then accomplished by regressing the criterion on a single variable that is the a-weighted sum of the columns of X and then projecting the result back on to a: or
ER59969.indb 78
βˆ a = a(a ′X ′Xa)- a ′X ′y
(4.5)
βˆ a = a(a ′Sa)- a ′r
(4.6)
3/21/08 10:50:24 AM
What Makes Improper Linear Models Tick? • 79
Setting a = βˆ OLS returns βˆ OLS , so that least squares can be conceptualized as a special case of this framework. The least squares form of improper models helps make the shrinkage logic clear. In the unit weight example, each value of one is multiplied by the correlation between y and an equal-weighted sum of the columns in X. The result reduces to a vector of weights each equal to the sum of all validities over the sum of all cue correlations:
∑r ∑∑S i
i
(4.7)
ij
From Equation 4.7 we can see that as the cue redundancies in the denominator increase relative to the apparent length of the data in the numerator, the weights will shrink. The more ill-conditioned the problem becomes, the more an improper model hedges toward zero. In general, the biasing effect of improper models relative to least squares regression is predictable in two ways: i
j
1. The vector of weights from the least squares form of improper models is always shorter than the vector of ordinary least squares regression weights. That is, the square root of the sum of squared weights is closer to 0 when using improper least squares. Each individual weight in the improper scheme may not be closer to 0, but the vector of weights will be. 2. As the condition number of S increases (meaning it is more poorly conditioned), so does the difference in length between regression coefficients and improper coefficients. That is, the less trustworthy the solution, the more improper models hedge toward ignorance. The motivation behind this shrinkage can also be thought of as Bayesian. In least squares regression, we are over-relying on our erratic sample to make our estimates. Least squares regression is based on completely diffuse priors about what weights will obtain; it does not know certain values are more plausible in a given situation, for instance, the social prediction. However, the researcher or the decision maker may have prior beliefs, such as believing that a social prediction problem should be highly regressive because of the noisy nature of such problems. This logic is implicit in the use of improper models. The difference between improper models and explicitly Bayesian regression is that the modeler need not have any well-formed priors about the values or the
ER59969.indb 79
3/21/08 10:50:25 AM
80 • Jason Dana
distributions of values of the cue weights. Rather, the simple belief that the data are not good enough to estimate precise weights is sufficient. Improper models are a rough and ready technique for incorporating such priors. Yet another way to conceptualize improper models is as a combination of least squares regression weights and zeros (a prediction of the mean). In classical test theory, we would not take an observed score for a true score, but rather some weighted combination of the observed score and the mean that depends on the amount of unreliability. Improper models are a coarse technique for doing the same thing with regression coefficients, where one does not necessarily have a good estimate of the amount of measurement error. The general point is that the more we dig into our sample and try to extract information from it, the more we implicitly place faith in the reliability of our data, often to the exclusion of placing faith in our prior beliefs. Because our goal is not to explain as much variance as we can in the sample but to explain as much as we can in the future, sophisticated techniques magnify the importance of measurement error. In this way, simple approaches sometimes have virtue because they are robust, whereas highly calibrated models may backfire because they are more precise than our measurements.
Conditions Favoring Improper Models Improper models shrink estimates in light of ill-conditioned problems, but relating such shrinkage to improvement in MSE is not straightforward for three reasons. First, improper models not only shrink the length of the cue weight vector, but they change the relationships between cue weights. This situation is unlike some other shrinkage approaches that apply a constant multiplier to least squares coefficients. Second, the application of an a vector of the type Dawes used is somewhat all-or-nothing. At present, we do not have an option to shrink only mildly. Finally, because these models are biased, their MSE will depend on the true nature of β, a quantity that we do not know. Simulation is thus an attractive starting point for specifying the conditions under which improper models will yield improvement over least squares. A recent large-scale simulation study helps make prescriptions based on properties of one’s sample (Dana & Dawes, 2004). The study conducted a “horserace” between various improper models (unit weights, equal weights, and take the best) and regression. Public datasets were used so that one could not object that the data were specially created to support improper models. The results were then
ER59969.indb 80
3/21/08 10:50:26 AM
What Makes Improper Linear Models Tick? • 81
replicated in simulation data that perfectly met the assumptions of the linear regression model. Five large public datasets were used. They ranged from highly predictable physical science data (predicting the age of abalone from various measurements of its size) to survey data that were weakly predictive of outcomes (predicting a measure of job prestige from various psychological scales and self-ratings). However, the cues in all of the datasets were correlated to some degree. Each of these datasets was treated as the population of interest. Calibration samples were then drawn randomly from the larger datasets in sizes of 5p, 10p, 15p, 20p, 30p, and 50p. In each sample, regression coefficients, correlation weights, and a “take the best rule” that used only the best predictor from the sample were calculated. Additionally, regression coefficients were calculated for a model using only a subset of the cues as chosen by Mallows’s Cp statistic.2 The unit weights for all samples were determined a priori by the majority opinion of a sample of colleagues. The sampling and calibration procedure was repeated 50 times for each sample size in each dataset. All sets of weights were then applied in the entire “population” to determine the validated multiple R, the accuracy metric on which they were compared. Figure 4.1 gives the averaged results for each of the five public datasets, listed top to bottom in order of increasing error of prediction. As expected, as either sample size or inherent predictability increases, regression becomes superior to improper models on validation. The abalone dataset at the top provides an interesting case; the data were highly ill conditioned, as cues such as weight, length, width are all highly correlated (average of about .89), yet we can see that regression coefficients calculated on even the smallest sample sizes are superior. That is because the age of abalone relates quite lawfully to each of the predictors, and the predictors are all reliably ascertained. In this situation, we can trust in the stability of the data because of the lawfulness of the relationship between size and age of abalone and thus would not want to shrink our estimates. Moving down the order to the bottom three datasets, which all involve social predictions of some nature, one can see that unit weights and correlation weights perform above regression even up to our largest calibration samples, which had 50 times more observations than predictors. Interpreting the performance of the single variable rule is straightforward: Take the best is better than the other improper models when one predictor’s validity is approaching the model’s validity. This is the case in the abalone data, where any predictor will do nicely. However, when at least two variables are important, as in the National
ER59969.indb 81
3/21/08 10:50:26 AM
82 • Jason Dana Abalone
Validated R
0.75
best cp regression
0.70 0.65
take the best
0.60 0.55
correlation unit 5m
10m
20m
30m
50m
NFL
0.55
Validated R
15m
best cp
0.50
regression correlation unit take the best
0.45 0.40 5m
10m
Validated R
20m
30m
ABC
0.35
50m
0.30
unit 0.25
unit
correlation regression best cp take the best
0.25 0.20
5m
10m
15m
20m
30m
50m
0.15
5m
10m
15m
20m
30m
50m
Sample Size
WLS
0.20
unit correlation regression best cp
0.15
take the best
0.10 0.05
NES
0.35 regression best cp take the best 0.30
correlation
0.20
Validated R
15m
5m
10m
15m
20m
Sample Size
30m
50m
Figure 4.1 Results from simulations with public data.
Football League data belonging to the next highest graph—where one wants to know about the quality of both offense and defense—they are deficient: A set of correlation weights would lose to a single variable in only 1 of the 300 samples of NFL data. Take the best was the only horse that never ran in the lead. The horserace procedure was similar for the synthetic data, except that a massive number of datasets were used, and sample sizes of 75p and 100p were also included. Unit weights were simply defined with the
ER59969.indb 82
3/21/08 10:50:27 AM
What Makes Improper Linear Models Tick? • 83
correct signs. A ridge regression estimator was included to compare improper models against a more modest shrinkage technique. The averaged results for correlation weights, unit weights, regression, and ridge regression are depicted in Figure 4.2. For graphical presentation, the results were averaged across samples and datasets within bins reflecting inherent predictability. Because the true error variance is unknown to a decision maker, the graph was conditioned on adjusted R as a proxy measure of the inherent predictability in the population.3 The bottom three bins are particularly relevant. Correlation weights remain above regression and ridge regression calibrated 0.95
0.90
> .9
0.85
.8 - .9
0.80
regression ridge regression correlation weights
0.75
.7 - .8
unit weights
Validated R
0.70
0.65
.6 - .7
0.60
Level p
0.55
.5 - .6 0.50
0.45
.4 - .5
0.40
.3 - .4 0.35
0.30
5m
10m
15m
20m
30m
50m
75m 100m
Sample size Figure 4.2 Results from simulated data.
ER59969.indb 83
3/21/08 10:50:27 AM
84 • Jason Dana
in samples with 100 observations per predictor, while unit weights are also often superior. An adjusted R2 of .36 is quite large, particularly for social predictions, as are sample sizes of 100 observations per cue. Because nearly all social prediction problems (and most social science predictions) reside in this region, the authors concluded that there is rarely any need to use regression analyses; improper models are the proper approach to social prediction.
Discussion Apparently, regression and other techniques for producing precise cue weights are rarely required for social predictions, where simple unit or correlation weights will be a better estimate of the “true” cue weights. This conclusion does not imply that one’s data are simply junk and that estimating precise weights is a waste of time. The nature of social predictions can be so important that every little bit helps. Many will object that these techniques work best when multiple R is small (e.g., less than .4 for unit weights), and that then it hardly matters what techniques one uses because the modeling is so poor, but small correlations can be important. For example, consider that 87% of U.S. lung cancers are caused by smoking (American Lung Association, 2005) and suppose that about 20% of adults in the United States smoke (according to the 2004 Centers for Disease Control National Health Interview Survey [2005], it was 20.9%, and rates have been steadily declining). Together, these facts translate to relative incidences of 1 in 8 for smokers and 1 in 200 for nonsmokers. While this increase in risk is clearly important, it corresponds to a correlation between smoking and lung cancer of just less than .29. That correlation is a major motivation behind one of the largest public health campaigns in U.S. history. Similarly, when predicting important criteria such as whom to hire, admit to a training program, parole from prison, or deem mentally competent, every bit of accuracy matters. However, contrary to intuition, the importance of such criteria mandates that we not use the most precise models at our disposal to reduce the imprecision of our predictions. Some theories of human cognition, such as Hammond’s (1996) theory of pseudo-rationality, hold that humans work in a similar way. Pseudorational people may use rough rules that are suboptimal in a particular environment. These rules may be rational in that they are robust across a universe of possible environments. It is tempting to conclude that the statistical realities discussed here support that view. However, the findings on improper linear models are noteworthy because they are nonintuitive. Dawes and Corrigan initially struggled to publish
ER59969.indb 84
3/21/08 10:50:27 AM
What Makes Improper Linear Models Tick? • 85
their findings (Dawes, personal communication), apparently because editors simply did not believe them. If people were pseudo-rational in the same way that improper models are, then people would not fare so poorly against them. Unfortunately, it is unnatural to resort to making crude inferences for important decisions. It is even less natural to want to accommodate error up front so that one can eventually make fewer errors (Einhorn, 1986). The findings on improper linear models raise several potential issues for statistical practice more generally. First, we must question if differential weights really have any meaning in many areas of social science. Recently, Taagepera (2005) wrote a sharp critique of the ubiquitous use of regression models, highlighting their differences from models in the physical sciences. He notes that regression analyses reported in social science journal articles typically have as many different free values for cue weights as there are cues, all of which have little or no interpretation. In contrast, functions in the physical sciences usually have coefficients of 1, 2, or ½. Precise valued constant terms have a strong grounding in theory and are often so meaningful that they are named after someone. Certainly, the interpretation of equal weights or even correlation weights is much more straightforward than that of least squares regression weights. The problem is even worse if we often cannot expect those unequal weights to be better estimates of the parametric values than unit weights. Edwards (1976) questioned whether we should be satisfied with a significant value of R2 when that value is probably not significantly different than the R2 obtained by equally weighting the cues. If we go further to say that the R2 will likely be worse than that of the unit model in a new sample, then differential weights further lose credence. In a noisy environment without an enormous sample, can we ever say anything other than that we have identified the important variables and not which among them are most important? Perhaps we can if we have a strong theory, but that still does not necessitate precise models that assign coefficients based on maximum likelihood in sample data. The same reasoning applies to significant values of individual coefficients. We may be happy if the coefficient on our variable of interest turns up significant in a multiple regression, for then we can say it contributes when “controlling for” the effects of the other variables, but if our differential weights are likely farther from the truth than simple equal weights, is such an inference valid? Perhaps an equally good or better test of the contribution of an individual variable is to test the difference between the R2 of equal-weighted models with and without the variable of interest.
ER59969.indb 85
3/21/08 10:50:28 AM
86 • Jason Dana
There are important practical advantages to using improper linear models as well. These models make large numbers of right-hand side variables possible. Equations 4.5 and 4.6 show how these models collapse the dimensionality of a regression problem, so that the familiar problems of estimation associated with large numbers of predictors are relieved. Also, a priori–chosen unit weights can make valuable predictions even when the criterion is ill defined. Consider the example of selecting candidates for admission to graduate school. Graduate programs would like to admit students who will graduate, get good jobs, publish papers, and so on, yet none of these suffices as the criterion of interest, which is an ideal concept like “good student.” Dawes (1979) put it as follows: When deciding which students to admit to graduate school, we would like to predict some future long-term variable that might be termed “professional self-actualization.” … It would be impossible to conduct the study using records from current students, because that variable could not be assessed until at least 20 years after the students had completed their doctoral work. (p. 574) In this situation, one could still collect those cues that she or he thinks are important and then simply equal weight them, expecting that if there were a measurable criterion, this method would be superior to many sophisticated regression techniques anyway. Perhaps the most interesting point about improper linear models to consider for future research is that there is nothing special about the particular instances of improper model described here. One could place restrictions on the parameter space of β, for example, by placing an upper bound on what R would be for a given problem and assuming positive manifold. Equation 4.5 could then be the general form for a new type of shrinkage estimator where a would be optimized. Ideally, the choice of a would lower the condition number and shrink the length of the estimator while improving the expected MSE. Rather than a rough taxonomy, the class of models that for the present is called improper could be defined by the types of a that rein in the estimator. As the theoretical framework through which to understand the success of improper linear models becomes clearer, they will perhaps become more of an accepted and, more importantly, utilized tool in the social sciences. Analogous developments have occurred in time series forecasting. Some practitioners have used exponential smoothing methods (which are biased, inconsistent, etc.) for decades before theoretical papers detailing when and why they worked (first by Muth, 1960; see also Gardner, 1985) ushered them into the canon of accepted
ER59969.indb 86
3/21/08 10:50:28 AM
What Makes Improper Linear Models Tick? • 87
practice. Let us hope improper linear models can be generalized as a class of shrinkage estimator and receive the same esteem as a valid statistical practice.
Acknowledgments The author is grateful for support from NIMH National Research Service award No. MH14257 to the University of Illinois while the author was a postdoctoral trainee in the Quantitative Methods Program in the Department of Psychology. This work benefited greatly from discussions with David Budescu and Clintin Davis-Stober.
References American Lung Association. (2005). Facts about lung cancer. Retrieved from http://www.lungusa.org/site/pp.asp?c=dvLUK9O0E&b=35427 Brown, P. J. (1993). Measurement, regression, and calibration. London: Oxford University Press. Casella, G. (1985). Condition numbers and minimax ridge-regression estimators. Journal of the American Statistical Association, 80, 753–758. Centers for Disease Control. (2005). Early release of selected estimates based on data from the 2004 National Health Interview Survey. Retrieved from http://www.cdc.gov/nchs/data/nhis/earlyrelease/200506_08.pdf Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304–1312. Dana, J., & Dawes, R. (2004). The superiority of simple alternatives to regression for social science predictions. Journal of Educational and Behavioral Statistics, 29, 317–331. Dawes, R. (1971). A case study of graduate admissions: Application of three principles of human decision making. American Psychologist, 26, 180–188. Dawes, R. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, 571–582. Dawes, R., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Edwards, W. (1976). Comment on “Equal weighting in multi-attribute models: A rationale, an example, and some extensions” by Hillel Einhorn. In M. Schiff & G. Sorter (Eds.), Proceedings of the Conference on Topical Research in Accounting (np). New York: New York University Press. Einhorn, H. J. & Hogarth, R. M. (1975). Unit weighting schemes for decision making. Organizational Behavior and Human Performance, 13, 171–192.
ER59969.indb 87
3/21/08 10:50:28 AM
88 • Jason Dana Einhorn, H. J. (1986). Accepting error to make less error. Journal of Personality Assessment, 50, 387–395. Eldar, Y. C., Ben-Tal, A., & Nemirovski, A. (2005). Robust mean-squared error estimation in the presence of bounded data uncertainties. IEEE Transactions on Signal Processing, 53, 168–181. Gardner, E. (1985). Exponential smoothing: the state of the art. Journal of Forecasting, 4, 1–38. Gigerenzer, G., & Goldstein, D. G. (1996). Reasoning the fast and frugal way: Models of bounded rationality. Psychological Review, 103, 650–669. Gigerenzer, G., & Todd, P. M. (1999). Simple heuristics that make us smart. Oxford, UK: Oxford University Press. Green, B. F. (1977). Parameter sensitivity in multivariate methods. Multivariate Behavioral Research, 12, 263–288. Hammond, K. R. (1996). Human judgment and social policy. New York: Oxford University Press. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12, 55–67. Hogarth, R., & Karelaia, N. (2005). Ignoring information in binary choice with continuous variables: When is less “more”? Journal of Mathematical Psychology, 49, 115–124. James, W., & Stein, C. (1961). Estimation with quadratic loss. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1, pp. 361–379). Berkeley: University of California Press. Meehl, P. E. (1954). Clinical vs. statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. Muth, J. (1960). Optimal properties of exponentially weighted forecasts. Journal of the American Statistical Association, 55, 299–306. Taagepera, R. (2005). Beyond regression: The need for logical models. Paper presented at the Third Conference of the Belgian Political Science Association. Retrieved from http://www.ut.ee/SORT/05k/kursused/Beyond.pdf Wainer, H. (1976). Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin. 83, 213–217.
notes 1. The arguments that follow may not apply to random weights. Their performance is difficult to analyze, since we must first resolve uncertainty about what weights will obtain and then resolve uncertainty about how well they will validate. 2. Another interpretation of an ill-conditioned S matrix is that it is close to not being full rank. Choosing subsets that might have the correct dimensionality is another approach to the problem.
ER59969.indb 88
3/21/08 10:50:28 AM
What Makes Improper Linear Models Tick? • 89 3. One may notice that in the bottom bins, the expected performance is downward sloping across n in the smaller sample sizes. The reason for this apparent anomaly is the instability of the small samples, many of which came from populations whose true R were much larger than the sample adjusted R.
ER59969.indb 89
3/21/08 10:50:28 AM
ER59969.indb 90
3/21/08 10:50:28 AM
5
Why Meta-Science Should be Irresistible to Decision Researchers David Faust Department of Psychology, University of Rhode Island, and Department of Psychiatry and Human Behavior, Brown Medical School
Foreword and Dedication: In Honor of Robyn Dawes In a book dedicated to Robyn Dawes, it seems fitting to share a few reflections about the well-deserved object of our attention. My first direct meeting with Robyn occurred many years ago, very early in my academic career and when my status was about that of a nonentity. At my invitation, Robyn had arrived in Rhode Island to lecture to our department the next morning. After I met him at the airport and then got him checked in at the hotel, I queried whether, despite the somewhat late hour, I could ask him a few questions about decision research. Being familiar with his work and probing intellect, I wanted to take advantage of the exceptional educational opportunity afforded me. Robyn happily responded to a few questions, showing no apparent wear or tear or impatience, and thus I thought there would be no harm in asking a few more questions, followed by a few more, and then more and then more still. Finally glancing at my watch, I noticed it was 2:00 a.m. and I realized that if I did not stop, Robyn would continue with the lesson until he was down to a minute or two to sleep, dress, eat, and otherwise prepare for the upcoming morning’s presentation. 91
ER59969.indb 91
3/21/08 10:50:29 AM
92 • David Faust
From this initial encounter I drew a number of tentative inferences about Robyn, which have been consistently corroborated over the years (and no, I do not think that this is mere confirmatory bias on my part). First, Robyn is simply one of the most generous individuals whom I have ever met. I have been amazed, time and again, with his willingness to provide the benefit of his intellect and thinking not only to me but to many others, no matter a person’s stature or lack thereof. Robyn is extraordinarily motivated by the pursuit of ideas and learning to advance knowledge, and not by self-gain, which is why he is so giving of his time and talents. I have also found Robyn to be highly forgiving of my flaws and those of others, and I consider it a blessing to count him as a friend. Every coin has two sides, and almost any meaningful compliment gains credibility when the obverse is not disregarded. Thus, I must make two other points before returning to my far more weighty list of glowing attributes. Second, as is sensible given his talents and engrossment in intellectual pursuit, Robyn’s attention can sometimes drift from such practical matters as time management and scheduling, and he is undoubtedly fortunate to have loving and helpful people around him (in particular his wife, Mary Schafer). Third, as I found that night and subsequently, Robyn does not seem overly concerned with following a linear sequence in his oral presentations, something I believe stems partly from his highly original and quick mind. This style of discourse often compelled me to repeat some variation of, “I’m sorry but I didn’t follow you.” I assume I am not alone in this occasional befuddlement, but it is a small downside in comparison to the many gems one gleans along the conversational route. Fourth, Robyn has remarkable intellectual creativity. He has little extrinsic worry about what is considered the conventional or received view; rather, he formulates his own conclusions about things. On many occasions, in discussions and in his writings, Robyn has formulated original and fresh ideas that correct long-standing and widely held misconceptions or that shed new and productive light on difficult problems. Fifth, Robyn does not mind telling you when he thinks you are wrong, and he is usually able to set forth strong reasons for his conviction. I have personally found this quality immensely beneficial. Usually, if there is a flaw in my thinking, even a subtle one, Robyn will find it, and thus he will either save me from an error or help me achieve a better end product. Additionally, as I often would run new ideas by Robyn and my former (informal) mentor and colleague Paul Meehl, in the instances in which I passed muster with these two there seemed to be good grounds for confidence that I was not too far afield. On a broader level, and much more
ER59969.indb 92
3/21/08 10:50:29 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 93
important than any personal benefit I may have gained from Robyn’s input on my work, he has served this same role of master critiquer for many individuals and for much work in psychology and decision making. Although we should not overlook Robyn’s many positive suggestions for addressing problems he described or uncovered, his ability to detect flaws and shortcomings in the first place is truly exceptional and has been of tremendous benefit in advancing knowledge. Finally, sixth, and related in ways to all of the other points, is Robyn’s penetrating intellect. It is this unusual combination of creativity, independent mindedness and disregard of intellectual fads, an exceptional capacity to detect flaws and shortcomings in ideas, remarkable generosity in sharing his intellectual capabilities and products with others, together with just plain hard work and productivity that have yielded such an important impact on the field. The essence of all this is that Robyn has changed and improved the thinking of many individuals, fulfilling his fundamental aim of promoting rational thought. Robyn has often been controversial, and he is surely tough-minded, but sometimes that is what is needed when one is providing strong medicine for ailing concepts. Robyn is a person of good will who, in areas in which errors and thinking limitations can be measured in human costs, has provided much illumination. When I was a sophomore in college, my fellow students and I had a far-ranging discussion with our Zen Buddhist photography teacher, who commented that his foremost aim was to leave the darkroom a little better off than he found it. Being, shall I say, sophomoric, we were not quite sure if he was speaking in concrete or metaphorical terms, but of course it was the latter, and I now think of that aim as one key way of determining whether a life is well-lived. Surely Robyn has succeeded impressively by this measuring stick. I am most honored to be a small part of this very worthwhile and well-deserved tribute to Robyn’s intellectual contributions and enduring legacy.
Appraising Performance on a Complex Judgment Task You are a decision researcher, and you are studying performance on a commonly undertaken, complex judgment task. The performance of this task sometimes has little impact on others, but many other times it has considerable impact, sometimes to the point of spelling the difference between life and death for hundreds or thousands of individuals.
ER59969.indb 93
3/21/08 10:50:29 AM
94 • David Faust
Ironically, the complex judgment task itself entails the evaluation of ideas or appraisals, and thus three hierarchically arranged levels of analysis are involved: (1) the original set of ideas themselves, (2) the evaluation of these ideas (which is the judgment task under study), and (3) the study or appraisal of these evaluations. It is this final or highest level of analysis (3) that is the task of the judgment researcher. To distinguish between these levels, I will refer to level 1 as idea sets (IS), level 2 as the evaluation of these idea sets (EV), and level 3 as the evaluation of these evaluations (EV of EV). Characteristics of Judgments to Be Appraised: 10 Parameters That Combine to Create Considerable Complexity and Cognitive Strain The features of the evaluations (EVs) under study generally include most, if not all, of the 10 following characteristics or parameters: 1. When evaluating (EV) idea sets (IS), there is often more than one IS to evaluate, and sometimes as many as a half dozen or more. Thus, performing these EVs frequently necessitates simultaneous and multiple comparative judgments; that is, one is not only comparing an IS to a set of standards, but also comparing the relative merits of differing or competing ISs. 2. ISs are appraised along multiple dimensions. Further, the list of dimensions to be used in appraisal is open (it can potentially be expanded or contracted), and there is often disagreement about the composition of the list or about whether certain items belong on it. 3. Most, if not all, of the dimensions are continuous or a matter of degree and rarely, if ever, dichotomous. 4. Appraising or measuring standing on dimensions, even singly, may not be easy or straightforward, and hence such appraisals are prone to error and disagreement, even among capable, rational, and fair-minded people. These difficulties can arise from various sources. The dimensions themselves may be somewhat vague or amorphous and composed of multiple relevant subcomponents. The amount of available information bearing on the dimensions may be substantial or potentially overwhelming, and it may be hard to determine whether information or outcomes relating to these dimensions should be sorted into the plus or minus column or whether or how much they should count. The end result is that appraising these dimensions often minimally resembles counting noses and in and of itself is a considerably more difficult and complex judgment task.
ER59969.indb 94
3/21/08 10:50:29 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 95
5. Ratings for these various dimensions often do not rest on a common scale or metric. Consequently, determining how standing on one dimension compares to that on another dimension may be quite difficult, and that difficulty is compounded if one must integrate ratings across dimensions when comparing more than one IS. 6. Although direct comparison across dimensions may be difficult given differing metrics, it is nevertheless clear that ratings across dimensions often conflict within and across EVs of ISs. For example, in the EV of a single IS (IS-1), the appraisal of standing on two dimensions will often differ. Moreover, in a sizable number of cases, judgment of standing on the dimensions will not only be inconsistent but contradictory, with ratings on one or more dimensions falling on the positive side of the ledger and one or more on the negative side. Further, across EVs of two or more ISs, ratings of the dimensions almost always show inconsistency and often conflict. For example, it would be unusual to find that the comparative evaluation of IS-1 versus IS-2 yields uniformly higher ratings for one or the other idea set. Rather, mixed outcomes are far more common, with one idea set considered superior on some dimensions, about equal on others, and inferior on other dimensions. Matters become considerably more complicated when more then two ISs are involved, as is not uncommon, which is likely to produce a seeming hodgepodge of inconsistent or conflicting ratings. 7. No one dimension is dispositive and trumps all other dimensions, or at least trumps all other composites of the other dimensions. For all of the legitimate EVs that the decision researcher needs to consider, each of the dimensions shows probabilistic relations with the outcomes or criteria of interest; thus, it follows that no dimension is deterministic or shows an unwavering association with outcome. There is also good reason to believe that these associations do not approach, and sometimes fall well below, 1.00. This form of probabilistic association is ontologic, that is, it involves limits in the level of true association between dimensions and outcomes, not limits in our level of knowledge about these relations. 8. EVs of ISs face a second fundamental form of uncertainty. In addition to the aforementioned uncertainty grounded in ontology, a second source of uncertainty rests on our epistemologic situation or results from methodolologic factors. Whatever
ER59969.indb 95
3/21/08 10:50:29 AM
96 • David Faust
the true associations might be between these dimensions and outcome, that is, the ontological conditions, our knowledge of these associations is imperfect, which is a product of our epistemologic situation. Indeed, for a number of the dimensions, our knowledge may be rather limited or deficient, and sometimes not much better than a coarse guess. We do have good reasons to believe that standing on most or all of the dimensions that are commonly used to evaluate ISs is validly related to outcome, but we are not really sure. As noted, knowledge about the strength of association is limited for most or nearly all of the dimensions, and beliefs about these associations are matters of long-standing dispute among even very capable individuals. 9. Interactions across the dimensions or configural relations may matter. We cannot assume that EVs can be handled optimally, and perhaps in some cases even effectively, by simple linear composites. 10. When performing EVs, the relevant data or informational base is frequently large, if not massive. The Difficulty of the Appraisal Task I think almost anyone would agree that the 10 characteristics described above create a complex decision task, and exactly the type with which we struggle cognitively, especially if we lack formal decision aids and instead must depend primarily on subjective or impressionistic judgment. In a related but decidedly less complex decision domain involving clinical versus actuarial decision methods, Meehl (1986) used the following illustration: Surely we all know that the human brain is poor at weighting and computing. When you check out at the supermarket, you don’t eyeball the heap of purchases and say to the clerk, “Well it looks to me as if it’s about $17.00 worth; what do you think?” The clerk adds it up. There are no strong arguments … from empirical studies … for believing that human beings can assign optimal weights in equations subjectively or that they apply their own weights consistently. (p. 372) When addressing a more complex judgmental task (but one still considerably less demanding than the EVs described above), Dawes, Faust, and Meehl (1989) commented thusly in following up on Meehl’s earlier statement:
ER59969.indb 96
3/21/08 10:50:30 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 97
It might be objected that this analogy, offered not probatively but pedagogically, presupposes an additive model that a proponent of configural judgment will not accept. Suppose instead that the pricing rule were, “Whenever both beef and fresh vegetables are involved, multiply the logarithm of 0.78 of the meat price by the square root of twice the vegetable price”; would the clerk and customer eyeball that any better? Worse, almost certainly when human judges perform poorly at estimating and applying the parameters of a single or component mathematical function, they should not be expected to do better when required to weight a complex composite of these variables. (p. 1672) The large research base on clinical versus actuarial judgment (Grove & Meehl, 1996; Grove, Zald, Lebow, Snitz, & Nelson, 2000; Meehl, 1954/1996) and extensive literature in such areas as concept formation and cue utilization (see Connolly, Arkes, & Hammond, 2000; Faust, 1984; Gilovich, Griffin, & Kahneman, 2002; Meehl, 1990/2006) show that highly trained professionals are hardly immune from difficulties with data integration, even with judgmental problems far less complex than the EVs outlined above.
Theory Evaluation as an Exemplar of the Need for Meta-Science The evaluative problem that I have characterized with the 10 parameters is theory evaluation, which is arguably the most important appraisal activity in science. I do not claim that the evaluation of theories always involves all 10 elements, as sometimes the task is easier. However, as I hope will become apparent, most meaningful theory evaluation involves most or all of these parameters (and sometimes more), thereby creating a highly complex cognitive undertaking. I will reiterate each parameter briefly and then discuss its application to theory evaluation. In doing so, I will also start to lay out the rationale for what Paul Meehl and I labeled meta-science, or the science of science. Reiteration of parameter 1: There is often more than one IS to evaluate, and sometimes a half dozen or more. Thus, performing EVs frequently necessitates simultaneous and multiple comparative judgments; that is, one is just not comparing an IS to a set of standards but also comparing the relative merits of different ISs.
ER59969.indb 97
3/21/08 10:50:30 AM
98 • David Faust
Application of parameter 1: Often, scientists are not comparing a single theory to nature, but they are comparing two or more theories to each other by comparing them to nature. Reiteration of parameter 2: ISs are appraised along multiple dimensions. Further, the list of dimensions to be used in appraisal is open (it can be potentially expanded or contracted), and there is often disagreement about the composition of the list. Application of parameter 2: As is of course well known, multiple dimensions are used to appraise theories, or have been proposed for theory evaluation. Certain dimensions appear across most such lists, such as the capacity to predict novel observations; breadth, or the qualitative diversity of the observations explained; fertility, that is, the capacity to stimulate or generate productive research; precision or rigor; parsimony (in one or more of its multiple forms); reducibility; and logical consistency. What I might label cognitive aesthetic criteria also appear on many lists, which are often described by such terms as the elegance or beauty of proposed ideas. Despite these common elements, varying lists often differ on specifics. Further, no competent philosopher of science would say that the list has been decided or finalized and is closed to potential additions. Reiteration of parameter 3: Most, if not all, of the dimensions are continuous or a matter of degree and rarely, if ever, dichotomous. Application of parameter 3: Parameter 3 probably needs little explanation. Few, if any, dimensions for theory evaluation are dichotomous or all-or-none, although some might sound as if they are (e.g., logical consistency). Also, among serious scientific competitors, it would be unusual to merit a flat-out zero on any dimension, especially in all but the most nascent sciences or earliest stages of theory development. Imagine a set of statements that scored a zero, for example, on logical consistency. Although the boundary between science and nonscience is often fuzzy, a score of zero on any of a variety of evaluative criteria could serve as a point of clear demarcation. Reiteration of parameter 4: Due to a variety of factors, appraising or measuring standing on dimensions, even singly, may not be easy or straightforward, and hence such attempts are prone to error and disagreement, even among capable, rational, and fair-minded people. Judging these dimensions is often less like a simple counting exercise and more like subjectively apprais-
ER59969.indb 98
3/21/08 10:50:30 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 99
ing events or constructs that range from modest to high levels of complexity. Application of parameter 4: I doubt many would argue that judging standing on such dimensions as reducibility, logical consistency, the capacity to predict novel observations, or parsimony as defined by the simplest explanation that fits the applicable underlying data is comparable to counting the number of GM vehicles salesperson Smith sold in a calendar year. Reiteration of parameter 5: Ratings for these various dimensions often do not rest on a common scale or metric. Consequently, determining how standing on one dimension compares to that on another dimension may be difficult, much less integrating ratings across dimensions when comparing one IS to another IS or multiple other ISs. Application of parameter 5: How common might it be, for example, for a group of eminent scientists proposing alternative theories about, say, the causes of certain degenerative central nervous system disorders to all agree to rate every pertinent dimension for theory evaluation on a five-point Likert-type scale? Even if they did agree, what are the chances that their adopted method is optimum or founded on a basis that could truly or properly be justified as sound? As will be discussed further, given the current state of knowledge, we often have limited information about the relative value of various dimensions, and combinations of dimensions, for predicting the longterm fate of theories or for judging their viability. As will also be argued, despite Dawes’s (1979) and Dawes and Corrigan’s (1974) compelling arguments about the relative insignificance of differential weighting of variables across a range of predictive tasks, we cannot dismiss concerns about differential weighting and configural relations in the domain of theory evaluation because both may be of considerable importance at times. Reiteration of parameter 6: Although direct comparison may be difficult because of varying metrics, it is nevertheless clear that ratings across dimensions are often inconsistent and may conflict within and across EVs. Considerable complication is added because we are often comparing multiple EVs. Application of parameter 6: Surely little needs to be said to establish the application of this parameter to theory evaluation. It is the rare instance in which a theory is equally strong across all potential evaluative dimensions, or in which, among serious competitors, one theory trumps another uniformly. If
ER59969.indb 99
3/21/08 10:50:30 AM
100 • David Faust
one theory exceeds another along all dimensions, the inferior theory is very unlikely to be a serious competitor, except under highly unusual circumstances (e.g., the scientific community is at a very early stage of development and perhaps dominated by a seriously flawed assumptive network). Note that even strong theories often have relative weaknesses, as, among other things, strength on certain dimensions tends to create weaknesses on others (e.g., breadth tends to be negatively associated with precision), and thus for one theory to exceed a second theory across all dimensions usually signals obvious gross deficiency on the part of the latter. It is far more common for competing theories to each show respective areas of superiority on certain dimensions, with approximate ties on other dimensions. (Of course, these relative ratings are partly in the eye of the particular theory’s proponent, with such possible variation in evaluation being of substantial interest in its own right.) The comparative task becomes considerably more complicated when there are multiple serious competitors, as is not uncommon in science, especially in relatively earlier stages of development or in novel areas. Reiteration of parameter 7: No one dimension is dispositive and trumps all other dimensions, or at least trumps all other composites of the other dimensions. Stated in another way, the dimensions are all probabilistic, rather than deterministic, indicators of status or outcome. This form of probabilistic association is ontologic, that is, it involves limits in the level of true association between dimensions and outcomes. Application of parameter 7: Virtually no competent philosopher of science, despite what may be a relative emphasis on one or another dimension for theory appraisal, claims that the particular dimension is dispositive or trumps all other dimensions or combinations of dimensions. For example, despite the strong emphasis on the survival of risky tests, Popper (1962, 1983) did not claim that it trumped all other combinations of dimensions. The lack of dispositive dimensions relates back to the nature of the associations between criteria for theory appraisal and the status or long-term fate of theories: Within a broad range of operating characteristics, these associations are probabilistic, not deterministic. To again use Popper as an example, he never claimed something like the following: There has never been a theory that seemed to have been disconfirmed soundly or convincingly that was subsequently resurrected. Claims about
ER59969.indb 100
3/21/08 10:50:30 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 101
desirable theory characteristics often contain probabilistic qualifiers, such as “rarely,” “sometimes,” “usually,” or “often” (e.g., see Laudan et al., 1986, for a compelling example). For example, it might be said that sound theories are often productive generators of research. At the most fundamental level, science can be divided into two components: the body of knowledge produced and the methodology used to produce this knowledge. Of major relevance to meta-scientific proposals and efforts, probabilistic associations not only generally characterize relations between the characteristics of knowledge productions (e.g., the dimensions of theories) and their success or long-term fate. Probabilistic relations also generally characterize this second great side of science, that is, the associations between methodology and successful knowledge production. For example, when conducting research, strong methods or investigative strategies do not ensure success, nor do weaker methods ensure failure. There also may be multiple methods that can all produce success (or that will all lead to failure). This is by no means to argue that all methodological choices are equally sound or equally likely to succeed or fail. To the contrary, certain choices are undoubtedly much better or worse bets than others, and some choices or methods are so bad they are almost certain to fail except under the most extraordinarily lucky circumstances. However, as noted, within the range of typical choices, the relation between method and positive outcome is probabilistic. At times the differences in these probabilities are relatively small but cumulatively meaningful, and at other times they are not apparent and are vulnerable to misappraisal, even to a gross extent. Reiteration of parameter 8: Theory evaluation is subject to two fundamental forms of uncertainty. In addition to the previously uncertainty grounded in ontology, a second source of uncertainty rests on our epistemologic situation or results from methodolologic factors. Whatever the true associations between these dimensions and outcome (the ontological situation), our knowledge of these associations is imperfect (our epistemologic situation). Although we cannot be sure, there are good reasons to believe that standing on most or all of the commonly used dimensions for theory evaluation are validly related to status or outcome (see Meehl, 2002). However, our knowledge about the strength of association is limited for most or nearly all of
ER59969.indb 101
3/21/08 10:50:31 AM
102 • David Faust
the dimensions, sometimes reflecting little more than a coarse guess, and beliefs about these associations are matters of longstanding dispute among even very capable individuals. Application of parameter 8: If one doubts the limited state of knowledge and the lack of agreement about associations between evaluative dimensions and the status or fate of theories, consider the following. When Meehl (1992) had highly qualified methodologists rate the relative importance of various dimensions that are commonly used for evaluating theories, he obtained a correlation among the rankings of .00! There is little question that lists of criteria for evaluating theories often overlap, but this is very different from saying that there is consensus on the relative importance of the dimensions. Although historians and philosophers of science have directed considerable effort toward theory appraisal, their empirical basis almost always rests on the case study method, that is, the study of single or isolated occurrences in the history of science. There is much to be said for the quality and depth of many such efforts and for their potential value for generating ideas. However, case studies have serious deficiencies for testing many such ideas, and especially for determining the associations between dimensions for theory appraisal, or standing on such dimensions, and outcome. It should be obvious that in a probabilistic problem domain with a massive database of scientific occurrences, multiple “supportive” instances can be found for nearly any proposal about the characteristics of theories that promote success or progress. If one believes, for example, that disconfirmation is essential to progress, it should not be too difficult to find multiple episodes in the history of science in which disconfirmation of theoretical proposals seemed critical or decisive, whether or not the true association or overall frequency of occurrence is high or low. Thus, identifying supportive instances merely establishes what nearly everybody already knows, that is, that the occurrence of such events is greater than 0%, information that is of little or no use. What we want to know instead is just how often such events occur, and how these frequencies compare to those of other events. For example, when evaluating competing proposals about scientific advance, it can make a very important difference if one finds that approach A leads to success a lot less often, a little less often, about as often, a little more often,
ER59969.indb 102
3/21/08 10:50:31 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 103
or far more often than approach B. Similarly, if one dimension generally shows a much higher association with the success of theories than another dimension, and if in our current situation standing on these two dimensions conflicts, then all else being equal, we typically should elevate the more predictive dimension over the less predictive one. The case study method is simply not a trustworthy way to determine frequency of occurrence or comparative frequencies, especially when it is used selectively to uncover or seek out supportive instances for one or another proposal for theory evaluation, which is probably its most frequent application. If reliance on theory attributes is rational, we must believe that these attributes correlate with a theory’s merits (see Faust & Meehl, 2002; Meehl, 2002). Merit may be conceptualized differently depending on one’s leanings. If one tends toward instrumentalism, merit will likely be viewed as a theory’s long-term survival, whereas a scientific realist will focus on a theory’s verisimilitude (i.e., its truth-likeness or the extent to which it provides the closest approximation to truth). However, for both the instrumentalist and scientific realist, rational belief assumes some connection between theory attributes and standing on the qualities of interest. Ultimately, such reliance on theory attributes intersects with beliefs or empirical claims about the history of science; that is, the presumption that over a given class of scientific theories, or across theories overall, an attribute is a correlate of success (regardless of the particular variant used to define it). In turn, these are intrinsically statistical claims, and the best way to evaluate a statistical claim is to gather and compute the statistics. There is no known way to verify or refute statistical claims about empirical relations other then to get the facts and perform the needed calculations. Reiteration of parameter 9: Interactions across the dimensions or configural relations may matter. We cannot assume that our appraisal task can be handled optimally, or perhaps even well, by simple linear composites. Application of parameter 9: Imagine a theory that scores 100 on rigor, 100 on logical consistency, but very deficiently (say a 10) on predictive capacity. Is it equal to a theory that scores 70 on rigor, 70 on logical consistency, and 70 on predictive capacity? Almost surely not. Although the example is extreme, the meaning and predictive utility of some dimensions are likely
ER59969.indb 103
3/21/08 10:50:31 AM
104 • David Faust
to be affected or modified by standing on other dimensions, thereby fulfilling one definition of a configural relation. Along with Dawes, I believe that patterns, configurations, and interactions are often of minimal or negligible importance across a range of judgmental tasks and that obsession with such matters often leads decision makers to attend insufficiently to what is commonly far more important—determining what to include and exclude in predictive formulations. However, it is very questionable to generalize these findings whole-cloth to scientific judgment tasks, and especially to theory evaluation. First, there is little doubt that nature often exhibits patterns and interactions, and it is improbable that such phenomena can be captured in linear terms with the exquisite accuracy that advanced sciences frequently seek. Second, at the risk of committing the very same case study fallacy I have just criticized, there do appear to be many examples of scientific success that involve coverage of interactive terms (consider the formulae that appear in any basic text on, say, chemistry, physics, or systems theory). Thus, it seems highly unlikely that the disregard of configural relations and patterns will achieve optimal or near optimal scientific success across a high percentage of scientific problems or domains of study. Third, and related to the other two points, there are almost surely important differences between judgment tasks in which the ceiling in predictive accuracy is fairly low, the state of measurement is limited, and “theories” or assumptive networks are in a rather primitive state of development, in comparison to formulation in areas in which we are in fairly solid (or very solid) scientific shape. Reiteration of parameter 10: The database pertinent to the evaluation of theories is large, if not massive. Application of parameter 10: Although much work in the history and philosophy of science focuses on grand theories, much of science is composed of mini-theories (e.g., digestion, visual perception), which surely number in the thousands. Even in the case of these much narrower domains, hundreds or thousands of studies are often required to decide the competition between alternative theories. Thus, across the spectrum of scientific theories grand and not so grand, there is often a very large database for formulation and evaluation. If one is interested in studying the extent to which factors that promote the success of theories generalize across scientific domains or areas, then the overall pertinent database is massive.
ER59969.indb 104
3/21/08 10:50:31 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 105
Theory Evaluation as an Exemplar for the Study of Scientific Judgment and the Need for Meta-Science Theory evaluation provides one example of a judgmental task in science that is often highly complex, as consideration of the 10 parameters discussed above illustrates. Ironically, formal scientific methods, which obviously provide tremendous epistemological advantages in the acquisition and appraisal of knowledge, provide limited direct help to scientists engaged in evaluating theories. Furthermore, theory evaluation is not an isolated or unique example of a scientific decision task that demands high-level integrative judgment but for which existent methodology provides limited direct help and suitable quantitative indicators are lacking. Multiple other examples could be cited, such as the journal review process (which involves the appraisal of completed work), the grant review process (which involves predicting the outcome or productivity of proposed work), and the selection of suitable problems or strategies for scientific investigations. In many instances, judgments about such matters ultimately rest in substantial part on subjective mental processes or impressionistic methods. Despite the tremendous success of science, it would be astounding if these types of decision tasks were almost always performed optimally and, thus, if properly developed or extended, decision aids did not improve judgment. It is these kinds of considerations that led Paul Meehl and me to develop meta-science, which emphasizes the evaluation of scientific evaluations (EV–EV) of various types. Given the focus on complex decision making, meta-scientific studies would seem an extremely appealing subject matter for decision researchers, especially when one also considers the importance of science, its demonstrated overall superiority as a method for knowing, and the gains that might be realized by pursuing approaches that enhance scientific judgment. Formal studies across a variety of domains, along with analytic considerations, provide strong grounds to argue that various limits uncovered in human judgment, and particularly in the capacity to integrate complex information subjectively, apply in meaningful ways to the working scientist. For example, Chapman and Chapman (1967, 1969) found that not only laypersons, but also professional psychologists (who often are thoroughly trained in scientific methods and may themselves be active researchers), were prone to the formation of false associations between variables. Arkes (2003; see also chapter 3) studied judgmental restrictions in the grant review process. A broad overview of evidence linking the judgmental limitations of laypersons to those of scientists is
ER59969.indb 105
3/21/08 10:50:31 AM
106 • David Faust
available in Faust (1984), with subsequent overviews provided by Meehl (1992, 2002). None of this is meant to disregard the astounding successes of science, to question whether it is the best knowledge game in town, or to disregard the extremely difficult intellectual problems with which scientists often must struggle (the latter of which, if anything, provides another rationale for decision aids). Despite all this, any comprehensive review of science shows the frequency with which scientists struggle to make advances and face the challenges of failed starts, errors, and grindingly slow progress. Again, relative superiority, and even vast superiority, should not be conflated with optimality. It would be like arguing years ago that riding by horseback is much better than walking and that hence the optimal endpoint in transportation had been achieved.
A Brief Introduction to Meta-Science The basic premise of what Paul Meehl labeled the Faust–Meehl thesis is as follows: The use of more rigorous methods for studying scientific occurrences, such as representative sampling and sophisticated psychometric procedures, can provide a more accurate description and understanding of the factors that promote scientific success. (By the phrase scientific occurrences, I am referring broadly to the historical track record of scientific activities and decision making as well as ongoing activities.) Such knowledge can help clarify, and sometimes even resolve, long-standing questions within the history and philosophy of science and can provide important assistance to the practicing scientist. Specifically, better knowledge of what leads to scientific success and failure can provide helpful guidance for scientific undertakings. A number of these points can be fleshed out in a little more detail. (For more extensive coverage of meta-science, see Faust, 1984; Faust & Meehl, 1992, 2002; Meehl, 1992, 2002, 2004.) The meta-scientific study of theories involves the development of indices that quantify standing on criteria for theory evaluation. Meehl has designed multiple such indices (Meehl, 1992, 2002), which are offered as initial approximations or possible starting points given the present early state of meta-science and the likely need for refinement or modification as more is learned. One example is the Verisimilitude Index (Meehl, 1992), which is intended to capture and quantify Popper’s notion of risky prediction. This index combines the narrowness or specificity of prediction, the range of possible outcomes, and the size of the discrepancy between the predicted and obtained outcome to produce a numerical result. Such indices are initially examined by using the historical track record. One
ER59969.indb 106
3/21/08 10:50:32 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 107
selects a sample of theories that were viable competitors, some of which failed and some of which succeeded. Success can be defined in part by such criteria as long-term and nearly unanimous acceptance over the course of time (e.g., five decades or more). Proceeding study by study using all of the research conducted on the relevant theories (or random samples of studies), one plots performance on the Verisimilitude Index over time for each of the competing theories. One might examine, for example, whether there is a point at which the competing theories diverge and how soon this occurs in relation to the relevant scientific community’s proper selection of a winner. More broadly, one is interested in studying various indices, refining them as needed, and determining which indices and combinations of indices maximize the provide optimal prediction of theory success. One is also interested in examining the extent to which optimum predictors generalize across domains within and across scientific fields. For example, certain predictors may be highly redundant, and thus leading combinations may exclude a number of indices with strong subjective appeal and elevate others usually considered to be of secondary or minor importance. It would be very surprising if subjective judgment, even among talented scientists, proved as effective as proper quantification and statistical analysis in discerning the association between indices and outcome and in determining optimal combinations. Again, as Meehl said, “When you check out at the supermarket…” Even if such analyses and formalisms afforded only a minor advantage over subjective judgment, their cumulative advantage over time or as the database on theories mounted might prove considerable.1 As noted, a good deal of the work in the history and philosophy of science on theory evaluation has focused on grand theories (e.g., Newtonian physics, evolution). However, much of science, and much of the database of science, involves considerably narrower theories or minitheories, which are plentiful. Thus, many potential subjects are available for the meta-scientific study of theories, and an initial focus on mini-theories may prove more practical than tackling grand theories. It is not our contention that meta-scientific studies will be easy or free of challenging obstacles, but the importance of science and the potential advantages gained by even modest improvements over subjective judgment methods, which are so pervasive in many scientific endeavors, would seem to justify the effort. As suggested, one objection to meta-science is that science has been very successful without it. I do not disagree with this appraisal, but, again, successful is not the same as optimal, and any close look at science will show that in many areas progress is painstakingly slow or completely impeded.
ER59969.indb 107
3/21/08 10:50:32 AM
108 • David Faust
It might also be argued that indicators or predictors of successful theories, or of good outcomes in other areas of scientific work, are not likely to generalize. The extent of generalization is a concern whether or not one is engaged in meta-scientific studies, and it is exactly the type of question that meta-scientific methods are designed to address. Why not go ahead and conduct the needed research to determine the level of generalization that holds? Additionally, it is exceedingly unlikely that some degree of generalizability will not be found. Would anyone maintain that just because the use of control groups has been effective in hundreds of thousands of studies, there is no good reason to predict it will prove so again in other or future undertakings, especially situations with fundamental parallels to those in the past in which control groups were especially helpful? If there truly is no basis to assume generalization for this or other methodology, almost all scientific efforts would be crippled, and choices from among the universe of possible methods, no matter the past track record or similarity of circumstance, would all have to be considered equal bets, in which case we might as well make these selections randomly. There is every reason to believe that psychologists, given their methodological sophistication, and particularly decision researchers, are well equipped to participate in, if not lead, meta-scientific efforts. After all, what could be more interesting and important than studying scientific judgment under conditions of uncertainty, an area ripe for investigation and one that would seemingly hold great appeal for decision researchers? It is my hope that some readers will consider joining in such meta-scientific endeavors.
Acknowledgment I offer my sincere thanks to Leslie J. Yonce for her skillful editorial help and suggestions in the preparation of this chapter.
References Arkes, H. R. (2003). The nonuse of psychological research at two federal agencies. Psychological Science, 14, 1–6. Chapman, L. J., & Chapman, J. P. (1967). Genesis of popular but erroneous psychodiagnostic observations. Journal of Abnormal Psychology, 72, 193–204.
ER59969.indb 108
3/21/08 10:50:32 AM
Why Meta-Science Should Be Irresistible to Decision Researchers • 109 Chapman, L. J., & Chapman, J. P. (1969). Illusory correlation as an obstacle to the use of valid psychodiagnostic signs. Journal of Abnormal Psychology, 74, 271–280. Connolly, T., Arkes, H. R., & Hammond, K. R. (Eds.) (2000). Judgment and decision making (2nd ed.). New York: Cambridge University Press. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, 571–582. Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674. Faust, D. (1984). The limits of scientific reasoning. Minneapolis: University of Minnesota Press. Faust, D., & Meehl, P. E. (1992). Using scientific methods to resolve questions in the history and philosophy of science: Some illustrations. Behavior Therapy, 23, 195–211. Faust, D. & Meehl, P. E. (2002). Using meta-scientific studies to clarify or resolve questions in the philosophy and history of science. Philosophy of Science, 69, S185–196. Gilovich, T., Griffin, D., & Kahneman, D. (Eds.). (2002). Heuristics and biases: The psychology of intuitive judgment. New York: Cambridge University Press. Grove, W. M., & Meehl, P. E . (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: The clinical–statistical controversy. Psychology, Public Policy, and Law, 2, 292–323. Grove, W. M., Zald, D. H., Lebow, B. S., Snitz, B. E., & Nelson, C. (2000). Clinical vs. mechanical prediction: A meta-analysis. Psychological Assessment, 12, 19–30. Laudan, L., Donovan, A., Laudan, R., Barker, P., Brown, H., Leplin, J., et al. (1986). Scientific change: Philosophical models and historical research. Synthese, 69, 141–223. Meehl, P. E. (1954/1996). Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press. (Reprinted with new Preface, 1996, Northvale, NJ: Jason Aronson.) Meehl, P. E. (1986). Causes and effects of my disturbing little book. Journal of Personality Assessment, 50, 370–375. Meehl, P. E. (1990/2006). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant using it. Psychological Inquiry, 1, 108–141, 173–180. (Reprinted in Waller, N. G., Yonce, L. J., Grove, W. M., Faust, D., & Lenzenweger, M. F. (Eds.). [2006]. A Paul Meehl reader: Essays on the practice of scientific psychology [pp. 91–167]. Mahwah, NJ: Lawrence Erlbaum.)
ER59969.indb 109
3/21/08 10:50:32 AM
110 • David Faust Meehl, P. E. (1992). Cliometric metatheory: The actuarial approach to empirical, history-based philosophy of science. Psychological Reports, 71, 339–467. Meehl, P. E. (2002). Cliometric metatheory II: Criteria scientists use in theory appraisal and why it is rational to do so. Psychological Reports, 91, 339–404. Meehl, P. E. (2004). Cliometric metatheory III: Peircean consensus, verisimilitude, and asymptotic method. British Journal for the Philosophy of Science, 55, 615–643. Popper, K. R. (1962). Conjectures and refutations. New York: Basic Books. Popper, K. R. (1983). Postscript. In Realism and the aim of science (Vol. 1). Totowa, NJ: Rowman and Littlefield.
notes 1. Although meta-analysis provides an example of a scientific decision aid for integrating information, it is separate and distinct from the indices being described here. Meta-analysis is, of course, a method for determining the presence and extent of effect sizes. It is not a method (or at least a direct one) for appraising the status of theories.
ER59969.indb 110
3/21/08 10:50:32 AM
6
The Robust Beauty of Simple Associations Joachim I. Krueger Brown University
Since Festinger’s (1954) exposition of social comparison theory, the presumed relative nature of human judgment has loomed large in theories of self-perception, stereotyping, and intergroup relations. The core assumption of this meta-theoretical orientation is that when people cannot judge themselves in absolute terms, they focus on the ways in which they differ from others. Karniol (2003) suggested that self-concepts are primarily composed of traits that people know, or think they know, set them apart from others. McCauley and Stitt (1978) offered an analogous proposal with respect to group perception. Here, an image of a group (or stereotype) is composed of all those traits perceivers consider distinctive, that is, traits that are either more or less common in the group than in the general population. Social identity theory (Tajfel & Turner, 1979) and its offshoots (e.g., self-categorization theory [Turner, Hogg, Oakes, Reicher, & Wetherell, 1987]) combine these two levels of distinctiveness by suggesting that perceived group differences give rise to both the stereotyping of other groups (i.e., outgroups) and to a person’s own “socialized” self. These theories assume that people are motivated to perceive and even exaggerate group differences and to seek comparisons that allow the ingroup and the self to appear in a favorable light (Brown & Capozza, 2007). Relativistic theories place the burden of explaining self-perception and group perception on the perceived differences between two social objects: the self relative to the average person or the ingroup relative to 111
ER59969.indb 111
3/21/08 10:50:33 AM
112 • Joachim I. Krueger
the outgroup. These differences thereby attain a theoretical status that must be justified ontologically or pragmatically. Both types of justification lead to a search for correlations between the relevant differences and criterion variables. In the area of self-perception, positive differences between self-judgments and judgments of the average person are said to capture a self-enhancement bias. Taylor and Brown (1988) argued that these differences are ontologically valid indices of bias because not everyone can be better than average (but see Moore & Small, chapter 7). In the words of my 8-year-old daughter Lauren, “If everyone were above average, above average would be average.” Taylor and Brown further argued that self–other differences are pragmatically valid because they predict other desirable attributes, such as self-esteem. In the area of group perception, differences between ingroup and outgroup perception (or between ingroup perception and the perception of the general population) are ontologically justified inasmuch as such differences indeed exist (McCauley & Stitt, 1978); they are pragmatically justified inasmuch as they are related to people’s well-being. Again, self-esteem is the typical variable of choice (Hogg & Abrams, 1990). The appeal of the relativistic approach to self- and social perception is that it capitalizes on the idea that judgments make sense in context. Whenever statistics are gathered, the question is, “Compared to what?” (Dawes, 2001). Experimentalists routinely compare data observed in a treatment group with data obtained in a control group. It is the statistically significant difference that leads to the conclusion that a phenomenon has been detected (Krueger, 2001). An effect exists if there “is not nothing” (Dawes, 1991). Similarly, Bayesians know that learning by induction requires a likelihood ratio large enough to yield credible changes from the ratios of base rates to the ratios of posterior probabilities. In a correlational world, however, the question is whether differences (or ratios) predict criterion variables better than their component variables. Thus, there is an alternative to the idea that social comparisons are necessary ingredients of self- and social perception. Conceivably, people have access to at least some of their own attributes without comparing themselves to others, and perhaps they evaluate their ingroups positively or negatively regardless of how they view a salient outgroup (see Brewer, chapter 10). Even Festinger (1954) granted that social comparison is a strategy of last resort, which is deployed when no direct assessment is possible. In this chapter, I review classic and recent work relating to the issue of relative versus absolute perception. Based on methodological considerations and empirical evidence, I suggest that theories staked on discrepancies must meet a higher standard of proof than has hitherto been
ER59969.indb 112
3/21/08 10:50:33 AM
The Robust Beauty of Simple Associations • 113
assumed. Often, absolute judgments of the self or of social groups work quite well as predictors of socially relevant outcomes such as subjective well-being or intergroup harmony. A parsimonious model can integrate the egocentric biases of self-enhancement and social projection with the intergroup biases of ingroup–outgroup differentiation and ingroup favoritism. Questioning the search for complex interactions among clinical predictor variables, Dawes (1979; see Appendix 1, this volume) demonstrated the “robust beauty” of simple weighted-average models. His insights provide valuable lessons for efforts to trim away unnecessary complexity from relativistic models of social perception.
Self-Perception and Enhancement How much more intelligent, soulful, better, is everything about us than about anyone else? (Horowicz, 1878, p. 267) The most popular index of self-enhancement is a direct comparative judgment. Most people, for example, believe they are better than average drivers (Svenson, 1981) and less likely to succumb to genetic diseases (Dawes, 1994). Assuming that people engage in social comparisons, they might as well express them in a single number. Alas, this method conflates accurate self-perception with bias. If the sample of respondents is representative of the population and if the distribution of “true scores” is symmetrical, the average comparative self-judgment should lie at the midpoint of the scale. If it is higher, there must be more false positives (people claiming to be better than average when they are not) than false negatives (people claiming to be worse than average when they are not). The method does not discriminate between the false and the true positives, or between the false and the true negatives. Therefore, correlations between the comparative judgments and criterion variables (e.g., self-esteem) say nothing about the ontological or pragmatic status of these comparisons. It gets worse when judges become more accurate. If everyone accurately ranked the self relative to the average group member, every true positive would be wrongly accused of self-enhancement, and every true negative would be wrongly accused of self-deprecation. A positive correlation between comparative self-judgments and selfesteem would not reveal the benefits of self-enhancement, but simply reveal that positive self-assessment on different instruments tend to go together (Baumeister, Campbell, Krueger, & Vohs, 2003). Klar and Giladi (1999) recognized the need to understand comparative judgments in light of the absolute judgments that presumably underlie them. They found that comparative self-judgments are closely
ER59969.indb 113
3/21/08 10:50:33 AM
114 • Joachim I. Krueger
related to absolute self-judgments and independent of absolute otherjudgments (see Chambers & Windschitl, 2004, for a review). When people judge how happy they are relative to the average person, they simply report how happy they are, while ignoring how happy they think the average person is. Fox and Kahneman (1992, p. 224) anticipated this result when writing that it is plausible to assume, for example, that most people are keenly aware of their own happiness or misery in the domain of love, but relatively ignorant of the love lives of others. In the absence of more direct information about other people, one’s own satisfaction can be used to guess how one compares to others. The selective reliance on self-judgments suggests that comparative judgments are not really comparative and only self-enhancing by accident.1 Indeed, for difficult items (e.g., acting), people make low absolute self-judgments, which in turn lead to low comparative judgments suggesting self-deprecation (Kruger, 1999). There are currently three different ways to explain the relative dominance of absolute self-judgments. According to one view, people weight self-referent information more strongly than other-referent information because the former is richer, more deeply encoded, and more accessible. According to another view, people weight whichever information they happen to be attending to most strongly. The information they focus on happens to be self-referent information most of the time, but it does not have to be (Moore & Kim, 2003). A third view also assumes psychological differences in attention to and memory of self- versus other-referent information, but it stresses the statistical differences in regressiveness (Krueger, 2000). Inasmuch as other-judgments contain a larger random error component than self-judgments do, they are more regressive; that is, they are less strongly correlated with judgments of criterion variables (e.g., success at a task or possession of desirable personality traits). It follows that people make positive comparative judgments when they themselves think they succeed, and negative comparative judgments when they think they fail, even if they do not ignore their absolute judgments of others (Moore & Small, chapter 7). Sedikides, Gaertner, and Toguchi (2003) used comparative judgments to defend the universalist view of self-perception. They noted that evidence for self-enhancement, much like evidence for other biases, was long accepted as valid across cultures. Most people were thought to possess a need for positive self-regard and to rely on self-enhancement to meet this need. This view was later challenged as an instance of, if not scientific imperialism, then at least as an instance of unwarranted
ER59969.indb 114
3/21/08 10:50:33 AM
The Robust Beauty of Simple Associations • 115
generalization (Markus & Kitayama, 1991). In time, empirical studies began to appear showing that members of some collectivist cultures do not self-enhance, but rather self-deprecate (Heine, Lehman, Markus, & Kitayama, 1999). Sedikides et al. (2003) countered this challenge with a challenge of their own. Retaining the premise that there is a universal need for positive self-regard and the idea that self-enhancement meets this need, they suggested that culture does not moderate the overall level of self-enhancement, but that it directs people to domains in which selfenhancement is most effective. Sedikides et al. found that Americans (and people with independent self-construals) rated themselves higher than the average person on individualist traits (e.g., self-reliant, unique), whereas the Japanese (or people with collectivist self-construals) rated themselves higher than the average person on collectivist traits (e.g., cooperative, respectful). The decomposability of comparative judgments raises a challenge to this challenge of a challenge. Western individualists may simply ascribe individualist traits to themselves, and Easterners may simply ascribe collectivist traits to themselves without being overly concerned about how they might differ from the average person in their culture. High self-judgments on culturally prominent traits then merely signal that a person is effectively acculturated. The implication of this analysis is that cultural stereotypes, on which comparisons between individualist and collectivist societies are predicated, may be at least partially accurate (Jussim, 2005). How can some people come to see themselves in individualist terms, whereas others come to see themselves in collectivist terms? Self-categorization theory attributes group differences in self-perception to processes of self-stereotyping or “depersonalization.” For these processes to work, social categorization needs to be salient, which was not the case for the participants in the study by Sedikides et al. (2003). Neither can self-stereotyping account for the differences between individualist and collectivist self-construals within a culture. Act-frequency theory offers a simpler account. Participants might have agreed on what constitutes an individualist or a collectivist behavior and simply summed their own acts within the two trait domains (Buss & Craik, 1983). Comparisons could then take place, but they would involve categories of acts instead of categories of people. In the West, most people may note that they more often act individualistically than collectivistically and draw the corresponding trait inference. The advantage of this comparative strategy is that access to one’s own relative act frequencies is easy compared with access to act frequencies in other cultures.
ER59969.indb 115
3/21/08 10:50:33 AM
116 • Joachim I. Krueger
Social Projection We are most likely to touch something universal when each of us speaks personally and witnesses to the sliver of truth we have refined from our individual experience. (Keen, 1992, p. 9) Unlike self-enhancement, social projection generates perceptions of similarity between the self and others. The typical measure is a correlation between judgments of the self and judgments of others (Krueger, 1998a). Experimental work shows that these correlations reflect a projective path from self-assessments to group estimates, instead of the inverse path from group assessments to self-estimates, which would indicate self-stereotyping (Cadinu & Rothbart, 1996; Otten & Epstude, 2006). Inasmuch as most self-knowledge is more deeply encoded than knowledge of others, it is more accessible. Judgments about others are in part constructed (or “projected”) on the basis of this self-knowledge, a process that works most effectively when little individuating knowledge about others or relevant stereotypes regarding their group are available. Dawes, McTavish, and Shacklee (1977; see Appendix 2, this volume) found social projection in the prisoner’s dilemma game. Most cooperators expected others to cooperate, and most defectors expected others to defect. Moreover, the variance in players’ predictions was larger than the variance of predictions made by uninvolved observers, which suggests that players used their own choice in the game to infer what their opponents would do (see Krueger, Acevedo, & Robbins, 2006, for replications). Social projection has traditionally been deemed a cognitive illusion (Ross, Greene, & House, 1977), and some still argue that it is a form of “epistemic egocentrism” that stands in the way of effective perspective-taking (Royzman, Cassidy, & Baron, 2003). According to this view, people ought to do a better job setting aside their own privileged selfknowledge when making predictions about others. Dissenting, Dawes (1989; 1990; Dawes & Mulford, 1996) showed analytically and empirically that people should treat their own behaviors as data because they are lawfully related to what most others do. Through projection, people can reap an accuracy benefit. Their social predictions are less erroneous that they would be without projection. Self-enhancement and social projection can co-occur. People can think that they are better than others, while also being similar. Whereas self-enhancement is typically studied as a mean difference between self-judgments and other-judgments, social projection is studied as a matter of profile similarity (Krueger, 2000). Nevertheless, increases in projection should reduce self-enhancement effects. If people were to
ER59969.indb 116
3/21/08 10:50:33 AM
The Robust Beauty of Simple Associations • 117
judge others exactly like themselves, no self-enhancement could occur (Krueger, 2002). With respect to both phenomena, it is interesting to note that the bias is not so much an overuse of self-referent information, but a relative neglect of other-referent information. In the case of self-enhancement, we have seen that people dismiss their own absolute judgments of the average person when judging themselves relative to that person. In the case of projection, they base their judgments of the group primarily on their own behaviors, while putting little weight on the behaviors of individual others whom they know (Krueger, 2003). An integrative model of social judgment needs to account for the interplay among diverse heuristics and biases. Recall Taylor and Brown’s (1988) pragmatic justification of self-enhancement, namely, the claim that greater enhancement is associated with better personal adjustment. Statistically, this claim takes the form of a positive difference-score correlation. In rself - other, criterion, “self” refers to absolute self-judgments with regard to a desirable attribute, “other” refers to absolute judgments of the average person with regard to the same attribute, and “criterion” refers to a measure of personal adjustment, usually self-esteem. The difference-score correlation can be recovered from the three correlations among the three measures and the variances of these measures. When the difference-score correlation is written as
sself • rself, criterion − sother • rother, criterion
s2
self
+ s2
other
- 2 • sself • sother • rself, other
,
three dependencies are evident. First, self-enhancement (self–other) is positively correlated with the criterion inasmuch as people weight their self-judgments more strongly than their other-judgments (i.e., rself,criterion > rother,criterion). Second, the difference-score correlation is positive inasmuch as self-judgments are less regressive than other-judgments (i.e., sself > sother).2 Third, the difference-score correlation is positive inasmuch as people project to others (i.e., rself,other > 0). To illustrate these dependencies, I created eight combinations by selecting reasonable values for each. I assumed that people either weigh self-judgments and other-judgments equally (rself,criterion = rother,criterion = .7) or egocentrically (rself,criterion = .7; rother,criterion = .2), that other-judgments are nonregressive (sself = sother = 1) or regressive relative to self-judgments (sself = 1, sother = .5), and that other-judgments reflect either social projection (rself,other = .8) or a bias toward uniqueness (rself,other = −.2).
ER59969.indb 117
3/21/08 10:50:36 AM
118 • Joachim I. Krueger 1 0. 9 0. 8 0. 7
r(S-O,C)
0. 6 0. 5 0. 4 0. 3 0. 1 0
equal weight
ego weight
equal weight
pr ojection
no regressiveness
ego weight
uniqueness
regressiveness
Figure 6.1 The correlation between self-enhancement and a criterion variable, r(S-O,C) as a function of the regressiveness of other-judgments, the weight put on other-judgments, and social projection.
Figure 6.1 shows the expected variations in the size of the difference-score correlation. Also note that the effect of projection versus uniqueness is stronger under the empirically plausible conditions of egocentric weighting. Because social projection is rather common and because it protects other-judgments from being strongly regressive, the most likely empirical case is given by the combination of egocentric weighting, nonregressive other-judgments, and the presence of social projection. In the present simulation, this set of assumptions yields a difference-score correlation of .50. To infer that this correlation reveals the adaptiveness of self-enhancement is to go beyond the data given. A more reductive interpretation focuses on the underlying statistical associations and the variability of the measures. This level of analysis and interpretation recognizes the possibility that any particular difference-score correlation can arise from data patterns that are dramatically different from one another. Difference scores and correlations involving difference scores have been criticized for their unreliability and for their inability to yield information that is not already contained in the component variables
ER59969.indb 118
3/21/08 10:50:36 AM
The Robust Beauty of Simple Associations • 119
(Harris, 1963). Regression methods present an alternative. To examine self-enhancement, one can regress self-judgments on other-judgments and correlate the residuals with a criterion variable. The resulting semipartial correlation is then thought to reveal a “weighting bias” accorded to self-knowledge (Kahneman, 2000). This method also raises the question of how social projection might contribute to inferences about the adaptiveness of self-enhancement. The residual-score correlation is given by
rself, criterion − rself, other • rcriterion, other
2 1- rself, other
.
Like the difference-score correlation, the residual-score correlation increases with egocentric weighting, but it is not directly affected by regressive other-judgments.3 In contrast to the difference-score correlation, the residual-score correlation is influenced by social projection in both its numerator and its denominator. Within an individual study, difference scores and residuals may be correlated because both depend on absolute self-judgments.4 It is less clear, however, whether difference-score correlations and residual-score correlations yield comparable results across studies when the level of social projection varies. To illustrate the differences between the two types of correlation, consider again a case in which self-judgments and other-judgments are weighted equally (rself,criterion = rother,criterion = .7) and a case of egocentric weighting (rself,criterion = .7; rother,criterion = .2). In the four panels of Figure 6.2, difference-score and residual-score correlations are plotted against levels of social projection ranging from −.2 to .8. The correlation between the two types of discrepancy correlation varies considerably depending on the underlying assumptions. Even for the empirically most likely case (panel C), the two discrepancy correlations converge only when projection is very high. In contrast, when participants rationally weight the two absolute judgments equally, investigators might falsely conclude that their choice of discrepancy correlation is of no consequence for the question of whether self-enhancement is adaptive. The lack of a clear correspondence between difference-score and residual correlations should guard against temptations to reify either discrepancy as a construct in its own right. It may not be surprising that people with global self-esteem are also more likely to endorse specific positive attributes as self-descriptive. It would be wrong to assume, however, that high self-esteem is necessarily a sign of an inflated selfimage. A further implication of this analysis concerns work trying to
ER59969.indb 119
3/21/08 10:50:38 AM
120 • Joachim I. Krueger 3DQHO$
SURMHFWLRQ UGLIIHUHQFH
UUHVLGXDO
3DQHO%
SURMHFWLRQ UGLIIHUHQFH
UUHVLGXDO
Figure 6.2 Difference-score correlations and residual-score correlations plotted against social projection. Panel A: Equal weighting and equal variances (r between the two discrepancy correlations not defined). Panel B: Equal weighting and regressive other-judgments (r = -.96).
understand how people generate comparative judgments. Correlational studies have shown a close and unique dependence of these judgments on absolute self-judgments (Chambers & Windschitl, 2004). Studies on the relative accessibility of self- and other-judgments have shown that people make self-judgments more rapidly, more reliably, with greater confidence, and with less subjective difficulty (Krueger, 2003). If comparative self–other judgments are merely absolute self-judgments in disguise, even comparative judgments might be made with greater efficiency than absolute other-judgments.
ER59969.indb 120
3/21/08 10:50:38 AM
The Robust Beauty of Simple Associations • 121 3DQHO&
SURMHFWLRQ UGLIIHUHQFH
UUHVLGXDO
3DQHO'
SURMHFWLRQ UGLIIHUHQFH
UUHVLGXDO
Figure 6.2 Panel C: Egocentric weighting and equal variances (r = .76). Panel D: Egocentric weighting and regressive other-judgments (r = .67).
One chapter in the history of research on social projection involved the search for the truly false consensus effect (TFCE), a term coined by Robyn Dawes. Following Robyn’s advice, Joanna Zeiger and I computed idiographic correlations between participants’ self-judgments and their judgment errors, that is, the differences between their consensus estimates and actual base rates (Krueger & Zeiger, 1993). Most of these correlations were positive, suggesting that people’s self-judgments predicted their estimation errors. Participants who endorsed a judgment item (e.g., “I like poetry”) were more likely to overestimate the percentage
ER59969.indb 121
3/21/08 10:50:39 AM
122 • Joachim I. Krueger
of people who endorse it than to underestimate it, whereas the opposite was true for participants who rejected the item. The difference-score correlation representing the TFCE, restimate – actual, , self can be written as
sestimate • restimate, self − sactual • ractual, self
2 2 sestimate + sactual - 2 • sestimate • sactual • restimate, actual
.
The three critical correlations underlying the derived difference-score correlation are restimate,self (projection), ractual,self (validity; or the typicality of the self with respect to the group), and restimate,actual (accuracy). If we assume that the variances of the three variables are equal, we find that the TFCE is zero if projection equals validity. We did not realize at the time that under this assumption the TFCE measure is identical to Hoch’s (1987) proposal that projection is biased if it exceeds validity. What remains, however, is the tendency of projection to also increase the variance of the estimates, which, as the formula shows, will contribute to positive TFCE values. This feature of the TFCE measure led de la Haye (2001) to note that the presence of bias is overestimated to the extent that the selection of judgment items yields a regressive distribution of actual percentage scores. In a proposal echoing earlier developments in the self-enhancement literature, she recommended that a residual-score correlation be used. Now estimated consensus is regressed on actual consensus, and the residuals are correlated with self-judgments. The formula for this measure is
restimate, self − restimate, actual • rself, actual
1- r 2
.
estimate, actual
Again, within a given study, difference scores and residuals can be expected to be highly correlated and the two discrepancy-score correlations to lead to similar conclusions. Returning to Figure 6.2, however, we can see how the difference-score correlation and the residual-score correlation will turn out for individuals who differ in the accuracy of their estimates. Now it is the accuracy term that resides in the denominator of both discrepancy correlations and in the numerator of the residual-score correlation. The results are dramatic. Because actual percentages are now the regressive variable to reflect de la Haye’s (2001) concern, differencescore correlations are positive even when projection is equal to validity
ER59969.indb 122
3/21/08 10:50:40 AM
The Robust Beauty of Simple Associations • 123
(panel B), and they rise as accuracy increases. In contrast, residual-score correlations decrease. In other words, when accuracy is low, the difference-score correlation is the more conservative measure of bias, but when accuracy is high, the residual-score correlation is more conservative. When projection exceeds validity (panels C and D), the differencescore correlation also increases with accuracy, whereas the function for the residual-score correlation is slightly U-shaped. As in the case of self-enhancement correlations, investigators remain hard pressed to decide which measure of bias is the truest. Again, a careful decomposition of each index appears to say more about the structure of people’s judgments than do derivative measures like discrepancy correlations. In the 1990s a cottage industry (Griffin & Varey, 1996) sprang up to isolate true projective bias from inductively warranted projection. Most of the proposed measures were positively correlated with one another, but none of them emerged as the undisputed gold standard. More direct evidence for the egocentric nature of projection comes from studies showing that people underuse information available about other individuals (Krueger, 2003). Perhaps trying to avoid the psychometric difficulties that beset corrected measures of social projection, investigators returned to simple correlational measures of projection to ask what will affect the mean level of projection and what will predict individual differences.
Social Categorization and Stereotyping Of all the experimental power that can be brought to bear on a participant, none affects the strength of social projection more than the way the person is categorized. People project strongly to social ingroups and very little to social outgroups. The only moderator variable that emerged in a meta-analysis was the kind of group involved. The projection differential is larger for minimal laboratory groups than for groups found in the social world (Robbins & Krueger, 2005). This makes Bayesian sense. If people have some information regarding common traits and behaviors in real social groups, their egocentric weighting of the self should ease up (Ames, 2004). In minimal groups, where other-related or stereotypic information is unavailable, projection can emerge in its pure form. The puzzle is why people do not project more to outgroups. Arguably, projection to outgroups should be stronger than it is because the characteristics of real groups tend to be positively correlated inasmuch as these groups can be lumped together within a common inclusive ingroup. This is especially true for minimal outgroups, which differ
ER59969.indb 123
3/21/08 10:50:40 AM
124 • Joachim I. Krueger
from ingroups only by label. The inductive fallacy is not that people generalize too much within groups, but that they do not generalize enough across groups. It seems that people are too impressed with category boundaries (Krueger & Clement, 1996). Sluggish outgroup projection has significant implications for stereotyping. In the absence of intergroup contrast effects or content-rich and diversified stereotypes, differential projection can lead people to see outgroups as being different from ingroups. Assuming common empirical values for ingroup (r = .5) and outgroup (r = .1) projection, the correlation between perceived ingroup characteristics and outgroup characteristics is near zero (r = .05). If people were to project to the outgroup nearly as strongly as they project to the ingroup (e.g., r = .4), they would come to see the two groups as somewhat similar (i.e., r = .25).5 Although the meta-analytic correlation between self and outgroup and between ingroup and outgroup is not negative, some theoretical arguments call for such contrasts. Cadinu and Rothbart (1996) proposed that an “oppositeness heuristic” contributes to balance among social perceptions, and Mullen, Dovidio, Johnson, and Copper (1992) argued that contrasting away the outgroup creates a reassuring sense of uniqueness. Negative projection to outgroups can occur, but specific conditions must be met. Riketta (2005) reported that when people are antagonized by an intergroup conflict, those who identify strongly with their own group see members of the other groups as opposites to themselves.6 Perceived group differences, be they the result of differential projection or additional mental contrasts, are thus a reality of intergroup perception (Tajfel, 1969). Now the question is what role these differences are to play in the definition and measurement of social stereotypes. Early theorists realized that although a stereotype could just be a collection of those traits seen as common in the group, it could also be a collection of those traits that are seen as distinctive of the group. Zawadski (1948) raised both possibilities but remained undecided as to which one was better, and so did Allport (1954). Most stereotype measurement at the time was conducted with the adjective-checklist procedure, which could not address this critical issue. By 1970, simple associationism gained favor, and investigators asked their respondents to estimate the percentage of group members for whom a given trait was characteristic (Brigham, 1971). McCauley and Stitt (1978) reversed this development by casting stereotypes as perceived intergroup differences. After finding positive correlations between “diagnostic ratios” and typicality judgments, McCauley and Stitt collected more estimates of the prevalence of certain traits in a target group (Germans), estimates of the prevalence of
ER59969.indb 124
3/21/08 10:50:41 AM
The Robust Beauty of Simple Associations • 125
these traits in the world population, and judgments of how typical these traits were of the target group. The ratio of the two prevalence estimates was again positively correlated with typicality judgments. However, the simple prevalence estimates for the target group remained the best predictors of typicality. Nevertheless, McCauley and Stitt argued that the diagnostic ratio is the proper measure of stereotyping. If a trait is not uniquely ascribed to a target (group or person), it carries little information. Retaining nondistinctive traits in a mental representation of the target only creates clutter; hence, it is best to ignore them. These arguments have enjoyed wide, though not unanimous, acceptance in the stereotype-measurement literature (Judd & Park, 1993; Kunda & Thagard, 1996; Schneider, 2004). The distinctiveness approach is appealing because it is valid with respect to categorization. When multiplied by the probability of the group, p(G), the diagnostic ratio (i.e., the probability of a trait T given membership in group G, p[T|G], divided by the base rate probability of the trait, p[T]) gives the probability of group membership given the trait, p(G|T). To define stereotypes in terms of diagnostic ratios is to say that people see traits as typical if they allow inferences about a person’s group membership. This rationale turns the standard stereotypic inference on its head. In measuring stereotypes, the target group is known and the traits are to be judged. In other words, stereotype measurement requires the attribution of traits, not the categorization of people. The ratio measure emphasizes rare traits. Suicide proneness should be highly stereotypic of the Japanese because the low suicide rate in that country is not as low as it is elsewhere (Allport, 1954). If stereotypes are defined by trait distinctiveness, other measures may also be considered. To minimize the geometric explosion for rare traits, ratios can be rescaled to vary from −1 to +1. Log odds ratios (LOR) emphasize discrepancies at both ends of the scale equally. As McCauley recommended, difference scores can also be used if discrepancies in the middle of the scale are to be emphasized (McCauley & Thangavelu, 1991).7 Use of the difference-score measure highlights the similarities between stereotype measurement and the measurement of selfenhancement and measurement of the TFCE. The formula for rp(T|G) – p(T),TYP can be written as
sp(T | G) • rp(T | G), p(TYP) − sp(T) • rp(T), TYP
ER59969.indb 125
2 2 -2•s sp(T + sp(T) •s •r | G) p(T | G) p(T) p(T | G), p(T)
.
3/21/08 10:50:42 AM
126 • Joachim I. Krueger
As predictors of typicality judgments, difference scores could outperform simple trait prevalence estimates only if the correlation between perceived trait base rates and typicality judgments were negative (i.e., rp[T|G]-p[T],TYP > rp[T|G],TYP only if rp[T],TYP < 0). Logically, this cannot occur for a majority of groups because there is only one profile of base rates. If traits with low base rates were seen as most typical for most groups, the distinctions among these groups would be nullified. Simple prevalence estimates do not have this problem because they can be tailored individually for each group. The second observation is that trait base rates should be regressive when compared with group-specific prevalence estimates. Inasmuch as sp(T) < sp(T|G), difference-score correlations become more positive. The third observation is that prevalence estimates for groups should be correlated with perceived base rates (i.e., rp[T|G],p[T] > 0, as any other part–whole correlation). The larger this correlation is, the larger will be the difference-score correlation. Ironically, it is mostly their ingroups that people see as similar to the overall population (Waldzus, Mummendey, & Wenzel, 2005). When people fail to grant outgroups general human properties (Paladino et al., 2002), they stereotype them in a way that precludes difference-score correlations from detecting it. The residual-score method has not been used in stereotype measurement, but it further illustrates the hurdles faced by the distinctiveness approach. This approach leads to the expectation that typicality judgments are correlated with prevalence judgments when base rates are controlled. This will be the case unless
rp(T | G), TYP = rp(T | G), p(T) i rTYP, p(T)
The critical test of the distinctiveness hypothesis is provided by the inverse correlation. Will typicality judgments be correlated with base rates when simple prevalence judgments are statistically controlled? Several replications of the McCauley and Stitt (1978) work suggest that they are not. Simple prevalence estimates, p(T|G), consistently outperform discrepancy scores as predictors of typicality judgments. When these estimates are controlled, discrepancy scores drop out (Krueger, 1996; Krueger et al., 2003). The zero-order correlations between diagnostic ratios and typicality judgments are spurious because ratios are naturally confounded with their own numerator. Likewise, as students of change scores have noted, differences are confounded with the variable from which another variable has been subtracted (Harris, 1963; Thorndike, 1924). As people ascribe a trait to a larger percentage of group members, the discrepancy between this percentage and
ER59969.indb 126
3/21/08 10:50:42 AM
The Robust Beauty of Simple Associations • 127
the base rate of the trait (or its perceived prevalence in some reference group) goes up. Perhaps the poor performance of the diagnostic ratio can be attributed to the fact that respondents rate the typicality of each trait with respect to the target group. When led to focus on that group, their attention may be diverted from the normative comparison with the referent. To examine this possibility, we asked respondents to rate how typical each trait was of a target group (e.g., women) relative to a salient referent group (i.e., men). Even when typicality judgments were couched in comparative terms, discrepancy scores performed poorly; when the two gender groups (American women and men) were judged by members of a national outgroup (Italians), their unique predictive contribution once again disappeared (Krueger, Hall, Villano, & Jones, in preparation). One experimental study aimed to show that people spontaneously pick out distinctive traits when forming impressions of novel groups. Ford and Stangor (1992) presented several behaviors suggesting varying degrees of intelligence or friendliness in a “red” and a “blue” group. If two groups differ on two traits to the same average degree, the trait with the smaller within-group variance provides better differentiation. The more homogeneous trait should be stereotypic for one group and counterstereotypic for another, whereas the more heterogeneous trait should be largely ignored. This prediction represents the “meta-contrast” principle, which is a cornerstone of self-categorization theory (Turner et al., 1987). Although the findings were consistent with this prediction, a closer look at the stimulus materials reveals that the results could also imply within-group comparisons. The means (and the standard deviations) of the heterogeneous trait were 6.05 (SD = 2.50) and 3.70 (SD = 2.25), which meant that only 66% of the high distribution lay above the midpoint of the scale (5), and only 64% of the low distribution lay below the midpoint of the scale. By contrast, the distributions for the homogeneous traits did not overlap (M = 6.05, SD = .38; M = 3.80, SD = .43). The implication is that a homogeneous trait with a high mean scale value could be considered stereotypic not only because it differentiated the target group from the referent group, but also because it could be attributed to each group member. In other words, p(T|G) was higher for the homogeneous trait than for the heterogeneous trait within the target group. As a metaphor for significant effects in analysis of variance, the meta-contrast principle applies to both contrasts between groups and contrasts between traits. The principle itself demands that the largest contrast be taken most seriously.
ER59969.indb 127
3/21/08 10:50:42 AM
128 • Joachim I. Krueger
The challenges to distinctiveness-based stereotyping do not mean that people never overestimate group differences. They often do (Dawes, Singer, & Lemons, 1972; Krueger, 1992). Even differential projection yields this result. What is in doubt is the claim that people encode perceived group differences into stereotyped mental representations. Instead, people may simply encode associations between group labels and personality attributes. Most work on the automatic activation of stereotypes is built on this assumption (Devine, 1989). Stereotypic (i.e., common) attributes come to mind more easily when group label is presented first (Kawakami, Dovidio, & Dijksterhuis, 2003). In research on intergroup relations, prejudice is the wayward cousin of stereotyping. Whereas stereotypes are conceptualized as social beliefs, prejudice is considered an attitude. When approached from an associationist perspective, one can ask whether an attitude toward a certain group is positive or negative relative to a neutral point. However, the relativistic approach has also gained popularity in attitude research. Since the publication of a seminal study by Greenwald, McGhee, and Schwartz (1998), the Implicit Association Test (IAT) has been widely used to assess relative preference for the young over the old, Whites over Blacks, or Democrats over Republicans (cf., Arkes & Tetlock, 2004). A person’s score on the IAT is the difference obtained by subtracting reaction times in a mismatched condition from reaction times in a matched condition. A presumed White racist in a matched condition needs to press one key when a White face or a positive word is presented, and a different key when a Black face or a negative word is presented. When the IAT predicts other measures, such as explicit attitudes, it obscures the extent to which variations in the implicit attitude toward the ingroup or variations in the implicit attitude toward the outgroup contribute to the result (Blanton & Jaccard, 2006; Fiedler, Messner, & Bluemke, 2006).8 Like the distinctiveness-based approach to stereotype measurement, the IAT approach to attitude measurement makes the strong assumption that people directly encode group differences. The psychometric threats to the reliability of difference scores should caution investigators against any rash reification (Cohen & Cohen, 1983). Empirically, the reliability of the IAT is modest, hovering around .50 (Greenwald et al., 2002). Perhaps more importantly, any emphasis on discrepancies constitutes a departure from scientific realism. In stereotype measurement, the distinctiveness-based approach must face the question of which discrepancies are encoded. Is the perceived athleticism of White American males a function of the perceived athleticism of African American males or the perceived athleticism of White American
ER59969.indb 128
3/21/08 10:50:43 AM
The Robust Beauty of Simple Associations • 129
females? Depending on the reference group, White American males could be perceived as unathletic or as athletic. The simple association between the trait and the target group should not be encoded, for it would have no meaning. To believe that all possible comparisons are encoded and available for judgment stretches credulity. In the psychometric literature, the shadow of relativism is typically overlooked by investigators who select attributes for which simple associations and differences are perfectly confounded (Judd & Park, 1993). For example, White American males are only compared with African American males, and the trait of athleticism is considered counterstereotypic for the former and stereotypic of the latter.9 One alternative to creeping relativism is explicit relativism. This is the position of self-categorization theory, which assumes that stereotypes derive their meaning entirely from intergroup comparisons and that stereotypes shift whenever the referents of the comparison change (Turner et al., 1987). Such relativism is not without problems. How can one predict, for example, that members of group B are seen as shorter when judged in the presence of group A than in the presence of group C, without acknowledging that members of group A are seen as taller than members of group C? The perceived difference between A and C, be it true or false, can only arise from a comparison of absolute judgments.10
Ingroup favoritism Compared with their reactions to outgroup members, people like ingroup members more, seek their company more, and have higher expectations of them. The dominant theoretical account for this phenomenon is both relativistic and motivational. Social identity theory (Tajfel & Turner, 1979), self-categorization theory (Turner et al., 1987), and their various offshoots (see Otten, 2005, for a review) emphasize specific need states that impel people to not only differentiate the ingroup from the outgroup but to also ensure that the comparison is a favorable one. To make such a comparison is not only to construct a positive social identity but also to reap the individualist benefit of heightened self-esteem (but see Aberson, Healy, & Romero, 2000, and Rubin & Hewstone, 1998, for the mixed record of the self-esteem hypothesis). Before accepting this discrepancy-based argument, one should ask whether simple associations can explain ingroup favoritism. The foregoing discussion suggests that two empirically based assumptions suffice for such an explanation. First, self-images tend to be positive. For most individuals, self-judgments are positively correlated with desirability judgments (Krueger, 1998b). Second, most people project more
ER59969.indb 129
3/21/08 10:50:43 AM
130 • Joachim I. Krueger
strongly to ingroups than to outgroups (Robbins & Krueger, 2005). It follows that when self- and group judgments are plotted against trait desirability, ingroup judgments are more regressive than self-judgments and outgroup judgment are even more regressive (Krueger et al., 2005). Differential regressiveness implies that differences between selfjudgments and ingroup judgments (i.e., self-enhancement) increase with trait desirability (Moore & Small, chapter 7). More importantly, differences between ingroup judgments and outgroup judgments (i.e., ingroup favoritism) also increase with trait desirability. The differential-regressiveness account does not require that any discrepancies be reified as stand-alone constructs. Instead, this account predicts that ingroup favoritism can emerge simply as a by-product of positive selfimages and differential projection. Consistent with this model, ingroup favoritism disappears in the minimal-group paradigm when self-judgments (and thereby differential projection) are statistically controlled (Krueger et al., 2005; Otten & Wentura, 2001; see also Cadinu & Rothbart, 1996; Gaertner & Sedikides, 2005).11 The associationist model can also account for the finding that ingroup favoritism is more closely tied to ingroup love than to outgroup derogation (Brewer, 1999). Because projection to outgroups is only slightly positive, judgments of such groups gravitate toward being unrelated, as opposed to negatively related, to judgments of attribute desirability. The irrelevance of outgroup judgments for ingroup favoritism is reinforced by the finding that people perceive ingroups favorably even when there is no outgroup available for comparisons (Gaertner, Iuzzini, & Witt, 2006). Finally, the model sheds new light on the role of self-esteem. Gramzow and Gaertner (2005) reported that personal self-esteem predicts ingroup favoritism (note that this is the inverse of the effect predicted by social-identity theory). Inasmuch as self-esteem is closely related to the feature-based positivity of the self-image (as measured by idiographic correlations between self-judgments and desirability judgments), this is as expected (Sinha & Krueger, 1998). The model predicts that the correlation between self-esteem and ingroup favoritism will be attentuated when the positivity of the self-image is controlled. Now consider the IAT. Even members of minimal groups show implicit preferences for the ingroup (Ashburn, Voils, & Monteith, 2001; Otten & Moskowitz, 2000). Implicit and explicit attitudes tend to be positively, though weakly, related (Schneider, 2004). If both implicit and explicit measures of ingroup favoritism were assessed in the same study, it might turn out that the associationist account of explicit ingroup-favoritism can explain, in part, implicit bias. That is, IAT effects might be attenuated
ER59969.indb 130
3/21/08 10:50:43 AM
The Robust Beauty of Simple Associations • 131
when differences in social projection are statistically controlled. This may be all the more so because projection itself is partly automatic. Projection might also account for other categorization effects seen with the IAT. When social categorization into ingroups and outgroups becomes less salient because respondents attend to a common superordinate ingroup, the projection differential, and thus ingroup favoritism, is attenuated (Krueger & Clement, 1996). Work with the IAT shows a parallel result. Greenwald et al. (1998) found that implicit ingroup preferences exhibited by Korean and Japanese Americans were weakest among those with the shallowest immersion in their culture of origin. In other words, participants who were more acculturated to their new and common ingroup showed the least ingroup favoritism.
Emergent Properties? A great deal of theory and research on self-enhancement and social categorization has accepted the idea that discrepancies between certain types of judgment reveal psychological constructs that exist aside from simple associations between any pair of judgments. According to this view, self-perception, self-enhancement, projection bias, stereotyping, and ingroup favoritism are emergent properties of multiple judgment processes. There is an apparent analogy to claims regarding the unique power of clinical judgment to detect complex patterns that cannot be uncovered by the linear combination of valid cues. Dawes (1979) was skeptical about the emergent property argument then, and I remain skeptical about the argument now. The emergent-property argument has been explicitly made with regard to self-enhancement.12 Zuckerman and Knee (1996) pointed out that difference-score measures of self-enhancement cannot reveal anything that is not already given by the variables that constitute these differences. In response, Colvin, Block, and Funder (1996, p. 1255) attempted an ontological justification, suggesting that these correlations are “emergent properties [with] a meaning different from and extending beyond [their] constituent parts.” They cited Searle (1995), who defined “an emergent property of a system [as] a property that is explained by the behavior of the elements of the system; but it is not a property of any individual elements and it cannot be explained simply as a summation of the properties of those elements” (p. 62, emphasis added). Searle was concerned with qualitative shifts that arise when, for example, hydrogen and oxygen combine to yield the property of wetness, or when the patterned firing of neurons yields the property of consciousness. No such qualitative shifts occur
ER59969.indb 131
3/21/08 10:50:43 AM
132 • Joachim I. Krueger
in self- or group perception. Here, the arguments regarding emergent properties are quantitative. Self-images are too positive or too negative when compared with judgments of others, and group stereotypes are too extreme or too regressive when compared with perceptions of referent groups. Epistemological justifications of emergent properties are less restrictive. A property is epistemologically emergent if the processes that generate it are not well understood (Bedau, 1997). With regard to self-enhancement, however, an appeal to the limitations of current understanding is unconvincing. The mathematics of how discrepancy correlations arise from the zero-order correlations and the variances of the individual judgment variables are well known. If discrepancy correlations were emergent, correlations between difference scores and self-judgments would also have to be emergent. These correlations are, however, widely recognized as regression effects. The same logic applies to discrepancy correlations representing projection bias, stereotyping, and ingroup favoritism. My concern in this essay has been that of parsimony. Research on social perception is replete with higher-order, derivative constructs, most of which are assessed and cataloged in isolation (Krueger & Funder, 2004). The relationships among lower-level constructs are rarely examined, and their ability to at least partially account for the more complex ones is widely ignored. A generalized striving for simplicity is a healthy corrective and an epistemic value so brilliantly exemplified by Robyn Dawes’s scientific work.
References Aberson, C. L., Healy, M., & Romero, V. (2000). Ingroup bias and self-esteem: A meta-analysis. Personality and Social Psychology Review, 4, 157–173. Allport, G. W. (1954). The nature of prejudice. Garden City, NY: Doubleday/Anchor. Ames, D. R. (2004). Strategies for social inference: A similarity contingency model of projection and stereotyping in attribute prevalence estimates. Journal of Personality and Social Psychology, 87, 573–585. Arkes, H. R., & Tetlock, P. E. (2004). Attributions of implicit prejudice, or “Would Jesse Jackson ‘fail’ the implicit association tests?” Psychological Inquiry, 15, 257–278. Ashburn, N. L., Voils, C. I., & Monteith, M. J. (2001). Implicit associations as the seeds of intergroup bias: How easily do they take root? Journal of Personality and Social Psychology, 81, 789–799.
ER59969.indb 132
3/21/08 10:50:43 AM
The Robust Beauty of Simple Associations • 133 Baumeister, R. F., Campbell, J., Krueger, J. I., & Vohs, K. (2003). Does high selfesteem cause better performance, interpersonal success, happiness, or healthier lifestyles? Psychological Science in the Public Interest, 4, 1–44. Bedau, M. (1997). Weak emergence. In J. E. Tomberlin (Ed.), Philosophical perspectives, 11: Mind, causation, and world (pp. 375–399). Boston: Blackwell. Blanton, H., & Jaccard, J. (2006). Arbitrary metrics in psychology. American Psychologist, 61, 27–41. Brewer, M. B. (1999). The psychology of prejudice: Ingroup love or outgroup hate? Journal of Social Issues, 55, 429–444. Brigham, J. C. (1971). Ethnic stereotypes. Psychological Bulletin, 76, 15–33. Brown, R., & Capozza, D. (2007). Social identities: motivational, emotional and cultural influences. London: Routledge. Buss, D. M., & Craik, K. H. (1983). The act frequency approach to personality. Psychological Review, 90, 105–126. Cadinu, M. R., & Rothbart, M. (1996). Self-anchoring and differentiation processes in the minimal group setting. Journal of Personality and Social Psychology, 70, 661–677. Chambers, J. R., & Windschitl, P. D. (2004). Biases in social comparative judgments: The role of nonmotivated factors in above-average and comparative-optimism effects. Psychological Bulletin, 130, 813–838. Cohen, J., & Cohen, P. (1983). Applied multiple regression/correlation analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Colvin, C. R., Block, J., & Funder, D. C. (1996). Psychometric truths in the absence of psychological meaning: A reply to Zuckerman & Knee. Journal of Personality and Social Psychology, 70, 1252–1255. Dawes, R. M. (1979). The robust beauty of improper linear models in decision making. American Psychologist, 34, 571–582. Dawes, R. M. (1989). Statistical criteria for establishing a truly false consensus effect. Journal of Experimental Social Psychology, 25, 1–17. Dawes, R. M. (1990). The potential nonfalsity of the false consensus effect. In R. M. Hogarth (ed.), Insights in decision making: A tribute to Hillel J. Einhorn (pp. 179–199). Chicago: University of Chicago Press. Dawes, R. M. (1991). Probabilistic versus causal thinking. In D. Cicchetti & W. M. Grove (Eds.), Thinking clearly about psychology: Vol. 1. Matters of public interest: Essays in honor of Paul Everett Meehl (pp. 235–264). Minneapolis: University of Minnesota Press. Dawes, R. M. (1994). House of cards: Psychology and psychotherapy built on myth. New York: The Free Press. Dawes, R. M. (2001). Everyday irrationality: How pseudoscientists, lunatics, and the rest of us systematically fail to think rationally. Boulder, CO: Westview Press. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication, and assumptions about other people’s behavior in a commons dilemma situation. Journal of Personality and Social Psychology, 35, 1–11.
ER59969.indb 133
3/21/08 10:50:44 AM
134 • Joachim I. Krueger Dawes, R. M., & Mulford, M. (1996). The false consensus effect and overconfidence: Flaws in judgment, or flaws in how we study judgment? Organizational Behavior and Human Decision Processes, 65, 201–211. Dawes, R. M., Singer, D., & Lemons, F. (1972). An experimental analysis of the contrast effect and its implications for intergroup communication and the indirect assessment of attitude. Journal of Personality and Social Psychology. 21, 281–295. de la Haye, A. M. (2001). False consensus and the outgroup homogeneity effect: Interference in measurement or intrinsically dependent processes? European Journal of Social Psychology, 31, 217–230. Devine, P. G. (1989). Stereotypes and prejudice: Their automatic and controlled components. Journal of Personality and Social Psychology, 56, 5–18. Festinger, L. (1954). A theory of social comparison processes. Human Relations, 7, 117–140. Fiedler, K., Messner, C., & Bluemke, M. (2006). Unresolved problems with the “I”, the “A”, and the “T”: A logical and psychometric critique of the Implicit Association Test (IAT). In W. Stroebe & M. Hewstone (Eds.), European review of social psychology (Vol. 17, pp. 74–147). Chichester, UK: Wiley. Ford, T. E., & Stangor, C. (1992). The role of diagnosticity in stereotype formation: Perceiving group means and variances. Journal of Personality and Social Psychology, 63, 356–367. Fox, C. R., & Kahneman, D. (1992). Correlations, causes, and heuristics in surveys of life satisfaction. Social Indicators Research, 27, 221–236. Gaertner, L., Iuzzini, J., & Witt, M. G. (2006). Us without them: Evidence for an intragroup origin of positive in-group regard. Journal of Personality and Social Psychology, 90, 426–439. Gaertner, L., & Sedikides, C. (2005). A hierarchy within: On the motivational and emotional primacy of the individual self. In M. D. Alicke, D. A. Dunning, & J. I. Krueger (Eds.), The self in social perception (pp. 213– 239). Philadelphia: Psychology Press. Gramzow, R. H., & Gaertner, L. (2005). Self-esteem and favoritism toward novel in-groups: The self as an evaluative base. Journal of Personality and Social Psychology, 88, 801–815. Greenwald, A. G., Banaji, M. R., Rudman, L. A., Farnham, S. D., Nosek, B. A., & Mellon, D. S. (2002). A unified theory of implicit attitudes, stereotypes, self-esteem, and self-concept. Psychological Review, 109, 3–25. Greenwald, A. G., McGhee, D. E., & Schwartz, J. L. K. (1998). Measuring individual differences in implicit cognition: The implicit association test. Journal of Personality and Social Psychology, 74, 1464–1480. Griffin, D. W., & Varey, C. A. (1996). Towards a consensus on overconfidence. Organizational Behavior and Human Decision Processes, 65, 227–231. Harris, C. W. (1963). Problems in measuring change. Madison: University of Wisconsin Press.
ER59969.indb 134
3/21/08 10:50:44 AM
The Robust Beauty of Simple Associations • 135 Heine, S. J., Lehman, D. R., Markus, H. R., & Kitayama, S. (1999). Is there a universal need for positive self-regard? Psychological Review, 106, 766–794. Hoch, S. J. (1987). Perceived consensus and predictive accuracy: The pros and cons of projection. Journal of Personality and Social Psychology, 53, 221–234. Hogg, M. A., & Abrams, D. (1990). Social motivation, self-esteem, and social identity. In D. Abrams & M. A. Hogg (Eds.), Social identity theory: Constructive and critical advances (pp. 28–47). New York: Springer-Verlag. Horowicz, A. (1878). Psychologische Analysen auf physiologischer Grundlage [Psychological analyses on a physiological basis]. Magdeburg, Germany: Faber. Judd, C. M., & Park, B. (1993). Definition and assessment of accuracy in social stereotypes. Psychological Review, 100, 109–128. Jussim, L. (2005). Accuracy in social perception: Criticisms, controversies, criteria, components and cognitive processes. In M. P. Zanna (Ed.), Advances in experimental social psychology (Vol. 37, pp. 1–93). San Diego, CA: Academic Press. Kahneman, D. (2000). A psychological point of view: Violations of rational rules as a diagnostic of mental processes. Behavioral and Brain Sciences, 23, 681–683. Karniol, R. (2003). Egocentrism versus protocentrism: The status of self in social prediction. Psychological Review, 110, 564–580. Kawakami, K., Dovidio, J. F., & Dijksterhuis, A. (2003). Effect of social category priming on personal attitudes. Psychological Science, 14, 315–319. Keen, S. (1992). Fire in the belly. New York: Bantam. Klar, Y., & Giladi, E. E. (1999). Are most people happier than their peers, or are they just happy? Personality and Social Psychology Bulletin, 25, 585–594. Krueger, J. I. (1992). On the overestimation of between-group differences. In W. Stroebe & M. Hewstone (Eds.), European review of social psychology (Vol. 3, pp. 31–56). Chichester, UK: Wiley & Sons. Krueger, J. I. (1996). Probabilistic national stereotypes. European Journal of Social Psychology, 26, 961–980. Krueger, J. I. (1998a). On the perception of social consensus. In M. P. Zanna (Ed.), Advances in experimental social psychology (Vol. 30, pp. 163–240). San Diego, CA: Academic Press. Krueger, J. I. (1998b). Enhancement bias in the description of self and others. Personality and Social Psychology Bulletin, 24, 505–516. Krueger, J. (2000). The projective perception of the social world: A building block of social comparison processes. In J. Suls & L. Wheeler (Eds.), Handbook of social comparison: Theory and research (pp. 323–351). New York: Plenum/Kluwer. Krueger, J. I. (2001). Null hypothesis significance testing: On the survival of a flawed method. American Psychologist, 56, 16–26.
ER59969.indb 135
3/21/08 10:50:44 AM
136 • Joachim I. Krueger Krueger, J. I. (2002). On the reduction of self-other asymmetries: Benefits, pitfalls, and other correlates of social projection. Psychologica Belgica, 42, 23–41. Krueger, J. I. (2003). Return of the ego—Self-referent information as a filter for social prediction: Comment on Karniol (2003). Psychological Review, 110, 585–590. Krueger, J. I., Acevedo, M., & Robbins, J. M. (2006). Self as sample. In K. Fiedler & P. Juslin (Eds.), Information sampling and adaptive cognition (pp. 353–377). New York: Cambridge University Press. Krueger, J. I., & Clement, R. W. (1996). Inferring category characteristics from sample characteristics: Inductive reasoning and social projection. Journal of Experimental Psychology: General, 125, 52–68. Krueger, J. I., & Funder, D. C. (2004). Towards a balanced social psychology: Causes, consequences and cures for the problem-seeking approach to social behavior and cognition. Behavioral and Brain Sciences, 27, 313–376. Krueger, J. I., Hall, J. H., Villano, P., & Jones, M. C. (in preparation). Attribution and categorization processes in the representation of gender stereotypes. Brown University. Krueger, J. I., Hasman, J. F., Acevedo, M., & Villano, P. (2003). Perceptions of trait typicality in gender stereotypes: Examining the role of attribution and categorization processes. Personality and Social Psychology Bulletin, 29, 108–116. Krueger, J. I., & Zeiger, J. S. (1993). Social categorization and the truly false consensus effect. Journal of Personality and Social Psychology, 65, 670–680. Kruger, J. I. (1999). Lake Wobegon be gone! The “below-average effect” and the egocentric nature of comparative ability judgments. Journal of Personality and Social Psychology, 77, 221–232. Kunda, A., & Thagard, P. (1996). Forming impressions from stereotypes, traits, and behaviors: A parallel-constraint-satisfaction theory. Psychological Review, 103, 284–308. Markus, H. R., & Kitayama, S. (1991). Culture and the self: Implications for cognition, emotion, and motivation. Psychological Review, 98, 224–253. McCauley, C., & Stitt, C. L. (1978). An individual and quantitative measure of stereotypes. Journal of Personality and Social Psychology, 36, 929–940. McCauley, C., & Thangavelu, K. (1991). Individual differences in sex stereotyping of occupations and personality traits. Social Psychology Quarterly, 54, 267–279. Moore, D. A., & Kim, T. G. (2003). Myopic social prediction and the solo comparison effect. Journal of Personality and Social Psychology, 85, 1121–1135. Mullen, B., Dovidio, J. F., Johnson, C., & Copper, C. (1992). In-group outgroup differences in social projection. Journal of Experimental Social Psychology, 28, 422–440.
ER59969.indb 136
3/21/08 10:50:44 AM
The Robust Beauty of Simple Associations • 137 Otten, S. (2005). The ingroup as part of the self: Reconsidering the link between social categorization, ingroup-favoritism, and the self-concept. In M. D. Alicke, D. Dunning, & J. I. Krueger (Eds.), The self in social perception, (pp. 241–265). Philadelphia: Psychology Press. Otten, S., & Epstude, K. (2006). Overlapping mental representations of self, ingroup, and outgroup: Unraveling self-stereotyping and self-anchoring. Personality and Social Psychology Bulletin, 32, 957–969. Otten, S., & Moskowitz, G. B. (2000). Evidence for implicit evaluative in-group bias: Affect-biased spontaneous trait inference in a minimal group paradigm. Journal of Experimental Social Psychology, 36, 77–89. Otten, S., & Wentura, D. (2001). Self-anchoring and ingroup-favoritism: An individual profiles analysis. Journal of Experimental Social Psychology, 37, 525–532. Paladino, M.-P., Leyens, J.-P., Rodriguez, R., Rodriguez, A., Gaunt, R., & Demoulin, S. (2002). Differential association of uniquely and non uniquely human emotions with the ingroup and the outgroup. Group Processes & Intergroup Relations, 5, 105–117. Riketta, M. (2005). Cognitive differentiation between self, ingroup, and outgroup: The roles of identification and perceived intergroup conflict. European Journal of Social Psychology, 35, 97–106. Robbins, J. M., & Krueger, J. I. (2005). Social projection to ingroups and outgroups: A review and meta-analysis. Personality and Social Psychology Review, 9, 32–47. Ross, L., Greene, D., & House, P. (1977). The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279–301. Royzman, E. B., Cassidy, K. W., & Baron, J. (2003). “I know, you know”: Epistemic egocentrism in children and adults. Review of General Psychology, 7, 38–65. Rubin, M., & Hewstone, M. (1998). Social identity theory’s self-esteem hypothesis: A review and some suggestions for clarification. Personality and Social Psychology Review, 2, 40–62. Schneider, D. J. (2004). The psychology of stereotyping. New York: Guilford. Searle, J. R. (1995). The mystery of consciousness. New York Review of Books, 92, 60–66. Sedikides, C., Gaertner, L., & Toguchi, Y. (2003). Pancultural self-enhancement. Journal of Personality and Social Psychology, 84, 60–79. Sinha, R. R., & Krueger, J. (1998). Idiographic self-evaluation and bias. Journal of Research in Personality, 32, 131–155. Svenson, O. (1981). Are we all less risky and more skillful than our fellow drivers? Acta Psychologica, 47, 143–148. Tajfel, H. (1969). Cognitive aspects of prejudice. Journal of Social Issues, 25, 79–97.
ER59969.indb 137
3/21/08 10:50:45 AM
138 • Joachim I. Krueger Tajfel, H., & Turner, J. C. (1979). An integrative theory of intergroup conflict. In W. G. Austin, & S. Worchel (Eds.), The social psychology of intergroup relations (pp. 33–47). Monterey, CA: Brooks/Cole. Taylor, S. E., & Brown, J. D. (1988). Illusion and well-being: A social psychological perspective on mental health. Psychological Bulletin, 103, 193–210. Thorndike, E. L. (1924). The influence of the chance imperfections of measures upon the relation of initial scores to gain or loss. Journal of Experimental Psychology, 7, 225–232. Turner, J. C., Hogg, M. A., Oakes, P. J., Reicher, S. D., & Wetherell, M. (1987). Rediscovering the social group: A self-categorization theory. Oxford, UK: Blackwell. Waldzus, S., Mummendey, A., & Wenzel, M. (2005). When “different” means “worse”: In-group prototypicality in changing intergroup contexts. Journal of Experimental Social Psychology, 41, 76–83. Zawadski, B. (1948). Limitations of the scapegoat theory of prejudice. Journal of Abnormal and Social Psychology, 43, 127–141. Zuckerman, M., & Knee, C. R. (1996). The relation between overly positive self-evaluation and adjustment: A comment on Colvin, Block, & Funder (1995). Journal of Personality and Social Psychology, 70, 1250–1251.
notes 1. Could it be said that instead of comparative judgments being absolute self-judgments in disguise, absolute self-judgments are comparative judgments in disguise? If so, absolute other-judgments would also have to be disguised comparative judgments, and the correlation between self- and other-judgments should be highly negative. Empirically, these correlations tend to be moderately positive, however. 2. In the Taylor and Brown paradigm, s other is 0 if all respondents judge accurately. 3. Regressive other-judgments could increase the residual-score correlation indirectly by reducing the correlation between self- and other-judgments and the correlation between criterion and other judgments. I thank Johannes Ullrich for this suggestion. 4. The correlation between self–other differences and self-judgments is sself ⋅ rself, other - sother
2 2 2 2 sself ⋅ sother - 2 rself, other ⋅ sself ⋅ sothe r
.
The correlation between self-judgments and residualized self-judgments (i.e., controlling for other-judgments) is
ER59969.indb 138
2 1 - rself, other
.
3/21/08 10:50:46 AM
The Robust Beauty of Simple Associations • 139 5. This analysis does not assume that people project the same characteristics to the two groups—only that they project at the same rate. Note also the implications of differential projection for self-enhancement. Sluggish outgroup projection means that self-enhancement would increase if people were to compare themselves to the average outgroup member. Difference-score correlations between self-enhancement and criterion variables (e.g., self-esteem) would be reduced, however. That is, hypotheses regarding the adaptiveness of self-enhancement would be harder to test if the self were evaluated within an outgroup context. 6. Negative projection cannot be inductively justified (Dawes, 1989). Rationally, people may see outgroups as opposites of themselves only if they possess sampling information to that effect. If so, projection would not be at issue. 7. When p(T|G) < p(T), the modified ratio is
1-
p(T | G) p(T) , otherwise −1 p(T) p(T | G)
(Krueger, Hasman, Acevedo, & Villano, 2003). LORs are given by
ln
p(T | G) / (1 - p(T | G)) p(T) / (1 - p(T)) .
Difference scores can be derived from ratios as
p(T | G) - p(T) = p(T | G) ⋅ (1-
p(T)
p(T | G)
)
.
Naturally, the correlations among these indices are high. 8. Notice that it is difficult for a White respondent to obtain a “comfortable” IAT score. If the difference score indicates implicit prejudice, the respondent may feel pressured to conduct a self-examination and to come clean. If the score indicates a preference for the outgroup, the respondent also needs to do extra work in order to explain it (while a preference for Black would simply signal a positive social identity to a Black respondent). For a White respondent, the only safe score is zero, or something very close to it. The reduction of a “good” score to a point, while bad or strange scores may appear across a wide range, is familiar from research on errors and biases in the judgment literature. Here, rational responding typically has the status of an unconfirmable null hypothesis (Krueger & Funder, 2004). 9. The confound resulting from such trait selection is that the simple measure of p(T|G) is perfectly correlated with any discrepancy measure across such attributes.
ER59969.indb 139
3/21/08 10:50:47 AM
140 • Joachim I. Krueger 10. The argument that perceptions of A and C are also context dependent begs the question of where perceivers begin to make comparisons. Without some anchor in a presumed reality, points for relativism cannot be scored. 11. The model implies that ingroup favoritism should also diminish when self-enhancement is controlled. 12. Colvin et al. (1996) advocated a variation of the Taylor and Brown (1988) method. Instead of subtracting participants’ own judgments of the average person from their self-judgments, they proposed that the average peer judgments of these participants be subtracted.
ER59969.indb 140
3/21/08 10:50:47 AM
7
When It Is Rational for the Majority to Believe that They Are Better than Average Don Moore Carnegie Mellon University
Deborah Small University of Pennsylvania
Behavioral decision research draws much of its energy from the fruitful tension between what people do and what people ought to do. Robyn Dawes brought this energy to bear on the false consensus effect when he asked how exactly people ought to use their own behavior to help them predict the behavior of others (Dawes, 1989, 1990; Dawes & Mulford, 1996). Dawes’s insight came from the application of Bayes’s Rule, a fundamental principle of normative decision making that routinely clashes with human intuition. Bayes’s Rule specifies how one ought to update one’s beliefs based on evidence. Say, for instance, I am asked to estimate the probability that someone like me would consent to wearing a big sign reading “REPENT” around campus for an hour. This request is sufficiently unusual that I probably do not have strong prior beliefs about it—I do not begin this problem with much useful information—but I do have access to one useful data point—myself. Would I wear the “REPENT” sign? In their study, Ross, Greene, and House (1977) found a substantial difference between the beliefs of Stanford students who agreed to 141
ER59969.indb 141
3/21/08 10:50:48 AM
142 • Don Moore and Deborah Small
wear the sign and those who refused. Those who agreed believed that a greater proportion of other Stanford students would likewise agree than did those who refused. Dawes (1989, 1990) suggested that this so-called false consensus effect might not be so false after all. Given that a majority of people are always in the majority on any given issue, people can improve the quality of their predictions if they assume that others will behave the way they do. Just so, Bayesian logic suggests that, if people use themselves as a useful data point in making inferences about others, people will expect others to be more similar to them than they actually are. This insight clarified exactly how much people ought to rely on what they know about themselves when making inferences about others. This represented real theoretical progress and opened the door for researchers to ask whether there is a truly false consensus effect that is greater than what Bayesian logic would predict.
Better-Than-Average Effects This paper extends the logic of Dawes’s Bayesian critique to a different phenomenon: the so-called better-than-average (BTA) effect. BTA effects describe the tendency for people to believe that they are different from—and better than—others. In the words of Peterson (2000), “Apparently, in our minds, we are all children of Lake Wobegon, all of whom are above average” (p. 45). Indeed, a number of important theories have begun with the assumption that human judgment is biased in this way, and have sought to explain it (Baumeister, 1998; Benabou & Tirole, 2002; Brown, 1998; Daniel, Hirshleifer, & Sabrahmanyam, 1998; Dunning, 1993; Epstein, 1990; Greenwald, 1980; Steele, 1988; Taylor & Brown, 1988). By applying the same Bayesian logic that Dawes used in his critique of the false consensus effect we will see that there are circumstances in which better-than-average beliefs may be justified. Evidence across many domains has demonstrated the important implications of BTA biases. If stock market investors believe they are better than other investors at identifying promising investments, that might help explain why there is so much more trading activity in most financial markets than would be predicted on the basis of traditional economic theory (Odean, 1998). If CEOs believe they are better managers than are other CEOs, that could help explain why there are so many more corporate acquisitions than there ought to be (Malmendier & Tate, 2005). If disputants believe their claims are more justified than are those of others, it could explain the prevalence of costly and inefficient conflicts such as labor strikes and lawsuits going to trial (Babcock
ER59969.indb 142
3/21/08 10:50:48 AM
When the Majority Believes They’re Better Than Average • 143
& Loewenstein, 1997; Neale & Bazerman, 1985). If nations believe their armies are stronger than those of other nations, it could help explain their willingness to go to war (Howard, 1983). As a rule, scholars have pronounced BTA beliefs as biases produced by self-serving cognitions that lead people to see themselves in an unrealistically positive light. In this paper we explore the normative issues surrounding the question of how people ought to infer the abilities and performances of others. In the process, we call into doubt the degree to which BTA effects are really biases. Our theory implies that there are circumstances in which a majority of rational Bayesians ought to believe that they are better than others. This theory presents a number of testable hypotheses. We test some of the key hypotheses in a series of three experiments.
Normative Explanations Under what circumstances could more than 50% of rational people believe that they are better than average? This can be the case in at least two situations. First, more than 50% of people will be above average in a negatively skewed distribution. For instance, most people have more legs than average.1 If 1% of the population is missing a leg, the average number of legs per person is 1.99, and 99% of the population has more legs than average. Negatively skewed distributions are especially likely whenever ceiling effects limit performance, such as the fact that a surplus of legs (over two) is highly improbable. The second instance in which rational people will believe, on average, that they are better than average, is when they have imperfect information about performance, and outcomes are better than expected (Moore & Small, 2007). Consider an illustrative example. I attempt to accomplish some feat that I do not expect everyone to be able to accomplish, such as passing my driver’s test. Let’s say, coming in to it, I expect 70% of people to pass. I pass. Therefore, I am above average, but maybe part of the reason for my success is that the task is easy, so I should raise my estimate of the percentage of others who pass. As long as I believe the average pass rate to be below 100%, I will be above average. If everyone expects a 70% pass rate, but the test turns out to be easier than that, the average test-taker will believe that he or she is above average. This explanation holds similarly for tasks on which performance is not limited to 100% (success) and 0% (failure). To be precise, when average performance exceeds average expectations, people will be normatively justified in believing that they are better than average (and better than the median, for that matter). We will refer to this as the
ER59969.indb 143
3/21/08 10:50:48 AM
144 • Don Moore and Deborah Small
differential information explanation, because it posits that differential information leads estimates of self and others to be differentially regressive to the prior. Where do priors come from? Often, people have some relevant experience, but evidence suggests that when people are unsure about how outcomes will be distributed, their predictions gravitate toward a uniform distribution in which all outcomes are predicted to be equally likely (Bruine de Bruin, Fischhoff, Millstein, & HalpernFelsher, 2000; Fox & Rottenstreich, 2003). We should note that three sensible assumptions are essential to obtaining this differential information effect. First, people have better information about themselves than they do about others. Second, people believe some portion of their own performances is due to their own idiosyncratic abilities or to luck. Without this assumption, rational people would expect others to perform identically to themselves, and there would be no BTA effects. However, the empirical evidence clearly supports the plausibility of this assumption: People use themselves as a useful, if imperfect, starting point for estimating others (Krueger, Acevedo, & Robbins, 2005). Third, when estimating self and others, people have to rely on both their prior expectations and the evidence at their disposal. If they ignore their prior expectations because they believe those expectations to be uninformative, and instead believe that everyone will perform exactly as they have, there will be no BTA effects. To clarify the differential information explanation, it might be useful to describe how it could explain prior BTA findings. For instance, a number of studies have found that people rate positive personality attributes as more descriptive of themselves than of others (Alicke, 1985; Brown, 1986). People generally express positive traits more often than negative ones. Furthermore, they are more familiar with their own intentions and behaviors than with those of others. For the majority of people who generally try to be friendly, cooperative, and dependable, they can reasonably infer that there are many ways for others to embody these traits less than they do. There are fewer ways in which people could embody these traits more than they do. It follows that rational people could believe that they are more friendly, cooperative, and dependable than are others (Fiedler, 1996). On the other hand, for negative behaviors such as being dishonest, phony, or rude, people know that they personally express them rarely, but they cannot be as sure about others. It follows that they ought to infer that these negative traits (and rare behaviors) are less descriptive of themselves than of others (Fiedler, 2000). Perhaps the most frequently cited BTA effect is Svenson’s (1981) finding that people believe they are better drivers than others. Unde-
ER59969.indb 144
3/21/08 10:50:48 AM
When the Majority Believes They’re Better Than Average • 145
niably, people have more information about their own driving habits than those of others. As a rule, people obey traffic laws and successfully avoid accidents; most of the information they have about themselves is positive. Other drivers might be just as competent, but it is hard to say with equal confidence since people have less information about others. It is easy to see the many ways in which others could be worse drivers than oneself. Estimates of others, then, should be more moderate than for self. It follows that people will believe they are better drivers than are others. It should be obvious at this point that both of the explanations we have offered (skewed distributions and differential information) also hypothesize circumstances in which people will believe themselves to be worse than others. In positively skewed distributions, the majority of people are below average. This is the case, for instance, with income. Income is generally limited to zero at the bottom end. As long as there are a few CEOs raking in a few hundred million dollars, the vast majority of people will have incomes that are below average. One simple alternative is to ask participants to report their percentile ranks instead of comparing themselves with the average. It is statistically impossible for more than half of people to be above the 50th percentile. The differential information explanation would predict that people will believe themselves to be worse than others (and worse than the median) whenever their outcomes are worse or the probability of success is lower than expected. Recent empirical results bear this prediction out. For instance, people report that they are less likely than others to experience rare events, such as finding a $20 bill on the ground in the next two weeks (Chambers, Windschitl, & Suls, 2003). People believe that they are worse than others at difficult tasks such as coping with the death of a loved one (Blanton, Axsom, McClive, & Price, 2001; Kruger, 1999), and people believe they will lose difficult competitions, such as trivia contests on indigenous vegetation of the Amazon (Moore & Cain, 2007; Moore, Oesch, & Zietsma, 2007; Windschitl, Kruger, & Simms, 2003). Other researchers have provided explanations for these effects based on egocentric biases (for a review, see Chambers & Windschitl, 2004). However, in this paper, we focus on the differential information account because it is a more parsimonious explanation for these effects, because it offers broader explanatory power (Moore & Small, 2007), and because it takes into account the normative question of how people ought to make comparative self–other comparative judgments. What are some of the testable implications of the differential information theory? We test three. First, estimates of others should be more regressive than are
ER59969.indb 145
3/21/08 10:50:48 AM
146 • Don Moore and Deborah Small
estimates of self. Second, to the extent that people have imperfect information about their own performances, self-estimates should also be regressive, but less so than estimates of others. The hypothesized patterns are illustrated in Figure 7.1. These first two implications are tested in all three experiments. The first experiment examines beliefs about the likelihood of future life events. The second experiment examines beliefs about performance on a task. A third implication of the differential information explanation is that differential regressiveness should be exaggerated when people expect themselves to be different from others and should be reduced when people expect themselves to be similar to others. This third implication is tested in the third experiment.
Experiment 1: Relative Probabilities Experiment 1 applies the differential information explanation to account for relative probability judgments. For some time, researchers have believed that people were motivationally biased to believe that they are more likely than others to experience positive life events and less )LJXUH likely than others to experience negative life events (Weinstein, 1980). Two recent studies have shown that while people do believe themselves to be above average in their likelihood of experiencing common events
$FWXDO
(VWLPDWHGVFRUH
6HOIHVWLPDWH
2WKHUHVWLPDWH
$FWXDOVFRUH Figure 7.1 Estimated scores for self and other (hypothesized).
ER59969.indb 146
3/21/08 10:50:49 AM
When the Majority Believes They’re Better Than Average • 147
(such as living past 70), they often rate themselves as below average in their likelihood of experiencing rare life events (such as living past 100) (Chambers et al., 2003; Kruger & Burrus, 2004). Experiment 1 replicates these effects and also elicits participants’ beliefs regarding the actual probabilities of experiencing common and rare events for both self and others. Method Participants We recruited 158 participants from the student body at Carnegie Mellon University in Pittsburgh, Pennsylvania. Half of the participants (79 individuals) received payment of $2 for their participation; the other 79 participants received credit toward a course requirement. The participants had an average age of 21 years, and 57% of them were male. Procedure Participants saw a list of 24 life events (shown in Table 7.1). These 24 events were chosen because they varied with respect to both their frequency and their valence. Half of these events are positive and half are negative; half of the events are common and half are rare. For each event, we asked participants to estimate the probability that the event would happen to them at some point in their lives. A separate page asked participants to estimate, for each event, the probability that it would occur at some point in the life of the typical participant in the experiment. A third page asked participants to estimate their comparative likelihood of experiencing the event. Participants were asked to estimate their percentile rankings relative to all other participants in the experiment: If you think you are more likely than anyone else in this experiment to experience the event, enter “100” as your percentile. If you think that you are the least likely person to experience the event, enter “0” as your percentile. If you think your chances are exactly in the middle, enter “50” as your percentile. All numbers between 0 and 100 are acceptable responses. The experimental materials included two different order manipulations to rule out idiosyncratic effects of order not relevant to the present hypothesis. The first order manipulation varied the sequence of the three estimation tasks (estimating for self, for others, and comparatively) using a Latin-squares design. The three orders were: (1) self-other-comparison, (2) other-comparison-self, (3) comparison-selfother. The second order manipulation varied the order in which the 24
ER59969.indb 147
3/21/08 10:50:49 AM
148 • Don Moore and Deborah Small Table 7.1 Estimates of Experiencing Each of 24 Life Events for Comparative Likelihood Expressed as a Percentile Rank and Absolute Probabilities for Self and for the Average Participant (Experiment 1) Events
Percentile
Self
Other
Positive Events Rare Events Record your own music CD
33.5 (33.5)
29.1 (37.4)
31.0 (32.9)
Win over $10 million in the lottery
22.9 (25.9)
14.1 (26.3)
15.6 (26.2)
Inherit over $10,000 from a distant relative
28.4 (27.9)
23.9 (29.0)
35.1 (28.2)
Write a book
40.0 (30.0)
35.5 (32.0)
35.3 (28.5)
Live past the age of 100
32.5 (27.2)
25.2 (29.9)
23.6 (29.8)
Have over $1 million in savings
61.5 (26.3)
61.5 (28.2)
49.7 (27.5)
Common Events Have a stranger spontaneously compliment your appearance
59.3 (25.8)
64.8 (31.4)
54.6 (25.9)
Obtain a starting salary over $25,000 per year
76.3 (21.0)
87.1 (18.6)
80.2 (18.0)
Live past the age of 70
60.3 (23.5)
66.1 (24.0)
62.4 (22.5)
Be elected an officer of an organization
68.3 (24.4)
72.7 (27.3)
60.0 (26.1)
Own your own home
76.5 (20.4)
89.1 (16.8)
79.0 (17.6)
Own your own car
79.1 (20.9)
92.0 (17.7)
84.6 (19.0)
Negative Events Rare Events Be killed in a terrorist attack
31.5 (25.7)
15.9 (20.2)
14.5 (19.4)
Be struck by lightning
32.9 (26.5)
20.1 (31.8)
20.2 (30.6)
Become addicted to crack cocaine
13.1 (21.2)
6.7 (15.4)
16.1 (21.7)
Go to jail
19.9 (22.6)
9.3 (14.4)
17.0 (20.2)
Be shot with a gun
31.2 (26.0)
15.4 (19.8)
15.0 (16.9)
Get AIDS
19.3 (21.0)
10.5 (16.9)
15.2 (17.5)
Common Events
ER59969.indb 148
Be interrupted during dinner by calls from telemarketers
54.7 (26.9)
68.3 (35.7)
72.0 (32.0)
Slip and fall on the ice
67.6 (25.2)
79.2 (26.7)
78.2 (21.7)
Be in an automobile accident
50.0 (28.0)
53.9 (32.6)
53.6 (29.5)
Attend the funeral of a loved one
62.9 (26.9)
79.0 (32.4)
76.8 (32.6)
Get lost driving in an unfamiliar city
64.1 (29.5)
75.6 (29.3)
69.3 (27.8)
Cut yourself shaving
59.0 (32.8)
71.3 (36.3)
69.6 (32.0)
3/21/08 10:50:50 AM
When the Majority Believes They’re Better Than Average • 149
events appeared on each of the three pages in the experimental materials. Some participants saw the 24 events in one randomly determined order; the other participants saw the same events in the reverse order. Results and Discussion We dropped from the analysis five participants who failed to complete the questionnaire. Results for the 24 life events appear in Table 7.1. First, we sought to verify replication of BTA and worse-than-average (WTA) effects. To do this, we computed average percentile rankings for each participant for each of the four categories of events (positive/frequent, positive/ rare, negative/frequent, and negative/rare). We then subjected these measures to a 2 (valence) × 2 (frequency) repeated-measures ANOVA. Recall that participants were all asked to rate themselves relative to other participants in the experiment, so if participants were accurate in their assessments, the average percentile rank for all participants would have to be 50. The results are consistent with prior results: People rated their relative likelihood of experiencing events as higher when the events were common (mean percentile = 64.7, SD = 15.27) than when they were rare (mean percentile = 30.8, SD = 15.19, F [1, 150] = 359, p < .001, η2 = .71). There was also a significant main effect of valence: Participants rated themselves as more likely to experience events that were positive (mean percentile = 53.1, SD = 11.3) than those that were negative (mean percentile = 42.4, SD = 12.2, F [1, 150] = 93, p < .001, η2 = .38). The interaction is not significant (F [1, 150] = 1.25, p = .265, η2 = .01). To test for the statistical significance of the hypothesized differences in participants’ estimates of absolute probabilities, we computed means for each participant for each of the four categories of events (positive/frequent, positive/rare, negative/frequent, and negative/rare). We subjected these means to a 2 (valence) × 2 (frequency) × 2 (target: self versus other) repeated-measures ANOVA. The hypothesized target × frequency interaction is significant (F [1, 152] = 88.9, p < .001, η2 = .37). Participants rated themselves more likely (M = 74%, SD = 15%) than others (M = 70%, SD = 15%) to experience common events but less likely (M = 22%, SD = 14%) than others (M = 24%, SD = 16%) to experience rare events. This two-way interaction is qualified by a significant three-way target × frequency × valence interaction (F [1, 152] = 7.79, p = .006, η2 = .05). This three-way interaction is illustrated in Figure 7.2. It describes the fact that the hypothesized target × frequency interaction is strongest where it is consistent with motivational bias and smallest where
ER59969.indb 149
3/21/08 10:50:50 AM
)LJXUH
150 • Don Moore and Deborah Small Negative events 80 70
Probability
60
6HOI 2WKHU
50 40 30 20 10
Rare
Common Ev ent freque ncy
Positive events
Probability
80 70
6HOI
60
2WKHU
50 40 30 20 10
Rare
Common Event frequency
Figure 7.2 Estimated absolute probabilities of experiencing negative (Panel A) and positive (Panel B) events, Experiment 1. Error bars show standard errors.
motivation acts against it. The effect is strongest for rare negative events and common positive events. This same effect did not show up in the direct comparative judgments (percentile rankings), since the frequency × valence interaction is not significant there. This
ER59969.indb 150
3/21/08 10:50:50 AM
When the Majority Believes They’re Better Than Average • 151
apparent inconsistency is not troublesome for two reasons. First, it is not troublesome for our theory because our theory makes no claims about motivational influences on comparative judgment. Second, it does not imply inconsistency between direct and indirect comparative judgments because the way we measured them did not allow us to infer their percentile ranks from their absolute assessments of self and others. We acknowledge that the absolute differences in probability estimates for self and other are not large. More important than the absolute size of these differences is their direction. Of 3,672 comparisons (24 events × 153 participants), only 26% go in the opposite direction from that predicted by our theory (i.e., participants estimating that they are more likely to experience rare events and less likely to experience common events than is the typical participant). It is the consistency of this directional difference that produces such a reliable target × frequency interaction. Not every result of the second experiment is perfectly consistent with our expectations. For our participants (ambitious students at a selective university), contrary to our expectations, amassing a million dollars in savings was not seen as a rare event, yet the overall tests are clearly consistent with our hypotheses. Furthermore, the regression argument would predict that the consistency in expected differences between self and other should decrease as the task domain moves away from the extremes in performance or probability. The anomalies we do observe tend to occur where they are least problematic for the differential information theory. We predicted that, in addition to obtaining BTA effects for common events and WTA effects for rare events, we would find that participants also underestimated their chances of experiencing common events and overestimated their chance of experiencing rare events. While we cannot say for certain how likely our participants will be to experience each of the 24 events during the courses of their lives, the results shown in Table 7.1 appear to be consistent with our hypotheses. Whereas our participants estimated their chances of being struck by lightning at 20%, the average American’s probability is closer to .009% (U.S. Census Bureau, 2002), and whereas our participants estimated that they stood a 14% chance of winning over $10 million in the lottery, at least in Pennsylvania, where they reside, the probability of winning the Powerball jackpot (the prize most likely to exceed $10 million) is closer to 0.0000008% (Pennsylvania Lottery, 2004).2 As for common events, while participants estimated that they stood only an 87% chance of obtaining a starting salary over $25,000 per year, the records of the Carnegie
ER59969.indb 151
3/21/08 10:50:51 AM
152 • Don Moore and Deborah Small
Mellon Career Center indicate that approximately 97.5% of Carnegie Mellon’s class of 2003 started at jobs with salaries over $25,000 (Carnegie Mellon Career Center, 2003). Finally, whereas our participants estimated that they only had a 71% chance of cutting themselves shaving, virtually everyone will cut themselves shaving at some point. The findings of Experiment 1 and the differential information explanation can help make sense of a set of inconsistent findings in research on risk perception. Research indicates that people tend to overestimate, sometimes radically, the probability that they will experience rare events. For example, Lerner, Gonzalez, Small, and Fischhoff (2003) reported that after September 11, 2001, Americans estimated their probability of being injured in a terrorist attack as 20%. Other examples come from perceptions of health risks. In one study, smokers reported a 37% chance that they will get cancer due to smoking (Viscusi, 1990). The actual risk that smokers will fall ill with lung cancer is around 5 to 10%. However, when smokers are asked whether they are at more or less risk than other smokers, the frequently report believing their risk is below average (Slovic, 2000; Weinstein, 1984). Similarly, women have been found to overestimate the probability that they will fall ill with breast cancer, often by as much as eight times the true probability (Lipkus, Biradavolu, Fenn, Keller, & Rimer, 2001). At the same time, people often believe that their risk for experiencing these rare events is below average (Woloshin, Schwartz, Black, & Welch, 1999). The regression explanation can account for both these findings: People’s estimates of their own chances are regressive, but their estimates of others’ chances are even more regressive.
Experiment 2: The Trivia Quiz The first experiment presents evidence consistent with the regression explanation. However, it is limited because it does not include measures of actual performance. This limitation prevented us from assessing the accuracy of judgment. Experiment 2 solves this problem. Experiment 2 also introduces a new test of the differential information explanation. The differential information explanation applies to people comparing themselves with others. When people are comparing other individuals to each other and self-knowledge is not relevant or useful, then the regression account would not predict biases in comparative judgments. To test this implication, Experiment 2 includes a condition in which participants are asked to compare two randomly chosen individuals.
ER59969.indb 152
3/21/08 10:50:51 AM
When the Majority Believes They’re Better Than Average • 153
Method Participants We recruited participants after classes at Carnegie Mellon University. An experimenter invited students in each of six classes to remain for 10 minutes after class and complete an experiment for cash payment; 215 individuals agreed to participate. Procedure Each participant was given a packet of questionnaires, beginning with a 10-item trivia quiz and an 11th tiebreaker question. Half the participants received a simple quiz including questions such as “How many inches are there in a foot?” whereas others received a difficult quiz including questions such as “What is Avogadro’s number?” The tiebreaker question (“How many people live in Pennsylvania?”) was scored based on participants’ distance from the correct answer and virtually eliminated the possibility of a tied score. Next, instructions informed participants that they had earned $3 and that they could wager any amount of it on the trivia competition. If they bet and won, the amount they bet would be doubled. Half the participants bet on whether they would beat a randomly chosen opponent (the tiebreaker question resolved tied scores). The other half of participants bet on whether a randomly selected (anonymous) protagonist would beat a randomly chosen opponent. For those betting on this random protagonist, the random nature of the selection of the protagonist was driven home by asking them to draw a number out of a hat to determine on whom they would be betting.3 The experiment therefore employed a 2 (difficulty) × 2 (protagonist) between-subjects design. Dependent measures Participants made both comparative and absolute judgments of performance. Comparative measures included participants’ bets, estimates of their probability of winning, and their responses to the question, “How do you expect that you [the person whose number you drew] will score relative to all other people taking the same test?” The response scale ran from 1 to 7, with labels at 1 (well below average), 4 (average), and 7 (well above average). Participants’ scores on the actual quiz served as measures of absolute performance. In addition, participants were asked to estimate scores for (1) the protagonist (self or the randomly chosen protagonist), (2) the randomly chosen opponent, and (3) the average person. We assessed confidence in these estimates by asking participants to specify (for both the protagonist and opponent) scores above and below their guesses such that they were 90% sure the true score fell within that range.
ER59969.indb 153
3/21/08 10:50:51 AM
154 • Don Moore and Deborah Small
Results and Discussion Manipulation Check Scores on the simple test were higher (M = 8.51 out of 10, SD = 1.48) than scores on the difficult test (M = 1.66 out of 10, SD = 1.26, t[213] = 36.54, p < .001). Comparative Judgments To assess the effects of the manipulations on comparative judgments, we subjected participants’ bets to a 2 (difficulty) × 2 (protagonist) ANOVA. The results reveal a main effect for difficulty: Those taking the simple quiz bet more (M = $1.84, SD = $1.01) than did those taking the difficult quiz (M = $1.38, SD = $1.14, F [1, 210] = 11.34, p = .001, η2 = .05). These results parallel results of asking participants to estimate the protagonist’s percentile rank. Again, the same 2 × 2 ANOVA reveals a main effect for difficulty, such that the protagonist’s percentile rank is estimated to be higher in the simple quiz condition (M = 57, SD = 17) than in the difficult quiz condition (M = 41, SD = 17, F [1, 209] = 67.23, p < .001, η2 = .24).4 However, this main effect is qualified by a significant difficulty × protagonist interaction (F [1, 209] = 23.54, p < .001, η2 = .10). The effect of difficulty on comparative evaluation was greater for those betting on self than for those betting on a randomly selected person (see Table 7.2). Absolute Judgments To test the effect of the experimental manipulations on participants’ estimates of absolute performance for self and opponent, we conducted a 2 (difficulty) × 2 (protagonist) × 2 (target: protagonist versus opponent) mixed ANOVA with repeated measures on target. Naturally, the results reveal a significant between-subjects main effect of difficulty: Participants taking the simple test predicted higher scores (M = 7.5, SD = 1.4) than did participants taking the difficult test (M = 3.0, SD = 1.5, (F [1, 173] = 443.36, p < .001, η2 = .72). This main effect is qualified by a significant target × difficulty interaction (F [1, 173] = 25.36, p < .001, η2 = .13). Participants estimated that the easy test would be easier for the protagonist (M = 7.9, SD = 1.5) than for the opponent (M = 7.4, SD = 1.5), whereas the difficult test would be more difficult for the protagonist (M = 2.9, SD = 1.6) than for the opponent (M = 3.3, SD = 1.5). This two-way interaction is qualified by the expected three-way target × difficulty × protagonist interaction (F [2, 173] = 8.72, p = .004, η2 = .05). This three-way interaction reveals that, consistent with the differential information explanation, differences between scores predicted for protagonist and opponent are greater for those betting on themselves than for those betting on a randomly selected person. For those betting on a randomly selected person, the
ER59969.indb 154
3/21/08 10:50:51 AM
When the Majority Believes They’re Better Than Average • 155 Table 7.2 Predicted and Actual Scores by Experimental Condition (Experiment 2) Betting on Self Betting on a Random Person Simple Difficult Simple Difficult Actual score (out of 10) 8.58a (1.59) 1.76b (1.34) 8.48a (1.43) 1.60b (1.22) Bet (up to $3) $2.13a $1.41b $1.69a,b $1.37b ($.88) ($1.23) ($1.05) ($1.15) Probability of winning 64%a (20%) 37%c (24%) 51%b (21%) 39%c (21%) Percentile rank 61.3a (17.9) 30.6d (20.5) 54.3b (16.7) 46.4c (12.4) Protagonist’s score 8.37a (1.46) 2.22c (1.34) 7.66a (1.48) 3.32b (1.62) (estimated) 7.68a (1.02) 3.06b (1.41) 7.32a (1.64) 3.37b (1.57) Opponent’s score (estimated) Average score (estimated) 7.12a (1.11) 3.07b (1.16) 7.05a (1.48) 3.29b (1.33) Size of 90% confidence 2.79a (1.43) 3.09a (1.31) 4.33b (2.05) 4.69b (1.94) interval—protagonist Size of 90% confidence 3.92a (1.75) 4.39a (2.15) 4.13a (2.09) 4.89a (2.14) interval—opponent Note. Standard deviations in parentheses. Figures on the same row with different superscripts are significantly different from each other (p < .05).
estimated score for that protagonist is not significantly different from the score estimated for their randomly chosen opponent for either the easy quiz (t (57) = 1.84, p = .071) or the difficult quiz (t (58) = −1.00, p = .32). Measures of Regressiveness Participants were imperfect estimators of performance. As shown in Figure 7.3, participants’ estimates look regressive, because error leads estimates to be less extreme than the actual scores. Consistent with the assumption that individuals have more information about themselves than about others, participants estimated their own scores with greater accuracy than others’ scores. Nevertheless, people estimated their own scores with some error. When performance was low, they overestimated it; those taking the difficult test only got an average of 1.66 (SD = 1.26) correct, yet estimated that they got 2.22 (SD = 1.34) correct. When performance was high, they underestimated it; those taking the simple test got an average of 8.51 (SD = 1.59) correct, yet estimated that they got 8.37 (SD = 1.46) correct. This effect is statistically significant, as shown by the difficulty × measure interaction (F [1, 211] = 71.72, p < .001, η2 = .25) in a 2 (difficulty) × 2 (measure: actual versus estimated score) ANOVA with repeated measures on the second factor. These results are consistent with the
ER59969.indb 155
3/21/08 10:50:52 AM
156 )LJXUH • Don Moore and Deborah Small
10 9
$FWXDO 6HOIHVWLPDWH
8
2WKHUHVWLPDWH
Estimated Score
7 6 5 4 3 2 1 0
0
1
2
3
4
5
6
7
8
9
10
Actual Score
Figure 7.3 Estimated scores for self and opponent by those betting on the self, Experiment 2. There is more noise and variability in the middle of the scale on Figure 7.3 because there are so few participants with scores of 5 and 6. Most of the scores lie at the extremes.
well-documented hard/easy effect in judgments of confidence (Erev, Wallsten, & Budescu, 1994). Each participant estimated a 90% confidence interval for both protagonist and opponent. Participants knew that their estimates of their own scores were not perfectly accurate; they established confidence intervals around their answers that were, on average, 2.97 points in width (SD = 1.40). Their confidence intervals are even wider when estimating others’ performances (M = 4.39, SD = 2.08). In a 2 (difficulty) × 2 (protagonist) × 2 (target: protagonist versus opponent) mixed ANOVA with repeated measures on target, the main between-subjects effect of protagonist (F [1, 173[ = 11.45, p < .01) and the within-subject effect of target (F [1, 173] = 29.91, p < .001) are significant, but they are qualified by a significant target × protagonist interaction (F [1, 173] = 37.75, p < .001), reflecting greater confidence when predicting their own scores than when predicting others’ scores (Table 7.2). We also expected that participants would use the self as a basis for estimating others, which would also contribute to greater regressiveness in estimates of others. In other words, for those betting on themselves, actual performance would serve as the basis for predicting the performance of the protagonist, which would in turn serve as
ER59969.indb 156
3/21/08 10:50:52 AM
When the Majority Believes They’re Better Than Average • 157
the basis for predicting the performance of the opponent. To examine this hypothesis, we conducted a mediational analysis (Baron & Kenny, 1986) in which actual performance and self estimate were used to predict opponent estimate. The results are shown in Figure 7.4A and indicate that estimates of self mediate the effect of actual performance on estimates of opponent. The significance of the indirect effect was tested using Sobel’s (1988) equation (z = 2.00, p < .05). The converse, however, is not true (see Figure 7.4B); opponent estimates do not fully account for the effect of test performance on estimates of self. Both greater error in estimating others’ scores and the use of self as a basis for estimating others contribute to more regressive estimates of others than of self. After just having overestimated their own performances, those taking the difficult quiz overestimated the performances of others to an even greater extent. Similarly, those who had taken the simple quiz underestimated their performances and underestimated the performances of their opponents even more. Mediational Tests of Comparative Judgment If it is the more regressive predictions of others that lead to both BTA and WTA effects, then the estimated differences between one’s own score and those of others should mediate the effect of test difficulty on comparative judgments for those who were betting on themselves. To test this hypothesis, we conducted three sets of regressions, shown in Figure 7.4C. First, comparative judgments were regressed on a dummy variable for test difficulty. Indeed, those who took the difficult test rated themselves as worse than those who took the simple test. In the second regression, test difficulty had a dramatic influence on the predicted difference between scores for self and for opponent. When both test difficulty and predicted score difference are used to predict comparative judgments, both remain significant. Differences between predicted absolute scores for self and the group account for 62.2% of the variance in comparative self-evaluation. Nevertheless, these predictions do not fully mediate the relationship between difficulty and relative evaluation because test difficulty remains a significant predictor of comparative evaluation after controlling for differences in predicted absolute scores (Sobel test: z = −.49, p = .63).5 Whereas it is comforting that differences between participants’ absolute estimates of self and other influence self–other comparisons, our theory makes a bolder prediction regarding the confidence with which participants make these estimates of absolute performance. Specifically, we would predict that when self is predicted with much greater confidence than other, both BTA and WTA effects should be stronger,
ER59969.indb 157
3/21/08 10:50:53 AM
)LJXUH
158 • Don Moore and Deborah Small
A
= .95**
=.88**
Actual performance
B
=.03 =.87** bivariate)
Opponent estimate
Opponent estimate = .87**
Actual performance
C
Self estimate
=.69**
=.31** =.95** bivariate)
Self estimate
Predicted Difference in Scores = -.49**
Test Difficulty
=.66**
=-.25* =-.62** bivariate)
Percentile rank
Figure 7.4 Mediation analyses (among those betting on themselves), Experiment 2. * p < .05 ** p < .1
since estimates made with lower confidence will naturally be more regressive. We do have measures of the confidence with which participants made their estimates of absolute performance: their 90% confidence intervals. We constructed a measure of differential confidence by subtracting the size of the confidence interval for protagonist from the size of the confidence interval for opponent. We then correlated this measure of differential confidence with participants’ comparative judgments. Consistent with the regression explanation, that correlation was positive and significant among those who took the simple quiz (r [88]
ER59969.indb 158
3/21/08 10:50:53 AM
When the Majority Believes They’re Better Than Average • 159
= .26, p < .05): The greater the difference between the confidence with which they estimated self and other, the more likely they were to rate themselves above average. Also consistent with the regression explanation, that correlation reversed itself among those taking the difficult quiz (r [88] = −.20, p = .058): The greater the difference in confidence between self and other, the more likely participants were to rate themselves below average. The correlation difference test reveals the difference between these two correlations to be statistically significant (Z = 20.06, p < .001). How Much Variance Does This Theory Account For? The results clearly replicate prior results showing that task difficulty affects beliefs about relative performance. The results also show that differential regressiveness is indeed at work, but how much of the effect of task difficulty can be accounted for by differential regression? To answer this question, we first have to assess the size of the effect of difficulty. When difficulty is used as the sole independent variable in a regression predicting selfreported percentile rank for those betting on themselves, the resulting R-squared value indicates that difficulty accounts for 39.4% of the variance. Differential regressiveness in participants’ absolute estimates of performance by self and other, as measured by their indirect comparative judgments (estimated score for self minus estimated score for opponent), on its own, accounts for 62.2% of the variance in estimated percentile rank. When both difficulty and the indirect comparison are included as independent variables, the resulting R-squared value indicates that combined they account for 66.9% of the variance in estimated percentile rank. This means task difficulty accounts for 4.7% of the variance in comparative judgments, over and above the differential regressiveness in people’s absolute judgments. This 4.7% represents just 12% of the total effect of difficulty (39.4%) on direct comparative judgments. The implication is that differential regressiveness accounts for 88% of the effect of difficulty on comparative judgments, at least for self-estimates, leaving 12% of the variance for other explanations, such as differential weighting (Klar & Giladi, 1999; Kruger, Windschitl, Burrus, Fessel, & Chambers, 2006). However, among those who had little useful information about the people whose percentile ranks they were estimating, this same analysis suggests that differential regressiveness accounts for less than 9% of the effect of difficulty on comparative judgments. Part of the issue here is that the effect of difficulty is so small among this group (R-squared = 6.8%). However, our theory would not predict differential regressive-
ER59969.indb 159
3/21/08 10:50:54 AM
160 • Don Moore and Deborah Small
ness among this group, given that they do not have better information about the protagonist than the opponent. Reconciliation With Prior Results The disappearance of BTA and WTA effects among people betting on a randomly selected individual stands in contrast to the results of Moore and Kim (2003), who replicated BTA and WTA effects even when comparing two other individuals. The key difference between Experiment 2 and their approach is probably related to the degree to which participants focused on the other person on whom they were betting. Moore and Kim (2003, Experiment 4) showed that focusing on the other person was key to the effect. When people took the perspective of the person on whom they were betting, they made predictions about that person in the same way they made predictions about themselves. Sanbonmatsu, Shavitt, Sherman, & Roskos-Ewoldsen (1987) have shown that focusing on a particular target can be enough to lead people to make more regressive estimates of other individuals who are not in focus. However, in the present study participants drew that person’s number from a hat, highlighting the random and anonymous nature of the other. Skewed Distributions? As noted earlier, the majority of people are above average in a negatively skewed distribution. Of course, one cause of negatively skewed distributions is a performance ceiling. If participants knew that the distributions were skewed and responded to our questions regarding randomly chosen opponents as if we were asking about the average other, it could account for our results. This raises the question of the degree to which BTA effects depend on the presence of a ceiling effect and the degree to which WTA effects depend on a floor. It is noteworthy that most prior findings of BTA and WTA effects occur in domains where there are ceilings and floors in performance, or at least on the scales used to measure performance. Our theory does not depend on ceilings or floors for the production of BTA and WTA effects and thus predicts that we will replicate these biases in comparative judgment even when no ceilings or floors are present. Experiment 3 tests this prediction. Experiment 3 also tests a second prediction of our theory: that perceived similarity between the target and the referent will moderate BTA and WTA effects. When the two are assumed to be similar, information about the target’s performance will generalize to the referent (Krueger, 2000; Mussweiler, 2003), and one should not expect the target to be much better or worse than the referent. However, when the two are
ER59969.indb 160
3/21/08 10:50:54 AM
When the Majority Believes They’re Better Than Average • 161
assumed to be different, our theory would predict that BTA and WTA effects would be stronger. Experiment 3 allows us to examine the process of updating from prior expectations because it includes measures of expected performance. This is important because our theory posits an important role for expectations. That is, differential regression will only produce BTA effects when the task is easier than expected and will only produce WTA effects when the task is more difficult than expected. This is a reasonable supposition for the tasks used in Experiments 1 and 2. Our case would be more convincing if we measured these beliefs. We do so in the third experiment.
Experiment 3: The Dates Test (MPS) Method Participants The participants were 187 volunteers from Carnegie Mellon University, who participated in exchange for pay. Task When participants arrived, they were told that they would be taking “The Dates Test”: “In this experiment, you will take a test that measures how well informed you are about the world in which we live. Each question on the test asks you to identify the calendar year in which some event occurred.” They were to answer as many of these questions as they could, and as accurately as they could, in two minutes. They had to answer every question in sequence and could not skip questions. For every question they answered they could earn up to 100 points if their answer was exactly right. If they were not exactly right, one point would be deducted for each year off their answer was from the correct answer. The maximum number of points that could be lost was 100 for any answer that was 200 or more years off. Procedure After reading the instructions for the task, participants were asked to predict their own scores and the average score on the upcoming test. Then they took the test. Each participant received a set of either easy or difficult questions. The first item on the easy quiz was “At the start of what year, known as ‘Y2K,’ were computers expected to crash due to the ‘millennium bug’?” Answer: 2000. The first item on the difficult quiz was “In what year was Pope Pius the first selected as Pope?” Answer: 142.
ER59969.indb 161
3/21/08 10:50:54 AM
162 • Don Moore and Deborah Small
After having taken the test, participants again estimated their own scores. Then participants read “Next we are going to ask you to estimate the score of the average participant. But before you do that, please spend about 5 minutes writing down all the reasons why the average participant is likely to have a score that is similar to [different from] your own.” Below these instructions were several blank lines on which they wrote. Participants then estimated the average quiz score. Participants were then asked to estimate precisely how much they expected their score to exceed that of the average person. They were told to use negative numbers to indicate the extent to which their score would be below that of the average person. This was the direct comparative judgment. Design The experimental design thus was a 2 (difficulty) × 2 (similarity priming) between-subjects design. The experiment also included measures of participants’ beliefs about scores, both their own and the average score, both before and after having taken the quiz. Results and Discussion Pretest Predictions We submitted pretest predicted scores for self and other to a 2 (difficulty) × 2 (similarity) × (2) (target: self versus other) mixed ANOVA with repeated measures on the third factor. The effect of target emerges as significant (F [1, 180] = 4.93, p = .03, η2 = .03). Before taking the test, participants predicted that they would score higher (M = 751, SD = 631) than the average (M = 669, SD = 601). No other effects are significant. Test Performance Average scores were higher on the easy (M = 1318, SD = 568) than on the difficult quiz (M = −649, SD = 389, t [185] = 27.58, p < .001). Participants answered more questions on the easy (M = 13.9, SD = 5.42) than the difficult (M = 12.1, SD = 4.57) quiz (t [185] = 2.54, p = .012). The distributions of test scores were roughly normal, as indicated by modest levels of skewness for the easy (.18, SE = .25) and the difficult (−.51, SE = .25) tests. Note that the directions of skew here allow for a conservative test of our hypotheses, in the sense that given positive skew on the easy test, the majority of people (60%) are below average, and given negative skew on the difficult test, the majority of people (56%) are above average. Posttest Beliefs To examine participants’ beliefs about their performance relative to others, we submitted their estimated scores for self and for the average other to a 2 (difficulty) × 2 (similarity) × 2 (target:
ER59969.indb 162
3/21/08 10:50:54 AM
When the Majority Believes They’re Better Than Average • 163
own estimated score versus estimated average score) mixed ANOVA with repeated measures on the third factor. Of course, the effect of difficulty is significant because estimated scores are higher on the easy than the difficult test (F [1, 182] = 228, p < .001, η2 = .56). However, the within-subjects effect of target did not emerge as significant (F [1, 182] = .54, p = .47, η2 = .003). This reflects the fact that, after taking the test, participants did not believe they (M = 499, SD = 946) scored any better than did others (M = 516, SD = 738). The results do reveal the expected target × difficulty interaction effect (F [1, 182] = 39.86, p < .0001, η2 = .18). Participants reported that on the simple quiz, they did better (M = 1213, SD = 455) than average (M = 1002, SD = 378, t [44] = 3.91, p < .001), but that on the difficult quiz they did worse (M = −206, SD = 829) than average (M = −57, SD = 733, t [45] = −2.21, p = .032). Furthermore, the predicted three-way target × difficulty × similarity interaction effect emerges as marginally significant (F [1, 182] = 3.71, p = .056, η2 = .02). This effect, which is illustrated in Figure 7.5, shows that the target × difficulty effect described above was weakened when participants were primed to think about how the average is similar to them, whereas it was exacerbated when they were primed to think )LJXUH about how the average is different from them. No other effects in this mixed ANOVA emerge as significant. 400 'LIILFXOW
Estimated score for self minus others
300
6LPSOH
200 100 0 -100 -200 -300 -400
Different
Similar
Prime
Figure 7.5 Estimated relative scores, Experiment 3.
ER59969.indb 163
3/21/08 10:50:55 AM
)LJXUH
164 • Don Moore and Deborah Small
2500
6HOI$F WXDO 6HOI3RVW(VWLPDWHG 2WKHU3RVW'LIIHUHQW(VWLPDWHG 2WKHU3RVW6LPLODU(VWLPDWHG 6HOI3UH(VWLPDWHG 2WKHU3UH(VWLPDWHG
2000
1500
Estimated Score
1000
500
0
-500
-1000
-1500 -1500
-1000
-500
0
500
1000
1500
2000
2500
Actual Score
Figure 7.6 Estimated scores for self and opponent, both before and after taking the actual test, Experiment 3. Respondents are grouped by their own actual scores, by rounding their scores to the nearest 500 points.
Finally, and also consistent with our theory, participants’ estimates of their own scores (M = 1196, SD = 553) were lower than their actual scores (M = 1318, SD = 568) on the easy test (t (93) = 4.23, p <.001). On the difficult test, participants’ estimates were higher (M = −211, SD = 705) than were their actual scores (M = −646, SD = 390, t [91] = −5.72, p < .001). Participants made regressive estimates of themselves, but their estimates of others were even more regressive (see Figure 7.6). Updating From Priors The differential information theory is based on the Bayesian notion that people begin with some prior expectation and then update that belief when they get new evidence. To clarify how their prior expectations influenced participants’ evaluations of self and other, we conducted two regressions using posttest estimated scores for self and other as dependent variables. Our differential information theory would predict that people use their own quiz performances to update beliefs about their own scores more than beliefs about others’ scores. Self-knowledge is more useful for predicting the self than it is for predicting others. The regression predicting participants’ posttest beliefs about their own performance employs three independent variables: (1) the participant’s pretest estimated score for self, (2) the participant’s own score on the
ER59969.indb 164
3/21/08 10:50:56 AM
When the Majority Believes They’re Better Than Average • 165 Table 7.3 Regressions Predicting Posttest Score Estimates for Self and Other (Experiment 3) Model 1 Predicting Posttest Beliefs About Own Performance
Model 2 Predicting Posttest Beliefs About Other’s Performance
Independent Variable
Unstandardized B Coefficient
Independent Variable
Unstandardized B Coefficient
Pretest estimated score for self
.37a (.05)
Pretest estimated score for other
.23a (.06)
Own actual score
.61a (.06)
Own actual score
.33a (.07)
Difficult quiz dummy
?244 (164)
Difficult quiz dummy
?427b (163)
R2
.75a
R2
.58a
Note. Standard errors in parentheses a
p < .001
b
p < .01
quiz, and (3) the difficulty of the quiz the participant received. The results of this regression appear in the first two columns of Table 7.3. The second regression predicting participants’ posttest beliefs about others’ performance employed the participant’s pretest estimated score for the other and the same last two variables. These results appear in the right-hand columns of Table 7.3. Consistent with our expectations, participants’ own quiz performances exert a weaker influence on posttest estimates of others (B = .32, SE = .07, t = 4.46, p = 1.44 × 10_5) than on posttest estimates of self (B = .61, SE = .07, t = 8.26, p = 3.01 × 10_14). When participants were estimating their own scores, they had excellent information. They relied heavily on their own scores, but their pretest priors were also a significant influence. When estimating the scores of others, the regression results suggest participants relied less on their own experiences and instead tried to account for the ease or difficulty of the task—hence the significance of quiz difficulty. The significant effects of pretest priors for estimations of both self and others suggest, consistent with our theory, that people’s priors affect their subsequent judgments. They updated from these priors using new information, and since the information they had (their own quiz performances) was more useful for estimating self than others, this information was weighted more heavily when estimating their own scores than when estimating others’ scores. Moreover, comparing those who were primed with similarity versus difference clarifies how this manipulation had its effect: The degree
ER59969.indb 165
3/21/08 10:50:56 AM
166 • Don Moore and Deborah Small
to which participants projected information about themselves onto others depended upon similarity priming. When estimating others’ scores, those who expected to be similar to others weighted their own scores more heavily (B = .42, SE = .10, t = 4.14, p = 7.60 X 10-5) than did those who expected others to be different (B = .27, SE = .11, t = 2.53, p = .013). How Much Variance Does This Theory Account For? How much of the effect of task difficulty in this experiment can be accounted for by differential regression? We sought to answer this question as we did for Experiment 2. We first computed the percentage of the variance in direct comparative judgments accounted for by the difficulty manipulation. When difficulty is used as the sole independent variable in a regression predicting participants’ estimates of the difference between their own score and the average score, the resulting R-squared value indicates that difficulty accounts for 14.3% of the variance. Differential regressiveness in participants’ absolute estimates of performance by self and others, as measured by their indirect comparative judgments (estimated score for self minus estimated score for others), on its own, accounts for 29.7% of the variance in their direct comparative judgments. When both difficulty and the indirect comparison are included as independent variables, the resulting R-squared value indicates that combined they account for 26.2% of the variance in direct comparisons. What this means is that task difficulty accounts for 3.5% of the variance in comparative judgments, over and above the differential regressiveness in people’s absolute judgments. This 3.5% represents just 24% of the total effect of difficulty (14.3%) on direct comparative judgments. The implication is that differential regressiveness accounts for 76% of the effect of difficulty on comparative judgments in Experiment 3, leaving 24% of the variance for other explanations.
General Discussion The results of the three experiments we present are consistent with the differential information explanation for both BTA and WTA effects. Estimates of performance were regressive, especially when people made estimates about others about whom they had poorer information. As a consequence, when performance was high, people believed they were better than others. When performance was low, people believed they were worse than others. A great deal of evidence shows that people prefer flattering information about themselves (Baumeister, 1998; Greenwald, 1980; Taylor,
ER59969.indb 166
3/21/08 10:50:56 AM
When the Majority Believes They’re Better Than Average • 167
1989). They seek it out and they accept it uncritically (Gilovich, 1991). It is often comforting or flattering to believe that one has done well or that one is better than others, but these theories have trouble accounting for WTA effects. The differential information explanation can account for both BTA and WTA effects. In contrast to the previous theories, it does not suggest an egocentric or self-enhancing bias. Rather, it is a rational, Bayesian explanation for why people might exhibit biases in comparative judgments. Our theory is useful for reconciling BTA and WTA effects with a set of results that appear to contradict them. The so-called hard/easy effect on overconfidence has documented the fact that people overestimate performance on difficult tasks and tend to underestimate performance on easy tasks (Burson, Larrick, & Klayman, 2006; Krueger & Mueller, 2002; Kruger & Dunning, 1999). We should note that the hard/easy effect involves measures of absolute performance, whereas BTA and WTA effects involve measures of relative performance. Regressiveness in self-estimates is sufficient to produce the hard/easy effect (Erev et al., 1994). This pattern is even more pronounced for other-estimates, resulting in the co-occurrence of hard/easy effects and WTA/BTA effects.
The Normative Question We have examined BTA and WTA effects to clarify what the appropriate normative judgment is. We have argued, using evidence from three experiments, that people ought to believe that they are better than others when their performance is better than expected. When people learn about their own likely performance or their own probability of experiencing some event, that knowledge is often useful for updating their beliefs about others as well. However, there is a problem. Real people do not obey Bayes’s Rule all that well. Sometimes, people appear to neglect priors (such as base rates), overweighting recent evidence (Grether, 1980, 1990). Other times, people appear too conservative, overweighting priors and neglecting useful new evidence (Edwards, 1968; McKelvey & Page, 1990). Which of these errors people commit depends on the order and form in which they acquire information (Hogarth & Einhorn, 1992; Wells, 1992). What is important for our differential information explanation, however, is that although people are imperfect Bayesians, they rarely abandon Bayesian logic completely. Indeed, under some circumstances human judgment is impressively close to Bayesian prescription (Griffiths & Tenenbaum, 2007). However, all that is necessary for our explanation to hold is that people’s estimates of others lie between their priors and their beliefs
ER59969.indb 167
3/21/08 10:50:57 AM
168 • Don Moore and Deborah Small
about themselves. Both the results from our experiments and from other experiments bear this assumption out (Krueger et al., 2005). If people neglected their priors (base rates) and assumed that others behaved exactly as they did, they would, in effect, commit the false consensus effect in grand form. They would predict that others’ scores would be the same as theirs and that others would experience the identical probability of various possible outcomes. If people did the opposite and assumed that their own outcomes were irrelevant for determining the outcomes of others, then learning about their own performances would not shift predictions of others off of baseline. Instead, what we observe is a combination of the two. Across all three experiments, people’s estimations of themselves are highly correlated with their estimates of others, but estimates of others are less extreme (McFarland & Miller, 1990; Miller & McFarland, 1987). We have come full circle, then, back to the false consensus effect.
False consensus? Dawes’s critique of the false consensus effect did not attempt to argue that there was no such thing as the false consensus effect, only that there was a rational explanation that would hypothesize an effect that looked like the false consensus effect. This useful clarification set the stage for a further refinement: Was there, in fact, a false consensus effect that was stronger than that predicted by normative theory? The answer, as shown by Krueger and Clement (1994), is yes. There really is a false consensus effect, and it is smaller than researchers had previously assumed before Dawes clarified the matter. Can we accomplish the same refinement of theories of bias in comparative judgment? We believe we can. Our evidence suggests that differential regression caused by differential information can indeed account for a good deal of the variance in comparative judgments. However, it cannot account for 100% of such variance. WTA and BTA effects are stronger than one would predict based solely on differential regressiveness in estimates of absolute performance by self and other (Moore, 2007). Clearly, there are other causes of BTA and WTA effects. One such cause is likely to be differential weighting. A number of researchers have shown that people tend to overweight information about the self when making comparative judgments (Klar & Giladi, 1997, 1999; Kruger, 1999; Windschitl et al., 2003). Kruger and his colleagues (2006) have shown that this, too, can have a rational basis. When people possess high-quality information, it makes sense to give
ER59969.indb 168
3/21/08 10:50:57 AM
When the Majority Believes They’re Better Than Average • 169
that information greater weight than they give to mere speculation about the performances of others. We want to conclude by pointing out that it is not our intention to claim that people are perfectly rational or that their judgments are unbiased. Evidence of human irrationality is abundant and undeniable (Bazerman, 2002). Indeed, it is precisely because we take seriously the evidence of imperfections in human judgment that we seek to understand it as best we can. We want to hold research on judgment and decision making to the high standard that Robyn Dawes would set: Before we accuse people of being irrational, we ought to have an excellent idea of precisely what it means to make rational judgments. This approach has two clear benefits. First, it helps us more clearly understand the normative benchmarks that we can use to provide advice to others or make better judgments ourselves. Second, we gain a richer understanding of human judgment by better understanding exactly when it complies with rational prescriptions and when it deviates from them. Finally, we must note that by offering a rational explanation for BTA and WTA effects, we are not saying that they are not real. On the contrary, if there is a rational basis for these effects, we ought to expect them to be particularly durable because we cannot expect to be able to correct or to debias them.
Acknowledgments The authors benefited from the support of the staff and facilities of the Center for Behavioral Decision Research at Carnegie Mellon University. This material is based upon work supported by the National Science Foundation under Grant No. 0451736. Thanks to Robyn Dawes and Joachim Krueger for helpful comments on the manuscript. Please address correspondence via e-mail to
[email protected].
References Alicke, M. D. (1985). Global self-evaluation as determined by the desirability and controllability of trait adjectives. Journal of Personality and Social Psychology, 49, 1621–1630. Babcock, L., & Loewenstein, G. (1997). Explaining bargaining impasse: The role of self-serving biases. Journal of Economic Perspectives, 11(1), 109-126. Baron, R. M. & Kenny, D. A. (1986). The Moderator-Mediator variable distinction in social psychological research: Conceptual, strategic, and statistical concerns. Social Psychology, 51,
ER59969.indb 169
3/21/08 10:50:57 AM
170 • Don Moore and Deborah Small Baumeister, R. F. (1998). The self. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (Vol. 3, pp. 680–740). Boston: McGraw-Hill. Bazerman, M. H. (2002). Judgment in managerial decision making (5th ed.). New York: Wiley. Benabou, R., & Tirole, J. (2002). Self-confidence and personal motivation. Quarterly Journal of Economics, 117, 871-915. Blanton, H., Axsom, D., McClive, K. P., & Price, S. (2001). Pessimistic bias in comparative evaluations: A case of perceived vulnerability to the effects of negative life events. Personality and Social Psychology Bulletin, 27, 1627–1636. Brown, J. D. (1986). Evaluations of self and others: Self-enhancement biases in social judgments. Social Cognition, 4, 353–376. Brown, J. D. (1998). The self. Boston: McGraw-Hill. Bruine de Bruin, W., Fischhoff, B., Millstein, S. G., & Halpern-Felsher, B. L. (2000). Verbal and numerical expressions of probability: ‘It’s a fifty-fifty chance’. Organizational Behavior and Human Decision Processes, 81, 115–131. Burson, K. A., Larrick, R. P., & Klayman, J. (2006). Skilled or unskilled, but still unaware of it: How perceptions of difficulty drive miscalibration in relative comparisons. Journal of Personality and Social Psychology, 90, 60–77. Carnegie Mellon Career Center. (2003). 2003 post-graduation survey results. Retrieved March 17, 2004, from http://www.studentaffairs.cmu.edu/ career/student/student.html Chambers, J. R., & Windschitl, P. D. (2004). Biases in social comparative judgments: The role of nonmotivational factors in above-average and comparative-optimism effects. Psychological Bulletin, 130, 813–838. Chambers, J. R., Windschitl, P. D., & Suls, J. (2003). Egocentrism, event frequency, and comparative optimism: When what happens frequently is “more likely to happen to me”. Personality and Social Psychology Bulletin, 29, 1343–1356. Daniel, K. D., Hirshleifer, D. A., & Sabrahmanyam, A. (1998). Investor psychology and security market under- and overreactions. Journal of Finance, 53, 1839–1885. Dawes, R. M. (1989). Statistical criteria for establishing a truly false consensus effect. Journal of Experimental Social Psychology, 25, 1–17. Dawes, R. M. (1990). The potential nonfalsity of the false consensus effect. In R. M. Hogarth (Ed.), Insights in decision making: A tribute to Hillel J. Einhorn. (pp. 179–199). Chicago: University of Chicago Press. Dawes, R. M., & Mulford, M. (1996). The false consensus effect and overconfidence: Flaws in judgment or flaws in how we study judgment? Organizational Behavior and Human Decision Processes, 65, 201–211.
ER59969.indb 170
3/21/08 10:50:57 AM
When the Majority Believes They’re Better Than Average • 171 Dunning, D. (1993). Words to live by: The self and definitions of social concepts and categories. In J. M. Suls (Ed.), Psychological perspectives on the self (Vol. 4, pp. 99–126). Hillsdale, NJ: Erlbaum. Edwards, W. (1968). Conservatism in human information processing. In B. Kleinmuntz (Ed.), Formal representation of human judgment (pp. 17– 52). New York: Wiley. Epstein, S. (1990). Cognitive-experiential self-theory. In L. A. Pervin (Ed.), Handbook of personality: Theory and research (pp. 165–192). New York: Guilford Press. Erev, I., Wallsten, T. S., & Budescu, D. V. (1994). Simultaneous over- and underconfidence: The role of error in judgment processes. Psychological Review, 101, 519-527. Fiedler, K. (1996). Explaining and simulating judgment biases as an aggregation phenomenon in probabilistic, multiple-cue environments. Psychological Review, 103, 193–214. Fiedler, K. (2000). Beware of samples! A cognitive-ecological sampling approach to judgment biases. Psychological Review, 107, 659–676. Fox, C. R., & Rottenstreich, Y. (2003). Partition priming in judgment under uncertainty. Psychological Science, 14, 195–200. Gilovich, T. (1991). How we know what isn’t so: The fallibility of human reason in everyday life. New York: Free Press. Greenwald, A. G. (1980). The totalitarian ego: Fabrication and revision of personal history. American Psychologist 35, 603–618. Grether, D. M. (1980). Bayes’ rule as a descriptive model: The representative heuristic. Quarterly Journal of Economics, 95, 537–557. Grether, D. M. (1990). Testing Bayes rule and the represetativeness heuristic: Some experimental evidence. Journal of Economic Behavior and Organization, 17, 31–57. Griffiths, T. L., & Tenenbaum, J. B. (2007). Optimal predictions in everyday cognition. Psychological Science, 17, 767–773. Hogarth, R. M., & Einhorn, H. J. (1992). Order effects in belief updating: The belief-adjustment model. Cognitive Psychology, 24, 1–55. Howard, M. E. (1983). The causes of war and other essays. Cambridge, MA: Harvard University Press. Klar, Y., & Giladi, E. E. (1997). No one in my group can be below the group’s average: A robust positivity bias in favor of anonymous peers. Journal of Personality and Social Psychology, 73, 885–901. Klar, Y., & Giladi, E. E. (1999). Are most people happier than their peers, or are they just happy? Personality and Social Psychology Bulletin, 25, 585-594. Krueger, J. I. (2000). The projective perception of the social world: A building block of social comparison processes. In J. Suls & L. Wheeler (Eds.), Handbook of social comparison: Theory and research (pp. 323–351). Dordrecht, The Netherlands: Kluwer Academic.
ER59969.indb 171
3/21/08 10:50:57 AM
172 • Don Moore and Deborah Small Krueger, J. I., Acevedo, M., & Robbins, J. M. (2005). Self as sample. In K. Fiedler & P. Juslin (Eds.), Information sampling and adaptive cognition (pp. 353–377). New York: Cambridge University Press. Krueger, J. I., & Clement, R. W. (1994). The truly false consensus effect: An ineradicable and egocentric bias in social perception. Journal of Personality and Social Psychology, 67, 596–610. Krueger, J. I., & Mueller, R. A. (2002). Unskilled, unaware, or both? The betterthan-average heuristic and statistical regression predict errors in estimates of own performance. Journal of Personality and Social Psychology, 82, 180–188. Kruger, J. (1999). Lake Wobegon be gone! The “below-average effect” and the egocentric nature of comparative ability judgments. Journal of Personality and Social Psychology, 77, 221–232. Kruger, J., & Burrus, J. (2004). Egocentrism and focalism in unrealistic optimism (and pessimism). Journal of Experimental Social Psychology, 40, 332–340. Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77, 1121–1134. Kruger, J., Windschitl, P. D., Burrus, J., Fessel, F., & Chambers, J. R. (in press). The rational side of egocentrism in social comparisons. Journal of Experimental Social Psychology. Lerner, J. S., Gonzalez, R. M., Small, D. A., & Fischhoff, B. (2003). Effects of fear and anger on perceived risks of terrorism: A national field experiment. Psychological Science, 14, 144–150. Lipkus, I. M., Biradavolu, M., Fenn, K., Keller, P., & Rimer, B. K. (2001). Informing women about their breast cancer risks: Truth and consequences. Health Communication, 13, 205–226. Malmendier, U., & Tate, G. (2005). CEO overconfidence and corporate investment. Journal of Finance 60, 2661–2700. McFarland, C., & Miller, D. T. (1990). Judgments of self-other similarity: Just like other people, only more so. Personality and Social Psychology Bulletin, 16, 475–484. McKelvey, R. D., & Page, T. (1990). Public and private information: An experimental study of information pooling. Econometrica, 58, 1321–1339. Miller, D. T., & McFarland, C. (1987). Pluralistic ignorance: When similarity is interpreted as dissimilarity. Journal of Personality and Social Psychology, 53, 298–305. Moore, D. A. (2007). Not so above average after all: When people believe they are worse than average and its implications for theories of bias in social comparison. Organizational Behavior and Human Decision Processes, 102, 42–58.
ER59969.indb 172
3/21/08 10:50:58 AM
When the Majority Believes They’re Better Than Average • 173 Moore, D. A., & Cain, D. M. (2007). Overconfidence and underconfidence: When and why people underestimate (and overestimate) the competition. Organizational Behavior & Human Decision Processes, 103, 197–213. Moore, D. A. & Kim, T. G. (2003). Myopic social prediction and the solo comparison effect. Journal of Personality and Social Psychology, 85, 1121–1135. Moore, D. A., Oesch, J. M., & Zietsma, C. (in press). What competition? Myopic self-focus in market entry decisions. Organization Science, 18, 440–454. Moore, D. A., & Small, D. A. (in press). Error and bias in comparative social judgment: On being both better and worse than we think we are. Journal of Personality and Social Psychology. Mussweiler, T. (2003). Comparison processes in social judgment: Mechanisms and consequences. Psychological Review, 110, 472–489. Neale, M. A., & Bazerman, M. H. (1985). The effects of framing and negotiator overconfidence on bargaining behaviors and outcomes. Academy of Management Journal, 28, 34–49. Odean, T. (1998). Volume, volatility, price, and profit when all traders are above average. Journal of Finance, 53, 1887–1934. Pennsylvania Lottery. (2004, Feb 2). Powerball game information. Retrieved March 17, 2005, from http://www.palottery.com/lottery/cwp/view. asp?a=3&q=457089&lotteryNav=|29736| Peterson, C. (2000). The future of optimism. American Psychologist, 55, 44–55. Ross, L., Greene, D., & House, P. (1977). The false consensus effect: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279–301. Sanbonmatsu, D. M., Shavitt, S., Sherman, S. J., & Roskos-Ewoldsen, D. R. (1987). Illusory correlation in the perception of performance by self or a salient other. Journal of Experimental Social Psychology, 23, 518–543. Slovic, P. (2000). Rejoinder: the perils of Viscusi’s analyses of smoking risk perceptions. Journal of Behavioral Decision Making, 13, 273–276. Sobel, M. E. (1982). Asymptotic confidence intervals for indirects in structual equation models. Sociological Methodology, 13, 291–312. Steele, C. M. (1988). The psychology of self-affirmation: Sustaining the integrity of the self. In L. Berkowitz (Ed.), Advances in experimental social psychology (Vol. 21, pp. 261–302). New York: Academic. Svenson, O. (1981). Are we less risky and more skillful than our fellow drivers? Acta Psychologica, 47, 143–151. Taylor, S. E. (1989). Positive illusions: Creative self-deception and the healthy mind. New York : Basic Books. Taylor, S. E., & Brown, J. D. (1988). Illusion and well-being: a social psychological perspective on mental health. Psychological Bulletin, 103, 193–210.
ER59969.indb 173
3/21/08 10:50:58 AM
174 • Don Moore and Deborah Small U.S. Census Bureau. (2002). Statistical abstract of the United States (122nd ed.). Washington, DC: Library of Congress. Viscusi, W. K. (1990). Do smokers underestimate risks? Journal of Political Economy, 98, 1253–1269. Weinstein, N. D. (1980). Unrealistic optimism about future life events. Journal of Personality and Social Psychology, 39, 806–820. Weinstein, N. D. (1984). Why it won’t happen to me: Perceptions of risk factors and susceptibility. Health Psychology, 3, 431–457. Wells, G. L. (1992). Naked statistical evidence of liability: Is subjective probability enough? Journal of Personality and Social Psychology, 62, 739–752. Windschitl, P. D., Kruger, J., & Simms, E. (2003). The influence of egocentrism and focalism on people’s optimism in competitions: When what affects us equally affects me more. Journal of Personality and Social Psychology, 85, 389–408. Woloshin, S., Schwartz, L. M., Black, W. C., & Welch, H. G. (1999). Women’s perceptions of breast cancer risk: How you ask matters. Medical Decision Making, 19, 221–229.
notes 1. Thanks to Shane Frederick for suggesting this helpful example. 2. This number represents the probability of a single lottery ticket winning, but people routinely buy more than one ticket. However, most people do not buy 17.5 million lottery tickets, which is how many you would need to buy to raise your chances of winning to the 14% participants estimated. 3. To rule out timing effects of this procedure, half of these participants drew their random number before they bet and half after they bet. Because this manipulation did not yield any main or interaction effects with variables of interest, these participants are grouped together in the analyses reported. 4. Degrees of freedom fluctuate slightly between tests, due to missing data for some participants. 5. Similar patterns hold for bets and probability of winning as measures of comparative judgment. Bets, however, are influenced by a number of other motivations, including risk aversion and feelings about gambling.
ER59969.indb 174
3/21/08 10:50:58 AM
8
Wishful Thinking in Predicting World Cup Results Still Elusive Maya Bar-Hillel The Hebrew University
David V. Budescu University of Illinois at Urbana-Champaign
Moty Amar The Hebrew University and Ono Academic College
Bar-Hillel and Budescu (1995) defined the “desirability effect” as “the inflation of the judged probability of desirable events or the diminution of the judged probability of undesirable events” (p. 71). Desirability effects are related to optimism and to wishful thinking, because inasmuch as desirability effects exist, they contribute to both. Evidence supporting these effects was obtained in several types of studies. Some compared respondents’ reported chances that desirable personal life events (e.g., getting a high-paying job after graduation) or undesirable ones (e.g., being involved in an accident) would happen to them versus to various others (see e.g., Weinstein, 1980, 1982). Other studies asked participants to judge the probability of outcomes of contests (e.g., elections [Babad & Yakobos, 1993] or sport contests [Babad, 1987]) in which they personally favored one of the contestants (see also Fischer & Budescu, 1995; Granberg & Brent, 1983). 175
ER59969.indb 175
3/21/08 10:50:58 AM
176 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
These results, however, are subject to alternative explanations, such as differential information about one’s own actions (or one’s favorite contestant) versus others’ and biased or selective access to information. For example, most people think their chances of being involved in a car accident are lower than others’ (McKenna, 1993). Rather than reflecting desirability bias, or unrealistic optimism, this result may be explained by the following facts: (1) people know more about their own driving skills; (2) people are more aware of the preventive and cautionary actions they themselves take; (3) people pay excessive attention to errors made by other drivers; or (4) a combination of all of these factors. Bar Hillel and Budescu (1995) argued that desirability effects can be demonstrated unambiguously only with respect to events whose desirability is established through experimental manipulations that are independent of the respondents’ background, information, and preferences. In an extended series of experimental studies using this paradigm they found little evidence that the desirability of an outcome can, in and of itself, cause its judged probability to loom larger. In Study 1 Bar-Hillel and Budescu (1995) used aleatory events. Respondents estimated the proportion of cells of a given color in a twocolor matrix or of beads of a given color in a jar containing beads in one of two colors. One color was designated as the winning color, but its judged probability of selection did not systematically increase. In Study 2 the stimuli were four scenarios that concerned, respectively, pairs of firms competing for a contract, finalists in a literary competition, basketball teams playing each other, and parents involved in a child custody battle. One of the competitors in each scenario was made desirable by an appropriate manipulation: The respondent imagined holding stock in one firm (pecuniary); one writer was described as severely handicapped (sympathy); one team was identified as Israel’s national team (the respondents were Israeli students); one parent is the same gender as the respondent (in the control condition, the parents’ genders were not disclosed in their description). These manipulations had the intended effect on the judged desirability of the target events, but they did not increase the judged probability for the desirable contestant to win. For example, although the respondents wanted the Israeli team to win its game, they did not overestimate the probability of this event. In Study 3, the same four scenarios were used. Desirability was manipulated by promising respondents a monetary reward if a designated contestant won. This manipulation was the only one that did actually show a significant desirability effect, albeit a mild one. Finally, in Study 4, respondents followed and predicted the value of the Dow Jones index over a four-week period and were rewarded according to
ER59969.indb 176
3/21/08 10:50:59 AM
Wishful Thinking in Predicting World Cup Results • 177
whether or not the index changed by 20 points or more (in either direction) over a week. In this study respondents were also rewarded for accurate prediction. Although respondents presumably wished to win the reward, they did not inflate the judged probability of the event that would lead to this outcome. A recent comprehensive literature review (Krizan & Windschitl, 2007) confirmed the scarcity of empirical evidence supporting the desirability bias hypothesis. Despite a bias against null results in the literature (see, e.g., Sterling, Rosenbaum, & Weinkam, 1995; Hubbard & Armstrong, 1997), these studies were published under a title that told it all: “The elusive wishful thinking effect” (Bar-Hillel and Budescu, 1995). This chapter reports a field experiment using the same paradigm in the context of real-world competitive sports. Sport is an area in which wishful thinking is notoriously rampant and has been experimentally proven (e.g., Babad & Katz, 1991). Betting in sport is routine and commonplace, with people tending to bet on their own favorites (e.g., the teams they are fans of) and to be overconfident that their bets will succeed (warning against betting on your favorite team is commonplace in guides to sports betting; e.g., www.betinf.com). A unique opportunity to run this study presented itself in June 2002, when the finals of the World Cup in soccer were being played in Korea and Japan. We conducted an experiment with Israeli participants, involving predictions of the outcomes of World Cup tournament games. Soccer is the most popular team sport in Israel, and most games were being broadcast live in their entirety on a daily basis. Thus, even people who do not follow the sport regularly were subjected to frequent updates on game outcomes in the general news media. This created a felicitous setting for running yet another experiment on the desirability effect but now in a context that elicits involvement quite naturally. It allowed us to test once again whether a paradigm in which desirability is manipulated experimentally would yield desirability bias, this time in a context in which such thinking is commonplace.
Study 1 Method Participants Participants were 329 students at The Hebrew University. About half were female, and almost all were between 20 and 26 years of age, with a mean and median age of 24. They were approached in classrooms at the ends of lectures or in public areas such as cafeterias, libraries, and corridors and asked to fill out a short paper-and-
ER59969.indb 177
3/21/08 10:50:59 AM
178 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
pencil questionnaire regarding the World Cup games. An opportunity to win considerable monetary prizes (see details below) was promised. Design and Procedure Eight games were to be played on June 13 and 14, 2002, as part of the last round of the group stage. On June 11 and 12, 2002, participants were handed a questionnaire that referred to these eight games, in which they were asked to estimate the probability that of each of the 16 teams would win its game. It was pointed out that for teams playing each other, these probabilities had to sum to 100% (predictions of ties could be indicated using 50%). The questionnaire promised respondents several possible rewards. The reward that was most critical for this study did not depend in any way on the respondent’s performance. A coupon accompanying the questionnaire designated one particular team and stated that a sum of 25 NIS (about $5 at that time) would be paid to the bearer of the coupon if the team designated on it won. This coupon embodied the desirability manipulation, under the assumption that it would make the respondent wish that team would win its game. To ensure attention to this team, respondents were asked to write it down on their questionnaires. A second reward of 400 NIS was promised to the winner of a lottery held among all holders of winning coupons (irrespective of the team named on them). Since these two rewards were not dependent on the respondents’ performance, other rewards were promised for the purpose of motivating the respondents to strive for accuracy. Payments of 25 NIS were promised to the respondent who would guess correctly the outcome of the largest number of games. Strictly speaking, the respondents were not asked directly which teams would win, but rather for the probability that a team would win. They were deemed to have predicted the outcome of a game correctly if they assigned the team that subsequently won a probability greater than 50%. Since we expected more than one respondent to tie for this reward, a fourth prize of 400 NIS was to be given by lottery among all those who tied for maximum accuracy. Thus, many respondents could expect payment of 25 NIS, by luck or by competence; they could also hope to win one, or both, of the 400 NIS lottery prizes. The questionnaire also asked respondents about their overall level of interest in the competition and their level of soccer expertise and asked them to list their favorite team(s). Results Because we collected all the data over just two days, our experimental manipulation was limited to five of the eight games played on June
ER59969.indb 178
3/21/08 10:50:59 AM
Wishful Thinking in Predicting World Cup Results • 179
13 and 14, as listed in Figure 8.1. The figure shows the judged probabilities (in percentages) for these games. Each data point was generated by between 31 and 38 respondents (we removed 13 respondents whose favorite team was one of the 10 listed teams; all numbers refer to the remaining respondents). The team listed on top is the favorite team within each game, namely, the team that, over all respondents, got the higher mean probability for winning. As it turned out, the favorite team was also the winner (Italy and Mexico were tied). The graph shows the mean probability estimate that the favorite team would win, as given by two experimental groups. The mean estimate of respondents whose coupon indicated that they would win 25 NIS if the top listed team won is shown by diamonds. The mean estimate given by respondents who would be rewarded if the rival team won is shown by squares. By our instructions, rival teams enjoy complementary probabilities. The games on the abscissa are ordered to increase monotonically by the diamonds. In all five games the diamonds hover above the squares, indicating that the group that presumably wanted the top team to win (because of the desirability manipulation of the coupons) always gave a mean estimate higher than the one given by those who presumably wanted the bottom team to win. This ordinal pattern has a 1/32 probability under the null (chance) hypothesis, so it is statistically significant. Game by 75%
73.2 63 56.7
63.7 62
63.9
64.8
59.8
50%
49.6
53.5
Russia Belgium
Italy Mexico
25%
Japan Tunisia
Turkey China
Brazil Costa Rica
Figure 8.1 Mean probability that the top team listed on the abscissa would win, as judged by those rewarded if it wins (diamonds), and by those rewarded if the rival team wins (squares).
ER59969.indb 179
3/21/08 10:51:00 AM
180 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
game the parametric differences between the two groups of respondents were statistically significant only for the Italy–Mexico game (t[61] = 2.35; p < .05; Cohen’s d = .59). The difference was also significant over all five games (63.9% versus 57.7%; t[166] = 2.73; p < .05; Cohen’s d = .30). In other words, designating a team as the one whose win would bestow 25 NIS caused it to seem more likely to win. The effect was not sufficient to reverse the judgments, as 9 of the 10 means are above the 50% line. Discussion Figure 8.1 shows a small, but systematic, effect. The outcomes of 1 of 10 teams were endowed at random with value in a controlled betweensubject experimental design by making a monetary reward contingent on that outcome. Presumably, this made that outcome desirable. In all five games, an outcome was judged more probable by those respondents who desired it. This effect was found despite the presence of an equally high monetary incentive for accuracy and the absence of any fans of the pertinent teams. Of all the experiments reported in Bar-Hillel and Budescu (1995), our Study 1 most closely resembles their Study 3. There too the desirability manipulation involved making a monetary reward contingent on one particular outcome out of two possible ones. This also happened to be the only study that yielded what appeared to be a desirability effect. Figure 8.2 shows some of the results of that study in a graph that resembles Figure 8.1 in format (data from Budescu and Bar-Hillel, 1995, Table 7). The diamonds belong to the respondents whose reward was contingent on one outcome, and the squares belong to those whose reward was contingent on the alternative outcome, but both address the diamond outcome. The similar format helps in seeing how similar the pattern of results is in the two figures, in particular, all diamonds hover above the respective squares. Figures 8.1 and 8.2 suggest a genuine effect. However, before inferring that it is the enhanced desirability of some outcome that causes its judged probability to loom larger, alternative interpretations must be ruled out. One such explanation (which we neglected to consider in the 1995 paper) is that affixing a reward to some outcome makes that outcome salient, singling it out, so to speak, among other outcomes as being of particular interest; and it is this marking in itself that causes the inflated probabilities, rather than the fact that the marking happens to have been done by affixing a prize. Such an attentional process is quite distinct from the motivational process implied by the wishful thinking hypothesis.
ER59969.indb 180
3/21/08 10:51:00 AM
)LJXUH
Wishful Thinking in Predicting World Cup Results • 181
75 %
57
58
51
53
58
52 50 %
52
49
25 %
Literary Competition
Ball Game
Contract Bid
Custody Fight
Figure 8.2 Mean probabilities judged by those rewarded if the target event occurs (diamonds), and by those rewarded if its complementary event occurs (squares). Note: Based on data from Bar-Hillel and Budescu, 1995, Table 7.
Study 2 To test this possibility, we conducted a second experiment, in which desirability and salience were both manipulated. Naturally, we expected to replicate the effect of the desirability manipulation from Study 1. If there were no similar effect for the salience manipulation, we can attribute the results to a desirability bias with confidence. If, however, we were also to find a similar effect for the salience manipulation, parsimony precludes this interpretation of the results. Method Participants Participants were 227 students at The Hebrew University. They were recruited and run as in Study 1. Design and Procedure On June 20, 2002, the participants were handed a questionnaire, which referred to the four games (the quarter-finals) that were to be played on June 21 and 22, 2002. Respondents were asked to estimate the probability of each team winning its game and were promised rewards similar to those in Study 1. In addition to manipulating the desirability of one outcome, another team, playing in another game, was made salient by simply stating, “We are particularly interested in team X” and writing the name Team X in boldface.
ER59969.indb 181
3/21/08 10:51:00 AM
182 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
Given the time constraints, we applied our manipulations to only two games: Spain versus Korea and Turkey versus Senegal. For each participant a team in one of these games was made desirable by means of a promised monetary reward if it won, and a team in the other game was made salient by a mere expression of interest in its outcome. This generated eight distinct conditions (number of respondents in parentheses): Spain rewarded – Senegal salient (26); Spain rewarded – Turkey salient (29); S. Korea rewarded – Senegal salient (28); S. Korea rewarded – Turkey salient (31); Turkey rewarded – Spain salient (23); Turkey rewarded – S. Korea salient (25); Senegal rewarded – Spain salient (13); Senegal rewarded – S. Korea salient (25).1 Results Figure 8.3 was drawn according to the same guidelines as Figure 8.1. It plots the average probabilities assigned to the events that the favorite teams (i.e., the ones that were predicted to win by a majority of respondents and are listed on top) would win their games.2 The left panel contrasts the responses of those who stood to win money if the top team won with the responses of those who stood to win money if its opponent won. As in the first study, the former estimates were larger than the latter, albeit not significantly. The right panel contrasts the responses of those who were told that the experimenters were particularly interested in the top team with the responses of those who were told that Desirability
Salience
75%
75%
61 .9
63 .2
63 .8
62.1 50%
25%
60
56 .2
54 .3
50%
Spain S. Korea
Senegal Turkey
25%
58 .2
Spain S. Korea
Senegal Turkey
Figure 8.3 Mean estimated probabilities of the favorite team winning, as a function of its desirability (left panel) and of its salience (right panel).
ER59969.indb 182
3/21/08 10:51:01 AM
Wishful Thinking in Predicting World Cup Results • 183
the experimenters were especially interested in its rival team. Again, the former estimates are larger than the latter, but neither difference was statistically significant. As in Study 1, the desirability manipulation is not sufficient to induce an estimation reversal, nor is the salience manipulation. An eyeball comparison of the two panels shows that the magnitude of the desirability effect is roughly similar to that of the salience effect. We calculated for each team the difference between its mean estimates under the desirability manipulation and the salience manipulation. Two of these differences were negative (−1.9 for Spain, −3.8 for S. Korea), and two were positive (4.9 for Senegal, 7.9 for Turkey), with only the last one significant (t[111] = 2.13; p < .05, Cohen’s d = .41). In an additional analysis we calculated for each respondent the difference between his or her two relevant estimates:
d = Prob(team named on coupon) − Prob(team marked as of special interest)
If both manipulations have essentially the same effect, the mean difference should average about 0. Moreover, the distribution of the differences between the estimates given under the two manipulations should be symmetric. Figure 8.4 shows the distribution of the difference scores. Its mean is 0.52, its median and mode are both 0, and its skewness is 0.03, confirming that the distribution is symmetric around 0. )LJXUH
Figure 8.4 The distribution of the difference scores.
ER59969.indb 183
3/21/08 10:51:02 AM
184 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
Discussion These two studies applied the methodology that Bar-Hillel and Budescu used in 1995 to the present experiments, which were embedded in the real context of the World Cup in soccer. The new results confirm their previous findings and help clarify, at least to some degree, the status of the desirability bias. As in Study 3 in Bar-Hillel and Budescu (1995), we found a small, but robust, effect of the monetary incentives associated with particular outcomes on their judged probability. These results are consistent with the standard finding of desirability bias often observed in natural contexts (Babad & Yakobos, 1993; Babad, 1987; Fischer & Budescu, 1995; Granberg & Brent, 1983; Weinstein, 1980, 1982). However, the results in Figure 8.3 highlight the ambiguity of this interpretation. The “wishful thinking” account implies that this result reflects the operation of a judgmental process that is biased in a systematic fashion by the motivational priming of the desirable outcomes or incentives. Yet, we found practically identical results by merely marking one outcome and making it more salient than the others, absent any motivational manipulation. If salience alone affects probability estimates, is it necessary to invoke a motivational account? Unfortunately, our data cannot answer this question conclusively. Future work should seek to address this issue. Until then, the wishful thinking effect remains elusive.
Postscript We are delighted to have the opportunity to contribute a chapter to this Festschrift honoring Robyn Dawes. We have always admired Robyn’s work and have been inspired by his clear thinking. Although the subject matter of this chapter is not on a topic that Robyn worked on, we believe our approach to it is very much in the spirit he brings to his work. Is the false consensus effect really false (Dawes & Mulford, 1996)? Are the richness and flexibility of clinical judgment and judges really superior to “improper linear models” (Dawes, Faust, & Meehl, 1989)? In short, are things, inside and outside the psychologist’s lab, always as they seem at first blush? In this chapter (and some earlier work), we did not study whether fans of a soccer team are overconfident that their favorite team will win. They are, and we acknowledge as much, but does this mean they inflate the probability that their favorite team will win because they want it to win? We stripped away the context within which sports fans usually operate to ask whether the cognitive mechanism underlying their wishful thinking could possibly be as simple
ER59969.indb 184
3/21/08 10:51:02 AM
Wishful Thinking in Predicting World Cup Results • 185
and direct as the following: When a victory for Team A is exogenously endowed with enhanced desirability, its judged probability increases. Appearances notwithstanding, our answer is no.
References Babad, E. (1987). Wishful thinking and objectivity among sports fans. Social Behavior, 2, 231–240. Babad, E., & Katz, Y. (1991). Wishful thinking—Against all odds. Journal of Applied Social Psychology, 21, 1921–1938 Babad, E., & Yakobos, E. (1993). Wish and reality in voters’ predictions of election outcomes. Political Psychology, 14, 37–54. Bar-Hillel, M., & Budescu, D. V. (1995). The elusive wishful thinking effect. Thinking and Reasoning, 1, 71–104. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674. Dawes, R. M. & Mulford, M. (1996). The false consensus effect and overconfidence: Flaws in judgment, or flaws in how we study judgment? Organizational Behavior and Human Decision Processes, 65, 201–211. Fischer, I., & Budescu, D. V. (1995). Desirability and hindsight biases in predicting results of a multi-party election. In J. P. Caverni, M. Bar-Hillel, H. F. Barron, & H. Jungermann (Eds.), Contributions to decision research I (pp. 185–203). Amsterdam: Elsevier Science (North Holland). Granberg, D., & Brent, E. (1983). When prophecy bends: The preference expectation link in the U.S. presidential elections, 1952-1980. Journal of Personality and Social Psychology, 45, 477–491. Hubbard, R., & Armstrong, J. S. (1997). Publication bias against null results. Psychological Reports, 80, 337–338. Krizan, Z., & Windschitl, P. D. (2007). The influence of outcome desirability on optimism. Psychological Bulletin, 133, 95–121. McKenna, F. P. (1993). It won’t happen to me: Unrealistic optimism or illusion of control? British Journal of Psychology, 84, 39–50. Sterling, T. D., Rosenbaum, W.L., & Weinkam, J. J. (1995). Publication decisions revisited: The effect of the outcome of statistical tests on the decision to publish and vice versa. The American Statistician, 49, 108–112. Weinstein, N. (1980). Unrealistic optimism about future life events. Journal of Personality and Social Psychology, 39, 806–820. Weinstein, N. (1982). Unrealistic optimism about susceptibility to health problems. Journal of Behavioral Medicine, 5, 441–460.
ER59969.indb 185
3/21/08 10:51:02 AM
186 • Maya Bar-Hillel, David V. Budescu, & Moty Amar
notes 1. Twenty-seven additional respondents contributed data only to the salience results below. We discarded their desirability results because they indicated that they were fans of one of the four teams in the two games we studied. 2. Here, the two favorites happened to lose.
ER59969.indb 186
3/21/08 10:51:03 AM
9
How Expectations Affect Behavior Fairness Preferences or Fairness Norms? Cristina Bicchieri University of Pennsylvania
Since its origin, philosophy has been concerned with fairness: how to define it, how to justify our intuitions about it, and how to lend consistency to the multiplicity of meanings fairness usually takes. For the philosopher, what is at stake is the normativity of our moral judgments, what can possibly ground their “ought” claims. In a dialogue that has lasted more than 15 years, I, the philosopher, and Robyn Dawes, the psychologist, have explored the why and how of many behaviors that we would normally call ethical: cooperation and reciprocity, fairness, benevolence, and altruism. We have had many discussions about how much such behavior is sensitive to the decision context, and the crucial role expectations pay in our assessment of a given situation. Understanding how people form fairness judgments, the cognitive dynamics involved in the process, and what drives fair behavior on one occasion and dampens it in another are important steps that any philosopher should take in the direction of building better normative theories. Naturalizing ethics does not mean reducing what ought to be done to what is done: This would be a trivial naturalistic fallacy misstep. What instead needs to be done is to build our normative theories upon the solid foundation of what we know individuals can do, and this is a whole different project. I embarked on this project long ago by trying to show that our ethical norms are just collectively defined and supported social norms. Some such norms are more entrenched than 187
ER59969.indb 187
3/21/08 10:51:03 AM
188 • Cristina Bicchieri
others, but the cognitive processes underlying norm-following, and the biases we all face in filtering and processing the social information that will ultimately decide whether or not we act in a prosocial way, are essentially the same. Without knowledge of such cognitive processes, and the behaviors they engender, ethics is condemned to remain an abstract and fairly useless endeavor. In what follows I will concentrate upon some experimental results that show what appears to be individuals’ disposition to behave in a fair manner in a variety of circumstances. One common explanation is that individuals have a preference for fairness. The alternative explanation I propose is that, in the right kind of circumstances, individuals obey fairness norms. To say that we obey fairness norms differs from assuming that we have a preference for fairness (Bicchieri 2000, 2006). To follow a fairness norm, we must have the right kind of expectations. We must expect others to follow the norm too and believe that there is a generalized expectation that we will obey the norm in the present circumstances. The preference to obey a norm is conditional upon such expectations.1 Take away some of the expectations, and behavior will significantly change. A conditional preference will thus be stable under certain conditions, but a change in the relevant conditions may induce a predictable preference shift. The predictions of a norm-based theory are thus testable and quite different, at least in some critical instances, from the predictions of theories that postulate a social preference for fairness. When economists postulate fairness preferences, they make two related, important assumptions. The first is that what matters to an agent is the final distribution, not the way the distribution came about (Falk, Fehr, & Fischbacher, 1999): this is a consequentialist assumption. The second assumption is that preferences are stable. Both assumptions are easy to test. When falsified, however, it is less clear who the culprit is. For example, if a person has a stable preference for fair outcomes, we would expect his or her cross-situational behavior to be consistent and insensitive to the circumstances surrounding the specific distributive situation. Whether you are the proposer in an ultimatum or a dictator game should not matter to your choice of how much money to give to a responder. Similarly, information about who the proposer is—a real person or a random device—should not have an effect on one’s propensity to accept or reject its offer. What is observed instead is cross-situational inconsistency. The reason for this inconsistency is not obvious. It is possible that people do care about how a distribution came about and that the process itself matters. For example, we might accept an unequal share of the pie if it comes from a lottery but reject it if it results from an auction. Preferences could still be assumed to be stable, but
ER59969.indb 188
3/21/08 10:51:03 AM
How Expectations Affect Behavior • 189
in this case what we prefer is a combination of goods and processes to distribute and allocate those goods. On the other hand, preferences may be highly context dependent. Change the context, or the context’s description, and there is a noticeable preference shift. In the latter case, however, making any prediction would require a mapping from contexts to preferences. No such mapping has ever been provided. In what follows I will examine two of the most common games studied by experimental economists. Ultimatum and dictator games come in many flavors and variants, but the simplest, bare versions of both games are in some sense ideal, because they offer a very simplified allocation problem. The good to be allocated (or divided) is money, and the situation is such that most familiar contextual clues are removed. The results of such experiments consistently defy the predictions of traditional rational choice models. Agents are clearly not solely concerned with their monetary payoffs: They care about what other agents get and how they get it. The big challenge has been to enrich traditional rational choice models in such a way that they can explain (and predict) behavior that is not just motivated by material incentives in a variety of realistic contexts. I will compare one of the most interesting and influential new models with my norm-based approach and show that the hypothesis that people obey fairness norms offers a more satisfactory explanation for the phenomena we observe. Where my predictions differ from those of the alternative, social preference model, the data seem to vindicate my model. However, we need many more experiments to test the effects that manipulating expectations (and thus norm compliance) has on behavior.
The Ultimatum game In 1982 Guth, Schmittberger, and Schwarze published a seminal study in which they asked subjects to play what is now known as an ultimatum bargaining game. Their goal was to test the predictions of game theory about equilibrium behavior. Their results instead showed that subjects consistently deviate from what game theory predicts. To understand what game theory predicts, and why, let us look at a typical ultimatum game (Figure 9.1). The structure of this game is fairly simple. Two people must split a fixed amount of money M according to the following rules: The proposer (P) moves first and offers a division of M to the responder (R), where the offer can range between M and zero. The responder has a binary choice in each case: to accept the offer or to reject it. If the offer is accepted, the proposer receives M − x, and the responder receives x,
ER59969.indb 189
3/21/08 10:51:03 AM
190 • Cristina Bicchieri Proposer
Offer 0
Offer 1
Offer M
Acce pt
Acce pt P: 0 R: M
t
P: 0 R: 0
Rejec
t
P: 0 P: M–1 R: 0 R: 1
Rejec
t
P: M R: 0
Rejec
Acce pt
Receiver
P: 0 R: 0
Figure 9.1 Ultimatum game.
where x is the offer amount. If the offer is rejected, each player receives nothing. If rationality is common knowledge, the proposer knows that the responder will always accept any amount greater than zero, because accept dominates reject for any offer greater than zero. Hence P should offer the minimum amount guaranteed to be accepted, and R will accept it. For example, if M = $10 and the minimum available amount is 1 cent, the proposer should offer it, and the offer should be accepted, leaving the proposer with $9.99. Experiments find, however, that nobody offers 1 cent or even 1 dollar. Note that such experiments are always one-shot and anonymous. That is, subjects play the game only once with an anonymous partner and are guaranteed that their choice will not be disclosed. The absence of repetition is important to distinguish between generous behavior that is dictated by a rational, selfish calculation and genuine generosity. If an ultimatum game is repeated with the same partner, or if players suspect that future partners will know of their past behavior, it may be perfectly rational for players who are only interested in their material payoff to give generously if they expect to be on the receiving side at a future time. On the other hand, a receiver who might accept the minimum in a one-shot game might want to reject a low offer at the beginning of a repeated game, in the hope of convincing future proposers to offer more. In the United States, as well as in a number of other countries, the modal and median offers in one-shot experimental games are 40 to
ER59969.indb 190
3/21/08 10:51:04 AM
How Expectations Affect Behavior • 191
50% of the total amount, and the mean offers are 30 to 40%. Offers below 20% are rejected about half the time.2 These results are robust with respect to variations in the amount of money that is being split and cultural differences (Camerer, 2003). For example, we know that raising the stake from $10 to $100 does not decrease the frequency of rejections of low offers (those between $10 and $20) and that in experiments run in Slovenia, Pittsburgh, Israel, and Tokyo the modal offers were in the range of 40 to 50% (Hoffman, McCabe, & Smith, 1998; Roth, Prasnikar, Okuno-Fujiwara, & Zamir, 1991). If by rationality we mean that subjects maximize expected utility and that they only value their monetary outcomes, then we must conclude that a subject who rejects a nonzero offer is acting irrationally. However, individuals’ behavior across games suggests that money is not the sole consideration, and instead there is a concern for fairness, so much so that subjects are prepared to punish at a cost to themselves those who behave in inequitable ways. A concern for fairness is just one example of a more general fact about human behavior: We are often motivated by a host of factors, of which monetary incentives are one, and often not the most important. We act out of love, envy, spite, generosity, desire to imitate, sympathy, or hatred, to name just a few of the passions and desires that move us to act. When faced with different possible distributions, we usually care about how we fare with respect to others, how the distribution came about, who implemented it, and why. Experiment after experiment has demonstrated that individuals care about others’ payoffs, that they may want to spend resources to increase or decrease such payoffs, and that what they perceive to be the (good or bad) intentions of those they interact with weigh in their decisions. Unfortunately, the default utility function in game theory is a narrowly selfish one: It is selfish because it depicts people who care only about their own outcomes, and it is narrow because motivations like altruism, benevolence, guilt, envy, or hatred are kept out of the picture. Such motives, however, can and should be incorporated into a utility function, and economists have recently started to develop richer, more complex models of human behavior that try to explain what we have always known: People care about other people’s outcomes. Thus, a better way to explain what is observed in experiments (and real life) is to provide a richer definition of rationality: People still maximize their utilities, but the arguments of their utility functions include other people’s utilities. The obvious risk of such models is their “ad hocness”: One may easily explain any data by adjusting the utility function to reflect what looks like envy or altruism or a preference for equal shares. What we need
ER59969.indb 191
3/21/08 10:51:04 AM
192 • Cristina Bicchieri
are utility functions that are general enough to subsume many different experimental phenomena and specific enough to make falsifiable predictions. In what follows I will look at some possible explanations for the generous distributions we observe in ultimatum games and test these explanations against some interesting variations of the game. Such testing is not always easy to conduct. The problem is that we still have quite rudimentary theories of how motives affect behavior, and to test a hypothesis about what sort of motives induce us to act one way or another, we have to be very specific in defining such motives and the ways in which they influence our choices. Let me clarify this statement with an example. Observing the results of ultimatum games, someone might argue that subjects in the proposer’s role are behaving altruistically. Others would deny that, saying that people like to give because of the “warm glow” their actions induce in them (Andreoni, 1990), and yet others would say that what we observe is just benevolence—nothing else. To make sense at all, such concepts need to be made as specific as possible, and operational. Take, for example, a distribution (x1, x2) of money between two people. Being an altruist would mean that 1’s utility is an increasing function of 2’s utility, that is, U1 = f(x2) and δU1/δx2 > 0. Thus, a true altruist would not care about his own share, but he would only care about how much the other gets (and the more, the better). A proposer who is a pure altruist would “donate” all the money to the responder, provided he believes the responder only cares about money. Being benevolent instead means that one cares about one’s own payoff and the other’s, that is, U1 = f(x1, x2). In this case, the first partial derivatives of U1 = f(x1, x2) with respect to x1, x2 are strictly positive, meaning that the utility of a benevolent player 1 increases as the utility of player 2 increases. Depending on a player’s degree of benevolence, the proposers will turn out to be more or less generous, but a benevolent attitude on the part of the proposers might explain, prima facie, the results of experimental ultimatum games. The results of typical ultimatum games eliminate the pure altruist hypothesis, because people almost never give more than 50%, but do not eliminate the benevolence hypothesis. If benevolence is a stable character disposition, however, we would expect a certain behavioral stability or consistency in any situation in which a benevolent proposer has to offer a division of money to an anonymous responder. A variant of the ultimatum game is the dictator game, in which the proposers receive a sum of money from the experimenter and decide to split the money any way they choose; the proposer’s decision is final in that the responder cannot reject whatever is offered. If we hypothesize
ER59969.indb 192
3/21/08 10:51:04 AM
How Expectations Affect Behavior • 193
that the ultimatum game results reveal that a certain percentage of the population has a benevolent disposition, we should expect to observe roughly the same percentage of generous offers in all those circumstances in which one of the parties, the proposer, is all powerful. In most of the experiments, however, the modal offer is one in which the proposer keeps all the money, and in double-blind experiments 64% of the participants give nothing. Still, it must be mentioned that although the most frequent offer is zero, the mean allocation is 20% (Forsythe, Horowitz, Savin, & Sefton, 1994). These results suggest that people are not totally selfish, but it would be hard to argue they are benevolent unless we are prepared to presume that benevolence is a changeable disposition, as mutable as the circumstances that we encounter.
Social Preferences Altruism and benevolence are just two examples of social preferences. By social preference I refer to how people rank different allocations of material payoffs to self and others. If we stay with the ultimatum game as an example, we can think of other, slightly more complex ways to explain the results we discussed before. The uniformity of responders’ behavior suggests that people do not like being treated unfairly. That is, if subjects perceive an offer of 20 or 30% of the money as unfair, they may reject it to punish the greedy proposer, even at a cost to themselves. It is important to emphasize that these experiments were all one-shot, which means the participants were fairly sure of not meeting again; therefore, punishing behavior cannot be motivated as an attempt to convince the other party to be more generous the next time around. Similarly, proposers could not be generous because they were expecting reciprocating behavior in future interactions. One possibility is to assume that both proposers and responders are showing a preference for fair outcomes or an aversion to inequality. We can thus try to explain the experimental results with a traditional rational choice model, where the agents’ preferences take into account the payoffs of others. In models of inequality aversion, players prefer both more money and more equal allocations. Though there are several models of inequality aversion, perhaps the best known and most extensively tested is the model of Fehr and Schmidt (1999). This model intends to capture the idea that people may be uneasy, to a certain extent, about the presence of inequality, even if they benefit from the unequal distribution. Given a group of L persons, the Fehr–Schmidt utility function of person i is
ER59969.indb 193
3/21/08 10:51:04 AM
194 • Cristina Bicchieri
U i ( x1 ,..., x L ) = xi −
αi L −1
∑ max(x − x ,0)− Lβ−1 ∑ max(x − x ,0) j
i
i
j
i
j
j
where xj denotes the material payoff person j gets. αi is a parameter that measures how much player i dislikes disadvantageous inequality (an envy weight), and βi measures how much i dislikes advantageous inequality (a guilt weight).3 One constraint on the parameters is that 0< βi < αi, which indicates that people dislike advantageous inequality less than disadvantageous inequality. The other constraint is βi < 1, so that an agent does not suffer terrible guilt when he or she is in a relatively good position. For example, a player would prefer getting more without affecting other people’s payoffs even though that results in an increase of the inequality. Applying the model to the game in Figure 9.1, the utility function is simplified to
α ( x − xi ) if x 3−i ≥ xi U i ( x1 , x 2 ) = xi − i 3−i βi ( xi − x 3−i ) if x 3−i < xi
i = 1, 2
Obviously, if the responder rejects the offer, both utility functions are equal to zero, that is, U1reject = U2reject = 0. If the responder accepts an offer of x, the utility functions are as follows:
(1 + α1 )M −(1 + 2α1 )x if x ≥ M / 2 U1accept ( x ) = (1− β1 )M −(1− 2β1 )x if x < M / 2
(1 + 2α 2 )x −α 2 M if x < M / 2 U 2accept ( x ) = (1− 2β1 )x + β2 M if x ≥ M / 2 The responder should accept the offer if and only if U2accept(x) > U2reject = 0. Solving for x we get the threshold for acceptance: x > α2M/(1+2α2). Evidently, if α2 is close to zero, which indicates that player 2 (R) does not care much about being treated unfairly, the responder will accept very stingy offers. On the other hand, if α2 is sufficiently big, the offer has to be close to half to be accepted. In any event, the threshold is not higher than M/2, which means that hyper-fair offers (more than half) are not necessary for the sake of acceptance. Note that for the proposer, the utility function is monotonically decreasing in x when x ≥ M/2. Hence, a rational proposer will not offer more than half of the money. Suppose x ≤ M/2; two cases are possible
ER59969.indb 194
3/21/08 10:51:07 AM
How Expectations Affect Behavior • 195
depending on the value of β1. If β1 > 1/2, that is, if the proposer feels sufficiently guilty about treating others unfairly, the utility is monotonically increasing in x, and the best choice is to offer M/2. However, if β1 < 1/2, the utility is monotonically decreasing in x, and hence the best offer for the proposer is the minimum one that would be accepted, that is, (a little bit more than) α2M/(1 + 2z ). Last, if β1 = 1/2, it does not matter how much the proposer offers, as long as it is between α2M/(1 + 2α2) and M/2. Note that the other two parameters, α1 and β2, are not identifiable in ultimatum games. As noted by Fehr and Schmidt, the model allows for the fact that individuals are heterogeneous. Different α’s and β’s correspond to different types of people. Although the utility functions are common knowledge, the exact values of the parameters are not. The proposer, in most cases, is not sure what type of responder he or she is facing. Along the Bayesian line, the proposer’s belief about the type of the responder can be formally represented by a probability distribution P on α2 and β2 . When β1 > 1/2, the proposer’s rational choice does not depend on what P is. When β1 < 1/2, however, the proposer will seek to maximize the expected utility: EU ( x ) = P(α 2 M / (1 + 2α 2 ) < x )×((1− β1 )M −(1− 2β1 )x ) Therefore, the behavior of a rational proposer in the ultimatum game is determined by the proposer’s own type (β1) and his or her belief about the type of the responder. The experimental data suggest that for many proposers, either β is big (β > 1/2) or they estimate the responder’s α to be large. The choice of the responder is only determined by the responder’s type (α2) and the offer. Small offers are rejected by responders with a positive α. The positive features of the above-described utility function are that it can rationalize both positive and negative outcomes and that it can explain the observed variability in outcomes with heterogeneous types. One of the major weaknesses of this model, however, is that it has a consequentialist bias: Players only care about final distributions of outcomes, not about how such distributions come about.4 As we shall see, more recent experiments have established that how a situation is framed matters to an evaluation of outcomes and that the same distribution can be accepted or rejected depending on “irrelevant” information about the players or the circumstances of play. Another difficulty with this approach is that, if we assume the distribution of types to be constant in a given population, we should observe, overall, the same proportion of “fair” outcomes in ultimatum games. Not only does this not happen,
ER59969.indb 195
3/21/08 10:51:08 AM
196 • Cristina Bicchieri
but we observe individual inconsistencies in behavior across different situations in which the monetary outcomes are the same. If we assume, as is usually done in economics, that individual preferences are stable, we would expect similar behaviors across ultimatum games. If instead we conclude that preferences are context dependent, we should provide a mapping from contexts to preferences that indicates in a fairly predictable way how and why a given context or situation changes one’s preferences. Of course, different situations may change a player’s expectation about another player’s envy or guilt parameters, and we could thus explain why a player’s behavior may change depending upon how the situation is framed. In the case of Fehr and Schmidt’s utility function, however, experimental evidence that I shall discuss later implies that a player’s own β (or α) changes value in different situations, yet nothing in their theory explains why one would feel consistently more or less guilty (or envious) depending on the decision context.
Norms matter Rule-based approaches are not completely new. Guth (1995), for example, interpreted the results of the ultimatum game as showing that people have rules of behavior such as sharing money equally, and they apply them when necessary. The problem with such solutions is that we need a plausible story about how people change their behavior in response to changes in payoffs and framing. If rules are inflexible, but we observe flexible compliance, there must be something wrong with a rule-based approach. Indeed, a common understanding of norms, one that I have tried to dispel in my definition (see Appendix 9.1), is that they are inflexible behavioral rules that one would apply in any circumstance that calls for them. Nothing could be farther from the truth. To be effective, norms have to be activated by salient cues.5 As I explain (Bicchieri, 2000, 2006), a norm may exist, but it may not be followed simply because the relevant expectations are not there or because one might be unaware of being in a situation to which the norm applies. I have argued that people have conditional preferences for conformity to a norm, in that they would prefer to follow it on condition that (1) they expect others to follow it and (2) they believe that, in turn, they are expected by others to abide by the norm (see Appendix 9.1 and Bicchieri, 2006). Both conditions have to be present to generate conformity. Indeed, there is plenty of evidence that manipulating people’s expectations has an effect on norm compliance (Cialdini et al., 1990). Thus, I would argue that belief elicitation in experiments is crucial to determine whether a norm will be perceived as relevant and then followed.
ER59969.indb 196
3/21/08 10:51:08 AM
How Expectations Affect Behavior • 197
We already know, for example, that telling subjects how others have behaved in a similar game has a profound effect on their choices and that allowing people to communicate before playing the game often results in a cooperative outcome.6 Ultimatum games are an ideal tool to study fair behavior, because they offer a very simple allocation choice. The good to be allocated is money, and the situation is such that most familiar contextual clues are removed. It is thus possible to introduce in this rarefied environment simple contextual information and control for its effects on the perception of what constitutes a fair division. We know that to be fair means different things in different contexts. In some situations being fair means sharing equally. In others it may mean giving more to the needy or to the deserving. In the simplest context, when there is no reason to differentiate between proposer and responder, an equal split is usually called for, but the salience of the equal split solution is lost if subjects are told that offers are generated by a random device (Blount, 1995) or if it is believed that the proposer was otherwise constrained in his or her decision. In both cases responders are willing to accept lower offers. This phenomenon is well known to consummate bargainers: If an unequal outcome can be credibly justified as a case of force majeure, people can be convinced to accept much less than an equal share. Also, variations in the strength of property rights alter the shared expectations of the two players regarding the norm that determines the appropriate division. In the original ultimatum game, the proposer receives what amounts to a monetary gift from the experimenter. As a consequence, the proposer is perceived as having no special right to the money and is expected (at least in our culture) to share it equally with the responder. Because the fairness norm that is activated in this context dictates an equal split, the proposer who is offering little is perceived as stingy and consequently gets punished. Note that the proposer who was constrained in his or her decision is not seen as being intentionally stingy, because intentions do matter only when the choice is perceived as being freely made. To infer another person’s intention or motive, we consider not only the action chosen, but also the actions that were not chosen but, as far as we know, could have been chosen. Since what counts as fair is highly context dependent, a specific context simultaneously gives reasons to expect behavior appropriate to the situation and a clue as to the proposer’s intention, especially when the offer is different from what is reasonably expected in that context. Subjects approach resource sharing or, for that matter, any other situation with implicit knowledge structures (scripts) that detail conditions that
ER59969.indb 197
3/21/08 10:51:08 AM
198 • Cristina Bicchieri
are prototypically associated with sharing tasks. Once we have categorized the particular decision task we face, we enact scripts that tell us how people typically behave and what they expect others to do. However, it must be emphasized that people will display expected, appropriate behavior to the extent that crucial environmental cues match those of well-known prototypical scripts. An interesting question to ask is thus under which conditions an equal sharing norm will be violated. I shall discuss this point more extensively later on, but for now let me say that my hypothesis is that a deviation from equal sharing will be mainly due to (1) the presence of appropriate and acceptable justifications for taking more than an equal share or (2) the shift to a very different script that involves different roles and expectations. An example of the second reason is when the proposer is labeled “seller,” and the responder, “buyer”; in this case the proposer offers a lower amount than in the control, and responders readily accept (and expect) less than an equal share (Hoffman, McCabe, Shachat, & Smith, 1994). In this case, the interaction is perceived as being market-like, and in a market script it is deemed equitable that a seller earns a higher return than a buyer. An example of the first reason is when the proposer has “earned” the right to the money by, for example, getting a higher score on a general knowledge quiz (Frey & Bohnet, 1995; Hoffman & Spitzer, 1985). In this case the proposer has an available, acceptable justification for getting more than the equal share. Doing better than someone else in a test is a common and reasonable mechanism, at least in our society, for determining differential access to a shared resource. It thus seems appropriate to many proposers to choose equity versus equality in such conditions. There is continuity between real life and experiments with respect to how rights and entitlements, considerations of merit, need, desert, or sheer luck shape our perception of what is fair and what kind of reasons count as acceptable justifications for violating a fairness norm. Cultures differ in their reliance on different allocative and distributive rules, because such rules depend on different forms of social organization. Within a given culture, however, there usually is a consensus about how different goods and opportunities should be allocated or distributed. Cross-cultural studies of ultimatum and dictator games in 15 smallscale societies show quite convincingly that the behavior displayed in such games was highly correlated with the economic organization and social structure of each society (Henrich, Boyd, Bowles, Fehr, & Camerer, 2004). Furthermore, because experimental play is presumably categorized according to the specific sociocultural patterns of each society, the experimental results showed much greater variability than the results of typical ultimatum and dictator games played in modern
ER59969.indb 198
3/21/08 10:51:08 AM
How Expectations Affect Behavior • 199
Western (or westernized) societies.7 These results lend even more support to the hypothesis that social norms, and the accompanying shared expectations, play a crucial role in shaping behavioral responses to experimental games. A norm-based explanation of the results of experiments with ultimatum and dictator games predicts that whenever proposers are focused upon the relevant expectations, they will behave in a norm-consistent way. In the traditional ultimatum game, the expected cost of not following an equal division rule may be enough to elicit fair behavior. In considering what the responder would accept, the proposer is forced to look at the situation and categorize it as a case in which an equality rule applies. This does not mean the person who follows the norms is in fact fair or casts a high value on equitable behavior. As I make plain in my definition of what it takes to follow an existing norm (Appendix 9.1), if a player assesses a sufficiently high probability to the opponent’s following the norm and expects to be punished for noncompliance, that player will prefer to conform to a norm even if he or she has no interest in the norm itself. The general utility function I introduced in Bicchieri, 2006, can now be applied to the ultimatum game. Let πi be the payoff function for player i. The norm-based utility function of player i depends on the strategy profile s and is given by U i (s ) = πi (s )− ki max max{πm (s− j , N j (s− j ))−πm (s ), 0} s− j ∈L− j m≠ j
where k ≥ 0 is a constant representing i’s sensitivity to the relevant i norm. Such sensitivity may vary with different norms; for example, a person may be very sensitive to equality and much less so to equity considerations. The first maximum operator takes care of the possibility that the norm instantiation (and violation) might be ambiguous in the sense that a strategy profile instantiates a norm for several players simultaneously (as would be the case, for example, in a social dilemma with three players). The second maximum operator ranges over all the players other than the norm violator. In plain words, the discounting term (multiplied by ki) is the maximum payoff deduction resulting from all norm violations. The model is motivated by people’s apparent respect (or disregard) for social norms regarding fairness. In the traditional ultimatum game, the norm usually prescribes a fair amount the proposer ought to offer. The norm functions that represent this norm are the following: N1 is a constant N function, and N2 is nowhere defined.8 If the responder (player 2) rejects the offer, the utilities of both players are zero:
ER59969.indb 199
3/21/08 10:51:09 AM
200 • Cristina Bicchieri
U1reject (x) = U2reject (x) = 0
Given that the proposer (player 1) offers x and the responder accepts, the utilities are the following: U1accept ( x ) = M − x − k1 max( N1 − x , 0) U 2accept ( x ) = x −kk2 max( N 2 − x , 0) where Ni denotes the amount player i thinks he or she should get or offer according to some social norm applicable to the situation, and ki is nonnegative. Note that k1 measures how much player 1 dislikes deviating from what he or she takes to be the norm. To obey a norm, sensitivity to the norm need not be high. Fear of retaliation may make a proposer with a low k behave according to what fairness dictates but, absent such risk, that player’s disregard for the norm will lead him or her to be unfair. For the moment, I assume it is common knowledge that N1 = N2 = N, which is not too unreasonable in the traditional ultimatum game. Again, the responder should accept the offer if and only if U2accept(x) > U2reject = 0, which implies the following threshold for acceptance: x > k2N/(1 + k2). Notice that an offer larger than the norm dictates is not necessary for the sake of acceptance. For the proposer, the utility function is decreasing in x when x ≥ N, so a rational proposer will not offer more than N. Suppose x ≤ N. If k1 > 1, the utility function is increasing in x, which means the best choice for the proposer is to offer N. If k1 < 1, the utility function is decreasing in x, which implies that the best strategy for the proposer is to offer the least amount that would result in acceptance, that is, (a little bit more than) the threshold k2N/(1 + k2). If k1 =1, it does not matter how much the proposer offers, provided the offer is between k2N/(1 + k2) and N. It should be noted that k1 plays a very similar role as β1 in the Fehr– Schmidt model. If we take N to be M/2 and k1 to be 2β1, the two models agree on what the proposer’s utility is. It is equally apparent that k2 in this model is analogous to α2 in the Fehr–Schmidt model. There is, however, an important difference between these parameters. The αs and βs in the Fehr–Schmidt model measure people’s degree of aversion toward inequality, which is a very different disposition than the one measured by the ks, that is, people’s sensitivity to different norms. The latter will usually be a stable disposition, and behavioral changes may thus be caused by changes in focus or in expectations. A theory of norms can explain such changes, whereas a theory of inequity aversion does not. I will come back to this point later.
ER59969.indb 200
3/21/08 10:51:10 AM
How Expectations Affect Behavior • 201
It is also the case that the proposer’s belief about the responder’s type figures in the proposer’s decision when k1 < 1. The belief can be represented by a joint probability over k2 and N2, if the value of N2 is not common knowledge. The proposer should choose an offer that maximizes the expected utility
EU ( x ) = P(k2 N 2 / (1 + k2 ) < x )×( M − x − k1 ( N1 − x )
As will become clear, an advantage this model has over the Fehr– Schmidt model is that it can explain some variants of the traditional ultimatum game more naturally. However, it shares a problem with the Fehr–Schmidt model: They both entail that fear of rejection is the only reason people offer almost-fair amounts rather than lower sums. This prediction, however, could be easily refuted by a parallel dictator game where rejection is not an option.
Variations on the Ultimatum Game So far I have only considered the basic ultimatum game, which is not the whole story. A number of interesting variants of the game exist in the literature, to some of which I now apply the two alternative models to see if they can tell reasonable stories about what happens in those experiments. Ultimatum Game With Asymmetric Information and Payoffs Kagel, Kim, & Moser (1996) designed an ultimatum game in which the proposer is given a certain amount of chips. The chips are worth either more or less to the proposer than they are to the responder. Each player knows how much a chip is worth to him or her but may or may not know that the chip has a different value to the other player. Participants play an ultimatum game over 10 rounds with changing opponents, and this is public knowledge. The particularly interesting setting is one in which the chips have higher (three times more) values for the proposer, and only the proposer knows it. It turns out that in this case the offer is (very close to) half of the chips and the rejection rate is low. A popular reading of this result is that people merely prefer to appear fair, as a really fair person is supposed to offer about 75% of the chips. As Figure 9.2 shows, proposers offered close to 50% of the chips, and very few such offers were rejected. To analyze this variant formally, we only need a small modification to our original setting. That is, if the responder accepts an offer of x, the proposer actually gets 3(M − x), though, to the responder’s knowledge, the proposer only gets M − x. In the Fehr–Schmidt model, the utility
ER59969.indb 201
3/21/08 10:51:11 AM
202 • Cristina Bicchieri Asymmetric Information and Chip Value 1
0.47
0.08 0
proportion of chips offered
proportion of offers rejected
Chips are Worth 3X as Much to Proposer and only Proposer Knows
Figure 9.2 Asymmetric information about chips value.
function of player 1 (the proposer), if the offer gets accepted, is now the following: (3 + 3α1 )M −(3 + 4α1 )x if x ≥ 3M / 4 U1accept ( x ) = (3 − 3β1 )M −(3 − 4β1 )x if x < 3M / 4 The utility function of the responder upon acceptance does not change, as, to the best of the responder’s knowledge, the situation is the same as in the simple ultimatum game. Also, if the responder rejects the offer, both utilities are again zero. It follows that the responder’s threshold for acceptance remains the same; he or she accepts the offer if x > α2M/(1 + 2α2). For the proposer, if β1 > 3/4, his or her best offer is 3M/4; otherwise the best offer is the minimum amount above the threshold. An interesting point is that even if a player offers M/2 in the simple ultimatum game, which indicates that β1 > 1/2, that player may not offer 3M/4 in this new condition. This prediction is consistent with the observation that almost no one offers 75% of the chips in the real game. At this point, it seems the Fehr–Schmidt model does not entail a difference in behavior in this new game, but proposers in general do offer more in this new setting than they do in the usual ultimatum game, which naturally leads to the lower rejection rate. Can the Fehr–Schmidt model explain this? One obvious way is to adjust α2 so that the predicted threshold increases, but there is no reason in this case for the responder to change his or her attitude toward inequality. Another explanation might be that under this new setting, the proposer believes that the responder’s distaste for inequality increases, for after all, it is the proposer’s belief about α2 that affects the offer. This move sounds
ER59969.indb 202
3/21/08 10:51:12 AM
How Expectations Affect Behavior • 203
as questionable as the last one, but it does point to a reasonable explanation. Because the proposer is uncertain about the responder’s type, the proposer’s belief about α2 should be represented by a nondegenerate probability distribution. The proposer should choose an offer that maximizes his or her expected utility, which in this case is given by the following:
EU(x) = P (α2 < x / (M – 2x)) × ((3 – 3β1) M – (3 – 4β1) x)
The main difference between this expected utility and the one in the simple ultimatum game is that it involves a bigger stake. Hence, it is likely to be maximized at a bigger x unless the distribution (the proposer’s belief) over α2 is sufficiently odd. Thus, the Fehr–Schmidt model can explain the phenomenon in a reasonable way. If we apply my model to this new setting, again the utility function of player 2 does not change. The utility function of player 1 (the proposer) given acceptance is changed to U1accept ( x ) = 3( M − x )− k1 max( N1′ − x , 0) I use N1’ here to indicate that the proposer’s perception of the fair amount, or the proposer’s interpretation of the norm, may have changed due to his or her awareness of the informational asymmetry.9 My model behaves quite similarly to the previous one. Specifically, the responder’s threshold for acceptance is still k2N2/(1 + k2). The proposer will and should offer N1’ only if k1 > 3, so people who offer the fair amount in the simple ultimatum game (k1 >1) may not offer the fair amount in the new setting. That means that even if N1’ = 3M/4, the observation that few people offer that amount does not go against my model. The best offer for most people (k1 < 3) is the smallest amount that would be accepted. However, because the proposer is not sure about the responder’s type, the proposer will choose an offer to maximize his or her expected utility, and this in general leads to an increase of the offer, given an increase of the stake. Although it is not particularly relevant to the analysis in this case, it is worth noting that N1’ is probably less than 3M/4 in the situation as thus framed. This point will become crucial in games with obvious framing effects. Ultimatum Game with Different Alternatives There is also a very simple twist to the ultimatum game, which turns out to be quite interesting. Falk et al. (2003) introduced a simple ultimatum game where the proposer has only two choices: either offer 2 (and keep 8) or make an alternative offer that varies across treatments
ER59969.indb 203
3/21/08 10:51:13 AM
204 • Cristina Bicchieri
in a way that allows the experimenter to test the effect of reciprocity and inequity aversion on rejection rates. The alternative offers in four treatments are (5,5), (8,2), (2,8), and (10,0). As Figure 9.3 shows, when the (8,2) offer is compared to the (5,5) alternative, the rejection rate is 44.4%, which is much higher than the rejection rates in each of the three alternative treatments. It turns out that the rejection rate depends a lot on what the alternative is. The rejection rate decreases to 27% if the alternative is (2,8), and further decreases to 9% if the alternative is (10,0).10 It is hard for the Fehr–Schmidt model to explain these results. In their consequentialist model there does not seem to be any role for the available alternatives to play. As the foregoing analysis shows, the best reply for the responder is acceptance if x > α2M/(1 + 2α2). That is, different alternatives can affect the rejection rate only through their effects on α2. It is not entirely implausible to say that what could have been otherwise affects one’s attitude towards inequality. After all, one’s dispositions are shaped by all kinds of environmental or situational factors, to which the “path not taken” seems to belong. Still it sounds quite odd that one’s sensitivity to fairness changes as alternatives vary, and in particular, it is not compatible with the assumption of independence of irrelevant alternatives, a common assumption in decision theory. The norm-based model, by contrast, seems to have an easier time. For one thing, my model can explain the data by telling a story about how the norm’s perception might change, and the story, unlike the previous case, can be quite plausible. Recall that my definition of what it takes to follow a norm relies heavily on expectations, both empirical and normative. As I discussed (Bicchieri, 2000, 2006), how we decide and act in a situation depends upon how we interpret, understand, and )LJXUH encode it. Once a situation is categorized as a member of a particular class, a schema (or script) is invoked. Such a script allows us to make Proposer (8,2)
(5,5)
Proposer (8,2)
Proposer
(2,8)
(8,2)
(10,0)
Responder
Responder
Responder
a
a
a
(8, 2)
r (0, 0)
Rejection Rate: 44.4%
(8, 2)
r (0, 0)
(8, 2)
Rejection Rate: 27%
r (0, 0)
Rejection Rate: 9% e
Figure 9.3 Ultimatums with alternatives offers.
ER59969.indb 204
3/21/08 10:51:13 AM
How Expectations Affect Behavior • 205
inferences about unobservable variables, predict other people’s behavior, make causal attributions, and modulate emotional reactions. The script we invoke is the source of both projectible regularities and the legitimacy of our expectations. If, as I argued (Bicchieri, 2006), social norms are embedded into scripts, then the particular way a situation is framed will have a large effect on our expectations about others’ behavior and what they expect from us. Thus, a change in the way a situation is framed will induce a change in expectations and have an immediate effect on our focusing (or not focusing) on the norm that has (or has not) been elicited. As the possible alternatives vary, the player may no longer believe that the same norm applies, and it is quite reasonable to conjecture that different alternatives point the responder to different norms (or lack thereof). In the (8,2), (5,5) situation, players are naturally focused on the equal split. The proposer who could have chosen it but did not is sending a clear message about his disregard for fairness. If the expectation of a fair share is violated, the responder will probably feel outraged, attribute a greedy intention to the proposer, and punish him accordingly. If the alternatives are (8,2) or (2,8), few people would expect a proposer to sacrifice for the responder. In real life, situations like this are decided with a coin toss. In the game context, it is difficult to see that any norm would apply to the situation. This is why 70% of the subjects choose the (8,2) split and only 27% reject it. Finally, the choice of (8,2) when the alternative is (10,0) appears quite nice, and indeed the rejection rate is only 9%. When the alternative for the proposer is to offer the whole stake, there is little reason for the responder to think that the norm is still (50%, 50%) or something close to this. Thus, a natural explanation given by my model is that N2 changes (or may be empty) as the alternative varies. The results of this experiment tell us that most people do not have selfish material preferences, in which case they would always accept the (8,2) division. They also tell us that people are not simply motivated by a dislike for inequality, for otherwise we would have observed the same rejection rate in all contexts. Ultimatum Game With Framing Framing effects, a topic of continuing interest to psychologists and social scientists, have also been investigated in the context of ultimatum games. Hoffman et al. (1994), for example, designed an ultimatum game in which groups of 12 participants were ranked on a scale of 1 to 12 either randomly or by superior performance in answering questions about current events. The top six were assigned to the role of seller and the rest to the role of buyer. They also ran experiments with the stan-
ER59969.indb 205
3/21/08 10:51:14 AM
206 • Cristina Bicchieri
dard ultimatum game instructions, both with random assignments and assignment to the role of proposer by contest. The exchange and contest manipulations elicited significantly lowered offers, but rejection rates were unchanged compared to the standard ultimatum game.11 Figure 9.4 shows that the “exchange” framing significantly lowered offers but also that being the winner of a contest in the traditional ultimatum game had an effect on the proposers’ offers. Several other experiments have consistently shown that when the proposer is a contest winner (Frey & Bohnet, 1995) or has “earned the right” to that role (Hoffman & Spitzer, 1985), offers are lower than in the traditional ultimatum game. As I suggested before, in the presence of prototypical, acceptable justifications for deviating from equality, subjects will be induced to follow an equity principle. Framing in this case provides salient cues suggesting that an equity rule is appropriate to the situation. Because, from a formal point of view, these situations are not different from that of a traditional ultimatum game, the previous analysis remains the same. Hence, according to the Fehr–Schmidt model, the framing of the game decreases α2. In other words, the role of a buyer or the knowledge that the proposer was a superior performer or had simply earned the right to his role lowers the responder’s concern for fairness. This change does not sound intuitive and demands some explanation. In addition, the proposer has to expect this change in the responder’s concern for fairness in order to lower his offer. It is equally, )LJXUH if not more, difficult to see why the framing can lead to different beliefs the proposer has about the responder.
approximate % offering $4 or more
Manipulating Entitlement and Frame 100 80
85*
85 *
60
45
40
45
20 0
divide $10 , random
divide $10, contest
exchange, random
exchange, contest
Instructions X Entitlement *about 50% offer $5 for random, $4 for cont est
Figure 9.4 Entitlement and framing effects.
ER59969.indb 206
3/21/08 10:51:14 AM
How Expectations Affect Behavior • 207
In my model, the parameter N plays a vital role again. Although we need more studies about how and to what extent framing affects people’s expectations and perception of what norm is being followed, it is intuitively clear that framing, like the examples mentioned above, will change the players’ conceptions of what is fair. The exchange framework is likely to elicit a market script where the seller is expected to try to get as much money as possible, whereas the entitlement context has the effect of focusing subjects away from equality in favor of an equity rule. In both cases, what has been manipulated is the perception of the situation and thus the expectations of the players. An individual’s sensitivity and concern for norms may be unchanged, but the relevant norm is clearly different from the usual fairness as equality rule. Dictators With Uncertainty In a theory of norms, the role of expectations is crucial. Norms and expectations are part of the same package. Focusing people on a norm usually means eliciting certain expectations and, in turn, when people have the right empirical and normative expectations, they will tend to follow the relevant norm. In the traditional ultimatum game, at least in Western societies, the possibility of rejection forces the proposer to focus upon what is expected of him or her.12 In the absence of information about the responder, and without a history of previous games and results as a guide, equal (or almost equal) shares become a focal point. Eliminate the possibility of rejection, and equality becomes much less compelling: For example, we know that in double-blind dictator games, 64% of the proposers keep all the money. The dictator game is particularly interesting as a testing ground for the study of how norms influence behavior, because it illustrates in a clear manner how sensitive we are to the presence, reminder, or absence of others’ expectations. In such a decision context, an equal share seems much less compelling. In fact, in a dictator game there is no prima facie clear-cut behavioral rule to follow and, because of that, we can better examine the role expectations (and their manipulation) play in the emergence of a consensual script and, consequently, a social norm. An experiment conducted by Dana, Weber, & Kuang (2003) enlightens this point. The basic setting is a dictator game where the allocator has only two options. The game is played in two very different situations. Under the known condition (KC), the payoffs are unambiguous, and the allocator has to choose between option A (6,1) and option B (5,5), where the first number in the pair is the allocator’s payoff, and the second number is the receiver’s payoff. Under the unrevealed condition (UC), the allocator has to choose between option A (6,?) and option
ER59969.indb 207
3/21/08 10:51:15 AM
208 • Cristina Bicchieri
B (5,?), where the receiver’s payoff is 1 with probability 0.5 and 5 with probability 0.5 (Figure 9.5). Before making a choice, however, the allocator is given the option to find out privately at no cost which game is being played and thus know what the receiver’s payoff is. It turns out that 74% of the subjects choose B (5,5) in KC, and 56% choose A (6,?) without revealing the actual payoff matrix in UC. This result, as Dana et al. (2003) point out, stands strongly against the Fehr–Schmidt model. If we take the revealed preference as the actual preference, choosing (5,5) in KC implies that β1 > 0.2, while choosing (6,?) without revealing in UC implies that β1 < 0.2.13 Hence, unless a reasonable story can be told about β1, the model does not fit the data. If a stable preference for fair outcomes is inconsistent with the above results, can a conditional preference for following a norm show greater consistency? Note that, if we were to assume that Ni is fixed in both experiments, a similar change of k would occur in my model, too.14 However, the norm-based model can offer a natural explanation of the data through an interpretation of Ni. In KC subjects have only two, very clear choices. There is a fair outcome (5,5) and there is an inequitable one (6,1). Choosing (6,1) entails a net loss for the receiver and only a marginal gain for the allocator. Note that the choice framework focuses subjects on fairness, though, as I mentioned before, the usual dictator game has no such focus. Dana et al.’s (2003) example evokes a related situation (one that we frequently encounter) in which we may choose to give to the poor or otherwise disadvantaged: What is $1 more to the allocator is $4 more to the receiver, mimicking the multiplier effect that money has for a poor person. In this experiment, what is probably activated is a norm of beneficence, and subjects uniformly respond by choosing (5,5). Indeed, when receivers in Dana Condition 1 D A (6, 1)
B
Condition 2: With Uncertainty D B A r (6,?)
0.5
0.5
(5, 5)
D (6, 1)
(5,?)
Reveal
D (5, 5)
(6, 5)
(5, 1)
? is 1 with probability 0.5 and 5 with probability 0.5
Figure 9.5 Dictator games with and without uncertainty.
ER59969.indb 208
3/21/08 10:51:15 AM
How Expectations Affect Behavior • 209
et al.’s experiment were asked what they would choose in the allocator’s role, they unanimously chose the (5,5) split as the most appropriate. A natural question to ask is whether we should hold N fixed, thus assuming a variation in people’s sensitivity to the norm (k), or if instead what is changing here is the perception of the norm itself. I argue that what changes from the first to the second experiment is the perception that a norm exists and applies to the present situation, as well as expectations about other people’s behavior and what their expectations about one’s own behavior might be. Recall that in my definition of what it takes for a norm to be followed, a necessary condition is that a sufficient number of people expect others to follow it in the appropriate situations and believe they are expected to follow it by a sufficient number of other individuals. People will prefer to follow an existing norm conditionally upon entertaining such expectations. In KC the situation is transparent, and so are the subjects’ expectations. If a subject expects others to choose (5,5) and believes he or she is expected so to choose, that subject might prefer to follow the norm (provided the subject’s k, which measures one’s sensitivity to N, is large enough). In UC, on the contrary, there is uncertainty as to what the receiver might be getting. To pursue the analogy with charitable giving further, in UC there is uncertainty about the multiplier (“am I giving to a needy person or not?”) and thus there is the opportunity for norm evasion: The player can avoid activating the norm by not discovering the actual payoff matrix. Though there is no cost to see the payoff matrix, people will opt to not see it in order to avoid having to adhere to a norm that could potentially be disadvantageous. Thus, a person who chooses (5,5) under KC may choose (6,?) under UC with the same degree of concern for norms. Choosing to reveal looks like what moral theorists call a supererogatory action. We are not morally obliged to perform such actions, but it is awfully nice if we do. Indeed, I believe few people would expect an allocator to choose to reveal, and similarly I would expect few people would be willing to punish an allocator who chooses to remain in a state of uncertainty. A very different situation would be one in which the allocator has a clear choice between (6,1) and (5,5) but is told that the prospective receiver does not even know he or she is playing the game. In other words, the binary choice would focus the allocator, as in the KC condition, on a norm of beneficence, but the allocator would also be cued about the absence of a crucial expectation. If the recipient does not expect the allocator to give anything, is there any reason to follow the norm? This is a good example of what I have extensively discussed in Bicchieri (2000, 2006). A norm exists, the subject knows it and knows the norm applies to the situation, but the
ER59969.indb 209
3/21/08 10:51:15 AM
210 • Cristina Bicchieri
subject’s preference for following the norm is conditional on having certain empirical and normative expectations (see Appendix 9.1). In our example the normative expectations are missing, because the recipient does not know that a dictator game is being played or know his or her part in it. In this case, I predict that a large majority of allocators will choose (6,1) with a clear conscience. This prediction is different from what a fairness preference model would predict, but it is also at odds with theories of social norms as constraints on action. One such theory is Rabin’s (1995) model of moral constraints. Very briefly, Rabin assumes that agents maximize egoistic expected utility, subject to constraints. Thus, our allocator will seek to maximize his or her payoffs but experience disutility if the action taken is in violation of a social norm. However, if the probability of harming another is sufficiently low, a player may circumvent the norm and act more selfishly. Because in Rabin’s model, the norm functions simply as a constraint, beliefs about others’ expectations play no role in a player’s decision to act. Because the (6,1) choice does harm the recipient, Rabin’s model should predict that the number of subjects who choose (6,1) is the same as in the KC of Dana et al.’s (2003) experiment. In my model, however, the choices in the second experiment will be significantly different from the choices we have observed in Dana et al.’s KC condition. To summarize, the norm-based model explains the behavioral changes observed in the above experiments as due to a (potentially measurable) change in expectations. An individual’s propensity to follow a given norm would remain fixed, as would the individual’s preferences. However, because preferences in my model are conditional upon expectations, a change in expectations will have a major, predictable effect on behavior.
Appendix 9.1: Conditions for a social norm to exist Let R be a behavioral rule for situations of type S, where S can be represented as a mixed-motive game. We say that R is a social norm in a population P if there exists a sufficiently large subset Pcf ⊆ P such that for each individual i ∈ Pcf: 1. Contingency: i knows that a rule R exists and applies to situations of type S. 2. Conditional preference: i prefers to conform to R in situations of type S on the condition that:
ER59969.indb 210
3/21/08 10:51:16 AM
How Expectations Affect Behavior • 211
a. Empirical expectations: i believes that a sufficiently large subset of P conforms to R in situations of type S, and either: b. Normative expectations: i believes that a sufficiently large subset of P expects i to conform to R in situations of type S; b’ Normative expectations with sanctions: i believes that a sufficiently large subset of P expects i to conform to R in situations of type S, prefers i to conform, and may sanction behavior.
A social norm R is followed by population P if there exists a sufficiently large subset Pf ⊆ Pcf such that, for each individual i ∈ Pf , conditions 2a and either 2b or 2b’ are met for i, and, as a result, i prefers to conform to R in situations of type S.
References Andreoni, J. (1990). Impure altruism and donations to public goods: A theory of warm-glow giving. Economic Journal, 100, 464–477. Bicchieri, C. (2000). Words and deeds: A focus theory of norms. In J. NidaRumelin & W. Spohn (Eds.), Rationality, rules and structure. Dordecht, The Netherlands: Kluwer Academic. Bicchieri, C. (2006). The grammar of society: The nature and dynamics of social norms. Cambridge, UK: Cambridge University Press. Blount, S. (1995). When social outcomes aren’t fair: The effect of causal attributions on preferences. Organizational Behavior and Human Decision Processes, 63: 131–144. Camerer, C. (2003). Behavioral game theory: Experiments on strategic interaction. Princeton, NJ: Princeton University Press. Camerer, C., Loewenstein, G., & Rabin, M. (2004). Advances in behavioral economics. Princeton, NJ: Princeton University Press. Camerer, C., & Thaler , R. H. (1995). Anomalies: Ultimatums, dictators, and manners. Journal of Economic Perspectives, 9(2):209–219. Cameron, L. (1995). Raising the stakes in the ultimatum game: Experimental evidence from Indonesia. Princeton Department of Economics, Industrial Relations Sections. Working Paper, 345. Cialdini, R., Kallgren, C., et al. (1990). A focus theory of normative conduct: A theoretical refinement and reevaluation of the role of norms in human behavior. Advances in Experimental Social Psychology, 24, 201–234. Dana, J., Weber, R., & Kuang, J. X. (2003). Exploiting moral wriggle room: Behavior inconsistent with a preference for fair outcomes. Carnegie Mellon Behavioral Decision Research. Working Paper, 349.
ER59969.indb 211
3/21/08 10:51:17 AM
212 • Cristina Bicchieri Falk, A., Fehr, E., & Fischbacher, U. (2003). Testing theories of fairness—Intentions matter. Institute for Empirical Research in Economics, University of Zürich. Working Paper, 63. Fehr, E., & Gachter, S. (2000). Fairness and retaliation: The economics of reciprocity. Journal of Economic Perspectives, 14(3), 159–181. Fehr, E., & Schmidt, K. (1999). A theory of fairness, competition, and cooperation. Quarterly Journal of Economics, 114, 817–868. Forsythe, R., Horowitz, J.L, Savin, N. E., & Sefton, M. (1994). Fairness in simple bargaining experiments. Games and Economic Behavior, 6, 347–369. Frey, B., & Bohnet, I. (1995). Institutions affect fairness: Experimental investigations. Journal of Institutional and Theoretical Economics, 151(2), 286–303. Frey, B., & Bohnet, I. (1997). Identification in democratic society. Journal of Socio-Economics, 26, 25–38. Guth, W. (1995). On ultimatum bargaining experiments: A personal review. Journal of Economic Behavior and Organization, 27, 329-344. Guth, W., Schmittberger, R., & Schwarze, B. (1982). An experimental analysis of ultimatum games. Journal of Economic Behavior and Organization, 3, 367–388. Henrich, J., Boyd, R., Bowles, S., Fehr, H. G. E., & Camerer, C. (Eds.). (2004). Foundations of human sociality: Ethnography and experiments in 15 small-scale societies. Oxford, UK: Oxford University Press. Hoffman, E., McCabe, K. A., Shachat, K., & Smith, V. (1994). Preferences, property rights, and anonymity in bargaining games. Games and Economic Behavior, 7, 346–380. Hoffman, E., McCabe, K. A., & Smith, V. (1998). Behavioral foundations of reciprocity: Experimental economics and evolutionary psychology. Economic Inquiry, 36, 335–352. Hoffman, E., & Spitzer, M. (1985). Entitlements, rights, and fairness: An experimental examination of subjects’ concept of distributive justice. Journal of Legal Studies, 2, 259–297. Kagel, J. H., Kim, C., & Moser, D. (1996). Fairness in ultimatum games with asymmetric information and asymmetric payoffs. Games and Economic Behavior, 13, 100–110. Rabin, M. (1995). Moral preferences, moral constraints, and self-serving biases. University of California at Berkeley, Department of Economics. Working Paper, 95-241. Roth, A. E., Prasnikar, V., Okuno-Fujiwara, M., & Zamir, S. (1991). Bargaining and market behavior in Jerusalem, Ljubljana, Pittsburgh, and Tokyo: An experimental study. American Economic Review, 81, 1068–1095.
ER59969.indb 212
3/21/08 10:51:17 AM
How Expectations Affect Behavior • 213
notes 1. The conditions for following a norm are formally described in Chapter 1 of Bicchieri, 2006, and in Appendix 1 here. 2. Guth et al. (1982) were the first to observe that the most common offer by proposers was to give half of the sum to the responder. The mean offer was 37% of the original allocation. In a replication of their experiments, they allowed subjects to think about their decision for one week. The mean offer was 32% of the sum, which is still very high. 3. The term max (xj – xi, 0) denotes the maximum of xj – xi and 0; it measures the extent to which there is disadvantageous inequality between i and j. 4. This is a separability of utility assumption: What matters to a player in a game is the payoff at a terminal node. The way in which that node was reached and the possible alternative paths that were not taken are irrelevant to an assessment of the player’s utility at that node. Utilities of terminal node payoffs are thus separable from the path through the tree and from payoffs on unchosen branches. 5. Cues that activate, or bring to mind, a norm may involve a direct statement or reminder of the norm, observing others’ behavior, similarity of the present situation to others in which the norm was used, as well as how often or how recently one has used the norm. 6. I discuss these results and the relevant literature in Bicchieri, 2006, Chapter 4. 7. In some groups, rejections were extremely rare, even when offers were very low, whereas in other groups “hyper-fair” offers were frequently rejected, pointing to very different (but interculturally shared) interpretations of the experimental situation. 8. Intuitively, N2 should proscribe rejection of fair (or hyper-fair) offers. The incorporation of this consideration, however, will not make a difference in the formal analysis. 9. It is important to note that since norms are very dependent on expectations, informational asymmetries will almost certainly affect norm-following behaviors. 10. Note that 30% of the subjects proposed (8,2) when the alternative was (5,5), 70% proposed (8,2) when the alternative was (2,8), and 100% proposed (8,2) when the alternative was (10,0). Each player played four games, presented in random order, in the same role. 11. Rejections remained low throughout, about 10%. All rejections were on offers of $2 or $3 in the exchange instructions. There was no rejection in the contest entitlement/divide $10, and there was 5% rejection of the $3 and $4 offers in the random assignment/divide $10.
ER59969.indb 213
3/21/08 10:51:17 AM
214 • Cristina Bicchieri 12. I do not want to imply that sanctions are crucial to norm following. They may just reinforce a tendency to obey the norm and serve the function— together with several other indicators—of focusing individuals’ attention on the particular norm that applies to the situation. 13. In KC choosing option B implies that U1 (5,5) > U1 (6,1), or 5 − α1 (0) > 6 − β1 (5). Hence, 5 > 6 − 5 − β1 and therefore β1 > 0.2. In UC not revealing and choosing option A implies that U1 (6, (.5(5), .5(1))) > U1 (.5(5,5), .5(6,5)), since revealing will lead to one of the two “nice” choices with equal probability. We thus get 6 − .3(β1) > 2.5 + .5(6 − β1), which implies that β1 < 0.2. 14. According to my model, if we keep Ni constant, choosing option B in KC means that U1 (5,5) > U1 (6,1), and hence 5 > 6 − k1 (4). It follows that k1 > 0.25. In UC not revealing and choosing option A implies that U1 (6, (.5(5), .5(1))) > U1 (.5(5,5), .5(6,5)), and hence 6 − k1 (2) > 5.5, which implies that k1 < 0.25.
ER59969.indb 214
3/21/08 10:51:17 AM
10
Depersonalized Trust and Ingroup Cooperation Marilynn B. Brewer Ohio State University
Twenty years ago, in a chapter entitled “Ethnocentrism and Its Role in Interpersonal Trust” (Brewer, 1986), I proposed that one function of ingroup formation and ingroup favoritism is providing a solution to the dilemma of social cooperation and trust. The basic argument was that distrust (and noncooperation) dominates trust and cooperation in a standard social dilemma situation unless cooperation can be made contingent on reciprocal cooperation on the part of others. If one’s own cooperation alters the probability that others will cooperate in turn, then contingent cooperation dominates over noncooperation and joint welfare can be maximized.1 This is, of course, the basic logic of reciprocal altruism (Trivers, 1971) and the success of the tit-for-tat strategy in repeated prisoner’s dilemma games (Axelrod, 1984). Reciprocal cooperation, however, has limited utility as a solution to cooperation dilemmas in relatively large groups if it depends on personal knowledge of each of the other participants, such as a history of interpersonal exchange and future cooperation on an individualized basis. On the other hand, cooperation that is contingent on common membership in a bounded social group bypasses the need for such personalized knowledge or the costs of negotiating reciprocity with individual others. If one’s social ingroups are effectively bounded communities of obligatory cooperation,2 this affords a kind of depersonalized trust based on group membership (social identity) alone. All that 215
ER59969.indb 215
3/21/08 10:51:18 AM
216 • Marilynn B. Brewer
is required for group-based trust and cooperation is (1) the mutual knowledge that oneself and the other share a common ingroup membership and (2) the expectation that the other will act in terms of that group membership in dealings with a fellow group member (and vice versa). In effect, one’s own and the other’s behavior is perceived to be constrained by the requirements of group membership and the desire to retain one’s status as an accepted group member. As a consequence, within ingroups the probability of reciprocal cooperation is presumed, a priori, to be high, in contrast to intergroup exchanges where trust and cooperation is contingent on personalized knowledge of the other or a history of negotiated reciprocity. In the 20 years since the publication of the chapter on group-based trust, the basic hypothesis that mere knowledge of shared ingroup membership is sufficient to engender cooperative behavior under conditions of risk and uncertainty has been supported by results of numerous experiments in the social dilemmas literature. Making salient a shared social identity has been demonstrated to increase cooperative behavior between strangers in dyadic exchanges such as prisoner’s dilemma games (e.g., Dion, 1973; Miller, Downs, & Prentice, 1998) and investment games (e.g., Buchan, Croson, & Dawes, 2002; Tanis & Postmes, 2005) and in collective decision making situations such as resource dilemmas (e.g., Brewer & Kramer, 1986; Brewer & Schneider, 1990; Kramer & Goldman, 1995; Wit & Wilke, 1992) and public goods dilemmas (e.g., De Cremer & Van Vugt, 1999; Wit & Kerr, 2002). Although the finding that mere knowledge of shared group identity increases cooperative behavior is well established, the psychological mechanisms that underlie ingroup cooperation are less clear. In the sections that follow, I will first discuss the evidence for depersonalized trust (i.e., the general expectancy that others will be cooperative within the ingroup) and some of the reasons why shared group membership might engender such expectancies. After that, I will discuss some alternative mechanisms that might promote intragroup cooperation, independent of expectations of reciprocity.
Ingroup Expectancies and Trust The term depersonalized trust was intended to refer to a willingness to engage in cooperative behavior at some risk to self-interest under the assumption that others in the social exchange will also choose to cooperate, when that assumption is not based on personal knowledge of the motives or intents of the other individuals or any existing interpersonal relationships with them. Of interest here is depersonalized trust that
ER59969.indb 216
3/21/08 10:51:18 AM
Depersonalized Trust and Ingroup Cooperation • 217
is based on knowledge of social category or social group membership and the expectations that are associated with that group membership. One form of group-based depersonalized trust would be expectancies based on social stereotypes. If the stereotype of a particular social category includes traits such as honesty and trustworthiness (e.g., pastors, Red Cross volunteers, nurses), then category membership may substitute for personal knowledge about the individual category member to engage expectancies of reciprocal cooperation. Although expectancies based on group (as opposed to personal) reputation may underlie some instances of depersonalized trust, such stereotypes are probably not very widespread. A second form of group-based depersonalized trust is activated by knowledge that the others in a social exchange are members of one’s own social group or social category—independent of the content of stereotypes that may be associated with that group. Ingroup trust is the expectation that others will cooperate with me because we are members of the same group (Kramer & Wei, 1999). The findings from a recent experiment by Tanis and Postmes (2005) provide a good illustration of the role of ingroup expectancies in trust behavior. In this experiment, participants were faced with the decision whether to invest all or part of their monetary payment by transferring it to a randomly assigned partner. Any amount invested would be tripled in value, and then the partner would decide how much (if any) of the total funds to transfer back to the participant. This investment game is frequently referred to as the trust game (cf. Berg, Dickhout, & McCabe, 1995) because the decision to send money to the partner is assumed to depend on one’s expectations that the other will reciprocate cooperation by sharing the largeses to a reasonable degree.3 In one condition of the experiment, the only information the participants had about their randomly selected partner was his or her group membership, which identified the other either as a member of the participant’s ingroup university or an outgroup university. The rate of trusting behavior (sending money) was significantly higher when the otherwise unknown partner was an ingroup member (66.7%) than when he or she was an outgroup member (41.4%). Correspondingly, ratings of expected reciprocity were significantly higher for ingroup partners than for outgroup partners, and the effect of group membership on trusting behavior was fully mediated by this expectancy difference. Interestingly, however, there was no difference between ingroup and outgroup partners in how they were rated on a measure of perceived trustworthiness. Clearly, the expectancy that ingroup members could be counted on to reciprocate cooperative sharing was not based on any character
ER59969.indb 217
3/21/08 10:51:18 AM
218 • Marilynn B. Brewer
attributions of trustworthiness to all members of the ingroup. As the authors put it, “It seems that this difference in trusting behavior and expected reciprocity across groups is related to norms of reciprocity being strong in intra-group interactions, and weaker or absent across group boundaries” (Tanis & Postmes, p. 422). In other words, participants were more trusting of ingroup members than of outgroup members because only the former were expected to adhere to principles of mutual reciprocity, not because they saw them as inherently more trustworthy people. By contrast, when personalized information about the randomly selected partner was made available, the ingroup–outgroup differential in expected reciprocity and trusting behavior was eliminated, and trusting behavior was mediated by perceptions of the trustworthiness of the other person as an individual. These findings from Tanis and Postmes (2005) underscore the distinction between person-based and group-based (or identity-based) expectations of cooperation. Expectations that ingroup members will be cooperative with oneself do not depend on assumptions about the individual group member’s general propensity to be trustworthy. Rather, it is the assumption that the other will be cooperative because they perceive the situation as an intragroup exchange. This assumption of ingroup cooperation is particularly interesting in situations (such as the trust game) where the participant has no opportunity for future reciprocation (positive or negative) of the other partner’s cooperative (or noncooperative) behavior. Since the partners are unknown to each other, the possibility of future reciprocation or sanction is precluded, yet the expectation that an ingroup member will reward my cooperation is quite high. This suggests that the mechanism for ingroup trust and cooperation lies in group-level processes rather than interpersonal exchange. One basis for expecting ingroup members to behave in a trustworthy manner in accord with ingroup norms is the prospect of sanction from other group members. In my earlier chapter (Brewer, 1986) I speculated that group membership provides a mechanism for increasing the perceived probability of sanctions against failures to reciprocate trust. If defection is regarded as a violation of group norms—as disloyalty to the collective rather than victimization of an individual—it may be detected and punished by any member of the group, not just the individual who has been betrayed. Thus, the expectation that ingroup members can be trusted to cooperate may be based, in part, on the general expectation that “bad” ingroup members will ultimately be caught and sanctioned by the group as a whole. The idea that any group member may take on the role of sanctioning untrustworthiness, even when the exercise of sanctioning is costly to the self, is known as “altruistic punishment”
ER59969.indb 218
3/21/08 10:51:18 AM
Depersonalized Trust and Ingroup Cooperation • 219
(Fehr & Gächter, 2002) and is an interesting form of ingroup cooperation in its own right. The assumption that ingroup norm violation will be subject to punishment is also an important element in the link between trust in others’ cooperative behavior and one’s own willingness to cooperate. Unlike the investment game, many social dilemmas (e.g., resource dilemmas, public goods contribution dilemmas) involve a decision whether or not to cooperate with the group as whole when one’s own cooperative choice does not directly influence the cooperation of others. Under these circumstances, expecting that others will behave cooperatively (i.e., contribute to the public good or restrain consumption of a shared resource) reduces the fear that one’s own cooperation will be wasted (i.e., the “sucker’s payoff”). However, it does not eliminate the self-interested benefit of noncooperation. If everyone else can be expected to cooperate, then noncooperation takes advantage of the others’ contributions to the group welfare and maximizes personal outcomes. Thus, group-based depersonalized trust translates to cooperative behavior only if the individual’s own behavior is constrained by the same group norms that underlie his or her expectations of the others’ behavior. Ingroup-based depersonalized trust solves the cooperation dilemma only if it involves contingent cooperation—contingent not only on group membership per se, but on the perception that the other is a group member in good standing. In the absence of information to the contrary, any member of the ingroup may be expected (by default) to be counted on for trustworthy behavior and reciprocal cooperation. However, if a group member has been known to violate these principles of ingroup behavior, trust and cooperation should be withheld, even if the individual is still technically a member of the ingroup. Interestingly, there is evidence that noncontingent cooperation within ingroups is negatively sanctioned (just as failure to reciprocate cooperation). Individuals who were cooperative with another group member who has a known history of failure to cooperate with other ingroup members were evaluated negatively, and participants were less willing to give resources to such a person than to someone who cooperated only with other cooperators (Mashima & Takahashi, 2005).
Is Trust Necessary for Ingroup Cooperation? The relationship between trust that others will cooperate and one’s own cooperation has been well established (e.g., Bruins, Liebrand, & Wilke, 1980; De Cremer, Dewitte, & Snyder, 2001; Kerr, 1983; Kramer & Wei, 1999; Parks, Henager, & Scamahorn, 1996; Pruitt & Kimmel,
ER59969.indb 219
3/21/08 10:51:18 AM
220 • Marilynn B. Brewer
1977; Rapoport & Eshed Levy, 1989; Schnake, 1991; Yamagishi, 1986). Whatever the basis for the expectations that others will cooperate, such trust reduces the fear of the sucker’s payoff—that one’s own cooperation would be taken advantage of. Thus, it is tempting to believe that if ingroup membership increases trust (i.e., expectations that other ingroup members will be cooperative), this is sufficient to account for the effect of ingroup identity on intragroup cooperative behavior. As hinted at earlier, however, expectations of others’ intentions to cooperate are not of themselves sufficient to generate cooperative behavior. Ingroup trust can be exploited, particularly under conditions of anonymity and diffusion of responsibility. When players are unknown to each other and there is no likelihood of future interaction, it is not rational to anticipate that one would be sanctioned by the group for failure to cooperate even with fellow ingroup members. Indeed, experiments by Yamagishi (1988) found that when the possibility of a sanctioning system is explicitly removed, cooperation in social dilemma situations is significantly reduced in Japan, and Japanese participants become less cooperative than American participants. Yamagishi argues that the dependence on sanctioning systems as a basis for trust is particularly high in Japan and thus decreases trust in others in a situation where a sanctioning system is absent. In that sense, cooperation based on trust in others’ intentions is a vulnerable strategy for inducing and sustaining ingroup cooperation. Anything that undermines the implicit basis for trust also undermines the motivation to cooperate (Mulder, van Dijk, De Cremer, & Wilke, 2006). Further, there is reason to question whether expectations about others’ behavior is actually the cause of cooperative decisions. The evidence for a relationship between trust (operationalized as expectancies of cooperation from others) and own cooperation is almost entirely correlational, with measures of expectations usually assessed after the participant has already made his or her own decision about contributing or sharing. As Robyn Dawes has repeatedly pointed out (e.g., Dawes, McTavish, & Shaklee, 1977), decisions may cause expectations, rather than the other way around. After the fact, individuals may project their own choices onto fellow group members or use expectations to justify their previous actions. Clearly, then, trust in fellow ingroup members that is predicated on sanctioning violations of intragroup cooperative norms is not sufficient to sustain large-group cooperation. Something more than norm-based expectancies about others’ behavior is needed to motivate cooperation within ingroups. At least three candidates for that something extra—additional mechanisms that underlie ingroup cooperative
ER59969.indb 220
3/21/08 10:51:19 AM
Depersonalized Trust and Ingroup Cooperation • 221
behavior—have been proposed: (1) heuristic decision making, (2) egoistic projection, and (3) social identification and goal transformation. Ingroup Cooperation as Heuristic Thinking Recently, psychologists and economists have begun to explore the role of heuristic decision making in social dilemmas (e.g., Kiyonari, Tanida, & Yamagishi, 2000; Messick, 1999; Weber, Kopelman, & Messick, 2004). A heuristic is a rule of thumb that guides situated decision making without the necessity of elaborate rational calculus. Heuristics themselves may be the product of adaptive strategies (cf. Gigerenzer, 2000), but as decision-making tools they bypass the explicit use of deliberative, rational cost–benefit analysis. The simple rule, if-ingroup-then-cooperate, is such a heuristic. The heuristic may have its origins in early learning about group norms and the long-term adaptive value of intragroup sharing and cooperation, but in the immediate situation none of those considerations may be involved. Salience of the cognitive representation of a shared ingroup is a cue to activation of a cooperative “script.” Another way to think about the heuristics associated with ingroup cooperation is that the salience of ingroup membership reframes the decision-making situation to one of coordination rather than competition (or, as Liebrand et al. [1986] put it, as an issue of “morality” rather than “might”). As Messick (1999) has argued, a critical proximal determinant of how people behave in a social dilemma is how they define the situation they face. Individuals have prototypes of different kinds of situations and associated rules for what constitutes appropriate behavior in those contexts. Faced with a new decision-making situation, elements of the current context are matched to prototypes to determine the way the present situation is defined or framed. Depending on what prototype is invoked by situational cues, the same collective choice dilemma could alternatively be defined as a competitive game, as a risktaking venture, or as a cooperative coordination problem. Without any differences in the actual structure of the dilemma, differences in framing can lead to dramatic differences in propensity to cooperate. Salient shared group identity may serve as a particularly powerful cue to defining collective choice situations as cooperative exchanges. Projection of the Self to the Ingroup When making judgments or estimates of others’ behaviors, opinions, or intentions, the idea that individuals project their own attitudes and beliefs onto others is widely documented. Variously known as false consensus (Ross, Greene, & House, 1977), social projection, (Krueger & Acevedo, 2005), or self-anchoring (Otten, 2002), the general
ER59969.indb 221
3/21/08 10:51:19 AM
222 • Marilynn B. Brewer
phenomenon is the assumption that people like me will do what I do, and the consequence is often (though not invariably) an overestimation of the prevalence of one’s own attitudes or behaviors. The “people like me” part of this projection process is an important qualification. Numerous studies have demonstrated that individuals are significantly more likely to project their own characteristics to others who are similar or to ingroups rather than outgroups (Clement & Krueger, 2002; Robbins & Krueger, 2005). Thus, projection of the self onto others may best be characterized as an intragroup phenomenon, made more likely when a shared ingroup has been made salient. Projection of one’s own characteristics to the ingroup as a whole has been demonstrated even when the ingroup is a novel social category (Gramzow, Gaertner, & Sedikides, 2001; Otten & Wentura, 2001). The less that is known about the properties or characteristics of the group, the more likely one is to assume that others in the group are the same as oneself as the only source of information available (Dawes, 1989). This projection of the self onto the ingroup has been proposed as one mechanism underlying the phenomenon of ingroup bias (e.g., Cadinu & Rothbart, 1996; Otten, 2002). Assuming that most individuals have generally positive views of the self (high self-esteem), attributing one’s own characteristics to others in the ingroup will be biased in the direction of positive traits and behaviors, producing a general positivity in thinking about ingroups that is not extended to outgroups. This relationship between personal self-esteem and ingroup positivity was demonstrated in an experiment by Gramzow and Gaertner (2005), who found that high self-esteem predicted increased favoritism toward a novel ingroup relative to an objectively similar outgroup. This projected positivity of one’s own self-esteem to a novel ingroup has even been shown to be automatic, as demonstrated by implicit evaluation measures (Farnham, 1999; Otten & Wentura, 1999). Ingroup projection can help account for ingroup cooperation if we assume that most people (as social animals) are motivated to cooperate if others also cooperate (i.e., contingent altruism). To the extent that individuals project their own motivation to others in the situation, their willingness to risk making the cooperative choice themselves is increased. Since self projection is more probable when the others are known to be members of a common ingroup, projection-based cooperation is more likely within an ingroup collective than in the presence of outgroup members or other persons whose group membership is unknown. Given these assumptions, self-projection processes may underlie expectations about ingroup members’ cooperative intent even in the absence of any assumptions about group sanctions for
ER59969.indb 222
3/21/08 10:51:19 AM
Depersonalized Trust and Ingroup Cooperation • 223
noncooperation. The problem with this explanation is that it rests on the assumption that most people prefer mutual cooperation (at least among ingroup members) to free-riding (Fehr & Schmidt, 1999), a motive that may itself be dependent on other factors such as social identity and level of identification with the ingroup. Ingroup Cooperation as Goal Transformation Willingness to initiate cooperation within ingroups is vital to group survival, so much so that it is likely that multiple, redundant mechanisms have evolved to support and sustain cooperative orientation among members of a bounded social group. As I have already suggested, cooperation based solely on expectations that other group members will simultaneously or sequentially also cooperate (whether based on group norms, anticipated sanctions, or self-projection) is a vulnerable strategy. If there is error in the system, too many free-riders at a particular time, or failure to sanction noncooperation, the depersonalized trust underlying willingness to cooperate may collapse and would be very difficult to reestablish. What is needed are additional mechanisms that are more robust and able to tolerate some imperfection in the system of mutual cooperation. The psychological process of group identification, as elaborated in social identity theory (Tajfel, 1981; Tajfel & Turner, 1979), provides a basis for intragroup cooperation that does not necessarily rely on interpersonal trust in fellow group members. When individuals attach their sense of self to their group membership, they see themselves as interchangeable components of a larger social unit (Turner, Hogg, Oakes, Reicher, & Wetherell, 1987). The consequence of such social identification is not only affective attachment to the group as a whole, but also a shift of motives and values from self-interest to group interest and concern for the welfare of fellow group members. As a result of this redefinition of the self, pursuing the group’s interest becomes a direct and natural expression of self-interest, that is, collective and personal interest are interchangeable. When the definition of self changes, the meaning of self-interest and self-serving motivations also changes accordingly (Brewer, 1991). Group identity involves a transformation of goals from the personal to the collective level (De Cremer & Van Vugt, 1999; Kramer & Brewer, 1986). Goal transformation provides a basis for ingroup cooperation that does not depend directly on expectations that others in the group will reciprocate cooperation. When social identification is strong, contributing to the group welfare is an end in itself, independent of what benefits ultimately accrue to the self. De Cremer and Van Dijk (2002) tested
ER59969.indb 223
3/21/08 10:51:19 AM
224 • Marilynn B. Brewer
this idea directly in the context of a public goods dilemma. Participants in their experiment were divided into groups of seven and were each given 300 cents and told that they were free to contribute any amount between 0 and 300 cents to a collective pot. The total amount contributed by the group would be multiplied by two and then would be split equally among all members, with the proviso that the group as a whole had to contribute at least 1,050 cents. If this amount was not reached, any money in the collective pot would be forfeited. Participants then made their first contribution decisions, following which half of the groups were told that their group had succeeded in making the criterion level of contributions, and the other half of participants were told that their group had failed to meet the criterion. Then all participants made a second round of contribution decisions. Prior to the choice dilemma, level of social identification with the group was manipulated by telling participants either that the study was about individual decision making (weak social identity salience) or that it was about group decision making and that the performance of different groups would be compared (strong social identity salience). Consistent with the findings from previous studies, participants in the strong social identity condition contributed more in the first round than did participants in the weak identification groups. More importantly, failure motivated strong social identity participants to contribute significantly more in the session after feedback was given than in the one before feedback was given, whereas weak group identifiers contributed even less in the session after feedback was given than in the one before feedback. As the authors argue, when identification with the ingroup is weak, group failure is an indication that other members of the group cannot be expected to contribute, and hence one’s own motivation to contribute is undermined. When group identification is strong, however, participants interpret negative group feedback as a signal that their group is in need and as such they should try harder at achieving their group goals. Consistent with the goal-transformation hypothesis, strong group identifiers should exhibit a genuine concern for the group’s welfare, and as such, negative group feedback may be interpreted as a threat to the group’s welfare and a signal that behavioral changes are required, motivating them to cooperate more (see Brewer & Schneider, 1990, for similar findings). Social identity–mediated cooperation is particularly important for large groups facing resource and public goods dilemma problems. In the absence of close monitoring and sanctioning of noncooperation, some basis for intrinsic motivation to cooperate and contribute to the group welfare is essential. Although continued contribution to failing
ER59969.indb 224
3/21/08 10:51:19 AM
Depersonalized Trust and Ingroup Cooperation • 225
groups over a long period would not be adaptive, some willingness to sustain (or increase) contributions under conditions of uncertainty about outcomes or others’ behavior may be necessary for maintaining large groups as cooperative communities. Thus, while the formation of bounded ingroups may solve one part of the dilemma of when (and with whom) to cooperate, the capacity for social identification with large social ingroups may have been a necessary additional adaptation to sustain human sociality (Brewer & Caporael, 2006).
Are Outgroups Necessary for Ingroup Cooperation? According to some versions of evolutionary psychology, human grouping can only be explained in terms of coalitions of self-interested cooperators who band together to dominate and exploit other groups (Alexander, 1979; Kurzban & Leary, 2001). In effect, from this perspective, ingroups emerge from intergroup conflict, and the idea that intergroup conflict or competition gives rise to intragroup cooperation is widespread in the social sciences (e.g., Sherif, 1966; Sumner, 1906). The primary theories of social identity (Tajfel, 1981; Turner, 1975; Turner et al., 1987) postulate that social identity is activated in contexts where intergroup comparison (and the presence of outgroups) is salient. Consistent with this idea, many social psychological laboratory experiments manipulate social identity in groups by introducing intergroup competition, which is usually quite effective in elevating levels of reported identification with participants’ own groups (e.g., De Cremer & Van Vugt, 1999; De Cremer & Van Dijk, 2002). In contrast to this view, other experiments have found that merely making a common group identity salient (in the absence of any explicit outgroups or intergroup comparison) is sufficient to increase intragroup cooperation (e.g., Brewer & Kramer, 1986; Wit & Kerr, 2002). Indeed, results from both laboratory experiments and field studies indicate that variations in ingroup positivity and social identification do not systematically correlate with degree of bias or negativity toward outgroups (Brewer, 1979; Hinkle & Brown, 1990). For example, in a study of the reciprocal attitudes among 30 ethnic groups in East Africa, Brewer and Campbell (1976) found that almost all of the groups exhibited systematic differential positive evaluation of the ingroup over all outgroups on dimensions such as trustworthiness, obedience, friendliness, and honesty. However, the correlation between degree of positive ingroup
ER59969.indb 225
3/21/08 10:51:20 AM
226 • Marilynn B. Brewer
regard and social distance toward outgroups was essentially .00 across the 30 groups (Brewer & Campbell, 1976, p. 85). Since ingroup–outgroup distinctions do not always involve intense (or even mild) competition or conflict over scarce resources, there is need for a theory of the evolution of social groups that does not depend on intergroup conflict per se. Such a theory starts from the recognition that group living represents the fundamental survival strategy that characterizes the human species. In the course of our evolutionary history, humans abandoned most of the physical characteristics and instincts that make possible survival and reproduction as isolated individuals or pairs of individuals, in favor of other advantages that require cooperative interdependence with others to survive in a broad range of physical environments. In other words, as a species we have evolved to rely on cooperation rather than strength and on social learning rather than instinct as basic adaptations. With coordinated group living as the primary survival strategy of the species, the social group, in effect, provided a buffer between the individual organism and the exigencies of the physical environment. Thus, psychological mechanisms that promote and sustain intragroup cooperation evolved from the structural requirements of group living rather than intergroup competition. In light of both paleoanthropological and archaeological evidence, it makes little sense to see conflict as the source of ingroup formation. There is no reason to believe that early hominids lived under dense population conditions in which bands of people lived in close proximity with competition over local resources. It is estimated that group living was well established as early as 2.5 million years ago by our human ancestors, and complex sociality evolved early among our primate ancestors (Foley, 1996). As Alexander (1989) admits, there is no evidence of intergroup conflict in early human evolutionary history. Given the costs of intergroup fighting combined with low population density, group flight rather than fight would seem to be the strategy of choice for our distant ancestors.
Summary The basic argument in this chapter is that willingness to cooperate as a default strategy in social exchanges with ingroup members was essential to the evolution of interdependent group living. As a consequence, multiple partially redundant mechanisms underlie and sustain cooperation that is premised on shared group membership alone, in the absence of personal knowledge or a history of exchange with specific individuals. Basically, two paths to ingroup cooperation have been postulated.
ER59969.indb 226
3/21/08 10:51:20 AM
Depersonalized Trust and Ingroup Cooperation • 227
One relies on depersonalized trust—the expectation that others in the group will cooperate because we share group membership—supported by established cooperative ingroup norms and potential sanctions for noncooperation. The second path is based on social identity processes and associated goal transformation and projection of one’s own cooperative (group-benefiting) motives to fellow group members. No doubt, successful groups of all types rely to some extent on both paths to generate and sustain intragroup cooperation, but different group characteristics may necessitate more or less reliance on trust versus identification as primary mechanisms. Relatively small social groups (e.g., villages) or groups composed of dense networks of interpersonal relationships (e.g., extended kin groups or some collectivist cultures such as Japan) may be characterized as communities of “generalized reciprocity” (Yamagishi, Jin, & Kiyonari, 1999), with cooperation based on mutual trust backed by social sanctions. As groups get larger and more differentiated (e.g., large organizations, nations, etc.), the basis for generalized reciprocity and depersonalized trust may be relatively weak (or heuristic only), and maintenance of intragroup cooperation may depend more on members’ current (and perhaps fluctuating) levels of ingroup identification and consequent concern for group welfare. Technically, the processes associated with ingroup identification (social identity) could be extended to symbolic ingroups of larger and larger size, so long as the group had some bounded identity and basis for mutual recognition of fellow group members. One might even conceive of a global civil society (e.g., Keane, 2003) of identity-based cooperation. However, countervailing forces make strong social identification with ever larger and more inclusive groups relatively unlikely (see Brewer, 1991). Highly inclusive groups are invariably complex and characterized by differentiation into subgroups that (being smaller and more distinctive) elicit stronger levels of social identification than the superordinate group in which they are embedded. Prospects for global interdependence may ultimately depend on development of mechanisms for intergroup cooperation that parallel depersonalized and identity-based trust at the individual level.
References Alexander, R. D. (1979). Darwinism and human affairs. Seattle: University of Washington Press. Alexander, R. D. (1989). Evolution of the human psyche. In P. Mellars & C. Stringer (Eds.), The human revolution (pp. 455–513). Princeton, NJ: Princeton University Press.
ER59969.indb 227
3/21/08 10:51:20 AM
228 • Marilynn B. Brewer Axelrod, R. (1984). The evolution of cooperation. New York: Basic Books. Berg, J., Dickhaut, J. W., & McCabe, K. A. (1995). Trust, reciprocity, and social history. Games and Economic Behavior, 10, 122–142. Brewer, M. B. (1979). In-group bias in the minimal intergroup situation: A cognitive motivational analysis. Psychological Bulletin, 86, 307–324. Brewer, M. B. (1986). Ethnocentrism and its role in interpersonal trust. In M. Brewer & B. Collins (Eds.), Scientifiic inquiry and the social sciences: A volume in honor of Donald T. Campbell (pp. 345–360). San Francisco: Jossey-Bass. Brewer, M. B. (1991). The social self: On being the same and different at the same time. Personality and Social Psychology Bulletin, 17, 475–482. Brewer, M. B., & Campbell, D. T. (1976). Ethnocentrism and intergroup attitudes: East African evidence. Beverly Hills, CA: Sage. Brewer, M. B., & Caporael, L. R. (2006). An evolutionary perspective on social identity: Revisiting groups. In M. Schaller, J. Simpson, & D. Kenrick (Eds.), Evolution and Social Psychology (pp. 143–161). New York: Psychology Press. Brewer, M. B., & Kramer, R. M. (1986). Choice behavior in social dilemmas: Effects of social identity, group size, and decision framing. Journal of Personality and Social Psychology, 50, 543–549. Brewer, M. B., & Schneider, S. (1990). Social identity and social dilemmas: A double-edged sword. In D. Abrams & M. Hogg (Eds.), Social identity theory: Constructive and critical advances (pp. 22–41). New York: Springer-Verlag. Bruins, J. J., Liebrand, W. B., & Wilke, H. A. (1989). About the saliency of fear and greed in social dilemmas. European Journal of Social Psychology, 19, 155–161. Buchan, N., Croson, R., & Dawes, R. M. (2002). Swift neighbors and persistant strangers: A cross-cultural investigation of trust and reciprocity in social exchange. American Journal of Sociology, 108, 161–206. Cadinu, M., & Rothbart, M. (1996). Self-anchoring and differentiation processes in the minimal group setting. Journal of Personality and Social Psychology, 70, 661–677. Clement, R. W., & Krueger, J. (2002). Social categorization moderates social projection. Journal of Experimental Social Psychology, 38, 219–231. Cox, J. C. (2004). How to identify trust and reciprocity. Games and Economic Behavior, 46, 260–281. Dawes, R. M. (1989). Statistical criteria for a truly false consensus effect. Journal of Experimental Social Psychology, 25, 1–17. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication, and assumptions about other people’s behavior in a commons dilemma situation. Journal of Personality and Social Psychology, 35, 1–11. De Cremer, D., Dewitte, S., & Snyder, M. (2001). The less I trust, the less I contribute (or not)? The effects of trust, accountability and self-monitoring in social dilemmas. European Journal of Social Psychology, 31, 93–107.
ER59969.indb 228
3/21/08 10:51:20 AM
Depersonalized Trust and Ingroup Cooperation • 229 De Cremer, D., & Van Dijk, E. (2002). Reactions to group success and failure as a function of identification level: A test of the goal-transformation hypothesis in social dilemmas. Journal of Experimental Social Psychology, 38, 435–442. De Cremer, D., & Van Vugt, M. (1999). Social identification effects in social dilemmas: A transformation of motives. European Journal of Social Psychology, 29, 871–893. Dion, K. L. (1973). Cohesiveness as a determinant of ingroup-outgroup bias. Journal of Personality and Social Psychology, 28, 163–171. Farnham, S. D. (1999). From implicit self-esteem to in-group favoritism. Unpublished doctoral dissertation, University of Washington. Fehr, E., & Gächter, S. (2002). Altruistic punishment in humans. Nature, 415, 137–140. Fehr, E., & Schmidt, K. M. (1999). A theory of fairness, competition, and cooperation. Quarterly Journal of Economics, 3, 817–818. Foley, R. (1996). The adaptive legacy of human evolution: A search for the environment of evolutionary adaptedness. Evolutionary Anthropology, 4, 194–203. Gigerenzer, G. (2000). Adaptive thinking: Rationality in the real world, New York: Oxford University Press. Gramzow, R., & Gaertner, L. (2005). Self-esteem and favoritism toward novel in-groups: The self as an evaluative base. Journal of Personality and Social Psychology, 88, 801–815. Gramzow, R., Gaertner, L., & Sedikides, C. (2001). Memory for in-group and out-group information in a minimal group context: The self as an informational base. Journal of Personality and Social Psychology, 80, 188–205. Hinkle, S., & Brown, R. (1990). Intergroup comparisons and social identity: Some links and lacunae. In D. Abrams & M. Hogg (Eds.), Social identity theory: Constructive and critical advances (pp. 48–70). London: Harvester Wheatsheaf. Keane, J. (2003). Global civil society? Cambridge, UK: Cambridge University Press. Kerr, N. L. (1983). Motivation losses in small groups: A social dilemma analysis. Journal of Personality and Social Psychology, 45, 819–828. Kiyonari, T., Tanida, S., & Yamagishi, T. (2000). Social exchange and reciprocity: Confusion or heuristic? Evolution and Human Behavior, 21, 411–427. Kramer, R. M., & Brewer, M. B. (1986). Social group identity and the emergence of cooperation in resource conservation dilemmas. In H. Wilke, D. Messick, & C. Rutte (Eds.), Experimental social dilemmas (pp. 129– 137). Frankfurt, Germany: Verlag Peter Lang.
ER59969.indb 229
3/21/08 10:51:20 AM
230 • Marilynn B. Brewer Kramer, R. M., & Goldman, L. (1995). Helping the group or helping yourself? Social motives and group identity in resource dilemmas. In D. A. Schroeder (Ed.), Social dilemmas: Perspectives on individuals and groups (pp. 49–67). New York: Praeger. Kramer, R. M., & Wei, J. (1999). Social uncertainty and the problem of trust in social groups: The social self in doubt. In T. R. Tyler, R. M. Kramer, & O. P. John (Eds.), The psychology of the social self (pp. 145–168). Mahwah, NJ: Lawrence Erlbaum Associates. Krueger, J. I., & Acevedo, M. (2005). Social projection and the psychology of choice. In M. D. Alicke, D. Dunning, & J. I. Krueger (Eds.), The self in social perception (pp. 17–41). New York: Psychology Press. Kurzban, R., & Leary, M. R. (2001). Evolutionary origins of stigmatization: The functions of social exclusion. Psychological Bulletin, 127, 187–208. Liebrand, W., Jansen, R., Rijken, V., & Suhre, C. (1986). Might over morality: Social values and the perception of other players in experimental games. Journal of Experimental Social Psychology, 22, 203–215. Mashima, R., & Takahashi, N. (2005). What types of others do people regard as “good” in generalized exchange? Paper presented at the International Conference on Social Dilemmas, Kraków, Poland. Messick, D. M. (1999). Alternative logics for decision making in social settings. Journal of Economic Behavior and Organization, 38, 11–28. Miller, D. T., Downs, J. S., & Prentice, D. A. (1998). Minimal conditions for the creation of a unit relationship: The social bond between birthdaymates. European Journal of Social Psychology, 28, 475–481. Mulder, L. B., van Dijk, E., De Cremer, D., & Wilke, H. A. (2006). Undermining trust and cooperation: The paradox of sanctioning systems in social dilemmas. Journal of Experimental Social Psychology, 42, 147–162. Otten, S., & Wentura, D. (1999). About the impact of automaticity in the minimal group paradigm: Evidence from affective priming tasks. European Journal of Social Psychology, 29, 1049–1071. Otten, S., & Wentura, D. (2001). Self-anchoring and in-group favoritism: An individual profiles analysis. Journal of Experimental Social Psychology, 37, 525–532. Otten, S. (2002). I am positive and so are we: The self as determinant of favoritism toward novel ingroups. In J. Forgas & K. Williams (Eds.), The social self: Cognitive, interpersonal, and intergroup processes (pp. 273–291). New York: Psychology Press. Parks, C. D., Henager, R. F., & Scamahorn, S. D. (1996). Trust and reactions to messages of intent in social dilemmas. Journal of Conflict Resolution, 40, 134–151. Pruitt, D. G., & Kimmel, M. (1977). Twenty years of experimental gaming: Critique, synthesis, and suggestions for the future. Annual Review of Psychology, 28, 363–392.
ER59969.indb 230
3/21/08 10:51:21 AM
Depersonalized Trust and Ingroup Cooperation • 231 Rapoport, A., & Eshed Levy, D. (1989). Provision of step-level public goods: Effects of greed and fear of being gypped. Organizational Behavior and Human Decision Processes, 44, 325–344. Robbins, J. M., & Krueger, J. (2005). Social projection to ingroups and outgroups: A review and meta-analysis. Personality and Social Psychology Review, 9, 32–47. Ross, L., Greene, D., & House, P. (1977). The “false consensus effect”: An egocentric bias in social perception and attribution processes. Journal of Experimental Social Psychology, 13, 279–301. Schnake, M. E. (1991). Equity in effort: The “sucker effect” in co-acting groups. Journal of Management, 17, 41–55. Sherif, M. (1966). In common predicament: Social psychology of intergroup conflict and cooperation. New York: Houghton Mifflin. Sumner, W. G. (1906). Folkways. New York: Ginn. Tajfel, H. (1981). Human groups and social categories. Cambridge, UK: Cambridge University Press. Tajfel, H., & Turner. J. C. (1979). An integrative theory of intergroup conflict. In W. Austin & S. Worchel (Eds.), Social psychology of intergroup relations (pp. 33–47). Chicago: Nelson. Tanis, M., & Postmes, T. (2005). A social identity approach to trust: Interpersonal perception, group membership and trusting behaviour. European Journal of Social Psychology, 35, 413–424. Trivers, R. L. (1971). The evolution of reciprocal altruism. Quarterly Review of Biology, 46, 35–57. Turner, J. C. (1975). Social comparison and social identity: Some prospects for intergroup behaviour. European Journal of Social Psychology, 5, 5–34. Turner, J. C., Hogg, M., Oakes, P., Reicher, S., & Wetherell, M. (1987). Rediscovering the social group: A self-categorization theory. Oxford, UK: Basil Blackwell. Weber, J. M., Kopelman, S., & Messick, D. M. (2004). A conceptual review of decision making in social dilemmas: Applying a logic of appropriateness. Personality and Social Psychology Review, 8, 281–307. Wit, A. P., & Kerr, N. L. (2002). “Me versus just us versus us all”: Categorization and cooperation in nested social dilemmas. Journal of Personality and Social Psychology, 83, 616–637. Wit, A. P., & Wilke, H. (1999). The effect of social categorization on cooperation in three types of social dilemmas. Journal of Economic Psychology, 13, 135–151. Yamagishi, T. (1986). The provision of a sanctioning system as a public good. Journal of Personality and Social Psychology, 51, 110–116. Yamgishi, T. (1988). The provision of a sanctioning system in the United States and Japan. Social Psychology Quarterly, 51, 265–271. Yamagishi, T., Jin, N., & Kiyonari, T. (1999). Bounded generalized reciprocity: Ingroup boasting and ingroup favoritism. Advances in Group Processes, 16, 161–197.
ER59969.indb 231
3/21/08 10:51:21 AM
232 • Marilynn B. Brewer Yamagishi, T., & Kiyonari, T. (2000). The group as the container of generalized reciprocity. Social Psychology Quarterly, 63, 116–132.
notes 1. At the time that I was completing this chapter, Robyn Dawes was my colleague as a visiting scholar at University of California, Santa Barbara. I was fortunate to be able to take advantage of his help and advice in checking the algebra that I used to develop this argument. 2. See also Yamagishi’s conceptualization of a social group as the “container of generalized reciprocity” (Yamagishi, Jin, & Kiyonari, 1999; Yamagishi & Kiyonari, 2000). 3. Recent research in experimental economics suggests that even this assumption may be unfounded. Cox (2004) compared the behavior of endowed senders in the typical trust game (with reciprocity) and an alternative version with no reciprocity (the dictator game). He found that the mean amount of money sent to the recipient in the dictator version of the game was substantially larger than 0. In that condition 19 out of 30 participants sent some of their endowment to their (randomly assigned) partner in the absence of any expectation of return, suggesting that at least some of the behavior in the so-called trust game is driven by altruistic regard for the other rather than expectations of reciprocity.
ER59969.indb 232
3/21/08 10:51:21 AM
11
Must Good Guys Finish Last? David Messick Kellogg School of Management, Northwestern University
The question posed in the title of this chapter is an old question that probes the very foundation of human sociality. It is also a question that derives from research with the famous prisoners’ dilemma paradigm that has been taken to represent the underlying incentive structure of a myriad of social situations. The heart of the dilemma is exposed with the simple two-person game represented in Figure 11.1. In this figure, we have the payoff matrix for good guys and bad guys. Good guys make choices that are cooperative, that benefit the other person, and that recognize that if they are interacting with other good guys, they will both do well. Two good guys will get five units each, whereas two bad guys interacting will lose three units each in this hypothetical, denuded, and highly abstract situation. A good guy interacting with a bad guy loses four units, whereas the bad guy thrives with six units. Another way to look at this figure is to ask what you would do if you could choose to be either a good guy or a bad guy if you don’t know who your interaction partner will be. If you suppose your partner will be a good guy, you would do better by one unit if you chose to be a bad guy; if your partner were to be a bad guy, you would still do a unit better by choosing to be a bad guy. No matter what kind of guy the other is, you get a unit advantage by being a bad one. Hence, good guys finish last. End of story. That may be the end of the story in hypothetical, denuded, and abstract situations that will never be repeated, with others who are hypothetical 233
ER59969.indb 233
3/21/08 10:51:21 AM
234 • David Messick
and abstract, but it is not necessarily the end of the story in situations that are closer to real human social life. Indeed, Luce and Raiffa (1957), who introduced this game to most social scientists in their path-breaking book, Games and Decisions, after showing why one would lose by being a good guy, say, “If we were to play this game we would not take the second strategy (i.e., be a bad guy) at every move” (p. 100). Their explanation is awkward and unconvincing, but their intuition is that being a bad guy all the time might not be such a good idea. Subsequent research, both theoretical and empirical, has supported this intuition. In the theoretical domain, Kreps, Milgrom, Roberts, and Wilson (1982) showed that in repeated games of the sort depicted in Figure 11.1, being a good guy can be a rational (i.e., not finish last) strategy. The key is that the game is repeated, even for a finite but unknown number of interactions. Also, the authors assume players have assumptions about the nature of the other player, assumptions that include the notion that the other may understand the futility of failing to cooperate at all times. With such assumptions, Kreps et al. (1982) showed that there were equilibria in which good guys could thrive. A somewhat different approach to the theoretical issue was the challenge extended by Axelrod (1984). His project, somewhere between empirical and theoretical, invited scholars to submit computer programs that would interact with each other according to payoff rules like those summarized in Figure 11.1. Specifically, every program submitted would play a repeated series of interactions with every other program submitted, making this a round robin tournament. The interactions were scored by the total number of units each program accumulated, and the program score was the sum of the individual interaction scores. Somewhat similar to the mathematical results of Kreps et al. (1982), the surprising and robust result of this tournament was that the good guys did not finish last. Indeed, the winning program was one that started by being a good guy and mimicked the interaction partner on every Good Guy Bad Guy subsequent trial. This was the famous tit-for-tat (TFT) strategy Good Guy 5,5 -4,6 that had been submitted by Anatol Rapoport. The TFT program Bad Guy 6,-4 -3,-3 accumulated more units than any other and did so in a second round after its initial success had been widely publicized. Figure 11.1 Prisoner’s dilemma.
ER59969.indb 234
3/21/08 10:51:22 AM
Must Good Guys Finish Last? • 235
Axelrod (1984) identified several properties of this program that accounted for its robust success. One of these was that the program was what we might call a “conditional” good guy. It had the property that it would never be the first to be a bad guy, so with any other program that also had this feature of conditional “goodness,” they would always stay in the cell that generated mutually high outcomes. A second feature that was interesting came from the observation that the TFT strategy could never outscore any other program with which it interacted. At best it could accumulate the same amount as the partner, and at worst it could be down by 10 units after one occasion when it was a good guy and the other was a bad guy. Here is a paradox that has deep implications. TFT amasses more units than any other, but it can never, in a head to head interaction, amass more than the other. It works by inducing the others to be good guys too, and it is profitable to interact with good guys. This is important. Among the empirical results that show how good guys may avoid finishing last, I will mention two. The first is an early study by Dawes, McTavish, and Shaklee (1977). In this experiment, groups of students were in a multiperson version of Figure 11.1. Being a good guy would be costly and, mutatis mutandis, lead to a lower payment. One of the variables that was manipulated in this study was the availability of communication and the content of this communication. The interesting finding was that groups that were permitted to discuss the decision they had to make (individually) had a higher representation of good guys than groups that could not communicate at all or that could only discuss issues not related to the dilemma. Naturally, people in groups with a higher fraction of good guys accumulated more units then the other groups. The good guys in those groups accumulated less than the bad guys in the same groups, but they may have earned more than the bad guys in the no communication or irrelevant communication groups with fewer good guys. Whether you are a good guy or a bad guy, you do better the more good guys you have around. Thus, taken as a whole, the good guys did not necessarily finish last. The second empirical study I want to mention was also conducted by Dawes and his collaborators. This is the important paper by Orbell and Dawes (1993) in which they allowed participants to decline to play a game like the one in Figure 11.1. Previous work, largely motivated by Dawes et al. (1977), had shown that people expect others to make the same type of decisions in these games that they themselves make, so if one adds to the hypothetical, denuded, and abstract situation with which we started, the option of not playing the game, what should happen? Bad guys will expect the others to be bad guys and good guys
ER59969.indb 235
3/21/08 10:51:22 AM
236 • David Messick
should expect the others to be good guys by and large. Both want to play with good guys, so who should take the option to not play? It should be the bad guys, and this is precisely what the data showed. The ones who mainly declined to play are the bad guys, who would have reduced the earnings of all the others who did play. Thus, while the good guys who did play made less than the bad guys who played, they did not necessarily make less than bad guys who did not play or the bad guys who played with other bad guys. In a real situation, the bad guys might eventually learn that their belief that they would interact with mainly bad guys was false, and they would not opt to withdraw, but this is a story that has not yet been examined. If the bad guys drifted back into the competition, it could spell trouble for the good guys. We started with a denuded, hypothetical, and abstract situation represented in Figure 11.1 and examined some theoretical and empirical findings that suggest that despite the obvious disadvantage good guys suffer in this stark situation, they may do rather well if the relationship exists in time (as does most of our lives) and if good guys can in some way interact mainly with other good guys. At this point I want to shift to a very real, concrete, and contextually rich question that is a parallel to the title to the chapter. Can good firms compete successfully in a competitive environment? To begin to address this question, let me clarify what I mean by “good” firms. Good firms are ones that spend money and other resources on what may be called socially responsible activities such as community development; education for employees and other people; securing health care for employees and others; environmental tidiness and sustainability; the elimination of graft, corruption, and unethical practices; the assumption of civic responsibility to pay (rather than avoid) taxes; and that, generally speaking, work for the common or public good in addition to their own economic well-being. This cluster of activities is often referred to as corporate social responsibility (CSR). In the management literature there are essentially two major positions with regard to this question. The first position is that investments and expenditures for these types of activities waste the firm’s resources and constitute irresponsible leadership. Perhaps the most articulate voice for this position is that of Nobel Prize–winning economist Milton Friedman (1970), who argued that the sole responsibility of an executive is to enhance the profitability of the firm. To use the firms resources to advance causes that are unrelated to the firm’s business, Friedman argues, is to tax the shareholders (by taking their resources and diverting them to these causes) without their consent. Moreover, it is not good for business since firms that do not pay the costs to engage in CSR will, all else equal, be
ER59969.indb 236
3/21/08 10:51:22 AM
Must Good Guys Finish Last? • 237
more profitable than those that do. Even if the firm gets some general benefit from having, say, a more educated workforce, firms that fail to make the investment also share in this benefit without having to pay the cost. In terms of our title, these are the “bad” firms, and the ones that pay the cost are the “good” firms. If the social and environmental benefits generated by the CSR of the good firms can be shared by all firms, then the good firms are at an obvious economic disadvantage. The benefits are common and shared, and the costs are private. Those who pay, that is, the good guys, will finish last. The second position in the literature is that the costs paid for CSR are investments that generate a return to the firms that pay for them. Whether or not they do is essentially an empirical question. It is also a question that is frustratingly difficult to answer for reasons that I will illustrate shortly. The conceptual position expressed by Frank (1996), for instance, and Willard (2002) is that firms that invest in CSR will generate returns that will more than repay the investment, and that the repayment will be specific to the good firm. Frank (1996) argues that being good reduces the costs for firms in at least five ways. Good firms have an advantage in attracting and retaining the best employees, attracting loyal customers, reducing legal and contracting costs, and maintaining cheaper and efficient relationships with other firms. Willard adds to this the ability to motivate employees, to reduce environmental costs and risks, making financing and capitalization cheaper. These are plausible hypotheses about how good firms can avoid finishing last in a competitive economic environment, but do the data support these claims? Before we examine some of the data that bear on this question, I want to digress and point out that this question is closely related to a question in the evolution of altruism. That latter question is this: “How can a form—a phenotype—succeed when it pays a fitness cost relative to others in the population?” The altruist, by definition, pays a fitness cost that is associated with an enhancement of the fitness of some other conspecifics. One of the answers to this question is that if the altruists are in some fashion segregated from the nonaltruists, they may receive more than the average fitness benefit from the other altruists and hence accrue a higher fitness level than the average nonaltruist. How this segregation may happen is a matter of dispute, but the underlying process is precisely like the experimental effects that Dawes and his colleagues found and that I discussed previously. If people are permitted to discuss a social dilemma and to make promises about their future actions, they tend to be good guys, whereas people who cannot discuss it tend to be less so. This means the good guys find themselves disproportionately
ER59969.indb 237
3/21/08 10:51:22 AM
238 • David Messick
in the presence of other good guys while bad guys find themselves disproportionately in the presence of other bad guys. The bad guys in the group of predominantly good guys will do very well indeed, but the good guys in these groups may well do better than the average bad guy (who will be in the other groups). Messick and van de Geer (1981) have spelled out this apparent paradox in some detail. The relevance of this point to the issue under discussion is that even if there were no direct return on the investment of CSR, it is possible through this “clustering” mechanism that good firms may do better than bad firms. If good firms are more likely than bad firms to be the recipients of the benefits of other good firms, they may be more successful economically than bad firms. The important point here is that factors that tend to make me a good guy tend to make others good guys as well. Therefore, there should be a positive correlation between being a good guy and being amidst good guys. That is all this mechanism requires. Now let’s examine some data that pertain to the question of whether good firms must finish last. An important early study was published by Waddock and Graves (1997), who collected measures of financial performance and information on the extent of CSR activities on a large sample of firms. The empirical question was whether there was any relationship between these two sets of indices. The answer was that a moderate but positive relationship existed among the classes of measures. Furthermore, in an effort to tease out causal directions among the measures, the authors did time-lagged regressions in which CSR measures from one year were related to financial performance measures from the following year and vice versa. The evidence suggested that being good led to improved financial performance and that enhanced financial performance led to being good. The authors referred to this pattern as a virtuous circle. In an effort to eliminate extraneous factors from influencing the results of the regressions in the Waddock and Graves (1997) study, the authors controlled for the industries that the firms were in. This means variation associated with differences in the industries were eliminated. This process of control may, however, eliminate or reduce the very effects that clustering creates. If there are important differences in industries with regard to the relative concentrations of good guys and bad guys, then controlling for industries may hide the process (clustering) that creates the advantage. To illustrate this possibility, imagine three industries, A, B, and C, in which there are large differences in the density or number of good guys and the resulting performance. (Keep in mind that a firm does better if it gets benefits from other good firms.) We may have a hypothetical case such as that depicted in Figure 11.2, in
ER59969.indb 238
3/21/08 10:51:23 AM
Must Good Guys Finish Last? • 239
Performance
which industry A has relatively few Caveat good guys and somewhat modest performance, industry B has more good guys and better perforC mance than A, and industry C has the most good guys and the best B performance in aggregate. For the A population that consists of all three industries, the correlation between being a good guy and performance will be positive (because more # of Good Guys good guys will be further to the right in the figure). However, if in MGGFL? Locally – Yes Globally – No doing a regression, one controlled for industries, in effect the mean differences between the three ovals Figure 11.2 How clustering may change the relationship between performance and the would be reduced, and the result- number of good guys. ing correlation would be negative, as would the correlation within any of the three industries. This figure is another way to display the fact that whereas good guys may be doomed to finish last locally, they are not doomed to finish last globally. Remember Axelrod’s result that TFT, the winner of his tournament, can never win a single interaction. Waddock and Graves (1997) do not report analyses in which the industries were not controlled for, making it impossible to determine if some of the clustering benefits outlined above are eliminated by their statistical methodology. Thus, the empirical question remains open. The same is true with the interpretational question. The point made above requires that we ask the question of what to control when we examine the real empirical world. Messick and van de Geer (1981) argued that there is no way to determine whether the local or the global perspective was the correct one. The empirical truth is that by some standards (what we are calling local ones), good firms may finish last, but by global standards, they may finish well ahead of bad firms, even if there are no direct investment returns accruing to good firms. This is a genuine paradox, not an illusion, a paradox that must be kept in mind when thinking about the question. Like Axelrod’s TFT, good firms may finish last locally, but do very well globally. This is a subtle point that perhaps deserves an illustration. Imagine two communities in a country. One community requires its firms to invest in technology that protects the environment, to pay living wages to its employees, and to spend money on education and health for citizens.
ER59969.indb 239
3/21/08 10:51:23 AM
240 • David Messick
The other community requires none of these things. The benefits that are provided by the first community are public goods that can be enjoyed by all firms in the community. The environment is greener, the citizens are healthier and better educated, and the wealth of the community is greater, permitting it to do more things for the people who live there. If these benefits exceed the cost that a given firm pays for its share, then the firms will do better than those in the less demanding community. Now, if a firm moved into the first of these regions and could find a way to avoid paying for these activities, this firm would get all the benefits, but it would have to pay no costs at all. Thus, locally, the good firms would do poorly with respect to the renegade firm, but they still might do better than the renegades in the community where none do good. Whereas the mathematics of the comparison depend on the costs of the good, the benefits allotted to the others, and the relative concentrations of good guys and bad guys, the fact remains that the good guys need not, on average, do worse than the bad guys. We can now return to the question of whether there is a financial return from an investment in CSR and, if so, what the nature of that return is. In perhaps the most comprehensive review of the data on this question, Orlitzky, Schmidt, and Rynes (2003) conducted a meta-analysis of 52 quantitative studies that examined the relationship between indices of CSR (measures of the extent to which a firm is a good guy) and measures of financial success (measures of how firms were doing in the economic competition). The authors hypothesized that a positive relationship exists between CSR and corporate financial performance (CFP). They also suggested that the causal arrow between these measures points in both directions, that is, that CSR activities increase financial performance and that financial success enhances a firm’s ability to invest in CSR. This hypothesis is essentially the virtuous circle hypothesis of Waddock and Graves (1997). However, Orlitzky et al. (2003) go further by proposing two more or less independent paths by means of which CSR investments pay off. The first is that firms learn new competencies, efficiencies, and skills that improve performance. Finding uses for industrial wastes, for instance, can transform what had been costs into saleable products that produce profits. Developing transparent modes for pricing may reduce the costs of negotiations. Bringing stakeholders into planning processes early on may reduce the risks of future litigation and the costs of project delays. These are merely a few of the ways in which more responsible and ethical practices may lead to increased profits. The second route through which CSR activities may enhance financial performance is through reputation building. Firms with good
ER59969.indb 240
3/21/08 10:51:23 AM
Must Good Guys Finish Last? • 241
reputations may find it easier in a multitude of ways to do business. Firms that have the reputation of operating with a larger environment in mind, both a social environment and a temporal environment, are less likely to provoke scrutiny and regulation from governmental and other regulatory agencies. Costs of compliance will be lower. Customers may prefer to do business with firms that have a reputation for dealing well with their stakeholders, from employees to suppliers and customers. Both of these routes are what we have called “direct investment” routes. They represent ways in which good firms are directly rewarded for being good. The virtuous circle hypotheses of these authors was clearly confirmed by their meta-analysis. All of the estimates of the relationship between CSR and CFP were positive and robust enough to remain positive even when various sources of error and bias were removed. Furthermore, supporting evidence was found for both of the two mediating mechanisms that were proposed, what might be called the learning effect and the reputation effect. The evidence supporting the latter was stronger, and it supported the hypothesis that the reputation effect might be a more important source of advantage than the learning effect. Far from supporting the notion that in the world of competitive business, good guys must finish last, these studies, taken together, support precisely the opposite conclusion, namely, that good guys have a competitive advantage over bad guys. I moved from studies in the world of experimental social science to the world of corporate performance to highlight the fact that this question about how well or poorly good guys do is an exceedingly important one with consequences not only for our views of the nature of human sociality but also for implications about the costs or benefits of corporate citizenship. There will always be some firms that cheat and get away with it, either because they have governmental protection or because they are especially good cheaters or both, but the bulk of the evidence suggests that doing good does not condemn firms in a competitive marketplace to finishing last. With this point made, I want to return to the social science laboratory for a final point on this question about good guys. The evidence I have reviewed indicates that there are two mechanisms by means of which good guys get an advantage in a social dilemma: (1) by having the situation be recurrent (thereby permitting reputation building or learning) or (2) by having people segregated in some way so that statistically there are significantly more than an average number of good guys in some groups and fewer than average in others. This segregation,
ER59969.indb 241
3/21/08 10:51:24 AM
242 • David Messick
as Dawes and his collaborators have shown, can be achieved by allowing people the option of not playing or by allowing them the ability to discuss the choice that they will have to make. The de facto consequence of both of these manipulations is to create the type of segregation that can reward good guys more than bad guys. (I should point out here that I am going to ignore a complex relationship between the degree of segregation and the cost–benefit structure of cooperating. In short, the more beneficial the cooperative choice relative to its cost, the more random the distribution of cooperators needs to be for cooperators to have a global advantage. Thus, the two necessary conditions for the evolution of altruism—that the cost of the altruism must be less than the total benefit provided and that the benefits of the altruism cannot be randomly distributed across all others—are linked.) I now want to describe evidence that there is a third, more direct way that good guys may benefit. The presence of good guys in a small group may tend to encourage others in the group to also be good guys and to therefore provide benefits for the initial good guys. Why might this happen? Because in many situations, particularly experimental situations that are not often encountered in daily life, participants may look for cues as to what the appropriate behavior is. This hypothesis is consistent with the conceptual point of view proposed by Weber, Kopelman, and Messick (2004), who suggested that most social dilemmas can be seen as opportunities for either self-enrichment or for group problem solving. If the former, then being a good guy is dumb and misguided, but if the situation is a social problem-solving situation, such as getting through an intersection with a four-way stop sign, cooperating and making sure that everyone is well off is the smart thing to do. Trying to get a leg up can lead to collective disaster. One source of information about what a situation means can be extracted from the actions of others in the situation. Latané and Darley (1970) made this point clearly in their path-breaking research on the “bystander effect.” One reason people do not help others in an emergency being witnessed by a group is that others are not seen to be helping, lending credence to the hypothesis that the situation is not truly an emergency. In precisely the same way, a consistently good guy may influence a naïve participant’s interpretation about what kind of situation he or she is in. If there is a consistent cooperator, a person who cooperates on every occasion, then perhaps the situation is one that calls for cooperation rather than self-promotion. This was the hypothesis Weber (2004) set out to examine in his dissertation. He started by examining a data set from an experimental public goods game conducted by Isaac, Walker, & Williams (1994).
ER59969.indb 242
3/21/08 10:51:24 AM
Must Good Guys Finish Last? • 243
Weber found that groups that contained a consistent contributor (CC), a member who cooperated on every trial, had a higher level of contributions (cooperation) from those who were not themselves the CC than groups that did not contain at least one CC. CCs earned less than the other members of their groups that were not CCs, but the question is whether they earned less than the average of all the other players. They did not earn less. In fact, they earned significantly more than the average of the others. Weber then went on to experimentally create groups in which some had CCs and others did not. This is not the place for a detailed discussion of Weber’s manipulations or his findings, but one pattern was clear. Being a CC did not mean earning less than others in groups without CCs. In two of three studies, including the one in which they reanalyzed the Isaac et al. (1994) data, CCs earned significantly more than the average other, and in the other study the CCs earned more but not significantly more. Thus, Weber’s research indicates yet another mechanism by means of which good guys do not finish last. Good guys set an example and people emulate them. The emulators improve the outcomes for the good guys. In this chapter I have briefly reviewed some of the evidence pertaining to the claim that people who cooperate in prisoners’ dilemma and other types of social dilemmas must finish last, that is, must earn less than people who do not—the bad guys. Both the theoretical and empirical evidence indicates that this simple claim, based typically as it is on the hypothetical, abstract, unsituated matrix in Figure 11.1, is incorrect. The claim is incorrect in experimental games and in the world of corporate economic competition, according to the best and most sophisticated analyses available. The claim that good guys finish first is not generally supported unless a global context for ranking is assumed, but there is a clear answer to the claim that good guys must finish last, and that answer is that the claim is flat wrong. Must good guys finish last? Absolutely not!
References Axelrod, R. (1984). The evolution of cooperation. New York: Basic Books. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication, and assumptions about other people’s behavior in a commons dilemma situation. Journal of Personality and Social Psychology, 35, 1–11. Frank, R. H. (1996). Can socially responsible firms survive in a competitive environment? In D. M. Messick & A. E. Tenbrunsel (Eds.), Codes of conduct (pp. 86–103). New York: Russell Sage.
ER59969.indb 243
3/21/08 10:51:24 AM
244 • David Messick Friedman, M. (1970). The social responsibility of business is to increase its profits. New York Times Magazine, September 13, 122–126. Isaac, R. M., Walker, J. M., & Williams, A. W. (1994). Group size and the voluntary provision of public goods. Journal of Public Economics, 54, 1–36. Kreps, D. M., Milgrom, P., Roberts, J., & Wilson, R. (1982). Rational cooperation in the finitely repeated prisoner’s dilemma. Journal of Economic Theory, 27, 245–252. Latané, B., & Darley, J. (1970). The unresponsive bystander: Why doesn’t he help? New York: ACC. Luce, R. D., & Raiffa, H. (1957). Games and decisions. New York: Wiley. Messick, D. M., & van de Geer, J. P. (1981). A reversal paradox. Psychological Bulletin, 90, 582–593. Orbell, J. M., & Dawes, R. M. (1993). Social welfare, cooperator’s advantage, and the option of not playing the game. American Sociological Review, 58, 787–800. Orlitzky, F., Schmidt, F. L., & Rynes, S. L. (2003). Corporate social and financial performance: A meta-analysis. Organizational Studies, 24, 403–441. Waddock, S. A., & Graves, S. B. (1997). A corporate social performance-financial performance link. Strategic Management Journal, 18, 303–319. Weber, J. M. (2004). Catalysts for cooperation: Consistent contributors in public good dilemmas. Unpublished doctoral dissertation, Northwestern University. Weber, J. M., Kopelman, S., & Messick, D. M. (2004). A conceptual review of decision making in social dilemmas: Applying a logic of appropriateness. Personality and Social Psychology Review, 8, 281–307. Willard, B. (2002). The sustainability advantage. Gabriola Island, BC, Canada: New Society.
ER59969.indb 244
3/21/08 10:51:24 AM
12
Women’s Beliefs about Breast Cancer Risk Factors: A Mental Models Approach Stephanie J. Byram, Lisa M. Schwartz, Steven Woloshin, and Baruch Fischhoff
Decision-Making Needs All women face the risk of breast cancer. Dealing with that threat requires each to understand her personal risk factors, which can vary widely across women and across the lifespan. Some risk factors are potentially under women’s control, such as diet, hormone use, exposure to sun and pesticides, screening, and exercise. Understanding these factors allows women to make effective choices about the related activities. Other risk factors are outside their control, such as age and family history. Understanding these factors allows women to put their other life choices into perspective. Acquiring that understanding is, however, a challenge. Risk factors are but one aspect of breast cancer, which also raises issues of detection, diagnosis, and treatment. Breast cancer is but one of many risks arising in women’s lives. For each, individuals have limited cognitive and emotion resources, which they must apply to information that is often complex, confusing, and scattered, coming from sources whose content is hard to interpret and whose credibility is hard to assess. The costs of mastering these uncertain facts add to the burden of decision making—and reduces the chances of its success. It increases the chances of experiencing regret if things go badly: Somewhere in the morass of information was what one needed to make a better choice (Fischhoff, 1992). 245
ER59969.indb 245
3/21/08 10:51:24 AM
246 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
Professionals hoping to empower lay people to manage risks need a disciplined approach that (1) identifies the medical knowledge most relevant to these choices, from among all the facts that might be “nice to know”; (2) characterizes current beliefs, including knowledge gaps and misconceptions, in a way that allows evaluation of the adequacy of what people know already; and (3) creates (and evaluates) communications that bridge the gap between current and needed understanding (Fischhoff, 1999a, b, 2005). That can be a lot to ask. The relevant medical science can be complex and changing, such that even dedicated professionals may struggle to sort out the competing claims (Gawande, 2002). Lay beliefs can be sufficiently hard to discern that even empathetic professionals may not fully understand what patients are hiding or trying to say. Saying the right thing carries no guarantee of being heard as intended, and even committed professionals may not realize that patients are reluctant to reveal their ignorance, especially when it might mean surrendering control over fateful decisions. These problems arise even when there is good will all around. Lay decision making becomes more difficult still when professionals have a vested interest in conveying an incorrect impression of the efficacy of the treatments they offer. Patients’ most natural fear is that professionals will oversell methods that are unproven—or even demonstrated not to work. That may occur through misleading claims or through misleading disclaimers, which leave the impression of being pro forma (e.g., “past performance does not predict future performance,” the drug side effects read in the background of dream scenes on TV ads, impenetrable patient package inserts). As Dawes (1994) has shown, such practices can reflect both deliberate misrepresentation and unwitting self-deception, when professionals fall prey to the judgmental biases often attributed to lay people. For example, they might see illusory correlations as the result of disproportionately remembering successes or not remembering clients who dropped from treatment (perhaps because it was not working). They might take credit for reductions in chronic pain, even when the condition would have improved on its own as the result of regression effects. Such biases can occur through benign cognitive processes and be augmented by motivated cognition, as when people are attuned to spotting evidence supporting a favored hypothesis and not to scrutinizing it too hard compared to evidence they do not want or expect to see (Hastie & Dawes, 2001). Ideally, information should be tailored to each individual’s specific health decisions. However, often that may be an unrealistic aspiration: Both patient and professional may lack the resources, training, and incentives needed to identify, share, and integrate the relevant infor-
ER59969.indb 246
3/21/08 10:51:25 AM
Women’s Beliefs about Breast Cancer Risk Factors • 247
mation. In those cases, a more feasible goal is systematically developing communications for groups of individuals with similar needs, investing the resources needed to produce focused, unbiased, and comprehensible messages. Those communications would be used unless there are grounds for assuming that better tailoring is possible. Having good general information in circulation could make such tailoring more feasible if it provides individuals with fundamental understanding that they can adapt to specific circumstances. For the past 20 years, we have been pursuing this strategy with a general approach called the mental models method (Fischhoff, 2005; Morgan, Fischhoff, Bostrom, & Atman, 2001). In the spirit of this volume, the approach aspires to be socially responsible by recruiting science to improve the rationality and optimality of individuals’ choices. As described below, each application requires the integration of decision science, psychology, and the expertise of subject matter experts. The focus on individuals’ decision-making needs means that the applications place secondary importance on the traditional disciplinary needs of the contributing sciences. Rather, it addresses their interface in ways that inform the generality and usefulness of their results, while potentially identifying new topics worthy of basic research.
The Mental Models Approach The centerpiece of the mental models approach to risk communication (Atman, Bostrom, Fischhoff, & Morgan, 1994; Bostrom, Atman, Fischhoff, & Morgan, 1994; Morgan et al., 2001) is an expert model, summarizing scientific knowledge relevant to the focal choices. The two formalisms typically used are the decision tree and the influence diagram (Clemen, 1997; Dawes, 1988; Raiffa, 1968; von Winterfeldt & Edwards, 1986). The former is used when the communication focuses on quantitative estimates of the risks and benefits of possible actions. The latter is used when the communication focuses on the processes shaping those risks and benefits. Although choices eventually require quantitative estimates, knowledge of the contributing processes is relevant to individuals who want to understand why the estimates have the values that the experts claim, to monitor the environment for changes, to formulate options that are worth evaluating, or to assemble a core of knowledge that can be applied to multiple decisions. All these reasons could lead women to want someone to identify the relevant breast cancer science and then communicate it in a comprehensible form. Mental models characterize lay beliefs in terms that can be compared to the expert model. Our means for eliciting them is through
ER59969.indb 247
3/21/08 10:51:25 AM
248 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
semistructured open-ended interviews, asking people to address the factors in the expert model in their own terms. The interviews begin with very general questions of the form: “Tell me what you know or have heard about….” Follow-up questions ask for elaboration on every topic that has been volunteered, both to elicit additional beliefs and to ensure that respondents have been understood as intended. Once spontaneously generated topics have been exhausted, the interviewer raises the other topics in the expert model in increasingly specific terms (e.g., asking first about any exposures to carcinogens in general, then about specific carcinogens and exposure processes, then about possible ways to reduce them). Again, respondents are asked to elaborate on their beliefs until they indicate that they have exhausted what they can share. Prompted beliefs may resemble the ones they would generate were they to engage the topic more fully under natural circumstances. Nonetheless, using increasingly specific questions may make the interviews increasingly reactive, a possibility that can be examined by comparing beliefs shared in response to general questions with those emerging after specific prompts. Open-ended interviews can capture expert model issues in their intuitive formulation as well as allow for the emergence of issues that, rightly or wrongly, escaped expert notice. However, the conduct and analysis of such interviews is resource intensive, meaning relatively few can be conducted. In addition, they often pursue a stratified sampling strategy, deliberately seeking individuals from diverse groups, so as to elicit a broad range of beliefs. As a result, assessing the prevalence of beliefs requires structured surveys with large, more representative samples, incorporating the topics and language observed in the interviews. For example, in an application concerning domestic radon, Bostrom, Fischhoff, and Morgan (1992) compared the frequency with which various beliefs are raised in interviews and endorsed in a survey. They found that the two rates were generally similar. Although that need not be the case, exceptions would be informative, for example, if an important belief is not naturally recalled or if interviewees routinely noted that a topic is not important. Using a well-defined expert model as a template typically allows reliable coding of interview protocols. However, it precludes identifying more holistic organizing principles of the sort sought by phenomenological or ethnographic research. Expert models are models in the sense of including variables and relationships believed to mimic real-world phenomena. Were its parameters estimated, an expert model should predict real-world input–output relationships. With most projects, development of the expert model stops with computability, in the sense of specifying the variables and
ER59969.indb 248
3/21/08 10:51:25 AM
Women’s Beliefs about Breast Cancer Risk Factors • 249
relationships precisely enough that they could be estimated (and predictions made), were the resources available to satisfy the data demands. For communications focused on conveying the big picture of how risks are created and controlled, clear demarcation of the issues may suffice. Indeed, greater precision may erode understanding by favoring issues that are most readily quantified (e.g., economic versus psychological causes and effects). Achieving computability is often a challenge, insofar as it requires clear communication among experts from different disciplines, who may not have pooled their knowledge before. Casman, Fischhoff, Palmgren, Small, and Wu (2000) describe a case where computations were needed to resolve critical uncertainty about the effectiveness of consumer warnings in the case of Cryptopsporidium intrusions in domestic water systems. They found that surveillance systems were so insensitive that warnings could not protect vulnerable populations, making risk communication irrelevant and reliance on it a distraction from providing needed protection. Riley, Small, and Fischhoff (2000) describe a model computing the risks associated with various ways of using methylene chloride–based paint stripper to assess the feasibility of voluntary behavioral controls (as opposed to regulatory restrictions). They found that significant risk reduction was possible if producers could effectively communicate three actions that should be feasible (in the sense of allowing the work to be completed): opening windows, having fans pointed outward, and leaving the space while the chemical is curing (20 minutes). A review of existing labels found that none highlighted these actions, and some never mentioned them at all. Eggers and Fischhoff (2004) found that a court-mandated warning reduced the optimality of consumer decisions regarding a dietary supplement, but not so far that the expected utility of their choices was typically negative. The creation of such models, as well as the estimation of their parameters (where appropriate), requires the exercise of judgment informed by the most relevant research. The elicitation of those judgments follows protocols developed through analytically informed psychological research (Cooke, 1991; Dawes, 1988; Fischhoff, 1989; Morgan & Henrion, 1990). For a problem of any complexity, the model requires input and review from experts in several disciplines. As a result, experts’ model might be a better description, recognizing further that expert is a relative, rather than an absolute, term, in the sense that those who know the most about a topic may still know much too little to satisfy decision makers’ needs. In addition, the mental models interviews often reveal topics missing from the expert model, either because the experts did not realize their importance to lay people (e.g., caregiver stress in health decisions) or because they did not realize what lay people knew about
ER59969.indb 249
3/21/08 10:51:25 AM
250 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
how some aspect of the world works (e.g., the role of social support—or involvement in some other activity—in reducing caregiver stress). Mental models need not be models in the sense of mapping objects and operations in people’s thinking, such that one could aspire to predict input–output relations accurately. In relatively well-structured domains, psychological research has had success in developing such predictive models (Bartlett, 1932; Ericsson & Simon, 1996; Gentner & Stevens, 1983; Johnson-Laird, 1983; Slovic & Lichtenstein, 1971). However, with many risk problems, the domain is poorly structured, such that learning what is on people’s minds is a big part of the research. Moreover, the need to create communications that resonate with natural ways of thinking means that predictive validity is insufficient to guarantee the deeper understanding needed to be able to help (Dawes & Corrigan, 1974). The test of a mental models project is its ability to produce interventions that help people make decisions in their own best interests (or, in the wrong hands, against those interests). However, such interventions are complex enterprises, whose success depends on many factors in addition to the underlying science (e.g., the fidelity of the implementation, the ability to engage and retain participants, the appropriateness of the measurement). In one mental models project seen through from an expert model (Fischhoff, Downs, & Bruine de Bruin, 1998) to a randomized control trial of the intervention, Downs et al. (2004) were able to able to reduce risky sexual behavior and the prevalence of chlamydia reinfection rates in a population of high-risk young women, using an interactive video available on DVD (see also Bruine de Bruin, Downs & Fischhoff, 2007). The project’s mental models interviews led to helping adolescent females identify choice points and options, however far a sexual encounter had progressed; it was implemented with negotiation techniques based on self-efficacy theory (Maibach & Flora, 1993) as well as behavioral decision research. The interviews also showed a failure to appreciate how small risks mounted up through repeated exposure, a result observed in many other contexts (e.g., Shaklee & Fischhoff, 1990; Slovic, Fischhoff, & Lichtenstein, 1978). That previous research suggested looking for this bias and overcoming it by explicitly calculating cumulative risk (and not just suggesting a long-term perspective). The success of the intervention provides some support for each of its elements—and for the methodology that identified and integrated them. Other mental models projects have addressed topics as diverse as preventing sexual assault (Fischhoff, 1992), selecting emergency contraception (Krishnamurti, Eggers, & Fischhoff, 2006), carbon dioxide sequestration (Palmgren, Morgan, Bruine de Bruin, & Keith, 2004),
ER59969.indb 250
3/21/08 10:51:25 AM
Women’s Beliefs about Breast Cancer Risk Factors • 251
evaluating the use of nuclear energy sources in space (Maharik & Fischhoff, 1992), understanding acute local problems with breast implants (Byram, Fischhoff, Embrey, Bruine de Bruin, & Thorne, 2001), and setting tariffs for deregulated electricity transmission (Gregory, Fischhoff, Butte, & Thorne, 2003). Each requires assembling a team of experts and focusing them on the issues that matter to an audience of laypeople by devising a computable model to summarize their respective inputs. Typically, the ensuing interviews have led to revising the expert model, if only to incorporate issues raised by laypeople that cannot be definitively rejected as irrelevant based on existing science. When this criterion is used, expert models can get quite messy, insofar as it is hard to rule out factors that have some very small positive or negative effect. That is the case in the research reported here, where we have adopted an inclusive definition to place lay beliefs with limited expert support in the context of those with more. Whether this is the best analytical strategy for understanding and aiding lay decisions requires an exercise in judgment that, like other aspects of the mental models approach, is open to others’ evaluation.
An Expert Model of Cancer Risk Factors Breast cancer poses a myriad of decisions, including the intensity of surveillance that women undertake, the diagnostic testing following any suspicious sign, and the intensity of treatment following a diagnosis of cancer. The uncertain variables affecting the outcomes of these decisions include side effects of treatment, the patient’s resilience, the diagnosticians’ training, and insurance coverage. The details of these “downstream” processes are acutely important to women with cancer. The general picture will occupy all women to some extent, partly as a function of their perception of their personal risk factors. Some may worry too much, others too little. Silverman et al. (2001) report on women’s mental models regarding the process as a whole. Byram (1998) presents the full expert model, along with its construction process, which included literature review, consultation with experts, and peer review. Its core team included both psychologists (Byram and Fischhoff) and physicians (Schwartz and Woloshin). Here we focus on possible risk factors. Beliefs about them can play roles in women’s decisions about how much to worry about breast cancer risk, which precautions to take, and how to interpret whatever fate befalls them. Figure 12.1 shows the expert model. It seeks to include the major factors thought to affect breast cancer risk by at least some credible medical authorities, at the time of the research (circa 2000), along
ER59969.indb 251
3/21/08 10:51:25 AM
pre/post menopause
ER59969.indb 252
proximity
menarche age
menopause age
age at first child
DES
Risk Factors
Ethnicity
Endogenous Hormone History
SES
Radiation
Breast Health
BRCA 1/2
Family History of Breast Cancer
number
oral contraceptives
Exogenous Hormone History
diet
Behavior
Psychosocial
Environment
Age
Gender
long-term HRT
exercise
alcohol
smoking
252 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
Figure 12.1 The expert model of breast cancer risk factors.
3/21/08 10:51:26 AM
Women’s Beliefs about Breast Cancer Risk Factors • 253
with some of their indicators. Its very complexity captures some of the reality facing women: Many things might, conceivably, affect their risk. Moreover, these factors are not easily grouped into categories amenable to separate consideration. Thus, a woman attentive to medical opinions might hear about age, environment, psychosocial factors (e.g., stress), behavior (e.g., smoking, drinking), obesity, exogenous hormonal history (e.g., oral contraceptives, DES use during pregnancy), endogenous hormonal history (as affected by childbearing and maturation), socioeconomic status, radiation, previous breast health events (e.g., ductal carcinoma in situ), genetic status (as defined by BRCA1/2 test results), family history of breast cancer, and ethnic descent. In these brief summaries, quantitative estimates are from Love and Lindsey (1995): Gender: Although men can get breast cancer, gender is an obvious risk factor. Age: The older the woman, the higher her chances of getting breast cancer, with about 80% of cases found in women over 50. Family history: A woman whose mother, sister, or daughter has had breast cancer faces greater risk. The relative risk is 2.1 if one has had breast cancer; 13.6 if two have had it. Genetics: About 5% of breast cancers are associated with an alteration of the BRCA1 gene (found in 1/300 to 1/800 of women), which entails a 50% chance of breast cancer by age 50 and over 85% by age 70. Men with BRCA1 are not at increased risk. Alterations in BRCA2 are understood more poorly but are thought to act similarly. Ethnicity: American or Northern European descent entails greater risk of getting breast cancer. (Black women have a higher risk of dying, should they get breast cancer, but disease prognosis was not considered in this part of the model or the corresponding interview segments.) Breast health: Previous breast cancer increases relative risk by 1.8. Atypical hyperplasia entails a relative risk of 4.4. Women with lobular carcinoma in situ have a relative risk of 7.2 and a relative risk of 11.0 for ductal carcinoma in situ (which typically does not progress). Benign breast conditions (e.g., lumpy breasts) do not increase breast cancer risk. Radiation: High-dose radiation increases breast cancer risk. Endogenous hormonal history: If a woman begins menstruating before 12, her relative risk is 1.2; if she is over 55 at menopause with more than 40 menstruating years, her relative risk is 2.0.
ER59969.indb 253
3/21/08 10:51:26 AM
254 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
If a woman bears her first child before age 20, her relative risk is 0.8; if she does so in her 30s, the relative risk increases to 1.4. If she never has children, the relative risk is 1.6. Exogenous hormonal history: Less is known about the effects of interventions, such as birth control pills, estrogen replacement therapy, and DES taken when pregnant or in utero. Socioeconomic status (SES): Breast cancer has higher incidence among women with more education and higher income. Behavior:. Alcohol consumption (>three drinks/week), smoking, and lack of exercise are suspected as correlates of increased risk. One study found that women over 154 pounds and 5’5” in height had 3.6 times the risk of women under 132 pounds and below 5’3”. One estimate held 27% of breast cancers to be attributable to dietary fat. Environment: Some evidence has implicated pesticides, herbicides, electromagnetic fields, artificial light, organochlorides, hormones, cleaning solvents, and some other factors. However, environmental effects are notoriously difficult to establish, even if relatively large, unless there are clear exposures and a signature cancer (e.g., mesothelioma and asbestos). Psychosocial: Stress or traumatic life events (e.g., divorce, death of a loved one) may be associated with higher breast cancer risk. Figure 12.1 was completed in 1997. Since that time, evidence regarding breast cancer risk factors has continued to accumulate. Were the model used to compute breast cancer risks, its parameter estimates could be updated. As a computable model, change is needed only if factors in the model have been entirely removed from consideration—or new ones added. For the qualitative beliefs elicited here, the evaluative standard is, arguably, much the same. Interview Method Respondents Forty-one women were recruited with quota sampling from a national frame. The sample was stratified on income (annual household income above or below $25,000), race (White, Black, non-Black minority), and age (under 40 years, 40–49 years, 50–69 years, 70 years and older). Potential respondents were contacted by phone and asked 12 screening questions about their age, race, income, and breast cancer risk. If eligible, they were invited to participate in return for $20. They were also offered a personal breast cancer risk assessment, based on Gail et
ER59969.indb 254
3/21/08 10:51:26 AM
Women’s Beliefs about Breast Cancer Risk Factors • 255 Table 12.1 Reported Sample Characteristics <40 years 40–49 years 50–69 years ≥70 years
N 6 15 14 6
% 15 37 34 15
White Black Non-Black minority
21 10 10
51 24 24
20 21
49 51
Over $25,000 Under $25,000
Annual Household Income
Highest Level of Education Less than high school 8 20 17 41 High school (or equivalent) Postsecondary 16 39 General Health Relative to Other Women of Same Age Above average 16 40 Average 18 45 Below average 6 15 Family History of Breast Cancer 36 87 None 2 5 Mother or sister 3 8 Mother and/or multiple sisters Previous Biopsy History 32 77 None 6 15 One Two 2 5 Three or more 1 3 Mammogram History 8 20 None Within past year 17 41 Within past two years 7 17 Within past three years 2 5 Fours years or longer 7 17 Note: Five of six respondents under 40 did not have a previous mammogram. Among respondents aged 40 and older, 32 (91%) had ever had a mammogram, compared to 84.5% for similarly aged women across the United States. 60% of respondents aged 40–49 had had a mammogram in the preceding two years, while 55% of those aged 50 and older had had one in the past year (compared to national averages of 64.5% and 58.4%, respectively) (American Cancer Society, 2000).
ER59969.indb 255
3/21/08 10:51:27 AM
256 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
al. (1989). Women who had had breast cancer were excluded, assuming that their experience would have significantly altered their beliefs. As seen in Table 12.1, on average, respondents reported being 53 years old, experiencing menarche at 13, and giving birth first at 21 (for the 93% who had done so at all). Procedure Each potential respondent was interviewed by telephone by one of two individuals. Both interviewers trained extensively, conducted several pretests, and then were evaluated on 12 female volunteers from a registry of female veterans from northern New England. When they had attained similar response detail and interview length, the study began. All interviews were audio-recorded (with the respondent’s informed consent). At the end, the interviewer asked for the respondent’s address to send a money order and the risk assessment (if desired) and then thanked her. Interviews ranged from 17 to 150 minutes, averaging 45 minutes. Transcripts were checked to ensure continued consistency. Interview Protocol The interview began with the general prompt, “Please tell me everything you know about mammograms.” It continued with, “Have you ever had a mammogram? [If so, please tell me about your experience.]” and “Do you intend to get a mammogram in the future? Why or why not? [If so, when?]” When a topic was mentioned, the interviewer checked it on a sheet listing each topic in the expert model. When spontaneous responses ended, the interviewer returned to each topic that had been raised with the prompt, “Can you tell me more about [the topic]?” When a respondent’s beliefs appeared exhausted, the interviewer raised those expert model topics that had not been discussed. Questions were worded and ordered to avoid providing information or prompting particular responses. Thus, respondents were asked to explain what a mammogram is and why it is used before being asked about its effectiveness. Nonetheless, there is some risk of novel inferences being generated to meet perceived interviewer expectations. As an additional precaution, interviewers were instructed to move on when a subject said, “I don’t know,” and appeared to have exhausted her beliefs on a topic. Scripted prompts regarding risk factors were, “Who is at risk for getting breast cancer? Are all women at equal risk for getting breast cancer? Why/why not? Think of women the same age as yourself. Do you think
ER59969.indb 256
3/21/08 10:51:27 AM
Women’s Beliefs about Breast Cancer Risk Factors • 257
your risk for breast cancer is lower than, the same as, or higher than the average woman? Why? What sorts of things might cause breast cancer? Is there anything a woman can do to reduce her risk for getting breast cancer? Does mammography reduce the risk for getting breast cancer? [If yes,] How?” Interviewers asked how important each mentioned risk factor was, offering “strong,” “moderate,” and “somewhat” as response options. Data Analysis Coding Each interview was professionally transcribed and then checked against the audiotape by an interviewer. Transcripts were coded into the expert model, adding new codes as they were encountered. For coding purposes, the transcripts were divided into units, defined as ideas with a beginning and an ending concept. A unit could be a phrase, a sentence, or an entire paragraph. For each unit, the beginning concept, ending concept, and connecting link were given codes in the expert model. Each unit was also coded for accuracy, strength, and direction (increasing or decreasing breast cancer risk). Reliability Half of the transcripts (20 of 41) were coded twice for reliability. The overall interrater agreement rates for concepts, links, and link features were 70%, 66%, and 90%, respectively. These scores are consistent with past mental model studies, where reliability has ranged from 69% to 85% (Morgan et al., 2001). Given the large number of codes, there should be little agreement by chance. Indeed, kappa statistics, adjusting for agreement rates expected by chance (Sackett, Haynes, Guyatt, & Tugwell, 1991), were almost identical: .70, .66, and .85, respectively. Kappas of .60 to .80 and .80 to 1.0 have been called “substantial” and “almost perfect,” respectively (Landis & Koch, 1977). Code reliability was counted at the most detailed level, resulting in the most conservative estimate. Differences between coders were resolved by discussion, with resolution rules incorporated into the coding guidelines. Qualitative Representations Figure 12.2 summarizes these responses, at an aggregate level. Commonly mentioned risk factors have bold nodes. Commonly mentioned indicators are grayed, with dashed outlines for ones not in the expert model (Fig 12.1), indicating risks seen by respondents but not experts. Conversely, risk factors and indicators with unbolded nodes were relevant to experts, but not respondents.
ER59969.indb 257
3/21/08 10:51:27 AM
ER59969.indb 258
BRCA 1/2
age at first child
DES
Risk Factors
Ethnicity
Endogenous Hormone History
SES
Radiation
Breast Health
menarche age
body changes
proximity
Family History of Breast Cancer
menopause age
mammography
breast trauma
body breaking down
pre/post menopause
number
oral contraceptives
Exogenous Hormone History
diet
Behavior
Psychosocial
Environment
Age
Gender
long-term HRT
exercise
alcohol
smoking
lifestyle
residence
workplace
258 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
Figure 12.2 Aggregate mental model, showing commonly mentioned risk factors (bold nodes) and indicators (grayed nodes). Nodes with dashed outlines do not appear in expert model (Figure 12.1).
3/21/08 10:51:28 AM
Women’s Beliefs about Breast Cancer Risk Factors • 259
Quantitative Measures Each individual mental model was characterized with six knowledge measures. Five were adopted from Bostrom et al. (1992), with weighted accuracy being new. Completeness: The percentage of expert model nodes addressed (correctly or not). Accuracy: The product of completeness and the percentage of beliefs evaluated as correct. Weighted accuracy: The product of accuracy and importance ratings (leading to higher scores for those who have correct beliefs on the topics most important to them). Specificity: The ratio of specific (e.g., mother) to general concepts (e.g., family history). Focus: The number of links to a concept, allowing identification of the most central nodes. Similarity: The number of lay concepts inside and outside the expert model. Results Although we focused on the aggregate mental model, we also looked at demographic groups, controllable versus uncontrollable risk factors, and archetypes (explained below). The Aggregate Mental Model Comparing Figures 12.1 and 12.2 shows that respondents’ aggregate mental model (Figure 12.2) includes many items not in the expert model (Figure 12.1). Some are consistent with it and provide the next level of detail. Others represent factors that few, if any, experts view as credible. Our analysis seeks to identify the major trends in this complex picture. Age was the most frequently mentioned risk factor and generally rated as important. Most respondents mentioning age knew that breast cancer risk increases over a woman’s lifetime. However, some believed that risk was highest when “female” or “reproductive” hormones were fluctuating the most (e.g., during childbearing years) and then decreased after menopause. A few said risk was highest when breasts developed and menstruation began. Gender was seldom mentioned, except to say that women faced greater risk or, occasionally, that men could get breast cancer, too.
ER59969.indb 259
3/21/08 10:51:28 AM
260 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
Family history was often mentioned and rated as very important. Typically, the references were general (e.g., “it’s in the family”), with specific beliefs often incorrect, such as inferring personal breast cancer risk from family members having suffered other forms of cancer (e.g., prostate, colon, abdominal, pancreatic), because “cancer runs in the family.” No one raised the critical factor of whether relatives had had premenopausal or bilateral breast cancer. Genes were mentioned only generally, with no one noting the BRCA1/BRCA2 genes. Ethnicity was seldom mentioned but was rated as important. No one said that women of Hispanic, African, or Asian descent face less risk than women of Northern European or North American descent. Breast health was often mentioned and rated as very important. Many of the specific beliefs were incorrect. One or more respondents erroneously believed that breast cancer risk increased with having a mole or pimple on the breast, large breasts, dried-up milk in the breasts (while weaning a baby), a previous benign lump, or breast implants. No one mentioned three of the four factors in the expert model: DCIS (ductal carcinoma in situ), atypical hyperplasia, and lesions without atypia (with coders looking for lay equivalents of these medical terms). Many other beliefs were too general to evaluate (e.g., “lumps,” “cysts,” “blood clots,” “tissue gone bad,” “irregular breast tissue”), even after prompts to elaborate. Radiation figured prominently as a moderate risk factor in many respondents’ mental models. No one mentioned high-dose radiation (the expert factor). Rather, respondents discussed sources with little known risk, like radiation from mammograms and x-rays (e.g., dental, chest x-ray), with a few citing cumulative radiation doses over a lifetime of small exposures. One woman said that living near a government radioactive waste storage site had increased her breast cancer risk. Socioeconomic status was mentioned by two respondents, who erroneously believed that lower income meant higher risk. One of them noted that lower income women often lack the means to ensure early detection and subsequent treatment. Endogenous hormone history was mentioned by most respondents and rated as highly important. Their specific beliefs typically cited hormonal fluctuations unrelated to cancer risk: becoming sexually mature, having menstrual cycles, having children, and going through menopause. No respondent knew that risk
ER59969.indb 260
3/21/08 10:51:28 AM
Women’s Beliefs about Breast Cancer Risk Factors • 261
increases with length of menstruation history (i.e., early menarche, late menopause) and the age of first childbirth. Exogenous hormone history (specifically, oral contraceptives and long-term estrogen replacement therapy) was mentioned by a few respondents who rated it as moderately important. One woman repeatedly mentioned the combined risk of oral contraceptives and smoking. No one mentioned the (expert model) risk of one’s mother having taken DES during pregnancy. Psychosocial factors figured heavily in a few accounts, where they were rated as very important. Some respondents discussed at length the impacts of negative emotions (e.g., being depressed, having a fatalistic attitude about cancer). Some said that fearing or discussing cancer made it more likely. Indeed, a few seemed reluctant to say the word “cancer” during the interview (despite being recruited to talk about it). Behavioral factors occupied many respondents, who rated them as very important. Diet issues were especially common, although few respondents cited the issues for which there is at least some (inconclusive) scientific evidence: a “high-fat” diet or not having enough vitamins, fruits, or vegetables (Schwartz, Woloshin, & Welch, 2006). More common were references to diet factors without research support, such as increased risk from eating animal products (red meat, dairy, or flesh from an animal with cancer) and drinking coffee/caffeine. Behavioral risk factors that some respondents correctly identified included smoking, alcohol consumption, insufficient exercise, stress, and an unhealthy lifestyle. Incorrectly cited behavioral risks included sexual stimulation (breast fondling) and violent breast trauma (e.g., a blow to the breast, bruising the breast muscle, being in a car accident with the seat belt on, having the breast squashed in a mammogram machine). Several respondents mentioned breastfeeding, disagreeing over whether it increased or decreased their risk. Those who elaborated sometimes offered theories like those under breast history, such as how insufficient breastfeeding allowed the milk to pool in the breast, causing abnormalities, or how activating the breast for milk production activated other harmful cells. One said that insufficient mental exercise would increase risk. Environmental exposures were often mentioned and rated as very important. The most common were ones at work (chemicals, computer monitors) or home (nearby radioactive waste,
ER59969.indb 261
3/21/08 10:51:28 AM
262 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
chemical companies). Others were toxins in food and water (e.g., pesticides) and factory emissions. Respondents also raised risk factors outside the expert model, generally rating them as moderately important. A few mentioned bad karma, in the sense of a woman getting breast cancer as a result of something she had done in a previous lifetime. Some cited the effects of other illnesses (e.g., diabetes) or medical history (e.g., metastasis to the breast). Some cited the body breaking down as it aged (e.g., more deformed cells, a “build up” in the blood, slower or faster metabolism, vulnerable physiology, suppressed immune function). Some respondents believed that actions reducing heart disease risks also reduce cancer risks. Some of these beliefs have support (eating a healthy diet); others do not (eating red meat, not drinking enough red wine). A few respondents erroneously generalized germ theories to breast cancer. One worried about eating pigs or cows with cancer, another about touching someone with cancer (especially a person with skin cancer and dry, flaky skin), living in a house whose previous residents had had cancer (especially if remodeling had stirred cancer into the air), or getting a mammogram in a machine whose previous users had breast cancer. How often a concept was mentioned was unrelated to how important it was rated (by those mentioning it) (r = .16; ns). Thus, some factors were raised often despite being seen as relatively unimportant (e.g., behaviors), while some were important to the few people who mentioned them (e.g., obesity). Demographic Differences Our small, but diverse, sample allows only suggestive analysis of group differences. With six performance scores and four demographic variables (age, race/ethnicity, income, education), the large number of comparisons increases the risk of significant results by chance. Setting alpha (for t-tests) at .01, there were no significant results. Other studies have found sampled White beliefs to be more consistent with standard biomedical accounts than those of other cultural groups (e.g., Chavez, Hubbell, McMullin, Martinez, & Mishra, 1995). Possibly, the open-ended format of the interviews allowed better communication of lay beliefs than do structured surveys, especially if those reflect the perspectives of investigators unlike the respondents (Bruine de Bruin, Fischbeck, Stiber, & Fischhoff, 2002).
ER59969.indb 262
3/21/08 10:51:28 AM
Women’s Beliefs about Breast Cancer Risk Factors • 263
Controllability For each respondent, four performance measures (accuracy, completeness, focus, specificity) were computed separately for controllable and uncontrollable risk factors. Controllability was defined as how well a woman could determine her exposure to a risk factor. For example, family history and diet are uncontrollable and controllable, respectively. Respondents’ beliefs were significantly more accurate (t[35] = −3.28, p = .002) and more specific (t[23] = −2.52, p = .019) for controllable than for uncontrollable ones. Completeness (t[41] = 1.20, p = .24) and focus (t[35] = −1.76, p = .088) scores did not differ. Perhaps people tend to refine their knowledge in areas where it can be put to use. Individual Mental Models To give a feeling for the variability in individual mental models, we did median splits according to two of the scores: completeness and accuracy. Figure 12.3 shows the mental models of individual respondents, representing each of the four archetypes. We now discuss them, reporting suggestive, and necessarily speculative, trends for those in each category: A sparse mental model (32% of the sample) is incomplete and inaccurate (relative to the rest of the sample). For example, the respondent in Figure 12.3a mentioned only gender and unspecified family history. Respondents producing sparse models tended to be 50–69 years old, non-Black minorities, low income, with less than a high school degree. A concentrated mental model (20%) is highly accurate but incomplete, indicating a woman who knows a great deal about a few concepts. Figure 12.3b shows such a woman, who knows a lot about diet but mentions little else—other than related exercise and unspecified family history. Such women tended to be 50– 69 years old and Black. A misinformed mental model (12%) is relatively complete and inaccurate. For example, the woman in Figure 12.3c erroneously saw risks in lumpy breasts, breast trauma, and talking about cancer. The few such respondents tended to be 40–49, Black, and low income. A knowledgeable mental model (37%) is relatively complete and accurate. The example in Figure 12.3d shows a broad range of concepts and, in some places (e.g., behavior), some depth of understanding. These respondents were disproportionately younger than 50 or over 70, White, higher income, with at least some college education.
ER59969.indb 263
3/21/08 10:51:29 AM
264 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
Gender Family History of Breast Cancer
Risk Factors
(a)
Gender Age Risk Factors body breaking down
Psychosocial
breast trauma
Breast Health
lumpy breasts
Radiation X-rays
body changes
Environment
Behavior diet
SES
Exogenous Hormone History
Endogenous Hormone History
oral contraceptives
fear of breast cancer lifestyle smoking alcohol non-vegetarian low vitamin antioxidant
(b)
ER59969.indb 264
3/21/08 10:51:30 AM
Women’s Beliefs about Breast Cancer Risk Factors • 265
Gender Family History of Breast Cancer
Age Risk Factors
Environment lifestyle
Psychosocial Behavior diet obesity
exercise non-vegetarian
(c)
toxins in air, water Family History of Breast Cancer BRCA 1/2
body breaking down breast trauma
Age Risk Factors
Environment Psychosocial
Breast Health
breast implants
residence
Behavior
Radiation
diet
mammography
non-vegetarian
radioactive waste
stress lifestyle fear of breast cancer negative emotions bad karma
(d)
Figure 12.3 Mental models of four women: (a) a sparse model, with few correct beliefs: (b) a concentrated model, with accurate beliefs about limited topics; (c) a misinformed model, with many, often incorrect beliefs; and (d) a knowledgeable model, with many correct beliefs.
ER59969.indb 265
3/21/08 10:51:31 AM
266 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
As mentioned, category assignments are relative to other respondents. Compared to the expert model, few respondents had knowledgeable or complete mental models.
Discussion Women must understand the risk factors associated with breast cancer if they are to manage their personal risks appropriately. That knowledge allows them to determine how great the threat is and what can be done about it. This study has attempted to characterize women’s beliefs in a way that identifies critical communication needs. As summarized in Figure 12.2, women as a whole have many different beliefs about risk factors for breast cancer. Moreover, these beliefs generally focus on factors supported by some scientific evidence (diet, family history, age, environment, and behavior). However, while women tend to know which issues matter, the transcripts reveal that their detailed beliefs are often vague or inaccurate. One recurrent misconception was associating greater hormonal activity with increasing risk. As a result, some respondents erroneously thought that breast cancer risk was greatest early in women’s lives and when they have regular menstrual cycles. These results suggest employing a communication strategy that takes advantage of women’s interest in hormonal factors and brings their beliefs in line with the science. The best vehicle would be a simple integrating theory of how hormones affect cancer risk. That same small dose of physiology might also help women understand more of the controversy over hormone replacement therapy and how hormones affect cancer treatment and prevention. Similarly, although respondents were generally aware of family history as a risk factor, they offered few specifics (e.g., none mentioned having a pre- versus postmenopausal relative with breast cancer). Without specifics, general awareness can cause unwarranted fear (as with those who thought having any cancer in the family increases a woman’s risks). Unlike hormones, where there may be some integrative physiological explanation, family, genetic, and ethnicity risk factors are disparate facts. Because lists of unrelated facts are hard to remember, risk communications should focus on integrating the most important ones. For example, it is worth knowing that alterations in the BRCA1 gene entail a large increase in risk but are involved in only a fraction of breast cancers. Within these general patterns, the interviews showed considerable individual variability, including various erroneous beliefs, held by one
ER59969.indb 266
3/21/08 10:51:31 AM
Women’s Beliefs about Breast Cancer Risk Factors • 267
woman or another (e.g., about behavior, diet, contagion). Many involve needless concerns, perhaps prompting ill-advised actions and depleting the energy for more effective ones. Unfounded fears include having bad karma, having large breasts, or not breastfeeding enough. Such idiosyncratic beliefs pose a communication challenge. Debunking them would be irrelevant to most message recipients, some of whom might even misremember whether they heard the belief being asserted as true or contradicted. An inventory of idiosyncratic misconceptions might, however, be useful to health care professionals as beliefs that their clients might hold. Perhaps the best way to eliminate such beliefs is to crowd them out with creating strong, accurate models—along with reassurance that “this is all that you need to know.” Fortunately for comprehension purposes, but unfortunately for prevention purposes, there are not that many strong risk factors for breast cancer and even fewer under a woman’s control. The greater accuracy of beliefs about controllable factors suggests a ready market for such information. Our stratified sampling strategy sought to elicit a wide range of possible beliefs rather than to examine group differences. As a result, little can be inferred from the absence of such differences, pending a structured survey administered to a larger sample. One advantage of the semistructured open-ended interview protocol is that it can reduce systematic biases possible with structured surveys, which can favor test-wise individuals like their compilers (Bruine de Bruin & Fischhoff, 2000). Structured surveys drawing on such interviews can use language observed in them, so as to avoid errors due to language and framing alone. They can also test for the prevalence of misconceptions not suspected before the interviews were conducted, allowing whatever was on respondents’ minds to come out (Bruine de Bruin & Fischhoff, 2000). If sustained differences in the language are used by different groups, broadly targeted communications would be justified. Those could also take advantage of differences in the most relevant facts. For example, messages targeted at younger women might focus on placing hormonal changes in perspective. If the archetypes in Figure 12.3 had reliable predictors, they might benefit from different strategies, such building up sparse mental models while refining knowledgeable ones. If prevalence here is any guide, these two categories are the largest ones (32% and 37%, respectively). Communications targeting those with sparse models would likely be overly simplistic for knowledgeable individuals, whereas those targeting the knowledgeable might overwhelm those with sparse beliefs. In addition to deliberate targeting, women could self-select the tailored communication that most suits them. A compromise might target topics where misconceptions or gaps in knowledge
ER59969.indb 267
3/21/08 10:51:31 AM
268 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff Table 12.2 Current Results Compared to Previous Studies Mental Model Results What women get right
What women get wrong
ER59969.indb 268
Previous Studies What Women Get What Women Get Right Wrong Family history
Not Addressed in Previous Studies
Exogenous hormone history (oral contraceptives, estrogen replacement therapy) Smoking Older age High-fat diet Low exercise Excessive drinking Emotional outcomes Gender Taking care of self Chance X-ray radiation Type of diet
Breast trauma
Younger age, sexual maturity Sexual activity Bad karma Breast feeding Body breaking down Breast surgery Medical history Other illnesses Benign breast disease, fibrocystic (e.g., diabetes) disease, lumpy breasts Radioactive waste Other types of cancer in the family Chemical plants Not enough vitamins Caffeine, coffee, tea Blood clot in breast Food additives Straining Low mental activity Hormonal (risk decreases at menopause)
3/21/08 10:51:31 AM
Women’s Beliefs about Breast Cancer Risk Factors • 269 Table 12.2 Current Results Compared to Previous Studies Mental Model Results
What’s missing (knowledge gaps)
Previous Studies What Women Get What Women Get Right Wrong Environmental pollution (e.g., pesticides) Occupational exposure Stress, negative emotions, attitude Fear of/talking about breast cancer Cancer germs Childbearing age Childbearing history
Not Addressed in Previous Studies
Clogged lymph nodes Exposure to sun
Cumulative radiation Treatment radiation
Imbalance in the blood (e.g., dirty blood)
Ethnic descent
SES Genetic status Obesity Sources for previous literature: Aiken, Fenaughty, West, Johnson, & Luckett (1995); Balshem (1991); Chavez et al. (1995); Gifford (1986); and Mathews et al. (1994).
are especially prevalent and important (e.g., Six Common Misperceptions about Breast Cancer Risk or What You Don’t Know About Breast Cancer or even What Science Doesn’t Know About Breast Cancer—but Suspects). Any such communications would, of course, have to be empirically evaluated to ensure that they capture the nuances of women’s beliefs. The design could take advantage of the full text of the interview transcripts and, of course, other related research. Table 12.2 contrasts the present results with those found in previous studies. Generally speaking, they are consistent in terms of what people get right (e.g., family history, smoking, age) and wrong (e.g., breast trauma, sexual activity). As mentioned, women seem to use
ER59969.indb 269
3/21/08 10:51:32 AM
270 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff
hormones as an organizing theory, which generally led them astray. Some see hormonal activity as increasing risk (as breasts developed, during menstrual cycles and child-bearing periods, while breastfeeding), prompting the desire for screening at early ages. Such thinking leads women to believe that menopause decreases risk, making mammograms seem less needed at a time when they are most valuable. As mentioned, our interviews supplement studies using structured surveys by providing details, phrasing, and underlying rationale of the beliefs revealed in them. Communicators need that detail to offer alternative mental models so that people can deduce appropriate beliefs, rather than having to remember isolated assertions. Beliefs observed here, but not previously identified, include the risks imposed by bad karma, general bad health (body breaking down), other illnesses (diabetes), hidden problems within the breast (blood clot), and unhealthy behaviors (straining, low mental activity). Some beliefs reflect known scientific uncertainties (e.g., environmental exposures), whereas others indicate new variations (eating animal products, dairy, pigs or cows with cancer), or downright confusion (red wine reducing cancer). It is important to monitor lay beliefs, as they change over time, particularly when the science is rapidly changing, and communications may not keep up. Any research project faces a tradeoff between the number and intensity of observations. Mental models interviews emphasize intensity, attempting to elicit individuals’ full suite of beliefs on a topic, in their intuitive formulation. Doing so facilitates identifying lay concepts that investigators might not otherwise suspect, as well as language that might connect with lay thinking when used in communications. The richness of respondents’ wording is preserved in the transcripts. However, the beliefs are coded into an expert model. Doing so sacrifices some of the qualitative analysis of an enthnographic approach in return for allowing quantitative analysis of belief prevalence and performance measures. The precision of the expert model facilitates reliable coding— more so than would be possible with more subjective categories. A price paid for this intensity is limited sample size, allowing only rough statistical estimates. Thus, its results are a complement to structured surveys. They provide context for interpreting past results and a foundation for designing future ones—showing beliefs whose prevalence it is important to assess and the language in which they should be expressed. A mental models study draws on psychology and decision science, as well as subject matter expertise in the domain. Its contributions to those sciences are in breadth, rather than depth. For example, it does not provide rigorous tests of specific psychological hypotheses or exter-
ER59969.indb 270
3/21/08 10:51:32 AM
Women’s Beliefs about Breast Cancer Risk Factors • 271
nal validation of computational risk models. However, it does afford a structured way to integrate each with related disciplines, as well as an opportunity to identify the boundary conditions on their phenomena and identify new ones.
Acknowledgments This research was supported by a National Research Service Award to Stephanie J. Byram (PHS-05-T32-CA76575-02), a New Investigator Award to Lisa Schwartz and Steve Woloshin (co-PIs) through the U.S. Army Medical Research and Materiel Command Breast Cancer Research Program (DAMD17-96-MM-6712), and the National Science Foundation (SBR95-9521914 & SES-0433152). The authors gratefully acknowledge the support of Wändi Bruine de Bruin, Robyn Dawes, Julie Downs, Paul Fischbeck, Roberta Klatzky, and Claire Palmgren, Annette Romain, and Jennifer Winder. The views expressed are those of the authors.
References Aiken, L.S., Fenaughty, A. M., West, S. G., Johnson, J. J., & Luckett, T. L. (1995). Perceived determinants of risk for breast cancer and the relations among objective risk, perceived risk, and screening behavior over time. Womens Health, 1, 27–50. American Cancer Society. (2000). Breast cancer facts and figures 1999-2000. Atlanta, GA: Author. Atman, C. J., Bostrom, A., Fischhoff, B., & Morgan, M.G. (1994). Designing risk communications: Completing and correcting mental models of hazardous processes. Part 1. Risk Analysis, 14, 779–788. Balshem, M. (1991). Cancer, control, and causality: Talking about cancer in a working class community. American Ethnology, 18, 152–172. Bartlett, F. C. (1932). Remembering. Cambridge, UK: Cambridge University Press. Bostrom, A., Atman, C. J., Fischhoff, B., & Morgan, M. G. (1994). Evaluating risk communications: Completing and correcting mental models of hazardous processes. Part 2. Risk Analysis, 14, 789–798. Bostrom, A., Fischhoff, B., & Morgan, M. G. (1992). Characterizing mental models of hazardous processes: A methodology and an application to radon. Journal of Social Issues, 48(4), 85–100. Bruine de Bruin, W., Downs, J. S., & Fischhoff, B. (2007) Adolescents’ thinking about the risks and benefits of sexual behavior. In M. Lovett & P. Shah (Eds.), Thinking with data (pp. 421-439). Mahwah, NJ: Erlbaum.
ER59969.indb 271
3/21/08 10:51:32 AM
272 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff Bruine de Bruin, W., Fischbeck, P. S., Stiber, N. A., & Fischhoff, B. (2002). What number is “fifty-fifty”?: Distributing excessive 50% responses in elicited probabilities. Risk Analysis, 22, 725–735. Bruine de Bruin, W., & Fischhoff, B. (2000). The effect of question format on beliefs about AIDS. AIDS Education and Prevention, 12, 187–198. Byram S. (1998). Breast cancer and mammogram screening: Mental models and quantitative assessments of belief. Unpublished doctoral disseration, Carnegie Mellon University, Pittsburgh, PA. Byram, S., Fischhoff, B., Embrey, M., Bruine de Bruin, W., & Thorne, S. (2001). Mental models of women with breast implants regarding local complications. Behavioral Medicine, 27, 4–14. Casman, E., Fischhoff, B., Palmgren, C., Small, M., & Wu, F. (2000). Integrated risk model of a drinking waterborne cryptosporidiosis outbreak. Risk Analysis, 20, 493–509. Chavez, L. R., Hubbell, F. A., McMullin, J. M., Martinez, R. G., & Mishra, S. I. (1995). Understanding knowledge and attitudes about breast cancer: A cultural analysis. Archives of Family Medicine, 4, 145–152. Clemen, R. (1997). Making hard decisions. Belmont, CA: Duxbury. Cooke, R. M. (1991). Experts in uncertainty: Opinion and subjective probability in science. New York: Oxford University Press. Dawes, R. (1988). Rational choice in an uncertain world. San Diego, CA: Harcourt Brace Jovanovich. Dawes, R. (1994). House of cards: Psychology and psychotherapy built on myth. New York: Free Press. Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Downs, J. S. Murray, P. J., Bruine de Bruin, W., White, J. P., Palmgren, C., & Fischhoff, B. (2004). An interactive video program to reduce adolescent females’ STD risk: A randomized controlled trial. Social Science and Medicine, 59, 1561–1572. Eggers, S. L., & Fischhoff, B. (2004). A defensible claim? Behaviorally realistic evaluation standards. Journal of Public Policy and Marketing, 23, 14–27. Ericsson, A., & Simon, H. A. (1996). Protocol analysis: Verbal reports as data. Cambridge, MA: MIT Press. Fischhoff, B. (1989). Eliciting knowledge for analytical representation. IEEE Transactions on Systems, Man and Cybernetics, 13, 448–461. Fischhoff, B. (1992). Giving advice: Decision theory perspectives on sexual assault. American Psychologist, 47, 577–588. Fischhoff, B. (1999a). What do patients want? Help in making effective choices. Effective Clinical Practice, 2, 198–200. Fischhoff, B. (1999b). Why (cancer) risk communication can be hard. Journal of the National Cancer Institute Monographs, 25, 7–13. Fischhoff, B. (2005). Decision research strategies. Health Psychology, 21, S9–S16.
ER59969.indb 272
3/21/08 10:51:32 AM
Women’s Beliefs about Breast Cancer Risk Factors • 273 Fischhoff, B., Downs, J., & Bruine de Bruin, W. (1998). Adolescent vulnerability: A framework for behavioral interventions. Applied and Preventive Psychology, 7, 77–94. Gail, M. H., Brinton, L. A., Byar, D. P., Corle, D. K., Green, S. B., Schairer C., et al. (1989). Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. Journal of the National Cancer Institute, 81, 1879–1886. Gawande, A. (2002). Complications: A surgeon’s guide on an imperfect science. New York: Henry Holt. Gentner, D., & Stevens, A. L. (Eds.). (1983). Mental models. Hillsdale, NJ: Erlbaum. Gifford, S. M. (1986). The meaning of lumps: A case study of the ambiguities of risk. In C. R. Janes (Ed.), Anthropology and epidemiology (pp. 213–246). Dordrecht, The Netherlands: D. Reidel. Gregory, R., Fischhoff, B., Butte, G., & Thorne, S. (2003). A multi-channel stakeholder consultation process for energy deregulation. Energy Policy, 31, 1291–1299. Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world (2nd ed.). Thousand Oaks, CA: Sage. Johnson-Laird, P. N. (1983). Mental models. New York: Cambridge University Press. Krishnamurti, T. P., Eggers, S. L., & Fischhoff, B. (2006). A behavioral decision approach to regulatory law: An application to teens’ contraceptive decision-making competence. Submitted. Landis R., & Koch, G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33, 159–174. Love, S. M., & Lindsey, K. (1995). Dr. Susan Love’s breast book (2nd ed.). Reading, MA: Addison-Wesley. Maharik, M., & Fischhoff, B. (1992). The risks of nuclear energy sources in space: Some activists’ perceptions. Risk Analysis, 12, 383–392. Maibach, E., & Flora, E. J. (1993). Symbolic modeling and cognitive rehearsal: Using video to promote AIDS prevention. Communication Research, 20, 517–545. Mathews, H. F., Lannin, D. R., & Mitchell, J. P. (1994). Coming to terms with advanced breast cancer: Black women’s narratives from Eastern North Carolina. Social Science and Medicine, 38, 789–800. Morgan, M. G., Fischhoff, B., Bostrom, A., & Atman, C. (2001). Risk communication: The mental models approach. New York: Cambridge University Press. Morgan, M. G., & Henrion, M. (1990). Uncertainty. New York. Cambridge University Press. Palmgren, C. R., Morgan, M. G., Bruine de Bruin, W., & Keith, D. W. (2004). Initial public perceptions of deep geological and oceanic disposal of carbon dioxide. Environmental Science and Technology, 38, 6441–6450.
ER59969.indb 273
3/21/08 10:51:32 AM
274 • S. J. Byram, L. M. Schwartz, S. Woloshin, and B. Fischhoff Raiffa, H. (1968). Decision analysis: Introductory lectures on choices under uncertainty. Reading, MA: Addison-Wesley. Riley, D. M., Small, M. J., & Fischhoff, B. (2000). Modeling methylene chloride exposure-reduction options for home paint-stripper users. Journal of Exposure Analysis and Environmental Epidemiology, 10, 240–250. Sackett, D., Haynes, R., Guyatt, G., & Tugwell, P. (1991). Clinical epidemiology: A basic science for clinical medicine. Boston: Little, Brown. Schwartz, L. M., Woloshin, S., & Welch, H. G. (2006). Fat or fiction? Is there a link between dietary fat and cancer risk? Washington Post Health Section, March 14, 1, 4. Shaklee, H., & Fischhoff, B. (1990). The psychology of contraceptive surprises: Judging the cumulative risk of contraceptive failure. Journal of Applied Psychology, 20, 385–403. Silverman, E., Woloshin, S., Schwartz, L. M., Byram, S., Welch, H. G., & Fischhoff, B. (2001). Women’s views of breast cancer risk and screening mammography: A qualitative interview study. Medical Decision Making, 21, 231–240. Slovic, P., Fischhoff, B., & Lichtenstein, S. (1978). Accident probabilities and seat-belt usage: A psychological perspective. Accident Analysis and Prevention, 10, 281–285. Slovic, P., & Lichtenstein, S. (1971). Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior & Human Performance, 6, 649–744. von Winterfeldt, D., & Edwards, W. (1986). Decision analysis and behavioral research. New York: Cambridge University Press.
ER59969.indb 274
3/21/08 10:51:33 AM
13
Groups and the Evolution of Good Stories and Good Choices Linnda R. Caporael
The study of the human mind is so difficult, so caught in the dilemma of being both the object and the agent of its own study, that it cannot limit its inquiries to ways of thinking that grew out of yesterday’s physics. (Jerome Bruner, 1990) Since Darwinism is, among other things, an account of human origins, is it any wonder that it is expected to carry a moral message? (John Maynard Smith, 1987) In the late 1980s, Robyn Dawes, John Orbell, and Alfonse van de Kragt had interesting data about contribution in public goods games, but no theory. I had a theory, but no data. The title of our paper spoke to our shared interests: “Selfishness examined: Cooperation in the absence of egoistic incentives” (Caporael, Dawes, Orbell, & van de Kragt, 1989). Happily, their empirical findings were supported by a substantial crosscultural study of cooperation (Henrich et al., 2005). Not so happily, human evolutionary theory continues to be dominated by the heuristic fiction of the “selfish gene,” an outdated view of group selection and a widely unexamined model of the evolution of human sociality as the product of intergroup conflict (Brewer & Caporael, 2006a). Our reply to commentators, entitled “Thinking in Sociality,” set aside the traditional “objectivist view” of human thought as abstract, disembodied, logical, and ahistorical (Lakoff, 1987), a standpoint from which scientific thinking seems to easily flow without much need for 275
ER59969.indb 275
3/21/08 10:51:33 AM
276 • Linnda R. Caporael
justification. We adopted an alternative standpoint that humans “had evolved for being social (and learning what that means in our cultures) and not for doing science, philosophy, or other sorts of critical reasoning and discourse (Caporael et al., 1989, p. 730). In this view, science is not “natural”—it is hard-won. We should expect, and indeed we find, cognitive limitations. Especially under conditions of uncertainty, these could interact with various constructions, including folk psychological notions of human nature. In his commentary, Anatol Rapoport (1989) echoed these concerns and explicitly tied them to ethics and social responsibility: The close relationship between cognitive and ethical evaluations in the social sphere points to the importance of coupling challenges to existing social, political, and economic systems with critiques of the images of society and of human nature that provide rationalizations of these systems . . . . [E]vidence of “sociality” as an important component of human motivation deserves serious attention, and the concept itself deserves further theoretical development accompanied by rigorous experimentation. (p. 720) This chapter addresses the issues Rapoport identified, although in reverse order. It begins with a brief overview of the theoretical development for an evolutionary model of human sociality, then discusses how pre-Darwinian images and stories protrude into scientific accounts of human evolution, and, finally, generalizes narratives as sources of values for rational choice and rational choice as a strategy for choosing among stories. The evolutionary model is based on a set of hypothesized core configurations in face-to-face groups. These are posited to be the context, or natural environment, for the evolution of uniquely human mental systems. An important theoretical consequence is that self-interest is constrained in humans: They are unable to survive and reproduce without a face-to-face group, creating conditions of obligatory interdependence. Thus, the issue around human sociality is not whether humans are basically selfish or altruistic (they are both). Rather, it is the problem of coordination—the skillful integration of diverse elements into a harmonious operation—regardless of whether behavior is agonistic or altruistic. Another implication of the model is a functional distinction between narrative and paradigmatic (i.e., logical–scientific) modes of thought (Bruner, 1986). Dawes (2001) makes a similar distinction in his discussion of the effect of stories on rational judgment. In the core configurations model, these modes of thought serve two different epistemic functions, one attempting to reduce ambiguity and the other attempting
ER59969.indb 276
3/21/08 10:51:33 AM
Groups and the Evolution of Good Stories and Good Choices • 277
to create social coordination. These two modes of thought typically come together in origin accounts, which normally provide justifications for social relationships as well as the relationships between people and their habitat (Sanday, 1981). Many familiar features of human origin accounts are anomalous from a scientific point of view, but telling from a narrative one. Science and narrative are distinguishable because they are two different types of epistemic projects. In other words, origin stories are a special case where cognitive and ethical evaluations are closely related and provide rationalizations for existing—desired— social systems (Richards, 1987). That theme is expanded in the last section, where we meet Rudolph Höss, commandant at Auschwitz and a familiar character in Dawes’ work. Dawes’s (1988a, b, 2001) consequentialist theory, based on avoiding self-contradictory conclusions, tells us how not to decide, but fails to tell us what to choose. An unavoidable conclusion from a consequentialist theory is that Höss and the Nazis were rational with respect to their criteria (Dawes, 1988a, 2001). I suggest those criteria are embedded in collectively shared stories and offer a comparison with the spontaneously self-organizing rescue of the Danish Jews. Moral individual choices and stories are not necessarily obviated by collective ones, just hampered. We can ignore stories at our own peril or take heart from Dawes’s critical approach. Science, social responsibility, and civil society in an increasingly globalized world require a reflexive, or self-conscious, awareness of both stories and rational choice. Stories, recognized as such, can always be retold. In the absence of such awareness, stories cannot be contested; they become “natural.” Dawes’s guidelines for rational choice can help us sift through stories, not as sources of evidence, but rather as sources of alternative perspectives and possible outcomes. In the stories we tell, as Dawes suggests in his references to Bertrand Russell’s stories, we also find pointers to criteria for choices. In a globalizing world, good stories and good choices are epistemic projects.
Trapped in Social Life Empirically, humans are unable to reproduce and survive to reproductive age without a group. They lack the natural defenses of most mammalian carnivores, they have a long and labor-intensive juvenile period, and they are utterly dependent on collective knowledge and cooperative information sharing among individuals and between generations. Humans may happily pursue their self-interests, but such pursuits are
ER59969.indb 277
3/21/08 10:51:33 AM
278 • Linnda R. Caporael
constrained by obligatory interdependence (Boyd & Richerson, 1985; Caporael, 2001a; Fiske, 2000; Li, 2003). From Selfish Genes to Repeated Assemblies The gene’s eye view is at the wrong level of analysis for human evolutionary theorizing. Genes are below the level of the organism, blind to the distinctions between social and asocial creatures (Caporael, 2007). As a result, the same theory serves for any species, oysters to humans, given what we know—or think we know—about them. Post hoc explanations, called “just-so” stories, are easy, but there is not much predictive traction. As Schaller (2002) points out, psychologists don’t need neo-Darwinism to tell them that most people benefit their kin more than they do strangers, and as we pointed out in 1989, the success of the gene’s eye view is its compatibility with a long tradition of methodological individualism in psychology, the social sciences, and even folk psychology. Leo Buss (1987), an evolutionary biologist, has shown how multicellular individuals, composed of reproductive gametes and nonreproductive somatic cells that “gave up” reproduction, could evolve. He argues that the giving up of reproduction, an instance of “downward causation” (Campbell, 1990b), can be explained in terms of genetic self-interest, but far more predictive power is gained from considering conflicts and synchronies over different levels of organization, where there exists different types of payoffs, obstacles, path dependencies, and locked-in systems. The similarity between methodological individualism and selfish genes may explain the neglect of multilevel selection theory (Buss, 1987; Sober & Wilson, 1998; Szathmáry & Maynard Smith, 1995) and its emerging integration (called “evo-devo”) with a variety of developmental systems perspectives (Oyama, 1985; Hendriks-Jansen, 1996; Lickliter & Honeycutt, 2003). These approaches share a view of biological phenomena as being hierarchically organized (e.g., DNA, cells, tissues, organisms, groups) and organisms as the developmental result of the repeated assembly of various resources. These include various genetic and epigenetic resources (e.g., genes, nutrition, oxygen, artifacts, language environment, social roles) that have different temporal scales and cycles of reproduction (Caporael, 2003, Jablonka & Lamb, 2005; Wimsatt, 1999; Wimsatt & Griesemer, in press).1 (Psychologists who have concerns about group selection are referred to Brewer and Caporael, 2006a). Such repeated assemblies can be defined with respect to specific research questions at useful and appropriate levels of analysis. This perspective transforms the traditional reductionist view of biology as the foundation for psychology, which in turn serves as the foundation
ER59969.indb 278
3/21/08 10:51:33 AM
Groups and the Evolution of Good Stories and Good Choices • 279
for culture, usually with the implicit promise that all these pieces will come together again somehow. The point here is that genes alone cannot explain “human nature” because our evolutionary history is a product of the co-evolution of genes, culture, and social structure. Core Configurations An alternative to the gene’s eye view assumes that interactions of evolving hominids with their environments were likely to be more effective as a group process than as individual processes (Caporael et al., 1989). As group size is limited by ecological resources and the morphological features for exploiting them, selection would result in the evolution of perceptual, affective, and cognitive processes that support the development and maintenance of group membership. Even in the modern world, humans are obligately interdependent, unable to reproduce and survive to reproductive age except in a group context. The topography of the selective environment for humans is based on a consideration of tasks that are necessary for survival and reproduction and on research about group size (Binford, 2001; Dunbar, 1993; Hassan, 1981; Jarvenpa & Brumbach, 1988; Kelley, 1995; Lee & DeVore, 1968). The model consists of four configurations: dyad, task group, deme (or band), and macrodeme (or macroband). (The term deme, from the Greek demos, is used here in its original sense of a neighborhood unit rather than the biological sense of breeding population. I use band only when referring to hunter–gatherer bands.) A core configuration is the joint function of group size and activity. Configurations provide a context for tasks or activities that are specific to that level of organization. That is, each group configuration affords functional possibilities and coordination problems that do not exist at other levels. Table 13.1 illustrates the core configuration model, assuming an idealized hierarchically structured foraging group. The tasks listed in the table could be characteristic of foraging groups (e.g., Jarvenpa & Brumbach, 1988) and have analogs in present-day life (cf. Hull, 1988). For example, dyads afford the evolution and development of coordinated body movements such as those used in facial imitation in the mother–infant dyad, interactional synchrony, and human sexual attraction (Perper, 1985). Configurations are a function of both size and task (e.g., a group of five strangers in an elevator is not a task group; the same group becomes a task group when the situation changes to escaping from an elevator stuck between floors). Human mental systems that evolved in the context of particular core configurations can nevertheless be combined and extended to novel tasks. A surgery team involves highly coordinated interactional
ER59969.indb 279
3/21/08 10:51:34 AM
280 • Linnda R. Caporael Table 13.1 Core Configurations Core Configurationa
Group Size
Modal Tasks
Proper Function
Dyad
2
Sex, infant interaction with older children and adults
Microcoordination
Task group
5
Foraging, hunting, gathering, direct interface with habitat
Distributed cognition
Deme (band)
30
Movement from place to place, general processing and maintenance, work group coordination
Shared construction of reality (includes indigenous psychologies), identity
Macrodeme (macroband)
300
Seasonal gathering, Stabilizing and exchange of individuals, standardizing resources, and information language
a
Core configurations are a function of both size and task. Except for dyads, these numbers should be considered as modal estimates.
synchrony as well as distributed cognition. The core configuration model differs from more familiar approaches because it takes situated activity rather than inclusive fitness theory as a central theoretical proposition. Inclusive fitness is assumed to be a background condition that cannot be violated. In other words, inclusive fitness is a “push back” assumption, but not a “pull” assumption. A selective advantage of core configurations is the coordination of activity and the acquisition, transmission, and maintenance of information and knowledge. The process is sloppy, however. More often than not, people are unable to know in advance what sorts of knowledge will be useful and how (Campbell, 1960), so rather than a streamlined and efficient mental database, we would expect quite a bit of mind clutter to persist (e.g., irrelevant details of unique or first-time events). Like the occasional genetic mutation, an item turns out to be useful later. Much of the structuring of mental activity must happen outside the skull. Groups and habitat must both function in scaffolding and structuring the ongoing dynamics of mind and action (Freyd, 1983; Shore, 1996; Twain cited in Wegner, 2002). Grouping has its disadvantages from free riding, spread of pathogens, failure of subgroup function, and loss of group functionality through natural disaster. When subgroups fail, there can be a loss of capacity to coordinate group-level function and a
ER59969.indb 280
3/21/08 10:51:34 AM
Groups and the Evolution of Good Stories and Good Choices • 281
subsequent failure of groups to socially and even biologically reproduce (Jarvenpa & Brumbach, 1988). I elaborated this core configurations model elsewhere (e.g., Caporael, 1997a; Caporael & Baron, 1997; Caporael, 2001c; Brewer & Caporael, 2006b). Here I wish to focus on its significance to understanding human knowledge processes. Increasingly, psychologists and other social scientists recognize that groups have epistemic functions (Hardin & Higgins, 1996; Kruglanski, 1990; Kruglanski, Pierro, Mannetti, & De Grada, 2006). The model sketched in Table 13.1 suggests an initial decomposition of functions. Dyads would appear relevant to nonverbal forms of knowledge. Rhythmicity, gaze, pointing, and nodding might be examples for further investigation. Primates have a considerable repertoire of nonverbal signaling that facilitates complex behavior. Task groups of five or so people appear related to concrete understanding almost literally “on the ground,” and macrodemes (collections of related demes) of about 300 people appear related to an overall sense of symbolically created “groupishness” or identity. Demes seem to be in the middle, serving as a clearinghouse for the outputs of other structures. In the following brief discussion of task groups, macrodemes, and demes, I point to some simple analogies that might exist between the situated activities of hunter–gatherers and moderns. Task group configurations afford (among other possibilities) the evolution of cognitive processes that enable tasks such as perception, classification, inference, and contextually cued responses to be distributed over group members (Cole, Hood, & McDermott, 1980; Hutchins, 1996), particularly when the group is confronted with ambiguous or anomalous environmental information or tasks that need to be done.2 Learning a skilled performance also seems to be more effective in a task group than for an individual (Liang, Moreland, & Argote, 1995). Among our imaginary ancestors, a task group might be a foraging party discussing and interpreting signs of animal movements over a landscape or the edible status of a patch of mushrooms. A modern example might be a research group discussing and interpreting the output of various data collection devices (Amann & Knorr Cetina, 1990) or airport control tower personnel gathering around a radar screen to discuss and interpret signs of possible danger given by ambiguous blips on the screen. There are clear constraints on this kind of distributed cognition. The body itself limits the number of people that can communicate in a meaningful way, as do the spatial constraints of how many people can simultaneously scrutinize an animal dropping, a mushroom, a graph, or a radar screen. By no means do task configurations guarantee good data collection or analysis—we have a considerable literature on social
ER59969.indb 281
3/21/08 10:51:34 AM
282 • Linnda R. Caporael
loafing, polarization, clique selfishness, and other varieties of group dysfunction (cf., Poole & Hollingshead, 2005). Nevertheless, relative to individuals or other core group configurations, the small task group of about five people is the most efficient for serving paradigmatic epistemic functions, for example, doing science, sorting out complicated logical arguments, and giving language or gesture to observations. In contrast, the macrodeme completes the cycle of biological and social reproduction. A Monte Carlo simulation of paleodemographics by Wobst (1974, cited in Hassan, 1981) indicates that about 175 to 475 people, or 7 to 19 25-person bands, are needed to maintain genetic viability by providing mates for members reaching sexual maturity in a population. Among many hunter–gatherer groups, related bands used to meet in seasonal gatherings as macrobands to perform rituals, play competitive games, and exchange marriage partners, gifts, and information (Birdsell, 1972). Such groups shared a common language, cosmology, and stories of common descent. Macrobands were probably seasonal because of limitations of resources in local ecologies, but as food storage and agriculture took root, settled macrobands simply became settlements. In terms of situated activity, there is little that a macrodeme can actually do (e.g., hold the baby, make a decision).3 Its epistemic function appears to be as a “vehicle of knowledge,” particularly across time and space (Campbell, 1997; Caporael, 1997b; Donald, 1991). Seasonal macrodemes imply the capacity to cognitively represent entitative groups beyond day-to-day engagement. That is, the group as a unit interacting with other groups persists even if individual members die or change groups. Macrodemes are the prototypes for modern ethnic groups—geographically extended groups of people who do not necessarily have face-to-face relationships but define themselves as having a common origin and share interaction norms, socialization experiences, and rituals of group identity (Gil-White, 2001; Wells, 1998). Counterintuitively, the macrodeme may be posited as the crucial environment for the evolution of symbolic language. One does not need more than a couple of grunts to learn how to make a hand ax. That is, there is no reason to suppose that language is particularly important for subsistence, including the cooperative hunting of large prey, or for group living itself. This is not to say that language over an evolutionary scale would not be advantageous to all these activities as well as create new possibilities, but language was probably well along before it became important for instructing the young. A novel advantage of language is the opportunity for vicarious knowledge about distant places and events. It needs to become standardized and stabilized to work across groups.
ER59969.indb 282
3/21/08 10:51:34 AM
Groups and the Evolution of Good Stories and Good Choices • 283
Macrodemes would also stabilize and standardize myths, norms, origin accounts, interpretations of experience, folk psychologies, common visual symbols, and forms of artifacts, even as such items need to be changed to accommodate new circumstances. All of these human-built elements contribute to creating a value-saturated environment that scaffolds behavior in some directions and not others. A modern analog to macrobands is the professional meeting, including the annual scientific convention. Young people (for jobs) and information are exchanged, scientific argot is standardized and stabilized, and scientific group identify is affirmed. Awards are made, honors are given, and myths are told of founding fathers and heroic discoveries. As for the task group, there are constraints on what macrodemes can do. In the days before slides and PowerPoint presentations, it would have been impossible for 300 people to simultaneously see an object held in one person’s hand. The deme, or band, is the basic economic unit, the first configuration that can be self-sustaining for survival and child rearing (but not reproduction). Historically, the deme is posited as the staging ground for domestic life, including task group coordination, and for cooperative alliances, which are the basis for fissioning when the community exceeds resources or is fractured by conflict (Olsen, 1987). Demes are the locus of practical skills and a clearinghouse for common knowledge, some of which may be mythical and derived from macrodemes, some of which may be acutely attuned to local conditions from detailed knowledge of other people to that of the local ecology. An example of the modern deme would be the scientific workshop, with around 25 to 30 people. On the one hand, it is responsive to the hot issues of disciplinary conventions; on the other hand, it organizes a specific research area and competes with similar groups working on the same topic. To summarize, the task group is implicated in adjustment to local habitats—the interpretation of intersubjective data and inferences from such data. In the absence of modern technology, constraints are imposed by the size and sensory capacities of the body. Macrodemes are implicated in the creation of meaning and with group identity, and with it, the panoply of symbols, myths, rules, and rituals that distinguish “we-groups” and legitimate various kinds of cross-group relationships and expectations. Both configurations are engaged in different kinds of action and the creation of different kinds of knowledge and meaning. Task groups would be more closely associated with paradigmatic modes of thought and macrodemes with narrative modes. Face-to-face demes help integrate and coordinate both kinds of knowledge. These epistemic functions are not mutually exclusive. Before the development of complex
ER59969.indb 283
3/21/08 10:51:34 AM
284 • Linnda R. Caporael
mathematics, complex stories must have held a considerable amount of usable scientific–technical information. A good example comes from the detailed analysis of Micronesian navigation in Hutchins (1996). Humans have dramatically altered their social and material environments over the past 10 millennia, especially in the past 300 years. Clearly, cognitive processes hypothesized to have evolved primarily in the context of face-to-face interaction must be at a level of analysis where they would be capable of being extended, combined, and used in new domains (Caporael, 1997a; Smith & Semin, 2004). The case for this claim begins in action. The physical relations between human bodies and the material world have not been altered. Technology can serve as a bridge between the functions of configurations. A group of 500 people given an order to march on a football field are likely to clump and straggle, but they can hardly avoid keeping time to a marching song broadcast over a loud speaker. In an airplane, a pilot and copilot fiddling with their controls engage fine hand–eye coordination, a relatively ancient human activity; for 400 strangers to share the close quarters of the plane is a novel component of human flight. What does alter—and not in any perceptible linear or progressive fashion—is the reconceptualization of human possibilities and the invention of human nature. Epistemic Projects, Social Identity, and Imaginaries The description of core configurations above implies not only different kinds of activities or exchanges between a group of individuals and a habitat. It also posits different kinds of knowledge production and products that contribute to coordination. We tend to think of knowledge production as distinctively human, occurring in a solitary and disembodied individual mind and then transmitted to others. A naturalized approach to knowledge would need to examine its context (i.e., knowledge is “of something”), not only in the case of humans, but also in the case of other animals. We can describe an epistemic project in a general way as a cooperative process producing knowledge of an interface between a group and its habitat. While it may seem odd to think of nonhuman animals as having epistemic projects, Boinski and Garber (2000) have edited a hefty and fascinating volume on the common and complex problem of how animals move in groups. Coordination involves which individuals make decisions about when to travel and in what direction, how other members of a group know the decision, the purpose or goal of moving, if the movement is for prey, and the costs and benefits of movement given the expected outcomes for a particular prey. Some social mammals have core configurations as described above: dyads, daytime feeding and hunting groups, and evening sleeping groups. Often some
ER59969.indb 284
3/21/08 10:51:35 AM
Groups and the Evolution of Good Stories and Good Choices • 285
kind of recognition enables subgroups to rejoin larger affinity groups. There is also evidence for “social rallies” that facilitate transition from rest periods to hunting, with synchronization of behavior and division of labor, with some individuals adopting specialized roles for the hunt depending on the roles that other group members adopt (Holekamp, Boydston, & Smale, 2000). Different species, because they have different morphological and ecological characteristics, have different coordinating means (e.g., scents, sounds, movements that cue action) for producing an interface between group and environment. Humans have many means of coordination, one of them being social identity. In psychology, social identity is typically studied in the context of contributing resources to group welfare or effecting change in selfconcept (Brewer, this volume). However, increasingly, it is proposed as a pivotal fulcrum for human action. Clancey (1997) has argued that the trick to knowing what to do next is “a complex choreography of role, involving a sense of place, and a social identity that conceptually regulates behavior” (p. 24, emphasis in original). Social identity, through redefinitions of self, may operate as a “gear-shift” among core configurations (Caporael, 1997a, 2001c), and groups, as previously mentioned, are also increasingly seen as having necessary epistemic functions. These various hypotheses, interpretations, and empirical studies suggest the evolution of social identity in concert with thinking and doing. Here I propose that paradigmatic and narrative modes of thought are related to different kinds of epistemic projects, core configurations, and social identities. The nature of these relationships would take considerable research; Table 13.2 and the following sections are attempts to bring these concepts into a single frame. Bruner (1986) shows that paradigmatic and narrative modes of thought have distinct operating principles. A well-formed argument and a well-formed narrative are recognizably different. The former is evaluated in terms of rules of logic, science, and mathematics; the latter is evaluated in terms of its ability to make sense and reflect on human or human-like intentions and experience. Paradigmatic modes are expected to lead to factual understanding; narrative modes are expected to be lifelike, to provide different views of subjectivity (although later I suggest that origin narratives are for constraining different views). Perhaps most notably, causality is understood differently. In paradigmatic modes of thought, the search is for the specific factors or the elimination of competing alternatives in an “x causes y” argument. In the narrative, causality lies in the connection between events. The particular causes in a story are chosen precisely because they make sense of the ending.
ER59969.indb 285
3/21/08 10:51:35 AM
286 • Linnda R. Caporael Table 13.2 Epistemic Projects and Social Identities Modes of Knowing
Identity
Attunement Paradigmatic
Dyad Relational identities/ situated activities
Local knowledge integration Narrative
Selective Configuration Task group Deme
Collective identity/imaginaries
Macrodeme
It seems odd that humans would have two such completely different modes of thought. Convention suggests that paradigmatic thought might be an invention beginning with Newton, Descartes, or Galileo; in other words, with the origins of Western science. However, although science as an institution (with methods, social identities, consensual practices, and institutionalized skepticism) makes use of paradigmatic modes of thought, it is not the same thing. Humans and other animals have fairly reliable epistemic processes to reduce uncertainty and ambiguity and facilitate prediction, enabling them to eat, move through the world, find shelter, reproduce, and so on. The origins of paradigmatic thought are prehuman, grounded in perceptual and inferential processes that evolved between organism and expected habitat. For humans, the cognitive properties of task group configurations can be seen as an expansion of the cognitive capacities and situated activities of individuals. It seems more promising to take narrative modes of thought as unprecedented (albeit, ancient) in the animal world and consider the fit of narrative to patterns of cognition and activity, and not just language and discourse (cf. Lakoff & Johnson, 1980). Independent of any discussions of modes of thought, Brewer and her colleagues (Brewer & Gardner, 1996; Brewer & Caporael, 2006b, Brewer & Yuki, 2007) have shown that there are two distinct levels of social identity. Relational identity is based on interpersonal relationships with others and defines distinctive self-representations with different structural properties, bases of self-evaluation, and motivational concerns (see also Kashima & Hardie, 2000; Sedikides & Brewer, 2001). In contrast, collective identity involves depersonalized relationships that exist by virtue of common membership in a symbolic group. Collective identities do not require interpersonal knowledge or face-to-face interaction for coordination, but rather they rely on shared symbols and cognitive representations of the group as a unit. The different types of configura-
ER59969.indb 286
3/21/08 10:51:35 AM
Groups and the Evolution of Good Stories and Good Choices • 287
tions may require (and engage) relational and collective identification processes to different degrees and under different circumstances. Relational and collective identities may also correspond to Bruner’s (1986) paradigmatic and narrative modes of thought, particularly with respect to the problem-solving capacities of the task groups and the construction of shared collective identities through stories and myths. Campbell’s (1990a) reinterpretation of Asch’s conformity studies4 suggests how cognitive tasks, identity, and modes of thought might come together. Campbell argued that Asch’s studies should be reinterpreted in view of the dependence we have on others for knowledge. Accordingly, the research design should be seen as creating a structural conflict between trust, the respect we have for each other’s reports, and honesty, our duty to report our observations as we see them. If honesty prevails, the subject reports the line as he sees it and fails in trusting the reports of the confederate group members. If trust prevails, the subject completes the consensus of the group but fails in honesty and self-respect (our duty to integrate our own and others’ reports “without deprecating them or ourselves,” Campbell, 1990a, p. 46). Given Campbell’s interpretation, a face-to-face configuration with a paradigmatic task and task group size could evoke personalized responses although the individuals are complete strangers. (However, a testable alternative would be that honesty and trust can also be attributed to ingroup strangers. Yuki, Maddux, Brewer, & Takemura [2005] found that Americans tended to trust strangers based on collective identity in contrast to Japanese, whose trust for a stranger depended on having a link to a close other.) Where the main epistemic tasks in small groups might involve personalized trust and honesty, the epistemic issues in macrodemes seem more concerned with good stories and values, obligations, and duties. Stories, parables, norms, laws, circulated images, and urban legends are among the various forms of such abstract knowledge. Humans tend to naturalize many such bodies of knowledge, making them real (i.e., enabling contingent consequences). For example, witchcraft is common across cultures. Although its role in mediating social life is quite variable, where a belief in witches and supernatural forces is legitimized, the evidence supporting witchcraft accusations can still be logically evaluated at local levels (Caporael, 1976; Evans-Pritchard, 1937). Among the important modern functions of demes and macrodemes are the construction and negotiation of group identity (Brewer & Gardner, 1996), shared reality (Hardin & Higgins, 1996), social representations (Deaux & Philogène, 2001), controlling stereotypes (Fiske, 1993; Steele & Aronson, 2003), artifacts (Norman, 1990), folk psychologies
ER59969.indb 287
3/21/08 10:51:35 AM
288 • Linnda R. Caporael
(Heelas & Lock, 1981; Lillard, 1998), cultural narratives such as founding myths and origin stories, and the gathering of them all into world views, or social imaginaries (Taylor, 2002). Taylor writes of the social imaginary as the way ordinary people imagine their social surroundings. Imaginaries are collectively shared landscapes that makes common practices normative and possible and provide a widely shared sense of legitimacy. His focus is on the superordinate group in the large sense of the “Western social imaginary.” However, the term has been applied to other levels of organization as well. The associations between demic structure, epistemic functions, and identity are rarely mutually exclusive, and one of the remarkable features of modern society is the enormous number of social roles that imaginaries coordinate. Ultimately, modern social imaginaries must be constructed by people who belong to multiple, sometimes overlapping, laterally and hierarchically organized groups, and who themselves possess complex social identities (Brewer & Gardner, 1996; Brewer & Pierce, 2005). How such stories get started and may even come to dominate in a culture or subculture (often because they come to be viewed as natural) is not immediately obvious. It seems most plausible that demes would be a mediating structure because their epistemic function is a local integration between facts and stories and relational and collective identities. A successful deme among macrodemes would presumably garner more influence and resources. Thus, for example, what may have begun as an “upstart religion” (e.g., Mormonism) or as a challenge to an existing power structure (e.g., Love Canal) may have started in a task group that expands into or is absorbed by a functioning deme, a group of insiders. Geographic regions may have various overlaps of such modern demic structures. In this view, society is hardly the smooth textured construct that comes to mind in discussions of individual and society. Rather, in dense populations with multiple-role cultures, “society” has a lumpy texture. I do not mean to imply that modern societies are simply macrodemes with tall buildings and Internet access. Modern humans are parts of many kinds of groups, some of which could be macrodemes; others, fragments of macrodemes; and still others, group structures that do not fit the kind of categorization suggested here. The essential point is that people do dwell within collectively shared landscapes (e.g., dog fancier communities, professional communities, and national communities), social imaginaries that are held together by a variety of practices, including the telling and retelling of shared narratives that order experiences, beliefs, and social relations. Such uniquely human processes have an evolutionary history that is grounded in repeatedly assembled—and observable—issues of bodies, tasks, group size, and
ER59969.indb 288
3/21/08 10:51:35 AM
Groups and the Evolution of Good Stories and Good Choices • 289
coordination. The products of these processes can be problematic, however. If narratives of human origins are privileged for the coordination and maintenance of group identity, then, in addition to the traditional burdens of proof that science carries, its accounts of human origins have to be distinct from the “images of society and human nature [that rationalize] existing social, political and economic systems” (Rapoport, 1989, p. 720). Because we are both humans and scientists, we are caught in the dilemma of being both object and agent of our own investigations.
Telling Human Stories When sociobiology burst onto the scene, it threatened to “cannibalize” the social sciences, psychology, and ethics. Its proponents claimed that neo-Darwinism would explain self-interest, altruism, aggression, coyness, cooperation, knowledge, conflict, reciprocity, and sympathy, among other human attributes (Hamilton, 1975; Trivers, 1971; Wilson, 1975). However, there was scant reference to the existing empirical literature in psychology. This was an odd situation. If evolutionary biologists failed to draw on the scientific literature in psychology, the only other basis for their descriptions, explanations, and claims about human characteristics would have been folk psychology, and according to Crawford (1998), this is exactly the case: Folk aphorisms (“blood is thicker than water”) that are consistent with sociobiology are also “good explanations” for behavior. It’s no wonder that the evolutionary biologist John Maynard Smith (1987) fretted that sometimes he could not tell the difference between Darwinism used as myth and science. To him, science was about possibilities in the world, whereas myth was about values. As trained as scientists, most of us want to get the stories out of science (e.g., Dawes, 2001), a view Maynard Smith espoused himself as a young scientist. However, he came to a change of mind. He argued that it was important to distinguish myth and science and to recognize myths as sources of values, which even science needed. In the previous section, I argued that paradigmatic and narrative modes of thought had different evolutionary origins and produced different kinds of epistemic products—facts and stories (Bruner, 1986). In this section I discuss their overlap, focusing on how narrative protrudes into scientific accounts of human origins. Some would argue that science itself is stories all the way down. Among others, Schank and Abelson (1995) offered a “heavy metal” thesis that all knowledge, memory, and understanding are stories. However, this seems to lose a useful and important distinction: It is a fact, sad but true, that Elvis is
ER59969.indb 289
3/21/08 10:51:36 AM
290 • Linnda R. Caporael
dead. Stories of his sightings will not socially reconstruct him. (This statement is broader than I intend: Consider the small industry in Elvis impersonation.) Origin stories can be problematic because (like witchcraft) we legitimize various distinctions and accounts as descriptive (where from other perspectives we can see they are invented). We wish to explain (make meaning about) descriptions of the present in terms of an ancestral past. We tend to be unaware of or discount the extent to which that past is informed by the social imaginary in which we dwell and the limitations of extant scientific evidence. Evolutionary Images As participants in a scientific culture, we inherit a rich imaginary about primitive life from illustrated children’s books, museum dioramas, and even cartoons. Stephanie Moser (1998), an archaeologist who specializes in representations of the past, found that scientists and nonscientists alike share an iconography of evolution with roots traceable to ancient Greek and Roman texts. Although these earliest descriptions were textual, they were also rich in descriptive imagery, highly visual, and the subject of paintings in later centuries. In the 8th century b.c., Hesiod (Works and Days) described the five successive stages of humans. The earliest, people of the Golden Age, were like the gods: They lived without sorrow, hard toil, or grief and feasted merrily until they died peacefully, as if overcome by sleep. From there it went downhill, with increasing violence and war. By the fifth generation (into which Hesiod sorely wished he had not been born), parents and children, guests and hosts, comrades, and even brothers disagreed with each other. Men dishonored their parents and sacked cities. There was no respect for the man who kept his oath, and greed, lies, and exploitation ruled the day. Seven hundred years later, Lucretius (cited in Moser, 1998) introduced an alternative view. Gradual cultural and technological advances characterized human prehistory. Primitive man was strong, muscular, resistant to heat and cold, and lived in packs like beasts. He ate acorns, berries that ripened in winter, and wild strawberries that were bigger and more plentiful in the distant, pre–first century past. These primal humans had no fire, clothes, foundry skills, laws, binding customs, or recognition of the common good. Mating was not bound by social, legal, or economic ties, but by the result of mutual desire, male lust and violence, or presents of food to the female. Represented in woodcuttings, paintings, and drawings, the naked and hairy “wild man of the woods,” with cave and suckling female, persisted through the medieval period. In the 17th century, primitive man met a nascent scientific interest in fossils and what they might tell people about the past. At this time, and
ER59969.indb 290
3/21/08 10:51:36 AM
Groups and the Evolution of Good Stories and Good Choices • 291
Figure 13.1 A typical “primeval family” painting by Gabriel von Max and presented to Ernest Haeckel (1894).
up to Darwin’s, the past was still largely a Biblical one (and as we will see later, this past still haunts modern Darwinism when it is applied to humans), and primitive man even became a part of the visual elements in the expulsion from the Garden of Eden. The basic elements of this iconography of the ancestral past continue to be well-known today and include items such as caves, animal skins, dark skin, disheveled hairiness, clubs, males with tools, and females suckling infants. Figure 13.1 is illustrative. A dark picture, it shows
ER59969.indb 291
3/21/08 10:51:36 AM
292 • Linnda R. Caporael
Figure 13.2 An unusually attractive “primeval family” sketched by Worthington G. Smith (1894).
a hirsute couple, fleshy and slack-jawed with unkempt hair, modern bodies, and feet with opposable toes. The female, sitting in the dirt, nurses an infant, and behind her stands a coarse, dark man, clinging bent-kneed to the branch at a cave entrance. We might quibble about the opposable toes, but it is not an image that seems to be particularly out of place, although it is over 100 years old. Its scientific basis is a fossilized partial skullcap discovered in Indonesia, and it received the discriminating approval of the well-known German biologist Ernst Haeckel in 1894. Our familiar iconology of the ancestral past becomes visible in its absence or rearrangement. Figure 13.2 is a particularly surprising image, also published in 1894. It depicts three naked, attractive, and very modern-seeming people sitting on a log. A bearded man, between the two women, sits comfortably cross-legged, as if he were swinging his foot. One woman, seated near a small group of stone tools, rests her chin in her hands, elbows on knees. A stone knife is bound around her waist. The prototypical club leans on the log, not far from the other female, and everyone is having a good hair day. The British illustrator, Worthington G. Smith (1894), was well known for discovering and cataloging a large collection of flint tools, but he had never found a human fossil. Charming and challenging to stereotypes about ancient humans as the image was, it clearly fails to follow the usual form. The vivid imagery of the ancient past continues into our modern scientific period. In the ensuing 100 years, details from scientific
ER59969.indb 292
3/21/08 10:51:37 AM
Groups and the Evolution of Good Stories and Good Choices • 293
evidence might be added to a display, but icons of the past continued to be reproduced. The familiar images became an obstacle to alternative visions of the past. Even so simple and ordinary a representation as a male holding a toddler in his arms would be a remarkable deviation from the iconographical canon. Conceptual Structures for Human Origins In 1984 Misia Landau made a telling critique of historical accounts in paleoanthropology. Highly regarded and well-known scenarios of human evolution differed in details but had the form of a heroic folktale or myth (Landau, 1991; Propp, 1968). “Jack and the Beanstalk” is a classic example. The tale passes through a number of familiar and ordered “functions,” or actions, which are predictable elements of story structure (e.g., the ordinary-person-turned-hero, tests to be overcome, a magic gift, triumph and the winning of the prize, and often the final dénouement). The actions move the story along to a narrative turning point (in evolutionary scenarios, the “prime mover” or “initial kick” transforming the hominid to human), a climax, and sometimes even an anticlimax about the hero’s downfall. The sequential placement of events implies temporal relations, turning points, progress, and conclusions as stories do, but also causal connections, as if there were no alternatives or even more complicated causal factors than those in the tale. The entire production of a story floats on a tacit understanding of shared historical and cultural context (readily provided, as we have seen, by images of the past). Partially because of such tacit assumptions, narratives can be economical, relying on listeners and readers to fill in the blanks, and have multiple meanings, typically including a prescriptive moral. The various components of the story—whether in a folktale or a scenario—are chosen to make sense of the ending—known, of course, from the beginning of the tale. As in earlier accounts, texts evoke a rich and familiar imagery. Barkow (1989, p. 144–145) begins with human ancestors leaving their tree homes to take advantage of a mosaic environment of woodlands, trees, and savannas. The ancestors become terrestrial because food-bearing trees tend to be unevenly distributed in clumps, and walking between groves was necessary and, hence, favored by selection. Bipedalism may have been the preferred mode because it was energy efficient, or because it freed “the hands to carry food to females as part of courtship, or to carry it to home bases and the offspring therein” (p. 145). A “Promethean process” would have been necessary to give culture an initial kick. The kick, or turning point, in the narrative is “autopredation,” intergroup conflict that culls the slower and weaker, the poor of hearing
ER59969.indb 293
3/21/08 10:51:37 AM
294 • Linnda R. Caporael
and vision, and thus selects for strength, disease resistance, better vision and hearing among the winners. “It has been conjectured,” intones Barkow, “that we were our own predators and preyed on ourselves” (p. 146). In Barkow’s story, the obstacle, autopredation, is overcome by “groups genetically likely to produce individuals who had more ‘cultural capacity,’” that is, the ability to make weapons, grasp strategies and tactics, foresee contingencies, and cooperate in groups. The anticlimax completes the narrative form. Barkow claims the modern capacity for culture leads to maladaptive minds and culture—individual decisions and cultural practices that result in the “capacity to make fitness-reducing choices” (p. 295). Political and social concerns can shape the search for prime movers and influence the categories invoked for constructing origin stories. In the early years of the Cold War, a prominent paleontologist proposed cannibalism as a prime mover that would explain the “blood-bespattered, slaughter gutted archive” of human history from ancient Egypt to World War II (Dart, 1953). In the 1960s, the corporatist, “Man the Hunter” (Lee & DeVore, 1968), replaced the “killer ape.” With the rise of the women’s movement, female gathering and the invention of tools for provisioning offspring was proposed (Tanner & Zihlman, 1976), as well as its counter-proposition, monogamy (“food-for-sex”; Lovejoy, 1981). Globalization has produced a new prime mover, free trade and coalitional arms races (Horan, Bulte, & Shogren, 2005; Flinn, Geary, & Ward, 2005). Such searches for a prime mover should be viewed with suspicion. They signal not only a familiar narrative structure, but also a nonevolutionary mode of constructing scenarios, more similar to special creation stories than to consideration of the embodied processes of variation, development, selective retention, and recursion over spans of geological time. In practice, primitive society is invented by inversion and projection (Kuper, 1988). Inversion, the assumption that the past is the opposite of the present, was a common technique before ethnology and archaeology were systematically developed (Kuper, 1988). Inversion is also related to creating outgroups, more primitive, less civilized than the ingroup and less deserving of membership in a moral community. Much of our culturally inherited imagery of the past is invented by inversion. The notion that man has lost his natural environment is also an instance of inversion, comparing the primitive to the modern and technological. In contrast, projection, assuming the past is like the present, has been more common in recent decades. Using projection, we assume that our ancestors were more like us, and we just have to imagine what we would do if we were ancient hunter–gatherers (or their genes). A clear
ER59969.indb 294
3/21/08 10:51:37 AM
Groups and the Evolution of Good Stories and Good Choices • 295
illustration of projection is the popular argument that males in ancestral environments who preferred nubile females outreproduced males who lacked nubility-preferring adaptations (Symons, 1992). This might make sense if mating was occurring in a large, dense population with lots of singles bars and gyms. However, the ecological context of evolving humans would have been small, separated populations. We could just as well argue that males with no preferences, that is, a willingness to copulate with just about anything, would outreproduce the picky men. The paleontologist Robert Foley (1996) calls projection a “hindsight method” that views hominid species in terms of what they became— modern humans—rather than in terms of what they were in their specific temporal and ecological contexts. Of course, not all evolutionary scenarios have the features of folktales, inversion, or projection. For example, Foley’s (1987, pp. 158–162, 176–183) account of the origins of bipedalism is based on consideration of ecology and morphology and alternatives. Foley compares the habitats available to primates feeding by standing on all fours on top of branches or hanging below branches to feed crossed with the various forms of primate locomotion (e.g., knuckle-walking, tree swinging). In his account, there are various options for locomotion, with bipedalism being not particularly attractive, but at least available in an otherwise competitive environment. After Landau’s (1984) work, paleoanthropologists, who still had to construct scenarios about their data, were more circumspect and certainly far less vivid in their writing. Not much later, Latour and Strum (1986) recommended that the alternative to story telling was a logical analysis that started by selecting a single currency and sticking to it. Inclusive fitness theory seemed to fit the bill. A language constrained to genotypes, phenotypes, and natural selection hardly suggests a narrative plot. Nevertheless, the technical language is far from innocent, cut de novo from the whole cloth of evolutionary theory. Human evolutionary theories in general (Bowler, 1986) and the gene’s eye view imported from biology more specifically piggyback on pre-Darwinian elements of Anglo-American religion (Dupré, 1987; Midgley, 1987; Oyama, 1985) and its familiar web of meanings. Like the immortal soul, the immortal strands of DNA are the essence of the person, somehow more real or true than the ephemeral body, the vessel of the soul, or the phenotype, the vehicle of the genes. None of this would matter much except that it does make for a greased slide between theoretical facts and narrative meaning (Table 13.3). Evolutionists often identify their work as providing an alternative description to William Paley’s (1743–1805) famous argument for a natural theology: Just as the wondrous complexity of a watch implies a
ER59969.indb 295
3/21/08 10:51:37 AM
296 • Linnda R. Caporael
watchmaker, so does the wonderful complexity of organisms imply an intelligent designer. However, Darwin’s “descent with modification” provided an alternative to design: The natural selection of complex characters is a result of the variable and recursive relations between organisms and their environment; it was not a cause. There is no need to invoke design, much less a designer, yet Dawkins (1987), well known for his antireligious sentiments, constructed natural selection as a religious parallel. Like the divine watchmaker, natural selection as the blind watchmaker also creates, favors, and selects optimal designs. Much as a minister urges his congregation to resist the tyranny of temptation, Dawkins (1976) urged his readers to “rebel against the tyranny of the selfish replicators [genes]” (p. 215). The philosopher Daniel Dennett (1995) echoed this sentiment: “There is a persisting tension between the biological imperatives of our genes on the one hand and the cultural imperatives of our memes [‘infectious ideas’] on the other, but we would be foolish to ‘side with’ our genes…” (p. 365). We humans “can rise above the imperatives of our genes thanks to our memes” (p. 365) and “rebel against the tyranny of the selfish replicators” (p. 471). George C. Williams (1989), renowned for his work on natural selection and adaptation (Williams, 1966), defended Dawkins against a charge made by Günter Stent, a prominent microbiologist, that rebelling against one’s genes was a biological absurdity. Williams seriously wrote that “whole technologies such as hair dyeing and cosmetic surgery [are] based on attempts of individuals to correct perceived flaws in development controlled by their own genotypes” (1989, p. 214). He also maintained that the only connection between a person today and 200 years ago is genetic. Language, knowledge, architecture, and culture are absent from an evolutionary imaginary where genes are the primary agents. More recently, self-help books have been written to provide guidance for rebelling against our Mean Genes (Burnham & Phelan, 2000) and aid us in the search for Darwinian Happiness (Grinde, 2002). Thinking in Sociality Evolutionary scenarios are forms of scientific hypotheses, so we expect them to be consistent with the spirit and practice of science. However, as the discussion above suggests, evolutionary scenarios are more similar in form to narrative modes of thought than they are to paradigmatic modes of thought. Most origin accounts do not provide scientific explanation so much as they require more research and explanation. In general, origin accounts are ubiquitous across cultures (Brown, 1991; Bruner, 1986), closely linked to ethnopsychologies (Bruner, 1990; Lillard, 1998), and mediate gender relationships and change reflexively in
ER59969.indb 296
3/21/08 10:51:38 AM
Groups and the Evolution of Good Stories and Good Choices • 297
response to novel ecological demands (Sanday, 1981). A variety of factors are likely to be intertwined in the persistence and appeal of particular origin accounts, scientific or otherwise. Some of these factors may be related to the narrative mode itself, others to human cognition, and still others to the social functions of narrative accounts. To some extent, there may be no alternatives to some of the problematic techniques of scenario construction. Nelson (1993, 2003) proposes that autobiographical memory, which emerges from an adaptive mammalian memory-learning system, “underlies all our storytelling, history-making narrative activities, and ultimately, all of our accumulated knowledge systems” (1993, p. 12). Landau (1984), for example, argues that subverting narrative structure may be impossible. Even a research report begins with a description of the current state of the art, shifts to the action of the methods section, rises to the turning points of the results, and climaxes with a discussion that includes a final irony—more research is needed. The content of scenarios, whether as folk tales or scenarios, typically appears as scientific jargon, but everyday life provides the easily recognizable distinctions and analytic categories: marriage (mating), money (resource acquisition), status (dominance hierarchy), divorce (mate desertion), the “terrible two’s” (weaning conflict), adolescent rebellion (parent–offspring conflict), and menopause (declining reproductive value). These are what Schank and Abelson (1995) call story skeletons, which are summary concepts, often conveyed with a single representative word or phrase. Once chosen, the word or phrase gives the essence of a story. Story skeletons influence the acceptance of confirming evidence and rejection of contradictory evidence. Not surprisingly, the “evidence” for projective images of the past is all around us, since the present was the source. (Notice that projective images of the past fail to include phenomena such as maternal or infant mortality, which have been largely eliminated as a characteristic of modern daily life.) Summary concepts adopted from everyday discourse impose their own social values on problem choice, methodology, and data in science; conversely, science can transform the valence of the categories, usually by “naturalizing” them. Evolutionary scenarios invite vivid imagery about long-gone worlds; they deal with fundamental questions about human nature, relationships, and values; reduce critical thinking; and attract easy belief (Gilbert, 1991). Although scenarios are not single-authored works in the sense of a vignette or novel, Green and Brock’s research suggests (Green & Brock, 2000; Green, Strange, & Brock, 2002) scenarios might have considerable power to persuade by transporting readers to a time and place in our shared imaginaries
ER59969.indb 297
3/21/08 10:51:38 AM
298 • Linnda R. Caporael
of an ancestral past. If narrative itself is inescapable, under some circumstances, the best we might be able to do is to be acutely aware of the problems and on the lookout for solutions. Cognitive, cultural and methodological factors sustaining traditional scenarios may be more amenable to intervention. Despite their extensive training, scientists are not cognitively infallible. Human cognition is not abstract, ahistorical, and logical—indeed, the raison d’être of evolutionary studies is to show precisely that. Rarely, however, are scientific products evaluated with respect to the intrusion of fallible heuristics (cf., Shweder & D’Andrade, 1980). Common cognitive phenomena such as belief perseverance, theory-driven data perception, and overconfidence in judgment are potential distortions in reconstructions. Scenario thinking creates a link between present and past: The more detailed a scenario, the better the story, and the easier it is to construct, the more probable it is judged to be (Dawes, 1988b; Kahneman, Slovic, & Tversky, 1982). The difficulty of thinking in terms of what is not available may cause some cultural beliefs to be reified. For example, to demonstrate that cooperation is based on selfish incentives, one must show cooperation does not occur in the absence of incentives (Caporael et al., 1989). Part of the persistence of evolutionary traditions may arise simply from the way human cognitive machinery works (Kahneman et al., 1982; Nisbett & Ross, 1980). In addition to the structure of narratives and the way minds work, factors having to do with social functions help explain the peculiarities of evolutionary scenarios (Caporael & Brewer, 1991). Origin myths and beliefs about human nature are foundations for social imaginaries. Origin accounts both reflect and shape expectations and social values and explain and justify existing social arrangements (De Vos & RomanucciRoss, 1975; Shore, 1996). Origin accounts aid in the social coordination of a people over changing circumstances. Based on a study of 156 cultures, Sanday found that origin stories helped people within a culture “to solve the puzzle of sex differences by sorting out how and why the differences came about, what is to be done about the differences and how the two kinds of people resulting from the differences are to relate to one another and to their environment” (1981, p. 4). Within recognizable boundaries of characters and events, origin accounts can be altered to track changing circumstances. When environmental conditions changed (including the social conditions of colonization and its aftermaths), subtle changes in origin myths also occurred. The same phenomena can be seen in Darwinian accounts. After the emergence of Japanese economic power in the 1980s, rankings based on “racial” categories, once scientifically defunct, were reintroduced with
ER59969.indb 298
3/21/08 10:51:38 AM
Groups and the Evolution of Good Stories and Good Choices • 299
“Mongoloids” at the top of the ranks (cf. Fairchild, 1991). The entry of women in the workplace and a decline of family income preceded a revision of parental investment theory, so that “males should prefer females [with] above-average ability to produce and control resources” (emphasis added; Barkow, 1989, p. 326). The last example also illustrates a common double-framing in Darwinian language as simultaneously predictive and prescriptive. Evolutionary theory also provides a rationalization for stigmatization: Given limited time and attention, “one should not want to spend one’s affiliation time with those who have little of value to offer in terms of skills or economic or social resources…” (Kurzban & Leary, 2001, p. 194). These categories include the poor, the homeless, the elderly, and the disabled, those lacking economic and social capital who “might activate systems that induce [others] to systematically exclude” them. Without denying the occurrence of stigmatization, we can question the relevance of these modern-day categories to the “ancestral past” and the use of evolutionary theory for a novel way to justify “blaming the victim” for causing their own victimization.5 Origin narratives are value-laden, easily available to the members of a culture, accommodating to changing circumstances, and resistant to scientific falsification. They are prone to well-known heuristics and biases regarding accessibility, representativeness, and scenario effects. Like other historical accounts, they provide an important arena for understanding differences between novice and expert cognition, including the psychology of counterfactuals (Tetlock & Henik, 2005), need for closure (Kruglanski et al., 2006), and impact of science on social values (Brem, Ranney, & Schindel, 2003).
Inventing Human Nature In the previous sections, I offered a model of core configurations and suggested a correspondence, moderated by social identity, between it and two modes of thought, scientific and narrative (Bruner, 1986). Both are significant for the coordination of groups. Science tells us about possible outcomes of choice; stories create values, climates of opinion, justifications, blame and innocence, and the needs for rationales. Although we (scientists and citizens in scientific cultures) like to keep these separate, they often overlap (Latour, 1992). It seems reasonable that we should try to eliminate stories, or at least keep them confined to entertaining and escapist fantasies. However, stories are ubiquitous; they have a significant role in teaching and communicating. Their effectiveness in creating groups, as well as transitioning between
ER59969.indb 299
3/21/08 10:51:38 AM
300 • Linnda R. Caporael
group identities, indicate that the level of effort that has been applied to understanding paradigmatic thinking also needs to be applied to understanding narrative thinking (Bruner, 1986). The role of narrative in human affairs, and its relationship to paradigmatic modes of thought (Tetlock et al., 2000), has been seriously underestimated. Bruner (1986) recognized that scientists may temporarily fill gaps in their knowledge with narratives and wrote that these are quickly replaced later when the relevant research is done. It is less clear how narratives actually drive research or change our social imaginaries—which are the context for our scientific theories about human cognition and behavior. In other words, social imaginaries (and the origin stories they draw on) legitimate some accounts of human nature, including origin accounts that distinguish us from them. The Individual Story: Rudolf Höss and Cold Logic “I will begin by describing a rational man, Rudolph Hoess, the commandant of Auschwitz for approximately three years. Under his command, 2,900,000 people were murdered” (Dawes, 1988a, p. 20).6 It is no surprise that Rudolf Höss (1996) is a recurrent character in Robyn Dawes’ writing on rational choice (e.g., 1988b, 2001). Höss’s grip on the imagination is by way of a coherence view of rationality. The coherence view tells us how not to make a decision, but it tells us nothing about what decisions to make in the first place. So it is that Höss’s choices were rational with respect to “what he believed to be politically correct goals… determined by his belief system” (Dawes, 2001, p. 39; unless otherwise noted, all following references are also to Dawes, 2001). The same is true for the leadership of the SS, which sets out in the Wannsee Protocol a horrifying yet rational agenda, complete with the evaluation of resources, alternatives, and detailed criteria for implementing the “final solution.” The minutes also set out a discussion of degrees of relationship to Jewish kin based on genealogical and marriage relationships. Decision rules based on these assessments could lead to death, forced labor camps, or forced sterilization. A possible “final remnant” who did not die in forced labor would be the most “resistant” and would “have to be treated accordingly, because it is the product of natural selection and would, if released, act as a the [sic] seed of a new Jewish revival…” (Wannsee Protocol, 1942).7 Höss is a foil to a popular argument that the horrors of Nazism, Fascism, and Communism in the 20th century were not the work of sick, evil people, of irrational passions and hatreds, or of emotions and impulses out of control. Dawes writes,
ER59969.indb 300
3/21/08 10:51:38 AM
Groups and the Evolution of Good Stories and Good Choices • 301
Much as we would wish to find some irrationality in the basic tenets of Nazism, there is a certain cold logic to it, provided we accept the premise that the suffering of the individual human is absolutely of no importance—particularly as opposed to “group élan”—and that the fate of a particular human being is no more important than that of a cobweb in the corner of the room. We add to these premises that human history is meaningful only in terms of struggles between national and ethnic groups, where the group that emerges as the winner is culturally superior, and the members of that group that loses should naturally be made slaves. (2001, p. 38–39) It is tempting to see this claim as ironic, yet no matter how the point is put, it remains the same: Rational choice can be put into the service of any goal, no matter how repugnant. How then is it possible to make good choices? Dawes gives us a start with recommendations at the individual level. Later I will connect them to stories at the collective level. Dawes suggests that Höss was possibly irrational in one respect: making an attribution error. Like his superiors and other Nazis, Höss suppressed his feelings (an aversion to personal murder or to the sobbing grief of mothers watching their children go to their deaths) for the sake of Nazi ideals: It is “not proper” to be influenced by emotion, particularly the “feminine” emotions such as empathy or sympathy. He views the uncaring attitudes of Jewish prisoners as signs of a “racial characteristic.” However, Höss has privileged access to his own reasons about why he seems “emotionless,” and, when observing equally emotionless prisoners, he fails to assume that they, too, might have private reasons for suppressing their emotions. Dawes makes two points about Höss’s choices. The first is by way of stories of young love by Bertrand Russell. By eschewing emotions such as love, Höss and other Nazis adopted a disastrous view of rationality as part of their belief system. However, such “pure thought irrationality” can be destroyed by factors, such as love, that lie outside of the reasoning system. Second, Höss could have avoided the attribution error by assuming that the Jews were like him, that is, by projecting his own feelings of hidden pity on those condemned to die at Auschwitz. The starting points for good choices, then, are love and empathy. As given, however, they are incomplete. They are recommendations at the individual level and, in particular, an individual level devoid of context. Höss, like all of us, lived inside a larger story: Nazism created a social imaginary—a collectively shared landscape that makes common practices normative and possible and provides a widely shared sense of
ER59969.indb 301
3/21/08 10:51:39 AM
302 • Linnda R. Caporael
legitimacy (cf. Taylor, 2002). Being moved to tears does not mean we can infer that Höss would have known what action to take. No doubt there were ways he could have avoided the attribution error, but probably at the cost of being shot and replaced. There were, however, Germans who did take considerable risks to help rescue Jews during the war. In explaining their own behavior, these rescuers claimed that anyone would naturally do what they had done (Dawes, 2001). Objectively, rescuers “did not use their own behavior as a cue as to what others actually did…what these rescuers did, however, was to use their own feelings as a cue to the initial ethical and human response people would have to others in such dire need of help” (p. 153). Dawes identifies this psychological process as projection and writes that it is “exactly what underlies the Golden Rule for treating others as we would like them to treat us. It makes sense only if we assume that we can make a valid inference that other people will react the same way we do when treated in certain ways. For us to decide that people should be treated the same way that we ourselves would wish to be treated, we must decide that other people are hurt by what hurts us, pleased by what pleases us, and so on” (p. 150, emphasis in original). This discussion of Höss’s attribution error and rescuers projecting their feelings takes place in a chapter entitled “Connecting Ourselves with Others Without Recourse to a Good Story” (Dawes, 2001).8 The Golden Rule, a heuristic guide—a principle—for rational choice, provides the “story-less connection.” However, I have already suggested that stories are fundamental to human coordination, including shared conceptions of values, and that a story-less connection would be something of a surprise. In the case of Golden Rules, a collective story is embedded in the ambiguity about who this us is. All major world religions (as well as many less major ones) have a Golden Rule, but to whom it applies when the rubber hits the road is quite elastic. We know people project more strongly to ingroups than to outgroups (Robbins & Krueger, 2005). In some religions, the Golden Rule may simply be about doing unto “others” whose personal or group identity remains vague. In other interpretations, the Rule refers to “neighbor” or “brother,” but these relations do not necessarily generate the kind of universal other-concerning feelings that Dawes grants to projection. For example, American Quakers are rightly known for their leadership in the fight against slavery and antiviolence activism. Nevertheless, Quakers have a history of racism (“fit for freedom; not for friendship”), and today’s Quakers are now engaged in reversing the effects of the past segregated “black bench” in meeting (New England Yearly Meeting Ministry and Counsel Working Party on Racism, 2002).
ER59969.indb 302
3/21/08 10:51:39 AM
Groups and the Evolution of Good Stories and Good Choices • 303
The example of the Quakers is especially interesting because it reflects the distinction between relational and group collectivism (Brewer & Gardner, 1996; Brewer & Chen, 2007). Nineteenth-century Quakers recognized—and defended at great cost—that they shared with slaves a universal collective identity as members of a human community, even if, like most Americans of their time, they rejected a relational identity based on interpersonal bonds. (In no way should this recollection of past attitudes be taken to diminish the significant contributions of Quaker leadership in American life.) Not just neighbor or brother is specified; many Golden Rules refer to “man” or “fellowman.” It does not seem that anyone had women in mind when Golden Rules were invented (MacKinnon, 2006), and many world religions tolerate wife-beating and other forms of violence against women (Adelman, 2001) despite preaching the importance of their version of the Golden Rule. The perils of relying on the Golden Rule as projection can be particularly acute in a globalizing world. How does a Westerner understand, by using projection, the familial shame that leads a Kurdish family to murder a daughter and call it an “honor” killing? Individuals may practice their own particular “moral aesthetics” (Carrithers, 2005), that is, their personal feelings and awareness in ongoing situations. Sometimes these may be projective and other times not. Reemtsma (cited in Carrithers, 2005) offers two examples. One is a German psychiatrist who refused to participate in a sterilization program for the mentally handicapped. Her superior tried to motivate her by drawing attention to how different the mentally retarded are from her. She replied to him, “There are many people who are not like me, in the first place, you.” A similar recourse to difference came from another woman (who would put her family at risk by hiding a Jewish girl), who invoked the Biblical story of the Good Samaritan. She did not hold the Good Samaritan as a model, but rather argued to her husband that he would not want to be like the first two travelers who just passed without helping a man who had been beaten and left to die. In these cases, there may be stories, but they are invoked at personal levels. Carrithers (2005) refers to these as improvisations. The Danes offer an interesting contrast because of their spontaneous collective action to save the country’s Jews. In a sense, the Danes did “project” on the Jews, but they had an enabling story (Buckser, 2001), not a Golden Rule. The Collective Story: The Danish Rescuers The Danish occupation as a German “model protectorate” began in 1940 and ended with the imposition of martial law in 1943. In October the German attempt to round up Copenhagen’s 7,000 Jews was thwarted,
ER59969.indb 303
3/21/08 10:51:39 AM
304 • Linnda R. Caporael
with overwhelming support throughout the country. As word of the German plan was leaked, Jews were aided by the initially spontaneous action of thousands of rescuers, from the highest to the lowest social strata. In the first few days, before any organization for rescue, friends, neighbors, acquaintances, and even complete strangers helped with widespread, individual actions. Later, the roundups were denounced by the state church, the Supreme Court, and local organizations; the police force, which had fought against the Danish resistance under the Germans, also refused to take action against the Jews. An entire village cooperated in organizing and hiding a massive escape by boat to Sweden. Buckser (2001) provides a recent analysis of the Danish rescue. Traditionally, social scientists have offered a variety of explanations, which fall into two main categories: mechanics (e.g., the location of Denmark, a turn in the tide of war) and motivations (e.g., the identification of assimilated “Viking Jews” to the cultural and business life of Copenhagen and Danish democratic values). While these certainly contribute to an understanding of the Danish rescue, there are still important slips and gaps. To take one example, the Viking Jews were a minority of the Jewish population; thousands of other Jews had fled pogroms and persecution in eastern Europe and Russia early in the century, and as recently as the 1930s, a new wave of refugees arrived. The Jews of Denmark were a variable group, with a considerable number that spoke no Danish, had no income, and were visibly different from the rest of the population. None of this mattered to the rescuers. However, while antiSemitism was probably lower in Denmark than elsewhere in Europe, it existed before the war, among Danes in Sweden during the war (many of whom had been in the Danish resistance and fled to Denmark to avoid arrest), and briefly after the war. Dawes (2001) makes a digression that is useful here. When we have simple alternatives to consider, it matters psychologically which alternative we take to be figure—that is, what needs an explanation—and which we take to be ground, that is (to use his term), what we think is “natural.” Buckser (2001) makes just this move. Usually social scientists take the actions of the Danish rescuers as the figure to be explained and Jews as background, a generic minority group that could stand in for any other minority group. Buckser turns this around and asks, what is it about the Jews of Denmark that explains Danish behavior? The Danes had a good story. Buckser (2001) argues that the Jews were symbolic of Danish identity during the German occupation, a time when the Danes turned to a nationalist culture and philosophy, Grundtvigism, which had previously transformed Danish politics, agriculture, education, and identity.9 The occupation recalled for many the
ER59969.indb 304
3/21/08 10:51:39 AM
Groups and the Evolution of Good Stories and Good Choices • 305
earlier conflicts with Prussia in which a Danish majority in the region of northern Schleswig struggled to sustain Danish cultural identity based on Grundtvigian precepts, which centered on a romantic ideal of Danish rural life, customs, and ideals; community solidarity, self-sufficiency; and a lively conversational, intellectual, and cultural life. The Schleswig Danes did more than just celebrate Danish identity, dialects, folklore, symbols, and heroes. They ostracized German language teachers, founded libraries and clubs, and where possible built their own schools, and when that became impossible, they sent their children to folk high schools in Denmark. Eventually, the Prussians clamped down, outlawed Danish schools, flooded the area with Germans, and prosecuted some Danish leaders. The struggle to maintain cultural identity—one’s language, customs, and rituals—despite persecution resonated with the Danish construction of Grundtvigism (Buckser, 2001). The Jews had maintained their own culture, preserved their language, and built their own schools in ways that Grundtvig would have approved. Even the notion of the “eternal Jew” evoked a sense of permanence and collective identity that recalled the “folk spirit” of the Grundtvigian ideal. And the predicament of the Jews during the war mirrored that which the Grundtvigians imagined for the Danes. Both found themselves without territory or arms, under the heel of a Nazi regime that wanted to eradicate their independent existence, with their own traditions and group spirit their only real resources. (Buckser, 2001, p. 21). The success of the Jews through the ages testified that Danes could prevail as well. Neither the Jews nor the Danes could be overthrown, went the story of “folk spirit,” because theirs were nations of spirit, not of might. Two ironies, here oversimplified, could be noted. The first is that both Denmark and Germany had similar origin stories: romanticized ideals of rural life and folk spirit. The Danish folkeånd and German völkisch glorified Nordic heritage, folkloric arts, individualism, and self-sufficiency, and both promoted institutions that had built widely popular and deeply nationalist ideologies (Buckser, 2001; Kurlander, 2002). The communal myths were similar, but Danish and Nazi leaders revised them differently in the years leading up to war. The second irony is that our “good story” based on Grundtvigian ideals began with Grundtvig’s “blistering critique” in 1810 of Enlightenment ideology, which “had robbed the [Lutheran] church of its soul, replacing its emotional appeal with a dry system of logic and science” (Buckser. 2001, p. 16).
ER59969.indb 305
3/21/08 10:51:39 AM
306 • Linnda R. Caporael
Conclusion: Good Stories and Good Choices Humans face an odd dilemma. There is no doubt that we have an evolutionary history as do the other life forms of our planet. However, partly as a result of that evolutionary history, we fill in the inevitable gaps of our knowledge with stories. The core configurations model offers an alternative to folk psychology, imaginaries of the ancestral past, and the various manifestations of egoistic incentive theory, including selfish genes. The model focuses on situated activity in face-to-face groups and the inevitable tensions between an accurate comprehension of the world (given by our evolved morphology interacting with the limited spectrum available to it) and constructions of the world that glue individuals into groups, generation to generation (Caporael, 2001b)—all in the name of reproduction and survival to reproductive age. The heart of the dilemma of being both object and agent of our own studies is that it is human nature to invent human nature. Rationality without stories does not guarantee particularly desirable outcomes (as Rudolf Höss and Nazi Germany indicate), and stories without rationality can trap science into self-referential circles (as our evolutionary imaginaries do). Not all stories are about using the past to explain the present, however. Another category of important stories are those for imagining the future. Most of the time, the stories about the future are surprisingly like those about the past—in one way or another, both draw on the present. More than the past, however, stories of the future lend themselves to imagining alternative possibilities and, hence, open a way for rational choice among them. Keane (2003) suggests an ethics for a global civil society that might contribute to socially responsible consequentialist choices. He argues that an emergent global civil society “is a logical and institutional precondition for the survival and flourishing of a genuine plurality and different ideals and forms of life” (p. 202). The cultivation of “imparative reasoning” (from the Latin imparare), which he describes as the “learned art of learning through exposure to otherness” (p. 205), could usher in a new kind of story—or better, new conversations—toward a reconceptualization of human possibilities. Some elements of this new conversation include compromise, persuasion by better arguments, the appropriation of others’ practices, the commitment to sustaining diverse ways of life, and the explicit recognition that global civil society is part of a “natural tendency” to harmony. Global civil society is a “militant ethic” in that it is intolerant of intolerant opponents (specifically those using violence, including terrorists, out-of-control corporations, rogue nations, or universalizing nations
ER59969.indb 306
3/21/08 10:51:40 AM
Groups and the Evolution of Good Stories and Good Choices • 307
or religions), yet it requires an anonymous trust despite the different moralities and supporting stories that need (and are needed by) global civil society in order to persist. In this paper, I have suggested that Bruner’s (1986) two modes of thought, narrative and paradigmatic, are products of the evolution (and development) of mind and body at individual and group levels. They are associated with different kinds of actions, relationships, and identities. In some circumstances, such as science and other modes of critical inquiry, it is important for these two modes of thought to be distinct, and in other circumstances, such as social responsibility and civic action, their intertwining is essential. Narratives are a fact of human life, and the best ones have many options, as Dawes points out. Nevertheless, we should expect them to be inconsistent and to require rational analysis for coherence. For many domains of action—and rationality—narratives delimit and constrain possibilities. Dawes attributes this to automatic thinking and incomplete specification and is right for many cases. However, in many others, beliefs and expectations can be genuinely different in different cultures. There still will be possibilities where we can draw on the human ability to recognize rationality (Dawes, 2001), but social responsibility and civil society in an increasingly globalized world require an acute self-conscious awareness of both stories and rational choice. If stories are the source of criteria for rational choice, it behooves us to have good ones, which brings up the last question: What’s a good story? It’s debatable.
Acknowledgments I am grateful for the support of Dean John Harrington and the School of Humanities and Social Sciences at Rensselaer for sabbatical leave and also for the generous hospitality of the Konrad Lorenz Institute for Evolution and Cognition Research and its scientific director, Dr. Werner Callebaut. Professor Dan Rogers provided invaluable assistance for interpretation and deeper historical understanding of the Wannsee Conference minutes. Barbara Katz provided generous help and discussion about the Quakers’ contributions to the antislavery movement in America. I also thank Marilynn Brewer and Robyn Dawes for their advice and critique.
ER59969.indb 307
3/21/08 10:51:40 AM
308 • Linnda R. Caporael
References Adelman, H. T. (2001). Review of the book Silence is deadly: Judaism confronts wifebeating. Shofar, 19, 144–146. Amann, K., & Knorr Cetina, K. (1990). The fixation of visual evidence. In M. Lynch (Ed.), Representation in scientific practice (pp. 85–121). Cambridge, MA: MIT Press. Barkow, J. H. (1989). Darwin, sex, and status. Toronto, Canada: University of Toronto Press. Binford, L. (2001). Constructing frames of reference: An analytical method for archaeological theory building using hunter-gatherer and environmental data sets. Berkeley: University of California Press. Birdsell, J. B. (1972). An introduction to the new physical anthropology. Chicago: Rand McNally. Boinski, S., & Garber, P. A. (Eds.). (2000). On the move: How and why animals travel in groups. Chicago: University of Chicago Press. Bowler, P. J. (1986). Theories of human evolution: A century of debate, 18441944. Baltimore, MD: Johns Hopkins University Press. Boyd, R., & Richerson, P. J. (1985). Culture and the evolutionary process. Chicago: University of Chicago Press. Brem, S. K., Ranney, M., & Schindel, J. (2003). Perceived consequences of evolution: College students perceived negative personal and social impact in evolutionary theory. Science Education, 87, 181–206. Brewer, M. B., & Caporael, L. R. (2006a). An evolutionary perspective on social identity: Revisiting groups. In J. A. Simpson, M. Schaller, & D. T. Kenrick (Eds.), Evolution and social psychology (pp. 143–161). Philadelphia: Psychology Press. Brewer, M. B., & Caporael, L. R. (2006b). Social identity motives in evolutionary perspective. In R. Brown & D. Capozza (Eds.), Social identities: Motivational, emotional, cultural influences. Philadelphia: Psychology Press. Brewer, M. B., & Chen, Y.-R. (2007). Where (who) are collectives in collectivism? Toward conceptual clarification of individualism and collectivism. Psychological Review, 114, 133–151. Brewer, M. B., & Gardner, W. (1996). Who is this “We”? Levels of collective identity and self representation. Journal of Personality and Social Psychology, 71, 83–93. Brewer, M. B., & Pierce, K. P. (2005). Social identity complexity and outgroup tolerance. Personality and Social Psychology Bulletin, 13, 1–10. Brewer, M. B., & Yuki, M. (2007). Culture and social identity. In S. Kitayama & D. Cohen (Eds.), Handbook of Cultural Psychology, (pp. 307–322). New York: Guilford Press. Brown, D. E. (1991). Human universals. Philadelphia: Temple University Press. Bruner, J. S. (1986). Actual minds, possible worlds. Cambridge, MA: Harvard University Press.
ER59969.indb 308
3/21/08 10:51:40 AM
Groups and the Evolution of Good Stories and Good Choices • 309 Bruner, J. S. (1990). Acts of meaning. Cambridge, MA: Harvard University Press. Buckser, A. (2001). Rescue and cultural context during the Holocaust: Grundtvigian nationalism and the rescue of the Danish Jews. SHOFAR, 19, 1–25. Buller, D. J. (2005). Adapting minds: Evolutionary psychology and the persistent quest for human nature. Cambridge: MIT Press. Burnham, T., & Phelan, J. (2000). Mean genes. Cambridge, MA: Perseus. Buss, L. W. (1987). The evolution of individuality. Princeton, NJ: Princeton University Press. Campbell, D. T. (1960). Blind variation and selective retention in creative thought as in other knowledge processes. Psychological Review, 67, 380–400. Campbell, D. T. (1990a). Asch’s moral epistemology for socially shared knowledge. In I. Rock (Ed.), The legacy of Solomon Asch (pp. 39–55). Hillsdale, NJ: Erlbaum. Campbell, D. T. (1990b). Levels of organization, downward causation, and the selection-theory approach to evolutionary epistemology. In E. Tobach & G. Greenberg (Eds.), Scientific methodology in the study of mind: Evolutionary epistemology (pp. 1–15). Hillsdale: NJ: Lawrence Erlbaum. Campbell, D. T. (1997). From evolutionary epistemology via selection theory to a sociology of scientific validity. Evolution and Cognition, 3, 5–38. Caporael, L. R. (1976). Ergotism: The Satan loosed in Salem? Science, 192, 21–26. Caporael, L. R. (1997a). The evolution of truly social cognition: The core configurations model. Personality and Social Psychology Review, 1, 276–298. Caporael, L. R. (1997b). Vehicles of knowledge: Artifacts and social groups. Evolution and Cognition, 3, 39–43. Caporael, L. R. (2001a). Evolutionary psychology: Toward a unifying theory and a hybrid science. Annual Review of Psychology, 52, 607–628. Caporael, L. R. (2001b). Natural tensions: Realism and constructivism. In D. L. Hull & C. M. Heyes (Eds.), Selection theory and social construction: The evolutionary naturalistic epistemology of Donald T. Campbell. Albany, NY: SUNY Press. Caporael, L. R. (2001c). Parts and wholes: The evolutionary importance of groups. In C. Sedikides & M. B. Brewer (Eds.), Individual self, relational self, and collective self (pp. 241–258). Philadelphia: Psychology Press. Caporael, L. R. (2003). Repeated assembly. In S. Schur & F. Rauscher (Eds.), Alternative approaches to evolutionary psychology (pp. 71–90): Kluwer. Caporael, L. R. (2007). Evolutionary theory for social and cultural psychology. In E. T. Higgins & A. Kruglanski (Eds.), Social psychology: Handbook of basic principles. (pp. 1–13). New York: Guildford Press.
ER59969.indb 309
3/21/08 10:51:40 AM
310 • Linnda R. Caporael Caporael, L. R., & Baron, R. M. (1997). Groups as the mind’s natural environment. In J. Simpson & D. Kenrick (Eds.), Evolutionary social psychology (pp. 317–343). Hillsdale, NJ: Lawrence Erlbaum. Caporael, L. R., & Brewer, M. B. (1991). The quest for human nature. Journal of Social Issues, 47, 1–9. Caporael, L. R., Dawes, R. M., Orbell, J. M., & van de Kragt, A. J. C. (1989). Selfishness examined: Cooperation in the absence of egoistic incentives. Behavioral and Brain Sciences, 12, 683–739. Caporael, L. R., Dawes, R. M., Orbell, J. M., & van de Kragt, A. J. C. (1989). Thinking in sociality. Behavioral and Brain Sciences, 12, 727–739. Carrithers, M. (2005). Anthropology as a moral science of possibilities. Current Anthropology, 46, 433–456. Clancey, W. J. (1997). Situated cognition. Cambridge, UK: Cambridge University Press. Cole, M., Hood, L., & McDermott, R. (1980). Ecological niche picking. In U. Neisser (Ed.), Memory observed (pp. 366–373): Freeman. Cowan, N. (2000). The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behavioral and Brain Sciences, 24, 87–185. Crawford, C. (1998). The theory of evolution in the study of human behavior: An introduction and overview. In C. Crawford & D. Krebs (Eds.), Handbook of evolutionary psychology (pp. 1–41). Mahwah, NJ: Erlbaum. Dart, R. (1953). The predatory transition from ape to man. International Anthropological and Linguistic Review, 1, 201–218. Dawes, R. M. (1988a). Plato vs. Russell: Hoess and the relevance of cognitive psychology. Religious Humanism, 22, 20–26. Dawes, R. M. (1988b). Rational choice in an uncertain world. New York: Harcourt Brace Jovanovich. Dawes, R. M. (2001). Everyday irrationality. Boulder, CO: Westview. Dawkins, R. (1976). The selfish gene. New York: Oxford University Press. Dawkins, R. (1987). The blind watchmaker. New York: Norton. Deaux, K., & Philogène, G. (Eds.). (2001). Representations of the social: Bridging theoretical traditions. Oxford, UK: Blackwell. Dennett, D. C. (1995). Darwin’s dangerous idea. New York: Simon & Schuster. De Vos, G., & Romanucci-Ross, L. (1975). Ethnicity: Vessel of meaning and emblem of contrast. In G. De Vos & L. Romanucci-Ross (Eds.), Ethnic identity (pp. 363–390). Palo Alto, CA: Mayfield. Donald, M. (1991). Origins of the modern mind. Cambridge, MA: Harvard University Press. Dunbar, R. I. M. (1993). Coevolution of neocortical size, group size and language in humans. Behavioral and Brain Sciences, 16, 681–735. Dupré, J. (1987). Human kinds. In J. Dupré (Ed.), The latest on the best (pp. 327–348). Cambridge, MA: MIT Press.
ER59969.indb 310
3/21/08 10:51:40 AM
Groups and the Evolution of Good Stories and Good Choices • 311 Evans-Pritchard, E. E. (1937). Witchcraft, oracles, and magic among the Azande. Oxford, UK: Clarendon. Fairchild, H. H. (1991). Scientific racism: the cloak of objectivity. Journal of Social Issues, 47, 101–115. Fiske, A. P. (2000). Complementarity theory: Why human social capacities evolved to require cultural complements. Personality and Social Psychology Review, 4, 76–94. Fiske, S. T. (1993). Controlling other people: The impact of power on stereotyping. American Psychologist, 48, 621–628. Flinn, M. V., Geary, D. C., & Ward, C. V. (2005). Ecological dominance, social competition, and coalitionary arms races: Why humans evolved extraordinary intelligence. Evolution and Human Behavior, 26, 10–46. Foley, R. (1987). Another unique species: Patterns of human evolutionary ecology. New York: Longman/John Wiley. Foley, R. (1996). The adaptive legacy of human evolution: A search for the environment of evolutionary adaptedness. Evolutionary Anthropology, 4, 194–203. Freyd, J. (1983). Shareability: The social psychology of epistemology. Cognitive Science, 7, 191–210. Gilbert, D. T. (1991). How mental systems believe. American Psychologist, 46, 107–119. Gil-White, F. J. (2001). Are ethnic groups biological “species” to the human brain? Current Anthropology, 42, 515–554. Gould, S. J. (1997). Darwinian fundamentalism. New York Review of Books, 44(10), pp. 34–37. Green, M. C., & Brock, T. C. (2000). The role of transportation in the persuasiveness of public narratives. Journal of Personality and Social Psychology, 70, 701–721. Green, M. C., Strange, J. J., & Brock, T. C. (Eds.). (2002). Narrative impact: Social and cognitive foundations. Mahweh, NJ: Erlbaum. Grinde, B. (2002). Darwinian happiness: Evolution as a guide for living and understanding human behavior. Princeton, NJ: Darwin Press. Haeckel, E. (1894). Natürliche Schöpfungsgeschichte. Berlin: G. Reimer. Hamilton, W. D. (1975). Innate social aptitudes of man. In R. Fox (Ed.), Biosocial anthropology (pp. 133–155). New York: Wiley. Hardin, C. D., & Higgins, E. T. (1996). Shared reality: How social verification makes the subjective objective. In R. M. Sorrentino & E. T. Higgins (Eds.), Handbook of motivation and cognition. Vol. 3: The interpersonal context (pp. 28–84). New York: Guilford Press. Hassan, F. A. (1981). Demographic archaeology. New York: Academic Press. Heelas, P., & Lock, A. (Eds.). (1981). Indigenous psychologies. New York: Academic Press. Hendriks-Jansen, H. (1996). Catching ourselves in the act. Cambridge, MA: MIT Press.
ER59969.indb 311
3/21/08 10:51:41 AM
312 • Linnda R. Caporael Henrich, J., Boyd, R., Bowles, S., Camerer, C., Fehr, E., & Gintis, H., et al. (2005). “Economic man” in cross-cultural perspective: Behavioral experiments in 15 small-scale societies. Behavioral and Brain Sciences, 28, 795–855. Holekamp, K. E., Boydston, E. E., & Smale, L. (2000). Group travel in social carnivores. In S. Boinski & P. A. Garber (Eds.), On the move: How and why animals travel in groups (pp. 587–627). Chicago: University of Chicago Press. Horan, R. D., Bulte, E., & Shogren, J. F. (2005). How trade saved humanity from biological exclusion: An economic theory of Neanderthal extinction. Journal of Economic Behavior and Organization, 58, 1–29. Horowitz, D. L. (2001). The deadly ethnic riot. Berkeley: University of California Press. Höss, R. (1996). Death dealer: The memoirs of the SS kommandant at Auschwitz. Cambridge, MA: Da Capo. Hull, D. L. (1988). Science as a process. Chicago: University of Chicago Press. Hutchins, E. (1996). Cognition in the wild. Cambridge, MA: MIT Press. Jablonka, E., & Lamb, M. J. (2005). Evolution in four dimensions: Genetic, epigenetic, behavioral and symbolic variation in the history of life. Cambridge, MA: MIT Press. Jarvenpa, R., & Brumbach, H. (1988). Socio-spatial organization and decision-making processes: Observations from the Chipewyan. American Anthropologist, 90, 598–618. Kahneman, D., Slovic, P., & Tversky, A. (Eds.). (1982). Judgment under uncertainty: Heuristics and biases. New York: Cambridge University Press. Kashima, E., & Hardie, E. A. (2000). The development and validation of the Relational, Individual, and Collective Self-Aspects (RIC) Scale. Asian Journal of Social Psychology, 3, 19–48. Keane, J. (2003). Global civil society? New York: Cambridge University Press. Kelley, R. L. (1995). The foraging spectrum: Diversity in hunter-gatherer lifeways. Washington, DC: Smithsonian Institution Press. Kruglanski, A. W. (1990). Lay epistemic theory in social-cognitive psychology. Psychological Inquiry, 1, 181–197. Kruglanski, A. W., Pierro, A., Mannetti, L., & De Grada, E. (2006). Groups as epistemic providers: Need for closure and the unfolding of group centrism. Psychological Review, 113, 84–100. Kuper, A. (1988). The invention of primitive society: Transformations of an illusion. New York: Routledge. Kurlander, E. (2002). The rise of völkisch-nationalism and the decline of German liberalism: A comparison of liberal political cultures in SchleswigHolstein and Silesia 1912-1924. European Review of History, 9, 23–36. Kurzban, R., & Leary, M. R. (2001). Evolutionary origins of stigmatization: The functions of social exclusion. Psychological Bulletin, 127, 187–208.
ER59969.indb 312
3/21/08 10:51:41 AM
Groups and the Evolution of Good Stories and Good Choices • 313 Lakoff, G. (1987). Women, fire and dangerous things. Chicago: University of Chicago Press. Lakoff, G., & Johnson, M. (1980). Metaphors we live by. Chicago: University of Chicago Press. Landau, M. (1984). Human evolution as narrative. American Scientist, 72, 262–268. Landau, M. (1991). Narratives of human evolution. New Haven, CT: Yale University Press. Latour, B. (1992). We have never been modern. Cambridge, MA: Harvard University Press. Latour, B., & Strum, S. C. (1986). Human social origins: Oh please, tell us another story. Journal of Social and Biological Structures, 9, 169–187. Lee, R. B., & DeVore, I. (Eds.). (1968). Man the hunter. Chicago: Aldine. Li, S.-C. (2003). Bicultural orchestration of developmental plasticity across levels: The interplay of biology and culture in shaping the mind and behavior across the life span. Psychological Bulletin, 129, 171–194. Liang, D. W., Moreland, R., & Argote, L. (1995). Group versus individual training and group performance: The mediating role of transactive memory. Personality and Social Psychology Bulletin, 21, 384–393. Lickliter, R., & Honeycutt, H. (2003). Developmental dynamics: Toward a biologically plausible evolutionary psychology. Psychological Bulletin, 129, 839–835. Lillard, A. (1998). Ethnopsychologies: Cultural variations in theories of mind. Psychological Bulletin, 123, 3–32. Lovejoy, C. O. (1981). The origins of man. Science, 211, 341–350. MacKinnon, C. A. (2006). Are women human? Cambridge, MA: Harvard University Press. Maynard Smith, J. (1987). Science and myth. In N. Eldredge (Ed.), The Natural History reader in evolution (pp. 222–229). New York: Columbia University Press. (Originally published 1984.) Midgley, M. (1987). Evolution as a religion: A comparison of prophecies. Retrieved April 6, 2005, from http://www.aaas.org/spp/dser/evolution/perspectives/midgley.shtml (Originally published in Zygon, 22, 179–194). Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81–97. Moser, S. (1998). Ancestral images: The iconography of human origins. Ithaca, NY: Cornell University Press. Nelson, K. (1993). The psychological and social origins of autobiographical memory. Psychological Science, 4, 7–14. Nelson, K. (2003). Self and social functions: Individual autobiographical memory and collective narrative. Memory, 11, 125–136.
ER59969.indb 313
3/21/08 10:51:41 AM
314 • Linnda R. Caporael New England Yearly Meeting Ministry and Counsel Working Party on Racism. (2002). Fit for freedom, not for friendship. Retrieved 4 June, 2006, from http://www.neym.org/ministryandcounsel/racism/freedom_not_ friendship.html Newman, S. A. (2003). The fall and rise of systems biology: Recovering from a half-century of gene binge. GeneWatch, July–August, 8–12. Nisbett, R., & Ross, L. (1980). Human inference: Strategies and shortcomings of social judgment. Englewood Cliffs, NJ: Prentice-Hall. Norman, D. A. (1990). The design of everyday things. NY: Doubleday (Originally published as Psychology of everyday things, Basic Books, 1988). Olsen, C. L. (1987). The demography of colony fission from 1878-1970 among the Hutterites of North America. American Anthropologist, 89, 823–837. Oyama, S. (1985). The ontogeny of information. New York: Cambridge University Press. Perper, T. (1985). Sex signals: The biology of love. Philadelphia: ISI Press. Poole, M. S., & Hollingshead, A. B. (Eds.). (2005). Theories of small groups. Thousand Oaks, CA: Sage. Propp, V. I. (1968). Morphology of the folktale (L. Scott, Trans.). Austin: University of Texas Press. Rapoport, A. (1989). Egoistic incentive: A hypothesis or an ideological tenet? Behavioral and Brain Sciences, 12, 719–720. Richards, R. J. (1987). Darwin and the emergence of evolutionary theories of mind and behavior. Chicago: University of Chicago Press. Robbins, J. M., & Krueger, J. I. (2005). Social projection to ingroups and outgroups: A review and meta-analysis. Personality and Social Psychology Review, 9, 32–47. Ross, L., Greene, D., & House, P. (1977). The “false consensus effect”: The effects of memory structure on judgment. Journal of Experimental Social Psychology, 13, 279–301. Sanday, P. R. (1981). Female power and male dominance: On the origins of sexual inequality. New York: Cambridge University Press. Schaller, M. (2002). The evidentiary standard of special design is a little bit like heaven. Behavioral and Brain Sciences, 25, 526–527. Schank, R. C., & Abelson, R. P. (1995). Knowledge and memory: The real story. In R. J. Wyer (Ed.), Advances in social cognition (Vol. 8, pp. 1–86). Hillsdale, NJ: Erlbaum. Sedikides, C., & Brewer, M. B. (Eds.). (2001). Individual self, relational self, and collective self. Philadelphia: Psychology Press. Shore, B. (1996). Culture in mind. New York: Oxford University Press. Shweder, R. A., & D’Andrade, R. G. (1980). The systematic distortion hypothesis. In R. A. Shweder (Ed.), Fallible judgment in behavioral research (New directions for methodology of social and behavioral science, No. 4) (pp. 37–58). San Francisco: Jossey-Bass.
ER59969.indb 314
3/21/08 10:51:41 AM
Groups and the Evolution of Good Stories and Good Choices • 315 Smith, B. H. (2006). Scandalous knowledge: Science, truth and the human. Durham, NC: Duke University Press. Smith, E. R., & Semin, G. R. (2004). Socially situated cognition: Cognition in its social context. Advances in Experimental Social Psychology, 36, 53–117. Smith, W. G. (1894). Man, the primeval savage. London: E. Stanford. Sober, E., & Wilson, D. S. (1998). Unto others: The evolution and psychology of unselfish behavior. Cambridge, MA: Harvard University Press. Steele, C. M., & Aronson, J. (2003). Stereotype threat and the intellectual test performance of African-Americans. Journal of Personality and Social Psychology, 69, 797–811. Symons, D. (1992). On the use and misuse of Darwinism in the study of human behavior. In J. H. Barkow, L. Cosmides, & J. Tooby (Eds.), The adapted mind (pp. 137–159). New York: Oxford University Press. Szathmáry, E., & Maynard Smith, J. (1995). The major evolutionary transitions. Nature, 374, 227–232. Tanner, N., & Zihlman, A. L. (1976). Women in evolution. Part I. Innovation and selection in human origins. Signs, 1, 585–608. Taylor, C. (2002). Modern social imaginaries. Public Culture, 14, 91–124. Tetlock, P. E., & Henik, E. (2005). Theory- versus imagination-driven thinking about historical counterfactuals: Are we prisoners of our preconceptions? In D. R. Mandel, D. J. Hilton, & P. Catellani (Eds.), The psychology of counterfactual thinking. New York: Routledge. Tetlock, P. E., Kristel, O. V., Elson, S. B., Green, M. C. S., Jeffrey J., & Lerner, J. S. (2000). The psychology of the unthinkable: Taboo trade-offs, forbidden base rates, and heretical counterfactuals. Journal of Personality and Social Psychology, 78, 853–870. Tooby, J., & Cosmides, L. (2005). The conceptual foundations of evolutionary psychology. In D. M. Buss (Ed.), The handbook of evolutionary psychology (pp. 5–67). Hoboken, NJ: Wiley. Trivers, R. L. (1971). The evolution of reciprocal altruism. Quarterly Review of Biology, 46, 35–57. Wannsee Protocol. (1942). Retrieved August 8, 2006, from http://www.writing.upenn.edu/~afilreis/Holocaust/wansee-transcript.html Wegner, D. M. (2002). The illusion of conscious will. Cambridge, MA: MIT Press. Wells, P. S. (1998). Identity and material culture in the later prehistory of central Europe. Journal of Archaeological Research, 6, 239–298. Williams, G. C. (1966). Adaptation and natural selection. Princeton, NJ: Princeton University Press. Williams, G. C. (1989). A sociobiological expansion of evolution and ethics. In J. Paradis & G. C. Williams (Eds.), Evolution and ethics (pp. 179–214). Princeton, NJ: Princeton University Press. Wilson, E. O. (1975). Sociobiology: The new synthesis. Cambridge, MA: Harvard University Press.
ER59969.indb 315
3/21/08 10:51:42 AM
316 • Linnda R. Caporael Wimsatt, W. C. (1999). Generativity, entrenchment, evolution, and innateness. In V. Hardcastle (Ed.), Biology meets psychology (pp. 139–179). Cambridge, MA: MIT Press. Wimsatt, W. C., & Griesemer, J. R. (2007). Re-producing entrenchments to scaffold culture: How to re-develop cultural evolution. In R. Sansom & R. N. Brandon (Eds.), Integrating evolution and development: From theory to practice (pp. 227–323). Cambridge, MA: MIT Press. Yuki, M., Maddux, W. W., Brewer, M. B., & Takemura, K. (2005). Cross-cultural differences in relationship- and group-based trust. Personality and Social Psychology Bulletin, 31, 48–62. Table 13.3 Parallels Between Biological and Religious Discourse Biology
Theology
Immortal genes
Immortal soul
Phenotypic vehicle of the genes
Bodily vessel of the soul
Selfish genes
Original sin
Man has lost his natural environment
Expulsion from the Garden of Eden
Genes in future generations versus loss from the gene pool
Salvation versus damnation
Self-interest → future success (representation in the gene pool)
Charitable acts → future success (a place in heaven)
Optimization of design
Perfection of design
Central dogma (Weismanism)
Dogma
Blind watchmaker
Divine watchmaker
Natural selection
God
notes 1. In an accessible article, Newman (2003) correctly points out that the emerging systems approach in biology is “more of an agenda than a body of results” (p. 12)—although it is becoming increasingly well funded. The term evo-devo is most frequently associated with what used to be embryology, a crucial area excluded from the neo-Darwinian synthesis of the 1930s until recently. Thanks to new technologies and to the new discoveries from the Human Genome Project, development has become an increasingly important area for understanding evolution and viceversa. For scholars in the human sciences, evo-devo is opening new conceptual arenas and creating new vocabularies. Wimsatt and Greiesemer (2007) illustrate “evo-devo-ing” culture.
ER59969.indb 316
3/21/08 10:51:42 AM
Groups and the Evolution of Good Stories and Good Choices • 317 2. Short-term memory (STM) could be reconceived as a link between sensory perceptions and communication within a task group, whether verbal or nonverbal. Interestingly, Cowan (2000) argues for a pure STM limit of three to five chunks in contrast to Miller’s (1956) more familiar “7 ± 2” compound chunks. There may be similar capacity limits in other sensory processing systems such as odor and vision. 3. A possible and tempting function of macrodemes, particularly for those who posit intergroup conflict as an explanation for sociality, might be certain categories of mob violence. In his comprehensive study of deadly ethnic riots, Horowitz (2001) finds that the violence is “structured, nonrandom, socially sanctioned, destructive rather than appropriative, relatively spontaneous, uncalibrated, and yet precisely focused on certain groups” (p. 524). In his view, such riots involve a level of psychological disinhibition, suggesting an underlying psychology of “the beast within.” However, from a sociality perspective, one might argue that such riots inhibit an equally “natural” empathy. Yet another explanation might involve a concatenation of gendered developmental experiences that are common cross-culturally. 4. In these studies, subjects in a group were presented with slides of lines of various lengths, and one at a time, announced which line was the longest. In actuality, there was only one subject in the group; the others were confederates of the experimenter. On some trials, the first “subjects” announced that the second longest line was the longest, even though it was patently shorter than the longest. A third of Asch’s subjects conformed to the confederate majority. 5. Kurzban and Leary (2001) are careful to say that they are not making a statement about the value of poor and homeless people who are shunned by the affluent—rather they are just making a claim that disadvantaged people provide cues about their inability to furnish future social benefits to the affluent and hence activate the evolved exclusionary systems of those who are better off. I do see this as a value statement, albeit not necessarily intentional. This supposedly scientific claim naturalizes rejection, victimization, and the constitution of value itself in economic terms to the exclusion of other possible values (e.g., wisdom, fascinating story-telling, experience, etc.) and in the absence of any evidence that could satisfy an evolutionary claim. It is not surprising that students who know evolutionary theory are comparable to students who are creationist in perceiving negative personal and social impacts of evolutionary theory (Brem, Ranney, & Schindel, 2003). 6. Even in a postcomputer world, a variety of conventions served for the transliteration of foreign languages. I have used the spelling, Rudolf Höss, except when quoting specific works that use a different spelling, including the recent edition of his memoirs, with the spelling Rudolph Höss.
ER59969.indb 317
3/21/08 10:51:42 AM
318 • Linnda R. Caporael 7. The reference to “natural selection” is among the revisions and clarifications, made about 12 years ago by Professor Dan Rogers, a historian of modern Germany, to the hasty English translation used in the Nuremberg trials. To help me adjudicate among translations both in print and on the Web, Prof. Rogers sent me a fresh translation based on the original German: “The possible final remnant will, since it doubtless will be composed of that element most capable of offering resistance, have to be treated accordingly, since it represents a natural elite that will have to be responded to upon release as the seed of a new Jewish revival (witness the experience of history).” The important issue is not the reference to natural selection, which is described in the various translations (e.g., attrition due to natural causes), but the qualification about a “possible” remaining remnant (absent in the translation in Höss, 1996), which indicates that the intention of German policies was total genocide, not merely relocation as claimed by holocaust deniers. Human evolutionists have an obligation to know how scientists’ claims about natural selection were used to justify Nazi policies of extermination, forced sterilization, medicalized social control, police intervention in private reproductive choice, and the rejection of respect for life on the purportedly higher ground of science and logic as the ultimate arbiter of “truth,” no matter what the harm. Given that the evidential and theoretical value of Darwinism to understanding human cognition and behavior is scientifically disputed (Buller, 2005; Gould, 1997; Smith, 2006), the assertion that unconvinced social scientists “may have been complicit in the perpetuation of vast tides of human suffering” (Tooby & Cosmides, 2005, p. 7) that might have been prevented if the scientific community had turned to Darwinism suggests a lack of basic historical literacy and a primitive conception of science and values. 8. The “good story” in the chapter title refers to the social false-consensus effect (Ross, Greene, & House, 1977), which Dawes disputes is an egoistic bias exaggerating the degree to which others are like oneself. His particular objection is to the way in which the false consensus effect was operationally defined. Projection and the Golden Rule are not considered to be stories; they are credible sources of information under certain conditions. 9. Nikolai Frederik Severin Grundtvig, who gave his name to this movement, argued that every people had a distinctive identity (“folk spirit”), which could be found in rural folklore and culture. Grundtvig became a man of enormous stature in Danish life, contributing to the writing of its constitution, and in the course of a century (he lived until the age of 89) redefined Danish identity as spiritual rather than political–military–industrial. Grundtvigians founded folklore institutes and societ-
ER59969.indb 318
3/21/08 10:51:42 AM
Groups and the Evolution of Good Stories and Good Choices • 319 ies, political parties and farmers’ cooperative, and schools and folk high schools to educate students “in the informal style of folk storytelling” (Buckser, 2001, p. 17).
ER59969.indb 319
3/21/08 10:51:42 AM
ER59969.indb 320
3/21/08 10:51:42 AM
Appendix 1 The Robust Beauty of Improper Linear Models in Decision Making Robyn M. Dawes University of Oregon
ABSTRACT: Proper linear models are those in which predictor variables are given weights in such a way that the resulting linear composite optimally predicts some criterion of interest; examples of proper linear models are standard regression analysis, discriminant function analysis, and ridge regression analysis. Research summarized in Paul Meehl’s book on clinical versus statistical prediction—and a plethora of research stimulated in part by that book—all indicates that when a numerical criterion variable (e.g., graduate grade point average) is to be predicted from numerical predictor variables, proper linear models outperform clinical intuition. Improper linear models are those in which the weights of the predictor variables are obtained by some nonoptimal method; for example, they may be obtained on the basis of intuition, derived from simulating a clinical judge’s predictions, or set to be equal. This article presents evidence that even such improper linear models are superior to clinical intuition when predicting a numerical criterion from numerical predictors. In fact, unit (i.e., equal) weighting is quite robust for making such predictions. The article discusses, in some detail, the application of unit weights to decide what bullet the Denver Police Department should use. Finally, the article considers commonly raised technical, psychological, and ethical resistances to using linear models to make important social decisions and presents arguments that could weaken these resistances. 321
ER59969.indb 321
3/21/08 10:51:43 AM
322 • Robyn M. Dawes
Paul Meehl’s (1954) book Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence appeared 25 years ago. It reviewed studies indicating that the prediction of numerical criterion variables of psychological interest (e.g., faculty ratings of graduate students who had just obtained a PhD) from numerical predictor variables (e.g., scores on the Graduate Record Examination, grade point averages, ratings of letters of recommendation) is better done by a proper linear model than by the clinical intuition of people presumably skilled in such prediction. The point of this article is to review evidence that even improper linear models may be superior to clinical predictions. A proper linear model is one in which the weights given to the predictor variables are chosen in such a way as to optimize the relationship between the prediction and the criterion. Simple regression analysis is the most common example of a proper linear model; the predictor variables are weighted in such a way as to maximize the correlation between the subsequent weighted composite and the actual criterion. Discriminant function analysis is another example of a proper linear model; weights are given to the predictor variables in such a way that the resulting linear composites maximize the discrepancy between two or more groups. Ridge regression analysis, another example (Darlington, 1978; Marquardt & Snee, 1975), attempts to assign weights in such a way that the linear composites correlate maximally the criterion of interest in a new set of data. Thus, there are many types of proper linear models and they have been used in a variety of contexts. One example (Dawes, 1971) was presented in this Journal; it involved the prediction of faculty ratings of graduate students. All graduate students at the University of Oregon’s Psychology Department who had been admitted between the fall of 1964 and the fall of 1967—and who had not dropped out of the program for nonacademic reasons (e.g., psychosis or marriage)—were rated by the faculty in the spring of 1969; faculty members rated only students whom they felt comfortable rating. The following rating scale was used: 5, outstanding; 4, above average; 3, average; 2, below average; 1, dropped out of the program in academic difficulty. Such overall ratings constitute a psychologically interesting criterion because the subjective impressions of faculty members are the main determinants of the job (if any) a student obtains after leaving graduate school. A total of 111 students were in the sample; the number of faculty members rating each of these students ranged from 1 to 20, with the mean number being 5.67 and the median being 5. The ratings were reliable. (To determine the reliability, the ratings were subjected to a one-way analysis of variance in which each student being rated was regarded as a
ER59969.indb 322
3/21/08 10:51:43 AM
Appendix 1 • 323
treatment. The resulting between-treatments variance ratio [η2] was .67, and it was significant beyond the .001 level.) These faculty ratings were predicted from a proper linear model based on the student's Graduate Record Examination (GRE) score, the student's undergraduate grade point average (GPA), and a measure of the selectivity of the student's undergraduate institution. The cross-validated multiple correlation between the faculty ratings and predictor variables was .38. Congruent with Meehl’s results, the correlation of these latter faculty ratings with the average rating of the people on the admissions committee who selected the students was .19; that is, it accounted for one fourth as much variance. This example is typical of those found in psychological research in this area in that (a) the correlation with the model’s predictions is higher than the correlation with clinical prediction, but (b) both correlations are low. These characteristics often lead psychologists to interpret the findings as meaning that while the low correlation of the model indicates that linear modeling is deficient as a method, the even lower correlation of the judges indicates only that the wrong judges were used. An improper linear model is one in which the weights are chosen by some nonoptimal method. They may be chosen to be equal, they may be chosen on the basis of the intuition of the person making the prediction, or they may be chosen at random. Nevertheless, improper models may have great utility. When, for example, the standardized GREs, GPAs, and selectivity indices in the previous example were weighted equally, the resulting linear composite correlated .48 with later faculty rating. Not only is the correlation of this linear composite higher than that with the clinical judgment of the admissions committee (.19), it is also higher than that obtained upon cross-validating the weights obtained from half the sample. An example of an improper model that might be of somewhat more interest—at least to the general public—was motivated by a physician who was on a panel with me concerning predictive systems. Afterward, at the bar with his wife and me, he said that my paper might be of some interest to my colleagues, but success in graduate school in psychology was not of much general interest: “Could you, for example, use one of This index was based on Cass and Birnbaum’s (1968) rating of selectivity given at the end of their book Comparative Guide to American Colleges. The verbal categories of selectivity were given numerical values according to the following rule: most selective, 6; highly selective, 5; very selective (+), 4; very selective, 3; selective, 2; not mentioned, 1. Unfortunately, only 23 of the 111 students could be used in this comparison because the rating scale the admissions committee used changed slightly from year to year.
ER59969.indb 323
3/21/08 10:51:43 AM
324 • Robyn M. Dawes
your improper linear models to predict how well my wife and I get along together?” he asked. I realized that I could—or might. At that time, the Psychology Department at the University of Oregon was engaged in sex research, most of which was behavioristically oriented. So the subjects of this research monitored when they made love, when they had fights, when they had social engagements (e.g., with in-laws), and so on. These subjects also made subjective ratings about how happy they were in their marital or coupled situation. I immediately thought of an improper linear model to predict self-ratings of marital happiness: rate of lovemaking minus rate of fighting. My colleague John Howard had collected just such data on couples when he was an undergraduate at the University of Missouri—Kansas City, where he worked with Alexander (1971). After establishing the intercouple reliability of judgments of lovemaking and fighting, Alexander had one partner from each of 42 couples monitor these events. She allowed us to analyze her data, with the following results: “In the thirty happily married couples (as reported by the monitoring partner) only two argued more often than they had intercourse. All twelve of the unhappily married couples argued more often” (Howard & Dawes, 1976, p. 478). We then replicated this finding at the University of Oregon, where 27 monitors rated happiness on a 7-point scale, from “very unhappy” to “very happy,” with a neutral midpoint. The correlation of rate of lovemaking minus rate of arguments with these ratings of marital happiness was .40 (p<.05); neither variable alone was significant. The findings were replicated in Missouri by D. D. Edwards and J. Edwards (1977) and in Texas by Thornton (1977), who found a correlation of .81 (p < .01) between the sex-argument difference and self-rating of marital happiness among 28 new couples. (The reason for this much higher correlation might be that Thornton obtained the ratings of marital happiness after, rather than before, the subjects monitored their lovemaking and fighting; in fact, one subject decided to get a divorce after realizing that she was fighting more than loving; Thornton, Note 1.) The conclusion is that if we love more than we hate, we are happy; if we hate more than we love, we are miserable. This conclusion is not very profound, psychologically or statistically. The point is that this very crude improper linear model predicts a very important variable: judgments about marital happiness. The bulk (in fact, all) of the literature since the publication of Meehl’s (1954) book supports his generalization about proper models versus intuitive clinical judgment. Sawyer (1966) reviewed a plethora of these studies, and some of these studies were quite extensive (cf. Goldberg, 1965). Some 10 years after his book was published, Meehl (1965) was able to conclude, however, that there was only a single example showing
ER59969.indb 324
3/21/08 10:51:43 AM
Appendix 1 • 325
clinical judgment to be superior, and this conclusion was immediately disputed by Goldberg (1968) on the grounds that even the one example did not show such superiority. Holt (1970) criticized details of several studies, and he even suggested that prediction as opposed to understanding may not be a very important part of clinical judgment. But a search of the literature fails to reveal any studies in which clinical judgment has been shown to be superior to statistical prediction when both are based on the same codable input variables. And though most nonpositivists would agree that understanding is not synonymous with prediction, few would agree that it doesn’t entail some ability to predict. Why? Because people—especially the experts in a field—are much better at selecting and coding information than they are at integrating it. But people are important. The statistical model may integrate the information in an optimal manner, but it is always the individual (judge, clinician, subjects) who chooses variables. Moreover, it is the human judge who knows the directional relationship between the predictor variables and the criterion of interest, or who can code the variables in such a way that they have clear directional relationships. And it is in precisely the situation where the predictor variables are good and where they have a conditionally monotone relationship with the criterion that proper linear models work well. The linear model cannot replace the expert in deciding such things as “what to look for,” but it is precisely this knowledge of what to look for in reaching the decision that is the special expertise people have. Even in as complicated a judgment as making a chess move, it is the ability to code the board in an appropriate way to “see” the proper moves that distinguishes the grand master from the expert from the novice (deGroot, 1965; Simon & Chase, 1973). It is not in the ability to integrate information that people excel (Slovic, Note 2). Again, the chess grand master considers no more moves than does the expert; he just knows which ones to look at. The distinction between knowing what to look for and the ability to integrate information is perhaps best illustrated in a study by Einhorn (1972). Relationships are conditionally monotone when variables can be scaled in such a way that higher values on each predict higher values on the criterion. This condition is the combination of two more fundamental measurement conditions: (a) independence (the relationship between each variable and the criterion is independent of the values on the remaining variables) and (b) monotonicity (the ordinal relationship is one that is monotone). (See Krantz, 1972; Krantz, Luce, Suppes, & Tversky, 1971.) The true relationships need not be linear for linear models to work; they must merely be approximated by linear models. It is not true that “in order to compute a correlation coefficient between two variables the relationship between them must be linear” (advice found in one introductory statistics text). In the first place, it is always possible to compute something.
ER59969.indb 325
3/21/08 10:51:44 AM
326 • Robyn M. Dawes
Expert doctors coded biopsies of patients with Hodgkin’s disease and then made an overall rating of the severity of the process. The overall rating did not predict survival time of the 193 patients, all of whom died. (The correlations of rating with survival time were all virtually 0, some in the wrong direction.) The variables that the doctors coded did, however, predict survival time when they were used in a multiple regression model. In summary, proper linear models work for a very simple reason. People are good at picking out the right predictor variables and at coding them in such a way that they have a conditionally monotone relationship with the criterion. People are bad at integrating information from diverse and incomparable sources. Proper linear models are good at such integration when the predictions have a conditionally monotone relationship to the criterion. Consider, for example, the problem of comparing one graduate applicant with GRE scores of 750 and an undergraduate GPA of 3.3 with another with GRE scores of 680 and an undergraduate GPA of 3.7. Most judges would agree that these indicators of aptitude and previous accomplishment should be combined in some compensatory fashion, but the question is how to compensate. Many judges attempting this feat have little knowledge of the distributional characteristics of GREs and GPAs, and most have no knowledge of studies indicating their validity as predictors of graduate success. Moreover, these numbers are inherently incomparable without such knowledge, GREs running from 500 to 800 for viable applicants, and GPAs from 3.0 to 4.0. Is it any wonder that a statistical weighting scheme does better than a human judge in these circumstances? Suppose now that it is not possible to construct a proper linear model in some situation. One reason we may not be able to do so is that our sample size is inadequate. In multiple regression, for example, b weights are notoriously unstable; the ratio of observations to predictors should be as high as 15 or 20 to 1 before b weights, which are the optimal weights, do better on cross-validation than do simple unit weights. Schmidt (1971), Goldberg (1972), and Claudy (1972) have demonstrated this need empirically through computer simulation, and Einhorn and Hogarth (1975) and Srinivisan (Note 3) have attacked the problem analytically. The general solution depends on a number of parameters such as the multiple correlation in the population and the covariance pattern between predictor variables. But the applied implication is clear. Standard regression analysis cannot be used in situations where there is not a “decent” ratio of observations to predictors.
ER59969.indb 326
3/21/08 10:51:44 AM
Appendix 1 • 327
Another situation in which proper linear models cannot be used is that in which there are no measurable criterion variables. We might, nevertheless, have some idea about what the important predictor variables would be and the direction they would bear to the criterion if we were able to measure the criterion. For example, when deciding which students to admit to graduate school, we would like to predict some future long-term variable that might be termed “professional selfactualization.” We have some idea what we mean by this concept, but no good, precise definition as yet. (Even if we had one, it would be impossible to conduct the study using records from current students, because that variable could not be assessed until at least 20 years after the students had completed their doctoral work.) We do, however, know that in all probability this criterion is positively related to intelligence, to past accomplishments, and to ability to snow one’s colleagues. In our applicant’s files, GRE scores assess the first variable; undergraduate GPA, the second; and letters of recommendation, the third. Might we not, then, wish to form some sort of linear combination of these variables in order to assess our applicants’ potentials? Given that we cannot perform a standard regression analysis, is there nothing to do other than fall back on unaided intuitive integration of these variables when we assess our applicants? One possible way of building an improper linear model is through the use of bootstrapping (Dawes & Corrigan, 1974; Goldberg, 1970). The process is to build a proper linear model of an expert’s judgments about an outcome criterion and then to use that linear model in place of the judge. That such linear models can be accurate in predicting experts’ judgments has been pointed out in the psychological literature by Hammond (1955) and Hoffman (1960). (This work was anticipated by 32 years by the late Henry Wallace, vice president under Roosevelt, in a 1923 agricultural article suggesting the use of linear models to analyze “what is on the corn judge’s mind.”) In his influential article, Hoffman termed the use of linear models a paramorphic representation of judges, by which he meant that the judges’ psychological processes did not involve computing an implicit or explicit weighted average of input variables, but that it could be simulated by such a weighting. Paramorphic representations have been extremely successful (for reviews see Dawes & Corrigan, 1974; Slovic & Lichtenstein, 1971) in contexts in which predictor variables have conditionally monotone relationships to criterion variables. The bootstrapping models make use of the weights derived from the judges; because these weights are not derived from the relationship between the predictor and criterion variables themselves, the resulting
ER59969.indb 327
3/21/08 10:51:44 AM
328 • Robyn M. Dawes
linear models are improper. Yet these paramorphic representations consistently do better than the judges from which they were derived (at least when the evaluation of goodness is in terms of the correlation between predicted and actual values). Bootstrapping has turned out to be pervasive. For example, in a study conducted by Wiggins and Kohen (1971), psychology graduate students at the University of Illinois were presented with 10 background, aptitude, and personality measures describing other (real) Illinois graduate students in psychology and were asked to predict these students’ first-year graduate GPAs. Linear models of every one of the University of Illinois judges did a better job than did the judges themselves in predicting actual grade point averages. This result was replicated in a study conducted in conjunction with Wiggins, Gregory, and Diller (cited in Dawes & Corrigan, 1974). Goldberg (1970) demonstrated it for 26 of 29 clinical psychology judges predicting psychiatric diagnosis of neurosis or psychosis from Minnesota Multiphasic Personality Inventory (MMPI) profiles, and Dawes (1971) found it in the evaluation of graduate applicants at the University of Oregon. The one published exception to the success of bootstrapping of which I am aware was a study conducted by Libby (1976). He asked 16 loan officers from relatively small banks (located in Champaign-Urbana, Illinois, with assets between $3 million and $56 million) and 27 loan officers from large banks (located in Philadelphia, with assets between $.6 billion and $4.4 billion) to judge which 30 of 60 firms would go bankrupt within three years after their financial statements. The loan officers requested five financial ratios on which to base their judgments (e.g., the ratio of present assets to total assets). On the average, the loan officers correctly categorized 44.4 businesses (74%) as either solvent or future bankruptcies, but on the average, the paramorphic representations of the loan officers could correctly classify only 43.3 (72%). This difference turned out to be statistically significant, and Libby concluded that he had an example of a situation where bootstrapping did not work—perhaps because his judges were highly skilled experts attempting to predict a highly reliable criterion. Goldberg (1976), however, noted that many of the ratios had highly skewed distributions, and he reanalyzed Libby’s data, normalizing the ratios before building models of the loan officers. Libby found 77% of his officers to be superior to their paramorphic representations, but Goldberg, using his rescaled predictor variables, found the
ER59969.indb 328
3/21/08 10:51:44 AM
Appendix 1 • 329
opposite; 72% of the models were superior to the judges from whom they were derived. Why does bootstrapping work? Bowman (1963), Goldberg (1970), and Dawes (1971) all maintained that its success arises from the fact that a linear model distills underlying policy (in the implicit weights) from otherwise variable behavior (e.g., judgments affected by context effects or extraneous variables). Belief in the efficacy of bootstrapping was based on the comparison of the validity of the linear model of the judge with the validity of his or her judgments themselves. This is only one of two logically possible comparisons. The other is the validity of the linear model of the judge versus the validity of linear models in general; that is, to demonstrate that bootstrapping works because the linear model catches the essence of the judge’s valid expertise while eliminating unreliability, it is necessary to demonstrate that the weights obtained from an analysis of the judge’s behavior are superior to those that might be obtained in other ways, for example, randomly. Because both the model of the judge and the model obtained randomly are perfectly reliable, a comparison of the random model with the judge’s model permits an evaluation of the judge’s underlying linear representation, or policy. If the random model does equally well, the judge would not be “following valid principles but following them poorly” (Dawes, 1971, p. 182), at least not principles any more valid than any others that weight variables in the appropriate direction. Table 1 presents five studies summarized by Dawes and Corrigan (1974) in which validities (i.e., correlations) obtained by various methods were compared. In the first study, a pool of 861 psychiatric patients took the MMPI in various hospitals; they were later categorized as neurotic or psychotic on the basis of more extensive information. The MMPI profiles consist of 11 scores, each of which represents the degree to which the respondent answers questions in a manner similar to patients suffering from a well-defined form of psychopathology. A set of 11 scores is thus associated with each patient, and the problem is to predict whether a later diagnosis will be psychosis (coded 1) or neurosis (coded 0). Twenty-nine clinical psychologists “of varying experience and training” (Goldberg, 1970, p. 425) were asked to make this prediction on an 11-step forced-normal distribution. The second It should be pointed out that a proper linear model does better than either loan officers or their paramorphic representations. Using the same task, Beaver (1966) and Deacon (1972) found that linear models predicted with about 78% accuracy on cross-validation. But I can’t resist pointing out that the simplest possible improper model of them all does best. The ratio of assets to liabilities (!) correctly categorizes 48 (80%) of the cases studied by Libby.
ER59969.indb 329
3/21/08 10:51:44 AM
330 • Robyn M. Dawes TABLE 1 Correlations Between Predictions and Criterion Values Average Validity Average validity Validity of Crossof validity of optimal of equal Average validity validity of judge random weighting regression linear Example of judge model model model analysis model Prediction of .28 .31 .30 .34 .46 .46 neurosis vs. psychosis Illinois .33 .50 .51 .60 .57 .69 students’ predictions of GPA Oregon .37 .43 .51 .60 .57 .69 students’ predictions of GPA Prediction of .19 .25 .39 .48 .38 .54 later faculty ratings at Oregon Yntema & .84 .89 .84 .97 — .97 Torgerson’s (1961) experiment Note. GPA = grade point average.
two studies concerned 90 first-year graduate students in the Psychology Department of the University of Illinois who were evaluated on 10 variables that are predictive of academic success. These variables included aptitude test scores, college GPA, various peer ratings (e.g., extraversion), and various self-ratings (e.g., conscientiousness). A firstyear GPA was computed for all these students. The problem was to predict the GPA from the 10 variables. In this second study this prediction was made by 80 (other) graduate students at the University of Illinois (Wiggins & Kohen, 1971); in a third study this prediction was made by 41 graduate students at the University of Oregon. The details of the fourth study have already been covered; it is the one concerned with the prediction of later faculty ratings at Oregon. The final study (Yntema & Torgerson, 1961) was one in which experimenters assigned values to ellipses presented to the subjects, on the basis of figures’ size, eccentricity, and grayness. The formula used was ij + kj + ik, where i, j, and k refer to values on the three dimensions just mentioned. Subjects in this experiment were asked to estimate the value of each ellipse and were presented with outcome feedback at the end of each trial. The
ER59969.indb 330
3/21/08 10:51:45 AM
Appendix 1 • 331
problem was to predict the true (i.e., experimenter-assigned) value of each ellipse on the basis of its size, eccentricity, and grayness. The first column of Table 1 presents the average validity of the judges in these studies, and the second presents the average validity of the paramorphic model of these judges. In all cases, bootstrapping worked. But then what Corrigan and I constructed were random linear models, that is, models in which weights were randomly chosen except for sign and were then applied to standardized variables. “The sign of each variable was determined on an a priori basis so that it would have a positive relationship to the criterion. Then a normal deviate was selected at random from a normal distribution with unit variance, and the absolute value of this deviate was used as a weight for the variable. Ten thousand such models were constructed for each example.” (Dawes & Corrigan, 1974, p. 102) On the average, these random linear models perform about as well as the paramorphic models of the judges; these averages are presented in the third column of the table. Equal-weighting models, presented in the fourth column, do even better. (There is a mathematical reason why equal-weighting models must outperform the average random model.) Unfortunately, Dawes and Corrigan did not spell out in detail that these variables must first be standardized and that the result is a standardized dependent variable. Equal or random weighting of incomparable variables—for example, GRE score and GPA—without prior standardization would be nonsensical. Consider a set of standardized variabvles S1, X2 , Xm, each of which is positively correlated with a standardized variable Y. The correlation of the average of the Xs with the Y is equal to the correlation of the sum of the Xs with Y. The covariance of this sum with Y is equal to
1 n
∑ y (x i
1 n
=
i1
+ x i 2 + x im )
i
∑y x i
i1
1 n
+
1
∑ y x + n ∑ y x i
i2
i
im
i
=r1+ r2…+ rm (the sum of the correlation). The variance of y is 1, and the variance of the sum of the Xs is M + M(M−l) r , where r is the average intercorrelation between the Xs. Hence, the correlation of the average of the X with Y is
(∑ r ) / ( M + M ( M − 1) r ) i
this is greater than
(∑ r ) / ( M + M 1
2
−M
)
1
2
;
1 2
= average r1. Because each of the random models is positively correlated with the criterion, the correlation of their average, which is the unit-weighted model, is higher than the average of the correlations.
ER59969.indb 331
3/21/08 10:51:47 AM
332 • Robyn M. Dawes
Finally, the last two columns present the cross-validated validity of the standard regression model and the validity of the optimal linear model. Essentially the same results were obtained when the weights were selected from a rectangular distribution. Why? Because linear models are robust over deviations from optimal weighting. In other words, the bootstrapping finding, at least in these studies, has simply been a reaffirmation of the earlier finding that proper linear models are superior to human judgments—the weights derived from the judges’ behavior being sufficiently close to the optimal weights that the outputs of the models are highly similar. The solution to the problem of obtaining optimal weights is one that—in terms of von Winterfeldt and Edwards (Note 4)—has a “flat maximum.” Weights that are near to optimal level produce almost the same output as do optimal beta weights. Because the expert judge knows at least something about the direction of the variables, his or her judgments yield weights that are nearly optimal (but note that in all cases equal weighting is superior to models based on judges’ behavior). The fact that different linear composites correlate highly with each other was first pointed out 40 years ago by Wilks (1938). He considered only situations in which there was positive correlation between predictors. This result seems to hold generally as long as these intercorrelations are not negative; for example, the correlation between X + 2Y and 2X + Y is .80 when X and Y are uncorrelated. The ways in which outputs are relatively insensitive to changes in coefficients (provided changes in sign are not involved) have been investigated most recently by Green (1977), Wainer (1976), Wainer and Thissen (1976), W. M. Edwards (1978), and Gardiner and Edwards (1975). Dawes and Corrigan (1974, p. 105) concluded that “the whole trick is to know what variables to look at and then know how to add.” That principle is well illustrated in the following study, conducted since the Dawes and Corrigan article was published. In it, Hammond and Adelman (1976) both investigated and influenced the decision about what type of bullet should be used by the Denver City Police, a decision having much more obvious social impact than most of those discussed above. To quote Hammond and Adelman (1976): “In 1974, the Denver Police Department (DPD), as well as other police departments throughout the country, decided to change its handgun ammunition. The principal reason offered by the police was that the conventional round-nosed bullet provided insufficient “stopping effectiveness” (that is, the ability to incapacitate
ER59969.indb 332
3/21/08 10:51:47 AM
Appendix 1 • 333
and thus to prevent the person shot from firing back at a police officer or others). The DPD chief recommended (as did other police chiefs) the conventional bullet be replaced by a hollowpoint bullet. Such bullets, it was contended, flattened on impact, thus decreasing penetration, increasing stopping effectiveness, and decreasing ricochet potential. The suggested change was challenged by the American Civil Liberties Union, minority groups, and others. Opponents of the change claimed that the new bullets were nothing more than outlawed “dum-dum” bullets, that they created far more injury than the round-nosed bullet, and should, therefore, be barred from use. As is customary, judgments on this matter were formed privately and then defended publicly with enthusiasm and tenacity, and the usual public hearings were held. Both sides turned to ballistics experts for scientific information and support” (p. 392). The disputants focused on evaluating the merits of specific bullets— confounding the physical effect of the bullets with the implications for social policy; that is, rather than separating questions of what it is the bullet should accomplish (the social policy question) from questions concerning ballistic characteristics of specific bullets, advocates merely argued for one bullet or another. Thus, as Hammond and Adelman pointed out, social policymakers inadvertently adopted the role of (poor) ballistics experts, and vice versa. What Hammond and Adelman did was to discover the important policy dimensions from the policymakers, and then they had the ballistics experts rate the bullets with respect to these dimensions. These dimensions turned out to be stopping effectiveness (the probability that someone hit in the torso could not return fire), probability of serious injury, and probability of harm to bystanders. When the ballistics experts rated the bullets with respect to these dimensions, it turned out that the last two were almost perfectly confounded, but they were not perfectly confounded with the first. Bullets do not vary along a single dimension that confounds effectiveness with lethalness. The probability of serious injury or harm to bystanders is highly related to the penetration of the bullet, whereas the probability of the bullet’s effectively stopping someone from returning fire is highly related to the width of the entry wound. Since policymakers could not agree about the weights given to the three dimensions, Hammond and Adelman suggested that they be weighted equally. Combining the equal weights with the (independent) judgments of the ballistics experts, Hammond and Adelman discovered a bullet that “has greater stopping effectiveness and is less apt to cause injury (and is less apt to
ER59969.indb 333
3/21/08 10:51:47 AM
334 • Robyn M. Dawes
threaten bystanders) than the standard bullet then in use by the DPD” (Hammond & Adelman, 1976, p. 395). The bullet was also less apt to cause injury than was the bullet previously recommended by the DPD. That bullet was “accepted by the City Council and all other parties concerned, and is now being used by the DPD” (Hammond & Adelman, 1976, p. 395). Once again, “the whole trick is to decide what variables to look at and then know how to add” (Dawes & Corrigan, 1974, p. 105). So why don’t people do it more often? I know of four universities (University of Illinois; New York University; University of Oregon; University of California, Santa Barbara—there may be more) that use a linear model for applicant selection, but even these use it as an initial screening device and substitute clinical judgment for the final selection of those above a cut score. Goldberg’s (1965) actuarial formula for diagnosing neurosis or psychosis from MMPI profiles has proven superior to clinical judges attempting the same task (no one to my or Goldberg’s knowledge has ever produced a judge who does better), yet my one experience with its use (at the Ann Arbor Veterans Administration Hospital) was that it was discontinued on the grounds that it made obvious errors (an interesting reason, discussed at length below). In 1970, I suggested that our fellowship committee at the University of Oregon apportion cutbacks of National Science Foundation and National Defense Education Act fellowships to departments on the basis of a quasi-linear point system based on explicitly defined indices, departmental merit, and need; I was told “you can’t systemize human judgment.” It was only six months later, after our committee realized the political and ethical impossibility of cutting back fellowships on the basis of intuitive judgment, that such a system was adopted. And so on. In the past three years, I have written and talked about the utility (and in my view, ethical superiority) of using linear models in socially important decisions. Many of the same objections have been raised repeatedly by different readers and audiences. I would like to conclude this article by cataloging these objections and answering them.
Objections to Using Linear Models These objections may be placed in three broad categories: technical, psychological, and ethical. Each category is discussed in turn. It should be pointed out that there were only eight bullets on the Pareto frontier; that is, there were only eight that were not inferior to some particular other bullet in both stopping effectiveness and probability of harm (or inferior on one of the variables and equal on the other). Consequently, any weighting rule whatsoever would have chosen one of these eight.
ER59969.indb 334
3/21/08 10:51:47 AM
Appendix 1 • 335
Technical The most common technical objection is to the use of the correlation coefficient; for example, Remus and Jenicke (1978) wrote: It is clear that Dawes and Corrigan’s choice of the correlation coefficient to establish the utility of random and unit rules is inappropriate [sic, inappropriate for what?]. A criterion function is also needed in the experiments cited by Dawes and Corrigan. Surely there is a cost function for misclassifying neurotics and psychotics or refusing qualified students admissions to graduate school while admitting marginal students. (p. 221) Consider the graduate admission problem first. Most schools have k slots and N applicants. The problem is to get the best k (who are in turn willing to accept the school) out of N. What better way is there than to have an appropriate rank? None. Remus and Jenicke write as if the problem were not one of comparative choice but of absolute choice. Most social choices, however, involve selecting the better or best from a set of alternatives: the students that will be better, the bullet that will be best, a possible airport site that will be superior, and so on. The correlation coefficient, because it reflects ranks so well, is clearly appropriate for evaluating such choices. The neurosis–psychosis problem is more subtle and even less supportive of their argument. “Surely,” they state, “there is a cost function,” but they don’t specify any candidates. The implication is clear: If they could find it, clinical judgment would be found to be superior to linear models. Why? In the absence of such a discovery on their part, the argument amounts to nothing at all. But this argument from a vacuum can be very compelling to people (for example, to losing generals and losing football coaches, who know that “surely” their plans would work “if”—when the plans are in fact doomed to failure no matter what). A second related technical objection is to the comparison of average correlation coefficients of judges with those of linear models. Perhaps by averaging, the performance of some really outstanding judges is obscured. The data indicate otherwise. In the Goldberg (1970) study, for example, only 5 of 29 trained clinicians were better than the unitweighted model, and none did better than the proper one. In the Wiggins and Kohen (1971) study, no judges were better than the unitweighted model, and we replicated that effect at Oregon. In the Libby (1976) study, only 9 of 43 judges did better than the ratio of assets to liabilities at predicting bankruptcies (3 did equally well). While it is then conceded that clinicians should be able to predict diagnosis of neurosis
ER59969.indb 335
3/21/08 10:51:48 AM
336 • Robyn M. Dawes
or psychosis, that graduate students should be able to predict graduate success, and that bank loan officers should be able to predict bankruptcies, the possibility is raised that perhaps the experts used in the studies weren’t the right ones. This again is arguing from a vacuum: If other experts were used, then the results would be different. And once again no such experts are produced, and once again the appropriate response is to ask for a reason why these hypothetical other people should be any different. (As one university vice president told me, “Your research only proves that you used poor judges; we could surely do better by getting better judges”—apparently not from the psychology department.) A final technical objection concerns the nature of the criterion variables. They are admittedly short-term and unprofound (e.g., GPAs, diagnoses); otherwise, most studies would be infeasible. The question is then raised of whether the findings would be different if a truly longrange important criterion were to be predicted. The answer is that of course the findings could be different, but we have no reason to suppose that they would be different. First, the distant future is in general less predictable than the immediate future, for the simple reason that more unforeseen, extraneous, or self-augmenting factors influence individual outcomes. (Note that we are not discussing aggregate outcomes, such as an unusually cold winter in the Midwest in general spread out over three months.) Since, then, clinical prediction is poorer than linear to begin with, the hypothesis would hold only if linear prediction got much worse over time than did clinical prediction. There is no a priori reason to believe that this differential deterioration in prediction would occur, and none has ever been suggested to me. There is certainly no evidence. Once again, the objection consists of an argument from a vacuum. Particularly compelling is the fact that people who argue that different criteria or judges or variables or time frames would produce different results have had 25 years in which to produce examples, and they have failed to do so. Psychological One psychological resistance to using linear models lies in our selective memory about clinical prediction. Our belief in such prediction is reinforced by the availability (Tversky & Kahneman, 1974) of instances of successful clinical prediction—especially those that are exceptions to some formula: “I knew someone once with . . . who . . . ” (e.g., “I knew of someone with a tested IQ of only 130 who got an advanced degree in psychology”). As Nisbett, Borgida, Crandall, and Reed (1976) showed, such single instances often have greater impact on judgment than do much more valid statistical compilations based on many instances. (A
ER59969.indb 336
3/21/08 10:51:48 AM
Appendix 1 • 337
good prophylactic for clinical psychologists basing resistance to actuarial prediction on such instances would be to keep careful records of their own predictions about their own patients—prospective records not subject to hindsight. Such records could make all instances of successful and unsuccessful prediction equally available for impact; in addition, they could serve for another clinical versus statistical study using the best possible judge—the clinician himself or herself.) Moreover, an illusion of good judgment may be reinforced due to selection (Einhorn & Hogarth, 1978) in those situations in which the prediction of a positive or negative outcome has a self-fulfilling effect. For example, admissions officers who judge that a candidate is particularly qualified for a graduate program may feel that their judgment is exonerated when that candidate does well, even though the candidate’s success is in large part due to the positive effects of the program. (In contrast, a linear model of selection is evaluated by seeing how well it predicts performance within the set of applicants selected.) Or a waiter who believes that particular people at the table are poor tippers may be less attentive than usual and receive a smaller tip, thereby having his clinical judgment exonerated. A second psychological resistance to the use of linear models stems from their “proven” low validity. Here, there is an implicit (as opposed to explicit) argument from a vacuum because neither changes in evaluation procedures, nor in judges, nor in criteria, are proposed. Rather, the unstated assumption is that these criteria of psychological interest are in fact highly predictable, so it follows that if one method of prediction (a linear model) doesn’t work too well, another might do better (reasonable), which is then translated into the belief that another will do better (which is not a reasonable inference)—once it is found. This resistance is best expressed by a dean considering the graduate admissions who wrote, “The correlation of the linear composite with future faculty ratings is only .4, whereas that of the admissions committee’s judgment correlates .2. Twice nothing is nothing.” In 1976, I answered as follows (Dawes, 1976, pp. 6–7): In response, I can only point out that 16% of the variance is better than 4% of the variance. To me, however, the fascinating part of this argument is the implicit assumption that that other 84% of the variance is predictable and that we can somehow predict it. Now what are we dealing with? We are dealing with personality and intellectual characteristics of [uniformly bright] people This example was provided by Einhorn (Note 5).
ER59969.indb 337
3/21/08 10:51:48 AM
338 • Robyn M. Dawes
who are about 20 years old. . . . Why are we so convinced that this prediction can be made at all? Surely, it is not necessary to read Ecclesiastes every night to understand the role of chance. . . . Moreover, there are clearly positive feedback effects in professional development that exaggerate threshold phenomena. For example, once people are considered sufficiently “outstanding” that they are invited to outstanding institutions, they have outstanding colleagues with whom to interact—and excellence is exacerbated. This same problem occurs for those who do not quite reach such a threshold level. Not only do all these factors mitigate against successful long-range prediction, but studies of the success of such prediction are necessarily limited to those accepted, with the incumbent problems of restriction of range and a negative covariance structure between predictors (Dawes, 1975). Finally, there are all sorts of nonintellectual factors in professional success that could not possibly be evaluated before admission to graduate school, for example, success at forming a satisfying or inspiring libidinal relationship, not yet evident genetic tendencies to drug or alcohol addiction, the misfortune to join a research group that “blows up,” and so on, and so forth. Intellectually, I find it somewhat remarkable that we are able to predict even 16% of the variance. But I believe that my own emotional response is indicative of those of my colleagues who simply assume that the future is more predictable. I want it to be predictable, especially when the aspect of it that I want to predict is important to me. This desire, I suggest, translates itself into an implicit assumption that the future is in fact highly predictable, and it would then logically follow that if something is not a very good predictor, something else might do better (although it is never correct to argue that it necessarily will). Statistical prediction, because it includes the specification (usually a low correlation coefficient) of exactly how poorly we can predict, bluntly strikes us with the fact that life is not all that predictable. Unsystematic clinical prediction (or “postdiction”), in contrast, allows us the comforting illusion that life is in fact predictable and that we can predict it. Ethical When I was at the Los Angeles Renaissance Fair last summer, I overheard a young woman complain that it was “horribly unfair” that she had been rejected by the Psychology Department at the University of California, Santa Barbara, on the basis of mere numbers, without even an interview. “How can they possibly tell what I’m like?” The answer is
ER59969.indb 338
3/21/08 10:51:48 AM
Appendix 1 • 339
that they can’t. Nor could they with an interview (Kelly, 1954). Nevertheless, many people maintain that making a crucial social choice without an interview is dehumanizing. I think that the question of whether people are treated in a fair manner has more to do with the question of whether or not they have been dehumanized than does the question of whether the treatment is face to face. (Some of the worst doctors spend a great deal of time conversing with their patients, read no medical journals, order few or no tests, and grieve at the funerals.) A GPA represents 3½ years of behavior on the part of the applicant. (Surely, not all the professors are biased against his or her particular form of creativity.) The GRE is a more carefully devised test. Do we really believe that we can do a better or a fairer job by a 10-minute folder evaluation or a half-hour interview than is done by these two mere numbers? Such cognitive conceit (Dawes, 1976, p. 7) is unethical, especially given the fact of no evidence whatsoever indicating that we do a better job than does the linear equation. (And even making exceptions must be done with extreme care if it is to be ethical, for if we admit someone with a low linear score on the basis that he or she has some special talent, we are automatically rejecting someone with a higher score, who might well have had an equally impressive talent had we taken the trouble to evaluate it.) No matter how much we would like to see this or that aspect of one or another of the studies reviewed in this article changed, no matter how psychologically uncompelling or distasteful we may find their results to be, no matter how ethically uncomfortable we may feel at “reducing people to mere numbers,” the fact remains that our clients are people who deserve to be treated in the best manner possible. If that means—as it appears at present—that selection, diagnosis, and prognosis should be based on nothing more than the addition of a few numbers representing values on important attributes, so be it. To do otherwise is cheating the people we serve.
Reference Notes 1. Thornton, B. Personal communication, 1977. 2. Slovic, P. Limitations of the mind of man: Implications for decision making in the nuclear age. In H. J. Otway (Ed.), Risk vs. benefit: Solution or dream? (Report LA 4860-MS). Los Alamos, N.M.: Los Alamos Scientific Laboratory, 1972. [Also available as Oregon Research Institute Bulletin, 1971, 11(17).]
ER59969.indb 339
3/21/08 10:51:49 AM
340 • Robyn M. Dawes 3. Srinivisan, V. A theoretical comparison of the predictive power of the multiple regression and equal weighting procedures (Research Paper No. 347). Stanford, Calif.: Stanford University, Graduate School of Business, February 1977. 4. von Winterfeldt, D., & Edwards, W. Costs and payoffs in perceptual research. Unpublished manuscript, University of Michigan, Engineering Psychology Laboratory, 1973. 5. Einhorn, H. J. Personal communication, January 1979.
REFERENCE Alexander, S. A. H. Sex, arguments, and social engagements in marital and premarital relations. Unpublished master’s thesis, University of Missouri—Kansas City, 1971. Beaver, W. H. Financial ratios as predictors of failure. In Empirical research in accounting: Selected studies. Chicago: University of Chicago, Graduate School of Business, Institute of Professional Accounting, 1966. Bowman, E. H. Consistency and optimality in managerial decision making. Management Science, 1963, 9, 310–321. Cass, J., & Birnbaum, M. Comparative guide to American colleges. New York: Harper & Row, 1968. Claudy, J. G. A comparison of five variable weighting procedures. Educational and Psychological Measurement, 1972, 32, 311–322. Darlington, R. B. Reduced-variance regression. Psychological Bulletin, 1978, 85, 1238–1255. Dawes, R. M. A case study of graduate admissions: Application of three principles of human decision making. American Psychologist, 1971, 26, 180–188. Dawes, R. M. Graduate admissions criteria and future success. Science, 1975, 187, 721–723. Dawes, R. M. Shallow psychology. In J. Carroll & J. Payne (Eds.), Cognition and social behavior. Hillsdale, N.J.: Erlbaum, 1976. Dawes, R. M., & Corrigan, B. Linear models in decision making. Psychological Bulletin, 1974, 81, 95–106. Deacon, E. B. A discriminant analysis of predictors of business failure. Journal of Accounting Research, 1972, 10, 167–179. deGroot, A. D. Het denken van den schaker [Thought and choice in chess]. The Hague, The Netherlands: Mouton, 1965. Edwards, D. D., & Edwards, J. S. Marriage: Direct and continuous measurement. Bulletin of the Psychonomic Society, 1977, 10, 187–188. Edwards, W. M. Technology for director dubious: Evaluation and decision in public contexts. In K. R. Hammond (Ed.), Judgement and decision in public policy formation. Boulder, Colo.: Westview Press, 1978.
ER59969.indb 340
3/21/08 10:51:49 AM
Appendix 1 • 341 Einhorn, H. J. Expert measurement and mechanical combination. Organizational Behavior and Human Performance, 1972, 7, 86–106. Einhorn, H. J., & Hogarth, R. M. Unit weighting schemas for decision making. Organizational Behavior and Human Performance, 1975, 13, 17–192. Einhorn, H. J., & Hogarth, R. M. Confidence in judgment: Persistence of the illusion of validity. Psychological Review, 1978, 85, 395–416. Gardiner, P. C., & Edwards, W. Public values: Multiattribute-utility measurement for social decision making. In M. F. Kaplan & S. Schwartz (Eds.), Human judgment and decision processes. New York: Academic Press, 1975. Goldberg, L. R. Diagnosticians vs. diagnostic signs: The diagnosis of psychosis vs. neurosis from the MMPI. Psychological Monographs, 1965, 79(9, Whole No. 602). Goldberg, L. R. Seer over sign: The first “good” example? Journal of Experimental Research in Personality, 1968, 3, 168–171. Goldberg, L. R. Man versus model of man: A rationale, plus some evidence for a method of improving on clinical inferences. Psychological Bulletin, 1970, 73, 422–432. Goldberg, L. R. Parameters of personality inventory construction and utilization: A comparison of prediction strategies and tactics. Multivariate Behavioral Research Monographs, 1972, No. 72–2. Goldberg, L. R. Man versus model of man: Just how conflicting is that evidence? Organizational Behavior and Human Performance, 1976, 16, 13–22. Green, B. F., Jr. Parameter sensitivity in multivariate methods. Multivariate Behavioral Research, 1977, 3, 263. Hammond, K. R. Probabilistic functioning and the clinical method. Psychological Review, 1955, 62, 255–262. Hammond, K. R., & Adelman, L. Science, values, and human judgment. Science, 1976, 194, 389–396. Hoffman, P. J. The paramorphic representation of clinical judgment. Psychological Bulletin, 1960, 57, 116–131. Holt, R. R. Yet another look at clinical and statistical prediction. American Psychologist, 1970, 25, 337–339. Howard, J. W., & Dawes, R. M. Linear prediction of marital happiness. Personality and Social Psychology Bulletin, 1976, 2, 478–480. Kelly, L. Evaluation of the interview as a selection technique. In Proceedings of the 1953 Invitational Conference on Testing Problems. Princeton, N.J.: Educational Testing Service, 1954. Krantz, D. H. Measurement structures and psychological laws. Science, 1972, 175, 1427–1435. Krantz, D. H., Luce, R. D., Suppes, P., & Tversky, A. Foundations of measurement (Vol. 1). New York: Academic Press, 1971. Libby, R. Man versus model of man: Some conflicting evidence. Organizational Behavior and Human Performance, 1976, 16, 1–12.
ER59969.indb 341
3/21/08 10:51:49 AM
342 • Robyn M. Dawes Marquardt, D. W., & Snee, R. D. Ridge regression in practice. American Statistician, 1975, 29, 3–19. Meehl, P. E. Clinical versus statistical prediction: A theoretical analysis and a review of the evidence. Minneapolis: University of Minnesota Press, 1954. Meehl, P. E. Seer over sign: The first good example. Journal of Experimental Research in Personality, 1965, 1, 27–32. Nisbett, R. E., Borgida, E., Crandall, R., & Reed, H. Popular induction: Information is not necessarily normative. In J. Carrol & J. Payne (Eds.), Cognition and social behavior. Hillsdale, N.J.: Erlbaum, 1976. Remus, W. E., & Jenicke, L. O. Unit and random linear models in decision making. Multivariate Behavioral Research, 1978, 13, 215–221. Sawyer, J. Measurement and prediction, clinical and statistical. Psychological Bulletin, 1966, 66, 178–200. Schmidt, F. L. The relative efficiency of regression and simple unit predictor weights in applied differential psychology. Educational and Psychological Measurement, 1971, 31, 699–714. Simon, H. A., & Chase, W. G. Skill in chess. American Scientist, 1973, 61, 394–403. Slovic, P., & Lichtenstein, S. Comparison of Bayesian and regression approaches to the study of information processing in judgment. Organizational Behavior and Human Performance, 1971, 6, 649–744. Thornton, B. Linear prediction of marital happiness: A replication. Personality and Social Psychology Bulletin, 1977, 3, 674–676. Tversky, A., & Kahneman, D. Judgment under uncertainty: Heuristics and biases. Science, 1974, 184, 1124–1131. Wainer, H. Estimating coefficients in linear models: It don’t make no nevermind. Psychological Bulletin, 1976, 83, 312–317. Wainer, H., & Thissen, D. Three steps toward robust regression. Psychometrika, 1976, 41, 9–34. Wallace, H. A. What is in the corn judge’s mind? Journal of the American Society of Agronomy, 1923, 15, 300–304. Wiggins, N., & Kohen, E. S. Man vs. model of man revisited: The forecasting of graduate school success. Journal of Personality and Social Psychology, 1971, 19, 100–106. Wilks, S. S. Weighting systems for linear functions of correlated variables when there is no dependent variable. Psychometrika, 1938, 8, 23–40. Yntema, D. B., & Torgerson, W. S. Man–computer co-operation in decisions requiring common sense. IRE Transactions of the Professional Group on Human Factors in Electronics, 1961, 2(1), 20–26.
note Work on this article was started at the University of Oregon and Decision Research, Inc., Eugene, Oregon; it was completed while I was a
ER59969.indb 342
3/21/08 10:51:49 AM
Appendix 1 • 343
James McKeen Cattell Sabbatical Fellow at the Psychology Department at the University of Michigan and at the Research Center for Group Dynamics at the Institute for Social Research there, I thank all these institutions for their assistance, and I especially thank my friends at them who helped. This article is based in part on invited talks given at the American Psychological Association (August 1977), the University of Washington (February 1978), the Aachen Technological Institute (June 1978), the University of Groeningen (June 1978), the University of Amsterdam (June 1978), the Institute for Social Research at the University of Michigan (September 1978), Miami University, Oxford, Ohio (November 1978), and the University of Chicago School of Business (January 1979). I received valuable feedback from most of the audiences.
ER59969.indb 343
3/21/08 10:51:49 AM
ER59969.indb 344
3/21/08 10:51:49 AM
Appendix 2 Behavior, Communication, and Assumptions about Other People’s Behavior in A Commons Dilemma Situation Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee University of Oregon and Oregon Research Institute
Two experiments investigated effects of communication on behavior in an eight-person commons dilemma of group versus individual gain. Subjects made a single choice involving a substantial amount of money (possible outcomes ranging from nothing to $10. 50). In Experiment 1, four communication conditions (no communication, irrelevant communication, relevant communication, and relevant communication plus roll call) were crossed with the possibility of losing money (loss, no loss). Subjects chose self-serving (defecting) or cooperating responses and predicted responses of other group members. Results showed defection significantly higher in the no-communication and irrelevant-communication conditions than in relevant-communication and relevant-communication plus roll call conditions. Loss had no effect on decisions. Defectors expected much more defection than did cooperators. Experiment 2 replicated irrelevant communication and cooperation effects and compared predictions of participants with those of observers. Variance of participants’ predictions was significantly greater than that of observers, indicating that participants’ decisions were affecting their expectations about others’ behavior. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication, and assumptions about other people’s behavior in a commons dilemma situation. Journal of Personality and Social Psychology, 35, 1-11.
345
ER59969.indb 345
3/21/08 10:51:50 AM
346 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
When first discussing the prisoner’s dilemma, Luce and Raiffa wrote (1957, p. 97) that there ought to be a law against it. There are laws against some prisoner’s dilemmas, but modern societies seem to be inventing new ones at an alarming rate. It is, for example, to each individual’s rational self-interest to exploit the environment, pollute, and (in some countries) overpopulate, while the collective effect is worse for everyone than if each individual exercised restraint. Thus, individual behavior when faced with such dilemmas is a matter of increasing interest to both social scientists and people in general. Social dilemmas—such as that in the Prisoner’s Dilemma Game— are characterized by two conditions (Dawes, 1975): (a) the antisocial, or defecting, response is dominating for each individual; that is, no matter what other people do, each individual is best off behaving in an antisocial manner; (b) the result is an equilibrium that is deficient. It is an equilibrium because no player is motivated to change, and it is deficient because all individuals would prefer an outcome in which all cooperated to one in which all chose their dominant strategy of defection. Within the context of these conditions, it is possible to define a number of different social dilemmas. It turns out, however, that the most common experimental games that have been devised for studying behavior in a dilemma situation are all structurally identical, although often stated differently by different authors. These games are (a) N-person separable prisoner’s dilemma (e.g., Hamburger, 1973), (b) games in which payoffs for cooperation and defection are linear functions with equal slopes of the number of cooperators (e.g., Kelley & Grzelak, 1972), and (c) games devised according to the principle that profit for defection accrues directly to the defector while loss (which is greater than gain) is spread out equally among all the players1 (Dawes, 1975). The purpose of the present research is to examine behavior in such an experimental game—specifically, to look at the roles of communication about the dilemma and at the relationship between one’s own behavior and one’s expectations about how other people will behave. Before proceeding to the empirical study, let us first explain the game as characterized by the gain-to-self-loss-spread-out principle. The game is constructed on the basis of Hardin’s (1968) analysis of the “tragedy of the commons.” Each player can receive a certain amount c for cooperating, but each player can in addition receive an amount d if he or she chooses to defect. The group as a whole is fined d + λ, with λ > 0, for each defecting choice, each player's share of the fine being (d + λ)/N where N is the number of players. Thus, each player’s motive for defection is d – (d + λ) N which will be greater than 0 provided that d > λ/ (N – 1 ), a side condition. Note that if there are m cooperators, then the payoff for
ER59969.indb 346
3/21/08 10:51:50 AM
Appendix 2 • 347
cooperation is c – (N – m) (d + λ)/N = [(d +λ) /N ] m+ c – d – λ; the payoff for defection is the same amount incremented by d. Hence, the payoffs for cooperation and defection are linear functions with equal slopes of the number of cooperators, and as Hamburger (1973) has shown, such payoff functions define games that are equivalent to N-person separable prisoner’s dilemmas. Subjects in the experiments described in this paper met in eight-person groups and were faced with the following choices. A cooperative response earned $2.50 with no fine to anyone. A defecting response earned $12.00 with a fine of $1.50 to each group member including the defector. Thus, each player had an $8.00 motive to defect ($12.00 − $1.50] − $2.50). Of course, if all defect, no one receives anything. If someone cooperates and two or more other group members defect, the cooperator has a negative payoff. This investigation is the first of a series of experiments to identify situational and personal variables important to group or self-interested decisions in this dilemma. Two major variables were of interest here. Opportunity for communication was the first major variable and was manipulated with the expectation that people faced with this dilemma would have a better chance of resolving it if they could communicate with each other. Communication commonly results in increased cooperation in two-person prisoner’s dilemmas (e.g., Deutsch, 1960; Loomis, 1959; Radlow & Weidner, 1966; Swensson, 1967; Wichman, 1972) and had a similar effect in a five-person social dilemma involving hypothetical business decisions (Jerdee & Rosen, 1974). Caldwell (1976) found that communication alone did not seem to be sufficient to affect subjects’ decisions; nevertheless, his findings were in the right direction (although not significant) and, as he notes (p. 279), "Perhaps with real money subjects would be less inclined to treat the experiment as a competitive game.” (The same possibility applies to most of the other research as well.) Communication effects could have at least three sources. First, the opportunity to communicate allows group members to get acquainted, which could raise their concern for each other’s welfare. Second, the relevant information raised through the discussion and appeals for mutual cooperation could persuade group members to cooperate. Third, group members’ statements of their own intended decisions could assure other members of their good intentions, leading to higher rates of cooperation. To distinguish between these possibilities, four communication conditions were included in the present design. The no-communication (N) groups worked silently on an unrelated topic before making their
ER59969.indb 347
3/21/08 10:51:50 AM
348 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
decision in the game. The irrelevant-communication (I) groups were allowed to get acquainted with each other through a group discussion of an unrelated topic, but were not permitted to discuss the group dilemma decision. The relevant-communication (C) groups discussed the dilemma situation before making their decisions, and the relevantcommunication plus vote (C + V) groups ended their discussion with a roll call in which each group member made a nonbinding declaration of intended decision. If a role call vote were suggested in the C groups, it was stopped by the experimenters. Thus, considering groups ordered N, I, C, C + V, each of the possible sources of communication effects were systematically added to the conditions of the previous group to see if it incremented the level of cooperation in the group. A second major variable of interest concerned possible individual differences in cooperators’ and defectors’ expectations about others’ decisions. Prior work by Kelley and Stahelski (1970) on the prisoner’s dilemma indicated that cooperators and competitors maintain different world views, with competitors expecting competition from other players and cooperators expecting either cooperation or competition. Subjects in Experiment 1 were asked to predict other group members’ decisions to see if similar individual differences occur in the commons dilemma. Further research in Experiment 2 suggests an alternative to Kelley and Stahelski’s interpretation of these differences in expectations. A final variable manipulation was the possibility of losing money in the game. The possibility for cooperators to lose money might increase the risk associated with a cooperative decision, leading to less cooperation. Alternatively, defectors might be more reluctant to defect if their decision caused other group members to lose money. The net effect of these contrary forces was difficult to predict. Equally difficult, however, was the task of designing an experiment where loss of money was possible without violating experimental ethics. The solution used in the experimental work was to have subjects pool their winnings or losses with friends, truncating pooled losses at zero. Thus, an individual subject could be put in a condition where he or she could lose money. Subjects came to the experiment in groups of four friends. Two of the four went to a loss condition where they individually could lose money; two went to a condition in which their potential personal losses were truncated at zero. When the four friends returned from their decision groups, their earnings were pooled and shared equally. If the net total was negative, money was not taken from the group. Thus, subjects' losses could detract from their other friends' gains, but subjects would never owe the experimenter money by the end
ER59969.indb 348
3/21/08 10:51:50 AM
Appendix 2 • 349
of the experiment. In sum, communication (N, I, C, C + V) and loss (loss, no loss) conditions were crossed in a 4 × 2 factorial design. Subjects’ decisions and their expectations about others’ decisions were the two dependent variables of interest.
Experiment 1 Method Subjects Subjects were recruited from newspaper advertisements asking for groups of four friends. Eight such groups were scheduled for each time, so that one member from each “friendship group” could participate in separate “decision-making groups” of eight strangers. Since scheduled groups occasionally did not show up, a total of 284 subjects were run in 40 decision-making groups, rather than the anticipated 320. Friendship Groups Friendship groups met initially with an experimenter who informed them that each person would go to a different decision group where she or he would make a decision with seven other people. The four friends would then return to their friendship group, pool their earnings, and divide them equally among themselves. If the total were negative, no member of the friendship group would receive anything (although people who did not win at least $2.00 were contacted later and paid from $1.00 to $2.50, depending on their initial earnings). One member from each friendship group was sent to each of the four communication conditions. Two went to groups in which it was possible to lose money, two to groups in which negative payoffs were truncated at zero. Thus, the eight groups of four friends separated and formed four groups of eight strangers to play the commons dilemma game. Decision-making Groups Payoff matrices were determined according to the rule that each member of the decision group would earn $2.50 for a cooperative choice (O) or $12.00 for a defecting choice (X) All group members were fined $1.50 for each person in the group who chose X. When fewer than eight friendship groups showed up for the experiment, the defectors’ payoff was reduced by an appropriate amount: to $10.50 for seven-person groups, $9.00 for six-person groups, etc. Two payoff conditions were included in the experiment. In the loss condition, payoff to a cooperator was reduced by $1.50 for every defec-
ER59969.indb 349
3/21/08 10:51:51 AM
350 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
tor in the group; in the no-loss condition, cooperators’ negative payoffs were truncated at zero. Table 1 indicates all possible outcomes to decision makers under these two conditions. Opportunity for communication was manipulated in four communication conditions. In the no-communication condition, (N) subjects were not permitted to talk to each other. Subjects in this condition worked silently for 10 minutes on an irrelevant task (estimating the percentage of people at certain income levels in Eugene, Oregon, in the United States, etc.) before making their decisions. In the irrelevant-communication condition, (I) subjects discussed the same irrelevant topic for 10 minutes before making their decisions. In the relevant-communication condition, (C) subjects discussed the commons dilemma decision for 10 minutes before making their decisions. They were not, however, permitted to take a roll call. In the communication-plus-vote condition, (C + V) subjects’ 10-minute discussion of the commons dilemma decision ended in a roll call—a nonbinding declaration of intended decision. Table 1 Payoff Matrix Payoff to X — 10.5 9.00 7.50 6.00 4.50 3.00 1.50 .00 — 10.50 9.00 7.50 6.00 4.50 3.00 1.50 .00
ER59969.indb 350
Number of choosing X 0 Loss condition 0 8 1 7 2 6 3 5 4 4 5 3 6 2 7 1 8 0 No-loss condition 0 8 1 7 2 6 3 5 4 4 5 3 6 2 7 1 8 0
Payoff to 0 2.50 1.00 –.50 –2.00 –3.50 –5.00 –6 50 –8.00 — 2.50 1.00 0 0 0 0 0 0 —
3/21/08 10:51:51 AM
Appendix 2 • 351
The two loss conditions were crossed with the four communication conditions in a 2 × 4 factorial design. Five groups were run in each condition. Procedure Instructions were read to the decision groups as follows: I would like to explain the decision-making task in which you will now be participating. To insure that all of our subjects receive exactly the same information, I will have to read the instructions. Please listen carefully. I can answer questions at the end. This table (Table 1) indicates the possible consequences of the decision each of you will be asked to make. You must decide whether to choose an X or an O. You will have to mark an X or an O on the card in private. If you choose an O, you will earn $2.50 minus a $1.50 fine for every person who chooses X. If you choose X, you will earn $2.50 plus $9.50 minus a $1.50 fine for each person, including yourself, who chooses X. (However, as you can see, your payoffs do not go below zero.) By looking at the first row, for example, you can see that if seven of you choose O and one of you chooses X, then those choosing O will earn $1.00 and the person choosing X will earn $10.50. You will write your code number and decision on the top of the sheet in your envelope. Your decision will be totally private and none of the other participants in this group will know what you decided. You will each be paid and dismissed separately. On the sheet please indicate what decision you believe each other person here to be making. Beside the code number of each person, mark X or O to indicate the choice you believe that person to be making. Then indicate your confidence level for each judgment with a number from 50 to 100, with 100 indicating complete confidence. If you are just guessing, the probability is 50–50 that you are correct, so you should mark 50 if you have no confidence at all in your predictions. Questions? Once questions had been answered and group members understood the decision, they proceeded to 10 minutes of discussion or interpolated task, depending on the communication condition. When the 10 minutes were up, subjects made their decisions and predictions of other group members’ decisions. Once outcomes had been determined, subjects returned to their friendship groups where they divided any net gain between themselves.
ER59969.indb 351
3/21/08 10:51:51 AM
352 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
Results Because the groups differed in size, results of defection and predicted defection are presented in percentages. Table 2 shows the average proportion of defectors in each of the eight conditions. An analysis of variance based on arc sine transformations of the proportions (where the group is the unit of analysis) indicates that the effect of communication is extremely significant, F(3, 32) = 9.36, p <.001. The loss manipulation was not only insignificant, but accounted for virtually no variance, as did the Communication × Loss interaction. As can be seen from Table 2, there is a great deal more defection when subjects cannot communicate about the dilemma, even if they interact for 10 minutes about an irrelevant topic. Moreover, the structured communication with the vote did not elicit any more cooperation than did the unstructured communication (73% versus 72% on the average), despite the fact that every subject in the structured communication condition announced an intention to cooperate. The possible loss manipulation was not only ineffective in eliciting differential cooperation; it was ineffective in eliciting differential predictions about others’ behavior as well. In the results about such prediction, potential loss will therefore not be included as a variable. What will be included is the variable concerning whether the individual making the prediction is a cooperator or a defector. The correlation between the proportion of defections the subject predicted (not including himself or herself) and whether the subject actually defected was .60 (p < .001). Table 3 presents the average proportion of predicted defection on the part of other subjects made by defectors and cooperators in the four different communication conditions (colTable 2 Proportion of Subjects Defecting Condition Condition Loss No loss
No Irrelevant Unrestricted communication communication communication
0.73 67
0.65 70
Communication plus vote
0.26 0.30
0.16 42
Table 3 Proportion of Subjects Predicted to Defect (Subjects not Included) Condition No
Irrelevant
Unrestricted
communication communication communication Subject Defectors 0.65 0.61 0.29 Cooperators 0.35 0.42 0.08
ER59969.indb 352
Communication plus vote Overall
0.30 0.04
0.54 0.16
3/21/08 10:51:51 AM
Appendix 2 • 353
lapsed across loss versus no loss). An analysis of variance—again on these proportions transformed to arc sines, and again with the group as the unit of analysis—revealed two strong main effects, one for communication condition and one for defectors versus cooperators.2 Overall, more defection is predicted when people cannot communicate, F(3, 31) = 9.86, p < .001, and defectors predict almost four times as much defection as do cooperators, F(1,31) = 35.93, p < .001). When the subject’s own behavior is included in the prediction (i.e., the defector predicts himself or herself to defect and the cooperator to cooperate), the overall predictions become even more discrepant—.60 versus .13. The analysis of variance shows virtually identical results; neither it nor the proportions with the subjects included in each condition are presented here. In the results that follow, all predictions do not include the subjects themselves. Table 4 presents the proportion of correct predictions made by defectors and cooperators in the four conditions. Again, the analysis of variance was based on the arc sine transformations of these proportions with the group as the unit of analysis. Subjects overall are more accurate at predicting in the communications conditions, F(3, 31) = 11.30, p < .001, which is not exactly surprising. The overall accuracy of each subject as measured by proportion correct is directly affected by the match between the subject’s base-rate prediction of defection and the actual base rate of defection. If, for example, a subject predicted r percentage of defection and p occurred, then even if the predictions were noncontingent—that is, if the subject could not accurately predict which subjects would defect and which wouldn’t—the expected proportion of correct predictions would be rp + (1 – r) (1 – p). Each subject’s actual proportion of correct predictions was corrected for this base-rate accuracy, and the residuals subjected to an analysis of condition by defection. These residual scores are tiny, averaging .03, but significant, F(1, 248) = 8.69, p < .01. The data were analyzed for sex differences. Across conditions, there were no significant sex differences in subjects’ decisions, χ2(l) = 1.78, p > .10. Table 4 Proportion of Subjects Correctly Predicted to Defect (Subjects not Included) Condition No
Irrelevant
Unrestricted
communication communication communication Subject Defectors 0.56 0.57 0.67 Cooperators 0.35 0.54 0.73
ER59969.indb 353
Communication plus vote Overall
0.74 0.76
0.60 0.66
3/21/08 10:51:52 AM
354 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
Finally, the data were analyzed for differences in defection as a function of group size. Due to no-shows, there were groups of size 5 (n = 12), size 6 (n = 5), size 7 (n = 10), and size 8 (n = 13); the overall proportion of defectors was .46, .30, .52, and .54, respectively. The data were subjected to a 4 × 2 analysis of variance with unequal cells (Winer, 1962, p. 242) where the two levels of the second factor were defined by collapsing conditions N and I (which elicited little cooperation) and C and C + V (which elicited much). The effect of group size on proportion of defection was nill, F(3, 32) = 1.50, as was the interaction of group size by condition, F (3, 32) = 1.47.3 Discussion It is not surprising that when people can communicate, they can solve a dilemma better than when they cannot. Simply getting to know other people did not make much of a difference, at least in the 10-minute discussion of the irrelevant-communication condition. Whether subjects in more contact or longer lasting groups would be better able to elicit implicit cooperation is an open question. For reasons described later in this discussion, that question may remain unanswered. At any rate, groups of strangers can in fact elicit cooperative behavior from each other if they are permitted to communicate about the dilemma for 10 minutes. Even in these groups, however, many people lied about their intentions; although every vote was unanimous in favor of cooperating, there was only a single group in which all people actually cooperated. Nevertheless, the overall rate of cooperation in communicating groups was about 75%. (Interestingly enough, if the rate in each group was 75%, each cooperator would end up with no money or losing 50(¢). A much more striking finding concerns the expectations about others’ behavior. Defectors predict approximately four times as much defection as do cooperators. The present study is purely correlational, so it is not possible to determine the degree to which perception of others’ intentions influences the decision to cooperate or defect and the degree to which such a decision influences judgment about other people’s behavior. Experiment 2 assesses the degree of such influence by comparing the judgments of people who are actually making the decisions in such groups with the judgments of observers. The data concerning the prediction of defection indicate that people can accurately predict overall defection and even the sources of defection. But the prediction of the individual source—when corrected for base rate—is pretty feeble. One of the most significant aspects of this study, however, did not show up in the data analysis. It is the extreme seriousness with which
ER59969.indb 354
3/21/08 10:51:52 AM
Appendix 2 • 355
the subjects take the problems. Comments such as, “If you defect on the rest of us, you’re going to have to live with it the rest of your life,” were not at all uncommon. Nor was it unusual for people to wish to leave the experimental building by the back door, to claim that they did not wish to see the “sons of bitches” who double-crossed them, to become extremely angry at other subjects, or to become tearful. For example, one subject just assumed that everyone else would cooperate in the no-communication condition, and she ended up losing $8.00, which matched the amount of money her friends had won. She was extremely upset, wishing to see neither the other members of the decision group, nor her friends. We are concerned that her experience may have had a very negative effect on her expectations about other people (although, alas, making her more realistic). The affect level was so high that we are unwilling to run any intact groups because of the effect the game might have on the members’ feelings about each other. The affect level also mitigates against examining choice visibility. In pretesting, we did run one group in which choices were made public. The three defectors were the target of a great deal of hostility (“You have no idea how much you alienate me!” one cooperator shouted before storming out of the room); they remained after the experiment until all the cooperators were presumably long gone. With the exception of the one cooperator who left hurriedly, the experimenters calmed all cooperators who were upset. There was, however, no general “debriefing” procedure because there was no deception in the experiment.4
Experiment 2 The purposes of Experiment 2 were to replicate the findings of Experiment 1 and to explore further the source of the high correlation between subjects’ own behavior and their expectations about others. Kelley and Stahelski (1970) attributed similar differences between cooperators and defectors in a prisoner’s dilemma to stable differences in world view. According to them, competitive people elicit competition from both cooperative and competitive people. Their consistent experience is that people are competitive, leading to a generalized expectancy that others are like themselves. Cooperators’ experience is differentiated according to the behavior of others. Cooperative people meet with cooperation from other cooperators and competition from competitive people, resulting in a belief that others are heterogeneous with respect to the competitive dimension. According to this theory, then, defectors come to the experiment with a predisposition to expect others to behave similarly; cooperative people have no such consistent expectation.
ER59969.indb 355
3/21/08 10:51:52 AM
356 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
Our interest was in the alternative possibility that the decision itself was affecting subjects’ expectations about others. A couple of explanations were plausible. One source may be motivational. Subjects may feel the need to justify their decision—defectors in order to assuage possible guilt over their decision, cooperators to avoid feeling duped. A second source is cognitive. Given the belief that people tend to behave similarly in the same situation, a subject who decides to cooperate or to defect may have a rational basis for believing others will do likewise. Whichever the source, the subjects’ decisions themselves would lead them to believe that others’ decisions would be like theirs. The expectation, then, is that predictions of people who actually make the decision will be different from those of people who observe the same process but make no decision. Participant cooperators should expect more cooperation; participant defectors should expect defection, if the decision is affecting participants’ expectations. The observers who make no decision should not be similarly biased. Since participants’ decisions would distort cooperators’ and defectors’ predictions in opposite directions, the variance of all participants’ predictions should be greater than the variance of observers’ predictions. On the other hand, if Kelley and Stahelski are correct, observers and participants should have similar world views and should have the same expectations about others’ decisions. Thus, the variance of predictions should be roughly the same for participants and observers, since the proportion of potential cooperators and defectors should be the same in each group (given, of course, random assignment to the role of participant and observer). Method The number of conditions in Experiment 2 was reduced to two. Because potential money loss had no effect in Experiment 1, all losses were truncated at zero. Further, because there was no difference between the nocommunication and irrelevant-communication conditions and between the communication and communication-plus-vote conditions, only an irrelevant-communication condition and a relevant-communication condition were run—with subjects in the relevant one being free to have a roll call vote or not, depending upon the group interaction. Further, there was no need to use “friendship groups” because there was no necessity of having potential monetary loss. Subjects Subjects were recruited from newspaper ads, and 16 were scheduled for each session: 8 were to be assigned to be participants, the remaining 8 as observers who viewed the interaction through a one-way
ER59969.indb 356
3/21/08 10:51:52 AM
Appendix 2 • 357
mirror. Because subjects sometimes failed to appear at the experiment, 8 were randomly chosen to be participants with the remainder observers. The result was that there were 160 participants and 149 observers, 10 groups being run in the irrelevant-communication condition and 10 in the relevant-communication condition. Because the previous results had indicated no effect of sex, no attempt was made to balance men and women equally in the roles of participant and observer. Procedure Procedure for decision groups was identical to that of the corresponding conditions in Experiment 1. Instructions to observers were as follows: You will be observing a decision-making task in which the participants must individually decide between two choices: X and O. The outcome of the individuals in the group depends on the number of individuals choosing X and O. [Copies of the matrix and prediction sheets are distributed.] This will also be explained to the participants and should be clear to you. Before the participants make their decisions, you will have an opportunity to observe a 10-minute discussion. Your task will be to predict what each individual in the group will choose: either X or O. In addition, we will ask you to indicate your confidence level for each prediction with a number from 50 to 100, with 100 indicating complete confidence. If you are just guessing, the probability is 50-50 that you are correct, so you should mark 50 if you have no confidence at all in your predictions. Please make these predictions and confidence ratings individually, without consulting one another. In addition, please refrain from commenting about the group as you observe. I will now instruct the decision-making group and will return shortly. Questions? Observers were placed behind a one-way mirror and made their predictions at the same time as the participants made their predictions and decisions. Results The difference between irrelevant-communication and relevant-communication was replicated. In the irrelevant-communication condition, 76% of the subjects defected, while in the relevant communication only 31% did so. Since all groups contained eight people, a one-way analysis of variance was performed on the number of defectors in the two conditions, the groups themselves again being the unit of analysis. The results were significant, F (1, 18) = 41.51, p < .001. (Only one group
ER59969.indb 357
3/21/08 10:51:52 AM
358 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
in the irrelevant-communication condition had fewer defectors than did any of the groups in the relevant condition.) Analysis of sex differences in subjects’ decisions showed that females were more likely to cooperate than were males χ2(1) = 3.6, p < .10. Considering the decisions within each condition, this sex difference was strong in the relevant-communication condition χ2 (1) = 7.6, p < .01, but nonexistent in the irrelevant-communication condition.5 All the other effects for participants replicated. As before, there is more predicted defection in the irrelevant-communication condition (.53 vs. .20; F(1, 16) = 16.65, p < .001), and defectors predict more defection than do cooperators (.56 vs. .18; F (1, 16) =37.86, p < .001). Also, as before, there is a higher proportion of correct predictions in the relevant-communication condition than in the irrelevant-communication condition, but the result is not significant, F(1, 16) = 1.72. The findings of positive residual accuracy, however, was not replicated; in fact, the residual accuracy was negative, averaging –.02. Figure 1 presents the number of defections predicted by cooperators, defectors, and observers, broken down by conditions. In order to make the predictions of participants and observers comparable, each participant was assumed to have made an (implicit) prediction for himself or herself; in addition, each observer was randomly paired with a participant, and the prediction of the observer for that participant was changed if it was incorrect. Thus, the observers and participants were put in the same situation—guessing about seven of the choices and knowing about one.6 The variance of predictions for the participants was 7.50, whereas that for the observers was 4.60; for testing the null hypothesis of equal variance, F(159, 148) = 1.63, p < .01. In the irrelevant-communication condition, the variances are 5.24 for participants and 3.20 for observers, F(79, 75) = 1.64, p < .05, while in the relevantcommunication condition the respective variances were 3.84 and 3.65, F(79, 72) = 1.05, ns. Discussion The most important finding of this experiment was that having to make the cooperative or defective choice apparently did affect the estimates of what other people would do, as well as vice versa. Thus, one’s choices in such a dilemma situation not only reflect beliefs about others, but also affect these beliefs. There are a number of possible explanations. First, the effect may be pure rationalization. Having decided to cooperate or defect, the group member may attempt to justify the choice by his or her estimates of what others will do. Clearly, a cooperative choice is not very wise if any other people are going to defect, while a defect-
ER59969.indb 358
3/21/08 10:51:53 AM
Appendix 2 • 359
COOPERATORS
COOPERATORS
DEFECTORS OBSERVERS
50%
PERCENT PREDICTED TO DEFECT
50%
DEFECTORS OBSERVERS
0
2
4
6
NUMBER PREDICTED TO DEFECT a. IRRELEVANT COMMUNICATION
8
0
2
4
6
8
NUMBER PREDICTED TO DEFECT b. RELEVANT COMMUNICATION
Figure 1 Predictions of participant defectors, participant cooperators, and observers.
ing choice may be considered downright immoral if most other people cooperate. Thus, the group member may have a motivational reason for believing that other people will behave in a similar manner; specifically, such a belief helps the individual maintain an image of being a rational, moral person. Second, there may be two closely related cognitive reasons for the behavior to affect the belief. Individuals may decide to use their own behavior as information about what other people would do; after all, if people from similar cultures tend to behave in similar ways in similar situations, and if I do this, it follows that my peers may do so also. In addition, there is the possibility that as I make up my mind to defect or cooperate, the reasons leading to the choice I finally make become more salient, while those leading to the other choice become less so. Then, when attempting to evaluate what other people will do, I see compelling reasons for doing what I do and less compelling ones for doing the opposite. As suggested by our colleague, David Messick (Reference Note 1), one such reason may involve the ethical implication of the choice— that when people perceive an ethical dimension or component in a particular social choice, they may have a tendency to assume that other people will have the same perception. (For example, the individual who
ER59969.indb 359
3/21/08 10:51:53 AM
360 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee
perceives a particular social choice in terms of a fundamental religious struggle between good and evil tends to believe that others would view it along the same dimension and that the atheist or agnostic who denies such a perception is simply lying or deluding himself.) Thus, the act of making a choice considered ethical or rational may define the situation for the chooser as one requiring ethicality or rationality and hence bias the chooser to believe that others will behave in a similar ethical or rational manner.
Reference Note 1 Messick, D. Personal communication, 1975.
References Caldwell, M. Communication and sex effects in a five-person prisoner’s dilemma game. Journal of Personality and Social Psychology, 1976, 33, 273–280. Dawes, R. M. Formal models of dilemmas in social decision making. In M. Kaplan & S. Schwartz (Eds.), Human judgment and decision processes: Formal and mathematical approaches. New York: Academic Press, 1975. Deutsch, M. The effect of motivational orientation upon trust and suspicion. Human Relations, 1960, 13, 123–139. Hamburger, H. N-person prisoner’s dilemmas. Journal of Mathematical Sociology, 1973, 3, 27–48. Hardin, G. The tragedy of the commons Science, 1968, 162, 1243–1248. Hays, W. L. Statistics for the social scientist (2nd ed). New York: Holt, Reinhart & Winston, 1973. Jerdee, T. H., & Rosen, B. Effects of opportunity to communicate and visibility of individual decisions on behavior in the common interest. Journal of Applied Psychology, 1974, 59, 712–716. Kelley, H H, & Grzelak, J. Conflict between individual and common interest in an n-person relationship. Journal of Personality and Social Psychology, 1972, 21, 190–197. Kelley, H. H., & Stahelski, A. V. Social interaction basis of cooperators’ and competitors’ beliefs about others. Journal of Personality and Social Psychology, 1970, 16, 66–91. Loomis, J. Communication: The development of trust and cooperative behavior. Human Relations, 1959, 12, 305–315. Luce, R. D. , & Raiffa, H. Games and decision: Introduction and critical survey. London: Wiley, 1957. Messick, D. M. To join or not to join: An approach to the unionization decision. Organizational Behavior and Human Performance, 1973, 10, 145–156.
ER59969.indb 360
3/21/08 10:51:53 AM
Appendix 2 • 361 Radlow, R., & Weidner, M. Unenforced commitments in “cooperative” and “noncooperative” non-constant-sum games. Journal of Conflict Resolution, 1966, 10, 497-505. Swensson, R. Cooperation in a prisoner’s dilemma game: I. The effects of asymmetric payoff information and explicit communication. Behavioral Science, 1967, 12, 314–322. Wichman, H. Effects of communication on cooperation in a 2-person game. In L. Wrightsman, J. O’Connor, & N. Baker (Eds.), Cooperation and competition. Belmont, Calif.: Brooks/Cole, 1972. Winer, B. J. Statistical principles in experimental design New York: McGrawHill, 1962.
Notes 1. Messick’s (1973) union dilemma game also is structurally identical, although in a probabilistic context. 2. Some groups had to be omitted because they consisted entirely of cooperators or defectors; hence the number of degrees of freedom is attenuated to 31. Analyses were also performed using the individuals as the unit of analysis; the conclusions were virtually identical. 3. A linear trend analysis using coefficients −3, −1, +1, +3 and the same error term used in the Winer analysis revealed even less evidence for a size effect. The F value was 1.44. (See Hays, 1973, p. 587, for a description of such a trend analysis.) 4. But subjects who received less than $2.00 for participation were contacted later and paid handsomely (up to $2.00) for filling out a brief questionnaire. 5. Subsequent research—not supported here—supports the findings of Experiment 1; we have never been able to replicate the sex effect. 6. The rationale for this procedure may best be explained by example. Suppose that four of the eight group members defect. A cooperator is facing seven other group members, of whom four (57%) defect; a defector is facing seven other group members, of whom three (43%) defect; an observer is facing eight group members, of whom four (50%) defect. By including the participants’ own choice in the predictions, cooperators, defectors, and observers are all facing the same situation—eight group members, of whom four defect. Such an inclusion, however, means that for the participants, one of the “predictions” (of their own behavior) is not a guess but a certainty. To make the choices of the observers strictly comparable to the participants, one of their predictions was turned into a certainty by randomly pairing them with a group member and changing the predictions of that group member if it were wrong. Many
ER59969.indb 361
3/21/08 10:51:54 AM
362 • Robyn M. Dawes, Jeanne McTavish, and Harriet Shaklee other means of achieving comparability were considered over a 4-month period, and all were rejected. The reasons for such rejection would be too lengthy to detail here.
acknowledgments Experiment 1 was presented at the 1975 West Coast Conference on Small Group Research in Victoria, British Columbia, on April 16, 1975. The entire research was supported by the Advanced Research Projects Agency of the Department of Defense (ARPA Order No. 2449); it was partially monitored by Office of Naval Research under Contract N00014-75-C-0093. We would like to thank Sol Fulero, Lita Furby, Phil Hyman, Mike Moore, Len Rorer, and Myron Rothbart for their help in this project.
ER59969.indb 362
3/21/08 10:51:54 AM
APPENDIX 3
Robyn M. Dawes’s Festschrift Remarks
Several years ago, I saw a two-hour TV special about the Watergate scandal. It sequentially covered all the important events, in the order in which they became public knowledge—ranging from the burglary itself to Nixon’s farewell wave to his staff. The condensation of these events into a mere two hours provided—for me at least—a strong emotional impact. (I almost ended up feeling sorry for Nixon. The important word in this confession remains, however, “almost.”) Now I have just recently read the terrific chapters in this Festschrift. All their authors claim, with varying degrees of justification, that I have had an impact on their work. I see these papers presented together in a brief book (I was the person who insisted that they be as concise as possible). Moreover, I read them sequentially in a comparatively brief period of time—little more than a week. They have had a profound emotional impact on me.
363
ER59969.indb 363
3/21/08 10:51:54 AM
ER59969.indb 364
3/21/08 10:51:54 AM
APPENDIX 4
Robyn M. Dawes’s Biography and Selected Work
After two years as a philosophy major at Harvard (BA 1958, cl), Dawes fled that field to enter clinical psychology at Michigan. After two years in clinical psychology, he fled that field to enter mathematical psychology—with a content interest in behavioral decision making, social interaction, and attitude measurement (PhD 1963)—and graduate training in mathematics (to substitute for his linguistic incompetence that made passing the second foreign language requirement an impossibility). He hung around Ann Arbor for five years—as a researcher psychologist at the local VA hospital and a member of the university Psychology Department—while his first wife completed her PhD, and then he moved as an associate professor to Oregon in 1967, where he became a professor in 1971 and served six years as a department head (acting 1972–1973; regular 1979–1980, 1981–1985). He also worked part time at the Oregon Research Institute, where he was a vice president in 1973– 1974 and was fired for insubordination. He moved to Carnegie Mellon University (CMU) in the fall of 1985, as a professor of psychology in the Department of Social and Decision Sciences and was that department’s head for a five-year term. He also served as acting department head for a one-year term in 1996. He is now the Charles J. Queenan, Jr. University Professor. Also, in 1999, he spent seven months as the Olof Palme Professor at the University of Stockholm and Göteborg, where he subsequently (2000) received an honorary degree. In October 2002 he was inducted as a Fellow of the American Academy of Arts and Sciences. 365
ER59969.indb 365
3/21/08 10:51:54 AM
366 • Rationality and Social Responsibility
Dawes is the author of over 150 articles and 6 books. House of Cards: Psychology and Psychotherapy Built on Myth (1994) has an impressive— or depressing—praise to sales ratio; the previous book, Rational Choice in an Uncertain World (1988), was recipient of the William James Award, Division of General Psychology of the American Psychological Association (APA) 1990. In 2001 Reid Hastie and Dawes published a revised version of Rational Choice (Sage), and on his own Dawes published a book on a favorite topic: Everyday Irrationality: How Pseudo-Scientific Lunatics and the Rest of Us Systematically Fail to Think Rationally. His book Mathematical Psychology: An Elementary Introduction, published jointly with his late Michigan colleagues Clyde Coombs and Amos Tversky, also remains a personal favorite of his. Dawes has served as president of the Oregon Psychological Association (1983–1984), president of the Society for Judgment and Decision Making Research (JDM) (1988–1989), member of the APA’s national Ethics Committee (1985–1988), and on the executive boards of various scientific organizations. In addition he has served on the National Research Council’s AIDS Research and the Behavioral, Social, and Statistical Sciences Committee, and has contributed to its reports: AIDS, Sexual Behavior, and Intravenous Drug Use (1989) and AIDS: The Second Decade (1990). He has led a life that might politely be described as “interesting.”
Selected Works of Robyn M. Dawes Books Coombs, C. H., Dawes, R. M., & Tversky, A. (1970). Mathematical psychology: An elementary introduction. Englewood Cliffs, NJ: Prentice-Hall. Dawes, R. M. (1972). Fundamentals of attitude measurement. New York: Wiley. Dawes, R. M. (1988). Rational choice in an uncertain world. San Diego, CA: Harcourt, Brace, Jovanovich. Dawes, R. M. (1994). House of cards: Psychology and psychotherapy built on myth. New York: The Free Press. Dawes, R. M. (2001). Everyday irrationality: How pseudoscientists, lunatics, and the rest of us systematically fail to think rationally. Boulder, CO: Westview Press. Hastie, R., & Dawes, R. M. (2001). Rational choice in an uncertain world (2nd ed.). New York: Russell Sage Foundation.
ER59969.indb 366
3/21/08 10:51:54 AM
Robyn M. Dawes Biography • 367
Articles and Book Chapters Dawes, R. M. (1966). Memory and distortion of meaningful written material. British Journal of Psychology, 57, 77–86. Dawes, R. M., & Meehl, P. E. (1966). Mixed group validation: A method for determining the validity of diagnostic signs without using criterion groups. Psychological Bulletin, 66, 63–67. Dawes, R. M. (1971). A case study of graduate admissions: Application of three principles of human decision making. American Psychologist, 26, 180–188. Dawes, R. M., Singer, D., & Lemons, F. (1972). An experimental analysis of the contrast effect and its implications for intergroup communication and the indirect assessment of attitude. Journal of Personality and Social Psychology, 21, 281–295. Dawes, R. M., & Corrigan, B. (1974). Linear models in decision making. Psychological Bulletin, 81, 95–106. Dawes, R. M. (1975). Graduate admissions criteria and future success. Science, 187, 721–723. Dawes, R. M. (1975). Formal models of dilemmas in social decision-making. In S. Schwartz & M. F. Kaplan (Eds.), Human judgment and decision processes: Formal and mathematical approaches (pp. 87–106). New York: Academic Press. Dawes, R. M. (1976). Shallow psychology. In J. Carroll & J. Payne (Eds.), Cognition and social behavior (pp. 3–12). Hillsdale, NJ: Lawrence Erlbaum. Dawes, R. M., McTavish, J., & Shaklee, H. (1977). Behavior, communication and assumptions about other peoples’ behavior in a common dilemma situation. Journal of Personality and Social Psychology, 35, 1–11. Dawes, R. M. (1979). The robust beauty of improper linear models. American Psychologist, 34, 571–582. Dawes, R. M. (1980). Social dilemmas. Annual Review of Psychology, 31, 169–193. Dawes, R. M. (1980). Confidence in intellectual judgments vs. confidence in perceptual judgments. In E. D. Lantermann & H. Feger (Eds.), Similarity and choice. Bern, Switzerland: Hans Huber. Cockayne, E. J., Dawes, R. M., & Hedetniemi, S. (1980). Total domination in graphs. Networks, 10, 211–219. van de Kragt, A. J. C., Orbell, J. M., & Dawes, R. M. (1983). The minimal contributing set as a solution to public goods problems. American Political Science Review, 77, 112–122. Dawes, R. M., & Smith, T. E. (1985). Attitude and opinion measurement. In G. Lindzey & E. Aronson (Eds.), Handbook of social psychology (vol. I, pp. 509–566). New York: Random House. Dawes, R. M. (1986). Representative thinking in clinical judgment. Clinical Psychology Review, 6, 425–441.
ER59969.indb 367
3/21/08 10:51:55 AM
368 • Rationality and Social Responsibility Dawes, R. M., Orbell, J. M., Simmons, R. T., & van de Kragt, A. J. C. (1986). Organizing groups for collective action. American Political Science Review, 80, 1171–1185. Orbell, J. M., van de Kragt, A. J. C., & Dawes, R. M. (1988). Explaining discussion-induced cooperation. Journal of Personality and Social Psychology, 54, 811–819. Caporael, L., Dawes, R. M., Orbell, J. M., & van de Kragt, A. J. C. (1989). Selfishness examined: Cooperation in the absence of egoistic incentives. Behavioral and Brain Sciences, 12, 683–739. Dawes, R. M. (1989). Statistical criteria for establishing a truly false consensus effect. Journal of Experimental Social Psychology, 25, 1–17. Dawes, R. M., Faust, D., & Meehl, P. E. (1989). Clinical versus actuarial judgment. Science, 243, 1668–1674. Dawes, R. M. (1991). Social dilemmas, economic self-interest, and evolutionary theory. In D. R. Brown, & J. E. K. Smith (Eds.), Recent research in psychology: Frontiers of mathematical psychology: Essays in honor of Clyde Coombs (pp. 53–79). New York: Springer-Verlag. Pearson, R. W., Ross, M., & Dawes, R. M. (1991). Personal recall and the limits of retrospective questions in surveys. In J. M. Tanur (Ed.), Questions about questions: Inquiries into the cognitive bases of surveys (pp. 65–94). New York: Russell Sage Foundation. Dawes, R. M. (1992). The importance of alternative hypotheses—and hypothetical counterfactuals in general—in social science. General Psychologist, 28, 2–7. Dawes, R. M. (1993). The prediction of the future versus an understanding of the past: A basic asymmetry. American Journal of Psychology, 106, 1–24. Dawes, R. M., Mirels, H. L., Gold, E., & Donahue, E. (1993). Equating inverse probabilities in implicit personality judgments. Psychological Science, 6, 396–400. Dawes, R. M., & Mulford, M. F. (1993). Diagnoses of alien kidnappings that result from conjunction effects in memory. Skeptical Inquirer, 18, 50–51. Orbell, J. M., & Dawes, R. M. (1993). Social welfare, cooperators’ advantage, and the option of not playing the game. American Sociological Review, 58, 787–800. Dawes, R. M. (1994). AIDS, sterile needles, and ethnocentrism. In L. Heath, R. S. Tindale, J. Edwards, E. Posavac, F. B. Bryant, E. Henderson-King, Y. Suarez-Balcazar, & J. Myers (Eds.), Social psychological applications to social issues. III: Applications of heuristics and biases to social issues (pp. 31–44). New York: Plenum Press. Dawes, R. M., & Orbell, J. M. (1995). The benefit of optional play in anonymous one-shot prisoner’s dilemma games. In K. Arrow, R. Mnookin, L. Ross, A. Tversky, & R. Wilson (Eds.), Barriers to conflict resolution (pp. 62–85). New York: Norton.
ER59969.indb 368
3/21/08 10:51:55 AM
Robyn M. Dawes Biography • 369 Dawes, R. M., & Mulford, M. (1996). The false consensus effect and overconfidence: Flaws in judgment, or flaws in how we study judgment? Organizational Behavior and Human Decision Processes, 65, 201–211. Dawes, R. M. (1998a). Behavioral decision making and judgment. In D. Gilbert, S. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (Vol. II, pp. 497–548). Boston: McGraw-Hill. Dawes, R. M. (1998b). Standards for psychotherapy. Encyclopedia of mental health (Vol. 3, pp. 589–597) In H. S. Friedman et al. (Eds.), San Diego, CA: Academic Press. Dawes, R. M., & Messick, D. M. (2000). Social dilemmas. International Journal of Psychology, 35, 111–116. Swets, J. A., Dawes, R. M., & Monahan, J. (2000). Psychological science can improve diagnostic decisions. Psychological Sciences in the Public Interest, 1, (special issue). Dana, J., & Dawes, R. M. (2004). The superiority of simple alternatives to regression for social science predictions. Journal of Educational and Behavioral Statistics, 29, 317–331 Dawes. R. M. (2005). The ethical implications of Paul Meehl’s work on comparing clinical versus actuarial prediction methods. Journal of Clinical Psychology, 61, 1245–1255.
ER59969.indb 369
3/21/08 10:51:55 AM
ER59969.indb 370
3/21/08 10:51:55 AM
Con t r i bu tor s Moty Amar completed his BA with honors in economics, sociology, and anthropology. He received his MBA from the Hebrew University in Jerusalem, Israel, in 1999. He is presently writing his PhD dissertation, titled “Generating a Placebo Effect Through Marketing Actions,” under the supervision of Prof. Maya Bar-Hillel. Since 2001, Moty has taught marketing and management at The Hebrew University as well as at the Ono Academic College. Hal R. Arkes is a professor of psychology; professor of health services, management, and policy; and senior scholar in the Center for Health Outcomes, Policy, and Evaluation Studies, all at The Ohio State University. His research interests include medical, economic, and legal decision making. During the 1990s he served for four years as a program officer at the National Science Foundation. He is a former president of the Society for Judgment and Decision Making. Maya Bar-Hillel is a professor of psychology at The Hebrew University, where she also received her PhD under the supervision of the late Amos Tversky. Maya is former director of The Hebrew University’s Center for the Study of Rationality and former president of the Society for Judgment and Decision Making. Like Robyn Dawes, she has always been interested in debunking psychological myths and legends, as well as pseudopsychologies. Cristina Bicchieri is the Carol and Michael Lowenstein Professor of Philosophy and Legal Studies at the University of Pennsylvania. She has done extensive research in the foundations of game theory and, most recently, on the nature and dynamics of social norms. She is now working on the role of behavioral research in reshaping our views about ethical choices. She is the author of The Grammar of Society: The Nature
371
ER59969.indb 371
3/21/08 10:51:55 AM
372 • Rationality and Social Responsibility
and Dynamics of Social Norms (Cambridge University Press, 2006) and Rationality and Coordination (Cambridge University Press, 1993). Marilynn Brewer received her PhD in social psychology at Northwestern University in 1968 and is currently a professor of psychology and Eminent Scholar in Social Psychology at the Ohio State University. Her primary areas of research are the study of social identity, collective decision making, and intergroup relations, and she is the author of numerous research articles and books in this area. Dr. Brewer was recipient of the 1996 Lewin Award from SPSSI, the 1993 Donald T. Campbell Award for Distinguished Contributions to Social Psychology from the Society for Personality and Social Psychology, and the 2003 Distinguished Scientist Award from the Society of Experimental Social Psychology. In 2004 she was elected as a fellow of the American Academy of Arts and Sciences, and in 2007 she received the Distinguished Scientific Contribution Award from the American Psychological Association. David V. Budescu is a professor of quantitative psychology at the University of Illinois at Urbana Champaign. His research is in the areas of human judgment, individual and group decision making under uncertainty and with incomplete and vague information, and statistics for the behavioral and social sciences. He is on the editorial boards of Applied Psychological Measurement, Journal of Behavioral Decision Making, Journal of Mathematical Psychology, Journal of Experimental Psychology: Learning, Memory & Cognition (2000–2003), Multivariate Behavioral Research, Organizational Behavior and Human Decision Processes, (1992–2002), and Psychological Methods (1996–2000). He is past president of the Society for Judgment and Decision Making (2000– 2001), fellow of the Association for Psychological Science, and an elected member of the Society of Multivariate Experimental Psychologists. Stephanie J. Byram received her PhD from the Department of Social and Decision Sciences, Carnegie Mellon University. Her research included studies of how people estimate the time needed to complete tasks, how women understand the risks of breast cancer and the benefits of mammography, how women understand the chronic and acute risks of breast implants, and how to elicit probability judgments, especially for very small risks. Her studies combined analytical and empirical methods, including experiments, structured surveys, and in-depth interviews. She died of breast cancer June 9, 2001. Knowing Stephanie celebrates her life, with her words, photographs by Charlee Brodsky, and an essay by Jennifer Matesa (University of Pittsburgh Press, 2003).
ER59969.indb 372
3/21/08 10:51:56 AM
Contributors • 373
Linnda R. Caporael is a professor of psychology in the Department of Science and Technology Studies at Rensselaer Polytechnic Institute. As a Fulbright-Hayes Scholar, she studied human ethology at the Institute of Child Development, University of London. She has also been a Visiting Scientist in Invertebrate Paleontology and in Anthropology at the American Museum of Natural History. Her research interests are biology from a cultural perspective and culture from a biological perspective. She uses deviations from rationality, broadly construed, as clues about the implications of humans as a fundamentally social species. Her work has appeared in Science, Behavioral and Brain Sciences, Journal of Personality and Social Psychology, Annual Review of Psychology and in Social Psychology: Handbook of Basic Principles, 2nd Edition (E. T. Higgins & A. Kruglanski, eds.). She received her PhD from the University of California, Santa Barbara. Jason Dana is an assistant professor in the Department of Psychology and an affiliate of the Philosophy, Politics, and Economics program at the University of Pennsylvania. He received his PhD in behavioral decision research from Carnegie Mellon University. His research interests lie broadly within the area of judgment and decision making and include topics such as clinical judgment, the psychology of fairness and altruism, and the efficiency of “improper” linear models. David Faust, PhD, is a professor in the Department of Psychology, University of Rhode Island, and holds an affiliate appointment in the Department of Psychiatry and Human Behavior, Brown University Medical School. Dr. Faust was previously the director of psychology at Rhode Island Hospital and affiliated hospitals. Dr. Faust has published numerous articles and books on such topics as clinical judgment and decision making, psychology and law, and the philosophy of science. He is currently serving as senior author and editor for the upcoming revision of Ziskin’s classic text on psychology and law. He has been the recipient of multiple awards and honors in his field. Baruch Fischhoff is Howard Heinz University Professor, in the Department of Social and Decision Sciences and Department of Engineering and Public Policy at Carnegie Mellon University, where he heads the decision sciences major. His research attempts to address simultaneously issues of basic and applied interest in areas such as valuing environmental goods, improving adolescent decision making, preventing sexual assault, communicating about risks, managing terrorism, and assessing the practical value of basic science. He is a member of the
ER59969.indb 373
3/21/08 10:51:56 AM
374 • Rationality and Social Responsibility
Institute of Medicine of the National Academy of Sciences and a past president of the Society for Judgment and Decision Making and the Society for Risk Analysis. He currently serves on the Environmental Protection Agency’s Scientific Advisory Board, where he chairs the Homeland Security Advisory Committee, and on the Department of Homeland Security Science and Technology Advisory Committee. He has coauthored or edited four books, Acceptable Risk (1981), A TwoState Solution in the Middle East: Prospects and Possibilities (1993), Preference Elicitation (1999), and Risk Communication: The Mental Models Approach (2001). Eric Gold heads Fidelity Investments’ Center for Applied Behavioral Economics. Prior to joining Fidelity in 2006, Dr. Gold founded a software and consulting company, Gold Objects, Inc., where he has worked as a consultant to companies such as SEI Investments, Banker’s Trust, and Salomon Smith Barney. Gold has also worked at Morgan Stanley, the IBM Thomas J. Watson Research Center, and the Tektronix Computer Research Lab. Gold received a BA in psychology from Cornell University, an MS in experimental psychology from the University of Oregon, an MS in computer science from Yale University, an MA in decision making and an MS in statistics both from Carnegie Mellon University, and a PhD in behavioral decision theory from Carnegie Mellon. Gordon Hester received his PhD in Public Policy Analysis from Carnegie Mellon University in 1990. He worked at the Electric Power Research Institute in Palo Alto, California from 1990 to 2003. There he conducted and managed research on emissions trading, assessment and management of possible risks from electric and magnetic field exposure, and risk communication. Since 2004 he has worked as a personal chef in the San Francisco Bay area. Joachim I. Krueger is a professor of psychology and human development at Brown University. He received his PhD from the University of Oregon. His research interests include self-perception and intergroup relations. The (ir)rationality of inductive reasoning in social context is the broad theme underlying his work. As a member of the APS Task Force on Self-esteem, he published an extensive report on the causal role of self-esteem in social perception and behavior (Psychological Science in the Public Interest, 2003). With M. Alicke and D. Dunning, he edited a volume on The Self in Social Judgment (Psychology Press, 2005).
ER59969.indb 374
3/21/08 10:51:56 AM
Contributors • 375
David Messick is the Morris and Alice Kaplan Professor Emeritus of Ethics and Decision in Management of the Kellogg School at Northwestern University. He received his PhD at the University of North Carolina, Chapel Hill, and was a professor at the University of California, Santa Barbara, for nearly 30 years before accepting a chair at Kellogg. He is a social psychologist whose research areas are social and ethical decision making and information processing. His recent interests have been in the psychology of leadership. Don Moore is an associate professor of organizational behavior at Carnegie Mellon University’s Tepper School of Business and holder of the Carnegie Bosch Faculty Development chair. He received his PhD in organization behavior from Northwestern University. His research interests include overconfidence; bargaining and negotiation; comparative judgment, especially with regard to when people believe themselves to be better or worse than others; decision making and decision-making biases; and ethical issues in business. His research has appeared in Organization Behavior and Human Decision Processes, the Journal of Personality and Social Psychology, Organization Science, Experimental Economics, and the Annual Review of Psychology. Lisa M. Schwartz, MD, MS, is an associate professor of medicine and community and family medicine at Dartmouth Medical School (Hanover, NH), a general internist, and the co-director of the VA Outcomes Group at the Department of Veterans Affairs Medical Center, White River Junction, VT. Her research (in collaboration with Steven Woloshin) focuses on improving the communication of medical information to patients, physicians, journalists, and policymakers. She is the coauthor of Know Your Chances: Understanding Health Statistics (University of California Press, 2008). Deborah Small, PhD, is an assistant professor of marketing at the University of Pennsylvania. She also holds a secondary appointment in Psychology and is a fellow of the Risk Management and Decision Processes Center. Her research lies on the interface of psychology and economics and examines fundamental processes that underlie judgment decision making. Much of her work has public policy and social marketing implications. She received her PhD in psychology and behavioral decision research from Carnegie Mellon University. Steven Woloshin, MD, MS, is an associate professor of medicine and community and family medicine at Dartmouth Medical School
ER59969.indb 375
3/21/08 10:51:56 AM
376 • Rationality and Social Responsibility
(Hanover, NH), a general internist, and senior researcher in the VA Outcomes Group at the Department of Veterans Affairs Medical Center, White River Junction, VT. His research (in collaboration with Lisa Schwartz) focuses on improving the communication of medical information to patients, physicians, journalists, and policymakers. He is the coauthor of Know Your Chances: Understanding Health Statistics (University of California Press, 2008).
ER59969.indb 376
3/21/08 10:51:56 AM
su bj ec t I n de x A Ad hoc categories, 37–40 Adolescents, 23 Altruism reciprocal, 215 social dilemma games and, 237–238 ultimatum games and, 192–193 Alvarado Score, 61 Ammunition, evaluation of, 332–334 Animistic thinking, 22–23 Anthropomorphism, 22–24 Applicant selection, 47–48, 334, 338–339 Aristotle, 24 Associationist thinking, 5 Attitude measurement, 6–7 Attribution errors, 301 Autopredation, 293–294 Availability heuristic, 6
B Baiul, Oksana, 68 Base rates attitude–behavior correlation and, 7 clinical assessment and, 1–2 differential explanation effects and, 167–168 Bayesian analysis clinical diagnosis and, 3 false consensus effect and, 141–142 overgeneralization and, 3–4 predictive value of, 8–9 Better-than-average (BTA) effects base rate neglect and, 167–168 dates test experiment, 161–166 general discussion, 166–167
normative explanations, 143–146 overview, 142–143 trivia quiz experiment, 152–161 Bootstrapping, 327–329 Breast cancer risk study data analysis, 257–259 discussion, 261–271 expert model and, 251–254 interview method, 254–257 results, 259–266 Broken leg cues, 61–62 BTA effects, see Better-than-average effects Bullets, evaluation of, 332–334 Bystander effect, 242–243
C Cancer risk, see Breast cancer risk study Categories ad hoc, 37–40 social, 123–129 Chess, 325 Children, 22–23 Clinical judgment base rates and, 1–2, 4 Bayesian thinking and, 3 vs. linear models, 334–339 Clinical Versus Statistical Prediction: A Theoretical Analysis and a Review of the Evidence (Meehl), 322 Coin flips, see Gambler’s fallacy Collective identity, 286–287 Commons dilemma study background, 346–349
377
ER59969.indb 377
3/21/08 10:51:56 AM
378 • Subject Index experiment 1, 349–355 experiment 2, 355–360 Comparison, social categorization and, 123–129 ingroup favoritism and, 129–131 relativistic nature of, 111–112 self-enhancement and, 113–115 social projection and, 116–123 Computability, 248–249 Consistent contributors (CCs), 243 Consumer warnings, 249 Contrast effect, 7 Cooperation, see Ingroup cooperation Core configurations, 279–284 Corporate social responsibility (CSR), 236–241 Correlation attitude measurement and, 6 base rates and, 4, 7 weights used in, 72 Criterion judgments, 5 Criterion variables, 327 CSR (corporate social responsibility), 236–241 Cues broken leg, 61–62 redundant, 75–76
D Danish rescue, 303–305 Dates test experiment, 161–166 Decision trees, 247 Demes, 283, 287 Dependence, 22 Depersonalized trust, 216–217 Desirability effects overview, 175–177 study 1, 177–181 study 2, 181–184 Developmental studies, 22–23 Diagnosis, clinical, see Clinical judgment Diagnostic support systems, 60–64 Dictator games, 192–193, 207–210 Differential explanation effects base rate neglect and, 167–168 dates test experiment, 161–166
ER59969.indb 378
general discussion, 166–167 relative probabilities and, 146–152 trivia quiz experiment, 152–161 Differential information effects defined, 143–144 false consensus effect and, 168–169 testable implications, 145–146 Disaggregated ratings hiring decisions and, 48 methodological problems of studying, 54–56 of NIH proposals, 51–53, 59 of NSF proposals, 48–51, 59–60 political resistance to, 53–54 resistance to, 64–68 Society for Medical Decision Making submissions and, 56–59 of student performance, 47–48 vs. holistic judgments, 8 Disjunctive relations attitude measurement and, 7 overgeneralization and, 2–3
E Elderly individuals, 2–3 Emergent-property argument, 131–132 Epistemic projects, 284–289 Ethical decisions, 300–305 Evo-devo, 278 Evolutionary scenarios, see Origin stories Exclusive sets, 7 Expert models of breast cancer risk, 251–254 overview, 247–251
F Fairness, 187–189; see also Dictator games; Ultimatum games False consensus effect; see also Projection, social; Truly false consensus effect Bayesian logic and, 141–142 debunking of, 9 differential information effects and, 168–169 Favoritism, ingroup, 129–131
3/21/08 10:51:57 AM
Subject Index • 379 Figure skating, 67–68 Firms, good, 236–241
G Gambler’s fallacy boundary conditions of, 6 conclusions, 43 defined, 21 experiments, 25–43 previous research, 21–24 Gambling devices, 22 Gene’s eye view, 278–279 Global vs. local perspectives, 239–240 Goal transformation, 223–225 Golden Rule, 302–303 Granting agencies decision making by, 8 National Science Foundation, 48–51 NIH, 51–53, 59 NSF, 59–60 Grundtvigian ideals, 304–305, 318
H Hesiod, 290 Hiring judgments, 48 Holistic ratings hiring decisions and, 48 student performance and, 47–48 vs. actuarial decision making, 7 vs. disaggregated ratings, 8; see also Disaggregated ratings Höss, Rudolf, 300–302 Hot hands, 40 Human–machine systems, 23–24
I IAT (Implicit Association Test), 130–131 Iconography, evolutionary, 289–293 Identity, social group differences and, 111 levels of, 286–287 as means of coordination, 285 Illusions, cognitive, 116 Imaginaries, social, 288–289, 298 Implicit Association Test (IAT), 130–131 Improper linear models applications of, 8, 326–327
ER59969.indb 379
bootstrapping and, 327–329 choice of predictor variables and, 325–326 conditions favoring, 80–84 defined, 72–75, 323 implications, 84–87 literature on, 324–325 of marital happiness, 324 null hypothesis significance testing and, 9 as shrinkage estimators, 77–80 social predictions and, 75–77 vs. random linear models, 329–331 Incremental validity, 8 Inequality aversion, 193–195, 200 Influence diagrams, 247 Ingroup cooperation depersonalized trust and, 216–220 outgroups and, 225–226 overview, 215–216 reasons for, 220–225 Ingroups favoritism and, 129–131 social projection and, 123–129 Intentional systems, 22 Interviews, 247–248 Inversion, 294–295
K Kerrigan, Nancy, 68
L Linear models, 334–339; see also Disaggregated ratings; Improper linear models; Proper linear models Local vs. global perspectives, 239–240 Lucretius, 290
M Macrodemes, 282–283, 287 Malpractice suits, 61 Marital happiness, 324 Max, Gabriel von, 291–292 Medical decision making diagnostic support systems and, 60–64
3/21/08 10:51:57 AM
380 • Subject Index rating of conference submissions, 56–59 Memory gambler’s fallacy and, 32–36 set relations and, 2–3 Mental models approach, 247–251; see also Breast cancer risk study Mental retardation animistic thinking and, 23 Meta-science need for, 105–106 overview, 106–108 theory evaluation parameters and, 97–104 Minnesota Multiphasic Personality Inventory (MMPI), 328, 329, 334 Modes of thought, 285–286 Most Valuable Player (MVP) awards, 66–67 Multicolinearity, 76 Multilevel selection theory, 278
N Naive physics, 22, 24 Narrative modes of thought, 285–286 Narratives; see also Origin stories moral choices and, 300–305 reasoning errors and, 2–3 role of, 299–300 National Institutes of Health (NIH), 51–53, 59 National Science Foundation (NSF), 48–51, 59–60 Nazism, 300–302 Negatively skewed distributions, 143 Nested relations, 2–3 Norms conditions for existence of, 210–211 ultimatum games and, 196–201 vs. preferences, 187–189 Null hypothesis significance testing (NHST), 9
O Obligatory interdependence, 277–278 Origin stories conceptual structures for, 293–296
ER59969.indb 380
iconography, 289–293 as scientific hypotheses, 296–299 Ottowa Ankle Rule, 61 Outgroups ingroup cooperation and, 225–226 social projection and, 123–129 Overgeneralization Bayesian analysis and, 3–4 disjunctive relations and, 2–3
P Paradigmatic modes of thought, 285–286 Physics, naive, 22, 24 Popper, Karl, 8 Positively skewed distributions, 145 Predictive validity, 4–5, 7 Preferences, social ultimatum games and, 193–196 vs. norms, 187–189 Prejudice, 4–5 Prime movers, 293–294 Prisoner’s dilemma basic format, 233–234 corporate performance and, 236–241 declining to play, 235–236 group communication study, 235 social projection and, 116 tit-for-tat strategy, 215, 234–235 Projection, social categorization and, 123–129 ethics and, 301–303 ingroups and, 221–223 social comparison and, 116–123 Projection of past onto present, 294–295 Proper linear models, 322–323, 326–327; see also Improper linear models; Linear models Proposal ratings at NIH, 51–53, 59 at NSF, 48–51 political aspects of, 53–54 Pseudo-rationality, 84–85 Putative biases, 9
Q Quakers, 302–303
3/21/08 10:51:57 AM
Subject Index • 381
R Random linear models, 331 Random weights, 7, 73 Rationality, 5 Redundant cues, 75–76 Regression analysis, 7, 8–9, 85 Relational identity, 286–287 Relative perception, 111–113 Relative probabilities, 146–152 Repeated assemblies, 278–279 Representativeness, 22 Reputation building, 240–241 Risk communication; see also Breast cancer risk study difficulties with, 245–247 mental models approach, 247–251 Rorschach test, 1–2
S Sample size, 326 Schizophrenia, 23 Scientific hypotheses criteria for appraisal of, 8 origin stories as, 296–299 Self-enhancement, 113–115 Self-esteem, 129 Selfish gene theory criticism of, 278–279 religious parallels, 296 role in evolutionary theory, 275 Self–other comparisons, 8 Self-perception relativistic nature of, 111–113 self-enhancement and, 113–115 Sensitivity, 4 Set relations attitude measurement and, 7 overgeneralization and, 2–3 Shrinkage estimators, 77–80 Significance testing, 9 Single variable rule, 72–73 Smith, Worthington G., 292 Social comparison theory, see Comparison, social Social identity theory, see Identity, social Sociality, obligatory, 277–278 Social predictions, 75–77
ER59969.indb 381
Social projection, see Projection, social Society for Medical Decision Making, 56–59 Specificity, 4 Stereotyping, 123–129 Stories, see Narratives; Origin stories Structural availability bias, 6 Student performance, 47–48, 334, 338–339 Sucker’s payoff, 220 Swimming metaphor, 5–6
T “Take the best” strategy, 73, 81 Task groups, 279, 281 TFCE (truly false consensus effect), 121–122 Theory evaluation; see also Meta-science application to meta-science, 97–104 parameters, 94–96 Tit-for-tat (TFT) strategy, 215, 234–235, 239 Trivia quiz experiment, 152–161 Truly false consensus effect (TFCE), 121–122 Trust, 216–220
U Ultimatum games; see also Dictator games with asymmetric information and payoffs, 201–203 with different alternatives, 203–205 with framing, 205–207 overview, 189–193 social norms and, 196–201 social preferences and, 193–196 Uncertainty, judgment under, 6 Unit weights, 7, 72, 73–74
V Virtuous circle hypothesis, 240–241
W Warnings, 249 Weightings, 7, 72–74 Wishful thinking, see Desirability effects Witt, Katarina, 67–68 Worse-than-average (WTA) effects, 166–167, 168–169
3/21/08 10:51:57 AM
ER59969.indb 382
3/21/08 10:51:58 AM
Au t hor I n de x A Abelson, R. P., 289, 297, 314 Aberson, C. L., 129, 132 Abrams, D., 112, 135 Acevedo, M., 10, 19, 116, 136, 139, 144, 172, 221, 230 Adelman, H. T., 303, 308 Adelman, L., 332, 334, 341 Ahumada, A., 21, 45 Aiken, L. S., 269, 271 Alexander, R. D., 225, 226, 227 Alexander, S. A. H., 324, 340 Alicke, M. D., 144, 169 Allport, G. W., 124, 132 Alvarado, A., 61, 68 Amann, K., 281, 308 Ames, D. R., 123, 132 Anderson, J. H., 24, 44 Andreoni, J., 192, 211 Argote, L., 281, 313 Arkes, H. R., 63, 65, 66, 68, 69, 70, 97, 105, 108, 109, 128, 132 Armstrong, J. S., 177, 185 Aronson, J., 287, 315 Ash, F. E., 23, 46 Ashburn, N. L., 130, 132 Ashton, R. H., 66, 69 Atman, C. J., 247, 271, 273 Axelrod, R., 215, 228, 234, 235, 243 Axsom, D., 145, 170 Ayton, P., 40, 44
B Babad, E., 175, 177, 184, 185 Babcock, L., 142, 169 Balshem, M., 269, 271
Bar-Hillel, M., 175, 176, 177, 180, 184, 185 Barkow, J. H., 293, 299, 308 Baron, J., 116, 137 Baron, R. M., 157, 169, 281, 310 Barsalou, L. W., 37, 44 Bartlett, F. C., 2, 17, 250, 271 Bartz, W. H., 23, 45 Baumeister, R. F., 113, 133, 142, 166, 170 Bazerman, M. H., 143, 169, 170, 173 Beaver, W. H., 329n, 340 Bedau, M., 132, 133 Behrend, D. A., 23, 44 Bell, C. R., 23, 44 Benabou, R., 142, 170 Ben-Tal, A., 78, 88 Berg, J., 217, 228 Bicchieri, C., 188, 196, 199, 204, 205, 209, 211 Billig, M. G., 11, 19 Binford, L., 279, 308 Biradavolu, M., 152, 172 Birdsell, J. B., 282, 308 Birnbaum, M., 323n, 340 Black, W. C., 152, 174 Blackburn, J. L., 49, 69 Blanton, H., 128, 133, 145, 170 Block, J., 131, 133 Blount, S., 197, 211 Bluemke, M., 128, 134 Bohnet, I., 198, 206, 212 Boinski, S., 284, 308 Borgida, E., 336, 342 Bostrom, A., 247, 248, 259, 271, 273 Bowler, P. J., 295, 308 Bowles, S., 198, 212 Bowman, E. H., 329, 340 Boyd, R., 198, 212, 278, 308
383
ER59969.indb 383
3/21/08 10:51:58 AM
384 • Author Index Boydston, E. E., 285, 312 Branscomb, L. M., 24, 44 Brem, S. K., 299, 308, 317 Brent, E., 175, 184, 185 Brewer, M. B., 130, 133, 215, 216, 218, 223, 225, 226, 227, 228, 229, 275, 278, 281, 286, 287, 288, 298, 303, 308, 310, 314, 316 Brigham, J. C., 124, 133 Brock, T. C., 297, 311 Brown, D. E., 296, 308 Brown, J. D., 112, 117, 138, 140, 142, 144, 170, 173 Brown, L. B., 23, 24, 44 Brown, P. J., 77, 87 Brown, R., 111, 133, 225, 229 Bruine de Bruin, W., 144, 170, 250, 251, 262, 267, 271, 272, 273 Bruins, J. J., 219, 228 Brumbach, H., 279, 281, 312 Bruner, J. S., 275, 276, 285, 287, 289, 296, 299, 300, 307, 308, 309 Buchan, N., 216, 228 Buckser, A., 303, 304, 305, 309, 319 Budescu, D. V., 156, 171, 175, 176, 177, 180, 184, 185 Buller, D. J., 309, 318 Bullock, M., 23, 44 Bulte, E., 294, 312 Bundy, R. P., 12, 19 Burnham, T., 296, 309 Burrus, J., 147, 159, 172 Burson, K. A., 167, 170 Bush, R. R., 21, 44 Buss, D. M., 115, 133 Buss, L. W., 278, 309 Butte, G., 251, 273 Byram, S., 251, 272, 274
C Cadinu, M. R., 116, 124, 130, 133, 222, 228 Cain, D. M., 145, 173 Caldwell, M., 347, 360 Camerer, C., 191, 198, 211, 212 Campbell, D. T., 225, 226, 228, 278, 280, 282, 287, 309 Campbell, J., 113, 133
ER59969.indb 384
Caporael, L. R., 23, 44, 225, 228, 275, 276, 278, 279, 281, 282, 284, 285, 286, 287, 298, 306, 308, 309, 310 Capozza, D., 111, 133 Caramazza, A., 24, 44 Carrithers, M., 303, 310 Casella, G., 77, 87 Casman, E., 249, 272 Cass, J., 323n, 340 Cassidy, K. W., 116, 137 Cartette, E. C., 21, 45 Chambers, J. R., 114, 120, 133, 145, 147, 159, 170, 172 Champagne, A. B., 24, 44 Chapman, J. P., 105, 108, 109 Chapman, L. J., 105, 108, 109 Chase, W. G., 325, 342 Chavez, L. R., 262, 269, 272 Chen, Y.-R., 303, 308 Christensen, C., 66, 69 Cialdini, R., 196, 211 Cicchetti, D. V., 52, 69 Clancey, W. J., 285, 310 Claudy, J. G., 326, 340 Clemen, R., 247, 272 Clement, R. W., 124, 131, 136, 168, 172, 222, 228 Cohen, J., 39, 44, 73, 87, 128, 133 Cohen, P., 128, 133 Cole, M., 281, 310 Colle, H. A., 21, 44 Colvin, C. R., 131, 133, 140 Connolly, T., 97, 109 Cooke, R. M., 249, 272 Coombs, C. H., 2, 17 Copper, C., 124, 136 Corey, G. A., 60, 69 Corrigan, B., 7, 15, 18, 47, 69, 72, 73, 74, 75, 87, 99, 250, 272, 327, 328, 329, 331, 332, 334, 340 Cosmides, L., 315, 318 Cowan, N., 310, 317 Cox, J. C., 228, 232 Craik, K. H., 115, 133 Crandall, R., 336, 342 Cranell, C. W., 23, 44 Crawford, C., 289, 310 Croson, R., 216, 228 Crowell, D. H., 23, 44
3/21/08 10:51:58 AM
Author Index • 385
D
E
Dana, J., 80, 87, 207, 208, 210, 211 D’Andrade, R. G., 298, 314 Daniel, K. D., 142, 170 Darley, J., 242, 244 Darlington, R. B., 322, 340 Dart, R., 294, 310 Dawes, R. M., 2, 3, 6, 7, 9, 10, 11, 13, 15, 17, 18, 19, 21, 45, 47, 48, 65, 66, 69, 71, 72, 73, 74, 75, 80, 87, 96, 99, 109, 112, 113, 116, 128, 131, 133, 134, 139, 141, 142, 170, 184, 185, 216, 220, 222, 228, 235, 244, 246, 247, 249, 250, 272, 273, 275, 276, 277, 289, 298, 300, 302, 304, 307, 310, 322, 324, 328, 329, 331, 332, 334, 337, 338, 339, 340, 341, 346, 360 Dawkins, R., 296, 310 Deacon, E. B., 329n, 340 Deaux, K., 287, 310 De Cremer, D., 216, 219, 220, 223, 225, 228, 229, 230 De Finetti, B., 13, 18 De Grada, E., 281, 312 DeGroot, A. D., 325, 340 De la Haye, A. M., 122, 134 Dennett, D. C., 24, 44, 296, 310 Dennis, W., 22, 23, 44, 46 Deutsch, M., 347, 360 Devine, P. G., 128, 134 DeVore, I., 279, 294, 313 De Vos, G., 298, 310 Dewitte, S., 219, 228 Dickhout, J. W., 217, 228 Dijksterhuis, A., 128, 135 Dion, K. L., 216, 229 DiSessa, A. A., 24, 44 Dole, A. A., 23, 44 Dolgin, K. G., 23, 44 Donahue, E., 3, 18 Donald, M., 282, 310 Dovidio, J. F., 124, 128, 135, 136 Downs, J., 216, 230, 250, 271, 272, 273 Dunbar, R. I. M., 279, 310 Dunning, D., 142, 167, 171, 172 Dupré, J., 295, 310
Edwards, D. D., 324, 340 Edwards, J. S., 324, 340 Edwards, W., 54, 70, 85, 87, 167, 171, 247, 274, 332, 340, 340n, 341 Eggers, S. L., 249, 250, 272, 273 Einhorn, H. J., 67, 69, 73, 85, 87, 88, 167, 171, 325, 326, 337, 341 Eldar, Y. C., 78, 88 Embrey, M., 251, 272 Epstein, S., 142, 171 Epstude, K., 116, 137 Erev, I., 156, 167, 171 Ericsson, A., 250, 272 Erwin, M., 24, 46 Eshed Levy, D., 220, 231 Evans-Pritchard, E. E., 287, 311
ER59969.indb 385
F Fairchild, H. H., 299, 311 Falk, A., 188, 203, 212 Farnham, S. D., 222, 229 Farr, J. L., 52, 70 Faust, D., 65, 69, 96, 97, 103, 106, 109, 184, 185 Fehr, E., 11, 18, 188, 193, 212, 219, 223, 229 Fehr, H. G. E., 198, 212 Fenaughty, A. M., 269, 271 Fenn, K., 152, 172 Fessel, F., 159, 172 Festinger, L., 111, 112, 134 Fiedler, K., 128, 134, 144, 171 Fienberg, S. E., 29, 45 Fischbacher, U., 188, 212 Fischbeck, P. S., 262, 272 Fischer, I., 40, 44, 175, 184, 185 Fischhoff, B., 144, 152, 170, 172, 245, 246, 247, 248, 249, 250, 251, 262, 267, 271, 272, 273, 274 Fiske, A. P., 278, 311 Fiske, S. T., 287, 311 Flament, C., 12, 19 Flinn, M. V., 294, 311 Flora, E. J., 250, 273 Foley, R., 226, 229, 295, 311 Ford, T. E., 127, 134 Forsythe, R., 193, 212 Fox, C. R., 114, 134, 144, 171 Frank, R. H., 237, 243
3/21/08 10:51:59 AM
386 • Author Index Frey, B., 198, 206, 212 Freyd, J., 280, 311 Friedman, M., 236, 244 Friedman, M. P., 21, 45 Funder, D. C., 131, 132, 133, 136, 139
Grinde, B., 296, 311 Grove, W. M., 97, 109 Grzelak, J., 346, 360 Guth, W., 189, 196, 212, 213 Guyatt, G., 257, 274
G
H
Gächter, S., 219, 229 Gaertner, L., 114, 130, 134, 137, 222, 229 Gail, M. H., 254, 273 Garber, P. A., 284, 308 Gardiner, P. C., 332, 340 Gardner, E., 86, 88 Gardner, W., 286, 287, 288, 303, 308 Gawande, A., 246, 273 Geary, D. C., 294, 311 Gelman, R., 23, 45 Gentner, D., 250, 273 Gifford, S. M., 269, 273 Gigerenzer, G., 72, 88, 221, 229 Giladi, E. E., 113, 135, 159, 171 Gilbert, D. T., 297, 311 Gilovich, T., 40, 45, 97, 109, 167, 171 Gil-White, F. J., 282, 311 Goethals, G. R., 40, 45 Gold, E., 3, 18 Goldberg, L. R., 324, 325, 326, 327, 328, 329, 334, 335, 341 Goldman, L., 216, 230 Goldstein, D. G., 72, 88 Gonzalez, R. M., 152, 172 Gould, S. J., 311, 318 Graham, I. D., 61, 69 Gramzow, R., 130, 134, 222, 229 Granberg, D., 175, 184, 185 Graves, S. B., 238, 239, 244 Green, B., 24, 44, 73, 88 Green, B. F., Jr., 332, 341 Green, M. C., 297, 311 Greenberg, G. H., 61, 70 Greenberg, M. G., 40, 45 Greene, D., 116, 137, 141, 173, 221, 231, 314, 318 Greenwald, A. G., 4, 19, 128, 131, 134, 142, 166, 171 Gregory, R., 251, 273 Grether, D. M., 167, 171 Griesemer, J. R., 278, 316 Griffin, D., 97, 109, 123, 134 Griffiths, T. L., 167, 171
Hackenberg, B. H., 21, 45 Haeckel, E., 291, 311 Hakel, M. D., 49, 69 Hall, J. H., 127, 136 Halpern-Felsher, B. L., 144, 170 Hamburger, H., 346, 347, 360 Hamilton, W. D., 289, 311 Hammond, K. R., 84, 88, 97, 109, 327, 332, 334, 341 Hanna, S. E., 60, 69 Hardie, E. A., 286, 312 Hardin, C. D., 281, 311 Hardin, G., 346, 360 Harris, C. W., 119, 126, 134 Hasman, J. F., 136, 139 Hassan, F. A., 279, 282, 311 Hastie, R., 3, 19, 21, 45, 246, 273 Haynes, B., 60, 69 Haynes, R., 257, 274 Hays, W. L., 360, 361n Healy, M., 129, 132 Heelas, P., 288, 311 Heine, S. J., 115, 135 Heinrich, J., 198, 212 Henager, R. F., 219, 230 Hendriks-Jansen, H., 278, 311 Henik, E., 299, 315 Henrich, J., 275, 312 Henrion, M., 249, 273 Hewstone, M., 129, 137 Higgins, E. T., 281, 311 Hinkle, S., 225, 229 Hirshleifer, D. A., 142, 170 Hoch, S. J., 122, 135 Hoerl, A., 77, 88 Hoffman, E., 191, 198, 205, 206, 212 Hoffman, P. J., 327, 341 Hogarth, R. M., 72, 73, 87, 88, 167, 171, 326, 337, 341 Hogg, M. A., 12, 19, 111, 112, 135, 138, 223, 231 Holekamp, K. E., 285, 312 Hollingshead, A. B., 282, 314
ER59969.indb 386
3/21/08 10:51:59 AM
Author Index • 387 Holt, R. R., 325, 341 Honeycutt, H., 278, 313 Hood, L., 281, 310 Horan, R. D., 294, 312 Horowicz, A., 113, 135 Horowitz, D. L., 312, 317 Horowitz, J. L., 193, 212 Höss, R., 300, 312, 318 House, P., 116, 137, 141, 173, 221, 231, 314, 318 Howard, J. W., 7, 19, 324, 341 Howard, M. E., 143, 171 Hubbard, R., 177, 185 Hubbell, F. A., 262, 272 Hull, D. L., 279, 312 Hunt, D. L., 60, 69 Hutchins, E., 281, 284, 312
I Isaac, R. M., 242, 243, 244 Iuzzini, J., 130, 134
J Jablonka, E., 278, 312 Jaccard, J., 128, 133 James, W., 77, 88 Jarvenpa, R., 279, 281, 312 Jenicke, L. O., 335, 342 Jerdee, T. H., 347, 360 Jin, N., 227, 231, 232 Johnson, C., 124, 136 Johnson, J. J., 269, 271 Johnson, M., 286, 313 Johnson-Laird, P. N., 250, 273 Jones, E. E., 40, 45 Jones, M. C., 127, 136 Judd, C. M., 125, 135 Jussim, L., 115, 135
K Kagel, J. H., 201, 212 Kahneman, D., 3, 19, 21, 46, 97, 109, 114, 119, 134, 135, 298, 312, 336, 342 Kaplan, B., 60, 61, 69 Kaplan, R. M., 51, 70 Karelaia, N., 72, 88 Karniol, R., 111, 135 Kashima, E., 286, 312
ER59969.indb 387
Kassirer, J. P., 60, 70 Katz, Y., 177, 185 Kawakami, K., 128, 135 Keane, J., 227, 229, 306, 312 Keen, S., 116, 135 Keith, D. W., 250, 273 Keller, P., 152, 172 Kelley, H. H., 346, 348, 355, 360 Kelley, R. L., 279, 312 Kelly, L., 339, 341 Kennard, R., 77, 88 Kenny, D. A., 157, 169 Keren, G., 40, 45 Kerr, N. L., 216, 219, 225, 229, 231 Kim, C., 201, 212 Kim, T. G., 114, 136, 160, 173 Kimmel, M., 219, 230 Kitayama, S., 115, 135, 136 Kiyonari, T., 221, 227, 229, 231, 232, 232 Klahr, D., 50, 70 Klar, Y., 113, 135, 159, 171 Klayman, J., 167, 170 Klopfer, L. E., 24, 44 Knee, C. R., 131, 138 Knorr Cetina, K., 281, 308 Koch, G., 257, 273 Kohen, E. S., 328, 330, 335, 342 Kopelman, S., 221, 231, 242, 244 Kramer, R. M., 216, 217, 219, 223, 225, 228, 229, 230 Krantz, D. H., 325n, 341 Kreps, D. M., 234, 244 Krishnamurti, T. P., 250, 273 Krizan, Z., 177, 185 Krueger, J. I., 9, 10, 19, 112, 113, 114, 116, 117, 120, 121, 123, 124, 126, 127, 128, 129, 130, 131, `132, 133, 135, 136, 137, 139, 144, 160, 167, 168, 171, 172, 221, 222, 228, 230, 231, 302, 314 Kruger, J., 114, 136, 145, 147, 159, 167, 172, 174 Kruglanski, A. W., 281, 299, 312 Kuang, J. X., 207, 211 Kunda, A., 125, 136 Kuper, A., 294, 312 Kurlander, E., 305, 312 Kurzban, R., 225, 230, 299, 312, 317
3/21/08 10:52:00 AM
388 • Author Index
L Lakoff, G., 275, 286, 313 Lamb, M. J., 278, 312 Landau, M., 293, 295, 297, 313 Landis, R., 257, 273 Landy, F. J., 52, 70 Langer, E. J., 40, 45 LaPiere, R. T., 6, 19 Larrick, R. P., 167, 170 Latané, B., 242, 244 Latour, B., 295, 299, 313 Laudan, L., 101, 109 Laupacis, A., 61, 69 Leary, M. R., 225, 230, 299, 312, 317 Lebow, B. S., 97, 109 Lee, R. B., 279, 294, 313 Lehman, D. R., 115, 135 Lemons, F., 7, 18, 128, 134 Lerner, J. S., 152, 172 Li, S.-C., 278, 313 Liang, D. W., 281, 313 Libby, R., 328, 341 Lichtenstein, S., 250, 274, 327, 342 Lickliter, R., 278, 313 Liebrand, W., 219, 221, 228 Lillard, A., 288, 296, 313 Lindsey, K., 253, 273 Linke, R. D., 23, 45 Lipkus, I. M., 152, 172 Lock, A., 288, 311 Loewenstein, G., 143, 169, 211 Looft, W. R., 23, 45 Loomis, J., 347, 360 Love, S. M., 253, 273 Lovejoy, C. O., 294, 313 Lowie, D. E., 23, 45 Lucas, A. M., 23, 45 Luce, R. D., 234, 244, 325n, 341, 346, 360 Luckett, T. L., 269, 271
M MacKinnon, C. A., 303, 313 Maddux, W. W., 287, 316 Maharik, M., 251, 273 Maibach, E., 250, 273 Malmendier, U., 142, 172 Mannetti, L., 281, 312 Markus, H. R., 115, 135, 136 Marquardt, D. W., 322, 342 Martinez, R. G., 262, 272
ER59969.indb 388
Mashima, R., 219, 230 Massey, C. M., 23, 45 Mathews, H. F., 269, 273 Maynard Smith, J., 275, 278, 289, 313, 315 McCabe, K. A., 191, 198, 212, 217, 228 McCauley, C., 111, 112, 124, 125, 126, 136 McClelland, G. H., 21, 45 McClive, K. P., 145, 170 McCloskey, M., 24, 44 McDermott, R., 281, 310 McFarland, C., 168, 172 McGhee, D. E., 4, 19, 128, 134 McKelvey, R. D., 167, 172 McKenna, F. P., 175, 185 McKnight, R. D., 61, 70 McMullin, J. M., 262, 272 McTavish, J., 10, 18, 116, 133, 220, 228, 235, 243 Medow, M. A., 63, 69 Meehl, P. E., 3, 7, 19, 61, 65, 69, 70, 73, 88, 96, 97, 101, 102, 103, 106, 109, 110, 184, 185, 322, 324, 342 Merenstein, J. H., 60, 69 Messick, D. M., 11, 18, 221, 230, 231, 238, 239, 242, 244, 360, 361n Messner, C., 128, 134 Metzger, M. A., 21, 45 Midgley, M., 295, 313 Milgrom, P., 234, 244 Millar, M. G., 64, 70 Miller, D. T., 168, 172, 216, 230 Miller, G. A., 313, 317 Millstein, S. G., 144, 170 Minsky, M., 24, 45 Mirels, H. I., 3, 18 Mishra, S., 262, 272 Monteith, M. J., 130, 132 Moore, D. A., 114, 136, 143, 145, 160, 172, 173 Moreland, R., 281, 313 Morgan, M. G., 247, 248, 249, 250, 257, 271, 273 Morlock, H. C., 21, 44 Moser, D., 201, 212 Moser, S., 290, 313 Moskowitz, G. B., 130, 137 Mueller, R. A., 167, 172 Mulder, L. B., 220, 230
3/21/08 10:52:00 AM
Author Index • 389 Mulford, M., 9, 18, 116, 134, 141, 170, 184, 185 Mullen, B., 124, 136 Mummendey, A., 126, 138 Mussweiler, T., 160, 173 Muth, J., 86, 88
N Nakatani, L., 21, 45 Neale, M. A., 143, 173 Nelson, C., 97, 109 Nelson, K., 297, 313 Nemirovski, A., 78, 88 Newman, S. A., 314, 316 Nisbett, R., 298, 314, 336, 342 Norman, D. A., 287, 314
O Oakes, P., 12, 19, 111, 138, 223, 231 Odean, T., 142, 173 Oesch, J. M., 145, 173 Okuno-Fujiwara, M., 191, 212 Oldman, D., 21, 40, 46 Olsen, C. L., 283, 314 Orbell, J. M., 10, 11, 19, 235, 244, 275, 310 Orlitzky, F., 240, 244 Otten, S., 116, 129, 130, 137, 221, 222, 230 Oyama, S., 278, 295, 314
P Page, T., 167, 172 Paladino, M.-P., 126, 137 Palmgren, C., 249, 250, 272, 273 Park, B., 125, 135 Parks, C. D., 219, 230 Perper, T., 279, 314 Peterson, C., 142, 173 Pezzo, M. V., 62, 64, 70 Pezzo, S. P., 62, 64, 70 Phelan, J., 296, 309 Philogène, G., 287, 310 Piaget, J., 22, 46 Pierce, K. P., 288, 308 Pierro, A., 281, 312 Poole, M. S., 282, 314 Popper, K. R., 100, 110 Postmes, T., 216, 217, 218, 231 Prasnikar, V., 191, 212 Prentice, D. A., 216, 230
ER59969.indb 389
Price, S., 145, 170 Propp, V. I., 293, 314 Pruitt, D. G., 219, 230
R Rabin, M., 210, 211, 212 Radlow, R., 347, 361 Raiffa, H., 234, 244, 247, 274, 346, 360 Ranney, M., 299, 308, 317 Rapoport, A., 220, 231, 276, 289, 314 Ravinder, H. V., 55, 70 Reed, H., 336, 342 Reicher, S., 12, 19, 111, 138, 223, 231 Remus, W. E., 335, 342 Richards, R. J., 277, 314 Richerson, P. J., 278, 308 Ridderikhoff, J., 60, 70 Riketta, M., 124, 137 Riley, D. M., 249, 274 Rimer, B. K., 152, 172 Robbins, J. M., 116, 123, 130, 136, 137, 144, 172, 222, 231, 302, 314 Roberts, J., 234, 244 Rock, L., 40, 45 Romanucci-Ross, L., 298, 310 Romero, V., 129, 132 Rose, R. M., 21, 44 Rosen, A., 3, 19 Rosen, B., 347, 360 Rosenbaum, W. L., 177, 185 Roskos-Ewoldsen, D. R., 160, 173 Ross, 298, 318 Ross, L., 116, 137, 141, 173, 221, 231, 314 Roth, A. E., 191, 212 Roth, J., 40, 45 Rothbart, M., 116, 124, 130, 133, 222, 228 Rottenstreich, Y., 144, 171 Royzman, E. B., 116, 137 Rubin, M., 129, 137 Russell, R. W., 22, 23, 44, 46 Rynes, S. L., 240, 244
S Sabrahmanyam, A., 142, 170 Saccuzzo, D. P., 51, 70 Sackett, D., 257, 274 Sanbonmatsu, D. M., 160, 173 Sanday, P. R., 277, 297, 298, 314 Savin, N. E., 193, 212 Sawyer, J., 324, 342
3/21/08 10:52:01 AM
390 • Author Index Scamahorn, S. D., 219, 230 Schaller, M., 278, 314 Schank, R. C., 289, 297, 314 Scheibe, K. E., 24, 46 Schindel, J., 299, 308, 317 Schmidt, F. L., 240, 244, 326, 342 Schmidt, K., 11, 18, 193, 212, 223, 229 Schmittberger, R., 189, 212 Schnake, M. E., 220, 231 Schneider, D. J., 125, 130, 137 Schneider, S., 216, 228 Schooler, J., 64, 70 Schwartz, J. L. K., 4, 19, 128, 134 Schwartz, L. M., 152, 174, 251, 261, 274 Schwarz, N., 14, 19 Schwarze, B., 189, 212 Schweder, R. A., 298, 314 Searle, J. R., 131, 137 Searles, H. F., 23, 46 Sedgwick, P. P., 23, 45 Sedikides, C., 114, 115, 130, 134, 137, 222, 229, 286, 314 Sefton, M., 193, 212 Semin, G. R., 284, 315 Shachat, K., 198, 212 Shaffer, V. A., 63, 69 Shaklee, H., 10, 18, 116, 133, 220, 228, 235, 243, 250, 274 Shanon, B., 24, 46 Shaver, K. G., 40, 45 Shavitt, S., 160, 173 Sherif, M., 225, 231 Sherman, S. J., 160, 173 Shogren, J. F., 294, 312 Shore, B., 280, 298, 314 Showalter, D., 52, 69 Sieck, W. R., 65, 70 Silverman, E., 251, 274 Simms, E., 145, 174 Simon, H. A., 250, 272, 325, 342 Singer, D., 7, 18, 128, 134 Sinha, R. R., 130, 137 Slovic, P., 152, 173, 250, 274, 298, 312, 325, 327, 339n, 342 Smale, L., 285, 312 Small, D. A., 143, 145, 152, 172, 173 Small, M., 249, 272, 274 Smith, B. H., 314, 318 Smith, E. R., 284, 315 Smith, K., 60, 69 Smith, T. E., 6, 18
ER59969.indb 390
Smith, V., 191, 198, 212 Smith, W. G., 292, 315 Snee, R. D., 322, 342 Snitz, B. E., 97, 109 Snyder, M., 219, 228 Sobel, M. E., 157, 173 Sober, E., 278, 315 Spelke, E., 23, 45 Spitzer, M., 198, 206, 212 Srinivasan, V., 326, 340n Stahelski, A. V., 348, 355, 360 Stangor, C., 127, 134 Steele, C. M., 142, 173, 287, 315 Stein, C., 77, 88 Sterling, T. D., 177, 185 Stevens, A. L., 250, 273 Stiber, N. A., 262, 272 Stiell, I. G., 61, 69, 70 Stitt, C. L., 111, 112, 124, 126, 136 Strange, J. J., 297, 311 Strum, S. C., 295, 313 Suls, J., 145, 170 Sumner, W. G., 225, 231 Suppes, P., 325n, 341 Svenson, O., 113, 137, 144, 173 Swensson, R., 347, 361 Symons, D., 295, 315 Szathmáry, E., 278, 315
T Taagepera, R., 85, 88 Tajfel, H., 11, 19, 111, 124, 129, 137, 138, 223, 225, 231 Takahashi, N., 219, 230 Takemura, K., 287, 316 Tanida, S., 221, 229 Tanis, M., 216, 217, 218, 231 Tanner, N., 294, 315 Tate, G., 142, 172 Taylor, C., 288, 302, 315 Taylor, H. A., 21, 44 Taylor, S. E., 112, 117, 138, 140, 142, 166, 173 Tenenbaum, J. B., 167, 171 Tesser, A., 64, 70 Tetlock, P. E., 128, 132, 299, 300, 315 Thagard, P., 125, 136 Thangavelu, K., 125, 136 Thissen, D., 332, 342 Thorndike, E. L., 126, 138
3/21/08 10:52:01 AM
Author Index • 391 Thorne, S., 251, 272, 273 Thornton, B., 324, 342 Thouless, R. H., 23, 24, 44 Tirole, J., 142, 170 Todd, P. M., 72, 88 Toguchi, Y., 114, 137 Tooby, J., 315, 318 Torgerson, W. S., 330, 342 Toulmin, S., 24, 46 Trivers, R. L., 215, 231, 289, 315 Tugwell, P., 257, 274 Turner, J. C., 12, 19, 111, 127, 129, 138, 223, 225, 231 Tversky, A., 2, 3, 17, 19, 21, 40, 45, 46, 298, 312, 325n, 336, 341, 342 Tyrer, P. J., 52, 69
V Vallone, R., 40, 45 Van de Geer, J. P., 238, 239, 244 Van de Kragt, A. J. C., 11, 19, 275, 310 Van Dijk, E., 220, 223, 225, 229, 230 Van Herk, E., 60, 70 Van Vugt, M., 216, 223, 225, 229 Varey, C. A., 123, 134 Villano, P., 127, 136, 139 Viscusi, W. K., 152, 174 Vohs, K., 113, 133 Voils, C. I., 130, 132 Von Winterfeldt, D., 54, 70, 247, 274, 332, 340n
W Waddock, S. A., 238, 239, 244 Wagenaar, W. A., 21, 40, 45, 46 Wainer, H., 73, 88, 332, 342 Waldzus, S., 126, 138 Walker, J. M., 242, 244 Wallace, H. A., 327, 342 Wallsten, T. S., 156, 171 Ward, C. V., 294, 311 Ward, L. M., 40, 45 Weber, J. M., 221, 231, 242, 244 Weber, R., 207, 211 Wegner, D. M., 280, 315 Wei, J., 217, 219, 230 Weidner, M., 347, 361 Weiner, B., 40, 45 Weinkam, J. J., 177, 185
ER59969.indb 391
Weinstein, N., 146, 152, 174, 175, 184, 185 Weizenbaum, J., 24, 46 Welch, H. G., 152, 174, 261, 274 Wells, G. L., 167, 174 Wells, P. S., 282, 315 Wentura, D., 130, 137, 222, 230 Wenzel, M., 126, 138 West, S. G., 269, 271 Wetherell, M., 12, 19, 111, 138, 223, 231 Wichman, H., 347, 361 Wiggins, N., 328, 330, 335, 342 Wilke, H., 216, 219, 220, 228, 230, 231 Wilks, S. S., 332, 342 Willard, B., 237, 244 Williams, A. W., 242, 244 Williams, G. C., 296, 315 Wilson, D. S., 278, 315 Wilson, E. O., 289, 315 Wilson, R., 234, 244 Wilson, T. D., 64, 70 Wimsatt, W. C., 278, 316, 316 Windschitl, P. D., 114, 120, 133, 145, 159, 170, 172, 174, 177, 185 Winer, B. J., 354, 361 Wit, A. P., 216, 225, 231 Witt, K., 67, 70 Witt, M. G., 130, 134 Woloshin, S., 152, 174, 251, 261, 274 Wu, F., 249, 272
Y Yakobos, E., 175, 184, 185 Yamagishi, T., 220, 221, 227, 229, 231, 232, 232 Yntema, D. B., 330, 342 Yuki, M., 286, 287, 308, 316
Z Zald, D. H., 97, 109 Zamir, S., 191, 212 Zawadski, B., 124, 138 Zeiger, J. S., 121, 136 Zietsma, C., 145, 173 Zihlman, A. L., 294, 315 Zuckerman, M., 131, 138
3/21/08 10:52:02 AM
ER59969.indb 392
3/21/08 10:52:02 AM