Artifacts in Behavioral Research Robert Rosenthal and Ralph L. Rosnow’s Classic Books
This page intentionally left bl...
215 downloads
432 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Artifacts in Behavioral Research Robert Rosenthal and Ralph L. Rosnow’s Classic Books
This page intentionally left blank
Artifacts in Behavioral Research Robert Rosenthal and Ralph L. Rosnow’s Classic Books A Re-issue of ARTIFACT IN BEHAVIORAL RESEARCH EXPERIMENTER EFFECTS IN BEHAVIORAL RESEARCH and THE VOLUNTEER SUBJECT With a Foreword by Alan E. Kazdin
1
2009
1 Oxford University Press, Inc., publishes works that further Oxford University’s objective of excellence in research, scholarship, and education. Oxford New York Auckland Cape Town Dar es Salaam Hong Kong Karachi Kuala Lumpur Madrid Melbourne Mexico City Nairobi New Delhi Shanghai Taipei Toronto With offices in Argentina Austria Brazil Chile Czech Republic France Greece Guatemala Hungary Italy Japan Poland Portugal Singapore South Korea Switzerland Thailand Turkey Ukraine Vietnam
Copyright # 2009 by Oxford University Press, Inc. Published by Oxford University Press, Inc. 198 Madison Avenue, New York, New York 10016 www.oup.com Oxford is a registered trademark of Oxford University Press All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior permission of Oxford University Press. Library of Congress Cataloging-in-Publication Data Rosenthal, Robert, 1933– Artifacts in behavioral research : Robert Rosenthal and Ralph L. Rosnow’s classic books / Robert Rosenthal, Ralph L. Rosnow. p. cm. New combination volume of three books-in-one. Includes bibliographical references and index. ISBN 978-0-19-538554-0 (alk. paper) 1. Psychology—Research. 2. Social sciences—Research. I. Rosnow, Ralph L. II. Title. BF76.5.R635 2009 150.72—dc22 2008038527
1
3
5
7
9 8
6
4
2
Printed in the United States of America on acid-free paper
Foreword
We are extremely fortunate to have this three-in-one volume. To begin, Robert Rosenthal and Ralph Rosnow have been collaborating for nearly 50 years with stellar contributions to research design, statistical evaluation, and methodology. This volume conveys their enduring dedication to advancing and improving behavioral research. They have accorded special attention to the study of artifacts in research, the topic of this volume. The volume provides a convenient way to bring the remarkable body of work that has fallen out of print and, even worse, out of mind. The volume will be extremely valuable for individuals in training as well as for experienced researchers. Each will relish in the rich material whether this is new or rekindled knowledge. Artifact in the context of research refers to influences that are not of direct interest to the investigator. From a methodological standpoint, consideration of and attention to artifacts are no less critical to the design of a study than to the phenomenon of direct interest. An uncontrolled or neglected artifact can account for, explain, and compete with the interpretation the author wishes to place on the data. Perhaps the most familiar artifact in intervention research is the placebo effect. In medication trials, individuals who receive a fake drug, that is, one with no pharmacological properties that could influence the outcome, still improve and improve more than those who receive no drug at all. Because the investigator is interested in the effects of medication, placebo effects become an influence to be controlled. Although placebo effects are the widely recognized poster child of artifacts, there are many other artifacts less well known or discussed. Rosenthal and Rosnow have like no one before codified a large set of artifacts in behavioral science but also science more generally and showed that they in fact can exert significant impact on human and non-human animal behavior and the results of investigations. Insufficient attention has been paid to artifacts and their influence. Among the reasons, the very term conveys they are not the main show or interest. A particular phenomenon sparks the interest and curiosity of investigators and perhaps even draws them into the field; artifacts seem to be all those other factors that merely douse the flame. After all, unless one is an anthropologist why should one care about artifacts? The answer is threefold. First and perhaps the most obvious is the methodological part. Artifacts are influences that must be controlled or addressed to draw valid inferences. It is often easy to complete a research project that focuses solely on the phenomenon of interest and neglect all sorts of plausible other influences. One can then draw conclusions about the results and place a perfunctory limitations paragraph that gives a passing curtsy to what was not controlled. The question for any researcher before a study is conducted is what might be an alternative explanation of the findings if the intervention/independent variable has the predicted impact. Those codified as artifacts are some of the more nuanced answers to the question and hence readily overlooked. Although we may not be interested in a particular v
vi
Foreword
experimental artifact, we certainly do not want one of them to be a more or equally parsimonious interpretation of our results. Second, artifacts are not artifacts. They raise critical substantive issues about subtle factors that influence performance. Beliefs, attributions, expectancies, demand characteristics, subject roles, test sensitization, and more, all elaborated in this volume, are not mere influences to be controlled. They reflect influences— genuine and often more subtle—that need to be understood and harnessed. For example, volunteer subjects behave differently from those who are not volunteers. There is more to this than demographic differences. Influences such as choice, threshold for agreeing to participate, and willingness to take risk or not seeing risk all are critical issues pregnant with theoretical and applied implications. For those artifacts that are about data and data analysis, some of these (e.g., biases in observing and scoring, data fabrication) relate to the pillars of scientific research, well beyond psychology. Understanding more about how they take place and how they can be prevented is critically important. Third, Rosenthal and Rosnow have alerted us to the relativity of artifact. Artifacts are what one is not interested in, but should worry about, from a methodological standpoint. If I study placebo effects, the last thing I want is having my subjects taking real medications that I do not know about. Placebo is my variable of interest, and real medication (e.g., taking pills at home to treat their disorder, using some alternative medicine) is the artifact. Even if all subjects are taking different medications or some are and some are not taking medication, this can interfere with drawing valid conclusions. (Large variability that these alternative treatments introduce can readily reduce power and interfere with demonstrating an effect.) In other words, consider a class called, ‘‘all variables that influence human and non-human animal functioning.’’ In our individual work we select those we are interested in for study; some subset of the others in which we are not very interested are the artifacts. And, more colloquially one person’s artifact is another person’s independent variable. I mention this because understanding artifacts and their nuanced effects has broad generality in understanding human behavior. The present volume elaborates the issues I have raised and makes a connection that to me represents the highest level of sophistication in our field. The volume focuses on artifact but as the reader will see, three broad topics are included: theory, research design, and data analysis. By elaborating artifacts, we learn that we need new concepts, new theory to explain the effects, new design variations, and new methods of data analysis. Rosenthal and Rosnow have moved social sciences to new heights with this work by influencing and integrating three broad topics. On a more personal note, over the years I have deeply admired the contributions of Rosenthal and Rosnow. We know from analyses of science that quality and quantity of publications often go hand in hand (e.g., Charles Darwin, Albert Einstein, Gustav Fechner). We can add Rosenthal and Rosnow to this list as this three-in one volume illustrates. However, the larger context cannot be neglected by noting many other products of their collaboration, including: • Beginning Behavioral Research: A Conceptual Primer (Rosnow and Rosenthal) • Contrast Analysis: Focused Comparisons in the Analysis of Variance (Rosenthal and Rosnow) • Contrasts and Effect Sizes in Behavioral Research: A Correlational Approach (Rosenthal, Rosnow, and Rubin)
Foreword
vii
• Essentials of Behavioral Research: Methods and Data Analysis (Rosenthal and Rosnow) • People Studying People: Artifacts and Ethics in Behavioral Research (Rosnow and Rosenthal) • Primer of Methods for the Behavioral Sciences (Rosenthal and Rosnow) • Understanding Behavioral Science: Research Methods for Research Consumers (Rosnow and Rosenthal)
My list, of course, omits many other works (e.g., books, articles, and chapters)— works each has contributed individually outside of this rich collaboration. The artifacts they have detailed are rich in hypotheses and could accommodate legions of dissertations and careers as reflected in such generic questions as, How they work? For whom they work? And how their influence can be harnessed? Our field and sciences more generally would profit from knowing the moderators, mediators, and mechanisms of action. Moreover, relating these artifacts to other influences (e.g., social norms, priming) would contribute greatly to theory. Rosenthal and Rosnow present us with revolutionary concepts, insights, and challenges. I have been greatly influenced by their work and am so delighted to have the opportunity to express my gratitude to them publicly. I have read these volumes and urge you to as well, first selectively to suit your tastes and research, but then fully to enrich the quality of one’s own research and the care with which one interprets all research. The scholarship is superb, the issues are provocative, and the artifacts still wait to be exploited to understand the range of influences on human functioning. It would be difficult to read the volume without shifting a little of one’s work to the study or control of the influences these authors elaborate. Alan E. Kazdin, Ph.D. Yale University September 2008
This page intentionally left blank
A Preface to Three Prefaces
We are as sailors who are forced to rebuild their ship on the open sea, without ever being able to start fresh from the bottom up. Wherever a beam is taken away, immediately a new one must take its place, and while this is done, the rest of the ship is used as support. In this way, the ship may be completely rebuilt like new with the help of the old beams and driftwood—but only through gradual rebuilding. Otto Neurath1 Otto Neurath’s brilliant simile of science captures beautifully the structure and evolution of science. New ideas, new theories, new findings, new connections, new methods—all of these constitute the new beams that replace the old. Sometimes the old beams weren’t so bad; it’s just that some newer ones were a bit better. From time to time it wasn’t noticed that particular beams were weakening the ship. Artifacts that weaken the ship of science often go unnoticed. The three books republished in this volume were originally intended to encourage scientists to notice the beams weakened by artifacts and to think about how the weakened beams might be replaced by stronger, though still imperfect, beams. In the present context, the three books can serve as an introduction and a reminder. They are an introduction to the topic of artifacts for graduate students, advanced undergraduates, and younger researchers. They are a reminder to more experienced researchers inside and outside of academia that the problems of artifacts in behavioral research, which they may have first encountered as beginning researchers, have not gone away. For example, problems of experimenter effects have not been solved. Experimenters still differ in the ways in which they see, interpret, and selectively present their data. Experimenters still obtain different responses from their research participants (human or infrahuman) as a function of the experimenters’ transitory states as well as their enduring traits of biosocial, psychosocial, and situational origins. Similarly, experimenters’ expectations still are manifested too often as self-fulfilling prophecies. Biomedical researchers seem to have acknowledged and guarded against this problem better than have behavioral researchers, as many biomedical experiments would be considered of dubious scientific value had their experimenters not been blind to experimental condition. Problems of participant or subject effects have also not been eliminated. Research samples are, by necessity, usually drawn from a population of volunteers. These volunteer subjects often differ along many dimensions from those not finding their way into the research. At times, those volunteer subjects may differ from nonvolunteers in the degree to which they are affected by the experimental conditions. 1
Neurath, O. (1921). Antispengler (T. Parzen, Trans.). Munich: Calwey, pp. 75–76.
ix
x
A Preface to Three Prefaces
In addition to our still having to deal with volunteer bias, we must frequently contend with the fact that research participants are suspicious of experimenters’ intent, try to figure out what the experimenters are ‘‘really’’ after, and are concerned about what the experimenter thinks of them. Not only have issues of artifacts in behavioral research not been completely resolved, but also there have been recent reminders that similar concerns have surfaced elsewhere. For instance, in the newly burgeoning discipline of experimental economics, Steven D. Levitt and John A. List (2007) pondered the extent to which behavior observed in a controlled experimental environment is a satisfactory indicator of ‘‘real world’’ behavior.2 In evidence-based medicine, Peter M. Rothwell (2005) has questioned the generalizability of the results of pharmaceutical trials to definable groups of patients in particular clinical settings in routine medical practice.3 David B. Strohmetz (2008) recently published an overview of artifacts in psychological experiments in which he reiterated theorized mechanisms for their mediation that are discussed in the volunteer subject book in this volume,4 and, in another context, Chris T. Allen (2004) has reported his own operationalization and an elegant series of studies in support of such a mediational model.5 Although most of the examples in the three books in this volume emphasize behavioral research, it is fitting that Neurath’s simile implies that the concern with artifacts is not restricted to behavioral or social science. Examples from physics, biology, and other scientific areas are also sprinkled throughout, as well as cases from outside science that imply a common thread in the nature of knowledge and the effort to make sense of the world. We are indebted to numerous people who contributed in unique ways to this volume. We have long benefited from the insights and direct contributions to the first book in this volume made by E. G. Boring, Donald Campbell, Robert Lana, William McGuire, Martin Orne, and Milton Rosenberg. We share their earlier vision and hope that behavioral researchers become more acutely aware of the important, unintended, and unwitting effects that the artifacts discussed in that book can have on the progress of our science. In addition to those distinguished tutors and coauthors whose names are noted above, we owe a great deal to the other early artifactologists and ethicists of behavioral research, from the paleopioneers, including Oskar Pfungst and Saul Rosenzweig, to the early pioneers, including John Adair, William Bean, Henry Beecher, Lee Cronbach, Herbert Hyman, John Jung, Jay Katz, Herbert Kelman, Henry Riecken, and Irwin Silverman. We are also grateful to a long list of coworkers—including many brilliant graduate students—with whom we have been fortunate to collaborate, beginning at the University of North Dakota and Boston University and continuing at Harvard University, the University of California, Riverside, and Temple University.
2
Levitt, S. D. & List, J. A. (2007). What do laboratory experiments measuring social preferences reveal about the real world? Journal of Economic Perspectives, 21, 153–174. 3 Rothwell, P. M. (2005). External validity of randomized controlled trials: ‘‘To whom do the results of this trial apply?’’ The Lancet, 365, 82–93. 4 Strohmetz, D. B. (2008). Research artifacts and the social psychology of psychological experiments. Social and Personality Psychology Compass 2(2008): 10.1111/j. 1751-9004.2007.00072.x. 5 Allen, C. T. (2004). A theory-based approach for improving demand artifact assessment in advertising experiments. Journal of Advertising, 33(2), 63–73.
A Preface to Three Prefaces
xi
We are also grateful for the research and foundational support that we have received over the years, including the support that Rosenthal received from the National Science Foundation, the Spencer Foundation, a James McKeen Cattell Sabbatical Award, and the John D. and Catherine T. MacArthur Foundation and that Rosnow received from the National Science Foundation, the National Institute of Mental Health, and Temple University’s Thaddeus L. Bolton endowment. We thank Lori Handelman, our wonderful editor at Oxford University Press, for her enthusiastic shepherding of this volume. We thank Alan E. Kazdin for taking time from his professorship at Yale, and his simultaneous Presidency of the American Psychological Association, to provide a foreword that clarifies the very concept of artifacts, delineates their importance, sets a context for the topic, and provides suggestions for future research! It is not often that a foreword actually adds to the scholarly literature of the volume that lies ahead, but Alan Kazdin has done just that. Finally, we thank MaryLu Rosenthal and Mimi Rosnow for all the ways in which they have improved our writing, and for all the ways in which they continue to improve us! It was back in Massachusetts, many years ago, that we became friends and began to work together while one of us was teaching at Harvard University and the other at Boston University. It is a deep friendship and collaboration that happily continues today. Robert Rosenthal, Riverside, California Ralph L. Rosnow, Radnor, Pennsylvania October 2008
This page intentionally left blank
Contents
Foreword
ALAN E. KAZDIN
v
A Preface to Three Prefaces
ix
BOOK ONE – ARTIFACT IN BEHAVIORAL RESEARCH ROBERT ROSENTHAL AND RALPH L. ROSNOW, EDITORS Preface
3
CHAPTER 1 Perspective: Artifact and Control
EDWIN G. BORING
CHAPTER 2 Suspiciousness of Experimenter’s Intent CHAPTER 3 The Volunteer Subject CHAPTER 4 Pretest Sensitization
7
WILLIAM J. MCGUIRE
ROBERT ROSENTHAL and RALPH L. ROSNOW ROBERT E. LANA
15 48
93
CHAPTER 5 Demand Characteristics and the Concept of Quasi-Controls MARTIN T. ORNE 110 CHAPTER 6 Interpersonal Expectations: Effects of the Experimenter’s Hypothesis ROBERT ROSENTHAL 138 CHAPTER 7 The Conditions and Consequences of Evaluation Apprehension MILTON J. ROSENBERG 211 CHAPTER 8 Prospective: Artifact and Control
DONALD T. CAMPBELL
264
BOOK TWO – EXPERIMENTER EFFECTS IN BEHAVIORAL RESEARCH ROBERT ROSENTHAL Preface
289
Preface to the Enlarged Edition PART I
294
THE NATURE OF EXPERIMENTER EFFECTS
Chapter 1
The Experimenter as Observer
Chapter 2
Interpretation of Data xiii
308
297
xiv
Contents
Chapter 3
Intentional Error
Chapter 4
Biosocial Attributes
Chapter 5
Psychosocial Attributes
Chapter 6
Situational Factors
Chapter 7
Experimenter Modeling
Chapter 8
Experimenter Expectancy
PART II
317 326 345
365 385 397
STUDIES OF EXPERIMENTER EXPECTANCY EFFECTS
Chapter 9
Human Subjects
411
Chapter 10
Animal Subjects
423
Chapter 11
Subject Set
Chapter 12
Early Data Returns
453
Chapter 13
Excessive Rewards
465
Chapter 14
Structural Variables 477
Chapter 15
Behavioral Variables
Chapter 16
Communication of Experimenter Expectancy
PART III
440
495 519
METHODOLOGICAL IMPLICATIONS
Chapter 17
The Generality and Assessment of Experimenter Effects
Chapter 18
Replications and Their Assessment
Chapter 19
Experimenter Sampling
561
Chapter 20
Experimenter Behavior
572
Chapter 21
Personnel Considerations
Chapter 22
Blind and Minimized Contact
Chapter 23
Expectancy Control Groups
Chapter 24
Conclusion
583 592 603
620
Interpersonal Expectancy Effects: A Follow-up References
552
630
652
BOOK THREE – THE VOLUNTEER SUBJECT ROBERT ROSENTHAL AND RALPH L. ROSNOW Preface
669
Chapter 1
Introduction
671
Chapter 2
Characteristics of the Volunteer Subject
677
539
Contents
xv
Chapter 3
Situational Determinants of Volunteering
Chapter 4
Implications for the Interpretation of Research Findings
Chapter 5
Empirical Research on Voluntarism as an Artifact-Independent Variable 787
Chapter 6
An Integrative Overview
Chapter 7
Summary
Appendix
837
References
840
Author Index
863
Subject Index
880
829
812
745 770
This page intentionally left blank
Figure 1 Photograph taken during Eastern Psychological Association meeting in Atlantic City, New Jersey in the early 1960s. Left front to rear: Ralph Rosnow, Robert Rosenthal, Duane Schultz, Sydney Schultz, Leslie Frankfurt, Robert Commack; right front to rear: Frederick Pauling, Gordon Russell, Robert Lana
Figure 2 Robert Rosenthal—Left, Ralph L. Rosnow—Right. Photograph taken by Mimi Rosnow in Philadelphia, Pennsylvania in 2001
This page intentionally left blank
Artifacts in Behavioral Research Robert Rosenthal and Ralph L. Rosnow’s Classic Books
This page intentionally left blank
BOOK ONE ARTIFACT IN BEHAVIORAL RESEARCH Robert Rosenthal and Ralph L. Rosnow, Editors
This page intentionally left blank
Preface
The effort to understand human behavior must itself be one of the oldest of human behaviors. But for all the centuries of effort, there is no compelling evidence to convince us that we do understand human behavior very well. Instead, there are the unsolved behavioral problems of mental illness, racism, and violence, of both the idiosyncratic and institutionalized varieties, to bear witness to how much there is we do not yet know about human behavior. In the face of the urgency of the questions waiting to be answered it should not be surprising that behavioral scientists, and the publics that support them, should suffer from a certain impatience. That impatience is understandable; but perhaps from time to time we need remind ourselves that we have not really been in business for very long. The application of that reasoning and of those procedures which together we call ‘‘the scientific method’’ to the understanding of human behavior is of relatively very recent origin. What we have learned about human behavior in the short period, say from the founding of Wundt’s laboratory in Leipzig in 1879 until now, is out of all proportion to what we learned in preceding centuries. The success of the application of ‘‘scientific method’’ to the study of human behavior has given us new hope for an accelerating return of knowledge on our investment of time and effort. But most of what we want to know is still unknown. The application of what we think of as scientific method has not simplified human behavior. It has perhaps shown us more clearly just how complex it really is. In contemporary behavioral research it is the research subject we try to understand. He serves as our model of man in general or at least of a certain kind of man. We know that his behavior is complex. We know it because he does not behave exactly as does any other subject. We know it because sometimes we change his world ever so slightly and observe his behavior to change enormously. We know it because sometimes we change his world greatly and observe his behavior to change not at all. We know it because the ‘‘same’’ careful experiment conducted in one place at one time often yields results very different from one conducted in another place at another time. We know his complexity because he is so often able to surprise us with his behavior. 3
4
Book One – Artifact in Behavioral Research
Much of the complexity of human behavior may be in the nature of the organism. But some of this complexity may derive from the social nature of behavioral research itself. Some of the complexity of man as we know it from his model, the research subject, may reside in the fact that the subject usually knows perfectly well that he is to be a research subject and that this role is to be played out in interaction with another human being, the investigator. That portion of the complexity of human behavior which can be attributed to the social nature of behavioral research can be conceptualized as a set of artifacts to be isolated, measured, considered, and, sometimes, eliminated. This book is designed to consider in detail a number of these artifacts. The purpose is not simply to examine the methodological implications, though that is an important aspect, but also to examine some of the substantive implications. It may be that all ‘‘artifacts,’’ when closely examined, teach us something new about a topic of substantive interest. The introductory chapter, which was written by our late colleague Edwin G. Boring, provides a perspective on artifact and a discussion of the nature of experimental control. The following six chapters are a series of position papers by researchers who have been actively engaged in systematic exploration of various antecedents of artifact in behavioral research, and each writer summarizes the findings in his respective area. Those six essays, in the order of their presentation, are by William J. McGuire on suspiciousness of intent, Robert Rosenthal and Ralph L. Rosnow on volunteer effects, Robert E. Lana on pretest sensitization, Martin T. Orne on demand characteristics, Rosenthal on experimenter expectancy effects, and Milton J. Rosenberg on evaluation apprehension. The final chapter, by Donald T. Campbell, takes into account the separate contributions and tells us something of the future prospects for behavioral research. In organizing this volume, the editors have been guided by Herbert Hyman’s comment that the demonstration of systematic error may well mark an advanced state of a science. All scientific inquiry is subject to error, and it is far better to be aware of this, to study the sources in an attempt to reduce it, and to estimate the magnitude of such errors in our findings, than to be ignorant of the errors concealed in the data. One must not equate ignorance of error with the lack of error. The lack of demonstration of error in certain fields of inquiry often derives from the nonexistence of methodological research into the problem and merely denotes a less advanced stage of that profession.*
The editors thank Academic Press for their patience and continued interest throughout the two and one-half year evolution of this book. To our contributors— Boring, Campbell, Lana, McGuire, Orne, and Rosenberg—we are indebted for their thoughtful and thought-provoking essays. Our task was greatly facilitated by separate grants to each of us from the Division of Social Sciences of the National Science Foundation. Edwin G. Boring, who passed away on July 1, 1968, wrote once of the sense of inadequacy of the individual scholar to available information at any moment of his existence and of the feeling sometimes of being overwhelmed by the complexity of nature.
* Hyman, Herbert H. Interviewing in social research. Chicago: University of Chicago Press, 1954. P. 4.
Preface
5 That would explain Kepler’s looking for a geometrical generalization to explain the planets in the solar system, would give a sound basis for the need for all generalization in science. And nowadays we no longer hope to learn about everything that’s in nature, but only about everything that’s already been published about nature, and ultimately we sink, gasping . . . Still, I am content to live in this age. Titchener once said that he would have liked to live in the age when one man could know everything, and that was quite long ago. We must accustom ourselves to an age in which one man never knows more than just enough to use for a given purpose.y
How fortunate our age was to have Edwin G. Boring. We take pride in dedicating this book to his memory. January 29, 1969 Robert Rosenthal Ralph L. Rosnow
y Personal communication, November 1, 1967.
This page intentionally left blank
1 Perspective: Artifact and Control Edwin G. Boring Harvard University
The Concept of Control If x, then y. That is the formula for John Stuart Mill’s method of agreement. The independent variable is x and the dependent y. It says that x is a sufficient condition of y, and the experimental establishment of this relation has sometimes been thought to be the aim of science. The statement is, however, not enough. It must be coupled with if not–x, then not–y, and the two formulas together constitute Mill’s joint method of agreement and difference, establishing the independent variable, x, as both the sufficient and the necessary condition of y. It is essential to add the method of difference to the method of agreement in order to establish x as necessary to y as well as sufficient. (See Mill, 1843, Bk. III, chap. 8.) In short not–x is the control, for x is a necessary condition only if it be shown that y does not occur without it. In this sense Mill was a good expositor for the concept of control, although he did not use the term. (On the history of control, see Boring, 1954, 1963, 111–125.) The principle had not been overlooked before Mill. When Pascal in 1648 planned the experiment for measuring the weight of the air by having a barometer carried 3000 feet up to the top of the Puy-de-Dome so as to show the loss of atmospheric pressure at the greater height, he provided also for a second barometer which was kept at the foot of the mountain and was found to remain unchanged. (Pascal, 1937, 97–112; Cohen, 1948, 71f; Conant, 1951, 39; Boring, 1954, 577f; 1963, 115.) It was really a control. The independent variable was the height, the dependent variable was the atmospheric pressure, and the procedure was the joint method of agreement and difference—two centuries before Mill had named it and laid down the rules. Actually it is the use of the method of difference, that is to say, of control, that puts rigor into science. A fact is a difference. Something is this and not that. Any observed value has meaning only in relation to some frame of reference, and any quantity only in respect of the scale in which it is set. Lana (Chapter 4) makes this point in his paper in the present volume. The method of concomitant variations, which Mill gave as a separate method, is really an elaboration of his method of difference, for every pair of values of x and y is placed in relation to every other x–y pair from which the first pair differs. The paradigm for concomitant variations is y ¼ f(x) and its determination is the scientific ideal. The joint methods of agreement and difference, if x, 7
8
Book One – Artifact in Behavioral Research
then y and if not-x, then not-y, are really only one pair of cases in the method of concomitant variations, where x is some positive value and also zero. In short, experimental science is the determination of functional relationships of the nature of y ¼ f(x) by the observation of concomitant variations. A fact is a difference, and ideally the use of control is always implied. All this becomes clearer if we see what happens when we have no control. History depends for its general laws on the method of agreement and ordinarily lacks control because the initial term of an historical causal relationship is not truly an independent variable. To establish inductively a generality by repetition of an observable relationship, one has to wait in history on the recurrence of an x by what is called ‘‘chance’’ (a synonym for the inscrutable ignorance of prognostic causality). One cannot control historical sequence, and there are comparable examples to be found in other descriptive nonexperimental sciences like astronomy, geology and some branches of biology. Four Meanings The word control originally meant counter-roll, a master list against which any subsequent special list could be checked and if necessary corrected. Thus the term came to mean a check and so restraint in order to induce or maintain conformity. To control is now to guide. In science the word has had four meanings, successively adopted and all still useful. (1) Control has long been used in the sense of maintaining constancy of conditions and also for checking an experimental variable to see if it is adhering to its stated or intended specifications. The artifacts with which this volume is primarily concerned are mostly of this kind. The independent variable of y ¼ f(x) is contaminated, often unwittingly, by additional unspecified determinants that affect y. This is the oldest scientific meaning of the word control, one which the discussions of the present volume re-emphasize. (2) In the late nineteenth century the use of the control experiment or control test came into psychology and to a lesser extent into biology, although not always with that name. For instance, the Hipp chronoscope was calibrated by a ‘‘control hammer,’’ a heavy pivoted hammer which was released electrically, and in falling tripped successively two switches, wired in with the chronoscope so that the time of the hammer’s fall from one switch to the other was measured. Since less variability could be expected of the fall-hammer than of the chronoscope, the hammer was used to calibrate the chronoscope. A number of successive falls constituted a ‘‘control series’’ from which the variable and constant errors of the chronoscope could be computed. (Wundt, 1874, 772; 1911, III, 367.) Control tests, called ‘‘puzzle experiments’’ (Vexirversuche), were used in the early measurements of the cutaneous two-point threshold. The separation of the compass points placed upon the skin was varied and the observer reported whether he felt one or two. Here the artifact that Titchener called the stimulus-error tends to make trouble. If the subject knows that two points are always being placed on his skin, it becomes difficult for him to report a unitary perceptual pattern because he knows that two points are being applied (Boring, 1921, 465–470; 1963, 267–271). Especially is this difficulty present in naive subjects, like McDougall’s primitive people in the Torres Straits who wanted to show off their fineness of perception (McDougall, 1903, 141–223, esp. 189–193). To control this error, single points are mixed in with the double and that control works well to
Perspective: Artifact and Control
9
measure the skill with which the subject can discriminate single from double stimuli. It is not, however, wholly successful in obtaining a description of the perceptual pattern, for the reason that very often a single point does give a very good dual perception. A control of this second type can be vitiated by a failure of control of the first type, for the stimulations by a single point are not always the same. Some unknown factor in this independent variable remains uncontrolled in the sense that it has been left free to vary and does vary. It has been suggested that this variation could be caused by the presence or absence of multiple innervation at the point stimulated (Kincaid, 1918; Boring, 1954, 579f; 1963, 116f, also on multiple innervation, Boring, 1916, 89–93). At the end of the century Mu¨ller and Pilzecker (1900) published their elaborate investigation on the use of the method of right associates in the study of memory. They used principal series (method of agreement) and comparison series (method of difference), what we should nowadays call the experimental and control series. The use of the control test, experiment, or series became almost standard in the twentieth century, and the growth of behavioral psychology in which discrimination plays the fundamental role in assessing the psychological capacities of the subject has practically put the word control out of common usage, for a discrimination is the observation of a difference and Mill’s joint method is now standard. Lana’s discussion (Chapter 4) of the use of the pretest in social research shows both the importance of control and the manner in which it often introduces an artifact by changing the experimental status of the subject before the crucial test. (3) The use of the control group avoids the difficulty of the pretest artifact by introducing another difficulty. The control group and the experimental group are independent, since they are constituted of different noncommunicating organisms, but, being different, they are not identical and one cannot tell certainly how nearly identical they are. You may match the two groups, individual for individual, litter-mates of the same bodyweight if the subjects are animals, twins if the subjects are human, or you can make the groups of great size hoping that there is a law of large numbers that will perform the magic of reducing any inscrutable difference to a negligible amount, yet you remain in the dilemma of complementarity. You have gained independence at the expense of assured equivalence. One early example (Hankin, 1890) of the use of a control group is an experiment demonstrating the effective immunization of mice against tetanus. The immunized mice lived when inoculated and the controls died. The experiment is convincing because the difference between a live mouse and a dead one is so large. An acceptable level of confidence that a dead mouse is not alive can be reached without the calculation of a critical ratio for how nearly dead the live mice are or how much life still inheres in the dead ones. This kind of comparison is so easily conceived that it would seem that earlier examples of the use of a control group ought to exist. In psychology the use of the control group came in with the study of the transfer of training where a pretest to establish a base of skill also provided some early practice, learning which could not readily be separated from the formal practice of the experimental group, the effect of which was what was being measured. It is desirable to give a pretest in skill A to both an experimental and a control group, and then to give formal practice in skill B to the experimental group while the control group is being left unpracticed. After that, both groups can be tested and compared for improvement in A to see if practice of the experimental group in B led it to more improvement in A than had been furnished by the pretest for the control group. Thorndike and Woodworth (1901, esp. 558) are the first to have introduced this conception into the study of transfer, but their use of it was trivial. The earliest carefully designed study was by Winch (1908), a study which just happens to have been made in the same year that Gosset (1908) published his paper on
10
Book One – Artifact in Behavioral Research
how to determine the significance of differences between groups by converting the critical ratio into a t-value. The two developments ran along neck and neck—the use of control groups and the statistical techniques for quantifying the confidence you should feel for the significance of the differences between such groups (the R. A. Fisher confidences). This kind of artifact, due to the impossibility of showing that the differences between the groups are negligible, is not an artifact that this book considers, so we may leave the topic there. Since some of the progress of science occurs by the discovery and correction of the many kinds of artifacts, it would seem that science still has a considerable future. (4) The fourth scientific use of the concept of control is irrelevant here. It is the reversion to the early use of the term that we find in Skinner’s use of the notion of the control and shaping of behavior (Skinner, 1953). It is applied to what may be an intentional production of artifacts and has such social uses as the psychotherapy of behavioral deviations or the eradication of ignorance. Education itself is, of course, an artifact as man promotes it for his fellows. That is what Rousseau thought when he extolled the ‘‘noble savage.’’
The Problem of Artifact Now let us turn more specifically from the understanding of control to the problem of the artifact. Most of the discussion in this volume is concerned with the constancy and specification of experimental conditions. These requirements raise questions of controls of type 1, the discovery and specification of extraneous conditions—for the most part social in nature—that affect the experimental variables. There is always the hope that, once understood, they can be eliminated or at least be made subject to correction. The experimenter’s expectations and personality (Rosenthal, Chapter 6), subjects’ personality (Rosenthal and Rosnow, Chapter 3), their awareness of the experimenter’s intent (McGuire, Chapter 2), or their concern that they are being evaluated (Rosenberg, Chapter 7) may affect the results. The degree to which such factors inhere in the conditions of the experiment needs to be known, so that the ways to avoid their influences can come under consideration. (For an earlier discussion of these matters, see Rosenthal, 1965.) Certainly one of the most scientifically important sources of error in experimentation lies in the indeterminacy of the specification of the variables. Consider the independent variable. If x, then y, and if not–x, then not–y. But what is x? Mill, the logician, did not have to consider that. To a logician, x is x, but to a scientist x is a variable, identified in words which may easily mean one thing at one time and another at another, or different things in different laboratories or at different periods of history. We have already seen that a single point touching the skin is not always the same stimulus. It may be felt as one, yet sometimes as two (a Vexirfehler). Take the specific energy of nerves which was well validated by Johannes Mu¨ller in 1826 and 1838 and then discredited about seventy-five years later (Boring, 1942, 58–74). How could that have come about? The experiments were all right. Stimulation of the nerve of any one of the five senses gives rise to the quality appropriate to that sense. The discrediting discovery was the finding that even sensory nerves are passive conductors that deliver qualitatively identical messages. Perhaps this difficulty seems semantic now, but it was very real at the time.
Perspective: Artifact and Control
11
Sense-physiologists had been talking for many years about what it is that the nerves conduct: animal spirits, vis viva, vis nervosa. Mu¨ller was making the point that the sensorium does not perceive directly the external world, but only what the nerves bring to it. What each of the five nerves brings is patently different. Vis (force) and energy were not clearly distinguished in 1826 or 1838. If each of the five kinds of nerve brings something different to the brain, the five have different specificities, different ‘‘energies.’’ What Mu¨ller almost forgot—actually he did not quite forget it but he did not take it seriously—was that the nerves have another specificity that lies not in what they conduct. They have specificity of projection, and the specificity of perceived quality is the specificity of the connections that the sensory nerves make in the brain. The crucial difference lies not in the nature of the conduction but in whither conduction leads, and it took more than half a century to right this error. The Wever–Bray effect when it was first discovered furnished another example of the mistaken identity of a variable, this time not the x or y of the observation but a physiologically intervening variable. When electronic amplification came in before 1930, it became possible for these investigators to put electrodes on the auditory nerve of a cat, amplify the voltage, speak to the cat, and hear the sounds in a loudspeaker over the circuit of amplification. This was a dramatic finding, though Wever and Bray reported it with great care and circumspection (Wever and Bray, 1930). Presently, it was discovered that the amplified potentials were not from the impulses of the VIIIth nerve but were induced by the electrical events involved in the action of the organ of Corti in the cochlea (Davis and Saul, 1931; Davis, Derbyshire, Lurie, and Saul, 1934; Boring, 1942, 420–423, 434–436). Both discoveries were important as bearing upon the electrical nature of receptor response, but the first belief that the finding supported the frequency theory of hearing was due to an incorrect specification of the variables. As a matter of fact the experimental variable is ever so much more complex than is ordinarily supposed, and often the discovery of its nature is a scientific event of considerable importance. For instance, one is seldom sure about the true nature of a stimulus until research has accomplished the analysis. Such was Newton’s discovery of the stimulus for color and especially of the surprisingly complex stimulus for white. Galileo’s discovery of the stimulus for pitch not only put the psychology of tone in readiness for development but also made possible the scientific management of music. The psychology of smell has been long deferred because the nature of its stimulus remained unknown (Boring, 1942, 448). John Dewey in his famous paper on the reflex made the point that the stimulus cannot be presumed but has to be discovered (Dewey, 1896, 370; Boring, 1950, 554). The true stimulus is often an invariant known only as the result of careful research. For instance, the stimulus for apparent visual size under the rule of size constancy and Emmert’s law is the linear size of the retinal image of the perceived object divided by the distance of the object from the observing eye, for that is the invariant (Boring, 1942, 292; 1952, 144–146). A great deal of the progress of science has depended upon the discovery of such invariants (Stevens, 1951, 19–21). In social research the independent variable is less often called a stimulus but its specification as an invariant is just as important. Hypnosis furnishes us good examples of insufficiently specified initial determinants. When research on hypnosis is being undertaken, the independent variable is the experimenter’s suggestion and the dependent variable the subject’s behavior, but it is a mistake to think that the
12
Book One – Artifact in Behavioral Research
experimenter’s words are all of the suggestion. ‘‘Bring me that rattlesnake,’’ says the experimenter of a live coiled rattlesnake behind invisible glass, and the subject complies until prevented by the glass (Rowland, 1939). Would he have done so had there been no glass? Perhaps, but the ‘‘demand’’ made upon him, to use Martin Orne’s term, was more than the verbal instruction. It included the knowledge that this was an experiment, that you do not get truly injured in an experiment, that there is an experimenter and a university looking out for you. The demand characteristic is much broader than it is explicit. Martin Orne (Chapter 5) shows how the essays in this book deal with instructional demands upon the subject’s behavior, instructions that are enormously amplified by special cues of which in many cases the subject is unaware. The independent variable is insufficiently specified.
A Dilemma Now, how is the correctness of the specification of the experimental variables to be protected from all these predisposing additions, conscious and unconscious, that the subject adds, often from his knowledge about the experimental situation, to the intended explication of the independent variable? The answer would seem to lie in keeping the subject ignorant of what is going on, but that is difficult. In this respect a group control is better than a control experiment, because the one group has no communication with the other, but this advantage is offset by the fact that one cannot be sure that the two groups are comparable. Are animals better subjects because they do not know the difference between an experiment and real life? Not always. There was H. M. Johnson’s dog whose threshold for pitch discrimination turned out to be the same as Johnson’s because the dog watched Johnson’s face and wanted to please (Johnson, 1913, 27–31). The skilled horse, Clever Hans, also watched his master and sought to please (Pfungst, 1911, 1965). It would be better to secure ignorance by not working in a laboratory nor letting the subjects know that they are subjects or that there is an experimenter. That limitation may, however, put you in the position of the historian or the astronomer, limited to the method of agreement without a control. You cannot tell the subject to do this or that without giving away the artificiality of the situation, and, if the intended part of the observation is known to be artificial, other artifacts are almost sure to squeeze in. The investigators whose observation resulted in the publication of When Prophecy Fails infiltrated a fanatical group in order to note what happens when an assured conviction is frustrated (Festinger, Riecken, and Schachter, 1956), but such work is only gross preliminary taxonomy. One would like to get more rigorous facts by the use of experimentation. The choice between laboratory control and the free uncontrolled behavior of natural phenomena is no new dilemma. Let us for a moment go back seventy-five or even only fifty years to the time when introspection was the principal method of the new experimental psychology. In those days secrecy was the rule about the experiments. Students did not talk about procedures and observations with one another. There was no general discussion of work in progress—except perhaps at the intimate meetings of the little Society of Experimental Psychologists, when graduate students who were subjects in an experiment were excluded from the room when the experiment was being discussed. Procedure without knowledge was the rule, and there was in force as
Perspective: Artifact and Control
13
strong an ethic about discussion of current experiments as there was later about classified war material. Did secrecy work? It must have been helpful. Nowadays, some subjects are undergraduates hired from outside the laboratory; yet gossip spreads in any student group. Perhaps the chief evidence that hypotheses influenced results in the old days lies in the fact that introspection never settled the question of the nature of feeling—as to whether feeling is an independent quality or a kind of sensation and, if so, what kind, as to whether feeling can or cannot become the object of attention and how it is observed if it cannot enter the clear focus of attention. There was always in this crucial introspective matter, a suspicion that laboratory atmosphere—local hypotheses—influenced the findings. No one could produce proof, but right there lies one of the reasons why introspection faded out for lack of confidence—at least systematic experimental introspection lapsed although not the use of psychophysical judgments. All in all we are left with a dilemma. The experimental method is science’s principal tool: y ¼ f(x) is the goal. Control is necessary and is used even when it is not recognized as such. Every fact is at bottom a difference, and the method of concomitant variations emphasizes this relational characteristic about facts. Nevertheless, in the specification of a variable one always remains uncertain as to how exhaustive the description is. Artifacts adhere implicitly to specification and, when they are discovered, ingenuity may still be unable to circumvent them. When they are not discovered, they may persist for a year or a century and eventually turn out to be the reason why a well-established fact is at long last disconfirmed. For this reason scientific truth remains forever tentative, subject always to this possible eventual disconfirmation. But that is no new idea, is it?
References Boring, E. G. Cutaneous sensation after nerve-division. Quarterly Journal of Experimental Physiology, 1916, 10, 1–95. Boring, E. G. The stimulus-error. American Journal of Psychology, 1921, 32, 449–471. Boring, E. G. Sensation and perception in the history of experimental psychology. New York: Appleton-Century, 1942. Boring, E. G. A history of experimental psychology. 2nd ed. New York: Appleton-Century-Crofts, 1950. Boring, E. G. Visual perception as invariance. Psychological Review, 1952, 59, 141–148. Boring, E. G. The nature and history of experimental control. American Journal of Psychology, 1954, 67, 573–589. Boring, E. G. History, psychology, and science: selected papers. R. I. Watson and D. T. Campbell (Eds.), New York: Wiley, 1963. Cohen, I. B. Science, servant of man. Boston: Little, Brown, 1948. Conant, J. B. On understanding science. New Haven: Yale University Press, 1951. Davis, H., and Saul, L. J. Action currents in the auditory tracts of the mid-brain of the cat. Science, 1931, 74, 205f. Davis, H., Derbyshire, A. J., Lurie, M. H., and Saul, L. J. The electrical response of the cochlea. American Journal of Physiology, 1934, 107, 311–332. Dewey, J. The reflex are concept in psychology. Psychological Review, 1896, 3, 357–370. Festinger, L., Riecken, H. W., and Schachter, S. When prophecy fails. Minneapolis: University of Minnesota Press, 1956. Gosset, W. S. (‘‘Student’’). The probable error of a mean. Biometrika, 1908, 6, 1–25. Hankin, E. H. A cure for tetanus and diphtheria. Nature, 1890, 43, 121–123.
14
Book One – Artifact in Behavioral Research Johnson, H. M. Audition and habit formation in the dog. Behavior Monographs, 1913, 2, no. 3, serial no. 8. Kincaid, Margaret. An analysis of the psychometric function for the two-point limen with respect to the paradoxical error. American Journal of Psychology, 1918, 29, 227–232. McDougall, W. Cutaneous sensation. Reports of the Cambridge Anthropological Expedition to Torres Straits. 1903, II, 141–223. Cambridge, England.: Cambridge University Press. Mill, J. S. A system of logic, ratiocinative and inductive, being a connected view of the principles of evidence and the method of scientific investigation. 1843. Reprint: London: Longmans, Green, 1930. Mu¨ller, G. E., and Pilzecker, A. Experimentelle Beitra¨ge zur Lehre vom Geda¨chtniss. Zeitschrift fu¨r Psychologie, Ergbd. I. Leipzig: Barth, 1900. Pascal, B. 1648. Trans.: The physical treatises of Pascal: the equilibrium of liquids and the weight of the mass of the air. New York: Columbia University Press, 1937. Pfungst, O. Clever Hans (The horse of Mr. von Osten). 1907. Trans. 1911. Reprint: R. Rosenthal (Ed.), New York: Holt, Rinehart and Winston, 1965. Riecker, A. Versuche u¨ber den Raumsinn der Kopfhaut. Zeitschrift fu¨r Biologie, 1874, 10, 177–201. Rosenthal, R. Introduction. In Pfungst, 1965, op. cit. supra., ix–xlii. Rowland, L. W. Will hypnotized persons try to harm themselves or others? Journal of Abnormal and Social Psychology, 1939, 34, 114–117. Skinner, B. F. Science and human behavior. New York: Macmillan, 1953. Solomon, R. L. An extension of control group design. Psychological Bulletin, 1949, 46, 137–150. Stevens, S. S. Handbook of experimental psychology. New York: Wiley, 1951. Thorndike, E. L., and Woodworth, R. S. The influence of improvement in one mental function upon the efficiency of other functions. Psychological Review, 1901, 8, 247–261, 384–395, 553–564. Vierordt, K. v. Die Abha¨nigkeit der Ausbildung des Raumsinnes der Haut von Beweglichkeit der Ko¨rperteile. Zeitschrift fu¨r Biologie, 1870, 6, 53–72. Wever, E. G., and Bray, C. W. The nature of the acoustic response: the relation between sound frequency and the frequency of impulses in the auditory nerve. Journal of Experimental Psychology, 1930, 13, 373–387. Winch, W. H. The transfer of the improvement of memory in school-children. British Journal of Psychology, 1908, 2, 284–293. Wundt, W. Grundzu¨ge der physiologischen Psychologie. 1st ed. Leipzig: Engelmann, 1874. Wundt, W. Grundzu¨ge der physiologischen Psychologie. 6th ed. III, Leipzig: Engelmann, 1911.
2 Suspiciousness of Experimenter’s Intent William J. McGuire University of California, San Diego
Introduction It is a wise experimenter who knows his artifact from his main effect; and wiser still is the researcher who realizes that today’s artifact may be tomorrow’s independent variable. Indeed, even at a given time, one man’s artifact may be another man’s main effect. The essentially relativistic and ambiguous criterion for calling a variable an ‘‘artifact’’ is well illustrated by the topic on which we focus in this chapter, namely, suspiciousness of the experimenter’s manipulatory intent. However, we shall begin by using the case of response sets to illustrate the transitory nature of the ‘‘artifact’’ status. Response sets serve better to show the stages by which a variable passes from artifact to theoretical focus since, as an older topic of study, they serve to illustrate the total career of an artifact more fully than does the more current ‘‘suspiciousness’’ problem. After discussing the career of artifacts in general, we shall devote a section to considering the current artifactual status of suspiciousness of experimenter’s intent. This section will consider the antecedent variables that give rise to such suspiciousness and which therefore might be contaminated by it. A fourth section will review some of the theoretical housings in which current research on suspiciousness of experimenter’s intent research is embedded. Finally, we shall consider the ethical problem of deception which is inextricably involved in any discussion of suspiciousness of experimenter’s manipulatory intent. Any full consideration of how suspiciousness of the experimenter’s intent operates in psychological research would quickly broaden to include rubrics, such as guinea pig reactions, placebos, faking good, awareness, etc., each of which carries in its train a long history of experimental investigation. In the mainstream of American experimental psychology, starting at its source back in the nineteenth century, it has been taken for granted that the subject should be kept unaware of the purpose of the experiment, even when it dealt with such unemotional issues as visual acuity or the serial position effect. It must be admitted that this routine secretiveness has not been universal, and some experimenters have even used themselves as observers and subjects in the areas of psychophysics and rote memorization (Ebbinghaus immediately comes to mind in this regard). An informal recollection of this undisguised research in which the experimenter serves as his own subject inclines me to believe that the results that were obtained in this flagrant manner have replicated remarkably well under covert conditions. Nevertheless, hiding from the subject the true purpose of 15
16
Book One – Artifact in Behavioral Research
one’s experiment has become normative in psychological research. Ignorance is achieved either by noninformation or by misinformation. I suspect that one could discover such ludicrous cases as in, say, the rote memorizing area, where an experimenter who was investigating the serial position effect told his subjects that he was studying the effect of knowledge of results, while an investigator testing a hypothesis about knowledge of results informed his subjects that he was studying the serial position effect. The already existing ‘‘Minsk–Pinsk’’ joke fortunately relieves us from the felt need for the laborious scholarly investigation that would be needed to document this illustration. Inevitably there is growing concern regarding the probability and effect of a growing suspiciousness regarding experimenter’s intent by subjects drawn from heavily-used populations, such as the college sophomore. To some of us who have striven with more zeal than success to interest the student in the introductory classes (from whom the experimental subjects are drawn) in the results of much of the general experimental psychology research, the extent of our secretiveness compulsion in such areas as psychophysics and verbal learning might seem quite unnecessary. The problem seems more one of apathy than overcuriosity. If the student in the classroom seems so unmoved by the beauty of such a relationship as that between rate of presentation and shape of the serial position curve even when the instructor exhibits it to him framed in an enhancing theoretical architecture, why do we feel that special care is necessary to keep the same individual from actively suspecting what we are looking for when he participates as a subject in the laboratory and perhaps reacts so atypically that the results will not be generalizable to a naive population. Yet I suspect it was the great felt need for using unsuspecting subjects that has promoted some of the practices in American experimental psychology which seem a little peculiar to the layman. For example, our predilection for using nonhuman subjects, our avoidance of research on certain humanly gripping problems, our use of highly artificial laboratory situations, our avoidance of phenomenological explanatory concepts, etc., have all been partially motivated by our assumption that good research can be done only to the extent that the subject is unaware of the purpose of the investigation. While our secretiveness might seem excessive in such traditional areas as rote learning, one is inclined to take the need for deception more seriously in the case of social and personality research. One might have had only minor worries about the generalizability of research regarding serial position curves even if our data came from subjects who suspected that this was indeed what we were studying. There seem more grounds for worry that disclosure might cause serious loss of generalizability in other areas, such as operant verbal conditioning and attitude change research. In this chapter we shall concentrate on this latter line of research, but before focussing on this question of the extent to which awareness (or suspiciousness) of persuasive intent distorts the results of an attitude change experiment, we shall in the next section outline what we believe to be the life history of an artifact in general, illustrating the sequential stages through which it passes in terms of the ‘‘response bias’’ artifact.
Three Stages in the Life of an Artifact A review of the progress of psychological interest in a wide variety of artifacts would, we believe, reveal a natural progression of this interest through the three stages of ignorance, coping, and exploitation. At first, the researchers seem unaware
Suspiciousness of Experimenter’s Intent
17
of the variable producing the artifact and tend even to deny it when its possibility is pointed out to them. The second stage begins as its existence and possible importance become undeniable. In this coping phase, researchers tend to recognize and even overstress the artifact’s importance. They give a great deal of attention to devising procedures which will reduce its contaminating influence and its limiting of the generalizability of experimental results. The third stage, exploitation, grows out of the considerable cogitation during the coping stage to understand the artifactual variable so as to eliminate it from the experimental situation. In their attempt to cope, some researchers almost inevitably become interested in the artifactual variable in its own right. It then begins to receive research attention, not as a contaminating factor to be eliminated, but as an interesting independent variable in its own right. Hence, the variable which began by misleading the experimenter and then, as its existence became recognized, proceeded to terrorize and divert him from his main interest, ends up by provoking him to new empirical research and theoretical elaboration. The suspiciousness of persuasive intent artifact has only begun to reach this third stage. Hence, in this section we will illustrate the three stages by recounting the career of its somewhat older sibling in the artifact family, response biases. The preoccupation with response sets and styles developed about five years earlier than that in the awareness of persuasive intent issue, and now has reached a stage that allows a fuller illustration of the total life cycle. This brief consideration of the response bias artifact will also help to give our present discussion some perspective, and illustrate our claim that this three-stage career is a common one for artifacts, not peculiar to the suspiciousness of experimenter intent artifact that mainly concerns us in this chapter. The Ignorance Stage Once the existence of an artifact becomes known, its baleful influence is essentially at an end. Thus the part of its life span during which it achieves its notoriety is essentially an anticlimax. Its deleterious effect on the development of a field of knowledge occurs during the long period of ignorance prior to its achieving the explicit attention of researchers. It is then that it leads the psychologist to draw false conclusions from data, to elaborate or trim his theories in inappropriate ways, and to design new research in ways more likely to confuse than clarify the issue. To those of us who have elected an intellectual vocation, it is gratifying to consider that once we start worrying about an artifact rather than ignoring it, while our peace may be at an end, so also is its harmful impact on the development of knowledge. Still, there seems to be a considerable inertia about admitting the existence of an artifact. While we are all ready enough to grant that progress inevitably involves upsetting the ongoing routine of things and disturbing the peace, one is always reluctant to admit that his modus operandi involves an artifact. It tends to become a question of whose peace is being disturbed. While our superordinate goal is the discovery of truth, we are reluctant to give up the subordinate goal of believing that our past research was heading inexorably toward such discovery. Hence, it seems necessary that an artifact be discovered and rediscovered several times before it becomes a sufficiently public scandal so that some bright young men seize upon it as a device to pry their elders out of their ruts and find a place in the sun for themselves. Like much else that disturbs and advances a field of knowledge currently, the discovery of artifacts seems to be the work of the associate professors.
18
Book One – Artifact in Behavioral Research
Our illustrative artifact of response bias exhibits the difficulty of the discovery process. That it took the field so long to become interested in this artifact seems strange for a number of reasons. In the first place, it is a sufficiently obvious problem so that it occurs to the layman, and must have suggested itself constantly, at least preconsciously, to those engaged in testing enterprises. Moreover, so much of psychological research involves ability or personality testing that opportunities to stumble upon such an artifact were constantly available. With the almost stifling amount of current research on response biases, it is hard to believe there was ever a time when the field was not preoccupied with them. As a matter of fact, though, not even several explicit considerations of its existence stirred up any great amount of research interest. Thus, the demonstration by Lorge (1937) and by Lentz (1938) of an acquiescence response set in personality tests seems to have stirred up little interest until a decade passed. It is undoubtedly Cronbach (1941, 1942, 1946, 1950) who deserves to be called ‘‘the father of the response set.’’ Since he called attention to the role of response biases, research interest in this area has grown progressively. It is perhaps significant of a psychological reality in the history of science that he dealt primarily with response biases in abilities tests, where they are more manageable. It was in this manageable area that the possible importance of the artifact was first admitted and attention paid to it. Lorge was sounding the alarm, in connection with response biases in personality tests, where they are somewhat more difficult to cope with, and not surprisingly he found the field little attuned to the drum he was beating. In the personality tests area, we are faced with more difficult response biases such as social desirability, while with abilities tests the discussion is more confined to acquiescence or position biases that can be somewhat more easily handled by mechanical adjustments in the wording of questions or of the ordering of responses. By the late 1940s, however, even those working in the personality area admitted the importance of controlling for ‘‘faking good’’ tendencies. At least those who were working with the MMPI realized this problem (Ellis, 1946; Meehl and Hathaway, 1946; Gough, 1947). This resurgence of intent in the social desirability artifact, which has been maintained (or overmaintained) until the present, followed ten years of silence after the earlier reports by Steinmetz (1932) and by Kelly, Miles, and Terman (1936) of the existence and importance of this response bias artifact in personality measurement.
The Stage of Coping Once a field has admitted the existence of an artifact, as occurred in the case of response biases in the 1940s, researchers in the area devise methods of coping with it so that it does not make the results of experimentation ambiguous or less generalizable. Sophistication in achieving this aim tends to pass through several successive steps during the coping phase. Three of these, rejection, correction, and prevention, can be illustrated in the case of response sets. 1. Detection and rejection as a mode of coping. The most primitive form of coping with an artifact is to detect in which subjects it is operative beyond a certain (arbitrarily determined) amount, and then to reject the data from all the subjects above and accept at face value the data from subjects not above this threshold. Behind this strategy there lurk the double-fallacies that there is a magic sieve which can skim off the noise and leave the information behind; and that this can be done by simple dichotomization. Still, it must be
Suspiciousness of Experimenter’s Intent
19
admitted that research is the art of the possible and that we must often compromise by using less-than-perfect methodological tactics. If we will grant that some subjects are more susceptible to the artifactual process than others, then there is a certain logic of desperation in this tactic of rejecting the data from all subjects who exhibit it beyond a certain arbitrarily set amount, and ignoring it in the subjects who do not exceed this preset amount. In this way one hopes to avoid major contamination by the artifact, while admitting that some information is thrown away with the rejected subjects while some artifactualness remains in those subjects who did not exhibit the contamination beyond the preset amount. There are various alternative procedures used in this detection and rejection mode of coping. Minnesota Multiphasic Personality Inventory (MMPI) research furnishes good examples of the use of three of these: catch scales, response counts, and discrepancy scores. Early in the development of this standardized personality inventory, a number of subscales were introduced in order to catch response sets. The most widely used of these are undoubtedly the F, the K, and L scales (though additional scales to catch malingering, social desirability, etc., were all developed by the nineteen-fifties). The notion is that anyone who answers too many questions in a way that is inconsistent, exceedingly rare, too good to be true, etc., should be rejected as manifesting too much response bias to furnish a usable protocol. The response count procedure is closely akin to the use of catch scales. This simply involves counting up the number of responses in a certain category, for example, the number of ‘‘?’’ responses, to detect noncommitment response sets or the number of ‘‘yes’’ responses to check for acquiescence response bias. Here again, when one detects subjects exceeding a certain arbitrarily set level in the use of the response category, their protocols are rejected. A third tactic employing the detection and rejection mode of coping is the use of discrepancy scores. In general, the several discrepancy approaches involve partitioning the items that measure a given variable, for example, the schizophrenic measure on the MMPI, into two subsets, one of which is made up of obvious items and the other of more subtle items. Subjects are then rejected as trying to conceal their symptoms if the discrepancy between the subtle and the obvious subscores exceed a certain preset amount. Alternatively, the use of simulated patterns, based on the responses of subjects who have been asked ‘‘to fake good,’’ is used to detect and reject subjects whose protocols reflect an unacceptably high need to appear healthy. 2. The correction mode of coping. The detection and rejection procedures which we just considered had an obvious arbitrariness to them which made them less than ideal as methods for eliminating the effects of the artifact. The use of an arbitrary cutoff point leaves in a considerable amount of the artifactual variance and eliminates a fair amount of the variance due to the factor under investigation. Inevitably, this primitive stage is succeeded by a more sophisticated approach to the problem which we here call the ‘‘correction’’ procedure. The experimenter using this tactic attempts to retain all of the data collected and adjust each person’s scale score for the amount of artifactual variance that contaminates his responses. A classic example of this adjustment procedure is given by the K scale of the MMPI. The subscale is here used, not as a device for detecting and rejecting certain protocols, but as a suppression scale score which furnishes a correction factor for each person’s score, hopefully tailored to his amount of artifactual variance. Other examples of correction procedures involve the use of control groups or conditions. For example, we might determine a person’s feelings about a subject matter area by giving him a reasoning, retention or perception task involving material from that area and calculating how much his score is affected by motivated distortion, after correcting the raw score for his capacity at this type of task on neutral materials. These correction procedures typically involve elaborate statistical adjustments.
20
Book One – Artifact in Behavioral Research
3. Prevention modes of coping. The more we learn about the correction modes, the more hypersensitive they seem to be to the validity of the scales. For example, it can be demonstrated that unless our predictor scale correlates at least .70 with the criterion, it is better to develop an additional predictor scale than to develop a suppressor scale to correct the original one (Norman, 1961). In view of the tedium and the indifferent success of the correction modes, it is not uncommon to find that attempts to cope with an artifact develop from an adjustment stage to a prevention stage. These prevention tactics involve use of one or another procedure that avoids the artifact’s occurring or at least its contaminating our obtained scores. In the case of response sets, the prevention approach has taken several forms. One procedure is to use counterbalanced scales, such as keying the items so that ‘‘yes’’ and ‘‘no’’ responses equally often indicate possession of the trait. A second procedure is the use of ipsatizing procedures. These sometimes take the form of a priori ipsatizing as in the use of forced-choice items. In our opinion it is preferable that they take the form of a posteriori ipsatizing as, for example, pattern analysis. A third prevention approach involves utilizing experimental procedures which minimize the likelihood of occurrence of the artifact. For example, one might attempt to minimize the extent of response biases, such as noncommitment, acquiescence, and social desirability, by anonymous administration or by explicit instructions to the respondent. A fourth method for preventing response biases such as social desirability is to use subtly worded items or other disguising procedures. As this second or ‘‘coping’’ stage in the career of an artifact reaches its height, we find methodological tours de force with experimenters using all three tactics. They devise administration and scoring procedures that tend to eliminate the response artifact, adjust the scores for such detectable artifactual variance as remains, and eliminate a few of the subjects showing an excessive degree of artifact-proneness. By the time that this elaborate coping response is evoked, one is likely to find that the artifact has already reached the third stage of its career, its apotheosis into an independent variable in its own right.
The Exploitation Stage It is rather heartwarming to observe that in the final stage in the career of an artifact, the variable comes into its own. The ugly duckling becoming the Prince Charming which gives rise to a new line of research. Not only in the case of the artifact considered in this chapter—suspiciousness of experimenter’s intent—but in the artifacts considered in several other chapters of the present volume, we find variables which, long considered annoying artifacts to be eliminated, ultimately become independent variables of considerable theoretical interest in their own right. For example, in Chapter 4 we see Lana’s depiction of how the use of ‘‘before’’ tests developed from the methodological device to reduce the impact of initial individual differences to a sensitization variable of intrinsic interest; or we find in Chapter 6 Rosenthal’s account of how the influence of the experimenter’s expectations on the obtained results developed from being a worrisome contamination to the status of a research program on nonverbal communication and social influence. The case of response bias, which we are using to illustrate this account of the career of an artifact, shows the typical happy ending. Variables like social desirability (Crowne and Marlowe, 1964) or acquiescence (Couch and Keniston, 1960) are now considered interesting individual difference characteristics in their own right, rather than merely contaminants to be eliminated from our personality scales. We even find attempts such as that of Messick (1960) to map out personality space
Suspiciousness of Experimenter’s Intent
21
entirely in terms of what were once regarded only as biases to be eliminated before such an enterprise could get underway. In the case of the suspiciousness artifact to which we devote the remainder of our discussion in this chapter, research interest has only recently entered the third phase. A necessary preliminary to the efficient investigation of an independent variable, or even of an artifact, is that we gain experimental control over it so that the experimenter is able to manipulate it. In the next section we will consider a dozen or so procedures by which the extent of the subject’s suspiciousness of the experimenter’s intent can be manipulated. Almost all of these procedures were developed for other reasons than the manipulation of the subject’s suspiciousness, which indeed is the reason why such suspiciousness was initially considered an artifact. As our interest in the suspiciousness variable enters the third phase, the availability of so many procedures for manipulating is quite useful. Hence what is, during the second stage of an artifact’s career, considered its deplorable pervasiveness, becomes, in the third phase, a considerable convenience in its study. While the suspiciousness problem is a pervasive one in research, to provide focus for our discussion, all our examples of procedures for manipulating this suspiciousness will be taken from the special area of attitude change and social influence.
Antecedents of the Awareness of Persuasive Intent In attitude change research, the experimenter traditionally pretends to the subject that his research deals with another topic. If he is studying the effect of peer pressure on conformity, he might employ visual stimuli and represent his study as an investigation of sensory acuity. If he is studying the impact of persuasive messages on beliefs, the experimenter might say he is studying reading comprehension ability and represent the persuasive message as the test material. It seems to be taken for granted that if one admitted the persuasive intent of the communication, the subjects’ behavior could not be interpreted and generalized to the behavior of the naive subjects to whom our theories of persuasion are supposed to apply. Hence, any sign that the subject is suspicious of the persuasive intent of the experimenter is likely to elicit alarm. There is consequently cause for concern that in at least eleven lines of attitude change research there is reason to suspect that the experimental manipulation, in addition to (or instead of) varying whatever it is intended to vary, might also be affecting the subject’s suspiciousness of persuasive intent. Any relationship which is found might be due not to the originally theorized effect of the manipulation, but to its impact on the subject’s suspiciousness. Some of these possibly artifactual manipulations involve how the source is represented to the subject; others have to do with the contents of the persuasive message; and still others concern the experimental procedures. We shall consider each of these classes of work in turn, discussing their possible artifactual components. We are here considering suspiciousness of persuasive intent as a second stage artifact, that is, a contaminating factor in our experiments which we deplore and attempt to eliminate. As interest in this variable develops to the third stage of independent variable in its own right, our evaluative reaction to its pervasiveness undergoes a change. The fact that it might be affected by all eleven of these types of
22
Book One – Artifact in Behavioral Research
manipulations then becomes a convenience for studying it and a sign of its importance. Its pervasiveness is then seen as increasing its attractiveness for study, rather than as an endemic contaminant that drives us to despair. Effect of Source Presentation on Subject’s Suspiciousness A number of variables having to do with how the source is represented to the subject might affect the subject’s suspiciousness of persuasive intent. One such variable is the extent to which the introduction is such that the subject is led to perceive that the source has some profit to gain from his position’s being accepted. A second involves whether or not the source is represented as realizing that the subject is an audience for his message. Whether the source’s presentation is represented as having occurred in the context of a debate or as a noncontroversial presentation is a third such variable. A fourth situation of this type involves the primacy-recency issue, and concerns whether the source presents his message first and therefore to a naive audience, or comes only after they have been exposed to the opposition side and are sensitized to controversiality. We shall consider each of these lines of work as they bear on the question of suspiciousness of persuasive intent. It should be noted that this suspiciousness is an intervening variable. Hence to understand its operation we must answer two questions. To what extent do these antecedent conditions actually affect suspiciousness of persuasive intent? And given that suspiciousness is affected, to what extent is the ultimate dependent variable of opinion change (or whatever) further affected? 1. Perceived disinterestedness of the source. A number of studies have involved varying the introductory description of the source in such a way that he is represented to some of the subjects as having something to gain from their agreement with his point of view, while for other subjects he is made to appear more disinterested in the point about which he is arguing. It seems reasonable to assume that the former procedure will produce greater suspiciousness of the source’s intent to persuade. For example, a given speech advocating more lenient treatment of juvenile delinquents is judged to be fairer and produces more opinion change when the speaker is identified as a judge or a member of the general public, than when he is identified as someone himself involved in juvenile offenses (Kelman and Hovland, 1953). There is, however, evidence to suggest that the differential persuasiveness of these sources is due to their status difference rather than their differential intent to persuade. Thus Hovland and Mandell (1952) used a speech favoring currency devaluation and attributed it, for half the subjects, to an executive in an importing firm who would stand to profit financially from such devaluation; while for the other half of the subjects, the speaker was represented as a knowledgeable but disinterested academic economist. The speech was judged considerably fairer when it came from this latter, disinterested source but it was equally effective in changing opinions regardless of the source. Put together, the results of the two experiments suggest that by proper portrayal of source disinterestedness one can manipulate suspiciousness of intent to persuade but this differential suspiciousness does not seem to eventuate in any attitude change differential unless the source’s status is also varied. In practice, the source variables of disinterestedness, expertise, and status will often be contaminated and so the results of varying any one of them must be interpreted carefully lest the contamination of the characteristic produce misleading results. At any rate, these ‘‘disinterestedness’’ manipulations provide little support at present for the assumption that suspiciousness of the source’s persuasive intent reduces the amount of opinion change he effects.
Suspiciousness of Experimenter’s Intent
23
2. Source’s purported perception of his audience. We might assume that the subject will be more suspicious of the persuasive intent of the source if he is made to perceive that the source knows he is listening than if he believes that he is overhearing the source without the latter’s knowledge. Walster and Festinger (1962) did indeed find that women are more likely to be persuaded by a given conversation if they think they are inadvertently overhearing it rather than when they feel the speakers are aware that they are listening, though this difference was found only with highly involving topics. Subsequent work by Brock and Becker (1965) indicated that the greater effectiveness of overheard communication was even further limited, requiring that the sources argue both in the direction which the audience wants to hear and also on an involving issue. Mills and Jellison (1967) interpreted this limitation of the difference to arguments in desirable directions as indicating that a source is more likely to be judged sincere when he argues in a direction which he knows undesirable to his audience. They found in line with this interpretation that students are more influenced by a speech favoring raising truck license fees if they are told it has originally been given to truck drivers (for whom it would be arguing in an undesirable direction) than when told it has been delivered to railway men (who would have found its conclusion desirable). Walster, Aronson, and Abrahams (1966) also found that a source has more impact when he is perceived as arguing against his own best interest. This set of experiments could be interpreted as indicating that a given message is persuasive to the extent that its source is not perceived as trying to persuade. However, a more precise interpretation would seem to be that the source is more persuasive when he is perceived to be urging an opinion in which he sincerely believes. Hence, if the crucial variable here is to be called suspiciousness, it seems that it is suspiciousness of insincerity rather than suspiciousness of persuasive intent that is crucial. 3. Perceived disputatiousness of the source. The orthodox suspiciousness theorizing would suggest that if the source is represented as having given the message in a controversial setting it will evoke more suspiciousness of persuasive intent and therefore less attitude change impact, than if purportedly given a noncontroversial setting. Sears, Freedman, and O’Connor (1964) report that subjects respond differently when they anticipate a confrontation of speakers in a debate situation from when they are led to expect simply two uncoordinated opposed speeches. In anticipation of the clear-cut debate, the more highly committed subjects tend to polarize and the less committed subjects to moderate their initial opinions. Somewhat relevant, though nonsupportive, to the suspiciousness hypothesis are the results of Irwin and Brockhaus’s (1963) study comparing the effectiveness of two speeches favoring the telephone company, an educational type talk, versus one more explicitly asking for the subject’s approval. The educational one was judged as more interesting, but one more directly appealing for approval produced more favorableness to A. T. & T. While this difference has been interpreted as indicating that the more explicit advocacy of the company produces more effect, it seems to us that the conditions were such that the differences could have been due to more personally relevant appeals used in the disputatious version, or to the distracting effect of the information in the educational version. Further evidence that overt disputatiousness and partisanship might actually enhance attitude change impact by clarifying the source’s point is indicated by a study in which Sears (1965) presented material favoring the defense or prosecution in a juridical proceeding and found that this material had more persuasive impact when it was clearly identified as coming from a defense or a prosecution lawyer than when it purportedly came from a neutral lawyer, even though the latter was rated as more trustworthy. 4. Order of presentation as affecting suspiciousness. The primacy-recency variable becomes involved in the suspiciousness question since, as Hovland, Janis, and Kelley
24
Book One – Artifact in Behavioral Research
(1953) conjectured, the first side in debate would have the advantage of seeming less controversial than the second, particularly with a noncontroversial issue and in a situation not clearly defined in advance as a debate. An audience would be more inclined to interpret the first side’s presentation as a rounded view of the topic, but when they received the second side it would be much clearer to them that they were now hearing a one-sided viewpoint on an issue where other views were quite possible. In this formulation, primacy effects in persuasion are attributed to the subject’s greater suspiciousness of persuasive intent while listening to the second side. Hovland (1957) finds some suggestive support for this notion in his impression that primacy effects are more pronounced in situations where a single communicator presents both sides than when each side is presented by a different communicator. This suspiciousness hypothesis predicts main order primacy effects and more manageably, a number of interactions between order of presentation and other variables in the communications situation as they affect opinion change. These interaction variables include the controversiality and the familiarity of the issue, the use of suspicion-arousing pretests, etc. We have reviewed this literature in some detail elsewhere (McGuire, 1966, 1968) as has Lana in his chapter in this book and elsewhere (Lana, 1964). In general, the experimental results seem to defy description by the suspiciousness hypothesis. As regards main effect, primacy effects may be somewhat the more common, but recency effects are far from rare. The interactions between the order variable and others, such as issue controversiality, go in the direction opposite to that required by the suspiciousness hypothesis in some studies while confirming it in others. Overall, the primacy-recency results offer little support for the orthodox formulation that suspiciousness of persuasive intent dampens persuasive impact.
Suspicion-Arousing Factors Having to Do with Message Style and Content Above we considered ways in which source presentation might arouse suspiciousness of persuasive intent and thus purportedly affect the persuasive impact of the message. In this section we shall consider how the content and style of the message might give rise to such suspicions. One such possible variable is whether the conclusion is drawn explicitly within the message, as opposed to being left for the subject’s own inferring. Another content variable which might give rise to suspicion is whether the opposition arguments are completely ignored or taken into consideration within the persuasive message. Still another possible message factor which might give rise to such suspicion is the extremity of the position which is urged. Finally, we shall consider such stylistic characteristics as the dynamism of the delivery as it might affect suspiciousness of persuasive intent. As has already been seen in the case of source factors, the results regarding message factors which we shall review here give surprisingly little support to the notion that arousal of suspiciousness tends to reduce persuasive impact. 1. Explicitness of conclusion drawing. The belief that a conclusion is more persuasive if the person derives it for himself (rather than having it announced to him by the source, however prestigeful) has been current at least since the beginning of the psychoanalytic movement and nondirective therapy in general. Freud indicated that he abandoned hypnotherapy with its stress on therapist suggestion, in favor of psychoanalysis with its stress on the patient’s active participation in the discovery of the bases of his problems, in part because of the incredulity with which many of the therapist-drawn conclusions were received by the patient. Indeed, psychoanalytic theorists have developed an epistemology
Suspiciousness of Experimenter’s Intent
25
as well as a therapy based on the notion that its insights require personal experience and self-analysis, rather than simply external presentation, in order to obtain credence and comprehensibility. There are, of course, other theoretical reasons for advocating that the patient participate actively in the drawing of conclusions regarding the nature of his problem. Any theory of therapy which depended on such concepts as abreaction, emotional catharsis, rapport, transference, etc., would tend to encourage the patient’s active participation in the therapeutic process even aside from credibility factors. However, the notion that the patient is more likely to believe the therapist’s interpretation of his problem if he himself actively participates in the arrival at the conclusion, rather than having the conclusion presented to him passively, provides at least part of the motivation for urging nondirective therapy. The empirical results give little support for this notion that a message is more persuasive if it leaves the conclusion to be drawn by the subject. The early work by Janis and King (1954; King and Janis, 1956) did seem to indicate that a subject was more persuaded by actively improvising a speech, rather than by passively reading or listening to a comparable speech. However, subsequent research has cast considerable doubt on the persuasive efficacy of active improvisation, as reviewed recently by McGuire (1968). The Hovland and Mandell (1952) study indicated that allowing the subject to draw the conclusion for himself, far from being more efficacious, actually produced far less opinion change than when he had the conclusion passively presented to him. A number of other studies have likewise failed to indicate that a message which allows the subject to draw the conclusion for himself, and thus would presumably arouse less suspiciousness of persuasive intent, was more persuasive than was a more explicit conclusion drawing (e.g., Cooper and Dinerman, 1951). What we seem to have here is a situation in which any enhanced effectiveness due to the increased credibility that is produced by the subdued, implicit-conclusioned message through its lesser arousal of suspiciousness, is more than cancelled by its loss of effectiveness due to the subject’s failure to get the point. We have been arguing frequently of late that most of the difficulty in persuading the audience (both in laboratory experiments and in naturalistic mass media situations) derives from the difficulty of getting the apathetic audience to attend to and comprehend what we are saying, rather than in overcoming its resistance to yielding to our arguments. The barrier is provided by intellectual indolence, rather than by motivated resistance. It seems quite possible that those who do in fact actually draw for themselves the conclusion of the implicit message may be more persuaded thereby; but it is more apparent that very few do in fact avail themselves of the opportunity actively to draw the conclusion or rehearse the arguments (McGuire, 1964). It is also probable that there is a gradual ‘‘filtering down’’ of the persuasive impact from the explicit premises to the implicit conclusion with the passage of time (Cohen, 1957; Stotland, Katz, and Patchen, 1959; McGuire, 1960, 1968). A cognitive inertia may prevent the need for cognitive consistency from manifesting its full effect on remote issues immediately after the message. The studies cited suggested that with the passage of time these logical ramifications are increasingly discernible in the belief system, as the initial inertia is gradually overcome. Even over time, however, the impact of the implicit message only catches up with, rather than surpasses, that of the explicit message. 2. Treatment of the opposition arguments. We might expect that the treatment of the opposition’s arguments would have some influence on the obviousness of our intent to persuade, and thus affect the persuasive efficacy of our message. A message which is completely one-sided, ignoring the existence of opposition arguments of which the subject may be quite aware, should seem more biased and blatantly attempting to persuade than would a message which took into account the opposition arguments by
26
Book One – Artifact in Behavioral Research
mentioning them and attempting to deal reasonably with them. Yet the World War II studies in the Army indoctrination program indicated that neither the one-sided nor the ‘‘two-sided’’ message had an overall greater persuasive impact, where the former presented the arguments for one’s own side and ignored completely the opposition arguments while the latter presented one’s own side but at least mentioned and sometimes refuted the opposition arguments (Hovland, Lumsdaine, and Sheffield, 1949). In fact, the latter was not even perceived as more fair a presentation, the impression of objectivity being, if anything, in the reverse direction. This peculiarity may have derived from the peculiar condition that the ‘‘two-sided’’ message ignored one of the most salient opposition arguments, while refuting less salient ones. It may be that to elicit the appearance of objectivity by the mention of the opposition arguments, one loses more credibility than he gains unless he is careful to mention all of the salient counterarguments. As far as the direct persuasive impact of refuting versus ignoring the opposition is concerned, the results seem to indicate that counterarguments which the subjects are likely to think of spontaneously are best refuted and those which would not arise spontaneously are best ignored if one wishes to achieve maximum persuasive impact. Hence, less intelligent subjects and those who are closer in their initial position to the conclusion being urged tend to be more influenced by messages which ignore the opposition arguments, while refuting the opposition argument tends to be more effective with subjects of higher intelligence and those further in the opposition as regards their initial opinions. Refuting, rather than ignoring, the opposition arguments does seem to be superior in developing resistance to subsequent counterattacks. The superior immunizing efficacy of mentioning and refuting (rather than ignoring) opposition arguments has been demonstrated by Lumsdaine and Janis (1953), McGuire (1964), Tannenbaum (1966), and others. It should be noted in the present connection, however, that the suspiciousness of persuasive intent mechanism does not seem to play any major part in the immunizing efficacy of considering the opposition arguments. The evidence currently seems to indicate that resistance conferral derives from the motivating threat which the mention of the opposition argument arouses. It is conceivable, though, that the subsequent persuasive attack is less effective also because the prior mention of its arguments makes the subject more suspicious of its persuasive intent. 3. Extremity of message position. Suspiciousness of persuasive intent would seem to occur with greater probability as the position espoused in the message became more and more extreme. In so far as this suspiciousness factor is concerned, increasing the discrepancy between the position urged in the message and the subject’s initial position should progressively reduce the persuasive impact. It would be naive, however, to disregard the likelihood that other processes mediate the relationship between message discrepancy and the amount of opinion change. For example, Anderson and Hovland (1957) postulate that a reverse relationship obtains such that amount of attained opinion change is an increasing function of amount of change urged. This position is plausible since when a discrepancy is quite small, the amount of change produced would be relatively minor even if the message was completely effective, while with large discrepancies, even a partly effective message could produce a considerable absolute change. These considerations have led a number of theorists to posit an overall nonmonotonic relationship between amount of obtained change and amount of urged change (Osgood and Tannenbaum, 1955; Sherif and Hovland, 1961), with maximal opinion change occurring at intermediate discrepancies. This theoretical formulation, that as discrepancy becomes quite large it produces sufficient suspiciousness of persuasive intent to overcome the effect postulated in the Anderson and Hovland proportional model, is quite plausible but empirical work has shown that it occurs only at very extreme ranges of discrepancy. Over a surprisingly wide
Suspiciousness of Experimenter’s Intent
27
range, the monotonic relationship holds such that the greater the discrepancy the greater the induced change. It is true though that some experimenters who persevered to the extent of producing extremely wide discrepancies have succeeded in demonstrating a reversal in effectiveness as the position urged became quite extreme (Hovland, Harvey, and Sherif, 1957; Fisher and Lubin, 1958; Whittaker, 1964, etc.). As might be expected from these suspiciousness explanations, the turndown is most likely to occur with low credible sources (Bergin, 1962; Aronson, Turner, and Carlsmith, 1963) and with ambiguous issues (Insko, Murashima, and Saiyadain, 1966), and where commitment to one’s initial position is high (Freedman, 1964; Greenwald, 1964; Miller, 1965). The turndown is probably least likely to occur where the subject experiences a great deal of evaluation apprehension (Zimbardo, 1960), which is discussed more fully in Rosenberg’s chapter in this book. As compared with most of the lines of research we have been considering, the evidence for the straightforward suspiciousness hypothesis (that as the position urged becomes more extreme, suspiciousness of persuasive intent increases and persuasiveness decreases) is fairly encouraging. Still, it should be noted that it takes rather surprising degrees of extremity before any such effect is manifested. 4. Style of presentation. It seems likely that suspiciousness of persuasive intent can be aroused, not only by the content of the message, but also by the style in which it is presented. A dynamic style of presentation seems more likely to arouse such suspicion than does a more subdued style; more conjecturally, an elegantly worded and presented speech might seem more suspicious than an improvised informal style. As regards the intensity of presentation variable, Hovland, Lumsdaine, and Sheffield (1949) found no differences either in attitude change or in perceived intent to persuade between two forms of an argumentative presentation used with U.S. Army personnel in World War II, a dynamic documentary style presentation and a subdued narrator style. Greater attention to this intensity of style variable is given by researchers in the speech area than in psychology. Bowers (1964) has attempted to determine the components of judged intensity of language. Both he (Bowers, 1963) and Carmichael and Cronkhite (1965) have found some very slight tendency (not reaching conventional levels of significance) for the more intense speech to produce less attitude change. One study suggests that the use of metaphors may be a special case of language intensity in this regard. Bowers and Osborn (1966) find that highly metaphorical speech, which is judged to constitute a more intense style, produces more attitude change. Possibly metaphor constitutes a special type of intensity in this regard because, as Aristotle and Cicero suggested, it increases the perceived intelligence of the speaker. If so, the mechanism involved in the metaphorical affect might be perceived source competence rather than perceived intent to persuade. It seems reasonable, if not quite compelling, to assume that an informal extemporaneous-seeming presentation would not arouse suspiciousness of intent to persuade quite as saliently as would a more polished and organized presentation. Hence, one would predict that the source employing this latter, more polished style will be perceived more suspiciously and will be less efficacious in producing opinion change. However, a number of counteracting processes would also seem to be operative in connection with this variable. The more polished style would also be likely to produce a greater comprehension of the message content and would tend to raise the perceived competence of the speaker (Sharp and McClung, 1966). Addington (1965) found no difference in opinion change impact as a function of how many mispronunciations had been introduced into the speech. Miller and Hewgill (1964) found that other inelegancies of speech, such as pauses, did produce a lower perceived competence of the source but did not affect his perceived trustworthiness. In this area of research, differences in the mediating processes
28
Book One – Artifact in Behavioral Research
(such as the perception of source characteristics and message comprehension) which were produced by the stylistic variables did not seem to eventuate in any impressive amount of opinion change differentials.
Experimental Setting as a Factor Arousing Suspiciousness In this section we turn from intrinsic communication variables (such as source and message factors) to a consideration of how extrinsic factors deriving from the experimental setting might affect suspiciousness and, consequently, attitude change. We shall consider such variables as the clarity with which the situation is depicted as a psychological experiment, the use of an attitude pretest which might arouse suspiciousness that one’s persuasibility is being investigated, and the introduction of explicit warnings that the experiment deals with persuasibility. It is in this area that we have the most clear-cut examples of how a procedure or variable which initially attracts attention purely for methodological reasons begins to gain theoretical interest in its own right. 1. Revealing the experimental content. It seems likely that the subject will become more suspicious of the persuasive intent of the messages presented in attitude change research if we reveal to him that he is taking part in an experiment. McGinnies and Donelson (1962) had their subjects read messages to other subjects which advocated a negative attitude toward ecclesiastical matters. It was revealed to half of these subjects that their own attitudes were under investigation. They found some slight evidence that this revelation did reduce the persuasive impact of the message for initially opposed subjects but only in some sub-groups. On the other hand, Silverman (1968) found greater compliance with the message in situations that were clearly designated to the subject as psychological experiments, in keeping with the ‘‘demand character’’ notion considered in more detail in Orne’s chapter in this volume. This interpretation receives further support from the fact that this greater conformity in the revelation condition occurred to a greater extent with subjects who had to identify themselves and with female subjects. Further evidence on this point is given by studies of the effect of ‘‘debriefing.’’ Deliberate deception in experiments gives rise to the felt necessity on the part of most experimenters who use deception to employ also a ‘‘debriefing’’ or ‘‘catharsis’’ treatment at the end of their experiment. During this final procedure, the true purposes of the experiment are explained to the subject, and the deceptions employed are pointed out to him, along with the reasons why they were employed. We shall return to the ethical considerations in the final section of this chapter; here, we shall focus on the theoretical aspects. There has long been some concern that participation in deception experiments and going through these debriefing procedures produces suspicious, experiment-wise persons who are unsuited to serve as subjects in subsequent experiments because this acquired sophistication will cause them to behave in a way unrepresentative of the more naive population to whom the results are to be generalized. It does seem plausible that the revelation during prior debriefing about the deception used in the earlier experiment will make the subject suspicious about what is going on in subsequent experiments and hence harder to persuade. However, the results to date provide little substantiation for this reasonable concern. Fillenbaum (1966) finds that the performance of the ‘‘faithful’’ subject who has been exposed to prior deceptions yields results little different from those of the more naive subjects. Indeed, though both previously deceived and naive subjects included a sizable number of suspicious persons, suspiciousness did not seem to affect their experimental
Suspiciousness of Experimenter’s Intent
29
performance in any important way. Brock and Becker (1966) find that prior participation in a deception experiment with debriefing produces surprisingly little effect on performance in a subsequent test experiment, even when it follows immediately afterward. Only when the test experiment and the prior debriefing experiment were made ostentatiously similar was performance found to be affected in a substantial way. This work is quite reassuring (or disappointing, depending on one’s initial attitude) regarding the possible contaminating effect of the subject’s suspiciousness of the experimenter’s true purpose in the experiment. Not only do manipulations which seem quite likely to arouse subjects’ suspicion fail to produce any noticeable change in the obtained relationship, but even when one does internal analyses separately for suspicious and non-suspicious subjects, the two groups yield surprisingly similar relationships about the hypothesis in question. In my own research on attitude change, where I usually represent the persuasive communications as part of a test of reading comprehension, we rather routinely introduce near the end of the experiment a questionnaire of some subtlety designed to detect any suspicions regarding the true nature of the experiment. Subjects can then be partitioned on the basis of their responses to high and low suspiciousness of persuasive intent. In many experiments we have analyzed the data separately for the sub-group of subjects who seem to indicate at least a moderately good grasp of the true nature of the experiment, which tends to include about 15% of the total sample. So far, we have never found significant differences between suspicious and non-suspicious subjects as regards the effects of any of the important variables. Hence, we have never had to face the anguishing decision as to whether or not we should eliminate from our experiments a particularly suspicious subject, which, as we indicated in a previous section of this paper, is an inadequate methodological solution for the problem and also tends to raise more problems of generalizability than it resolves. Judging from the uniformity of our own results, we suspect that many other researchers have had the same reassuring experience when they performed a similar internal analysis. 2. Pretests as a suspicion arouser. The subject’s suspiciousness that we are investigating his persuasibility in a disguised attitude change experiment seems more likely to arise if we employ a pretest than if we use an after-only design, particularly when the pretest involves an undisguised opinionnaire administered just prior to the persuasive messages. We face here a classical question of experimental design involving the efficiency of ‘‘before-after’’ versus ‘‘after-only’’ designs (Hovland, Lumsdaine, and Sheffield, 1949) and the inclusion of control groups (Solomon, 1949). The current state of this question is reviewed in detail in Lana’s chapter of this volume on ‘‘Pretest Sensitization.’’ To oversimplify somewhat the conclusion to be drawn from the pretest experimentation as regards the current suspiciousness issue, it seems to us that there is evidence from this work of a rather slight depressing effect of using a pretest, as one would expect on the basis of the straightforward suspiciousness hypothesis that the pretest arouses suspicion and therefore reduces the amount of opinion change induced. There are, however, a few experiments in which a test actually enhances the main effect of the manipulation, as one might predict on the basis of a ‘‘demand character’’ interpretation (as discussed more fully in Orne’s chapter of this volume). And quite frequently, the pretest is found to produce no main effect at all. Even if we do tentatively accept the working hypothesis that pretests arouse suspicion, which then slightly decreases the main effect of our independent variables, the methodological anguish that such a main effect should provoke can be quite low. We have indicated elsewhere (McGuire, 1966, 1968) that a serious problem of interpretation would occur only if we find that the pretest interacts with our main independent variable. Studies
30
Book One – Artifact in Behavioral Research
in which there is an interaction between this design feature and the independent variable are exceedingly rare. Hence, it seems likely that we will be misled, at most, by failing to detect some relationships because of use of a pretest, rather than being misled into finding the ‘‘wrong’’ kind of relationship that would not be generalizable to a more naive, unpretested population. 3. Effects of forewarning of persuasive intent. The most straightforward procedure for investigating the effect of suspiciousness on the amount of opinion change produced would seem to be designing an experiment with a well-disguised attitude change induction, and then explicitly stating the persuasive intent of the communications to half the subjects, while the remainder of the subjects are given a quite different plausible explanation of the materials to be read. We would then expect, in line with the orthodox suspiciousness notion, that the informed subjects made aware of the persuasive intent of the messages would come forearmed and premotivated to resist the predesignated belief-discrepant communications. Rather slight support is given to this common sense notion in several experiments. On the basis of internal analyses within some subsets of subjects, Allyn and Festinger (1961) report that teenage subjects were more influenced by an anti-driving speech when they were led to believe it was being presented to them to study how well they could judge the personality of the speaker rather than to assess their opinions. However, the results from the total subject sample did not confirm this orthodox prediction at the conventional level of significance. McGuire and Papageorgis (1962) found that a forewarning of an impending persuasive attack on certain cultural truisms accepted by the subjects indirectly strengthened their resistance to these attacks when they were given an opportunity to study defensive material in advance. However, the forewarning itself did not directly enhance the resistance to the persuasive message when no defensive material intervened. Hastorf and Piper (1951) also failed to detect any resistance to suggestion produced by explicitly reminding the subjects that they had answered a pretest and should give similar answers on the post-test after receiving some normative feedback. Wright (1966) likewise finds no significant acrossconditions superiority of a direct over an indirect influence attempt, though he finds some suggestion that when coming from a liked partner, the indirect message is somewhat more effective. Brehm’s (1966) reactance theory leads to the notion that the subject will tend to respond to a source’s attempt at persuasion with a ‘‘boomerang’’ response when the attempt is too blatant. Perhaps, the study that comes nearest to indicating any strong resistanceconferral effect of warning is that by Freedman and Sears (1965), and even their results seem somewhat dependent upon the time parameters, and the same might be said of the Kiesler and Kiesler (1964) study. This body of research on the effects of explicit warning of persuasive intent has been frustratingly elusive as regards its implications. There does seem to be a relationship begging to be found, and yet it seems to be hiding out in only certain cells of our experimental design. That it so seldom shows up as an across-condition significant effect, suggests that, while in some cells the warning reduces the persuasive impact; under other conditions the warning enhances impact. The source of such powerful interactions may be found in the demand character of the experiment and in the attractiveness of the source, as the research which we will discuss below seems to indicate. In general, the results of these many lines of research considered in this section are not particularly alarming as regards the possible artifactual nature of results obtained under conditions that might make the subject suspicious of the experimenter’s intent. It might be, of course, that some of the experimental variables did not actually manipulate in any dramatic way the degree of suspiciousness. However, we have seen a number of cases in which the independent variable did seem to manipulate suspiciousness to a considerable extent, and still no overall main effect in terms of differential opinion change eventuated. In a few cases, there was a diminution of communication effectiveness after suspicion was aroused; in the vast
Suspiciousness of Experimenter’s Intent
31
number of experiments, no overall significant difference occurred as a function of suspiciousness; and in a few experiments, arousing suspiciousness actually increased the amount of change. Furthermore, such effects of warning as have been found tend to be main effects which are annoying rather than misleading. Evidence for the more worrisome interaction effects are almost nonexistent. As regards the main effect, where there is evidence of enhanced resistance, it is still unclear what is the mechanism by which suspiciousness reduces attitude change. Does it operate by giving the person a chance to marshall his defenses; or by making it more difficult for him to yield to the outside influence without suffering more loss of selfesteem than he is willing to countenance; or by some other mechanism? Furthermore, we have been forced to suspect that under a number of conditions suspiciousness of intent actually enhances persuasive impact of the message. Again, the mechanism question arises. Such a result could be obtained in various ways, for example by clarifying the demand character of the experiment, or by the channeling of his ingratiation or cooperative motives, etc. In the following section we shall turn to a consideration of what are some of the theoretical formulations that seem called for.
Theoretical Housings of the Suspiciousness Variable In the previous section we looked somewhat askance at the suspiciousness variable, regarding it somewhat as a poor relation whose advent spelled trouble. In this section we shall look at the suspiciousness variable in a more positive way, asking what interesting processes it might involve and what opportunities for theoretical elaboration and refinement it might offer. We shall first consider some matters of definition to clarify the question regarding just what the subject is supposed to be suspicious about in order for the hypothesized effect to occur, and what areas of behavior suspiciousness is supposed to affect. After dealing with the definitional problem, we shall turn to a consideration of the various mediating factors which seem possibly to be involved with the suspiciousness variable, and which could result in either enhancing or diminishing the persuasive impact of experimental messages. We shall then consider much more briefly some of the temporal considerations and individual difference factors that seem involved in the suspiciousness effects. The Problem of Definition That in the previous section we considered as many as eleven rather separate lines of research purportedly giving rise to suspiciousness of experimenter’s intent should lead us to expect that this suspiciousness variable is not a completely homogeneous concept. Hence, some conceptual clarification seems called for here if the results of the suspiciousness variable are not to be unnecessarily confusing. First we shall consider the question of what the person is supposed to be suspicious about in order that the predicted effect occurs. We shall then point out some needed distinctions regarding the several different dependent variables that suspiciousness purportedly affects. 1. Suspiciousness of what? In the introductory section of this chapter, we pointed out that the suspiciousness variable is practically coterminous with the awareness variable, and hence arises pervasively over the whole range of psychological research. To ask about the effect of ‘‘suspiciousness of experimenter’s intent’’ is to ask what is the effect of awareness of what is going on in the experiment. We pointed out that currently in
32
Book One – Artifact in Behavioral Research
psychology this awareness issue has arisen particularly in regard to the work on verbal conditioning and in the area of attitude change and that we would confine our discussion to the latter. Even within the narrow realm of attitude change experiments, several further distinctions are useful to avoid unnecessary confusions. For example, in the experiments involving forewarning of persuasive intent, Papageorgis’s (1967) work indicates we should distinguish between situations in which the person is simply warned that the (unspecified) communication which he is about to hear is designed to persuade him as compared with situations in which he is also warned regarding the precise issue and the side which is to be developed by the message. Still another distinction which seems necessary to facilitate generalization of laboratory results to the real world is the distinction between being aware that the communicator is trying to persuade oneself and being aware that one’s persuasibility is being studied. For example, the former obtains in most naturalistic situations to which we would want to generalize our laboratory results on attitude change, in that the person is at least preconsciously aware that the material with which he is being presented was designed to influence his beliefs and behavior. For example, the average audience being exposed to an advertising presentation, a political speech, a disputation with a friend, etc., is more than a little suspicious that the material with which he is being presented is designed to influence him. Hence, when in the laboratory we strain our intellectual and moral resources in order to design some elaborate deception which will hide from the subject the persuasive nature of the material, we are paradoxically making it more difficult to generalize to the naturalistic situation, even though the researcher frequently justifies the deception as necessary for extrapolation to the real world. Are we then making a peculiar logical error in calling suspicious laboratory situations artifactual, rather than regarding situations in which the subject’s suspicions are allayed by deceptions as the artifactual ones? Behind our conventional thinking in this area, there seems to lie the assumption that it is particularly essential to prevent the subject’s becoming aware that his persuasibility is being studied, since this awareness would seriously affect his behavior and it is not operative in the naturalistic situation. Hence, to make his attitude change behavior more comparable between laboratory and naturalistic setting, we try by deception to divert his suspiciousness into some other channel. This strategy represents a peculiar and devious compromise. In the natural setting, the person is suspicious of the persuasive intent of the communication that is being presented to him, but he is not at all suspicious that his reactions to it are being studied. To achieve comparability in the laboratory, we design the situation so that the subject suspects neither that the material presented was designed to persuade him nor that his persuasive reaction is being studied. Without the deception, he would be suspicious both that the material was designed for persuasive purposes and that his own persuasibility is being measured. It can be seen that both the deception experiment and the undisguised experiment deviate from the naturalistic situation in one crucial manner. Perhaps, some reexamination is necessary as to whether the typical twofold deception experiment is any closer to the naturalistic situation to which we wish to generalize than is the somewhat more tolerable (intellectually and morally) fully undisguised experiment. Even more clearly, the situation seems to call for the study of each of these dimensions of awareness separately and in combination, rather than choosing one or the other for exclusive study or, worse, confounding the two. So far we have seen that there are several levels of awareness: Is the subject aware that the material was designed to persuade him, is he aware of the issue and side on which it will argue, and is he aware that his attitudinal or behavioral response to the message is being evaluated? There is an even higher level of awareness of persuasive intent, since we can ask further whether the subject is aware of the particular hypothesis being investigated. For example, the experiment might be designed to test the hypothesis that there is a nonmonotonic, inverted U-shaped relationship between fear arousal and the persuasive
Suspiciousness of Experimenter’s Intent
33
impact of the message. The subject could be aware of all the points so far discussed (for example, that the experiment involves persuasion, that it deals with the advocacy of automobile seatbelts, and that the extent to which he is influenced by the material presented to him will be measured) and yet he might be quite unaware of the particular hypothesis about fear appeals. Hence we could produce still higher degrees of awareness of the experimenter’s intent by making differentially clear to him the independent variable in the experiment, its hypothesized relationships to the dependent variable, and the level of the independent variable to which he himself is being exposed. Research results have been unclear about the effect of suspiciousness in general, and also about the differential effects of suspiciousness of these different aspects of the experiment, two deficiencies that are probably interrelated. 2. Clarification of the dependent variables. Since suspiciousness of persuasive intent constitutes a mediating variable in most of the theorizing into which it enters, we must be concerned with ‘‘checking our manipulations’’ as well as with measuring our dependent variable. Thus, if we are testing how communications with implicit versus explicit conclusions affect opinion change via the mediation of suspiciousness of persuasive intent, we must not only measure the dependent variable of opinion change but also we should have some direct measure of the purported mediating suspiciousness. In some of the research discussed in the previous section, where suspiciousness of persuasive intent was indeed theorized to be operative, there was such a check on the purported mediator. However, in many of the studies cited, suspiciousness of persuasive intent rose as a possible artifact suggested by later commentators and in these cases there usually was no such direct measure of this purported process. Where the predicted relationship does hold between the antecedent manipulation and the amount of opinion change, but the direct measure of the suspiciousness does not show any difference as a function of the manipulation, the doubt is raised regarding whether this process does indeed enter into the relationship. However, we might alternatively wonder if our measure of this mediator, usually a self-report instrument devised without too much consideration (one tends to worry less about constructing this incidental ‘‘check’’ than about measuring the dependent variable) is indeed adequate to pick up fluctuations in the suspiciousness. Where this suspicion mediator is found to vary in the appropriate direction, the question remains whether this variation is adequate in amount to account for the obtained difference on the dependent variable of opinion change. A covariance analysis could test whether the relationship between the antecedent manipulation and opinion change remains significant, even when we adjust for the variance due to suspiciousness. Here again we would probably draw any conclusions only tentatively, since it is unlikely that we would have any considerable confidence in the quantitative precision of our measuring instrument for suspiciousness. At least three quite different dependent variables have been used in testing how suspiciousness of persuasive intent affects the subject’s persuasibility. In some studies (McGuire and Millman, 1965; Papageorgis, 1967) the dependent variable has been the direct impact on opinions of the warning of persuasive intent, even before the persuasive communications are actually presented. In most studies the dependent variable is the effect of the warning on the persuasive impact of the message when it is actually presented (Allyn and Festinger, 1961; Freedman and Sears, 1965). Still other studies (McGuire and Papageorgis, 1962) have investigated the extent to which the suspicion of impending persuasive attack enhances the immunizing efficacy of a prior defense presented before the forewarned attack occurs. These three dependent variables would not be expected to yield exactly the same relationships, and so ignoring the distinction among them is likely to lead to some confusion. More important, analyzed in conjunction, they can help considerably in clarifying the processes involved, since the several mechanisms
34
Book One – Artifact in Behavioral Research
associated with suspiciousness affect these different dependent variables in somewhat different ways, allowing us to tease out and evaluate the several factors involved.
Possible Mechanisms for Suspiciousness Effects Suspiciousness of persuasive intent could have an effect on the amount of opinion change produced via any of a number of mechanisms. The operation of some of these would enhance the persuasive impact while others should mitigate it. Still others seem able to operate in either direction. We shall first consider three mechanisms associated with suspiciousness that are likely to enhance the person’s resistance to persuasion. One of these is that suspiciousness of impending attack should motivate the person to absorb and generate defensive arguments for his own position. A second such factor is that he would, having been warned of an impending attack, be more likely to rehearse actively his defense. A third consideration is that a forewarning would constitute something of a challenge to his self-esteem to demonstrate his ability to stand up for his own beliefs. Fourth, fifth, sixth, and seventh considerations suggest that suspiciousness may well have the opposite effect of enhancing the persuasiveness of the message when it comes. Assuming that the subject was responding to some kind of perceived demand to go along with whatever the experiment entails, making him aware of its persuasive intent would tend to increase the amount of opinion change he would show. If he was trying to ingratiate himself for some reason with the source, any awareness of the persuasive intent should have a similar enhancing effect. Also, since the main obstacle to persuasive effect is often the subject’s failure to perceive accurately the point of the message, being made aware of its purpose should enhance its impact on him. Finally the awareness of persuasive intent brings home to the person that there exists the source, often a person of some status, who holds a view opposite to his own and this would generate conformity pressures even prior to the communication. Two other mechanisms which may be involved and whose operation is more ambiguous are set and distraction. Either one of these could be produced by suspiciousness of persuasive intent, and either one could operate by enhancing or diminishing the persuasive impact of the message. In the sections that follow we shall consider each of these nine possible associated mechanisms in turn. 1. Suspiciousness as motivating preparatory defense. McGuire and Papageorgis (1961, 1962) postulated that people tend to underestimate the vulnerability of their beliefs (at least those which they perceive as cultural truisms) and are little motivated spontaneously to develop a defense or even to absorb effectively the bolstering arguments that are presented to them. In a series of studies (McGuire, 1964) it has been demonstrated in accord with this motivational deficit notion that the prior presentation of various kinds of threats to the belief is efficacious in making beliefs more resistant to subsequent strong attacks. However, the type of threat most relevant to the present discussion, forewarning that the forthcoming communication will constitute a persuasive attack on the given belief, is efficacious in enhancing resistance only if presented in conjunction with belief-bolstering material, indicating that both motivation and help in developing a defense must be supplied. The belief bolstering material plus the suspicion arousing threat was more efficacious than the belief bolstering material alone (McGuire and Papageorgis, 1962).
Suspiciousness of Experimenter’s Intent
35
2. Defense, rehearsal consequent on forewarning. Another possible source of resistance to persuasion occasioned by a suspicion of impending attack is that such a forewarning increases the likelihood that the believer will rehearse his belief defenses and thus be better prepared to refute the suspected attack when it comes. That a rehearsal opportunity is important in the resistance-conferring effect of promoting suspiciousness of impending attack is suggested by the studies varying the temporal interval between the threat and the actual arrival of the attack. Freedman and Sears (1965) have shown that a forewarning of impending attack is more efficacious if it comes ten rather than two minutes prior to the attack. McGuire (1962, 1964) has demonstrated that prior mention of weakened attacking arguments or the requirement of active participation in defending one’s beliefs has an accumulative effect over time, for a period of several days at least, in conferring resistance. There is some suggestion that the rehearsal factor which would produce a delayed reaction resistance effect in the case of active participation (McGuire, 1964) occurs also as regards the persistence of opinion change (Watts, 1967). 3. Suspiciousness as enhancing one’s personal commitment to one opinion. It was argued by McGuire and Millman (1965) that making the believer suspicious that a forthcoming communication constitutes an attack on his belief tends to engage his selfesteem more explicitly in his response to the communication. The notion here is that people tend to behave so as to maintain their self-esteem and that in our society there are many situations in which yielding to a persuasive communication would be damaging to one’s self-regard, for example when the issue is a matter of taste or when the source is disreputable, or when one is clearly committed publicly (or at least in one’s own mind) to one’s initial position. Since suspiciousness that the communication is designed to persuade puts one’s self-esteem on the line, it might be predicted that the person who is made suspicious will be resistant to the attack when it comes. Actually, the McGuire–Millman (1965) study was designed to test a hypothesis about a different mode of coping with self-esteem needs in the face of an impending persuasive attack. They predicted (and found) that forewarned subjects actually lowered their beliefs on matters of taste on which they had not explicitly committed themselves, in advance of a suspected attack. In this case the forewarning actually weakened the belief, our interpretation being that the believer spontaneously moves his belief in the direction of the impending attack so that he can tell himself afterward that he felt the same way all the time, rather than was influenced by the persuasive message. It should be noted, however, that while under the conditions of the McGuire–Millman study (suspicion of an impending attack weakening the belief), the situation could have been designed so that self-esteem considerations would have produced greater resistance to the attack. 4. Suspiciousness and message perception. We have been stressing here and elsewhere that in most persuasion situations, in the laboratory and in the natural environment as well, we do not confront an audience attentively alert and resolute to resist our arguments, a notion that seems to be the point of departure for more than a little theorizing about persuasibility. Rather, the audience tends to be rather apathetic with little felt need to resist such arguments as get to them but not much inclined to pay attention to the message either. According to our analysis of the situation, the ineffectiveness of persuasive communication more often derives from poor message reception than from unyieldingness to such part of it as is received by the audience. Insofar as this conceptualization has general validity, awareness of the persuasive intent of the message would actually augment its opinion change impact. Defining prior to message reception just what the communication is designed to achieve in the way of attitude change could be looked upon as an introductory summary that
36
Book One – Artifact in Behavioral Research
facilitates message reception (Hovland, Lumsdaine, Sheffield, 1949). Theorists are becoming increasingly aware that persuasive communication situations are looked upon by the audience more as a problem to be solved than as an intrusion on their autonomy to be resisted (Bauer, 1966). 5. Suspiciousness as clarifying demand character. Since Orne in Chapter 5 of this volume discusses the role of ‘‘demand character’’ in determining the outcome of psychological experiments, we need discuss this matter only briefly here, as it bears on the suspiciousness of persuasive intent issue. The usual psychological subject is a fairly cooperative individual. Sometimes he comes to our laboratory voluntarily (giving rise to problems that are considered more fully by Rosenthal and Rosnow in Chapter 3 of this volume), but even when he comes simply to earn a fee or to fill a course requirement, he tends to enter the experiment in a fairly compliant mood. We would venture a guess that perhaps nine out of every ten subjects in psychological experiments would prefer to help rather than hinder the experimenter. The experimenter, being associated with the university faculty tends to be a fairly benevolent and prestigeful figure to the college students who constitute the majority of our subjects. The student population, perhaps even more than the general population at large, is made up of reasonable, well disposed individuals who value research and are disposed to ‘‘help’’ the experimenter according to their lights in the conduct of this experimentation. Hence any indication in the experimental situation which arouses the subject’s suspiciousness of the persuasive intent of the communication would tend to enhance the amount of opinion change produced. The cooperative subject is likely to assume that if the experimenter presents him with a persuasive message he intends that the audience be persuaded by it. The effect of such enhanced compliance on the part of suspicious subjects responding to what they perceive as the demand character of the experiment would be a main effect, such that opinion change would be enhanced across most experimental conditions. Occasionally, we might use experimental conditions such that the suspiciousness would lead the subject to cooperate in some way other than increased compliance. Such situations are more worrisome, since then the suspiciousness would tend to interact with our main independent variable rather than simply to add a constant to the persuasive impact across conditions. An earlier distinction which we made regarding what the subject is suspicious of is relevant here. If the subject is merely suspicious that the intent of the communication is to persuade him, the result should be simply to add a constant to the amount of change produced. If however he is suspicious that a certain hypothesis is being tested, the effect is more worrisome, since he might be complying with what he feels is demanded of him in a way that would make the result difficult to generalize to the population at large which is not responding to any such sophisticated demand character. 6. Suspiciousness and source attractiveness and power. Many theorists have pointed out that in the laboratory and in the natural environment people behave in accord with an ‘‘exchange’’ theory such that, if one person conforms to the other’s persuasive communication, the other incurs an obligation to do a reciprocal favor for the first person. An implication of this exchange theory in the present context is that increasing the subject’s suspiciousness that the persuasive communication is designed to persuade him will, under specifiable conditions of source valence, increase rather than diminish its persuasive impact. Two relevant lines of current work come to mind in this connection. The ingratiation work by Jones (1964) would indicate that when an inferior is confronted by the demands of a more powerful source (a set of conditions frequently operative in the laboratory as well as in natural persuasion situations) he can by judicious compliance on selected issues build up ‘‘credit’’ with
Suspiciousness of Experimenter’s Intent
37
the power figure which could serve him well later. Hence, where the subject is inclined to use conformity as an ingratiation tactic, arousing his suspiciousness of persuasive intent will only increase his attitude change. While history may have demonstrated that for controlling the minds and behavior of man, it is better to be feared than loved, love also wists the way to the hearts and minds of men. Mills and Aronson (1965) have demonstrated that a communicator who makes clear his desire to influence the subject’s opinion is more persuasive than one who does not so arouse suspiciousness of persuasive intent, but only when this source is attractive. Where the communicator was unattractive, suspiciousness had little effect on the amount of change produced. In a subsequent study Mills (1967) finds that suspiciousness of persuasive intent enhances the opinion change impact with an attractive source and diminishes it with an unattractive source. The psychodynamics involved here indicate again that it is naive to assume that suspiciousness will routinely result in diminished effectiveness. 7. Suspiciousness and the communication of consensus. McGuire and Millman (1965) explained the anticipatory belief-lowering effect of an announcement of an impending attack on one’s belief as due to a self-esteem preserving tactic. Specifically, one moved one’s belief in the direction of the suspected influence prior to the communication so that one would not have to admit to having been influenced by it. This anticipatory belief lowering following the announcement of an impending persuasive attack has been replicated in other laboratories. However, Papageorgis (1967) has demonstrated that this self-esteem explanation may be superfluous. He has demonstrated that the ‘‘anticipatory’’ belief lowering occurs after simply announcing that the other person holds the divergent belief, even when there is no implication that he is about to present the subject with a persuasive communication. Whether there is an additional impact via the self-esteem mechanism when the subject is also told that this other person is about to present him with a persuasive communication remains to be tested. Some suggestion that the self-esteem explanation may also be responsible for the effect is given by the interaction with type of issue which was obtained in McGuire and Millman (1965). 8. Suspicion as establishing einstellung. Arousing the person’s suspicion of the persuasive nature of an impending communication should induce in him a preparatory set that would influence the way in which he perceives the message when it comes and hence its impact on his belief system. Elsewhere (McGuire, 1966), we have considered the evidence for and against this einstellung hypothesis that has been contributed by research on the primacy-recency issue in attitude change research. The implication of this formulation in the present instance is rather ambiguous. Given that suspiciousness of persuasive intent establishes an expectation as to what the content of the message will be, it is hard to predict whether this preparatory set will result in assimilation or contrast in the person’s perception of the content when it actually comes. The Sherif–Hovland (Sherif and Hovland, 1961; Sherif, Sherif and Nebergall, 1965) formulation would suggest that assimilation tends to occur (along with increased opinion change) when the message is close to the subject’s own position; while when the message is more discrepant, the contrast effect (and lessened opinion change) results. The appropriate prediction is even more difficult to make in the present case since we are dealing, not with the subject’s own position as the reference point, but with his suspicion-aroused opinion of where the message will be. There is some weak evidence (Ewing, 1942) that a subject who suspects that he is about to hear a quite discrepant communication would tend to perceive the given message as more discrepant from his own position than actually it was. On the other hand, there seems to be an overall tendency in human perception to distort information toward, rather than away
38
Book One – Artifact in Behavioral Research
from, one’s own position as a secular trend which is imposed across the operation of other distortion tendencies. 9. Suspiciousness and distraction. Allyn and Festinger (1961) manipulated suspiciousness of persuasive intent by disguising the communication as a test of the subject’s ability to judge the speaker’s personality in one condition, while in the other condition its persuasive intent was revealed. Subsequently, Festinger and Maccoby (1964) argued that the crucial factor here was not the suspiciousness aroused by the revelation of persuasive intent but rather the distraction produced by the personality judgment task. (Since the original effect was quite slight by conventional statistical standards, a wise commentator has aptly written of it with amused patience that, ‘‘seldom has so slight an effect been made to bear so heavy a burden of explanation.’’) In the later study, Festinger and Maccoby (1964) report that when their audience was distracted from the persuasive sound track by an irrelevant amusing film, it showed more attitude change than when they watched a film appropriate to the sound track. Freedman and Sears (1965), however, find little evidence for the distraction effect over and above the effect of warning. McGuire (1966) has conjectured that such effect of the film as may have obtained in the Festinger and Maccoby study was perhaps produced by the pleasant hedonic feeling resulting from its entertaining nature, rather than by the distraction’s lowering of the audiences’ defenses, as Festinger and Maccoby conjectured. That a given message is more persuasive if the audience hears it in a pleasant mood has been demonstrated by Janis, Kaye, and Kirschner (1965) and by Dabbs and Janis (1965). McGuire (1966) has argued further that it would be surprising if distraction did indeed enhance the persuasive impact of a message. The prediction of such an enhancement rests on the notion that the audience tends to be waiting and ready to defend themselves against the coming onslaught unless they are distracted so that it reaches them with their defenses down. As we have mentioned at several points in this chapter and elsewhere, we entertain the contrary notion that audiences are typically apathetic, disinclined to attend to the message sufficiently for it to affect them, but disinclined also to resist such of its arguments as reach them. Since what is in shortest supply is motivation and ability to comprehend the message content sufficiently to be affected by it, it seems to us that the distraction would rather weaken than enhance its persuasive impact. Perhaps some resolution of this difference of opinion is found in the work by Rosenblatt (1966; Rosenblatt and Hicks, 1966) which suggests a nonmonotonic relationship between distraction and persuasive effectiveness, with maximum impact occurring with a moderate amount of distraction. He also finds some confounding between distraction and the subject’s suspiciousness. We anticipate that the current movement in psychology toward the concept of the organism as an information processing machine will probably sustain this line of research for some time to come. Indeed, we would venture the prediction that this information-processing theme which is developing in psychology will cause more attention to be paid to the reception mediator, as opposed to the yielding mediator, in determining the relationship of such.
Temporal Considerations Regarding Suspiciousness The relationship of suspiciousness of persuasive intent to amount of opinion change seems highly dependent on time parameters. We shall review the results of research on the effects of varying the interval between the warning and the attack and also the interval between the attack and the measurements of opinion change effect.
Suspiciousness of Experimenter’s Intent
39
1. The warning-attack interval. A number of studies have indicated that the warning is effective only if it precedes the actual attack. Thus, McGuire (1964) shows that a forewarning of an impending attack increases the immunizing efficacy of a prior defense if it is presented before the defense, but that it has little or no efficacy if the warning comes after the defense. Kiesler and Kiesler (1964) report that an attribution designed to arouse suspiciousness of persuasive intent is effective in reducing the impact of the message if it is presented at the beginning but not if it comes at the end of that message. Greenberg and Miller (1966) find in three replications that if a source is identified as a person of low credibility prior to the presentation of the message, the message has less persuasiveness than when the source is not identified; but there is no retroactive effect such that the low credibility attribution diminishes the impact when it occurs only after the message has been presented. There is further evidence that the warning must not only precede rather than follow the message, but that it should precede the message by some finite time period in order to exhibit its maximum effectiveness. Thus, McGuire (1964) has shown that the resistance conferral produced by having the believer participate in a worrisome active defense shows up more fully against an attack which comes a week later than an attack which follows the defense immediately. He has also shown (McGuire, 1962) that a ‘‘refutational’’ defense which mentions some threatening counterarguments develops its immunizing efficacy increasingly for several days subsequent to its initial presentation. Hence, resistance is greater to an attack that follows this threatening defense by two days than to one which follows it immediately. Freedman and Sears (1965) found that a warning was more efficacious in reducing opinion change if it preceded the attacking message by ten, rather than two, minutes. 2. Interval between attack and measurement of effect. Both the temporal effect just discussed and the one to which we turn here assume an inertia in the cognitive apparatus, such that effects produced by experimental intervention or by persuasive communications in the natural environment manifest themselves only gradually over time. Hence, an immediate post manipulation measure might indicate relationships rather different from those revealed by delayed measures of effect. Above, we conjectured that suspicion of persuasive intent does produce strain in the person but the effect of this strain becomes manifest only as the person has sufficient time and ingenuity to act on the induced motivation. The same gradualism considerations have something of a reciprocal effect when we consider the dampening effect of suspiciousness on opinion change when it is allowed time to be operative. We have in mind that the suspiciousness at the time of the message reception constitutes the ‘‘discounting cue’’ in the Hovland sleeper-effect formulation (Hovland and Weiss, 1951; Kelman and Hovland, 1953). If this analysis is correct, we would expect that suspiciousness of persuasive intent would reduce the immediate opinion change impact but as time passes, allowing the association between the discounting suspiciousness cue and the convincing message content to weaken, the full impact of the persuasive message would begin to manifest itself. Related delayed action effects in persuasion have been more fully discussed elsewhere (McGuire, 1968).
Individual Differences in Suspiciousness It often happens in the history of the psychological research on any issue that after it has been studied as an across-subject variable for a certain period, attention is turned to individual differences in its manifestation. First individual differences are investigated as they moderate the effect of the variable and then the investigation begins to focus on interaction between the variable and the personality or other individual
40
Book One – Artifact in Behavioral Research
difference characteristics. Since we started our discussion in this chapter with an overview of the career of an artifact through the stages of ignorance, coping, and exploitation, it is only appropriate that we conclude our discussion of the substantive issues raised by the suspiciousness of persuasive intent variable with a discussion of individual differences in the operation of this factor. It seems rather evident that people will vary as regards the extent to which our manipulation arouses their suspiciousness. These individual differences involve both ability and motivational variables. For example, quite early in the research on overt versus covert conclusion drawing in the message (which we considered above), attention was turned to the possible role of audience intelligence in moderating any such effect (Cooper and Dinerman, 1951; Hovland and Mandell, 1952; Thistlethwaite, deHaan, and Kamenetzky, 1955) with rather weak evidence for any such ability interaction. More positive evidence in line with the suspiciousness notion was given by the World War II findings (Hovland, Lumsdaine, and Sheffield, 1949) that a message which mentioned the opposition arguments was more effective than one which ignored them if we consider the more intelligent army personnel rather than the less intelligent. Besides the question of individual differences in responsiveness to a deliberate manipulation of suspiciousness, there is the question of idiosyncrasies in the spontaneous arousal of suspiciousness in experimental situations, a topic that has begun to receive attention from Stricker and his colleagues (Stricker, 1967; Stricker, Messick, and Jackson, 1966). They find considerable situational variance in the amount of suspiciousness aroused but an appreciable degree of across-situational generality of suspiciousness for individuals. In their studies males show more suspiciousness than females and they also report some tendency for males to show a positive relationship between need for approval and suspiciousness. This would seem to reverse the finding by Rosenthal, Kohn, Greenfield, and Carota (1966) that subjects scoring high on social approval show less awareness of the response-reinforcement contingencies in verbal conditioning situations. However, the resolution may reside in a distinction between suspiciousness and willingness to report suspiciousness. With the focusing of interest on individual difference characteristics as they interact with suspiciousness-arousing manipulations, the latter variable has achieved full status as a respectable psychological issue in its own right, rather than as an artifact to be overcome. The next step should be the demonstration that the effects produced by this erstwhile artifact should themselves be attributed to an artifact yet to be discovered.
Deception and Suspiciousness: The Ethical Dimension Without experimenter deception, the issue of suspiciousness would never arise. Hence, the methodological and theoretical problems raised by the suspiciousness issue imply that there is already an ethical problem. Kelman (1965, 1967) particularly has called attention to the ethical problems involved specifically in experimenters’ use of deception. The present time seems to be one of rising ethical anxieties among many involved in behavioral science research and, perhaps even more, in lay observers of this research. Some would feel that as unsavory as the use of deception is, there is even greater cause for moral concern in other practices in behavioral
Suspiciousness of Experimenter’s Intent
41
science research such as invasion of privacy, harmful manipulation, etc. Paradoxically, deception is sometimes employed to circumvent these more serious concerns, as when Milgram (1963, 1965) deceived the subject into thinking that he was hurting another person rather than allowing him actually to do so. The argument that worse things are done in behavioral science than deceiving subjects offers cold comfort to the researcher suffering moral qualms. Hence, we shall confine ourselves in the present chapter to discussing deception apart from other ethical issues, since it is the one intrinsically involved in the question of suspiciousness of persuasive intent. It seems undeniable that there is some moral cost in the use of deception in experimental situations. Perhaps few can feel distaste for willful deception more than we scientists, for whom the discovery of truth constitutes the basic moral imperative which lies at the core of our vocation. Most of us feel at least a slight moral revulsion, aesthetic strain, and embarrassment when deceiving a subject in order to create an experimental situation, even when we feel that our deception is in the service of the discovery of a higher and more lasting truth. Even our more crass fellow researchers, who sometimes act as if they enjoy and relish every experimental deception they ever practiced, do show a sign of healthy moral unease in their sharing our compulsion to remove the deception by a suitable debriefing or catharsis explanation at the end of the experiment. The almost universal use of such a postexperimental revelation of the deceptions is particularly impressive evidence of the felt ethical concern, since such a revelation introduces another source of artifact about which researchers worry, namely, the communication of the true purpose of the experiment by earlier participants to later ones (Zemack and Rokeach, 1966). The data therefore become contaminated with the suspiciousness artifact which our deception was used to avoid. Some bases for feeling this moral unease over deceiving one’s fellow man, even in an experimental situation in the service of truth, have been discussed more fully by Kelman (1965, 1967). We shall simply state here our opinion that the experimenter who denies he feels any moral qualms about the use of deception in experiments is deceiving himself. While we emphatically insist that the use of deception does involve a moral cost, we equally emphatically insist that it might be necessary to pay this cost and continue to use deception rather than to cease our research. We must first admit here that our notion of ethics involves quantitative considerations, a stand which some of our more absolutistic fellow intellectuals might regard as vulgar. We deny that a practice which involves a moral cost must ipso facto be avoided. Admittedly with more resignation than enthusiasm, we are willing to employ a cost-utility analysis in the ethical evaluation of our behavioral alternatives, as our economist colleagues are applying it to the problems of administrative decision making and program budgeting. We are willing to admit that we are arguing that the signal for stopping a practice is not the discovery that it has a moral cost, but that it has a greater moral cost relative to its moral utility than have other available courses of behavior. It seems to us that the alternatives to using deception in our experiments are to find someway of pursuing the line of research without the use of deception or of giving up the line of research. Let us consider each of these in turn. It has been argued that certain research cannot be done unless the subject is deceived as regards the experimental purpose. In an earlier section of this chapter we exhibited an undisguised description of such an experiment to indicate the patent absurdity of taking the subject fully into the experimenter’s confidence and expecting to find generalizable
42
Book One – Artifact in Behavioral Research
results. Some might argue that we can disguise the intent of our experiment without the use of deception. For example, we might provide no information to the subject regarding the true purpose of the experiment. The use of active deception is not simply hiding the true intent but providing false information so as to mislead the subject into suspecting another purpose. This practice probably originated in the realization that many subjects will find it psychologically necessary to generate some explanation for what is involved in the experiment in which they are participating. If they are not deceived by a plausible alternative explanation provided by the experimenter, they will derive their own explanation (which may be correct or incorrect as regards the actual purpose, but in either case might equally contaminate the results in unknown ways and reduce their generalizability). Even if the experimenter not only withholds explanation but explicitly requires the subjects not to try to figure out what the experiment is about, many, even with the best of good will, might be unable to restrain their conjecturing about its purpose. Hence, simply not revealing to the subject the purpose of the experiment will perhaps not be as effective in eliciting generalizable results as will actively deceiving him by presenting him with a false purpose. Furthermore, leaving the subject in ignorance or allowing him to deceive himself as to the purpose of the experiment might be felt by some moralists to itself present an ethical problem and to skirt perilously on the fringe of active deception. A diametrically opposed method of carrying on the research without the use of deception is to be blatantly outspoken about the nature of the experiment, as regards various levels discussed in a previous section, and enlisting the subject as an active collaborator in the investigation. Such a procedure usually involves some form of role-playing, such that the subject is in effect told what the experiment is about and then asked to adopt the role of subject by behaving as he feels a subject who is actually in the situation would probably behave. The role-playing procedure has been used to good effect by Rosenberg and Abelson (1960) and Kelman (1967) and the role-playing is not necessarily linked to a full disclosure of the experimental purpose. An interesting variant is the ‘‘observer’’ procedure used in Bem’s (1967) radical behavioral technique. In a role-playing procedure, instead of deceiving the subject into thinking that he is lying to a fellow student in return for a one dollar or a 20 dollar ‘‘bribe’’ and then testing him to see how much he believes his own lie, the subject can be asked to imagine that he is telling the lie for one dollar or for 20 dollars and then asked to indicate how much he would probably believe the lie if he had actually told it under the several conditions. This role-playing procedure is particularly attractive in that it avoids not only deception but some of the other ethically troublesome procedures such as involving the subject in psychologically harmful acts. Despite this moral attractiveness of the role-playing procedure, and even though some of the recent studies in this line of work indicate that similar results are obtained from the roleplayers and ‘‘observers’’ as from actual subjects, we have little confidence that this role-playing procedure will constitute a final solution that will eliminate the deception problem from the psychologist’s list of woes. We feel intuitively that over the wide range of psychological problems, this ‘‘public opinion polling’’ approach of having the quasi-subjects tell us how the experiment would probably come out had we done it will prove quite limited. Still, until its limits are explored it seems a feasible line of research to pursue. We expect that the success of Kelman and Bem in their lines of research will encourage other investigators to take up the exploration.
Suspiciousness of Experimenter’s Intent
43
Should all attempts to circumvent the deception problem and still continue the research fail, there remains the alternative of ceasing the research altogether. While we feel that considerable effort is worthwhile in order to carry on our research without deception, we ourselves value our research sufficiently so that, rather than give it up altogether, we would think it worthwhile to pay the moral cost of deceiving subjects as to the nature of the experiment, provided we explain to the subject at the end of the experiment the various deceptions to which he was exposed and our reasons for utilizing them. In general, we listen with little enthusiasm to the argument that research that cannot be done without deception should be given up altogether. The alternative of giving up a line of research is one that too many of our colleagues have found too easy to take for us to entertain such behavior on the part of our stillworking colleagues with any enthusiasm. For psychologists who are actually engaged in research, we regard the solution of ceasing work the least attractive of the alternatives open to them. On the contrary, we feel that the most besetting moral evil in the psychological community today is indolence. Were we to list the moral problems of psychology, we would cite those who are doing experiments which involve deception far below those who are doing too few experiments or none at all as a source of ethical concern. The besetting offense that we find in the psychological profession, as in so many other sectors of the middle class, is not malfeasance but nonfeasance. It seems to us that the angel of death is likely to come upon more of our colleagues in idleness than in sin. We hope that methodological and moral concern over the problem of deception and subject’s suspiciousness will not be used to add to the ranks of the self-unemployed.
References Addington, D. W. Effect of mispronunciations on general speaking effectiveness. Speech Monographs, 1965, 32, 159–163. Allyn, Jane and Festinger, L. The effectiveness of unanticipated persuasive communication. Journal of Abnormal and Social Psychology, 1961, 62, 35–40. Anderson, N. H. and Hovland, C. I. The representation of order effect in communication research. In C. I. Hovland (Ed.) The order of presentation in persuasion. New Haven: Yale University Press, 1957, 158–169. Aronson, E., Turner, Judith, and Carlsmith, M. Communicator credibility and communicator discrepancy as determinants of opinion change. Journal of Abnormal and Social Psychology, 1963, 67, 31–36. Bauer, R. A. A revised model of source effect. Presidential address of the Division of Consumer Psychology, American Psychological Association Annual Meeting, Chicago, Ill., Sept. 1965. Bem, D. Self-perception: an alternative interpretation of cognitive dissonance phenomena. Psychological Review, 1967, 74, 183–200. Bergin, A. E. The effect of dissonant persuasive communications on changes in a self-referring attitude. Journal of Personality, 1962, 30, 423–436. Bowers, J. W. Language intensity, social introversion and attitude change. Speech Monographs, 1963, 30, 345–352. Bowers, J. W. Some correlates of language intensity. Quarterly Journal of Speech, 1964, 50, 415–420. Bowers, J. M. and Osborn, M. M. Attitudinal effects of selected types of concluding metaphors in persuasive speech. Speech Monographs, 1966, 33, 147–155.
44
Book One – Artifact in Behavioral Research Brehm, J. Reactance theory. New York: Academic Press, 1966. Brock, T. C. and Becker, L. A. Ineffectiveness of ‘‘overheard’’ counterpropaganda. Journal of Personality and Social Psychology, 1965, 2, 654–660. Brock, T. C. and Becker, L. A. ‘‘Debriefing’’ and susceptibility to subsequent experimental manipulation. Journal of Experimental Social Psychology, 1966, 2, 314–323. Carmichael, C. W. and Cronkhite, G. L. Frustration and language intensity. Speech Monographs, 1965, 32, 107–111. Cohen, R. A. Need for cognition and order of communication as a determinant of opinion change. In Hovland, C. I. (Ed.) Order of presentation in persuasion. New Haven: Yale University Press, 1957, 79–97. Cooper, Eunice and Dinerman, Helen. Analysis of the film ‘‘Don’t Be A Sucker’’: a study of communication. Public Opinion Quarterly, 1951, 15, 243–264. Couch, A. and Keniston, K. Yeasayers and naysayers: Agreeing response set as a personality variable. Journal of Abnormal and Social Psychology, 1960, 60, 151–174. Cronbach, L. J. An experimental comparison of the multiple true-false and the multiple-choice tests. Journal of Educational Psychology, 1941, 32, 533–543. Cronbach, L. J. Studies of acquiescence as a factor in true-false tests. Journal of Educational Psychology, 1942, 33, 401–415. Cronbach, L. J. Response sets and test validity. Educational and Psychological Measurement, 1946, 6, 475–494. Cronbach, L. J. Further evidence on response sets and test design. Educational and Psychological Measurement, 1950, 10, 3–31. Crowne, D. P. and Marlowe, D. The approval motive. New York: Wiley, 1964. Dabbs, J. M. and Janis, I. L. Why does eating while reading facilitate opinion change? An experimental inquiry. Journal of Experimental Social Psychology, 1965, 1, 133–144. Ellis, R. S. Validity of personality questionnaires. Psychological Bulletin, 1946, 43, 385–440. Ewing, R. A study of certain factors involved in changes of opinion. Journal of Social Psychology, 1942, 16, 63–88. Festinger, L. and Maccoby, N. On resistance to persuasive communications. Journal of Abnormal and Social Psychology, 1964, 68, 359–366. Fillenbaum, S. Prior deception and subsequent experimental performance: the ‘‘faithful’’ subject. Journal of Personality and Social Psychology, 1966, 4, 537. Fisher, S. and Lubin, A. Distance as a determinant of influence in a two-person serial interaction situation. Journal of Abnormal and Social Psychology, 1958, 56, 230–238. Freedman, J. L. Involvement, discrepancy, and opinion change. Journal of Abnormal and Social Psychology, 1964, 69, 290–295. Gough, H. G. Simulated patterns on the MMPI. Journal of Abnormal and Social Psychology, 1947, 42, 215–225. Greenberg, B. L. and Miller, G. R. The effect of low credibility source on message acceptance. Speech Monographs, 1966, 33, 127–136. Greenwald, H. The involvement-discrepancy controversy in persuasion research. Unpublished Ph. D. dissertation, Columbia University, New York, 1964. Hastorf, A. H. and Piper, G. W. A note on the effect of explicit instructions on prestige suggestion. Journal of Social Psychology, 1951, 33, 289–293. Hovland, C. I. Summary and implications. In Hovland, C. I. (Ed.) Order of presentation in persuasion. New Haven: Yale University Press, 1957, 129–157. Hovland, C. I., Harvey, O. J., and Sherif, M. Assimilation and contrast effects in communication and attitude change. Journal of Abnormal and Social Psychology, 1957, 55, 242–252. Hovland, C. I., Janis, I. L., and Kelley, H. H. Communication and Persuasion. New Haven: Yale University Press, 1953. Hovland, C. I., Lumsdaine, A. A., and Sheffield, F. D. Experiments on mass communications. Princeton, N.J.: Princeton University Press, 1949. Hovland, C. I. and Mandell, W. An experimental comparison of conclusion-drawing by the communicator and by the audience. Journal of Abnormal and Social Psychology, 1952, 41, 581–588. Hovland, C. I. and Weiss, W. The influence of source credibility on communication effectiveness. Public Opinion Quarterly, 1951, 15, 635–650.
Suspiciousness of Experimenter’s Intent
45
Insko, C. A., Murashima, F., and Saiyadain, M. Communicator discrepancy, stimulus ambiguity and difference. Journal of Personality, 1966, 34, 262–274. Irwin, J. V. and Brockhaus, H. H. The ‘‘teletalk project’’: a study of the effectiveness of two public relations speeches. Speech Monographs, 1963, 30, 359–368. Janis, I. L., Kaye, D., and Kirschner, P. Facilitating effects of ‘‘eating-while-reading’’ on responsiveness to persuasive communications. Journal of Personality and Social Psychology, 1965, 1, 181–186. Janis, I. L. and King, B. T. The influence of role-playing on opinion change. Journal of Abnormal and Social Psychology, 1954, 49, 211–218. Jones, E. E. Ingratiation. New York: Appleton-Century-Crofts, 1964. Kelly, E. L., Miles, Catherine C, and Terman, L. Ability to influence one’s score on a typical paper and pencil test of personality. Character and Personality, 1936, 4, 206–215. Kelman, H. C. Manipulation of human behavior—an ethical dilemma for the social scientist. Journal of Social Issues, 1965, 21, 31–46. Kelman, H. C. Human uses of human subjects, the problem of deception in social psychological experiments. Psychological Bulletin, 1967, 67, 1–11. Kelman, H. C. and Hovland, C. I. ‘‘Reinstatement’’ of the communicator in delayed measurement of opinion change. Journal of Abnormal and Social Psychology, 1953, 48, 327–335. Kiesler, C. A. and Kiesler, Sara B. Role of forewarning in persuasive communications. Journal of Abnormal and Social Psychology, 1964, 68, 547–549. King, B. T. and Janis, I. L. Comparison of the effectiveness vs. nonimprovised role-playing in producing opinion changes. Human Relations, 1956, 9, 177–186. Lana, R. E. Three interpretations of order effect in persuasive communications. Psychological Bulletin, 1964, 61, 314–320. Lentz, T. F. Acquiescence as a factor in the measurement of personality. Psychological Bulletin, 1938, 35, 659. Lorge, I. Gen-like: halo or reality? Psychological Bulletin, 1937, 34, 545–546. Lumsdaine, A. A. and Janis, I. L. Resistance to ‘‘counterpropaganda’’ produced by one-sided and two-sided ‘‘propaganda’’ presentation. Public Opinion Quarterly, 1953, 17, 311–318. McGinnies, E. and Donelson, Elaine. Knowledge of experimenter’s intent and attitude change under induced compliance. Dept. of Psychology, University of Maryland, 1963, (Mimeo). McGuire, W. J. A syllogistic analysis of cognitive relationships. In C. I. Hovland and M. J. Rosenberg (Eds.) Attitude organization and change. New Haven: Yale University Press, 1960, 65–111. McGuire, W. J. Persistence of the resistance to persuasion induced by various types of prior belief defenses. Journal of Abnormal and Social Psychology, 1962, 64, 241–248. McGuire, W. J. Inducing resistance to persuasion: some contemporary approaches. In Berkowitz, L. (Ed.) Advances in Experimental Social Psychology. Vol. 1. New York: Academic Press, 1964, 191–229. McGuire, W. J. Attitudes and opinions. In Farnsworth, P. Annual Review of Psychology. Vol. 17. Palo Alto, Calif. Anual Review Press, 1966, 475–514. McGuire, W. J. Attitudes and attitude change. In Lindzey, G. and Aronson, E. (Eds.) Handbook of social psychology. Reading, Mass.: Addison-Wesley, 1968, 136–314. McGuire, W. J. and Millman, Susan. Anticipatory belief lowering following forewarning of a persuasive attack. Journal of Personality and Social Psychology, 1965, 2, 471–479. McGuire, W. J. and Papageorgis, D. The relative efficacy of various types of prior belief-defense in producing immunity against persuasion. Journal of Abnormal Social Psychology, 1961, 62, 327–337. McGuire, W. J. and Papageorgis, D. Effectiveness of forewarning in developing resistance to persuasion. Public Opinion Quarterly, 1962, 26, 24–34. Meehl, P. E. and Hathaway, S. R. The K– factor. Journal of Applied Psychology, 1946, 30, 525–564. Messick, S. Dimensions of social desirability. Journal of Consulting Psychology, 1960, 24, 279–287. Milgram, S. Behavioral study of obedience. Journal of Abnormal and Social Psychology, 1963, 67, 371–378. Milgram, S. Liberating effects of group pressure. Journal of Personality and Social Psychology, 1965, 1, 127–134.
46
Book One – Artifact in Behavioral Research Miller, N. Involvement and dogmatism as inhibitors of attitude change. Journal of Experimental Social Psychology, 1965, 1, 121–132. Miller, G. R. and Hewgill, M. A. The effects of variations in nonfluency in audience ratings of source credibility. Quarterly Journal of Speech, 1964, 50, 36–44. Mills, J. Opinion change as a function of the communicator’s desire to influence and liking for the audience. Dept. of Psychology, University of Missouri, Columbia, Mo. 1967, (Mimeo). Mills, J. and Aronson, E. Opinion change as a function of communicator’s attractiveness and desire to influence. Journal of Personality and Social Psychology, 1965, 1, 173–177. Mills, J. and Jellison, J. M. Effect on opinion change of how desirable the communication is to the audience the communicator addressed. Journal of Personality and Social Psychology, 1967, 6, 98–101. Norman, W. I. Problem of response contamination in personality assessment. Personality Laboratory, Lackland A. F. Base, Texas, ASD–TN–61–43, May, 1961. Osgood, C. E. and Tannenbaum, P. H. The principle of congruity in the prediction of attitude change. Psychological Review, 1955, 62, 42–55. Papageorgis, D. Anticipation of exposure to persuasive message and belief change. Journal of Personality and Social Psychology, 1967, 5, 470–496. Rosenberg, M. J. and Abelson, R. P. An analysis of cognitive balancing. In C. I. Hovland and M. J. Rosenberg (Eds.) Attitude organization and change. New Haven: Yale University Press, 1960, 112–163. Rosenblatt, P. C. Persuasion as a function of varying amounts of distraction. Psychonomic Science, 1966, 5, 85–86. Rosenblatt, P. C. and Hicks, J. M. Pretesting, forewarning, and persuasion. Paper read at Midwestern Psychological Association Annual Convention, Chicago, Illinois, May, 1966. Rosenthal, R., Kohn, P., Greenfield, Patricia M., and Carota, N. Data desirability, experimenter expectancy, and the results of psychological research. Journal of Personality and Social Psychology, 1966, 3, 20–27. Sears, D. O. Opinion formation on controversial issues. Dept. of Psychology, University of California, L.A., June 18, 1965, (Mimeo). Sears, D. O., Freedman, J. L., and O’Connor, E. F. The effects of anticipated debate and commitment on the polarization of audience opinion. Public Opinion Quarterly, 1964, 28, 615–627. Sharp, H. and McClung, T. Effect of organization on the speaker’s ethics. Speech Monographs, 1966, 33, 182–183. Sherif, M. and Hovland, C. I. Social Judgment. New Haven: Yale University Press, 1961. Sherif, Carolyn W., Sherif, M., and Nebergall, R. E. Attitude and Attitude Change. Philadelphiia: Saunders, 1965. Silverman, I. Role-related behavior of subjects in laboratory studies in attitude change. Journal of Personality and Social Psychology, 1968, 8, 343–348. Solomon, R. L. An extension of control group design. Psychological Bulletin, 1949, 46, 137–150. Stotland, E., Katz, D., and Patchen, M. The reduction of prejudice through the arousal of self-thought. Journal of Personality, 1959, 27, 507–531. Steinmetz, H. C. Measuring ability to fake occupational interest. Journal of Applied Psychology, 1932, 16, 123–130. Stricker, L. J. The true deceiver, Psychological Bulletin, 1967, 68, 13–20. Stricker, L. J., Messick, S. and Jackson, D. N. Suspicion of deception: implications for conformity research. Journal of Personality and Social Psychology, 1967, 5, 379–389. Tannenbaum, P. H. The congruity principle revisited: studies in the reduction, induction and personalization of persuasion. In Berkowitz, L. (Ed.) Advances in Experimental Social Psychology. Vol. 4. New York: Academic Press, 1966. Thistlethwaite, D. L., de Haan, H., and Kamenetzky, J. The effects of ‘‘directive’’ and ‘‘nondirective’’ communication procedures on attitudes. Journal of Abnormal and Social Psychology, 1955, 51, 107–113. Walster, Elaine, Aronson, E., and Abrahams, D. On increasing the persuasiveness of a lowerprestige communicator. Journal of Experimental Social Psychology, 1966, 2, 325–342. Walster, Elaine and Festinger, L. The effectiveness of ‘‘overheard’’ persuasive communications. Journal of Abnormal and Social Psychology, 1962, 65, 395–402.
Suspiciousness of Experimenter’s Intent
47
Watts, W. A. Relative persistence of opinion change induced by action compared to passive participation. Journal of Personality and Social Psychology, 1967, 5, 4–15. Whittaker, J. O. Parameters of social influence in the autokinetic situation. Sociometry, 1964, 27, 88–95. Wright, P. H. Attitude change under direct and indirect interpersonal influence. Human Relations, 1966, 19, 199–211. Zemack, R. and Rokeach, M. The pledge to secrecy: a method to assess violations. American Psychologist, 1966, 21, 612. Zimbardo, P. G. Involvement and communication discrepancy as determinants of opinion conformity. Journal of Abnormal and Social Psychology, 1960, 60, 86–94.
3 The Volunteer Subject1 Robert Rosenthal Harvard University and
Ralph L. Rosnow Temple University
There is a long-standing fear among behavioral researchers that those human subjects who find their way into the role of ‘‘research subject’’ may not be entirely representative of humans in general. McNemar (1946, 333) put it wisely when he said, ‘‘The existing science of human behavior is largely the science of the behavior of sophomores.’’ Sophomores are convenient subjects for study, and some sophomores are more convenient than others. Sophomores enrolled in psychology courses, for example, get more than their fair share of opportunities to play the role of the research subjects whose responses provide the basis for formulations of the principles of human behavior. There are now indications that these ‘‘psychology sophomores’’ are not entirely representative of even sophomores in general (Hilgard, 1967), a possibility that makes McNemar’s formulation sound unduly optimistic. The existing science of human behavior may be largely the science of those sophomores who both (a) enroll in psychology courses and (b) volunteer to participate in behavioral research. The extent to which a useful, comprehensive science of human behavior can be based upon the behavior of such self-selected and investigatorselected subjects is an empirical question of considerable importance. It is a question that has received increasing attention in the last few years (e.g., London and Rosenhan, 1964; Ora, 1965; Rosenhan, 1967; Rosenthal, 1965).2
1
Preparation of this chapter, which is an extensive revision of an earlier paper published in Human Relations (Rosenthal, 1965), was facilitated by research grants GS-714, GS-1741, and GS-1733 from the Division of Social Sciences of the National Science Foundation. We want to thank our many colleagues who helped us by sending us unpublished papers, unpublished data, and additional information of various kinds. These colleagues include Timothy Brock, Carl Edwards, John R. P. French, Donald Hayes, E. R. Hilgard, Thomas Hood, Gene Levitt, Perry London, Roberta Marmer, A. H. Maslow, Ray Mulry, Lucille Nahemow, John Ora, Jr., David Poor, David Rosenhan, Dan Schubert, Duane Schultz, Peter Suedfeld, Jay Tooley, Allan Wicker, Abraham Wolf, and Marvin Zuckerman. 2 Most of the interest has been centered on the selection of human subjects, which is our concern here, but there are similar problems of the selection and representativeness of those animal subjects that find their way into behavioral research (e.g., Beach, 1950, 1960; Christie, 1951; Kavanau, 1964, 1967; Richter, 1959).
48
The Volunteer Subject
49
The problem of the volunteer subject has been of interest to many behavioral researchers, and evidence of their interest will be found in the pages to follow. Mathematical statisticians, those good consultants to behavioral researchers, have also interested themselves in the volunteer problem (e.g., Cochran, Mosteller, and Tukey, 1953). Because of their concern we now know a good deal about the implications for statistical procedures and statistical inference of having drawn a sample of volunteers (Bell, 1961). The concern with the volunteer problem has had for its goal the reduction of the nonrepresentativeness of volunteer samples so that investigators may increase the generality of their research results (e.g., Hyman and Sheatsley, 1954; Locke, 1954). The magnitude of the problem is not trivial. The potential biasing effects of using volunteer samples have been clearly illustrated recently. At one large university, rates of volunteering varied from 10 per cent to 100 per cent. Even within the same course, different recruiters visiting different sections of the course obtained rates of volunteering varying from 50 per cent to 100 per cent (French, 1963). At another university, rates of volunteering varied from 26 per cent to 74 per cent when the same recruiter, extending the same invitation to participate in the same experiment, solicited female volunteers from different floors of the same dormitory (Marmer, 1967). Some reduction of the volunteer sampling bias may be expected from the fairly common practice of requiring psychology undergraduates to spend a certain number of hours serving as research subjects. Such a requirement gets more students into the overall sampling urn, but without making their participation in any given experiment a randomly determined event. Students required to serve as research subjects often have a choice among alternative experiments. Given such a choice, will brighter (or duller) students sign up for an experiment on learning? Will better (or more poorly) adjusted students sign up for an experiment on personality? Will students who view their consciousness as broader (or narrower) sign up for an experiment that promises an encounter with ‘‘psychedelicacies’’? We do not know the answers to these questions very well, nor do we know whether these possible self-selection biases would make any difference in the inferences we want to draw. If the volunteer problem has been of interest and concern in the past there is good evidence to suggest that it will become of even greater interest and concern in the future. That evidence comes from the popular press and the technical literature and it says to us: In the future you, as an investigator, may have less control than ever before over the kinds of human subjects who find their way into your research. The ethical questions of humans’ rights to privacy and to informed consent are more salient now than ever before (Bean, 1959; Clark, et al., 1967; Miller, 1966; Orlans, 1967; Rokeach, 1966; Ruebhausen and Brim, 1966; Wicker, 1968; Wolfensberger, 1967; Wolfle, 1960). One possible outcome of this unprecedented soul-searching is that the social science of the future may, due to internally and perhaps externally imposed constraints, be based upon propositions whose tenability will come only from volunteer subjects who have been made fully aware of the responses of interest to the investigator. However, even without this extreme consequence of the ethical crisis of the social sciences, we still will want to learn as much as we can about the external circumstances and the internal characteristics that bring any given individual into our sample of subjects or keep him out. Our purpose in this chapter will be to say something of what is known about the act of volunteering and about the characteristics that may differentiate volunteers for
50
Book One – Artifact in Behavioral Research
behavioral research from nonvolunteers. Subsequently we shall consider the implications of what we think we know for the representativeness of the findings of behavioral research and for the possible effects on the results of experiments employing human subjects.
The Act of Volunteering Finding one’s way into the role of the subject is not a random event. The act of volunteering seems to be as reliable a response as the response to many widely used tests of personality. Martin and Marcuse (1958), employing several experimental situations, found reliabilities of the act of volunteering to range from .67 for a study of attitudes toward sex to .97 for a study of hypnosis. Such stability in the likelihood of volunteering raises a question as to whether there may not also be stability in the attributes associated with the likelihood of volunteering. Several relatively stable attributes that show promise of serving as predictors of volunteering will be discussed later in this chapter. In this section we shall discuss the less stable, more situational determinants of volunteering. It is no contradiction that situational determinants can be powerful even in view of the reliability of the act of volunteering. In studies of the reliability of volunteering, situational determinants tend to be relatively constant from the initial request for volunteers to the subsequent request, so that the role of situational determinants is artificially diminished. Unavailable at present, but worth collecting, are data on volunteering as a simultaneous function of personal characteristics of volunteers and situational determinants of volunteering. Incentives to Volunteer Not surprising is the fact than when potential subjects fear that they may be physically hurt, they are less likely to volunteer. Subjects threatened with electric shocks were less willing to volunteer for subsequent studies involving the use of shock (Staples and Walters, 1961). More surprising perhaps is the finding that an increase in the expectation of pain does not lead concomitantly to much of an increase in avoidance of participation. In one study, for example, 78 per cent of college students volunteered to receive very weak electric shocks, while almost that many (67 per cent) volunteered to receive moderate to strong shocks (Howe, 1960). The difference between these volunteering rates is of only borderline significance (p < .15). The motives to serve science and to trust in the wisdom and authority of the experimenter (Orne, this volume), and to be favorably evaluated by the experimenter (Rosenberg, this volume), must be strong indeed to have so many people willing to tolerate so much for so little tangible reward. But perhaps in Howe’s (1960) experiment the situation was complicated by the fact that there was more tangible reward than usual. The rates of volunteering which he obtained may have been elevated by a $3.00 incentive that he offered in return for participation. The subjects who volunteered for electric shocks may also have been those for whom the $3.00 had more reward value. Volunteers showed a significantly greater (p ¼ .001) ‘‘need for cash’’ than did nonvolunteers. Need for cash, however, was determined after the volunteering occurred, so it is possible that the incentive was viewed as more important by
The Volunteer Subject
51
those who had already committed themselves to participate by way of justifying their commitment to themselves. As the intensity of the plea for participation increases, more subjects are likely to agree to become involved. For an experiment in hypnosis, adding either a lecture on hypnosis or a $35 incentive increased the rate of volunteering among student nurses about equally (Levitt, Lubin, and Zuckerman, 1962). On intuitive grounds one can speculate that students should perceive $35 as more rewarding than a lecture, so possibly the student nurses in this study were responding to the heightened intensity of the request for volunteers. Perhaps the more important it seems to the subject that his participation is to the recruiter, the higher will be the rate of volunteering. That certainly seems to be the case among respondents to a mail questionnaire who, though they did not increase their participation when personalized salutations and true signatures were employed by the investigator, markedly increased their participation when special delivery letters were employed (Clausen and Ford, 1947). Consistent results were also obtained by Rosenbaum (1956), who found that a great many more subjects were willing to volunteer for an experiment on which a doctoral dissertation hung in the balance than if a more desultory request was made. Volunteering also seems to become more likely as it becomes the proper, normative, expected thing to do. If other subjects are seen by the potential volunteer as likely to consent, the probability increases that the potential volunteer also will consent to participate (Bennett, 1955; Rosenbaum, 1956; Rosenbaum and Blake, 1955). And, once the volunteer has consented, it may be that he would find it undesirable to be denied an opportunity actually to perform the expected task. Volunteers who were given the choice of performing a task (a) that was more pleasant but less expected or (b) one that was less pleasant but more expected, tended relatively more often to choose the latter (Aronson, Carlsmith, and Darley, 1963). Sometimes it is difficult to distinguish among appeals of increased intensity, appeals that give the impression that volunteering is very much the expected thing to do, and appeals that offer almost irresistible inducements to participation. More subjects volunteer when they get to miss a lecture as a reward, and a great many more volunteer when they get to miss an examination (Blake, Berkowitz, Bellamy, and Mouton, 1956). Being excused from an exam seems to be such a strong inducement that subjects tend to volunteer without exception even when it means that they must raise their hands in class to do it. Under conditions of less extreme incentive to volunteer, subjects seem to prefer less public modes of registering their willingness (Blake et al., 1956) unless almost everyone else in the group also seems willing to volunteer publicly (Schachter and Hall, 1952). Bennett (1955), however, found no relationship between volunteering and the public versus private modes of registering willingness to participate. Schachter and Hall (1952) have performed a double service for students of the volunteer problem. They not only have examined the conditions under which volunteering is more likely to occur but also the likelihoods that subjects recruited under various conditions will actually show up for the experiment to which they have verbally committed their time. The results are not heartening. Apparently it is just those conditions that increase the likelihood of a subject’s volunteering that increase the likelihood that he will not show up when he is supposed to. This should serve to emphasize that it is not enough even to learn who will volunteer and under what circumstances. We will also need to learn which people show up, as our science is
52
Book One – Artifact in Behavioral Research
based largely on the behavior of those who do. At least in the case of personality tests there is evidence from Levitt, Lubin, and Brady (1962) to suggest that ‘‘no-shows’’ (i.e., volunteers who never show up) are psychologically more like nonvolunteers than they are like ‘‘shows’’ (i.e., volunteers who show up as scheduled). Subject Involvement The proposition that subjects are more likely to volunteer the more they are involved or the more they have to gain finds greater support in the literature on survey research than in the literature on laboratory experiments. Levitt, Lubin, and Zuckerman (1959), for example, found no differences between volunteers and nonvolunteers for hypnosis research in their attitudes toward hypnosis. Attitudes of their student nurse subjects were measured by responses to the ‘‘hypnotist’’ picture of the TAT. In contrast, Zamansky and Brightbill (1965) found that male undergraduate volunteers for hypnosis research rated the concept of ‘‘hypnosis’’ more favorably (p ¼ .05) than did nonvolunteers. These same authors also found that subjects who were more susceptible to hypnotic phenomena tended to rate the concept of ‘‘hypnosis’’ more favorably (Brightbill and Zamansky, 1963; Zamansky and Brightbill, 1965). Subjects for hypnosis research, therefore, may select themselves not only for their view of hypnosis but also for their susceptibility to hypnosis. Direct evidence for this possibility has been presented by Boucher and Hilgard (1962). It seems reasonable to speculate that college students majoring in psychology would be more interested in behavioral research than would nonpsychology majors. In an experiment on sensory deprivation twice as many psychology majors volunteered to participate than did non-psychology majors (Jackson and Pollard, 1966). Among the motives given for volunteering, curiosity was listed by 50 per cent of the subjects, financial incentive ($1.25 per hour) by 21 per cent, and being of help to ‘‘Science’’ by a surprisingly low 7 per cent. The main reason for not volunteering, given by 80 per cent of those who did not volunteer, was that they had no time available. In support of these results are those of Rosen (1951). His request to undergraduates to take the MMPI met with greater success among students who were more favorably disposed toward psychology and behavioral research. Similarly, Ora (1966) found his volunteers for psychological research to be significantly more interested in psychology than were his nonvolunteers. The greater interest and involvement of volunteers as compared to nonvolunteers also is suggested in the work of Green (1963). He found that when subjects were interrupted during their task performance, nonvolunteers recalled fewer of the interrupted tasks than did volunteers. Presumably the volunteers’ greater involvement facilitated their recall of the tasks that they were not able to complete. It was noted earlier that it is in the literature on survey research that one finds greatest support for the involvement-volunteering relationship. Thus, the more interested a person is in radio and television programming, the more likely he is to answer questions about his listening and viewing habits (Belson, 1960; Suchman and McCandless, 1940). When questions were asked in the 1930’s about the use of radio in the classroom, it was discovered that nonresponders tended to be those who did not own radios (Stanton, 1939).
The Volunteer Subject
53
College graduates are about twice as likely to respond to a mail questionnaire as college drop-outs (Pace, 1939). Shuttleworth (1940) found that those college graduates who responded more promptly to questionnaires had an appreciably lower rate of unemployment (0.5%) than did those who were slower to respond (5.8%). Similar results have been reported by Franzen and Lazarsfeld (1945), Gaudet and Wilson (1940), and Edgerton, Britt, and Norman (1947), all of whom conclude that responders tend to be those individuals who are more interested in the topic under study. A particularly striking example of this relationship can be found in the research of Larson and Catton (1959). Questionnaires were sent to 700 members of a national organization. Of those responding to the first request only 17 per cent of the respondents were thoroughly inactive and presumably disinterested members. Of those members who did not reply even after three requests, about 70 per cent were inactive, presumably disinterested individuals. Sometimes it is not so much a matter of the general interest of the individual as it is his specific attitude toward the issue under discussion that determines whether he will be self-selected into the sample. Matthysse (1966) wrote follow-up letters to research subjects who had been exposed to pro-religious communications. He found that the subjects who replied to his letter were more often those individuals who regarded religious questions as more important, i.e., attached greater importance to the question of the existence of God. Siegman (1956), recruiting subjects for Kinsey-type interviews, found that 92 per cent of those undergraduates who volunteered to be interviewed advocated sexual freedom for women, while only 42 per cent of those who did not volunteer advocated such freedom. Data obtained by Benson (1946) suggested that when public policy is under discussion, respondents may be over-represented by individuals with strong feelings against the proposed policy—a kind of political protest vote. Survey literature is rich with suggestions for dealing with these potential sources of bias. One practical suggestion offered by Clausen and Ford (1947) follows directly from the work on involvement. It was discovered that a higher rate of response was obtained if, instead of one topic, a number of topics were surveyed in the same study. People seem to be more willing to answer a lot of questions if at least some of the questions are on a topic of interest to them. Another, more standard technique is the follow-up letter or follow-up phone call that reminds the subject to respond to the questionnaire. However, if the follow-up is perceived by the subject as a bothersome intrusion, then, if he responds at all, his response may reflect an intended or unintended distortion of his actual beliefs. The person who has been reminded several times to fill out the same questionnaire may not approach the task in the same way he would if he were asked only once. There is some evidence from Norman (1948) and from Wallin (1949) which suggests that an increase in the potential respondent’s degree of acquaintanceship with the investigator may lead to an increase in the likelihood of the individual’s cooperation. Similarly, an increase in the perceived status of the investigator may lead to an increase in the rate of cooperation (Norman, 1948; Poor, 1967). Increases in the acquaintanceship with the investigator and in the investigator’s status may, therefore, reduce the volunteer bias, but there is a possibility that one bias may simply be traded for other biases. Investigators who are better acquainted with their subjects or who have a higher perceived status may obtain data from their subjects that are different from data obtained by investigators less well-known to their subjects or lower in perceived status (Rosenthal, 1966). We may need to learn with
54
Book One – Artifact in Behavioral Research
which biases we are more willing to live, which biases we are better able to assess, and which biases we are better able to control. The Phenomenology of Volunteering Responding to a mail questionnaire is undoubtedly different from volunteering for participation in a psychological experiment (Bell, 1961), yet there are likely to be phenomenological similarities. In both cases the prospective data-provider, be he ‘‘subject’’ or ‘‘respondent,’’ is asked to make a commitment of his time for the serious purposes of the data-collector. In both cases, too, there may be an explicit request for candor, and almost certainly there will be an implicit request for it. Perhaps most important, in both cases the data-provider recognizes that his participation will make the data-collector wiser about him without making him wiser about the data-collector. Within the context of the psychological experiment, Riecken (1962) has referred to this as the ‘‘one-sided distribution of information.’’ On the basis of this uneven distribution of information the subject-respondent is likely to feel an uneven distribution of legitimate negative evaluation.3 From the subject’s point of view, the data-collector may judge him to be maladjusted, stupid, unemployed, lower class or in possession of any one of a number of other negative characteristics. The possibility of being judged as any of these might be sufficient to prevent someone from volunteering for either surveys or experiments. The data-provider, on the other hand, can, and often does, negatively evaluate the data-collector. He can call the investigator, his task, or his questionnaire inept, stupid, banal, and irrelevant but hardly with any great feeling of confidence as regards the accuracy of this evaluation. After all, the data-collector has a plan for the use of his data, and the subject or respondent usually does not know this plan, though he is aware that a plan exists. He is, therefore, in a poor position to evaluate the data-collector’s performance, and he is likely to know it. Riecken (1962) has postulated that one of the major aims of the subject is to ‘‘put his best foot forward.’’ It follows that in both survey and experimental research, the volunteer subject may be the individual who guesses that he will be evaluated favorably. Edgerton, Britt, and Norman (1947) found that contest winners were more likely than losers to respond helpfully to a follow-up questionnaire relevant to their achievement. These same authors convincingly demonstrated the consistency of their results by summarizing work which showed, for example, that (a) parents of delinquent boys are more likely to respond to questionnaires about the boys if the parents have nice things to say, (b) college professors who hold minor and temporary appointments are not so likely to reply usefully to job-related questionnaires, and (c) patrons of commercial airlines are more prompt to return questionnaires about airline usage than non-patrons. Locke (1954) found married respondents more willing than divorced respondents to be interviewed about their marital adjustment. None of these findings deny the interest hypothesis advanced by Edgerton, Britt, and Norman (1947). Indeed, additional evidence, some of which was cited earlier, can simply be interpreted as demonstrating that greater interest in a topic leads to a higher response rate. Nevertheless, on the basis of Riecken’s (1962) analysis and in light of the empirical evidence cited here, we may 3
For a full discussion of the importance to the subject of feeling evaluated by the behavioral scientist who studies him, see Chapter 7 by Milton Rosenberg.
The Volunteer Subject
55
postulate that another major variable which contributes to the decision to volunteer is the subjective probability of subsequently being favorably evaluated by the investigator. It is trite but necessary to add that this formulation requires more direct empirical test.
Characteristics of Volunteers We have discussed some of the less stable characteristics of the volunteer subject that are specifically related to the source and nature of the invitation to volunteer. Now let us consider more stable characteristics of volunteers. We shall proceed attribute by attribute. In principle it would have been desirable to perform such an analysis separately for each type of subject population investigated and for each type of experiment or survey conducted. However, the variations of outcomes of different studies of volunteer characteristics within even a given type of subject sample and within even a given area of research were sufficiently great that it seemed a prematurely precise strategy, given the state of the data. Sex The variations in the results of studies of volunteer characteristics are well-illustrated when the characteristic investigated is the subject’s sex. Belson (1960), Poor (1967), and Wallin (1949) reported no sex differences associated with the rate of volunteering in their survey research projects, nor did Hilgard, Weitzenhoffer, Landes, and Moore (1961), Hood (1963), London (1961), and Schachter and Hall (1952) in their experimental laboratory projects. However, for every study that does not find a relationship between volunteering and sex of the respondent, there is one or more that supports such a relationship. Table 3-1 summarizes the results of 12 such studies. Eight of the studies discovered that females volunteered more than males, while the remaining four studies found the inverse relationship. Those studies for which women are more likely to volunteer seem to have in common that they requested subjects to participate in rather standard or unspecified psychological experiments.4 The exceptions are the studies by Schachter and Hall (1952) and Wilson and Patterson (1965). The former study asked for volunteers for a study of interpersonal attraction and found no sex differences in rates of volunteering. The latter study employed a vague request for volunteers to which the New Zealand male undergraduates responded more favorably than females. The experiments by Hilgard et al. (1961) and London (1961) had requested volunteers for hypnosis and neither had found any sex differences in volunteering. London did find, however, that among those subjects who were ‘‘very eager’’ to participate, males predominated. For the hypnosis situation London felt that women were less likely to show such eagerness because of a greater fear of loss of control. Perhaps being very eager to be hypnotized, willing to be electrically shocked (Howe, 1960), and sensation-deprived (Schultz, 1967b), and ready to answer questions about sex behavior (Siegman, 1956) reflect the somewhat greater degree of unconventionality that is more often associated in our culture with males than with females.5 4
Related to these results are those obtained by Rosen (1951) and Schubert (1964), both of whom found males more likely to volunteer for standard experiments if they showed greater femininity of interests. 5 Consistent with this interpretation is the finding by Wolf and Weiss (1965) that, relative to female subjects, male subjects showed the greater preference for isolation experiments.
56
Book One – Artifact in Behavioral Research Table 3–1 Volunteering Rates Among Males and Females
Author
Task
Percentage volunteering
Two-tail p of difference
Females
Males
MORE VOLUNTEERING BY FEMALES Himelstein (1956) Psychology experiment Newman (1956) Perception experiment Newman (1956) Personality experiment Ora (1966) Psychology experiments Rosnow & Rosenthal (1966) Perception experiment Psychology experiment Rosnow & Rosenthal (1967)a Schubert (1964) Psychology experiment Wicker (1968) Questionnaire
65% 60% 59% 66% 48% 27% 60% 56%
43% 39% 45% 54% 13% 10% 44% 38%
.02 .02 .25 .001 .02 .005 .001 .10
MORE VOLUNTEERING BY MALES Howe (1960) Electric shock Schultz (1967b) Sensory deprivation Siegman (1956) Sex interview Wilson & Patterson (1965) Psychology experiment
67% 56% 12% 60%
81% 76% 42% 86%
.05 .06 .02 .005
a
Unpublished data. The experiment on which these data are based is described later in the present chapter.
If one were to attempt to summarize the findings thus far, one might hypothesize that in behavioral research (a) there is a likelihood of females volunteering more than males if the task for which participation is solicited is perceived as relatively standard and (b) there is a likelihood of males volunteering more than females if the task is perceived as unusual. Some modest support for this hypothesis comes from the research of Martin and Marcuse (1958). Volunteers were solicited for four experiments—one in learning, a second on personality, a third involving hypnosis, and a fourth for research on attitudes toward sex. Female volunteers were overrepresented in the first three experiments, those which could be described as relatively more standard. Male volunteers were overrepresented in the sex study. The joint effects on volunteering rates of subjects’ sex and nature of the task for which participation is solicited are probably complicated by other variables. For example, Coffin (1941) long ago cautioned about the complicating effects of the investigator’s sex, and one may wonder along with Coffin and Martin and Marcuse (1958), about the differential effects on volunteer rates among male and female subjects of being confronted with a male versus a female Kinsey interviewer as well as the differential effects on eagerness to be hypnotized of being confronted with a male versus a female hypnotist. Our interest in volunteers is based on the fact that only they can provide us with the data the nonvolunteers have refused us. But not all volunteers, it usually turns out, provide us with the data we need. To varying degrees in different studies there will be those volunteers who fail to keep their experimental appointment. These ‘‘noshows’’ have been referred to as ‘‘pseudovolunteers’’ by Levitt, Lubin, and Brady (1962) who showed that on a variety of personality measures pseudo-volunteers are less like volunteers, and more like the nonvolunteers who never agreed to come in the first place. Other studies also have examined the characteristics of experimental subjects who fail to keep their appointments. Frey and Becker (1958) found no sex
The Volunteer Subject
57
differences between subjects who notified the investigator that they would be absent versus those who did not notify him. Though these results argue against a sex difference in pseudovolunteering, it should be noted that the entire experimental sample was composed of extreme scorers on a test of introversion-extraversion. Furthermore, no comparison was given of either group of no-shows with the parent population from which the samples were drawn. Leipold and James (1962) also compared the characteristics of shows and no-shows among a random sample of introductory psychology students who had been requested to serve in an experiment in order to satisfy a course requirement. Again, no sex differences were found. Interestingly enough, however, about half of Frey and Becker’s no-shows notified the experimenter that they would be absent while only one of Leipold and James’s 39 no-shows so demeaned himself. Finally, there is the more recent study by Wicker (1968), in which it was possible to compare the rates of pseudovolunteering by male and female subjects for a questionnaire study. These results also yielded no sex differences. Hence, three studies out of three suggest that failing to provide the investigator with data promised him probably is no more apt to be the province of males than of females. Birth Order Stemming from the work of Schachter (1959) there has been increasing interest shown in birth order as a useful independent variable in behavioral research (Altus, 1966; Warren, 1966). A number of studies have attempted to shed light on the question of whether firstborns or only-children are more likely than laterborns to volunteer for behavioral research. But for all the studies conducted there are only a few that suggest a difference in volunteering rates among first- and laterborns to be significant at even the .10 level. It is suggestive, however, that all of these studies found the firstborn to be overrepresented among the volunteering subjects. Capra and Dittes (1962) found that among their Yale University undergraduates 36 per cent of the firstborns, but only 18 per cent of the laterborns, volunteered for an experiment requiring cooperation in a small group. Varela (1964) found that among Uruguayan male and female high school students 70 per cent of the firstborns, but only 44 per cent of the laterborns, volunteered for a small group experiment similar to that of Capra and Dittes. Altus (1966) reported that firstborn males were overrepresented relative to laterborn males when subjects were asked to volunteer for testing. Altus obtained similar results when the subjects were female undergraduates, but in that case the difference in volunteering rates was not statistically significant. Suedfeld (1964) recruited subjects for an experiment in sensory deprivation and found that 79 per cent of those who appeared were firstborns while only 21 per cent were laterborns. Unfortunately we do not know for this sample of undergraduates what proportion of those who did not appear were firstborn. It seems rather unlikely, however, that the base rate for primogeniture would approach 79 per cent; Altus (1966) was unable to find a higher proportion than 66 per cent for any college population. No differences in rates of volunteering by first- versus laterborns were found in studies by Lubin, Brady, and Levitt (1962b), Myers, Murphy, Smith, and Goffard (1966), Poor (1967), Rosnow and Rosenthal (1967, unpublished data), Schultz (1967a), Ward (1964), Wilson and Patterson (1965), and Zuckerman, Schultz, and
58
Book One – Artifact in Behavioral Research
Hopkins (1967). In these studies, in which no p reached even the .10 level, not even the trends were suggestive. In several of the studies relating birth order to volunteering the focus was less on whether a subject would volunteer and more on the type of experiment for which he would volunteer. That was the case in a study by Brock and Becker (1965) who found no differences between first- and laterborns in their choices of individual or group experiments. Studies reported by Weiss, Wolf, and Wiltsey (1963) and by Wolf and Weiss (1965) suggest that preference for participation in group experiments by firstborn versus laterborn subjects may depend on the recruitment method. When a ranking of preferences was employed, firstborns more often volunteered for a group experiment. However, when a simple yes-no technique was employed, firstborns volunteered relatively less for group than for individual or isolation experiments. Thus, most of the studies show no significant relationship between birth order and volunteering. However, in those few studies where there is a significant relationship, the results suggest that it is the firstborn or only child who is more likely to volunteer. This finding might be expected on the basis of work by Schachter (1959) suggesting the greater sociability of the firstborn. It is this variable of sociability to which we now turn our attention. Sociability Using as subjects male and female college freshmen, Schubert (1964) observed that volunteers (n ¼ 562) for a ‘‘psychological experiment’’ scored higher in sociability on the Social Participation Scale of the MMPI than nonvolunteers (n ¼ 443). A similar positive relationship between sociability and volunteering has been reported by others. Martin and Marcuse (1957, 1958) found that female volunteers for an experiment in hypnosis measured higher in sociability on the Bernreuter than female nonvolunteers.6 London, Cooper, and Johnson (1962) found a tendency for their more serious volunteers to be somewhat more sociable than those less serious about serving science—sociability here being defined by the California Psychological Inventory, the 16 Pf, and MMPI. Thus, it would appear that volunteers, especially females, are higher in sociability than nonvolunteers. The relationship in fact, however, is not always this simple. Although Lubin, Brady, and Levitt (1962a) observed that student nurses who volunteered for hypnosis scored higher than nonvolunteers on a Rorschach content dependency measure (a finding which is consistent with those above), it also was observed that volunteers were significantly less friendly as defined by the GuilfordZimmerman. On intuitive grounds the latter finding would appear to be inconsistent with the simple, positive sociability-volunteering relationship. One might expect 6
Though one would expect the factor of introversion-extraversion to be related to sociability, and so might predict greater extraversion among volunteers, Martin and Marcuse found no differences in introversion-extraversion between volunteers and nonvolunteers while Ora (1966) found volunteers, and especially males, to be significantly more introverted than nonvolunteers. Another surprising finding, by Frey and Becker (1958), is relevant in so far as volunteering may be related to styles of pseudovolunteering. Among those subjects who failed to keep an appointment for an experiment in which they had previously agreed to participate, those who notified the experimenter that they would be unable to attend had lower sociability scores on the Guilford than those who failed to appear without notifying the experimenter. It is difficult to explain this somewhat paradoxical finding that presumably less thoughtful pseudovolunteers are, in fact, more sociable than their more thoughtful counterparts.
The Volunteer Subject
59
a positive relationship between sociability and dependency, but certainly not a negative relationship between sociability and friendliness. Despite the confusion, it is clear that in research on hypnosis, differences between volunteers and nonvolunteers are likely to bias results. Boucher and Hilgard (1962) have shown that subjects who are less willing to participate in hypnosis research are clearly more resistant to showing hypnotic behavior when they are conscripted for research. A factor that is likely to complicate the sociability-volunteering relationship is the nature of the task for which volunteering is requested. When Poor (1967) solicited volunteers for a psychological experiment, he found the volunteers to be higher in sociability than nonvolunteers on the California Psychological Inventory. However, when the task was completing a questionnaire, the return rate for the less sociable subjects tended to be higher than that for the more sociable (p < .25). If sociability can be defined on the basis of membership in a social fraternity, then other findings become relevant as well. Reuss (1943) obtained higher return rates among fraternity and sorority members (high sociability?) than among independents (lower sociability?). However, Abeles, Iscoe, and Brown (1954–55)—a study in which male undergraduates were invited by the president of their university to complete questionnaires concerning the Draft, the Korean War, college life, and vocational aspirations—found that fraternity men were significantly underrepresented in the initial sample of volunteers. (It can be noted that in a subsequent session, which followed a letter ‘‘ordering’’ the students to participate and then a personal phone call, fraternity men were significantly overrepresented in the volunteer sample.) If sociability can be defined in terms of verbosity, then another study becomes relevant. In an experiment on social participation, Hayes, Meltzer, and Lundberg (1968) noted that undergraduate volunteers were more talkative than (nonvolunteer) conscripts. The relationship is confounded by the fact that verbosity, since it was observed after the volunteer request, must be considered a dependent variable. Perhaps conscription leads to moodiness and quietude. One cannot be absolutely certain that conscripts would also have been less talkative than the volunteer subjects before the experiment began. In some cases, characteristics of volunteers do tend to remain stable over appeals for participation in different types of tasks. Earlier we described the research of Lubin, Brady, and Levitt (1962a) in which student nurses were asked to volunteer for hypnosis research. In a related study of student nurses, Lubin, Levitt, and Zuckerman (1962) asked for the return of a mailed questionnaire. Volunteers in the hypnosis study were more dependent than nonvolunteers as defined by a Rorschach measure. In the questionnaire study, those who chose to respond were also more dependent than nonresponders despite the fact that a different definition of dependency was employed, viz. one based on the Edwards Personal Preference Schedule. Though there are a good many equivocal results to complicate the interpretation, and possibly even some contradictory findings, at least in the bulk of studies showing any clear difference in sociability between volunteers and nonvolunteers it would appear that volunteers tend to be the more sociable. Approval Need Crowne and Marlowe (1964) have elaborated the empirical and theoretical network of consequence that surrounds the construct of approval motivation. Using the Marlowe–Crowne (M–C) Scale as their measure of need for social approval, they
60
Book One – Artifact in Behavioral Research
have shown that high scorers are more influenceable than low scorers in a variety of situations. Directly relevant to the present chapter is their finding that high scorers report a greater willingness to serve as volunteers in an excruciatingly dull task. Consistent with this finding is the observation of Leipold and James (1962) that determined male nonvolunteers tend to score lower than volunteers on the M–C. Similarly, Poor (1967) found that volunteers for an experiment, the nature of which was unspecified, scored higher in need for approval than nonvolunteers on the M–C Scale. Poor also found that subjects higher in need for approval were more likely than low need approvals to return mailed questionnaires to the investigator. For both of Poor’s samples the significance levels were unimpressive in magnitude, but impressive in consistency; both ps were .13 (two-tail). C. Edwards (1968) invited student nurses to volunteer for a hypnotic dream experiment and uncovered no difference between volunteers and nonvolunteers in their average M–C scores. This failure to replicate the findings above may have been due to differences in the type of subject solicited or perhaps to the nature of the experiment. Another, rather intriguing, finding was that the need for approval of the volunteers’ best friends was significantly higher than the need for approval of the nonvolunteers’ best friends. Also, the students’ instructors rated the volunteers as significantly more defensive than the nonvolunteers. Thus, at least in their choice of best friends and in their instructors’ judgment, though not necessarily in their own test scores, volunteers appear to show a greater need for social approval. Edwards went further in his analysis of subjects’ scores on the M–C Scale. He found a nonlinear trend suggesting that the volunteers were more extreme scorers on the M–C than nonvolunteers, i.e., either too high (which one might have expected) or too low (which one would not have expected). Edwards’ sample size of 37 was too small to establish the statistical significance of the suggested curvilinear relationship. However, Poor (1967), in both of the samples mentioned earlier, also found a curvilinear relationship, and in both samples the direction of curvilinearity was the same as in Edwards’ study. The more extreme M–C scorers were those more likely to volunteer. In Poor’s smaller sample of 40 subjects who were asked to volunteer for an experiment, the curvilinear relationship was not significant. However, in Poor’s larger sample of 169 subjects who were asked to return a questionnaire, the curvilinear relationship was significant at p < .0002 (two-tail). So far our definition of need for approval has depended heavily on the MarloweCrowne Scale, but there is evidence that other paper-and-pencil measures might well give similar results. McDavid’s (1965) research, using his own Social Reinforcement Scale, also indicated a positive relationship between approval-seeking and volunteering. Using still another measure of need for approval (Christie-Budnitzky), Hood and Back (1967) found their volunteers to score higher than their nonvolunteers. Their finding was significant for male subjects while for female subjects there was a tendency for the relationship between need for approval and volunteering to depend on the task for which volunteering was solicited. With the one exception, then, of the study by C. Edwards, who recruited a different type of subject for a different type of experiment, it would appear that volunteers tend to be higher than nonvolunteers in their need for approval. However, the nonmonotonic function suggested in the results of Edwards and Poor implies that the positive relationship may only hold for the upper range of the continuum. It may be subjects showing medium need for approval who will volunteer the least.
The Volunteer Subject
61
Conformity It seems almost tautological to consider the relationship between volunteering and conformity, for the act of volunteering is itself an act of conformity to some authority’s request or invitation to participate. We shall see, however, that conforming to a request to volunteer is by no means identical with, and often not even related to, other definitions of conformity. Crowne and Marlowe (1964) have summarized the evidence that subjects higher in need for approval are more likely than low need approvals to conform to the demands of an experimental task including an Asch-type situation. Since need for approval is positively related both to volunteering (at least in the upper range) and to conformity, one would expect a positive relationship between conformity in the Asch-type situation and volunteering. Foster’s (1961) findings, though not statistically significant, imply such a relationship among male subjects, but just the opposite among females. If volunteers can be characterized as conforming, then one also would expect them to be low in autonomy. Using the Edwards Personal Preference Schedule, such a finding was obtained by C. Edwards (1968) for his sample of student nurses, the volunteers also being judged by their instructors as more conforming than the nonvolunteers. However, diametrically opposite results were obtained by Newman (1956) also using the Edwards Schedule, but where the task was a perception experiment. Both male and female volunteers were significantly more autonomous than male and female nonvolunteers. When an experiment in personality was the task, no difference in autonomy was revealed between volunteers and nonvolunteers. Lubin, Levitt, and Zuckerman (1962) also employed the Edwards Personal Preference Schedule, finding that student nurses who completed and returned a questionnaire scored lower in autonomy (and in dominance) than nonrespondents. The differences were, however, not judged statistically significant. To further confound the array of results it must be added that Martin and Marcuse (1957) found male volunteers for a hypnosis experiment to be significantly more dominant than nonvolunteers on the Bernreuter. And, Frye and Adams (1959), also using Edwards’ scales, obtained no appreciable differences on any of the measures between male and female volunteers versus nonvolunteers. There appears to be little consistency in the relationships obtained between conformity and voluntarism. In the majority of studies no significant relationship was obtained between these variables. However, considering only those studies in which a significant relationship was obtained, one might tentatively conclude that the direction of relationship is unpredictable for female subjects but that male volunteers are probably more autonomous than male nonvolunteers.
Authoritarianism A number of investigators have compared volunteers with nonvolunteers on the basis of several related measures of authoritarianism. Rosen (1951), using the F Scale definition of authoritarianism, found that volunteers for personality research scored lower than nonvolunteers. Newman (1956) also found volunteers to be less authoritarian on the F Scale, but his finding was complicated by the interacting effects of type of experiment and sex of subject. Thus, only when recruitment was
62
Book One – Artifact in Behavioral Research
for an experiment in perception and only when the subjects were male was there a significant difference in authoritarianism between volunteers and nonvolunteers. When recruitment was for an experiment in personality, neither male nor female volunteers showed significantly lower authoritarianism than nonvolunteers. Poor (1967), also employing the F scale, found that mail questionnaire respondents were less authoritarian than nonrespondents. However, in soliciting volunteers for an experiment in social psychology, Poor obtained no differences in authoritarianism between volunteers and nonvolunteers. Martin and Marcuse (1957), in their study of volunteers for hypnosis research, employed the Ethnocentrism (E) Scale. Volunteers, especially males, were found to be significantly less ethnocentric than nonvolunteers. However, Schubert (1964), employing MMPI definitions of prejudice and tolerance, obtained no difference between volunteers and non-volunteers for a psychological experiment. Consistent with the general trend toward lower authoritarianism among volunteers are the results of Wallin (1949). He found that participants in survey research are politically and socially more liberal than nonparticipants. Benson, Booman, and Clark (1951) similarly found a more favorable attitude toward minority groups among people who were willing to be interviewed than among those who were not cooperative. Finally, Burchinal (1960) found that undergraduates who completed questionnaires at scheduled sessions were less authoritarian than those students who did not. The bulk of the evidence suggests that volunteers are likely to be less authoritarian than nonvolunteers. This conclusion seems most warranted for those studies in which the subjects were asked to respond either verbally or in written form to questions of a personal nature. In all five samples where the task was to answer such personal questions, those subjects who were less authoritarian, broadly defined, were more cooperative. Conventionality There is a sense in which the more authoritarian individual is also the more conventional, so that one might expect volunteers for behavioral research to be less conventional than nonvolunteers. That seems most often, but by no means always, to be the case. Thus, Wallin (1949), who found survey respondents to be less authoritarian than nonrespondents, did not find a difference in conventionality between these types. Rosen (1951), however, found that volunteers for personality research were less conventional than nonvolunteers, while C. Edwards (1968) reported the opposite finding. In the latter study, student nurses who volunteered for a hypnotic dream experiment were judged by their instructors to be more conventional than nonvolunteers. A number of studies have discovered that volunteers for Kinsey-type interviews tend, either in their sexual behavior or in their attitudes toward sex, to be more unconventional than nonvolunteers (Maslow, 1942; Maslow and Sakoda, 1952; Siegman, 1956). In order to determine whether this relative unconventionality of volunteers is specific to the Kinsey-type situation, one would need to know if these same volunteers were more likely than nonvolunteers to participate in other types of psychological research. It also would be helpful if one knew whether groups matched on the basis of sexual conventionality, but differing in other types of conventionality, exhibited different rates of volunteering for Kinsey-type interviews.
The Volunteer Subject
63
The Pd scale of the MMPI often is regarded clinically as reflecting dissatisfaction with societal conventions, and higher scorers may be regarded as less conventional than lower scorers. Both London et al. (1962) and Schubert (1964) found volunteers for different types of experiments to be less conventional by this definition, though with their army servicemen subjects, Myers, Murphy, Smith, and Goffard (1966) found volunteers for a perceptual isolation experiment to be more conventional. London et al. and Schubert further found volunteers to score higher on the F scale of the MMPI, which reflects a willingness to admit to unconventional experiences. The Lie scale of the MMPI taps primness and propriety, and high scorers may be regarded as more conventional than low scorers. Although Heilizer (1960) found no Lie scale differences between volunteers and nonvolunteers, Schubert (1964) found that volunteers scored lower. These results are not unequivocal, but in general it would appear that volunteers for behavioral research tend to be more unconventional than nonvolunteers. Six studies support this conclusion. (However, two others find no difference, and two others report the opposite relationship.) It would not be surprising if further research proved that sex differences are significant determinants of the nature of the conventionality-volunteering relationship. Recall that London et al. (1962) concluded, at least for hypnosis research, that females who volunteer may be significantly more interested in the novel and the unusual, whereas for males the relationship is less likely. A finding was noted earlier, under the heading of conformity, that may bear out London et al. That was Foster’s (1961) finding which, though not statistically significant, implied that the relationship between conformity and volunteering may be in opposite directions for males versus females. Arousal Seeking On the basis of his results using over 1,000 subjects, Schubert (1964) has postulated a trait of arousal-seeking on which he found volunteers to differ from nonvolunteers. He notes that volunteers for a ‘‘psychological experiment’’ reported drinking more coffee, taking more caffeine pills, and (among males) smoking more cigarettes than nonvolunteers. All three types of behavior are related conceptually and empirically to arousal seeking. In partial support of Schubert’s results are those obtained by Ora (1966). Though he found volunteers reporting significantly greater consumption of coffee and caffeine pills than was reported by nonvolunteers, there were no differences in cigarette smoking between volunteers and nonvolunteers. However, recent unpublished data collected by Rosnow and Rosenthal (1967), which are described in greater detail later, indicate no overall, significant relationship between volunteering for a ‘‘psychological experiment’’ and either smoking or coffee drinking. In fact, among males, there is a tendency for volunteers to smoke less (p ¼ .07) and to drink less coffee (p ¼ .06) than nonvolunteers. The relationship between smoking and coffee drinking is þ.54. Also inconsistent with Schubert’s results are the findings of Poor (1967). In both his questionnaire study and his social psychological experiment, Poor found no significant relationship between smoking and participation by his predominantly male subjects. In fact, in both studies, Poor obtained trends opposite to those of Schubert; participants smoked less and reported drinking less alcohol than nonparticipants. In addition, Myers et al. (1966) found no relationship
64
Book One – Artifact in Behavioral Research
between smoking and volunteering for isolation experiments. On the whole, it does not appear that smokers and coffee drinkers are necessarily overrepresented among volunteers for behavioral research. Fortunately, Schubert’s construct of arousal seeking does not rely so heavily on the associated behavior of smoking and coffee drinking. He found that a variety of MMPI scales, associated with arousal seeking, discriminated significantly between volunteers and nonvolunteers. These MMPI characteristics which Schubert found associated with a greater likelihood of volunteering generally coincide with those noted by London et al. (1962). One important exception, however, is that the hypomanic (Ma) scale scores of the MMPI were found by Schubert to correlate positively with volunteering (a result of some importance to the arousal seeking hypothesis), while London et al. found a negative relationship between Ma scores and volunteering for an hypnosis experiment. This reversal weakens the generality of the arousal seeking hypothesis, which is further weakened by Rosen’s (1951) finding that female volunteers scored lower on the Ma scale than female nonvolunteers. Nevertheless, there are other data which lend support to Schubert’s hypothesis that volunteers are more arousal seeking than nonvolunteers. Riggs and Kaess (1955) observed that volunteers were characterized by more cycloid emotionality on the Guilford Scale than nonvolunteers, a result that is not inconsistent with Schubert’s finding of higher Ma scores among volunteers. Howe (1960) reports that volunteers willing to undergo electric shocks were characterized by less need to avoid shock than nonvolunteers, a finding that is not totally tautological and one that is consistent with the arousal seeking hypothesis. However, Riggs and Kaess also found that volunteers were characterized by more introversive thinking on the Guilford than nonvolunteers, a result which is not supportive of the arousal seeking hypothesis. Closely related to Schubert’s concept of arousal seeking is the concept of sensation-seeking discussed by Zuckerman, Schultz, and Hopkins (1967). In a number of studies Zuckerman et al. compared volunteers with nonvolunteers on a specially developed Sensation Seeking Scale (SSS). In one study female undergraduates who volunteered for sensory deprivation were found to have scored higher than nonvolunteers on the SSS. In a second study, which also employed female undergraduates, volunteers for a hypnosis experiment scored higher in sensation seeking than nonvolunteers. In a third study, male undergraduates were invited to volunteer for sensory deprivation and/or hypnosis research. Subjects who volunteered for both experiments scored highest in sensation seeking, while those who volunteered for neither task scored lowest on the SSS and also on the Ma scale of the MMPI. The correlation between scores on the SSS and the Ma scale, though statistically significant, is low enough (þ .21) that one would not expect the results obtained on the basis of the Ma scale to be attributable entirely to the results obtained with the SSS. In another study in which Schultz (1967b) solicited volunteers for sensory deprivation, male volunteers obtained significantly higher scores on the SSS than male nonvolunteers, while among female subjects a less clearly significant difference in the same direction was revealed. Finally, Schultz (1967c) invited female undergraduates to volunteer for a sensory restriction experiment. On the basis of scores on the Cattell Scales, volunteers could be judged more adventurous than nonvolunteers. At least when arousal seeking is defined in terms of the Sensation Seeking Scale there appears to be substantial support for Schubert’s hypothesis that volunteers are more arousal seeking than nonvolunteers. When other tests or scales are used to
The Volunteer Subject
65
define arousal seeking the results are less consistent, though even then the hypothesis is not completely without support. Anxiety There is no dearth of studies comparing the more or less enduring anxiety levels of volunteers and nonvolunteers. Table 3-2 summarizes the results of 11 of those studies that could be most easily categorized as to outcome. In seven of these there appeared to be no difference between volunteers and nonvolunteers, Lubin, Brady, and Levitt (1962a) having employed the IPAT measure of anxiety, and the remaining studies using the Taylor Manifest Anxiety Scale or a close relative of that scale. The tasks for which volunteers were solicited included hypnosis, sensory deprivation, electric shock, Kinsey-type interviews, small group experiments, and an unspecified ‘‘psychological experiment.’’ On the basis of these seven studies one could certainly conclude that, at least in terms of manifest anxiety, volunteers are usually no different than nonvolunteers. However, results of the other four studies listed in Table 3-2 make it difficult to reach this simple conclusion, since these studies show significant differences at around the .05 level. Moreover, the fact that two find volunteers to be more anxious than nonvolunteers, while two others find just the opposite relationship, only complicates the attempt to summarize simply the collective results. One might be tempted to take the algebraic mean of the differences in anxiety level found between volunteers and nonvolunteers in these four studies, but that would be like averaging the temperature of one winter and one summer and concluding that there had been two springs. Furthermore, one cannot attribute the inconsistency in results to the instruments employed to measure anxiety. Scheier (1959) used the IPAT; Myers et al. (1966), Rosen (1951), and Schubert (1964) all employed the usual MMPI scales or derivatives (Depression Scale, Psychesthenia Scale, or Taylor Manifest Anxiety Scale). One possibility to explain the inconsistency concerns the anxiety arousing nature of the tasks for which volunteers were solicited. The tasks for which more anxious subjects volunteered were an MMPI examination (Rosen, 1951) and participating in a ‘‘psychological experiment’’ (Schubert, 1964). The tasks for which less anxious subjects volunteered were a sensory deprivation study (Myers et al., 1966) and one that Scheier (1959) left unspecified but characterized as somewhat threatening. As a working hypothesis let us suggest that although most often there will be no difference in the level of chronic anxiety between volunteers and nonvolunteers, when such a
Table 3–2 Studies of the Anxiety Level of Volunteers versus Nonvolunteers
Volunteers more anxious
No difference
Volunteers less anxious
Rosen (1951) Schubert (1964)
Heilizer (1960) Himelstein (1956) Hood and Back (1967) Howe (1960) Lubin, Brady, and Levitt (1962a) Siegman (1956) Zuckerman, Schultz, and Hopkins (1967)
Myers et al. (1966) Scheier (1959)
66
Book One – Artifact in Behavioral Research
difference does occur it will be the more threatening experiment that will draw the less anxious volunteer and the ordinary experiment that will draw the more anxious volunteer. Thus, the more anxious subject worries more about the consequences of his refusing to volunteer, but only so long as the task is not itself perceived as frightening. If it is frightening, then the more anxious and fearful subject may decide that he cannot tolerate the additional anxiety that his participation would engender and so chooses not to volunteer. Some weak support for this hypothesis comes from the work of Martin and Marcuse (1958), who obtained a complicated interaction between volunteering, anxiety level, and task. No differences in anxiety level were found between volunteers and nonvolunteers of either sex when recruiting for experiments in learning or attitudes toward sex. However, when volunteering was requested for a personality experiment, both male and female volunteers were found to be more anxious than nonvolunteers. When volunteering was requested for an experiment on hypnosis, male volunteers were found to be less anxious than male nonvolunteers, a difference that was not obtained among female subjects. These results seem to parallel the results of the studies listed in Table 3-2. Most of the time no differences were found in anxiety level between volunteers and nonvolunteers. When differences were obtained, the volunteers for the more ordinary experiments were more anxious, while the (male) volunteers for the more unusual, perhaps more threatening, experiments were less anxious. The results of the Martin and Marcuse research again emphasize the importance of the variable of subject’s sex as a moderating or complicating factor in the relationship between volunteering behavior and various personal characteristics. Further support for the hypothesis that the more fearful the subject, the less he will volunteer for a frightening experiment comes from a study by Brady, Levitt, and Lubin (1961). Seventy-six student nurses were asked to indicate whether they were afraid of hypnosis. Two weeks later, volunteers for an experiment in hypnosis were solicited. Of those nurses who volunteered, 40 per cent had indicated at least some fear of hypnosis while among the nonvolunteers more than double that number (82 per cent) indicated such fear (p < .0002). It should be noted, however, that student nurses who volunteered did not differ in anxiety as measured by the IPAT from those who did not volunteer. Less relevant to the question of volunteering but quite relevant to the related question of who finds their way into the role of research subject is the study by Leipold and James (1962). Male and female subjects who failed to appear for a scheduled psychological experiment were compared with subjects who kept their appointments. Among the female subjects those who appeared did not differ in anxiety on the Taylor Scale from those who did not appear. However, male subjects who failed to appear—the determined nonvolunteers—were significantly more anxious than those male subjects who appeared as scheduled. These findings not only emphasize the importance of sex differences in studies of volunteer characteristics, but also that it is not enough simply to know who volunteers; even among those subjects who volunteer there are likely to be differences between those who actually show up and those who do not. Psychopathology We now turn our attention to variables that have been related to global definitions of psychological adjustment or pathology. Some of the variables discussed earlier have also been related to global views of adjustment, but our discussion of them was
The Volunteer Subject
67
intended to carry no special implications bearing on subjects’ adjustment. For example, when anxiety was the variable under discussion, it was not intended that more anxious subjects be regarded as more maladjusted. Indeed, within the normal range of anxiety scores found, the converse might be equally accurate. There is perhaps a score of studies relevant to the question of the psychological adjustment of volunteers versus nonvolunteers. Once again, however, the results are equivocal. About one-third of the studies suggest that volunteers are better adjusted; another third suggest the opposite; and the remainder reveal no difference in adjustment between volunteers and nonvolunteers. We begin by summarizing the studies that indicate that volunteers are psychologically more healthy than nonvolunteers. Self-esteem is usually regarded as a correlate, if not a definition, of good adjustment. Maslow (1942) and Maslow and Sakoda (1952) summarized the results of six studies of volunteering for Kinsey-type interviews dealing with respondents’ sexual behavior. In five cases, volunteers revealed greater self-esteem (but not greater security) than nonvolunteers as measured by Maslow’s own tests. The one case that tended to show volunteers to be lower in self-esteem than nonvolunteers was a sample drawn from a class in abnormal psychology. These students were found to have an atypical distribution of self-esteem scores; both very high and very low scorers were overrepresented among the volunteers. Some time later, Siegman (1956) also solicited volunteers for a Kinsey-type interview, administering his own self-esteem scale to the subjects. He found no differences in self-esteem between volunteers and nonvolunteers. In one of the studies by Poor (1967), subjects were requested to complete and to return a questionnaire. ‘‘Volunteers,’’ thus, may be thought of as those subjects who returned the completed questionnaires. Poor employed a measure of self-esteem developed by Morris Rosenberg and found ‘‘volunteers,’’ or responders, to be higher in self-esteem than ‘‘nonvolunteers’’ (p < .07). A study by Pan (1951) of residents of homes for the aged is also, at least indirectly, relevant to this discussion. Pan observed that residents who completed and returned his questionnaires were in better physical health than were nonrespondents. Only because it appears that physical and mental health are somewhat correlated do Pan’s results imply that respondents may also be better adjusted psychologically than nonrespondents. There is, however, good reason to be cautious about these results, for it is possible that the superintendents of the homes may have unduly influenced the composition of the respondent group by distributing the questionnaires primarily to residents in good health. So far we have considered nine samples in which volunteers (i.e., respondents or interviewees) were compared with nonvolunteers (nonrespondents or noninterviewees) on adjustment-related variables. In seven of those nine, volunteers appeared to be the better adjusted. In one sample, nonvolunteers were the better adjusted, and in one other, volunteers did not differ from nonvolunteers in adjustment. On the whole, then, it would seem that in questionnaire or interview studies, respondents will mainly be those subjects who tend to be psychologically well-adjusted. When volunteering is requested for a typical psychological experiment the relationship between adjustment and volunteering becomes more equivocal. Schubert (1964), for example, found no differences in the MMPI scores on neuroticism between volunteers and nonvolunteers for a psychological experiment. He did find, however, that volunteers tended to be more irresponsible than nonvolunteers. In one of Poor’s (1967) studies, subjects were solicited for a psychological experiment.
68
Book One – Artifact in Behavioral Research
Using the Rosenberg self-esteem measure, Poor found a tendency toward lower selfesteem among volunteers than among nonvolunteers (p ¼ .14). It will be recalled that Poor also found just the opposite result when solicitation had been of questionnaire returns. Finally, Ora (1966), using a self-report measure of adjustment found no difference between volunteers and nonvolunteers for various psychological experiments. However, the volunteers perceived themselves in greater need of psychological assistance than did the nonvolunteers even without their feeling themselves more maladjusted. There is little basis for any conclusion to be drawn from the various findings. We noted earlier, in discussing ‘‘shows’’ and ‘‘no-shows,’’ that not all subjects who volunteer actually become part of the final data pool. There are two studies of those subjects who fail to keep research appointments that appear relevant to the adjustment variable. Silverman (1964) found that when participation was requested for a psychological experiment, it was subjects higher in self-esteem (as defined by a modified Janis and Field measure) who more often failed to keep their appointment. This finding seems consistent with that of Poor (1967), who also found subjects lower in self-esteem more likely to end up by contributing data to the behavioral experimenter. Wrightsman (1966), however, observed that subjects who failed to keep their research appointments scored lower in social responsibility, a finding which is inconsistent with Schubert’s (1964) observation that less responsible subjects are those more likely to volunteer. The probable complexity of the relationship between psychopathology and volunteering for a fairly standard, or unspecified psychological experiment is well-illustrated in the study by Newman (1956). Male volunteers were found to be less variable in degree of self-actualization than male nonvolunteers, whereas female volunteers showed greater variability than female nonvolunteers. If one accepts selfactualization as a measure of adjustment, the implication is a U-shaped relationship between adjustment and volunteering for females when the task is a psychological experiment, but an upside down U for males. Whereas the best and the least adjusted males may be less likely to volunteer than those males who are moderately welladjusted, the best and the least adjusted females may be more likely to volunteer than moderately well-adjusted females. When we turn to a consideration of the somewhat less standard types of behavioral experiments, we find similar difficulties in trying to summarize the relationship between volunteering and adjustment. Schultz (1967c) reports that female undergraduate volunteers for an experiment in sensory deprivation scored higher in emotional stability on the Cattell than nonvolunteers. However, in an experiment on hypnosis, Hilgard, Weitzenhoffer, Landes, and Moore (1961) found that female undergraduate volunteers scored lower in self-control than nonvolunteers. Among male subjects they report no significant difference in self-control between volunteers and nonvolunteers. In their recruitment for volunteers for hypnosis research, Lubin, Brady, and Levitt (1962a, 1962b) found no significant difference between student nurse volunteers and nonvolunteers that could be attributed to differences in adjustment. The tendency was, however, for the volunteers to appear somewhat less well-adjusted than the nonvolunteers as defined by a variety of test scores and by obesity. When the research for which volunteers are solicited takes on a medical appearance, the relationship between psychopathology and volunteering is a little more clear. Pollin and Perlin (1958) and Perlin, Pollin, and Butler (1958) have concluded
The Volunteer Subject
69
that the more intrinsically eager a person is to volunteer for hospitalization as a normal control subject, the more likely he is to be maladjusted. Similarly, Bell (1962) reports that volunteers for studies of the effects of high temperature were more likely to be maladjusted than nonvolunteers. Lasagna and von Felsinger (1954), recruiting subjects for drug research, noted the high incidence of psychopathology among volunteers. A similar finding has also been reported by Esecover, Malitz, and Wilkens (1961), who solicited volunteers for research on hallucinogens. They found that the better adjusted volunteers were motivated more by money, scientific curiosity, or because volunteering was a normally expected occurrence (e.g., as by medical students). Such findings are also consistent with the results of Pollin and Perlin (1958). Thus far the results of studies of volunteers for medical research agree rather well with one another, consistently tending to show greater psychopathology among volunteers than nonvolunteers. To this impression, however, one must add the results obtained by Richards (1960), who compared volunteers and nonvolunteers for a study of mescaline on the basis of their responses to the Rorschach and TAT. Though significant differences were obtained, the nature of those differences was such that no conclusion could be drawn as to which group was the more maladjusted. It can be noted, however, that Richards’ subjects were undergraduates in the medical sciences, where volunteering might well have been the expected behavior for these students who may also have been motivated by a preprofessional interest in drug research. Under such circumstances, one might not expect volunteers to reveal any excess of psychopathology over what one would obtain among nonvolunteers. Intelligence There are several studies showing a difference in intellectual performance between volunteers and nonvolunteers. Martin and Marcuse (1957) found that volunteers for an experiment in hypnosis scored higher on the ACE than nonvolunteers. In a subsequent study, Martin and Marcuse (1958) solicited volunteers for three additional experiments in personality, learning, and attitudes toward sex. For all four studies combined, volunteers were still found to score higher in intelligence than nonvolunteers. The definition of intelligence again was the score on the ACE, a test employed also by Reuss (1943) in a study of responders and nonresponders to a mail questionnaire. Reuss also found that ‘‘volunteers’’ (i.e., responders) scored higher in intelligence than ‘‘nonvolunteers.’’ Myers et al. (1966), however, employing a U.S. Army technical aptitude measure of intelligence, found no significant relationship between intelligence and volunteering for isolation experiments. In a study of high school juniors, Wicker (1968) compared the degree of participation in behavioral research of ‘‘regular’’ and ‘‘marginal’’ students. Regular students were defined as those scoring at least 105 on an IQ test and who either earned no grades below C in the preceding semester or were children of fathers engaged in managerial or professional occupations. Marginal students were defined as those having scored below 100 in IQ and either having earned two or more D or F grades in the preceding semester or who were children of fathers in ‘‘lower’’ occupational categories. Of the juniors in the regular group 44 per cent found their way into the research project. Of those in the marginal group less than 14 per cent made their way into the project (p < .001).
70
Book One – Artifact in Behavioral Research
In requesting student nurses to volunteer for an hypnotic dream experiment, Edwards (1968) found no relationship between IQ and volunteering but, somewhat surprisingly, he observed that volunteers scored significantly lower than nonvolunteers on a test of psychiatric knowledge (p < .01) and that volunteers were also lower in relative class standing (p ¼ .07). In addition, nonvolunteers’ fathers were better educated than the fathers of volunteers (p ¼ .006). These findings seem opposite in direction to those obtained by Wicker, but it can be noted that Edwards’ subjects were more highly selected from the upper end of the ability distribution than were Wicker’s. Edwards’ findings also differ from those of Martin and Marcuse, but this inconsistency cannot be attributed to differences in the general level of intellectual performance found in the two samples. Brower (1948) found that volunteers for a visual-motor skills experiment performed better at difficult visual-motor tasks than did coerced nonvolunteers, though there was no performance difference in a simple visual-motor task. Wolfgang (1967) solicited volunteers for a concept learning experiment, after which all subjects were administered the Shipley-Hartford test of abstract thinking ability. Male volunteers exhibited better performance than male nonvolunteers, but no difference was revealed between female volunteers and nonvolunteers. A problem common to the studies of Brower and Wolfgang is that volunteer status was established before determining the correlate of volunteering. At least in principle, it is possible that the act of volunteering, or of refusing to volunteer, may affect the subject’s subsequent task performance. Thus, a subject who has been coerced to participate in an experiment may be poorly motivated to perform well at tasks that have been set for him against his will. It remains problematic whether his poor performance antedated his decision not to volunteer. In an experiment in which the test to be correlated with volunteering is administered both to volunteers and nonvolunteers once the volunteers have already participated in an experiment, one must examine carefully the nature of the experimental task. If the task were similar to the test one would expect the volunteers to perform better, since the task would provide a practice session for their performance on the test. When the definition of intelligence is in terms of a standard IQ test or even a test of visual-motor skill there are at least a half-dozen samples showing that volunteers perform better than nonvolunteers, three samples showing no difference in IQ, and no samples showing nonvolunteers to perform better than volunteers. It appears that when there are differences in intelligence between volunteers and nonvolunteers, the difference favors the volunteers. When the definition of ‘‘intelligence’’ is in terms of school grades, the relationship of volunteering to intelligence becomes more equivocal. On the one hand, Edwards (1968) found that volunteers stood lower in class rankings than nonvolunteers. On the other hand, Wicker (1968), including grades in his definition of marginality of academic status, found that volunteers performed better than nonvolunteers. The relationship is further complicated by the fact that Rosen (1951) found no difference in grades between male volunteers and male nonvolunteers for behavioral research, but among female subjects the volunteers tended to earn higher grades (p < .10). Poor (1967) obtained no differences in grades or in intellectual interests and aspirations between respondents and nonrespondents to a mail questionnaire nor between volunteers and nonvolunteers for a psychological experiment. Though Abeles, Iscoe, and Brown (1954–1955) found no overall relationship with
The Volunteer Subject
71
volunteering, they did find a tendency toward higher grades among early volunteers. From the few studies in which volunteering has been correlated with school grades, it is difficult to draw any clear conclusions. If there is a relationship between school grades and volunteering it seems to be neither strong nor consistent. Leipold and James (1962) found that female volunteers who showed up for the experiment as scheduled had been earning higher grades in psychology; among male subjects this relationship was not significant. Where grades are so specific to a single course they are likely to be less well-correlated with general intelligence. Perhaps grades in a psychology course are as much a measure of interest in research as they are a measure of intelligence. Such an interpretation would be consistent with the findings of Edgerton, Britt, and Norman (1947). They reported that over a period of several years, winners of science talent contests responded to a mail questionnaire more than did runners-up who, in turn, responded more than ‘‘also rans.’’ While winners of such contests may well be more intelligent than losers, the greater interest of the winners might be the more potent determinant of their cooperation. In a recent study by Matthysse (1966) a number of volunteers for an experiment in attitude change were followed up by mail questionnaire. Somewhat surprisingly, but consistent with Edwards’ (1968) findings, those who responded to the follow-up had scored lower in intellectual efficiency (p < .10) on the California Psychological Inventory. Relevant to intellectual motivation, if not to intellectual performance, is a consideration of the variable of achievement motivation. In his important review of volunteer characteristics, Bell (1962) suggests that volunteers may be higher in need for achievement than nonvolunteers. The evidence for this hypothesis, however, is quite indirect. More direct data have become available from research by Lubin, Levitt, and Zuckerman (1962) and by Myers et al. (1966). In their studies of respondents and nonrespondents to a mail questionnaire, and of volunteers and nonvolunteers for an isolation experiment, respectively, a tendency, not statistically significant, was found for respondents and volunteers to score higher in need achievement on the Edwards Personal Preference Schedule than nonrespondents and nonvolunteers. Education In most of the studies described in the preceding section the subjects were students, usually in college, and the educational variance was low, a finding which is characteristic of experimental studies but not of survey research. In questionnaire or interview studies the target population is often intended to show considerable variability of educational background. Among survey researchers there has long been a suspicion that better educated people are those more apt to find their way into the final sample. The suspicion is well justified by the data. Study after study has shown that it is the better educated person who finds his responses constituting the final data pool. Since we could find no significant reversals to this relationship, we have simply listed in Table 3-3, the various studies in support of this conclusion. Social Class There is a high degree of correlation between amount of education and social class, the latter defined by occupational status. One would expect, therefore, that in survey research those subjects having higher occupational status roles would be more likely
72
Book One – Artifact in Behavioral Research Table 3–3 Studies Showing Respondents in Survey
Research to Be Better Educated Authors
Date
Benson, Booman, and Clark Franzen and Lazarsfeld Gaudet and Wilson Pace Pan Reuss Robins Suchman and McCandless Wallin Zimmer
1951 1945 1940 1939 1951 1943 1963 1940 1949 1956
to answer questions and to answer them sooner than subjects lower in occupational status. Belson (1960), Franzen and Lazarsfeld (1945), and Robins (1963) have all found professional workers more likely than lower class jobholders to participate in survey research. Similarly, Pace (1939) found professionals more willing to be interviewed and to return their questionnaires more promptly than nonprofessionals. Zimmer (1956), in his study of Air Force officers and enlisted men, found that probability of responding to a questionnaire increased directly as the serviceman’s rank increased. Finally, King (1967) found in his survey of Episcopal clergymen that questionnaires were more likely to be returned by (a) bishops than by rectors, (b) rectors than by curates, and (c) curates than by vestrymen. The sharp and statistically significant break came between the rectors and curates. However, there is some possibility in King’s study that not all of the curates and vestrymen actually received the questionnaires. Even discounting King’s results, the trend is clear. At least for the range of occupational statuses noted here, higher status role occupants are more likely to participate in the survey research process than those lower in status. We have been talking of the volunteer’s own social class as defined by his occupational status. The picture becomes more complicated when we consider the social class of the volunteer’s parents—his class of origin. Edwards (1968) reports that the fathers of volunteers for an hypnotic dream experiment had a lower educational level than the fathers of nonvolunteers. Similarly, Reuss (1943) found that the parents of respondents to a mail questionnaire had less education than the parents of nonrespondents. Rosen (1951) notes that the fathers of female volunteers for psychological research had a lower income than the fathers of nonvolunteers. Poor (1967), however, found no relationship between father’s occupational status and either responding to a questionnaire or volunteering for an experiment. The trend, if any, was for the fathers of respondents to have a higher occupational status than the fathers of nonrespondents. Finally, the reader will recall the study by Wicker (1968), in which marginal students participated in research less often than nonmarginal students. For Wicker, father’s occupation was part of the definition of marginality. If father’s occupational status, then, made any difference at all, it was the children of higher status fathers who more often produced data for the behavioral researcher. To summarize, when father’s social class makes the clearest difference, it seems that
The Volunteer Subject
73
children of lower class fathers are the most likely to volunteer. Since the evidence seems so clear that subjects who are themselves lower class members are less likely to volunteer, the hypothesis is suggested that those higher status persons are most likely to volunteer whose background includes vertical social mobility. In keeping with a point made earlier in this chapter, these latter persons may be those who at least in survey research would perceive themselves as having the most interesting and most acceptable answers to the investigator’s questions. Age There are over a dozen studies addressed to the question of age differences between volunteers and nonvolunteers. As we already have seen with other variables, often there is no significant difference between the ages of volunteers and nonvolunteers. However, when differences are found they most often suggest that younger rather than older subjects are those who volunteer for behavioral research. Abeles, Iscoe, and Brown (1954–1955) found this to be the case in their questionnaire study. Newman (1956) observed the same relationship for personality and perception experiments, although the age difference in the latter experiment was not statistically significant. In another experiment in perception, however, Marmer (1967) found volunteers to be significantly younger than nonvolunteers. Rosen (1951) found that female volunteers were significantly younger than female nonvolunteers, a difference which, however, did not hold for male subjects. For research with college students, then, even when the general trend is for volunteers to be younger, the sex of the subjects and the type of experiment for which volunteering is requested seem to complicate the relationship between age and volunteering. From studies not employing the usual college samples there is also some evidence for the greater youthfulness of volunteers. Myers, Murphy, Smith, and Goffard (1966) requested Army personnel to volunteer for a study of perceptual isolation and found volunteers to be younger. Pan (1951), too, found his respondents among residents of homes for the aged to be younger than the nonrespondents. The same tendency was reported by Wallin (1949). Nevertheless, opposite results also have been reported. Thus, in King’s (1967) study it was the older clergymen who were most likely to reply to a brief questionnaire. In that study, however, age was very much confounded with position in the Church’s status hierarchy. That situation seems to hold for the study by Zimmer (1956) as well. He found older Air Force men to respond more readily to a mail questionnaire, but the older airmen were also those of higher rank. Kruglov and Davidson (1953), in their study of male undergraduates, found the older students more willing to be interviewed than the younger students. Even allowing for the confounding effects of status in the studies by King and by Zimmer, the three studies just described weaken considerably the hypothesis that volunteers tend to be younger than nonvolunteers. That hypothesis is weakened further by several studies showing no differences in age between volunteers and nonvolunteers (Benson, Booman, and Clark, 1951; Edwards, 1968; Poor, 1967). Further evidence for the potentially complicated nature of the relationship between age and volunteering comes from the work of Gaudet and Wilson (1940). They found that their determined nonvolunteers for a personal interview tended to be of intermediate ages with the younger and older householders more willing to
74
Book One – Artifact in Behavioral Research
participate. Similarly, curvilinearities have been reported by Newman (1956), though for his collegiate sample the curvilinearity was opposite in direction to that found by Gaudet and Wilson. Especially among Newman’s female subjects, the nonvolunteers showed more extreme ages than did the volunteers. The same tendency was found among male subjects, but it was significant for males only among subjects recruited for a personality experiment, not among those recruited for a perception experiment. Religion The data are sparse that bear on the question of the relationship between volunteering and religious affiliation and attitudes. Matthysse (1966) found his respondents to a mail questionnaire to be disproportionately more often Jewish than Protestant and also to be more concerned with theological issues. The latter finding is not surprising since the questionnaire dealt with religious attitudes. Rosen (1951) also found Jews to be significantly overrepresented in his sample of volunteers for psychological research. In addition, Rosen found volunteers to be less likely to attend church services than nonvolunteers. However, Ora (1966) found no relationship between religious preference or church attendance and volunteering for various psychological experiments. In his interview research, Wallin (1949) found Protestants somewhat more likely than Catholics to participate in a study of the prediction of marital success. The tenuousness of these findings is well illustrated in a study by Poor (1967). He, too, found no significant relationship between volunteering and either religious affiliation or church attendance. However, in his study of respondents to a mail questionnaire, he reports a trend toward greater participation among Protestants than among Catholics or Jews, while in his study of volunteers for a psychological experiment, he reports a trend toward greatest participation among Catholics and least participation among Jews. In a study of student nurses, Edwards (1968) found no association between volunteering and religious attitudes. In summary, then, it is not possible to make any general statement concerning the relationship between volunteering and either religious affiliation, religious attitudes, or church attendance. On the basis of our earlier discussion, one might speculate that if any relationship does exist it is probably complicated by subjects’ sex and the type of research for which participation is solicited. Geographic Variables In his study of respondents to a mail questionnaire, Reuss (1943) found greater participation by subjects from a rural rather than an urban background. In his study of college students, however, Rosen (1951) found no such difference between volunteers and nonvolunteers. Siegman (1956) reports that for a Kinsey-type interview, volunteering rates were higher in an Eastern than in a Midwestern university. Presumably, there were more students of rural origin in the Midwestern sample. Perhaps, the nature of the volunteer request interacts with rural-urban origin to determine volunteering rates. Finally, Franzen and Lazarsfeld (1945) found that respondents to their mail questionnaire were overrepresented by residents of the East Central States but underrepresented by residents of New England and the Middle
The Volunteer Subject
75
Atlantic States. In addition, residents of cities with a population of less than 100,000 were overrepresented relative to residents of cities of larger population. In view of the sparseness of the obtained data it seems best to forego any summary of the relationship between geographic variables and volunteering for behavioral research.
Populations Investigated Before summarizing what is known and not known about differentiating characteristics of volunteers for behavioral research, let us consider the populations that have been discussed. All of the studies cited here sampled from populations of human subjects, situations, tasks, contexts, personal characteristics, and various measures of those characteristics (Brunswik, 1956). Human Subjects In his study of subject samples drawn for psychological research, Smart (1966) examined every article in the Journal of Abnormal and Social Psychology appearing in the years 1962–1964. Less than 1 per cent of the studies employed samples from the general population; 73 per cent used college students; and 32 per cent used introductory psychology students. Comparable data from the Journal of Experimental Psychology revealed no studies that had sampled from the general population; 86 per cent of the studies employed college students, and 42 per cent used students enrolled in introductory psychology courses. These data provide strong, current support for McNemar’s 20-year-old criticism of behavioral science’s being largely a science of the behavior of sophomores. The studies discussed in this chapter provide additional support for McNemar’s contention. The vast majority of the psychological experiments drew their subject samples from college populations. When the studies were in the nature of surveys, a much broader cross-section of subject populations was tapped, but even then college students were heavily represented. Such a great reliance on college populations may be undesirable from the standpoint of the representativeness of design in behavioral research generally, but it does not reflect an ecological invalidity for our present purpose. Sampling of subject populations in studies of volunteer characteristics seems to be representative of sampling of subject populations in behavioral research generally. Situations, Tasks, and Contexts A considerable variety of situations, tasks, and contexts were sampled by the studies discussed here. The tasks for which volunteering was requested included survey questionnaires, Kinsey-type interviews, psycho-pharmacological and medical control studies, and various psychological experiments focusing upon small group interaction, sensory deprivation, hypnosis, personality, perception, learning, and motor skills. Unfortunately, very few studies have employed more than one task; hence, little is known about the effects of the specific task either on the rate of volunteering to undertake it or on the nature of the relationship between volunteering and the personal characteristics of volunteers.
76
Book One – Artifact in Behavioral Research
There is, however, some information bearing on this problem. Newman (1956), for example, employed more than one task, asking subjects to volunteer both for a personality and a perception experiment, but he found no systematic effect of these two tasks on the relationships between the variables investigated and the act of volunteering. Zuckerman, Schultz, and Hopkins (1967) similarly found no systematic effect of their two tasks (sensory deprivation and hypnosis) on the relationships between volunteering and those correlates of volunteering in which they were interested. Hood (1963), however, employed four tasks and found a significant interaction between subjects’ sex and type of task. Males were more willing than females to participate in a competitive experimental task but were less willing than females to participate in studies of affiliation behavior and self-revelation, or in a relatively unspecified study. Similarly, Ora (1966) found males particularly reluctant to volunteer for a ‘‘clinical’’ study in which self-revelation was called for. Martin and Marcuse (1958) employed four tasks for which volunteering was requested. They found greater personality differences between volunteers and nonvolunteers for a hypnosis experiment than were found between volunteers and nonvolunteers for experiments in learning, attitudes toward sex, and personality. Of these last three experimental situations, the personality study tended to reveal somewhat greater personality differences between volunteers and nonvolunteers than were obtained in the other two situations. Those differences that emerged from the more differentiating tasks were not specifically related conceptually to the differential nature of the tasks for which volunteering had been requested. These findings should warn us, however, that any differentiating characteristic may be a function of the particular situation for which volunteers were solicited. Because of the obvious desirability of being able to speak about characteristics of volunteers for ‘‘generalized’’ behavioral research, a search was made for studies of volunteer characteristics in which the request for volunteers was nonspecific. A number of the studies previously cited met this requirement, usually by asking subjects to volunteer simply for a ‘‘psychological experiment’’ (e.g., Himelstein, 1956; Leipold and James, 1962; McDavid, 1965; Poor, 1967; Schubert, 1964; Silverman, 1964; Ward, 1964; Wilson and Patterson, 1965). The results of those studies revealed no tendency toward results any different from those obtained in studies in which the volunteer’s task was more specifically stated. Personal Characteristics In this chapter we have tried to include every finding of a significant difference between volunteers and nonvolunteers that was available. Having once discovered a significant difference, every effort was then made to uncover studies reporting either no differences or differences in the opposite direction. For organizational and heuristic purposes, however, we have grouped the many findings together under a fairly small number of headings. Decisions to group any variables under a given heading were made on the basis of empirically established or conceptually meaningful relationships. It should further be noted that within any category several different operational definitions may have been employed. Thus, we have discussed anxiety as defined by the Taylor Manifest Anxiety Scale as well as by the Pt scale of the MMPI. This practice was made necessary by the limited number of available studies employing
The Volunteer Subject
77
identical operational definitions, except possibly for age and sex. This necessity, however, is not unmixed with virtue. If, in spite of differences of operational definition, the variables serve to predict the act of volunteering, we can feel greater confidence in the construct underlying the varying definitions and in its relevance to the predictive and conceptual task at hand.
Summary of Volunteer Characteristics For this mass of studies some attempt at summary is essential. Each of the characteristics sometimes associated with volunteering has been placed into one of three groups of statements. In the first group, statements or hypotheses are listed for which the evidence seems strongest. Though in absolute terms our confidence may not be so great, in relative terms we have most confidence in the propositions listed in this group. In the second group, statements or hypotheses are listed for which the evidence, though not unequivocal, seems clearly to lean in favor of the proposition. At least some confidence in these statements seems warranted. In the third group, statements or hypotheses are listed for which the evidence is unconvincing. Little confidence seems warranted in these propositions. Within each of the three groups of statements, the hypothesized relationships are listed in roughly descending order of warranted confidence. Statements Warranting Most Confidence 1. Volunteers tend to be better educated than nonvolunteers. 2. Volunteers tend to have higher occupational status than nonvolunteers (though volunteers may more often come from a lower status background). 3. Volunteers tend to be higher in the need for approval than nonvolunteers (though the relationship may be curvilinear with least volunteering likely among those with average levels of need approval). 4. Volunteers, especially males, tend to score higher than nonvolunteers on tests of intelligence (though school grades seem not clearly related to volunteering). 5. Volunteers tend to be less authoritarian than nonvolunteers, especially when asked to answer personal questions. 6. Volunteers tend to be better adjusted than nonvolunteers when asked to answer personal questions, but more poorly adjusted when asked to participate in medical research. (In psychological experiments the relationship is equivocal.)
Statements Warranting Some Confidence 7. 8. 9. 10. 11.
Volunteers tend to be more sociable than nonvolunteers. Volunteers tend to be more arousal-seeking than nonvolunteers. Volunteers tend to be more unconventional than nonvolunteers. Volunteers tend more often than nonvolunteers to be firstborn. Volunteers tend to be younger than nonvolunteers, especially when occupational status is partialled out. 12. Volunteers tend more often than nonvolunteers to be females when the task is standard and males when the task is unusual.
78
Book One – Artifact in Behavioral Research
Statements Warranting Little Confidence 13. Volunteers tend to be more anxious than nonvolunteers when the task is standard and less anxious when the task is threatening. 14. Male volunteers are less conforming than male nonvolunteers. 15. Volunteers tend more often than nonvolunteers to be Jewish. 16. Volunteers more than nonvolunteers tend to be of rural origin when the task is standard and of urban origin when the task is unusual.
It is obvious that the hypothesized relationships require further investigation, especially those falling lower in the lists. There is little reason for thinking that an experimentum crucis would place any of these relationships on firmer footing. Many have been examined in a dozen or even a score of studies. When the summary of that many findings is equivocal, it is readily apparent that it will take more than simply a few new studies to clarify the underlying relationship. As we have already seen, the nature of the relationship between the attributes discussed and the act of volunteering may be complicated by the interacting effects of other variables. Two of the most likely candidates for the role of moderator variable are the nature of the task for which volunteering is solicited and the sex of the subject.
Implications for Representativeness The results of our analysis suggest that in any given study of human behavior the chances are good that those subjects who find their way into the research will differ appreciably from those subjects who do not. Even if the direction of difference is not highly predictable, it is important to know that volunteers for behavioral research are likely to differ from nonvolunteers in a variety of characteristics. One implication of this conclusion is that limitations may be imposed on the generality of finding of research employing volunteer subjects. It is well known that the violation of the requirement of random sampling complicates the process of statistical inference. This problem is discussed in basic texts on sampling theory and has also been dealt with by some of the workers previously cited (e.g., Cochran, Mosteller, and Tukey, 1953). Granted that volunteers are never a random sample of the population from which they were recruited, and granting further that a given sample of volunteers differs on a number of important dimensions from a sample of nonvolunteers, we still do not know whether volunteer status is a condition that actually makes any great difference with regard to our dependent variables. It is possible that in a given experiment the performance of the volunteer subjects would not differ at all from the performance of the unsampled nonvolunteers if the latter had actually been recruited for the experiment (Lasagna and von Felsinger, 1954). The point is that substantively we have little idea of the effect of using volunteer subjects. What is needed are series of investigations covering a variety of tasks and situations for which volunteers are solicited, but for which both volunteers and nonvolunteers are actually used. Thus, we could determine in what types of studies the use of volunteers actually makes a difference, as well as the kinds of differences and their magnitude. When more information is available, we can, with better conscience, enjoy the convenience of
The Volunteer Subject
79
using volunteer subjects. In the meantime, the best one can do is to hypothesize what the effects of volunteer characteristics might be in any given line of inquiry. Let us take, as an example of this procedure, the much analyzed Kinsey-type study of sexual behavior. We have already seen how volunteers for this type of study tend to have unconventional attitudes about sexuality and may in addition behave in sexually unconventional ways. This tendency, as has frequently been noted, may have had grave effects on the outcome of Kinsey-type research, possibly leading to population estimates of sexual behavior seriously biased in the unconventional direction. The extent of this type of bias could probably be partially assessed over a population of college students among whom the nonvolunteers could be converted into ‘‘volunteers’’ in order to estimate the effect on data outcome of initial volunteering versus nonvolunteering. Clearly, such a study would be less feasible among a population of householders who stood to gain no course credit or instructor’s approval from changing their status of nonvolunteer to volunteer. The experiment by Hood and Back (1967) has special implications for small groups research. At least among male subjects these investigators found volunteers to be more willing than nonvolunteers to disclose personal information to others. Among female subjects the relationship between volunteering and self-disclosure was complicated by the nature of the experiment for which volunteering had been requested. Especially among males, then, small groups experiments that depend upon volunteer subjects may give inflated estimates of group members’ willingness to participate openly in group interaction. In a more standard realm of experimental psychology, Greene (1937) showed that precision in discrimination tasks was related to subjects’ intelligence and type of personal adjustment. To the extent that volunteers differ from nonvolunteers in adjustment and intelligence, typical performance levels in discrimination tasks may be misjudged when volunteer samples are employed. It seems reasonable to wonder, too, about the effect of the volunteer variable on the normative data required for the standardization of an intelligence test. Since, at least in the standardization of intelligence tests for adults, volunteers tend to be over-represented, and since volunteers tend to score higher on tests of intelligence, the ‘‘mean’’ IQ of 100 may represent rather an inflation of the true mean that would be obtained from a more truly random sample. It was suggested earlier that it might be useful to assess the magnitude of volunteer bias by converting nonvolunteers to volunteers. However, one problem with increasing the pressure to volunteer in a sample of nonvolunteers is that the experience of having been coerced may change the subjects’ responses to the experimental task. That seems especially likely in situations where nonvolunteers are initially led to believe that they are free not to volunteer. One partial solution to this problem might be to recruit volunteers from among nonvolunteers using increasingly positive incentives, a technique that has met with some success in survey research. Even then, however, we must try to assess the effect on the subject’s response of having been sent two letters rather than one, or of having been offered $2 rather than $1 as spurs to volunteering. If volunteers differ from nonvolunteers in their response to the task set by the investigator, the employment of volunteer samples can have serious effects on estimates of such parameters as means, medians, proportions, variances, skewness, and kurtosis. In survey research, where the estimation of such parameters is the principal goal, biasing effects of volunteer samples could be disastrous. In most behavioral experiments, however, interest is not centered so much on such statistics
80
Book One – Artifact in Behavioral Research
as means and proportions but rather on such statistics as the differences between means or proportions. The investigator is ordinarily interested in relating such differences to the operation of his independent variable. The fact that volunteers differ from nonvolunteers in their scores on the dependent variable may be quite irrelevant to the behavioral experimenter. He may want more to know whether the magnitude and statistical significance of the difference between his experimental and control group means would be affected if he used volunteers. In other words, he may be interested in knowing whether volunteer status interacts with his experimental variable.
Implications for Experimental Outcomes In this section we shall describe the evidence relevant to the problem of interaction of volunteer status with various experimental variables. Compared to the evidence amassed to show inherent differences between volunteers and nonvolunteers, there is little evidence available from which to decide whether volunteer status is likely to interact with experimental variables. On logical grounds alone one might expect such interactions. If we assume for the moment that volunteers are more often firstborn than are nonvolunteers, some research by Dittes (1961) becomes highly relevant. Dittes found that lessened acceptance by peers affected the behavior of firstborns but not that of laterborns. Still assuming firstborns to be overrepresented by volunteers, a study of the experimental variable of ‘‘lessened acceptance’’ conducted on volunteers might show strong effects, while the same study conducted with a more nearly random sample of subjects might show only weak effects. We can imagine, too, an experiment to test the effects of some experimental manipulation on the dependent variable of gregariousness. If a sample of highly sociable volunteers were drawn, any manipulation designed to increase gregariousness might be too harshly judged as ineffective simply because the untreated control group would already be unusually high on this factor. The same manipulation might prove effective in increasing the gregariousness of the experimental group relative to the gregariousness of the control group if the total subject sample were characterized by a less restricted range of sociability. At least in principle, then, the use of volunteer subjects could lead to an increase in Type II errors. The opposite type of error can also be imagined. Suppose an investigator were interested in the relationship between the psychological adjustment of women and some dependent variable. If female volunteers are indeed more variable than female nonvolunteers on the dimension of adjustment, and if there were some relationship between adjustment and the dependent variable, then the magnitude of that relationship would be overestimated when calculated for a sample of volunteers relative to a sample of nonvolunteers. So far in our discussion we have dealt only with speculations about the possible effects of volunteer bias on experimental or correlational outcomes. Fortunately, we are not restricted to speculation, since there are several recent studies addressed to this problem. The Hayes, Meltzer, and Lundberg Study (1968) In this experiment the investigators were interested in learning the effects on the subject’s vocal participation in dyadic task-oriented groups of (a) his possession of
The Volunteer Subject
81
task-relevant information, (b) his co-discussant’s possession of task-relevant information, and (c) the joint possession of task-relevant information. The dyad’s task was to instruct an outsider how to build a complex tinkertoy structure. The builder had no diagram, but each of the two instructors did. Amount of task-relevant information was varied by the use of good, average, and poor diagrams. One-third of the 120 undergraduates were assigned to each of the three levels of information, and within each of these three groups the co-discussants were given either good, average, or poor information. Within each of the nine conditions so generated, half the subjects were paid volunteers and half were required to serve. Results showed that there were no effects on vocal activity of a subject’s own level of information but that subjects talked least when their partner had most information. In addition, paid volunteers participated significantly more than did the conscripted subjects, a finding also cited earlier in this chapter. The results in which we are most interested are the interactions between the volunteering variable and the experimental manipulations. None of those Fs approached significance; indeed all were less than unity. From this experiment one might conclude that, while volunteers and conscriptees differ in important ways from one another, they are nevertheless similarly affected by the operation of the experimental manipulation. No serious errors of inference would have occurred had the investigators used a sample composed entirely of volunteers. Perhaps, we should be surprised to discover any difference between the volunteers and the conscriptees. Conscriptees, after all, are not nonvolunteers, but rather a mixed group, some of whom would have volunteered had they been invited and some of whom (the bona fide nonvolunteers) would have refused. A comparison between a group of volunteers and a group comprised of volunteers plus nonvolunteers (in unknown proportion) should not yield a difference so large as a comparison between volunteers and nonvolunteers. Perhaps, then, when volunteers and nonvolunteers are more clearly differentiated there may be significant interactions between the experimental variable and the volunteer variable. Before leaving the Hayes, Meltzer, and Lundberg study one additional possibility can be noted. Since the conscriptees and the paid volunteers were contacted at different times of the school year, it is possible that the differences between the two groups were confounded by temporal academic variables, e.g., time to next examination period as well as by differences in the subject pools available at the two periods of the year. The Rosnow and Rosenthal Study (1966) The primary purpose of this experiment was to examine the differential effects of persuasive communications on volunteer and nonvolunteer samples. Approximately half of the 42 female undergraduates had volunteered for a fictitious experiment in perception, and half had not volunteered. Both the volunteer and nonvolunteer subjects then were assigned at random to one of three groups. One group of subjects was exposed to a pro-fraternity communication; a second group was exposed to an anti-fraternity communication; and a third group received neither communication. For all subjects, prior opinions about fraternities had been unobtrusively measured one week earlier by means of fraternity opinion items embedded in a 16-item opinion survey. After exposure to pro-, anti-, or no-communication about fraternities, subjects were retested for their opinions.
82
Book One – Artifact in Behavioral Research Table 3–4 Opinion Change among Volunteers and Nonvolunteers
Treatment
Volunteers Change
Pro-fraternity Control Anti-fraternity
þ 1.67 (9) þ 0.40 (5) 3.50 (6)
Nonvolunteers
Two-tail p .20 .90 .05
Change
Two-tail p
þ 2.50 (6) 0.91 (11) 1.20 (5)
.15 .15 .50
Note. A positive valence indicates that opinions changed in a pro-fraternity direction; a negative valence, in an anti-fraternity direction. Numbers in parentheses indicate the sample size.
Table 3-4 shows the mean opinion change scores for each experimental condition separately for volunteers and nonvolunteers. The associated probability levels are based on t tests for correlated means, and they suggest that while opinion changes were not dramatic in their p values, they were large in magnitude and greater than might be ascribed to chance. In a set of six ps, only one would be expected to reach the .17 level by chance alone. In this study, three of the six ps reached that level despite the average of only 6 df per group. The only group to show opinion change significant at the .05 level was that comprised of volunteers exposed to the antifraternity communication. Table 3-5 shows an alternative analysis in which the experimental groups’ opinion changes are compared with one another separately for volunteers and nonvolunteers. The overall effect of the pro- versus anti-fraternity communications was not significantly greater among volunteers than among nonvolunteers. The magnitude of the effect, however, reached a p of .004 among volunteers compared to a p of .07 among nonvolunteers. An investigator employing a strict decision model of inference and adopting an alpha level of .05 or .01 would have reached different conclusions had his experiment been conducted with volunteers rather than nonvolunteers. Table 3-5 also shows that for volunteers the anti-fraternity communication was more effective, while for nonvolunteers the pro-fraternity communication was more effective (interaction p < .05). It appears, then, that volunteer status can, at times, interact with experimental manipulations to affect experimental outcomes. We can only speculate on why the particular interaction occurred. Some evidence is available to suggest that faculty experimenters were seen as being moderately antifraternity. Perhaps volunteers, who tend to show a greater need for approval, felt they would please the experimenter more by being more responsive to his anti-fraternity communication than to his pro-fraternity communication. That does not explain, Table 3–5 Effectiveness of One-Sided Communications among Volunteers and Nonvolunteers
Treatment difference Pro minus control Control minus anti Pro minus anti p ¼ .004 p ¼ .05 c p ¼ .07 a b
Volunteers þ 1.27 þ 3.90b þ 5.17a
Nonvolunteers þ 3.41b þ 0.29 þ 3.70c
The Volunteer Subject
83
however, why nonvolunteers tended to show the opposite effect unless we assume that they also saw the experimenter as being more anti-fraternity and resisted giving in to what they saw as his unwarranted influence attempts. Within each of the three experimental conditions the pretest–posttest reliabilities were computed separately for volunteers and nonvolunteers. The mean reliability (rho) of the volunteer subjects was .35, significantly lower than the mean reliability of the nonvolunteers (.97) at p < .0005. Volunteers, then, were more heterogeneous in their opinion change behavior, perhaps reflecting their greater willingness to be influenced in the direction they felt was demanded by the situation (see the chapter by Orne). The findings of this study as well as our review of volunteer characteristics suggest that volunteers may more often than nonvolunteers be motivated to confirm what they perceive to be the experimenter’s hypothesis. The Rosnow and Rosenthal Study (1967) This experiment will be described in greater detail than the other studies summarized because it has not previously been published.7 As in our earlier study (Rosnow and Rosenthal, 1966), the primary purpose was to examine the differential effects of communications on volunteer and nonvolunteer samples. In this study, however, two-sided as well as one-sided communications were employed. Four introductory sociology classes at Boston University provided the 103 male and 160 female subjects. All of the students were invited by their instructors to volunteer for either or both of two fictitious psychological experiments; one of the experiments purported to deal with psycho-acoustics, the other, with social groups. Approximately one week later, all of the subjects in each class simultaneously were presented by the experimenter with one of five different booklets, representing the five treatments in this after-only design. On the cover page of every booklet were four items, which inquired as to the subject’s (a) sex, (b) cigarette smoking habit, (c) coffee drinking habit, and (d) order of birth. These items were followed, beginning on the next page, by a one-sided, two-sided, or control communication describing a brief episode in a day in the life of ‘‘Jim,’’ a fictitious individual based on the character described by Luchins (1957). The last page of every booklet contained four 9-point graphic scales on which the subject was asked to rate Jim in terms of how (a) friendly or unfriendly, (b) forward or shy, (c) social or unsocial, (d) aggressive or passive he seemed, based on the information contained in the communication. Communications. Two one-sided communications were used. One of the communications was a positive appeal (P), which portrayed Jim as friendly and outgoing: Jim left the house to get some stationery. He walked out into the sun-filled street with two of his friends, basking in the sun as he walked. Jim entered the stationery store which was full of people. Jim talked with an acquaintance while he waited for the clerk to catch his eye. On his way out, he stopped to chat with a school friend who was just coming into the store. Leaving the store, he walked toward school. On his way out he met the girl to whom he had been introduced the night before. They talked for a short while, and then Jim left for school.
7
We thank Robert Holz, Robert Margolis, and Jeffrey Saloway for their help in recruiting volunteers and George Smiltens for his help in data processing.
84
Book One – Artifact in Behavioral Research
The other one-sided communication was a negative appeal (N). It portrayed Jim as shy and unfriendly: After school Jim left the classroom alone. Leaving the school, he started on his long walk home. The street was brilliantly filled with sunshine. Jim walked down the street on the shady side. Coming down the street toward him, he saw the pretty girl whom he had met on the previous evening. Jim crossed the street and entered a candy store. The store was crowded with students, and he noticed a few familiar faces. Jim waited quietly until the counterman caught his eye and then gave his order. Taking his drink, he sat down at a side table. When he had finished his drink he went home.
Two other descriptions, or two-sided communications, were constructed by combining the positive and negative appeals. When the P description was immediately followed by the N description, without a paragraph indentation between the two passages we refer to this two-sided communication as PN. When N immediately preceded P, we refer to this two-sided communication an NP. All four communications were introduced by the following passage: In everyday life we sometimes form impressions of people based on what we read or hear about them. On a given school day Jim walks down the street, sees a girl he knows, buys some stationery, stops at the candy store. On the next page you will find a paragraph about Jim. Please read the paragraph through only once. On the basis of this information alone, answer to the best of your ability the questions on the last page of this booklet.
The control subjects received just the introductory passage above with the sentences omitted referring to the paragraph on the following page, succeeded immediately by the four rating scales. Some of the results of this study have already been given in the relevant sections of our discussion of volunteer characteristics. Thus, females volunteered significantly more than males (2 ¼ 11.78, df ¼ 2, p < .005); birth order was unrelated to volunteering (2 ¼ 0.54, df ¼ 2, p > .75); for the total sample, and for female subjects alone, smoking and coffee drinking were unrelated to volunteering. Among male subjects, however, volunteers tended to smoke less (p ¼ .07) and drink less coffee (p ¼ .06) than nonvolunteers. Even among male subjects, smoking and drinking accounted for less than 4 per cent of the variance in volunteering behavior. Because of the very unequal numbers of subjects within subgroups, all analyses of each of the four ratings made by subjects were based on unweighted means. The analyses of variance of the five treatments by volunteer status by sex of subject showed only significant effects of treatments. For each of the four ratings analyzed in turn, ps for treatment were less than .001. In these overall analyses no other ps were less than .05. Our greatest interest, however, is in the interaction of volunteer status with treatments. This interaction was computed for each of the four dependent variables, and only two of the associated ps were less than .20. For the variable ‘‘friendly,’’ p was .12; for the variable ‘‘social,’’ p was .17. With so many treatment conditions, however, these Fs for unordered means are relatively insensitive, and it may be instructive to examine separately for volunteers and nonvolunteers the specific experimental effects in which we are most interested. Table 3-6 shows separately for volunteers and nonvolunteers the difference between the control group mean and the mean of each of the four experimental groups in turn. Each of the entries in Table 3-6 is based on the data from both male and female subjects combined without weighting.
The Volunteer Subject
85
Table 3-6 Effectiveness among Volunteers and Nonvolunteers of One-Sided and Two-Sided
Communications as Compared with a ‘‘Zero’’ Control Treatment difference
Ratings
Volunteers
Nonvolunteers
Positive (P) minus control
Friendly Forward Social Aggressive
þ 2.18a þ 1.45a þ 1.92a þ 1.15b
þ 1.78a þ 0.42 þ 1.30b þ 0.20
Negative (N) minus control
Friendly Forward Social Aggressive
1.55a 2.04a 2.22a 1.36a
0.70 1.68a 1.91a 1.56a
PN minus control
Friendly Forward Social Aggressive
0.58 0.34 1.08b 0.34
þ 0.64 0.24 þ 0.24 0.34
NP minus control
Friendly Forward Social Aggressive
þ 0.30 þ 0.04 0.30 0.45
þ 0.38 0.52 þ 0.06 0.52
Note. Ratings could range from þ 4.00, or favoring strongly the positive appeal, to 4.00, strongly favoring the negative appeal. Entries in the table are differences between means of treatment conditions. a p .01 b p < .05
It can be seen from Table 3-6 that the two one-sided communications were most effective among all subjects. Although these effects were not significantly greater among volunteers than among nonvolunteers, the trend was in that direction. Of the eight tests for the significance of the effectiveness of one-sided communications, all eight reached the .05 level among volunteers, while only five of the eight tests reached the .05 level among nonvolunteers. An investigator employing similar sample sizes, and a strict decision model of inference with an alpha of either .05 or .01, would arrive at different conclusions over one-third of the time were he to employ volunteer rather than nonvolunteer samples. Most important, perhaps, is that whenever differences in significance levels occurred it was the volunteers who favored the experimental hypothesis. When the communication was positive, volunteers became more positive than nonvolunteers; when the communication was negative, volunteers became more negative than nonvolunteers. When we consider the effects of two-sided communications there appears to be less volunteer bias. The two-sided communications were ineffective generally regardless of whether they were compared to the control group (as shown in Table 3-6) or to each other (not shown). Of the 16 mean differences indicating the effectiveness of the two-sided communications shown in Table 3-6, only one was significant at the .05 level, just about what one might expect by chance. Nevertheless, that one ‘‘effect’’ occurred among volunteers. The Marmer Study (1967) Following a similar recruitment procedure as was used in the preceding study, Marmer administered to both volunteer and nonvolunteer subjects’ treatments
86
Book One – Artifact in Behavioral Research
adapted from a standard deception introduced by cognitive dissonance theorists to study the effects of decisional importance and the relative attractiveness of unchosen alternatives on post-decisional dissonance reduction. The study was carried out in three phases. In the first phase, undergraduate women at Boston University were recruited for a fictitious psychology experiment. The second phase, which began immediately thereafter, consisted of having the subjects— volunteer and nonvolunteer alike—complete an opinion survey that was represented to them as a national opinion poll of college students being conducted by the University of Wisconsin. The third phase was carried out one month later. At that time a third experimenter, who was represented as an employee of the Boston University Communication Research Center, had the subjects choose between two alternative ideas whose importance they had evaluated in Phase II. The ideas included, for example, that there should be more no-grade courses at universities, that students should unionize in order to gain a more powerful voice in running the university, that there should be courses in the use and control of hallucinatory drugs, and that there should be courses in sex education. For half the subjects a condition of high importance was created by informing them that their choices would be taken into consideration by the administration in selecting one idea to be instituted at Boston University the following year. The remaining subjects, constituting a low importance condition, were simply instructed to choose one of the two proffered alternatives, but no information was conveyed to them which would have implied that their decisions had any practical importance. Within each of these treatments, attractiveness was manipulated by having approximately half the subjects choose between alternatives that had earlier been rated either close together (high relative attractiveness of the unchosen alternative) or far apart (low attractiveness). As predicted by cognitive dissonance theory, there was a greater spreading apart of the choice alternatives when the subjects re-evaluated their importance under conditions of high versus low manipulated importance (p ¼ .11) and high versus low attractiveness (p < .001). This first dependent variable, however—the spreading apart of the choice alternatives after the subject had irrevocably decided to choose one of the two ideas—was not directly influenced by the volunteer variable nor by any significant interaction of volunteering and either decisional importance or attractiveness. Clearly, then, volunteering does not consistently interact with other independent variables to affect experimental outcomes. It would appear that the deception may have functioned in a similar manner as theoretically the two-sided communication did in the preceding experiment, in effect removing or disguising demand characteristics that might otherwise favor one direction of response over another. A second dependent variable, ratings of the survey in Phase III, was employed by Marmer as a check on the success of the manipulation of perceived importance. Somewhat surprisingly, volunteers saw the Boston University survey as less important than did the nonvolunteers. In addition, volunteer status showed a tendency (p < .08) to interact with the experimental manipulation of the importance of the subjects’ rating decisions. Nonvolunteers were more affected than volunteers by that manipulation, a finding that may weaken somewhat our hypothesis that volunteers are more sensitive and accommodating to the perceived demand characteristics of the situation.
The Volunteer Subject
87
Conclusions We began this chapter with McNemar’s lament that ours is a science of sophomores. We conclude this chapter with the question of whether McNemar was too generous. Often ours seems to be a science of just those sophomores who volunteer to participate in our research and who also keep their appointment with the investigator. Our purpose in this chapter has been to summarize what has been learned about the act of volunteering and the more or less stable characteristics of those people who are likely to find their way into the role of data-contributor in behavioral research. Later in the chapter we considered the implications of volunteer bias for the representativeness of descriptive statistics and for the nature of the relationships found between two or more variables in behavioral research. The act of volunteering was viewed as a nonrandom event, determined in part by more general situational variables and in part by more specific personal attributes of the person asked to participate as subject in behavioral research. More general situational variables postulated as increasing the likelihood of volunteering included the following: 1. 2. 3. 4.
Having only a relatively less attractive alternative to volunteering. Increasing the intensity of the request to volunteer. Increasing the perception that others in a similar situation would volunteer. Increasing acquaintanceship with, the perceived prestige of, and liking for the experimenter. 5. Having greater intrinsic interest in the subject matter being investigated. 6. Increasing the subjective probability of subsequently being favorably evaluated or not unfavorably evaluated by the experimenter.
On the basis of studies conducted both in the laboratory and in the field, it seemed reasonable to postulate with some confidence that the following characteristics would be found more often among people who volunteer than among those who do not volunteer for behavioral research: 1. 2. 3. 4. 5.
Higher educational level, Higher occupational status, Higher need for approval, Higher intelligence, Lower authoritarianism.
With less confidence we can also postulate that more often than nonvolunteers, volunteers tend to be: 6. 7. 8. 9. 10.
More sociable, More arousal seeking, More unconventional, More often firstborn, Younger.
Two additional and somewhat more complicated relationships may also be postulated: (a) In survey-type research volunteers tend to be better adjusted than nonvolunteers, but in medical research volunteers tend to be more maladjusted than nonvolunteers. (b) For standard tasks women tend to volunteer more than men, but for unusual tasks women tend to volunteer less than men. These more complicated relationships illustrate the likelihood that there may often be variables that
88
Book One – Artifact in Behavioral Research
complicate the nature of the relationship between the act of volunteering and various personal characteristics. Two such moderating variables appear to be the sex of the subject and the nature of the task for which volunteering is requested. Our survey suggests that those who volunteer for behavioral research often differ in significant ways from those who do not volunteer. Most of the research that is summarized here tends to underestimate the effect of these differences on data obtained from volunteer subjects. In most of the studies, comparisons were made only between those who indicated that they would participate as research subjects versus those who indicated that they would not. However, there is considerable evidence to suggest that, of those who volunteer, a substantial proportion will never contribute their responses to the data pool. The evidence suggests that these ‘‘no-shows’’ are more like nonvolunteers than they are like the volunteers who keep their appointments. Therefore, comparing nonvolunteers with verbal volunteers is really comparing nonvolunteers with some other nonvolunteers mixed in unknown proportion with true volunteers. Differences found between nonvolunteers and verbal volunteers will, therefore, underestimate differences between those who do, and do not, contribute data to the behavioral researcher. To the extent that true volunteers differ from nonvolunteers, the employment of volunteer samples can lead to seriously biased estimates of various population parameters. In addition, however, there is the possibility that volunteer status may interact with experimental variables in such a way as to increase the probability of inferential errors of the first and second kind. The direct empirical evidence at this time is rather scanty and equivocal, but there are indirect, theoretical considerations that suggest the possibility that volunteers may more often than nonvolunteers provide data that support the investigator’s hypothesis.
References Abeles, N., Iscoe, I., and Brown, W. F. Some factors influencing the random sampling of college students. Public Opinion Quarterly, 1954–1955, 18, 419–423. Altus, W. D. Birth order and its sequelae. Science, 1966, 151, 44–49. Aronson, E., Carlsmith, J. M., and Darley, J. M. The effects of expectancy on volunteering for an unpleasant experience. Journal of Abnormal and Social Psychology, 1963, 66, 220–224. Beach, F. A. The snark was a boojum. American Psychologist, 1950, 5, 115–124. Beach, F. A. Experimental investigations of species specific behavior. American Psychologist, 1960, 15, 1–18. Bean, W. B. The ethics of experimentation on human beings. In S. O. Waife and A. P. Shapiro (Eds.), The clinical evaluation of new drugs. New York: Hoeber-Harper, 1959, 76–84. Bell, C. R. Psychological versus sociological variables in studies of volunteer bias in surveys. Journal of Applied Psychology, 1961, 45, 80–85. Bell, C. R. Personality characteristics of volunteers for psychological studies. British Journal of Social and Clinical Psychology, 1962, 1, 81–95. Belson, W. A. Volunteer bias in test-room groups. Public Opinion Quarterly, 1960, 24, 115–126. Bennett, Edith B. Discussion, decision, commitment and consensus in ‘‘group decision’’. Human Relations, 1955, 8, 251–273. Benson, L. E. Mail surveys can be valuable. Public Opinion Quarterly, 1946, 10, 234–241. Benson, S., Booman, W. P., and Clark, K. E. A study of interview refusal. Journal of Applied Psychology, 1951, 35, 116–119. Blake, R. R., Berkowitz, H., Bellamy, R. Q., and Mouton, Jane S. Volunteering as an avoidance act. Journal of Abnormal and Social Psychology, 1956, 53, 154–156. Boucher, R. G., and Hilgard, E. R. Volunteer bias in hypnotic experimentation. American Journal of Clinical Hypnosis, 1962, 5, 49–51.
The Volunteer Subject
89
Brady, J. P., Levitt, E. E., and Lubin, B. Expressed fear of hypnosis and volunteering behavior. Journal of Nervous and Mental Disease, 1961, 133, 216–217. Brightbill, R., and Zamansky, H. S. The conceptual space of good and poor hypnotic subjects: a preliminary exploration. International Journal of Clinical and Experimental Hypnosis, 1963, 11, 112–121. Brock, T. C., and Becker, G. Birth order and subject recruitment. Journal of Social Psychology, 1965, 65, 63–66. Brower, D. The role of incentive in psychological research. Journal of General Psychology, 1948, 39, 145–147. Brunswik, E. Perception and the representative design of psychological experiments. Berkeley: University of California Press, 1956. Burchinal, L. G. Personality characteristics and sample bias. Journal of Applied Psychology, 1960, 44, 172–174. Capra, P. C., and Dittes, J. E. Birth order as a selective factor among volunteer subjects. Journal of Abnormal and Social Psychology, 1962, 64, 302. Christie, R. Experimental naı¨vete´ and experiential naı¨vete´. Psychological Bulletin, 1951, 48, 327–339. Clark, K. E. and Members of the Panel on Privacy and Behavioral Research. Science, 1967, 155, 535–538. Clausen, J. A., and Ford, R. N. Controlling bias in mail questionnaires. Journal of the American Statistical Association, 1947, 42, 497–511. Cochran, W. G, Mosteller, F., and Tukey, J. W. Statistical problems of the Kinsey report. Journal of the American Statistical Association, 1953, 48, 673–716. Coffin, T. E. Some conditions of suggestion and suggestibility. Psychological Monographs, 1941, 53, No. 4 (Whole No. 241). Crowne, D. P., and Marlowe, D. The approval motive. New York: Wiley, 1964. Dittes, J. E. Birth order and vulnerability to differences in acceptance. American Psychologist, 1961, 16, 358. (Abstract). Edgerton, H. A., Britt, S. H., and Norman, R. D. Objective differences among various types of respondents to a mailed questionnaire. American Sociological Review, 1947, 12, 435–444. Edwards, C. N. Characteristics of volunteers and nonvolunteers for a sleep and hypnotic experiment. American Journal of Clinical Hypnosis, 1968, 11, 26–29. Esecover, H., Malitz, S., and Wilkens, B. Clinical profiles of paid normal subjects volunteering for hallucinogenic drug studies. American Journal of Psychiatry, 1961, 117, 910–915. Foster, R. J. Acquiescent response set as a measure of acquiescence. Journal of Abnormal and Social Psychology, 1961, 63, 155–160. Franzen, R., and Lazarsfeld, P. F. Mail questionnaire as a research problem. Journal of Psychology, 1945, 20, 293–320. French, J. R. P. Personal communication. August 19, 1963. Frey, A. H., and Becker, W. C. Some personality correlates of subjects who fail to appear for experimental appointments. Journal of Consulting Psychology, 1958, 22, 164. Frye, R. L., and Adams, H. E. Effect of the volunteer variable on leaderless group discussion experiments. Psychological Reports, 1959, 5, 184. Gaudet, H., and Wilson, E. C. Who escapes the personal investigator? Journal of Applied Psychology, 1940, 24, 773–777. Green, D. R. Volunteering and the recall of interrupted tasks. Journal of Abnormal and Social Psychology, 1963, 66, 397–401. Greene, E. B. Abnormal adjustments to experimental situations. Psychological Bulletin, 1937, 34, 747–748. (Abstract). Hayes, D. P., Meltzer, L., and Lundberg, Signe. Information distribution, interdependence, and activity levels. Sociometry, 1968, 31, 162–179. Heilizer, F. An exploration of the relationship between hypnotizability and anxiety and/or neuroticism. Journal of Consulting Psychology, 1960, 24, 432–436. Hilgard, E. R. Personal communication. February 6, 1967. Hilgard, E. R., Weitzenhoffer, A. M., Landes, J., and Moore, Rosemarie K. The distribution of susceptibility to hypnosis in a student population: a study using the Stanford Hypnotic Susceptibility Scale. Psychological Monographs, 1961, 75, 8 (Whole No. 512).
90
Book One – Artifact in Behavioral Research Himelstein, P. Taylor scale characteristics of volunteers and nonvolunteers for psychological experiments. Journal of Abnormal and Social Psychology, 1956, 52, 138–139. Hood, T. C. The volunteer subject: patterns of self-presentation and the decision to participate in social psychological experiments. Unpublished master’s thesis, Duke University, 1963. Hood, T. C., and Back, K. W. Patterns of self-disclosure and the volunteer: the decision to participate in small groups experiments. Paper read at Southern Sociological Society, Atlanta, April, 1967. Howe, E. S. Quantitative motivational differences between volunteers and nonvolunteers for a psychological experiment. Journal of Applied Psychology, 1960, 44, 115–120. Hyman, H., and Sheatsley, P. B. The scientific method. In D. P. Geddes (Ed.), An Analysis of the Kinsey Reports. New York: New American Library, 1954, 93–118. Jackson, C. W., and Pollard, J. C. Some nondeprivation variables which influence the ‘‘effects’’ of experimental sensory deprivation. Journal of Abnormal Psychology, 1966, 71, 383–388. Kavanau, J. L. Behavior: confinement, adaptation, and compulsory regimes in laboratory studies. Science, 1964, 143, 490. Kavanau, J. L. Behavior of captive white-footed mice. Science, 1967, 155, 1623–1639. King, A. F. Ordinal position and die Episcopal Clergy. Unpublished bachelor’s thesis, Harvard University, 1967. Kruglov, L. P., and Davidson, H. H. The willingness to be interviewed: a selective factor in sampling. Journal of Social Psychology, 1953, 38, 39–47. Larson, R. F., and Catton, W. R., Jr. Can the mail-back bias contribute to a study’s validity? American Sociological Review, 1959, 24, 243–245. Lasagna, L., and von Felsinger, J. M. The volunteer subject in research. Science, 1954, 120, 359–361. Leipold, W. D., and James, R. L. Characteristics of shows and no-shows in a psychological experiment. Psychological Reports, 1962, 11, 171–174. Levitt, E. E., Lubin, B., and Brady, J. P. The effect of the pseudovolunteer on studies of volunteers for psychology experiments. Journal of Applied Psychology, 1962, 46, 72–75. Levitt, E. E., Lubin, B., and Zuckerman, M. Note on the attitude toward hypnosis of volunteers and nonvolunteers for an hypnosis experiment. Psychological Reports, 1959, 5, 712. Levitt, E. E., Lubin, B., and Zuckerman, M. The effect of incentives on volunteering for an hypnosis experiment. International Journal of Clinical and Experimental Hypnosis, 1962, 10, 39–41. Locke, H. J. Are volunteer interviewees representative? Social Problems, 1954, 1, 143–146. London, P. Subject characteristics in hypnosis research: Part I. A survey of experience, interest, and opinion. International Journal of Clinical and Experimental Hypnosis, 1961, 9, 151–161. London, P., Cooper, L. M., and Johnson, H. J. Subject characteristics in hypnosis research. II. Attitudes towards hypnosis, volunteer status, and personality measures. III. Some correlates of hypnotic susceptibility. International Journal of Clinical and Experimental Hypnosis, 1962, 10, 13–21. London, P., and Rosenhan, D. Personality dynamics. Annual Review of Psychology, 1964, 15, 447–492. Lubin, B., Brady, J. P., and Levitt, E. E. A comparison of personality characteristics of volunteers and nonvolunteers for hypnosis experiments. Journal of Clinical Psychology, 1962, 18, 341–343. (a) Lubin, B., Brady, J. P., and Levitt, E. E. Volunteers and nonvolunteers for an hypnosis experiment. Diseases of the Nervous System, 1962, 23, 642–643. (b) Lubin, B., Levitt, E. E., and Zuckerman, M. Some personality differences between responders and nonresponders to a survey questionnaire. Journal of Consulting Psychology, 1962, 26, 192. Luchins, A. S. Primacy-recency in impression formation. In C. I. Hovland et al., The order of presentation in persuasion. New Haven: Yale University Press, 1957, 33–61. Marmer, Roberta S. The effects of volunteer status on dissonance reduction. Unpublished master’s thesis, Boston University, 1967. Martin, R. M., and Marcuse, F. L. Characteristics of volunteers and nonvolunteers for hypnosis. Journal of Clinical and Experimental Hypnosis, 1957, 5, 176–180. Martin, R. M., and Marcuse, F. L. Characteristics of volunteers and nonvolunteers in psychological experimentation. Journal of Consulting Psychology, 1958, 22, 475–479. Maslow, A. H. Self-esteem (dominance feelings) and sexuality in women. Journal of Social Psychology, 1942, 16, 259–293. Maslow, A. H., and Sakoda, J. M. Volunteer-error in the Kinsey study. Journal of Abnormal and Social Psychology, 1952 47, 259–262.
The Volunteer Subject
91
Matthysse, S. W. Differential effects of religious communications. Unpublished doctoral dissertation, Harvard University, 1966. McDavid, J. W. Approval-seeking motivation and the volunteer subject. Journal of Personality and Social Psychology, 1965, 2, 115–117. McNemar, Q. Opinion-attitude methodology. Psychological Bulletin, 1946, 43, 289–374. Miller, S. E. Psychology experiments without subjects’ consent. Science, 1966, 152, 15. Myers, T. I., Murphy, D. B., Smith, S., and Goffard, S. J. Experimental studies of sensory deprivation and social isolation. Technical Report 66–8, Contract DA 44–188–ARO–2, HumRRO, Washington, D.C.: George Washington University, 1966. Newman, M. Personality differences between volunteers and nonvolunteers for psychological investigations. (Doctoral dissertation, New York University School of Education) Ann Arbor, Mich.: University Microfilms, 1956, No. 19,999. Norman, R. D. A review of some problems related to the mail questionnaire technique. Educational and Psychological Measurement, 1948, 8, 235–247. Ora, J. P., Jr. Characteristics of the volunteer for psychological investigations. Technical Report, No. 27, November, 1965, Vanderbilt University, Contract Nonr 2149 (03). Ora, J. P., Jr. Personality characteristics of college freshman volunteers for psychological experiments. Unpublished master’s thesis, Vanderbilt University, 1966. Orlans, H. Developments in federal policy toward university research. Science, 1967, 155, 665–668. Pace, C. R. Factors influencing questionnaire returns from former university students. Journal of Applied Psychology, 1939, 23, 388–397. Pan, Ju-Shu. Social characteristics of respondents and non-respondents in a questionnaire study of later maturity. Journal of Applied Psychology, 1951, 35, 120–121. Perlin, S., Pollin, W., and Butler, R. N. The experimental subject: 1. The psychiatric evaluation and selection of a volunteer population. American Medical Association Archives of Neurology and Psychiatry, 1958, 80, 65–70. Pollin, W., and Perlin, S. Psychiatric evaluation of ‘‘normal control’’ volunteers. American Journal of Psychiatry, 1958, 115, 129–133. Poor, D. The social psychology of questionnaires. Unpublished bachelor’s thesis, Harvard University, 1967. Reuss, C. F. Differences between persons responding and not responding to a mailed questionnaire. American Sociological Review, 1943, 8, 433–438. Richards, T. W. Personality of subjects who volunteer for research on a drug (mescaline). Journal of Projective Techniques, 1960, 24, 424–428. Richter, C. P. Rats, man, and the welfare state. American Psychologist, 1959, 14, 18–28. Riecken, H. W. A program for research on experiments in social psychology. In N. F. Washburne, (Ed.), Decisions, values and groups. Vol. II. New York: Pergamon, 1962, 25–41. Riggs, Margaret M., and Kaess, W. Personality differences between volunteers and nonvolunteers. Journal of Psychology, 1955, 40, 229–245. Robins, Lee N. The reluctant respondent. Public Opinion Quarterly, 1963, 27, 276–286. Rokeach, M. Psychology experiments without subjects’ consent. Science, 1966, 152, 15. Rosen, E. Differences between volunteers and non-volunteers for psychological studies. Journal of Applied Psychology, 1951, 35, 185–193. Rosenbaum, M. E. The effect of stimulus and background factors on the volunteering response. Journal of Abnormal and Social Psychology, 1956, 53, 118–121. Rosenbaum, M. E., and Blake, R. R. Volunteering as a function of field structure. Journal of Abnormal and Social Psychology, 1955, 50, 193–196. Rosenhan, D. On the social psychology of hypnosis research. In J. E. Gordon (Ed.), Handbook of clinical and experimental hypnosis. New York: Macmillan, 1967, 481–510. Rosenthal, R. The volunteer subject. Human Relations, 1965, 18, 389–406. Rosenthal, R. Experimenter effects in behavioral research. New York: Appleton-Century-Crofts, 1966. Rosnow, R. L., and Rosenthal, R. Volunteer subjects and the results of opinion change studies. Psychological Reports, 1966, 19, 1183–1187. Rosnow; R. L., and Rosenthal, R. Unpublished data (described in this chapter), 1967. Ruebhausen, O. M., and Brim, O. G. Privacy and behavioral research. American Psychologist, 1966, 21, 423–437. Schachter, S. The psychology of affiliation. Stanford, Calif.: Stanford University Press, 1959.
92
Book One – Artifact in Behavioral Research Schachter, S., and Hall, R. Group-derived restraints and audience persuasion. Human Relations, 1952, 5, 397–406. Scheier, I. H. To be or not to be a guinea pig: preliminary data on anxiety and the volunteer for experiment. Psychological Reports, 1959, 5, 239–240. Schubert, D. S. P. Arousal seeking as a motivation for volunteering: MMPI scores and centralnervous-system-stimulant use as suggestive of a trait. Journal of Projective Techniques and Personality Assessment, 1964, 28, 337–340. Schultz, D. P. Birth order of volunteers for sensory restriction research. Journal of Social Psychology, 1967, 73, 71–73. (a) Schultz, D. P. Sensation-seeking and volunteering for sensory deprivation. Paper read at Eastern Psychological Association, Boston, April, 1967. (b) Schultz, D. P. The volunteer subject in sensory restriction research. Journal of Social Psychology, 1967, 72, 123–124. (c) Shuttleworth, F. K. Sampling errors involved in incomplete returns to mail questionnaires. Psychological Bulletin, 1940, 37, 437. (Abstract) Siegman, A. Responses to a personality questionnaire by volunteers and nonvolunteers to a Kinsey interview. Journal of Abnormal and Social Psychology, 1956, 52, 280–281. Silverman, I. Note on the relationship of self-esteem to subject self-selection. Perceptual and Motor Skills, 1964, 19, 769–770. Smart, R. G. Subject selection bias in psychological research. Canadian Psychologist, 1966, 7a, 115–121. Stanton, F. Notes on the validity of mail questionnaire returns. Journal of Applied Psychology, 1939, 23, 95–104. Staples, F. R., and Walters, R. H. Anxiety, birth order, and susceptibility to social influence. Journal of Abnormal and Social Psychology, 1961, 62, 716–719. Suchman, E., and McCandless, B. Who answers questionnaires? Journal of Applied Psychology, 1940, 24, 758–769. Suedfeld, P. Birth order of volunteers for sensory deprivation. Journal of Abnormal and Social Psychology, 1964, 68, 195–196. Varela, J. A. A cross-cultural replication of an experiment involving birth order. Journal of Abnormal and Social Psychology, 1964, 69, 456–457. Wallin, P. Volunteer subjects as a source of sampling bias. American Journal of Sociology, 1949, 54, 539–544. Ward, C. D. A further examination of birth order as a selective factor among volunteer subjects. Journal of Abnormal and Social Psychology, 1964, 69, 311–313. Warren, J. R. Birth order and social behavior. Psychological Bulletin, 1966, 65, 38–49. Weiss, J. M., Wolf, A., and Wiltsey, R. G. Birth order, recruitment conditions, and preferences for participation in group versus non-group experiments. American Psychologist, 1963, 18, 356. (Abstract) Wicker, A. W. Requirements for protecting privacy of human subjects: some implications for generalization of research findings. American Psychologist, 1968, 23, 70–72. Wilson, P. R., and Patterson, J. Sex differences in volunteering behavior. Psychological Reports, 1965, 16, 976. Wolf, A., and Weiss, J. H. Birth order, recruitment conditions, and volunteering preference. Journal of Personality and Social Psychology, 1965, 2, 269–273. Wolfensberger, W. Ethical issues in research with human subjects. Science, 1967, 155, 47–51. Wolfgang, A. Sex differences in abstract ability of volunteers and nonvolunteers for concept learning experiments. Psychological Reports, 1967, 21, 509–512. Wolfle, D. Research with human subjects. Science, 1960, 132, 989. Wrightsman, L. S. Predicting college students’ participation in required psychology experiments. American Psychologist, 1966, 21, 812–813. Zamansky, H. S., and Brightbill, R. F. Attitude differences of volunteers and nonvolunteers and of susceptible and nonsusceptible hypnotic subjects. International Journal of Clinical and Experimental Hypnosis, 1965, 13, 279–290. Zimmer, H. Validity of extrapolating nonresponse bias from mail questionnaire follow-ups. Journal of Applied Psychology, 1956, 40, 117–121. Zuckerman, M., Schultz, D. P., and Hopkins, T. R. Sensation seeking and volunteering for sensory deprivation and hypnosis experiments. Journal of Consulting Psychology, 1967, 31, 358–363.
4 Pretest Sensitization1 Robert E. Lana Temple University
When Max Planck sought an explanation for heat radiating from a black body at high temperatures, he focused not upon radiation per se but upon the radiating atom and thus became one of the men most responsible for beginning a line of thought and research which ended in the formulation of quantum theory. The development of quantum theory eventually led to major reconceptions in the field of physics, which challenged the Newtonian models that were then predominant. The wave-particle controversy was recognized as a result of the work of Schro¨dinger and others, and this provided a context for an interpretation of quantum theory and, within that context, for the recognition of what was called the principle of indeterminacy. It is possible to speak of the position and the velocity of an electron as one would in Newtonian mechanics, and one can observe and measure both of these quantities. However, one cannot determine both quantities simultaneously with a limitless degree of accuracy. Relations between quantities such as these are called relations of uncertainty, or indeterminacy. Similar relations can be formulated for other experimental situations. The wave and particle theories of radiation—two complementary explanations of the same phenomenon—were interpreted in such a manner. There were limitations to the use of both the wave and the particle concept. These limitations are expressed by the uncertainty relations, and hence any apparent contradiction between the two interpretations disappears. The idea of uncertainty in physics can best be illustrated by a Gedanken (theoretical) experiment given by Heisenberg (1958, 47). ‘‘One could argue that it should at least be possible to observe the electron in its orbit. One should simply look at the atom through a microscope of a very high resolving power, then one would see the electron moving in its orbit. Such a high resolving power could to be sure not be obtained by a microscope using ordinary light, since the inaccuracy of the measurement of the position can never be smaller than the wave length of the light. But a microscope using gamma rays with a wave length smaller than the size of the atom would do. . . . The position of the electron will be known with an accuracy given by the wave length of the gamma ray. The electron may have been practically at rest before the observation. But in the act of observation [italics mine] at least one light quantum of the gamma ray must have passed the microscope and must first have been 1
Many of the studies done by the author and reported in this chapter were supported by the National Institute of Mental Health, United States Public Health Service.
93
94
Book One – Artifact in Behavioral Research
deflected by the electron. Therefore, the electron has been pushed by the light quantum, it has changed its momentum and velocity, and one can show that the uncertainty of this change is just big enough to guarantee the validity of the uncertainty relations.’’ It is evident from this Gedanken experiment that the very act of measurement negated the possibility of observing the phenomenon as it would have occurred had it not been observed. It is important to note, however, that we are dealing with phenomena at the limits of physical existence, namely those of subatomic physics. Measurement of physical activity farther from this limit (or away from the limit of infinite space and time at the other end of the continuum ) is not so sensitive to the influence of the measuring instrument or technique (as, for example, when one measures the speed and position of a freely falling object at sea level). Heisenberg states, ‘‘The measuring device deserves this name only if it is in close contact with the rest of the world, if there is an interaction between the device and the observer . . . . If the measuring device would be isolated from the rest of the world, it would be neither a measuring device nor could it be described in the terms of classical physics at all.’’ One of the implications for life sciences of the interpretation of quantum theory through appeal to uncertainty relations has been pointed out by Neils Bohr (discussed by Heisenberg, 1958, 104–105). He noted that our knowledge of a cell’s being alive may be dependent upon our complete knowledge of its molecular structure. Such a complete knowledge may be achievable only by operations which would destroy the life of the cell. It is, therefore, logically possible that life precludes the complete determination of its underlying physiochemical nature.
The Hawthorne Studies Beginning in 1927, Mayo, Roethlisberger, Whitehead, and Dickson (Roethlisberger and Dickson, 1939) began a series of studies in the Hawthorne plant of the Western Electric Company. That series not only launched modern industrial psychology on its current path, but also introduced the idea that the process of measurement in social psychological situations can influence what is being measured and change its characteristics. For our purposes the most pertinent results of these studies are those which are perhaps most general. The original aim of the studies was to examine the effects on production of such work conditions as illumination, temperature, hours of work, rest periods, wage rate, etc. Six female workers were observed. The interesting result was that their production increased no matter what the manipulation. Whether hours of work or rest periods were increased or decreased, production always increased. The reason given by the authors for this effect was that the women felt honored at being chosen for the experiment. They felt that they were a team and worked together for the benefit of the group as a whole. What I wish to emphasize is that from the point of view of the experimenter, the fact of measurement changed not only the magnitude of the dependent variable (rate of production), but the very nature of the social situation as well. The principle of indeterminacy found in the physical situation is, at least analogously, operating in this social psychological situation. One finds a definite relationship between the observational process of the experimenter and the natural process of the subject. If we narrow the context of our inquiry to the effect of any specific device
Pretest Sensitization
95
designed to measure some relevant characteristics of the organism, we will have arrived at the principal point of departure of this paper.
Current Methodology The relevant state of the organism must be determined before an experimental treatment is applied in much psychological research. This is necessary since all psychological experiments are designed to test a hypothesis of change from an initial state of the organism to some other state as a result of an experimental treatment. Therefore, some assessment of the magnitude of a given variable is necessary prior to the administration of the experimental treatment. One may legitimately raise the question of why an experimental hypothesis of change needs to be examined by assessing the value of the dependent variable prior to treatment through the use of a pretest. It is certainly possible to substitute a randomization design for a pretest design. By randomly selecting subjects from a defined population and by randomly assigning them to the various experimental treatments in a given study, one may assume the comparability of these subjects. Any differences among the scores of the various groups are directly comparable to one another and hence a pretest is unnecessary. However, there are some reasons why the use of a pretest is preferable to a randomization design. Given a constant N, the use of a pretest will often increase the precision of measurement by controlling for individual differences within subgroups. In addition, should there be a ‘‘failure’’ of randomization, comparison of the subgroups’ pretest means will tell us so. Of course, it is also possible that a pretest might be administered to detect differences in initial performance, so that the effects of some experimental manipulation taking into account these differences can be examined. However, this has not been of typical interest to researchers utilizing pretest designs in attitudinal studies. The principal point is that we are interested in demonstrating empirically that a given treatment either succeeds or does not succeed in changing some existing variable in the organism (such as an opinion), and the most direct way of establishing such a fact is to test that variable before and after the application of the treatment. The ideal experiment is one in which the relevant pre-experimental state of the organism is determined without affecting that state by the very measuring process itself. Unfortunately, it is rarely possible to achieve this aim, since it is almost always necessary to manipulate the environment of the subject in some way in order to obtain the measurement. However, it is not impossible to do so, as, for example, in a situation where the subject is not aware that he is being observed and his response recorded (cf. Campbell, 1967). Several control groups, and thus various types of pretreatment manipulation of the subject, are usually necessary in most studies. The intent of this paper is to examine the nature of these pretreatment measures for sources of artifact which disrupt the legitimacy of the conclusions it is possible to draw within the context of social psychology. Since several experimental situations found in psychological research require some preliminary manipulation involving the subject before the treatment can be applied, appropriate controls are necessary to isolate all possible sources of variation. Any manipulation of the subject or of his environment by the experimenter prior to the advent of the experimental treatment, which is to be followed by some measure of
96
Book One – Artifact in Behavioral Research
performance, allows for the possibility that the result is due either to the effect of the treatment or to the interaction of the treatment with the prior manipulation. A control group is needed to which is presented the prior manipulation followed by the measure of performance, without the treatment intervening. This control group can then be compared with the experimental group receiving prior manipulation, treatment, and the measure of performance. Should there be a significant difference between the two groups, one may reach a conclusion as to the relative effectiveness of the two methods for increasing or decreasing performance. This control is diagrammed in Table 4-1. Even though the application of this design allows one to make a direct comparison between groups and yields an evaluation of the effect on performance of the prior manipulation and the treatment, it is to be noted that there is no logical possibility of evaluating the effect of the treatment alone on the measure of performance. In order to do this, a second control group must be added. The revised design is shown in Table 4-2. The second control group (Group III), which presents the subject with the treatment and follows this with the measure of performance, now permits us to examine not only the effects of prior manipulation on the measure of performance, and the combined effects of prior manipulation and treatment, but also the effect of the treatment alone. Thus, three possible comparisons may now be made, Group I with Group II, Group I with Group III, and Group II with Group III. However, there is still one source of variation which remains unaccounted for in this design. Conceivably the performance of the subject alone might be quite similar in magnitude to his performance under any or all of the conditions contained in Groups I, II, and III. In order to examine this hypothesis, as indicated in Table 4-3, a third and final control group must be added to the design.2 The design is now complete and all possible effects on the dependent variable of prior manipulation and treatment have been accounted for. Obviously, the design can
Table 4–1 Control for Effect of Prior Manipulation
I
II
Prior Manipulation Treatment Measure of Performance
Prior Manipulation Measure of Performance
Table 4–2 Controls for Effect of Prior Manipulation and Its Interaction with
the Treatment
2
I
II
Prior Manipulation Treatment Measure of Performance
Prior Manipulation Measure of Performance
III
Treatment Measure of Performance
Solomon (1949), in a now classic paper, was the first to discuss these groups in systematic detail.
Pretest Sensitization
97 Table 4–3 Controls for Effects of Prior Manipulation, and Its Interaction with the
Treatment and Existing Magnitude of Performance in the Subject I
II
Prior Manipulation Treatment Measure of Performance
Prior Manipulation Measure of Performance
III
IV
Treatment Measure of Performance
Measure of Performance
be further complicated if the time intervals among the prior manipulation, treatment, and measure of performance are varied. However, the principle of control remains the same. In order to apply the final design presented in Table 4-3, certain assumptions must be made concerning the distribution of subjects in the four groups involved. Subjects must be chosen at random from the population, and randomly assigned to one of the four groups. The assumption is that subjects assigned to any one group will be similar, in all relevant characteristics, to subjects assigned to any of the other groups. A special problem arises if the prior manipulation happens to be a test tapping some already existing quality in the organism, e.g., an opinion questionnaire regarding racial prejudice or a test examining achievement in knowledge of American History, instead of being a direct manipulation such as injecting a drug where no measurement is involved. Since Groups III and IV of Table 4-3 are not exposed to prior manipulation (e.g., an opinion questionnaire) there is no immediate assurance that the groups are initially homogeneous with respect to the opinion being tapped. Yet, it is necessary to arrange the groups as in Table 4-3 if one wishes to control for the effects of the three elements of the design. There are at least two possible solutions to this dilemma. One has been suggested by Solomon (1949). Since Groups I and II have pretest measures taken on them, it is possible to calculate the mean value and the standard deviation for both groups on their questionnaire scores. The mean of these means and the combined standard deviation of the two groups can then be assigned to Groups III and IV as the best estimates of the pretest scores of these groups in lieu of administering a questionnaire to them. It is then possible to examine the change from pretest mean scores to posttest (measurement of performance) mean scores for all groups, without actually having applied the pretests to Groups III and IV. The original means of Groups I and II are used in the analysis. Degrees of freedom utilized in any tests of significance should be those appropriate for each of the actual four groups. However, this method is tenuous if one has little information as to the comparability of the various groups of subjects. An alternative or complementary solution is to examine for comparability a large number of subjects from the pool from which the final selection of experimental subjects will be chosen. Thus, should there be available 500 comparable subjects from which we wish to choose 100 for our experiment, then the following procedure would be useful. Randomly choose the 100 subjects to be used in the four groups of the experiment. Randomly assign 25 subjects to each of the four treatment groups. Randomly assign two of these four groups to the two pretest conditions, and administer the pretest. Pretest the remaining 400 people in the original pool. Finally, assign the grand mean and grand standard deviation of all 450 pretested subjects to
98
Book One – Artifact in Behavioral Research
groups III and IV, the unprotested groups. (Lana, 1959, discusses this problem and provides a related example.) There are essentially two types of prior manipulations that are utilized in our basic design. One type (Case I) we may designate as an experimental condition of some sort, such as receiving a jolt of electricity, swallowing a pill, or pressing a button. In this type of manipulation the measure of performance (posttest) is always different from the prior manipulation; the two never require the same task from the subject. Also the prior manipulation is not a measure of performance or of previous condition of the organism as it is, e.g., when opinion questionnaires, spelling tests, etc., are used. Essentially, Case I represents the situation where some pre-treatment applied to the subject is necessary in order to examine the effects of the principal treatment. Strictly speaking, no pretest is involved, but rather a part of the experimental treatment conceptualized as a pre-condition necessary for examination of the dependent variable. In Case II the prior manipulation requires the same kind of a performance from the subjects as does the posttest and is actually a part of the dependent variable (pretest-posttest change). With the Case I type of pretreatment condition only random assignment of subjects to the initial groups can be used to assure homogeneity of subjects. When the prior manipulation is exactly the same task as the measure of performance (Case II), as, for example, when an opinion questionnaire is used for both, it is also possible to estimate pretest measures for groups which can not be pretested. These procedures, discussed above, become extremely important for this type of situation although irrelevant when the prior manipulation is nonmensurative. In most of the situations with which we shall be concerned, Case II predominates.
Analysis of Case I Designs Following Solomon’s article (Solomon, 1949) a good deal of work has been done in examining the effects on performance of prior manipulation in interaction with a succeeding treatment. A recent study by Ross, Krugman, Lyerly, and Clyde (1962) develops the four group design for use in certain types of psychopharmacological studies and provides an example of our Case I. Although there may be instances where a psychopharmacological study might include a prior manipulation which taps an existing attribute of the subject, most pretest-treatment-interaction designs in this area are of the type where the prior manipulation is some experimental condition and is therefore not a pretest and is not repeated in the posttest. The experiment by Ross et al. is of this latter type and illustrates the proper statistical analysis to be used with this kind of data. The design is contained in Table 4-4. Table 4–4 Design of Experiment by Ross, Krugman, Lyerly, and Clyde (1962)
I
II
III
IV
Pill Drug Task (tapping)
Pill No drug (Placebo) Task
No pill Drug (disguised) Task
No pill No drug Task
Pretest Sensitization
99
It is to be noted that this design fulfills exactly the conditions summarized in Table 4-3 for permitting the assessment of prior manipulation-treatment-interactions, the effects of prior manipulation alone, and the effects of task alone. It is also to be noted that the measure of performance (a tapping task) is different from the prior manipulation which is the swallowing of a pill. Since the prior manipulation in this case is not designed to tap any existing attribute of the individual, the usual random assignment of subjects to the various groups should be utilized. A double classification analysis of variance is the proper method of analysis in examining the various main and interaction effects. The main effects for drug and pill are examined against the mean square for error as is the interaction mean square. Degrees of freedom and appropriateness of error term are determined as in an ordinary double classification analysis of variance. A significant F-ratio for drug would suggest that the drug treatment significantly affected the performance of the task. A significant main effect for pill would suggest that the prior manipulation (actually giving a placebo in the form of a pill) significantly affected the performance of the task. A significant interaction effect would indicate that the effect of the prior manipulation and the treatment taken together affected the task and, therefore, main effects if significant would become more complex to interpret. Actually, if the main effect for drug were significant in our example and the interaction between drug and pill were also significant, but not the main effect for pill, then depending upon the shape and magnitude of the interaction the following interpretation might be made. The drug, in itself, is powerful enough to have an effect on task performance regardless of other factors in the experiment. However, if the drug were taken in pill form this factor would also have a significant effect on the performance of the task. Obviously, any theoretical interpretation of these results would have to await further research. Should the interaction alone be significant, the interpretation would be more difficult. This situation is discussed below.
Analysis of Case II Designs Within the framework of the pretest-treatment-posttest design (i.e., prior manipulation-treatment-measure of performance), a subject’s initial response to a questionnaire may provide a basis of comparison with later questionnaire responses, and thus a positive or negative correlation of some magnitude might be expected between the individual’s first and second scores to the same questionnaire. This expectation of correlation between successive, similar tasks is the basis for methodological concern with repeated measurements designs. When two treatments are performed in succession, a treatment carryover may occur. The rotation or counterbalanced design (Cochran and Cox, 1957) is specifically intended to give information on such a treatment carryover. In the experimental situations described in this chapter, where a single pretest precedes a single treatment, the possible effects of carryover from pretest to treatment are the same as the treatment-to-treatment carryover. A rotation design is not possible when the first variable is a pretest, since by definition a pretest must precede the treatment. Consequently, any examination of the confounding effects of the pretest with the treatment must be made by experimentally manipulating the
100
Book One – Artifact in Behavioral Research
nature and application of the pretest. The major question that remains is directed at the nature of the relationship existing between these two variables. Thus the pretest-treatment-posttest research design is a special case of the general repeated measures design where there are multiple treatments or tests of the same organism over time. Ordinarily in a pretest-posttest design, since the initial score on the pretest has a tendency to be variable over subjects, a covariance design with pretest score as the covariable is appropriate and very useful. However, as we have seen, the very nature of our interest in pretest sensitization disallows for the use of covariance, since, in order to fulfill the requirement of the four-group design, some groups will not have been pretested. Consequently, the principal design needs to be of another type. If one can assume that there is a high probability that the unpretested groups in the four-group-control design would not be significantly different on pretest scores from the groups who were pretested, then a two-by-two factorial analysis of variance can be computed on the posttest scores of the four groups. This analysis will yield main effects for the treatment and for pretesting, and a first order interaction term between the two. Should the interaction effect not be significant, but either or both of the main effects be significant, the interpretation is straightforward. A significant main effect for treatment indicates that the treatment affected the posttest score in either a facilitative or depressive manner, depending on the direction of the mean change scores. A similar interpretation may be made for a significant pretesting main effect. Should the interactive effect of pretesting and treatment be significant, regardless of whether or not the main effects are significant, interpretation would become more complicated (see Lana and Lubin, 1961, 1963). When an interaction term is significant it can be concluded that the usual interpretations regarding the data which would ordinarily follow from the hypothesis testing model cannot be made. That is, with a significant interaction effect, the experimenter needs to exercise extreme care in the interpretation of his data and, in many instances, he may have to reexamine the manner in which he has constructed the empirical aspects of his problem. The significant interaction should lead him to reconstruct his hypotheses along somewhat different lines. Scheffe concludes, ‘‘In order to get exact tests and confidence intervals concerning the main effects it is generally necessary with the fixed-effects model (but not the random effects model or mixed model) to assume that there are no interactions.’’ All of the studies discussed below use the fixed-effects model. ‘‘It happens occasionally that the hypothesis of no interactions will be rejected by a statistical test, but the hypothesis of zero main effects for both factors will be accepted. The correct conclusion is then not that no differences have been demonstrated: If there are (any nonzero) interactions there must be (nonzero) differences among the cell means. The conclusion should be that there are differences, but that when the effects of the levels of one factor are averaged over the levels of the other, no difference of these averaged effects has been demonstrated’’ (Scheffe, 1959, 94). The point of this discussion is that since we are specifically looking for significant interaction terms, if we find one it should act as a warning device. The experimenter should then reformulate the problem. In the pretest-treatment-posttest design case, the use of a pretest becomes suspect for the given data of the study.
Pretest Sensitization
101
Sensitization When the Pretest Involves Learning Besides providing one of the first formal analyses of a research design capable of structuring an experiment so that a pretest treatment interaction could be isolated and measured, R. L. Solomon (1949) conducted an experiment demonstrating one source of such an interaction effect. Two grammar school classes were equated for spelling ability by teacher judgment. The three-group control design that was examined earlier was used. Each group was pretested on a list of words of equal difficulty by having the children spell the words. The groups were then given a standard spelling lesson on the general rules of spelling which served as the experimental treatment. The posttest consisted of the same list of words to spell as were used as the pretest. In the analysis which followed, there was some indication that the pretest interacted with the treatment although the usual two-way analysis of variance could not be computed. It was concluded that the taking of the pretest tended to diminish the spelling effectiveness of the subjects. In this study the errors made in the pretest somehow were resistant to the treatment and were made again during the posttest. Here then is an instance of a pretest which, at least in part, is a learning experience (or recall of already learned material) depressing the effect of the treatment, which is another related learning task. Beginning with a reexamination of Solomon’s results, Entwisle (1961a, 1961b) performed two experiments of her own and found a significant interaction effect among pretesting, IQ and sex. Pretests consisted of several multiple choice questions about state locations of large U.S. cities. Treatment consisted of showing all subjects a slide with the name of a city projected on it for .1 second. The subjects then wrote the state name, and immediately afterward the correct state name was shown for .1 second. This procedure was repeated for all items in the pretest. Hence, the treatment consisted of a training session directly relevant to material presented as pretest. The posttest consisted of the same procedure as the pretest. There was no significant main effect of pretesting, but the triple interaction measured above was significant. Although the results are equivocal, there is a suggestion that pretesting aided recall for high IQ individuals and was ‘‘mildly hindering’’ for ‘‘average’’ IQ students. In another training study, Entwisle (1961a) found no significant interaction effect or direct effect of pretesting. In 1960, Lana and King, using the four-group control design contained in Table 4-5, had all groups read a short summary of the mental health film on ethnic prejudice, ‘‘The High Wall.’’ Two of the groups were asked to recall the summary immediately after the reading as near to the original as possible by writing it out on a sheet of paper. This recall was considered the pretest. Sometime later the film was presented to one group (pretested) that had been asked to recall the summary, and to Table 4–5 Experimental Design of Lana and King (1960)
Group I
Group II
Group III
Group IV
Reading Recall 12 days Film Recall
Reading Recall 12 days
Reading
Reading
12 days Film Recall
12 days
Recall
Recall
102
Book One – Artifact in Behavioral Research
another group that had not been asked to recall the summary (unpretested). Immediately after presentation of the film all groups were posttested by asking them to recall as near to the original as possible the summary which had been read to them several days before. Accuracy of recall was measured by dividing the story into ‘‘idea units’’ and counting the number of units in each subject’s protocol. Even though the film used as the treatment has a definite attitudinal component to it and is clearly didactic, our interest in this study was to examine only recall components of the pretest-treatment-posttest experimental design. The results indicated a significant main effect for pretesting and no significant effects for the treatment nor for the pretest-treatment interaction. Although the content of the summary read before the first recall contained no more information than that which could be seen and heard in the film, the act of recalling the written summary was more effective than seeing the film in influencing the precision of the second recall taken after the presentation of the film. In this case, only the fact of the first recall significantly affected later recall. The combination of first recall and film viewing was not as effective in post-test recall. To the extent that a pretest serves as a device for conscious recall of meaningfully connected material, it can serve to influence post-test results. Attitude and opinion questionnaires used as pretests might have the same effect should part of the process of taking such a pretest involve recall of previously held attitudes or opinions. Hicks and Spaner (1962), working with attitudes toward mental patients and hospital experience, found a pretest sensitization effect similar to that shown by Lana and King. The former investigators suggested that a learning factor might have been present in the attitude questionnaire used as the pretest. In the studies by Solomon, Lana and King, Hicks and Spaner and the two by Entwisle, virtually all possible effects of pretesting, when pretesting was a learning or recall device, were shown. Solomon found a significant interaction between pretest and treatment which had a depressing effect on posttest score. Lana and King found a significant main effect for pretesting, indicating greater recall with pretesting than without pretesting, but no significant pretest-treatment interaction. Entwisle found a pretest-treatment interaction with a salutory effect of the pretest on posttest scores and in another study found no significant pretest-treatment interaction of any kind. Entwisle dismissed her negative results because sex was not a control variable in that study, and she later showed that when it was introduced as a variable a significant pretesttreatment-sex interaction appeared. Even though we have examined only five relevant studies, we are probably safe in assuming that a pretesting procedure which, in whole or part, involves some learning process such as recall of previously learned material may very well have an effect on the magnitude of the posttest score. Ordinarily, if the task or the recall demanded by the pretest procedure is properly understood by the subject, the effect on the posttest should be facilitative. However, as we have seen in the case of Solomon’s results and some of Entwisle’s, depressive effects can also occur. There are different implications of the interpretations of pretest results, depending upon whether or not the sensitization is direct (Lana and King) or operates in concert with the treatment (Solomon, Entwisle). The former results are simpler to interpret and the experimenter need not be as concerned with the general procedure of pretesting in that experimental situation, since pretesting effects and treatment effects are independent. When the pretest-treatment interaction effect is significant there is always the danger that the interaction indicates a change in the nature of the empirical
Pretest Sensitization
103
phenomenon and a distortion of that phenomenon so that it is markedly different from what it would have been had a pretest not been used. It is in this situation of measuring attitude by use of a pretest that we see an application of the principle of indeterminacy which seems to operate in a manner analogous to that found in subatomic physics.
Sensitization When the Pretest Involves Opinions and Attitudes In Solomon’s 1949 article, he indicated in a footnote that some evidence was available that the pretest may reduce the variance of the posttest in attitudinal studies. The implication was that taking an attitudinal pretest may restrict the attention of the subjects so that they are not as variable in their reactions to the treatment as they would have been had they not been required to take a pretest. In 1955, E. V. Piers, at the suggestion of J. C. Stanley, used the Solomon four-group control design to measure teacher attitudes toward students. The pretest consisted of the Minnesota Teacher Attitude Inventory, and the posttest used the Adorno et al. F Scale, a student rating scale and a vocabulary test. No pretest effect of any kind was found. Lana (1959a) utilized the Solomon design with a questionnaire measuring opinion on vivisection as both the pretest and posttest. The treatment consisted of a taped provivisection appeal. If a pretest sensitization were operating it would more likely be evident in this study, where pretest and posttest devices were identical, than in Piers’ study where they were different. With the same pretest and posttest devices, a recall factor alone should produce some effect from pre-to-posttest, as was noted in the Lana and King, Solomon, and Entwisle studies. There was, however, no significant pretesting main effect nor a significant pretest treatment interaction effect. Considering the fact that the topic used, ‘‘vivisection,’’ was probably of minor interest to the subjects, one explanation for these results is that they might have been little affected by the tasks asked of them. Lana (1959b) repeated this study using a topic (ethnic prejudice) which, it seemed reasonable to assume, was more interesting and controversial to the college student subjects than vivisection. The treatment consisted of ‘‘The High Wall,’’ the mental health film used by Lana and King (1960). Pretest and posttest consisted of a modified version of the California Ethnocentrism Scale. There were no significant main or interactive effects involving the pretest. DeWolfe and Governale (1964) administered to experimental and control groups of student nurses a pretest consisting of the Nurse-Patient Relationship Sort, Fear of Tuberculosis Questionnaire, and the IPAT ‘‘Trait’’ Anxiety Scale. The Nurse-Patient Relationship Sort was given as a posttest at the end of a specified nursing training period. Appropriate controls were utilized to allow for an examination of pretest and pretest-treatment interaction effects. The authors reported that there was no consistent sensitization or desensitization as a result of pretesting. Campbell and Stanley (1966) have indicated that studies by Anderson (1959), Duncan et al. (1957), Sobol (1959), and Zeisel (1947) also reported no sensitizing effect as a result of taking a pretest, when opinions or attitudes were involved. In 1949, Hovland, Lumsdaine and Sheffield, in their classic work on attitudes of the American soldier during World War II, reported that something like a sensitization effect occurred as a consequence of using a pretest in attitudinal studies. They found that there was less attitude change in a group of soldiers administered pretest questionnaires on topics relevant to the war effort than in those not so pretested
104
Book One – Artifact in Behavioral Research
where all other experimental conditions were similar. By their own admission, this conclusion was extremely tenuous since the pretested group consisted of soldiers receiving infantry training at one base while the non-pretested group consisted of soldiers receiving armored vehicle training at another base. Also, the demographic characteristics of the two groups of men were not comparable. This study remains the only one involving opinions and attitudes in which even a suggestion of the occurrence of pretest sensitization is made and in which a unidirectional communication (i.e., a communication supporting only one point of view) was used. The overwhelming lack of a pretest sensitization effect when the pretest is used to measure existing opinions or attitudes is as convincing a demonstration as one is likely to find in social psychological research. It seems reasonably safe to use a pretest without concern for its direct or interaction (with the treatment) effects on posttest results. This zero effect seems to be present over a large variety of opinions and attitudes and a large variety of treatment situations, as is evident from the diversity of techniques used in the studies cited here. The vast majority of these studies utilized a one-sided communication geared to influence opinion or attitude change in one possible direction. The Anderson study is an exception, and conceivably in the DeWolfe and Governale study the training of the nurses may have contained incidents that represented both positive and negative positions about the patient qua patient. However, by and large, the studies involved a unidirectional persuasive attempt as the treatment. Within the context of research on order effects in persuasive communications, it was decided to check for pretest sensitization when subjects are exposed to both of two opposed arguments on the same topic. This would be much the same situation as, for example, being exposed to conflicting advertisements for similar products or to apparently contrary arguments on political issues by individuals running for the same political office. The presentation of opposed arguments seemed sufficiently different from a treatment using a unidirectional communication to warrant a renewed effort in looking for pretest sensitization. In an experiment by Lana and Rosnow (1963), subjects were divided into various groups such that half of these groups received a questionnaire measuring opinions either on the use of nuclear weapons or on public censorship of written materials. This was accomplished by either handing the questionnaire to the subject and asking him to complete it or interspersing the questionnaire items throughout a regular Psychology I examination, thus ‘‘hiding’’ it from the subject. When the questionnaire is handed directly to the subject and he is asked to complete it, it is highly likely that his attention will be focused directly on the task. When the questionnaire items are interspersed throughout a regular classroom examination, the attention and expectancies of the subject are initially on a topic other than the content of the questionnaire. Conceivably, here was a way to get a measure of initial opinion on a given subject matter, and to reduce any effects of pretest sensitization. Our initial purpose in carrying out this study was to examine possible effects of pretesting on the order effects (primacy-recency) of two opposed communications. Primacy refers to the success in changing opinion of the initial argument of two opposed communications. Recency refers to a similar success of the argument presented second. Disregarding this result, a reanalysis of the data indicated that the average opinion change per group (mean absolute differences from pretest to posttest regardless of the direction of change) was significantly greater for groups where the pretest was hidden than for the groups where the pretest was exposed.
Pretest Sensitization
105
The step that next seemed most appropriate was to attempt to demonstrate this pretest sensitization when two opposed communications were used as the treatment and when some subjects received no pretest and others responded to an exposed pretest. Two experiments (Lana, 1964, 1966) were conducted for this purpose. Their results indicated that the no-pretest groups changed their opinions in either direction to a significantly greater degree than did the groups administered the exposed pretest. The mean of the unpretested groups was estimated by computing the mean of the means of the pretested groups. Consistent with our earlier discussion, it was assumed that the groups not pretested were homogeneous with those pretested since they were formed randomly from the same population. These studies tend to support the notion that the pretest can act as a device by which the individual commits himself to maintain his opinion in the face of opposed (i.e., bidirectional) arguments presented after he has made his commitment. Campbell and Brock (1957) have shown that commitment to an attitudinal position inhibits change when commitment is elicited after an initial attempt to influence the subject, but not when response to a precommunication questionnaire constitutes the commitment. Their suggestion, however, is that there are forms of attitudinal commitment, usually made under public conditions, which inhibit opinion or attitude change as a result of materials presented later. Almost without exception, however, no pretest main or interaction effect has been found in the situation where only one opinion or attitude is measured by the pretest and where a unidirectional communication serves as the treatment. Where bidirectional arguments comprise the treatment, pretest sensitization has consistently been present (see Table 4-6). A possible explanation for these marked differences is the following: If the recipient initially favors the position advocated, then a unidirectional communication should yield greater opinion change, regardless of pretest conditions. This is because the communication would support the recipients’ initial commitment, if no ceiling problem were encountered. If half the subjects supported the position advocated, these subjects would not need to consider their
Table 4–6 Summary of Sensitization Effects Indicated by Various Experiments
No sensitization
Sensitization main effect of pretest
Sensitization interaction of pretest with treatment
Pretest was a learning device
Entwisle (1961a)
Solomon (1949) Entwisle (1961b)
Pretest was an attitudinal device (unidirectional)
Zeisel (1947) Duncan et al. (1957) Anderson (1959) Lana (1959a) Lana (1959b) Sobol (1959) DeWolfe and Governale (1964)
Lana and King (1960) Hicks and Spaner (1962) Hovland, Lumsdaine, and Sheffield (1949)
Pretest was an attitudinal device (bidirectional)
Lana and Rosnow (1963) Lana (1964) Lana (1966)
106
Book One – Artifact in Behavioral Research
initial opinion when reacting to the posttest.3 There would be no need to resolve discrepancy between what these individuals wrote on their pretests and the point of view represented in the communication. Both would be consistent with one another. Thus there is no resistance to the communication because of prior commitment for half of the subjects. However, if two opposed arguments were presented as the communication, one of the positions would automatically be discrepant with every subject’s initial commitment. Pretest commitment, challenged by one of the communications, produces resistance to change, and hence the result is a smaller change score from pretest to posttest.
Cautions Associated with the Use of a Pretest As we have seen, when the mensurative process involves people required to respond to an experimental condition in a manner reflecting their motives, opinions, attitudes, feelings, or beliefs, the administration of a pretest to measure these characteristics which is free from influence on that process may be difficult to find. The subject’s awareness of the manipulatory intent of the experimenter dealt with in detail by McGuire, (Chapter 2), the experimenter’s expectations (Rosenthal, Chapter 6), the subject’s concern about being evaluated (Rosenberg, Chapter 7)—all of these can exert an influence through use of a pretest, or act as separate effects, and thereby confound the effect of the experimental treatment. Indeed all of the other chapters of this book deal with factors which, though extrinsic to the experimental situation as conceived by the experimenter, can affect the magnitude and quality of the treatment and its effect on behavior. Campbell and Stanley (1966, 20) have noted, ‘‘In the usual psychological experiment, if not in educational research, a most prominent source of unrepresentativeness is the patent artificiality of the experimental setting and the student’s knowledge that he is participating in an experiment. For human experimental subjects, a higher order problem-solving task is generated, in which the procedures and experimental treatment are reacted to not only for their simple stimulus values, but also for their role as clues in divining the experimenter’s intent.’’ Campbell and Stanley (1966) have also observed that the posttest may create an artificial ‘‘experiment participating effect’’ for the subject if the connections between treatment and posttest (or among pretest, treatment, and posttest) are obvious. One way that this perception on the part of the subject might be changed is by using a different (e.g., equivalent form) posttest than was used as the pretest. With few exceptions (e.g., Piers, 1955) most of the studies mentioned above used identical preand posttests, that is, they are all examples of what was earlier referred to as Case II. At this point the alternative to be explored is that of substituting for the pretest some other technique of observation or measurement of initial standing which lacks the obvious and telltale characteristics of the pretest (i.e., as in Case I). One alternative to administering a pretest, an alternative which allows a reasonable estimation of the strength of a subject’s opinion and attitude toward some social object is the use of groups which, because of the unified stand of their members regarding the topic in question, are naturally homogeneous. For example, one might speculate that the 3
This assumes a symmetrical distribution of pretest scores with a mean near the ‘‘indifference’’ point, an assumption which holds true for the great majority of the studies heretofore cited.
Pretest Sensitization
107
opinions concerning birth control of members of the Catholic college organization, The Newman Club, would cluster in the negative half of the opinion continuum. In his examination of order effects when opposed communications on the same topic were presented to subjects, Lana (1964b) found that intact groups have a tendency to be more rigid in their commitment to a given opinion or attitude, although these groups are of such a nature that more information is available about their initial opinions which is useful in solving the pretest sensitization problem. It is more difficult to change their opinions via a persuasive communication than those of groups formed randomly. They are, therefore, not the most ideal subjects to use in a demonstration of the facility with which opinion or attitude change can be effected under various communicative conditions. Recently, E. J. Webb, D. T. Campbell, R. D. Schwartz, and L. Sechrest (1966) published a book which included a summary of what may be conceived as various alternatives to the pretest questionnaire technique. It is their contention that there are ‘‘nonreactive measures’’ which may be used to determine relative states of the organism, measures which assure the experimenter that the organism is not affected by the process of measurement itself. In short, there are measures which eliminate the operation of the principle of indeterminacy in the opinion or attitude measurement situation. (Reactive measures are those which sensitize the subject to the fact of being measured, or of being an object of concern to the experimenter and which, therefore, serve in many instances to change the behavior of the subject as a result.) Their point is that for the researcher concerned with social and other complex conditions affecting the organism, a variety of mensurative techniques are available which do not interfere with the process being measured by virtue of the fact that the subject is totally unaware of the measurement process. They divide these ‘‘unobtrusive’’ measures into five categories. The first is that of physical traces. For example, it is possible to infer what displays are most popular at a museum by examining the degree of wear of the floor tiles directly in front of the exhibit. This is a non-reactive measure. In contrast, a reactive measure would be to ask a number of visitors which exhibits they spent the most time at or which they enjoyed the most. Conceivably, an individual asked this question might respond with the name of an exhibit quite at variance with the one he actually spent the most time at or most enjoyed. If the subject wished to appear ‘‘cultured,’’ he might say that he spent more time looking at the fish fossils than he did looking at the stuffed gorilla. The second source of nonreactive data is what Webb et al. call the ‘‘running record,’’ where data of a census nature have been compiled by society for purposes other than those of the experimenter, but which provide useful information for him. Examples include actuarial statistics, city budgets, and voting statistics. Episodic and private records would serve the same function as running records. Simple observation of expressive movements and casual conversation comprise another category of nonreactive measurement. The final type of measure involves the use of hidden mechanical or other devices to record behavior in situations where the subject is unaware of the ongoing measurement. Paying all due respect to the cleverness and imagination shown by Webb, Campbell, Schwartz, and Sechrest in devising and systematizing these nonreactive, unobtrusive measures, the strongest impact on this writer after reading their book was to reinforce his belief in the necessity of looking for ways and devices to utilize reactive measures where the reaction (sensitization) on the part of the subject can
108
Book One – Artifact in Behavioral Research
either be measured or be eliminated altogether. The unobtrusive measures listed by these authors are rarely relevant to the research of a good many psychologists who use some sort of pretest measure. However, it should be noted that the authors intended these techniques to be as much ‘‘posttests’’ as ‘‘pretests.’’ Since it has been shown that, at least in attitude research, pretest measures, if they have any impact at all, depress the effect being measured, any differences which can be attributed to the experimental treatment probably represent strong treatment effects. In short, when pretest measures exert any influence at all in attitude research, the effect is to produce a Type II error, which is more tolerable to most psychological researchers than is an error of the first kind. It would seem that a researcher’s decision to use a pretest or, instead, to utilize a randomization design with only posttest measures is partly based upon personal characteristics having little to do with the logic of the experiment. As we have indicated, what one gains in information by utilizing a pretest he sometimes loses in increased sensitization of the subject. What he gains in purity of experimental effect by utilizing a randomization design he loses in knowledge of pre-treatment conditions existing in the organism. In some cases the goals of the experiment set the risk one will take. However, in many situations one is caught between the Scylla of sensitization and the Charybdis of ignorance of pre-existing conditions. The choice of procedure may be arbitrary. If, however, one does choose to utilize some form of the pretest-posttest design, disguising both pretest and posttest as much as possible may reduce sensitization of the subject. For example, as part of an as yet unfinished master’s thesis, Julian Biller hid an attitudinal pretest in a questionnaire ostensibly concerned with student reaction to various university administration policies and to student life in general. It was explained to the students that the information was useful to the instructor in shaping his course toward the needs of the students. Since the pretest items were concerned with attitudes toward the American college grading system, they were not conspicuous by their content. In a similar manner, the posttest items were hidden in a different questionnaire presented to the subjects sometime after they had been exposed to a persuasive communication concerning various grading systems. Of course, the astute subject may still recognize the key items in the questionnaire as being related to the communications he listened to sometime in the past. However, recognition and therefore sensitization effects may be minimized by use of this technique. Pretest sensitization might also be minimized by increasing the time between application of the pretest and the presentation of the persuasive communications and the posttest. However, one risks the possibility that factors external to the experimental situation may influence pretest-posttest change scores if the interval between the two is great. Conceivably an optimum time interval between pretest and treatment as well as between treatment and posttest might be found.
References Anderson, N. H. Test of a model for opinion change. Journal of Abnormal and Social Psychology, 1959, 59, 371–381. Campbell, D. T. Administrative experimentation, institutional records, and nonreactive measures. In J. C. Stanley (Ed.), Improving experimental designs and statistical analysis. Chicago: Rand McNally, 1967.
Pretest Sensitization
109
Campbell, D. T. and Stanley, J. C. Experimental and quasi-experimental designs for research. Chicago: Rand McNally, 1966. Campbell, E. H. and Brock, T. The effects of ‘‘commitment’’ on opinion change following communications. In C. I. Hovland, et al., The order of presentation in persuasion. New Haven: Yale University Press, 1957. Cochran, W. G. and Cox, Gertrude M. Experimental design (2nd ed.). New York: Wiley, 1957. DeWolfe, A. S. and Governale, C. N. Fear and attitude change. Journal of Abnormal and Social Psychology, 1964, 69, 119–123. Duncan, C. P., O’Brien, R. B., Murray, D. C, Davis, L., and Gilliland, A. R. Some information about a test of psychological misconceptions. Journal of General Psychology, 1957, 56, 257–260. Entwisle, Doris R. Attensity: Factors of specific set on school learning. Harvard Educational Review, 1961, 31, 84–101. (a) Entwisle, Doris R. Interactive effects of pretesting. Educational and Psychological Measurement, 1961, 21, 607–620. (b) Heisenberg, W. Physics and philosophy. New York: Harper Torch Books, 1958. Hicks, J. M. and Spaner, F. E. Attitude change and hospital experience. Journal of Abnormal and Social Psychology, 1962, 65, 112–120. Hovland, C. I., Lumsdaine, A. A. and Sheffield, F. D. Experiments on mass communication. Princeton: Princeton University Press, 1949. Lana, R. E. Pretest-treatment interaction effects in attitudinal studies. Psychological Bulletin, 1959, 56, 293–300. (a) Lana, R. E. A further investigation of the pretest-treatment interaction effect. Journal of Applied Psychology, 1959, 43, 421–422. (b) Lana, R. E. and King, D. J. Learning factors as determiners of pretest sensitization. Journal of Applied Psychology, 1960, 44, 189–191. Lana, R. E. and Lubin, A. Use of analysis of variance techniques in psychology. Progress Report to the National Institute of Mental Health, United States Public Health Service No. M–4113(A), March, 1961. Lana, R. E. and Lubin, A. The effect of correlation on the repeated measures design. Educational and Psychological Measurement, 1963, 23, 729–739. Lana, R. E. and Rosnow, R. L. Subject awareness and order effects in persuasive communications. Psychological Reports, 1963, 12, 523–529. Lana, R. E. The influence of the pretest on order effects in persuasive communications. Journal of Abnormal and Social Psychology, 1964, 69, 337–341. (a) Lana, R. E. Existing familiarity and order of presentation of persuasive communications. Psychological Reports, 1964, 15, 607–610. (b) Lana, R. E. Inhibitory effects of a pretest on opinion change. Educational and Psychological Measurement, 1966, 26, 139–150. Piers, Ellen V. An abstract of effects of instruction on teacher attitudes: extended control group design. Bulletin of the Maritime Psychological Association, 1955 (Spring), 53–56. Ross, S., Krugman, A. D., Lyerly, S. B., and Clyde, D. J. Drugs and placebos: a model design. Psychological Reports, 1962, 10, 383–392. Roethlisberger, F. J. and Dickson, W. J. Management and the worker. Cambridge, Massachusetts: Harvard University Press, 1939. Scheffe, H. The analysis of variance. New York: Wiley, 1959. Sobol, M. G. Panel mortality and panel bias. Journal of the American Statistical Association, 1959, 54, 52–68. Solomon, R. L. An extension of control group design. Psychological Bulletin, 1949, 46, 137–150. Webb, E. J., Campbell, D. T., Schwartz, R. D., and Sechrest, L. Unobtrusive measures: nonreactive research in the social sciences. Chicago: Rand McNally, 1966. Zeisel, H. Say it with figures. New York: Harper, 1947.
5 Demand Characteristics and the Concept of Quasi-Controls1 Martin T. Orne2 Institute of the Pennsylvania Hospital and University of Pennsylvania
Special methodological problems are raised when human subjects are used in psychological experiments, mainly because subjects’ thoughts about an experiment may affect their behavior in carrying out the experimental task. To counteract this problem psychologists have frequently felt it necessary to develop ingenious, sometimes even diabolical, techniques in order to deceive the subject about the true purposes of an investigation (see Stricker, 1967; Stricker, Messick, and Jackson, 1967). Deception may not be the only, nor the best, way of dealing with certain issues, yet we must ask what special characteristic of our science makes it necessary to even consider such techniques when no such need arises in, say, physics. The reason is plain: we do not study passive physical particles but active, thinking human beings like ourselves. The fear that knowledge of the true purposes of an experiment might vitiate its results stems from a tacit recognition that the subject is not a passive responder to stimuli and experimental conditions. Instead, he is an active participant in a special form of socially defined interaction which we call ‘‘taking part in an experiment.’’ It has been pointed out by Criswell (1958), Festinger (1957), Mills (1961), Rosenberg (1965), Wishner (1965), and others, and discussed at some length by the author elsewhere (Orne, 1959b; 1962), that subjects are never neutral toward an experiment. While, from the investigator’s point of view, the experiment is seen as permitting the controlled study of an individual’s reaction to specific stimuli, the situation tends to be perceived quite differently by his subjects. Because subjects are active, sentient beings, they do not respond to the specific experimental stimuli with which they are confronted as isolated events but rather they perceive these in the total context of the experimental situation. Their understanding of the situation is based 1
The substantive work reported in this paper was supported in part by Contract #Nonr 4731 from the Group Psychology Branch, Office of Naval Research. The research on the detection of deception was supported in part by the United States Army Medical Research and Development Command Contract #DA-49-193-MD-2647. 2 I wish to thank Frederick J. Evans, Charles H. Holland, Edgar P. Nace, Ulric Neisser, Donald N. O’Connell, Emily Carota Orne, David A. Paskewitz, Campbell W. Perry, Karl Rickels, David L. Rosenhan, Robert Rosenthal, and Ralph Rosnow for their thoughtful criticisms and many helpful suggestions in the preparation of this manuscript.
110
Demand Characteristics and the Concept of Quasi-Controls
111
upon a great deal of knowledge about the kind of realities under which scientific research is conducted, its aims and purposes, and, in some vague way, the kind of findings which might emerge from their participation and their responses. The response to any specific set of stimuli, then, is a function of both the stimulus and the subject’s recognition of the total context. Under some circumstances, the subject’s awareness of the implicit aspects of the psychological experiment may become the principal determinant of his behavior. For example, in one study an attempt was made to devise a tedious and intentionally meaningless task. Regardless of the nature of the request and its apparently obvious triviality, subjects continued to comply, even when they were required to perform work and to destroy the product. Though it was apparently impossible for the experimenter to know how well they did, subjects continued to perform at a high rate of speed and accuracy over a long period of time. They ascribed (correctly, of course) a sensible motive to the experimenter and meaning to the procedure. While they could not fathom how this might be accomplished, they also quite correctly assumed that the experimenter could and would check their performance3 (Orne, 1962). Again, in another study, subjects were required to carry out such obviously dangerous activities as picking up a poisonous snake or removing a penny from fuming nitric acid with their bare hands (Orne and Evans, 1965). Subjects complied, correctly surmising that, despite appearances to the contrary, appropriate precautions for their safety had been taken. In less dramatic ways the subject’s recognition that he is not merely responding to a set of stimuli but is doing so in order to produce data may exert an influence upon his performance. Inevitably he will wish to produce ‘‘good’’ data, that is, data characteristic of a ‘‘good’’ subject. To be a ‘‘good’’ subject may mean many things: to give the right responses, i.e., to give the kind of response characteristic of intelligent subjects; to give the normal response, i.e., characteristic of healthy subjects; to give a response in keeping with the individual’s self-perception, etc., etc. If the experimental task is such that the subject sees himself as being evaluated he will tend to behave in such a way as to make himself look good. (The potential importance of this factor has been emphasized by Rosenberg, 1965; see Chapter 7.) Investigators have tended to be intuitively aware of this problem and in most experimental situations tasks are constructed so as to be ambiguous to the subject regarding how any particular behavior might make him look especially good. In some studies investigators have explicitly utilized subjects’ concern with the evaluation in order to maximize motivation. However, when the subject’s wish to look good is not directly challenged, another set of motives, one of the common bases for volunteering, will become relevant. That is, beyond idiosyncratic reasons for participating, subjects volunteer, in part at least, to further human knowledge, to help provide a better understanding of mental processes that ultimately might be useful for treatment, to contribute to science, etc. This wish, which, despite currently fashionable cynicism, is fortunately still the mode rather than the exception among college student volunteers, has important consequences for the subject’s behavior. Thus, in order for the subject to see the data as useful, it is essential that he assume that the experiment be important, meaningful, and properly executed. Also, he would hope that the experiment work, which tends to mean that it prove what it attempts to prove. Reasons such as these may help to clarify why subjects are so committed to 3
These pilot studies were performed by Thomas Menaker.
112
Book One – Artifact in Behavioral Research
see a logical purpose in what would otherwise appear to be a trivial experiment, why they are so anxious to ascribe competence to the experimenter and, at the end of a study, are so concerned that their data prove useful. The same set of motives also helps to understand why subjects often will go to considerable trouble and tolerate great inconvenience provided they are encouraged to see the experiment as important. Typically they will tolerate even intense discomfort if it seems essential to the experiment; on the other hand, they respond badly indeed to discomfort which they recognize as due to the experimenter’s ineptness, incompetence, or indifference. Regardless of the extent to which they are reimbursed, most subjects will be thoroughly alienated if it becomes apparent that, for one reason or another, their experimental performance must be discarded as data. Interestingly, they will tend to become angry if this is due to equipment failure or an error on the part of the experimenter, whereas if they feel that they themselves are responsible, they tend to be disturbed rather than angry. The individual’s concern about the extent to which the experiment helps demonstrate that which the experimenter is attempting to demonstrate will, in part, be a function of the amount of involvement with the experimental situation. The more the study demands of him, the more discomfort, the more time, the more effort he puts into it, the more he will be concerned about its outcome. The student in a class asked to fill out a questionnaire will be less involved than the volunteer who stays after class, who will in turn be less involved than the volunteer who is required to go some distance, who will in turn be less involved than the volunteer who is required to come back many times, etc., etc.4 Insofar as the subject cares about the outcome, his perception of his role and of the hypothesis being tested will become a significant determinant of his behavior. The cues which govern his perception—which communicate what is expected of him and what the experimenter hopes to find—can therefore be crucial variables. Some time ago I proposed that these cues be called the ‘‘demand characteristics of an experiment’’ (Orne, 1959b). They include the scuttlebutt about the experiment, its setting, implicit and explicit instructions, the person of the experimenter, subtle cues provided by him, and, of particular importance, the experimental procedure itself. All of these cues are interpreted in the light of the subject’s past learning and experience. Although the explicit instructions are important, it appears that subtler cues from which the subject can draw covert or even unconscious inference may be still more powerful. Recognizing that the subject’s knowledge affects his performance, investigators have employed various means to disguise the true purpose of the research, thereby trying to alter the demand characteristics of experimental situations in order to make them orthogonal to the experimental effects. Unfortunately, the mere fact that an investigator goes to great lengths to develop a ‘‘cute’’ way to deceive the subject in no way guarantees that the subject is, in fact, deceived. Obviously, it is essential to establish whether the subject or the experimenter is the one who is deceived by the experimental manipulation. 4
Obviously, how the subject is treated will affect his motivation in this regard. If the experimenter seems casual, disinterested, or worse yet, incompetent, he will both resent it and mobilize little investment. On the other hand, if the experimenter seems both to care about the outcome and to appear competent, subjects will often want to help even at great inconvenience to themselves. Thus we have frequently seen subjects return from distant cities to complete a study.
Demand Characteristics and the Concept of Quasi-Controls
113
Demand Characteristics and Experimenter Bias Demand characteristics and the subject’s reaction to them are, of course, not the only subtle and human factors which may affect the results of an experiment. Experimenter bias effects, which have been studied in such an elegant fashion by Rosenthal (1963; 1966), also are frequently confounding variables. Experimenter bias effects depend in large part on experimenter outcome expectations and hopes. They can become significant determinants of data by causing subtle but systematic differences in (a) the treatment of subjects, (b) the selection of cases, (c) observation of data, (d) the recording of data, and (e) systematic errors in the analysis of data. To the extent that bias effects cause subtle changes in the way the experimenter treats different groups, they may alter the demand characteristics for those groups. In social psychological studies, demand characteristics may, therefore, be one of the important ways in which experimenter bias is mediated. Conceptually, however, the two processes are very different. Experimenter bias effects are rooted in the motives of the experimenter, but demand characteristic effects depend on the perception of the subject. The effects of bias are by no means restricted to the treatment of subjects. They may equally well function in the recording of data and its analysis. As Rosenthal (1966) has pointed out, they can readily be demonstrated in all aspects of scientific endeavor—‘‘N rays’’ being a prime example. Demand characteristics, on the other hand, are a problem only when we are studying sentient and motivated organisms. Light rays do not guess the purpose of the experiment and adapt themselves to it, but subjects may. The repetition of an experiment by another investigator with different outcome orientation will, if the findings were due to experimenter bias, lead to different results. This procedure, however, may not be sufficient to clarify the effects of demand characteristics. Here it is the leanings of the subject, not of the experimenter, that are involved. In a real sense, for the subject an experiment is a problem-solving situation. Riecken (1962, 31) has succinctly expressed this when he says that aspects of the experimental situation lead to ‘‘a set of inferential and interpretive activities on the part of the subject in an effort to penetrate the experimenter’s inscrutability. . . .’’ For example, if subjects are used as their own controls, they may easily recognize that differential treatment ought to produce differential results, and they may act accordingly. A similar effect may appear even when subjects are not their own controls. Those who see themselves as controls may on that account behave differently from those who think of themselves as the ‘‘experimentals.’’ It is not conscious deception by the subject which poses the problem here. That occurs only rarely. Demand characteristics usually operate subtly in interaction with other experimental variables. They change the subject’s behavior in such a way that he is often not clearly aware of their effect. In fact, demand characteristics may be less effective or even have a paradoxical action if they are too obvious. With the constellation of motives that the usual subject brings to a psychological experiment, the ‘‘soft sell’’ works better than the ‘‘hard sell.’’ Rosenthal (1963) has reported a similar finding in experimenter bias: the effect is weakened, or even reversed, if the experimenter is paid extra to bias his results.
114
Book One – Artifact in Behavioral Research
It is possible to eliminate the experimenter entirely, as has been suggested by Charles Slack5 some years back in a Gedanken experiment. He proposed that subjects be contacted by mail, be asked to report to a specific room at a specific time, and be given all instructions in a written form. The recording of all responses as well as the reinforcement of subjects would be done mechanically. This procedure would go a long way toward controlling experimenter bias. Nevertheless, it would have demand characteristics, as would any other experiment which we might conceive; subjects will always be in a position to form hypotheses about the purpose of an experiment. Although every experiment has its own demand characteristics, these do not necessarily have an important effect on the outcome. They become important only when they interact with the effect of the independent variable being studied. Of course, the most serious situation is one where the investigator hopes to draw inferences from an experiment where one set of demand characteristics typically operates to a real-life situation which lacks an analogous set of conditions.
Pre-Inquiry Data as a Basis for Manipulating Demand Characteristics A recent psychophysiological study (Gustafson and Orne, 1965) takes one possible approach to the clarification of demand characteristic effects. The example is unusual only because some of its demands were deliberately manipulated and treated as experimental variables in their own right. The results of the explicit manipulation enabled us to understand an experimental result which was otherwise contrary to field findings. In recent years there have been a number of studies on the detection of deception—more popularly known as ‘‘lie detection’’—with the galvanic skin response (GSR) as the dependent variable. In one such study, Ellson, Davis, Saltzman, and Burke (1952) reported a very curious finding. Their experiment dealt with the effect which knowledge of results can have on the GSR. After the first trial, some subjects were told that their lies had been detected, while others were told the opposite. This produced striking results on the second trial: those who believed that they had been found out became harder to detect the second time, while those who thought they had deceived the polygraph on Trial 1 became easier to detect on Trial 2. This finding, if generalizable to the field, would have considerable practical implications. Traditionally, interrogators using field lie detectors go to great lengths to show the suspect that the device works by ‘‘catching’’ the suspect, as it were. If the results of Ellson et al. were generalizable to the field situation, the very procedure which the interrogators use would actually defeat the purpose for which it was intended by making subsequent lies of the suspect even harder to detect. Because the finding of Ellson et al. runs counter to traditional practical experience, it seemed plausible to assume that additional variables might be involved in the experimental situation. The study by Ellson et al. was therefore replicated by Gustafson and Orne6 with equivocal results. Postexperimental interviews with subjects revealed that many college students apparently believe that the lie detector 5 6
Personal communication, 1959. Unpublished study, 1962.
Demand Characteristics and the Concept of Quasi-Controls
115
works with normal individuals and that only habitual liars could deceive a polygraph. Given these beliefs, it was important for the student volunteers that they be detected. In that respect the situation of the experimental subjects differs markedly from that of the suspect being interrogated in a real life situation. Fortunately, with the information about what most experimental subjects believe, it is possible to manipulate these beliefs and thereby change the demand characteristics of the Ellson et al. study. Two groups of subjects were given different information about the effectiveness of the lie detector. One group was given information congruent with this widely held belief and told: ‘‘This is a detection of deception experiment. We are trying to see how well the lie detector works. As you know, it is not possible to detect lying in the case of psychopathic personalities or habitual liars. We want you to try your very best to fool the lie detector during this experiment. Good luck.’’ These instructions tried to maximize the kind of demand characteristics which might have been functioning in the Ellson et al. study, and it was assumed that the subjects would want to be detected in order to prove that they were not habitual liars. The other group was given information which prior work (Gustafson and Orne, 1963) had shown to be plausible and motivating; they were told, ‘‘This is a lie detection study and while it is extremely difficult to fool the lie detector, highly intelligent, emotionally stable, and mature individuals are able to do so.’’ The demand characteristics in this case were designed to maximize the wish to deceive. From that point on, the two groups were treated identically. They drew a card from an apparently randomized deck; the card had a number on it which they were to keep secret. All possible numbers were then presented by a prerecorded tape while a polygraph recorded the subjects’ GSR responses. On the first such trial, the ‘‘detection ratios’’—that is, the relative magnitudes of the critical GSR responses—in the two groups were not significantly different (see Table 5-1). When the first trial was over, the experimenter gave half the subjects in each group the impression that they had been detected, by telling them what their number had been. (The experimenter Table 5–1 Number of Successful and Unsuccessful Detections on Trial I for the Two Subgroups
of the n Detected and n Deceive Groupsa, b Told detected (subsequently) ‘‘Need to be Detected Group’’ Detected Not Detected
Told not detected (subsequently)
c2 between columns 1 and 2
9 7
13 3
2 ¼ 1.31 n.s.
‘‘Need to Deceive Group’’ Detected Not Detected
13 3
11 5
2 ¼ 0.17 n.s.
2 Between n Detected and n Deceive Groups
2 ¼ 1.31 n.s.
2 ¼ 0.17 n.s.
Note. From: L. A. Gustafson and M. T. Orne, ‘‘Effects of perceived role and role success on the detection of deception,’’ Journal of Applied Psychology, 49, 1965, 412–417. Copyright (1965) by the American Psychological Association, and reproduced by permission. a Note that Ss were not given information about the success of detection until after the trial on which these data are based. b A multiple chi-square contingency analysis (Sutcliffe, 1957) was used to analyze the departures from expected frequencies in the entire table. Neither the chi-square components for each variable alone, nor the interaction between variables, were significant.
116
Book One – Artifact in Behavioral Research Table 5–2 Number of Successful and Unsuccessful Detections on Trial II for the Two Subgroups
of the n Detected and n Deceive Groupsa Told detected (subsequently)
Told not detected (subsequently)
c2 between columns 1 and 2
‘‘Need to be Detected Group’’ Detected Not Detected
4 12
14 2
2 ¼ 10.28 p < .005
‘‘Need to Deceive Group’’ Detected Not Detected
15 1
3 13
2 ¼ 15.36 p < .001
2 Between n Detected and n Deceive Groups
2 ¼ 12.96 p < .001
2 ¼ 12.55 p < .001
Note. From: L. A. Gustafson and M. T. Orne, ‘‘Effects of perceived role and role success on the detection of deception,’’ Journal of Applied Psychology, 49, 1965, 412–417. Copyright (1965) by the American Psychological Association, and reproduced by permission. a A multiple chi-square contingency analysis here shows that neither information given, nor motivation (n Detect vs. n Deceive) have significant effects by themselves. The relevant chi-square values, calculated from partitioned subtables, are .25(p > .95) and .00 respectively (df ¼ 1). However, successful detection does depend significantly on the interaction between information and motivation (2 ¼ 30.94; p < .001; df ¼ 1).
had independent access to this information.) The other half were given the impression that they had fooled the polygraph, the experimenter reporting an incorrect number to them. A table of random numbers was used to determine, independent of his actual GSR, which kind of feedback each subject received. A second detection trial with a new number was then given. The dramatic effects of the feedback in interaction with the original instructions are visible in Table 5-2. Two kinds of subjects now gave large GSRs to the critical number: those who had wanted to be detected but yet had not been detected, and also those who had hoped to deceive and yet had not deceived. (This latter group is analogous to the field situation.) On the other hand, subjects whose hopes had been confirmed now responded less and thus became harder to detect, regardless of what those hopes had been. Those who had wanted to be detected, and indeed had been detected, behaved physiologically like those who had wanted to deceive and indeed had deceived. This effect is an extremely powerful but also an exceedingly subtle one. The differential pretreatment of groups is not apparent on the first trial. Only on the second trial do the manipulated demand characteristics produce clear-cut differential results, in interaction with the independent variable of feedback. Furthermore, we are dealing with a dependent measure which is often erroneously assumed to be outside of volitional control, namely a physiological response—in this instance, the GSR. This study serves as a link toward resolving the discrepancy between the laboratory findings of Ellson et al. (1952) and the experience of interrogators using the ‘‘lie detector’’ in real life. It appeared possible in this experiment to use simple variations in instructions as a means of varying demand characteristics. The success of the manipulation may be ascribed to the fact that the instructions themselves reflected views that emerged from interview data, and both sets of instructions were congruent with the experimental procedure. Only if instructions are plausible—a function of their congruence with the subjects’ past knowledge as well as with the experimental procedure—will
Demand Characteristics and the Concept of Quasi-Controls
117
they be a reliable way of altering the demand characteristics. In this instance the instructions were not designed to manipulate the subjects’ attitude directly; rather they were designed to provide differential background information relevant to the experiment. This background information was designed to provide very different contexts for the subjects’ performance within the experiment. We believe this approach was effective because it altered the subjects’ perception of the experimental situation, which is the basis of demand characteristics in any experiment. It is relevant that the differential instructions in no way told subjects to behave differently. Obviously subjects in an experiment will tend to do what they are told to do— that is the implicit contract of the situation—and to demonstrate this would prove little. Our effort here was to create the kind of context which might differentiate the laboratory from the field situation and which might explain differential results in these two concepts. Plausible verbal instructions were one way of accomplishing this end. (Also see Cataldo, Silverman, and Brown, 1967; Kroger, 1967; Page and Lumia, 1968; Silverman, 1968.) Unless verbal instructions are very carefully designed and pretested they may well fail to achieve such an end. It can be extremely difficult to predict how, if at all, demand characteristics are altered by instructions, and frequently more subtle aspects of the experimental setting and the experimental procedure may become more potent determinants of how the study is perceived.
Dealing with Demand Characteristics Studies such as the one described in which the demand characteristics are deliberately manipulated contribute little or nothing to the question of how they can be delineated. In order to design the lie detection experiment in the first place, a thorough understanding of the demand characteristics involved was essential. How can such an understanding be obtained? As was emphasized earlier, the problem arises basically because the human subject is an active organism and not a passive responder. For him, the experiment is a problem-solving situation to be actively handled in some way. To find out how he is trying to handle it, it has been found useful to take advantage of the same mental processes which would otherwise be confounding the data. Three techniques were proposed which do just that. Although apparently different, the three methods serve the same basic purpose. For reasons to be explained later, I propose to call them ‘‘quasi-controls.’’ Postexperimental Inquiry The most obvious way of finding out something about the subject’s perception of the experimental situation is the postexperimental inquiry. It never fails to amaze me that some colleagues go to the trouble of inducing human subjects to participate in their experiments and then squander the major difference between man and animal—the ability to talk and reflect upon experience. To be sure, inquiry is not always easy. The greatest danger is the ‘‘pact of ignorance’’ (Orne, 1959a) which all too commonly characterizes the postexperimental discussion. The subject knows that if he has ‘‘caught on’’ to some apparent deception and has an excess of information about the experimental procedure he
118
Book One – Artifact in Behavioral Research
may be disqualified from participation and thus have wasted his time. The experimenter is aware that the subject who knows too much or has ‘‘caught on’’ to his deception will have to be disqualified; disqualification means running yet another subject, still further delaying completion of his study. Hence, neither party to the inquiry wants to dig very deeply. The investigator, aware of these problems and genuinely more interested in learning what his subjects experienced than in the rapid collection of data, can, however, learn a great deal about the demand characteristics of a particular experimental procedure by judicious inquiry. It is essential that he elicit what the subject perceives the experiment is about, what the subject believes the investigator hopes and expects to find, how the subject thinks others might have reacted in this situation, etc. This information will help to reveal what the subject perceives to be a good response, good both in tending to validate the hypothesis of the experiment and in showing him off to his best advantage. To the extent that the subject perceives the experiment as a problem-solving situation where the subject’s task is to ascertain the experiment’s true nature, the inquiry is directed toward clarifying the subject’s beliefs about its true nature. When, as is often the case, the investigator will have told the subject in the beginning something about why the experiment is being carried out, it may well be difficult for the subject to express his disbelief since to do so might put him in the position of seeming to call the experimenter a liar. For reasons such as these, the postexperimental interview must be conducted with considerable tact and skill, creating a situation where the subject is able to communicate freely what he truly believes without, however, making him unduly suspicious or, worse yet, cueing him as to what he is to say. Using another investigator to carry out the inquiry will often maximize communication, particularly if the other investigator is seen as someone who is attempting to learn more about what the subject experiences. However, it is necessary to avoid having it appear as though the inquiry is carried out by someone who is evaluating the experimenter since the student subject may identify with what he sees to be the student experimenter and try to make him look good rather than describing his real experience. The situational factors which will maximize the subject’s communicating what he is experiencing are clearly exceedingly complex and conceptually similar to those which need to be taken into account in clinical situations or in the study of taboo topics. Examples of the factors are merely touched upon here. It would be unreasonable to expect a one-to-one relationship between the kind of data obtained by inquiry and the demand characteristics which were actually perceived by the subject in the situation. Not only do many factors mitigate against fully honest communication, but the subject cannot necessarily verbalize adequately what he may have dimly perceived during the experiment, and it is the dimly perceived factors which may exert the greatest effect on the subject’s experimental behaviors. More important than any of these considerations, however, is the fact that an inquiry may be carried out at the end of a complex experiment and that the subject’s perception of the experiment’s demand characteristics may have changed considerably during the experiment. For example, a subject might ‘‘catch on’’ to a verbal conditioning experiment only at the very end or even in retrospect during the inquiry itself, and he may then verbalize during the inquiry an awareness that will have had little or no effect on his performance during the experiment. For this reason, one may
Demand Characteristics and the Concept of Quasi-Controls
119
wish to carry out inquiry procedures at significant junctures in a long experiment.7 This technique is quite expensive and time-consuming. It requires running different sets of subjects to different points in the experiment, stopping at these points as if the experiment were over (for these subjects it, in fact, is), and carrying out inquiries. While it would be tempting to use the same group of subjects and to continue to run them after the inquiry procedure, such a technique would in many instances be undesirable because exhaustive inquiries into the demand characteristics, as the subject perceives them at a given point in time, make him unduly aware of such factors subsequently. While inquiry procedures may appear time-consuming, in actual practice they are relatively straightforward and efficient. Certainly they are vastly preferable to finding at the conclusion of a large study that the data depend more on the demand characteristics than on the independent variables one had hoped to investigate. It is perhaps worth remembering that, investigators being human, it is far easier to do exhaustive inquiry during pilot studies when one is still motivated to find out what is really happening than in the late stages of a major investigation. Indeed this is one of the reasons why pilot investigations are an essential prelude to any substantive study. Non-experiment Another technique—and a very powerful one—for uncovering the demand characteristics of a given experimental design is the ‘‘pre-inquiry’’ (Orne, 1959a) or the ‘‘non-experiment.’’8 This procedure was independently proposed by Riecken (1962). A group of persons representing the same population from which the actual experimental subjects will eventually be selected are asked to imagine that they are subjects themselves. They are shown the equipment that is to be used and the room in which the experiment is to be conducted. The procedures are explained in such a way as to provide them with information equivalent to that which would be available to an experimental subject. However, they do not actually go through the experimental procedure; it is only explained. In a non-experiment on a certain drug, for example, the participant would be told that subjects are given a pill. He would be shown the pill. The instructions destined for the experimental subjects would be read to him. The participant would then be asked to produce data as if he actually had been subjected to the experimental treatment. He could be given posttests or asked to fill out rating scales or requested to carry out any behavior that might be relevant for the actual experimental group. The non-experiment yields data similar in quality to inquiry material but obtained in the same form as actual subjects’ data. Direct comparison of non-experimental data and actual experimental data is therefore possible. But caution is needed. If these two kinds of data are identical, it shows only that the subject population in the actual experiment could have guessed what was expected of them. It does not tell us whether such guesses were the actual determinants of their behavior. Kelman (1965) has recently suggested that such a technique might appropriately be used as a social psychological tool to obviate the need for deception studies. While the economy of this procedure is appealing, and working in a situation where subjects 7 8
These results may also be conceptualized in terms of learning theory. Ulric Neisser suggested this persuasive term.
120
Book One – Artifact in Behavioral Research
become quasi-collaborators instead of objects to be manipulated is more satisfying to many of us, it would seem dangerous to draw inferences to the actual situation in real life from results obtained in this fashion. In fact, when subjects in pre-inquiry experiments perform exactly as subjects do in actual experimental situations, it becomes impossible to know the extent to which their performance is due to the independent variables or to the experimental situation. In most psychological studies, when one is investigating the effect of the subject’s best possible performance in response to different physical or psychological stimuli, there is relatively little concern for the kind of problems introduced by demand characteristics. The need to concern oneself with these issues becomes far more pronounced when investigating the effect of various interventions such as drugs, psychotherapy, hypnosis, sensory deprivation, conditioning of physiological responses, etc., on performance or experiential parameters. Here the possibility that the subject’s response may inadvertently be determined by altered demand characteristics rather than the process itself must be considered. Equally subject to these problems are studies where attitude changes rather than performance changes are explored. The investigator’s intuitive recognition that subjects’ perceptions of an experiment and its meaning are very likely to affect the nature of his responses may have been one of the main reasons why deception studies have been so popular in the investigation of attitude change. Festinger’s cognitive dissonance theory (1957) has been particularly attractive to psychologists probably because it makes predictions which appear to be counterexpectational; that is, the predictions made on the basis of intuitive ‘‘common sense’’ appear to be wrong whereas those made on the basis of dissonance are both different and borne out by data. Bem (1967) has shown in an elegant application of pre-inquiry techniques that the findings are not truly counterexpectational in the sense that subjects to whom the situation is described in detail but who are not really placed in the situation are able to produce data closely resembling those observed in typical cognitive dissonance studies. On the basis of these findings, Bem (1967) appropriately questions the assertion that the dissonance theory allows counterexpectational predictions. His use of the pre-inquiry effectively makes the cognitive dissonance studies it replicates far less compelling by showing that subjects could figure out the way others might respond. It would be unfortunate to assume that Bem’s incisive critique of the empirical studies with the pre-inquiry technique makes further such studies unnecessary. On the contrary, his findings merely show that the avowed claims of these studies were not, in fact, achieved and provide a more stringent test for future experiments that aim to demonstrate counterexpectational findings. It would appear that we are in the process of completing a cycle. At one time it was assumed that subjects could predict their own behavior, that in order to know what an individual would do in a given situation it would suffice merely to ask him. It became clear, however, that individuals could not always predict their behavior; in fact, serious questions about the extent to which they could make any such predictions were raised when studies showing differences between what individuals thought they do and what they, in fact, do became fashionable. With a sophisticated use of the pre-inquiry technique Bem (1967) has shown that individuals have more knowledge about what they might do than has been ascribed to them by psychologists. Although it is possible to account for a good deal of variance in behavior in this way, it is clear that it will not account for all of the variance. We are confronted now with a peculiar paradox. When
Demand Characteristics and the Concept of Quasi-Controls
121
pre-inquiry data correctly predict the performance of the subject in the actual experiment—the situation that is most commonly encountered—the experimental findings strike us as relatively trivial, in part, because at best we have validated our intuitive common sense but also because we cannot exclude the nagging doubt that the subject may have merely been responsive to the demand characteristics in the actual experiment. Only when we succeed in setting up an experiment where the results are counterexpectational in the sense that a pre-inquiry would yield different findings from those obtained from the subjects in the actual situation can we be relatively comfortable that these findings represent the real effects of the experimental treatment rather than being subject to alternative explanations. For the reasons discussed above, pre-inquiry can never supplant the actual investigation of what subjects do in concrete situations although, adroitly executed, it becomes an essential tool to clarify these findings. Simulators This principle can be carried one step further to provide yet another method for uncovering demand characteristics: the use of simulators (Orne, 1959a). Subjects are asked to pretend that they have been affected by an experimental treatment which they did not actually receive or to which they are immune. For subjects to be able to do this, it is crucial that they be run by another experimenter who they are told is unaware of their actual status, and who in fact really is unaware of their status. It is essential that the subjects be aware that the experimenter is blind as well as that the experimenter actually be blind for this technique to be effective. Further, the fact that the experimenter is ‘‘blind’’ has the added advantage of forcing him to treat simulators and actual subjects alike. This technique has been used extensively in the study of hypnosis (e.g., Damaser, Shor, and Orne, 1963; Orne, 1959a; Orne and Evans, 1965; Orne, Sheehan, and Evans, 1968). For an extended discussion, see Orne (1968). It is possible for unhypnotized subjects to deceive an experimenter by acting as though they had been hypnotized. Obviously, it is essential that the simulators be given no special training relevant to the variables being studied, so that they have no more information than what is available to actually hypnotized subjects. The simulating subjects must try to guess what real subjects might do in a given experimental situation in response to instructions administered by a particular experimenter. This design permits us to separate experimenter bias effects from demand characteristic effects. In addition to his other functions, the experimenter may be asked to judge whether each subject is a ‘‘real’’ or a simulator. This judgment tends to be random and unrelated to the true status of the subjects. Nevertheless, we have often found differences between the behaviors of subjects contingent on whether or not the experimenter judges that they are hypnotized or just simulating. Such differences may be ascribed to differential treatment and bias, whereas differences between actually hypnotized subjects and actual simulators are likely to be due to hypnosis itself. Again, results obtained with this technique need careful evaluation. It is important not to jump to a negative conclusion if no difference is found between deeply hypnotized subjects and simulators. Such data are not evidence that hypnosis consists only of a reaction to demand characteristics. It may well have special properties. But so long as a given form of behavior is displayed as readily by simulators as by ‘‘reals,’’ our procedure has failed to demonstrate those properties. The problem here
122
Book One – Artifact in Behavioral Research
is the same as that discussed earlier in the pre-inquiry. Most likely there will be many real effects due to hypnosis which can be mimicked successfully by simulators. However, only when we are able to demonstrate differences in behavior between real and simulating subjects do we feel that an experiment is persuasive in demonstrating that a given effect is likely to be due to the presence of hypnosis.
Quasi-Controls: Techniques for the Evaluation of Experimental Role Demands The three techniques discussed above are not like the usual control groups in psychological research. They ask the subject to participate actively in uncovering explicit information about possible demand characteristic effects. The quasi-control subject steps out of his traditional role, because the experimenter redefines the interaction between them to make him a co-investigator instead of a manipulated object. Because the quasi-control is outside of the usual experimenter-subject relationship, he can reveal the effects of this relationship in a new perspective. An inquiry, for example, takes place only after the experiment has been defined as ‘‘finished,’’ and the subject joins the experimenter in reflecting on his own earlier performance as a subject. In the non-experiment, the quasi-control cooperates with the experimenter in second-guessing what real subjects might do. Most dramatically, the simulating subject reverses the usual relationship and deceives the experimenter. It is difficult to find an appropriate term for these procedures. They are not, of course, classical control groups since, rather than merely omitting the independent variable, the groups are treated differently. Thus we are dealing with treatment groups that facilitate inference about the behavior of both experimental and control groups. Because these treatment groups are used to assess the effect that the subject’s perception of being under study might have upon his behavior in the experimental situation, they may be conceptualized as role demand controls in that they clarify the demand characteristic variables in the experimental situation for the particular subject population used. As quasi-controls, the subjects are required to participate and utilize their cognitive processes to evaluate the possible effect that thinking about the total situation might have on their performance. They could, in this sense, be considered active, as opposed to passive, controls. A unique aspect of quasi-controls is that they do not permit inference to be drawn about the effect of the independent variable. They can never prove that a given finding in the experimental group is due to the demand characteristics of the situation. Rather, they serve to suggest alternative explanations not excluded by the experimental design employed. The inference from quasi-control data, therefore, primarily concerns the adequacy of the experimental procedure. In this sense, the term design control or evaluative control would be justified. Since each of these various terms focuses upon different but equally important aspects of these comparison groups, it would seem best to refer to them simply as quasi-controls. This explicitly recognizes that we are not dealing with control groups in the true sense of the word and are using the term analogously to the way in which Campbell and Stanley (1963) have used the term quasi-experiments. However, while they think of quasi-experiments as doing the best one can in situations where ‘‘true experiments’’ cannot be carried out, the concept of quasi-controls is intended to refer specifically to techniques for the assessment of demand characteristic variables in
Demand Characteristics and the Concept of Quasi-Controls
123
order to evaluate how such factors might effect the experimental outcome. The term ‘‘quasi-’’ in this context says that these techniques are similar to—but not really— control groups. It does not mean that these groups are any less important in helping to evaluate the data obtained from human subjects. In bridging the gap from the laboratory experiments to situations where the individual does not perceive himself to be a subject under investigation, techniques of this kind are of vital importance. It is frequently pointed out that investigators often discuss the experimental procedures with colleagues in order to clarify their meaning. Certainly many problems in experimental design will be obvious only to expert colleagues. These types of issues have typically been discussed in the context of quantitative methods and have led to some more elaborate techniques of experimental design. There is no question that expert colleagues are sensitive to order effects, baseline phenomena, practice effects, sampling procedures, individual differences, and so on, but how a given subject population would, in fact, perceive an experimental procedure is by no means easily accessible to the usual tools of the psychologists. Whether in a deception experiment the subject may be partially or fully aware of what is really going on is a function of a great many cues in the situation not easily explicated, and the prior experience of the subject population which might in some way be relevant to the experiment is also not easily ascertained or abstracted by any amount of expert discussion. The use of quasi-controls, however, allows the investigator to estimate these factors and how they might affect the experimental results. The kind of factors which we are discussing here relate to the manner in which subjects are solicited (for example, the wording of an announcement in an ad), the manner in which the secretary or research assistant answers questions about the proposed experiment when subjects call in to volunteer, the location of the experiment (i.e., psychiatric hospital versus aviation training school), and, finally, a great many details of the experimental procedure itself which of necessity are simplified in the description, not to speak of the subtle cues made available by the investigator himself. Quasi-controls are designed to evaluate the total impact of these various cues upon the particular kind of population which is to be used. It will be obvious, of course, that a verbal conditioning experiment carried out with psychology students who have been exposed to the original paper is by no means the same as the identical experiment carried out with students who have not been exposed to this information. Again, quasi-controls allow one to estimate what the demand characteristics might be for the particular subject population being used. Quasi-controls serve to clarify the demand characteristics but they can never yield substantive data. They cannot even prove that a given result is a function of demand characteristics. They provide information about the adequacy of an investigative procedure and thereby permit the design of a better one. No data are free of demand characteristics but quasi-controls make it possible to estimate their effect on the data which we do obtain.
The Use of Quasi-Controls to Make Possible a Study Manipulating Demand Characteristics When extreme variations of experimental procedures are still able to elicit surprisingly similar results or identical experimental procedures carried out in different laboratories yield radically different results, the likelihood of demand characteristic effects must be seriously considered. An area of investigation characterized in this
124
Book One – Artifact in Behavioral Research
way were the early studies on ‘‘sensory deprivation.’’ The initial findings attracted wide attention because they not only had great theoretical significance for psychology but seemed to have practical implications for the space program as well. A review of the literature indicated that dramatic hallucinatory effects and other perceptual changes were typically observed after the subject had been in the experiment approximately two-thirds of the total time; however, it seemed to matter relatively little whether the total time was three weeks, two weeks, three days, two days, twenty-four hours, or eight hours. Clearly, factors other than physical conditions would have to account for such discrepancies. As a first quasi-control we interviewed subjects who had participated in such studies.9 It became clear that they had been aware of the kind of behavior that was expected of them. Next, a pre-inquiry was carried out, and, from participants who were guessing how they might respond if they were in a sensory deprivation situation, we obtained data remarkably like that observed in actual studies.10 We were then in a position to design an actual experiment in which the demand characteristics of sensory deprivation were the independent variables (Orne and Scheibe, 1964). Our results showed that these characteristics, by themselves, could produce many of the findings attributed to the condition of sensory deprivation. In brief, one group of the subjects were run in a ‘‘meaning deprivation’’ study which included the accoutrements of sensory deprivation research but omitted the condition itself. They were required to undergo a physical examination, provide a short medical history, sign a release form, were ‘‘assured’’ of the safety of the procedure by the presence of an emergency tray containing various syringes and emergency drugs, and were taken to a well-lighted cubicle, provided food and water, and given an optional task. After taking a number of pretests, the subjects were told that if they heard, saw, smelled, or experienced anything strange they were to report it through the microphone in the room. They were again reassured and told that if they could not stand the situation any longer or became discomforted they merely had to press the red ‘‘panic button’’ in order to obtain immediate release. They were then subjected to four hours of isolation in the experimental cubicle and given posttests. The control subjects were told that they were controls for a sensory deprivation study and put in the same objective conditions as the experimental subjects. Table 5-3 summarizes the findings which indicate that manipulation of the demand characteristics by themselves could produce many findings that had previously been ascribed to the sensory deprivation condition. Of course, neither the quasi-controls nor the experimental manipulation of the demand characteristics sheds light on the actual effects of the condition of sensory deprivation. They do show that demand characteristics may produce similar effects to those ascribed to sensory deprivation.
The Problem of Inference Great care must be taken in drawing conclusions from experiments of this kind. In the case of the sensory deprivation study, the demand characteristics of the laboratory and those which might be encountered by individuals outside of the laboratory differ radically. In other situations, however, such as in the case of hypnosis, the expectations of subjects about the kind of behavior hypnosis ought to elicit in the laboratory 9
Unpublished study. Stare, F., Brown, J., and Orne, M. T. Demand characteristics in sensory deprivation studies. Unpublished seminar paper, Massachusetts Mental Health Center and Harvard University, 1959.
10
Table 5–3 Summary and Analysis of Ten Tests for Control and Experimental Groups
Test and group
Pretest M
Posttest M
Mirror Tracing (errors) Experimental Control
28.1 35.8
19.7 15.2
F ¼ 1.67a
Spatial Orientation Angular deviation Experimental Control
45.7 52.5
53.9 59.1
F ¼ .25a
5.3 6.4
5.4 5.7
F ¼ 3.34b
Word Recognition (N correct) Experimental Control
17.3 15.2
15.6 12.3
t ¼ .50
Reversible Figure (rate per minute) Experimental Control
29.0 20.1
35.0 25.0
F ¼ 1.54a
Digit Symbol (N correct) Experimental Control
98.2 99.2
109.9 111.9
F ¼ .05a
Mechanical Ability Tapping speed (N completed) Experimental Control
33.9 32.9
32.2 35.0
F ¼ 2.26
Tracing speed (N completed) Experimental Control
55.6 53.1
52.3 58.4
F ¼ 4.57b
Visual pursuit (N completed) Experimental Control
5.7 5.7
8.9 9.2
F ¼ .22a
Linear deviation Experimental Control
Difference statistic
Simple Forms (N increment distortions) Experimental Control
— —
3.1 0.8
U ¼ 19c
Size Constancy (change in steps) Experimental Control
— —
0.6 0.0
t ¼ 1.03a
Spiral Aftereffect Duration, seconds Experimental Control
24.4 15.6
27.1 16.1
F ¼ .99a
Absolute Change Experimental Control
— —
7.0 2.7
t ¼ 3.38d continued
125
126
Book One – Artifact in Behavioral Research
Table 5–3 continued Test and group Logical Deduction (N correct) Experimental Control
Pretest M
— —
Posttest M
20.3 22.1
Difference statistic
t ¼ 1.64
Note. F ¼ adjusted postexperimental scores, analysis of covariance; t ¼ t tests; U ¼ Mann-Whitney U test, where plot of data appeared grossly abnormal. (From: M. T. Orne and K. E. Scheibe, ‘‘The contribution of nondeprivation factors in the production of sensory deprivation effects: The psychology of the ‘panic button,’’’ Journal of Abnormal and Social Psychology, 68, 1964, 3–12. Copyright [1964] by the American Psychological Association, and reproduced by permission.) a Indicates differences between groups were in predicted direction. b p < .05, one-tailed. c p ¼ .01, one-tailed. d p < .001, nondirectional measure.
are similar to the kind of expectations which patients might have about being hypnotized for therapeutic purposes. To the extent that the hypnotized individual’s behavior is determined by these expectations we might find similar findings in certain laboratory contexts and certain therapeutic situations. When demand characteristics become a significant determinant of behavior, valid accurate predictions can only be made about another situation where the same kind of demand characteristics prevails. In the case of sensory deprivation studies, accurate predictions would therefore not be possible but, even in the studies with hypnosis, we might still be observing an epiphenomenon which is present only as long as consistent and stable expectations and beliefs are present. In order to get beyond such an epiphenomenon and find intrinsic characteristics, it is essential that we evaluate the effect that demand characteristics may have. To do this we must seek techniques specifically designed to estimate the likely extent of such effects.
Psychopharmacological Research as a Model for the Psychological Experiment What are here termed the demand characteristics of the experimental situation are closely related to what the psychopharmacologist considers a placebo effect, broadly defined. The difficulty in determining what aspects of a subject’s performance may legitimately be ascribed to the independent variable as opposed to those which might be due to the demand characteristics of the situation is similar to the problem of determining what aspects of a drug’s action are due to pharmacological effect and what aspects are due to the subject’s awareness that he has been given a drug. Perhaps because the conceptual distinction between a drug effect and the effect of psychological factors is readily made, perhaps because of the relative ease with which placebo controls may be included, or most likely because of the very significant consequences of psychopharmacological research, considerable effort has gone into differentiating pharmacological action from placebo effects. A brief review of relevant observations from this field may help clarify the problem of demand characteristics. In evaluating the effect of a drug it has long been recognized that a patient’s expectations and beliefs may have profound effects on his experiences subsequent to
Demand Characteristics and the Concept of Quasi-Controls
127
the taking of the drug. It is for this reason that the use of placebos has been widespread. The extent of the placebo effect is remarkable. Beecher (1959), for example, has shown that in battlefield situations saline solution by injection has 90 per cent of the effectiveness of morphine in alleviating the pain associated with acute injury. In civilian hospitals, postoperatively, the placebo effect drops to 70 per cent of the effectiveness of morphine, and with subsequent administrations drops still lower. These studies show not only that the placebo effect may be extremely powerful, but that it will interact with the experimental situation in which it is being investigated. It soon became clear that it was not sufficient to use placebos so long as the investigator knew to which group a given individual belonged. Typically, when a new, presumably powerful, perhaps even dangerous medication is administered, the physician takes additional care in watching over the patient. He tends to be not only particularly hopeful but also particularly concerned. Special precautions are instituted, nursing care and supervision are increased, and other changes in the regime inevitably accompany the drug’s administration. When a patient is on placebo, even if an attempt is made to keep the conditions the same, there is a tendency to be perfunctory with special precautions, to be more cavalier with the patient’s complaints, and in general to be less concerned and interested in the placebo group. For these reasons, the doctor, as well as the patient, is required to be blind as to the true nature of a drug; otherwise differential treatment could well account for some of the observed differences between drug and placebo (Modell and Houde, 1958). The problems discussed here would be conceptualized in social psychological terms as E-bias effects or differential E-outcome expectations. What would appear at first sight to be a simple problem—to determine the pharmacological action of a drug as opposed to those effects which may be attributed to the patient’s awareness that he is being treated by presumably effective medication—turns out to be extremely difficult. Indeed, as Ross, Krugman, Lyerly, and Clyde (1962) have pointed out, and as discussed by Lana (Chapter 4), the usual clinical techniques can never evaluate the true pharmacological action of a drug. In practice, patients are given a drug and realize that they are being treated; therefore one always observes the pharmacological action of the drug confounded with the placebo effect. The typical study with placebo controls compares the effect of placebo and drug versus the effect of placebo alone. Such a procedure does not get at the psychopharmacological action of the drug without the placebo effect, i.e., the patient’s awareness that he is receiving a drug. Ross et al. elegantly demonstrate this point by studying the effect of chloral hydrate and amphetamine in a 33 design. Amphetamine, chloral hydrate, and placebo were used as three agents with three different instructions: (a) administering each capsule with a brief description of the amphetamine effect, (b) administering each capsule with a brief description of the chloral hydrate effect, and (c) administration without the individual’s awareness that a drug was being administered. Their data clearly demonstrate that drug effects interact with the individual’s knowledge that a drug is being administered. For clinical psychopharmacology, the issues raised by Ross et al. are somewhat academic since in medical practice one is almost always dealing with combinations of placebo components and drug effects. Studies evaluating the effect of drugs are intended to draw inference about how drugs work in the context of medical practice. To the extent that one would be interested in the psychopharmacological effect as such—that is, totally removed from the medical context—the type of design Ross et al. utilized would be essential.
128
Book One – Artifact in Behavioral Research
In psychology, experiments are carried out in order to determine the effect of an independent variable so that it will be possible to draw inference to non-experimental situations. Unfortunately the independent variables tend to be studied in situations that are explicitly defined as experimental. As a result, one observes the effect of an experimental context in interaction with a particular independent variable versus the effect of the experimental context without this variable. The problem of the experimental context in which an investigation is carried out is perhaps best illustrated in psychopharmacology by research on the effects of meprobamate (known under the trade names of Equanil and Miltown). Meprobamate had been established as effective in a number of clinical studies but, when carefully controlled investigations were carried out, it did not appear to be more efficacious than placebo. The findings from carefully controlled studies appeared to contradict a large body of clinical observations which one might have a tendency to discount as simply due to placebo effect. It remained for Fisher, Cole, Rickels, and Uhlenhuth (1964) to design a systematic investigation to clarify this paradox, using physicians displaying either a ‘‘scientific,’’ skeptical attitude toward medication or enthusiasm about the possible help which the drug would yield. The study was run double-blind. The patients treated by physicians with a ‘‘scientific’’ attitude toward medication showed no difference between drug and placebo; however, those treated by enthusiastic physicians clearly demonstrated an increased effectiveness of meprobamate! It would appear that there is a ‘‘real’’ drug effect of meprobamate which may, however, be totally obscured by the manner in which the drug is administered. The effect of the drug emerges only when medication is administered with conviction and enthusiasm. The striking interaction between the drug effects and situation-specific factors not only points to limitations in conclusions drawn from double-blind studies in psychopharmacology but also has broad methodological implications for the experimental study of psychological processes. An example of these implications from an entirely different area is the psychotherapy study by Paul (1966) which showed differences in improvement between individuals expecting to be helped at some time in the future and a matched control group who were not aware that they were included in the research.
Dealing with the Placebo Effect: An Analogy to Dealing with Demand Characteristics Drug effects that are independent of the patient’s expectations, beliefs, and attitudes can of course be studied with impunity without concern about the psychological effects that may be attributed to the taking of medication. For example, the antipyretic fever-reducing effect of aspirin is less likely to be influenced by the patient’s beliefs and expectations than is the analgesic effect, though even here an empirical approach is considerably safer than a priori assumptions. Of greatest relevance are the psychological effects of drugs. The problems encountered in studying these effects, while analogous to those inherent in other kinds of psychological research, seem more evident here. Since the drug constitutes a tangible independent variable (subject to study by pharmacological techniques), it is conceptually easily distinguished from another set of independent variables, psychological in nature, that also may play a crucial part in determining the patient’s response.
Demand Characteristics and the Concept of Quasi-Controls
129
The totality of these non-drug effects which are a function of the patient’s expectations and beliefs in interaction with the medical procedures that are carried out, the doctor’s expectations, and the manner in which he is treated have been conceptualized as placebo effect. This is, of course, analogous to the demand characteristic components in psychological studies; the major difference is that the concept of placebo component directly derives from methodological control procedures used to evaluate it. The placebo is intended to produce the same attitudes, expectations, and beliefs of the patient as would the actual drug. The double-blind technique is designed to equate the environmental cues which would interact with these attitudes. For this model to work, it is essential that the placebo provide subjective side effects analogous to the actual drug lest the investigator and physician be blind but the patient fully cognizant that he is receiving a placebo. For these reasons an active placebo should be employed which mimics the side effects of the drug without exerting a central pharmacological action. With the use of active placebos administered by physicians having appropriate clinical attitudes in a double-blind study, a technically difficult but conceptually straightforward technique is available for the evaluation of the placebo effect. This approach satisfies the assumptions of the classical experimental model. One group of patients responds to the placebo effect and the drug, the other group to the placebo effect alone, which permits the investigator to determine the additive effect that may be attributed to the pharmacological action of the drug. Unfortunately such an ideal type of control is not generally available in the study of other kinds of independent variables. This is particularly true regarding the context of such studies. Thus, the placebo technique can be applied in clinical settings where the patient is not aware that he is the object of such study whereas psychological studies most frequently are recognized as such by our subjects who typically are asked to volunteer. Because a true analog to the placebo is not readily available, quasi-control techniques are being proposed to bridge the inferential gap between experimental findings and the influence of the experimental situation upon the subject who is aware that he is participating in an experiment. The function of quasi-controls to determine the possible contextual effects of an experimental situation is perhaps clarified best when we contrast them with the use of placebos in evaluating possible placebo effect. Assume that we wish to evaluate an unknown drug purporting to be a powerful sedative and that neither pharmacological data nor placebo controls are available as methodological tools. All that we are able to do is to administer the drug under a variety of conditions and observe its effects. This is in many ways analogous to the kind of independent variable that we normally study in psychology. In fact, in this example the unknown drug will be sodium amytal, a powerful hypnotic with indisputable pharmacological action. On giving the drug the first time, with considerable trepidation of course, we might well observe relatively little effect. Then as we get used to the drug a bit we might see it causes relaxation, a lessening of control, perhaps even some slurring of speech; in fact, some of the kind of changes typically associated with alcohol. At this point, working with relatively small dosages of the drug, we would find that there were wide individual differences in response, some individuals actually becoming hyperalert, and one might wonder to what extent the effects could be related to subjects’ beliefs and expectations. Under these circumstances the inquiry procedures discussed earlier could be carried out after the drug had been given. One
130
Book One – Artifact in Behavioral Research
would focus the inquiry on what the subject feels the drug might do, the kind of side effects he might expect, what he anticipated he would experience subsequent to taking the drug, what he thought we would have expected to happen, what he believed others might have experienced after taking the drug, etc. Data of this kind might help shed light on the patient’s behavior. Putting aside the difficulty of interpreting inquiry material, and assuming we are capable of obtaining a good approximation of what the subject really perceived, we are still not in a position to determine the extent to which his expectations actually contributed to the effects that had been observed. Consider if a really large dose of amytal had been given: essentially all subjects would have gone to sleep and would most likely have correctly concluded they had been given a sleeping pill—the inquiry data in this instance being the result of the observed effect rather than the cause of it. Inquiry data would become suggestive only if (in dealing with relatively small dosages) it were found that subjects who expected or perceived that we expected certain kinds of effects did in fact show these effects whereas subjects who had no such expectations failed to show the effects. Even if we obtained such data, however, it would still be unclear whether the subject’s perceptions were post hoc or propter hoc. The most significant use of inquiry material would be in facilitating the recognition of those cues in the situation which might communicate what is expected to the subject so that these cues could be altered systematically. Neither subject nor investigator is really in a position to evaluate how much of the total effect may legitimately be ascribed to the placebo response and how much to drug effect. Evaluation becomes possible only after subsequent changes in procedure can be shown to eliminate certain effects even though the same drug is being administered, or, conversely, subjects’ perceptions upon inquiry are changed without changing the observed effect. The approach then would be to compare the effect of the drug in interaction with different sets of demand characteristics in order to estimate how much of the total effect can reasonably be ascribed to demand characteristic components. (The paper previously mentioned by Ross et al. [1962] reports precisely such a study with amytal and showed clear-cut differences.) It is clear that the quasi-control of inquiry can only serve to estimate the adequacy of the various design modifications. Inference about these changes must be based on effects which the modifications are shown to produce in actual studies of subjects’ behavior. The non-experiment can be used in precisely the same manner. It has the advantage and disadvantage of eliminating cues from the drug experience. Here one would explain to a group of subjects drawn from the usual subject pool precisely what is to be done, show them the drug that is to be taken, give them the identical information provided to those individuals who actually take the drug, and, finally, ask them to perform on the tests to be used as if they had received the drug. This procedure has the advantage that the experimenter need no longer infer what the subject could have deduced about what was expected and how these perceptions could then have affected his performance. Instead of requiring the experimenter to interpret inquiry data and make many assumptions about how presumed attitudes and beliefs could manifest themselves on the particular behavioral indices used, the subject provides the experimenter with data in a form identical to that provided by those individuals who actually take the drug. The fact that the non-experimental subject yields data in the form identical to that yielded by the actual experimental subject must not, however, seduce the investigator into believing that the data are in other ways equivalent. Inference from such a
Demand Characteristics and the Concept of Quasi-Controls
131
procedure about the actual demand characteristic components of the drug effect would need to be guarded indeed. Such findings merely indicate that sufficient cues are present in the situation to allow a subject to know what is expected and these could, but need not, be responsible for the data. To illustrate with our example, if in doing the non-experiment one tells the subject he will be receiving three sleeping capsules and then asks him to do a test requiring prolonged concentration, the subject is very likely to realize that he ought to perform as though he were quite drowsy and yield a significantly subnormal performance. The fact that these subjects do behave like actual subjects receiving three sleeping capsules of sodium amytal would not negate the possible real drug effects which, in our example, are known to be powerful. The only thing it indicates is that the experimental procedure allows for an alternative explanation and needs to be refined. Again, the non-experiment would facilitate such refinement: if subjects instead of being told that they would receive sleeping capsules were told we are investigating a drug designed to increase peripheral blood flow and were given a description of an experimental procedure congruent with such a drug study, they would not be likely to show a decrement in performance data. However, subjects who were run with the drug and such instructions would presumably yield the standard subnormal performance. In other words, the quasi-control of the non-experiment has allowed us to economically assess the possible effects of instructional sets rather than allowing drug inference. It is an efficient way of clarifying the adequacy of experimental procedures as a prelude to the definitive study.11 A somewhat more elaborate procedure would be to instruct subjects to simulate.12 It would be relatively easy to use simulators in a fashion analogous to that suggested in hypnosis research. Two investigators would be employed, one who would administer the medication and one who would carry out all other aspects of the study. The simulators would, instead of receiving the drug, be shown the medication, would read exactly the information given to the drug subjects, but would be told they would not be given the drug. Instead, their task would be to deceive the other experimenter and to make him think they had actually received the drug. They would further be told the other experimenter was blind and would not know they were simulating; if he really caught on to their identity, he would disqualify them; therefore, they should not be afraid they would give themselves away since, as long as they were not disqualified, they were doing well. The subject would then be turned over to the other experimenter who would, in fact, be blind as to the true status of the subject. The simulating subject, under these circumstances, would get no more and no less information than the subject receiving the actual drug (except cues of subjective side effects from the drug). He would be treated by the experimenter in essentially the 11
Obviously, extreme caution is needed in interpreting differences in performance of the individuals actually receiving the drug and that of the non-experimental control. The subjects in a non-experiment cannot really be given the identical cues and role support provided the subject who is actually taking the drug. While the identical instructions may be read to him, it is essentially impossible to treat such subjects in the same fashion. Obviously the investigator is not concerned about side effects, possible dangers, etc. A great many cues which contribute to the demand characteristics, including drug side effects, are thus different for the subject receiving the actual drug, and differences in performance could be due to many aspects of this differential treatment. 12 The use of simulators as an alternative to placebo in psychopharmacological studies was suggested by Frederick T. Evans.
132
Book One – Artifact in Behavioral Research
same fashion. This procedure avoids some of the possible difficulties of differential treatment inherent in the non-experiment. Even under these circumstances, however, if both groups produce, let’s say, identical striking alterations of subjective experience, it would still be erroneous to conclude that there is no drug effect. Rather one would have to conclude that the experimental procedure is inadequate, that the experience of the subjects receiving the drug could (but need not) be due to placebo effects. Whether this is in fact the case cannot be established with this design. The only conclusion which can be drawn is that the experimental procedure is not adequate and needs to be modified. Presumably an appropriate modification of the demand characteristics would, if there is a real drug effect, eventually allow a clear difference to emerge between subjects who are receiving drugs and subjects who are simulating. The interpretation of findings where the group of subjects receiving drugs performs differently from those who are simulating also requires caution. While such findings suggest that drug effect could not be due simply to the demand characteristics because it differs from the expectations of the simulators who are not exposed to the real treatment, the fact that the simulating group is a different treatment group must be kept in mind. Thus, some behavior may be due to the request to simulate. Greater evasiveness on the part of simulating subjects, for example, could most likely be ascribed to the act of simulation. Greater suspiciousness on the part of a simulator could equally be a function of the peculiar situation into which the subject is placed. These observations underline the fact that the simulator, who is a quasi-control, is effective primarily in clarifying the adequacy of the research procedure. The characteristic of this treatment group is that it requires the subjects actively to participate in the experiment in contrast to the usual control group which receives the identical treatment omitting only the drug, as would be the case when placebos are used properly. The problem of inference from data obtained through the use of quasi-controls is seen relatively easily when one attempts to evaluate the contributions which demand characteristics might make to subjects’ total behavior after receiving a drug. Clearly the placebo design properly used is the most adequate approach. This will tell us how much of the behavior of those individuals receiving the drug can be accounted for on the basis of their receiving a substance which is inert as to the specific effect but which mimics the side effects when the total experimental situation and treatment of the subject are identical. The placebo effect is the behavioral consequence which results from the demand characteristics which are (1) perceived and (2) responded to.13 In other words, in any given context there are a large number of demand characteristics inherent in the situation and the subject responds only to those aspects of the demand characteristics which he perceives (there will be many cues which are not recognized by a particular subject) and, of those aspects of the demand characteristics which are perceived at some level by the subject, only some will have a behavioral consequence. One might consider any given experiment as having demand characteristics which fall into two groups: (a) those which will be perceived and responded to and are, therefore, active in creating experimental effects (that is, they will operate differentially between groups) and (b) those which are present in the situation 13
For a discussion of the placebo response, see Beecher (1959) and Honigfeld (1964). Undoubtedly it has a large number of components and is influenced by both situational and personality factors; particularly its relationship to suggestibility (Evans, 1967) is of considerable interest and is by no means clear. These issues, however, go beyond the scope of this paper.
Demand Characteristics and the Concept of Quasi-Controls
133
but either are not readily perceived by most subjects or, for one reason or another, do not lead to a behavioral response by most of the subjects. Quasi-control procedures tend to maximally elicit the subject’s responses to demand characteristics. As a result, the behavior seen with quasi-control subjects may include responses to aspects of the demand characteristics which for the real subjects are essentially inert. All that possibly can be determined with quasi-controls is what could be salient demand characteristics in the situation; whether the subjects actually respond to those same demand characteristics cannot be confirmed. Placebo controls or other passive control groups such as those for whom demand characteristics are varied as independent variables are necessary to permit firmer inference.
A Final Example The problem of inference from quasi-controls is illustrated in a study (Evans, 1966; Orne and Evans, 1966) carried out to investigate what happens if the hypnotist disappears after deep hypnosis has been induced. This question is by no means easy to examine. The hypnotist’s disappearance must be managed in such a way as to seem plausible and truly accidental in order to avoid doing violence to the implicit agreement between subject and hypnotist that the latter is responsible for the welfare of the former during the course of the experiment. Such a situation was finally created in a study requiring two sessions with subjects previously trained to enter hypnosis readily. It was explained to them that in order to standardize the procedure all instructions, including the induction and termination of hypnosis, would be carried out by tape recording. The experimenter’s task was essentially that of a technician—turning on the tape recorder, applying electrodes, presenting experimental materials, etc. He did not say anything throughout the study since every item of instruction was given by means of the tape recorder. Each subject came for two such sessions. During the second session, while the subject was deeply hypnotized and tapping his foot in rhythm with hallucinated music, the tape recorder came to a grinding halt, the light went out on the desk, and the experimenter tried in vain to reactivate the machine by flicking the switch. He then muttered under his breath that the fuse must be blown and dashed from the room. The subject’s behavior was observed through a one-way screen throughout the experiment in order to determine what would happen once the experimenter had left the room. Would hypnosis be terminated immediately once the subject was alone as some theories would predict, or would it take a period of time for the subject to pull himself out of hypnosis as one would expect if the condition were some special altered state? Despite the ‘‘cuteness’’ of the deception procedure of the simulated power failure there was no certainty whether the subject would accept it as such. For this reason quasicontrols were run. In this particular paradigm, simulating subjects were the most appropriate kind of quasi-control and these were included without the hypnotist being aware which subjects were in fact deeply hypnotizable and which were unhypnotizable individuals instructed to simulate. The findings were clear-cut: five out of the six deeply hypnotized subjects did not arouse immediately when the hypnotist left. They required varying periods of time to apparently pull themselves out of the state. When they did, they appeared somewhat disoriented and confused. In contrast, five out of the six simulating subjects continued to simulate hypnosis for the whole half-hour period until the technician returned, presumably having fixed the fuse. During the subsequent
134
Book One – Artifact in Behavioral Research
postexperimental inquiry it became clear that the simulators caught on to the fact that the power failure was contrived and that they behaved as they thought a hypnotized subject would if there were a true power failure. The simulating subjects quite cogently pointed out that having been instructed to fool the experimenter put them on their guard lest they themselves be fooled. In contrast, only one of the hypnotized subjects suspected that the power failure was contrived. Because the quasi-control subjects are a special treatment group their purpose is not primarily as a comparison with actual subject populations; rather, it is to clarify the adequacy of the experimental procedure. Since the simulators caught on to the fact that the power failure was contrived it was plausible that at some level the hypnotized subjects could have responded to the same cues though they did not verbalize such awareness. Therefore despite statistically significant differences between these two groups, these results do not permit any conclusions to be drawn about the nature of the hypnotic phenomenon. The behavior of the quasi-controls demonstrates an inadequacy of the deception procedure and indicates that the behavior of the hypnotized subjects could not be interpreted at face value. Having found the deception manipulation wanting we re-examined in detail the cues which might allow the subject to become aware that the power failure was a planned part of the experiment and that his behavior would actually be observed subsequent to it. The kind of nitty-gritty details which most investigators fail to publish and which, even if they wished to, most editors will delete for the sake of brevity, are crucial in this regard: to wit, in the study just described, to facilitate observation after the ‘‘power failure,’’ the overhead lights remained on and only the bright desk lamp went out as the tape recorder stopped; an obvious one-way mirror had always been in the room to which we had hoped (in vain) the subject would have adapted; further, a small polygraph continued to run in the experimental room after the ‘‘power failure,’’ etc. We now redesigned the study with considerable attention to such details. The oneway screen was eliminated and replaced by another less well-known and carefully hidden means of observation. When the power failure occurred all lights in the room were extinguished. Just enough light was filtered into the room through the window curtains from daylight—or from a strategically placed parking light outside at night— to make observation possible. Furthermore, physiological recordings were made by means of a large Ampex instrumentation tape recorder and a small polygraph. Both of these machines stopped, along with the lights and the tape recorder, with the power failure. (Actually, the wires from the subject were run through to another room at some distance in order to permit continuous recording of relevant data.) When the experimenter left the room the hall lights were out. Every effort was made by scrupulous attention to such details to make the power failure plausible. With these revised conditions simulating and hypnotized subjects were run in another experiment by a blind experimenter. Here five out of six simulating subjects stopped simulating the moment the technician left the room, indulging in exploratory and other private behavior clearly indicating that they felt themselves to be unobserved. On the other hand, the hypnotized group behaved much as they had in the earlier experiment, pulling themselves out of this state only after a considerable period of time had elapsed. This behavior was in line with the data provided during an inquiry, indicating that both real and simulating subjects accepted the power failure as genuine. Significant differences were again obtained but, in contrast to the findings of the first study, those from the second study allow meaningful inference. The behavior of
Demand Characteristics and the Concept of Quasi-Controls
135
the quasi-controls clearly indicates that the power failure was accepted as genuine and therefore it is plausible to accept the behavior of the hypnotized individuals at face value. It will be clear that the purpose served by the quasi-controls was simply to determine whether or not there were sufficient cues in the experimental situation to allow the subjects to surmise that the power failure was staged rather than spontaneous. However, this issue is crucial if we hope to draw inference to a situation which is perceived by the subject as extra-experimental. Though it entailed a great deal of trouble to test the adequacy of the deception manipulation, without quasi-controls we would have had no empirical test of the procedure’s adequacy. It is not possible, without techniques such as this, to evaluate the kind of cues presented to the subject in an experimental situation, and yet the extent to which such cues are present will determine the kind of inference which legitimately can be drawn from experimental findings.
Conclusion Research with human subjects introduces a new set of difficulties because the subjects are sentient beings who are affected by the act of observation and, particularly in experimental contexts, are by no means neutral to the outcome of the study. The kinds of variables which affect subjects’ perceptions about the experiment, its purposes, what one hopes to find, how they may perform as good subjects, and so forth—especially those not specifically communicated but rather inherent in what the subject learns about the experiment and the procedure itself—have been termed the demand characteristics of the experimental situation. The nature of the effects of demand characteristics is such that certain findings may be observed—and may even be replicated in laboratory situations—but be specific to the experimental situation. In order to make inference beyond the experimental context to phenomena occurring outside the laboratory the possible effects introduced by demand characteristics must be considered. These difficulties have led some to suggest that psychologists must leave the laboratory and conduct research exclusively in naturalistic settings. Certainly it is desirable to obtain data of this kind, but the experimental paradigm remains the most powerful tool of analysis we have available. Although we must recognize the problem of inference from one context to another, other sciences have had to do likewise. Thus, aerodynamics has had to develop conversion factors before data obtained in the wind tunnel could be safely applied to a place in flight. Similarly, inference from the action of an antibiotic in the test tube to its medical effects on the organism depends on recognition that effects in vitro may differ from those in vivo. We cannot afford to give up either laboratory research or observation in a naturalistic setting. Both kinds of data are an integral part of behavioral science. In addition to the usual control procedures which are recognized as necessary in isolating the action of an independent variable in any experiment, studies with human subjects require a set of controls designed to look at the effect of the experimental technique itself. These controls do not permit a direct inference about the independent variable. Rather, they are designed to allow the investigator to estimate the effects which are due to the situation under which a study is being carried out. The term quasi-control has been suggested to differentiate these techniques from the more typical control measures. The kinds of quasi-controls outlined here all share the
136
Book One – Artifact in Behavioral Research
feature that they utilize the ability of subjects to reflect upon the context in which they are being investigated, as a means of understanding the way in which this context might affect their own and other subjects’ behavior. Undoubtedly other quasi-controls will need to be developed in order to facilitate inference about human behavior from one context to another. While the difficulty of inference from one context to another is recognized by all scientists, psychology and the other behavioral sciences are in a peculiar position. The object of our study is man. The implications of our research relate to man’s behavior. It is not surprising that our findings are of considerable interest to individuals outside of scientific disciplines. Studies in the behavioral sciences tend increasingly to affect policy decisions. Even the scientist in pure research may find his data quoted as the basis of a decision where he himself would feel there is little relevance. Whether we welcome this tendency or view it with alarm, it seems likely to continue. With the increasing interest in and dissemination of knowledge about behavioral research, it becomes important to see what is needed before meaningful generalization is possible. This problem is particularly acute in experimental work, although the Hawthorne studies (Roethlisberger and Dickson, 1939) demonstrate that it also exists in research outside of the laboratory. Perhaps our responsibility extends beyond our subjects and our disciplines, to include a concern with the kinds of generalizations which may be drawn from our work. The leap is one which others are so eager to make that we can hardly avoid considering it ourselves.
References Beecher, H. K. Measurement of subjective responses: Quantitative effects of drugs. New York: Oxford University Press, 1959. Bem, D. J. Self-perception: An alternative interpretation of cognitive dissonance phenomena. Psychological Review, 1967, 74, 183–200. Campbell, D. T., and Stanley, J. C. Experimental and quasi-experimental designs for research. In N. L. Gage (Ed.), Handbook of research on teaching. Chicago: Rand McNally, 1963. Cataldo, J. F., Silverman, I., and Brown, J. M. Demand characteristics associated with semantic differential ratings of nouns and verbs. Educational and Psychological Measurement, 1967, 27, 83–87. Criswell, Joan H. The psychologist as perceiver. In R. Tagiuri and L. Petrullo (Eds.), Person perception and interpersonal behavior. Stanford: Stanford University Press, 1958, 95–109. Damaser, Esther C., Shor, R. E., and Orne, M. T. Physiological effects during hypnotically requested emotions. Psychosomatic Medicine, 1963, 25, 334–343. Ellson, D. G., Davis, R. C, Saltzman, I. J., and Burke, C. J. A report on research on detection of deception. (Contract N6onr–18011 with Office of Naval Research) Bloomington, Indiana: Department of Psychology, Indiana University, 1952. Evans, F. J. The case of the disappearing hypnotist. Paper read at American Psychological Association, New York, September, 1966. Evans, F. J. Suggestibility in the normal waking state. Psychological Bulletin, 1967, 67, 114–129. Festinger, L. A theory of cognitive dissonance. New York: Row and Peterson, 1957. Fisher, S., Cole, J. O., Rickels, K., and Uhlenhuth, E. H. Drug-set interaction: The effect of expectations on drug response in outpatients. In P. B. Bradley, F. Flu¨gel, and P. Hoch (Eds.), Neuropsychopharmacology. Vol. 3. New York: Elsevier, 1964, 149–156. Gustafson, L. A., and Orne, M. T. Effects of heightened motivation on the detection of deception. Journal of Applied Psychology, 1963, 47, 408–411. Gustafson, L. A., and Orne, M. T. Effects of perceived role and role success on the detection of deception. Journal of Applied Psychology, 1965, 49, 412–417.
Demand Characteristics and the Concept of Quasi-Controls
137
Honigfeld, G. Non-specific factors in treatment. I: Review of placebo reactions and placebo reactors. Diseases of the Nervous System, 1964, 25, 145–156. Kelman, H. C. The human use of human subjects: The problem of deception in social-psychological experiments. Paper read at American Psychological Association, Chicago, September, 1965. Kroger, R. O. The effects of role demands and test-cue properties upon personality-test performance. Journal of Consulting Psychology, 1967, 31, 304–312. Mills, T. M. A sleeper variable in small group research: The experimenter. Paper read at American Sociological Association, St. Louis, September, 1961. Modell, W., and Houde, R. W. Factors influencing the clinical evaluation of drugs: With special reference to the double-blind technique. Journal of the American Medical Association, 1958, 167, 2190–2199. Orne, M. T. The nature of hypnosis: Artifact and essence. Journal of Abnormal and Social Psychology, 1959, 58, 277–299. (a) Orne, M. T. The demand characteristics of an experimental design and their implications. Paper read at American Psychological Association, Cincinnati, September, 1959. (b) Orne, M. T. On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. American Psychologist, 1962, 17, 776–783. Orne, M. T. The simulation of hypnosis: Method, rationale, and implications. Paper presented at the meeting of the Society for Clinical and Experimental Hypnosis, Chicago, November, 1968. Orne, M. T., and Evans, F. J. Social control in the psychological experiment: Antisocial behavior and hypnosis. Journal of Personality and Social Psychology, 1965, 1, 189–200. Orne, M. T., and Evans, F. J. Inadvertent termination of hypnosis on hypnotized and simulating subjects. International Journal of Clinical and Experimental Hypnosis, 1966, 14, 61–78. Orne, M. T., and Scheibe, K. E. The contribution of nondeprivation factors in the production of sensory deprivation effects: The psychology of the ‘‘panic button.’’ Journal of Abnormal and Social Psychology, 1964, 68, 3–12. Orne, M. T., Sheehan, P. W., and Evans, F. J. Occurrence of posthypnotic behavior outside the experimental setting. Journal of Personality and Social Psychology, 1968, 9, 189–196. Page, M. M., and Lumia, A. R. Cooperation with demand characteristics and the bimodal distribution of verbal conditioning data. Psychonomic Science, 1968, 12, 243–244. Paul, G. L. Insight vs. desensitization in psychotherapy: An experiment in anxiety reduction. Stanford, Calif.: Stanford University Press, 1966. Riecken, H. W. A program for research on experiments in social psychology. Paper read at Behavioral Sciences Conference, Albuquerque, 1958. In N. F. Washburne (Ed.), Decisions, values and groups. Vol. 2. New York: Pergamon Press, 1962, 25–41. Roethlisberger, F. J., and Dickson, W. J. Management and the worker. Cambridge, Mass.: Harvard University Press, 1939. Rosenberg, M. J. When dissonance fails: On eliminating evaluation apprehension from attitude measurement. Journal of Personality and Social Psychology, 1965, 1, 28–42. Rosenthal, R. On the social psychology of the psychological experiment: The experimenter’s hypothesis as unintended determinant of experimental results. American Scientist, 1963, 51, 268–283. Rosenthal, R. Experimenter effects in behavioral research. New York: Appleton-Century-Crofts, 1966. Ross, S., Krugman, A. D., Lyerly, S. R., and Clyde, D. J. Drugs and placebos: A model design. Psychological Reports, 1962, 10, 383–392. Silverman, I. Role-related behavior of subjects in laboratory studies of attitude change. Journal of Personality and Social Psychology, 1968, 8, 343–348. Stricker, L. J. The true deceiver. Psychological Bulletin, 1967, 68, 13–20. Stricker, L. J., Messick, S., and Jackson, D. N. Suspicion of deception: Implications for conformity research. Journal of Personality and Social Psychology, 1967, 5, 379–389. Sutcliffe, J. P. A general method of analysis of frequency data for multiple classification designs. Psychological Bulletin, 1957, 54, 134–137. Wishner, J. Efficiency: Concept and measurement. In O. Milton (Ed.), Behavior disorders: Perspectives and trends. Philadelphia: Lippincott, 1965, 133–154.
6 Interpersonal Expectations: Effects of the Experimenter’s Hypothesis1 Robert Rosenthal Harvard University
The social situation which comes into being when a behavioral scientist encounters his research subject is a situation of both general and unique importance to the behavioral sciences. Its general importance derives from the fact that the interaction of experimenter and subject, like other two-person interactions, may be investigated empirically with a view to teaching us more about dyadic interaction in general. Its unique importance derives from the fact that the interaction of experimenter and subject, unlike other dyadic interactions, is a major source of our knowledge in the behavioral sciences. To the extent that we hope for dependable knowledge in the behavioral sciences, we must have dependable knowledge about the experimenter-subject interaction specifically. Without an understanding of the data collection situation we can no more hope to acquire accurate information for our disciplines than astronomers and zoologists could hope to acquire accurate information without their understanding of the operation of their telescopes and microscopes. It is for these reasons that increasing interest has been shown in the investigation of the experimenter-subject interaction system. And the outlook is anything but bleak. It does seem that we can profitably learn about those effects which the behavioral scientist unwittingly may have on the results of his research.
Unintended Effects of the Experimenter It is useful to think of two major types of effects, which the behavioral scientist can have upon the results of his research. The first type operates, so to speak, in the mind, in the eye, or in the hand of the investigator. It operates without affecting the actual response of the human or animal subjects of the research; it is not interactional. The second type of experimenter effect is interactional; it operates by affecting the actual response of the subject of the experiment. It is a sub-type of this latter effect, the effects of the investigator’s expectancy or hypothesis on the results of his research, that will occupy most of the discussion. First, however, some examples of other effects of the investigator on his research will be mentioned. 1 Preparation of this chapter and much of the research summarized here was supported by research grants (G–17685; G–24826; GS–177; GS–714; and GS–1741) from the Division of Social Sciences of the National Science Foundation.
138
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
139
Observer Effects In any science, the experimenter must make provision for the careful observation and recording of the events under study. It is not always so easy to be sure that one has, in fact, made an accurate observation. That lesson was learned by the psychologists, who were grateful to learn it, but it was not the psychologists who focused attention on it originally. It was the astronomers. Just near the end of the 18th century, the royal astronomer at the Greenwich Observatory, Maskelyne, discovered that his assistant, Kinnebrook, was consistently ‘‘too slow’’ in his observations of the movement of stars across the sky. Maskelyne cautioned Kinnebrook about his ‘‘errors’’ but the errors continued for months. Kinnebrook was discharged. The man who might have saved that job was Bessel, the astronomer at Ko¨nigsberg, but he was 20 years too late. It was not until then that he arrived at the conclusion that Kinnebrook’s ‘‘error’’ was probably not willful. Bessel studied the observations of stellar transits made by a number of senior astronomers. Differences in observation, he discovered, were the rule, not the exception (Boring, 1950). This early observation of the effects of the scientist on the observations of science made Bessel perhaps the first student of the psychology of scientists. More contemporary research on the psychology of scientists has shown that, while observer errors are not necessarily serious, they do tend to occur in a biased manner. By that we mean that, more often than we would expect by chance, when errors of observation occur they tend to give results more in the direction of the observer’s hypothesis (Rosenthal, 1966). Recording Errors. As data collectors observe the behavior of their subjects, their observations must in some way be recorded. It is no revelation to point out that errors of recording do occur, but it may be of interest to try to obtain some estimates of the incidence of such errors. Table 6-1 shows four such estimates based on an older study of errors in recording responses to a telepathy task, two more recent studies of errors in recording responses to a person perception task, and one recent study of errors in recording responses to a numerosity estimation task. The next to last column shows the range of misrecording rates to be in the neighborhood of one per cent and the last column shows that perhaps over two-thirds of the errors that do occur are biased in the direction of the observer’s hypothesis.
A more preliminary assessment of computational errors is summarized in Table 6-2. About two-thirds of the experimenters err computationally, though it Table 6–1 Recording Errors in Four Experiments
Study 1. a 2. b 3. c 4. d Combined a
Observers
Recordings
Errors
Error %
Bias %
28 30 11 34 103
11,125 3,000 828 1,770 16,723
126 20 6 30 182
1.13% .67% .72% 1.69% 1.09%
68% 75% 67% 85% 71%
Kennedy and Uphoff, 1939. Rosenthal, Friedman, Johnson, et al., 1964. c Persinger, Knutson, and Rosenthal, 1968. d Weiss, 1967. b
140
Book One – Artifact in Behavioral Research Table 6–2 Biased Computational Errors in Three Studies
Study
Experimenters Total N
Laszlo and Rosenthal, 1967 Rosenthal, Friedman, Johnson, et al., 1964 Rosenthal and Hall, 1968a Combined
3 30 1 34
Erring %
Bias %
100% 60% 100% 65%
100% 67% 100% 73%
Erring N 3 18 1 22
a For a sample of five research assistants performing 5,012 calculations and transcriptions there were 41 errors detected on recheck for an .82 per cent error rate.
seems safe to suggest that, given enough computations to perform, all experimenters will make computational errors. More interesting is the combined finding that nearly three out of four experimenters, when they do err computationally, err in the direction of their hypothesis. In general, the magnitudes of errors, both biased and unbiased, tend to be small, and the overall effects of recording and computational errors on grand means of different treatment conditions tend to be trivial. A few of the experimenters studied, however, made errors sufficiently large and sufficiently non-canceling to have affected the conclusions of an experiment in which they were the only data recorders and data processors. Successive independent checking and rechecking of a set of observations can give us whatever degree of accuracy is needed, though an absolute zero level of error seems an unlikely goal to achieve. A more historical and theoretical discussion of observer effects and their control is available elsewhere (Rosenthal, 1966). Interpreter Effects The interpretation of the data collected is part of the research process, and a glance at any of the technical journals of contemporary behavioral science will suggest strongly that, while we only rarely debate one another’s observations, we often debate the interpretation of those observations. It is as difficult to state the rules for accurate interpretation of data as it is to state the rules for accurate observation of data but the variety of interpretations offered in explanation of the same data imply that many of us must turn out often to be wrong. The history of science generally, and the history of psychology more specifically, suggest that more of us are wrong longer than we need to be because we hold our theories not quite lightly enough. The common practice of theory monogamy has its advantages, however. It does keep us motivated to make more crucial observations. In any case, interpreter effects seem less serious than observer effects. The reason is that the former are public while the latter are private. Given a set of observations, their interpretations become generally available to the scientific community. We are free to agree or disagree with any specific interpretation. Not so with the case of the observations themselves. Often these are made by a single investigator so that we are not free to agree or disagree. We can only hope that no observer errors occurred, and we can, and should, try to repeat the observations. Examples of interpreter effects in the physical, biological, and behavioral sciences are not hard to come by, and to an earlier theoretical discussion and
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
141
inventory of examples (Rosenthal, 1966) we need only add some recent instances. In the physical sciences, Polanyi (1967) refers to the possible interpretations of those data which appeared to support Velikovsky’s controversial theory dealing with the history of our planet and the origin of the planet Venus. In the same paper, Polanyi gives us other examples of interpreter effects and places them all into a broad conceptual framework in which the antecedent plausibility plays a prominent role. In a disarming retraction of an earlier interpretation, Bradley (1968, 437) told how the microscopic particles he found, turned out not to be the unmineralized fossil bacteria he had originally believed them to be. The minute spheres, it turned out, were artifactually formed fluorite. ‘‘I was as completely taken in as Don Quixote . . . .’’ Carlson and Armelagos (1965) discuss a considerably more macroscopic ‘‘find’’ reported in the literature of paleopathology. They argue convincingly that the prehistoric curved bark bands earlier interpreted as orthopedic corsets were actually hoods for Indian cradleboards. Additional recent discussions of interpreter effects in the behavioral sciences can be found in Honorton (1967) and Shaver (1966). Intentional Effects It happens sometimes in undergraduate laboratory science courses that students ‘‘collect’’ and report data too perfect to be true. (That probably happens most often when students are taught to be scientists by being told what results they must get to do well in the course, rather than being taught the logic of scientific inquiry and the value of being quite open-eyed and open-minded.) Unfortunately, the history of science tells us that not only undergraduates have been dishonest in science. Intentional effects, though rare, must be regarded as part of the inventory of the effects of the investigator himself, and some historically important cases from the physical, biological, and behavioral sciences have been described in detail elsewhere (Rosenthal, 1966). Additional anecdotal evidence is presented for the case of behavioral research by Roth (1965) and Gardner (1966). Martin Gardner, professional magician and editor of the ‘‘Mathematical Games’’ section of the Scientific American, provides an excellent summary of how behavioral scientists can be deceived by their over-eager subjects of research in dermo-optical perception, and also how they can prevent such deception. Intentional effects, interpreter effects, and observer effects all operate without the investigator’s influencing his subject’s response to the experimental task. In those effects of the experimenter himself to which we now turn, we shall see that the subject’s response to the experimental task is influenced. Biosocial Effects The sex, age, and race of the investigator have all been found to affect the results of his research (Rosenthal, 1966). What we do not know and what we need to learn is whether subjects respond differently simply to the presence of experimenters varying in these biosocial attributes or whether experimenters varying in those attributes behave differently toward their subjects and, therefore, obtain different responses from them because they have, in effect, altered the experimental situation for their subjects. So far the evidence suggests that male and female experimenters conduct
142
Book One – Artifact in Behavioral Research
the ‘‘same’’ person perception experiment quite differently so that the different results they obtain may be attributable to those unintentionally different manipulations. Male experimenters, for example, were found in two experiments to be more friendly to their subjects than female experimenters (Rosenthal, 1967). Biosocial attributes of the subject can also affect the experimenter’s behavior, which, in turn, may affect the subject’s responses. In one study, for example, the interactions between experimenters and their subjects were recorded on sound films. It was found that only 12 per cent of the experimenters ever smiled at their male subjects, while 70 per cent of the experimenters smiled at their female subjects. Smiling by the experimenters, it was discovered, affected the subjects’ responses. From this evidence and from some more detailed analyses which suggest that female subjects may be more protectively treated by their experimenters (Rosenthal, 1966, 1967), it might be suggested that in the psychological experiment, chivalry is not dead. This news may be heartening socially, and it is interesting psychologically, but it is very disconcerting methodologically. Sex differences are well established for many kinds of behavior. But a question must now be raised as to whether sex differences which emerge from psychological experiments are due to the subject’s genes, morphology, enculturation, or simply to the fact that the experimenter treated his male and female subjects differently so that, in a sense, they were not really in the same experiment at all. So far we have seen that both the sex of the experimenter and the sex of the subject can serve as significant determinants of the way in which the investigator conducts his research. In addition, however, we find that when the sex of the experimenter and the sex of the subject are considered simultaneously, certain interaction effects emerge. Thus, male experimenters contacting female subjects, and female experimenters contacting male subjects, tend to require more time to collect portions of their data than do male or female experimenters contacting subjects of the same sex (Rosenthal, 1967). This tendency for opposite-sex dyads to prolong their datacollection interactions has also been found in a verbal conditioning experiment by Shapiro (1966). Other interesting interaction effects occur when we examine closely the sound motion pictures of male and female experimenters contacting male and female subjects. Observations of experimenters’ friendliness were made by two different groups of observers. One group watched the films but did not hear the sound track. The other group listened to the sound track but did not view the films. From the resulting ratings, a measure of motor or visual friendliness and an independent measure of verbal or auditory friendliness were available. (The correlation between ratings of friendliness obtained from these independent channels was only .29.) Among male experimenters, there was a tendency (not statistically significant) for their movements to show greater friendliness than their tone of voice, and to be somewhat unfriendly toward their male subjects in the auditory channel of communication. It was among the female experimenters that the more striking effects occurred. The females were quite friendly toward their female subjects in the visual channel but not in the auditory channel. With male subjects, the situation was reversed significantly. Though not friendly in the visual mode, female experimenters showed remarkable friendliness in the auditory channel when contacting male subjects. The quantitative analysis of sound motion pictures is not yet far enough developed that we can say whether such channel discrepancy in the communication of
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
143
friendliness is generally characteristic of women in our culture, or only of advanced female students in psychology, or only of female investigators conducting experiments in person perception. Perhaps it would not be farfetched to attribute the obtained channel discrepancy to an ambivalence over how friendly they ought to be. Quite apart from considerations of unintended effects of the experimenter, such findings may have some relevance for a better understanding of communication processes in general. Though the sex of the experimenter does not always affect the performance of the subject, in a great many cases it does. Gall and Mendelsohn (1966) found, for example, that male experimenters elicited more creative problem solutions than did female experimenters and that, in general, female subjects were more affected in their performance than male subjects by the sex of the experimenter. Cieutat (1965), on the other hand, found that female experimenters elicited intellectual performance from children superior to that obtained by male experimenters. In addition, children tended to perform better for examiners of the opposite sex. Follow-up research by Cieutat and Flick (1967), however, found these effects to be appreciably diminished. Glixman (1967) presents partial support for the proposition that the sex of the experimenter may interact with the type of task required to determine, in part, the subject’s response, and Kintz, Delprato, Mettee, Persons, and Schappe (1965) have data suggesting that sex of experimenter may be a variable affecting the maze behavior of albino rats. The race or ethnic grouping of the experimenter also often affects the subject’s response. Vaughn (1963) found that Maori and pakeha experimenters differentially affected the responses of Maori and pakeha school-children such that the Maori children preferred figures of their own race considerably more when the experimenter was also Maori. Wenk (1966) found that Negro subjects scored appreciably higher on nonlanguage tests of intellectual functioning when the examiner was Negro rather than white. Summers and Hammonds (1966) found that the presence of a Negro investigator considerably decreased self-reports of anti-Negro prejudice. But in a more clinical context, Womack and Wagner (1967) found only personal characteristics other than race to affect the patients’ responses to professionally identified interviewers. Psychosocial Effects Experimenters who differ along such personal and social dimensions as anxiety, need for approval, status, and warmth tend to obtain different responses from their research subjects, and a summary of the effects of these and other variables is available (Rosenthal, 1966). But what, for example, does the more anxious experimenter do in the experiment that leads his subjects to respond differently? We might expect more anxious experimenters to be more fidgety, and that is just what they are. Experimenters scoring higher on the Taylor Manifest Anxiety scale have been observed from sound-motion pictures (Rosenthal, 1967) to show a greater degree of general body activity and to have a less dominant tone of voice. What effects just such behavior on the part of the experimenter will have on the subjects’ responses depends no doubt on the particular experiment being conducted and, very likely, on various characteristics of the subject as well. In any case, we must assume that a more anxious experimenter cannot conduct just the same experiment as a less anxious
144
Book One – Artifact in Behavioral Research
experimenter. It appears that in experiments which have been conducted by just one experimenter, the probability of successful replication by another investigator is likely to depend on the similarity of his personality to that of the original investigator. Anxiety of the experimenter is just one of the experimenter variables affecting the subjects’ responses in an unintended manner. Crowne and Marlowe (1964) have shown that subjects who score high on their scale of need for approval tend to behave in such a way as to gain the approval of the experimenter. Now there is evidence that suggests that experimenters who score high on this measure also behave in such a way as to gain approval from their subjects. Analysis of filmed interactions showed that experimenters scoring higher on the Marlowe–Crowne scale spoke to their subjects in a more enthusiastic and a more friendly tone of voice. In addition, they smiled more often at their subjects and slanted their bodies more toward their subjects than did experimenters lower in the need for approval. Earlier research by Towbin (1959) has shown that the examiner’s power to control his patient’s fate can be a partial determinant of the patient’s Rorschach responses, though the status of the examiner, independent of his power to control the patient’s destinies, had little effect. We might suppose that a Roman Catholic priest would obtain different responses to personal questions asked of Roman Catholic subjects than would a Roman Catholic layman. That was the question addressed in an experiment by Walker, Davis, and Firetto (1968). They had a layman and a priest, each garbed sometimes as layman and sometimes as priest, administer a series of personal questions to male and female subjects. The results were complex but interesting, male and female subjects responding differentially not so much to priest versus layman but rather to whether the priest and layman were playing their true roles or simulating those roles. While this study showed no simple effect of being contacted by a priest as opposed to a layman, an earlier study did show such differences (Walker and Firetto, 1965). ‘‘Warmer’’ experimenters have also been found often to obtain quite different responses from their subjects than ‘‘cooler’’ experimenters. Some of the more recent support for this proposition comes from the work of Engram (1966) with children and of Goldblatt and Schackner (1968) with college students. These latter workers found that their subjects’ judgments of affect in photographs were dramatically influenced by the degree of friendliness shown by the data collectors. A pioneering study by Malmo, Boag, and Smith (1957) showed that within-experimenter variation could also serve as a powerful unintended determinant of subjects’ responses. A particular data-collector’s variations in feeling state were found to be related to his subjects’ physiological responses. When the experimenter had a ‘‘bad day,’’ his subjects’ heart rate showed significantly greater acceleration than when he had a ‘‘good day.’’ Surprisingly, the data collector’s feeling state was not particularly related to his own physiological responses. Situational Effects The degree of acquaintanceship between experimenter and subject, the experimenter’s level of experience, and the things that happen to him before and during his interaction with his subject have all been shown to affect the subject’s responses (Rosenthal, 1966). Most recently, for example, Jourard (1968) has shown that experimenters better acquainted with their subjects and more open to them obtain
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
145
not only the more open responses we might expect on the basis of reciprocity, but also obtain superior performance in a paired associate learning task. 1. Experimenter Experience. The kind of person the experimenter is before he enters his laboratory can in part determine the responses he obtains from his subjects. From the observation of experimenters’ behavior during their interaction with their subjects there are some clues as to how this may come about. There is also evidence that the kind of person the experimenter becomes after he enters his laboratory may alter his behavior toward his subjects and lead him, therefore, to obtain different responses from his subjects. In the folklore of psychologists who conduct experiments, there is the notion that sometimes, perhaps more often than we would expect, subjects contacted early in an experiment behave differently from subjects contacted later. There may be something to this bit of lore even if we make sure that subjects seen earlier and later in an experiment come from the same population. The difference may be due to changes over the course of the experiment in the behavior of the experimenter. From what we know of performance curves, we might predict both a practice effect and a fatigue effect on the part of the experimenter. There is evidence for both. In experiments which were filmed (Rosenthal, 1966), experimenters became more accurate and faster in the reading of their instructions to their later-contacted subjects. That seems simply to be a practice effect. In addition, experimenters became more bored or less interested over the course of the experiment as observed from their behavior in the experimental interaction. As we might also predict, experimenters became less tense with more experience. The changes which occur in the experimenters’ behavior during the course of their experiment affect their subjects’ responses. In the experiments which were filmed, for example, subjects contacted by experimenters whose behavior changed as described rated stimulus persons as less successful (Rosenthal, 1966). 2. Subject Behavior. The experimenter-subject communication system is a complex of intertwining feedback loops. The experimenter’s behavior, we have seen, can affect the subject’s response. But the subject’s behavior can also affect the experimenter’s behavior, which in turn affects the subject’s behavior. In this way, the subject plays a part in the indirect determination of his own response. The experimental details are given elsewhere (Rosenthal, 1966; Rosenthal, Kohn, Greenfield, and Carota, 1965). Briefly, in one experiment, half the experimenters had their experimental hypotheses confirmed by their first few subjects, who were actually accomplices. The remaining experimenters had their experimental hypotheses disconfirmed. This confirmation or disconfirmation of their hypotheses affected the experimenters’ behavior sufficiently so that from their next subjects, who were bona fide and not accomplices, they obtained significantly different responses not only to the experimental task, but on standard tests of personality as well. These responses were predictable from a knowledge of the responses the experimenters had obtained from their earlier-contacted subjects. There is an interesting footnote on the psychology of the accomplice which comes from the experiment alluded to. The accomplices had been trained to confirm or to disconfirm the experimenter’s hypothesis by the nature of the responses they gave the experimenter. These accomplices did not, of course, know when they were confirming an experimenter’s hypothesis or, indeed, that there were expectancies to be confirmed at all. In spite of the accomplices’ training, they were significantly affected in the adequacy of their performance as accomplices by the expectancy the experimenter had of their performance, and by whether the experimenter’s hypothesis was being confirmed or disconfirmed by the accomplices’ responses. We can think of the accomplices as experimenters, and the experimenters as the accomplices’ targets or ‘‘victims.’’ It is interesting
146
Book One – Artifact in Behavioral Research
to know that experimental targets are not simply affected by experimental accomplices. The targets of our accomplices, like the subjects of our experimenters, are not simply passive responders. They ‘‘act back.’’ 3. Experimental Scene. One of the things that happens to the experimenter which may affect his behavior toward his subject, and thus the subject’s response, is that he falls heir to a specific scene in which to conduct his experiment. Riecken (1962) has pointed out how much there is that we do not know about the effects of the physical scene in which an experimental transaction takes place. We know little enough about how the scene affects the subject’s behavior; we know even less about how the scene affects the experimenter’s behavior. The scene in which the experiment takes place may affect the subject’s response in two ways. The effect may be direct, as when a subject judges others to be less happy when his judgments are made in an ‘‘ugly’’ laboratory (Mintz, 1957). Or, the effect may be indirect, as when the scene influences the experimenter to behave differently and this change in the experimenter’s behavior leads to a change in the subject’s response. Evidence that the physical scene may affect the experimenter’s behavior comes from some data collected with Suzanne Woolsey. We had available eight laboratory rooms which were varied as to the ‘‘professionalness,’’ the ‘‘orderliness,’’ and the ‘‘comfortableness’’ of their appearance. The 14 experimenters of this study were randomly assigned to the eight laboratories. Experimenters took the experiment significantly more seriously if they had been assigned to a laboratory which was both more disordered and less comfortable. These experimenters were graduate students in the natural sciences or in law school. Perhaps they felt that scientifically serious business is carried on best in the cluttered and severely furnished laboratory which fits the stereotype of the scientist’s ascetic pursuit of truth. In this same experiment, subjects described the behavior of their experimenter during the course of the experiment. Experimenters who had been assigned to more professional appearing laboratories were described by their subjects as significantly more expressivevoiced, more expressive-faced, and as more given to the use of hand gestures. There were no films made of these experimenters interacting with their subjects, so we cannot be sure that their subjects’ descriptions were accurate. There is a chance that the experimenters did not really behave as described but that subjects in different appearing laboratories perceived their experimenters differently because of the operation of context effects. The direct observation of experimenters’ behavior in different physical contexts should clear up the matter to some extent. 4. The Principal Investigator. More and more research is carried out in teams and groups so that the chances are increasing that any one experimenter will be collecting data not for himself alone. More and more there is a chance that the data are being collected for a principal investigator to whom the experimenter is responsible. The basic data are presented elsewhere (Rosenthal, 1966), but here it can be said that the response a subject gives his experimenter may be determined in part by the kind of person the principal investigator is and by the nature of his interaction with the experimenter. More specifically, personality differences among principal investigators, and whether the principal investigator has praised or reproved the experimenter for his performance of his data-collecting duties, affect the subjects’ subsequent perception of the success of other people and also affect subjects’ scores on standardized tests of personality (e.g., Taylor Manifest Anxiety scale). In one experiment, there were 13 principal investigators and 26 experimenters. The principal investigators first collected their own data and it was found that their anxiety level correlated positively with the ratings of the success of others (pictured in photographs) they obtained from their subjects (r ¼ .66, p ¼ .03). Each principal investigator was then to employ two research assistants. On the assumption that principal investigators
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
147
select research assistants who are significantly like or significantly unlike themselves, the two research assistants were assigned to principal investigators at random. That was done so that research assistants’ scores on the anxiety scale would not be correlated with their principal investigator’s anxiety scores. The randomization was successful in that the principal investigators’ anxiety correlated only .02 with the anxiety of their research assistants. The research assistants then replicated the principal investigators’ experiments. Remarkably, the principal investigators’ level of anxiety also predicted the responses obtained by their research assistants from their new samples of subjects (r ¼ .40, p ¼ .07). The research assistants’ own level of anxiety, while also positively correlated with their subjects’ responses (r ¼ .24), was not as good a predictor of their subjects’ responses as was the anxiety level of the principal investigators. Something in the covert communication between the principal investigator and his research assistant altered the assistant’s behavior when he subsequently contacted his subjects. We know that the effect of the principal investigator was mediated in this indirect way to his assistant’s subjects, because the principal investigator had no contact of his own with those subjects. Other experiments show that the data obtained by the experimenter depend in part on whether the principal investigator is male or female, whether the principal investigator makes the experimenter self-conscious about the experimental procedure, and whether the principal investigator leads the experimenter to believe he has himself performed well or poorly at the same task the experimenter is to administer to his own subjects. The evidence comes from studies in person perception, verbal conditioning, and motor skills (Rosenthal, 1966). As we would expect, these effects of the principal investigator on his assistant’s subjects are mediated by the effects on the assistant’s behavior toward his subjects. Thus, experimenters who have been made more self-conscious by their principal investigator behave less courteously toward their subjects, as observed from films of their interactions with their subjects. In a different experiment, involving this time a verbal conditioning task, experimenters who had been given more favorable evaluations by their principal investigator were described by their subsequently contacted subjects to be more casual and more courteous. These same experimenters, probably by virtue of their altered behavior toward their subjects, obtained significantly more conditioning responses from their subjects. All ten of the experimenters who had been more favorably evaluated by their principal investigator showed conditioning effects among their subjects, but only five of the nine experimenters who felt unfavorably evaluated obtained any conditioning.
Modeling Effects From the fields of survey research, child development, clinical psychology, and from laboratory experiments, there is a reasonable amount of evidence to suggest that the nature of the data collector’s own task performance may be a nontrivial determinant of his subject’s subsequent task performance (Rosenthal, 1966). Though most of the evidence for such modeling effects comes from studies in which the experimentersubject contact is very brief, there are some studies, usually of the ‘‘field study’’ variety, that are based on more prolonged contact. One of these, the classic study of Escalona (1945), shows that modeling effects do not depend on verbal communication. The subjects were 50 babies, most of them less than one year old. On alternate days they were given orange juice and tomato juice and many of the babies drank more heartily of one juice than of the other. It turned out that the ladies who fed the babies also had marked preferences for one juice over the other and that babies fed by orange
148
Book One – Artifact in Behavioral Research
juice preferrers preferred orange juice, while babies fed by tomato juice preferrers preferred tomato juice. When babies were reassigned to new feeders with a different preference, the babies tended to change their preference to coincide with that of the new feeder. In another long term experiment, Yando and Kagan (1966) found that first-graders taught by teachers whose decision-making was ‘‘reflective’’ became significantly more reflective during the course of the school year relative to the children taught by teachers whose decision-making was ‘‘impulsive.’’ In a recent report of a more short-term interpersonal contact, Barnard (1968) found that the experimenter’s degree of disturbance on hostile phrases of a phraseassociation test was significantly predictive of subjects’ subsequent degree of disturbance on hostile phrases. Similarly, in a Rorschach study, Marwit (1968) found that experimenters whose vocal behavior was more hostile, elicited significantly more hostile behavior from their subjects. Finally, Klinger (1967) has shown that even when based entirely on nonverbal cues, an experimenter who appeared more achievement-motivated elicited significantly more achievement-motivated responses from his subjects.
The Experimenter’s Expectancy In the discussion just concluded we have considered briefly some sources of artifact deriving from the experimenter himself. We have seen that a variety of personal and situational variables associated with the experimenter may unintentionally affect the subject’s responses. Our discussion was not exhaustive but only illustrative and a number of sources are available for obtaining a more complete picture (Krasner and Ullman, 1965; Masling, 1960, 1966; McGuigan, 1963; Rosenthal, 1966; Sarason, 1965; Sattler and Theye, 1967; Stevenson, 1965; Zax, Stricker, and Weiss, 1960). We turn our attention now to a somewhat more detailed consideration of another potential source of artifact associated with the experimenter—his research hypothesis. The particular expectation a scientist has of how his experiment will turn out is variable, depending on the experiment being conducted, but the presence of some expectation is virtually a constant in science. The independent and dependent variables selected for study by the scientist are not chosen by means of a table of random numbers. They are selected because the scientist expects a certain relationship to emerge among them. Even in those less carefully planned examinations of relationships called ‘‘fishing expeditions’’ or, more formally, ‘‘exploratory analyses,’’ the expectation of the scientist is reflected in the selection of the entire set of variables chosen for examination. Exploratory analyses of data, like real fishing expeditions, do not take place in randomly selected pools. These expectations of the scientist are likely to affect the choice of the experimental design and procedure in such a way as to increase the likelihood that his expectation or hypothesis will be supported. That is as it should be. No scientist would select intentionally a procedure likely to show his hypothesis in error. If he could too easily think of procedures that would show this, he would be likely to revise his hypothesis. If the selection of a research design or procedure is regarded by another scientist as too ‘‘biased’’ to be a fair test of the hypothesis, he can test the hypothesis employing oppositely biased procedures or less biased procedures by which to demonstrate the greater value of his hypothesis. The designs and procedures
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
149
employed are, to a great extent, public knowledge, and it is this public character that permits relevant replications to serve the required corrective function. The major concern of this chapter is with the effects of the experimenter’s expectation on the responses he obtains from his subjects. The consequences of such an expectancy bias can be quite serious. Expectancy effects on subjects’ responses are not public matters. It is not only that other scientists cannot know whether such effects occurred in the experimenter’s interaction with his subjects; the investigator himself may not know whether these effects have occurred. Moreover, there is the likelihood that the experimenter has not even considered the possibility of such unintended effects on his subjects’ responses. This is not so different from the situations already discussed wherein the subject’s response is affected by any attribute of the experimenter. Later, the problem will be discussed in more detail. For now it is enough to note that while the other attributes of the experimenter affect the subject’s response, they do not necessarily affect these responses differentially as a function of the subject’s treatment condition. Expectancy effects, on the other hand, always do. The sex of the experimenter does not change as a function of the subject’s treatment condition in an experiment. The experimenter’s expectancy of how the subject will respond does change as a function of the subject’s treatment condition. Although the focus of this chapter is primarily on the effects of a particular person—the experimenter—on the behavior of a specific other—the subject—it should be emphasized that many of the effects of the experimenter, including the effects of his expectancy, may have considerable generality for other social relationships. That one person’s expectation about another person’s behavior may contribute to a determination of what that behavior will actually be has been suggested by various theorists. Merton (1948) developed the very appropriate concept of ‘‘self-fulfilling prophecy.’’ One prophesies an event and the expectation of the event then changes the behavior of the prophet in such a way as to make the prophesied event more likely. Gordon Allport (1950) applied the concept of interpersonal expectancies to an analysis of the causes of war. Nations expecting to go to war affect the behavior of their opponents-to-be by the behavior which reflects their expectations of armed conflict. Nations that expect to remain out of wars at least sometimes manage to avoid entering into them. Drawn from the general literature, and the literatures of the healing professions, survey research, and laboratory psychology, there is considerable evidence for the operation of interpersonal self-fulfilling prophecies. This evidence, while ranging from the anecdotal to the experimental, with emphasis on the former, permits us to begin consideration of more recent research on expectancy effects with possibly more than very gentle priors (Mosteller and Tukey, 1965). The literatures referred to have been reviewed elsewhere (Rosenthal, 1964a,b, 1965, 1966; Rosenthal and Jacobson, 1968), but it may be of interest here to give one illustration from experimental psychology. The example is one known generally to psychologists as a case study of an artifact in animal research. It is less well known, however, as a case study of the effect of experimenter expectancy. While the subject sample was small, the experimenter sample was very large indeed. The case, of course, is that of Clever Hans (Pfungst, 1911). Hans, it will be remembered, was the horse of Mr. von Osten, a German mathematics teacher. By means of tapping his foot, Hans was able to add, subtract, multiply, and divide. Hans could spell, read, and solve problems of musical
150
Book One – Artifact in Behavioral Research
harmony. To be sure, there were other clever animals at the time, and Pfungst tells about them. There was ‘‘Rosa,’’ the mare of Berlin, who performed similar feats in vaudeville, and there was the dog of Utrecht, and the reading pig of Virginia. All these other clever animals were highly trained performers who were, of course, intentionally cued by their trainers. Von Osten, however, did not profit from his animal’s talent, nor did it seem at all likely that he was attempting to perpetrate a fraud. He swore he did not cue the animal, and he permitted other people to question and to test the horse even without his being present. Pfungst and his famous colleague, Stumpf, undertook a program of systematic research to discover the secret of Hans’ talents. Among the first discoveries made was that if Hans could not see the questioner, then the horse was not clever at all. Similarly, if the questioner did not himself know the answer to the question. Hans could not answer it either. Still, Hans was able to answer Pfungst’s questions as long as the investigator was present and visible. Pfungst reasoned that the questioner might in some way be signaling to Hans when to begin and when to stop tapping his foot. A forward inclination of the head of the questioner would start Hans tapping, Pfungst observed. He tried then to incline his head forward without asking a question and discovered that this was sufficient to start Hans tapping. As the experimenter straightened up, Hans would stop tapping. Pfungst then tried to get Hans to stop tapping by using very slight upward motions of the head. He found that even the raising of his eyebrows was sufficient. In fact, even the dilation of the questioner’s nostrils was a cue for Hans to stop tapping. When the questioner bent forward more, the horse would tap faster. This added to the reputation of Hans as brilliant. That is, when a large number of taps was the correct response, Hans would tap rapidly until he approached the region of correctness, and then he would begin to slow down. It was found that questioners typically bent forward more when the answer was a long one, gradually straightening up as Hans got closer to the correct number. For some experiments, Pfungst discovered that auditory cues functioned additively with visual cues. When the experimenter was silent, Hans was able to respond correctly 31 per cent of the time in picking one of many placards with different words written on it, or cloths of different colors. When auditory cues were added, Hans responded correctly 56 per cent of the time. Pfungst himself then played the part of Hans, tapping out responses to questions with his hand. Of 25 questioners, 23 unwittingly cued Pfungst as to when to stop tapping in order to give a correct response. None of the questioners (men and women of all ages and occupations) knew the intent of the experiment. When errors occurred, they were usually only a single tap from being correct. The subjects of this study, including an experienced psychologist, were unable to discover that they were unintentionally emitting cues. Hans’ amazing talents, talents rapidly acquired too by Pfungst, serve to illustrate the power of the self-fulfilling prophecy. Hans’ questioners, even skeptical ones, expected Hans to give the correct answers to their queries. Their expectation was reflected in their unwitting signal to Hans that the time had come for him to end his tapping. The signal cued Hans to stop, and the questioner’s expectation became the reason for Hans’ being, once again, correct. Not all of Hans’ questioners were equally good at fulfilling their prophecies. Even when the subject is a horse, apparently, the attributes of the experimenter make a
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
151
considerable difference in determining the subject’s response. On the basis of his studies, Pfungst was able to summarize the characteristics of those of Hans’ questioners who were more successful in their covert and unwitting communication with the horse. Among the characteristics of the more successful unintentional influencers were those of tact, an air of dominance, attention to the business at hand, and a facility for motor discharge. Pfungst’s observations of 60 years ago seem not to have suffered excessively for the lack of more modern methods of scaling observations. To anticipate some of the research findings to be presented later, it must be said that Pfungst’s description seems also to fit those experimenters who are more likely to affect their human subject’s response by virtue of their experimental hypothesis. In summarizing his difficulties in learning the nature of Clever Hans’ talents, Pfungst felt that he had been too long off the track by ‘‘looking for, in the horse, what should have been sought in the man.’’ Perhaps, too, when we conduct research in the behavioral sciences we are sometimes caught looking at our subjects when we ought to be looking at ourselves. It was to this possibility that much of the research to be reviewed here was addressed. Animal Learning A good beginning might have been to replicate Pfungst’s research, but with horses hard to come by, rats were made to do (Rosenthal and Fode, 1963a). A class in experimental psychology had been performing experiments with human subjects for most of a semester. Now they were asked to perform one more experiment, the last in the course, and the first employing animal subjects. The experimenters were told of studies that had shown that maze-brightness and mazedullness could be developed in strains of rats by successive inbreeding of the welland the poorly-performing maze-runners. Sixty laboratory rats were equitably divided among the 12 experimenters. Half the experimenters were told that their rats were maze-bright while the other half were told that their rats were maze-dull. The animal’s task was to learn to run to the darker of two arms of an elevated T-maze. The two arms of the maze, one white and one gray, were interchangeable; and the ‘‘correct’’ or rewarded arm was equally often on the right as on the left. Whenever an animal ran to the correct side he obtained a food reward. Each rat was given 10 trials each day for five days to learn that the darker side of the maze was the one which led to the food. Beginning with the first day and continuing on through the experiment, animals believed to be better performers became better performers. Animals believed to be brighter showed a daily improvement in their performance, while those believed to be dull improved only to the third day and then showed a worsening of performance. Sometimes an animal refused to budge from his starting position. This happened 11% of the time among the allegedly bright rats; but among allegedly dull rats it happened 29% of the time. When animals did respond correctly, those believed to be brighter ran faster to the rewarded side of the maze than did even the correctly responding rats believed to be dull (z ¼ þ2.05). When the experiment was over, all experimenters made ratings of their rats and of their own attitudes and behavior vis-a`-vis their animals. Those experimenters who had been led to expect better performance viewed their animals as brighter, more pleasant, and more likeable. These same experimenters felt more relaxed in their
152
Book One – Artifact in Behavioral Research
contacts with the animals and described their behavior toward them as more pleasant, friendly, enthusiastic, and less talkative. They also stated that they handled their rats more often and also more gently than did the experimenters expecting poor performance. The next experiment to be described also employed rat subjects, using this time not mazes but Skinner boxes (Rosenthal and Lawson, 1964). Because the experimenters (39) outnumbered the subjects (14), experimenters worked in teams of two or three. Once again about half the experimenters were led to believe that their subjects had been specially bred for excellence of performance. The experimenters who had been assigned the remaining rats were led to believe that their animals were genetically inferior. The learning required of the animals in this experiment was more complex than that required in the maze learning study. This time the rats had to learn in sequence and over a period of a full academic quarter the following behaviors: to run to the food dispenser whenever a clicking sound occurred, to press a bar for a food reward, to learn that the feeder could be turned off and that sometimes it did not pay to press the bar, to learn new responses with only the clicking sound as a reinforcer (rather than the food), to bar-press only in the presence of a light and not in the absence of the light, and, finally, to pull on a loop which was followed by a light which informed the animal that a bar-press would be followed by a bit of food. At the end of the experiment the performance of the animals alleged to be superior was, in fact, superior to that of the allegedly inferior animals (z ¼ þ2.17) and the difference in learning favored the allegedly brighter rats in all five of the laboratory sections in which the experiment was conducted. Just as in the maze learning experiment, the experimenters of the present study were asked to rate their animals and their own attitudes and behaviors toward them. Once again those experimenters who had expected excellence of performance judged their animals to be brighter, more pleasant, and more likeable. They also described their own behavior as more pleasant, friendly, enthusiastic, and less talkative, and they felt that they tended to watch their animals more closely, to handle them more, and to talk to them less. One wonders what was said to the animals by those experimenters who believed their rats to be inferior. The absolute amount of handling of animals in this Skinner box experiment was considerably less than the handling of animals in the maze learning experiment. Nonetheless, those experimenters who believed their animals to be Skinner box bright handled them relatively more, or said they did, than did experimenters believing their animals to be dull. The extra handling of animals believed to be brighter may have contributed in both experiments to the superior learning shown by these animals. In addition to the differences in handling reported by the experimenters of the Skinner box study as a function of their beliefs about their subjects, there were differences in the reported intentness of their observation of their animals. Animals believed to be brighter were watched more carefully, and more careful observation of the rat’s Skinner box behavior may very well have led to more rapid and appropriate reinforcement of the desired response. Thus, closer observation, perhaps due to the belief that there would be more promising responses to be seen, may have made more effective teachers of the experimenters expecting good performance. Cordaro and Ison (1963) employed 17 experimenters to conduct conditioning experiments with 34 planaria. Five of the experimenters were led to expect that their
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
153
worms (two apiece) had already been taught to make many turning and contracting responses. Five of the experimenters were led to expect that their worms (also two apiece) had not yet been taught to make many responses and that in ‘‘only 100 trials’’ little turning and contracting could be expected. The seven experimenters of the third group were each given both these opposite expectancies, one for each of the their two worms. Behavior of the worms was observed by the experimenters looking down into a narrow (½’’) and shallow (¼’’) v-shaped trough into which each worm was placed. The results of the Cordaro and Ison experiment are easily summarized. Regardless of whether the experimenter prophesied the same results for both his worms or prophesied opposite results for his two worms, when the experimenter expected more turning and contracting he obtained more turning and contracting (z > þ3.25). Similar results in studies of planaria have been obtained in two studies reported by Hartry (1966) and in studies of rats by Ingraham and Harrington (1966), and Burnham (1966) to whose study we shall later return. From the results of these studies we cannot be sure that the behavior of the animal was actually affected by the expectation of the experimenter, though that possibility cannot be ruled out. It is also possible, however, that only the experimenter’s perception of the animal’s behavior was affected by his hypothesis. That view of the results as examples of observer effects is quite plausible in the case of the planaria studies. Those worms, after all, are hard to see. That same view, however, is far less plausible in the case of the rat studies. It is difficult, for example, to confuse a rat’s going out on the right arm of the maze with his going out on the left arm. Everything we know about the effects of handling on performance, the effects of set on reaction time (which would make an experimenter expecting bright Skinner box performance a faster and better ‘‘shaper’’), and the base rate for recording errors, suggest that it is more plausible to think that the rat’s behavior was affected than that the experimenters saw so badly or lied so much. But what about those worms? Surely an experimenter cannot affect a worm in a trough to behave differently as a function of his expectation? Perhaps not, but perhaps so. Ray Mulry (1966) has pointed out in a personal communication how when the control of an unconditioned stimulus to the worm is not automatic, the experimenter may unwittingly teach worms differentially by his application of the unconditioned stimulus. Even in fully automated set-ups, however, we cannot yet rule out the possibility that worms can be affected differentially by a closely observing experimenter. Stanley Ratner (1966) has suggested in a personal communication that changes in the respiration or even temperature of the experimenter might (or might not) affect the worm’s response. Relative to the small worm, in a small amount of water, the closely watching experimenter presents a potentially large source of various physical stimuli. The hypothesis of worm sensitivity to experimenter respiration changes is especially interesting in view of earlier research on dogs suggesting that they were substantially influenced by changes in their trainers’ respiration (Rosenthal, 1965). We have now described a number of experiments in which the effects of experimenter expectancy were investigated in studies of animal learning. We have given only enough details of a few of these studies to show the type of research conducted, but it would be desirable to have some systematic way to summarize the results of all the experiments conducted, including those only briefly mentioned. Our need to develop some systematic way to summarize runs of studies is greater when shortly we turn to a consideration of human subjects, for there we shall have over 80 studies to consider.
154
Book One – Artifact in Behavioral Research
Appraisal of a Research Domain We all join in the clarion calls for ‘‘more research’’ and echo the sentiment that this or that research is in (1. some, 2. much, 3. sore, 4. dire) need of replication. But sometimes we seem not quite sure of what to do with the replication when we have it. Behavioral scientist X finds A > B at p < .05. What shall be his conclusion after replication as a function of his second result? If replication yields B > A at p < .05 he can conclude that A is not always larger than B, that A is too often too different from B or, and only in this case is he likely to err, that on the average, A ¼ B. For the moment putting aside considerations of statistical power differences between replications, it would seem that considerable information could be conveyed by just the direction of difference in the two studies and the associated p values. These p values can be handily traded in for standard normal deviates (z) and they, in turn, can be added, subtracted, multiplied, and divided. An algebraically signed normal deviate gives the direction and likelihood of a difference, while an unsigned normal deviate gives the nondirectional likelihood. If we have 10 experiments showing A > B at p ¼ .05 and 10 experiments showing B > A at p ¼ .05, then the average directional z is zero but the average nondirectional z is large enough so that we would be rash to conclude that A ¼ B. Instead we would probably want to conclude that A and B differ too often, but unpredictably, and the research task might then be to reduce this unpredictability. There is another important advantage to translating the results of runs of experiments to the standard normal deviate equivalents of the p values obtained. That advantage accrues from the fact that the sum of a set of standard normal deviates when divided by the square root of the number of zs, yields an overall z that tells the overall likelihood of the obtained results considered as a set (Mosteller and Bush, 1954). In the summary of what is now known about the effects on research results of the experimenter’s expectancy, we shall want to make use of these helpful characteristics of the standard normal deviate. But our purpose will not be solely to summarize the results of what we know about expectancy effects. An additional purpose is to employ these data as an illustration of how we may deal in a more global, overall way with the results of runs of experiments which, despite their differences in sampling, in procedure, and in outcome, are all addressed essentially to the same hypothesis or proposition. There may be sufficient usefulness to the method to warrant its more widespread adoption by those undertaking a comprehensive review of a given segment of the literature of the behavioral sciences. With the increased interest in the effect of the experimenter’s expectancy there have been increasing numbers of literature summaries (Rosenthal, 1963, 1964a, 1966, 1968a; Barber and Silver, 1968). These summaries have been generally cumulative, but none have been sufficiently systematic. Even the most recent ones have considered less than half the available experimental evidence. Altogether, well over a hundred studies are known to have been conducted, all based on independent samples and all addressed to the central proposition that interpersonal expectancies in the research situation may be a significant source of artifact in behavioral research. For over 80 of these, more or less formal reports are available. These reports include published papers, papers presented at professional meetings, doctoral dissertations, master’s theses, honors theses, and some unpublished manuscripts. The remaining studies, in various stages of completion or availability posed a
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
155
problem; should they be included or not? It was decided that any research, even if lacking a formal report, would be included if (a) at least an informal description of the procedures were available and (b) the raw data were available for analysis. An additional dozen studies met this criterion and are listed in the references as ‘‘unpublished data.’’ Of the remaining studies several will shortly become available to the writer and two reports have been seen but not yet used. Both of these reports presented data insufficient for the computation or even reasonable estimation of a z value. One of these reports claimed effects of experimenter expectancy and one claimed the absence of such effects. For most of the studies summarized, no exact p was given, only that p was less than or greater than some arbitrary value. In those cases, exact ps were computed for our present purpose. Sometimes there was more than a single overall test of the same hypothesis of expectancy effect and in those cases median ps are presented. In a few studies, orthogonal overall tests were made of the effects of two different types of expectancies and in those cases only the more extreme z was retained after a correction was made for having made two tests. The correction involved doubling the p before finding the corrected associated z value. The same correction was employed in those cases where a specific p was computed and it was stated that another test yielded a larger p but without that p being given. The basic paradigm in the experiments to be summarized has been to establish two groups of experimenters and to generate for each group a different expectation for their subjects’ responses. For the studies following this paradigm it was a straightforward procedure to compute the overall exact p along with its corresponding directional z value. For the purpose of overall assessment, it was convenient to divide the obtained zs into three groups: those of þ1.28 or greater, those falling between 1.28, and those of –1.28 or smaller. We expect 10 per cent of the results to fall in the first group, 80 per cent of the results to fall in the second group, and 10 per cent of the results to fall in the third group under the hypothesis of no expectancy effect. For the sake of simplicity, exact zs were recorded only for zs greater than þ1.28 or less than –1.28 with zs falling between those values entered simply as .00. The net effect of this procedure, as we shall see, was to make our overall assessments somewhat too conservative or Type II Error-prone, but advantages of simplicity and clarity seemed to outweigh this disadvantage. The sign of the z value, of course, was positive when the difference between groups was in the predicted direction and negative when the difference was in the unpredicted direction. A special problem occurred, however, for those studies in which the basic paradigm was extended to include additional experimental or control group conditions. Thus, there were some studies in which a control group was included whose experimenters had been given no expectations for their subjects’ responses. In those cases the z value associated with the overall test of expectancy effects has a meaning somewhat different from the situation in which there are only two experimental groups. A large nondirectional z means that the experimenter’s expectancy made some difference in subjects’ responses but it is not so easy to prefix the algebraic sign of the z. It often happened that the control group differed more from the two experimental groups than the experimental groups differed from each other. Essentially, then, considering all available studies there were two hypotheses being tested rather than just one. The first hypothesis is that experimenters’ expectations significantly affect their subjects’ responses in some way and it is tested by considering the absolute magnitude of the zs obtained. The second hypothesis is that
156
Book One – Artifact in Behavioral Research
experimenters’ expectations affected their subjects’ responses in such a way as to lead to too many responses in the direction of the experimenter’s expectation. This hypothesis is tested by considering the algebraic magnitude of the zs obtained. For those studies in which the overall z was only a test of the first hypothesis, an additional directional z was also computed addressed to the question of the degree to which the experimental manipulation of expectations led to hypothesis-confirming responses. One more difficult decision was to be made. That had to do with determining the number of studies to be counted for each paper. Many papers described more than one experiment but sometimes investigators regarded these as several studies with no overall test of significance and sometimes investigators pooled the data from several studies and tested the overall significance. The guiding principle employed in an earlier summary (Rosenthal, 1968a) was to count as more than one experiment only those within a given paper that employed both a different sample of subjects and a substantial difference in procedure. For this more comprehensive review it was felt to be more informative to count as a separate experiment those that employed either a different sample of subjects or a substantial difference in procedure for some of their subjects. It was often difficult to decide when some procedural difference was substantial and quite unavoidably this had to remain a matter of the writer’s judgment. There is no doubt that other workers might have classified some procedures as substantially different that were here regarded as essentially similar, and that some studies treated separately here would have been regarded by others as essentially the same. The major protection against serious errors of inference due to this matter of judgment comes from subsequent analyses that consider all research done by a given principal investigator or at a given laboratory as a single result. Judgment and, therefore, possible error also entered into the calculation of each of the many z values. Methods of dealing with multiple p values have been referred to, but sometimes (e.g., when no overall p had been computed) it was necessary to decide on the most appropriate overall test. Here, too, it seems certain that, for any given study, different workers might have chosen different tests as most appropriate. Because of the large amounts of raw data analyzed by the writer, and because of the many secondary analyses performed when the original was felt to be inappropriate, a rule of thumb aimed primarily at the goal of convenience was developed. Given a choice of several more-orless equally defensible procedures (e.g. multiple regression, analysis of covariance, treatments by levels analysis of variance) the most simple procedure was selected (e.g. treatments by levels) with the criterion of simplicity geared to the use of a desk calculator. This rule of thumb probably had the effect of decreasing more zs than it increased, since, in general, the more elegant procedures use more of the information in the data and generally lead to a reduction of Type II errors. This is likely to be especially true when the distribution of obtained z values is as radically skewed as the one obtained. Despite these sources of errors of conservatism, the possibility of biased judgment and sheer error on the part of the writer cannot and should not be ruled out. As protection against the possibility of these biases we shall later want to make some very stringent corrections. Since we have already summarized partially the results of studies of expectancy effect employing animal subjects, that seems a good sub-domain with which to illustrate our systematic summarization procedure. Table 6-3 lists nine studies testing the hypothesis of expectancy effects. Burnham conducted a single study by our criterion, but the next three sets of authors conducted two apiece. For Cordaro and
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
157
Table 6–3 Expectancy Effects in Studies of Animal Learning
Study Standard normal deviate Code number
Authors
1. 2. 3. 4. 5. 6. a 7. b 8. 9.
Burnham Cordaro and Ison, Cordaro and Ison, Hartry, Hartry, Ingraham and Harrington, Ingraham and Harrington, Rosenthal and Fode, Rosenthal and Lawson,
a b
Nondirectional 1966 1963 1963 1966 1966 1966 1966 1963a 1964
Directional
I I II I II I II I I
1.50 3.96 3.25 5.38 3.29 1.48 2.10 2.33 2.17
þ1.95 þ3.96 þ3.25 þ5.38 þ3.29b þ1.48 þ2.10 þ2.33 þ2.17
Sum pffiffiffi 9 z p<
25.46 3 8.49 1/(million)2
þ25.91 3 þ8.64 1/(million)2
See also Rosenthal, 1967b, 1967c. Not based on exact p; exact z probably exceeds 5 or 6.
Ison and for Ingraham and Harrington one experiment was defined as that in which one group of experimenters was given one expectation while the other group of experimenters was given the opposite expectation. In both these papers, the second experiment was defined as that in which another group of experimenters held positive expectations for one group of their subjects and negative expectations for another group of their subjects. To put it another way, when experimenter expectancy was a between-experimenter source of variation that was regarded as an experiment different from that in which experimenter expectancy was a within-experimenter source of variation. Finally, the experiments reported by Hartry were simply two independent experiments conducted at different times by different experimenters and with differences in procedures. Interestingly, it was the second study, with its tighter controls for inexperience of the experimenters, for observer errors, and for intentional errors, that showed the greater magnitude of expectancy effect with planaria subjects. For each of the studies in Table 6-3, two z values are given. The first is the z associated with an overall test of the hypothesis that the groups of experimenters employed showed some difference. The second is the z associated with the specific test that experimenters given one expectation obtained data in the direction of that expectation more than when experimenters were given some other expectation. For each column of zs the sum of the zs is indicated, as is the square root of the number of zs and the new z obtained when the former is divided by the latter. Finally, the p associated with the overall z is given. As it turned out for studies involving animal subjects, the nondirectional zs are identical in absolute value to the directional zs in every case except one. That was because in only that study was there a comparison among more groups than simply those reflecting each of two different expectations. The combined probability of obtaining the overall z based either on the nondirectional or the directional zs is infinitesimally low. In order to bring the combined p value for the directional test to .05, another 239 experiments with an average directional z value of exactly .00 would have to be conducted.
158
Book One – Artifact in Behavioral Research
Table 6–4 Expectancy Effects in Studies of Animal Learning by Principal Investigators
Principal investigator Code number
Name
I II III IV V
Burnham Ison Hartry Harrington Rosenthal
Standard normal deviate Nondirectional 1.50 5.11 6.15 2.54 3.19 Sum pffiffiffi 5 z p<
18.49 2.24 8.25 1/(million)2
Directional þ1.95 þ5.11 þ6.15 þ2.54 þ3.19 þ18.94 2.24 þ8.46 1/(million)2
In addition to a ‘‘per experiment’’ appraisal it was mentioned earlier that a ‘‘per principal investigator’’ appraisal might protect us from some erroneous inferences. Table 6-4 gives the ‘‘per principal investigator’’ results and they are found to be very similar to those based on experiments. It would take over 127 new principal investigators obtaining an average directional z value of exactly .00 to bring the combined p value to the .05 level. The zs given for each principal investigator are usually based on the method of combining ps described earlier in detail. In a few cases, however, an overall test of the expectancy effect for the several studies was already available and then that overall value was employed.
Human Subjects So far we have given only the results of studies of expectancy effect in which the subjects were rats or worms. Most of the research available, however, is based on human subjects and it is those results that we now consider. In this set of experiments at least 20 different specific tasks have been employed, but some of these tasks seemed sufficiently related to one another that they could reasonably be regarded as a family of tasks or a research area. These areas include human learning and ability, psychophysical judgments, reaction time, inkblot tests, structured laboratory interviews, and person perception. We consider each in turn. Learning and Ability Table 6-5 summarizes the results of the per experiment and the per principal investigator analysis. There appeared to be no appreciable effects of the experimenter’s expectancy on subjects’ performance of (a) the Wechsler Adult Intelligence Scale (Getter et al.), (b) the Block Design subtest of the same Scale (Wartenberg-Ekren), (c) a color-recognition-task (Timaeus and Lu¨ck), and (d) a dot-tapping task (Wessler). Two of the experiments (Kennedy et al.) employed a verbal conditioning task and in the second of these studies those experimenters expecting greater ‘‘conditioning’’ obtained greater conditioning than did the experimenters expecting less conditioning. In this experiment, as indicated by the footnote of Table 6-5, as in many others, there was also an interaction effect (defined as
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
159
Table 6–5 Expectancy Effects in Studies of Learning and Ability
Study Standard normal deviate Code number
Authors
1. 2. 3. 4. 5. 6. 7. 8. 9.
Getter, M, H, W, Hurwitz and Jenkins, Johnson, Kennedy, E, W, Kennedy, C, B, Larrabee and Kleinsasser Timaeus and L€ uck, Wartenberg-Ekren, Wessler
Nondirectional I I I I I I I I I
.00 1.28 3.89 .00 2.27 1.60 .00 .00 .00
.00 þ 1.28 þ3.89a .00 þ2.27a þ1.60 .00 .00 .00
SUMMARY
Sum pffiffiffi 9
9.04 3
þ9.04 3
By Study
z p< Sum pffiffiffi 8
3.01 .0015 8.38
þ3.01 .0015 þ8.38
2.83
2.83
z p¼
2.96 .0015
þ2.96 .0015
By Principal Investigator a
1967 1966 1967 1968 1968 1967 1968b 1962 1968b
Directional
Indicates that experimenter expectancy interacted with another variable at z > /1.28/.
z þ1.28 or z 1.28) between experimenter expectancy and some other variable. In this study the interaction took the form that those experimenters of a more ‘‘humanistic’’ or optimistic disposition obtained greater biasing effects of expectations than did experimenters of a more ‘‘deterministic’’ or pessimistic disposition. The earlier experiment by Kennedy et al. also varied expectations about subjects’ conditioning scores but half the time the experimenters had visual contact with their subjects (i.e., sat across from them) and half the time they had no visual contact with subjects (i.e., sat behind them). Because the same experimenters were employed in both conditions, we count this as only one experiment and the overall test of significance showed the directional z < þ1.28 and no interaction at z ¼ /1.28/ between expectancy and visual contact. It was nevertheless of interest to note that in the face-to-face condition, experimenters expecting greater conditioning obtained greater conditioning (z ¼ þ1.95) than did the experimenters expecting less conditioning. In the condition of no visual contact the analogous z was very close to zero. These results, though summarized as a ‘‘failure to replicate,’’ do suggest that visual cues may be important to the communication of experimenter expectations to the subjects of a verbal conditioning experiment as they were important to the communication of expectations to Clever Hans. The second experiment in the Kennedy series was conducted with experimenter and subject in face-to-face contact. Especially instructive for its unusual within-subject experimental manipulation was the study by Larrabee and Kleinsasser. They employed five experimenters to administer the Wechsler Intelligence Scale for Children (WISC) to 12 sixth-graders of average intelligence. Each subject was tested by two different experimenters—one administering the even-numbered items and the other administering the odd-numbered items.
160
Book One – Artifact in Behavioral Research
For each subject, one of the experimenters was told that the child was of above-average intelligence while the other experimenter was told that the child was of below-average intelligence. When the child’s experimenter expected superior performance the total IQ earned was 7.5 points higher on the average than when the child’s experimenter expected inferior performance. When only the performance subtests of the WISC were considered, the advantage to the children of having been expected to do well was less than three IQ points and could easily have occurred by chance. When only the verbal subtests of the WISC were considered, the advantage of having been expected to do well, however, exceeded 10 IQ points. The particular subtest most affected by experimenters’ expectancies was Information. The results of this study are especially striking in view of the very small sample size (12) of subjects employed. In the experiment by Hurwitz and Jenkins the tasks were not standardized tests of intelligence, but rather two standard laboratory tests of learning. Three male experimenters administered a rote verbal learning task and a mathematical reasoning task to a total of 20 female subjects. From half their subjects the experimenters were led to expect superior performance; from half they were led to expect inferior performance. In the rote learning task, subjects were shown a list of pairs of nonsense syllables and were asked to remember one of the pair members from a presentation of the other pair member. Subjects were given six trials to learn the syllable pairs. Somewhat greater learning occurred on the part of the subjects contacted by the experimenters believing subjects to be brighter although the difference was not large numerically and z < þ1.28; subjects alleged to be brighter learned 11 per cent more syllables. The curves of learning of the paired nonsense syllables, however, did show a difference between subjects alleged to be brighter and those alleged to be duller. Among ‘‘brighter’’ subjects, learning increased more monotonically over the course of the six trials than was the case for ‘‘duller’’ subjects. (The coefficient of determination between accurate recall and trial number was .50 for the ‘‘bright’’ subjects and .25 for the ‘‘dull’’ subjects; z > 2.00 but not used in assessing overall significance.) In the mathematical reasoning task, subjects had to learn to use three sizes of water jars in order to obtain exactly some specified amount of water. On the critical trials the correct solution could be obtained by a longer and more routine procedure which was scored for partial credit or by a shorter but more novel procedure which was given full credit. Those subjects whose experimenters expected superior performance earned higher scores than did those subjects whose experimenters expected inferior performance. Among the latter subjects, only 40 per cent ever achieved a novel solution, while among the allegedly superior subjects 88 per cent achieved one or more novel solutions. Subjects expected to be dull made 57 per cent again as many errors as did subjects expected to be bright. In this experiment, with two tasks performed by each subject, the overall z was based on subjects’ performance on both tasks. The final experiment to be mentioned in this section is of special importance because of the elimination of plausible alternatives to the hypothesis that it is the subject’s response that is affected by the experimenter’s expectancy. In his experiment, Johnson employed the Stevenson marble-dropping task. Each of the 20 experimenters was led to believe that marble-dropping rate was related to intelligence. More intelligent subjects were alleged to show a greater increase in rate of marble-dropping over the course of six trials. Each experimenter then contacted eight subjects, half of whom were alleged to be brighter than the remaining subjects.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
161
The recording of the subject’s response was by means of an electric counter, and the counter was read by the investigator who was blind to the subject’s expectancy condition. As can be seen from Table 6-5, the results of this study, one of the best controlled in this area, were the most dramatic. Experimenters expecting a greater increase in marbledropping rate obtained a greater increase than they did when expecting a lesser increase. In this study, too, there was an interaction effect between the expectation of the experimenter, the sex of the experimenter, and the sex of the subject. Same-sex dyads showed a greater effect of experimenter expectation (z ¼ 1.80). Considering the studies of human learning and ability as a set, it appears that the effects of experimenter expectancy may well operate as unintended determinants of subjects’ performance. The magnitudes of the effects obtained, however, are considerably smaller than those obtained with animal subjects. It would take only another 21 experiments with an average directional z of .00 to bring the overall p to the .05 level compared to the 239 experiments required in the area of animal learning. Another 18 principal investigators averaging zero z results would bring the combined p to .05 in the area of human learning and abilities. Because there are few experiments in this set employing exactly the same task, it is difficult to be sure of any pattern in the magnitudes of zs obtained in individual studies or by individual investigators. Perhaps it does appear, however, that standardized intelligence tests employed with adults are relatively not so susceptible to the effects of the experimenter’s expectancy. The Hurwitz and Jenkins results, however, weaken that conclusion somewhat. With only one study each for color recognition and dot-tapping perhaps any conclusion would be premature. Psychophysical Judgments Table 6-6 shows the results of nine studies employing tasks we may refer to loosely as requiring psychophysical judgments, and Table 6-7 shows the per investigator summary. Five of the six studies yielding directional zs < þ1.28 employed a number estimation task (Adair; Muller, & Timaeus; Shames and Adair, I, II; Weiss). Table 6–6 Expectancy Effects in Studies of Psychophysical Judgments
Study Code number
Authors
1. 2. 3. 4. 5. 6. 7. 8. 9.
Adair, Horst, Mu¨ller and Timaeus, Shames and Adair, Shames and Adair, Weiss, Wessler, Zoble, Zoble,
a
Standard normal deviate
1968 1966 1967 1967 1967 1967 1968b 1968 1968
I I I I II I I I II Sum pffiffiffi 9 z p
Nondirectional
Directional
.00 1.74 1.88 .00 .00 1.39 .00 3.29 2.02 10.32 3 3.44 .0003
.00a þ1.94 .00a .00a .00 .00a .00 þ3.70 þ2.02 þ7.66 3 þ2.55 .006
Indicates that experimenter expectancy interacted with another variable at z /1.28/.
162
Book One – Artifact in Behavioral Research Table 6–7 Expectancy Effects in Studies of Psychophysical Judgments by
Principal Investigators Principal investigator Code number
Name
I II III IV V VI
Adair Horst Timaeus Weiss Wessler Zoble
Standard normal deviate Nondirectional
Sum p ffiffiffi 6 z p
Directional
.00 1.74 1.88 1.39 .00 3.77
.00 þ1.94 .00 .00 .00 þ4.06
8.78 2.45 3.58 .0002
þ6.00 2.45 þ2.45 .007
Adair, though he found no main effect of experimenter expectancy, did find that the magnitude of expectancy effect could be predicted from a knowledge of the sex of experimenter and sex of subject. Greater expectancy effects were found when experimenter and subject were of the opposite rather than the same sex (z ¼ 2.33). Mu¨ller and Timaeus found that the effect of experimenter expectation was to decrease the variability of obtained responses relative to a control group, while Weiss found that relative to the control subjects, subjects whose experimenters had been given any expectation underestimated the number of dots presented. Shames and Adair (I) found that those experimenters who were judged by their subjects as more courteous, more pleasant, and more given to the use of head gestures showed a tendency (all zs > 1.96) to obtain data opposite to that which they had been led to expect. The experiments by Horst and by Wessler both employed a line length estimation task. Data presented by Wessler suggest that the z associated with the effect of experimenter expectancy might well be > þ1.28 but because it could not be determined exactly from the data available, and because no effect appeared in two other tasks administered to the same subjects, we count the z as .00. Horst, however, found line length estimation to be affected by the experimenter’s expectancy and more so by those experimenters rated by their subjects as more pleasant, bolder, and less awkward. In addition, Horst found (just as Weiss did) that, relative to the control subjects, subjects whose experimenters had been given any expectation showed a greater tendency to underestimate. The largest effects of interpersonal expectancies were found in the studies of tone length discrimination by Zoble. In both his studies, which differed from each other in the mental sets induced in the subjects, he found that the experimenter’s expectancy was a significant determinant of subjects’ discriminations. In addition, while he found that either the visual or the auditory channel was probably sufficient to serve as mediator of expectancy effects, the data suggested that the visual channel was more effective than the auditory channel (z ¼ þ1.44). On the whole, the area of psychophysical judgment, particularly when the judgment is of numerosity, seems less susceptible to the effects of experimenter expectancy than the other areas considered so far. The number of additional experiments with a mean directional z of .00 required to bring the overall p to the .05 level for
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
163
the area of psychophysical judgments is only a dozen. On a per investigator basis, only seven additional investigators with mean .00 findings would bring the cumulative p to .05. Reaction Time Table 6-8 shows the results of three studies by three different investigators in which the dependent variable was one form or another of reaction time. Employing visual stimuli, McFall found no effects of experimenter expectancy, but Wessler did. Wessler also found that the effects of experimenter expectancy were greater on earlier trials and that, over time, there was a monotonic decrease of expectancy effect (z ¼ 1.65). The remaining experiment, by Silverman, employed verbal rather than visual stimuli. Silverman employed 20 students of advanced psychology as experimenters to administer a word association test to 333 students of introductory psychology. Half the experimenters were led to expect that some of their subjects would show longer latencies to certain words than would their control group subjects. The remaining experimenters were given no expectations and served as an additional control condition. Results showed that latencies did not differ between the two baseline conditions but that when experimenters expected longer latencies from their subjects, they obtained longer latencies. For some of the experimenters, Silverman found a significant tendency to commit scoring errors in the direction of their expectations, but there was some evidence to suggest that such scoring errors could not very well account for the effects obtained. Silverman had found an interaction of experimenter expectation by sex of experimenter and sex of subject (z ¼ 1.95). The nature of this interaction was the same as that found by Adair. When experimenters and subjects were of the opposite sex they showed greater expectancy effects than when they were of the same sex. Silverman’s plausible reasoning was that scoring errors on the part of the experimenters ought not to be differentially related to their subjects’ sex as a function of their own sex. Because of the small number of experiments conducted in this area, it would take only one additional study (or principal investigator) with an associated z of zero to bring the combined p level to .05. Thus though two of the three studies obtained zs of þ1.28 or greater, we cannot have the confidence in the expectancy hypothesis that seems warranted for other research. Table 6–8 Expectancy Effects in Studies of Reaction Time
Study Code number
Authors
1. 2. 3.
McFall, Silverman Wessler,
a
Standard normal deviate Nondirectional 1965 1968 1966, 1968a
I I I Sum pffiffiffi 3 z p<
.00 1.88 1.46 3.34 1.73 1.93 .03
Indicates that experimenter expectancy interacted with another variable at z /1.28/.
Directional .00 þ1.88a þ1.46a þ3.34 1.73 þ1.93 .03
164
Book One – Artifact in Behavioral Research Table 6–9 Expectancy Effects in Studies of Inkblot Tests
Study Code number
Authors
1. 2. 3. 4. SUMMARY
Marwit, Marwit and Marcia, Masling, Strauss,
By Study By Principal Investigator
a
Standard normal deviate Nondirectional 1968 1967 1965, 1966 1968
I I I I Sum pffiffiffi 4 z p Sum pffiffiffi 3 z p¼
1.80 3.25 2.05 2.32 9.42 2 4.71 .0000015 7.95 1.73 4.60 .0000025
Directional þ1.80 þ3.25 þ2.05 .00a þ7.10 2 þ3.55 .0002 þ5.63 1.73 þ3.25 .0006
Indicates that experimenter expectancy interacted with another variable at z /1.28/.
Inkblot Tests Table 6-9 summarizes by study and by principal investigator the results of research on expectancy effects employing as the dependent variable subjects’ perceptions of inkblot test materials. In the first of these studies, Masling employed 14 graduate student experimenters to administer the Rorschach to a total of 28 subjects. Half the experimenters were led to believe that it would reflect more favorably upon themselves if they obtained more human than animal responses. The remaining experimenters were given the opposite value, belief, or expectation. All experimenters were forcefully warned not to coach their two subjects and all administrations of the Rorschach were tape-recorded. Results showed that experimenters led to prize animal responses obtained one-third again as high an animal to human response ratio as did the experimenters led to prize human responses. Analysis of the tape recordings revealed no evidence favoring the hypothesis that differential verbal reinforcement of subjects’ responses might have accounted for the differences obtained. In addition, none of the subjects reported that their experimenter seemed to show any special interest in any particular type of response. The cues by which an experimenter unintentionally informs his subject of the desired response appear likely to be subtle ones. In the experiment by Marwit and Marcia, 36 advanced undergraduate experimenters administered five of the Holtzman inkblots to a total of 53 students of elementary psychology. Some of the experimenters expected many responses from their subjects either on the basis of their own hypotheses or because that was what they had been led to expect. The remaining experimenters expected few responses from their subjects. The overall results showed that experimenters expecting more responses obtained more responses than did experimenters expecting fewer responses. Among the experimenters who had developed their own hypotheses, those who expected more responses obtained 59 per cent more responses than did the experimenters who expected fewer responses. Among the experimenters who were given ‘‘ready-made’’ expectancies, those who expected more obtained 61 per cent more responses than those who expected fewer responses.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
165
In this experiment almost one-third of the experimenters admitted to being aware that their own expectancy effects were under investigation. Interestingly enough, this admitted awareness bore no relationship to magnitude of expectancy effect exerted. In addition, there was no overall relationship between the number of verbal inquiries made by experimenters and the number of responses obtained from their subjects (r ¼ .07). However, an interesting reversal of what we might expect occurred when it was shown that the subgroup of experimenters who asked the most questions were the experimenters who had been led to expect few responses (z ¼ 2.40). Finally, there was an interesting tendency, not found in earlier studies, for those inkblots shown later to manifest greater expectancy effects than those inkblots shown earlier. If some form of unintended reinforcement were employed by the experimenters, it is at least unlikely to have been anything so obvious as differential questioning. In his experiment, Strauss employed five female experimenters each of whom was to administer the Rorschach to six female undergraduates. For two of the subjects each experimenter was led to expect an ‘‘introversive’’ experience balance; for another two of the subjects each experimenter was led to expect an ‘‘extratensive’’ experience balance; and for the remaining two subjects, experimenters were given no expectations. Subjects’ actual experience balance, measured by the difference between relative standings in human movement (M) and color (Sum C) production, was found overall to be unrelated to the experimenter’s expectancy. There was an interesting difference, however, in the variability of obtained responses across the three treatment groups (F max ¼ 19.22, p < .05), with the control group of subjects for whom experimenters had been given no expectation showing least individual differences among experimenters. Relative to the control group subjects, subjects contacted by the different experimenters in both conditions of expectancy obtained experience balance scores that were both too high and too low. As a check on the success of the induction of the expectations, Strauss asked his experimenters to predict the experience balance that would be obtained from each subject. The analysis showed very clearly that, on the average, experimenters predicted experience balance scores very much in line with those they had been led to expect. In addition, however, an interaction effect was obtained (z > 2.58) which showed that one of the five experimenters predicted results opposite to those he had been led to expect, while another predicted an unusually great difference between his introversive and extratensive subjects, but in the right direction. These two extreme predictors, it turned out, both showed a tendency to obtain responses opposite to those they had been led to expect (mean expectancy effect in standard score units ¼ 1.74). The remaining three experimenters all obtained more positive effects (mean expectancy effect in standard score units ¼ þ1.45). With so small a sample of experimenters (df ¼ 3) such a comparison (z ¼ 1.44) can be at best suggestive, but it may serve to alert other investigators to the interesting possibility that experimenters, when given a prophecy for a subject’s behavior, may be more likely to fulfill that prophecy if they believe neither too much nor too little that the prophecy will be fulfilled. In the most recent of the inkblot experiments, Marwit employed 20 graduate students in clinical psychology as his experimenters and 40 undergraduate students of introductory psychology as his subjects. Half the experimenters were led to expect some of their subjects to give many Rorschach responses and especially a lot of
166
Book One – Artifact in Behavioral Research
animal responses. Half the experimenters were led to expect some of their subjects to give few Rorschach responses but proportionately a lot of human responses. Results showed that subjects who were expected to give more responses gave more responses (z ¼ þ1.55) and that subjects who were expected to give a greater number of animal relative to human responses did so (z ¼ þ2.04). Marwit also found trends for the first few responses to have been already affected by the experimenter’s expectancy and for later-contacted subjects to show greater effects of experimenter expectancy than earlier-contacted subjects. In summarizing the four inkblot experiments we can say that three investigations obtained substantial effects of experimenters’ expectancies while one investigation did not. Perhaps we can account for the differences in results in terms of differences in procedure. In the three experiments showing expectancy effects, the expectancy induced was for a more simple Rorschach response—animal or human content in one study, number of responses in another, and both in the third. In the study showing no significant expectancy effect, the expectancy induced was for a more complex response, one involving a relationship of two response categories to one another, human movement, and color. In the three experiments showing expectancy effects, each experimenter entertained the same hypothesis for each of his experimental group subjects. In the experiment not showing the expectancy effect, each experimenter contacted subjects under opposite conditions of expectation. There have been many experiments with human and animal subjects showing that expectancy effects may occur even when the different expectations are held in the mind of the same experimenters. There are, however, a number of studies showing that under these conditions, from 12 to 20 per cent of the experimenters show significant reversals of expectancy effect. The word significant is italicized to emphasize that we speak not of failures to obtain data in the predicted direction, but of obtaining data opposite to that expected with non-Gaussian gusto (Rosenthal, 1967c, 520). In the study by Strauss not showing expectancy effects on the obtained Rorschach experience balance, such extreme reversals were not obtained but the sample of experimenters was small (five). Finally, in the study not showing expectancy effects, the experimenters were more experienced than those of the studies that did show expectancy effects. There is, however, some other evidence to suggest that more experienced, more competent, and more professional experimenters may be the ones to show greater rather than smaller expectancy effects (Rosenthal, 1966). Pending the results of additional research, perhaps all we can now say is that some inkblot responses may, under some conditions, be fairly susceptible to the effects of the experimenter’s expectancy. For the set of experiments described here, the overall p level can be brought to the .05 level by the addition of 15 new results of an average directional z value of .00. For the set of principal investigators, the overall p level can be brought to the .05 level by the additional results of nine principal investigators obtaining an average directional z value of .00. Later we shall have occasion to discuss more systematically the results of studies of experimenter expectancy as a function of the laboratory in which they were conducted. For now it should only be mentioned that the one study of inkblot responses showing no overall directional effect of experimenter expectancy was the one conducted in the writer’s laboratory.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
167
Table 6–10 Expectancy Effects in Studies of Structured Laboratory Interviews
Study Code number
Authors
1. 2. 3. 4. 5. 6.
Cooper, E, R, D, Jenkins, Pflugrath, Raffetto, Rosenthal, P, V, F, Timaeus and Lu¨ck,
Standard normal deviate Nondirectional 1967 1966 1962 1967, 1968 1963b 1968a Sum pffiffiffi 6 z p
a
I I I I I I
3.37 1.34 1.75 5.24 1.48 1.55 14.73 2.45 6.01 1/500 million
Directional þ3.37 þ1.34a .00 þ5.24a þ1.48 þ1.55 þ12.98 2.45 þ5.30 1/10 million
Indicates that experimenter expectancy interacted with another variable at z /1.28/.
Structured Laboratory Interviews Table 6-10 shows the per-investigation and per-investigator results of the research in what must be the most miscellaneous of our research areas. In one of the earliest of these studies, Pflugrath investigated the effects of the experimenter’s expectancy on scores earned on a standardized paper-and-pencil test of anxiety. He employed nine graduate student counselors, each of whom was to administer the Taylor Manifest Anxiety Scale to two groups of students of introductory psychology. In each group there was an average of about eight subjects. Three of the experimenters were led to believe that their groups of subjects were very anxious, three were led to believe that their groups of subjects were not at all anxious, and three were given no expectations about their subjects’ anxiety level. Pflugrath found no difference among the three treatment conditions in his analysis of variance but his 2 was large. The bulk of the obtained 2 was due to the fact that, among experimenters led to expect high anxiety scores, more subjects actually scored lower in anxiety. This finding, while certainly not predicted, was at least interpretable in the light of the experimenters’ status as counselors-in-training. Told that they would be testing very anxious subjects who had required help at the counseling center, these experimenters may well have brought their developing therapeutic skills to bear upon the challenge of reducing these subjects’ anxiety. If the subject’s performance on even well-standardized paper-and-pencil tests, administered in a group situation, may be affected by the experimenter’s perception of the subject, then it is not unreasonable to suppose that such effects may occur with some frequency in the more intense and more personal relationship that characterizes the more typical clinical assessment situation. As a check on the success of his experimental induction of expectancies, Pflugrath asked his experimenters to predict the level of anxiety they would actually find in each of their groups of subjects. Although there was a tendency for examiners to predict the anxiety level that they had been led to expect, this tendency did not reach an associated z of þ1.28. The experimentally manipulated expectations, then, were not very effectively induced. It is of interest to note, however, that all three of the experimenters who specifically predicted higher anxiety obtained higher anxiety
168
Book One – Artifact in Behavioral Research
scores than did any of the experimenters who specifically predicted lower anxiety (z > þ1.65). Because the results of the Pflugrath experiment showed some effects in the predicted direction and some effects in the opposite direction, the directional z is entered as .00 in Table 6-10. The nondirectional z, however, retains the information that some differences were associated with the effects of experimenter expectancy. The experiment by Raffetto was addressed to the question of whether the experimenter’s expectation for greater reports of hallucinatory behavior might be a significant determinant of such reports. Raffetto employed 96 paid, female volunteer students from a variety of less advanced undergraduate courses to participate in an experiment on sensory restriction. Subjects were asked to spend one hour in a small room that was relatively free from light and sound. Eight more advanced students of psychology served as the experimenters, with each one interviewing 12 of the subjects before and after the sensory restriction experience. The preexperimental interview consisted of factual questions such as age, college major, and college grades. The postexperimental interview was relatively well-structured including questions to be answered by ‘‘yes’’ or ‘‘no’’ as well as more open-ended questions—e.g., ‘‘Did you notice any particular sensations or feelings?’’ Postexperimental interviews were tape-recorded. Half the experimenters were led to expect high reports of hallucinatory experiences, and half were led to expect low reports of hallucinatory experiences. Obtained scores of hallucinatory experiences ranged from zero to 32 with a grand mean of 5.4. Of the subjects contacted by experimenters expecting more hallucinatory experiences, 48 per cent were scored above the mean on these experiences. Of the subjects contacted by experimenters expecting fewer hallucinatory experiences, only 6 per cent were scored above the mean. Since in this experiment the experimenters scored their own interviews for degree of hallucinatory experience, it is possible that scoring errors accounted for part of the massive effects obtained. It seems unlikely, however, in the light of what we now know of such errors that effects as dramatic as these could have been due entirely to scoring errors even if such errors were very great. Fortunately this question can be answered in the future since Raffetto did tape record the interviews conducted so that they can be rated by ‘‘blind’’ observers. When Raffetto himself checked the experimenters’ scoring he found no significant scoring errors, but we must note, as did Raffetto, that he was not blind to the interviewers’ condition of expectancy. The work of Beez (1968), however, amply documents the fact that such dramatic effects of expectancy may occur even in the absence of scoring errors. In the experiment conducted by Rosenthal et al., 18 graduate students served as experimenters in a study of verbal conditioning conducted with 65 undergraduate subjects. Half the experimenters were led to expect from their subjects high rates of awareness of having been conditioned, while the remaining experimenters were led to expect low rates of awareness of having been conditioned. Questionnaires assessing subjects’ degree of awareness were scored blindly by two psychologists. Of the subjects expected to show a low degree of awareness, 43 per cent were subsequently judged as ‘‘aware.’’ Of the subjects expected to show a high degree of awareness, 68 per cent were subsequently judged as ‘‘aware.’’ In the experiment by Timaeus and Lu¨ck of the University of Cologne, subjects were asked to estimate the level of aggression to be found in a Milgram type
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
169
experiment. When experimenters had been led to expect high levels of aggression, they obtained higher levels of aggression than when experimenters had been led to expect lower levels of aggression. Jenkins found in her experiment that factual information about a stimulus person could be communicated unintentionally from experimenter to subject. Subjects contacted by experimenters believing one set of factual statements to be true of a stimulus person more often also believed those factual statements to be true of the stimulus person than did subjects contacted by experimenters believing the opposite factual statements to be true of the stimulus person. The experiment by Cooper et al. is one we shall consider later in more detail. Their research showed that the degree of certainty of having to take a test as a function of the degree of preparatory effort was successfully predicted from a knowledge of the experimenters’ expectations. Considering the results of these experiments (or principal investigators) as a set, it would require 56 additional studies (or investigators) finding a mean directional z of .00 to bring the overall p level to .05. Person Perception Table 6-11 shows the results of 57 studies of expectancy effect in which a standardized task of person perception was employed. Table 6-12 shows the analogous results based not on studies but on principal investigators. The basic paradigm of these investigations has been sufficiently uniform that we need only an illustration (Rosenthal and Fode, 1963b I). Ten advanced undergraduate and graduate students of psychology served as the experimenters. All were enrolled in an advanced course in experimental psychology and were already involved in conducting research. Each student-experimenter was assigned as his subjects a group of about 20 students of introductory psychology. The experimental procedure was for the experimenter to show a series of ten photographs of people’s faces to each of his subjects individually. The subject was to rate the degree of success or failure shown in the face of each person pictured in the photos. Each face could be rated as any value from 10 to þ10, with 10 meaning extreme failure and þ10 meaning extreme success. The 10 photos had been selected so that, on the average, they were rated as neither successful nor unsuccessful, but rather as neutral with an average numerical score of zero. All ten experimenters were given identical instructions on how to administer the task to their subjects and were given identical instructions to read to their subjects. They were cautioned not to deviate from these instructions. The purpose of their participation, it was explained to all experimenters, was to see how well they could duplicate experimental results which were already well-established. Half the experimenters were told that the ‘‘well-established’’ finding was such that their subjects should rate the photos as being of successful people (ratings of þ5) and half the experimenters were told that their subjects should rate the photos as being of unsuccessful people (ratings of 5). Results showed that experimenters expecting higher photo ratings obtained higher photo ratings than did experimenters expecting lower photo ratings. Although all of the other experiments shown in Table 6-11 were also intended as replications of the basic finding, most of the work summarized was designed particularly to learn something of the conditions which increase, decrease, or otherwise modify the effects of experimenter expectancy. That intent has
170
Book One – Artifact in Behavioral Research Table 6–11 Expectancy Effects in Studies of Person Perception
Study Code number
Authors
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49.
Adair and Epstein, Adair and Epstein, Adler, Adler, Adler, Barber, C, F, M, C, B, Barber, C, F, M, C, B, Barber, C, F, M, C, B, Barber, C, F, M, C, B, Barber, C, F, M, C, B, Bootzin, Bootzin, Bootzin, Carlson and Hergenhahn, Carlson and Hergenhahn, Connors, Connors and Horst, Fode, Horn, Jenkins Laszlo and Rosenthal Marcia, Marcia, McFall, Moffatt, Nichols, Persinger, Persinger, K, R, Persinger, K, R, Rosenthal and Fode, Rosenthal and Fode, Rosenthal and Fode, Rosenthal, F, J, F, S, W, V, Rosenthal, K, G, C, Rosenthal, K, G, C, Rosenthal and Persinger, Rosenthal and Persinger, Rosenthal, P, M, V, G, Rosenthal, P, M, V, G, Rosenthal, P, M, V, G, Rosenthal, P, M, V, G, Rosenthal, P, V, F, Rosenthal, P, V, M, Rosenthal, P, V, M, Shames and Adair, Smiltens, Trattner, Uno, F, R, Uno, F, R,
Standard normal deviate Nondirectional
1967 1967 1968 1968 1968 1967 1967 1967 1967 1967 1968 1968 1968 1968 1968 1968 1966 1967 1968 1966 1967 1961 1961 1965 1966 1967 1962 1966 1968 1963b 1963b 1963b 1964 1965 1965 1968 1968 1964a 1964a 1964b 1964b 1963a 1963 1963 1967 1966 1966, 1968 1968 1968
I II I II III I II III IV V I II III I II I I I I I I I II I I I I I I I II III I I II I II I II I II I I II I I I I II
1.65 1.64 4.42 2.33 1.50 .00 .00 .00 .00 1.58 2.14 1.64 1.44 .00 .00 .00 .00 2.81 2.01 1.61 1.80 .00 .00 .00 .00 .00 .00 1.88 1.64 2.46 3.94 1.64 1.52 1.69 .00 1.29 .00 1.44 .00 2.33 2.58 2.17 .00 1.96 1.70 1.28 .00 1.99 .00
Directional þ1.65 þ1.64 þ4.42 2.33 1.50 .00a .00 .00 .00 .00 þ2.14b 1.64b þ1.44b .00b .00b .00b .00b þ2.81b þ2.01 þ1.61b þ1.80b .00b .00b .00b .00 .00b .00b þ1.88b þ1.64b þ2.46 þ3.44 þ1.64b 1.52b þ1.69b .00b þ1.29b .00b þ1.44b .00b .00b .00 þ2.33b .00 þ1.96b þ1.70 1.28b .00b 1.99b .00b
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
Study Code number
Authors
50. 51. 52. 53. 54. 55. 56. 57.
Uno, F, R, Uno, F, R, Uno, F, R, Weick, Wessler, Wessler and Strauss, White, Woolsey and Rosenthal,
a b
171
Standard normal deviate Nondirectional
1968 1968 1968 1966 1968b 1968 1962 1966
III IV V I I I I I Sum p ffiffiffiffiffiffi 57 z p
.00 2.17 .00 2.33 .00 1.65 2.81 1.34 68.38 7.55 9.06 1/(million)2
Directional
.00 2.17b .00 þ2.33b .00 .00 1.51b þ1.34b þ30.72 7.55 þ4.07 .000025
See also Rosenthal, 1967d, 1968b. Indicates that experimenter expectancy interacted with another variable at z /1.28/.
characterized the work of 18 of the 20 principal investigators listed in Table 6-12. It was the role of auditory cues, for example, that engaged the interest of Adair and Epstein. They first conducted a study which was essentially a replication of the basic experiment on the self-fulfilling effects of experimenters’ hypotheses. Results showed that, just as in the original studies, experimenters who expected the perception of success from their subjects fulfilled their expectations as did the experimenters who had prophesied the perception of failure by their subjects. During the conduct of this replication experiment, Adair and Epstein taperecorded the experimenters’ instructions to their subjects. The second experiment was then conducted not by ‘‘live’’ experimenters, but by tape-recordings of experimenters’ voices reading standard instructions to their subjects. When the taperecorded instructions had originally been read by experimenters expecting success perception by their subjects, the tape-recordings evoked greater success perceptions from their subjects. When the tape-recorded instructions had originally been read by experimenters expecting failure perception by their subjects, the taperecordings evoked greater failure perceptions from their subjects. Self-fulfilling prophecies, it seems, can come about as a result of a prophet’s voice alone. Since, in the experiment described, all experimenters read standard instructions, self-fulfillment of prophecies may be brought about by the tone in which the prophet prophesies. Adler, in her recent research, investigated the effects on experimenter expectancy of several experimenter sets or orientations toward outcomes. When experimenters were made to feel that it was important to obtain certain results, experimenters obtained the expected results. When experimenters were made to feel that it was very important to follow certain scientific procedures, they obtained results significantly opposite to those that they had been led to expect. In the control condition, in which no special orientations toward outcome were specially generated, experimenters also showed the reversal tendency. For the particular sample of
172
Book One – Artifact in Behavioral Research Table 6–12 Expectancy Effects in Studies of Person Perception by Principal Investigators
Principal investigator
Standard normal deviate
Code number
Name
Nondirectional
I II III IV V VI VII VIII IX X XI XII XIII XIV XV XVI XVII XVIII XIX XX
Adair Adler Barber Bootzin Carlson Connors Fode Horn Jenkins Marcia McFall Moffat Nichols Persinger Rosenthal Smiltens Trattner Weiek Wessler White
2.88 4.77 1.40 3.02 .00 .00 2.81 2.01 1.61 .00 .00 .00 .00 .00 6.91 1.28 .00 2.33 .00 2.81 Sum p ffiffiffiffiffiffi 20 z p
31.83 4.47 7.12 1/(4 million) (105)
Directional þ2.88 .00 .00 .00 .00 .00 þ2.81 þ2.01 þ1.61 .00 .00 .00 .00 .00 þ3.52 1.28 .00 þ2.33 .00 1.51 þ12.37 4.47 þ2.77 .003
experimenters and subjects employed, it seems possible that a general processconsciousness was operating that contributed to the reversal effect among the experimenters of the control group. Many of the experiments listed in Table 6-11 with an associated directional z < þ1.28 showed one or more interaction effects of experimenter expectancy and some other variable. These interactions and those found between experimenter expectancy and other variables in the earlier described research areas will not be described here but will be drawn upon in a later discussion of factors complicating the effects of experimenter expectancy. For many of the experiments listed with z < þ1.28, there is no ready explanation for the low z but sometimes the design of the experiment was intentionally such as to minimize the effects of experimenter expectancy. Thus Carlson and Hergenhahn (II) interposed a screen between experimenters and subjects and used a tape recorder to administer instructions to subjects (I, II), both of these procedures having been suggested as techniques for the reduction of expectancy effects (Rosenthal, 1966). Similarly, Moffat’s experimenters were made to remain mute. It has been suggested that higher status experimenters may show greater expectancy effects (Rosenthal, 1966). In only nine of the studies listed in Table 6-11 was no attempt made to have the experimenters exceed their undergraduate subjects in class standing, age, or training in psychology (Bootzin I, II, Carlson and Hergenhahn, I, II, Barber et al., I, II, III, IV, V). Only one of these nine studies, or 11 per cent,
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
173
showed a directional z of þ1.28 or greater, about what we might expect by chance. Of the remaining 45 studies employing college samples (Persinger, Knutson, and Rosenthal, 1966, 1968, and Trattner, 1966, 1968, employed neuropsychiatric patients), 19, or 42 per cent, showed an associated z of þ1.28 or greater. Apparently, for college samples, when the experimenter’s status exceeds that of his subject in the person perception task, the chances are almost quadrupled that expectancy effects will be obtained compared to the situation in which experimenter and subject are of the same status. This latter situation, of course, is relatively rare not only in our sample of studies but also in the real world of laboratory experiments. On the whole, the person perception task seems less susceptible to the effects of experimenter expectancy than most of the other areas investigated though the large number of studies conducted makes the overall combined p a fairly stable one. It would take the addition of 278 studies (or 36 principal investigators) with a mean directional z of .00 to bring the overall p to the .05 level. Compared to all other research areas combined, however, the person perception task shows fewer directional z results of þ1.28 or greater (2 ¼ 6.51, pffi.01).
An Overview of Expectancy Efects Now that we have considered the results of studies of expectancy effects for seven areas of research, it will be convenient to have a summary. Table 6-13 presents such a summary based on experiments, and Table 6-14 presents a summary based on principal investigators. For each research area the combined zs, both nondirectional and directional, are given as well as the per cent of the studies (or investigators) that reached the specified value of z. The next to last row of Tables 6-13 and 6-14 give the grand overall zs based on all studies and all investigators. The experiment by Wessler (1968) was represented in each of three research areas and that by Jenkins in each of two research areas. For each of these studies the mean z was used as the entry in the next to last row of Tables 6-13 and 6-14 in order to have each entry based on independent samples. Based either on these overall zs of all studies (or all investigators), or on the results of the binomial tests shown in the last row of Tables 6-13 and 6-14, the overall Table 6–13 Expectancy Effects in Seven Research Areas
Research area
Studies
Nondirectional z z
Animal Learning Learning and Ability Psychophysical Judgments Reaction Time Inkblot Tests Laboratory Interviews Person Perception All Studies Binomial test z (N ¼ 94) a
9 9a 9a 3 4 6b 57a,b 94c
8.49 3.01 3.44 1.93 4.71 6.01 9.06 14.35 11.39
% ‡ /1.28/ 100% 44% 56% 67% 100% 100% 60% 67%
Directional z z þ8.64 þ3.01 þ2.55 þ1.93 þ3.55 þ5.30 þ4.07 þ9.82 þ12.92
Indicates a single experiment represented in each of three areas. Indicates a different experiment represented in each of two areas. c Three entries were nonindependent and the mean z across areas was used for the independent entry. b
% ‡ 11.28 100% 44% 33% 67% 75% 83% 39% 50%
174
Book One – Artifact in Behavioral Research Table 6–14 Expectancy Effects in Seven Research Areas by Principal Investigators
Research area
Animal Learning Learning and Ability Psychophysical Judgments Reaction Time Inkblot Tests Laboratory Interviews Person Perception All Investigators Binomial test z (N ¼ 48) a b c
Investigators
5 8a 6a 3 3 6b 20a,b 48c
Nondirectional z
Directional z
z
% ‡ /1.28/
z
% ‡ 11.28
8.25 2.96 3.58 1.93 4.60 6.01 7.12 13.28 8.81
100% 50% 67% 67% 100% 100% 55% 71%
þ8.46 þ2.96 þ2.45 þ1.93 þ3.25 þ5.30 þ2.77 þ9.55 þ9.71
100% 50% 33% 67% 67% 83% 30% 52%
Indicates a single investigator represented in each of three areas by the same subject sample. Indicates another investigator represented in two areas by the same subject sample. Three entries were non-independent and the mean z across areas was used for the independent entry.
p associated with expectancy effects is infinitesimally small. In both tables it can be seen that though the combined nondirectional zs are larger than the combined directional zs in the next to last row, the directional zs are larger when based on the binomial test. The reason, of course, is that we expect twice as many of the nondirectional zs to reach a given magnitude and the binomial test knows that fact. In comparing the likelihoods of expectancy effects in the various research areas, the use of the zs may be somewhat misleading. A large number of investigations with only a moderately large number of zs reaching a specified level will make for a very large z, while a small number of investigations with a relatively large number of zs reaching a specified level will make for a smaller z. For this reason, the percentage of zs reaching a specified level may be a better basis on which to compare the likelihood of expectancy effects in the various research areas. By chance, we expect 10 per cent of the directional zs to reach or exceed þ1.28 but half of all directional zs reach that value. Effects of the experimenter’s expectancy are found most often in studies of animal learning, laboratory interviews, inkblot tests, and reaction time. They are found least often in studies of psychophysical judgments and person perception, and about half the time in studies of human learning and ability. There is one sense in which some of the entries of Table 6-14 are not independent. Some of the principal investigators conducted experiments in more than one area of research. In addition, we have so far considered as principal investigators anyone reporting an experiment regardless of the laboratory of origin. For these reasons it was felt to be instructive to summarize the results of all experiments conducted in different laboratories with each laboratory given equal weight with every other. Table 6-15 lists 29 laboratories and the principal investigator associated with each. Again the overall probabilities are very low and the median laboratory had about two-thirds of their experimental results reach a directional z value of þ1.28 compared to the 10 per cent we would expect by chance. While we expect one of the 29 laboratories to show a directional z of þ1.82 or greater by chance, Table 6-15 shows that 15 of the laboratories obtained zs of that value or greater. Though with so many laboratories we would expect one directional z of 1.82 to occur by chance, it is of interest to note that the one negative z of that size was obtained in the laboratory of a different culture—Japan.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
175
Table 6–15 Expectancy Effects Obtained in Different Laboratories
Investigator
Location
Studies
Nondirectional z z
1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29.
a b
Adair Adler Barber Bootzin Burnham Carlson Cooper Getter Harrington Hartry Horn Ison Johnson Kennedy Larrabee Marcia Masling McFall Moffat Persinger Raffetto
Manitoba Wellesley Medfielda Purdue Earlham Hamline CC, CUNY Connecticut Iowa State Occidental Geo. Washington Rochester New Brunswick Tennessee South Dakota SUNY, Buffalo SUNY, Buffalo Ohio State British Columbia Fergus Fallsa San Francisco State Rosenthal Harvard Silverman SUNY, Buffalo Timaeus Cologne Uno Keio (Tokyo) Wartenberg-Ekren Marquette Weick Purdue Wessler St. Louis Zoble Franklin and Marshall Sum pffiffiffiffiffiffi / 29 Means Medians Binomial test z (N ¼ 29)
Directional z
% ‡ /1.28/
z
% ‡ 11.28
6 3 5 3 1 2 1 1 2 2 1 2 1 2 1 2 2 2 1 2 1
2.04 4.77 1.40 3.02 1.50 .00 3.37 .00 2.54 6.15 2.01 5.11 3.89 1.61 1.60 3.58 2.44 .00 .00 2.50 5.24
50% 100% 20% 100% 100% 0% 100% 0% 100% 100% 100% 100% 100% 50% 100% 100% 100% 0% 0% 100% 100%
þ2.04b .00 .00b .00b þ1.95 .00b þ3.37 .00 þ2.54 þ6.15 þ2.01 þ5.11 þ3.89b þ1.61b þ1.60 þ3.58 þ1.45b .00b .00 þ2.50b þ5.24b
50% 33% 0% 67% 100% 0% 100% 0% 100% 100% 100% 100% 100% 50% 100% 100% 50% 0% 0% 100% 100%
35 1 3 5 1 1 3 2
8.04 1.88 1.98 1.86 .00 2.33 1.80 3.77
69% 100% 67% 40% 0% 100% 67% 100%
þ4.83b þ1.88b .00b 1.86b .00 þ2.33 .00b þ4.06
49% 100% 33% 0% 0% 100% 33% 100%
74.43 13.81 2.57 2.04 8.47
71% 100%
þ54.28 þ10.07 þ1.87 þ1.88 9.32
61% 67%
State Hospitals. Indicates that experimenter expectancy interacted with other variables at z /1.28/.
Because so much of the business of the behavioral sciences is transacted at certain specified p levels, the percentage of experiments and of laboratories reaching each of a set of standard p levels is shown in Table 6-16.2 In addition, the last row shows the number of future replicates obtaining a directional mean z of exactly .00 required to bring the overall p to the .05 level. 2 Since the preparation of this chapter another nine experiments, by four principal investigators, became available (Becker, 1968; Minor, 1967, 1967a; Peel, 1967; Zegers, 1968). The combined p for the nine experiments was .03, (z ¼ þ5.59) for the four investigators p < .04 (z ¼ þ3.58).
176
Book One – Artifact in Behavioral Research Table 6–16 Percentage of Experiments and Laboratories Obtaining Results at
Specified p Levels p
.10 .05 .01 .001 .0001 .00001 .000001 Grand Sum z Tolerance for Future Negative Resultsa a
Experiments N 5 94
Laboratories N 5 29
50% 35% 17% 12% 5% 3% 2% 95.27
62% 52% 38% 28% 21% 14% 14% 54.28
3,260
1,060
Replicates required to bring overall p to .05, assuming all replicates to yield a mean z of .00 exactly.
Earlier, the possibility was raised that certain judgments and computations made by the present writer might be in error so that a correction factor for these errors would be desirable. In addition, there is the possibility that studies showing no effects of experimenter expectancy might be less likely to be reported or called to the attention of the writer. This latter possibility cannot be ruled out in any way, though, at the time of this writing, interest in publication of ‘‘negative findings’’ seems as great as interest in publication of ‘‘positive findings’’ of expectancy effects. As a fairly stringent correction for the possibility of the writer’s errors and for the possibility of a biased availability of studies, we assume that the total number of experiments and of laboratories is ten times greater than that reported here. The factor of ten was selected on the basis of the widespread, intentionally exaggerated, and perhaps cynical fear among behavioral scientists that any given critical value of p gives the proportion of experiments conducted that come to public knowledge (Rosenthal, 1966). Since the directional z defined as worth listing in this review was that associated with a p of .10, the factor of ten was selected. If we assume that instead of 94 experiments conducted as tests of the hypothesis of expectancy effects there were actually 940 conducted, what becomes of the overall combined z? It goes to þ3.10, p < .001, assuming that the additional 846 experiments found a mean directional z of zero exactly. Similarly, if we assume that instead of 29 investigating laboratories there were 290, the overall combined z for laboratories goes to þ3.18, p < .0008, assuming that the additional 261 laboratories found a mean directional z of zero exactly. Additional protection against any errors leading us to entertain with insufficient basis the possibility of expectancy effects comes from considering any z < /1.28/ to be a z of zero. As Table 6-16 shows, the distribution of zs is highly skewed such that too many are very much greater than zero. It seems most likely, therefore, though the check would have been an onerous task, that the bulk of the zs considered as equal to .00 was actually also skewed such as to give too many zs of positive value. Principal Investigators With so many experiments in a series it becomes possible to examine the relationship between outcome and various characteristics of the principal investigator. Of the
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
177
94 experiments, 18 were conducted primarily by female principal investigators. Of these studies 44 per cent yielded directional zs of þ1.28 or greater compared to the 51 per cent of studies conducted by male principal investigators, a difference which is quite trivial (2 ¼ .07). In 37 of the 94 experiments, the principal investigator was a student, and in 43 per cent of these studies the directional z reached or exceeded þ1.28. In those studies in which the senior investigator was not a student (e.g., a faculty member), 54 per cent of the results reached or exceeded that value of z. The difference, however, was very small in the sense of 2 (.71). For reasons described in greater detail elsewhere (Rosenthal, 1966), it was felt to be desirable to compare the outcomes of experiments conducted in the present writer’s laboratory with those conducted elsewhere. Of the 35 experiments conducted in the writer’s laboratory, 49 per cent showed a directional z of þ1.28 or greater. Of the 59 experiments conducted elsewhere, 51 per cent showed zs of that magnitude. The difference between these percentages was trivial (2 ¼ .00). Of the 35 experiments conducted in the writer’s laboratory, 15 were conducted by students, and of these 15 investigations, only four, or 27 per cent, showed a directional z of þ1.28 or greater. That proportion was just half the proportion of 54 per cent found in the remaining 79 studies (2 ¼ 2.86, p < .10). Since the sample sizes of the experiments conducted by the writer’s students were not systematically lower than those of the remaining investigations, the differences in outcomes were probably not due to differences in statistical power, though such differences might account for the failure of individual studies to reach a directional z of þ1.28. There seems to be no ready explanation for the differences in outcome, but one hypothesis, suggested by the work of Adler (1968), may be considered. Adler’s results suggested that experimenters made particularly sensitive to the importance of ‘‘following scientific procedures’’ tended to obtain data that not only did not confirm their expectations but actually tended significantly to disconfirm their expectations. Perhaps a similar phenomenon may occur among principal investigators. Emphasis on the investigator’s remaining blind to experimenters’ treatment conditions may generate a sensitivity to procedures among student investigators that tends to reverse the directionality of expectancy effects. Consistent with such a hypothesis would be the finding that among these students, a greater proportion obtain zs of 1.28 or less. Although the total number of such negative zs is too small to permit strong inference, it is of interest to note that 13 per cent of the students’ experiments found zs that low compared to 8 per cent of the remaining investigations. Magnitude of Expectancy Effects So far we have discussed the results of studies of expectancy effect only in terms of the zs obtained. By itself such information does not tell us how large the effects of expectancy tend to be. Given a very large sample size, even effects of trivial magnitude can reach any specified level of z. We want, therefore, to have some estimates of the magnitude of expectancy effects quite apart from the question of the ‘‘reality’’ of the phenomenon. One such estimate can be obtained by computing the proportion of experimenters whose obtained responses have been brought into line with their expectations. For this computation we need the mean of the responses obtained by each experimenter
178
Book One – Artifact in Behavioral Research
in each of two different conditions of expectation. For those experiments in which each experimenter was given one expectation for some of his subjects and a different expectation for other subjects, the mean difference between responses of the two groups of subjects is all that is needed. If an experimenter obtained more of the expected responses from the subjects of whom he expected them than from the other subjects, that experimenter is counted as showing expectancy effects. For those experiments in which experimenters were given the same expectancy for all their subjects, a preliminary computation was required. For all the experimenters given one of the expectations, the grand mean response obtained was computed separately for all experimenters given one expectation and again for all experimenters given the opposite expectation. An experimenter in the condition of expecting more X type responses was counted as showing expectancy effect if his mean obtained responses showed more X than did the grand mean of the experimenters in the condition led to expect fewer X responses. An experimenter in the condition of expecting fewer X type responses was counted as showing expectancy effect if his mean obtained responses showed fewer X than did the grand mean of the experimenters in the condition led to expect more X responses. The analogous procedure was also employed for estimating the proportion of subjects whose responses were in the direction of their experimenter’s expectancy. Table 6-17 shows the results of the analyses performed. There were 27 studies for which the counts for subjects could be made with moderate effort. The selection was based not on a random sampling basis but rather on the basis of the availability of the data required. Data were also available from two other studies but because both were associated with such unusually large directional z values, they were not included in the analysis. The mean directional z of the subsample of studies employed was identical to that of all the experiments we have considered. Approximately 60 per cent of subjects gave responses consistent with the expectation of their experimenter. For the analysis based on experimenters, more of the studies provided the necessary information so that we have the data based on 57 experiments. The median directional z of these experiments, however, was less than þ1.28 so that the sample is biased slightly in the direction of overrepresenting studies scored as .00 z values. Approximately 70 per cent of experimenters obtained data in the direction of their hypotheses. How are we to account for the difference in proportion of subjects versus proportion of experimenters affected by expectancy effects? It was possible, of course, that the difference was in some way only an artifact of the difference in samples of experiments yielding the appropriate information. If the difference were not an Table 6–17 Proportions of Subjects and Experimenters Showing
Expectancy Effects Subjects Number of Studies Median z of studies Number of Ss or Es (N) Mean N per Study Weighted Percent of Biased Ss or Es Median Percent of Biased Ss or Es
27 þ1.28 1370 51 59% 62%
Experimenters 57 .00 523 9 69% 75%
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
179
artifact, however, it would suggest that expectancy effects are relatively more widespread among experimenters but that the effects per experimenter are relatively smaller. This interpretation is made more plausible by the results of an analysis comparing the proportion of experimenters showing expectancy effects with the proportion of subjects affected for just those experiments for which both types of information were available. There were 26 such samples, and for 24 of them the proportions of affected subjects and experimenters were either both above the grand median or both below it (z ¼ 3.94). The median percentage of affected subjects was 66 per cent, and the median percentage of affected experimenters was 75 per cent. The latter value is identical to the percentage based on all 57 studies so that it seems likely that the studies for which subject data were available were not unrepresentative of the larger number of studies for which experimenter data were available. Because neither in the case of the analysis based on subjects nor in that based on experimenters was the analysis sufficiently exhaustive, nor even necessarily representative, we should not take these estimates as very precise. Perhaps as a crude guide to the estimation of expectancy effects and to the planning of the sample sizes required in future research, we can give as a reasonable index that about two-thirds of subjects and of experimenters will give or obtain responses in the direction of the experimenter’s expectancy. Though we have been able to arrive at some estimate, however crude, of the magnitude of expectancy effects, we will not know quite how to assess this magnitude until we have comparative estimates from other areas of behavioral research. Such estimates are not easy to come by ready-made, but it seems worthwhile for us to try to obtain such estimates in the future. Although in individual studies, investigators occasionally give the proportion of variance accounted for by their experimental variable, it is more rare that systematic reviews of bodies of research literature give estimates of the overall magnitude of effects of the variable under consideration. It does not seem an unreasonable guess, however, to suggest that in the bulk of the experimental literature of the behavioral sciences, the effects of the experimental variable are not impressively ‘‘larger,’’ either in the sense of magnitude of obtained zs or in the sense of proportion of subjects affected than the effects of experimenter expectancy. The best support for such an assertion would come from experiments in which the effects of experimenter expectancy are compared directly in the same experiment, with the effects of some other experimental variable believed to be a significant determinant of behavior. Fortunately, there are two such experiments to shed light on the question. The first of these was conducted by Burnham (1966). He had 23 experimenters each run one rat in a T-maze discrimination problem. About half the rats had been lesioned by removal of portions of the brain, and the remaining animals had received only sham surgery which involved cutting through the skull but no damage to brain tissue. The purpose of the study was explained to the experimenters as an attempt to learn the effects of lesions on discrimination learning. Expectancies were manipulated by labeling each rat as lesioned or nonlesioned. Some of the really lesioned rats were labeled accurately as lesioned but some were falsely labeled as unlesioned. Some of the really unlesioned rats were labeled accurately as unlesioned but some were falsely labeled as lesioned. Table 6-18 shows the standard scores of the ranks of performance in each of the four conditions. A higher score indicates superior performance. Animals that had been lesioned did not perform as well as those that had not been lesioned and animals that were believed to be lesioned did not perform
180
Book One – Artifact in Behavioral Research Table 6–18 Discrimination Learning as a Function or Brain Lesions and Experimenter Expectancy
Brain state Lesioned Lesioned Unlesioned z of Difference a b
S
Expectancy
z of difference
Unlesioned
46.5 48.2 94.7
49.0 58.3 107.3
95.5 106.5
þ1.40a
þ1.60b
By unweighted means F test; z ¼ þ1.47 by U test. By unweighted means F test; z ¼ þ1.95 by U test.
as well as those that were believed to be unlesioned. What makes this experiment of special interest is that the effects of experimenter expectancy were at least as great as those of actual removal of brain tissue (the z associated with the interaction was only about 1.0). A number of techniques for the control of experimenter expectancy effects have been described elsewhere in detail (Rosenthal, 1966). One of these techniques, the employment of expectancy control groups, is well illustrated by Burnham’s design. The experimenter expectancy variable is permitted to operate orthogonally to the experimental variable in which the investigator is ordinarily most interested. Ten major types of outcomes of expectancy-controlled experiments have been outlined and Burnham’s result fits most closely that outcome labeled as Case 3 (Rosenthal, 1966, 382). If an investigator interested in the effects of brain lesions on discrimination learning had employed only the two most commonly employed conditions, he could have been seriously misled by his results. Had he employed experimenters who believed the rats to be lesioned to run his lesioned rats and compared their results to those obtained by experimenters running unlesioned rats and believing them to be unlesioned, he would have greatly overestimated the effects on discrimination learning of brain lesions. For the investigator interested in assessing for his own area of research the likelihood and magnitude of expectancy effects, there appears to be no substitute for the employment of expectancy control groups. For the investigator interested only in the reduction of expectancy effects, other techniques such as blind or minimized experimenter–subject contact or automated experimentation (Kleinmuntz and McLean, 1968; McGuigan, 1963; Miller, Bregman, and Norman, 1965) are among the techniques that may prove to be useful. The first of the experiments to compare directly the effects of experimenter expectancy with some other experimental variable employed animal subjects. The next such experiment to be described employed human subjects. Cooper, Eisenberg, Robert, and Dohrenwend (1967) wanted to compare the effects of experimenter expectancy with the effects of effortful preparation for an examination on the degree of belief that the examination would actually take place. Each of ten experimenters contacted ten subjects; half of the subjects were required to memorize a list of 16 symbols and definitions that were claimed to be essential to the taking of a test that had a 50–50 chance of being given, while the remaining subjects, the ‘‘low effort’’ group, were asked only to look over the list of symbols. Half of the experimenters were led to expect that ‘‘high effort’’ subjects would be more certain of actually having to take the test, while half of the
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
181
Table 6–19 Certainty of Having to Take a Test as a Function of Preparatory Effort
and Experimenter Expectancy Effort level
High Low z of Difference a
Expectancy High
Low
þ.64 þ.56 þ1.20
.40 .52 .92
þ3.37a
S
z of difference
þ.24 þ.04
þ0.33a
By F test.
experimenters were led to expect that ‘‘low effort’’ subjects would be more certain of actually having to take the test. Table 6-19 gives the subjects’ ratings of their degree of certainty of having to take the test. There was a very slight tendency for subjects who had exerted greater effort to believe more strongly that they would be taking the test. Surprising in its magnitude was the finding that experimenters expecting to obtain responses of greater certainty obtained such responses to a much greater degree than did experimenters expecting responses of lesser certainty. The ratio of expectancy effect to effort effect mean squares exceeds 112. In the terms of the discussion of expectancy control groups referred to earlier, these results fit well the so-called case 7 (Rosenthal, 1966, 384). Had this experiment been conducted employing only the two most commonly encountered conditions, the investigators would have been even more seriously misled than would have been the case in the earlier-mentioned study of the effects of brain lesions on discrimination learning. If experimenters, while contacting high effort subjects expected them to show greater certainty, and if experimenters, while contacting low effort subjects, expected them to show less certainty, the experimental hypothesis might quite artifactually have appeared to have earned strong support. The difference between these groups might have been ascribed to effort effects while actually the difference seems due almost entirely to the effects of the experimenter’s expectancy.
Moderating Variables Except for the very first few experiments in each of the research domains described, the bulk of the 94 studies summarized were not designed primarily to test the hypothesis of expectancy effects. Rather, these studies were designed to learn something of the conditions which increase, decrease, or otherwise modify the effects of experimenter expectancy. Approximately half of the experiments (49 per cent) and half of the laboratories (52 per cent) obtained one or more interactions of experimenter expectancy with some other variable with an associated z > /1.28/. Many of the specific interactions were investigated in more than one experiment. Sex of Participants In a great many of the experiments summarized, it would have been possible to examine the interaction of expectancy effects with sex of experimenter, sex of
182
Book One – Artifact in Behavioral Research Table 6–20 Expectancy Effects as a Function of Sex of Participants in Studies of Person Perception
I Sex of experimenter
II
III
Sex of subject
Sex of dyad
Study
z
Study
z
Study
z
a
þ1.44 þ1.41 þ1.96 þ2.07
14, 15 18 39 40, 41 42
1.44 2.85 þ2.58 þ1.96 þ1.64
6–10 38 39
þ1.51 þ1.64 þ1.51
14 , 15 22, 23 39 43b a b
Numbers refer to those of Table 6–11. Refers to expectancy effects transmitted via research assistants, see Rosenthal, 1966, 232.
subject, and sex of dyad. This was done, however, or reported in only a fraction of the studies so that it was not possible to have an exhaustive inventory of such interactions. Therefore, we cannot sensibly employ the technique of combining zs to obtain an overall estimate of the interaction of experimenter expectancy with the sex of the participants. What was possible was to find those experiments in which a relationship was found or reported in which z reached an absolute value of 1.28. Summaries based on such results, then, will have little to offer in the way of estimating the frequency of a relationship. Instead they will be limited to estimating the proportion of results in a specific direction for just that subsample of studies in which results reached the specified value of |z|. Table 6-20 shows the directional zs associated with interactions of expectancy effects with sex of participants for studies of person perception. In the first column, a positive z means that male experimenters showed greater expectancy effects than did female experimenters. When more than a single study is associated with a single z, it means that the interaction was based on the combined samples. The finding of all four zs as positive suggests that when differences in expectancy effects are found between male and female experimenters, male experimenters tend to show the greater expectancy effect. It is interesting to note that all six of the studies listed in this first column were tabulated as showing directional zs < þ1.28. Though it would be difficult to attach an exact p value to this result, the fact that such consistent results of tests of interactions were obtained from studies showing no main effects of experimenter expectancy, puts an additional strain on the credibility of the null hypothesis that expectancy effects do not occur. All the interactions shown in Table 6-20 were based on the person perception task. The reason for this was that for no other task was there more than a single study available to shed light on the nature of the interaction of expectancy effects and sex of experimenter and/or subject. Another experiment testing the difference between male and female experimenters in magnitude of expectancy effect was available. That was the study by Raffetto (1967, 1968) of reports of hallucinatory experiences. At z ¼ 1.65 he found that for this task it was female experimenters who showed the greater expectancy effects. It seems possible that whether male or female experimenters show the greater expectancy effects may depend upon the specific nature of the experiment conducted.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
183
In the second column of Table 6-20 a positive z means that female subjects were more susceptible to the effects of experimenter expectancy. There is no consistency to these results and perhaps all that can be said is that sometimes male subjects and sometimes female subjects show greater susceptibility to expectancy effects. Of the seven studies represented in Column II, five were tabulated earlier as showing directional zs < þ1.28. In his experiment employing a marbledropping task, Johnson (1967) had found female subjects to be more susceptible than male subjects to expectancy effects (z ¼ þ1.72) at least under some conditions. In the third column of Table 6-20 we find the results of a highly specific three way interaction between sex of subject, sex of experimenter, and experimenter expectancy. The first of these studies (38) found net positive expectancy effects among male experimenters contacting either male or female subjects and among female experimenters contacting female subjects. However, when female experimenters contacted male subjects, the expectancy effect was reversed, with subjects responding in the direction opposite to that which experimenters had been led to expect. Just that same pattern was obtained in two other analyses. Of the seven studies represented in column III, all but one was tabulated earlier as showing directional zs < þ1.28. Three other studies have reported interactions involving simultaneously the sex of experimenter and subject and magnitude of expectancy effect. Johnson (1967), in his marble-dropping experiment, found that when experimenter and subject were of the same sex there were greater expectancy effects than when experimenter and subject were of the opposite sex (z ¼ þ1.80). Just the opposite results, however, were obtained by Adair (1968) employing a numerosity estimation task (z ¼ 2.33) and by Silverman (1968) employing a reaction time measure (z ¼ 1.61). Both these investigators found greater expectancy effects when experimenters and subjects were of the opposite sex. The joint effects of experimenter and subject sex may sometimes be significant determinants of the direction and magnitude of expectancy effects, but it seems likely that the type of task employed may be a further complicating variable. Experimenter Dominance On the basis of a variety of evidence presented elsewhere (Rosenthal, 1966), it was suggested that experimenters showing greater dominance or a greater degree of professionalness in their behavior were likely to show greater effects of their experimental hypotheses. This interaction of a specific experimenter characteristic with magnitude of expectancy effect has recently received some fairly strong support in three experiments conducted by Bootzin, 1968. In all three studies, Bootzin found more dominant experimenters to show greater effects of their induced expectations. The three obtained zs were þ2.05, þ3.30, and þ2.17; the combined z was þ4.35, p < .000008. This result may well be related to the finding that where there are differences between male and female experimenters in magnitude of expectancy effects, it is the male experimenters who are likely to show the greater effects. It seems reasonable to suppose that, in general, male experimenters are likely to be classed as more dominant than are female experimenters.
184
Book One – Artifact in Behavioral Research
Other Variables There are a good many other variables that have been shown to interact significantly with the effects of experimenter expectancy. Later, we shall have occasion to refer to some, but because so many of these interactions have been described elsewhere in some detail (Rosenthal, 1966), we need give here only some illustrations. Through the employment of accomplices serving as the first few subjects it was learned that when the responses of the first few subjects confirmed the experimenter’s hypothesis, his behavior toward his subsequent subjects was affected in such a way that these subjects tended to confirm further the experimenter’s hypothesis. When accomplices serving as the first few subjects intentionally disconfirmed the expectation of the experimenter, the real subjects subsequently contacted were affected by a change in the experimenter’s behavior also to disconfirm his experimental hypothesis. It seems possible, then, that the results of behavioral research can, by virtue of the early data returns, be determined partially by the performance of just the first few subjects (Rosenthal, 1966). In some of the experiments conducted, it was found that when experimenters were offered a too-large and a too-obvious incentive to affect the results of their research, the effects of expectancy tended to diminish. It speaks well for the integrity of student-experimenters that when they felt bribed to obtain the data they had been led to expect, they seemed actively to oppose the principal investigators. There was a tendency for those experimenters to ‘‘bend over backward’’ to avoid the biasing effects of their expectation, but sometimes with their bending so far backward that the results of their experiments tended to be significantly opposite to the results they had been led to expect (Rosenthal, 1966). In several experiments in which each experimenter was given two different expectancies for two allegedly different subsamples of subjects, the distribution of expectancy effects showed a significant and interesting skew. In each of three such studies, which were not at all homogeneous in the overall magnitude of expectancy effects obtained, a significant minority of experimenters obtained results more negative in direction than could reasonably be expected by chance. These three studies are summarized in Table 6-21 in which the first listed study employed animal subjects and the others employed human subjects performing the photo rating task. Since each experimenter had contacted some subjects under different conditions of expectation, magnitude of expectancy effect was defined simply as the mean response obtained under one condition of expectation minus the mean response obtained under the opposite condition of expectation. In order to make the units of measurement of the different studies more comparable, each distribution of difference scores was divided into ten equal intervals, five above an absolute difference score of .00 and five below. All three studies show a substantial minority (14 to 20 per cent) of experimenters to obtain data significantly opposite to what they had been led to expect. This type of finding suggests the possibility that there are some experimenters who react to being given an expectancy either by bending over backward to avoid biasing their data, or perhaps because of resentment at being told what to expect, by in some way showing the expectancy inducer that he was wrong to make the prediction he made. If these minority reactions to induced expectancies were widespread, it might be of interest to try to learn the personal correlates of membership in this subset of experimenters who react to induced expectations with such negative and non-Gaussian gusto.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
185
Table 6–21 Proportions of Experimenters Showing Various Magnitudes of Expectancy Effects in
Three Studies Study Effect þ5 þ4 þ3 þ2 þ1 1a 2 3 4 5 zb a b
I & H, 1966, II (N 5 15) .00 .00 .13 .40 .27 .00 .00 .00 .13 .07 þ 2.58
R,P,M, V,G, 1964b I (N 5 13) .00 .00 .08 .31 .23 .23 .00 .00 .08 .08 þ2.33
R,P,M, V,G, 1964b II (N 5 7) .00 .00 .00 .00 .43 .43 .00 .00 .00 .14 þ2.58
Combined (N 5 35) .00 .00 .09 .29 .29 .17 .00 .00 .09 .09 þ4.33
Includes .00 effect. For asymmetry.
On the basis of an earlier analysis suggested by Fred Mosteller (Rosenthal, 1966, 312–313), it was proposed that one effect of experimenter expectancies might be not on measures of central tendency but on measures of variance. In an experiment employing three groups of experimenters, two experimental groups and one control, it was found that experimenters of the control group obtained data that were more variable than the data obtained by experimenters of the two experimental groups. Two more recent studies showed similar effects. In an experiment on verbal conditioning, Kennedy, Edwards, and Winstead (1968) found that those experimenters who had been given no expectation obtained more variable responses than did experimenters expecting either high or low rates of conditioning (z ¼ þ2.22). Similarly, in an experiment on judging the frequency of light flashes, Mu¨ller and Timaeus (1967) found that control group experimenters obtained more variable responses than did experimenters expecting either overestimation or underestimation (z ¼ þ1.88). In both of these experiments, as was often the case in studies showing interaction effects, the directional zs associated with the main effects of experimenter expectancy were less than þ1.28 and, therefore, were recorded as zs of .00. Earlier, reference was made to the experiment by Adler (1968) in which the set given the experimenters was an important determinant of the direction of the subsequent expectancy effects. Other such results have also been reported (Rosenthal, 1966; Rosenthal and Persinger, 1968) as have results showing the effects of subject set on the direction and magnitude of expectancy effects (Rosenthal, 1966; White, 1962). In a number of studies where there was a conflict between what an experimenter had been led to expect and what he himself actually expected, these two sources of hypothesis were found to interact significantly (Bootzin, 1968; Nichols, 1967; Strauss, 1968); but sometimes they did not (Marcia, 1961; Marwit & Marcia, 1967). For two samples of male experimenters, it has been reported that those who exchanged fewer glances with their subjects during the instruction-reading phase
186
Book One – Artifact in Behavioral Research
of the person perception experiment, subsequently showed greater expectancy effects (Rosenthal, 1966, 268). The more recent work of Connors (1968) bears out this finding (z ¼ þ2.12). Other studies of variables complicating the effects of experimenter expectancy have investigated the effects of experimenter and subject need for approval, experimenter and subject anxiety, degree of acquaintanceship between experimenter and subject, experimenter status, and characteristics of the laboratory in which the interaction occurs. In general, the results of these studies have been complex, with far too many results of large zs, but with the signs sometimes in one direction and sometimes in the other. For many of these moderating variables there appear to be meta-moderating variables (Rosenthal, 1966).
The Mediation of Expectancy Effects How are we to account for the results of the experiments described? How does an experimenter unintentionally inform his subjects just what response is expected of him? Our purpose in this section is to review the evidence that may shed light on this question. First, however, we must take up the proposition that there is nothing to be explained, that our talk about an artifact is based on nothing but other artifacts. Expectancy Effects as Artifacts Cheating and recording errors have been suggested as prime candidates for consideration as the artifacts leading to the false conclusion that experimenters’ expectancies may serve as significant partial determinants of subjects’ responses (Barber and Silver, 1968; Rosenthal, 1964a). There is no way to rule out with any certainty the operation of either intentional ‘‘errors’’ or errors of observation in most of the individual experiments investigating the effects of experimenter expectancy—but there is no way to rule out the operation of these errors in the vast majority of the research in the behavioral sciences. What we can do is to rule out the operation of cheating and observer errors as necessary factors operating in studies of expectancy effects. There are a number of experiments which do permit us to rule out the operation of such errors. Earlier, the experiment by Adair and Epstein (1967, II) was described. It will be recalled that in this study there were no experimenters, only tape recordings of the voices of experimenters, and tape recordings cannot err either intentionally or unintentionally. In this experiment, in which subjects recorded their own responses, the directional z associated with expectancy effects was þ1.64. The experiment by Johnson (1967) similarly ruled out the operation of intentional or observer errors. The recording of subjects’ responses was accomplished by an electrical system which did the bookkeeping. The tallies were then transcribed by the principal investigator who was blind to the experimental condition of experimenter expectancy in which each subject had been contacted. Despite the tightness of the controls for cheating and for observer errors, Johnson’s results showed a very large effect of experimenter expectancy with a directional z of þ3.89. The experiment by Weick (described in Rosenthal, 1966) was another in which cheating and observer errors were unlikely to occur. That experiment was conducted
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
187
in a classroom under the watchful eyes of students in a class in experimental social psychology. Despite the restraint such an audience might be presumed to impose on the intentional errors or the careless errors of an experimenter, the obtained directional z was þ2.33. Because of the small size of the animals involved, experiments employing planaria would seem to be especially prone to quasi-intentional errors or to errors of recording. Since it is often difficult to judge the behavior of planaria, experimenters might too often judge or claim a response to have occurred when that response was expected. Hartry (1966) conducted two experiments on the effects of experimenter expectancy on the results of studies of planaria performance. In one of these studies, special pains were taken to reduce the likelihood of observer or intentional errors. Experimenters were given more intensive training, an instructor was present during the conduct of the experiment, and three observers were present to record the worm’s response. Quite surprisingly, the effects of experimenter expectancy were greater in the study with better controls for observer errors and cheating. In the less well-controlled study, experimenters expecting more responses obtained an average of 73 per cent more responses than did the experimenters expecting fewer responses. In the better-controlled study, however, experimenters expecting more responses obtained an average of 211 per cent more responses than did the experimenters expecting fewer responses. The experiment by Persinger, Knutson, and Rosenthal (1968) was filmed and tape-recorded without the knowledge of experimenters or subjects. Independent observers then recorded subjects’ responses directly from the tape recordings and these recordings were compared to those of the original experimenters. It was found that .72 per cent of the experimenters’ transcriptions were in error and that .48 per cent of the transcriptions erred in the direction of the experimenters’ hypothesis while .24 per cent of the transcriptions erred in the direction opposite to that of the experimenters’ hypothesis. These latter errors, however, tended to be larger than the errors favoring the hypothesis, so that the mean net error per experimenter was .0003 in the direction opposite to the experimenters’ expectancies and so trivial in magnitude that analyses based on either the corrected or uncorrected transcriptions gave the same results (directional z of þ1.64). Analysis of the films of this and of other experiments (Rosenthal, 1966), in which experimenters did not know they were being filmed, gave no evidence to suggest any attempts to cheat on the part of the experimenters. Similarly, other analyses of the incidence of recording errors show their rates to be too low to account for the results of studies of experimenter expectancy or most other studies for that matter. It is, of course, possible that in any single experiment in the behavioral sciences, cheating or recording errors may occur to a sufficient extent to account for the obtained results. It seems unlikely, however, that any replicated findings of the behavioral sciences, especially if replicated in different laboratories, could reasonably be ascribed either to intentional errors or to recording errors. Our discussion has been of cheating and of observer errors serving as artifacts in the production of an effect which can itself be regarded as an artifact in behavioral research, the expectancy of the experimenter. Our discussion would be incomplete, however, without a systematic consideration of what it would mean if we had found effects of experimenter expectancy to be associated with artifacts of cheating and of observer errors. Earlier discussions of this problem have, unfortunately, been
188
Book One – Artifact in Behavioral Research Table 6–22 Schema for the Consideration of Experimental Results as a Function of
Artifacts and Meta-Artifacts Experimental results
Effects of meta-artifact Decrease z
Trivial effect
Increase z
PRIMARY VARIABLE Positive z Trivial z Negative z
Case 1 Case 4 Case 7
Case 2 Case 5 Case 8
Case 3 Case 6 Case 9
EXPECTANCY EFFECT Positive z Trivial z Negative z
Case 10a Case 13 Case 16
Case 11 Case 14 Case 17
Case 12 Case 15 Case 18
a The best documented case of cheating among experimenters to come to our attention occurred in research involving animal subjects in which allegedly dull animals were helped to perform better, thus decreasing the effects of experimenter expectancy.
incomplete in this regard (Barber and Silver, 1968; Rosenthal, 1964a). Barber and Silver, for example, suggest that if it could be established that such meta-artifacts as cheating and observer errors accounted for the results of studies showing expectancy effects at some specified level of z, then this would be sufficient to rule out the effects of experimenter expectancy as a source of artifact in other research. Unfortunately, the situation is a good deal more complex than that simple inference would suggest. Table 6-22 presents a schema for the consideration of a variety of experimental outcomes in relation to the artifact of expectancy effects and the meta-artifacts of intentional and recording errors. We let the ‘‘primary variable’’ stand for whatever a given behavioral researcher is currently investigating, other than expectancy effects. The three columns of Table 6-22 represent the three broad classes of effects of cheating or recording errors: (a) effects decreasing the obtained z, (b) effects of arbitrarily trivial magnitude, and (c) effects increasing the obtained z. The suggestion by Barber and Silver was essentially to look only at the cell labeled Case 12. If, in an experiment on expectancy effects, there were errors to inflate the z, then we need not concern ourselves any longer with the role of expectancy effects as an artifact in behavioral research. The conclusion, of course, does not follow. We want to consider the rest of the possible outcomes. In Case 15, for example, we have a trivial z for expectancy effect which may have been made trivial by the meta-artifact which increased the z from a negative to a near zero level. We want to know about Case 16 to be sure that a negative z for expectancy effects was not due to the meta-artifact, or about Case 13 to be sure that a near zero z was not depressed from a positive z by cheating or recording errors. Although our empirical evidence suggests most errors due to cheating or misrecording to be trivial, it must be kept in mind that these errors can cut two ways. They can artifactually deflate the obtained effects as much as they can artifactually inflate the obtained effects. What makes our schema still more complicated is the necessity for considering simultaneously the effect of our meta-artifact on expectancy effects relative to its effect on the primary variable. There is no basis in data to think so, but if we assume for the moment that Case 12 effects were found, we would want to compare their
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
189
magnitude and frequency with those of the Case 3 effects. If it were found that Case 12 occurs often but Case 3 occurs seldom, then we would legitimately begin to wonder whether expectancy effect research might not be particularly prone to metaartifact. But our inquiry would be far from over since it must first be seen whether Cases 10, 13, and 16 are not also overrepresented relative to Cases 1, 4, and 7. What we want, in short, is something like a 3 3 2 contingency table that would permit us to say something of the effect of our meta-artifact on experimenter expectancy and of the relative effects of the meta-artifact and of experimenter expectancy on the primary variable. Operant Conditioning If intentional errors and recording errors will not do as explanations of the results of studies of expectancy effect, what will? The most obvious hypothesis was that experimenters might quite unwittingly reinforce those responses of their subject that were consistent with their hypothesis. Any small reinforcer might serve—a smile, a glance, a nod. Under an hypothesis of operant conditioning, we would expect to find that the very first response of a given subject is not affected by the experimenter’s expectancy and that, in general, later responses are more affected than earlier responses. Elsewhere, there is a summary of four experiments showing that, on the average, expectancy effects are greater for the subject’s very first response than for his later responses (Rosenthal, 1966, 289–293). A more recent experiment by Wessler (1966) also showed a decrease in expectancy effect from the subjects’ earlier to later responses (z ¼ þ1.65). The experiment by Adair and Epstein (1967), in which tape recordings served as experimenters, also served to rule out the operation of operant conditioning as a necessary mediator of expectancy effects. Additional, though ‘‘softer,’’ evidence that operant conditioning was not a factor in the mediation of expectancy effects has been presented by Marwit (1968) and Masling (1965, 1966), though Marwit and Marcia’s (1967) data suggested that sometimes operant conditioning might be a factor. Just as was the case in our consideration of cheating and recording errors as explanations of expectancy effect, we cannot conclude that operant conditioning never operates as a mechanism mediating expectancy effects. What we can conclude, just as in the case of cheating and recording errors, is that expectancy effects do occur in the absence of operant conditioning. Operant conditioning, like cheating and observer errors, cannot explain the results of studies of expectancy effect. Communication Channels The fact that the very first response of an experimental subject can be affected by the expectancy of the experimenter suggests that the mediation of expectancy effects must occur, at least sometimes, during that phase of the data-collection situation in which the experimenter greets, seats, and instructs his subject. Some beginnings have been made to learn what the experimenter does unintentionally during this phase of the experiment to inform his subject of the expected response. These beginnings are not characterized by spectacular success (Rosenthal, 1966). Data of a more modest sort, however, are beginning to sketch some picture of the classes of cues likely to be involved in the mediation of expectancy effects.
190
Book One – Artifact in Behavioral Research
There are two experiments to show that auditory cues alone may be sufficient to mediate expectancy effects. One of these is the study by Adair and Epstein (1967) in which subjects heard only the instructions tape-recorded earlier by experimenters given different expectancies. The z for expectancy effect based upon voice alone was þ1.64. The other experiment was by Troffer and Tart (1964) in which the experimenters were all experienced hypnotists. They were to read standard passages to subjects in each of two conditions which may have affected the expectation of the experimenters. When experimenters had reason to expect lower suggestibility scores, their voices were found to be significantly less convincing in their reading of the instructions to their subjects (z ¼ þ2.81). This result was obtained despite the fact that experimenters (a) were cautioned to treat their subjects identically, (b) were told that their performances would be tape-recorded, and (c) were all aware of the problem of experimenter effects. This experiment tells us of the importance of the auditory cues, but because there was a plausible rival hypothesis to the hypothesis of expectancy effects the study was not included in our earlier summary of studies of expectancy effects. That a hypnotist-experimenter’s expectancy may affect his treatment of a research subject has been documented earlier, though the sample sizes involved only a single hypnotist-experimenter and a single subject (Shor and Schatz, 1960). The two experiments described suggest that auditory cues may be sufficient to serve as mediators of expectancy effects. There are two additional experiments in support of this proposition, both of which have the additional merit of permitting estimates of the effects on the magnitude of expectancy effects of subjects’ having available only auditory cues as compared to having access to both auditory and visual cues. The possibility of obtaining such estimates depends on having available at least three groups of experimenters. For two of these groups, subjects must have access to both visual and auditory cues from their experimenters, but each group of experimenters must have a different expectation for their subjects’ responses. The difference between the mean response obtained by experimenters of these two groups is considered the base line of magnitude of expectancy effect when both channels of information are available. The third group of experimenters is given one of the two possible expectations, but subjects’ access to visual cues from these experimenters is cut off. The difference between the mean response obtained by experimenters in this condition and the mean response obtained by experimenters expecting the opposite response is considered the magnitude of expectancy effect when only auditory cues are available. This magnitude can be divided by the base line magnitude for an estimate of the proportion of expectancy effect obtained when only auditory cues were available. The two experiments meeting these requirements have been tabulated earlier as Rosenthal and Fode, 1963b, II and as Zoble, 1968 (the former study was a master’s thesis by Fode). Fode’s study employed the person perception task and his data showed that 47 per cent of the total expectancy effect was obtained when subjects had access only to auditory cues from their experimenter. Zoble’s study employed a task requiring subjects to make tone-length discriminations but his results were remarkably similar to Fode’s. Zoble’s data showed that 53 per cent of the total expectancy effect was obtained when subjects were restricted to purely auditory cues. The combined z associated with finding expectancy effects with only auditory cues available to subjects was þ4.01.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
191
Additional evidence for the importance of the auditory channel to the mediation of expectancy effects comes from an analysis by Duncan and Rosenthal (1968). Sound motion pictures were available of three male experimenters administering the person perception task to 10 different subjects. An analysis of the experimenters’ vocal emphases showed that no subject was exposed to identical differential emphases of those portions of the instructions that listed the subject’s response alternatives. All five subjects who heard relatively greater vocal emphasis on the response alternatives associated with high photo ratings subsequently assigned higher photo ratings than did any of the five subjects who heard relatively greater vocal emphasis on the response alternatives associated with low photo ratings (z ¼ þ2.65). The three experimenters on whose differential vocal emphases these paralinguistic analyses were made had been selected because they were known to have shown expectancy effects. We expect, therefore, by definition, to find a large correlation between the various expectancies given each experimenter and the mean photo rating given by the subject contacted under each different expectation. That correlation was þ.60 (z ¼ þ1.75) a finding obviously not reported in support of the hypothesis of expectancy effect, but rather to establish a base line for comparison. The correlation between an experimenter’s ‘differential vocal emphasis on the various response alternatives in the instructions read to subjects and the subjects’ subsequent response was þ.72 (z ¼ þ2.33). That was a promising chain of correlations. The experimenter’s expectancy predicted his subjects’ responses, and the differential vocal emphasis of the experimenter also predicted his subjects’ responses. It remained only to show that the experimenter’s expectancy was a good predictor of how he read his instructions to his subjects. Then everything would fall nicely into place. Unfortunately, that is not what we found. The correlation between an experimenter’s expectancy and his instruction-reading behavior was only þ.24, a correlation that is difficult to defend as being really different from zero with a maximum of eight degrees of freedom. The correlation between experimenters’ differential vocal emphases and their subjects’ subsequent photo ratings with the effects of experimenter expectancy partialed out showed no shrinkage; it was þ.74. Therefore, though this analysis gave further evidence of the importance of the auditory channel of communication, it did not turn out to provide the key to the specific signal employed by subjects to learn what it was that their experimenter expected. Evidence for such a signal would have been provided only if the correlation between an experimenter’s expectation and his differential vocal emphasis during instruction reading had been substantial.3 With all the data available to suggest the importance of the auditory channel in the mediation of expectancy effect, it should not surprise us that those studies of expectancy effect permitting the subject little or no auditory access to the experimenter generally failed to obtain expectancy effects. That was the case in the two studies listed for Carlson and Hergenhahn (1968) and in that listed for Moffatt (1966), all three of these studies having been tabulated as showing directional zs of less than þ1.28. The same result occurred in one group of experimenters in the study 3
Rosenberg, in collaboration with Duncan, has recently replicated the effects on subjects’ responses of differential emphasis in the instruction-reader’s listing of response alternatives. That research was based, of course, on a different sample of experimenters. For a more detailed discussion of the interaction between differential vocal emphasis and subjects’ evaluation apprehension, see the chapter by Rosenberg in this volume.
192
Book One – Artifact in Behavioral Research
conducted by Fode (Rosenthal and Fode, 1963b, II), though in that study the overall effects of experimenter expectancy were still associated with a z > 3.00. So far we have focused on the auditory channel of communication but there are also data available to show the importance of the visual channel. One important finding comes from the research by Zoble (1968) described earlier. As one of his many experimental groups, Zoble had one group of subjects who had access only to visual cues from their experimenter. Despite the fact that Zoble’s results helped to support the importance of auditory cues, his data nevertheless showed that visual cues were more effective than auditory cues in the mediation of expectancy effects (z ¼ þ1.44). Whereas those subjects who had access only to auditory cues were affected by their experimenter’s expectancy only 53 per cent as much as those subjects who had access to both visual and auditory cues, those subjects who had access only to visual cues were affected by their experimenter’s expectancy 75 per cent as much as those subjects who had access to both information channels. Zoble’s results suggest a possible nonadditivity of the information carried in the visual and auditory channel. It may be that, when subjects are deprived of either visual or auditory information, they focus more attention on the channel that is available to them. This greater attention and perhaps greater effort may enable subjects to extract more information from the single channel than they could, or would, from that same channel if it were only one part of a two-channel information input system. Much earlier we tabulated the results of two studies of verbal conditioning by Kennedy’s group. In one of those studies, Kennedy, Edwards, and Winstead (1968) found the overall directional z associated with expectancy effects to be less than þ1.28. That experiment we count as a directional z of .00 in our bookkeeping system but for our present purpose we can afford a closer look at that study. Part of the time experimenters were face-to-face with their subjects, and part of the time subjects had no visual access to their experimenter. The failure of the overall directional z to reach þ1.28 seems due entirely to the condition in which subjects were deprived of visual cues from their experimenter. When the analysis was based only on the condition in which visual cues were available, the directional z for expectancy effect was þ1.95. Both the studies described suggest that visual cues may also be important for the mediation of expectancy effects, though the experiment by Fode (Rosenthal and Fode, 1963b, II) found mute but visible experimenters to exert no expectancy effects. Further indirect evidence for the importance of visual cues comes from the experiment by Woolsey and Rosenthal (1966). In the first stage of that experiment, subjects had no visual access to their experimenters, but in the second stage they did. When the screens were removed from between experimenters and subjects, expectancy effects became significantly greater (z ¼ þ2.04). This evidence must be held very lightly, however, since experimenters contacting subjects with visual contact differed in several other ways from experimenters contacting subjects without visual contact. One difference was that experimenters with visual contact had gained greater experience, and more experienced experimenters appear to show greater expectancy effects, a topic to which we now turn. Expectancy Effects as Interpersonal Learning For a number of experiments on expectancy effects, sound motion pictures were available that had been obtained without the experimenters’ or subjects’ prior
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
193
knowledge. The analyses of some of these films have been reported elsewhere (Friedman, 1967; Friedman, Kurland, and Rosenthal, 1965; Rosenthal, 1966; Rosenthal, 1967; Rosenthal, Friedman, and Kurland, 1966). For all the hundreds of hours of careful observation, and for all the valuable things learned about experimenter-subject interaction, no well-specified system of unintentional cueing has been uncovered. But if the students of experimenter behavior do not know how experimenters unintentionally cue their subjects to give the expected response, then how do experimenters themselves know how? Perhaps they do not know, but perhaps within the context of the given experiment they can come to know. Expectancy effects may be a learned phenomenon and learned in interaction with a series of research subjects. Each experimenter may have some types of unintended signaling in common with other experimenters, but beyond that each experimenter may have some unique unintended signals that work only for him. Whether this is so is a problem for the psycholinguist, the paralinguist, the kinesicist, and the sociolinguist. But if there were this unique component to the unintentional cueing behavior of the experimenter, it might account for our difficulty in trying to isolate very specific but very widespread cueing systems. The experimenter, who very likely knows no more than we about his cueing behavior, may begin his experiment with little ability to exert expectancy effects. But all the time, in his interaction with his first subject, he is emitting a myriad of unprogrammed and unintended cues in the visual and auditory channels. If whatever pattern of cues he is emitting happens to affect the subject’s response, so that the experimenter obtains the response he expects to obtain, that pattern of cues may be more likely to recur with the next subject. In short, obtaining an expected response may be the reinforcement required to shape the experimenter’s pattern of unintentional cueing. Subjects, then, may teach experimenters how to behave kinesically and paralinguistically so as to increase the likelihood that the next subject’s response will be more in the direction of the experimenter’s expectancy. Our old friend Pfungst, the student of Clever Hans, found that as experimenter-questioners gained experience in questioning Hans, they became better unintentional signalers to Hans. If we are seriously to entertain the proposition that expectancy effects are learned in an interpersonal context, then we must be able to show that, in fact, experimenters are more successful in their unintentional influencing of subjects later, rather than earlier, in the sequence of subjects contacted. Elsewhere there is a report of six analyses investigating this question. In three of the samples studied, subjects contacted later in the series showed greater effects (z > þ1.28) of experimenter expectancy, while three of the samples showed no order effect (Rosenthal, 1966). The overall directional z in support of the learning hypothesis was þ2.73 (z ¼ þ6.70, N ¼ 6). Since that earlier summary a number of other relevant findings have become available. Connors and Horst (1966), in whose research the overall magnitude of expectancy effect did not reach a directional z of þ1.28, nevertheless found that later contacted subjects showed significantly (z ¼ þ1.81) greater expectancy effects than did earlier contacted subjects. That same result was obtained by Uno, Frager, and Rosenthal (1968 II), a study in which the overall magnitude of expectancy effect was tabulated as a z of .00 although later contacted subjects showed significantly greater expectancy effects (z ¼ þ1.70). In the two other studies by this group showing no overall expectancy effect (z ¼ .00), there were no order effects reaching
194
Book One – Artifact in Behavioral Research
a z of /1.28/. In the two studies by this same group showing negative expectancy effects (zs ¼ 1.99, 2.17), the first showed an increase of the negative expectancy effect over time (z ¼ þ1.85) but the second showed a decrease (z ¼ 1.46). Altogether, then, there are 12 studies investigating the tendency for expectancy effects to increase as more subjects are seen. Six of the results support the hypothesis at z >1.28, one of the results runs counter to the hypothesis at z <1.28, and five of the results neither support nor run counter to the hypothesis. The overall directional z in support of the hypothesis is þ3.06, p ¼ .0011. The five studies by Uno’s group were conducted in Japan, and for just that set of studies the combined z is less than þ1.28. The remaining seven studies were conducted in the United States and for them the combined z was þ3.21. Whether this difference may be due to differences in communication patterns between the two cultures is currently under investigation. For the time being, at least, it seems reasonable to believe that when there is a difference in magnitude of expectancy effect from earlier to later contacted subjects, it is among the later-seen subjects that expectancy effects are likely to be larger. The hypothesis that the mediation of expectancy effects is learned by experimenters in the interpersonal context of the experiment, seems worthy of further investigation.
Research on Unintended Influence Quite apart from the methodological implications of research on experimenter expectancy effects, there are substantive implications for the study of interpersonal relationships. Perhaps the most general implication is that people can engage in effective unprogrammed and unintended communication with one another and that this process of unintentional influence can be investigated experimentally. A great deal of effort within the behavioral sciences has gone into the study of such intentional influence processes as education, persuasion, coercion, propaganda, and psychotherapy. In each of these cases, the influencer intends to influence the recipient of his message and the message is usually encoded linguistically. Without diminishing efforts to understand these processes better, greater effort should perhaps be expended to understand the processes of unintentional influence in which the message is often encoded nonlinguistically. The question, in short, is how people ‘‘talk’’ to one another holding constant what it is they say. At the present time, not only do we not know the specific signals by which people unintentionally influence one another, we do not even know all the channels of communication involved. There is reason, though, to be optimistic. There appears to be a great current increase of interest in nonlinguistic behavior as it may have relevance for human communication (e.g., Sebeok, Hayes, and Bateson, 1964). Most interest seems to have been centered in the auditory and visual channels of communication and those are the channels investigated in the research described in this chapter. Other sense modalities will also bear investigation, however. For example, Geldard (1960) has brought into focus the role of the skin senses in human communication and has presented evidence that the skin may be sensitive to human speech. Even when the sense modality involved is the auditory, it need not be only speech and speech-related stimuli to which the ear is sensitive. Kellogg (1962), and Rice and Feinstein (1965) have shown that, at least among blind humans, audition can provide a surprising amount of information about the environment.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
195
Employing a technique of echo ranging, Kellogg’s subjects were able to assess accurately the distance, size, and composition of various external objects. The implications for interpersonal communication of these senses and of olfaction, or of even less commonly discussed modalities (e.g., Ravitz, 1950, 1952), are not yet clear but are worthy of more-intensive investigation. Since expectancies of another person’s behavior seem often to be communicated to that person unintentionally, the basic experimental paradigm employed in our research program might be employed even if the interest were not in expectancy effects per se. Thus if we were interested in unintentional communication among different groups of psychiatric patients, some could be given expectancies for others’ behavior. Effectiveness of unintentional influence could then be measured by the degree to which other patients were influenced by expectancies held of their behavior. There might be therapeutic as well as theoretical significance to knowing what kind of psychiatric patients were most successful in the unintentional influence of other psychiatric patients. The experiment by Persinger, Knutson, and Rosenthal (described in Rosenthal, 1966) employed such a paradigm. Twelve experimenters administered a standard photo-rating task to 94 neuropsychiatric patients who could be classified as either relatively more anxious than hostile (schizophrenic or neurotic) or as relatively more hostile than anxious (paranoid or character disorder). Each experimenter was led to expect half his subjects to judge the stimulus photos as being of more successful people while the remaining subjects were expected to judge the photos as being of less unsuccessful people. What made this experiment unusual was that the experimenters were themselves patients in a mental hospital who had been classified into the same categories as their subjects. Just as was the case with graduate student and advanced undergraduate student experimenters, mental patient experimenters obtained responses from their mental patient subjects consistent with their expectations (z ¼ þ1.88). Our primary interest in this study, however, was to examine the magnitude of unintended influence or communication as a joint function of the experimenters’ and subjects’ nosologies. Results showed that when both experimenters and subjects could be characterized as more anxious than hostile, experimenters showed the greatest positive unintended influence. However, when both experimenters and subjects were characterized more by hostility than anxiety, the predicted unintended communication was least effective. Findings of this kind may have implications for the treatment of psychiatric disorders. The belief is increasing that an important source of informal treatment is the association with other patients. If, as seems likely, such treatment is more unintentional than intentional, then the grouping of patients might be arranged so that patients are put into contact with those other patients with whom they can ‘‘talk’’ best, even if this ‘‘talk’’ be nonlinguistic. Perhaps success as an unintentional influencer of another’s behavior also has relevance for the selection of psychotherapists to work with certain types of patients. The general strategy of trying to ‘‘fit the therapist to the patient’’ has been considered and has aroused considerable interest (e.g., Betz, 1962). That such selection may be made on the basis of unintentional communication patterns may also be suggested. In one recent study, it was found that the degree of hostility in the doctor’s speech was unrelated to his success in getting alcoholic patients to accept treatment. However, when the content of the doctor’s speech was filtered out, the degree of hostility found
196
Book One – Artifact in Behavioral Research
in the tone of his voice alone was significantly and negatively related to his success in influencing alcoholics to seek treatment (Milmoe, Rosenthal, Blane, Chafetz, and Wolf, 1965; see also Milmoe, Novey, Kagan, and Rosenthal, 1968). One variable in particular, the ‘‘AB’’ variable, has been employed in a promising series of studies relevant to patient–therapist pairing (Betz, 1962; Berzins and Seidman, 1968; Carson, 1967). There are indications that so-called ‘‘A’’ type therapists (as defined by a paper-and-pencil test) are more effective with more disturbed patients while ‘‘B’’ type therapists are more effective with less disturbed patients. With these indications in mind, we conducted a series of studies in which A and B type experimenters administered the standard photo rating task to subjects under different conditions of expectation. The general prediction was that the differential effectiveness of unintended communication by A and B type experimenters vis-a`-vis their subjects would parallel the differential therapeutic effectiveness of A and B type therapists vis-a`-vis their patients. Three such studies were conducted (Jenkins, 1966; Persinger, Knutson, and Rosenthal, 1968; Trattner, 1966, 1968). For her sample of college student experimenters and subjects, Jenkins found that B type experimenters showed greater effects of their expectations than did A type experimenters (z ¼ þ2.72). Although the literature of the AB variable is addressed more to mental patients than to college students, we need only assume that college students are not so disturbed as schizophrenic patients to have Jenkins’ finding lend some support to the proposition that B type influencers are more effective with less disturbed influencers. In his experiment, Trattner employed psychiatric aides as experimenters and hospitalized schizophrenics as his subjects. Following the standard procedure, some subjects were represented to their experimenters as success perceivers, others as failure perceivers. When A type experimenters contacted more chronically disturbed (process-type) patients, experimenters showed greater effects of their expectations than when they contacted less chronically disturbed (reactive-type) patients. Similarly consistent with what we might expect on the basis of the AB literature, the B type experimenters were more successful unintentional influencers when they contacted the less chronically disturbed patients than when they contacted the more chronically disturbed patients (interaction z ¼ þ1.81). In the study by Persinger et al., a variety of mental patients served as subjects while male and female ward personnel served as experimenters. Once again effectiveness of unintended communication was defined by the degree to which experimenters obtained the responses they had been led to expect. In this experiment patients were not selected on the basis of severity of disturbance but rather on the basis of primary categorization as relatively more anxious than hostile (schizophrenic and neurotic) or as relatively more hostile than anxious (paranoid and character disorder). Results showed that greater expectancy effects were exerted by A type male experimenters and by B type female experimenters when patients were categorized as relatively more anxious. When patients were categorized as relatively more hostile, it was the B type male experimenters and the A type female experimenters who showed the greater unintended effects of their expectations (interaction z ¼ 2.06). The results of these studies lend support to the idea that the AB variable may be important in the prediction of interpersonal influence, but that is not the reason for their having been reported here. Rather the major purpose has been to illustrate the
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
197
potential utility to studies of unintended interpersonal influence of the interpersonal expectancy paradigm. It seems to be an uncomplicated procedure to induce in one member of a dyad (A) an expectancy for the behavior of the other member (B). On the basis of the experiments summarized in this chapter, the odds are not unfavorable that the expectancy will be communicated to the other member of the dyad. With that a likely occurrence, the student of processes of covert communication has a focus for his attention. He will be looking for what A does differently in the interaction as a function of what is expected from B. Depending on the tastes and questions of the investigator, either an experimental or an observational approach can be employed. If the investigator were interested, for example, in finding out the proportion of information carried in various channels of communication, the openness of these channels could be systematically varied. If the investigator were more interested in a global description of communication processes, he might permit all channels to remain fully open while trying to give as complete a description as possible of the type and amount of information carried in each channel. This can be partially accomplished by having different observers focus on channels that have been artificially isolated from one another. In the case of sound motion pictures, for example, some observers can be given access only to the silent film while others are given access only to the sound track. Potentially as instructive as the analysis of individual channels of communication may be the analysis of differences between the signals sent through different channels.
Beyond the Laboratory The vast majority of the experiments summarized in this chapter were conducted in psychological laboratories. Even when the subjects were not sophomores but psychiatric patients, the interaction between experimenter and subject took place in a setting that unquestionably spelled ‘‘laboratory.’’ For reasons quite apparent to any reader of this volume on artifacts in behavioral research, there are considerable advantages to testing laboratory-derived relationships in nonlaboratory settings. The laboratory gives us convenience, the control of variance-increasing variables, and perhaps a sense of security. The world beyond the laboratory gives us inconvenience, frequent increases in error variance, and a feeling of insecurity when mostly we do our work in the basement of the Psychology Building. But if a relationship obtained in the laboratory is to be viewed as uncontaminated by the procedures, subjects, and setting of the laboratory itself, it must be taken out of the artificial light of the lab and examined in the harsher light of the world beyond. In the case of the variable of interpersonal expectancy, if we should want to regard it as a phenomenon of general interest and one not restricted in its implications to the data-collecting work of the behavioral scientist, we must see whether interpersonal expectancies can also be made to show themselves in other interpersonal contexts. The context selected was that of ongoing educational systems (Rosenthal and Jacobson, 1968). All of the children in an elementary school serving a lower socioeconomic status neighborhood were administered a non-verbal test of intelligence. The test was disguised as one that would predict intellectual ‘‘blooming.’’ There were
198
Book One – Artifact in Behavioral Research
18 classrooms in the school, three at each of the six grade levels. Within each grade level the three classrooms were composed of children with above average ability, average ability, and below average ability, respectively. Within each of the 18 classrooms approximately 20 per cent of the children were chosen at random to form the experimental group. Each teacher was given the names of the children from her class who were in the experimental condition. The teacher was told that these children had scored on the ‘‘test for intellectual blooming’’ such that they would show remarkable gains in intellectual competence during the next eight months of school. The difference between the experimental group and the control group children, then, was in the minds of the teachers. Eight months later, at the end of the school year, all of the children were retested with the same IQ test. This intelligence test, while relatively nonverbal in the sense of requiring no speaking, reading, or writing was not entirely nonverbal. Actually there were two subtests: one requiring a greater comprehension of English—a kind of picture vocabulary test; the other requiring less ability to understand any spoken language, but more ability to reason abstractly. For shorthand purposes we refer to the former as a ‘‘verbal’’ subtest and to the latter as a ‘‘reasoning’’ subtest. The pretest correlation between these subtests was þ.42. For the school as a whole, the children of the experimental groups showed only a slightly greater gain in verbal IQ (2 points) than did the control group children. However, in total IQ (4 points), and especially in reasoning IQ (7 points), the experimental group children gained appreciably more than did the control group children. When educational theorists have discussed the possible effects of teachers’ expectations, they have usually referred to the children at lower levels of scholastic achievement. It was interesting, therefore, to find that in the present study, children of the highest level of achievement showed as great a benefit as did the children of the lowest level of achievement of having their teachers expect intellectual gains. At the end of the school year of this study, all teachers were asked to describe the classroom behavior of their pupils. Those children from whom intellectual growth was expected were described as having a significantly better chance of becoming successful in the future, as significantly more interesting, curious, and happy. There was a tendency, too, for these children to be seen as more appealing, adjusted, and affectionate and as lower in the need for social approval. In short, the children from whom intellectual growth was expected become more intellectually alive and autonomous, or at least they were so perceived by their teachers. We have already seen that the children of the experimental group gained more intellectually so that perhaps it was the fact of such gaining that accounted for the more favorable ratings of these children’s behavior and aptitude. But a great many of the control group children also gained in IQ during the course of the year. We might expect that those who gained more intellectually among these undesignated children would also be rated more favorably by their teachers. Such was not the case. The more the control group children gained in IQ the more they were regarded as less well adjusted, as less interesting, and as less affectionate. From these results, it would seem that when children who are expected to grow intellectually do so, they are considerably benefited in other ways as well. When children who are not especially expected to develop intellectually do so, they seem either to show accompanying undesirable behavior or at least are perceived by their teachers as showing such
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
199
undesirable behavior. If a child is to show intellectual gain it seems to be better for his real or perceived intellectual vitality and for his real or perceived mental health if his teacher has been expecting him to grow intellectually. It appears worthwhile to investigate further the proposition that there may be hazards to unpredicted intellectual growth. A closer analysis of these data, broken down by whether the children were in the high, medium, or low ability tracks or groups, showed that these effects of unpredicted intellectual growth were due primarily to the children of the low ability group. When these slow track children were in the control group so that no intellectual gains were expected of them, they were rated more unfavorably by their teachers if they did show gains in IQ. The greater their IQ gains, the more unfavorably were they rated, both as to mental health and as to intellectual vitality. Even when the slow track children were in the experimental group (so that IQ gains were expected of them) they were not rated as favorably relative to their control group peers as were the children of the high or medium track despite the fact that they gained as much in IQ relative to the control group children as did the experimental group children of the high group. It may be difficult for a slow track child, even one whose IQ is rising, to be seen by his teacher as a well-adjusted child and as a potentially successful child, intellectually. The effects of teacher expectations had been most dramatic when measured in terms of pupils’ gains in reasoning IQ. These effects on reasoning IQ, however, were not uniform for boys and girls. Although all the children of this lower socioeconomic status school gained dramatically in IQ, it was only among the girls that greater gains were shown by those who were expected to bloom compared to the children of the control group. Among the boys, those who were expected to bloom gained less than did the children of the control group (interaction F ¼ 9.27, p ¼ .003). In part to check this finding, the experiment originally conducted on the West Coast was repeated in a small Midwestern town (Rosenthal and Evans, 1968). This time the children were from substantial middle-class backgrounds, and this time the results were completely and significantly reversed. Now it was the boys who showed the benefits of favorable teacher expectations. Among the girls, those who were expected to bloom intellectually gained less in reasoning IQ than did the girls of the control group (interaction F ¼ 9.10, p ¼ .003). Just as in the West Coast experiment, however, all the children showed substantial gains in IQ. These results, while they suggest the potentially powerful effects of teacher expectations also indicate the probable complexity of these effects as a function of pupils’ sex, social class, and, as time will no doubt show, other variables as well. In both the experiments described, IQ gains were assessed after a full academic year had elapsed. However, the results of another experiment suggest that teacher expectations can significantly affect students’ intellectual performance in a period as short as 2 months (Anderson and Rosenthal, 1968). In this small experiment, the 25 children were mentally retarded boys with an average pretest IQ of 46. Expectancy effects were significant only for reasoning IQ and only in interaction with membership in a group receiving special remedial reading instruction in addition to participating in the school’s summer day camp program (p < .03). Among these specially tutored boys, those who were expected to bloom showed an expectancy disadvantage of nearly 12 IQ points; among the untutored boys who were participating only in the school’s summer day camp program, those who were expected to bloom showed
200
Book One – Artifact in Behavioral Research
an expectancy advantage of just over three IQ points. (For verbal IQ, in contrast, the expectancy disadvantage of the tutored boys was less than one IQ point, while the expectancy advantage for the untutored boys was over two points). The results described were based on posttesting only two months after the initiation of the experiment. Follow-up testing was undertaken seven months after the end of the basic experiment. In reasoning IQ, the boys who had been both tutored and expected to bloom intellectually made up the expectancy disadvantage they had shown after just two months. Now, their performance change was just like that of the control group children, both groups showing an IQ loss of four points over the nine month period. Compared to these boys who had been given both or neither of the two experimental treatments, the boys who had been given either tutoring or the benefit of favorable expectations showed significantly greater gains in reasoning IQ scores (p < .025). Relative to the control group children, those who were tutored showed a 10 point advantage while those who were expected to bloom showed a 12 point advantage. While both tutoring and a favorable teacher expectation were effective in raising relative IQ scores, it appeared that when these two treatments were applied simultaneously, they were ineffective in producing IQ gains over the period from the beginning of the experiment to the nine month follow-up. One possible explanation of this finding is that the simultaneous presence of both treatments led the boys to perceive too much pressure. The same pattern of results reported for reasoning IQ was also obtained when verbal IQ and total IQ were considered, though the interaction was significant only in the case of total IQ (p < .03). In the experiment under discussion, a number of other measures of the boys’ behavior were available as were observations of the day-camp counselors’ behavior toward the boys. Preliminary analysis suggests that boys who had been expected to bloom intellectually were given less attention (p ¼ .09) by the counselors and developed a greater degree of independence (p < .02) compared to the boys of the control group. Another study, this time conducted in an East Coast school with upper middle class pupils, again showed the largest effect of teachers’ expectancies to occur when the measure was of reasoning IQ (Conn, Edwards, Rosenthal, and Crowne, 1968). In this study, both the boys and the girls who were expected to bloom intellectually showed greater gains in reasoning IQ than did the boys and girls of the control group, and the magnitude of the expectancy effect favored the girls very slightly. Also in this study, we had available a measure of the children’s accuracy in judging the vocal expressions of emotion of adult speakers. It was of considerable theoretical interest to find that greater benefits of favorable teacher expectations accrued to those children who were more accurate in judging the emotional tone expressed in an adult female’s voice. These findings, taken together with the research of Adair and Epstein (1967) described earlier, give a strong suggestion that vocal cues may be important in the covert communication of interpersonal expectations. In all the experiments described so far, the same IQ measure was employed, the Flanagan (1960) Tests of General Ability. Also employing the same instrument with his sample of first graders, Claiborn (1968) found a tendency (z ¼ 1.45) for children he designated as potential bloomers to gain less in IQ than the children of the control group. With fifth grade boys as his subjects and males as teachers, Pitt (1956) found no effect on achievement scores of arbitrarily adding or subtracting ten IQ points to the children’s records. In her study, Heiserman (1967) found no effect of teacher expectations on her 7th graders’ stated levels of occupational aspiration.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
201
There have been two studies in which teachers’ expectations were varied not for specific children within a classroom but rather for classrooms as a whole (Biegen, 1968; Flowers, 1966). In both cases, the performance gains were greater for those classrooms expected by their teachers to show the better performance. A radically different type of performance measure was employed in the research by Burnham (1968); not intelligence or scholastic achievement this time, but swimming ability. His subjects were boys and girls aged 7–14 attending a summer camp for the disadvantaged. None of the children could swim at the beginning of the two week experimental period. Half the children were alleged by the camp staff to have shown unusual potential for learning to swim as judged from a battery of psychological tests. Children were, of course, assigned to the ‘‘high potential’’ group at random. At the end of the two week period of the experiment all the children were retested on the standard Red Cross Beginner Swimmer Test. Those children who had been expected to show greater improvement in swimming ability showed greater improvement than did the children of the control group. We may conclude now with the brief description of just one more experiment, this one conducted by Beez (1968), who kindly made his data available for the analyses to follow. This time the pupils were 60 preschoolers from a summer Headstart program. Each child was taught the meaning of a series of symbols by one teacher. Half the 60 teachers had been led to expect good symbol-learning and half had been led to expect poor symbol-learning. Most (77 percent) of the children alleged to have better intellectual prospects learned five or more symbols, but only 13 per cent of the children alleged to have poorer intellectual prospects learned five or more symbols (p < 2 in one million). In this study the children’s actual performance was assessed by an experimenter who did not know what the child’s teacher had been told about the child’s intellectual prospects. Teachers who had been given favorable expectations about their pupil tried to teach more symbols to their pupil than did the teachers given unfavorable expectations about their pupil. The difference in teaching effort was dramatic. Eight or more symbols were taught by 87 per cent of the teachers expecting better performance, but only 13 per cent of the teachers expecting poorer performance tried to teach that many symbols to their pupil (p < 1 in 10 million). These results suggest that a teacher’s expectation about a pupil’s performance may sometimes be translated not into subtle vocal nuances but rather into overt and even dramatic alterations in teaching style. The magnitude of the effect of teacher expectations found by Beez is also worthy of comment. In all the earlier studies described, one group of children had been singled out for favorable expectations while nothing was said of the remaining children of the control group. In Beez’ short-term experiment, it seemed more justified to give negative as well as positive expectations about some of the children. Perhaps the very large effects of teacher expectancy obtained by Beez were due to the creation of strong equal but opposite expectations in the minds of the different teachers. Since strong negative expectations doubtless exist in the real world of classrooms, Beez’s procedure may give the better estimate of the effects of teacher expectations as they occur in everyday life. In the experiment by Beez, it seems clear that the dramatic differences in teaching style accounted at least in part for the dramatic differences in pupil learning. However, not all of the obtained differences in learners’ learning were due to the differences in teachers’ teaching. Within each condition of teacher expectation, for
202
Book One – Artifact in Behavioral Research Table 6–23 Expectancy Effects in Educational Settings
Study
1. Anderson and Rosenthal 2. Beez 3. Biegen 4. Burnhamb 5. Claiborn 6. Conn, et al. 7. Flowers 8. Heiserman 9. Pitt 10. Rosenthal and Evans 11. Rosenthal and Jacobson
a b
Directional standard normal deviate 1968 1968 1968 1968 1968 1968 1966 1967 1956 1968 1968 Sum pffiffiffiffiffiffi 11 z p
.00a þ4.67 þ1.83 þ2.61a 1.45a .00a þ1.60 .00 .00 .00a þ2.11a þ11.37 3.32 þ 3.42 .00033
Dependent variable
Total IQ Symbol learning Achievement Swimming skill Total IQ Total IQ Achievement þ IQ Aspiration Achievement Total IQ Total IQ
Indicates that teacher expectancy interacted with another variable at z /1.28/. See also Burnham and Hartsough (1968).
example, there was no relationship between number of symbols taught and number of symbols learned. In addition, it was also possible to compare the performances of just those children of the two conditions who had been given an exactly equal amount of teaching benefit. Even holding teaching benefits constant, the difference favored the children believed to be superior (t ¼ 2.89, p < .005, one-tail) though the magnitude of the effect was now diminished by nearly half. We have now seen at least a brief description of 11 studies of the effects of interpersonal expectancies in natural learning situations. That is too many to hold easily in mind and Table 6-23 provides a convenient summary. For each experiment, the directional standard normal deviate is given as well as a brief identification of the dependent variables employed. As has been the custom in this chapter, a standard normal deviate greater than 1.28 and smaller than þ1.28 has been recorded as zero. Of the five experiments tabulated as showing no main effect of teacher expectation, it should be noted that three of them showed significant interactions of teacher expectation with some other primary variable such as special tutoring (study 1), accuracy of emotion perception (6), and sex of pupil (10). The combined one-tail p of the main effects of teacher expectancy in the studies shown in Table 6-23 is less than 1 in 3,000. It would take an additional 37 studies of a mean associated z value of .00 to bring the overall combined p to above .05.4 4
Combining the ps of the 105 studies of Tables 6-24 together with the results of the nine studies of footnote 2 gives a grand sum z of þ112.23. The overall z associated with this set of results is þ10.51 and 4,540 new experiments with a mean z of .00 are required to bring the overall p level to .05. (As this chapter went to press, the results of another study of teacher expectation effects became available. Meichenbaum, Bowers, and Ross, at the University of Waterloo, found that favorable teacher expectations led to a significant increase in the appropriateness of classroom behavior of a sample of adolescent female offenders (df ¼ 12, z ¼ þ2.02).
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
203
Table 6–24 Percentage of Studies of Expectancy Effects in Laboratories and Educational Settings Obtaining Results at Specified p Levels
p
Laboratories
Educational settings
N 5 94 .10 .05 .01 .001 .0001 .00001 .000001 Grand Sum z Mean z
N 5 11
50% 35% 17% 12% 5% 3% 2% þ95.27 þ1.01
45% 36% 18% 9% 9% 9% 0% þ11.37 þ1.03
Shall we view this set of experiments in natural learning situations in isolation or would it be wiser to see them simply as more of the same type of experiment that has been discussed throughout this chapter? Since the type of experimental manipulation involved in the laboratory studies is essentially the same as that employed in the studies beyond the laboratory, it seems more parsimonious to view all the studies as members of the same set. If, in addition to the communality of experimental procedures, we find it plausible to conclude a communality of outcome patterns between the laboratory and field experiments, perhaps we can have the greater convenience and power of speaking of just one type of effect of interpersonal expectancy. Table 6-24 allows each reader to make his own test for goodness of fit. At each level of p, we find the proportion of laboratory and educational studies reaching that or a lower level of p. The agreement between the two types of studies appears to be remarkably close. In addition, the mean z value obtained from the 94 laboratory experiments is nearly identical to that obtained from the 11 studies of educational settings. If there were a systematic difference in sample size between studies conducted in laboratories and those involving teachers, then we might expect to find that for similar zs the average effects would be smaller in magnitude for the set comprised of larger sample sizes. For this reason, it seems necessary also to compare magnitudes of expectancy effect for studies involving experimenters and teachers. Table 6-25 shows this comparison. For 57 studies of experimenters and for six studies of teachers, it was possible to calculate the proportion of each that was affected in the predicted direction by their expectancy. Again the agreement is very good. Depending upon the particular method of computation selected, about 7 out of 10 Table 6–25 Proportions of Experimenters and Teachers Showing Expectancy Effects
Number of Studies Median z Number of Es or Ts Mean N per Study Weighted Percent of Biased Es or Ts Median Percent of Biased Es or Ts
Experimenters
Teachers
57 .00 523 9 69% 75%
6 .00 115 19 75% 66%
204
Book One – Artifact in Behavioral Research
experimenters or 7 out of 10 teachers can be expected to show the effects of their expectation on the performance of their subjects or pupils. This chapter began its discussion of interpersonal expectancy effects by suggesting that the expectancy of the behavioral researcher might function as a self-fulfilling prophecy. This unintended effect of the investigator’s research hypothesis must be regarded as a potentially damaging artifact. But interpersonal self-fulfilling prophecies do not operate only in laboratories and while, when there, they may act as artifacts, they are more than that. Interpersonal expectancy effects occur also among teachers and, there seems no reason to doubt it, among others as well. What started life as an artifact continues as an interpersonal variable of theoretical and practical interest. Today’s artifact, as Bill McGuire so wisely said, is tomorrow’s main effect; and tomorrow is today.
References Adair, J. G. Demand characteristics or comformity? Suspiciousness of deception and experimenter bias in conformity research. Unpublished manuscript, University of Manitoba, 1968. Adair, J. G., and Epstein J. Verbal cues in the mediation of experimenter bias. Paper presented at the meeting of the Midwestern Psychological Association, Chicago, May, 1967. Adler, N. E. The influence of experimenter set and subject set on the experimenter expectancy effect. Unpublished AB thesis, Wellesley College, 1968. Allport, G. W. The role of expectancy. In H. Cantril (Ed.), Tensions that cause wars. Urbana, Illinois: University of Illinois Press, 1950, 43–78. Anderson, D. F., and Rosenthal, R. Some effects of interpersonal expectancy and social interaction on institutionalized retarded children. Proceedings of the 76th Annual Convention of the American Psychological Association, 1968, 479–480. Barber, T. X., Calverley, D. S., Forgione, A., McPeake, J. D., Chaves, J. F., and Bowen, B. Five attempts to replicate the experimenter bias effect. Unpublished manuscript, Harding, Mass.: Medfield Foundation, 1967. Barber, T. X., and Silver, M. J. Fact, fiction, and die experimenter bias effect. Psychological Bulletin Monograph Supplement, 1968, 70, 1–29. Barnard, P. G. Interaction effects among certain experimenter and subject characteristics on a projective test. Journal of Consulting and Clinical Psychology, 1968, 32, 514–521. Becker, H. G. Experimenter expectancy, experience, and status as factors in observational data. Unpublished master’s thesis, University of Saskatchewan, 1968. Beez, W. V. Influence of biased psychological reports on teacher behavior and pupil performance. Proceedings of the 76th Annual Convention of the American Psychological Association, 1968, 605–606. Berzins, J. I., and Seidman, E. Differential therapeutic responding of A and B quasi-therapists to schizoid and neurotic communications. Unpublished manuscript, University of Kentucky, 1968. Betz, J. Experiences in research in psychotherapy with schizophrenic patients. In H. H. Strupp and L. Luborsky (Eds.), Research in psychotherapy. Washington, D.C.: American Psychological Association, 1962, 41–60. Biegen, D. A. Unpublished data. University of Cincinnati, 1968. Bootzin, B. R. The experimenter: a credibility gap in psychology. Unpublished manuscript, Purdue University, 1968. Boring, E. G. A history of experimental psychology, (2nd ed.). New York: Appleton-Century-Crofts, 1950. Bradley, W. H. Unmineralized fossil bacteria: A retraction. Science, 1968, 160, 437. Burnham, J. R. Experimenter bias and lesion labeling. Unpublished manuscript, Purdue University, 1966. Burnham, J. R. Effects of experimenter’s expectancies on children’s ability to learn to swim. Unpublished master’s thesis, Purdue University, 1968. Burnham, J. R., and Hartsough, D. M. Effect of experimenter’s expectancies (‘‘the Rosenthal effect’’) on children’s ability to learn to swim. Paper presented at the meeting of the Midwestern Psychological Association, Chicago, May, 1968.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
205
Carlson, J. A., and Hergenhahn, B. R. Use of tape-recorded instructions and a visual screen to reduce experimenter bias. Unpublished manuscript, Hamline University, 1968. Carlson, R. L., and Armelagos, G. J. Cradleboard hoods, not corsets. Science, 1965, 149, 204–205. Carson, R. C. A and B therapist ‘‘types’’: A possible critical variable in psychotherapy. Journal of Nervous and Mental Diseases, 1967, 144, 47–54. Cieutat, V. J. Examiner differences with the Stanford-Binet IQ. Perceptual and Motor Skills, 1965, 20, 317–318. Cieutat, V. J., and Flick, G. L. Examiner differences among Stanford-Binet items. Psychological Reports, 1967, 21, 613–622. Claiborn, W. L. An investigation of the relationship between teacher expectancy, teacher behavior and pupil performance. Unpublished doctoral dissertation, Syracuse University, 1968. Conn, L. K., Edwards, C. N., Rosenthal, R., and Crowne, D. Perception of emotion and response to teachers’ expectancy by elementary school children. Psychological Reports, 1968, 22, 27–34. Connors, A. M. Two experimenter behaviors as mediators of experimenter expectancy. Unpublished AB thesis, Harvard University, 1968. Connors, A. M., and Horst, L. The relationship between subjects’ unbiased response tendencies and subsequent responses under two conditions of experimenter expectancy. Unpublished manuscript, Harvard University, 1966. Cooper, J., Eisenberg, L., Robert, J., and Dohrenwend, B. S. The effect of experimenter expectancy and preparatory effort on belief in the probable occurrence of future events. Journal of Social Psychology, 1967, 71, 221–226. Cordaro, L., and Ison, J. R. Observer bias in classical conditioning of the planarian. Psychological Reports, 1963, 13, 787–789. Crowne, D. P., and Marlowe, D. The approval motive. New York: Wiley, 1964. Duncan, S., and Rosenthal, R. Vocal emphasis in experimenters’ instruction reading as unintended determinant of subjects’ responses. Language and speech, 1968, 11, Part 1, 20–26. Escalona, S. K. Feeding disturbances in very young children. American Journal of Orthopsychiatry, 1945, 15, 76–80. Flanagan, J. C. Test of general ability: technical report. Chicago: Science Research Associates, 1960. Flowers, C. E. Effects of an arbitrary accelerated group placement on the tested academic achievement of educationally disadvantaged students. Unpublished doctoral dissertation, Teachers College, Columbia University, 1966. Fode, K. L. The effects of experimenters’ anxiety, and subjects’ anxiety, social desirability and sex, on experimenter outcome-bias. Unpublished doctoral dissertation, University of North Dakota, 1967. Friedman, N. The social nature of psychological research. New York: Basic Books, 1967. Friedman, N., Kurland, D., and Rosenthal, R. Experimenter behavior as an unintended determinant of experimental results. Journal of Projective Techniques and Personality Assessment, 1965, 29, 479–490. Gall, M., and Mendelsohn, G. A. Effects of facilitating techniques and subject-experimenter interaction on creative problem solving. Unpublished manuscript, University of California, Berkeley, 1966. Gardner, M. Dermo-optical perception: A peek down the nose. Science, 1966, 151, 654–657. Geldard, F. A. Some neglected possibilities of communication. Science, 1960, 131, 1583–1588. Getter, H., Mulry, R. C., Holland, C., and Walker, P. Experimenter bias and the WAIS. Unpublished data, University of Connecticut, 1967. Glixman, A. F. Psychology of the scientist: XXII. Effects of examiner, examiner-sex, and subjectsex upon categorizing behavior. Perceptual and Motor Skills, 1967, 24, 107–117. Goldblatt, R. A., and Schackner, R. A. Categorizing emotion depicted in facial expressions and reaction to the experimental situation as a function of experimenter ‘‘friendliness.’’ Paper presented at the meeting of the Eastern Psychological Association, Washington, D.C., April, 1968. Hartry, A. Experimenter bias in planaria conditioning. Paper presented at the meeting of the Western Psychological Association, Long Beach, April, 1966. Heiserman, M. S. The relationship between teacher expectations and pupil occupational aspirations. Unpublished master’s thesis, Iowa State University, Ames, 1967.
206
Book One – Artifact in Behavioral Research Honorton, C. Review of C. E. M. Hansel’s ESP: A scientific evaluation. New York: Scribner’s, 1966. Journal of Parapsychology, 1967, 31, 76–82. Horn, C. H. The field dependent and field independent person’s response to experimenter bias. Unpublished manuscript, George Washington University, 1968. Horst, L. Research in the effect of the experimenter’s expectancies—a laboratory model of social influence. Unpublished manuscript, Harvard University, 1966. Hurwitz, S., and Jenkins, V. The effects of experimenter expectancy on performance of simple learning tasks. Unpublished manuscript, Harvard University, 1966. Ingraham, L. H., and Harrington, G. M. Experience of E as a variable in reducing experimenter bias. Psychological Reports, 1966, 19, 455–461. Jenkins, V. The unspoken word: A study in non-verbal communication. Unpublished AB thesis, Harvard University, 1966. Johnson, R. W. Subject performance as affected by experimenter expectancy, sex of experimenter, and verbal reinforcement. Unpublished master’s thesis, University of New Brunswick, 1967. Jourard, S. M. Project replication: Experimenter-subject acquaintance and outcome in psychological research. Unpublished manuscript, University of Florida, 1968. Kellogg, W. N. Sonar system of the blind. Science, 1962, 137, 399–404. Kennedy, J. J., Cook, P. A., and Brewer, R. R. An examination of the effects of three selected experimenter variables in verbal conditioning research. Unpublished manuscript, University of Tennessee, 1968. Kennedy, J. J., Edwards, B. C., and Winstead, J. C. The effects of experimenter outcome expectancy in a verbal conditioning situation: A failure to detect the ‘‘Rosenthal Effect.’’ Unpublished manuscript, University of Tennessee, 1968. Kennedy, J. L., and Uphoff, H. F. Experiments on the nature of extrasensory perception: III. The recording error criticism of extrachance scores. Journal of Parapsychology, 1939, 3, 226–245. Kintz, B. L., Delprato, D. J., Mettee, D. R., Parsons, C. E. and Schappe, R. H. The experimenter as a discriminative stimulus in a T-maze, Psychological Record, 1965, 15, 449–454. Kleinmuntz, B., and McLean, R. S. Computers in behavioral science: Diagnostic interviewing by digital computer. Behavioral Science, 1968, 13, 75–80. Klinger, E. Modeling effects on achievement imagery. Journal of Personality and Social Psychology, 1967, 7, 49–62. Krasner, L., and Ullman, L. P. (Eds.) Research in behavior modification: New developments and implications. New York: Holt, Rinehart and Winston, 1965. Larrabee, L. L., and Kleinsasser, L. D. The effect of experimenter bias on WISC performance. Unpublished paper. Psychological Associates, St. Louis, 1967. Laszlo, J. P., and Rosenthal, R. Subject dogmatism, experimenter status, and experimenter expectancy effects. Unpublished manuscript, Harvard University, 1967. Malmo, R. B., Boag, T. J., and Smith, A. A. Physiological study of personal interaction. Psychosomatic Medicine, 1957, 19, 105–119. Marcia, J. E. Hypothesis-making, need for social approval, and their effects on unconscious experimenter bias. Unpublished master’s thesis, Ohio State University, 1961. Marwit, S. J. An investigation of the communication of tester-bias by means of modeling. Unpublished doctoral dissertation, State University of New York at Buffalo, 1968. Marwit, S. J., and Marcia, J. E. Tester bias and response to projective instruments. Journal of Consulting Psychology, 1967, 31, 253–258. Masling, J. The influence of situational and interpersonal variables in projective testing. Psychological Bulletin, 1960, 57, 65–85. Masling, J. Differential indoctrination of examiners and Rorschach responses. Journal of Consulting Psychology, 1965, 29, 198–201. Masling, J. Role-related behavior of the subject and psychologist and its effects upon psychological data. In D. L. Levine (Ed.), Nebraska Symposium on Motivation, Lincoln, Nebraska: University of Nebraska Press, 1966. 67–103. McFall, R. M. ‘‘Unintentional communication’’: The effect of congruence and incongruence between subject and experimenter constructions. Unpublished doctoral dissertation, Ohio State University, 1965. McGuigan, F. J. The experimenter: A neglected stimulus object. Psychological Bulletin, 1963, 60, 421–428.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
207
Merton, R. K. The self-fulfilling prophecy. Antioch Review, 1948, 8, 193–210. Miller, G. A., Bregman, A. S., and Norman, D. A. The computer as a general purpose device for the control of psychological experiments. In R. W. Stacy and B. D. Waxman (Eds.), Computers in biomedical research. Vol. I. New York: Academic Press, 1965. 467–490. Milmoe, S., Novey, M. S., Kagan, J., and Rosenthal, R. The mother’s voice: Postdictor of aspects of her baby’s behavior. Proceedings of the 76th Annual Convention of the American Psychological Association, 1968, 463–464. Milmoe, S., Rosenthal, R., Blane, H. T., Chafetz, M. E., and Wolf, I. The doctor’s voice: postdictor of successful referral of alcoholic patients. Journal of Abnormal Psychology, 1967, 72, 78–84. Minor, M. W. Experimenter expectancy effect as a function of evaluation apprehension. Unpublished doctoral dissertation, University of Chicago, 1967. Minor, M. W. Unpublished data, University of Chicago, 1967. (a) Mintz, N. On the psychology of aesthetics and architecture. Unpublished manuscript, Brandeis University, 1957. Moffat, M. C. Unpublished data. University of British Columbia, 1966. Mosteller, F., and Bush, R. R. Selected quantitative techniques. In G. Lindzey (Ed.) Handbook of social psychology, Vol. I. Cambridge, Mass.: Addison-Wesley, 1954. 289–334. Mosteller, F., and Tukey, J. W. Data analysis, including statistics. Unpublished manuscript, Harvard University, 1965. Mu¨ller, W. and Timaeus, E. Conformity behavior and experimenter bias. Unpublished manuscript, University of Cologne, 1967. Nichols, M. Data desirability and experimenter expectancy as unintended determinants of experimental results. Unpublished data, Harvard University, 1967. Peel, W. C. Jr. The influence of the examiner’s expectancy and level of anxiety on the subject’s responses to the Holtzman Inkblots. Unpublished master’s thesis, Memphis State University, 1967. Persinger, G. W. The effect of acquaintanceship on the mediation of experimenter bias. Unpublished master’s thesis, University of North Dakota, 1962. Persinger, G. W., Knutson, C., and Rosenthal, R. Communication of interpersonal expectations among neuropsychiatric patients. Unpublished data, Harvard University, 1966. Persinger, G. W., Knutson, C., and Rosenthal, R. Communication of interpersonal expectations of ward personnel to neuropsychiatric patients. Unpublished data, Harvard University, 1968. Pflugrath, J. Examiner influence in a group testing situation with particular reference to examiner bias. Unpublished master’s thesis, University of North Dakota, 1962. Pfungst, O. Clever Hans (the horse of Mr. von Osten): a contribution to experimental, animal, and human psychology. (Translated by C. L. Rahn) New York: Holt, 1911. Republished by Holt, Rinehart and Winston, 1965. Pitt, C. C. V. An experimental study of the effects of teachers’ knowledge or incorrect knowledge of pupil IQ’s on teachers’ attitudes and practices and pupils’ attitudes and achievement. Unpublished doctoral dissertation, Columbia University, 1956. Polanyi, M. The society of explorers. Encounter, 1967. Raffetto, A. M. Experimenter effects on subjects’ reported hallucinatory experiences under visual and auditory deprivation. Unpublished master’s thesis, San Francisco State College, 1967. Raffetto, A. M. Experimenter effects on subjects’ reported hallucinatory experiences under visual and auditory deprivation. Paper presented at the meeting of the Midwestern Psychological Association, Chicago, May, 1968. Ravitz, L. J. Electrometric correlates of the hypnotic state. Science, 1950, 112, 341–342. Ravitz, L. J. Electrocyclic phenomena and emotional states. Journal of Clinical and Experimental Psychopathology, 1952, 13, 69–106. Rice, C. E. and Feinstein, S. H. Sonar system of the blind: size discrimination. Science, 1965, 148, 1107–1108. Riecken, H. W. A program for research on experiments in social psychology. In N. F. Washburne (Ed.), Decisions, values, and groups, Vol. 2. New York: Pergamon Press, 1962. 25–41. Rosenthal, R. On the social psychology of the psychological experiment: The experimenter’s hypothesis as unintended determinant of experimental results. American Scientist, 1963, 51, 268–283. Rosenthal, R. The effect of the experimenter on the results of psychological research. In B. A. Maher (Ed.) Progress in experimental personality research. Vol. 1. New York: Academic Press, 1964. 79–114. (a)
208
Book One – Artifact in Behavioral Research Rosenthal, R. Experimenter outcome-orientation and the results of the psychological experiment. Psychological Bulletin, 1964, 61, 405–412. (b) Rosenthal R. Clever Hans: a case study of scientific method. Introduction to Pfungst, O. Clever Hans: (the horse of Mr. von Osten). New York: Holt, Rinehart and Winston, 1965. ix–xlii. Rosenthal, R. Experimenter effects in behavioral research. New York: Appleton-Century-Crofts, 1966. Rosenthal, R. Covert communication in the psychological experiment. Psychological Bulletin, 1967, 67, 356–367. (a) Rosenthal, R. Experimenter expectancy, experimenter experience, and Pascal’s Wager. Psychological Reports, 1967, 20, 619–622. (b) Rosenthal, R. Experimenter expectancy, one tale of Pascal, and the distribution of three tails. Psychological Reports, 1967, 21, 517–520. (c) Rosenthal, R. The eternal triangle: Investigators, data, and the hypotheses called null. Unpublished manuscript, Harvard University, 1967. (d) Rosenthal, R. Experimenter expectancy and the reassuring nature of the null hypothesis decision procedure. Psychological Bulletin Monograph Supplement, 1968, 70, 30–47. (a) Rosenthal, R. On not so replicated experiments and not so null results. Journal of Consulting and Clinical Psychology, 1968, in press. (b) Rosenthal, R., and Evans, J. Unpublished data, Harvard University, 1968. Rosenthal, R. and Fode, K. L. The effect of experimenter bias on the performance of the albino rat. Behavioral Science, 1963, 8, 183–189. (a) Rosenthal, R., and Fode, K. L. Three experiments in experimenter bias. Psychological Reports, 1963, 12, 491–511. (b) Rosenthal, R., Friedman, C. J., Johnson, C. A., Fode, K. L., Schill, T. R., White, C. R., and VikanKline, L. L. Variables affecting experimenter bias in a group situation. Genetic Psychology Monographs, 1964, 70, 271–296. Rosenthal, R., Friedman, N., and Kurland, D. Instruction-reading behavior of the experimenter as an unintended determinant of experimental results. Journal of Experimental Research in Personality, 1966, 1, 221–226. Rosenthal, R., and Hall, C. M. Computational errors in behavioral research. Unpublished data, Harvard University, 1968. Rosenthal, R., and Jacobson, L. Pygmalion in the classroom: Teacher expectation and pupils’ intellectual development. New York: Holt, Rinehart and Winston, 1968. Rosenthal, R., Kohn, P., Greenfield, P. M., and Carota, N. Experimenters’ hypothesis-confirmation and mood as determinants of experimental results. Perceptual and Motor Skills, 1965, 20, 1237–1252. Rosenthal, R., and Lawson, R. A longitudinal study of the effects of experimenter bias on the operant learning of laboratory rats. Journal of Psychiatric Research, 1964, 2, 61–72. Rosenthal, R, and Persinger, G. W. Subjects’ prior experimental experience and experimenters’ outcome consciousness as modifiers of experimenter expectancy effects. Unpublished manuscript, Harvard University, 1968. Rosenthal, R. Persinger, G. W., Mulry, R. C., Vikan-Kline, L. L., and Grothe, M. Changes in experimental hypotheses as determinants of experimental results. Journal of Projective Techniques and Personality Assessment, 1964, 28, 465–469. (a) Rosenthal, R., Persinger, G. W., Mulry, R. C., Vikan-Kline, L. L., and Grothe, M. Emphasis on experimental procedure, sex of subjects, and the biasing effects of experimental hypotheses. Journal of Projective Techniques and Personality Assessment, 1964, 28, 470–473. (b) Rosenthal, R., Persinger, G. W., Vikan-Kline, L. L., and Fode, K. L. The effect of early data returns on data subsequently obtained by outcome-biased experimenters. Sociometry, 1963, 26, 487–498. (a) Rosenthal, R., Persinger, G. W., Vikan-Kline, L. L., and Fode, K. L. The effect of experimenter outcome-bias and subject set on awareness in verbal conditioning experiments. Journal of Verbal Learning and Verbal Behavior, 1963, 2, 275–283. (b) Rosenthal, R., Persinger, G. W., Vikan-Kline, L. L., and Mulry, R. C. The role of the research assistant in the mediation of experimenter bias. Journal of Personality, 1963, 31, 313–335. Roth, J. A. Hired hand research. American Sociologist, 1965, 1, 190–196. Sarason, I. G. The human reinforcer in verbal behavior research. In L. Krasner and L. P. Ullman (Eds.), Research in behavior modifications: New developments and implications. New York: Holt, Rinehart and Winston, 1965. 231–243.
Interpersonal Expectations: Effects of the Experimenter’s Hypothesis
209
Sattler, J. M. and Theye, F. Procedural, situational, and interpersonal variables in individual intelligence testing. Psychological Bulletin, 1967, 68, 347–360. Sebeok, T. A., Hayes, A. S., and Bateson, M. C. (Eds.) Approaches to semiotics. The Hague: Mouton, 1964. Shames, M. L., and Adair, J. G. Experimenter bias as a function of the type and structure of the task. Paper presented at the meeting of the Canadian Psychological Association, Ottawa, May, 1967. Shapiro, J. L. The effects of sex, instructional set, and the problem of awareness in a verbal conditioning paradigm. Unpublished master’s thesis, Northwestern University, 1966. Shaver, J. P. Experimenter bias and the training of observers. Proceedings of the Utah Academy of Sciences, Arts, and Letters, 1966, Part I, 43, 143–152. Shor, R. E., and Schatz, J. A critical note on Barber’s case-study on ‘‘Subject J’’. Journal of Psychology, 1960, 50, 253–256. Silverman, I. The effects of experimenter outcome expectancy on latency of word association. Journal of Clinical Psychology, 1968, 24, 60–63. Smiltens, G. J. A study of experimenter expectancy effects with two expectancies being manipulated. Unpublished AB thesis, Harvard University, 1966. Stevenson, H. W. Social reinforcement of children’s behavior. In L. P. Lipsitt and C. C. Spiker (Eds.), Advances in child development and behavior. Vol. 2. New York: Academic Press, 1965. 97–126. Strauss, M. E. Examiner expectancy: effects on Rorschach Experience Balance. Journal of Consulting and Clinical Psychology, 1968, 32, 125–129. Summers, G. F., and Hammonds, A. D. Effect of racial characteristics of investigator on selfenumerated responses to a Negro prejudice scale. Social Forces, 1966, 44, 515–518. Timaeus, E., and Lu¨ck, H. E. Experimenter expectancy and social facilitation I: Aggression under the condition of coaction. Unpublished manuscript, University of Cologne, 1968. (a) Timaeus, E., and Lu¨ck, H. E. Experimenter expectancy and social facilitation II: Stroop-test performance under the condition of audience. Unpublished manuscript, University of Cologne, 1968. (b) Towbin, A. P. Hostility in Rorschach content and overt aggressive behavior. Journal of Abnormal and Social Psychology, 1959, 58, 312–316. Trattner, J. H. The Whitehorn-Betz ‘‘AB’’ Scale and the communication of expectancies to ‘‘process’’ and ‘‘reactive’’ schizophrenics. Unpublished manuscript, Harvard University, 1966. Trattner, J. H. The Whitehorn-Betz AB Scale and the communication of expectancies to high and low social competence schizophrenics. Unpublished master’s thesis, Northwestern University, 1968. Troffer, S. A., and Tart, C. T. Experimenter bias in hypnotist performance. Science, 1964, 145, 1330–1331. Uno, Y., Frager, R. D., and Rosenthal, R. Interpersonal expectancy effects among Japanese experimenters. Unpublished data, Harvard University, 1968. Vaughan, G. M. The effect of the ethnic grouping of the experimenter upon children’s responses to tests of an ethnic nature. British Journal of Social and Clinical Psychology, 1963, 2, 66–70. Walker, R. E., Davis, W. E., and Firetto, A. An experimenter variable: The psychologist-clergyman. Psychological Reports, 1968, 22, 709–714. Walker, R. E., and Firetto, A. The clergyman as a variable in psychological testing. Journal for the Scientific Study of Religion, 1965, 4, 234–236. Wartenberg-Ekren, U. The effect of experimenter knowledge of a subject’s scholastic standing on the performance of a reasoning task. Unpublished master’s thesis, Marquette University, 1962. Weick, K. E. Unpublished data, University of Minnesota, 1966. Weiss, L. R. Experimenter bias as a function of stimulus ambiguity. Unpublished manuscript, State University of New York at Buffalo, 1967. Wenk, E. A. Notes on some motivational aspects of test performance of white and Negro CYA inmates. Unpublished manuscript, Deuel Vocational Institution, Tracy, California, 1966. Wessler, R. L. The experimenter effect in a task-ability problem experiment. Unpublished doctoral dissertation, Washington University, 1966. Wessler, R. L. Experimenter expectancy effects in psychomotor performance. Perceptual and Motor Skills, 1968, 26, 911–917. (a) Wessler, R. L. Experimenter expectancy effects in three dissimilar tasks. Unpublished manuscript, St. Louis University, 1968. (b)
210
Book One – Artifact in Behavioral Research Wessler, R. L., and Strauss, M. E. Experimenter expectancy: a failure to replicate. Psychological Reports, 1968, 22, 687–688. White, C. R. The effect of induced subject expectations on the experimenter bias situation. Unpublished doctoral dissertation, University of North Dakota, 1962. Womack, W. M., and Wagner, N. N. Negro interviewers and white patients. Archives of General Psychiatry, 1967, 16, 685–692. Woolsey, S. H., and Rosenthal, R. Unpublished data, Harvard University, 1966. Yando, R. M., and Kagan, J. The effect of teacher tempo on the child. Unpublished manuscript, Harvard University, 1966. Zax, M., Stricker, G., and Weiss, J. H. Some effects of nonpersonality factors on Rorschach performance. Journal of Projective Techniques, 1960, 24, 83–93. Zegers, R. A. Expectancy and the effects of confirmation and disconfirmation. Journal of Personality and Social Psychology, 1968, 9, 67–71. Zoble, E. J. Interaction of subject and experimenter expectancy effects in a tone length discrimination task. Unpublished AB thesis, Franklin and Marshall College, 1968.
7 The Conditions and Consequences of Evaluation Apprehension Milton J. Rosenberg University of Chicago
Just as it keeps rats pressing levers, intermittent reinforcement keeps psychologists theorizing and neologizing. The best reinforcer I know is not the student’s imitation of his professor’s crotchets, nor is it a ‘‘successful replication’’ of one’s experiment by another: instead it is to have some theoretical term that one has coined be often quoted and then to watch the quotation marks fade away as the term begins to enjoy some common usage. To put a phrase into the language (even if that language is spoken by only a few dozen others) confirms the sometimes faltering sense that one has really said something. This seems to have begun to happen with the term ‘‘evaluation apprehension’’ which I first used in some unpublished documents in 1960–1961 and in an obscure article in 1963, and which I then explicated in a more visible one in 1965. Yet the diffusion of the term is not at all due to its being the key to some arcane and profound insight. Most experimental psychologists had long since come to the unhappy awareness that their subjects were prone to ‘‘faking it’’ and, particularly, to faking it ‘‘good.’’ But as a sort of contrast to Mark Twain’s aphorism about people and the weather, the problem of self presentation in experiments seemed to be something that virtually nobody was talking about1 though a great deal could be done about it. That the term ‘‘evaluation apprehension’’ has recently gained some currency must, then, be due to its helping to fill a need—the need, I should say, for experimental psychologists, social and otherwise, to come to terms with an obvious and fascinating source of trouble in their experimental procedures and rituals. In recent years my own sense of that need has led me beyond the initial conceptualization and into this possibly paradoxical commitment: to try to do systematic 1
One clear voice that helped break the silence was that of Henry Riecken. In a valuable article published in 1962 he proffered a general view of the psychological experiment as a sort of ritualized exchange between subject and experimenter. An important aspect of the exchange dynamic, as he saw it, was the subject’s desire to ‘‘put his best foot forward.’’ However, in Riecken’s view, this was basically a source of ‘‘unintended variance’’ in data and the possibility that it could exert systematic influence making for false confirmation or disconfirmation of hypotheses was not directly examined. Also focused upon the self-presentation process were the inquiries by Edwards (1957) and Crowne and Marlowe (1964) concerning the ‘‘social desirability’’ variable. In distinction to the work described in this chapter, their basic interest has been with the contaminating influence of positive self-presentation upon psychological testing and its results rather than upon psychological experiments.
211
212
Book One – Artifact in Behavioral Research
experimentation on evaluation apprehension as a source of systematic bias in psychological experiments. This chapter is intended as a rather loose, narrative account of the main directions taken and the major findings gleaned in that research program. All but the first of the studies to be described are previously unpublished, though they have been presented in various colloquia over the last two years. Some of these studies will be described in full detail in forthcoming articles; and it is my ambition to bring all this work, and related studies, into tight but expansive focus in an as yet unwritten book.
Evaluation Apprehension as Concept and Process To begin, I had better not assume that the partial diffusion of the term ‘‘evaluation apprehension’’ has also spread abroad its full intended conceptual meaning. Thus what is called for, first, is a statement of definition. Then I shall need to outline my conception of how evaluation apprehension gets aroused and, after arousal, sometimes interacts with features of the experimental situation in ways that produce systematic biasing of experimental response data. Following these necessary preliminaries I shall turn, in the last portion of this introductory section, to some of the reasoning that lies behind the basic conceptualization and then we can begin to look at its research implications. What, then, is the working conception of evaluation apprehension around which my recent research and this chapter are organized? The summary given in an earlier article (Rosenberg, 1965) is, I think, worth repeating here: It is proposed that the typical human subject approaches the typical psychological experiment with a preliminary expectation that the psychologist may undertake to evaluate his (the subject’s) emotional adequacy, his mental health or lack of it. Members of the general public, including students in introductory psychology courses, have usually learned (despite our occasional efforts to persuade them otherwise) to attribute special abilities along these lines to those whose work is perceived as involving psychological interests and skills. Even when the subject is convinced that his adjustment is not being directly studied he is likely to think that the experimenter is nevertheless bound to be sensitive to any behavior that bespeaks poor adjustment or immaturity. In experiments the subject’s initial suspicion that he may be exposing himself to evaluation will usually be confirmed or disconfirmed (as he perceives it) in the early stages of his encounter with the experimenter. Whenever it is confirmed, or to the extent that it is, the typical subject will be likely to experience evaluation apprehension; that is, an active, anxiety-toned concern that he win a positive evaluation from the experimenter, or at least that he provide no grounds for a negative one. Personality variables will have some bearing upon the extent to which this pattern of apprehension develops. But equally important are various aspects of the experimental design such as the experimenter’s explanatory ‘pitch,’ the types of measures used, and the experimental manipulations themselves. Such factors may operate with equal potency across all cells of an experiment; but we shall focus upon the more troublesome situation in which treatment differences between experimental groups make for differential arousal and confirmation of evaluation apprehension. The particular difficulty with this state of affairs is that subjects in groups experiencing comparatively high levels of evaluation apprehension will be more
The Conditions and Consequences of Evaluation Apprehension
213
prone than subjects in other groups to interpret the experimenter’s instructions, explanations, and measures for what they may convey about the kinds of responses that will be considered healthy or unhealthy, mature or immature. In other words, they will develop hypotheses about how to win positive evaluation or to avoid negative evaluation. And usually the subjects in such an experimental group are enough alike in their perceptual reactions to the situation so that there will be considerable similarity in the hypotheses at which they separately arrive. This similarity may, in turn, operate to systematically influence experimental responding in ways that foster false confirmation of the experimenter’s predictions.
What suggests this view of the secret side of the structured transaction between experimenter and subject? What, if anything, confirms the view? One answer to the first of these questions concerns the modal theme that is usually encountered when one engages subjects in extended postexperimental discussion. Experienced experimenters who bother to talk to their subjects have all heard questions like these: ‘‘How did I do—were my responses (answers) normal?’’ ‘‘What were you really trying to find out, whether I’m some kind of neurotic?’’ ‘‘Did I react the same as most people do?’’ If one goes further in postexperimental inquiry, as I have regularly tried to do in recent years in my experimental work on attitude change (see Abelson et al., 1968; Rosenberg et al., 1960), and asks subjects to attempt a reconstruction of their private experience of the experimental transaction, one often picks up another theme that I take to be quite significant. Subjects will report—sometimes with uncertainty and sometimes with great clarity—that they were burdened or preoccupied with the question ‘‘What is the real purpose of this experiment?’’ and that when some striking aspect of the experimental situation was revealed to them (whether through further instructions from the experimenter or, often, through first encounter with the instrument designed to elicit dependent variable measures) this generated a flash of ‘‘insight’’ about what the experimenter was ‘‘really trying to find out about me.’’ Though such ‘‘insights’’ are almost always incorrect they are of the sort that is capable of affecting the subject’s further behavior in the experimental situation. The fact that such influence upon experimental responding has occurred is often the precise burden of the subject’s remarks. Thus, conversations with subjects (and also with graduate students and colleagues as they muse upon their memories from undergraduate years when they were the recruited subjects rather than the recruiters) have helped to shape the basic conception of the evaluation apprehension process. Yet another contributing influence has been the fact that experienced experimental social psychologists seem to share a certain basic style when engaged in professional ‘‘yesbutism.’’ What I mean is that when they suspect that someone else’s data ‘‘are too neat—and the hypothesis can’t be that true’’ their first line of reinterpretation is usually to suggest that something about the experimental instructions or manipulations probably ‘‘aroused’’ the subjects in some unintended way or direction. Who has not heard reinterpretations similar to these illustrative ones? ‘‘The instructions probably made the subjects in the experimental group quite anxious about how they would be accepted and that, rather than the attributions of expertise as such, would be enough to make them conform to the views of the other group members.’’ Or: ‘‘By telling the subjects that prejudiced people are people who have repressed their hostility toward parents you are really making it necessary for them to show that the tolerance message influences
214
Book One – Artifact in Behavioral Research
them; it isn’t ‘insight’ that accounts for the change, it’s their need to get the psychologist’s approval.’’ The penchant for this sort of reinterpretation in terms of self-presentation dilemmas is widespread. Is this simply because it is a normative style in our profession; or has it become so because it reflects a persisting social psychological reality in the conduct of psychological research? Obviously, I would suggest that the latter is the case. But if observations and speculations of the sort that I have indicated have helped to suggest the evaluation apprehension view, they do not, of course, in any way serve to confirm it. Confirmation can only be accomplished through further research. Thus, one of the basic aims of the experimental program that I and my various colleagues have been conducting has been to demonstrate that evaluation apprehension, once aroused, can significantly influence dependent variable data. We have intended also to show that this influence often works not merely to increase ‘‘random error variance’’ but rather that it exerts systematic bias upon experimental responding; i.e., it ‘‘tilts’’ data distributions toward one or the other end of the response continuum and thus generates ‘‘significant’’ findings that happen also to be illusory ones. We have had other purposes in mind as well—particularly to investigate the conditions under which evaluation apprehension is more or less likely to be aroused and, if aroused, more or less likely to induce systematic bias in dependent variable data. I shall return to these matters later. Our first task is to review and discuss some ‘‘demonstration’’ studies. What they are intended to demonstrate is, simply, that when evaluation apprehension is aroused (and when it is coupled with the provision of cues that hint how the normal or ‘‘healthy’’ person would be likely to respond) this can induce systematic bias. Of course, it must be clearly understood that any demonstration that this can happen does not establish that it always or usually will happen. But there is no point in worrying about evaluation apprehension at all or in spending effort on trying to control and reduce it, unless we have first satisfied ourselves that it can actually be shown to exert biasing influence upon experimental responding.
Demonstration Through Altered Replication There are at least two ways in which our basic point can be demonstrated. The one that will occupy us now is, in essence, a classic strategy. It is the one that is commonly employed when one suspects that the findings obtained in some reported ‘‘successful’’ experiment are in reality not due to the validity of the experimenter’s hypothesis but to some unintended influence let loose by his poorly designed operations of manipulation or measurement. In this strategy one redesigns the suspected operations and repeats the experiment. If, despite the operational changes, the original findings are replicated one now has presumptive evidence that one’s objections and doubts were ill taken; if meaningfully different, nonreplicative data are obtained, one has some claim (though it should not be over-indulged) to emit the prideful chortle: ‘‘I told you so’’ or ‘‘Thus do I refute Professor Berkeley.’’ How does this bear upon our intention to confirm, by empirical demonstration, that unsuspected arousal of evaluation apprehension does sometimes generate false confirmations of hypotheses? Obviously, when we suspect that this has happened,
The Conditions and Consequences of Evaluation Apprehension
215
and where we have a speculative interpretation of how it happened, we may undertake an altered replication of the original study. The object would be to change those operations which we believe to have aroused evaluation apprehension and to have fostered the expectation that a certain way of responding would bring positive evaluation from the experimenter. If such an altered replication were to yield data that, as predicted, were quite different from the findings of the original study this could be taken as evidence that our original concern over evaluation apprehension was neither excessive nor misplaced. In effect, such an outcome would be a demonstration, through the construct validation method, that evaluation apprehension can generate systematic bias in experimental data—though, of course, a single successful instance would hardly stand as an incontrovertibly definitive demonstration. The strategy that I have just described was the first one employed in the demonstration phase of our inquiry into the evaluation apprehension phenomenon. The substantive area of concern was research in support of a basic hypothesis derived from cognitive dissonance theory. Some early experiments—most notably those by Festinger and Carlsmith (1959) and Cohen (in Brehm and Cohen, 1962) had supposedly confirmed the hypothesized relationship: when counterattitudinal advocacy (i.e., arguing in support of an attitude position opposite to one’s own true conviction) is undertaken with little justification (e.g., for a small monetary reward) this will induce more attitude change in the advocate than when counterattitudinal advocacy is undertaken with strong justification (e.g., for a comparatively large monetary reward). However, as many observers (among them Chapanis and Chapanis, 1964; Brown, 1962) have pointed out, dissonance studies of this type confront the subjects (particularly those in ‘‘low dissonance’’ experimental groups) with startling and ambiguous experiences and conditions. Agreeing with Chapanis and Chapanis that a likely consequence will be the arousal of ‘‘suspicion,’’ I thought it possible to be even more specific about the intervening, response-affecting, patterns of arousal that may occur with subjects in such studies. The particular case in point upon which we focussed was the well-known study by Cohen. In this experiment Yale undergraduates had been recruited to write essays in support of a position opposite to the one they actually held on a currently salient campus issue. The issue concerned ‘‘the actions of the New Haven police’’ in a recent campus riot. The undergraduates uniformly felt that the police had behaved badly. The essay they were requested to write was on the topic: ‘‘Why the actions of the New Haven police were justified.’’ Having appeared at randomly chosen dormitory rooms the experimenter requested the potential subject to write such an essay and as an inducement offered a financial reward of either $.50, $1.00, $5.00 or $10.00. After the essay had been completed the experimenter asked the subject to fill out an attitude measure indicating how much he approved or disapproved of ‘‘the actions of the New Haven police.’’ As this measure was handed to him the subject was invited to take into account, if he so chose, the pro-police arguments he had just improvised in writing the counterattitudinal essay. The prediction derived from dissonance theory was that the lesser magnitude of reward would generate a greater magnitude of dissonance and thus greater attitude change: that is, an inverse monotonic relationship was expected between the amount of money offered to elicit the counterattitudinal advocacy and the degree of attitude change toward the pro-police position. This prediction was apparently confirmed;
216
Book One – Artifact in Behavioral Research
the $.50 reward group showed greatest attitude change in the pro-police direction, the $1.00 group next greatest change and the $5.00 and $10.00 reward groups did not differ from a control group which, without any prior counterattitudinal advocacy, had merely filled out an attitude scale concerning the question of whether ‘‘the actions of the New Haven police’’ were justified. On the basis of attitude theory considerations that need not be reviewed here I thought that the opposite prediction made more sense: that the degree of attitude change would be a positive, rather than an inverse, function of the amount of monetary payment that was offered to elicit the counterattitudinal advocacy. Also it seemed likely to me that Cohen’s results could be due to an unsuspected arousal of evaluation apprehension and a strong, but implicit, cueing which would have led most low dissonance (i.e., high reward) subjects to withhold evidence that they had influenced themselves in the pro-police direction. Exactly what leads us toward this sort of interpretation of what really happened in this and similar early dissonance experiments on attitude change? The answer can best be conveyed by some extended quotations from the original article (Rosenberg, 1965) which posed the evaluation apprehension reinterpretation of the Cohen study and then went on to report the altered replication by which that reinterpretation was tested. It seems quite conceivable that in certain dissonance experiments the use of surprisingly large monetary rewards for eliciting counterattitudinal arguments may seem quite strange to the subject, may suggest that he is being treated disingenuously. This in turn is likely to confirm initial expectations that evaluation is somehow being undertaken. As a result the typical subject, once exposed to this manipulation, may be aroused to a comparatively high level of evaluation apprehension; and, guided by the figural fact that an excessive reward has been offered, he may be led to hypothesize that the experimental situation is one in which his autonomy, his honesty, his resoluteness in resisting a special kind of bribe, are being tested. Thus, given the patterning of their initial expectations and the routinized cultural meanings of some of the main features of the experimental situation, most low-dissonance subjects may come to reason somewhat as follows: ‘‘they probably want to see whether getting paid so much will affect my own attitude, whether it will influence me, whether I am the kind of person whose views can be changed by buying him off.’’ The subject who has formulated such a subjective hypothesis about the real purpose of the experimental situation will be prone to resist giving evidence of attitude change: for to do so would, as he perceives it, convey something unattractive about himself, would lead to his being negatively evaluated by the experimenter. On the other hand, a similar hypothesis would be less likely to occur to the subject who is offered a smaller monetary reward and thus he would be less likely to resist giving evidence of attitude change.
On the basis of these speculative considerations I suggested, regarding Cohen’s experiment, that in this study, as in others of similar design, the low-dissonance (high-reward) subjects would be more likely to suspect that the experimenter had some unrevealed purpose. The gross discrepancy between spending a few minutes writing an essay and the large sum offered, the fact that this large sum had not yet been delivered by the time the
The Conditions and Consequences of Evaluation Apprehension
217
subject was handed the attitude questionnaire, the fact that he was virtually invited to show that he had become more positive toward the New Haven police: all these could have served to engender suspicion and thus to arouse evaluation apprehension and negative affect toward the experimenter. Either or both of these motivating states could probably be most efficiently reduced by the subject refusing to show anything but fairly strong disapproval of the New Haven police; for the subject who had come to believe that his autonomy in the face of a monetary lure was being assessed, remaining ‘antipolice’ would demonstrate that he had autonomy; for the subject who perceived an indirect and disingenuous attempt to change his attitude and felt some reactive anger, holding fast to his original attitude could appear to be a relevant way of frustrating the experimenter. Furthermore, with each step of increase in reward we could expect an increase in the proportion of subjects who had been brought to a motivating level of evaluation apprehension or affect arousal.
But such a reinterpretation is merely another instance of applied ‘‘wise-guyism’’ unless one attempts to put it to a close and demanding further experimental test. To properly employ the altered replication strategy that I have already described, it was necessary to remove the posited evaluation apprehension dynamic, or at least to subdue it, and otherwise to hew as closely as possible to the design and operations of the original study. How might the first of these desiderata best be implemented? The reinterpretation in terms of evaluation apprehension had an obvious methodological implication. If the posited data biasing dynamic had actually occurred this had been made possible by the fact that the experimenter conducted both the dissonance arousal and subsequent attitude measurement. For evaluation apprehension and negative affect, if they had been aroused in the high reward subjects, would have been focused upon the experimenter; and it would have been either to avoid his negative evaluation or to frustrate him, or both, that the high reward subject would hold back (from the experimenter and possibly even from himself) any evidence that he had been influenced by the pro-police arguments that he had elaborated in the essay he had just completed. Thus, quoting again from the original article (Rosenberg, 1965), these considerations led us toward the basic alteration employed in our replication: The most effective way then to eliminate the influence of the biasing factors would be to separate the dissonance arousal phase of the experiment from the attitude measurement phase. The experiment should be organized so that it appears to the subject to be two separate, unrelated studies, conducted by investigators who have little or no relationship with each other and who are pursuing different research interests. In such a situation the evaluation apprehension and negative affect that are focused upon the dissonancearousing experimenter would probably be lessened and, more important, they would not govern the subject’s responses to the attitude-measuring experimenter and to the information that he seeks from the subject.
We need not tarry here over the details of the staging of the two-experiment disguise. It will suffice to say that the disguise (judged by what the subjects said in quite probing postexperimental interviews) worked well, and that adaptations of it have since been used successfully both by others (e.g., Carlsmith, Collins, and Helmreich, 1966) and in my own continuing research program on attitude change (Rosenberg, 1968).
218
Book One – Artifact in Behavioral Research
Nor do we have to linger over precise descriptions of the instructions and measurement procedures used with the subjects. Except for changes required by our use of the two-experiment disguise all but two aspects of the procedure were identical with those used by Cohen in the original experiment. The two deviations from the original experiment were necessitated by the fact that it was conducted at Yale University and the altered replication at Ohio State University. Thus in the second study Yale undergraduates did not serve as subjects and the issue for counterattitudinal advocacy could not be the same one employed at Yale. The issue that was used concerned the subjects’ attitudes toward a proposed ban upon any further participation by the O.S.U. football team in the Rose Bowl contest. Such a ban had been enacted, and later rescinded, by the faculty senate during the previous year and extreme student opposition had been expressed through demonstrations and some riot-like group activity. The experimental subjects wrote essays favoring the restoration of the Rose Bowl ban. The three experimental groups wrote the essays for promised rewards (delivered after completion of the essay) of $.50, $1.00 and $5.00 respectively. A control group merely took the dependent variable measure—a questionnaire on seven different campus issues, one of which was the Rose Bowl ban, while another dealt with the desirability of O.S.U. abandoning its policy of giving athletic scholarships. On the Rose Bowl issue the Kruskal–Wallis one-way analysis of variance disclosed a significant relationship (p < .001) and inspection shows this to be of the positive, monotonic type: the larger the financial reward for counterattitudinal advocacy the greater the degree of attitude change (as estimated by comparison to the baseline attitude data provided by the control group). The $.50 and $1.00 groups showed greater favorability toward the Rose Bowl ban than did the control group (p < .01) and less favorability than the $5.00 group (p < .02).2 A similar overall finding (p < .005) was obtained on the athletic scholarship issue, though the differences between the groups were of lesser (but still significant) magnitude. This finding was also predicted, and is interpreted as evidence of some generalization of the main attitude change effect to a related, antiathletic issue. Avoiding the lure of another theoretical area I have so far said nothing about the substantive issues in this experiment. And I shall resist the temptation to do so now— except to note that the positive relationship obtained between degree of reward for counterattitudinal advocacy and degree of resultant attitude change confirms the prediction drawn from my own affective-cognitive consistency theory and disconfirms the prediction derived from dissonance theory. But these issues of attitude theory need not be examined here. They are fully treated in some of my earlier publications (Rosenberg 1956; 1960a; 1960b; 1968) and in a published debate between myself and Aronson, the latter writing as an advocate of a sophisticated, modified version of dissonance theory (Aronson, 1966; Rosenberg, 1966). Before I turn away completely from the whirlpool of attitude theory around which I have been skirting, I should like to make clear that the controversy concerning 2
The probabilities reported here as confirming the differences between the groups in this study are all based upon the one-tailed test. Throughout this chapter the same convention has been employed whenever the direction of a difference was predicted—though, as will be seen, most of the findings would easily retain their statistical significance even if the more stringent, but less appropriate, two-tailed standard were applied. Within the tables summarizing the statistical findings a designation of ‘‘N.S.’’ (i.e., not significant) represents a probability value larger than .10, usually considerably larger.
The Conditions and Consequences of Evaluation Apprehension
219
counterattitudinal advocacy effects was not, by any means, fully resolved on the basis of this one study. Indeed, new issues have since been discovered in this by now middle-aged area of theoretical debate, experiment and counter-experiment. But the two-experiment disguise is now fairly standard in this particular research area. Also, the fact that under some conditions, at least, the ‘‘incentive’’ rather than the ‘‘dissonance’’ relationship does obtain is now credited by the main participants in the persisting debate, though they continue to disagree (see the contributions of Janis, Carlsmith, Collins, Aronson, and Rosenberg in Abelson et al., 1968) about the nature and provenance of those conditions. Of greater pertinence at the moment are two points that have nothing to do with counterattitudinal advocacy as such, though they are grounded upon the Rose Bowl counterattitudinal advocacy study. The results of the altered replication can be taken as at least an indirect demonstration of the possibility that evaluation apprehension is capable of inducing systematic bias in experimental responding, and thus of generating undetected Type I or Type II errors (in the sense of invalid confirmations and disconfirmations of hypotheses). The second point is that such bias effects need not remain undetected, nor need they be left in the realm of the merely suspected. Variations of the altered replication strategy could probably be designed in most instances where an evaluation apprehension artifact is suspected to have induced systematic bias in the array of dependent variable data. Inventiveness and care in the design of altered replications, and a readiness to resort to them frequently could probably do much to improve the reliability of the data that experimental psychologists collect to test hypotheses and in reaction to which they often develop new hypotheses. Evaluation apprehension is by no means the only conceivable source of systematic biasing of data, nor is it an equally threatening possibility in all realms of psychological research. But whenever our experiments are heavy on surprise and whenever the experimenter’s purposes are likely to seem mysterious to subjects (or whenever subjects are likely to sense disingenuousness in the experimenter’s explanatory communication), we would do well to adopt the cautionary stance of obsessive concern over the evaluation apprehension problem. And having adopted this stance, we would do well to go beyond mere obsession or mere disputatiousness and get back to the laboratory where we can put our suspicions to test by conducting the relevantly redesigned altered replications. Anyone who resorts to this strategy, however, had better be prepared to find himself at the receiving end of the ironic justice process. For the criticized and their partisans can reverse the tactic on the aspiring critic. An altered replication designed to remove a suspected evaluation apprehension contaminant from some previously reported experiment can, in itself, be interpreted as having been contaminated by evaluation apprehension or by some other biasing force (e.g., experimenter expectancy, demand characteristics, subject presensitization). The mind reels, and one’s strength does quaver a bit, at the conceivable prospect of an infinite regress in which a study is designed to take systematic bias out of a study that was designed to take systematic bias out of a study that was . . . but in all likelihood, even fully consensual devotion to the expunging of evaluation apprehension effects will stop far short of such total, unlimited doubt. At some point it should become clear that particular experimental paradigms and particular substantive areas of experimental inquiry have been pretty thoroughly ‘‘debugged.’’ And meanwhile,
220
Book One – Artifact in Behavioral Research
whatever temporary disruption, confusion, and outraged pride may result, the ultimate outcome can be trusted to be beneficial—not only in that it will probably elevate the trustworthiness of data in the contested area, but also because there is nothing more restorative of the scientific temper than an occasional encounter with the hard, intractable fact that one has made, and remains capable of making, mistakes.
Demonstration Through Manipulated Arousal and Cueing The construct validation strategy, useful as it has been in theory testing generally and in our own research program, is really a version of the Platonic analogy of the cave. The shadows that are projected across the wall (i.e., our data) denote that something is passing between us and the sun—but we are still tantalizingly out of direct contact with its substance. Thus, the foregoing study and others of similar design, though they seem to confirm the reality of the evaluation apprehension dynamic, do not bring us into direct contact with it. To look more closely at that process it appeared necessary to arouse it, rather than reduce it as was done in the Rose Bowl study. My first effort in this direction was undertaken in an experiment in which I had valuable collaboration from Dr. Raymond Mulry who was, at that time, one of my graduate students. Our working plan was simple, perhaps even crude. Through a printed ‘‘Background Information Sheet’’ we conveyed to two separate experimental groups the following points: They were about to participate in a study of social perception, in which they were to judge how much they liked or disliked various pictured persons. Past research by others, they were informed, had shown that liking-disliking reactions to strangers were correlated with personality, particularly with whether the rater was psychologically ‘‘mature’’ or ‘‘immature.’’ To one experimental group it was disclosed further that the main burden of the past research (various invented journal articles were cited) was that psychologically mature and healthy people show greater liking for strangers than do immature people. To a second experimental group the printed communication conveyed the opposite: past research had shown that it was psychologically immature and comparatively unhealthy persons who showed greater liking for strangers. Beyond this crucial point of difference the two forms of the manipulative communication again converged. All of the past research, it was asserted, had been done with subjects in face-to-face contact with real strangers. Would the same relationship between psychological maturity and liking hold for mere photographs of unknown people? This, the subjects were told, was a question that we planned to pursue in further research. But first it was necessary to ‘‘standardize’’ a set of photographs; to determine how much, on the average, they elicited liking or disliking reactions. Thus, in the present study, according to the concluding paragraph of the Background Information Sheet, we were not testing the personalities of our subjects; rather, we were simply establishing normative data against which we would later compare the liking-disliking ratings elicited from subjects whose personality qualities had already been assessed. The simplicity and directness of this manipulation make clear its intended purpose. We were attempting to arouse evaluation apprehension by confirming, for our
The Conditions and Consequences of Evaluation Apprehension
221
subjects, the sort of expectancy that subjects often bring to experiments; namely, in the present instance, that as researchers we were ordinarily interested, among other things, in personality assessment. And furthermore we were cueing our subjects about what past research had shown (and thus about the likely content of our own expectations) concerning the ways in which ‘‘mature’’ and ‘‘immature’’ people tended to react to strangers. Why did we add that we had no idea whether the same relationship would hold with pictures as with reactions to directly encountered real persons? Partly to enhance the general credibility of our communication and partly to reduce what might otherwise be a too overwhelming influence upon the individual’s judgments of the pictures. Also this made the present study somewhat more comparable to many others in which the common strategy (whatever has gone before) is to provide overt reassurance that the subjects’ personalities are not being scrutinized. Apart from the two experimental groups (both aroused to evaluation apprehension and cued either toward liking or disliking responses respectively) we also set up a control group. These subjects received a brief neutral communication which did nothing to arouse evaluation apprehension or to provide directional cueing. The data from this group served, then, as the baseline against which we could assess the significance of the deflections, toward the liking and disliking ends of the scale, of the experimental subjects’ self-reported judgments. I have already suggested that the Rose Bowl altered replication study could be interpreted as providing indirect evidence that evaluation apprehension can contaminate experimental data but not that it always or usually will. The same stricture is all the more applicable in limiting the meaning of the present study. Only a failure to find significant differences between the mean liking ratings of the three groups could be taken as definitive; for this would mean that, even under optimum conditions, evaluation apprehension does not get aroused or, if aroused, does not affect experimental data. But if significant differences were obtained just what would they tell us? Merely that the data biasing dynamic that we suspect to be unintentionally induced in certain kinds of experiments can be intentionally induced by rather direct manipulation. In essence, then, we were giving ourselves a chance to increase the pertinence of the null hypothesis or provisionally to reject it. If the data seemed to allow the latter (i.e., if they showed that, at least by intentional amplification, evaluation apprehension can be made to affect experimental responding) we would also be in a position to carry out inquiry a few steps further. We would then be able to ask what kinds of people, situational definitions, and experimental tasks tend to facilitate or diminish the operation of the process in which data are systematically biased under the influence of evaluation apprehension. These foregoing considerations set the context in which we can now proceed to discuss the findings of the first evaluation apprehension manipulative study; and they are equally relevant to the various other studies that followed it and employed the same basic design paradigm. Obviously I would have no claim to write this chapter if the results of this first study, and of the others that followed upon it, had failed to render the null hypothesis improbable. Thus there will be little surprise in the disclosure that in the first of these studies a large and significant difference was obtained between the two experimental groups. For the 12 pictures of male faces (each rated on a 21-point like-dislike scale ranging from þ10 to 10) the algebraic sums of each subject’s judgments were
222
Book One – Artifact in Behavioral Research Table 7–1 Like-Dislike Mean Sums for Groups; and Probabilities of Differences Between Groups
Liking treatment Males judging male pictures
+23.20
Disliking treatment
Control p < .0001
–11.25
N.S.
–8.65
p < .01
–25.45
p < .0001 Females judging male pictures
+25.90
p < .03
+4.15
p < .0001
computed. The means of these scores for separate groups of male and female subjects and the probabilities of the differences between various pairs of means are displayed in Table 7-1. For male subjects in the experimental group that was cued to think that mature people like strangers, the mean algebraic sum of the ratings was þ23.20. For the male subjects cued to think that immature people show greater liking for strangers, the mean algebraic sum was 8.65. The significance of the difference between these groups (computed by the Mann–Whitney Rank Sum statistic, as are most of the other simple differences between groups that are reported in this chapter), was clearly established (p < .0001). However, as reference to Table 7-1 makes clear, we also encountered an interesting complication. The disliking treatment did not, in fact, exert a significant influence upon the male subjects who received it. This is apparent from the fact that the picture ratings from the unmanipulated control group are, on the average, just as negative (the mean is 11.25) as those from the disliking group and, of course, there is no significant difference between these groups. Does this signify that the disliking cueing that we employed was simply not credible? Or that, though credible, subjects could not bring themselves to behave in opposition to the normative standard (at least with typical middle-class Americans) that whatever our private disposition may be, strangers are to be approached with external affability? Either of these interpretations would be plausible if it were not for various other available findings. The most striking is that with the separate groups of female subjects (judging pictures of males, it should be remembered) the disliking treatment does influence the picture ratings and they are as deviant from the control mean in the negative direction (p < .01) as the mean for the liking treatment is in the positive direction (p < .03). Furthermore, even with the male subjects, there is evidence suggesting that a personality-linked variable has mediated the influence of the disliking treatment. For all subjects we had available their scores on the Marlowe–Crowne (1961) Social
The Conditions and Consequences of Evaluation Apprehension
223
Desirability (SD) Scale which had been administered sometime before the present experiment was conducted. When the male subjects are split into high and low halves on the Social Desirability Scale distribution we find that a trend in the predicted direction is visible between the High SD subjects in the control and disliking groups respectively. But with the Low SD subjects that trend is reversed and approaches significance (p < .10) in the counterhypothesis direction. If the latter group had shown a trend no stronger than that obtained from the former the overall finding would have supported the predicted relationship at an acceptable level of statistical significance. Thus it is the Low SD males who, needing less approval from others (and, we may assume, from the experimenter), are not willing to respond against the normative grain and win a judgment of normality by representing themselves as disliking certain strangers more than they otherwise would. There is a glimmer of a paradox in these last data; for generally one would expect people with a high need for approval (the High SD scorers) also to show a more persisting proclivity for representing themselves as positively disposed toward random others. In fact, passing beyond the data from the disliking condition, we find that the social desirability factor did exert the expected influence within the experimental group that was cued to believe that liking strangers is a sign of maturity. The High SD male subjects in that condition give more extreme liking scores (the mean sum of their picture ratings is þ34.77) than do the comparable Low SD subjects (whose mean score is þ13.72). The difference between these groups is significant (p < .03). Despite the few tantalizing ambiguities that I have discussed above, the overall import of this first manipulative study seemed quite clear; with intentional arousal of evaluation apprehension the subsequent directional cueing does ‘‘take’’—that is, it influences the subjects’ experimental responding. Postexperimental inquiry indicated to our tentative satisfaction that these results were not due to any easy comprehension of our unrevealed purposes. The subjects usually insisted that the preliminary material that they read concerning ‘‘earlier studies’’ on reactions to strangers had not particularly influenced them. I do not take these reports as veridical; but neither do I think that they are due to a simple intention to deceive the experimenter. From interviewing conducted after data collection in this study and others, I have formed the impression that subjects will usually obscure from themselves the extent to which they regulate their responding so as to win favorable judgments from the experimenter. And though I cannot anchor the following judgment on a base of hard data I would hazard the psychologically obvious interpretation that this sort of motivated inattention is due to the typical subject’s need to conserve a positive image of himself even as he half-knowingly seeks to make a positive impression upon the experimenter. Upon completion of this first demonstration study it would have been possible to plunge directly into studies concerned with variables that facilitate or suppress the evaluation apprehension-data biasing process. But the rating of pictures for their likability is a rather special sort of task and, as we have seen, certain complications did arise on the dislike cueing side of the experiment. To satisfy ourselves that the process under study was a fairly general one, it seemed necessary to adapt the basic experimental paradigm to some other and quite different sorts of experimental tasks. Two further studies of this type were successfully carried out with male undergraduate subjects at Dartmouth College. I shall describe them somewhat more briefly than the preceding study, since they are useful here only in adding some empirical
224
Book One – Artifact in Behavioral Research
weight to an assertion that I have already registered more than once: that is, that evaluation apprehension combined with some hints about how ‘‘normal people’’ react (and thus implying something about how the experimenter’s approval can be obtained) does exert systematic biasing influence upon experimental responding. In one of these additional studies I was joined by two coinvestigators, Philip Corsi (who developed the basic experimental design and operations) and Edward Holmes, both of whom were advanced undergraduate students in the Psychology Department of Dartmouth College. We used an extremely simple task: the subject taps upon a key with his right and left index fingers for six separate ten-second intervals, half of these with one finger and half with the other. The number of taps is automatically registered on a Veeder Root meter. Normally there is a considerable discrepancy between the performance of the two fingers, the index finger of the dominant hand producing more taps than that of the nondominant hand. As the subject entered the experimental room he was asked by the experimenter: ‘‘Did you take the general abilities test and the personality inventory during freshman week?’’ The purpose of this query was to stir some initial prompting toward evaluation apprehension. Following the administration of three brief abilities tests focused on verbal and symbolic skills (and intended to rouse the subjects’ interest in their performances) the experimenter proceeded to give a memorized, verbal explanation of the fingertapping task. For the control subjects (eleven right handed undergraduates) this consisted only of a simple description of the task. Working with only this information these subjects did, indeed, produce more taps with their right than with their left index fingers. The mean difference between the sum of right index taps minus the sum of left index taps was 22.45 for this group. With an experimental group of the same number of right handed subjects the preliminary communication contained some additional information designed to heighten evaluation apprehension and to turn it in a particular response direction. Thus they were told that recent research with graduate students at Yale and at the University of Michigan had turned up the surprising finding that the number of taps with the nondominant index finger was virtually equal to the number with the dominant index finger. The clear implication was that people with higher intelligence (or perhaps of higher educational attainment) performed differently than did other, more ordinary, persons. The result was striking. The mean difference between the sums of right and left index finger taps was only 10.73 and this was quite significantly (p < .005) different from the comparable score of 22.45 obtained with the control group. A clear hint about the relation between performance on the experimental task and the likely evaluation that the experimenter would draw from the subject’s performance had produced a ‘‘transcendence effect.’’ The experimental subjects performed far more efficiently with the left index finger than subjects (both our own control subjects and those in many other studies) ordinarily do. One further finding from this study is of particular interest. The control and experimental data just described were obtained under a ‘‘full feedback’’ condition; that is, the meter registering the cumulative number of taps was visible to each subject as he went through the separate right and left finger tapping trials. But on six other trials with these same subjects we used a ‘‘partial feedback’’ condition in which the meter was exposed only after the ten-second tapping interval had been
The Conditions and Consequences of Evaluation Apprehension
225
completed. (For half of the subjects the partial feedback trials, and for the other half the full feedback trials, were run first.) Under the partial feedback condition the same sort of difference is obtained between the experimental and control groups but it is of lesser magnitude (p < .03) and the difference between the means for the two groups is correspondingly smaller. It seems clear, then, that the continuous availability of information about the quality of his performance will assist the subject in guiding that performance in the direction that he expects will bring him a favorable evaluation. A more ambitious study was undertaken at the same time. This one followed our original experimental paradigm (as used in the picture-rating study) in all its basic details, except for using a quite different sort of experimental task. In two basic experimental conditions and in a separate control condition the subjects (20 in each group) worked with a booklet of addition problems, each page of which had the same number of total digits. The booklet was far too lengthy for completion and the subject was free to stop whenever he chose to do so. Each fourth page in the booklet presented some scales from the Nowlis (1965) Mood Adjective Check List which the subject was to fill out before he went on to the next page of addition problems. Again the experiment began with the subject reading a Background Information Sheet. This document, modeled after the one used in the original picture rating study, explained that past research by other investigators had revealed a relationship between personality competence and the amount of efficiency shown, and pleasure experienced, in executing dull, routine tasks. One form of the information sheet went on to report that the clear finding from many studies (again fictional citations were given) was that the mature, psychologically healthy person experienced more pleasure and was more efficient at such tasks than was the immature person. There followed a paragraph explaining the probable psychological basis of this finding (i.e., emotionally healthy people are less prone to distraction and enjoy challenges more than emotionally handicapped persons). The text then went on to explain that all of the previous studies had dealt with ‘‘motoric and manual’’ tasks and there was no proof that the same relationship would hold true with clerical or other ‘‘paper and pencil’’ tasks. This, it was explained, was the question that would be addressed in later research that the experimenters planned, but first it was necessary to develop a standard task and to determine what the typical levels of performance would be. Thus the subject was about to participate in a pilot study focused not upon his personality but rather upon determining average performance levels for various kinds of clerical tasks—and at present the task whose utility was to be explored was arithmetic addition. The content of the Background Information Sheet used with the other experimental group was, of course, exactly parallel in form but opposite in content. It reported that past data with routine motoric performances had shown that psychologically healthy and mature persons were less efficient at carrying out such tasks and got less pleasure from them than did psychologically immature persons. Again a brief psychological explanation of the basis for this finding was offered and this was followed by exactly the same further comments that were used with the first version of the manipulative communication. Obviously this manipulation, just as the one used in the picture rating study, was likely to exert a strong force toward arousal of evaluation apprehension and at the
226
Book One – Artifact in Behavioral Research
same time would provide unambiguous cues that could be used to regulate experimental responding so as to maximize the chance that one would be judged ‘‘normal’’ by the psychological experimenter. However, despite the directness of the manipulation, postexperimental questionnaires and postexperimental interviewing with a sample of the subjects revealed little acknowledged penetration of the purpose of the experiment. Subjects did show accurate recall for the content of the Background Information Sheet but usually they insisted that they did not feel that their personalities were being scrutinized; instead, they reported that they had simply worked on the problems until they got bored or fatigued. But, though evaluation apprehension or concern for performing in the ‘‘normal’’ way typically was not acknowledged to the interviewer (and, possibly, not fully acknowledged to the self) it did clearly influence the actual performances of the subjects on the arithmetic addition task. That this is the case is clear from the findings presented in Table 7-2. The means reported there are from the two experimental groups. The table also displays means from a control group whose members worked on the addition problems without any previous arousal of evaluation apprehension, thus establishing the baseline performance levels against which the experimental groups can be judged. It is clear that the two experimental groups differ from one another in the predicted direction. On the average, the subjects who were led to believe that mature people tend to be comparatively dysphoric and inept toward routine tasks completed ten less addition problems than did the opposite experimental group (p < .03). Similarly, they correctly solved eleven fewer problems than did the other experimental group (p < .003). However, examination of Table 7-2 will quickly suggest that the treatment emphasizing that mature people do not perform well on dull tasks was a far stronger influence upon the performance than was the opposite treatment. While the former experimental group differs significantly from the control group on both the number Table 7–2 Performance Means for Groups; and Probabilities of Differences Between Groups
Pleasure and efficiency treatment Mean number of problems completed
57.96
Displeasure and inefficiency treatment
Control N.S.
59.58
p < .03
47.46
p < .005
32.25
N.S.
0.63
p < .03 Mean number of problems correct
43.39
N.S.
41.00
p < .003 Mean efficiency index: No. correct No. complete
0.73
p < .05
0.67
p < .03
The Conditions and Consequences of Evaluation Apprehension
227
of problems completed and the number solved, the latter group does not. An obvious proposition virtually suggests itself and it is one that may well deserve an important place in the general theory of evaluation apprehension processes that is emerging as we pursue our experimental program. Simply stated it is this: cues suggesting a response pattern that is likely to bring approval from the experimenter will have stronger influence upon actual responding when that pattern is also less effortful in execution. There is, of course, an alternative interpretation that is quite plausible in the present instance: the ‘‘displeasure and inefficiency’’ version of the Background Information Sheet may simply have been more credible, more in accord with the initial expectations of the subjects. After all it does seem likely, at a common sense level, that only ‘‘odd’’ people will enjoy routine, repetitious tasks. But this interpretation is weakened by the fact that, by a somewhat subtler analysis, we find that the subjects in the opposite experimental group were also influenced in their performances on the addition task. Apart from comparing the groups on the total number of problems completed and correctly completed we went on to compute for each subject an index based upon the ratio between these two separate scores. Dividing the number of problems correctly solved by the number completed we obtain a meaningful estimate of the quality of the subject’s performance in relation to the scope of that performance. As Table 7-2 shows, on this index the experimental groups are again significantly different (p < .03) from one another in the predicted direction. However in this instance the group cued to the expectation that mature people perform poorly does not differ from the control group. Though they are completing fewer problems, the percent of these that are correctly completed is the same as with the control group. But the opposite experimental group does differ from the control group. On the average the subjects in this experimental group complete a few less problems and solve a few more. In consequence the difference between the control group and the ‘‘pleasure and efficiency’’ cued group attains significance (p < .05). Clearly these experimental subjects have been putting somewhat more effort into the task; they have been concentrating more closely on the truly tiresome task of adding columns of digits, and in consequence they have attained somewhat greater accuracy. Equally interesting and meaningful in the light of the finding just reported, are some additional findings obtained with the Mood Adjective Check List. Avoiding the task of describing the scoring or analytical procedures, I shall content myself here with simply reporting that on some of the subscales of this instrument we find the experimental groups differing either significantly or at borderline levels from one another and, in some instances, from the control group as well. The subjects in the group cued to think that normal people enjoy routine tasks characterize themselves as feeling less dysphoric while doing the addition problems than those cued to think that normal people do not enjoy them. And these characterizations tend to persist across the various intervals (every fourth page in the addition problem booklet) at which the subjects were required to report upon their mood states. I have reviewed three manipulative studies each of which successfully demonstrated our basic point: that systematic bias in experimental responding can be produced through the arousal of evaluation apprehension and the cueing of particular response patterns as likely to foster positive evaluation.
228
Book One – Artifact in Behavioral Research
However, two defects of this group of studies are apparent and they should be noted here. The first is, simply, that they do not cover as broad a range of experimental tasks as could be desired. Suspecting that virtually any type of experimental performance could be systematically biased by the evaluation apprehension process, we might well have gone on to similar studies in such diverse areas as conditioning and other learning phenomena, psychophysical judgment, impression formation, concept formation, and many other areas. Particularly I should have liked to test the proposition that the degree of attitude change shown by subjects (in their responses to questionnaires administered after a persuasive communication has been received) will be influenced by the prior suggestion that attitude change reflects a mature quality of ‘‘openmindedness’’ or an immature quality of ‘‘inconstancy.’’ Further work along some of these lines is planned. But, happily, the task of demonstrating the broad relevance of evaluation apprehension as a data biasing process has now been taken up by some other investigators. By rather different experimental techniques than those that I have employed, Silverman (1968; Silverman and Regula, 1968) has been providing some evidence that could be easily fitted to the general picture developed here. And Sigall, Aronson, and Van Hoose (1968) have recently reported a study in which subjects are exposed to evaluation apprehension cueing and also to the ‘‘demand characteristic’’ of the experimenter’s expectancy about their performances, the latter ostensibly based upon the scientific hypothesis he is testing. With subjects for whom both forces converge, suggesting that a certain mode of responding will prove the experimenter’s hypothesis and also make the subject appear a competent and adequate personality, strong influence on experimental responding is obtained. With another group of subjects these forces are made to diverge, so that the subject must violate the experimenter’s hypothesis if, as he sees it, he is to appear competent and psychologically adequate. The typical subject yields to the latter rather than the former force. Thus, even with a strong demand characteristic opposing it, the evaluation apprehension dynamic is found to exert a statistically significant influence upon the experimental responding of the subjects. Interesting and heartening as such studies are, much more experimental exploration will be required before we can take as established the claim that the systematic biasing of data through the evaluation apprehension dynamic is a general phenomenon, one that can be made to occur over the vast range of response dimensions with which modern experimental psychology is concerned. My expectation is that such a program of ‘‘parametric’’ exploratory studies would in fact reveal considerable generality of this sort. At the same time it would probably disclose that certain types of experimental responding are more prone, and others more resistant, to this type of systematic biasing. Indeed, I think it likely that one would also find that, within a given behavioral realm, certain directions of responding are more easily affected by evaluation apprehension pressures than are others. This has already become apparent through our discoveries that liking of strangers or inefficient performance on routine tasks are more readily inducible response patterns than are their opposites. A momentary lapse into unrestrained programmatic fantasy (an easy indulgence if one puts aside the fact that someone must actually undertake the vast labors that are contemplated) suggests the desirability of constructing, through empirical
The Conditions and Consequences of Evaluation Apprehension
229
techniques, a sort of evaluation apprehension atlas of response dimensions. The hundreds of types of elicited behaviors which now serve as dependent variables in psychological research could be separately submitted to evaluation apprehension cueing of the sort employed in our demonstration experiments. The degree of influenceability of each particular response pattern (and of separate response directions) could then be assessed. Ideally, this would need to be done with systematic variation in types of subjects, types of evaluation apprehension arousal, and types of directional cueing. The result would probably have high payoff in terms of increasing our ability to do uncontaminated, bias-free research—or at least to come closer in approaching that utopian state of affairs. I said earlier that I perceive two main defects in the group of demonstration studies described here. The first, as discussed above, can be handled only by doing more demonstration (and parametric exploration) studies over the broad range of common dependent variables employed in psychological research. The second defect is one that bears upon the way in which such further studies might be conducted. What I have in mind is the fact that in all of the foregoing studies the manipulation had two separate components: evaluation apprehension was aroused or heightened by our telling the subjects, in a fairly direct way, that the responses they were about to make would have some revelatory significance concerning their own personalities; then, in a separate and subsequent portion of the communication, some hints (usually rather strong ones) were given concerning the response differences that might be expected as between normal and abnormal or ‘‘mature’’ and ‘‘immature’’ persons. Are both portions of the induction required? For that matter, can subtler inductions be used without the loss of the systematic bias effects? These questions point up a basic limit in the group of demonstration studies so far reviewed: namely, that they have not featured enough cross experiment systematic variation in ways of inducing evaluation apprehension. When such variation is attempted what are we likely to find? Both through speculative rumination and also in the light of some of the data from additional studies that I shall shortly discuss, I am willing to hazard some informed guesses. The first is that the evaluation apprehension biasing effect does not depend upon providing the subjects with an initial statement defining the experimenter as one who is interested in the study of personality or who is otherwise sensitive to the personality revealing implications of the data he is collecting. When this is done it probably does boost the data biasing process, but the same sort of process is likely to be set in motion merely by providing some cues suggesting that one mode of responding as compared to another is more ‘‘normal’’ or ‘‘competent’’ or ‘‘mature.’’ The latter strategy was the one employed by Sigall, Aronson, and Van Hoose (1968) and it was sufficient to induce significant systematic bias. However, what of the situation in which no direct cueing toward the ‘‘normal’’ pattern of response is provided? Surely this is the typical state of affairs in experiments in which evaluation apprehension is an inadvertent rather than an intended influence upon subjects’ responding. Theoretical analysis has rather persuaded me (and some studies, reported later, on the mediation of the experimenter expectancy effect have turned persuasion toward conviction) of this basic point: arousing the subject to the general expectancy that his personality competence will be available for judgment by the psychological experimenter sets him examining salient aspects
230
Book One – Artifact in Behavioral Research
of the situation for what they might reveal about ‘‘the way a normal person would respond.’’ In other words, when a general state of evaluation apprehension has been aroused by intention (as we attempt to do with the first portion of the Background Information Sheet communication) or unintentionally, direct cueing of the normality-revealing behavioral model is not required. Subtler hints will be picked up and private hypotheses will be formulated by the subjects—and, to the extent that the separate subjects attend to the same hints and draw the same interpretations, systematic biasing of response data will be likely to occur. An implied methodological corrective is lurking in these last comments. Though I shall return to it at a later point it deserves bold preliminary iteration here: techniques for reducing any stirrings of evaluation apprehension that subjects bring into an experimental situation, for disconfirming any initial concern that their psychological adequacy or inadequacy will be open to judgment, are bound to improve the trustworthiness of the data collected in that experiment. In setting the stage for the foregoing discussion of the limits of our manipulation techniques I asked: can subtler inductions be used without loss of the systematic bias effects? By ‘‘subtler’’ I mean communications which do the work of our Background Information Sheet (i.e., arousing general evaluation apprehension, or cueing the subject in a particular response direction, or both) far less explicitly, with more ‘‘natural’’ indirectness. I am fairly sure that the answer is yes—that such subtler manipulation will induce systematic bias in experimental responding. To this purpose, a number of further demonstration studies have been planned, but not yet executed. If their results are successful they will give us a stronger empirical basis than we have yet established, for the claim that the evaluation apprehension dynamic does often operate where it is usually unsuspected: for example, in experiments undertaken to test substantive issues and hypotheses relevant to important matters of psychological theory. But, if this point is not yet fully established through our demonstration efforts I must, nevertheless, confess that the studies we have already completed (both those described above and those that follow in the next sections) have considerably strengthened my own original suspicion, namely: Evaluation apprehension does contaminate a fair portion of the experimental work now being conducted over the broad range from social psychology to psychophysics. To be sure, as I make this declaration I am mindful of various limiting considerations: some of my readers will certainly think it a considerable leap beyond the data—and they are right; but scientific inquiry, like other more muscular pursuits, is advanced by the judicious use of audacity. Also I am mindful that this sort of j’accuse, as it concerns any experiment in which one suspects that evaluation apprehension has distorted the data, cannot be sustained by a hundred, let alone three, demonstration experiments; instead the logic of inquiry forces us back to the necessity for undertaking carefully designed altered replication studies. However, the more we can learn about evaluation apprehension through intentionally arousing it, the better equipped we will be to search it out and bring it to heel through the altered replication strategy. Thus, in further research I and my colleagues have gone on creating evaluation apprehension and expanding our inquiry to encompass subsidiary variables which may work to heighten or reduce its influence upon experimental responding. I shall now turn to a review and discussion of some of these further studies.
The Conditions and Consequences of Evaluation Apprehension
231
Variables Influencing the Evaluation Apprehension Process Though all of our original demonstration studies showed clear main effects, there was a fair degree of intersubject variance within, as well as between, conditions. This suggested that uncontrolled factors relating to the subjects’ personalities, their sensitivities to aspects of the situation, and their patterns of past experience as subjects might be affecting how much evaluation apprehension they felt and how they were acting to reduce it. Clearly a host of variables might be found to influence the evaluation apprehension data biasing process—and the direction of such influence might be either to facilitate or subdue the overall operation of the process. I found it useful to conceive such ‘‘booster’’ and ‘‘suppressor’’ variables as falling into five major categories. They could be: personality attributes (or overall personality patterns) of the subject; aspects of the subject’s recent, preexperimental experience; aspects or attributes of the experimenter; or of the experimental setting; or of the experimental task. We need not think of this taxonomy as the most logical of all possible ones, nor need we assert that it would incorporate all relevant variables. Its main value was, simply, that it was enough to get us started. But we are just barely started on this line of inquiry. While many relevant variables are easily conceivable, only four major ones have been investigated in specific experiments. The results which I shall shortly present have been quite informative both in confirming our initial hypotheses and also, in two of these studies, by disclosing certain more complex interactions which have, in turn, suggested some new lines of theoretical speculation. The four variables upon which this work has so far focused are: the need for approval as an attribute of personality, the salience of the ‘‘clinical’’ orientation as an attribute of the experimenter or of the experimental setting, the experimenter’s ‘‘gate-keeper’’ power over the subject, and the ambiguity of the experimental stimulus materials. I have already reported that the need for approval (as indexed by scores on the Social Desirability Scale) seemed to play a response affecting role in the first of our demonstration studies. The same appeared to be true in the study reported above in which ‘‘efficient’’ and ‘‘inefficient’’ performance on routine addition problems were separately cued as reflecting personality competence. In this instance we found that under the cueing treatment suggesting that bored and inefficient performance on routine tasks is a correlate of ‘‘maturity,’’ subjects with high (upper 50 per cent) Social Desirability scores did not differ from the control group. However those with low Social Desirability scores completed and correctly solved significantly fewer problems. This finding though unexpected is a meaningful one, in the sense that hindsight suggests some hypotheses that deserve further experimental testing: perhaps those with strong approval need have in the past been more concerned with (and, thus, more rewarded for) seeking social acceptance through overt display of competence; or perhaps, because of their greater general concern with techniques of ingratiation, they find the cueing communication less credible than do the subjects who are comparatively low in the need for social approval. The personality attribute of need for approval, as indexed by SD scores, was studied within the context of our original demonstration studies. Our more directly
232
Book One – Artifact in Behavioral Research
focused work on booster and suppressor effects began with two studies dealing with another variable: the definition of the experimenter or the experimental situation as high or comparatively low in clinical interest. In effect this can be viewed as a simple ‘‘additive’’ variable: that is, another direct force making for evaluation apprehension in the subject and thus, in our typical experimental situation, adding to the weight of forces that have already served to create that motivating state. In the first of our studies on the clinical interest variable, the same sort of Background Information Sheet that was used in our earlier studies was read by all experimental subjects. Once again its first portion was designed to rouse general evaluation apprehension while its second portion conveyed the clear hint for one group that mature people are comparatively high, on liking for strangers, and for the other group, on disliking for strangers. Control subjects received no such preliminary communication. So far this study is essentially a replication of our original demonstration experiment. The additional variable was introduced through a few memorized sentences which the experimenter addressed to the subject. The experimenters (senior students in an advanced experimental psychology course at Dartmouth College) each ran six subjects: two who had received the liking cueing, two the disliking cueing, and two control subjects. To one subject from each of these three categories the experimenter represented himself as having a ‘‘clinical’’ orientation. He did this by saying, just before presenting the instructions for the picture rating task, that he was quite fascinated by the experiment as, indeed, he was by psychology generally. ‘‘I guess,’’ he continued, ‘‘that’s because I’m always thinking about what makes people tick. That’s why I’m hoping to go into psychiatry after I finish medical school.’’ In the opposite ‘‘nonclinical’’ role that he played with his three other subjects each experimenter said that he did not particularly see the importance of the present experiment. He continued: ‘‘For that matter I’m not sure what I’m doing in this course but they said, at the School of Engineering, that I had to take it.’’ The purpose of this manipulation was, then, simply to convey that the experimenter either had or lacked interest in understanding and judging the personalities of others. At the same time, for the ‘‘clinical’’ role, it was clear that the experimenter was not yet professionally trained or skilled in this direction. As he alternated between these two roles the experimenter ran his subjects without any knowledge of whether they were in the control group or in the groups that had been respectively cued to the suggestion that liking or disliking for strangers was characteristic of psychologically mature persons. One hundred and fifty subjects gave their liking-disliking ratings for 15 photographs of male faces and the data from 130 of these were analyzed. (The data from the 20 other subjects were discarded because postexperimental questionnaire data showed that they had not understood or retained the content of the like-dislike portion of the communication.) The data clearly indicate that the definition of the experimenter as either having or lacking a clinical orientation does, as predicted, have some influence upon the amount of systematically biased responding by the subjects. Under both the clinical and nonclinical experimenter conditions the control subjects (who received neither evaluation apprehension arousal nor directional cueing) lean toward an overall liking response pattern; and there is no difference between the mean algebraic sums of the ratings for the control subjects run by clinical and nonclinical experimenters. For the
The Conditions and Consequences of Evaluation Apprehension
233
former the mean of the algebraic sums is þ25.20 and for the latter þ23.00. In the clinical experimenter condition the subjects who received the ‘‘disliking is mature’’ cueing have a mean sum of þ.13, while under the nonclinical experimenter condition the mean sum is þ5.80. Apparently somewhat greater deflections away from the control group basal levels are occurring under the clinical condition. However, in both instances the differences from the relevant control groups are quite significant (p < .00005 and p < .0003, respectively). A more clear-cut booster effect is obtained with the subjects who received the ‘‘liking is mature’’ cueing. The subgroup run by nonclinical experimenters has a mean sum of þ20.80; and this is not significantly different from the mean for the nonclinical, control group. However, the liking subgroup run by clinical experimenters shows a mean of þ39.12. This differs significantly both from the means of the clinical control group (p < .01) and the nonclinical, liking-cued group (p < .003). The following conclusions seem reasonable: In this subject population there was some tendency, as indicated by the control group data, to give moderately positive judgments of the pictured persons. Thus, for the liking-cued groups, the information that liking of strangers is a sign of maturity was congruent with their initial response disposition. But the identification of the experimenter as having a special interest in ‘‘what makes people tick’’ and in ‘‘psychiatry’’ operated to raise the stakes for the subjects run by the clinical experimenter. To guarantee the winning of a positive evaluation from him the typical subject in this group strives to give extreme, and thus unambiguous, proof that he possesses the defined hallmark of the mature, psychologically healthy person. These are, then, unexceptional data. They seem to confirm the obvious and predicted relationship between the aroused strength of evaluation apprehension (the clinical definition having served to increase it, and the nonclinical definition to decrease it) and the degree to which the subject relies upon a response style he believes to be indicative of normal or attractive personality. But, persisting in the mood of parametric exploration rather than theoretical expansion, the following simple question might be asked: Must the experimenter directly define himself as having a special interest or ability in evaluating personalities? Or can the same sort of evaluation atmosphere be induced by other means? One additional experiment seemed to confirm the latter possibility. In this study we employed the same Background Information Sheet as in the previous one. By this means we again provided both for arousal of evaluation apprehension and directional cueing of responses in the ‘‘liking’’ and ‘‘disliking’’ directions respectively. And again we attempted to strengthen the evaluation apprehension dynamic by introducing an additional clinical implication into the experimental situation. Thus, before they read the Background Information Sheet the subjects in one main treatment read a printed announcement concerning an impending study. This told them that ‘‘Dr. P. J. Schroeder,’’ a clinical psychologist from another institution, had asked our cooperation in recruiting subjects for a large study on ‘‘student personality and adjustment in college life.’’ This study was being conducted on various different campuses. Participation in it would involve the subject’s being interviewed by Dr. Schroeder and allowing him to administer various ‘‘projective tests of personality.’’ Dr. Schroeder, it was made clear, would treat the findings as completely confidential and, specifically, he would not disclose them to the experimenter. The subjects were asked to sign for appointments ‘‘for this other, unrelated project’’ if they were so inclined. Virtually all the subjects did sign.
234
Book One – Artifact in Behavioral Research
When we compare the subjects in this treatment to others who were not exposed to it we find the former showing stronger directional bias effects than the latter. Under the ‘‘Schroeder is coming’’ condition the difference in ratings between subjects cued in the liking and disliking directions respectively is clearly significant (p < .02). Under the standard condition comparable to the prior experiment, but lacking any extra clinical implication, the comparable finding is p < .10. (Smaller samples were used in this study than in the previous one; and with variances of about the same magnitude the overall probabilities are, as would be expected, somewhat larger.) As I have already suggested, these are studies of limited import and they offer no major surprises. Essentially, their value lies in lending support to this basic point: any aspect of the experimenter (or of the situation or setting in which he is encountered) that adds some further implication of interest in psychological evaluation will tend to increase the influence of the evaluation apprehension dynamic upon the subject’s experimental responding. This statement assumes, of course, that some other provocations toward evaluation apprehension are also acting upon the subjects as, for example, the information that we conveyed through the Background Information Sheet. However, it would seem quite likely that our additional factors (i.e., the undergraduate experimenter’s confessed clinical interest or the subject’s elicited commitment to participate in a later personality evaluative study) could operate as sufficient factors in and of themselves. Further research would be required to confirm this rather obvious speculation. But obvious relationships (even when they raise questions about the underlying and somewhat obscure sequences of events that mediate them) are less compelling than findings that raise new and unexpected issues. Therefore, rather than linger over the findings reviewed above, I shall turn now to some further preliminary studies concerning other variables. In both of these studies the major hypotheses were confirmed, but certain unexpected relationships were also encountered; and they are of a type that promises to deepen our inquiry into the operation of the evaluation apprehension dynamic. In one of these studies we attempted to examine the consequences of making the experimenter a ‘‘gatekeeper’’ for the subjects. By this we meant, simply, that the experimenter was to be perceived by the subjects as likely to allow some, but not all of them into some rewarding activity area. In addition to setting up gatekeeper and nongatekeeper conditions we also treated the manipulation of evaluation apprehension in a new way. In previous studies our Background Information Sheet had been designed to arouse (or confirm and amplify) evaluation apprehension; and control subjects who did not receive the Background Information Sheet provided the necessary baseline data. However, in the present study we used two forms of the Background Information Sheet, one intended to confirm and strengthen original promptings toward evaluation apprehension and the other to disconfirm and weaken them. To convey fully the nature of this experiment I must linger briefly over the actual operations employed to establish the gatekeeper and nongatekeeper and high and low evaluation apprehension variables. The gatekeeper manipulation consisted of the experimenter (myself) reading aloud a letter that, as he explained it, he had recently received from the executive director of the ‘‘National Policy Foundation’’ on whose advisory board, as the letter made clear, the experimenter himself had served. The letter invited the experimenter
The Conditions and Consequences of Evaluation Apprehension
235
to nominate some forty undergraduate students to participate in a ‘‘regional collegiate council’’ that would be discussing, as part of a funded national study, college student experiences and views on general issues concerning campus life and educational practices. These students, the letter made clear, would be well paid and through their participation they would ‘‘help to educate the experts and, hopefully, the academic and governmental hierarchs who will be receiving the progress reports and the final report of the total study.’’ Having read this letter the experimenter said he had decided not to choose from among his own students but rather from the many subjects who were participating in ‘‘this week’s experimental sessions,’’ and that he would be ‘‘contacting some of you about this within the next week or so.’’ In the nongatekeeper portion of the study we elected to say nothing along these lines, rather than to provide some contrasting or filler material. The communication designed to arouse and suppress general evaluation apprehension explained that the study the subject was about to participate in was being conducted by the Personality Assessment Project (high evaluation apprehension) or the Mathematical Psychology Project (low evaluation apprehension). The low evaluation apprehension communication stressed that we were ‘‘interested in integrating the various dimensions or factors involved in social perceptual processes into a more simplified model by a method which statisticians term ‘stochastic-inferential mathematical modeling’.’’ With these, and similar comments, we intended to create in the subject the reassurance that we were studying some rather obscure and technical issues and had no interest in his unique personality. Thus, a further paragraph informed the subject that: ‘‘the photo ratings of all participants will be considered as a group and an overall average score and statistical estimate of distribution variance will be obtained for this set of photographs.’’ The high evaluation apprehension communication informed the subjects that ‘‘these photographs comprise a recently developed Test of Psychological Maturity. The pattern of responses given to these photographs by psychologically mature undergraduate students has been shown to be significantly different from the responses of those manifesting psychological maladjustment.’’ The communication went on to assert that our research was pointed toward further development of the picture rating personality test as an ‘‘instrument for picking out from the large college population those students who would be judged inadequate in their emotional adjustment or generally prone to psychological disorders.’’ Postexperimental questionnaire data established that the low evaluation apprehension communication did significantly ‘‘reassure’’ those who read it while the high evaluation apprehension communication did increase anxiety and set the person toward concern over what impression he would make upon the experimenter. The contrast with our earlier uses of the Background Information Sheet should be clear. Not only were we attempting to remove evaluation apprehension in some subjects while strengthening it in others, but we were also providing no cueing that directly reported that liking or disliking for strangers had been found to be characteristic of psychologically mature persons. Instead, a more limited, or one might say, less obvious and intrusive form of directional cueing was employed. Each experimental subject, after he had been exposed to the gatekeeper or nongatekeeper and high or low evaluation apprehension manipulations, read a two-paragraph communication which simply reported that previous research with the pictures he
236
Book One – Artifact in Behavioral Research
was about to rate had shown that most people judged them positively (liking) or negatively (disliking). While one third of the total subject population of 148 males received this form of the liking cueing, and another third the disliking cueing, the remaining third received no directional cueing and thus served as a control group. In the actual administration of this experiment we were able to achieve a high level of efficiency by use of the language laboratory at the University of Chicago. Subjects were run in groups of eight to twelve. Each subject occupied a separate work booth. Seated in the booth he first heard the experimenter deliver the gatekeeper ‘‘pitch’’ or, for the nongatekeeper subjects, a brief and quite neutral introductory statement. The subject then read the high or low evaluation apprehension manipulation which, under instruction, he had removed from an envelope placed on the table within his booth. He then went on, unless a cueing control subject, to read the directional cueing communication. Following this he gave, on a rating sheet, his liking-disliking judgments for each of the 15 pictured faces as they were projected on a screen easily visible to all subjects. After this rating sheet had been completed the sequence of pictures was presented again while the subject rated each of the pictured persons for ‘‘how successful’’ they had been. A third exposure of the pictures was then given while the subjects rated the pictured persons for ‘‘how intelligent’’ they appeared.3 All pictures were exposed for ten seconds each, with a following ten second interval during which the subject wrote his rating on a scale from 10 to þ10. Two postexperimental questionnaires, administered both before and after a thorough debriefing, provided strong evidence that the manipulations had been successful and that very little suspicion had been aroused as to our real purpose. I have so far described the procedures of this study without any direct reference to the hypotheses that guided it. However, they are probably already apparent. The gatekeeper manipulation was intended to increase the desirability of winning a positive evaluation from the experimenter; for this would now have the additional payoff value of increasing the probability of being chosen for membership in the interesting and remunerative student discussion group that was being set up by the ‘‘National Policy Foundation.’’ Thus we predicted that response dependence upon the directional cueing would be greater for subjects in the gatekeeper condition than for those in the nongatekeeper condition. Similarly we expected that subjects receiving the high evaluation apprehension manipulation would show stronger response bias effects than those receiving the communication that was designed to reduce evaluation apprehension. And, of course, we were interested in the possibility of a meaningful interaction between the two major variables, and also their respective and combined interactions with the like-dislike cueing variable. This rather complex study, with 12 separate cells in a 2 2 3 design, and with considerable data drawn from postexperimental questionnaires and inquiry, yielded a great deal of information; and full presentation and analysis can only be attempted in a lengthy, separate article. Thus I shall dwell here only upon some of the major findings and their probable meaning. 3
It should be clear that the subjects had not received any directional cueing concerning the personality revealing relevance of judgments that others have been successful or are intelligent. However, judging another as possessing these qualities would represent a positive evaluation of him. Thus we expected some generalization from the subjects’ judgments on the like-dislike dimension onto these two other judgmental scales. Also, evidence of such generalization (or of such indirect cueing effects) could be taken as an additional measure of the degree to which the directional cueing was utilized by the subject.
The Conditions and Consequences of Evaluation Apprehension
237
Table 7-3 presents the mean algebraic sums of the liking ratings for the six cells that received the gatekeeper treatment and, separately, for the six cells in the nongatekeeper treatment. The probabilities of the differences between relevant pairs of cells are also presented. Reference to these tables will help to illuminate the findings from the separate analyses of variance that were carried out for both the gatekeeper and nongatekeeper conditions. In both analyses we obtained clear evidence of a cueing effect. The algebraic sums of the subjects’ ratings on the like-dislike dimension strongly reflect the cueing that was received: those who got positive cueing gave more positive ratings than those who got no cueing and these, in turn, gave more positive ratings than those who got negative cueing. In the nongatekeeper half of the experiment the p value for this effect is less than .0001. Also the effect does appear to generalize to the ratings of ‘‘success’’ (p < .05) and ‘‘intelligence’’ (p < .006). Considering only the four groups that received cueing in either the positive or negative direction (i.e., eliminating the two no cueing, control groups) our other
Table 7–3 Like-Dislike Mean Sums for Groups; and Probabilities of Differences Between Groups
A. Nongatekeeper condition Liking treatment
Disliking treatment
Control p < .0002
High evaluation apprehension
+33.50
p < .02
+13.67
p < .004
N.S.
p < .05 Low evaluation apprehension
+5.42
N.S.
+2.83
–19.93
N.S. p < .05
–12.00
p < .003
B. Gatekeeper condition Liking treatment
High evaluation apprehension
+13.82
N.S.
+18.00
+5.09
p < .02
N.S.
N.S. Low evaluation apprehension
Disliking treatment
Control p < .003
p < .002
–2.33
p < .00003
–17.25
p < .06 p < .004
–26.83
238
Book One – Artifact in Behavioral Research
major prediction was confirmed. In the nongatekeeper portion of the study a significant interaction is obtained between cueing and evaluation apprehension level as regards the liking ratings (p < .03). This is due to the fact that when subjects have been roused to a state of evaluation apprehension their picture ratings are more extremely influenced by either the positive or negative cueing than when they are at a low or suppressed level of evaluation apprehension. Thus, the mean of the algebraic sums for the high evaluation apprehension subjects who received positive cueing is some 53 points more positive than the mean for the high evaluation apprehension subjects who received negative cueing. For the low evaluation apprehension group the comparable discrepancy, while in the same direction, is only 25 points. Similar effects of lesser magnitude and statistical significance are obtained when we compare the two evaluation apprehension groups on their ratings of the pictures for success and intelligence. The probabilities for the overall evaluation apprehension by cueing interactions on these two dependent variables are less than .12 and .19, respectively. In passing, it is worth noting that within the high evaluation apprehension condition the differences between the scores from the positively and negatively cued subjects are significant at probabilities of .008 or less for each of the three dependent variables; while the parallel analysis with the low evaluation apprehension subjects yields a significant probability only on the liking ratings. I have dwelt upon these results because they suggest a point of particular interest both as concerns an emerging theory of the self-presentation process and also as they bear upon an important methodological issue. The kind of directional cueing intentionally provided in this study is often unintentionally present in other research situations, both of experimental and survey form (e.g., the respondent in the typical public affairs study often has a fairly clear idea, whether accurate or not, of ‘‘how most people would probably answer’’ on some of the more salient issues). More ‘‘valid’’ data (i.e., more accurate self-representations) are likely to be obtained when we attempt to reduce evaluation apprehension through some preliminary communication which disconfirms the subject’s or respondent’s concern that his psychological maturity (or, for that matter, his ‘‘public spiritedness’’ or ‘‘patriotism’’) may be open to assessment and evaluative judgment. Yet the fact is that even with an apparently successful reduction of evaluation apprehension (judging by the postexperimental questionnaire data from the low evaluation apprehension subjects) the directional cueing still exerts some influence. Probably this indicates some residuum of persisting evaluation apprehension and, if so interpreted, it points up the necessity for developing even more effective techniques for giving subjects or respondents the sort of reassurance which allows them to be their typical selves (i.e., uninfluenced by situational and inadvertent cueing factors) when reporting on their own judgmental or attitudinal processes. So far the discussion has been restricted to the findings from the nongatekeeper portion of the experiment. With the data from the gatekeeper portion of this study we encounter a number of interesting patterns, particularly when they are viewed in relation to the comparable nongatekeeper experimental groups. Whereas the positively and negatively cued groups in the nongatekeeper, low evaluation apprehension condition differed significantly only on their liking ratings of the pictures, but not on the success or intelligence ratings, the low evaluation apprehension groups who received the gatekeeper manipulation show significant cueing effects on the liking, success, and intelligence ratings (respectively, p < .00003, p < .0002, p < .0005).
The Conditions and Consequences of Evaluation Apprehension
239
This is further reflected in the difference between the liking means for the positively and negatively cued, low evaluation apprehension groups. Under the nongatekeeper condition this difference is 25.67, while the comparable difference under the gatekeeper condition is 44.83. For the success ratings the differences between the means for the two cueing groups are 3.60 for the low evaluation apprehension nongatekeeper condition and 37.34 for the low evaluation apprehension gatekeeper condition. With the ratings of intelligence the respective difference scores are 10.80 and 38.73. It is clear that when we make the subject dependent upon the experimenter’s judgment of him we restore something like evaluation apprehension. The subject, knowing that the experimenter is a psychologist and probably desiring that he ‘‘let him through the gate’’ to a rewarding experience, regulates his responding by reference to the cues that tell him how ‘‘most others’’ respond. So far the results from the gatekeeper condition confirm our original hypotheses. However, where we examine the data from the high evaluation apprehension gatekeeper subjects, one major surprise is encountered: Unlike the results with the low evaluation apprehension subjects, the introduction of the gatekeeper condition (which was intended as an extra force compelling the subject toward reliance upon the directional cueing) seems in fact to reduce such reliance for the positively, but not negatively, cued group. In the high evaluation apprehension nongatekeeper and gatekeeper conditions the mean algebraic sums for the liking ratings in the absence of any directional cueing are 5.42 and 5.09, respectively. But whereas the liking sums for the positively cued subjects in the former group have a mean of 33.50 (and thus the difference between the control and positively cued subjects is 28.08), in the latter group the positively cued subjects yield a mean sum of only 13.82 (making the difference between the control and positively cued subjects only 8.73). Similar findings are obtained with the dependent variables of success ratings and intelligence ratings. A possible interpretation is that the combination of the high evaluation apprehension and gatekeeper treatments strains the subjects’ credulity or, perhaps, puts them under a degree of tension which inhibits or otherwise disrupts their readiness to be influenced by the directional cueing. But the absence of the same pattern with the negatively cued groups limits the applicability of this interpretation. Subtler possibilities have occurred to us, but their explication had best await the results of further data analyses that are yet to be executed. These last findings comprise one of the valuable surprises of which I spoke earlier; and I must confess considerable interest in further experimental investigation in this particular realm as well as considerable frustration over the tantalizing ambiguity that presently beclouds the issue. Among many further subsidiary findings obtained in this experiment I shall mention only one other. A postexperimental index of the ‘‘anxiety’’ aroused by the high evaluation apprehension communication is strongly correlated with the degree to which the subjects in the experimental groups were influenced by the directional cueing that they received. This serves to reinforce our general theoretical view while also suggesting the importance of apprehension-proneness as a mediating, personality-linked variable. While I have not here attempted a full description of the procedures of this complex study or of all the available analyses, enough has been presented to make clear the basis for the following conclusions: Evaluation apprehension has again
240
Book One – Artifact in Behavioral Research
been shown to be a factor, or process, that mediates systematic biasing of the sort that is due to cueing (in this study, somewhat more indirect cueing than in our previous work) of the preferred pattern of experimental responding. A second variable, namely the perception of the experimenter as a ‘‘gatekeeper’’ (i.e., as one who controls access to further reward or ego-enhancement) has been shown to facilitate yielding to directional cueing, particularly when evaluation apprehension has been brought to a low, or inoperative, level. But the combination of high evaluation apprehension and the gatekeeper variables has not, as we thought it would, worked to maximize the degree of influence upon experimental responding that is exerted by directional cueing. Whether this is due to some artifactual considerations (or to some unintended and subtler pattern of evaluation apprehension that has, in turn, generated a more obscure response strategy) or whether it is our first encounter with a truly general effect remains to be determined through further research. In general this study does appear to add force to the claim that evaluation apprehension can contaminate the data gathering process, and it directs us toward a more complex consideration of other variables that interact with evaluation apprehension. The last study that I shall treat in this section was, in all but two respects, a close duplicate of the one just described. Thus, its design and procedures can be outlined in short compass. Subjects were again exposed to the high and low evaluation apprehension treatments and then to either positive or negative directional cueing or to no cueing at all. Again the experiment was conducted in the language laboratory setting with each subject working in a separate booth and all viewing and rating the projected pictures at the same time. The two major differences between this study and the previous one were: All subjects were female undergraduates; in place of the gatekeeper manipulation we attempted a systematic, two-stage variation in which the pictures to be rated were presented under conditions of high and low ambiguity. Operationally, this meant that under the nonambiguity condition each successive picture was exposed in sharp focus for ten full seconds and the subject was to give her rating only after the exposure was completed. In the ambiguity condition a stable level of poor focus (low resolution) was employed and the picture was exposed for only three seconds. The basic hypothesis which led us toward this study on the evaluation apprehension cueing ambiguity interaction was that biased experimental responding due to evaluation apprehension (in interaction with directional cueing) will be a direct partial function of the degree of ambiguity in the stimulus materials to which such responding is coordinated. Basic to this prediction was the notion that the stimulus attributes of the particular pictures do, in interaction with the subject’s own judgmental standards, exert some influence upon his ratings. This is likely to be true even when a larger part of the variance in the ratings is controlled by the arousal and directional channeling of evaluation apprehension. To make the pictures more ambiguous is to make the stimulus attributes less readily available. This, in sum, should foster a further intensification of the subject’s reliance upon such cueing as he may have received and thus the bias effects should be intensified. Table 7-4 presents the mean algebraic sums of the liking ratings for the six cells in the ambiguity treatment and, separately, for the six cells in the nonambiguity treatment. The significant differences reported in the table help to make clear the findings from the analyses of variance that we carried out for both the ambiguity and nonambiguity conditions.
The Conditions and Consequences of Evaluation Apprehension
241
Table 7–4 Like-Dislike Mean Sums for Groups; and Probabilities of Differences Between Groups
A. Nonambiguity condition Liking treatment
High evaluation apprehension
+24.50
N.S.
+7.73
+12.64
p < .001
N.S.
N.S. Low evaluation apprehension
Disliking treatment
Control p < .0005
N.S.
+7.64
–26.27
N.S. p < .003
–17.08
p < .006
B. Ambiguity condition Liking treatment
High evaluation apprehension
+9.85
p < .09
N.S. Low evaluation apprehension
+12.69
Disliking treatment
Control p < .04
–10.08
N.S.
N.S. p < .02
–8.67
–7.00
N.S. N.S.
–16.17
p < .008
Analysis of variance of the ratings from the half of the study in which the subjects rated the unambiguous photographs reveals comparatively strong cueing effects. On the liking ratings the cueing effect is highly significant (p < .0001); and for the success and intelligence ratings they are of borderline significance (p < .15, p < .07 respectively). Analysis of variance of the ratings from the half of the study run under the condition of stimulus ambiguity also reveals a significant main effect for the liking ratings (p < .002), but no effect for the success ratings (p < .68), and a borderline effect for the intelligence ratings (p < .14). However, while the like-dislike directional cueing exerts the predicted influence, we find that two other expectations are not directly confirmed: Within the separate ambiguity conditions we do not find that the high evaluation apprehension subjects are significantly more influenced by both types of directional cueing than are the low evaluation apprehension subjects; nor do we find significantly more biasing of responses in the cued directions in the ambiguous as compared to the nonambiguous treatments. Instead, what stands out is a complex interaction between direction of
242
Book One – Artifact in Behavioral Research
cueing, low and high evaluation apprehension and the ambiguity-nonambiguity variable. This significant interaction can best be described in these terms: Under the ambiguity condition the positive directional cueing has greater influence upon the liking ratings than does the negative directional cueing; under the nonambiguity condition the negative directional cueing has greater influence than the positive directional cueing; and while this pattern is visible with both high evaluation apprehension and low evaluation apprehension subjects, it is somewhat stronger with the former under the nonambiguity condition and with the latter under the ambiguity condition. Further and more complex analysis of these data, and of related data gathered with an extensive postexperimental questionnaire, carries us partway toward unraveling the meaning of the triple interaction reported above. But all such interpretation remains uncertain without recourse to further, replicative study. At this point the speculative path that seems most accessible is one which highlights the interaction between our independent variables and the special meaning of the experimental task. This speculative path begins with the assumption that most (American middle class) persons take it to be socially desirable to show openness and liking toward others. Those subjects who have received cueing suggesting that most persons in past research have rated the pictures negatively face a conflict between their own expectations or half-shaped hypothesis and the directional cueing that has been addressed to them. With stimulus ambiguity high they may, in the resultant state of uncertainty, fall back upon their own, original expectations; and thus the positive cueing works more effectively upon them than does the negative cueing. However, with high clarity and detail in the photographs, typical subjects may be able to find evidence in facial and expressive characteristics onto which they can more readily impose the negative judgments that, according to the negative directional cueing, are typically made by ‘‘most people’’ who view these particular photographs. That the yielding to the negative cueing under the nonambiguous condition is greater for high evaluation apprehension than low evaluation apprehension subjects (the difference between control and dislike group means being 38.91 for the former and 24.72 for the latter) suggests the further pertinence of the interpretation offered here: for the high evaluation apprehension subjects, believing they are undergoing indirect personality assessment, have a greater stake in regulating their responses in the cued direction. In effect, our interpretation, reduced to its simplest form, suggests this further hypothesis: to yield to directional cueing that endorses an unpracticed response style, the person needs ‘‘something to work with,’’ that is, some supporting aspects in the experimental situation or in the proffered stimulus material which will enable him to view his yielding to the directional cueing as having some basis in ‘‘reality’’ rather than solely in his need to win a positive evaluation. Clearly this line of speculation, if strengthened by later research, moves our inquiry into self-presentation processes toward a subtler and more difficult kind of theorizing; one which will have to give fuller representation than heretofore to the limits and lures that the total experimental context provides for the subject who is attempting to regulate his experimental responding in a way that serves both his need for approval from others and, at the same time, from himself. As I said in opening this section, ‘‘we are just barely started on this line of inquiry.’’ Having now reviewed our completed studies on variables that strengthen or reduce the data-biasing influence of evaluation apprehension I am all the more
The Conditions and Consequences of Evaluation Apprehension
243
sensitive to the fact that this work has a decidedly preliminary air about it. Much more inquiry is required and as it proceeds we must get beyond our present and too simple classificatory taxonomy of variables and into the construction of a process or systems model of the flow of the evaluation apprehension dynamic. Further work along these lines, both experimental and theoretical, is contemplated. But for now we can, I think, conclude that at least this much has been established: Between the initial arousal of evaluation apprehension and the ultimate tilting of experimental responses in the direction that, as the subject sees it, will maximize positive evaluation, there is scope for influence through many intervening and subsidiary variables. The few we have so far investigated appear to me to derive their influence in either or both of two ways: they may directly affect the subject’s perceptions of how his responses will be judged; or they may affect his estimate of the importance of winning a positive evaluation from the particular experimenter in his particular experimental setting.
Evaluation Apprehension and the Experimenter Expectancy Effect Three research strategies have been featured in the work I have already reported: altered replication, demonstration experiments and experiments on intervening or additive variables. Yet one other related research approach has figured in our recent work on the evaluation apprehension process. Simply described, this involves manipulating evaluation apprehension (by arousing or confirming it for some groups and suppressing or disconfirming it for others) and then examining the consequences for some other phenomenon or relationship of psychological interest. In general this strategy would appear to be relevant whenever one suspects that evaluation apprehension operates as a mediating or facilitating condition for an already established relationship between other variables. Directly illustrative of my meaning is the possibility that evaluation apprehension may well be involved in the experimenter expectancy effect: that is, a state of concern over whether the experimenter will judge him as ‘‘normal’’ or ‘‘abnormal’’ may affect the way in which the subject perceives the experimenter’s meanings, preferences, and aspirations within the experimental situation. To be specific: if, as the work of Rosenthal (1966) and Friedman (1967) suggests, the experimenter’s expectancy is subtly communicated by aspects of his expressive style, the subject who is possessed of a concern over evaluation may well be more closely and accurately attuned to such indirect communication; or he may be more motivated to act upon the basis of what has been indirectly communicated. To investigate such a possibility, then, one would attempt to replicate a standard experimenter expectancy study with at least two groups of subjects—one aroused to a high level of evaluation apprehension and one in which all tendencies toward this pattern of concern have been effectively diminished. In a sense this research strategy can be viewed not as a forth and new one, but as a variant of the altered replication approach described in the first section of this chapter; but in this variant, instead of eliminating evaluation apprehension we attempt also to arouse it. However one wishes to classify it, this strategy has proved effective in the one realm in which it has already been employed. As the title of this section and the illustration offered above have already suggested, that realm has been the further study of the mediation of the experimenter expectancy effect.
244
Book One – Artifact in Behavioral Research
Before I turn to an account of our studies in this area I should like to comment briefly upon the relationship between my own preoccupation with the evaluation apprehension process and the work of other investigators of the ‘‘social psychology of the psychological experiment.’’ From the record of research (much of which is summarized in other chapters in its volume) on demand characteristics, subject presensitization, volunteer effects, and experimenter expectancy effects, it seems abundantly clear that there are a number of sources of systematic bias in experimental data. For a long time these went unsuspected and, it can be assumed, contributed considerable nonrandom error to the data through which theoretical propositions were tested or inspired. In the main I am persuaded by the work of others that the various processes that have been conceived as making for systematic bias do, in fact, have considerable operative force. And, obviously, I think and have tried to show that the same is true of the evaluation apprehension process. We have then developed an empirically verified catalogue of data-biasing variables and processes. So far so good. But it seems apparent to me that we have now reached a stage at which we need not be content with a mere catalogue. Some larger, more integrative theory of the experimental-transactional process is required. The development of such a theory will afford intellectual satisfaction in itself; but, equally important, it will probably also contribute to a richer understanding of the role of self-representational dynamics in nonexperimental, interaction situations; and, of course, it will promise considerable further advance in improving the methods of research design and execution in all those disciplines (psychology is only one) whose data are gathered through interaction between the investigator and other, investigated persons. I shall not presume to suggest the possible shape of a full and general integrative theory of the experimental process, though in the concluding section I shall risk a few preliminary speculations upon some aspects of such a theory. However, at this point I want only to register this obvious point: the development of this sort of theory will be advanced by—indeed, it may require—the prior investigation of the interaction and overlap between the biasing processes that are now separately delineated in our catalogue. A few investigations of this type have already been attempted; the study by Sigall, Aronson, and Van Hoose (1968) discussed above, is one. The three studies I shall now describe represent another such contribution. They are all focused upon the interaction between evaluation apprehension and experimenter expectancy. More particularly they are attempts to test the proposition already advanced: that is, that the experimenter expectancy effect is mediated or facilitated by evaluation apprehension. At the same time, the last of these studies also bears upon another important aspect of the experimenter expectancy effect; namely, the paralinguistic content of the experimenter’s communications to the subject. In our first effort in this realm my coinvestigator was Marshall Minor and a portion of this study served as his doctoral dissertation (Minor, 1967). We had two basic purposes: to replicate Rosenthal’s finding that the expectancy held by an experimenter can introduce ‘‘experimenter bias’’ into the research situation so that the expectancy is confirmed by the response data elicited from subjects; and, as I have already indicated, to show that the experimenter bias effect is mediated by evaluation apprehension. Particularly we hypothesized that the experimenter bias effect will be intensified when subjects believe that their experimental responses
The Conditions and Consequences of Evaluation Apprehension
245
may be utilized to evaluate their psychological adequacy, and that the effect will be diminished when they define the situation as one in which their psychological adequacy is not likely to be evaluated. The design of this study called for 16 naive male experimenters (eight given the þ5 expectancy and eight the 5 expectancy) to separately run four subjects (two male and two female) through the Rosenthal picture rating task. (In this standardized task the subject rates each of a series of pictured persons on a scale from 10 to þ10 for ‘‘whether the person pictured has been experiencing success or failure.’’) Half of all subjects would have already been roused to a high level of evaluation apprehension and the other half to a low or ‘‘suppressed’’ level of evaluation apprehension. Balancing in the assignment of subjects to experimenters was to be arranged so as to enable a statistical control for sequence effects, sex differences, and other possible unintended influences upon the response data. Difficulties in recruiting the full complement of subjects during the University of Chicago summer quarter of 1966, and the failure of one experimenter to keep his scheduled appointment, reduced the actual situation to one in which 15 experimenters ran 23 male and 16 female subjects. However, it was possible to maintain partial balance in subject assignment and to effect statistical analytic controls for the ‘‘holes’’ in the matrix of experimenter-subject pairs that were actually completed. The latter type of control was made possible by use of the University of Chicago MESA 95 computer program (we are indebted to Professor Darrel Bock for introducing us to this program and for helping us to fit it to our needs) in which, as each mean square for an effect is determined, a transformation is made on the model by eliminating previously estimated effects from the succeeding mean squares. This made it possible for us to get an estimate of the expectancy evaluation apprehension interaction in which estimate was independent of sex and sequence variables. (A full account of the handling of this and other analytic problems is found in Minor, 1967 and the special analysis of variance program employed is described in Bock, 1965.) Some words about procedures and operations are required before we review the basic findings of this study. The naive experimenters, all male graduate students in the school of education, reported individually for the scheduled experiment in whose execution they had promised to assist. After receiving a standardized background introduction to the experimental task each experimenter read a document giving further instructions. These expectancy-establishing instructions comprised a modified version of one of the methods used by Rosenthal (1966). The crucial content was in the last paragraph: For your information, we have found from past research that certain types of people tend to rate these pictures in very particular ways. On the basis of personality test data that we collected earlier this quarter from the subjects you will be running, you should expect them to give an overall average rating of þ5 (5). This is because our earlier testing has shown that all of the subjects whom you will be running are in the category that we are calling ‘‘success perceivers’’ (‘‘failure perceivers’’). Thus if you will follow instructions and use proper experimental procedure, they will rate the pictures more extremely positive (extremely negative)—thus obtaining an average of about þ5 (5)—than would be the case if your subjects had simply been selected at random.
Each subject, before being introduced to the experimenter who was to administer the experimental task, was asked to ‘‘please read this sheet which will give you a
246
Book One – Artifact in Behavioral Research
brief description of the purpose of today’s experiment.’’ Half of the subjects read an explanation intended to arouse a high and stable magnitude of evaluation apprehension. The other half read an explanation designed to reduce and, hopefully, eliminate any tendency toward evaluation apprehension that the subjects might have brought with them into the experimental situation. The full logic that lay behind the construction of these two communications (which are similar to the ones used in our earlier studies on the gatekeeper and ambiguity variables) will not be traced here. However, much of it should be apparent from the content of the high and low evaluation apprehension communications which are fully reproduced below. High Evaluation Apprehension Communication ‘‘Today, you will be participating in a psychological experiment; and shortly you will be assigned to an experimenter who will explain the task to you. Although we are not able to answer any questions until after the experiment is over, we do want to give you a brief description of the purpose of the experiment. This should make participating more interesting and meaningful for you. Also, a growing number of psychological researchers are beginning to realize that they have an ethical responsibility to make the purpose of their experiments known to the individuals who are helping them out by participating in their research. ‘‘We are interested in studying social perception (i.e., how people perceive other people). More specifically, we want to find the factors which increase or decrease the accuracy of an individual’s perception of other people. Certainly, with the lack of understanding in the world today, we do need to find out as much as we can about the reasons for inaccurate social judgment. ‘‘Prior research by ourselves and others indicates that, typically, poor social perception is associated with psychopathology. That is, people who are not able to accurately perceive how other people are feeling, or what they are experiencing, usually are found to be psychologically maladjusted. Much of our initial research in this area indicates that on the basis of performance on the social perception task, we can pick out from a college population those students who would be judged clinically to be maladjusted. ‘‘Several other researchers have presented data which support the preceding findings. Morgan and Provino (J. of Abnormal and Social Psychology, 1963) for example, report that in a college setting, the Social Perception Test could make rather subtle discriminations between varying degrees of emotional maladjustment and normalcy. ‘‘The purpose of today’s experiment, therefore, is to replicate the previous results, and thus to test further the generality of the finding that people who cannot accurately judge what other people are experiencing tend to be psychologically maladjusted.’’ Low Evaluation Apprehension Communication ‘‘Today, you will be helping us to collect some preliminary data which we will use in setting up a subsequent research project. Shortly, you will be assigned to an experimenter who will explain the task to you. Time does not permit us to answer any questions, but we are able to give you a brief description of the purpose of the study. This should make participating more interesting and meaningful for you.
The Conditions and Consequences of Evaluation Apprehension
247
‘‘We are interested in studying social perception (i.e., how people perceive other people). More specifically, we want to find the factors (e.g., fatigue, practice, etc.) which increase or decrease the accuracy of an individual’s perception of other people. ‘‘Before we can investigate these different factors, however, we have to know how people perceive the feelings and experiences of others when these experimental factors are not present. ‘‘That is, we need a control, or standardization, group to use as a baseline against which we can judge the effects that our experimental factors have on social perception. This is the reason for your participation today. ‘‘We intend to average the performance of all of the students participating today, so that we will have a measure of how subjects perform on the task when such experimental variables as fatigue and prior practice are not present. This information will allow us to judge the effects which our experimental variables have when they are used with a subsequent group of students. ‘‘In other words, today’s group will help us to find out how subjects typically perform on the task. Later, we can use the data we receive here to judge the performances of subsequent experimental groups of subjects.’’ As in the typical Rosenthal experiment, interaction between experimenter and subject was held to a minimum level in which the experimenter read the picture rating instructions to the subject and collected his ratings for each of the ten pictures. Upon completion of this phase the subject, no longer in contact with the experimenter, filled out an extensive postexperimental questionnaire and was thoroughly interviewed. The same was done with each experimenter after he had completed running all of his assigned subjects. In a last phase experimenters and subjects were brought together for a full ‘‘debriefing’’ and for extended discussion, considerable care being taken to alleviate any lingering concern that might be felt by subjects who had been assigned to the high evaluation condition. From the full analysis of variance three significant findings were obtained: Between experimenters, the expectancy variable (þ5 versus 5 experimenter expectancy) controls a significant portion of the variance in their subjects’ ratings of how successful the pictured persons have been (p < .05). In the ‘‘within experimenters’’ analysis the sex of the subjects operates significantly (p < .03) reflecting a general tendency for females in either the þ 5 or 5 expectancy groups to rate the pictured persons as less successful than the respective male subjects in the same expectancy treatments. Most relevant to our major interest is the finding of a rather strong interaction between expectancy and evaluation apprehension (p < .02). The basis for this significant interaction is clearly revealed by a comparison of the mean photo ratings obtained from the þ5 and 5 expectancy groups under both the high and low evaluation apprehension conditions respectively. (The male-female proportions are roughly equivalent in each of these four groups.) With evaluation apprehension reduced or suppressed (i.e., under the low evaluation apprehension treatment) the mean picture ratings for the þ5 and 5 expectancy groups are .78 and .59 respectively. The difference between these means is not significant. Under the high evaluation apprehension condition the þ5 and 5 expectancy group means are þ.16 and 1.06 respectively. This difference is significant at a probability lower than .002. Figure 7-1 provides a graphic representation of our basic finding that the experimenter expectancy effect is obtained, as predicted, when evaluation
248
Book One – Artifact in Behavioral Research Figure 7–1 Response to Experimenter Expectancy as a Function of Evaluation Apprehension Level
+.30 Successful
+.20 +5 Expectancy
+.10 .00 –.10
Mean of means
–.20 –.30 –.40 –.50 –.60
N.S.
p < .002
Unsuccessful
–.70 –.80 –.90 –1.00
–5 Expectancy
–1.10 Low EA
High EA
Predicted order of means
High EA +5 EE
Low EA +5 EE
Low EA –5 EE
High EA –5 EE
Obtained means
+.16
–.78
–.59
–1.06
apprehension has been aroused and is not obtained when evaluation apprehension has been reduced or eliminated. Many additional aspects of the data analysis serve to develop further detail on the picture sketched here and to further strengthen our overall conclusions. These matters will be more fully reported in a separate publication. However, one particular subsidiary finding is worth noting here. An index reflecting the degree to which each experimenter was successful in inducing bias under the high evaluation apprehension condition was computed. This index was separately correlated with the scores from various questionnaires that were administered to the experimenters after they had run all their subjects. The two strongest correlations obtained were those with the Marlowe–Crowne Social Desirability scale (r ¼ .40, p < .06) and the Sarason Test Anxiety Scale (r ¼ .57, p < .01). This suggests the possibility that something like evaluation apprehension is involved not only in mediating the responsiveness of the subject to the experimenter’s bias-inducing cues but perhaps also in setting the experimenter to emit such cues. At any rate, these findings suggest an empirical hypothesis worthy of further and more direct study, namely: that assigned expectancies will have a greater influence upon subject performance when the experimenters to whom these expectancies have been assigned have a high need for approval and a tendency to be apprehensive over the evaluation of their own competence.
The Conditions and Consequences of Evaluation Apprehension
249
Upon completion of the analysis of this study we decided to attempt an altered and expanded replication. The major intended changes were these: to completely fill the matrix of required experimenter-subject combinations and thus handle the problem of sequence effects without recourse to the sort of statistical corrections that were required in the previous study; to run an ‘‘evaluation apprehension control’’ condition in which we would attempt to neither increase nor diminish the subjects’ original, nonmanipulated, evaluation apprehension level; to run a ‘‘zero expectancy’’ as well as a þ5 and 5 expectancy condition. A further purpose was to try out a way of staging the study which combined some features of mass administration (e.g., subjects reading the initial evaluation apprehension communications while waiting in a large reception room) with the individual running of subjects on the picture rating task. Our hope in this last regard was to increase the efficiency of our own experimental procedures and those employed earlier by the Rosenthal group. In this study, then, each of 33 male experimenters (11 each having been given the þ5, 0, or 5 expectancies respectively) ran 3 male subjects on the Rosenthal picture rating task (one each having first received the high, low, or control evaluation apprehension communication respectively). The first two of these communications were slightly modified versions of the ones used in the earlier study and the last was a simpler one that merely advised the subject that he would shortly be assigned to an experimenter and asking him to wait until called upon. The main results of this study can be quickly told. We failed to replicate the basic experimenter expectancy effect. Analysis of postexperimental questionnaire and interview data from both subjects and experimenters suggests that this was due to our having failed to provide a credible experimental staging. In attempting to maximize efficiency in the routing of subjects and experimenters we seem to have aroused considerable suspicion about our own unrevealed purposes and about the actual contents of the communications intended to manipulate evaluation apprehension. Without going further into the details of our post-hoc analysis of the suspicionarousing aspects of the experimental procedures used, it may be said that a number of valuable cautionary points became clear to us and that we have profited from these in our further attempts at experimental investigation of experimenter bias, evaluation apprehension and kindred processes. In fact, on the basis of our first attempt at running a partially group-administered experiment in this realm, we were able to develop a different approach which was used in our next study in this sequence. This approach is just as efficient or more so, and yet seems to keep suspicion and other intrusive artifacts at a very low level. Before turning to a description of the major study just referred to it will be necessary to briefly describe a study that was stimulated by our earlier work but was not undertaken as part of our research program. Starkey Duncan, a clinical psychologist who had been working on paralinguistic aspects of communication within the psychotherapeutic situation, became interested in our work on experimenter bias effects. Through our joint consultations he came to the conclusion that the mediation of biasing cues in the Rosenthal paradigmatic situation might largely depend upon variations in the nonlinguistic aspects of the experimenter’s spoken communications to the subject. Particularly, he conjectured that the way in which the experimenter varied the intensity, intonation, pitch, and rhythm aspects of his reading of the instructions for the picture rating task might
250
Book One – Artifact in Behavioral Research
convey to the subject an extra-linguistic (or, more properly, a ‘‘paralinguistic’’) indication of the experimenter’s expectancy regarding the responses the subject was about to make. Duncan and Rosenthal proceeded to design a preliminary study to test this hypothesis. From films provided by Rosenthal, Duncan transcribed sound tapes of vocal readings of the instructions; three from each of two comparatively high biasing experimenters and four from a third high biasing experimenter. Together with the films from which the tapes were made, Rosenthal also provided the picture rating data obtained from the respective subjects who had received these separate vocal readings of the instructions. The taped readings were blindly coded on a number of different paralinguistic dimensions. The coding procedure used was based upon Duncan’s earlier work. This procedure is extremely detailed and, with trained coders, yields high inter-judge reliability scores. While the coding method will not be further described here, the results of this preliminary study can be simply summarized. Based only upon the coding of the instruction-reading tapes, Duncan was able to demonstrate that a large amount of the variance in the mean picture ratings given by the subjects could be accounted for by reference to the ‘‘Differential Emphasis Score’’ for each of the separate instruction readings that the respective subjects had received. The Differential Emphasis score is a single index which reflects the degree to which the experimenter, in his vocal reading, has emphasized (through variations in volume, pitch, rhythm, etc.) either ‘‘success’’ or ‘‘failure’’ and either the positive or negative ends of the rating scale. The correlation between differential vocal emphasis and the subjects’ subsequent picture ratings was þ.72 (p < .01); and all subjects who had heard greater emphasis on the rating alternatives associated with success subsequently rated the photos as being of more successful people than the subjects who heard readings that placed greater emphasis on the failure alternatives (p < .001). An additional finding of considerable interest was that the correlation between experimenters’ assigned expectancies and the Differential Emphasis Scores was only .24. This suggests that though the pattern of emphasis used by the experimenters is influenced by the assigned expectancy it often varies from that expectancy in the direction of giving either greater or lesser than average emphasis to it. It suggests further that even where the relation between assigned expectancy and the subjects’ picture ratings is low, the experimenter may actually be influencing the subject (through his deviant pattern of vocal emphasis) a good deal more than has previously been suspected. From this preliminary study it seemed clear that with paralinguistic analysis considerable further progress could be achieved in pursuit of the difficult question of just how experimenter expectancy effects are mediated. Since the Duncan–Rosenthal study had used a variant of the method of postdiction it seemed especially desirable to attempt a more ambitious and more fully controlled study. We would reverse the procedure, moving from postdiction to prediction; this would be accomplished by exposing subjects to vocal readings selected for their paralinguistic direction (i.e. ‘‘success’’ or ‘‘failure’’) and the degree of differential paralinguistic emphasis. Thus by experimental manipulation we could gain a closer and more stringent test of the hypothesis that in the typical experimenter bias study (and also in studies that may be inadvertently contaminated by bias effects) the subject’s responses are influenced through paralinguistic aspects of the experimenter’s communication to him.
The Conditions and Consequences of Evaluation Apprehension
251
At the same time we planned to extend our earlier inquiry into the way in which the experimenter expectancy effect is mediated by the subject’s state of evaluation apprehension. Thus the next study (in which I was joined by Duncan and Jonathan Finkelstein) was an experimental, manipulative investigation of the separate and interacting influences upon subjects’ response patterns of both paralinguistic emphasis and evaluation apprehension. In the first phase we obtained from a number of colleagues and graduate students taped readings of the basic instructions for the Rosenthal picture rating task. Our request was that the first reading be given in an ‘‘objective and balanced’’ manner and that subsequent readings be ‘‘slightly shaded’’ in either a positive (i.e., ‘‘success’’ stressing) or negative (i.e., ‘‘failure’’ stressing) direction. None of these ‘‘experimenters’’ heard the readings of any other and each went about his ‘‘balancing’’ and ‘‘shading’’ in strictly his own manner. After all the resulting speech samples were transcribed and scored for paralinguistic Differential Emphasis we were able to select a set of nine readings (three from each of three readers) to be used in the study. The three instruction readings taken from one of these experimenters were scored as balanced (i.e., no differential emphasis), moderate positive (i.e., intermediate bias toward an emphasis on perceiving the pictured persons as successful), and strong positive, respectively. From a second reader we had balanced, moderate negative, and strong negative readings; and from a third we had balanced, moderate positive, and moderate negative readings. In the basic design of this study each of the nine instruction tapes was combined with each of three evaluation apprehension conditions, thus yielding a 27 cell design. The evaluation apprehension conditions employed were High, Control, and Low. As in our earlier studies the evaluation apprehension manipulations were effected through a ‘‘Background Information Sheet’’ read by the subject. The evaluation apprehension bolstering and suppressing communications were similar to those used in earlier studies. The control evaluation apprehension group received no Background Information Sheet and was given no advance ‘‘explanation of the experiment.’’ The subjects were 216 female undergraduates (eight per cell) who had volunteered in response to telephone calls requesting their participation in a study of person-perception. No payment or other rewards were offered. All experimental sessions were run in the University of Chicago language laboratory. In this facility the separate listening booths with multi-channel receivers could be easily adapted to a basic requirement of our design: namely, that within each administration group (N varied from 8 to 12 for the successive groups) each of the three thirds would respectively hear one of the three different readings of the instructions recorded by a single experimenter. At the beginning of the experimental session, after all subjects were seated in their randomly assigned booths, they first heard a taped message thanking them for coming and, for the high and low evaluation apprehension groups, directing them to read the Background Information Sheet which was in a packet in front of the subject. After a five-minute pause for this purpose (control subjects were run in separate groups and had no such pause) each subject heard one of the taped readings of the Rosenthal instructions. Immediately following this the photographs to be rated for degree of ‘‘success’’ or ‘‘failure’’ were projected onto a screen in front of the
252
Book One – Artifact in Behavioral Research
booths, each for ten seconds. The subjects recorded their own ratings on a standardized rating sheet which they were also required to sign. After the rating sheets had been collected a postexperimental questionnaire was distributed and following its completion and collection all subjects went through a thorough debriefing and were pledged to keep the purpose and design of the study confidential. Before data analysis was undertaken 35 subjects were eliminated on the basis of important manipulation validation items from the post-experimental questionnaire. Thirteen were eliminated because they indicated that they had been aware of the purpose of the experiment; and 22 were eliminated either because they were in the low evaluation apprehension conditions and rated the Background Information Sheet as ‘‘anxiety arousing’’ or in the high evaluation apprehension condition and rated the Background Information Sheet ‘‘reassuring.’’ Analysis of the data was based on the mean picture rating for each subject. Because comparisons between experimenters were not made in transcribing or scoring the readings of the instructions, and thus were not reflected in the Differential Emphasis scores, it was necessary to adjust the subject means to take into account any differences among the experimenters. For each experimenter, therefore, the mean of all subjects in his control condition (i.e., the mean of the picture rating means from the subjects who heard his ‘‘balanced’’ reading of the instructions and had not received evaluation apprehension manipulation) was subtracted from the separate means of all his other subjects (i.e., those who heard his ‘‘biased’’ readings). These adjusted scores were used in our basic analysis. Preliminary analysis indicated no significant difference in bias induction between the subjects who heard the moderate and strong positive readings from one of the experimenters or between the subjects who heard the moderate and strong negative readings from another of the experimenters. However, on inspection, clear differences were visible between those who heard positive and negative readings respectively, while those who heard ‘‘balanced’’ readings occupied an intermediate position. It was apparent, then, that somewhat subtler shadings of volume, pitch, and rhythm were just as effective as more pronounced ones in conveying a differential emphasis which influenced subject response patterns. In our further analysis we combined the data from subjects who had heard the moderate and strong positive readings of the instructions and, separately, from those who heard the moderate and strong negative readings. When we tested the difference in scores between all subjects who had received the positive differential emphasis and those who had received the negative differential emphasis, the predicted main effect was confirmed (p < .02). In a further and more detailed analysis the mean scores from the six separate cells were arranged in the ascending order that could be predicted from the assumption that the effects of differential emphasis would be facilitated to the degree that evaluation apprehension was experienced. That predicted order was: high EA, negative differential emphasis; control EA, negative emphasis; low EA, negative emphasis; low EA, positive emphasis; control EA, positive emphasis; high EA, positive emphasis. An analysis of variance was executed to determine whether the predicted order did, in fact, obtain. The resultant linear trend was found to be significant (p < .02). Figure 7-2 reveals the basis for this summary statistic and reports further probabilities obtained through application of the Mann–Whitney Rank Sum Test. Thus
The Conditions and Consequences of Evaluation Apprehension
253
Successful
+.90 +.80
Positive bias
+.70 +.60 +.50
p < .006
Mean of means
+.40 +.30
N.S.
p < .02
+.20 +.10
Unsuccessful
.00 –.10 –.20 –.30
Negative bias
–.40 –.50 Low EA
Control EA
High EA
Predicted order of means
High EA Positive
Control EA Positive
Low EA Positive
Low EA Negative
Control EA Negative
High EA Negative
Obtained means
+.84
+.57
+.07
+.42
–.20
–.39
Figure 7–2 Response to Paralinguistic Differential Emphasis as a Function of Evaluation Apprehension
Level
subjects who had first read the Background Information Sheet designed to remove or reduce evaluation apprehension were apparently uninfluenced by the differential emphases conveyed in the various instruction tapes that they heard; no significant difference is found between the scores of the low evaluation apprehension groups that respectively heard positively and negatively biased readings of the instructions. On the other hand, when subjects have read a communication designed to confirm and heighten evaluation apprehension their picture ratings are very strongly influenced by the paralinguistic shading of the instruction tapes in either positive or negative directions (p < .006). Similarly we find that when evaluation apprehension is not manipulated (i.e., when, in the evaluation apprehension control condition, it is allowed to operate at a level that we may assume to be set by the interaction between the experimental task and the subject’s personality) we obtain a smaller, but still significant, difference between subjects exposed to positively and negatively shaded readings of the instructions. The scope of this difference (p < .02) is roughly the same as that reported in typical successful experiments by the Rosenthal group. Further and more detailed analysis of these data remains to be carried out— particularly an analysis program that will draw upon material gathered through an open-ended postexperimental questionnaire. But on the basis of the main findings reported above we feel that we can now more clearly discern the nature and dynamics
254
Book One – Artifact in Behavioral Research
of the experimenter bias effect. Loosely stated, the process appears to be one in which subtle paralinguistic shadings in the experimenter’s communications do convey his expectancies or preferences as regards the response choices that the subject must make.4 Whether the subject will be attuned to these paralinguistic cues or, if attuned, whether he will allow them to influence his experimental responding, may depend upon a number of things; but we are now in a position to conclude that one of these considerations, and probably an extremely important one, is whether the subject has come to perceive the experiment as one in which the experimenter is likely to form judgments about the subject’s psychological adequacy or attractiveness.
Some Recommendations and Open Issues In bringing to conclusion a chapter that has already probably taxed the reader by its length and detail I shall resist the temptation to elaborate further upon my basic argument and its supporting evidence. By way of general summary it shall suffice to say that I have tried to explicate a conceptualization of the evaluation apprehension process and that all of the present studies appear to show that this process does induce systematic bias in experimental responding. Some of the research studies that have been reviewed have served an additional purpose: they have delineated and examined certain variables that appear to facilitate or restrict the operation of the evaluation apprehension biasing process. On the basis of the present studies I think it reasonable to put the seal of provisional validation (and the judgment that they are worthy of further experimental study) upon the following propositions: The biasing influence of evaluation apprehension upon response data will be reduced if those data are collected by an experimenter other than the one whose evaluative judgment was the original focus of the subject’s concern. When a response pattern cued as likely to bring positive evaluation is also counternormative, subjects high on the need for approval will be more likely to produce that response pattern than subjects low on the need for approval. The availability of continuous feedback about the quality of a subject’s performance will facilitate his shaping that performance in the direction he thinks likely to earn him a favorable evaluation from the experimenter. The less effortful the response direction that has been cued as likely to bring positive evaluation, the more will the subject go in that direction. When the experimenter is perceived by the subject as having ‘‘power’’ over him (in the sense of controlling his access to some goal region or activity) this will foster the biasing of his responses in cued directions; and this will be particularly likely in the absence of other conditions that directly arouse evaluation apprehension. 4
Of course, we cannot logically rule out the possibility that still other modes of mediation such as ‘‘kinesic’’ cueing may play a role in conveying the experimenter’s expectancy to the subject. What is clear is that no such additional channel of indirect communication was open in the present study since the only experienced differences between the three separate instruction readings contributed by any single ‘‘experimenter’’ lay in their differential paralinguistic emphases. Also relevant in this connection is this further fact: In the present study, the magnitude of the experimenter expectancy effect under both the control and high evaluation apprehension conditions was as great, or greater than, that obtained in most experiments concerned with this type of bias. This strongly suggests that differential paralinguistic emphasis is the main, if not the only, process through which the direction of the experimenter’s expectancy is transmitted to the subject.
The Conditions and Consequences of Evaluation Apprehension
255
When the subject expects that a particular type of judgmental response will earn him positive evaluation from the experimenter, and when that type of response is also counternormative or unpracticed, his adoption of it will be facilitated by clarity in the stimuli to be judged. Still other propositions supported or suggested by the research reviewed in this chapter could be summarized. But the past work is prologue to present and future concerns. Thus, my main purpose in this concluding section will be to address some interesting implications and open issues that seem to be suggested by the studies that have been reported here. The first of these is the question of whether one can draw from the present research and analysis any clear prescription concerning the conduct of psychological and related forms of research. A number of fairly obvious recommendations do come easily to mind. One of these I have already suggested: the altered replication approach does seem to afford a way of testing reinterpretations of experiments whenever it is suspected that the original data were influenced by inadvertent arousal of evaluation apprehension. This strategy can and should be more widely employed. Disputatious reinterpretation of the other man’s research is easier than research itself; but the latter legitimates the former and assesses its relevance. Thus, whenever possible, these activities should be joined. Another prescription follows rather obviously from our studies, described in the previous section, which demonstrated that evaluation apprehension mediates the experimenter-expectancy effect. These studies do seem to show that evaluation apprehension and its data distorting effects can be reduced (or, at least, minimized) if one defines the experimental situation in a certain way for the subject. Whatever the particular details of such preliminary communications, they should lead the subject to perceive at least two things about the experimenter and his experiment: that his interest is focused not so much upon individuals in their uniqueness as upon aggregates of persons in their normative or nomothetic aspect; that some purpose far more technical (and perhaps more ‘‘dry’’) than personality study is being prosecuted by the experimenter.5 Credible messages to this effect, or ‘‘accidental’’ revelations of the same order, can probably be rather easily developed and almost as easily pretested. Undoubtedly, content and style will need to be varied with types of subjects and types of experimental situations; but if interest in handling this problem became widespread we would soon develop a technology of evaluation apprehension control that would, I think, contribute significantly to improving the quality and trustworthiness of psychological research. However such a change in standard procedure would raise an important problem concerning an aspect of experimental method that has, in recent years, become quite ritualized. Should the postexperimental ‘‘debriefing’’ include an explanation of the evaluation apprehension problem and of the way it was brought under control? Recently, some commentators have argued that debriefing should not be conducted unless it is required to reduce anxieties or ego-injuries directly due to the experiment, 5
At the same time it would probably be necessary to guard against making the experiment seem so empty of purpose or relevance as to destroy the subject’s motivation to remain psychologically involved in it. Clearly, some art (and some validation of its products) will be required in the further development of techniques for limiting and reducing evaluation apprehension.
256
Book One – Artifact in Behavioral Research
and unless it is also clear that the debriefing itself will not embarrass the subject or diminish his self-esteem by demonstrating his gullibility. Against these considerations I would give great weight to the notion that experimenters have an ethical obligation to be as frank as possible with their subjects, even though full, disingenuous revelation must be deferred until all data are collected. Nor do I think that such revelation need be degrading. Whether the subject comes out of the debriefing feeling tricked, and exposed as an ‘‘easy mark,’’ or whether he comes out with a sense of having participated in a useful endeavor in which he played an important part and was honorably treated would, I think, depend largely upon the secret motives and visible style of the experimenter. Surely, as Kenneth Ring (1968) has suggested, the ‘‘fun and games’’ approach to experimental social psychology degrades subjects, trivializes research and, I would add, quite probably activates the evaluation apprehension dynamic so as to induce unsuspected but sizeable systematic bias in resultant data. Candid and thorough debriefing, unmarred by any proclivity toward gloating, can do much for the experimenter’s self-image and probably it also serves the enrichment of the subject’s experience and knowledge. However it does generate a further problem—as much for experimenters who may employ evaluation apprehension control procedures as for experimenters pursuing other approaches. I refer, of course, to the risk that, despite the elicitation from the subject of a pledge to say nothing about the experiment to other potential subjects, the vessel of secrecy may spring leaks. This, in turn, may spoil the host culture of the naive subject pool without the experimenter knowing that anything of the sort has happened. It is my impression that the pledge to postexperimental secrecy is usually internalized when a bond of mutual trust has been woven; and I am not aware of anything that works better to insure that bond than full and candid postexperimental debriefing. Furthermore, the postexperimental discussion that can be opened up by mutual debriefing tends to free the subject to reveal much of his own recent, subjective experience in the experimental situation. The information thereby gleaned can be of considerable help in determining whether evaluation apprehension or other contaminating processes may have been operating during the experimental transaction. Such discussion also provides the experimenter with a fairly comfortable occasion for asking whether ‘‘you had heard anything about this experiment from a previous subject,’’ just as it provides a facilitating context in which subjects are likely to respond to that query with candor. I am aware, of course, that I am dealing here in lore and impressions. Clearly, more systematic research is required on the effects of the debriefing strategy upon the subject’s self-esteem, upon his maintenance of the secrecy pledge and, for that matter, upon the value of his introspections about his experiences in the experiment. But until such a body of research has been undertaken and reported I do think it reasonable to hew to the general standard favoring postexperimental revelation of all deceptions and of all major experimental purposes; and this should include discussion of the evaluation apprehension problem and of the techniques that have been used to bring that problem under control. I shall turn now to a second major matter that requires some discussion. Among various other issues heretofore unattended is a question that any thoughtful reader must have already conceived as he has worked through these pages: Is the desire to win the experimenter’s approval, and his judgment that one is psychologically
The Conditions and Consequences of Evaluation Apprehension
257
adequate, the only motive of interpersonal relevance activated in the subject during the experimental transaction? Assuredly the case cannot be that simple. Even if we reduce our range of conjectural scan to patterns of motivational arousal that directly affect the subject’s way of relating to the experimenter (and, thus, how he responds in the latter’s experimental situation) a number of other possibilities come easily to mind. Though they are probably far less common than the process upon which this chapter has focused they do require discussion. At least one of these additional data-biasing processes is quite familiar to all psychological experimenters of sufficient experience and sensitivity. There are some subjects who most of the time (and many subjects who some of the time) are likely to want to confound the experimenter, to disconfirm what they perceive as his expectations, to violate what they construe as his apparent scientific hypothesis. I am mindful that this observation partially contradicts the view elaborated by Martin Orne (1962). For him the experimenter’s hypothesis is a ‘‘demand characteristic’’ to which subjects, by the very nature of their role, are prone to yield. This may often happen though, as I shall argue later in this section, when it does it is probably mediated less by a general role-based standard of cooperativeness than by the evaluation apprehension dynamic. Under what sorts of circumstances, or with what kinds of persons, does the opposite tend to occur; that is, what accounts for the not uncommon instance in which the subject’s purpose seems to be to ‘‘screw up the works’’? With an ease and haste that may bespeak defensiveness, psychologists are often prone to interpret such behavior as due to general hostility, to character-based ‘‘anality,’’ or to lingering reverberations of the oedipal revolt against authority figures. Such may indeed be the case with occasional subjects. But another dynamic process seems to me to be far more common. Evaluation apprehension, when strongly experienced, may sometimes generate a sort of reactive anger toward the experimenter; or it may be so intolerable as to require immediate ‘‘distancing’’ from the experimental situation and its evaluative implications. Either of these purposes, and yet other comparably defensive ones, can be served by turning the tables on the experimenter and giving indirect expression to a negative evaluation of him. Given the constraints of the usual experimental situation, the most effective way of doing this may often be to disrupt the experimenter’s enterprise by emitting just those responses which will, as the subject sees it, confound or disappoint him. Also if this can be done with a ‘‘light’’ style, with some visibly amused irresponsibility, a further defensive stratagem is brought into operation. The subject may then be able to believe that he has destroyed the evaluative significance of the experimental transaction; for, if he is clearly not taking the situation seriously, his behavior cannot be meaningfully interpreted as saying much about his true psychological nature or competence. From the viewpoint of the experimenter, the problem posed by this sort of process is not so much that it may occur as that it may not be easily or reliably discerned. While skillful postexperimental inquiry may be of some use in reducing this problem, there is, I think, another important alternative. There may well be some personality patterns and some foci of regnant conflict that tend to heighten the likelihood that subjects will take recourse to the ‘‘confound the experimenter’’ strategy. The question begs for early investigation and psychologists interested in
258
Book One – Artifact in Behavioral Research
the social psychology of the experiment will need to turn their investigative skills in this direction. Equally compelling and probably even more readily open to systematic investigation, is the question of what attributes of the experimenter and of his instructions and preliminary explanations work toward the same effect. It is my untested impression that experimenters who are perceived by subjects as rather severe and unrevealing while, at the same time, intrusively ‘‘nosy,’’ are the ones most likely to arouse special data biasing patterns of resistance in some of their subjects. And obviously, it could be hypothesized also that the same is true of experiments that are perceived (or misperceived) as probing too deeply into anxiety-laden or low self-esteem areas of the private self. At least one other rather obvious bias-inducing pattern requires discussion in the present context even though, as far as I know, it has not been submitted to any systematic study whatever. I have in mind the occasional sounding of the ‘‘cry for help’’ by a genuinely troubled or unhappy subject who thinks he ought to be, but presently is not, a patient. Undoubtedly, this is far less common than the aspiration to appear ‘‘normal’’ and win a positive evaluation, but just how uncommon it is I do not know. From my own experience and that of colleagues with whom I have discussed the matter I would hazard this judgment: with some small number of undergraduate subjects (and, perhaps, most often with freshmen at times of situational stress) contact with a ‘‘psychologist’’ does activate the regressive longing for some show of support and sympathy from a wise, compassionate parent surrogate. When this does occur a number of problems arise. The most important, in the light of our methodological concern, is that out of this background there may issue a pattern of experimental responding opposite to that with which this chapter has been most concerned, but just as troublesome for its data biasing consequences. Where involvement in the experimental situation fosters the tendency to emit the ‘‘cry for help’’ the subject, utilizing the same directional cues as are available to other subjects and either with or without fully conscious intent, may shape his experimental responding so as to make himself appear ‘‘abnormal’’ or troubled or anxious. In consequence, his pattern of experimental responding will lack valid bearing upon the hypotheses that are being put to experimental test. The obvious corrective, again, is to submit the problem to systematic scrutiny through further research. A paradigm experimental situation such as the picture rating task used in some of our studies can be employed, and to it there can be attached fairly clear implications of ‘‘normal’’ and ‘‘abnormal’’ response patterns. Response deflection in the ‘‘abnormal’’ direction could be taken as an index of the motivation toward negative or ‘‘needful’’ self-representation. And variations in such an index could be examined against coordinate variation in personality indices, in systematic manipulations of the experimenter’s style, the experimental script, and prior inductions of psychological stress. Out of some such research program there would probably emerge a set of useful cautionary strictures that would help to further reduce the problem of systematic bias in psychological and kindred types of research data. My purpose in these last few pages has been to note that, in addition to the subject’s striving toward positive self-representation as a way of reducing evaluation apprehension, there are some other, related trends which may also induce systematic
The Conditions and Consequences of Evaluation Apprehension
259
bias in response data. Returning to our main focus upon the former process, I should now like to address an issue that has haunted the discussion at a number of points but has not yet been fully confronted: What is the relationship between the evaluation apprehension dynamic and such other sources of systematic bias as the experimenter expectancy and demand characteristic processes? The answer that I think most acceptable, though only in a provisional way, is already implicit in my earlier discussion of the two experiments in which we found that the activation of evaluation apprehension facilitated, and its reduction obliterated, Rosenthal’s experimenter expectancy effect. In their separate research and theorizing Rosenthal and, to a lesser degree, Orne have both emphasized the experimenter side of the experimenter-subject interaction: that is, they have delineated and demonstrated that experimenters do indirectly reveal what sorts of responses they would welcome from their subjects and they have also shown that this does, somehow, affect the responses of those subjects. However they have had far less to say about the subject’s side of the transaction; about the patterns of concern, apprehension, and ego-defensiveness which move him toward acting out, or at least coordinating to, the experimenter’s implicit demands. It is, of course, true that Rosenthal has addressed this issue in some of his fascinating side-excursions (Rosenthal, 1966) into the personality attributes of comparatively biasable and unbiasable subjects. But what has been required as well is a narrower or more process-oriented focus upon the actual psychological events that carry the subject through the experiment and up to the point at which he ‘‘delivers’’ the elicited gift of his responses.6 The evaluation apprehension process as defined in this chapter and as exemplified in our various studies appears now to be an important part of the subject side of the total experimental transaction. In the emerging general theory of the ‘‘social psychology of the experiment’’ it does not replace the account of experimenter expectancy effects developed by Rosenthal. Rather it extends it and perhaps also deepens that account by adding further clarity about the conditions under which experimenter bias is likely to be induced. As regards the demand characteristic process posited by Orne, the present approach does inevitably raise some difficulties and disposes me toward one note of disagreement. This concerns the motivational-perceptual pattern which facilitates the subject’s yielding to the ‘‘experimenter’s scientific hypothesis.’’ Where the experimenter’s true hypothesis is clear to the subject (and I would think that usually it is not) yielding to it would most likely be mediated by the expectation that this will somehow bring approval or other immediate social rewards from the experimenter. To be sure Orne might be interpreted as saying that positive selfevaluation is being sought by the subject, particularly in that he may take pleasure in viewing himself as an accommodating and helpful person. But the present studies, 6
While it has been insufficiently developed, this sort of concern has not been totally ignored during the short period in which the social psychology of the experiment has commanded intellectual interest. Some usefully provocative beginnings in this direction were elaborated by Riecken in his seminal article (1962); and Orne, despite the experimenter-oriented nature of the demand characteristic concept, has also been somewhat sensitive to these matters. However, while the focus upon subject processes has not been totally absent in earlier speculative writing, it has lagged in development. Perhaps this is due to its having been obscured by the deserved figural prominence of the work on experimenter expectancy and demand phenomena. The proper corrective lies not in abandoning the latter interest but in restoring and expanding our concern with the former.
260
Book One – Artifact in Behavioral Research
coupled with the very pertinent one by Sigall, Aronson, and Van Hoose (1968), suggest that evaluation apprehension focused upon the experimenter is a more potent and more basic pattern of subject sensitivity. Thus, I would hazard the hypothesis that the subject’s readiness to help the experimenter make his scientific point, if experienced at all, is an instrumental stage in his search for reassuring evidence that the experimenter judges him as an acceptable or even attractively ‘‘normal’’ person. My basic argument, then, is that our focus on evaluation apprehension adds to the picture developed by Rosenthal and other major contributors. By carrying us beyond the kind of biasing processes which can be traced to variability in the experimenter’s behavior it directs us toward those which may be due to the figural highlights and ambiguities of the experiment itself. While they do not logically require it, the experimenter-oriented theories sometimes tend to view the subject as a comparatively passive recipient of implicit ‘‘messages’’ or ‘‘cues’’ from the experimenter. This would suggest that where such cues are absent or imperceptible, systematic bias would be unlikely to occur. In distinction, a subject-oriented theory of the experimental transaction views the subject as seeking something from the experimental experience. In the present theoretical view that ‘‘something’’ is the experimenter’s judgmental validation of the subject’s psychological adequacy and on this basis, the ultimate maintenance or enhancement of the subject’s self-esteem. However, whether this or some other private purpose animates the typical subject is of less importance for the moment than the altered perspective that is opened to us when we lay basic stress upon the subject as seeker. From this emphasis there follows the necessary recognition that even when there is no direct cueing conveyed through the experimenter’s behavior, the subject may be prone to construct some personal interpretation of the ‘‘true meaning’’ of the experiment. More often than not, he will speculatively examine the instructions he has received, the overall rationale that has been provided, the procedures and measuring devices to which he has been exposed; and out of the questions these raise for him and the hints they convey to him he will, if at all possible, draw some meaning, some guiding hypothesis about what is really being investigated and how he can best display himself to the investigator. In this view, then, the experimental situation and, for that matter nonexperimental research situations as well, can activate the subject to search for their meaning. Whether the meaning found is often focused upon the evaluation theme, as I have argued, or upon yet other themes, there ensues a consequence as intellectually fascinating as it is methodologically troublesome. The subject’s final ‘‘definition of the situation’’ will affect his responding and thus will be reflected in the dependent variable data. To turn again to the problem of improving research procedures, the foregoing argument clearly suggests a further caution. The danger of inadvertent systematic bias in response data cannot be fully reduced by effective elimination of the experimenter expectancy and demand characteristic problems. We must remain sensitive to the possibility that the subject, no matter how acquiescent or calm he appears, may be actively processing his impressions toward the development of some interpretive hypothesis, one that will lead him to adopt a response strategy that may distort the resulting data. An analogue for this whole process is provided by the larger number of our present studies, excepting those focused upon the experimenter expectancy
The Conditions and Consequences of Evaluation Apprehension
261
phenomenon. In the former group of studies the systematic biasing of the subject’s response patterns was not demonstrably due to any intraexperimenter or interexperimenter variations in behavioral style. Rather, the differences in subjects’ performances could be directly traced to the fact that the preparatory materials they read contained hints that they could then rather easily shape into hypotheses about the purpose, or the indirect revelatory significance, of the experiment. In substantive research focused upon other psychological issues and conducted by experimenters who do not intend their experimental procedures to induce systematic bias, the suspicions aroused and the hints conveyed by the instructions, manipulations, and measures may be of more obscure origin and less certain import. Yet ‘‘seeking’’ subjects are prone to pick up whatever cues may be available in the structural and procedural detail of the experiment itself. The more figural and prominent are the cues of this type, the more likely that separate subjects will come to the same or similar interpretive hypotheses about how to assure positive evaluation for themselves, or, for that matter, about how to reach still other social goals that they may be seeking. In consequence, it will be more likely that a systematic bias in one or another response direction will result. In contrast, the more obscure and the more numerous such provocations toward suspicion and interpretation, the more likely that subjects will reach comparatively unique interpretive hypotheses; and this will tend to foster ‘‘random’’ rather than systematic bias. Either way, the consequence is an increase in the possibility that intrinsically valid hypotheses will be ‘‘disconfirmed’’ and intrinsically invalid ones ‘‘confirmed.’’ Thus it becomes imperative that we submit to far closer scrutiny the processes by which subjects engage in active information seeking, ambiguity reduction and the development of interpretive hypotheses. Whether subjects engage in such activities with full ‘‘consciousness’’ (i.e., with purposive self-direction and ratiocinative clarity) or, as I think more likely, with intersubject and intrasubject variability in motivation, effort, and attentiveness, is an interesting issue but not a crucial one. At the present stage what is most important is that we translate our research interest in such processes into the more specific questions that will make possible their controlled investigation. In my view the most useful focus of the required further research effort would be to ask just what variables determine when and how subjects go about formulating hypotheses; and what other variables influence the content and certainty of those hypotheses and the ways in which they are transformed into actual, data-yielding responses. Equally important, of course, is the search for conditions which reduce the likelihood that such activities will take place at all. The reduction or elimination of evaluation apprehension (or the structuring of an experiment so that evaluation apprehension never arises) appears to be one such important condition. But there are probably others and their discovery would be a great boon to the whole experimental enterprise. Until all these matters have been more fully clarified through further research it is necessary that experimenters strive to abandon the image of the ‘‘average’’ subject as a passive and patient human component within a total experimental system; a component that, by processing inputs into outputs, somehow automatically reveals immutable psychological laws. Having said this much I must hasten to add that I do believe that such laws exist in nature, and that the experimental method has been and will remain essential to the task of apprehending and confirming them.
262
Book One – Artifact in Behavioral Research
Those psychologists who have responded to recent research on the social psychology of the experiment with despair over the prospects of the experimental method itself are, I think, guilty of unjustifiable reactive depression and are casting out the baby with the bath. When they call for renewed recourse to ‘‘field studies,’’ to ‘‘natural observation’’ with ‘‘non-reactive measures,’’ and to phenomenological inquiry they are doing the behavioral disciplines a useful service. Those ways of gathering data (though equally open to systematic bias effects) can do a great deal to enrich inquiry into the regularities that govern man’s psychological development and his functioning in relation to the persons and institutions that define his existence. However, when such critics suggest that the experimental God is dead, they appear to have missed the point implicit in all research on the social psychology of the experiment. That point is that the experimental method can readily be used to perfect, or at least to significantly improve, itself. Any experimental demonstration of some source of systematic bias and of the process by which it operates immediately suggests procedures for the control and elimination of that source of bias. Another heartening consideration is, simply, that on the basis of present knowledge a great deal is already known about how to reduce the dangers of contamination and systematic bias. Such knowledge can also inform the critical evaluation of the worth of particular experiments as these are reported. The wheat, then, can even now often be separated from the chaff—and the yield is not a grossly unfavorable one. A truly exciting and optimistic prospect has been opened by a decade of work on the social psychology of the experiment and I hope that it has been further advanced by the present inquiry into the evaluation apprehension process. We are approaching the point at which we may achieve a practical (if not philosophically perfected) solution to the classic epistemological problem of detaching the knower from the known; of allowing the order inherent in behavioral and social processes to tell us its own true story without any distortion due to promptings from the listener or failings of his listening device. The velocity of further advance toward the improvement of both experimental and nonexperimental investigative procedures is likely to increase as research on the social psychology of social inquiry is vigorously prosecuted. And if, on occasion, one is troubled by the ostensible paradox that the processes inducing systematic bias may operate in our very investigations of systematic bias, there are at least two types of reassurance available. The lesser one is that every investigation in this realm profits the succeeding one; error should fall away as we continue to ‘‘zero in’’ toward the goal of bias-free research. The greater reassurance is that paradox itself is a goad toward intellectual and scientific adventurousness; the more closed off and ostensibly circular the problem, the more deserving it is of assault and solution.
References Abelson, R. P., Aronson, E., McGuire, W. J., Newcomb, T. M., Rosenberg, M. J., and Tannenbaum, P. H. (Eds.) Theories of Cognitive Consistency: A Sourcebook. Chicago: Rand McNally, 1968. Aronson, E. The psychology of insufficient justification: an analysis of some conflicting data. In S. Feldman (Ed.) Cognitive Consistency. New York: Academic Press, 1966. Bock, R. D. A computer program for univariate and multivariate analysis of variance. Proceedings of the I.B.M. Computer Symposium on Statistics. White Plains, New York: I.B.M. Data Processing Division, 1965, 69–111.
The Conditions and Consequences of Evaluation Apprehension
263
Brehm, J. W., and Cohen, A. R. Explorations in Cognitive Dissonance. New York: Wiley, 1962. Brown, R. Models of attitude change. In R. Brown, E. Galanter, E. Hess, and G. Mandler. New Directions in Psychology. New York: Holt, Rinehart and Winston, 1962, 1–85. Carlsmith, J. M., Collins, B. C., and Helmreich, R. L. Studies in forced compliance: I. The effect of pressure for compliance on attitude change produced by face-to-face role-playing and anonymous essay writing. Journal of Personality and Social Psychology, 1966, 4, 1–13. Chapanis, N., and Chapanis, A. Cognitive dissonance: five years later. Psychology Bulletin, 1964, 61, 1–22. Crowne, D. P., and Marlowe, D. A new scale of social desirability independent of psychopadiology. Journal of Consulting Psychology, 1960, 24, 349–354. Crowne, D. P., and Marlowe, D. The Approval Motive. New York: Wiley, 1964. Edwards, A. L. The Social Desirability Variable in Personality Assessment and Research. New York: Dryden, 1957. Festinger, L., and Carlsmith, J. M. Cognitive consequence of forced compliance. Journal of Abnormal and Social Psychology, 1959, 58, 203–210. Friedman, N. The Social Nature of Psychological Research: The Psychological Experiment as a Social Interaction. New York: Basic Books, 1967. Minor, M. W. Experimenter Expectancy Effect as a Function of Evaluation Apprehension. Unpublished doctoral dissertation, University of Chicago, 1967. Nowlis, V. Research with the Mood Adjective Check List. In S. S. Tomkins and C. E. Izard (Eds.), Affect, Cognition, and Personality. New York: Springer, 1965. Orne, M. On the social psychology of the psychological experiment: with particular reference to demand characteristics and their implication. American Psychologist, 1962, 17, 776–783. Riecken, H. W. A program for research on experiments in social psychology. In N. F. Washburne (Ed.), Decisions, Values, and Groups. Vol. 2. New York: Pergamon Press, 1962. Ring, K. Experimental social psychology: some sober questions about some frivolous values. Journal of Experimental Social Psychology, 1967, 2, 113–123. Rosenberg, M. J. Cognitive structure and attitudinal affect. Journal of Abnormal and Social Psychology, 1956, 53, 367–372. Rosenberg, M. J. An analysis of affective-cognitive consistency. In Rosenberg, M. J., Hovland, C. I. et al., Attitude Organization and Change. New Haven: Yale University Press, 1960. (a). Rosenberg, M. J. Cognitive reorganization in response to the hypnotic reversal of attitudinal affect. Journal of Personality, 1960, 28, 39–63. (b). Rosenberg, M. J. When dissonance fails: on eliminating evaluation apprehension from attitude measurement. Journal of Personality and Social Psychology, 1965, 1, 18–42. Rosenberg, M. J. Some limits of dissonance: toward a differentiated view of counter-attitudinal performance. In S. Feldman (Ed.) Cognitive Consistency. New York: Academic Press, 1966. Rosenberg, M. J. Hedonism, inauthenticity, and other goads toward expansion of a consistency theory. In R. P. Abelson, E. Aronson, W. J. McGuire, T. M. Newcomb, M. J. Rosenberg, and P. H. Tannenbaum (Ed.) Theories of Cognitive Consistency: A Sourcebook. Chicago: Rand McNally, 1968. Rosenberg, M. J., Hovland, C. I., McGuire, W. J., Abelson, R. P., and Brehm, J. W. Attitude Organization and Change. New Haven: Yale University Press, 1960. Rosenthal, R. Experimenter Effects in Behavioral Research. New York: Appleton-Century Crofts, 1966. Sigall, H., Aronson, E., and Van Hoose, T. The Cooperative Subject: Myth or Reality? Dept. of Psychology, University of Texas, 1968 (mimeographed). Silverman, I. Role-related behavior of subjects in laboratory studies of attitude change. Journal of Personality and Social Psychology, 1968, 8, 343–348. Silverman, I., and Regula, C. R. Evaluation apprehension, demand characteristics, and the effects of distraction on persuasibility. Journal of Social Psychology, 1968, 75, 273–281.
8 Prospective: Artifact and Control1 Donald T. Campbell Northwestern University
Logic of Inference If we had remained with the definitional operationism of our recent past, we would not have known the problems with which this volume deals. Our experimental setups and our measurement procedures would have been treated as definitional of our theoretical concepts. Conceptualizing them as definitions would have excluded recognizing them as biased, as systematically imperfect as well as randomly errorful. Definitional operationism did indeed lull some into an uncritical complacency and reification of test scores, but fortunately the major practitioners of science had too little contact with or too little faith in philosophy of science to be misled. While logical positivists were defining intelligence in terms of the Stanford Binet, 1916 edition, Terman was already initiating revisions designed to make it a less biased and more accurate measure of intelligence, a goal which clearly showed that for him, his test was not the definition. Similarly, every physicist working with a measurement device such as the galvanometer knows that in practice it fails of perfect reflection of electrical potential differences because of the effects of gravity, friction, inertia, field forces, etc. (e.g., Wilson, 1952). While compensated and corrected design may minimize these sources of error, on theoretical grounds the galvanometer is known to be subject to systematic biases, the elucidation of which is itself a history of cumulative scientific achievement rather than of logical revelation. If definitional operationism and other accoutrements of logical positivism now are recognized as misleading, how are we to understand our predicament as knowers, and in such a way as to make philosophical sense out of the prototypic activities of this book? For me, the orientation of Karl Popper (1959, 1963; Campbell, in press) and that partial common denominator shared with Polanyi (1958), Toulmin (1953, 1961), Kuhn (1962), and Quine (1953) although they might be the last to acknowledge any such, seems most appropriate. I shall try to present an aspect of this orientation, albeit through metaphors that are perhaps unorthodox. Following Popper, I honor Hume as a logician and reject him as an inductivist psychologist. Hume called attention to the scandal of induction, to the fact that 1
Supported in part by National Science Foundation Grant GS1309X. This paper was written while Fulbright Lecturer in Social Psychology at the University of Oxford. I am indebted to my host Michael Argyle both for generous hospitality and for help with this paper.
264
Prospective: Artifact and Control
265
scientific generalizations are not logically proven or provable. While most modern philosophers take this to be a mere technicality, a mere statement of the inappropriateness of analytic logic to contingent truth, it is Popper’s strength to recognize this as a fundamental limitation. Not only are scientific truths logically unproven, they also lack certainty in any other sense—inductive, empirical, scientific, or implicative. Yet they are in some sense ‘‘established.’’ The best of theories if not ‘‘confirmed’’ are at least ‘‘corroborated.’’ Logic is relevant to the statement of the situation. The ‘‘scandal of induction’’ can be expressed by noting that science makes use of an invalid logical argument, making the error of the ‘‘undistributed middle,’’ or of ‘‘affirming the consequent.’’ But while invalid, the argument is not useless. The logical argument of science has this form: If Newton’s theory A is true, then it should be observed that the tides have period B, the path of Mars shape C, the trajectory of a cannonball form D. Observation confirms B, C, & D. Therefore Newton’s theory A is true.
We can see the fallacy of this argument by viewing it as an Euler diagram: The invalidity comes from the existence of the cross-hatched area, i.e., other possible explanations for B, C, and D being observed. But the syllogism is not useless. If observations inconsistent with B, C, and D are found, these validly reject the truth of Newton’s theory A. The argument is thus highly relevant to a winnowing process, in which predictions and observations serve to weed out the most inadequate theories. Furthermore, if the predictions are confirmed, the theory remains one of the possible true explanations. This asymmetry between logically valid rejection and logically inconclusive confirmation is the main thrust of Popper’s emphasis on falsifiability. The truism is now safely ensconced in elementary presentations of inductive logic, without the necessity of citations to Popper (e.g., Hempel, 1966; Salmon, 1963). There is another decision locus in the process upon which Popper’s critics have focused: do the observations in fact confirm the predictions. It was assumed in the above that this decision could be and had been made. At this level, falsifiability and confirmability are more logically symmetrical. And at this level, observations always falsify a quantified prediction if carried out with sufficient precision. At this level the tolerance for accuracy which scientists actually allow is a social system function, determined by the degree of development of the science, the degree of experimental control achieved, and the sharpness of competition from other theories. Thus for Einstein’s prediction of the bending of starlight passing the sun, as in the 1919 eclipse, a predicted value of 1.745 seconds of arc has been ‘‘confirmed’’ by values of 1.6100 , 1.9800 , 1.7200 , 2.200 , and 2.000 . Let us look in more detail at the Euler circles and the relation of confirmed predictions to the truth or credibility of a theory. It is our inescapable predicament that we cannot prove the theory. We must work within the limitations there diagrammed. What we as scientists do is to try in some practical way to ‘‘empty’’ the cross-hatched area, to make it as small as possible. We do this by expanding as greatly as possible the number, range, and precision of confirmed predictions. The larger and more precise the set, the fewer possible alternative singular explanations, even though this number still remains in some sense infinite.
266
Book One – Artifact in Behavioral Research
More important, we in fact pay little or no attention to the mere logical possibility of alternate theories, to the merely logical existence of a cross-hatched area. Toulmin has stated the point well: Again, philosophers sometimes assert that a finite set of empirical observations can always be explained in terms of an infinite number of hypotheses. The basis for this remark is the simple observation that through any finite set of points an infinite number of mathematical curves can be constructed. If there were no more to ‘explanation’ than curve-fitting, this doctrine would have some bearing on scientific practice. In fact, the scientist’s problem is very different: in an intellectual situation which presents a variety of demands, his task is—typically—to accommodate some new discovery to his inherited ideas, without needlessly jeopardizing the intellectual gains of his predecessors. This kind of problem has an order of complexity quite different from that of simple curve-fitting: far from his having an infinite number of possibilities to choose between, it may be a stroke of genius for him to imagine even a single one. (Toulmin, 1961, 113–115)
It is only when there exist actually developed alternative explanations, that is when there are known contents to the cross-hatched area, that validity questions arise for theories whose predictions have been confirmed. It was because there were no actually developed rivals that Newton’s theory was regarded as certainly true for 200 years, even by such critical epistemologists as Kant. The cross-hatched area was empty in any practical sense. But the logical correctness of Hume’s analysis of scientific truth is brought home as a relevant problem for scientific induction by the fact of the subsequent overthrow of Newton’s theory for that of Einstein. The situation is in fact even sloppier than this. When a theory such as Newton’s has, in current fact, no near rivals at all, and when it predicts exquisitely well an enormous range of phenomena, we tend to forgive it a few mispredictions. Thus, as Kuhn (1962) emphasizes, there were known in Newton’s day systematic errors of prediction, as of the precision of the perihelion of Mercury, which would have invalidated it at that time had Einstein’s theory then been available. The truer picture is one of a competition between developed and preponderantly corroborated theories for an overall superiority in pattern matching (Campbell, 1966). Thus the only process available for establishing a scientific theory is one of ‘‘eliminating plausible rival hypotheses.’’ Since these interaction effects are present, they are of a dampening nature, leading to an underestimation of the other law under investigation. They are not of the sensitization type that would produce pseudo effects utterly untypical of the natural situation. In this case, the ‘‘plausibility’’ of the rival hypothesis as stated in 1957 was based partly upon an appeal to common-sense knowledge of psychological processes. However, my presentation did contain what now appears to have been an erroneous pseudo-citation, and one of considerable persuasive power. It will add to the manifest uniformity of Lana’s review of empirical findings if I take space here to set the record straight. What I reported was that in the ‘‘Cincinnati looks at the United Nations’’ study, it was only the pretested panel that showed any awareness of, or effects from, a very intensive public communications effort. The study involved two parts. The main one compared two separate but randomly equivalent samples of 1000, one taken before the campaign, one after, finding essentially no differences. The results of this had been published in the paper I cited. By oral report from one of the authors of that
Prospective: Artifact and Control
267
paper I learned of the subsidiary study done for Columbia’s Bureau of Applied Social Research, which involved reinterviewing the pretest sample. By this oral report I learned, or thought I learned, of the outcome of that still unpublished study, and this is what I reported. It seemed a most apt illustration, just what was needed to make the point. Years later, Claire Selltiz, in the process of revising Research Methods in Social Relations (Selltiz, Jahoda, Deutsch, and Cook, 1959) spent a great deal of effort trying to track down a more precise reference, but without success, and therefore presented the retest study as a hypothetical possibility. Eventually there turned up a duplicated report by Glock (1958) presenting quite different findings: there were no significant differences between the pretested and unpretested posttest samples and not even trends in the direction of a sensitization or communicationenhancement effect. (A later presentation in Campbell and Stanley [1963] set the facts straight but did not give the detailed apology here presented.) This anecdote illustrates, I hope, the dependence of the argument upon the hypothesized laws, hypothesized empirical regularities, and therefore, the relevance of empirical fact to the establishment of a ‘‘necessary’’ scientific control. One important implication of this argument is that it is not failure-to-control in general that bothers us, but only those failures of control which permit truly plausible rival hypotheses, laws with a degree of scientific establishment comparable to or exceeding that of the law our experiment is designed to test. Thus our current standards of experimental design represent a scientific achievement, an empirical product, not a logical dispensation. They represent generally verified hypothetical laws—in the philosophers’ terms, contingent, descriptive, synthetic, and therefore corrigible ‘‘truths,’’ rather than logical or analytic truths. The ‘‘control group’’ is a feature we psychologists are taught as axiomatically required. It is seldom noted that it, or its analogue, is totally missing from most of the 19th century physics, chemistry, and physiology from which we took our methodological models. As Boring (1954) documents, it was invented as recently as 1907 to control for a plausible rival hypothesis quite specific to psychology, namely that pretests would produce gains in performance even in the absence of experimental treatments (that is what we can call a main effect of testing in contrast to the interaction effect of testing described above). Were it regularly to be found that there were no practice effects, this reason for the control group would be eliminated. This is not likely to be so for much of experimental psychology, but might be for persuasion studies, as Lana has noted. Our typical synchronous pretest-posttest-control group design also controls for other hypothetical rival explanations of change (Campbell, 1957). But for an experimental psychologist studying the learning of nonsense syllables, none of these are plausible enough explanations for a gain in memory so that he typically does without. Even in the most primitive one-group pretest-posttest design, the need for a pretest would not be there if it were not for the empirical facts that test-retest correlations are greater than zero and that individuals are not all equal. Superimposed upon the simple control group are other required control groups whose empirical justification remains more obvious. Thus in cortical ablation studies, the sham-operation control group reflects the empirical fact of surgical shock. With accumulating evidence about the nature of surgical shock, it becomes an utterly implausible explanation for many ablation and electrode-stimulation effects, and as a result is dropping out of use. The placebo control group reflects the very well
268
Book One – Artifact in Behavioral Research
established law as to the therapeutic effects of believing one has received a curative treatment (see Orne, Chapter 5). Where pharmaceutical research is being tested in terms of the very general variable of illness-health, it still remains essential. For much more specific effects, it can often be skipped. The double-blind placebo control group reflects the further empirical law of the effect of experimenter faith when administering the pill, and gets us into the realm of facts which Rosenthal has so well documented (1966; and Chapter 6). A major way in which this volume contributes to the science of psychological method is thus in establishing the need for new control groups. From Orne come ‘‘demand character’’ control groups. From Rosenthal come the high expectancy and low expectancy treatment replications. From Rosenberg come the evaluation apprehension control groups, and the recommendation of experimental arrangements disguising the administrative relation between treatment and posttest. McGuire’s and Rosenthal’s chapters in this volume provide confirmation of the law-like character of threats to validity, in showing how such variables can shift from being control problems to being focal.
A Typology of Artifacts, Biases, or Threats to Valid Inference While threats to validity or artifacts can come from any aspect of the experimental process, and while a complete typology is not possible, it may help to lay out some recurrent types of artifact. Confounded Aspects of the Experimental Treatment: Main Effects For this purpose, we can regard as aspects of the experimental treatment all features which differ between experimental and control groups. Inevitably many of these are irrelevant to the theoretical variable we are manipulating. They are ‘‘instrumental incidentals.’’ Such features are arbitrary, in that there might have been other implementations, but are unavoidable in that had there not been these incidentals, there would have had to have been others. No ‘‘pure treatments’’ are possible. Each one of these irrelevant details is a potential rival explanation of an effect. A major class of artifacts is of this type. In this volume, it is exemplified by Rosenberg’s specific claim that the relevant manipulation was not cognitive dissonance but evaluation apprehension. A still more general one is Rosenthal’s criticism of a vast array of experiments, to the effect that differential experimenter expectation, rather than more specific treatment details, was the essential treatment variable. Two ways of controlling a specific exemplar of this type of confound emerge. First, there is the way of the new control group, or the expanded-content control group. That is, the control group treatment is modified to include more of what was previously only experienced by the experimental group. The sham operation and the placebo control groups are of this sort. Thus a control group given equivalent evaluation apprehension, or demand-character, or experimenter expectancy is created, increasing the common denominator so that experimental and control groups no longer differ on this feature. Second, there is the opportunistic search for new modes of implementation in which the theoretical variable is exemplified without this particular rival variable, that is, through a modification of the experimental group.
Prospective: Artifact and Control
269
As a control, this involves an a priori preference for parsimony, for it is always possible that two separate irrelevancies, a different one in each setting, explain the superficially consistent results. Again, we must disregard this possibility except insofar as specific versions are developed and plausible. Confounded Aspects of the Treatment: Interaction Effects There may be a genuine effect of the theoretical variable that is specific to (or inhibited by) particular vehicular components. Again, the potentialities are so numerous that we pay attention only to explicitly elaborated and plausible hypotheses of this nature. More than that, we are even more likely to disregard or judge implausible an interaction effect than a main effect. Perhaps part of the reason for this is that main effects are more easily handled. Probably more important is the very general inferential generalization that main effects are more probable than interaction effects. This would be analogous to, or perhaps even a part of, Mill’s inductive presupposition that nature is orderly. Such a generalization might be descriptively true, and it would seem to me worth an actuarial survey of analyses-of-variance in Ph.D. dissertations (a less biased sample than published research). But even if this is not descriptive of nature as we find her, it is descriptive of the knowable aspects of nature, a biased sample upon which science and simpler knowledge processes necessarily focus (Campbell, in press). By knowledge we mean, in part, usable reidentifiable samenesses in settings that are not identical. If the highest order interactions with the specifics of space, time, and attributes are always significant, then no generalization is possible, and hence no knowledge and no science. A successfully established main effect is a much more general generalization than is an interaction effect. Much of the basis for the recalibration of measurement dimensions is the search for that option of quantification which turns the most regularities into main effects. Rival hypotheses in this class are not controlled by the expanded content control group approach, but would usually be by the altered experimental treatment. Note that this latter as a control here too involves the a priori preference for parsimony and the presumptive bias in favor of main effects. For it would always be possible that the apparent general confirmation of the law in two settings was in fact a coincidence of two separate specific interaction effects. Background Interactions Background here refers to those common features shared by both experimental and control groups. Inevitably these involved many features unspecified in theory, probably even more numerous in the social sciences than in the physical. All of these are potential sources of interaction effects with the theoretically relevant aspect of the treatment or, to be sure, with an irrelevant one of the aspects confounded with the treatment. Again, these are so numerous that we can only pay attention to the plausible and well developed rival hypotheses. Hypotheses of interaction in this category, in the prior category, and in several to come below (as on subject selection) are in some sense not as serious threats as those in the first category. They represent only potential limitations on the generality of a law already established in one setting. It is only when that one setting is ‘‘artificial’’
270
Book One – Artifact in Behavioral Research
and when we are interested primarily in applying our generalizations to other settings than that artificial laboratory, that such threats to validity worry us. The possible pretest sensitization in persuasion studies, discussed above, is of this nature. It should be recognized that an elegant science of persuasion restricted to pretested audiences would be a quite worthy scientific achievement, even if of little practical value, and that, by and large, the physical sciences have been preoccupied with predicting exclusively in laboratory settings, although, to be sure, in their truly impressive achievements they predict from one artificial laboratory to others of quite different structure. In this case too, the expanded-content control group approach will not work, and a changed experimental treatment provides a general control. Interactions with Population Characteristics This paper is written in the context of ‘‘true experiments,’’ not ‘‘quasi-experiments’’ (Campbell and Stanley, 1963), and hence population differences do not appear as spurious sources of main effects, i.e., as differences between experimental and control groups, but instead as potential limitations on the generality of laws observed in a study done with a specific population. We in social psychology may inherit a misleading super-ego ideal from sociology, to the effect that this should be solved by representative sampling from some universe of theoretical relevance, perhaps of all mankind. (Our emphasis on randomization to achieve equivalence between quite parochial experimental and control groups should not be confused with the sociologists’ emphasis upon randomization to achieve representativeness of some specified population [Campbell and Stanley, 1963, 23].) This is to be sure an unpracticed ideal. But it is so out of keeping with what we know of science that it should be removed even from our philosophy of science. A consideration of the time dimension will help show its utter unreasonableness. In the physical sciences, the presumption that there are no interactions with time (except those of daily, lunar, seasonal, and other cycles) has proved to be a reasonable one. But for the social sciences, a consideration of the potentially relevant population characteristics shows that changes over time (e.g., a 30-year comparison of college students) produce differences fully as large as synchronous social class and sub-cultural differences. To representatively sample from our intended universe of generalization would require representative sampling in time, an obvious impossibility. More typical of science is the case of Nicholson and Carlisle. Taking in May, 1800, a very parochial and idiochronic sample of Soho water, inserting into it a very biased sample of copper wire, into which flowed a very local electrical current, they obtained hydrogen gas at one electrode, oxygen at the other, and uninhibitedly generalized to all the water in the world for all eternity. It was a hypothetical generalization, to be sure, rather than a proven fact. There have been by now many studies of the effect of ‘‘impurities’’ in the water upon hydrolysis, but these too have been done on very biased samples. The idea of a representative sampling of all the waters of the world, or of England, never occurred even as an ideal. The very concept of ‘‘impurities,’’ of segregating the contents of water into the ‘‘pure’’ stuff and the alien contents, is one which would never have emerged had a representative sampling approach to water been employed. In the successful sciences, generalizations
Prospective: Artifact and Control
271
have never been ‘‘inductive’’ in the sense of summarizing what had been observed within the bounds of the generalization, but instead have always been presumptive, albeit guided by prior laws. The limitations to the generalization have emerged from checking in nonrepresentative ways on an initial bold generalization. Scientists assumed that hydrolysis held true universally until it was shown otherwise. In this light, had we achieved one, there would be no need to apologize for a successful psychology of college sophomores, or even of Northwestern University coeds, or of Wistar strain white rats. Exciting and powerful laws would then be presumed to hold for all men or all vertebrates at all times, until specific applications of that presumption proved Wrong. We already are at this latter stage, but even here a representative sampling of species or school populations is not the answer. Theoryguided, dimensional explorations, as in comparing primates widely varying in evolutionary development, are in the typical path of science. Thus it would be a fine achievement, even though not science of proven universality, to have a lawful psychology of volunteer subjects (e.g., Rosenthal and Rosnow, Chapter 3). However, here too, when a specific plausible hypothesis has been developed, predicting restrictions on a generalization we very much want to make as to nonvolunteering populations, we attempt to control it. Not only would we want to generalize to such nonlaboratory populations for reasons of applied science, we in experimental social psychology also aspire on pure science grounds to bridging generalizations to the unavoidably nonexperimental social sciences. Also to be noted in the volunteer subject problem is the fact that the plausible interaction hypothesis affects not just one treatment variable, but a very large class of them, e.g., to the effect that volunteer subjects will show the results they believe the experimenter wants in any experiment. Such a hypothesis is indeed threatening enough, so that if empirically justified, it would make us want to shift populations for our basic exploratory studies. Confounded Aspects of Measurement: Main Effects Every measuring device, like every treatment, is dimensionally complex with many theoretically irrelevant vehicular components. The measured effects of the treatment could be due to one of these irrelevancies. Such artifacts in measurement have generated a vast literature. There are no doubt hundreds of studies of response-sets in questionnaires, attitude tests, personality tests, etc. (e.g., Cronbach, 1946, 1950; Rorer, 1965; Campbell, Siegman, and Rees, 1967). Social desirability provides another vast literature (Edwards, 1957; Block, 1965). For ratings, there are halo effects and implicit rater theories of personality. There are artifacts in scores of dyadic discrepancy, inter-personal perception, and pattern similarity (Cronbach, 1958; Corsini, 1956; Silverman, 1959). Other measures, including observations, traces, and actuarial records, have analogous problems (Webb et al., 1966). So far, such artifacts have been used in the literature in criticism of the interpretation of correlations and group differences, rather than in criticism of experiments. Thus social class differences in F-Scale authoritarianism have been criticized as due to the intelligence (Christie, 1954) or acquiescent response-set component of the measure rather than the authoritarian component, but similar criticisms of attitude change studies involving the F-Scale have not appeared. It is perhaps for this reason
272
Book One – Artifact in Behavioral Research
that this large segment of the research artifact literature is not represented in this volume. Control for these problems is through the use of multiple measures differing in vehicular or method components (Campbell and Fiske, 1959; Campbell, 1960). In most laboratory experimentation more of this could be done than usually is, and with much less addition to research cost than would be involved in methodological replication of treatments. Probably more is done than is reported, because having multiple measures generates the jeopardy of discrepant results which are a great embarrassment to write up. Confounded Aspects of Measurement: Interactions with Treatments Even when the measured change is due to the theoretically relevant aspects of the measure, the irrelevant method components can condition the reaction—the observed reaction thus may be specific to this particular mode of measurement. Again, control of such a plausible rival hypothesis lies in alternative measurement devices. (Note that the interaction of the relevant aspects of pretest measurement with the treatment has been discussed above.)
Controlling Artifacts In the previous section, several distinguishable modes of control have been presented. There are: 1. Expanded-content control groups, in the tradition of sham operations and placebos; 2. Treatment replication with altered methods; and 3. Multiple methods in measurement. The present section continues this discussion with three additional points more general in nature. Controlling Plausible Rival Hypotheses through Supplementary Variation This heading refers to a very general technique of partial or inferential control useable for many settings in which direct or complete control is not possible. While its primary application has been in quasi-experimental settings, it is available also in experimental ones. One noteworthy implication is that clarity of inference sometimes may be improved by deliberately reducing the quality of part of the data collected. Let us begin with such an illustration. In our study of cultural differences in susceptibility to optical illusion (Segall et al., 1966) one of the plausible rival explanations of the differences obtained was in terms of variations in the administration of the visual tasks by the various anthropologists involved. To control this, we made what seems to me now to have been the amazingly brave decision to deliberately debase half of our best controlled data collection. Instructions were for the test pages to be held vertically at four feet from the respondents’ eyes, not the easiest position to achieve if the same person is also to record results. In our Evanston sample of 200, collected on a door-to-door survey sample basis, half were done correctly, and the other half were administered as in table-top presentation, at one and one half feet from the eyes with the booklet in a horizontal position. The latter was thought to be more slovenly than any actual administration, but in the likely direction of deviation. These two conditions did
Prospective: Artifact and Control
273
produce differences, but small ones, not at all sufficient to explain the cultural differences which were five to seven times larger. Our resultant power of inferences was greater than had all of the Evanston data been of the best quality. The study also provides a second illustration. After the major body of data had been collected, published research appeared indicating that inspection time differences were a possible plausible rival explanation. These we ‘‘controlled’’ by collecting a new Evanston sample in which two exposure times were used, one very brief, the other much larger than was likely to have occurred in any sample. Here again, while there were differences, they were of much too small a magnitude to explain the major cultural differences. In his book on data quality control, Naroll (1962) divides ethnographies used in quantitive cross-cultural comparisons into two or more levels of quality. For example, if the ethnographer lived in the area for two or more years, and if he learned the local language, the ethnography would be classed as of high quality. Variables such as belief in witchcraft and non-European positions in childbirth turn out to be correlated with data quality, being more apt to be noted by those with better acquaintance. Any two such variables will as a result show a spurious correlation with each other. By introducing the variation in data quality, Naroll is able to rule out or confirm the existence of such spurious correlation. Bitterman (1965), working in the quite different laboratory of comparative psychology, has arrived at the more general methodological percept of which these quality variations are one illustration: I do not, of course, know how to arrange a set of conditions for the fish which will make sensory and motor demands exactly equal to those which are made upon the rat in some given experimental situation. Nor do I know how to equate drive level or reward value in the two animals. Fortunately, however, meaningful comparisons still are possible, because for control by equation we may substitute what I call control by systematic variation. Consider, for example, the hypothesis that the difference between the curves which you see here is due to a difference, not in learning, but in degree of hunger. The hypothesis implies that there is a level of hunger at which the fish will show progressive improvement, and put in this way, the hypothesis becomes easy to test. We have only to vary level of hunger widely in different groups of fish, which we know well how to do. If, despite the widest possible variation in hunger, progressive improvement fails to appear in the fish, we may reject the hunger hypothesis. Hypotheses about other variables also may be tested by systematic variation. With regard to the question of reversal learning, I shall simply say here that progressive improvement has appeared in the rat under a wide variety of experimental conditions—it is difficult, in fact, to find a set of conditions under which the rat does not show improvement. In the fish, by contrast, reliable evidence of improvement has failed to appear under a variety of conditions. (Bitterman, 1965, 396–410)
As applied for the control of artifacts, two types can be distinguished. On the one hand there is interpolating or bracketing variation, in which the supplementary variation includes the whole likely range or more. The two illustrations from the optical illusions study are of that nature. As controls, these assume linearity or monotonicity of laws, i.e., that intermediate values would have intermediate effect. This is usually a reasonable enough assumption to render the threat implausible if the extreme bracketing values find it so.
274
Book One – Artifact in Behavioral Research
Second, there is extrapolating variation, in which we do not have full access to all values of the dimension, and to achieve our control must extrapolate outside the range of explored values to unobtainable values. The problem of volunteer respondents might be such. What we would like to do is to extrapolate to the nonvolunteering population, but in even the best we can do, some degree of volunteering is required. A degree of control is introduced if one adds a much more extremely voluntary situation, more voluntary than would normally be used. If these two degrees of voluntarism show the same laws, we extrapolate, assuming monotonicity, to the condition of no voluntarism at all. Here the assumptions involved seem intuitively less plausible than in the bracketing case, but are still plausible enough to make such a control worth adding. In this case too we have added a body of deliberately poorer data. A widespread utilization of supplementary variation as a control is in the common practice of classifying respondents on the basis of post-session interviews as to degree of awareness of the experimenter’s purpose, and checking the replication of the same laws in such subgroups. McGuire’s research, in this volume, has extended this by deliberately introducing more extreme degrees of awareness on an experimental basis. Heteromethod Replication The reiterated history in research on artifacts is for an exciting original one-treatment, one-measure experiment to be criticized with specific plausible rival hypotheses, and to be followed up by a series of experiments with expanded control groups or changed treatment method, or changed measurement method, until the original finding is doubly confirmed or rejected in favor of some rival interpretation. Even where the field is active and research is cheap, this cycle takes a good ten years or more, as illustrated by Rosenberg’s work on the dissonance experiments, reported in this volume. Any strategy which would cut down on this wasteful procedure would seem at first glance to be worth introducing. If one reviews the control comments in the previous section on types of artifact, one notes a very general utility to varied experimental implementation. Multiple methods of measurement have a parallel value. There emerges the suggestion of routinely programming heteromethod replication in the initial research phase. Each Ph.D. research would, for example, be required to induce the treatment variable in two methodologically independent ways, and for each implementation to measure the effects by two independent methods. If an hypothesized law was initially confirmed in all of the four heteromethod replications thus generated, most of the probably plausible rival hypotheses would have been ruled out in advance (without having ever been explicitly formulated). If all four were not consistent, but if there were several strong effects, the candidate would be left with a challenging empirical puzzle upon which to work, but without the temptation to over-strong theoretical claims which he would have had if he had only seen one part of the puzzle. This methodological precept is, however, not recommended. If I judge our present theoretical successes and experimental skills properly, full confirmation would almost never be found. The process would in general be much more discouraging than present practice, so much more so that many would cease research altogether. Journal editors would almost certainly reject an honest presentation, under current
Prospective: Artifact and Control
275
standards against publishing negative results on novel hypotheses, and these standards are probably correct from the point of view of optimal collective information processing (Campbell, 1959, 168–170). The social system of science requires sufficient motivation to produce scientific investigations in redundant number. That motivation is much higher when each investigator believes that he has the optimal method, and, when his expectations are confirmed, believes that he has proven a true theory. Some degree of over-optimism may be necessary, both in anticipation and in retrospect upon accomplished research. So too with a perspective on artifacts in general, perhaps it is motivationally best not to anticipate these as an overwhelmingly likely aspect of all research, but instead to close our eyes to their general possibility, and to regard each such challenge as it appears as a specific local anomaly in an otherwise straightforward scientific quest. If we are indeed in an extremely difficult arena, then there is even a motivational utility in the regular occurrence of exciting findings which later are discounted as artifacts. These provide exciting rewards to the would-be discoverers, and exciting rewards to the successful critics (the more exciting the greater the reputation of the false claims). These are rewards and motivation for experimental work and empirical exploration. Both would be lost under a procedure that effectively screened out overoptimistic pseudo-confirmations of exciting theories. Disguised Experiments in Natural Settings While the formalism of the previous section provided a useful general perspective on possible artifacts, it serves to fragment a central class of plausible rival hypotheses with which this volume deals. This we can call awareness of experimentation, or as I once labeled it in 1957, reactive arrangements: In any of the experimental designs, the respondents can become aware that they are participating in an experiment, and this awareness can have an interactive effect, in creating reactions to X [experimental treatment] which would not occur had X been encountered without this ‘‘I’m a guinea pig’’ attitude. Lazarsfeld (1948), Kerr (1945), and Rosenthal and Frank (1956), all have provided valuable discussions of this problem. Such effects limit generalizations to respondents having this awareness, and preclude generalization to the population encountering X with nonexperimental attitudes. The direction of the effect may be one of negativism, such as an unwillingness to admit to any persuasion or change. This would be comparable to the absence of any immediate effect from discredited communicators, as found by Hovland (1953). The result is probably more often a cooperative responsiveness, in which the respondent accepts the experimenter’s expectations and provides pseudoconfirmation. Particularly is this positive response likely when the respondents are self-selected seekers after the cure that X may offer. The Hawthorne studies (Roethlisberger and Dickson, 1939) illustrate such sympathetic changes due to awareness of the experimentation rather than to the specific nature of X. The problem of reactive arrangements is distributed over all features of the experiment which can draw the attention of the respondent to the fact of experimentation and its purposes. The conspicuous or reactive pretest is particularly vulnerable, inasmuch as it signals the topics and purposes of the experimenter. For communications of obviously persuasive aim, the experimenter’s topical intent is signaled by the X itself, if the
276
Book One – Artifact in Behavioral Research communication does not seem a part of the natural environment. Even for the posttestonly groups, the occurrence of the posttest may create a reactive effect. The respondent may say to himself, ‘‘Aha, now I see why we got that movie.’’ This consideration justifies the practice of disguising the connection between O [observation or measurement] and X . . . as through having different experimental personnel involved, using different facades, separating the settings and times, and embedding the X-relevant content of O among a disguising variety of other topics. (Campbell, 1957, 308–309)
Many, although not all, of the artifacts covered in the previous chapters are subsumable under the hypothesis that the results are what they are only because the subjects were aware that they were being experimented with, including the possibility of differential awareness on the part of the experimental group. Orne’s demand characteristics (Chapter 5) are entirely of this nature, although the placebo effects which he also reviews are not, inasmuch as they no doubt are also characteristic of nonexperimental medical applications of drugs. Pretest sensitization (Chapter 4), had it been empirically established, would have been in this class, and possibly the commitment effect of pretest is also due to awareness that one’s pretest and posttest scores will be experimentally compared. Volunteering for an experiment (Chapter 3) implies awareness of an experiment to be volunteered for. Experimenter effects (Chapter 6) are not in general of this nature (indeed, the Pygmalion effects occur with neither teacher nor pupil aware of the experiment) but those aspects of them due to respondent cooperation with the perceived goal of the experiment are. Suspiciousness of experimenter’s intent (Chapter 2) is a near synonym of the category head here used, and McGuire’s measures and experimental manipulations are specific illustrations of it. Evaluation apprehension (Chapter 7) is entirely a matter of awareness of experimentation, at least in the illustrations Rosenberg provides. The interaction effects of measurement which have led to efforts at indirect attitude measurement (Campbell, 1950; Kidder and Campbell, in press) and unobtrusive measurement (Webb et al., 1966) are of this nature. The issue of ‘‘experimental realism’’ versus ‘‘mundane realism’’ (Aronson and Carlsmith, 1968) is to a large extent a matter of awareness. While much of the research reported in this volume is reassuring, much of it is not. The alternative of simulation (e.g., Brown, 1962; Kelman, 1967; Orne, Chapter 5) is clearly useful as an auxiliary, but unappealing as a substitute because it carries awareness of experiment to an extreme. The obvious cure for all these artifacts is the disguised experiment in which the respondents (if not the experimenters) are unaware of participating in an experiment, are unaware that they are ‘‘being experimented with.’’ Such experiments are best done in natural rather than laboratory settings, not because natural settings are more representative of the target of generalization, but rather because in natural settings respondents do not suspect they are being experimented with. Laboratories, in general, are perceived as just that, i.e., as settings for experiments. The force of the argument may be strengthened if we note that most of the laboratory studies with dramatic ‘‘experimental realism’’ achieve this by distracting the respondent with some plausible facade or cover story while introducing the treatment as an incidental or accidental event. Thus French (1944) assembled groups for discussion purposes and then used smoke seeping under the door as an experimental treatment. Orne (Chapter 5) uses an ‘‘accidental’’ power failure, Darley and
Prospective: Artifact and Control
277
Latane´ (1968) an epileptic seizure. For some, the real experiment is among the respondents waiting to serve in the experiment. For the innumerable experiments using a confederate, the treatment is the performance of a fellow respondent. The incidental fact that one experimenter was Negro, the other Caucasian has been used (Rankin and Campbell, 1955). The respondent has frequently been led to believe that he is the experimenter (e.g., Festinger and Carlsmith, 1959; Milgram, 1963), and so forth. All of these are efforts to use the natural aspects of the setting, to evade the effects of awareness of experimentation. The utility of these deceptions is being lost through publicity, but can be regained for a while by moving out of the laboratory entirely. Social psychology has had by now enough experience with disguised experiments in natural settings to provide the basis of a mature taxonomy and methodology. Webb and his associates (1966) have provided very useful beginnings, although their focus is on measurement rather than experimental treatments. Aronson and Carlsmith (1968) provide another part of the framework. The projected book by Gross, Collins, and Bryan (in preparation) may fill the bill as also may Rosenblatt and Miller (in preparation), but the task has not yet been done, nor will this chapter do it. However, a few paragraphs and illustrations are in order. Preparatory to the illustrations, two general issues will be raised that provide perspectives for evaluation. 1. Content restrictions. What are needed for a fully disguised natural setting experiment are a natural mode of contact to persons (or to social units small enough and numerous enough so that random assignment to treatments achieves effective equation) with the mode of contact being private enough so that there is no awareness that other units are getting different treatments, and with a natural response available in the same setting relevant as a measure of effect. If they are to retain naturalness, such settings cannot be created at will for all possible treatments and with all possible measures. Instead, they must be opportunistically hit upon. Any given setting will inevitably impose great restrictions on the kinds of problems that can be studied in it. These restrictions will be upon the kinds of experimental variables that can be implemented and upon the modes of measurement available. 2. Deceit, debriefing, and other ethical issues. Disguised experiments obviously involve deceit at some level, and as McGuire (Chapter 2, part 5) and Kelman (1967) make clear, this is an undesirable feature only justified by more important considerations. One of these considerations is the moral value of producing a nontrivial social science. In any such comparative weighing of competing values, the degree of each becomes relevant, for example the magnitude of the deceit. In terms of pain to the liar (the experimenter), white lies are less painful than black ones (McGuire’s ‘‘active deceit’’), and while they may be equally damaging to the recipient, he would likewise judge them less immoral, due to our linguistic legalism. Lying of either sort is less painful and less immoral when occurring in a setting where it is both expected and justified by convention. In terms of debasing language and our communal ability to depend upon the verbal reports of others (Asch, 1952; Campbell, 1965), the effect is greater the more that lying is conspicuously exhibited by high prestige models. For all of these, the adaptation-level created by other segments of social practice provides a relativistic comparison base. A flagrant lie is less immoral introduced into a language community where such lies are frequent than when it is a novelty. In these terms, disguised naturalistic experiments vary greatly, and probably in balance present no greater problems than do laboratory ones. They probably depend
278
Book One – Artifact in Behavioral Research
more on nonverbal or white lies, less on direct deceit. They operate typically in arenas of discourse already more debased by deceit than are the halls of learning. If lying is revealed, the modeling impact is presumably less than it is in the professor-student relationship. But natural settings generally lack the implicit convention of acceptable lying which the psychology laboratory may be achieving.2 A separate problem is that of obtaining the respondent’s permission, a problem which has become of great practical importance now that half of our research support requires it. This is obviously an impossibility in the experimental setups to be described here, if disguise and unawareness is to be maintained. On the other hand, in those settings using means and ranges of communication that are within the public domain, and which nonexperimenters are using freely without such permission, this becomes an utterly unreasonable requirement. Another ethical problem is that of invasion of privacy. This is not a necessary aspect of disguised naturalistic experiments, and indeed is an impossibility in some. Anonymity of records is an aspect of the problem. However, when potentially embarrassing material is collected in a manner that makes possible linking it with the person’s name, the threat to the invasion of privacy is made worse by the disguise and the lack of permission. Injury, including humiliation and insult, is a problem no greater in degree than in laboratory experiments. Debriefing, explaining to the respondent the true nature of the experiment, apologizing for the deception, and if possible, providing feedback of the results are procedures characteristic of the self-announced campus laboratory and generally omitted from the disguised field experiment. While such debriefing has come to be a standard part of deception experiments in the laboratory, it has many ethical disadvantages. It is many times more of a comfort to the experimenter for his pain at deceiving than to the respondent who may learn in the process of his own gullibility, conformity, cruelty, or bias. It provides modeling and publicity for deceit and thus serves to debase language for the respondent as well as for the experimenter. It reduces the credibility of the laboratory and undermines the utility of deceit in future experiments.3 Argyle (1962), Milton Rokeach (personal communications), Stollak (1967), McGuire (Chapter 2), and Aronson and Carlsmith (1968) have called attention to these disadvantages, and they are strong enough to justify elimination of debriefing in those cases where the experimental treatment falls within the range of the respondent’s ordinary experience, merely being an experimental rearrangement of normal-level communications. This normal range is certainly exceeded in the Asch (1956) studies that present eight fellow Swarthmore students in solid contradiction of what would ordinarily have been a simple perceptual judgment, or the Milgram (1963) studies in which the respondent had to administer strong electrical shocks to a fellow student. It is probably exceeded in persuasive communications containing fictitious facts on important topics, but is probably not exceeded in most persuasion studies. In 2
Indeed, one practical way of avoiding the ethical problem on campus would be to announce to all members of the subject pool at the beginning of the term, ‘‘In about half of the experiments you will be participating in this semester, it will be necessary for the validity of the experiment for the experimenter to deceive you in whole or in part as to his exact purpose. Nor will we be able to inform you as to which experiments these were or as to what their real purpose was, until after all the data for the experiment have been collected. We give you our guarantee that no possible danger or invasion of privacy will be involved, and that your responses will be held in complete anonymity and privacy. We ask you at this time to sign the required permission form, agreeing to participate in experiments under these conditions.’’ This would merely be making explicit what is now generally understood, and probably would not worsen the problem of awareness and suspicion that now exists. 3 The gleeful reporting of deception experiments in introductory psychology texts and lectures is probably still more important in regard to these last two points.
Prospective: Artifact and Control
279
experimental social psychology, we are doomed to wear out our laboratories. For this reason we are already leaving the college in favor of the high school, the grammar school, and the street. Publicity will eventually contaminate these laboratories too, but this process will be greatly increased, and public anger over the deception not reduced, by debriefing in disguised naturalistic experiments. 3. A range of classic studies. Gosnell (1927) sent persuasive messages to registered voters urging them to vote, and used precinct records to later determine whether or not the member of different experimental and control groups had voted, achieving an entirely inconspicuous experiment using a range of communications well within normal limits. While today there would be distrust of Chicago’s precinct records, other cities’ are still useable. Here is a laboratory which should have been reused a hundred times by now, but so far as I know, it has not been reused even once. While the topic is very narrow, the persuasive messages could vary along a wide range of the experimental dimensions utilized in laboratory persuasion studies. The value of this laboratory would greatly increase if one could use how the person voted as well as whether he voted. While this information is not public for individuals, it is public for precincts as a whole, and this became the sampling unit for Hartmann’s (1936) classic study of rational versus emotional political leaflets. Used in a state like California in which voters get to vote on issues as well as persons, a very wide range of persuasion theories could be tested. Again, Hartmann’s laboratory has not been reused. In these studies, permission and debriefing would seem totally unwarranted, unless the content of the communications contained libel or falsehood, and if so, debriefing after the election would certainly raise a storm of justified protest. Thus the range of experimental stimuli is certainly limited—but could still cover one-sided versus two-sided communications, extremity of position advocated, or degree of adulatory vocabulary. Gosnell and Hartmann were both advocating sides they genuinely believed in. (Hartmann was himself running for mayor on the Socialist ticket.) This sincerity, and the related nondeception, would be lost in primacy-recency studies in which both of opposing alternatives are advocated, unless experimenters of opposing advocacies collaborated, or unless an experimenter got endorsements for the appropriate messages from the opposing sides. This is moving into the white-lie area, but on the other hand all that need be manipulated is the when and to-whom of messages that are going to get partial and haphazard distribution anyway. (In such studies one would often give disproportionate publicity to minor issues, just because of the relative absence of other communications on the same topic.) Using this laboratory for conformity studies (Campbell, 1951) one would abstain from feedback of falsified public opinion poll results, so readily used in the college laboratory, and limit one’s comparisons to the presence or absence of feedback, and the source (precinct, state, or nation) from which the feedback came. This limitation would in some cases represent a very real sacrifice in clarity of experimental inference, but to present falsified poll results would be an intolerable tampering with the ballot, quite different in kind than had not the action of voting been involved. Most disguised field experiments provide more limited laboratories than this, and are opportunistically hit upon for very specific purposes. Thus in a conformity study Lefkowitz, Blake, and Mouton (1955) modeled walking across an intersection against the light, in high-status or low-status clothing, and observed the effect upon an observer’s tendency to violate the traffic light. Schwartz and Skolnick (1962) manipulated the contents of applicant briefs sent to employers of temporary summer resort help, studying the effect of a criminal record upon employability. Schwartz and Orleans (1967) used income tax returns to measure aroused fear of legal sanctions. Bryan and Test (1966) created the altruism opportunity of helping a woman with a flat tire: with and without a prior helping model. Page (1958) randomly applied motivating comments on student
280
Book One – Artifact in Behavioral Research
papers and measured impressive effects on later classroom tests. Doob and Gross (1968) used the horn-honking response of the car behind and the experimental treatments of failing to go when the light went green, in high-status versus low-status cars. While some of these are such restricted laboratories that one can hardly imagine any other problem being studied in them, some could be more broadly used. Thus the Schwartz and Skolnick setting could be used for a wide variety of topics in impression formation or the presentation of self, albeit with a very impoverished dimensionality of effect measures. Even a technique seemingly so narrowly focused on honesty as the lost letter technique (Merritt and Fowler, 1948) lends itself to the addition of many other variables. By addressing envelopes to attitude-relevant groups, Milgram (Milgram, Mann, and Harter, 1965; Milgram, 1969) has obtained performance measures of attitude of seemingly high validity. By leaving the envelopes unsealed, Gross (1968) has been able to use variations in letter content to manipulate a variety of variables. 4. Employment. Essential to experimentation is arbitrary control over some segment of a person’s time. It is this arbitrariness which cuts the causal links between past conditions and experimental treatments and which makes possible randomly assigning equivalent samples to different treatments. The greater this arbitrary control, the greater the multipurpose experimental utility. One such setting is provided by the employment situation. I will neglect here its use in applied experiments focused on the employer’s problems and administrative options (e.g., Feldman, 1937; Kerr, 1945), and focus instead on the uses of employment for research in theoretical social psychology. Adams (1963) has set an outstanding example in his studies of pay inequity and work produced. Typical is his use of short term part-time employees, in which the wages paid represent a research cost of the same order of magnitude as paying subjects in the manifest laboratory. Stuart Cook (1964) has used this setting in his classic study (as yet unpublished) of the effect of equal-status contact on race attitudes. The admirable study of Rokeach and Mezei (1966) used a related setting, the employment agency, in replicating a finding already demonstrated in more artificial laboratories. (Close inspection of their results, however, suggests some degree of leaning-over-backward in the direction of fair-play in interracial contacts, a trend possibly symptomatic of reactive arrangements.) In these illustrative studies, no extreme or damaging treatments were involved, there being merely a scheduling of experiences that some would have been, or might well have been, exposed to anyway. For these studies, debriefing would seem unnecessary, if not unwise, and the ethical problems of deception minimal. But this is of course a matter of the nature of the treatments. Consider in contrast the use of military assignment power to create a realistic threat of imminent death (Berkun, Bialek, Kern, and Yagi, 1962; Daily Palo Alto Times, 1959; Argyle, 1960), a landmark in unethical excess of scientific zeal. 5. Encounters in public places. There are a considerable range of experimental stimuli that can be administered in the chance encounters of strangers. Bryan and Test (1967) set up a Salvation Army kettle on the sidewalk and varied the ethnicity of the Salvation Army member and the presence or absence of a model giving. Feldman (1968) had locals and foreigners ask for directions, ask for help in mailing a letter, and ask if the respondent had dropped the dollar bill just found. Cook, Bean, Cialdini, Krovetz, and Ray (submitted) have done an interesting study in which a somewhat unusually forward but nonthreatening stranger complimented a woman walking on a campus, the effects being measured by a charity appeal set up farther along the path (and by interviewers who were posing as survey takers and who asked the women questions about their reactions to the different kinds of compliment). Milgram, Bickman and Berkowitz (submitted) set up experimental seed-crowds of varying size and noted the number of passers-by who were thereby
Prospective: Artifact and Control
281
attracted. Sommer’s (1959) approach to interpersonal space lends itself to such settings, through asking strangers questions, sitting next to them on public conveyances, at cafeteria tables, etc. The possibilities are wide in range, including the ethically unacceptable. Rumors already provide reports of epileptic seizures enacted on streets, of experimental taxi drivers introducing unexplained delays to frustrate anxious passengers, and the like. 6. Sample solicitation. For persuasion studies, a broadly flexible disguised laboratory is provided by all those settings in which custom sanctions make appeals to strangers. Intrinsically, a response measure is made available in the natural response to the appeal. Selling, fund raising, and petition circulating exemplify the sanctioned goals; direct mail, telephone, and door-to-door contacts, the sanctioned means. Survey research establishments provide a readily transferable sampling technology and staff (and what more poetic reversal than to have public opinion surveyors pose as salesmen). Door-to-door or letterto-letter variations in the persuasive appeal provide an elegant opportunity for random equivalence without respondents being aware of experimentation. (Spatial separation of comparison groups receiving different appeals would often be desirable to avoid suspicion through respondents comparing experiences.) It is a commentary on the ethics of white lies that the experimenter and the door-bell-ringers would feel better about the deception if a genuine interest in the fund collection or the product promotion could be incorporated, as it often could be by offering one’s services to the relevant causes. Note here an additional financial advantage, in that costs-of-collection are legitimately deductible from the proceeds in much charitable fund raising. Salesmen’s commissions would have a similar role. These ‘‘sincere’’ fac¸ades would also expedite getting the solicitation permits that most police departments now require. One can envisage primacy-recency studies in which funds were solicited alternately for the White Citizen’s League and the Black Power Coalition. One could study the effect of degrees of fear arousing appeals for nuclear disarmament on the sales of air-raid-shelter construction plans. For fund raising, the amount given provides a relevant quantification, and the comments of even the noncontributors can be graded in favorableness. For sales, the dichotomous sale-no-sale can be enriched by a series of Guttman scale steps, through offering postcards which can be used for postponed purchase decisions, booklets with additional information, etc. For petitions, the natural measure is dichotomous, but not unusable on that account, and comments are codable as in opinion surveys (although face-to-face recording of comments would be out). In some settings, a mild and a strong version of the petition could be offered without reducing plausibility. Blake and his associates (Blake, Mouton, and Hain, 1956; Helson, Blake, and Mouton, 1958) have pioneered the use of petitions in nonlaboratory campus studies, as have also Gore and Rotter (1963). In the advertising industry there are some highly applied experiments with direct mail advertising. Cook and Insko (1968) have used mailed letters of differing contents as experimental treatments. Brock (1965) has used a salesman in a store to administer varying experimental treatments. It is probable that door-to-door sales companies have done some deliberate experimentation in techniques. But, by and large, this vast range of possibilities has not been utilized for the purposes of science. While opinion surveys would seem apt to invoke a ‘‘guinea pig’’ effect, they have become enough a part of the public scene so that several theoretically oriented experimenters have used them to present varied treatments in disguised field experiments (e.g., Abelson and Miller, 1967; Freedman and Fraser, 1966; Miller and Levy, 1967). They are less disguised than sample solicitations in general. Artifices have to be added to introduce persuasive content or other treatment, while solicitation provides the occasion for persuasion. On the other hand they offer the special advantage of justifying verbal attitude measures.
282
Book One – Artifact in Behavioral Research
Customers as well as salesmen may be experimenters. The old civil-rights approach of ‘‘test cases’’ provides such an experimental paradigm for field experimentation on the effect of race on access to housing, overnight lodging (La Piere, 1934), service in restaurants (Kutner, Wilkins, and Yarrow, 1952), etc. Franzen (1950) experimentally varied customer’s behavior in scaling the willingness of druggists to give medical advice. Jung (1959) experimented by presenting automobile dealers with customers of varying degrees of gullibility. Feldman (1968) varied the ethnicity customers and studied merchants’ acceptance of inadvertent overpayment and taxi drivers overcharges. E. Schaps4 is studying helping behavior among shoe salesmen as a function of customer dependency (e.g., a woman with or without a broken heel on the shoe she is wearing), costs or comparison level for alternatives (other customers waiting for service or not), and visibility (the experimental customer has a friend with her or not). The main dependent variable is the number of shoes shown the customer, although codings of verbal responses are also available. Ethical issues in the design of the study include whether or not to debrief the salesman and remunerate him for his time. Arguing against these procedures are the following considerations. The treatment does not exceed the frustratingness of the regular range of female customers, some 30 per cent of whom do not make purchases in any given visit. Given sufficient budget and the use of a large number of experimental customers, a desirable feature in any event (Hammond, 1954; Brunswik, 1956), a final purchase could be included without social waste. The setting is one in which the social norms for deceit have already been debased not only through deceit in salesmanship, but also through the use of pseudo-customers to check on employee courtesy, effectiveness, and honesty, entrapments which also invade privacy by attaching the acts to the salesman’s name. In contrast, the research customer provides complete privacy and anonymity. Debriefing would probably not reduce the salesman’s frustration, but merely change its target. For a professional proud of his sophistication and cynicism, it would be painful to learn he’d been had. Some damage to the future utility of natural settings would result even from the salesman’s private communications, and the possibility of journalistic publicity would be greatly enhanced. The nature of the experimental treatment is the crucial factor, and those treatments requiring debriefing should probably not be used anyway without the respondent’s permission. On the other hand there are the ethical values of a relevant and dependable social science, and our desperate shortage of appropriate laboratories. 7. Artifacts. A spirit of advocacy has slipped in to the presentation of the previous paragraphs, but this must not be allowed to blind us to the fact that disguised field experiments share the epistemological predicament described so pessimistically in the earlier sections of this paper. It is only the one family of artifacts related to awareness of participating in an experiment that is controlled. Of the artifacts treated in this volume, it is obvious that experimenter effects will be likely in most of the natural settings here described, aggravated in some by the experimenter having to record verbal reactions after having left the respondent. For each of them, the treatment variable and the response measure will turn out to be conceptually complex, with irrelevant aspects frequently responsible for the results, either as main effects or as modifying interactions. For the natural response measures involved, Webb and associates (1966), in spite of their generally optimistic tone, have provided detailed grounds for pessimism. In the end, expanded-content control groups or replication with varied treatments and measures will be required just as they have been for laboratory studies. 4
‘‘Some determinants of helping and exploitative behaviors in a field situation,’’ PhD dissertation, in preparation. (Northwestern University.)
Prospective: Artifact and Control
283
Summary The logic of scientific inference indicates that experiments cannot prove theories, but only probe them. For every theory-corroborating experimental result there are an infinity of rival explanations potentially available, a few of which we must attend to because they are both explicitly advocated and have a plausibility comparable to that of the theory corroborated. A major class of these plausible rival hypotheses is methodological artifacts introduced through irrelevant vehicular aspects of the experimental treatment or the measuring device, either as main effects or interactions. Control can never be complete in ruling out all plausible rival hypotheses in advance. As a rule, research must seek out ways of controlling each artifact as it is developing, through means that are specific to each combination of artifact hypothesis and theoretical variable. But general-purpose controls are discovered for recurrent classes of artifacts, and these become the empirically developed methodological requirements of a field. General strategies of control include expanding the content of the control group, varying the vehicular irrelevancies of the treatment variable, varying the method of measurement, and supplementary variation in data quality. Because most of the important artifact hypotheses in laboratory social psychology are made possible by the respondent’s awareness that he is participating in an experiment, attention is given to the techniques and ethics of disguised experiments in natural, nonlaboratory settings. Such experiments do not avoid the general artifact problem, but just this one type.
References Abelson, R. P., and Miller, J. C. Negative persuasion via personal insult. Journal of Experimental and Social Psychology, 1967, :3, 321–333. Adams, J. S. Toward an understanding of inequity. Journal of Abnormal and Social Psychology, 1963, 67, 422–436. Argyle, M. Report to the Council of the British Psychological Society on my dealings with the APA Committee on Scientific and Professional Ethics and Conduct. June 24, 1960. (Mimeo) Argyle, M. Experimental studies of small social groups. In A. T. Welford, M. Argyle, O. V. Glass, and J. N. Morris, (Eds.), Society: Problems and methods of study, London: Routledge and Kegan Paul, 1962, 77–89. Aronson, E., and Carlsmith, J. M. Experimentation in social psychology. Handbook of Social Psychology, (2nd ed.), Volume 2, Reading, Massachusetts: 1968, 1–79. Asch, S. E. Social Psychology, Englewood Cliffs, New Jersey: Prentice Hall, 1952. Asch, S. E. Studies of independence and conformity: I. A. minority of one against a unanimous majority. Psychological Monographs, 1956, 70, No. 9 (whole number 416). Berkun, M., Bialek, H. M., Kern, R. P., and Yagi, K. Experimental studies of psychological stress in man. Psychological Monographs, 1962, 76 (15, whole number 534) 39. Bitterman, M. E. Phyletic differences in learning. American Psychologist, 1965, 20, 396–410. Blake, R. R., Mouton, J. S., and Hain, J. D. Social forces in petition signing. Southwest Social Science Quarterly, 1956, 36, 385–390. Block, J. The challenge of response sets. New York: Appleton-Century-Crofts, 1965. Boring, E. G. The nature and history of experimental control. American Journal of Psychology, 1954, 67, 573–589. Brock, T. C. Communicator-recipient similarity and decision change. Journal of Personality and Social Psychology, 1965, 1, 650–654.
284
Book One – Artifact in Behavioral Research Brown, R. Models of attitude change. In R. Brown, E. Galanter, E. H. Hess, & G. Mandler (Eds.), New directions in psychology. New York: Holt, Rinehart & Winston, 1962. Brunswik, E. Perception and the representative design of psychological experiments. (2nd ed.) Berkeley: University of California Press, 1956. Bryan, J., and Test, M. A. Models and helping: naturalistic studies of aiding behavior. Journal of Psychology and Social Psychology, 1967, 6, 400–407. Campbell, D. T. The indirect assessment of social attitudes. Psychological Bulletin, 1950, 47 (1), 15–38. Campbell, D. T. On the possibility of experimenting with the ‘‘bandwagon’’ effect. International Journal of Opinion and Attitude Research, 1951, 5 (2), 251–260. Reprinted in H. Hyman and E. Singer (Eds.), Readings in Reference Group Theory and Research. New York: The Free Press, 1968, 452–460. Campbell, D. T. Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 1957, 54 (4), 297–312. Campbell, D. T. Methodological suggestions from a comparative psychology of knowledge processes. Inquiry (University of Oslo Press), 1959, 2, 152–182. Campbell, D. T. Recommendations for APA test standards regarding construct, trait, or discriminant validity. American Psychologist, 1960, 15, 546–553. Campbell, D. T. Variation and selective retention in socio-cultural evolution. In H. R. Barringer, G. I. Blanksten, and R. W. Mack, (Eds.), Social change in developing areas: a reinterpretation of evolutionary theory. Cambridge, Massachusetts: Schenkman, 1965, 19–49. Campbell, D. T. Pattern matching as an essential in distal knowing. In K. R. Hammond, (Ed.), Egon Brunswik’s Psychology. New York: Holt, Rinehart and Winston, 1966, 81–106. Campbell, D. T. Evolutionary Epistemology. In P. A. Schilpp (Ed.) The philosophy of Karl R. Popper. In: The library of living philosophers. La Salle, Illinois: The Open Court Publishing Co., (volume in press). Campbell, D. T., and Fiske, D. W. Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 1959, 56 (2), 81–105. Campbell, D. T., and Stanley, J. C. Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching. Chicago: Rand McNally, 1963, 171–246. Reprinted as Experimental and quasi-experimental design for research. Chicago: Rand McNally, 1966. Campbell, D. T., Siegman, C. R., and Rees, M. B. Direction-of-wording effects in the relationships between scales. Psychological Bulletin, 1967, 68, 293–303. Christie, R. Authoritarianism reexamined. In R. Christie and M. Jahoda (Eds.), Studies in the scope and the method of the authoritarian personality. New York: The Free Press, 1954, 123–196. Corsini, R. J. Understanding and similarity in marriage. Journal of Abnormal and Social Psychology, 1956, 52, 327–332. Cook, S. W. Desegregation and attitude change. Address to the Southeastern Psychological Association, 1964. (Mimeo) Cook, T. D., Bean, J. R., Cialdini, R. B., Krovetz, M. L., and Ray, A. A. Three contexts of ingratiation, and their effects on attributions, affect, and donating to charity: Two field experiments. (Submitted for publication). Cook, T. D., and Insko, C. A. Persistence of attitude change as a function, of conclusion reexposure: A laboratory-field experiment. Journal of Personality and Social Psychology, 1968, 9, 322–328. Cronbach, L. J. Response sets and test validity. Educational and Psychological Measurement, 1946, 6, 475–494. Cronbach, L. J. Further evidence on response sets and test design. Educational and Psychological Measurement, 1950, 10, 3–31. Cronbach, L. J. Proposals leading to analytic treatment of social perception scores. In R. Tagiuri and L. Petrullo (Eds.), Person perception and interpersonal behavior. Stanford: Stanford University Press, 1958, 353–379. Daily Palo Alto Times, Psychologists protest tests by Army to see recruits reaction to danger. Thursday, August 13, 1959, p. 5. Darley, J. M., and Latane´, B. Bystander intervention in emergencies: Diffusion of responsibility. Journal of Personality and Social Psychology, 1968, 8, 377–383. Doob, A. N., and Gross, A. E. Status of frustrator as an inhibitor of horn-honking responses. Journal of Social Psychology, 1968, 76, 213–218.
Prospective: Artifact and Control
285
Edwards, A. L. The social desirability variable in personality assessment and research. New York: Dryden, 1957. Feldman, H. Problems in labor relations. New York: Macmillan, 1937. Feldman, R. E. Response to compatriot and foreigner who seek assistance. Journal of personality and social psychology, 1968, 10, 202–214. Festinger, L., and Carlsmith, J. M. Cognitive consequences of forced compliance. Journal of Abnormal and Social Psychology, 1959, 58, 203, 210. Franzen, R. Scaling responses to graded opportunities. Public Opinion Quarterly, 1950, 14, 484–490. Freedman, J. L, and Fraser., S. C. Compliance without pressure: The foot-in-the-door technique. Journal of Personality and Social Psychology, 1966, 4, 195–202. French, J. R. P. Organized and unorganized groups under fear and frustration. University of Iowa Studies in Child Welfare, 1944, 20, 229–309. Glock, C. Y. The effects of reinterviewing in panel research. 1958. Multilith of a chapter to appear in P. F. Lazarsfeld (Ed.), The study of short run social change, in preparation. Gore, P. M., and Rotter, J. B. A personality correlate of social action. Journal of Personality, 1963, 31, 58–64. Gosnell, H. F. Getting out the vote: an experiment in the stimulation of voting. Chicago: University of Chicago Press, 1927. Gross, A. E. Some determinants of honesty in a naturalistic situation. Talk presented at the Western Psychological Association, San Diego, California, March, 1968. Gross, A. E., Collins, B., and Byran, J. Experiments in social psychology. New York: Wiley (in preparation). Hammond, K. R. Representative vs. systematic design in clinical psychology. Psychological Bulletin, 1954, 51, 150–159. Hartmann, G. W. A field experiment on the comparative effectiveness of ‘‘emotional’’ and ‘‘rational’’ political leaflets in determining election results. Journal of Abnormal and Social Psychology, 1936, 31, 99–114. Helson, H., Blake, R. R., and Mouton, J. S. Petition-signing as adjustment to situational and personal factors. Journal of Social Psychology, 1958, 48, 3–10. Hempel, G. G. Philosophy of natural science. Englewood Cliffs, New Jersey: Prentice Hall, 1966. Hovland, C. E., Janis, I. L., and Kelley, H. H. Communication and persuasion. New Haven: Yale University Press, 1953. Jung, A. F. Price variations among automobile dealers in Chicago, Illinois. Journal of Business, 1959, 32, 315–326. Kelman, H. C. The human use of human subjects. Psychological Bulletin, 1967, 67, 1–11, reprinted in Kelman, H. C. A time to speak: On human values and social research. San Francisco: Jossey-Bass, 1968. Kerr, W. A. Experiments on the effect of music on factory production. Applied Psychological Monographs, 1945, No. 5. Kidder, L., and Campbell, D. T. The indirect testing of social attitudes. In G. Summers (Ed.), (to be a chapter in book on Attitude Theory and Measurement, in press). Kuhn, T. The structure of scientific revolutions. Chicago: University of Chicago Press, 1962. Kutner, B., Wilkins, C., and Yarrow, P. R. Verbal attitudes and overt behavior involving racial prejudice. Journal of Abnormal and Social Psychology, 1952, 47, 649–652. La Piere, R. T. Actions versus actions. Social Forces, 1934, 13, 230–237. Lazarsfeld, P. F. Training guide on the controlled experiment in social research. Columbia University, Bureau of Applied Social Research, 1948. (Mimeo) Lefkowitz, M., Blake, R. R., and Mouton, J. S. Status factors in pedestrian violation of traffic signals. Journal of Abnormal and Social Psychology, 1955, 51, 704–706. Merritt, C. B., and Fowler, R. G. The pecuniary honesty of the public at large. Journal of Abnormal and Social Psychology, 1948, 43, 90–93. Milgram, S. Behavioral study of obedience. Journal of Abnormal and Social Psychology, 1963, 67, 371–378. Milgram, S. The lost-letter technique: An unusual way to predict the outcome of elections, sentiments on integration, the strength of communist influence in Hong Kong, and the orientation of Americans to communists and fascists. Psychology Today, 1969, June (in press).
286
Book One – Artifact in Behavioral Research Milgram, S., Bickman, L., and Berkowitz, L. Note on the drawing power of crowds. (Submitted for publication, 1968). Milgram, S., Mann, L., and Harter, S. The lost-letter technique: A tool of social research. Public Opinion Quarterly, 1965, 29, 437–438. Miller, N., and Levy, B. H. Defaming and agreeing with the communication as a function of emotional arousal, communication extremity, and evaluative set. Sociometry, 1967, 30, 158–175. Naroll, R. Data quality control. New York: The Free Press, 1962. Page, E. B. Teacher comments and student performance: A seventy-four classroom experiment in school motivation. Journal of Educational Psychology, 1958, 49, 173–181. Petrie, H. G. The strategy sense of ‘‘methodology.’’ Philosophy of Science, 1968, 35, 248–257. Polanyi, M. Personal knowledge: Toward a post-critical philosophy. London: Routledge and Kegan Paul, 1958. Popper, K. R. The Logic of scientific discovery. London: Hutchinson, or New York: Basic Books, 1959. Popper, K. R. Conjectures and refutations. London: Routledge and Kegan Paul, New York: Basic Books, 1963. Quine, W. V. From a logical point of view. Cambridge, Massachusetts: Harvard University Press, 1953. Rankin, R. E., and Campbell, D. T. Galvanic skin response to Negro and white experimenters. Journal of Abnormal and Social Psychology, 1955, 51(1), 30–33. Roethlisberger, F. J., and Dickson, W. J. Management and the worker. Cambridge, Massachusetts: Harvard University Press, 1939. Rokeach, M., and Mezei, L. Race and shared belief as factors in social choice. Science, 1966, 151, 167–172. Rorer, L. G. The great response-style myth. Psychological Bulletin, 1965, 65, 129–156. Rosenblatt, P. C., and Miller, N. Experimental method. In C. G. McClintock, (Ed.), Experimental Social Psychology. (To be published by Holt, Rinehart, and Winston, in preparation.) Rosenthal, D., and Frank, J. O. Psychotherapy and the placebo effect. Psychological Bulletin, 1956, 53, 294–302. Rosenthal, R. Experimenter effects in behavioral research. New York: Appleton-Century-Crofts, 1966. Salmon, W. Logic. Englewood Cliffs, New Jersey: Prentice Hall, 1963. Schwartz, R. D., and Skolnick, J. H. Two studies of legal stigma. Social Problems, 1962, 10, 133–142. Schwartz, R. D., and Orleans, S. On legal sanctions. University of Chicago Law Review, 1967, 34(2), 274–300. Selltiz, C., Jahoda, M., Deutsch, M., and Cook, S. W. Research methods in social relations. (Rev. ed.) New York: Holt-Dryden, 1959. Sigall, M. H., Campbell, D. T., and Herskovits, M. J. The influence of culture on visual perception. Indianapolis: Bobbs-Merrill, 1966. Silverman, L. H. A Q-sort study of the validity of evaluations made from projective techniques. Psychological Monographs, 1959, 73(7, Whole No. 477). Sommer, R. Studies in personal space. Sociometry, 1959, 22, 247–260. Stollak, G. E. Obedience and deception research. American Psychologist, 1967, 22, 678. Toulmin, S. The philosophy of science. London: Hutchinson, 1953. Toulmin, S. Foresight and understanding: An inquiry into the aims of science. Bloomington: Indiana University Press, 1961. Webb, E. J., Campbell, D. T., Schwartz, R. D., and Sechrest, L. B. Unobtrusive measures: nonreactive research in the social sciences. Chicago: Rand McNally, 1966. Wilson, E. B. An introduction to scientific research. New York: McGraw-Hill, 1952.
BOOK TWO EXPERIMENTER EFFECTS IN BEHAVIORAL RESEARCH Robert Rosenthal
This page intentionally left blank
Preface
The effort to understand human behavior must itself be one of the oldest of human behaviors. But for all the centuries of effort, there is no compelling evidence to convince us that we do understand human behavior very well. The application of that reasoning and of those procedures which together we call ‘‘the scientific method’’ to the understanding of human behavior is of relatively recent origin. What we have learned about human behavior in the short period—let us say from the founding of Wundt’s laboratory in Leipzig in 1879 until now—is out of all proportion to what we learned in preceding centuries. The success of the application of scientific methods to the study of human behavior has given us new hope for an accelerating return of knowledge on our investment of time and effort. But most of what we want to know is still unknown. The application of scientific methods has not simplified human behavior. It has perhaps shown us more precisely just how complex it really is. In the contemporary behavioral science experiment, it is the research subject we try to understand. He serves as our model of man in general, or at least of a certain kind of man. We know that his behavior is complex. We know it because he does not behave exactly as any other subject behaves. We know it because sometimes we change his world ever so slightly and observe his behavior to change enormously. We know it because sometimes we change his world greatly and observe his behavior to change not at all. We know it because the ‘‘same’’ careful experiment conducted in one place at one time yields results very different from the results of an experiment conducted in another place at another time. We know his complexity because he is so often able to surprise us with his behavior. Most of this complexity of human behavior may be in the nature of the organism. But some of it may derive from the social nature of the psychological or behavioral experiment itself. Some of the complexity of man as we know it from his model, the research subject, resides not in the subject himself but rather in the particular experimenter and in the interaction between subject and experimenter. 289
290
Book Two – Experimenter Effects in Behavioral Research
That portion of the complexity of human behavior which can be attributed to the experimenter as another person and to his interaction with the subject is the focus of this book. Whatever we can learn about the experimenter and his interaction with his subject becomes uniquely important to the behavioral scientist. To the extent that we hope for dependable knowledge in the behavioral sciences generally, and to the extent that we rely on the methods of empirical research, we must have dependable knowledge about the researcher and the research situation. In this sense, the study of the behavioral scientist-experimenter is crucial; there are important implications for how we conduct and how we assess our research. There is another sense in which the study of the experimenter and his interaction with his subject is important. In this sense, it is not at all crucial that the experimenter happens to be the collector of scientific data. He might as well be a teacher interacting with his student, an employer interacting with his employee, a healer interacting with his patient, or any person interacting with another. In this sense, the experimenter himself serves as a model of man or of one kind of man. His subject also serves as a model, and the interaction between them, the situation arising from their encounter, serves as a model of other more or less analogous situations. From the behavior of the experimenter, we may learn something of consequence about human behavior in general. This book is divided into three parts. The first deals with the general nature of the effects an experimenter may have on the results of his research. The second describes a program of research on the effects of a particular type of experimenter variable on the results of research. The third takes up some methodological implications of the data presented. Part I consists of two sections. The first contains a discussion of those effects of the experimenter that do not influence the subject’s response even though they may affect the results of the research. When the experimenter serves as observer of the subject’s behavior, when he records the data, summarizes, analyzes, and interprets the data, he may err in significant ways but not by directly affecting the subject’s response. However, when the experimenter interacts with the subject, his own more enduring attributes, his attitudes, and his expectancies may prove to be significant determinants of the subject’s behavior in the experiment. These effects of the experimenter are discussed in the second section of Part I. The last chapter of Part I provides a historical introduction to the experimenter variable that is central to the second major part of this book. That variable is the experimenter’s orientation toward the outcome of his research. The hypothesis is put forward that the experimenter’s hypothesis, his expectancy, can be a significant determinant of the results of his research. Part II begins with a presentation of the evidence that an experimenter’s expectancy may serve as self-fulfilling prophecy of his subjects’ responses when the subjects are either humans or animals. In these and in the following chapters, evidences are presented in sufficient details for the research to be critically evaluated by the reader without reference to papers published elsewhere. This seems particularly necessary in a work that purports to offer some suggestions for the further development of behavioral research methodology.
Preface
291
In the second section of Part II, some factors are discussed that have been shown to augment, to neutralize, or to reverse the effects of the experimenter’s expectancy on the results of his research. These factors include subjects’ expectancies, the nature of data earlier obtained by the experimenter, the motive states aroused in the experimenter, and the subjects’ view of the experimenter. What are the factors that make possible the dramatic effects of the experimenter’s expectancy? The third section of Part II is addressed to this question. Those characteristics and behaviors of the experimenter associated with greater exertion of unintentional influence are discussed. Those characteristics of experimental subjects associated with a greater susceptibility to the influence of the experimenter’s hypothesis are presented. Finally, those cues that might serve to communicate the experimenter’s expectancy to his subjects are considered. The evidence put forth in Part II of the book has clear methodological implications for the behavioral researcher. But beyond the methodological implications there are substantive implications as well, for what is evidence for the effects experimenters can have on their subjects is also, more generally, evidence for the importance in human relations of unintentional interpersonal influence and, more specifically, the interpersonal influence that stems from one person’s expectancy of another’s behavior. It seems not overly important that the possibility of unintentional influence has been demonstrated. No one will probably be very surprised. What does seem important is that the process of unintended social influence can be observed in the laboratory, and that its dynamics can now be more fully and more systematically investigated. Part III deals with a number of methodological implications. In the first section of Part III, the generality of experimenter effects is discussed and a conceptual schema presented which should make it easier to talk about the operating characteristics of experimenters. Also the general problem of replications and their assessment is related to the earlier sections of the book. In the second section of Part III, concrete proposals are offered which the behavioral scientist can employ to reduce and/or assess the effects of his and his surrogate’s expectancies on the results of his research. An effort has been made to have these suggestions be useful, and they are offered with due regard for the practical problems of getting research done, getting it done expeditiously, and getting it done economically. The suggestions made for the control of experimenter expectancy effects will not, in all probability, solve the problem of ‘‘experimenter bias.’’ But that does not seem discouraging. In the short time that ‘‘scientific method’’ has been applied to the study of human behavior it has shown itself to be a good and robust teacher. There are things we have learned about human behavior in spite of the possible operation of experimenter expectancy effects. We may do still better by the addition of even imperfect safeguards. Whether we will ever be able to account for all the sources of variance deriving from the experimenter remains a moot question. It does not differ in kind from the question of whether we will ever be able to account for all the sources of variance deriving from the subject. It is the question of whether the concept of indeterminacy applies because it is in the nature of the universe or whether it applies because of how much there is we do not yet know. Both views have been held by
292
Book Two – Experimenter Effects in Behavioral Research
distinguished contributors to our understanding of nature. Thus, how each reader of this volume answers this question for himself may make little difference in terms of what we want to know and will be able to learn. The more meaningful question, perhaps, is whether we can account for increasing proportions of the total variance in experiments by a consideration of experimenter expectancy (and related) effects, and whether we can, by some form of intervention, reduce these sources of error. I owe much to many people who, on many counts, contributed in one way or another to the thinking and to the research that resulted in this book. I cannot thank them all. The authors of a book or a paper read a decade ago—they will forgive me if I express their idea, and in less eloquent language than theirs and without acknowledgment, and for having forgotten that the idea was not mine in the first place. But there are those I can thank, and happily. Donald T. Campbell, Harold B. Pepinsky, and Henry W. Riecken all provided more intellectual stimulation and personal encouragement than I could hope to repay. So many people, I cannot recall them all, have given of their time to make available to me reprints, their own and others, and references they knew would be of interest to me. None has been more generous than Professor William B. Bean, Head of the Department of Internal Medicine, State University of Iowa College of Medicine, and a wise and knowledgeable student of error in science. This book would not have been written nor would it have been worth the writing without the research program that forms its core. This research was supported initially by a grant from the University of North Dakota Faculty Research Committee, and since 1961 by the Division of Social Sciences of the National Science Foundation (G-17685, G-24826, GS-177, GS-714). Without the support of the Foundation much of the research could not have been conducted. This book owes much to that support. The research on which much of the book is based was not conducted by me alone. It owes much to the work of my colleagues both senior and junior. Reed Lawson, Edward Halas, and John Gaito were not only coauthors of joint research, but my tutors as well. Kermit Fode, Linda V. Kline, Gordon Persinger, and Ray Mulry collaborated for a period of years on our research—from their undergraduate days through various advanced degrees. Other collaborators included Jack Friedman, Paul Kohn, Patricia Greenfield, Mardell Grothe, and Noel Carota (all collaborators on several occasions) and Neil Friedman, Suzanne Haley, Daniel Kurland, Carl Johnson, Thomas Schill, and Ray White. All these collaborators would surely join me in thanking our far more numerous collaborators of a different kind: the many experimenters and the many subjects upon whose participation our research program was dependent, and whose behavior we were privileged to observe. A number of people kindly read and commented on various portions of the manuscript: Elliot Aronson, Neil Friedman, David Marlowe, Fred Mosteller, Theodore Newcomb, Martin Orne, Karl Weick, and the following members of the latter’s seminar in experimental social psychology: Gordon Fitch, I. Helbig, Michael Langley, Donald Penner, Dan Ray, Marion Reed, Edward Ypma, and Joseph Zuro. Kenneth MacCorquodale and Milton Rosenberg read and improved the entire manuscript. To them my debt is greater still. I want to thank each of these readers for his help, and absolve them of any responsibility for remaining errors and inelegancies of expression. These inelegancies would have been still more considerable had I not had
Preface
293
the benefit of some earlier tutorials from that scholarly, wise, and kind tutor, E. G. Boring. The typing of various parts of this manuscript and the consequent improvements of spelling and punctuation were expertly undertaken by Betty Burnham, Nancy Johnson, Susan Novick, and Kathy Sylva. For endless putting aside of dishes and laundry to listen to an idea or a paragraph and typing it and improving it and for countless other assistances I thank my wife, Mary Lu. For being interested in their father’s homework I thank Roberta, David, and Virginia. R. R.
Preface to the Enlarged Edition
Ten years have passed since the completion of the original edition of Experimenter Effects in Behavioral Research, and a follow-up is in order. Our focus will be on studies of interpersonal expectation effects as these occur both in laboratory settings and in everyday life. The ten year span has seen more than a ten-fold increase in research on interpersonal expectations, and there are now well over 300 studies specifically designed to investigate the occurrence, the importance, and the operating characteristics of interpersonal self-fulfilling prophecies. To summarize all this research in detail would require a book of its own rather than an epilogue and, indeed, someday I hope to write such a book. In our present epilogue there is space only for some summaries and some illustrations. —R. R.
294
Part I THE NATURE OF EXPERIMENTER EFFECTS
EXPERIMENTER EFFECTS NOT INFLUENCING SUBJECTS’ BEHAVIOR Chapter 1. The Experimenter as Observer Chapter 2. Interpretation of Data Chapter 3. Intentional Error EXPERIMENTER EFFECTS INFLUENCING SUBJECTS’ BEHAVIOR Chapter 4. Biosocial Attributes Chapter 5. Psychosocial Attributes Chapter 6. Situational Factors Chapter 7. Experimenter Modeling Chapter 8. Experimenter Expectancy
This page intentionally left blank
1 The Experimenter as Observer
It was the science of astronomy that made clear that the scientific observer was an imperfectly calibrated instrument. In the closing years of the eighteenth century, Maskelyne, the astronomer royal at the Greenwich Observatory, discovered that his assistant, Kinnebrook, was consistently ‘‘too slow’’ in his observation of the movement of stars across the sky. During the next six months, despite Maskelyne’s admonition, Kinnebrook’s recording continued to lag behind Maskelyne’s own recording of the times of stellar transits. Maskelyne then felt forced to discharge Kinnebrook. Some twenty years later, Bessel, the astronomer at Ko¨nigsberg, studied this incident and concluded that Kinnebrook’s ‘‘error’’ must have been beyond his control. Bessel then compared his own observations of stellar transits with those of other senior astronomers and discovered that differences in observation were the rule, not the exception. Furthermore, he found these differences or ‘‘personal equations’’ to vary over time. These important events in the history of the notation of observer error have been described and documented by Boring (1950).
The Generality of Observer Effects The plan of the next few pages is to indicate some of the disciplines that have shown a self-conscious awareness of the problem of observer effects. The intent is not to be exhaustive, but rather to be sufficiently representative to establish some consensus with the reader regarding the generality of the phenomena. The Physical Sciences Newton did not have much confidence in his own observational ability, and for at least one occasion, the lack of confidence seemed justified. Boring (1962a) noted that Newton did not see and report the absorption lines in the prismatic solar spectrum, which were visible with Newton’s apparatus, because of his theoretically based expectations. Boring put it aptly and beautifully: ‘‘To the observing scientist, hypothesis is both friend and enemy’’ (p. 601). Boring’s suggestion that observer 297
298
Book Two – Experimenter Effects in Behavioral Research
effects may not be random with respect to the observer’s hypothesis is agreed with by N. R. Hanson (1958) and E. B. Wilson (1952). Another dramatic example of observer errors (errors that were both nonrandom and widespread among observers) has been reported by Rostand (1960). In 1903, Blondlot discovered ‘‘N-rays,’’ which appeared to make reflected light more intense. This phenomenon was viewed by many great observers, including many famous scientists of the day. Only a few were unable to detect the phenomenon, which later was evaluated as at least a colossal compounded observer error if not a downright fraud. Interestingly, as this evaluation became generally known, the effects of ‘‘N-rays’’ could no longer be observed. Discussion of observer effects, especially as they have been operative in the physical sciences, often ends by reference to modern instruments which serve to eliminate observer effects. That these effects may be brought under partial control by mechanical means seems reasonable enough. That instrumentation may not eliminate observer effects must also be considered. If the instrument is a dial, it must be read by a human observer. If the instrument is a computer, the print-out must also be read by an observer. Observer effect, or variability in the reading of scales, has been noted by Yule, writing in the Journal of the Royal Statistical Society (1927). A general error tendency found was the inclination to read scales to quarters of intervals rather than to tenths. Empirical analysis of his own observer effect revealed to Yule his tendency to avoid the number 7 as a final digit and to favor the numbers 8, 9, 0, and 2. That this particular bias was not at all unique to Yule was demonstrated in a still earlier work by Bauch (1913). The digit preference phenomenon has also revealed itself in large sample data collection enterprises. In an age, census conducted in England and Wales, both males and females showed a preference for the digit 0 and an avoidance of the digit 1 in the units place of their age statements. Yule planned to investigate observer errors in scale reading in more systematic fashion but did not do so. His plan, revealing his awareness of the role of psychological factors in observer errors, was to relate the nature of the error or effect to the nature of the observer. The Biological Sciences The counting of blood cells is a routine and important procedure in biological research and in the practice of medicine. For many years, the standard textbooks published data setting the ‘‘maximum allowable discrepancy’’ between blood cell counts of successive samples of blood. Then, in 1940, Berkson, Magath, and Hurn reported a way of counting blood cells more accurately than was ordinarily possible. Each blood cell was pierced by a stylus a single time, and each piercing was recorded electrically. After collecting many series of blood samples, the investigators were led to the inescapable conclusion that laboratory technicians had for years routinely reported blood cell counts that could have agreed with one another so well only 15 to 34 percent of the time. ‘‘Published studies involving erythrocyte counts, as well as standard texts, disallow discrepancies between successive counts so small that they would in most instances necessarily be exceeded as a matter of chance if counts were accurately made and faithfully recorded’’ (p. 315). The story has many similarities to the story of the ‘‘N-rays.’’ Observations were made by many observers, over a long period of time, which were consistent with the observers’ expectations but inconsistent with the realities of nature as subsequently defined.
The Experimenter as Observer
299
In the field of agricultural statistics, observer effects have been well demonstrated by Cochran and Watson (1936). These investigators enlisted the aid of 12 experienced observers who believed themselves able to select young plants whose heights would vary in truly random fashion. When actually put to the task, it was found that observers selected plants or shoots that were neither representative nor random. Because these observer errors were not randomly distributed around the ‘‘true’’ values, the errors were appropriately defined as biased. Bias, it was found, did not remain constant from sampling unit to sampling unit. In the observation of shoot heights, as in the observation of stellar transits, observer effects were not easily predictable. In the field of experimental genetics, Fisher (1936) cites Dr. J. Rasmussen, who mentioned that in experimental genetics he, as well as his assistants, showed an unconscious bias to select the best plants first for observation. This type of observer effect, or more specifically ‘‘bias,’’ like that shown in the selection of shoot heights and that shown to occur in other situations by Yule and Kendall (1950), has led these authors to propose that man may simply be unable to select random sets of events to be observed without such external aids as tables of random numbers. Perhaps, the most important case of observer effects in the history of experimental genetics is the one involving the work of Gregor Mendel. Mendel, it will be recalled, expected that when hybrid pea plants were self-fertilized 75 percent of the offspring would show the dominant phenotype and 25 percent would show the recessive phenotype. That, almost exactly, was what Mendel’s observations subsequently showed. Considering the relatively small sample sizes reported, Fisher (1936), in a closely reasoned logical and statistical analysis, showed that Mendel could not reasonably have obtained the data he reported. The data were just too good to be likely. We may at least hypothesize the existence of an observer effect or, because of its directionality, a bias, in either Mendel, his assistant, or both. If the biased error was due to the work of an assistant, the case does not stand alone (Shapiro, 1959). Alfred Binet, of intelligence testing fame, working then in the area of physical anthropology, was forced to discharge a research assistant who made errors in cephalometric measurements (Wolf, 1961). These errors, too, were not randomly distributed but rather were in the direction of the hypothesis. These errors, like those possibly committed by Mendel’s assistant, were not necessarily errors of observation. In any case, we can see that the history of science has often repeated the Kinnebrook episode. In a treatise on the octopus, Lane (1960) asserts that scientists may ‘‘equate what they think they see, and sometimes what they want to see, with what actually happens’’ (p. 85). W. B. Bean (1953), a thoughtful student of the role of error in science, presents the following data: In 1901, Leser claimed an association of cherry angioma, an easily observable skin condition, with malignant disease. Leser’s first assistant, Mu¨ller, found that 49 of 50 cancer patients had cherry angiomas, but among a control series of 300 noncancer patients, he found only a handful. On the basis of theoretical considerations and especially the inability to replicate this result, it appeared most likely to have been a case of observer error. Bean wondered, ‘‘Was the wish father to the thought, was Mu¨ller a too avid helper or an unbelievably bad observer?’’ (p. 241). Bean has also called attention to the work of Feinstein (1960) and M. L. Johnson (1953). It was the former who pointed out the observer error involved in the use of the stethoscope in cardiac diagnostics. Feinstein asked that physicians as well as their
300
Book Two – Experimenter Effects in Behavioral Research
stethoscopes be calibrated. Johnson cited the case of a radiologist who saw a button ‘‘on a vest’’ rather than in the throat where it lodged because, presumably, buttons occur more frequently on vests than in throats. In experiments on observer processes, Johnson found medical students observing quite inaccurately when presented with two x-rays of hands for study. Johnson, in this paper entitled ‘‘Seeing’s Believing,’’ was prompted to say, ‘‘Our assumptions define and limit what we see, i.e., we tend to see things in such a way that they will fit in with our assumptions even if this involves distortion or omission. We therefore may invert our title and say ‘Believing Is Seeing’’’ (p. 79). We might find it instructive to consider the data bearing on the question of the reliability of medical or psychological diagnoses. There are ample evidences that the diagnostic process has great unreliability, but this phenomenon does not quite fit our conception of observer error. In too many cases where the unreliability is great, the defining characteristics of the classes to which assignment is to be made are all too vague. We may therefore have interpretive errors or perhaps not even that. When nosological categories are carefully defined and objective criteria of inclusion are available, then, when error occurs we may more legitimately regard it as observer error. Bean (1948) found in nutritional examinations that experienced physicians disagreed in the diagnosis of nutritional deficiency even when objective standards were available. Speaking of observer errors, and of others as well, Bean stated, ‘‘Our aim must not be to deny error, but to learn from it, avoiding the stability it gets from repetition’’ (p. 54). The Behavioral Sciences Harry Stack Sullivan has called attention to the problem of observer effect on the social sciences generally and on the ‘‘social science of psychiatry’’ in particular (1936–37). More than most investigators, he was aware of the extent to which the observer entered into transaction with the object of the observation. Sullivan, of course, was not alone in this awareness, an awareness eloquently expressed by Wirth (1936), and somewhat later by Bakan (1962), Colby (1960), and Kubie (1956). The psychotherapy relationship may be viewed appropriately as a data-collecting situation with the therapist in the position of observing his patient’s responses. Both the lore of the practicing clinician and the evidence of more formal investigations point to the omnipresent effects of the clinical observer. Events occurring in the clinical interaction are often unobserved or at least unreported by the clinician. Events not occurring in the clinical interaction are sometimes reported erroneously by the clinician. And often, the errors may be shown to be related to the personal characteristics of the clinical observer, particularly to his personal ‘‘blind spots’’ (Cutler, 1958; Garfield & Affleck, 1960; Levitt, 1959; Sarason, 1951; Strupp, 1959; Wallach & Strupp, 1960; Zirkle, 1959). The observation of planaria. Although observer effects may be less obvious in a laboratory than in a clinical setting, it is nevertheless clear that they do occur. A well-designed experiment by Cordaro and Ison (1963) nicely illustrates the fact. The behavior to be observed was the number of head turns and body contractions made by planaria (flatworms placed low on the phylogenetic scale). For half the worms, seven observers were led to expect a very high incidence of turning and contracting. For the remaining worms, the same observers were led to expect a very
The Experimenter as Observer
301
low incidence of turning and contracting. The worms observed under the two conditions of expectation were, of course, essentially identical. Results of this phase of the experiment showed that observers reported twice as many head turns and three times as many body contractions when their expectation was for high rates of response as when their expectation was for low rates of response. The basic plan of this experiment was repeated employing a new set of ten observers. This time, however, half the experimenters were to observe only ‘‘highresponse-producing’’ worms, and the remaining observers were given only ‘‘lowresponse-producing’’ worms. Again there was no real difference between the two ‘‘types’’ of worms. The results of this phase of the experiment found nearly five times as many head turns and twenty times as many contractions reported by the observers who expected high levels of responding as reported by observers expecting low levels of responding. The observers employed in the experiments cited were undergraduate college students enrolled in an introductory psychology course. It may be, of course, that the degree of observer bias shown would not be found among more experienced observers, a possibility pointed out by Shinkman and Kornblith (1965) and by Cordaro and Ison themselves. Some data are available which have a bearing on this question. In an experiment investigating ‘‘natural’’ individual differences among workers interested in planaria, it was found that differences among these experimenters in the number of turning, contraction, and other responses obtained were for the most part statistically significant (Rosenthal & Halas, 1962). This experiment differed from that of Cordaro and Ison in two ways. First, the experimenters were not given any false expectancies but were engaged in ‘‘actual’’ research on behavior modification in planaria. Second, all eight of these experimenters were more experienced than any of those employed by Cordaro and Ison. Half had master’s degrees at the time, and the set of eight experimenters averaged just under three publications each. At least six of the experimenters are still active in psychological research and four have Ph.D.s. As might be expected, the absolute magnitudes of differences among experimenters in numbers of responses observed were not so large as those found by Cordaro and Ison. In no case did one observer report twice as many turns as another, although the differences obtained were statistically significant more often than not. Observations of body contractions, however, were subject to surprisingly large observer effects. The largest discrepancy occurred when one observer reported nearly seven times as many contractions as his comparison observer. The smallest discrepancy was one in which an observer reported nearly twice as many contractions as his comparison observer. Each of the results presented was based on a minimum of 900 trials (observations) per observer. It seems reasonable to conclude that even experienced observers may differ in their perception of the behavior of planaria. A somewhat different but perhaps more serious problem, however, is that in which observer effects interact with experimental conditions (Rosenthal & Halas, 1962). Table 1-1 illustrates just this effect for those two experimenters for whom the most complete data were available. These experimenters were among the most academically advanced and the most experienced in research. Each experimenter tried to condition six planaria to respond by turning or contracting to a light which had been paired with an electric shock. As a
302
Book Two – Experimenter Effects in Behavioral Research Table 1–1 Two Experiments in the Learning of Turning in Worms
Experimenter I
Experimenter II
Experimental
Control
Experimental
Control
First block Second block Third block Fourth block Fifth block Sixth block
10.5 10.0 11.5 12.5 9.0 8.0
11.5 9.0 9.5 9.0 12.0 9.5
9.0 11.0 15.0 15.5 17.0 16.0
8.0 11.0 10.5 11.0 11.0 11.5
Means p (difference)
10.2
10.1
13.9
10.5
NS
.02
control procedure, each experimenter had six other planaria to which the light was administered without the shock. The results of Table 1-1 show the mean number of turns for six blocks of 25 trials each. One experimenter obtained ‘‘conditioning’’ of turning responses, and one did not. Experimenter II not only obtained more turning in his experimental group than in his control group, but his experimental animals showed an increase in turning in each subsequent block. The correlation (rho) between number of turns per block and block order was .94, p ¼ .02. However, the control group for this experiment also showed a tendency to turn more often on later trials (rho ¼ .77, p ¼ .10), although the rate of increase was much more gradual. Table 1-2 shows the analogous data for body contractions. Once again there is no difference between the mean responses of the experimental and control groups for experimenter I, but there is a surprise in the difference between the mean number of contractions observed for the experimental and control groups by experimenter II. This time planaria in the control group responded more than planaria in the experimental group. This reversal of the results of experimenter II is the more surprising as turning and contracting responses have been found to be so well correlated that they are commonly added together to form a ‘‘total response’’ score. For these experienced experimenters, therefore, it can be concluded that there are individual differences in the extent to which behavior modifications in planaria are observed and that the particular differences found are affected by the specific type of behavior being observed. Table 1–2 Two Experiments in the Learning of Contraction in Worms
Experimenter I Experimental First block Second block Third block Fourth block Fifth block Sixth block
2.0 0.5 0.0 1.0 1.0 1.0
Means p (difference)
0.9
Experimenter II Control
NS
Experimental
0.5 1.0 0.0 1.0 1.5 2.0
0.5 0.0 0.5 0.5 0.5 1.0
1.0
0.5
Control 2.0 0.5 1.0 2.0 1.5 2.0 1.5
.005
The Experimenter as Observer
303
Although the foregoing data have been cited as evidence of observer effects, alternative explanations are possible. It could be that the effects arose by chance. That does not seem likely, however, since the animals were assigned to experimenters and to experimental conditions at random. The likelihood of the experimenter’s effect being due to chance is given by the p values of Tables 1-1 and 1-2, and these p values are low. It could also be that the planaria behavior was correctly observed but incorrectly recorded. Recording errors of such magnitude, however, are too rare to serve as likely explanations, as will be shown in the next section of this chapter. Intentional errors are generally only a remote possibility and, on the basis of personal acquaintance, for these particular experimenters, a virtual impossibility. One remaining possibility is that in some way the experimenters’ behavior affected the behavior of the planaria. This can be only a speculation, but it seems at least possible that one or the other of the two experimenters unintentionally treated his animals differentially as a function of whether they were in the treatment or control group. It cannot be assumed that experimenter II showed such a difference while experimenter I did not. It could as well be argued that, except for the programmed differences in treatment, experimenter II treated his animals identically and that the differences he obtained are those attributable only to the experimental conditions. Differential behavior toward the animals of the two groups of experimenter I might have ‘‘improperly’’ reduced the ‘‘true’’ difference between the experimental and control animals. Research with rabbits, as with planaria, has shown significant effects associated with the particular experimenter employed. Brogden (1962) found that inexperienced experimenters required more trials in which to condition rabbits. As in the planaria research, the aim was to elicit an avoidance reaction to light which had been paired with an electric shock. Unlike the situation for the planaria research, however, the experimenter effect disappeared with further practice on the part of the inexperienced experimenters. Neither in the case of the rabbits nor in the case of the planaria can it be specified just what the experimenters did differently that could have led to such different records of animal learning. In a later chapter (Chapter 8) there will be occasion to discuss this problem again. Recording errors. As experimenters observe the behavior of their subjects, their observations must in some way be recorded. It comes as no surprise that errors of recording have been demonstrated and that these errors are not always selfcanceling. A self-canceling set of errors is one in which errors inflating a category are exactly offset by errors deflating that category. If an observer of the turning of worms records three turns that did not occur and fails to record three turns that did occur, he has committed six errors which have canceled each other out. Kennedy and Uphoff (1939) performed a careful study of recording errors in experiments in extrasensory perception. Briefly, the task for the observers was to record the investigator’s guesses as to the nature of the symbol being ‘‘transmitted’’ by the observer. The symbols employed were the standard ones used in such research and included circles, squares, stars, crosses, and wavy lines. Each trial consisted of 25 cards, five of each of the five symbols. Because the guesses made for the observers had been predetermined, it was possible to count the number of recording errors. A total of 28 observers recorded a grand total of 11,125 guesses, of which 126, or 1.13 percent, were misrecorded. All observers made at least one error (one observer
304
Book Two – Experimenter Effects in Behavioral Research
made 16), and the modal number of errors per observer was four. Some of the errors committed increased the telepathy scores (45.2 percent), some decreased it (21.4 percent), and some had no effect (33.4 percent). There was, then, a general tendency to make recording errors that increased the telepathy scores. Kennedy and Uphoff knew which observers had favorable attitudes to extrasensory perception and which had unfavorable attitudes. The analysis of errors by believers and disbelievers showed that each type of observer tended to err in the direction favorable to his attitude, though these biased errors were quite small. Believers in telepathy made 71.5 percent more errors increasing telepathy scores than did disbelievers. Disbelievers made 100 percent more errors decreasing the telepathy scores than did believers. Very similar findings have been reported by Sheffield and Kaufman (1952). In an experiment in psychokinesis, they filmed the actual fall of the dice which subjects were trying to influence. They found subjects believing in the phenomenon to make more tallying errors in favor of the hypothesis. Subjects who disbelieved made more of the opposite type of tallying error. Recording, as well as computational, errors by experimenters have also been studied in an experiment on the perception of people (Rosenthal, Friedman, Johnson, Fode, Schill, White, & Vikan, 1964). In that experiment, each subject wrote on a small writing pad his rating of the degree of success or failure experienced by persons pictured in photographs. The 30 experimenters of this study transcribed these ratings to a master data sheet. A comparison of the experimenters’ 3,000 transcriptions with their subjects’ recordings revealed that only 20 errors had occurred. The 0.67 percent rate of misrecording approached the 1 percent rate just exceeded (1.13 percent) by Kennedy and Uphoff. Probably because each of the experimenters of the experiment in person perception made only one fourth as many observations as did Kennedy and Uphoff’s observers, 18 experimenters made no recording errors whatever. Some experimenters had been given an expectation that they would obtain high ratings of the photos from their subjects, whereas some experimenters had been given the opposite expectation. Nine of the 12 experimenters who made any recording errors erred in the direction of their expectation, and their errors tended to be larger (p ¼ .05). The computational task for the experimenters in the study under discussion was simply to sum the 20 ratings given by each of their five subjects. Of the 30 experimenters, 18 made a computational error, and 12 of these erred in the direction of their expectation. Those experimenters who were more likely to make computational errors in the direction of their hypothesis also tended to make larger computational errors. In this same experiment, all subjects rated their experimenters on the variable of ‘‘honesty’’ during the conduct of the experiment. This was a very impressionistic rating since the experimenters could not actually have been ‘‘dishonest’’ even if they had been so inclined. During the experiment, the co-investigators had all experimenters under surveillance (a fact apparent to all experimenters). In spite of the subjectivity of the ratings of the experimenters’ honesty made by the subjects, these ratings predicted better than chance (p ¼ .02) whether the experimenter would subsequently favor his hypothesis in the making of computational errors. It should be noted, however, that all experimenters were rated as being quite honest: þ8.5 (extremely honest) was the mean rating assigned to experimenters who did not err in
The Experimenter as Observer
305
the direction of their expectation; +6.8 (moderately to highly honest) was the mean rating of experimenters who did err in the direction of their hypothesis. In the same experiment, it was possible to relate the occurrence of recording errors to the occurrence of computational errors. The correlation of .48 (p ¼ .01) showed that experimenters who erred in data transcription tended to err in data processing. Somewhat surprisingly, however, those experimenters who erred in the direction of their expectancy in their recording errors were not any more likely (rho ¼ .05) to err in the same direction in their computational errors. The making of numerical errors seemed, then, to be a consistent characteristic, but directionality of error vis-a-vis expectation did not. We should note here that the overall effects of both recording and computational errors on the grand means of the different treatment conditions of the experiment reported were negligible. An occasional experimenter did have some real effect on the data he obtained; an effect that, at least in principle, could be serious if an entire experiment depended on an experimenter who was prone to err numerically. In a recent experiment conducted by John Laszlo, three experimenters conducted the same basic experiment in person perception employing a total of 64 subjects. In this study, all three experimenters made computational errors. For the most accurate experimenter, 6 percent of his computations were in error. The other experimenters erred 22 and 26 percent of the time. The magnitudes of the errors were quite small, but for all three experimenters, a majority (75 percent overall) of the errors tended to favor the experimental hypothesis, though the frequency of these biased errors did not reach statistical significance. In spite of the apparent regularity of the occurrence of such errors, little attention has been given to real or alleged numerical errors in the scientific literature of psychology (Hanley & Rokeach, 1956; Wolins, 1962).
Conceptualization of Observer Effects In the mapping out of the generality of observer effects, we have had only broad hints at certain definitions and differentiations which must now be made more explicit. Later on, in Part III, we will consider these matters in greater detail. By ‘‘observer effects’’ or ‘‘observer error’’ we have referred to overstatement or understatement of some criterion value. When two observers disagree in an observation, each may be said to err with respect to the other. Both may be said to err with respect to some third observation which may, for various reasons, be a more or less usefully employed criterion. Given a population of observations, we may choose to define some central value (such as the mean or mode) as the ‘‘true’’ value and regard all observations not falling at that value as being more or less in error as a direct function of their distance from the central value. Observer errors or effects may be distinguished from observer ‘‘bias’’ by the fact that observer errors are randomly distributed around a ‘‘true’’ or ‘‘criterion’’ value. Biased observations tend to be consistently too high or too low and may bear some relation to some characteristics of the observer (Roe, 1961), the observation situation (Pearson, 1902), or both. In considering the act or sequence of acts constituting the observation in the scientific enterprise, we may distinguish conceptually among locations of error or bias. The error of ‘‘apprehending’’ occurs when there is some sort of misrecording
306
Book Two – Experimenter Effects in Behavioral Research
between the event observed and the observer of the event. We may include here such diverse sources of apprehending error as differing locations of observers (Gillispie, 1960) or angles of observation (George, 1938), imperfections in the sensory apparatus, central relay systems, cortical projection areas, and the like. The error of recording may be distinguished conceptually from the apprehending error. In the case of recording error, we assume first an errorless act of apprehending followed by a transcription of the event (to paper, to the ear of another observer, or to another instrument) which differs from the event as correctly apprehended. In actual practice, of course, when an event or observation is recorded in error with respect to some criterion, we cannot locate the error as having occurred either in apprehending, in transcribing, or in both processes. There is no certain method for isolating an apprehending error unconfounded with a recording error, though introspective reports may be suggestive. Computational errors are more clearly distinguishable from the foregoing errors since they involve the incorrect manipulation of recorded events. Incorrectness is usually defined here by the formal rules of arithmetical operations. In some of the cases of ‘‘observer error,’’ the criterion or ‘‘true’’ value of the observation is so vague and ephemeral that we cannot properly speak of errors of apprehending or recording (or computation). Such would be the case, for example, with psychiatric classification. When ‘‘error’’ occurs in this situation we may more appropriately speak of ‘‘error of interpretation.’’ Interpretation effects will be discussed in the next chapter. Finally, throughout this chapter the assumption has been made that the classes of errors discussed occurred without the intent of the observer. Those occasions when intent is involved in the production of an erring observation will be discussed in Chapter 3.
The Control of Observer Effect A powerful, necessary, though insufficient, tool for the control of observer error is our awareness of the phenomenon. The role of various mechanical apprehenders and recorders in the reduction of observer error has been noted earlier. As Boring (1950) pointed out, these mechanizations do not replace the human observer; rather, they postpone human observation to some other, more convenient time and circumstance of reapprehension and rerecording. If mechanization reduces observer error—and it very likely does—there remain still subsequent errors of ‘‘re’’-observation. Yule (1927) was relatively optimistic that observer training could eliminate observer error. This optimism seemed unshared by Fisher (1936) in citing Rasmussen. The most critical control of observer error is probably woven into the fabric of science by the tradition of replication. Frequent replication of observations serves to establish the definition of observer errors. It does not, however, eliminate the problem, since replicated observations made under similar conditions of anticipation, instrumentation, and psychological climate may, by virtue of their intercorrelation, all be in error with respect to some external criterion (Pearson, 1902). An excellent example of this, as mentioned, is Rostand’s (1960) discussion of the infamous N-rays. Perhaps the great contribution of the skeptic, the disbeliever, in any given scientific observation is the likelihood that his anticipation, psychological
The Experimenter as Observer
307
climate, and even instrumentation may differ enough so that his observation will be more an independent one. Error, in the sense of discrepancy, will then have a greater chance of being revealed. Which of two contradictory sets of observations will be regarded as error-free depends on sets of criteria subsequently adopted by the assessing community.
2 Interpretation of Data
Identical observations are often interpreted differently by different scientists, and that fact and its implications are the subject of this chapter. Interpretation effects are most simply defined as any difference in interpretations. The difference may be between two or more interpreters, or an interpreter and such a generalized interpreter as an established theory or an ‘‘accepted’’ interpretation of a cumulative series of studies. As in the observer effect, the interpreter effect, or difference, does not necessarily imply a unidirectional phenomenon. When observations are nonrandomly distributed around a true value, we refer to them as ‘‘biased observations.’’ Similarly, when interpretations do not vary randomly—and usually they do not—we may refer to them as ‘‘biased.’’ Note that we do not thereby imply that the biased interpreter is ‘‘wrong’’ with respect to some notion of ‘‘true interpretation,’’ but only that his interpretation is predictable. It does not seem as reasonable to postulate the central value interpretation as the true interpretation as it does to postulate the central value observation as the true observation. The distinction between an observation itself and the interpretation of an observation is not always simple. Some observations require a greater component of interpretation than others. If we observe the behavior of worms there seems to be less interpretation required to observe whether there is a worm present than to observe whether the worm is completely immobile. If we choose to observe a very small worm, however, even the observation of its presence or absence may require a larger interpretive element. Interpretations or constructions of data have an enormous range of generality, from the interpretation of a speck as worm or not-worm, through the interpretation of a person’s speech as schizophrenic or not schizophrenic, to the interpretation of measurements of the speed of light as damaging to or irrelevant to Einstein’s theory of relativity. At the lowest level of generality, differences in interpretation could easily be regarded as observer effects. At the highest level of generality, differences in interpretation are nothing more than differences in theoretical positions. Even at the higher levels of generality, however, differences in interpretation may affect the accuracy of observations. This can occur in two ways. First, a given theory or interpretive framework may affect the perceptual process in such a way as to increase 308
Interpretation of Data
309
errors of observation in the direction of greater consistency with the theory. Such effects are clearly implied by some of the evidences presented in the last chapter and by the extensive literature on need-determined perception (Dember, 1960; see also Campbell, 1958; Sanford, 1936; Stephens, 1936; Zillig, 1928). Second, a given interpretive framework may function to keep ‘‘off the market’’ data that may weaken the tenability of the theory. Such underrepresentation of data contradictory to prevailing theories would bias the ‘‘true’’ value of an observation. Since the ‘‘true’’ value of an observation was defined in terms of some central value of available observations, it seems obvious that by ignoring observations at variance with the existing central value that value will become more and more stable statistically and psychologically. If in the history of science the proponents of a dominant theory have often thus shepherded the current central or true observation into the direction supporting their theory, they have also often been responsible for the fact that observations were being made at all. Theoretical biases are mixed blessings. They are selectively attentive to data that if completely unbiased by theory would not have been collected at all.
The Physical Sciences In 1887, Michelson and Morley conducted their famous experiment on the speed of light. Their report showed that whether the light signals were sent out in the direction of the earth’s motion or not, the speed was the same. It is said that this counterintuitive result was the stimulus for Einstein to develop his theory of relativity in 1905. The Michelson–Morley experiment was important to relativity theory, and, in fact, the result seemed required by it. But there are two facts that must be added. First, according to Einstein, the Michelson–Morley experiment had nothing to do with his original formulation of relativity theory. Second, the results of the Michelson–Morley experiment were probably in error, and there did appear to be an ‘‘ether drift.’’ Defined by a difference in the speed of light as a function of the signal’s direction in relation to the earth’s motion, this ‘‘ether drift’’ could have jeopardized relativity theory. That it did not illustrates interpreter effects in science. Michael Polanyi (1958) and Arthur Koestler (1964) have given the details. In 1902, some 15 years after the Michelson–Morley experiment, W. M. Hicks showed some ether drift in their original observations. Then, from 1902 to 1926, D. C. Miller repeated the experiment with improved instrumentation thousands of times and consistently obtained a drift of from eight to nine kilometers per second. Still later, W. Kantor, using still more elegant instrumentation, also showed that the speed of light did depend on the motion of the observer. So well established was relativity theory that Miller’s work was essentially ignored (though that was difficult, since he presented his complete evidence in 1925 to the American Physical Society, of which he was then president). It is true, as Polanyi tells us, that there was other evidence from different workers for the absence of ether drift as required by relativity theory. But that evidence was not available when Miller presented his data nor for the many years before that he had been making his observations. How do we decide whether there really was an artifact in Miller’s work, so that people did well to ignore it? Is there a possibility that some physicist, had he been taught to take apparently sound
310
Book Two – Experimenter Effects in Behavioral Research
data seriously, might, because of these inconsistent data, have so modified relativity theory that it would be more powerful by far? Such questions, if they are answerable at all, E. G. Boring would refer to history for verdict. Miller’s data were ignored but they were available. Sometimes the effects of interpretation of data are such as to keep those data unavailable. Bernard Barber (1961) tells of some well-known instances. One such was Lord Kelvin’s interpretation of Roentgen’s X-rays as a hoax, a kind of N-ray phenomenon in reverse. Several instances of workers’ inability to publish papers that seemed to the judges to be paradoxical were also documented. The most interesting of these, because it represents a kind of controlled experiment, was the case of Lord Rayleigh. In 1886, he submitted a paper entitled ‘‘An Experiment to Show That a Divided Electric Current May Be Greater in Both Branches Than in the Mains.’’ He was, at the time, already well known. In some way his name became detached from the paper, however, and it was rejected. Shortly afterward the name somehow became collated with the paper, which was then found to have sufficient merit for acceptance. But perhaps the most useful illustration, for its recency and for its charm in the telling, is the case of Michael Polanyi himself and his theory of the adsorption (adhesion) of gases on solids (Polanyi, 1963). In 1914, he first published his theory and within a few years had adduced convincing experimental evidence on its behalf. But the then current conception of atomic forces made his theory unacceptable. Asked to state his position publicly, Polanyi was chastised by Einstein for showing a ‘‘total disregard’’ of what was then ‘‘known’’ about the structure of matter. Said Polanyi, ‘‘Professionally, I survived the occasion only by the skin of my teeth’’ (p. 1011). Polanyi, of course, was subsequently credited with having been correct. His analysis of the role of that orthodoxy in science which kept his evidence from being considered is remarkable for its balance, objectivity, and the lack of bitterness, a bitterness that characterized Planck’s reaction to the resistance he encountered (Barber, 1961). Polanyi felt that the rejection of his theory and his evidence was unavoidable and even proper given the state of knowledge at the time. Although recognizing the danger of orthodoxy in repressing contradictory evidence, he points out that the journals could easily become flooded with nonsense in the face of a too great tolerance of dissent. This moderate view of orthodoxy is much the same as that expressed by Florian Znaniecki in his classic work The Social Role of the Man of Knowledge (1940).
The Biological Sciences Mosteller put it well: ‘‘. . . perhaps sometimes the data are not ready to be looked at— and it is not that the anomalies aren’t at all noticed, but that they aren’t discussed much because no one knows just what to say’’ (Personal communication, 1964). Perhaps that is the reason why Mendel’s now classic monograph, Experiments in Plant Hybridization, first presented in 1865, had to wait to become important until de Vries, Correns, and Tschermak found something to say about it, all independently of each other, and all in 1900. Perhaps, too, less was found to say about Mendel’s work because of his, for that time odd, applications of mathematics to botany, and because of Mendel’s relative lack of scientific stature (Barber, 1961). Even after people found
Interpretation of Data
311
things to say about Mendel’s data, however, no one looked at it closely enough because it was so easy to interpret in accordance with each one’s own theoretical orientation. Fisher (1936) put it: ‘‘Each generation, perhaps, found in Mendel’s paper only what it expected to find; . . . Each generation, therefore, ignored what did not confirm its own expectations’’ (p. 137). Mendel’s case is not unique in the history of biology. Darwin, Lister, Pasteur, Semmelweiss, and their observations tended to be ignored or rejected, and these are only some of the better-known cases. They and others, less well known, have been chronicled by Barber (1961), Fell (1960), Koestler (1964), and Zirkle (1960.) Sometimes in science the situation is not that there is too little that can be said about the data but rather too much. A number of equally plausible interpretations are available, and that leads neither to rejecting the observations nor to ignoring them. It leads to an assimilation of the data to the various theoretical positions that can make use of them. Wolf (1959) gives us a good example based on Morris’ data which found London tramway motormen with a higher incidence of coronary heart disease than tramway conductors. (The data may, for our purpose, be regarded as free from observer effect.) The original interpretation of these data was in terms of the relationship between sedentary occupations and heart disease, the motorman sitting while performing his task, the conductor moving about more. One alternative interpretation offered by Wolf was that motormen, because of their sedentary work, might be gaining weight faster and that it was the weight gain which led to a higher incidence of heart disease. Wolf presented the additional interpretation that the lessened social interaction with other people required by the motorman’s job when compared to the conductor’s job might also be the critical variable. Other interpretations are of course possible, including those which postulate that individuals prone to heart disease, because of their biological or psychological makeup, tend to select or be assigned to the front end of the trolley. Here, then, are alternative interpretations whose relative tenabilities could easily be established by further observation. The initial data were immediately important theoretically (and practically) because there were theories available that could make sense of the observations and could be tested further by performing the experiments implied by the various interpretations of the data. When the experiments are well designed and well executed, the experimenter has a better chance to ‘‘. . . escape from his own preferences in interpreting his results’’ (Boring, 1959; p. 3).
The Behavioral Sciences In the example of interpretation differences just given, it was assumed that there were no observer errors. Who is a motorman and who is a conductor seemed an easy observation on which to achieve consensus. The presence or absence of heart disease, however, is a somewhat more equivocal judgment (Feinstein, 1960). We are hard put to decide whether diagnostician differences are observer effects or interpretation effects. If we may assume that cardiologists hear the same ‘‘lub-dub’’ through their stethoscopes and see the same tracings of the electrocardiogram, we would be inclined to regard diagnostic variations as differences of interpretation. In the applied behavioral sciences of psychiatry and clinical psychology, the diagnosis or categorization of behavior is a common enterprise. Differences in the
312
Book Two – Experimenter Effects in Behavioral Research
interpretation of behavioral data are well illustrated by differences in diagnoses. The magnitude of such differences has been reported by Star (1950). During the Second World War, psychiatric examiners interviewed army recruits for the purpose of rejecting any who might be too severely disturbed to function as soldiers. The most extreme difference in the rate of rejection found one induction center rejecting 100 times more recruits than another. Although this magnitude of difference is unusual, the generality of differences in the interpretation of abnormal behavior seems well established (Hyman, Cobb, Feldman, Hart, & Stember, 1954). In the diagnosis of abnormal behavior, the large effects of interpreters are probably due to the vagueness of the defining characteristics of the various diagnostic categories. Of itself this would increase unreliability. If this source of unreliability were the only one, however, we would expect interpreter differences to be unbiased or unpredictable. But that is not the case. Robinson and Cohen (1954), for example, found that there were significant biases in the psychological evaluations of 30 patients by 3 psychological examiners. The authors related the biases in evaluation to the personality differences among the examiners, a relationship postulated by Henry Murray in 1937 and supported in a number of studies (e.g., Filer, 1952; Harari & Chwast, 1959; Rotter & Jessor, undated, circa 1947). In this discussion of interpreter effects among diagnosticians we have assumed that the examinee’s behavior on which the interpretations were based was not itself affected by the examiner. Sometimes the examiner does affect the patient’s behavior and markedly so. These effects will be discussed beginning with Chapter 4. Before leaving the area of clinical diagnosis or interpretation, it should be emphasized that diagnostic differences occur in other areas, perhaps even to as great an extent. Jones (1938), for example, has shown the degree of disagreement in the assessment of the nutritional health of schoolchildren. Not only did diagnosticians disagree with one another but they also differed from their own earlier assessments. Sometimes in nutritional diagnosis, as in psychological diagnosis, we can speak of biased or directional or predictable differences among diagnosticians (Bean, 1948; Bean, 1959). An informal report by Wooster (1959) nicely illustrates such biased diagnosis. Wooster tells the possibly apocryphal story of 200 patients who were to be classified as obese, normal, or underweight. Leaner physicians tended to classify patients as more obese than did obese physicians. One variable that has been shown especially likely to bias the assessment of behavior is the expectancy of the observer or interpreter. Rapp (1965) tells us about an especially carefully conducted experiment which demonstrates this expectancy bias. Rapp’s experiment, it will be seen, could be equally well viewed as a study of observer effect or of interpreter effect. It deals with data falling in the range of experimenter effects that are difficult to categorize clearly. The setting of the experiment was a nursery school, and the task for each of eight pairs of observers was to describe objectively the behavior of a single child as it occurred within one minute. One member of each pair of observers was led to believe that the child to be observed was feeling ‘‘under par.’’ The other member of the pair of recorders was led to believe that the child was feeling ‘‘above par.’’ Actually all the eight children included for observation had been selected so that their behavior would not show extreme behavior in either the above or below par direction. Results of this study showed that seven of the eight pairs of observers wrote descriptions of the children’s behavior that were detectably biased in the direction of their expectation (p ¼ .003).
Interpretation of Data
313
An example of the biasing effect of expectations, one that seems to be more clearly an example of an interpreter effect, is given by Cahen (1965). His subjects, 256 prospective schoolteachers, were each asked to score several test booklets ostensibly filled out by children being tested for academic readiness. Each of the 30 test items was to be scored on a four-point scale using a scoring manual which gave examples of answers of varying quality. On each of the answer booklets to be scored some ‘‘background’’ information was provided for that child. This background information included an alleged IQ score, the purpose of which was to create an expectation in the scorer that the child whose booklet was being scored was (1) above average, (2) average, or (3) below average in intellectual ability. The scoring of the tests supported Cahen’s hypothesis that children thought to be brighter would receive higher scores for the same performance than would children believed to be less able. The assessment of cultures like the assessment of individuals is subject to widely divergent interpretations (Hyman et al., 1954). Oscar Lewis and Robert Redfield described the Mexican village of Tepoztlan in quite different ways. Redfield presented a picture of a highly cooperative, integrated, and happy society relative to Lewis’ picture of an uncooperative, poorly integrated society whose members seemed anything but happy. Reo Fortune and Margaret Mead described the Arapesh in significantly different terms. For Mead, but not for Fortune, the Arapesh were a placid, domestic people characterized by a maternal temperament. In such cases of anthropological disagreement, we are hard put to account adequately for differing interpretations. It is important to know that such differences occur, but it would be most valuable to know why. If, for example, we could show a general tendency for female workers to perceive cultures as more peaceful, we could begin to write some general terms into the anthropological personal equations. In the absence of such data, we are left with the unsatisfactory alternative of noting differences without adequately understanding them. Sometimes an anthropological interpretive effect can be understood as an illustration of a well-known principle of perception. Such seems to be the case for data cited by Campbell (1959). The evaluation of the drabness or liveliness of Russian cities was found to depend on the order in which the cities were visited. Cities visited earlier on a tour were judged more drab than those visited later. ‘‘Against the adaptation level based upon experience with familiar US cities, the first Russian city seemed drab and cold indeed. But stay in Russia modified the adaptation level, changed the implicit standard of reference so that the second city was judged against a more lenient standard’’ (1959, p. 11). (Here and elsewhere [1958] Campbell has provided inventories of sources of error relevant to our discussion of interpreter as well as observer effects.) A major attempt to assess the biasing effects of different anthropological interpreters has been made by Raoul Naroll (1962). His method of data quality control is designed to compare anthropological reports made under more favorable conditions with reports made under less favorable conditions. Thus, staying in the field for over a year is associated with reports of higher rates of witchcraft attribution than staying in the field for less than a full year. Length of stay in the field is, then, a biasing factor but one for which it seems reasonable to assume that the longer stay gives a truer picture than does the shorter stay. Length of stay in a culture does not, however, bias reports of drunken brawling, so we see that conditions of observation or interpretation may bias reports of some behaviors but not others.
314
Book Two – Experimenter Effects in Behavioral Research
Another test of the quality of anthropological reports notes the investigator’s knowledge of the native language. Whether he knows the language tends in fact to be related to his report, not only of witchcraft attribution but of protest suicide as well. A third test described by Naroll is the distinction between a professional and nonprofessional investigator. The anthropologist is the former, and, in this context, the missionary the latter. Although in general we might expect professionals to be more accurate, Naroll suspects that, at least for reports of witchcraft attribution, missionaries may be more reliable than anthropologists. In summary, Naroll’s method allows us not only to assess the extent of bias in a series of anthropological reports but to institute controls for these as well. Perhaps more than any other, the survey research literature has shown a sophisticated awareness of interpreter and related effects; the already classic work of Hyman and his collaborators (1954) shows this fact most clearly. In their discussion of interviewer effects they describe the impact of interviewers’ expectations on their interpretation of respondents’ replies. Smith and Hyman (1950) provide the example. Recordings were made of two interviews. One of the respondents was a political isolationist described additionally as provincial and prejudiced. The other respondent, chosen to contrast markedly with the first, was an interventionist. In each interview, responses were included that objectively reflected equivalent sentiments on the part of both respondents. However, the interviewers were greatly affected by the respondent’s overall orientation in assessing these matched replies. One of the questions dealt with the amount of money spent by the United States for European recovery. Answers to this question by the isolationist and interventionist both actually suggested that we were spending an appropriate amount. However, when these same answers were coded or interpreted by interviewers who had been given the isolationist versus interventionist set, the results were dramatically altered. The isolationist’s response was interpreted as meaning that we were spending too much for European recovery by 53 per cent of the interpreters. The interventionist’s response, which had been equated with the isolationist’s response, was interpreted as meaning that we were spending too much by only 9 per cent of the interpreters. Another question on which the replies of the two respondents had been equated dealt with the respondents’ interest in our policy toward Spain. Actually both respondents’ replies indicated some interest, and 99 per cent of the interpreters so coded the interventionist’s reply. In contrast, however, only 76 per cent of the coders so interpreted the isolationist’s reply. For our most recent examples of interpreter effects we turn to experimental research in psychology. A recent paper summarizes 25 experiments in which eyelid conditioning was related to the subjects’ level of anxiety as measured by a paper-and-pencil test (Spence, 1964). Considerable theoretical importance is associated with the direction of this relationship, the more highly anxious subjects having been postulated to show the greater learning. In 21 of the 25 experiments the greater learning did in fact occur (p ¼ .002) among the more anxious subjects, though the differences were not statistically significant for every individual comparison. The interpretive effects arise from the finding that 16 of the 17 studies carried out in the Iowa laboratory showed the predicted effect (p < .001), while in the other laboratories 5 out of 8 studies showed the predicted effect (p > .70). A great many differences in procedure and in sampling could, of course, easily account for these differences. One major interpretation offered to account for the differences, however, was that the studies not conducted at Iowa
Interpretation of Data
315
employed smaller sample sizes. Such an interpretation would simply be a restatement of the fact that the power of a statistic increases with the sample size if it were not for the fact that three of the eight smaller sample studies showed mean differences in the unpredicted direction. In this case, the interpretation that a larger sample size would lead to differences in the predicted direction can only be made if it assumes that later-run subjects differ significantly from earlier-run subjects and systematically so in the predicted direction. An example of an oppositely biased interpretation would be to suggest that if the eight experiments conducted at different laboratories had employed larger sample sizes their results would have been still significantly more different from the Iowa studies than they actually were. Other recent examples of interpreter differences may be found in discussions of extrasensory perception (Boring, 1962b; Murphy, 1962) and of social psychology (Chapanis & Chapanis, 1964; Jordan, 1964; Silverman, 1964; Weick, 1965). Earlier in the discussion of the natural sciences, reference was made to the fact that sometimes interpreter differences lead to keeping data ‘‘off the market.’’ This, of course, also occurs in the behavioral sciences. Sometimes it occurs directly, as in an explicit or implicit editorial decision to not publish certain kinds of experiments. Such decisions, of course, are inevitable given that the demand for space in scientific literature far exceeds the supply. Often the data thus kept off the market are negative results which are themselves often difficult to account for. (The problem of negative results will be discussed in greater detail in Part III.) One good reason for keeping certain data off the market is that the particular data may be wrong. This suspicion may be raised about a particular observation within a series that is very much out of line with all the others. But the question of how to deal with such discordant data is not easily answered (Rider, 1933; Tukey, 1965). Kety’s (1959) caution is most appropriate: ‘‘[I]t is difficult to avoid the subconscious tendency to reject for good reason data which weaken a hypothesis while uncritically accepting those data which strengthen it’’ (p. 1529). Wilson (1952) and Wood (1962) give similar warnings.
The Control of Interpreter Effects Some interpreter effects are fully public events and some are not. If the interpretation of a set of public observations is uncongenial to our own orientation we are free to disagree. The public nature of these interpretive differences insures that in time they may be resolved by the addition of relevant observations or the development of new mental matrices which allow the reconciliation of heretofore opposing theoretical orientations (Koestler, 1964). When interpreter effects operate to keep observations off the market, however, they are less than fully public events. If an investigator simply scraps one of his observations as having been made in error there is no one to disagree and attempt to use the discordancy in a reformulation of an existing theory or as evidence against its tenability. When negative results are unpublishable, the fact of their negativeness is not a publicly available observation. When unpopular results are unpublishable, they are kept out of the public data pool of science. All these examples are clear-cut illustrations of interpreter effects which reduce the ‘‘publicness’’ of science. There are less clear-cut cases, however.
316
Book Two – Experimenter Effects in Behavioral Research
As in Mendel’s case, the observations are sometimes available but so little known and so little regarded that for practical purposes they are unavailable publicly. Sometimes it is our unawareness of their existence that keeps them out of science, but sometimes they are known at least to some but ‘‘. . . they lie outside of science until someone brings them in’’ (Boring, 1962b, p. 357). That, of course, is the point made earlier, that we may know of the existence of data but not what can be said of them. When we speak, then, of the control of interpreter effects we do not necessarily mean that there should be none. In the first place, their elimination would be as impossible as the elimination of individual differences (Morrow, 1956; Morrow, 1957). In the next place, their elimination would more likely retard than advance the development of science (Bean, 1958). Only those interpreter effects that serve to keep data from becoming publicly available or those that are very close to being observer effects should be controlled. As for the interpreter effects of a public nature that involve the impassioned defense of a theory, Turner (1961a) put it thus: ‘‘In the matter of making discoveries, unconcern is not a promising trait. But the desire to gain the truth must be balanced by an equally strong desire not to be played false’’ (p. 585).
3 Intentional Error
Intentional error production on the part of the experimenter is probably as relatively rare an event in the psychological experiment as it is in the sciences generally (Turner, 1961b; Shapiro, 1959; Wilson, 1952). Nevertheless, any serious attempt at understanding the social psychology of psychological research must consider the occurrence, nature, and control of this type of experimenter effect.
The Physical Sciences Blondlot’s N-rays have already been discussed as a fascinating example of observer effect. Rostand (1960) has raised the question, however, whether their original ‘‘discovery’’ might not have been the result of overzealousness on the part of one of Blondlot’s research assistants. Were that the case then we could learn from this example how observer or interpreter effects may derive from intentional error even when the observers are not the perpetrators of the intentional error. This certainly seemed to be the case with the famous Piltdown man, that peculiar anthropological find which so puzzled anthropologists until it was discovered to be a planted fraud (Beck, 1957). A geologist some two centuries ago, Johann Beringer, uncovered some remarkable fossils including Hebraic letters. ‘‘The[se] letters led him to interpret earth forms literally as the elements of a second Divine Book’’ (Williams, 1963, p. 1083). Beringer published his findings and their important implications. A short time after the book’s publication a ‘‘fossil’’ turned up with his name inscribed upon it. Beringer tried to buy back copies of the book which were by now circulating, but the damage to his reputation had been done. The standard story had been that it was Beringer’s students who had perpetrated the hoax. Now there is evidence that the hoax was no schoolboy prank but an effort on the part of two colleagues to discredit him (Jahn & Woolf, 1963). Here again is a case where interpreter effects on the part of one scientist could be in large part attributed to the intentional error of others. A more recent episode in the history of archaeological research, and one far more difficult to evaluate, has been reported on the pages of The Sunday Observer. Professor L. R. Palmer, a comparative philologist at Oxford, has called into question 317
318
Book Two – Experimenter Effects in Behavioral Research
Sir Arthur Evans’ reconstruction of the excavations at Knossos (Crete). These reconstructions were reported in 1904 and then again in 1921. The succession of floor levels, each yielding its own distinctive type of pottery, was called by Palmer a ‘‘complete figment of Evans’ imagination.’’ Palmer’s evidence came from letters that contradicted Evans’ reconstruction—letters written by Evans’ assistant, Duncan Mackenzie, who was in charge of the actual on-site digging. These letters were written after Evans had reported his reconstruction to the scientific public. Evans did not retract his findings but rather in 1921 he reissued his earlier (1904) drawing. Palmer felt that the implications of these events for our understanding of Greece, Europe, and the Near East were ‘‘incalculable’’ (Palmer, 1962). In subsequent issues of The Observer Evans had his defenders. Most archaeologists (e.g., Boardman, Hood) felt that Palmer had little reason to attack Evans’ character and question his motives, though, if they are right, questions about Duncan Mackenzie’s might be implied. The Knossos affair serves as a good example of a possible intentional error which could conceivably turn out to have been simply an interpreter effect—a difference between an investigator and his assistant. One thing is clear, however: whatever did happen those several decades ago, the current debate in The Observer clearly illustrates interpreter differences. C. P. Snow, scientist and best-selling novelist, has a high opinion of the average scientist’s integrity (1961). Yet he refers to at least those few cases known to scientists in which, for example, data for the doctoral dissertation were fabricated. In one of his novels, The Affair, he deals extensively with the scientific, social, and personal consequences of an intentional error in scientific research (1960). Other references to intentional error, all somewhat more pessimistic in tone than was C. P. Snow, have been made by Beck (1957), George, (1938), and Noltingk (1959).
The Biological Sciences When, two chapters ago, observer effects were under discussion the assumption was made that intentional error was not at issue. Over the long run this assumption seems safely tenable. However, for any given instance it is very difficult to feel certain. We must recall: (1) Fisher’s (1936) suspicion that Mendel’s assistant may have deceived him about the results of the plant breeding experiments; (2) Bean’s (1953) suspicion that Leser’s assistant may have tried too hard to present him with nearly perfect correlations between harmless skin markings and cancer; (3) Binet’s suspicion over his own assistant’s erring so regularly in the desired direction in the taking of cephalometric measurements (Wolf, 1961). One of the best known and one of the most tragic cases in the history of intentional error in the biological sciences is the Kammerer case. Kammerer was engaged in experiments on the inheritance of acquired characteristics in the toad. The characteristic acquired was a black thumb pad, and it was reported that the offspring also showed a black thumb pad. Here was apparent evidence for the Lamarckian hypothesis. A suspicious investigator gained access to one of the specimens, and it was shown that the thumb pad of the offspring toad had been blackened, not by the inherited pigment, but by India ink (MacDougall, 1940). There cannot, of course, be any question in this case that an intentional error had been perpetrated, and Kammerer recognized that prior to his suicide. To this day, however, it cannot be said with
Intentional Error
319
certainty that the intentional error was of his own doing or that of an assistant. A good illustration of the operation of interpreter effects is provided by Zirkle (1954) who noted that scientists were still citing Kammerer’s data, and in reputable journals, without mentioning its fraudulent basis. More recently, two cases of possible data fabrication in the biological sciences came to light. One case ended in a public expose´ before the scientific community (Editorial Board, 1961); the other ended in an indictment by an agency of the federal government (Editorial Board, 1964).
The Behavioral Sciences The problem of the intentional error in the behavioral sciences may not differ from the problem in the sciences generally. It has been said, however, that at least in the physical sciences, error of either intentional or unintentional origin is more quickly checked by replication. In the behavioral sciences replication leads so often to uninterpretable differences in data obtained that it seems difficult to establish whether ‘‘error’’ has occurred at all, or whether the conditions of the experiment differed sufficiently by chance to account for the difference in outcome. In the behavioral sciences it is difficult to specify as explicitly as in the physical sciences just how an experiment should be replicated and how ‘‘exact’’ a replication is sufficient. There is the additional problem that replications are carried out on a different sample of human or animal subjects which we know may differ very markedly from the original sample of subjects. The steel balls rolled down inclined planes to demonstrate the laws of motion are more dependably similar to one another than are the human subjects who by their verbalizations are to demonstrate the laws of learning. In survey research the ‘‘cheater problem’’ among field interviewers is of sufficient importance to have occasioned a panel discussion of the problem in the International Journal of Attitude and Opinion Research (1947). Such workers as Blankenship, Connelly, Reed, Platten, and Trescott seem to agree that, though statistically infrequent, the cheating interviewer can affect the results of survey research, especially if the dishonest interviewer is responsible for a large segment of the data collected. A systematic attempt to assess the frequency and degree of interviewer cheating has been reported by Hyman, Cobb, Feldman, Hart, and Stember (1954). Cheating was defined as data fabrication, as when the interviewer recorded a response to a question that was never asked of the respondent. Fifteen interviewers were employed to conduct a survey, and unknown to them, each interviewed one or more ‘‘planted’’ respondents. One of the ‘‘planted’’ interviewees was described as a ‘‘punctilious liberal’’ who qualified all his responses so that no clear coding of responses could be undertaken. Another of the planted respondents played the role of a ‘‘hostile bigot.’’ Uncooperative, suspicious, and unpleasant, the bigot tried to avoid committing himself to any answer at all on many of the questions. Interviews with the planted respondents were tape recorded without the interviewers’ knowledge. It was in the interview with the hostile bigot that most cheating errors occurred. Four of the interviewers fabricated a great deal of the interview data they reported, and these interviewers tended also to cheat more on interviews with the punctilious liberal, although, in general, there was less cheating in that interview. Frequency of cheating, then, bore some relation to the specific data-collection situation and was at least to some extent predictable from one situation to another.
320
Book Two – Experimenter Effects in Behavioral Research
In science generally, the assumption of predictability of intentional erring is made and is manifested by the distrust of data reported by an investigator who has been known, with varying degrees of certainty, to have erred intentionally on some other occasion. In science, a worker can contribute to the common data pool a bit of intentionally erring data only once. We should not, of course, equate the survey research interviewer with the laboratory scientist or his assistants. The interviewer in survey research is often a part-time employee, less well educated, less intelligent, and less interested in the scientific implications of the data collected than are the scientist, his students, and his assistants. The survey research interviewer has rarely made any identification with a scientific career role with its very strong taboos against data fabrication or other intentional errors, and its strong positive sanctions for the collection of accurate, ‘‘uncontaminated’’ data. Indeed, in the study of interviewers’ intentional errors just described, the subjects were less experienced than many survey interviewers, and this lack of experience could have played its part in the production of such a high proportion of intentional errors. In that study, too, it must be remembered, the design was such as to increase the incidence of all kinds of interviewer effects by supplying unusually difficult situations for inexperienced interviewers to deal with. However, even if these factors increased the incidence of intentional error production by 400 percent, enough remains to make intentional erring a fairly serious problem for the survey researcher (Crespi, 1945–1946; Cahalan, Tamulonis, & Verner, 1947; Mahalanobis, 1946). A situation somewhere between that of collecting data as part of a part-time job and collecting data for scientific purposes exists in those undergraduate science courses in which students conduct laboratory exercises. These students have usually not yet identified to a great extent with the scientific values of their instructors, nor do they regard their laboratory work as simply a way to earn extra money. Data fabrication in these circumstances is commonplace and well-known to instructors of courses in physics and psychology alike. Students’ motivation for cheating is not, of course, to hoax their instructors or to earn more money in less time but rather to hand in a ‘‘better report,’’ where better is defined in terms of the expected data. Sometimes the need for better data arises from students’ lateness, carelessness, or laziness, but sometimes it arises from fear that a poor grade will be the result of an accurately observed and recorded event which does not conform to the expected event. Such deviations may be due to faulty equipment or faulty procedure, but sometimes these deviations should be expected simply on the basis of sampling error. One is reminded of the Berkson, Magath, and Hurn (1940) findings which showed that laboratory technicians were consistently reporting blood counts that agreed with each other too well, so well that they could hardly have been accurately made. We shall have occasion to return to the topic of intentional erring in laboratory course work when we consider the control of intentional errors. For the moment we may simply document that in two experiments examined for intentional erring by students in a laboratory course in animal learning, one showed a clear instance of data fabrication (Rosenthal & Lawson, 1964), and the other, while showing some deviations from the prescribed procedure, did not show any evidence of outright intentional erring (Rosenthal & Fode, 1963a). In these two experiments, the incidence of intentional erring may have been reduced by the students’ belief that their data were collected not simply for their own edification but also for use by others for serious scientific purposes. Such error reduction may be postulated if we can assume that
Intentional Error
321
data collected only for laboratory learning are less ‘‘sacred’’ than those collected for scientific purposes. Student experimenters are often employed as data collectors for scientific purposes. In one such study Verplanck (1955) concluded that following certain reinforcement procedures the content of conversation could be altered. Again employing student experimenters Azrin, Holz, Ulrich, and Goldiamond (1961) obtained similar results. However, an informal post-experimental check revealed that data had been fabricated by their student experimenters. When very advanced graduate student experimenters were employed, they discovered that the programmed procedure for controlling the content of conversation simply did not work. Although it seems reasonable to assume that more-advanced graduate students are generally less likely to err intentionally, few data are at hand for documenting that assumption. We do know, of course, that sometimes even very advanced students commit intentional errors. Dr. Ralph Kolstoe has related an instance in which a graduate student working for a well-known psychologist fabricated his data over a period of some time. Finally, the psychologist, who had become suspicious, was forced to use an entrapment procedure which was successful and led to the student’s immediate expulsion. What has been said of very advanced graduate students applies as well to fully professional scientific workers. It would appear that the incidence of intentional errors is very low among them, but, again, few data are available to document either that assumption or its opposite. Most of the cases of ‘‘generally known’’ intentional error are imperfectly documented and perhaps apocryphal. In the last chapter there was occasion to discuss those types of interpreter effects which serve to keep certain data off the market either literally or for all practical purposes. It was mentioned that sometimes data were kept out of the common exchange system because no one knew quite what to say about them. Sometimes, though, data are kept off the market because the investigator knows all too well what will be said of them. Such intentional suppression of data damaging to one’s own theoretical position must be regarded as an instance of intentional error only a little different from the fabrication of data. What difference there is seems due to the ‘‘either-or-ness’’ of the latter and the ‘‘shades of grayness’’ of the former. A set of data may be viewed as fabricated or not. A set of legitimate data damaging to a theory may be withheld for a variety of motives, only some of which seem clearly selfserving. The scientist may honestly feel that the data were badly collected or contaminated in some way and may therefore hold them off the market. He may feel that while damaging to his theory their implications might be damaging to the general welfare of mankind. These and other reasons, not at all self-serving, may account for the suppression of damaging data. Recently a number of workers have called attention to the problem of data suppression, all more or less stressing the selfserving motives (Beck, 1957; Garrett, 1960; Maier, 1960). One of these writers (Garrett) has emphasized a fear motive operating to suppress certain data. He suggests that young scientists fear reprisal should they report data that seem to weaken the theory of racial equality. Sometimes the suppression of data proceeds, not by withholding data already obtained, but by insuring that unwanted data will not be collected. In some cases we are hard put to decide whether we have an instance of intentional error or an instance of incompetence so magnificent that one is reduced to laughter. Consider, for
322
Book Two – Experimenter Effects in Behavioral Research
example, (1) an investigator interested in showing the widespread prevalence of psychosis who chooses his sample entirely from the back wards of a mental hospital; (2) an investigator interested in showing the widespread prevalence of blindness who chooses his sample entirely from a list of students enrolled in a school for the rehabilitation of the blind; (3) an investigator interested in showing that the aged are very well off financially who chooses his sample entirely from a list of white, noninstitutionalized persons who are not on relief. The first two examples are fictional, the third, according to the pages of Science, unfortunately, is not. (One sociologist participating in that all too real ‘‘data’’-collecting enterprise was told to avoid apartment dwellers.) A spokesman for a political group which made use of these data noted helpfully that the survey was supported by an organization having a ‘‘conservative outlook’’ (Science, 1960). The issue, of course, is not whether an organization having a ‘‘liberal outlook’’ would have made similar errors either of incompetence or of intent but rather that such errors do occur and may have social as well as scientific implications.
The Control of Intentional Error The scientific enterprise generally is characterized by an enormous degree of trust that data have been collected and reported in good faith, and by and large this general trust seems well justified. More than simply justified, the trust seems essential to the continued progress of the various sciences. It is difficult to imagine a field of science in which each worker feared that another might at any time contaminate the common data pool. Perhaps because of this great faith, science has a way of being very harsh with those who break the faith (e.g., Kammerer’s suicide) and very unforgiving. A clearly established fraud by a scientist is not, nor can it be, overlooked. There are no second chances. The sanctions are severe not only because the faith is great but also because detection is so difficult. There is virtually no way a fraud can be detected as such in the normal course of events. The charge of fraud is such a serious one that it is leveled only at the peril of the accuser, and suspicions of fraud are not sufficient bases to discount the data collected by a given laboratory. Sometimes such a suspicion is raised when investigators are unwilling to let others see their data or when the incidence of data-destroying fires exceeds the limits of credibility (Wolins, 1962). It would be a useful convention to have all scientists agree to an open-data-books policy. Only rarely, after all, is the question of fraud raised by him who wants to see another’s data, although other types of errors do turn up on such occasions. But if there is to be an open-books system, the borrower must make it convenient for the lender. A request to ‘‘send me all your data on verbal conditioning’’ made of a scientist who has for ten years been collecting data on that subject rightly winds up being ignored. If data are reasonably requested, the reason for the request given as an accompanying courtesy, they can be duplicated at the borrower’s expense and then given to the borrower. Such a data-sharing system not only would serve to allay any doubts about the extent and type of errors in a set of data but would, of course, often reveal to the borrower something very useful to him though it was not useful to the original data collector. The basic control for intentional errors in science, as for other types of error, is the tradition of replication of research findings. In the sciences generally this has
Intentional Error
323
sometimes led to the discovery of intentional errors. Perhaps, though, in the behavioral sciences this must be less true. The reason is that whereas all are agreed on the desirability or even necessity of replication, behavioral scientists have learned that unsuccessful replication is so common that we hardly know what it means when one’s data don’t confirm another’s. Always there are sampling differences, different subjects, and different experimenters. Often there are procedural differences so trivial on the surface that no one would expect them to make a difference, yet, when the results are in, it is to these we turn in part to account for the different results. We require replication but can conclude too little from the failure to achieve confirming data. Still, replication has been used to suggest the occurrence of intentional error, as when Azrin’s group (1961) suggested that Verplanck’s (1955) data collectors had deceived him. In fact, it cannot be established that they did simply because Azrin’s group had been deceived by their data collectors. Science, it is said, is self-correcting, but in the behavioral sciences especially, it corrects only very slowly. It seems clear that the best control of intentional error is its prevention. In order to prevent these errors, however, we would have to know something about their causes. There seems to be agreement on that point but few clues as to what these causes might be. Sometimes in the history of science the causes have been so idiosyncratic that one despairs of making any general guesses about them, as when a scientist sought instant eminence or to embarrass another, or when an assistant deceived the investigator to please him. Crespi (1945–1946) felt that poor morale was a cause of cheating among survey research interviewers. But what is the cause of poor morale? And what of the possibility that better morale might be associated with worsened performance, a possibility implied by the research of Kelley and Ring (1961)? Of course, we need to investigate the problem more systematically, but here the clarion call for ‘‘more research’’ is likely to go unheeded. Research on events so rare is no easy matter. There is no evidence on the matter, but it seems reasonable to suppose that scientists may be affected by the widespread data fabrication they encountered in laboratory courses when they were still undergraduates. The attitude of acceptance of intentional error under these circumstances might have a carry-over effect into at least some scientists’ adult lives. Perhaps it would be useful to discuss with undergraduate students in the various sciences the different types of experimenter effects. They should, but often do not, know about observer effects, interpreter effects, and intentional effects, though they quickly learn of these latter effects. If instructors imposed more negative sanctions on data fabrication at this level of education, perhaps there would be less intentional erring at more advanced levels. Whereas most instructors of laboratory courses in various disciplines tend to be very conscious of experimental procedures, students tend to show more outcomeconsciousness than procedure-consciousness. That is, they are more interested in the data they obtain than in what they did to obtain those data. Perhaps the current system of academic reward for obtaining the ‘‘proper’’ data reinforces this outcomeconsciousness, and perhaps it could be changed somewhat. The selection of laboratory experiments might be such that interspersed with the usual, fairly obvious demonstrations there would be some simple procedures that demonstrate phenomena that are not well understood and are not highly reliable. Even for students who ‘‘read
324
Book Two – Experimenter Effects in Behavioral Research
ahead’’ in their texts it would be difficult to determine what the ‘‘right’’ outcome should be. Academic emphasis for all the exercises should be on the procedures rather than on the results. What the student needs to learn is, not that learning curves descend, but how to set up a demonstration of learning phenomena, how to observe the events carefully, record them accurately, report them thoroughly, and interpret them sensibly and in some cases even creatively. A general strategy might be to have all experiments performed before the topics they are designed to illustrate are taken up in class. The spirit, consistent with that endorsed by Bakan (1965), would be ‘‘What happens if we do thus-andso’’ rather than ‘‘Now please demonstrate what has been shown to be true.’’ The procedures would have to be spelled out very explicitly for students, and generally this is already done. Not having been told what to expect and not being graded for getting ‘‘good’’ data, students might be more carefully observant, attending to the phenomena before them without the single set which would restrict their perceptual field to those few events that illustrate a particular point. It is not inconceivable that under such less restrictive conditions, some students would observe phenomena that have not been observed before. That is unlikely, of course, if they record only that the rat turned right six times in ten trials. Observational skills may sharpen, and especially so if the instructor rewards with praise the careful observation and recording of the organism’s response. The results of a laboratory demonstration experiment are not new or exciting to the instructor, but there is no reason why they cannot be for the student. The day may even come when classic demonstration experiments are not used at all in laboratory courses, and then it need not be dull even for the instructor. That the day may really come soon is suggested by the fact that so many excellent teachers are already requiring that at least one of the scheduled experiments be completely original with the student. That, of course, is more like Science, less like Science-Fair. If we are seriously interested in shifting students’ orientations from outcomeconsciousness to procedure-consciousness there are some implications for us, their teachers, as well. One of these has to do with a change in policy regarding the evaluation of research. To evaluate research too much in terms of its results is to illustrate outcome-consciousness, and we do it very often. Doctoral committees too often send the candidate back to the laboratory to run another group of subjects because the experiment as originally designed (and approved by them) yielded negative results. Those universities show wisdom that protect the doctoral candidate from such outcome-consciousness by regarding the candidate’s thesis proposal as a kind of contract, binding on both student and faculty. The same problem occurs in our publication policies. One can always account for an unexpected, undesired, or negative result by referring to the specific procedures employed. That this occurs so often is testament to our outcome-consciousness. What we may need is a system for evaluating research based only on the procedures employed. If the procedures are judged appropriate, sensible, and sufficiently rigorous to permit conclusions from the results, the research cannot then be judged inconclusive on the basis of the results and rejected by the referees or editors. Whether the procedures were adequate would be judged independently of the outcome. To accomplish this might require that procedures only be submitted initially for editorial review or that only the resultless section be sent to a referee or, at least, that an evaluation of the procedures be set down before the referee or editor reads the
Intentional Error
325
results. This change in policy would serve to decrease the outcome-consciousness of editorial decisions, but it might lead to an increased demand for journal space. This practical problem could be met in part by an increased use of ‘‘brief reports’’ which summarize the research in the journal but promise the availability of full reports to interested workers. Journals such as the Journal of Consulting Psychology and Science are already making extensive use of briefer reports. If journal policies became less outcome-conscious, particularly in the matter of negative results, psychological researchers might not unwittingly be taught by these policies that negative results are useless and might as well be suppressed. In Part III negative results will be discussed further. Here, as long as the discussion has focused on editorial policies which are so crucial to the development of our scientific lifestyles and thinking modes, it should be mentioned that the practice of reading manuscripts for critical review would be greatly improved if the authors’ name and affiliation were routinely omitted before evaluation.1 Author data, like experimental results, detract from the independent assessment of procedures.
1 Both Gardner Lindzey and Kenneth MacCorquodale have advocated this procedure. The usual objection is that to know a man’s name and affiliation provides very useful information about the quality of his work. Such information certainly seems relevant to the process of predicting what a man will do, and that is the task of the referee of a research proposal submitted to a research funding agency. When the work is not being proposed but rather reported as an accomplished fact, it seems difficult to justify the assessment of its merit by the reputation of its author.
4 Biosocial Attributes
In the last three chapters some effects of experimenters on their research have been discussed. These effects have operated without the experimenter directly influencing the organisms or materials being studied. In this chapter, and in the ones to follow, the discussion will turn to those effects of experimenters that operate by influencing the events or behaviors under study. The physical and biological sciences were able to provide us with illustrations of those experimenter effects not influencing the materials studied. It seems less likely that these sciences could provide us with examples of experimenter effects that do influence the materials studied. The speed of light or the reaction of one chemical with another or the arrangement of chromosomes within a cell is not likely to be affected by individual differences among the investigators interested in them. As we move from physics, chemistry, and molecular biology to those disciplines concerned with larger biological systems, we begin to encounter more examples of how the investigator can affect his subject. By the time we reach the level of the behavioral sciences there can be no doubt that experimenters may unintentionally affect the very behavior in which they are interested. Christie (1951) tells us how experienced observers in an animal laboratory could judge which of several experimenters had been handling a rat by the animal’s behavior in a maze or while being picked up. Gantt (1964) noted how a dog’s heart rate could drop dramatically (from 160 to 140) simply because a certain experimenter was present. The importance to an animal’s performance of its relationship to the experimenter has also been pointed out for horses (Pfungst, 1911), sheep (Liddell, 1943), and porpoises (Kellogg, 1961). If animal subjects can be so affected by their interaction with a particular experimenter, we would expect that human subjects would also be, perhaps even more so. Our primary focus in this and in the following chapters will be on those characteristics of experimenters that have been shown to affect unintentionally the responses of their human subjects. The study of individual differences among people proceeds in several ways. Originally it was enough to show that such characteristics as height, weight, and intelligence were distributed throughout a population and that the shape of the distribution could be specified. Later when the fact and shape of individual differences were well known, various characteristics were correlated with one another. 326
Biosocial Attributes
327
That led to answers to questions of the sort: are men or women taller, heavier, brighter, longer-lived? From these studies it was learned which of the characteristics studied were significantly associated with many others. It was found that age, sex, social class, education, and intelligence, for example, were all variables that made a great deal of difference if we were trying to predict other characteristics. Always, though, it was a characteristic of one person that was to be correlated with another characteristic of that person. In undertaking the study of individual differences among experimenters, the situation has become more complex and even more interesting. Here we are interested in relating characteristics of the experimenter, not to other of his characteristics, but rather to his subjects’ responses. The usual study of individual differences is not necessarily social psychological. The relationship between person A’s sex and person A’s performance on a motor task is not of itself social psychological. But the relationship between person A’s sex and person B’s performance on a motor task is completely social psychological. That person A happens to be an experimenter rather than a parent, sibling, friend, or child has special methodological importance but no special substantive importance. It has special methodological importance because so much of what has been learned by behavioral scientists has been learned within the context of the experimenter-subject interaction. If the personal characteristics of the data collector have determined in part the subjects’ responses, then we must hold our knowledge the more lightly for it. There is no special substantive importance in the fact that person A is an experimenter rather than some other person because as a model of a human organism behaving and affecting others’ behavior, the experimenter is no more a special case than is a parent, sibling, friend, or child. Whether we can generalize from the experimenter to other people is as open a question as whether we can generalize from parent to friend, friend to child, child to parent. There are experiments by the dozen which show that different experimenters obtain from their comparable subjects significantly different responses (Rosenthal, 1962). In the pages to follow, however, major consideration is given only to those studies showing that a particular type of response by an experimental subject is associated with a particular characteristic of the experimenter. Experimenter attributes that have been shown to be partial determinants of subjects’ responses are sometimes defined independently of the experiment in which their effect is to be assessed. That is the case for such biosocial characteristics as sex, race, age, religion, and for such psychometrically determined variables as anxiety, hostility, authoritarianism, and need for social approval. Sometimes the relevant experimenter attributes can be defined only in terms of the specific experimental situation and the specific experimenter-subject interaction. Such attributes include the status of the experimenter relative to the status of the subject, the warmth of the experimenter-subject relationship, and such experiment-specific events as whether the experimenter feels himself approved by the principal investigator or whether the subject has surprised him with his responses. Quite a little is known about the relationship between these different experimenter variables and subjects’ behavior, but little is known of the mechanisms accounting for the relationships. For example, we shall see that male and female experimenters often obtain different responses from their subjects. But that may be due to the fact that males and females look different or that males and females conduct the experiment slightly differently, or both of these. Does a dark-skinned survey interviewer
328
Book Two – Experimenter Effects in Behavioral Research
obtain different responses to questions about racial segregation because of his dark skin or because he asks the questions in a different tone of voice or because of both these factors? In principle, we can distinguish active from passive experimenter effects. Active effects are those associated with unintended differences in the experimenter’s behavior that can be shown to influence the subject’s responses. Passive effects are those associated with no such differences in the behavior of the experimenters and therefore must be ascribed to their appearance alone. In practice, the distinction between active and passive effects is an extremely difficult one, and no experiments have yet been reported that would be helpful in making such a distinction. It may help illustrate the distinction between active and passive effects to describe a hypothetical experiment designed to assess the relative magnitudes of these effects. Suppose that female experimenters administering a questionnaire to assess anxiety obtain consistently higher anxiety scores from their subjects than do male experimenters. To simplify matters we can assume that the questionnaire is virtually self-administering and that the experimenter is simply present in the same room with the subject. Our experiment requires 10 male experimenters and 10 females, each of whom administers the anxiety scale individually to 15 male subjects. For one-third of their subjects, the experimenters excuse themselves and say that their presence is not required during the experiment and that they will be busy with other things which take them to the other side of an obvious oneway mirror. From there they can from time to time ‘‘see how you are doing.’’ Another third of the subjects are told the same thing except that the experimenter explains that he has to leave the building. The light is left on in the room on the other side of the one-way mirror so that the subject can see he is not being observed. The final third of the subjects are contacted in the usual way with the experimenter sitting in the same room but interacting only minimally. Table 4-1 shows some hypothetical results. Mean anxiety scores are shown for subjects contacted by male and female experimenters in each of the three conditions. Female experimenters again obtained higher anxiety scores but not equally so in each condition. We learn that when the experimenter is neither present nor observing, the sex-of-experimenter effect has vanished. The brief greeting period was apparently insufficient to establish the sex effect, but the physical presence of the experimenter appears to augment the effect. For convenience assuming all differences to be significant, we conclude that female experimenters obtain higher anxiety scores from their subjects only if the subjects feel observed by their experimenters. We cannot say, however, whether the greater sex-of-experimenter effect in the ‘‘experimenter present’’ condition was due to any unintended behavior on the part of the experimenters or whether their physical presence was simply a more constant Table 4–1 Mean Anxiety Scores as a Function of Experimenter Sex and Presence
Sex of experimenter Male
Female
Difference
Sum
I Experimenter present II Experimenter absent but observing III Experimenter absent, not observing
14 18 23
20 22 23
þ6 þ4 0
34 40 46
Sum
55
65
10
120
Biosocial Attributes
329
reminder that they were being observed by an experimenter of a particular sex. If the results had shown no difference between Conditions I and II, we could have concluded that the sex effect is more likely a passive rather than an active effect. That seems sensible since the belief of being observed by an experimenter of a given sex, without any opportunity for that experimenter to behave vis-a`-vis his subject, was sufficient to account for the obtained sex effects. Often in our discussion of the effects of various experimenter attributes on subjects’ responses we shall wish that data of the sort just now invented were really available. Sex, age, and race are variables so immediately assessable that there is a temptation to assume them to be passive in their effects. That assumption should be held lightly until it can be shown that the sex, age, and race of an experimenter are not correlated with specific behaviors in the experiment. Conversely, experimenter’s ‘‘warmth’’ sounds so behavioral that we are tempted to assume that it is active in its effects. Yet a ‘‘warm’’ experimenter may actually have a different fixed appearance from a cooler experimenter. The order of discussion of experimenter attributes proceeds in this and the following chapters from (1) those that appear most directly obvious (i.e., sex) to (2) those that are thought to be relatively fixed psychological characteristics (i.e., need for approval) to (3) those that seem quite dependent on the interpersonal nature of the experiment to (4) those that are very highly situational. This organization is arbitrary and it should be remembered that many of the attributes discussed may be correlated with each other.
Experimenter’s Sex A good deal of research has been conducted which shows that male and female experimenters sometimes obtain significantly different data from their subjects. It is not always possible to predict for any given type of experiment just how subjects’ responses will be affected by the experimenter’s sex, if indeed there is any effect at all. In the area of verbal learning the results of three experiments are illustrative. Binder, McConnell, and Sjoholm (1957) found that their attractive female experimenter obtained significantly better learning from her subjects than did a husky male experimenter, described as an ‘‘ex-marine.’’ Some years later Sarason and Harmatz (1965) found that their male experimenter obtained significantly better learning than did their female experimenter. Ferguson and Buss (1960) round out this illustration by their report of no difference between a male and a female experimenter. This last experiment also provides a clue as to how we may reconcile these inconsistent but statistically quite real findings. Ferguson and Buss had their experimenters behave aggressively to some of their subjects and neutrally to others. When the experimenter behaved more aggressively, there was decreased learning. If we can assume that Binder and associates’ ex-marine officer gave an aggressive impression to his subjects, their results seem consistent with those of Ferguson and Buss. However, we would have to assume further that Sarason and Harmatz’s female experimenter was perceived as more aggressive by her subjects, and for this we have no good evidence. Another experiment by Sarason (1962), in any case, tends to weaken or at least to complicate the proffered interpretation. In this study, Sarason employed 10 male and 10 female experimenters in a verbal learning experiment. Subjects were
330
Book Two – Experimenter Effects in Behavioral Research
to construct sentences and were reinforced for the selection of hostile verbs by the experimenter’s saying ‘‘good’’ or by his flashing a blue light. More hostile experimenters of both sexes tended to obtain more hostile responses (p < .10). If we can assume that those experimenters earning higher hostility scores behaved more aggressively toward their subjects, then we have a situation hard to reconcile with the results presented by Ferguson and Buss. A further complication in the Sarason experiment was that the relationship between experimenter hostility and the acquisition of hostile responses was particularly marked when the experimenters were males rather than females. Perhaps, though, the recitation of hostile verbs is a very special case of verbal learning, especially when it is being correlated with the hostility of the experimenters. One wonders whether more hostile experimenters would also be more effective reinforcing agents for first-person pronouns. Sarason and Minard (1963) provide the answer, which, though a little equivocal, must be interpreted as a ‘‘no.’’ Hostility of experimenters neither alone nor in interaction with sex of experimenter affected the rate of selecting the first-person pronouns which were reinforced by the eight male and eight female experimenters of this study. Of very real interest to our general discussion of experimenter attributes and situational variables was the finding that the verbal learning of first-person pronouns was a complex function of experimenter sex, hostility, and prestige; subject sex, hostility, and degree of personal contact between experimenter and subject. It appears that at least in studies of verbal conditioning, when an experiment is so designed as to permit the assessment of complex interactions, these interactions are forthcoming in abundance. Only rarely, however, are most of them predictable or even interpretable. In tasks requiring motor performance as well as in verbal learning, for young children as well as for college students, the sex of the experimenter may make a significant difference. Stevenson and Odom (1963) employed two male and two female experimenters to administer a lever-pulling task to children ages six to seven and ten to eleven. From time to time the children were rewarded for pulling the lever by being shown various pictures on a filmstrip. During the first minute, no reinforcements were provided in order that a baseline for each subject’s rate of pulling could be determined. Even during this first minute, significant sex-of-experimenter effects were found (p < .001). Subjects contacted by male experimenters made over 30 per cent more responses than did subjects contacted by female experimenters. This large effect was the more remarkable for the fact that the experimenter was not even present during the subject’s task performance. Experimenters left their subjects’ view immediately after having instructed them. Stevenson, Keen, and Knights (1963) provide additional data that male experimenters obtain greater performance than female experimenters in a simple motor task, in this case, dropping marbles in a hole. As in the other experiment, the first minute served as a base rate measure after which the experimenter began regularly to deliver compliments on the subjects’ performance. This time the subjects were younger still, ages three to five. Subjects contacted by male experimenters dropped about 18 per cent more marbles into the holes than did subjects contacted by female experimenters during the initial one-minute period (p < .05). As expected, female experimenters’ subjects increased their rate of marble dropping after the reinforcement procedure began. Relative to the increasing performance of subjects contacted by female experimenters, those contacted by males showed a significant decrement
Biosocial Attributes
331
of performance during the following period of reinforced performance (p < .01). The interpretation the investigators gave to the significant sex-of-experimenter effect was particularly appropriate to their very young subjects. Such young children have relatively much less contact with males, and this may have made them anxious or excited over the interaction with the male experimenter. For simple tasks this might have served to increase performance which then fell off as the excitement wore off from adaptation or from the soothing effect of the experimenter’s compliments. The anxiety-reducing aspect of these statements might have more than offset their intended reinforcing properties. We have already encountered the fact of interaction in the study of sex of experimenter in the work of Sarason (1962). One of the most frequently investigated variables, and one that often interacts with experimenter’s sex, is the sex of the subject. Again we take our illustration from Stevenson (1961). The task, as before, is that of dropping marbles, and after the first minute the experimenters begin to reinforce the children’s performance by regularly complimenting them. The six male and six female experimenters administered the task to children in three age groups: three to four, six to seven, and nine to ten. Although the individual differences among the experimenters of either sex were greater than the effect of experimenter sex itself, there was a tendency for male experimenters to obtain slightly higher performance from their subjects (t ¼ 1.70, p < .10, pooling individual experimenter effects and all interactions). When the experimenters began to reinforce their subjects’ performance after the first minute, female experimenters obtained a greater increase in performance than did male experimenters, but only for the youngest (3–4) children. Among the oldest children (9–10) there was a tendency (p < .10) for a reversal of this effect. Among these children, male experimenters obtained the greater increase in performance. These findings show how sex of experimenter can interact with the age of subjects. It was among the middle group of children (age 6–7) that the sex of subjects became an interacting variable most clearly. Male experimenters obtained a greater increase of performance from their female subjects, and female experimenters obtained the greater increase from their male subjects. Although less significantly so, the same tendency was found among the older (9–10) children. Stevenson’s alternative interpretations of these results were in terms of the psychoanalytic theory of development as well as in terms of the relative degree of deprivation of contact with members of the experimenter’s sex. The interacting effects of the experimenter’s and subject’s sex are not restricted to those studies in which the subjects are children. Stevenson and Allen (1964) had eight male and eight female experimenters conduct a marble-sorting task with 128 male and 128 female college students. For the first 90 seconds subjects received no reinforcement for sorting the marbles by color. Thereafter the experimenter paid compliments to the subject on his or her performance. Once again, there were significant individual differences among the experimenters of both sexes in the rate of performance shown by their subjects. In addition, however, a significant interaction between the sex of subjects and sex of experimenters was obtained. When male experimenters contacted female subjects and when female experimenters contacted male subjects significantly, more marbles were processed than when the experimenter and subject were of the same sex. This difference was significant during the first 30 seconds of the experiment and for the entire experiment as well. Even further support for the generality of the interaction of experimenter and subject sex was provided by
332
Book Two – Experimenter Effects in Behavioral Research
Stevenson and Knights (1962), who obtained the now predicted interaction when the subjects were mentally retarded, averaging an IQ of less than 60. In trying to understand their obtained interactions, Stevenson and Allen postulated that the effects could be due to the increased competitiveness, higher anxiety, or a greater desire to please when the experimenter was of the opposite sex. There is no guarantee, however, as Stevenson (1965) points out, that experimenters may not treat subjects of the opposite sex differently than subjects of the same sex. A little later in this section some data will be presented which bear on this hypothesis. If the interaction between experimenter and subject sex is significant in such tasks as marble sorting and the construction of simple sentences, we would expect the phenomenon as well when the subjects’ tasks and responses are more dramatic ones. Walters, Shurley, and Parsons (1962) conducted an experiment in sensory deprivation which is instructive. Male and female subjects were floated in a tank of water for 3 hours and then responded to five questions about their experiences during their isolation period. Half the time subjects were contacted by a male experimenter, half the time by a female. The questions dealt with (1) feelings of fright, (2) the most unpleasant experience, (3) sexual feelings, (4) anything learned about oneself from the experience, and (5) what the total experience was reminiscent of. All responses were coded on a scale which measured the degree of psychological involvement or unusualness of the phenomena experienced. If a subject reported no experience, his score was 0. If he reported hallucinations with real feeling, the response was scored 5, the maximum. Intermediate between these extremes was a range of scores from 1 to 4. For two of the questions the interaction between sex of experimenter and sex of subject was significant. To the question dealing with sexual feelings, subjects contacted by an experimenter of the same sex gave replies earning psychological ‘‘richness’’ scores 3 times higher than when contacted by an experimenter of the opposite sex. This was the most significant finding statistically and in terms of absolute magnitude. In a subsequent study, although in smaller and less significant form, the same effect was obtained (Walters, Parsons, & Shurley, 1964). This particular interaction seems less difficult to interpret than that found for the marble sorting experiment. Even in an experimental laboratory, subjects regard the ‘‘mixed company’’ dyad as not a place to discuss sexual matters freely. In survey research, as in the experimental laboratory, the inhibiting effects of ‘‘mixed company’’ dyads have been demonstrated. Benney, Riesman, and Star (1956) reported that when given an opportunity to assess the cause of abnormal behavior, respondents gave sexual interpretations about 25 per cent more often when their interviewer was of their own, rather than the opposite, sex. About the same percentage difference occurred when a fuller, frank discussion of possible sexual bases for emotional disturbance was invited. Interestingly, moralistic responses were more frequent when the interviewer and respondent were of the opposite sex. Apparently, then, in interviewer-respondent dyads, sex matters are less likely to be brought up spontaneously in mixed company, but if they are brought up by the interviewer, opposite-sexed respondents are more likely to take a negative, harsh, or moralistic stance than same-sexed respondents. Additional evidence for this interpretation has been presented by Hyman and coworkers (1954). In projective methods of appraising personality, the sex of the experimenter has also been found to affect the subjects’ responses—sometimes. Masling (1960) has summarized this literature which consists of some studies showing a sex effect, and some not.
Biosocial Attributes
333 Table 4–2 Mean Photo Ratings by Four Combinations of Subjects
Sex of experimenter
SEX OF SUBJECT
Male Female
Male
Female
þ 0.14 þ 0.31
þ 0.40 1.13
Earlier in this chapter the question was raised whether the effects of experimenter attributes were passive or active. That is, do different experimenters elicit different responses because they have a different appearance, because they behave differently toward their subjects, or both? Some data relevant to, but not decisive for, these questions are available. The task was one of person perception. Subjects were asked to rate the degree of success or failure reflected in the faces of people pictured in photographs. The ratings of the photographs could range from 10, extreme failure, to þ10, extreme success. The standardization of these particular photos was such that their mean rating was actually zero, or very neutral with respect to success or failure. There were five male and five female experimenters who contacted 35 female and 23 male subjects. About half the interactions between experimenters and subjects were filmed without the knowledge of either. Details of the procedure, but not the data to be reported here, have been described elsewhere (Rosenthal, Friedman, & Kurland, 1965). Table 4-2 shows the mean photo ratings obtained by the male and female experimenters from their male and female subjects. Only the results from those 33 subjects whose interaction was filmed are included. Female subjects, when contacted by female experimenters, tended to rate the photographs as being of less successful persons than did the other three combinations of experimenter and subject sex (p < .05), which did not differ from one another. When the sex of subjects was disregarded it was found that male experimenters were significantly (p < .05) more variable ( ¼ 1.97) in the data they obtained from their subjects than were female experimenters ( ¼ 0.61). (A similar tendency was obtained by Stevenson [1961], though there the effect was not so significant statistically.) When the sex of the subjects was considered, it developed that when experimenters and subjects were of the same sex the variability of subjects’ ratings ( ¼ 1.68) was significantly (p ¼ .06) greater than when the dyads were composed of opposite-sexed persons ( ¼ 0.78). Some data are available which suggest that the effects of experimenter sex are active rather than simply passive. It appears that male and female experimenters behave differently toward their subjects in the experiment. In connection with two other studies observations were made of the experimenters’ glancing, smiling, posture, activity level, and the accuracy of his reading of the instructions (Friedman, 1964; Katz, 1964). Both workers kindly made their raw data available for this analysis. During the brief period preceding the experimenter’s formal instructions to the subject, the experimenter asked the subject for such identifying data as name, age, class, and college major. In this preinstruction period there was no difference between male and female experimenters in the number of glances they exchanged with their subjects. However, experimenters tended to exchange more glances with their female subjects. When interacting with male subjects, 38 per cent of the experimenters exchanged at least some glances, but when interacting with females 90 per cent exchanged glances. The average number of glances exchanged
334
Book Two – Experimenter Effects in Behavioral Research
with male subjects was .31 and with females .75 (p < .10). This finding that females drew about 2.4 times as many glance exchanges as males is close numerically to the ratio of 2.9 reported by Exline (1963), in spite of the differences in the group composition, experimental procedures, and measures of glancing behavior employed in his and the present study. During the reading of the formal instructions to subjects, an interaction appeared in the glances exchanged. Now experimenters exchanged more than twice as many glances with subjects of their own sex (mean ¼ 1.44) as with subjects of the opposite sex (mean ¼ 0.62) (p < .10). In this experiment the subject’s task was to rate the 10 photos in sequence, and during this rating phase of the experiment, the experimenter’s task was to present the photos in the correct order. Richard Katz made observations of the experimenters’ glancing behavior separately for those times when the experimenter was actually presenting a photograph and when the experimenter was preparing to present the next stimulus. There was an interesting difference in the glancing behavior of experimenters as a function of the phases of the stimulus presentation. During the photo presentations male subjects were glanced at more (mean ¼ 1.9) than female subjects (mean ¼ 1.5), the difference not reaching significance (p < .20). During the preparation periods, however, male subjects were glanced at less (mean ¼ 1.1) than female subjects (mean ¼ 1.7). This interaction effect was significant (p < .05) and was shown by all but one of the experimenters. During the presentation period the subject is somewhat ‘‘on the spot.’’ The experimenter is just sitting expectantly, and the subject has to do something and wants to do it well. It could easily be that during this mutually tense moment experimenters avoid eye contact with their female subjects in order to spare them any embarrassment. This seems an especially reasonable interpretation in the light of recent data provided by Exline, Gray, and Schuette (1965), who reported that eye contact was reduced during interviews creating greater tension. In the moments following the subject’s response the pressure is off. As the experimenters prepare their next stimulus for presentation, they need not fear for their female subjects’ tension, and indeed their increased glancing at this point toward their female subjects may serve to reassure them that all is well. Looking at the subject during the rating period of the experiment is in fact correlated with smiling at the subject (rho ¼ .63, p ¼ .10), although smiling at the subject is very rare during this stage of the experiment and, during either the presentation or the preparation period considered separately, is not significantly related to glancing. From these results, it can be seen that experimenters do in fact behave differently toward their subjects and that the differences are related sometimes to the sex of the subject, sometimes to the sex of the experimenter, and sometimes to both these variables. The particular pattern of experimenter behavior described suggests that at least in the psychological experiment, chivalry is not dead. Female subjects seem to be treated more attentively and more considerately than male subjects. While discussing the differences in experimenter behavior during the stimulus presentation and stimulus preparation periods, another example of experimenter sex effect can be given. All five of the female experimenters showed more smiling during the preparation than the presentation period with an average 35 per cent increase of smiles (p < .05). Among male experimenters, however, only one showed any increase, and the average increase was only about 2 per cent. It appears that
Biosocial Attributes
335
sometimes during those moments of the experimental procedure when the need for formality and austerity seems lessened, females, even when functioning as quite competent experimenters, behave more as females usually do. Those sociological writers who have been concerned with sex role differentiation would probably not be surprised either at these data or at their interpretation. Parsons (1955), Parsons, Bales, and Shils (1953), and Zelditch (1955) have all commented on the feminine role as that of greater socioemotional concern and the masculine role as that of greater concern with task accomplishment. The data presented so far and those to follow support this conception. Not only is the female more of a socioemotional leader when she is the leader but she seems much more to be led socioemotionally when she is the follower. For example, during the brief period preceding the formal instructions, the female subjects were smiled at significantly more often than were male subjects, regardless of the sex of the experimenter (p < .05). When contacting female subjects, 70 per cent of the experimenters smiled at least a little, but when contacting male subjects only about 12 per cent did so. The mean amount of smiling at female subjects by all experimenters was 0.50; at male subjects it was only 0.06. During the subsequent reading of the instructions, all experimenters showed less smiling and only 40 per cent of the experimenters smiled at female subjects, but no experimenter smiled even a little at any male subject. Most of the smiling in this phase of the experiment was done by female experimenters (mean ¼ 0.57) rather than males (mean ¼ 0.10), though this difference was not very significant (p ¼ .15). To summarize, female experimenters tended to smile more, and female subjects were recipients of significantly more smiles. Although no one has written what Friedman (1964) calls an etiquette for the psychological experimenter, the reaction of most laboratory psychologists to these data has been to assume that female experimenters might be less competent at conducting experiments if they smile more than they ‘‘should.’’ Smiling seems frivolous in such a serious interaction as that between experimenter and subject. But data are available which show that females, by an important criterion, are at least as competent as males. According to scoring categories developed by Friedman (1964), a scale of accuracy of instruction reading was developed. Errors in the reading of instructions would lower the score from the maximum possible value of 2.00. A more competent experimenter, as a minimum, should read the instructions to subjects as they were written. Accuracy of instruction reading, then, is an index of experimental competence, though not, of course, the only one. Table 4-3 shows the male and female experimenters’ mean accuracy scores when the subjects were males and when they were females. For both male and female subjects, female experimenters read their instructions more accurately than did male experimenters (combined
Table 4–3 Accuracy of Instruction Reading
Sex of experimenter
SEX OF SUBJECT
Male Female
Male
Female
Difference
p
1.62 1.50
2.00 1.87
þ.38 þ.37
.20 .12
336
Book Two – Experimenter Effects in Behavioral Research
p < .05). Among female experimenters, 80 per cent read their instructions perfectly to all subjects, whereas only 20 per cent of male experimenters were that accurate. Considering the total number of times instructions were read to subjects, female experimenters read them perfectly to 88 per cent of their subjects, whereas male experimenters read them perfectly only to 56 per cent of their subjects. There were no effects of experimenter’s or subject’s sex on the speed with which the experiment proceeded except during those periods of the rating task itself when the experimenter was preparing to show the next stimulus photo. Table 4-4 shows the mean time in seconds required during this part of the interaction by male and female experimenters when contacting male and female subjects. The only significant effect was of the interaction variety. Male experimenters were significantly slower in their preparation for presenting the next stimulus photo when the subjects were females than when they were males. Similarly female experimenters were slower when interacting with male rather than female subjects, although this tendency was not significant statistically. With the average male experimenter in his early twenties and the average female subject in her late teens, it appeared almost as though the male experimenters sought to prolong this portion of their interaction with their female subjects. This period of the experiment was earlier interpreted as having tensionreleasing characteristics compared to the periods of tension increase (stimulus presentation) which preceded and followed these preparation periods. The few extra seconds of relaxed contact may have been stretched somewhat because of their intrinsic social interest when the dyads were of opposite-sexed members. Because the prerating periods were such busy times for the experimenter, we would not expect him to utilize them for even covertly social purposes. Observations were also available which told the degree to which the experimenter leaned in the direction of each of his subjects. Experimenters were seated diagonally across the edge of a table from their subjects so that the leaning was in a sideways direction that tended to bring experimenter and subject closer together. Table 4-5 shows the mean index numbers describing how much male and female experimenters
Table 4–4 Time Required for Stimulus Preparation
Sex of experimenter Male SEX OF SUBJECT
Male Female Difference p
Female
33.2 38.4
45.4 40.3
þ5.2 <.01
5.1 >.20
Difference þ12.2 þ1.9
p <.20 >.20
Table 4–5 Degree of Leaning Toward Subjects
Sex of experimenter
SEX OF SUBJECT
Male Female
Male
Female
1.35 0.99
0.75 0.96
Biosocial Attributes
337
tended to reduce the distances between themselves and their male and female subjects during the entire rating period. The results for the entire interaction are similarly significant, although the instruction-reading and preinstruction periods by themselves did not show significant effects. When female subjects were contacted there was no sex-of-experimenter effect. When subjects were males, however, male experimenters leaned closer than did female experimenters (p < .05). Relative to male experimenters, females may have been more bashful or modest in assuming any posture that would move them closer to their male subjects. During the reading of the instructions male experimenters tended (p < .10) to show a higher level of general body activity (mean ¼ 6.2) than did female experimenters (mean ¼ 4.4). This was true regardless of the sex of the subjects contacted. Then, in the period during which subjects made their actual photo ratings, there was a tendency for all experimenters to show a greater degree of general body activity when their subjects were males (mean ¼ 4.4) rather than females (3.9). This difference was not very significant statistically, however (p ¼ .15). In our culture, general body activity is associated more with males (Kagan & Moss, 1962, p. 100); and male psychological experimenters, as any other members and products of their culture, do show more body activity in the experiment. That both male and female experimenters may show greater activity when contacting male subjects suggests that there may have been a kind of activity contagion and legitimation in the interactions with male subjects, who, we can only assume, were themselves more active during the experiment. Unfortunately, systematic observations have not yet been made of the subjects’ activity level. Table 4-6 is relevant to the interpretation. Although none of the effects reach statistical significance, it can be seen that during the rating task on which these means are based, male experimenters move more, and most of all when contacting male subjects. Female experimenters move less, and least of all when contacting female subjects. Sex differences in the degree of motility of the experimenters seem to be well augmented by the hypothesized contagion and legitimation effects of being in interaction with people who very likely vary in their own degree of body motility. Another line of evidence is available that male and female experimenters behave differently as they conduct their psychological experiments. Suzanne Haley kindly made the raw data available for this analysis. She had 12 male and 2 female experimenters administer the same photo-rating task to 86 female subjects. After the experiment, subjects were asked to rate their experimenters on how well they liked them and on 26 behavioral variables—e.g., degree of friendliness of the experimenter. Table 4-7 shows the ratings of the experimenter’s behavior as a function of the sex of the experimenter. The correlations are point biserials and
Table 4–6 Experimenter’s Body Activity
Sex of experimenter Male SEX OF SUBJECT
Female
Means
Male Female
4.6 4.1
4.3 3.8
4.45 3.95
Means
4.35
4.05
4.20
338
Book Two – Experimenter Effects in Behavioral Research
when positive in sign indicate that it was the male experimenters who were rated higher on the scales listed in the first column. The first column gives the correlations resulting from this analysis. It can be seen from the table that the female subjects of this experiment rated their male experimenters as more friendly in general, as having more pleasant and expressive voices, and as being more active physically. The nine correlations tabulated for this analysis were those significant at the .05 level out of the total of 26 possible. (As might be expected from the obtained correlations, male experimenters were also better liked, rpb ¼ .34, p < .005.) The magnitudes of the tabulated correlations tend to be conservative because only 10 of the 86 subjects were contacted by female experimenters. The median correlation of þ.27 becomes þ.40 when corrected for this imbalance. Some additional preliminary data are available which suggest the stability of these correlations. The same photo-rating task was administered by 15 male and 3 female experimenters to a total of 57 subjects; 40 females and 17 males. All these interactions were recorded on sound film and then rated by three observers for just the preinstruction-reading period on the dimensions listed in Table 4-7. The right side of the table gives the correlations. Male experimenters were judged more friendly and pleasant as before. With one exception, the correlations between the sex and behavior of the experimenter were similar to those obtained from the analysis of Haley’s data. That exception was the variable of pleasantness of voice, which in this replication was reversed in sign though very small in magnitude. Since in this study only 16 df were available, only two of the correlations reached even the .10 level of significance. From the results of both these studies it seems reasonable to conclude that, either by asking the subjects themselves or by asking observers who were not participants in the experiment, the behavior and manner of experimenters are associated with their sex. For the person perception task employed, and when interacting primarily with female subjects, male experimenters behave in a more friendly, personally involved, and physically active manner. Since two of the three observers who rated the experimenters were themselves females, this conclusion must be tempered by the
Table 4–7 Ratings of Experimenters and Sex of Experimenter
Source of ratings Subjects Ratings
Observers
rpb
p
rpb
p
Friendly Pleasant Interested Encouraging Enthusiastic Pleasant-voiced Expressive-voiced Leg activity Body activity
þ.32 þ.37 þ.27 þ.27 þ.27 þ.42 þ.26 þ.30 þ.23
.005 .001 .02 .02 .02 .001 .02 .01 .05
þ.47 þ.28 þ.36 þ.35 þ.41 .11 þ.20 þ.20 þ.31
.05 — — — .10 — — — —
Median
þ.27
.02
þ.31
—
Biosocial Attributes
339 Table 4–8 Video and Audio Channel Ratings of Experimenters and
Sex of Subject Observation channel Ratings
Video: rpb
Audio: rpb
Liking Friendly Pleasant Encouraging Honest Relaxed
.29 .21 .33 .30 .33 .32
.39 .35 .29 .31 .26 .21
Median
.31
.30
possibility that female subjects or observers are biased to perceive male experimenters in the direction indicated. For the 18 experimenters and 57 subjects whose interactions were recorded on film, there were consistent differences in the way experimenters were judged to behave when their subjects were males (N ¼ 17) as compared to females (N ¼ 40). For the preliminary data now available, the instruction-reading phase of the interaction was rated by one group of observers (N ¼ 4) who could see the films but not hear the sound track. Another group of judges (N ¼ 3) heard the sound track of the films but could not see the interaction. Table 4-8 shows the correlations between the sex of the subject contacted and the ratings of the experimenter separately for those observers who could see but not hear and those who could hear but not see the interaction between experimenters and subjects. Of 17 ratings that could be made under both conditions, 6 showed a correlation of .20 or larger under both conditions of observation. In every one of these 6 cases the direction of the correlation was the same under both conditions of observation, and the numerical values agreed closely. Judging both by looking at the experimenters and also by listening to their tone of voice, experimenters were more likable, pleasant, friendly, encouraging, honest, and relaxed when contacting female subjects than when contacting male subjects. The absolute size of the correlations would probably have been larger if there had been a more nearly equal division of male and female subjects (50:50 rather than 70:30) and if the reliability of the observers’ judgments had been higher. The median reliability of the video variables tabulated was only .37 and of the audio variables it was .17. Corrected for attenuation the median of the correlations under the video condition becomes .53, p < .03, and under the audio condition the median correlation becomes .65, p < .01. From the preliminary analysis of the filmed interactions between experimenters and subjects it seems that male experimenters behave more warmly than do female experimenters, at least when the subjects are primarily females. In addition, both male and female experimenters behave more warmly toward their female than toward their male subjects. The more molecular observations (e.g., glancing) reported earlier and made by Neil Friedman and Richard Katz, in general, tend to support these conclusions with one exception. That was the finding that female experimenters, at least sometimes, smiled more at their subjects than did male experimenters. The results for the effect on experimenter behavior of the sex of the subject contacted, however, are sufficiently stable to warrant retention of the
340
Book Two – Experimenter Effects in Behavioral Research
conclusion that in the psychological experiment, a certain degree of chivalry is maintained. Within the past few years a number of investigators have pointed out the interacting effects of experimental variables and the sex of subjects (Carlson & Carlson, 1960; Hovland & Janis, 1959; Kagan & Moss, 1962; McClelland, 1965; Sarason, Davidson, Lighthall, Waite, & Ruebush, 1960). Both simple, across-the-board sex differences and interacting sex differences may have multiple sources, including those that are genetic, morphological, endocrinological, sociological, and psychological. To this list must now be added the variable of differential treatment of male and female subjects. An experiment employing male and female subjects is likely to be a different experiment for the males and for the females. Because experimenters behave differently to male and female subjects even while administering the same formally programmed procedures, male and female subjects may, psychologically, simply not be in the same experiment at all. In order to assess the extent to which obtained sex differences have been due to differential behavior toward male and female subjects, it would be necessary to compare sex differences obtained in those studies that depended for their data on a personal interaction with the subject and those that did not. It would be reassuring to learn that sex differences obtained in a personal interaction between experimenter and subject were also obtained in mailed-out questionnaires and in experiments in which instructions to subjects were tape-recorded and self-administered. In Part III, such methodological implications will be considered in detail.
Experimenter’s Age As in the case of the experimenter’s sex, the age of the experimenter can be readily judged, fairly and accurately, by the subject. There has been less work done to assess the effects of the experimenter’s age on subjects’ responses than has been the case for experimenter’s sex. What work has been done suggests that, at least sometimes, the experimenter’s age does affect the subject’s response. One recent investigation was carried out by Ehrlich and Riesman (1961). Their analysis was of the data collected from a national sample of adolescent girls and included the girls’ responses to four questions of a more or less projective nature. One of these questions, for example, involved the presentation of a picture of a group of girls in which someone suggested they all engage in behavior that one of the girl’s parents had forbidden. The respondent was to say what that particular girl’s response would be to the group’s suggestion. The answers to the four questions could be coded as to whether they would be socially acceptable or unacceptable by parental standards. The interviewers in this survey were all women, primarily of middle-class background, and ranging in age from the early twenties to the late sixties. The most dramatic effects of the interviewers’ ages were found to depend on the subjects’ ages. Among respondents aged 15 or younger there was only the smallest tendency for younger interviewers to be given more ‘‘unacceptable’’ type responses. Interviewers under 40 received 6 per cent more such replies than interviewers over 40. Among the older girls, however, those over 15, the younger interviewers evoked 44 per cent more unacceptable responses than did the older interviewers. It was the older girls,
Biosocial Attributes
341
then, who were more sensitive to the age differences among interviewers and who, perhaps, felt relatively freer to say ‘‘unacceptable’’ things to people closer to themselves in age. In the case of interviewer’s age, then, the effects were found not to be simple but rather interactive. Often, as we saw earlier in this chapter, the effects of experimenter’s sex were similarly interactive rather than simple. The results just now reported tell us of the relationship between a data collector’s age and the subjects’ responses, but they do not tell us whether it is the age per se that makes the difference. Older interviewers differ in various ways from younger ones, and perhaps they behave differently toward their subjects as well. Just this question was raised by Ehrlich and Riesman. They had available some psychometric data on their interviewers, including scores on their ascendance or dominance. There was a tendency, though not statistically significant, for the older interviewers to score as more ascendant. Presumably this difference in personality test scores was reflected in differences in behavior during the interview. The less imposing behavior of the younger interviewers may have made it easier for the older girls to voice their less acceptable responses. An analysis cited earlier in connection with the effects of experimenter’s sex also provides evidence bearing on the effects of experimenter’s age (Benney, Riesman, & Star, 1956). The data suggest that when the response required is a frank discussion of sexual maladjustment, the age of the data collector makes some difference, but particularly so when the age of the subject is considered. Among subjects under 40 there were 10.5 per cent more frank responses to interviewers under 40 than to interviewers over 40. However, among respondents over 40 there were 52.2 per cent more frank responses for the younger than for the older interviewers. Combining male and female interviewers and male and female subjects, when both participants are over 40, a frank discussion of sex matters is simply less likely to occur.
Experimenter’s Race The skin color of the experimenter may also affect the responses of the subject (Cantril, 1944; Williams, 1964), though not all types of responses are equally susceptible (Williams & Cantril, 1945). Some of the evidence for the survey research situation is provided by Hyman et al. (1954). Just as older interviewers tended to receive more ‘‘proper’’ or acceptable responses from some of their subjects, so did white interviewers receive more proper or acceptable responses from their Negro respondents than did Negro interviewers. The data cited were collected during World War II. Half the Negro respondents were interviewed by white, half by Negro, interviewers. One of the questions asked was whether Negroes would be treated better or worse by the Japanese in the event they won the war. When interviewed by Negroes, only 25 per cent of the respondents stated that they would be worse off under Japanese than under American rule. When interviewed by whites, however, 45 per cent stated that they would be worse off under Japanese rule (p < .001). When interviewed by whites, only 11 per cent of the Negroes stated that the army was unfair to Negroes, but when the interviewers were Negroes, 35 per cent of respondents felt the army was discriminatory (p < .001). Additional evidence of this type is presented by Summers and Hammonds (1965), who also present some interesting data of their own. Their data, complementing the
342
Book Two – Experimenter Effects in Behavioral Research
Hyman data, suggest further the interacting nature of the skin color of the experimenter and the skin color of the subject. In their survey research, the respondents were white and were contacted by a research team consisting sometimes of two whites and sometimes of one white and one Negro. The questionnaire was concerned with racial prejudice. When both investigators were white, 52 per cent of the respondents showed themselves to be highly prejudiced. When one of the investigators was Negro, only 37 per cent were equally prejudiced. These results (p < .001) are the most remarkable for the fact that subjects responded in writing and anonymously. Just as Negro respondents were shown to say the ‘‘proper’’ thing more often to a white interviewer, so too did white respondents say the ‘‘right’’ thing more often to Negro data collectors. The experimenter’s skin color also interacts with other characteristics of the subject to affect the subject’s response. In the Summers and Hammonds study, those respondents whose father’s income was higher showed a greater sensitivity to the race of the data collector. When father’s income was below $5,000, 17 per cent of the subjects decreased their stated degree of racial prejudice when one experimenter was Negro (p < .50). When father’s income was over $5,000 but less than $10,000, 30 per cent of respondents claimed less prejudice (p < .005). When father’s income was over $10,000 there was a 38 per cent reduction in admitted prejudice (p < .005). As socioeconomic status increases, the lessons of politeness and social sensitivity seem better taught and better learned. The same trend appears when church attendance is substituted for father’s income. When church attendance is minimal, only 13 per cent of subjects show a decrease in admitted racial prejudice when one investigator is Negro. When church attendance is moderate, 21 per cent (p < .05) show a decrease of prejudice, and when church attendance is very regular, 44 per cent (p < .001) show sensitivity to the race of the experimenter. In this case, the lessons of the church seem to be the same as the lessons of the social class. Even when the response investigated is physiological, the race of the experimenter may affect that response. Rankin and Campbell (1955) showed that the galvanic skin response showed a greater increase if the experimenter adjusting the apparatus was Negro rather than white. More recently, Bernstein (1965) reported that basal skin impedance (measured in kilohms) was higher when the experimenter was white rather than Negro regardless of the race of the subject. In general, the effect of experimenter’s race on subjects’ physiological responses is poorly understood and, up to the present, little studied. A number of studies are available which suggest that performance on various psychological tests may be affected by the race of the experimenter. Employing a test of expression of hostility, Katz, Robinson, Epps, and Waly (1964) carried out just such a study employing a white and Negro experimenter. Half the time the Negro subjects had their task structured as an affectively neutral research procedure, and half the time the task was structured as an intelligence test. When the task was presented as a neutral one there were no significant effects of the experimenter’s race on subjects’ hostility scores. However, when the task was structured as an intelligence test, significantly less hostility was obtained when the experimenter was white (p < .01). The authors’ interpretation of this finding was that Negroes tended to control their hostility more when contacted by a white rather than a Negro experimenter. This interpretation is very much in line with that implied by the data from
Biosocial Attributes
343
survey research studies in which Negroes gave more ‘‘proper’’ responses to their white as compared to Negro interviewers. When the tests really are tests of intellectual functioning of various kinds, the race of the experimenter also has its effects. Thus, Katz and his coworkers describe an experiment in which the task was similar to one of the subtests of standard tests of intelligence, in this case digit-symbol substitution. When the task was structured as a test of coordination, the Negro subjects performed better for the white than for the Negro experimenter. It was as though the subjects were unwilling to demonstrate their ‘‘good sense of rhythm’’ to the Negro but quite willing to demonstrate it for the white experimenter who might, in their eyes, have expected it. When the same task was structured as an intelligence test, performance was relatively better with the Negro than with the white experimenter. Perhaps again these subjects were doing what they perceived to be the socially appropriate thing—in this case performing not so brightly for the white experimenter. There are, too, studies that showed no effects of experimenter skin color on subjects’ intellectual performance. In the same study described, for example, Katz and his associates found no effects of the experimenter’s race on the adequacy of subjects’ concept formation. Other examples of negative results are given by Canady (1936) and Masling (1960).
Experimenter’s Religion The experimenter’s religion as a variable affecting subjects’ responses has been investigated primarily in the area of survey research. Hyman and his collaborators (1954) give us one example. In 1943, over 200 subjects were interviewed by Jewish and Gentile data collectors who asked whether Jews had too much, too little, or the right amount of influence in the business world. Of the Gentile subjects contacted by Gentile interviewers, 50 per cent felt that Jews had too much influence. When the interviewers were Jewish, however, only 22 per cent thought so. Once again the respondents seemed to have said the right thing. One caution in the interpretation of these data was advanced by Hyman et al. In this study, interviewers were free to pick their own respondents within certain limits, so that Jewish interviewers might, perhaps unwittingly, have chosen more sympathetic Gentile respondents. Robinson and Rohde (1946) varied both the appearance of Jewishness and the Jewishness of the interviewer’s name in their study of the effect of perceived religion of the interviewer on the extent of anti-Semitic responses in public opinion research. When interviewers neither looked Jewish nor gave Jewish names, about 23 per cent of respondents felt that Jews had too much power. When the interviewer was Jewishappearing but did not give a Jewish name, about 16 per cent of subjects felt Jews had too much power. When the interviewer looked Jewish and gave a Jewish name, only 6 per cent of respondents felt Jews had too much power. In this study, the samples assigned the different types of interviewer were well matched, so that the results are more likely due to the respondent’s perception of the interviewer rather than to a selection bias on the part of the data collector. Unlike the situation described earlier when race of experimenter was the variable, it was the lower economic status subjects who were more sensitive to the religion of the investigator.
344
Book Two – Experimenter Effects in Behavioral Research
Much of what has been learned about the effects of various biosocial attributes of the data collector on the responses obtained from subjects has come from the field of survey research. This seems natural enough as has been pointed out by Hyman et al. (1954) and by Mosteller in a personal communication (1964). In that field the numbers of data collectors are large enough to permit the systematic evaluation of interviewer differences with or without an attempt to relate these differences to specific attributes of the interviewer. But there is no reason to assume that the effects obtained in survey research of various experimenter attributes would not hold in such other data-collecting contexts as the laboratory experiment. Particularly for the variables of experimenter age and religion, however, there is little direct evidence to date that they operate in the laboratory as they do in the field. The general conclusion to be drawn from much of the research reviewed here seems to be that subjects tend to respond in the way they feel to be most proper in the light of the investigator’s attributes. That subjects in experiments as well as respondents in surveys want to do the right thing and want to be well evaluated has been suggested by Orne (1962), Riecken (1962), and Rosenberg (1965). Before leaving the general topic of the biosocial attributes of the experimenter as determinants of subjects’ responses, it would be well to repeat a caution suggested earlier. There is no way to be sure that any of the effects discussed so far are due to the physical characteristics of the experimenter rather than to some correlated variables. In fact, it was found quite likely, especially for the variable of experimenter’s sex, that experimenters differing in appearance also behave differently toward their subjects. It could be this behavioral variation more than the variation of physical attributes that accounts for the effects on subjects’ responses.
5 Psychosocial Attributes
The experimenter attributes discussed in the last chapter were all readily assessable by inspection. The experimenter attributes to be discussed now are also readily assessable, but not simply by inspection. The anxiety or hostility of those experimenters functioning well enough to be experimenters at all must be assessed more indirectly, sometimes by simply asking the experimenter about it, more often by the use of standard psychological instruments.
Experimenter’s Anxiety Winkel and Sarason (1964) have shown that the anxiety level of the experimenter may interact complexly with subject variables and with experimental conditions in determining the verbal learning of the experimental subjects. They employed 24 male experimenters, all undergraduates, half of whom scored high on a scale of test anxiety and half of whom scored low. Subjects were 72 male and 72 female students of introductory psychology. Half the subjects scored as high-anxious and half as lowanxious. Results showed that when the experimenters were more anxious there was no difference between male and female subjects in their performance on the verbal learning task. However, when the experimenters were less anxious, female subjects performed better than males. The optimal combination of the experimenter’s anxiety and the subject’s anxiety and sex was that in which the subject was a low-anxious female in contact with a low-anxious experimenter. In this condition performance was better than in any of the others. This interaction was further complicated by the still higher order interaction which involved the additional variable of the type of instructions given the subjects. When the experimenter attribute under investigation is anxiety, just as in the case of experimenter’s sex, extremely complicated interactions tend to emerge if the experiment allows for their assessment. Sarason (1965) describes an unpublished study by Barnard (1963) which showed that degree of disturbance of the experimenter as determined from a phrase association task was a predictor of the subjects’ degree of disturbance in the same task. When the task is the interpretation of ink blots rather than the learning of verbal materials, the anxiety level of the experimenter as defined by his own Rorschach 345
346
Book Two – Experimenter Effects in Behavioral Research
responses also makes a difference. More anxious experimenters obtained from their subjects Rorschach responses interpreted as more hostile and more passive than the responses obtained by experimenters judged less anxious. In addition, the more anxious experimenters obtained from their subjects more fantasy material and a higher degree of judged self-awareness (Cleveland, 1951; Sanders & Cleveland, 1953). When the task involved memory for digits, a subtest of many standard tests of intelligence, the degree of ‘‘adjustment’’ or anxiety of the experimenter affected subjects’ performance (Young, 1959). The measure of adjustment, a variable correlated generally with anxiety, was based on the Worchel Self Activity Inventory administered to introductory psychology students. These students then served as experimenters and administered the digit span test to their peers. Subjects who were contacted by more poorly adjusted experimenters performed better at the task than did subjects contacted by the better adjusted experimenters. The results of this study are not consistent with those found by Winkel and Sarason (1964) for a verbal learning task. In that experiment, described above, anxiety of the experimenter was an effective variable only in interaction with subject variables or instruction variables. If anything, the more anxious experimenters tended to obtain less adequate performance. That seemed also to be the case for some data reported by McGuigan (1963). The more neurotic of his nine experimenters tended to obtain the poorer performance from their subjects in a learning task. From the studies considered, it seems safe to conclude that the experimenter’s anxiety level (or perhaps adjustment level) may affect subjects’ responses for a variety of tasks; but the nature of the effect is not predictable on the basis of our current knowledge. This conclusion is borne out by the results of the two experiments reported next. In both studies, the task was that described earlier which required subjects to rate the success or failure of people pictured in photographs. In both experiments, the experimenters had been tested for anxiety level defined by the Taylor (1953) Scale of Manifest Anxiety. In one of these studies 40 experimenters administered the photo rating task to 230 subjects, half of whom were males, half females. In this study more anxious experimenters obtained higher ratings of success of the photos they asked their subjects to rate. The correlation was þ.48, p ¼ .02 (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). In the other experiment, 26 experimenters administered the same photo rating task to 115 female subjects. In this experiment, it was the less anxious experimenters who obtained higher ratings of the success of the photos they presented to their subjects. The correlation this time was .54, p < .01 (Rosenthal, Kohn, Greenfield, & Carota, 1965). Final evidence for the complexity of the relationship between experimenter’s anxiety and subject’s response comes from a study of verbal conditioning (Rosenthal, Kohn, Greenfield, & Carota, 1966). In that experiment 19 male experimenters conducted the verbal reinforcement procedures with 60 female subjects. Sentences were to be constructed by the subjects and each sentence was to begin with any one of six pronouns (Taffel, 1955). Whenever first-person pronouns were selected the experimenter said the word ‘‘good.’’ The increase in the usage of first-person pronouns from the beginning to the end of the experiment was the measure of verbal conditioning. This time the highand low-anxious experimenters did not differ from each other in the degree of verbal conditioning shown by their subjects. However, both high- and low-anxious experimenters obtained significantly more conditioning than did those experimenters who scored as medium-anxious (p ¼ .08).
Psychosocial Attributes
347
There is little information available to suggest what it is about the appearance or behavior of more or less anxious experimenters that might affect their subjects’ responses. Only the barest clues are available from a preliminary analysis of the sound motion pictures mentioned earlier of experimenters interacting with subjects. Based only upon the ratings of the brief preinstruction phase of the experiment in which the experimenter asked for the subject’s name, age, class, and major field, more anxious experimenters were judged to be more active in their leg movements (r ¼ þ.42, p ¼ .08) and in the movement of their entire body (r ¼ þ.41, p ¼.09). These relationships tend only to add to the construct validity of the anxiety scale employed. We might expect that more anxious experimenters would be somewhat more fidgety in their interaction with their subjects. The movement variables mentioned were rated by four undergraduate observers who saw the films but did not hear the sound track. Three additional undergraduate observers listened to the sound track but did not see the films. Based on their ratings, those experimenters who scored as more anxious were judged to have a less dominant tone of voice (r ¼ .43, p ¼.07) and a less active tone of voice (r ¼ .44, p ¼.07).1 More anxious experimenters, then, may behave toward their subjects in a way that communicates their tension through excessive fidgeting and a meeker, less self-assured tone of voice. (This impression is strengthened by some unpublished data kindly made available by Ray Mulry. Analysis of these data showed more anxious experimenters to be rated more shy [r ¼ þ.23, p ¼.06] by their subjects during an experiment involving motor performance.) From the evidence presented, this constellation of experimenter behavior seems sometimes to increase, sometimes to decrease, and sometimes not to affect the subjects’ performance at all. To make a notable understatement: more research is needed—much more.
Experimenter’s Need for Approval Crowne and Marlowe (1964) have shown that the need for social approval as measured by their scale (the Marlowe-Crowne Social Desirability Scale) predicts cautious, conforming, and persuasible behavior in a variety of experimental situations. Until recently only ‘‘subjects’’ had been administered this instrument, but now there are a few studies that have related the ‘‘experimenter’s’’ need for approval to his subject’s responses in various experimental situations. Mulry (1962), for example, employed 12 male experimenters to administer to some 69 subjects a pursuit rotor task requiring perceptual-motor skill. A number of tests, including the Marlowe-Crowne SD Scale, were administered to the experimenters. Mulry found a tendency for experimenters scoring higher on the need for approval to obtain superior performance on the pursuit rotor task. Experimenters higher in the need for approval obtained especially good performance from their male subjects when the experimenters had been led to believe that they themselves were good at a pursuit rotor task. 1
This general pattern of correlations between experimenter anxiety and experimenter behavior was also found on analysis of the instruction-reading period of the experiment. Some of the correlations became somewhat smaller, some became somewhat larger. During this period of the experiment, too, experimenters scoring as more anxious on the Taylor Scale were judged as more tense by the film observers. With or without benefit of sound track the correlation was the same: þ.40 (p < .10).
348
Book Two – Experimenter Effects in Behavioral Research
The unpredictability of the effects of the experimenter’s anxiety on his subject’s responses is matched by the unpredictability of the effects of experimenter’s need for approval. Thus, in one experiment employing the person perception task described earlier, experimenters lower in need for approval obtained ratings of the photos as being of more successful people. The correlation was .32, p ¼.10 (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). In another experiment, also cited earlier, it was experimenters higher in need for approval who obtained more ‘‘success’’ ratings. That correlation was þ.38, p ¼.05 (Rosenthal, Kohn, Greenfield, & Carota, 1965). Within a single experiment, Marcia (1961) obtained similarly unpredictable relationships. He employed seven male experimenters and six female experimenters to administer the same standard person perception task to subjects. Among male experimenters, the correlation between their need for approval scores and their subjects’ ratings of ‘‘success’’ was .27. Among female experimenters the analogous correlation was þ.43. These two correlations, although not significantly different from zero, nor from each other for such small sample sizes, do suggest that the sex of the experimenter may interact with such experimenter attributes as need for approval to affect the subjects’ responses. In one of the experiments cited earlier, the experimenter’s need for approval was not related to the subject’s susceptibility to the verbal reinforcements of the experimenter (Rosenthal, Kohn, Greenfield, & Carota 1966). In that experiment, however, each experimenter was rated by each of his subjects on his behavior during the interaction with that subject. Although the anxiety level of the experimenter was found to be unrelated to any of the subjects’ ratings that was not the case for the experimenter’s need for social approval. Table 5-1 shows the correlations between subjects’ ratings of their experimenters and the experimenters’ need for approval. The pattern of correlations obtained apparently did not affect the subjects’ responses in this experiment on verbal conditioning. But presumably where a quieter, more enthusiastic but less likable experimenter would affect his subjects differently, we would expect the experimenter’s need for approval to affect his subjects by way of these different behaviors. That experimenters higher in need for approval should be less well liked is predictable from the work of Crowne and Marlowe (1964). However, that they should be less personal does not seem to follow from what is known of the need for approval. If anything, these experimenters should try too hard to be friendly, thereby becoming less popular with their subjects. Once again we look to the preliminary analysis of the filmed interactions between experimenters and subjects as they transact their preinstructional business. The experimenters’ need for approval was not found to be related to any of the observations made of the films without benefit of the sound track. When observers had access to both Table 5–1 Experimenter’s Need for Approval and Experimental Behavior
as Seen and Heard ‘‘Subjectively’’ Behavior Personal Loud Enthusiastic Talkative Likable
Correlation
p
.32 .27 þ.27 .22 .22
.02 .05 .05 .10 .10
Psychosocial Attributes
349 Table 5–2 Experimenter’s Need for Approval and Experimental Behavior
as Heard ‘‘Objectively’’ Behavior Personal Friendly Dominant Speaks distinctly Expressive-voiced Active Likable Enthusiastic Pleasant-voiced
Correlation
p
þ.57 þ.47 þ.46 þ.46 þ.45 þ.41 þ.40 þ.39 þ.39
.02 .05 .05 .05 .06 .10 .10 .10 .10
visual and auditory cues, only three variables were found to be related to the experimenters’ need for approval. Experimenters higher in need for approval were judged to have a more expressive face (r ¼ þ.42, p ¼ .08), to smile more often (r ¼ þ.44, p ¼.07), and to slant their bodies more in the direction of their subjects (r ¼ þ.39, p ¼.10). (Ratings of these last two variables were made available by Neil Friedman and Richard Katz.) These findings are just what we would expect from the person higher in need for approval (Crowne & Marlowe, 1964). It was the observations made of the sound track alone that yielded the most interesting information.2 Table 5-2 shows the larger correlations between experimenter’s need for approval and ratings by the ‘‘objective’’ observers—i.e., those who were not themselves subjects of the experimenter. These ‘‘tone-of-voice’’ variables partially agree with the observations made by subjects themselves in a different experiment and given a different task (Table 5-1). In both cases experimenters higher in need for approval were judged as more enthusiastic. In the verbal conditioning experiment, however, subjects found these experimenters less personal and less likable, whereas in the photo-rating experiment independent judges of the experimenters’ tone found them more personal and more likable if they were higher in need for approval. It is pleasant to acknowledge the consistencies but difficult to account for the differences. The two experiments differed in the nature of the experimental tasks, the samples of experimenters, and the type of judgments made of the experimenter’s behavior. The subject of the experiment is closer physically to the experimenter and may observe things not observable by the ‘‘objective’’ observer of the motion picture or sound track record. On the other hand, the interacting subject is much busier than the ‘‘objective’’ observer, who can attend completely to the experimenter’s behavior without having another task of his own to perform. A reconciliation of the differences is possible if we can assume that to be judged personal and likable from the tone of voice alone is not at all the same thing as to be judged similarly on the basis of all available sense modalities. What does seem clear is that experimenters higher or lower in need for approval are likely to behave differently in interaction with their subjects. Sometimes, but not always, this differential behavior is likely to affect the subject’s response. 2 The same pattern of correlations was also obtained when the behavior during the instruction-reading period was analyzed, though fewer of the correlations reached the .10 level of significance.
350
Book Two – Experimenter Effects in Behavioral Research
Experimenter’s Birth Order The order of birth within the family is not, in the usual sense, a psychological variable. It is not defined in terms of the subject’s behavior except in the narrow sense that it is usually the subject’s statement of his ordinal position, which is used as the operational definition of the variable. Since Schachter’s already classic work (1959), birth order has been investigated by many workers and has been shown to bear significant relationships to other, more ‘‘properly’’ psychological variables. One experiment shows that for the person perception task described earlier, firstborn experimenters tend to obtain higher ratings of the success of persons pictured in photos than do later-born experimenters (x2 ¼ 5.85, p ¼.02; Rosenthal, Kohn, Greenfield, & Carota, 1965). Another experiment, however, the one employing the verbal conditioning procedure, showed no effects on subjects’ performance of the experimenter’s birth order (Rosenthal, Kohn, Greenfield, & Carota, 1966). In that study, it may be recalled, subjects made judgments of their experimenter’s behavior during the experimental transaction. Table 5-3 shows the correlations between these ratings of the experimenter’s behavior and his birth order. The general picture that emerges is that, as experimenters, firstborns are faster but more reluctant speakers, employing fewer body and facial movements and expressions, than their later-born counterparts. In this verbal conditioning experiment, this combination of characteristics differentiating firstborn from later-born experimenters appeared to have no effect on subjects’ responses; in other experiments it might. In the experiment by Mulry (1962) already cited, there was no relationship between the birth order of the experimenter and the motor performance of his subjects. An analysis of the ratings these subjects made of their experimenters, however, showed that firstborn experimenters behaved differently during the experiment than did later-borns. Firstborn experimenters were rated as more mature (r ¼ þ.24, p ¼ .05) and more defensive (r ¼ þ.22, p ¼ .07) than later-borns, which seems consistent with the picture that emerges from Table 5-3 of firstborns as somewhat more staid and motorically controlled people. Further analysis of Mulry’s data, however, revealed that firstborn experimenters were also rated as more talkative (r ¼ þ.24, p ¼.05) than later-borns. This is directly opposite to the relationship reported in Table 5-3 and is not easily reconciled by the fact that Mulry’s task was motor while the other task was verbal. A third experiment in which the birth order of the experimenter could be correlated with his behavior during the experiment was the study in person perception which had been filmed. In this experiment there was no relationship between Table 5–3 Experimenter’s Earlier Birth and Behavior in a
Verbal Conditioning Experiment Behavior Talkative Slow-speaking Body activity Trunk activity Hand gestures Expressive face
Correlation
p
.37 .32 .32 .27 .26 .24
.006 .02 .02 .05 .05 .08
Psychosocial Attributes
351 Table 5–4 Experimenter’s Earlier Birth and Behavior in a
Person Perception Experiment Behavior Hand gestures Body activity Head activity Arm gestures Important-acting
Correlation þ.50 þ.48 þ.47 þ.41 þ.41
p .05 .05 .05 .10 .10
the experimenter’s birth order and the degree of success perceived by his subjects in the faces to be judged. However, during the instruction-reading phase of the interaction, firstborn experimenters were seen and heard to behave more actively and officiously than later-born experimenters. Table 5-4 shows the relevant correlations. Observations made during the brief preinstructional phase were not significantly correlated with the experimenter’s birth order, though the correlations based on that phase were all in the same direction as those based on the instruction period. The results shown in Table 5-4 are opposite in direction to those obtained in the verbal conditioning study (Table 5-3). It cannot be said whether the difference is due to the different tasks employed in the two studies or to the fact that the observers in the one case were the subjects themselves rather than external observers of sound motion pictures. As for the variable of talkativeness which yielded opposite relationships in the verbal conditioning and the motor performance experiments, it was not significantly related to birth order in the filmed study. We are left with the unsatisfying conclusion that the birth order of the experimenter only sometimes affects the responses he obtains from his subjects; that more often his birth order is related to his behavior in the experimental interaction; and that the nature of this behavior seems to interact at least with the type of experiment he is conducting.
Experimenter’s Hostility The work of Sarason (1962) and of Sarason and Minard (1963) has already been cited in connection with the effects of experimenter’s sex. It will be recalled that greater hostility of the experimenter was predictive of obtaining more hostile verbs in a sentence construction task (Sarason, 1962). This was especially the case when the subjects, too, tended to be more hostile. Among experimenters scoring low in hostility, those subjects scoring high in hostility emitted 9 percent fewer hostile verbs than did subjects scoring low in hostility. Among experimenters scoring high in hostility, those subjects scoring high in hostility emitted 17 percent more hostile verbs than did subjects scoring low in hostility. The interaction was significant at the .05 level. When the experimenters reinforced subjects’ use of first-person pronouns by saying ‘‘good,’’ the hostility level of the experimenter was again found to make a difference, this time by affecting the increase in the use of the reinforced responses from earlier to later trials. Actually, it was the interaction of experimenter’s hostility and his ascribed prestige that led to the dramatic effects obtained (Sarason & Minard, 1963). The increase in the use of the reinforced responses was only 4 percent when the experimenter was low in hostility and high in prestige and only 5 percent when he
352
Book Two – Experimenter Effects in Behavioral Research
was high in hostility and low in prestige. The increase, however, was 47 percent when the experimenter was high in both hostility and prestige, and it was 52 percent when he was low in both hostility and prestige. Once again the complex nature of the effects of experimenter attributes on subjects’ responses is demonstrated; and once again, the explanation is far from intuitively obvious. Additional evidence is presented by Sarason (1965), who cites the unpublished work of Barnard (1963). Barnard administered a test of hostility to both subjects and experimenters and found that subjects contacted by less hostile experimenters showed a greater degree of disturbance on a phrase association test than did subjects contacted by more hostile experimenters. The importance of distinguishing between overt and covert hostility levels of experimenters has been made clear by the work of Sanders and Cleveland (1953). Nine graduate students in psychology administered Rorschachs to a large sample of undergraduate students. Overt hostility was defined in terms of subjects’ ratings of their experimenter. Covert hostility was defined in terms of the experimenter’s own Rorschach responses. Subjects’ responses reflecting hostility increased when experimenters were high on covert hostility but decreased when their experimenters were high on overt hostility. Overtly hostile experimenters may have intimidated their subjects into giving more benign responses, and covertly hostile experimenters may have legitimated subtly the expression of hostile responses. What seems especially needed at this time is information on the actual behavior of experimenters classified as high or low in hostility—behavior that presumably creates quite different standards for the appropriateness of subjects’ responses.
Experimenter’s Authoritarianism On the basis of the California F Scale, Peggy Cook-Marquis (1958) obtained groups of experimenters and subjects who were high-authoritarian, low-authoritarian, and acquiescent. Experimenters administered tests of problem solving to their subjects. Performance on these problems was not related to experimenter personality. However, when attitudes toward different forms of teaching methods were assessed, it was found that high-authoritarian experimenters were less effective in influencing these attitudes than were the low-authoritarian or the acquiescent experimenters. The interpretation given these results by Cook-Marquis, with which it seems easy to agree, was that high authoritarians might not themselves believe in unstructured teaching techniques and that they were therefore less convincing in trying to influence their subjects to approve more of these techniques. The work of Mulry (1962) has already been cited in connection with the need for approval and birth order variables. In his experiment, employing the pursuit rotor task, his twelve experimenters had also been assessed for authoritarianism by the use of the California F Scale. Authoritarianism of the experimenter was a factor in determining subjects’ perceptual-motor performance only in interaction with the experimenter’s belief about his own ability at the pursuit rotor task. Those experimenters who were low in authoritarianism and who felt themselves not to be good at the pursuit rotor task obtained superior performance from their subjects compared to the other combinations of experimenter’s authoritarianism and perception of their own adequacy at the motor task they administered to their subjects.
Psychosocial Attributes
353
Although Mulry’s more authoritarian experimenters did not obtain significantly different data from their subjects (unless other variables were considered simultaneously), their subjects were affected differentially by contact with them. Thus, subjects contacted by more authoritarian experimenters described themselves as less satisfied with their participation in the experiment (r ¼ .27, p ¼.03) and as less interested in the experiment (r ¼ .23, p ¼.06). In addition, the more authoritarian experimenters were judged by their subjects to be less consistent in their behavior during the experiment (r ¼ .27, p ¼.03). Though it did not seem to occur in this study, it seems reasonable to suppose that there are experiments in which experimenters who thus affect their subjects’ reactions will obtain different responses from them in the experimental task posed. There are some data that suggest that this is so. From the analysis of sound motion pictures of experimenters administering the person perception task, it has been found that experimenters who are judged to be less consistent in their behavior tend to obtain ratings of the photos as of more successful people (r ¼ .35, p < .01). If more authoritarian experimenters are less consistent in their conduct of the person perception experiment, as they were in Mulry’s motor performance experiment, we would expect that more authoritarian experimenters would obtain ratings of photos as being of more successful people. This prediction could be tested for only a small sample of six experimenters who had been administered the California F Scale and who also conducted a person perception experiment described in detail elsewhere (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964a, p. 467). The mean rating of success obtained by the three more authoritarian experimenters was þ 0.27 and that obtained by the three less authoritarian experimenters was 1.06. The difference was significant at the .06 level (t ¼ 2.75).
Experimenter’s Intelligence Perhaps because experimenters, even ‘‘student-experimenters,’’ tend to be so highly selected for intelligence, there has been little effort expended to study the effects of experimenter’s intelligence on subjects’ responses. The restriction of the range of IQ scores found among a set of experimenters would tend to reduce dramatically the correlation between their IQ and their subjects’ performance. In the Mulry (1962) experiment, no relationship was found between the intelligence test scores of the experimenter and his subject’s perceptual-motor performance. There was a tendency, however, for experimenters’ intelligence to interact with subjects’ sex in such a way that male subjects earned particularly high performance scores when their experimenters scored lower on the Shipley-Hartford Test of intelligence. Once again, subjects’ ratings of their experimenters were available. Experimenters scoring higher in intelligence were rated by their subjects as more consistent in their behavior (r ¼ þ.29, p ¼.02) and more physically active as reflected in greater amount of body movements (r ¼ þ.20, p ¼.10). In addition, subjects contacted by brighter experimenters were more satisfied with their participation in the experiment (r ¼ þ.26, p < .05).
Experimenter’s Dominance Reference has already been made to the work of Ehrlich and Riesman (1961). They had available scores on a scale of ascendance or dominance earned by the interviewers
354
Book Two – Experimenter Effects in Behavioral Research
employed in a study of adolescent girls. Those interviewers who were more ascendant and who appeared more task-oriented, as defined by a scale of ‘‘objectivity,’’ obtained different responses from their subjects than did the remaining interviewers. Responses in this study were defined in terms of the social unacceptability of the reply. When interviewers scored high on both ascendance and objectivity, they obtained 38 percent fewer socially unacceptable responses than did interviewers scoring lower on these scales. No-nonsense type interviewers are, it would seem, more likely to draw no-nonsense type responses. Sarason (1965) has summarized an unpublished dissertation by Symons (1964) which shows that subjects contacted by more dominant experimenters make more negative self-references than do subjects contacted by less dominant experimenters. There is also evidence that subjects contacted by more dominant experimenters make more negative references to other people. The correlation between ratings of the experimenter’s dominance throughout an entire experiment and his subjects’ rating other people as having experienced failure was þ.34, p < .005. (This particular experiment is discussed further in the chapter dealing with the communication of experimenters’ expectancies.) These findings make tempting the psychoanalytic interpretation that dominant experimenters evoke more hostility which, because it cannot be safely directed toward the source, is turned either inward, as in the Symons study, or against an external scapegoat. This interpretation is weakened somewhat by the fact that in data collected by Suzanne Haley, experimenters described as more ‘‘pushy’’ tended to obtain ratings of other people as more successful rather than more unsuccessful as we would have predicted from our interpretation. To the extent that these definitions of dominance derive from the experimenter’s behavior in the experiment rather than from standard psychological instruments, their further discussion seems best postponed until the next section. There will be found a more detailed consideration of other, more fully social psychological variables. One of these variables, that of experimenter status, seems particularly related to the variable of experimenter dominance as inferred from his behavior in the experiment.
Social Psychological Attributes The biosocial attributes of experimenters which have been discussed are, usually, immediately apparent to the subject. The psychological attributes discussed are not, usually, so immediately apparent to the subject, although as we have seen, there are often behavioral correlates of an experimenter’s psychological characteristics. In this section the discussion turns more fully to those attributes of the experimenter that are defined neither by his appearance nor by his answers to items of a psychological test or questionnaire. Sometimes the definition of these social psychological attributes is directly and simply behavioral, as in the case of the attribute of ‘‘warmth.’’ Sometimes the definition is only indirectly behavioral, as in the case of an experimenter’s status, and not at all simple, in the sense that the relative status of an experimenter who is an army captain will be determined by whether the subject is an army private or a major.
Psychosocial Attributes
355
Experimenter’s Relative Status In most laboratory research the subjects are undergraduates and the experimenters range in academic status from being advanced undergraduates, through the various levels of graduate students, all the way through the various status levels of the faculty, from new Ph.D. to senior professor. In military research settings, the status of the experimenter in terms of absolute rank is immediately apparent to the subjects, though an additional source of status, as we shall see, may derive from the setting in which the research is conducted. This effect of the setting or of the sponsorship of the research is well known to have an important influence in survey research (Hyman et al., 1954). Surveys conducted by the FBI are likely to earn a degree of cooperation quite different from that earned by a manufacturer of so-called washday products. Regardless of how the experimenter derives his relative status or prestige in the eyes of his subject, that status often affects not only whether the subject will respond (Norman, 1948) but also how he will respond. An example of this has already been given in the discussion of experimenter’s hostility. There we saw that the prestige of the experimenter interacted with his hostility level to serve as a determinant of subjects’ susceptibility to verbal reinforcement (Sarason & Minard, 1963). Experimenter’s prestige in that experiment was defined in terms of formality of dress, of manner, and of request for participation. Experimenter’s prestige was found to interact with another variable—access to visual cues from the experimenter’s face. When subjects could not see the experimenter’s face and when he was in the low status condition, there was a decrease in the effect of his reinforcements on the subject’s responses. Perhaps subjects felt that if the experimenter wasn’t very serious and furthermore, wasn’t even looking, it couldn’t matter too much whether his verbal utterances of ‘‘good’’ were taken seriously or not. In this experiment 16 experimenters were employed; in general, Sarason and his collaborators have employed large samples of experimenters. For the experimenter attribute of status, most of the relevant studies are based on sample sizes of only two or three experimenters. Still, they may be usefully considered. In a study of the control of verbal behavior of fifth-grade children, Prince (1962) employed two experimenters differing markedly in prestige. The more prestigious experimenter was more influential in controlling his subjects’ responses. This is as we would expect and is consistent, generally, with the results of Sarason and Minard. However, just as other variables were found to interact with experimenter status in that study, so too do we find such interactions in the following. Ekman and Friesen (1960) employed two military experimenters to administer a photo judging task to army recruits. Sometimes the experimenters were presented to the subjects as officers, sometimes as enlisted men. Sometimes experimenters reinforced subjects for liking the persons pictured in the photos and sometimes for disliking them. The overall results, although not clear-cut, suggested that the officer-experimenter was more effective at increasing subjects’ rate of disliking photographs, whereas the enlisted-man–experimenter was more effective at increasing subjects’ rate of liking photographs. That is a result similar to the one found when photos were being rated for their success or failure and more dominant experimenters drew more failure ratings. The officer role seems a more dominant one than that of enlisted man. One plausible interpretation, related to that proposed earlier, of the present data is that the
356
Book Two – Experimenter Effects in Behavioral Research Table 5–5 Experimenter’s Status and Success at Controlling
Subjects’ Verbal Behavior Variable Businesslike Professional Loud Behaved consistently
Correlation
p
þ.43 þ.33 .31 þ.26
.001 .01 .02 .05
recruit-subjects were given the ‘‘go-ahead’’ by the officer to be aggressive when his presence might itself have made them feel aggressive. Here, in a sense, was a chance to combine the experimenter-required conformity with the subject-desired aggressiveness. When the experimenter was an enlisted man, as the subjects themselves were, they may have felt more friendly and, therefore, found it easier to increase their rate of liking the persons pictured in photos. In this particular experiment, the authors point out, the differences in status between the experimenters might have been diminished in their subjects’ eyes because both were staff members of the high status organization carrying out the research. In one experiment on verbal reinforcement the experimenter’s status was defined in terms of his behavior during his interaction with the subject. It was assumed that a more professional, businesslike, less noisy, and more consistent experimenter would be ascribed a higher status by his subjects. The 19 male experimenters of this study said the word ‘‘good’’ whenever the subjects used first-person pronouns (Rosenthal, Kohn, Greenfield, & Carota, 1966). Table 5-5 shows the correlations between the increase in the use of first-person pronouns over the course of the experiment as a function of subjects’ perception of their experimenter’s behavior during the experiment. Higher status experimenters, as defined by their subjects’ perception of their behavior, were significantly more influential in changing their subjects’ responses. In this particular study we cannot be certain that subjects’ ratings of their experimenters actually reflected differences in that behavior. Possibly those subjects more susceptible to the influence of the experimenter only perceived him differently than did less influenceable subjects. It is also possible that having been influenced by an experimenter, subjects described that experimenter according to their conception of the sort of person by whom they would permit themselves to be influenced. Even if these more influential experimenters did not, in fact, behave as their subjects stated, it is instructive to note the pattern of characteristics ascribed to more influential experimenters. At least the stereotype of the behavior of more influential experimenters includes their being seen as behaving in a way associated with higher status. We gain some support for the idea that experimenters who influence their subjects’ responses more behave in a more professional way from a study by Barber and Calverley (1964a). In their experiment in hypnosis the single experimenter sometimes adopted a forceful, authoritative tone of voice and sometimes a lackadaisical one. Subjects accepted more suggestions when offered in the authoritative tone than when offered in a bored, disinterested tone. These variables of interest, enthusiasm, and expressiveness of tone were also employed in the verbal conditioning study cited, and in that study, too, were related to the experimenter’s success at influencing verbal behavior. Experimenters who influenced their subjects more were rated by them as more interested (r ¼ þ.43, p < .001), more enthusiastic (r ¼ þ.28, p < .05),
Psychosocial Attributes
357
and more expressive-voiced (r ¼ þ.24, p < .10). The general impression obtained from the studies relevant to the experimenter’s status is that when the subject’s task involves conforming to an experimenter’s influence (as in studies of verbal conditioning or hypnosis), higher status experimenters are more successful in obtaining such conformity. That seems to be the case whether the experimenter’s status is defined in terms of such external symbols as dress or insignia or in terms of statusearning behaviors during the interaction with the subject. This conclusion seems consistent also with the general literature on social influence processes, though there the influencer is not usually an experimenter (e.g., Berg and Bass, 1961). Other investigators who have discussed the effect of the experimenter’s status on subjects’ susceptibility to his influence include Glucksberg and Lince (1962), Goranson (1965), Krasner (1962), and Matarazzo, Saslow, and Pareis (1960). The effect of experimenter status can operate even when the subject’s response is not a direct measure of social influenceability. Thus, Birney (1958) found that his two faculty experimenters obtained responses from subjects reflecting a higher need for achievement than did his student experimenter. Subjects may feel a greater need to achieve when in interaction with others who have probably achieved more; or at least subjects may feel it would be more proper to respond with more achievement responses in such company. The effect of the experimenter’s being a faculty member, especially if he is known to the subject, has also been illustrated by McTeer (1953). In many of the studies bearing on the effects of the experimenter’s status, the samples of experimenters have been small, so that any number of factors other than status could have accounted for the differences obtained. Thus not only do faculty experimenters differ in status from student experimenters but they are likely to be older as well. In those studies where larger samples of experimenters were employed, the experimenters were usually aware that their status effects were being investigated, and this in itself might have made them perform the experiment somewhat differently. Where the subjects’ perceptions of the experimenter’s behavior were used to define status it was noted that the behavior that actually occurred was not necessarily the same as that reported by the subjects. What seems especially needed, then, is a study in which the status of the experimenter is varied without the experimenter’s knowledge of this variation. Just such a study was carried out by John Laszlo, who made his data available for the analysis reported here. There were 3 experimenters who administered the photo-rating task to 64 subjects. Half the time the subjects were told they would be contacted by a prestigious investigator and half the time by ‘‘just a student.’’ Each experimenter, then, obtained data from subjects when he ‘‘was’’ a higher status and a lower status person without his knowledge of that fact. Table 5-6 shows the tendency for Table 5–6 Experimenter’s Status and the Means of Subjects’ Photo Ratings
Status Experimenter
High
Low
Difference
A B C
.58 1.70 .69
.18 .57 .56
þ.40 þ1.13 þ.13
.99
.44
þ.55
Mean
358
Book Two – Experimenter Effects in Behavioral Research
experimenters who were ascribed the lower status to obtain ratings of the photos as being of more successful persons. Although this was not an experiment of verbal reinforcement, the results are reminiscent of those of Ekman and Friesen (1960), who also found a tendency for a lower status experimenter to obtain more favorable reactions to photographs. More directly analogous, for having employed the same task, are the two studies cited in the section dealing with experimenter’s dominance. One of these studies yielded results just like those obtained by Laszlo, but the other obtained results in the opposite direction. In the Laszlo study, the results were not significant statistically, although all three experimenters showed the same tendency. In this experiment, too, another finding that did not reach statistical significance, but which is of interest, nevertheless, was that the effect of the experimenter’s status was larger among subjects scoring higher on Rokeach’s (1960) scale of dogmatism. These are just those subjects who would be expected to be more susceptible to the effects of the status of those with whom they interact. This finding receives support from the work of Das (1960), who employed four experimenters to administer a test of body sway suggestibility. The status of the experimenters varied from department chairman to attendant. Higher status experimenters obtained more body sway from their subjects (p < .05), but it was the more suggestible subjects who showed the effects of experimenter’s prestige while the less suggestible subjects did not. The data presented from Laszlo’s study are supported by the results of another unpublished study employing the same photo-rating task. This is the study in which 19 experimenters contacted 57 subjects and were filmed during their interaction with the subjects. None of the ratings of the experimenter’s behavior during the brief preinstruction period predicted subjects’ judgments of the success of the persons pictured in the photos. However, ratings made during the instruction-reading phase of the experiment did. Table 5-7 shows the significant correlations between subjects’ ratings of ‘‘success’’ and the ratings of experimenter’s behavior made from simultaneously viewing the films and hearing the sound track. Other ratings were also significantly predictive of subjects’ responses, but only those are listed here that may be used to define the status level of the experimenter. Those experimenters who behaved more professionally and consistently and showed less body activity and talkativeness obtained lower ratings of success from their subjects. It seems reasonable to regard such experimenters as achieving higher status in their subjects’ eyes by virtue of their behavior. In general, these results are very much in line with the trends obtained by Laszlo.
Table 5–7 Experimenter’s Status and Subjects’ Ratings of
Photos as Successful Variable Behaved consistently Professional Talkative Leg activity Trunk activity
Correlation .35 .23 þ.26 þ.29 þ.24
p .01 .10 .05 .05 .10
Psychosocial Attributes
359
Experimenter’s Warmth An experiment by Ware, Kowal, and Baker (1963) is illustrative. Two experimenters alternated playing a warm, solicitous, democratic role and one that was cool, brusque, and autocratic. The task set for the military subjects of this study was one of signal detection. Regardless of the various conditions of environmental stimulation occurring during the signal detection task, those subjects who had been contacted by the warmer-acting experimenter detected signals significantly better than did those contacted by the cooler-acting experimenter (p < .05). When the dependent variable was the production of verbal responses, the warmth of the experimenter was also an effective independent variable. Reece and Whitman (1962) defined ‘‘warm’’ experimenter behavior in terms of leaning toward the subject, looking directly at the subject, smiling, and keeping the hands still. Cold behavior was defined in terms of the experimenter leaning away from the subject, looking around the room, not smiling, and drumming his fingers. Subjects were, of course, able to judge correctly which was the warm and which the cold behavior, and this behavior affected their verbal output. Predictably, this was greater when the experimenter was warmer. This particular study is important not only because of its content but because of its method as well. Although there are a number of studies that manipulate warmth of experimenter, there are few that attempt to specify so carefully the motor behavior of the experimenter that is to be part of the picture of warmth. In the area of projective testing, Masling (1960) has discussed the effects of the examiner’s warmth on the subject’s productions. In an experiment by Lord (1950), for example, three examiners administered the Rorschach under warm, cool, and neutral styles of interaction. Subjects contacted by examiners in the warm condition gave ‘‘richer,’’ more imaginative Rorschachs than did subjects contacted under the cold condition. Interestingly, the differences among the three female examiners in the responses they obtained were greater than the differences among the three experimental conditions. Perhaps the ‘‘natural’’ warmth or coldness of the examiners was a more crucial variable than the role-played warmth or coldness. A good illustration of the magnitude of difference in subjects’ responses which may be associated with the experimenter’s warmth or coldness comes from research by Luft (1953). He employed an undergraduate female experimenter who administered 10 homemade ink blots to 60 freshman subjects, half of them males, half females. The task for each subject was simply to indicate those of the blots that were liked and those that were disliked. Half the time the experimenter played a warm, friendly role. Half the time she played a cool, unfriendly role, which included asking the subjects some questions about current affairs which they were sure to be unable to answer accurately. Subjects contacted by the experimenter in the warm role liked 7.6 of the 10 blots. Those contacted by the cold-role experimenter liked 3.1 blots (t ¼ 9.7). Among those subjects treated coldly, 57 percent disliked most of the cards; among those treated warmly, only a single subject (3 percent) disliked most of the cards. There was no effect of the sex of the subject by itself or in interaction with the experimental treatment. Luft’s interpretation of the results bears repeating. ‘‘Like me and I will like your inkblots; reject me and I will reject them’’ (p. 491). Additional evidence that a cold examiner or experimenter may obtain different responses in storytelling tasks is available from the work of Bellak (1944) and of Rodnick and Klebanoff (1942).
360
Book Two – Experimenter Effects in Behavioral Research
They found critical treatment of the subjects to increase the incidence of aggressive themes. Assuming cold experimenters to be relatively more stressful stimuli for their subjects, there is still more evidence that a cold experimenter may, by his coldness, alter the subject’s responses in a variety of tasks. Masling (1960) gives an excellent summary of the relevant literature on projective testing. Subjects’ performance on an intelligence test may also be affected by the warmth of the examiner. Gordon and Durea (1948) administered the Stanford Binet Scale to 40 eighth-grade pupils. Half of these children were treated more coolly by the examiners. The result was that relative to the more warmly treated children, the coolly treated lost over six IQ points. Some data supportive of this result were collected by Wartenberg-Ekren, who kindly made the data available for further analysis. In her experiment 8 male examiners administered a visual-motor test of intelligence (Block Design) to 32 male subjects. Each examiner was rated by his subjects on his behavior during the administration of the test. The 21 scales employed were similar to those used in other studies described in this chapter. Two of the scales were significantly related to the subjects’ performance. Examiners rated by their subjects as more casual (p ¼ .01) and as more talkative (p ¼ .02) obtained superior performance on the intelligence test administered. By themselves these variables do not seem convincingly related to warmth. Table 5-8 shows the intercorrelations of five of the variables on which examiners were rated as well as the correlations of each with subjects’ performance. The correlations are based on the mean ratings ascribed to examiners and the mean performance each obtained. There being only eight examiners, a correlation of .62 is required for significance at the .10 level and a .71 is needed for the .05 level. Because of the high intercorrelations, each examiner was given a cluster rating by adding the individual ratings together. The correlation between these cluster scores and subjects’ performance was þ.79, p ¼.03. It seems reasonable to regard this cluster as one reflecting warmth. A word of caution is necessary, however. It is possible that subjects who performed more adequately felt differently about their examiners because of it and rated them differently, not because their behavior differed, but because of the subjects’ own improved mood. It is also possible that better performers at this particular task simply rate other people higher on the particular variables in the warmth cluster. The interpretation that examiners did, in fact, behave as described by their subjects, and that this casual, pleasant, encouraging syndrome fostered better performance, is not too far-fetched and is consistent with the data from the Gordon and Durea experiment. Table 5–8 Examiner Warmth and Subjects’ Intellectual Performance
Description of examiners
Casual Talkative Expressive face Encouraging Pleasant-voiced
Casual
Talkative
— þ.76 þ.75 þ.72 þ.56
— þ.87 þ.71 þ.66
Expressive Face
— þ.67 þ.45
Encouraging
— þ.86
Subjects’ Performance þ.83 þ.81 þ.61 þ.52 þ.44
Psychosocial Attributes
361
In a subsequent chapter dealing more thoroughly with problems of subjects’ ratings of their experimenters, some evidence will be presented that suggests that subjects see their experimenters somewhat as their experimenters see themselves. This fact increases our confidence that what subjects say their experimenters did is, in fact, related to what their experimenters did do. There is evidence for this, too, from the survey research literature. One example relevant both to this point and to the attribute of warmth is a study by Brown (1955). He reported on a national survey conducted by the National Opinion Research Center in which subjects were to rate the interviewer’s behavior during the data-collection transaction. Better rapport in the interview was associated with fewer avoidable ‘‘don’t know’’ responses on the part of the subjects and with an increase in the number of usable responses given to open-ended questions. How the interviewer ‘‘really’’ behaved we cannot know. It is possible that more forthright subjects evaluate their questioners more favorably. It is also possible that after obtaining some forthright answers from subjects, data collectors in fact became more competent, or warmer, or happier, and that the subjects’ record of the interviewer’s behavior, although ‘‘accurate,’’ has actually been determined by the subject’s own behavior. All these processes may be operating, and yet there can be a kernel of correlation between the interviewer’s actual behavior and his subjects’ perception of that behavior. That, at least, is suggested by the data Brown obtained. In some of the studies of the effects of the experimenter’s warmth it was not the experimenter’s behavior that was varied independently or even assessed as it occurred naturally. Rather, the set given to the subject was varied in such a way that sometimes he expected the experimenter to be a warm, likable person and sometimes a cold, unlikable person. Though not originally employed to study experimenter-subject interaction, this manipulation has come to be associated with the earlier work of Back (1951). McGuigan (1963) describes an unpublished dissertation by Spires (1960) which employed just such a manipulation in a study of verbal conditioning. Spires found better conditioning to occur when subjects had been led to expect a warm experimenter, a finding borne out by Sapolsky’s work (1960), which was conducted at about the same time. In Spires’ study, most of the effect of the subject’s set was actually associated with a particular personality characteristic of the subject. Subjects scoring higher on an ‘‘obsessive-compulsive’’ dimension, as defined by the Pt scale of the MMPI, were little affected by the set they had been given about the experimenter’s warmth. However, subjects scoring high on an ‘‘hysteria’’ dimension, as defined by the Hy scale of the MMPI, showed a very large effect of the set they had been given. When experimenters believed to be warmer said ‘‘good’’ to reinforce subjects’ responses, those scoring high on the Hy scale increased their use of the reinforced pronouns about 80 percent, whereas those scoring low on the Hy scale increased their use of these words only about 15 percent. From the results of the studies cited so far and from others (e.g., Sampson & French, 1960; Smith, 1961), it seems reasonable to conclude that when the subject’s performance is a measure of influenceability, more influence is exerted by a warm, or warmly perceived, experimenter than by a cold, or coldly perceived, experimenter. The extent of the effect of experimenter warmth, however, appears to interact with subject variables and, very probably, with experimenter variables and situational variables as well. Some of the cited studies of experimenter warmth have defined warmth in terms of the experimenter’s behavior, and others have defined warmth in terms of the
362
Book Two – Experimenter Effects in Behavioral Research
subject’s expectation of the experimenter’s behavior. An experiment by Crow (1964) employed both definitions simultaneously. Although only a small study, employing 13 subjects and 4 experimenters, the results are instructive enough to warrant the telling of some of the details. Half the subjects had been found to have a conception of psychological experimenters as relatively warm in manner. The remaining subjects tended to expect experimenters to behave more coldly toward their subjects. Half the time experimenters played the part of a warm experimenter after the manner of Reece and Whitman (1962). That is, they smiled more at their subjects, leaned toward them, and looked at them more. Half the time experimenters played a cold role, defined by leaning away from their subjects, not smiling, avoiding eye contact, and drumming their fingers. Three tasks were administered to the subjects. One of these was a spool-packing task in which spools of thread were placed into an empty box, removed, repacked, removed, and so on for the duration of the task period. Another task called for the subjects to cross out all the W’s on a page of randomly arranged letters. Both of these tasks have been employed or are similar to those employed by investigators interested in learning just how far subjects will go in cooperating with a psychological experimenter (e.g., Crowne & Marlowe, 1964; Orne, 1962). The third task administered to the subjects was a homemade version of a standard subtest of intelligence (digit symbol) which required the learning of a simple code for translating numbers into symbols. Tables 5-9, 5-10, and 5-11 give the mean performance scores for each of the three tasks. The raw scores have been converted to standard scores from the raw data available in Crow’s report. Most of the results vary from task to task, except that in each case the performance was best when the experimenter behaved warmly and was contacting subjects who expected to be treated warmly. The average standard score for this subgroup was þ1.43; that for the remaining subgroups was .48 (t ¼ 3.62, p < .10, df ¼ 2). Closer study of the marginals of Tables 5-9, 5-10, and 5-11 suggests an interesting interaction effect involving the type of task and the relative effects of the experimenter’s behavior compared to the effects of subjects’ expectations. The lower marginals in Table 5-9, for example, show that for the spool-packing experiment there was no main effect for Table 5–9 Experimenter Warmth and Subject’s Performance in a Spool-Packing Task
Experimenter behavior
SUBJECT’S EXPECTATION
Warm Cold Mean
Warm
Cold
Mean
þ1.32 1.32
.51 þ.51
þ.405 .405
0
0
Table 5–10 Experimenter Warmth and Subject’s Performance in a Letter-Canceling Task
Experimenter behavior
SUBJECT’S EXPECTATION
Warm
Cold
Mean
Warm Cold
þ1.53 1.25
.33 þ.05
þ.60 .60
Mean
þ.14
.14
Psychosocial Attributes
363
Table 5–11 Experimenter Warmth and Subject’s Performance in a Digit Symbol Task
Experimenter behavior
SUBJECT’S EXPECTATION
Warm
Cold
Mean
Warm Cold
þ1.43 þ.29
1.32 .40
þ.055 .055
Mean
þ .86
.86
Table 5–12 Effects of Experimenter’s Warmth Defined by Either Experimenter Behavior or
Subject’s Expectation Task Spool-packing Letter-canceling Digit symbol Mean
Experimenter Behavior
Subject’s Expectation
Difference
.0 þ.28 þ1.72
þ.81 þ1.20 þ.11
.81 .92 þ1.61
þ.67
þ.71
.04
the experimenter’s behavior. The right-hand marginals, however, show that the difference between the mean performances of subjects expecting warm treatment was superior by þ.81 to that of subjects expecting cooler treatment. Table 5-12 gives the analogous values for each of the three tasks. A plus sign preceding the standard score data indicates that performance was superior in the warmer condition. For the spoolpacking and letter-canceling tasks, the experimenter’s actual behavior made virtually no difference compared to the subject’s expectation, which had a more substantial effect on the subject’s performance. The situation was reversed for the digit symbol task. There, the subject’s expectation made no difference but the experimenter’s behavior made a good deal of difference in the subject’s performance. The last column of Table 5-12 summarizes the interaction (t ¼ 25.9, df ¼ 1, p < .05). For simple tasks with little meaning, subjects’ expectations may assume a greater importance, because subjects who view experimenters more favorably may view his tasks more favorably, thereby transforming a compellingly inane procedure into one that simply ‘‘must’’ have more value. The experimenter’s behavior may lose relative importance just because of the peculiarity of the task itself which absorbs the subject’s attention. In the quasi-intelligence test, expectations about experimenters’ behavior may become less salient because now the task is one like those the subject has been performing for years in school settings. The experimenter becomes more like those others in the student’s life who have administered tests—usually teachers—and is to be evaluated more in terms of his actual behavior. The expectation of the experimenter’s behavior becomes less important as soon as the subject finds the experimental situation to have required no special expectation at all because of its resemblance to the school situation. If that were the case, we might expect that expectations about the warmth of teachers would have been an effective determinant of subjects’ performance. Those expecting teachers to be warmer should have performed better at the task most similar to that usually administered by teachers. Such data, unfortunately, are not available, and the interpretation offered remains an
364
Book Two – Experimenter Effects in Behavioral Research
unsupported speculation. However, the fact that in an intelligence testlike task warmer-behaving experimenters obtained superior performance seems quite consistent with the data presented earlier. In the discussion of the effects of experimenter warmth on responses to projective tests we encountered the work of Luft (1953). He had shown that subjects contacted by warmer experimenters were more favorably inclined toward ink blots. If a warmer experimenter draws more ‘‘liking’’ responses to blots we might expect that he would also draw more favorable responses to photos of people. Some indirect evidence is available from the experiment in person perception which had been filmed. Experimenters whose instruction-reading behavior was judged from both film and sound track to be more personal (r ¼ þ.28, p < .05) and more interested (r ¼ þ.23, p < .10) obtained ratings of photos as being of more successful people. These are weak findings, however, because for the variables ‘‘friendly’’ and ‘‘pleasant’’ the corresponding correlations were much lower than we would have expected (rs ¼ þ.12, and þ.09) if warm experimenters dependably obtained more ‘‘success’’ ratings from their subjects. When subjects rated their experimenters on these same four variables in an experiment conducted by Suzanne Haley there was only the smallest trend for experimenters rated more positively by their subjects to obtain ratings of the photos as more successful. Although we cannot always say exactly what the effect will be, the status and warmth of an experimenter often affect the responses given him by his subjects. Here, and elsewhere (Edwards, 1954; Rosenthal, 1963a), when that point was made, the emphasis has been on research employing human subjects. There appear to be no experiments on the effects of more enduring experimenter attributes on the performance of their animal subjects, but there are, nevertheless, sufficiently compelling anecdotes to make us suspect that even the performance of animals depends to some degree on the personality of the investigator (Christie, 1951; Maier, 1956; Pfungst, 1911; Rosenthal, 1965a). To summarize, and in the process oversimplify grossly, what seems to be known about the effects of the experimenter’s status and warmth: Higher status experimenters tend to obtain more conforming but less pleasant responses from their subjects. Warmer experimenters tend to obtain more competent and more pleasant responses from their subjects.
6 Situational Factors
More than an experimenter’s score on a test of anxiety, his status and warmth are defined and determined in part by the nature of the experimental situation and the particular subject being contacted. The experimenter ‘‘attributes’’ to be considered now are still more situationally determined. That is, the degree of warmth an experimenter shows one subject may be correlated with the degree of warmth he shows other subjects. But whether he ‘‘accidentally’’ encounters a subject with whom he has had prior social contact seems less likely to be an enduring attribute and more purely situational. The distinction is, nevertheless, arbitrary. Experimenters who are acquainted with a subject may differ in associated personality characteristics which make them more likely to be acquainted with other subjects as well. The effects of prior acquaintanceship thus may be due not simply to the prior contact as such, but to correlated variables as well.
Experimenter’s Acquaintanceship When the experimenter has had prior contact with his subject, even when that contact is brief, the subject may respond differently in the experimental task. When the task was an intelligence test, the study by Sacks (1952) is the most interesting. Her subjects, 30 children all about three years old, were divided into three experimental groups. With the children of group A she spent one hour each day for 10 days in a nursery school, participating as a good, interested teacher. With the children of group B, she spent the same amount of time but her role was that of a dull-appearing, uninterested teacher. With the children of group C, she had no prior contact. The results were defined in terms of changes in intelligence test scores from before to after treatment. Group A gained 14.5 IQ points (p < .01), group B gained 5.0 IQ points (p < .05), while the no-contact control group gained only 1.6 IQ points. This study illustrates not only the effects of prior contact but also the effects of the warmth of that contact. When the experimenter had played a warmer role the gain in IQ was 9.5 IQ points greater than when she had played a cooler role (p ¼ .02). There may be an interaction between the effects of prior contact and the particular experimenter in determining the effects on children’s intellectual performance. 365
366
Book Two – Experimenter Effects in Behavioral Research
Marine (1929), for example, spent time with somewhat older schoolchildren and found this prior contact to have no effect on the children’s gain in IQ points. Most clinicians feel that anxiety serves to lower intellectual performance under ordinary conditions. Prior contact with the experimenter may serve to lower any anxiety about being contacted by a stranger and thereby lead to a relative increase in IQ. When the experimenter, in addition, is warmer, anxiety may be still further reduced, thereby raising still more the level of intellectual performance. This interpretation could be tested by having subjects high and low on test anxiety and high and low in fear of strangers receive prior contact or no prior contact. Those more anxious over tests and those more fearful of strangers should profit most from prior contact with the experimenter, and probably also from contact with a warmer experimenter. The effects of prior contact also seem to depend on the task set for the subject. When the task is a simple, repetitive motor task such as dropping marbles into holes, complete strangers seem to be more effective reinforcers than experimenters known to the subjects—in one case, the preschool subject’s own parents (Stevenson, Keen, & Knights, 1963). This is just what we would expect on the basis of Hullian learning theory. When the response is a simple one, easily available to the subject, an increase in anxiety, such as we expect to occur in the presence of strangers, increases the performance level. When the response is a difficult one, not easily available to the subject, as in an intelligence test, an increase in anxiety makes these less available responses still less likely to occur because the more available responses, more often wrong, become more likely due to the so-called multiplicative effect of drive. A recent experiment by Berkowitz (1964) is relevant. He employed 39 chronic schizophrenic and 39 medically hospitalized normals in a study of the effects of prior warm contact, prior cold contact, and no prior contact on reaction time scores. Early trials were not reinforced, but later trials were reinforced by the experimenter’s complimenting the subject for his performance. Psychiatrically normal subjects who had prior contact, either warm or cold in character, were slower in reacting than were normal subjects who had no prior contact. Of the two prior contact groups, those subjects who had experienced a warmer interaction showed the slower reaction time. Berkowitz’s interpretation of these results in terms of drive level fits well with the interpretation of the results of the Stevenson et al. study just mentioned. Because the task is a simple one, the less the anxiety or drive level, the poorer the performance. Prior contact, it was suggested earlier, reduces anxiety, and with a warm experimenter more so than with a cold one. In Berkowitz’s study, the results for the schizophrenic patients were somewhat different. They, too, showed the slowest reaction time when their experimenter had been warm in prior contact. However, there was no difference between reaction times of subjects with cold prior contact and those with no prior contact. For schizophrenics, perhaps, cold prior contact does not reduce anxiety as it does for psychiatric normals. With college students as subjects, Kanfer and Karas (1959) investigated the effects of prior contact on the conditioning of first-person pronouns. There were four groups of subjects; three had prior contact with the experimenter and the fourth did not. During their prior contact one group of subjects was made to feel successful at a brief intelligence test, another group was made to feel unsuccessful, and the third group was given no feedback. All three groups who had experienced prior contact with the experimenter conditioned faster than did the group with no prior contact. If it can be assumed that learning the contingency in a verbal conditioning experiment is
Situational Factors
367
somewhat challenging intellectually, then the results of this study seem consistent with those of Sacks (1952), who found intellectually challenging tasks to be performed better after prior contact with the experimenter. Kanfer and Karas, however, found no difference in performance among the three groups who had prior contact with their experimenter. Such a difference might have been expected from the results of the studies described here. The lack of any difference might have been due to the fact that during the prior contact subjects took a brief IQ test, which might have made them all sufficiently anxious to weaken the effects of the different types of feedback received about their performance. The change to the simpler verbal conditioning task might have reduced the anxiety of all three groups to below the level of the control group, for whom the experimentersubject interaction was new, strange, and therefore possibly more anxiety-arousing. There is also the possibility that the prior contact subjects retained their high anxiety levels through the verbal conditioning task and that more anxious subjects perform better at that task. That is what we expect if the task is not challenging intellectually. The two opposing interpretations must remain unreconciled for want of the relevant data. Even when anxiety is defined by a standard test such as the Taylor Scale of Manifest Anxiety or a near relative rather than by an experimental manipulation, it is not well established whether more or less anxious subjects show more or less verbal conditioning (Rosenthal, 1963d). Verbal conditioning may turn out to be less difficult than most items of an intelligence test but more difficult than such performances as reaction time or eyelid conditioning (Spence, 1964), and that may account for the equivocality of the data available. There are some conclusions, though that can be drawn about the effects on the subject’s performance of prior contact with the experimenter. Often, at least, such contact makes a difference (Krasner, 1962; Wallin, 1949). When the performance required is difficult, prior contact, especially when of a ‘‘warm’’ quality, seems to improve performance. When the task is simple, prior contact may worsen performance, although, it seems safe to assume, subjects may feel more relaxed about it. When the task is of medium difficulty, no clear prediction is possible except that how the subject is occupied during the prior contact may make the major difference.
Experimenter’s Experience It seems reasonable to suppose that a more experienced experimenter, one who has conducted more experiments or at last repeated a certain experiment more often, may behave differently in the experiment than a less experienced experimenter. This difference in behavior alters the stimuli offered the subject so that we might expect him to behave differently. We have already seen at least one experiment in which the experience of the experimenter seemed to affect the speed of learning of his subjects, and these subjects were rabbits (Brogden, 1962). The less experienced experimenter obtained a slower rate of learning than did more experienced experimenters. When the subject’s task was to construct stories to TAT stimuli, there was a tendency for examiners who had administered fewer TATs to elicit more storytelling material (Turner & Coleman, 1962). In the experiment in person perception, which was recorded on sound film, some of the 19 experimenters had prior experience. They had served in one of two other studies in which their task was also to present the
368
Book Two – Experimenter Effects in Behavioral Research Table 6–1 Experimenter’s Experience and Behavior: Observed from
Sound Track Only Preinstructional period Variable Personal Enthusiastic Pleasant-voiced Speaks distinctly
Correlation
p
.56 .41 .41 .50
.02 .10 .10 .05
Instructional period Variable Personal Expressive-voiced Speaks distinctly
Correlation .43 .44 .64
p .08 .07 .005
photos of faces to their subjects and record subjects’ ratings of success or failure. In this study, there was no effect on subjects’ ratings of the stimuli associated with experimenters’ having had prior experience in the experimenter role. However, from the analysis of the films and of the sound track, it was evident that the more experienced experimenters behaved differently during the course of the brief preinstructional period and during the reading of the instructions. Interestingly, it was in the sound track rather than in the film or in the film combined with sound track that the differences emerged. Table 6-1 shows the larger correlations between experimenters’ behavior and their prior experience. During both the preinstructional period and the instruction reading itself, the more experienced experimenters spoke in a less personal tone of voice and less distinctly. They read the instructions with less expression and gathered the initial background information from the subjects in a less pleasant and less enthusiastic tone of voice. It may be that the nature of the task was such that having been through it all before, the more experienced experimenters were simply bored. The boredom, however, if that is what it was, was revealed through tone of voice and not through motor behavior. It is of special interest to note that observers who had access to the sound track and also to the film could not make the tone of voice judgments as well. When the information is in the sound track rather than in the film, viewing the film while listening to the sound probably results in a decreased signal-to-noise ratio (Jones & Thibaut, 1958). The film then only distracts the judges. In this analysis there was even a trend for some of the correlations based on the judgments of the film to be opposite in direction from those based on judgments of the sound track. Although the differences in vocal behavior between more and less experienced experimenters did not affect the subjects’ responses in the present study, it is not difficult to imagine experimental tasks wherein such behavioral differences among experimenters could affect subjects’ task performance. Studies in verbal conditioning are one such class of studies. Here the tone of the experimenter as he utters his ‘‘good’s’’ and ‘‘um-hmm’s’’ may make a substantial difference, and one wants to know whether more experienced reinforcers obtain better conditioning and, if they do, whether it is because their tone of voice is different.
Situational Factors
369
Even when the experimenter has had no prior experience in that role, his experience changes during the course of his first experiment. At the end of his first experiment he is more experienced than at the beginning. Sarason (1965) reports a finding from an unpublished study by Barnard (1963) which illustrates that even during the course of a single experiment, the behavior of the experimenter can change systematically. In the Barnard study experimenters administered a phrase association task to their subjects. The degree of associative disturbance shown by the subjects seemed to be related, at least sometimes, to the prior experience of the experimenter during this study. Barnard’s experimenters also reported a drop in anxiety over the course of the experiment which might have accounted for the effects of experimenters’ experience on subjects’ degree of disturbance. In the experiment recorded on film, the serial order in which each subject was seen was correlated with the experimenter’s behavior. It was thereby possible to learn whether later-contacted subjects were meeting an experimenter whose behavior had changed from that shown earlier subjects. Considering only the preinstructional period, none of the ratings of the experimenters’ behavior correlated ‘‘significantly’’ (p < .05) with the serial order of the subject contacted. Behavior during the instruction period, however, did seem to be affected by the number of subjects the experimenter had seen previously. Table 6-2 shows the larger correlations obtained when judgments were based on the observation of films without sound track. Table 6-3 shows the correlations obtained when the sound track was added to the films for a different group of observers. The general decrease of motor activity during the instruction period as successive subjects were contacted seems consistent with Barnard’s report of decreased experimenter anxiety over the course of an Table 6–2 Serial Order of Subject Contacted and Experimenter’s
Behavior: Silent Film Variable
Correlation
Active Body activity Trunk activity Leg activity Expressive face
.32 .32 .32 .30 .24
p .02 .02 .02 .03 .08
Table 6–3 Serial Order of Subject Contacted and Experimenter’s
Behavior: Film and Sound Track Variable Interested Active Enthusiastic Encouraging Relaxed Leaning toward S Head nodding Accuracy Time
Correlation .31 .24 .23 .23 þ.26 .26 .25 þ.25 .31
p .02 .10 .10 .10 .06 .06 .07 .07 .03
370
Book Two – Experimenter Effects in Behavioral Research
experiment. Again, the addition of another channel of information resulted in a decrease of ‘‘correlational information’’ about these variables. When the sound track was added, only one of the variables shown in Table 6-2 remained significantly correlated with the serial order of subject contacted, and even that correlation was reduced substantially. Table 6-3 shows, however, that the addition of the sound track made possible the observation of different behaviors which were determined in part by the serial order of subjects contacted. Experimenters seemed to become less interested and less involved in their interaction with later-contacted subjects but more relaxed as well. They read their instructions more rapidly and more accurately to later than to earlier subjects, which suggests an expected practice effect. (Although not significant statistically, experimenters who had participated in an earlier experiment and thus were more experienced, by that definition, also tended to be more accurate [r ¼ þ.26] and faster [r ¼ .18] in reading their instructions.) In this experiment, as in Barnard’s, experimenters seem to relax over the course of an experiment and, in this study, to become somewhat more bored though more proficient as well. Also, in this study, the behavior changes shown by the experimenters seemed to affect their subjects’ responses to the photo-judging task. Later-contacted subjects tended to rate the photos as being of more unsuccessful people than did earlier subjects (r ¼ .31, p < .02). It may be that over the course of an experiment the data collector acquires greater comfort and competence and thereby greater status. For the photo-rating task employed, it was shown earlier that experimenters judged to have higher status did tend to obtain ratings of photos as of more unsuccessful people. From the evidence available it seems safe to conclude that the amount of experience of an experimenter may affect the responses collected from his subjects. This seems to be the case when experience is defined either over several experiments or within a single experiment.
Experimenter Experiences Not only the amount of experience the experimenter has accumulated but also the experiences he has encountered in his role as data collector may affect his subjects’ responses. Earlier, in discussing the effects of experimenter’s warmth, it was suggested that the subject’s response may affect the experimenter’s behavior in his transaction with the subject. But, since the experimenter’s behavior may influence the subject’s response, it is easy to view the experimenter-subject system as one of complex feedbacks. The response given by the subject may itself affect his next response at the same time it affects his experimenter’s response, which will also affect the subject’s next response. Focusing on the experimenter, the same analysis is possible. His behavior affects his own subsequent behavior but also affects the subject’s response, which, in turn, affects the experimenter’s next response. The resulting complex of intertwining feedback loops may be incredibly complex but no more complex than that characterizing other dyadic interactions (Jones & Thibaut, 1958). In this section the discussion will deal with such ongoing effects on the experimenter that have repercussions on the responses he obtains from his subjects. The subject’s own effect on the experimenter will be considered as well as such other
Situational Factors
371
influences as the physical characteristics of the laboratory in which the experimenter works and the nature of his interaction with any principal investigator to whom he may be responsible. Subjects’ behavior. An experiment by Heller, Myers, and Kline (1963) demonstrates the effects of a subject’s behavior on the interviewer’s behavior. Each of 34 counselor-interviewers contacted 4 subject-clients in a clinical context. Actually, each counselor interviewed the same four ‘‘clients,’’ who were accomplices of the investigators and trained to play one of four roles. Two clients played a dominant role, and one of these was friendly about it, the other hostile. The other two clients played a dependent role, one friendly, the other hostile. Observations of interviewer behavior revealed that contact with a dominant client led to interviewers’ behaving in a more dependent manner (mean dominance score ¼ 12.1), while contact with a more dependent client led to more dominant behavior (mean ¼ 15.2, p < .001). When interviewers contacted more hostile clients they responded in a less friendly fashion (mean ¼ 11.4) than when they contacted friendly clients (mean ¼ 21.4, p < .0005). These results were just those the investigators had predicted. In this study it is reasonable to think of the actor-clients as the experimenters and the interviewers as the subjects. However, the interviewers’ perception of their own role was more like that of data collector than of experimental subject. This may have reduced the obtained effects, since the role of subject is thought to include greater susceptibility to social influence than is the role of data collector, whether the collector be ‘‘experimenter,’’ ‘‘examiner,’’ ‘‘therapist,’’ or ‘‘interviewer.’’ When the task employed by the experimenters was the administration of an intelligence test, Masling (1959) found results analogous to those obtained by Heller, Myers, and Kline. Also employing actor-subjects, Masling found warmer subjects treated in more friendly fashion by the examiners. For a situation in which the experimenter was trying to follow a more highly programmed procedure with his subjects, Matarazzo provides an illuminating anecdote (personal communication, 1964). The basic data are reported elsewhere (Matarazzo, Wiens, & Saslow, 1965), but briefly, the study was of the effect of the duration of an interviewer’s utterance on the duration of the subject’s utterance. The interviews were divided into three periods. During the first and third periods the interviewer tried to average utterances of five seconds. During the middle period he tried to average ten-second utterances. Regardless of the patterns employed (e.g., 5, 10, 5; 10, 5, 10; 5, 15, 5) the subject’s average length of utterance was a function of the length of the interviewer’s utterance. Matarazzo raised the possibility of a feedback effect upon the interviewer associated with the subject’s length of utterance. Unless he paid strict attention to his average length of utterance, it seemed that his own length of utterance was being affected by the subject’s length of utterance. Thus in one experiment, the interviewer overshot his target length of five seconds by only 6 percent in the first of the three periods; then, in the third period, after the subject had increased the length of his utterances in the second period, the interviewer overshot his target by 22 percent (p < .01). This effect disappeared completely when the investigator kept this phenomenon in mind. Subsequently, when not attentive to it, the hysteresis occurred again. This time the interviewer achieved the target length of five seconds perfectly in the first period of the interview. In the third period, however, after the increasing length of his subject’s utterances in the second period, he overshot his target time by 10 percent (p < .01).
372
Book Two – Experimenter Effects in Behavioral Research
What happens to an experimenter during the course of his experiment may alter his behavior toward his subjects in such a way as to affect subjects’ (1) judgments of the degree of success shown by standard stimulus persons, (2) responses on standard tests of personality, and (3) test-retest reliabilities of personality tests. In Part II of this book, Chapter 12, dealing with the effects of early data returns, will give the details. Briefly, for now, 26 experimenters administered the photo-judging task to a total of 115 female subjects. Half the experimenters were led to expect that their subjects would see the stimulus persons as successful and half were led to expect their subjects to see the stimulus persons as unsuccessful. Accomplices were trained to rate the photos sometimes as of successful people and sometimes as of unsuccessful people. Regardless of their initial expectancy, half the experimenters had their expectancies confirmed and half had their expectancies disconfirmed by their first two subjects who were the accomplices. That is, half of the experimenters who were expecting ratings of success (þ5) obtained ratings of success, while the other half obtained ratings of failure (5). Half the experimenters expecting ratings of failure (5) obtained such ratings, and the other half obtained ratings of success (þ5) from their ‘‘subjects.’’ Subsequently, when the experimenters contacted real subjects, the mean rating of the photos obtained by experimenters whose expectations had been confirmed was 1.55; that obtained by experimenters whose expectations had been disconfirmed was 0.79 (p ¼ .05). It may be that the confirmation of expectancies gave added confidence to these experimenters, a confidence reflected in a more professional, assured manner. Earlier, data were presented that suggested that such a more professional, prestigious experimenter was likely to obtain ratings of the photos as being of more unsuccessful people. Before and after the experiment, subjects were tested with the Taylor Manifest Anxiety Scale and the Marlowe-Crowne Social Desirability Scale. Whether their experimenter had his initial expectations confirmed or disconfirmed did not affect subjects’ level of anxiety. However, subjects whose experimenters had their expectancies confirmed showed a significant increase in their social desirability scores compared to the subjects whose experimenter’s expectancies had been disconfirmed (p < .05). It can again be hypothesized that confirmatory responses increased the experimenter’s self-confidence, leading to his behaving in a more professional manner. In the section dealing with the effects of experimenter status, we saw that increases in status and authority on the part of the experimenter lead to a greater degree of propriety in the responses he obtains from subjects. That seems to be what happened in this experiment as well. Changes in test scores is a different matter from changes in test reliability. All subjects may earn higher or lower scores on a retest without the retest reliability being affected. The retest reliability of the subjects’ scores on the social desirability scale was not affected significantly by the confirmation or the disconfirmation of their experimenter’s expectation, though there was a slight decrease when the experimenter’s hypothesis had been disconfirmed (r ¼ .74 vs. r ¼ .66). When their experimenter’s expectation had been disconfirmed, the reliability of subjects’ anxiety scores was lower (r ¼ þ.80) than when their experimenter’s expectation had been confirmed (r ¼ þ.90, p of difference ¼ .06). It is interesting to speculate on the possibility that the behavior of a more self-confident experimenter is such as to increase the retest reliability of his subjects’ test scores. It may be that the general retest-taking set provided by such an experimenter is one for consistency of
Situational Factors
373
performance. This set could operate in spite of a general tendency for the experimenter’s manner to affect subjects’ retests uniformly, with the result being like that of adding a constant to an array of scores. Such a constant does not, of course, affect the correlation coefficient. As mentioned, in experiments employing accomplices whose task it is to influence the interviewer, or the examiner, or the experimenter, it is sometimes useful to regard the accomplice as the experimenter and the data collector or clinician as the subject. In the experiment under discussion, the accomplices may be regarded as experimenters of a kind, since they were making the programmed responses. So, too, were the experimenters, but their behavior in carrying out the directions of the experiment could vary, within limits, without their being regarded as incompetent experimenters who were ‘‘spoiling’’ the experiment. There is no direct measure of the experimenters’ behavior in this experiment, as it was not filmed, but there is good evidence that their behavior affected the performance of the accomplices. It will be remembered that half the time accomplices were to give þ5 and half the time 5 responses to the photos presented by their experimenters. Sometimes these responses confirmed the experimenter’s expectancy, sometimes they disconfirmed it. Accomplices did not, of course, know that they were confirming or disconfirming by their responses, or that the experimenters had any expectancy at all. All accomplices came close to giving their target ratings of þ5 or 5 when considering that the photos’ standardized value was approximately zero. The mean rating given (disregarding signs which, of course, were not disregarded by the accomplices) by accomplices in the four conditions described was 3.99, or about one scale unit too close to the neutral side of the scale of success or failure ( ¼ .27). Table 6-4 shows the mean absolute ratings given by accomplices to the experimenters of each of the four experimental conditions. The means have been converted to standard scores. If the numerical values given the experimenters had been equivalent in the four cells, all standard scores would have been close to zero. As it was, the accomplices assigned at random to the experimenters expecting and receiving ratings of the photos as failures gave ratings too close to the neutral end of the scale. In this they were significantly different from the accomplices in the other three conditions (p < .02). It must be emphasized that only the experimenters were given an expectancy and only the experimenters experienced confirmation or disconfirmation. In some way, the experimenter’s behavior was such as to drive the accomplices’ ratings off the target and into the direction of the neutral point if the experimenter expected and obtained negative ratings from the accomplice-subjects. Because there was no direct observation of the experimenter-accomplice interaction, the interpretation is speculative. It may be that experimenters expecting subjects to see failure in others feel sorry for such subjects. Under a hypothesis of projection, these subjects would be viewed as feeling themselves to be failures. When the experimenters expecting Table 6–4 Distance from Target Values of Ratings Made by
Accomplices Experimenter’s Expectancy
Confirmation Disconfirmation
þ5
5
þ.29 þ.63
1.70 þ.78
374
Book Two – Experimenter Effects in Behavioral Research
failure perceptions from their subjects have these expectancies disconfirmed, they need no longer feel sorry for their subjects. However, when they learn from the accomplices in the confirming condition that they do indeed see others as unsuccessful, they may react with special warmth and friendliness to these subjects suspected of feeling inadequate. This warmth, which has been shown to increase the perception of success of others, may similarly influence the accomplices in spite of the fact that they have learned a part to play and, most likely, are quite unaware of being so influenced by the experimenters they believe to be their ‘‘marks’’ or ‘‘targets.’’ But if this interpretation were sound, what about those accomplices who also rate photos as unsuccessful for those experimenters expecting ratings of success? Would we not expect the experimenters to be warmer, too, to these failure perceivers? We would, ordinarily, but the effects of disconfirmation may be to disconcert the experimenter so that he cannot be an effective ‘‘therapist’’ for his unwilling and unneedful ‘‘client.’’ From the evidence presented in this section, it seems clear that the subject’s behavior can affect the experimenter’s behavior which, in turn, may have further effects on the subject’s behavior. Each participant in the interaction affects not only the other but himself as well. The effect on the participant by the participant himself may be direct or indirect. It is direct when he recognizes the response he has made, and this recognition affects the probability of a subsequent response. It is indirect when his response alters the behavior of the other participant in such a way that the new response affects his own subsequent response. It makes no difference whether we speak from the viewpoint of the subject or of the experimenter. It makes no difference whether the experimenter is interacting with a bona fide subject or an accomplice. Experimenters do not simply affect subjects. Accomplices do not simply affect their targets. Subjects and targets both ‘‘act back.’’ Characteristics of the laboratory. Riecken (1962) has pointed out how much there is we do not know about the effects of the physical scene in which an experimental transaction occurs. We know little enough about how the scene affects the subject’s response; we know still less about how the particular laboratory setting affects the experimenter. Riecken wondered about the effect on his subjects of the experimenter’s white coat. Perhaps that makes him more of a scientist in his subject’s eyes. Perhaps it does and perhaps, too, it makes him more of a scientist in his own eyes. If ‘‘clothes can make the man,’’ then perhaps, too, a laboratory can make a scientist feel more the part. What impresses and affects the subject may impress and affect the experimenter. Perhaps the most senior of the laboratory directors is not susceptible to such effects. Even if he is not, however, we must ask what percentage of his laboratory’s data he himself collects. It is perhaps more common for more data to be collected by less senior personnel who might be affected by the status of the setting in which they contact their subjects. So many psychology departments are housed in ‘‘temporary’’ buildings with space shortages that one wonders about the systematic effects possible if indeed the physical scene affected both subject and experimenter. There is evidence that subjects’ responses may be affected by the ‘‘laboratory’s’’ characteristics. Mintz (1957) found that negative print photos of faces were judged more energetic and more pleased in a ‘‘beautified’’ room, more ‘‘average’’ in an average room, and less energetic and less pleased in an ‘‘uglified’’ room (p < .01). Observations of the two experimenters who administered the photo-judging tasks
Situational Factors
375
suggested that they, too, were affected by the rooms in which they conducted the experiments. Not only were their own ratings of the photos affected by their locale, but so too was their attitude toward the experiment and their behavior toward their subjects. Some data collected together with Suzanne Haley show the effects of laboratory room characteristics on subjects and possibly their effects on experimenters’ behavior. The experiment required subjects to rate photos of faces for degree of success experienced. There were 14 experimenters, 86 subjects, and 8 laboratory rooms. Experimenters and subjects were assigned to rooms at random. Each room was rated by 13 experimenters (not including the one who used that room) on the following four dimensions: (1) how professional the room was in appearance, (2) how high the status was of the room’s characteristic user judging from the physical appearance, (3) how comfortable the room was, and (4) how disorderly the room was. None of the room characteristics were significantly related to the subjects’ ratings of the photos of faces. However, the characteristics of the rooms were significantly related to a large proportion of the 26 ratings subjects made of their experimenter’s behavior. Table 6-5 shows the correlations between the experimenters’ behavior as judged by their subjects and the room characteristics of ‘‘professional’’ and ‘‘disordered.’’ The room characteristic of ‘‘status of the user’’ is omitted since its correlation with ‘‘professional’’ was .98. The room characteristic ‘‘comfortable’’ is not listed because only one of the 26 judgments of experimenter behavior reached the .05 level. That one correlation showed the experimenters in more comfortable rooms to have a less pleasant voice (r ¼ .24, p ¼ .03). Because it occurred as the only significant relationship in a set of 26 correlations, it is best mentioned and put aside. When the room is a more professional-appearing locale for the experimental interaction, experimenters behave, or at least are seen as behaving, in a more motorically and verbally active manner. They are seen also to be somewhat more at ease and friendly. The pattern is not very different when the laboratory is described as more disordered, and that may be due to the substantial correlation of þ.41
Table 6–5 Experimenter Behavior and Characteristics of His Laboratory
Professional Variable Talkative Loud Pleasant-voiced Expressive-voiced Hand gestures Arm gestures Trunk activity Leg activity Body activity Expressive face Encouraging Friendly Relaxed Interested
Disordered
Correlation
p
Correlation
p
þ.24 þ.25 þ.04 þ.22 þ.32 þ.21 þ.17 þ.31 þ.22 þ.32 þ.12 þ.19 þ.20 þ.16
.03 .02 — .05 .005 .05 — .007 .05 .005 — .10 .07 —
þ.08 þ.09 þ.19 þ.09 þ.23 þ.11 þ.26 þ.15 þ.21 þ.05 þ.25 þ.07 þ.10 þ.20
— — .10 — .05 — .02 — .05 — .02 — — .07
376
Book Two – Experimenter Effects in Behavioral Research
between the professionalness and disorderedness of the lab. There is no way to be sure whether the characteristics of the room affected only the subjects’ judgments of their experimenters (as Mintz’s subjects judged photo negatives of faces differently in different rooms) or whether experimenters were sufficiently affected by their surroundings to have actually behaved differently. Both mechanisms could, of course, have operated. If only the subjects’ perceptions were affected, that still argues that we take more seriously than we have Riecken’s (1962) invitation to study the effects of the physical scene on subjects’ responses. If the experimenter appears differently to subjects as a function of the scene, subjects might respond differently for him in some experimental tasks, though in the present task of judging the success of others they did not. There is one thin line of evidence that the behavior of the experimenters was, in fact, affected by the characteristics of the rooms to which they had been randomly assigned. All experimenters were asked to state the purpose of the experiment at its conclusion, in order that their degree of suspiciousness about the intent of the study might be assessed. In addition, their written statements were assessed for the degree of seriousness with which these graduate students appeared to take the experimenter role. Those experimenters who had been assigned to a more disordered room were less suspicious of the true intent of the experiment. The correlation obtained was .42, but with the small number of experimenters (14) this was not statistically significant. How seriously the experiment was taken, however, did appear related to the rooms to which they had been assigned. If the room was more disordered, experimenters were more serious in their statements about their perception of the intent of the experiment (r ¼ þ.39). In addition, if the room was more comfortable, they were less serious in their written statements (r ¼ .45). These two findings taken together are unlikely to have occurred by chance, since the room characteristics of comfortable and disordered were positively correlated (r ¼ þ.32). The two room characteristics together predicted the seriousness of subsequent written statements with a multiple R of .73 (p < .02). Since the nature of the experimenter’s room predicted his subsequent behavior, it seems more reasonable to think that it might have affected his behavior during the experiment as well. It is not too clear, however, why a more disordered, less comfortable room should make the experimenters view the experiment more seriously. Perhaps these graduate students, who were not in psychology, felt that a scientifically serious business was carried on best in the cluttered and severely furnished laboratory some of them may have encountered in the psychology departments of colleges at which they were undergraduates, and which seems to fit the stereotype of the scientist’s ascetic pursuit of truth. It seems, then, that the physical scene in which the subject interacts with his experimenter may affect the subject’s response in two ways. First, the scene may affect directly the subject’s response by making him feel differently. Second, the scene may affect the experimenter’s behavior, which in turn affects the subjects’ responses to the experimental task. Research on the physical scene as an unintended determinant of the subject’s and experimenter’s behavior is in its infancy. What data there are suggest the wisdom of collecting more. The principal investigator. With more and more research carried out in teams and groups, the chances are increasing that any given experimenter will be collecting data not for himself alone. More and more, there is a principal investigator to whom the experimenter is responsible for the data he collects. The more enduring personal
Situational Factors
377
characteristics of the principal investigator as well as the content and style of his interaction with the experimenter can affect the responses the subjects give the experimenter. In telling of the effects of the subjects’ responses on the experimenter’s behavior, an experiment was mentioned in which the expectation of the experimenter was confirmed half the time and disconfirmed half the time. In that same experiment, two other variables were studied, both relating to the effects of the principal investigator on the data obtained by the experimenters. One of these variables was the affective tone of the relationship between experimenter and principal investigator; the other was the individual differences between principal investigators. After the experimenters contacted their first few subjects (who were actually accomplices) they were given feedback by one of two principal investigators on how well they had done their work as experimenters. Half the experimenters were praised for their performance, half were reproved. Each of two principal investigators contacted half of the 26 experimenters and administered praise and reproof equally often. When the experimenters had been praised before contacting their real subjects, those subjects rated photos as being of more unsuccessful people (mean ¼ 1.60) than when experimenters had been reproved (mean ¼ .74, p < .05). When experimenters had been either praised or reproved by one of the principal investigators, their subjects subsequently rated people as less successful (mean ¼ 1.57) than when experimenters had been either praised or reproved by the other principal investigator (mean ¼ .78, p < .05). Both the kind of person the principal investigator is, as well as the content of his interaction with the experimenter, affect the responses subjects give their experimenter. Praising an experimenter (and contact with a certain type of principal investigator) may have the same effect on his behavior toward his subjects that confirming his expectation does. He feels, and therefore acts, in a more professional, self-confident manner, a pattern of behavior already shown to lead to ratings by subjects of others as less successful. A reminder is in order that we do not know the reasons for this reaction on the part of subjects to a more professional, confident, higher status experimenter. It has been suggested earlier that, in a military or an academic setting, a higher status experimenter may evoke more negative feeling which is displaced onto the stimulus persons. In the military setting, negative feeling toward an officer may be well institutionalized. In the academic setting, the higher status or more professional-acting experimenter may be seen as a more effective ‘‘poker and pryer’’ into the mind of the subject (Riecken, 1962); he is, therefore, more to be feared. It may even be that undergraduate subjects ‘‘know’’ or intuit something of Freud’s concept of projection and feel that if they see too much success in photos they will be regarded as immodest by the higher status experimenter. As already shown, ‘‘proper’’ responses are more often given to data collectors of higher status in both laboratory and field research. In the experiment described, subjects had been tested for anxiety and social desirability before and after the experiment. There were no effects on subjects’ social desirability scores associated with which of the two principal investigators had contacted their experimenter early in the experiment. However, subjects whose experimenters had been either praised or reproved by one of the principal investigators showed a significantly greater increase in anxiety over the course of the experiment than did subjects whose experimenters had earlier been contacted by the other principal investigator (x2 ¼ 7.71, p < .01).
378
Book Two – Experimenter Effects in Behavioral Research
There is additional evidence of the effect of the principal investigator on the data obtained by his research assistants. In this experiment there were 13 principal investigators, each of whom was randomly assigned two research assistants (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). Before the principal investigators received their ‘‘research grants,’’ which allowed them to hire their research assistants, they had themselves served as experimenters in the person perception task. The principal investigators’ scores on the Taylor Anxiety Scale correlated significantly with their subjects’ ratings of the success of others (rho ¼ þ.66, p ¼ .03). Remarkably enough, the principal investigators’ anxiety also predicted the photo ratings their assistants obtained from their different sample of subjects (rho ¼ þ.40, p < .07). In ‘‘real-life’’ research situations, such a correlation could be enhanced by the possibility that principal investigators employ research assistants who are similar to themselves in personality. A correlation between an attribute of the principal investigator and his assistant’s obtained data could then be nothing more than the effect of the assistant’s personality on the subject’s response. This has been well established by now and is not so intriguing. In the study described, however, assistants were assigned at random to their principal investigator. The correlation between the principal investigator’s anxiety level and that of his assistants was only .02. Therefore, it must be that the nature of the principal investigator’s interaction with his assistants altered their behavior in such a way as to affect their subjects’ responses. The principal investigator affected the subject by affecting the data collector. It should be emphasized that the principal investigator never even saw the subjects who had been assigned to his research assistants. In this same experiment, there was no effect of the principal investigator’s need for social approval on the photo ratings obtained by his assistants, although that correlation (.16) was in the same direction as that between the principal investigator’s need for approval and his own subjects’ perception of the success of persons pictured in photos (.49). Finally, the correlation between the average ‘‘success’’ ratings obtained by any principal investigator and those obtained by his own research assistants from a different sample of subjects was þ.38, which, for the sample of 13 principal investigators, was not significant. Omitting the three female principal investigators raised this correlation to þ.75, p < .02, suggesting a possible interaction effect. Table 6-6 shows that such an interaction did occur. The mean photo ratings of success, in standard score form, obtained by the experimenters are shown separately for those whose principal investigators had themselves obtained mean Table 6–6 Experimenters’ Data as a Function of Data Obtained by Their Male and Female
Principal Investigators Photo ratings obtained by principal investigators
SEX OF PRINCIPAL INVESTIGATOR
Success
Failure
Difference
t
p
Male Female
þ.63 1.31
.54 þ1.22
þ1.17 2.53
2.09 2.38
.07 .05
Difference t p
þ 1.94 2.07 .08
1.76 2.35 .05
3.70 3.12 .02
Situational Factors
379
ratings of either success or failure from their own subjects. When the principal investigator was a male, his assistants obtained ratings significantly similar to those he had obtained. When the principal investigator was female, the assistants obtained data significantly opposite to the data she had obtained. The sample of female principal investigators, especially, is small but the data are clear. The responses a subject gives his experimenter depend not only, as we saw much earlier, on the sex of the experimenter, but on the sex of his experimenter’s principal investigator as well. Finally, there is an experiment in person perception in which, after training the experimenters, the principal investigators called their attention to the fact that only if they followed proper experimental procedures could the experimenters expect to obtain the results desired by the principal investigators (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964b). There were 15 male experimenters who conducted the person perception experiment with a total of 60 female subjects. Those eight experimenters whose principal investigators had made them self-conscious about their procedure obtained ratings of persons as significantly less successful (mean ¼ .57) than did experimenters who had not been made self-conscious (mean ¼ þ.37, p < .06). A subsample of the interactions had been filmed so that there were clues available as to how the more self-conscious experimenters might have behaved differently toward their subjects so as to obtain judgments of others as being more unsuccessful. The observations again come from three groups of observers. One group had access to the film and sound track, one group saw the film but did not hear the sound track, and one group heard only the sound track. None of the observations made by this last group of observers was related to the experimentally created self-consciousness of the experimenters. During the brief preinstructional transaction, observers who saw only the film found a tendency for more self-conscious experimenters to behave more dominantly (r ¼ þ.41, p < .10). When the sound track was added to the films, these experimenters, who had been ‘‘put on the spot’’ by the principal investigators, were judged less relaxed (r ¼ .45, p ¼ .06) and less courteous (r ¼ .40, p < .10). During the instruction-reading period, as Table 6-7 shows, the behavior of the more procedure-conscious experimenters was judged less likable from observing the films with or without the sound track. They were judged less courteous only when their tone of voice could be heard, less interested only when their tone could not be
Table 6–7 Instruction-Reading Behavior as a Function of Procedure Consciousness Induced by
Principal Investigators Observation channels Sound films Variable Likable Courteous Interested Slow-speaking Honest
Silent films
r
p
r
p
.45 .43 .07 .43 .42
.06 .07 — .07 .08
.41 .15 .40 — .58
.10 — .10 — .01
380
Book Two – Experimenter Effects in Behavioral Research
heard. Judged to be more slow-speaking from the observation of the sound film, that was not the case from a hearing of the sound track alone. Although the addition of information via a different sense modality does not always add usable information, it sometimes does, even when we would not expect it to. Finally, we note that more self-conscious experimenters are judged less honest. So the picture we have is of the principal investigator’s admonition affecting the experimenter’s behavior by making him less likable, less courteous in tone, faster speaking, and more ‘‘dishonest,’’ by which is meant, probably, more subtly ‘‘pushy’’ or influential. (The particular subtle influence they were probably exerting on their subjects will be discussed later on in Part II. It has to do with the expectancy for particular responses from the subject.) In an earlier chapter, that which dealt with observer effects, we saw that subjects, as well as observers, could be quite sensitive to the ‘‘bias’’ of the experimenter and were likely to code this information into the category of ‘‘honesty.’’ The general behavior shown by these experimenters is, as we have seen earlier, that kind of behavior which leads often, but not always, to subjects’ responding with more negative ratings of the success of the stimulus persons. The results of the last three studies described show that the interaction with the principal investigator can affect the experimenter’s interaction with his subjects and, thereby, the responses he obtains from them. The precise direction of the effect, however, seems difficult to predict. In the first study described, the principal investigator’s reproof led to the experimenter’s obtaining ratings of others as more successful. In the second study, the more anxious the principal investigator was, the more successful were the perceptions of others obtained by his research assistants. An anxious principal investigator may affect the experimenter as a reproving one, so these two studies are not inconsistent. However, we cannot assume that the more anxious principal investigator simply made the experimenter more anxious and that this altered anxiety level affected the subjects so that they perceived more success in others. It must be remembered that in the discussion of the effects of experimenters’ anxiety, one study showed that more anxious experimenters obtained ratings of others as more successful but another study showed the opposite effect. In the third experiment, the only one in which we could see what happened in the experimenter-subject interaction, experimenters who were made more conscious of their procedures by their principal investigators obtained ratings of the stimulus people as less successful. The opposite result, although less intuitively appealing, would seem to have been more consistent with the results of the other two studies. More self-conscious experimenters should perhaps have been somewhat like reproved ones or like those in contact with an anxious principal investigator. We are left with little confidence that we can predict the specific effect on subjects’ responses from a knowledge of the nature of the experimenter’s interaction with the principal investigator. We can have considerable confidence, however, that the nature of the interaction between experimenter and principal investigator can affect the subjects’ responses in some way. Not all the evidence for this assertion comes from the person perception experiment. Mulry’s experiment (1962) called for the experimenters to administer a pursuit rotor task to their subjects. Experimenters had been trained to administer this task by having themselves served as subjects for the principal investigator. Half the experimenters were told by the principal investigator that they were very good at the perceptual-motor skills involved. Half the experimenters were led to believe their
Situational Factors
381
own performance was not a good one. There was no effect of this feedback on the performance of the experimenters’ subjects. However, experimenters who had been complimented by their principal investigator were perceived quite differently by their own subjects than were the less fortunate experimenters. Complimented experimenters were seen to behave in a more interested (r ¼ þ.31, p ¼ .01), more enthusiastic (r ¼ þ.24, p ¼ .05), and more optimistic (r ¼ þ.29, p ¼ .02) manner. From this it seems that even though the behavior of the experimenter is affected by his interaction with the principal investigator, that does not always affect the subject to respond differently. The next, and final, study to be considered shows an instance in which it does (Rosenthal, Kohn, Greenfield, & Carota, 1966). The experimental task was a standard one for studies of verbal conditioning. Subjects constructed sentences and, after the establishment of a basal level, were reinforced by the experimenter’s saying ‘‘good’’ whenever first-person pronouns were employed. There were 19 experimenters who contacted a total of 60 subjects. Before the experiment began, the principal investigators gave experimenters indirect and subtle personal evaluations. Half the experimenters were evaluated favorably, half were evaluated unfavorably. Within each of these conditions, half the evaluations dealt with the experimenter’s intelligence, half with his influenceability. Thus, half the favorably evaluated experimenters were subtly informed that they were regarded as very intelligent by the principal investigators; half were evaluated as resistant to manipulation by others. Unfavorably evaluated experimenters were led to believe they were regarded by the principal investigators as either less intelligent or more manipulatable by others. Experimenters who felt more favorably evaluated by their principal investigators were significantly more successful at obtaining increased use of first-person pronouns by their subjects. Table 6-8 shows the mean increase in the number of such words emitted from the operant level to the end of the experiment. There was no difference in the magnitude of the effect associated with the particular attribute evaluated. All ten of the experimenters who felt favorably evaluated obtained an increase in their subjects’ use of the reinforced words (p ¼ .001), but only five of the nine who felt unfavorably evaluated obtained any increase (p ¼ 1.00). An interesting additional finding was that even during the operant level of responding, before any reinforcements were provided, experimenters obtained a greater number of first-person pronouns (mean ¼ 9.8) when their principal investigator’s evaluation was favorable than when it was unfavorable (mean ¼ 8.3, p ¼ .10). This was not an artifact based on a relationship between the operant level and the operant to terminal block increase. The correlation between operant level and conditioning score was only .10. Perhaps an experimenter who feels favorably
Table 6–8 Conditioning Obtained by Experimenters as a Function of Their Principal Investigator’s
Evaluation Evaluation Attribute Intelligence Influenceability
Favorable 3.0 3.1
Unfavorable 1.3 0.8
Difference 1.7 2.3
t
p
1.89 2.56
.10 .03
382
Book Two – Experimenter Effects in Behavioral Research Table 6–9 Experimenters’ Behavior as a Function of Favorable Evaluation
by Their Principal Investigator Variable Casual Courteous Pleasant Expressive-voiced Trunk activity
Correlation þ.33 þ.27 þ.24 þ.24 .26
p .01 .05 .08 .08 .05
evaluated by his supervisor makes his subjects more willing to make up more personal statements, quite apart from being a more effective reinforcer. In this study, subjects were asked to describe their experimenter’s behavior in a series of 28 rating scales. Table 6-9 shows the larger correlations between subjects’ observations of the experimenter’s behavior and the favorableness of his evaluation by the principal investigator. The correlations are what we would expect. Feeling more favorably evaluated, the experimenter is less tense and more pleasant, and these characteristics could reasonably make him a more effective reinforcer and a person for whom more ‘‘personal’’ (i.e., first person) sentences are constructed. Earlier, the inconsistency of the effect of a principal investigator’s interaction with the data collector on the subject’s response was noted. It is interesting, however, that in those three studies in which the experimenter’s behavior was observed, either by the subjects themselves or by external observers of the sound films, the results do show a certain consistency. Experimenters who, in their interaction with the principal investigator, were made to feel (1) less self-conscious, (2) more successful at the experimental task, and (3) more intelligent or less manipulatable, all seemed to behave toward their subjects in a more positive, likable, interpersonal style. In two of these three experiments, all employing different tasks, this behavior on the part of the experimenter probably affected the responses of the subjects in their performance of the experimental tasks.
Conclusion From all that has been said and shown it seems clear that there are a great many variables that affect the subject’s response other than those variables which, in a given experiment, are specifically under investigation. The kind of person the experimenter is, how he or she looks and acts, may by itself affect the subject’s response. Sometimes the effect is a direct and simple one, but sometimes, too, the effect is found to interact with subject characteristics, task characteristics, or situational characteristics. Not only the kind of person the experimenter ‘‘is’’ but the things that happen to him before and during the experiment affect his behavior in such a way as to evoke different responses from his subjects. The subject’s behavior may have feedback effects on his own subsequent behavior not only directly but also by changing the experimenter’s behavior, which then alters the subject’s response. The room in which the experiment is conducted not only may affect the subject’s response directly but may affect it indirectly as well, by also affecting the behavior of
Situational Factors
383
the experimenter as he interacts with his subject. Such a change in experimenter behavior, of course, alters the experimental conditions for the subject. The experimenter and the subject may transact the experimental business as a dyad, but often there is, in effect, a triadic business. The non-present third party is the principal investigator, who, by what he is, and what he does, and how he does it in his dyadic interaction with the experimenter, indirectly affects the responses of the subject he never comes to meet. He changes the experimenter’s behavior in ways that change the subject’s behavior. Of all the possible variables associated with the experimenter, only those have been discussed for which enough evidence has been accumulated that we may say these often make a substantial difference. Probably they make less of a difference where the phenomena under investigation are very robust. There are experiments in psychophysics, learning, and psychopharmacology in which the average obtained responses may be only trivially (even if ‘‘significantly’’ in the statistical sense) affected by the experimenter’s attributes. Increasing dosages of ether are more likely to produce unconsciousness, regardless of the attributes of the experimenter, though the shape of the curve may be altered by his unique characteristics and behaviors. Most of the behavioral science research carried out today is of the ‘‘50 subjects, p ¼ .01’’ type. That means, of course, accounting for something like 13 percent of the variance in subjects’ responses from a knowledge of our treatment conditions or a reduction in predictive errors of about 6 percent. Because the effects of our independent variables, though unquestionably ‘‘real,’’ are usually so fragile, we must be especially concerned about the effects of experimenter attributes. The methodological implications of the experimenter effects discussed will be treated more fully in Part III. Only a few points need be mentioned here. First, very little has been said so far about the effects of experimenter attributes on the ‘‘results of research.’’ Generally the wording has been in terms of effects on the subject’s response. Such effects may alter the ‘‘results of research,’’ but they may not. In that research which seeks to estimate a population mean from the mean of a sample, experimenter effects do change the ‘‘results of research.’’ Examples include much of the work performed by survey research organizations. If we want to estimate the average degree of favorableness to a national policy, a well-dressed, high-statusappearing, older gentleman is likely to draw responses different from those obtained by a more shabbily dressed, bearded young man presumed to be from a nearby college. If we want to standardize a new test—that is, estimate the national mean and standard deviation—or do sex behavior surveys, the results may be affected directly by the experimenter’s effects on his subjects’ responses. But much, perhaps most, psychological research is not of this sort. Most psychological research is likely to involve the assessment of the effects of two or more experimental conditions on the responses of the subject. If a certain type of experimenter tends to obtain slower learning from his subjects, the ‘‘results of his experiment’’ are affected not at all so long as his effect is constant over the different conditions of the experiment. Experimenter effects on means do not necessarily imply effects on mean differences. In the survey research or test standardization type of research, the data tend to be collected by many different interviewers, examiners, or experimenters. We may be fortunate, and in the given sample of data collectors the various effects due to their characteristics or experiences may be canceled out. However, they may not be, as when there is a tendency for the data collectors to be selected on strict criteria,
384
Book Two – Experimenter Effects in Behavioral Research
implicit or explicit, in such a way that the N different experimenters are more nearly N times the same experimenter. There will be more to say about this in Part III. In the laboratory experiment, the effect of a given experimenter attribute or experience may interact with the treatment condition. We have seen earlier that this does happen when the experimenter is aware, and usually he is, which subjects are undergoing which different treatments. To use two experimenters, one for each treatment condition, of course, confounds any effects of the experimenter with the effects of the treatments, so that an assessment of treatment effects is impossible. Any method that makes it less likely that experimenter effects will interact with treatment conditions would reduce our problem of assessing adequately the effects of our treatment conditions. More will be said of this in Part III, but for now, the not very surprising conclusion is that for the control of the effects of experimenter attributes, as for the control of the other effects discussed in earlier chapters, we must rely heavily on the process of replication.
7 Experimenter Modeling
In this chapter the discussion will turn to an ‘‘attribute’’ of the experimenter which, like those considered just before, is also defined in terms of the particular experiment being conducted. That attribute is the performance of the experimenter himself of the same task he sets his subjects. For some experiments, then, this experimenter attribute will be a more enduring characteristic, such as intelligence or authoritarianism. For other experiments, this attribute will be a less enduring one, such as an opinion on a timely public issue, though such less enduring attributes may often be related to more enduring ones. When there is a significant relationship between the experimenter’s own performance of the particular task he requires of his subjects and the performance he obtains from his subjects, we may speak of an experimenter’s ‘‘modeling’’ effect. The evidence for this effect comes from the literature of survey research, clinical psychology, and laboratory experiments.
Survey Research In the area of survey research, many investigators have assessed the effect of the interviewer’s own opinion, attitude, or ideology on the responses obtained from respondents. The basic paradigm has been to ask the interviewers who are to be used in a given project to respond to the questionnaire themselves. The responses these interviewers subsequently obtain from their respondents are then correlated with their own responses. The correlation obtained becomes the estimate of opinion bias or ideology bias. The interpretation of such a correlation is not, however, always straightforward. If interviewers are allowed any choice in the selection of interviewees, they may simply be selecting like-minded respondents. If interviewers are not allowed any choice in interviewee selection but respondents are not randomly assigned to interviewers, the same problem may result. Thus, if interviewers are each assigned a sample of respondents from their home neighborhoods, the opinions of interviewers and respondents are likely to come precorrelated, because opinions are related to neighborhoods. If, however, respondents are randomly assigned to interviewers, and if errors of observation, recording, and coding can be eliminated, at least statistically, the resulting correlation between interviewers’ opinions and their 385
386
Book Two – Experimenter Effects in Behavioral Research
respondents’ opinions provides a good measure of modeling effects. Evidence for the phenomenon of interviewer modeling effects has been discussed and summarized elsewhere (Hyman et al., 1954; Maccoby & Maccoby, 1954). Here it will do to note that, in some of the many relevant studies, modeling effects were found to occur and in others they were found either to occur not at all or only trivially. Where modeling effects have been found, they have ordinarily been positive. That is, the subjects’ responses have tended to be similar in direction to those of the interviewer. In a minority of cases, however, the effects of the interviewer’s own opinion or ideology have been negative, so that subjects responded in a direction significantly opposite to that favored by the interviewer himself (Rosenthal, 1963b). An early study by Clark (1927), while not definitive, is illustrative of positive modeling effects. Two interviewers inquired of 193 subjects how much of their time was devoted to various daily activities. One of the interviewers was more athletically inclined than the other, and he found that his subjects reported a greater amount of time spent in athletic activities than did the subjects contacted by the less athletic interviewer. It is possible that the sampling problems mentioned or observer, recorder, or interpreter effects accounted for the obtained modeling effect. It seems equally reasonable to think that in the presence of the interviewer appearing and behaving more athletically, the respondents actually gave more athletic responses. Perhaps while in this interviewer’s presence they were better reminded of the athletic activities in which they did engage. Or it could also have been that it seemed to respondents more ‘‘proper’’ to be more athletic in interaction with an athlete from a college campus. On many campuses, an athlete is attributed a higher status, and we have seen in our discussion of this attribute that subjects do tend to give more ‘‘proper’’ responses to higher status data collectors. A more recent study reported by Hyman is equally interesting (1954). The data were collected by the Audience Research Institute in 1940. Respondents were given a very brief description of a proposed motion picture plot and were asked to state whether they would like to see such a movie. There were both male and female interviewers to contact the male and female subjects. Responses obtained by interviewers depended significantly (p < .005) on their sex and, perhaps, on the respondent’s inference of what movies the interviewers would, because of their sex, themselves enjoy. One of the film plots described was that for ‘‘Lawrence of Arabia.’’ When male and female subjects were asked about this film by interviewers of their own sex, male subjects were 50 percent more often favorable to the film than were female subjects. However, when the interviewer was of the opposite sex, male subjects responded favorably only 14 percent more often than female subjects. It appeared plausible to reason that subjects responded by ‘‘preferring’’ those movies which, judging from the sex of the interviewer, they thought would be preferred by them. It is interesting to raise the question of whether subjects of field research or laboratory research tend, in general, to respond in such a way as to reduce the perceived differences between themselves and the data collector with whom they interact. No answer is available to this question at the present time, and surely it is highly oversimplified, as an assertion. It may, however, be a reasonable one if both the participants’ attributes and the nature of the data collection situation are considered. From all we seem to know at present, these factors are all likely to interact with the subject’s motives to be less different from the data collector. Two sources of
Experimenter Modeling
387
such motives are obvious. One is the wish to be similar in order to smooth the social interaction. The other is the wish to be more like a person who very often enjoys, either continuously or at least situationally, a position of higher status. To ‘‘keep up with’’ that Jones who is a data collector, one must behave as one believes a Jones would behave in the same situation.
Clinical Psychology It is often said of clinical psychological interactions that the clinician models his patients somewhat after his own image. When the clinical interaction is the protracted one of psychotherapy it seems especially easy to believe that such effects may occur. If it seemed plausible to reason that subjects in research tended to respond as they believed the experimenter would, then it is the more plausible to argue that such effects occur when the ‘‘subject’’ is a patient who may have all the motives of the experimental subject to respond in such a manner and, in addition, the powerful motive of hope that his distress may be relieved. Graham (1960) reports an experiment that is illustrative. Ten psychotherapists were divided into two groups on the basis of their own perceptual style of approach to the Rorschach blots. Half the therapists tended to see more movement in the ink blots relative to color than did the remaining therapists, who tended to see more color. The 10 therapists saw a total of 89 patients for eight months of treatment. Rorschachs administered to the patients of the two groups of therapists showed no differences before treatment. After treatment the patients seen by the relatively more movement-perceiving therapists saw significantly more movement themselves. Patients seen by the relatively more color-perceiving therapists saw significantly more color after treatment. This is exactly the sort of evidence required to establish modeling effects in the psychotherapeutic relationship. There is, of course, considerable literature on the effects of psychotherapy, and when changes have been shown to occur, the behavior of the patient becomes more like that of his therapist. This body of evidence is not directly relevant to a consideration of modeling effects. The reason is that assuming therapists’ behavior to be more ‘‘normal’’ than their patients’, and defining patient improvement as a change toward more normal behavior, it must follow that patients change their behavior in the direction of their therapist’s behavior when they improve. Therefore, evidence of the kind provided by Graham is required. What must be shown is not simply that patients become more like therapists, but that they become more like their own particular therapist than does the patient of a different therapist. Further evidence for modeling effects of the therapist comes from the work of Bandura, Lipsher, and Miller (1959), who found that more directly hostile therapists were more likely to approach their patients’ hostility, whereas less directly hostile therapists tended to avoid their patients’ hostility. The approach or avoidance of the hostile material, then, tended to determine the patient’s subsequent dealing with topics involving hostility. Not surprisingly, when therapists tended to avoid the topic, patients tended to drop it as well. The work of Matarazzo and his colleagues has already been cited (Matarazzo, Wiens, & Saslow, 1965) in connection with the effects of the subject on his experimenter’s response. That, of course, was material quite incidental to their interest in the anatomy of the interview. The amount of evidence they have
388
Book Two – Experimenter Effects in Behavioral Research Table 7–1 Subjects’ Changes in Duration of Speech as a Function of Interviewer’s
Changes (After Matarazzo, Wiens, & Saslow, 1965, p. 199) Changes in speech duration Interviewer’s Target þ200% þ100 þ100 0 0 50 50 67
Interviewer
Subject
þ204% þ94 þ87 þ4 0 38 48 64
þ109% þ111 þ93 þ2 8 43 45 51
accumulated is compelling. It seems clear, as one example of their work, that increases in the speaking time of the interviewer are followed by increases in the speaking time of the subjects, who in this case were 60 applicants for Civil Service employment. Table 7-1 shows the increases and decreases in the average length of subjects’ speaking time as a function of increases and decreases in the interviewer’s speaking time. (The first column shows the target values the interviewer was trying to achieve, by and large very successfully.) The rank correlation between changes in the interviewer’s length of utterance and his subjects’ changes in length of utterance was þ.976 (p < .001). On the average, subjects’ length of utterances are five or six times longer than those of the interviewer. But clearly, from these data, patterns of behavior shown by the interviewer can serve as the blueprint for how the subject should respond. Similar results have been reported by Heller, Davis, and Saunders (1964). There were 12 graduate student interviewers to talk with a total of 96 subjects. Half the interviewers were instructed to behave in a verbally more active manner, and half were instructed to be less active verbally. During every minute of the 15 minutes recorded, subjects spoke more if their interviewer had been more verbally active than if he had been less verbally active. Subjects contacted by more talkative interviewers spent about 16 percent more time in talk than did their peers assigned to more laconic interviewers (p < .02). In another connection we cited the work of Heller, Myers, and Vikan-Kline (1963). Now we need only a reminder of their findings relevant to the present discussion. Friendlier ‘‘clients’’ (experimenters) evoked friendlier interviewer (subject) behavior, an example of positive modeling effects. More dominant ‘‘clients’’ evoked less dominant interviewer behavior, an example not only of negative modeling effects but also of the fact that interviewers, and presumably also experimenters, may sometimes be modeled by their ‘‘clients’’ or subjects just as these are modeled by the interviewer or experimenter. There is a sense in which the studies described so far are not true examples of modeling effects, though they are relevant to a consideration of such effects. The reason is that the therapists or interviewers were not assessed at exactly the same task or performance at which their patients, interviewees, or subjects were assessed, and not necessarily by the interviewer himself. These studies have all been instructive, however, in showing that the behavior of the interviewer along any dimension may
Experimenter Modeling
389
affect the analogous behavior of the subject, though we are still unsure of the mechanisms by which these effects operate. There is a difference, of course, in the degree of structure provided for therapists, interviewers, and experimenters as to how closely they must follow a given program or plan. In all the studies described so far, the clinicians were relatively free as to what they could say or do at any time. In the studies by Matarazzo and his colleagues only the length of each utterance was highly programmed, not the content of the utterance. In the studies by Heller and his colleagues the degree of dominance and friendliness was programmed into the stimulus persons, but they, too, were free to vary other aspects of their behavior as they felt it to be required. Of even greater relevance, then, to an understanding of the effects of the more highly programmed experimenter is the study of the effects of the psychological examiner. The manuals for the administration of psychological tests are often as explicit as the directions given to a psychological experimenter in a laboratory. The reduced freedom of the examiner and of the experimenter to behave as they would should reduce the magnitude of modeling effects, or so it would seem. One experiment employing psychological examiners, and bearing on the consideration of modeling effects, was carried out by Berger (1954). All eight of his examiners had been pretested on the Rorschach. After each of the examiners had administered the Rorschach to his subjects, correlations were computed between the examiners’ own Rorschach scores on 12 variables and the responses they had subsequently obtained from their subjects. Two of the 12 variables showed a significant positive correlation between the examiners’ scores and their subjects’ scores. Examiners who tended to organize their percepts into those very commonly seen obtained more such popular percepts from their subjects (rho ¼ þ.86, p ¼ .01). Examiners who tended to use the white space of the ink blots more often, obtained from their subjects a greater use of such white space (rho ¼ þ.80, p ¼ .03). Another example of modeling effects in the more standard clinical interaction of psychological testing comes from the work of Sanders and Cleveland (1953). Again the Rorschach was the test administered. All 9 of the examiners were given the Rorschach, and they, in turn, administered the Rorschach to 30 subjects each. For each examiner and for each subject a Rorschach anxiety score and a Rorschach hostility score were computed. There was no relationship between the examiner’s own anxiety level and the mean anxiety level reflected in the Rorschachs he obtained. However, those three examiners whose own hostility scores were highest obtained significantly higher hostility scores from their subjects (mean ¼ 16.6) than did those three examiners whose own hostility scores were lowest (mean ¼ 13.5, p < .05). Two more informal reports conclude the discussion of modeling effects found in clinical settings. Funkenstein, King, and Drolette (1957) were engaged in a clinical experiment on reactions to stress in which it was necessary to test patients. Typically, patients showed anger in their responses. However, one of the experimenters found himself filled with doubts and anxieties about the studies undertaken. Every patient tested by this experimenter showed severe anxiety responses. Finally, the classic study of Escalona (1945) is cited to illustrate that the effects under discussion do not depend on verbal communication channels. The scene of the research was a reformatory for women in which the offenders were permitted to have their babies. There were over 50 babies altogether, and 70 percent of these were less than one year old.
390
Book Two – Experimenter Effects in Behavioral Research
Part of the feeding schedule was for the babies to be given orange juice half the time and, on alternate days, tomato juice. Often the babies, many under four months of age, preferred one of these juices but disliked the other. The number of orange juice drinkers was about the same as the number of tomato juice drinkers. The ladies who cared for the babies also turned out to have preferences for either orange or tomato juice. When the feeders of the baby disliked orange juice, the baby was more likely to dislike orange juice. When the feeder disliked tomato juice, the baby similarly disliked tomato juice. When babies were reassigned a new feeder who preferred the type of juice opposite to the one preferred by the baby, the baby changed its preference to that of its feeder.
Laboratory Experiments A number of laboratory studies mentioned earlier have suggested that even in these somewhat more highly structured interactions, modeling effects may occur. Thus Cook-Marquis (1958) found that high-authoritarian experimenters were unable to convince their subjects of the value of nonauthoritarian teaching methods. Presumably, such experimenters could not convincingly persuade subjects to accept communications they themselves found unacceptable. Barnard’s work (1963) similarly suggested the operation of modeling effects. He used a phrase association task and found that subjects contacted by experimenters showing a higher degree of associative disturbance also showed a higher degree of disturbance than did subjects contacted by experimenters showing less such disturbance. Even before such experiments had been conducted, F. Allport (1955) had suggested that the experimenter might suggest to the subject, quite unintentionally, his own appraisal of the experimental stimulus and that such suggestion could affect the results of the experiment. Similarly, in the area of extrasensory perception the work of Schmeidler and McConnell (1958) has raised the question that the experimenter’s belief in the phenomenon of ESP could influence the subject’s belief in ESP. In this area of research such belief tends to be associated with performance at ESP tasks. Subjects who believe ESP to be possible (‘‘sheep’’) seem to perform better than subjects who believe ESP to be impossible (‘‘goats’’). From this it follows—and perhaps this should be more systematically investigated—that experimenters who themselves believe in ESP may, by affecting their subjects’ belief, obtain superior performance at ESP tasks than do experimenters not believing in ESP. Most of the research explicitly designed to assess the modeling effects of the data collector has come from the field of survey research, some has come from the area of clinical psychological practice, and, until very recently, virtually none has come from laboratory settings. In part, the reason for this may be the greater availability for study of the interviewers of field research and even of clinicians compared to the availability for study of laboratory experimenters. But that does not seem to be the whole story. There is a general belief, perhaps largely justified, that the greater ‘‘rough-and-tumble’’ of the field and of the clinic might naturally lead to increased modeling and related effects. The behavior of the interviewer and of the clinician is often less precisely programmed than the behavior of the experimenter in the laboratory, so that their unintended influences on their subjects and patients could come about more readily. In the laboratory, it is often believed, these unintended
Experimenter Modeling
391
effects are less likely because of the more explicit programming of the experimenter’s behavior. The words ‘‘experimenter behavior’’ are better read as ‘‘instructions to subjects,’’ since this is usually the only aspect of the experimenter’s behavior that is highly programmed. Sometimes, when the experimenter is to play a role, he is told to be warm or cold, and then other aspects of his behavior are more programmed, but still not very precisely so. Of course, we cannot program the experimenter so that there will be no unplanned influence on his subjects. We cannot do this programming because we do not know precisely what the behavior is that makes the difference— that is, affects the subjects to respond differently than they would if the experimenter were literally an automaton. In the light of these considerations, we should not be too surprised to learn that modeling effects may occur in the laboratory as well as in the field and in the clinic. There is no reason to believe that even with instructions to subjects held constant, experimenters in laboratories cannot influence their subjects as effectively, and as unintentionally, as interviewers in the field or clinicians in their clinical settings. Furthermore, there is no reason to suppose that the interpersonal communication processes that mediate the unintended influence are any different in the laboratory than they are in the field, or in the clinic, or in interpersonal relationships generally. At present, we must settle for an evaluation of the occurrence of modeling effects in laboratory settings. For a full understanding of how these effects operate, we must wait for the results of research perhaps not yet begun. There is a series of nine experiments specifically designed to assess the occurrence and magnitude of modeling effects in a laboratory setting. This series of studies, conducted between 1959 and 1964, employed the person perception task already described. Subjects were asked to rate a series of 10 or 20 photos on how successful or unsuccessful the persons pictured appeared to be. In all nine studies, experimenters themselves rated the photos before contacting their subjects. This was accomplished as part of the training procedure—it being most convenient to train experimenters by having them assume the role of subject while the principal investigators acted in the role of experimenter. For each study, modeling effects were defined by the correlation between the mean rating of the photos by the different experimenters themselves and the mean photo rating obtained by each experimenter from all his subjects. The number of experimenters (and therefore the N per correlation coefficient) per study ranged from 10 to 26. The number of subjects per study ranged from 55 to 206. The number of subjects per experimenter ranged from 4 to 20, the mean falling above 5. In all, 161 experimenters and about 900 subjects were included. All experimenters employed in the first eight studies were either graduate students or advanced undergraduate students in psychology or guidance. In the last experiment, there were two samples of experimenters. One consisted of nine law students, the other was a mixed group of seven graduate students primarily in the natural sciences. Subjects were drawn from elementary college courses, usually from psychology courses, but also from courses in education, social sciences, and the humanities. All of the experiments were designed to test at least one hypothesis about experimenter effects other than modeling effects—as, for example, the effects of experimenters’ expectancy. All studies, then, had at least two treatment conditions, the effects of which would have to be partially transcended by modeling effects. Table 7-2 shows the correlation (rho) obtained in each of the nine studies between the experimenters’ own ratings of the photos and the mean rating they subsequently
392
Book Two – Experimenter Effects in Behavioral Research Table 7–2
Modeling Effects in Studies of Person Perception
Experiment 1. 2. 3. 4. 5. 6. 7. 8. 9.
Rosenthal and Fode (1963b) Hinkle (personal communication, 1961) Rosenthal, Persinger, Vikan-Kline, and Fode (1963a) Rosenthal, Persinger, Vikan-Kline, and Fode (1963b) White (1962) Rosenthal, Persinger, Vikan-Kline, and Mulry (1963) Persinger (1962) Rosenthal, Persinger, Mulry, Vikan-Kline, and Grothe (1964a; 1964b) Haley and Rosenthal (unpublished, 1964) I Haley and Rosenthal (unpublished, 1964) II Total
Correlation
N
þ .52 þ .65 þ .18 þ .31 .07 .32 .49 þ .14 .18 þ .54
10 24 12 18 18 26 12 25 9 7 161
obtained from their subjects. The correlations are listed in the order in which they were obtained so that the experiment listed as No. 1 was the first conducted and No. 9 the last. There is a remarkable inconsistency of obtained correlations, the range being from .49 to þ.65. (Taken individually, and with the df based on the number of experimenters, only the correlation of þ.65 [p < .001] differed significantly from zero [at p < .10]. This correlation of þ.65, obtained in experiment No. 2 was not, however, available for closer study.) Employing the method described by Snedecor (1956) for assessing the likelihood that a set of correlations are from the same population, the value of x2 was 23.52 (df ¼ 9, p ¼ .006). The same analysis omitting the data from experiment No. 2 yielded x2 of 13.17 (df ¼ 8, p ¼ .11). It seems from these results that in different studies employing the person perception task there may be variable directions and magnitudes of modeling effects which, for any single study, might often be regarded as a chance fluctuation from a population correlation of zero. Disregarding the direction of the correlations which turned out to be negative surprisingly often, we see that the proportions of variance in subjects’ mean photo ratings accounted for by a knowledge of the experimenters’ own responses to the experimental task varied from less than 1 percent to as much as 42 percent. Sometimes, then, modeling effects are trivial, sometimes large, a finding consistent with the results of the survey research literature. There the opinion of the interviewer sometimes makes a difference and sometimes not. When there is a difference, it is sometimes sizable, sometimes trivial. Examination of Table 7-2 shows that for the first eight experiments, there is a fairly regular decrease in the magnitude of the correlations obtained (p < .05). The interpretation of this trend holding for the first eight studies is speculative. Over the five years in which these experiments were conducted, the probability would seem to increase that experimenters might learn that they themselves were the focus of interest. This recognition may have led to their trying to avoid any modeling effects on their subjects. By trying too hard, they may have reversed the behavior that leads to positive modeling effects in such a way that negative modeling effects resulted. In a later chapter, dealing with the effects of excessive reward, some evidence will be presented that suggests that such ‘‘bending over backward’’ does occur. The last study listed in Table 7-2 shows that, even within the same experiment, the use of different samples of experimenters can lead to different directions and
Experimenter Modeling
393
magnitudes of modeling effects. Among the nine law students there were no large modeling effects, and the tendency, if any, was for negative effects. Among the seven graduate students, who were primarily in the physical sciences, the tendency was for larger and positive, though not significant, modeling effects. The two correlations could from statistical considerations alone have been combined, but because it was known that these two samples differed in a number of other characteristics, this was not done. The law student experimenters, for example, themselves rated the photos as being of more successful people (rpb ¼ þ.57, p < .05), and from their written statements of the purpose of the experiment were judged more serious (r ¼ þ.62, p < .02) and less suspicious that their own behavior was under study (r ¼ .74, p < .005). This last finding argues somewhat against the earlier interpretation that as experimenters were more likely to be suspicious of being studied they would tend to bend over backward to avoid modeling their subjects. The lawyers’ behavior during the experiment also seemed to be different from that of the mixed sample of graduate students. Table 7-3 shows the larger point biserial correlations between subjects’ ratings of their experimenters and experimenters’ sample membership. The young attorneys were judged by their subjects to be friendlier and more active and involved both vocally and motorically. It seems well established that, at least for these particular samples, the lawyers and graduate students treated their subjects differently; but there is nothing in the pattern of differences to tell us how it may have led to differences in modeling effects. Later, in Part II, we shall see that, if anything, this pattern of behavior is associated with greater unintended effects of the experimenter, though those effects are not of modeling but of the experimenter’s expectancy. Among the first eight experiments there was one (No. 8) that had been filmed. Unfortunately, this was an experiment that showed virtually no modeling effects. Still it might be instructive to see what the behavior was of experimenters who themselves rated the photos as being of more successful people. At least in some of the studies such behavior may affect the photo ratings of the subjects. During the brief preinstructional period of the experiment, there was little experimenter behavior from which one could postdict how he had rated the success of photos of others. Those who had rated the photos as more successful were judged from the film alone to behave less consistently (r ¼ .40, p < .10) than those who had rated photos as of less successful people. Such a single relationship could easily
Table 7–3
Experimenter Behavior Distinguishing Law Students from Graduate Students
Variable
Correlation
p
þ.37 þ.26 þ.23 þ.30 þ.43 þ.24 þ.24 þ.27 þ.30
.001 .02 .05 .01 .001 .05 .05 .02 .01
Friendly Pleasant Likable Interested Pleasant-voiced Loud Hand gestures Head activity Leg activity
394
Book Two – Experimenter Effects in Behavioral Research Table 7–4
Experimenter Behavior and the Perception of Success of Others
Variable Personal Interested Expressive face Fast speaking
Correlation
p
.42 .40 .44 þ.51
.08 .10 .07 .03
have occurred by chance, however. During the instruction-reading phase of the experiment, observers who saw the film but heard no sound track judged more success-rating experimenters as less enthusiastic (r ¼ .42, p ¼ .08). These experimenters were judged from the sound track alone to behave in a more self-important manner (r ¼ þ.41, p < .10). Observers who had access to both the film and sound track made the most judgments found to correlate with the experimenter’s own perception of the success of others. Table 7-4 shows the larger correlations. Relatively more success-perceiving experimenters seemed less interested, less expressive, and faster speaking than their less success-perceiving colleagues. Ordinarily we expect such behavior to result in subjects subsequently rating photos of others as less successful, and if that had occurred there would have been a negative modeling effect. Instead, there was virtually none at all, a little positive if anything. In the study conducted in collaboration with Haley, these general results were reversed. At least as defined by subjects’ ratings, those experimenters who rated the photos as more successful behaved in a more friendly (r ¼ þ .25, p ¼ .02) manner. We are left knowing only that the behavior of experimenters rating photos as of more successful persons differs significantly, but not consistently, from the behavior of less success-perceiving experimenters. Tritely but truly put, more research is needed. There is a more recently conducted experiment, in which the task was to construct sentences beginning with any of six pronouns (Rosenthal, Kohn, Greenfield, & Carota, 1966). The procedures called for the experimenter’s saying ‘‘good’’ whenever the subject chose a first-person pronoun. But before these reinforcements began, subjects were permitted to generate sentences without reinforcements, in order that an operant or basal level could be established. Before experimenters contacted their subjects, they, too, constructed sentences without receiving any reinforcements. Modeling effects are defined again by a correlation coefficient, this time between the experimenter’s operant level of choosing to begin sentences with first-person pronouns and his subjects’ subsequently determined operant levels. This was the experiment, described earlier, in which the experimenters were subtly evaluated by their principal investigator. Half the experimenters were evaluated on their intelligence, half on their influenceability. Within each of these groups half the experimenters were evaluated favorably, half unfavorably. Table 7-5 shows the correlations representing modeling effects for each of the four groups of experimenters. There was a general tendency for experimenters who had been favorably evaluated to show negative modeling effects and for experimenters who had been unfavorably evaluated, especially as to their intelligence, to show positive modeling effects. (In the earlier discussion of the effects of evaluation by the principal investigator, it was mentioned that, in this particular experiment, the favorably evaluated experimenters were the ones who also obtained the significantly greater amount of conditioning from their subjects.)
Experimenter Modeling
395
Table 7–5
Modeling Effects of Experimenters as a Function of Their Principal Investigator’s Evaluation Evaluation
Attribute
Favorable
Unfavorable
z Difference
p
Intelligence Influenceability
.88* .74
þ.997** þ.03
3.58 .98
.0005 —
Mean
.81*
þ.52
2.24
.03
*p .05 **p .005
The experiment under discussion and that conducted with Haley are the only ones within which comparisons are made between different sets of experimenters. The favorably evaluated experimenters of the one study and the lawyers of the other study both showed negative modeling effects, and both were evaluated by their subjects as more interpersonally pleasant. The unfavorably evaluated experimenters and the natural scientists both showed positive modeling effects and were both evaluated generally as less pleasant. This consistency between the two studies was especially heartening in view of the fact that the two studies employed different tasks, sentence construction in the one case, person perception in the other. It may be that subjects evaluate as more pleasant those experimenters who are not unintentionally influencing their subjects to respond as they would themselves respond. Or it may be that experimenters who are ‘‘really’’ more pleasant interpersonally, either characteristically or because they have been made that way by their interaction with the principal investigator, bend over backward to avoid modeling their subjects, while less favorably evaluated experimenters and those characteristically less pleasant interpersonally behave in such a way as to obtain positively modeled responses. This interpretation can be applied to the series of person perception studies which showed modeling effects to become more negative over time. In most of these studies there were one or more principal investigators who were involved with several of the studies. Perhaps as the principal investigators gained more experience in conducting such experiments they became more relaxed and pleasant toward the experimenters, so that, unintentionally, experimenters of the later studies felt less tense and less ‘‘on the spot’’ than experimenters of the earlier studies. Such unintentionally increased comfort on the part of the experimenters in later studies could account for an increase in their pleasantness toward their subjects, an increase that, in one way or another, seems to lead to negative modeling effects. From all the evidence considered, it seems sensible to conclude that modeling effects occur at least sometimes in psychological research conducted in field or laboratory. We find it difficult, however, to predict the direction and magnitude of modeling effects. In survey research, they tend usually to be positive but variable as to magnitude. In laboratory studies, modeling effects are variable not only in magnitude but in direction as well. The interpretation of the variability of direction of modeling effects that is best supported by the evidence, though still not well established, is that a happier, more pleasant, less tense experimenter seems to model his subjects negatively. The less pleasant, more tense experimenter seems to model his subjects positively. Just why that should be is not at all clear.
396
Book Two – Experimenter Effects in Behavioral Research
Problems in the control of modeling and of related effects of the experimenter will be treated in Part III. One methodological implication follows from the possible relationship between the direction of modeling and the pleasantness of the experimenter’s behavior. If a pleasant experimenter models negatively and an unpleasant experimenter models positively, then perhaps a more nearly neutral experimenter models not at all. If research were to show that this were the case, we could perhaps reduce modeling effects either by the selection of naturally neutral experimenters or by inducing more randomly selected experimenters to behave neutrally. If our selection of experimenters were fairly random with respect to the characteristic of pleasantness, and if we did not systematically change our assistants’ degree of pleasantness in our interaction with them, we might hope for the modeling effects of the more and less pleasant data collectors to cancel each other out. Replication, therefore, is required for the assessment and control of an effect of the experimenter.
8 Experimenter Expectancy
The preceding chapters have dealt with the effects of various attributes of the experimenter on the responses he obtains from his subjects. Some of these attributes were quite stable (i.e., the sex of the experimenter) and some were quite situational (i.e., the experiences the experimenter encountered while conducting his experiment). In this chapter, the discussion turns to another ‘‘attribute’’ highly dependent on the specific experiment being conducted—the expectancy the experimenter has of how his subjects will respond. Much of the remainder of this book deals with this variable. In Part II the emphasis will be on the experimental evidence that supports the proposition that what results the experimenter obtains from his subjects may be determined in part by what he expects to obtain. In Part III, the emphasis will be on various methodological implications of this proposition, including what may be done to minimize the unintended effect of the experimenter’s expectancy. The particular expectation a scientist has of how his experiment will turn out is variable, depending on the experiment being conducted, but the presence of some expectation is virtually a constant in science. The independent and dependent variables selected for study by the scientist are not chosen by means of a table of random numbers. They are selected because the scientist expects a certain relationship to appear between them. Even in those less carefully planned examinations of relationships called ‘‘fishing expeditions’’ or, more formally, ‘‘exploratory analyses’’ the expectation of the scientist is reflected in the selection of the entire set of variables chosen for examination. Exploratory analyses of data, like real fishing ventures, do not take place in randomly selected pools. These expectations of the scientist are likely to affect the choice of the experimental design and procedure in such a way as to increase the likelihood that his expectation or hypothesis will be supported. That is as it should be. No scientist would select intentionally a procedure likely to show his hypothesis in error. If he could too easily think of procedures that would show this, he would be likely to revise his hypothesis. If the selection of a research design or procedure is regarded by another scientist as too ‘‘biased’’ to be a fair test of the hypothesis, he can test the hypothesis employing oppositely biased procedures or less biased procedures by which to demonstrate the greater value of his hypothesis. The designs and 397
398
Book Two – Experimenter Effects in Behavioral Research
procedures employed are, to a great extent, public knowledge, and it is this public character that permits relevant replications to serve the required corrective function. In the behavioral sciences, especially, where statistical procedures are so generally employed to guide the interpretation of results, the expectation of the investigator may affect the choice of statistical tests. Unintentionally, the investigator may employ more powerful statistical tests when his hypothesis calls for his showing the untenability of the null hypothesis. Less powerful statistics may be employed when the expectation calls for the tenability of the null hypothesis. As in the choice of design and procedure, the consequences of such an unintentional expectancy bias are not serious. The data can, after all, be reanalyzed by any disagreeing scientist. Other effects of the scientist’s expectation may be on his observation of the data and on his interpretation of what they mean. Both these effects have already been discussed in the opening chapters of this book. The major concern of this chapter will be with the effects of the experimenter’s expectation on the responses he obtains from his subjects. The consequences of such an expectancy bias can be quite serious. Expectancy effects on subjects’ responses are not public matters. It is not only that other scientists cannot know whether such effects occurred in the experimenter’s interaction with his subjects but the investigator himself may not know whether these effects have occurred. Moreover, there is the likelihood that the experimenter has not even considered the possibility of such unintended effects on his subjects’ response. That is not so different from the situations already discussed wherein the subject’s response is affected by any attribute of the experimenter. Later, in Part III, the problem will be discussed in more detail. For now it is enough to note that while the other attributes of the experimenter affect the subject’s response, they do not necessarily affect these responses differentially as a function of the subject’s treatment condition. Expectancy effects, on the other hand, always do. The sex of the experimenter does not change as a function of the subject’s treatment condition in an experiment. The experimenter’s expectancy of how the subject will respond does change as a function of the subject’s treatment condition. Although the focus of this book is primarily on the effects of a particular person, an experimenter, on the behavior of a specific other, the subject, it should be emphasized that many of the effects of the experimenter, including the effects of his expectancy, may have considerable generality for other social relationships. That one person’s expectation about another person’s behavior may contribute to a determination of what that behavior will actually be has been suggested by various theorists. Merton (1948) developed the very appropriate concept of ‘‘self-fulfilling prophecy.’’ One prophesies an event, and the expectation of the event then changes the behavior of the prophet in such a way as to make the prophesied event more likely. Gordon Allport (1950) has applied the concept of interpersonal expectancies to an analysis of the causes of war. Nations expecting to go to war affect the behavior of their opponents-to-be by the behavior which reflects their expectations of armed conflict. Nations who expect to remain out of wars, at least sometimes, manage to avoid entering into them.
Experimenter Expectancy
399
Expectancy Effects in Everyday Life A group of young men, studied intensively by Whyte (1943), ‘‘knew how well a man should bowl.’’ On some evenings the group, especially its leaders, ‘‘knew’’ that a given member would bowl well. That ‘‘knowledge’’ seemed predictive, for on such an evening the member did bowl well. On other evenings it was ‘‘known’’ that a member would bowl poorly. And so he did, even if he had been the good bowler of the week before. The group’s expectancy of the members’ performance at bowling seemed, in fact, to determine that performance. Perhaps the morale-building banter offered that one who was expected to perform well helped him to do so by reducing anxiety, with its interfering effects. The communication to a member that he would do poorly on a given evening may have made his anxiety level high enough to actually interfere with his performance. Although not dealing specifically with the effects of one person’s expectancy on another’s behavior, some observations made at the turn of the century by Jastrow (1900) are relevant. He tells of the bicycle rider who so fears that he may fall that his coordination becomes impaired and he does fall. ‘‘So in jumping or running and in other athletic trials, the entertainment of the notion of a possible failure to reach the mark lessens the intensity of one’s effort, and prevents the accomplishment of one’s best.’’ We may disagree with Jastrow over his interpretation of the effects of expectancy on performance but that such effects occur seems well within common experience. In these examples Jastrow did not specify that the expectancy of falling or of failing came from another person, but as we saw in the example provided by Whyte, they often do. Jastrow also gives the details of a well-documented case of expectancy effects in the world of work. The setting was the United States Census Bureau in 1890. The Hollerith tabulating machine had just been installed. This machine, something analogous to a typewriter, required the clerks to learn some 250 positions compared to the two-score positions to be learned in typing. All regarded the task as quite difficult, and Hollerith himself estimated that a trained worker should be able to punch about 550 cards per day, each card containing about 10 punches. It took 2 weeks before any clerk achieved that high a rate, but gradually, the hundreds of clerks employed were able to perform at even higher levels but only at great emotional cost. Workers were so tense trying to achieve the records established that the Secretary of the Interior forbade the establishment of any minimum number of cards to be punched per day. At this point 200 new clerks were brought in to augment the work force. They knew nothing of the work and, unlike the original group, had no training nor had they ever seen the machines. These workers’ chief asset was that no one had told them of the task’s great ‘‘difficulty.’’ Within 3 days this new group of clerks was performing at the level attained by the initial group after 5 weeks of indoctrination and 2 weeks of practice. Among the initial group of workers, those who had been impressed by the difficulty of the task, many became ill from overwork when they achieved a level of 700 cards per day. Needless to say there was no such illness among the group of workers who had no reason to believe the task to be a difficult one. Within a short time, one of these new clerks was punching over 2,200 cards per day.
400
Book Two – Experimenter Effects in Behavioral Research
The effects on a person’s behavior of the expectancies others had of that behavior are further illustrated in an anecdote related by the learning theorist E. R. Guthrie (1938). He told how a shy, socially inept young lady became selfconfident and relaxed in social contacts by having been systematically treated as a social favorite. A group of college men had arranged the expectancies of those coming in contact with her so that socially facile behavior was expected of her. In a somewhat more scholarly report, Shor (1964) showed that in automobile driving, one driver’s expectancy of another’s behavior was communicated to that driver automotively in such a way as to increase the likelihood that the expected behavior would occur. Education is one of the socially most important areas of everyday life in which expectancy effects have been regarded as central. With increasing concern over the education of economically, racially, and socially disadvantaged children, more and more attention has been paid to the effect of our expectancy of a child’s intellectual performance on that child’s performance. MacKinnon (1962) put it this way: ‘‘If our expectation is that a child of a given intelligence will not respond creatively to a task which confronts him, and especially if we make this expectation known to the child, the probability that he will respond creatively is very much reduced’’ (p. 493). The same position has been stated also by Katz (1964), Wilson (1963), and Clark (1963), who speaks of the deprived child becoming ‘‘the victim of an educational selffulfilling prophecy’’ (p. 150). Perhaps the most detailed statement of this position is that made by the authors of Youth in the Ghetto (Harlem Youth Opportunities Unlimited, Inc., 1964). In this report considerable evidence is cited which shows that the culturally deprived child shows a relative drop in academic performance and IQ as he progresses from the third to the sixth grade. Until recently, however, there has been no experimental evidence that teachers’ expectations of a child’s performance actually affect that performance. Now there are data that show quite clearly that when teachers expect a child’s IQ to go up it does go up. The effect is consistent, not always large, but sometimes very dramatic (e.g., 20-point IQ gains). The data, not yet fully analyzed, were collected in collaboration with Lenore Jacobson and will be reported fully elsewhere.
Expectancy Effects in Survey Research Perhaps the classic work in this area was that of Stuart Rice (1929). A sample of 2,000 applicants for charity were interviewed by a group of 12 skilled interviewers. Interviewers talked individually with their respondents, who had been assigned in a wholly nonselected manner. Respondents ascribed their dependent status to factors predictable from a knowledge of the interviewers’ expectancies. Thus, one of the interviewers, who was a staunch prohibitionist, obtained 3 times as many responses blaming alcohol as did another interviewer regarded as a socialist, who, in turn, obtained half again as many responses blaming industrial factors as did the prohibitionist interviewer. Rice concluded that the expectancy of the interviewer was somehow communicated to the respondent, who then replied as expected. Hyman and his colleagues (1954) disagreed with Rice’s interpretation. They preferred to ascribe his remarkable results to errors of recording or of interpretation. What the correct interpretation is, we cannot say, for the effects, if of observation or of expectancy, were
Experimenter Expectancy
401
private ones. In either case, of course, the results of the research were strikingly affected by the expectancy of the data collector. One of the earliest studies deliberately creating differential expectancies in interviewers was that conducted by Harvey (1938). Each of six boys was interviewed by each of five young postgraduates. The boys were to report to the interviewers on a story they had been given to read. Interviewers were to use these reports to form impressions of the boys’ character. Each interviewer was given some contrived information about the boys’ reliability, sociability, and stability, but told not to regard these data in assessing the boys. Standardized questions asked of the interviewers at the conclusion of the study suggested that biases of assessment occurred even without interviewers’ awareness and despite conscious resistance to bias. Harvey felt that the interviewers’ bias evoked a certain attitude toward the boys which in turn determined the behavior to be expected and then the interpretation given. Again, we cannot be sure that subjects’ responses were actually altered by interviewer expectancies. The possibility, however, is too provocative to overlook. Wyatt and Campbell (1950) trained over 200 student interviewers for a public opinion survey dealing with the 1948 presidential campaign. Before collecting their data, the interviewers guessed the percentage distribution of responses they would obtain to each of five questions. For four of the five questions asked, interviewers tended to obtain more answers in the direction of their expectancy, although the effect was significant in the case of only one question. Those interviewers expecting more of their respondents to have discussed the campaign with others tended to obtain responses from their subjects that bore out their expectancy (p ¼ .02). Interviewers had also answered the five questions themselves, so that an assessment of modeling effects was possible. These effects were not significant. More recent evidence for expectancy effects in survey research comes from the work of Hanson and Marks (1958), and a very thorough discussion can be found in Hyman et al. (1954).
Expectancy Effects in Clinical Psychology Although it was the sociologist Merton who developed the concept of the selffulfilling prophecy, particularly for the analysis of such large-scale social and economic phenomena as racial and religious prejudice and the failure of banks, the concept was applied much earlier and in a clinical context. Albert Moll (1898) spoke specifically of clinical phenomena in which ‘‘the prophecy causes its own fulfillment’’ (p. 244). He mentioned hysterical paralyses cured at the time it was believed they could be cured. He told of insomnia, nausea, impotence, and stammering all coming about when their advent was most expected. But his particular interest was in the phenomenon of hypnosis. It was his belief that subjects behaved as they believed they were expected to behave. Much later, in 1959, Orne showed that Moll was right, and still more recent evidence (Barber & Calverley, 1964b) gives further confirmation, though Levitt and Brady (1964) showed that the subject’s expectation did not always lead to a confirming performance. In the studies just now cited we were not dealing specifically with the hypnotist’s expectancy as an unintended determinant of the subject’s response. It was more a case of the subject’s expectancy as a determinant of his own response. As yet there have
402
Book Two – Experimenter Effects in Behavioral Research
been no reports of studies in which different hypnotists were led to have different expectations about their subjects’ performance. That is the kind of study needed to establish the effects of the hypnotist’s expectation on his subject’s performance. Kramer and Brennan (1964) do have an interpretation of some data that fits the model of the self-fulfilling prophecy. They worked with schizophrenics and found them to be as susceptible to hypnosis as college undergraduates. In the past, schizophrenics had been thought far less hypnotizable. Their interpretation suggested that, relative to the older studies, their own approach to the schizophrenics communicated to them the investigators’ expectancy that the patients could be hypnotized. In the area of psychotherapy, a number of workers have been impressed by the effects of the self-fulfilling prophecy. One of the best known of these was Frieda FrommReichmann (1950). She spoke, as other clinicians have, of iatrogenic psychiatric incurabilities. The therapist’s own belief about the patient’s prognosis might be a determinant of that prognosis. Strupp and Luborsky (1962) have also made this point. These clinical impressions are supported to some extent by a few more formal investigations. Heine and Trosman (1960) did not find the patient’s initial expectation of help to be related to his continuance in treatment. They did find, however, that when the therapist and patient had congruent expectations, patients continued longer in treatment. Experimental procedures to help ensure such congruence have been employed by Jerome Frank and Martin Orne with considerable success (Frank, 1965). Goldstein (1960) found no client-perceived personality change to be related to the therapist’s expectancy of such change. However, the therapist’s expectancy was related to the duration of psychotherapy. Additionally, Heller and Goldstein (1961) found the therapist’s expectation of client improvement significantly correlated (.62) with a change in the client’s attraction to the therapist. These workers also found that after 15 sessions, the client’s behavior was no more independent than before, but that their self-descriptions were of more independent behavior. The therapists employed in this study generally were favorable to increased independence and tended to expect successful cases to show this decrease in dependency. Clients may well have learned from their therapists that independent-sounding verbalizations were desired and thereby served to fulfill their therapist’s expectancy. The most complete discussion of the general importance to the psychotherapeutic interaction of the expectancy variable is that by Goldstein (1962). But hypnosis and psychotherapy are not the only realms of clinical practice in which the clinician’s expectancy may determine the outcome. The fatality rates of delirium tremens have recently not exceeded about 15 per cent. However, from time to time new treatments of greatly varying sorts are reported to reduce this figure almost to zero. Gunne’s work in Sweden summarized by the staff of the Quarterly Journal of Studies on Alcohol (1959) showed that any change in therapy led to a drop in mortality rate. One interpretation of this finding is that the innovator of the new treatment expects a decrease in mortality rate, an expectancy that leads to subtle differential patient care over and above the specific treatment under investigation. A prophecy again may have been self-fulfilled. Greenblatt (1964) describes a patient suffering from advanced cancer who was admitted to the hospital virtually dying. He had been exposed to the information that Krebiozen might be a wonder drug, and some was administered to him. His improvement was dramatic and he was discharged to his home for several months. He was then exposed to the information that Krebiozen was probably ineffective. He
Experimenter Expectancy
403
relapsed and was readmitted to the hospital. There, his faith in Krebiozen was restored, though the injections he received were of saline solution rather than Krebiozen. Once again he was sufficiently improved to be discharged. Finally he was exposed to the information that the American Medical Association denied completely the value of Krebiozen. The patient then lost all hope and was readmitted to the hospital, this time for the last time. He died within 48 hours. Such an anecdote might not be worth the telling were it is not for the fact that effects almost as dramatic have been reported in more formal research reports on the effects of placebo in clinical practice. Excellent reviews are available of this literature (e.g., Honigfeld, 1964; Shapiro, 1960; Shapiro, 1964; Shapiro, 1965), which show that it is not at all unusual to find placebo effects more powerful than the actual chemical effects of drugs whose pharmacological action is fairly well understood (e.g., Lyerly, Ross, Krugman, & Clyde, 1964). In his comprehensive paper, Shapiro (1960) cites the wise clinician’s admonition: ‘‘You should treat as many patients as possible with the new drugs while they still have the power to heal’’ (p. 114). The wisdom of this statement may derive from its appreciation of the therapeutic role of the clinician’s faith in the efficacy of the treatment. This faith is, of course, the expectancy under discussion. The clinician’s expectancy about the efficacy of a treatment procedure is no doubt subtly communicated to the patient with a resulting effect on his psychobiological functioning.
Expectancy Effects in Experimental Psychology There is an analysis of 168 studies that had been conducted to establish the validity of the Rorschach technique of personality assessment. Levy and Orr (1959) categorized each of these studies on each of the following dimensions: (1) the academic versus nonacademic affiliation of the author; (2) whether the study was designed to assess construct or criterion validity; and (3) whether the outcome of the study was favorable or unfavorable to the hypothesis of Rorschach validity. Results showed that academicians, more interested in construct validity, obtained outcomes relatively more favorable to construct validation and less favorable to criterion validation. On the basis of their findings these workers called for more intensive study of the researcher himself. ‘‘For, intentionally or not, he seems to exercise greater control over human behavior than is generally thought’’ (p. 83). We cannot be sure that the findings reported were a case of expectancy effect or bias. It might have been that the choice of specific hypotheses for testing, or that the choice of designs or procedures for testing them, determined the apparently biased outcomes. At the very least, however, this study accomplished its task of calling attention to the potential biasing effects of experimenters’ expectations. Perhaps the earliest study that employed a straightforward experimental task and directly varied the expectancy of the experimenter was that of Stanton and Baker (1942). In their study, 12 nonsense geometric figures were presented to a group of 200 undergraduate subjects. After several days, retention of these figures was measured by five experienced workers. The experimenters were supplied with a key of ‘‘correct’’ responses, some of which were actually correct but some of which were incorrect. Experimenters were explicitly warned to guard against any bias associated with their having the keys before them and thereby unintentionally
404
Book Two – Experimenter Effects in Behavioral Research
influencing their subjects to guess correctly. Results showed that the experimenter obtained results in accordance with his expectations. When the item on the key was correct, the subject’s response was more likely to be correct than when the key was incorrect. In a careful replication of this study, Lindzey (1951) emphasized to his experimenters the importance of keeping the keys out of the subjects’ view. This study failed to confirm the Stanton and Baker findings. Another replication by Friedman (1942) also failed to obtain the significance levels obtained in the original. Still, significant results of this sort, even occurring in only one out of three experiments, cannot be dismissed lightly. Stanton (1942a) himself presented further evidence which strengthened his conclusions. He employed a set of nonsense materials, 10 of which had been presented to subjects, and 10 of which had not. Experimenters were divided into three groups. One group was correctly informed as to which 10 materials had been exposed, another group was incorrectly informed, and the third group was told nothing. The results of this study also indicated that the materials that the experimenters expected to be more often chosen were, in fact, more often chosen. An experiment analogous to those just described was conducted in a psychophysical laboratory by workers (Warner & Raible, 1937) who interpreted their study within the framework of parapsychological phenomena. The study involved the judgment of weights by subjects who could not see the experimenter. The latter kept his lips tightly closed to prevent unconscious whispering (Kennedy, 1938). In half the experimental trials the experimenter knew the correct response and in half he did not. Of the 17 subjects, only 6 showed a large discrepancy from a chance distribution of errors. However, all 6 of these subjects made fewer errors on trials in which the experimenter knew which weight was the lighter or heavier. At least for those 6 subjects who were more affected by the experimenter’s knowledge of the correct response, the authors’ conclusion seems justified (p = .03). As an alternative to the interpretation of these results as ESP phenomena, they suggested the possibility of some form of auditory cue transmission to subjects. Among the most recent relevant studies in the area of ESP are those by Schmeidler and McConnell (1958). These workers found that subjects who believed ESP possible (‘‘sheep’’) performed better at ESP tasks than subjects not believing ESP possible (‘‘goats’’). These workers suggested that an experimenter, by his presentation, might affect subjects’ self-classification, thereby increasing or decreasing the likelihood of successful ESP performance. Similarly, Anderson and White (1958) found that teachers’ and students’ attitudes toward each other might influence performance in classroom ESP experiments. The mechanism operating here might also have been one of certain teachers’ expectancies being communicated to the children whose selfclassification as sheep or goats might thereby be affected. The role of the experimenter in the results of ESP research has been discussed recently by Crumbaugh (1959), and much earlier by Kennedy (1939), as a source of evidence against the existence of the phenomenon. No brief is filed here for or against ESP, but if, in carefully done experiments, certain types of experimenters obtain certain types of ESP performances in a predictable manner, as suggested by the studies cited, further evidence for experimenter expectancy effects will have been adduced (Rhine, 1959). In a more traditional area of psychological research—memory—Ebbinghaus (1885) called attention to experimenter expectancy effects. In his own research he noted that his expectancy of what data he would obtain affected the data he
Experimenter Expectancy
405
subsequently did obtain. He pointed out, furthermore, that the experimenter’s knowledge of this expectancy effect was not sufficient to control the phenomenon. This finding, long neglected, will be discussed further in Part II when the question of early data returns is taken up. Another possible case, and another classic, has been described by Stevens (1961). He discussed the controversy between Fechner and Plateau over the results of psychophysical experiments designed to determine the nature of the function describing the operating characteristics of a sensory system. Plateau held that it was a power function rather than a log function. Delboeuf carried out experiments for Plateau, but obtained data approximating the Fechnerian prediction of a log function. Stevens puzzled over these results which may be interpretable as experimenter expectancy effects. Either by implicitly expecting the Fechnerian outcomes or by attempting to guard against an anti-Fechnerian bias, Delboeuf may have influenced the outcome of his studies. It would appear that Pavlov was aware of the possibility that the expectancy of the experimenter could affect the results of experiments. In an exchange of letters in Science, Zirkle (1958) and Razran (1959), in discussing Pavlov’s attitude toward the concept of the inheritance of acquired characteristics, give credence to a statement by Gruenberg (1929, p. 327): ‘‘In an informal statement made at the time of the Thirteenth International Physiological Congress, Boston, August, 1929, Pavlov explained that in checking up these experiments, it was found that the apparent improvement in the ability to learn, on the part of successive generations of mice, was really due to an improvement in the ability to teach, on the part of the experimenter! And so this ‘proof’ of the transmission of modifications drops out of the picture, at least for the present.’’ Probably the best-known and most instructive case of experimenter expectancy effects is that of Clever Hans (Pfungst, 1911). Hans, it will be remembered, was the horse of Mr. von Osten, a German mathematics teacher. By means of tapping his foot, Hans was able to add, subtract, multiply, and divide. Hans could spell, read, and solve problems of musical harmony. To be sure, there were other clever animals at the time, and Pfungst tells about them. There was ‘‘Rosa,’’ the mare of Berlin, who performed similar feats in vaudeville, and there was the dog of Utrecht, and the reading pig of Virginia. All these other clever animals were highly trained performers who were, of course, intentionally cued by their trainers. Mr. von Osten, however, did not profit from his animal’s talent nor did it seem at all likely that he was attempting to perpetrate a fraud. He swore he did not cue the animal, and he permitted other people to question and test the horse even without his being present. Pfungst and his famous colleague, Stumpf, undertook a program of systematic research to discover the secret of Hans’ talents. Among the first discoveries made was that if the horse could not see the questioner, Hans was not clever at all. Similarly, if the questioner did not himself know the answer to the question, Hans could not answer it either. Still, Hans was able to answer Pfungst’s questions as long as the investigator was present and visible. Pfungst reasoned that the questioner might in some way be signaling to Hans when to begin and when to stop tapping his hoof. A forward inclination of the head of the questioner would start Hans tapping, Pfungst observed. He tried then to incline his head forward without asking a question and discovered that this was sufficient to start Hans’ tapping. As the experimenter straightened up, Hans would stop tapping. Pfungst then tried to get Hans to stop tapping by using very slight upward motions of the head. He found that even the
406
Book Two – Experimenter Effects in Behavioral Research
raising of his eyebrows was sufficient. Even the dilation of the questioner’s nostrils was a cue for Hans to stop tapping. When a questioner bent forward more, the horse would tap faster. This added to the reputation of Hans as brilliant. That is, when a large number of taps was the correct response, Hans would tap very, very rapidly until he approached the region of correctness, and then he began to slow down. It was found that questioners typically bent forward more when the answer was a long one, gradually straightening up as Hans got closer to the correct number. For some experiments, Pfungst discovered that auditory cues functioned additively with visual cues. When the experimenter was silent, Hans was able to respond correctly 31 per cent of the time in picking one of many placards with different words written on it, or cloths of different colors. When auditory cues were added, Hans responded correctly 56 per cent of the time. Pfungst himself then played the part of Hans, tapping out responses to questions with his hand. Of 25 questioners, 23 unwittingly cued Pfungst as to when to stop tapping in order to give a correct response. None of the questioners (males and females of all ages and occupations) knew the intent of the experiment. When errors occurred, they were usually only a single tap from being correct. The subjects of this study, including an experienced psychologist, were unable to discover that they were unintentionally emitting cues. Hans’ amazing talents, talents rapidly acquired too by Pfungst, serve to illustrate further the power of the self-fulfilling prophecy. Hans’ questioners, even skeptical ones, expected Hans to give the correct answers to their queries. Their expectation was reflected in their unwitting signal to Hans that the time had come for him to stop his tapping. The signal cued Hans to stop, and the questioner’s expectation became the reason for Hans’ being, once again, correct. Not all of Hans’ questioners were equally good at fulfilling their prophecies. Even when the subject is a horse, apparently, the attributes of the experimenter make a considerable difference in determining the response of a subject. On the basis of his studies, Pfungst was able to summarize the characteristics of those of Hans’ questioners who were more successful in their covert and unwitting communication with the horse. What seemed important was: 1. 2. 3. 4. 5.
That the questioner have ability and ‘‘tact’’ in dealing with animals generally. That he have an air of quiet authority. That he concentrate on the correct answer, both expecting and wishing for it. That he have a facility for motor discharge or be gesturally inclined. That he be in relative good health.
Pfungst summarized eloquently the difficulties of uncovering the nature of Clever Hans’ talents. Investigators had been misled by ‘‘looking for, in the horse, what should have been sought in the man.’’ Additional examples of just such looking in the wrong place and more extensive references are to be found elsewhere (Rosenthal, 1964b; Rosenthal, 1965a). There is a more recent example of possible expectancy effects, and this time the subjects were humans. The experiment dealt with the Freudian defense mechanism of projection (Rosenthal, 1956; Rosenthal, 1958). A total of 108 subjects was composed of 36 college men, 36 college women, and 36 hospitalized patients with paranoid symptomatology. Each of these three groups was further divided into three subgroups
Experimenter Expectancy
407
receiving success, failure, or neutral experience on a task structured as and simulating a standardized test of intelligence. Before the subjects’ experimental treatment conditions were imposed, they were asked to rate the degree of success or failure of persons pictured in photographs. Immediately after the experimental manipulation, subjects were asked to rate an equivalent set of photos on their degree of success or failure. The dependent variable was the magnitude of the difference scores from pre- to post-ratings of the photographs. It was hypothesized that the ‘‘success’’ treatment condition would lead to greater subsequent perception of other people’s success, whereas the ‘‘failure’’ treatment condition would lead to greater subsequent perception of other people’s failure as measured by the pre-post difference scores. An analysis (which was essentially unnecessary to the main purpose of the study) was performed which compared the mean pre-ratings of the three experimental treatment conditions. These means were as follows: success, 1.5; neutral, 0.9; and failure, 1.0. The pre-rating mean of the success treatment group was significantly lower (p ¼ .01) than the other means. In terms of the hypothesis under test, a lower pre-rating by this group would tend to lead to significantly different difference scores even if the post-ratings were identical for all treatment conditions. Without the investigator’s awareness, the cards had been stacked in favor of obtaining results confirming the hypothesis under test. It should be emphasized that the success and failure groups’ instructions had been identical, verbatim, during the pre-rating phase of the experiment. (Instructions to the neutral group differed only in that no mention was made of the experimental task, since none was administered to this group.) The investigator, however, was aware for each subject which experimental treatment the subject would subsequently be administered. ‘‘The implication is that in some subtle manner, perhaps by tone, or manner, or gestures, or general atmosphere, the experimenter, although formally treating the success and failure groups in an identical way, influenced the success subjects to make lower initial ratings and thus increase the experimenter’s probability of verifying his hypothesis’’ (Rosenthal, 1956, p. 44). As a further check on the suspicion that success subjects had been differently treated, the conservatism-extremeness of pre-ratings of photos was analyzed. The mean extremeness-of-rating scores were as follow: success, 3.9; neutral, 4.4; and failure, 4.4. The success group rated photos significantly (p ¼ .001) less extremely than did the other treatment groups. Whatever the manner in which the experimenter differentially treated those subjects he knew were destined for the success condition, it seemed to affect not only their mean level of rating but their style of rating as well. It was these puzzling and disconcerting results that led to the experiments to be described in Part II.
This page intentionally left blank
Part II STUDIES OF EXPERIMENTER EXPECTANCY EFFECTS
EXPERIMENTAL DEMONSTRATION OF EXPERIMENTER EXPECTANCY EFFECTS Chapter 9. Human Subjects Chapter 10. Animal Subjects FACTORS COMPLICATING EXPERIMENTER EXPECTANCY EFFECTS Chapter 11. Subject Set Chapter 12. Early Data Returns Chapter 13. Excessive Rewards VARIABLES RELEVANT TO THE COMMUNICATION OF EXPERIMENTER EXPECTANCY EFFECTS Chapter 14. Structural Variables Chapter 15. Behavioral Variables Chapter 16. Communication of Experimenter Expectancy
This page intentionally left blank
9 Human Subjects
The evidence presented up to this point that the expectancy of the experimenter may in part determine the results of his research has been at least somewhat equivocal. Some of the evidence has been anecdotal. Some has required the untenable assumption that the expectancy of the experimenter, and not some correlated variable, had led to the effects observed. That is the case in any study in which the data collector estimates beforehand the data he will obtain and then obtains data significantly in that direction. In such cases it could be that experimenters who expect certain kinds of data differ in other attributes from their colleagues and that it is these attributes, rather than the expectancy, that influence the subjects’ response. The most clear-cut evidence for the effects of the experimenter’s expectancy, therefore, must come from experiments in which experimenters are given different expectancies. Of the studies examined, that by Stanton and Baker (1942) comes closest to meeting this requirement of the experimental induction of an expectancy. That study does require, however, the assumption that experimenters will expect the subjects to answer correctly the items being presented. The same assumption is required to interpret the case of Clever Hans as an experiment in expectancy effects. The studies to be described now seem to be fairly straightforward tests of the hypothesis of the effects of the experimenter’s expectancy on his research results.
The Person Perception Task In earlier chapters there has been occasion to refer often to the person perception task. The details of the standardization should be described. Fifty-seven photographs of faces ranging in size from 2 3 cm to 5 6 cm were cut from a weekly news magazine and mounted on 3 5 in. white cards. These were presented to 70 male and 34 female students, enrolled in an introductory psychology class at the University of North Dakota. Subjects were instructed to rate each photo on a rating scale of success or failure. The scale, shown in Figure 9-1, ran from 10, extreme failure; to þ10, 411
412
Book Two – Experimenter Effects in Behavioral Research Moderate Mild Mild Moderate Extreme Extreme Failure Failure Success Success Success Failure –10 –9 –8 –7 –6 –5 –4 –3 –2 –1 +1 +2 +3 +4 +5 +6 +7 +8 +9 +10 Figure 9–1 The Empathy Test Rating Scale
extreme success; with intermediate labeled points. Each subject was seen individually by the author who read to each the following instructions: Instructions to Subjects. I am going to read you some instructions. I am not permitted to say anything which is not in the instructions nor can I answer any questions about this experiment. OK? We are in the process of developing a test of empathy. This test is designed to show how well a person is able to put himself into someone else’s place. I will show you a series of photographs. For each one I want you to judge whether the person pictured has been experiencing success or failure. To help you make more exact judgments you are to use this rating scale. As you can see the scale runs from 10 to þ10. A rating of 10 means that you judge the person to have experienced extreme failure. A rating of þ10 means that you judge the person to have experienced extreme success. A rating of 1 means that you judge the person to have experienced mild failure, while a rating of þ1 means that you judge the person to have experienced mild success. You are to rate each photo as accurately as you can. Just tell me the rating you assign to each photo. All ready? Here is the first photo. (No further explanation may be given, although all or part of the instructions may be repeated.) From the original 57 photos, 10 were selected for presentation to male subjects and 10 were selected for presentation to female subjects. All 20 photos were rated on the average as neither successful nor unsuccessful, and for each the mean rating evoked fell between þ1 and 1. The distributions of ratings evoked by each of the photos were also symmetrical. The 10 photos composing the final sets of stimuli for male subjects and the 10 for female subjects were rated on the average as exactly zero.1
The First Experiment2 Ten of the eleven students in a class in undergraduate experimental psychology served as experimenters. All were psychology majors, and three of them were firstyear graduate students in psychology. All but two of the experimenters were males. Subjects were 206 students enrolled in a course in introductory psychology (92 males and 114 females). Because subjects were given class credit for participating in the experiment, most of the class volunteered, thus reducing the selective effect of using volunteer subjects (Rosenthal, 1965b). Each experimenter contacted from 18 to 24 subjects. 1
Four years later, at the same university, a sample of 14 experimenters administered the photo-rating task to a sample of 28 female subjects. Each experimenter contacted 2 subjects. The grand mean photo rating obtained was .004. It should be noted, however, that the demonstration of expectancy effects does not depend on the ‘‘validity’’ of the standardization. The standardization sample was useful to determine the characteristics of the stimuli, but it is not employed as a comparison or control group in any of the experiments described in this book. 2 This study and the first replication have been reported earlier (Rosenthal & Fode, 1961; Rosenthal & Fode, 1963b).
Human Subjects
413
The experimenters’ task was structured as a laboratory exercise to see whether they could replicate ‘‘well-established’’ experimental findings as ‘‘students in physics labs are expected to do.’’ Experimenters were told to discuss their project with no one and to say nothing to their subjects other than what was in the Instructions to Subjects. All experimenters were paid a dollar an hour except that if they did a ‘‘good job’’ they would be paid double: two dollars an hour. All ten experimenters received identical instructions except that five experimenters were told that their subjects would average a þ5 rating on the ten neutral photos. The other experimenters were told that their subjects would average a 5 rating. Thus the only difference between the two groups of experimenters was that one group had a plus mark written in front of the ‘‘5’’ while the other group had a minus mark written in front of the ‘‘5.’’ As a part of the experimenters’ training, each of them also rated the standardized set of ten photos. The exact instructions to experimenters were as follows: Instructions to Experimenters. You have been asked to participate in a research project developing a test of empathy. You may have seen this project written up in the campus newspaper. There is another reason for your participation in this project—namely, to give you practice in duplicating experimental results. In physics labs, for example, you are asked to repeat experiments to see if your findings agree with those already well established. You will now be asked to run a series of Ss and obtain from each ratings of photographs. The experimental procedure has been typed out for you and is self-explanatory. DO NOT DISCUSS THIS PROJECT WITH ANYONE until your instructor tells you that you may. You will be paid at the rate of $1.00 per hour for your time. If your results come out properly—as expected—you will be paid $2.00 instead of $1.00. The Ss you are running should average about a (þ or ) 5 rating. Just read the instructions to the Ss. Say nothing else to them except hello and goodbye. If for any reason you should say anything to an S other than what is written in your instructions, please write down the exact words you used and the situation which forced you to say them. GOOD LUCK! The results of this experiment are shown in Table 9-1. Each entry represents the mean photo rating obtained by one experimenter from all his subjects. The difference between the mean ratings obtained by experimenters expecting success (þ5) ratings and those expecting failure (5) ratings was significant at the .007 level (one-tailed p, t ¼ 3.20, df ¼ 8). All experimenters expecting success ratings obtained higher ratings than did any experimenter expecting failure ratings. Such nonoverlapping Table 9–1 Experimenters’ Expectancy and Their Subjects’ Mean
Ratings of Success Expectancy
Means
þ5
5
þ.66 þ.45 þ.35 þ.31 þ.25
þ.18 þ.17 þ.04 .37 .42
þ.40
.08
414
Book Two – Experimenter Effects in Behavioral Research
of distributions occurs only rarely in behavioral research and has a probability of .004 (one-tailed, for N1 ¼ N2 ¼ 5). The mean ratings obtained by the two female experimenters, one in each treatment condition, did not differ from the mean ratings obtained by the male experimenters of their respective experimental conditions. The grades earned by all experimenters in their experimental psychology course were not related to either the mean photo ratings obtained from subjects or the magnitude of the biasing phenomenon.
The First Replication The magnitude of the expectancy effects obtained was not readily believable, and a replication was performed by Kermit Fode (1960). There were other reasons for this study, which will be discussed in the chapter dealing with the communication of the experimenter’s expectancy. Here, only those portions of the study are reported that served the replication function. Twelve of the 26 male students enrolled in an advanced undergraduate course in industrial psychology were randomly assigned to serve as experimenters. In this sample of experimenters, few were psychology majors; most were majoring in engineering sciences. Subjects were 86 students enrolled in a course in introductory psychology (50 males and 36 females). These subjects were also given class credit for participating in the experiment. Each experimenter contacted from 4 to 14 subjects. The procedure of this experiment was just as in the preceding study with the exception that experimenters did not handle the photos. Instead, each set of ten photos was mounted on cardboard and labeled so that subjects could call out their ratings of each photo to their experimenter. It was thought that less handling of the photos might serve to reduce the effects of experimenters’ expectancies on the data obtained from subjects. There were two reasons for this thinking. First, if the experimenter did not hold each stimulus photo, the subject would have the experimenter in his field of vision much less often and the number of cues observed by the subject should be reduced. That had been Pfungst’s experience with Clever Hans. The second reason, related to the first, was the suspicion that the movements of the hand in which the experimenter held the stimulus photo might serve a cueing function. (This was the thinking about the one change in procedure, but the change itself was not one of the variables investigated formally. Rather, the change was required so that the two replication groups would not differ from other experimental groups of the experiment in procedure.) The results of the replication are shown in Table 9-2. As in the original experiment, half the experimenters had been led to expect ratings of success (þ5) and half had been led to expect ratings of failure (5). The difference between these two groups of experimenters in the responses they obtained from their subjects was again significant, this time at the .0003 level (one-tailed p, t ¼ 4.99, df ¼ 10). Once again, all experimenters expecting ratings of success obtained ratings of the photos as more successful than did any of the experimenters expecting failure ratings.
The Second Replication There is one more experiment by Fode (1965) which is sufficiently similar to the two described already to be usefully regarded as another replication. Later, in the chapter
Human Subjects
415 Table 9–2 Experimenters’ Expectancy and Subjects’ Mean
Ratings: Replication Expectancy
Means
þ5
5
þ3.03 þ2.76 þ2.59 þ2.09 þ2.06 þ 1.10
þ1.00 þ0.91 þ0.75 þ0.46 þ0.26 0.49
þ 2.27
þ 0.48
dealing with experimenter characteristics associated with greater and lesser expectancy effects, other aspects of that study will be considered. Here, we consider only the two most relevant groups employed by Fode. There were eight experimenters, all advanced undergraduate students in industrial psychology, the same course from which the experimenters of the first replication were drawn, but, of course, in a different year. The 90 subjects were all enrolled in an introductory psychology course (55 males and 35 females). Each experimenter contacted from 9 to 13 subjects. The procedure was as in the original experiment. The major difference between this and the original experiment was that experimenters had been selected for their characteristic level of anxiety defined by the Taylor Scale of Manifest Anxiety. The eight experimenters whose results will be described were all medium anxious. Half were randomly assigned to a group led to expect success (þ5) ratings, and half were assigned to a group led to expect failure (5) ratings. The results of this second replication are shown in Table 9-3. Once again, experimenters expecting ratings of people as more successful obtained ratings of higher success than did experimenters expecting ratings of people as failures, this time with an associated p value of .005 (one-tailed, t ¼ 3.96, df ¼ 6). Once again, too, the distributions did not overlap. Every experimenter expecting positive ratings obtained positive ratings, and every experimenter expecting negative ratings obtained negative ratings. Table 9-4 gives a summary of the magnitude of expectancy effects obtained in each of the three experiments described. Employing Stouffer’s method suggested by Mosteller and Bush (1954) gave a combined probability for the three experiments of one in about two million. Table 9–3 Experimenters’ Expectancy and Subjects’ Mean Ratings: Second
Replication Expectancy
Means
þ5
5
þ1.51 þ0.64 þ0.47 þ0.13
0.31 0.49 0.65 1.02
þ0.69
0.62
416
Book Two – Experimenter Effects in Behavioral Research Table 9–4 Summary of Three Basic Replicates
Expectancy þ5
5
Difference
t
df
One-Tail p
I II III
þ0.40 þ2.27 þ0.69
0.08 þ0.48 0.62
þ0.48 þ1.79 þ1.31
3.20 4.99 3.96
8 10 6
.007 .0003 .005
Means
þ1.12
0.07
þ1.19
Experiment
Some Discussion It seems reasonable to conclude from these data that the results of an experiment may be determined at least in part by the expectations of the experimenter. Since the experimenters had all read from the identical instructions, some more subtle aspects of their behavior toward their subjects must have served to communicate their expectations to their subjects. From experimental procedures and from more naturalistic observation of experimenters interacting with their subjects, some things have been learned about the communication of expectancies. What is known of this communication will be discussed in a subsequent chapter. We may note in passing, however, that of the studies described just now, one (II of Table 9-4) in which the experimenters were less often in the subjects’ field of vision, and in which experimenters did not handle the stimulus photos, did not show a decrement in the biasing effect of the experimenter’s expectancy. Surprisingly, that study was the one to show the greatest magnitude of biasing effect. It may at least be concluded that the communication of the experimenter’s expectancy does not depend either on his handling of the stimulus materials or on his being within the subject’s constant view. From this alone, it seems that the communication processes involved are not quite like those discovered by Pfungst to apply to Clever Hans. Hans, it will be recalled, did suffer a loss of unintended communication when he lost visual contact with his experimenter.3 In the first few chapters of this book there was a discussion of a number of effects of experimenters which did not affect their subjects’ responses but which could affect the results of their research. It should be considered whether errors of observation or interpretation, or even intentional errors, could have accounted for the findings reported. Errors of observation and of interpretation are hard to discriminate in these experiments. The subject calls out a number and the experimenter records it as he hears it. We do know that errors of recording occur and that they tend to occur in the direction of the experimenter’s expectancy. But the evidence presented in earlier portions of this book suggests that the magnitude of such errors is most often trivial. Intentional errors could have occurred, but they, too, are unlikely to have led to three sets of nonoverlapping distributions. 3
There was another effect possibly due to the different conditions of experiment II. All experimenters of this study tended to obtain ratings of photos as more successful, regardless of their expectancy, than did the experimenters of the other two studies (p < .01, x2 ¼ 6.8, df ¼ 1). It is possible that experimenters of study II, having less to do during their interaction with the subjects, were perceived by them as less important or of lower status. In the chapter dealing with the effects of the experimenter’s status, some evidence was presented which suggested that lower status experimenters did tend to obtain ratings of these photos as being of more successful people.
Human Subjects
417
The hypotheses of recording errors and of intentional errors seem further weakened by the microgeography of the experimental interactions. The subjects sat in such relation to the experimenter that they could see what the experimenter recorded on his data sheet. For either recording errors or intentional errors, therefore, the subject was in a position to correct the experimenter’s entry.4 Finally, from the filmed and direct observations of other experiments in progress, it could be determined that experimenters do record the response as given by the subject. In the filmed studies, not all responses could be checked, however, because there were places where the sound track was too poor to be sure what response the subject had given. In the experiments described, the experimenters were offered extra pay for ‘‘a good job.’’ Perhaps the expectancy effect depends on such extrinsic incentives. On the basis of just these experiments no answer is possible. Later, however, there will be experiments that did not offer such additional incentives to experimenters to obtain biased responses. In fact, we shall encounter evidence suggesting that with increased incentive, the effects of expectancy are reduced or even thrown into a reversal of direction. Questions of the generality of expectancy effects have been discussed in the preceding chapter. In Part III there will be a detailed statement of the generality of expectancy effects based on the research program designed specifically to investigate them. For now, however, we should consider the task employed. On first glance it would seem that neutral photos would, because of their neutrality, make subjects especially watchful of cues from the experimenter to guide them in their ratings. If the photos could ‘‘be’’ anything, successful or unsuccessful, then even minor cues should make it easy to influence the subject’s response. It must be considered, however, that the meaning of neutral is not ‘‘anything.’’ The stimulus value, the ‘‘reality’’ of the stimulus, is a specific numerical value, zero. For one group of subjects to rate the photos as significantly different from that zero value, or from the value established by a control group of subjects, is not, therefore, a trivial deviation. In the three experiments described there was a source of ecological invalidity which should be discussed. That was the fact that experimenters contacted subjects under only a single condition of expectancy. Subjects were expected to be either success perceivers or failure perceivers. In ‘‘real’’ research it is more common for the same experimenter to contact the subjects of both the experimental and the control groups. The question must therefore be raised whether expectancy effects occur also when the same experimenter contacts subjects for whom he has differing expectancies. An experiment that is similar to the ones described so far and which sheds light on this matter is one conducted by Laszlo. He employed three male experimenters to administer the photo-rating task to 64 female subjects. Each of the experimenters contacted from 18 to 23 subjects. For half these subjects the experimenters were led 4
We know from the observation of other experiments employing the same task that occasionally subjects do correct their experimenter’s data entry. We cannot be absolutely certain, however, that subjects generally do not let errors observed by them go by without comment. Possibly those of our subjects who corrected their experimenter were unusual. Perhaps they were lower in the need for social approval. An interesting experiment would be to have a sample of experimenters intentionally misrecord their subjects’ responses in plain view of their subjects. One wonders how often these ‘‘errors’’ will be called to the experimenter’s attention, under what conditions, and by what type of subject.
418
Book Two – Experimenter Effects in Behavioral Research Table 9–5 Experimenters’ Expectancy and Subjects’ Mean Ratings: Alternating
Expectancies Expectancy þ5
5
Difference
A B C
.13 .51 .96
.67 .72 1.59
þ.54 þ.21 þ.63
Means
.53
.99
þ.46
Experimenter
to expect positive ratings of the success of others (þ5), and for half they were led to expect negative ratings (5). The order in which experimenters contacted each ‘‘type’’ of subject was random. Table 9-5 shows the mean photo ratings obtained by each experimenter under each type of expectancy. All three of the experimenters obtained higher ratings of success when expecting such ratings than when not expecting such ratings (p ¼ .04, one-tailed, t ¼ 3.61, df ¼ 2). The mean magnitude of the expectancy effect was þ.46, which was very close to the value of þ.48 obtained in the original experiment (I of Table 9-4). In Laszlo’s study there was also no extra pay offered to experimenters for obtaining the expected data. Apparently neither the extra incentive offered for ‘‘good’’ data nor the holding of only a single expectancy for all subjects could account for the results of the three experiments described earlier. It should be noted, however, that in the Laszlo study, the distributions of mean photo ratings obtained under the two conditions of expectancy did overlap. In that sense at least, the results are less dramatic than those of the other three studies. Whether this was due to some dampening effect of the expectancies’ varying for the experimenters cannot be determined. The Laszlo study differed also in that half the time a higher status was ascribed the experimenter, and half the time a lower status. This procedural difference might also account for the possibly weakened effect of the experimenter’s expectancy. In a subsequent chapter dealing with the personal characteristics of more successful unintentional influencers, some additional evidence is presented which also shows that the effects of experimenters’ expectancies do not depend upon their contacting subjects under only a single condition of expectancy. Another question that must be raised is the extent to which the expectancy effects demonstrated were due, not to the expectancy of the experimenters, but to the expectancy of the author. If that were entirely the case it would not, of course, eliminate the evidence for the effects of the experimenter’s expectancy. It would, however, reduce considerably the number of cases in the sample of experimenters studied from several hundred to one. We would have, then, a longitudinal case study of the expectancy effects of a single investigator, the author. In some of the early studies in the research program such effects of the principal investigator cannot be ruled out. Thus there were studies in which the author ushered subjects into the experimenters’ rooms without being blind to the experimenters’ expectancies. Knowing that a given subject was destined for a ‘‘success’’-expecting experimenter may have led the author to treat these subjects differently, in such a way as to affect
Human Subjects
419
their photo ratings. Even when the walk with the subject from waiting room to laboratory is short, such effects cannot be ruled out. Later studies in the research program eliminated these potential effects. The details of the safeguards against the principal investigator’s expectancy will be given in later chapters. For now it should be mentioned that in many of the studies conducted the investigators did not know which experimenters had what expectancies until the experiment was completely finished. A point to be developed later is that ten experiments performed in a single laboratory may be worth less than the same ten experiments conducted in ten different laboratories. Most of the experiments reported in this book were conducted in a single ‘‘laboratory,’’ or at least involved one common investigator. For this reason it is especially important to look to other laboratories for evidence to support or to infirm the hypothesis of the expectancy effect of the psychological experimenter. Some such evidence was reported in the last chapter, and a few more recent reports are relevant. In a demonstration employing the same task described here, Karl Weick had two experimenters conduct the person perception experiment in front of his class in experimental social psychology. One experimenter was led to expect success ratings from his five subjects; the other was led to expect failure ratings from his five subjects. The results are given in more detail in the chapter dealing with the communication of expectancies. Briefly the experimenter expecting positive ratings obtained a mean rating of þ1.18, whereas the experimenter expecting negative ratings obtained a mean rating of 0.50. The difference was significant at the .01 level, one-tailed. There is a very recent experiment by Masling (1965) in which he gave ‘‘special training’’ to a group of 14 graduate students in a ‘‘new method of learning the Rorschach procedure.’’ Half the examiners were led to believe that experienced examiners obtained a relatively greater proportion of human percepts in the ink blots. The remaining examiners were led to believe that experienced examiners obtained relatively more animal percepts from their subjects. All the examiner-subject interactions were tape-recorded. Examiners led to believe that more experienced examiners obtained relatively more human percepts obtained a ratio of 1.8 animal percepts to each human percept. Examiners led to believe that obtaining animal percepts was more desirable obtained an animal-to-human percept ratio of 2.4 (p ¼ .04). If these examiners also expected to obtain the responses they probably desired, this experiment would be an excellent demonstration of expectancy effects. Even if they did not, however, this study illustrates, with data from a different laboratory, that cognitions of the experimenter may affect the subject’s response by shepherding it into the desired (and perhaps the expected) direction. Interestingly, the analysis of the tape recordings of the examiner-subject interactions revealed no differential reinforcements of subjects’ responses that could account for the differences obtained by the two groups of examiners. Still more recently and even more directly, Marwit and Marcia (1965) tested the effects of experimenter expectancies on their subjects’ responses to a Rorschach-like task. They employed 36 undergraduate students of experimental psychology to administer a modified Holtzman inkblot test to a total of 54 students enrolled in introductory psychology. Half the experimenters were asked to evolve their own hypotheses as to whether normal college students would give many or few responses
420
Book Two – Experimenter Effects in Behavioral Research
to the inkblot stimuli. The remaining experimenters were given ‘‘ready-made’’ hypotheses as to whether subjects would give many or few responses. About two thirds of the experimenters evolving their own hypotheses expected their subjects to give many responses and one third expected few responses. About two thirds of the experimenters given ready-made hypotheses were, therefore, led to expect many responses to the inkblot stimuli, and one third were led to expect few responses. The results of the Marwit and Marcia study showed that it made no difference whether experimenters evolved their own hypotheses or were given ready-made hypotheses. In both cases, experimenters expecting more responses to inkblots obtained more responses to inkblots. Among experimenters who originated their own hypotheses, those who expected more responses obtained 52 percent more responses than did those expecting fewer responses. Among experimenters who were given their expectations by the principal investigators, those led to expect more responses obtained 55 percent more responses than did experimenters led to expect fewer responses. For both groups of experimenters combined, these numerically large expectancy effects were also very significant statistically (p ¼ .00025, one-tail, t ¼ 3.76, df ¼ 50). Marwit and Marcia had felt that the number of questions asked by the experimenters of their subjects might serve to communicate their expectation to their subject. That, they found, was not the case. Whereas greater questioning of subjects was associated with significantly more responses from subjects among experimenters who evolved their own hypotheses, exactly the opposite relationship was found among experimenters who had been given ‘‘ready-made’’ expectancies. There was a general tendency, too, for expectancy effects to increase during the course of the interaction with each subject. Although this trend cannot establish that experimenters were employing any system of differential reinforcement, this learning curve at least suggests that such reinforcement was a possibility. Alternatively, it might have been the subject who reinforced the experimenter’s unintentional communication behavior. This possibility will be discussed in more detail in the chapter dealing with the communication of expectancy effects. Troffer and Tart (1964) reported on some relevant experimenter effects obtained from a sample of eight experimenters. The sample was particularly interesting in that these experimenters were fully aware of the problem of ‘‘experimenter bias.’’ The experiment called for subjects to be tested on the Stanford Hypnotic Susceptibility Scale (Weitzenhoffer & Hilgard, 1962). Half the time the experimenters administered the scale after a hypnotic induction procedure. Half the time they administered the scale without having attempted any hypnotic induction. All experimenter-subject interactions were tape-recorded, and the experimenters knew that these recordings were being made. The very first item of the suggestibility tests was found to have been read differently to subjects depending on whether the experimenter had or had not gone through the induction procedure. Judges listening to the tapes rated experimenters as speaking in a more relaxed, somnolent, solicitous, and convinced tone when they had gone through the hypnotic procedure before testing their subjects. Whatever the precise cues, judges could correctly assess whether the experimenter had or had not carried out the induction procedure prior to his administration of item No. 1 of the Stanford Scale. Excluding one judge who could not differentiate better than chance, the remaining six judges were correct 73 percent of the time, where 50 percent would have been expected by chance (p < .005). As it happened, that one
Human Subjects
421
judge who performed only at a chance level was the only one who felt that the experiment would not turn up anything. The authors of this report provide two interpretations, either or both of which might have accounted for the results. The first interpretation suggests that the act of having gone through an induction procedure essentially ‘‘warms up’’ the experimenter and makes him a more effective hypnotist. The second interpretation, more relevant to our immediate concern, suggests that experimenters expected better performance in the condition involving hypnotic induction. Expecting such better performance led them to put more into their reading of the item to their subjects. It should be noted that all eight of the experimenters favored the first, or ‘‘warm-up,’’ interpretation over the second, or ‘‘expectancy,’’ interpretation. So although we cannot be sure that we have here a case of expectancy effects, we do have excellent evidence that even seasoned experimenters, cautioned to treat their subjects identically, were unable to do so. Instead, these ‘‘bias-wise’’ experimenters treated their subjects as they would have to be treated to increase the likelihood of the confirmation of the hypothesis. Smallest in sample size, but perhaps the most ‘‘lifelike,’’ of the relevant studies from other laboratories is a study by Rosenhan (1964). There will be occasion to cite his work again when the topic of expectancy control groups is treated in Part III. Briefly, Rosenhan had established through correlational research a certain complex pattern of relationships between hypnosis and various types of conformity behavior. Then he and a research assistant set out independently to replicate these findings. Before beginning the replication, Rosenhan showed the assistant the pattern of correlations he had originally obtained; only he reversed the sign of every correlation coefficient. Thus the larger positive correlations became the larger negative correlations, the negatives became the positives. The data the assistant subsequently obtained from her subjects were significantly different from those obtained by Rosenhan in his own replication. In most cases, he reports, the correlations obtained by the assistant were opposite in sign to those obtained by him, but were, of course, in line with the correlations she had been led to expect. In spite of identically programmed procedures, two ‘‘real’’ experimenters obtained significantly opposite data; to each came what was expected. Rosenhan points out that the two experimenters differed in more ways than just in the nature of the expectancy held by each. There were differences in sex, age, status, and experience, and any or all of these could have contributed to the obtained reversals. Rosenhan’s conclusion, however, was that compared to the possible effects of these correlated variables, ‘‘It seems far more likely that the differences obtained in the hypnosis-conformity study were a function of the different expectations and hypotheses held by the experimenters’’ (p. 27).5 The basic experiments designed to test the hypothesis of the effect of the experimenter’s expectancy require one additional comment. That has to do with the fact that in every case deception was involved. There was deception of the subjects in 5
Rosenhan (1964) also describes and analyzes another case which could be interpreted as a case of the experimenter’s expectancy determining his behavior toward the subject in such a way as to fulfill his experimental prophecy. These data, which will not be described here, are especially interesting in that they involve a report by a co-author of a technical paper of an experimenter’s behavior toward that co-author at a time when he was a bona fide subject. It represents, therefore, a sophisticated subject’s eye-view of unintended experimenter behavior.
422
Book Two – Experimenter Effects in Behavioral Research
their being told that their task was a test of empathy. There was deception of the experimenters in their not being told that it was their behavior which was of the greatest interest, and in their being given false information about the subjects in order that expectancies could be induced. Deception is a necessary commonplace in psychological research. One does not give subjects the California F Scale and ask them to ‘‘fill out this test which tells how authoritarian you are.’’ Though that might make an interesting experiment, it is just not the way the instrument can be employed. If it and many psychological techniques are to be used at all, the purpose must be disguised, and that, of course, is deception. The problem will be discussed again in Part III. For now the fact of deception must be accepted, and the hope must be that the knowledge acquired through this necessary deception is worth the price of having deceived.
10 Animal Subjects
In the last chapter a question was raised as to the generality of the effects of the experimenter’s expectancy. The experiments to be described in this chapter were designed to extend the generality of these effects. It was felt that a major gain in the generality of the phenomenon depended on the demonstration that expectancy effects might operate with different species of subjects. Accordingly, the subjects of these experiments were rats rather than humans. There were differences other than the change in subjects’ species between these experiments and the original ones employing human subjects. In the animal studies, as in some of the later human studies, the experimenters were offered no special incentives for obtaining data consistent with the experimental hypothesis. In addition, in both the studies to be described now, closer supervision was possible of the experimenter’s conduct of his experiment. There was, therefore, greater opportunity to note instances of error of observation, recording, and response interpretation as well as any intentional errors.
Maze Learning In the first experiment employing animal subjects, the experimenters were 12 of the 13 students enrolled in a laboratory course in experimental psychology. All the experimenters had been performing laboratory experiments with human subjects during the entire semester. The present study was arranged as their last experiment of the term, and the first to employ animal subjects. The following written instructions were given to each experimenter (Rosenthal & Fode, 1963a): Instructions to Experimenters. The reason for running this experiment is to give you further experience in duplicating experimental findings and, in addition, to introduce you to the field of animal research and overcome any fears that you may have with regard to working with rats. This experiment is a repetition of work done on Maze-Bright and Maze-Dull rats. Many studies have shown that continuous inbreeding of rats that do well on a maze leads to successive generations of rats that do considerably better than ‘‘normal’’ rats. Furthermore, these studies have shown that continuous inbreeding of rats that do badly on a maze leads to successive generations of rats that do considerably worse than ‘‘normal’’ rats. 423
424
Book Two – Experimenter Effects in Behavioral Research Thus, generations of Maze-Bright rats do much better than generations of Maze-Dull rats. Each of you will be assigned a group of five rats to work with. Some of you will be working with Maze-Bright rats, others will be working with Maze-Dull rats. Those of you who are assigned the Maze-Bright rats should find your animals on the average showing some evidence of learning during the first day of running. Thereafter performance should rapidly increase. Those of you who are assigned the Maze-Dull rats should find on the average very little evidence of learning in your rats. The experiment itself will involve a discrimination learning problem. The animals will be rewarded only if they go to the darker of two platforms. In order that the animals do not simply learn a position response, the position of the darker platform will be varied throughout each day’s running.
The apparatus employed by the experimenters was a simple elevated T-maze described by Ehrenfreund (1952) and built to his specifications. The two arms of the maze were interchangeable; one was painted white, the other a dark gray. For the experimenters’ use at the conclusion of the experiment, a questionnaire was constructed on which could be rated their satisfaction with the experiment, their feelings about the subjects, and a description of their own behavior during the experiment. Each scale ran from 10 (e.g., extremely dissatisfied) to þ10 (e.g., extremely satisfied) with intermediate labeled points. On this questionnaire form, space was also provided for each experimenter to describe how he felt before, during, and after the experiment. The subjects of this experiment were 65 naı¨ve, Sprague-Dawley albino rats which ranged in age from 9 to 15 weeks. Thirteen groups of five each were formed in such a way as to make differences in mean age per group a minimum. Each group was composed of two male and three female animals and ranged in mean age from 12 to 13 weeks. Each group was housed in two cages, segregated by sex, and, several days before the beginning of the experiment, placed on 23-hour food deprivation. The experimental procedure was described briefly in the instructions to experimenters. On the day the course instructor announced the details of the final experiment of the semester, the laboratory assistant entered the classroom announcing that the ‘‘Berkeley Rats’’ had arrived. Instructions were read to the experimenters and explained further where necessary. Each experimenter was then asked to rate on a 20-point scale how much he or she thought they would like working with the rats. None had any prior experience with animal subjects. On the basis of these ratings, six pairs of experimenters were formed, matched on their estimated liking of the rats. For each pair, one member was randomly assigned a group of subjects that had been labeled ‘‘Maze-Bright,’’ while the other member of the pair was assigned a group labeled ‘‘Maze-Dull.’’ Thus, the experimental treatment was the information that an experimenter’s rats were Maze-Bright or Maze-Dull. Actually, of course, the groups had been labeled bright or dull randomly but with the restriction that differences in mean age per group per matched pair be at a minimum. Before actually running any subjects, each experimenter was asked to rate on a 20-point rating scale (þ10, extremely well, to –10, extremely poorly) exactly how
Animal Subjects
425
well he thought his animals would perform. Each subject received one hour of handling and maze experience before being run in the maze. During the maze experience, subjects could obtain food from either arm of the T-maze. Each subject was run ten times a day for five days. For each trial the experimenter recorded whether it was correct or incorrect as well as the time required to complete the response. The darker arm of the maze was always reinforced and the white arm was never reinforced. The darker arm appeared equally often on the right and on the left, although the particular patterning of correct position was developed randomly for each day of the experiment and followed by all experimenters. It was mentioned that 12 of the 13 students in a particular course served as the experimenters. The thirteenth student was an undergraduate research assistant who had worked for almost a year on the program of research of which this experiment was a part. Although it seemed unlikely that any of the students in that class knew about the existence of this research project, and of the thirteenth student’s connection with it, steps were taken to minimize the likelihood that such a connection could be made. The undergraduate research assistant therefore participated in the experiment just as any other experimenter but with the fully conscious motivation to get as good performance from her animals as possible without violating the formally programmed procedures. An advantage of her being in this class was that since the course instructor rarely observed the actual conduct of the course experiments, she could serve as an observer of the experimental procedures actually employed without arousing the self-consciousness that might have been incurred had the course instructor observed the experimental procedures. After the end of the semester during which this experiment took place, one of the experimenters became associated with the research program. He was thus also able to give valuable information on actual procedures employed by the experimenters during the conduct of the experiment. All reports made by these assistants were held in confidence and at no time was the name of a specific experimenter mentioned. Table 10-1 shows the mean number of correct responses per subject for those six experimenters who believed they were running Maze-Bright rats, for the six who believed they were running Maze-Dull rats, and for the research assistant who was aware that the rats were neither bright nor dull but who was trying to obtain maximum performance from them. Performance of the animals run by experimenters believing them to be bright was significantly better on the first, fourth, and fifth days. In addition, when the data from all five days of the experiment were combined, t was again significant, this time with a one-tailed p of .01. Table 10–1 Mean Number of Correct Responses
Days Experimenter sample Research assistant ‘‘Maze-Bright’’ ‘‘Maze-Dull’’ ‘‘Bright’’ > ‘‘Dull’’ t p (one-tail)
1
2
3
4
5
Mean
1.20 1.33 0.73
3.00 1.60 1.10
3.80 2.60 2.23
3.40 2.83 1.83
3.60 3.26 1.83
3.00 2.32 1.54
þ0.60 2.54 .03
þ0.50 1.02 .18
þ0.37 0.29 .39
þ1.00 2.28 .04
þ1.43 2.37 .03
þ0.78 4.01 .01
426
Book Two – Experimenter Effects in Behavioral Research
Inspection of the day by day means for each group of experimenters shows that the ‘‘bright’’ animals’ performance increased monotonically as might be expected if learning were occurring. The obtained monotonic increase could be expected by chance only six times in a hundred. The ‘‘dull’’ animals’ performance, on the other hand, increased only to day three, dropping on the fourth day and not changing on the fifth. The differences in obtained functions as well as the differences between performance means suggest that learning was less likely among rats run by experimenters believing them to be dull. Table 10-1 also shows that, except for the first day of the experiment, the experimenter who was a research assistant and trying explicitly to obtain good performance from her rats did obtain better performance than did the experimenters believing their animals to be bright (p < .05, one-tail, t ¼ 2.38). While her obtained performance function was not a monotonically increasing one, interpretation of this seems restricted by the fact that she ran relatively few animals compared to the number in the two experimental groups. Interpretation of the obtained t suggests that an experimenter who is explicitly ‘‘biased’’ to obtain good performance from animal subjects obtains even better performance than do experimenters who are biased to expect good performance but not explicitly instructed to obtain it. Of the 300 occasions when subjects were run (60 subjects 5 days) there were 60 occasions when the animal made no response at all. On the average, then, one out of every five sessions the animals refused to make a choice. This relatively poor performance may have been due to the difficulty of the discrimination problem, the limiting of pretraining to one hour, or the inexperience of the experimenters in running animals. At any rate, these no-response occasions were not equally distributed between the experimental groups. There were 17 such occasions among the ‘‘bright’’ subjects and 43 among the ‘‘dull,’’ a division significant at the .001 level. Since the ‘‘dull’’ animals made fewer responses, it was possible that the results shown in Table 10-1 were confounded, as animals responding more are likely to respond correctly more often. In order to partial out the effects of greater nonresponding among the ‘‘dull’’ rats, the mean time in minutes required to make only correct responses was computed for each day separately for the two experimental groups. The obtained mean times are shown in Table 10-2. Although for any given day the running times did not differ significantly between the two treatment groups, the difference for the entire experiment was found to be significant (p < .02, one-tail, t ¼ 3.50). Thus, animals run by experimenters believing them to be bright made their correct choices more rapidly than did the rats run by experimenters believing their rats to be dull. Table 10–2 Time Required to Make Correct Responses
Days Experimenter sample
1
2
3
4
5
Research assistant ‘‘Maze-Bright’’ ‘‘Maze-Dull’’
5.45 3.13 3.99
1.63 2.75 4.76
2.04 2.05 3.20
0.74 2.09 2.18
0.68 1.75 3.20
2.11 2.35 3.47
þ0.86
þ2.01
þ1.15
þ0.09
þ1.45
þ1.12
‘‘Bright’’ < ‘‘Dull’’
Mean
Animal Subjects
427
Inspection of the day-by-day means for the two treatment groups shows that the ‘‘bright’’ animals tended to improve more steadily than did the ‘‘dull’’ animals. The related question may be raised of whether ‘‘bright’’ rats simply ran faster or whether they actually improved their performance compared to the ‘‘dull’’ rats. Comparing the running time of the ‘‘dull’’ animals on their first and fifth days yielded a t of less than one, suggesting that this group did not improve their performance significantly. The comparable t for the ‘‘bright’’ animals was 1.77, which has a p of .06 (one-tailed test), suggesting that this group probably did improve their performance during the course of the experiment. Table 10-2 also shows that the experimenter who was actually a research assistant obtained the shortest mean running time per correct response. Except for day one, on which her animals ran slowest of any group, her rats performed better than those run by experimenters believing their rats to be bright. This trend serves to support the earlier interpretation that an experimenter who is explicitly ‘‘biased’’ to obtain good performance from animal subjects obtains better performance than do less explicitly ‘‘biased’’ experimenters. (It is also possible, of course, that by chance this particular experimenter was the most competent.) To what extent could the obtained results have been due to intentional or other errors on the part of the experimenters? The two experimenters who subsequently worked with the research program on experimenter expectancy had been in a position to observe most, but not all, of the actual experimental procedures. There were no observed instances of rats not actually being run or of the making of incorrect entries on the data sheets. There was, however, a total of five observed instances of deviation from programmed procedure when experimenters prodded subjects to run the maze. Two of these instances occurred among experimenters running ‘‘bright’’ rats while three occurred among experimenters running ‘‘dull’’ rats. It appears unlikely from this distribution of instances of procedural deviation that the differences obtained between the treatment groups could be ascribed to gross procedural or intentional errors on the part of the experimenters. In addition, the superior performance of the animals run by the research assistant shows that intentional errors are not needed to explain the good performance of animals run by experimenters believing them to be bright. A question of some interest deals with whether both groups of experimenters were biased by their expectation or whether only one of the groups was actually biased, with the other group obtaining data no different from what they might have obtained had they been given no expectation. Prior to running their animals, all experimenters had been asked to predict the performance they would obtain from their animals. It was possible, therefore, to compute a correlation between the data the experimenter expected to obtain and the data he subsequently obtained. Such a correlation can serve as an index of the degree of expectancy effect. In this experiment there was a shrinkage of the correlations due to the experimenters within each condition having predicted what they were led to expect. This decreased the variation of predictions and, therefore, the correlations obtained. The rank correlation between expected and obtained performance was þ.43 for the experimenters running ‘‘bright’’ rats and þ.41 for those running ‘‘dull’’ rats. Since there were only six experimenters in each group, these correlations did not reach statistical significance (although when the correlations were combined, the one-tailed p ranged between .12 and .007 depending on the method of combination). These findings suggest that the two groups of
428
Book Two – Experimenter Effects in Behavioral Research Table 10–3 Descriptions of Subjects’ and of Experimenters’ Behavior
Belief about subject ‘‘Bright’’
‘‘Dull’’
t
Two-tail p
Subjects’ Behavior 1. Bright 2. Pleasant 3. Likable
4.2 4.8 4.8
3.0 0.0 2.2
2.94 1.77 0.92
.04 .15 .40
Experimenters’ Behavior 1. Satisfied 2. Relaxed 3. Pleasant 4. Friendly 5. Enthusiastic 6. Nontalkative 7. Gentle handling 8. Much handling
3.0 8.7 6.7 5.3 5.5 6.2 6.5 5.2
2.5 4.8 2.8 1.3 0.2 3.2 2.7 0.3
2.10 5.11 2.56 2.61 1.51 1.19 1.95 1.17
.10 .005 .05 .05 .19 .29 .11 .30
experimenters were probably biased by their expectations to about the same degree, although of course in opposite directions. At the conclusion of the experiment, each experimenter made ratings of his subjects, of his satisfaction with the experiment, and of his behavior during the experiment. These ratings were designed to suggest the mechanisms whereby the experimenters unintentionally influenced their animals to perform as the experimenter expected. The mean ratings on these scales are shown separately for each group of experimenters in Table 10-3. Only those scales have been listed which differentiated experimenters with different expectancies in both the experiments described in this chapter. Experimenters who expected good performance from their animals saw them as brighter, somewhat more pleasant, and more likable. These experimenters were more satisfied with their participation in the experiment and felt more relaxed in their contacts with the rats. They described their behavior toward their animals as more pleasant, more friendly, somewhat more enthusiastic, and less talkative. Of course, we cannot be sure of the sense modality by which the experimenter’s expectancy is communicated to the subject. Rats are sensitive to visual, auditory, olfactory, and tactual cues (Munn, 1950). These last, the tactual, were perhaps the major cues mediating the experimenter’s expectancy to the animal. The attitudinal ratings described, which differentiated the two groups of experimenters, may well have been translated into the quantity and quality of their handling of the animals. Table 10-3 suggests that experimenters expecting and obtaining better performance handled their rats more and also more gently than did the experimenters expecting and obtaining poorer performance. After a description of the second experiment employing rat subjects, further evidence will be presented that increased handling can improve performance (e.g., Bernstein, 1957). At the end of each experimenter’s questionnaire, space was provided for any comments he might wish to make. These comments suggested that the experimenters were unaware of their differential handling of the animals as a function of their expectancy. In addition, these comments made it appear still more unlikely that there were intentional errors being committed. Nine of the 12 experimenters
Animal Subjects
429
spontaneously reported feeling good when the animals performed well, and feeling badly when they performed poorly. These comments were equally distributed between the experimental groups, four in the ‘‘dull,’’ five in the ‘‘bright.’’ Since even the experimenters expecting poor performance stated that they felt better if subjects performed better, it seems unlikely that they would have done anything to worsen their subjects’ performance, at least intentionally. In fact, because it was academically important to the experimenter that he demonstrate the ‘‘laws of learning’’ in his experiment, the pressures on the experimenters expecting poor performance were to get good performance, and good learning. Any intentional errors, if they did occur, should therefore have operated to reduce those effects of the experimenters’ expectancy that were demonstrated in this study.
Operant Learning The second experiment designed to demonstrate the effects of the experimenter’s expectancy on his animal subjects was conducted at the Ohio State University in the laboratory of Professor Reed Lawson, at his invitation (and has been reported earlier, Rosenthal & Lawson, 1964). A considerable gain in generality accrues from this fact. It was hinted at earlier and will be discussed in detail in Part III, but for now it is enough to make the point that replications conducted in a different laboratory are worth more than those conducted in one’s own. Other gains in generality deriving from this second experiment with animal subjects will be mentioned shortly when this experiment is compared more systematically to the one already described. In this experiment there were 30 male and 9 female students enrolled in a course in experimental psychology to serve as the experimenters. At the very beginning of the course all experimenters were given the following written instructions: Instructions to Experimenters. The reason for running these experiments is to give you experience in duplicating experimental findings and, in addition, to introduce you to the field of animal research and overcome any fears you might have with regard to working with rats. The experiments are all repetitions of work done recently on Skinner Box–Bright and Skinner Box–Dull rats. Many studies have shown that continuous inbreeding of rats that do well on Skinner box problems, such as those you will be running, leads to successive generations of rats that do considerably better than ‘‘normal’’ rats. Furthermore, these studies have shown that continuous inbreeding of rats that do badly on Skinner box problems, such as those you will be running, leads to successive generations of rats that do considerably worse than ‘‘normal’’ rats. Thus generations of Skinner Box–Bright rats do much better than generations of Skinner Box–Dull rats. Each of you will be assigned to a group to work with. Some groups will be working with Skinner Box–Bright rats, others will be working with Skinner Box–Dull rats. Those of you who are assigned the Skinner Box–Bright rats should find your animals on the average showing some evidence of learning during even the early stages of each of your experiments. Thereafter, performance on each of your experiments should rapidly increase. Those of you who are assigned the Skinner Box–Dull rats should find on the average very little evidence of learning in your rats. You should, however, not
430
Book Two – Experimenter Effects in Behavioral Research
become discouraged, since it has been found that even the dullest rats can, in time, learn the required responses. If you are interested in learning more about the details of the experiments on breeding rats for brightness and dullness, your lab instructors can give you references to the work done by Tryon and others at the University of California at Berkeley and elsewhere. The animals employed in this experiment were 16 female laboratory rats (all 80 days old) drawn from the animal colony maintained by The Ohio State University Department of Psychology. They were randomly assigned to one of two groups. One group of eight rats was assigned to home cages which had been labeled ‘‘Skinner Box–Bright,’’ while the other group was assigned to home cages which bore the labels ‘‘Skinner Box–Dull.’’ Early in the course, two of the animals labeled ‘‘Dull’’ died, so that the maximum number available for the subsequent experiments was eight rats labeled ‘‘Bright’’ and six labeled ‘‘Dull.’’ All were on a feeding regimen of one-half hour ad lib access to food daily throughout the eight weeks of the study. The basic equipment employed in the studies was commercially made (Scientific Prototype Co.) demonstration Skinner boxes with feeders that dispensed 45 mg P. J. Noyes pellets. Experimenters followed the laboratory manual of Homme and Klaus (1957), except that food pellets were used instead of water as reinforcement. The questionnaire employed in the last study, in which experimenters could rate their satisfaction with their participation in the experiments, their feelings about their animals, and their description of their own behavior during the conduct of the experiments, was again administered at the conclusion of the study. A few new scales were added to this questionnaire. At the beginning of the study each experimenter was assigned to one of five laboratory periods, to each of which had been assigned one or two ‘‘bright’’ and one or two ‘‘dull’’ rats. Assignment to laboratory sections could not be random, since there were only certain times that certain experimenters were able to schedule their laboratory section. Within each laboratory section, however, experimenters were randomly assigned to the animals to be run during that section. At least two experimenters were assigned to each subject, and the mean number of experimenters per subject was 2.7. Each laboratory team performed three different functions during each of the experiments—that of experimenter, timer, and recorder. These functions were rotated among the members of each laboratory team. For those teams consisting of only two members, the functions of timer and recorder were usually performed by the same person. A total of seven experiments was performed, each of which is described in detail in the manual mentioned earlier (Homme & Klaus, 1957). A brief description of each follows: 1. Magazine Training. Training the rat to run to the magazine and eat whenever the feeder was clicked. Latencies were recorded for each click, and the dependent variable was defined as the mean latency on the first and last ten clicks of the session. 2. Operant Acquisition. Training the rat to bar-press. Number of bar-pressing responses per minute was recorded, and the dependent variable was defined as the mean number of responses during the first and last ten minutes of the session. 3. Extinction and Spontaneous Recovery. Number of responses per minute was again recorded, and the dependent variable was defined as the number of minutes elapsed until
Animal Subjects
431
the animal showed two response-free minutes. Data were analyzed separately for the two parts of this experiment. 4. Secondary Reinforcement. In this experiment the animals’ responses were reconditioned and partially reextinguished. Subsequent responses were reinforced by the clicking sound without presentation of food. The dependent variable was again defined as the number of minutes elapsed until the animal showed two response-free minutes, while getting click reinforcements. 5. Stimulus Discrimination. Training the rat to bar-press only in the presence of a light and not in the absence of the light. For each trial of this experiment the experimenters recorded the latency for the reinforced response and the number of responses occurring under the nonreinforced condition until a criterion of 30 seconds of no responding had been reached. The dependent variable was defined as the ranks of the mean latencies of the first and last ten trials added to the ranks of the mean number of non-reinforced responses during these same trials. 6. Stimulus Generalization. Demonstrating that animals trained to respond only in the presence of a 110-volt light would show a decrease in response rate as the voltage was decreased to 70 v, to 35 v, and finally to 0 v. For each of the four test periods, the number of responses was recorded, and the dependent variable was defined as the probability for each subject that his response decrements as a function of stimulus decrements could have occurred by chance. The ranks of these probabilities would, of course, be identical, or nearly so, with the ranks of any other index of monotonic decrease. 7. Chaining of Responses. Conditioning a loop-pulling response which was followed by the light which signaled the animal that a bar-press would produce a food pellet. The number of complete chains per minute was recorded and the dependent variable was defined as the mean number of completed chains during the first and last ten minutes.
The students were expected to complete each of these studies in one 2-hour period each week, excepting the stimulus discrimination study, which was allotted two periods to complete. If a team did not complete a study within the scheduled time, they had to return to the laboratory in their free time and continue working until their subject was ready to go on to the next scheduled experiment. Even more than in the last experiment, then, the experimenters were all well motivated to have their animals learn as well as possible. Comparison with the maze learning experiment. There were several differences between this and the first study. The studies were done at different universities using different learning tasks and apparatus. In this study there were fewer subjects, 14 compared to 60, but more experimenters, 39 instead of 12. In addition, this was a longitudinal study lasting about 8 weeks and a minimum of 14 hours spent with each animal, while the earlier study lasted 1 week with only 5 hours spent with each group of animals. In the present study, in spite of rotating their team functions, all experimenters spent a minimum average of four hours working with their rat, whereas in the earlier study no experimenter spent more than one hour with any one of his five animals. In the earlier study experimenters worked alone and were much of the time unobserved by the laboratory supervisor. Whereas those instances of procedural deviation that came to light were found to be randomly distributed over the two treatment conditions, the present study provided better control over this possibility, since a laboratory instructor was present during each of the laboratory periods.
432
Book Two – Experimenter Effects in Behavioral Research
Perhaps more important than the control of procedural deviation was the control of gross cues to the animals. Thus, if an experimenter, because of his belief that a rat was dull, handled the animal roughly, the laboratory instructor was there to point out to the experimenter that his rat would never learn unless he were better treated. In the present study, too, the motivations of the experimenters were quite different. In the earlier study it was found that experimenters felt better when their rats learned well but there was no external sanction for their learning well. In the present situation the rat in effect had to learn in order that the experimenter could write a report, get a grade, and go on to the next study. An additional motivational difference was possibly associated with the differing roles of the laboratory instructors in the earlier and the present study. In the earlier study, the lone laboratory instructor reinforced the experimenters’ beliefs that poor performance was accounted for by the rats’ ‘‘dullness.’’ As it happened, and not by design, in the present study, only one of the three laboratory instructors did so. Another instructor, as it happened, evaded any reference to the rats’ brightness or dullness, while the third instructor told his students that there was no such thing in the final analysis as a dull rat, only a dull experimenter! Quite accidentally, then, a small sample of ‘‘climates’’ was acquired apparently more or less favorable to the occurrence of experimenter expectancy effects. This variation in ‘‘climates’’ would serve to increase somewhat the generality of any obtained findings. Any one team of experimenters performed their experiments in only one of these three climates. Preliminary inspection of the results revealed that for several of the experiments there were such extremely deviant scores that the use of interval scale statistics seemed inappropriate. Therefore, in each experiment, the obtained scores were converted to ranks and the treatment effect evaluated by means of the Mann–Whitney U test. Since on the average each experiment was not conducted by some team of experimenters from each treatment condition, the mean raw rank for each treatment group was not comparable from experiment to experiment. In order to achieve comparability of mean ranks across experiments, and to legitimize their addition, all ranks were converted to Guilford’s (1954) C-scale scores. Table 10-4 shows these normalized mean ranks for each treatment group as well as the rank correlation of the performances in that experiment with the performances of the preceding experiment. A lower mean rank is assigned to a superior Table 10–4 Mean Ranks of Operant Learning for Seven Experiments
Belief about subject Experiment I II IIIA B IV V VI VII
Magazine training Operant acquisition Extinction Spontaneous recovery Secondary reinforcement Stimulus discrimination Stimulus generalization Response chaining
Means (Total)
‘‘Bright’’
Correlation with Preceding Experiment
‘‘Dull’’
One-tail p
4.4 4.3 4.2 4.6 4.7 4.0 4.3 5.8
5.8 6.2 5.8 5.0 5.5 6.3 5.8 3.8
.13 .09 .12 .48 .17 .008 .02 .92
– .25 .08 .25 .37 .38 .59 (P < .05) .45
4.5
5.5
.015
.35
Animal Subjects
433
performance. The overall probability that the superior learning shown by the rats labeled as ‘‘Bright’’ could have occurred by chance was .015 (one-tail). Inspection of the mean ranks for all eight comparisons shows that in every case but one, performance was superior when the experimenters expected a superior performance. It appears likely then that experimenters’ belief or expectation about the performances of their animals was responsible in part for the performances obtained. Inspection of the p levels for the eight comparisons suggests no trend for subsequent treatment effects to become either more or less significant. The combined p level for the first four comparisons was .035, and for the last four comparisons it was .025. The question of correlated performances should be raised. That is, did the differences between the treatment groups arise during the first experiment and then simply maintain themselves over subsequent experiments? The answer to this question will tell us in part whether the seven experiments were nothing more than a single experiment replicated seven times. In Table 10-4 the last column shows that in most cases less than 15 percent of the variance of performances in any experiment could be accounted for by the performances in the preceding experiment. Only the correlation between performances in the experiments on stimulus discrimination and generalization was significant at the .05 level, one-tail test. A good illustration that the degree of interexperimental correlation was not sufficient to regard the seven experiments as only a single experiment is provided by examination of the results of the response-chaining experiment in Table 10-4. In that experiment subjects’ performances correlated .45 with their performances in the preceding experiment, this correlation accounting for about 20 percent of the variance. Yet in spite of this, the obtained mean differences in performance differed significantly from each other and were in the opposite directions. Although the animals’ performances from experiment to experiment were not accounting for much of the variance of subsequent experiments, there was a tendency for later performances to be better predictors of subsequent performances. This increase over time of these correlations was significant at the .01 level (rho ¼ þ.90). Such an increase suggests that, over time, the animals may have been more and more ‘‘permanently’’ affected by their experimenters’ differential treatment. The original assignment of subjects to treatment conditions had been random, but the question may fairly be asked whether by chance animals labeled ‘‘Bright’’ might not in fact have been brighter, especially in view of the small sample size.1 This question cannot be answered directly, but the likelihood of this factor accounting for the obtained results can be evaluated. If the obtained results had been due to preexperimental differences among the animals rather than to the labeling treatment, we would have expected correlations differing significantly from zero between subjects’ performance in an experiment and their performance in the subsequent experiment. As an additional check on this question, the following comparison was made. Those four ‘‘dull’’ rats who participated in both experiments I and II were matched with those four ‘‘bright’’ rats who also performed in both experiments and whose performances in experiment I were most similar to those of the ‘‘dull’’ rats. The mean normalized rank of performance in experiment I for the four ‘‘dull’’ rats was 5.5, while for the four ‘‘bright’’ rats the mean was 5.8. The mean normalized rank of performance in 1
Max Bershad and Leon Pritzker pointed out this problem and clarified some of the issues involved.
434
Book Two – Experimenter Effects in Behavioral Research
experiment II was 6.0 for these ‘‘dull’’ rats and 3.8 for these ‘‘bright’’ rats, the difference being significant at the .10 level, one-tail U test. Thus, when considering experiment I as the basis for ‘‘preexperimental’’ matching, the subsequent experiment showed no change in the direction of difference, or the degree of its significance, from what was obtained without preexperimental matching. It seems reasonable, then, to assume that if preexperimental differences in ability favored the ‘‘bright’’ animals, this could at most have affected only the results of experiment I. It was mentioned earlier that all animals had been assigned to one of five laboratory periods, one or more of which were supervised by one of three laboratory instructors. It was also mentioned that each of these instructors appeared to provide a somewhat different ‘‘climate,’’ which might be interpreted as more or less favorable to the occurrence of experimenters’ expectancy effects. Table 10-5 shows the normalized mean ranks for all the experiments combined for each treatment group listed by laboratories. In all five laboratories the treatment effects were in the predicted direction, with p levels ranging from .07 to .25 and the combined p < .005. The differences in obtained p levels for such a small sample of laboratories do not seem to warrant elaborate interpretation, although it seems safe to say that ‘‘climates’’ or lab periods did not seem to make much difference. At the conclusion of the experiments, each experimenter filled out the questionnaire described earlier. Table 10-6 shows the mean rating for each treatment condition on each of the scales for which the results were given for the maze learning experiment. The last two scales were new for this experiment and the ‘‘amount of handling’’ ratings were made separately for handling before each experiment and after each experiment. In this study, too, experimenters believing their rats to be bright were significantly more satisfied with their participation in the experiments than were those believing their animals to be dull. However, even these latter experimenters were remarkably satisfied (6.6) compared to either the experimenters running ‘‘bright’’ rats (3.0) or those running ‘‘dull’’ rats (2.5) in the earlier study. This much greater satisfaction of the experimenters in the later study may have been due in part to the nature of the experiments performed. These experiments were an integral part of the course content, and the principles they were designed to demonstrate were covered in lectures by the course instructor. In the earlier study there was relatively much less relationship of the experiment to the content of the course. Furthermore, in the earlier study, although the ‘‘bright’’ rats learned faster, none really learned well. In the later study, although ‘‘bright’’ rats again learned faster, almost all animals did learn eventually.
Table 10–5 Mean Ranks of Operant Learning for Five Laboratories
Belief about subject Laboratory
‘‘Bright’’
‘‘Dull’’
One-tail p
A B C D E
4.3 4.9 5.1 3.7 4.1
5.3 6.5 5.8 4.6 6.0
.08 .07 .25 .21 .07
Means
4.4
5.6
.02
Animal Subjects
435
Table 10–6 Descriptions of Subjects’ and of Experimenters’ Behavior
Belief about subject ‘‘Bright’’
‘‘Dull’’
t
One-tail p
Both studies combined p
Subjects’ behavior 1. Bright 2. Pleasant 3. Likable Experimenters’ Behavior 1. Satisfied 2. Relaxed 3. Pleasant 4. Friendly 5. Enthusiastic 6. Nontalkative 7. Gentle handling 8. Much handling a. before experiments b. after experiments 9. Watching subjects 10. Talking to subjects
5.0 5.2 4.1
–2.6 3.7 2.2
2.64 1.07 1.28
.02 .16 .12
.002 .05 .08
9.1 6.3 6.4 5.8 4.2 –0.2 5.8 (–1.0) –1.2 –0.8 9.3 –3.1
6.6 6.1 5.3 3.9 1.9 –2.7 5.7 (–2.8) –2.3 –3.3 8.3 0.7
4.40 — 2.04 1.31 1.31 1.11 — — — 1.17 2.16 2.41
.0005 — .04 .11 .11 .15 — — — .14 .03 .02
.0003 .03 .005 .02 .04 .07 .13 (.10) .14 .07 — —
The overall descriptions which differentiated the two groups of experimenters were much the same as in the first animal study. The last column of Table 10-6 shows the combined probabilities derived from both experiments. For both experiments, experimenters believing their subjects to have been bred for good learning judged their rats’ behavior to be brighter, more pleasant, and more likable. These experimenters were also happier about their conduct of the experiment, and more so in this than in the preceding study. That makes sense, because in the present study, experimenters had much to be dissatisfied about if their animal learned poorly. They had to come in after hours and get their animal ‘‘caught up.’’ In both studies, experimenters assigned ‘‘bright’’ rats felt more relaxed in their contacts with their animals and felt their own behavior showed a more pleasant, friendly, and enthusiastic approach. These more global attitudes may have been translated more specifically into a less talkative ‘‘interaction’’ with the animal (one wonders what experimenters might have been saying to their ‘‘dull’’ rats), more handling, and more gentle handling. As we would expect, the absolute amount of handling was considerably less in the operant learning experiments than in the maze learning study, regardless of the experimenters’ belief about their animals’ ability. What handling took place in the operant learning studies was confined to conveying animals from home cage to Skinner box and back again. In the maze learning study the animals were similarly handled in transport but also were handled after each trial when they were returned to the starting box of the T-maze. In the operant learning study separate descriptions were given of handling of animals before and after the experiments. Table 10-6 shows that experimenters expecting better performance handled their animals about 33 percent more after each experiment, whereas experimenters expecting poorer performance handled their animals about 44 percent less after each experiment. Handling in this experiment was generally quite gentle, even among experimenters
436
Book Two – Experimenter Effects in Behavioral Research
expecting poor performance. That was not true for the first experiment and may have been due to the presence of the laboratory instructor, who could note and call attention to any rough handling. Therefore, most of the handling of this second experiment may have been positively reinforcing to the animals. Experimenters expecting good performance may have rewarded their rats tactually after a good performance, and such reward may have improved the animals’ performance in the subsequent experiment. Experimenters expecting poor performance withheld their positive tactual reinforcement more after each experiment, perhaps because there was little to be even unintentionally reinforcing about. We cannot decide easily whether this differential postexperimental reinforcement was only a consequence of the subjects’ performance or whether it was a partial determinant of that performance. It might well have been both. Also in the operant learning study, experimenters kept a closer watch on their rats’ behavior if they expected better performance, although all experimenters watched their animals very closely. In operant learning studies, closer observation of the subject may lead to more appropriate and more rapid reinforcement of the response desired. So the closer watching, perhaps due to the expectation that there would be more promising responses to be seen, may have made better ‘‘teachers’’ of the experimenters expecting good performance. The last item on each questionnaire was an open-ended one asking each experimenter to say in his own words how he felt about the experiments. Nineteen completed questionnaires were obtained from those who had worked with ‘‘bright’’ rats, and 17 were received from those who had worked with ‘‘dull’’ rats. Table 10-7 shows the percentage of each group of experimenters who spontaneously mentioned (1) the benefit they derived from the experiments, (2) how interesting the experiments were, (3) the difficulties of getting their animals to learn anything. The pattern of spontaneous responses follows closely the pattern we expect from the analysis of the more formally coded responses. One of the open-ended comments was especially interesting: ‘‘Our rat, number X, was in my opinion, extremely dull. This was especially evident during training for discrimination. Perhaps this might have been discouraging but it was not. In fact, our rat had the ‘honor’ of being the dullest in all the sections. I think that this may have kept our spirits up because of the interest . . . in [our] rat.’’ As a matter of fact, the animal in question was one of the two animals performing at the median level on the discrimination problem as well as for all the experiments taken as a whole. The cited comment serves to point out anecdotally the importance to the experimenters of the type of rat they were running. None of the 34 written comments even remotely suggested that any experimenter was aware that the subjects had not been Table 10–7 Spontaneous Comments by Experimenters
Belief about subject Comments 1. 2. 3. 4.
Beneficial experience Interesting experience Uneducable subject No comments made
‘‘Bright’’
‘‘Dull’’
63% 53% 5% 0%
41% 18% 47% 12%
One-tail p .16 .04 .007 .22
Animal Subjects
437
specially bred. The impression of the three laboratory instructors confirmed this lack of suspicion on the part of the experimenters. On the last day of the course, after the experiments and questionnaires had been completed, the entire study was explained to all the experimenters. There appeared to be great interest and animation on their part. One reaction, though, was surprising, and that was the sudden increase in sophistication about sampling theory in the experimenters who had been assigned ‘‘dull’’ rats. Many of these experimenters pointed out that, of course, by random sampling, the two groups of rats would not differ on the average. However, they continued, under random sampling, some of the ‘‘dull’’ rats would really be dull by chance and that their animal was a perfect example of such a phenomenon.
Some Discussion The results of the experiments reported suggest that experimenters’ expectancies may be significant determinants of the results of their research employing animal subjects. The overall combined probability that the results of the two experiments could have arisen by chance was .0007. The conditions of the second experiment, particularly, suggest that the mediation of this expectancy biasing phenomenon may be extremely subtle. It appears unlikely that nonsubtle differences in the treatment and handling of the animals would have gone unnoticed and uncorrected by the various laboratory instructors whose task it was to supervise the learning of the experimenters via the learning of the subjects. The question occurs, however, whether the laboratory instructors might not have been biased observers. That is a possibility, but it will be recalled that the three instructors seemed to have different biases or orientations toward the experiment; yet in each one’s laboratory, the results were quite comparable. In addition, the teaching function of the laboratory instructors was such as to diminish the effects of their students’ expectancies. That was because they tended to give more help and advice to the experimenters whose animals were performing more poorly, a fact that would tend, of course, to offset the treatment effects. What can be said specifically about the several operant learning experiments showing greater or lesser expectancy bias? Are certain types of tasks that rats may be called upon to perform more susceptible to the biasing effects of the experimenter’s hypothesis? It seems doubtful that the data can answer this question. We may feel most confident in the experimenters’ tendency to obtain biased data on stimulus discrimination and generalization type experiments. However, it might prove most useful, for the present at least, to regard the median obtained p level of .13 as our best estimate of the median p level to be obtained, with similar sample sizes, if we were to continue sampling the population of operant learning experiments. Taking this view, our more extreme p levels, those closer to zero and those closer to one, would be regarded as sampling fluctuations. Later in Part II some evidence will be presented that suggests that experimenters’ descriptions of their own behavior during the experiment are borne out rather well by their subjects’ descriptions of that behavior. In these studies employing animal subjects there was no independent check from the subjects as to how they were treated by their experimenters. But as a source of possible interpretations of the
438
Book Two – Experimenter Effects in Behavioral Research
results obtained, it can do no harm to assume the veridicality of the experimenters’ self-descriptions. Shall we, then, regard the experimenters’ behavior toward their subjects as antecedents or as consequents of the subjects’ performance? Perhaps it makes most sense to regard experimenters’ behavior as both. Thus, initially, those experimenters expecting their animals to perform poorly treated them in some subtle fashion such as to produce dull behavior, whereas those experimenters expecting bright performance treated their rats accordingly. Those initial differences in the treatments accorded the animals might have led to different performances by subjects which could, in turn, reinforce experimenters’ expectations about their animals and maintain the subtle differences in the treatment of the ‘‘bright’’ and ‘‘dull’’ rats. The specific cues by which an experimenter communicates his expectancy to his animal subjects probably vary with the type of animal, the type of experiment, and perhaps even the type of experimenter. With Clever Hans as subject, the cues were primarily visual, but auditory cues were also helpful. That seemed also to be true when the subjects were dogs rather than one unusual horse. The experiments were carried out by H. M. Johnson (1913), who knew of Pfungst’s work with Clever Hans. Johnson believed that the alleged auditory discriminations shown by dogs were due to the experimenter’s unintentional communication to the animals of the expectancy that such discrimination was possible. Just as Hans’ questioners betrayed their expectancy of the horse’s ability to answer questions, so did experimenters betray to their canine subjects how they should respond to confirm the experimenter’s expectancy. The specific cues, Johnson felt, were the experimenter’s posture, respiration, and the pattern of strain and relaxation of the muscles of the head and body. Just as in the case of Hans’ questioners, such cues were obviously of an unintended, involuntary nature. As a control for the Clever Hans phenomenon, Johnson conducted the standard series of experiments on discrimination, but with the modification that the dogs could not see him at all. To control auditory cues at least partially, Johnson suggested that the experimenters not watch the dogs’ responses so that they could not respond differentially and involuntarily as a function of whether the dog’s response was the expected one. When all the appropriate controls had been employed, Johnson found that dogs could no more perform the discriminations with which they had been credited than Hans could solve problems of calculus. When the subjects are rats instead of horses or dogs, the unintended cues from the experimenter might also be visual or auditory, but they could also be olfactory for all the little that is known of the matter at the present time. The best hypothesis to account for the results of the two experiments described in this chapter is probably that the quality and quantity of handling communicated the experimenters’ hypotheses. In the study of operant learning, closer observation of the rats’ response could have led to more clever teaching of the animals believed to be brighter. But that explanation would not do for the maze learning experiment. Handling differences seem the best explanation for both experiments. Animals believed to be brighter and more pleasant may well be handled more ‘‘pleasantly,’’ and less fearfully, more gently and more often. Such handling could alter the animals’ behavior and lead to still greater changes in handling patterns. Christie (1951) has told that he and others have been able to postdict which experimenter had handled an animal by observing the rat’s behavior while in a maze or while being picked up. Support for these and
Animal Subjects
439
similar informal observations is available from more formally collected data which show that rats that are handled more learn better (Bernstein, 1952; Bernstein, 1957). From the experiments described in this chapter, we cannot be certain of the role of handling patterns as the mediators of the experimenters’ expectancies, nor of whether such other channels as the visual, olfactory, and auditory were involved. Experiments are needed, and could be performed, that would clear up the matter. Earlier when the discussion was of observer errors, an experiment by Cordaro and Ison (1963) was described. That experiment employed the same paradigm as that described in this chapter, but this time the subjects were flatworms (planaria). Experimenters obtained responses from their planaria which were dramatically in the direction of their expectations. The results, very reasonably, were interpreted as due to biased observations of the worms’ responses. It cannot be ruled out as an alternative interpretation, however, that the subjects’ responses might have been affected by the experimenters’ expectations. Visual, olfactory, auditory, and tactual cues do not seem likely candidates as the channels of unintended influence of an experimenter on his worm subjects. But perhaps changes in respiration of the experimenter affected the turbulence of the water medium in which the planaria swam and influenced them to respond differentially. That the experimenter’s respiration may be affected by his expectation was pointed out, of course, by Johnson, though for his dogs such changes meant visual cues, rather than mechanical stimulation. What seems to be needed in the area of research with planaria is an experiment suggested by Wernicke and described by Moll (1910) (for use with human subjects) in which a glass partition is placed between the experimenter and, in this case, his worm, to see whether this reduces the amount of ‘‘observer’’ error. If it does, it may well mean that the behavior of planaria can, like that of horses, dogs, and rats, be affected by the unintended communication of the experimenter’s expectancy. This chapter may be concluded by recalling a clinical and clever observation by Bertrand Russell, who, however, was referring more to the effects of the programmed experimental procedures than to the unprogrammed effects of the experimenter’s expectancy (1927, pp. 29–30). The manner in which animals learn has been much studied in recent years, with a great deal of patient observation and experiment. Certain results have been obtained as regards the kinds of problems that have been investigated, but on general principles there is still much controversy. One may say broadly that all the animals that have been carefully observed have behaved so as to confirm the philosophy in which the observer believed before his observations began. Nay, more, they have all displayed the national characteristics of the observer. Animals studied by Americans rush about frantically, with an incredible display of hustle and pep, and at last achieve the desired result by chance. Animals observed by Germans sit still and think, and at last evolve the solution out of their inner consciousness.
11 Subject Set
‘‘Did I do right?’’ That was the question in the mind of one of the subjects. She had rated the standard 10 photos and, along with other subjects, had been asked to tell her feelings about the experiment. She did not mean by this question that she worried whether she earned a good score on the ‘‘empathy test,’’ though that might have been part of it. She was more worried whether she had performed properly her role as ‘‘Subject of a Psychological Experiment.’’ Another subject verbalized it, ‘‘I was wondering if I was doing the experiment the way it should be done.’’ That subjects in psychological experiments think and worry about such matters has been pointed out increasingly in the last few years. ‘‘Part of the experimental task,’’ says Joan Criswell, ‘‘relates to performing adequately as an experimental subject’’ (1958, p. 104). In this chapter the discussion will be of various subject sets as they complicate the effects of the experimenter’s expectancy. Martin Orne (1962) has shown the lengths to which subjects will go to give adequate performances. For some years he has been trying to find experimental tasks so tedious, dull, or meaningless that experimental subjects would refuse to do them or would soon discontinue them. No such tasks have been found. Subjects want to be good subjects; they don’t want to waste their own time or the experimenter’s. For Orne, being a good subject means ultimately that the subject wants to validate the experimental hypothesis. Such a motive on the part of the subject would help answer a question that may have occurred to some readers of this book. Granted that experimenters can communicate their expectations to their subjects, why do subjects act so as to confirm these expectations? Orne’s answer seems to do partially for this question, but there are others. There seem to be too many subject pools where subjects seem too indifferent, or too disturbed, or too distracted, or too giggly for these scientific motivations to be always the primary ones. Orne lists motives for subjects’ participating in experiments other than the advancement of science. These include the fulfilling of course requirements, pay, or the hope of improving personal adjustment. Criswell (1958) lists these motives and adds curiosity about research, boredom, less pleasant alternative pursuits, and a need to ingratiate oneself with the experimenter. This last motive also helps explain why subjects who correctly ‘‘read’’ the experimenter’s unintentionally communicated expectancy generally 440
Subject Set
441
go along with it rather than choosing to disregard it or to defy it. (Jones [1965] has pointed out that successful ingratiation requires some subtlety rather than simple compliance. Simple compliance leads to a relative loss of esteem. In the experimenter-subject interaction, the subject’s going along can hardly be called simple compliance, since the ‘‘requests’’ for certain responses are unintended or covert.) Going along with the experimenter’s covert request satisfies the ingratiation motive subtly while at the same time satisfying the motive to make a useful contribution by confirming the experimenter’s hypothesis, as Orne suggested. Other workers have stressed the social nature of the psychological experiment (e.g., Bakan, 1953; Friedman, 1964; Mills, 1962; Tuddenham, 1960). One of the most important and one of the most systematic analyses of the social nature of the psychological experiment was that by Riecken (1962). After describing the features characteristic of experiments, Riecken notes three aims of the subject. The first of these is the attainment of those rewards he feels his due from having accepted the invitation to participate. These rewards may include course credit, money, and psychological insight. The second aim of the subject is to ‘‘penetrate the experimenter’s inscrutability and discover the rationale of the experiment.’’ The third aim of the subject, for which the second aim is instrumental, is to ‘‘represent himself in a favorable light’’ or ‘‘put his best foot forward.’’ This third aim of the experimental subject is also discussed in detail by Rosenberg (1965), who has shown the systematic effects that ‘‘evaluation apprehension’’ may have on the outcome of the experiment. The task the experimenter formally sets for the subject is only one problem the subject must solve. Riecken called attention also to the subject’s ‘‘deuteroproblem,’’ the problem of ‘‘doping out the experiment’’ so his performance can be an appropriate one, and one that will lead to favorable evaluation. The solution to the subject’s deutero-problem comes from his preconceptions of psychological research, from the formally programmed procedures of the experiment, from the physical scene, and from the un-programmed ‘‘procedures’’ such as the experimenter’s unintended communications. A good example of the possible effects of the physical scene on the subject’s solution of his deutero-problem comes from a comment by Veroff (1960). In reviewing a research program on affiliative behavior, he wondered whether the amount of such behavior might not be affected by the sign which read ‘‘Laboratory for Research in Social Relations.’’ Such scene effects are likely to be constant for all conditions of an experiment, so that the inferences drawn about experimental effects need not be affected. Nevertheless, it would be interesting to change the signs from time to time to see to what extent affiliative behavior occurs in ‘‘The Laboratory for the Study of Social Conformity.’’ Such cues to the solution of the deutero-problem have been called the ‘‘demand characteristics’’ of the experimental situation by Orne (1962). He has shown in his research program that in a variety of experiments, subjects perform as they believe they are expected to perform. Thus if subjects believe that hypnosis implies catalepsy of the dominant hand, they show such catalepsy when hypnotized. If they are not led to believe such catalepsy to be part of hypnosis, they do not show it when hypnotized (Orne, 1959).
442
Book Two – Experimenter Effects in Behavioral Research
Experimenter Expectancy and Subject Set One purpose of the experiment to be described now was to learn of the effects of demand characteristics operating independently of but jointly with experimenters’ expectancies. A second purpose was to test the generality of the effects of experimenters’ expectancies. The studies reported so far have established the occurrence of the phenomenon in the research domains of human perception and animal learning. Here the occurrence of the phenomenon is examined in a different but equally lively area of research. Such an area is that of verbal conditioning (Krasner, 1958), and one of its more hotly debated aspects is the question of the role of awareness in successful verbal conditioning (Dulany & O’Connell, 1963; Eriksen, 1960; Eriksen, 1962; Spielberger & DeNike, 1966). It should be emphasized that the purpose of this study was not to answer the question of whether such learning without awareness can occur. The purpose, rather, was to learn whether studies of the role of awareness in learning might be affected by the phenomena of experimenter expectancy effects and the effects of demand characteristics or subjects’ set. One provocative hint as to the possible role of demand characteristics in determining rates of awareness in studies of verbal conditioning is provided by Krasner (1958). He told of a subject who, during the course of a verbal conditioning experiment, spontaneously verbalized the correct contingency. On the subsequent inquiry for awareness, however, this same subject gave perfectly ‘‘unaware’’ responses. This interesting occurrence might be accounted for by the subject’s perception of the demand characteristics of the situation as being, ‘‘You ought not to be aware of the contingency if you wish to regard yourself as a ‘good subject’.’’ The experiment has been described elsewhere (Rosenthal, Persinger, Vikan-Kline, & Fode, 1963b). Briefly, there were 18 graduate students to serve as experimenters, all enrolled in a graduate course in educational psychology, and all but one were males. There were 65 subjects, 57 of them females, most of whom were freshmen or sophomores. Each experimenter presented to each of his subjects individually the standardized series of 20 photos of faces described earlier. Subjects were asked, as before, to rate each photo on the apparent success or failure that the person pictured had been experiencing. This time, however, experimenters were instructed to reinforce all positive ratings made by their subjects. Each experimenter was individually trained by an investigator, who did not know to which treatment condition the experimenter would be assigned. The exact instructions to experimenters and the instructions they were to read to their subjects were mimeographed and given to each experimenter at the conclusion of his training session. Instructions to Experimenters. You have been asked to participate in a research project studying the phenomena of conditioning. The reason for your participation in this project is to standardize results of experiments dealing with conditioning. There is the problem in psychological research of different examiners getting somewhat different data on the same tests as a function of individual differences. Therefore, to standardize the tests it is better methodological procedure to use groups of experimenters.
Subject Set
443
You will now be asked to run a series of subjects and obtain from each ratings of photographs. The experimental procedure is as follows: After recording the data from each subject at the top of the recording sheet, and reading the instructions to the subject, you are ready to begin. Take photo #1 and say: ‘‘This is photo #1,’’ and hold it in front of the subject until he tells you his rating, which you will write down on the recording sheet. Continue this procedure through the 20 photos. Do not let any subject see any photo for longer than 5 seconds. After each subject, total the ratings of the 20 photos and find the average (mean). Previous research in verbal conditioning has shown that subjects may be conditioned to give a certain number by verbal reinforcement. In this study we want you to say ‘‘good’’ after every plus rating up to the number five (þ5), and ‘‘excellent’’ after every rating of plus five and over. Do not say anything for minus ratings. As you would suspect, you should shortly be receiving very high ratings from your subjects, about a þ5 or higher. The Marlowe-Crowne Social Desirability scores of your subjects are such that they, on a postexperimental interview, will (not)1 very likely be aware of having been conditioned. That is to say, this test (the Marlowe-Crowne) is able to predict beforehand which subjects will be aware of having been conditioned. The present study is designed to verify the reliability of the Marlowe-Crowne. Just read the instructions to the subjects. Say nothing else to them except hello and goodbye and ‘‘excellent’’ and/or ‘‘good.’’ If for any reason you should say anything to a subject other than that which is written in your instructions, please write down the exact words you used and the situation which forced you to say them. Good Luck! Instructions to Subjects. I am going to read you some instructions. I am not permitted to say anything which is not in the instructions, nor can I answer any questions about this experiment. OK? 2 Immediately after this experiment is over, you will be asked the purpose of this experiment, that is, what is really going on. See how perceptive you can be in determining the true intent of this experiment. Now I will show you a series of photographs. For each one I want you to judge whether the person pictured has been experiencing success or failure. To help you make more exact judgments you are to use this rating scale. As you can see, the scale runs from 10 to þ10. A rating of 10 means that you judge the person to have experienced extreme failure. A rating of þ10 means that you judge the person to have experienced extreme success. A rating of 1 means that you judge the person to have experienced mild failure, while a rating of þ1 means that you judge the person to have experienced mild success. You are to rate each photo as accurately as you can. Just tell me the rating you assign to each photo. All ready? Here is the first photo. (No further explanation may be given, although all or part of the instructions may be repeated.) The instructions given the experimenters carried one of the treatment dimensions. Half were led to expect that their subjects would subsequently be aware of having been conditioned, and half were led to believe that their subjects would not be aware. 1
This word was inserted in the instructions to half of the experimenters. For half of the subjects the sentence, ‘‘We are in the process of standardizing a test,’’ was substituted for this paragraph.
2
444
Book Two – Experimenter Effects in Behavioral Research
The instructions read to subjects by their experimenter were designed so that half the subjects would view it as a ‘‘good subject’’ performance to be aware of having been conditioned, and half the subjects would not be given such a set. After the subjects had been contacted by their experimenter they were given two questionnaires to be filled out in succession. The first questionnaire was the one used by Matarazzo, Saslow, and Pareis (1960). It asked simply two questions: (1) ‘‘The purpose of this experiment was:’’ and (2) ‘‘My evidence for this is:’’. About a half page of space was provided for the answers to each of these questions. Orne (1962) has suggested that subjects agree to a ‘‘pact of ignorance’’ to not ‘‘see through’’ the ostensible purposes of the experiment. The experience of the research program under discussion supports this view. For this reason, it was felt that a very vague, general inquiry for subjects’ awareness, with no incentives offered for reports of awareness, would be favorable to low rates of awareness among the subjects. An inquiry that offers many suggestions to subjects about what may have been going on in the experiment was felt to create a set favorable to seeing through the experiment. This is probably due not so much to the cues provided by the questions themselves, though that is a factor. More important, probably, is the general set given the subject that it must be acceptable role performance to have a lot of hunches and suspicions if the investigators themselves expect it enough to print up forms with questions that hint at seeing through. The second questionnaire was patterned after Levin’s (1961) and was chosen to elicit higher rates of awareness. It is as follows: The Second Questionnaire. 1. Did you usually give the first number which came to your mind? 2. How did you go about deciding which of the numbers to use? 3. Did you think you were using some of the numbers more often than others? Which numbers? Why? 4. What did you think the purpose of this experiment was? 5. What did you think about while going through the photos? 6. While going through the photos did you think that you were supposed to rate them in any particular way? 7. Did you get the feeling that you were supposed to change the ratings of the photos as you were giving them? 8. Were you aware of anything else that went on while you were going through the photos? 9. Were you aware of anything about the experimenter? 10. Were you aware that the experimenter said anything? If so, what?
Note: Answer the following questions only if you were aware of anything said by the experimenter. 11. What did the saying of the word or words by the experimenter mean to you? 12. Did you try to figure out what made the experimenter say anything or why or when he did? 13. How hard would you say that you tried to figure out what was making the experimenter say the word or words? very hard fairly hard not hard at all
Subject Set
445
14. What ideas did you have about what was making the experimenter say the word or words? 15. While going through the photos did you think that what the experimenter said had anything to do with the way that you rated the photos? How?
All awareness-testing questionnaires were scored blindly, independently, and without pretraining by two psychologists according to the criteria set forth below as modified for this study from the criteria employed by Matarazzo, Saslow, and Pareis (1960). The scoring weights were constructed by asking judges to arrange the five criteria, typed on small cards, along a yardstick with the distances between cards to represent differences in degree of awareness. Five faculty members and two graduate students involved in dissertation research served as judges and the following scores represent the median yardstick points assigned to each criterion (divided by three). Criteria for Scoring Awareness Score Criterion 0
3
6
10 12
S had no idea of the purpose of the experiment, or had a completely wrong hypothesis, or made absolutely no mention of E’s reinforcing verbalizations. S mentioned the reinforcement, but did not connect it with a specific class of ratings; or S brought up the possibility of certain ratings being reinforced (including the correct response class), but in conjunction with other incorrect hypotheses. S stated that some specific class of ratings was being reinforced, but names the wrong response class; or he states the correct response class but does so along with an incorrect hypothesis. S correctly stated the specific response class being reinforced, and did not state an incorrect hypothesis in addition. S correctly stated the specific response class being reinforced, stated no incorrect hypothesis, and correctly differentiated the use of ‘‘good’’ from the use of ‘‘excellent’’ as reinforcers.
Reliability of scoring was nearly perfect for questionnaire number one (r ¼ .99), and adequate for questionnaire number two (r ¼ .87). Both correlations were based on an N of 63 subjects who completed the questionnaires. The two subjects who did not complete the questionnaires were excused when they began using terms like ‘‘contingency’’ and all but argued the merits of open-ended versus structured interviews to tap awareness! As it developed, both had run a verbal conditioning study of their own the preceding semester as part of an undergraduate research program. As far as could be determined, there were no more than these two sophisticates. In all cases where the two scorers’ ratings of a questionnaire differed, the mean of the two ratings was the final score assigned that questionnaire. Questionnaire effects. Questionnaire number two (Q2) evoked much higher awareness than did questionnaire number one (Q1). On Q1, 70 percent of all subjects earned awareness scores of zero while on Q2 only 2 percent earned zero scores. On Q1, 19 percent of the average experimenter’s subjects earned awareness scores of 8 or higher, whereas on Q2, 56 percent earned such scores. Of all subjects, 78 percent earned a different awareness score on Q2 than they had on Q1, and 100 percent of these scored more aware on Q2 (binomial p less than .00001).
446
Book Two – Experimenter Effects in Behavioral Research
The correlation (rho) between experimenters’ obtained rate of awareness on Q1 and Q2 was þ .58 (one-tail p ¼.01, df ¼16). The correlation (phi) between subjects’ Q1 and Q2 awareness scores was þ.50 (p < .0001). The bimodal distribution of awareness scores, especially on Q1 where the distribution was markedly discontinuous and asymmetrical, suggested that for practical purposes the awareness scale was a two-point rather than a five-point scale. Accordingly, in all subsequent analyses a subject was defined as aware if his awareness score was 8 or higher. This number was chosen because it represented the lower limit of the upper distribution of scores on Q1. Treatment effects. Tables 11-1 and 11-2 show the percentage of the average experimenter’s subjects within each experimental condition who were judged aware by Q1 and Q2. Analysis of the data from both questionnaires showed that experimenters expecting their subjects to be aware obtained higher rates of awareness than did experimenters expecting their subjects to be unaware (p ¼ .07, one-tail). Subjects given the set to see through the experiment tended to be aware more often than the remaining subjects, though this difference was not reliable statistically. For both questionnaires, however, the most statistically significant differences were found to exist between the conditions where experimenter expectancy and subject set were both favorable to increased awareness and the conditions where both were unfavorable to awareness (p ¼.02, one-tail, t ¼ 2.46). What we do not know, but what must be learned, is how an experimenter who expects awareness from his subjects treats them compared to the way an experimenter who does not expect awareness treats his subjects. Perhaps experimenters expecting their subjects to be aware are less subtle in their reinforcement of subjects’ responses. Perhaps, too, they convey a conspiratorial impression of ‘‘we both know what’s going on here.’’ Such an attitude may legitimate the subject’s subsequent verbalization of what he knows. In order to determine the effect of experimenters’ reinforcement of higher ratings, it was necessary to employ an additional control group of experimenters which, like the experimental groups, expected high photo ratings but, unlike the experimental Table 11–1 Percentage of Subjects Judged Aware: Questionnaire 1
Experimenter’s expectancy
SUBJECT’S SET
Awareness
Nonawareness
Means
Awareness Nonawareness
31 24
13 10
23 14
Means
28
11
19
Table 11–2 Percentage of Subjects Judged Aware: Questionnaire 2
Experimenter’s expectancy
SUBJECT’S SET
Awareness
Nonawareness
Means
Awareness Nonawareness
78 55
40 47
62 50
Means
68
43
56
Subject Set
447 Table 11–3 Mean Photo Ratings
Experimenter’s expectancy
SUBJECT’S SET
Awareness Nonawareness Means
Awareness
Nonawareness
Means
.58 1.24
.63 1.20
.60 1.22
.91
.92
.91
groups, was not programmed to reinforce them. Four additional experimenters, drawn from the same class as the other experimenters, were accordingly led simply to expect high photo ratings. These experimenters contacted a total of 26 subjects drawn from the same class from which the other subjects had been drawn. These subjects were given no set to ‘‘see through’’ the experiment. Table 11-3 shows the mean photo ratings obtained by experimenters in the four experimental groups. The mean photo rating obtained by experimenters of the control group was þ.49. Those subjects given a set to ‘‘see through’’ the experiment gave photo ratings that were no different from those given by the control group subjects. They were, however, significantly lower than those given by reinforced subjects with no set to ‘‘see through’’ the experiment (p ¼.02, two-tail). All experimental groups and the control group showed approximately linear curves of acquisition of positive ratings, all having about the same slope (35 degrees). Degree of awareness as judged from Q2 was found to be unrelated to photo ratings given. These results suggest that whether or not a subject may be judged aware of having been verbally reinforced for a certain response, he is significantly less likely to make the desired responses when he has been given a set to ‘‘see through’’ an experimental procedure. The set to ‘‘see through’’ may well carry an implication of ‘‘don’t go along.’’ The question of whether any of the groups showed verbal conditioning must remain unanswered. All groups showed similar acquisition curves, but this may have been due, in part, to a photo order effect. An earlier study showed that experimenters biased to expect negative data also obtained linear acquisition curves of positive ratings (Rosenthal, Fode, Vikan-Kline, & Persinger, 1964).
The Phenomenology of the Subject Riecken stated that one of the experimental subject’s major aims in the experiment is to ‘‘penetrate the experimenter’s inscrutability and discover the rationale of the experiment’’ (1962). Even a cursory reading of the 60 questionnaires (Q1 and Q2 combined) confirms this hypothesis. After more careful study, each of the 60 questionnaires was assigned to one of the following three categories: (1) Subject stated that the purpose of the experiment was to standardize a test, and there was no indication of any other suspicion on his part. (2) Subject stated or implied that the purpose was to assess the role of reinforcement in changing responses with or without mentioning standardization of the ‘‘empathy’’ test. (3) Subject stated or implied that the purpose was really in some way to assess the subject other than by (or in addition to) simply trying to reinforce certain responses. Only 17 percent of the questionnaires fell into group 1, whereas 42 percent fell into each of the other two
448
Book Two – Experimenter Effects in Behavioral Research
categories. Thus two out of five subjects held hypotheses about the rationale of the experiment which had little to do with this particular study but which sampled fairly well the kind of thing that contemporary research psychologists might very well be after. Several subjects thought they were being assessed for degree of conformity or resistance to conformity. Several hypothesized that we were measuring their degree of racial prejudice based on the fact that one of the photos was of a Negro. Perhaps the most common hunch was that the pictures were really ‘‘projective’’ devices designed to tap everything from self-concepts to optimism or pessimism. This finding lends support to Riecken’s formulation that subjects tend to see the psychologist-researcher as a ‘‘poker’’ and a ‘‘pryer’’ into one’s inner recesses (1962). All in all, subjects were quite actively engaged in formulating hypotheses, some of which would imply a fairly high level of sophistication. Whereas on Q2 most subjects were aware, a number made specific reference to their handling of this information. One subject, who had not been rated as aware, said she was sure her experimenter was trying to get her to change her ratings, but she knew that since this was a rigorous standardization situation, this could not possibly be the case. One clearly aware subject manifested unmistakable signs of guilt over ruining the study through her being aware, while several mentioned the contingency and their decision to go along or not go along. Perhaps the most striking illustration of the complexity of the problem of our penetrating the subject’s inscrutability came from a questionnaire which in reply to Q1 stated: ‘‘The purpose of this experiment was to standardize a test. They used a standardized rating scale and the person administrating [sic] the test was not allowed to say anything. This kept his influence out. The fact that we all make our judgment on the pictures would give returns which would show the standardization of them.’’ Not five seconds later this same subject on Q2 replied to the same question about the purpose of the experiment: ‘‘To see how much the words given by the tester influenced me.’’ In this study the experimenter-subject interactions lasted as little as five minutes. This seems to be an unusually short period for the development of ‘‘transference’’ reactions toward an experimenter. Nevertheless, one of our subjects made very clear reference to the sexual implications of being alone with an experimenter in a small room and described some of her experimenter’s characteristics in inordinate detail. Even if such responses were quite rare, they would nevertheless serve to remind us that subjects are far from being the automated data production units that Riecken (1962) has suggested is a frequent current view of the subject. But such ‘‘transference’’ reactions are not all that rare. In the research program under discussion one or two such responses are obtained in every experiment. This is the more remarkable not only for the brevity of the experimenter-subject interaction but also because there is very little getting acquainted possible. The tasks chosen for this research program were intentionally designed to minimize experimenter-subject interactions so that it would be difficult to get the unintentional influencing we might more readily expect in more elaborate experimenter-subject interactions or in clinical interactions. In a study of verbal conditioning employing a more standard sentence-construction task there was further evidence that subjects are sometimes interested in their experimenter as a person rather than simply as an inscrutable scientist-psychologist (Rosenthal, Kohn, Greenfield, & Carota, 1966). In that study 20 percent of the subjects made some reference to one or more physical characteristics of their experimenter which were ‘‘irrelevant’’ to the experimenter’s role performance.
Subject Set
449
References were made to the experimenter’s posture, clothing, facial blemishes, eyeglasses, dental condition, and relative attractiveness. When the research is somewhat more clinical we expect more of the ‘‘transference’’ reactions to the experimenter, and we get them. Klein (1956) has discussed this problem, and Whitman (1963) gives a more recent illustration. The research was in the area of dreaming responses. There were 10 volunteer subjects who were wakened and asked to report their dreams whenever their eye movements suggested to the monitors that dreaming might be going on. About one third of the dreams dealt overtly with the experimental situation, about one third dealt covertly with the experimental situation, and about one third did not appear to deal at all with the experimental situation. There was the predictable evaluation apprehension Rosenberg has described (1965) and a great variety of emotionally significant reactions to the experimenter. Female subjects tended to view him as more seductive, males as more sadistic. Some dreams found him incompetent; some found him potentially therapeutic but not helpfully motivated; some found him a cold, exploiting scientist, or even quite unscrupulous. This more clinical research may evoke stronger reactions to the experimenter, but from the evidence presented earlier we must conclude that such reactions may also occur in less psychodynamically loaded interactions, though perhaps to a lesser degree.
Some Recapitulation and Discussion The data suggest that in studies of verbal conditioning concerned with the role of awareness, the experimenter’s expectancy of awareness can significantly affect the rates of awareness he will obtain. In addition, subjects’ perceived demand characteristics of the experimental situation appear to play a role in the determination of subsequent awareness rates, although to a smaller and less reliable degree. Furthermore, the form of inquiry for awareness makes a significant difference in obtained awareness rates. However, when an experimenter expects more awareness from his subjects at the same time that his subjects have a positive sanction to see through the experiment, he tends to obtain higher rates of awareness from his subjects than when the converse conditions are true, regardless of the form of inquiry employed in this study. Although none of the obtained p levels was striking, the magnitude of the effects was. Thus on Q1, three times as many subjects of one condition were aware as were aware in the opposite treatment condition. This magnitude of effect may have been even greater had it not been for an unplannedfor difficulty in the experimental procedure. Many of the subjects in the standard set condition were able to see some earlier contacted subjects filling out the questionnaires. It seems likely that this might have established demand characteristics among the standard set subjects favorable to greater attentiveness and seeing through, sets reasonably to be expected from the subjects’ prospect of having to answer questions about the experimental procedure. Krasner (1958), in his review of 31 articles on verbal conditioning, found that over one half of them reported no awareness at all on the part of any subject. In all these studies, only 5 percent of all subjects were reported to be aware. In the present study, even on the crude Q1 19 percent of the subjects were aware, and on Q2 about 56 percent were aware. The reason for this discrepancy is not entirely clear.
450
Book Two – Experimenter Effects in Behavioral Research
The possibility that all our subjects had perceived the demand characteristics as generally favorable to awareness cannot be ruled out. Communication among subjects was possible and likely, but this is probably true for most verbal conditioning studies (and other studies as well, though it is rare to find the problem taken seriously). In addition, it must be admitted that the author’s expectation was for a high rate of awareness; and although it was possible to remain blind for membership in treatment conditions, subjects did have to be contacted by members of the research group, however briefly, so that they could be directed to the proper laboratory rooms and, later, have the questionnaires administered. In another verbal conditioning experiment, the one employing a sentence construction task, the rates of awareness obtained were somewhat lower (Rosenthal, Kohn, Greenfield, & Carota, 1966). In that study 17 percent of the subjects were clearly aware, 8 percent were somewhat aware, and 75 percent were clearly unaware. In this experiment, as in the one described in detail, several aware subjects noted their difficulty in trying to decide whether to go along with the experimenter’s attempt to influence their response. Subjects can sometimes verbalize their deutero-problem very well. We cannot tell from answers to an awareness questionnaire whether the subject, if ‘‘aware,’’ was aware during the experiment. We cannot tell whether the subject, if ‘‘unaware,’’ is responding as unaware because he thinks that is the proper thing to do. There is a kind of subject especially interested in saying and doing the proper thing, and that is the subject who scores high on the Marlowe-Crowne Scale of Social Desirability (Crowne & Marlowe, 1964). In the sentence construction experiment, subjects scoring high in this scale gave significantly less aware responses (r ¼ .30, p ¼.02), and so did more anxious subjects (r ¼ .22, p ¼.10). These more anxious subjects and those higher in the need for approval may well have viewed the good subject role as that which permits no ‘‘seeing through.’’ That they show their desire to please the experimenter in other ways is suggested by the fact that subjects higher in need for approval arrive earlier at the site of the experiment (r ¼ þ.40, p ¼ .003). The results of the qualitative analyses of the awareness questionnaires of this second study of verbal conditioning yielded much the same sort of information as the first. Subjects often did not believe the formal explanations offered them. Riecken (1962) was right to call for a more systematic investigation of subjects’ perceptions of experimental situations. The kinds of hunches subjects had about the true purposes of the two experiments, although frequently wrong for these particular studies, were uncomfortably accurate in assessing the kinds of research in which contemporary psychologists were likely to ask them to participate. The day of the naive sophomore may rapidly be drawing to a close.
Conflicting Expectations In an experiment by C. R. White (1962), the expectations of the experimenters were varied as in other studies, but subjects were given varying expectations about the stimuli they would be asked to judge. His experimenters were 18 graduate students in counseling and guidance, and his subjects were 108 undergraduates enrolled in educational psychology.
Subject Set
451
In earlier studies in this research program there had always been only two expectancies. Thus, when the task was that of judging the success or failure of persons pictured in photos, half the experimenters were led to expect success ratings and half were led to expect failure ratings. White, however, employed six different expectancies, to each of which three experimenters were assigned at random. These six expectancies were not single numbers but rather a set of overlapping ranges of expectancies, the means of which were 6, 3, 0.5, þ0.5, þ3, and þ6. Within each of these conditions of experimenter expectancy, subjects were divided into six groups, each with one of the expectancies analogous to those induced in the experimenters. These expectancies were induced by telling subjects that the particular photos they would be shown had been found earlier to evoke ratings of about 6 (or 3, 0.5, þ0.5, þ3, þ6). The six conditions of experimenter expectancy, each with six conditions of subject expectancy, were analyzed by means of a 6 6 analysis of variance design. Results of the analysis revealed a significant interaction effect (p ¼.001) between experimenters’ and subjects’ expectancies. Table 11-4 shows the mean ratings obtained by experimenters expecting either positive or negative ratings for subjects for whom either positive or negative ratings had been suggested. The data support the interpretation of contrast effects (Helson, 1964; Sherif & Hovland, 1961). Subjects predisposed to rate low, when contacted by experimenters expecting to obtain high ratings, gave the lowest ratings (p ¼ .01), suggesting a kind of opinion entrenchment. Similarly, when subjects were given sets to rate high but were confronted by experimenters expecting low ratings, they gave the highest ratings, though this finding was statistically less significant. White kindly made his data available for this further analysis. In addition to the ‘‘entrenchment’’ effect shown by subjects contacted by experimenters with contrasting expectations, another factor was found to contribute to our understanding of the interaction effect. Table 11-5 shows quartiles of subjects arranged in descending order of discrepancy between their own expectancy and their experimenter’s
Table 11–4 Mean Photo Ratings as a Function of Experimenter and Subject Expectancies
Experimenter’s expectancy Positive SUBJECT’S EXPECTANCY
Positive Negative
Negative
Means
.59 .43
.81 .42
.70 .00
.08
.62
.35
Means
Table 11–5 Mean Ratings as a Function of Discrepancy between Experimenters’
and Subjects’ Expectancies Quartiles 1 2 3 4
Mean discrepancy
Mean rating
9.2 5.3 2.7 0.3
.35 .43 .50 .59
452
Book Two – Experimenter Effects in Behavioral Research
expectancy. The correlation between discrepancy of expectations and mean photo rating was .96 (p ¼ .05). Regardless of the direction of experimenters’ and subjects’ expectations, subjects tended to rate the photos as more successful when their expectancy and that of their experimenter were in greater accord. It may be that when experimenters and subjects had similar expectancies, their experimental interaction was a smoother, more pleasant experience, with less conflict for the subject over whether to be influenced by the investigator who had induced the subject’s expectancy or to be influenced by the subtle cues of his own experimenter. The lack of conflict may have been reflected in his perceiving others more cordially—that is, as more successful. The experiments described in this chapter have served in part to extend the generality of the effects of the experimenter’s expectancy. More particularly they have shown the combined effects of the experimenter’s expectancy and the subject’s set or his perception of the demand characteristics of the experimental situation (Orne, 1962). The examination of such joint effects is only just now being mapped out for inquiry. But from what data there are, both quantitative and qualitative, the conclusion seems warranted that what is in the head of the subject and in the head of the experimenter can unintentionally affect the results of psychological research.
12 Early Data Returns
In the last few chapters the effects of the experimenter’s expectancy on his subjects’ responses have been considered. In the present chapter, to some extent, we reverse the direction of the predictions and consider the effects of the subjects’ responses on the experimenter’s expectancy. Except in the most exploratory of experimental enterprises, the experimenter’s expectancies are likely to be based upon some sort of observed data. These data need not have been formally acquired. They may derive from quite casual observations of behavior made by the experimenter himself or even by another observer. Since some sort of data are the most likely determinants of experimenter expectancies, we may fairly ask: what about the data obtained early in an experiment? What are their effects upon data subsequently obtained within the same experiment? Perhaps early data returns that confirm the experimenter’s hypothesis strengthen the expectancy and thus make it more likely that subsequent data will also be confirmatory. Perhaps early data returns that disconfirm the experimenter’s expectancy lead to a revision of the expectancy in the direction of the disconfirming data obtained, thereby making it more likely that subsequent data will continue to disconfirm the original hypothesis but support the revised hypothesis. That the ‘‘early returns’’ of psychological research studies can have an effect on experimenters’ expectancies was noted and well discussed by Ebbinghaus (1885). After saying that investigators notice the results of their studies as they progress, he stated: ‘‘Consequently it is unavoidable that, after the observation of the numerical results, suppositions should arise as to general principles which are concealed in them and which occasionally give hints as to their presence. As the investigations are carried further, these suppositions, as well as those present at the beginning, constitute a complicating factor which probably has a definite influence upon the subsequent results’’ (p. 28). He went on to speak of the pleasure of finding expected data and surprise at obtaining unexpected data, and continued by stating that where ‘‘average values’’ were obtained initially, subsequent data would tend also to be of average value, and where ‘‘especially large or small numbers are expected it would tend to further increase or decrease the values’’ (p. 29). Ebbinghaus was, of course, speaking of himself as both experimenter and subject. Nevertheless, on the basis of his thinking and of the reasoning described earlier, it was decided to test Ebbinghaus’ hypothesis of the effect of early data returns on data 453
454
Book Two – Experimenter Effects in Behavioral Research
subsequently obtained by experimenters. There was also an interest in learning whether the male experimenters would have a greater biasing effect upon their female subjects than upon their male subjects. This seemed reasonable in view of the general finding in the literature that female subjects are more susceptible to interpersonal influence processes. Finally, there was interest in whether the effects of early data returns would operate uniformly throughout the series of subjects contacted by experimenters or whether earlier- or later-contacted subjects would be more affected.
The Effects of Early Returns The experiment has been described in more detail elsewhere (Rosenthal, Persinger, Vikan-Kline, & Fode, 1963a). Briefly, there were 12 male graduate students in education to serve as experimenters. The subjects were 55 undergraduates, mostly freshmen and sophomores, enrolled in beginning courses in psychology and education. About half were males and half were females. In this experiment, as in others investigating some factors complicating the effects of the experimenter’s expectancy, it was decided that the same task as that employed in the original studies demonstrating expectancy effects should be used. Since the purpose of these studies was not simply to replicate the basic findings but to learn more about the variables affecting the operation of expectancy effects, the studies were kept comparable with respect to the basic task employed. In the present study, therefore, each experimenter presented to each of his subjects individually the standardized series of 20 photos of faces and asked that the subject rate each photo on the apparent success or failure that the person pictured had been experiencing. Subjects were to use the rating scale described earlier to help them make their judgments. Before contacting his subjects, each experimenter was individually instructed and briefly trained as to the experimental procedure. The exact instructions to experimenters were mimeographed and given to the experimenter when he came in for his training session. These instructions, as well as the instructions each experimenter read to his subjects, were similar to those presented in earlier chapters. It should be noted that the investigator who instructed the experimenters did not know into which treatment group any experimenter would be assigned. The overall design of the experiment was to randomly establish three groups of four experimenters each, all with a bias, expectation, or hypothesis to obtain high positive ratings from their subjects. For one group of experimenters this bias, expectation, or hypothesis was to be confirmed by their first two pretest (‘‘good data’’) subjects. For a second group this bias, expectation, or hypothesis was to be disconfirmed by their first two subjects (‘‘bad data’’). The third group of experimenters was to serve as a control group (‘‘normal data’’). The first group to be run was the control group of four experimenters, each of whom contacted six naı¨ve subjects. Of those eight subjects who were run as the pretest subjects by the control experimenters, four were selected on the basis of their having free time when the remaining experimenters were to contact their subjects. These four subjects agreed to serve as accomplices and were instructed to give average ratings of þ5 to the photo-rating task for the ‘‘good data’’ experimenters
Early Data Returns
455
and 5 for the ‘‘bad data’’ experimenters. Each of the accomplices then gave ‘‘good data’’ to two experimenters and ‘‘bad data’’ to two experimenters. Accomplices appeared equally often as the first-run and second-run subject. Treatment groups of experimenters were thus defined: the ‘‘good data’’ group experimenters each contacted two accomplices who gave them the expected data, followed by four subjects who were naı¨ve. The ‘‘bad data’’ group experimenters each contacted two accomplices who gave them data opposite to that expected, followed by four subjects who were naı¨ve. The ‘‘normal data,’’ or control group, experimenters each contacted six naı¨ve subjects. The dependent variable was defined as the mean of the photo ratings given by the last four subjects contacted by each experimenter in each condition. All these subjects were, of course, naı¨ve, and it was hypothesized that the experience of having obtained ‘‘good data’’ should lead those experimenters to obtain ‘‘better’’ subsequent data, whereas the experience of having obtained ‘‘bad data’’ should lead those experimenters to obtain ‘‘worse’’ subsequent data in relation to the control group. Except for the accomplices, all subjects were randomly assigned to experimenters with the necessary restriction that they have free time during experimenters’ available free time. One experimenter contacted only three test subjects instead of four. Treatment effects. Table 12-1 shows the mean rating obtained by each experimenter and each treatment group, from their two pretest subjects and their four test subjects. For the ‘‘good’’ and ‘‘bad’’ data groups the two pretest subjects were, of course, the accomplices. Although none of the pairs of accomplices actually gave mean ratings of either þ5 or 5 as they had been instructed, the treatments were considered adequate, since none of the groups’ pretest ratings showed any overlap. Table 12-2 summarizes the analysis of variance of the means of the mean ratings obtained by each of the experimenters from his four test subjects. The obtained F of 4.91 for linear regression was significant beyond the .03 level (one-tail), the ordering of means having been predicted. The difference between the means of the two experimental groups was significant at the .01 level (one-tail, t ¼ 3.14, df ¼ 6). Table 12–1 Mean Photo Ratings
Treatment
‘‘Good data’’
Control
Experimenter
Test subjects
E1 E2 E3 E4
3.68 3.45 2.03 1.90
0.51 1.19 0.83 0.68
Mean
2.77
0.80
1.30 0.43 0.18 0.70
1.24 0.80 0.34 0.06
0.30
0.58
E9 E10 E11 E12
2.38 3.08 3.83 4.38
0.16 0.55 0.09 0.10
Mean
3.42
0.18
E5 E6 E7 E8 Mean
‘‘Bad data’’
Pretest subjects
456
Book Two – Experimenter Effects in Behavioral Research Table 12–2 Analysis of Variance of Mean Photo Ratings
Source
df
MS
F
Early returns Linear regression Deviation Error
2 1 1 9
(0.4324) 0.7750 0.0898 0.1577
(2.74) 4.91
Table 12–3 Mean Photo Ratings by Sex of Subject
Treatment
Male subjects
Female subjects
‘‘Good data’’ Control ‘‘Bad data’’
.61 .52 .35
.99 .62 .08
Sex effects. Table 12-3 shows the mean obtained ratings, considering male and female subjects separately within treatments. The grand mean rating by males of þ.49 did not differ from the grand mean rating by females of þ.56. Inspection of the treatment means suggests that the treatment effect may have been more powerful in its action upon female than upon male subjects. The differences between the means of the control and experimental groups were greater for the female than for the male subjects (t ¼ 7.21, df ¼ 1, p ¼ .10, two-tail). The remaining analyses were carried out on male and female subjects combined, since it was found that the random assignment of subjects to experimenters had resulted in a proportional sex distribution for both treatments and order of test subjects. (In a subsequent chapter dealing with expectancy effects as a function of subject characteristics there will be a fuller discussion of the relevance of subjects’ sex.) Order effects. In order to determine whether the effects of ‘‘early returns’’ showed an order effect, the mean of the first two test subjects was compared to the mean of the second two test subjects for each group. Although these trends were not very significant statistically (p ¼ .13, two-tail, t ¼ 5.42, df ¼ 1), they seem to be worth noting. Whereas the control group showed a mean rating change of only þ.02, the ‘‘bad data’’ group showed a mean rating change of .59 and the ‘‘good data’’ group showed a shift of þ.44, suggesting that the effect of early returns becomes more marked later in the process of gathering subsequent data. After having run only the first of two test subjects, the mean obtained ratings were barely different from each other, although they were in the predicted directions (Table 12-4). Further support for a hypothesis of a ‘‘delayed action effect’’ can be seen in the sequence of correlations between the mean pretest ratings and the ratings obtained Table 12–4 Mean Photo Ratings by First and Last Two Test Subjects
Treatment
First subjects
‘‘Good data’’ Control ‘‘Bad data’’
0.58 0.57 0.47
Last subjects 1.02 0.59 0.12
Difference 0.44 0.02 0.59
Early Data Returns
457
from the first-run test subjects, then the second-, third-, and fourth-run test subjects. This sequence of correlations (rhos) was .04, .07, .43, .41. The last two correlations were significantly higher than the first two (p ¼ .02, two-tail, t ¼ 8.20, df ¼ 2). Thus, considering subjects in order, it appears that the later-contacted subjects account for more of the correlation found overall and reported next. The correlation (rho) between the data that experimenters obtained from their two pretest subjects and that obtained from all four of their test subjects was þ.69 (p < .01, one-tail, t ¼ 3.02, df ¼ 10). This correlation is identical with the one obtained if data from the control group experimenters are omitted. The effect of ‘‘early returns’’ of data, then, may have been great enough to account for up to 47 percent of the variance of the data obtained from the four subsequent subjects. In order to learn of the relative contribution to this correlation of the first-run pretest subject alone, and the second-run pretest subject alone, analogous correlations were computed between the data they gave their experimenter and the data that experimenter subsequently obtained from his four test subjects. For the first-run pretest subject, the rho obtained was þ.55, (p < .05), whereas for the second-run pretest subject, rho was þ.74, (p < .01). The difference between these rhos did not even approach statistical significance, and therefore we cannot conclude that the effect of ‘‘good’’ or ‘‘bad’’ early returns is strengthened by adding more early returns. What is of real interest, however, is the possibility that the data from only one subject may have such a marked effect on subsequently obtained data. This interpretation, or rather speculation, must be tempered by the fact that all experimenters did in fact have two pretest subjects and that the effect on subsequent test subjects might have been quite different had only one pretest subject been employed. One other factor complicates the interpretation of these obtained correlations. Rather than the early returns affecting subsequently obtained data, it may be that the experimenter, by virtue of his personality and technique in a given situation, tends to elicit similar responses from all his subjects, thus tending to inflate the obtained correlations. This possibility is most obvious for the control group experimenters who contacted only naı¨ve subjects from the very beginning. However, it is also possible that the experimenter affects even the accomplices in a systematic way. Table 12-1 shows that there was considerable variability in the ways in which the accomplices were able to comply with the request to give þ5 or 5 ratings to their experimenters. Whether the accomplice-within-treatments variability was due to initial accomplice variance, effect-of-experimenter variance, or an interaction variance remains an interesting question subject to further study. At any rate, this question does make any simple interpretation of the obtained correlations tenuous. Examination of the rank correlation between pretest and test data for the control group alone shows it to be unity. In addition, the first-run pretest subject means correlated perfectly with the second-run pretest subject means. We are faced with the same difficulty in interpretation: was it experimenter effect or subject effect? Thus, although these subjects were not accomplices, their ratings may have served to give the experimenter hypotheses which he then went on to confirm.
Some Recapitulation and Discussion The data suggest that Ebbinghaus’ hypothesis—that early data returns can affect subsequently obtained data—was correct not only in the situation where the
458
Book Two – Experimenter Effects in Behavioral Research
experimenter serves as his own subject as Ebbinghaus originally formulated it, but also in the situation in which the experimenter is contacting others as his subjects. When the first one or two subjects give ‘‘good’’ or expected data, data obtained from subsequently contacted subjects tends also to be ‘‘good.’’ When the first one or two subjects give ‘‘bad’’ or unexpected data, data from subsequent subjects tends also to be ‘‘bad.’’ For the male experimenters employed in this study, it appears that the effect of early returns may operate more powerfully upon female than upon male subjects. The possibility was also suggested that the effect of early returns may not make itself immediately apparent, but that the effect may be delayed or cumulated to somewhat later-contacted subjects. The finding that nearly half the variance of the test subjects’ photo ratings could be accounted for by a knowledge of the pretest subjects’ ratings was mentioned as difficult to interpret. It might well be that some portion of this variance is due to the effect of early returns and some to more enduring experimenter effects, such as his personality technique experimental-situation interaction, which effects might be distributed similarly over all the subjects contacted by the experimenter. How might the effects of early data returns on subsequent data be explained? When the early returns of an experiment are ‘‘good,’’ the hypothesis with which the experimenter undertook the study is partially confirmed in his own mind and thereby strengthened, with a possible increase in the biasing phenomenon for subsequent subjects. The experimenter’s mood may also be considerably brightened (Carlsmith & Aronson, 1963), and this might lead him to be seen as a more ‘‘likable,’’ ‘‘personal,’’ and ‘‘interested’’ person in his interaction with subsequent subjects. There is some evidence, to be presented in subsequent chapters, which suggests that such experimenter behavior increases the effects of his expectancy on his subjects’ responses. That the flow of incoming data can indeed effect changes in an experimenter’s mood has been suggested by Wilson (1952) and has been recently, charmingly, and autobiographically documented by Griffith (1961). He told how, as the data came in, ‘‘Each record declared itself for or against . . . [me] . . . and . . . [my] . . . spirit rose and fell as wildly as does the gambler’s whose luck supposedly expresses to him a higher love or rejection’’ (p. 307). The situation for the experimenter whose early returns are ‘‘bad’’ may be similarly analyzed, although the situation may be more complicated in this case. If the ‘‘bad’’ early returns are perceived by the experimenter as disconfirmation of his hypothesis, he may experience a mood change making him less ‘‘likable,’’ ‘‘personal,’’ and ‘‘interested,’’ thereby possibly decreasing his effectiveness as an expectancy biasing experimenter. It may also be that for some experimenters the ‘‘bad’’ early returns form the basis for a revised hypothesis, confirmation of which is then obtained from subsequently contacted subjects.
A Second Study of the Effects of Early Returns Essentially, then, two major variables have been proposed to help us understand the effects of early data returns. The more cognitive one has been called ‘‘hypothesis confirmation,’’ the more affective one has been called ‘‘mood.’’ It was suggested that the former implies an experimenter mood change; but, of course, mood change does
Early Data Returns
459
not imply hypothesis confirmation or disconfirmation. Interest in the experiment reported now was in the relative proportion of variance of the early returns effect which could be ascribed to the operation of mood alone. In the first experiment reported in this chapter, data were defined as ‘‘good’’ if they represented higher ratings of success of the persons pictured in the photos. In order that the definition of ‘‘good,’’ or expected, data not be equated with higher ratings of the photos, the nature of the initial expectancy was also varied in the second study. Half the experimenters were thus led to expect þ5 (success) ratings, as in the earlier study, but half the experimenters were led this time to expect 5 (failure) ratings from their subjects. Hypothesis confirmation or disconfirmation was again varied by the use of accomplices serving as the first two subjects. These accomplices gave data either in accord with or opposite to the experimenters’ expectancy or hypothesis. Mood or hedonic tone was experimentally varied by having one of the investigators praise or reprove experimenters for their technique of ‘‘running subjects’’ after the accomplices had performed the experimental task but before the ‘‘real’’ subjects had been through the procedure. Praise was designed to induce a good mood in experimenters, reproof a bad mood. The details of the experiment have been reported elsewhere (Rosenthal, Kohn, Greenfield, & Carota, 1965). The experimenters were 26 Harvard College seniors, all but one of whom was writing an undergraduate thesis in the Department of Social Relations. Experimenters administered the photo-rating task to 115 female subjects, all of whom were enrolled as undergraduates in a college of elementary education. Experimenters were trained in the experimental procedures by an investigator who did not know to which experimental conditions experimenters would be assigned. The accomplices who served as the first two ‘‘subjects’’ were students at another women’s college and were selected on the basis of ‘‘trustworthiness.’’ That is, they were well known to one of the investigators before the experiment began. Twelve accomplices in all participated in the experiment, which was conducted on two evenings. Eight were used each evening, and six participated both evenings. Each accomplice served as ‘‘subject’’ for three or four experimenters an evening. Each was instructed to give photo ratings averaging as close to þ5 or 5 as possible without using the same numbers suspiciously often. In half the conditions these ratings confirmed the expectancy previously induced in experimenters (‘‘good’’ early returns). In the other half, the accomplices’ ratings disconfirmed the initial expectancy (‘‘bad’’ early returns). After experimenters had contacted their first two subjects (accomplices), one of the two male investigators serving as critics entered each experimental room and either praised or reproved the experimenter. In the praise conditions, the critic entered, picked up the experimenter’s data sheets, studied them first with wrinkled brow and then with an increasingly pleased expression, and, smiling, finally said approximately the following: ‘‘Your data follow an almost classical pattern. Haven’t seen results that good in a long time. I’d tell you more specifically what’s so good about them, except that it wouldn’t be really cricket to do that now—perhaps later. Anyway, I’m sure you must be running things very competently to draw data patterns like that. Obviously, you’ve run subjects before this. Well, keep up the good work with the rest of them. See you later.’’
460
Book Two – Experimenter Effects in Behavioral Research
In the reproof conditions, the critic entered, picked up the experimenter’s data sheets, studied them with a wrinkled brow for about thirty seconds, began to frown, and then said approximately the following: ‘‘Your data certainly follow a strange pattern. Haven’t seen results like those in a long time. I’d tell you more specifically what bothers me except that it wouldn’t be really cricket to do that now—perhaps later. Anyway, I’m sure you must be doing something strange to draw data patterns like that! I don’t imagine you’ve run subjects before this. Maybe empirical research is not your cup of tea. Well, please try to be very careful for the rest of them. See you later.’’ Then each experimenter contacted from three to six ‘‘real’’ subjects in succession. After they had completed their portion of the experiment, experimenters who had been reproved were told that they really had done a very good job. Combination of the four variables described above—(1) þ5 or 5 initial expectancy, (2) confirmation or disconfirmation of expectancy (i.e., ‘‘good’’ or ‘‘bad’’ early returns), (3) praise or reproof, (4) critic 1 or critic 2—yielded 16 experimental conditions (arranged in a 2 2 2 2 factorial design). Experimenters, accomplices, and research rooms were randomly assigned to conditions. Both critics were also randomly assigned to conditions, except that the number of praises and reproofs that each administered was equalized as closely as possible. The accomplices did not know what the treatment conditions of the experiment were, and the critics were blind as to the particular conditions in which they were carrying out their praise or reproof. Effects of early returns. Table 12-5 shows that expectancy effects, defined as the difference between data obtained when expecting þ5 ratings and when expecting 5 ratings, were greater when the early data returns were ‘‘good’’ (p ¼ .05, one-tail, t ¼ 1.71) and that they were smaller and in the wrong direction when early returns were ‘‘bad’’ (p > .50). Table 12-5 also shows an unexpected main effect on subjects’ ratings of the nature of the experimenter’s early data returns. Subjects tended to rate photos as being of more successful people when their experimenter’s early returns were disconfirming. This particular result has been discussed earlier in the chapter dealing with situational factors affecting subjects’ responses. When the effects of early data returns were considered separately for those experimenters contacted by critic 1 and those contacted by critic 2, a significant difference emerged. For experimenters contacted by critic 1, the effects of early returns were marked, whereas for experimenters contacted by critic 2 there were no effects of early returns, only a tendency for all experimenters to obtain data consistent with their initial expectancy. The personality of the principal investigator who interacts with the experimenter can, therefore, affect the relationship between early returns and subsequent subjects’ responses. That the principal investigator can serve as a ‘‘moderator variable’’ (Marks, 1964) was Table 12–5 Mean Photo Ratings by Initial Expectancy and Early Returns
Expectancy
‘‘Good data’’
‘‘Bad data’’
þ5 5
1.16 1.94
0.96 0.62
Difference
þ 0.78
0.34
Early Data Returns
461
suggested and discussed in the chapter dealing with situational influences on experimenter effects. Effects of mood. Whether experimenters were praised or reproved did not affect the magnitude of their expectancy effects, nor did it affect the magnitude of the early returns effect. There was, however, an interaction between the effects of praise or reproof on expectancy effects as a function of which of the two critics had contacted the experimenters (p ¼ .10). Experimenters contacted by critic 1 showed greater expectancy effects when praised rather than reproved. Experimenters contacted by critic 2 showed greater expectancy effects when reproved rather than praised. Such an interaction makes it virtually impossible to draw any conclusions about the effects of mood, as induced in this study, on experimenter expectancy effects. Who does the praising or reproving is more important than the fact of praise or reproof. Early returns as sources of expectancy. When early returns confirmed initial expectancies, experimenters showed the greatest expectancy effects. There was also a tendency, when expectancies were disconfirmed, for experimenters to obtain data opposite to those expected initially. From this it might be inferred that the disconfirming early returns formed the basis for a revised expectancy. This inference would have an increased plausibility if it could be shown that early returns within treatment conditions often predicted the data subsequently obtained from real subjects. Some relevant data are available from the study described. Within each of the eight conditions shown in Table 12-6, a correlation (r) was computed between the magnitude of the mean ratings given the experimenters by their first two subjects (accomplices) and the magnitude of ratings subsequently obtained from real subjects. That such correlations could be other than trivial in magnitude seemed unlikely in view of the very restricted range of early returns within any conditions. All accomplices within any of the experimental conditions had, of course, been programmed to give the same responses. The average deviation of the early data given experimenters by the accomplices within the eight conditions was only 0.5. Table 12-6 shows the obtained correlations; only one was not positive (binomial p < .04, one-tail). The mean z transformed correlation was þ.79 (p ¼ .002, one-tail, df ¼ 10). Inspection of Table 12-6 suggests that this overall correlation may mask a difference between those correlations obtained when accomplices were confirming as opposed to disconfirming the experimenters’ initial hypotheses or expectations. When initial expectancies were being confirmed the mean r was þ.96 (p < .0005, one-tail, df ¼ 6). However, when experimenters’ initial expectancies were being disconfirmed, the mean r (þ.23) was not significantly greater than zero. It was, Table 12–6 Correlations between Early Returns and Subsequent Subjects’
Responses Evaluation Praise Praise Reproof Reproof Weighted mean
Expectancy
‘‘Good data’’
‘‘Bad data’’
þ5 5 þ5 5
þ 67* þ.73 þ.69* þ.999
þ.88 þ.06 .76 þ.44
þ.96
þ.23
*Four experimenters per cell; all other cells had three experimenters per cell.
462
Book Two – Experimenter Effects in Behavioral Research
however, significantly lower than the mean r of þ.96 (z ¼ 2.57, p ¼ .01, two-tail). At least those experimenters, then, whose initial expectancies were confirmed by their early data returns tended to obtain data from subsequent subjects that were similar to the data obtained from earlier contacted subjects; and this in spite of an artificially restricted range of early data returns. However, two quite different factors may have been operating to bring this about: (1) an experimenter personality factor or (2) an expectancy factor. If the personality factor were operant, experimenters would have affected the accomplices in the same way in which they subsequently affected their real subjects. Accomplices were, after all, free to vary at least a little in the ratings they produced. If, on the other hand, the expectancy factor were operating, the data produced by the accomplices would serve to modify the original expectancy—by a good deal when early returns were disconfirming, and by just a little when early returns were confirming. (The fact that the correlations were larger when the early returns were more similar to the initial expectancies seems best understood as an instance of ‘‘assimilation effects’’ as described by Helson [1964] and Sherif and Hovland [1961]. In psychophysics and attitude change alike, small deviations are often more accommodated to than large changes.) If the personality factor were operative, one would expect experimenters to have a relatively constant effect on the responses obtained from accomplices and from real subjects regardless of order. Therefore, the correlation between responses obtained from accomplices and those obtained from subjects contacted later should be no higher than the correlation between responses obtained from real subjects contacted early and real subjects contacted later. The correlation between responses given to experimenters by accomplices and those given by real subjects subsequent to the first two was þ.85 (p ¼ .005, one-tail, df ¼ 6). The correlation between responses given by the first two real subjects and those given by subsequently contacted real subjects was only þ.16, a correlation not significantly greater than zero but appreciably lower than þ.85 (p ¼ .06, two-tail). These findings seem inconsistent with the hypothesis of experimenter personality effect but consistent with the hypothesis of experimenter expectancy. The earliest collected data, if they are not too inconsistent with the initial expectancy, may serve to modify or to specify more precisely the experimenter’s expectancy. Delayed action effect. In the earlier experiment on the effect of early data returns there was a ‘‘delayed action effect,’’ with accomplices’ data affecting later-contacted real subjects more than earlier-contacted real subjects. In the second study no such effect was found. There was, however, a tendency for experimenters’ initial expectancies to become more effective for later than for earlier contacted subjects. Initial expectancies (þ5 vs. 5) had no effect on data obtained from the first two real subjects. Among subsequently contacted real subjects, however, those contacted by experimenters initially expecting þ5 ratings gave higher ratings (.35) than did (1.21) subjects contacted by experimenters expecting 5 ratings (t ¼ 1.73, p ¼ .09, two-tail, df ¼ 83).
An Increase in Generality The data reported by Ebbinghaus and those reported in this chapter are not the only evidence relevant to the hypothesis of early returns as determinants of subsequent
Early Data Returns
463 Table 12–7 Expectancy Effects as a Function of Amount of Expectancy Disconfirmation
(After McFall, 1965) Expectancy þ5
5
Difference
t
p<
Less More
þ.51 .06
.40 þ.04
þ.91 .10
1.86
.05
Mean
þ.22
.18
þ.40
1.04
.15
Disconfirmation
data. There is a recent study by McFall (1965) that is at least suggestive. Working in a different laboratory and at another university, McFall employed 14 experimenters to administer the photo-rating task to 56 subjects. (The particular photos employed were different from those employed in the earlier studies.) From half the subjects experimenters were led to expect þ5 ratings of success and from half they were led to expect 5 ratings. Within each condition of experimenter expectancy half the subjects were given a set to respond to the stimulus photos with very fast responses. This set was induced by the use of conspicuous timing devices, and it led to the elimination of expectancy effects. McFall reasoned that the greater the number of subjects who had this set for speed who were contacted by experimenters, the more disconfirming returns would be obtained. He therefore analyzed the effects of experimenter expectancy separately for two stages of the experiment differing in the amount of disconfirming data obtained. When a good deal of disconfirming data had been obtained, experimenters showed no expectancy effects whatever. However, when there had been less disconfirming data, significant expectancy effects were obtained (p < .05). Table 12-7 shows the magnitude of these effects. McFall replicated this experiment, but with half the experimenters expecting a shorter reaction time and half expecting a longer reaction time. When the experimenters’ expectancies were of reaction times rather than magnitude of ratings, hardly any expectancy effects emerged (p < .30). From the experimental results presented in this chapter it seems that Ebbinghaus was right. The data obtained early in an experiment may be significant determinants of the data obtained later in the same experiment. The methodological implications of these findings will be discussed in more detail in Part III. For now it is enough to raise the question of whether it might not be useful to try to remain uninformed as to how the data are turning out until the experiment is completed. That would mean that the experimenter collecting the data would have to be kept uninformed about what data are ‘‘good’’ and what data are ‘‘bad.’’ But even if the data collector were uninformed of this at the beginning, there would be the difficulty of keeping him uninformed. If the experimenter reported the early returns to the principal investigator, there would probably be more or less subtle reactions to this report, reactions that might cue the experimenter as to whether data were falling well or poorly. If the data seemed to be coming well, judging from the principal investigator’s reaction, the data collector might make a point of not modifying his behavior toward his subjects. The data might, therefore, continue to come in as before. If the data seemed to be coming poorly, judging from the principal investigator’s reaction, the data collector might make a point, even if
464
Book Two – Experimenter Effects in Behavioral Research
not consciously, of modifying his behavior toward his subsequently contacted subjects. Such a change in behavior, midway through an experiment, might lead to troublesome interactions of experimental treatment conditions with the number of subjects contacted. Later-collected data might turn out to support the experimental hypothesis because of the unintended change in the experimenter’s behavior.
13 Excessive Rewards
Little has been said so far about the effects of the experimenter’s motives on the operation of expectancy effects. From studies employing animal subjects it appeared that the experimenter’s expectancy might be a more important determinant of the results of the experiment than the experimenter’s motives. In these studies experimenters were motivated to have all their animal subjects learn rapidly; yet the animals’ learning was impaired when the experimenter expected poor performance. In some of the earlier studies employing human subjects, experimenters were offered a special incentive to obtain data consistent with their expectancy. In a few studies they were promised two dollars instead of the standard one dollar for their participation if their ‘‘data came out as expected.’’ In these studies there were no control groups to show whether these incentives increased the effects of the experimenter’s expectancy, although it is known from other studies that such incentives are not necessary for the demonstration of expectancy effects. Shortly, more formal evidence will be presented which shows the complicating effects of varying types and sizes of incentives on the operation of expectancy effects. For any scientist to carry on any research, he must be motivated to do so, and probably more than casually so. It is, after all, a lot of trouble to plan and conduct an experiment. The motivation to conduct research is usually related to certain motivations associated with the results of the research. Rarely is the investigator truly disinterested in the results he obtains from his research, but very likely some scientists are more disinterested than others. The same scientist may be more or less disinterested in the results of his research on different occasions (Roe, 1961). A number of workers have discussed the implication for science of the motivation of the experimenter vis-a`-vis the results of his research. In his preface to Mannheim’s Ideology and Utopia, Wirth spoke of the personal investment of the scientist in his research. Though he spoke more directly to the problem of interpreter effects than to the problem of expectancy effects, his remarks bear repeating: ‘‘The fact that in the realm of the social the observer is part of the observed and hence has a personal stake in the subject of observation is one of the chief factors in the acuteness of the problem of objectivity in the social sciences’’ (1936, p. xxiv). Similarly, Beck (1957, p. 201) 465
466
Book Two – Experimenter Effects in Behavioral Research
has stated: ‘‘Each successive step in the method of science calls forth a greater emotional investment and adds to the difficulties of remaining objective. When the ego is involved, self criticism may come hard. (Who ever heard of two scientists battling to prove the other right?)’’ More recently Ann Roe (1959; 1961) discussed the scientist’s commitment to his hypothesis and suggested that creative advance may depend on it. She went on to caution us, however, to be aware of the intense bias that accompanies our involvement. (In an effort to implement this caution against bias, Roe devised the ‘‘Dyad Refresher Plan’’ for checking biases in clinical work. Lamentably this plan, which suggests periodic recalibration of the clinician, has not been well accepted.) Motivational factors in the scientist affecting the work he does have been discussed by others as well (Barzun & Graff, 1957; Bingham & Moore, 1941; Griffith, 1961; Reif, 1961). But perhaps the most eloquent and most balanced brief statement on this topic was that by William James: ‘‘. . . science would be far less advanced than she is if the passionate desires of individuals to get their own faiths confirmed had been kept out of the game. . . . If you want an absolute duffer in an investigation, you must, after all, take the man who has no interest whatever in its results: he is the warranted incapable, the positive fool. The most useful investigator, because the most sensitive observer, is always he whose eager interest in one side of the question is balanced by an equally keen nervousness lest he become deceived’’ (1948, p. 102).
The First Experiment: Individual Subjects In the first experiment to investigate the effects of varying incentives on experimenter expectancy effects, 12 graduate students in education served as the experimenters (Rosenthal, Fode, & Vikan-Kline, 1960). They administered the standard photorating task to a total of 58 undergraduate students, of whom 30 were males and 28 females. Instructions read to subjects were those described in the preceding chapters. All experimenters were led to expect mean ratings of about þ7 from all their subjects. The motivation level of the experimenters was defined by the incentive offered for a ‘‘good job’’—that is, obtaining high ratings of photos from their subjects. The more moderately motivated group of six experimenters was told that the rate of pay would be two dollars per hour for a ‘‘good job,’’ exactly as in some of the earlier described studies. The more highly motivated group of six experimenters, however, was told that the rate of pay would be five dollars for a good job. In addition to the feeling that more highly motivated experimenters would show more biasing effects, it was also felt that subjects’ motivation level might be an important variable affecting experimenter expectancy bias. Accordingly, all subjects were randomly assigned to a paid or an unpaid group. All experimenters contacted paid and unpaid subjects alternately. Paid subjects were told that they would receive fifty cents for their five minutes of participation. In connection with another study, each subject and each experimenter was asked to fill in a lengthy questionnaire concerning their reaction to the experiment after their part in it was finished. In addition, before each experimenter contacted any
Excessive Rewards
467
subjects he was asked to predict as accurately as possible the average rating he would actually obtain from his subjects. All experimenters were told to leave the doors to their research rooms open, as one of the research supervisors might drop in at any time. It was hoped that this might minimize the possibility of actual cheating by the more motivated experimenters. In this study, magnitude of expectancy effect was defined in two ways. Higher, mean obtained ratings—that is, those closer to þ7—were regarded as more biased. In addition, a higher positive correlation between the data specifically predicted by experimenters and the data subsequently obtained by them was regarded as an index of greater expectancy effect. Since experimenters had been led to expect a mean rating of þ7, their specific predictions tended to cluster around that value. Consequently, correlations should underestimate the ‘‘true’’ correlations between predicted and obtained data because of the restriction of range of the experimenters’ predictions. Table 13-1 shows the mean ratings obtained by the more and less rewarded experimenters, each contacting paid and unpaid subjects. The analysis of variance revealed no differences in data obtained as a function of either of the treatments operating alone. The mean rating obtained by the more moderately motivated experimenters, when contacting unpaid subjects, did seem to be more biased than did the other three means of Table 13-1 (p ¼ .03, two-tail, t ¼ 6.94, df ¼ 2). This was surprising, since it had been thought that the more highly motivated experimenters, especially when contacting paid subjects, would show the greatest expectancy effects. This first hint that an increase in motivation level (especially of experimenters, but of subjects as well) might decrease the effects of the experimenter’s expectancy was further checked. Table 13-2 shows the correlations (rhos) between the data experimenters had specifically predicted they would obtain and the data they subsequently did
Table 13–1 Mean Photo Ratings
Experimenter’s motivation
SUBJECT’S MOTIVATION
High (paid) Moderate (unpaid)
High ($5)
Moderate ($2)
.96 1.24
.99 2.28
Table 13–2 Correlations Between Experimenters’ Predicted and Obtained Ratings
Experimenter’s motivation High ($5) SUBJECT’S MOTIVATION
*p ¼ .04 (two tail) **p ¼ .01 (two tail)
High (paid) Moderate (unpaid) All subjects
.60 .31 .31
Moderate ($2) þ.24 þ.84* þ.99**
468
Book Two – Experimenter Effects in Behavioral Research
obtain. This criterion (or definition) of experimenter expectancy effect correlated þ.80 with the definition of expectancy effect based on photo ratings obtained (Table 13-1). Only among moderately motivated experimenters contacting moderately motivated subjects was the magnitude of expectancy effect significantly greater than zero. Disregarding motivation of subjects, it was found that the more moderately motivated experimenters biased their subjects’ responses more than did the more highly motivated experimenters. The latter group, in fact, tended to show a negative or ‘‘reverse’’ expectancy effect, though this was not statistically significant. The results just described led to a further investigation of the effect of the experimenter’s motivation level. There was curiosity at this point, too, about two other variables. One was the degree of explicitness of the instructions to obtain expectancy biased data. Heretofore these had all been relatively implicit. The other variable of interest was whether expectancy effects could occur in the situation wherein an experimenter contacts a number of subjects simultaneously (Rosenthal, Friedman, Johnson, Fode, Schill, White, & Vikan-Kline, 1964).
The Second Experiment: Subjects in Groups In this second experiment 30 advanced male undergraduates, primarily from the College of Engineering, served as experimenters. With the modifications to be noted, they administered the photo-rating task to a total of 150 subjects, 90 males and 60 females, all of whom were students of introductory psychology. Experimenters were divided into five treatment groups of six experimenters each. Four of the treatment groups were led to expect high positive ratings (þ5) of the photos from their subjects, and the remaining group was led to expect high negative ratings (5) from subjects. In order to test the effect of reward or motivation, half of the experimenters expecting high positive ratings from their subjects were given a dollar bill and told that if they did a ‘‘better’’ job—that is, obtained closer-toexpected ratings—than an unknown partner, they could keep their dollar and get their partner’s dollar as well. However, they were told that if their partner did a better job, he would get their dollar. The other half of the experimenters expecting high positive ratings were not involved in this betting situation and were thus considered the less motivated or less rewarded group. In order to test the effect of the explicitness of instructions to bias their subjects’ responses, half the experimenters expecting to obtain high positive ratings were told to do whatever they could to obtain the expected data, but without deviating from the written instructions to subjects. The other half of the experimenters expecting high positive ratings were simply led to expect these ratings, and were thus considered to be less explicitly biased. The six experimenters expecting high negative ratings were given the more explicit instructions to bias their subjects and were also given the two-person sum-zero game condition of motivation. The entire experiment was carried out in a single evening. The locale was an armory in which 30 tables were arranged in a roughly circular pattern. Experimenters sat on one side of the table, and their five subjects sat across from them. Each photo was presented to each of the five subjects in turn. Subjects recorded their ratings on a small pad, and these pads were shown to the experimenter so that only he could see
Excessive Rewards
469
them. The purpose of having this double recording of ratings was to provide a check on any recording errors experimenters might make. The treatment conditions were actually imposed by differential wording of ‘‘lastminute instructions,’’ which experimenters found at their tables the night of the experiment. Each of the treatments was represented equally often in the different parts of the armory. Order of arrival then determined assignment of experimenters to tables, with earlier and later arrivers represented equally often among treatments. Subjects were similarly assigned to experimenters, except that each table had three male and two female subjects. Positions of males and females vis-a`-vis the experimenter were systematically varied within treatments. Each experimenter was asked to predict the specific mean photo rating he would obtain. After the photos had been rated, each subject filled out a series of 20-point rating scales describing his experimenter’s behavior during the course of the experiment. Each scale ran from 10 (e.g., extremely unfriendly) to þ10 (e.g., extremely friendly). Experimenters also rated their own behavior on the same set of rating scales. Table 13-3 shows the mean ratings obtained by the more and less motivated experimenters under more and less explicit instructions to obtain expectancy biased data. More motivated experimenters tended to show less expectancy effect than did less motivated experimenters (p ¼ .05). Explicitness of instructions to obtain biased data had at best only an equivocal effect. Table 13-4 shows the correlations between data experimenters had specifically predicted they would obtain and the data they subsequently did obtain. None of these correlations was significantly different from zero. The analogous treatment conditions from the two experiments described here were those in which more and less motivated experimenters were implicitly set to bias the responses of unpaid subjects. Table 13-5 shows that rather similar correlations were obtained under the analogous conditions of the two studies. Considered
Table 13–3 Mean Photo Ratings
Experimenter’s motivation High INSTRUCTIONS
Explicit Implicit
Moderate
.23* .35
.02 .17
* þ.14 was the comparable mean obtained by experimenters expecting ‘‘5’’ ratings.
Table 13–4 Correlations between Experimenters’ Predicted and Obtained Ratings
Experimenter’s motivation
INSTRUCTIONS
Explicit Implicit
High
Moderate
.00* .21
.10 þ.59
* .21 was the comparable correlation among experimenters expecting ‘‘5’’ ratings.
470
Book Two – Experimenter Effects in Behavioral Research Table 13–5 Correlations between Experimenters’ Predicted and Obtained Ratings
for Analogous Treatment Conditions of Two Studies Experimenter’s motivation
p Difference (Two-Tail)
High
Moderate
Study 1 Study II
.31 .21
þ.84 þ.59
.06 .27
Means
.26
þ.74
.05
together, these studies suggest that ‘‘excessive’’ rewards for expectancy effects actually led to decreased expectancy effects under the particular conditions of the two experiments described. The more explicitly instructed-to-bias and more highly motivated experimenters expecting ‘‘5’’ ratings obtained ratings higher than those of any treatment condition imposed on experimenters expecting ‘‘þ5’’ mean ratings. This finding (two-tail, p ¼ .13), together with the negative correlations representing magnitude of expectancy effects in the more highly motivated treatment conditions, suggest the possibility that these experimenters were actively biased into a reversed direction. In an earlier chapter, the finding that experimenters’ computational errors in data processing were not randomly distributed was reported. It can be reasoned that those more motivated experimenters whose computational errors favored the induced expectancy might also have biased their subjects’ responses into the direction of the induced expectancy. By similar reasoning, those more motivated experimenters whose computational errors did not favor the induced hypothesis should not have biased their subjects’ responses into the direction of their expectancy. This latter group of experimenters might be the one most likely to show a ‘‘reverse bias’’ effect. Among the less motivated experimenters there was no difference in data obtained by experimenters erring computationally in the direction of the hypothesis (þ5) and those erring in the opposite direction. Among the more motivated experimenters, however, this difference was significant at the .10 level (two-tail). The mean photo rating obtained by more motivated experimenters subsequently erring computationally in the direction of their hypothesis (þ5) was þ.23, whereas the mean obtained by those not erring in this direction computationally was .55. (These means, of course, were corrected for the effect of the errors themselves.) The effect of excessive reward seems to be to increase the variability of data obtained (p < .05). Some experimenters appear to be significantly more biased by excessive reward, whereas some experimenters appear to be significantly less, or even negatively, biased.
Some Discussion Why might one effect of excessive incentive to bias subjects’ data be to reduce or even to reverse the expectancy effect of the experimenters? In a postexperimental group discussion with the experimenters of the second study, many of them seemed somewhat upset by the experimental goings-on. Several of them used the term ‘‘payola,’’ suggesting that they felt that the investigators were bribing them to get
Excessive Rewards
471
‘‘good’’ data, which was, in a sense, true. Since money had been mentioned and dispensed to only the more motivated experimenters of this study, it seems likely that they were the ones perceiving the situation in this way. Kelman (1953) found that subjects under higher motivation to conform to an experimenter showed less such conformity than did subjects under lower conditions of motivation. One of several of Kelman’s interpretations was that the subjects who were rewarded more may have felt more as though they were being bribed to conform for the experimenter’s own benefit, thus making subjects suspicious and resentful, and therefore less susceptible to experimenter influence. This interpretation fits the present situation quite well. Ferber and Wales (1952) concluded from their data that in the interviewing situation if the interviewer is biased and knows it, he may show a ‘‘negative’’ bias as part of an overreactive attempt to overcome his known bias. Although this was not found consistently in their study, it seems to help in the interpretation of the ‘‘reverse’’ bias phenomenon. Dr. Raymond C. Norris (in a personal communication) has related an anecdote that seems to illustrate nicely this bending over backward to insure freedom from bias: ‘‘Briefly, the situation involved an experiment in which the faculty member took a rather firm position consistent with Hullian theory and the student, being unschooled in Hullian theory, took a directly contradictory point of view based on some personal experience he had had in a similar situation. They discussed the expectations at some length and each experimenter was familiar with the point of view that the other was advancing. However, each felt some very deep commitment to his own expectations. When the results of the experiment were analyzed it was found that there was a significant treatment-by-experimenter interaction. Further analysis demonstrated that this interaction consisted of each experimenter producing the results that the other expected. Through some deep soul searching and interrogation both the experimenters became convinced that they had bent over backwards to avoid biasing the results in the direction of their prediction and consequently produced results antagonistic to their predictions.’’ Mills (1958) found that honest sixth graders became more anti-cheating under conditions of high motivation to cheat. In addition, he found that under conditions of high restraint against cheating, subjects became more anti-cheating than did similarly highly motivated subjects under conditions of low restraint against cheating. Mills interpreted this finding on the basis of the high-restraint group’s feeling perhaps more suspected and therefore denouncing cheating more vehemently. During the conduct of both experiments reported in this chapter, experimenters had good reason to feel suspected. In the first study experimenters were asked to keep the doors to their research rooms open so the principal investigators might check up on them. In the second study a corps of eight researchers was in constant circulation about the armory in order to detect and correct any procedural deviations. The more highly motivated experimenters may have felt that their behavior in particular was being evaluated because of their higher stakes. This may have led many of them to bend over backward to avoid appearing in any way dishonest by biasing their subjects. Festinger and Carlsmith (1959) found that subjects under conditions of lower reward, when forced to behave overtly in a manner dissonant from their private belief, changed their private beliefs toward their publicly stated ones more than did subjects under conditions of higher reward. This finding and those of Kelman (1953)
472
Book Two – Experimenter Effects in Behavioral Research
and of Mills (1958) have been interpreted within the framework of Festinger’s theory (1957) of cognitive dissonance. The finding from the present studies, that more motivated experimenters tend to show a reverse bias effect, is consistent with the studies reviewed and may also be interpreted within the framework of dissonance theory (Rosenthal, Friedman, Johnson, Fode, Schill, White, & Vikan-Kline, 1964). Perhaps the most parsimonious interpretation of the reversal of expectancy effects is in terms of Rosenberg’s concept of evaluation apprehension (1965 and in personal communication). The experimenters of the studies described were, of course, a special kind of subject who may well have experienced some apprehension about what the principal investigators might think of them. Both the autonomy and honesty of the experimenters may have been challenged by offering large incentives for affecting the results of their research. By bending over backward, the experimenters could establish that they would not be either browbeaten or bribed to affect their subjects’ responses as their principal investigators had ‘‘demanded.’’ Such an interpretation speaks well of the integrity of the experimenters, but it must be noted that the results of the research were still affected, though in this case in the direction opposite to that expected. No single interpretation of the results of these studies is entirely satisfactory. That is because the locus of the reversal of expectancy effects cannot be determined from the data available. The interpretations offered so far have assumed the locus to fall within the experimenter. In view of the experimenters’ very obvious concern with the question of bribery, the assumption seems reasonable. It is also possible, however, that the locus of the reversal of expectancy effects falls within the subject. That is what we would expect if the more motivated experimenter tried too hard to influence his subjects. The subjects feeling pushed by an experimenter who ‘‘comes on too strong’’ may resist his efforts at influencing their responses. This self-assertion through noncompliance may operate in the service of evaluation apprehension if the subject feels that the experimenter would think the less of him if he complied. Orne (1962), Schultz (1963), and Silverman (1965) have all suggested that when influence attempts become more obvious, subjects become less influenceable. This may be a trend of the future as much or more than a fact of the past. As more and more subjects of psychological experiments become acquainted with the results of the classic research in conformity (e.g., Asch, 1952) there may be more and more determination to show the experimenter that the subject is not to be regarded as ‘‘one of those mindless acquiescers’’ which instructors of elementary psychology courses are likely to teach about. As likely as not, in the studies described in this chapter, the locus of the reversal of expectancy effects is to be found both in the experimenters (who were subjects) and in the subjects of these ‘‘subjects.’’
Subjects’ Perception of Their Experimenter At the conclusion of the first experiment described in this chapter, each subject was asked to fill out a questionnaire describing the behavior of his experimenter during the course of the experiment. Experimenters completed the same forms describing their own behavior during the experiment. Neither experimenters nor subjects knew beforehand that they would be asked to complete these questionnaires, and no one save the investigators saw the completed forms (Rosenthal, Fode, Friedman, &
Excessive Rewards
473
Vikan-Kline, 1960). These forms consisted of 27 twenty-point rating scales ranging from 10 (e.g., extremely discourteous) to þ10 (e.g., extremely courteous). The more desirable-sounding poles of the scales appeared about equally often on the right and left of the page. All scales were completed by all 12 experimenters and by 56 of the 58 subjects. Table 13-6 shows the mean ratings of the experimenters by their subjects and by themselves. Both sets of ratings reflected very favorably on the experimenters. The profile of the experimenters as they were viewed by subjects showed remarkable similarity to the profile of the experimenters as viewed by themselves. The rank correlation between profiles was .89, p < .0005. (It has been shown elsewhere that such correlations are increased by the commonly shared stereotype of the psychological experimenter [Rosenthal & Persinger, 1962].) To summarize and facilitate interpretation of the obtained ratings, all variables were intercorrelated and clusteranalyzed. Table 13-7 defines the four clusters that emerged. The associated B-coefficients are all considerably larger than generally deemed necessary to establish the significance of a cluster (Fruchter, 1954). For mnemonic purposes we may label Cluster I as ‘‘Casual-Pleasant,’’ Cluster II as ‘‘Expressive-Friendly,’’ Cluster III as ‘‘Kinesic’’ or ‘‘Gestural Activity,’’ and Cluster IV as ‘‘Enthusiastic-Professional.’’ The only scale not accommodated within any cluster was ‘‘quiet (nonloud).’’
Table 13–6 Mean Ratings of Experimenters’ Behavior
Rating scale
By subjects
By experimenters
Satisfied with experiment Liking Honest Friendly Personal Quiet (nontalkative) Relaxed Quiet (nonloud) Casual Enthusiastic Interested Courteous Businesslike Professional Pleasant-voiced Slow-speaking Expressive-voiced Encouraging Behaved consistently Pleasant Use of hand gestures Use of head gestures Use of arm gestures Use of trunk Use of legs Use of body Expressive face
1.44 5.16 (of E) 7.56 5.50 0.36 1.89 5.82 2.79 5.80 2.49 4.12 6.89 5.45 4.66 7.39 1.55 3.16 3.53 6.31 6.40 2.02 0.72 2.18 2.75 2.86 1.27 2.57
5.09 5.00 (of S) 8.55 4.82 1.45 4.73 4.73 2.55 5.36 3.45 5.09 7.09 6.36 5.45 6.27 2.45 2.18 4.27 6.82 5.82 2.18 0.36 2.27 2.27 2.55 0.09 0.73
474
Book Two – Experimenter Effects in Behavioral Research Table 13–7 Cluster Analysis of Subjects’ Perceptions of Experimenters’ Behavior
Cluster I: B ¼ 6.48 Honest Casual Relaxed Pleasant Courteous Businesslike Slow-speaking Pleasant-voiced Behaved consistently Mean rating ¼ 5.91
Cluster II: B ¼ 3.97 Liking Friendly Personal Interested Encouraging Expressive face Expressive-voiced Use of hand gestures Satisfied with experiment Mean rating ¼ 2.57
Cluster III: B ¼ 9.10 Use of head gestures Use of arm gestures Use of trunk Use of body Use of legs Mean rating ¼ 1.96
Cluster IV: B ¼ 3.55 Enthusiastic Professional Quiet (nontalkative)
Mean rating ¼ 3.01
Table 13-7 also shows the subjects’ mean rating of their experimenters for each of the four clusters. The mean rating on the ‘‘Casual-Pleasant’’ cluster was significantly higher (p ¼ .002) than the mean rating on the ‘‘Expressive-Friendly’’ and the ‘‘Enthusiastic-Professional’’ clusters, which were not significantly different from each other. The mean rating on these latter two clusters, in turn, was significantly higher than the mean rating on the ‘‘Kinesic Cluster’’ (p ¼ .002). The 12 experimenters rated for their dyadic behavior had not uniformly influenced the responses they obtained from their subjects. We turn now to the question of whether those experimenters who showed greater positive expectancy effects were perceived by their subjects as in any way different in their experimental interaction from those experimenters who showed less or even a reversal of expectancy effects. In order to answer this question, all experimenters were ranked according to the magnitude of their expectancy effect. For this purpose, magnitude of expectancy effect was defined as the discrepancy between the data an experimenter specifically predicted he would obtain and the data he actually did obtain. The smaller this discrepancy, the greater the expectancy effect; the greater this discrepancy, the less the expectancy effect. Table 13-8 shows the correlations (rhos) between each of the behavioral variables and magnitude of expectancy effect. The median magnitude of the obtained correlations was .35 (p ¼ .02, two-tail). Considering only those correlations reaching a p ¼ .10, experimenters showing greater positive expectancy effects were viewed by their subjects as more interested, likable, and personal; as slower speaking and more given to the use of hand, head, and leg gestures and movements. As some of these relationships may have occurred by chance because of the number of correlations computed, we may obtain a more stable picture of the relationship between subjects’ perceptions of their experimenter and the magnitude of his expectancy effects by looking at the four obtained clusters rather than at the 27 variables. The median correlations with magnitude of expectancy effect of the variables in Clusters I and IV were .26 and .21, respectively; neither was significantly greater
Excessive Rewards
475 Table 13–8 Perceptions of Experimenters’ Behavior and Magnitude of
Expectancy Effects Variables Satisfied with experiment Liking Honest Friendly Personal Quiet (nontalkative) Relaxed Quiet (nonloud) Casual Enthusiastic Interested Courteous Businesslike Professional Pleasant-voiced Slow-speaking Expressive-voiced Encouraging Behaved consistently Pleasant Use of hand gestures Use of arm gestures Use of head gestures Use of trunk Use of legs Use of body Expressive face
Correlation .17 .56* .06 .15 .53* .09 .31 .41 .35 .39 .71*** .04 .26 .21 .43 .64** .47 .12 .23 .24 .61** .34 .52* .34 .55* .43 .39
* p .10 ** p .05 (two-tail) *** p .01
than a correlation having a p ¼ .50. The median correlations with magnitude of expectancy effect of the variables in Clusters II and III were .47 and .43, respectively. Both of these median correlations were significantly greater than a correlation to be often expected by chance (ps were .04 and .01, respectively, two-tail). More expectancy biased experimenters, then, were characterized by higher loadings on the ‘‘Expressive-Friendly’’ and the ‘‘Kinesic’’ clusters. These findings suggested that kinesic and possibly paralinguistic (e.g., tone of voice) aspects of the experimenter’s interaction with his subjects served to communicate the experimenter’s expectancy to his subjects. Further evidence bearing on this formulation will be presented in the following chapters.
Predicting Computational Errors The particular experimenters under discussion were those who participated in the first study described in this chapter. In the second study described, the subjects were also asked to make ratings of their experimenter’s behavior during the conduct of
476
Book Two – Experimenter Effects in Behavioral Research
the experiment. Eleven of the experimenters in that study made computational errors in their data processing in the direction of their hypothesis. These experimenters were rated an average of þ6.8 on the scale of ‘‘honesty’’ by their 55 subjects. The 65 subjects of the remaining 13 experimenters rated them as þ8.5, on the average, on the same scale. The difference between these mean ratings was significant at the .02 level (two-tail, t ¼ 2.67, df ¼ 22). Thus, whereas all experimenters were rated as quite honest (whatever that word might have meant to the subjects), those experimenters who later made computational errors in their hypothesis’ favor were seen as somewhat less honest. Just how subjects were able to predict their experimenters’ computational errors from their judgments of experimenters’ behavior during the experiment is a fascinating question for which we presently have no answer. Clearly, however, subjects learn a good deal about their experimenter in the brief interaction of the person-perception experiment conducted. When a subject rates the behavior of an experimenter, we may do well to take his rating seriously.
14 Structural Variables
Are there some experimenters who, more than others, unintentionally affect the results of their research? Are there some subjects who, more than others, are susceptible to the unintentional influence of their experimenter’s expectancy? The present chapter is addressed to these questions. The answers to these questions should increase our general understanding of the effects of the experimenter’s expectancy and perhaps provide us with some clues to the effective control of these effects. In addition, the answers to these questions may suggest to us whether the unintentional influence processes under study are facilitated by factors similar to those that facilitate the more usually investigated processes of social influence.
Experimenter and Subject Sex A number of studies have been conducted to learn the role of experimenter and subject sex in the operation of experimenter expectancy effects. In the two experiments to be described first there was an additional purpose to be served (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964a). Most of the experiments described up to now had suffered from a certain nonrepresentativeness of design. The experimenters employed had expected all their subjects to give them a specific type of response. In ‘‘actual’’ psychological experiments, experimenters normally expect two or more different kinds of responses from their subjects, depending upon the experimental condition to which the subject has been assigned. In some psychological experiments, experimenters first collect data from subjects in one experimental condition and then later from subjects in the other condition(s). In other experiments, data from subjects representing all conditions are collected on the same occasion, with the order of appearance of subjects from different conditions either systematically or randomly varied. The additional purpose of the experiments to be reported, then, was to learn whether the effects of experimenter expectancy might be generalized to include the two data collection situations described. Accordingly, in the first experiment, experimenters collected data from subjects under one condition of expectancy; then, some time later, these same experimenters collected data from a fresh sample of 477
478
Book Two – Experimenter Effects in Behavioral Research
subjects under an opposite condition of expectancy. In the second experiment, experimenters collected all their data on a single occasion but with opposite expectancies of the responses to be obtained from subjects randomly distributed through the series. If experimenter expectancy effects could occur under these conditions, it would lend further support to the suggestion made in an earlier chapter that experimenters’ hypotheses about subjects’ responses might change in the midst of an experiment, and still serve as self-fulfilling prophecies. The First Experiment Three male and two female graduate students in counseling and guidance served as experimenters. The subjects were 52 undergraduate students enrolled in various elementary courses; 23 were males, 29 were females. Each experimenter presented to each of his subjects individually the standard 10-photo rating task. All experimenters were, as before, to read identical instructions to all their subjects. Experimenters were individually trained, and the importance of their role in the experiment was impressed upon them. Each experimenter was randomly assigned an average of 10 subjects. In the first stage of this study, three experimenters were told that personality test data available from the subjects he would be contacting suggested that they would give mean photo ratings of about þ5. The remaining two experimenters (one male, one female) were led to expect opposite results, mean ratings of 5. At this time each experimenter collected data from about five subjects. Several weeks later, each experimenter contacted about five more subjects. This time those experimenters who had earlier been led to expect mean ratings of þ5 were led to expect ratings of 5. The expectancies of those experimenters initially expecting 5 ratings were similarly reversed. Explanations to experimenters were simply that this second set of subjects had opposite personality characteristics. For each subject, magnitude of experimenter expectancy effect upon that subject was defined as the difference score between that subject’s mean photo rating and the mean of the ratings by subjects contacted under the opposite condition of expectancy. A plus sign meant that the direction of difference was the predicted one—that is, subject’s rating was higher if the experimenter expected him to rate þ5 or lower if the experimenter expected him to rate 5. The analysis of variance of the effects of experimenter expectancy as a function of experimenter and subject sex yielded only a significant interaction (F ¼ 3.05, p ¼ .10). Table 14-1 shows the mean expectancy bias score for each sex of experimenter by each sex of subject. For each of the four conditions the tabulated ps represent the likelihood that for the particular combination of experimenter and subject sex the magnitude of expectancy effect could have occurred by chance. Among the female experimenters contacting male subjects there Table 14–1 Expectancy Effects as a Function of Sex of Experimenter and Subject: First Experiment
E sex
S sex
Mean
t
df
p (Mean ¼ 0)
Male
Male Female
þ1.33 þ.78
3.01 1.54
16 14
.01 .16
Female
Male Female
.25 þ1.22
.47 2.11
5 13
.66 .06
Structural Variables
479
was a nonsignificant trend for the data obtained to be opposite to that expected. Among the three remaining groups the combined p of experimenters’ expectancy affecting subjects was .0005 (z ¼ 3.39). The Second Experiment Six advanced undergraduate and beginning graduate students enrolled in psychology courses served as experimenters; three were females. All had served as experimenters in an earlier experiment and were familiar with the experimental procedure (Persinger, 1962). The subjects were 35 undergraduate students enrolled in various elementary courses; 22 were females, and 13 were males. The experimental task and general instructions to experimenters were as in the first experiment. Each experimenter collected data from about six randomly assigned subjects, seen consecutively in a single session. Experimenters were told that their subjects were of two personality types and that some would therefore average þ5 photo ratings while others would average 5 ratings. Before meeting each subject, the experimenter was told to which ‘‘group’’ that subject belonged. Experimenter expectancies thus were varied randomly on what amounted to a roughly alternating schedule. In order to detect gross, intentional procedural deviations, however, all experimenters were observed during their contacts with all subjects. About half these transactions were permanently recorded on 16 mm sound film. Neither experimenters nor subjects were aware of this monitoring, and no intentional procedural deviations were noted. The analysis of variance yielded significant main effects of experimenter and subject sex. Male experimenters obtained more biased responses than did female experimenters (F ¼ 4.20, p ¼ .05). Female subjects were more susceptible to experimenter expectancy effects than were male subjects (F ¼ 7.67, p ¼ .01). The interaction was, however, somewhat too large to ignore entirely (F ¼ 1.79, p ¼ .20). Table 14-2 shows the mean expectancy bias score for the four treatment combinations. In this experiment the tendency found earlier for female experimenters contacting male subjects to obtain responses opposite to those expected was found again and more markedly so (p ¼ .01). For both experiments together the combined p for this reversal was .03 (z ¼ 1.93). In this second study the combined p of the predicted effects’ occurrence among the other three conditions was <.002 (z ¼ 3.06); for both experiments p < .0001 (z ¼ 4.55). Considering the results of both experiments together it appears that male experimenters may unintentionally bias the data collected from both male and female subjects. For female experimenters, on the other hand, the situation is more complex. Influencing their female subjects to give predicted responses, they seem to obtain Table 14–2 Expectancy Effects as a Function of Sex of Experimenter and Subject: Second Experiment
E sex
S sex
Mean
t
df
p
p (Both Experiments)
Male
Male Female
þ.72 þ1.54
.94 2.17
5 10
.40 .06
.01 .01
Female
Male Female
1.37 þ.99
3.70 3.41
6 10
.01 .01
.03 .001
480
Book Two – Experimenter Effects in Behavioral Research
opposite results from their male subjects. It appears from this significant reversal that the subtle influence process that mediates experimenter expectancy effects is perceived accurately enough by the male subjects; otherwise we would have expected no expectancy effects at all. Male subjects, however, may feel somehow they ought not to let female experimenters influence them in this way. In response, they may then tend to give their female experimenters data opposite to that subtly requested as a demonstration of their masculine independence. The results of these two experiments also extend the generality of earlier findings bearing on the effects of experimenter expectancies on the responses obtained from their subjects. The plausibility of the earlier suggestion that experimenters may alter their hypotheses in midexperiment and then obtain data in accord with the revised expectancy seems somewhat increased. A subsequent study was undertaken to further evaluate the effects of subjects’ sex (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964b). The procedure in this study was very similar to that of the study just reported. Eight male experimenters conducted the photo rating experiment with 32 female subjects. Another 5 male experimenters conducted the same experiment with 13 male subjects. Each experimenter was led to expect about half his subjects to make photo ratings of þ5; the remaining subjects were expected to make ratings of 5. Only 38 percent of the male subjects gave ratings in the expected direction, whereas 69 percent of the female subjects gave such biased ratings. For this sample of male experimenters, then, female subjects were clearly more susceptible to the biasing effects of their experimenters’ hypotheses (p < .05, two-tail). Data presented by Marcia (1961) were analyzable for differential biasing effects of male and female experimenters. Among his female experimenters the correlation between the experimenters’ expectancy and the data subsequently obtained was only .34. Among male experimenters the analogous correlation was þ.62. However, because the total number of experimenters was only 13, the difference between these rhos was not very significant statistically (p ¼ .16, two-tail). An experiment conducted by Persinger (1962) also investigated sex of experimenters and of subjects as factors in the operation of expectancy effects. We shall have occasion to refer to his study in more detail later in this chapter. For now we may simply state that the combination of experimenter and subject sex showing the most extreme biasing effects was that wherein male experimenters contacted female subjects (p ¼ .07, two-tail). One other experiment also suggested that female subjects were more susceptible (p ¼ .10, two-tail) to the expectancy effects of their experimenters. This experiment was summarized in an earlier chapter (Rosenthal, Persinger, Vikan-Kline, & Fode, 1963a). The overall picture that seems now to emerge is that, in general, male experimenters show more significant expectancy effects than do female experimenters. This finding, a fairly consistent one, may be due in part to the specific inability of female experimenters to bias the data given them by male subjects. One possible reason for this was advanced earlier. A number of studies have been conducted in which no differences were found between male and female subjects in their susceptibility to experimenter expectancy effects. We have never found, however, a situation wherein male subjects were significantly more susceptible. With one exception, it does seem safe to conclude that where a sex difference does occur it is the female subjects who show the greater
Structural Variables
481
susceptibility. The possible exception to this has just been reported by Silverman (1965). He studied the effects of experimenters’ expectancies on subjects’ latencies in a word association task. The results suggested that male experimenters unintentionally influenced their female subjects more and that female experimenters influenced their male subjects more. This interaction may be specifically associated with the task employed by Silverman. The susceptibility under discussion is to a subtle, unintended form of social influence. We have no empirical basis for assuming that those characteristics increasing susceptibility to experimenter expectancy effects should be the same characteristics found to increase susceptibility to other, more commonly investigated forms of social influence. Should this prove to be the case, however, we may entertain the hope of a ‘‘bootstrap’’ operation. That is, the results may extend the generality of the research findings in the area of social influence processes while at the same time finding a conceptual niche within that somewhat well-articulated area of investigation. In the question of relating subject sex to susceptibility to interpersonal influence, we find ourselves fortunate indeed. For a variety of situations of less subtle forms of influence, female subjects have consistently been found more influenceable (Aas, O’Hara, & Munger, 1962; Coffin, 1941; Crutchfield, 1955; Hovland & Janis, 1959; Jenness, 1932; London & Fuhrer, 1961; Simmons & Christy, 1962). To expect a sex-linked genetic determination of response to specific situations by college sophomores would no doubt be to expect too much. To the extent that sex predicts susceptibility to interpersonal influence we may postulate that cultural sanctions are operating which serve to approve women’s influenceability more than men’s. Some interesting data tend to bear this out for the situation wherein a biased experimenter is the source of influence (Rosenthal & Fode, 1963). More influenceable, or more successfully biased, female subjects were better liked by their influencing experimenters (rho ¼ þ.59). More influenceable, or more biased, male subjects were very significantly less liked by their influencing experimenters (rho ¼ .54). In this experiment, in which all experimenters were male, subjects may have been liked to the degree to which their experimenter felt they fit their culturally prescribed roles of female-acquiescence and male-autonomy. That females may prove to be more docile subjects from an experimenter’s point of view has been interestingly if only accidentally reported. Foster (1961) briefly discussed attrition rates of subjects due to their suspicion that his Asch-type conformity (1952) situation was rigged. About 32 percent of his male subjects were suspicious enough to warrant their being dropped from the experiment, whereas only about 13 percent of his female subjects claimed this degree of suspicion. A fascinating if moot question might be: were the girls actually less suspicious or did they by virtue of some greater degree of acquiescence simply have the greater ‘‘decency’’ not to report to an experimenter something that they believed he might not wish to hear? If female experimenters were usually less successful in the unintentional influencing of their male subjects, we might also expect that female experimenters would be less successful in influencing their male research assistants to obtain the data predicted. One experiment, which has been cited earlier and will be discussed again later, could be analyzed to help answer this question (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). There were 10 male experimenters and 3 female experimenters who were given one of two opposite expectancies for responses to
482
Book Two – Experimenter Effects in Behavioral Research Table 14–3 Expectancy Effects of Experimenters and of Their Research Assistants as a Function of
Experimenter’s Sex Experimenter’s sex Male
Female
Experimenter’s subjects Assistants’ subjects
þ.37 þ.32
þ.30 .07
Difference One-tail p
þ.05 NS
þ.37 .10
Difference
þ.07 þ.39
One-tail p
NS .07
the photo-rating task. Table 14-3 shows the magnitudes of expectancy effects defined as the mean difference in ratings obtained under each of the two expectancies. A plus sign means that the difference was in the direction of the expectancy. The first row of Table 14-3 shows that female experimenters affected their own subjects’ responses almost as much as did the male experimenters. Each experimenter then served as principal investigator and trained two research assistants. Experimenters were told not to tell their assistants what data the experimenter was expecting them to obtain. The second row of Table 14-3 shows that male experimenters were able to communicate their expectancy to their male research assistants in some covert manner. Female experimenters, however, were significantly less successful at influencing their male research assistants to obtain the data they expected the assistants to obtain from their new samples of subjects. Among male experimenters there was a correlation of þ.66 (p < .05, two-tail) between the magnitude of an experimenter’s expectancy effects on his own subjects and the magnitude of his research assistants’ expectancy effects on their subjects. Among the female experimenters the correlation was in the opposite direction (rho ¼ 1.00, p ¼ .33, two-tail), though with so few female experimenters the correlation could not reach significance. The results are at least suggestive. Male principal investigators tend to influence their male research assistants to influence their subjects unintentionally to about the same degree as they themselves influenced their own subjects. Female principal investigators do not show this tendency. In fact, they tend to show a reversal of this effect. Just as male subjects may resist the unintentional influence attempts of their female experimenters, male research assistants may resist the unintentional influence attempts of their female principal investigators.
Experimenter and Subject Anxiety An experiment conducted by Fode (1965) was designed specifically to show whether experimenters’ or subjects’ anxiety level might be related to the occurrence of expectancy effects. He employed a total of 16 experimenters each of whom administered the standard photo-rating task to an average of 10 subjects. Some experimenters were led to expect positive ratings of the photos and some were led to expect negative ratings. Anxiety level was defined by scores on the Taylor Manifest Anxiety Scale (MAS). Magnitude of expectancy effects was defined by the difference scores between ratings obtained under the two conditions of expectancy. A plus sign preceding this difference score means that higher ratings were obtained by
Structural Variables
483
Table 14–4 Expectancy Effects at Three Levels of Experimenter and Subject Anxiety (After Fode, 1965)
Experimenter’s anxiety
SUBJECT’S ANXIETY
High
Medium
Low
Mean
High Medium Low
þ1.17 þ0.72 þ0.23
þ1.23 þ2.13 þ0.08
þ0.04 þ1.85 0.18
þ0.82 þ1.57 þ0.04
Mean
þ0.71
þ1.15
þ0.57
þ0.81
experimenters expecting higher rather than lower ratings. Table 14-4 shows the results of the Fode experiment. Examination of the marginals shows that mediumanxious experimenters exerted the greatest expectancy effects upon their subjects. Similarly, more medium-anxious subjects showed the greatest susceptibility to experimenter expectancy effects. Six additional experiments (94 experimenters; 432 subjects), all employing the same task, were designed in part to bring further evidence to bear on the question raised by Fode. In these studies, anxiety was also defined by MAS scores or by a near relative. Experimenters and subjects were again classified as high, medium, or low anxious if they fell into the top, center, or bottom third of their sample’s distribution of anxiety scores. Table 14-5 summarizes the results of all seven studies. In three samples of experimenters, medium anxiety level was associated with the greatest effect of experimenter expectancy. In two samples, high anxiety level and in one sample low anxiety level were associated with greatest expectancy effects. Most of these findings must be regarded as significant statistically in spite of their remarkable inconsistency. A similarly chaotic pattern emerges when we consider the results for samples of subjects. Those subjects found to be most susceptible to experimenter expectancy effects were found to be high-anxious in three samples, medium-anxious in two samples, and low-anxious in two samples. All these results also tended to be statistically significant. We can safely conclude only that experimenters’ and subjects’ level of anxiety are significantly (but very unpredictably) related to the occurrence of expectancy effects. If the results bearing on the relationship between subjects’ anxiety level and susceptibility to subtle, unintended, interpersonal influence are confusing, they are
Table 14–5 Anxiety Levels Maximizing Expectancy Effects
Investigators
Fode, 1965 Persinger, 1962 RR,GP,KF,1962 RR,GP,LVK,RM, 1963 RR,PK,PG,NC, 1965 RR,GP,RM,LVK,MG, 1962 Vikan-Kline, 1962
Experimenters N
Level
16 12 10 29 14 29 —
Medium Low High High Medium Medium —
Subjects p .001 .05 .13 .05 .08 .08 —
N
Level
167 43 — 200 28 86 75
Medium Low — High High and low High Medium
p .04 .07 — .12 .05 .10 .06
484
Book Two – Experimenter Effects in Behavioral Research
at least matched in equivocality by the results from other areas of research on susceptibility to interpersonal influence. The relationship of anxiety, also usually defined by scores on the MAS or a next of kin, to influenceability has been rather frequently investigated. More-anxious subjects have been found more susceptible to interpersonal influence in ‘‘conditioning’’ situations by a number of workers (Gelfand & Winder, 1961; Haner & Whitney, 1960; Sarason, 1958; Taffel, 1955), although others have not found anxiety to be a relevant variable in the same situation (Buss & Gerjuoy, 1958; Dailey, 1953; Eriksen, Kuethe, & Sullivan, 1958; Matarazzo, Saslow, & Pareis, 1958). More-anxious subjects were found more influenceable by persuasive communication (Fine, 1957; Janis, 1955), and more-conforming subjects were characterized as more anxious under conditions of stress by Crutchfield (1955); Goldberg, Hunt, Cohen, and Meadow (1954) found more-anxious females to be more conforming. On the other hand, several studies have shown more-anxious subjects to be less persuasible, a contradictory finding which was reviewed and partially reconciled by Cervin, Joyner, Spence, and Heinzl (1961). These workers showed that more-emotional subjects were indeed more persuasible when under conditions of public commitment. Their study, and others requiring subjects to make their responses publicly, seems most relevant to our interest in the social psychology of the psychological experiment, since in most experimental situations the subject’s response is made in the presence of his minimum public of the experimenter. Left unexplained are data obtained by Kuethe (1960). In his classroom drama situation, more-acquiescent and less-anxious subjects were more susceptible to social influence. Goldberg et al. (1954) found their less-anxious male subjects to be more influenceable. In a review of the relationship between postural sway suggestibility and neuroticism or anxiety, Heilizer (1960) summarized the equivocal findings obtained, and found in his own study, utilizing a more precise measure of postural sway, no relationship between suggestibility and either neuroticism or anxiety. In attempting to summarize the relationship between anxiety and susceptibility to interpersonal influence for a variety of situations we must settle for the unsatisfying conclusion that anxiety often makes a difference but we cannot accurately predict whether more or less anxious subjects will be the more influenceable. One factor in the equivocality of the obtained relationships between anxiety and susceptibility to social influence processes, generally, may be the curvilinear nature of the underlying relationship. This curvilinearity appears clearly in our own data (Table 14-5), although admittedly we may be dealing with a special case of interpersonal influenceability. Another factor possibly contributing to the obtained equivocality of relationships is the anxiety level of the experimenter. Inspection of Table 14-5 suggests that the particular level of subject anxiety associated with greatest susceptibility to the influence of experimenter expectancy may depend on the anxiety level of the experimenter. When the experimenter showing greatest expectancy effects is high-anxious, chances are that the most susceptible subjects will also be high-anxious. When the experimenter showing greatest expectancy effects is low-anxious, chances are that the more susceptible subjects will also be low-anxious. The correlation (rho) between the level of anxiety characterizing most biasing experimenters and the level of anxiety characterizing most susceptible subjects was þ.64. Although this correlation was not significant statistically
Structural Variables
485
(p ¼ .18, two-tail), based as it was on only six samples, the potential implications are important enough to warrant further consideration. In research on experimenter expectancy effects and possibly in other research dealing with more traditional situations of interpersonal influence, the influence process may proceed most effectively when the source and target of influence are more alike in level of anxiety. Anxiety similarity or perhaps nondissimilarity may be a correlate of rapport just as racial and religious similarity often seem to be (Hyman et al., 1954). When the variables of experimenter and subject hostility were employed, Sarason (1962) also found the greatest influence exerted upon subjects when experimenters and subjects were similar. He too employed an interpretation of experimenter-subject similarity. Even if the hypothesis of similarity or of nondissimilarity were upheld by further research, we would be left with the problem of accounting for the differences in the absolute level of anxiety associated with maximal interpersonal influence. We could easily posit gross situational variables that might account for these differences, except that in the studies reported in Table 14-5 the situation was about as ‘‘uniform’’ as it is likely to be in social psychological research. In time, we can hope for a developing structuring of the complex relationships between social influence processes and the anxiety status of the participants. It seems, however, from a research logistics point of view, that we should not expect any one or two or three experiments to provide the necessary integrating information. When seven experiments yield only an equivocal hypothesis, how many more may be required to impose a meaningful structure on a domain of data characterized by such complexity? One of the reasons that the subject’s level of anxiety is only a poor predictor of susceptibility to social influence may be that experimenters find it hard to treat subjects with differing levels of anxiety in even a roughly equivalent manner. There may be some experimenters who are sufficiently perceptive to be able to differentiate the anxiety levels of their subjects and to treat them differently as a function of their involuntary assessment of their anxiety level. If subjects differing in level of anxiety (or in other characteristics) are not treated similarly, they cannot be said to be in the same experiment, an argument that has been made in earlier chapters. An experiment by Pflugrath (1962) suggests that at least some experimenters do treat their subjects differently as a function of their perceived anxiety level. The basic purpose of Pflugrath’s study was to find out whether experimenter expectancy effects could operate under conditions of group personality testing. Because group testing situations minimize personal contact between experimenter and subject and maximize the physical distance between them, it was thought that expectancy effects would be, at most, trivial. Pflugrath employed nine experimenters, all graduate students in counseling and guidance, to administer the Taylor Manifest Anxiety Scale (MAS) to 142 students enrolled in introductory psychology, all of whom had already taken the MAS on an earlier occasion. Three of the experimenters were told the subjects they would be testing were highly anxious. Three were told their subjects were quite nonanxious, and three were told nothing about their subjects. Each experimenter administered the MAS to two groups of randomly assigned subjects. Number of subjects in each group ranged from 5 to 10 with a mean of 8. Instructions to experimenters explained that their subjects had been seen in the student counseling center, a fact that may assume some importance in our interpretation of Pflugrath’s data.
486
Book Two – Experimenter Effects in Behavioral Research
The overall analysis of the results showed no significant differences in anxiety scores earned by the subjects of the three groups of experimenters. Among the subjects of the control group, 47 percent showed a decrease in anxiety score from their pretest level (p > .70). Among the subjects whose experimenters believed them to be nonanxious, 57 percent showed a decrease in anxiety (p ¼ .30). Among the subjects whose experimenters believed them to be highly anxious, 70 percent showed a decrease in anxiety (p < .005). When we recall that the experimenters were counselors-in-training this result seems reasonable. Told that they would be testing very anxious subjects who had required interpersonal assistance at the counseling center, these experimenters may well have brought all their counseling skills to bear upon the challenge of reducing their subjects’ anxiety. This unprogrammed ‘‘therapy’’ by psychological experimenters may operate more frequently in behavioral research than we may like to believe. In a good deal of contemporary behavioral research, subjects are exposed to conditions believed to make them anxious. What might be the effect, on the outcome of experiments of this sort, of the covert therapeutic zeal and/or skill of various investigators carrying out this research? Might certain investigators typically conclude ‘‘no difference’’ because they unwittingly tend to dilute the effects of treatment conditions? And conversely, might others be led to conclude ‘‘significant difference’’ by their unintentionally increasing the anxiety of subjects known to belong to the ‘‘more anxious condition’’ of an experiment? Clearly these are not necessarily effects of experimenters’ expectancies, but they are effects of experimenter attributes which may have equally serious implications for how we do research.
Experimenter and Subject Need for Approval A very extensive series of experiments by Crowne and Marlowe (1964) and their coworkers has shown the importance of the approval motive to our understanding of susceptibility to interpersonal influence. Higher need for social approval as defined by scores on the Marlowe-Crowne Social Desirability Scale (M-C SD) and related measures (Ismir, 1962; Ismir, 1963) usually (Buckhout, 1965), but not always (Spielberger, Berger, & Howard, 1963), characterizes those subjects who comply more with experimentally varied situational demands. The eight samples of experimenters for whom expectancy effects were correlated with need for approval (M-C SD) are identified in Table 14-6. Early in the series of studies, it became evident that the nature of the relationship depended upon the experimenter’s level of anxiety. In Table 14-6, therefore, the correlations between degree of experimenter expectancy effects and M-C SD scores are tabulated separately for high, medium, and low levels of experimenter anxiety. In a few cases, this method of analysis was not possible, and the entire sample of experimenters was then classified as medium anxiety. Considering only the medium-anxious samples of experimenters, all but one of the correlations was positive (p < .05, two-tail). Considering the high- and lowanxious samples of experimenters, however, the correlations tended to be more negative than positive (p < .05, two-tail). Experimenters at a medium or unclassified level of anxiety thus showed greater expectancy effects if they scored higher on need for approval (median rho ¼ þ.62), whereas experimenters at either high or low anxiety levels showed just the opposite relationship (median rho ¼ .40). Why this
Structural Variables
487
Table 14–6 Need for Approval and Expectancy Effects at Three Levels of Anxiety
Anxiety level High
Medium
Investigators
N
Rho
N
Rho
Fode, 1965 Marcia, 1961 Persinger, 1962 RR,GP,KF,1962 RR,GP,LVK,RM, 1963 Sample 1 Sample 11 RR,PK,PG,NC,1965 RR,GP,RM,LVK,MG,1962
4 — — 4
.25 — — 0.00
8 13 12 4
þ.80 þ.74 þ.58* þ.95
.40 þ.07 (.55)** .60
4 7 14 9
þ.65 þ.15 .55 þ.21
4 5 (14) 6
Low N 4 — — 4 4 5 (14) 10
Rho .80 — — –.40 .85 þ.13 (.55)** þ.25
* For an atypical sample of subjects personally acquainted with their experimenter, the analogous correlation was .34, p ¼ .15. ** At each level of anxiety the relationship was similar.
should be is not at all clear, and just as in the case of experimenter and subject anxiety discussed in the preceding section, we may guess that it will take considerable effort to impose a structure upon this complex array of data. At any rate, our findings do suggest that, at least for some situations, the predictive power of the M-C SD scale may be still further increased by controlling for the influencer’s level of anxiety. Let us suppose for a moment that our data had more clearly shown us that experimenters scoring higher on a scale of approval need biased their subjects’ responses more. (Such a finding has in fact been reported by Buckhout [1965] for a situation in which the influence attempt was quite intentional.) Even then the interpretation would not be straightforward. These experimenters might affect their subjects more in order to please the investigators who created demands upon them. Alternatively, these experimenters might simply be more effective unintentional influencers without their necessarily wanting to please the principal investigators any more than the lower need approval experimenters. Either or both of these mechanisms might reasonably be expected to operate. It could well be argued, on the basis of the literature relating need approval to social influenceability, that we should not have expected any particular correlation between experimenters’ expectancy effects and M-C SD scores. In the more usual experiment, it is the high-need-approval subject who makes the conforming response for the experimenter. In our studies, the experimenter-subject, in order to conform to our demands, must successfully influence other, subordinate subjects. He is able to conform, then, only in an indirect and extremely complicated way. Given samples of experimenters who do influence their subjects’ responses, we might expect that those subjects scoring higher in need for approval would be more influenceable. In none of our samples of subjects was this hypothesis confirmed. Subjects’ need for approval has consistently been found unrelated to degree of susceptibility to experimenter expectancy effects. It appears that susceptibility to the subtle, unintended social influence of a biased experimenter is not as predictable from subjects’ need for approval as is susceptibility to more clearly intended forms of interpersonal influence.
488
Book Two – Experimenter Effects in Behavioral Research
Experimenter-Subject Acquaintanceship Were there such a thing as a typical psychological experiment it would be likely to involve an experimenter who was unacquainted with all or most of his subjects. There are numerous situations, however, when a major proportion of the subjects are acquaintances of the experimenter. This is often the case with pilot studies and follow-up studies. It is also likely to be the case where an experimenter-teacher employs his intact classes as subjects or when the experimenter is located at a smaller college, clinic, hospital, or industrial setting from which he will draw his subject samples. Granting only that experimenters sometimes contact acquainted subjects, it would be worthwhile to know whether this changes the likelihood of the operation of expectancy effects. This question would, in fact, be of interest even if no experimenter ever contacted a prior-acquainted subject. Acquaintanceship, itself, is an inevitable result of an experimenter-subject interaction, and the degree of acquaintanceship varies directly with the time and intensity of the experimenter-subject interaction sequence. The longer an experiment lasts and the greater the information exchanged, the more acquainted will the participants be at some point in the sequence, e.g., its half-life. Although acquaintanceship cannot readily be eliminated, it can and has been experimentally varied. In an earlier chapter several such studies were cited, and it was found that subjects’ performance might be affected by prior acquaintance with the data collector. The study by Kanfer and Karas (1959) is most relevant to our interest here. They found that prior acquaintance with the experimenter increased subjects’ susceptibility to the influence of the experimenter’s reinforcing verbal behavior. These findings, together with the comprehensive work of Newcomb (1961), suggest that those subjects who are more acquainted with their experimenters should be more susceptible to expectancy effects. This might be even more true in the case of less obvious, unintended influence processes. Where the cues to the subject are only poorly programmed, the acquainted subject would seem to have the better chance of accurately ‘‘reading’’ the unintended signals sent by the experimenter. The first data obtained bearing on the relationship between acquaintanceship and magnitude of expectancy effects were incidental to another purpose to be described later (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). In that study, experimenters were asked to predict what sort of photo ratings of success or failure they would subsequently obtain from their six randomly assigned subjects. Magnitude of expectancy effect was defined by the correlation (rho) between ratings experimenters expected to obtain and the data they later did obtain. A subsample was available of 10 male experimenters who had been previously acquainted with one or more of their subjects and unacquainted with one or more of their subjects. Based only on data obtained from unacquainted subjects, the correlation defining expectancy effects was .05. The analogous correlation based on these experimenters’ acquainted subjects was þ.69 (p ¼ .04, two-tail). Thus expectancy effects operated for this subsample of experimenters only when their subjects were acquaintances. The effects of acquaintanceship in this analysis were, if anything, underestimated because of the tolerant definition of ‘‘acquainted.’’ An experimenter who had only said ‘‘hello’’ to a subject on some earlier occasion was classified as acquainted. Even passing acquaintanceships may be nonrandom in origin, and it may be that willingness to be influenced by
Structural Variables
489
the acquaintance is a factor in the origination of the relationship. This selective factor, together with the greater reinforcement value of an acquaintance demonstrated by Kanfer and Karas (1959) and the possibly greater ability of acquainteds to ‘‘read’’ each others’ interpersonal signals, may best account for the data reported here. A subsequent experiment was designed to deal more explicitly with the relationship between acquaintanceship and experimenter expectancy effects (Persinger, 1962). Five male and seven female advanced undergraduate students served as experimenters, and 83 beginning undergraduate students served as subjects in a photo-rating experiment. Half the experimenters were led to expect ratings of success, and half were led to expect ratings of failure of the persons pictured in the photos. Each experimenter contacted both male and female subjects and had prior acquaintanceship with half his subjects. The results of this study again showed that male experimenters exerted significantly greater expectancy effects upon acquainted than upon unacquainted subjects (p ¼ .005, one-tail, t ¼ 2.64, df ¼ 29). This did not hold for female experimenters, however. In fact, there was a tendency, though not significant, for female experimenters to show greater expectancy effects with unacquainted subjects. The interpretation of acquaintanceship dynamics offered earlier, therefore, may be applicable only to male experimenters, though Silverman (1965) has recently reported a case in which a female experimenter, working in the area of angle estimation, had her hypothesis confirmed by acquainted subjects, her own students, but disconfirmed, and significantly so, by unacquainted subjects.
Experimenter Status Data in the social sciences may be collected by experimenters differing greatly in the status or prestige ascribed them by their subjects. The distinguished professor, the new instructor, the graduate research assistant, the able undergraduate, all represent points on a scale of status consensually ascribed them by virtue of their position within the academic world. Similar status scaling for potential data collectors in clinical, industrial, military, and survey research settings would not be difficult. The general question of concern here is whether the experimenter’s status makes a difference with respect to the results of his research. In an earlier chapter dealing with several experimenter attributes, some evidence was presented that suggested that experimenter status could, in fact, be a partial determinant of the data he obtained from subjects. The more specific question then becomes whether experimenter status is a factor in the operation of experimenter expectancy effects. Do higher status experimenters obtain data more in accord with their expectancy than do lower status experimenters? It was to this question that the following study by Vikan-Kline (1962) was addressed. Six male faculty members and six male graduate students served as experimenters. Since all were psychologists familiar with research on experimenter expectancy effects, the expectancy-inducing procedure employed in earlier studies could not be used. Instead all experimenters were asked to somehow subtly influence half their subjects to rate photos as successful and influence half their subjects in the opposite direction. There was a total of 85 introductory psychology students who served as subjects. About half were males.
490
Book Two – Experimenter Effects in Behavioral Research
Before any subject was ushered into the research room, the experimenter was informed whether he should try to influence that subject to give ratings of success or of failure. No instructions were given as to how to influence the subjects. Indeed such instructions could not be given, since no one knew how this subtle form of interpersonal influence was mediated. It was hoped that subsequent to the data collection, experimenters’ verbal reports of how they tried to influence subjects could be used as a source of hypotheses for further research. Perhaps those who proved to be more successful influencers would be able to tell us of having used different techniques of influence. Although this study differed from others in the program in that experimenters were fully aware of trying to influence their subjects, it was hoped that the mechanisms employed by unintentional influencers might be similar if less overt. From the subjects’ point of view, most of whom knew neither the faculty nor graduate student experimenters, the definition of status was a name card placed on each experimenter’s desk. The graduate students had the words ‘‘psychology grad. student’’ written under their names. The faculty experimenters had the words ‘‘professor of psychology’’ written under their names. All experimenters had been rated as to apparent age by a sample of 14 colleagues; the apparent age (late twenties to early thirties) of the Ph.D. group members was higher than that of any member of the graduate student group (early to mid-twenties). All experimenters dressed similarly—i.e., white shirts, ties, and jackets. Results showed that the faculty experimenters were more successful at influencing their subjects to yield the desired data, but only among subjects contacted later in the experiment. In fact, Table 14-7 shows that early in the series of subjects, the faculty experimenters were, if anything, less successful influencers than the graduate student experimenters. Although the graduate students never did influence their subjects much, there was a tendency for them to grow less influential later in the series of subjects contacted. This trend can be interpreted within the framework of the data presented in the earlier chapter on the effects of early data returns. Having obtained initially ‘‘poor’’ data these experimenters went on to collect ‘‘worse’’ subsequent data. The studies of the early returns effect had been based on samples of graduate student experimenters. The early returns effect may be less likely to occur for faculty experimenters. Possibly they were less threatened by their earlier inability to produce the desired effects and were thus freer to learn from their early subjects what techniques of influence might be most effective. What techniques were employed? Experimenters tended to employ several. In reading the instructions they tried to emphasize that portion of the description of the photo-rating scale which contained reference to the desired responses. If they were trying to influence positive ratings, they were friendlier in general, smiled Table 14–7 Expectancy Effects as a Function of Experimenter’s Status
Experimenter status
ORDER OF SUBJECT
Student
Faculty
Difference
p
First half Last half
.12 .75
.88 þ2.42
.76 þ3.17
NS .01
Difference Two-tail p
.63 NS
þ3.30 .01
Structural Variables
491
more, and were more ‘‘accepting.’’ They behaved in cooler fashion when trying to obtain negative ratings. When they obtained responses of the desired type they were more likely to look interested and pleasant, sometimes even smiling. This sort of reinforcement behavior was not so consistent nor so blatant, however, that we can regard this study as one of typical operant conditioning. Unfortunately no differences in self-description of attempted influencing behavior were found between those experimenters who were and those who were not successful influencers. In general, the results of this study agree quite well with the general literature relating status to influential behavior (Coffin, 1941; Cole, 1955; Goranson, 1965; Homans, 1961; Hovland & Weiss, 1951; Lefkowitz, Blake, & Mouton, 1955; Mausner, 1953, 1954; Mausner & Bloch, 1957; Raven & French, 1958; Wuster, Bass, & Alcock, 1961). For a variety of situations, some of which were summarized in an earlier chapter, people with higher status are more likely to influence others successfully. If we can generalize from this literature and from the experiment reported, it appears that the higher status experimenter, in part because of his greater competence, more markedly affects his subjects’ responses into the direction of his hypothesis. We shall have more to say about this formulation a little later. For now it is interesting to note that Pfungst (1911) had also found a relationship between the questioner’s status or ‘‘air of authority’’ and the likelihood of his getting the correct response from Clever Hans. Even with a horse as subject, unintentional influence was more likely when the experimenter was more self-assured. In the experiment by Vikan-Kline the status ascribed the experimenters was confounded with their age, with their actual status, and possibly with the techniques of influence employed. It would be useful to a clearer understanding of the effects of status per se to have an experiment in which the status of the same experimenter was varied experimentally and without his knowledge. The experiment by John Laszlo, referred to in an earlier chapter, employed three experimenters to administer the photo-rating task to 64 subjects. Half the time experimenters were led to expect ratings of success (þ5), and half the time they were led to expect ratings of failure (5). Half the subjects in each of these conditions were told that their experimenter was ‘‘just a student,’’ and the remaining subjects were led to believe the experimenter was of higher status. The subjects had all been administered Rokeach’s (1960) scale of dogmatism. Table 14-8 shows the magnitude of expectancy effect as a function of the status ascribed to the experimenter and the subject’s level of dogmatism. More-dogmatic subjects showed a greater susceptibility to the experimenter’s unintended influence, as had been expected (p < .05, two-tail). Surprisingly, however, it was the experimenters with lower ascribed status who showed the greater expectancy effects, regardless of the level of their subjects’ dogmatism (p < .05, two-tail). Table 14–8 Expectancy Effects as a Function of Experimenter’s Status and Subjects’ Dogmatism
Experimenter status
SUBJECTS’ DOGMATISM
High
Low
Difference
High Low
þ.83 .88
þ1.69 þ.09
.86 .97
Difference
þ1.71
þ1.60
492
Book Two – Experimenter Effects in Behavioral Research
Perhaps the subjects of this experiment felt sorry for the experimenter who was ‘‘just a student’’ and, therefore, were more willing to be influenced by him. In any case, it appears from this study that the status effects obtained by Vikan-Kline were probably not due to the labeling of her experimenters as students or as professors. More likely, the different appearance of her high and low status experimenters and possibly differences in the degree of self-assurance shown by her faculty and student experimenters accounted for her findings.
Characteristics of the Laboratory Riecken (1962) has pointed out the potential importance to the results of psychological research of the characteristics of the laboratory in which the experiment is conducted. It seems reasonable to suggest that the room in which the subject is contacted by his experimenter will convey information to the subject about the sort of person the experimenter might be. If a tour is undertaken of research rooms and offices used by graduate students and faculty members in a university setting, great individual differences may be observed. Some rooms look impressive, some look very professional, some very comfortable, some inordinately neat or bare. While room characteristics may reflect the status of the occupant, the occupant may also derive certain characteristics in the eyes of his subjects from the scene in which the experimental contact occurs. The experiment to be reported now, conducted in collaboration with Suzanne Haley, was designed in part to vary characteristics of experimenters by the variation of the scenes in which they contacted their subjects. A total of 16 experimenters administered the standard person perception task to a total of 72 female undergraduate subjects. Most of the experimenters were males and enrolled either in the Harvard Law School (N ¼ 9) or in the Harvard Graduate School in the area of the natural sciences (N ¼ 7). Each experimenter expected half his subjects to perceive the photos of faces as quite successful and expected half his subjects to perceive them as quite unsuccessful. In an effort to reduce the overall magnitude of experimenter expectancy effects, a screen was placed between experimenter and subject so that during the course of the data collection they could not see each other. In order to control for experimenter recording errors, subjects recorded their own responses. Finally, in order to eliminate the effects of early data returns, experimenters were not permitted to see the responses made by any of their subjects. The experimental interactions took place on a single evening in eight different rooms to which experimenters were randomly assigned. Most of these rooms served as offices for psychology graduate students and faculty members. Each room was rated after the experiment by all 16 experimenters on the following dimensions: 1. How professional is the room in appearance? 2. How impressive is the room, i.e., what is the status of the person who normally occupies it? 3. How comfortable is the room, especially from the subjects’ point of view? 4. How disorderly is the room?
Ratings were made on a scale ranging from zero (e.g., not at all professional) to 10 (e.g., maximally professional). The first three scales were found to be highly
Structural Variables
493 Table 14–9 Correlations between Room Characteristics and Expectancy Effects
Experimenters Law students Graduate students
No. of Rooms 7 6
Means Two-tail p
Status
Disorder
þ.64 þ.48
þ.61 þ.89
þ.58 .10
þ.77 .02
intercorrelated (mean rho ¼ þ.78), and a single scale of status was constructed by summing the scores on all three scales for each room. The mean reliability of these three scales was .89. The scale of disorder showed a correlation of only þ.29 with the combined status scale and showed a reliability of þ.99. Table 14-9 shows the correlations of room status and room disorder with magnitude of experimenter expectancy effect. Magnitude of expectancy effect was defined, as before, as the difference between mean ratings obtained from subjects believed to be success perceivers and those believed to be failure perceivers. The law student and graduate student experimenters are listed separately because the latter group showed significantly greater expectancy effects that did the former. For both samples of experimenters, the higher the status of the room in which the subject was contacted, the greater were the expectancy effects. This finding adds to our confidence in the hypothesis that, with the exception noted earlier, the higher the status of the experimenter, the greater his unintended influence on his subjects. For both samples of experimenters, the greater the disorder in the experimental room, the greater were the effects of the experimenters’ expectancy. We saw in the last chapter that expectancy effects were likely to be greater when the experimenter was perceived as more likable and more personal. The disorderliness of an experimental room may have relevance to this dimension of interpersonal style. None of the rooms were disorderly to a chaotic degree. Within these limits the more disorderly room may be seen as reflecting the ‘‘living’’ style of a more likable and more personal experimenter. At the conclusion of this experiment, the experimenters were told about the phenomenon of expectancy effects, shown published articles describing some of the earlier studies, and asked to repeat the experiment with a different sample of 86 female subjects. In this repetition of the experiment no screens were placed between the experimenters and their subjects. Half the experimenters were asked to try to avoid the operation of expectancy effects, half were asked to try to maximize them. Within each of these conditions, half the experimenters were told that in the original study they had shown significant unintended influence. The remaining experimenters were told that they had shown no real expectancy effects. Table 14-10 shows the Table 14–10 Increase in Expectancy Effects for Four Kinds of Experimenters
Orientation to expectancy effect Influence ‘‘History’’
Maximize
Minimize
Difference
‘‘Successful’’ influencer ‘‘Unsuccessful’’ influencer
þ1.08 þ1.33
þ0.23 þ0.41
þ.85 þ.92
Difference
.25
.18
494
Book Two – Experimenter Effects in Behavioral Research Table 14–11 Room Characteristics and Expectancy Effects: Second Sample of Subjects
Experimenters Law students Graduate students
No. of Rooms
Status
Disorder
7 5
.07 .30
þ.25 .90*
*p < .05, two-tail.
mean increase of expectancy effect for each of the four experimental conditions from the first study to the replication. Experimenters who were trying to influence their subjects showed significantly greater expectancy bias than did experimenters trying to avoid expectancy effects (p < .05, two-tail). (Somewhat surprisingly, however, even the experimenters trying to avoid them showed a tendency to increase their expectancy effects when the screens had been removed, p < .20.) Those experimenters who had been told that they had shown no expectancy bias tended to show a greater increase in expectancy effects than did the experimenters who were told they had shown expectancy bias (p ¼ .11, two-tail). Perhaps those experimenters who believed they had biased the results of their first study felt chastised for it and made some special efforts to retard the communication of their expectancy to their second set of subjects, even when they had been instructed to maximize expectancy effects. When the changes in magnitude of expectancy effect from the original to the replication study were examined separately for law student and graduate student experimenters, an interesting difference emerged. Among law student experimenters, 66 percent of the subjects were more influenced in the second study than in the first (p ¼ .02, x2 ¼ 6.25), disregarding the particular experimental conditions. Having learned about subtle communication processes between the time of the first and second samples, young lawyers may have felt it desirable for attorneys to be able to communicate subtly with other people. Among the graduate students, most of whom were in the sciences, the subtle communication process may have been not only less prized but perhaps even abhorred as a cause of spoiling experiments. Regardless of their experimental condition, these young scientists showed a significant decrease in expectancy effects in the second study. Of their subjects, 73 percent were less biased than were the subjects of their first sample (p < .05, x2 ¼ 4.54). In the second stage of this experiment, laboratory room characteristics were again correlated with magnitude of expectancy effects. As can be seen from Table 14-11, the correlations obtained earlier were not replicated under the conditions of the second phase of the study. Among the law student experimenters the correlations tended simply to go toward zero. Among the graduate students, however, they tended to go into the opposite direction. If it is reasonable to think that our more science-oriented graduate students were trying to avoid the spoiling effects of their expectancy bias, the reversals of the correlations make sense. The higher status and more disordered rooms, then, still predicted the biasing effects of the experimenter’s expectancy, only now the science-oriented graduate students expected, and hoped, to obtain data opposite to that which they had been led to expect. Their ‘‘real’’ expectancy may have become that which would avoid the effects of the induced expectancy, and subjects were more inclined to go along with these, now reversed, unintended communications in the higher status, more disordered laboratories.
15 Behavioral Variables
In the last chapter a number of experimenter characteristics were shown to be related to the operation of experimenter expectancy effects. The particular characteristics discussed could all be assessed before the experimenter entered his laboratory. In this chapter, the discussion of experimenter variables will continue, but now the emphasis will be on the experimenter’s behavior in his interaction with the subject. Much earlier, in Part I, we saw that various structural variables such as the sex of the experimenter were correlated with the behavior of the experimenter as he interacts with his subjects. The behavioral variables to be discussed in this chapter, therefore, are not independent of the more structural variables discussed in the last chapter, but they do warrant special attention. The communication of expectancy effects to subjects must depend on something the experimenter does. If experimenters of a certain kind, as measured before the experiment begins, exert greater expectancy effects on their subjects, it is very likely due to their behaving differently toward their subjects. The observations of the experimenter’s behavior to be discussed now come from two major sources. The first of these sources is the direct observation of experimenter behavior usually by the subject himself. The second of these sources is the observation of experimenter behavior by a variety of observers of sound motion pictures of experimenters interacting with subjects.
Direct Observations of Experimenter Behavior In the chapter dealing with the effects of excessive rewards, some preliminary data were presented which gave some idea of how subjects perceived the behavior of their experimenter. The subjects’ ratings of their experimenter’s behavior were intercorrelated and cluster analyzed (Fruchter, 1954). In three subsequent experiments, this procedure was repeated. Table 15-1 lists the clusters and B-coefficients of the original study as well as those of the later experiments. A B-coefficient greater than 1.5 may be interpreted as indicating a significant cluster, and a B-coefficient of 3.0 indicates a relatively tightly intercorrelated set of variables. The mean B-coefficients for the four studies show that the first three clusters held up quite well, but Cluster IV, composed as it was of only three variables, tended to vanish. The first three clusters may prove to be 495
496
Book Two – Experimenter Effects in Behavioral Research Table 15–1 Cluster Analyses of Four Samples of Subjects’ Perceptions of Their Experimenter’s Behavior
Clusters
Investigators
I Casual pleasant
II Expressive friendly
III Kinesic cluster
IV Enthusiastic professional
RR,KF,JF,LVK,1960 RR,GP,LVK,RM,1963 Sample I Sample II RR,JF,CJ,KF,TS,RW,LVK,1964
6.48
3.97
9.10
3.55
4.43 1.69 0.67
3.00 2.33 6.01
3.77 4.86 9.12
0.58 0.33 2.73
Mean
3.32
3.83
4.83
0.14
a useful way of organizing subjects’ perceptions of their experimenter’s behavior in future studies. They may also be of value in other investigations of interpersonal perception. They have not, however, proved themselves related to magnitude of experimenter expectancy effects in most of the studies carried out. For five samples of experimenters, their subjects’ perceptions of their behavior were correlated with the degree to which the experimenters exerted expectancy effects upon their subjects. Experimenters who obtained data closest to that which they had been led to expect were ranked as most biased. Those who obtained data most unlike that which they had been led to expect were ranked as least biased. Three samples of experimenters employed the photo-rating task in which each experimenter contacted his subjects individually (Rosenthal, Fode, Friedman, & Vikan-Kline, 1960; Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). The data for the first sample were those presented in the chapter dealing with the effects of excessive reward. A fourth sample of experimenters also employed the photo-rating task, but the experimenter contacted subjects in groups of five (Rosenthal, Friedman, Johnson, et al., 1964). The fifth study employed a standard verbal conditioning paradigm, in which subjects were rewarded with a ‘‘good’’ for the correct choice of pronoun in forming a sentence. Although differential expectancies were created in experimenters leading them, presumably, to emit unintended cues to their subjects, they also were intentionally trying to influence their subjects by their contingent saying of ‘‘good.’’ In this experiment, therefore, both intentional and unintentional influence processes were operating (Rosenthal, Kohn, Greenfield, & Carota, 1966). In this study, each experimenter contacted his subjects individually. The average sample size for each of the five experiments was about 20 experimenters and about 100 subjects. Table 15-2 shows the four variables that were least situation-specific in their correlation with magnitude of expectancy effect. None of the correlations was impressively high, though all held up statistically significantly over five replications. Also heartening was the fact that each of the four variables represented a different cluster. As we would expect, therefore, the intercorrelations among these variables were generally low. The only exception to this was the mean correlation of þ.40 between professional and businesslike. Considering these two variables together it appears that for a variety of situations of unintended and intended influence, of individual and group contact with subjects, in person perception and verbal
Behavioral Variables
497 Table 15–2 Subjects’ Ratings of Their Experimenter’s Behavior and
Magnitude of Expectancy Effect: Applicable Under All Conditions Behavior Businesslike Expressive voice Professional Use of legs
r
df
p Two-tail
þ.31 þ.26 þ.25 þ.22
105 105 105 105
.005 .01 .01 .03
conditioning tasks, the experimenter with the more professional manner is more likely to exert his influence on his subjects. This finding is consistent with those presented in the preceding chapter in which experimenters with higher status, as independently determined, showed greater influencing of their subjects’ responses. An interesting extension of this finding was possible from one of our studies. Subjects’ ratings of how professional their experimenter appeared correlated significantly (þ.59, p ¼.05, two-tail) with the degree of expectancy effect exerted by the investigators who had trained these experimenters. It may be that more professionalmannered experimenters train their assistants to be more professional-mannered and therefore more influential. In this study, which has already been referred to (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963), research assistants were randomly assigned to investigators. In the real-life situation, selection factors no doubt combine with training effects to make research assistants and other colleagues more like each other than would be true of any random set of experimenters. This diminished variability due to selection and training may serve to make the results of research coming from any laboratory less variable than we would expect from randomly chosen experimenters, even with techniques of research held constant. The correlation between experimenter expectancy effects and ‘‘expressive voice’’ suggests that part of the communication to the subject of what it is the experimenter expects him to do is carried by the inflection and tone given to the verbal instructions to subjects. Similarly, the correlation between experimenters’ expectancy effects and use of legs suggests that movement or kinesic patterns also play a role in the mediation of experimenter expectancy effects (and in other situations of interpersonal influence as well, as Birdwhistell, 1963, has suggested). Why leg movements in particular should serve this function is not at all clear. All experimenters have been rated by their subjects as showing relatively few movements of any sort. In particular, leg movements are the least frequent of the infrequent movements. Perhaps because they are rare, very minor leg adjustments are more noted by subjects and responded to. Exactly what sort of leg adjustments are employed unintentionally by the experimenters, and what their immediate effect on the subject might be, is a question for further research. The variables just discussed were those which held across all five samples of experimenters. When we omit the verbal conditioning sample, leaving us only those samples in which experimenter influence was more likely to be purely unintentional and in which the photo-rating task was employed, several additional variables become significant. Table 15-3 shows these additional characteristics. That experimenters who are perceived as acting more important should show greater unintended influence is related to the more general finding of professional-mannered
498
Book Two – Experimenter Effects in Behavioral Research
Table 15–3 Subjects’ Ratings of Their Experimenter’s Behavior and
Magnitude of Expectancy Effect: Applicable When Influence Is Unintentional Only Behavior Important-acting Relaxed Head gestures
r
df
p Two-tail
þ.33 þ.27 þ.23
39 48 48
.04 .07 .11
experimenters’ exerting greater influence. The mean correlation of this variable with the variable ‘‘professional’’ for these samples was þ.42. For the person perception task, the use of head movements became an additionally possible source of cues to subjects. For these samples the mean correlation between use of head and leg movements was only þ.26, suggesting that it was not simply greater movement but differential movement of body areas that served as sources of information to subjects as to what it was the experimenter expected from them. For these samples of experimenters, those who appeared more relaxed (vs. nervous) were more effective expectancy communicators. Freedom from tension is consistent (r ¼ þ.56) with the picture of the more professional, higher status experimenter. In addition, however, this finding suggests that the movement patterns we have discussed were not gross, random activity patterns, as we might expect from an anxious experimenter, but rather more finely differentiated patterns of kinesic activity. The mean correlation between ‘‘relaxed’’ and use of head gestures was only .06. When we omit from our five samples only that sample in which experimenters contacted subjects in a group situation, the additional variables shown in Table 15-4 emerge as significant. When contact between experimenter and subject is one-to-one, those experimenters showing greater interest influence their subjects more, regardless of which task was employed and regardless of the degree of intentionality of the influence process. The slower-speaking experimenter probably can better give differential emphasis to the instructional proceedings, thereby giving subjects more information from which to ‘‘decide’’ what responses are expected by the experimenter. None of the experimenters are perceived as being very enthusiastic, but those who are somewhat more so, influence their subjects more. This would probably not be true for samples of overenthusiastic experimenters, who would likely be seen by their subjects as not too businesslike or professional. At lower levels of this variable, enthusiasm is related strongly to degree of experimenter interest (r ¼ þ.58) and significantly, but less strongly, to professional manner (r ¼ þ.35). When we consider only those three samples of experimenters who influenced their subjects only unintentionally and in a face-to-face interaction, two additional Table 15–4 Subjects’ Ratings of Their Experimenter’s Behavior and
Magnitude of Expectancy Effect: Applicable Only When Interaction Is Dyadic Behavior Interested Slow-speaking Enthusiastic
r
df
p Two-tail
þ.42 þ.28 þ.20
84 84 84
.001 .01 .07
Behavioral Variables
499
variables emerge as related to degree of expectancy effects. The more important of these was the variable ‘‘personal-impersonal.’’ Most experimenters are rated as neither very personal nor very impersonal, but a bit toward the latter end of the scale. Among these samples of experimenters, those who exerted more unintended interpersonal influence were perceived as significantly more personal (r ¼ þ.46, df ¼ 27, p ¼ .02). That more ‘‘personal’’ experimenters should be more influential makes rather good sense, given the face-to-face nature of the interaction in these experiments. A less sensible finding was that experimenters seen as more courteous tended to influence their subjects a little less (r ¼ .30, df ¼ 27, p ¼ .11). On the average all experimenters were viewed as very courteous, and ratings on this variable showed the lowest variability (S.D. ¼ 2.2) of all 27 variables. Given this high general degree of courtesy it may be that extreme courtesy was perceived as aloofness by subjects who were more readily influenced by a more personal experimenter. In two of the three samples of experimenters we have just been discussing, the interaction between experimenters and subjects was monitored by another experimenter who had trained the data collector. This observer simply sat in on the interaction and rated the experimenter’s behavior on the same variables employed by the subjects. Table 15-5 shows the only three variables that gave promise of showing a relationship with the magnitude of experimenter expectancy effects. Most surprising was the finding that those experimenters who were rated as using fewer arm movements influenced their subjects more. The surprise was due to the generally high positive correlations among all the movement variables, two of which had already been shown to be positively related to experimenter expectancy effects. A tentative explanation of this finding is that, to an external observer of the dyadic interaction, excessive movement by the experimenter is interpreted as nonpurposive and, therefore, a reflection of both tension and an unprofessional manner. Behaving consistently in interaction with subjects is one of the qualifications of a competent, professional experimenter. The obtained correlation of þ.39 between behaving consistently and exerting expectancy effects upon subjects further strengthens the emerging picture of the more competent, professional experimenter’s showing the greater expectancy effects. The slower-speaking experimenter’s greater opportunity to convey information to the subject about the experimenter’s expectancy has already been discussed. This hypothesis appears to hold up regardless of whether the observation of ‘‘slow-speaking’’ is made by a participant-observer subject or by an external observer. Before leaving this section, it must be emphasized that these external observations were made in the situation of face-to-face contact between experimenter and a single subject and might not hold for the situation in which several subjects are contacted as a group.
Table 15–5 Observers’ Ratings of Experimenter’s Behavior and Magnitude
of Expectancy Effect: Applicable Only When Interaction Is Dyadic Behavior Arm gestures Behaved consistently Slow-speaking
r
df
p Two-tail
.41 þ.39 þ.36
18 18 18
.08 .10 .13
500
Book Two – Experimenter Effects in Behavioral Research
Because our external observers were older, more sophisticated, and more professional, we might be tempted to regard their observations of the experimenters’ behavior as somewhat more ‘‘valid’’ than the observations made by the participantobserver subjects. At this stage of our knowledge to make such an assumption seems unwarranted. Perhaps because of their greater direct involvement and perhaps because of their less sophisticated, more implicit theories of human interaction, the subjects’ perceptions of their experimenters may be, in a sense, even more valid than the external observers’. Phenomenologically, the subjects were more really present during the interaction with their experimenter. Returning now to subjects’ assessments of experimenters’ behavior, we find two additional variables to bear a relation to degree of expectancy effects. In the situation wherein groups of subjects are contacted by an experimenter, greater body movement by the experimenter is associated with the exertion of greater unintended influence (r ¼ þ.43, df ¼ 22, p ¼ .04). Whereas subtle movements of the legs and head may be sufficient to carry information to the subject when he is alone in his interaction with the experimenter, the more gross cues of total body movements may be required to convey equivalent information in the group situation. Individual subjects may not see the experimenter quite as well in the group situation, and, in addition, subjects may be emitting significant interpersonal messages to each other via their movement patterns which serve to distract attention away from signals emitted by the experimenter. In the group situation as in the dyadic, subjects rate all their experimenters as very honest. Those experimenters, however, who influenced their subjects more were seen as somewhat less honest than the less influential experimenters (r ¼ .33, df ¼ 22, p ¼ .12). In some way, subjects seem able to sense the process of unintended interpersonal influence and evaluate it as undesirable. In the verbal conditioning study wherein experimenters were in part intentionally attempting to influence their subjects, several additional variables bore a relation to degree of experimenter influence (Table 15-6). In this situation, compared to the person perception task employed in the other samples, all experimenters were very active. They talked (‘‘good’’) during the process of the subjects’ responding, whereas in the person perception studies, experimenters served only as recorders during the data production phase. Under these conditions of experimenters’ fairly obvious attempts to influence subjects’ responses, a louder experimenter might have been viewed as a brow-beating influencer who could best be dealt with by negative conformity. The more consistent behavior of the more influential experimenter is in accord with our evolving view of the more effective influencer as the more competent, professional experimenter. The most interesting correlation may be the negative relationship between experimenter’s influence and the degree to which he Table 15–6 Subjects’ Ratings of Experimenter’s Behavior and Magnitude
of Experimenter Influence: Applicable When Influence Is At Least Partially Intentional Behavior Loud Behaved consistently Important-acting
r
df
p Two-tail
.27 þ.24 .22
58 58 58
.04 .07 .10
Behavioral Variables
501
acts importantly. This relationship is opposite to that obtained under conditions of unintended influence. When the experimenter has already assumed the role of important reinforcer of desired responses, the still more important-acting experimenter may be seen as overbearing rather than simply important or high status. The ‘‘overinfluencer,’’ the experimenter who seems to push too hard, may be a less successful influencer than the more modest, professional experimenter who more quietly communicates his wishes to his subjects. For this sample of experimenters the correlation between ‘‘loud’’ and ‘‘important-acting’’ was þ.40 (p < .005). From all the data available based on subjects’ perceptions of their experimenters, four dimensions emerge that seem relevant to distinguishing experimenters who are more or less likely to exert the unintended influence of their expectancy upon their subjects: 1. Professional status. Experimenters who are more important, professional, business- like, and consistent exert greater expectancy effects upon their subjects. 2. Interpersonal style. Experimenters who are more relaxed, interested, enthusiastic, and personal exert greater expectancy effects upon their subjects, but probably only so long as they maintain a professional manner, and do not permit the experiment to become a ‘‘social hour.’’ 3. Kinesic communication. Experimenters who employ subtle kinesic signals from the leg and head regions exert greater expectancy effects upon their subjects. These kinesic signals may still be effective at higher levels of overtness if subjects are not paying full attention to the experimenter. However, if the kinesic signals become very obvious, they are likely to lead to a diminution of expectancy effects, because they will detract from the professional demeanor of the experimenter. 4. Paralinguistic communication. Experimenters who speak slowly and in an expressive, nonmonotonous tone exert greater expectancy effects upon their subjects. The way in which the experimenter delivers his programmed input (instructions, greeting, leavetaking) probably serves to communicate his expectancy to his subject.
It is through the kinesic and paralinguistic channels of communication that the experimenter may convey the information to the subject as to what responses are expected. The greater professional status of the experimenter who unintentionally influences his subjects more may serve to legitimize for the subject his conformity to the experimenter’s expectancy. The more personal interpersonal style of the more influential experimenter may motivate the subject to want to fulfill the status-legitimized expectancy of the experimenter. If the experimenter becomes subtly bored, tense, or distant, the subject may subtly retaliate by disconfirming the experimenter’s expectancy, even though it may be perceived as legitimate. If the experimenter lacks professional status in the eyes of the subject, it may be irrelevant that he is interested and personal; he may be viewed as having no right to expect the subject’s conformity to his expectancy. If a high status, personal, interested experimenter cannot communicate effectively through the kinesic and/or paralinguistic channels, his influence will fail simply because the subject cannot learn what it is the experimenter really expects him to do. It seems likely that some experimenters communicate more effectively via the kinesic and some via the paralinguistic channels. Similarly, some subjects may be more influenceable simply because they are more accurate decoders of signals sent via the experimenter’s particular channel of ‘‘choice.’’
502
Book Two – Experimenter Effects in Behavioral Research
One difficulty with the data based on subjects’ perceptions of experimenters’ behavior must be emphasized. In all these studies, subjects assessed their experimenter’s behavior after they had made their responses for the experimenter. It is possible, then, that those subjects who felt they had been influenced by their experimenter went on to describe their experimenter, not as he was, but as he ought to have been for them to have been influenced by him. If this were the case we would have learned, not what sorts of experimenters influence subjects unintentionally, but rather what sorts of characteristics people ascribe to experimenters to justify their having been influenced. This too would, of course, be worth knowing. In any case, there is no way out of the dilemma created by asking subjects to assess their experimenters. If we asked subjects to describe their experimenter before they respond for him, the characteristics ascribed might easily serve as a basis for subjects’ ‘‘deciding’’ whether or not to accept the influence of the experimenter. The act of having ascribed high status and personalness to an experimenter may be reason enough for subjects to behave as though these attributes had a validity independent of their own assessment. Dissonance reduction cuts both ways.
Filmed Observations of Experimenter Behavior Useful as it was to have subjects serve as observers of their experimenter’s behavior, it became apparent that external observers would be needed to tell us how the experimenters ‘‘really’’ behaved when interacting with their subjects. These external observations would be important in their own right, and, in addition, they could serve to validate or invalidate the hypotheses generated from the subjects’ observations of the behavior shown by experimenters exerting greater or lesser expectancy effects. From sitting in on experimenter-subject interactions it became clear that only a fraction of the behavior of an experimenter could be observed, recalled, and reported. Pfungst’s (1911) experience in tracking down the cues that questioners gave to his clever friend, Hans the horse, suggested too that ‘‘just watching’’ might be too coarse a methodological sieve with which to strain out the possibly tiny cues that communicated the experimenter’s expectancy to his subject. What seemed most needed was the opportunity to observe the experimenter-subject interaction, the possibility of reobserving it, and, then, observing it again. Sound motion pictures seemed to provide the best permanent record of how the experimenter behaved vis-a`-vis his subject. Reference to the sound films taken has already been made in earlier chapters. Now some of the details of the filming procedure are reported (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1962). There were five different samples of experimenters whose interactions with their subjects were filmed. All had in common that they employed the photo-rating task but differed in the specific hypotheses to which the studies were addressed. Not including an analysis of the films, the substantive results of these studies have been reported in the appropriate chapters of this book and, in somewhat different form, elsewhere (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964a; 1964b). Altogether there were 24 male and 5 female experimenters who administered the photo-rating task to 164 subjects, of whom about 75 percent were females. Of the 29 experimenters, 24 were graduate students enrolled in a course in advanced educational psychology. The other 5 experimenters (2 males, 3 females) were advanced
Behavioral Variables
503
undergraduates enrolled in psychology courses. The subjects were undergraduates enrolled in elementary courses in psychology, education, English, history, and government. For the students enrolled in psychology courses, serving as subjects was a course requirement. Subjects from other courses were encouraged by their instructors to volunteer but were not required to participate. The subject population, therefore, was a mixed group of volunteers and nonvolunteers. Each experimenter contacted from three to eight subjects, and the mean number of subjects per experimenter was between five and six.
Experimental Groups Table 15-7 presents a summary of the characteristics of each of the five experimental samples, including the number of experimenters and subjects involved in each and the number of dyadic interactions that were filmed. A description of the five experimental groups follows. Group A1. These experimenters had served as experimenters in an earlier study (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). In that experiment some experimenters had been led to expect either ratings of success or of failure from their subjects. In the present study, those experimenters who had earlier been led to expect mean photo ratings of þ5 were led to expect mean ratings of 5. Those experimenters earlier led to expect ratings of 5 were now led to expect ratings of þ5. Group A2. These experimenters had also served as experimenters in an earlier study (Persinger, 1962). Some had been in a þ5 expectancy condition, others in a 5 expectancy condition. In the present study, these experimenters were led to expect that some of their subjects would give photo ratings of þ5, whereas other subjects would give photo ratings of 5. This condition and the preceding one had in common the fact that the experimenters had served as data collectors before in the same task. On the average, these experimenters had contacted about six subjects in their earlier data collection. The subjects contacted by these experimenters had all seen the photos to be rated when they were part of a standardization group. In the standardization study, subjects were shown the photos in their classrooms by means of an opaque projector. Group B3. These experimenters had not served before in the role of data collector. Each experimenter’s group of subjects was divided into thirds. For the first third of
Table 15–7 Characteristics of Five Filmed Experiments
Experiment
Experienced experimenter Ego-involved experimenter Subjects’ sex N of experimenters N of subjects N of filmed interactions
A1
A2
A3
B4
B5
Total
yes no mixed 4 23 15
yes no mixed 6 37 24
no no female 7 40 20
no yes female 8 48 22
no yes male 4 16 10
mixed mixed mixed 29 164 91
504
Book Two – Experimenter Effects in Behavioral Research
his subjects contacted, each experimenter was given no expectation as to the photo ratings likely to be obtained from his subjects. The second third of these subjects were contacted with half of the experimenters led to expect þ5 and half led to expect 5 ratings. For the final third of the subjects, each experimenter was led to expect ratings opposite to those he had expected from the second third of his subjects. All subjects in this group were females. Group B4. This group was identical with the just preceding group except that when their expectations of þ5 or 5 data were induced, these experimenters were told that the ‘‘prediction’’ of subjects’ ratings depended on the experimenter’s following instructions and proper experimental procedure. This sentence was intended as a very mildly ego-involving manipulation. All subjects in this group were also females. Group B5. This group was identical with the one just preceding except that all the subjects were males rather than females. For all groups, the experimental manipulation of experimenters’ expectancies was as follows: According to several personality tests we have given the (next) subject(s), we are able to predict how they will rate the photos.1 Some of these subjects tend to rate the photos, on the average, extremely high; and some tend to rate them, on the average, extremely low. The (next) subject(s) that you will run should average a þ5 (or 5) which is a pretty high (or low) average. The experimenters serving in groups B3, B4, and B5 were assigned to these groups at random. Within all groups subjects were randomly assigned to their experimenters. The Filming Procedure All experimenters contacted all subjects in the same experimental room. This room was very large, measuring about 50 by 20 feet. Experimenters and subjects sat near the entrance to this room near one of the 20-foot walls. Their chairs were arranged so that they would be partially facing each other and partially facing the far 20-foot wall. At a point about 10 feet from this far wall a sound-insulated wall was constructed, shortening the experimental room and creating a smaller room for the placement of a camera. A Bach Auricon Pro-600 sound movie camera employing Kodak Tri-X Reversal Film, tripod-mounted, and equipped with a Pan Cinor ‘‘zoom lens’’ was the recording instrument. The camera was focused on the experimenter and subject some 30 feet distant through an 8 by 8 inch double-glassed window built into the specially erected wall. This window was equipped with a wooden shutter which was operated from the observation room. In the experimental room the window was camouflaged by a glassfront, false-backed bookcase containing both books and old empty picture frames. These frames were intended to simulate the frame of the observation window and to give the impression that old frames were simply being stored in the old bookcase. A microphone, connected to the camera audio system, was concealed in the false motor case of an 8-inch electric fan fastened to the ceiling directly above the table at which the experimenter and subject sat. The coaxial cable connecting the 1
For the ego-involving manipulation, the following sentence was inserted here: ‘‘That is, if the experimenter follows his instructions and proper experimental procedure.’’
Behavioral Variables
505
microphone to the camera equipment was made to appear as an extension cord from the fan to a wall socket. This was accomplished by connecting a false cable between the point where the true cable entered the camera room wall and a nearby wall socket. In order to provide a system of communication between investigators in the camera room and those controlling the flow of subjects and experimenters, two army surplus field telephones were used. One telephone was located in the camera room and was positioned in such a way that the camera operator was able to manipulate the camera and telephone at the same time. The other telephone was located in a small room off the subjects’ waiting-room area, and both the telephone operator and the telephone were concealed from the subjects’ vision and hearing. For the purpose of testing the success of the camouflaging of the camera and microphone, a number of graduate students and faculty members sat in the chairs intended for experimenters and subjects, and tried to locate the camera and microphone. None were able to do so. When permitted to approach within a few feet of the window, however, they were able to see the camera through the window if the window shutter was fully open. For this reason, a number of tables were placed between the experimental chairs and the observation window so that if an experimenter wandered around the room between subjects, he could not approach too close to the window. As it turned out, few experimenters left their chairs, and none came close to the window. Out of the 91 experimenter-subject interactions filmed, in whole or in part, there were 4 or 5 subjects whose eyes dwelled on the bookcase long enough to make us fear their suspicions. When this occurred, the camera operator telephoned the subject router to conduct a postexperimental interview with that subject. These interviews suggested that while subjects were suspicious that the real intent of the experiment was being withheld, none hinted at a suspicion of being observed. Following these instances of overattention to the bookcase that camouflaged the window, we watched for an increased rate of looking at the bookcase by subsequent subjects. We could detect no such increase. Of course, had experimenters and subjects been aware that they were being observed, their behavior in the brief datacollecting interaction would no doubt have been affected. All experimenters contacted all their subjects at a single sitting. The film capacity of the camera was such that not all the dyadic interactions could be filmed. Systematic sampling of the interactions was undertaken with an effort made to film the contact with one subject from the first third, one from the second third, and one from the final third of subjects contacted by each experimenter. This resulted in a good, though not perfect, distribution of subjects contacted by the experimenter with a þ5, a 5, or no expectancy. The experimenter was the focus for the camera, and virtually every frame shows his or her face and trunk. Most of the time the subject’s face and trunk were also on camera, but whenever the camera ‘‘zoomed’’ in for a tighter close-up of the experimenter, the subject could not be seen. Because of the finding from subjects’ observations of their experimenter that leg movements might be important sources of cues, the camera moved back from time to time so that full length pictures of the experimenter and subject could be obtained. Most of the 91 interactions were filmed in their entirety, but sometimes the film ran out before the experimenter could finish obtaining the photo ratings from the last
506
Book Two – Experimenter Effects in Behavioral Research
subject scheduled to be filmed. Sometimes, too, an experimenter was very slow, or a subject was, so that to film the entire interaction would have meant losing the subsequent interaction with a subject of the opposite expectancy. Since one of the main reasons for this filming was to learn what experimenters did differently in their interaction with subjects from whom they expected different ratings, very lengthy interactions were interrupted. Sometimes, for example, an experimenter would make small talk before or after recording the ‘‘face sheet’’ data from the subject. Such small talk was sometimes not filmed. When subjects were very slow in making their ratings, the rating period was sometimes interrupted. There was good reason to believe that the communication of the experimenter’s expectancy occurred before the subject made his first response, the evidence for this to come in the next chapter. Therefore, interrupting the rating period was felt to be particularly preferable to losing the prerating period with another subject. A Preliminary Analysis There were 15 experimenters for whom films were available of their interaction with subjects for whom they had been given no expectancy. Experimenters’ behavior in interaction with these subjects, therefore, reflected their ‘‘typical’’ behavior in the experiment uncomplicated by the addition of any formal, uniform experimental hypothesis. To be sure, each experimenter may have entertained hypotheses about the responses to be obtained from each of these ‘‘practice’’ subjects, but these idiosyncratic expectancies were probably fairly randomly distributed among the experimenters. The five investigators served as the first observers and independently rated each experimenter contact on the following variables: 1. Dominance: the extent to which the experimenter was clearly in charge of the situation. 2. Liking: the extent to which the observer liked the experimenter. 3. Activity: the extent to which the experimenter manifested gross and nonessential movements. 4. Professional: the extent to which the experimenter showed professional ‘‘good form’’ in his role as experimenter. 5. Friendly: the extent to which the experimenter was friendly to his subjects.
For each variable, ratings could range from 1 (least possible) to 10 (most possible). No effort was made to be more precise in either the definitions of the variables or the rating scale. At most, these observations of experimenters’ behavior were designed to serve as sources of hypotheses. All five observers had been involved in the collection of this data, and there was no way of assessing the degree to which any of them might not be blind to (1) the experimental condition under which each subject had been contacted, and (2) the magnitude of any experimenter’s expectancy effect defined by the difference between data obtained from subjects under each condition of expectancy. All observers reported that ‘‘as far as they could tell’’ they were blind. Table 15-8 shows the correlation between the mean of the observers’ ratings and magnitude of expectancy effects as well as the intercorrelations among the ratings. In this analysis, the degree of expectancy effects may safely be regarded as the
Behavioral Variables
507
Table 15–8 Observations of Experimenter Behavior and Magnitude of Expectancy Effect
Behavior Dominant Likable Activity Professional Friendly
Expectancy effect
Dominant
þ.53** þ.54** .48* þ.63*** .03
þ.38 .44* þ.38 þ.12
Liking
Activity
Professional
þ.12 þ.36 þ.60***
.65*** þ.59***
.19
* p .10 ** p .05 *** p .02
dependent variable, even though the subjects’ responses could be observed, since these observations of the experimenters were all made while practice subjects were being contacted—i.e., no expectancies had as yet been induced in the experimenters. Greater expectancy effects were shown subsequently by those experimenters who were judged as more professional, more in charge of the situation, less hyperactive, and who were better liked by the observers. The intercorrelations among the ratings suggest that several of these four variables tend to cluster together. Hyperactive experimenters were seen as less professional and less in charge of the experimental situation. These three variables seem to constitute the professional status dimension discussed earlier (mean intercorrelation ¼ þ.50). Observers’ liking of experimenters shows a lower mean correlation with the professional status variables (þ.29) and may be related to the interpersonal style dimension discussed earlier. Somewhat inconsistent was the finding that the variable of friendliness to subjects, which was correlated significantly with the liking variable, showed no relationship to magnitude of expectancy effects. Friendliness was significantly positively associated for this sample with hyperactivity and negatively, but not significantly, with professionalness. It may be that friendliness as judged by the observers was a hyperfriendliness which interfered with the professional business of the interaction. Some support for the finding, though not necessarily for its interpretation, comes from the recent report by Silverman (1965) that subjects are less responsive to the demands of a possibly too-friendly experimenter. The possibility that the data presented here might be contaminated by the observers’ exposure to criterion information was raised earlier. Some evidence is available, however, which suggests that these observers’ ratings may yet have a measure of validity. Subsequent blind observations of the films were carried out by Neil Friedman (1964) and Richard Katz (1964). Some of these observations served to anchor our cruder observations to more precise, uncontaminated ones. Thus, the more ‘‘dominant’’ of this sample of 15 experimenters observed when contacting subjects in the ‘‘no expectancy’’ condition were found to be significantly older and therefore higher status (Vikan-Kline, 1962) (r ¼ þ.54), less likely to show gross body activity (r ¼ .57), and more expeditious in instructing their subjects (r ¼ þ.40). More ‘‘professional’’ experimenters were found to make significantly fewer errors in reading their instructions to their subjects (r ¼ .57). More ‘‘active’’ experimenters were found to show more gross body activity (r ¼ þ.63). Better liked experimenters looked (r ¼ þ.64) and smiled (r ¼ þ.51) more at their subjects throughout the experiment. Finally, more ‘‘friendly’’ experimenters smiled more at
508
Book Two – Experimenter Effects in Behavioral Research Table 15–9 Mean Duration of Three Stages of Experimental Interactions
Stages Preinstructional Mean duration Standard deviation Number of interactions
36.4 8.8 80
Instructional 67.7 10.8 85
Rating
Total
94.6 34.3 65
196.1 42.3 65
their subjects during the entire experiment (r ¼ þ.87). The nature and magnitude of these correlations suggests that the more global and possibly contaminated observations might, after all, be reasonably valid, especially in view of the fact that every one of the correlations with the ‘‘anchoring’’ variables reported was higher than the highest interobserver reliability of the more global variables (r ¼ þ.34). In all the subsequent analyses of the filmed sessions, observers were kept uninformed as to the particular expectancy the filmed experimenter held for the responses of any of his subjects. All observations made of the films subsequently were made separately for the three stages of the experimental interactions. The first or preinstructional phase required the experimenter to obtain and record the subject’s name, age, sex, marital status, year in school, and major field of study. The second stage required the experimenter to read the standard instructions to his subject. The third stage was that during which the experimenter presented each of the 10 photos to his subjects and recorded their response for each. Very roughly, the first or preinstructional phase lasted only about half a minute; the instruction reading lasted only about a minute; and the rating period about a minute and a half. Table 15-9 shows the mean time in seconds required for each stage of the experiment. The standard deviations are also given as well as the number of dyadic interactions upon which each mean and standard deviation are based. More Molecular Variables Neil Friedman (1964) and Richard Katz (1964) undertook a very careful analysis of the appearance and behavior of the experimenters. Friedman observed for each experimenter the sort of clothing he wore, how often he smiled and glanced at his subject, how often he exchanged glances with the subject, how accurately he read the instructions to the subject, and how long each phase of the interaction lasted. Katz observed for each experimenter his smiling, glancing, direction of gaze, head and body activity, body position relative to the subject, and the manner of holding the stimulus photo during the rating period. For most of the variables, observations were made not only for each of the three stages of the experiment but at many specific points within each of the stages. Therefore, the total number of observational variables involved was well over 200. Two separate analyses were made by both Friedman and Katz; one for the first two samples of experimenters combined (A1 and A2) and one for the last three samples of experimenters combined (B3, B4, B5). From the analyses completed so far of these data (Friedman, 1964; Friedman, Kurland, & Rosenthal, 1965; Katz, 1964; Rosenthal, Friedman, & Kurland, 1965), it appeared that the behavior of the experimenter during the instructional period was the best predictor of the magnitude of his subsequent expectancy effects. The degree
Behavioral Variables
509
to which each subject was influenced by his experimenter’s expectancy was defined as follows: When the experimenter’s expectancy was for a þ5 response, the subject’s magnitude of influenceability was his mean rating of the photos minus the grand mean photo rating of all subjects (not only those who had been filmed) for whom that experimenter had an expectancy for a 5 response. When the experimenter had an expectancy for a 5 response, the subject’s ‘‘bias’’ score was his mean photo rating subtracted from the grand mean photo rating of all subjects for whom that experimenter had an expectancy for a þ5 response. For just those experimenters of conditions B3, B4, and B5, all of whom were males, all of whom had no prior research experience, and all of whom contacted either male or female subjects but never both, only a single cluster of behaviors, all of them during the instruction period, predicted subsequent expectancy effects. Experimenters showed greater subsequent expectancy effects if they exchanged fewer glances with their subject (r ¼ .41, p ¼ .02), read their instructions with fewer errors (r ¼ .42, p ¼ .02), and required less time to read the standard instructions, particularly the first short paragraph (r ¼ .43, p ¼ .02). The mean intercorrelation of this cluster of variables was þ.37. This more businesslike, no-nonsense experimenter is just the kind we would expect, on the basis of all the earlier evidence, to be the more effective unintentional influencer (Friedman, Kurland, & Rosenthal, 1965). The male and female experimenters of conditions A1 and A2 had all had prior research experience, and all contacted both male and female subjects. Correlations between these experimenters’ behavior and their subsequent expectancy effects were computed separately for male and female experimenters. Among male experimenters, those who exchanged fewer glances with their subjects (r ¼ .59, p ¼ .02) and required less time to read the instructions (r ¼ .52, p < .05) subsequently showed greater expectancy effects. These findings and the fact that the more unintentionally influential experimenters showed less gross body activity (r ¼ .66, p < .01) support still further the hypothesis that more professional experimenters show greater expectancy effects. In this replication, however, there was an unexpected reversal of the relationship between the accuracy of instruction reading and subsequent expectancy effects. Now it was the experimenters who made more errors in reading their instructions who subsequently showed greater expectancy effects (r ¼ þ.65, p < .01). A number of hypotheses were suggested to account for this reversal in terms of the different characteristics of the samples of experimenters, but none were tenable after further analyses (Rosenthal, Friedman, & Kurland, 1965). Among female experimenters of these same experimental conditions, none of the behaviors during the instruction-reading period predicted significantly the subsequent magnitude of expectancy effect. Compared to male experimenters there was a significant reversal of direction, however, in the correlations between their behavior and their subsequent expectancy effects. Female experimenters who exchanged more glances with their subjects (r ¼ þ.45) and were slower in reading their instructions (r ¼ þ.36) showed greater expectancy effects on their subjects’ subsequent responses. (There was no relationship between their accuracy in reading the instructions and subsequent expectancy effects.) Although the sample of female experimenters was too small to make much of this reversal, it may be reasonable to expect that female experimenters show more effective interpersonal influence when they are more interpersonally oriented. That would be consistent with those theoretical formulations (Parsons & Bales, 1955) and those summaries of relevant data
510
Book Two – Experimenter Effects in Behavioral Research
(McClelland, 1965) that suggest that men function more typically and effectively by a greater stress on task orientation relative to women, who function more typically and effectively by a greater stress on socioemotional orientation. ‘‘Average’’ subjects in ‘‘average’’ experiments may be better able to respond to the subtle cues of experimenters who are playing out their socially expected role behaviors. These subtle biasing cues may tend to be overshadowed and obscured by any behavior that, by its unexpectedness, calls for all of the subject’s attention.
More Molar Variables For the most part, the relatively more molecular observations described above were extremely reliable. The interobserver reliabilities showed a median correlation of over .80, with many of the variables showing nearly perfect (1.00) reliabilities (e.g., accuracy, speed of instruction reading, glancing at subjects). The analysis of all these data is far from complete and they have already shown their value. Yet, the utility of these more molecular variables as predictors of subsequent expectancy effects was somewhat disappointing in view of the relatively few correlations that reached significance out of the hundreds computed. It seemed likely that the more molecular variables were missing qualitative aspects of experimenters’ behavior which might have interpersonal communication value. A glance is not just a glance, in interpersonal communication, but rather a friendly glance, a dominating glance, an interested glance, an encouraging glance. Too little is known at the present time about the exact features of a facial expression or a body movement that make one glance or one smile different from another glance or another smile. But ordinary people in everyday life seem able to make these judgments. It was, therefore, decided to employ a sample of undergraduate students as paradigm observers. As members of the culture, they should be able to make the required judgments, not perfectly, but well enough. The particular judgments they were asked to make were the same global judgments that subjects had been making for some years about their experimenter’s behavior. These judgments, made on 20-point rating scales, had proven to be useful; and in any case, it would be good to know whether external observers of an experimenter’s behavior could predict his expectancy effect as well as subjects could postdict it. The basic variables were those presented in the chapter dealing with the effects of excessive reward. Four additional variables were employed: ‘‘active,’’ ‘‘dominant,’’ ‘‘important-acting,’’ and ‘‘speaks distinctly.’’ The first two were added because they had been employed in the preliminary analysis of the films and had been found promising. The third was added because it had been employed in earlier studies and seemed relevant; the last was added for these analyses. A total of ten undergraduate students, six females and four males, rated each experimenter’s behavior separately for the preinstructional period, the instructionreading period, and the period in which the subject made his ratings of the level of success of the stimulus person. Three observers (one male, two females) made their ratings while watching the film and hearing the sound track. Four observers (one male, three females) made their ratings while watching the film but without hearing the sound track. Three observers (two males, one female) did not see the film at all but made their ratings solely from listening to the sound track.
Behavioral Variables
511
The reason for the last two groups of observers was the thought that the meaning of a gesture or a tone might be thrown into bolder relief if it were not cross-referenced by another channel of communication. Single channel (i.e., visual or auditory) judgments could also be examined for discrepancy, and ‘‘channel discrepancy’’ has long been thought to be an important factor in normal and abnormal human communication (Allport & Vernon, 1933), though the evidence for this assumption has been at the anecdotal level (Bateson, Jackson, Haley, & Weakland, 1956; Ringuette & Kennedy, 1964). Of the 29 experimenters whose interactions with their subjects had been filmed, only 19 (16 males and 3 females) were filmed in interaction both with subjects from whom they expected þ 5 ratings and with subjects from whom they expected 5 ratings. These 19 experimenters were observed in sessions with 48 subjects (33 females, 15 males) for whom they had one of these two opposite expectancies. We shall consider first the behavior of the experimenter during just the instructionreading period as a predictor of his subsequent expectancy effects. The first look was at the interobserver reliability within each of the three conditions of observation. Table 15-10 shows some of the median reliability coefficients (r) of the sets of observers under each condition of observation. For the 30 observations made with access to both communication channels, the highest median reliability obtained was only þ.50, and the median of the 30 median reliabilities was only þ.28. Because some of the observations were impossible under the other two conditions of observation (e.g., loudness of voice in the absence of sound track) the numbers of reliability coefficients possible for each channel are shown in Table 15-10. In all conditions of observation there were some negative reliabilities. The median reliabilities were similarly depressing, and the maximum reliabilities were reminiscent more of validity than of reliability coefficients. In spite of these unencouraging findings, the means of the observers’ judgments within each condition were correlated with the experimenters’ subsequent expectancy effects. The results were surprising. Of the 77 correlations, 17 (or 22 percent) were significant at the .05 level (r .29). The correlations predicting expectancy effects from experimenter behavior were clustered separately within each condition of observation. Table 15-11 shows the variables constituting each of the five clusters obtained. Every variable included in any cluster showed a significant correlation with expectancy effect (p < .10), and each of these correlations is shown in Table 15-11. Table 15-12 shows for each of the five clusters its mean correlation with every other cluster, the strength or unity of the cluster expressed by
Table 15–10 Some Median Interobserver Reliabilities Under Three Conditions of
Observation of Instructional Behavior Communication channels
Reliability Lowest Median Highest
Visual plus auditory (30 Variables) .12 þ.28 þ.50
Visual only (24 Variables) .26 þ.27 þ.55
Auditory only (23 Variables) .10 þ.12 þ.32
512
Book Two – Experimenter Effects in Behavioral Research Table 15–11 Clusters of Experimenter Instructional Behaviors Predicting Subsequent Expectancy Effects
Observation channel Visual
Auditory
Cluster I Relaxed Likable Professional Honest Casual
þ.56 þ.46 þ.45 þ.40 þ.29
Cluster II Dominant
þ.32
Cluster III Businesslike Arm gestures not used
þ.33 þ.25
Cluster IV Pleasant Honest Expressive voice Active Friendly Personal Not acting important
Visual plus Auditory .33 .30 .30 .25 .25 .25
Cluster V Leg gestures Hand gestures Arm gestures Body gestures Trunk gestures Head gestures Nervous
.44 .37 .34 .32 .31 .30 .30
Table 15–12 Intercorrelations among Clusters of Experimenter Instructional Behavior
Clusters I II III IV V
II
III
IV
V
Mean r Intracluster
þ.13 —
þ.19 þ.03 — —
.00 þ.12 .10
.29 þ.04 .46 þ.07 —
þ.65 þ1.00 þ.51 þ.59 þ.51
Mean r Expectancy Effects þ.43 þ.32 þ.29 .28 .34
the mean intercorrelation of the cluster’s variables with each other, and the mean correlation of the cluster of behavioral ratings with the magnitude of subsequent expectancy effects. The most significantly predictive cluster (I) might be labeled as the behavior of the likable-professional as perceived by paradigm observers judging only his visually communicated behavior. The likable-professional was relaxed (not tense), gave an honest visual impression, and was casual (‘‘not pushy’’) in his approach to his subjects. This constellation of behaviors is just the one we would expect on the basis of (1) our possibly contaminated preliminary analysis, (2) the data available from subjects’ perceptions of their experimenter, and (3) the more experimental evidence available from this research program and those of others. Clusters II and III, based also on the visual channel, strengthen our overall impression of the competent, professional experimenter as the one who shows greater expectancy effects, although both these clusters are independent of Cluster I and of each other. Cluster IV, based as it was on only the sound track of experimenters reading the same written instructions to subjects, reflected the behavior of a pleasant, expressive-voiced experimenter. Such experimenters tended to show less expectancy effects, perhaps because such a tone made the interaction into more of an amiable social situation rather than a more task-oriented one. It was suggested earlier that such a too-friendly tone may interfere with the unintentional influence exerted by the experimenter. Cluster V was a ‘‘nervous activity’’ cluster; and behaviorally tense, hyperactive experimenters showed less expectancy effect.
Behavioral Variables
513
Such experimenters could hardly be viewed as professional and competent, and we have seen that lacking this perception of the experimenter, subjects are unlikely to be influenced by their experimenter’s expectancy. The correlation between this nervous activity cluster and the ‘‘likable-professional’’ cluster was not high (r ¼ .29), so that it was possible to be professional and yet nervous. Such experimenters probably were also unable to influence their subjects’ responses unintentionally. The nervous hyperactivity would probably interfere with the subject’s decoding of the unintended message by providing a context of excessive ‘‘noise’’ or distracting inputs. One of the reasons for having observers judge the behavior of experimenters using only a single channel of communication was to learn about the effects of channel discrepancy. Of the total of 30 variables on which behavior was judged there were 17 for which judgments were available from both groups of single-channel judges. Relaxed-nervous, for example, could be judged from silent films or sound track. The remaining 13 variables could be judged from only one of the two single-channel conditions of observation (e.g., expressive voice or leg activity). For each of the 17 variables judged from both visual cues alone and from auditory cues alone, a channel discrepancy score was computed simply by subtracting the mean rating assigned an experimenter in the auditory channel from the mean rating assigned that experimenter in the visual channel. A large discrepancy score on any variable simply means that the experimenter is rated higher on that variable in the visual than in the auditory modality. Table 15-13 shows those 7 of the 17 possible correlations between magnitude of channel discrepancy and magnitude of subsequent expectancy effects that reached a p < .10. All seven of these variables were highly intercorrelated, forming a cluster with a mean intercorrelation of þ.52. This cluster is very similar to the likable-professional cluster reported earlier when the visual channel was considered by itself rather than in relation to the auditory channel. Apparently the most effective unintentional influencer must ‘‘look’’ like a likable-professional but must, in addition, not sound like one, perhaps because a too likable tone of voice while reading the instructions would detract from the task orientation of the interacting dyad. The subject is watching the experimenter, usually, while the instructions are being read, but the subject’s main task is to hear and understand these instructions. These channel discrepancies in behavior may have a very different meaning depending on the sex of the experimenter and the sex of the subject (Rosenthal, 1965c). Future work will be done in this area, but for now it may be interesting to see the effects of the subject’s sex in the channel discrepancies of these primarily male Table 15–13 Visual Minus Auditory Channel Discrepancies in
Experimenter Instructional Behavior and Subsequent Expectancy Effects Behavior
r
p
Likable Relaxed Pleasant Honest Professional Casual Friendly
þ.44 þ.44 þ.42 þ.41 þ.33 þ.28 þ.26
.003 .003 .005 .007 .03 .07 .08
Mean
þ.37
.01
514
Book Two – Experimenter Effects in Behavioral Research Table 15–14 Visual Minus Auditory Channel Discrepancies in
Experimenter Instructional Behavior and Sex of Subject Behavior
r
p
Dominant Enthusiastic Interested Personal Professional
þ.53 þ.35 þ.26 þ.25 þ.22
.0001 .01 .06 .07 .10
Mean
þ.32
.02
experimenters. The point biserial correlations between the sex of the subject and the degree to which the experimenter shows a given behavior more in the visual than in the auditory channel were computed. For this analysis interactions with 57 subjects were available, and there were 17 possible correlations. Table 15-14 shows the five correlations with p .10. All five of these variables were highly inter-correlated, forming a cluster with a mean intercorrelation of þ.63. These experimenters showed greater ‘‘dominant-enthusiasm’’ in the visual than in the auditory mode when interacting with male subjects. When interacting with female subjects, they showed their enthusiastic-dominance relatively more in the auditory than in the visual channel. One wonders whether such a relationship may have relevance not only for a better understanding of the social psychology of the psychological experiment but also for a better understanding of interpersonal communication in general. At least we know that the results obtained and just reported are not a unique function of the instruction-reading situation in a psychological experiment. Channel discrepancies of behavior in the half minute of the pre-instructional period were also computed. This was that brief and more informal period during which the experimenter asked for and recorded the subject’s name, age, and other such data. Channel discrepancies were again correlated with the sex of the subject. For this analysis 50 subjects were available, and again there were 17 possible correlations. Table 15-15 shows the seven correlations with p .10. Again these variables were highly intercorrelated, and the cluster’s mean intercorrelation was þ.45. Although this cluster of behaviors was not the same as that found during the instruction reading period, it is similar enough that it may be regarded also as a ‘‘dominant-enthusiasm’’ cluster. Even during this brief period experimenters showed greater enthusiastic-dominance Table 15–15 Visual Minus Auditory Channel Discrepancies in
Experimenter Preinstructional Behavior and Sex of Subject Behavior
r
p
Dominant Active Relaxed Enthusiastic Interested Businesslike Friendly
þ.36 þ.34 þ.31 þ.30 þ.27 þ.26 þ.24
.02 .03 .03 .04 .06 .07 .10
Mean
þ.30
.04
Behavioral Variables
515
in the visual than in the auditory mode when contacting male subjects, whereas with female subjects they showed relatively greater enthusiastic-dominance in the auditory channel. When channel discrepancy in the preinstructional period was employed as a predictor of subsequent expectancy effects, the results were less impressive than when channel discrepancy in the instruction-reading period had been employed. Still, the results were consistent, if less striking. Experimenters who showed greater interest in the visual than in the auditory mode exerted greater subsequent expectancy effects (r ¼ þ.30, p ¼ .06). Experimenters who showed a more businesslike manner in the auditory than in the visual channel showed greater expectancy effects (r ¼ .30, p ¼ .06). Although these correlations were not very significant statistically, they suggest an early form of the pattern which emerges more clearly in the instructional period. (The correlation between these variables was þ.11, which makes the multiple correlation þ.45, p ¼ .02.) When speaking of ‘‘channel discrepancy’’ the ‘‘discrepancy’’ has always been taken by subtracting the mean rating of an experimenter’s behavior judged only from the sound track from the mean rating of his behavior judged only from the silent film. Therefore, channel discrepancy has been a directional discrepancy. In clinical lore the importance of communication channels’ carrying opposite messages does not depend on a given direction of difference. It seemed interesting to see whether disregarding the direction of channel discrepancy would teach us something about the unintentional influence of an experimenter’s expectancy. We turn now to the correlation between the absolute discrepancy between the visual and auditory channels, sign of difference disregarded, and the magnitude of expectancy effects. Considering the experimenter’s behavior during the brief preinstructional phase we find, in Table 15-16, that channel discordance in three behavioral variables (out of a possible 17) significantly predicted subsequent expectancy effects. Regardless of which modality, the visual or the auditory, conveyed the greater interest, enthusiasm, or professionalness of manner, the greater the disagreement between channels, the greater the subsequent expectancy effects. Perhaps such channel discordance so confuses the subject, perhaps even without his awareness, that he tries especially hard to ‘‘read’’ the unintended messages from the experimenter so that he may better learn what really is expected of him. The particular variables shown in Table 15-16 were well clustered, with a mean inter-correlation of þ.53. When we turn to the instructional period for an examination of the absolute channel discordance as a predictor of expectancy effects we again find three Table 15–16 Absolute Channel Discordance in Preinstructional Behavior
and Subsequent Expectancy Effects Behavior
r
p
Interested Enthusiastic Professional
þ.46 þ.41 þ.35
.005 .01 .03
Mean
þ.41
.01
516
Book Two – Experimenter Effects in Behavioral Research Table 15–17 Absolute Channel Discordance in Instructional
Behavior and Subsequent Expectancy Effects Behavior Relaxed Enthusiastic Honest
r
p
.37 .36 þ.26
.01 .02 .08
significant predictors. Table 15-17 shows them, and we note that two of them, ‘‘relaxed’’ and ‘‘honest,’’ were also significant predictors when their algebraic channel discrepancies were considered. The correlation between the algebraic and absolute discrepancy for the variable ‘‘honest’’ was þ.65, so perhaps this variable does not mean anything different from what we have seen before. However, the correlation between algebraic and absolute channel discrepancy for the variable ‘‘relaxed’’ was only .07, so in this case we do have a different variable. During the instruction period experimenters who show less discordance of the visual and verbal channels in their tension level and in their enthusiasm level go on to exert greater expectancy effects. (The correlation between these variables was only þ.18.) Perhaps during the instruction period, too much channel discordance, of a less systematic sort than is implied in algebraic discrepancy, confuses the subject at the very moment he is to receive the experimenter’s unintended cues to what the ‘‘right’’ answer might be. Unsystematic channel discordance may be a good way to get a subject to be attentive, but when it serves as background to unintended specific communications, it may serve as just so much noise. Although we have discussed the question of channel discrepancies in the preinstructional behavior of the experimenter, we have not yet considered the preinstructional behaviors in each of the three conditions of observation that predict subsequent expectancy effects. Table 15-18 shows the highest, midmost, and lowest median reliability coefficient (r) of the sets of observers within each condition of observation. As in the case of the interobserver reliabilities of the rating of instructional behavior, the correlations are so low that one wonders how such variables can be predictive of anything. It should be kept in mind that the mean of the observers’ ratings is a much more stable estimate of the experimenter’s behavior than is the rating of any single observer. Individual observer idiosyncrasies are probably canceled out in taking the mean observation as the definition of the experimenter’s behavior.
Table 15–18 Some Median Interobserver Reliabilities Under Three Conditions of
Observation of Preinstructional Behavior Communication channels
Reliability Lowest Median Highest
Visual plus auditory (30 Variables) .16 þ.16 þ.39
Visual only (24 Variables)
Auditory only (23 Variables)
.12 þ.20 þ.41
.23 þ.08 þ.45
Behavioral Variables
517 Table 15–19 Clusters of Experimenter Preinstructional Behaviors Predicting
Subsequent Expectancy Effects Observation channel Visual
Visual plus auditory
Cluster I Likable Relaxed Personal Enthusiastic
þ.39 þ.34 þ.30 þ.24
Cluster II Honest
.25
Cluster III Not acting important
.37
Cluster IV Relaxed Dominant Active
þ.42 þ.31 þ.29
Cluster V Leg gestures Trunk gestures
.31 .25
Table 15–20 Intercorrelations among Clusters of Experimenter Preinstructional
Behavior
Clusters I II III IV V
II
III
IV
V
Mean r Intracluster
.06 —
.26 þ.36 —
þ.25 .00 .30 —
.00 þ.21 þ.34 .17 —
þ.54 1.00 1.00 þ.46 þ.42
Mean r Expectancy Effects þ.32 .25 .37 þ.34 .28
In the preinstructional period none of the behavior observations made from the sound track alone were predictive of subsequent expectancy effects. Table 15-19 shows for the visual channel and for the visual-plus-auditory channels the clusters of behaviors predicting subsequent expectancy effects. All the variables shown predict expectancy effects at p < .10, and the magnitude of the predictive correlation is given for each variable. Table 15-20 shows the intercorrelations among the five predictive clusters which emerged, the mean intercorrelation of the variables forming a cluster (cluster unity), and the mean correlation of each cluster of behaviors with subsequent expectancy effects. Although the preinstructional behavior of the experimenter showed five clusters predicting expectancy effects, just as did the experimenter’s instructional behavior, the clusters were not quite the same, nor were they composed of as many variables. (The comparison is between Tables 15-19 and 15-11.) In the visual mode, compared to his instructional behavior, the experimenter’s preinstructional behavior was as relaxed and likable but less professional, less dominant, and less ‘‘honest,’’ or scrupulous. The preinstructional period was less clearly defined for the experimenter than the instruction-reading period, and the general impression is that the more effectively influential experimenter is less formal during this less formal stage of the experiment than he will later become. In the visual-plusauditory mode the more unintentionally influential experimenter again showed his lack of tension as well as showing that he was very much in charge of the situation.
518
Book Two – Experimenter Effects in Behavioral Research
The overall impression we have of the behavior of the experimenter who shows greater expectancy effects is that he is professional, competent, likable, and relaxed, particularly in his movement patterns, while avoiding an overly personal tone of voice that might interfere with the business at hand. When his interactions with the subject are not highly programmed by the design of the experiment, he relaxes his professional demeanor a bit, and perhaps engages his subject’s attention more by showing discrepancies between his movement patterns and his tone of voice. When his interactions with the subject are more formally programmed, he becomes more formal in manner and sends more congruent messages through his movement patterns and his tone of voice. The behavioral variables considered in this chapter, and the more structural variables considered in the last chapter, which are predictive of experimenter expectancy effects, may not be so different from the variables that predict other forms of interpersonal influence. Perhaps unintended social influence processes are governed by the same principles that govern intentional interpersonal influence. The major differences between intentional and unintentional influence processes may turn out to be associated with the system of communication or signal transmission employed. When the influence is intentional, as in studies of compliance, persuasive communication, or verbal conditioning, the signals of the influencer are both highly programmed and overt. When the influence is unintentional, the signals of the influencer are less overt and probably occur in a context of greater noise. In the next chapter we shall discuss the problem of signal transmission in the experimental situation.
16 Communication of Experimenter Expectancy
There is now a good deal of evidence bearing on the question of what structural and behavioral characteristics of the experimenter tend to increase the operation of expectancy effects. But perhaps the most important question of all remains to be answered. That question, of course, is how does the experimenter inform his subject what it is he expects the subject to do? Data from several studies including the leisurely analysis of films suggests that no gross errors are responsible. Experimenters do not tell their subjects in words or even in any obvious gestures what it is they expect from them. Errors of observation and of recording, although they do occur, occur so rarely as to be trivial to any explanation of experimenter expectancy effects.
Intentional Communication of Expectancies If experimenters were asked to communicate their expectancy to their subjects we might hope that the cues they employ intentionally might be simple exaggerations of cues employed unintentionally in the real experimental situation. As a first step, however, it would be necessary to learn whether observers could accurately ‘‘read’’ the expectancy intentionally being communicated by the experimenter. Six graduate students and one faculty member of our research group served as the subjects of the first study. One of the graduate students administered the photo-rating task to the remaining subjects. The experimenter chose a number between 10 and þ10 as the expectancy he would try to communicate to the subjects. The subjects’ task was to try to ‘‘read’’ the experimenter’s expectancy. Table 16-1 shows the results of two such attempts to read an intentionally communicated expectancy. Observers’ accuracy was significantly better than chance (p ¼ .007, one-tail). Only 1 of the 12 judgments was seriously in error. The observers were unable to verbalize the source of the cues they had employed in making what were regarded as uncanny judgments. The possibility of extrasensory perception was somewhat lightly raised, and this possibility was tested. A standard deck of Rhine ESP cards was employed, but in the short runs employed no evidence for ESP emerged. No extended series of runs was required, since the ESP effect 519
520
Book Two – Experimenter Effects in Behavioral Research Table 16–1 Subjects’ ‘‘Readings’’ of Experimenters’
Expectancies Experimenter’s expectancy
Means
4.0
þ3.5
þ7.0 2.0 2.0 2.0 3.0 3.0
þ5.5 þ4.5 þ4.0 þ4.0 þ3.5 þ3.0
0.83
þ4.08
would have to emerge as significantly as our ‘‘reading’’ of experimenter cues if it were to serve as an explanation of these ‘‘readings.’’ Subsequent conversation with J. B. Rhine (April 4, 1961) revealed that he did not feel that ESP was a strong enough or predictable enough phenomenon to account for the communication of experimenters’ expectancies to their subjects. The study just described was not, of course, a fair test of our general hypothesis. It included only a single experimenter who was free to choose expectancies that might well have been biased in some way. It became necessary, therefore, to employ more experimenters and to assign their expectancies at random. In addition, in order to permit more leisurely study of the communication of cues, a sound film record of the experimenters’ behavior was desirable. The first film made was of three experimenter-subject interactions. Expectancies between 10 and þ10 were randomly assigned to experimenters who tried to influence their subjects to rate the standard photos in the desired way but without being too obvious about it. This film was viewed by 52 observers who tried to read the experimenters’ expectancy and state their reason for their judgment. Three of the observers were faculty members, one was a representative of a publishing firm who happened by, and the rest were graduate students from a midwestern and an eastern university. The second film was of five experimenter-subject interactions. Expectancies were again assigned at random. This film was viewed by 11 observers. One observer was a faculty member, one the wife of a faculty member, and the rest were graduate students. Three of these had served as observers of the first film; otherwise there was no overlap of observers for the two films. Table 16-2 shows for each film the experimenter’s randomly assigned expectancy and the mean ‘‘reading’’ of that expectancy by the observers. For each film, a correlation (rho) was obtained between each observer’s ‘‘reading’’ of the experimenters’ expectancies and the actual expectancies. The median of the 52 correlations thus obtained for film I was þ.88 (p < .00001). The median of the 11 correlations obtained for film II was þ.72 (p < .001). These results leave little doubt that observers can ‘‘read’’ experimenters’ expectancies with great accuracy, at least when these are being deliberately communicated. It might be expected that when observers can agree so well on experimenters’ expectancies they would agree on the channel by which these expectancies were communicated. This was not at all the case, however. The numerous hypotheses
Communication of Experimenter Expectancy
521
Table 16–2 Observers’ ‘‘Readings’’ of Experimenters’ Randomly Assigned
Expectancies Expectancies Experimenters’
Guessed (Mean)
Film I
10.0 1.0 þ2.0
4.6 þ1.3 þ2.5
Film II
þ7.0 þ5.0 þ3.0 9.0 9.0
þ6.1 1.5 0.4 5.3 2.2
advanced by the observers showed little agreement among themselves. Two major dimensions emerged, however, along which differences in hypotheses could be ordered: temporality and sense modality. Thus about half the hypotheses emphasized the experimenters’ reaction to subjects’ responses. For these observers, expectancy communication occurred only after subjects began responding and followed a differential reinforcement paradigm. For the other observers, expectancy communication occurred before the subject made even his first response. With each of these schools of thought or observation there were some observations favoring a visualkinesic mode and some favoring an auditory-paralinguistic mode of communication. Table 16-3 summarizes the specific observations of behaviors that were hypothesized by our ‘‘reinforcement theorists’’ to increase the likelihood of desired responses and decrease the likelihood of undesired responses. Observers agreed that no two experimenters seemed to show the same patterns of differential reinforcement. In addition, it was their impression that the same experimenter employed different patterns as a function of the sex of subjects. Those observers who felt that experimenters communicated their expectancies to their subjects before subjects began responding produced two types of hypotheses. The first of these emphasized the manner of delivering instructions about the rating scale subjects were to use. When experimenters mentioned the anchoring points in
Table 16–3 Differential Reinforcers of Desired and Undesired Responses
Reinforcers Positive
Negative
Smiling Head nodding Looking happier Looking more interested Recording response more vigorously
Head shaking Raising eyebrows Looking surprised Looking disappointed Repeating response Pencil tapping Holding photo up longer Tilting photo forward ‘‘Throwing’’ photo down
522
Book Two – Experimenter Effects in Behavioral Research
the region of the expected data they were said to use greater emphasis, to stammer, to speak more slowly, to speak faster, to make more reading errors, and to point a little longer at the region of the scale including the desired data. The second type of hypothesis held by these observers emphasized the general atmosphere created by experimenters even before they came to the critical section of the instructions. Specific examples included the creation of a ‘‘positive’’ tone by experimenters who expected positive ratings and a ‘‘negative’’ tone by those who expected negative ratings. These observers also reported greater looking at subjects by experimenters who expected positive ratings and greater eye avoidance by experimenters who expected negative ratings. Table 16-4 gives a summary of the six most common ‘‘theories’’ of expectancy communication on the basis of the three major dimensions that differentiate them. Additional data, bearing this time on the unintentional communication of experimenter expectancies, were kindly made available by Karl Weick (1963). He employed two experimenters, each of whom administered the photo-rating task to five introductory psychology students. One experimenter was led to expect positive ratings; the other was led to expect negative ratings. The entire experiment was conducted in front of Weick’s class in experimental social psychology. Table 16-5 shows the results of this study. The experimenter expecting higher ratings obtained higher mean ratings than did the experimenter expecting lower ratings (t ¼ 2.93, p ¼ .01, one-tail). For one of the experimenters, the classroom observers were unable to offer any clear hypotheses. For the other experimenter, the observers felt that when
Table 16–4 Six Theories of Expectancy Communication
Dimensions of theories Theories
Temporality
Modality
Specificity
I II III IV V VI
After subject’s response After subject’s response Before subject’s response Before subject’s response Before subject’s response Before subject’s response
Visual Auditory Visual Auditory Visual Auditory
Specific cues Specific cues Specific cues Specific cues General atmosphere General atmosphere
Table 16–5 Weick’s Classroom Demonstration of Experimenter
Expectancy Effects Expectancy þ5
5
Subjects’ Mean Ratings
þ2.20 þ0.50 þ0.75 þ1.85 þ0.60
1.35 þ1.10 1.35 0.65 0.25
Mean
þ1.18
0.50
Communication of Experimenter Expectancy
523
he obtained expected data he recorded them very rapidly but that he was slow to record unexpected responses. In addition, he was observed to ‘‘really stare’’ at his subjects in response to unexpected data. For one of these experimenters, then, the cues observed when the expectancy was communicated unintentionally were similar to those observed (at least by holders of Theory I, see Table 16-4) when the communication was deliberate. The data presented so far which suggest that small cues from the experimenter serve to communicate his expectancy to the subject are not without precedent. In an earlier chapter we discussed some of these now classic cases. There was Clever Hans (Pfungst, 1911), who could read the experimenter’s expectancy from his head, eyebrow, and nostril movements. There was the phenomenon of unintentional whispering to subjects in ESP research (Kennedy, 1938). A short time later Kennedy (1939) suggested that involuntary movements by experimenters in ESP research might provide subjects with kinesthetic, tactual, auditory, and visual cues to the expected response. Much earlier, Moll (1898) had discussed Wernicke’s warning of the involuntary cueing by muscle tremors of subjects in experiments in clairvoyance. The reading by subjects of cues in ESP research could apparently occur without their awareness and even at normally sub-threshold values of illumination (Miller, 1942). There have been several serious investigations of the dramatic cue-reading ability of apparently hypersensitive individuals. Foster (1923) summarized five experiments designed to test the ability of a ‘‘sensitive’’ who could locate hidden materials by means of a divining rod. The results of these studies suggested that when all possible sources of cues from the experimenter and from observers were removed, the subject was no longer able to locate hidden items such as watches, coins, and water mains. The eliminated cues included those of the visual-kinesic and auditorypara-linguistic modes. Stratton (1921) reported an extensive series of experiments conducted on a famous ‘‘muscle reader,’’ Eugen de Rubini. Both E. Tolman and W. Brown were present during most of the series. The subject’s task was to select 1 of 10 books which had been chosen by one of the experimenters. Contact between this experimenter (guide) and the subject was established by a slack watch chain held by both. Although no observing experimenter could detect any sensory cue emitted by the guide to the subject, Rubini was able to select the correct item significantly (100 percent) more often than could be accounted for by chance. Even when the watch chain was not employed, the subject’s performance was 50 percent better than chance alone. When possible auditory cues were reduced, the subject’s performance actually improved, but when visual cues were reduced his performance deteriorated. Only when all visual and auditory cues were reduced drastically, however, was the subject’s performance clearly no better than chance. An unexpected finding was that when the guide tried consciously to help Rubini, the latter’s performance fell off. Apparently those cues actually used by this subject were not easily inferred, a finding in accord with the data presented earlier in this chapter. Other workers have also suggested the importance to the experimental situation of unintended cues given off by the experimenter (Edwards, 1950). In his critique of Kalischer’s research with dogs, Johnson (1913) suggested that the animals could correctly anticipate the experimenter’s responses from his posture, muscle tonus, and respiratory changes. Johnson employed a control series in which the dogs could not
524
Book Two – Experimenter Effects in Behavioral Research
see the experimenter. These controls led to the disappearance of the animals’ alleged discriminatory ability. Among fifth-grade children, Prince (1962) found that a not so subtle marking of the experimenter’s data sheet served as a reinforcer of verbal behavior. A more subtle and unintended data-recording cue was reported by Wilson (1952). In a task requiring the discrimination of the presence and absence of a light, it was found that subject performance varied as a function of the data-recording system. That system yielding the best ‘‘discrimination’’ was one in which a longer pen scratch by the experimenter was associated with one of the alternatives. (A fuller discussion of some of these cases of communication by means of small cues is available elsewhere [Rosenthal, 1965a].)
Experimental Restriction of Communication In the preceding section we saw the potential relevance for the communication of experimenter expectancies of the visual-kinesic and auditory-paralinguistic modalities. In this section data will be presented that were in part designed to show which of these two channels of communication might be the more important (Fode, 1960; Rosenthal & Fode, 1963). The standard photo-rating task was administered to 103 male and 77 female students in introductory psychology classes by 24 male advanced undergraduate engineers. In this study experimenters did not show each of the 10 photos to their subjects. Instead the entire set of photos was mounted on a rectangular board so that subjects could rate aloud the success or failure of the persons pictured without experimenters’ handling the photos. Six of the experimenters were randomly assigned to the control group and were led to expect photo ratings of 5 from their subjects. The remaining experimenters were all led to expect ratings of þ5 and were randomly assigned to one of the following three experimental groups: (1) Visual cues. These six experimenters were fully visible to their subjects but remained entirely silent after greeting subjects and handing them their written instructions. (2) Auditory cues. These six experimenters were permitted to read their instructions to their subjects but were shielded from subjects’ view by sitting behind a screen immediately after greeting their subjects. (3) Visual plus auditory cues. These six experimenters read their instructions to their subjects and remained in full view throughout the experiment. This group was identical with the control group in procedure and differed only in the induced expectancy. Each experimenter contacted an average of 7.5 subjects. Magnitude of expectancy effect was defined as the difference between the mean photo rating obtained by each experimenter of the experimental groups (þ5) and the mean of the mean photo ratings obtained by the experimenters of the control group (5). Table 16-6 shows the magnitude of expectancy effect for each experimenter of each experimental condition. Negative numbers indicate that those experimenters’ obtained ratings were in the direction opposite to their expectancy. The experimenters of the visual cue group showed no effect of experimenter expectancy. The auditory cue group of experimenters showed significant expectancy effects (t ¼ 3.19, p ¼ .005). The visual plus auditory group of experimenters showed
Communication of Experimenter Expectancy
525
Table 16–6 Magnitude of Expectancy Effects Under Three Conditions of Cue
Communication Communication channel Visual
Means
Auditory
Visual plus auditory
.57 .42 .09 .16 .64 .71
1.35 .98 .83 .79 .62 .47
2.55 2.28 2.11 1.61 1.58 0.62
.07
.84
1.79
much more significant expectancy effects, and the magnitude of these effects was significantly greater than those of the auditory cue group (t ¼ 2.63, p ¼ .02). This finding suggests that a combination of the visual-kinesic and auditory-paralinguistic channels is most effective in the communication of experimenter expectancies. As to the differential effectiveness of the visual and auditory modalities, the data are equivocal. One interpretation suggests that auditory cues alone are more important than visual cues alone. (By ‘‘auditory’’ is meant, of course, the noncontent or paralinguistic aspects of speech, since all experimenters gave identical instructions to their subjects.) An alternative interpretation, however, suggests that the visual cue group’s mute behavior in their experimental sessions may have struck their subjects as so peculiar and ‘‘unnecessarily’’ unfriendly that they reacted with a negative conformity to their silent experimenters’ expectancy. Both of these alternative interpretations must at this time be regarded as more or less speculative. (We do know, however, from the work of Troffer and Tart [1964] that the auditory channel can be sufficient to communicate the expectancy of experimenters in hypnosis research.)
Temporal Localization In the preceding section we discussed the role of two sense modalities in the communication of experimenter expectancy effects. Both modalities had been emphasized by observers of films of intentional cue production. In this section the question of when expectancy effects are communicated will be discussed. Some of the observers had suggested that the communication of expectancies occurred very early in the experimenter-subject negotiation. Others felt that this communication occurred only after the subjects began making their responses. These observers were essentially suggesting an operant conditioning paradigm with positive reinforcers emitted by experimenters in response to subjects’ emission of expected responses. Negative reinforcers were thought to follow the occurrence of unexpected responses. In the films in question these events undoubtedly did occur, since the experimenters were intentionally trying to influence their subjects and consciously employed nonverbal reinforcements. The purpose of the data analysis to be presented here was to learn whether under the more ecologically valid or more representative conditions of unintended influence, operant conditioning was necessary for the
526
Book Two – Experimenter Effects in Behavioral Research
operation of experimenter expectancy effects (Rosenthal, Fode, Vikan-Kline, & Persinger, 1964). From the experiments completed at the time, those three were selected for analysis that met the following criteria: (1) that experimenters contacted their subjects individually using the same photo-rating task and identical instructions, (2) that experimenters and subjects be in full view of each other throughout the experiment, and (3) that there be one group of experimenters led to expect photo ratings of +5 while another group was led to expect ratings of 5. To test the hypothesis that operant conditioning, as commonly defined (e.g., Krasner, 1958; Krasner, 1962), was necessary for the operation of experimenter expectancy effects, the following analysis was made. For each of the three experiments, the mean photo ratings obtained of the first photo only were compared for the þ5 and 5 expecting experimenters. If the differences in ratings obtained by the oppositely outcome-biased experimenters were as great on the first photo as for all ten photos, we could reject the ‘‘theories’’ of expectancy communication which require operant conditioning as a necessary condition, since no reinforcement was possible until after this first rating. To test the hypothesis that operant conditioning augments experimenter expectancy effects, the following analysis was made. For each of the three experiments the mean ratings obtained by experimenters of each treatment condition were plotted for each of the ten photos in sequence. Magnitude of experimenter expectancy effects, defined as the difference in mean rating, should show an increase over time if operant conditioning served to augment the phenomenon. Table 16-7 shows the magnitude of experimenter expectancy effect (mean difference) for the first photo alone, and for all ten photos together, for each of the three experiments. In addition, Table 16-7 shows, for each experiment, the t and p level (one-tail) as well as the number of experimenters and the mean number of subjects contacted by each. In the two studies having the larger number of experimenters, the magnitude of expectancy effect (mean difference) was somewhat greater for the first photo alone than for all ten photos combined. For all three studies combined, the grand mean difference based on the first photo alone was 1.34, and that based on all ten photos
Table 16–7 Effect of Experimenter Expectancy on Ratings of the First
Photo Alone and of All Ten Photos Experiment
I
N of Es N of Ss per E
10 21
II 12 7
III 8 11
Mean difference t p
0.71 1.76 .08
First Photo Only 2.16 1.50 .08
0.89 0.74 .25
Mean difference t p
0.50 2.19 .03
All Ten Photos 1.79 4.97 .0005
1.30 3.94 .005
Communication of Experimenter Expectancy
527
was 1.23. This finding clearly indicates that operant conditioning is not necessary for the communication of expectancy effects. Masling (1965), in his study of examiner effects in influencing subjects’ Rorschach responses, was also unable to show that it was a pattern of examiner reinforcement that accounted for subjects’ biased responses. Comparison of the ts shown in Table 16-7 and their associated p levels does show that expectancy effects are more statistically significant when comparisons are based on the differences in ratings of all ten photos. This was due to the increased stability of the mean differences, resulting from their being based on ten times as many actual raw scores. The combined p of the mean differences based on first photos alone was < .03 (z ¼ 2.01). Figure 16-1 shows the grand mean photo ratings for each treatment condition for photos grouped in sequence, for all three experiments combined (unweighted). Inspection of this figure shows that magnitude of experimenter expectancy effects, defined as the difference in mean rating, changed very little over time (p > .50), suggesting that verbal conditioning need not serve to augment expectancy effects. In our earlier discussion of Weick’s classroom experiment on expectancy effects we noted that the observers’ reports suggested the operation of differential reinforcement of expected and unexpected responses by subjects. It is of special interest, therefore, to note that in this study, too, significant experimenter expectancy effects emerged before any reinforcement was possible. The magnitude of expectancy effect on the very first photo was þ5.00 (p < .02, one-tail, t ¼ 2.84, df ¼ 8).
MEAN RATING 1.75 1.50 1.25 +5 EXPECTANCY 1.00 0.75 0.50 0.25 0.00 –5 EXPECTANCY –0.25 –0.50
1
2–4 PHOTO NUMBERS
5–7
Figure 16–1 Mean Photo Ratings Obtained in Sequence
8–10
528
Book Two – Experimenter Effects in Behavioral Research
MEAN RATING 5.00
+5 EXPECTANCY
4.00 3.00 2.00 1.00 –5 EXPECTANCY
0.00 –1.00 –2.00 1
2–4
5–7 PHOTO NUMBERS
8–10
11–20
Figure 16–2 Mean Photo Ratings Obtained in Sequence (Weick’s Data)
Figure 16-2 illustrates that for Weick’s experiment the very first responses were more affected than were the subsequent responses. Magnitude of expectancy effect, after the first response, was fairly stable throughout the first 10 photos. In this study the standard 20-photo set was employed, and Figure 16-2 shows that for the last 10 photos, expectancy effects tended to diminish significantly (p ¼ .08). The fact that experimenter expectancy effects manifest themselves so early in the data collection process has important implications for the further study of the mediation of expectancy effects. It suggests that during the very brief period in which the experimenter greets, seats, and instructs his subject, the results of the experiment may be partially determined. The very special importance of the first few moments of the experiment has also been suggested by Kimble (1962) in experiments on eyelid conditioning and by Stevenson and Odom (1963), who found that the sex of the experimenter affected the subjects’ performance even though the experimenter left the subject after giving instructions and before the subject began responding. From what we now know about the ‘‘when’’ of the communication of expectancies it seems that in future studies we must focus our attention on the brief predata-collection phase of the experimental interaction in order to discover the ‘‘how’’ of the communication of expectancies.
The Problem of Signal Specification Even if we knew the precise moment, if there were one, when the experimenter unintentionally signals his subject to respond in a certain way, and even if we could specify with near-certainty which experimenters would successfully influence their subjects’ responses, we would still be left with the basic riddle of the Clever Hans
Communication of Experimenter Expectancy
529
phenomenon. Exactly what does the experimenter do differently when he expects a certain response compared to what he does when he expects the opposite response? No upward movement of the head, or eyes, no dilation of the nostrils, have yet been shown to be the critical signals to subjects as they were in the case of Clever Hans. Nor should we push that analogy too far. Hans, after all, had only to receive a signal to stop a repetitive movement, the tapping of his foot. A simple signal by our experimenters to their subjects that they were responding ‘‘properly’’ would not do as a hypothesis, since we saw earlier that the crucial communication occurs before the subjects’ first response. The first attempt to see what experimenters did differently when interacting with subjects for whom they held opposite expectancies employed the original five molar variables described in the last chapter. For each of his interactions with a subject, each experimenter had been rated on his dominance, likability, activity, professional manner, and friendliness. The ratings made by the five original observers of an experimenter’s interaction with a subject from whom he expected negative ratings of the success of others (5) could be subtracted from the ratings made of his behavior when contacting a subject from whom he expected positive ratings (þ5). These difference scores tended to be quite small compared to the differences found among different experimenters. Table 16-8 illustrates this fact by showing for each variable the greatest obtained difference between the means of different experimenters and the greatest difference obtained by any single experimenter between subjects under the two experimental conditions. Those variables showing the greatest withinexperimenter variation were also those showing the greatest between-experimenter variation (r ¼ .91, p ¼ .03). Table 16-9 shows the mean differences in experimenter behavior vis-a`-vis those subjects from whom photo ratings of þ5 were expected and those from whom ratings of 5 were expected. These mean differences are tabulated separately for experimenters showing large positive expectancy effects (> þ1.00), those showing no expectancy effects (< þ1.00, > 1.00), and those showing large negative or reverse expectancy effects (< 1.00). None of these mean differences was significant at the .05 level for any of the variables listed. However, since 9 of the 15 ts computed had an associated p < .20 (too many to be reasonably ascribed to chance), it appears that on the whole experimenters behaved differently toward their subjects depending on whether they believed them to be success- or failure-perceiving subjects. The profile of difference scores of the positively biased experimenters was significantly opposite to the profile of the unbiased experimenters (rho ¼ 1.00, p ¼ .02), and tended to be opposite to the profile of the negatively biased experimenters (rho ¼ .70, p ¼ .20).
Table 16–8 Maximum Obtained Differences in Experimenter Behavior
Behavior
Within experimenters
Between experimenters
Dominant Likable Activity Professional Friendly
.60 .80 .90 1.60 2.00
3.70 4.20 4.00 4.95 4.80
Means
1.18
4.33
530
Book Two – Experimenter Effects in Behavioral Research Table 16–9 Differences in Experimenter Behavior as a Function of Expectancy
Magnitude of expectancy effects Positive effect N ¼ 8 Behavior
Mean*
Dominant Likable Activity Professional Friendly
þ.04 .25 .11 þ.19 .52
No effect N ¼ 8
Negative effect N ¼ 7
t
p
Mean*
t
p
Mean*
t
p
— 1.68 — — 1.60
— .15 — — .16
.18 þ.14 þ.04 .20 þ.44
1.94 1.76 — 1.11 1.81
.10 .13 — — .12
.24 .19 þ.36 .31 .11
1.98 1.66 2.27 1.55 —
.09 .15 .06 .17 —
* (Mean rating under ‘‘þ5’’ expectancy condition) minus (mean rating under ‘‘5’’ expectancy condition).
The profile of difference scores of this latter group did not differ from the profile of the unbiased group (rho ¼ þ.70, p ¼ .20). Thus positively biased experimenters behaved in a relatively less professional but more friendly and likable manner toward those subjects they believed to be failure perceivers (‘‘5’s’’). The unbiased experimenters showed just the opposite configuration of behavior and, in addition, behaved somewhat more dominantly toward their ‘‘5’’ subjects. It appeared almost as though the positive biasing experimenters were trying to be especially nice to the subjects they believed to be failure perceivers, while the unbiased experimenters were just the opposite—perhaps from trying too hard to avoid treating their subjects differentially. The behavior of negative biasing experimenters seemed to vary most as a function of their expectancy. Like the unbiased experimenters, they were relatively less dominant and professional toward their ‘‘þ5’’ subjects. But like the positively biased experimenters, they were also less likable toward their ‘‘þ5’’ subjects. Unlike either of the other groups of experimenters, the reverse biasers behaved more actively vis-a`-vis their ‘‘þ5’’ subjects. The differential behavior of this group of experimenters might have been due either to their efforts to avoid affecting their data and/or to ‘‘faulty’’ cueing behavior toward their subjects. It seems unlikely that a simple increase in ‘‘friendliness,’’ for example toward a ‘‘5’’ subject, leads to obtaining the lowered photo ratings we would expect if bias were to occur. Table 16-10 shows the correlations between subjects’ photo ratings and the mean of their experimenter’s rated behavior. Higher ratings (disregarding expectancy) were obtained by experimenters who were more active, more friendly,
Table 16–10 Correlations between Subjects’ Photo Ratings and Their
Experimenter’s Typical Behavior Behavior
r
p
Dominant Likable Activity Professional Friendly
.34 þ.08 þ.18 .24 þ.20
.005 — .13 .05 .09
Communication of Experimenter Expectancy
531
less dominant, and less professional (findings consistent with those reported in Part I of this book). It may well be that changes in experimenter behavior have entirely different meanings to their subjects depending on the usual behavior of that experimenter. We saw in an earlier chapter that more positively biased experimenters were different from less biased experimenters before they even began their experiment. With respect to the behavioral variables being considered here, the more biased experimenter is ordinarily more professional, dominant, likable, and less active. On the basis of the data presented in this section, we still cannot say how changes in experimenter behavior lead to changes in their subjects’ responses. Of considerable interest, however, was the finding that experimenters’ expectancies lead to changes in their own behavior regardless of whether they bias their data positively, negatively, or not at all. Given that a more dominant experimenter subsequently exerts greater expectancy effects on his subjects, as we saw in the last chapter, if he becomes somewhat less dominant vis-a`-vis a subject from whom he expects þ5 ratings, that subject will tend to rate the photos as less successful, thereby leading toward negative or reversed expectancy effects. This can be seen from Table 16-9. Perhaps a simpler way of expressing this relationship is to give the correlation between the difference in behavior manifested toward subjects from whom different responses are expected and the difference in the responses subsequently obtained from them. For the behavior ‘‘dominant’’ this correlation was þ.47, (p ¼ .03, df ¼ 21), which means that if an experimenter was more dominant toward a ‘‘þ5’’ than toward a ‘‘5’’ subject he tended to obtain higher ratings from his ‘‘þ5’’ than from his ‘‘5’’ subjects. Of the five behavioral variables under discussion this was the only one that reached statistical significance. Although more-dominant experimenters generally tend to obtain lower photo ratings from their subjects, when they show an increase in dominance they tend to obtain higher photo ratings. Why this should be is far from clear, but it does suggest the possible importance of the effects of the subject’s adaptation level (Helson, 1964). The five behavioral variables being discussed were based on the judgments of five investigators whose ratings might have been contaminated. It was, therefore, important to see whether other observers who could not have been contaminated would make observations that would show similar correlations. In the last chapter, observations by uncontaminated observers were described. For the five behaviors we have been discussing, the mean ratings of experimenters contacting ‘‘5’’ subjects were subtracted from the mean ratings of experimenters contacting ‘‘þ5’’ subjects. These difference scores were correlated, as before, with the photo ratings made by ‘‘5’’ subjects subtracted from photo ratings made by ‘‘þ5’’ subjects. When judgments of experimenter behavior were made of only the brief preinstructional period, none of the correlations were significant, regardless of whether the behavioral observations were based on the silent films, the sound track, or the sound films. When judgments of experimenter behavior were made of the instructional period, by observers of the sound film, the only correlation to reach significance was the same one that reached significance when the possibly contaminated observers had been employed. Experimenters showing more-dominant behavior toward their ‘‘þ5’’ subjects tended to obtain the expected higher photo ratings from them. The correlation was þ.50 (p ¼ .04, df ¼ 16), very close to the value of þ.47 obtained by the original observers. When the observations were based on the silent films alone, experimenters showing
532
Book Two – Experimenter Effects in Behavioral Research Table 16–11 Differential Instructional Behavior of the Experimenter as
Predictor of Subjects’ Differential Responding Observation channel Visual Professional Not acting important Honest Courteous
Visual plus auditory þ.64 þ.50 þ.43 þ.41
Mean r ¼ þ.50 Mean intercorrelation ¼ þ.41
Talkative Dominant Dishonest
þ.63 þ.50 þ.43
Mean r ¼ þ.52 Mean intercorrelation ¼ þ.27
in their motor behavior an increase in professional manner toward their ‘‘þ5’’ subjects relative to their ‘‘5’’ subjects, obtained the expected higher photo ratings from them (r ¼ þ.64, p ¼ .008, df ¼ 15). None of the differential behaviors of the experimenters toward their ‘‘þ5’’ or ‘‘5’’ subjects as judged from sound track alone predicted significantly the differential photo ratings subsequently obtained from ‘‘þ5’’ and ‘‘5’’ subjects. It may be recalled that these samples of observers had made ratings of other aspects of the experimenter’s behavior than just those five we have been discussing. Table 16-11 shows the significant correlations (p .10) between differences in instructional behavior vis-a`-vis ‘‘þ5’’ and ‘‘5’’ subjects and differences in photo ratings subsequently obtained. None of the correlations based upon observations of experimenter behavior in the auditory channel alone reached a p .10. Within each of the other two channels of observation the variables predicting subjects’ differential responses as a function of experimenter expectancy were not well clustered. Within each channel the mean correlation predicting expectancy effect was higher than the mean intercorrelation. Judging from the visual-plusauditory channel, experimenters who behaved in a more talkative, dominant, and ‘‘dishonest’’ way toward subjects from whom they expected þ5 responses tended to obtain such responses. Judging from the visual channel alone, experimenters who acted in a more professional, more courteous, less important, and more ‘‘honest’’ manner toward their ‘‘þ5’’ subjects tended to obtain the expected responses from them. The most puzzling aspect is the very different meaning of ‘‘honest’’ when it is judged from the visual channel alone compared to the visual-plus-auditory channel. As we would expect from the opposite directions of the correlations between honesty and subjects’ responses, treating a subject in a more honest way judging from the visual channel alone is seen as treating him in a less honest way judging from the visual-plus-auditory channel (r ¼ .62, p < .01). These findings do not solve our problem of finding the key to the communication of expectancies, but there is a lesson for future studies of interpersonal communication. Adding a channel of communication does not simply strengthen the meaning of a message—it may, in fact, reverse that meaning. Table 16-12 shows the significant (p .10) correlations between the differential preinstructional behavior shown vis-a`-vis ‘‘þ5’’ and ‘‘5’’ subjects and the differential responses subsequently shown by these subjects. Judged from the visual-plus-auditory channels, experimenters who showed more hand gestures
Communication of Experimenter Expectancy
533
Table 16–12 Differential Preinstructional Behavior of the Experimenter as
Predictor of Subjects’ Differential Responding Observation channel Visual
Auditory
Trunk gestures .50 Important-acting .48 Behaved inconsistently .47
Loud
Mean r ¼ .48 Mean Intercorrelation ¼ þ.09
Mean r ¼ .49 –
.49
Visual plus auditory Loud Hand gestures Speaks Distinctly
þ.65 þ.47 þ.46
Mean r ¼ þ.53 Mean Intercorrelation ¼ þ.38
and whose speech was more ‘‘loud and clear’’ toward their ‘‘þ5’’ than toward their ‘‘5’’ subjects obtained more positive ratings from them than from their ‘‘5’’ subjects. Again we find a judgment of behavior reversed as a function of the channel of communication. Judging from the sound track alone, experimenters obtained relatively higher ratings from subjects from whom they expected higher ratings if they were less loud. Judging from the visual channel alone, experimenters who showed less trunk activity, acted less important, and behaved less inconsistently toward their ‘‘þ5’’ subjects obtained higher ratings from them than from their ‘‘5’’ subjects. In the last chapter the concept of channel discrepancy was described. Because there were observations of behavior based only on the visual channel and observations based only on the auditory channel, a difference could be computed. Such characteristic channel discrepancies were found to be useful predictors of subsequent expectancy effects. What has not yet been discussed is whether a given experimenter’s changes in channel discrepancy as a function of the expectancy he has of a given subject’s response can be used to predict that subject’s subsequent response. For 17 of the behavioral variables, channel discrepancies could be computed. The mean channel discrepancy an experimenter showed in interacting with a ‘‘5’’ subject was subtracted from the mean channel discrepancy he showed in interacting with a ‘‘þ5’’ subject. The resulting difference scores were correlated with the mean rating of the stimulus photos subsequently given by ‘‘5’’ subjects subtracted from the mean photo rating given by ‘‘þ5’’ subjects. For the preinstructional period alone and for the instructional period alone, 17 correlations were available based on algebraic channel discrepancies (visual minus auditory, retaining the sign of the difference); and 17 correlations were available based on absolute channel discrepancies (visual minus auditory, disregarding the sign of the difference). Table 16-13 shows the significant (p .10) correlations between differences in channel discrepancies in the experimenter’s behavior shown toward ‘‘þ5’’ as compared to ‘‘5’’ subjects and differences between these subjects’ subsequent photo ratings. Based on observations made in the preinstructional period alone, those experimenters who showed greater absolute channel discordance in their degree of
534
Book Two – Experimenter Effects in Behavioral Research
Table 16–13 Differential Channel Discrepancies as Predictors of Subjects’
Differential Responding Experimental period Preinstructional Behavior Casual Likable Multiple R ¼
Instructional
r
p
þ.74 .57 .85
.003 .04 .001
Behavior Honest* Dominant Multiple R ¼
r
p
þ.46 þ.41 .59
.06 .10 .04
* This was an algebraic discrepancy (visual > auditory). All other discrepancies were absolute, that is, sign of discrepancy was ignored.
casualness but less discordance in their likability toward their ‘‘þ5’’ subjects subsequently obtained the expected higher photo ratings from these subjects (R ¼ .85, p < .001). Based on observations made in the instructional period alone, those experimenters who showed greater absolute channel discordance in their degree of dominance, and who showed greater visual than auditory honesty toward their ‘‘þ5’’ than their ‘‘5’’ subjects, subsequently obtained the expected higher photo ratings from their ‘‘þ5’’ than from their ‘‘5’’ subjects (R ¼ .59, p < .04). It seems best to forego the speculation required to interpret the specifics of these findings. In general, however, these results demonstrate the importance of changes in the discrepancies between channels of communication as predictors of subsequent interpersonal influence. Additional analyses of sound motion pictures of experimenters interacting with their subjects are in progress. We may or may not find more specific signals by means of which experimenters communicate to their subjects what it is that is expected of them. Not finding such specific cues may not mean that there are no such cues but only that we do not yet know enough about subtle signaling systems to be able to find them. If, in fact, there were no specific cues to be found, then more molar changes in the behavior of the experimenter might serve as the nonspecific influencers of the subjects’ behavior. Mention has already been made of these nonspecific changes in experimenter behavior as antecedents of subjects’ subsequent differential responses. We cannot be sure, however, that these changes in experimenter behavior are themselves conveyors of information to the subjects as to how they should respond. Possibly, those subjects who later go on to confirm or disconfirm the experimenter’s hypothesis affect the experimenter differently early in the experiment. The experimenter then behaves differently toward these subjects but without necessarily conveying response-related information to the subject. In other words, differential treatment by the experimenter may be quite incidental to the question of whether a subject goes on to confirm or disconfirm the experimenter’s hypothesis. What has been learned that we can accept with confidence is the understatement that interpersonal communication processes are enormously complex and that they may be still more complex when the communication is unintentional.
Communication of Experimenter Expectancy
535
Learning to Communicate Unintentionally If, after hundreds of hours of careful observation, no well-specifiable system of unintentional signaling has been uncovered, how do experimenters ‘‘know’’ how to influence their subjects unintentionally? Perhaps the knowledge of interpersonal influence processes is a tacit knowledge. As Polanyi (1962) has put it, ‘‘There are things that we know but cannot tell’’ (p. 601). One question that could be answered in part was whether an experimenter ‘‘knows’’ better how to influence his subjects later on in the process of data collection. If an experimenter is more successful in unintentional influencing later than he was earlier, it would be reasonable to think that in part, unintentional influence was a learned phenomenon. That was just what happened in the case of Clever Hans. Pfungst (1911) found that as questioners gained experience in asking Hans to respond they became more successful in unintentionally signaling to Hans when to stop his tapping. In the chapter dealing with the effects of early data returns, two experiments were described. In both these studies subjects contacted during the last half of the experiment were more influenced by the experimenters’ expectancy than were subjects contacted during the first half of the experiment (ps were .02 and .01 respectively). Data collected with Suzanne Haley were similarly analyzed, and later-contacted subjects again showed greater effects of the experimenters’ expectancy (p ¼ .12). For these three experiments with a total of 54 experimenters, the combined p was less than .001, but it must be mentioned that it was not always possible to be sure that earlier- and later-contacted subjects did not differ in some other ways as well. In Weick’s study reported earlier, there was no increase in expectancy effects when the two experimenters were contacting later as compared to earlier subjects in the sequence. Vikan-Kline’s (1962) data, reported two chapters ago, showed no order effect among her lower status experimenters but did show higher status experimenters to increase their expectancy effects as a function of number of subjects contacted (p ¼ .01). Although the evidence is not conclusive, it does seem that, on the whole, later-contacted subjects are more influenced by the experimenter’s expectancy than earlier-contacted subjects. Over the course of an experiment, experimenters may learn to communicate their expectancies more effectively. This learning hypothesis is strengthened somewhat by the findings of the studies on experimenter-subject acquaintanceship. The two studies summarized earlier found greater acquaintanceship associated with greater expectancy effects. In part, of course, this may have been due to the greater willingness of people to be influenced by prior acquaintances. In addition, however, acquaintanceship implies a longer joint history with greater opportunity for learning how the interpersonal influence process operates with the specific other. Acquaintances, presumably, not only have greater reinforcement value for each other but probably can better read each other’s cues, unintentional as well as intentional. If the experimenter were indeed learning to increase his unintended influence, who would be the teacher? Most likely, the subject would be the teacher. It seems to be rewarding to have one’s expectations confirmed (Aronson, Carlsmith, & Darley, 1963; Carlsmith & Aronson, 1963; Harvey & Clapp, 1965; Sampson
536
Book Two – Experimenter Effects in Behavioral Research
& Sibley, 1965). Therefore, whenever the subject responds in accordance with the experimenter’s expectancy, the likelihood is increased that the experimenter will repeat any covert communicative behavior that may have preceded the subject’s confirming response. Subjects, then, may quite unintentionally shape the experimenter’s unintended communicative behavior. Not only does the experimenter influence his subjects to respond in the expected manner, but his subjects may well evoke just that unintended behavior that will lead subjects to respond as expected. As the work of Hefferline (1962) and of Pfungst (1911) suggests, such communication may not fall under what we commonly call ‘‘conscious control.’’
Part III METHODOLOGICAL IMPLICATIONS
SOME THEORETICAL CONSIDERATIONS Chapter 17. The Generality and Assessment of Experimenter Effects Chapter 18. Replications and Their Assessment THE CONTROL OF EXPERIMENTER EXPECTANCY EFFECTS Chapter 19. Experimenter Sampling Chapter 20. Experimenter Behavior Chapter 21. Personnel Considerations Chapter 22. Blind and Minimized Contact Chapter 23. Expectancy Control Groups Chapter 24. Conclusion
This page intentionally left blank
17 The Generality and Assessment of Experimenter Effects
As behavioral scientists, what should be our reaction to the evidence presented in this book? Three different reactions to the presentation of some of the data from this book have actually been observed: (1) the incredulous, (2) the gleeful, (3) the realistic. The incredulous reactor (who may not have read this far) feels vaguely that all of this is just so much nonsense and that if it is not completely nonsense, at least it does not apply to him. The gleeful reactor (who may have read this far, but may read no further) has ‘‘known all along that experiments in the behavioral sciences were riddled with error.’’ He does not do or like empirical research. He is gleeful because, paradoxically, he reads into the experimental evidence presented in this book his justification for his epistemology that knowledge of the world comes through revelation rather than observation. After all, if observation is subject to observer influence, is he not justified in his eschewal of observation? The realistic reactor (and the choice of terms is intentionally positively evaluative) has read this far more or less critically and has wondered a bit whether some of his own research might have been affected by his own expectancies or more enduring attributes. Much of what follows is for that reader who, although skeptical by training, is not incredulous; who, although interested, is not overdeterminedly gleeful; who, although reminiscing about his own research, is not contemplating giving up the scientific enterprise. It is for the reader who agrees with Hyman and his co-authors (1954) when they say: ‘‘Let it be noted that the demonstration of error marks an advanced state of a science. All scientific inquiry is subject to error, and it is far better to be aware of this, to study the sources in an attempt to reduce it . . . than to be ignorant of the errors concealed in the data’’ (p. 4).
The Generality of Experimenter Effects How pervasive are the unintended effects of the experimenter on the results of his research, and how much ought we to worry about them in our day-to-day research activities? The answer to the first part of this question seems simple. We don’t know. No one knows. It seems reasonable to suppose that there may be experimenters doing 539
540
Book Two – Experimenter Effects in Behavioral Research
experiments the results of which are unaffected by the experimenters themselves. Unfortunately, we don’t know who they are or which of their experiments, if not all, are immune to their own unintended effect. This lack of specificity in our knowledge suggests the answer to the second part of our question. It seems more prudent to worry than not to worry about experimenter effects in our day-to-day research. One type of experimenter effect, that of his hypothesis or expectancy, has received our special attention in this book. For this special case of experimenter effect we can sketch out the evidence bearing on the question of its generality. After the manner of Brunswik’s conception (1956) of the representative design of experiments, we may specify the sampling domains of experimenters, subjects, tasks, and contexts employed in the experiments described in this book. Experimenters Altogether, there have been well over 350 experimenters employed in the studies described. About 90 percent of these were males. All but a handful (faculty experimenters) were graduate or undergraduate students. In all cases, however, experimenters were academically more advanced than were their experimental subjects. Graduate student experimenters were drawn from classes in psychology, education, biology, physics, engineering, and law. Undergraduate student experimenters were drawn primarily from courses in experimental, industrial, and clinical psychology, statistics, and the social sciences. In most cases experimenters were volunteers, but in others the class as a whole was urged by its instructor to participate—a practice that led to essentially nonvolunteer populations (Rosenthal, 1965b). Most experimenter samples were paid for their participation, but many were not. Thus, although sampling of experimenters has been fairly broad, it has been broad only within various student populations. Does any of the work reported, then, have any real relevance to the ‘‘real’’ experimenter? The gleeful reactor mentioned earlier may too quickly say ‘‘yes.’’ The incredulous reactor may too quickly say ‘‘no.’’ In his discussion of the generality of interviewer effects, Hart (obviously a realistic reactor) put it this way: Generalization of our conclusions to researchers of greater maturity and sophistication than these subjects has to be made, therefore, with due and proper caution. It would be dangerous, however, though consoling, for the mature and sophisticated interviewer to assume that he is not equally subject to the operation of the same error-producing factors affecting the varied group of interviewers covered by the studies we are here reporting. As a matter of fact, the available evidence suggests that, while the sophisticated interviewer may be less subject to variable errors of a careless sort, he is probably equally subject to certain biasing errors ( 1954, pp. ix–x).
Indeed, we can go further than Hart. If anything, our data suggest fairly strongly that more professional, more competent, higher status experimenters are more likely to bias the results of their research than are the more amateurish data collectors. Most experimenters, like most interviewers, are task-oriented, but the experimenters whom we have studied (and those ‘‘real’’ ones we have known) seem to be much more interested in the subject’s response than the survey interviewer apparently is in his respondent’s reply (Hyman et al., 1954, p. 270). But that seems not hard to understand. The experimenter, as compared to the survey interviewer, is less of a
The Generality and Assessment of Experimenter Effects
541
‘‘hired hand’’ who, if he performs poorly, can simply take another job. At least to some extent the professional career of the experimenter depends on the responses his subjects give him in the experimental situation. At first glance this may seem farfetched. Actually it is quite analogous to the situation in the other sciences. The behavior of a noble gas or of heavenly bodies can clearly affect the professional career of the physical scientist interested in such behavior. It gives, or does not give, him something to report or to guide his next experiment or observation. If the experimenter is not the principal investigator, but a student of the principal investigator, his professional career may still depend, much more than the survey interviewer’s, on his subjects’ responses. The student bears a much more special relationship to the principal investigator than the interviewer bears to his employing agency. The student is the only employee or one of a handful of employees. The interviewer may be one of thousands of employees. The student experimenter is likely to learn immediately what his employer’s reaction is to his subject’s responses. The interviewer’s feedback may be much delayed or even absent altogether. In short, the experimenter, be he principal investigator or research assistant, has much more at stake than does the interviewer. If he cares so much more about how his subject performs for him in the experimental situation, it seems reasonable to suppose that he may be more likely to communicate something of this concern to his subject than the typical interviewer is likely to do. At the present time we cannot say with certainty whether very highly experienced professional experimenters are more or less likely to bias their subjects’ responses than less experienced experimenters, although all the evidence available suggests that more professional, more competent, higher status experimenters show the greater expectancy effects. In any case, we should note the trend that as experimenters become highly experienced they become less and less likely to contact their subjects directly. As investigators become better established they are more and more likely to acquire more and more assistants who will do the actual data collection. These assistants range from an occasional postdoctoral student through the various levels of graduate students. Increasingly, even undergraduate students are collecting data to be used for serious scientific purposes. Undergraduate research assistants, for example, are the only ones available at many excellent liberal arts colleges with active research programs in the behavioral sciences. For some time original research has been required of at least some undergraduate candidates for honors degrees, and this trend is increasing. More and more we shall probably see undergraduates collecting data for serious purposes under the expanding programs supported by the federal government as part of the movement to encourage the earlier selection of careers in research. The Undergraduate Research Participation Program of the National Science Foundation is a prime example. With more and more ‘‘real’’ data being collected by less and less experienced experimenters, it appears that our student experimenters are not as unrepresentative of the ‘‘real’’ world of data collection after all. But suppose for a moment that it were indeterminant that there were ‘‘real’’ experimenters in the world who were like the graduate and undergraduate students we employed. How seriously would that restrict the generality of the data presented? Of course, we could not be certain of any answer to that question. But in a relative sense, it does not seem far-fetched to use students as models of student and faculty researchers—certainly much less far-fetched (as Marcia has pointed out in personal communication, 1961) than using a
542
Book Two – Experimenter Effects in Behavioral Research
Sprague-Dawley albino rat as the model for man. But we have learned enough of consequence about human behavior from both sophomore and Sprague-Dawley that we do not feel too uncomfortable about even this degree of generalization. If these generalizations seem tenable, then even more does our generalization from student to ‘‘real’’ data collector seem tenable. Subjects There have been well over 2,000 human subjects employed in the studies described. About 60 percent have been female. Most of the subjects were undergraduates and were drawn from courses in liberal arts, education, and business. The greatest single contributing course was introductory psychology. Most of the subjects were volunteers, but many were urged by their instructors to participate and so became more like a non-volunteer population. In some of the studies subjects were paid; usually they were not. In the case of animal subjects, 80 rats of two different species from two different laboratories were employed. About two thirds were females. The subjects employed in our research were very much like those typically used in behavior research, and there appears to be little risk in generalizing from our subjects to subjects-in-general. Of course, nonstudent subjects are employed now and then in behavioral research, but this is so relatively rare that McNemar was led to state sadly: ‘‘The existing science of human behavior is largely the science of the behavior of sophomores’’ (1946, p. 333). Situations There is no standard way in which we can describe the ‘‘situations’’ in which the experimenter-subject dyads transacted their business in the thirty or so studies that have been carried out. But certainly a part of any experimental situation is the task the subject is asked to perform. The most frequently employed task has been that of rating photos for the degree of success or failure the person pictured has been experiencing. The exact instructions to the subjects, the training of experimenters to administer the task, and the exact mode of administration have all been varied. Nevertheless, in spite of the variations in this task and in spite of the fact that the task is a fairly typical one in psychological research, no single task can be regarded as an adequate sample of the many tasks psychologists have asked their subjects to perform. Accordingly, other tasks have been employed, including verbal conditioning, standardized and projective psychological tests; and for animal subjects, learning in T-mazes and Skinner boxes. Most of the studies described have been carried out at the University of North Dakota, the Ohio State University, Harvard University, and two smaller universities in Ohio. They were carried out at different times during the academic year and during summer sessions. Length of time elapsing between the contacting of the first and last subjects in a given study has varied from a few hours to several months. In most cases a number of experimenters were simultaneously contacting their subjects, each of whom was individually seen by his experimenter. In other studies, different experimenters contacted their subjects individually but at different times. In one study all subjects were contacted by their experimenter as a group.
The Generality and Assessment of Experimenter Effects
543
The rooms in which experimenter-subject transactions occurred differed considerably. These ranged from a large armory (in which 150 subjects were simultaneously contacted by 30 experimenters) to individual rooms barely large enough for two chairs and a small table. Some of the rooms had one-way-vision mirrors and microphones in view; others did not. Some of the rooms were furnished so as to convey the impression that the occupant was a person of high status; some were furnished to convey the opposite impression. Earlier we asked the question of the generality of the effects of the experimenter on the results of his experiment. We are now in a position to conclude, at least for one type of experimenter effect (that of his hypothesis or expectancy), that the phenomenon may well be a fairly general one. This conclusion seems warranted by the variety of experimenter, subject, and situation or context domains sampled and by the fact that expectancy effects have been shown to occur in other than experimental laboratories. Some of this evidence was presented in Part I and some will be touched upon in the final chapter. The generality of the phenomenon of experimenter expectancy effects suggests the need to consider in some detail the implications for psychological research methodology. We will turn our attention first to the problem of the assessment of experimenter effects.
The Assessment of Experimenter Effects So far in this book we have found it sufficient to give only very general definitions of certain operating characteristics of the experimenter. In this section we shall see somewhat more formal definitions of some of these characteristics. Whenever we speak here of ‘‘an experimenter’’ or ‘‘a subject’’ we imply that whatever is said applies as well to a homogeneous set of experimenters or subjects unless specifically restricted to the single case. I. Experimenter Effect Experimenter effect is defined as the extent to which the datum obtained by an experimenter deviates from the ‘‘correct’’ value. The measure of experimenter effect (or experimenter error) is some function of the sum of the absolute (unsigned) deviations of that experimenter’s data about the ‘‘correct’’ value. It is, therefore, a measure of gross or total error. A. Data. Data are defined as the performance or responses made by the experimenter’s subjects. The term ‘‘data’’ may be applied to (1) the ‘‘raw’’ response, (2) the conversion of the raw response to quantitative terms, and (3) any subsequent transformation of the quantitative terms. 1. Response. A subject’s response is that behavior of the subject which the experimenter has defined as being of interest. We may use this term to refer to the subject’s behavior in both absolute and relative terms, both before and after quantitative transformation. For example, in an experiment comparing one or more experimental groups and one or more ‘‘control’’ groups, a subject’s response might be defined as the ‘‘raw’’ (untransformed) response produced or as the difference between that raw response and the mean of any other group.
544
Book Two – Experimenter Effects in Behavioral Research
B. ‘‘Correct value.’’ The ‘‘correct’’ or ‘‘true’’ datum is established by reasoned fiat. In some cases there are reasonable bases for the choice of the true or correct value. In a censuslike investigation of age, birth records may serve as the ‘‘correct’’ value against which subjects’ responses may be compared. In a study involving college grades the registrar’s records may serve as the criterion against which to compare subjects’ statements of grades. In both of these examples, we should note, it is entirely possible that the official records are ‘‘in error’’ in some absolute sense and that the subject’s response is more accurate. But on the whole, we are more inclined to trust the official bookkeepers of society, not because they are error-free, but because in many situations they seem to have the ‘‘best’’ data most of the time. But there are no books kept on a given subject’s pursuit rotor performance or his political ideology (but affiliation, yes), or sex life, or verbal learning, or small group interaction patterns. We find ourselves hard put to establish a criterion value. In survey research (Hyman et al., 1954) this is often done by sending out more experienced data collectors whose obtained data are then assumed to be more accurate than those collected by more inexperienced data collectors. That this may be so is reasonable but is so far from having been well established that it may be a misleading assumption. Similarly, in anthropological research, it has been suggested that better rapport with informants leads to more accurate data (Naroll, Naroll, & Howard, 1961; Naroll, 1962). This, too, is a reasonable assumption but probably also a risky one. Realistically, we must content ourselves with the fact that in most behavioral research the ‘‘true’’ data are unknown except as we obtain them in behavioral inquiry. One solution that may serve for the time being is the democratic but not very satisfying one of assuming equal likelihood of error in all experimenters until shown otherwise. On the basis of this assumption, we take the mean data obtained from roughly comparable samples of subjects to be our ‘‘true’’ mean. The more experimenters that have collected such data, in fact, the ‘‘truer’’ will our ‘‘true’’ mean be. II. Experimenter Bias Experimenter bias is defined as the extent to which experimenter effect or error is asymmetrically distributed about the ‘‘correct’’ or ‘‘true’’ value. The measure of experimenter bias is some function of the algebraic sum of the deviations of that experimenter’s data about the ‘‘correct’’ value. It is, therefore, a measure of net error. We should note here that for a single subject’s score or a single mean we can only judge whether that score or mean is accurate or not if we are given a criterion of ‘‘correctness.’’ If the score or mean is accurate, well and good. If it is not accurate, we cannot evaluate whether the inaccuracy is biased or not. In a sense, of course, it is biased, since it must represent a net deviation from the correct value. But we would have to have at least one other score or mean to test properly the hypothesis of bias. If a subsequently drawn score or mean were to fall equally distant from, and on the opposite side of, the correct value, we would necessarily reject the notion of a biased data collector.
The Generality and Assessment of Experimenter Effects
545
III. Experimenter Consistency Experimenter consistency is defined as the extent to which the data obtained by an experimenter from a single subject or sample vary minimally among themselves. The measure of experimenter consistency is some function of the sum of the absolute deviations of that experimenter’s obtained data about his mean datum obtained. The commonly used measure in this case would, of course, be the variance or standard deviation. In the case of experimenter effect and experimenter bias we could take a simple evaluative position: we are likely to be against both. In the case of experimenter consistency the situation is more complex. Whereas we may be against marked inconsistency, we should also worry about hyperconsistency.1 If the experimenter is very inconsistent he is inefficient in the sense that he will have to obtain a larger number of responses to establish a reliable mean value. Such inconsistency of obtained responses may be due to random variations in his behavior vis-a`-vis his subjects, including minor deviations from both his programmed procedures and his unprogrammed modal ‘‘interpersonal style.’’ If, on the other hand, the experimenter is significantly undervariable in the data he obtains, his increased ‘‘efficiency’’ is bought at the cost of possible bias. Such possible bias has been well illustrated in the earlier cited study of the error of estimate of blood cell counts (Berkson, Magath, & Hurn, 1940). These workers showed that successive blood counts were significantly undervariable and that this bias could be attributed to an expectancy and desire on the part of the observer for the close agreement of successive counts. Whatever the observer’s initial expectancy might be, his counts agree too often with this expectancy. In the absence of any special initial expectancy, it seems reasonable that the early data might have special significance as determinants of subsequent counts. Early data returns, as they influence the central tendencies (rather than the variability) of subsequent data, were discussed in an earlier chapter. It is interesting to note that in the first study described in that chapter (Table 12-1) those experimenters whose early data were biased as to their central tendency (means) obtained subsequent data that were biased not only with respect to central tendency but with respect to variance as well. Variances obtained by experimenters obtaining more biased early returns tended to be significantly smaller than variances obtained by experimenters obtaining relatively unbiased early returns (F ¼ 4.06, df ¼ 6, 3, p < .10). It may be, then, that unusually restricted variance or hyperconsistency can serve as a clue to the possible biasing of central tendencies. When we speak of ‘‘hyper’’-consistency or inconsistency we imply that we know the ‘‘true’’ or ‘‘correct’’ variance. The situation for variance is essentially the same as it was for the mean or any other measure of central tendency. We never really know the ‘‘true’’ value, but we can make reasonable choices of a ‘‘working-true’’ value. In a few cases we again can turn to public records from which ‘‘true’’ variances may be computed. We can use as our ‘‘true’’ value the variance obtained by some paragon experimenter or group of experimenters. In our earlier discussion of ‘‘correct values’’ we pointed out some difficulties of this technique, difficulties that apply equally well for variances as for scores or means of scores. For practical purposes, at this stage of 1
I want to thank Fred Mosteller for pointing out this problem and for calling my attention to the Berkson et al. (1940) study.
546
Book Two – Experimenter Effects in Behavioral Research
our knowledge, we must probably rely on some method of sampling experimenters to arrive at some estimate of a ‘‘correct’’ variance. Such sampling may help us avoid the bias associated with the employment of experimenters who, fortuitously, may be overconsistent or underconsistent. Before leaving this section, two kinds of experimenter deviation from normality of response distribution will be mentioned. Even assuming a properly consistent and unbiased experimenter, his distribution of obtained responses may contain too many high or low responses (skewness or asymmetry). In addition, his distribution of obtained responses may contain too many or too few responses at or near the mean. When we speak here of ‘‘too high’’ (or low) and of ‘‘too many’’ (or few) we mean it with respect to the normal distribution. Whether the ‘‘true’’ distribution is, in fact, normal is the same sort of question we have asked before when discussing ‘‘correct’’ scores, means, and variances; and our answer is essentially the same. These two kinds of experimenter deviation from normality of response distribution have been discussed only briefly because, at the present time, we have no evidence that they are in any way serious for the usual conduct of psychological research. It is the rare psychological research paper that deals in any central way with the absolute magnitudes of skewness or kurtosis. It would seem interesting, however, to assess an experimenter’s distribution of obtained responses for these characteristics, since in real life situations these deviations may prove to be indicative of error or bias in the means.
A Typology of Experimenter Operating Characteristics We have emphasized three major concepts dealing with the data-obtaining characteristics of experimenters: effect, bias, and consistency. We may consider these three variables as dichotomous for the sake of simplicity, although recognizing that, in fact, they are continuous variables. The three ‘‘concepts’’ in all possible combinations permit the following seven-category typology of experimenters’ operating characteristics: (1) I. ACCURATE II. INACCURATE A. Unbiased (2)
1. Consistent
(3)
2. Inconsistent B. Biased 1. Consistent
(4) (5)
a. net high b. net low 2. Inconsistent
(6)
a. net high
(7)
b. net low
The Generality and Assessment of Experimenter Effects
547 (1)
TE RA
CU AC TE RA
NEGATIVE
CU AC N I
(2)
(5)
POSITIVE (4) (7)
CONSISTENT (3)
(6)
INCONSISTENT
UNBIASED
BIASED
Figure 17–1 Schematic Typology of Experimenter Operating Characteristics
Figure 17-1 illustrates each of the seven types of experimenters, each of whom has drawn two samples of N subjects. In each cubicle or semi-cubicle the two distributions of responses are shown in relation to the ‘‘correct’’ value (indicated by the arrow), and the number corresponding to the experimenter type is shown in the upper right corner. For the sake of clarity we have not considered cases of significantly decreased variability or hyperconsistency. The type (1) experimenter is accurate; that is, he obtains data that are correctly consistent or variable about the mean of his obtained data, his data vary only negligibly about the ‘‘correct’’ value and, therefore, can be only negligibly biased. We can see from Figure 17-1 that the accurate experimenter is also maximally efficient. He can provide us with the desired estimate of the ‘‘correct’’ value with far fewer responses than can any other experimenter. All other experimenters [(2) to (7)] are inaccurate, but we vastly prefer the inaccuracy of types (2) and (3), the unbiased experimenters. Their data will, in the long run, also give us a good estimate of the ‘‘correct’’ value. Between experimenters (2) and (3) we prefer (2) because his greater consistency permits us to draw our conclusions with fewer subjects. Among biased experimenters [(4) to (7)] we have no strong preferences. From the point of view of estimating the ‘‘correct’’ value, a positive (net high) [(4) and (6)] bias does not differ from a negative (net low) [(5) and (7)]. There may, however, be a slight preference for the consistent [(4) and (5)] over the inconsistent [(6) and (7)] biased experimenter. Bias can be more quickly determined for the consistent experimenter, and that may be useful information. It may prevent his collecting additional, unusable data. Let us assume for the moment that most experimenters will show one or another form of bias to a greater or lesser extent. It still seems possible to obtain an unbiased estimate of the ‘‘correct’’ value although the cost will be greater. If we can assume a fairly symmetrical distribution of biases among a population of experimenters, the mean of the data obtained by a number of experimenters is likely to be unbiased. More subjects will be required, and more experimenters, and that is why the cost is
548
Book Two – Experimenter Effects in Behavioral Research
greater. If our biased experimenters are consistent [types (4) and (5)], the cost per experimenter will be lower than if they are inconsistent [types (6) and (7)]. We should note that if we employ a set of experimenters of opposite biases, the total variance of subjects’ responses, disregarding who their experimenter was, will be quite inflated because the variance attributable to the two types of experimenters will be added to the normal individual difference variance among an individual experimenter’s subjects. We shall have more to say later about the important principle of ‘‘balancing biases’’ which was suggested by Mosteller (1944).
Biased Response Magnitude vs. Biased Inference In our definitions of ‘‘data’’ and ‘‘response’’ we stated that these terms could be used to refer not only to the absolute measure of subjects’ behavior but also to the difference between that measure and a comparison measure. Therefore, the data distributions shown in Figure 17-1 may for the sake of generality be viewed either as arrays of raw data obtained from homogeneously treated subjects or as arrays of difference scores arising, for example, from the differences between experimental and control manipulations. What we must consider now is the fact that an experimenter may be very biased in the raw data he obtains and yet be completely unbiased in the inferences his data allow him to make. Put more generally, inaccuracy in the order of magnitude of data obtained may be quite independent of the inaccuracy of the inferences to be drawn from the differences between data obtained from the groups to be compared. We can illustrate this point best by restricting our discussion to the occurrence and nonoccurrence of only one type of inaccuracy: e.g., bias. Tables 17-1, 17-2, 17-3, and 17-4 show the four possible situations: 1. 2. 3. 4.
Data magnitude unbiased; inference unbiased Data magnitude unbiased; inference biased Data magnitude biased; inference unbiased Data magnitude biased; inference biased
In Table 17-1we are interested in comparing Ex’s data with the ‘‘correct values’’ as defined by the means of Es a, b, and c. We see in this case that the responses Ex obtained from his subjects are just like those obtained by the criterion experimenters. In addition, the difference between the data obtained from experimental and control group subjects is identical when we compare Ex’s value with the ‘‘correct value.’’ We conclude that Ex showed no bias in either response magnitude obtained (column III) or inference permissible on the basis of obtained differences (column IV).
Table 17–1 Unbiased Response Magnitude and Unbiased Inference
I Experimental
II Control
III Sum
IV Difference
Ex Ea Eb Ec
1.2 1.3 1.2 1.1
0.8 0.9 0.8 0.7
2.0 2.2 2.0 1.8
0.4 0.4 0.4 0.4
‘‘Correct’’
1.2
0.8
2.0
0.4
The Generality and Assessment of Experimenter Effects
549
Table 17–2 Unbiased Response Magnitude and Biased Inference
I Experimental
II Control
III Sum
IV Difference
Ex Ea Eb Ec
1.0 1.3 1.2 1.1
1.0 0.9 0.8 0.7
2.0 2.2 2.0 1.8
0.0 0.4 0.4 0.4
‘‘Correct’’
1.2
0.8
2.0
0.4
Table 17–3 Biased Response Magnitude and Unbiased Inference
I Experimental
II Control
III Sum
IV Difference
Ex Ea Eb Ec
1.7 1.3 1.2 1.1
1.3 0.9 0.8 0.7
3.0 2.2 2.0 1.8
0.4 0.4 0.4 0.4
‘‘Correct’’
1.2
0.8
2.0
0.4
Table 17-2, however, shows that although Ex was unbiased in response magnitude obtained (column III), he was biased in the inference permissible from his experiment (column IV). He was the only experimenter not to obtain the ‘‘correct’’ mean difference of 0.4. In this example Ex might have been biased even further in the direction opposite to that of the correct mean difference. That is, he might have obtained significantly higher values from the subjects in his control group than from the subjects in the experimental group. At the same time, his obtained response magnitude might have remained unbiased. Table 17-3 shows that our protagonist, Ex, has obtained the same difference between his experimental and control subjects that was obtained by the criterion experimenters (column IV). However, the response magnitude he obtained was significantly greater than that obtained by the more ‘‘accurate’’ experimenters (column III). If the purpose of the experiment was simply to establish that the subjects of the experimental group would outperform the subjects of the control group, our Ex has not led us at all astray. However, if there was, in addition to an interest in the experimental-control group difference, an intrinsic interest in the actual values obtained, Ex’s data would have been very misleading. Table 17-4 shows that in this example our Ex has obtained responses of significantly greater magnitude than were obtained by the more ‘‘accurate’’ experimenters (column III). In addition, he found no difference between the subjects of his experimental and control groups and was, with respect to the criterion experimenters, in biased error (column IV). With a given obtained response magnitude, our Ex might have been biased into the opposite direction with his control subjects outperforming his experimental subjects. He might also have been biased if he had obtained, say, a mean difference of 0.8. In this case we would not worry at all if we simply wanted to be able to claim the superiority of the experimental over the control subjects.
550
Book Two – Experimenter Effects in Behavioral Research Table 17–4 Biased Response Magnitude and Biased Inference
I Experimental
II Control
III Sum
IV Difference
Ex Ea Eb Ec
1.5 1.3 1.2 1.1
1.5 0.9 0.8 0.7
3.0 2.2 2.0 1.8
0.0 0.4 0.4 0.4
‘‘Correct’’
1.2
0.8
2.0
0.4
However, if we had some intrinsic interest in the magnitude of the difference favoring the experimental group, we would have been misled. Suppose that our experimental treatment in this case was a very costly surgical procedure, whereas our control treatment was an inexpensive medical procedure. Let us say that a mean difference of 0.4 represents a statistically significant but clinically trivial improvement in patient comfort. But let us say that a mean difference of 0.8 represents a dramatic clinical improvement in the patient. On the basis of our single experimenter’s research, we might institute a surgical procedure that, on balance of cost against utility, is simply not worth it. This is only one example where we might be interested not so much in showing the significance of a difference but in showing its absolute magnitude.
The Practical Problem of Assessment Our discussion of the assessment of experimenter effects has been largely theoretical. Now we consider the ‘‘real’’ world of research. Here we have experimenters conducting experiments that, because of differences in subject sampling, instrumentation, and procedure, cannot reasonably be compared directly to any other experiments. How are we to assess the operating characteristics—i.e., the accuracy—of these experimenters? The answer is simple enough—it can’t be done. Any data obtained by a single experimenter may be due as much to the experimenter as to his treatment conditions. No experimental data derived from a single experimenter can be considered as anything more than highly provisional unless replicated by at least one other investigator. We may assume any given experimenter to be accurate until the first replication is carried out. If there is very close agreement between the results of the replication and of the original study, the hypothesis of experimenter accuracy is not discredited, though of course it is not confirmed either. If the results tend to be quite different but not significantly opposite in direction, we may suspend judgment until further replications are carried out. If the results are significantly opposite in direction, we are more assured than ever that the results are biased with respect to each other. Our solution again is to demand further replication. We may find that after a series of replications our original study and the first replication yielded the two most discrepant results, with all subsequent replications filling in the central area of what now begins to look like a normal distribution. Now we, in practice, can conclude (or more accurately, define) the first two studies as each yielding biased data—biased with respect to the grand mean data obtained and opposite in direction. On the other hand,
The Generality and Assessment of Experimenter Effects
551
if after our original study and one replication, the next several studies agree clearly with one of the first two, we may decide that the mean of the results of the studies in agreement will constitute our ‘‘correct’’ value in terms of which we define the other earlier study as quite biased. In any case, then, replication is essential not only to assess the accuracy of obtained data but also to help us correct for any inaccuracy of data. In general, the more discrepant the results from two or more subjects, the more subjects are needed to establish certain parameters. And, in general, the more discrepant the results of two different experiments, the more replications of the entire study by different experimenters are required. In view of the importance of replications to the conclusions we will draw about experimenters’ operating characteristics, and ultimately about nature, we will focus our attention further on the problem of replication.
18 Replications and Their Assessment
The crucial role of replication is well established in science generally. The undetected equipment failure, the rare and possibly random human errors of procedure, observation, recording, computation, or report are known well enough to make scientists wary of the unreplicated experiment. When we add to the possibility of the random ‘‘fluke,’’ common to all sciences, the fact of individual organismic differences and the possibility of systematic experimenter effects in at least the behavioral sciences, the importance of replication looms larger still to the behavioral scientist. What shall we mean by ‘‘replication’’? Clearly the same experiment can never be repeated by a different worker. Indeed, the same experiment can never be repeated by even the same experimenter (Brogden, 1951). At the very least, the subjects and the experimenter himself are different over a series of replications. The subjects are usually different individuals and the experimenter changes over time, if not necessarily dramatically. But to avoid the not very helpful conclusion that there can be no replication in the behavioral sciences, we can speak of relative replications. We can order experiments on how close they are to each other in terms of subjects, experimenters, tasks, and situations. We can usually agree that this experiment, more than that experiment, is like a given paradigm experiment. When we speak of replication (and, in a sense, this entire book is an argument that we do so) in this section, we refer to a relatively exact repetition of an experiment.
The Replication Shortage and Inferential Models In the real world we may count two sorts of replications—those carried out and those reported. The latter, unfortunately, are a special case of the former and certainly not a random subsample. The difference in number between replications carried out and those reported is some unknown dark figure—a figure that depends, to some extent at least, on our view of statistical inference. The ‘‘null-hypothesis decision procedure’’ (Rozeboom, 1960), advocated by many statisticians, tends to establish certain critical p values as the definitions of whether a difference has ‘‘truly’’ been obtained. Now this might be nothing more than a semantic convention if it were not for a tendency 552
Replications and Their Assessment
553
among authors and editors to prefer publication of results with an associated p value less than some critical point—usually .05 or .01.1 This tends to result in the publication of a biased sample of experiments (Bakan, 1965; McNemar, 1960; Smart, 1964; Sterling, 1959). It has usually been argued that published experiments are biased in the direction of Type I errors in that record is only made of the ‘‘.05 Hits’’ while the ‘‘.06–.99 Misses’’ are kept off the market. That may well be true. However, it can be argued that Type II errors may also be increased by the adoption of critical p values. Suppose that a series of experiments has been carried out, all making similar comparisons between an experimental and a control condition. None of the results obtained by the five experimenters were statistically ‘‘significant.’’ None are published, and the experimenters may not even be aware of the existence of four replications of their work. Table 18-1 gives the hypothetical results of the five studies. Although even the combined (say, by Fisher’s method) probabilities of the five studies may not reach some conventional level of significance, we note that in each study the experimental group performance exceeds the control group performance, and by a similar amount in each case. Considering these five differences, they are very unlikely to have occurred if the differences between the experimental and control conditions were, in fact, symmetrically distributed about zero (t ¼ 9.50, df ¼ 4, p < .001). There is a sense, then, in which Type II errors can be increased by our tendency to withhold publication of results not achieving a given level of significance (see also, Mosteller & Bush, 1954; Mosteller & Hammel, 1963). In order to benefit properly from replications actually carried out, it is essential that these be routinely published, even if only as brief notes with fuller reports available from the experimenter, from a university library, or from the American Documentation Institute.2 Without such availability our efforts to learn about behavioral phenomena in general—and more specifically to the point of this book, our efforts to assess the effects of the experimenter—will continue to be seriously hampered.
Table 18–1 Hypothetical Results of Five Experiments
Experiment
1
Experimental
Control
Difference
p
1 2 3 4 5
8.5 7.0 9.0 9.5 7.5
7.0 5.5 8.0 7.5 6.0
þ1.5 þ1.5 þ1.0 þ2.0 þ1.5
.30 .30 .30 .30 .30
Means
8.3
6.8
þ1.5
A number of other workers have questioned the utility of the accept-reject model of inference (Bakan, 1965; Conrad, 1946; Eysenck, 1960; Wolf, 1961). Evidence that there are, psychologically if not statistically, critical p values (or ‘‘inferential cliffs’’) among established investigators as well as the upcoming generation of graduate students has been presented recently (Rosenthal & Gaito, 1963; Beauchamp & May, 1964; Rosenthal & Gaito, 1964). 2 Similar pleas have been made by Wolf (1961), Wolins (1959), and Goldfried and Walters (1959). These last authors have proposed the publication of a special Journal of Negative Results patterned after the Psychological Abstracts.
554
Book Two – Experimenter Effects in Behavioral Research
It has often been lamented of late that too few investigators concern themselves with more or less precise replications (e.g., Lubin, 1957). As an enterprise, replication, it has been said, lacks status. Who, then, on any large scale will provide us with the necessary replications? McGuigan’s (1963) data and Woods’ (1961) suggest that there are now enough experiments carried out and reported by multiple authors for there to be no hardships in subdividing these studies into as many complete replicates as there are investigators. The total investment of time would not be increased, but the generality of the results would be. Although such replication within projects would help us assess experimenter effects to some extent, we may feel that such replication is not quite the same as a truly ‘‘independent’’ replication carried out by an experimenter in a different laboratory. The problem of the potentially dependent or correlated nature of replicators bears further comment.
Correlated Replicators To begin with, an investigator who has devoted his life to the study of vision, or of psychological factors in somatic disorders, is less likely to carry out a study of verbal conditioning than is the investigator whose interests have always been in the area of verbal learning or interpersonal influence processes. To the extent that experimenters with different research interests are different kinds of people—and if we have shown that different kinds of people, experimenters, are likely to obtain different data from their subjects—then we are forced to the conclusion that within any area of behavioral research the experimenters come precorrelated by virtue of their common interests and any associated characteristics. Immediately, then, there is a limit placed on the degree of independence we may expect from workers or replications in a common vineyard. But for different areas of research interest, the degree of correlation or of similarity among its workers may be quite different. Certainly we all know of workers in a common area who obtain data quite opposite from that obtained by colleagues. The actual degree of correlation, then, may not be very high. It may, in fact, even be negative, as with investigators holding an area of interest in common but holding opposite expectancies about the results of any given experiment. A common situation in which research is conducted nowadays is within the context of a team of researchers. Sometimes these teams consist entirely of colleagues; often they are composed of one or more faculty members and one or more students at various stages of progress toward a Ph.D. Experimenters within a single research group may reasonably be assumed to be even more highly intercorrelated than any group of workers in the same area of interest who are not within the same research group. And perhaps students in a research group are more likely than a faculty member in the research group to be more correlated with their major professor. There are two reasons for this likelihood. The first is a selection factor. Students may select to work in a given area with a given investigator because of their perceived and/or actual similarity of interest and associated characteristics. Colleagues are less likely to select a university, area of interest, and specific project because of a faculty member at that university. The second reason why a student may be more correlated with his professor than another professor might be is a training factor. A student may have had a large proportion of his research experience under
Replications and Their Assessment
555
the direction of a single professor. Another professor, though he collaborates with his colleagues, has most often been trained in research elsewhere by another person. Although there may be exceptions, even frequent ones, it seems reasonable, on the whole, to assume that student researchers are more correlated with their adviser than another adviser might be. The correlation of replicators that we have been discussing refers directly to a correlation of attributes and indirectly to a correlation of data these investigators will obtain from their subjects. The issue of correlated experimenters or observers is by no means a new one. Over 60 years ago Karl Pearson spoke of ‘‘the high correlation of judgments . . . [suggesting] an influence of the immediate atmosphere, which may work upon two observers for a time in the same manner’’ (1902, p. 261). Pearson believed the problem of correlated observers to be as critical for the physical sciences as for the behavioral sciences.
Replication Assessment What we have had to say about correlated replicators has implications for the assessment of replications. Such assessment may serve two goals: (1) to help us make a general statement of how well studied a given area of inquiry or a specific relationship might be, (2) to help us make a general statement of what the available evidence, taken as a whole, has to say about the nature of the relationship studied. Not only the worker who wants to summarize formally, as in a journal article (e.g., in the Psychological Bulletin), what is known of a given relationship but any investigator contemplating work in an area somewhat new to him might profit from some numerical system of replication assessment. Such a system is suggested here.3 The basic unit is the single experiment conducted by a single experimenter. Assuming a ‘‘perfectly’’ designed and executed study, we assign a value of 1.00. This would assume for a given research question, and standard sample size, N, that the appropriate (as defined by the consensus of colleagues) experimental treatment and control groups were employed, and that the data collector was effectively blind to the treatment group membership of each subject. Now this may seem like vague information with which to assign a numerical value to the soundness of an experiment, but the fact is that we are constantly making judgments of this sort anyway, and sometimes with even less information. There appears to be at least fair agreement on a ranking of the soundness of single studies in formal and informal seminars on research methodology. The really difficult step is the assignment of a numerical value. It should be noted that our interest at the moment is not in the assessment of the experimenter but of the experiment. Thus, we could find the experimental vs. control comparison in which we were interested regardless of whether the investigator was primarily interested in that particular comparison or not. Certain comparisons of great interest to a given worker are often buried as a few sentences in a report by an investigator who has only an incidental interest in that comparison. In other words, the intent of the investigator is irrelevant to our purposes. It is the validity of the comparison that concerns us. Similarly, we are not concerned with the conclusion a 3
I want to thank Fred Mosteller for his helpful discussion of this procedure.
556
Book Two – Experimenter Effects in Behavioral Research
given investigator draws from his comparison, for such conclusions vary greatly in the degree to which they derive directly from the data. If an investigator finds A > B, it is that inequality which concerns us, not his explanation of how it came about. That explanation may be important, but it is not relevant to the question of replication as we are discussing it. If we grant that some agreement can be reached on the assessment of the single experiment, we can state the general principle that a replication of that experiment which obtains similar results is maximally convincing if it is maximally separated from the first experiment along such dimensions as time, physical distance, personal attributes of the experimenters, experimenters’ expectancy, and experimenters’ degree of personal contact with each other. The number of dimensions (n) that may prove useful in the future is not known at present, but we can restate our principle in geometric terms. The value of replications with similar results is maximized when the distance between replicates in the n-dimensional space is maximized. The Replication Index Now for a concrete example of how we might score a set of replicates to determine how much we know about a given relationship. An investigator conducts a sound study with only some minor imperfections of design or procedure. The mean rating assigned by a seminar of competent methodologists is .80. In a few months he replicates the study and his new score of .80 is added to his old. Now we ‘‘know’’ 1.60’s worth. One of his students replicates, and though we have argued that students are likely to be correlated with their professors, the student is a different person. We multiply the student’s replication value of .80 by 2 to weight the fact of lessened correlation of replicates. The student’s points (1.60) are added to his professor’s (1.60) for a total of 3.20 points. Now, a colleague down the hall replicates the work, a friend, perhaps, who may still not be regarded as uncorrelated but who was trained by other people and who came to the same department for reasons other than working on this problem with this colleague. Doing the study in a very similar way the colleague earns a .80 for the study, but to credit his presumably lesser correlatedness we multiply that value by 3. He has taught us 2.40’s worth. We sum his points with those obtained until now and have 5.60. If the replication were carried out in a different laboratory by an investigator not known personally to the original worker or his correlated replicators we might want to assign an even higher weight, e.g., 5. Conducted by this stranger, a replication might give us 4.00 points to be added to the previously cumulated total of 5.60. So far, our hypothetical replicators have all found similar results, and all had no reason to expect otherwise. But now there is a researcher for whom the results, by now reported, make no sense whatever. His theoretical position would postulate just the opposite outcomes from those reported. Furthermore, he doesn’t know the original investigator personally, or any of the previous replicators, isn’t a thing like any of them, and to top it all off, his laboratory is halfway or more across the country. He replicates. His study’s basic .80 value gets us 8.00. The weighting of 10, which seems quite large, is due in no small measure to his expectancy, which is opposite to
Replications and Their Assessment
557
all the other replicators’. We now have a cumulated replication value of 17.60. If we wanted to, we could establish a scale of evaluation such that our score of 17.60 represents a fairly respectable level of replicatedness. We could, for example, call a total of less than 2.00 as hardly representing real replication at all, a total of 5.00 or more might be regarded as a good beginning and values over 10.00 as fairly respectable. The weighting system described and the particular weights arbitrarily employed in our examples are obviously intended only to be suggestive of the considerations relevant to a more precise system. We can sum up some of the major characteristics of the scoring system: 1. A very badly done experiment profits us little, upon even many replications. As the score per unweighted replicate approaches 0.00, no amount of replication can help us. 2. Replications by different investigators are worth more to us than replications by the same investigator. 3. The more different the replicators are from each other, the more value accrues to the total replicational effort.
The replication index yields a summary statement of how well studied a given problem is, regardless of whether replication results are consistent or inconsistent. However, the index also yields a summary statement of how confident we can be of the specific results obtained if the results are all in the same direction. In the not infrequent situation where some replication results are in opposite directions, we apply the scoring system separately to all those replications yielding results in one direction and then again to those replications yielding results in the opposite direction. The difference between the two scores obtained gives some indication of which result is better established. It is entirely possible that the scoring system suggested can help clarify a set of opposite results. Suppose that of ten experiments five have found A > B and five have found A < B. If one of these sets of five studies was carried out by a single investigator and one or two of his students, whereas the other set was carried out by less correlated experimenters, including some with opposite expectancies, there could be an overwhelming superiority in the points earned by the latter set of replications. This would be especially true if, in addition, there were some reason for assigning a lower score for the individual replicates in the set of studies conducted by the more correlated replicators. At least in some cases, then, it seems more valuable to compare contradictory sets of data on our replication index than to simply say there are five studies ‘‘pro’’ and five studies ‘‘con.’’ There may, of course, still be those puzzling situations where the pro studies and con studies each earn high and similar replication index scores. The Generality Index We have talked very much as though the replications discussed were virtually ‘‘exact.’’ The index of replication can also be applied, however, to only approximate replications. If we were interested in the effects of anxiety on intellectual performance, a more or less ‘‘exact’’ replication would require more or less identical procedures for arousing anxiety and measuring intellectual performance. We could as well apply our index to not-so-exact ‘‘replications’’ in which different arousal
558
Book Two – Experimenter Effects in Behavioral Research
procedures and different measures of intellectual functioning were employed. (The less exact the replications, the more the individual study’s score for ‘‘soundness’’ may vary.) A higher score on the replication index for a given research question implies greater generality for the results, assuming these results to be fairly consistent. Because of our special interest in the experimenter, we have dealt primarily with the problem of interexperimenter correlation in our discussion of the assessment of replication. If we were interested in the more general problem of generality, as we often are, we could readily extend our index to include other, nonexperimenter factors increasing the generality of our data. Thus, in the example given earlier of the effect of anxiety on intellectual performance we might give more points on a generality index for a ‘‘replication’’ that employed different methods for arousing anxiety and for measuring intellectual performance. If a sample of males were employed where females had been employed before, or grocery clerks where college students had been employed before, or animals where humans had been employed before, we would weight more heavily the contribution of the ‘‘replication’’ to the generality index. In effect, the generality index can differ from the replication index only to the extent that the replications are only approximately similar experiments.
Anecdotal Replication In order that we not be wasteful of information we must have a place in our replication index or generality index for information derived from sources other than formal experiments. For an appropriate example we may return to our hypothetical study of the effect of anxiety on intellectual performance. Suppose that the experimenter in his role as educator has observed many instances in which students’ anxiety has lowered their examination performance. Suppose further that our investigator has never observed an instance in which anxiety (of a given magnitude) led to improved examination performance. If other people had also made the same observation and also found no negative instances, we would have some additional evidence for the relationship between anxiety level and intellectual performance. Such evidence we usually regard as anecdotal, and that term often carries a negative connotation. On the other hand, however, we can argue that there is a continuity of more and less elegant circumstances of observation which ranges from the fairly crude anecdote to the more elegant anecdotes of the ethologist, the survey researcher, and finally the variable-manipulating experimenter employing control or comparison groups. We can argue further that the most elegant experiment differs from the cruder anecdote only as to the plausibility of the conclusions reached on its basis. Such plausibility, in the final analysis, is defined in psychological terms, such as the degree of belief or conviction it inspires in qualified workers in the area. The wellcontrolled experiment, then, may be seen as a more formal anecdote, more or less convincing, as with any anecdote, as a function of who ‘‘tells’’ it, how well and carefully it is told, and how relevant it is to the question under study. If we can assign ‘‘soundness’’ points to the experiment, and weight these points to establish a replication or generality index, we ought to be able to do the same thing for the cruder anecdote. The ‘‘soundness’’ points assigned would usually be some value lower than
Replications and Their Assessment
559
if it were a more systematic anecdote, as is the formal experiment. Arbitrarily, let us assign a score of .10 to any ‘‘well-told’’ anecdote for which no contrary anecdote can be found after honest efforts to find them.4 This search for negative instances is central and can be most usefully pursued by enlisting the reminiscences or observations of workers whose theoretical position would suggest contrary anecdotes. In practice, anecdotes on either side of a theoretical question are likely to cancel each other out. Where they do not, we have fairly powerful sources of additional evidence. The weighting of the soundness scores of anecdotes can be as was described for more formal experiments: more weight given to replicated anecdotes as a function of the noncorrelatedness of the raconteur. Such weights, then, might vary from ‘‘1’’ for a new consistent anecdote by the same teller to ‘‘10’’ for a consistent anecdote told by a very different observer whose theoretical orientation would suggest a contrary anecdote. In order to encourage more systematic observations and discourage an interpretation of these remarks as favorable to a swing to anecdotes as major or even exclusive sources of evidence, we can add the restriction that very informal anecdotes are not scored as greater than zero value in a replication or generality index unless the score on that index has already achieved a given level (e.g., a 2.00 score) on the basis of more formal research. There are situations in which anecdotes of greater or lesser elegance are actually more valuable than more formal experiments. Consider some research question that has been well replicated by different experimenters, such that a very respectable replication index score has been achieved. Assume further that the different experiments yield results quite consistent with one another (e.g., A > B). But now suppose that a fair number of less formal anecdotes, including very casual observation, experiments in nature, and field studies, are also quite consistent with each other but inconsistent with the results of the more formal laboratory experiments, such that A < B always. In such a case it may be that the formal experiments as a set are biased with respect to more ‘‘real-lifelike’’ situations. This sort of bias could occur even though the experimenters were completely unbiased in the sense in which we have used that term. It could well be that the very laboratory nature of the experimentersubject interaction systematically so changes the situation that the more usual extraexperimental response is quite reversed. This effect of the experimental situation on subjects’ responses has been frequently discussed and even labeled (e.g., experimental back-action or backlash effect). The demand characteristics of the experimental situation (Orne, 1962), although varying from experiment to experiment, may have, for a given type of study, such communality that the results of even an entire set of experiments may be quite biased. For this reason, and because of other special characteristics of the laboratory experimental situation (Riecken, 1954; Riecken, 1962), there may be occasions on which anecdotes, less formal than the experiment, may be more valuable than additional laboratory experiments. One view of the more informal source of evidence that emerges in part from what we have said is that there are phases in systematic inquiry in which more anecdotal evidence is more likely to have special relevance. Before a program of experiments is undertaken, informal evidence seems useful in guiding the direction of, or even in 4
For the situation where the anecdote is of the somewhat formal sort—anthropological field reports— Naroll (1962) has made an outstanding contribution through his development of the ‘‘observation quality index.’’ This is essentially a method for assessing the reliability of the raconteur.
560
Book Two – Experimenter Effects in Behavioral Research
justifying the very existence of, the experimental program. Then later, at the completion of the program, a systematic search for (preferably new) anecdotal evidence seems indicated to reassure us that the general findings of the more formal research program are consistent with more nearly everyday experience. Nothing in what we have said about the formal experimental situation should be so construed that the laboratory setting comes somehow to be regarded as ‘‘unreal.’’ Different it is, of course. But at the same time, it is as real a situation as any other, though perhaps less common than the word ‘‘everyday’’ implies (Mills, 1962). Whether we can reasonably generalize from the laboratory to ‘‘everyday’’ life, then, is an empirical question to be answered by observing both, rather than a philosophical question to be answered on any a priori grounds.5
5
The same reasoning can be applied to the often-asked question of whether we can reasonably generalize from studies of animal behavior to human behavior.
19 Experimenter Sampling
Much of this book has been devoted to showing that an experimenter’s expectancy may be an unintended determinant of the results of his research. This chapter and those to follow are addressed to the question of what can be done to control the effects of the experimenter’s expectancy. A number of strategies will be proposed. Some of these strategies will be recognized as direct attempts to minimize expectancy effects. Somewhat paradoxically, some of these strategies will be recognized as attempts to maximize these effects. In this chapter we shall discuss strategies that seek neither to minimize nor to maximize but rather to randomize and ‘‘calibrate’’ experimenter expectancies. In preceding chapters we have alluded to the advantages accruing from the employment of samples of experimenters rather than the more usual single data collector. In this chapter some of these advantages will be discussed in more detail. The employment of samples of data collectors is already a common practice in survey research (Hyman et al., 1954). In part this is due to the logistic problem of trying to obtain responses from perhaps thousands, or even millions, of respondents. In part too, however, the practice of sampling data collectors is part of a self-conscious strategic attempt to assess the influence of the data collector on the results of the survey (e.g., Mahalanobis, 1946). In other kinds of psychological research (e.g., laboratory experiments), the number of subjects contacted is low enough for a single experimenter to collect all the data easily. The necessity for employing samples of experimenters in these cases is not logistic but strategic.1 It was stated earlier that in principle we cannot assess the experimenter’s accuracy at all without having at least one replication to serve as the reference point for the definition of accuracy. And as our sample of experimenters increases in size beyond two, we are in an increasingly good position to assess not only the experimenter’s accuracy but his bias and consistency as well. 1
The practical problem of obtaining samples of data collectors for laboratory research was discussed in the last chapter and has been found not at all insurmountable.
561
562
Book Two – Experimenter Effects in Behavioral Research
Subdividing Experiments With the sample size of subjects fixed, the larger the sample of experimenters, the smaller the subsample of subjects each data collector must contact. Subdivision of the experiment among several experimenters may in itself serve to reduce the potential biasing effects of the experimenter. Learning to bias. We have suggested that experimenter bias may be a learned phenomenon, and that within a given experiment the experimenter may learn from the subjects’ response how to influence subjects unintentionally. This learning process takes time, and with fewer subjects from whom to learn the unintentional communication system there may be less learning to bias. Even if the interpretation of bias as a learned phenomenon were in error, the basic evidence that bias increases as a function of the number of subjects contacted by each experimenter should encourage the use of more experimenters and fewer subjects per experimenter. Maintaining blindness. A second advantage gained when each experimenter contacts fewer subjects is related particularly to the method of blind contact with subjects. In discussing that method in a subsequent chapter it will be suggested that if enough subjects were contacted, the experimenter might unintentionally ‘‘crack the code’’ and learn which subjects are members of which experimental group and/or the nature of the experimental treatment subjects had received. The fewer subjects each experimenter contacts, the less chance of an unwitting breakdown of the blind procedure. Early returns. A third advantage of having fewer subjects contacted by each experimenter, a ‘‘psychological’’ advantage, derives from the finding that early data returns may have a biasing effect upon subsequent data. With more experimenters the entire experiment can be completed more quickly if facilities are available for the simultaneous collection of data by different experimenters. With all the results of a study ‘‘nearly in’’ there is less need for the principal investigator to get a glimpse of the early returns and hence less chance for the operation of the biasing effects of these returns. A limiting case of contacting fewer subjects would, in fact, eliminate entirely the effect of early data returns on the biasing behavior of the data collector. If each experimenter contacted only a single subject in each treatment condition, his obtained data could not, of course, influence the data of any other subject in the same condition. Where there are no later data, there can be no effect of early data returns. Although there may be some merit to the procedure of allowing each experimenter only a single subject per experimental condition there are two drawbacks. One of these is logistic. It would not be very efficient to train a data collector for a given experiment and have him contact only a single subject per condition. On the other hand, there may be situations in which the utility of the procedure outweighs the increased cost. The other drawback to this procedure is that it provides us with no estimate of individual differences among subjects. The variation among subjects within conditions is confounded with the variation among experimenters. This may not be too serious, however. If we can be satisfied with an estimate of the effect due to the treatment condition and that due to the differences among experimenters, we may
Experimenter Sampling
563
be willing to forego the within cells mean square. Even if each experimenter contacts only a single subject in a single condition we could still evaluate the effects of the treatment, although we could get no estimate of the variation among either subjects or experimenters.
Increasing Generalizability If there were no effect of earlier upon later obtained data, nor indeed any form of experimenter expectancy effect, we would still benefit greatly from the employment of samples of experimenters. As Brunswik (1956) and Hammond (1954) have pointed out, this would greatly increase the generality of our research results. Because of differences in appearance and behavior, different experimenters serve as different stimuli to their subjects, thereby changing to a greater or lesser degree the experimental situation as the subject confronts it. When only a single experimenter has been employed, we have no way of knowing how much difference it would have made if a different experimenter had been employed. The results of the research are then confounded with the stimulus value of the particular experimenter. We would have little confidence in a prediction of the results of a subsequent experiment employing a different experimenter except the prediction that the result would probably be different. The more experimenters we employ the better, but even the modest addition of a single experimenter helps a great deal. We not only would be able to predict that the result of a subsequent experiment would fall somewhere near the mean of our two experimenters’ results but would be able to say something of how much deviation from this value is likely. In other words, with as few as two experimenters we can make a statement of experimenter variance. In principle, of course, this line of reasoning holds only when experimenters are sampled randomly. A little later, we shall speak of automated data-collection systems (ADCS) and shall stress their value as a means of avoiding differential treatment of subjects. Here it must be added that any ADCS has its own special stimulus value (McGuigan, 1963). We can then regard any given ADCS with its particular stimulus settings as just another experimenter, although a very ‘‘standardized’’ one. To increase the generality of the obtained results, therefore, we must sample a variety of ADCS’s or at least a variety of settings of a single ADCS. The employment of samples of data collectors, necessitated by their individual differences, may be viewed as a boon to, rather than the price of, behavioral research. Built-in replications, although they bring with them the data collector as a source of variance (which can be measured and handled statistically), also bring a greater robustness to our research findings.2 From the point of view now, not so much of generality but of the control of experimenter expectancy effects, there are three conditions involving experimenter sampling which will be discussed in turn. In the first of these conditions, the sampled 2
It is the name of Brunswik that rightly comes to mind when we speak of the increased generality deriving from the sampling of experimenters and their associated procedural variations. But it would be a mistake to assume that ‘‘more classic’’ or ‘‘traditional’’ workers in the field of experimental design would disagree with Brunswik. R. A. Fisher (1947), for example, though speaking of procedural variation not explicitly associated with different data collectors, makes the same point.
564
Book Two – Experimenter Effects in Behavioral Research
experimenters’ expectancies are unknown and indeterminable. In the second of these conditions, experimenters’ expectancies are known before the sampling. In the third condition, experimenters’ expectancies are known only after the sampling has occurred.
Expectancies Unknown Population Characteristics There may be experiments in which we decide to employ a sample of experimenters but in which we have no way of assessing the experimenters’ expectancies. We may draw such a sample from a variety of populations differing in the number of restrictions imposed. Perhaps the least restrictive population of potential experimenters would be all those who are physically and intellectually capable of serving as experimenters. If we choose such a population we earn perhaps the greatest degree of generalizability of our data, but at the cost of representativeness of the real world or ecological validity. Ecological validity is sacrificed, however, only in the sense that most experimenters who have in the past collected data have been drawn from less broadly defined populations. Most experimenters in a given experiment are not simply organismically capable of collecting the data. They are further selected on the basis of an interest in research generally and an interest in the particular research question they are trying to answer. They may be further selected on the basis of the expectancy they hold about the outcome. They may, as a corollary, be selected for personality characteristics associated with people doing research in a given area of behavioral science and having certain outcome orientations. Because real experimenters are so highly selected—i.e., drawn from such a relatively restricted population of capable data collectors—it might be very difficult to draw a large sample of such experimenters for our purposes. There is, however, a trend for less highly selected experimenters to collect data for serious scientific purposes. Not only more and more graduate students are collecting behavioral data but undergraduates as well. As this trend continues and accelerates, our employment of less fully professional experimenters will become more and more representative of the ‘‘real world’’ of data collection. At least it seems not at all farfetched to draw samples of advanced undergraduate students in the behavioral sciences and generalize from their results to what we might expect from advanced undergraduate research assistants. It seems, then, that we may not be sacrificing too much ecological validity, after all, by employing samples of less than fully professional data collectors. The random assignment of experimenters to experiments gets around the potential problems of self-selection on the basis of hypotheses. Experimenters, naturally enough, spend their time collecting data relevant to a question to which they are likely to expect a given answer. If the investigator, though he may have an expectancy about the outcome, employs a random sample of data collectors, he may protect the data from the effects of his own expectancy. This would be especially true if the sampling of experimenters were combined with some of the control strategies described in subsequent chapters. If indeed there are personality
Experimenter Sampling
565
characteristics or other attributes associated with an experimenter’s choice of research question, the data collected by that experimenter are likely to show a certain amount of error, though not necessarily bias. The random assignment of experimenters also gets us around this potential problem of self-selection for correlated attributes. Cancellation of Biases Simply selecting our experimenters at random does not imply that they will have no expectancies. The expectancies they do have, however, are more likely to be heterogeneous, and the more so as we have not tried to select experimenters very much like the experimenters who have in the past collected data within a given area of research. The more heterogeneous the expectancies, the greater the chance that the effects of expectancies will, at least partially, cancel each other out. The classic discussion of the canceling of biases is that by Mosteller (1944) for the situation of the survey research interviewer, a situation that in principle does not differ from that of the laboratory experimenter. If we can hope for a canceling of expectancy bias, we can also hope for a canceling of modeling biases. But where the experimenters’ expectancies and their own task performances are unknown, we can only hope for such a cancellation. And even if this information were available we could not count on a cancellation. The various expectancies represented in our sample may be held with different intensities, resulting in different magnitudes of expectancy effect. Or particular expectancies may be correlated with personality characteristics or other attributes that are themselves associated with the degree of unintended influence exerted by the experimenter. An example may be helpful. Suppose we want only to standardize a set of photographs such as those we have often used as to the degree of success or failure reflected by the persons pictured. We select at random 20 experimenters, all enrolled in a course in experimental psychology. For the sake of simplicity let there be only two expectancies among experimenters: (1) that the photos will be rated as successful and (2) that they will be rated as unsuccessful. Let us suppose further that the ‘‘true’’ mean value of the photos is at the exact point of indifference. If ten of our experimenters expect success ratings and ten expect failure ratings, and the magnitudes of their expectancy effects are equal, we obtain a grand mean rating that is quite unbiased. That situation is the one we hope for. But now suppose that the ten experimenters who expect to obtain success ratings differ from the experimenters expecting failure ratings in being more self-confident, more professional in manner, more businesslike, and more expressive-voiced. These are the experimenters, we have already seen, who are more likely to influence their subjects in the expected direction. The ten experimenters expecting failure ratings do not equitably influence their subjects in the opposite direction. Their mean obtained rating is, therefore, at the point of indifference, and they cannot serve to cancel the biasing effects of our more influential success-expecters. The grand mean rating obtained will be biased in the ‘‘success’’ direction. Troublesome as this situation may be, we should note that it is still better than having employed only a few self-selected, success-expecting experimenters. In this particular example we would have been best served by selecting only those experimenters who could not implement their
566
Book Two – Experimenter Effects in Behavioral Research
expectancy. But, of course, in our example we have given ourselves information not ordinarily so readily available. The hoped-for cancellation of bias may also fail for reasons residing in the experimental task. A good example might involve a ‘‘ceiling effect.’’ Suppose a large number of children have been tested on a group administered form of a new perceptual-motor task. The testing was done under those conditions of administration maximizing their performance as the originator of the task intended. Now suppose that to establish the reliability of the task performance all the children are retested, this time with an individually administered alternate form of the task. Again, we employ twenty data collectors, and again they have one of two possible expectancies about the children they will test: (1) that they are very well-coordinated and (2) that they are very poorly coordinated. By their manner during the interaction with the children, those experimenters expecting poor performance obtain poorer performance. On the average the children’s performance on this alternate form retest is lower by some amount than it was on the originally administered test. We can see that this bias cannot be canceled. Controlling for scoring errors, the youngsters tested by experimenters expecting good performance cannot perform any better than they did on the pretest. Regardless of any experimenter characteristics facilitating unintentional influence, organismic limits permit no biasing in the direction of better performance. The grand mean of our obtained retest data has been biased in the low direction by the inability of half the experimenters to exert equivalent and opposite bias. In this case, the retest reliability of the task has also been biased in the low direction. Interestingly enough, if the experimenters expecting very good performance had been able to bias their subjects’ performance equivalently there would have been no bias in the grand mean performance obtained, but the correlation between the pretest and posttest would have been even further lowered. If all experimenters showed the same expectancy effect, the grand mean performance would have been maximally biased, but the retest reliability would not, of course, have been affected at all. This assumes, as we have here, that any experimenter of one expectancy exercises the same magnitude of effect as any other experimenter in that same expectancy condition. An interesting example of asymmetrical effects of bias has been reported by Stember and Hyman (1949). In their analysis of an opinion survey they found that interviewers holding the more common opinion tended to report data that inflated the number of respondents to be found with that same opinion. Interviewers holding the less common opinion, however, inflated the ‘‘don’t know’’ category rather than the category of their own opinion. In this case, which we can regard as modeling bias, we again see a failure of the cancellation of bias. The grand mean response was inflated in the more commonly held opinion category. One interpretation of this unexpected finding proposes that an expectancy bias may have been operating simultaneously. Thus, if it is generally known what the majority opinion is, and if it is known also that there is a heavy majority, then all interviewers may have the expectancy that they will obtain majority opinions at least most of the time. This expectancy by itself may inflate the expected majority opinion category. For interviewers whose own opinion is the majority opinion, their modeling bias may act in conjunction with their expectancy bias to inflate the majority opinion category even more. However, the minority opinion-holding interviewers have a modeling bias which runs counter to their expectancy and serves in fact to cancel it.
Experimenter Sampling
567
Left with neither an unopposed modeling bias nor unopposed expectancy, the neutral ‘‘don’t know’’ category is inflated. Whether this is what happened in the Stember and Hyman study or not cannot be easily determined. But this analysis does illustrate the possibility that opposing biases within the same experimenter may cancel each other and that consonant biases may reinforce each other. It should be mentioned that the bias in this study could have been one of interpretation or coding rather than a bias affecting the subjects’ response, but this does not alter the relevance of the illustration. We have already suggested that the experimenter’s attitude toward the results of his research may affect his observation, recording, computation, and interpretation as well as his subjects’ responses. From all that has been said it seems clear that we cannot depend on the complete cancellation of biases in a sample of experimenters. But the argument for sampling experimenters is still strong. At least by sampling experimenters we have the possibility of cancellation of biases, whereas if we use only a single experimenter we can be absolutely certain that no cancellation of bias is possible. Homogeneity of Results Employing samples of experimenters will often provide us with considerable reassurance. If all of a sample of experimenters obtain similar data we will not err very often if we assume that no bias has occurred and that, in fact, no effects whatever associated with the experimenter have occurred. On these occasions we have good reason for arguing that only one experimenter would have been required. But, obviously, there is no way of knowing this heartening fact without having first employed experimenter sampling. The homogeneity of obtained results should not be so reassuring to us if the sampling of experimenters has been very restrictive. If our sample included only data collectors holding one expectancy regarding the data to be obtained, as might occur if we selected only experimenters who had selected a given hypothesis for investigation, our results would be homogeneous still, but biased too. The homogeneity of obtained results is convincing in direct proportion to the heterogeneity of the experimenters’ expectancies and other experimenter attributes. In this section we have discussed the advantages of sampling experimenters even though their expectancies are unknown and indeterminable. Under these circumstances we benefit greatly in terms of the increased generalizability of our data, but our controls for expectancy effects are at best haphazard. No correction formulas can be written to control statistically the effects of experimenter expectancies. To write such corrections we must know what experimenters’ expectancies (and related sources of error) are like. In some cases these expectancies may be well known before sampling, and in other cases, although not known before sampling, they can be assessed after sampling. We will discuss next the situation in which experimenter expectancies are generally known before the sampling occurs.
Expectancies Known Before Sampling In all the sciences there are investigators whose theories and hypotheses are so well known or so easily inferred that there can be wide agreement on the nature of their
568
Book Two – Experimenter Effects in Behavioral Research
expectancy for the results of their research. Academic scientist A designs and conducts an experiment in order to demonstrate that his expectancy is warranted. Academic scientist B may design and conduct an experiment in order to demonstrate his expectancy that scientist A’s expectancy is unwarranted. This, of course, is scientific controversy at its best—taken into the laboratory for test. If they are in the behavioral sciences, our two scientists may design and conduct quite different experiments to arrive at their conclusions. Each may obtain the expected results whether or not any unintended biasing effect occurred, and feel his own position to be strengthened. So long as they conducted different experiments, we can have nothing to say about the occurrence of expectancy effects. Sooner or later members of one camp are likely to attempt a more or less exact replication of the other camp’s experiment. If they obtain data in agreement with the original data, we are somewhat reassured that the role of expectancy effects in either study was minimal. But if they obtain contradictory data, can we attribute the difference to expectancy effects? Probably not, because geographic and temporal factors, subject population, and experimenter attributes all covaried with the possible expectancy effect. Collaborative Disagreement For the resolution of theoretical and empirical issues important enough to engage the interest of two or more competent and disagreeing scientists, it seems worthwhile to coordinate their efforts more efficiently. At the design stage the opponents might profitably collaborate in the production of a research plan which by agreement would provide a resolution of the difference of opinion. At the stage of data collection, too, the opponents may collaborate either in person or by means of assistants provided by both scientists. Conducted at the same place, using a common pool of subjects and the same procedures, the two (or more) replicates should provide similar results. If they do not, we may attribute the difference either to the effects of the differing expectancies or to experimenter variables correlated with the differing expectancies. Such collaboration of disagreeing scientists has taken place. Disproportionately often this ‘‘committee approach’’ to the resolution of scientific controversy has been applied to controversies involving either ‘‘borderline areas’’ of science or areas having major economic or social implications. Such an approach has been suggested for the investigation of parapsychological phenomena, for alleged cancer cures, and for study of the effects of smoking on the likelihood of developing cancer. In such cases even the scientific layman can readily infer the scientific ‘‘antagonist’s’’ expectancies or at least some potential sources of such expectancies. When the press described the distinguished panel of scientific ‘‘judges’’ preparing the United States Public Health Service report on the effects of smoking, it carefully noted for each member whether he was or was not himself a smoker. Laymen (and some scientists) were forced to reject the hypothesis that the committee’s evaluation might have been biased by their expectancies or preferences by noting that its conclusions were uncorrelated with their own smoking habits. On the other hand, when the press reported the dissenting view of scientists employed by the tobacco industry, the report was clearly if implicitly written in the tone of ‘‘Well, what else would you expect?’’ It is, of course, not necessarily true that the expectancy of an
Experimenter Sampling
569
industry-employed scientist is due to economic factors. The expectancy may have preceded the employment and indeed may have been a factor in the particular employment sought. But the source of the expectancy may be more relevant to a consideration of ethical rather than scientific questions. The origin of an expectancy may be quite irrelevant to the degree of its effect upon data obtained or upon interpretations of data. It seems very reasonable that the ‘‘committee approach’’ to scientific investigation has been applied to areas of great interest to the general public. But the more technical, less generally appealing issues to which most scientists direct their attention deserve equal effort to minimize sources of error. One special problem may arise when established scientists collaborate with a sincere wish to eliminate the effects of their expectancy. In their contacts or their surrogates’ contacts with subjects they may bend over backward to avoid biasing the results of their experiment. This ‘‘bending over backward,’’ an effect described in an earlier chapter, may lead each investigator to obtain data biased in the direction of his opponents’ hypothesis. For this reason, and for even greater control of expectancy effects, the sampling of experimenters with known expectancies is best combined with the control techniques to be described in subsequent chapters.3 Another difficulty of the ‘‘committee method’’ of controlling for expectancy effects is that it is likely to involve only a small though well-known sample of experimenters. With a smaller sample of data collectors it becomes more difficult to assess the sources of variation if there is disagreement among the data collected by different investigators.
Expectancies Determined After Sampling Established investigators involved in a visible scientific difference of opinion are in sufficiently short supply for us to have to turn elsewhere for larger samples of data collectors. If less visible experimenters are to be employed we are unlikely to know their expectancies regarding the outcome of their research. But their expectancies can be determined after they are selected and before they collect any data.4 Not only their expectancies but their own task performance may be determined so that modeling effects may also be assessed and controlled. In addition, other experimenter attributes known to affect, or suspected of affecting, subjects’ responses may be determined and then controlled. The experimenter’s own performance and many 3
The analysis of the data collected by disagreeing, collaborating investigators can proceed, in the simple case, in the same way as in any ‘‘expectancy-controlled’’ experiment. In the terminology introduced in the last chapter of this section, one experimenter would be contacting subjects in the A and D cells while his collaborating opponent would be contacting subjects in the B and C cells. The interpretation of various possible outcomes of this 2 2 design would also proceed as described in the chapter dealing with expectancy control groups. The basic logic of employing oppositely expecting experimenters is, of course, the same as that underlying the use of expectancy control groups. The main difference is that in the former case we find the experimenter’s expectancy whereas in the latter case we induce it. 4 If the determination of expectancy or of some other experimenter variable were made after the experimenters’ data collection we could not properly regard it as an independent variable. The data collection itself might have influenced the experimenter’s expectancy and some (e.g., anxiety) but not all (e.g., birth order) other experimenter attributes. Such contamination renders measures of experimenter variables useless as a means of controlling for their effect by statistical techniques.
570
Book Two – Experimenter Effects in Behavioral Research
other attributes are easy to determine. As part of the training procedure experimenters may be asked to serve as subjects. They learn the procedure they will have to follow while at the same time giving us a measure of their own task performance. Other experimenter attributes, if relevant, can be determined by direct observation (e.g., sex), from public records (e.g., age), by direct questioning (e.g., religion), or by means of standardized tests (e.g., intelligence). Some of these same methods may be used in the determination of experimenters’ expectancies, which we now discuss in more detail. Determination of Expectancies Inexperienced experimenters. If we are going to employ a fairly large sample of experimenters it is less likely that we can obtain very highly experienced data collectors. More inexperienced experimenters such as advanced undergraduates may have no particular expectancy about the result of a given experiment. They may not know enough about the area to have developed an expectancy. For these experimenters we can describe the experiment in detail and ask them to make a ‘‘guess’’ about how subjects will respond. This guess then may serve as the expectancy statement. The form of the guess may vary from an open-ended verbal or written statement, through a ranking of alternatives to an absolute rating of the several alternatives. If there are several possible expectancies or several degrees of one expectancy, we may want to assign to the open-ended statement some numerical value. This can be accomplished by having judges rank or rate these statements on the direction and magnitude of the implied expectancy. Ranking or absolute rating of alternatives by experimenters will give us numerical values of the expectancy in a more direct way and may, therefore, be preferred. Experienced experimenters. If our sample of experimenters is composed of more sophisticated data collectors there is a greater likelihood that expectancies are better developed. We may still use the methods sketched out for inexperienced experimenters, but we have other alternatives. We may, for example, ask the experimenters’ colleagues to rate their expectancies based on their knowledge of the experimenters’ theoretical orientations. Or, we can make such judgments ourselves based on reading the reports published by our experimenters or even perhaps their term papers. The reliability of these judgments made of an experimenter’s expectancy by his colleagues or from his written documents must, of course, be checked. With sophisticated experimenters who are already familiar with the research literature, we can ask them to write out or tell us ‘‘what previous research has shown’’ should be the outcome of the experiment and ‘‘how well the research was done.’’ This ‘‘state of the art’’ paper or monologue can be quantitatively judged for the expectancy it seems to imply. Correcting for Expectancy Effect In some cases we will find expectancies distributed only dichotomously; either a result is expected or it is not. At other times we will have an ordering of expectancies in terms of either ranks or absolute values. In any of these cases we can correlate the results obtained by the experimenters with their expectancies. If the correlation is
Experimenter Sampling
571
both trivial in magnitude and insignificant statistically, we can feel reassured that expectancy effects were probably not operating. If the correlation, however, is either large numerically or significant statistically, we conclude that expectancy effects did occur. These can then be ‘‘corrected’’ by such statistical methods as partial correlation or analysis of covariance. These same corrections can be applied if significant and/or large correlations are obtained between the results of the experiment and experimenters’ own task performance or other attributes.
20 Experimenter Behavior
In the last chapter some techniques for the control of experimenter expectancy effects were suggested which depended on the determination of experimenter behavior, expectancy, and other attributes before the data collection process began. In this chapter we shall consider some related controls which, however, depend on the determination of experimenter behavior, expectancy, and other attributes during and after the data collection process.
Observation of Experimenter Behavior The Public Nature of Science The public nature of the scientific process is one of its defining characteristics. All we do as scientists is determined in part by our intent that others be able to do it too. All we learn as scientists is intended to be learned as well by any other scientist with appropriate background and interests. Our research reports reflect this intent. We try to make public the reasoning that led to our research, how we conducted the research, what the results were, and how we interpreted these results. We expect scientists to differ in the reasoning that leads to an experiment and in the interpretation of the data. It is because of the public nature of the reasoning and interpretation of scientists that any one may disagree with the reasoning and interpretations of any other. Not so, however, for the data per se. We must simply accept them as given. With an absolutely complete description of the circumstances of their collection, this would create no problem. But particularly in the behavioral sciences, such a complete description is impossible. We cannot even give a description of some of the most relevant variables affecting our results, because we don’t know what they are. In psychological experiments we could give more detailed descriptions than we now do of such variables as temperature, pressure, illumination, the nature and arrangement of the furniture, the physical and personal characteristics of the data collector, and, perhaps most important, his behavior. We do, of course, describe the experimenter’s programmed behavior, but there are literally thousands of experimenter behaviors that are not described. Generally it is 572
Experimenter Behavior
573
not even known that these behaviors have occurred. We have no good vocabulary for describing them if we knew of them. Yet, for all this, these unprogrammed experimenter behaviors do affect the results of the experiment, as has been shown in earlier chapters. These behaviors constitute perhaps the least public stage of the scientific enterprise. They are less serious for occurring ‘‘behind closed doors,’’ as Beck (1957) has reminded us, than for having been insufficiently studied and ruled out as sources of unintended variance. We stand to learn a great deal from making the data collection process more public because it will allow us to define the conditions of the experiment more precisely. We must give other scientists the opportunity of seeing what the experiment was so that, just as in the case of our calculations, our reasoning, and our interpretation, they will be at liberty to disagree with us. They should be free to decide whether the experimental manipulation, which we claim, was or was not successfully implemented. They should be free to decide whether our programmed behavior actually ran according to the program. Of course, if we knew what relevant variables we were not now reporting, we could simply add that information to our research reports. Since we do not, we must open the data collection process to a wide angle look so that we can learn what variables must in the future also be reported. Observation Methods A variety of methods are available for the observation of experimenters’ behavior during the data collection process. These methods include the use of various kinds of human and mechanical observers. Each method and each combination of methods has its own special advantages and disadvantages, which each investigator must weigh in deciding on which method to employ. Subjects as observers. Earlier in this book we have seen how the subjects themselves may be employed as observers. Immediately after the experiment is over for the subject he may be asked to describe his experimenter’s behavior. We have most often employed a series of rating scales to help the subject with his description—but open-ended questions, adjective checklists, Q sorts, and other techniques could be employed as well. More qualitative, less constrained descriptions have the advantage that they may suggest additional categories of experimenter behavior which may prove to be related to unintended sources of variance in the results of the research. For some purposes of control, such qualitative descriptions may have to be quantified, and although it is generally possible to do so, it is not necessarily a convenient or easy matter. Numbered rating scales have the advantage of being easy to work with but presuppose that we have some prior information, or at least guesses, about the relevant categories. Perhaps the most useful method of making observations is to combine the quantitative (e.g., rating scale) and qualitative (e.g., open-ended question). At the very least the qualitative observations can serve as the basis for subsequent, more formal categories. One question that arises is whether the subject should be told before he contacts the experimenter that he will be asked to describe the experimenter’s behavior. If the subject knows he is to describe the experimenter, he may make more careful observations. However, he may also be distracted from the experimental task and
574
Book Two – Experimenter Effects in Behavioral Research
therefore perform as a rather atypical subject. In addition, his having been asked by the principal investigator to carefully observe the experimenter may significantly alter the nature of the subject’s relationship to his experimenter. The subject may feel himself to be in a kind of collusion with the principal investigator and not at all subordinate in status to the experimenter. His increase in status relative to the status of the experimenter may make him less susceptible to the unintentional influence of the experimenter. The gain of more careful, sensitized observation of experimenter behavior accruing from the subject’s set to observe may be offset by the loss of ecological validity arising from the subject’s altered concentration on his task and his altered relative status. An empirical evaluation of the gains and losses may be obtained if half the subjects of the experiment are told beforehand to observe the experimenter carefully, and half the subjects are told nothing about their subsequent task of describing the experimenter. The two groups of subjects can then be compared both on their description of the experimenter and on the performance of their experimental task. The employment of subjects as observers is clearly a case of participant observership, and this is its greatest strength and greatest weakness. Being very much within the experimental situation gives the best opportunity to note what transpired through a variety of communication channels. At the same time the subject as participant is busy with his own task performance and perhaps too deeply involved in the interaction to be ‘‘objective.’’ Alternative methods of observation of experimenter behavior, therefore, become important. Expert observers. Anthropologists, clinical psychologists, psychiatrists, all make their living in part by the careful observation of behavior. These and other experts in observation may be employed to observe the experimenter’s behavior. The methods of making the observations may be as described in the case of subjects as observers. These expert observers may (1) sit in on the experiment; (2) observe through a one-way window; or if not superficially too dissimilar from the subjects of the experiment, (3) serve as ‘‘subjects’’ themselves. In this last case they might, unlike real subjects, be able to retain their ‘‘objectivity’’ in interaction with the experimenter because of the nature of their training, while retaining the greatest possible access to the modalities in which the experimenter can be said to ‘‘behave’’ (e.g., visual, auditory, olfactory, tactual). Obviously, if the expert sits in on the experiment, the experimenter knows he is being observed. This may alter his behavior in the experiment so that we can no longer learn what his ‘‘natural’’ behavior would be like. Observation of experimenter behavior without the experimenter’s knowledge may then be necessary either by covert observation or by the expert’s serving as subject. This, of course, raises the question of the propriety of deception for scientific purposes, a question discussed more fully in the chapter describing expectancy control groups. If it were established by careful research that experimenters behave no differently when they believe themselves to be observed, we could in good scientific conscience eliminate the method of covert observation, a method no one really likes to use anyway. Representative observers. Different observers may see the same behavior in different ways. Since we are primarily interested in the effects of experimenters’ behavior on their subjects’ performances, we could argue that the
Experimenter Behavior
575
observers of the experimenters should be like the subjects of the experimenters. Observers can be drawn from the same population from which subjects are drawn in the hope that they will be responsive to the same aspects of experimenter behavior to which their peer group members, the subjects, were responsive. These subject-representative observers could be asked to function in much the same way as the expert observers. They may miss some behavioral subtleties an ‘‘expert’’ might observe, but they may also attend to the more relevant aspects of the experimenter’s behavior. Other populations of observers that might profitably be employed include colleagues of the experimenter, the principal investigator and his colleagues, randomly selected groups, or specialized groups that might be particularly sensitive to certain aspects of experimenter behavior. Thus, actors, speech teachers, singers; dancers, physical education teachers; photographers and caricaturists may be especially sensitive to verbal, motor, and postural behavior, respectively. Mechanical ‘‘observers.’’ During any given period of the experimental interaction, the experimenter’s behavior occurs just once. If any behavior goes unobserved by a human observer it is lost and not recoverable. Fortunately, there are mechanical systems of permanently recording the experimenter’s behavior. These mechanical systems, including sound tape recordings, silent film, sound film, and television tapes, are becoming increasingly available, technically more effective, and economically more feasible. These recording systems differ from each other in the completeness of the recording of behavior, in the speed with which the records are available for use, and in their cost. Tape recording is the most practical system of permanent recording. The machines are readily available, inexpensive, and easy to use. However, they record only that behavior which can be heard, not that which can be seen. Silent film, sound film, and video-tape do record the behavior that can be seen and, in their less elaborate forms, can be surprisingly inexpensive. Silent 8 mm film can be used with a tape recorder to provide a convenient record of behavior which can be both seen and heard. Sound films, while more expensive and less easily available, provide a still better (more synchronized) record. Developments in photographic technology make it no longer necessary to have studio conditions before good films can be obtained. This seems quite important, since the bright lights and seating arrangements formerly required might significantly affect the experimenter’s behavior. Whether experimenters should be informed that their behavior is being recorded is both a scientific and ethical question, and some of these issues will be discussed later. It goes without saying, of course, that any records of experimenter behavior obtained with or without his knowledge must be treated with utmost confidentiality. An analogy to a clinically privileged communication is appropriate. None of the systems eliminates the need for the human observers, but they do allow for the more leisurely observation of behavior. The types of judgments made by the observers of the permanently recorded behavior of the experimenter may be the same as those made by a direct observer of the experimental interaction. In addition, however, some more mechanical modes of categorizing are available (e.g., see Mahl & Schulze, 1964). Observation of a permanent record (on tape or film) of behavior is more ‘‘forgiving’’ than the direct observation of the behavior as it occurs originally. Behavior
576
Book Two – Experimenter Effects in Behavioral Research
missed on first observation can be observed on second, third, or fourth observation. Larger groups of observers can simultaneously make their judgments, a logistic advantage that becomes increasingly more important as the judgments of behavior become more difficult or unreliable. For any observations of experimenter behavior, the reliability among observers must be calculated. In general, more molecular observations (e.g., the experimenter is or is not smiling) will be found to be more reliable than more molar observations (e.g., the experimenter is or is not friendly). However, that does not imply that variations in more molecular experimenter behavior will prove to be better predictors of unintended variation in the data obtained from subjects. On the contrary, experience with both the more molar and the more molecular observations suggests that the former may serve the more useful predictive function. This may be due to the fact that such global judgments as ‘‘friendly,’’ although not made as reliably, carry more social meaning than the more molecular observations of glancing or smiling. There may be just too many ways to smile and too many ways to glance, each with a different social meaning. Perhaps by indexing our more molecular observations in future studies—i.e., a friendly glance, a condescending smile—we can increase both the reliability and the predictive value of observations of experimenter behavior. Before leaving the topic of mechanical recording of experimenters’ behavior, mention should be made of the potential value of a very special kind of observer of these records. Milton Rosenberg has suggested in a personal communication (1965) that the experimenter whose behavior was filmed might find it especially instructive to study his own behavior as an experimenter. He might be able to raise questions or hypotheses missed by other, less personally involved observers. There is also the possibility, however, that the experimenter himself would have more to learn than to teach of his own behavior and of its effects on others. The experience of listening to one’s own psychotherapy behavior or supervising students’ psychotherapy training by means of tape recordings suggests that often the therapist, and perhaps the experimenter too, ‘‘is the last to know.’’ Reduction of Bias by Observer Presence The question of the effect of an observer’s presence on the experimenter’s behavior has already been raised. Here we raise the more specific question of the effect of an observer’s presence on his unintentional influence on his subjects. It seemed reasonable to suppose that the presence of an observer might reduce the experimenter’s unintentional communication of his expectancy to his subjects. An observer’s presence might serve to inhibit even those communications from the experimenter of which the experimenter is unaware. Some data are available that provide a preliminary answer to this question (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964a). In this experiment the standard photo-rating task was administered by 5 experimenters to about 10 subjects each. For half the subjects, experimenters were led to expect ratings of success, and for half the subjects, experimenters were led to expect ratings of failure. For each experimenter, several of his interactions with subjects were monitored by one of the principal investigators who sat in during the experiment.
Experimenter Behavior
577 Table 20–1 Observer Presence and Expectancy Effects
Experimenter
Unobserved
Observed
A B C D E
þ2.68 þ2.49 þ1.20 þ0.44 0.33
þ0.25 þ0.20 2.00 þ2.95 0.88
Means
þ1.30
þ0.10
Table 20-1 shows the effects on magnitude of expectancy effect of the experimenter’s having been observed. The numbers in each column represent the mean photo rating obtained from subjects believed to be low (failure) raters subtracted from the mean photo rating obtained from subjects believed to be high (success) raters. The difference between the grand means might suggest that expectancy effects were reduced by the presence of an observer. Such a conclusion would be misleading, however. For four of the five experimenters (A, B, C, D) expectancy effects were significantly affected by an observer’s presence. Two of the experimenters (A, B) showed a significant reduction of expectancy effects when observed. A third experimenter (C) not only showed a reduction of the expected biasing effect when observed, but actually tended to obtain data significantly opposite to that which he had been led to expect. His bias went into reverse gear. The fourth experimenter (D) tended to obtain unbiased data except when he was observed. At those times, he obtained data significantly biased in the predicted direction. These somewhat complex results can be interpreted best by postulating that different experimenters interpret an observer’s presence in different ways. Those experimenters whose biasing effects disappeared or even reversed in the observer’s presence may have interpreted the monitoring as an attempt to guard against subtle differential treatment of the subjects. This interpretation may have led to a reduction or reversal in any such differential treatment. The experimenter whose expectancy effect became more clearly pronounced in the observer’s presence may have interpreted the monitoring as an attempt to insure that the experiment would turn out well—i.e., lead to ‘‘proper,’’ predicted results. It would seem worthwhile in future experiments to vary systematically the impression conveyed to monitored experimenters as to the real purpose of the observer’s presence. This might shed light on whether the hypothesis of different meanings is tenable. For the present, we cannot draw any simple conclusions about the effects of an observer’s presence on the experimenter’s expectancy effects. If our sample of experimenters were larger, we could say, perhaps, that monitoring makes a difference four out of five times, but we could not be sure whether expectancy effects would be significantly increased or decreased among these affected experimenters. Correcting for Experimenter Behavior Once we have observed what the experimenter does in the experimental situation, we are in a position to make some correction for those of his unprogrammed behaviors
578
Book Two – Experimenter Effects in Behavioral Research
that have affected the results of his research. Suppose that in a certain experiment all subjects were to be treated identically by their experimenter. Suppose further that the observers, who did not know which subjects were in the experimental or control conditions, noted that the experimenter (who might not be blind to subjects’ experimental condition) behaved differently toward the subjects of the experimental and control groups. Any difference in the performance of the subjects of the two conditions might then be partially or entirely due to the experimenter’s differential behavior. By the use of such techniques as analysis of covariance or partial correlation we can assess the effects of the treatment condition, holding constant that experimenter behavior which was confounded with the subjects’ treatment condition. Individual differences among experimenters in the data they obtain from their subjects can similarly be controlled by knowledge of the experimenters’ behavior in the experiment. Experimenters’ behavior may change during the course of an experiment. Practice may change their behavior vis-a`-vis their subjects, and so may fatigue or boredom. Even if these phenomena do not bias the results of the experiment in the direction of the experimenter’s hypothesis, they may have undesired effects—generally an increase in Type II errors. Variability in experimenter behavior over time associated with variability in subject performance over time will tend to increase (erroneous) failures to reject the null hypothesis. A fairly extreme correction for unprogrammed experimenter behavior is to drop the data obtained by an experimenter whose behavior was in some way very deviant or unacceptable. An example of such behavior might be an experimenter’s unwitting omission of a critical sentence in his instructions to his subject. Such correction by elimination might seem to be an easy matter. In fact, it is not. Experimenters tend to behave in a normally distributed manner, and it will be troublesome to decide that this experimenter’s behavior (e.g., instruction reading) is barely acceptable while that experimenter’s behavior is barely unacceptable. The final decision to drop or not to drop the data obtained by a given experimenter from a given subject is itself highly susceptible to the interpretive bias of the principal investigator. At the very least, such a decision should be made without knowledge of the subject’s performance. Ideally, the rules for dropping data will be written before the experiment is begun and the decision to drop made by independent judges whose task is only to decide whether a given behavior violates a given rule. For the near future, at any rate, this will be no easy matter, for we have looked at so few experimental interactions that we hardly have any rules (even of thumb) for what constitutes an adequate, true-to-program set of experimenter behaviors. As Friedman (1964) has pointed out, for the psychological experiment, there is not yet an etiquette. Although a major deviation of an experimenter’s behavior from the behavior intended (usually implicitly) by the principal investigator may significantly alter the intended experimental conditions, such a deviation does not necessarily result in either an alteration in the subject’s response or in an alteration of the magnitude of the experimenter’s expectancy effect. Whether such alterations have occurred can and should be specifically determined for each experiment.
Experimenter Behavior
579
Inferring Experimenter Behavior Sometimes when there has been no direct observation of the experimenter’s behavior we can still make useful inferences about his behavior during the experiment. Such inferences can be based on an analysis of the results of his research and related personal characteristics. Some of these inferences are made on more quantitative bases, others on more qualitative bases. Quantitative Bases of Inference In earlier chapters we have emphasized that some replication of an experiment was required in order to assess an experimenter’s accuracy. Some such assessment is, however, possible even when only a single experimenter and a single experiment are involved. Replication in this sense involves a partition of the experiment into earlier and later phases. Table 20-2 shows the results of a hypothetical experiment comparing the effects of two teaching methods (one old, one new) on subjects’ performance. The experiment has been subdivided into six periods with subjects of both groups represented equally in all periods. For both teaching methods, subjects who are contacted in later phases of the experiment perform better than do subjects contacted earlier (rho ¼ 1.00, p ¼ .01). So long as subjects have been randomly assigned to phases of the experiment, and the possibility of feedback from earlier to later subjects eliminated, we might reasonably infer that the experimenter’s behavior has changed over the course of the experiment. He has perhaps become a better ‘‘teacher’’ or examiner, as it were, and in terms of the hypothesis under test, this need not be of too great concern. More troublesome is the fact that the superiority of the new teaching method has shown an increase from earlier to later phases of the experiment (rho ¼ .97, p < .02). It is not easy to interpret this interaction. It may be due to the fact that the new or experimental method becomes more effective when employed by a more effective, more experienced teacher (or when tested by a more effective examiner), which the experimenter has become over time. On the other hand, it is also possible that the experimenter has treated the subjects of the two groups in an increasingly differential way, and in a way unrelated to the teaching methods themselves. We might suspect
Table 20–2 Effects of Two Teaching Methods as a Function of Experimental Period
Methods Period
Experimental
Control
Difference
1 2 3 4 5 6
1.2 1.4 1.5 1.9 2.0 2.2
1.1 1.2 1.3 1.5 1.6 1.7
þ0.1 þ0.2 þ0.2 þ0.4 þ0.4 þ0.5
Means
1.7
1.4
þ0.3
580
Book Two – Experimenter Effects in Behavioral Research
this especially if the experimenter was the teacher as well as the examiner or, if he was only the examiner, then one not blind to treatment conditions. If efforts had been made to keep the experimenter-examiner blind, we might suspect a gradual ‘‘cracking of the code’’ by the experimenter. In addition (or alternatively), we might hypothesize that the experimenter was unwittingly learning to bias the results of the experiment or simply becoming a more effective influencer by virtue of his growing professionalness of manner. The nature of the specific experiment may suggest which interpretation of such an order-x-treatment interaction effect is most reasonable. The method of subdividing experiments is most effective when the experiment is designed specifically for this purpose. Pains can be taken to assign equal or proportional numbers of subjects of all experimental conditions to each phase of the experiment. Pains can also be taken to insure that there will be no feedback of information from earlier- to later-contacted subjects. But the logic of the method can be approximately and usefully applied to experiments that have already been conducted. The subdivision of the experiment can take place on a post hoc basis and can, at the very least, raise interesting questions (e.g., overall significant differences may prove entirely attributable to subjects contacted in only one phase of the experiment). In our hypothetical example we have used a correlational method of analysis only for the sake of simplicity. The basic method of analysis of subdivided experiments can be a treatment-x-order design, in which we would hope that only the treatment effect would prove significant. Even though the main effect due to order is not significant there may be a significant linear regression which should be checked. Such significant linear regression without significant main effects of order occurs when performance in subsequent phases changes by very small but very regular increments or decrements (see, e.g., Snedecor, 1956, p. 347). In an earlier chapter it was suggested that significant decreases in the variance of subjects’ peformance might serve as a clue to experimenter expectancy effects. Over time an experimenter may unwittingly alter his behavior in some way such as to ‘‘shepherd’’ his subjects’ responses into an increasingly narrower range. We can assess the likelihood of this phenomenon in a manner analogous to that described for assessing experimenter effects upon mean performance. Table 20-3 shows the hypothetical variance of subjects’ responses for the experiment on teaching methods described earlier. Later-contacted subjects of both
Table 20–3 Performance Variances as a Function of Experimental Period
Methods Period 1 2 3 4 5 6
Experimental
Control
Difference
15 14 12 11 9 8
15 14 14 13 12 12
0 0 2 2 3 4
Experimenter Behavior
581
treatment conditions show decreasing variance of performance. This decrease, like the improved performance scores shown in Table 20-2, may be due to the experimenter’s increasing skill. Some of his randomly variable behavior may have dropped out, so that he is treating subjects more consistently. In practice, however, we cannot say whether he began the experiment overly variable in behavior and became more ‘‘properly’’ consistent, or whether he began the experiment ‘‘appropriately’’ variable in behavior and became a more effectively biasing experimenter later on. To help us answer that question, we need additional data of a normative sort about the magnitude of variance ordinarily to be expected. Perhaps more serious than the systematic decline in subject variability is the differential decline in variability between the two treatment conditions. The decrease in variance of the experimental subjects’ performance is proceeding more rapidly than the decrease in performance variance of the control subjects. Whether this is a ‘‘natural’’ consequence of that particular teaching method as a function of experimenter experience, or whether it reflects some phenomenon related to expectancy effects, can be assessed only indirectly, as suggested earlier in the case of differential increments in mean performance. (In general, the reasoning that has been applied to the variability of the experimenter’s data may also be applied to its skewness and kurtosis.) Qualitative Bases of Inference We have seen that we may be able to make useful inferences about the experimenter’s behavior on the basis of the data he has obtained. Crude as this basis of inference may seem, we may at times have even less basis from which to infer something about the experimenter’s behavior and yet be forced to make such inferences. Suppose we knew that in a given experiment a large number of computational errors had occurred and that these computational errors were nonrandomly distributed with respect to the hypothesis under test. We might have some weak empirical grounds for inferring that the experimenter’s interaction with his subjects was also biased. The research showing a relationship between magnitude of expectancy effects and computational errors has, unfortunately, not yet been replicated (Rosenthal, Friedman, Johnson, Fode, Schill, White, & Vikan-Kline, 1964). Are we ever justified in using an experimenter’s reputation as a basis for inferring what his behavior during an experiment was like? Probably not very often, if at all. There are few scientists who have a clearly documented history of producing data consistently biased by their expectancy. On those occasions when workers are heard to say about a research result, ‘‘Oh, you can’t believe that, it came out of X’s lab,’’ or ‘‘Nobody can ever replicate X’s work anyway,’’ there is likely to be little documented basis for the statement. In almost a decade of trying to follow up such statements, I have only seldom been personally convinced by the ‘‘evidence’’ that I should believe data less because of the lab or the investigator from whom the data came. Often, perhaps, such a reputational statement means little more than, ‘‘They don’t get the data we get, or which we think they ought to get.’’ At the present time, then, an investigator’s reputation for erring, or more specifically, for obtaining data influenced by his expectancy, does not appear to provide an adequate basis for inferring such error or bias. In principle, however, it could. Not
582
Book Two – Experimenter Effects in Behavioral Research
‘‘reputation’’ in the loose sense, then, but performance characteristics, can be assessed if we are willing to take the trouble. Such assessment will be discussed in the next chapter. In the present chapter we have discussed the experimenter’s behavior in the experiment as the source of expectancy effects and as a vehicle for their control. The methods of observation and of control suggested here are not, of course, intended to substitute for those strategies of control presented in the preceding and following chapters. Rather, they are intended to serve as additional tools for the control of expectancy effects which will sometimes prove especially appropriate and, at other times, especially inappropriate. In general, each investigator interested in controlling for expectancy effects will have to select one or more of the strategies presented in this volume (or others overlooked here) on the basis that it (or they) will best serve the purpose for a given experiment.
21 Personnel Considerations
In the last two chapters suggestions were made which were designed to help control experimenter expectancy effects in specific experiments. In this chapter, which draws upon some of the suggestions made earlier, we shall consider on a possibly more general basis the selection and training of experimenters as an aid to the control of experimenter expectancy effects. If there were certain kinds of data collectors who never influenced their subjects unintentionally, we could make a point of having only these experimenters collect our data. If the amount and type of an experimenter’s training were significant predictors of his expectancy effects, we could establish training programs for data collectors such that its graduates’ data would be unaffected by their expectancies or hypotheses.
The Selection of Experimenters The importance of careful selection of data collectors has been recognized by social scientists working in the area of survey research (e.g., Cahalan, Tamulonis, & Verner, 1947; Harris, 1948). Hyman and his collaborators (1954) have an excellent discussion of the personal characteristics of interviewers who are more prone to make various errors during the data collection process. These errors include errors in asking the programmed questions, errors in probing for further information, errors in recording the response, and cheating errors. We may summarize these errors as all being relevant to interviewer competence. More competent interviewers make fewer errors. But lack of competence is probably not the problem when the data collector is a psychological experimenter. Better educated, better motivated, more carefully selected, holding more scientific values, psychological experimenters may well be more competent in the sense of doing what is asked and doing it accurately than are the less highly selected interviewers employed to assist in the conduct of large-scale surveys. Furthermore, more competent, more accurate experimenters are more, rather than less, likely to show expectancy effects on their subjects’ responses, as we saw in an earlier chapter. Hyman and his co-workers do discuss personal correlates of interviewers who showed a greater biasing effect, but the biasing was usually of the observer bias or 583
584
Book Two – Experimenter Effects in Behavioral Research
interpreter bias variety. As far as could be determined, there have been no studies relating interviewer characteristics to interviewer expectancy bias in which, by independent observation of the respondents’ replies, it could be determined that the bias affected the subjects’ responses rather than the observation, interpretation, or recording of responses. In fact, as Hyman’s group points out, even when the term ‘‘bias’’ includes net errors of observation, recording, and interpretation, ‘‘Evidence on what variables might be used as predictors of tendencies to ideological or expectation biases is almost nonexistent’’ (p. 302). What this literature, so well reviewed by Hyman’s group, offers us, then, is a principle rather than a body of information to apply to the situation of the psychological experimenter. In an earlier chapter are described the personal characteristics of experimenters exerting greater expectancy effects. Although these characteristics were theoretically interesting, the magnitude of their correlation with experimenter bias seems too low to be useful for selection purposes. In any case, it seems unlikely that we would select as experimenters people who are less professional, less consistent, of lower status, more tense, and less interested. Purposeful selection of such people might lower the degree of bias but at the cost of other, perhaps far more serious, errors. When we read a journal report of an experiment, we put our faith in the experimenter’s having been professional, consistent, and competent, or we would doubt that the experiment was conducted as reported. And if we must maintain high standards of competence, we will necessarily have a harder time of developing selection devices that will predict experimenter expectancy effects. As our experimenters become more homogeneous with respect to such variables as intelligence we will find these variables to predict bias less and less well as a simple statistical consequence; i.e., a reduction in the variance of either of two variables to be correlated leads to a reduction in the resulting correlation. What methods, then, can be used to select experimenters? The Method of Sample Experiments Hyman and his collaborators (1954) suggest the use of performance tests in the case of interviewer selection, and as employed in survey research organizations, this technique appears to be effective, at least in minimizing coding bias. The ultimate in job sample techniques applied to the situation of the experimenter would involve his actually conducting one or more standard experiments with subjects whose usual responses were known beforehand. For each prospective experimenter, his expectancy of the results of the particular experiment to be conducted could be determined. Subjects would be randomly drawn from a population whose mean response had been determined.1 Consistent significant deviations in the responses obtained by our prospective experimenter would define him to be a biased data collector. The bias might be in the direction of his expectancy or in the opposite direction. The extent to which his deviations in obtained data could have been occurred by chance would be determined by standard statistical tests. But what if, for a given experiment, a prospective experimenter showed clearly a propensity to bias the outcome, whereas for another experiment he showed only 1
The problem of establishing the correct value of the mean response (no small matter) was discussed in the chapter dealing with the assessment of experimenter effect.
Personnel Considerations
585
a propensity for obtaining accurate data? No evidence is available for suggesting whether such is a likely state of affairs. We do not know the degree of generality of experimenters’ biasing tendencies over a sample of experiments. Ideally, we would have a large sample of prospective experimenters conduct a series of standard experiments in each of several different areas of research. What we would be likely to discover is that (1) there is a general factor defined by a tendency to bias over a large range of types of experiments, (2) there are group factors defined by a tendency to bias in certain types of experiments, and (3) there are specific factors defined by a tendency to bias in only certain specific experiments. We might find further that some specific experiments or some types of experiments are more commonly free of bias, whereas others are more likely to show biasing effects of experimenters. Figure 21-1 shows the hypothetical profiles of three experimenters who have undergone our somewhat elaborate selection procedure. Experimenter A shows a tendency to exert expectancy effects in all his research— most especially so in studies of emotional behavior in humans, but less so in studies of perception. Experimenter B tends to bias the results of his learning research only, but for both human and animal subjects. Experimenter C shows the biasing effects of his expectancy only in studies of the emotional behavior of animals. If we had to conduct all eight of these hypothetical experiments using these three data collectors, we would be able to choose one or more to conduct each experiment with some hope of avoiding biasing effects associated with the experimenter’s expectancy. In almost every case we would prefer experimenter C, and if we were in the market for a research assistant we would hire him, all other things being equal.
5 4 A MAGNITUDE OF EXPECTANCY EFFECT
3 B 2 1 C 0
EXPERIMENT NO.
1
2
3
4
5
6
7
8
TYPE OF SUBJECT (HUMAN OR ANIMAL)
HU
AN
HU
AN
HU
AN
HU
AN
TYPE OF EXPERIMENT
PERCEPTION LEARNING MOTIVATION
EMOTION
Figure 21–1 Expectancy Effects as a Function of Type of Experiment
586
Book Two – Experimenter Effects in Behavioral Research
In proposing what amounts to a personal validity index we are suggesting a procedure that, at least in its most ideal form, cannot be appropriately employed by the ordinary principal investigator in search of a research assistant. For one thing, this simply cannot be done to the graduate student or undergraduate who wants to work for a given investigator. Such selection may be educationally inappropriate from the student’s point of view. If the experiments are designed only as a selection device there is also a certain indignity involved for the student. The procedure is expensive, time-consuming, and boring. It requires some institutionalized system of implementation. It does, in fact, suggest the creation of a new profession with its own system of selection and training. We shall return to this recommendation later. Our primary concern in this discussion of the selection of experimenters has been selection for minimal expectancy effects. But the job sample method permits us also to screen out potential experimenters who find it hard to carry out highly programmed procedures. Although there is a correlation between intelligence and accuracy in carrying out instructions, it is not likely to be very high among a selected group of potential experimenters. Yet, we do note individual differences in the skill with which our research assistants carry out their behavior programs. The direct observation or the recording on sound tapes or sound films of the potential experimenter’s behavior during the standard sample experiment permits us to assess whether any procedural deviations are too great to tolerate. By the observation, either directly or by way of sound films, of the experimenter’s behavior over the course of one or more experiments, we may be able to gauge his ability to learn to behave in that standard fashion we would like. Ultimately, it may prove feasible to develop tests and questionnaires that will predict an experimenter’s proneness to the exertion of expectancy (and related) effects. But the validation of any such instruments requires an ecologically valid criterion. Such a criterion is provided by the job sample method described. The personal validity index, which can be computed separately for each experiment, for related types of experiments, or for all experiments in the standard battery, can simply be correlated with already existing or specially devised instruments. The major advantage of the development of such instruments is, of course, economic. Tests can be administered more quickly and cheaply than experiments can be conducted.
The Training of Experimenters In our discussion of the training of experimenters we will define training broadly enough to include the variable of experience as a kind of on-the-job training. Again we find most of the relevant literature to come from the field of survey research. Amount of Training Hyman’s group (1954), in summarizing a number of studies of interviewer competence and bias, concluded that more experienced interviewers were somewhat more competent and less likely to bias their results. This conclusion, tentatively offered,
Personnel Considerations
587
they tempered by pointing out that selective retention of interviewers might have been operating. The better interviewers may have greater longevity with the research organization—experience may be the dependent rather than the independent variable. Other workers disagree with even the modest conclusion drawn by Hyman et al. Cantril (1944) reported that training did not make much difference in the quality of data obtained. Similarly Eckler and Hurwitz (1958) reported that census interviewers showed no decrease in net errors for at least certain types of questions when additional training was provided. The lore of psychological research suggests that more experienced experimenters are at least more competent than the more inexperienced. This may well be true. As in the case of the interviewer, there may be a selective retention within the craft of those who can do competent data collection—competent in the sense of following directions. But the lore of psychological research suggests less about the relationship between experimenter’s experience and the magnitude of his expectancy effects.2 There have been two experiments that, taken together, provide at least some indirect evidence bearing on the effect of experience or training on the magnitude of expectancy bias (Rosenthal, Persinger, Mulry, Vikan-Kline, & Grothe, 1964a; 1964b). In one of these studies (a) all experimenters had served as data collectors once before, whereas in the other (b) none had any prior research experience. The two studies were not specifically designed for this comparison, and so any conclusions are tentative, at best. It did appear, however, that magnitude of expectancy effects was not particularly related to the experimenters’ experience. If anything, the more experienced experimenters showed a greater biasing effect and were less variable in the degree of their bias. It would even be reasonable to suggest that more experienced experimenters should show greater expectancy effects. With more experience, experimenters gain in self-confidence and perhaps behave in a more professional manner. In an earlier chapter we saw that more professional experimenters exerted more biasing influence on their subjects. Further indirect evidence comes from an examination of the results of the second of the two cited studies (b). Among that group of inexperienced experimenters some were made more conscious of their procedures as determinants of the results of their research. This amounted to a minimal training (or educational) effort. The experimenters who were made minimally more procedure-conscious showed greater biasing effects in the data they obtained from their subjects. The minimal training procedure may have led these experimenters to feel the importance of their role as data collector more keenly. They may have conveyed to their subjects a certain sense of status enhancement, and as we saw in an earlier chapter, experimenters of higher perceived status tend to exert a greater degree of expectancy effect. In summary, we must be impressed by the absence of well-established findings bearing on the relationship between experimenter expectancy effects and experimenter training and experience. If forced to draw some conclusion from what evidence there is, we might conclude that better trained, more experienced experimenters are likely to be more competent in carrying out their research with minimal procedural deviation. But at the same time, because the more experienced, better 2
The lore of anthropological field research, however, suggests that a better (i.e., professionally) trained observer is less likely to be in error (Naroll, 1962). This certainly seems reasonable, but an experimental demonstration, even if it bore out the lore on the average, would probably show very considerable overlap, with some ‘‘amateur’’ observers erring less than some professionals.
588
Book Two – Experimenter Effects in Behavioral Research
trained experimenter is likely to enjoy a higher status in the eyes of his subjects and behave in a more professional manner, those very slight procedural deviations that do occur are more likely to result in the effective communication and influencing effect of the experimenter’s expectancy. Type of Training There are virtually no data available that would suggest to us how we should train experimenters to maximize their competence generally and to minimize their expectancy effects specifically. There is an undocumented assumption, however, that if only we tell our experimenters about the pitfalls of bias, everything will be all right. What little evidence there is bearing on just this point, however, suggests that this assumption is quite unjustified. Troffer and Tart (1964) showed that fairly experienced experimenters who understood the problem of expectancy effect and presumably tried to guard against it nevertheless treated subjects differently depending on whether subjects were in the experimental or the control condition. There are also some data, collected by Suzanne Haley, which suggest that, if anything, expectancy effects increase when experimenters are asked to try to avoid them. It seems incongruous that psychologists, who have been so helpful to education, business, industry, and the military in setting up and evaluating training programs, have not turned their attention to the training of psychological experimenters—the members of their own family. We have been like the physician who neglects his own health, the mental health expert who neglects his own family. If we look at procedures currently employed for the training of our research assistants, we find no systematic pattern, not even explicit assumptions about training for data collection. Many experimenters, perhaps most, have never been observed in the data collection process or have even heard a lecture about it. As researchers, then, we lag far behind those ‘‘applied’’ fields, so scorned by some, in the application of the principles of learning. The clinical psychologist, in contrast, is thoroughly trained in his data collection process. He is observed in his interaction with his patients and given feedback. We may lament the lack of validation of various methods of supervision, but at least the methods exist to be evaluated, and significant research has shown that even very subtle aspects of training and supervision can be empirically investigated (Kelley & Ring, 1961). In the area of survey research, most organizations have training manuals which, although also unvalidated as to their value in error reduction, at least represent some self-conscious thinking about the problem (Hyman et al., 1954). Perhaps the ultimate in concern with problems of both selection and training of data collectors is reflected in the procedures employed by the Institute for Sex Research (more commonly known as the ‘‘Kinsey Group’’). Over the several decades of their research they have employed only 3 percent of the applicants who were considered for employment (Pomeroy, 1963). The grand total of nine interviewers employed by the Institute was, then, an extremely carefully selected group. The Institute knew the criteria its interviewers had to meet, and selected accordingly. Although we might wish for an empirical evaluation of the success of their
Personnel Considerations
589
selection and training procedures, we could hardly hope for more caution in the selection of data collectors. One of the reasons that we, as psychologists, have paid so little attention to our own training as experimenters may stem from the combination of a specific belief and a specific value about data collection. The belief is that data collection is simple, if not simple-minded, and that anybody who can reach graduate school can carry out an experimental procedure. The associated value is that data are to be highly prized but data collection is not. The young postdoctoral psychologist can hardly wait to turn the burdens of data collection over to his graduate student assistants. Not all psychologists share this belief and the associated value, of course. Extremely sophisticated investigators have pointed out informally that some of their graduate research assistants can, and some cannot, carry out at least some experimental procedures. The fact that some cannot is often learned fairly late in the game, often at some cost to both the experimenter and the principal investigator. Such instances are a tribute to our neglect of both selection and training procedures. The Professional Experimenter Science implies observation and the collection of data. The scientist is responsible for the collection of the appropriate data, but it need not be his eye at the telescope or microscope nor his pencil mark indicating how a respondent will vote. In survey research the interviewer is not the scientist. In medical research the laboratory technician is not the scientist. Each of these data collectors is a member of an honorable profession and is perhaps more expert and less biased than the scientist himself. In psychology, however, the scientist himself is commonly the experimenter, or if he has ‘‘outgrown’’ the running of subjects, a scientist-in-training is the collector of the data. This hampers both the selection and training of experimenters who are both competent and unbiased. What is needed is a new profession, the profession of behavioral experimenter analogous to the professional interviewer and the laboratory technician. The professional experimenter will be well selected, well trained, and well paid. He will enter the profession because he is interested in data collection and not because it is expected as something to be done before an advanced degree can be earned and as something to be delegated quickly to one’s own graduate students. Careful, expensive selection and training procedures will be warranted because of the greater longevity of the new professional’s data collection career. There will be no conflict between educational and scientific aims as there may be in the case of a brilliant student of science who simply happens to be inept at collecting data in behavioral experiments. At present, such a student may be discouraged from a scientific career because of this one ineptitude. This becomes his loss and ours. There is no reason why he should not conceive of needed experiments, design them, evaluate their results, and report them to the rest of us. The actual data collection can be turned over to an institute set up at a university or privately, as in the case of various survey research organizations (e.g., National Opinion Research Center). This proposal does not imply that data collection would no longer be a part of graduate education or that much research would not continue to be done as it now is. But there is no necessary correlation between the educational function of serving as
590
Book Two – Experimenter Effects in Behavioral Research
experimenter and the scientific function of data collection. Divisions of labor might sensibly evolve. One very natural division would be between pilot studies and largescale replications or cross validations. The former would more likely be conducted by the individual investigator and his assistants. The latter might most profitably be contracted to a large research agency which selects and trains professional experimenters and conducts research on contract. If this proposal should seem radical, we need only remind ourselves that one can already have surveys conducted, tests validated, and experimental animals bred to order. What is proposed here simply extends the limits of the kind of data that could become available on a contract basis. The details of setting up institutes for the selection and training of professional experimenters and the conduct of behavioral research are complex. They would be expensive and would require the support probably of both universities and interested federal agencies. The various agencies now functioning most nearly like the proposed institutes would need to be consulted so that their experience could be profitably utilized. The ‘‘ideal’’ selection procedure suggested earlier in this chapter could be employed along with others. Different training procedures could be developed with continuing evaluation of their relative effectiveness in increasing experimenter competence and decreasing biased errors. In addition to the development of manuals which may or may not prove to be helpful, more job-related procedures may be introduced. Trainee experimenters could observe the data-collecting behavior of ‘‘ideal’’ experimenters directly or on film. The trainee’s own performance could be monitored directly by supervisors or, if on film, by supervisors and by the trainee himself to learn of any procedural deviations. In the early days of the development of such a new profession, variability of procedures of selection and training would be especially important. Amount and type of the trainee’s educational background, intelligence, motor skill, personality variables, and the didactic and performance types of training methods should all be permitted to vary so that the effectiveness of various types of experimenters and training programs may be assessed. The emotional investment of the professional experimenter would be in collecting the most accurate data possible. That is the performance dimension on which his rewards would be based. His emotional investment would not be in obtaining data in support of his hypothesis. Hypotheses would remain the business of the principal investigator and not of the data collector. There might, in general, be less incentive to obtain biased data by the professional experimenter than by the scientist-experimenter or the graduate student-experimenter. Still, professional experimenters will have or develop hypotheses, and the strategies for the control of expectancy effects described in the last two chapters and in the next two chapters can be employed. In fact, they can be more effectively employed with professional experimenters because there will be less conflict with educational goals. The professional experimenter wants to be kept blind, but the graduate student might properly feel imposed upon if he were kept from knowing what research he was conducting. Some of the values to be acquired by the professional experimenter are, of course, already found among behavioral scientists, but their increased articulation might have a beneficial feedback effect on those of us back at the universities. We do too often judge a piece of research, not by its careful execution and the data’s freedom from error, but by whether the results confirm our expectations. Many universities
Personnel Considerations
591
give implicit recognition to this tendency by protecting their doctoral candidates with a kind of contract. The essence of this contractual procedure is that the soundness of a piece of research is to be judged without reference to the results. If a qualified group of judges (i.e., the doctoral committee) feels that a piece of research is well designed then it must be acceptable no matter how the data fall.3 That such contractual arrangements exist is a good and reasonable thing. That such contractual arrangements are necessary is a somewhat sad and sobering situation. It is a situation that suggests we are too often more interested in demonstrating that we already ‘‘know’’ how nature works than in trying to learn how, in fact, she does work.
3
There are, of course, additional reasons for this form of contractual protection of the student, e.g., the possibility of staff turnover, changing standards, and changing interests.
22 Blind and Minimized Contact
Blind Contact It seems plausible to reason that if the experimenter does not know whether the subject is in the experimental or the control group, then he can have no validly based expectancy about how the subject ‘‘should’’ respond. The experimenter ‘‘blind’’ to the subject’s treatment condition cannot be expected unintentionally to treat subjects differentially as a function of their group membership. This is an old and effective idea in the field of pharmacology. The so-called single blind study refers to the situation in which the patient or subject is kept from knowing what drug has been administered. When both subject and experimenter (physician) are kept from knowing what drug has been administered, the procedure is called ‘‘double-blind’’ (Beecher, 1959; Levitt, 1959; Wilson, 1952).1 This technique is over 120 years old, having been employed by members of the Vienna Medical Society at least as early as 1844 (Haas, Fink, & Hartfelder, 1963). Haas and his co-authors have recently presented rather convincing evidence that the use of the double-blind study is more than warranted. In a review of nearly 100 placebo studies, involving thousands of subjects and many different disorders, they observed that the placebo works best when the doubleblind method has been employed. Apparently, when the experimenter (doctor) does not know that the substance given his subject (patient) is inert, he expects, and gets, a better result. Psychologists have been slow to adopt the double-blind method for other than psychopharmacological research (Shapiro, 1960, p. 125), though Wolf (1964) reports that in 1889 Delboeuf proposed the double-blind method for research in hypnosis. It is the unusual data collector today who does not know whether his subject is a member of the experimental or control group (e.g., Babich, Jacobson, 1
There is a certain amount of confusion about the exact usage of the term ‘‘double-blind.’’ Always the subject is blind, but sometimes the other ‘‘blind’’ person is the subject’s personal physician, sometimes the research physician, sometimes the person who actually dispenses the drug, and sometimes several of these. There is talk, too, of triple-blind, quadruple-blind, etc., to add to the confusion. We will adopt a usage in speaking of the psychological experiment such that a double-blind study is one in which no one having direct contact with subjects is permitted to know what the subjects’ treatment condition will be, is, or has been, until the experiment is over. ‘‘Double-blind’’ for us will mean ‘‘total-blind.’’
592
Blind and Minimized Contact
593
Bubash, & Jacobson, 1965). The suggestion to have experimenters contact their subjects under blind conditions is implied not only logically but empirically as well, if we may draw on the data of the pharmacologists. In addition, it is not a suggestion that would work an impossible hardship on the researcher. More and more data are being collected by less and less sophisticated (or, at least, less academically advanced) student research assistants who could be kept uninformed of the hypothesis and overall design of the experiment, as well as the treatment conditions to which each subject belongs. If these students were too sophisticated, or if the principal investigator preferred to do so from educational and ethical considerations, assistants could be told exactly why they must remain blind. In order to be somewhat more convinced about the efficacy of the double-blind procedure among psychological experimenters, however, it was decided to try the technique out (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). A Test of Blind Experimentation Fourteen graduate students (11 males and 3 females) administered the standard person perception experiment to a total of 76 introductory psychology students (about half were males and half were females). As in earlier studies, half the experimenters were led to expect low photo ratings and half were led to expect high photo ratings from their subjects. Experimenters were told that those who adhered most strictly to the experimental procedure and obtained the ‘‘best’’ data would be awarded ‘‘research grants.’’ At the conclusion of this phase of the experiment all experimenters were given small ‘‘grants’’ of $14. Of this amount $10 represented their ‘‘salary’’ for continuing in the role of ‘‘principal investigators,’’ and $4 was used to pay their research assistants. To each of the 13 experimenters who were able to continue in the experiment as ‘‘principal investigators’’ two research assistants were randomly assigned. All but three of these assistants were males. Assistants were trained in the experimental procedure and were paid for their time by the experimenters. Each research assistant then conducted the photo-rating experiment with a new sample of six introductory psychology students. Of the total of 154 subjects contacted by the research assistants, about half were males. Unlike the original instructions to experimenters, the instructions to the research assistants made no mention of what ratings should be expected from the subjects. Experimenters were led by their instructions to expect their research assistants to obtain data from their subjects of the same sort they had themselves obtained from their earlier-contacted subjects. Experimenters were warned not to leak to their assistants the magnitude of data that experimenters had themselves obtained from their subjects. Research assistants, then, were running ‘‘blind.’’ In spite of the fact that, as a set, these experimenters did not bias their subjects’ responses to a very great extent, what bias did exist was transmitted to their research assistants. In spite of the attempt to keep research assistants blind, those whose ‘‘principal investigators’’ had biased their subjects’ responses more also biased their own subjects’ responses more. The correlation between the magnitude of experimenters’ biasing effect and the magnitude of their research assistants’ biasing effect was .67 (p ¼ .01).
594
Book Two – Experimenter Effects in Behavioral Research
Here, then, is an interesting case of unintended interpersonal influence onceremoved, which has important substantive social psychological implications which will be discussed later on. The methodological implications, however, are clear. Simply not telling our research assistants what to expect from a given subject (i.e., whether they are experimental or control subjects) does not insure real blindness. In some subtle way, by tone and/or gesture, experimenters may unintentionally overinform their research assistants. The principle of the double-blind method is not impugned by our findings; but the difficulty of implementing and maintaining the required experimenter ‘‘blindness’’ is emphasized.2 Additional Problems of Maintaining Blindness We have shown that the principal investigator may be a source of the inadvertent failure of the double-blind method. Here we will show that another source of such failure may be the subject himself. This is well known in pharmacological research. When active and inert chemical substances are compared, sometimes the active substance has an irrelevant but obvious side effect. Some patients given the ‘‘real’’ drug may change color, for example. Thus the experimenter knows, at least for these subjects, that they are more likely to be in the drug group than the placebo control group. In psychological experiments, too, such ‘‘side effects’’ may occur. Assume an experiment in which anxiety is the independent variable. People who have just been through an anxiety-arousing experience or who score high on a test of anxiety may behave differently in an experimental situation. The ‘‘blind’’ experimenter may then covertly ‘‘diagnose’’ the level of anxiety and, if he knows the hypothesis, bias the results of the experiment in the expected direction or, by bending over backward to avoid bias, ‘‘spoil’’ the study. There are many experimental treatments or measurements that may be assessed unintentionally by the ‘‘blind’’ data collector. A recent example of this derives from a finding that subjects scoring high in need for social approval arrived earlier at the site of the experiment (r ¼ þ.40, p ¼ .003; Rosenthal, Kohn, Greenfield, & Carota, 1965). In effect, to see a subject arrive is to know something about him that often is meant to remain unknown. Arrival time, overt anxiety, skin color changes, and potentially hundreds of other, more subtle signs may break down the most carefully arranged double-blind study. Irrelevant Expectancies Even a truly blind experimenter is likely to have or to develop some expectancy about his subjects’ behavior. If he does not know the experimental hypotheses— i.e., how the subjects ought to behave—then his idiosyncratic expectancies are likely to be irrelevant to the hypotheses under investigation. From the point of view of the particular design of the study, however, these more or less ‘‘random’’ idiosyncratic hypotheses may serve to increase the error variance and, from the 2
The work of Martin Orne (1962) also suggests that even the ‘‘single blind’’ method is not all that easy to achieve. Although no investigator would tell his subjects what their response ‘‘ought’’ to be, there may be cues from the situation (even if not from the experimenter himself) that unintentionally communicate to the subjects how they are expected to behave.
Blind and Minimized Contact
595
principal investigator’s point of view, increase the likelihood of Type II errors. If the experimenter did not know the hypotheses being tested but did know to which group each subject belonged, the results of the study are more likely to be biased in the direction of the hypothesis (or opposite to it) rather than biased irrelevantly with respect to the hypothesis. We can illustrate these considerations by returning to our earlier example of the study of the effects of anxiety on intellectual performance. Suppose the principal investigator exposes a random half of his subjects to an anxiety-arousing experience, while the remaining subjects are exposed to a situation involving no anxiety arousal. The experimenter who collects the intellectual performance data does not know the hypothesis or the treatment group membership of any subject. Suppose, however, that the experimenter has the irrelevant covert hypothesis that tall, thin people tend to be unusually bright. He therefore unintentionally treats them somewhat differently, and as a result they obtain higher performance scores. Table 22-1 illustrates the effect of this irrelevant hypothesis on the results of the experiment. The intellectual performance scores are tabulated as they might occur in each group with and without the effects of the irrelevant expectancy of the data collector. Note that there was only one tall, thin subject in each group whose performance score was affected by the data collector’s expectancy. In each case a 5-point effect on these subjects is observed. The table shows that the mean performance scores are barely affected by this particular constant error, that the effect of anxiety (i.e., mean difference) is unchanged, but that the t values and p levels are affected. Even for the relatively minor experimenter effect we have illustrated, the increase in Type II errors is clearly shown. (Particularly damaging, this error, if the principal investigator follows an accept–reject decision model and does not take note of the large mean differences obtained).
Table 22–1 Effect of Idiosyncratic Hypothesis on Results of a Double-Blind Study
True values
Affected values
Anxiety
No Anxiety
112 114 116 118 120
117 119 121 123 125
112 114 116 123* 120
117 119 121 128* 125
116 8
121 8
117 16
122 16
Means 2 Mean difference y p Decision Error * Affected scores.
5.0 2.50 <.05 reject none
Anxiety
No Anxiety
5.0 1.77 >.10 not reject Type II
596
Book Two – Experimenter Effects in Behavioral Research
If the experimenter is not entirely blind but knows that subjects belong to two groups, and which subjects to which group, the mean difference between groups is more likely to be affected. This is true even though the experimenter does not know what the treatment conditions are. It will be apparent to him that a difference is expected, and he may covertly, and perhaps irrelevantly, hypothesize which group is to be the better performing and behave differently to the subjects of the two groups as a result of his hypothesis. On a chance basis, half the time this should tend to help support the principal investigator’s hypothesis and half the time it should tend to weaken it. But, in either case, we can be misled as to the nature of the real state of affairs. It seems, therefore, highly desirable that the experimenter be unaware of which subjects constitute a group even when he does not know what treatments have been administered to any group. Procedures Helping to Maintain Blindness We have seen that both subject and principal investigator can serve as sources of unintended cues leading to the breakdown of experimenter blindness. In the next major section of this chapter we shall discuss various strategies that may help maintain blindness by helping to reduce contact between an experimenter and his subject. In this section we shall discuss two strategies designed to help the experimenter maintain blindness in spite of his having some contact with the principal investigator. Avoiding feedback from the principal investigator. The first of these strategies is implied by the findings described in the chapter dealing with the effects of early data returns (the Ebbinghaus effect): the data collector should not tell the principal investigator the nature of the early returns. This is a bit of a psychological hardship for a research group eager to learn whether they do or don’t ‘‘have something.’’ Still, many studies are conducted within a short enough period of time that the hardship would not be excessive.3 Any contact with the principal investigator, including many unavoidable sorts, is likely to increase the chance of a breakdown of ‘‘blindness,’’ but the report of early returns may be especially damaging. Suppose that over the course of an experiment a blind experimenter is unintentionally having some sort of variable effect on subjects. For example, early in the data collection process he may be smiling more at subjects he sees as more anxious, but later on he smiles somewhat less at them. If the early data returns are reported to the principal investigator there will probably be subtle or overt positive or negative reactions to the news. If the reaction is positive, the principal investigator’s pleasure may serve as a reinforcer for the data collector’s unprogrammed experimental behavior—in this case his differential smiling. What was a randomly variable bit of unprogrammed behavior coincidentally serving to effect subjects’ behavior into the predicted directions, now becomes a systematically biasing behavior on the part of the experimenter which will continue throughout the rest of the data collection process. If the early data returns are in the unpredicted direction and the principal 3
Here is another advantage to be gained from employing a large sample of experimenters. An experiment can be completed so much sooner with a number of experimenters, working sometimes even simultaneously, that there is a far less urgent desire on the part of the principal investigator to learn how the data are coming out.
Blind and Minimized Contact
597
investigator’s reaction is negative, the data collector may change his randomly variable unprogrammed behavior possibly to another more ‘‘biasing’’ mode of unprogrammed behavior.4 The ‘‘total-blind’’ design. The second strategy to be described is one we have frequently employed and found quite useful in our own research program. This strategy, when applicable (and it often is), gives virtually complete assurance of the maintenance of experimenter blindness, usually so difficult to obtain. Following the terminology of the pharmacological researcher we may call it the ‘‘total-blind’’ method because no one knows the treatment condition to which any subject is assigned. In our simple situation in which half the experimenters are led to expect high photo ratings (þ5) and half the experimenters are led to expect low photo ratings (5), these expectations were induced by a written statement of how subjects ‘‘would perform.’’ In a small study employing only 10 experimenters a different research room might be assigned to each experimenter. The 10 sets of instructions, five inducing the þ5 expectancy and five inducing the 5 expectancy, would be randomly and blindly assigned to the 10 rooms. The 10 experimenters then would be randomly assigned to their rooms, where they would read over their ‘‘last-minute instructions’’ which, in fact, were the means for creating the experimental conditions. Not until the conclusion of the experiment, when the experimenters’ ‘‘last-minute instructions’’ would be picked up along with the data sheets, would anyone know in what experimental treatment each experimenter (or subject) had been.5 In the more complex situation, where there were several different expectancies and other experimental manipulations, the very same procedures were followed. To illustrate the more complex situation, consider an experiment requiring 4 conditions and 6 experimenters per condition. If we have 4 experimental rooms we divide the experiment into 6 replicates; if we have 8 experimental rooms we divide the study into 3 replicates; if we have 12 experimental rooms we divide the study into 2 replicates, assuming in all cases that we can arrange to have different experimenters contact their subjects simultaneously. Within each replicate each experimental condition is represented equally. Experimental treatments, induced by written ‘‘instructions,’’ are put into envelopes, coded, randomly assigned to research rooms, and not associated with any given experimenter until the experiment is over. Of course, the same logic can be applied even if only one experimental room were available. It is our impression, however, that experimenters (or subjects) who find 4
The presence and variety of unprogrammed experimenter behaviors during the experiment have been emphasized by Friedman (1964). These unprogrammed behaviors (e.g., smiling or glancing at the subject) cannot be regarded as ‘‘wrong’’ because no one has laid down the ground rules for ‘‘right’’ smiling and glancing or nonsmiling and nonglancing behavior. It would be an error to state simply that none of this behavior should occur in an experiment. The absence of certain socially expected facial, gestural, and tonal behaviors may have a far more unusual, even bizarre, effect upon subjects’ behavior than their presence (Rosenthal & Fode, 1963b). In speaking of these unprogrammed interpersonal behaviors of experimenters we should note that they do not necessarily have any implications for biasing the results of an experimental vs. control group comparison. So long as the unprogrammed behavior is either constant or only randomly variable, these behaviors cannot serve to mediate experimenter expectancy effects. Only when subjects are differentially treated with respect to these unprogrammed behaviors as a function of their treatment condition can these behaviors serve to mediate experimenter expectancy effects. 5 Subjects, of course, were randomly assigned to experimenters (or experimental rooms) but with the restriction that the number of subjects per room be as nearly equal as possible.
598
Book Two – Experimenter Effects in Behavioral Research
their way into early vs. later stages of an experiment may be nonrandomly different. Therefore, if we can have early- and later-participating experimenters or subjects equally represented in each treatment condition there may be less confounding of the treatment condition with these temporally associated personal characteristics. If we had only one room available for the data collection process, therefore, we would prefer to have each set of four experimenters represent all four experimental conditions. Random assignment of the four experimenters to the four conditions would have the additional advantage that if there were an unexpected attrition of experimenters toward the end of the experiment, there would be a more nearly equal distribution of experimenters among the various conditions. All that we have said about our own research employing experimenters can be equally applied to other research employing subjects directly—that is, without intervening experimenters. All those experiments in which a written (or a taperecorded) communication serves as the experimental manipulation can therefore be run ‘‘totally blind.’’ There appear to be few, if any, areas of behavioral research in which this strategy cannot be appropriately employed at least some of the time. Before leaving the topic of blind contacts, mention should be made of a paradoxical question raised by Milton Rosenberg in a personal communication (1965). He suggested the interesting possibility that experimenters, knowing they were blind, might expect, and therefore obtain, significantly more variable data. The idea is sufficiently intriguing and sufficiently important in terms of leading to increased Type II errors that the implied experiment should clearly be carried out.
Minimized Contact In describing the blind contact strategy in general, we pointed out that it was by no means always easy or even possible to achieve. Therefore, if we could eliminate experimenter-subject contact altogether it would seem that we would then also eliminate the operation of experimenter expectancy effects. Automated Data Collection The day may yet come when the elimination of the experimenter, in person, will be a widespread, well-accepted practice. Through the use of computers we may generate hypotheses, sample hypotheses, sample the experimental treatment conditions from a population of potential manipulations, select our subjects randomly, invite their participation, schedule them, instruct them, record and analyze their responses, and even partially interpret and report the results. Even if this fantasy were reality we would not be able to eliminate the experimenter completely. He or his surrogates or colleagues must program the computer and thereby influence the output. However, there would at least be no unprogrammed differential treatment of subjects as a function of their experimental conditions. In short, although experimenter or even machine effects cannot be completely eliminated, we can at least hope for unbiased effects. Progress is being made along the trail to automation and the elimination of experimenter-subject contact from certain stages of research. Not necessarily
Blind and Minimized Contact
599
because of an interest in reducing expectancy effects, many researchers employing animal subjects have fairly complex automated data-collection systems (McGuigan, 1963). This automation, however, generally applies only to the period of the animal’s data production and not to all his pre- and extra-experimental experience (Christie, 1951). Experimenter handling patterns in transporting animals from home cage to the experimental work area and back may vary not only across experimenters but within experimenters as a function of the treatment condition to which the animal belongs.6 Even if the animals were transported to and from their home cages without human contact there might still be an opportunity to treat differentially the animals of different treatment conditions. Animals in their home cages may be treated differently as a function of their cage labels even if these labels are in code and the handler is not formally ‘‘an experimenter’’ or data collector. He still knows something of psychological research procedures—i.e., that different behavior is expected of members of different experimental groups. Earlier in this chapter we showed how knowledge of which subjects constitute a treatment group can affect the results of the research even when the hypotheses being tested and the specific treatment conditions are unknown. Automated data collection, when the subjects are humans, also appears to be on the increase and, as in the case of animal studies, particularly among researchers employing operant techniques. Written or tape-recorded instructional methods certainly seem to eliminate experimenter-subject contact. Although these methods should reduce any opportunity for the communication of experimenter expectancy effects, they would not eliminate such opportunity if there were nonblind experimenter-subject contact before the data collection phase of the experiment. Experiments Requiring Human Interaction It can be argued that there are experiments that make no sense unless a human interaction can occur between experimenter and subject, situations in which a written communication or tape recording alone simply would not do. Experimenter constancy. If this must be, then from all we have said the experimenter’s behavior should be as nearly constant as possible. There is a way in which we can achieve perfect constancy of experimenter behavior, and that is to employ the identical experimenter’s input into each experimenter-subject interaction. This can be accomplished by filming an experimenter’s required behavior in the experiment, including sound track. The sound film can then be used to instruct the subject. This alone would be little better than the tape-recorded instruction method if subjects could see that the experimenter was on film. However, where one-way mirrors or a television camera (with or without film) and closed-circuit television monitoring facilities are available, it would be a simple matter to give the impression of a ‘‘live’’ interaction. The situation could be structured for subjects so that they felt they could observe the experimenter and he could observe them via the monitoring system. In this way constancy of experimenter behavior could be assured without 6
Such differential handling of animals as a function of their experimental condition was postulated earlier as a major factor in the mediation of experimenter expectancy effects to animal subjects.
600
Book Two – Experimenter Effects in Behavioral Research
sacrificing the impression of ‘‘liveness’’ of interaction which may be crucial in certain experimental conditions. Restricting cues available to the subject. Where experimenter-subject contact cannot be eliminated completely, it can at least be minimized. Earlier we showed how the reduction of the available channels of communication between experimenter and subject might reduce the effects of the experimenter’s expectancy. Thus, interposing a screen between experimenter and subject would reduce the available channels for the communication of expectancy effects from experimenter to subject. However, cues from the subject to the experimenter have also been shown to increase the likelihood of experimenter expectancy effects by serving to break down the experimenter’s blindness. It would, therefore, be desirable to restrict cues made available by the subject to the experimenter. Restricting cues available to the experimenter. Cues to the experimenter tend to increase as he interacts with more subjects and as he interacts more with each subject. An incidental advantage of employing a group of experimenters, then, is that each contacts fewer subjects and therefore has less opportunity unwittingly to ‘‘crack the code’’ of a blind procedure—i.e., to learn which subjects constitute a group and what a subject’s experimental condition might have been. In this way the advantages of experimenter blindness may better be maintained. During his contact with any subject, the experimenter may avoid some important unintentional cues from the subject by having subjects record their own responses. There are many experimental procedures in work with human subjects wherein responses are coded in a fairly simple system. Subjects could then often be requested to record their own response on a clearly laid-out data sheet. If this procedure were followed whenever possible four advantages would accrue: (1) The experimenter’s chances of remaining blind would be increased. (2) The experimenter, by simply not looking at the data sheets, could avoid that influence on his subsequent subjects attributable to his knowledge of the early data returns. (3) Experimenters’ recording errors, which, though rare, tend to be biased when they do occur, would be virtually eliminated. (4) The amount of interpersonal contact between experimenter and subject would be reduced, thereby reducing the opportunity for the subtle communication of the experimenter’s expectancy (or other bias or effect) to his subjects. Some combined procedures. The use of an ordinary tape recorder may be combined with the use of a screen interposed between experimenter and subject to achieve some of the advantages of using a filmed experimenter to contact the subjects. This alternative procedure requires less expensive equipment than the closed-circuit television monitor. The tape recorder, out of sight of the subject behind the experimenter’s screen, could be used to instruct the subject (without the subject’s awareness) if earphones were provided. The impression, then, would be that the experimenter was speaking over a telephone like device. This method assures instructional constancy, elimination of visual cues during the interaction, and when further combined with subjects’ self-recording of response, no effects due to early data returns. At the same time the experimenter is physically present, therefore perhaps more ‘‘real’’ as required by some experimental manipulations. The only opportunity for unprogrammed experimenter input would be during the greeting phase of the data collection process before the experimenter retires behind his screen (and even this greeting phase could be eliminated). If the experimenter is also blind
Blind and Minimized Contact
601
to the subject’s treatment condition, however, this should not be particularly damaging. The use of earphones for the subject has the additional advantage that it may help the experimenter maintain his blindness. Suppose the treatment conditions were to be created by the taped instructions. The tape could be constructed so that the different instructions appeared in some random sequence unknown to the experimenter. He would play one segment of the tape after another for sequences of subjects without knowing what instructions the subject had received through the earphones. Effects of ‘‘Absent’’ Experimenters We mentioned earlier that the experimenter could never be eliminated completely, even from a fantasied computer-run experiment. We repeat that restriction here. There always are (and always will be) decisions that must be made by the experimenter which may unintentionally affect the subjects’ responses. These decisions, however, should have no unprogrammed, differential effect upon the subjects constituting the different experimental conditions. One sort of research that is erroneously believed to involve no effect of the experimenter is the mail survey. Letters requesting information (often from psychologists themselves) are legion. There is no face-to-face contact between the questioner and the respondent. Yet the wording of the letter may yield different rates of return and, among the respondents, different kinds of responses. Different data collectors interested in the same questions are likely to ask them in different ways, thereby eliciting different responses. The advantage to the mail-survey technique, however, is that there, at least, we can specify exactly what the experimenter’s stimulus value was. We can completely capture it simply by having a copy of the letter sent. Just as letters convey something of the writer, so even more do tapes and films. Experimenter attributes cannot be eliminated. Their effects can only be distanciated, randomized, or held constant over treatment conditions to avoid a bias in the comparison between groups. At the present time, there is such a dearth, relatively, of studies employing a more or less ‘‘absent’’ experimenter that it is difficult to assess the effect of such absence per se on the subjects’ responses. What evidence there is suggests that a more or less ‘‘absent’’ experimenter often does affect subjects’ behavior by his absence (Felice, 1961; Masling, 1960). There is at present, however, no reason to assume that an ‘‘absent’’ experimenter can affect his different experimental groups differentially and in the direction of his experimental hypothesis. The Double Standard for Expectancy Control Before leaving the topic of minimized contact between experimenter and subject, one further observation must be made. Earlier in this book and elsewhere in more detail (Rosenthal, 1965a) we have reviewed some early attempts to minimize experimenter-subject contact. It is striking that so many of these efforts at greater control occurred in what might be called ‘‘borderline’’ areas of psychology. Even today it would be the rash parapsychologist who would not make every effort to minimize the contact between the experimenter and subject in a study of
602
Book Two – Experimenter Effects in Behavioral Research
extrasensory perception. And this is all to the good. Also to the good is the fact that nonparapsychologists would be outraged if such controls against expectancy effects were not employed. But not all to the good is the fact that some of these same workers might be outraged if in their own less ‘‘borderline’’ areas of inquiry they were required to institute the same degree of control over their own expectancy effects. Clearly we have a double standard of required degree of control. Those behavioral data found hard to believe are checked and controlled far more carefully than those behavioral data found easier to believe (e.g., Babich et al., 1965). What this amounts to, then, is a widespread interpretive bias which may serve to make it easier to demonstrate easily expected findings and harder to demonstrate intuitively less likely outcomes. In the overall conduct of the business of the behavioral sciences, this may lead to a pervasive bias to support hypotheses in keeping with beliefs of the times. Obviously the solution is not to make it easier to ‘‘demonstrate’’ unlikely events such as clairvoyance, rod-divining, talking animals, or muscle-reading. What is called for is the setting of equally strict standards of control against expectancy effects in the more prosaic, perhaps more important, everyday bread-and-butter areas of behavioral research. There should be no double standard; every area of behavioral inquiry should require the greatest possible control over the potential effects of experimenter expectancy and other sources of scientific error.
23 Expectancy Control Groups
In the last chapter we discussed some strategies attempting to minimize experimenter expectancy effects. In the present chapter we shall discuss a strategy which essentially attempts to maximize experimenter expectancy effects by the employment of ‘‘expectancy control groups.’’ The logic of the control group was well developed by Mill in 1843 and, according to Boring (1954), had been anticipated a century earlier by Hume and two centuries earlier by Bacon and Pascal.1 At least since the beginning of this century, psychologists have with increasing frequency employed control groups in their experiments (Solomon, 1949). Expectancy control groups represent a specific set of controls derived directly from the research demonstrating the effects of experimenters’ expectancy on the results of their research. Consider any experiment in which the effects of an experimental and a control treatment are to be compared.2 The experiment is likely to be conducted because a difference between the experimental and control group is expected. Table 23-1 shows the generally resultant confounding of experimental treatment conditions with the experimenter’s expectancy. Cell A represents the condition in which the experimental treatment is administered to subjects by a data collector who expects the occurrence of the treatment effect. Cell D represents the condition in 1
Boring tells how Pascal, in 1648, had his brother-in-law, Perier, perform an experiment with the Torricellian tube (barometer). As the tube was carried higher up a mountain the column of mercury dropped lower and lower. A control tube was left at the bottom of the mountain monitored by an observer to note whether there was any change in the level of mercury. A number of readings were made of the mercury level at the top of the mountain and one halfway down. Measures of pressure at any two of the three levels of altitude illustrate Mill’s method of difference, and measures at all three points illustrate the more elegant special case of that method: the method of concomitant variation. A much earlier example of the use of control groups comes to us from ancient Egypt (Jones, 1964). That particular research if conducted today might have been titled: ‘‘Citron Ingestion as a Determinant of Longevity among Snake-bitten Animals.’’ 2 Although our example employs an experimental manipulation as the independent variable, the discussion applies as well to any comparison between groups, no matter how they are constituted. For our purpose a comparison between ‘‘experimental’’ and ‘‘control’’ groups does not differ from a comparison between male vs. female subjects, conforming vs. nonconforming subjects, or ‘‘high’’ vs. ‘‘low’’ anxious subjects.
603
604
Book Two – Experimenter Effects in Behavioral Research Table 23–1 Confounding of Treatments with Experimenter Expectancy
Treatment conditions
EXPECTANCY
Occurrence Nonoccurrence
Experimental
Control
A C
B D
which the absence of the experimental treatment is associated with a data collector who expects the nonoccurrence of the treatment effect. But ordinarily the investigator is interested in the treatment effects unconfounded with experimenter expectancy. The addition of the appropriate expectancy control groups will permit the evaluation of the treatment effect separately from the expectancy effect. A ‘‘complete expectancy control’’ requires the addition of both cells B and C, whereas a ‘‘partial expectancy control’’ requires the addition of either cell B or C. Subjects in cell B are those who will not receive the experimental treatment but who will be contacted by an experimenter who expects a treatment effect. Subjects in Cell C are those who will receive the experimental treatment but who will be contacted by an experimenter who expects no treatment effect.
Hypothetical Outcomes The results of the case of an experimental vs. a control group comparison with ‘‘complete expectancy control’’ are most simply evaluated by a two-way analysis of variance yielding a main effect attributable to the experimental treatment, a main effect attributable to experimenter expectancy, and an interaction of these two effects. For the sake of simplicity, we may say that any of these three sources of variance can be only (1) significant and large, or (2) significant and small, or (3) insignificant and virtually zero. Large Treatment Effects Table 23-2 shows some likely hypothetical results of a complete expectancy-controlled experiment in which the treatment effects are significant statistically and large Table 23–2 Expectancy-Controlled Experiments Showing Large Treatment
Effects Treatment conditions Case
Expectancy
Experimental
Control
1
Occurrence Nonoccurrence
6.0 6.0
0.0 0.0
2
Occurrence Nonoccurrence
6.0 5.0
1.0 0.0
3
Occurrence Nonoccurrence
6.0 3.0
3.0 0.0
Expectancy Control Groups
605
numerically. The numbers in the cells represent the mean data obtained from the subjects in that condition. For the sake of clarity we may assume that the mean square within cells is so small that any numerical differences are statistically significant. Case 1 shows that whereas the experimental treatment had a powerful effect upon subjects’ performance, experimenter expectancy had neither any effect in itself nor did it enter into interaction with the experimental treatment. A result such as this not only reassures us that our treatment per se ‘‘works,’’ but also impugns the generality of experimenter expectancy effects. Case 2 shows almost the same magnitude of difference between the average performance of experimental vs. control subjects as we found in Case 1. However, experimenter expectancy effects were significant, though small, and trivial relative to the powerful effects of the treatment condition. Case 3 shows that the treatment effects, although still large and significant, are no greater than the effects of experimenter expectancy. These first three cases, then, show increasing effects of experimenter expectancy, although in each case we would correctly conclude that the experimental condition had a significant effect with expectancy effects controlled. If we had omitted the expectancy controls, would we have erred seriously? Not if we were interested only in showing that the experimental treatment affected subjects’ performance. However, if we were at that more advanced stage of inquiry where we would like to be able to state with some accuracy the magnitude of the experimental effect we would have been misled—not, of course, in Case 1, where expectancy had no effect whatever. But in Case 2 we would have overestimated slightly the power of the experimental treatment. And in Case 3 we would have overestimated seriously the power of the treatment under study. From Table 23-2 we can see that for Case 1 the difference between performances of those experimental and control groups which are normally confounded with expectancy (cell A–cell D) is 6.0 points. This is the same difference obtained by experimenters expecting either the occurrence (cell A–cell B) or the nonoccurrence (cell C–cell D) of the treatment effects. For Case 2 the normally confounded experiment uncontrolled for expectancy would have yielded a difference of 6.0, where only a 5.0 was attributable to treatment unconfounded with expectancy. For Case 3, a 6.0 difference, of which only half was attributable to unconfounded treatment effect, would have been claimed. Interacting Treatment Effects The three cases discussed have shown no interaction effects. Cases 4 and 5, shown in Table 23-3, however, both show significant interaction effects in addition to significant main effects. In each case all three sources of variance are equal in magnitude. The main effects of experimental treatment and experimenter expectancy, hence, are not interpretable apart from the interaction effect. In Case 4 the experimental treatment has an effect on subjects’ performance only when the experimenter expects such performance. If this outcome were the ‘‘true’’ state of affairs, then the omission of the expectancy control groups would have been quite serious. The experimental treatment would have been regarded as significantly effective and large in magnitude when, in fact, such a conclusion, unqualified, would have been extremely misleading.
606
Book Two – Experimenter Effects in Behavioral Research Table 23–3 Expectancy-Controlled Experiments Showing Interacting Treatment Effects
Treatment conditions Case
Expectancy
Experimental
Control
4
Occurrence Nonoccurrence
6.0 0.0
0.0 0.0
5
Occurrence Nonoccurrence
6.0 6.0
6.0 0.0
Similarly misleading would be conclusions based on the situation shown in Case 5 if expectancy control groups had not been employed. In this case either the experimental treatment or the experimenter’s expectancy was sufficient to affect the subjects’ performance in the expected direction, to the same degree, and without any summative effects attributable to the combined effects of treatment and expectancy. (In Case 5 the suspicion might arise that a special instance of Case 3 had occurred where a low ceiling on the dependent variable measure had prevented a higher mean score from being obtained in cell A.) Small Treatment Effects In Table 23-4 we see a number of outcomes in which the effects attributable to the experimental treatment are either trivial or absent altogether. Case 6 shows the simplest of all outcomes. Nothing made any difference; not the experimental treatment, not the experimenter’s expectancy, and not the interaction. This, like Case 1, is one of the few situations in which an omission of the expectancy control groups would not have increased our errors of interpretation. But as with Case 1, it seems impossible to know beforehand, without our (or someone’s) ascertaining empirically that such would be the result. Case 7 shows only a large and significant expectancy effect. It is like Case 1 in showing only a large main effect, but unlike Case 1 in that the omission of expectancy controls would have caused serious error. Not only would the effect of the treatment have been ‘‘significant,’’ but the magnitude of the effect (cell A–cell D) would have been thought to be 6.0, as in Case 1, rather than the zero it really was. Table 23–4 Expectancy-Controlled Experiments Showing Small Treatment Effects
Treatment conditions Case
Expectancy
Experimental
Control
6
Occurrence Nonoccurrence
0.0 0.0
0.0 0.0
7
Occurrence Nonoccurrence
6.0 0.0
6.0 0.0
8
Occurrence Nonoccurrence
6.0 1.0
5.0 0.0
Expectancy Control Groups
607
Case 8, somewhat analogous to Case 2, shows a significant and large expectancy effect and a significant but relatively trivial treatment effect. The omission of expectancy controls in this case, as in Case 7, would have greatly misled us as to the magnitude of the treatment effect. Had we been interested only in establishing any difference favoring the experimental over the control conditions, however, we would not have been misled by the omission of expectancy controls. Other Treatment Effects Only a few of the possible outcomes of expectancy-controlled experiments have been presented. Some of the outcomes not described here in detail make some sort of psychological sense, but many do not. Examples of those that do would be situations in which one main effect is significant and very large relative to the other main effect and interaction, which are small but significant (Table 23-5 ). Examples of outcomes making less psychological sense include those many possible situations in which (1) the means of the control conditions unpredictably exceed the means of the experimental conditions, and/or (2) the means obtained under the expectancy-fornonoccurrence exceed the means obtained under the expectancy-for-occurrence situation, and/or (3) some interaction of these reversed main effects. To say that some of these outcomes are less sensible psychologically is not to say that they will be rare. The unexpected, reversed finding is quite frequent in the behavioral sciences. Adding to the likelihood of less sensible findings are some of the data presented earlier in this book demonstrating the ‘‘bending-over-backward’’ effect. Although not the usual result, there are occasions (e.g., when the rewards are psychologically excessive) when experimenters try so hard to avoid letting their expectancy influence their subjects that the subjects are influenced to respond in the direction opposite to that consistent with the experimenter’s expectancy. If unexpected treatment effects join synergistically with an experimenter bending over backward to avoid biasing, we may obtain a significant ‘‘reversed interaction.’’ In this case, for example, only that cell (D) in the control condition assigned an experimenter not expecting the treatment effect would show the predicted treatment effect. Partial Expectancy Controls So far we have discussed only the use of ‘‘complete expectancy controls.’’ The use of ‘‘partial expectancy controls’’ (employing either of rather than both cells B and C) is best considered only if the alternative is to use no expectancy control at all. Table 23–5 Expectancy-Controlled Experiments Showing Interpretable Interacting
Main Effects Treatment conditions Case
Expectancy
Experimental
Control
9
Occurrence Nonoccurrence
6.0 6.0
3.0 0.0
10
Occurrence Nonoccurrence
6.0 3.0
6.0 0.0
608
Book Two – Experimenter Effects in Behavioral Research
The relative loss of information incurred, when only partial rather than full expectancy controls are employed, depends on the ‘‘true’’ outcome of the experiment. Thus, although for most outcomes we would be better off to use partial rather than no expectancy control, there are ‘‘true’’ outcomes involving interaction effects (e.g., Cases 4, 5, and 10) for which the use of only partial controls could lead to seriously erroneous conclusions about the relative effects of the treatments vs. the effects of the experimenter’s expectancy. The problem currently is that we have no good basis for deciding what the true outcome would have been if expectancy had been fully controlled. As complete expectancy control groups are employed more and more, we may accumulate enough information to sensibly decide for what type of study we can afford to omit one (or both) of the expectancy controls. If for some reason, perhaps logistic, only two of the four cells can be employed, what is our best choice? We may choose either one of the comparisons within rows (cells A vs. B or C vs. D) or one of the comparisons within diagonals (cells B vs. C or A vs. D, the ‘‘usual’’ comparison). By defining the ‘‘true’’ magnitude of the treatment effect as the difference between the column means in the completely controlled expectancy design, the relative merits of the use of within-rows vs. within-diagonal comparisons may be illustrated. Table 23-6 has been derived from the hypothetical outcomes of the fully controlled experiments shown in Tables 23-2 to 23-5. For each comparison, the magnitude of error in obtaining the ‘‘true’’ magnitude of the treatment effect has been listed. It can be seen that the choice of a within-row comparison leads to fewer errors, and none so large as some of those obtained if the choice is of a within-diagonal comparison. Furthermore, errors in obtaining the ‘‘true’’ treatment effect occur in the within-row comparison only in those cases where the completely controlled experiment shows an interaction effect. It makes no systematic difference which of the two within-rows comparisons we choose (cells A vs. B or cells C vs. D). If we chose to make within-diagonal comparisons, however, it would make a very important systematic difference whether we employed cells B vs. C or cells A. vs. D. The former comparison, except for Cases 1 and 6, always underestimates the ‘‘true’’ treatment effect, whereas the latter, more typical comparison always overestimates the ‘‘true’’ treatment effect. It seems clear, then, that if for any reason only two cells can be employed, the Table 23–6 Magnitude of Error as a Function of Choice of Comparison
Comparisons Case 1 2 3 4 5 6 7 8 9 10
Within Rows (A-B or C-D) 0.0 0.0 0.0 3.0 3.0 0.0 0.0 0.0 1.5 1.5
Within Diagonals (A-D or B-C) 0.0 1.0 3.0 3.0 3.0 0.0 6.0 5.0 1.5 4.5
Expectancy Control Groups
609
experimenter should have the same expectancy in both; either favoring the occurrence or the nonoccurrence of the treatment effect. Although the two-cell, within-row, partial expectancy-controlled experiment is preferable to the ordinary within-diagonal experiment, it is no substitute for the complete expectancy-controlled experiment nor even for the three-cell design (B or C omitted) described. The three-cell design, at least, has the very real advantage of affording us a replication of the usual experiment (cells A vs. D) uncontrolled for expectancy.
Implementation of Expectancy Controls We have discussed the general design of expectancy control groups but have had little to say so far about some practical issues raised by this methodological suggestion. For example, we stated at the beginning of this chapter that experimenters’ expectancies tended to be preconfounded with their experimental and control conditions. How can we create those conditions in which the experimenters’ expectancy runs counter to the predicted effects of the experimental conditions (cells B and C)? A number of methods are available for creating expectancy control groups, and these will be described shortly. However, most of these methods involve the withholding of information from, and giving of false information to, the experimenters. A number of ethical questions are raised by such deception. Ethical Considerations Perhaps the major question to ask is whether the distasteful though necessary deception is warranted by the potential importance of the result of the expectancycontrolled experiment. Since virtually no scientific behavioral research can be univocally described as too trivial to warrant adequate controls, it would seem that most research conducted in the behavioral sciences should be expectancy-controlled. It is a moot question whether the deception of data collectors should score lower on an evaluative scale than the production of research results that may be subject to serious error—error that could be assessed by the employment of appropriate controls. How serious is the deception of data collectors (or subjects) in general? The widespread use of placebo control groups in pharmacological research, especially when conducted under double-blind conditions, suggests that no harmful effects of deception need occur. Placebo and double-blind deceptions have shown themselves to warrant use by the greater knowledge of drug action they have given us. In our own research on experimenter expectancy effects we have employed deception of necessity and have found no harmful effects. Factually erroneous information given data collectors can be quickly (and cognitively) corrected. We have found no hostility (to be affectively corrected) by the data collectors resulting from an explanation of how expectancies had been created. On the contrary, the data collectors seemed intrigued and wondered why expectancy controls were not routinely employed. Of course, there is no question that hostility can be evoked by the deception of data collectors or subjects. It is my very strong impression—if I may insert here a
610
Book Two – Experimenter Effects in Behavioral Research
clinical ‘‘footnote’’—that such hostility is evoked not by the fact of deception itself but by the manner of deceiving, the personalized nature of the deception, and the manner of subsequent explanation or ‘‘de-hoaxing.’’ These variables serve the data collectors (or subjects) as sources of clues to the experimenter’s underlying motivation for having employed the deception. If subjects are satisfied that these motivations are primarily rational (e.g., for ‘‘science’’) rather than primarily irrational (e.g., for ‘‘fun,’’ to be ‘‘cute’’ or clever, to be hostile), they will react with appreciation of the necessity for deception, rather than with hostility at having been deceived. Note that we are not speaking here of experimenters’ ‘‘true’’ motives in employing (or not employing) deception; rather we are speaking of the subjects’ perception of these motives. (The general problem of the deception of subjects has been discussed recently by Vinacke [1965], by Wolfle [1960], and in most detail, by Kelman [1965].) The Induction of Experimenter Expectancies Ascribing subject characteristics. One method for creating experimenter expectancies calls for a statement of subject characteristics. Subjects assigned to cell B are described to their experimenters as having characteristics such that their response in the experiment will be like that of the subjects in cell A. Subjects assigned to cell C are described to their experimenters as having characteristics such that their response will be like that of subjects in cell D. If the experiment were one we have described earlier, the effect of anxiety upon intellectual performance among college students, cell B experimenters could be told their subjects were a bit below average intellectually for college students. Cell C experimenters could be told their subjects were a bit above average intellectually. It goes without saying, of course, that subjects are assigned at random to the experimental conditions or equated on intellectual performance. Ascribing experimental conditions. Another method for generating expectancy control groups is by the labeling of treatment conditions. In those cases where the experimenter does not himself administer the experimental treatment, he is told that the cell B subjects have received the experimental treatment and that cell C subjects have received the control ‘‘treatment.’’ In the example we have used, cell B experimenters would be told that their subjects had undergone the anxiety-arousing experience, and cell C experimenters would be told that their subjects had not—that they were part of the control group. Disparagement of treatment effectiveness. A third method of generating expectancy control groups involves a relative disparagement of treatment effectiveness. In this method, cell C experimenters are ‘‘shown’’ that the specific experimental treatment of their subjects ‘‘could not possibly’’ have the predicted effect on their behavior. Cell B experimenters are ‘‘shown’’ that, whereas the subjects of the treatment condition may show the predicted response, the subjects of the control group will show that response just as much if not more. Thus, in the example used, cell C experimenters would have pointed out to them that the particular ‘‘anxietyarousal’’ treatment could not really make anyone anxious. Cell B experimenters would have pointed out to them that the particular nature of the ‘‘control’’ condition was such that it might make subjects even more anxious than the ‘‘experimental’’ condition.
Expectancy Control Groups
611
Theory reversal. A fourth method of generating expectancy control groups is that of hypothesis or ‘‘theory reversal.’’ It can best be used when less academically advanced or less expert data collectors are employed. In this method cell B and cell C experimenters are provided with a plausible rationale (possibly buttressed by ‘‘earlier results’’ or results in the literature that are consistent with the rationale) for expecting just the opposite sort of relative outcome from the experimental and control subjects. In our example, cell B and cell C experimenters might be shown how the usual control situation in an experiment generates anxiety whereas the anxiety-arousing treatment merely puts a ‘‘sharp edge’’ on the subjects, leading to improved intellectual performance. Although practically none of the outlined methods for creating expectancy control groups have been employed by investigators not specifically interested in the effect of experimenter expectancy, there is an ingenious exception to be found in the work of Rosenhan (1964). In a study of the relationship between hypnosis and conformity Rosenhan (1963) found more hypnotizable subjects to conform more under certain conditions and to conform less under other conditions. He and a research assistant then attempted to replicate these findings. Essentially he assigned himself as an experimenter to cells A and D and the assistant to cells B and C. He employed the technique of ‘‘theory reversal’’ by showing the assistant the results of his earlier study but with the signs preceding the correlation coefficients reversed. Results of this expectancy-controlled study showed that Rosenhan himself obtained data similar to those he had obtained earlier, but the assistant obtained data similar to the opposite of the data obtained earlier but consistent with her hypothesis. Rosenhan rightly points out that since the two experimenters differed in many ways other than expectancy it cannot be concluded with great certainty that it was the expectancy difference that led to the opposite experimental results. Nevertheless, the results of this expectancy-controlled study showed that experimenter attributes (including expectancy) might account for some of the differences in results reported in the literature. Intentional influence. A fifth method for the creation of expectancy control groups, and one not requiring the employment of deception, is the ‘‘method of intentional influence.’’ This technique can be used with very sophisticated as well as quite unsophisticated experimenters. Experimenters in cell B are quite aware that their subjects are in the control condition, but they are told to try to influence them to respond as though they had received the experimental treatment, but without deviating from the detailed procedure followed by experimenters of cell D. Experimenters in cell C are aware that their subjects are in the experimental condition, but they are told to try to influence them to respond as though they had been in the control condition, but again, without deviating from the detailed procedure followed by experimenters of cell A. The advantages of this technique have already been mentioned—i.e., that it involves no deception and can be used with experimenters of any degree of sophistication. The major disadvantage of this technique is its lack of symmetry of cells B and C with cells A and D. Experimenters in these last two cells are not making any conscious effort to influence their subjects, whereas experimenters in cells B and C are. Thus, intentionality of influence is confounded with the ‘‘primary’’ (cells A and D) vs. ‘‘expectancy control group’’ (cells B and C) comparison.
612
Book Two – Experimenter Effects in Behavioral Research
Unintentional communication. A sixth method for creating expectancy control groups differs from all those described so far in that no expectancy is ever explicitly communicated to the experimenters of the expectancy control groups. This method is based on the difficulty of maintaining double-blind contact which was documented in the last chapter. Experimenters of cells A and B are trained by experimenters expecting large treatment effects, and experimenters of cells C and D are trained by experimenters expecting no treatment effects. The experimenters serving as trainers are likely to subtly and unintentionally communicate these expectancies to their trainee-experimenters. The expectancies of the trainer-experimenters can be created by any one of several of the methods described earlier. Another method of creating expectancies in the trainers would be to describe the experiment in which the trainees will be employed without mentioning that cells B and C would be formed as randomly divided subgroups of cell A and cell D trainees. Finally, expectancies may be created in the trainers by having them actually participate as experimenters in cells A and D of the experiment. Cell A trainers, of course, would contact only cell A (and B) trainees, and cell D trainers would contact only cell D (and C) trainees. Subjects’ responses. A final method for creating expectancy control groups is one that also never explicitly communicates an expectancy to the experimenters. It is a method that derives from studies of the effects of early data returns and the finding that experimenter expectancies may be altered by these early returns. Half the experimenters of cell A are, through the use of accomplices, provided with disconfirming early returns, thereby making them more like cell C experimenters. Half the experimenters of cell D are similarly provided with disconfirming data, thereby making them more like cell B experimenters. If only one experimenter is available, we can then employ him in cells A and B or in cells C and D. We saw earlier that this was preferable to employing experimenters in the diagonal conditions (e.g., cells A and D). More generally, the procedure of providing experimenters with planned early returns can be employed to augment some of the methods described earlier for creating expectancy control groups. Thus, if the ‘‘method of ascribing subject characteristics’’ has been employed, the induced expectancy would be greatly strengthened by the first few subjects’ providing the expected data. More details of the procedure for providing confirming or disconfirming early returns through the use of accomplices were presented in the chapter dealing with the effects of early data returns. In our discussion of various methods for generating expectancy control groups we have tried to be suggestive rather than exhaustive. Entirely different methods, a variety of subtypes of the methods mentioned, or combinations of the several methods may be most useful for a certain area of behavioral research and a specific research question. Experimenter Assignment Perhaps the ideal way in which to use expectancy control groups is to take a large and random sample of experimenters and assign them randomly to the various subconditions of the experimental design we have been discussing. The general advantages of a large number of experimenters have already been stressed in earlier chapters. But
Expectancy Control Groups
613
the absence of such a pool of experimenters does not rule out the use of expectancycontrolled designs. One experimenter. Even if only a single experimenter is available, experiments can be expectancy-controlled. Subjects in cells A and D would be contacted as in ordinary experimental procedure. By using certain of the methods described earlier for creating expectancies, the same experimenter can also contact subjects in cells B and C. If more than one experimenter is available, all may be employed in each experimental condition. (In such a case the analysis of the data changes from the simple 2 2 [treatment expectancy] analysis of variance to the more complex 2 2 N [treatment expectancy experimenters] analysis in which each experimenter may be regarded as a replicator of the 2 2 experiment [e.g., Lindquist, 1953]3.) Two experimenters. If only two experimenters are available it would probably be best if each could contact subjects in all four conditions, but some alternatives are possible and may, for a particular experiment, be necessary. Thus, Rosenhan (1964) could not very well have placed himself into the B and C cells of his expectancycontrolled experiment, nor, by his technique, placed his assistant in the A and D cells. With two experimenters the four cells can be divided equally in three ways: (1) One experimenter contacts only those subjects to be seen with an expectancy for occurrence of the treatment effect (cells A and B), while the other experimenter contacts the remaining subjects (cells C and D). (2) One experimenter contacts only the treatment condition subjects (cells A and C), while the other contacts only control condition subjects (cells B and D). (3) One experimenter conducts the ‘‘basicuncontrolled’’ experiment (cells A and D), while the other contacts the subjects in the expectancy control groups (cells B and C). In each of these three divisions the effect of the experimenter’s attributes is confounded with one of the sources of variance.4 Thus, in division 1 individual differences between the two experimenters could significantly alter the magnitude of the expectancy effect. In division 2 these differences could affect the treatment effect, and in division 3 they could affect the interaction. Divisions 1 and 2 are probably not usable (although their analogue is sometimes employed in research, as when different experimenters contact subjects in the treatment and control conditions). Division 3 does seem useful. If the effects of experimenters’ attributes are constant for the subjects of the two cells contacted by each experimenter, then at least neither of the main effects should be affected, although their interpretation may be complicated by a significant interaction which could far exceed either or both of the main effects. This division, across the diagonals of our basic design, will be remembered as analogous to the expectancy-controlled study conducted by Rosenhan (1964). More than two experimenters. For samples of experimenters larger than two but smaller than about eight, the best strategy would appear to be either (a) using each 3
For appropriate application, Lindquist (1953) or a comparable text should be consulted with special attention to the fact that error terms in fixed constants models are not analogous to error terms in mixed models. The 2 2 N design, for example, may be regarded as either model depending on what basis was used for the selection of the experimenters. 4 If two experimenters divided the experiment unequally, it can be seen that the three cells of one experimenter and the single cell of the other would yield results confounded with the experimenter.
614
Book Two – Experimenter Effects in Behavioral Research
experimenter in each cell, or (b) confounding the interaction with experimenter differences as in division 3, above, or perhaps best of all, (c) a combination of a and b such that about half the experimenters available are assigned to each method. The particular advantage of this method is that it permits a comparison of the results of the two strategies. Using the results of the replication(s) in which each experimenter functions in all four cells (strategy a) may help us assess whether a large interaction in the results of a replication employing strategy b is due to confounding with experimenters or is more likely to be independent of experimenters. With a small number of experimenters, however, we can never answer that question with great confidence because we will not be able to assess adequately the effects of the orders and sequences in which experimenters contacted the subjects of the different treatment conditions (cells). More than seven experimenters. If we have eight or more experimenters available, we can begin to think seriously of assigning them at random to one of the four cells of our basic design, thereby eliminating the problem of assessing order and sequence effects.5 It also becomes more possible to test the significance of the difference in results obtained by experimenters within treatments as well as the effects of the treatment condition, the expectancy condition, and their interaction. If the use of such a ‘‘nested’’ (Winer, 1962) or groups-within-treatments (Lindquist, 1953) analysis shows no significant individual differences between experimenters within cells, we can simply forget that the subjects were contacted by different data collectors and use the individual differences among subjects within cells as the error term against which to evaluate the other sources of variation. Or with samples of experimenters very large, the mean scores obtained by each experimenter may be used as the basic data which can be analyzed by the standard 2 2 analysis of variance. More Complex Designs Throughout the discussion of expectancy control groups we have kept the basic design as simple as possible for illustrative purposes. Thus, our basic experiment has been the comparison of a single experimental treatment condition with a control condition. The principle of expectancy control groups can, however, be applied to more complex designs. In some situations the complexity of the expectancycontrolled design increases proportionally to the complexity of the added experimental groups. In other situations, however, the increase in complexity is disproportional. Proportional increase in complexity. As an example of a proportional increase in complexity, we note that the simple complete expectancy-controlled experiment (2 2) may be subdivided into two subexperiments, each conducted with subjects at one of two levels of some personal characteristic (e.g., sex, anxiety, need for achievement). In this case our basic 2 2 design becomes a 2 2 2 design, assuming that the level of subject characteristic does not in itself affect experimenters’ expectancies regarding the effects of the treatment condition. Our four basic 5
With only four experimenters, if we assigned one to each of the four cells, the effects of cells would be confounded with individual differences among experimenters. With random assignment of experimenters to cells, this confounding is less and less of a problem as our sample of experimenters becomes larger and larger.
Expectancy Control Groups
615
groups (A, B, C, D) have become eight groups representing a proportional increase in complexity. Disproportional increase in complexity. A disproportional increase in complexity may be required by the addition of a single experimental treatment condition if the preexisting expectancy of its effect were opposite to that of the original treatment condition. For illustration we return to our example of a study of the effects of anxiety arousal on intellectual performance. Suppose we add an experimental condition in which subjects are actively reassured about their intellectual performance. The hypothesis might be that this group of subjects would show an improved performance relative to the ‘‘no-treatment’’ control group, whereas the anxious subjects would show an impairment. We may now want to have three conditions or levels of experimenter expectancy rather than the two we have employed in earlier examples. If we did, our basic 2 2 design would become not a 2 3 but rather a 3 3. Our four basic groups, therefore, have become nine groups. For logistic reasons we may not be able to entertain such a complex design. If this is the case, then what we have said earlier about partial expectancy controls will apply. Therefore, if we can employ only three groups of our design, all three groups should be contacted by experimenters holding the same expectancy. This would represent a within-row comparison rather than the fully uncontrolled (for expectancy) diagonal comparison in which expectancy would be confounded completely with experimental condition. In any specific experimental situation the basic principle of expectancy control can be applied by the investigator, although the specific form of the design will be determined both by the nature of the research question and by consideration of the resources available for the research.
Controlling for Subject Expectancy In this chapter we have discussed the use of special control groups to control only the effects of experimenter expectancy.6 In an earlier chapter, however, we showed that, at least in some experimental situations, the subjects’ expectancies or outcome orientations could also be unintended determinants of the results of our research. There may be some experiments in which these subject expectancy effects are large and perhaps as important as, or even more important than, experimenter expectancy effects. By employing the principles for generating experimenter expectancy control groups we can control for subjects’ expectancy. We can illustrate this best by imagining the following experiment: We want to learn the effects of alcohol on verbal learning. Our basic design is to have half our randomly assigned subjects consume a given quantity of beverage alcohol while the remaining subjects consume a soft drink. For the sake of simplicity let us suppose that experimenters and subjects alike are convinced that the ingestion of the experimental dosage of alcohol will impair verbal learning. We can control for experimenter expectancy by having experimenters believe that half the subjects in 6
The control of other experimenter effects, including modeling effects and effects due to various other experimenter attributes, depends on their measurement rather than on their experimental induction; they have been discussed in the chapter dealing with the sampling of experimenters.
616
Book Two – Experimenter Effects in Behavioral Research
the treatment (alcohol) condition are in the control (soft-drink) condition (cell C). Similarly we could have experimenters in the control condition believe that half of their subjects are in the experimental condition (cell B). So far we have dealt only with our old friends: cells A, B, C, and D. Because subjects believe strongly (let’s say) that alcohol impairs their verbal learning, our old A, B, C, D design is confounded by subjects’ expectancies. Table 23-7 shows the situation. In our old cell A, subjects expect the effects of alcohol, but in cell B they do not. In our old cell D, subjects expect no effects of alcohol, but in cell C they do. Our basic experiment of alcohol vs. soft-drink has been doubly confounded. If we used only the basic groups of cell A vs. cell D, any differences might be due to the effect of alcohol, the effect of experimenter expectancies, the effect of subject expectancies, or any of several possible interaction effects. To control fully for subject expectancy we must add cells A1, B1, C1, and D1, as shown in Table 23-7. These cells may, for this hypothetical experiment, be generated by the use of a nonalcoholic beverage which has an alcohol-like taste (for cells B1 and D1) and the use of an alcoholic beverage which has a nonalcoholic taste (for cells A1 and C1).7 Instead of, or in addition to, the variation of the tastes of the substances ingested, verbal statements to subjects could be used to vary their performance expectancies. The analysis of the data of this double-expectancy-controlled experiment could proceed as a straightforward 2 2 2 analysis of variance. We can, therefore, assess the independent effects of the alcohol, the subjects’ and the experimenters’ expectancies, the interactions between any two of these independent variables, and the interaction of all three. If all eight groups could not be managed, the design could be cut in half by employing any two rows of cells shown in Table 23-7. Then all experimenters (rows 1 and 2) or none (rows 3 and 4) would expect the effects of alcohol. Or all subjects (rows 1 and 3) or none (rows 2 and 4) would expect the effects of alcohol. Any of these four subdivisions would be helpful, but none would permit a comparison of the effects of experimenter vs. subject expectancy. However, the use of the two rows in which experimenters’ and subjects’ expectancies disagreed (rows 2 and 3) would permit such a comparison.
Table 23–7 Double Confounding of Treatments with Experimenter and Subject Expectancy
Expectancy Experimenter
Subject
Occurrence
Occurrence Nonoccurrence Occurrence Nonoccurrence
Nonoccurrence
7
Treatment conditions Row
Experimental
Control
1 2 3 4
A A1 C C1
B1 B D1 D
It goes without saying that all our subjects are social drinkers and have volunteered to ingest alcohol, though with the understanding that not all volunteers will necessarily receive alcohol.
Expectancy Control Groups
617
If the purpose of the experiment were to permit generalization to the real-life social drinking situations in which alcohol was consumed, rather than to evaluate the effects of a chemical upon verbal learning, we would prefer that subdivision of the experiment in which subjects’ and experimenters’ expectancies were in agreement (rows 1 and 4). In most real-life social drinking situations both the drinker and his ‘‘evaluator’’ are aware of whether alcohol has been consumed, although there are certainly exceptions to this. Employing this subdivision of the full experiment does not permit us to compare the effects of subjects’ or experimenters’ expectancies. However, we may be less interested in that comparison for some purposes, sacrificing it for the greater ecological validity of this subdivision. Another advantage of this subdivision is that it includes a replicate of the ‘‘usual’’ experiment (cell A vs. cell D). If, for some reason, we could employ only two of the eight groups, we may choose any one of the four rows, since in each we have equated for both subjects’ and experimenters’ expectancies. On the grounds of ecological validity, however, we would probably prefer row 1 or row 4 to rows 2 or 3. And because it might be easier to implement practically, we might prefer row 1 over row 4. For any other number of groups to be chosen from the full complement of eight groups, the choice would be made on bases similar to those presented just now and also in the section dealing with partial expectancy controls.
Combining Methods of Control The control group designs described in this chapter can be combined with other methods for the control of experimenter effects which were described in earlier chapters. We can, for example, minimize the contact between experimenters and subjects of expectancy-controlled experiments. This should reduce the communication of our experimenters’ expectancy to their subjects, but unless we have an intrinsic interest in these expectancy effects, that is all to the good. As contact with subjects is reduced further and further, in principle we have less and less need for the employment of any expectancy controls at all. Combining of minimized (or blind) contact with expectancy control groups has a very special advantage. It provides us with an opportunity to assess the success of the minimization (or blindness) of contact. If contact has been successfully minimal (and/or blind), we should find no significant main effect of experimenter expectancies nor any interaction involving experimenter expectancy. Finding such effects would be sufficient evidence for concluding that the minimization (or blind) procedure had been ineffective. A still more powerful combination of controls for expectancy effects might include sampling experimenters, determining their expectancies, applying the expectancy control group procedure, and maintaining blind and minimized contact. This combination of controls might reduce to an absolute minimum the biasing effect of experimenters’ expectancies. The basic 2 2 design described in this chapter could then be extended into a third dimension—the dimension of ‘‘idiosyncratic’’ or ‘‘natural’’ expectancy. If there were only two types of idiosyncratic expectancies, our overall design might become a 2 2 2 arrangement—the experimental vs. control treatment, the experimenters led to expect a treatment effect vs. those led to
618
Book Two – Experimenter Effects in Behavioral Research
expect no treatment effect, and experimenters ‘‘naturally’’ expecting a treatment effect vs. those expecting no treatment effect. If we had a range of idiosyncratic or ‘‘natural’’ expectancies rather than only two, we could elongate the design to have three, four, or even more levels of ‘‘natural’’ expectancy. One advantage of generating this third dimension of experimenters’ ‘‘natural’’ expectancies is that it may help to reduce the variation between experimenters (within cells). Within any one of the basic four cells of our expectancy-controlled experiment, experimenters’ obtained data may vary because of variation in initial idiosyncratic expectancies. Another advantage of generating the dimension of experimenters’ idiosyncratic expectancies stems from the finding that expectancy bias is maximized when experimentally induced and idiosyncratic expectancies are in agreement (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). In ‘‘real’’ experiments, the data collector has no conflicting expectancies arising from the imposition of an ‘‘artificial’’ expectancy upon the preexisting one. We may, therefore, obtain a more accurate estimate of the upper limits of the effects of expectancy bias as it occurs in ‘‘real’’ data collectors by allowing both types of expectancy to operate jointly. In addition to the combination of methods for the control of expectancy effects already mentioned we might employ one or more methods of observing experimenters’ behavior in interaction with their subjects. Their behavior vis-a`-vis their subjects could be regarded as the dependent variable for one analysis. If we found experimenters in the various conditions to show no significant differences in behavior, we would feel more confident in the substantive results of the experiment. If we found significant differences in the behavior of experimenters in different conditions, we would feel more confident that our trouble in setting up the various control conditions was warranted. The experimenters’ behavior toward their subjects can also be regarded as an independent variable. If we choose to so regard it, then we will learn something more, not only about methodological matters but about some substantive issues in unintentional interpersonal influence as well.
Expectancy Controls: Cost versus Utility In assessing the employment of expectancy control groups we can weigh their cost against their utility. We have already weighed one type of cost—the need to withhold information or give false information to our data collectors. In general, the utility of controlling for experimenter expectancy seems to outweigh heavily the innocuous deception necessitated by most methods of generating expectancy control groups. What about other costs? The number of subjects required for an experiment is not increased, the time per subject-contact is not increased and, sometimes, the number of data collectors involved is not even increased. The creation of expectancy control groups takes additional time in the planning stages of the experiment and, if more experimenters are employed, in the training stage. But this amount of time is measured in hours and minutes, not in months and weeks, and therefore should prove to be no real obstacle. If a larger number of experimenters are employed there may also be a small increase in the financial cost of the experiment—not because more hours are involved in all, but because an experimenter employed for a total of
Expectancy Control Groups
619
one hour must usually be paid more per hour than one employed for 50 hours. But this cost, too, is relatively small.8 If a larger number of experimenters are employed than would normally be the case, an additional utility can be achieved if the procedure of simultaneous experimenter-subject contacts is employed, as suggested in the last chapter. The total time from beginning to end of the data collection can be greatly reduced, and this is an efficiency that is not hard to appreciate. All in all, the total costs of conducting expectancy-controlled experiments seem trivial in relation to the utility of the method. But it can be said that costs are easier to assess than utility. Utility for whom? Are there not areas in the behavioral sciences that just do not require controls for experimenter expectancy? To this it must be said that there may be, but we don’t know that to be the case. And if it is the case, we don’t know which areas are the immune ones. The employment of expectancy-controlling designs is perhaps the only way in which we can find out. In a sense, we must use these controls to learn whether we can afford to do without them.
8
Not really a cost, but a problem associated with the long-term usage of expectancy controls should be made explicit. As Milton Rosenberg has pointed out in a personal communication (1965), it might not take too long before the usual sources of research assistants are exhausted in the sense that in these circles everyone would know all about expectancy controls. In that case less sophisticated experimenters must be employed, though they will not indefinitely stay unsophisticated. It is for these reasons especially that serious consideration should be given in the near future to the development of the new profession of data collector described earlier.
24 Conclusion
The social situation that comes into being when an experimenter encounters his research subject is one of both general and unique importance to the social sciences. Its general importance derives from the fact that the interaction of experimenter and subject, like other two-person interactions, may be investigated empirically with a view to teaching us more about dyadic interaction in general. Its unique importance derives from the fact that the interaction of experimenter and subject, unlike other dyadic interactions, is a major source of our knowledge in the social sciences. To the extent that we hope for dependable knowledge in the social sciences generally, we must have dependable knowledge about the experimenter-subject interaction specifically. We can no more hope to acquire accurate information for our disciplines without an understanding of the data collection situation than astronomers and zoologists could hope to acquire accurate information for their disciplines without their understanding the effects of their telescopes and microscopes. For these reasons, increasing interest has been shown in the investigation of the experimenter-subject interaction system. And the outlook is anything but bleak. It does seem that we can profitably learn of those effects that the experimenter unwittingly may have on the results of his research. In the last five chapters, a variety of suggestions have been put forward which show some promise as controls for the effects of the experimenter in general and for the effects of his expectancy in particular. In Table 24-1 these suggestions are summarized as ten strategies or techniques. For each one, the consequences of its employment are listed, and for the last three, additional brief summaries are shown in Tables 24-2, 24-3, and 24-4.
The Social Psychology of Unintentional Influence Quite apart from the methodological implications of research on experimenter expectancy effects there are substantive implications for the study of interpersonal relationships. Perhaps the most compelling and most general implication is that people can engage in highly effective and influential unprogrammed and unintended communication with one another and that this process of unintentional influence can be investigated experimentally. 620
Conclusion
621 Table 24–1 Strategies for the Control of Experimenter Expectancy Effects
1. Increasing the number of experimenters: decreases learning of influence techniques helps to maintain blindness minimizes effects of early data returns increases generality of results randomizes expectancies permits the method of collaborative disagreement permits statistical correction of expectancy effects 2. Observing the behavior of experimenters: sometimes reduces expectancy effects permits correction for unprogrammed behavior facilitates greater standardization of experimenter behavior 3. Analyzing experiments for order effects: permits inference about changes in experimenter behavior 4. Analyzing experiments for computational errors: permits inference about expectancy effects 5. Developing selection procedures: permits prediction of expectancy effects 6. Developing training procedures: permits prediction of expectancy effects 7. Developing a new profession of psychological experimenter: maximizes applicability of controls for expectancy effects reduces motivational bases for expectancy effects 8. Maintaining blind contact: minimizes expectancy effects (see Table 24-2) 9. Minimizing experimenter-subject contact: minimizes expectancy effects (see Table 24-3) 10. Employing expectancy control groups: permits assessment of expectancy effects (see Table 24-4)
Table 24–2 Blind Contact as a Control for Expectancy Effects
A. Sources of breakdown of blindness 1. Principal investigator 2. Subject (‘‘side effects’’) B. Procedures facilitating maintenance of blindness 1. The ‘‘total-blind’’ procedure 2. Avoiding feedback from the principal investigator 3. Avoiding feedback from the subject
Table 24–3 Minimized Contact as a Control for Expectancy Effects
A. Automated data collection systems 1. Written instructions 2. Tape-recorded instructions 3. Filmed instructions 4. Televised instructions 5. Telephoned instructions B. Restricting unintended cues to subjects and experimenters 1. Interposing screen between subject and experimenter 2. Contacting fewer subjects per experimenter 3. Having subjects or machines record responses
622
Book Two – Experimenter Effects in Behavioral Research Table 24–4 Procedures for Generating Experimenter Expectancies
1. 2. 3. 4. 5. 6. 7.
Ascribing subject characteristics Ascribing experimental conditions Disparagement of treatment effectiveness Theory reversal Intentional influence Unintentional communication Early data returns
A great deal of effort within social psychology has gone into the study of such intentional influence processes as education, persuasion, coercion, propaganda, and psychotherapy. In each of these cases the influencer intends to influence the recipient of his message, and the message is usually encoded linguistically. Without diminishing efforts to understand these processes better, greater effort should be expended to understand the processes of unintentional influence in which the message is often encoded nonlinguistically. The question, in short, is how people ‘‘talk’’ to one another without ‘‘speaking.’’ At the present time we not only do not know the specific signals by which people unintentionally influence one another, we do not even know all the channels of communication involved. There is cause, though, to be optimistic. There appears to be a great current increase of interest in non-linguistic behavior as it may have relevance for human communication (e.g., Sebeok, Hayes, & Bateson, 1964). Most interest seems to have been centered in the auditory and visual channels of communication, and those are the channels investigated in the present research program. Other sense modalities will bear investigation, however. For example, Geldard (1960) has brought into focus the role of the skin senses in human communication and has presented evidence that the skin may be sensitive to human speech. Even when the sense modality involved is the auditory, it need not be only speech and speech-related stimuli to which the ear is sensitive. Kellogg (1962) and Rice and Feinstein (1965) have shown that at least among blind humans, audition can provide a surprising amount of information about the environment. Employing a technique of echo ranging, Kellogg’s subjects were able to assess accurately the distance, size, and composition of various external objects. The implications for interpersonal communication of these senses and of olfaction or of even less commonly discussed modalities (e.g., Ravitz, 1950; Ravitz, 1952) are not yet clear but are worthy of more intensive investigation. Since expectancies of another person’s behavior seem often to be communicated to that person unintentionally, the basic experimental paradigm employed in our research program might be employed even if the interest were not in expectancy effects per se. Thus if we were interested in unintentional communication among different groups of psychiatric patients, some could be given expectancies for others’ behavior. Effectiveness of unintentional influence could then be measured by the degree to which other patients were influenced by expectancies held of their behavior. There might be therapeutic as well as theoretical significance to knowing what kind of psychiatric patients were most successful in the unintentional influence of other psychiatric patients. The following experiment is relevant and was conducted with Clifford Knutson and Gordon Persinger.
Conclusion
623
Twelve hospitalized psychiatric patients served as ‘‘experimenters.’’ On the basis of their scores on the appropriate MMPI scales, three were classified as schizophrenic, three were classified as paranoid, three as character disorders, and three as neurotics. Each ‘‘experimenter’’ administered the standard photo-rating task to male and female patients, one each in each of the same four diagnostic categories. All subjects were acutely rather than chronically disturbed. From half the subjects in each of the four diagnostic groups, experimenters were led to expect photo ratings of success, and from half they were led to expect ratings of failure. It was somewhat surprising to find, even with this unusual sample of experimenters and of subjects, that overall expectancy effects were significant. Nine of the 12 experimenters obtained mean ratings from their subjects which were in the direction of their expectancies (p ¼ .07), and 65 percent of all subjects gave ratings in the direction their experimenter expected (p ¼ .002, x2 ¼ 9.05). Magnitude of expectancy effect was not related directly to the nosology of the experimenter, nosology of the subject, or sex of the subject, though there were some significant interactions. The analysis of the data is not yet completed, but there is a finding that may be illustrative of the type of information such research may yield. Among the psychotic subjects, experimenters exerted greater unintentional influence on schizophrenic subjects if they were themselves schizophrenic, but they exerted less unintentional influence on paranoids if they were themselves paranoid (p < .10). Among nonpsychotic subjects, experimenters exerted greater influence on neurotic subjects if they were themselves neurotic, but they exerted less influence on subjects with character disorders if they themselves had been diagnosed as character disorders (p < .10). Tables 24-5 and 24-6 show the magnitudes of expectancy effects for each of these combinations of experimenter and subject nosology. Scores for expectancy effects were defined in the usual Table 24–5 Similarity of Experimenter’s and Subject’s Diagnosis as Determinant of
Expectancy Effects: Psychotic Subjects Experimenter’s diagnosis Subject’s Diagnosis
Same
Different
Difference
Schizophrenic Paranoid
þ2.53 0.65 þ3.18
0.59 þ1.35 1.94
þ3.12 2.00 þ5.12*
Difference *p < .10 two-tail.
Table 24–6 Similarity of Experimenter’s and Subject’s Diagnosis as Determinant of
Expectancy Effects: Nonpsychotic Subjects Experimenter’s diagnosis Subject’s Diagnosis
Same
Different
Difference
Neurotic Character disorder
þ3.67 þ0.53 þ3.14
þ0.07 þ1.81 1.74
þ3.60 1.28 þ4.88*
Difference p < .10 two-tail.
624
Book Two – Experimenter Effects in Behavioral Research
manner—i.e., mean ratings obtained from subjects from whom positive (þ5) ratings were expected minus the mean ratings obtained from subjects from whom negative (5) ratings were expected. To summarize these preliminary results we might say that schizophrenic and neurotic patients are best ‘‘talked to’’ by patients of their own diagnostic category, whereas paranoid and character disorder patients are least well ‘‘talked to’’ by patients of their own diagnostic category. Of the four diagnostic groups, it is the schizophrenic and neurotic patients who show the greatest degree of overt anxiety and who, perhaps, feel best understood by equally anxious influencers. The paranoid and character disorder patients, both characterized by more overt hostility, may be especially sensitive to and resistant to the hostility of their paranoid and character-disordered influencers. Findings of the kind described may have implications for the interpersonal treatment of psychiatric disorders. The belief is increasing that an important source of informal treatment is the association with other patients. If, as seems likely, such treatment is more unintentional than intentional, then the grouping of patients might be arranged so that patients are put into contact with those other patients with whom they can ‘‘talk’’ best, even if this ‘‘talk’’ be nonlinguistic. Perhaps success as an unintentional influencer of another’s behavior also has relevance for the selection of psychotherapists to work with certain types of patients. The general strategy of trying to ‘‘fit the therapist to the patient’’ has been employed with considerable success and has aroused considerable interest (e.g., Betz, 1962). That such selection may be made on the basis of unintentional communication patterns may also be suggested. In one recent study, it was found that the degree of hostility in the doctor’s speech was unrelated to his success in getting alcoholic patients to accept treatment. However, when the content of the doctor’s speech was filtered out, the degree of hostility found in the tone of his voice alone was significantly and negatively related to his success in influencing alcoholics to seek treatment (r ¼ .65, p ¼ .06, two-tail [Milmoe, Rosenthal, Blane, Chafetz, & Wolf, 1965]).
Expectancy Effects in Everyday Life The concept of expectancy has been of central importance for many psychological theorists (e.g., Allport, 1950; Kelly, 1955; Rotter, 1954; Tolman, 1932), and Goldstein (1962) has reviewed the role of expectancy as a construct of interest to psychologists. Expectancy as a determinant of behavior has most often been investigated with an eye to learning the extent to which an individual’s expectancy might determine his own subsequent behavior. The construct of expectancy as employed in this book has been more specifically interpersonal. The question, for us, has concerned the extent to which one person’s expectancy of another’s behavior might serve as determinant of that other’s behavior. In everyday life people do have expectations of how others will behave. These expectations usually are based on prior experience, direct or indirect, with those other people’s behavior. Scientist and layman seem agreed that predictions of future behavior are best based on past behavior. If this assumption were untenable there would be no behavioral sciences. If past behavior were unrelated to future behavior, then there could only be the humanist’s interest to prompt us to study behavior, not the scientist’s. But if expectations are only based upon history how do they influence future events?
Conclusion
625
It is unpleasant to have one’s expectations disconfirmed, though that is not always the case. An unexpected inheritance need not lead to negative feelings. But often it is more pleasant to have one’s expectations confirmed than disconfirmed. The evidence for this comes from experiments in which the expectancy is of an event that will befall the ‘‘expecter’’ (Aronson & Carlsmith, 1962; Carlsmith & Aronson, 1963; Festinger, 1957; Harvey & Clapp, 1965; Sampson & Sibley, 1965). The ‘‘expecter’’ seems to behave in such a way as to confirm his expectancy about what will befall him or how he will act (Aronson, Carlsmith, & Darley, 1963). It seems to be not too great an extension to think that if one’s expectancy is not of one’s own behavior but of another’s, one will also behave in such a way as to influence that other to behave in an expected way. Whatever its basis, whether to achieve greater cognitive order, stability, predictability, or to maintain cognitive consonance (Festinger, 1957), there appears to be a motive to fulfill one’s interpersonal expectancy. Interpersonal expectancies in everyday life are likely to be accurate predictors of another’s behavior for two reasons. The first reason is that expectancies are often realistic and veridical—i.e., based on prior experience with the other’s behavior. The second reason is that, other things being equal, we may behave in such a way as to bring about the accuracy of our interpersonal expectations. If we expect a person to be friendly, it may be ‘‘true’’ because we have experienced him as friendly, or a credible source claims to have experienced him as friendly (Kelley, 1949). In addition to this experience-derived component, however, there is the self-fulfilling prophecy component. Expecting him to be friendly, we may behave in a more friendly fashion and, therefore, evoke a more friendly response. The fact that there appear to be two components to the accuracy of interpersonal predictions, hypotheses, or expectations has implications for research methods in expectancy effects. If we simply ascertain people’s expectations of others’ behavior and correlate these with the others’ subsequent behavior, the two components of experiential accuracy and self-fulfilling accuracy will be confounded. If we take the appropriate safeguards, however, we can eliminate the self-fulfilling accuracy component (as in asking people to ‘‘predict’’ behavior that has already occurred). We can also randomize the experiential accuracy component by ‘‘assigning’’ expectancies at random, and that is the strategy adopted in much of the research described in this book. What we do not yet know, and what is worth the learning, is the magnitude of expectancy effects, of the self-fulfilling type, in important everyday social interactions. The experimenter-subject dyad may profitably be viewed as a social influence system different from, but yet similar to, other social influence systems. It seems most fruitful at the present time to emphasize the similarity and to make the working assumption that the principles governing the unintentional influence processes of the experimenter-subject dyad are not different from those governing influence processes of the more commonly investigated type. The experimental approach to the study of unintended social influence process can be extended from the special setting of the scientific experiment to other such special settings as psychotherapy, and to such more general settings as the classroom, industry, and government. Although we might prefer a more experimental demonstration of the phenomenon, a number of research investigators and practicing clinicians have called attention to the process whereby a psychotherapist’s or other healer’s expectation of his patient’s course in treatment may be communicated to his patient with subsequent effects on
626
Book Two – Experimenter Effects in Behavioral Research
the course of the treatment. Goldstein (1962) has given us an excellent picture of what is now known and how much there is yet to be learned in this regard. In the field of vocational rehabilitation, the staff of the Human Interaction Research Institute concluded that the expectancies of the staff seemed to lead to commensurate client performance. They evaluated a project that attempted to demonstrate that mentally retarded young men could learn to be gainfully employed. ‘‘The staff found that when they expected him to assume some personal responsibility, he was able to do so’’ (Coffey, Dorcus, Glaser, Greening, Marks, & Sarason, 1964, p. 11). The effect of one person’s expectancy on another’s behavior has important implications for public policy. In their famous comparison of racially integrated vs. segregated housing patterns, Deutsch and Collins (1952) discussed the social standards of interracial behavior: ‘‘. . . people tend to behave as they are expected to behave. The expectations of others in a social situation, particularly if these others are important to the individual, help to define what is the appropriate behavior’’ (p. 588). This does not surprise social scientists. But what might surprise us is the degree to which the arbitrary definition of ‘‘appropriate behavior’’ can be implicit yet clearly discernible in social interaction by the person who would have these definitions serve as a guide to behavior. In the educational system a child’s reputation precedes him through the succession of classrooms leading from his first school day to his eventual graduation. We need to learn the extent to which that reputation itself serves as the definition for the child of how he should behave in school. If a bright child earns a reputation as bright and then performs brightly, we consider that all is going well. But what if a bright child in some not-at-all impossible way earns a reputation as dull? Will his teachers’ perception of him and their expectations of his behavior then lead to duller behavior than need be the case? Or if a duller child, reputed to be bright, is treated as bright in the communication system with his teachers, will he then, in fact, tend to become more bright? We shall return to this question shortly. The complexity and subtlety of the communication of one’s expectancy of another’s behavior to that other is well emphasized by reference to that experiment in which expectancy effects were transmitted from the experimenter through his research assistant to the subject (Rosenthal, Persinger, Vikan-Kline, & Mulry, 1963). It appeared from that experiment that in the two-person interaction between subject and research assistant there was a nonpresent third party, the primary experimenter. This nonpresent other appeared to communicate his expectancy through the research assistant but without having simply made the assistant a passive surrogate for himself. The research assistant, while serving as a ‘‘carrier’’ for the nonpresent influencer, was still able to exert his own influence in an additive manner to the influence of the nonpresent participant. This interpersonal influence, once-removed, is no all-or-none phenomenon. The more a person is able to influence others subtly, the more effectively he is able to make other people carriers of his subtle, unplanned influence. How far this chain of subtle interpersonal influence can extend, complicating itself at each link, is not known, nor is the pattern of interpersonal communication of which the chain is woven. But these unknowns lead to fascinating practical and theoretical questions of the extent of interpersonal influence, once-removed. Some of the more obvious ones include: When the senior psychotherapist or physician believes the more junior healer’s patient to have a good or poor prognosis, is the ‘‘assessment,’’ whether explicit or implicit, really only an assessment? Or is it really a prophecy which stands a chance of being self-fulfilled?
Conclusion
627
When the master teacher or school principal believes a junior teacher’s pupils to be fast learners, or believes a special group of pupils to be slow learners, is this belief (well founded or not, and verbalized or not) likely to accelerate or decelerate these pupils’ educational progress? Similarly, will the expectancies of performance, explicit or implicit, of civilian and government employers, military commanders, athletic coaches, and symphony orchestra conductors be transmitted ultimately to the employees, the troops, the athletes, and the musicians with a consequent effect on their performance?
The Last Experiment This is a book primarily about, and of, research. It seems appropriate, therefore, to end with the description of one more experiment. Several times now there has been mention of the possibility that teachers’ expectancies of their pupils’ ability might in fact be a partial determinant of those pupils’ ability. The experiment to be described was conducted with Lenore Jacobson. The procedure was basically the same as in the experiments on the effects of the experimenter’s expectancy. All of the children in an elementary school were administered a nonverbal group test of intelligence which was disguised as a test that would predict academic ‘‘blooming.’’ There were 18 classes, 3 at each of all 6 grade levels. Within each of the grade levels one of the classes was of above average ability children; a second class was of average ability children, and a third class was of below average ability children. A table of random numbers was employed to assign about 20 percent of the children in each of the 18 classes to the experimental condition. Each teacher was given the names of these children ‘‘who would show unusual academic development’’ during the coming school year. That was the experimental manipulation. At the end of the school year the children were retested with the same group intelligence test. The analysis of the data is not complete, but some of the results can be given. Table 24-7 shows the excess of IQ points gained in each class by the children whose teachers expected such gains when compared to the control subjects. A plus sign preceding an entry means that the children who were expected to show more gain of IQ points did show more gain. For the 18 classes combined, those children whose teachers expected them to gain in performance showed a significantly greater gain in IQ than did the control children, though the mean relative gain in IQ was small. Teachers’ expectancies, it turned out, made little difference in the upper grades. But at the lower levels the effects were dramatic. First graders purported to be bloomers gained 15.4 IQ points more than did the control children, and the mean relative gain in one classroom was 24.8 points. In the second grade, the relative gain was 9.5 IQ points, with one of the classes showing a mean gain of 18.2 points. These effects were especially surprising in view of the large gains in IQ made by the control group, which had to be surpassed by the experimental groups. Thus first graders in the control group gained 12.0 IQ points (compared to the 27.4 points gained by the experimentals) and second graders gained 7.0 IQ points (compared to the 16.5 points gained by the experimentals), somewhat larger than might simply be ascribed to practice effects. It is possible that the entire school was affected to some degree by
628
Book Two – Experimenter Effects in Behavioral Research Table 24–7 Teacher Expectancy Effects: Gain in IQ of Experimental over Control Groups (After Eight
Months) Initial ability level Grades
Higher
Average
Lower
Weighted means
1 2 3 4 5 6
þ11.2 þ18.2*** 4.3 0.0 0.5 1.3
þ9.6 2.9 þ9.1 þ0.2 y þ1.2
þ24.8** þ6.1 6.3 þ9.0 þ1.2 0.5
þ15.4*** þ9.5* 0.0 þ3.4 0.0 0.7
Weighted means
þ3.6
þ4.6
þ2.8
þ3.8*
* p < .02 one tail. ** p < .006 one tail. *** p < .002 one tail. y Part of the posttest was inadvertently not administered in this class.
being involved in an experiment with consequent good effects on the children’s performance. It was somewhat reassuring to find that the gains made by the experimental group children were not made at the expense of the control group children. In fact, the greater the gain made by the experimental group children, the greater the gain made by the control group children in the same class. The rank correlation between gains made by the experimental and control children for the 17 classes that could be compared was þ.57 (p ¼ .02, two-tail). The teachers had themselves administered the group IQ posttests, so that the question arose whether the gain in IQ of the experimental group might be due to differential behavior of the teacher toward these children during the testing. Three of the classes were retested by a school administrator not attached to the particular school employed. She did not know which of the children were in the experimental conditions. On the average the results of her testing showed somewhat greater effects of the teachers’ expectancies. In the class in which the experimental group children had earned a 25 IQ point gain in excess of the control group children’s gain, the experimental group children showed an additional 8 IQ point gain when retested by the ‘‘blind’’ examiner. It seems unlikely, then, that the IQ gains are attributable only to an examiner effect of the teacher. That teacher expectancy effects should be more pronounced at the lower grades makes good sense. In the lower grades the children have not yet acquired those reputations that become so difficult to change in the later grades and which give teachers in subsequent grades the expectancies for the pupil’s performance. With every successive grade it would be more difficult to change the child’s reputation. The magnitude of expectancy effect showed a fairly regular decline from first to sixth grade (rho ¼ þ.83, p ¼ .05, two-tail). There are important substantive implications for educational practice in the results of this experiment. In addition there are important methodological implications for the design of experiments in education which seek to establish the success of various new educational practices. Such implications will be discussed elsewhere in
Conclusion
629
detail, but for now we can simply call attention to the need for expectancy control groups. If experimenters can, and if teachers can, then probably healers, parents, spouses, and other ‘‘ordinary’’ people also can affect the behavior of those with whom they interact by virtue of their expectations of what that behavior will be. Of course, we must now try to learn how such communication of expectancies takes place. Considering the difficulties we have had in trying to answer that same question for the case of experimenters, whose inputs into the experimenter-subject interaction could be so relatively easily controlled and observed, we should not expect a quick or an easy solution. But there may be some consolation drawn from the conviction that, at least, the problem is worth the effort.
Interpersonal Expectancy Effects: A Follow-up
Consistency Over Time In 1969, just three years after the original appearance of the present volume, a provisional summary of the literature on interpersonal expectations was published which listed and surveyed 105 studies of interpersonal expectations (Rosenthal, 1969). Table A-1 shows an overall comparison of the results of these early 105 studies with the results of 206 subsequent studies. The first two columns show the number of studies conducted within each of eight areas of research as these areas were defined by the earlier review (Rosenthal, 1969). An analysis of the proportion of all studies conducted in each time period falling into each research area showed that, overall, there was a large shift in the areas receiving research attention (2 = 42.9, df = 7, p < .001). Most of this shift was due to changes in two of the eight research areas. Studies of person perception decreased dramatically from 53% of all studies conducted up to 1969 to only 27% of all studies conducted since 1969. Studies of everyday life situations, including studies of teacher expectations, increased dramatically from 10% of all studies conducted up to 1969 to 41% of all studies conducted since 1969. The third and fourth columns of Table 1 show the proportion of studies reaching the .05 level of significance for each of the eight research areas. All of the research areas both before and after 1969 show very substantially higher proportions of significant results than the proportion of .05 that would be expected by chance. Considering the research areas separately, none of them show a significant (p = .10,
Table 1 Overall Comparison of Results of Studies Before and After 1969
Proportion Reaching p £ .05
Number of Studies Research Area Reaction Time Inkblot Tests Animal Learning Laboratory Interviews Psychophysical Judgments Learning and Ability Person Perception Everyday Situations Median Total a
To 1969
Since 1969
To 1969
Since 1969
3 4 9 6 9 9 57 11
3 5 5 16 14 24 57 85
.33 .75 .89 .33 .33 .22 .25 .36
.33 .20 .40 .38 .50 .29 .30 .38
9 108a
15 209a
.33 .35
.36 .35
Three of these entries are nonindependent i.e., they occur in more than one area.
630
Interpersonal Expectancy Effects: A Follow-up
631
two-tail) change in the proportion of results reaching significance before 1969 as compared to after 1969. For both the older and the newer studies, about one-third reach the .05 level, about seven times more than we would expect if there were in fact no significant relationship between experimenters’ or teachers’ expectations and their subjects’ or pupils’ subsequent behavior. Results so striking could essentially never occur if there were really no such relationship (2 = 585, z = 24.2).
A Brief Overview Table 2 summarizes the results of all the studies, published and unpublished, that I was able to find up to the time of this writing. The first column gives the total number of studies that fall into each of the eight research areas that were defined for convenience in tabulating results (Rosenthal, 1969). Adding over all eight areas yields a grand total of 317 studies. Six of those studies, however, were not independent but involved dependent variables that fell into more than one research area. The total net number of independent studies, therefore, was 311. The second column of Table 2 shows the approximate number of degrees of freedom upon which the average study in that area was based. These data are included to give some feel for the typical size of the studies conducted in each area. The range of these means over the eight research areas was from 21 to 124 with a median of 48.5. All the values are reasonably homogeneous except for the mean of 124 df for the reaction time studies. That appears to be significantly larger than the remaining seven values and can be classified a statistical outlier at p < .05 (Snedecor and Cochran, 1967, p. 323). Examination of the six studies of reaction time suggested no hypothesis as to why these studies should be so substantially larger than those of other areas of research.
Table 2 Expectancy Effects in Eight Research Areas
Research Area
Number of Studies
Reaction Time 6 Inkblot Tests 9 Animal Learning 14 Laboratory 22 Interviews Psychophysical 23 Judgments Learning and 33 Ability Person Perception 114 Everyday Situations 96 Median a
22.5a
Estimated Mean df for t
Proportion of Studies Reaching p £ .05
Estimated Estimated z Standard Mean Normal Deviate Effect Size (s) Random Truncated
124 28 21 52
.33 .44 .64 .36
0.23 0.84 1.78 0.27
+ 3.11 + 4.23 + 6.96 + 6.21
+ 2.62 + 4.05 + 7.56 + 5.80
34
.43
1.31
+ 9.02
+ 6.61
52
.27
0.72
+ 3.82
+ 4.60
54 45
.27 .38
0.51 1.44
+ 3.82 + 5.92
+ 6.55 + 11.70
48.5
.37b
0.78c
+ 5.08
+ 6.18
Six entries occur in more than one area. These proportions do not differ significantly from each other 2 = 9.7, df = 7, p > .20. c The correlation between columns 3 and 4 was .778. p = .012. b
632
Book Two – Experimenter Effects in Behavioral Research
The third column of Table 2 shows the proportion of studies reaching the .05 level of significance in the predicted direction. The range of these proportions was from .27 to .64 with a median proportion, .37. These eight proportions did not differ significantly from each other (2 = 9.7, df = 7, p > .20). Thus, as we learned earlier from Table 1, about one-third of the studies investigating the effects of interpersonal expectations show such effects to occur at the .05 level of significance and one-third is a reasonable estimate regardless of the particular area of inquiry.
Effect Size In the earlier follow-up to the present book (Rosenthal, 1969) over a hundred studies were listed; some of them were described in some detail, and it was shown that the number of significant results occurring ruled out the possibility that sampling fluctuations or capitalization on chance could account for the large number of significant results obtained. Some effort was also made in that earlier follow-up to give estimates of the size of the effects of interpersonal expectations. It appears to me now, however, that the issue of effect size was not well handled. Since the preparation of the earlier overview Cohen’s (1969) superb book Statistical Power Analysis for the Behavioral Sciences was published, and its treatment of effect sizes in relation to power considerations made a lasting impression. In the present follow-up, therefore, an effort was made to provide estimates of effect size more useful than those provided in the earlier review. The primary index of effect size employed in the present study is the statistic ‘‘d’’ defined as the difference between the means of the two groups being compared, divided by the standard deviation common to the two populations (Cohen, 1969, p. 18). The great advantage of this index is that it permits us to compare the magnitudes of effects for a large variety of measures. It frees us from the particular scale of measurement and allows us to speak of effects measured in standard deviation () units. There are many different measures of effect size that could have been employed in the present follow-up, each with its special advantages and disadvantages. The measure ‘‘d’’ was chosen for its simplicity and because such a large proportion of the studies of interpersonal expectancy effects involve simply a comparison of an experimental with a control group by means of a t test (or F with df = 1 for the numerator), and d is particularly useful for that situation both conceptually and computationally. (For a recent example of the extensive use of ‘‘d’’ as an index of effect size in the behavioral sciences see Rosenthal and Rosnow, 1975.) Ideally it would have been best to go back to the over 300 studies of interpersonal expectations and to compute for each one the effect size in units. For the present follow-up, which is to some degree provisional, it was not possible to be exhaustive. Instead, a doubly stratified random sample (with planned oversampling) of 75 studies was chosen to permit the estimation of effect sizes. The first stratification was on the dimension of research area. For the two areas with fewer than 10 studies, reaction time and inkblot tests, all studies were included. For the remaining six areas, ten studies were included for each area. Thus, areas with fewer studies were oversampled in comparison to areas with more studies. In the area of animal learning, for example, 71% of the studies were included while in the area of person perception only 9% of the studies were included.
Interpersonal Expectancy Effects: A Follow-up
633
The second stratification was on the statistical significance of the primary result of the studies in each area. Arbitrarily, the five most significant studies were included for each area, and five studies were selected at random from the remaining studies in each area. The latter five studies, of course, were weighted in proportion to the size of the population of available studies so that there would be no bias favoring studies of greater statistical significance. An example will be useful. There were 33 studies of the effects of experimenter expectations on the learning and ability scores of their subjects. The mean effect size of the five most significant studies was 1.25. The mean effect size of the five studies randomly selected from the remaining 28 studies was 0.63. The estimated effect size for all 33 studies was 0.72, a value much closer to that of the random five studies’ mean than to that of the high significance five studies’ mean. The means are weighted by 5 and N 5, respectively, so that the overall estimated effect size is given by ½5 X Top + ðN 5Þ X Random=N, where N is the total number of studies conducted in that area. The fourth column of Table 2 shows these estimated effect sizes for each of the eight research areas. The range is from 0.23 for studies of reaction time to 1.78 for studies of animal learning, with a median effect size of 0.78. In Cohen’s terminology, then, these effect sizes range from small (.20) through medium (.50) to large (.80) and, for three of the research areas, to very large (Cohen, 1969, p. 38). It is interesting to note that there was a large and significant correlation of .78 between the estimated effect size and the proportion of studies reaching significance in the various areas of research. The fifth column of Table 2 shows the standard normal deviate (z) associated with the combined results of each of the eight areas of research according to the method of Mosteller and Bush (1954) and as employed in the earlier follow-up (Rosenthal, 1969). For each of the studies sampled, the obtained level of significance was converted to its associated algebraic standard normal deviate with a positive sign indicating that the result was in the direction of the hypothesis of interpersonal expectancy effects. The combined pffiffiffiffi z was then computed according to the formula: ½5 z Top+ðN 5Þ z Random = N. It is clear from column five of Table 2 that all areas of research showed overall significant effects of interpersonal expectancy. The final column of Table 2 shows the standard normal deviates of the combined results based not on sampling the studies of each research area but on the direct computation of the standard normal deviate for all studies in each area. In order to be consistent with the procedure of the earlier review (Rosenthal, 1969), however, any z falling between 1.27 and +1.27 was entered as zero, a procedure which tends to lead to combined results that are too conservative. The results shown in this column also show significant overall effects of interpersonal expectancies in all research areas. In order to get a better understanding of the probable ranges of effect sizes for the various areas of research, confidence intervals were computed and these are shown in columns five and six of Table 3. For each area of research the 95% confidence interval suggests the likely range of the effect size for that area. If we claim that the effect size falls within the range given we will be correct 95% of the time. The confidence intervals are wide because their computation was based on such small samples of studies (i.e., 6, 9, or 10). When each of the eight confidence intervals is compared with all other confidence intervals, we find that only two of the 28 comparisons show non-overlapping confidence intervals. The reaction time confidence interval is lower than the confidence intervals for animal learning and
634
Book Two – Experimenter Effects in Behavioral Research Table 3 Confidence Intervals for Mean Effect Sizes in Eight Research Areas
Research Area Reaction Time Inkblot Tests Animal Learning Laboratory Interviews Psychophysical Judgments Learning and Ability Person Perception Everyday Situations Median Total
Number of Studies Sampled
Estimated S
Standard Error of the Mean
95% Confidence Interval
Mean Effect Size
From
To
Correlations Between Effect Size and Level of Significance (z)
6 9 10 10
0.11 1.17 1.69 0.94
.04 .39 .53 .30
0.23 0.84 1.78 0.27
+ 0.13 0.06 + 0.58 0.41
0.33 1.74 2.98 0.95
.82 .85 .69 .87
10
1.15
.36
1.31
+ 0.50
2.12
.38
10
1.49
.47
0.72
0.34
1.78
.60
10
0.87
.28
0.51
0.12
1.14
.79
10
2.74
.87
1.44
0.53
3.41
.51
10 75
1.16 1.33
.38 .15
0.78 0.92
0.09 0.62
1.76 1.22
.74 .68
psychophysical judgments. Studies of reaction time appear to have a particularly narrow confidence interval but this result could have occurred by chance. The standard error of the mean effect size for the reaction time research is not a significant outlier, nor is the standard error of the mean effect size for any other research area. When we consider the total set of 75 studies sampled, we find the 95% confidence interval to lie between an effect size of 0.62 and 1.22, corresponding to effect magnitudes ranging from medium to very large in Cohen’s (1969) terminology. The first three columns of Table 3 show the various ingredients required for the computation of the confidence interval, the number of studies sampled, the estimated standard deviation, and the standard error of the mean. The last column of Table 3 reports the correlations obtained within each research area between the effect size measured in units and the degree of statistical significance measured in standard normal deviates. These correlations were overwhelmingly positive ranging from +.38 to +.87 with a median correlation of +.74. Such high correlations are what we would expect as long as the sample sizes employed within the various research areas are relatively homogeneous.
Alternative Indices So far we have reported effect sizes only in units. In earlier reviews, however, effect sizes were reported in terms of the percentage of experimenters or teachers who obtained responses from their subjects or students in the direction of their expectations (Rosenthal, 1969, 1971). If there were no effect of interpersonal expectations we would expect about half the experimenters or teachers to obtain results in the direction of their expectation while the remaining half obtained results in the
Interpersonal Expectancy Effects: A Follow-up
635
Table 4 Percentages of Experimenters, Teachers, Subjects, and Pupils Showing Expectancy Effects
‘‘Expecter’’
Number of Studies Median z (approximation) Number of Es, Ts, Ss, or Ps. Mean N per Study Weighted Percent of Biased Es, Ts, Ss, or Ps Median Percent of Biased Es, Ts, Ss, or Ps Approximate Effect Size in units of Median Percent of Biased Personsa a
‘‘Expectee’’
Experimenters
Teachers
Subjects
Pupils
87 1.25 909 10 66% 69% 1.01
30 1.32 340 11 69% 70% 1.06
52 1.28 2,748 53 60% 64% 0.71
13 1.97 515 40 63% 65% 0.75
See Table 1 of Friedman, 1968.
opposite direction. The results of the earlier reviews, based on over 60 studies, suggested that about two-thirds of the experimenters and teachers, the ‘‘expecters,’’ obtained results in the predicted direction. For purposes of comparison with those earlier analyses, Table 4 was prepared. The first column shows that for 87 studies of experimenter expectations about two-thirds of the experimenters obtained results in the direction of their expectation. The second column of Table 4 shows that the results for studies of teacher expectations were about the same, with the obtained percentages of biased experimenters or teachers corresponding to an effect size of about one standard deviation (Cohen, 1969; Friedman, 1968). Although these estimates are based on 117 studies, we should not have any greater confidence in them than in the estimates based on the 75 studies that were sampled more randomly and with stratification. The reason for extra caution in the case of the percentages of biased ‘‘expecters’’ is that these studies were not chosen at random but rather because sufficient data were available in these studies to permit the convenient calculation of the percentage of biased ‘‘expecters.’’ We cannot tell how these studies might differ from the remaining studies. However, the results obtained from these 117 studies were very much in line with the results obtained from the more systematically sampled set of 75 studies. Several of the research areas showed larger average effect sizes, and several showed smaller average effect sizes than those obtained from the potentially less representative 117 studies, and these latter results fall well within the 95% confidence interval of the mean effect size based on the more systematically sampled studies. There was, of course, considerable overlap between these two sets of studies. The third and fourth columns of Table 4 report the analogous data from the point of view of the subjects of biased experimenters and the pupils of biased teachers. Once again we expect that if no expectancy effects are operating, half the subjects or pupils will respond in the direction of their ‘‘expecter’s’’ induced expectation while half will respond in the opposite direction. For both subjects and pupils just under two-thirds show the predicted expectancy bias, a rate of bias equivalent to approximately three-quarters of a standard deviation. There may be a somewhat greater degree of sampling bias in the studies of ‘‘expectees’’ than in the studies of ‘‘expecters’’ simply because far fewer studies reported results in sufficient detail to permit an analysis of ‘‘expectee’’ bias rates. Still, the results of these studies are very
636
Book Two – Experimenter Effects in Behavioral Research
consistent with the results of the studies sampled more systematically, falling well within the 95% confidence interval of the mean effect size based on the more systematically sampled studies.
Expectancy Control Groups Chapter 23 dealt in detail with the utilization of expectancy control groups which permit the comparison of the effect size of the variable of interpersonal expectancy with the effect size of some other variable of psychological interest which is not regarded as an ‘‘artifact’’ variable. Chapter 23 was exclusively theoretical in the sense that there were no studies available that had employed the suggested paradigm. That situation has changed, and there are now a number of studies available that permit a direct comparison of the effects of experimenter expectancy with such other psychological effects as brain lesions, preparatory effort, and persuasive communications. The first of these was conducted by Burnham (1966). He had 23 experimenters each run one rat in a T-maze discrimination problem. About half the rats had been lesioned by removal of portions of the brain, and the remaining animals had received only sham surgery which involved cutting through the skull but no damage to brain tissue. The purpose of the study was explained to the experimenters as an attempt to learn the effects of lesions on discrimination learning. Expectancies were manipulated by labeling each rat as lesioned or nonlesioned. Some of the really lesioned rats were labeled accurately as lesioned but some were falsely labeled as unlesioned. Some of the really unlesioned rats were labeled accurately as unlesioned but some were falsely labeled as lesioned. Table 5 shows the standard scores of the ranks of performance in each of the four conditions. A higher score indicates superior performance. Animals that had been lesioned did not perform as well as those that had not been lesioned, and animals that were believed to be lesioned did not perform as well as those that were believed to be unlesioned. What makes this experiment of special interest is that the effects of experimenter expectancy were actually larger than those of actual removal of brain tissue although this difference was not significant. Ten major types of outcomes of expectancy-controlled experiments were outlined in Chapter 23, and Burnham’s result fits most closely that outcome labeled as Case 3 (p. 382). If an investigator interested in the effects of brain lesions on discrimination learning had employed only the two most commonly employed conditions, he could have been seriously misled by his results. Had he employed experimenters who Table 5 Discrimination Learning as a Function of Brain Lesions and Experimenter Expectancy:
After Burnham Experimenter Expectancy Actual Brain State Lesioned Unlesioned Statistics of Difference
Lesioned
Unlesioned
46.5 49.0 48.2 58.3 94.7 107.3 t 2.19 p .02 Effect Size 0.95
Statistics of Difference S 95.5 106.5
t
p
Effect Size
1.62
.06
0.71
Interpersonal Expectancy Effects: A Follow-up
637
believed the rats to be lesioned to run his lesioned rats and compared their results to those obtained by experimenters running unlesioned rats and believing them to be unlesioned, he would have greatly overestimated the effects on discrimination learning of brain lesions. For the investigator interested in assessing for his own specific area of research the likelihood and magnitude of expectancy effects, there appears to be no fully adequate substitute for the employment of expectancy control groups. For the investigator interested only in the reduction of expectancy effects, other techniques such as blind or minimized experimenter–subject contact or automated experimentation are among the techniques that may prove to be useful (see Chapters 19–22). The first of the experiments to compare directly the effects of experimenter expectancy with some other experimental variable, employed animal subjects. The next such experiment to be described employed human subjects. Cooper, Eisenberg, Robert, and Dohrenwend (1967) wanted to compare the effects of experimenter expectancy with the effects of effortful preparation for an examination on the degree of belief that the examination would actually take place. Each of ten experimenters contacted ten subjects; half of the subjects were required to memorize a list of 16 symbols and definitions that were claimed to be essential to the taking of a test that had a 50-50 chance of being given, while the remaining subjects, the ‘‘low effort’’ group, were asked only to look over the list of symbols. Half of the experimenters were led to expect that ‘‘high effort’’ subjects would be more certain of actually having to take the test, while half of the experimenters were led to expect that ‘‘low effort’’ subjects would be more certain of actually having to take the test. Table 6 gives the subjects’ ratings of their degree of certainty of having to take the test. There was a very slight tendency for subjects who had exerted greater effort to believe more strongly that they would be taking the test. Surprising in its relative magnitude was the finding that experimenters expecting to obtain responses of greater certainty obtained such responses to a much greater degree than did experimenters expecting responses of lesser certainty. The size of the expectancy effect was ten times greater than the size of the effort effect. In the terms of the discussion of expectancy control groups, these results fit well the so-called case 7 (p. 384). Had this experiment been conducted employing only the two most commonly encountered conditions, the investigators would have been even more seriously misled than would have been the case in the earlier mentioned study of the effects of brain lesions on discrimination learning. If experimenters, while contacting high effort subjects, expected them to
Table 6 Certainty of Having to Take a Test as a Function of Preparatory Effort and Experimenter
Expectancy: After Cooper et al. Experimenter Expectancy
Statistics of Difference
Effort Level
High
Low
S
t
p
Effect Size
High Low Statistics of Difference
+.64 +.56 +1.20 t p Effect Size
.40 .52 .92
+.24 +.04
0.33
.37
0.07
3.48 .0004 0.71
638
Book Two – Experimenter Effects in Behavioral Research Table 7 Results of Three Comparisons of the Effects of Persuasive Communications with the Effects of
Experimenters’ Expectancy: After Miller Study
Statistics of Difference
Pro vs Con Communication
Experimenter Expectancy
I (df = 76)
t p Effect Size ()
5.30 .0001 1.22
1.58 .06 0.36
II (df = 76)
t p Effect Size ()
1.97 .03 0.45
3.56 .0002 0.82
III (df = 76)
t p Effect Size ()
4.04 .0001 0.93
2.69 .004 0.62
t p Effect Size ()
3.77 .0001 0.86
2.61 .005 0.60
t p Effect Size ()
1.97 .03 0.71
2.69 .004 0.71
Mean (Miller only) Median (Miller, Burnham, Cooper, et al.)
show greater certainty, and if experimenters, while contacting low effort subjects, expected them to show less certainty, the experimental hypothesis might quite artifactually have appeared to have earned strong support. The difference between these groups might have been ascribed to effort effects while actually the difference seems due almost entirely to the effects of the experimenter’s expectancy. As part of a very large research undertaking involving 780 subjects, Miller (1970) conducted three sub-studies that permitted the comparison of the effects of persuasive communications (pro vs con) with the effects of experimenters’ expectations. Table 7 gives the results of the comparisons. In two of the three analyses the effects of pro vs con persuasive communications were greater than the effects of the experimenters’ expectancies, and the average effect size for persuasive communications was somewhat larger than the average effect size of experimenters’ expectations (.86 vs .60). When we consider all five analyses together, those of Burnham and Cooper et al. as well as those of Miller, we find that the median size of the effects of experimenter expectations was just as large as the median size of the effects of the psychological variables against which expectancy effects had been pitted, .71 in both cases. Five studies are not very many upon which to base any but the most tentative conclusions. Nevertheless, it does seem that it can no longer be said without considerable new evidence that the effects of interpersonal expectations, while ‘‘real,’’ are trivial in relation to ‘‘real’’ psychological variables.
An Analysis of Doctoral Dissertations In our overview of research on interpersonal expectations the results of 311 independent studies were summarized. These studies were all the ones I could locate employing the usual formal and informal bibliographic search procedures. Psychological Abstracts, Dissertation Abstracts International, programs of conventions of national and regional
Interpersonal Expectancy Effects: A Follow-up
639
psychological, sociological, and educational conventions, various computer assisted searches, and word of mouth were all employed to maximize the chances of finding all studies of interpersonal expectancy effects. Nevertheless, it was possible that many studies could not be retrieved because they were regarded by their authors as uninteresting or counter-intuitive, or overly complex, or whatever. Such studies may have shown preponderantly negative results. Could it be that the studies that were retrievable represented roughly the 5% of the results that by chance might have been significant at the 5% level, while the studies that were not retrievable represented roughly the 95% of the studies that showed no effect (see page 553)? That seems unlikely. Since 109 studies were found showing expectancy effects at p .05 and 202 studies were found not showing expectancy effects at p .05 it would mean that if 1,549 studies had been conducted but not reported, or at least not found by the present search, and if all 1,549 showed no significant effects, there would still be an overall significant effect of interpersonal expectancy. To make the point a bit more strongly we take into account the actual significance levels of the 311 studies collected, rather than just whether they did or did not reach the .05 level. The sum of the standard normal deviates associated with the significance levels of the 311 studies was about +367. Adding 49,457 new studies with a mean standard normal deviate of zero would lower the overall combined standard normal deviate to +1.645 (p = .05). It seems unlikely that there are file drawers crammed with the unpublished results of nearly 50,000 studies of interpersonal expectations! Sampling bias, then, cannot reasonably explain the overall significant results of studies of interpersonal expectancies. Nevertheless the studies that could be found might still differ in various ways from the studies that could not be found. It would be quite useful to examine any subset of studies for which we could be more sure of having found all the research performed. Such a situation exists to some extent in the case of doctoral dissertations. If the dissertation is accepted by the university where it is conducted it can be well-retrieved through Dissertation Abstracts International (DAI). Dissertations not accepted because the results are ‘‘nonsignificant’’ (see page 591) or in a direction displeasing to one or more members of the student’s committee, will not of course be retrievable. Even if we could get a very large proportion of all dissertation research results through DAI we could not assume an unbiased sample of research studies. Sampling bias might be reduced, but other biases might be operating. Dissertation researchers may be less experienced investigators, less prestigeful in the eyes of their research subjects, and less competent in the conduct of their research. All of these factors have been implicated as variables moderating the effects of interpersonal expectancies. Despite these difficulties it was felt to be worthwhile to compare the results obtained in the dissertation vs non-dissertation research included in our stratified random sample of 75 studies. Table 8 shows the results of this comparison. The first two columns show the number of studies in each research area that were dissertations or non-dissertations. Just over one-third (35%) of all the studies were doctoral dissertations, and the proportion of dissertations did not vary significantly from research area to research area (x2 = 6.56, df = 7, p ffi .50). The third and fourth columns of Table 8 show the mean effect sizes in units of the dissertations and non-dissertations of each research area, and the fifth column shows the weighted mean effect size for all studies in that research area. The effect sizes of columns 3 and 4 are too large, however, because of the oversampling of studies showing more significant results. The sixth column of
640
Book Two – Experimenter Effects in Behavioral Research Table 8 Mean Effect Sizes Estimated Separately for Dissertations and Non-Dissertations
Number of Studies Diss.
Research Area Reaction Time Inkblot Tests Animal Learning Laboratory Interviews Psychophysical Judgments Learning and Ability Person Perception Everyday Situations Median Total
Non. Diss.
Uncorrected Effect Size Diss.
NonDiss
(WD)a (WO)b (ED)c
(EO)d
Mean Effect Size
Corrected Effect Size
Diss.
NonDiss.
Difference
(ET)e
(X)f
(Y)g
(YX)
3 3 1 4 2
3 6 9 6 8
0.21 0.53 0.44 0.74 1.86
0.25 1.00 2.22 0.33 1.56
0.23 0.84 1.78 0.27 1.31
0.21 0.53 0.38 0.40 1.50
0.25 1.00 1.94 0.18 1.26
+ 0.04 + 0.47 + 1.56 0.22 0.24
3 5 5
7 5 5
0.44 2.74 1.38
1.16 3.96 2.42
0.72 0.51 1.44
0.34 0.42 1.05
0.89 0.60 1.83
+ 0.55 + 0.18 + 0.78
3 26
6 49
0.64 1.20
1.36 1.66
0.78 0.92
0.41 0.74
0.94 1.02
+ 0.32 + 0.28
a
Number of dissertations in sample. Number of non-dissertations in sample. c Estimated effect size based on dissertations only, uncorrected for oversampling of studies with large effect sizes. d Estimated effect size based on non-dissertations only, uncorrected for oversampling of studies with large effect sizes. e Estimated total effect size based on stratified random sampling. f Estimated effect size based on dissertations only, corrected for oversampling of studies with large effect sizes: ED ðWD + WO ÞET g Y =ð EO =ED ÞX X= ED WD + EO WO b
Table 8, therefore, shows the estimated mean effect size of the doctoral dissertations after correction for the oversampling of more statistically significant outcomes. The corrected effect size (X) was computed according to the formula: X = [ED(WD + W0)ET] / [EDWD + E0W0] The seventh column of Table 8 shows the estimated mean effect size of the non-dissertations after correction for the oversampling of more significant outcomes. The corrected effect size (Y) was computed according to the formula: Y = (E0/ED) X with X defined as above. The final column of Table 8 shows that over the eight research areas the differences in effect sizes between dissertations and non-dissertations ranged from about a quarter of a favoring the dissertations to about one-and-a-half favoring the nondissertations with a median difference favoring the non-dissertations by about onethird of a unit. The differences in effect sizes between dissertations and nondissertations are not significant statistically, and median or mean effect sizes for either dissertations or non-dissertations fall very comfortably within the 95% confidence intervals for medians or means of the 75 studies as shown in the bottom two rows of Table 3. The tendency for dissertations to show somewhat smaller effect sizes might be due to a reduction in sampling bias in retrieving dissertations as compared to non-dissertations, or it might be due to the introduction of one or more biases associated with dissertation research e.g., less experienced, less prestigeful, and less skilled investigators. A potentially powerful biasing factor might be introduced by dissertation researchers if they were unusually procedure-conscious in the
Interpersonal Expectancy Effects: A Follow-up
641
conduct of their research. There are indications that such researchers may tend to obtain data that are substantially biased in the direction opposite to their expectations (Rosenthal, 1969, page 234).
Controls for Cheating and Recording Errors Elsewhere it has been shown that although the occurrence of cheating or recording errors on the part of experimenters and teachers cannot be definitively ruled out, the occurrence of such intentional or unintentional errors can not reasonably account for the overall obtained effects of interpersonal expectations (Rosenthal, 1969, pages 245–249). Indeed, experiments were described that showed major effects of interpersonal expectations despite the impossibility of the occurrence of either cheating or recording errors. More recently in two ingenious experiments, Johnson and Adair (1970; 1972) were able to assess the relative magnitudes of intentional and/or recording errors. In both experiments the overall effects of interpersonal expectations were modest (.30 and .33) and cheating or recording errors accounted for 30% of these effects (.09 and .10, respectively). Thus, even where cheating and/or recording errors can and do occur, they can not reasonably be invoked as an ‘‘explanation’’ of the effects of interpersonal expectations. In the process of reviewing the procedures employed in the 311 studies under review here, it was possible to identify a subset of 36 studies that employed special methods for the elimination or control of cheating or observer errors or permitted an assessment of the possibility of intentional or unintentional errors. These methods included employing tape recorded instructions, data recording by blind observers, and videotaping of the interaction between the subject and the data-collector. The results of these 36 studies that had employed such safeguards were of special interest. If cheating and recording errors really played a major role in ‘‘explaining’’ interpersonal expectancy effects, then we would expect that studies guarding against such errors would show no effects of interpersonal expectation or at least would show only a very diminished likelihood of obtaining such effects. Table 9 shows the proportion of these special 36 studies reaching various levels of significance in the unpredicted and predicted directions compared to the analogous proportion of the remaining 275 studies. The results are unequivocal. The more carefully controlled studies are more likely (p = .01) rather than less likely to show significant effects of interpersonal expectations than the studies permitting at least the possibility of cheating and/or recording errors. The mean standard normal deviate for the specially controlled studies was +1.72 while that for the remaining studies was +1.11. Just why these specially controlled studies should be more likely than the remaining studies to yield significant effects is not immediately obvious. The median sample sizes employed in these studies was about the same as the median sample size employed in all 311 studies. Perhaps those investigators careful enough to institute special safeguards against cheating and/or observer errors are also careful enough to reduce nonsystematic errors to a minimum thereby increasing the precision and power of their experiments. A subgroup of the 36 specially controlled studies was of special interest; that subgroup was the set of 18 that were also doctoral dissertations. Examination of the
642
Book Two – Experimenter Effects in Behavioral Research Table 9 Effects of Special Controls Against Cheating on the Proportion of Studies Reaching Given Levels
of Significance z
Expected Proportion
Special Controls (N = 36)
Other Studies (N = 275)
Total (N = 311)
Unpredicted Direction
3.09 2.33 1.65
.001 .01 .05
.00 .00 .06
.00 .01 .04
.00 .01 .04
Not Significant
1.64 To + 1.64
.90
.39a
.64b
.61c
Predicted Direction
+ 1.65 + 2.33 + 3.09 + 3.72 + 4.27 + 4.75 + 5.20
.05 .01 .001 .0001 .00001 .000001 .0000001
.56d .33 .19 .17 .11 .06 .06
.32d .17 .11 .05 .04 .03 .02
.35 .19 .12 .07 .05 .04 .02
Mean z = + 1.72. Mean z = + 1.11. Mean z = 1.18. d 2 x that these proportions differ = 6.54, p = .01. a
b c
effect sizes of these studies might permit a reasonable estimate of the effect sizes obtained in studies that were both error-controlled and less susceptible to sampling bias since it does appear that dissertations are more retrievable than non-dissertations. Other biases might, of course, be introduced such as the possible lower levels of experience and prestige of dissertation researchers. Still, examination of the subgroup of specially controlled dissertations at least focusses on careful dissertation researchers or on dissertation researchers whose committee members are careful. Table 10 lists the 18 studies of this subgroup along with the effect size obtained in each. The mean and median effect sizes of these specially controlled dissertations are slightly larger than those found for the 26 dissertations examined in Table 8; (that set of dissertations includes some of the 18 dissertations of Table 10). The 95% confidence interval around the mean effect size runs from 0.26 to + 1.30 or from small to very large indeed. This confidence interval includes completely the confidence interval estimated for all 311 studies and based upon the 75 studies of Table 3. Whereas 35% of ail 311 studies were significant at the .05 level in the predicted direction, 56% of these specially controlled dissertations were significant at the .05 level. This was not due to any tendency for these studies to employ larger sample sizes. The median df for all 311 studies was 48 and the median df for these specially controlled dissertations was also 48. The mean standard normal deviate for these 18 studies was +1.86 (median = +1.84).
Correcting Errors of Data Analysis Although there is reason to believe that this sample of 18 specially controlled doctoral dissertations reflected the work of unusually careful researchers, it must be noted that errors of data analysis occurred with some frequency in this special
Interpersonal Expectancy Effects: A Follow-up
643
Table 10 Effect Sizes of Doctoral Dissertations Employing Special Controls for
Cheating and Observer Errors Area
Study
Effect Size (s)
Anderson, 1971 I Anderson, 1971 II Beez, 1970 Carter, 1969 Keshock, 1970 Maxwell, 1970 Seaver, 1971 Wellons, 1973
0.43 0.20 + 1.89 + 0.53 + 1.55 + 0.81 + 0.44 + 4.08
Blake (and Heslin) 1971 Hawthorne, 1972 Mayo, 1972 Todd, 1971
+ 0.55 + 0.21 + 0.15 + 1.16
Johnson, 1970 I Johnson, 1970 II Page, 1970 Yarom, 1971
+ 0.19 + 0.28 + 1.74 + 0.04
Gravitz, 1969
+ 0.19
Marwit, 1968
+ 0.90
Median Mean 95% Confidence Interval
+ 0.48 + 0.78 + 0.26 to + 1.30
Everyday Situations
Person Perception
Learning and Ability
Laboratory Interviews Inkblot Tests
sample as they did in the remainder of the 311 studies we have surveyed. Sometimes these errors were trivial and sometimes they were large. Sometimes the errors were such that expectancy effects were claimed to be significant when they were not, and sometimes the errors were such that expectancy effects were claimed to be not significant when they were very significant. Such was the case in the otherwise excellent experiment by Keshock (1970) listed in Table 10. The pupils were 48 Black inner city boys aged 7 to 11 and in grades 2 to 5. Within each grade level half the children were reported to their teachers as showing an ability level one greater than their actual scores. For control group children the actual scores were reported to the teachers. There were three dependent variables: intelligence, achievement, and motivation. The data analysis for intelligence and for motivation employed the appropriate blocking on grade level and showed no effects of teacher expectations on intelligence but a very large effect (+1.55) on motivation. However, in the analysis of the achievement data, no blocking was employed despite a correlation (eta) between grade level and total achievement of .86. In short, the massive effects of grade level were inadvertently pooled into the within condition error term instead of being removed from the error term by blocking. Accordingly, the effects of teacher expectations were claimed to be non-significant. Fortunately, Keshock wisely provided the raw gain scores for all children for the achievement variables so that a desk calculator re-analysis was a simple matter. The components
644
Book Two – Experimenter Effects in Behavioral Research Table 11 Excess of Gains in Achievement of Experimental Over Control Group Pupils Due to Favorable
Teacher Expectations: After Keshock Grade
Reading
Arithmetic
Total
z
Effect Size(s)
2 3 4 5
9.99 6.33 1.16 4.67
11.67 6.84 1.50 3.66
21.66 13.17 2.66 8.33
+ 5.20 + 3.72 + 0.81 + 2.45
+ 3.85 + 2.34 + 0.47 + 1.48
Mean
5.54
5.92
11.46
+ 5.70
+ 2.04
of the total achievement gain scores were a reading gain score (grand mean = +3.1, S = 7.0) and an arithmetic gain score (grand mean = +2.1, S = 6.8); these components were substantially correlated, r = + .59, a correlation higher than that often found between subtests of ability tests (e.g., r = .43 for TOGA; Rosenthal and Jacobson, 1968, page 68). Table 11 shows the results of the reanalysis. Gains in performance were substantially greater for the children whose teachers had been led to expect greater gains in performance. The sizes of the effects varied across the four grades from nearly half a unit to nearly four units. For all subjects combined, the effect size was over two standard deviations. The entry for this study in Table 10 shows the median effect size obtained for the three dependent variables. Interestingly, in another carefully conducted doctoral dissertation carried out by a colleague of Keshock’s at about the same time, at the same university, under the same committee members in part, significant effects of teacher expectations on intelligence (Binet) were obtained although effects on achievement were not found to be significant (Maxwell, 1970). An enlightening footnote on the sociology of science is provided by the fact that a faculty member serving on both doctoral committees, subsequently published a study of her own reporting no significant expectancy effects. In her report neither of her own students’ doctoral dissertations was cited although other research reporting no significant expectancy effects was cited, including an article published in the year following the completion of both dissertations.
The External Validity of Interpersonal Expectancy Effects The book for which this is the epilogue ended with the description of an experiment designed to extend the external validity of the construct of the interpersonal selffulfilling prophecy to everyday life situations. This experiment came to be known as the Pygmalion Experiment and it was reported in detail by Rosenthal and Jacobson (1968). Although a wealth of subsequent research has considerably weakened the impact of criticism of the Pygmalion research, it should be noted here for the sake of completeness that such criticism was forthcoming, with vigor, from several educational psychologists. Before proceeding we must examine at least the more famous of these criticisms. In his well-known article in the Harvard Educational Review, Jensen (1969) makes three criticisms. The first of these is that the child had been employed as the unit of analysis rather than the classroom, and that if the classroom had been employed the analysis would have
Interpersonal Expectancy Effects: A Follow-up
645
yielded only negligible results. That was an unfortunate criticism for several reasons. First, because analyses by classrooms had not only been performed but quite clearly reported (page 95), and second, because for total IQ the significance level changed only trivially in going from a ‘‘per child’’ to a ‘‘per classroom’’ analysis, specifically from a probability of 2% to a probability of 3%! For reasoning IQ, incidentally, the per classroom analysis led to even more significant results than had the per child analysis. That fact, however, also printed on page 95, was not mentioned by Jensen. Jensen’s second criticism was that the same IQ test had been employed both for the pretest and for the post-test and that practice effects were thus maximized. Regrettably, Jensen did not state how the results of a randomized experiment could be biased by practice effects, if practice effects were so great as to drive everyone’s performance up to the upper limit, or ceiling, of the test, then practice effects could operate to diminish the effects of the experimental manipulation, but they could not operate to increase the effects of the experimental manipulation. Jensen’s third criticism was that the teachers themselves administered the group tests of IQ. This criticism was unfortunate for two reasons. First, because Jensen neglected to mention to his readers what Rosenthal and Jacobson had mentioned to theirs, namely, that when the children were retested by testers who knew nothing of the experimental plan, the effects of teacher expectations actually increased, rather than decreased. Second, Jensen implied that teacher administered tests are unreliable compared to individually administered tests of intelligence (which is true) and that, therefore, the excess in IQ gain of the experimental over the control group children might be due to test undependability (which is not true). In fact, decreased test reliability makes it harder, not easier, to obtain significant differences between experimental and control groups when such differences are real. In short, Jensen’s ‘‘criticism’’ would have served to account for the failure to obtain differences between the experimental and control group children if no differences had been found. In no way can such a criticism be used to explain away an obtained difference no matter how uncongenial to one’s own theoretical position. Another critique of the Pygmalion experiment was published by R. L. Thorndike (1968), but since that review has been answered point for point elsewhere (Rosenthal, l969a), we can summarize it here quite briefly. The general point was that the IQ of the youngest children was badly measured by the test employed and, therefore, that any inference based on such measurement must be invalid. The facts, however, are these: (a) that the validity coefficient of the reasoning IQ subtest regarded as worthless by Thorndike in fact was .65, a value higher than that often advanced in support of the validity of IQ tests. The calculation of validity reported here was based on data readily available in the report of Pygmalion, and the calculation could have been made by any interested reader. (b) Even if the IQ measure had been seriously unreliable, Thorndike failed to show how unreliability could have led to spuriously significant results. As we saw earlier in the discussion of Jensen’s critique, unreliability could make it harder to show significant differences between the experimental and control groups but it could not make it easier as Thorndike erroneously implied. (c) Even if the reasoning IQ data for the youngest children had been omitted from the analysis there would still have been a significant effect of teacher expectations for the remaining classrooms as measured by reasoning IQ (p = .001). By far the most ambitious critique of Pygmalion was undertaken in a long-term study and re-analysis of the basic data by Elashoff and Snow (1970, 1971). That
646
Book Two – Experimenter Effects in Behavioral Research
critique was actually published as a book and has been answered by Rosenthal and Rubin (1971).1 The gist of the critique and of the reply can be given briefly. Elashoff and Snow transformed the data of Pygmalion in various ways some of which were very seriously biased. Yet despite the use of eight transformations not employed by Rosenthal and Jacobson, not a single transformation gave a noticeably different result from any reported by Rosenthal and Jacobson. Thus, for total IQ every transformation employed by Elashoff and Snow for all children of the experiment gave a significant result when a significant result had been claimed by Rosenthal and Jacobson, and no transformation gave a significant result when no significant result had been claimed by Rosenthal and Jacobson (Rosenthal and Rubin, 1971, page 142). When verbal and reasoning IQ were also considered, the various transformations employed by Elashoff and Snow in fact turned up more significant effects of teacher expectations than had been claimed by Rosenthal and Jacobson. Table 12 compares the excess of gain in IQ by experimental over control group children as defined by Rosenthal and Jacobson and as defined by the median of the nine measures analyzed by Elashoff and Snow. The comparisons are based on data provided by Elashoff and Snow’s Tables 20, 21, and 22 (1971). The very high degree of agreement between the original measure employed and the median of the transformation measures is evident in the table by inspection and is supported by the 0.95 Pearson product moment correlation and the 0.93 Spearman rank correlation between the original and the transformed measures. For all the effort expended, the re-analyses by Elashoff and Snow changed nothing as Table 12 shows. Many studies of teacher expectation effects have been conducted since the Pygmalion Experiment (Rosenthal, 1973). However, the bulk of the 311 studies we have surveyed in this epilogue have been studies of interpersonal expectation effects in laboratory situations rather than in such everyday situations as schools, clinics, or industries. We can best examine the external validity, or generality, of the interpersonal expectancy effect by comparing the results of those studies conducted in laboratories Table 12 Comparison of Expectancy Advantage Scores in Total, Verbal, and Reasoning IQ Employed in
Rosenthal and Jacobson vs. the Median of Nine Scores Employed in Elashoff and Snow Total IQ
Grades 1&2 3&4 5&6 Total
Verbal IQ
Reasoning IQ
RJ
ES
RJ
ES
RJ
ES
11.0* 1.8 .2
10.8* 1.8 .1
10.1* 4.6 2.0
8.7* 5.6 1.0
12.6 8.7 4.8
12.6 8.5* .5
3.8*
y
2.1
y
7.1**
y
* Two-tailed p £ .05 ** Two-tailed p £ .01 y Not reported by ES. 1 The reply by Rosenthal and Rubin was written in response to the Elashoff and Snow monograph dated 1970. The preparation of the various drafts of that monograph occupied several years but Rosenthal and Rubin were asked to prepare their reply in two weeks. In addition, Rosenthal and Rubin were shown only the 1970 version of the monograph and were not permitted to respond to the 1971 version which included the deletion of some information particularly damaging to the Elashoff and Snow position (e.g., in Tables 23 and 24 of their widely circulated 1970 monograph).
Interpersonal Expectancy Effects: A Follow-up
647
Table 13 Comparison of Studies of Interpersonal Expectancy Effects in Everyday Situations with
Laboratory Situations Laboratory Situations Mean df % Biased ‘‘Expecters’’ ‘‘Expectees’’ Effect Size () Sampled Dissertations (n = 26) Specially Controlled Dissertations (n = 18) Estimated Total: All Studies (N = 311) Estimated S of Effect Size 95% Confidence Interval of Total ES to:
Everyday Situations
Combined Situations
52
45
48
69% 64%
70% 65%
69% 64%
0.40 0.54 0.72 1.15 0.06 +1.74
1.05 1.08 1.44 2.74 0.53 +3.41
0.74 0.78 0.92 1.33 +0.62 +1.22
with those studies conducted in more ‘‘real-life’’ situations. Such comparisons have been made implicitly in Tables 1, 2, 3, 4, 8, and 10 but now we address the question explicitly. Table 13 compares the results of studies of interpersonal expectancy effects in laboratory situations with everyday situations, using data drawn from earlier tables. The two kinds of studies are similar in average size of study and in the percentages of experimenters or teachers (and subjects or pupils) showing the biasing effects of interpersonal expectations. Effect sizes as measured in units tend to be larger, on the average, for everyday situations than in laboratory situations, but they are also substantially more variable so that the effect size expected for any single study can be less accurately predicted. A final comparison is given in Table 14 which gives the proportion of studies reaching various levels of significance in the predicted and unpredicted directions for studies conducted in laboratory vs everyday situations. Results for the Table 14 Proportion of Studies Reaching Given Levels of Significance
Type of Study Laboratory Situations (N = 215)
Everyday Situations (N = 96)
z
Expected Proportion
3.09 2.33 1.65
.001 .01 .05
.00 .01 .05
.00 .00 .01
.00 .01 .04
Not Significant
1.64 to + 1.64
.90
.61
.61
.61a
Predicted Direction
+ 1.65 + 2.33 + 3.09 + 3.72 + 4.27 + 4.75 + 5.20
.05 .01 .001 .0001 .00001 .000001 .0000001
.34 .19 .11 .05 .03 .03 .02
.38 .19 .14 .10 .08 .05 .03
.35b .19 .12 .07 .05 .04 .02
Unpredicted Direction
a b
Grand mean of all zs = + 1.18. x2 that this exceeds expected proportion = 585, z = 24.2.
Total (N = 311)
648
Book Two – Experimenter Effects in Behavioral Research
two types of studies are in close agreement with studies conducted in everyday situations showing significant results in the predicted direction slightly more often. The results shown in Tables 13 and 14 strongly support the conclusion that interpersonal expectancy effects are at least as likely to occur in everyday life situations as in laboratory situations. That conclusion is based on evidence from so many studies that dozens of additional studies cannot appreciably alter the conclusion without their showing very significant reverse effects of interpersonal expectations. The phenomenon is general across many situations, and it is not necessarily small in magnitude. Often it is very large.
Future Research What, then, is there left for us to find out? Almost everything of consequence. What are the factors increasing or decreasing the effects of interpersonal expectations, i.e., what are the moderating variables? What are the variables serving to mediate the effects of interpersonal self-fulfilling prophecies? Only some bare beginnings have been made to address these questions. The role of moderating variables has been considered elsewhere and is being actively surveyed at the present time (Rosenthal, 1969). The variables serving to mediate the effects of interpersonal expectancies have also been considered elsewhere, and for the teacher-pupil interaction a four factor ‘‘theory’’ has been proposed (Rosenthal, 1969; 1973). This ‘‘theory’’ suggests that teachers, counselors, and supervisors who have been led to expect superior performance from some of their pupils, clients, or trainees, appear to treat these ‘‘special’’ persons differently than they treat the remaining not-so-special persons in roughly four ways: Climate. Teachers appear to create a warmer socio-emotional climate for their ‘‘special’’ students. Feedback. Teachers appear to give to their ‘‘special’’ students more differentiated feedback as to how these students have been performing. Input. Teachers appear to teach more material and more difficult material to their ‘‘special’’ students. Output. Teachers appear to give their ‘‘special’’ students greater opportunities for responding.
Work on this four-factor theory is currently in progress. Much of the research on interpersonal expectancies has suggested that mediation of these expectancies depends to some important degree on various processes of nonverbal communication (Rosenthal, 1969; 1973). Moreover, there appear to be important differences among experimenters, teachers, and people generally in the clarity of their communication through different channels of nonverbal communication. In addition, there appear to be important differences among research subjects, pupils, and people generally, in their sensitivity to nonverbal communications transmitted through different nonverbal channels. If we knew a great deal more about differential sending and receiving abilities we might be in a much better position to address the general question of what kind of person (in terms of sending abilities) can most effectively influence covertly what kind of other person (in terms of receiving abilities). Thus, for example, if those teachers who best communicate their expectations for children’s intellectual performance in the auditory channel were assigned children whose best channels of
Interpersonal Expectancy Effects: A Follow-up
649
reception were also auditory, we would predict greater effects of teacher expectation than we would if those same teachers were assigned children less sensitive to auditory channel nonverbal communication. Ultimately, then, what we would want would be a series of accurate measurements for each person describing his or her relative ability to send and to receive in each of a variety of channels of nonverbal communication. It seems reasonable to suppose that if we had this information for two or more people we would be better able to predict the outcome of their interaction regardless of whether the focus of the analysis were on the mediation of interpersonal expectations or on some other interpersonal transaction. Our model envisages people moving through their ‘‘social spaces’’ carrying two vectors or profiles of scores. One of these vectors describes the person’s differential clarity in sending messages over various channels of nonverbal communication. The other vector describes the person’s differential sensitivity to messages sent over various channels of nonverbal communication. Diagrammatically for any given dyad: Person A Sending
Person B Sending
Person B Receiving
Person A Receiving Channels
Channels 1
2
3
4
5
6 ....K
1
2
3
4
5
6 .... K
Sending Receiving MATRIX A
MATRIX B
Within each of the two matrices the scores on channels of sending of the sender can be correlated with the scores of channels of receiving of the receiver. Given a fixed average performance level for senders and receivers, a higher correlation reflects a greater potential for more accurate communication between the dyad members since the receiver is then better at receiving the channels which are the more accurately encoded channels of the sender. The mean (arithmetic, geometric, or harmonic) of the correlations between Matrix A and Matrix B reflects how well the dyad members ‘‘understand’’ each other’s communications. That mean correlation need not reflect how well the dyad members like each other, however, only that A and B should more quickly understand each others’ intended and unintended messages including how they feel about one another. As a start toward the goal of more completely specifying accuracy of sending and receiving nonverbal cues in dyadic interaction we have been developing an instrument designed to measure differential sensitivity to various channels of nonverbal communication: The Profile of Nonverbal Sensitivity or PONS (Rosenthal, Archer, DiMatteo, Koivumaki, and Rogers, 1976). It is our hope that research employing the PONS and related measures of skill at sending and receiving messages in various channels of nonverbal communication along with related research will help us to unravel the mystery of the mediation of interpersonal expectancy effects. That is the hope; but the work lies ahead.
650
Book Two – Experimenter Effects in Behavioral Research
References Anderson, D. F. Mediation of teachers’ expectancy with normal and retarded children. Unpublished doctoral dissertation, Harvard University, 1971. Beez, W. V. Influence of biased psychological reports on ‘‘teacher’’ behavior and pupil performance. Unpublished doctoral dissertation, Indiana University, 1970. Blake, B. F., and Hesljn, R., Evaluation apprehension and subject bias in experiments. Journal of Experimental Research in Personality, 1971, 5, 57–63. Burnham, J. R. Experimenter bias and lesion labeling. Unpublished manuscript, Purdue University, 1966. Carter, R. M. Locus of control and teacher expectancy as related to achievement of young school children. Unpublished doctoral dissertation, Indiana University, 1969. Cohen, J. Statistical power analysis for the behavioral sciences. New York: Academic Press, 1969. Cooper, J., Eisenberg, L., Robert, J., & Dohkenwend, B. S. The effect of experimenter expectancy and preparatory effort on belief in the probable occurrence of future events. Journal of Social Psychology, 1967, 71, 221–226. Elashoff, J. D., & Snow, R. E. A case study in statistical inference: Reconsideration of the RosenthalJacobson data on teacher expectancy. Technical Report No. 15, Stanford Center for Research and Development in Teaching, School of Education, Stanford University, December, 1970. Elashoff, J. D., & Snow, R. E. (Eds.) Pygmalion Reconsidered. Worthington, Ohio: Charles A. Jones, 1971. Friedman, H. Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin, 1968, 70, 245–251. Gravitz, H. L. Examiner expectancy effects in psychological assessment: The Bender Visual Motor Gestalt Test. Unpublished doctoral dissertation, University of Tennessee, 1969. Hawthorne, J. W. The influence of the set and dependence of the data collector on the experimenter bias effect. Unpublished doctoral dissertation, Duke University, 1972. Jensen, A. R. How much can we boost IQ and scholastic achievement? Harvard Educational Review, 1969, 39, 1–123. Johnson, R. W. Inducement of expectancy and set of subjects as determinants of subjects’ responses in experimenter expectancy research. Unpublished doctoral dissertation, University of Manitoba, 1970. Johnson, R. W., & Adair, J. G. The effects of systematic recording error vs, experimenter bias on latency of word association. Journal of Experimental Research in Personality, 1970, 4, 270–275. Johnson, R. W., & Adair, J. G. Experimenter expectancy vs. systematic recording error under automated and nonautomated stimulus presentation. Journal of Experimental Research in Personality, 1972, 6, 88–94. Keshock, J. D. An investigation of the effects of the expectancy phenomenon upon the intelligence, achievement and motivation of inner-city elementary school children. Unpublished doctoral dissertation, Case Western Reserve University, 1970. Marwit, S. J. An investigation of the communication of tester-bias by means of modeling. Unpublished doctoral dissertation, State University of New York at Buffalo, 1968. Maxwell, M. L. A study of the effects of teacher expectation on the I.Q. and academic performance of children. Unpublished doctoral dissertation, Case Western Reserve University, 1970. Mayo, C. C. External conditions affecting experimental bias. Unpublished doctoral dissertation, University of Houston, 1972. Miller, K. A. A study of ‘‘experimenter bias’’ and ‘‘subject awareness’’ as demand characteristic artifacts in attitude change experiments. Unpublished doctoral dissertation, Bowling Green State University, 1970. Mosteller, F., & Bush, R. R. Selected quantitative techniques. In G. Lindzey (Ed.), Handbook of social psychology, Vol. I. Cambridge, Mass.: Addison-Wesley, 1954. pp. 289–334. Page, J. S. Experimenter-subject interaction in the verbal conditioning experiment. Unpublished doctoral dissertation, University of Toronto, 1970. Rosenthal, R. Interpersonal expectations: Effects of the experimenter’s hypothesis. In R. Rosenthal & R. L. Rosnow (Eds.) Artifact in behavioral research. New York: Academic Press, 1969. pp. 181–277. Rosenthal, R. Empirical vs. decreed validation of clocks and tests. American Educational Research Journal, 1969, 6, 689–691.
Interpersonal Expectancy Effects: A Follow-up
651
Rosenthal, R. Teacher expectations and their effects upon children. In G. S. Lesser (Ed.), Psychology and educational practice. Glenview, Ill.: Scott, Foresman, 1971. pp. 64–87. Rosenthal, R. On the social psychology of the self-fulfilling prophecy: Further evidence for Pygmalion effects and their mediating mechanisms. New York: MSS Modular Publication, Module 53, 1973. Rosenthal, R., Archer, D., Dimatteo, M. R., Koivumaki, J, H., & Rogers, P. L. Measuring sensitivity to nonverbal communication: The PONS Test. Unpublished manuscript, Harvard University, 1976. RosenthaL, R., & Jacobson, L. Pygmalion in the classroom. New York: Holt, Rinehart and Winston, 1968. Rosenthal, R., & Rosnow, R. L. The volunteer subject. New York: Wiley-Interscience, 1975. Rosenthal. R., & Rubin, D. B. Pygmalion reaffirmed. In J. D. Elashoff, & R. E. Snow (Eds.), Pygmalion reconsidered. Worthington, Ohio: Charles A. Jones, 1971. pp. 139–155. Seaver, Jr., W. B. Effects of naturally induced teacher expectancies on the academic performance of pupils in primary grades. Unpublished doctoral dissertation, Northwestern University, 1971. Snedecor, G. W., & Cochran, W. G. Statistical methods. (6th ed.) Ames, Iowa: Iowa State University Press, 1967. Thorndike, R. L. Review of Pygmalion in the classroom. American Educational Research Journal, 1968, 5, 708–711. Todd, J. L. Social evaluation orientation, task orientation, and deliberate cuing in experimenter bias effect. Unpublished doctoral dissertation, University of California, Los Angeles, 1971. Wellons, K. W. The expectancy component in mental retardation. Unpublished doctoral dissertation, University of California, Berkeley, 1973. Yarom, N. Temporal localization and communication of experimenter expectancy effect with 10–11 year old children. Unpublished doctoral dissertation, University of Illinois, 1971.
Book Two – Experimenter Effects in Behavioral Research
References
Aas, A., O’Hara, J. W., & Munger, M. P. The measurement of subjective experiences presumably related to hypnotic susceptibility. Scand. J. Psychol., 1962, 3, 47–64. Allport, F. H. Theories of perception and the concept of structure. New York: Wiley, 1955. Allport, G. W. The role of expectancy. In H. Cantril (Ed.), Tensions that cause wars. Urbana, Ill.: Univer. of Illinois Press, 1950. Pp. 43–78. Allport, G. W., & Vernon, P. E. Studies in expressive movement. New York: Macmillan, 1933. Anderson, Margaret, & White, Rhea. A survey of work on ESP and teacher-pupil attitudes. J. Parapsychol., 1958, 22, 246–268. Aronson, E., & Carlsmith, J. M. Performance expectancy as a determinant of actual performance. J. abnorm. soc. Psychol., 1962, 65, 178–182. Aronson, E., Carlsmith, J. M., & Darley, J. M. The effects of expectancy on volunteering for an unpleasant experience. J. abnorm. soc. Psychol., 1963, 66, 220–224. Asch, S. E. Social psychology. Englewood Cliffs, N.J.: Prentice-Hall, 1952. Azrin, N. H., Holz, W., Ulrich, R., & Goldiamond, I. The control of the content of conversation through reinforcement. J. exp. anal. Behav., 1961, 4, 25–30. Babich, F. R., Jacobson, A. L., Bubash, Suzanne, & Jacobson, Ann. Transfer of a response to naive rats by injection of ribonucleic acid extracted from trained rats. Science, 1965, 149, 656–657. Back, K. W. Influence through social communication. J. abnorm. soc. Psychol., 1951, 46, 9–23. Bakan, D. Learning and the scientific enterprise. Psychol. Rev., 1953, 60, 45–49. Bakan, D. A standpoint for the study of the history of psychology. Paper read at Amer. Psychol. Ass., St. Louis, September, 1962. Bakan, D. The mystery-mastery complex in contemporary psychology. Amer. Psychologist, 1965, 20, 186–191. (a) Bakan, D. The test of significance in psychological research. Unpublished manuscript, Univer. of Chicago, 1965. (b) Bandura, A., Lipsher, D. H., & Miller, Paula E. Psychotherapists’ approach-avoidance reactions to patients’ expression of hostility. Amer. Psychologist, 1959, 14, 335. (Abstract) Barber, B. Resistance by scientists to scientific discovery. Science, 1961, 134, 596–602. Barber, T. X., & Calverley, D. S. Effect of E’s tone of voice on ‘‘hypnotic-like’’ suggestibility. Psychol. Rep., 1964, 15, 139–144. (a) Barber, T. X., & Calverley, D. S. Toward a theory of hypnotic behavior; effects on suggestibility of defining the situation as hypnosis and defining response to suggestions as easy. J. abnorm. soc. Psychol., 1964, 68, 585–593. (b) Barnard, P. G. Interaction effects among certain experimenter and subject characteristics on a projective test. Unpublished doctoral dissertation, Univer. of Washington, 1963. Barzun, J., & Graff, H. F. The modern researcher. New York: Harcourt, Brace & World, 1957. Bateson, G., Jackson, D. D., Haley, J., & Weakland, J. H. Toward a theory of schizophrenia, Behav. Sci., 1956, 1, 251–264. Bauch, M. Psychologische Untersuchungen u¨ber Beobachtungsfehler. Fortschr. Psychol., 1913, 1, 169–226.
652
References
653 Bean, W. B. An analysis of subjectivity in the clinical examination in nutrition, J. appl. Physiol., 1948, 1, 458–468. Bean, W. B. Cherry angioma—a digression on the longevity of error. Trans. Assoc. Amer. Physicians, 1953, 66, 240–249. Bean, W. B. A critique of criticism in medicine and the biological sciences in 1958. Perspectives in biol. Med., 1958, 1, 224–232. Bean, W. B. The natural history of error. Trans. Assoc. Amer. Physicians, 1959, 72, 40–55. Beauchamp, K. L., & May, R. B. Replication report: Interpretation of levels of significance by psychological researchers. Psychol. Rep., 1964, 14, 272. Beck, W. S. Modern science and the nature of life. New York: Harcourt, Brace & World, 1957. Beecher, H. K. Measurement of subjective responses: Quantitative effects of drugs. New York: Oxford, 1959. Bellak, L. The concept of projection: an experimental investigation and study of the concept. Psychiat., 1944, 7, 353–370. Benney, M., Riesman, D., & Star, Shirley A. Age and sex in the interview. Amer. J. Sociol., 1956, 62, 143–152. Berg, I. A., & Bass, B. M. (Eds.) Conformity and deviation. New York: Harper & Row, 1961. Berger, D. Examiner influence on the Rorschach. J. clin. Psychol., 1954, 10, 245–248. Berkowitz, H. Effects of prior experimenter–subject relationships on reinforced reaction time of schizophrenics and normals. J. abnorm. soc. Psychol., 1964, 69, 522–530. Berkson, J., Magath, T. B., & Hurn, Margaret. The error of estimate of the blood cell count as made with the hemocytometer. Amer. J. Physiol., 1940, 128, 309–323. Bernstein, A. S. Race and examiner as significant influences on basal skin impedance. J. pers. soc. Psychol., 1965, 1, 346–349. Bernstein, L. A note on Christie’s ‘‘Experimental Naivete´ and Experiential Naivete´.’’ Psychol. Bull., 1952, 49, 38–40. Bernstein, L. The effects of variations in handling upon learning and retention. J. comp. physiol. Psychol., 1957, 50, 162–167. Betz, Barbara J. Experiences in research in psychotherapy with schizophrenic patients. In H. H. Strupp & L. Luborsky (Eds.), Research in psychotherapy. Washington, D.C.: American Psychological Association, 1962. Pp. 41–60. Binder, A., McConnell, D., & Sjoholm, Nancy A. Verbal conditioning as a function of experimenter characteristics. J. abnorm. soc. Psychol., 1957, 55, 309–314. Bingham, W. V. D., & Moore, B. V. How to interview. (3rd rev.) New York: Harper & Row, 1941. Birdwhistell, R. L. The kinesic level in the investigation of the emotions. In P. H. Knapp (Ed.), Expression of the emotions in man. New York: International Universities Press, 1963. Pp. 123–139. Birney, R. C. The achievement motive and task performance: A replication. J. abnorm. soc. Psychol., 1958, 56, 133–135. Blankenship, A. B. The effect of the interviewer upon the response in a public opinion poll. J. consult. Psychol., 1940, 4, 134–136. Boring, E. G. A history of experimental psychology (2nd ed.) New York: Appleton-Century-Crofts, 1950. Boring, E. G. The nature and history of experimental control. Amer. J. Psychol., 1954, 67, 573–589. Boring, E. G. Science and the meaning of its history. The Key Reporter, 1959, 24, 2–3. Boring, E. G. Newton and the spectral lines. Science, 1962, 136, 600–601. (a) Boring, E. G. Parascience. Contemp. Psychol., 1962, 7, 356–357. (b) Brogden, W. J. Animal studies of learning. In S. S. Stevens (Ed.), Handbook of experimental psychology. New York: Wiley, 1951. Pp. 568–612. Brogden, W. J. The experimenter as a factor in animal conditioning. Psychol. Rep., 1962, 11, 239–242. Brown, J. M. Respondents rate public opinion interviewers. J. appl. Psychol., 1955, 39, 96–102. Brunswik, E. Perception and the representative design of psychological experiments. Berkeley: Univer. of California Press, 1956. Buckhout, R. Need for social approval and dyadic verbal behavior. Psychol. Rep., 1965, 16, 1013–1016. Buss, A. H., & Gerjuoy, L. R. Verbal conditioning and anxiety. J. abnorm. soc. Psychol., 1958, 57, 249–250. Cahalan, D., Tamulonis, V., & Verner, Helen W. Interviewer bias involved in certain types of opinion survey questions. Int. J. Opin. Attit. Res., 1947, 1, 63–77.
654
Book Two – Experimenter Effects in Behavioral Research Cahen, L. S. An experimental manipulation of the ‘‘Halo Effect’’: A study of teacher bias. Unpublished manuscript, Stanford Univer., 1965. Campbell, D. T. Systematic error on the part of human links in communication systems. Information and Control, 1958, 1, 334–369. Campbell, D. T. Systematic errors to be expected of the social scientist on the basis of a general psychology of cognitive bias. Paper read at Amer. Psychol. Ass., Cincinnati, Sept., 1959. Canady, H. G. The effect of ‘‘rapport’’ on the I.Q.: A new approach to the problem of racial psychology. J. Negro Educ., 1936, 5, 209–219. Cantril, H., & research associates. Gauging public opinion. Princeton: Princeton Univer. Press, 1944. Carlsmith, J. M., & Aronson, E. Some hedonic consequences of the confirmation and disconfirmation of expectancies. J. abnorm. soc. Psychol., 1963, 66, 151–156. Carlson, E. R., & Carlson, Rae. Male and female subjects in personality research. J. abnorm. soc. Psychol., 1960, 61, 482–483. Cervin, V. B., Joyner, R. C., Spence, J. M., & Heinzl, R. Relationship of persuasive interaction to change of opinion in dyadic groups when the original opinions of participants are expressed privately and publicly. J. abnorm. soc. Psychol., 1961, 62, 431–432. Chapanis, Natalia P., & Chapanis, A. Cognitive dissonance: Five years later. Psychol Bull., 1964, 61, 1–22. Christie, R. Experimental naivete´ and experiential naivete´. Psychol. Bull., 1951, 48, 327–339. Clark, E. L. The value of student interviewers. J. pers. Res., 1927, 5, 204–207. Clark, K. B. Educational stimulation of racially disadvantaged children. In A. H. Passow (Ed.), Education in depressed areas. New York: Teachers College, Columbia Univer., 1963. Pp. 142–162. Cleveland, S. The relationship between examiner anxiety and subjects’ Rorschach scores. Microfilm Abstr., 1951, 11, 415–416. Cochran, W. G., & Watson, D. J. An experiment on observer’s bias in the selection of shoot-heights. Empire J. exp. Agriculture, 1936, 4, 69–76. Coffey, H. S., Dorcus, R. M., Glaser, E. M., Greening, T. C., Marks, J. B., & Sarason, I. G. Learning to work. Los Angeles: Human Interaction Research Institute, 1964. Coffin, T. E. Some conditions of suggestion and suggestibility. Psychol. Monogr., 1941, 53, No. 4 (Whole No. 241). Colby, K. M. An introduction to psychoanalytic research. New York: Basic Books, 1960. Cole, D. L. The influence of task perception and leader variation on auto-kinetic responses. Amer. Psychologist, 1955, 10, 343. (Abstract) Conrad, H. S. Some principles of attitude-measurement: A reply to ‘‘opinion-attitude methodology.’’ Psychol. Bull., 1946, 43, 570–589. Cook-Marquis, Peggy. Authoritarian or acquiescent: Some behavioral differences. Paper read at Amer. Psychol. Ass., Washington, D.C., Sept., 1958. Cordaro, L., & Ison, J. R. Observer bias in classical conditioning of the planarian. Psychol. Rep., 1963, 13, 787–789. Crespi, L. P. The cheater problem in polling. Publ. Opin. Quart., 1945–46, 9, 431–445. Criswell, Joan H. The psychologist as perceiver. In R. Tagiuri, & L. Petrullo (Eds.), Person perception and interpersonal behavior. Stanford, Calif.: Stanford Univer. Press, 1958. Pp. 95–109. Crow, Linda. Public attitudes and expectations as a disturbing variable in experimentation and therapy. Unpublished paper, Harvard Univer., 1964. Crowne, D. P., & Marlowe, D. The approval motive. New York: Wiley, 1964. Crumbaugh, J. C. ESP and flying saucers: A challenge to parapsychologists. Amer. Psychologist, 1959, 14, 604–606. Crutchfield, R. S. Conformity and character. Amer. Psychologist, 1955, 10, 191–198. Cutler, R. L. Countertransference effects in psychotherapy. J. consult. Psychol., 1958, 22, 349–356. Dailey, J. M. Verbal conditioning without awareness. Unpublished doctoral dissertation, Univer. of Iowa, 1953. Das, J. P. Prestige effects in body-sway suggestibility. J. abnorm. soc. Psychol., 1960, 67, 487–488. Delboeuf, J. L. R. Le magne´tisme animal a` propos d’une visite a` l’e´cole de Nancy. Paris: Alcan, 1889.
References
655 Dember, W. N. The psychology of perception. New York: Holt, Rinehart and Winston, 1960. Deutsch, M., & Collins, Mary E. The effect of public policy in housing projects upon interracial attitudes. In G. E. Swanson, T. M. Newcomb, E. L. Hartley, et al. (Eds.), Readings in social psychology. (Rev. ed.) New York: Holt, Rinehart and Winston, 1952. Pp. 582–593. Dulany, D. E., & O’Connell, D. C. Does partial reinforcement dissociate verbal rules and the behavior they might be presumed to control? J. verb. Learn. verb. Behav., 1963, 2, 361–372. Ebbinghaus, H. Memory: A contribution to experimental psychology (1885). Translated by H. A. Ruger, & Clara E. Bussenius. New York: Teachers College, Columbia Univer., 1913. Eckler, A. R., & Hurwitz, W. N. Response variance and biases in censuses and surveys. Bull. de L’Institut International de Statistique, 1958, 36–2e, 12–35. Editorial Board, Consumer Reports. Food and Drug Administration. Consumer Reports, 1964, 29, 85. Editorial Board, Science. An unfortunate event. Science, 1961, 134, 945–946. Edwards, A. L. Experimental design in psychological research. New York: Holt, Rinehart and Winston, 1950. Edwards, A. L. Experiments: Their planning and execution. In G. Lindzey (Ed.), Handbook of social psychology. Vol. 1. Cambridge, Mass.: Addison-Wesley, 1954. Ehrenfreund, D. A study of the transposition gradient. J. exp. Psychol., 1952, 43, 81–87. Ehrlich, June S., & Riesman, D. Age and authority in the interview. Publ. Opin. Quart., 1961, 25, 39–56. Ekman, P., & Friesen, W. V. Status and personality of the experimenter as a determinant of verbal conditioning. Amer. Psychologist, 1960, 15, 430. (Abstract) Eriksen, C. W. Discrimination and learning without awareness: a methodological survey and evaluation. Psychol. Rev., 1960, 67, 279–300. Eriksen, C. W. (Ed.) Behavior and awareness. Durham, N.C.: Duke Univer. Press, 1962. Eriksen, C. W., Kuethe, J. L., & Sullivan, D. F. Some personality correlates of learning without verbal awareness. J. Pers., 1958, 26, 216–228. Escalona, Sibylle, K. Feeding disturbances in very young children. Amer. J. Orthopsychiat., 1945, 15, 76–80. Also in G. E. Swanson, T. M. Newcomb, & E. L. Hartley et al. (Eds.), Readings in Social Psychology (Rev. ed.) New York: Holt, Rinehart and Winston, 1952. Pp. 29–33. Exline, R. V. Explorations in the process of person perception: Visual interaction in relation to competition, sex, and need for affiliation. J. Pers., 1963, 31, 1–20. Exline, R. V., Gray, D., & Schuette, Dorothy. Visual behavior in a dyad as affected by interview content and sex of respondent. J. pers. soc. Psychol., 1965, 1, 201–209. Eysenck, H. J. The concept of statistical significance and the controversy about one—tailed tests. Psychol. Rev., 1960, 67, 269–271. Feinstein, A. R. The stethoscope: a source of diagnostic aid and conceptual errors in rheumatic heart disease. J. chronic Dis., 1960, 11, 91–101. Felice, A. Some effects of subject-examiner interaction on the task performance of schizophrenics. Dissert. Abstr., 1961, 22, 913–914. Fell, Honor B. Fashion in cell biology. Science, 1960, 132, 1625–1627. Ferber, R., & Wales, H. G. Detection and correction of interviewer bias. Publ. Opin. Quart., 1952, 16, 107–127. Ferguson, D. C., & Buss, A. H. Operant conditioning of hostile verbs in relation to experimenter and subject characteristics. J. consult. Psychol., 1960, 24, 324–327. Festinger, L. A theory of cognitive dissonance. New York: Harper & Row, 1957. Festinger, L., & Carlsmith, J. M. Cognitive consequences of forced compliance, J. abnorm. soc. Psychol., 1959, 58, 203–210. Filer, R. M. The clinician’s personality and his case reports. Amer. Psychologist, 1952, 7, 336. (Abstract) Fine, B. J. Conclusion-drawing, communication, credibility, and anxiety as factors in opinion change. J. abnorm, soc. Psychol., 1957, 54, 369–374. Fisher, R. A. Has Mendel’s work been rediscovered? Ann. Sci., 1936, 1, 115–137. Fisher, R. A. The design of experiments. (4th ed.) Edinburgh and London: Oliver & Boyd, 1947. Fode, K. L. The effect of non-visual and non-verbal interaction on experimenter bias. Unpublished master’s thesis, Univer. of North Dakota, 1960. Fode, K. L. The effect of experimenters’ and subjects’ anxiety and social desirability on experimenter outcome-bias. Unpublished doctoral dissertation, Univer. of North Dakota, 1965.
656
Book Two – Experimenter Effects in Behavioral Research Foster, R. J. Acquiescent response set as a measure of acquiescence. J. abnorm. soc. Psychol., 1961, 63, 155–160. Foster, W. S. Experiments on rod-divining. J. appl. Psychol., 1923, 7, 303–311. Frank, J. Discussion of Eysenck’s ‘‘The effects of psychotherapy.’’ Int. J. Psychiat., 1965, 1, 150–152. Freiberg, A. D., Vaughn, C. L., & Evans, Mary C. Effect of interviewer bias upon questionnaire results obtained with a large number of investigators. Amer. Psychologist, 1946, 7, 243. (Abstract) Friedman, N. The psychological experiment as a social interaction. Unpublished doctoral dissertation, Harvard Univer., 1964. Friedman, N., Kurland, D., & Rosenthal, R. Experimenter behavior as an unintended determinant of experimental results. J. proj. Tech. pers. Assess., 1965, 29, 479–490. Friedman, Pearl. A second experiment on interviewer bias. Sociometry, 1942, 5, 378–379. Fromm-Reichmann, Frieda. Principles of intensive psychotherapy. Chicago: Univer. of Chicago Press, 1950. Fruchter, B. Introduction to factor analysis. Princeton, N.J.: Van Nostrand, 1954. Funkenstein, D. H., King, S. H., & Drole´tte, Margaret E. Mastery of stress. Cambridge, Mass.: Harvard Univer. Press, 1957. Gantt, W. H. Autonomic conditioning. In J. Wolpe, A. Salter, & L. J. Reyna (Eds.), The conditioning therapies. New York: Holt, Rinehart, and Winston, 1964. Pp. 115–126. Garfield, S. L., & Affleck, D. C. Therapists’ judgments concerning patients considered for therapy. Amer. Psychologist, 1960, 15, 414. (Abstract) Garrett, H. E. On un-American science reporting. Science, 1960, 132, 685. Geldard, F. A. Some neglected possibilities of communication. Science, 1960, 131, 1583–1588. Gelfand, Donna M., & Winder, C. L. Operant conditioning of verbal behavior of dysthymics and hysterics. J. abnorm. soc. Psychol., 1961, 62, 688–689. George, W. H. The scientist in action: A scientific study of his methods. New York: Emerson, 1938. Gillispie, C. C. The edge of objectivity. Princeton: Princeton Univer. Press, 1960. Glucksberg, S., & Lince, D. L. The influence of military rank of experimenter on the conditioning of a verbal response. Tech. Mem. 10–62, Human Engineering Lab., Aberdeen Proving Ground, Maryland, 1962. Goldberg, S., Hunt, R. G., Cohen, W., & Meadow, A. Some personality correlates of perceptual distortion in the direction of group conformity. Amer. Psychologist, 1954, 9, 378. (Abstract) Goldfried, M. R., & Walters, G. C. Needed: Publication of negative results. Amer. Psychologist, 1959, 14, 598. Goldstein, A. P. Therapist and client expectation of personality change in psychotherapy. J. counsel. Psychol, 1960, 7, 180–184. Goldstein, A. P. Therapist-patient expectancies in psychotherapy. New York: Pergamon Press, 1962. Goranson, R. E. Effects of the experimenter’s prestige on the outcome of an attitude change experiment. Paper read at Midwest. Psychol. Assoc., Chicago, May, 1965. Gordon, L. V., & Durea, M. A. The effect of discouragement on the revised Stanford Binet Scale. J. genet. Psychol, 1948, 73, 201–207. Graham, S. R. The influence of therapist character structure upon Rorschach changes in the course of psychotherapy. Amer. Psychologist, 1960, 15, 415. (Abstract) Greenblatt, M. Controls in clinical research. Unpublished paper. Tufts Univer. School of Medicine, 1964. Griffith, R. Rorschach water percepts: A study in conflicting results. Amer. Psychologist, 1961, 16, 307–311. Gruenberg, B. C. The story of evolution. Princeton, N.J.: Van Nostrand, 1929. Guilford, J. P. Psychometric methods. (2nd ed.) New York: McGraw-Hill, 1954. Guthrie, E. R. The psychology of human conflict. New York: Harper & Row, 1938. Haas, H., Fink, H., & Ha¨rtfelder, G. The placebo problem. Psychopharmacol. Serv. Cent. Bull, 1963, 2, 1–65. Hammond, K. R. Representative vs. systematic design in clinical psychology. Psychol. Bull, 1954, 51, 150–159. Haner, C. F., & Whitney, E. R. Empathie conditioning and its relation to anxiety level. Amer. Psychologist, 1960, 15, 493. (Abstract)
References
657 Hanley, C, & Rokeach, M. Care and carelessness in psychology. Psychol. Bull., 1956, 53, 183–186. Hanson, N. R. Patterns of discovery. Cambridge, England: Cambridge Univer. Press, 1958. Hanson, R. H., & Marks, E. S. Influence of the interviewer on the accuracy of survey results. J. Amer. Statist. Assoc., 1958, 53, 635–655. Harari, C., & Chwast, J. Class bias in psychodiagnosis of delinquents. Amer. Psychologist, 1959, 14, 377–378. (Abstract) Harlem Youth Opportunities Unlimited, Inc. Youth in the ghetto. New York: Author, 1964. Harris, Natalie. Introducing a symposium on interviewing problems. Int. J. Opin. Attit. Res., 1948, 2, 69–84. Hart, C. W. Preface to H. H. Hyman, W. J. Cobb, J. J. Feldman, C. W. Hart, & C. H. Stember, Interviewing in social research. Chicago: Univer. of Chicago Press, 1954. Harvey, O. J., & Clapp, W. F. Hope, expectancy, and reactions to the unexpected. J. pers. soc. Psychol, 1965, 2, 45–52. Harvey, S. M. A preliminary investigation of the interview. Brit. J. Psychol., 1938, 28, 263–287. Hefferline, R. F. Learning theory and clinical psychology—an eventual symbiosis? In A. J. Bachrach (Ed.), Experimental foundations of clinical psychology. New York: Basic Books, 1962. Pp. 97–138. Heilizer, F. An exploration of the relationship between hypnotizability and anxiety and/or neuroticism. J. consult. Psychol., 1960, 24, 432–436. Heine, R. W., & Trosman, H. Initial expectations of the doctor-patient interaction as a factor in continuance in psychotherapy. Psychiatry, 1960, 23, 275–278. Heller, K., Davis, J. D., & Saunders, F. Clinical implications of laboratory studies of interpersonal style. Paper read at Midwest. Psychol. Assoc., St. Louis, May, 1964. Heller, K., & Goldstein, A. P. Client dependency and therapist expectancy as relationship maintaining variables in psychotherapy. J. consult. Psychol., 1961, 25, 371–375. Heller, K., Myers, R. A., & Vikan-Kline, L. Interviewer behavior as a function of standardized client roles. J. consult. Psychol., 1963, 27, 117–122. Helson, H. Adaptation-level theory. New York: Harper & Row, 1964. Homans, G. C. Social behavior: Its elementary forms. New York: Harcourt, Brace & World, 1961. Homme, L. E., & Klaus, D. J. Laboratory studies in the analysis of behavior. Pittsburgh: Lever Press, 1957. Honigfeld, G. Non-specific factors in treatment. Dis. nerv. Syst., 1964, 25, 145–156, 225–239. Hovland, C. I., & Jams, I. L. (Eds.) Personality and persuasibility. New Haven, Conn.: Yale Univer. Press, 1959. Hovland, C. I., & Weiss, W. The influence of source credibility on communication effectiveness. Publ. Opin. Quart., 1951, 15, 635–650. Hyman, H. H., Cobb, W. J., Feldman, J. J., Hart, C. W., & Stember, C. H. Interviewing in social research. Chicago: Univer. of Chicago Press, 1954. Ismir, A. A. The effects of prior knowledge of the TAT on test performance. Psychol. Rec, 1962, 12, 157–164. Ismir, A. A. The effect of prior knowledge, social desirability, and stress upon the Thematic Apperception Test performance. Unpublished doctoral dissertation, Univer. of North Dakota, 1963. Jahn, M. E., & Woolf, D. J. (Eds.) The lying stones of Dr. Johann Bartholomew Adam Beringer, being his Lithographiae Wirceburgensis. Berkeley: Univer. of California Press, 1963. James, W. Essays in pragmatism. New York: Hafner, 1948. Janis, I. L. Anxiety indices related to susceptibility to persuasion. J. abnorm. soc. Psychol, 1955, 51, 663–667. Jastrow, J. Fact and fable in psychology. Boston: Houghton Mifflin, 1900. Jenness, A. The role of discussion in changing opinions regarding a matter of fact. J. abnorm. soc. Psychol., 1932, 27, 279–296. Joel, W. The interpersonal equation in projective methods. J. proj. Tech., 1949, 13, 479–482. Johnson, H. M. Audition and habit formation in the dog. Behav. Monogr., 1913, 2, No. 3 (Serial No. 8). Johnson, M. L. Seeing’s believing. New Biology, 1953, 15, 60–80. Jones, E. E. Conformity as a tactic of ingratiation. Science, 1965, 149, 144–150. Jones, E. E., & Thibaut, J. W. Interaction goals as bases of inference in interpersonal perception. In R. Tagiuri, & L. Petrullo (Eds.), Person perception and interpersonal behavior. Stanford, Calif.: Stanford Univer. Press, 1958. Pp. 151–178.
658
Book Two – Experimenter Effects in Behavioral Research Jones, F. P. Experimental method in antiquity. Amer. Psychologist, 1964, 19, 419. Jones, R. H. Physical indices and clinical assessments of the nutrition of school children. J. Royal Statist. Soc., Part I, 1938, 101, 1–34. Jordan, N. The mythology of the non-obvious—autism or fact? Contemp. Psychol., 1964, 9, 140–142. Kagan, J., & Moss, H. J. Birth to maturity. New York: Wiley, 1962. Kanfer, F. H., & Karas, Shirley C. Prior experimenter-subject interaction and verbal conditioning. Psychol. Rep., 1959, 5, 345–353. Katz, D. Do interviewers bias poll results? Publ. Opin. Quart., 1942, 6, 248–268. Katz, I. Review of evidence relating to effects of desegregation on the intellectual performance of Negroes. Amer. Psychologist, 1964, 19, 381–399. Katz, I., Robinson, J. M., Epps, E. G., & Waly, Patricia. The influence of race of the experimenter and instructions upon the expression of hostility by Negro boys. J. soc. Issues, 1964, 20, 54–59. Katz, R. Body language: A study in unintentional communication. Unpublished doctoral dissertation, Harvard Univer., 1964. Kelley, H. H. The effects of expectations upon first impressions of persons. Amer. Psychologist, 1949, 4, 252. (Abstract) Kelley, H. H., & Ring, K. Some effects of ‘‘suspicious’’ versus ‘‘trusting’’ training schedules. J. abnorm. soc. Psychol., 1961, 63, 294–301. Kellog, W. N. Porpoises and sonar. Chicago: Univer. of Chicago Press, 1961. Kellog, W. N. Sonar system of the blind. Science, 1962, 137, 399–404. Kelly, G. A. The psychology of personal constructs. New York: Norton, 1955. Kelman, H. C. Attitude change as a function of response restriction. Hum. Relat., 1953, 6, 185–214. Kelman, H. C. The human use of human subjects: The problem of deception in social-psychological experiments. Paper read at Amer. Psychol. Assoc, Chicago, September, 1965. Kennedy, J. L. Experiments on ‘‘unconscious whispering.’’ Psychol. Bull., 1938, 35, 526. (Abstract) Kennedy, J. L. A methodological review of extra-sensory perception. Psychol. Bull., 1939, 56, 59–103. Kennedy, J. L., & Uphoff, H. F. Experiments on the nature of extra-sensory perception: III. The recording error criticism of extra-chance scores. J. Parapsychol, 1939, 3, 226–245. Kety, S. S. Biochemical theories of schizophrenia, Part I. Science, 1959, 129, 1528–1532; Part II, 1590–1596. Kimble, G. A. Classical conditioning and the problem of awareness. J. Pers., 1962, 30, Supplement: C. W. Eriksen (Ed.), Behavior and awareness, 27–45. Klein, G. S. Perception, motives and personality. In J. L. McCary (Ed.), Psychology of personality: Six modern approaches. New York: Logos, 1956. Pp. 121–199. Koestler, A. The act of creation. New York: Macmillan, 1964. Kramer, E., & Brennan, E. P. Hypnotic susceptibility of schizophrenic patients. J. abnorm. soc. Psychol., 1964, 69, 657–659. Krasner, L. Studies of the conditioning of verbal behavior. Psychol. Bull., 1958, 55, 148–170. Krasner, L. The therapist as a social reinforcement machine. In H. H. Strupp, & L. Luborsky (Eds.), Research in psychotherapy. Washington, D.C.: Amer. Psychol. Ass., 1962. Pp. 61–94. Kubie, L. S. The use of psychoanalysis as a research tool. Psychiat. Res. Rep., 1956, 6, 112–136. Kuethe, J. L. Acquiescent response set and the psychasthenia scale: An analysis via the aussage experiment. J. abnorm. soc. Psychol., 1960, 61, 319–322. Lane, F. W. Kingdom of the octopus. New York: Sheridan House, 1960. Lefkowitz, M., Blake, R. R., & Mouton, Jane S. Status factors in pedestrian violation of traffic signals. J. abnorm. soc. Psychol. 1955, 51, 704–706. Levin, S. M. The effects of awareness on verbal conditioning. J. exp. Psychol., 1961, 61, 67–75. Levitt, E. E. Problems of experimental design and methodology in psychopharmacology research. In R. H. Branson (Ed.), Report of the conference on mental health research. Indianapolis: Assoc. Advance. Ment. Hlth. Res. Educ., 1959. Levitt, E. E., & Brady, J. P. Expectation and performance in hypnotic phenomena. J. abnorm. soc. Psychol., 1964, 69, 572–574. Levy, L. H., & Orr, T. B. The social psychology of Rorschach validity research. J. abnorm. soc. Psychol., 1959, 58, 79–83. Liddell, H. S. The alteration of instinctual processes through the influence of conditioned reflexes. In S. S. Tomkins, Contemporary psychopathology. Cambridge, Mass: Harvard Univer. Press, 1943.
References
659 Lindquist, E. F. Design and analysis of experiments in psychology and education. Boston: Houghton Mifflin, 1953. Lindzey, G. A note on interviewer bias. J. appl. Psychol., 1951, 35, 182–184. London, P., & Fuhrer, M. Hypnosis, motivation, and performance. J. Pers., 1961, 29, 321–333. Lord, Edith. Experimentally induced variations in Rorschach performance. Psychol. Monogr., 1950, 64, No. 10. Lubin, A. Replicability as a publication criterion. Amer. Psychologist, 1957, 12, 519–520. Luft, J. Interaction and projection. J. proj. Tech., 1953, 17, 489–492. Lyerly, S. B., Ross, S., Krugman, A. D., & Clyde, D. J. Drugs and placebos: The effects of instructions upon performance and mood under amphetamine sulphate and chloral hydrate. J. abnorm. soc. Psychol., 1964, 68, 321–327. Maccoby, Eleanor E., & Maccoby, N. The interview: A tool of social science. In G. Lindzey (Ed.), Handbook of social psychology. Vol. I. Cambridge, Mass.: Addison-Wesley, 1954. Pp. 449–487. MacDougall, C. D. Hoaxes. New York: Macmillan, 1940. MacKinnon, D. W. The nature and nurture of creative talent. Amer. Psychologist, 1962, 17, 484–495. Mahalanobis, P. C. Recent experiments in statistical sampling in the Indian Statistical Institute. J. Royal Statist. Soc., 1946, 109, 325–370. Mahl, G. F., & Schulze, G. Psychological research in the extralinguistic area. In T. A. Sebeok, A. S. Hayes, & Mary C. Bateson (Eds.), Approaches to semiotics. The Hague: Mouton, 1964. Pp. 51–124. Maier, N. R. F. Frustration theory: Restatement and extension. Psychol. Rev., 1956, 63, 370–388. Maier, N. R. F. Maier’s law. Amer. Psychologist, 1960, 15, 208–212. Marcia, J. Hypothesis-making, need for social approval, and their effects on unconscious experimenter bias. Unpublished master’s thesis, Ohio State Univer., 1961. Marine, Edith L. The effect of familiarity with the examiner upon Stanford-Binet test performance. Teach. Coll. contr. Educ., 1929, 381, 42. Marks, M. R. How to build better theories, tests and therapies: The off-quadrant approach. Amer. Psychologist, 1964, 19, 793–798. Marwit, S., & Marcia, J. Tester–bias and response to projective instruments. Unpublished paper. State Univer. of New York at Buffalo, 1965. Masling, J. The effects of warm and cold interaction on the administration and scoring of an intelligence test. J. consult. Psychol., 1959, 23, 336–341. Masling, J. The influence of situational and interpersonal variables in projective testing. Psychol. Bull., 1960, 57, 65–85. Masling, J. Differential indoctrination of examiners and Rorschach responses. J. consult. Psychol., 1965, 29, 198–201. Matarazzo, J. D., Saslow, G., & Pareis, E. N. Verbal conditioning of two response classes: Some methodological considerations. J. abnorm. soc. Psychol., 1960, 61, 190–206. Matarazzo, J. D., Wiens, A. N., & Saslow, G. Studies in interview speech behavior. In L. Krasner, & L. P. Ullman (Eds.), Research in behavior modification: New developments and implications. New York: Holt, Rine-hart and Winston, 1965. Pp. 181–210. Mausner, B. Studies in social interaction: III. Effect of variation in one partner’s prestige on the interaction of observer pairs. J. appl. Psychol., 1953, 37, 391–393. Mausner, B. The effect of one partner’s success or failure in a relevant task on the interaction of observer pairs. J. abnorm. soc. Psychol., 1954, 49, 557–560. Mausner, B., & Bloch, Barbara L. A study of the additivity of variables affecting social interaction. J. abnorm. soc. Psychol., 1957, 54, 250–256. McClelland, D. C. Wanted: A new self-image for women. In R. J. Lifton (Ed.), The woman in America. Boston: Houghton Mifflin, 1965. Pp. 173–192. McFall, R. M. ‘‘Unintentional communication’’: The effect of congruence and incongruence between subject and experimenter constructions. Unpublished doctoral dissertation, Ohio State Univer., 1965. McGuigan, F. J. The experimenter: A neglected stimulus object. Psychol. Bull., 1963, 60, 421–28. McNemar, Q. Opinion-attitude methodology. Psychol. Bull., 1946, 43, 289–374. McNemar, Q. At random: Sense and nonsense. Amer. Psychologist, 1960, 15, 295–300. McTeer, W. Observational definitions of emotion. Psychol. Rev., 1953, 60, 172–180.
660
Book Two – Experimenter Effects in Behavioral Research Merton, R. K. The self-fulfilling prophecy. Antioch Rev., 1948, 8, 193–210. Miller, J. G. Unconsciousness. New York: Wiley, 1942. Mills, J. Changes in moral attitudes following temptation. J. Pers., 1958, 26, 517–531. Mills, T. M. A sleeper variable in small groups research: The experimenter. Pacific sociol. Rev., 1962, 5, 21–28. Milmoe, Susan, Rosenthal, R., Blane, H. T. Chafetz, M. E., & Wolf, I. The doctor’s voice: Postdictor of successful referral of alcoholic patients. Unpublished paper, Harvard Univer., 1965. (J. abnorm. Psychol., 1966, in press.) Mintz, N. On the psychology of aesthetics and architecture. Unpublished paper, Brandeis Univer., 1957. Moll, A. Hypnotism. (4th ed.) New York: Scribner, 1898. Moll, A. Hypnotism. Translated by A. F. Hopkirk. (4th enlarged ed.) London: Walter Scott; New York: Scribner, 1910. Morrow, W. R. Psychologists’ attitudes on psychological issues: I. Constrictive method-formalism. J. gen. Psychol., 1956, 54, 133–147. Morrow, W. R. Psychologists’ attitudes on psychological issues: II. Static-mechanical-elementarism. J. gen. Psychol., 1957, 57, 69–82. Mosteller, F. Correcting for interviewer bias. In H. Cantril, Gauging public opinion. Princeton: Princeton Univer. Press, 1944. Pp. 286–288. Mosteller, F., & Bush, R. R. Selected quantitative techniques. In G. Lindzey (Ed.), Handbook of social psychology. Cambridge, Mass.: Addison-Wesley, 1954. Pp. 289–334. Mosteller, F., & Hammel, E. A. Review of Naroll, R. Data quality control —a new research technique. New York: Macmillan, 1962. J. Amer. Statist. Assoc., 1963, 58, 835–836. Mulry, R. C. The effects of the experimenter’s perception of his own performance on subject performance in a pursuit rotor task. Unpublished master’s thesis, Univer. of North Dakota, 1962. Munn, N. L. Handbook of psychological research on the rat. Boston: Houghton Mifflin, 1950. Murphy, G. Science in a straight jacket? Contemp. Psychol., 1962, 7, 357–358. Murray, H. A. Techniques for a systematic investigation of fantasy. J. Psychol., 1937, 3, 115–143. Naroll, Frada, Naroll, R., & Howard, F. H. Position of women in childbirth. Amer. J. Obstetrics Gynecol., 1961, 82, 943–954. Naroll, R. Data quality control—a new research technique. New York: Macmillan, 1962. Newcomb, T. M. The acquaintance process. New York: Holt, Rinehart and Winston, 1961. Noltingk, B. E. The human element in research management. Amsterdam: Elsevier, 1959. Norman, R. D. A review of some problems related to the mail questionnaire technique. Educ. psychol. Measmt., 1948, 8, 235–247. Orne, M. T. The nature of hypnosis: Artifact and essence. J. abnorm. soc. Psychol., 1959, 58, 277–299. Orne, M. T. On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. Amer. Psychologist, 1962, 17, 776–783. Palmer, L. R. New evidence in Knossos affair. London: The Sunday Observer, 1962 (Feb. 11). Parsons, T. The American family: Its relations to personality and to the social structure. In T. Parsons, & R. F. Bales, Family, socialization and interaction process. New York: Free Press, 1955. Pp. 3–33. Parsons, T., & Bales, R. F. Family, socialization and interaction process. New York: Free Press, 1955. Parsons, T., Bales, R. F., & Shils, E. A. Working papers in the theory of action. New York: Free Press, 1953. Pearson, K. On the mathematical theory of errors of judgment with special reference to the personal equation. Phil. Trans. Roy. Soc. London, 1902, 198, 235–299. Persinger, G. W. The effect of acquaintanceship on the mediation of experimenter bias. Unpublished master’s thesis, Univer. of North Dakota, 1962. Pflugrath, J. Examiner influence in a group testing situation with particular reference to examiner bias. Unpublished master’s thesis, Univer. of North Dakota, 1962. Pfungst, O. Clever Hans (the horse of Mr. von Osten): A contribution to experimental, animal, and human psychology. Translated by C. L. Rahn. New York: Holt, 1911. Polanyi, M. Personal knowledge. Chicago: Univer. of Chicago Press, 1958. Polanyi, M. Tacit knowing: Its bearing on some problems of philosophy. Reviews of Modern Physics, 1962, 34, 601–616.
References
661 Polanyi, M. The potential theory of adsorption. Science, 1963, 141, 1010–1013. Pomeroy, W. B. Human sexual behavior. In N. L. Farberow (Ed.), Taboo topics. Englewood Cliffs, N.J.: Prentice-Hall, 1963. Pp. 22–32. Prince, A. I. Relative prestige and the verbal conditioning of children. Amer. Psychologist, 1962, 17, 378. (Abstract) Quart. J. Stud. Alcohol, Editorial staff. Mortality in delirium tremens. No. Dak. Rev. Alcoholism, 1959, 4, 3. Abstract of Gunne, L. M. Mortaliteten vid delirium tremens. Nord. Med., 1958, 60, 1021–1024. Rankin, R., & Campbell, D. Galvanic skin response to Negro and white experimenters. J. abnorm. soc. Psychol., 1955, 51, 30–33. Rapp, D. W. Detection of observer bias in the written record. Unpublished manuscript, Univer. of Georgia, 1965. Raven, B. H., & French, J. R. P., Jr. Group support, legitimate power and social influence. J. Pers., 1958, 26, 400–409. Ravitz, L. J. Electrometric correlates of the hypnotic state. Science, 1950, 112, 341–342. Ravitz, L. J. Electrocyclic phenomena and emotional states. J. clin. exp. Psychopathol., 1952, 13, 69–106. Razran, G. Pavlov the empiricist. Science, 1959, 130, 916. Reece, M. M., & Whitman, R. N. Expressive movements, warmth, and verbal reinforcements. J. abnorm. soc. Psychol., 1962, 64, 234–236. Reif, F. The competitive world of the pure scientist. Science, 1961, 134, 1957–1962. Rhine, J. B. How does one decide about ESP? Amer. Psychologist, 1959, 14, 606–608. Rice, C. E., & Feinstein, S. H. Sonar system of the blind: Size discrimination. Science, 1965, 148, 1107–1108. Rice, S. A. Contagious bias in the interview: A methodological note. Amer. J. Sociol., 1929, 35, 420–423. Rider, P. R. Criteria for rejection of observations. Washington Univer. Stud.: New Ser., Sci. Tech., 1933, 8. Riecken, H. W. A program for research on experiments in social psychology. In N. F. Washburne (Ed.), Decisions, values and groups. Vol. II. New York: Pergamon Press, 1962. Pp. 25–41. Riecken, H. W., et al. Narrowing the gap between field studies and laboratory experiments in social psychology: A statement by the summer seminar. Items, Soc. Sci. Res. Council, 1954, 8, 37–42. Ringuette, E. L., & Kennedy, Gertrude L. An experimental investigation of the double-bind hypothesis. Amer. Psychologist, 1964, 19, 459. (Abstract) Robinson, D., & Rohde, S. Two experiments with an anti-Semitism poll. J. abnorm. soc. Psychol., 1946, 41, 136–144. Robinson, J., & Cohen, L. Individual bias in psychological reports. J. clin. Psychol., 1954, 10, 333–336. Rodnick, E. H., & Klebanoff, S. G. Projective reactions to induced frustration as a measure of social adjustment. Psychol. Bull., 1942, 39, 489. (Abstract) Roe, Ann. Man’s forgotten weapon. Amer. Psychologist, 1959, 14, 261–266. Roe, Ann. The psychology of the scientist. Science, 1961, 134, 456–459. Rokeach, M. The open and closed mind. New York: Basic Books, 1960. Rosenberg, M. J. When dissonance fails: On eliminating evaluation apprehension from attitude measurement. J. pers. soc. Psychol., 1965, 1, 28–42. Rosenhan, D. Hypnosis, conformity, and acquiescence. Amer. Psychologist, 1963, 18, 402. (Abstract) Rosenhan, D. On the social psychology of hypnosis research. Educat. Test. Serv. Research Memo. Princeton, N.J.: Educ. Test. Serv., March, 1964. Also to appear as Chapter 13 in J. E. Gordon (Ed.), Handbook of experimental and clinical hypnosis. Rosenthal, R. An attempt at the experimental induction of the defense mechanism of projection. Unpublished doctoral dissertation, Univer. of California at Los Angeles, 1956. Rosenthal, R. Projection, excitement, and unconscious experimenter bias. Amer. Psychologist, 1958, 13, 345–346. (Abstract) Rosenthal, R. Variation in research results associated with experimenter variation. Unpublished paper, Harvard Univer., 1962. Rosenthal, R. Experimenter attributes as determinants of subjects’ responses. J. proj. Tech. pers. Assess., 1963, 27, 324–331. (a)
662
Book Two – Experimenter Effects in Behavioral Research Rosenthal, R. Experimenter modeling effects as determinants of subject’s responses. J. proj. Tech. pers. Assess., 1963, 27, 467–471. (b) Rosenthal, R. On the social psychology of the psychological experiment: The experimenter’s hypothesis as unintended determinant of experimental results. Amer. Scient., 1963, 51, 268–283. (c) Rosenthal, R. Subject susceptibility to experimenter influence. Unpublished paper, Harvard Univer., 1963. (d) Rosenthal, R. The effect of the experimenter on the results of psychological research. In B. A. Maher (Ed.), Progress in experimental personality research. Vol. I. New York: Academic Press, 1964. Pp. 79–114. (a) Rosenthal, R. Experimenter outcome-orientation and the results of the psychological experiment. Psychol. Bull., 1964, 61, 405–412. (b) Rosenthal, R. Clever Hans: A case study of scientific method. Introduction to Pfungst, O., Clever Hans. New York: Holt, Rinehart and Winston, 1965. Pp. ix–xlii. (a) Rosenthal, R. The volunteer subject. Hum. Relat., 1965, 18, 389–406. (b) Rosenthal, R. Covert communications and tacit understandings in the psychological experiment. Paper read at Amer. Psychol. Assoc, Chicago, September, 1965. (c) Rosenthal, R., & Fode, K. L. The problem of experimenter outcome-bias. In D. P. Ray (Ed.), Series research in social psychology. Symposia studies series, No. 8, Washington, D.C.: National Institute of Social and Behavioral Science, 1961. Rosenthal, R., & Fode, K. L. The effect of experimenter bias on the performance of the albino rat. Behav. Sci., 1963, 8, 183–189. (a) Rosenthal, R., & Fode, K. L. (Psychology of the scientist: V) Three experiments in experimenter bias. Psychol. Rep., 1963, 12, 491–511. (b) Rosenthal, R., Fode, K. L., Friedman, C. J., & Vikan-Kline, L. Subjects’ perception of their experimenter under conditions of experimenter bias. Percept. mot. Skills, 1960, 11, 325–331. Rosenthal, R., Fode, K. L., & Vikan-Kline, L. The effect on experimenter bias of varying levels of motivation of Es and Ss. Unpublished manuscript, Harvard Univer., 1960. Rosenthal, R., Fode, K. L., Vikan-Kline, L., & Persinger, G. W. Verbal conditioning: Mediator of experimenter expectancy effects? Psychol. Rep., 1964, 14, 71–74. Rosenthal, R., Friedman, C. J., Johnson, C. A., Fode, K. L., Schill, T. R., White, C. R., & VikanKline, L. Variables affecting experimenter bias in a group situation. Genet. Psychol. Monogr., 1964, 70, 271–296. Rosenthal, R., Friedman, N., & Kurland, D. Instruction-reading behavior of the experimenter as an unintended determinant of experimental results. Paper read at East. Psychol. Ass., Atlantic City, April, 1965. (J. exp. Res. Pers., 1966, in press.) Rosenthal, R., & Gaito, J. The interpretation of levels of significance by psychological researchers. J. Psychol., 1963, 55, 33–38. Rosenthal, R., & Gaito, J. Further evidence for the cliff effect in the interpretation of levels of significance. Psychol. Rep., 1964, 15, 570. Rosenthal, R., & Halas, E. S. Experimenter effect in the study of invertebrate behavior. Psychol. Rep., 1962, 11, 251–256. Rosenthal, R., Kohn, P., Greenfield, Patricia M., & Carota, N. Experimenters’ hypothesisconfirmation and mood as determinants of experimental results. Percept. mot. Skills, 1965, 20, 1237–1252. Rosenthal, R., Kohn, P., Greenfield, Patricia M., & Carota, N. Data desirability, experimenter expectancy, and the results of psychological research, J. Pers. soc. Psychol., 1966, 3, 20–27. Rosenthal, R., & Lawson, R. A longitudinal study of the effects of experimenter bias on the operant learning of laboratory rats. J. Psychiat. Res., 1964, 2, 61–72. Rosenthal, R., & Persinger, G. W. Let’s pretend: Subjects’ perception of imaginary experimenters. Percept. mot. Skills, 1962, 14, 407–409. Rosenthal, R., Persinger, G. W., & Fode, K. L. Experimenter bias, anxiety, and social desirability. Percept. mot. Skills, 1962, 15, 73–74. Rosenthal, R., Persinger, G. W., Mulry, R. C., Vikan-Kline, L., & Grothe, M. A motion picture study of 29 biased experimenters. Unpublished data, Harvard Univer., 1962. Rosenthal, R., Persinger, G. W., Mulry, R. C., Vikan-Kline, L., & Grothe, M. Changes in experimental hypotheses as determinants of experimental results. J. proj. Tech. pers. Assess., 1964, 28, 465–469. (a)
References
663 Rosenthal, R., Persinger, G. W., Mulry, R. C., Vikan-Kline, L., & Grothe, M. Emphasis on experimental procedure, sex of subjects, and the biasing effects of experimental hypotheses. J. proj. Tech. pers. Assess., 1964, 28, 470–473. (b) Rosenthal, R., Persinger, G. W., Vikan-Kline, L., & Fode, K. L. The effect of early data returns on data subsequently obtained by outcome-biased experimenters. Sociometry, 1963, 26, 487–498. (a) Rosenthal, R., Persinger, G. W., Vikan-Kline, L., & Fode, K. L. The effect of experimenter outcome-bias and subject set on awareness in verbal conditioning experiments. J. verb. Learn, verb. Behav., 1963, 2, 275–283. (b) Rosenthal, R., Persinger, G. W., Vikan-Kline, L., & Mulry, R. C. The role of the research assistant in the mediation of experimenter bias. J. Pers., 1963, 31, 313–335. Rostand, J. Error and deception in science. New York: Basic Books, 1960. Rotter, J. B. Social learning and clinical psychology. Englewood Cliffs, N.J.: Prentice-Hall, 1954. Rotter, J. B., & Jessor, Shirley. The problem of subjective bias in TAT interpretation. Unpublished manuscript, Ohio State Univer. (undated, circa 1947). Rozeboom, W. W. The fallacy of the null-hypothesis significance test. Psychol. Bull., 1960, 57, 416–428. Russell, B. Philosophy. New York: Norton, 1927. Sacks, Eleanor L. Intelligence scores as a function of experimentally established social relationships between child and examiner. J. abnorm. soc. Psychol., 1952, 47, 354–358. Sampson, E. E., & French, J. R. P. An experiment on active and passive resistance to social power. Amer. Psychologist, 1960, 15, 396. (Abstract) Sampson, E. E., & Sibley, Linda B. A further examination of the confirmation or nonconfirmation of expectancies and desires. J. pers. soc. Psychol., 1965, 2, 133–137. Sanders, R., & Cleveland, S. E. The relationship between certain examiner personality variables and subject’s Rorschach scores. J. proj. Tech., 1953, 17, 34–50. Sanford, R. N. The effects of abstinence from food upon imaginai processes: A preliminary experiment. J. Psychol., 1936, 2, 129–136. Sapolsky, A. Effect of interpersonal relationships upon verbal conditioning. J. abnorm. soc. Psychol., 1960, 60, 241–246. Sarason, I. G. Interrelationships among individual difference variables, behavior in psychotherapy, and verbal conditioning. J. abnorm, soc. Psychol., 1958, 56, 339–344. Sarason, I. G. Individual differences, situational variables, and personality research. J. abnorm. soc. Psychol., 1962, 65, 376–380. Sarason, I. G. The human reinforcer in verbal behavior research. In L. Krasner, & L. P. Ullman (Eds.), Research in behavior modification: New developments and implications. New York: Holt, Rinehart and Winston, 1965. Pp. 231–243. Sarason, I. G., & Harmatz, M. G. Test anxiety and experimental conditions. J. pers. soc. Psychol., 1965, 1, 499–505. Sarason, I. G., & Minard, J. Interrelationships among subject, experimenter, and situational variables. J. abnorm. soc. Psychol., 1963, 67, 87–91. Sarason, S. B. The psychologist’s behavior as an area of research. J. consult. Psychol., 1951, 15, 278–280. Sarason, S. B., Davidson, K. S., Lighthall, F. F., Waite, R. R., & Ruebush, B. K. Anxiety in elementary school children. New York: Wiley, 1960. Schachter, S. The psychology of affiliation. Stanford, Calif.: Stanford Univer. Press, 1959. Schmeidler, Gertrude, & McConnell, R. A. ESP and personality patterns. New Haven: Yale Univer. Press, 1958. Schultz, D. P. Time, awareness, and order of presentation in opinion change. J. appl. Psychol., 1963, 47, 280–283. Science and Politics: AMA attacked for use of disputed survey in ‘‘Medicare’’ lobbying, Science, 1960, 132, 604–605. Sebeok, T. A., Hayes, A. S., & Bateson, Mary C. (Eds.) Approaches to semiotics. The Hague: Mouton, 1964. Shapiro, A. K. A contribution to a history of the placebo effect. Behav. Sci., 1960, 5, 109–135. Shapiro, A. K. Factors contributing to the placebo effect. Amer. J. Psychother., 1964, 18, 73–88. Shapiro, A. K. Iatroplacebogenics. Unpublished paper, Montefiore Hospital, New York City, 1965.
664
Book Two – Experimenter Effects in Behavioral Research Shapiro, A. P. The investigator himself. In S. O. Waife, & A. P. Shapiro, (Eds.), The clinical evaluation of new drugs. New York: Hoeber-Harper, 1959. Pp. 110–119. Sheffield, F. D., Kaufman, R. S., & Rhine, J. B. A PK experiment at Yale starts a controversy. J. Amer. Soc. Psychical Res., 1952, 46, 111–117. Sherif, M., & Hovland, C. I. Social judgment. New Haven: Yale Univer. Press, 1961. Shinkman, P. G., & Kornblith, Carol L. Comment on observer bias in classical conditioning of the planarian. Psychol. Rep., 1965, 16, 56. Shor, R. E. Shared patterns of nonverbal normative expectations in automobile driving. J. soc. Psychol., 1964, 62, 155–163. Silverman, I. In defense of dissonance theory: Reply to Chapanis and Chapanis. Psychol. Bull., 1964, 62, 205–209. Silverman, I. Motives underlying the behavior of the subject in the psychological experiment. Paper read at Amer. Psychol. Ass., Chicago, September, 1965. Simmons, W. L., & Christy, E. G. Verbal reinforcement of a TAT theme. J. proj. Tech., 1962, 26, 337–341. Smart, R. G. The importance of negative results in psychological research. Canad. Psychologist, 1964, 5a, 225–232. Smith, E. E. Relative power of various attitude change techniques. Paper read at Amer. Psychol. Ass., New York, September, 1961. Smith, H. L., & Hyman, H. H. The biasing effect of interviewer expectations on survey results. Publ. Opin. Quart., 1950, 14, 491–506. Snedecor, G. W. Statistical methods. (5th ed.) Ames, Iowa: Iowa State University Press, 1956. Snow, C. P. The Affair. London: Macmillan, 1960. Snow, C. P. The moral un-neutrality of science. Science, 1961, 133, 256–259. Solomon, R. L. An extension of control group design. Psychol. Bull., 1949, 46, 137–150. Spence, K. W. Anxiety (drive) level and performance in eyelid conditioning. Psychol. Bull., 1964, 61, 129–139. Spielberger, C. D., Berger, A., & Howard, Kay. Conditioning of verbal behavior as a function of awareness, need for social approval, and motivation to receive reinforcement. J. abnorm. soc. Psychol., 1963, 67, 241–246. Spielberger, C. D., & De Nike, L. D. Descriptive behaviorism versus cognitive theory in verbal operant conditioning. Psychol. Rev., 1966, 73, in press. Spires, A. M. Subject-experimenter interaction in verbal conditioning. Unpublished doctoral dissertation, New York Univer., 1960. Stanton, F. Further contributions at the twentieth anniversary of the Psychological Corporation and to honor its founder, James McKeen Cattel. J. appl. Psychol., 1942a, 26, 16–17. Stanton, F., & Baker, K. H. Interviewer bias and the recall of incompletely learned materials. Sociometry, 1942, 5, 123–134. Star, Shirley A. The screening of psychoneurotics: comparison of psychiatric diagnoses and test scores at all induction stations. In S. A. Stouffer, L. Guttman, E. A. Suchman, P. F. Lazarsfeld, Shirley A. Star, & J. A. Clausen, Measurement and prediction. Princeton: Princeton Univer. Press, 1950. Pp. 548–567. Stember, C. H., & Hyman, H. H. How interviewer effects operate through question form. Int. J. Opin. Attit. Res., 1949, 3, 493–512. Stephens, J. M. The perception of small differences as affected by self interest. Amer. J. Psychol., 1936, 48, 480–484. Sterling, T. D. Publication decisions and their possible effects on inferences drawn from tests of significance—or vice versa. J. Amer. Statist. Assoc., 1959, 54, 30–34. Stevens, S. S. To honor Fechner and repeal his law. Science, 1961, 133, 80–86. Stevenson, H. W. Social reinforcement with children as a function of CA, sex of E, and sex of S. J. abnorm. soc. Psychol., 1961, 63, 147–154. Stevenson, H. W. Social reinforcement of children’s behavior. In L.P. Lipsitt, & C. C. Spiker, (Eds.), Advances in child development and behavior. Vol. 2. New York: Academic Press, 1965. Stevenson, H. W., & Allen, Sara. Adult performance as a function of sex of experimenter and sex of subject. J. abnorm. soc. Psychol., 1964, 68, 214–216. Stevenson, H. W., Keen, Rachel, & Knights, R. M. Parents and strangers as reinforcing agents for children’s performance. J. abnorm. soc. Psychol., 1963, 67, 183–186.
References
665 Stevenson, H. W., & Knights, R. M. Social reinforcement with normal and retarded children as a function of pretraining, sex of E, and sex of S. Amer. J. ment. Defic., 1962, 66, 866–871. Stevenson, H. W., & Odom, R. D. Visual reinforcement with children. Unpublished manuscript, Univer. of Minnesota, 1963. Stratton, G. M. The control of another person by obscure signs. Psychol. Rev., 1921, 28, 301–314. Strupp, H. H. Toward an analysis of the therapist’s contribution to the treatment process. Amer. Psychologist, 1959, 14, 336. (Abstract) Strupp, H. H., & Luborsky, L. (Eds.) Research in psychotherapy. Washington, D.C.: Amer. Psychol. Ass., 1962. Sullivan, H. S. A note on the implications of psychiatry, the study of interpersonal relations, for investigations in the social sciences. Amer. J. Sociol., 1936–37, 42, 848–861. Summers, G. F., & Hammonds, A. D. Toward a paradigm of respondent bias in survey research. Unpublished paper, Univer. of Wisconsin, 1965. Symons, R. T. Specific experimenter-subject personality variables pertinent to the influencing process in a verbal conditioning situation. Unpublished doctoral dissertation, Univer. of Washington, 1964. Symposium: Survey on problems of interviewer cheating. Int. J. Opin. attit. Res., 1947, 1, 93–106. Taffel, C. Anxiety and the conditioning of verbal behavior. J. abnorm. soc. Psychol., 1955, 51, 496–501. Taylor, Janet A. A personality scale of manifest anxiety. J. abnorm. soc. Psychol., 1953, 48, 285–290. Tolman, E. C. Purposive behavior in animals and men. New York: Century, 1932. Troffer, Suzanne A., & Tart, C. T. Experimenter bias in hypnotist performance. Science, 1964, 145, 1330–1331. Tuddenham, R. D. The view from Hovland Headland. Contemp. Psychol., 1960, 5, 150–151. Tukey, J. W. Data analysis and the frontiers of geophysics. Science, 1965, 148, 1283–1289. Turner, G. C., & Coleman, J. C. Examiner influence on thematic apperception test responses. J. proj. Tech., 1962, 26, 478–486. Turner, J. On being fair though one-sided. Science, 1961, 134, 585. (a) Turner, J. What laymen can ask of scientists. Science, 1961, 133, 1195. (b) Udow, A. B. The ‘‘interviewer effect’’ in public opinion and market research surveys. Arch. Psychol., 1942, 39, No. 277. Veroff, J. Anxious first-borns. Contemp. Psychol., 1960, 5, 328–329. Verplanck, W. S. The control of the content of conversation: Reinforcement of statements of opinion. J. abnorm. soc. Psychol., 1955, 51, 668–676. Vikan-Kline, L. The effect of an experimenter’s perceived status on the mediation of experimenter bias. Unpublished master’s thesis, Univer. of North Dakota, 1962. Vinacke, W. E. Laboratories and lives. Paper read at Amer. Psychol. Ass., Chicago, September, 1965. Wallach, M. S., & Strupp, H. H. Psychotherapist’s clinical judgments and attitudes toward patients. J. consult. Psychol., 1960, 24, 316–323. Wallin, P. Volunteer subjects as a source of sampling bias. Amer. J. Sociol., 1949, 54, 539–544. Walters, Cathryn, Parsons, O. A., & Shurley, J. T. Male-female differences in underwater sensory isolation. Brit. J. Psychiat., 1964, 109–110, 290–295. Walters, Cathryn, Shurley, J. T., & Parsons, O. A. Differences in male and female responses to underwater sensory deprivation: An exploratory study. J. nerv. ment. Dis., 1962, 135, 302–310. Ware, J. R., Kowal, B., & Baker, R. A., Jr. The role of experimenter attitude and contingent reinforcement in a vigilance task. Unpublished paper, U.S. Army Armor Human Research Unit, Fort Knox, Kentucky, 1963. Warner, L., & Raible, Mildred. Telepathy in the psychophysical laboratory. J. Parapsychol., 1937, 1, 44–51. Wartenberg-Ekren, Ursula. The effect of experimenter knowledge of a subject’s scholastic standing on the performance of a reasoning task. Unpublished master’s thesis, Marquette Univer., 1962. Weick, K. E. When prophecy pales: The fate of dissonance theory. Psychol. Rep., 1965, 16, 1261–1275. Weitzenhoffer, A. M., & Hilgard, E. R. Stanford Hypnotic Susceptibility Scale: Form C. Palo Alto, California: Consulting Psychologists Press, 1962.
666
Book Two – Experimenter Effects in Behavioral Research White, C. R. The effect of induced subject expectations on the experimenter bias situation. Unpublished doctoral dissertation, Univer. of North Dakota, 1962. Whitman, R. M. Drugs, dreams and the experimental subject. J. Canad. Psychiat. Assoc., 1963, 8, 395–399. Whyte, W. F. Street corner society. Chicago: Univer. of Chicago Press, 1943. Williams, F., & Cantril, H. The use of interviewer rapport as a method of detecting differences between ‘‘public’’ and ‘‘private’’ opinion. J. soc. Psychol., 1945, 22, 171–175. Williams, J. A. Interviewer-respondent interaction: A study of bias in the information interview. Sociometry, 1964, 27, 338–352. Williams, L. P. The Beringer hoax. Science, 1963, 140, 1083. Wilson, A. B. Social stratification and academic achievement. In A. H. Pas-sow (Ed.), Education in depressed areas. New York: Teachers College, Columbia Univer., 1963. Pp. 217–235. Wilson, E. B. An introduction to scientific research. New York: McGraw-Hill, 1952. Winer, B. J. Statistical principles in experimental design. New York: McGraw-Hill, 1962. Winkel, G. H., & Sarason, I. G. Subject, experimenter, and situational variables in research on anxiety. J. abnorm. soc. Psychol., 1964, 68, 601–608. Wirth, L. Preface to K. Mannheim, Ideology and Utopia. New York: Harcourt, Brace & World, 1936. Wolf, I. S. Perspectives in psychology, XVI. Negative findings. Psychol. Rec., 1961, 11, 91–95. Wolf, S. Human beings as experimental subjects. In S. O. Waife, & A. P. Shapiro (Eds.), The clinical evaluation of new drugs. New York: Hoeber-Harper, 1959. Pp. 85–99. Wolf, Theˆta H. An individual who made a difference. Amer. Psychologist, 1961, 16, 245–248. Wolf, Theˆta H. Alfred Binet: A time of crisis. Amer. Psychologist, 1964, 19, 762–771. Wolfle, D. Research with human subjects. Science, 1960, 132, 989. Wolins, L. Needed: Publication of negative results. Amer. Psychologist, 1959, 14, 598. Wolins, L. Responsibility for raw data. Amer. Psychologist, 1962, 17, 657–658. Wood, F. G. Pitfall. Science, 1962, 135, 261. Woods, P. J. Some characteristics of journals and authors. Amer. Psychologist, 1961, 16, 699–701. Wooster, H. Basic research. Science, 1959, 130, 126. Wuster, C. R., Bass, M., & Alcock, W. A test of the proposition: We want to be esteemed most by those we esteem most highly. J. abnorm. soc. Psychol., 1961, 63, 650–653. Wyatt, D. F., & Campbell, D. T. A study of interviewer bias as related to interviewers’ expectations and own opinions. Int. J. Opin. Attit. Res., 1950, 4, 77–83. Young, R. K. Digit span as a function of the personality of the experimenter. Amer. Psychologist, 1959, 14, 375. (Abstract) Yule, G. U. On reading a scale. J. Roy. statist. Soc., 1927, 90, 570–587. Yule, G. U., & Kendall, M. G. An introduction to the theory of statistics. (14th ed.) New York: Hafner, 1950. Zelditch, M. Role differentiation in the nuclear family: A comparative study. In T. Parsons, & R. F. Bales, Family, socialization and interaction process. New York: Free Press, 1955. Pp. 307–351. Zillig, Maria. Einstellung und Aussage. Z. Psychol., 1928, 106, 58–106. (Translated by Irene Jerison.) Zirkle, C. Citation of fraudulent data. Science, 1954, 120, 189–190. Zirkle, C. Pavlov’s beliefs. Science, 1958, 128, 1476. Zirkle, C. A conscience in conflict. Book review in Science, 1960, 132, 890. Zirkle, G. A. Some potential errors in conducting mental health research. In R. H. Branson (Ed.), Report of the conference on mental health research. Indianapolis: Assoc. Advance. Ment. Hlth. Res. Educ, 1959. Znaniecki, F. The social role of the man of knowledge. New York: Columbia Univer. Press, 1940.
BOOK THREE THE VOLUNTEER SUBJECT Robert Rosenthal and Ralph L. Rosnow
This page intentionally left blank
Preface
The idea is not new that persons volunteering for behavioral research may not be fully adequate models for the study of human behavior in general. Anthropologists, economists, political scientists, psychologists, sociologists, and statisticians are by now well aware of the problem of volunteer bias. In Bill McGuire’s (1969) terms, we have almost passed out of the ignorance stage in the life of this particular artifact. Our purpose in writing this book was to tell what is known about the volunteer subject so that: 1. The complete passing of the stage of ignorance of this artifact would be hastened, 2. We might dwell more productively on McGuire’s second stage in the life of an artifact, the coping stage, and 3. A beginning might be made in nudging our knowledge of volunteering as an artifact to McGuire’s final stage, the stage of exploitation.
We hope our book will help ensure that the problem of the volunteer subject will not only be recognized and dealt with but also that volunteering for research participation may come to be seen as behavior of interest in its own right and not simply as a source of artifact. To the extent that volunteers differ from nonvolunteers, the employment of volunteer subjects can have serious effects on estimates of such parameters as means, medians, proportions, variances, skewness, and kurtosis. In survey research, where estimation of such parameters is the principal goal, biasing effects of volunteer subjects could be disastrous. In a good deal of behavioral research, however, interest is centered less on such statistics as means and proportions and more on such statistics as differences between means and proportions. The experimental investigator is ordinarily interested in relating such differences to the operation of his independent variable. The fact that volunteers differ from nonvolunteers in their scores on the dependent variable may be of little interest to the behavioral experimenter. He may want more to know whether the magnitude of the difference between his experimental and control group means would be affected if he used volunteers. In other words, he may be 669
670
Book Three – The Volunteer Subject
interested in knowing whether volunteer status interacts with his experimental variable. In due course, we shall see that such interactions do indeed occur, and that their occurrence does not depend upon the main effects of volunteering. The audience to whom this book is addressed is made up of behavioral and social scientists, their graduate students, and their most serious undergraduate students. Although psychologists and sociologists may be the primary readers we believe that anthropologists, economists, political scientists, statisticians, and others of, or friendly to, the behavioral sciences may be equally interested. One way in which this book differs from most other empirically oriented works in the behavioral and social sciences is in its strong emphasis on effect sizes. Only rarely have we been content to specify that a relationship was ‘‘significant.’’ Whenever practical we have tried to include estimates of the size of various effects. Quite apart from the particular topic of this book we believe that routinely specifying effect sizes may be a generally useful procedure in the behavioral sciences. When specifying effect sizes we have most often employed r or as the unit of measurement. To give a rough indication of whether a particular effect was ‘‘large,’’ ‘‘medium,’’ or ‘‘small’’ we follow Cohen’s (1969) guidelines: Large effects are defined as r .50 or .80; medium effects are defined as r ffi .30 or ffi .50; and small effects are defined as r ffi .10 or ffi .20. Our research on the volunteer subject was supported by grants from the Division of Social Sciences of the National Science Foundation, which provided us the flexibility to pursue leads to whatever ends seemed most promising. Both of us were on leave from our home institutions during the academic year in which this book was written, 1973–1974, and we express our appreciation to Harvard University and Temple University for this time that was given us. One of us (RR) was on sabbatical leave for the full academic year and supported in part by a fellowship from the Guggenheim Foundation; the other (RLR) was on study leave during the first term while an academic visitor at the London School of Economics and then on leave of absence while a visiting professor at Harvard University during the second term. We are grateful for the support provided by these institutions and especially to the following individuals: Hilde Himmelweit, Brendan Maher, and Freed Bales. For permission to reprint figures and other copyright data, we thank American Psychologist, Educational and Psychological Measurement, Journal of Experimental Research in Personality, Journal of Experimental Social Psychology, Journal of Personality and Social Psychology, and Psychological Reports. For their devoted and expert typing we thank Donna DiFurio and Mari Tavitian. We are also grateful to Zick Rubin for his helpful suggestions. And finally, we thank our best friends Mary Lu and Mimi for the many ways in which they improved the book—and its authors. Robert Rosenthal Ralph L. Rosnow June 1974
1 Introduction
There is a growing suspicion among behavioral researchers that those human subjects who find their way into the role of research subject may not be entirely representative of humans in general. McNemar put it wisely when he said, ‘‘The existing science of human behavior is largely the science of the behavior of sophomores’’ (1946, p. 333). Sophomores are convenient subjects for study, and some sophomores are more convenient than others. Sophomores enrolled in psychology courses, for example, get more than their fair share of opportunities to play the role of the research subject whose responses provide the basis for formulations of the principles of human behavior. There are now indications that these ‘‘psychology sophomores’’ are not entirely representative of even sophomores in general (Hilgard, 1967), a possibility that makes McNemar’s formulation sound unduly optimistic. The existing science of human behavior may be largely the science of those sophomores who both enroll in psychology courses and volunteer to participate in behavioral research. The extent to which a useful, comprehensive science of human behavior can be based upon the behavior of such self-selected and investigatorselected subjects is an empirical question of considerable importance. It is a question that has received increasing attention in recent years (e.g., Adair, 1973; Bell, 1961; Chapanis, 1967; Damon, 1965; Leslie, 1972; Lester, 1969; London and Rosenhan, 1964; Maul, 1970; Ora, 1965; Parten, 1950; Rosenhan, 1967; Rosenthal, 1965; Rosenthal and Rosnow, 1969; Rosnow and Rosenthal, 1970; Schappe, 1972; Silverman and Margulis, 1973; Straits and Wuebben, 1973; Wells and Schofield, 1972; Wunderlich and Becker, 1969). Some of this recent interest has been focused on the actual proportion of human research subjects that are drawn from the collegiate setting. Table 1-1 shows the results of six studies investigating this question during the 1960s. The range is from 70% to 90%, with a median of 80%. Clearly, then, the vast majority of human research subjects have been sampled from populations of convenience, usually the college or university in which the investigator is employed. The studies shown in Table 1-1 span only an eight-year period, a period too short for the purpose of trying to discern a trend. Nevertheless, there is no reason for optimism based on a comparison of the earliest with the latest percentages. If anything, there may have 671
672
Book Three – The Volunteer Subject Table 1–1 College Students as a Percentage of Total Subjects Employed
Author
Source
Smart (1966) I Smart (1966) II Schultz (1969) I Schultz (1969) II Jung (1969) Higbee and Wells (1972) Median
Journal of Abnormal and Social Psychology Journal of Experimental Psychology Journal of Personality and Social Psychology Journal of Experimental Psychology Psychology Department Survey Journal of Personality and Social Psychology
Years
1962–1964 1963–1964 1966–1967 1966–1967 1967–1968 1969
% College Students 73 86 70 84 90 76 80
been a rise in the percentage of human subjects who were college students. The concern over the use of college students as our model of persons in general is based not only on the very obvious differences between college students and more representative persons in age, intelligence, and social class but also on the suspicion that college students, because of their special relationship with the teacher-investigator, may be especially susceptible to the operation of the various mechanisms that together constitute what has been called the social psychology of behavioral research (e.g., Adair, 1973; Fraser and Zimbardo, n.d.; Haas, 1970; Lana, 1969; Lester, 1969; McGuire, 1969a; Orne, 1969, 1970; Rosenberg, 1969; Straits and Wuebben, 1973; Straits, Wuebben, and Majka, 1972). Although most of the interest in, and concern over, the problem of subject selection biases has been centered on human subjects, we should note that analogous interests and concerns have been developing among investigators employing animal subjects. Just as the college student may not be a very good model for the ‘‘typical’’ person, so the laboratory rat may not be a very good model for the typical rodent nor for a wild rat, nor even for another laboratory rat of a different strain (e.g., Beach, 1950, 1960; Boice, 1973; Christie, 1951; Ehrlich, 1974; Eysenck, 1967; Kavanau, 1964, 1967; Richter, 1959; Smith, 1969).
Volunteer Bias The problem of the volunteer subject has been of interest to many behavioral researchers and evidence of their interest will be found in the chapters to follow. Fortunately for those of us who are behavioral researchers, however, mathematical statisticians have also devoted attention and effort to problems of the volunteer subject (e.g., Cochran, 1963; Cochran, Mosteller, and Tukey, 1953; Deming, 1944; Hansen and Hurwitz, 1946). Their work has shown the effects of the proportion of the population who select themselves out of the sample by not volunteering or responding, on the precision of estimates of various population values. These results are depressing to say the least, showing as they do how rapidly the margin of error increases as the proportion of nonvolunteers increases even moderately. In the present volume we shall not deal with the technical aspects of sampling and ‘‘nonresponse’’ theory, but it will be useful at this point to give in quantitative terms an example of volunteer bias in operation.
Introduction
673 Table 1–2 Example of Volunteer Bias in Survey Research
Response to Three Mailings First Second Third Nonrespondents Total Population Basic data Number of respondents % of population Mean trees per respondent Cumulative data Mean trees per respondent (Y1) Mean trees per nonrespondent (Y2) Difference (Y1 – Y2) % of nonrespondents (P) Bias [(P) (Y1 Y2)]
300 10 456
543 17 382
434 14 340
1839 59 290
3116 100 329
456 315 141 90 127
408 300 108 73 79
385 290 95 59 56
– – – – –
– – – – –
The basic data were presented by Cochran (1963) in his discussion of nonresponse bias. Three waves of questionnaires were mailed out to fruit growers, and the number of growers responding to each of the three waves was recorded, as was the remaining number of growers who never responded. One of the questions dealt with the number of fruit trees owned, and for just this question, data were available for the entire population of growers. Because of this fortunate circumstance, it was possible to calculate the degree of bias due to nonresponse, or nonvolunteering, present after the first, second, and third waves of questionnaires. Table 1-2 summarizes these calculations and gives the formal definition of volunteer bias. The first three rows of Table 1-2 give the basic data provided by Cochran: (1) the number of respondents to each wave of questionnaires and the number of nonrespondents, (2) the percentage of the total population represented by each wave of respondents and by the nonrespondents, and (3) the mean number of trees actually owned by each wave of respondents and by the nonrespondents. Examination of this row reveals the nature of the volunteer bias: the earlier responders owned more trees, on the average, than did the later responders. The remaining five rows of data are based on the cumulative number of respondents available after the first, second, and third waves. For each of these waves, five items of information are provided: (1) the mean number of trees owned by the respondents up to that point in the survey (Y1); (2) the mean number of trees owned by those who had not yet responded up to that point in the survey (Y2); (3) the difference between these two values (Y1 – Y2); (4) the percentage of the population that had not yet responded up to that point in the survey (P); and (5) the magnitude of the bias up to that point in the survey, defined as P (Y1 Y2). Examination of this last row shows that with each successive wave of respondents there was an appreciable decrease in the magnitude of the bias. This appears to be a fairly typical result of studies of this kind: increasing the effort to recruit the nonvolunteer decreases the bias in the sample estimates. Cochran gives considerable advice on how to minimize bias, given the usually greater cost of trying to recruit a nonvolunteer compared to the cost of recruiting a more willing respondent. Before leaving the present example of the calculation of bias, we should note again that in most circumstances of behavioral research we can compute the proportion of our population who fail to participate (P) and the statistic of interest for those who
674
Book Three – The Volunteer Subject
volunteer their data (Y1); but we cannot compute the statistic of interest for those who do not volunteer (Y2) so that it is often our lot to be in a position to suspect bias but to be unable to give an accurate quantitative statement about its magnitude. Put most simply, then, the concern over the volunteer problem has had for its goal the reduction of bias, or unrepresentativeness, of volunteer samples so that investigators might increase the generality of their research results (e.g., Ferber, 1948–1949; Ford and Zeisel, 1949; Hyman and Sheatsley, 1954; Locke, 1954). The magnitude of the problem is not trivial. The potential biasing effects of using volunteer samples has been illustrated recently and clearly. At one large university, rates of volunteering varied from 100% down to 10%. Even within the same course, different recruiters visiting different sections of the course obtained rates of volunteering varying from 100% down to 50% (French, 1963). At another university, rates of volunteering varied from 74% down to 26% when the same recruiter, extending the same invitation to participate in the same experiment, solicited female volunteers from different floors of the same dormitory (Marmer, 1967). A fuller picture of the rates of volunteering obtained in various kinds of studies will be presented in Chapter 2 (e.g., Tables 2-1, 2-2, and 2-3). Some reduction of the volunteer sampling bias might be expected from the fairly common practice of requiring psychology undergraduates to spend a certain number of hours serving as research subjects. Such a requirement gets more students into the overall sampling urn, but without making their participation in any given experiment a randomly determined event (e.g., Johnson, 1973b; King, 1970; MacDonald, 1972b). Students required to serve as research subjects often have a choice between alternative experiments. Given such a choice, will brighter (or duller) students sign up for an experiment on learning? Will better (or more poorly) adjusted students sign up for an experiment on personality? Will students who view their consciousness as broader (or narrower) sign up for an experiment that promises an encounter with ‘‘psychedelicacies’’? We do not know the answers to these questions very well, nor do we know whether these possible self-selection biases would necessarily make any difference in the inferences we want to draw. If the volunteer problem has been of interest and concern in the past, there is good evidence to suggest that it will become of even greater interest and concern in the future. Evidence of this comes from the popular press and the technical literature, and it says to us that in the future investigators may have less control than ever before over the kinds of human subjects who find their way into research. The ethical questions of humans’ rights to privacy and to informed consent are more salient now than ever before (Adair, 1973; Adams, 1973; Bean, 1959; Clark et al., 1967; Etzioni, 1973; Katz, 1972; Martin, Arnold, Zimmerman, and Richart, 1968; May, Smith, and Morris, 1968; Miller, 1966; Orlans, 1967; Rokeach, 1966; Ruebhausen and Brim, 1966; Sasson and Nelson, 1969; Steiner, 1972; Sullivan and Deiker, 1973; Trotter, 1974; Walsh, 1973; Wicker, 1968; Wolfensberger, 1967; Wolfle, 1960). One possible outcome of this unprecedented soul-searching is that the social science of the future may, because of internally and perhaps externally imposed constraints, be based upon propositions whose tenability will come only from volunteer subjects who have been made fully aware of the responses of interest to the investigator. However, even without this extreme consequence of the ethical crisis of the social sciences, we still will want to learn as much as we can about the external
Introduction
675
circumstances and the internal characteristics that bring any given individual into our sample of subjects or keep him out. Our purpose in the chapters to follow will be to say something about the characteristics that serve to differentiate volunteers for behavioral research from nonvolunteers and to examine in some detail what is known about the motivational and situational determinants of the act of volunteering. Subsequently we shall consider the implications of what is now known about volunteers and volunteering for the representativeness of the findings of behavioral research. We shall give special attention to experimental research that suggests that volunteer status may often interact with the independent variables employed in a variety of types of research studies. There is increasing reason to believe that such interactions may well occur, although the effects will not always be very large (Cope and Kunce, 1971; Cox and Sipprelle, 1971; Marlatt, 1973; Oakes, 1972; Pavlos, 1972; Short and Oskamp, 1965). Finally, an artifact influence model will be described that will help to provide an integrative overview.
The Reliability of Volunteering Before we turn to a consideration of characteristics differentiating volunteers from nonvolunteers, it will be useful to consider the evidence for the reliability of the act of volunteering. If volunteering were a purely random event (i.e., completely unreliable), we could not expect to find any stable relationships between volunteering and various personal characteristics. As we shall see in the next chapter, however, there are a good many characteristics that have been found to relate predictably to the act of volunteering. The reliability of volunteering, then, could not reasonably be expected to be zero on psychometric grounds alone. To check these psychometric expectations, Table 1-3 was constructed. In some of the studies listed, the reliabilities had been computed, while in the remaining studies sufficient data were provided for us to be able to estimate the reliabilities ourselves.
Table 1–3 The Reliability of Volunteering Behavior
Author
Index
1. Barefoot (1969) I 2. Barefoot (1969) II 3. Dohrenwend and Dohrenwend (1968) 4. Laming (1967) 5. Martin and Marcuse (1958) I 6. Martin and Marcuse (1958) II 7. Martin and Marcuse (1958) III 8. Martin and Marcuse (1958) IV 9. Rosen (1951) 10. Wallace (1954) Median
rpb rpb r rt rt rt rt C
a
Magnitude
p
.45 .42 .24a, b .22b .91a .80a .67a .97a .34 .58 .52
.001 .001 .02 .05 .001 .001 .001 .001 .05 .001 .001
Type of Study Various experiments Various experiments Interviews Choice-reaction Learning Personality Sex Hypnosis Personality Questionnaires
In these studies the second request was to volunteer for the same research as the first request. All subjects had been volunteers at one time, so that the reliabilities were probably lowered by a restriction of range of the volunteering variable. b
676
Book Three – The Volunteer Subject
Table 1-3 shows that the empirical evidence is consistent with our psychometric expectations. For the 10 studies shown, the median reliability coefficient was .52, with a range going from .97 to .22. As a standard against which to compare these reliability coefficients, we examined the subtest intercorrelations for what is perhaps the most widely used and carefully developed test of intelligence, the Wechsler Adult Intelligence Scale (WAIS) (Wechsler, 1958). Repeated factor analyses of this test have shown that there is a very large first factor, g, that in its magnitude swamps all other factors extracted, typically accounting for 10 times more variance than other factors. The full-scale WAIS, then, is an excellent measure of this first factor, or g. The range of subtest intercorrelations for the WAIS is reported by Wechsler to go from .08 to .85, with a median of .52, which, by coincidence, is the median value of the reliabilities of volunteering shown in Table 1-3. Of the 10 studies shown in Table 1-3, 8 requested volunteering for studies to be conducted in the laboratory, while only 2 requested cooperation in survey research conducted in the field. The median reliability of the laboratory studies was .56, while the median for the field studies was .41. Because there were only 2 studies in the latter group, however, we can draw no conclusions about this difference. The types of studies for which volunteering had been requested were quite variable for the samples shown in Table 1-3 and there were not enough studies in any one category to permit inferences about the type of research task for which volunteering was likely to be particularly reliable or unreliable. Also, as indicated in the footnote to Table 1-3, 5 of the studies requested people to volunteer a second time for the same task. In the remaining 5 studies the second and subsequent requests were for volunteering for a different study. If there are propensities to volunteering, then surely these propensities should be more stable when persons are asked to volunteer for the same, rather than a different, type of research experience. The data of Table 1-3 bear out this plausible inference. The median reliability for studies requesting volunteers for the same task is .80, while the median reliability for studies requesting volunteers for different tasks is only .42. Both these median reliabilities are very significantly different from zero, however (p much less than .001). Although 10 studies are not very many upon which to base such a conclusion, these results suggest that volunteering may have both general and specific predictors. Some people volunteer reliably more than others for a variety of tasks, and these reliable individual differences may be further stabilized when the particular task for which volunteering was requested is specifically considered.
2 Characteristics of the Volunteer Subject
In this chapter we consider the evidence that volunteer subjects differ from their reluctant peers in more or less systematic ways. We shall proceed attribute by attribute from those that are regarded as biosocial through those that are regarded as psychosocial, to those that must be seen as geosocial. Our inventory of attributes was developed inductively rather than deductively. Our question, put to the archives of social science (i.e., the ‘‘literature’’), was, in what ways have volunteers been found to differ or not to differ from nonvolunteers? Volunteers have been compared to nonvolunteers in many ways, but there is to be found in these hundreds of comparisons an underlying strategy: to discover those attributes in which volunteers differ from nonvolunteers that have a special likelihood of being related to the respondents’ or subjects’ responses on the dependent variable of ultimate interest. Thus, in survey research, return rates of questionnaires by males and females have been compared not simply because sex is easy to define but because males and females have often been found to have different attitudes, opinions, and beliefs. If they did not, it would make little difference in terms of survey results that women may be overrepresented at least in the early waves of response to a mailed questionnaire. In psychological research on the effects of sensory restriction, volunteers have been compared to nonvolunteers on the attribute of sensation-seeking, precisely because sensation-seekers may plausibly be regarded as likely to respond differently to sensory restriction than those less in search of new sensations. We return to these implications of volunteer status for the interpretation of research findings in Chapter 4.
Assessing the Nonvolunteer A general question may occur to those readers not familiar with the specific literatures of volunteer effects or response bias: How does one find out the attributes of those who do not volunteer to participate in psychological experiments, those who do not answer their mail or their telephone or their doorbell? It seems difficult 677
678
Book Three – The Volunteer Subject
enough to determine the sex and age of the person not answering the doorbell; How does one get his or her score on the California F Scale? A number of procedures have been found useful in comparing characteristics of those more or less likely to find their way into the role of data producer for the social or behavioral scientist. These methods can be grouped into one of two types, the exhaustive and the nonexhaustive. In the exhaustive method, all potential subjects or respondents are identified by their status on all the variables on which volunteers and nonvolunteers are to be compared. In the nonexhaustive method, data are not available for all potential subjects or respondents, but they are available for subjects or respondents differing in likelihood of finding their way into a final sample. There follow some examples of each of these classes of methods. Exhaustive Methods Archive Based Sampling Frames (Test, then Recruit).
In this method, the investigator begins with an archive containing for each person listed all the information desired for a comparison between volunteers and nonvolunteers. Requests for volunteers are then made some time later, sometimes years later, and those who volunteer are compared to those who do not volunteer, on all the items of information in the archive in which the investigator is interested. Many colleges, for example, administer psychological tests and questionnaires to all incoming freshmen during an orientation period. These data can then be used not only to compare those who volunteer with those who do not volunteer for a psychological experiment later that same year, but also to compare respondents with nonrespondents to an alumni-organization questionnaire sent out 10 years later.
Sample Assessment (Recruit, then Test).
In this method, volunteers for behavioral research are solicited, usually in a college classroom context, so that volunteers and nonvolunteers can be identified. Shortly thereafter, a test and/or a questionnaire is administered to the entire class by someone ostensibly unrelated to the person who recruited the volunteers. Volunteers can then be compared to nonvolunteers on any of the variables measured in the classwide testing or surveying. Nonexhaustive Methods
Second-Level Volunteering (the Easy to Recruit).
In this method all the subjects or respondents are volunteers to begin with, and any required data can easily be obtained from all. From this sample of volunteers, volunteers for additional research are then recruited and these second-level, or second-stage, volunteers can be compared to second-level nonvolunteers on the data available for all. Differences between the second-level volunteers and nonvolunteers are likely to underestimate the differences between volunteers and true nonvolunteers, however, since even the second-level nonvolunteers had at least been first-level volunteers. This problem of underestimation of volunteer–nonvolunteer differences is generally characteristic of all the nonexhaustive methods. These methods, by virtue of their not including true nonvolunteers, all require extrapolation on a gradient of volunteering. The underlying assumption, not unreasonable yet no doubt often wrong, is that those who volunteer repeatedly, those who volunteer with less incentive, or those who volunteer more quickly are further down the curve from those who volunteer less often, with
679
Need for social approval
Characteristics of the Volunteer Subject
Nonvolunteer
First level volunteer
Second level volunteer
Third level volunteer
Figure 2–1 Hypothetical illustration of extrapolating probable characteristics of nonvolunteers from characteristics of volunteers found at various levels of volunteering
more incentive, or more slowly, and still further down the curve from those who do not volunteer at all. Thus, two or more levels of volunteering eagerness are employed to extrapolate roughly to the zero level of volunteering. If, for example, repeated volunteers were higher in the need for social approval than one-time volunteers, it might be guessed that the nonvolunteers would be lower still in need for social approval. Figure 2-1 shows such a theoretical extrapolation. Increasing the Incentive (the Hard to Recruit).
In this method, volunteers are solicited from some sampling frame or list. After a suitable interval, another request for volunteers is made of those who were identified originally as nonvolunteers, and this process of repeated requesting may be repeated five or six times. Characteristics of those volunteering at each request are then plotted as data points from which a tentative extrapolation may be made to the characteristics of those who never respond. This method is frequently employed in survey research, and, in general, it is possible to get an increasingly sharper picture of the nonrespondent with successive waves of requests for participation in the survey. This method, it may be noted, is something like the reverse of the method of second-level volunteering. The latter method generates its data points by finding those more and more willing to participate, while the method of increasing the incentive generates its data points by finding those more and more unwilling to participate. An example of the successful application of the method of increasing the incentive was the case of volunteer bias described in Chapter 1. The data shown in Table 2-1 are plotted in Figure 2-2. In this case the mean number of trees owned by nonrespondents would have been quite accurately extrapolated from the date provided by the respondents to the earlier waves.
Latency of Volunteering (the Slow to Reply). In this method only a single request for volunteers is issued but the latency of the volunteering response is recorded.
680
Book Three – The Volunteer Subject Table 2–1 Studies Showing Higher Overall Rates of Volunteering by Females
Author
Barker and Perlman (1972)
Ellis, Endo, and Armer (1970) Gannon, Nothern, and Carroll (1971) Hill, Rubin, and Willard (1973)
Himelstein (1956) MacDonald (1972a)
Mann (1959) Mayer and Pratt (1966) Newman (1956)
Ora (1966) Philip and McCulloch (1970) Pucel, Nelson, and Wheeler (1971)
Rosnow, Holper, and Gitter (1973) Rosnow and Rosenthal (1966) Rosnow, Rosenthal, McConochie, and Arms (1969) Rothney and Mooren (1952) Schubert (1964) Sheridan and Shack (1970) Weiss (1968) Wicker (1968b)
Task
Percentage Volunteering
Two-Tail p of Difference
Females
Males
Questionnaire (overall) a. Parent–child relations b. Sexual behavior Questionnaire Questionnaire
76 80 73 50 66
62 58 64 32 57
.03 .03 .38 .001 .04
Questionnaire (overall) a. State college b. Small private univ. c. Catholic univ. d. Large private univ. Psychology experiment Psychology experiment (overall) a. Pay incentive b. Love of science c. Extra credit Questionnaire Questionnaire Psychology experiments (overall) a. Perception experiment b. Personality experiment Psychology experiments Questionnaire Questionnaire (overall) a. No incentives b. One incentive c. Two incentives d. Three incentives Psychology experiment
53 39 52 46 63 65 74
40 35 41 34 45 43 59
.003 .22 .001 .001 .001 .02 .001
73 61 89 78 78 60
48 56 73 70 73 41
.002 .52 .04 .05 .003 .001
60 59 66 71 60 49 59 62 66 47
39 45 54 47 42 28 38 41 58 34
.001 .13 .001 .05 .001 .08 .001 .001 .36 .006
Perception experiment Psychology experiments
48 27
13 10
.02 .002
Questionnaire Psychology experiment Sensitivity training research Psychology experiment Questionnaire
62 60 42 81 56
49 44 20 61 38
.001 .001 .06 .02 .12
Characteristics of those responding at each of two or more levels of latency are then employed as data points from which to extrapolate to the characteristics of those who do not volunteer at all. This method has been used primarily in survey research, and it appears to have some promise. Nevertheless, it is probably less effective as a basis for extrapolating to nonvolunteers or nonrespondents than the method of increasing the incentive. This method can be combined with the method of increasing the
Characteristics of the Volunteer Subject
681
500
Mean trees per respondent
450
(456)
400 (382)
350 (340)
300 (290)
First wave
Second Third wave wave Response wave
Nonrespondents
Figure 2–2 Illustration of the method of increasing the incentive based on data of Table 1-2
incentive, and the trends within waves or requests can be compared with the trends between waves or requests. Figure 2-3 shows a theoretical extrapolation to the nonrespondent from an analysis of both the between and within wave variation of respondent characteristics. Figure 2-3 was drawn in such a way as to suggest that the within-first-wave latency could give as good an extrapolation to the nonrespondent, or to the very late respondent, as the between-wave data. That is surely an unduly optimistic view of the situation, but it illustrates the principle involved. In a real-life example it is entirely possible for the within-wave slopes to be opposite in sign from the between-wave slopes. We turn now to a consideration of the various characteristics that have been found to be associated with volunteering in studies employing some of the methods described. In our review of the literature we have tried to identify the main threads of the literatures that differentiate volunteers from nonvolunteers.
Sex It will become readily apparent that there are few characteristics that unequivocally differentiate volunteers from nonvolunteers. The first attribute we consider, sex, will serve as an illustration. There are studies showing that females volunteer more
Book Three – The Volunteer Subject
Level of education
682
early response within wave late response within wave
First wave
Second Third wave wave Response wave
Nonrespondents
Figure 2–3 Hypothetical illustration of extrapolating probable characteristics of nonvolunteers from characteristics of volunteers responding early and late within each of three waves of requests for participation
than males; there are studies showing that males volunteer more than females; and there are studies showing no difference between the sexes in their likelihood of volunteering for participation in behavioral research. Table 2-1 shows the results of studies in which females showed a significantly greater likelihood of volunteering than did males. Some of these were series of studies, in which cases the table shows the overall results of the series as well as the results of each of the independent studies. Thus, the first entry in Table 2-1, research by Barker and Perlman (1972), found that for two studies employing questionnaires, 76% of the women and 62% of the men responded (p < .03). One of the questionnaires dealt with parent–child relations, and one dealt with sexual behavior standards. In the first of these studies, there was a difference of 22% in response rates favoring women, while in the second study, the response rate favored women by only 9%. As it turns out, this second study is one of the few in which there was a higher rate of responding or volunteering for women in a study investigating sexual behavior, and even here there was an appreciable drop in the difference between female and male volunteer rates compared to that difference for the non–sex-behavior questionnaire. In several cases in this table and elsewhere exact probabilities much higher than the conventional .05 level are shown instead of symbolizing such probabilities in the perhaps more familiar way simply as ns (for nonsignificant at .05). We provided these exact probabilities to facilitate the estimation of power by interested readers (Cohen, 1969) and to enable interested readers to combine probabilities should they wish to do so (Mosteller and Bush, 1954).
Characteristics of the Volunteer Subject
683
In the research by Hill, Rubin, and Willard (1973), students at four colleges were invited to participate in a study of dating couples. Overall, women replied more often than did men to the first request (53% versus 40%), but there were important differences in response rate differences between males and females, between the four colleges. Thus, at a state college the volunteering rate among females was only 4% higher than among males, while at a large private university the volunteering rate was 18% higher among females than among males. In this study, a number of students replying to the initial inquiry were invited to participate further, along with their dating partner. Of these students, however, only 46% of the females actually did participate, while 56% of the males did so. At first glance this result might appear to be a reversal of the finding that females are more willing to participate in behavioral research. Such a conclusion might be premature, however. All the students invited to participate had already agreed to do so; therefore, their actual participation as a couple may have depended primarily upon their partner’s willingness to participate. The response rate favoring males, then, may have been caused by their female partner’s greater willingness to participate compared to the female students’ male partners’ willingness to participate. We have seen now that even when the overall results show women more likely than men to volunteer, the magnitude of the difference can be affected both by the nature of the task for which volunteering is solicited and by characteristics of the target sample. The study by MacDonald (1972a) shows that, in addition, these magnitudes can be affected by the nature of the incentives offered to the potential volunteers. Thus, when volunteers were solicited only on the basis of their ‘‘love of science,’’ women were only slightly more likely (5%) to volunteer than men, but when pay was offered, women were very much more likely (25%) to volunteer than men. There are a number of other studies that also show women more likely than men to volunteer for behavioral research. These studies are not included in Table 2-1, because the percentages of females or of males responding were not given or because the significance of the difference between percentages was not available. Studies by Lowe and McCormick (1955), Thistlethwaite and Wheeler (1966), and Tiffany, Cowan, and Blinn (1970) show higher rates of responding (or responding earlier) by females to survey questionnaires, while research by May, Smith, and Morris (1968) showed higher rates of volunteering by females for psychological research. Related to these results are those obtained by Rosen (1951) and Schubert (1964), both of whom found males more likely to volunteer for standard experiments if they showed greater femininity of interests. Table 2-2 shows the results of studies in which males showed a significantly greater likelihood of volunteering than did females. The study by Britton and Britton (1951) is unusual in that it is one of the few studies showing that in a mailed questionnaire survey, men respond more than women. Their study surveyed retired teachers and found that those who had held administrative positions or who had taught at the college level were more likely to respond than those who had taught at the high school or elementary school level. It seems likely that men were overrepresented in the administrator and college-teacher categories so that it may have been this higher preretirement status rather than their gender per se that led to the overrepresentation of men. Later we shall see that persons of higher occupational status are generally more likely to respond to questionnaires than their colleagues of lower occupational status.
684
Book Three – The Volunteer Subject Table 2–2 Studies Showing Higher Overall Rates of Volunteering by Males
Author
Task
Percentage Volunteering Females
Britton and Britton (1951) Howe (1960) MacDonald (1969)
Schopler and Bateson (1965)
Schultz (1967b) Siegman (1956) Wilson and Patterson (1965)
Questionnaire Electric shock Electric shock (overall) a. Firstborns b. Laterborns High temperature (overall) a. Recruiter less dependent b. Recruiter more dependent Sensory deprivation Sex interview Psychology experiment
34 67 78 69 85 33 25 40 56 17 60
Two-Tail p of Difference
Males 57 81 92 97 87 61 71 50 76 42 86
.006 .05 .04 .006 1.00 .01 .004 .68 .06 .02 .005
The study by Wilson and Patterson (1965) is also unusual in that males volunteered more than females in response to a somewhat nonspecific request for volunteers made to New Zealand undergraduate students. In the vast majority of studies of volunteering for general or unspecified psychological experiments, females volunteer more than do males. The fact that this unusual result was obtained in New Zealand may help to explain it, but one wonders how. The remaining studies of Table 2-2 have in common that the task for which volunteers were solicited was physically or psychologically stressful and relatively unconventional. These tasks required tolerance of electric shocks, high temperatures, sensory deprivation, and personally asked questions about sex behavior. These are all tasks that give the (ordinarily fairly young) male volunteer an opportunity to assert what the culture defines for him as ‘‘his masculinity.’’ Additional evidence for this interpretation can be found in the research of Siess (1973), who discovered that compared to females, males significantly preferred experiments requiring them to withstand changes in gravitational pressure, deprivation of oxygen, rapid temperature changes, fatigue, sensory deprivation, and drug effects. Similarly, Wolf and Weiss (1965) found males expressing greater preference than females for isolation experiments. Some further support may be found, too, in the research of Martin and Marcuse (1958), who solicited volunteers for four experiments in the areas of learning, personality, hypnosis, and sex. Female volunteers were overrepresented in the first three of these areas (approximately 30%, 51%, and 43% volunteering rates for females versus 27%, 35%, and 27% volunteering rates for males respectively), while male volunteers were overrepresented in the fourth area, the study of sex behavior (approximately 25% of females and 52% of males volunteered). The studies by Martin and Marcuse are not listed in our tables since the percentages calculable from their data could only be quite approximate. In addition to the studies shown in Table 2-2, three others were discovered in which males found their way more easily into behavioral research. Crossley and Fink (1951) found that women refused to be interviewed more often than men, and Katz and Cantril (1937) found that mail returns overrepresented men rather than women. This latter study, however, also showed that persons of higher socioeconomic status
Characteristics of the Volunteer Subject
685
Table 2–3 Studies Showing No Reliable Sex Differences in Volunteering
Author
Back, Hood, and Brehm (1963); Hood (1963) Bergen and Kloot (1968–1969)
Diamant (1970) Ebert (1973) Francis and Diespecker (1973) London (1961) Loney (1972)
Olsen (1968) Poor (1967) Raymond and King (1973) Rosnow and Suls (1970) Rubin (1973a; in press)
Tune (1968) Tune (1969) Wicker (1968a)
Wolf (1967)
Wolfgang (1967)
Task
Percentage Volunteering Females
Males
Psychology experiments
57
54
Psychology experiment (overall) a. More personal appeal b. Less personal appeal Sex survey Questionnaire Sensory deprivation Hypnosis research Sex survey (overall) a. Heterosexual persons b. Homosexual persons Psychology experiments Questionnaire Research participation Psycholinguistic experiment Airport interviews (overall) a. Handwriting sample 1. Female experimenter 2. Male experimenter b. Self-description 1. Female experimenter 2. Male experimenter Sleep research Sleep research Church survey (overall) a. First request b. Second request Psychology experiments (overall) a. Summer session b. Regular term Learning experiment
66 76 56 35 79 47 42 66 92 50 26 36 42 40 67 67 77 57 67 78 56 84 88 87 50 74 48 36 64 48
73 71 76 51 75 48 40 72 100 56 25 23 41 30 63 71 71 70 56 67 45 84 90 78 61 45 44 40 52 47
responded more, and it may be that, for this sample, males were of higher socioeconomic status than women (e.g., listed in Who’s Who versus being on relief). Thus, socioeconomic status rather than sex per se may have accounted for the results. Finally, Teele (1967) found that among relatives of former mental patients, males were slightly more likely than females (r ¼ .13) to participate in voluntary associations. Listing this study as in support of the proposition that males volunteer more than females requires that we view participation in such associations as related to volunteering for participation in behavioral research. Table 2-3 shows the results of studies in which there was no reliable difference in volunteering rates between males and females. As we would expect, the average absolute difference between volunteering rates of males and females is smaller in this table than are the average absolute differences to be found in Tables 2-1 and 2-2. But we should note that there are some substantial absolute differences to be found in this table, nevertheless. When that occurs, it is because the sample sizes on which the
686
Book Three – The Volunteer Subject
percentages are based are too small for the differences in percentages to reach statistical significance. That was the situation, for example, in the study by Bergen and Kloot (1968–1969). When a less personal appeal for volunteers was made, 56% of the 39 women, but 76% of the 29 men, agreed to participate, a difference that, while not significant, can hardly be considered small. Cohen (1969) provides useful guidelines for the interpretation of the magnitude of differences between proportions. The detectability of differences between proportions depends not only upon the difference between the proportions but on the absolute value of the two proportions as well. However, when proportions are transformed by the relationpffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ship ¼ 2arcsin Proportion, the differences between the values are equally detectable, regardless of their absolute level. Cohen regards a difference between two values as small when its value is .2, medium when its value is .5, and large when its value is .8. The difference between the arcsin transformed proportions of .56 and .76 is .43, or approximately a medium-size effect. We should note that a good many of the significant differences in volunteering rates between males and females shown in Tables 2-1 and 2-2 represent effects smaller than this effect size simply because those smaller effects were based on larger sample sizes. Another nontrivial difference in volunteering rates between males and females can be noted in the study by Diamant (1970). In that study, volunteers were recruited for a sex survey, and 51% of the males and 35% of the females agreed to participate. Thus, although the difference was not significant statistically, it seems to agree with other results, reported earlier, showing men to be more willing to participate in sex research. The magnitude of this effect (51%–35%) in arcsin units is .32, about halfway between a small and a medium-size effect. Loney’s (1972) results were quite similar. Among both heterosexual and homosexual persons, males were more willing to participate in a survey of sex behavior. Arcsin transformed or not, the difference in volunteering rates was quite small for the homosexual sample but not at all trivial for the heterosexual sample. The 8-percentage-point difference between 92% and 100% is much larger in arcsin units (.57) than the 6-percentage-point difference between 50% and 56% (.12 arcsin units). A very informative study was conducted by Rubin (1973a; in press), who sought his volunteers in airport departure lounges. He employed male and female interviewers to solicit volunteering either for a study of handwriting or for a study of self-descriptions. Although Rubin found no important overall differences in volunteering rates between males and females, he found the equivalent of a three-way interaction between type of task for which volunteering was solicited, sex of interviewer, and sex of subject. When requests were for handwriting samples, subjects volunteered for an experimenter of the same sex more than when requests were for self-descriptions. In the latter case, both male and female subjects volunteered more for a female experimenter. To a lesser extent subjects also volunteered more for female experimenters when the request was for handwriting samples. Rubin’s study is one of the few to have investigated simultaneously the effects on volunteering rates of both type of task and sex of recruiter. His results suggest that it was a good idea. Actually, Coffin (1941) cautioned long ago that there might be complicating effects of the experimenter’s sex, and we might wonder along with Coffin and Martin and Marcuse (1958) about the differential effects on volunteer rates among male and female subjects of being confronted with a male versus a female Kinsey interviewer, as well as the differential effects on eagerness to be hypnotized of being confronted with a male versus a female hypnotist.
Characteristics of the Volunteer Subject
687
In addition to the studies listed in Table 2-3, several others were found in which there were no reliable differences between males and females in rates of volunteering. Belson (1960), Bennett and Hill (1964), Fischer and Winer (1969; two different studies), Sirken, Pifer, and Brown (1960), and Wallin (1949) reported no sex differences in volunteering rates in their survey research projects, nor did Hilgard, Weitzenhoffer, Landes, and Moore (1961), Hood (1963), MacDonald (1972b), Mackenzie (1969), Mulry and Dunbar (n.d.), Rosen (1951), Schachter and Hall (1952), and Stein (1971) in their laboratory or clinic projects. One more study appears to be at least tangentially relevant, that by Spiegel and Keith-Spiegel (1969). These investigators asked for college student volunteers not for a psychological experiment but for participation in a companionship program in which students would spend time with psychiatric patients over a 10-week period. No significant differences were found in rates of volunteering by female (39%) versus male (33%) students. Summarizing the Data Tables 2-1, 2-2, and 2-3 contain quite a lot of information and it will be useful to recast that information into more manageable forms. The strategy adopted here is based on the ‘‘stem-and-leaf’’ technique developed by John Tukey (1970). Stem-and-leaf plots look like histograms and indeed they are that, but they permit retention of all the raw data at the same time so that no information need be lost in the grouping process. Table 2-4 shows four parallel stem-and-leaf displays, two for studies requesting volunteering for physically or psychologically stressful studies (e.g., electric shock,
Table 2–4 Stem-and-Leaf Plots of Volunteering Rates in Percentage Units
General Studies
10 9 8 7 6 5 4 3 2 1 Maximum Quartile3 Median Quartile1 Minimum Q3 Q 1 ˆ S Mean
Stress Studies
Females (N ¼ 51)
Males (N ¼ 51)
10 9 01489 8 134678889 7 000122345666 6 0026667799 5 022267889 4 4669 3 67 2 1
10 0 9 46 8 0011356 7 117 6 24467788 5 0011113455557789 4 02445889 3 0358 2 03 1
2 5 3 79 06 07 5 5 7
89 73 60 48 26 25 19 15.8 59.1
90 61 45 38 10 23 17 18.3 49.1
92 71 53 37.5 17 33.5 25 23.3 54.7
Females (N ¼ 12)
Males (N ¼ 12) 10 9 8 7 6 5 4 3 2 1
0 7 17 16 4 016 28
100 84 67.5 50.5 42 33.5 25 19.8 68.6
688
Book Three – The Volunteer Subject
high temperature, sensory deprivation, sex behavior) and two for all other studies. Within each type of study, there are two parallel stem-and-leaf plots, one showing the distribution of percentages of females volunteering and the other showing the distribution of percentages of males volunteering. For each of our stem-and-leaf plots, the numbers to the left of the dividing line (the stems) represent the leading digit(s) of the percentage, while the digits to the right of the dividing line (the leaves) represent the final digit of the percentage. In the leftmost plot of Table 2-4 there are no leaf entries for the stems of 10 or 9. That means that no percentages of 90% or above were found for females in general studies of volunteering. In that same plot the stem of 8 is followed by leaves of 0, 1, 4, 8, and 9 (by convention not separated by commas.) We read those five leaves as representing a recording of the following scores: 80, 81, 84, 88, and 89. Coarse examination of the parallel stem-and-leaf displays for studies in general shows the heavy center region to be higher for females (modal stem ¼ 6) than for males (modal stem ¼ 4), while somewhat the opposite is true for the displays for studies involving stress (n too small to use modal stems). A more formal summary of each stem-and-leaf display is given below each plot. There we find the highest percentage obtained; the percentages corresponding to the third, second (median), and first quartiles; and the lowest percentage obtained. In addition, the difference between the third and first quartiles is given, the rough estimate ˆ [(Q3 Q1) (.75)], the unbiased estimate of , and the mean. Skews, bobbles, and bunchings can be found and described employing the plots themselves and/or the summaries given below the plots. For our present purpose, and for most readers, it may be sufficient to note that for studies in general, the median rate of volunteering is 60% for women and 45% for men. For studies involving stress, however, the median rate of volunteering is only 53% for women but over 67% for men. Considering the rates of volunteering by women for the two types of studies suggests that the median rates are not so very different (2 ¼ 0.10), although there is a hint that volunteering rates of women may be more variable in the stress studies than in the general studies (interquartile range and S are both larger for the stress studies, the latter nearly significantly: F ¼ 2.17, df ¼ 11,50, p < .07). When we consider the rates of volunteering by men for the two types of studies, however, we find a substantial difference with the greater likelihood of volunteering occurring for the stress studies (median of 67.5% versus 45%). Such a difference in arcsin units (.46) corresponds to what Cohen considers a medium-size effect, and it is significant (p < .025, 2 ¼ 5.06) despite the small sample size of the stress studies. Relative to females (median volunteer rate for all 63 studies ¼ 59%), then, males are likely to overvolunteer somewhat for stressful studies (67.5%) and to undervolunteer quite a lot for all other kinds of studies (45%). Although Table 2-4 has been useful, it does not preserve the information that for each of the 63 studies listed, there was both a male and female volunteering rate (i.e., that there was blocking by studies). Table 2-5 preserves this information by showing stem-and-leaf plots of the differences between volunteering rates with the male percentage always subtracted from the female percentage. Comparison of the plots for studies in general and for studies involving stress shows that they are virtually mirror images of one another. For studies in general, females volunteer more than males 84% of the time, while in studies involving stress, males volunteer more than females 92% of the time (2 ¼ 24.7, df ¼ 1, p very small). Before leaving
Characteristics of the Volunteer Subject
689
Table 2–5 Stem-and-Leaf Plots of Differences in Volunteering
Rates Between Females and Males General Studies (N ¼ 51) þ3 þ3 þ2 þ2 þ1 þ1 þ0 þ0 0 0 0 1 1 2 2 3 3 4 4 5 Maximum Quartile3 Median Quartile1 Minimum Q 3 Q1 ˆ S Mean
5 59 011112224 667888 01112223334 5556889 1112344 0 24 13 03 6
þ 35 þ 18 þ 11 þ3 26 þ 15 11 12.6 þ 9.4
Stress Studies (N ¼ 12) þ3 þ3 þ2 þ2 þ1 þ1 þ0 þ0 0 0 0 1 1 2 2 3 3 4 4 5
9
12 68 04 6 0 58
6 þ9 4 12 22 46 þ 18 14 14.6 13.9
our discussion of studies involving stress versus studies not involving stress, we should note that the latter studies can be readily divided into those involving questionnaires and those involving the more typical behavioral laboratory research. Sex differences in volunteering rates were examined as a function of this breakdown. The median difference in volunteering rate (percent of women volunteering minus percent of men volunteering) was 13 for the 21 questionnaire studies and 10.5 for the 30 remaining studies, a difference that did not approach significance (2 ¼ 0.13). Pseudovolunteering Our interest in volunteers is based on the fact that only they can provide us with the data that the nonvolunteers have refused us. But not all volunteers, it usually turns out, provide us with the data we need. To varying degrees in different studies there will be those volunteers who fail to keep their experimental appointment. These ‘‘noshows’’ have been referred to as ‘‘pseudovolunteers’’ by Levitt, Lubin, and Brady (1962), who showed that on a variety of personality measures, pseudovolunteers are
690
Book Three – The Volunteer Subject
not so much like volunteers as they are like the nonvolunteers who never agreed to come in the first place. Other studies also have examined characteristics of experimental subjects who fail to keep their appointments, but somewhat surprisingly, there is no great wealth of data bearing on the rates of pseudovolunteering one may expect to find in one’s research. What makes this data dearth surprising is that every investigator obtains such data routinely in every study conducted. One suspects, however, that experienced investigators have a fairly adequate sense of the proportion of no-shows to expect for any given type of study conducted at any given time of year. Knowing approximately how many subjects will not appear is very useful in planning the number of sessions required to fill the sampling quotas. Speaking from experience, the present authors can offer the suggestion, however, that this ‘‘adequate sense’’ can be quite inadequate. In a variety of studies we have found pseudovolunteering rates to range from perhaps 50% to 5%. The 5% occurred in a rather large-size study in which every subject appeared on schedule, with several bringing along friends who had not volunteered but wanted either to earn the $1.00 offered as compensation or to keep their friends company. Table 2-6 gives the results of 20 more formal studies of pseudovolunteering. For these studies considered as a set, we might guess that as many as a third of our volunteers might never appear as promised. No claim is made, however, that these studies are in any way representative. There appears to be no systematic research on variables affecting rates of pseudovolunteering, but there are procedures widely employed for keeping those rates at a minimum, such as reminder letters and telephone calls. There is some likelihood, however, based on the survey research literature that it is the potential pseudovolunteer who is less likely to be at home, to read his or her mail, or answer his or her telephone.
Table 2–6 Some Rates of Pseudovolunteering
Author Belson (1960) Bergen and Kloot (1968–69) Craddick and Campitell (1963) Dreger and Johnson (1973) Kirby and Davis (1972) Leipold and James (1962) Levitt, Lubin, and Brady (1962) MacDonald (1969) I MacDonald (1969) II Minor (1967) Tacon (1965) Valins (1967) I Valins (1967) II Waters and Kirk (1969) I Waters and Kirk (1969) II Waters and Kirk (1969) III Waters and Kirk (1969) IV Wicker (1968a) Wicker (1968b) Wrightsman (1966) Median
% Failing to Appear 16 10 42 36 41 36 30 14 32 37 40 37 3 24 31 30 37 12 38 19 31.5
Characteristics of the Volunteer Subject
691
There is also little information available to suggest whether males or females are more likely not to show. In their research, Frey and Becker (1958) found no sex differences between subjects who notified the investigator that they would be absent versus those who did not notify him. Although these results argue against a sex difference in pseudovolunteering, it should be noted that the entire experimental sample was composed of extreme scorers on a test of introversion–extraversion. Furthermore, no comparison could be made of either group of no-shows with the parent population from which the samples were drawn. Leipold and James (1962) also compared the characteristics of shows and noshows among a random sample of introductory psychology students who had been requested to serve in an experiment in order to satisfy a course requirement. Again, no sex differences were found. Interestingly enough, however, about half of Frey and Becker’s no-shows notified the experimenter that they would be absent, while only one of Leipold and James’s 39 no-shows did so. Finally, there are the more recent studies by Belson (1960) and Wicker (1968b), in which it was possible to compare the rates of pseudovolunteering by male and female subjects for a questionnaire study. These results also yielded no sex differences. Hence, four studies out of four suggest that failing to provide the investigator with data promised him may not be any more the province of males than of females. Four studies, however, are not very many, and it would be premature to defend staunchly the null hypothesis on the basis of so little data.
Birth Order Ever since the seminal work of Schachter (1959) there has been an increased interest in birth order as a useful independent variable in behavioral research (Altus, 1966; Belmont and Marolla, 1973; Warren, 1966; Weiss, 1970). A number of studies have investigated the question of whether firstborns or only children are more likely than laterborns to volunteer for behavioral research. Table 2-7 lists the authors of such studies and tabulates for each author the number of results showing firstborns to volunteer more and the number of results showing laterborns to volunteer more. The final column of the table lists the number of results by each investigator that were significant at approximately the 5% level and whether it was the firstborns or laterborns that had volunteered more. The most striking finding is that about half the studies listed show firstborns to have volunteered more and about half show laterborns to have volunteered more. Of the studies showing significant differences in volunteering rates, seven show firstborns to have volunteered more and three show laterborns to have volunteered more. It seems clear, then, that no overall conclusion can be drawn as to which ordinal position is likely to be overrepresented in a sample of volunteers. However, it would probably be unwise to conclude that there is no relationship between birth order and volunteering, since, disregarding direction of outcome, there are too many studies showing some significant relationship to make the null hypothesis defensible. Perhaps we can gain a better understanding of these results by considering the type of research for which volunteering was requested as a function of the type of results obtained. Table 2-8 shows such a listing. Ward (1964) found laterborns to volunteer more (55%) than firstborns (32%) for an optional experiment in
692
Book Three – The Volunteer Subject Table 2–7 Studies Comparing Rates of Volunteering by Firstborns and Laterborns
Author
Results Showing Greater Volunteering by: Firstborns (FB)
Altus (1966) Capra and Dittes (1962) Diab and Prothro (1968) Ebert (1973) Eisenman (1965) Fischer and Winer (1969) Lubin, Brady, and Levitt (1962b) MacDonald (1969) MacDonald (1972a) MacDonald (1972b) Myers, Murphy, Smith, and Goffard (1966) Olsen (1968) Poor (1967) Rosnow, Rosenthal, McConochie, and Arms (1969) Schultz (1967a) Stein (1971) Suedfeld (1964) Suedfeld (1969) Varela (1964) Wagner (1968) Ward (1964) Wilson and Patterson (1965) Wolf (1967) Wolf and Weiss (1965); Weiss, Wolf, and Wiltsey (1963) Zuckerman, Schultz, and Hopkins (1967) Sum a
2 1 1 1 0 1 —a 1 2 —a —a 2 1 0
Number Significant
Laterborns (LB) 0 0 1 0 1 2 —a 1 1 —a —a 0 0 1
1 FB 1 FB 0 1 FB 1 LB 0 0 0 0 0 0 2 FB 0 0
0 0 1 0 1 0 4 1 1 1
1 1 0 1 0 1 2 0 1 3
0 1 LB 1 FB 0 1 FB 0 1 LB 0 0 0
0 21
2 19
0 10
Unable to determine direction of difference.
which subjects were to work on a task while alone. However, Suedfeld (1964) found firstborns to volunteer more than laterborns for an experiment in sensory deprivation which was also optional and also involved the subject’s being alone. Given a significant relationship between birth order and volunteering, then, the direction of the difference cannot very well be dependent exclusively on either (a) the required nature of the experiment or (b) the solitary nature of the subject’s participation. Capra and Dittes (1962) found that among their Yale University undergraduates 36% of the firstborns, but only 18% of the laterborns, volunteered for an experiment requiring cooperation in a small group. Varela (1964) found that among his Uruguayan male and female high school students 70% of the firstborns, but only 44% of the laterborns, volunteered for a small-group experiment similar to that of Capra and Dittes. Altus (1966) and Olsen (1968) reported that firstborn males were overrepresented relative to laterborn males when subjects were asked to volunteer for testing. Altus as well as Olsen obtained similar results when the subjects were female
Characteristics of the Volunteer Subject
693
Table 2–8 Types of Research for Which Volunteering Was Requested for Each Type of Outcome
Firstborns Volunteer Significantly More
No Significant Difference
Laterborns Volunteer Significantly More
Sensory deprivation Small-group research Psychological testing Questionnaire
General experiments Hypnosis Interview Psychoacoustic research Questionnaire Sensory deprivation Small-group research
Optional experiment, working alone Psychotherapy research Strong electric shock
30
3
Number of results in each outcome 7
undergraduates, but in Altus’ study the difference in volunteering rates was not significant statistically. In his survey of college students, Ebert (1973) found that only-children were significantly more likely to respond than were children with siblings. Thus, for studies requesting volunteers for small-group research, for psychological testing, and for mailed questionnaires, when there is a significant difference, it seems to favor the greater participation of the firstborn. The remaining studies showing a significant relationship between birth order and volunteering are those by Stein (1971) and Eisenman (1965). Stein studied psychotherapy patients drawn from a psychological clinic and found that his laterborns were more cooperative in completing the required pre- and posttests than were the firstborns. Eisenman studied student nurses and found that more laterborns than firstborns were willing to participate in a study involving severe electric shock. In both these studies there appears to be an element of unusually high stress. In the study by Stein, we can assume that the clinic patients were under considerable stress simply by virtue of their having required psychological assistance. In the study by Eisenman, we can assume that persons confronted with very strong electric shock would also be under considerable stress or threat. Perhaps, then, when the experiment involves a very high level of stress either as a trait of the subject or as a characteristic of the task, and when there is a significant relationship between birth order and volunteering, it is the laterborns who will volunteer more. An apparent weakening of this hypothesis may be noted in the first column of Table 2-8. There we find that an experiment in sensory deprivation obtained significantly higher rates of volunteering by firstborn rather than laterborn subjects. We are inclined to think of sensory deprivation as a stressful experience—hence, the weakening of the stress hypothesis. In this particular study of volunteering for sensory deprivation (Suedfeld, 1964), however, Suedfeld’s recruiting technique was quite reassuring in tone. Suedfeld (1968) has shown further that a reassuring orientation toward sensory deprivation specifically lowers the anticipated stress in firstborn as compared to laterborn subjects. Indeed, there is an indication from the research of Dohrenwend, Feldstein, Plosky, and Schmeidler (1967) that when recruitment specifically tries to arouse anxiety about sensory deprivation, laterborns may tend to volunteer significantly more than firstborns. Taken together, then, the Suedfeld results and those of Dohrenwend et al. actually tend to
694
Book Three – The Volunteer Subject
strengthen the hypothesis that when great stress is involved, firstborns will volunteer significantly less than laterborns. In several of the studies relating birth order to volunteering the focus was less on whether a subject would volunteer and more on the type of experiment for which he would volunteer. That was the case in a study by Brock and Becker (1965), who found no differences between first- and laterborns in their choices of individual or group experiments. Studies reported by Weiss, Wolf, and Wiltsey (1963) and by Wolf and Weiss (1965) suggest that preference for participation in group experiments by firstborn versus laterborn subjects may depend on the form in which the subject is asked to indicate his volunteer status. When a ranking of preferences was employed, firstborns more often volunteered for a group experiment. However, when a simple yes–no technique was employed, firstborns volunteered relatively less for group than for individual or isolation experiments. Just as the nature of the research for which volunteering is requested may moderate the relationship between birth order and volunteering, so there are other variables that have been suggested as serving potential moderating functions in this relationship. Thus, MacDonald (1969) has shown that sex may operate interactively with birth order and volunteering such that firstborn females are most likely to avoid an electric shock, while firstborn males are least likely to avoid an electric shock, with male and female laterborns falling in between. MacDonald (1972a) has also found the nature of the incentive to volunteer to interact with birth order and volunteering behavior. Firstborns were appreciably more likely to volunteer for extra credit than for either pay or ‘‘love of science.’’ Several investigators (MacDonald, 1969; Fischer and Winer, 1969) have also suggested that the degree of intimacy or face-to-faceness of the recruiting situation might affect the relationship between birth order and volunteering. One suspects that the hypotheses put forward by all these authors could well be investigated simultaneously. Male and female firstborns and laterborns could be recruited under more versus less intimate conditions and offered extra credit or not. If a really large population of subjects was available, the degree of stressfulness of the task for which recruitment is solicited could also be varied orthogonally to the group versus individual nature of subjects’ interaction in the experiment. One suspects that a doctoral dissertation lurks here. Before summing up what we think we may know about the relationship between birth order and volunteering, we must consider the possible relationship between birth order and pseudovolunteering. Even if there were no differences between firstborns and laterborns in volunteering rates, if one of these were more likely to be a no-show, then the other would be overrepresented in the subject pool that finally produced data. MacDonald (1969) has investigated the matter. He found that 86% of the firstborns kept their appointments for an experiment ostensibly involving painful electric shock, while only 68% of the laterborns appeared. If further research were to support this direction and magnitude of effect for a variety of experimental situations, one might infer that the final subject sample may overrepresent the more docile, less suspicious type of subject. MacDonald (1969) found that while 22% of the laterborns were suspicious about the true purpose of the experiment, only 8% of the firstborns were suspicious (p < .01). To sum up the state of the evidence bearing on the relationship between volunteering and birth order, we can begin by saying that most of the studies
Characteristics of the Volunteer Subject
695
conducted found no significant relationship between these variables. Still, there were many more studies that did find a significant relationship than we could easily ascribe to the operation of chance. For those studies in which firstborns volunteered significantly more, there was perhaps a more intimate recruitment style and a greater likelihood that the request was for participation in an experiment requiring group interaction. For those studies in which firstborns volunteered significantly less, there was perhaps a greater degree of stress experienced by those approached for participation.
Sociability The bulk of the evidence seems to suggest that those persons more likely to volunteer for participation in behavioral research are more sociable than those less likely to volunteer. Table 2-9 summarizes this evidence by listing, separately for the more experimental studies and for the studies employing questionnaires, those results showing volunteers to be more sociable, those studies finding no difference in sociability between volunteers and nonvolunteers, and those studies showing volunteers to be less sociable. Of the 19 studies listed in Table 2-9, only 1 found that volunteers could be considered significantly less sociable than nonvolunteers. Abeles, Iscoe, and Brown (1954–1955) reported a study in which male undergraduates were invited by the president of their university to complete questionnaires concerning the draft, the Korean conflict, college life, and vocational aspirations. Results showed that members of fraternities, who might be regarded as more sociable, were significantly
Table 2–9 Studies of the Relationship Between Volunteering and Sociability
Volunteers found more sociable
(n ¼ 12) No difference
(n ¼ 6) Volunteers found less sociable (n ¼ 1)
Psychological Experiments
Questionnaires
Hayes, Meltzer, and Lundberg (1968) London, Cooper, and Johnson (1962) MacDonald (1972a) MacDonald (1972b) Martin and Marcuse (1957, 1958) Poor (1967) Schubert (1964) (n ¼ 7)
Abeles, Iscoe, and Brown (1954–1955) Kivlin (1965) Lehman (1963) Reuss (1943) Tiffany, Cowan, and Blinn (1970)
Myers, Smith, and Murphy (1967) Martin and Marcuse (1958) I Martin and Marcuse (1958) II (n ¼ 3)
Bennett and Hill (1964) Ebert (1973) Poor (1967)
(n ¼ 5)
(n ¼ 3) Abeles, Iscoe, and Brown (1954–1955)
(n ¼ 0)
(n ¼ 1)
696
Book Three – The Volunteer Subject
less likely to reply spontaneously to the questionnaire. In a subsequent session, which followed both a letter ‘‘ordering’’ the students to participate as well as a personal telephone call, fraternity members were significantly more likely to participate. Reuss (1943), in his earlier research, had found fraternity and sorority members to be more responsive to requests for participation in behavioral research. The three questionnaire studies showing no differences in volunteering as a function of sociability had in common only that all three sampled college, or former college, students. The five questionnaire studies showing volunteers to be more sociable also included studies of college students but were not restricted to them. Thus Kivlin (1965) sampled Pennsylvania dairy farmers; and Tiffany, Cowan, and Blinn (1970) surveyed former psychiatric patients. Considering only the questionnaire studies, then, it appears that respondents are likely to be more sociable than nonrespondents, a relationship possibly stronger for noncollege than for college samples. When we consider the relationship between volunteering and sociability for more laboratory-oriented behavioral research, we find a similar picture. Again, three studies reported no positive relationship between sociability and volunteering. Perhaps one of these exceptions (Myers, Smith, and Murphy, 1967) can best be understood in terms of the task for which volunteers were recruited—sensory deprivation. It seems reasonable to think that, while in general sociable subjects will volunteer more than unsociable subjects, in an experiment in which their sociability will be specifically frustrated sociable subjects will be less likely to volunteer. The other exceptions are results reported by Martin and Marcuse (1958), who found no differences in sociability between volunteers and nonvolunteers for sex research or, for males only, for hypnosis research. In the research relating sociability to volunteering, a variety of measures of sociability have been employed. Employing as subjects male and female college freshmen, Schubert (1964) observed that volunteers (n ¼ 562) for a psychological experiment scored higher in sociability than nonvolunteers (n ¼ 443) on the Social Participation Scale of the MMPI. Poor (1967) employed the California Psychological Inventory Scale of Sociability and obtained the same results and MacDonald (1972a, b) found that both those more likely to volunteer at all and those who volunteer earlier had made more new friends since coming to college. In their experiment on social participation, Hayes, Meltzer, and Lundberg (1968) found undergraduate volunteers to be more talkative than undergraduate (nonvolunteer) conscripts. Talkativeness appears to be related to sociability, but in this case talkativeness was measured after the volunteering or nonvolunteering occurred, so that it may have been situationally determined. Perhaps volunteers talked more because they were engaged in an experiment for which they had volunteered and nonvolunteers talked less because they had been coerced into participation. Although it seems reasonable to think so, we cannot be certain that conscripts would have been less talkative than volunteers even before the recruitment procedures were begun. In their research, London, Cooper, and Johnson (1962) found serious volunteers for hypnosis to be somewhat more sociable than those less serious about serving science, with sociability defined by the California Psychological Inventory, the 16 PF, and the MMPI. Martin and Marcuse (1957, 1958) found female volunteers for hypnosis to score higher on sociability than female nonvolunteers as measured by the Bernreuter Personality Inventory.
Characteristics of the Volunteer Subject
697
Another study of volunteers for hypnosis research appears relevant to the discussion of the variable of sociability, although it was not listed in Table 2-9 because its results were too difficult to interpret. Lubin, Brady, and Levitt (1962a) asked student nurses to volunteer for hypnosis and found volunteers to score higher on dependency than nonvolunteers on a measure based on Rorschach content. Dependency may be positively related to sociability and so this result appears to support the general results showing volunteers to be more sociable. However, Lubin et al. also found that these same volunteers were significantly less friendly than the nonvolunteers, as defined by scores on the Guilford–Zimmerman Temperament Survey. Friendliness may also be positively related to sociability, and so this result appears not to support the general conclusion that volunteers tend to be more sociable. It is possible, of course, that there is more than one kind of sociability and that it is the dependency aspects of sociability rather than the friendliness aspects that may predispose persons to volunteer for hypnosis. This interpretation still would not account for the significant negative relationship between friendliness and volunteering for hypnosis, although it would account for the absence of a positive relationship. There is another study that investigated the relationship between volunteering and dependency. In their questionnaire study of student nurses, Lubin, Levitt, and Zuckerman (1962) found that those who responded were more dependent than those who did not, the measure of dependency having been based on several subscores of the Edwards Personal Preference Schedule (EPPS). Thus, for two requests for volunteering (hypnosis and questionnaire) and for two measures of dependency (Rorschach and EPPS), volunteers were found to be more dependent than nonvolunteers. Finally, there is a surprising result relating sociability to thoughtfulness among a sample of pseudovolunteers. Frey and Becker (1958) found that those of their pseudovolunteers who notified the investigator that they would be unable to attend had lower sociability scores on Guilford’s scale than did the pseudovolunteers who simply never appeared. It is difficult to explain this somewhat paradoxical finding that presumably more thoughtful pseudovolunteers were in fact less sociable than their less thoughtful counterparts. To summarize, now, the evidence bearing on the relationship between volunteering and sociability, it seems safe to conclude that when a significant relationship does occur, it is much more likely to be positive than negative. Furthermore, whether the request is for participation in a psychological experiment or in a survey study, the probability is good that a positive relationship will be found at a significant or nearly significant level. No very powerful moderating variables were found, although sex remains a usual candidate, as does the nature of the task for which volunteering was solicited. Sociable persons who normally volunteer most may not volunteer much for unsociable tasks.
Extraversion The variable of extraversion would seem to be related conceptually to the variable of sociability, so that we might expect to find volunteers more extraverted than nonvolunteers, since volunteers appear to be more sociable than nonvolunteers. That expectation is not supported by the evidence. Table 2-10 shows that two studies
698
Book Three – The Volunteer Subject Table 2–10 Studies of the Relationship Between Volunteering and Extraversion
Volunteers found more extraverted No difference
Volunteers found more introverted
Author
Instrument
Jaeger, Feinberg, and Weissman (1973)a Schubert (1964)b
Omnibus Personality Inventory MMPI
Francis and Diespecker (1973) Ic Francis and Diespecker (1973) IIc Martin and Marcuse (1958)b Rosen (1951)b
Eysenck Personality Inventory 16 PF Bernreuter Personality Inventory MMPI
Ora (1966)b Tune (1968)d
Self-report Heron Inventory
a
Group experience. Psychological experiment. c Sensory deprivation. d Sleep research. b
found volunteers to be more extraverted, two studies found volunteers to be more introverted, and four studies found volunteers not to differ from nonvolunteers in extraversion. There are not enough studies in Table 2-10 to permit strong inferences about characteristics of those studies showing volunteers to be more extraverted versus more introverted, versus those studies showing no differences, but we should consider at least the types of instruments employed and the nature of the tasks for which volunteering was solicited as a source of clues. Inspection of the last column of Table 2-10 suggests no very clear differences between the three sets of instruments employed to measure extraversion. Perhaps the nature of the tasks for which volunteering was requested may provide some clues to help differentiate the three classes of outcomes. One of the studies finding volunteers to be more extraverted and one of those finding volunteers to be more introverted requested volunteers for general psychological experiments. The other study finding volunteers to be more extraverted requested volunteers for participation in an extracurricular group experience, a finding that makes good sense. More extraverted persons should be more likely to want to interact in groups than less extraverted persons. The remaining study finding volunteers to be more introverted requested volunteers to participate in a 56-day study of sleep patterns in which subjects were to record their hours of sleep each day in the privacy of their own home. That such a solitary task should be preferred by more introverted persons also makes good sense. The remaining studies, those finding no differences between volunteers and nonvolunteers, requested either participation in general psychological research or in studies of sensory deprivation. No very clear summary of the relationship between volunteering and extraversion is possible. Although there is no net difference in outcomes favoring either the view that volunteers are more introverted or the view that they are more extraverted, there are probably too many studies showing some significant relationship to permit a conclusion of ‘‘no relationship.’’ The task for the future is to learn what types of studies are likely to find significant positive relationships and what types of studies are likely to find significant negative relationships. We do have a hint, at least, that
Characteristics of the Volunteer Subject
699
when the task for which volunteering is requested involves group interaction, volunteers may tend to be more extraverted, while when the task involves no interaction either with other subjects or even with the experimenter, volunteers may tend to be more introverted.
Self-Disclosure Sydney Jourard (1971) has done for the variable of self-disclosure what Stanley Schachter has done for the variable of birth order. In both cases a large volume of research has followed upon the integrative work of these scholars. We have already considered the relationship between birth order and volunteering; we consider now the relationship between volunteering and the predisposition to be self-disclosing. Hood and Back (1971) solicited volunteers for one or more experiments varying in the degree to which the situation would be (1) competitive, (2) under the strong control of the experimenter, (3) affiliative, or (4) self-revealing. The Jourard measure of self-disclosure had been administered to all subjects, and results showed very clearly for male subjects that those scoring higher on self-disclosure were more likely to volunteer. The magnitude of this effect was very large, with volunteers scoring 1.42 standard deviation units higher than nonvolunteers (p < .001). For female subjects, the relationship between volunteering and self-disclosure was complicated by the nature of the experiments for which volunteering occurred and not in any predictable manner. Females generally, however, scored very much higher than males in self-disclosure, and there is a possibility that the complexity of the results for females may have been partially caused by problems of ceiling effects. Sheridan and Shack (1970) invited 81 undergraduates enrolled in a personaladjustment type of course to volunteer for a seven-session program in sensitivity training, and 23 agreed to do so. As measured by Shack’s Epistemic Orientation Inventory, those who volunteered were higher in self-exploration by over half a standard deviation than those who did not volunteer. In addition, volunteers scored more than two-thirds of a standard deviation higher than nonvolunteers on the Spontaneity Scale of Shostrom’s Personal Orientation Inventory. These results, too, seem consistent with the hypothesis that volunteers are more likely to be selfdisclosing than their nonvolunteering peers. Alm, Carroll, and Welty (1972) and Welty (n.d.) reported the results of a study in which 56 students of introductory sociology were asked to volunteer for an experiment in psycholinguistics. The 16 students who volunteered were then compared on the Twenty Statements Test to the 40 students who did not. Volunteers told significantly more about their beliefs, aspirations, and preferences than did nonvolunteers, with a median p level of .05 and a median effect size of two-thirds of a standard deviation. In addition, volunteers had fewer omissions than did nonvolunteers (p < .05, effect size ¼ .57). Finally, Aiken and Rosnow (1973) asked 374 high school students to describe how they thought subjects in psychological research were expected to behave. A majority felt that subjects should be frank and trustful, a combination that suggests proneness to self-disclosure. Although there is as yet no very large literature dealing with the relationship of volunteering to self-disclosure, the evidence that is available points to the
700
Book Three – The Volunteer Subject
probability of such a relationship. It might be hypothesized further that particularly for studies requiring subjects to reveal something personal about themselves, those who volunteer may be considerably more self-disclosing than those who do not volunteer.
Altruism Although there are many motives operating that lead subjects to volunteer for behavioral research, at least some of these motives are altruistic in nature (Orne, 1969). It is reasonable, therefore, to expect that people who are highly altruistic may be more willing to volunteer to serve science or to help the recruiter than people who are not so highly altruistic, but there are surprisingly few studies that have directly investigated that hypothesis. Wicker (1968a) surveyed 105 members of liberal Protestant churches and found that those who responded sooner to his questionnaire contributed about 80% more financially to their church than did those more reluctant to reply. Although this result is consistent with the hypothesis that more altruistic persons are more likely to contribute their data to the behavioral data pool, alternative interpretations are plausible. Thus, as we shall see later, those higher in socioeconomic status are also more likely to participate in behavioral research. In Wicker’s study, therefore, the more generously contributing early responders may simply have been more able to afford larger contributions than the less generously contributing later responders. We cannot be sure, then, whether Wicker’s earlier responders were actually more altruistic or whether they merely had greater financial resources. Raymond and King (1973) solicited volunteers for research in a variety of undergraduate classes and then administered the Rokeach Instrumental Value Scale to volunteers and nonvolunteers. Results showed that volunteers rated ‘‘helpful’’ as significantly more important than did nonvolunteers (effect size ¼ .38). Valuing helpfulness appears to be related conceptually to altruism, although surely it is not quite the same thing. In a general way, however, this study does seem to support the hypothesis that more altruistic persons are more likely to volunteer. Barker and Perlman (1972) administered Jackson’s Personality Research Form (PRF) to all the students enrolled in introductory psychology. Some time later, questionnaires dealing with parent–child relationships were sent to some of the subjects, and respondents were compared with nonrespondents on PRF variables. Respondents were found to score higher on the nurturance measure than nonrespondents, a result consistent with the hypothesis that more altruistic persons are more likely to volunteer. Here again, however, an alternative interpretation is possible. It may be that the content of the questionnaire, parent–child relations, simply was of greater interest to more nurturant subjects, and it appears generally to be the case that questionnaires of greater interest to a particular person are more likely to be filled in and returned. In a scaling study reported in detail in Chapter 5, Aiken and Rosnow (1973) identified students enrolled in introductory psychology as volunteers or nonvolunteers. Subjects were then asked to assess the similarity of various types of activities including ‘‘being a subject in a psychology experiment.’’ For both volunteers and nonvolunteers, serving in an experiment was seen as significantly more related to
Characteristics of the Volunteer Subject
701
altruism than to any other class of experience described (being obedient, being evaluated, being relaxed, being inconvenienced). In addition, this relationship was more marked for volunteers than for nonvolunteers in the sense that there was a greater difference between the similarity between serving as a subject and altruism and the similarity between serving as a subject and the next most similar experience for volunteers (.65 of a standard deviation) than for nonvolunteers (.35 of a standard deviation). Aiken and Rosnow also asked high school students to describe the expected behavior of the subject in a psychological experiment. More than half (56%) of these respondents, none of whom had ever before participated in psychological research, felt that subjects should behave in a ‘‘helpful’’ way. To summarize now what is known about the relationship between volunteering and altruism, we can only say that the evidence is indirect and suggestive rather than direct and strongly supportive. Direct evidence might show that altruistic persons are more likely to volunteer for behavioral research.
Achievement Need A rich network of relationships has been developed relating the need for achievement to a great variety of other variables (e.g., McClelland, 1961). A close-up of this network should surely reveal research conducted on the relationship between need for achievement and volunteering for behavioral research, and so it does. Atkinson (1955) has, for example, suggested that the fear of failure of persons low in need for achievement should prevent them from volunteering for psychological research, research in which their performance might be evaluated and found wanting. Persons high in need for achievement, on the other hand, should perhaps find the opportunity to perform and be evaluated a challenging opportunity. On the whole, this formulation seems to be supported by the evidence. Table 2-11 summarizes the results of nine studies investigating the relationship between volunteering and need for achievement. Seven of the nine studies showed those higher in need for achievement to be more likely to volunteer for behavioral research.
Table 2–11 Studies of the Relationship Between Volunteering and Need for Achievement
Greater Volunteering Among Subjects: High in Nach
Low in Nach
Bass (1967)a Burdick (1956) (Atkinson, 1955)a Cope (1968)a Lubin, Levitt, and Zuckerman (1962) Myers, Murphy, Smith, and Goffard (1966) Neulinger and Stein (1971)a Spiegel and Keith-Spiegel (1969)b
Heckhausen, Boteram, and Fisch (1970)a Spiegel and Keith-Spiegel (1969)a, c
Significant at p < .05. Male subjects. c Female subjects. a b
702
Book Three – The Volunteer Subject
The two studies reporting unpredicted results are different in several ways from those reporting the theoretically more expected outcomes. Thus, the research by Heckhausen, Boteram, and Fisch (1970) was presumably conducted at a German university, while the research by Spiegel and Keith-Spiegel (1969) solicited volunteers not for psychological research but for a companionship program in which volunteers were to spend one hour per week for 10 weeks with a psychiatric patient. In this study, too, it was only the female subjects for whom the volunteers scored lower in attraction to fame and power than did the nonvolunteers. Among the male subjects of this same research, however, volunteers scored higher on the test measuring attraction to wealth, power, or fame (the Spiegel Personality Inventory). Thus, all of the studies requesting volunteers for behavioral research and presumably conducted in the United States found those higher in need for achievement to be more willing to volunteer, and four of the six studies in this set showed effects significant at the .05 level. Effect sizes ranged from small (one-sixth of a standard deviation) in the study by Cope (1968) to large in the study by Neulinger and Stein (1971) (a difference of .77 between percentages transformed to arcsin units). Adding to the generality of the results is the fact that need for achievement was defined in a variety of ways for the studies summarized. In the research reported by Bass (1967) and Bass, Dunteman, Frye, Vidulich, and Wambach (1963), subjects were administered Bass’s Orientation Inventory, which permitted grouping of the subjects into those who were task-oriented, interaction-oriented, or self-oriented. Among the task-oriented subjects, whom we may regard as more likely to be high in need for achievement, 71% volunteered for an unspecified research study, while only 60% of the subjects classifiable into the alternative categories did so. When more standard measures of need for achievement were employed by Burdick (1956), 77% of those scoring above the median in achievement motivation volunteered for an experiment, while only 58% of those scoring below the median volunteered. A still different definition of need for achievement was involved in the study by Cope (1968). In his research, students who had dropped out of college were surveyed and early responders were compared to late responders on their fears of academic failure. As expected, later responders experienced significantly greater fear of failure than did earlier responders, although, as indicated before, the magnitude of the effect was small. Neulinger and Stein studied women undergraduates who had scored as either intellectually oriented or as socially oriented on the Stein Self-Description Questionnaire. Subjects were then invited to participate in a psychological experiment. Of the intellectually oriented women, those presumably more achievementmotivated, 87% agreed to participate; while among the more socially oriented women, those presumably less achievement-motivated, only 53% agreed to participate. The difference in volunteering rates obtained in this study, although large, may nevertheless have been attenuated by the fact that all subjects—volunteers and nonvolunteers—had actually been volunteers in the first instance, volunteers for taking the Stein Self-Description Questionnaire. Earlier we noted that in the study by Spiegel and Keith-Spiegel, female volunteers for a companionship program scored lower in attraction to fame and power than did the female nonvolunteers. The finding by Neulinger and Stein that their female undergraduates volunteered more when they were presumably higher in need for achievement suggests that
Characteristics of the Volunteer Subject
703
the opposite result obtained by Spiegel and Keith-Spiegel may have been caused more by the nature of the task for which volunteering was solicited than by the sex of the subjects. In the two remaining studies finding volunteers to score higher on need for achievement, those by Lubin, Levitt, and Zuckerman (1962) and by Myers, Murphy, Smith, and Goffard (1966), need for achievement was defined by scores on the Edwards Personal Preference Schedule. In both of these studies, however, the differences between volunteers and nonvolunteers were not significant statistically. We have seen earlier that the act of volunteering is a necessary, but not sufficient condition, for the volunteer actually to find his way into the behavioral researcher’s data set. Also necessary for that outcome is that the volunteer actually appear for his or her appointment with the data collector. Two of the studies summarized in Table 2-11 made special note of the rates of pseudovolunteering by volunteers high and low in need for achievement. Burdick (1956; Atkinson, 1955) found that of those who volunteered, 58% of the high-scorers, but only 42% of the low scorers, actually showed up for their research appointment. The difference was not trivial, but it was also not significant (2 ¼ 1.12). In their study, Neulinger and Stein (1971) found that 92% of the presumably more achievement-motivated and 88% of the presumably less achievement-motivated volunteers kept their appointments with the experimenter. Although neither of these two studies found a significant difference between the pseudovolunteering rates of presumably high versus presumably low achievement-motivated volunteers, the trends suggest that higher scorers on ‘‘need for achievement’’ (Nach) may better keep their promise to participate. From an overview of the results of the research relating volunteering to achievement motivation, the indications are strong that, for American samples at least, volunteers for behavioral research are likely to show higher levels of achievement motivation than their less achievement-motivated colleagues.
Approval Need Crowne and Marlowe (1964) have elegantly elaborated the empirical and theoretical network that surrounds the construct of approval motivation. Employing the Marlowe–Crowne Scale as their measure of need for social approval, they have shown that high-scorers are more easily influenced than low-scorers in a variety of situations. Directly relevant to this section is their finding that high-scorers report a significantly greater willingness to serve as volunteers in an excruciatingly dull task, and the difference they obtained was substantial in magnitude (about two-thirds of a standard deviation). Table 2-12 shows that there were 10 additional studies in significant support of the result reported by Crowne and Marlowe, while another 8 studies did not find a significant difference between volunteering rates by subjects high or low in need for approval. Craddick and Campitell (1963) asked all of their research subjects to return for an additional session one month later. Those of their subjects scoring higher in need for approval were considerably more likely to return (73%) than were their subjects scoring lower in need for approval (43%). Employing a different measure of need for approval, the Christie–Budnitzky measure, Hood and Back (1971) found the same
704
Book Three – The Volunteer Subject Table 2–12 Studies of the Relationship Between Volunteering and Need for
Approval Significantly Greater Volunteering by Subjects Higher in Need for Approval
No Significant Difference
Craddick and Campitell (1963) Crowne and Marlowe (1964) Hood and Back (1971)a Horowitz and Gumenik (1970) MacDonald (1972a ) McDavid (1965) Mulry and Dunbar (undated) Olsen (1968) I Olsen (1968) II Olsen (1968) III Olsen (1968) IV
Bennett and Hill (1964) Edwards (1968a) Hood and Back (1971)b MacDonald (1972a) I MacDonald (1972a) II MacDonald (1972b) Poor (1967) I Poor (1967) II
a b
Male subjects. Female subjects.
result for their male subjects, while for their female subjects the relationship between volunteering and approval-need appeared to depend on the task for which volunteering was requested. Still another measure of approval motivation was employed in the studies by McDavid (1965) and by Horowitz and Gumenik (1970). In both studies McDavid’s Social Reinforcement Scale (SRS) was employed, and in both studies volunteers scored higher than did nonvolunteers. In McDavid’s study there were actually two kinds of volunteers, those who volunteered for extra grade credit and those who had already earned their maximum allowable extra grade credit, the dedicated volunteers. The mean SRS scores were significantly higher for the bonus-seeking volunteers than for the nonvolunteers, and they were significantly higher for the dedicated volunteers than for the bonus-seeking volunteers. The magnitude of the difference in SRS scores between nonvolunteers and dedicated volunteers was nearly a full standard deviation (.95). The SRS scores of the less dedicated volunteers fell about halfway between the scores of the nonvolunteers and those of the dedicated volunteers. In the experiment by Horowitz and Gumenik, the magnitude of the difference in SRS scores between nonvolunteers and volunteers who had been offered no reward for volunteering was almost as large (.80). In the research by MacDonald (1972a) volunteers were solicited under three different conditions: for pay, for extra credit, and for love of science. Only in the condition in which pay was offered were subjects high in need for approval on the Marlowe-Crowne Scale more willing to volunteer (77%) than subjects low in need for approval (57%). Under both other conditions of recruitment, there was essentially no difference in volunteering rates as a function of approval need. Mulry and Dunbar (n.d.) employed a subset of the Marlowe–Crowne Scale items, but their comparison was not of volunteers with nonvolunteers but of those who volunteer earlier within a semester with those who volunteer later in a semester. Subjects volunteering early in the term earned modified Marlowe–Crowne Scale scores nearly half a standard deviation higher (.45) than those of subjects volunteering late in the term.
Characteristics of the Volunteer Subject
705
In her research, Olsen (1968) examined the relationship between volunteering and need for approval (Marlowe–Crowne) under conditions in which male or female subjects expected to be more versus less favorably evaluated by the experimenter. In all four groups the relationship was positive, but it was significantly greater (point biserial r ¼ .57) under conditions of favorable expectations than under conditions of unfavorable expectations (point biserial r ¼ .26). As the theory of approval motivation predicts, those high in need of approval should be relatively less eager to volunteer for experiments in which they are likely to be evaluated unfavorably. When we examine the studies reporting no significant difference in volunteering between those high or low in approval need, we find no systematic factors that might serve to differentiate these studies from those that did find significant differences. Six of the eight studies employed as their measure of approval motivation the Marlowe–Crowne Scale, an instrument that, as we have seen, has often revealed significant differences. The remaining two studies employed as their measures the California Psychological Inventory’s Good Impression Scale (Bennett and Hill, 1964) and the Christie–Budnitzky measure (Hood and Back, 1971). The latter study had found a significant positive correlation between volunteering and approval need, but only for male subjects. The sex difference per se cannot account for the difference in outcomes, since many studies employing female subjects had found a significant relationship between volunteering and approval motivation. It is possible, however, that sex and a particular instrument may serve as co-moderator variables, but we simply lack the data to evaluate the plausibility of this hypothesis. The nature of the incentives to volunteer also does not appear to differentiate the studies showing significant versus nonsignificant relationships between approval need and volunteering. In the two studies by MacDonald (1972a) showing no significant relationship, the incentives were extra credit and love of science, but there were also studies showing quite significant relationships that employed similar incentives. Although the overall results were not significant, a closer look at the studies by C. Edwards (1968a) and by Poor (1967) may be instructive. In one of his studies, Poor solicited volunteers for a psychological experiment and found that volunteers scored half a standard deviation higher on the Marlowe–Crowne Scale than did the nonvolunteers. In his other study, Poor compared respondents with nonrespondents to a mailed questionnaire. In this study respondents scored about a quarter of a standard deviation higher in need for approval than did the nonrespondents. Neither of these studies, however, reached statistical significance as commonly defined, but both did reach the .07 level (one-tailed). Carl Edwards (1968a) invited student nurses to volunteer for a hypnotic dream experiment and found no relationship between need for approval and volunteering. If this study were the only one to find no relationship, we might suppose it could have been caused by either the special nature of the task for which volunteers were solicited or the special nature of the subject population. There are other studies, however, also finding no relationship between volunteering and approval motivation in which more common tasks and more typical subjects were employed. It seems unlikely, therefore, that we can readily ascribe Edwards’ finding to either his task or his population. Edwards reported some intriguing additional results, however. The need for approval scores of the volunteers’ best friends was significantly higher than the need for approval scores of the nonvolunteers’ best friends. In addition, the student nurses’ instructors rated the volunteers as significantly more defensive than
706
Book Three – The Volunteer Subject
the nonvolunteers. If we were to reason that people chose people like themselves as their best friends, then we might conclude that at least in their choice of best friends, as well as in their instructors’ judgment, Edwards’ volunteers may have shown a higher need for approval not reflected in their own test scores. Edwards went further in his analysis of subjects’ scores on the Marlowe–Crowne Scale. He found a nonlinear trend suggesting that the volunteers were more extreme scorers than the nonvolunteers—that is, either too high (which one would have expected) or too low (which one would not have expected). Edwards’ sample size of 37 was too small to establish the statistical significance of the suggested curvilinear relationship. However, Poor (1967), in both of the samples mentioned earlier, also found a curvilinear relationship, and in both samples the direction of curvilinearity was the same as in Edwards’ study. The more extreme scorers on the Marlowe–Crowne Scale were those more likely to volunteer. In Poor’s smaller sample of 40 subjects who had been asked to volunteer for an experiment, the curvilinear relationship was not significant. However, in Poor’s larger sample of 169 subjects who had been asked to return a questionnaire, the curvilinear relationship was significant. None of the magnitudes of these curvilinear relationships was very great, however. Employing r as an index of relationship as suggested by Friedman (1968), we find the curvilinear correlations to be .19, .12, and .30 for the study by Edwards, the experiment by Poor, and the survey by Poor, respectively— effects that range from small to medium in Cohen’s (1969) definition of effect magnitudes. Before concluding our discussion of the relationship between volunteering and approval motivation, two other studies must be mentioned, one dealing with choice of subject role, the other with pseudovolunteering. Efran and Boylin (1967) found in a sample of male undergraduate students, all of whom had volunteered for some form of participation, that volunteers higher in need for approval were less willing to serve as discussants in groups than were volunteers lower in approval need (r ¼ .39, p < .01). Presumably the more approval-motivated volunteers felt they were more likely to be evaluated in a negative way when they were ‘‘on display’’ in an experiment rather than performing the less conspicuous role of observer. Finally, in their study of pseudovolunteering, Leipold and James (1962) compared (on the Marlowe–Crowne Scale) subjects who appeared for their research appointments with those who did not appear. For female subjects there was essentially no difference, but for male subjects those who appeared tended to be higher in approval motivation. This result, while not significant statistically, was of moderate size (.41). A summary of the research investigating the relationship between volunteering and approval motivation is not unduly complicated. When a significant linear relationship was obtained, and more than half the time it was, it always showed that volunteering was more likely among persons high rather than low in approval motivation. In three of the studies finding no significant linear relationship, there were indications of a curvilinear trend such that medium-scorers on the measure of approval need were least likely to volunteer. In trying to understand the reasons why the remaining studies showing no significant relationship failed to do so, it seems unlikely that any single moderator variable can bear the burden of responsibility. However, it is possible that the joint effects of two or more moderator variables may account for the obtained differences in outcomes. Our best candidates for playing the
Characteristics of the Volunteer Subject
707
role of co-moderator variables are (1) the type of task for which volunteering is solicited, (2) the incentives offered to the potential volunteer, (3) the instrument employed to measure approval motivation, and (4) the sex of the prospective volunteers.
Conformity At first glance it would appear to be essentially tautological to consider the relationship between volunteering and conformity, since the act of volunteering is itself an act of conformity to some authority’s request for participation. We shall see, however, that conforming to a request to volunteer is by no means identical with, and often not even related to, other definitions of conformity. Indeed, Table 2-13 shows that when there is a significant relationship between conformity and volunteering, and that occurs about 30% of the time, the relationship is found to be negative. Subjects lower in conformity, or higher in autonomy, are more likely to volunteer than their more conforming, or less autonomous, colleagues. We should note that although most of the studies listed in Table 2-13 found no significant relationship between volunteering and conformity, there are far more studies that did find a significant relationship than we would expect if there were, in fact, no relationship between these variables in nature. Fisher, McNair, and Pillard (1970) studied subjects high or low in physiological awareness. Among those high in such awareness, there was no relationship between volunteering and social acquiescence scores. Among those low in physiological awareness, however, only 29% of the highly acquiescent subjects volunteered for a drug study, compared to 80% of the less acquiescent subjects (p < .02). All of the subjects in this study had volunteered earlier for psychological testing, so, as is usually the case in a study of second level volunteers, the effects obtained may have been diminished in magnitude by the absence of a sample of true nonvolunteers.
Table 2–13 Studies of the Relationship Between Volunteering and Conformity
Significantly Greater Volunteering by Subjects Lower in Conformity
No Significant Difference
Fisher, McNair, and Pillard (1970) MacDonald (1972a) MacDonald (1972b) Martin and Marcuse (1957)a Newman (1956)
Edwards (1968a, 1968b) Fisher, McNair, and Pillard (1970) Foster (1961) Frye and Adams (1959) Lubin, Levitt, and Zuckerman (1962) Martin and Marcuse (1957)b Martin and Marcuse (1958) I Martin and Marcuse (1958) II Martin and Marcuse (1958) III McConnell (1967) Newman (1956) Spiegel and Keith-Spiegel (1969)
a b
Male subjects. Female subjects.
708
Book Three – The Volunteer Subject
In his study, MacDonald (1972a) employed two measures of conformity: MacDonald’s own conformity scale (the C-20) and Barron’s Independence of Judgment Scale. On both measures, volunteers were found to be significantly less conforming than nonvolunteers (p < .005; p < .015), although the magnitude of the effects was not large (.24 and .21, respectively). In his other study, MacDonald (1972b) found that those who signed up for an experiment earlier in the semester were significantly less conforming than their more dilatory colleagues (p ¼ .01, effect size ¼ .32). In their research, Martin and Marcuse (1957, 1958) requested four different samples of subjects to volunteer, each for a different psychological experiment. For three of these studies (learning, personality, and sex research) there was no relationship between volunteering and dominance as defined by the Bernreuter Personality Inventory. For the fourth study, however, that requesting volunteers for hypnosis research, male volunteers were found to score as significantly more dominant than nonvolunteers (p < .05), and the magnitude of the effect was substantial (.62). Among the female subjects of this hypnosis study, however, the relationship between volunteering and dominance was not only not significant but, indeed, was in the opposite direction. Employing as his measure the Edwards Personal Preference Schedule, Newman (1956) found volunteers to be significantly more autonomous than nonvolunteers (p ¼ .01) when the request was for volunteering for a perception experiment. When the request was for volunteering for a personality experiment, however, no significant relationship was obtained. Three studies, listed in Table 2-13 as showing no significant relationship between volunteering and conformity, also employed the Edwards Personal Preference Schedule; those were studies by Carl Edwards (1968a), Frye and Adams (1959), and Lubin, Levitt, and Zuckerman (1962). The latter two studies found essentially no significant differences, but the study by Edwards actually found a number of nearly significant differences that tended to be conceptually contradictory of one another. Thus, on the EPPS, Edwards’ volunteers for sleep and hypnosis research scored lower on autonomy (p < .08; effect size ¼ .66) and on dominance (p < .06; effect size ¼ .69) than did the nonvolunteers, results that are directionally consistent with those of Lubin et al., who had also employed student nurses as subjects in their survey research. In addition, Edwards’ volunteer student nurses were rated by their psychiatric nursing instructors to be more conforming than the nonvolunteers. In apparent contradiction to these results, however, was the finding that on Carl Edwards’ own Situation Preference Inventory, volunteers scored as less cooperative in their approach to interpersonal interaction when compared to nonvolunteers (p < .06; effect size ¼ .72). It is the contradictory nature of Edwards’ interesting findings, rather than their not having quite reached the .05 level of significance, that led us to list his study as one not showing a significant relationship between conformity and volunteering. We may note, too, that despite their inconsistent directionality, Edwards’ obtained relationships were substantial in magnitude. A similarly complex result was obtained in the study by Spiegel and KeithSpiegel (1969). Male volunteers were found as more yea-saying on the Couch–Keniston Scale but also as more self-assertive on the Spiegel Personality Inventory than were male nonvolunteers, an apparently inconsistent set of results. For female subjects the results were exactly opposite. We should recall, however,
Characteristics of the Volunteer Subject
709
that volunteering in this study was not for a behavioral research experience but for a program of companionship with psychiatric patients. The as yet unmentioned studies of Table 2-13 were those by Foster (1961) and McConnell (1967). McConnell found essentially no relationship between volunteering and scores on (1) a sociometric measure of dependency or (2) a measure of dependency based on the Rotter Incomplete Sentence Blank. Furthermore, there was no relationship between the two measures of dependency. Finally, Foster found no relationship between volunteering for an experiment and conformity in an Asch-type situation. To summarize the data bearing on the relationship between volunteering and conformity, we found most often that the relationship was not significant. However, there were too many studies finding a significant relationship to permit a conclusion that, in nature, there is no relationship. Furthermore, all five studies reporting significant relationships agreed in direction—less conforming subjects volunteered more. In comparing those experiments in which a significant relationship was found with those in which no significant relationship was found, no single moderating variable is clearly implicated. Both types of outcomes have been found for male and female subjects, for a variety of incentives to volunteer, for a variety of types of task for which volunteering was solicited, and for a variety of measures of conformity, broadly defined. One clue does emerge from an analysis of the potential co-moderating effects of sex and type of research for which volunteering was requested. Based on the results of three studies, we might expect that when the task falls in the ‘‘clinical’’ domain and the subjects are females, greater volunteering may occur among the more, rather than less, conforming subjects. Clinical is here defined as involving research on hypnosis, sleep, or group counseling, and the studies providing these indications are those by Martin and Marcuse (1957), Edwards (1968a, 1968b), and Lubin, Levitt, and Zuckerman (1962). Even if further research were to bear out this fairly specific prediction, however, we would not expect the magnitude of the effect to be very large.
Authoritarianism Volunteers have been compared to nonvolunteers on a variety of measures of authoritarianism, and overall the results suggest that volunteers tend to be less authoritarian than nonvolunteers. The most commonly used measure of authoritarianism has been the California F Scale; Table 2-14 shows the results of eight studies in which volunteers for behavioral research were compared to nonvolunteers on the F Scale. The entries for each study are the approximate effect sizes defined as the degree of authoritarianism of the volunteers subtracted from the degree of authoritarianism of the nonvolunteers, with the difference divided by the standard deviation of the sample’s F Scale scores. An effect size with a positive sign, therefore, means that volunteers were less authoritarian than the nonvolunteers. Where it was possible to do so, effect sizes were calculated separately for males and females. The most striking result of an inspection of Table 2-14 may be that all 11 effect sizes are positive in sign, a result that would occur only very rarely if there were
710
Book Three – The Volunteer Subject Table 2–14 Studies of the Relationship Between Volunteering for Behavioral Research and
Authoritarianism (California F Scale) Effect Sizes (in Units)
Author and Task
Horowitz and Gumenik (1970): experiment MacDonald (1972a): experiment MacDonald (1972b): experiment Newman (1956): perception Newman (1956): personality Poor (1967): questionnaire Poor (1967): social psychology Rosen (1951): personality Median Number of results % significant % positive signs a b c
Male
Female
Combinedb
— — — þ .52a þ .03 þ .26 — þ .65a .39 4 50 100
— — __ þ .15 þ .31 þ .22 — þ 1.31a .26 4 25 100
—c þ .16a þ .29b — — — þ .01 — .16 4 50 100
Total
.26 12 42 100
p < .05. Separate analysis for males and females not possible. Effect size could not be computed; result not significant.
no relationship between volunteering and authoritarianism (p < .002). In addition, 42% of the listed results showed significant relationships. We would expect to find only 5% obtaining such results if there were no relationship between volunteering and authoritarianism. Neither sex of the sample nor the type of task for which volunteering was requested appears to be strongly related to the magnitude of the effect obtained. There may be a hint that type of task may interact with sex of sample in affecting the strength of relationship. Thus, in personality research, female subjects show a stronger relationship between volunteering and authoritarianism than do male subjects, while in perception research the opposite trend is found. There are not enough studies available, however, to permit any rigorous test of this indication. Before leaving Table 2-14, we may note that the grand median effect size is about a quarter of a standard deviation with a range from very nearly 0 to about 1 1/3 standard deviations. Table 2-15 shows the results of 13 studies in which volunteers were compared to nonvolunteers on a variety of measures of authoritarianism. Once again, there are far more significant results than we would expect if there were no underlying relationship between volunteering and authoritarianism. Of the 22 results, 45% were significant, but not all of these showed volunteers to be less authoritarian. Of the 4 results showing volunteers to be more authoritarian (negative effect sizes), 3 were significant. One of these, the male sample of the study by Spiegel and Keith-Spiegel (1969), actually did not solicit volunteers for research but for a service project in which volunteers were to serve as companions to psychiatric patients. The other 2 studies finding volunteers to be significantly more authoritarian, those by Cope (1968) and Ebert (1973), had in common that they both employed return of mailed questionnaires as their definition of volunteering and that self-reported political conservatism was their definition of authoritarianism. The reversal of the usual effect in these two studies is probably related more to their use of self-report
Characteristics of the Volunteer Subject
711
Table 2–15 Studies of the Relationship Between Volunteering and Authoritarianism (Other Measures)
Effect Sizes (in Units)
Author and Task Male Benson, Booman, and Clark (1951): home interview Burchinal (1960): questionnaire Cope (1968): questionnaire Ebert (1973): questionnaire Martin and Marcuse (1958): hypnosis Martin and Marcuse (1958): learning Martin and Marcuse (1958): personality Martin and Marcuse (1958): sex Raymond and King (1973): research Schubert (1964): experiment Spiegel and Keith-Spiegel (1969)d Teele (1967): home interview Wallin (1949): questionnaire Median Number of results % significant % positive signs
— þ .12 — — þ 1.10b —c —c —c — þ .05 .85b þ .02 þ .13b .08 9 33 83
Total a
Female
Combined
— þ .39b — — þ .41 —c —c —c — .07 þ .29 þ .20b þ .18b .24 9 33 83
þ .39b — .19b .13b — — — — þ .37b — — — — .12 4 100 50
.16 22 45 75
a
Separate analysis for males and females not possible. p < .05. Effect size could not be computed; result not significant. d Employed F Scale but was not a request for behavioral research volunteering (psychiatric companionship program). b c
measures of political conservatism than to their definition of volunteering. Tables 2-14 and 2-15 show an additional 6 results of studies of respondents to questionnaires, and all 6 of these show volunteers to be less authoritarian than nonvolunteers, with a median effect size of .20 which is exactly equal to the grand median effect size of all the results shown in Tables 2-14 and 2-15. Just why a self-report measure of political conservatism should be positively related to returning a questionnaire is not at all clear, however. Other measures or definitions of authoritarianism employed in the studies listed in Table 2-15 include indices of prejudice or intolerance (Benson, Booman, and Clark, 1951; Schubert, 1964), F Scale-like measures (Burchinal, 1960; Martin and Marcuse, 1958), the valuing of obedience (Raymond and King, 1973), interviewer rating (Wallin, 1949), and self-report (Teele, 1967). The study by Teele was not actually a study of volunteering for behavioral research but of general social participation. One other feature of Table 2-15 should be noted; the effect sizes tend to be larger for females than for males. Of the six results available for females and males separately, five (83%) of the females’ results exceed the table’s median effect size of .16, while only one (17%) of the males’ results exceeds that value (p < .10). If we omit the two studies in which volunteering was not for behavioral research (Spiegel and Keith-Spiegel, 1969; Teele, 1967), this difference in effect sizes between males and females is slightly diminished, however. An overall assessment of the relationship between volunteering and authoritarianism must conclude that volunteers tend to be less authoritarian than nonvolunteers. Of 27 results in which effect sizes could be estimated, 85% are in directional support of this summary (p < .001). There are some indications that
712
Book Three – The Volunteer Subject
this relationship may be somewhat stronger for females than for males except possibly for studies investigating hypnosis or perception. For all results considered, the magnitudes of the effects range from very small to very large, but the most likely effect size to occur appears to be about one-fifth of a standard deviation.
Conventionality Volunteers tend to be less authoritarian than nonvolunteers, and since there is a sense in which less authoritarian people are also less conventional, we might expect to find that volunteers would be less conventional than nonvolunteers. Table 2-16 shows that this tends to be the case, but not unequivocally so. There are 11 studies showing volunteers to be less conventional with effect sizes ranging from .20 to .86 (median ¼ .40). Five studies reported no difference in conventionality between volunteers and nonvolunteers (median effect size ¼ .06) and 4 studies reported that volunteers were more conventional with effect sizes ranging from .21 to .63 (median ¼ .27). For all 20 studies of Table 2-16, the median study shows volunteers to be less conventional with an effect size of .20. In trying to understand the variables that may serve to differentiate the three types of outcome shown in Table 2-16, we consider in turn the sex of the samples employed, the type of research for which volunteering was requested, and the definition of the variable of conventionality. Of the 20 studies available, 7 employed male samples, 6 employed female samples, and 7 employed mixed samples for which separate secondary analyses by sex could not be carried out. The results of these three types of studies were quite similar with median effect sizes of .20, .19, and .28 respectively, and all in the same direction. The ratios of proportion of significant results showing conventionals to volunteer more to proportion of significant results showing unconventionals to volunteer more were also very similar for these three types of studies (.40, .33, .33, respectively). The gender composition of our three groups of studies, then, does not appear to be a critical variable in accounting for differences in outcome.
Table 2–16 Studies of the Relationship Between Volunteering and Conventionality
Unconventionals Volunteer Significantly More
No Significant Difference
Conventionals Volunteer Significantly More
Jaeger et al. (1973) Kaats and Davis (1971) I Kaats and Davis (1971) II Kivlin (1965) London et al. (1962) Maslow (1942) Reid (1942) Rosen (1951) Schubert (1964) I Schubert (1964) II Siegman (1956)
Cohler et al. (1968) Cope (1968) Heilizer (1960) Rosen (1951) Wallin (1949)
Ebert (1973) Edwards (1968a) Myers et al. (1966) I Myers et al. (1966) II
Characteristics of the Volunteer Subject
713
In considering the types of research for which volunteering was solicited, we turn first to those studies reporting volunteers to be more conventional. The studies by Ebert (1973) and Edwards (1968a) were questionnaire and sleep–hypnosis studies, respectively; and since similar types of studies have found volunteers to be less conventional, it seems unlikely that the nature of these tasks can be implicated as a moderator variable. The studies by Myers, Murphy, Smith, and Goffard (1966), however, were studies of sensory isolation conducted with army servicemen. These were the only studies of Table 2-16 that employed either that task or that type of subject so that either or both task and sample may have accounted for the reversal of the more usual finding. Similar analyses of tasks employed in studies finding no significant differences revealed no special features that might have accounted for the failure to find some significant relationship between volunteering and conventionality. The requests for volunteers in this set of studies included invitations for (1) ‘‘further research’’ (Cohler, Woolsey, Weiss, and Grunebaum, 1968), (2) questionnaire response (Cope, 1968; Wallin, 1949), (3) hypnosis (Heilizer, 1960), and (4) personality research (Rosen, 1951). When we consider the studies finding volunteers to be more unconventional, we find that four of the studies requested volunteers for research on sex behavior (Kaats and Davis, 1971, I, II; Maslow, 1942; Siegman, 1956). These studies included both male and female subjects and the median effect size was above average for the studies under discussion and substantial in absolute terms (.50). In all four of these studies, conventionality was defined in terms of sex behavior or views of sex roles. The other types of studies, those finding volunteers to be more unconventional, included two that solicited volunteers for an unspecified experiment (Schubert, 1964 I, II) and one that solicited volunteers for a ‘‘group experience’’ (Jaeger, Feinberg, and Weissman, 1973). The remaining studies finding volunteers to be more unconventional employed tasks that had also been employed in studies reporting no significant difference or even a significant reversal. Included in this group are questionnaire studies (Kivlin, 1965; Reid, 1942), hypnosis research (London, Cooper, and Johnson, 1962), and personality research (Rosen, 1951). In the studies under discussion, a wide variety of measures of conventionality have been employed. Seven of the studies shown in Table 2-16 employed the Pd scale of the MMPI as their measure of unconventionality. The median outcome of these seven studies showed volunteers to be more unconventional with an effect size of .20, the same value associated with the grand median outcome of all the studies listed in Table 2-16. For the studies of sex behavior, the definition of unconventionality was in terms of unconventionality of sexual attitudes and behavior as mentioned earlier. The remaining studies each employed a different definition of conventionality. Overall, there appeared to be no relationship between the direction and magnitude of the association between volunteering and conventionality and the type of definition of conventionality that was employed. In summarizing the relationship between volunteering and conventionality, we note first that 75% of the relevant studies show a significant relationship and that 55% show volunteers to be significantly less conventional. Studies of sex behavior show a particularly consistent tendency to yield such results and with a substantial magnitude of effect. Of the studies showing significant results in the opposite direction, there was some indication that when recruitment was for sensory isolation research, more conventional servicemen would volunteer.
714
Book Three – The Volunteer Subject
Arousal-Seeking Schubert (1964) and Zuckerman, Schultz, and Hopkins (1967) have proposed that volunteers for behavioral research are likely to be more motivated than nonvolunteers to seek arousal, sensation, or input. Table 2-17 shows the results of studies investigating the relationship between volunteering and arousal-seeking. Although there are a good many studies showing no significant relationship, and even a few showing a significant (p < .10) reversal of the relationship, there is a large set of studies in support of the hypothesis. Furthermore, there are a number of studies showing effect sizes greater than a full standard deviation (Aderman, 1972; Zuckerman et al., 1967 I and 1967 II), and the median effect size for the studies showing volunteers to be more arousal-seeking was .52. For the three studies showing volunteers to be less arousal-seeking, the effect sizes were smaller (.23, .24, and .54). In trying to understand the factors that might account for the obtained differences in outcome, we grouped the studies by the nature of the task for which volunteering had been solicited; Table 2-18 shows the results. Studies involving stress (e.g., electric shock, temperature extremes), sensory isolation, and hypnosis showed Table 2–17 Studies of the Relationship Between Volunteering and Arousal-Seeking
Arousal-Seekers Volunteer Significantly More
No Significant Difference
Arousal-Seekers Volunteer Significantly Less
Aderman (1972) Barker and Perlman (1972) Howe (1960) Jaeger et al. (1973) Myers et al. (1967) Riggs and Kaess (1955) Schubert (1964) Schultz (1967b) Schultz (1967c) Siess (1973) Zuckerman et al. (1967) I Zuckerman et al. (1967) II Zuckerman et al. (1967) III
MacDonald (1972a) MacDonald (1972b) Myers et al. (1966) Ora (1966) Poor (1967) I Poor (1967) II Rosen (1951) I Waters and Kirk (1969) I Waters and Kirk (1969) II Waters and Kirk (1969) III
London et al. (1962) Rosen (1951) II Rosnow et al. (1969)
Table 2–18 Median Effect Sizes (in Units) Obtained in Seven Types of Studies
Type of Study Stress Sensory isolation Hypnosis Group research Questionnaire Personality—clinical General experiments a b
One study is listed in both categories. Approximation only.
Effect Size þ .54 þ .54 þ .54 þ .25 þ .12 þ .12 .00b
Number of Studies 4 5a 3a 2 2 4 6
Characteristics of the Volunteer Subject
715
much larger effect sizes than did other types of research. The 11 results comprising these three types of studies included 8 (73%) with an effect size greater than a third of a standard deviation in the predicted direction of greater volunteering by subjects higher in arousal-seeking. The 14 results comprising the remaining types of studies included only 2 with effects that large (14%). Significance testing is of limited value in this type of situation, but should one want the results of a Kolmogorov–Smirnov test of the maximum difference of cumulative distributions, p would be found to be less than .02 (2 ¼ 8.42, df ¼ 2). About half of the studies examining the relationship between volunteering and arousal-seeking were conducted with samples composed of both males and females, samples for which it was not possible to obtain separate estimates of effect sizes for the two sexes. The remaining studies were conducted with subjects of one sex only, or they could be analyzed for each sex separately. Table 2-19 gives the median effect sizes for male, female, and mixed samples for studies involving stress (including sensory isolation) or hypnosis and for more general studies. For both types of studies, results based on male samples are quite similar to results based on female samples and the grand median effect sizes for males (.46) and females (.51) are very nearly the same. Interestingly, but inexplicably, those studies for which the data were not analyzable separately for males and females were the studies most likely to show no relationship between volunteering and arousal-seeking. Examination of the number of studies in each cell of Table 2-19 also shows an interesting relationship between type of study and whether data could be analysed separately for each sex. Of studies investigating stress or hypnosis, 77% studied males and/or females separately, while in the remaining, more general studies only 31% did so (2 ¼ 4.30, p < .05, ¼ .39). This difference may be due to investigators of stress and hypnosis having more often found sex to be a variable moderating the outcomes of their research than has been the case for investigators of more standard research topics. Our own analysis of sex differences in volunteering suggests that although women generally volunteer more readily than men, they volunteer substantially less than men when the research for which they are being recruited involves stress. This finding, that men and women volunteers are differentially more or less representative depending on the type of research for which volunteering was solicited, further supports the practice of analyzing the results of research separately for male and female subjects. For the studies listed in Table 2-17, various definitions of arousal-seeking have been employed. A specially developed measure, the Sensation-Seeking Scale, was
Table 2–19 Median Effect Sizes (in Units) by Type of Study and Sex of Subjecta
Subject Sex
Males Females Mixed Total a
Stress and Hypnosis
Other Studies
Effect
N
Effect
N
Effect
N
.61 .80 .00 .52
6 4 3 13
.24 .33 .00 .00
2 3 11 16
.46 .51 .00 .25
8 7 14 29
Values given as .00 are approximations.
Total
716
Book Three – The Volunteer Subject
employed in seven of these studies (Schultz, 1967b; Waters and Kirk, 1969, I, II, III; Zuckerman et al., 1967, I, II, III) with a median effect size of .54. Eight studies employed as their definition of arousal-seeking the use of alcohol, caffeine, or tobacco (MacDonald, 1972a, b; Myers et al., 1966; Ora, 1966; Poor, 1967, I, II; Rosnow et al., 1969; and Schubert, 1964) with a median effect size of approximately zero. We cannot conclude, however, that the measure of arousal-seeking employed determines whether there will be a significant relationship between volunteering and arousal-seeking. The reason is that the particular measure of arousal-seeking employed is highly correlated with the type of study conducted. Thus, 71% of the studies employing the Sensation-Seeking Scale were in our group of stress studies, while only 12% of the studies employing the use of chemicals were in our group of stress studies. For the studies employing these two measures of arousal-seeking, then, we cannot decide whether it is the measure employed or the type of task that accounts for the difference in magnitude of the effect of arousal-seeking on volunteering. It may well be both. Among the remaining studies, in which various measures of arousal seeking were employed, there was still a tendency for the stress studies to show a greater effect size (median ¼ .46) than the remaining studies (median ¼ .25). It appears, then, that while the use of the Sensation-Seeking Scale may lead to a greater relationship between volunteering and arousal-seeking, even when other measures are employed, studies of stress and hypnosis show a greater effect of arousal-seeking on volunteering than do more typical studies. Table 2-20 summarizes this state of affairs. Other measures that have been employed to investigate the relationship between arousal-seeking and volunteering include ratings of tasks (Aderman, 1972), Jackson’s Personality Research Form ‘‘Play’’ Scale (Barker and Perlman, 1972), Shockavoidance (Howe, 1960), the Omnibus Personality Inventory ‘‘Impulsive’’ Scale (Jaeger, Feinberg, Weissman, 1973), the Ma Scale of the MMPI (London, Cooper, and Johnson, 1962; Rosen, 1951, I, II), the Thrill-Seeking Scale (Myers, Smith, and Murphy, 1967), the Guilford Cycloid Scale (Riggs and Kaess, 1955), the Cattell 16-PF index of adventurousness (Schultz, 1967c), and the Harmavoidance Scale of the Jackson Personality Research Form (Siess, 1973). This last study differed from the others in that subjects were asked to judge their preferences for various types of experiments rather than being asked directly to volunteer for behavioral research. In summarizing the results of studies of the relationship between volunteering and arousal-seeking, we note first that overall, volunteers are more arousal-seeking than Table 2–20 Median Effect Sizes (in Units) by Type of Study and Type of Measurea
Type of Measure
Sensation-Seeking Scale Chemical usage Other measures Total a
Stress and Hypnosis
Other Studies
Total
Effect
N
Effect
N
Effect
N
.71 .00 .46 .50
5 1 6 12
.00 .00 .25 .00
2 7 5 14
.54 .00 .44 .24
7 8 11 26
Values given as .00 are approximations.
Characteristics of the Volunteer Subject
717
nonvolunteers, with a median effect size of about a quarter of a standard deviation. The type of study for which volunteering is requested, however, appears to act as a moderator variable such that studies of stress, sensory isolation, and hypnosis tend to show this relationship to a much greater degree than do more ordinary kinds of studies. It is, in fact, possible that for these ordinary kinds of studies, the true effect size may be only trivially greater than zero. There is also some evidence to suggest that the relationship between volunteering and arousal-seeking may be somewhat greater when the latter variable is defined in terms of scores on a particular instrument, the Sensation-Seeking Scale.
Anxiety Many investigators have addressed the question of whether volunteers differ from nonvolunteers in their level of anxiety. Table 2-21 gives the results of 35 such studies. Although most of the results were not significant statistically, there are far too many results that were significant to permit a conclusion that there is no relationship between volunteering and anxiety level. Unfortunately, however, from the point of view of simple interpretation there are nearly equal numbers of studies showing volunteers to be more anxious and less anxious than nonvolunteers. Two variables, fairly well confounded with one another, may serve as moderators for the relationship between volunteering and anxiety: the nature of the sample of subjects and the nature of the task for which volunteers had been solicited. All of the studies finding volunteers to be significantly more anxious were based on college students who had been asked to volunteer for fairly standard behavioral research. Of the studies finding volunteers to be less anxious, on the other hand, only two (29%) Table 2–21 Studies of the Relationship Between Volunteering and Anxiety
Volunteers Significantly More Anxious
No Significant Difference
Volunteers Significantly Less Anxious
Barefoot (1969) I Barefoot (1969) II Heckhausen et al. (1970) Jaeger et al. (1973) Martin and Marcuse (1958) I Riggs and Kaess (1955) Rosen (1951) I Rosen (1951) II Schubert (1964)
Barefoot (1969) III Barefoot (1969) IV Barefoot (1969) V Barefoot (1969) VI Carr and Whittenbaugh (1968) Cope (1968) Francis and Diespecker (1973) Heilizer (1960) Himelstein (1956) Hood and Back (1971) Howe (1960) Lubin et al. (1962a) Martin and Marcuse (1958) II Martin and Marcuse (1958) III Martin and Marcuse (1958) IV Philip and McCulloch (1970) I Siegman (1956) Zuckerman et al. (1967) I Zuckerman et al. (1967) II
Cohler et al. (1968) Martin and Marcuse (1958) V Myers et al. (1966) Myers et al. (1967) I Myers et al. (1967) II Philip and McCulloch (1970) II Scheier (1959)
718
Book Three – The Volunteer Subject
employed college samples (Martin and Marcuse, 1958, V; Scheier, 1959) and only two employed relatively standard requests for research participation (Cohler et al. 1968; Scheier, 1959). Of the studies making less standard requests of noncollege samples, three studies solicited volunteers for sensory deprivation from a population of military servicemen (Myers et al., 1966; 1967, I, II), and one study solicited questionnaire responses from males who had attempted suicide some time earlier (Philip and McCulloch, 1970). These four studies suggest that when the tasks for which volunteering is solicited are likely to be perceived as stressful or when the subject populations have undergone severe stress in their fairly recent past, only those who are relatively less anxious will be likely to volunteer. A more anxious, more fearful person may simply be unwilling to expose himself to further anxiety-arousing situations. It is interesting to note, too, that all four of these studies employed male subjects, so that our very tentative interpretation may be restricted to male subjects. The study by Martin and Marcuse (1958, V) also found less anxious males to volunteer more, this time for a study of hypnosis. The study by Cohler et al. (1968), however, employed female subjects and also found the volunteers to be less anxious than the nonvolunteers. These volunteers were actually second-level volunteers drawn from a sample of mothers all of whom had answered newspaper advertisements requesting volunteers willing to participate in research on parent–child relations. Because the second-level volunteers of this research were compared to first-level volunteers rather than to true nonvolunteers, the size of the effect obtained may be an underestimate of the true effect size. Nevertheless, the effect size was substantial (.47). The remaining study reporting volunteers to be significantly less anxious (Scheier, 1959) employed both male and female subjects who were asked to volunteer for a relatively unspecified study characterized as somewhat threatening. To recapitulate our hypothesis or interpretation, it appears that when the task for which volunteering is requested is potentially stressful or threatening and/or when the subjects are drawn from the world beyond the academic and when there is some significant difference to be found, it will be the less anxious persons who will be more likely to volunteer. For these studies the median effect size was .36. When the task for which volunteering is solicited is fairly standard and/or when the subjects are college students and when there is some significant difference to be found, it will be the more anxious persons who will be more likely to volunteer. These subjects may be concerned over what the recruiter will think of them should they refuse to cooperate, or they may simply be the subjects who are willing to run both the very minor perceived risk of participating in an experiment and the very minor perceived risk of admitting that they experience some anxiety. For these studies, the median effect size was .34. While we may have a reasonable hypothesis for differentiating studies reporting volunteers to be significantly more versus less anxious, we do not have good hypotheses differentiating studies reporting some significant effect from studies reporting no significant effect. Thus, eight studies reporting no significant difference in anxiety between volunteers and nonvolunteers employed college samples and fairly standard types of tasks, so that we might have expected volunteers to be more anxious. For these studies, however, the median effect size was essentially zero. Of the remaining studies, four were of hypnosis (Heilizer, 1960; Lubin et al., 1962a; Martin and Marcuse, 1958 II; Zuckerman et al., 1967 I), two of sex research (Martin and Marcuse, 1958 IV; Siegman, 1956), two of sensory deprivation
Characteristics of the Volunteer Subject
719
(Francis and Diespecker, 1973; Zuckerman et al., 1967 II), two of neuropsychiatric patients (Carr and Whittenbaugh, 1968; Philip and McCulloch, 1970 I), and one of electric shock (Howe, 1960). For these studies we might have expected to find the volunteers to be less anxious than the nonvolunteers, but that was not the case. The median effect size again was essentially zero. Further analysis of the studies shown in Table 2-21 failed to turn up any general suggestive relationship between gender composition of the samples and the studies’ outcome. Investigators of the relationship between volunteering and anxiety have employed a wide variety of measures of anxiety. The most commonly used measures, however, have been a derivative or a subtest of the MMPI. Fourteen studies employed the Taylor Manifest Anxiety Scale and five employed either the Pt or D scales of the MMPI. There was a tendency for studies employing these measures more often to find no significant difference than was the case for the wide variety (11) of other measures employed. Thus, of studies employing the Taylor Scale or the Pt (or D) scales, only 32% reported a significant result, while for the remaining studies 62% reported a significant result. It must be noted, however, that the use of any given measure of anxiety was partially confounded with the type of experiment for which recruitment of volunteers had been undertaken. Less relevant to the question of volunteering but quite relevant to the related question of who finds their way into the role of research subject is the study by Leipold and James (1962). Male and female subjects who failed to appear for a scheduled psychological experiment were compared with subjects who kept their appointments. Among the female subjects, those who appeared did not differ in anxiety on the Taylor Scale from those who did not appear. However, male subjects who failed to appear—the determined nonvolunteers—were significantly more anxious than those male subjects who appeared as scheduled, and the effect was strong (.84). These findings not only emphasize once again the potential importance of sex as a moderating variable in studies of volunteer characteristics but also that it is not enough simply to know who volunteers; even among those subjects who volunteer there are likely to be differences between those who actually show up and those who do not. To summarize the results of studies of the relationship between volunteering and anxiety, we note first that most studies report no significant relationship; however, a far larger percentage (46%) than could be expected to occur if there were actually no relationship between volunteering and anxiety does show a significant relationship. Among the studies showing significant relationships, those employing college students as subjects and fairly standard types of tasks tend to find volunteers to be more anxious than nonvolunteers. Studies employing nonstudent samples and potentially stressful types of tasks, on the other hand, tend to find volunteers to be less anxious than nonvolunteers. In both kinds of studies the magnitudes of the effects obtained tend to be about one-third of a standard deviation.
Psychopathology We now turn our attention to variables that have been related to global definitions of psychological adjustment or pathology. Some of the variables discussed earlier have also been related to global views of adjustment, but our discussion of them was
720
Book Three – The Volunteer Subject Table 2–22 Studies of the Relationship Between Volunteering and Psychopathology
Volunteers Significantly More Maladjusted
No Significant Difference
Volunteers Significantly Better Adjusted
Bell (1962) Conroy and Morris (1968) Esecover et al. (1961) Lasagna and von Felsinger (1954) London et al. (1962) Lubin et al. (1962 a, b) Ora (1966) Philip and McCulloch (1970) I Pollin and Perlin (1958) Riggs and Kaess (1955) Rosen (1951) I Schubert (1964) Silverman (1964) Stein (1971) Valins (1967)
Bennett and Hill (1964) Dohrenwend and Dohrenwend (1968) Francis and Diespecker (1973) Myers et al. (1966) Philip and McCulloch (1970) II Poor (1967) I Rosen (1951) II Siegman (1956) Tune (1968) Tune (1969)
Carr and Whittenbaugh (1968) Kish and Hermann (1971) Maslow (1942) Maslow and Sakoda (1952) Poor (1967) II Schultz (1967c) Sheridan and Shack (1970) Tiffany et al. (1970) Wrightsman (1966)
intended to carry no special implications bearing on subjects’ psychopathology. For example, when anxiety was the variable under discussion, it was not intended that more anxious subjects be regarded as more maladjusted. Indeed, within the normal range of anxiety scores found, the converse might be equally accurate. Table 2-22 shows the results of 34 studies of the relationship between volunteering and psychopathology. Most of the studies (71%) have reported a significant relationship, with 15 studies (44%) finding volunteers to be more maladjusted and nine studies (26%) finding volunteers to be better adjusted. In trying to understand the factors operating to produce such contradictory, yet so frequently significant, results, special attention was given to (1) the nature of the task for which volunteering had been solicited, (2) the gender composition of the samples, and (3) the measures of psychopathology employed. Table 2-23 shows the median effect sizes, in standard deviation units, obtained in eight types of studies. A positive effect size indicates that volunteers were found to be more poorly adjusted; a negative effect size indicates that volunteers were found to be less poorly adjusted. Volunteers for studies of (1) the effects on performance of Table 2–23 Median Effect Sizes (in Units) Obtained in Eight Types of Studies
Type of Study High temperature General experiments Drugs and hypnosis Sensory deprivation Field interviews Questionnaire Sex behavior Sensitivity training a
Approximation only.
Effect Size þ .80 þ .49 þ .47 .00a .00a .10 .39 .61
Number of Studies 1 11 5 3 1 9 3 1
Characteristics of the Volunteer Subject
721
high temperatures, (2) general psychological research topics, and (3) the effects of drugs and hypnosis tended to show greater degrees of psychopathology than did nonvolunteers with a median effect size of about half a standard deviation (.49). There was essentially no difference between volunteers and nonvolunteers in degree of maladjustment when recruitment was for studies involving sensory deprivation, field interviews, or questionnaires. However, when recruitment was for studies of sex behavior or for exposure to sensitivity training, volunteers were found to be better adjusted than nonvolunteers with a median effect size of exactly half a standard deviation. The pattern of results described permits no simple or obvious interpretation. The studies for which volunteers tend to be more maladjusted, however, may have in common that subjects will be exposed to unusual situations (heat, drugs, or hypnosis) or believe that they may be exposed to unusual situations (general experiments). Perhaps more maladjusted subjects feel that if they were to react inappropriately under such circumstances, it would not be because of their (perhaps vaguely suspected) psychopathology but because of the unusualness of the research situation; that is, when the situation is acutely unusual, subjects may feel that they cannot be held responsible for their own chronically unusual behavior. At the same time, volunteering would put these more maladjusted subjects into contact with someone probably perceived as an expert in adjustment who might be able to help the volunteer with his or her adjustment problems. In short, the more maladjusted volunteer may see an opportunity for personal help in a context wherein the risk of ‘‘looking too bad’’ is low. After all, if the volunteer were to behave strangely, it could be ascribed to the heat, the drugs, the hypnosis, or the strange situation of the psychological experiment. In research on sexual behavior or in sensitivity training, however, the situation may be somewhat different. While the more maladjusted person might still be motivated to obtain psychological assistance, such assistance might not be available at such low cost. In the area of sex behavior, for example, he or she will have to tell the investigator things that the more maladjusted person is more likely to believe will make him or her look bad (either because of ‘‘too little’’ or ‘‘too much’’ or the ‘‘wrong kind’’ of sex activity). Furthermore, in this situation the subject cannot easily ascribe any ‘‘strangeness’’ of behavior to the strangeness of the situation. Similarly, in the sensitivity-training research, the more maladjusted subject may feel that his or her insensitivity may become abundantly clear in the group interaction and any inadequacy of behavior again cannot be easily attributed to the situation, since, the maladjusted subject may feel, all the other group members will be behaving perfectly adequately. Accordingly, sensitivity-training research and sex research may be situations the more maladjusted person will try to avoid. We turn now to a consideration of the possible moderating effects of the gender composition of the samples listed in Table 2-22 on the results of those studies. Most (65%) of the results shown were based on combined samples of males and females. Six studies employed only male subjects and six employed only female subjects; overall, there were no differences in the results obtained. When we consider the type of study for which volunteering was solicited along with sex of sample, however, there are some indications that these variables may interact to serve as co-moderator variables. Three studies of sensory deprivation were conducted in which the samples were composed of either males or females. The two results based exclusively on male
722
Book Three – The Volunteer Subject
subjects found essentially no relationship between volunteering and psychopathology (effect size ffi 0.0; Francis and Diespecker, 1973; Myers, Murphy, Smith, and Goffard, 1966). The single study of sensory deprivation employing only female subjects, however, showed the volunteers to be psychologically more stable, as defined by the 16 PF measure, than the nonvolunteers with an effect size of .44 (Schultz, 1967c). In three of the questionnaire studies, the samples were composed exclusively of males and females. The single result based on female subjects showed essentially no relationship between volunteering and psychopathology (effect size ffi 0.0; Philip and McCulloch, 1970 II). The two studies based on male subjects both showed very large effects of psychopathology on volunteering, but the results were in opposite directions from one another. For their male subjects, Philip and McCulloch (19701) found that former psychiatric patients who had more often attempted suicide were more likely to respond to their questionnaire, with an effect size of .80. For their male subjects, however, Kish and Hermann (1971) found that patients discharged from an alcoholism treatment program were far more likely to respond to a questionnaire if their condition had improved, and the effect size was very large (approximately 1.4 units). Rosen (1951) conducted his research on volunteering with both male and female subjects. Recruitment was for participation in a personality experiment, and psychopathology was defined by scores on the Paranoia Scale of the MMPI. Among male subjects, there was essentially no relationship between volunteering and psychopathology, with a very small trend for better adjusted males to volunteer more (effect size ¼ .09). Among female subjects, however, those who volunteered scored as significantly more maladjusted, with an effect size of .49. There were two studies of sex behavior that employed only female subjects (Maslow, 1942; Maslow and Sakoda, 1952), and in both it was found that women scoring higher in self-esteem were more willing to volunteer, with effect sizes of .65 and .39, respectively. In a study of sex behavior that employed a combined sample of males and females, however, Siegman (1956) found no relationship between volunteering and self-esteem. There are not enough studies of male versus female samples who have been requested to volunteer for any specific type of research to permit any conclusion about the moderating effects of sex on the relationship between volunteering and psychopathology. There does appear to be sufficient evidence, however, to suggest the possibility that sex may indeed operate as a moderator variable but perhaps in different directions for different types of tasks for which recruiting was undertaken. For the 34 results shown in Table 2-22, a total of 20 different definitions of psychopathology were employed. Table 2-24 shows these 20 measures, the median effect size obtained in studies employing each of them, and the number of studies in which each was employed. An effect size with a positive sign indicates that volunteers were found to be more maladjusted, while an effect size with a negative sign indicates that volunteers were found to be better adjusted. A cursory examination of Table 2-24 does not suggest that the type of measure employed was a major determinant of the outcomes obtained. Thus, Silverman’s measure of self-esteem was employed in a study finding volunteers to score lower in self-esteem (þ.60), while Maslow’s measure of self-esteem was employed in 2 studies finding volunteers to score higher in self-esteem (.52). It appears unlikely that these reversed outcomes could be ascribed to a strong negative correlation between these
Characteristics of the Volunteer Subject
723
Table 2–24 Median Effect Sizes (in Units) Obtained in Studies Employing Various Measures of
Psychopathology Measure Lykken Emotionality Silverman Self-Esteem Guilford–Zimmerman Guilford STDCR Prior Psychiatric-Hospitalization Psychiatric Evaluation Suicide Attempts California Psychological-Inventory Paranoia Scale (MMPI) Rosenberg Self-Esteem Eysenck Neuroticism Heron Neuroticism Shostrom Personal Orientation-Inventory Siegman Self-Esteem Middletown NP Index Self-Esteem, Self-Rated Revised Social Responsibility-Scale 16 PF Maslow Self-Esteem Improvement of Alcoholism a
Effect Size
Number of Studies
þ .85 þ .60 þ .49 þ .48
1 1 1 2
þ .48 þ .46 þ .40
1 4 2
þ .24 þ .22 þ .10 .0a .0a
1 5 2 1 3
.00 .0a .0a .10
2 1 1 1
.41 .44 .52 1.40
1 1 2 1
Approximation only.
2 measures of self-esteem. More likely, it is the nature of the task for which volunteering was requested that serves as the moderator variable. In general, paper-and-pencil measures of adjustment appear to be well scrambled among the positive, negative, and zero-magnitude effect sizes. A closer examination of Table 2-24, however, does suggest one hypothesis about possible moderating effects of the type of measure employed on the relationship between volunteering and adjustment. When the definition of adjustment was clinical rather than psychometric, the results were more likely to show volunteers to be less well adjusted than the nonvolunteers. These clinical definitions included prior psychiatric hospitalization, psychiatric evaluation, history of suicide attempts, and improvement of alcoholism. The studies employing these definitions were all conducted either in medical contexts or under medical auspices, so that the measures employed are confounded with a particular context. The median effect size of the eight studies employing clinical definitions of psychopathology was þ.46. When research is conducted in a medical, and primarily a psychiatric, context, it appears especially plausible to think that more maladjusted persons volunteer more in the explicit or implicit hope of receiving assistance with their psychological problems. Among the better adjusted volunteers for these more medical studies, especially for studies testing drug effects (Esecover, Malitz, and Wilkens, 1961; Lasagna and von Felsinger, 1954; Pollin and Perlin, 1958), motives for volunteering
724
Book Three – The Volunteer Subject
were more likely to include payment, scientific curiosity, or group expectations (as in the case of medical students, for example) than was the case among the more poorly adjusted volunteers. Two of the studies listed in Table 2-22 defined their nonvolunteers in terms of failing to appear for their appointments (Silverman, 1964; Valins, 1967). Normally such studies would not be included in listings of studies comparing volunteers with nonvolunteers. In these particular cases, however, the grouping appeared reasonable. In the study by Silverman, for example, all persons in a pool that was required to serve as subjects comprised the target population. Of 40 subjects asked to come for the experiment, 15 did not appear. Among these 15 there were a number that called to say that they could not attend at the specified hour. These subjects are not like our ‘‘true’’ pseudovolunteers who simply do not show up. Analysis of the data in this study grouped together as the nonvolunteers those who notified the investigator and those who did not. Those who did appear were somewhat more like volunteers than would normally be the case when research participation is a course requirement, because this requirement had not been enforced. In the study by Valins, both the ‘‘volunteers’’ and ‘‘nonvolunteers’’ (no-shows) had actually participated in an earlier phase of his research, so that they were second-level subjects, whereas we usually think of pseudovolunteers as never finding their way into the research situation. In our summary of the studies of the relationship between volunteering and psychopathology we must first note the very high proportion (71%) of studies showing a significant relationship. No case could be made that volunteering and adjustment are not related; however, while there are many studies showing volunteers to be better adjusted than nonvolunteers (26% of the total studies), there are even more studies showing volunteers to be less well adjusted than nonvolunteers (44% of the total studies). Studies showing the latter result appear to have in common a potentially unusual situation (e.g., drugs, hypnosis, high temperature, or vaguely described experiments) in which any unusual behavior by the volunteer might more easily be attributed to the situation. Studies showing more maladjusted subjects to volunteer less appear to have in common a situation in which volunteers must be selfrevealing and in which any unusual behavior by the volunteer cannot easily be attributed to the situation but may be attributed instead to the volunteer’s psychopathology. Volunteers for studies that are conducted in medical contexts and that employ clinical rather than psychometric definitions of psychopathology (e.g., prior psychiatric hospitalization, psychiatric diagnosis, history of suicide attempts) appear to be particularly likely to be more maladjusted than nonvolunteers, especially when there are no external incentives to volunteer. More maladjusted persons may be more likely to volunteer for this type of research in particular because of implicit or explicit hopes of receiving some form of psychological assistance with their psychological problems.
Intelligence A good deal of evidence has accumulated bearing on the relationship between volunteering and intelligence. Table 2-25 shows the results of 37 studies that have examined that relationship. Although there are a good many results (15) showing no
Characteristics of the Volunteer Subject
725
Table 2–25 Studies of the Relationship Between Volunteering and Intelligence
Volunteers Significantly More Intelligent
No Significant Difference
Volunteers Significantly Less Intelligent
Abeles, Iscoe, and Brown (1954–1955) Brower (1948) Conroy and Morris (1968) Cudrin (1969) Ebert (1973) Eckland (1965) Edgerton, Britt, and Norman (1947) Ellis, Endo, and Armer (1970) Frey (1973) I Frey (1973) II Martin and Marcuse (1958) I Martin and Marcuse (1958) IIa Neulinger and Stein (1971) Reuss (1943) Rothney and Mooren (1952) Thistlethwaite and Wheeler (1966) Tune (1968) Weigel et al. (1971) Wicker (1968b) Wolfgang (1967)
Bennett and Hill (1964) Diab and Prothro (1968)a Kaess and Long (1954) Mann (1959) Martin and Marcuse (1958) IIIa Martin and Marcuse (1958) IVa Mulry and Dunbar (n.d.) Myers et al. (1966)a Poor (1967) I Poor (1967) II Rosen (1951)a Spiegel and Keith-Spiegel (1969)a Toops (1926) Tune (1969) Underwood et al. (1964)
Edwards (1968a)a Matthysse (1966)
a
Somewhat more unusual experiments; see text for description.
relationship between volunteering and intelligence, there are even more (20) showing volunteers to be significantly more intelligent, while only 2 results show volunteers to be significantly (p < .10) less intelligent. The median effect size obtained was .50 for studies showing volunteers to be more intelligent and approximately zero for all the remaining studies. A large percentage of the 37 studies shown in Table 2-25 employed mailed questionnaires (49%) and the results of these studies were compared to the remaining, more laboratory-oriented studies. The median effect size for all questionnaire studies was .29, while for the remaining studies it was .16; in both cases the median effect sizes were in the same direction, that showing volunteers to be more intelligent. The laboratory studies could be further subdivided into those that requested volunteers for more-or-less standard or unspecified experiments and those that requested volunteers for somewhat more unusual experiences, including hypnosis, sleep and hypnosis, sensory isolation, sex research, small-group and personality research, and participation in a program of companionship for psychiatric patients. The median effect size for the 10 more typical or unspecified types of studies was .40, while for the 8 more ‘‘unusual’’ types of studies the median effect size was approximately zero. The 8 studies comprising this latter set are marked with a superscript in Table 2-25. There was 1 additional experiment that does not fall into either the set of more standard studies or the set of less standard studies. This additional study by Cudrin (1969) solicited volunteers for medical research among the inmates of a federal prison. Prisoners who volunteered to be exposed to malaria or to gonorrhea
726
Book Three – The Volunteer Subject
scored significantly higher than nonvolunteers on the Revised Beta Examination, with an effect size of .71. The two most commonly employed measures or definitions of ‘‘intelligence’’ were IQ scores on a variety of standardized tests (e.g., ACE, Beta, Henmon–Nelson, Mill Hill Vocabulary, Raven’s Matrices, SAT, Shipley Hartford) and school grades. Studies employing neither of these two types of measures employed a wide variety of other measures (e.g., motor skills performance, graduation from college, science achievement, intellectual efficiency, paired associate learning scores, study habits, marginality of student status, and even college administrators’ knowledge of the validity of tests they were employing at their college). Table 2-26 shows the median effect sizes obtained in studies employing each of three types of measures for questionnaire and nonquestionnaire studies. The median effect size of these six cells is about a quarter of a standard deviation, a value appreciably lowered by the results of nonquestionnaire studies employing IQ tests or school grades as their definitions of intelligence. A closer examination of these studies reveals, however, that the zero effect sizes are caused by those studies we have referred to as somewhat less routine. While, in general, brighter or better-performing people are more likely to volunteer for relatively standard behavioral research, it appears that when the task for which volunteering is requested becomes more unusual, variables other than intelligence become more effective in predicting volunteering. The vast majority of studies listed in Table 2-25 (78%) employed both male and female subjects. The number of studies employing only male or female subjects was, therefore, too small to permit very stable inferences about the possible role of sex of sample as a variable moderating the relationship between volunteering and intelligence. The median effect size of the six studies employing only male subjects was .44, while the effect sizes of the two studies employing only female subjects were þ.73 (Neulinger and Stein, 1971) and .65 (Edwards, 1968a). The latter study, showing volunteers to rank lower in class standing, was conducted with student nurses who were asked to volunteer for sleep and hypnosis research. Although sex of sample and type of task may have combined to serve as co-moderator variables, there is simply not enough evidence available to permit such a conclusion with any confidence. There is one more study relevant to the question of the relationship of intelligence to finding one’s way into the role of subject for behavioral research. The study by Leipold and James (1962), because it was of pseudovolunteering, was not listed in
Table 2–26 Median Effect Sizes (in Units) by Type of Study and Type of Measurea
Questionnaire Type of Measure IQ tests School grades Other measures Total a
Effect .24 .29 .38 .29
N 7 7 4 18
Nonquestionnaire Effect .00 .00c .41 .16
Values given as .00 are approximations. Median effect size for the four more ordinary experiments ¼ .34. c Median effect size for the two more ordinary experiments ¼ .20. b
b
Total
N
Effect
N
9 5 5 19
.18 .14 .41 .23
16 12 9 37
Characteristics of the Volunteer Subject
727
Table 2-25. Those who kept their appointments with the investigator had earned higher grades in their psychology examinations than those who failed to appear, with an effect size of .34. The differences were significant for the females (effect size ¼ .52) but not for the males (effect size ¼ .10). A summary of the results of studies of the relationship between volunteering and intelligence can be relatively simple. When there is a significant relationship reported, and very often there is, it is overwhelmingly likely to show volunteers to be more intelligent. However, there appears to be a class of studies in which no relationship is likely to be found. That is the class of studies requesting volunteers for somewhat more unusual psychological experiences such as hypnosis, sensory isolation, sex research, and small-group and personality research.
Education Most of the work on the relationship between level of education achieved and probability of finding one’s way into the investigator’s data pool has been conducted in the area of survey research rather than in the area of laboratory research. That is not surprising, given that most laboratory studies employ college students as subjects. These students, often drawn from a single freshman or sophomore introductory course, show only a very small degree of variance in their level of education, so that no marked correlation between education and volunteering (or any other variable) could reasonably be expected. In survey research, on the other hand, the target population is often intended to show considerable variation in level of education attained. Survey researchers have long suspected that better educated people are more likely to find their way into the final sample obtained. That suspicion is remarkably well supported by the data. Table 2-27 lists 26 studies investigating the relationship between responding or volunteering and level of education. Every one of these studies shows the better-educated person more likely to participate in Table 2–27 Studies Showing Respondents or Volunteers to Be Better Educated
Less Personal Contact (e.g., questionnaires)
More Personal Contact (e.g., interviews)
Barnette (1950a) Baur (1947–1948) Donald (1960) Eckland (1965) Franzen and Lazarsfeld (1945) Gannon, Nothern, and Carroll (1971) Katz and Cantril (1937) Kivlin (1965) Pace (1939) Pan (1951) Reuss (1943) Rothney and Mooren (1952) Suchman (1962) Suchman and McCandless (1940) Wallace (1954) Wallin (1949) Zimmer (1956)
Benson, Booman, and Clark (1951) Cohler, Woolsey, Weiss, and Grunebaum (1968) Dohrenwend and Dohrenwend (1968) Gaudet and Wilson (1940) Kirby and Davis (1972) Meier (1972) Robins (1963) Stein (1971) Teele (1967)
728
Book Three – The Volunteer Subject
behavioral research, and in only two studies were the results not significant (Cohler et al., 1968; Dohrenwend and Dohrenwend, 1968). Here, then, is one of the clearest relationships obtained between participation in behavioral research and any personal characteristic of the subject. For all the studies considered together, the median effect size was .40. However, there appeared to be some moderating effects of the degree of personal contact between investigator and respondent on the magnitude of the relationship. Thus, for the 17 studies involving little or no personal contact between questioner and respondent, the median effect size was .58. For the remaining 9 studies in which there was a greater degree of personal contact between questioner and respondent, the median effect size was only .32, a full quarter of a standard deviation lower. Most of the studies of this latter group involved face-to-face contact between investigator and respondent, so that anonymity was not possible. The studies of the former group were essentially all questionnaire studies in which anonymity could often be offered and in which no personal pressure to cooperate could be brought to bear on the potential respondent. Under these circumstances, better-educated persons may better appreciate the scientific value of their cooperation, and/or better-educated persons may feel they will have more favorable things to say about themselves in their responses to the questionnaires. Not all of the studies requiring more personal contact between investigator and respondent were of the typical field-interview variety. Thus, the study by Cohler et al. (1968) invited a group of mothers who had already participated in research to volunteer for additional research on parent–child relationships. The study by Meier (1972) was based on the ‘‘biggest public health experiment ever: the 1954 field trial of the Salk poliomyelitis vaccine.’’ The study by Stein (1971) was of patients in a university outpatient clinic who were invited to participate in research on psychotherapy. Finally, the study by Teele (1967), while a not atypical field interview study, did not directly investigate volunteering for behavioral research. Instead, general participation in voluntary associations was correlated with education level. In all four of these studies, as in the remaining, somewhat more typical, survey research studies, better-educated persons volunteered more. Most of the studies (69%) of Table 2-27 employed both male and female subjects. Comparison of the results of studies employing only male or female subjects showed no marked difference in median effect size obtained either for questionnaire-type studies (males ¼ .64; females ¼ .53) or for studies requiring more personal contact with the investigator (males ¼ .18; females ¼ .14). A summary of the relationship between volunteering and education can be unusually unequivocal. Better-educated people are more likely to participate in behavioral (usually survey) research and especially so when personal contact between investigator and respondent is not required.
Social Class Many definitions of social class are based at least in part on level of education attained or on other variables that tend to be correlated with level of education. It will come as no surprise, therefore, that social class is related to volunteering just as educational level was found to be. Table 2-28 shows, however, that the evidence in the case of social class is not quite so univocal as it was for education. Of the 46
Characteristics of the Volunteer Subject
729
Table 2–28 Studies of the Relationship Between Volunteering and Social Class
Volunteers Significantly Higher in Social Class
No Significant Difference
Volunteers Significantly Lower in Social Class
Adams (1953) Barnette (1950a, b) Belson (1960) Britton and Britton (1951) Clark (1949) Crossley and Fink (1951) Fischer and Winer (1969) I Fischer and Winer (1969) II Franzen and Lazarsfeld (1945) Hilgard and Payne (1944) Hill, Rubin, and Willard (1973) Katz and Cantril (1937) King (1967) Kirby and Davis (1972) I Kirby and Davis (1972) II Kivlin (1965) Lawson (1949) MacDonald (1972b) Martin et al. (1968) I Mayer and Pratt (1966) Meier (1972) Pace (1939) Pucel, Nelson, and Wheeler (1971) Robins (1963) Rothney and Mooren (1952) Stein (1971) Suchman (1962) Teele (1967) Tune (1968) Wallace (1954) Wicker (1968b) Zimmer (1956)
Cohler et al. (1968) Dohrenwend and Dohrenwend (1968) Ellis, Endo, and Armer (1970) McDonagh and Rosenblum (1965) Poor (1967) I Poor (1967) II Rosen (1951) I Tune (1969) Waters and Kirk (1969)
Donald (1960) Edwards (1968a) Martin et al. (1968) II Reuss (1943) Rosen (1951) II
studies listed, 32, or 70%, do show that those defined as higher in social class are significantly more likely to participate in behavioral research. However, there are also 9 studies (20%) that report no significant relationship and another 5 studies (11%) that report a relationship significantly in the opposite direction. With so many studies investigating the relationship between volunteering and social class, it will be useful to have a stem-and-leaf plot of the magnitudes of the effects obtained. For 40 of the 46 studies of Table 2-28, an effect size in standard deviation units could be computed or estimated with reasonable accuracy. Table 2-29 shows the stem-and-leaf plot along with a summary. The left-hand column of the plot lists the first digit, the ‘‘stem,’’ of the effect sizes given in two digit proportions of a standard deviation. The digits to the right of the stem are the second digits, the ‘‘leaves,’’ of the various results. Thus, the leaf 7 following the stem .8 means that one study obtained an effect size of .87. The leaves 0, 3, and 5, following the stem .7 mean that there were three studies with effect sizes between .70 and .79—namely, .70, .73, and .75. Effect sizes with positive signs
730
Book Three – The Volunteer Subject Table 2–29 Stem-and-Leaf Plot of Effect Sizes (in Units) Obtained in Studies of the Relationship
Between Volunteering and Social Class Plot (N ¼ 40) þ .8 þ .7 þ .6 þ .5 þ .4 þ .3 þ .2 þ .1 .0 .1 .2 .3 .4
7 0 2 1 0 0 1 0 0 2
1.0
6
1.4
0
Summary
3 3 8 0 4 4 0 0 4
5
0 6 4 3 0 6
3 3 6 8 7 8 9 7 0 0 8
Maximum Quartile3 Median Quartile1 Minimum
þ .87 þ .42 þ .28 .00 1.40
Q3 Q1 ˆ S Mean
þ .42 .32 .44 þ .19
8
prefixed indicate that volunteering was more likely for persons higher in social class, while effect sizes with negative signs prefixed indicate that volunteering was less likely for persons higher in social class. The stem-and-leaf display and its summary to the right show a skew in the distribution such that there are a few very low negative effect sizes straggling away from the main bunching of the distribution. In our efforts to understand this skew, we round up our usual candidates for the role of moderator variables: the type of task for which volunteering was requested, the definition or measure of social class, and the sex of the sample employed. Table 2-30 shows the median effect sizes obtained in 10 types of studies. The most typical, least stressful, least threatening, and least dangerous types of tasks appear to be concentrated in the center of the listing. Large positive effect sizes were obtained in studies of the effects of exposure to air pollution (Martin et al., 1968 I) and of the personal counseling process (Kirby and Davis, 1972 I, II; Stein, 1971). Very large negative effect sizes were obtained in studies of sleep and hypnosis (Edwards, 1968a) Table 2–30 Median Effect Sizes (in Units) Obtained in 10 Types of Studies
Type of Study Exposure to air pollution Counseling research Questionnaires in laboratory Mailed questionnaires Field interviews Psychological research Interviews in laboratory Personality research Sleep and hypnosis research Exposure to malaria
Effect Size þ .87 þ .63 þ .53 þ .29 þ .17 þ .14 þ .13 .24 1.06 1.40
Number of Studies 1 3 2 19 5 4 2 2 1 1
Characteristics of the Volunteer Subject
731
and of exposure to malaria (Martin et al., 1968 II). These results are puzzling, and there are not enough studies in either of our extreme groups to permit any strong inferences. Perhaps subjects higher in social class have a feeling of noblesse oblige when the risks are not unduly great. Experimental exposure to air pollution may not be much worse than living in a typical industrial urban area, and participating in research on counseling may not involve any degree of threat greater than that surmounted in having decided to seek counseling in the first place. When the risks appear to be greater, subjects higher in social class may prefer to leave them to those lower in social class. Malaria is not part of the typical industrial urban picture, and it is traditionally left to prisoners or to servicemen to volunteer for that type of risk. That, at least, is what Martin, Arnold, Zimmerman, and Richart (1968) found. Not one of their 28 professional persons volunteered to contract malaria, but 8% of their firemen and policemen volunteered, 27% of their maintenance personnel and welfare recipients volunteered, and 67% of their prisoners volunteered. The situation for sleep and hypnosis research is less clear. The danger is less obvious in the physical sense, but perhaps those higher in social class are concerned that they might reveal something personal about themselves in the course of such research. Loss of control is at least believed to be more likely under conditions of either sleep or hypnosis. Some support for this interpretation comes from the finding that those higher in social class are also less willing to volunteer for research in personality. There, too, there may be a perceived potential for loss of privacy, a loss that may be more costly for those higher in social class, who have had greater opportunity to experience it. We repeat our caution that these interpretations are highly speculative and are in any case based only on a handful of studies. The next handful may show us to be quite clearly in error. We turn now to a consideration of the possible moderating effects on the relationship between volunteering and social class of the particular definitions or measures of social class that have been employed. Table 2-31 shows the median effect sizes obtained when nine different measures of social class were employed. Here our interpretation of the results can be less complex. Of the four measures of social class showing the smallest effect sizes, three employed definitions based on the social class of the subject’s father (occupation, education, and income) rather than on his or her own social class. Of the 40 studies for which effect sizes could be computed of estimated, 11 employed father’s status as the measure of social class with a median Table 2–31 Median Effect Sizes (in Units) Obtained When Employing Nine
Measures of Social Class Type of Measure Listing in Who’s Who Military rank Composite index Income Own occupation Father’s occupation Home and appliance ownership Father’s education Father’s income
Effect Size þ.62 þ.40 þ.39 þ.36 þ.29 þ.12 þ.10 þ.06 .00
Number of Studies 1 1 6 5 15 4 1 4 3
732
Book Three – The Volunteer Subject Table 2–32 Median Effect Sizes (in Units) by Sex of Sample and Definition of Social Class
Sample Sex
Male Mixed Female Total
Own Class
Parental Class
Total
Effect
N
Effect
N
Effect
N
þ.40 þ.30 .15 þ.34
8 19 2 29
þ.38 .00 .77 .00
2 7 2 11
þ.40 þ.28 .33 þ.28
10 26 4 40
Approximate Residuals Male .15 Mixed .00 Female þ.15 Total .00
þ.15 .00 .15 .00
.00 .00 .00 .00
effect size of approximately zero. For the remaining 29 studies, the median effect size was þ.34. Table 2-32 shows the median effect size for studies defining social class in terms of the subject’s own status versus father’s status, for studies employing male, female, and mixed samples. Examination of the right-hand column suggests a rather clear effect of sex of sample. Those studies employing only female subjects show a significant reversal of our general finding. Three of the 4 studies employing female subjects found higher-status females to volunteer significantly less than lower-status females. Because there are only 4 such studies, we cannot draw a firm conclusion about the matter, especially since sex of sample is partially confounded with the type of study for which participation had been solicited. Thus, the 2 studies showing the largest (negative) effect sizes were a study of sleep and hypnosis (Edwards, 1968a; effect size ¼ 1.06) and a study of personality (Rosen, 1951 II; effect size ¼ .48 ). We cannot be sure, then, whether these significantly reversed results were due to the sex of the sample, the potentially more threatening nature of the task, or both. In addition to the effects of sex of sample and definition of social class employed, Table 2-32 suggests an interaction of these two factors. The bottom half of the table shows the residuals when these two effects have been removed. For studies employing male samples, social status and volunteering are relatively more positively correlated when social status is defined in terms of parental social class than when it is defined in terms of the subject’s own social class. For studies employing female samples, social status and volunteering are relatively more negatively correlated when social status is defined in terms of parental social class than when it is defined in terms of the subject’s own social class. For studies employing combined samples of males and females, the results are just what we expect from an examination of the column and row medians: volunteering is positively related to social class only when social class is defined in terms of the subject’s own status rather than in terms of his parents’ social class. Because the interaction described is so complex and because the number of studies in three of the four cells showing large residuals is so small (i.e., 2), it seems wisest to forego interpretation until further research establishes the reliability of the interaction. A summary of the relationship between volunteering and social class that would be robust even to a good number of future contradictory results would state only that, in general, participation in behavioral research is more likely by those higher in
Characteristics of the Volunteer Subject
733
social class when social class is defined by the volunteer’s own status rather than by his parents’ status. There are some additional, but much weaker, findings as well. Persons higher in social class may volunteer somewhat more than usual when the task is slightly risky in biological or psychological terms, but they may volunteer a good deal less than usual, or even less than lower-class subjects, when the task is quite risky in a biological or psychological sense. There may also be a tendency for females to show less of the positive relationship between social class and volunteering than males.
Age A good bit of evidence is available to help us assess the relationship between age and volunteering for behavioral research. Table 2-33 shows the results of 41 studies of this relationship, results that permit no simple overall conclusion. While there are a good many studies (44%) that show no significant relationship between volunteering and age, there are far too many studies showing volunteers to be significantly younger (34%) or older (22%) than could reasonably be expected to occur if there were really no relationship between these variables. Before considering the possible role as moderator variables of the sex of the samples employed or the types of studies for which volunteering was requested, a stem-and-leaf plot of the effect sizes obtained will give a useful overview. Table 2-34 presents such a plot along with its summary. Effect sizes with positive signs prefixed indicate that volunteering was more likely for younger persons, while effect sizes with negative signs prefixed
Table 2–33 Studies of the Relationship Between Volunteering and Age
Volunteers Significantly Younger
No Significant Difference
Volunteers Significantly Older
Abeles, Iscoe, and Brown (1954–1955) Belson (1960) Dohrenwend and Dohrenwend (1968) Donald (1960) Kaats and Davis (1971) I Lowe and McCormick (1955) Marmer (1967) Myers et al. (1966) Newman (1956) I Newman (1956) II Pan (1951) Rosen (1951) I Wallin (1949) I Wallin (1949) II
Baur (1947–1948) Benson, Booman, and Clark (1951) Diamant (1970) I Diamant (1970) II Ebert (1973) Edwards (1968a) Gaudet and Wilson (1940) Kaats and Davis (1971) II Kaess and Long (1954) Martin et al. (1968) Newman (1956) III Newman (1956) IV Philip and McCulloch (1970) Poor (1967) I Poor (1967) II Rosen (1951) II Stein (1971) Tune (1968)
Crossley and Fink (1951) Gannon, Nothern, and Carrol (1971) King (1967) Kruglov and Davidson (1953) Mackenzie (1969) Mayer and Pratt (1966) Sirken, Pifer, and Brown (1960) Tune (1969) Zimmer (1956)
734
Book Three – The Volunteer Subject Table 2–34 Stem-and-Leaf Plot of Effect Sizes (in Units) Obtained in Studies of the Relationship
Between Volunteering and Age Plot (N ¼ 39) þ.6 þ.5 þ.4 þ.3 þ.2 þ.1 þ.0 .0 .0 .1 .2 .3 .4
4 1 0 0 0 0 3 0
4 1
Summary
4 3
4 5 7 7 3 5 7 7 8 0 0 0 0 0 0 0 0
4 6 1 2 0 7
8 9 7 9
Maximum Quartile3 Median Quartile1 Minimum
þ.64 þ.17 .00 .14 .47
Q3 Q1 ˆ S Mean
þ.31 .23 .26 þ.07
indicate that volunteering was more likely for older persons. The distribution of effect sizes appears to be strongly centered at about zero and if we did not know that there were significantly more significant results than could be expected by chance (2 ¼ 92, df ¼ 1, p near zero) we might conclude that there was ‘‘on the average’’ no relationship between age and volunteering. The challenge of this stem-and-leaf plot is to try to discover the variables that lead to significant positive versus significant negative relationships between age and volunteering. A fine-grained analysis of the different types of studies for which volunteers had been solicited revealed no substantial differences in effect sizes obtained. Table 2-35 shows the median effect sizes obtained when types of studies are coarsely grouped into laboratory versus survey research studies, for each type of outcome. Although the number of studies in each cell is too small to warrant any firm conclusions, there is a hint in the table that when a significant relationship is obtained between age and volunteering, the magnitude of the effect may be larger in either direction in laboratory studies (þ.43; .47) than in survey research (þ.24; .21). That is a surprising result, given that the laboratory studies were so often conducted with college students, who show relatively little age variance, compared to the respondents of the surveys, who frequently span half a century in their ages. On psychometric grounds alone, we would have expected volunteering to correlate less with age in any sample showing low age variance. Table 2–35 Median Effect Sizes (in Units) by Type of Study and Type of Outcomea
Type of Outcome
Volunteers younger No difference Volunteers older Total a b
Laboratory
Survey
Total
Effect
N
Effect
N
Effect
N
þ.43 .00 .47 þ.08
6b 10 1 17
þ.24 .00 .21 .00
8 8 8b 24
þ.35 .00 .24 þ.03
14 18 9 41
Values given as .00 are approximations. Effect size for one of these studies could not be determined.
Characteristics of the Volunteer Subject
735
Table 2-35 also suggests that there may be a difference in the pattern of outcomes for laboratory versus survey research studies. Thus, among the surveys, 67% show a significant relationship between age and volunteering, while among the laboratory studies only 41% do so. Among the studies that show a significant relationship, we find a difference in the distribution of outcomes for the laboratory versus survey studies. Thus, among the surveys conducted, 50% of the significant outcomes show respondents to be younger, while among the laboratory studies conducted, 86% of the significant outcomes show volunteers to be younger. A slightly different way of summarizing the distribution of outcomes shown in Table 2-35 is to note that among the laboratory studies there is only a single result out of 17 (6%) showing volunteers to be significantly older than nonvolunteers, a result that could reasonably be attributed to chance. Among the surveys, however, there are 8 results out of 24 (33%) showing respondents to be older than nonrespondents, a result that cannot be reasonably attributed to chance. An effort was made to detect any communalities among the nine studies that found respondents to be older than nonrespondents. This effort was not successful, but for five of the nine studies, it appeared that social status may have been confounded with age such that it may have been volunteers’ higher status rather than age per se that led to their higher rates of research participation. Thus, in King’s (1967) study, older clergymen were more likely than younger clergymen to reply to a brief questionnaire. King’s older clergymen, however, tended to occupy higher positions in the status hierarchy of the church. Similarly, Zimmer’s (1956) older Air Force men were more likely to respond to a mailed questionnaire, but these older men were also those of higher military rank. Sirken, Pifer, and Brown (1960) found older physicians more likely to respond to a questionnaire than younger physicians, and it may be that these older physicians were professionally and financially better established than the younger physicians. Mayer and Pratt (1966) surveyed people who had been involved in automobile accidents and found that younger drivers were less likely to respond than older drivers. In terms of insurance records, police records, and public opinion, younger drivers appear to have a lower status than older drivers in the domain of driving performance. Finally, in his experimental study of high school students, Mackenzie (1969) found older students to volunteer more. Perhaps within the high school subculture, the older student is ascribed higher status than the younger student. On the basis of these studies, then, it appears reasonable to propose that in some of the studies finding older persons to volunteer more for behavioral research, we may expect to find a positive relationship between age and social status, so that it may be status rather than age that is the effective determinant of volunteering behavior. Of the 41 studies investigating the relationship between volunteering and age, 14 were based on samples of males, 8 were based on samples of females, and 19 were based on samples of males and females combined. Table 2-36 shows the distribution of results of studies employing each of these three types of samples. The distributions of results employing male subjects and combined samples are very similar to each other. However, the distribution of results of studies employing only female subjects appears to differ substantially from the other two distributions. Studies employing only female subjects were much more likely (62% of the time) to find volunteers significantly younger than were studies employing male subjects (29%) or combined samples (26%).
736
Book Three – The Volunteer Subject Table 2–36 Distribution of Research Outcomes by Sex of Sample
Sample Sex
Volunteers Younger (N ¼ 14),%
Male (N ¼ 14) Mixed (N ¼ 19) Female (N ¼ 8) Total (N ¼ 41)
29 26 62 34
No Significant Difference (N ¼ 18),% 43 47 38 44
Volunteers Older (N ¼ 9),% 29 26 0 22
Sum
100 100 100 100
Earlier, we noted that survey research studies were much more likely than laboratory studies to find older persons more willing to participate in behavioral research. Analysis of type of study conducted as a function of sex composition of the sample employed suggests that these variables are not independent. Thus, among studies employing female subjects, 75% were laboratory studies, while among all remaining studies, only 33% were laboratory studies. Because the number of studies in each resulting cell is sometimes quite small, we cannot isolate clearly the effects on the relationship between age and volunteering of the sex composition of the sample from the type of study for which volunteering was solicited. Indications from such an analysis, however, suggest that both sex of sample and type of study may operate as variables moderating the relationship between age and volunteering. In our summary of the studies investigating the relationship between age and participation in behavioral research, we can be very confident of only a very modest proposition: that far too often some significant relationship between age and volunteering was obtained. Considering only those studies finding a significant relationship, there was a tendency for more studies to find volunteers to be younger rather than older when compared to nonvolunteers. This tendency appears to be due in large part to the results of laboratory studies, which much more often than survey results report volunteers to be younger rather than older and which also report larger magnitudes of this relationship. There was also a tendency for studies employing female samples to report that volunteers tend to be younger. Many of the studies reporting volunteers or respondents to be older have in common the fact that their older subjects are also higher in social status, broadly defined. For these studies, therefore, it may be subjects’ higher status rather than their greater age that serves as the effective partial determinant of volunteering.
Marital Status There is no great profusion of evidence bearing on the relationship between volunteering and marital status. As is so often the case in our survey of correlates of volunteering, however, there are far more studies showing a significant relationship than could reasonably be expected if the true correlation between volunteering and marital status were essentially zero. Table 2-37 shows the results of 11 studies divided into those in which there was some versus no personal contact between the investigator and the subject. There were 4 studies showing married persons to volunteer significantly more, 5 studies showing no significant relationship between volunteering and marital status, and 2 studies showing married persons to volunteer significantly less than single persons. Most of the studies involving personal contact
Characteristics of the Volunteer Subject
737
Table 2–37 Studies of the Relationship Between Volunteering and Marital Status
Results
No Personal Contact
Personal Contact
More volunteering by married persons (N ¼ 4)
Donald (1960) Gannon, Nothern, and Carroll (1971) Tiffany, Cowan, and Blinn (1970) Zimmer (1956)
No difference (N ¼ 5)
Baur (1947–1948) Ebert (1973)
Belson (1960) Crossley and Fink (1951) Edwards (1968a)
Less volunteering by married persons (N ¼ 2)
Scott (1961)
Stein (1971)
between the investigator and his subjects found no relationship between volunteering and marital status. However, most of the studies that did not require personal contact between investigator and subject showed married persons more likely to participate in behavioral research. Five of the 11 studies of Table 2-37 employed both male and female subjects, 4 studies employed only female subjects, and 2 studies employed only male subjects. No very noteworthy differences in median effect sizes were obtained as a function of differences in sex composition of the samples employed. The trend, if one could be discerned at all, was for the studies employing male samples to show greater volunteering by married persons than studies employing either female or combined samples. In summary, then, there appears to be a significant, but not too predictable, relationship between volunteering and marital status. For studies requiring no personal contact between investigator and subject, however, subjects who are married are more likely to volunteer than are subjects who are single.
Religion There are a number of studies that have examined the relationship between volunteering and affiliation with the Catholic, Jewish, and Protestant faiths. Table 2-38 shows that most of these studies have not reported significant differences but once again there are more significant results than would entitle us to conclude that there was essentially no relationship between volunteering and religious affiliation. Of the 17 studies listed in Table 2-38, there were 12 that permitted a comparison of volunteering rates of Jewish versus Catholic and/or Protestant subjects. Nine of these 12 studies showed Jewish subjects volunteering more than non-Jewish subjects; 4 of these 9 studies found the differences to be significant statistically, while none of the 3 studies showing Jewish subjects to volunteer less found the differences to be significant statistically. These 3 studies did not appear to differ from the remaining 9 studies in the type of task for which volunteering had been solicited. Actually, rates of volunteering could be computed for Jews in six studies, for Protestants in seven studies, and for Catholics in eight studies. The median rates of volunteering or responding were 63% for Jewish, 46% for Protestant, and 48% for Catholic respondents. Although the median rates of volunteering available for Catholic and Protestant respondents are very nearly the same, we should note that
738
Book Three – The Volunteer Subject Table 2–38 Studies of the Relationship Between Volunteering and Religious Affiliation
Jews Volunteer Significantly More than Non-Jews
Protestants Volunteer Significantly More than Catholics
No Significant Difference
Fischer and Winer (1969) I Matthysse (1966) Stein (1971) Teele (1967) I
Lawson (1949) Suchman (1962) Wallin (1949)
Dohrenwend & Dohrenwend (1968) Ebert (1973) Fischer and Winer (1969) II Hill, Rubin, and Willard (1973) MacDonald (1972a) Ora (1966) Poor (1967) I Poor (1967) II Rosen (1951) Teele (1967) II
of eight studies permitting a comparison between volunteering rates of Protestant versus Catholic subjects, the three results that were significant statistically all showed Protestant subjects to be more likely to volunteer than Catholic subjects. All three of these studies involved mailed questionnaires, while only one of the five studies showing no significant differences involved a mailed questionnaire. We can pose as an hypothesis, at least, the proposition that especially in mailed-questionnaire studies, Protestants are more likely to volunteer than Catholics. There are a number of studies available that have investigated the relationship between volunteering and degree of interest in, and commitment to, religious activity. Table 2-39 shows the results of 11 such studies. Most of these studies show no significant relationship, but again there are more showing a significant relationship than we would expect if there were really no association between volunteering and degree of religious interest. Four out of 5 of the studies showing a significant relationship report greater volunteering by those who are more interested in, and more active in, religious activity. While 1 of these 5 studies shows a large effect size (.75), the median effect size is quite small (þ.17). Three of the four studies showing significantly greater volunteering by those more interested in religion are questionnaire studies, while only two of the remaining
Table 2–39 Studies of the Relationship Between Volunteering and Religious Interest
Volunteers Significantly More Interested in Religion
No Significant Difference
Volunteers Significantly Less Interested in Religion
Ebert (1973) MacDonald (1972a) Matthysse (1966) Wallin (1949)
Edwards (1968a) MacDonald (1972b) McDonagh and Rosenblum (1965) Ora (1966) Poor (1967) I Poor (1967) II
Rosen (1951)
Characteristics of the Volunteer Subject
739
seven are questionnaire studies. This finding suggests the hypothesis that it is particularly in questionnaire studies that those who are more interested in religious matters are more likely to participate in behavioral research. A number of different definitions of degree of religious interest and activity were employed in the studies listed in Table 2-39. Seven of these studies defined religious interest in terms of frequency of church attendance, while the remaining definitions were more attitudinal in nature. There appeared to be no relationship between the type of definition of religious interest employed and the nature of the relationship between religious interest and participation in behavioral research. To summarize the relationship between volunteering and religious affiliation, it appears that Jews are somewhat more likely to volunteer than either Catholics or Protestants. In questionnaire studies, there is the further tendency for Protestants to be more likely to respond than Catholics. When we consider the relationship between volunteering for behavioral research and degree of religious interest and activity, it appears that, particularly in questionnaire studies, those who are more interested or active in religious matters are more likely to find their way into the data pool of the behavioral researcher.
Size of Town of Origin A modest number of studies provide evidence on the relationship between volunteering and size of town of the subject’s origin. Table 2-40 shows that (1) 4 studies have reported significantly greater volunteering by persons from smaller towns, (2) 6 studies have reported no significant relationship, and (3) no studies have reported significantly less volunteering by persons from smaller towns. Of the 10 studies listed in Table 2-40, 6 employed questionnaires and 4 employed more typical laboratory tasks. Four of the 6 questionnaire studies reported higher rates of responding by those from smaller towns, while none of the 4 laboratory studies reported a significant difference in volunteering rates between those from smaller, rather than larger, towns. The effect of size of town of origin on volunteering, then, may be limited to questionnaire-type studies. It should be noted that the effect sizes obtained tended to be quite modest; the largest effect obtained was about .40 and the median effect size was only .16. In summary, then, it appears that, at least for studies employing questionnaires, those persons coming from smaller towns are more likely to participate in behavioral research than are those persons coming from larger towns. Table 2–40 Studies of the Relationship Between Volunteering and Size of
Town of Origin Persons From Smaller Towns Volunteer Significantly More
No Significant Difference
Ebert (1973) Franzen and Lazarsfeld (1945) Reuss (1943) Sirken, Pifer, and Brown (1960)
Britton and Britton (1951) I Britton and Britton (1951) II MacDonald (1972a) MacDonald (1972b) I MacDonald (1972b) II Rosen (1951)
740
Book Three – The Volunteer Subject
Summary of Volunteer Characteristics Proceeding attribute by attribute, we have now considered a large number of studies investigating the question of how volunteer subjects differ from their more reluctant colleagues. Both the number of studies and the number of attributes are large, so that we are in urgent need of a summary. Table 2-41 provides such a summary. The first column lists the characteristic more often associated with the volunteer subject (except for the extraversion variable, which was equally as often associated significantly with volunteering as was introversion.) The second column lists the number of studies providing evidence on the relationship between volunteering and the characteristic in question. The minimum requirement for admission to our list of volunteer characteristics was that there be at least three statistically significant results, in either direction, bearing on the relationship between any characteristic and volunteering. The third column of Table 2-41 lists the percentage of the total number of relevant results that reported a significant relationship between volunteering and the listed characteristic. The range of percentages runs from 25% to 100%, indicating that for all the characteristics listed, there were more significant results than would be expected if there were actually no relationship between the characteristic listed and Table 2–41 Summary of Studies of Volunteer Characteristics
Volunteer Characteristic
A. Female B. Firstborn C. Sociable D. Extraverted E. Self-disclosing F. Altruistic G. Achievement motivated H. Approval motivated I. Nonconforming J. Nonauthoritarian K. Unconventional L. Arousal-seeking M. Anxious N. Maladjusted O. Intelligent P. Educated Q. Higher social class R. Young S. Married T1. Jewish > Protestant or Protestant > Catholic T2. Interested in religion U. From smaller town Median
Number of Studies
% of Total Studies Found Significant
% of Total Studies Significantly Favoring Conclusion
% of Significant Studies Favoring Conclusion
63 40 19 8 3 4 9
44 25 68 50 100 100 67
35 18 63 25 100 100 44
79 70 92 50 100 100 67
19 17 34 20 26 35 34 37 26 46 41 11 17
58 29 44 75 62 46 71 59 92 80 56 55 41
58 29 35 55 50 26 44 54 92 70 34 36 41
100 100 80 73 81 56 62 91 100 86 61 67 100
11 10 20
45 40 57
36 40 42
80 100 80
Characteristics of the Volunteer Subject
741
volunteering. Although this column indicates that all the listed characteristics are too often associated with volunteering, it does not provide evidence on the direction of the relationship. Thus, in the first row, the characteristic listed is ‘‘female’’ and the third column shows that 44% of the 63 relevant studies found some significant relationship between being female and volunteering. Some of these relationships were positive, however, while others were negative. It is the fourth column that gives the percentage of all relevant studies that showed volunteers to be more likely to show the characteristic listed in the first column. Thus, in the first row, the fourth column shows that 35% of the 63 relevant studies found females to be significantly more likely than males to volunteer for research participation. The range of percentages listed in the fourth column runs from 18% to 100%, indicating that for all the characteristics listed, there were more significant results than would be expected if volunteers were not actually more likely to be characterized by the attribute listed in the first column of Table 2-41. Even this fourth column, however, does not give sufficient information, since it is possible that there was an equally large percentage of the total number of relevant studies that yielded results significant in the opposite direction. That is exactly what occurred in the fourth row listing ‘‘extraverted’’ as the volunteer characteristic. Of the eight relevant studies 50% showed a significant relationship between volunteering and extraversion (column 3) and 25% showed that extraverts were significantly more likely to volunteer than introverts (column 4). The difference between column 3 and column 4, however, shows that an equal number of studies (25%) showed a significantly opposite effect. As a convenient way of showing the net evidence for a specific relationship between volunteering and any characteristic, column five was added. This final column lists the percentage of all significant results that favor the conclusion that volunteers are more often characterized by the attribute listed in the first column. The range of percentages listed runs from 50% to 100%. This range indicates that for some characteristics all significant results favor the conclusion implied by the first column, while for others the evidence is equally strong for the conclusion implied by the first column and for the opposite of that conclusion. This latter situation occurred only once and that was in the case of the attribute we have already discussed, extraversion. Table 2-41 lists the characteristics of volunteers in the order, in which we originally discussed them. Table 2-42 lists the characteristics by the degree to which we can be confident that they are indeed associated with volunteering. Four groups of characteristics are discriminable, and within each of these groups the characteristics are listed in approximately descending order of the degree of confidence that we can have in the relationship between volunteering and the listed characteristic. The definition of degree of confidence involved an arbitrary, complex multiple cutoff procedure in which a conclusion was felt to be more warranted when (1) it was based on a larger number of studies, (2) a larger percentage of the total number of relevant studies significantly favored the conclusion, and (3) a larger percentage of just those studies showing a significant relationship favored the conclusion drawn. These three criteria are based on the second, fourth, and fifth columns of Table 2-41. The minimum values of each of these three criteria required for admission to each of the four groups of characteristics (as shown in Table 2-42) are shown in Table 2-43. To qualify for ‘‘maximum confidence,’’ a relationship had to be based on a large number of studies, of which a majority significantly favored
742
Book Three – The Volunteer Subject Table 2–42 Volunteer Characteristics Grouped by Degree of Confidence of Conclusion
I. Maximum Confidence
III. Some Confidence
1. Educated 2. Higher social class 3. Intelligent 4. Approval-motivated 5. Sociable
12. From smaller town 13. Interested in religion 14. Altruistic 15. Self-disclosing 16. Maladjusted 17. Young
II. Considerable Confidence
IV. Minimum Confidence
6. Arousal-seeking 7. Unconventional 8. Female 9. Nonauthoritarian 10. Jewish > Protestant or Protestant > Catholic 11. Nonconforming
18. Achievement-motivated 19. Married 20. Firstborn 21. Anxious 22. Extraverted
Table 2–43 Cut-off Requirements for Each Degree of Confidence
Degree of Confidence
Maximum Considerable Some Minimum
Number of Studies
19 17 3 3
% of Total Studies 54 29 29 18
% of Significant Studies 86 73 61 50
Mean of Last Two Columns
70 51 45 34
the conclusion and of which the vast majority of just the significant outcomes favored the conclusion drawn. To qualify for ‘‘considerable confidence,’’ a large number of studies was also required, but the fraction of total studies significantly supporting the conclusion drawn was permitted to drop somewhat below one-third. The percentage of significant results that favored the conclusion, however, was still required to be large (73%). The major difference between the categories of ‘‘considerable’’ and ‘‘some’’ confidence was in the number of studies available on which to base a conclusion, although some characteristics that often had been investigated were placed into the ‘‘some’’ category when the fraction of significant studies favoring the conclusion fell to below two-thirds. The final category of ‘‘minimum confidence’’ comprised characteristics that did not so clearly favor one direction of relationship over the other and of characteristics that had not been sufficiently often investigated to permit a stable conclusion. To put the basis for the grouping shown in Tables 2-42 and 2-43 in a slightly different way, we can say that degree of confidence of a conclusion is based on the degree to which future studies reporting no significant relationships, or even relationships significantly in the opposite direction, would be unlikely to alter the overall conclusion drawn. Thus, for example, when 24 of 26 studies show volunteers to be significantly better educated than nonvolunteers, it would take a good many studies showing no significant relationship and even a fair number of studies showing a significantly opposite relationship before we would decide that volunteers were not, on the whole, better educated than nonvolunteers.
Characteristics of the Volunteer Subject
743
So far in our summary of characteristics associated with volunteering, we have counted all relevant studies, paying no attention to the type of task for which volunteering had been requested nor to the sex of the sample of subjects nor to the particular operational definition of the characteristic employed in each study. Yet each of these variables has been found to affect the relationship between volunteering and some of the characteristics investigated. On the one hand, then, our summary so far has been robust in the sense that conclusions drawn with good levels of confidence transcend the effects of these moderator variables. On the other hand, our summary has not been as precise as it might have been had we taken into account the effects of our several moderator variables. Hence, we conclude with a listing of the conclusions that seem warranted by the evidence, taking into account the effects of various moderator variables. The order of our listing follows that shown in Table 2-42, beginning with the conclusions warranting maximum confidence and ending with the conclusions warranting minimum confidence. Within each of the four groups, the conclusions are also ranked in approximate order of the degree of confidence we can have in each. Conclusions Warranting Maximum Confidence 1. Volunteers tend to be better educated than nonvolunteers, especially when personal contact between investigator and respondent is not required. 2. Volunteers tend to have higher social-class status than nonvolunteers, especially when social class is defined by respondents’ own status rather than by parental status. 3. Volunteers tend to be more intelligent than nonvolunteers when volunteering is for research in general, but not when volunteering is for somewhat less typical types of research such as hypnosis, sensory isolation, sex research, and small-group and personality research. 4. Volunteers tend to be higher in need for social approval than nonvolunteers. 5. Volunteers tend to be more sociable than nonvolunteers.
Conclusions Warranting Considerable Confidence 6. Volunteers tend to be more arousal-seeking than nonvolunteers, especially when volunteering is for studies of stress, sensory isolation, and hypnosis. 7. Volunteers tend to be more unconventional than nonvolunteers, especially when volunteering is for studies of sex behavior. 8. Females are more likely than males to volunteer for research in general, but less likely than males to volunteer for physically and emotionally stressful research (e.g., electric shock, high temperature, sensory deprivation, interviews about sex behavior.) 9. Volunteers tend to be less authoritarian than nonvolunteers. 10. Jews are more likely to volunteer than Protestants, and Protestants are more likely to volunteer than Catholics. 11. Volunteers tend to be less conforming than nonvolunteers when volunteering is for research in general, but not when subjects are female and the task is relatively ‘‘clinical’’ (e.g., hypnosis, sleep, or counseling research.)
744
Book Three – The Volunteer Subject
Conclusions Warranting Some Confidence 12. Volunteers tend to be from smaller towns than nonvolunteers, especially when volunteering is for questionnaire studies. 13. Volunteers tend to be more interested in religion than nonvolunteers, especially when volunteering is for questionnaire studies. 14. Volunteers tend to be more altruistic than nonvolunteers. 15. Volunteers tend to be more self-disclosing than nonvolunteers. 16. Volunteers tend to be more maladjusted than nonvolunteers especially when volunteering is for potentially unusual situations (e.g., drugs, hypnosis, high temperature, or vaguely described experiments) or for medical research employing clinical, rather than psychometric, definitions of psychopathology. 17. Volunteers tend to be younger than nonvolunteers, especially when volunteering is for laboratory research and especially if they are female.
Conclusions Warranting Minimum Confidence 18. Volunteers tend to be higher in need for achievement than nonvolunteers, especially among American samples. 19. Volunteers are more likely to be married than nonvolunteers, especially when volunteering is for studies requiring no personal contact between investigator and respondent. 20. Firstborns are more likely than laterborns to volunteer, especially when recruitment is personal and when the research requires group interaction and a low level of stress. 21. Volunteers tend to be more anxious than nonvolunteers, especially when volunteering is for standard, nonstressful tasks and especially if they are college students. 22. Volunteers tend to be more extraverted than nonvolunteers when interaction with others is required by the nature of the research.
3 Situational Determinants of Volunteering
In the last chapter we examined the evidence bearing on the relationship between volunteering and a variety of more-or-less stable personal characteristics of those given an opportunity to participate in behavioral research. In the present chapter we shall examine the evidence bearing on the relationship between volunteering and a variety of more-or-less situational variables. As was the case in our discussion of the more stable characteristics of volunteers, our inventory of situational determinants was developed inductively rather than deductively. The question we put to the archives of social science was, What are the variables that tend to increase or decrease the rates of volunteering obtained? The answer to our question has implications for both the theory and practice of the behavioral sciences. If we can learn more about the situational determinants of volunteering, we will also have learned more about the social psychology of social influence processes. If we can learn more about the situational determinants of volunteering, we will also be in a better position to reduce the bias in our samples that derives from volunteers’ being systematically different from nonvolunteers in a variety of characteristics.
Material Incentives Considering the frequency with which money is employed as an incentive to volunteer, there are remarkably few studies that have examined experimentally the effectiveness of money as an incentive. The results of those studies that have investigated the issue are somewhat surprising in the modesty of the relationships that have been obtained between payment and volunteering. Thus, Mackenzie (1969) found high school students to be significantly more likely to volunteer when offered payment, but the magnitude of the relationship was not very large ( ffi .3). When college students were employed, MacDonald (1972a) found no effect of offered payment on rates of volunteering. In his study, however, the offer of payment interacted with sex of subject such that payment increased females’ volunteering from 61% to 73%, while it decreased males’ volunteering from 56% to 48% (p ffi .10). 745
746
Book Three – The Volunteer Subject
In a mail survey of attitudes toward retail stores, Hancock (1940) employed three experimental conditions: (1) questionnaire only, (2) questionnaire with a promise of 25 cents to be sent on receipt of the completed questionnaire, and (3) questionnaire with 25 cents enclosed. Only 10% of the questionnaires were returned in usable form by those who received only the questionnaire. The addition of the promise of a future ‘‘reimbursement’’ for cooperation increased the usable return rate to 18%, while the addition of the ‘‘payment in advance’’ increased the return rate to 47%! An enormous reduction of potential bias, then, was effected by a fairly modest financial incentive that probably served to obligate the respondent to the survey organization. This technique of payment in advance may be seen as more a psychological than an economic technique in enlisting the assistance of the potential volunteer. Later we shall describe additional evidence bearing on the relationship between volunteering and experimentally aroused guilt. In the studies described so far, the actual amounts of money offered were quite small, ranging from 25 cents to $1.50. In a study by Levitt Lubin and Zuckerman (1962), however, the financial incentive was more substantial—$35.00 for from three to six hours of hypnosis research. The subjects were a group of student nurses who had not volunteered for research in hypnosis. Of the nonvolunteers, half were exposed to a lecture on hypnosis and half were offered the $35. Somewhat surprisingly, there was essentially no difference in volunteering rates associated with the different additional ‘‘motivators’’ employed. A similar procedure of trying to convert nonvolunteers to volunteers was explored by Bass, Dunteman, Frye, Vidulich, and Wambach (1963; Bass, 1967). Three target groups of subjects were established by employing Bass’s Orientation Inventory: (1) task-oriented, (2) interaction-oriented, and (3) self-oriented persons. When no pay had been offered, task-oriented subjects tended to volunteer most. Nonvolunteers were then offered pay ($1.50) for participation and this did increase the total number of subjects now willing to participate. The interesting results, however, were the differential effects of adding a financial incentive on the three groups of nonvolunteers. The self-oriented subjects were most influenced by the offer of pay (57% volunteering), followed by the task-oriented subjects (44%) and trailed by the interaction-oriented subjects (36%). Some people are predictably more likely than others to be greatly affected by an offer of payment for service as a research participant. Table 3-1, based upon data provided by Bass, summarizes the results of the research. The subtable A gives the proportion of each type of subject to volunteer under conditions of pay and no pay as well as the proportion who never volunteered. Subtable B gives the proportions in each cell after having corrected for the differences in the row margins. The purpose of this procedure is to highlight cell effects uninfluenced by the frequently arbitrary differences in the margin totals (Mosteller, 1968). The procedure is to divide each cell entry by its row total to equalize the row margins and then to divide the new cell values by the column total to equalize those, and so on, until further iteration produces no change in the cell values. Subtable C expresses the values above as residuals from the expected values so as to highlight the more extreme deviates. The first row shows that under conditions of no pay, the task-oriented subjects volunteer too often, while the self-oriented subjects volunteer too rarely. The second row shows that self-oriented subjects volunteer too often for pay, while both other groups of subjects volunteer too rarely for pay. The third row shows that interaction-oriented subjects too often fail to volunteer at all, while the other two groups do not fail to volunteer ‘‘often enough.’’ The research by Bass’s group, although very instructive, did not actually involve an
Situational Determinants of Volunteering
747
Table 3–1 Proportion of Subjects Volunteering for Three Types of Subjects (After data provided by
Bass, 1967) Type of Subject Task-Oriented A. Basic data Volunteered: no pay Volunteered: pay Never volunteered Sum B. Standardized margins Volunteered: no pay Volunteered: pay Never volunteered Sum C. Residuals (from .333) Volunteered: no pay Volunteered: pay Never volunteered a
Interaction-Oriented
Self-Oriented
Sum
.71 .13 .16 1.00a
.62 .14 .24 1.00a
.57 .25 .18 1.00a
1.90 0.52 0.58 3.00
.41 .28 .31 1.00
.32 .27 .41 1.00
.27 .45 .28 1.00
1.00 1.00 1.00 3.00
þ.08 .05 .03
.01 .06 þ.08
.06 þ.11 .05
N ¼ 42 for each group.
experimental manipulation of financial incentive, since all the non-volunteers were offered pay after it was learned that they had not volunteered. This research, then, constituted an important contribution to the literature on volunteering, following an individual difference approach to the problem. Another study to employ such an approach was that by Howe (1960). He found that volunteers showed a much greater need for cash than did the nonvolunteers (effect size ¼ l.l0, p < .001). Howe had offered payment of $3.00 for participation. In this study, need for cash was determined after the volunteering occurred, so it is possible that the incentive was viewed as more important by those who had already committed themselves to participate, by way of justifying their commitment to themselves and perhaps to the investigator. Another approach to the study of the role of financial incentives in volunteering has been to ask subjects to state their reason for having volunteered. Jackson and Pollard (1966) found that only 21% of their subjects listed payment ($1.25) as a reason for volunteering, while 50% listed curiosity as a reason. Interestingly, only 7% listed being of help to science as a reason for volunteering, and 80% of those who did not volunteer gave as their reason that they had no time available. In their study of prisoners volunteering for research on malaria, Martin, Arnold, Zimmerman, and Richart (1968) found about half to report that payment was the major reason for their participation, while the other half reported altruistic motives as their major reason for participation. Another motive that may operate with prisoners who volunteer for research is the implicit hope that a parole board might take into favorable account the prisoner’s participation in research. That such hopes are implicit rather than explicit stems from the typical structuring of such research for the prisoner in terms that make clear that volunteering will in no way affect the prisoner’s term of time to be served. Nevertheless, parole boards are likely to know that a prisoner participated in research, and they are likely to view this with increasing favor as a function of the degree of jeopardy into which they feel the prisoner has placed himself. Similar results to those reported by Martin et al. (1968) have been obtained by
748
Book Three – The Volunteer Subject
Nottingham (1972) in his study of attitudes and information-seeking behavior. Payment was the primary reason given for subjects’ participation by 45%, while another 34% gave altruistic motives as their primary reason for participation. There are some suggestions in the psychiatric literature that the use of financial incentives might change the nature of the volunteer sample in the direction of greater psychiatric stability. Esecover, Malitz, and Wilkens (1961), in their research on hallucinogens, found that the more well adjusted volunteers were motivated more by payment than were the less well adjusted volunteers. Other motivations operating to get the better-adjusted subjects into the volunteer sample included scientific curiosity or normative expectations that volunteering would occur, as in samples of medical students. Such findings tend to be consistent with those of Pollin and Perlin (1958). In our discussion so far, we have examined the relationship between volunteering and financial incentives. There is some fascinating preliminary evidence to suggest that whether volunteers were paid or not may be a significant determinant of the direction and magnitude of the effects of the experimental conditions to which the paid and unpaid volunteers are assigned at random. Thus, Weitz (1968) employed paid and unpaid college student volunteers and paid and unpaid nonvolunteers in a study of the effects on subjects’ responses of biased intonation patterns in experimenters’ reading of experimental instructions. She reported that unpaid volunteers were significantly more influenced to respond in the direction of the experimenter’s biased vocal emphasis than were the other three groups (p < .02). In a similar study conducted with high school students, Mackenzie (1969) found that the results depended on which of two investigators was in charge of the study. For one of the investigators, it was the paid nonvolunteers who were most influenced to respond in the direction of the biased intonation pattern found in the tape-recorded instructions. For the other investigator, it was the paid volunteer subjects who were most influenced, while the paid nonvolunteers were least influenced. Taken together, these studies suggest that volunteer status and level of financial incentive can jointly interact with the effects of the investigator’s independent variable. There are not yet enough studies available, however, to permit any conclusion about the particular moderating effects that volunteering and incentive level are likely to have on the operation of any particular independent variable. So far in our discussion of material incentives the particular incentives have been financial. There are other studies that have investigated the effects on volunteering of material incentives that were more in the nature of small gifts and courtesies. Thus, Pucel, Nelson, and Wheeler (1971) in their survey of 1128 graduates of post-highschool vocational–technical schools in the Minneapolis area, enclosed various numbers and combinations of small incentives with their mailed questionnaires. These incentives included small pencils, packets of instant coffee, use of green paper, and a preliminary letter telling of the subsequent arrival of the questionnaire. When respondents were grouped on the basis of the number of incentives employed (0, 1, 2, and 3), a monotonic relationship (p < .01) emerged between response rates and number of incentives received (43%, 51%, 55%, and 63%). In this study, females tended to volunteer more than males overall but appeared to be somewhat less affected by the employment of incentives than were the males. Thus, adding a third incentive increased the volunteering rate of females by only 4 percentage points, while it increased the volunteering rate of males by 17 percentage points. Enclosing a stamped, self-addressed envelope with mailed questionnaires also appears to increase the response rate appreciably (Dillman, 1972). Feriss (1951),
Situational Determinants of Volunteering
749
for example, found that such an enclosure increased the response rate from 26% to 62%, while Price (1950) found that stamped return envelopes increased the response rate over unstamped return envelopes by 11% (from 17% to 26%). The generally lower response rate obtained by Price was probably caused by respondents’ having to pay $6.00 in membership fees to join a national organization. A frequently employed incentive to volunteering is the offering of extra academic credit. On intuitive grounds this might appear to be a powerful incentive indeed, but there is surprisingly little experimental evidence to tell us just how effective extra credit might be in increasing rates of volunteering. This question is just one more that MacDonald (1972a) has helped to answer in his important research. While 58% of his subjects volunteered when no incentive was offered and 57% volunteered for modest pay, 79% volunteered when extra academic credit was offered. MacDonald’s research dealt with a positive academic incentive. The research by Blake, Berkowitz, Bellamy, and Mouton (1956), on the other hand, dealt with negative academic incentives. They found that more subjects volunteered when their reward was getting to miss a lecture, and a great many more volunteered when their reward was getting to miss an examination. We have already raised the possibility that one incentive for participation in behavioral research might be the volunteer’s perception that expert assistance with personal problems might be a side benefit. The evidence for this hypothesis is suggestive and intuitively appealing, but it is far from conclusive. When the research takes on a medical cast, however, there is more direct evidence available. In his longitudinal study of the normative aging of veterans, for example, Rose (1965) found volunteers to give as one reason for their participation the likelihood of prevention of serious illness by their obtaining regular tests and checkups. Because the evidence is still in somewhat preliminary form, we cannot be overly confident in our summary of the relationship between volunteering and the material incentives offered to the potential volunteer. Financial incentives do appear to increase rates of volunteering but not dramatically so, at least not for the small amounts of money that are customarily offered. Small gifts and courtesies may be somewhat more effective in raising rates of volunteering than somewhat larger amounts of cash, particularly if they are given before the potential volunteer has decided whether or not to volunteer and if they are outright gifts, not contingent on the subject’s decision. Thus, the symbolic value of a small gift may far outweigh its cash value. A small gift may obligate the recipient to participate and may further impress the recipient with the seriousness of the giver’s purpose. Both these variables, the subject’s sense of obligation and the seriousness of the recruiter’s intent, will be discussed in more detail later in this chapter. There is additional evidence to suggest that the effect of material incentives on volunteering may be moderated by stable personal characteristics of the potential volunteer. Finally, volunteering and level of incentive offered, appear to moderate the effects of independent variables such that research results may be less replicable when levels of incentive and volunteering status are altered in subsequent replications.
Aversive Tasks Not surprising is the finding that when potential subjects fear that they may be physically stressed, they are less likely to volunteer. Subjects threatened with electric
750
Book Three – The Volunteer Subject
shocks were less willing to volunteer for subsequent studies involving the use of shock (Staples and Walters, 1961). More surprising perhaps is the finding that an increase in the expectation of pain does not lead concomitantly to much of an increase in avoidance of participation. In one study, for example, 78% of college students volunteered to receive very weak electric shocks, while almost that many (67%) volunteered to receive moderate-to-strong shocks (Howe, 1960). The difference between these volunteering rates was of only borderline significance (p < .15). The motives to serve science and to trust in the wisdom and authority of the experimenter (Orne, 1969) and to be favorably evaluated by the experimenter (Rosenberg, 1969) must be strong indeed to have so many people willing to tolerate so much, for so little tangible reward. But perhaps in Howe’s (1960) experiment the situation was complicated by the fact that there was more tangible reward than usual. The rates of volunteering that he obtained may have been elevated by a $3.00 incentive that he offered in return for participation. Eisenman (1965) also recruited subjects for research involving electric shock. He found that volunteering rates under conditions of strong shock were affected by subjects’ birth order. Firstborns were nearly unanimous in their refusal to submit to strong shock. When the task involved an isolation experience, Suedfeld (1969) reported, firstborns were likely to volunteer less than laterborns if the recruiting procedure made the isolation experience appear frightening. Some evidence in support of this hypothesis was put forward by Dohrenwend, Feldstein, Plosky, and Schmeidler (1967). These workers found sensory deprivation to be more aversive to firstborns than to laterborns (p ¼ .05, one-tail), suggesting that firstborn volunteers must have been quite eager to serve as subjects to be willing to undergo so much anxiety. Dohrenwend’s group also suggested that because subjects may volunteer in order to affiliate with the investigator, volunteers might be especially prone to the biasing effects of the experimenter’s expectations. When volunteering is requested for research with a medical cast, the results are similar to what we would expect in the case of behavioral research. The more severe the perceived risk or stress to which the volunteer is to be exposed, the less likely he is to volunteer. Thus, in the research by Martin et al. (1968) people volunteered more for exposure to risks perceived as less serious: 79% volunteered to be exposed to air pollution, 47% to be exposed to the common cold, 38% to be exposed to new drugs, and 32% to be exposed to malaria. While all subgroups studied showed the same monotonic decrease, the slopes were quite different from group to group, with prisoners showing the most gentle slope and professional persons showing the steepest slope (93% for air pollution down to 0% for malaria). Earlier we saw that recruitment conditions might interact with independent variables of a psychological nature. We should note here that recruitment conditions may also interact with independent variables of a biological nature. Thus, Brehm, Back, and Bogdonoff (1964) found that those who volunteered to fast with only a low level of incentive not only reported themselves as less hungry but actually showed less physiological evidence of hunger. Although a number of investigators have discussed the incidence and consequences of subjects’ fears in the psychological experiment (e.g., Gustav, 1962; Haas, 1970) there is little evidence available to tell us exactly what type of study inspires how much fear in what type of subject. Such information would be very useful in helping us to estimate the selection biases likely to operate in different kinds
Situational Determinants of Volunteering
751
of research. A somewhat related line of inquiry has been undertaken, however, by Sullivan and Deiker (1973), who asked subjects to indicate the type of experiments for which they would theoretically be likely to volunteer. When the research called for alteration of subjects’ level of self-esteem, 80% said they would volunteer; when the research called for experimental pain, 52% said they would volunteer; but when the research called for the experimental induction of unethical behavior, only 39% said they would volunteer. In summary, there does appear to be a tendency for subjects to volunteer less for more painful, stressful, or dangerous tasks when the pain, stress, or danger is either psychological or biological. There are further indications that personal characteristics of the subject and the level of incentive offered may act as variables moderating the relationship between volunteering and the aversiveness of the task.
Guilt, Happiness, and Competence There is increasing evidence to suggest that the feeling states experienced by subjects at the time of their recruitment can significantly affect their probability of volunteering for behavioral research. The evidence, while not overwhelming in number of studies conducted, is unusually unequivocal in terms of drawing causal inference, based as it is on direct experimental manipulations of subjects’ subjective experience. Freedman, Wallington, and Bless (1967) conducted three experiments investigating the effects of induced guilt on volunteering. In their first study, half their male high school subjects were induced to lie to the investigator. These subjects had been told by a confederate about the experimental task, the Remote Associates Test, but reported to the investigator that they had not heard about the task. Subsequently, 65% of those who had been induced to lie volunteered, compared to only 35% of those who had not been induced to lie (p < .05). In their second study, they arranged to have some of their college student subjects knock over a stack of 1000 note cards that had been prepared for a doctoral dissertation. Of these presumably guilt-ridden subjects, 75% agreed to volunteer compared to 38% of subjects who had not knocked over the note cards (p < .02). Somewhat surprisingly, the effects of guilt were much less pronounced when volunteering was to assist the person who had been harmed by the subjects’ ‘‘clumsiness’’ than when volunteering was to assist another unharmed person. Volunteering rates were 60% versus 50% for the guilty versus nonguilty subjects asked to help the harmed person, but they were 90% versus 25% for the guilty versus nonguilty subjects asked to help a nonharmed person. In their third study, an altered replication of their second study, recruiting was always on behalf of the harmed person, but for half the subjects there would be contact with that person and for half the subjects there would be no contact. Overall, the presumably guilty subjects volunteered significantly (p ¼ .02) more (56%) than the nonguilty subjects (26%). The effects of guilt were much less pronounced when volunteering required contact with the harmed person (41% versus 35%) than when volunteering did not require contact with the harmed person (73% versus 18%). Although the difference between these differences was not significant in either the second or third studies described, taken together they are very suggestive of the hypothesis that while guilt may increase volunteering, it is more likely to do so when personal contact with the
752
Book Three – The Volunteer Subject
victim can be avoided. Apparently subjects want to atone for their guilt while avoiding the awkwardness of personal contact with their unintended victim. In their research on the relationship between guilt and volunteering, Wallace and Sadalla (1966) divided their 39 subjects into three groups. In two of these groups, subjects were led to believe that they had ruined a piece of laboratory equipment, but in one group this damage was ‘‘detected’’ by the experimenter and in the other it remained undetected. The third group did not ‘‘ruin’’ the equipment. Of the 13 subjects in each group, only 2 of those who had not ruined the equipment volunteered to participate in a subsequent stress experiment, 5 volunteered from the undetected group, and 9 volunteered from the detected group (p < .05). Apparently private guilt may increase volunteering but public guilt may increase it even more. In his study of the relationship between guilt and volunteering, Silverman (1967) employed much younger subjects: sixth-grade boys and girls from a primarily lowermiddle-class background. Children were exposed to the temptation to cheat and then given an opportunity to volunteer for further experiments lasting from one minute to one hour. In every case, volunteering meant sacrificing time from the children’s recess. Surprisingly, children who cheated were slightly less likely to volunteer for further research. The difference in results between this study and those summarized earlier might have been the result of the difference in ages or social class of the subjects involved or of the fact that in the earlier studies the subjects’ transgressions appeared to harm someone while the cheating of the children in the Silverman study may have been more in the nature of a ‘‘victimless crime.’’ Earlier, in our discussion of material incentives to volunteering, we reported that the enclosure of 25 cents with a questionnaire raised the response rate from 10% to 47% (Hancock, 1940). This dramatic difference in response rates can be interpreted as possibly caused by the operation of guilt. If we assume that very few people would trouble themselves sufficiently to return the 25 cents, then we would have a great many people in possession of money they had done nothing to earn. That might have resulted in guilt feelings for many of the recipients, guilt feelings that might have been reduced by their answering the questionnaire accompanying the unsolicited cash. Other feeling states than guilt have been employed as independent variables in the study of situational determinants of volunteering. Aderman (1972) found that subjects who had been made to feel elated were considerably more likely to volunteer for research (52%) than subjects who had been made to feel relatively depressed (31%). Holmes and Appelbaum (1970) assigned their subjects to positive, negative, or control experiences and subsequently asked them to volunteer for additional hours of research participation. Subjects who had been exposed to positive experience in the subject role volunteered for significantly more research time (2.0 hours) than did the subjects of the control group (0.7 hours) while those who had been exposed to negative experience in the subject role volunteered slightly more (0.9 hours) than the control group subjects. In their research, Kazdin and Bryan (1971) examined the relationship between volunteering and subjects’ feelings of competence. Subjects made to feel more competent volunteered to donate blood dramatically more (54%) than did the comparison group subjects (21%). A replication of this research found similar results when one research assistant was employed, but obtained no such effect when a different research assistant was employed. Such interactions of experimental
Situational Determinants of Volunteering
753
manipulations with the person of the data collector are, unfortunately, not particularly rare in the behavioral sciences (Rosenthal, 1966, 1969). To summarize now the indications bearing on the relationship between volunteering and feeling states, we can be somewhat confident that under certain conditions, subjects feeling guilty are more likely to volunteer for behavioral research. This relationship may be stronger when contact with the subject’s unintended victim can be avoided and when the cause of the subject’s guilt is public rather than private. In addition, the relationship may not hold for lower-middle-class children who feel they have not harmed another person by their transgression. There is also some evidence to suggest that subjects made to ‘‘feel good’’ or to feel competent are more likely to volunteer for research participation.
Normative Expectations Volunteering for behavioral research appears to become more likely the more it is viewed as the normative, expected, proper thing to do. One suggestive indication of this hypothesis comes from the experience of psychology departments that, even when they do not require research participation from their students, create a climate of normative expectations that results in reasonably successful recruitment of volunteers for behavioral research (Fraser and Zimbardo, n.d.; Jung, 1969). A more specific theoretical position has been set out, and a more explicit empirical investigation has been undertaken, by Schofield (1972). In her research, subjects were asked to read and evaluate a number of sex education pamphlets as part of an investigation of their effectiveness. In the normative expectation condition, subjects were led to believe that almost all other subjects had volunteered to read a large number of these pamphlets. Subjects in this condition volunteered to read an average of 18 pamphlets, compared to the subjects of the control condition, who volunteered to read an average of only 14 pamphlets. In his research, Rubin (1969, 1973b) employed four groups of young couples. Half the couples were rated as being more strongly in love than the remaining couples. Half the couples in each of these two groups were asked to interact with their fellow couple member, and half were asked to interact with an opposite sexed member of a different couple. Subjects from all four groups were then asked to volunteer for a couples T-group experience over a weekend. Three times more couples who were both strongly in love and who had interacted with their fellow couple member volunteered (39%) than did couples who were either less strongly in love (13%) or who were strongly in love but had interacted with a member of a different couple (14%). This higher rate of volunteering may have been brought about by those couples’ (i.e., those who were strongly in love and who had been asked to interact with each other) feeling that they were the type whose patterns of interaction during the T-group experience would be of greater interest to scientists studying the behavior of couples in love. Since all the couples in this research program had agreed to help in the scientific task of learning more about couples, those who had been studied as couples and who were strongly in love may have felt most keenly the pressure of normative expectations to participate in the T-group experience. A relatively extreme case of the effect of normative expectations on volunteering has been documented by Ross, Trumbull, Rubinstein, and Rasmussen (1966). They
754
Book Three – The Volunteer Subject
reported a study in which 34 Naval Reserve officers had volunteered to participate in a two-week seminar on problems of fallout shelters. When the officers arrived for their seminar, they were simply put into a fallout shelter for a five-day experiment. Although the officers were permitted to withdraw from the experiment, the pressure of normative expectations was so great that not one of the men asked to be excused. Normative expectations have also been employed widely in academic contexts. Thus, medical students, medical residents, clinical-psychology interns, and similar groups of advanced students have often been employed in research in which their refusal to ‘‘volunteer’’ would have been regarded as a fairly marked violation of normative expectations (e.g., Esecover, Malitz, and Wilkens, 1961). There are other studies also suggesting that when other persons are seen by the potential volunteer as likely to consent, the probability increases that the potential volunteer will also consent to participate (Bennett, 1955; Rosenbaum, 1956; Rosenbaum and Blake, 1955). Interestingly, it appears that once the volunteer has consented, he may find it undesirable to be denied an opportunity actually to perform the expected task. Volunteers who were given the choice of performing a task that was either more pleasant but less expected or less pleasant but more expected tended relatively more often to choose the latter (Aronson, Carlsmith, and Darley, 1963). In summary of the relationship between volunteering and normative expectations, it does seem warranted to suggest as a strong hypothesis that persons are more likely to volunteer for behavioral research when volunteering is viewed as the normative, expected, appropriate thing to do.
Public Versus Private Commitment The evidence is equivocal about the relationship between volunteering and whether the commitment to volunteer is made in public or in private. As we might expect, when normative expectations are in support of volunteering, a public commitment to volunteer may result in higher rates of volunteering. Thus, in Schofield’s (1972) experiment, when subjects believed that others would see their written commitment to evaluate sex-education pamphlets, subjects agreed to evaluate 20 pamphlets, compared to only 12 when subjects believed others would not see their written commitment. Consistent with this result was that of Mayer and Pratt (1966), who surveyed persons involved in automobile accidents. Their high rate of returns (about 75%) may have been caused by the respondents’ knowledge that their identities as persons involved in accidents was a matter of public record and that a failure to respond would be known and regarded as a transgression of a normative expectation. When incentives to volunteer are not very strong, there is evidence to suggest that subjects may volunteer more when volunteering is more private (Blake, Berkowitz, Bellamy, and Mouton, 1956), unless almost everyone else in the group also seems willing to volunteer publicly (Schachter and Hall, 1952). Bennett (1955), however, found no relationship between volunteering and the public versus private modes of registering willingness to participate. Schachter and Hall (1952) have performed a double service for students of the volunteer problem. They not only examined conditions under which volunteering was more likely to occur but also the likelihoods that subjects recruited under various conditions would actually show up for the experiment to which they had verbally
Situational Determinants of Volunteering
755
committed their time. The results were not heartening. Apparently it is just those conditions that increase the likelihood of a subject’s volunteering that increase the likelihood that he will not show up when he is supposed to. This should serve further to emphasize that it is not enough to learn who will volunteer and under what circumstances. We will also need to learn more about which people show up, as our science is based largely on the behavior of those who do. In the case of personality tests there is evidence from Levitt, Lubin, and Brady (1962) to suggest that pseudovolunteers are psychologically more like nonvolunteers than they are like true volunteers. However, Jaeger, Feinberg, and Weissman (1973) found that pseudovolunteers were psychologically more like volunteers than nonvolunteers. It is difficult to summarize with confidence the relationship between volunteering and public versus private commitment. If there is a trend in the research evidence, perhaps it is that public commitment conditions increase volunteering when volunteering is normatively expected but decrease volunteering when nonvolunteering is normatively expected.
Prior Acquaintanceship There is some evidence that an increase in the potential respondent’s degree of acquaintanceship with the recruiter may lead to an increase in volunteering (Norman, 1948; Wallin, 1949). When high school graduates were surveyed, those who responded promptly were likely to have been part of an experimental counseling program in high school that provided them with additional personal attention (Rothney and Mooren, 1952). These students presumably were better acquainted with, and felt closer to, the sponsors of the survey than did the students who had not received the additional personal attention. In his survey of 79 universities known to employ psychological tests, Toops (1926) found those respondents who knew him better personally to return their replies more promptly. Similarly, Kelley (1929) found in his survey of his Stanford University colleagues that replies were more likely to be received from those with whom he was better acquainted. Increases in the acquaintanceship with the investigator may reduce volunteer bias, but there is a possibility that one bias may be traded in for others. Investigators who are better acquainted with their subjects may obtain data from them that is different from data obtained by investigators less well known to their subjects (Rosenthal, 1966). We may need to learn which biases we are better able to assess, which biases we are better able to control, and with which biases we are more willing to live. Related to the variable of acquaintanceship with the recruiter are the variables of recruiter friendliness or ‘‘personalness.’’ Hendrick, Borden, Giesen, Murray, and Seyfried (1972), for example, found that for their seven-page questionnaire higher return rates were obtained when their cover message was more flattering to the respondent as long as the cover message did not also try to be too flattering of the researchers themselves. This effect of flattery, incidentally, did not occur when only a one-page questionnaire was employed. MacDonald (1969) has suggested that the personalness of the recruitment procedure may interact with the birth order of the subject such that firstborns may be more likely to volunteer than laterborns only when recruitment is more personalized. When attempts to personalize the relationship between recruiter and potential respondent have been relatively superficial, there has usually been no marked effect
756
Book Three – The Volunteer Subject
on volunteering. Thus, the use of respondents’ names as a technique of personalization has been judged ineffective by both Clausen and Ford (1947) and by Bergen and Kloot (1968–1969). The latter investigators, however, found an interesting interaction between using respondents’ names and the respondents’ sex. Among female subjects, using names increased the volunteering rate from 56% to 76%, while using names decreased the volunteering rate from 76% to 71% among male subjects. Although these interaction effects were substantial in magnitude, they could not be shown to be significant, because of the small size of the sample. In his research, Dillman (1972) compared the effects of using metered mail versus postage stamps on the return rates of questionnaires. The assumption was that postage stamps were more personal than postage meters, but no difference in return rates was found. A summary of the relationship between volunteering and degree of prior acquaintanceship with the recruiter suggests that when subjects actually know the recruiter, they are more likely to agree to participate in the recruiter’s research. When subjects are not personally acquainted with the recruiter, the addition of a more personal touch to the recruiting procedure may, under certain conditions that are not yet well understood, also increase the rate of volunteering.
Recruiter Characteristics Although there is considerable evidence bearing on the unintended effects on research results of a variety of experimenter characteristics (Rosenthal, 1966, 1969), there is no comparably systematic body of data bearing on the effects on volunteering of various personal characteristics of the recruiter. Nevertheless, there are at least suggestive indications that such effects may occur. There are studies, for example, suggesting directly and indirectly that volunteering may be more likely as the status or prestige of the recruiter is increased (Epstein, Suedfeld, and Silverstein, 1973; Mitchell, 1939; Norman, 1948; Poor, 1967; Straits, Wuebben, and Majka, 1972). A number of investigators have examined the effect on volunteering of the sex of the recruiter. Ferre´e, Smith, and Miller (1973) found that female recruiters were significantly more successful in soliciting volunteers than were male recruiters. Similarly, Weiss (1968) found that subjects who had been in experiments conducted by female experimenters were significantly more likely to volunteer for further research than were subjects who had been in experiments conducted by male experimenters. Also consistent with these findings were the results reported by Rubin (1973a; in press). He found that female recruiters obtained a volunteer rate of 74%, compared to the rate of 58% obtained by male recruiters (p < .001). Rubin also found a tendency for the sex of the recruiter to make a greater difference for female than for male subjects. In addition, for male subjects, the type of task for which volunteering was solicited interacted with the sex of the recruiter. When recruitment was for a study of handwriting analysis, there was no effect of the sex of the recruiter. However, when recruitment was for a study of how people describe themselves, female recruiters obtained a volunteering rate of 67%, while male recruiters obtained a rate of only 45% (p < .01). Rubin suggested that this difference might be caused by American men’s difficulty in communicating intimately with other men. A very large study of the interaction of sex of the recruiter and sex of the subject was conducted by Tacon (1965). He employed five male and five female recruiters to
Situational Determinants of Volunteering
757
solicit volunteering from 980 male and female subjects. When recruiters were of the same sex as the subject, there were significantly more refusals to participate and significantly fewer subjects who actually appeared for the scheduled experiment than when the recruiters were of the opposite sex. These results, although significant (p < .05), were quite modest in magnitude. Finally, in an experiment by Olsen (1968), the effects of sex of recruiter were found to interact with the type of task, at least for male subjects. When recruitment was for an experiment in learning, male subjects volunteered more for a female recruiter. When recruitment was for an experiment in personality, male subjects volunteered more for a male recruiter. There are several other experiments that have investigated individual differences between recruiters in their success at obtaining volunteers. Mackenzie (1969) employed two graduate student recruiters to solicit volunteering among high school students and found that the two recruiters differed significantly in the rates of volunteering obtained. Hood and Back (1971) also found different recruiters to obtain different rates of volunteering but only when the subjects were female. Kazdin and Bryan (1971) also found individual differences in volunteering rates obtained by two recruiters but only in interaction with the level of perceived competence that had been experimentally induced in subjects. Finally, Schopler and Bateson (1965) found that the level of the recruiter’s dependency on the subject interacted with the sex of the subject in producing different rates of volunteering. When the recruiter was in great need of help, females volunteered more than when he was less dependent (40% versus 25%). When the recruiter was in great need of help, however, males volunteered less than when he was less dependent (50% versus 71%). Our summary of the relationship between volunteering and recruiter characteristics must be quite tentative. Although there are a good many studies showing significant individual differences among recruiters in their obtained volunteering rates, there are not enough studies relating obtained volunteering rates to specific recruiter characteristics to permit any firm conclusions. There are some indications, however, that recruiters higher in status or prestige are more likely to obtain higher rates of volunteering. There are also indications that female recruiters may be more successful than male recruiters in obtaining agreement to volunteer. This relationship may be modified, however, by such variables as the sex of the subject and the type of task for which volunteering was solicited.
Task Importance There is a fair amount of evidence to suggest that volunteering rates are likely to increase when the task for which volunteering is requested is seen as important. There are several ways in which greater task importance can be communicated to the potential volunteer including intensity or urgency of recruitment requests, high status or prestige of the recruiter, and offers of large material incentives. We have already examined the evidence suggesting the latter two variables may serve to increase volunteering rates. Indeed, it appears difficult at times to distinguish between the variables of task importance and material incentives. Thus, we noted earlier that a $35 incentive to volunteer increased the volunteering rate in the research by Levitt, Lubin, and Zuckerman (1962). We cannot distinguish, however, the effects of the
758
Book Three – The Volunteer Subject
magnitude of this generous incentive from the communication to the subjects that this research must be important indeed to warrant so large an expenditure of money on the part of the investigators. Undergraduate students are generally well aware of the importance to advanced graduate students of the progress of their doctoral dissertation, and Rosenbaum (1956) found substantially greater volunteering for an experiment in which a doctoral dissertation hung in the balance than for an experiment in which a more desultory request was made. Weiss (1968) found that subjects who had been in an experimental condition of higher importance also volunteered more for subsequent research than did subjects in an experimental condition of lower importance. In his research, Wolf (1967) found a modest (but not significant) increase in volunteering when the instructor of the course in which recruitment took place gave his strong endorsement of the research. In the comparison group condition, the instructor was absent, but the fact that recruitment occurred with his knowledge and approval probably meant that the instructor endorsed the research implicitly. This implicit level of endorsement may have decreased the magnitude of the effect of task importance from the effect size that might have been obtained in a comparison of groups receiving strong instructor endorsement versus true instructor neutrality. This experiment by Wolf also illustrates the difficulty of distinguishing sometimes between the variables of task importance and normative expectations. An instructor’s strong endorsement bespeaks both his view of the task as important and his explicit expectation that students should volunteer. Task importance has also been found to interact with sex of subject and with the ingratiation tactics of the recruiter. Thus, Hendrick et al. (1972) found that increasing the perception of the recruiters’ seriousness of purpose increased volunteering except when the recruitment message simultaneously flattered the respondent. Schopler and Bateson (1965) found that recruiters in more urgent need of volunteers obtained more volunteering from female subjects but less volunteering from male subjects than did recruiters in less urgent need of volunteers. Based on his assessment of the literature, Dillman (1972) employed certified mail as a method of increasing the rate of return of his mailed questionnaires, although he did not employ the use of certified mail as an independent variable. That was done, however, by Sirken, Pifer, and Brown (1960) in their survey of physicians on the topic of death statistics. They found certified mail follow-ups to be 80% effective, compared to the effectiveness of 60% of regular mail. In their survey of the next of kin of the deceased, these workers also found certified mail to be more effective than regular mail. In his follow-up of Fisk University graduates, Phillips (1951) found special delivery letters to be 64% effective, compared to the effectiveness of 26% of ordinary first-class mail. Similar effectiveness of special delivery letters was demonstrated by Clausen and Ford (1947). The use of certified and special delivery letters would appear to be an effective means of emphasizing to potential respondents the importance of the survey at hand and the seriousness of purpose of the researchers. Earlier, in our discussion of small courtesies offered to potential respondents, we noted that the employment of stamped, self-addressed envelopes tended to increase the rate of response (Ferriss, 1951; Price, 1950). One explanation of this result is in terms of the increased convenience to the respondent, but a plausible alternative explanation is in terms of the increased perception of the importance of the survey and the increased urgency of the request to participate.
Situational Determinants of Volunteering
759
In summary, there does appear to be a positive relationship between volunteering and level of perceived task importance. The evidence in support of this summary comes not only from studies directly manipulating task importance. Those studies that have varied the status or prestige of the recruiter, or the level of material incentive offered, have also provided evidence, although of a more indirect kind, that adds to the tenability of our tentative conclusion.
Subject Interest More-General Interests There is a large body of data to suggest that subjects who are more interested in behavioral research, or more interested in and involved with the particular topic under investigation, will be more likely to volunteer. The better to test this and related hypotheses, Adair (1970) developed the Psychology Research Survey, which permits an assessment of the subject’s favorableness to psychology. He found that volunteers for behavioral research scored significantly higher in favorableness to psychology than nonvolunteers (p < .05, effect size ¼ .37). The differences obtained were somewhat larger when volunteering was requested with no offer of pay and substantially larger than when volunteering was requested for nutritional, rather than behavioral research. Such results add appreciably to the discriminant validation of Adair’s measure of favorableness toward psychology. Even when working with a population of coerced volunteers, Adair’s measure of favorableness to psychology was able to discriminate between more- and less-eager participants. Adair found that those subjects who signed up earlier for their required research participation scored significantly higher than did those who signed up later. This study was replicated twice, and the results of one of these replications were significant in the predicted direction, although the results of the other, while also in the predicted direction, did not reach significance. Not only do subjects more favorable to psychology appear to volunteer more, but there are indications as well that once in the experiment they may be more willing to comply with the taskorienting cues of the experiment (p < .02; r ¼ .28; Adair and Fenton, 1971). In their research, Kennedy and Cormier (1971) administered their own measures of favorableness to behavioral research and to participating in an experiment, to paid and unpaid volunteers and to subjects who had been required to serve. For both measures, both paid and unpaid volunteers were found to be significantly more favorable in attitude than the subjects who had not volunteered, with effect sizes ranging from .66 to .35. All three groups of subjects were employed in a verbal conditioning experiment that showed greatest conditioning on the part of the unpaid volunteers. This result, although not significant (effect size ¼ .27), is very much in line with the result obtained by Adair and Fenton (1971). Twenty years before these last two studies were conducted, Rosen (1951) investigated the relationship between volunteering and attitudes toward psychology. Results showed that volunteers found psychology to be significantly more rewarding and enjoyable than nonvolunteers, and they were more favorably disposed toward psychological experiments (median correlation ¼ .17). In his more recent research, Ora (1966) found volunteers to be significantly more interested in psychology than were nonvolunteers (r ¼ .25).
760
Book Three – The Volunteer Subject
Employing their own measures of favorableness to scientific research, psychological research, participating as a subject, and the department’s policy about research participation, Wicker and Pomazal (1970) investigated the relationship between these attitudes and volunteering for research participation. Results showed no relationship between general attitudes toward scientific or psychological research and volunteering but did show a significant positive relationship (p < .01) between volunteering and favorableness toward serving as a subject (r ¼ .17) and toward the department’s policy on research participation (r ¼ .19). Research by Meyers (1972) also showed that volunteers were more favorable to the conduct of experimentation than were nonvolunteers, although this result was not significant statistically because of the very small sample sizes involved (for p ¼ .05, and an effect size of .50, power was only 28%). In the studies described so far, the investigators examined explicitly the hypothesis that volunteering was more likely by those subjects with attitudes favorable to behavioral research. There are additional studies, however, in which the assessment of attitudes was less direct. Thus, it seems reasonable to suppose that college students majoring in psychology would be more interested in, and more favorably disposed toward, the field than nonmajors. Jackson and Pollard (1966) found in their study of sensory deprivation that twice as many psychology majors volunteered than did other majors. Similarly, Black, Schumpert, and Welch (1972) found far greater commitment to the experimental task on the part of their volunteers (more advanced students) than on the part of their ‘‘nonvolunteers’’ (less advanced students), who were actually a combined group of volunteers and nonvolunteers. Despite the tendency for contrasts between groups of volunteers and of mixed volunteers and nonvolunteers to be diminished by the fact of overlapping memberships, these investigators found the effects on task commitment to be very large (1.56). The substantive results of the experiment, incidentally, were greatly affected by volunteer status. Intermittent knowledge of results led to much greater resistance to extinction in perceptual–motor performance than did constant knowledge of results for volunteers (effect size ¼ 1.41) than it did for nonvolunteers (0.56). In their research, Mulry and Dunbar (n.d.) compared subjects participating earlier in the term with those participating later. Perhaps we can view the earlier participants as the more eager volunteers. Results showed that these more eager volunteers spent significantly more time on each of their questionnaire items than did the more reluctant participants. Presumably, the greater amount of time spent per item reflected the greater seriousness of purpose of the more eager volunteers. An interesting additional finding of this study was the very strong tendency (p < .001) for the participants believed to be more eager, to arrive earlier for their experimental appointment (eta ¼ .40). Additional evidence for the assumption that the early participants were more interested in, and more favorably inclined toward, psychology comes from the finding that although early and later participants did not differ in general ability (SAT scores), the early participants earned significantly higher psychology course grades than did the later participants (p < .025; effect size ¼ .40). The greater interest and involvement of volunteers as compared to nonvolunteers is also suggested in the work of Green (1963). He found that when subjects were interrupted during the performance of their task, volunteers recalled more of the interrupted tasks than did nonvolunteers. Presumably the volunteers’ greater involvement and interest facilitated their recall of the tasks that they were not able to complete.
Situational Determinants of Volunteering
761
More-Specific Interests In the studies considered so far, investigators have examined the relationship between volunteering and somewhat general attitudes toward behavioral research. We consider now the research relating volunteering to more-specific attitudes toward particular areas of inquiry. We consider first the area of hypnosis. Boucher and Hilgard (1962) and Zamansky and Brightbill (1965) found that subjects holding more-favorable attitudes toward hypnosis were more likely to volunteer for hypnosis research, although Levitt, Lubin, and Zuckerman (1959) found results in the opposite direction but not significantly so. Particularly threatening to the validity of inferences drawn from studies employing volunteer subjects are the results of several studies showing that volunteers may be more susceptible to hypnosis than nonvolunteers (Bentler and Roberts, 1963; Boucher and Hilgard, 1962; Coe, 1964; Shor and Orne, 1963; Brightbill and Zamansky, 1963; Zamansky and Brightbill, 1965). In a number of these studies, the effect sizes were substantial; as large as .75 in the research of Shor and Orne (see also Hilgard, 1965). The literature of survey research also suggests greater willingness to participate by those more favorable toward, or more interested in, the topic under investigation. Persons more interested in radio and television were found to be more likely to answer questions about their listening and viewing habits (Belson, 1960; Suchman and McCandless, 1940). When Stanton (1939) inquired about teachers’ use of radio in the classroom he found that those who owned radios were more likely to respond. Reid (1942) replicated Stanton’s research twice but with a sample of school principals rather than teachers. In one of his studies he found no difference in the use of radio in the classroom between those responding earlier to his questionnaire and those more leisurely in their response. In his other study, however, Reid found that earlier responders owned more radios and used them significantly more than did later responders. In a related result, Rollins (1940) found in a survey on the use of commercial airlines that early responders were more than twice as likely as later responders to have flown. Consistent results have also been obtained in studies of sex behavior, religion, and public policy. Diamant (1970), Kaats and Davis (1971), and Siegman (1956) all found volunteers to be sexually more experienced and/or sexually more permissive than nonvolunteers. These results were not only significant statistically but in some cases very large in magnitude. Thus, Siegman found that 92% of his volunteers advocated sexual freedom for women, while only 42% of his nonvolunteers did so. Matthysse (1966) wrote follow-up letters to research subjects who had been exposed to proreligious communications. He found respondents to be those who regarded religious matters as relatively more important. Benson (1946) obtained data suggesting that in studies of public policy, respondents may be overrepresented by individuals with strong feelings against the proposed policy—a kind of political protest vote. In their research, Edgerton, Britt, and Norman (1947), Franzen and Lazarsfeld (1945), Gaudet and Wilson (1940), Mitchell (1939), and Scott (1961) all concluded that responders to surveys tended to be more interested in the topic under study than the nonresponders. The vast majority of studies of the relationship between volunteering and interest on the part of the potential respondent are correlational rather than experimental. Few investigators have experimentally varied respondents’ degree of interest or
762
Book Three – The Volunteer Subject
involvement, but Greenberg’s (1956) rare exception showed that such investigations may prove valuable. His approach was to employ argumentative role-playing to increase respondents’ ego-involvement. These respondents then went on to provide the investigator with more information, fewer stereotyped answers, more thoughtful answers, and fewer ‘‘don’t knows.’’ Organizational and Interpersonal Bonds Another body of literature supports the hypothesis that volunteering is a positive function of subjects’ degree of interest and involvement in the content of the research. In this body of literature, the subject’s interest is defined in terms of his level of activity in, and affiliation with, formal organizations or his level of commitment and favorableness to interpersonal relationships that are the subject of the investigation. Table 3-2 summarizes the results of these studies, which are grouped into four sets on the basis of the definition of interest or involvement. In the first set of seven studies, interest was defined by the subject’s level of activity in such local organizations as colleges, unions, and churches and in such national organizations as the League of Women Voters. In six of the seven studies, subjects who volunteered were more active in their relevant organization, such activity ordinarily being part of the information requested by the sponsors of the research. The one exception to this general trend was obtained by Lawson (1949), who investigated the topic of gambling in England. He found that bookmakers, who presumably were vitally interested in gambling, responded substantially less often (9%) than did any other group contacted (up to 48%). That result, of course, runs counter to our hypothesis, but there are other results in the Lawson study that do support our hypothesis. Thus, among three groups of clergymen, ordered on the degree to which their positions were strongly against gambling, most returns were obtained from clergymen whose position was most strongly against gambling (44%) and fewest returns were obtained
Table 3–2 Studies of the Relationship Between Volunteering and Subjects’ Interest
as Defined by Organizational and Interpersonal Bonds Level of activity:
Organizational Bonds Degree of affiliation:
Donald (1960) Laming (1967) Larson and Catton (1959) Lawson (1949)a Phillips (1951) Schwirian and Blaine (1966) Wicker (1968a) Favorableness to treatment: Carr and Whittenbaugh (1968) Kaess and Long (1954)b Kish and Herman (1971) a b
Results in opposite direction. No difference obtained.
Britton and Britton (1951) Kish and Barnes (1973) Lehman (1963) Wallace (1954)
Interpersonal Bonds Commitment to partner: Hill, Rubin, and Willard (1973) Kirby and Davis (1972) Rubin (1969, pp. 197–198) Locke (1954)
Situational Determinants of Volunteering
763
from clergymen least opposed to gambling (13%). Clergymen who held intermediate positions against gambling showed an intermediate rate of response (26%). As a possible example of professional courtesy stands the additional finding that of all groups contacted, psychologists showed the highest response rate (48%). In the second set of four studies of Table 3-2, interest was defined by the subject’s degree of affiliation with the organization about which questions were asked and/or which directly sponsored the research (e.g., college attended by one’s children, YMCA, and Time magazine). The results of all four studies were consistent with the interest hypothesis. In the third set of three studies of Table 3-2, interest was defined by the subject’s favorableness to treatment procedures as measured by his continuation in, or benefit from, his treatment. Two of the three studies were in support of the interest hypothesis. In the fourth set of four studies, all involving research on couples, interest was defined by subjects’ degree of commitment to their partners; all four studies were in support of the interest hypothesis. Of the 18 studies listed in Table 3-2, 16 were in support of the interest hypothesis (usually at the .05 level), 1 showed no relationship between interest and volunteering, and 1 showed a result in the opposite direction. Even this last study, however, yielded some additional results that were in support of the interest hypothesis. For most of the studies listed, the volunteering rates for the more- and less-interested subjects were available. The median volunteering rate for the more interested persons was 57%, compared to the median rate for the less-interested persons of only 28%. In terms of magnitude of effect, then, the interest or involvement of the subject appears to be one of the most powerful determinants of volunteering. Survey research literature is rich with suggestions for dealing with potential sources of volunteer bias. One practical suggestion offered by Clausen and Ford (1947) follows directly from the work on respondent interest and involvement. It was discovered that a higher rate of response was obtained if, instead of one topic, a number of topics were surveyed in the same study. People seem to be more willing to answer a lot of questions if at least some of the questions are on a topic of interest to them. Another, more standard technique is the follow-up letter or follow-up telephone call to remind the subject to respond to the questionnaire. However, if the follow-up is perceived by the subject as a bothersome intrusion, then, if he responds at all, his response may reflect an intended or unintended distortion of his actual beliefs. The person who has been reminded several times to fill out the same questionnaire may not approach the task in the same way as he would if he had been asked only once, although Eckland (1965) has shown that high levels of prodding need not necessarily lead to the production of data that are factually inaccurate. Finally, there is another extensive body of literature that bears directly and indirectly on the relationship between volunteering and subjects’ level of interest in the topic under investigation. That is the literature (examined in Chapter 2) showing very clearly that better-educated persons are more likely to volunteer for participation in behavioral research. In some of the studies the questionnaires dealt specifically with the respondents’ level of education, so that the assumption appears warranted that a more educated person would be more interested in a survey involving educational experience. In other studies, questionnaires did not deal specifically with respondents’ level of education but with a wide variety of matters. The
764
Book Three – The Volunteer Subject
assumption we must make in these cases to bring them to bear on the interest hypothesis is that better educated persons are generally more interested in a wider variety of topics than are less well educated persons. That assumption does not appear to be farfetched, however. A similar line of reasoning also brings to bear on the interest hypothesis the additional and related bodies of literature showing more intelligent persons and those scoring higher on social-class variables to be more likely to volunteer for behavioral research. In summarizing the results of the research on the relationship between volunteering and subjects’ level of interest in the topic under investigation, we can be unusually brief and unusually unequivocal. Not only do interested subjects volunteer more than uninterested subjects, but the size of the effect appears to be substantial.
Expectation of Favorable Evaluation In this section we shall examine the evidence suggesting that subjects are more likely to volunteer when they have reason to believe that they will be evaluated favorably by the investigator. When a subject is invited to volunteer, he is asked to make a commitment of his time for the serious purposes of the investigator. The responses subjects will be asked to make during the course of their participation in the research will make the investigator wiser about the subject without making the subject wiser about the investigator. Within the context of the psychological experiment, Riecken (1962) has referred to this as the ‘‘one-sided distribution of information.’’ On the basis of this uneven distribution of information the subject or respondent is likely to feel an uneven distribution of legitimate negative evaluation. From the subject’s point of view, the investigator may judge him to be maladjusted, stupid, unemployed, lower class, or in possession of any one of a number of other negative characteristics. The possibility of being judged as any one of these might be sufficient to prevent someone from volunteering for either surveys or experiments. The subject, of course, can, and often does, negatively evaluate the data-collector. He can call the investigator, his task, or his questionnaire inept, stupid, banal, and irrelevant but hardly with any great feeling of confidence as regards the accuracy of this evaluation. After all, the data-collector has a plan for the use of his data, and the subject or respondent usually does not know this plan, although he is aware that a plan exists. He is, therefore, in a poor position to evaluate the data-collector’s performance, and he is likely to know it. Although few would deny the importance of other motivations of the subject in behavioral research, there is wide agreement among investigators that one of the important motives of the research subject is to ‘‘look good.’’ Riecken has suggested that one major aim of the research subject is to ‘‘put his best foot forward.’’ Rosenberg (1965, 1969) has shown the importance to an understanding of the social psychology of behavioral research of the concept of ‘‘evaluation apprehension.’’ Other investigators have further increased our understanding of this concept, broadly defined, through both their empirical efforts and their theoretical analyses (e.g., Adair, 1973; Argyris, 1968; Belt and Perryman, 1970; Kennedy and Cormier, 1971; Sigall, Aronson, and Van Hoose, 1970; Silverman and Shulman, 1970; Weber and Cook, 1972). While the bulk of the work in this area is of recent vintage, sophisticated theoretical attention was accorded this and related motives of the
Situational Determinants of Volunteering
765
research subject by Saul Rosenzweig (1933) before many of the above investigators were born. If research participants are indeed so concerned with what the investigator will think of them, we would expect them to volunteer more when their expectation is greater that they are likely to be evaluated favorably by the investigator. The evidence suggests strongly that such is indeed the case. Table 3-3 lists 19 studies showing greater volunteering when subjects are in a position to say more favorable things about themselves (or their children). The studies listed in the first column are those in which the favorable self-reports were about occupational or educational achievement. Those listed in the second column are those in which the favorable selfreports were about psychiatric or gender-related adjustment. Although not all the studies listed in Table 3-3 showed the differences to be significant statistically, all the studies showed greater volunteering by those who had more favorable things to say about themselves in terms of achievement or adjustment. Not only are the results remarkably consistent in direction, but they also suggest that the effect sizes may be quite substantial. For eight of the studies, the volunteering rates of those with more versus less favorable things to say about themselves were available. The median rates were 84% and 49%, respectively. Further support for the hypothesis under examination comes from three studies that could not be easily subsumed under the headings of Table 3-3. Mayer and Pratt (1966), in their survey of persons involved in automobile accidents, found higher response rates by those who had been passengers, rather than drivers, of automobiles involved in accidents. Presumably passengers could not be considered to be at fault, and nothing they reported could serve to implicate them, an advantage not shared by those who had been driving when the accident occurred. Wicker (1968a), in his survey of church members, found those more likely to respond who could report more favorably on their attendance at church. Finally, in what was perhaps the only investigation that specifically varied the independent variable of probability of favorable evaluation, Olsen (1968) found significantly greater volunteering when
Table 3–3 Studies of the Relationship Between Volunteering and
Favorableness of Self-Reports About Achievement and Adjustment Achievement
Adjustment
Barnette (1950a, b) Baur (1947–1948) Bradt (1955) Cope (1968) Eckland (1965) I Eckland (1965) II Edgerton, Britt, and Norman (1947) Gannon, Nothern, and Carroll (1971) Jones, Conrad, and Horn (1928) Kelley (1929) Kirchner and Mousley (1963) Rothney and Mooren (1952) Shuttleworth (1940) Toops (1926)
Anastasiow (1964) Ball (1930) Loney (1972) Milmoe (1973) Speer and Zold (1971)
766
Book Three – The Volunteer Subject
subjects were more likely to be favorably evaluated (36%) than when they were less likely to be favorably evaluated (15%). Olsen’s study is especially important because in so many of the other studies summarized it is not possible clearly to differentiate the variable of expectation of favorable evaluation from the variables of subject interest, intelligence, education, and social class. Subjects may be more interested in things in which they have done well, and we are usually in no position to say whether it is their interest or their having done well that prompts them to respond by participating in our research. Better-educated, more-intelligent subjects and those classified as higher in social class are by common cultural definition in a position to say ‘‘better’’ things about themselves than those less well educated, less intelligent, or those classified as lower in social class. The very clear results showing higher rates of volunteering by those more interested in the research area, by those who are better educated or more intelligent, and by those classified as higher in social class may, in a sense, provide additional support for the hypothesis that volunteering rates increase with the increased expectation of favorable evaluation. Our summary of the relationship between volunteering and expectation of favorable evaluation can be brief and unequivocal. Increased expectation of favorable evaluation by the investigator not only increases the probability of volunteering but it appears to increase that probability to a substantial degree.
Summary of Situational Determinants of Volunteering In this chapter we have examined the evidence bearing on the relationship between volunteering and a variety of more-or-less situational variables. The evidence has not always been as plentiful or as direct as it was in the case of the relationship between volunteering and more-or-less stable characteristics of the potential participant in behavioral research. Nevertheless, there appears to be sufficient evidence to permit a summary of the present stage of our knowledge. Table 3-4 lists the situational determinants of volunteering by the degree of confidence we can have that each is indeed associated with volunteering. Four groups of determinants are discriminable, and within each group the determinants are listed in approximately descending order of the degree of confidence we can have in the relationship between volunteering and the listed determinant. The definition of degree of confidence was based both on the number of studies relevant to the relationship under consideration and on the proportion of the relevant studies whose results supported a directional hypothesis. To qualify for ‘‘maximum confidence’’ a relationship had to be based on at least 20 studies, and at least 6 out of 7 studies had to be in support of the relationship. To qualify for ‘‘considerable confidence’’ a relationship had to be based on at least 10 studies, and at least twothirds had to be in support of the relationship. To qualify for ‘‘some confidence’’ a relationship had to be based either on 3 studies all of which were in support of the relationship or on 9 studies most of which were in support of the relationship with none showing a significant reversal of the relationship. Relationships not meeting these minimum standards are listed under the heading of ‘‘little confidence.’’ We conclude our summary with a listing of the conclusions that seem warranted by the evidence, taking into account the effects of various moderator variables where
Situational Determinants of Volunteering
767
Table 3–4 Situational Determinants of
Volunteering Grouped by Degree of Confidence of Conclusion Maximum confidence: 1. Subject interest 2. Expectation of favorable evaluation Considerable confidence: 3. Task importance 4. Guilt, happiness, and competence 5. Material incentives Some confidence: 6. Recruiter characteristics 7. Aversive tasks 8. Normative expectations Minimum confidence: 9. Prior acquaintanceship 10. Public versus private commitment
these are suggested by the data. The order of our listing follows that shown in Table 3-4, beginning with the conclusions warranting maximum confidence and ending with the conclusions warranting minimum confidence. Within each of the four groups, the conclusions are also ranked in approximate order of the degree of confidence we can have in each. Conclusions Warranting Maximum Confidence 1. Persons more interested in the topic under investigation are more likely to volunteer. 2. Persons with expectations of being more favorably evaluated by the investigator are more likely to volunteer.
Conclusions Warranting Considerable Confidence 3. Persons perceiving the investigation as more important are more likely to volunteer. 4. Persons’ feeling states at the time of the request for volunteers are likely to affect the probability of volunteering. Persons feeling guilty are more likely to volunteer, especially when contact with the unintended victim can be avoided and when the source of guilt is known to others. Persons made to ‘‘feel good’’ or to feel competent are also more likely to volunteer. 5. Persons offered greater material incentives are more likely to volunteer, especially if the incentives are offered as gifts in advance and without being contingent on the subject’s decision to volunteer. Stable personal characteristics of the potential volunteer may moderate the relationship between volunteering and material incentives.
768
Book Three – The Volunteer Subject
Conclusions Warranting Some Confidence 6. Personal characteristics of the recruiter are likely to affect the subject’s probability of volunteering. Recruiters higher in status or prestige are likely to obtain higher rates of volunteering, as are female recruiters. This latter relationship is especially modifiable by the sex of the subject and the nature of the research. 7. Persons are less likely to volunteer for tasks that are more aversive in the sense of their being painful, stressful, or dangerous biologically or psychologically. Personal characteristics of the subject and level of incentive offered may moderate the relationship between volunteering and task aversiveness. 8. Persons are more likely to volunteer when volunteering is viewed as the normative, expected, appropriate thing to do.
Conclusions Warranting Minimum Confidence 9. Persons are more likely to volunteer when they are personally acquainted with the recruiter. The addition of a ‘‘personal touch’’ may also increase volunteering. 10. Conditions of public commitment may increase rates of volunteering when volunteering is normatively expected, but they may decrease rates of volunteering when nonvolunteering is normatively expected.
Suggestions for the Reduction of Volunteer Bias Our review of the literature bearing on situational determinants of volunteering suggests fairly directly a number of steps that may prove to be useful in reducing the magnitude of volunteer bias. A list of recommendations, offered in a tentative spirit and subject to further empirical test, follows: 1. Make the appeal for volunteers as interesting as possible, keeping in mind the nature of the target population. 2. Make the appeal for volunteers as nonthreatening as possible so that potential volunteers will not be ‘‘put off’’ by unwarranted fears of unfavorable evaluation. 3. Explicitly state the theoretical and practical importance of the research for which volunteering is requested. 4. Explicitly state in what way the target population is particularly relevant to the research being conducted and the responsibility of potential volunteers to participate in research that has potential for benefiting others. 5. When possible, potential volunteers should be offered not only pay for participation but small courtesy gifts simply for taking time to consider whether they will want to participate. 6. Have the request for volunteering made by a person of status as high as possible and preferably by a woman. 7. When possible, avoid research tasks that may be psychologically or biologically stressful. 8. When possible, communicate the normative nature of the volunteering response. 9. After a target population has been defined, an effort should be made to have someone known to that population make the appeal for volunteers. The request for volunteers itself may be more successful if a personalized appeal is made.
Situational Determinants of Volunteering
769
10. In situations where volunteering is regarded by the target population as normative, conditions of public commitment to volunteer may be more successful; where nonvolunteering is regarded as normative, conditions of private commitment may be more successful.
A hasty reading of these recommendations gives the impression that they are designed only to increase rates of volunteering and thus to decrease volunteer bias. A more careful reading reveals that the recommendations may have other beneficial effects as well. They should make us more careful and thoughtful not only in how we make our appeals for volunteers but in our planning of the research itself. Our relations with our potential subjects may become somewhat more reciprocal and more human, and our procedures may become more humane. Finally, if we are to tell our subjects as much as possible about the significance of our research as though they were another granting agency, which in fact they are, granting us time instead of money, then we will have to give up trivial research.
4 Implications for the Interpretation of Research Findings
We concentrate now on the implications of the preceding discussions for the validity of inferred causal relationships in behavioral research and their generalizability beyond the particular circumstances in which they were demonstrated. Cook and Campbell (1974) have drawn a useful distinction between several threats to the tenability of inferred relationships in behavioral experimentation, and their typology encompasses this difference. The good experiment, they have observed, clearly establishes the temporal antecedence of the causal relationship; is strong enough to demonstrate that cause and effect covary; rules out alternative explanations and confounding variables; and is sufficiently representative to assure the robustness of the causal relationship. In our earlier discussions we examined studies that focused on voluntarism as a dependent variable, from which we could postulate some likely personality and demographic differences between willing and unwilling subjects as well as situational determinants of volunteering. Now we treat volunteer status as an independent variable and explore how such differences can in fact make a difference in the interpretation of research findings. We begin, however, with a consideration of the dilemma prompted by recent ethical concerns in psychology and their ramifications for the control of sampling errors in general.
An Ethical Dilemma What is called error in behavioral experimentation will depend on the purposes of the research (Winer, 1968). Error may be an independent variable and the main object of study in one case, a randomly occurring source of imprecision in another, and a systematic source of bias in a third. Speculating on the life cycle of inquiries into the nature of nonrandom errors like volunteer bias, McGuire (1969a) described three stages he named ignorance, coping, and exploitation. Initially, researchers seem unaware of the variable producing systematic error and may deny its existence. Once it becomes certain that the variable exists, means of coping with it are sought and attention focuses on techniques for reducing its contaminating influence. In the 770
Implications for the Interpretation of Research Findings
771
third stage, the variable is seen as a significant independent factor in its own right and not just an unintentional contaminant to be eliminated: ‘‘Hence, the variable which began by misleading the experimenter and then, as its existence became recognized, proceeded to terrorize and divert him from his main interest, ends up by provoking him to new empirical research and theoretical elaboration’’ (McGuire, 1969a, pp. 15–16). If this cycle has accelerated in the case of the volunteer variable, it may be the result of the sense of urgency attached to ethical concerns that have been voiced with increasing frequency of late. At the root of those concerns is a question about the personal responsibilities of scientists for assuring the moral acceptability of their research. Beecher (1970) and Kelman (1965, 1967, 1968, 1972), among others, have examined the issue in depth as it applies to medical and psychological experimentation, and the ramifications of certain proposed ethical guidelines for research with human subjects have been articulated in symposia as well as in the journals (cf. Sasson and Nelson, 1969). Lately, some of that discussion has shifted from a concern over ethical precautions to the conviction that there must be legal guarantees made that research subjects (including informed volunteers) be assured of compensation and financial protection against mental and physical risks (Havighurst, 1972; Katz, 1972). Impetus was lent to this important issue in 1954 as a result of Edgar Vinacke’s comments in the American Psychologist. Vinacke questioned the ethicality of ‘‘dissimulation’’ experiments, studies where ‘‘the psychologist conceals the true purpose and conditions of the experiment, or positively misinforms the subjects, or exposes them to painful, embarrassing, or worse, experiences, without the subjects’ knowledge of what is going on’’ (p. 155). While recognizing the methodological desirability of naivety in subjects, Vinacke asked whether it was not time to consider the ethical bounds of experimental deceptions: So far as I can tell, no one is particularly concerned about this. . . . In fact, one can note an element of facetiousness. It has reached the point where a man who reads a paper at the APA convention is almost embarrassed when he adds, with a laugh, that the subjects were given the ‘‘usual’’ post-session explanations. . . . What is at stake? Do subjects really feel happy about their experiences in some of these emotionally stressful experimental situations, even after a standardized attempt to reassure them? What possible effects can there be on their attitudes toward psychologists, leaving out entirely any other consequences? Beyond this, what sort of reputation does a laboratory which relies heavily on deceit have in the university and community where it operates? What, in short, is the proper balance between the interest of science and the thoughtful treatment of the persons who, innocently, supply the data? . . . Perhaps it is going too far to propose that an APA committee be appointed to look into the ethical precautions to be observed in human experimentation, but it is a possibility.
The Zeitgeist may not have been propitious for such an inquiry in 1954, but the time was certainly ripe a decade and a half later when a committee of psychologists was formed by the American Psychological Association (APA) to draft a set of ethical guidelines for research with human subjects (Cook, Kimble, Hicks, McGuire, Schoggen, and Smith, 1971, 1972). The final version, comprised of the following ten principles adopted by the APA Council, was published in January 1973, in the APA journal American Psychologist: 1. In planning a study the investigator has the personal responsibility to make a careful evaluation of its ethical acceptability, taking into account these Principles for research with human beings. To the extent that this appraisal, weighing scientific and
772
Book Three – The Volunteer Subject humane values, suggests a deviation from any Principle, the investigator incurs an increasingly serious obligation to seek ethical advice and to observe more stringent safeguards to protect the rights of the human research participant. 2. Responsibility for the establishment and maintenance of acceptable ethical practice in research always remains with the individual investigator. The investigator is also responsible for the ethical treatment of research participants by collaborators, assistants, students and employees, all of whom, however, incur parallel obligations. 3. Ethical practice requires the investigator to inform the participant of all features of the research that reasonably might be expected to influence willingness to participate, and to explain all other aspects of the research about which the participant inquires. Failure to make full disclosure increases the investigator’s responsibility to maintain confidentiality, and to protect the welfare and dignity of the research participant. 4. Openness and honesty are essential characteristics of the relationship between investigator and research participant. When the methodological requirements of a study necessitate concealment or deception, the investigator is required to ensure the participant’s understanding of the reasons for this action and to restore the quality of the relationship with the investigator. 5. Ethical research practice requires the investigator to respect the individual’s freedom to decline to participate in research or to discontinue participation at any time. The obligation to protect this freedom requires special vigilance when the investigator is in a position of power over the participant. The decision to limit this freedom increases the investigator’s responsibility to protect the participant’s dignity and welfare. 6. Ethically acceptable research begins with the establishment of a clear and fair agreement between the investigator and the research participant that clarifies the responsibilities of each. The investigator has the obligation to honor all promises and commitments included in that agreement. 7. The ethical investigator protects participants from physical and mental discomfort, harm and danger. If the risk of such consequences exists, the investigator is required to inform the participant of that fact, to secure consent before proceeding, and to take all possible measures to minimize distress. A research procedure may not be used if it is likely to cause serious and lasting harm to participants. 8. After the data are collected, ethical practice requires the investigator to provide the participant with a full clarification of the nature of the study and to remove any misconceptions that may have arisen. Where scientific or humane values justify delaying or withholding information, the investigator acquires a special responsibility to assure that there are no damaging consequences for the participant. 9. Where research procedures may result in undesirable consequences for the participant, the investigator has the responsibility to detect and remove or correct these consequences, including, where relevant, long-term aftereffects. 10. Information obtained about the research participants during the course of an investigation is confidential. When the possibility exists that others may obtain access to such information, ethical research practice requires that this possibility, together with the plans for protecting confidentiality, be explained to the participants as a part of the procedure for obtaining informed consent.
Implications for the Interpretation of Research Findings
773
Of particular interest from a methodological standpoint are those four principles (3 through 6) advocating informed, volitional consent. Apart from questions about (1) the merits, real or apparent, of such recommendations (Baumrind, 1964, 1971, 1972; Beckman and Bishop, 1970; Gergen, 1973; Kerlinger, 1972; May, 1972; Seeman, 1969), (2) whether or not they could be legislated and the legislation effectively enforced (Alumbaugh, 1972; Pellegrini, 1972; Smith, 1973), and (3) who should be delegated the role of ethical ombudsman (Adams, 1973), there is also scientific interest in whether compliance with the letter of the law could accidentally introduce an element of error that might jeopardize the tenability of inferred causal relationships. To reveal to a subject the exact substance of the research in which he is participating might distort his reaction and thus ultimately limit the applicability of the findings. A study by Resnick and Schwartz (1973) is empirically illustrative of the nature of the problem. The experiment centered on a simple, widely used verbal-conditioning task developed by Taffel (1955) as a method for repeating the operant conditioning effects demonstrated by Greenspoon (1951) and Ball (1952) in a more controlled fashion. All the subjects in Resnick and Schwartz’s study were volunteers half of whom had been forewarned as to the exact nature of the Taffel procedure following provisional APA guidelines, and the rest of whom had not. The subject and the experimenter were seated at opposite sides of a table, and the experimenter passed a series of 3-by-5-inch cards to the subject one at a time. Printed on each card was a different verb and six pronouns (I, WE, YOU, THEY, SHE, HE); the subject was told to construct a sentence containing the verb and any of the six pronouns. On the first 20 (operant) trials the experimenter remained silent. On the following 80 trials he reinforced the subject with verbal approval (‘‘Good’’ or ‘‘Mmm-hmmm’’ or ‘‘Okay’’) every time a sentence began with I or WE. The difference in results between the two groups is shown in Fig. 4-1. The uninformed subjects conditioned as had been expected, but those who had been fully informed along APA standards showed a reversal in the conditioning rate. (Every set of differences between groups was significant beyond the .01 level for blocks 1–4, and the groups were not different
Mean number of I–WE sentences
15 Uninformed volunteers 10
Fully informed volunteers
5
0
Operant
Block1
Block 2
Block 3
Block 4
Figure 4–1 Mean number of I–WE sentences constructed by ethically informed and uninformed subjects (after Resnick and Schwartz, 1973)
774
Book Three – The Volunteer Subject
at the operant level; a significant trend over trials was reported for both the informed and uninformed groups.) One plausible explanation for the difference was that subjects in the informed group may have become suspicious about the ‘‘real’’ demands of the study when its purpose was spelled out for them so blatantly, and feeling their freedom of response restricted by that awareness, their frustrations may have led them to respond counter to the experimenter’s expressed intent. Does ethically forewarning subjects trigger paranoid ideation in otherwise trusting individuals? Resnick and Schwartz mentioned that several subjects in the informed group said they felt they were involved in an elaborate double-reverse manipulation; none of the uninformed subjects expressed this belief. One wonders how different our current laws of verbal learning might be if all the earlier studies in this area had been carried out under the conditions of informed consent that Resnick and Schwartz inferred from the APA ethical code. These research findings emphasize the complexity of the ethical dilemma. Fully informed voluntarism, at the same time that it may satisfactorily cope with one ethical concern, could be contraindicated for another. Pity the poor experimenter confronted with a conflict of values in which he must weigh the social ethic against the scientific ethic and decide which violation would constitute the greater immorality. Should the current social temper of the times persist, an already complicated issue may become increasingly more complex in the future. Given whatever strengths are added to the civil libertarian movement in psychology by a society also properly concerned with individual rights, could this unprecedented soul-searching ultimately lead to a behavioral science whose data were drawn from an elite subject corps of informed volunteers? Kelman (1972, p. 1006) has talked about a ‘‘subjects’ union’’ and a ‘‘bill of rights’’ for research subjects, and Argyris (1968) mentioned that enterprising students at two universities have also started thinking in this direction with the notion of organizing a group along the lines of Manpower, but which, instead of secretaries, would offer informed volunteers to interested experimenters: They believe that they can get students to cooperate because they would promise them more money, better de-briefing, and more interest on the part of the researchers (e.g., more complete feedback). When this experience was reported to some psychologists their response was similar to the reactions of business men who have just been told for the first time that their employees were considering the creation of a union. There was some nervous laughter, a comment indicating surprise, then another comment to the effect that every organization has some troublemakers, and finally a prediction that such an activity would never succeed because ‘‘the students are too disjointed to unite’’. (p. 189)
However, perhaps we need not end on an overly pessimistic note. Is it possible that some of these fears by psychologists could be a projection of their own stringent ethical views? So the results of a study by Sullivan and Deiker (1973) would suggest. These investigators compared the questionnaire responses of a sample of American psychologists with those of undergraduate students at Louisiana State University and Wheaton College to assess their perceptions of ethical issues in human research. The respondents were each presented with a hypothetical experiment from among a group of experiments characteristically varying in stress, physical pain, or the threat to a subject’s self-esteem, and the questions to the respondents focused on ethical issues. For example, the students were asked if they would have volunteered for the experiment had they known its exact nature; they were also asked whether the
Implications for the Interpretation of Research Findings
775
deception involved seemed to them unethical. The psychologists were also questioned on the propriety of using deception, and they had to say whether subjects showing up for the experiment would have volunteered otherwise. Surprising certainly to most psychologists at least, the students’ answers revealed a high percentage of volunteering even after aversive aspects of the studies became known. And, in general, there was a discernible difference in the reported perceptions of the professionals and the students that showed the psychologists as having the more stringent ethical views. If ever a decision need be made about who should regulate what happens in our experiments, one favoring the professional psychologist should produce the more conservative watchdogs according to these data. (Both groups, incidentally, most often identified as unethical the offering of course grades as a pressure to get more students to participate as research subjects.)
The Threat to Robustness We began this book by reiterating McNemar’s familiar assessment of psychology as ‘‘the science of the behavior of sophomores’’ and emphasized his concern, voiced more than a quarter of a century ago, that college students were being used so exclusively by research psychologists as to seriously restrict the robustness of experimental findings in psychological research. In light of the idea of unionizing research subjects and organizing informed volunteers for interested experimenters, McNemar’s observation may now seem too conservative an assessment. A science of informed and organized volunteer sophomores may be on the horizon. It will be recalled from Chapter 1 that, as McNemar ascertained, a very large chunk of the research on normal adults may be restricted to the reactions of college undergraduates (see Table 1-1). Since no one can be certain of all the significant ways in which college students as research subjects may differ from the rest of the population, one can only speculate on the dangers of conveniently tapping the most accessible universe (Higbee and Wells, 1972; White and Duker, 1971). Smart warned: Such students are probably at the peak of their learning and intellectual abilities and this could mean that many findings in learning, especially verbal learning, could be special to the college student with limited applicability to other groups. Some might argue that only the speed of learning would be different in the college population, and that the general principles of learning would be the same in any group. However, the college student is selected for verbal learning ability and we have little evidence that this is a trivial consideration. We have little indication that the public school graduate or high school drop-out learns according to the same principles, because the question has never been investigated. (1966, pp. 119–120)
The threat to generalizability from using volunteer subjects can be seen as a specific case of sampling bias. To the extent that a pool of volunteers was different from the population at large, the resulting positive or negative bias could lead to an overestimate or underestimate of population parameters. See, for example, Fig. 4-2, which depicts roughly the positive bias that might result from using other than random sampling to estimate IQ parameters. We noted that volunteers scored higher on intelligence tests than nonvolunteers. If one relies entirely on volunteer subjects to standardize norms for an IQ test, it follows that the estimated population mean could
776
Book Three – The Volunteer Subject Y
Figure 4–2 The curve
X
symbolized as Y represents a theoretical normal distribution of IQs in the general population, and the curve labeled X represents a theoretical normal distribution of IQs among volunteers. To the extent that the mean of X is different from the mean of Y, the resultant bias constitutes a threat to the generalizability of the data
Bias Bias of X = Y – X
Y
True Estimated X value value
be artificially inflated by this procedure. Merely increasing the size of the sample would not reduce such bias, although improving the representativeness of sampling procedures certainly should (Kish, 1965). Another illustration of this type of sampling bias was described by Maslow and Sakoda (1952), who hypothesized that Kinsey’s survey research findings on human sexual behavior were distorted by the interviewees’ volunteer status. For example, Diamant (1970) has shown that estimates of premarital sexual behavior can be inflated when only volunteer subjects are interviewed. He found college-age male volunteers more apt to report having experienced sexual intercourse than male nonvolunteers and also to be significantly more permissive than the nonvolunteers in their attitudes about sex. Similar results were reported by Kaats and Davis (1971) and by Siegman (1956). Kinsey and his associates conducted a series of intensive interviews of about 8,000 American men and 12,000 American women in order to uncover the predominant sexual customs in the United States (Kinsey, Pomeroy, and Martin, 1948; Kinsey, Pomeroy, Martin, and Gebhard, 1953). Their fascinating findings became a source of intense discussion and controversy when the question of sampling representativeness was raised by critics (Cochran, Mosteller, and Tukey, 1953; Dollard, 1953; Hyman and Sheatsley, 1954). Might Kinsey’s interviewees, by virtue of their willingness to cooperate in the research, have shared other characteristics that distinguished them from the rest of the population and thereby restricted the generality of the data? Maslow and Sakoda explored this idea empirically in a study they designed with Kinsey. It was arranged for Kinsey to set up an office near the Brooklyn College campus and for Maslow to make an appeal for volunteers in his psychology classes. This made it possible to compare the volunteers and nonvolunteers and see how they differed in ways that could affect the generalizability of Kinsey’s data. Earlier, Maslow (1942) had discovered that people who are high in self-esteem often have relatively unconventional sexual attitudes and behavior, and he and Sakoda now observed that students who volunteered for the Kinsey interview tended to be higher in self-esteem than nonvolunteers. Based on this correlational evidence, Maslow and Sakoda concluded that there may have been an overestimation
Implications for the Interpretation of Research Findings
777
of population means in Kinsey’s original data because his subjects had all been willing participants. Of course, it is also possible to conceive of situations where population means were underestimated because volunteer subjects were used. Since volunteers are generally less authoritarian than nonvolunteers, standardizing a test that correlated positively with authoritarianism on volunteer subjects should yield underestimates of the true population mean. The important point is that insofar as any of the variables discussed in Chapter 2 (see again Tables 2-41 and 2-42) may be conceived as a threat, directly or indirectly, to generalizability when the subjects are willing participants, there might automatically result estimates of population parameters that were seriously in error. In the same way, we might expect that routinely sampling just volunteer subjects in behavioral research should jeopardize the robustness of any research conclusions if the subjects’ educational background, social class, intelligence, need for social approval, sociability, and so on, could be indicated as pertinent differentiating characteristics related to the estimation of parametric values for the research problem in question. In studies involving any sort of stress, the subjects’ sex, arousal-seeking inclinations, and anxiety could be crucial differentiating characteristics affecting the estimation of parameters. In clinical research, nonconformity may be suspect; in medical research, the subjects’ psychological adjustment; in laboratory behavioral experimentation, their age; and so on.
Motivational Differences and Representativeness There is another side to this problem, which has to do with the naturalistic motivations of human beings. Some scaling research to be discussed in the next chapter suggests that nonvolunteer types of subjects may amplify the distinction between work and nonwork activities and judge research participation as a work-oriented experience. This may help to account for why such subjects are ordinarily so unenthusiastic about participating in psychology experiments when they are unpaid captives, and it might suggest that nonvolunteers would also be less cooperative in real-life work situations where they were forced to participate without personal financial benefits. From this slightly different perspective, volunteer status might be seen as merely another organismic variable, like all the other myriad variables that affect human behavior. Indeed, insofar as volunteer status as an organismic variable was related to the dependent variable of an investigation, the study of voluntarism could be the raison d’eˆtre for the research (cf. Sommer, 1968; Sarason and Smith, 1971). Illustrative of this point was an experiment by Horowitz (1969) on the effects of fear-arousal on attitude change. A fear-appeal is a persuasive communication that arouses emotional tensions by threatening the recipient in some way and then providing a conclusion reassuring him of relief should he change his attitude in the recommended direction. It is a technique commonly used by political propagandists. For example, in the 1972 American presidential campaign, which saw Richard M. Nixon running against George McGovern, one anti-McGovern television spot showed a hand sweeping away toy soldiers and miniature ships and planes while a voice announced, ‘‘The McGovern defense plan—he would cut the Marines by one-third; he would cut Air Force personnel by one-third and interceptor planes by one-half; he’d cut Navy personnel
778
Book Three – The Volunteer Subject
by one-fourth, the Navy fleet by half and carriers from 16 to 6.’’ The announcer then recalled a statement about McGovern made by Sen. Hubert H. Humphrey during the presidential primaries, when the two Democrats were fiercely competing for their party’s nomination, ‘‘It isn’t just cutting into the fat, it isn’t just cutting into manpower, it’s cutting into the very security of this country.’’ Finally, the spot shifted to President Nixon aboard a Navy ship, visiting troops in Vietnam. With ‘‘Hail to the Chief’’ playing in the background, the message concluded, ‘‘President Nixon doesn’t believe we should play games with our national security; he believes in a strong America to negotiate for peace from strength; that’s why we need him now more than ever.’’ An earlier application of essentially the same political stratagem was a Democratic-sponsored TV spot in the 1964 presidential campaign that showed a little girl plucking petals from a daisy while a voice was heard gloomily counting down to zero. As the last petal disappeared, the screen was filled with a frightening atomic explosion and President Lyndon Johnson’s voice was heard saying, ‘‘These are the stakes . . . to make a world in which all of God’s children can live, or go into the dark . . . we must either love each other, or we must die.’’ The message concluded with the announcer urging viewers to vote for Johnson: ‘‘The stakes are too high for you to stay home.’’ In both cases, the idea was to evoke a feeling of anxiety by associating the opponent’s view—Sen. George McGovern or Sen. Barry Goldwater—with some highly threatening event whose probability of occurrence would purportedly increase were he elected to the presidency. Emotional relief was provided by the reassuring recommendation at the end, that voting for the incumbent president was the surest way to avoid disaster. In communications research, a question of long standing has been the effect of inducing emotional tensions by persuasion on attitude change pursuant to the feararousing message (cf. Rosnow and Robinson, 1967, p. 147ff.). Is attitude change directly or inversely proportional to the amount of anxiety aroused by a fear-appeal? What would be the effect of a persuasive communication inducing greater (or lesser) emotional tension? Would compliance be greater if the threat to personal security were more strongly emphasized? Most present-day research on these questions grew out of an experiment by Janis and Feshbach (1953) using three different intensities of fear in a communication on dental hygiene. Their results showed that while all three forms were equally effective in teaching factual content, the greatest compliance was to the position advocated in the least threatening communication. Presumably, communications that elicit a great deal of fear or anxiety may provoke a defensive reaction that interferes with acceptance of the message; hence, the greater the threat, the less attitude change in the recommended direction. While several studies have supported the Janis–Feshbach relationship (DeWolfe and Governale, 1964; Haefner, 1956; Janis and Terwilliger, 1962; Kegeles, 1963; Nunnally and Bobren, 1959), others have suggested the opposite relationship: the more fear, the more attitude compliance (Berkowitz and Cottingham, 1960; Insko, Arkoff, and Insko, 1965; Niles, 1964; Leventhal, Singer, and Jones, 1965; Leventhal and Niles, 1964), and there have been various alternative proposals for adjudicating this difference in outcomes (Janis, 1967; Leventhal, 1970; McGuire, 1966, 1968a, b, 1969b). In examining this literature closely, Horowitz (1969) noticed that experiments that used volunteers tended to produce the positive relationship between fear-arousal and attitude change, whereas experiments using captive subjects tended toward the inverse relationship. Reasoning that volunteers and nonvolunteers may be
Implications for the Interpretation of Research Findings
779
differentially disposed to felt emotional-persuasive demands, Horowitz then studied the difference in persuasability of willing and unwilling subjects by randomly assigning them to two groups, in one of which there was a high level of fear aroused and in the other of which there was a low level of fear aroused. (There was another aspect to the experiment; it dealt with the number of exposures to the materials, but it is not relevant to the present discussion.) The high-fear group read pamphlets on the abuse and effects of drugs and watched two Public Health Service films on the hazards of LSD and other hallucinogens and the dangerous effects of amphetamines and barbiturates. The low-fear group did not see the films; they read pamphlets on the hazards of drug abuse, but the vivid verbal descriptions of death and disability were omitted. To measure the subjects’ attitude changes, they were all administered a postexperimental questionnaire asking them to respond anonymously to 10-point scales corresponding to recommendations contained in the persuasive appeals. Also, to provide a check on the different levels of emotional arousal presumed to be operating in the two groups, the subjects responded to another scale ranging from ‘‘It did not affect me at all’’ to ‘‘It made me greatly concerned and upset.’’ The check on the manipulation of fear-arousal revealed that the treatments were successful (p < .01) and that the volunteers were significantly more affected by the arousal manipulations than were the nonvolunteers (effect size ¼ .74). More important, though, the attitude-change data clearly indicated that voluntarism was a crucial organismic variable for assessing the generality of the fear-arousal relationship. Volunteers exhibited greater attitude compliance than nonvolunteers (p < .01, effect size ¼ 1.01), and as shown in Fig. 4-3, the predicted positive relationship between fear-arousal and influenceability was obtained for volunteers (effect size ¼ .91) and the expected inverse relationship for nonvolunteers (effect size ¼ .47; p < .01). The illuminating methodological contribution of this study is its empirical 8 Volunteers
Mean attitude change
6
Nonvolunteers 4
2
0
Low fear
High fear
Figure 4–3 Mean attitude changes of volunteers and nonvolunteers in response to low- and high fearappeals. The higher the attitude-change score, the more congruent was the postexperimental attitude with the position advocated in the fear-appeals (after Horowitz, 1969)
780
Book Three – The Volunteer Subject
emphasis on the fact that the validity of behavioral data must always be interpreted within the operating motivational context (cf. Adair, 1971; Oakes, 1972).
The Threat to the Validity of Inferred Causality Voluntarism as an organismic variable complicates interpretation of the generalizability of research data; the threat it poses to the validity of inferred causal relationships is no less serious. For example, if volunteer status correlated with the dependent variable of a study, the reduction in individual variation of subjects on the criterion variable owing to the increased homogeneity of the experimental and control groups could result in nonrejection of the null hypothesis when it was actually false. Suppose we wanted to assess the validity of a new educational procedure that was purported to make people less rigid in their thinking. A simple way of testing this hypothesis would be to compare volunteers who were randomly assigned to a control group with others who were assigned to an experimental group that was administered the new procedure. However, people who are low in authoritarianism (as volunteers apparently are) may also be less rigid in their thinking (Adorno, Frenkel-Brunswik, Levinson, and Sanford, 1950), and it therefore follows that our control subjects could already be unusually low on the dependent variable. The effect in this case, which would be to minimize the true difference between the experimental and control groups, would increase the likelihood of a type-II error. It is also possible to imagine volunteer situations in which one was led to make the opposite type of error. Suppose someone wanted to find out how persuasive a propaganda appeal was before using it in the field. Again, a simple procedure would be to expose volunteers to the appeal and compare their reactions with a control group of unexposed subjects. If the magnitude of opinion changes in the exposed group significantly exceeded changes occurring in the control, one would probably conclude that the propaganda was an effective persuasive device. However, we noted before that volunteers tend to be higher in the need for approval than nonvolunteers, and people who score high in approval-need apparently are more readily influenced than low-scorers (Buckhout, 1965; Crowne and Marlowe, 1964). Hence, it follows that the volunteers who were exposed to the propaganda could have overreacted to it and that the comparison of their change scores with those in the control group may have exaggerated the true impact of the propaganda. There is a more subtle side to the problem of the validity of research conclusions that these Gedanken experiments allude to only very indirectly; it concerns errors systematically occurring in behavioral research because of its social nature. Demand Characteristics The results of experimental research can be attributed to controlled and uncontrolled sources. Controlled sources are conditions that are experimentally imposed or manipulated; uncontrolled sources are conditions left untreated and conceptualized as random errors when their own antecedents cannot be specified and as artifacts, or systematic errors—the two terms are interchangeable—when their antecedents are known (McGuire, 1969a; Rosnow, 1971). Complexities of human behavior that can be traced to the social nature of the behavioral research process can be seen as a set of
Implications for the Interpretation of Research Findings
781
artifacts to be isolated, measured, considered, and sometimes reduced or eliminated. Such complexities stem from the fact that subjects are usually aware that they are being studied and that their role is to be played out in interaction with another human being, the investigator. An early discussion of the general threat to validity resulting from this special role relationship was Rosenzweig’s 1933 paper in Psychological Review on the psychology of the experimental situation. In this perceptive account, Rosenzweig considered several ways in which psychological experimentation had more inherent complexities than research in the natural sciences, complexities resulting from the fact that ‘‘one is obliged to study psychological phenomena in an intact conscious organism that is part and parcel of a social environment’’ (p. 337). For instance, one peculiarity of psychological experimentation is that the ‘‘thing’’ being studied, the research subject, has motives, and they may propel him in directions that could threaten the validity of research conclusions. When self-report measures are used, as is typically the case in attitude research, there could be a disruptive interaction of the subject’s observational set and the experience he was reporting about. ‘‘It is the interdependence of the experimentally imposed conditions and the contribution of the individual personality which must, in the last analysis, serve as the basis for interpreting experimental results,’’ Rosenzweig (1952, p. 344) advised in a subsequent paper on the artifact problem. Several of Rosenzweig’s ideas have appeared in the important research and writings of other, more contemporary ‘‘artifactologists’’ (cf. Silverman and Shulman, 1970). Some of the most impressive research in this area is that of Martin Orne (1969, 1970) on the biasing effects of subjects’ compliance with the demand characteristics of the experimental situation. Orne’s contention is that a principal motive in this situation is cooperation and that research subjects, and perhaps volunteers especially, play out their experimental roles guided by an altruistic wish to help science and human welfare in general. The term demand characteristic, which derived from Kurt Lewin’s (1929) concept of Aufforderungscharakter (Orne, 1970), refers to the totality of task-orienting cues that govern subjects’ hypotheses about role expectations. According to this view, a person who agrees to be a subject in an experiment also implicitly agrees to cooperate with a wide range of actions without questioning their purpose or duration, and almost any request should automatically be justified by the fact that it was ‘‘part of an experiment’’ (Orne, 1962a, p. 777). To illustrate, Orne (1970) explored the idea that subjects in experimental hypnosis will behave in whatever ways they think are characteristic of hypnotized individuals. To test this hypothesis, he concocted a novel characteristic of hypnotic behavior—‘‘catalepsy of the dominant hand.’’ This experimental demand was then demonstrated to a large college class in a lecture on hypnosis using three ‘‘volunteers’’ who had been previously hypnotized and given a posthypnotic suggestion that the next time they entered hypnosis they would exhibit a waxen flexibility of their dominant hand. In another class, the same lecture on hypnosis was given except that there was no mention of catalepsy. One month later, students from both classes were invited to participate as research subjects in an experiment. When they came to the laboratory and were hypnotized by the experimenter, catalepsy of the dominant hand was exhibited by almost all the subjects who had attended the lecture, suggesting that catalepsy was characteristic of the hypnotized state. None of the subjects who had attended the control lecture showed this phenomenon. Other research has shown similar effects of compliance with demand characteristics. Silverman (1968) randomly assigned undergraduates to four groups that read a
782
Book Three – The Volunteer Subject
250-word argument in favor of using closed-circuit television tapes to give lectures to large classes. He found that the students showed more persuasibility when they were told that they were subjects in an experiment than if this demand was not explicitly conveyed to them and that they complied more when they had to sign their opinions than when they were tested anonymously. Page (1968) manipulated demand characteristics in a figure–ground perception experiment and found that subjects, particularly those having some elementary knowledge of psychology, were apt to respond in the ways they thought the experimenter anticipated they should behave. Other studies have probed the contaminating influence of demand characteristics in such varied contexts as prisoners’ dilemma games (Alexander and Weil, 1969), attitude change (Kauffmann, 1971; Page, 1970; Rosnow, 1968; Sherman, 1967; Silverman and Shulman, 1969), verbal operant conditioning (Adair, 1970b; Levy, 1967; Page, 1972, White and Schumsky, 1972) and classical conditioning (Page, 1969; Page and Lumia, 1968), the autokinetic effect (Alexander, Zucker, and Brody, 1970; Bruehl and Solar, 1970), hypnosis and sensory deprivation (Coe, 1966; Jackson and Kelley, 1962; Orne, 1959, 1969; Orne and Scheibe, 1964; Raffetto, 1968), perceptual defense (Sarbin and Chun, 1964), psychophysics (Juhasz and Sarbin, 1966), taking tests (Kroger, 1967), comparative psychotherapeutic evaluations (McReynolds and Tori, 1972), autonomic activity (Gustafson and Orne, 1965; Orne, 1969), and small-group research on leadership and conformity (Allen, 1966; Bragg, 1966; Geller and Endler, 1973; Glinski, Glinksi, and Slatin, 1970; Leik, 1965). This collection of studies emphasizes the ubiquitous threat to the tenability of research conclusions stemming from the special role relationship between subject and experimenter in behavioral research. The Role of Research Subject The conception of research subjects’ performances as role behaviors is an idea that originated in the sociological contention that human performance is affected by the social prescriptions and behavior of others. The theory of role, which sprang from the dramaturgical analogy, argues that social behavior can be conceived as a response to demands associated with specific propriety norms and that individual variations in performance can be expressed within the framework created by these factors (Thomas and Biddle, 1966); that is, people occupy different positions—the position of experimenter or research subject, for example—and situational regularities in their behavior are a function, in part, of the prescribed covert norms and overt demands associated with such positions and of certain mediating factors (e.g., their original willingness to accept the position) that could make them more-or-less compliant with role expectations (cf. Alexander and Knight, 1971; Biddle and Thomas, 1966; Jackson, 1972; Sarbin and Allen, 1968). If we may proceed on the assumption that within the context of our Western culture the transitional role of research subject in a psychology experiment is now well understood by most normal adults who find their way into our experiments (cf. Epstein, Suedfeld, and Silverstein, 1973) and that the stereotypic role expectation associated with this position is that of an alert and cooperative individual (or the ‘‘good subject’’), then it is possible to recast these ideas in a social-influence mold and provide a simple model to depict the role behavior of research subjects in response to demand characteristics. Some recently gathered evidence for this
Implications for the Interpretation of Research Findings
783
Table 4–1 Characteristics Ranked According to the Percentage of Students Who Circled Them in
Response to the Question, ‘‘How do you think the typical human subject is expected to behave in a psychology experiment?’’ Percentages Are Given in Parentheses (After Aiken and Rosnow, 1973) 1. 2. 3. 4. 6. 7. 9. 10. 11. 12. 13. 14. 15. 17. 18.
(78) cooperative (72) alert (60) observant (57) good-tempered (57) frank (56) helpful (52) logical (52) trustful (43) efficient (42) agreeable (41) conscientious (39) useful (38) curious (34) confident (30) good (30) loyal (24) punctual (19) inoffensive (19) critical
20. (16) sophisticated (16) idealistic 22. (15) outspoken (15) wordy 24. (14) argumentative (14) daring (14) impressionable 27. (13) jumpy (13) excited (13) crafty 30. (12) gracious (12) grateful (12) untiring 33. (11) nonconforming (11) moody (11) excitable (11) disturbed (11) irritable 38. (10) touchy
39. (9) self-possessed (9) wholesome 41. (8) amusing 42. (7) eccentric (7) irrational 44. (6) egotistical (6) shy (6) boastful 47. (5) painstaking (5) careless (5) unsocial 50. (4) daydreamer 51. (3) prudent (3) childish (3) impolite (3) petty 55. (2) disrespectful (2) impractical (2) wasteful
assumption also implies that the good-subject stereotype may be well ingrained. With the assistance of Susan Anthony and Marianne Jaeger, the question, ‘‘How do you think the typical human subject is expected to behave in a psychology experiment?’’ was directed to 374 experimentally naı¨ve Pennsylvania high school boys and girls, who responded by circling any characteristics they wished from a list of positive and negative ones (Aiken and Rosnow, 1973). There was a high degree of similarity in the male and female rankings (r ¼ .90, p < .0001), and the collective results (which were alluded to in Chapter 2 in connection with the differentiating characteristics of self-disclosure and altruism) are summarized in Table 4-1. Although it was revealed that none of the students had ever participated as a subject in a psychology study or could recall ever having been asked to do so, the image of this role projected in their minds was congruent with the assumption above. ‘‘Cooperative’’ and ‘‘alert’’ were the most frequently circled characteristics, and the only others noted by the majority of students—‘‘observant,’’ ‘‘good-tempered,’’ ‘‘frank,’’ ‘‘helpful,’’ ‘‘trustful,’’ ‘‘logical’’—also reinforce the good-subject stereotype. Of course, not all subjects will automatically conform to this role expectation, just as not all human beings will consistently conform to societal norms (Latane´ and Darley, 1970; Macaulay & Berkowitz, 1970), although recalcitrant subjects may be relatively rare in our experimental studies. If one imagines a continuum representing the range of compliance to what we presume is the prescribed behavior usually associated with the role of research subject, it would be the intensely eager and cooperative subject who anchors the compliance end of the continuum (cf. Rosnow, 1970) and the recalcitrant type who anchors the counter-compliance end. Neither type is probably all that common, however; most of our volunteers and even our coerced research subjects in psychology probably fall somewhere around the middle or between the middle and the compliance end.
784
Book Three – The Volunteer Subject
One way of conceptualizing these role enactments was to view the reaction to demand characteristics as the dependent variable in a social-influence paradigm where the predominant sources of artifact operate via a few main intervening variables, and there is a detailed discussion of this artifact-influence model in Chapter 6. For present purposes, it will suffice to say that voluntarism as an artifact-independent variable is presumed to operate on the motivational mediator in the chain, the subject’s acquiescent, counteracquiescent, or nonacquiescent response-set pursuant to demand characteristics. Volunteer Status and the Motivation Mediator There are alternative possibilities as to how experimental performances could be affected by this action. For example, it might be thought that volunteers and nonvolunteers would be differentially responsive to task- and ego-orienting cues. This hypothesis was first elaborated by Green (1963), who sought to clarify some of the conditions that promote and inhibit the psychological phenomenon in which people tend to recall a larger proportion of interrupted than completed tasks (cf. Zeigarnik, 1927). Green wondered if volunteers, because they would be curious about, and interested in, the experimental task, might therefore be strongly disposed to comply with task-orienting cues. Captive nonvolunteers, on the other hand, might be more oriented to ego cues in order to paint a good picture of themselves, for Green reasoned that it could have been those subjects’ concerns about being unfavorably evaluated that led them to reject the volunteering solicitation in the first place and that being drafted into an experiment could intensify their ego needs and bolster ego responsiveness. Given that willing and unwilling subjects may indeed be motivationally different in their concerns with task- and ego-orienting cues, does it follow then that volunteers and nonvolunteers would be differentially responsive to demand characteristics? In an experimental situation where demand characteristics were operating in conflict with ego-orienting cues, would volunteers’ and nonvolunteers’ contrasting motivations guide their experimental performances in ways that could threaten the tenability of research conclusions? For example, say that they perceived cooperative behavior as likely to result in an unfavorable assessment of their intellectual capacities. How would this cognition affect the experimental behaviors of volunteers and nonvolunteers? Would they react differently to the motivational dilemma, volunteers perhaps being the demand responsive subjects and nonvolunteers, the ego-responsive subjects? Silverman (1965) has also discussed this possibility, and Vidmar and Hackman (1971) raised it again recently in discovering that their volunteer subjects in a small-group interaction study produced longer written products than did a replication group of conscripted subjects. The volunteers also complained more about not having sufficient time to work on the experimental tasks—findings that are certainly suggestive of a stronger task-oriented set. An alternative hypothesis can be derived with reference to the concept of approval-need. We posited that volunteers have a relatively high approval-need, and it has been shown that high-need-for-approval subjects have a high affiliationneed (Crowne and Marlowe, 1964). One means of satisfying the affiliation need could be found in volunteering for certain kinds of research participation, although once committed to participating, the volunteer’s approval motivation might emerge as the prepotent determiner of his experimental behavior. Crowne and Marlowe drew
Implications for the Interpretation of Research Findings
785
a parallel between what is sometimes an unusually favorable self-image that some subjects project and the characters in Eugene O’Neill’s play The Great God Brown. It is especially those subjects who have a high need for social approval who fit this image. The characters in O’Neill’s play wear masks to conceal their identities from each other, just as volunteers may hide aspects of their true personality, projecting instead a more uniformly favorable image that seems likely to gain them the social approval they desire. To the extent that their experimental behavior was uncharacteristic of their naturalistic behavior, a threat to the tenability of any conclusions drawn from the experimental data would be posed. Such subjects could be initially different from the rest of the population because of their high approval-need, and this need could contribute further to invalidation by motivating them to respond to demand characteristics in ways that could distort their reactions to the experimental treatment. There is other work that also tentatively links voluntarism to the motivation mediator. In a study of the effects of signaled noncontingent rewards on human operant behavior, Remington and Strongman (1972) administered contingent or random presentations of a conditioning stimulus and an unconditioned stimulus to college volunteers, while the subjects worked on a time-related baseline schedule of reinforcement. In contrast to earlier studies in which operant acceleration was obtained on differential reinforcement of low rates schedules (DRL) and a suppression on variable interval schedules (VI), their subjects on DRL showed no effect, while those on VI showed a response facilitation—which the authors tentatively interpreted as the result of a volunteer species-by-schedule interaction. Brower (1948) compared the performances of volunteers and captive subjects in a visual– motor skills experiment where the task was to traverse a brass-plate apparatus with a stylus under three conditions: (1) when the subjects could directly observe their own performance, (2) when they could only observe their performance through a mirror, and (3) when blindfolded. The volunteers consistently made fewer errors than the captive subjects (although differences in mean error rates were significant at the .05 level only in conditions 2 and 3), results that, together with earlier findings, Brower thought indicative of the influence of motivational differences connected with the recruitment procedures. In a similar vein, a study by Cox and Sipprelle (1971) also has argued for the link between voluntarism and experimental motivation. Unable to replicate a finding by Ascough and Sipprelle (1968) demonstrating operant control of heart rate without the subjects’ (verbalized) awareness of the stimulus–response contingency, Cox and Sipprelle set about to discover the reason for their failure. Wondering if it could have been because of differences in motivation between the two subject pools—Ascough and Sipprelle having used undergraduates who volunteered for the experiment, and the replication failure being based on the responses of coerced subjects—Cox and Sipprelle experimented with using a small monetary incentive to elicit responding from their unwilling subjects. Each time there was a verbal reinforcement, the subject was awarded a penny. Examining some of the collective findings from the two studies, which are shown in Fig. 4-4, statistically similar rates of autonomic conditioning were obtained among verbally reinforced volunteers and a small group of nonvolunteers who were both verbally and monetarily reinforced, and these comparable rates were significantly different from those obtained among verbally conditioned nonvolunteers and a noncontingent control group. Interpreting the combined results as evidence that an acquired reinforcer can
786
Book Three – The Volunteer Subject 10
8
Mean heart rate change
Verbally reinforced volunteers 6
4
Verbally + monetarily reinforced nonvolunteers
2
0
Verbally reinforced nonvolunteers Noncontingent controls 0
1
2 Trial blocks
3
4
Figure 4–4 Combined accelerate and decelerate heart rate changes (after Cox and Sipprelle, 1971)
compensate for an initial lack of motivation, the investigators’ conclusions, which were stated in the context of their failure to replicate the earlier findings, conveyed the notion that the replication failure was seen somehow as less ‘‘internally’’ valid than results with the experimentally motivated subjects; that is, they implied that the results might jeopardize the validity of inferred causality, as opposed to the generalizability of the presumed relationship. Finally, Black, Schumpert, and Welch (1972) compared voluntary and nonvoluntary subjects in a perceptual–motor skills experiment with a slight twist. The subjects had to track a target circle on a pursuit rotor, and they were assigned to groups with a predetermined level of feedback for their performance. Thus, one group of subjects got 100% feedback; another, 50%; a third, no feedback; and a fourth group, complete feedback but only after the initial set of trials. The novel aspect to the study was that the subjects were told that they could drop out whenever they began to feel bored. As one might anticipate from the preceding studies, the volunteers showed considerably more staying power than did the nonvolunteers; and the more feedback the subjects got, the more persistent did the volunteers become. Although these studies tentatively link voluntarism with the motivation mediator, it should be recognized that the results could just as easily be interpreted in the context of external as internal validity, to draw upon Campbell’s (1957) familiar distinction that external validity refers to sampling representativeness and internal validity to the plausible existence of a causal relationship. The discussion now shifts to findings that are more directly indicative of the influence of this artifact-independent variable on the validity of inferred cause-and-effect relationships in behavioral research.
5 Empirical Research on Voluntarism as an Artifact-Independent Variable
So far, with the exception of the study by Horowitz (Chapter 4), our discussion of the threat from subjects’ volunteer status to the validity of inferred causality has been speculative. Conjectures were derived mainly from a knowledge of characteristics that distinguished willing from unwilling subjects, from which it was possible to hypothesize different types of error outcomes. In this chapter, we summarize the main results of a series of studies that experimentally probed the subtle influence of this artifact in psychological experimentation. We begin with three exploratory studies that were designed to determine whether volunteer status actually had any marked influence on experimental outcomes and to identify possible circumstances in which this biasing effect could be reduced. We then describe an experiment that demonstrated the sort of type-I–type-II confounding postulated earlier and next a study that linked voluntarism as an artifact-independent variable more definitely with the motivation mediator. We then turn to the question posed earlier concerning whether willing and unwilling subjects are differentially responsive to task- and egoorienting cues, and finally we examine the role expectations of voluntary and nonvoluntary subjects. Before proceeding to a discussion of this program of research, it may be useful to reemphasize an important idea about the nature of the volunteer construct that is apparently a source of confusion for some psychologists. A recent criticism of this research by Kruglanski (1973) has argued that because there are many reasons why a person might or might not volunteer to participate in behavioral research, volunteer status is, therefore, a nonunitary construct and, hence, ‘‘of little possible interest.’’ That is a troublesome chain of logic, however, as it would make much of psychology, perhaps most of it, ‘‘of little interest.’’ Patients who commit suicide, for example, do so for many different reasons; yet it is quite useful to characterize certain behaviors as suicidal. Suicide prevention centers have been established throughout the United States to conduct research and provide services for this class of persons who do what they do for many different reasons. People score high or low on the California F Scale, the Taylor Manifest Anxiety Scale, the MMPI scales, CPI scales, and various tests of ability, interest, and 787
788
Book Three – The Volunteer Subject
achievement, all for a variety of reasons, including many situational determinants. It would be a mistake to conclude that persons cannot usefully be grouped on the basis of their test scores, their suicidal status, or their willingness to participate in behavioral research because their scores or their status may have multiple determinants. However, perhaps the basic difficulty in grasping this idea lies in the confusion between constructs and measures thought to be related to constructs. One does not, of course, compute the reliability of constructs but of measures. And, as noted earlier, when the measure of volunteering is simply stating one’s willingness to participate, the reliabilities for volunteering range from .67 to .97—values that compare very favorably with the reliabilities of virtually any type of psychological test. When reliability is defined in terms of volunteering for sequentially different types of tasks, significant reliabilities are again obtained (Barefoot, 1969; Laming, 1967; Rosen, 1951; Wallace, 1954; median correlation ¼ .42, every p < .05). The magnitudes of correlation are comparable to those found between subtests of standard tests of achievement and intelligence (cf. Rosnow and Rosenthal, 1974). Our point is that the exclusion from psychological theory of constructs that are not ‘‘unitary’’ might not be prudent, since this implies the abandonment of such constructs as intelligence, which have been studied profitably for decades employing orthogonal factor structures (i.e., nonunitary elements). Although there were minor variations in procedure, a similar overall design was used in all the following studies. Since college undergraduates are apparently the most frequently used research subjects in social and experimental psychology (Higbee and Wells, 1972; Schultz, 1969; Smart, 1966), the subjects were culled from sections of introductory courses in behavioral science. They were identified as willing or unwilling subjects by their having been asked to volunteer for a psychology experiment. By comparing the names of the students who signed up for the experiment with the class roster, we could label subjects as volunteers and nonvolunteers (although in some cases the nonvolunteer groups may have contained a few absentee volunteers). In order to establish whether there were any differences in experimental outcome or in ratings of role expectations between willing and unwilling subjects, the reactions to an entirely different experimental treatment in which they all participated, or their responses to a questionnaire, were then subdivided according to the subjects’ volunteer status. There was, of course, a potential danger in this procedure of comparing verbal volunteers and nonvolunteers, since not everyone who signs up for an experiment will show up for his appointment. Cope and Kunce (1971) discovered differences in some linguistic behaviors between early shows and later shows, some of the latter being subjects who missed an earlier appointment and had to be rescheduled. Indeed, to the extent that verbal voluntarism could be confounded by this pseudovolunteering artifact, observed differences might be overestimates or underestimates of true differences between willing and unwilling subjects. While the collective data on the problem are equivocal, as discussed in Chapter 2, still they suggest that inferred experimental differences between true volunteers and nonvolunteers would probably be underestimated by this verbal voluntarism comparison (e.g., Conroy and Morris, 1968; Jaeger, Feinberg, and Weissman, 1972; Levitt, Lubin, and Brady, 1962).
Empirical Research on Voluntarism as an Artifact-Independent Variable
789
Exploratory Research Study 1 The first study explored whether volunteer status could have a biasing effect on experimental outcomes that might jeopardize the validity of inferred causal relationships, as it was not firmly established yet that this artifact variable was of more than passing theoretical concern. The possibility was investigated in a traditional attitudechange experiment in which we sought to determine whether volunteers responded any differently to a persuasive communication than nonvolunteers (Rosnow and Rosenthal, 1966). The subjects were 42 undergraduate women, 20 of whom had been identified as volunteers by their affirmative response to a request for psychological research subjects. Initially, they were all administered a lengthy opinion survey in which were embedded four 7-point items about American college fraternities. One week later, the subject pool was randomly divided into three groups. Two of these groups heard and read a pro- or antifraternity argument and the third group served as a noncommunication control. At the conclusion of the treatment, attitudes toward social fraternities were again measured and each subject was asked to guess the purpose of the study. (The subjects’ perceptions were generally in agreement about the experimenter’s persuasive intent, although not a single subject mentioned the voluntarism solicitation or said anything that might imply that she drew a connection between the attitude manipulation and the recruitment procedure.) Table 5-1 gives the mean attitude changes of volunteers and nonvolunteers in the three treatment groups. Overall, the direction of change was toward whichever side was advocated (p < .005), although volunteers changed more than nonvolunteers if the communication was antifraternity (effect size ¼ .72) and less so if the communication was profraternity (effect size ¼ .26). The interaction p was less than .05, the magnitude of this effect being 1.7 units, or about twice the size of what Cohen (1969) would call large. A simple explanation for the reversal would have been that voluntarism correlated with initial attitude scores. Recalling the discussion in Chapter 2, if volunteers are more sociable than nonvolunteers, perhaps volunteers also are more favorably disposed in their attitudes toward social fraternities. Because the volunteers in this study could have been far to the positive side of the attitude scale, there may have been little room for them to move much further in the
Table 5–1 Mean Attitude Changes of Volunteers and Nonvolunteers (After Rosnow and Rosenthal, 1966)a
Profraternity Argument Volunteers Nonvolunteers Mean changes a
þ1.7 þ2.5b þ2.1b
Antifraternity Argument 3.5 1.2 2.4c
b
Control Group þ0.4 0.9 0.3
Positive scores reflect profraternity attitude gains and negative scores, antifraternity gains. Means differed from their respective control group mean at the indicated p level. b p < .05. c p < .10.
790
Book Three – The Volunteer Subject
profraternity direction; maybe the antifraternity side was the only open direction in which their attitudes could sway. On the other hand, if volunteers initially leaned more in the antifraternity direction than nonvolunteers, this too could account for the reversal as the volunteers might then have been more receptive to the antifraternity arguments than those that were counter to their original attitudes. In either case, an attitude difference between volunteers and nonvolunteers on the pretest questionnaire would automatically raise doubts about the methodological significance of a treatment volunteer status interaction. In fact, however, the data provided no basis of support for either generalization; a comparison of the pretest scores revealed only a negligible difference between the volunteers and nonvolunteers, neither group’s original attitudes deviating very far from the middle of the scale. Another tentative explanation for the reversal was that the volunteers, possibly because of their higher approval-need, were more strongly motivated than the nonvolunteers to confirm what they thought to be the experimenter’s hypothesis. If volunteers are more accommodating to demand characteristics than nonvolunteers and if the subjects perceived the (faculty) communicator as being antifraternity, this could explain why the volunteers were so much more responsive to the negative communication. We found some circumstantial support for this idea when we asked a comparison group of undergraduate women from the same population to rate their impressions of the communicator’s attitude toward fraternities. On a nine-point bipolar scale anchored at the negative end with the label ‘‘extremely antifraternity,’’ the mean rating by these subjects was 1.1 (less than zero at p < .05), which reinforced the argument that our volunteers could have been more responsive to the communication that was congruent with their perceptions of the experimenter’s attitude and less responsive to the incongruent argument. Consistent with this interpretation was an additional finding, which is shown in Table 5-2, that the average reliability in before-after attitudes of the volunteers (rho ¼ .35) was significantly lower than the average reliability in attitudes of the nonvolunteers (rho ¼ .97). Although the volunteers’ attitudes were appreciably less reliable than the attitudes of the nonvolunteers (p < .0005), there were no significant effects on reliability of the treatment conditions within subjects of either volunteer or nonvolunteer status. That the volunteers were heterogeneous in their attitude change could be tentatively interpreted as evidence of their greater willingness to be influenced in whatever direction they felt was demanded by the situation. From these thin findings, we could now hypothesize that volunteers may be more sensitive and accommodating to demand characteristics than captive nonvolunteers, and a second exploratory study was designed to examine this possibility.
Table 5–2 Rank-Order Correlations for Pretest-Posttest Attitude Reliabilities as a Function of Volunteer
Status (After Rosnow and Rosenthal, 1966) Profraternity Argument Volunteers Nonvolunteers a
p < .05.
.49 .99a
Antifraternity Argument
Control Group
.76 .98a
.52 .95a
Empirical Research on Voluntarism as an Artifact-Independent Variable
791
Study 2 In this follow-up experiment, attitudes were observed in a situation in which the major directional response cues were contained in the communications themselves, and thus, the experimenter’s private views on controversial issues were not a source of competing experimental expectations (Rosnow, Rosenthal, McConochie, and Arms, 1969). If our interpretation of study 1 were correct, it would follow that volunteers should be more compliant in their attitude responses when they were clear on the demand characteristics of the situation. In order to manipulate the clarity of directional response demands, one-sided and two-sided personality sketches were substituted for the fraternity arguments of the first study. It was reasoned that equipollent two-sided communications, because they contained contradictory response cues, would obscure the directional demand characteristics of the situation. This time, the subjects were 263 captive undergraduate men and women, 53 of whom were verbal volunteers for a psychology experiment. At random, the students were assigned to one of five treatment groups, and all were presented the following introductory passage to read about a fictitious character named Jim (adapted from research on impression formation by Luchins, 1957): In everyday life we sometimes form impressions of people based on what we read or hear about them. On a given school day Jim walks down the street, sees a girl he knows, buys some stationery, stops at the candy store.
One group of subjects then read a short, extraverted personality sketch designed to portray Jim as friendly and outgoing; a second group read an introverted sketch, picturing him as shy and unfriendly. In two other groups, both communications were presented in a counterbalanced sequence, and the fifth group of subjects, which received only the introductory passage, served as a nondirectional control. In each case, the subjects had to rate Jim on four 9-point bipolar attitude scales for friendliness, forwardness, sociability, and aggressiveness. If our hypothesis were correct, we should expect more extreme congruent attitudes among volunteers than nonvolunteers who were exposed to the one-sided communications (where response cues were straightforward and consistent) and no appreciable attitude differences between volunteers and nonvolunteers in the two communication groups where response cues were contradictory. The results are summarized in Table 5-3, which shows for volunteers and nonvolunteers the difference between the average total score in the control group Table 5–3 Composite Attitude Scores of Volunteers and Nonvolunteers After Reading a
One-Sided or Two-Sided Personality Sketch (After Rosnow, Rosenthal, McConochie, and Arms, 1969)a Treatment Extraverted (E) minus control Introverted (I) minus control EI minus control IE minus control a
Volunteers
Nonvolunteers
þ6.8 7.2 2.3 0.5
þ3.7 5.9 þ0.3 0.5
The more highly positive or negative the score, the more extreme were the subjects’ attitudes, positive scores favoring the extraverted characterization and negative scores, the introverted characterization.
792
Book Three – The Volunteer Subject
and the average total score in each of the four experimental groups. Although the overall effects were not significantly greater among volunteers than nonvolunteers, the trend of the findings was in the predicted direction, and the interaction of volunteering with positive versus negative one-sided communications showed that volunteers were significantly (p < .05) more susceptible to one-sided communications than were nonvolunteers. In addition, the effect size was even larger than that of the previously described study. Thus, when the personality sketch was slanted in the positive direction, it was the volunteers who became more positive in their attitudes; and when the communication was slanted in the negative direction, it was the volunteers who became more negative. Completely consistent with the demand clarity interpretation were the results with the opposing communications. These stimuli were generally ineffective regardless of whether they were compared to the control group or to each other. Study 3 Encouraged by these data trends, we toyed with the clarity notion once more before formalizing the hypothesis and putting it to a more stringent experimental test. This time, we wondered if a successful deception study might not function in a manner similar to the two-sided communication in study 2, in effect obfuscating the directional response demands of the situation. An opportunity to test this idea arose when Roberta Marmer McConochie, then a graduate student in communications research at Boston University, agreed to tack on the voluntarism variable in a thesis experiment involving a demonstrably successful cognitive dissonance deception (Rosnow, Rosenthal, McConochie, and Arms, 1969). There were two phases to the deception, which was carried out on 109 undergraduate women in a student dormitory at Boston University (60 volunteers and 49 nonvolunteers). In the first phase, the subjects all participated in what was represented to them as a national opinion survey. Embedded in the lengthy opinion questionnaire were 12 statements that the students rated on an 11-point scale to indicate how important they perceived such ideas as whether there should be more no-grade courses at universities, courses in sex education, student unionization, instruction on the use and control of hallucinatory drugs, and so on. One month later, another experimenter, who represented himself as a researcher from Boston University’s Communication Research Center, told the students that the Center was conducting a follow-up to some of the questions from the national survey. Each student was handed a questionnaire (specially designed for her) in which were listed 2 of the 12 statements rated earlier. Since, according to Festinger’s (1957) theory, important decisions or decisions where the unchosen alternative is relatively attractive should produce more cognitive dissonance than unimportant decisions or decisions where the unchosen alternative is unattractive, McConochie studied both these factors. Importance was manipulated by deceiving some of the subjects into believing that Boston University planned to put into practice one of the ideas listed in the questionnaire and that the students’ preferences would be taken into account by the administration when making its final choice. Attractiveness of the unchosen alternative was manipulated by selecting ideas for the students to choose between which had been rated similarly (relatively ‘‘attractive’’) or dissimilarly (relatively ‘‘unattractive’’) on the initial questionnaire. After the students revealed their preferences,
Empirical Research on Voluntarism as an Artifact-Independent Variable
793
Table 5–4 Reduction of Postdecision Dissonance by Changing the Importance of the Chosen and
Unchosen Alternatives for Volunteers and Nonvolunteers (After Rosnow, Rosenthal, McConochie, and Arms, 1969)a Change from First to Second Rating for: Experimental Variables Volunteers Low importance Low attractiveness High attractiveness High importance Low attractiveness High attractiveness Nonvolunteers Low importance Low attractiveness High attractiveness High importance Low attractiveness High attractiveness
Chosen Alternative
Unchosen Alternative
Net Change
þ0.1 þ0.3
þ0.4 1.9
0.3 þ2.2b
þ0.2 þ0.6
0.7 1.6
þ0.9 þ2.2b
0.5 þ1.2
þ0.1 1.3
0.6 þ2.5b
þ0.4 þ2.1
0.7 0.3
þ1.1 þ2.4b
a Positive scores indicate an increase in importance, and negative scores a decrease. Net change is the change for the chosen alternative minus the change for the unchosen, which indicates the net spreading-apart of the alternatives following the choice. A positive net change is evidence of dissonance reduction. b Net change significantly greater than zero, p < .05.
they once again rated all 12 items for importance. The dissonance theory prediction was that they should spread apart the values of the choice alternatives in direct proportion to the amount of cognitive dissonance they experienced. The results of the study, which are shown in Table 5-4, did suggest differences between volunteers and nonvolunteers. The attractiveness variable especially (but also the importance variable) appeared to have less effect on volunteers than on nonvolunteers with regard to the chosen alternative. For the unchosen alternative, volunteers showed a greater effect of attractiveness than did nonvolunteers. Consistent with this interpretation, there was a highly significant four-way interaction in a voluntarism by importance by attractiveness by repeated measures analysis of variance of the change data (p < .001). This was hardly a predicted interaction to be sure, but it indicated nevertheless that volunteering can make a difference. Of course, since cognitive dissonance researchers do not usually look separately at the alternatives but only at the net change data, an overall decision about the success of the manipulation would be only indirectly affected by subjects’ volunteer status. Other than this complex interaction, no other interaction with voluntarism was statistically significant (every F < 1). However, consistent with the dissonance predictions, there was a greater spreading apart of the choice alternatives in the high than in the low attractiveness condition (p < .001) as well as in the high versus low importance conditions (p ¼ .11). Clearly, volunteer status does not always directly interact with other experimental variables to bias research conclusions, although it may affect validity in an indirect way. (How differential sensitivity to demand characteristics could cope with a difference under high but not low importance was unclear.)
794
Book Three – The Volunteer Subject
Type-I and Type-II Error Outcomes In Chapter 4, we derived hypothetical type-I and type-II error outcomes from relationships between volunteer status and certain dependent variables. Our exploratory findings, flimsy though they were, pointed now to a way of testing the notion experimentally. When questionnaires are employed as before-and-after measures in attitude research, such as in study 1, it is plausible that the pretest in conjunction with the treatment may sensitize the subject so that he approaches the treatment differently than if he had not been pretested. Repeated attitude measurements should easily convey to the subject that some attitude change is expected, particularly if the measurements are obviously related to the experimental treatment (Orne, 1962a); however, whether or not the subject complied with this demand characteristic should depend on his motivation. Whether pretest sensitization was a potential source of invalidity in attitude research is a question that has aroused considerable theoretical and empirical concern over the years (Campbell and Stanley, 1966; Hovland, Lumsdaine, and Sheffield, 1949; Lana, 1969; Ross and Smith, 1965; Solomon, 1949). Surprisingly, the results of a dozen or more laboratory studies were remarkably consistent in revealing either no appreciable systematic effects of pretesting (Lana, 1959a, b; Lana and Rosnow, 1968) or a moderate dampening effect (Brooks, 1966; Lana, 1964, 1966; Lana and Rosnow, 1963; Nosanchuk and Marchak, 1969; Pauling and Lana, 1969). Summarizing these results, Lana concluded that ‘‘when pretest measures exert any influence at all in attitude research, the effect is to produce a Type II error, which is more tolerable to most psychological researchers than is an error of the first kind’’ (1969, p. 139). Our exploratory findings on voluntarism led us to wonder whether the null and depressive effects in this body of research on pretest sensitization could have been caused by the failure to distinguish between motivational sets among subjects, for the results were based almost exclusively on the reactions of captive audiences. The actual subject samples might be visualized as pools of captive nonvolunteers mixed in some unknown proportion with potential volunteers. Since the ratio of nonvolunteers to volunteers frequently favors the former in college populations, the samples might have consisted largely of nonvolunteer types, or those subjects likely not to comply very strongly with demand characteristics. Furthermore, in the few instances in attitude research where a facilitative effect of pretesting was seen, the subjects were all rather willing participants—either volunteers or involved, presumably eager and highly motivated ‘‘nonvolunteers’’ (Crespi, 1948; Hicks and Spaner, 1962; Star and Hughes, 1950). It therefore followed that the directional effect of pretesting on subjects’ approaches to a treatment in attitude research might be predicated upon their original willingness to participate as research subjects. Since our exploratory studies suggested that volunteers may be more compliant with demand characteristics than nonvolunteers, it could be hypothesized that the effect of using a before-after design on volunteers should be in the direction of a type-I error, whereas the effect with nonvolunteers should be in the type-II direction. In other words, a facilitative effect of pretesting was predicted for volunteers’ responses to the treatment, and a dampening effect was predicted for nonvolunteers.
Empirical Research on Voluntarism as an Artifact-Independent Variable
795
Research Procedure To test these hypotheses, Rosnow and Suls (1970) randomly assigned 146 undergraduate men and women, including 50 verbal volunteers, four forms of a booklet containing the various pretest and treatment combinations in the design developed by Solomon (1949), as shown in Table 5-5. Thus, group I was administered a pretest questionnaire, followed by the experimental communication. Group II received the same pretest but an irrelevant control communication. Group III received the experimental communication and no pretest, and group IV, only the control communication. All the subjects were then simultaneously administered a posttest questionnaire that was identical to the pretest. Using this basic research design, it was possible to calculate the amount of attitude change for the experimental group (group I) that was caused by the summative action of the pretest (Pr), extraneous events (Ex), the experimental treatment (Tr), and the interaction between the pretest and the succeeding events in time (Int). Given the functional relationships (after Solomon, 1949) Y1 ¼ f (Pr þ Ex þ Tr þ Int), Y2 ¼ f (Pr þ Ex), Y3 ¼ f (Ex þ Tr), Y4 ¼ f (Ex). where Y1 Y4 represent the mean posttest scores in groups IIV, respectively; the interaction between the pretest and succeeding temporal events was determined by the following subtractive-difference formula: d ¼ Y1 (Y2þ Y3 Y4), where d might be positive or negative, depending upon the psychological effects of the pretest on the way the subject approached the experimental treatment. A positive difference score would imply a facilitative effect of pretesting and a negative difference would imply a depressive effect. The 388-word experimental communication, which was represented as an excerpt from an editorial by a contributing science editor of The New York Times, discussed the scientific and social implications of the purported discovery of a new element called galaxium and the implications of nuclear research in general. It reported that an American nuclear physicist warned that the new element was difficult to handle because of its constantly changing nuclear structure, and that when changes occurred, radioactive particles were emitted into the surrounding atmosphere, resulting in harmful effects to living tissues. The communication added that an economist said that continued large-scale federal financial support of nuclear research could produce a sharp decline in the value of American currency and that the resulting Table 5–5 Design of the Rosnow and Suls Pretest Sensitization Experiment
(After Solomon, 1949) Treatment
Experimental communication Control communication
Pretest Yes
No
Group I Group II
Group III Group IV
796
Book Three – The Volunteer Subject
inflation would lower the standard of living in the United States. The communication also warned that further nuclear research might produce another force for mankind to fear and misuse, and it concluded with the statement, ‘‘We have enough to worry about now without devoting our minds to the production of an even more overwhelming destructive power.’’ The 348-word control communication, also represented as an excerpt from The Times, was on the subject of sexual promiscuity among college students. The pretest and posttest questionnaires were identical, each containing four attitude statements along with instructions to the subjects to indicate their agreement or disagreement on a nine-point bipolar scale. Corresponding to the tricomponential conceptualization of attitude (Chein, 1948; Katz and Stotland, 1959; Kothandapani, 1971; Krech, Crutchfield, and Ballachey, 1962), one item tapped the affective component of attitude; another item, the conative component; a third, the cognitive; the fourth, a combination of affective and cognitive components. The four items were as follows: 1. Affective: Nuclear scientists in this country are responsible men who should be allowed to do their research without interference from people in other fields. 2. Conative: If asked to do so, I would probably be willing to sign a petition protesting the federal government’s large-scale support of nuclear research. 3. Cognitive: Continued large-scale funding of nuclear research will most likely result in a lowering of the standard of living in the United States. 4. Affective-cognitive: Continued large-scale nuclear research will bring the world closer to destruction.
Responses to each item were scored on the basis of a scale ranging from 1 (strong disagreement with the view advocated in the experimental communication) to 9 (strong agreement). Results Before proceeding to the main analyses, two preliminary assumptions had to be considered. One assumption was that the treatment groups were initially comparable, and the other assumption had to do with the effectiveness of the experimental communication. There were several possible analyses that could be undertaken to test the two assumptions. One means by which to validate the first was to compare the pretest scores in groups I and II with the posttest scores in group IV. If the assumption of initial comparability were correct, there should be no appreciable differences between these three groups of attitude scores. Results of a series of unweightedmeans analyses of variance supported the contention of initial comparability for three out of four questionnaire items. There were no overall differences obtained for items 1, 2 and 4 (every F < 1), although differences significant beyond the .05 level were obtained for item 3. A supplementary check on the assumption of initial comparability was to determine whether there were any pretest differences between volunteers and nonvolunteers in groups I and II versus the posttest scores in group IV. These results were consistent with the first analysis, disclosing differences between volunteers and nonvolunteers significant at the conventional level only for item 3 and negligible differences associated with the other three items. Hence, with the exception of the subjects’ attitudes regarding the cognitive statement, it could be concluded that the groups were initially comparable in their attitudes for items 1, 2, and 4. Since
Empirical Research on Voluntarism as an Artifact-Independent Variable
797
the preliminary assumption of initial intergroup comparability was not met for item 3, this item was eliminated from the main analysis. To provide an overall check on the second assumption, that the experimental communication was effective, correlated t tests were computed on the pretest versus posttest scores of subjects in groups I and II. The results were supportive of the preliminary assumption. Significant attitude gains in the expected direction were found in the group that had been subjected to the experimental communication for all three remaining items. Therefore, it could be concluded that the experimental communication was an effective attitude-change agent. Given that the assumptions of initial comparability in attitudes and potency of the experimental treatment were satisfactorily met, a 2 2 2 analysis of variance was computed on the summed posttest attitude scores for the acceptable items. The three factors in the analysis were (1) whether the subjects had been pretested or not, (2) whether they received the experimental or control communication, and (3) their volunteer status. Should the experimental hypotheses be correct, a three-way interaction was expected. Evidence of a simple pretesting-bytreatment effect would be a two-way interaction between factors 1 and 2. As predicted, the three-way interaction was statistically significant beyond the conventional level (effect size ¼ .41). The two-way interaction, consistent with the earlier experimental findings, was insignificant (F < 1). To determine the overall direction of posttest differences, a subtractive-difference procedure was applied to the composite posttest means using the formula derived before, or d ¼ Y1 (Y2 þ Y3 Y4). The results, which are shown separately for each item in Table 5-6, provided strong support for the experimental hypotheses. The composite difference scores for volunteers and nonvolunteers were all opposite in sign and of similar magnitudes. This finding could certainly explain why the earlier research failed to yield the pretesting-by-treatment interaction. Differences between willing and unwilling subjects could have canceled each other, thereby producing small and insignificant two-way interactions. Conclusions The findings thus demonstrated that the directionality of the effect of pretesting on subjects’ reactions to a communication in attitude-change research could be largely predicated upon their original willingness to participate as research subjects. The main finding was that pretested volunteer subjects were more accommodating, and pretested nonvolunteers less accommodating, to an attitude manipulation. This suggests that using a before-after design can lead to overestimates of the attitudinal effects of persuasive communications when the subjects are motivated positively Table 5–6 Summary of Posttest Means and Results of Subtractive-Difference Procedure
with Volunteers and Nonvolunteers (After Rosnow and Suls, 1970) Attitude Item 1 2 4
Volunteers
Nonvolunteers
6.6 (4.6 þ 5.8 5.1) ¼ 1.3 5.7 (4.8 þ 5.2 5.5) ¼ 1.2 6.6 (4.0 þ 5.4 5.5) ¼ 2.7
5.2 (6.1 þ 5.7 5.1) ¼ 1.5 5.0 (5.9 þ 5.7 5.5) ¼ 1.1 3.9 (5.7 þ 5.2 4.4) ¼ 2.6
798
Book Three – The Volunteer Subject
pursuant to the demand characteristics of the situation and to underestimates when they are captive nonvolunteers. We must mention, however, that recently there has also been an alternative explanation to account for the failure to isolate the simple interaction between the effect of pretesting and that of exposure to a persuasive communication. This alternative view, the commitment hypothesis, argues that subjects may become ‘‘frozen’’ in their original positions when they have to commit themselves on a serious controversial issue (Lana and Menapace, 1971; Pauling and Lana, 1969). Hence, to the extent that giving one’s opinions may arouse evaluation anxieties about seeming too gullible if one were subsequently to change them very much, a person should become more firmly entrenched in his initial position. Some support for this interpretation was provided by Lana and Menapace (1971) in a study of adult rehabilitation patients whose opinions on hospital conditions were measured before and after they underwent routine hospitalization. Demand characteristics were manipulated by informing some of the patients that the researcher expected their opinions to change during the course of their hospital stay and that they would be tested again later on to determine whether they had actually changed their minds. Commitment was varied by having the patients declare their initial opinions either aloud to the researcher or else silently to themselves. The results of the study would seem to favor the commitment hypothesis in that there was less opinion change among patients who were made aware of the researcher’s expectations and had ‘‘publicly’’ committed themselves than among those aware patients who had originally stated their feelings privately. Excluding the possibility of a boomerang effect in the aware group because of the blatancy of the demand characteristics, it may also be possible to reconcile these results with the demand hypothesis by postulating that different factors may predominate in different situations. Thus, while commitment may be a dominant variable in motivational conflicts of the sort investigated by Lana and Menapace, the demand hypothesis may be favored in experimental situations where there are no deep personal conflicts between evaluative cues and task-orienting cues. As the debate raises an empirical question, we were able to pit these two competing hypotheses against one another in the usual laboratory context simply by manipulating the subjects’ anonymity of response (Rosnow, Holper, and Gitter, 1973). On the one hand, the commitment hypothesis would predict a greater likelihood of pretest-by-treatment interaction under comparable conditions of anonymity than nonanonymity, since it follows that nonanonymous subjects would be more committed to their pretest opinions and therefore less responsive to the treatment than anonymous subjects. On the other hand, the demand characteristics hypothesis would predict that nonanonymous subjects, being more easily identifiable as ‘‘good’’ or ‘‘bad’’ subjects, would be more apt to comply with experimental demands than anonymous subjects, and so, there would be a lesser facilitative effect of pretesting under anonymity than nonanonymity conditions. In this case, using the same fourgroup design and stimuli as in the study by Rosnow and Suls, the results did indeed favor the demand hypothesis with the predicted subtractive-difference effect being greater for subjects who were treated nonanonymously (p < .08). We have now seen that in before–after attitude-change experiments the probability of type-I errors appears greater when the subjects are willing participants and that the probability of type-II errors appears greater when the subjects have the characteristics of nonvolunteers. More important, though, the results of the Rosnow
Empirical Research on Voluntarism as an Artifact-Independent Variable
799
and Suls study implied that pretesting may have the capacity to distort the relationships that emerge between independent and dependent variables in attitude-change studies when the subjects’ motivational sets are taken into account, thereby jeopardizing the validity of inferred causal relationships. Given the volunteer’s strong inclination to comply with task-orienting cues, now a stronger case could also be made that voluntarism operates as a motivation mediator. An opportunity to test the idea more directly arose next in an experiment on verbal operant conditioning.
The ‘‘Good Subject’’ in Verbal-Conditioning Research The experimental task in this study was the well-known Taffel conditioning procedure in which the experimenter’s use of the word good serves as a verbal reinforcement for the subject’s emission of I–WE responses, a procedure detailed in Chapter 4 in connection with Resnick and Schwartz’ ethical forewarning study. This popular experimental procedure had been employed in over 300 verbal-conditioning studies (Greenspoon, 1962; Kanfer, 1968; Krasner, 1958; Salzinger, 1959; Williams, 1964), with the rather consistent finding that good tended to facilitate I–WE sentences. What is it about the Taffel procedure that produces a change in verbal behavior? Researchers have been sharply divided on the answer. From a behavioristic viewpoint it has been argued that verbal behavior, like other forms of behavior, comes under the control of the reinforcing stimulus. Cognitive theorists maintain that it is the subject’s awareness of the reinforcement contingency that is necessary for changes in verbal behavior to occur. Discussions generated by these contrasting theoretical orientations have been summarized by Spielberger and DeNike (1966) and Kanfer (1968), and Kanfer raised the interesting possibility that inherent methodological complications may be an inadvertent source of confounding artifact. Given this perplexing state of affairs, it was thought that a fruitful alternative approach might be to view the procedure from the phenomenology of the research subject. Proceeding on the assumption that the typical subject in this experiment could regard his participation as a problem-solving task in which he tried to guess its purpose and the experimenter’s intent (Dulany, 1962; Krasner, 1962), it follows that once having arrived at a plausible explanation, the subject must next determine whether to respond to the perceived coercive demands of the situation. Certainly in this particular case the demand characteristics seem simple and straightforward. Even despite a subject’s not having been expressly instructed to this effect, he should readily infer that an I–WE response is required in order to elicit the experimenter’s verbal approval. Furthermore, since most of the studies had used volunteers, one would think that the very subjects most frequently used in this research might be those most apt to comply with this salient coercive demand. Research Procedure The general procedure was conducted in two phases (Goldstein, Rosnow, Goodstadt, and Suls, 1972). First, undergraduate women in sections of an introductory psychology class were casually instructed about verbal conditioning procedures as a regular part of their classroom lecture. (Recall Resnick and Schwartz’s finding that extensive
800
Book Three – The Volunteer Subject
forewarning resulted in countercompliant conditioning behavior.) Second, several days later 23 verbal volunteers and 22 nonvolunteers were drawn from among these informed students and from other sections that were experimentally naive, and all the subjects were individually subjected to the standard Taffel conditioning procedure. Immediately afterward, each subject was ushered into an adjoining room by another experimenter, who administered Dulany’s (1962) awareness questionnaire. This popular questionnaire was designed to assess three distinct classes of the subject’s hypotheses about the experiment and his intention to act upon his beliefs: a reinforcement hypothesis (awareness of the reinforcement contingency), a behavioral hypothesis (his awareness of expected, or ‘‘correct,’’ behaviors), and his behavioral intention (his motivation to act upon his hypothesis). The decision to use separate experimenters was predicated on our desire to minimize any bias in the administration and recording of awareness responses and to encourage the subjects to express their honest opinions. To safeguard against expectancy effects, both experimenters were kept unaware of the subjects’ volunteer status throughout the study, and the second experimenter received no information about the quality of the subjects’ conditioning performances. Results For convenience of analysis, the 80 conditioning trials were divided into four equal blocks, the first block of 20 defining the operant base rate for the critical I–WE responses and blocks labeled 1, 2, and 3 constituting learning segments also of 20 trials each (see Fig. 5-1). A comparison of volunteers and nonvolunteers at the operant level revealed no appreciable initial differences between the two groups. To provide a difference measure of conditioning, the mean rate of emission of I–WE responses at the operant level was subtracted from that in block 3. The data showed mean increases of 6.6 critical responses for volunteers and a corresponding increase of 2.9 for nonvolunteers, the difference in gain scores being significant beyond the .05 level with an effect size of .72. A repeated measures analysis of variance of volunteer status and all four blocks also disclosed significant differences in the emission of critical responses, and there was a significant interaction effect as well, the rate of increase being markedly higher among volunteers than nonvolunteers. A secondary data analysis was calculated in which volunteer status was one factor and prior information about the treatment was a second factor. As shown in Fig. 5-1, naı¨ve nonvolunteers conditioned less successfully than either informed or naı¨ve volunteers (p < .05). The results of a 2 2 analysis of variance for conditioning (block 3 minus the operant block) yielded one significant effect: volunteers conditioned better than nonvolunteers (p < .05). Hence, prior information about the treatment was apparently of less importance (F < 1) than the subjects’ volunteer status. We should note at this point, however, that the absence of a boomerang conditioning effect among informed volunteers is not, as it might seem on the surface, necessarily at odds with Resnick and Schwartz’s finding (see Chapter 4) if one hypothesizes a curvilinear relationship between the clarity of demand characteristics and demand compliance. Here, prior information exposure was casually induced in the context of a regular classroom lecture, whereas Resnick and Schwartz explicitly forewarned their subjects about the nature of the experimental procedures in a methodically
Empirical Research on Voluntarism as an Artifact-Independent Variable
801
20
Informed volunteers
Mean number of I–WE sentences
15
Naïve volunteers
Informed nonvolunteers 10
Naïve nonvolunteers
5
0
Operant
Block 1
Block 2
Block 3
Figure 5–1 Mean number of I–WE sentences constructed by informed and naive subjects (after Goldstein, Rosnow, Goodstadt, and Suis, 1972)
obtrusive manner inspired by the provisional APA ethical guidelines. We shall refer to this inverted U-shaped function again in the next chapter. Regarding the Dulany questionnaire responses for awareness, there were no significant differences between volunteers and nonvolunteers for the reinforcement hypothesis or the behavioral hypothesis (F < 1), although volunteers, as anticipated, more often reported intentions to produce the correct response class (effect size ¼ .67, p < .05). Prior information exposure had no significant effects on any of the awareness measures (which was not unexpected given the straightforward demands operating in this situation). Of course, our use of a postexperimental awareness measure carried certain problems of interpretation (cf. Greenspoon and Brownstein, 1967; Spielberger and DeNike, 1966). Such measures are typically used to corroborate hypotheses about awareness as an intervening factor in verbal conditioning. Aside from the epistemological difficulties inherent in such ex post facto reasoning is the problem of interpreting just what an awareness measure actually measures. Responses to an awareness interview, like those to sentence construction cards in the Taffel task, are also subject to conscious control and could be correlated with performance change. Subjects might intentionally report low or high awareness as a
802
Book Three – The Volunteer Subject
means of expressing or withholding their cooperation from an experimenter. This is an important problem that awaits more definitive inquiry, particularly in its broader implications for demand controls in general. Finally, it is worth mentioning that a study by Kennedy and Cormier (1971) also implies differences in verbal conditioning rates as a function of the mode of recruitment. These investigators compared three groups of subjects on the Taffel task—a group of unpaid volunteers, a group of subjects who were recruited for $2, and a group of required participants. While overall they found no appreciable difference between groups by an analysis of variance, combining two of them and then recalculating differences in conditioning rates shows that the unpaid volunteers exceeded the other groups on an order of magnitude of better than 3:1 (7.9 versus 2.3, effect size ¼ .28, z ¼ 1.39).
When Compliance Shifts to Self-Defense The sixth study in this series (Rosnow, Goodstadt, Suls, and Gitter, 1973) provided an independent test of a provocative finding by Sigall, Aronson, and Van Hoose (1970) challenging the ubiquitous belief in the cooperative nature of psychological research subjects. These investigators had reported a complicated deception experiment in which there were two sets of cues occurring simultaneously in one of the treatment conditions, one set of cues ostensibly related to the demand characteristics of the treatment and a competing set that conveyed the idea that demand compliance would result in the respondent being unfavorably evaluated on an important psychological dimension. Instead of cooperating with the experimenter, the subjects responded in the direction they would have seen as promoting a favorable image. The results were interpreted as evidence of the prepotency of ego-defensive motives when demand compliance is incongruent with a favorable self-presentation. The present study explored the responsiveness of willing and unwilling subjects in circumstances where there were conflicting task- and ego-orienting cues. Research Procedure The subjects were 123 undergraduate women from the introductory psychology course, 59 of whom were verbal volunteers. Two response demands were intended to operate concurrently in the study, one communicating manipulatory intent and the other, the expected direction of the subjects’ responding. As our earlier research had suggested that before–after procedures may effectively convey manipulatory intent, this simple routine was adopted for transmitting the first demand cue. Directionality was communicated by exposing subjects to the one-sided extraverted or introverted personality sketches of ‘‘Jim’’ used in the earlier exploratory research (study 2). The before-and-after questionnaires now contained three 9-point scales on which the subject recorded the degree to which she perceived Jim to be friendly or unfriendly, forward or shy, social or unsocial, and the pretest was preceded by the same explanatory passage used in the earlier investigation. Subjects were unobtrusively assigned to four groups, two of which received the extraverted characterization and two of which received the introverted message; half the groups participated in a ‘‘complementary’’ condition and half the groups
Empirical Research on Voluntarism as an Artifact-Independent Variable
803
participated in a ‘‘competing’’ condition. In devising these two conditions, our intent was to convey in a subtle, but unequivocal, way that compliance with demand characteristics could result in a favorable or unfavorable image, depending on the experimental condition. Hence, an indirect means of imparting this information was sought that, although straightforward and attention-getting, would not be so transparent as to reveal the true nature of the deception. For this purpose a young actor was employed who was planted in each of the classes as another student. In the complementary condition the actor interrupted the experimenter while test booklets were being explained and said that he heard from a friend in another class who had already participated in the study that the experimenter was trying to prove that ‘‘only people with really high IQ’s are able to come up with the correct impression of someone from a short paragraph.’’ To the confederate’s question whether this was really the aim of the study, the experimenter said he would answer later, when the experiment was completed. Thus, demand compliance should have been perceived as congruent with the desire to project a favorable image. In the competing condition, the opposite intent was conveyed by having the confederate say that he had heard that the experimenter was trying to prove that ‘‘only people with really low IQ’s would even attempt to form impressions of a person just from a short description.’’ In this case, demand compliance should have been perceived as incongruent with the desire to present a favorable image. Results The results of the experiment are shown in Fig. 5-2, which summarizes the data separately for volunteers and nonvolunteers. The valences associated with change scores (see vertical axis) indicate whether opinion changes were in the direction implied by demand characteristics. It can be seen that in three cases out of four there were sharp decreases in demand compliance from the complementary to the competing condition, the one reversal (which was statistically insignificant) being among volunteers who received the extraverted sketch. Where there was any change significant beyond the .05 level, it was in the predicted direction. Consistent with this observation, there was a significant statistical interaction of message directionality and the conflicting cues factor in a three-way analysis of variance (p < .05) and a borderline main effect for the conflicting cues factor (p < .06). Based on discussions by Green (1963) and Silverman (1970), it was thought that volunteers would be more strongly impelled by coercive demands, and nonvolunteers, by ego-defensive cues. Contrary to expectation there was only a very negligible voluntarism by conflicting cues interaction in the three-way analysis of variance (F < 1) and also a nearly zero triple interaction. However, there may be a hint of the hypothesized relationship in the fact that one group of volunteers responded slightly counter to evaluative cues. Since the same treatment was simultaneously administered to both volunteers and nonvolunteers, it would be difficult to explain this exception as a failure of the treatment. The results also provided an opportunity to check on our earlier finding that volunteers are more compliant with demand characteristics than nonvolunteers. While there was no conventionally significant main effect of volunteer status or a voluntarism by message directionality interaction, it can be seen that there was a tendency in the predicted direction. Volunteers in three out of four conditions
804
Book Three – The Volunteer Subject +2.0
Volunteers, introverted sketch
Mean impression change
+1.5
+1.0 Volunteers, extraverted sketch +0.5
0.0
–0.5
Nonvolunteers, introverted sketch
Complementary condition
Nonvolunteers, extraverted sketch
Competing condition
Figure 5–2 Mean impression changes of volunteers and nonvolunteers. Positive and negative signs are independent of the direction of the stimulus. Positive scores simply denote gains in whatever direction, positive or negative, was communicated by an extraverted or introverted sketch, and negative scores reveal demand countercompliance (after Rosnow, Goodstadt, Suls, and Gitter, 1973)
exhibited greater congruent change than nonvolunteers. Moreover, even though two out of four cases were quite insignificant (t < 1 for extraverted–complementary and introverted–competing conditions), relationships in the two cases that were statistically significant or approached the conventional level repeated our previous experimental finding (one-tailed p < .01, effect size ¼ .96 for extraverted–competing and one-tailed p < .10, effect size ¼ .56 for introverted–complementary). Taken together with the earlier findings, these results would certainly support the idea of a greater motivational tendency among volunteers than nonvolunteers to play the role of good subject.
Role Expectations of Volunteers and Nonvolunteers To summarize our findings thus far, the results of the preceding six studies provided compelling evidence of the subtle threat stemming from volunteer status as an artifact stimulus variable to the validity of inferred causal relationships. They showed, moreover, a definite link between voluntarism and the motivation mediator alluded to in the previous chapter. However, while they may also seem to imply differences in role perceptions between volunteers and nonvolunteers that could account for their energetic or sluggish experimental behaviors, the evidence on this score, being of an indirect nature, is at best circumstantial and one would prefer to ask
Empirical Research on Voluntarism as an Artifact-Independent Variable
805
subjects themselves what their role expectations are. Hence, it was with this idea in mind that the final study in this series was designed (Aiken and Rosnow, 1973). In our earlier discussions we mentioned that there has been some controversy centering on the role that subjects adhere to for research participation and what their experimental motivations are. One position holds that research subjects respond to demand characteristics because they hope and expect ‘‘that the study in which they are participating will in some material way contribute to human welfare in general’’ (Orne, 1962, p. 778). Consistent with this notion was Straits, Wuebben, and Majka’s (1972, p. 516) conclusion, having factor analyzed the reports of respondents who simulated research participation, that ‘‘subjects are apparently willing to forego ‘hedonistic’ considerations for ‘altruistic’ goals such as those of advancing science and helping the experimenter.’’ An alternative view attributes role behaviors to expectations of obedience to authority. For example, Fillenbaum (1966, p. 537) characterized experimental behaviors as ‘‘dutifully going along with the instructions,’’ and Sigall, Aronson, and Van Hoose (1970, p. 9) speculated, ‘‘Subjects are obedient in the sense that if they are instructed to do something, they fulfill that request.’’ A similar idea was contained in Kelman’s (1972) analysis of the rights of subjects in terms of relative power and legitimacy. The subject is compliant owing to his perception of experimenters as legitimate authorities to be obeyed and of himself ‘‘as having no choice, particularly since the investigators are usually higher in status and since the agencies sponsoring and conducting research may represent (or at least appear to represent) the very groups on which the subjects are dependent’’ (Kelman, 1972, p. 990). A third view implies that role behaviors, may be a function of subjects’ heightened approval needs in response to evaluative expectations that are endemic to behavioral research participation. The typical subject, it is hypothesized, approaches the typical experiment with the expectation of being evaluated on some pertinent psychological dimension and guides his behavior accordingly (Rosenberg, 1965, p. 29). Which of these interpretations best defines the predominant role expectation of subjects for standard experimental research—the altruism hypothesis, the obedience to authority hypothesis, or the evaluation hypothesis—and are there differences in role expectations between voluntary and nonvoluntary subjects? These were the questions to which this final study in the series was addressed. Since we sought verbal reports that were not themselves blatantly biased by the operation of demand characteristics, the study employed a kind of ‘‘projective’’ method of inquiry and scaling procedure that brought together the altruism, obedience, and evaluation hypotheses to paint a multidimensional picture of subjects’ expectations for experimental research participation. Research Procedure Again, the subjects were drawn from sections of an introductory psychology class and identified as volunteers or nonvolunteers by the identical procedure used in the preceding studies. The recruitment procedure produced a list of 53 volunteers and 100 nonvolunteers. All the subjects were given computer output sheets on which to respond directly by judging each of 55 unique pairs of 11 different stimuli that were presented in a complete paired-comparison schedule. One stimulus, the target situation stated
806
Book Three – The Volunteer Subject
simply, ‘‘being a subject in a psychology experiment.’’ The remaining 10 stimuli were designed either to represent experiences related to the three role hypotheses or to serve as control situations for tapping positive or negative affect. The 10 comparison stimuli and 5 conditions they represented were as follows: Altruism hypothesis 1. giving anonymously to charity 2. working free as a laboratory assistant Obedience to authority hypothesis 3. obeying a no-smoking sign 4. not arguing with the professor Evaluation hypothesis 5. taking a final exam 6. being interviewed for a job Positive control 7. spending the evening with a good friend 8. taking a walk in the woods Negative control 9. going to school on the subway 10. having to work on a weekend or holiday.
The instructions to the subjects began: There are many situations in which you might find yourself which are quite similar in the expectations you have about them. For example, playing in a tennis match or being in a debating contest evoke similar competitive expectations. In contrast, there are other situations which are probably quite different in the expectations you have about them. For example, jogging around the block versus watching television after a hard day of work evoke obviously different expectations, the first being very active, the second being very passive.
The subjects were then instructed to consider other pairs of experiences like these and to rate them on how similar the experiences were in the expectations the subjects had about them. A 15-point scale was provided for this purpose, with 15 representing maximum similarity (‘‘very similar expectations’’) and one representing maximum dissimilarity (‘‘very dissimilar expectations’’). After practicing on some sample pairs, the subjects rated the similarity of their expectations for the 55 pair members made up of the 11 different stimuli. Each subject received the pairs in a different random sequence to control for order effects. For the purpose of assessing reliability of judgment, the subjects were then administered a second output sheet containing 15 pairs chosen from across the pair sequence and asked to repeat their judgments. Data Analysis For each subject, a Pearson product–moment correlation was computed on original versus repeated judgments to the 15 reliability pairs. The median reliability
Empirical Research on Voluntarism as an Artifact-Independent Variable
807
coefficient among volunteers was .68 (p < .01), and among nonvolunteers it was .48 (p < .05). Since the inclusion of unreliable protocols could tend to obscure or confound differences between groups, only those volunteers and nonvolunteers whose protocols were reliable at r (df ¼ 13) ¼ .51, p ¼ .05, were retained for further analysis. Thirteen volunteer protocols were discarded for failure to meet this criterion, and 52 nonvolunteer protocols. The median reliability coefficients of the remaining 40 volunteers and 48 nonvolunteers were .77 (p < .01) and .79 (p < .01). In passing, we should perhaps note that the difference in reliabilities was significant well beyond the conventional level (2 ¼ 9.6, p < .002), although what this means is not very clear at this point. Could some of the volunteer subjects, being brighter and higher in the need for social approval than the nonvolunteers, have been trying to appear more consistent? Alternatively, might the volunteers have shown greater care in answering questions? Further research may shed light on these questions. Of the 55 total judgments each subject made, 10 were direct assessments of the target stimulus (‘‘being a subject in a psychology experiment’’) with the comparison stimuli. From these judgments five scores were computed for each subject, the scores representing the similarity of each of the five comparison categories (altruism, obedience, evaluation, positive control, negative control) to the target stimulus. Within each category this score was the sum of the judged similarities of the paired situations in that category to the target stimulus. These results, which are shown in Table 5-7, reveal that the category with highest similarity to the target stimulus among both volunteers and nonvolunteers was that corresponding to Orne’s altruism hypothesis. A 2 5 unweighted means analysis of variance of group (volunteer, nonvolunteer) category (altruism hypothesis . . . negative control), followed by post hoc analysis, confirmed this finding. The significant category main effect of F(4, 344) ¼ 21.4, p < .001, indicated differential similarity of the various categories to the target stimulus. That there was no interaction effect, F (4, 344) ¼ 1.8, p > .10, indicated that category profiles were similar across groups. Lastly, the group main effect reflects the higher mean judgments of the volunteers over the nonvolunteers, F(1, 86) ¼ 12.4, p < .001, which repeats the finding in our second exploratory study, described earlier. (Whether in the first case this means that ‘‘being a subject’’ has Table 5–7 Judged Similarity of the Target Stimulus to Five Categories of Paired Experiences
(after Aiken and Rosnow, 1973)a Subject’s Volunteer Status
Experiential Categories Altruism Hypothesis
Obedience Hypothesis
Evaluation Hypothesis
Positive Control
Negative Control
Volunteers
20.3 (7.1)
12.8 (7.2)
14.5 (6.8)
15.1 (8.8)
12.7 (5.8)
Nonvolunteers
16.9 (7.4)
9.4 (6.9)
14.5 (6.5)
10.0 (6.9)
9.4 (6.1)
a The maximum possible score was 30. The higher the score, the greater was the perceived similarity in role expectations between the target stimulus (‘‘being a subject in a psychology experiment’’) and the combined category stimuli. Standard deviations are indicated in parentheses.
808
Book Three – The Volunteer Subject
greater semantic meaning for volunteers than for nonvolunteers is an intriguing question and awaits further exploration.) As regards the post hoc analysis of category means using the Newman–Keuls technique, the altruism category, as stated above, was seen as significantly more similar to the target stimulus than all other categories (p < .01 in all cases). The evaluation category was closer to the target situation than was the positive control category (p < .05), the obedience to authority category (p < .01), or the negative control category (p < .01), and no differences were noted between these last three categories. Next, multidimensional scaling (MDS) was used to explore the structure underlying subjective similarity judgments between all situations. MDS is conceptually similar to factor analysis in that the purpose of MDS is to extract the dimensions (analogous to factors) underlying sets of interstimulus similarities (analogous to intervariable correlations). In addition, MDS furnishes for each stimulus its projections on the dimensions of the underlying configuration (analogous to factor loadings). As in factor analysis, the number of dimensions chosen to represent the structure of a set of judgments depends on the goodness-of-fit of a solution to the original judgments as well as on the interpretability of the dimensions of the solution. For the present MDS application, the mean judged similarity to each category pair was calculated separately for volunteers and nonvolunteers. These mean judgments were then analyzed with the TORSCA nonmetric MDS procedure (Young, 1968). Four-, three-, two-, and one-dimensional solutions were derived separately for volunteers and nonvolunteers so that goodness-of-fit could be examined as a function of solution dimensionality. The measure of fit in the TORSCA procedure, stress (Kruskal, 1964), is actually a ‘‘badness-of-fit’’ measure in that the larger the stress, the poorer the fit. Stress curves for the volunteer and nonvolunteer solutions showed elbows at two dimensions, thus indicating that two dimensions were appropriate for representing the original judgments of both groups. Among volunteers the stress values were .18, .08, .05, and .03 for the one-through-four-dimensional solutions respectively, and among nonvolunteers the corresponding stress values were .21, .09, .06, and .02. The raw judgments of the 88 individual subjects were next rescaled with the INDSCALE MDS procedure (Carroll and Chang, 1970), a technique to treat subjects and stimuli simultaneously. By using this procedure, a single stimulus configuration is recovered. The configuration is that unique solution which best represents the judgments of all individual subjects. For each subject, weights are simultaneously calculated that best represent the importance of each dimension to the subject. The procedure thus assumes that there is a common set of stimulus dimensions available to all observers but that individual subjects perceive particular stimulus dimensions as more-or-less salient when they judge the similarity between stimulus pairs. From this scaling procedure, three kinds of information are obtained: (1) a unique stimulus configuration with projections of the stimuli on the dimensions thereof; (2) for each individual a set of weights, one for each dimension of the stimulus configuration, which represents the extent to which the various stimulus dimensions are reflected in the subjects’ judgments; and (3) a measure of goodness-of-fit for each subject’s original judgments to the common stimulus configuration as weighted by the subject’s unique dimension weights.
Pleasant
Empirical Research on Voluntarism as an Artifact-Independent Variable
809
1 Giving anonymously to charity
7 Spending the evening with a good friend 8 Taking a walk in the woods
Alt
e
enc
edi
Ob
3 Obeying a no–smoking sign
2 Working free as a laboratory assistant 11 Being a subject in a psychology expt.
sm
Eva lua tion
Dimension 2
rui
4 Not arguing with the professor Unpleasant
6 Being interviewed for a job 9 Going to school on the subway Nonwork–oriented
Dimension 1
10 Having to work on a weekend or holiday 5 Taking a final exam Work–oriented
Figure 5–3 Two-dimensional INDSCALE solution based on individual judgments of 88 subjects. Lines labeled ‘‘altruism,’’ ‘‘evaluation,’’ and ‘‘obedience’’ show the actual relative distances between these three categories and the target stimulus (‘‘being a subject in a psychology experiment’’). The proximity between any pair of situations reflects the perceived similarity in role expectations for the pair members (after Aiken and Rosnow, 1973)
The two-dimensional stimulus configuration derived for the INDSCALE procedure is shown in Fig. 5-3. The first dimension can be interpreted as work orientation, with the nonwork experiences of ‘‘taking a walk in the woods’’ and ‘‘spending the evening with a good friend’’ at one extreme and the highly work-oriented experiences at the other. The second dimension can be interpreted as affective, with ‘‘giving anonymously to charity’’ at one extreme and ‘‘taking a final exam’’ at the other. In the scaling solution the ordering of distances of the altruism, evaluation, and obedience categories from the target stimulus was reported before. Altruism stimuli, it can now be seen, were closest to the target stimulus, followed by the evaluation and obedience to authority stimuli. The correlations of individual subjects’ original judgments and weighted distances derived from scaling were significant for all subjects (for 87 out of 88 subjects, p < .01; for the last subject, p < .05). Among volunteers the median correlation was .67 with a semi-interquartile range of .10 and for nonvolunteers the median correlation was .65 with a .06 semi-interquartile range. In sum, the judgments of all individual subjects could be represented in the common stimulus configuration given in Fig. 5-3. Lastly, we turn to the question of whether volunteers and nonvolunteers could be differentiated on the basis of the weight placed on each dimension in making their judgments. For this purpose the dimension weights of individual subjects derived from the INDSCALE procedure were examined to test for differentiation of groups on the basis of dimension salience. This analysis revealed that the nonvolunteers placed more weight on the work-orientation dimension than did the volunteers, with mean weights of .36 and .28, respectively, F(1, 86) ¼ 9.3, p < .01. And the volunteers placed slightly more weight on the pleasant–unpleasant dimension than did the
810
Book Three – The Volunteer Subject
nonvolunteers, with mean weights of .55 versus .49, and F(1, 86) ¼ 3.3, p < .10. As a final check, using the weights for the two dimensions as predictors in a discriminant analysis of volunteers versus nonvolunteers, the groups were significantly differentiated by the combination of weights, F(2, 85) ¼ 4.7, p < .05. However, only weights assigned to the work-orientation dimension contributed to the distinction between groups, and F < 1 for the affective dimension.
Summary of the Research Findings and Conclusions To reiterate the earlier conclusion, the results of the six experimental studies can be seen as evidence of the subtle threat deriving from voluntarism as an artifactindependent variable to the validity of inferred causal relationships. Volunteers were consistently more accommodating than nonvolunteers as long as the demand characteristics were presented clearly, and this motivational difference had the capacity to distort the relationships emerging between independent and dependent variables. In the study on error outcomes, it was discovered that the systematic biasing effect of pretesting subjects in before–after attitude research was to increase the likelihood of type-I decisions when the subjects are willing participants and to promote type-II decisions when they are captive nonvolunteers. (We use ‘‘type-I’’ in this context as shorthand for obtaining significant effects of an independent variable in a sample when such effects would not be significant in a more representative sample and ‘‘type II’’ for obtaining no significant effects in a sample when such effects would be significant in a more representative sample.) A biasing behavioral effect of voluntarism was also obtained in the fifth experiment, on verbal conditioning, although the results, when considered in the light of Resnick and Schwartz’s findings, implied to us that demand clarity may operate in a curvilinear fashion with demand compliance. This is an important theoretical assumption to which we return in the next chapter. As regards the sixth experimental study, further support was provided for the Sigall, Aronson, and Van Hoose hypothesis that subjects, given a strongly felt conflict between their perceptions of demand characteristics and presenting a good appearance, may tend to opt for the latter behavioral alternative. Contrary to expectation, there was no strong evidence that nonvolunteers were any more responsive to egodefensive cues than volunteers, although there was at least a hint of this possibility. Given these experimental findings, what can be said now about the pervasive influence of voluntarism as a potential artifact-independent variable? In view of the fact that several hundred studies have used the verbal conditioning paradigm on volunteer subjects (Resnick and Schwartz, 1973), and who knows how many hundreds of social influence studies have used the before–after opinionnaire design, the fact of having isolated this artifact in these broad domains of experimental inquiry should at least suggest the variable’s potential pervasiveness. However, to focus on the factor of pervasiveness to the exclusion of other important criteria may be misleading in the implications for the conceptualization of artifacts in general. Thus, in the next chapter we argue that artifact stimuli may not typically function as autonomous variables able to be ranked according to their pervasiveness as independent sources of systematic error. Probably like any other variable in social psychology, artifact stimuli also interact in complex ways, their pervasiveness as
Empirical Research on Voluntarism as an Artifact-Independent Variable
811
independent contaminants depending on what other factors are operating. This point will be emphasized in the next chapter in generating an integrative model of the artifact influence process in which volunteer status is postulated to be one of an array of intervening factors. Turning finally to the seventh study and the question posed earlier concerning which of three views best defines the customary role expectations of subjects for experimental research, those findings revealed elements of all three conceptions in the spectrum of role expectations that subjects associated with research participation. No single formulation was exclusively valid or invalid, although there was a strong indication that altruistic and evaluative situations, in that order, bore the closest resemblance to the target stimulus with substantially less similarity in role expectations noted between situations tapping obedience to authority. These results provide compelling support both for Orne’s contention that the typical subject associates altruistic expectations with the experience of participating in a psychology experiment as well as for Rosenberg’s conception of the experience as being heavily weighted by evaluative role expectations. As regards differences between willing and unwilling subjects, the results showed that nonvoluntary subjects put significantly heavier emphasis on a work-oriented dimension than did volunteers, whereas the volunteers placed more weight on an affective dimension. That nonvolunteers tended to amplify the distinction between situations involving work and nonwork activities and judged research participation more as a work-oriented experience seems consistent with a finding by Straits and Wuebben (1973) that it is the negative aspects of experimental participation that are likely to be more salient for nonvoluntary subjects (with the reverse result occurring among volunteers). The collective findings may help to explain why nonvolunteers can be so unenthusiastic about participating as research subjects as well as their generally sluggish behavior when they are unpaid, captive subjects; that is, being compelled to work when there is no personal monetary gain attached to the effort could be seen as promoting uncooperative behavior in subjects who are perhaps already more attuned to the negative aspects of the ‘‘work’’ situation.
6 An Integrative Overview
We have been discussing how the subject’s volunteer status may determine his role enactment and thereby affect the tenability of research findings based on that behavior. While it is probably true that all social-influence theories bear to some extent on this issue, William McGuire’s (1968a, 1969b) information-processing formulation has provided us with an illuminating vantage point from which to examine the interaction of factors such as voluntarism, demand clarity, and the like, in influencing subjects’ role behaviors. This hybrid model visualizes the social influence process as a six-step Markov chain (communication ! attention ! comprehension ! yielding ! retention ! behavior), the dependent variable being influenceability, or what is broadly interpreted as a tendency to change in response to social pressure. Hence, altering one’s opinions on a matter-of-fact or a matter-of-taste are included, as are judgments and movements in reaction to group pressures and normative expectancies. The fundamental postulate of the theory is that influenceability is a positive function of two requisite mediational steps: effective reception of the stimulus (through attention and comprehension) and yielding to what was understood. Influenceability is affected insofar as the subject adequately receives the stimulus and yields to the point received, and any additional variables are thought to affect influenceability indirectly by acting on receptivity, yielding, or a residual factor representing the probability of all other relevant processes. Given that influenceability may be liberally generalized to encompass the subject’s reaction to demand characteristics, an abbreviated and slightly modified version of McGuire’s model can be fitted to the data on experimental artifacts. In the proposed view, the status of voluntarism as an artifact-independent variable is seen as one of an array of usually uncontrolled stimuli that operate on a few main intervening variables affecting demand compliance. We begin this chapter with an outline of the postulated artifact-influence model and conclude with a discussion of alternatives for dealing with the ensuing methodological problem. (A survey of general factors jeopardizing the validity of various experimental and quasi experimental designs can be found in the monograph by Campbell and Stanley, 1966, and more recently in a chapter by Cook and Campbell, 1974.) 812
An Integrative Overview
813
The Artifact-Influence Model The central thesis of the proposed view is that there are three mutually exclusive and exhaustive states of behavior that are the end products of three conjoint mediators, and artifact-independent variables affect the ultimate outcomes of studies by indirectly impinging on the behavioral states at any of the mediating points (Rosnow and Aiken, 1973). Table 6-1 defines the three mutually exclusive and exhaustive states of overt behavior corresponding to this liberalized conception of influenceability. In essence, a trichotomy of role enactments is postulated, consisting of compliant behavior, noncompliant behavior, and countercompliant behavior. It is theorized that these behavioral states are the end products of the mediating variables defined in Table 6-2 and that artifact-determining stimuli, such as the subject’s volunteer status, affect role behaviors only indirectly by acting on the appropriate mediating variable. Thus, a subject could adequately receive the demand characteristics operating in an experimental situation or some proportion thereof, or he could be unreceptive or unclear about what demands were operating. Second, his motivation to comply with such demands could range from acquiescence to counteracquiescence. Third, he might or might not have the capability of responding to the demand characteristics he received. Very generally speaking, the three motivational states in Table 6-2 correspond to Willis’ (1965) familiar concepts of conformity, anticonformity, and
Table 6–1 Mutually Exclusive and Exhaustive Behavioral Influenceability States Pursuant to the Demand
Characteristics of an Experiment State of behavior
Notation
Compliant behavior
B+
Noncompliant behavior Countercompliant behavior
B0 B
Description Cooperative behavior in capitulation to the demand characteristics of the situation. Behavior overtly unaffected by the demand characteristics of the situation. Behavior antithetical to the demand characteristics of the situation.
Table 6–2 Mutually Exclusive and Exhaustive Subject States for Each of Three Mediating Variables
Mediator
State
Notation +
Receptivity
Adequate Inadequate
R R0
Motivation
Acquiescent
A+
Nonacquiescent
A0
Counteracquiescent
A
Capable
C+
Incapable
C0
Capability
Description Subject effectively receives demand characteristics. Subject fails to receive, or inadequately receives, demand characteristics. Subject is in an acquiescent mood pursuant to demand characteristics. Subject is not motivated to respond overtly to demand characteristics. Subject is in a counteracquiescent mood pursuant to demand characteristics. Subject is capable of manifesting his demand motivation behaviorally. Subject is incapable of manifesting his demand motivation behaviorally.
814
Book Three – The Volunteer Subject
C+
B+
[1]
C0
B0
[2]
C+
B–
[3]
C0
B0
[4]
B0
[5]
B0
[6]
A+
R+
A–
Demands A0
R0
Figure 6–1 Sequences of determining states leading to compliance, noncompliance, and countercompliance (after Rosnow and Aiken, 1973)
independence, although with regard to the third motivational state, no distinction will be drawn between unmotivated and motivated nonacquiescence as both should lead to the same behavioral outcome (cf. Kelvin, 1971; Stricker, Messick, and Jackson, 1970). The theoretical sequences leading to the three behavioral outcomes are depicted in Figure 6-1 in the form of a tree diagram. There is only one branch of the tree that leads to compliance with demand characteristics. This branch, labeled 1, requires adequate reception, a positive motivation, and the capability to pursue that motivation. There is also only one branch that leads to counter-compliance with demand characteristics. This branch, labeled 3, requires adequate reception, a counteracquiescent motivation, and the capability to express that negative motivation behaviorally. All the remaining paths lead to noncompliance. Path 6 is limited by the receptivity state; path 5, by the motivational state; and paths 2 and 4, by the capability state. If the subject failed to perceive any demand characteristics, he could not possibly act on them (path 6). If he received demand characteristics but was unmotivated to act on them, they could not affect his experimental behavior (path 5). If the subject perceived demand characteristics, was motivated in an acquiescent direction, and yet was unable to comply, the demand characteristics could not influence his experimental behavior (path 2). Neither could they distort his experimental reaction if he received them and was motivated in a counteracquiescent direction but lacked the capacity to manifest his negative motivation behaviorally (path 4). The complete artifact-influence sequence is thus conceptualized as a five-step chain when compliance or countercompliance is the end product.
Steps in the Artifact Influence Process Origins of Demand Characteristics In coining the term demand characteristics (which comprise both uncontrolled explicit and implicit task-orienting cues), Orne (1962a, p. 779) speculated on their varied origins in rumors and campus scuttlebutt (cf. Wuebben, 1967; Taub and
An Integrative Overview
815
Farrow, 1973), in information conveyed during the original solicitation for subjects, in the person of the experimenter and the experimental setting, and in the experimental procedure itself. It was shown in the preceding chapter that when an attitude questionnaire is given twice with some intervening persuasive communication, this may convey manipulatory intent, a type of implicit demand characteristic also anticipated by Orne. The subject’s preconceptions of the experimental situation as a function of his prior experimental experiences could be another unexpected source of demand characteristics—as, for example, in the subject’s preconceived ideas about the characteristics of hypnotic behavior (Orne, 1959, 1970). (Recall in Chapter 4 our mention of the experiment in which catalepsy of the dominant hand was concocted as a characteristic of hypnosis.) Other sources of demand characteristics could be the experimenter’s sex and scientific experience, his personality and expectations, and his modeling behavior (Kintz, Delprato, Mettee, Persons, and Schappe, 1965; Rosenthal, 1966, 1967, 1969). Modeling cues may arise when the investigator projects his own views onto the subject. While it is difficult to predict the direction and magnitude of modeling effects in laboratory experimentation, in survey research they are usually positive but variable in magnitude (Rosenthal, 1966, p. 112 ff.). The interpretation of the variability of direction in modeling effects that seems best supported by the evidence, although still not firmly established, is that a happier, more pleasant, less tense experimenter will tend to model his subjects negatively, whereas less pleasant, more tense experimenters will produce positive modeling effects. Exactly why this should be is not clear, although it is possible to speculate on the phenomenon within the framework of reinforcement theory. The emotional tone conveyed by the experimenter could be a stimulus source of satisfaction or dissatisfaction that, by (unwittingly) rewarding or punishing the subject for his experimental behavior, also helps to shape it in a particular fashion (cf. Jones and Cooper, 1971). If there is a methodological lesson to be learned from the informative research on modeling behavior in children, it is that demand characteristics are cognitively processed and that compliance will depend on the cognitive and emotional linkages with such behavior in the subject’s mind (cf. Bryan and London, 1970; Rosenhan, 1969; Rosenhan and White, 1967). However, in order to be cognitively processed, first the cues must be adequately received. Receptivity A number of studies point tentatively to such factors as the scientific atmosphere of the experimental setting and the sheer amount of the subject’s experience or prior knowledge as possible determinants of the facility for recognizing and interpreting demand characteristics (Goodstadt, 1971; Holmes, 1967; Holmes and Applebaum, 1970; Milgram, 1965; Stumberg, 1925; cf. discussion by Weber and Cook, 1972). The experimenter’s professional manner may serve to legitimize for the subject his cooperative behavior (Rosenthal, 1966, p. 258), and recent findings, which lend support to this idea, suggest that the practiced experimenter, by his professional behavior, will prompt his subjects to perceive the experiment in a more scientific light (Straits, Wuebben, and Majka, 1972). Silverman, Shulman, and Wiesenthal (1970) had introductory psychology students participate in two experiments. In the first, they were either deceived and
816
Book Three – The Volunteer Subject
debriefed or else took part in an ordinary memory study without deception. For their second participation, which occurred one week later, the subjects were all given a series of tests to check on how compliant with demand characteristics they now were. What is relevant to the present discussion from the standpoint of the receptivity mediator is that the deception apparently sensitized the subjects to possible ulterior evaluation cues in the second experimental session. In conformity research, suspiciousness of deception (which could be enhanced by communication between subjects) appears to predominate among young male subjects and to correlate positively with social desirability, ascendance, self-esteem, and intelligence (Stricker, Messick, and Jackson, 1967). Other work suggests an even more basic and intuitive principle for the operation of this initial mediator in the artifact-influence chain, which is that insofar as demand characteristics are simple and straightforward, they are likely to be clearly received (Adair and Schachter, 1972; Silverman, 1968). However, research also implies that role behaviors may be as much a function of the blatancy as the clarity of demand characteristics (a point to which we return shortly). Motivation The studies in Chapter 5 suggested that the second mediator in the artifact-influence chain, the subject’s motivation, may be influenced by his original willingness to participate. In that series of studies in which volunteer status was the independent variable and the clarity of demand characteristics was held roughly constant across levels of verbal voluntarism in each experiment, willing subjects exhibited role behaviors implying greater acquiescence than captive nonvolunteers. It is difficult to explain such behavior simply as a self-selection of conformist people in volunteer situations, as the relationship between voluntarism and conformity (Chapter 2), while not conclusive, suggested that volunteers were, if anything, less conforming than nonvolunteers. Self-selection on the basis of high social desirability or need for approval is, however, a very real possibility. A cognitive dissonance hypothesis that volunteering induces self-justification needed to cooperate is one plausible alternative. Silverman (1965) has developed this interesting idea, reasoning that subjects are motivated to make the experimental outcome successful in order to justify the time and effort they will have had to expend. Hence, any reward for participating in the experiment, whether it be monetary or psychological, should reduce this dissonance and the need to comply with demand characteristics. Unfortunately, the evidence on this score is mixed. Weitz (1968) did a standard experimenter-expectancy manipulation using Harvard undergraduates, in which paid and unpaid voluntary and nonvoluntary subjects were exposed to positive and negative expectancy demands. As one would predict from cognitive dissonance theory, compliance with demand characteristics was greatest among the unpaid volunteers. In fact, there was a strong boomerang effect for the paid volunteers. However, when a similar experiment was carried out by Mackenzie (1969) using high school students, his results were quite the opposite of a cognitive dissonance outcome. In this case, the paid volunteers showed considerable demand compliance, almost as much as a group of unpaid nonvolunteers. Silverman also reminds us that the cognitive dissonance interpretation is contrary to Ward and Sandvold’s (1963) notion that salary increases the subject’s sense of obligation and thus makes him more amenable to the experimenter’s demands.
An Integrative Overview
817
Certainly there are other plausible explanations besides cognitive dissonance theory for volunteers’ usually strong predilection to comply with demand characteristics. From Bem’s (1970) theory that behavior determines attitudes, one might posit that the act of volunteering could be cognitively interpreted by subjects as overt evidence of their own acquiescence, a cognition that feeds on itself and to which they submit in responding to the demand characteristics of the situation. Volunteering may be the foot in the door that opens the way to further compliance (cf. Freedman and Fraser, 1966). Alternatively, given that altruistic role expectations may be strongly associated with psychological research participation in our culture (cf. final study in Chapter 5), perhaps volunteers are compliant as a function of salient social expectancies. Since, from another point of view, helping behavior can be seen as instrumentally rewarding in terms of the psychological benefits accrued by the donor (cf. Foa, 1971; Handfinger, 1973; Weiss, Buchanan, Alstatt, and Lombardo, 1971), possibly demand compliance serves a similar function for some research subjects (cf. Eisenman, 1972). As long as there is not a conflict between demand compliance and the desire to project a favorable image, the fact of having been made to feel apprehensive about being evaluated also seems to bolster acquiescence tendencies. Minor (1970; cf. Johnson, 1973a) succeeded in producing experimenter expectancy effects only when his subjects were made to feel ego-involved in their performances. Henchy and Glass (1965) found that the presence of an audience enhanced the emission of dominant responses at the expense of subordinate responses only under conditions where the audience was perceived as an evaluative element of the situation. Several studies have shown that anonymous subjects are less compliant with demand characteristics than subjects who can be individually identified by the experimenter (Rosnow, Goodstadt, Suls, and Gitter, 1973; Rosnow, Holper, and Gitter, 1973; Silverman, 1968). If all these different manipulations were linked to Rosenberg’s (1965, 1969) concept of evaluation apprehension, but merely involved different levels of evaluation anxiety, the collective results might imply a monotonic relationship with acquiescence when task- and ego-orienting cues are operating in harmony. However, when evaluation apprehension and demand characteristics are strongly in conflict—as in the sixth study in Chapter 5 and in the experiment by Sigall, Aronson, and Van Hoose (1970; see also Page, 1971a)—this seems to dampen acquiescence, with deceptions sometimes increasing the tendency for favorable self-presentations and decreasing demand compliance (Silverman, Shulman, and Wiesenthal, 1970). The experiment by Resnick and Schwartz (Chapter 4), when viewed together with the verbal-conditioning results with informed and naı¨ve volunteers and nonvolunteers in Chapter 5, implies another condition in which negative motivations may prevail. Subjects who feel their freedom to act is constrained by demand characteristics may approach the experimental situation from a counteracquiescent set. Thus, in a follow-up to Horowitz’s study on fear arousal, discussed in Chapter 4, Horowitz and Gumenik (1970) systematically replicated the procedure in light of implications of Brehm’s (1966) reactance theory, which would predict boomerang effects for subjects denied freedom of choice. This time, volunteers and nonvolunteers for a previous study were either given or not given an opportunity to select the particular required experiment in which to participate. The subjects who were allowed a choice showed greater acceptance of the recommendations of the fear appeal as the level of
818
Book Three – The Volunteer Subject
emotional arousal was increased from low to high, whereas the nonvolunteers who hadn’t any freedom in the selection of experiments reacted in the opposite way. In other research on demand motivation, Rubin and Moore (1971) found suspicious subjects with low authoritarian scores likely to react against perceived demands, and there is a hint that the emotional tone of subjects’ attitudes toward psychology may correlate with their motivational set when participating in a psychological experiment (Adair, 1969, 1970; Adair and Fenton, 1970). There are other variables as well which may ultimately be shown to have a depressive effect on motivations. Modeling cues were mentioned earlier as possibly affecting subjects’ instrumental role behaviors (cf. Feldman and Scheibe, 1972; Silverman, Shulman, and Wiesenthal, 1972). Klinger (1967) showed that nonverbal cues from an experimenter who appeared more achievement-motivated elicited significantly more achievement-motivated responses from his subjects. Evidence of counterconditioning behavior was identified in subjects labeled as psychopathic or sociopathic (Cairns, 1961; Johns and Quay, 1962; Quay and Hunt, 1965), an effect that could perhaps also generalize to anti-establishment role behaviors in the research situation. The experimenter’s manner and temperament, the subject’s suspicion about the nature of the research or his frustrating earlier experiences, and his perception of role expectancies—these are promising candidates and await more definitive examination (cf. Brock and Becker, 1966; Cook, Bean, Calder, Frey, Krovetz, and Reisman, 1970; Fillenbaum, 1966; Grabitz-Gniech, 1972; Gustafson and Orne, 1965; Marquis, 1973; McGuire, 1969a; Silverman and Kleinman, 1967; Silverman and Shulman, 1969, 1970; Silverman, Shulman, and Wiesenthal, 1970). Capability Residual factors affecting the subject’s capability to enact a particular role behavior could include specific situational determinants having a pervasive effect on the entire subject pool as well as idiosyncratic characteristics (cf. Rosenzweig, 1952). If a critical response is beyond the subject’s reaction threshold, he should be incapable of responding to demand characteristics whatever his levels of receptivity and motivation. By the same token, if his latitude of movement was artificially restrained by observational or measurement boundaries, demand compliant behavior should be nil. A subject in a before—after attitude experiment might end up noncompliant, for example, if his pretest response were so extreme that he could not respond more extremely in the direction of the manipulation on the posttest. A subject might also be biologically incapable of complying with the experimenter’s wishes or expectations. Not all volunteers will be capable of conforming to some coercive demands no matter how intensely motivated and knowledgeable they are. Although it has been documented that some nonhypnotized control subjects can perform the ‘‘human-plank trick’’ on being instructed to do so by a competent investigator (Barber, 1969; Orne, 1962b), no doubt even the most acquiescent volunteer would fail to comply with this experimental demand if he lacked the sheer physical stamina to lie suspended in midair with only his head and feet resting on supports. However, because the self-selection process in research volunteering tends to weed out incapability—just as incompetence is weeded out by self-selection in volunteering for extraexperimental altruistic acts (Kazdin and Bryan, 1971)—this mediating factor may be of little practical consequence for understanding the
An Integrative Overview
819
antecedents of role behaviors in psychological experimentation. Furthermore, most experienced investigators are usually attuned to these obvious limitations and are very careful to select research settings that are well within the bounds of their subjects’ capabilities. The settings that have been traditionally chosen by laboratory researchers may favor not only elicitation of the desired critical responses within the subjects’ normal capabilities but also any undesired responses to demand characteristics lurking in those situations. Behavior
Probability of demand compliance
In the parent formulation it is presumed that independent variables relate to influenceability primarily through the combined mediation of receptivity and yielding (McGuire, 1968b). The artifact-influence model visualizes artifact stimuli as relating to demand behaviors mainly through the combined mediation of receptivity and motivation. For example, Figure 6-2 expresses the postulated relationship between the demand clarity factor briefly discussed in the preceding chapter and role behavior pursuant to the demand characteristics of the situation. The inverted U-shaped function alluded to earlier can now be envisioned as the product of a positive association with receptivity and a negative association with motivation; that is, demand receptivity is postulated as increasing with an increase in demand clarity, while acquiescence is postulated as decreasing with an increase in demand clarity. When cues are so patently obtrusive as to restrict seriously the subject’s latitude of movement, the resulting increase in psychological reactance should have a dampening effect on motivations, even to the point of arousing a counteracquiescent set. We assume that other artifact stimuli affect role behaviors in a similar combinatorial fashion by indirectly impinging on these primary mediating states. The factor of evaluation apprehension can also be reinterpreted with the help of the combinatorial rule. Assuming that receptivity would remain roughly level throughout, where task- and ego-orienting cues are operating harmoniously evaluation apprehension may bolster motivations to comply. In this case, demand
Receptivity mediator
Motivation mediator
Clarity of demand characteristics Figure 6–2 Postulated curvilinear relationship (shown by a solid line) between demand clarity and demand
compliance
Book Three – The Volunteer Subject
Probability of demand compliance
820
Motivation Receptivity
(a) Level of congruent EA
Motivation Receptivity
(b) Level of incongruent EA
Figure 6–3 Postulated curvilinear relationships (shown by solid lines) between evaluation apprehension (EA) and demand compliance when (a) task- and ego-orienting cues are operating harmoniously and (b) task- and ego-orienting cues are operating nonharmoniously
compliance would be congruent with the subject’s desire to project a favorable image and there should be a positive association with motivation (see Figure 6-3a). Alternatively, when incongruent task- and ego-orienting cues coexist, evaluation apprehension may have a depressive effect on motivations. In this case, demand compliance would be incongruent with the subject’s desire to project a favorable image, and following from data already discussed, ego-defensive needs should emerge as the predominant motivating force. With receptivity again at a roughly constant level, there would be a negative association with motivation (Figure 6-3b). One can infer the simple relationship between these mediating states and demand behavior in a study by Koenigsberg (1971) using the verbal-conditioning paradigm, although the results do not bear directly on our postulated combinatorial effects. In this study, subjects were divided into four groups on the basis of their responses to a postexperimental awareness-motivation questionnaire. Some of the subjects in each group had been conditioned by an experimenter with positive expectancies for their behavior and the others, by an experimenter having negative expectations, although we are concerned here only with the classification main effect. There are some difficulties in interpretation because of possible ambiguities in his grouping of responses. However, the four groups might be classified as (1) subjects who were aware of the demand characteristics and were positively motivated (e.g., ‘‘I felt that I was doing what he wanted me to’’ and ‘‘I thought it might be of some benefit to him.’’); (2) subjects with a counteracquiescent motivational set (e.g., ‘‘I did not wish him to control my thinking.’’); (3) a conglomerate of nonacquiescent subjects and subjects operating on poor receptivity (e.g., ‘‘I didn’t recognize a pattern and didn’t worry about following a pattern’’ and ‘‘I started to make it balance out at times.’’); and (4) subjects who reported that they were unaware of the demand characteristics. The differences in I–WE responding between the four groups were statistically significant at the .01 level, and the order of magnitude of correct responses could
An Integrative Overview
821
probably have been predicted by the artifact-influence model. Among the unaware subjects (group 4), the level of correct responding was 39.3. The acquiescent group (group 1) surpassed this ‘‘baseline’’ mean, giving an average level of correct responding of 52.6; the recalcitrant subjects (group 2) were below the baseline with an average of 30.7; and group 3 fell at roughly the same level of responding as group 4 or 41.4.
Inferential Validity as an Assessment Criterion In cataloguing the predominant artifact stimuli that distort dependent measures in behavioral research, our assumption has been that these confounding factors (of which the subject’s volunteer status was considered to be one) operated via a few mediating variables. A practical advantage of this integrative position is that rather than having to concern oneself with juggling an unmanageable number of control procedures for specifically coping with each artifact stimulus separately, it is possible to conceptualize the artifact problem from a broader perspective and seek ways of establishing the tenability of research findings at a more workable level of analysis. The concept of inferential validity provides one convenient, practicable criterion for assessing the tenability of inferred causal relationships in behavioral research. The object of experimental research in psychology is to observe psychological processes in a precisely controlled standard situation where it is possible to manipulate some given independent variable or interfere carefully with some normally occurring relationship while holding everything else constant. To be able to draw legitimate inferences from such findings to situations outside the research setting presumes that the experimental conditions adequately reflect the actual processes under investigation. Questions of inferential validity ask whether conclusions drawn from a set of research data are also tenable in the corresponding naturalistic environment where the subject’s behavior is not influenced by situationally localized experimental cues (Rosnow and Aiken, 1973). Whether conclusions from experimental studies are inferentially valid should, according to this view, hinge fundamentally on the subject’s awareness of the nature of the situation and how his behavior is guided by this knowledge; that is, insofar as the subject is responsive to the investigative aura of the laboratory or field setting in which he finds himself, he is probably attuned to localized cues on the basis of which he can generate hypotheses about the aims or expectations of the research. In general, we are assuming that role-restricting experimental contingencies of the sort just discussed can initially come to bear only if the subject becomes aware of the special nature of the situation. His awareness may directly or indirectly stem from cues that he received before or during the treatment or experimental manipulation or that were contained somehow in the research setting itself. Awareness might also be the result of inadvertent measurement or observation cues or of forewarning by the investigator about some aspect of the setting because of the investigator’s ethical concerns. Since in actual practice manipulation and measurement are usually very closely intertwined in an obvious way, the same role-mediating contingencies could be operating throughout a research study. We turn now to procedures aimed either at bypassing the inferential invalidity problem or at partialing out systematic error that may have contributed to a biased experimental outcome.
822
Book Three – The Volunteer Subject
Field Research with Unobtrusive Measures Using very young or abnormally unsophisticated subjects, while it might lead to a low level of awareness, would also automatically limit the generality of the research data (Rosenzweig, 1933; Stricker, Messick, and Jackson, 1969, p. 349). An approach that aims to skirt the artifact problem while also encouraging a high degree of robustness has been to use techniques of disguised experimentation in natural, nonlaboratory settings (Campbell, 1969). The assumption is that by emphasizing naturalistic social demands and not laboratory demand characteristics, the subject’s motivations and capabilities pursuant to the latter will not become an issue affecting his responses to the former. Webb, Campbell, Schwartz, and Sechrest (1966) have provided an excellent treatment of this research strategy as well as a compelling rationale for the greater utilization of unobtrusive measures. Instead of a single, critical experiment on a problem, they have recommended a sequence of linked investigations, each of which examines some different substantive outcropping of the hypothesis. Through triangulation of the data, the experimenter can logically strip away plausible rival explanations for his findings. The field experimental approach was used by Rosenbaum (1956; Rosenbaum and Blake, 1955) in research on the prepotency of social pressures as a determinant of volunteering behavior. Several students at the University of Texas were seated at desks in a library. In one condition an accomplice seated at the same desk as a student agreed to participate in the experimenter’s research. In another condition the accomplice refused to participate; in the control condition the accomplice was absent. Significantly more students volunteered in the first condition than in the control, and significantly fewer students volunteered in the second condition than in the control. Thus, in a naturalistic experiment on experimental recruiting, it was demonstrated that witnessing another person volunteer or refuse to volunteer strongly influenced the observer’s own decision to volunteer. In other applications of this approach, Gelfand, Hartmann, Walder, and Page (1973) studied the kinds of people who report shoplifters by having a confederate actually stuff drugstore merchandise into her handbag while other customers were around. Doob and Gross (1968) explored an aspect of frustration–aggression by having someone drive up to a signal-controlled intersection and remain stopped for a time in order to provoke a horn-honking response from the car behind. Mann and Taylor (1969) studied positioning in long queues for football tickets, Batman shirts, and chocolate bars to investigate the effect of people’s motives upon their estimates of how many others were standing in line ahead of them. Milgram, Bickman, and Berkowitz (1969) examined the drawing power of different sized crowds on a busy Manhattan street by having confederates stop for a minute and look up at a sixth-floor window. Hartmann (1936) explored the effect on voting patterns of emotional and rational Socialist party leaflets distributed during a local election in Pennsylvania in which his own name was on the ballot. As one can readily discern, this research approach has inspired many ingenious studies (cf. Bickman and Henchy, 1972; Evans and Rozelle, 1973; Swingle, 1973), although, in general, field research has still seen comparatively limited usage among social psychologists (Fried, Gumpper, and Allen, 1973). However, while field experimentation accompanied by unobtrusive measurement is a methodology that certainly has the potential for contributing theoretically enlightening findings, it is also true in practice that it seldom permits the
An Integrative Overview
823
degree of precision in the measurement of dependent variables afforded by laboratory research. Mindful of this possible limitation, other investigators have turned to the use of experimental deceptions as a way of capturing some degree of naturalistic realism in their research. When a good deception manipulation is effectively set into operation, it should arouse the subjects in much the same way they would be affected by a complex of naturalistic stimuli (cf. Aronson and Carlsmith, 1968; Rosenzweig, 1933, p. 346f). Laboratory Deceptions. Quite different kinds of deceptions were incorporated in some of the scenarios reported in the preceding chapter. It will be recalled that some subjects in the third exploratory study (on cognitive dissonance) were intended to believe that their university had commissioned the survey because the administration was interested in putting one of the ideas into practice the following year. In this case, the rationale for using deception was to determine whether it might function in the way a two-sided communication seemed to work in the previous exploratory study, in effect to obfuscate laboratory demand characteristics. Stricker (1967) discovered that nearly one-fifth of the research appearing in 1964 in four major social psychological journals involved some form of deception, and Menges (1973) found the same percentage in a survey of articles published during 1971 in a more varied assortment of psychological journals. Besides concerns about the ethicality of this approach to social inquiry, the question has been asked repeatedly whether researchers can really be certain about who the ‘‘true’’ deceiver is in their studies. It is important to have effective guidelines for evaluating the results of deception studies (cf. Orne, Thackray, and Paskewitz, 1972; Stricker, Messick, and Jackson, 1969), particularly inasmuch as subjects are often reluctant to confess their awareness of being deceived or of having prior information (Levy, 1967; Newberry, 1973; Taub and Farrow, 1973). Concerned about this problem, Orne (1969) proposed the use of ‘‘quasi-control’’ subjects for ferreting out the potential biasing effects of role mediating variables in situations of this sort. Quasi-control subjects, who typically are from the same population as the research subjects, step out of their traditional role (the experimentersubject interaction being redefined to make quasi-control subjects ‘‘co-investigators’’ instead of manipulated objects) and clinically reflect upon the context in which an experiment was conducted. They are asked to speculate on ways in which the context might influence their own and research subjects’ behaviors. One quasi-control approach employs research subjects who function as their own controls. The procedure involves eliciting from the subjects by judicious and exhaustive inquiry, either in a postexperimental interview or in an interview conducted after a pilot study, their perceptions and beliefs about the experimental situation without making them unduly suspicious or inadvertently cuing them about what to say. Recently, there has been some attention paid to testing different postexperimental inquiry procedures that seek to optimize the likelihood of detecting demand awareness (Page, 1971b, 1973). There is, of course, the recurrent problem of demand characteristics in the inquiry itself, a potential hazard that researchers will have to be mindful of when adopting this strategy. A second quasi-control procedure is preinquiry, in which control subjects are provided with information about the experiment that is equivalent to that available to an actual research subject and they are asked to imagine themselves to be research subjects (cf. Straits, Wuebben, and
824
Book Three – The Volunteer Subject
Majka, 1972). Another alternative is like the second in that the experimental situation is simulated for control subjects whose task is to imagine how they might behave if the situation were real, but in this case blind controls and quasi-controls are simultaneously treated by an experimenter who the subjects know is unaware of their status. The present theoretical orientation suggests a bit of structure that could be imposed on the questions presented to a quasi-control subject, such as inviting him to comment on feelings of acquiescence and evaluation apprehension and on his perceptions of cues revealing the experimenter’s expectations. The main deficiency of the procedure is that it provides only a very rough estimate of awareness of demand contingencies, still leaving the investigator with the knotty problem of trying to determine the extent to which his dependent measure was actually contaminated by uncontrolled factors of the type described. In an effort to bypass this last problem, the ‘‘bogus pipeline’’ was conceived by Jones and Sigall (1971) as a testable paradigm for assuring more objectively valid data by directly incorporating deception in the dependent measurement. The format of the bogus pipeline is comparable to most self-rating attitude measures except that the subject is led to believe that a physiological monitoring device can reliably catch him when he is lying. Claims for the efficacy of this measurement paradigm currently rest on its power to attenuate social desirability responses in the expression of negative interpersonal sentiments (Sigall and Page, 1971, 1972). While assurances of inferential validity and measurement sensitivity are debatable at this stage (Jones and Sigall, 1973; Ostrom, 1973), the approach does provide an imaginative first step for trying to circumvent the artifact problem in laboratory behavioral measures of attitude and affect. In seeking alternatives to deception manipulations, other investigators have resorted to role play and laboratory simulations as a way of trying vicariously to duplicate naturalistic realism without sacrificing the precision afforded by experimental control. Role Play. An illustration of the use of role play was a study by Greenberg (1967) that replicated earlier social psychological findings on anxiety and affiliation. In a well-known series of experiments by Schachter (1959), subjects were allowed to wait with others or to wait alone before participating in a research project. For some subjects (the ‘‘high-anxiety’’ group) the anticipated project was described as involving painful electric shocks, while for others (the ‘‘low-anxiety’’ group) it was represented as involving no physical discomfort. Significantly more of the highanxiety subjects chose to wait with others in a similar plight, a finding that inspired the conclusion that misery likes not only company but company that is equally miserable. Another finding in the research was that anxious firstborns and onlychildren showed this propensity to affiliate with others more so than did laterborns. In Greenberg’s role-play study, the subjects were instructed to act as if the situation were real and they were audience to a scenario that was closely modeled after that employed in the original experiments. Although the statistical significance of Greenberg’s role-play findings was mixed, the direction of his results was completely consistent with the earlier experimental findings. There have been several more exploratory inquiries on role play since Greenberg’s study (Darroch and Steiner, 1970; Horowitz and Rothschild, 1970; Wicker and Bushweiler, 1970; Willis and Willis, 1970), and the issue has drawn considerable discussion pro and con. However, opinions remain divided on the
An Integrative Overview
825
efficacy of role play as a dependable substitute for experimental research, some favoring it on moral or methodological grounds (Brown, 1962, p. 74; Kelman, 1967; Ring, 1967; Schultz, 1969) and others opposing it (Aronson and Carlsmith, 1968; Carlson, 1971; Freedman, 1969; McGuire, 1969; Miller, 1972). At this point, the exploratory findings are not sufficiently definitive to swing the consensus of scientific opinion one way or the other. Intuitively, one might speculate that role play will ultimately be shown to have only very restricted applicability as a methodological substitute for deception research. That there have already been differences discovered between role play and deception studies does not establish which methodology has the greater potential for inferential validity, since all the studies so far have concentrated on the comparative reliability of the two approaches. Even if role play should be preferred on ethical grounds (which is a debatable conclusion), this would not in and of itself guarantee that resulting data were inferentially valid. Perhaps the effectiveness of the role-play methodology can be improved by incorporating emotionally involving role enactments (e.g., Janis and Mann, 1965), by introducing incentives or other factors that might strengthen a subject’s personal commitment to his pretend behavior (cf. Boies, 1972; Elms, 1969), or by programming greater congruence between role demands and the personalities of the actors (Moos and Speisman, 1962; O’Leary, 1972). The issue of inferential validity is an empirical question, and while role play may prove to have some limited potential as an experimental substitute, the validity question certainly merits further careful inquiry before the role-play procedure is universally sanctioned (cf. Holmes and Bennett, 1974; Orne, 1971). Dual Observation. Up to this point we have been discussing some strategies for circumventing the inferential invalidity problem (1) by doing research in field settings with unobtrusive measures so that the subjects are blind to the investigative nature of the situation, (2) by employing laboratory deceptions having the capability of engrossing subjects in the same way that a complex of naturalistic social stimuli would, or (3) by using role simulations as a vicarious means of capturing naturalistic realism without sacrificing the precision of laboratory control. We also described two measurement approaches for dealing with this problem, one aimed at teasing out the potential biasing effects of role-mediating demand characteristics (Orne’s quasicontrol procedure) and the other designed to avoid confounding artifacts by directly incorporating deception into the dependent measurement (Jones and Sigall’s bogus pipeline procedure). There is an alternative measurement paradigm that is more deeply rooted in the present theoretical position. It is the method of dual observation, which aims at estimating the inferential invalidity of experimental findings. The conception of dual observation is simply to ‘‘reobserve’’ the original critical responses outside the laboratory in an atmosphere where the subject is not cognizant of whatever experimental implications of his behavior could have affected his initial responding (Rosnow and Aiken, 1973). This does not mean that he must be unaware of being observed, only that he does not connect the fact of his now being observed with the experiment or the experimenter. In most cases, the difference between observations (minus the effects of random error or other nuisance variables) could provide an estimate of the invalidity that is a function of the totality of demandassociated artifacts operating in the experimental setting. In effect, the approach is to manipulate the subject’s response set, and in this respect it is similar to other useful measurement procedures also altering response
826
Book Three – The Volunteer Subject
set. The use of anonymity in measurement or of a bogus pipeline deception are other good examples of inducing a special response set in subjects. Indeed, establishing a particular set that is inclined to produce more valid, reliable, or representative responses is a familiar idea that has been examined both by experimental and social psychologists. Brunswik (1956, pp. 93 ff.) discussed the possibility of separating subjects’ attitudes statistically within the specific framework of perception. The phenomenological research he described experimented with creating a naı¨ve perceptual attitude and comparing the effects against an analytical attitude. In half the cases he had the subject take a critical, or ‘‘betting,’’ stance, and in half the cases he had the subject take an uncritical set. More recently, Jourard (1968, 1969) has argued that disclosure by experimenters should result in more honest responding by subjects, and Hood and Back (1971) found self-disclosure to the experimenter to be a prevalent tendency among volunteer subjects (cf. Cozby, 1973; Lyons, 1970, p. 25). However, what is distinctive about dual observation is that it is based on the notion of repeated measurement of the laboratory-dependent variable under conditions in which the original demand contingencies should no longer be affecting the subject’s response set. By way of illustration, a qualitative variation of this procedure was the multiple observation stratagem employed in a study of laboratory posthypnotic suggestion by Orne, Sheehan, and Evans (1968). Highly suggestible subjects who were under hypnosis and a control group that simulated hypnosis were given the suggestion that for the following 48 hours they would respond by touching their forehead every time they heard the word experiment. Initially, the experimenters tested the suggestion in the experimental setting, and then, to gauge the inferential validity of their laboratory findings, they observed the dependent variable again, this time unbeknownst to the subject presumably, by having a secretary in the waiting room confirm the time for which the subject was scheduled ‘‘to come for the next part of the experiment.’’ Later, she asked him whether it was all right to pay him ‘‘now for today’s experiment and for the next part of the study tomorrow.’’ On the following day she met the subject with the question, ‘‘Are you here for Dr. Sheehan’s experiment?’’ One might also imagine dual observation being used to test the inferential validity of a verbal-conditioning interview, a laboratory opinion-change procedure, or any of a wide variety of experimental paradigms. By reobserving the critical responses under conditions where the original theoretical contingencies were no longer salient or relevant or where they were unclear, absent, or ignored, it should be possible to assess the inferential validity of the original experimental behaviors. Insko (1965) studied verbal conditioning by having his assistants telephone students at the University of Hawaii and read statements to them about the possibility of celebrating a springtime Aloha Week. Half the subjects were reinforced with approval (‘‘good’’) whenever they gave a positive response, and the other half were positively reinforced whenever they gave a negative response. One week later, an item about Aloha Week was surreptitiously embedded in a ‘‘Local Issues Questionnaire’’ that undergraduates filled out during class. Aronson (1966) experimented with using confederates to coax information about an experimental participation from their friends. Evans and Orne (1971) spied on subjects when the experimenter had left the room by employing an old, but little used, method of observation. Instead of the all too familiar one-way mirror, they watched the subject from behind a framed silk-screen painting that fitted into the natural decor of the room. The painting, which covered a hole in the wall,
An Integrative Overview
827
acted as a one-way screen when the observer looked from the dark to the lighter side. Variations of these and other techniques could easily be adapted for use in assessing the inferential invalidity of experimental findings. There are, of course, ethical implications to be considered in the use of dual observation, and experimenters must weigh the advantages and disadvantages of the paradigm.
Representative Research Design There is one final aspect to the methodological issue we have been discussing. It concerns procedures for assuring (or being able to specify) the generality of causal experimental relationships given that the relationships are in fact inferentially valid. We are referring to the robustness of causal relationships, which was a focus of attention in Chapter 4. In the specific case of the volunteer subject, the generality question translates into ways of assuring greater participation by nonvoluntary subjects. We mentioned in Chapter 1 that psychologists in academic institutions have achieved some measure of success in reducing volunteer sampling bias by the practice of requiring undergraduate majors to spend a specified number of hours serving as research subjects. However, which experiment he participates in is often left to the student in order not to encroach too much on his freedom of choice. While the practice of compulsory participation undoubtedly draws more nonvoluntary subjects into the overall sampling pool, their participation in any one experiment will not be a random event. Brighter students might sign up for learning experiments; gregarious students, for social interaction studies; men, for unconventional experiments. For this reason, and because the APA code may be interpreted by some as officially countermanding this practice, new procedures may have to be tested for reducing volunteer bias by motivating the more traditionally nonvolunteer types of individuals to volunteer for experimental participation. The sociological and statistical literature on survey research is rich with suggestions for ameliorating volunteer sampling error, although some of the correction procedures travel hand in hand with other methodological problems. For example, research by Norman (1948) and Wallin (1949) implies that an increase in the potential respondent’s degree of acquaintance with the investigator will lead to an increase in the likelihood of volunteering (see also Chapter 3). Similarly, an increase in the perceived status of the investigator should result in greater cooperation (Norman, 1948; Poor, 1967; see also Chapter 3). Although acquaintance with the experimenter might reduce volunteer bias by drawing more subjects into the sampling urn, it is conceivable that another type of error might be introduced as a result. Because there is evidence that experimenter-expectancy effects increase in likelihood the better acquainted that experimenters and subjects are (Rosenthal, 1966), in this case one would need to decide which type of error he would rather live with as well as which type could be more easily controlled. A more subtle aspect to this problem was defined by Egon Brunswik; it concerns restrictions imposed on the generality of research conclusions as a function of the nonrepresentativeness of the research design that specified how the data were to be collected and partitioned. In the classical design, all the variables on the stimulus side are held constant except for the independent variable, x, and effects are then observed on the subject side upon the dependent variable, y. Expressed in the semantics of
828
Book Three – The Volunteer Subject
causal analysis, the rationale for the studies on artifact can thus be seen as proceeding from the assumption that the independent variable in the statement y = f(x) is contaminated by unspecified determinants also affecting the dependent variable, the essence of the artifact problem stemming from the indeterminacy of the specification of what actually constitutes x and not -x (Boring, 1969). In the case of robustness, the problem usually concerns the extent to which a causal relationship is generalizable on the subject side; such generalizability should be proportional to the adequacy of the sampling procedures of all the relevant population variables. Brunswik (1947, 1955), however, pointed out another interesting issue that is a concomitant of the latter problem. Since research designs drawn from the classical design may impose restrictions on stimulus representativeness, causal generalization may also be limited as much on the stimulus side as on the subject side. To correct for this deficiency, Brunswik proposed that researchers sample from among stimuli and situations as well as from among subject populations. Hammond (1948, 1951, 1954) has discussed the utility of this approach as it applies to clinical experimentation in particular. To illustrate its present applicability, suppose that we wished to test the hypothesis that male and female volunteers and nonvolunteers are differentially responsive to the experimental expectations of male and female investigators. If we proceeded from the classical design, we might seek a representative sample of volunteer and nonvolunteer subjects of both sexes and then assign them to a male or female experimenter whose experimental orientation was prescribed beforehand using a counterbalanced procedure. However, the extent to which any inferentially valid conclusions were generalizable to other subjects, other experimenters, and other experimental situations would be influenced by the fact that the only element representatively sampled was the subject variable. Since our design did not representatively sample from among populations of experimenters and situations, it could be hazardous to generalize beyond these meager limits. Generalizability could be further affected, however, if there were other relevant variables unrepresented that fell on neither the stimulus nor the subject side. For example, there could be biases in the interpretation of our data stemming from the ways in which psychologists of different political ideologies or different cultural or biographical backgrounds fulfilled their scientific role (Innes and Fraser, 1971; Pastore, 1949; Sherwood and Nataupsky, 1968). Indeed, one might conceive of a hierarchical regression of relevant distal and proximal variables worth representing. While it is obviously impossible to sample every potentially relevant variable in a study, Brunswik’s contribution was to help us avoid a double standard in scrutinizing subject generality while ignoring the generality of other significant elements of the research situation.
7 Summary
Nearly 30 years have passed since Quinn McNemar cautioned researchers that the routine practice of sampling populations of convenience was causing our science of human behavior to become ‘‘the science of the behavior of sophomores.’’ Yet if recent data are representative, showing that from 70% to 90% of studies on normal adults have drawn subjects from the collegiate setting, that old warning may still ring true. Indeed, McNemar’s assessment may eventually prove too sanguine, as not only do the data suggest an increase in the percentage of research subjects who are college students, but recent ethical concerns make one wonder if the behavioral science of the near future may have to draw its data exclusively from an elite subject corps of informed volunteers. The extent to which a useful, comprehensive science of human behavior can be based upon the behavior of such self-selected and investigatorselected subjects is an empirical question of broad importance, and the preceding chapters have dwelled on various significant aspects of this problem.
Reliability of Volunteering How reliable is the act of volunteering to be a research subject? If volunteering were an unreliable event, we could not expect to find any stable relationships between it and various personal characteristics of willing and unwilling subjects, nor would it be logical to presume to study the experimental effects of voluntarism when it can be presented as an independent variable in the research paradigm. However, there are many personal characteristics that do relate predictably to the act of volunteering. Moreover, statistical measures of reliability tend to be satisfactorily high whether the measure of volunteering is simply stating one’s willingness to participate or whether it is defined in terms of volunteering for sequentially different types of research tasks. From the available data, the median overall reliability for volunteering was computed as .52, which, by way of comparison, is identical to the median reliability of subtest intercorrelations as reported by Wechsler for the WAIS. For studies requesting volunteers for the same task the median reliability was .80, and for studies asking for volunteers for different tasks it was .42. Although the number of studies on the reliability of volunteering is not large (10 studies), 829
830
Book Three – The Volunteer Subject
the findings do suggest that volunteering, like IQ, may have both general and specific predictors. Some people volunteer reliably more than others for a variety of tasks, and these reliable individual differences may be further stabilized when the particular task for which volunteering was requested is specifically considered.
Assessing the Nonvolunteer How do researchers determine the attributes of those who do not volunteer to participate? Several procedures have been found useful, and they can be grouped into one of two types, the exhaustive and the nonexhaustive. In the exhaustive method, all potential subjects are identified by their status on all the variables on which volunteers and nonvolunteers are to be compared. They may be tested first and then recruited, as when the investigator begins with an archive of data on each person and then, sometimes years later, makes a request for volunteers. For example, incoming freshmen are routinely administered a battery of tests in many colleges, and these data can then be drawn upon in future comparisons. Another variation is to recruit subjects and then test them. In this case, subjects for behavioral research are solicited, usually in a college classroom context, and the names of the volunteers and nonvolunteers are sorted out by using the class roster; shortly thereafter, a test or some other material is administered to the entire class by someone ostensibly unrelated to the person who recruited the volunteers. In the nonexhaustive method, data are not available for all potential subjects, but they are available for those differing in likelihood of finding their way into a final sample. Thus, one variation of the method uses the easy-to-recruit subject, although, because true nonvolunteers are not available, it requires extrapolation on a gradient of volunteering. The procedure in this case is to tap a population of volunteer subjects repeatedly so as to compare second-stage volunteers with firststage volunteers, and so on. If repeated volunteers, for example, were higher in the need for social approval than one-time volunteers, then by extrapolating these data roughly to the zero level of volunteering it could be tentatively concluded that nonvolunteers might be lower still in approval need. Another variation gets at the hard-to-recruit subject by repeatedly increasing the incentive to volunteer, a method frequently used in survey research to tease more respondents into the sampling urn. Still another variation focuses on the slow-to-reply subject. In this case only a single request for volunteers is issued, and latency of volunteering is the criterion for dividing up the waves of respondents, as well as the basis for extrapolating to nonrespondents.
Volunteer Characteristics Examining studies that used these various procedures for assessing the nonvolunteers, we drew the following conclusions about characteristics that may reliably differentiate willing and unwilling subjects:
Summary
831
Conclusions Warranting Maximum Confidence 1. Volunteers tend to be better educated than nonvolunteers, especially when personal contact between investigator and respondent is not required. 2. Volunteers tend to have higher social-class status than nonvolunteers, especially when social class is defined by respondents’ own status rather than by parental status. 3. Volunteers tend to be more intelligent than nonvolunteers when volunteering is for research in general but not when volunteering is for somewhat less typical types of research such as hypnosis, sensory isolation, sex research, small-group and personality research. 4. Volunteers tend to be higher in need for social approval than nonvolunteers. 5. Volunteers tend to be more sociable than nonvolunteers.
Conclusions Warranting Considerable Confidence 6. Volunteers tend to be more arousal-seeking than nonvolunteers, especially when volunteering is for studies of stress, sensory isolation, and hypnosis. 7. Volunteers tend to be more unconventional than nonvolunteers, especially when volunteering is for studies of sex behavior. 8. Females are more likely than males to volunteer for research in general, but less likely than males to volunteer for physically and emotionally stressful research (e.g., electric shock, high temperature, sensory deprivation, interviews about sex behavior.) 9. Volunteers tend to be less authoritarian than nonvolunteers. 10. Jews are more likely to volunteer than Protestants, and Protestants are more likely to volunteer than Catholics. 11. Volunteers tend to be less conforming than nonvolunteers when volunteering is for research in general but not when subjects are female and the task is relatively ‘‘clinical’’ (e.g., hypnosis, sleep, or counseling research.)
Conclusions Warranting Some Confidence 12. Volunteers tend to be from smaller towns than nonvolunteers, especially when volunteering is for questionnaire studies. 13. Volunteers tend to be more interested in religion than nonvolunteers, especially when volunteering is for questionnaire studies. 14. Volunteers tend to be more altruistic than nonvolunteers. 15. Volunteers tend to be more self-disclosing than nonvolunteers. 16. Volunteers tend to be more maladjusted than nonvolunteers, especially when volunteering is for potentially unusual situations (e.g., drugs, hypnosis, high temperature, or vaguely described experiments) or for medical research employing clinical rather than psychometric definitions of psychopathology. 17. Volunteers tend to be younger than nonvolunteers, especially when volunteering is for laboratory research and especially if they are female.
Conclusions Warranting Minimum Confidence 18. Volunteers tend to be higher in need for achievement than nonvolunteers especially among American samples. 19. Volunteers are more likely to be married than nonvolunteers, especially when volunteering is for studies requiring no personal contact between investigator and respondent.
832
Book Three – The Volunteer Subject
20. Firstborns are more likely than laterborns to volunteer, especially when recruitment is personal and when the research requires group interaction and a low level of stress. 21. Volunteers tend to be more anxious than nonvolunteers, especially when volunteering is for standard, nonstressful tasks and especially if they are college students. 22. Volunteers tend to be more extraverted than nonvolunteers when interaction with others is required by the nature of the research.
Situational Determinants What are the variables that tend to increase or decrease the rates of volunteering obtained? The answer to this question has implications both for the theory and practice of the behavioral sciences. If we can learn more about the situational determinants of volunteering, we will also have learned more about the social psychology of socialinfluence processes and, in terms of methodology, be in a better position to reduce the bias in our samples that derives from volunteers being systematically different from nonvolunteers on a variety of personal characteristics. As with the previous list of conclusions, our inventory of situational determinants was developed inductively, based on an examination of a fairly sizable number of research studies: Conclusions Warranting Maximum Confidence 1. Persons more interested in the topic under investigation are more likely to volunteer. 2. Persons with expectations of being more favorably evaluated by the investigator are more likely to volunteer.
Conclusions Warranting Considerable Confidence 3. Persons perceiving the investigation as more important are more likely to volunteer. 4. Persons’ feeling states at the time of the request for volunteers are likely to affect the probability of volunteering. Persons feeling guilty are more likely to volunteer, especially when contact with the unintended victim can be avoided and when the source of guilt is known to others. Persons made to ‘‘feel good’’ or to feel competent are also more likely to volunteer. 5. Persons offered greater material incentives are more likely to volunteer, especially if the incentives are offered as gifts in advance and without being contingent on the subject’s decision to volunteer. Stable personal characteristics of the potential volunteer may moderate the relationship between volunteering and material incentives.
Conclusions Warranting Some Confidence 6. Personal characteristics of the recruiter are likely to affect the subject’s probability of volunteering. Recruiters higher in status or prestige are likely to obtain higher rates of volunteering, as are female recruiters. This latter relationship is especially modifiable by the sex of the subject and the nature of the research.
Summary
833
7. Persons are less likely to volunteer for tasks that are more aversive in the sense of their being painful, stressful, or dangerous biologically or psychologically. Personal characteristics of the subject and level of incentive offered may moderate the relationship between volunteering and task aversiveness. 8. Persons are more likely to volunteer when volunteering is viewed as the normative, expected, appropriate thing to do.
Conclusions Warranting Minimum Confidence 9. Persons are more likely to volunteer when they are personally acquainted with the recruiter. The addition of a ‘‘personal touch’’ may also increase volunteering. 10. Conditions of public commitment may increase rates of volunteering when volunteering is normatively expected, but they may decrease rates of volunteering when nonvolunteering is normatively expected.
Suggestions for Reducing Volunteer Bias Our assessment of the literature dealing with the situational determinants of volunteering led us to make a number of tentative suggestions for the reduction of volunteer bias. Implementing these suggestions may serve not only to reduce volunteer bias but also to make us more thoughtful in the planning of the research itself. Our relations with potential subjects may become increasingly reciprocal and human and our procedures may become more humane. Our suggestions follow in outline form: 1. Make the appeal for volunteers as interesting as possible, keeping in mind the nature of the target population. 2. Make the appeal for volunteers as nonthreatening as possible so that potential volunteers will not be ‘‘put off’’ by unwarranted fears of unfavorable evaluation. 3. Explicitly state the theoretical and practical importance of the research for which volunteering is requested. 4. Explicitly state in what way the target population is particularly relevant to the research being conducted and the responsibility of potential volunteers to participate in research that has potential for benefiting others. 5. When possible, potential volunteers should be offered not only pay for participation but small courtesy gifts simply for taking time to consider whether they will want to participate. 6. Have the request for volunteering made by a person of status as high as possible, and preferably by a woman. 7. When possible, avoid research tasks that may be psychologically or biologically stressful. 8. When possible, communicate the normative nature of the volunteering response. 9. After a target population has been defined, an effort should be made to have someone known to that population make the appeal for volunteers. The request for volunteers itself may be more successful if a personalized appeal is made. 10. In situations where volunteering is regarded by the target population as normative, conditions of public commitment to volunteer may be more successful; where nonvolunteering is regarded as normative, conditions of private commitment may be more successful.
834
Book Three – The Volunteer Subject
An Ethical Dilemma The 1973 APA code of ethical guidelines for research with human subjects indirectly raises an issue as to whether compliance with the letter of the law might jeopardize the tenability of inferred causal relationships. A verbal-conditioning study by Resnick and Schwartz was presented as illustrative of one horn of the dilemma in that volunteer subjects who were forewarned of the nature of the research along the lines of the APA standards showed a boomerang effect in the conditioning rate—a reaction quite contrary to our present laws of verbal learning. The ethical dilemma results from the likelihood that fully informed voluntarism, while it may satisfy the moral concern of researchers, may be contraindicated for the scientific concern in many cases; and experimenters must weigh the social ethic against the scientific ethic in deciding which violation would constitute the greater moral danger. Other research, by Sullivan and Deiker, was reassuring at least from the point of view of the societal concern in implying that professional psychologists may be ultraconservative watchdogs because of their stringent ethical views. However, if the current social temper of the times persists, an already complicated issue may become further compounded in the future as greater restrictions are placed on the kinds of recruitment conditions that are ethically permissible.
Robustness of Research Findings In light of these developments emphasizing more use of fully informed consent procedures in recruiting subjects for behavioral research, it is important to be aware of the threat to the generalizability of data from using voluntary subjects exclusively. For example, to the extent that a pool of volunteers differed from the population at large, the resulting positive or negative bias would lead to overestimates or underestimates of certain population parameters. Suppose we relied entirely on volunteer subjects to standardize norms for a new test of social approval. Since volunteers tend to be higher in approval need than nonvolunteers, our estimated population mean would be artificially inflated by this procedure. Of course it is also possible to conceive of situations where population means were underestimated because only volunteers were used. The important point is that routinely sampling volunteer subjects could lead to estimates of population parameters that were seriously in error. Another way in which volunteer status can affect the generalizability of inferences has to do with the naturalistic motivations of human beings. Insofar as volunteer status as an organismic variable was related to the dependent variable of an investigation, the study of voluntarism could be the substantive basis for the research. From this slightly different perspective, volunteer status can be seen as merely another organismic variable like all the myriad variables affecting human behavior. A good case in point was Horowitz’s study on the effects of fear-arousal on attitude change. He noticed that research that used voluntary subjects tended to produce a positive relationship between fear-arousal and attitude change and that research using captive subjects tended toward the inverse relationship. Reasoning that volunteers and nonvolunteers may be differentially disposed to felt emotional– persuasive demands, he set about to demonstrate the difference in persuasibility of
Summary
835
these two types of subjects by assigning them either to a condition of high or low fear-arousal. Consistent with this hypothesis, the attitude-change data clearly indicated that voluntarism was an important organismic variable for assessing the generality of the fear-arousal relationship. The important point in this case was the empirical emphasis on the fact that behavioral data must always be interpreted within the motivational context in which they occurred.
The Artifact Problem Artifacts are systematic errors stemming from specifiable uncontrolled conditions and can be traced to the social nature of the behavioral research process. Our own empirical study of volunteer artifacts has stressed their occurrence in experimental contexts, although to the extent that demand characteristics may be operating outside the laboratory in psychological research, there is reason to believe that volunteer status can also be isolated as a mediating source of nonexperimental artifacts. The way in which voluntarism can affect causal inferences in this case has to do with threats to the tenability of inferred causal relationships stemming from the occurrence of demand characteristics; and a series of experiments to demonstrate and to probe this effect was discussed. For example, it was possible to imagine the effects of subjects’ volunteer status interacting with other variables to increase either the likelihood of obtaining significant effects of an independent variable in a selected sample when such effects would not be significant in a more representative sample (type I) or the likelihood of obtaining no significant effects in a selected sample when such effects would be significant in a more representative sample (type II). These types of confounding were demonstrated when volunteers and nonvolunteers participated in an attitude-change experiment using the familiar four-group design by Solomon for determining a pretest-by-treatment interaction, and the results may also have helped to unravel a puzzle about why previous studies in the area had failed to demonstrate the biasing effect of pretesting. Other research in this program of experiments used the verbal operant conditioning paradigm to suggest that volunteers may be more accommodating to demand characteristics than nonvolunteers, and this motivational difference was shown overall to have the capacity to distort the relationships emerging between independent and dependent variables. Contrary to expectation, however, there was no strong evidence that nonvolunteers were any more responsive to ego-defensive cues than volunteers, though there was at least a hint of this possibility.
Motivational Elements Other research was addressed to the question of the role expectations of voluntary and nonvoluntary subjects. Since verbal reports were sought that were not themselves blatantly confounded by the operation of demand characteristics, an indirect method of ‘‘projective’’ inquiry was used in which several hypotheses could be brought together to paint a multidimensional picture of subjects’ expectations for experimental research participation. Multidimensional scaling revealed altruistic and evaluative expectancies (in that order) to be the predominant role expectations
836
Book Three – The Volunteer Subject
associated with research participation, thus supporting the theoretical views of both Orne and Rosenberg. The nonvolunteers tended to amplify the distinction between situations involving work and nonwork activities and judged research participation more as a work-oriented experience. Viewed in the light of findings by Straits and Wuebben that it is the negative aspects of experimental participation which are likely to be more salient for nonvoluntary subjects (with the reverse result occurring among volunteers), the collective data suggested one reason why nonvolunteers can be so unenthusiastic about participating as research subjects, as well as their generally sluggish task behavior when they are unpaid, captive subjects. Being compelled to work at a task where there is no personal monetary incentive attached to the effort may promote uncooperative behaviors in subjects who are perhaps already more attuned to the negative aspects of the ‘‘work’’ situation.
An Artifact-Influence Model We spoke of pervasiveness in connection with volunteer artifacts; however, to focus on this factor to the exclusion of other important criteria may also be misleading in the implications for the conceptualization of artifacts in general. Artifact-producing stimuli may not typically function as autonomous variables capable of being rankordered according to their pervasiveness as independent sources of systematic error. Probably like any other variables, they also interact in complex ways, their pervasiveness as independent contaminants depending on what other factors are operating. This point was emphasized in presenting an integrative model of the artifact-influence process in which volunteer status was postulated to be one of an array of intervening factors. The theoretical model presumed three mutually exclusive and exhaustive states of behavior (compliance, countercompliance, and noncompliance with demand characteristics) and posited that artifact-independent variables, such as the subject’s volunteer status, affect the ultimate outcomes of studies by indirectly impinging on the behavioral states at any of three mediating points. How some combinations of artifact stimuli may relate to demand behaviors through the combined mediation of receptivity and motivation was shown. For example, the factor of evaluation apprehension could be reinterpreted with the aid of the combinatorial rule by assuming a positive association with motivation when task- and ego-orienting cues are operating harmoniously and a negative association with motivation when these cues are in conflict. Finally, various procedures were discussed either for circumventing the artifact problem or for teasing out any biasing effect when the difficulty cannot be avoided.
Appendix
Since the preparation of this book, we have gained access to a number of additional studies that provide information about characteristics of volunteers. Table A-1 lists these studies under the appropriate characteristics, in the order in which the characteristics are discussed in the book. Studies are listed as supporting a relationship if the result was significant at .05 or if a clear but nonsignificant trend was obtained. On the whole, the addition of these studies tends to increase our confidence in the pattern of volunteer characteristics described in our summary.
Table A–1 Additional Studies of Volunteer Characteristics
SEX Females volunteer more Ferree, Smith, and Miller (1973) Schaie, Labouvie, and Barrett (1973) Streib (1966) No difference Dreger and Johnson (1973) Loewenstein, Colombotos, and Elinson (1962) SOCIABILITY Volunteers more sociable Donnay (1972) Loewenstein, Colombotos, and Elinson (1962) No difference Dreger and Johnson (1973) EXTRAVERSION Volunteers more extraverted Burdick and Stewart (1974) McLaughlin and Harrison (1973) Silverman and Margulis (1973)a Volunteers less extraverted Ramsay (1970) ACHIEVEMENT NEED Volunteers more achievement motivated Burns (1974) APPROVAL NEED Volunteers more approval motivated Schofield (1974) AUTHORITARIANISM Volunteers less authoritarian Loewenstein, Colombotos, and Elinson (1962) Silverman and Margulis (1973)a continued
837
Table A–1 continued CONVENTIONALITY No difference Dreger and Johnson (1973) ANXIETY Volunteers less anxious Dreger and Johnson (1973) PSYCHOPATHOLOGY Volunteers more maladjusted Burdick and Stewart (1974) No difference Dreger and Johnson (1973) Loewenstein, Colombotos, and Elinson (1962) McLaughlin and Harrison (1973) Streib (1966) INTELLIGENCE Volunteers more intelligent Donnay (1972) Riegel, Riegel, and Meyer (1967) Schaie, Labouvie, and Barrett (1973) Volunteers less intelligent Maas (1956) EDUCATION Volunteers better educated Loewenstein, Colombotos, and Elinson (1962) Streib (1966) SOCIAL CLASS Volunteers higher in social class Politz and Brumbach (1947) Robinson and Agisim (1951) Speer and Zold (1971) Streib (1966) No difference Burdick and Stewart (1974) Loewenstein, Colombotos, and Elinson (1962) Riegel, Riegel, and Meyer (1967) AGE Volunteers younger Ferree, Smith, and Miller (1973) Jones, Conrad, and Horn (1928) Loewenstein, Colombotos, and Elinson (1962) Riegel, Riegel, and Meyer (1967) No difference Robinson and Agisim (1951) MARITAL STATUS No difference Robinson and Agisim (1951) RELIGION Protestants volunteer more than Catholics Streib (1966) No difference Loewenstein, Colombotos, and Elinson (1962) a
Volunteers for personality assessment research compared to volunteers for color preferences research.
838
Appendix
839
In addition to the studies of Table A-1 we also gained access to several studies relevant to our understanding of the situational determinants of volunteering. Thus, Robinson and Agisim (1951) found that enclosing 25 cents with their mailed questionnaires increased their returns substantially. They also found that when the return envelope bore a postage stamp rather than a business reply permit, returns were nearly 8% greater. Maas (1956) surveyed former university students and found that potential respondents who had been asked to volunteer before were significantly more likely to participate than were those who had never before been approached; the effect size was .35. In their research, Doob and Ecker (1970) compared the rates of volunteering obtained by an experimenter wearing or not wearing an eyepatch. When volunteering involved no future contact with the experimenter, the ‘‘stigmatized’’ experimenter obtained a much higher volunteering rate (69%) than did the ‘‘unstigmatized’’ experimenter (40%). However, when volunteering required further interaction with the stigmatized experimenter, his success rate dropped to 34% (compared to the 32% of the unstigmatized experimenter with whom further interaction was required). Politz and Brumbach (1947) reported that respondents to a survey of radio listening habits were reliably more likely to listen to the radio stations studied than were the nonrespondents. Miller, Pokorny, Valles, and Cleveland (1970) showed that former alcoholic patients were significantly more likely to cooperate with requests for follow-up interviews if their social, marital, and vocational adjustment was more satisfactory (effect size > .38). Finally, the very recent and interesting study by Parlee (1974) suggests the possibility that levels of estrogen or progesterone in women might be significant correlates of volunteering.
Book Three – The Volunteer Subject
References
Abeles, N., Iscoe, I., and Brown, W. F. (1954–1955). Some factors influencing the random sampling of college students. Public Opinion Quarterly, 18, 419–423. Ad Hoc Committee on Ethical Standards in Psychological Research (1973). Ethical Principles in the Conduct of Research with Human Participants. Washington, D.C.: American Psychological Association. Adair, J. G. (1970a). Pre-experiment attitudes toward psychology as a determinant of subject behavior. Paper read at Canadian Psychological Association, Winnipeg, May. Adair, J. G. (1970b). Preexperiment attitudes towards psychology as a determinant of experimental results: Verbal conditioning of aware subjects. Proceedings of the 78th American Psychological Association Meeting, 5, 417–418. Adair, J. G. (1972a). Coerced versus volunteer subjects. American Psychologist, 27, 508. Adair, J. G. (1972b). Demand characteristics or conformity?: Suspiciousness of deception and experimenter bias in conformity research. Canadian Journal of Behavioural Science, 4, 238–248. Adair, J. G. (1973). The Human Subject: The Social Psychology of the Psychological Experiment. Boston: Little, Brown. Adair, J. G. and Fenton, D. P. (1970). Subjects’ attitudes toward psychology as a determinant of experimental results. Paper presented at the Midwestern Psychological Association meeting, Cincinnati, Ohio. Adair, J. G. and Fenton, D. P. (1971). Subject’s attitudes toward psychology as a determinant of experimental results. Canadian Journal of Behavioural Science, 3, 268–275. Adair, J. G., and Schachter, B. S. (1972). To cooperate or to look good?: The subjects’ and experimenters’ perceptions of each others’ intentions. Journal of Experimental Social Psychology, 8, 74–85. Adams, M. (1973). Science, technology, and some dilemmas of advocacy. Science, 180, 840–842. Adams, S. (1953). Trends in occupational origins of physicians. American Sociological Review, 18, 404–409. Aderman, D. (1972). Elation, depression, and helping behavior. Journal of Personality and Social Psychology, 24, 91–101. Adorno, T. W., Frenkel-Brunswik, E., Levinson, D. J., and Sanford, R. N. (1950). The Authoritarian Personality. New York: Harper. Aiken, L. S., and Rosnow, R. L. (1973). Role expectations for psychological research participation. Unpublished manuscript, Temple University. Alexander, C. N., Jr., and Knight, G. W. (1971). Situated identities and social psychological experimentation. Sociometry, 34, 65–82. Alexander, C. N., Jr., and Sagatun, I. (1973). An attributional analysis of experimental norms. Sociometry, 36, 127–142. Alexander, C. N., Jr., and Weil, H. G. (1969). Players, persons, and purposes: Situational meaning and the prisoner’s dilemma game. Sociometry, 32, 121–144. Alexander, C. N., Jr., Zucker, L. G., and Brody, C. L. (1970). Experimental expectations and autokinetic experiences: Consistency theories and judgmental convergence. Sociometry, 33, 108–122.
840
References
841 Allen, V. L. (1966). The effect of knowledge of deception on conformity. Journal of Social Psychology, 69, 101–106. Alm, R. M., Carroll, W. F., and Welty, G. A. (1972). The internal validity of the Kuhn-McPartland TST. Proceedings of the American Statistical Association, pp. 190–193. Altus, W. D. (1966). Birth order and its sequelae. Science, 151, 44–49. Alumbaugh, R. V. (1972). Another ‘‘Malleus maleficarum’’? American Psychologist, 27, 897–899. Anastasiow, N. J. (1964). A methodological framework for analyzing non-responses to questionnaires. California Journal of Educational Research, 15, 205–208. Argyris, C. (1968). Some unintended consequences of rigorous research. Psychological Bulletin, 70, 185–197. Aronson, E. (1966). Avoidance of inter-subject communication. Psychological Reports, 19, 238. Aronson, E., and Carlsmith, J. M. (1968). Experimentation in social psychology. In G. Lindzey and E. Aronson, Eds., The Handbook of Social Psychology, rev. ed. Vol. II. Reading, Massachusetts: Addison-Wesley. Aronson, E., Carlsmith, J. M., and Darley, J. M. (1963). The effects of expectancy on volunteering for an unpleasant experience. Journal of Abnormal and Social Psychology, 66, 220–224. Ascough, J. C., and Sipprelle, C. N. (1968). Operant verbal conditioning of autonomic responses. Behavior Research and Therapy, 6, 363–370. Atkinson, J. (1955). The achievement motive and recall of interrupted and completed tasks. In D. C. McClelland, Ed. Studies in Motivation. New York: Appleton-Century-Crofts, pp. 494–506. Back, K. W., Hood, T. C., and Brehm, M. L. (1963). The subject role in small group experiments. Paper presented at the meetings of the Southern Sociological Society, Durham, North Carolina, April. Technical Report # 12. Ball, R. J. (1930). The correspondence method in follow-up studies of delinquent boys. Journal of Juvenile Research, 14, 107–113. Ball, R. S. (1952). Reinforced conditioning of verbal and nonverbal stimuli in a situation resembling a clinical interview. Unpublished doctoral diss., Indiana University. Barber, T. X. (1969). Hypnosis: A Scientific Approach. New York: Van Nostrand. Barefoot, J. C. (1969). Anxiety and volunteering. Psychonomic Science, 16, 283–284. Barker, W. J., and Perlman, D. (1972). Volunteer bias and personality traits in sexual standards research. Unpublished manuscript, University of Manitoba. Barnette, W. L., Jr. (1950a). Report of a follow-up of counseled veterans: I Public Law 346 versus Public Law 16 clients. Journal of Social Psychology, 32, 129–142. Barnette, W. L., Jr. (1950b). Report of a follow-up of counseled veterans: II Status of pursuit of training. Journal of Social Psychology, 32, 143–156. Bass, B. M. (1967). Social behavior and the orientation inventory: A review. Psychological Bulletin, 68, 260–292. Bass, B. M., Dunteman, G., Frye, R., Vidulich, R., and Wambach, H. (1963). Self, interaction, and task orientation inventory scores associated with overt behavior and personal factors. Educational and Psychological Measurement 23, 101–116. Baumrind, D. (1964). Some thoughts on ethics of research: After reading Milgram’s Behavioral Study of Obedience. American Psychologist, 19, 421–423. Baumrind, D. (1971). Principles of ethical conduct in the treatment of subjects: Reaction to the draft report of the Committee on Ethical Standards in Psychological Research. American Psychologist, 26, 887–896. Baumrind, D. (1972). Reactions to the May 1972 draft report of the ad hoc committee on ethical standards in psychological research. American Psychologist, 27, 1083–1086. Baur, E. J. (1947–1948). Response bias in a mail survey. Public Opinion Quarterly, 11, 594–600. Beach, F. A. (1950). The snark was a boojum. American Psychologist, 5, 115–124. Beach, F. A. (1960). Experimental investigations of species specific behavior. American Psychologist, 15, 1–18. Bean, W. B. (1959). The ethics of experimentation on human beings. In S. O. Waife and A. P. Shapiro, Eds., The Clinical Evaluation of New Drugs. New York: Hoeber-Harper, pp. 76–84. Beckman, L., and Bishop, B. R. (1970). Deception in psychological research: A reply to Seeman. American Psychologist, 25, 878–880. Beecher, H. K. (1970). Research and the Individual: Human Studies. Boston: Little, Brown.
842
Book Three – The Volunteer Subject Bell, C. R. (1961). Psychological versus sociological variables in studies of volunteer bias in surveys. Journal of Applied Psychology, 45, 80–85. Bell, C. R. (1962). Personality characteristics of volunteers for psychological studies. British Journal of Social and Clinical Psychology, 1, 81–95. Belmont, L., and Marolla, F. A. (1973). Birth order, family size, and intelligence. Science, 182, 1096–1101. Belson, W. A. (1960). Volunteer bias in test-room groups. Public Opinion Quarterly, 24, 115–126. Belt, J. A., and Perryman, R. E. The subject as a biasing factor in psychological experiments. Paper read at meeting meetings S.W.P.A., St. Louis, April, 1970. Bem, D. J. (1970). Beliefs, Attitudes, and Human Affairs. Belmont, Calif.: Brooks/Cole. Bennett, C. M., and Hill, R. E., Jr. (1964). A comparison of selected personality characteristics of responders and nonresponders to a mailed questionnaire study. Journal of Educational Research, 58, 178–180. Bennett, E. B. (1955). Discussion, decision, commitment and consensus in ‘‘group decision.’’ Human Relations, 8, 251–273. Benson, L. E. (1946). Mail surveys can be valuable. Public Opinion Quarterly, 10, 234–241. Benson, S., Booman, W. P., and Clark, K. E. (1951). A study of interview refusal. Journal of Applied Psychology, 35, 116–119. Bentler, P. M., and Roberts, M. R. (1963). Hypnotic susceptibility assessed in large groups. International Journal of Clinical and Experimental Hypnosis, 11, 93–97. Bergen, A. V., and Kloot, W. V. D. (1968–1969). Recruitment of subjects. Hypothese: Tijdschrift voor Psychologie en Opvoedkunde, 13, no. l, 11–15. Berkowitz, L., and Cottingham, D. R. (1960). The interest value and relevance of fear-arousing communication. Journal of Abnormal and Social Psychology. 60, 37–43. Bickman, L., and Henchy, T., Eds. (1972). Beyond the Laboratory: Field Research in Social Psychology. New York: McGraw-Hill. Biddle, B. J., and Thomas, E. J., Eds. (1966). Role Theory: Concepts and Research. New York: Wiley. Black, R. W., Schumpert, J., and Welch, F. (1972). A ‘‘partial reinforcement extinction effect’’ in perceptual–motor performance: Coerced versus volunteer subject populations. Journal of Experimental Psychology, 92, 143–145. Blake, R. R., Berkowitz, H., Bellamy, R. Q., and Mouton, J. S. (1956). Volunteering as an avoidance act. Journal of Abnormal and Social Psychology, 53, 154–156. Boice, R. (1973). Domestication. Psychological Bulletin, 80, 215–230. Boies, K. G. (1972). Role playing as a behavior change technique: Review of the empirical literature. Psychotherapy: Theory, Research and Practice, 9, 185–192. Boring, E. G. (1969). Perspective: Artifact and control. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Boucher, R. G., and Hilgard, E. R. (1962). Volunteer bias in hypnotic experimentation. American Journal of Clinical Hypnosis, 5, 49–51. Bradt, K. (1955). The usefulness of a post card technique in a mail questionnaire study. Public Opinion Quarterly, 19, 218–222. Brady, J. P., Levitt, E. E., and Lubin, B. (1961). Expressed fear of hypnosis and volunteering behavior. Journal of Nervous and Mental Disease, 133, 216–217. Bragg, B. W. (1966). Effect of knowledge of deception on reaction to group pressure. Unpublished master’s thesis, University of Wisconsin. Brehm, J. W. (1966). A Theory of Psychological Reactance. New York: Academic Press. Brehm, M. L., Back, K. W., and Bogdonoff, M. D. (1964). A physiological effect of cognitive dissonance under stress and deprivation. Journal of Abnormal and Social Psychology, 69, 303–310. Brightbill, R., and Zamansky, H. S. (1963). The conceptual space of good and poor hypnotic subjects: A preliminary exploration. International Journal of Clinical and Experimental Hypnosis, 11, 112–121. Britton, J. H., and Britton, J. O. (1951). Factors in the return of questionnaires mailed to older persons. Journal of Applied Psychology, 35, 57–60. Brock, T. C., and Becker, G. (1965). Birth order and subject recruitment. Journal of Social Psychology, 65, 63–66.
References
843 Brock, T. C., and Becker, L. A. (1966). ‘‘Debriefing’’ and susceptibility to subsequent experimental manipulations. Journal of Experimental Social Psychology, 2, 314–323. Brooks, W. D. (1966). Effects of a persuasive message upon attitudes: A methodological comparison of an offset before-after design with a pretest–posttest design. Journal of Communication, 16, 180–188. Brower, D. (1948). The role of incentive in psychological research. Journal of General Psychology, 39, 145–147. Brown, R. (1962). Models of attitude change. In R. Brown, E. Galanter, E. H. Hess, and G. Mandler, Eds., New Directions in Psychology, Vol. I. New York: Holt, Rinehart, and Winston. Bruehl, D. K. (1971). A model of social psychological artifacts in psychological experimentation with human subjects. Unpublished doctoral diss., University of California at Berkeley. Bruehl, D. K., and Solar, D. (1970). Systematic variation in the clarity of demand characteristics in an experiment employing a confederate. Psychological Reports, 27, 55–60. Bruehl, D. K., and Solar, D. (1972). Clarity of demand characteristics in an experimenter expectancy experiment. Paper presented at the Western Psychological Association meeting, Portland, Oregon. Brunswik, E. (1947). Systematic and Representative Design of Psychological Experiments. Berkeley: University of California Press. Brunswik, E. (1955). Representative design and probabilistic theory in a functional psychology. Psychological Review, 62, 193–217. Brunswik, E. (1956). Perception and the Representative Design of Psychological Experiments. Berkeley: University of California Press. Bryan, J. H., and London, P. (1970). Altruistic behavior by children. Psychological Bulletin, 73, 200–211. Buckhout, R. (1965). Need for approval and attitude change. Journal of Psychology, 60, 123–128. Burchinal, L. G. (1960). Personality characteristics and sample bias. Journal of Applied Psychology, 44, 172–174. Burdick, H. A. (1956). The relationship of attraction, need achievement, and certainty to conformity under conditions of a simulated group atmosphere. Unpublished doctoral diss., University of Michigan. Burdick, J. A., and Stewart, D. Y. (1974). Differences between ‘‘Show’’ and ‘‘No Show’’ volunteers in a homosexual population. Journal of Social Psychology, 92, 159–160. Burns, J. L. (1974). Some personality attributes of volunteers and of nonvolunteers for psychological experimentation. Journal of Social Psychology, 92, 161–162. Cairns, R. B. (1961). The influence of dependency inhibition on the effectiveness of social reinforcement. Journal of Personality, 29, 466–488. Campbell, D. T. (1957). Factors relevant to the validity of experiments in social settings. Psychological Bulletin, 54, 297–312. Campbell, D. T. (1969). Prospective: Artifact and control. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Campbell, D. T., and Stanley, J. C. (1966). Experimental and Quasi-experimental Designs for Research. Chicago: Rand-McNally. Capra, P. C., and Dittes, J. E. (1962). Birth order as a selective factor among volunteer subjects. Journal of Abnormal and Social Psychology, 64, 302. Carlson, R. (1971). Where is the person in personality research? Psychological Bulletin, 75, 203–219. Carr, J. E., and Wittenbaugh, J. A. (1968). Volunteer and nonvolunteer characteristics in an outpatient population. Journal of Abnormal Psychology, 73, 16–17. Carroll, J. D., and Chang, J. (1970). Analysis of individual differences in multidimensional scaling via an n-way generalization of Eckhart–Young decomposition. Psychometrika, 35, 282–319. Chapanis, A. (1967). The relevance of laboratory studies to practical situations. Ergonomics, 10, 557–577. Chein, I. (1948). Behavior theory and the behavior of attitudes: Some critical comments. Psychological Review, 55, 175–188. Christie, R. (1951). Experimental naı¨vete´ and experiential naı¨vete´. Psychological Bulletin, 48, 327–339. Clark, K. E. (1949). A vocational interest test at the skilled trades level. Journal of Applied Psychology, 33, 291–303.
844
Book Three – The Volunteer Subject Clark, K. E. et al. (1967). Privacy and behavioral research. Science, 155, 535–538. Clausen, J. A., and Ford, R. N. (1947). Controlling bias in mail questionnaires. Journal of the American Statistical Association, 42, 497–511. Cochran, W. G. (1963). Sampling Techniques. 2nd ed. New York: Wiley. Cochran, W. G., Mosteller, F., and Tukey, J. W. (1953). Statistical problems of the Kinsey report. Journal of the American Statistical Association, 48, 673–716. Coe, W. C. (1964). Further norms on the Harvard Group Scale of Hypnotic Susceptibility, Form A. International Journal of Clinical and Experimental Hypnosis, 12, 184–190. Coe, W. C. (1966). Hypnosis as role enactment: The role demand variable. American Journal of Clinical Hypnosis, 8, 189–191. Coffin, T. E. (1941). Some conditions of suggestion and suggestibility. Psychological Monographs, 53, no. 4 (Whole no. 241). Cohen, J. (1969). Statistical Power Analysis for the Behavioral Sciences. New York: Academic Press. Cohler, B. J., Woolsey, S. H., Weiss, J. L., and Grunebaum, H. H. (1968). Childrearing attitudes among mothers volunteering and revolunteering for a psychological study. Psychological Reports, 23, 603–612. Conroy, G. I., and Morris, J. R. (1968). Psychological health among volunteers, non-volunteers and no shows. Paper presented at Southeastern Psychological Association Meeting, Roanoke, Va. April. Cook, S. W., Kimble, G. A., Hicks, L. H., McGuire, W. J., Schoggen, P. H., and Smith, M. B. (1971). Ethical standards for psychological research. APA Monitor, 2, no. 7, 9–28. Cook, S. W., Kimble, G. A., Hicks, L. H., McGuire, W. J., Schoggen, P. H., and Smith M. B. (1972). Ethical standards for research with human subjects. APA Monitor, 3, i–xix. Cook, T. D., Bean, J. R., Calder, B. J., Frey, R., Krovetz, M. L., and Reisman, S. R. (1970). Demand characteristics and three conceptions of the frequently deceived subject. Journal of Personality and Social Psychology, 14, 185–194. Cook, T. D., and Campbell, D. T. (1974). The design and conduct of quasi-experiments and true experiments in field settings. In M. D. Dunnette, Ed., Handbook of Industrial and Organizational Psychology. Chicago: Rand-McNally, in press. Cope, C. S., and Kunce, J. T. (1971). Unobtrusive behavior and research methodology. Journal of Counseling Psychology, 18, 592–594. Cope, R. G. (1968). Nonresponse in survey research as a function of psychological characteristics and time of response. Journal of Experimental Education, 36, 32–35. Cox, D. E., and Sipprelle, C. N. (1971). Coercion in participation as a research subject. American Psychologist, 26, 726–728. Cozby, P. C. (1973). Self-disclosure: A literature review. Psychological Bulletin, 79, 73–91. Craddick, R. A., and Campitell, J. (1963). Return to an experiment as a function of need for social approval. Perceptual and Motor Skills, 16, 930. Crespi, L. P. (1948). The interview effect in polling. Public Opinion Quarterly, 12, 99–111. Croog, S. H., and Teele, J. E. (1967). Religious identity and church attendance of sons of religious intermarriages. American Sociological Review, 32, 93–103. Crossley, H. M., and Fink, R. (1951). Response and nonresponse in a probability sample. International Journal of Opinion and Attitude Research, 5, 1–19. Crowne, D. P., and Marlowe, D. (1964). The Approval Motive, New York: Wiley. Cudrin, J. M. (1969). Intelligence of volunteers as research subjects. Journal of Consulting and Clinical Psychology, 33, 501–503. Damon, A. (1965). Discrepancies between findings of longitudinal and cross-sectional studies in adult life: Physique and physiology. Human Development, 8, 16–22. Darroch, R. K., and Steiner, I. D. (1970). Role playing: An alternative to laboratory research? Journal of Personality, 38, 302–311. Deming, W. E. (1944). On errors in surveys. American Sociological Review, 9, 359–369. Dewolfe, A. S., and Governale, C. N. (1964). Fear and attitude change. Journal of Abnormal and Social Psychology, 69, 119–123. Diab, L. N., and Prothro, E. T. (1968). Cross-cultural study of some correlates of birth order. Psychological Reports, 22, 1137–1142. Diamant, L. (1970). Attitude, personality, and behavior in volunteers and nonvolunteers for sexual research. Proceedings, 78th Annual Convention, American Psychological Association, pp. 423–424.
References
845 Dillman, D. A. (1972). Increasing mail questionnaire response for large samples of the general public. Agricultural Research Center Scientific Paper No. 3752. Washington State University. Also in Public Opinion Quarterly, 36, 254–257. Dittes, J. E. (1961). Birth order and vulnerability to differences in acceptance. American Psychologist, 16, 358 (Abstract). Dohrenwend, B. S., and Dohrenwend, B. P. (1968). Sources of refusals in surveys. Public Opinion Quarterly, 32, 74–83. Dohrenwend, B. S., Feldstein, S., Plosky, J., and Schmeidler, G. R. (1967). Factors interacting with birth order in self-selection among volunteer subjects. Journal of Social Psychology, 72, 125–128. Dollard, J. (1953). The Kinsey report on women: ‘‘A strangely flawed masterpiece.’’ New York Herald Tribune, Sept. 13, 1953, Section 6. Donald, M. N. (1960). Implications of nonresponse for the interpretation of mail questionnaire data. Public Opinion Quarterly, 24, 99–114. Donnay, J. M. (1972). L’affiliation: Son substrat dynamique et ses implications sur le plan comportemental et intellectuel. Psychologica Belgica, 12, 175–187. Doob, A. N., and Gross, A. E. (1968). Status of frustrator as an inhibitor of horn-honking responses. Journal of Social Psychology, 76, 213–218. Doob, A. N., and Ecker, B. P. (1970). Stigma and compliance. Journal of Personality and Social Psychology, 14, 302–304. Dreger, R. M., and Johnson, W. E., Jr. (In press). Characteristics of volunteers, non-volunteers, and no-shows in a clinical follow-up. Journal of Consulting and Clinical Psychology. Dulany, D. E. (1962). The place of hypotheses and intentions: An analysis of verbal control in verbal conditioning. In C. E. Eriksen, Ed., Behavior and Awareness. Durham, N. C.: Duke University Press. Ebert, R. K. (1973). The reliability and validity of a mailed questionnaire for a sample of entering college freshmen. Unpublished doctoral diss., Temple University. Eckland, B. K. (1965). Effects of prodding to increase mailback returns. Journal of Applied Psychology, 49, 165–169. Edgerton, H. A.; Britt, S. H.; and Norman, R. D. (1947). Objective differences among various types of respondents to a mailed questionnaire. American Sociological Review, 12, 435–444. Edwards, C. N. (1968a). Characteristics of volunteers and nonvolunteers for a sleep and hypnotic experiment. American Journal of Clinical Hypnosis, 11, 26–29. Edwards, C. N. (1968b). Defensive interaction and the volunteer subject: An heuristic note. Psychological Reports, 22, 1305–1309. Efran, J. S., and Boylin, E. R. (1967). Social desirability and willingness to participate in a group discussion. Psychological Reports, 20, 402. Eisenman, R. (1972). Experience in experiments and change in internal–external control scores. Journal of Consulting and Clinical Psychology, 39, 434–435. Eisenman, R. (1965). Birth order, aesthetic preference, and volunteering for an electric shock experiment. Psychonomic Science, 3, 151–152. Ehrlich, A. (1974). The age of the rat. Human Behavior, 3, 25–28. Ellis, R. A., Endo, C. M., and Armer, J. M. (1970). The use of potential nonrespondents for studying nonresponse bias, Pacific Sociological Review, 13, 103–109. Elms, A. C., Ed. (1969). Role Playing, Reward, and Attitude Change. New York: Van Nostrand. Epstein, Y. M., Suedfeld, P., and Silverstein, S. J. (1973). The experimental contract: Subjects’ expectations of and reactions to some behaviors of experimenters. American Psychologist, 28, 212–221. Esecover, H., Malitz, S., and Wilkens, B. (1961). Clinical profiles of paid normal subjects volunteering for hallucinogenic drug studies. American Journal of Psychiatry, 117, 910–915. Etzioni, A. (1973). Regulation of human experimentation. Science, 182, 1203. Evans, F. J., and Orne, M. T. (1971). The disappearing hypnotist: The use of simulating subjects to evaluate how subjects perceive experimental procedures. International Journal of Clinical and Experimental Hypnosis, 19, 277–296. Evans, R. I., and Rozelle, R. M., Eds. (1973). Social Psychology in Life. Boston: Allyn and Bacon. Eysenck, H. J. (1967). The Biological Basis of Personality. Springfield, Ill.: Charles C. Thomas. Feldman, R. S., and Scheibe, K. E. (1972). Determinants of dissent in a psychological experiment. Journal of Personality, 40, 331–348.
846
Book Three – The Volunteer Subject Ferber, R. (1948–1949). The problem of bias in mail returns: A solution. Public Opinion Quarterly, 12, 669–676. Ferree, M. M., Smith, E. R., and Miller F. D. (1973). Is sisterhood powerful? A look at feminist belief and helping behavior. Unpublished manuscript, Harvard University. Ferriss, A. L. (1951). A note on stimulating response to questionnaires. American Sociological Review, 16, 247–249. Festinger, L. (1957). A Theory of Cognitive Dissonance. Evanston, Ill.: Row, Peterson; Reissued by Stanford University Press, 1962. Fillenbaum, S. (1966). Prior deception and subsequent experimental performance: The ‘‘faithful’’ subject. Journal of Personality and Social Psychology, 4, 532–537. Fischer, E. H., and Winer, D. (1969). Participation in psychological research: Relation to birth order and demographic factors. Journal of Consulting and Clinical Psychology, 33, 610–613. Fisher, S., McNair, D. M., and Pillard, R. C. (1970). Acquiescence, somatic awareness and volunteering. Psychosomatic Medicine, 32, 556. Foa, U. G. (1971). Interpersonal and economic resources. Science, 171, 345–351. Ford, R. N., and Zeisel, H. (1949). Bias in mail surveys cannot be controlled by one mailing. Public Opinion Quarterly, 13, 495–501. Foster, R. J. (1961). Acquiescent response set as a measure of acquiescence. Journal of Abnormal and Social Psychology, 63, 155–160. Francis, R. D., and Diespecker, D. D. (1973). Extraversion and volunteering for sensory isolation. Perceptual and Motor Skills, 36, 244–246. Franzen, R., and Lazarsfeld, P. F. (1945). Mail questionnaire as a research problem. Journal of Psychology, 20, 293–320. Fraser, S. C., and Zimbardo, P. G. (n.d.) Subject compliance: The effects of knowing one is a subject. Unpublished manuscript. New York University. Freedman, J. L. (1969). Role playing: Psychology by consensus. Journal of Personality and Social Psychology, 13, 107–114. Freedman, J. L., and Fraser, S. C. (1966). Compliance without pressure: The foot-in-the-door technique. Journal of Personality and Social Psychology, 4, 195–202. Freedman, J. L., Wallington, S. A., and Bless, E. (1967). Compliance without pressure: The effect of guilt. Journal of Personality and Social Psychology, 7, 117–124. French, J. R. P. (1963). Personal communication. Aug. 19. Frey, A. H., and Becker, W. C. (1958). Some personality correlates of subjects who fail to appear for experimental appointments. Journal of Consulting Psychology, 22, 164. Frey, P. W. (1973). Student ratings of teaching: Validity of several rating factors. Science, 182, 83–85. Fried, S. B., Gumpper, D. C., and Allen, J. C. (1973). Ten years of social psychology: Is there a growing commitment to field research? American Psychologist, 28, 155–156. Friedman, H. (1968). Magnitude of experimental effect and a table for its rapid estimation. Psychological Bulletin, 70, 245–251. Frye, R. L., and Adams, H. E. (1959). Effect of the volunteer variable on leaderless group discussion experiments. Psychological Reports, 5, 184. Gannon, M. J., Nothern, J. C., and Carroll, S. J., Jr. (1971). Characteristics of nonrespondents among workers. Journal of Applied Psychology, 55, 586–588. Gaudet, H., and Wilson, E. C. (1940). Who escapes the personal investigator? Journal of Applied Psychology, 24, 773–777. Gelfand, D. M., Hartmann, D. P., Walder, P., and Page, B. (1973). Who reports shoplifters?: A fieldexperimental study. Journal of Personality and Social Psychology, 25, 276–285. Geller, S. H., and Endler, N. S. (1973). The effects of subject roles, demand characteristics, and suspicion on conformity. Canadian Journal of Behavioural Science, 5, 46–54. Gergen, K. J. (1973). The codification of research ethics: Views of a doubting Thomas. American Psychologist, 28, 907–912. Glinski, R. J., Glinski, B. C., and Slatin, G. T. (1970). Nonnaivety contamination in conformity experiments: Sources, effects, and implications for control. Journal of Personality and Social Psychology, 16, 478–485. Goldstein, J. H., Rosnow, R. L., Goodstadt, B. E., and Suis, J. M. (1972). The ‘‘good subject’’ in verbal operant conditioning research. Journal of Experimental Research in Personality, 6, 29–33.
References
847 Goodstadt, B. E. (1971). When coercion fails. Unpublished doctoral diss., Temple University. Grabitz-Gniech, G. (1972). Versuchspersonenverhalten: Erkla¨rungsansa¨tze aus Theorien zum sozialen Einfluss. Psychologische Beitra¨ge, 14, 541–549. Green, D. R. (1963). Volunteering and the recall of interrupted tasks. Journal of Abnormal and Social Psychology, 66, 397–401. Greenberg, A. (1956). Respondent ego-involvement in large-scale surveys. Journal of Marketing, 20, 390–393. Greenberg, M. S. (1967). Role playing: An alternative to deception? Journal of Personality and Social Psychology, 7, 152–157. Greene, E. B. (1937). Abnormal adjustments to experimental situations. Psychological Bulletin, 34, 747–748 (Abstract). Greenspoon, J. (1951). The effect of verbal and nonverbal stimuli on the frequency of members of two verbal response classes. Unpublished doctoral diss., Indiana University. Greenspoon, J. (1962). Verbal conditioning and clinical psychology. In A. J. Bachrach, Ed., Experimental Foundations of Clinical Psychology. New York: Basic Books. Greenspoon, J., and Brownstein, A. J. (1967). Awareness in verbal conditioning. Journal of Experimental Research in Personality, 2, 295–308. Gustafson, L. A., and Orne, M. T. (1965). Effects of perceived role and role success on the detection of deception. Journal of Applied Psychology, 49, 412–417. Gustav, A. (1962). Students’ attitudes toward compulsory participation in experiments. Journal of Psychology, 53, 119–125. Haas, K. (1970). Selection of student experimental subjects. American Psychologist, 25, 366. Haefner, D. P. (1956). Some effects of guilt-arousing and fear-arousing persuasive communications on opinion change. Unpublished doctoral diss., University of Rochester. Hammond, K. R. (1948). Subject and object sampling—a note. Psychological Bulletin, 45, 530–533. Hammond, K. R. (1951). Relativity and representativeness. Philosophy of Science, 18, 208–211. Hammond, K. R. (1954). Representative vs. systematic design in clinical psychology. Psychological Bulletin, 51, 150–159. Hancock, J. W. (1940). An experimental study of four methods of measuring unit costs of obtaining attitude toward the retail store. Journal of Applied Psychology, 24, 213–230. Handfinger, B. M. (1973). Effect of previous deprivation on reaction for helping behavior. Paper presented at the Eastern Psychological Association meeting, Washington, D.C. Hansen, M. H., and Hurwitz, W. N. (1946). The problem of non-response in sample surveys. Journal of the American Statistical Association, 41, 517–529. Hartmann, G. W. (1936). A field experiment on the comparative effectiveness of ‘‘emotional’’ and ‘‘rational’’ political leaflets in determining election results. Journal of Abnormal and Social Psychology, 31, 99–114. Havighurst, C. C. (1972). Compensating persons injured in human experimentation. Science, 169, 153–169. Hayes, D. P., Meltzer, L., and Lundberg, S. (1968). Information distribution, interdependence, and activity levels. Sociometry, 31, 162–179. Heckhausen, H., Boteram, N., and Fisch, R. (1970). Attraktivita¨tsa¨nderung der Aufgabe nach Misserfolg. Psychologische Forschung, 33, 208–222. Heilizer, F. (1960). An exploration of the relationship between hypnotizability and anxiety and/or neuroticism. Journal of Consulting Psychology, 24, 432–436. Henchy, T., and Glass, D. C. (1968). Evaluation apprehension and the social facilitation of dominant and subordinate responses. Journal of Personality and Social Psychology, 10, 446–454. Hendrick, C., Borden, R., Giesen, M., Murray, E. J, and Seyfried, B. A. (1972). Effectiveness of ingratiation tactics in a cover letter on mail questionnaire response. Psychonomic Science, 26, 349–351. Hicks, J. M., and Spaner, F. E. (1962). Attitude change and mental hospital experience. Journal of Abnormal and Social Psychology, 65, 112–120. Higbee, K. L., and Wells, M. G. (1972). Some research trends in social psychology during the 1960s. American Psychologist, 27, 963–966. Hilgard, E. R. (1965). Hypnotic Susceptibility. New York: Harcourt, Brace and World. Hilgard, E. R. (1967). Personal communication. Feb. 6.
848
Book Three – The Volunteer Subject Hilgard, E. R., and Payne, S. L. (1944). Those not at home: Riddle for pollsters. Public Opinion Quarterly, 8, 254–261. Hilgard, E. R., Weitzenhoffer, A. M., Landes, J., and Moore, R. K. (1961). The distribution of susceptibility to hypnosis in a student population: A study using the Stanford Hypnotic Susceptibility Scale. Psychological Monographs, 75, 8 (Whole no. 512). Hill, C. T., Rubin, Z., and Willard, S. (1973). Who volunteers for research on dating relationships? Unpublished manuscript, Harvard University. Himelstein, P. (1956). Taylor scale characteristics of volunteers and nonvolunteers for psychological experiments. Journal of Abnormal and Social Psychology, 52, 138–139. Holmes, D. S. (1967). Amount of experience in experiments as a determinant of performance in later experiments. Journal of Personality and Social Psychology, 2, 289–294. Holmes, D. S., and Applebaum, A. S. (1970). Nature of prior experimental experience as a determinant of performance in a subsequent experiment. Journal of Personality and Social Psychology, 14, 195–202. Holmes, D. S., and Bennett, D. H. (1974). Experiments to answer questions raised by the use of deception in psychological research: I. Role playing as an alternative to deception; II. Effectiveness of debriefing after a deception; III. Effect of informed consent on deception. Journal of Personality and Social Psychology, 29, 358–367. Holmes, J. G., and Strickland L. H. (1970). Choice freedom and confirmation of incentive expectancy as determinants of attitude change. Journal of Personality and Social Psychology, 14, 39–45. Hood, T. C. (1963). The volunteer subject: Patterns of self-presentation and the decision to participate in social psychological experiments. Unpublished master’s thesis, Duke University. Hood, T. C., and Back, K. W. (1967). Patterns of self-disclosure and the volunteer: The decision to participate in small groups experiments. Paper read at Southern Sociological Society, Atlanta, April. Hood, T. C., and Back, K. W. (1971). Self-disclosure and the volunteer: A source of bias in laboratory experiments. Journal of Personality and Social Psychology, 17, 130–136. Horowitz, I. A. (1969). Effects of volunteering, fear arousal, and number of communications on attitude change. Journal of Personality and Social Psychology, 11, 34–37. Horowitz, I. A., and Gumenik, W. E. (1970). Effects of the volunteer subject, choice, and fear arousal on attitude change. Journal of Experimental Social Psychology, 6, 293–303. Horowitz, I. A., and Rothschild, B. H. (1970). Conformity as a function of deception and role playing. Journal of Personality and Social Psychology, 14, 224–226. Hovland, C. I., Lumsdaine, A. A., and Sheffield, F. D. (1949). Experiments on Mass Communication. Princeton, NJ: Princeton University Press. Howe, E. S. (1960). Quantitative motivational differences between volunteers and nonvolunteers for a psychological experiment. Journal of Applied Psychology, 44, 115–120. Hyman, H., and Sheatsley, P. B. (1954). The scientific method. In D. P. Geddes, Ed., An Analysis of the Kinsey Reports. New York: New American Library. Innes, J. M., and Fraser, C. (1971). Experimenter bias and other possible biases in psychological research. European Journal of Social Psychology, 1, 297–310. Insko, C. A. (1965). Verbal reinforcement of attitude. Journal of Personality and Social Psychology, 2, 621–623. Insko, C. A., Arkoff, A., and Insko, V. M. (1965). Effects of high and low fear-arousing communication upon opinions toward smoking. Journal of Experimental Social Psychology, 1, 256–266. Jackson, C. W., Jr., and Kelley, E. L. (1962). Influence of suggestion and subject’s prior knowledge in research on sensory deprivation. Science, 132, 211–212. Jackson, C. W., and Pollard, J. C. (1966). Some nondeprivation variables which influence the ‘‘effects’’ of experimental sensory deprivation. Journal of Abnormal Psychology, 71, 383–388. Jackson, J. A., Ed. (1972). Role. Cambridge: Cambridge University Press. Jaeger, M. E., Feinberg, H. K., and Weissman, H. N. (1973). Differences between volunteers, nonvolunteers, and pseudovolunteers as measured by the Omnibus Personality Inventory. Paper presented at the Eastern Psychological Association meeting, Washington, D.C. Janis, I. L. (1967). Effects of fear arousal on attitude change: Recent developments in theory and experimental research. In L. Berkowitz, Ed., Advances in Experimental Social Psychology, Vol. III. New York: Academic Press.
References
849 Janis, I. L., and Feshbach, S. (1953). Effects of fear-arousing communications. Journal of Abnormal and Social Psychology, 48, 78–92. Janis, I. L., and Gilmore, J. B. (1965). The influence of incentive conditions on the success of role playing in modifying attitudes. Journal of Personality and Social Psychology, 1, 17–27. Janis, I. L., and Mann, L. (1965). Effectiveness of emotional role-playing in modifying smoking habits and attitudes. Journal of Experimental Research in Personality, 1, 84–90. Janis, I. L., and Terwilliger, R. F. (1962). An experimental study of psychological resistances to fear-arousing communication. Journal of Abnormal and Social Psychology, 65, 403–410. Johns, J. H., and Quay, H. C. (1962). The effect of social reward on verbal conditioning in psychopathic and neurotic military offenders. Journal of Consulting Psychology, 26, 217–220. Johnson, R. W. (1973a). Inducement of expectancy and set of subjects as determinants of subjects’ responses in experimenter expectancy research. Canadian Journal of Behavioural Science, 5, 55–66. Johnson, R. W. (1973b). The obtaining of experimental subjects. Canadian Psychologist, 14, 208–211. Jones, E. E., and Sigall, H. (1971). The bogus pipeline: A new paradigm for measuring affect and attitude. Psychological Bulletin, 76, 349–364. Jones, E. E., and Sigall, H. (1973). Where there is ignis, there may be fire. Psychological Bulletin, 79, 260–262. Jones, H. E., Conrad, H., and Horn, A. (1928). Psychological studies of motion pictures: II. Observation and recall as a function of age. University of California Publications in Psychology, 3, 225–243. Jones, R. A., and Cooper, J. (1971). Mediation of experimenter effects. Journal of Personality and Social Psychology, 20, 70–74. Jourard, S. M. (1968). Disclosing Man to Himself. Princeton, NJ: Van Nostrand. Jourard, S. M. (1969). The effects of experimenters’ self-disclosure on subjects’ behavior. In C. Spielberger, Ed., Current Topics in Community and Clinical Psychology, Vol I. New York: Academic Press. Jourard, S. M. (1971). Self-Disclosure: An Experimental Analysis of the Transparent Self. New York: Wiley - Interscience. Juhasz, J. B., and Sarbin, T. R. (1966). On the false alarm metaphor in psychophysics. Psychological Record, 16, 323–327. Jung, J. (1969). Current practices and problems in the use of college students for psychological research. Canadian Psychologist, 10, 280–290. Kaats, G. R., and Davis, K. E. (1971). Effects of volunteer biases in studies of sexual behavior and attitudes. Journal of Sex Research, 7, 26–34. Kaess, W., and Long, L. (1954). An investigation of the effectiveness of vocational guidance. Educational and Psychological Measurement, 14, 423–433. Kanfer F. H. (1968). Verbal conditioning: A review of its current status. In T. R. Dixon and D. L. Horton, Eds., Verbal Behavior and General Behavior Theory. Englewood Cliffs, NJ: Prentice-Hall. Katz, D., and Cantril, H. (1937). Public opinion polls. Sociometry, 1, 155–179. Katz, D., and Stotland, E. (1959). A preliminary statement to a theory of attitude structure and change. In S. Koch, Ed. Psychology: A Study of a Science, Vol. III. New York: McGraw-Hill. Katz, J. (1972). Experimentation with Human Beings: The Authority of the Investigator, Subject, Professions, and State in the Human Experimentation Process. New York: Russell Sage Foundation (with the assistance of A. M. Capron and E. S. Glass). Kauffmann, D. R. (1971). Incentive to perform counterattitudinal acts: Bribe or gold star? Journal of Personality and Social Psychology, 19, 82–91. Kavanau, J. L. (1964). Behavior: Confinement, adaptation, and compulsory regimes in laboratory studies. Science, 143, 490. Kavanau, J. L. (1967). Behavior of captive white-footed mice. Science, 155, 1623–1639. Kazdin, A. E., and Bryan, J. H. (1971). Competence and volunteering. Journal of Experimental Social Psychology, 7, 87–97. Kegeles, S. S. (1963). Some motives for seeking preventative dental care. Journal of the American Dental Association, 67, 110–118. Kelley, T. L. (1929). Scientific Method. Columbus: Ohio State University Press.
850
Book Three – The Volunteer Subject Kelman, H. C. (1965). Manipulation of human behavior: An ethical dilemma for the social scientist. Journal of Social Issues, 21, 31–46. Kelman, H. C. (1967). Human use of human subjects: The problem of deception in social psychological experiments. Psychological Bulletin, 67, 1–11. Kelman, H. C. (1968). A Time to Speak. San Francisco: Jossey Bass. Kelman, H. C. (1972). The rights of the subject in social research: An analysis in terms of relative power and legitimacy. American Psychologist, 27, 989–1016. Kelvin, F. (1971). Socialization and conformity. Journal of Child Psychology and Psychiatry, 12, 211–222. Kennedy, J. J., and Cormier, W. H. (1971). The effects of three methods of subject and experimenter recruitment in verbal conditioning research. Journal of Social Psychology, 85, 65–76. Kerlinger, F. N. (1972). Draft report of the APA committee on ethical standards in psychological research: A critical reaction. American Psychologist, 27, 894–896. King, A. F. (1967). Ordinal position and the Episcopal Clergy. Unpublished bachelor’s thesis, Harvard University. King, D. J. (1970). The subject pool. American Psychologist, 25, 1179–1181. Kinsey, A. C., Pomeroy, W. B., and Martin, C. E. (1948). Sexual Behavior in the Human Male. Philadelphia: Saunders. Kinsey, A. C., Pomeroy, W. B., Martin, C. E., and Gebhard, P. H. (1953). Sexual Behavior in the Human Female. Philadelphia: Saunders. Kintz, B. L., Delprato, D. J., Mettee, D. R., Persons, C. E., and Schappe, R. H. (1965). The experimenter effect. Psychological Bulletin, 63, 223–232. Kirby, M. W., and Davis, K. E. (1972). Who volunteers for research on marital counseling? Journal of Marriage and the Family, 34, 469–473. Kirchner, W. K., and Mousley, N. B. (1963). A note on job performance: Differences between respondent and nonrespondent salesmen to an attitude survey. Journal of Applied Psychology, 47, 223–224. Kish, G. B., and Barnes, J. (1973). Variables that affect return rate of mailed questionnaires. Journal of Clinical Psychology, 29, 98–100. Kish, G. B., and Hermann, H. T. (1971). The Fort Meade Alcoholism Treatment Program. Quarterly Journal of Studies on Alcohol, 32, 628–635. Kish, L. (1965). Survey Sampling. New York: Wiley. Kivlin, J. E. (1965). Contributions to the study of mail-back bias. Rural Sociology, 30, 322–326. Klinger, E. (1967). Modeling effects on achievement imagery. Journal of Personality and Social Psychology, 7, 49–62. Koenigsberg, R. A. (1971). Experimenter–subject interaction in verbal conditioning. Unpublished doctoral diss., New School for Social Research. Kothandapani, V. (1971). Validation of feeling, belief, and intention to act as three components of attitude and their contribution to prediction of contraceptive behavior. Journal of Personality and Social Psychology, 19, 321–333. Krasner, L. (1958). Studies of the conditioning of verbal behavior. Psychological Bulletin, 55, 148–170. Krasner, L. (1962). The therapist as a social reinforcement machine. In H. Strupp and L. Luborsky, Eds., Research in Psychotherapy, Vol. II. Washington: American Psychological Association. Krech, D., Crutchfield, R. S., and Ballachey, E. L. (1962). Individual in Society. New York: McGraw-Hill. Kroger, R. O. (1967). The effects of role demands and test-cue properties upon personality test performance. Journal of Consulting Psychology, 31, 304–312. Kruglanski, A. W. (1973). Much ado about the ‘‘volunteer artifacts.’’ Journal of Personality and Social Psychology, 28, 348–354. Kruglov, L. P., and Davidson, H. H. (1953). The willingness to be interviewed: A selective factor in sampling. Journal of Social Psychology, 38, 39–47. Kruskal, J. B. (1964). Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika, 29, 1–27. Laming, D. R. J. (1967). On procuring human subjects. Quarterly Journal of Experimental Psychology, 19, 64–69. Lana, R. E. (1959a). A further investigation of the pretest-treatment interaction effect. Journal of Applied Psychology, 43, 421–422.
References
851 Lana, R. E. (1959b). Pretest-treatment interaction effects in attitudinal studies. Psychological Bulletin, 56, 293–300. Lana, R. E. (1964). The influence of the pretest on order effects in persuasive communications. Journal of Abnormal and Social Psychology, 69, 337–341. Lana, R. E. (1966). Inhibitory effects of a pretest on opinion change. Educational and Psychological Measurement, 26, 139–150. Lana, R. E. (1969). Pretest sensitization. In R. Rosenthal and R. L. Rosnow, Eds. Artifact in Behavioral Research. New York: Academic Press. Lana, R. E., and Menapace, R. H. (1971). Subject commitment and demand characteristics in attitude change. Journal of Personality and Social Psychology, 20, 136–140. Lana, R. E., and Rosnow, R. L. (1963). Subject awareness and order effects in persuasive communications. Psychological Reports, 12, 523–529. Lana, R. E., and Rosnow, R. L. (1969). Effects of pretest-treatment interval on opinion change. Psychological Reports, 22, 1035–1036. Larson, R. F., and Catton, W. R., Jr. (1959). Can the mail-back bias contribute to a study’s validity? American Sociological Review, 24, 243–245. Lasagna, L., and von Felsinger, J. M. (1954). The volunteer subject in research. Science, 120, 359–361. Latane´, B., and Darley, J. M. (1970). The Unresponsive Bystander: Why Doesn’t He Help? New York: Appleton-Century-Crofts. Lawson, F. (1949). Varying group responses to postal questionnaires. Public Opinion Quarterly, 13, 114–116. Lehman, E. C., Jr. (1963). Tests of significance and partial returns to mailed questionnaires. Rural Sociology, 28, 284–289. Leik, R. K. (1965). ‘‘Irrelevant’’ aspects of stooge behavior: Implications for leadership studies and experimental methodology. Sociometry, 28, 259–271. Leipold, W. D., and James, R. L. (1962). Characteristics of shows and no-shows in a psychological experiment. Psychological Reports, 11, 171–174. Leslie, L. L. (1972). Are high response rates essential to valid surveys? Social Science Research, 1, 323–334. Lester, D. (1969). The subject as a source of bias in psychological research. Journal of General Psychology, 81, 237–248. Leventhal, H. (1970). Findings and theory in the study of fear communications. In L. Berkowitz, Ed., Advances in Experimental Social Psychology, Vol V. New York: Academic Press, 1970. Leventhal, H., and Niles, P. (1964). A field experiment on fear-arousal with data on the validity of questionnaire measures. Journal of Personality, 32, 459–479. Leventhal, H., Singer, R., and Jones, S. (1965). Effects of fear and specificity of recommendation upon attitudes and behavior. Journal of Personality and Social Psychology, 2, 20–29. Levitt, E. E., Lubin, B., and Brady, J. P. (1962). The effect of the pseudovolunteer on studies of volunteers for psychology experiments. Journal of Applied Psychology, 46, 72–75. Levitt, E. E., Lubin, B., and Zuckerman, M. (1959). Note on the attitude toward hypnosis of volunteers and nonvolunteers for an hypnosis experiment. Psychological Reports, 5, 712. Levitt, E. E., Lubin, B., and Zuckerman, M. (1962). The effect of incentives on volunteering for an hypnosis experiment. International Journal of Clinical and Experimental Hypnosis, 10, 39–41. Levy, L. H. (1967). Awareness, learning, and the beneficent subject as expert witness. Journal of Personality and Social Psychology, 6, 365–370. Lewin, K. (1929). Die Entwicklung der Experimentellen Willenspsychologie und die Psychotherapie. Leipzig: Verlag von S. Hirzel. Locke, H. J. (1954). Are volunteer interviewees representative? Social Problems, 1, 143–146. Loewenstein, R., Colombotos, J., and Elinson, J. (1962). Interviews hardest-to-obtain in an urban health survey. Proceedings of the Social Statistics Section of the American Statistical Association, 160–166. London, P. (1961). Subject characteristics in hypnosis research: Part I. A survey of experience, interest, and opinion. International Journal of Clinical and Experimental Hypnosis, 9, 151–161. London, P., Cooper, L. M., and Johnson. H. J. (1962). Subject characteristics in hypnosis research. II. Attitudes towards hypnosis, volunteer status, and personality measures. III. Some correlates of hypnotic susceptibility. International Journal of Clinical and Experimental Hypnosis, 10, 13–21.
852
Book Three – The Volunteer Subject London, P., and Rosenhan, D. (1964). Personality dynamics. Annual Review of Psychology, 15, 447–492. Loney, J. (1972). Background factors, sexual experiences, and attitudes toward treatment in two ‘‘normal’’ homosexual samples. Journal of Consulting and Clinical Psychology, 38, 57–65. Lowe, F. E., and McCormick, T. C. (1955). Some survey sampling. biases Public Opinion Quarterly, 19, 303–315. Lubin, B., Brady, J. P., and Levitt, E. E. (1962a). A comparison of personality characteristics of volunteers and nonvolunteers for hypnosis experiments. Journal of Clinical Psychology, 18, 341–343. Lubin, B., Brady, J. P., and Levitt, E. E. (1962b). Volunteers and nonvolunteers for an hypnosis experiment. Diseases of the Nervous System, 23, 642–643. Lubin, B., Levitt, E. E., and Zuckerman, M. (1962). Some personality differences between responders and nonresponders to a survey questionnaire. Journal of Consulting Psychology, 26, 192. Luchins, A. S. (1957). Primacy-recency in impression formation. In C. I. Hovland, W. Mandell, E. H. Campbell, T. C. Brock, A. S. Luchins, A. R. Cohen, W. J. McGuire, I. L. Janis, R. L. Feierabend, and N. H. Anderson, The Order of Presentation in Persuasion. New Haven: Yale University Press. Lyons, J. (1970). The hidden dialogue in experimental research. Journal of Phenomenological Psychology, 1, 19–29. Maas, I. (1956). Who doesn’t answer? Bulletin of the British Psychological Society, 29, 33–34. Macaulay, J., and Berkowitz, L. Eds. (1970). Altruism and Helping Behavior. New York: Academic Press. MacDonald, A. P., Jr. (1969). Manifestations of differential levels of socialization by birth order. Developmental Psychology, 1, 485–492. MacDonald, A. P., Jr. (1972a). Characteristics of volunteer subjects under three recruiting methods: Pay, extra credit, and love of science. Journal of Consulting and Clinical Psychology, 39, 222–234. MacDonald, A. P., Jr. (1972b). Does required participation eliminate volunteer differences? Psychological Reports, 31, 153–154. Mackenzie, D. (1969). Volunteering, paralinguistic stress, and experimenter effects. Unpublished manuscript, Harvard University. Mann, L., and Taylor, K. F. (1969). Queue counting: The effect of motives upon estimates of numbers in waiting lines. Journal of Personality and Social Psychology, 12, 95–103. Mann, Sister M. J. (1959). A study in the use of the questionnaire. In E. M. Huddleston, Ed. Sixteenth Yearbook of the National Council on Measurements Used in Education. New York: National Council on Measurements Used in Education, pp. 171–179. Marlatt, G. A. (1973). Are college students ‘‘real people’’? American Psychologist, 28, 852–853. Marmer, R. S. (1967). The effects of volunteer status on dissonance reduction. Unpublished master’s thesis, Boston University. Marquis, P. C. (1973). Experimenter-subject interaction as a function of authoritarianism and response set. Journal of Personality and Social Psychology, 25, 289–296. Martin, D. C., Arnold, J. D., Zimmerman, T. F., and Richart, R. H. (1968). Human subjects in clinical research—a report of three studies. New England Journal of Medicine, 279, 1426–1431. Martin, R. M., and Marcuse, F. L. (1957). Characteristics of volunteers and nonvolunteers for hypnosis. Journal of Clinical and Experimental Hypnosis, 5, 176–180. Martin, R. M., and Marcuse, F. L. (1958). Characteristics of volunteers and nonvolunteers in psychological experimentation. Journal of Consulting Psychology, 22, 475–479. Maslow, A. H. (1942). Self-esteem (dominance feelings) and sexuality in women. Journal of Social Psychology, 16, 259–293. Maslow, A. H., and Sakoda, J. M. (1952). Volunteer-error in the Kinsey study. Journal of Abnormal and Social Psychology, 47, 259–262. Matthysse, S. W. (1966). Differential effects of religious communications. Unpublished doctoral diss., Harvard University. Maul, R. C. (1970). NCATE accreditation. Journal of Teacher Education, 21, 47–52. May, W. T., Smith, H. W., and Morris, J. R. (1968). Consent procedures and volunteer bias: A dilemma. Paper read at Southeastern Psychological Association, Roanoke, April. May, W. W. (1972). On Baumrind’s four commandments. American Psychologist, 27, 889–890.
References
853 Mayer, C. S., and Pratt, R. W., Jr. (1966). A note on nonresponse in a mail survey. Public Opinion Quarterly, 30, 637–646. McClelland, D. C. (1961). The Achieving Society. Princeton, NJ: Van Nostrand. McConnell, J. A. (1967). The prediction of dependency behavior in a standardized experimental situation. Dissertation Abstracts, 28, (5-b), 2127–2128. McDavid, J. W. (1965). Approval-seeking motivation and the volunteer subject. Journal of Personality and Social Psychology, 2, 115–117. McDonagh, E. C., and Rosenblum, A. L. (1965). A comparison of mailed questionnaires and subsequent structured interviews. Public Opinion Quarterly, 29, 131–136. McGuire, W. J. (1966). Attitudes and opinions. Annual Review of Psychology, 17, 475–514. McGuire, W. J. (1968a). Personality and attitude change: An information-processing theory. In A. G. Greenwald, T. C. Brock, and T. M. Ostrom, Eds., Psychological Foundations of Attitudes, New York: Academic Press. McGuire, W. J. (1968b). Personality and susceptibility to social influence. In E. F. Borgatta and W. W. Lambert, Eds., Handbook of Personality Theory and Research. Chicago: Rand-McNally. McGuire, W. J. (1969a). Suspiciousness of experimenter’s intent. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. McGuire, W. J. (1969b). The nature of attitudes and attitude change. In G. Lindzey and E. Aronson, Eds., The Handbook of Social Psychology, rev. ed., Vol III. Reading, Mass.: Addison-Wesley. McLaughlin, R. J., and Harrison, N. W. (1973). Extraversion, neuroticism and the volunteer subject. Psychological Reports, 32, 1131–1134. McNemar, Q. (1946). Opinion-attitude methodology. Psychological Bulletin, 43, 289–374. McReynolds, W. T., and Tori, C. (1972). A further assessment of attention-placebo effects and demand characteristics in studies of systematic desensitization. Journal of Consulting and Clinical Psychology, 38, 261–264. Meier, P. (1972). The biggest public health experiment ever: The 1954 field trial of the Salk poliomyelitis vaccine. In J. M. Tanur, Ed., Statistics: A Guide to the Unknown, San Francisco: Holden - Day, pp. 2–13. Menges, R. J. (1973). Openness and honesty versus coercion and deception in psychological research. American Psychologist, 28, 1030–1034. Meyers, J. K. (1972). Effects of S recruitment and cuing on awareness, performance, and motivation in behavioral experimentation. Unpublished master’s thesis, Ohio State University. Milgram, S. (1965). Some conditions of obedience and disobedience to authority. Human Relations, 18, 57–75. Milgram, S., Bickman, L., and Berkowitz, L. (1969). Note on the drawing power of crowds of different size. Journal of Personality and Social Psychology, 13, 79–82. Miller, A. G. (1972). Role playing: An alternative to deception? American Psychologist, 27, 623–636. Miller, B. A., Pokorny, A. D., Valles, J., and Cleveland, S. E. (1970). Biased sampling in alcoholism treatment research. Quarterly Journal of Studies on Alcohol, 31, 97–107. Miller, S. E. (1966). Psychology experiments without subjects’ consent. Science, 152, 15. Milmoe, S. E. (1973). Communication of emotion by mothers of schizophrenic and normal young adults. Unpublished doctoral diss., Harvard University. Minor, M. W., Jr. (1967). Experimenter expectancy effect as a function of evaluation apprehension. Unpublished doctoral diss., University of Chicago. Minor, M. W. (1970). Experimenter-expectancy effect as a function of evaluation apprehension. Journal of Personality and Social Psychology, 15, 326–332. Mitchell, W., Jr. (1939). Factors affecting the rate of return on mailed questionnaires. Journal of the American Statistical Association, 34, 683–692. Moos, R. H., and Speisman, J. C. (1962). Group compatibility and productivity. Journal of Abnormal and Social Psychology, 57, 190–196. Mosteller, F. (1968). Association and estimation in contingency tables. Journal of the American Statistical Association, 63, 1–28. Mosteller, F., and Bush, R. R. (1954). Selected quantitative techniques. In G. Lindzey, Ed. Handbook of Social Psychology, Vol. I. Cambridge, Mass.: Addison-Wesley. Mulry, R. C., and Dunbar, P. (n.d.). Homogeneity of a subject sample: A study of subject motivation. Unpublished manuscript, University of Indiana.
854
Book Three – The Volunteer Subject Myers, T. I., Murphy, D. B., Smith, S., and Goffard, S. J. (1966). Experimental studies of sensory deprivation and social isolation. Technical Report 66–8, Contract DA 44–188–ARO–2, HumRRO, Washington, DC: George Washington University. Myers, T. I., Smith, S., and Murphy, D. B. (1967). Personological correlates of volunteering for and endurance of prolonged sensory deprivation. Unpublished manuscript. Neulinger, J., and Stein, M. I. (1971). Personality characteristics of volunteer subjects. Perceptual and Motor Skills, 32, 283–286. Newberry, B. H. (1973). Truth telling in subjects with information about experiments: Who is being deceived? Journal of Personality and Social Psychology, 25, 369–374. Newman, M. (1956). Personality Differences Between Volunteers and Nonvolunteers for Psychological Investigations. Doctoral diss., New York University School of Education. Ann Arbor, Mich: University Microfilms, No. 19, 999. Niles, P. (1964). The relationship of susceptibility and anxiety to acceptance of fear-arousing communications. Unpublished doctoral diss., Yale University. Norman, R. D. (1948). A review of some problems related to the mail questionnaire technique. Educational and Psychological Measurement, 8, 235–247. Nosanchuk, T. A., and Marchak, M. P. (1969). Pretest sensitization and attitude change. Public Opinion Quarterly, 33, 107–111. Nottingham, J. A. (1972). The N and the out: Additional information on participants in psychological experiments. Journal of Social Psychology, 88, 299–300. Nunnally, J., and Bobren, H. (1959). Variables governing the willingness to receive communications on mental health. Journal of Personality, 27, 38–46. Oakes, W. (1972). External validity and the use of real people as subjects. American Psychologist, 27, 959–962. Ohler, F. D. (1971). The effects of four sources of experimental bias: Evaluation apprehension, cueing, volunteer status, and choice. Dissertation Abstracts International, 32 (5-A), 2800. O’Leary, V. E. (1972). The Hawthorne effect in reverse: Trainee orientation for the hard-core unemployed woman. Journal of Applied Psychology, 56, 491–494. Olsen, G. P. (1968). Need for approval, expectancy set, and volunteering for psychological experiments. Unpublished doctoral diss., Northwestern University. Ora, J. P., Jr. (1965). Characteristics of the volunteer for psychological investigations. Technical Report, No. 27, November. Vanderbilt University, Contract Nonr 2149 (03). Ora, J. P., Jr. (1966). Personality characteristics of college freshman volunteers for psychological experiments. Unpublished master’s thesis, Vanderbilt University. Orlans, H. (1967). Developments in federal policy toward university research. Science, 155, 665–668. Orne, M. T. (1959). The nature of hypnosis: Artifact and essence. Journal of Abnormal and Social Psychology, 58, 277–299. Orne, M. T. (1962a). On the social psychology of the psychological experiment: With particular reference to demand characteristics and their implications. American Psychologist, 17, 776–783. Orne, M. T. (1962b). Problems and research areas. In Medical Uses of Hypnosis, Symposium 8, April. New York: Group for the Advancement of Psychiatry. Orne, M. T. (1969). Demand characteristics and the concept of quasi-controls. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Orne, M. T. (1970). Hypnosis, motivation, and the ecological validity of the psychological experiment. Nebraska Symposium on Motivation, 18, 187–265. Orne, M. T. (1971). The simulation of hypnosis: Why, how, and what it means. International Journal of Clinical and Experimental Hypnosis, 19, 183–210. Orne, M. T. (1972). Can a hypnotized subject be compelled to carry out otherwise unacceptable behavior?: A discussion. International Journal of Clinical and Experimental Hypnosis, 20, 101–117. Orne, M. T., and Scheibe, K. E. (1964). The contribution of nondeprivation factors in the production of sensory deprivation effects: The psychology of the ‘‘panic button.’’ Journal of Abnormal and Social Psychology, 68, 3–12. Orne, M. T., Sheehan, P. W., and Evans, F. J. (1968). Occurrence of posthypnotic behavior outside the experimental setting. Journal of Personality and Social Psychology, 9, 189–196. Orne, M. T., Thackray, R. I., and Paskewitz, D. A. (1972). On the detection of deception: A model for the study of physiological effects of psychological stimuli. In N. S. Greenfield and R. A. Sternbach, Eds., Handbook of Psychophysiology. New York: Holt, Rinehart and Winston.
References
855 Ostrom, T. M. (1973). The bogus pipeline: A new ignis fatuus? Psychological Bulletin, 79, 252–259. Pace, C. R. (1939). Factors influencing questionnaire returns from former university students. Journal of Applied Psychology, 23, 388–397. Page, M. M. (1968). Modification of figure-ground perception as a function of awareness of demand characteristics. Journal of Personality and Social Psychology, 9, 59–66. Page, M. M. (1969). Social psychology of a classical conditioning of attitudes experiment. Journal of Personality and Social Psychology, 11, 177–186. Page, M. M. (1970). Role of demand awareness in the communicator credibility effect. Journal of Social Psychology, 82, 57–66. Page, M. M. (1971a). Effects of evaluation apprehension on cooperation in verbal conditioning. Journal of Experimental Research in Personality, 5, 85–91. Page, M. M. (1971b). Postexperimental assessment of awareness in attitude conditioning. Educational and Psychological Measurement, 31, 891–906. Page, M. M. (1972). Demand awareness and the verbal operant conditioning experiment. Journal of Personality and Social Psychology, 23, 372–378. Page, M. M. (1973). On detecting demand awareness by postexperimental questionnaire. Journal of Social Psychology, 91, 305–323. Page, M. M. and Lumia, A. R. (1968). Cooperation with demand characteristics and the bimodal distribution of verbal conditioning data. Psychonomic Science, 12, 243–244. Pan, Ju-Shu (1951). Social characteristics of respondents and non respondents in a questionnaire study of later maturity. Journal of Applied Psychology, 35, 120–121. Parlee, M. B. (1974). Menstruation and voluntary participation in a psychological experiment: A note on ‘‘volunteer artifacts.’’ Unpublished manuscript, Radcliffe Institute, Cambridge, Mass. Parten, M. (1950). Surveys, Polls, and Samples: Practical Procedures. New York: Harper. Pastore, N. (1949). The Nature–Nurture Controversy. New York: King’s Crown Press. Pauling, F. J., and Lana, R. E. (1969). The effects of pretest commitment and information upon opinion change. Educational and Psychological Measurement, 29, 653–663. Pavlos, A. J. (1972). Debriefing effects for volunteer and nonvolunteer subjects’ reactions to bogus physiological feedback. Paper read at Southern Society for Philosophy and Psychology, St. Louis, March–April. Pellegrini, R. J. (1972). Ethics and identity: A note on the call to conscience. American Psychologist, 27, 896–897. Perlin, S., Pollin, W., and Butler, R. N. (1958). The experimental subject: 1. The psychiatric evaluation and selection of a volunteer population. American Medical Association Archives of Neurology and Psychiatry, 80, 65–70. Philip, A. E. and McCulloch, J. W. (1970). Test–retest characteristics of a group of attempted suicide patients. Journal of Consulting and Clinical Psychology, 34, 144–147. Phillips, W. M., Jr. (1951). Weaknesses of the mail questionnaire: A methodological study. Sociology and Social Research, 35, 260–267. Politz, A., and Brumbach, R. (1947). Can an advertiser believe what mail surveys tell him? Printers’ Ink, June 20, 48–52. Pollin, W., and Perlin, S. (1958). Psychiatric evaluation of ‘‘normal control’’ volunteers. American Journal of Psychiatry, 115, 129–133. Poor, D. (1967). The social psychology of questionnaires. Unpublished bachelor’s thesis, Harvard College. Price, D. O. (1950). On the use of stamped return envelopes with mail questionnaires. American Sociological Review, 15, 672–673. Pucel, D. J., Nelson, H. F., and Wheeler, D. N. (1971). Questionnaire follow-up returns as a function of incentives and responder characteristics. Vocational Guidance Quarterly, 19, 188–193. Quay, H. C., and Hunt, W. A. (1965). Psychopathy, neuroticism, and verbal conditioning. Journal of Consulting Psychology, 29, 283. Raffetto, A. M. (1968). Experimenter effects on subjects’ reported hallucinatory experiences under visual and auditory deprivation. Paper presented at the Midwestern Psychological Association meeting, Chicago, Ill. Ramsay, R. W. (1970). Introversion-extraversion and volunteering for testing. British Journal of Social and Clinical Psychology, 9, 89.
856
Book Three – The Volunteer Subject Raymond, B., and King, S. (1973). Value systems of volunteer and nonvolunteer subjects. Psychological Reports, 32, 1303–1306. Reid, S. (1942). Respondents and non-respondents to mail questionnaires. Educational Research Bulletin, 21, 87–96. Remington, R. E., and Strongman, K. T. (1972). Operant facilitation during a pre-reward stimulus: Differential effects in human subjects. British Journal of Psychology, 63, 237–242. Resnick, J. H., and Schwartz, T. (1973). Ethical standards as an independent variable in psychological research. American Psychologist, 28, 134–139. Reuss, C. F. (1943). Differences between persons responding and not responding to a mailed questionnaire. American Sociological Review, 8, 433–438. Richards, T. W. (1960). Personality of subjects who volunteer for research on a drug (mescaline). Journal of Projective Techniques, 24, 424–428. Richter, C. P. (1959). Rats, man, and the welfare state. American Psychologist, 14, 18–28. Riecken, H. W. (1962). A program for research on experiments in social psychology. In N. F. Washburne, Ed., Decisions, Values and Groups. Vol. II, New York: Pergamon, pp. 25–41. Riegel, K. F., Riegel, R. M., and Meyer, G. (1967). A study of the dropout rates in longitudinal research on aging and the prediction of death. Journal of Personality and Social Psychology, 5, 342–348. Riggs, M. M., and Kaess, W. (1955). Personality differences between volunteers and nonvolunteers. Journal of Psychology, 40, 229–245. Ring, K. (1967). Experimental social psychology: Some sober questions about some frivolous values. Journal of Experimental Social Psychology, 3, 113–123. Robins, L. N. (1963). The reluctant respondent. Public Opinion Quarterly, 27, 276–286. Robinson, R. A., and Agisim, P. (1951). Making mail surveys more reliable. Journal of Marketing, 15, 415–424. Rokeach, M. (1966). Psychology experiments without subjects’ consent. Science, 152, 15. Rollins, M. (1940). The practical use of repeated questionnaire waves. Journal of Applied Psychology, 24, 770–772. Rose, C. L. (1965). Representativeness of volunteer subjects in a longitudinal aging study. Human Development, 8, 152–156. Rosen, E. (1951). Differences between volunteers and non-volunteers for psychological studies. Journal of Applied Psychology, 35, 185–193. Rosenbaum, M. E. (1956). The effect of stimulus and background factors on the volunteering response. Journal of Abnormal and Social Psychology, 53, 118–121. Rosenbaum, M. E., and Blake, R. R. (1955). Volunteering as a function of field structure. Journal of Abnormal and Social Psychology, 50, 193–196. Rosenberg, M. J. (1965). When dissonance fails: On eliminating evaluation apprehension from attitude measurement. Journal of Personality and Social Psychology, 1, 28–42. Rosenberg, M. J. (1969). The conditions and consequences of evaluation apprehension. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Rosenhan, D. (1967). On the social psychology of hypnosis research. In J. E. Gordon, Ed., Handbook of Clinical and Experimental Hypnosis. New York: Macmillan, pp. 481–510. Rosenhan, D. (1968). Some origins of concern for others. Educational Testing Service Research Bulletin, no. 68–33. Rosenhan, D., and White, G. M. (1967). Observation and rehearsal as determinants of prosocial behavior. Journal of Personality and Social Psychology, 5, 424–431. Rosenthal, R. (1965). The volunteer subject. Human Relations, 18, 389–406. Rosenthal, R. (1966). Experimenter Effects in Behavioral Research. New York: Appleton-CenturyCrofts. Rosenthal, R. (1967). Covert communication in the psychological experiment. Psychological Bulletin, 67, 356–367. Rosenthal, R. (1969). Interpersonal expectations: Effects of the experimenter’s hypothesis. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Rosenthal, R., and Rosnow, R. L. (1969). The volunteer subject. In R. Rosenthal and R. L. Rosnow, Eds., Artifact in Behavioral Research. New York: Academic Press. Rosenzweig, S. (1933). The experimental situation as a psychological problem. Psychological Review, 40, 337–354.
References
857 Rosenzweig, S. (1952). The investigation of repression as an instance of experimental idiodynamics. Psychological Review, 59, 339–345. Rosnow, R. L. (1968). One-sided versus two-sided communication under indirect awareness of persuasive intent. Public Opinion Quarterly, 32, 95–101. Rosnow, R. L. (1970). When he lends a helping hand, bite it. Psychology Today, 4, no. 1, 26–30. Rosnow, R. L. (1971). Experimental artifact. In The Encyclopedia of Education, Vol. III. New York: Macmillan and Free Press. Rosnow, R. L., and Aiken, L. S. (1973). Mediation of artifacts in behavioral research. Journal of Experimental Social Psychology, 9, 181–201. Rosnow, R. L., Goodstadt, B. E., Suls, J. M., and Gitter, A. G. (1973). More on the social psychology of the experiment: When compliance turns to self-defense. Journal of Personality and Social Psychology, 27, 337–343. Rosnow, R. L., Holper, H. M., and Gitter, A. G. (1973). More on the reactive effects of pretesting in attitude research: Demand characteristics or subject commitment? Educational and Psychological Measurement, 33, 7–17. Rosnow, R. L., and Robinson, E. J., Eds. (1967). Experiments in Persuasion. New York: Academic Press. Rosnow, R. L., and Rosenthal, R. (1966). Volunteer subjects and the results of opinion change studies. Psychological Reports, 19, 1183–1187. Rosnow, R. L., and Rosenthal, R. (1970). Volunteer effects in behavioral research. In K. H. Craik, B. Kleinmuntz, R. L. Rosnow, R. Rosenthal, J. A. Cheyne, and R. H. Walters, New Directions in Psychology, Vol. IV. New York: Holt, Rinehart and Winston. Rosnow, R. L., and Rosenthal, R. (1974). Taming of the volunteer problem: On coping with artifacts by benign neglect. Journal of Personality and Social Psychology, 30, 188–190. Rosnow, R. L., Rosenthal, R., McConochie, R. M., and Arms, R. L. (1969). Volunteer effects on experimental outcomes. Educational and Psychological Measurement, 29, 825–846. Rosnow, R. L., and Suls, J. M. (1970). Reactive effects of pretesting in attitude research. Journal of Personality and Social Psychology, 15, 338–343. Ross, J. A., and Smith, P. (1965). Experimental designs of the single stimulus, all-or-nothing type. American Sociological Review, 30, 68–80. Ross, S., Trumbull, R., Rubinstein, E., and Rasmussen, J. E. (1966). Simulation, shelters, and subjects. American Psychologist, 21, 815–817. Rothney, J. W. M., and Mooren, R. L. (1952). Sampling problems in follow-up research. Occupations, 30, 573–578. Rubin, Z. (1969). The social psychology of romantic love. Unpublished doctoral diss., University of Michigan. Rubin, Z. (1973a). Disclosing oneself to a stranger: I. Effects of reciprocity, anonymity, sex roles, and demand characteristics. Unpublished manuscript, Harvard University. Rubin, Z. (1973b). Liking and Loving: An Invitation to Social Psychology. New York: Holt, Rinehart and Winston. Rubin, Z. (In press). Disclosing oneself to a stranger: Reciprocity and its limits. Journal of Experimental Social Psychology. Rubin, Z., and Moore, J. C., Jr. (1971). Assessment of subjects’ suspicions. Journal of Personality and Social Psychology, 17, 163–170. Ruebhausen, O. M., and Brim, O. G. (1966). Privacy and behavioral research. American Psychologist, 21, 423–437. Salzinger, K. (1959). Experimental manipulation of verbal behavior: A review. Journal of General Psychology, 61, 65–94. Sarason, I. G., and Smith, R. E. (1971). Personality. Annual Review of Psychology, 22, 393–446. Sarbin, T. R., and Allen, V. L. (1968). Role theory. In G. Lindzey and E. Aronson, Eds., The Handbook of Social Psychology, rev. ed. Vol. I. Reading, Mass.: Addison-Wesley. Sarbin, T. R., and Chun, K. T. (1964). A confirmation of the choice of response hypothesis in perceptual defense measurement. Paper presented at the Western Psychological Association meeting, Portland, Ore. Sasson, R., and Nelson, T. M. (1969). The human experimental subject in context. The Canadian Psychologist, 10, 409–437. Schachter, S. (1959). The Psychology of Affiliation. Stanford: Stanford University Press.
858
Book Three – The Volunteer Subject Schachter, S., and Hall, R. (1952). Group-derived restraints and audience persuasion. Human Relations, 5, 397–406. Schaie, K. W., Labouvie, G. V., and Barrett, T. J. (1973). Selective attrition effects in a fourteenyear study of adult intelligence. Journal of Gerontology, 28, 328–334. Schappe, R. H. (1972). The volunteer and the coerced subject. American Psychologist, 27, 508–509. Scheier, I. H. (1959). To be or not to be a guinea pig: Preliminary data on anxiety and the volunteer for experiment. Psychological Reports, 5, 239–240. Schofield, J. W. (1972). A framework for viewing the relation between attitudes and actions. Unpublished doctoral diss., Harvard University. Schofield, J. W. (1974). The effect of norms, public disclosure, and need for approval on volunteering behavior consistent with attitudes. Unpublished manuscript, Harvard University. Schopler, J. (1967). An investigation of sex differences on the influence of dependence. Sociometry, 30, 50–63. Schopler, J., and Bateson, N. (1965). The power of dependence. Journal of Personality and Social Psychology, 2, 247–254. Schopler, J., and Matthews, M. W. (1965). The influence of the perceived causal locus of partner’s dependence on the use of interpersonal power. Journal of Personality and Social Psychology, 2, 609–612. Schubert, D. S. P. (1964). Arousal seeking as a motivation for volunteering: MMPI scores and central-nervous-system-stimulant use as suggestive of a trait. Journal of Projective Techniques and Personality Assessment, 28, 337–340. Schultz, D. P. (1967a). Birth order of volunteers for sensory restriction research. Journal of Social Psychology, 73, 71–73. Schultz, D. P. (1967b). Sensation-seeking and volunteering for sensory deprivation. Paper read at Eastern Psychological Association, Boston, April. Schultz, D. P. (1967c). The volunteer subject in sensory restriction research. Journal of Social Psychology, 72, 123–124. Schultz, D. P. (1969). The human subject in psychological research. Psychological Bulletin, 72, 214–228. Schwirian, K. P. and Blaine, H. R. (1966). Questionnaire-return bias in the study of blue-collar workers. Public Opinion Quarterly, 30, 656–663. Scott, C. (1961). Research on mail surveys. Journal of the Royal Statistical Society, Ser. A, 124, 143–195. Seeman, J. (1969). Deception in psychological research. American Psychologist, 24, 1025–1028. Sheridan, K., and Shack, J. R. (1970). Personality correlates of the undergraduate volunteer subject. Journal of Psychology, 76, 23–26. Sherman, S. R. (1967). Demand characteristics in an experiment on attitude change. Sociometry, 30, 246–260. Sherwood, J. J., and Nataupsky, M. (1968). Predicting the conclusions of negro–white intelligence research from biographical characteristics of the investigator. Journal of Personality and Social Psychology, 8, 53–58. Shor, R. E., and Orne, E. C. (1963). Norms on the Harvard Group Scale of Hypnotic Susceptibility, Form A. International Journal of Clinical and Experimental Hypnosis, 11, 39–47. Short, R. R., and Oskamp, S. (1965). Lack of suggestion effects on perceptual isolation (sensory deprivation) phenomena. Journal of Nervous and Mental Disease, 141, 190–194. Shuttleworth, F. K. (1940). Sampling errors involved in incomplete returns to mail questionnaires. Psychological Bulletin, 31, 437 (Abstract). Siegman, A. (1956). Responses to a personality questionnaire by volunteers and nonvolunteers to a Kinsey interview. Journal of Abnormal and Social Psychology, 52, 280–281. Siess, T. F. (1973). Personality correlates of volunteers’ experiment preferences. Canadian Journal of Behavioural Science, 5, 253–263. Sigall, H., Aronson, E., and Van Hoose, T. (1970). The cooperative subject: Myth or reality? Journal of Experimental Social Psychology, 6, 1–10. Sigall, H., and Page, R. (1971). Current stereotypes: A little fading, a little faking. Journal of Personality and Social Psychology, 18, 247–255. Sigall, H., and Page, R. (1972). Reducing attenuation in the expression of interpersonal affect via the bogus pipeline. Sociometry, 35, 629–642.
References
859 Silverman, I. (1964). Note on the relationship of self-esteem to subject self-selection. Perceptual and Motor Skills, 19, 769–770. Silverman, I. (1965). Motives underlying the behavior of the subject in the psychological experiment. Paper read at American Psychological Association, Chicago, September. Silverman, I. (1968). Role-related behavior of subjects in laboratory studies of attitude change. Journal of Personality and Social Psychology, 8, 343–348. Silverman, I. (1970). The psychological subject in the land of make-believe. Contemporary Psychology, 15, 718–721. Silverman, I., and Kleinman, D. (1967). A response deviance interpretation of the effects of experimentally induced frustration on prejudice. Journal of Experimental Research in Personality, 2, 150–153. Silverman, I., and Margulis, S. (1973). Experiment title as a source of sampling bias in commonly used ‘‘subject-pool’’ procedures. Canadian Psychologist, 14, 197–201. Silverman, I., and Shulman, A. D. (1969). Effects of hunger on responses to demand characteristics in the measurement of persuasion. Psychonomic Science, 15, 201–202. Silverman, I., and Shulman, A. D. (1970). A conceptual model of artifact in attitude change studies. Sociometry, 33, 97–107. Silverman, I., Shulman, A. D., and Wiesenthal, D. L. (1970). Effects of deceiving and debriefing psychological subjects on performance in later experiments. Journal of Personality and Social Psychology, 14, 203–212. Silverman, I., Shulman, A. D., and Wiesenthal, D. L. (1972). The experimenter as a source of variance in psychological research: Modeling and sex effects. Journal of Personality and Social Psychology, 21, 219–227. Silverman, I. W. (1967). Incidence of guilt reactions in children. Journal of Personality and Social Psychology, 7, 338–340. Sirken, M. G., Pifer, J. W., and Brown, M. L. (1960). Survey procedures for supplementing mortality statistics. American Journal of Public Health, 50, 1753–1764. Smart, R. G. (1966). Subject selection bias in psychological research. Canadian Psychologist, 7a, 115–121. Smith, M. B. (1973). Protection of human subjects—Ethics and politics. APA Monitor, 4, 2. Smith, R. E. (1969). The other side of the coin. Contemporary Psychology, 14, 628–630. Solomon, R. L. (1949). An extension of control group design. Psychological Bulletin, 46, 137–150. Sommer, R. (1968). Hawthorne dogma. Psychological Bulletin, 70, 592–595. Speer, D. C., and Zold, A. (1971). An example of self-selection bias in follow-up research. Journal of Clinical Psychology, 27, 64–68. Spiegel, D., and Keith-Spiegel, P. (1969). Volunteering for a high-demand, low-reward project: Sex differences. Journal of Projective Techniques and Personality Assessment, 33, 513–517. Spielberger, C. D., and DeNike, L. D. (1966). Descriptive behaviorism versus cognitive theory in verbal operant conditioning. Psychological Bulletin, 73, 306–326. Stanton, F. (1939). Notes on the validity of mail questionnaire returns. Journal of Applied Psychology, 23, 95–104. Staples, F. R., and Walters, R. H. (1961). Anxiety, birth order, and susceptibility to social influence. Journal of Abnormal and Social Psychology, 62, 716–719. Star, S. A., and Hughes, H. M. (1950). Report on an educational campaign: The Cincinnati plan for the United Nations. American Journal of Sociology, 55, 389–400. Stein, K. B. (1971). Psychotherapy patients as research subjects: Problems in cooperativeness, representativeness, and generalizability. Journal of Consulting and Clinical Psychology, 37, 99–105. Steiner, I. D. (1972). The evils of research: Or what my mother didn’t tell me about the sins of academia. American Psychologist, 27, 766–768. Straits, B. C., and Wuebben, P. L. (1973). College students’ reactions to social scientific experimentation. Sociological Methods and Research, 1, 355–386. Straits, B. C., Wuebben, P. L., and Majka, T. J. (1972). Influences on subjects’ perceptions of experimental research situations. Sociometry, 35, 499–518. Streib, G. F. (1966). Participants and drop-outs in a longitudinal study. Journal of Gerontology, 21, 200–209. Stricker, L. J. (1967). The true deceiver. Psychological Bulletin, 68, 13–20.
860
Book Three – The Volunteer Subject Stricker, L. J., Messick, S., and Jackson, D. N. (1967). Suspicion of deception: Implications for conformity research. Journal of Personality and Social Psychology, 5, 379–389. Stricker, L. J., Messick, S., and Jackson, D. N. (1969). Evaluating deception in psychological research. Psychological Bulletin, 71, 343–351. Stricker, L. J., Messick, S., and Jackson, D. N. (1970). Conformity, anticonformity, and independence: Their dimensionality and generality. Journal of Personality and Social Psychology, 16, 494–507. Stumberg, D. (1925). A comparison of sophisticated and naive subjects by the association-reaction method. American Journal of Psychology, 36, 88–95. Suchman, E. A. (1962). An analysis of ‘‘bias’’ in survey research. Public Opinion Quarterly, 26, 102–111. Suchman, E., and McCandless, B. (1940). Who answers questionnaires? Journal of Applied Psychology, 24, 758–769. Suedfeld, P. (1964). Birth order of volunteers for sensory deprivation. Journal of Abnormal and Social Psychology, 68, 195–196. Suedfeld, P. (1968). Anticipated and experienced stress in sensory deprivation as a function of orientation and ordinal position. Journal of Social Psychology, 76, 259–263. Suedfeld, P. (1969). Sensory deprivation stress: Birth order and instructional set as interacting variables. Journal of Personality and Social Psychology, 11, 70–74. Sullivan, D. S., and Deiker, T. E. (1973). Subject–experimenter perceptions of ethical issues in human research. American Psychologist, 28, 587–591. Swingle, P. G., Ed. (1973). Social Psychology in Natural Settings: A Reader in Field Experimentation. Chicago: Aldine. Tacon, P. H. D. (1965). The effects of sex of E on obtaining Ss for psychological experiments. Canadian Psychologist, 6a, 349–352. Taffel, C. (1955). Anxiety and the conditioning of verbal behavior. Journal of Abnormal and Social Psychology, 51, 496–501. Taub, S. I., and Farrow, B. J. (1973), Reinforcement effects on intersubject communication: The scuttlebutt effect. Perceptual and Motor skills, 37, 15–22. Teele, J. E. (1962). Measures of social participation. Social Problems, 10, 31–39. Teele, J. E. (1965). An appraisal of research on social participation. Sociological Quarterly, 6, 257–267. Teele, J. E. (1967). Correlates of voluntary social participation. Genetic Psychology Monographs, 76, 165–204. Thistlethwaite, D. L., and Wheeler, N. (1966). Effects of teacher and peer subcultures upon student aspirations. Journal of Educational Psychology, 57, 35–47. Thomas, E. J., and Biddle, B. J. (1966). The nature and history of role theory. In B. J. Biddle and E. J. Thomas, Eds., Role Theory: Concepts and Research. New York: Wiley. Tiffany, D. W., Cowan, J. R., and Blinn, E. (1970). Sample and personality biases of volunteer subjects. Journal of Consulting and Clinical Psychology, 35, 38–43. Toops, H. A. (1926). The returns from follow-up letters to questionnaires. Journal of Applied Psychology, 10, 92–101. Trotter, S. (1974). Strict regulations proposed for human experimentation. APA Monitor, 5, 1, 8. Tukey, J. W. (1970). Exploratory Data Analysis: Limited Preliminary Edition. Reading, Mass.: Addison-Wesley. Tune, G. S. (1968). A note on differences between cooperative and non-cooperative volunteer subjects. British Journal of Social and Clinical Psychology, 7, 229–230. Tune, G. S. (1969). A further note on the differences between cooperative and non-cooperative volunteer subjects. British Journal of Social and Clinical Psychology, 8, 183–184. Underwood, B. J., Schwenn, E., and Keppel, G. (1964). Verbal learning as related to point of time in the school term. Journal of Verbal Learning and Verbal Behavior, 3, 222–225. Valins, S. (1967). Emotionality and information concerning internal reactions. Journal of Personality and Social Psychology, 6, 458–463. Varela, J. A. (1964). A cross-cultural replication of an experiment involving birth order. Journal of Abnormal and Social Psychology, 69, 456–457. Verinis, J. S. (1968). The disbelieving subject. Psychological Reports, 22, 977–981. Vidmar, N., and Hackman, J. R. (1971). Interlaboratory generalizability of small group research: An experimental study. Journal of Social Psychology, 83, 129–139.
References
861 Vinacke, W. E. (1954). Deceiving experimental subjects. American Psychologist, 9, 155. Wagner, N. N. (1968). Birth order of volunteers: Cross-cultural data. Journal of Social Psychology, 74, 133–134. Wallace, D. (1954). A case for-and against—mail questionnaires. Public Opinion Quarterly, 18, 40–52. Wallace, J., and Sadalla, E. (1966). Behavioral consequences of transgression: I. The effects of social recognition. Journal of Experimental Research in Personality, 1, 187–194. Wallin, P. (1949). Volunteer subjects as a source of sampling bias. American Journal of Sociology, 54, 539–544. Walsh, J. (1973). Addiction research center: Pioneers still on the frontier. Science, 182, 1229–1231. Ward, C. D. (1964). A further examination of birth order as a selective factor among volunteer subjects. Journal of Abnormal and Social Psychology, 69, 311–313. Ward, W. D., and Sandvold, K. D. (1963). Performance expectancy as a determinant of actual performance: A partial replication. Journal of Abnormal and Social Psychology, 67, 293–295. Warren, J. R. (1966). Birth order and social behavior. Psychological Bulletin, 65, 38–49. Waters, L. K., and Kirk, W. E. (1969). Characteristics of volunteers and nonvolunteers for psychological experiments. Journal of Psychology, 73, 133–136. Webb, E. J., Campbell, D. T., Schwartz, R. D., and Sechrest, L. (1966). Unobtrusive Measures: Nonreactive Research in the Social Sciences. Chicago: Rand-McNally. Weber, S. J., and Cook, T. D. (1972). Subject effects in laboratory research: An examination of subject roles, demand characteristics, and valid inference. Psychological Bulletin, 77, 273–295. Wechsler, D. (1958). The Measurement and Appraisal of Adult Intelligence, 4th ed. Baltimore: Williams and Wilkins. Weigel, R. G., Weigel, V. M., and Hebert, J. A. (1971). Non-volunteer subjects: Temporal effects. Psychological Reports, 28, 191–192. Weiss, J. H. (1970). Birth order and physiological stress response. Child Development, 41, 461–470. Weiss, J. H., Wolf, A., and Wiltsey, R. G. (1963). Birth order, recuitment conditions, and preferences for participation in group versus non-group experiments. American Psychologist, 18, 356 (Abstract). Weiss, L. R. (1968). The effect of subject, experimenter and task variables on subject compliance with the experimenter’s expectation. Unpublished doctoral diss. S.U.N.Y., Buffalo. Weiss, R. F., Buchanan, W., Altstatt. L., and Lombardo, J. P. (1971). Altruism is rewarding. Science, 171, 1262–1263. Weitz, S. (1968). The subject: The other variable in experimenter bias research. Unpublished manuscript, Harvard University. Wells, B. W. P., and Schofield, C. B. S. (1972). Personality characteristics of homosexual men suffering from sexually transmitted diseases. British Journal of Venereal Diseases, 48, 75–78. Welty, G. (n.d.). The volunteer effect and the Kuhn-McPartland Twenty Statements Test. Unpublished manuscript. White, H. A., and Schumsky, D. A. (1972). Prior information and ‘‘awareness’’ in verbal conditioning. Journal of Personality and Social Psychology, 24, 162–165. White, M. A., and Duker, J. (1971). Some unprinciples of psychological research. American Psychologist, 26, 397–399. Wicker, A. W. (1968a). Overt behaviors toward the church by volunteers, follow-up volunteers, and non-volunteers in a church survey. Psychological Reports, 22, 917–920. Wicker, A. W. (1968b). Requirements for protecting privacy of human subjects: Some implications for generalization of research findings. American Psychologist, 23, 70–72. Wicker, A. W., and Bushweiler, G. (1970). Perceived fairness and pleasantness of social exchange situations: Two factorial studies of inequity. Journal of Personality and Social Psychology, 15, 63–75. Wicker, A. W., and Pomazal, R. J. (1970). The relationship between attitudes and behavior as a function of specificity of attitude object and presence of a significant person during assessment conditions. Unpublished manuscript, University of Illinois. Williams, J. H. (1964). Conditioning of verbalization: A review. Psychological Bulletin, 62, 383–393. Willis, R. H. (1965). Conformity, independence, and anticonformity. Human Relations, 18, 373–388. Willis, R. H., and Willis, Y. A. (1970). Role playing versus deception: An experimental comparison. Journal of Personality and Social Psychology, 16, 472–477.
862
Book Three – The Volunteer Subject Wilson, P. R., and Patterson, J. (1965). Sex differences in volunteering behavior. Psychological Reports, 16, 976. Winer, B. J. (1968). The error. Psychometrika, 33, 391–403. Wolf, A. (1967). Personal communication. Oct. 21. Wolf, A., and Weiss, J. H. (1965). Birth order, recruitment conditions, and volunteering preference. Journal of Personality and Social Psychology, 2, 269–273. Wolfensberger, W. (1967). Ethical issues in research with human subjects. Science, 155, 47–51. Wolfgang, A. (1967). Sex differences in abstract ability of volunteers and nonvolunteers for concept learning experiments. Psychological Reports, 21, 509–512. Wolfle, D. (1960). Research with human subjects. Science, 132, 989. Wrightsman, L. S. (1966). Predicting college students’ participation in required psychology experiments. American Psychologist, 21, 812–813. Wuebben, P. L. (1967). Honesty of subjects and birth order. Journal of Personality and Social Psychology, 5, 350–352. Wunderlich, R. A., and Becker, J. (1969). Obstacles to research in the social and behavioral sciences. Catholic Educational Review, 66, 722–729. Young, F. W. (1968). A FORTRAN IV program for nonmetric multidimensional scaling. L. L. Thurstone Psychometric Laboratory Monograph, No. 56. Zamansky, H. S., and Brightbill, R. F. (1965). Attitude differences of volunteers and nonvolunteers and of susceptible and nonsusceptible hypnotic subjects. International Journal of Clinical and Experimental Hypnosis. 13, 279–290. Zeigarnik, B. (1927). Das Behalten Erledigter und Unerledigter Handlungen. Psychologische Forschung, 9, 1–85. Zimmer, H. (1956). Validity of extrapolating nonresponse bias from mail questionnaire follow-ups. Journal of Applied Psychology, 40, 117–121. Zuckerman, M., Schultz, D. P., and Hopkins, T. R. (1967). Sensation-seeking and volunteering for sensory deprivation and hypnosis experiments. Journal of Consulting Psychology, 31, 358–363.
Author Index
Aas, A., 481, 652 Abeles, N., 59, 70, 73, 88, 695, 725, 733, 840 Abelson, R. P., 42, 46, 213, 219, 262–263, 281, 283 Abrahams, D., 23, 46 Ad hoc Committee on Ethical Standards in Psychological Research, 771, 840 Adair, J. G., x, 161–163, 170–172, 175, 183, 186, 189, 204, 209, 641, 650, 671–672, 674, 759, 764, 780, 782, 816, 818, 840 Adams, H. E., 61, 89, 707–708, 846 Adams, J. S., 280, 283 Adams, M., 674, 773, 840 Adams, S., 729, 840 Addington, D. W., 27, 43 Aderman, D., 714, 716, 752, 840 Adler, N. E., 170–172, 175, 177, 185, 204 Adorno, T. W., 103, 780, 840 Affleck, D. C., 300, 656 Agisim, P., 838–839, 856 Aiken, L. S., 699–701, 783, 805, 807, 809, 813–814, 821, 825, 840, 857 Alcock, W., 491, 666 Alexander, C. N., Jr., 782, 840 Allen, C. T., x Allen, J. C., 822, 846 Allen, S., 331–332, 664 Allen, V. L., 782, 841, 857 Allport, F. H., 390, 652 Allport, G. W., 149, 204, 398, 511, 624, 652 Allyn, J., 30, 33, 38, 43 Alm, R. M., 699, 841 Altstatt, L., 817, 861 Altus, W. D., 691–693, 841 Alumbaugh, R. V., 773, 841 American Psychological Association, 771, 841 Anastasiow, N. J., 765, 841 Anderson, D. F., 199, 202, 204, 643, 649 Anderson, M., 404, 652 Anderson, N. H., 26, 43, 103, 105, 108 Anthony, S., 783 Applebaum, A. S., 752, 815, 848 Archer, D., 649–650 Argyle, M., 264, 280, 283 Argyris, C., 764, 774, 841 Aristotle, 27 Arkoff, A., 778, 848
863
Armelagos, G. J., 141, 205 Armer, J. M., 680, 725, 729, 845 Arms, R. L., 680, 692, 714, 716, 791–793, 857 Arnold, J. D., 674, 729–731, 733, 747, 750, 852 Aronson, E., 23, 27, 37, 43, 46, 51, 88, 213, 218–219, 228–229, 244, 260, 262–263, 277–278, 283, 292, 458, 535, 625, 652, 654, 754, 764, 802, 805, 810, 817, 823, 825–826, 841, 858 Asch, S. E., 277–278, 283, 481, 652 Ascough, J. C., 785, 841 Atkinson, J., 701, 703, 841 Azrin, N. H., 321, 323, 652 Babich, F. R., 593–594, 602, 652 Back, K. W., 60, 65, 79, 90, 361, 652, 685, 687, 699, 703–705, 717, 750, 757, 826, 841–842, 848 Bacon, F., 603 Bakan, D., 300, 324, 441, 553, 652 Baker, K. H., 403–404, 411, 664 Baker, R. A., Jr., 359, 665 Bales, R. F., 335, 509, 660 Ball, R. J., 765, 841 Ball, R. S., 773, 841 Ballachey, E. L., 796, 850 Bandura, A., 837, 652 Barber, B., 310–311, 652 Barber, T. X., 154, 170, 172, 175, 188, 204, 356, 401, 652, 818, 841 Barefoot, J. C., 675, 717, 788, 841 Barker, W. J., 680, 682, 700, 714, 716, 841 Barnard, P. G., 148, 204, 345, 352, 369–370, 390, 652 Barnes, J., 762, 850 Barnette, W. L., Jr., 727, 729, 765, 841 Barrett, T. J., 837–838, 858 Barzun, J., 466, 652 Bass, B. M., 357, 653, 701–702, 746, 841 Bass, M., 491, 666 Bateson, G., 511, 652 Bateson, M. C., 194, 209, 622, 663 Bateson, N., 684, 757, 858 Bauch, M., 298, 652 Bauer, R. A., 36, 43 Baumrind, D., 773, 841 Baur, E. J., 727, 733, 737, 765, 841
864
Author Index Beach, F. A., 48, 88, 672, 841 Bean, J. R., 280, 284, 818, 844 Bean, W. B., x, 49, 88, 292, 299–300, 312, 316, 318, 653, 674, 841 Beauchamp, K. L., 553, 653 Beck, W. S., 317, 319, 321, 465, 653 Becker, G., 58, 89, 694, 818, 842–843 Becker, H. G., 175, 204 Becker, J., 671, 862 Becker, L. A., 23, 29, 44 Becker, W. C., 56–58, 89, 691, 697, 846 Beckman, L., 773, 841 Beecher, H. K., x, 127, 132, 136, 592, 653, 771, 841 Beez, W. V., 168, 201–202, 204, 643, 649 Bell, C. R., 49, 54, 69, 88, 671, 720, 842 Bellak, L., 359, 653 Bellamy, R. Q., 51, 88, 749, 754, 842 Belmont, L., 691, 842 Belson, W. A., 52, 55, 72, 88, 687, 690, 729, 733, 737, 761, 842 Belt, J. A., 764, 842 Bem, D. J., 42–43, 120, 136, 817, 842 Bennett, C. M., 687, 695, 704–705, 720, 725, 842 Bennett, D. H., 825, 848 Bennett, E. B., 51, 88, 754, 842 Benney, M., 332, 341, 653 Benson, L. E., 53, 761, 842 Benson, S., 62, 73, 88, 711, 727, 733, 842 Bentler, P. M., 761, 842 Berg, I. A., 357, 653 Bergen, A. v., 685–686, 690, 756, 842, 860 Berger, A., 486, 664 Berger, D., 389, 653 Bergin, A. E., 27, 43 Beringer, J., 317 Berkowitz, H., 51, 88, 366, 653, 749, 754, 842 Berkowitz, L., 280, 286, 778, 783, 822, 842, 852–853 Berkson, J., 298, 320, 545, 653 Berkun, M., 280, 283 Bernstein, A. S., 439, 653 Bernstein, L., 428, 439, 653 Berzins, J. I., 196, 204 Betz, B. J., 195–196, 204, 624, 653 Bialek, H. M., 280, 283 Bickman, L., 280, 286, 822, 842, 853 Biddle, B. J., 782, 842, 860 Biegen, D. A., 201–202, 204 Binder, A., 329, 653 Binet, A., 299, 318, 644 Bingham, W. V. D., 466, 653 Birdwhistell, R. L., 497, 653 Birney, R. C., 357, 653 Bishop, B. R., 773, 841 Bitterman, M. E., 273, 283 Black, R. W., 760, 786, 842
Blaine, H. R., 762, 858 Blake, B. F., 643, 649 Blake, R. R., 51, 88, 91, 279, 281, 283, 285, 491, 658, 749, 754, 822, 842, 856 Blane, H. T., 196, 207, 624, 660 Blankenship, A. B., 319, 653 Bless, E., 751, 846 Blinn, E., 683, 695–696, 720, 737, 860 Block, B. L., 491, 659 Block, J., 271, 283 Blondlot, A., 298, 317 Boag, T. J., 144, 206 Boardman, J., 318 Bobran, H., 778, 854 Bock, R. D., 245, 262 Bogdonoff, M. D., 750, 842 Bohr, N., 94 Boice, R., 672, 842 Boies, K. G., 825, 842 Booman, W. P., 62, 73, 88, 711, 727, 733, 842 Bootzin, R. R., 170, 172, 175, 183, 185, 204 Borden, R., 755, 758, 847 Boring, E. G., x, 4–5, 7–11, 13, 139, 204, 267, 283, 297, 306, 310–311, 315–316, 603, 653, 828, 842 Boteram, N., 701–702, 717, 847 Boucher, R. G., 52, 59, 88, 761, 842 Bowen, B., 170, 172, 204 Bowers, J. M., 27, 43 Bowers, K. S., 202 Boylin, E. R., 706, 845 Bradley, W. H., 141, 204 Bradt, K., 765, 842 Brady, J. P., 52, 56–59, 65–66, 68, 89–90, 401, 658, 689–690, 692, 697, 746, 755, 761, 788, 842, 851–852 Bragg, B. W., 782, 842 Bray, C. W., 11, 14 Bregman, A. S., 207 Brehm, J. W., 213, 215, 263, 817, 842 Brehm, J., 30, 44 Brehm, M. L., 685, 750, 841–842 Brennan, E. P., 402, 658 Brewer, R. R., 158–159, 206 Brightbill, R. F., 52, 89, 92, 761, 842, 862 Brim, O. G., 49, 91, 674, 857 Britt, S. H., 53–54, 71, 89, 725, 761, 765, 845 Britton, J. H., 683–684, 729, 739, 762, 842 Britton, J. O., 683–684, 729, 739, 762, 842 Brock, T., 105, 109 Brock, T. C., 23, 29, 44, 48, 58, 89, 283, 694, 818, 842–843 Brockhaus, H. H., 23, 45 Brody, C. L., 782, 840 Brogden, W. J., 303, 367, 552, 653 Brooks, W. D., 794, 843 Brower, D., 70, 89, 725, 785, 843
Author Index
865 Brown, J., 124 Brown, J. M., 117, 136, 361, 653 Brown, M. L., 687, 733, 735, 739, 758, 859 Brown, R., 215, 263, 276, 284, 825, 843 Brown, W. F., 59, 70, 73, 88, 695, 725, 733, 840 Brown, W., 523 Brownstein, A. J., 801, 847 Bruehl, D. K., 782, 843 Brumbach, R., 838–839, 855 Brunswik, E., 75, 89, 282, 284, 540, 563, 653, 826–828, 843 Bryan, J., 277, 285 Bryan, J., H., 279–280, 284, 752, 757, 815, 818, 843, 843, 849 Bubash, S., 593–594, 602, 652 Buchanan, W., 817, 861 Buckhout, R., 486, 653, 780, 843 Burchinal, L. G., 62, 89, 711, 843 Burdick, H. A., 702–703, 843 Burdick, J. A., 837–838, 843 Burke, C. J., 114–116, 136 Burnham, B., 293 Burnham, J. R., 153, 156–158, 175, 179–180, 201–202, 204, 636, 638, 649 Burns, J. L., 837, 843 Bush, R. R., 154, 207, 415, 553, 633, 650, 660, 682, 853 Bushweiler, G., 824, 861 Buss, A. H., 329, 484, 653, 655 Butler, R. N., 68, 91, 855 Cahalan, D., 320, 583, 653 Cahen, L. S., 313, 654 Cairms, R. B., 888, 843 Calder, B. J., 818, 844 Calverley, D. S., 170, 172, 204, 356, 401, 652 Campbell, D. T., x, 4, 95, 103, 106–109, 122, 136, 264, 266–267, 269–272, 275–277, 279, 282, 284–286, 292, 309, 313, 342, 401, 654, 661, 666, 770, 786, 794, 812, 822, 843–844, 861 Campbell, E. H., 105, 109 Campitell, J., 690, 703–704, 844 Canady, H. G., 343, 654 Cantril, H., 341, 587, 654, 666, 684, 727, 729, 849 Capra, P. C., 57, 89, 692, 843 Carlisle, A., 270 Carlsmith, J. M., 27, 43, 51, 88, 215, 217, 219, 263, 277–278, 283, 285, 458, 471, 535, 625, 652, 654–655, 754, 823, 825, 841 Carlson, E. R., 340, 654 Carlson, J. A., 170, 172, 175, 191, 205 Carlson, R. L., 141, 205 Carlson, R., 340, 654, 825, 843 Carmichael, C. W., 27, 44 Carota, N., 40, 46, 145, 170, 208, 292, 346, 348, 350, 356, 381, 394, 448, 450, 496, 594, 459, 662 Carr, J. E., 717, 719–720, 762, 843
Carroll, J. D., 808, 843 Carroll, S. J., Jr., 680, 727, 733, 737, 765, 846 Carroll, W. F., 699, 841 Carson, R. C., 196, 205 Carter, R. M., 643, 649 Cataldo, J. F., 117, 136 Catton, W. R., Jr., 53, 90, 762, 851 Cervin, V. B., 484, 654 Chafetz, M. E., 196, 207, 624, 660 Chang, J., 808, 843 Chapanis, A., 215, 263, 315, 654, 671, 843 Chapanis, N. P., 215, 263, 315, 654 Chaves, J. F., 170, 172, 204 Chein, I., 796, 843 Christie, R., 48, 89, 271, 284, 326, 364, 438, 599, 654, 672, 843 Christy, E. G., 481, 664 Chun, K. T., 782, 857 Chwast, J., 312, 657 Cialdini, R. B., 280, 284 Cicero, 27 Cieutat, V. J., 143, 205 Claiborn, W. L., 200, 202, 205 Clapp, W. F., 535, 625, 657 Clark, E. L., 386, 654 Clark, K. B., 400, 654 Clark, K. E. et al., 49, 62, 73, 88–89, 674, 711, 727, 729, 733, 842–844 Clausen, J. A., 51, 53, 89, 756, 758, 763, 844 Cleveland, S. E., 346, 352, 389, 663, 654, 839, 853 Clyde, D. J., 98, 109, 127, 130, 137, 403, 659 Cobb, W. J., 312–314, 319, 332, 341, 343–344, 355, 386, 400–401, 485, 539–540, 544, 561, 583–584, 587–588, 657 Cochran, W. G., 49, 78, 89, 99, 109, 299, 631, 651, 654, 672–673, 776, 844 Coe, W. C., 761, 782, 844 Coffey, H. S., 626, 654 Coffin, T. E., 56, 89, 481, 491, 654, 686, 844 Cohen, A. R., 215–216, 218, 263 Cohen, I. B., 7, 13 Cohen, J., 632, 649, 670, 682, 686, 688, 706, 789, 844 Cohen, L., 312, 661 Cohen, R. A., 25, 44 Cohen, W., 484, 656 Cohler, B. L., 712–713, 717–718, 727–729, 844 Colby, K. M., 300, 654 Cole, D. L., 491, 654 Cole, J. O., 136 Coleman, J. C., 367, 665 Collins, B., 277, 285 Collins, B. C., 217, 219, 263 Collins, M. E., 626, 655 Colombotos, J., 837–838, 851 Commack, R., xvii Conant, J. B., 7, 13
866
Author Index Conn, L. K., 200, 202, 205 Connelly, G. M., 319 Connors, A. M., 170, 172, 193, 205 Conrad, H., 765, 838, 849 Conrad, H. S., 553, 654 Conroy, G. I., 720, 725, 788, 844 Consumer Reports, 319, 655 Cook, P. A., 158–159, 206 Cook, S. W., 267, 280, 284, 286, 771, 844 Cook, T. D., 280–281, 284, 764, 770, 812, 815, 818, 844, 861 Cook-Marquis, P., 352, 390, 654 Cooper, E., 25, 40, 44 Cooper, J., 167, 175, 180, 205, 637–638, 649, 815, 849 Cooper, L. M., 58, 63–64, 90, 695–696, 712–714, 716, 720, 851 Cope, C. S., 675, 788, 844 Cope, R. G., 701–702, 710–713, 717, 765, 844 Cordaro, L., 152–153, 156–157, 205, 300–301, 439, 654 Cormier, W. H., 759, 764, 802, 850 Correns, C. E., (in example), 310 Corsi, P., 224 Corsini, R. J., 271, 284 Cottingham, D. R., 778, 842 Couch, A., 20, 44 Cowan, J. R., 683, 695–696, 720, 737, 860 Cox, D. E., 675, 785, 844 Cox, G. M., 99, 109 Cozby, P. C., 826, 844 Craddick, R. A., 690, 703–704, 844 Crespi, L., 794, 844 Crespi, L. P., 320, 323, 654 Criswell, J. H., 110, 136, 440, 654 Cronbach, L. J., x, 18, 44, 271, 284 Cronkhite, G. L., 27, 44 Croog, S. H., 844 Crossley, H. M., 684, 729, 733, 737, 844 Crow, L., 362, 654 Crowne, D. P., 20, 44, 59, 61, 89, 144, 200, 202, 205, 211, 222, 263, 347–349, 362, 450, 486, 654, 703–704, 780, 784, 844 Crumbaugh, J. C., 404, 654 Crutchfield, R. S., 481, 484, 654, 796, 850 Cudrin, J. M., 725, 844 Cutler, R. L., 300, 654 Dabbs, J. M., 38, 44 Dailey, J. M., 484, 654 Daily Palo Alto Times, 280, 284 Damaser, E. C., 121, 136 Damon, A., 671, 844 Darley, J. M., 51, 88, 276–277, 284, 535, 625, 652, 754, 783, 841, 851 Darroch, R. K., 824, 844 Darwin, C., vi, 311
Das, J. P., 358, 654 Davidson, H. H., 73, 90, 733, 850 Davidson, K. S., 340, 663 Davis, H., 11, 13 Davis, J. D., 388–389, 657 Davis, K. E., 690, 712–713, 727, 733, 729–730, 761–762, 776, 849–850 Davis, L., 103, 105, 109 Davis, R. C., 114–116, 136 Davis, W. E., 144, 209 de Hann, H., 40, 46 de Vries, H., (in example), 310 Deiker, T. E., 674, 751, 774, 834, 860 Delboeuf, J. L. R., 405, 592, 654 Delprato, D. J., 143, 206, 815, 850 Dember, W. N., 309, 655 Deming, W. E., 672, 844 DeNike, L. D., 442, 664, 799, 801, 859 Derbyshire, A. J., 11, 13 deRubini, E., (in example), 523 Deutsch, M., 267, 286, 626, 655 Dewey, J., 11, 13 DeWolfe, A. S., 103–105, 778, 844 Diab, L. N., 692, 725, 844 Diamant, L., 685–686, 733, 761, 776, 844 Dickson, W. J., 94, 136–137, 275, 286 Diespecker, D. D., 685, 698, 717, 719–720, 722, 846 DiFurio, D., 670 Dillman, D. A., 748, 756, 758, 845 DiMatteo, M. R., 649–650 Dinerman, H., 25, 40, 44 Dittes, J. E., 57, 80, 89, 692, 843, 845 Dohrenwend, B. P., 675, 720, 727–729, 733, 738, 845 Dohrenwend, B. S., 167, 180, 205, 637–638, 649, 675, 693, 720, 727–729, 733, 738, 750, 845 Dollard, J., 776, 845 Donald, M. N., 727, 729, 733, 737, 762, 845 Donelson, E., 28, 45 Donnay, J. M., 837–838, 845 Doob, A. N., 280, 284, 822, 839, 845 Dorcus, R. M., 626, 654 Dreger, R. M., 690, 837–838, 845 Drolette, M. E., 389, 656 Duker, J., 775, 861 Dulany, D. E., 441, 655, 799–801, 845 Dunbar, P., 687, 704, 725, 760, 853 Duncan, C. P., 103, 105, 109 Duncan, S., 191, 205, 249–250 Dunteman, G., 702, 746, 841 Durea, M. A., 360, 656 Ebbinghaus, H., 15, 404, 453, 462–463, 655 Ebert, R. K., 685, 692, 695, 710–713, 725, 733, 737–739, 845 Ecker, B. P., 839, 845
Author Index
867 Eckland, B. K., 725, 727, 763, 765, 845 Eckler, A. R., 487, 655 Edgerton, H. A., 53–54, 71, 89, 725, 761, 765, 845 Edgeworth, F., 585 Editorial Board, Consumer Reports, 319, 655 Editorial Board, Science, 319, 322, 655 Edwards, A. L., 211, 263, 271, 285, 523, 655 Edwards, B. C., 159, 185, 192, 206 Edwards, C. N., 48, 60–62, 70–74, 89, 200, 202, 205, 364, 704–709, 712–716, 725–726, 729–730, 732–733, 737–738, 845 Efran, J. S., 706, 845 Ehrenfreund, D., 424, 655 Ehrlich, A., 672, 845 Ehrlich, J. S., 340–341, 353, 655 Einstein, A., vi, 265–266, 310 Eisenberg, L., 167, 180, 205, 637–638, 649 Eisenman, R., 692–693, 817, 845 Ekman, P., 355, 358, 655 Elashoff, J. D., 646, 650 Elinson, J., 837–838, 851 Ellis, R. A., 680, 725, 729, 845 Ellis, R. S., 18, 44 Ellson, D. G., 114–116, 136 Elms, A. C., 825, 845 Endler, N. S., 782, 846 Endo, C. M., 680, 725, 729, 845 Engram, W. C., 144 Entwisle, D. R., 101–103, 105, 109 Epps, E. G., 342–343, 658 Epstein, J., 170, 186, 189, 204 Epstein, Y. M., 756, 782, 845 Eriksen, C. W., 442, 484, 655 Escalona, S. K., 147, 205, 389, 655 Esecover, H., 69, 89, 720, 723, 748, 754, 845 Etzioni, A., 674, 845 Evans, A., 318 Evans, F. J., 110–111, 121, 132–133, 136–137, 822, 826, 845 Evans, F. T., 131 Evans, M. C., 656 Evans, R., 199, 202, 208 Ewing, R., 37, 44 Exline, R. V., 334, 655 Eysenck, H. J., 553, 655, 672, 845 Farrow, B. J., 814–815, 823, 860 Fechner, G., vi, 405 Feinberg, H. K., 698, 712–714, 716–717, 755, 788, 848 Feinstein, A. R., 299, 311, 655 Feinstein, S. H., 194, 207, 622, 661 Feldman, H., 280, 285 Feldman, J. J., 312–314, 319, 332, 341, 343–344, 355, 386, 400–401, 485, 539–540, 544, 561, 583–584, 587–588, 657
Feldman, R. E., 280, 282, 285 Feldman, R. S., 818, 845 Feldstein, S., 693, 750, 845 Felice, A., 601, 655 Fell, H. B., 311, 655 Fenton, D. P., 759, 818, 840 Ferber, R., 471, 655, 674, 846 Ferguson, D. C., 329, 655 Ferree, M. M., 756, 837–838, 846 Ferriss, A. L., 748, 758, 846 Feshbach, S., 778, 849 Festinger, L., 12–13, 23, 30, 33, 38, 43–44, 46, 110, 120, 136, 215, 263, 277, 285, 471–472, 625, 655, 792, 846 Filer, R. M., 312, 655 Fillenbaum, S., 28, 44, 805, 818, 846 Fine, B. J., 484, 655 Fink, H., 592, 656 Fink, R., 684, 729, 733, 737, 844 Firetto, A., 144, 209 Fisch, R., 701–702, 717, 847 Fischer, E. H., 687, 692, 694, 729, 738, 846 Fisher, R. A., 10, 299, 306, 311, 318, 563, 655 Fisher, S., 27, 44, 136, 707, 846 Fiske, D. W., 272, 284 Fitch, G., 292 Flanagan, J. C., 200, 205 Flick, G. L., 143, 205 Flowers, C. E., 201–202 Foa, U. G., 817, 846 Fode, K. L., 139–140, 151, 157, 167–172, 190, 192, 205, 208, 292, 304, 320, 392, 412, 414–415, 423, 442, 447, 454, 466, 468, 472, 480–481, 483, 496, 524, 526, 597, 655, 662–663 Ford, R. N., 51, 53, 89, 674, 756, 758, 763, 844, 846 Forgione, A., 170, 172, 204 Fortune, R., 313 Foster, R. J., 61, 63, 89, 481, 656, 707, 709, 846 Foster, W. S., 523, 656 Fowler, R. G., 280, 285 Frager, R. D., 193, 209 Francis, R. D., 685, 698, 717, 719–720, 722, 846 Frank, J., 402, 656 Frank, J. O., 275, 286 Frankfurt, L., xvii Franzen, R., 53, 72, 74, 89, 282, 285, 727, 729, 739, 761, 846 Fraser, C., 828, 848 Fraser, S. C., 281, 285, 672, 753, 817, 846 Freedman, J. L., 23, 27, 30–31, 35, 38–39, 44, 46, 281, 285, 817, 825, 846 Freiberg, A. D., 656 French, J. R. P., 48, 89, 285, 361, 491, 661, 663, 674, 846 Frenkel-Brunswik, E., 103, 780, 840 Frey, A. H., 56–58, 89, 691, 697, 725, 846
868
Author Index Frey, R., 818, 844 Fried, S. B., 822, 846 Friedman, C. J., 139–140, 208, 292, 468, 472, 496, 662 Friedman, H., 635, 650, 706, 846 Friedman, N., 193, 205, 208, 243, 263, 292, 333, 335, 339, 348, 441, 507–509, 578, 597, 656, 662 Friedman, P., 404, 656 Friesen, W. V., 355, 358, 655 Fromm-Reichmann, F., 404 Fruchter, B., 495, 656 Frye, R., 702, 746, 841 Frye, R. L., 61, 89, 707–708, 846 Fuhrer, M., 481, 659 Funkenstein, D. H., 389, 656 Gaito, J., 292, 553, 662 Galileo, 11 Gall, M., 143, 205 Gannon, M. J., 680, 727, 733, 737, 765, 846 Gantt, W. H., 326, 656 Gardner, M., 141, 205 Garfield, S. L., 300, 656 Garrett, H. E., 321, 656 Gaudet, H., 53, 73–74, 89, 727, 733, 761, 846 Gebhard, P. H., 776, 850 Geldard, F. A., 194, 622, 656 Gelfand, D. M., 484, 656, 822, 846 Geller, S. H., 782, 846 George, W. H., 306, 319, 656 Gergen, K. J., 773, 846 Gerjuoy, L. R., 484, 653 Getter, H., 158–159, 175, 205 Giesen, M., 755, 758, 847 Gilliland, A. R., 103, 105, 109 Gillispie, C. C., 306, 656 Gilmore, J. B., 849 Gitter, A. G., 680, 798, 802, 804, 817, 857 Glaser, E. M., 626, 654 Glass, D. C., 817, 847 Glinski, B. C., 782, 846 Glinski, R. J., 782, 846 Glixman, A. F., 143, 205 Glock, C. Y., 267, 285 Glucksberg, S., 357, 656 Goffard, S. J., 57, 63, 65, 69, 71–72, 91, 692, 701, 703, 712–713, 716–718, 720, 722, 725, 733, 854 Goldberg, S., 484, 656 Goldblatt, R. A., 144, 205 Goldfried, M. R., 553, 656 Goldiamond, I., 321, 323, 652 Goldstein, A. P., 402, 624, 626, 656–657 Goldstein, J. H., 799, 801, 846 Goldwater, B., 778 Goodstadt, B. E., 799, 801–802, 804, 815, 817, 846–847, 957 Goranson, R. E., 357, 491, 656
Gordon, L. V., 360, 656 Gore, P. M., 281, 285 Gosnell, H. F., 279, 285 Gosset, W. S., 9, 13 Gough, H. G., 18, 44 Governale, C. N., 103–105, 778, 844 Grabitz-Gniech, G., 818, 847 Graff, H. W., 466, 652 Graham, S. R., 387, 656 Gravitz, H. L., 643, 650 Gray, D., 334, 655 Green, D. R., 52, 89, 760, 784, 803, 847 Greenberg, A., 762, 847 Greenberg, M. S., 824, 847 Greenblatt, M., 402, 656 Greene, E. B., 79, 89, 847 Greenfield, P. M., 40, 46, 145, 170, 208, 292, 346, 348, 350, 356, 381, 394, 448, 450, 496, 594, 459, 662 Greening, T. C., 626, 654 Greenspoon, J., 799, 801, 847 Greenwald, H., 27, 44 Griffith, R., 458, 466, 656 Gross, A. E., 277, 280, 284–285, 822, 845 Grothe, M., 170, 208, 292, 352, 353, 379, 392, 477, 480, 502–503, 576, 587, 662–663 Gruenberg, B. C., 405, 656 Grunebaum, H. H., 712–713, 717–718, 727–729, 844 Guilford, J. P., 432, 656 Gumenik, W. E., 704, 710, 817, 848 Gumpper, D. C., 822, 846 Gunne, L. M., 402 Gustafson, L. A., 114–116, 136, 782, 818, 847 Gustav, A., 750, 847 Guthrie, E. R., 400, 656 Haas, H., 592, 656, 672, 750, 847 Hackman, J. R., 784, 860 Haefner, D. P., 778, 847 Hain, J. D., 281, 283 Halas, E. S., 292, 301, 662 Haley, J., 511, 652 Haley, S., 292, 337, 354, 375, 392, 535, 588 Hall, C. M., 140, 208 Hall, R., 51, 55, 92, 687, 754, 858 Hammel, E. A., 553, 660 Hammond, K. R., 282, 285, 563, 656, 828, 847 Hammonds, A. D., 143, 209, 341–342, 665 Hancock, J. W., 746, 752, 847 Handelman, L., xi Handfinger, B. M., 817, 847 Haner, C. F., 484, 656 Hankin, E. H., 9, 13 Hanley, C., 305, 657 Hansen, M. H., 672, 847 Hanson, N. R., 298, 657
Author Index
869 Hanson, R. H., 401, 657 Harari, C., 312, 657 Harlem Youth Opportunities Unlimited, Inc., 400, 657 Harmatz, M. G., 329, 663 Harrington, G. M., 153, 157–158, 175, 206 Harris, N., 583, 657 Harrison, N. W., 837–838, 853 Hart, C. W., 312–314, 319, 332, 341, 343–344, 355, 386, 400–401, 485, 539–540, 544, 561, 583–584, 587–588, 657 Harter, S., 280, 286 Ha¨rtfelder, G., 592, 656 Hartmann, D. P., 822, 846 Hartmann, G. W., 279, 285, 847 Hartry, A., 153, 157–158, 175, 187, 205 Hartsough, D. M., 202–204 Harvey, O. J., 27, 44, 535, 625, 657 Harvey, S. M., 401, 657 Hastorf, A. H., 30, 44 Hathaway, S. R., 18, 45 Havighurst, C. C., 771, 847 Hawthorne, J. W., 643, 650 Hayes, A. S., 194, 209, 622, 663 Hayes, D. P., 48, 59, 80–81, 89, 695–696, 847 Hebert, J. A., 725, 861 Heckhausen, H., 701–702, 717, 847 Hefferline, R. F., 536, 657 Heilizer, F., 63, 65, 89, 484, 657, 712–713, 717–718, 847 Heine, R. W., 402, 657 Heinzi, R., 484, 654 Heisenberg, W., 93–94, 109 Heiserman, M. S., 200, 202, 205 Helbig, I., 292 Heller, K., 371, 388–389, 402, 657 Helmreich, R. L., 217, 219, 263 Helson, H., 281, 285, 451, 462, 531, 657 Hempel, G. G., 265, 285 Henchy, T., 817, 822, 842, 847 Hendrick, C., 755, 758, 847 Hergenhahn, B. R., 170, 172, 191, 205 Hermann, H. T., 720, 722, 762, 850 Herskovits, M. J., 286 Heslin, R., 643, 649 Hewgill, M. A. 27, 46 Hicks, J. M., 38, 46, 102, 105, 109, 794, 847 Hicks, L. H., 771, 844 Hicks, W. M., 309 Higbee, K. L., 672, 775, 788, 847 Hilgard, E. R., 48, 52, 55, 59, 68, 89, 420, 671, 687, 729, 761, 842, 847–848 Hill, C. T., 680, 683, 729, 738, 762, 848 Hill, R. E., Jr., 687, 695, 704–705, 720, 725, 842 Himelstein, P., 56, 65, 76, 90, 680, 717, 848 Hinkle, D. N., 392 Holland, C., 110, 158–159, 205
Holmes, D. S., 752, 815, 825, 848 Holmes, E., 224 Holmes, J. G., 848 Holper, H. M., 680, 798, 817, 857 Holtz, W., 321, 323, 652 Holz, R. 83 Homans, G. C., 491, 657 Homme, L. E., 430, 657 Honigfeld, G., 132, 137, 403, 657 Honorton, C., 141, 206 Hood, E., 318 Hood, T. C., 48, 55, 60, 65, 76, 79, 90, 685, 687, 699, 703–705, 717, 757, 826, 841, 848 Hopkins, T. R., 57–58, 64–65, 76, 92, 692, 714, 716–719, 862 Horn, A., 765, 838, 849 Horn, C. H., 170, 172, 175, 206 Horowitz, I. A., 704, 710, 777–779, 787, 817, 824, 834, 848 Horst, L., 161–162, 170, 193, 205 Houde, R. W., 127, 137 Hovland, C. I., 22–24, 26–27, 29, 36–37, 39, 40, 43–46, 103, 105, 109, 213, 263, 275, 285, 340, 451, 462, 481, 491, 657, 664, 794, 848 Howard, F. H., 544, 660 Howard, K., 486, 664 Howe, E. S., 50, 55–56, 64–65, 90, 684, 714, 716–717, 719, 747, 750, 848 Hughes, H. M. 794, 859 Hume, D., 264 Humphrey, H. H., 778 Hunt, R. G., 484, 656 Hunt, W. A., 818, 855 Hurn, M., 298, 320, 545, 653 Hurwitz, S., 159–161, 206 Hurwitz, W. N., 487, 655, 672, 847 Hyman, H. H., x, 4, 49, 90, 312–314, 319, 332, 341, 343–344, 355, 386, 400–401, 485, 539–540, 544, 561, 566–567, 583–584, 587–588, 657, 664, 674, 776, 848 Ingraham, L. H., 153, 157, 206 Innes, J. M., 828, 848 Insko, C. A., 27, 45, 280, 284, 778, 826, 848 Insko, V. M., 778, 848 International Journal of Attitude and Opinion Research, 319, 665 Irwin, J. V., 23, 45 Iscoe, I., 59, 70, 73, 88, 695, 725, 733, 840 Ismir, A. A., 486, 657 Ison, J. R., 152–153, 156–158, 175, 205, 300–301, 439, 654 Jackson, C. W., Jr., 52, 90, 747, 760, 782, 848 Jackson, D. D., 511, 652 Jackson, D. N., 40, 46, 110, 137, 814, 816, 822–823, 860
870
Author Index Jackson, J. A., 782, 848 Jacobson, A. L., 593–594, 602, 652 Jacobson, L., 197, 202, 208, 627, 644–646, 650 Jaeger, M. E., 698, 712–714, 716–717, 755, 783, 788, 848 Jahn, M. E., 317, 657 Jahoda, M., 267, 286 James, R. L., 57, 60, 66, 71, 76, 690, 706, 719, 726, 851 James, W., 466, 657 Janis, I. L., 23–24, 26, 38, 44–45, 285, 340, 481, 484, 657, 778, 825, 848–849 Jastrow, J., 399, 657 Jellison, J. M., 23, 46 Jenkins, V., 159–161, 167, 170, 172, 196, 206 Jenness, A., 481, 657 Jensen, A. R., 644–645, 650 Jessor, S., 312 Joel, W., 657 Johns, J. H., 818, 849 Johnson, C. A., 139–140, 208, 292, 468, 472, 496, 662 Johnson, H. J., 58, 63–64, 90, 695–696, 712–714, 716, 720, 851 Johnson, H. M., 12, 14, 438, 523, 657 Johnson, L. B., 778 Johnson, M. L., 299–300, 657 Johnson, N., 293 Johnson, R. W., 159–160, 175, 183, 186, 206, 641, 643, 650, 674, 817, 849 Johnson, W. E., Jr., 690, 837–838, 845 Jones, E. E., 36, 45, 368, 370, 441, 657, 824–825, 849 Jones, F. P., 603, 658 Jones, H. E., 765, 838, 849 Jones, R. A., 815, 849 Jones, R. H., 312, 658 Jones, S., 778, 851 Jordon, N., 315, 658 Jourard, S. M., 144, 206, 699, 826, 849 Joyner, R. C., 484, 654 Juhasz, J. B., 782, 849 Jung, A. F., 285 Jung, J., x, 672, 849 Kaats, G. R., 712–713, 733, 761, 776, 849 Kaess, W., 64, 91, 714, 716–717, 720, 725, 733, 762, 849, 856 Kagan, J., 148, 196, 207, 210, 337, 340, 658 Kalischer, O., 523 Kamenetsky, J., 40, 46 Kammerer, P., 318–319, 322 Kanfer, F. H., 366–367, 488–489, 658, 799, 849 Kant, I., 266 Kantor, W., 309 Karas, S. C., 366–367, 488–489, 658 Katz, D., 25, 46, 684, 727, 729, 796, 849
Katz, I., 342–343, 400, 658 Katz, J., x, 674, 771, 849 Katz, R., 333–334, 339, 348, 507–508, 658 Kauffmann, D. R., 782, 849 Kaufman, R. S., 304, 664 Kavanau, J. L., 48, 90, 672, 849 Kaye, D., 38, 45 Kazdin, A. E., vii, xi, 752, 757, 818, 849 Keen, R., 330, 366, 664 Kegeles, S. S., 778, 849 Keith-Spiegel, P., 687, 701–702, 707–708, 710–711, 725, 859 Kelley, E. L., 782, 848 Kelley, H. H., 23–24, 44, 285, 323, 588, 658 Kelley, T. L., 755, 765, 849 Kellogg, W. N., 194–195, 206, 326, 622, 658 Kelly, E. L., 18, 45 Kelly, G. A., 624, 658 Kelman, H. C., x, 22, 39–42, 45, 119, 137, 276–277, 285, 471, 610, 658, 771, 774, 805, 825, 850 Kelvin, P., 814, 850 Kendall, M. G., 299, 666 Keniston, K., 20, 44 Kennedy, G. L., 511, 661 Kennedy, J. J., 158–159, 175, 185, 192, 206, 759, 764, 802, 850 Kennedy, J. L., 139, 206, 303–304, 404, 523, 658 Kepler, J., 5 Keppel, G., 725, 860 Kerlinger, F. N., 773, 850 Kern, R. P., 280, 283 Kerr, W. A., 275, 280, 285 Keshock, J. D., 643–644, 650 Kety, S. S., 315, 658 Kidder, L., 276, 285 Kiesler, C. A., 30, 45, 39 Kiesler, S. B., 30, 45, 39 Kimble, G. A., 528, 658, 771, 844 Kincaid, M., 9, 14 King, A. F., 72–73, 90, 733, 735, 850 King, B. T., 25, 45 King, D. J., 101–103, 105, 109, 674, 850 King, S. H., 389, 656 King, S., 685, 700, 711, 856 Kinnebrook, (in example), 139, 297 Kinsey, A. C., 776–777, 850 Kintz, B. L., 143, 206, 815, 850 Kirby, M. W., 690, 727, 729–730, 762, 850 Kirchner, W. K., 765, 850 Kirk, W. E., 690, 714, 716, 729, 861 Kirschner, P., 38, 45 Kish, G. B., 720, 722, 762, 850 Kish, L., 775, 850 Kivlin, J. E., 695–696, 712–713, 727, 729, 850 Klaus, D. J., 430, 657 Klebanoff, S. G., 359, 661
Author Index
871 Klein, G. S., 449, 658 Kleinman, D., 818, 859 Kleinmuntz, B., 206 Kleinsasser, L. D., 159, 206 Klinger, E., 148, 206, 818, 850 Kloot, W. v. d., 685–686, 690, 756, 860 Knight, G. W., 782, 840 Knights, R. M., 330, 332, 366, 664–665 Knutson, C., 139, 173, 187, 195–196, 207, 622 Koenigsberg, R. A., 820, 850 Koestler, A., 309, 311, 315, 658 Kohn, P., 40, 46, 145, 170, 208, 292, 346, 348, 350, 356, 381, 394, 448, 450, 496, 594, 459, 662 Koivumaki, J. H., 649–650 Kolstoe, R., 321 Kornblith, C. L., 301, 664 Kothandapani, V., 796, 850 Kowal, B., 359, 665 Kramer, E., 402, 658 Krasner, L., 148, 206, 357, 367, 442, 449, 526, 658, 799, 850 Krech, D., 796, 850 Kroger, R. O., 117, 137, 782, 850 Krovetz, M. L., 280, 284, 818, 844 Kruglanski, A. W., 787, 850 Kruglov, L. P., 73, 90, 733, 850 Krugman, A. D., 98, 109, 127, 130, 137, 403, 659 Kruskal, J. B., 808, 850 Kubie, L. S., 300, 658 Kuethe, J. L., 484, 655, 658 Kuhn, T. S., 264, 266, 285 Kunce, J. T., 675, 788, 844 Kurland, D., 193, 205, 208, 292, 333, 508–509, 656, 662 Kutner, B., 282, 285 La Piere, R. T., 282, 285 Labouvie, G. V., 837–838, 858 Laming, D. R. J., 675, 762, 788, 794, 798, 850 Lana, R. E., x, xvii, 4, 7, 9, 20, 24, 29, 45, 98, 100–105, 107, 109, 127, 266–267, 672, 794, 798, 850–851, 855 Landes, J., 55, 68, 89, 687, 848 Lane, F. W., 299, 658 Langley, M., 292 Larrabee, L. L., 159, 175, 206 Larson, R. F., 53, 90, 762, 851 Lasagna, L., 69, 78, 90, 720, 723, 851 Laszlo, J. P., 140, 170, 206, 305, 357–358, 417–418, 491 Latane´, B., 276–277, 284, 783, 851 Lawson, F., 729, 738, 762, 783, 851 Lawson, R., 157, 208, 292, 320, 429, 662 Lazarsfeld, P. F. 53, 72, 74, 89, 275, 285, 727, 729, 739, 761, 846 Lefkowitz, M. M., 279, 285, 491, 658 Lehman, E. C. Jr., 695, 762, 851
Leik, R. K., 782, 851 Leipold, W. D., 57, 60, 66, 71, 76, 690, 706, 719, 726, 851 Lentz, T. F., 18, 45 Leser, U., (in example), 299, 318 Leslie, L. L., 671, 851 Lester, D., 671–672, 851 Leventhal, J., 778, 851 Levin, S. M., 444, 658 Levinson, D. J., 103, 780, 840 Levitt, E. E., 48, 51–52, 56–59, 61, 65–66, 68, 71, 89–90, 300, 401, 592, 658, 689–690, 692, 697, 701, 703, 708–709, 717–718, 720, 746, 755, 757, 761, 788, 842, 851–852 Levitt, S. D., x Levy, B. H., 281, 286 Levy, L. H., 403, 658, 782, 823, 851 Lewin, K., 781, 851 Lewis, O., 313 Liddell, H. S., 326, 658 Lighthall, F. F., 340, 663 Lince, D. L., 357, 656 Lindquist, E. F., 613, 659 Lindzey, G., 325, 404, 659 Lipsher, D. H., 837, 652 List, J. A., x Lister, J., 311 Locke, H. J., 49, 54, 90, 674, 762, 851 Loewenstein, R., 837–838, 851 Lombardo, J. P., 817, 861 London, P., 48, 55, 58, 63–64, 90, 481, 659, 671, 685, 695–696, 712–714, 716, 720, 815, 843, 851–852 Loney, J., 685–686, 765, 852 Long, L., 725, 733, 762, 849 Lord, E., 359, 659 Lorge, I., 18, 45 Lowe, F. E., 683, 733, 852 Lubin, A., 27, 44, 57–59, 61, 65, 71, 90, 100, 109, 554, 659, 692, 697, 701, 703, 708–709, 717–718, 720, 852 Lubin, B., 51–52, 56, 66, 68, 89–90, 689–690, 746, 755, 757, 761, 788, 842, 851 Luborsky, L., 665 Luchins, A. S., 83, 90, 791, 852 Lu¨ck, H. E., 158–159, 167–168, 209 Luft, J., 359, 364, 659 Lumina, A. R., 117, 137, 782, 855 Lumsdaine, A. A., 26, 29, 36, 40, 44–45, 103, 105, 109, 794, 848 Lundberg, S., 59, 80–81, 89, 695–696, 847 Lurie, M. H., 11, 13 Lyerly, S. B., 98, 109, 127, 130, 137, 403, 659 Lyons, J., 826, 852 Maas, I., 838–839, 852 Macaulay, J., 783, 852 Maccoby, E. E., 386, 659
872
Author Index Maccoby, N., 38, 44, 386, 659 MacDonald, A. P., Jr., 674, 680, 683–684, 687, 690, 692, 694–696, 704–705, 707–708, 710, 714, 716, 729, 738–739, 745, 749, 755, 852 MacDougall, C. D., 318, 659 Mackenzie, D., 318, 687, 733, 735, 745, 748, 757, 816, 852 MacKinnon, D. W., 400, 659 Magath, T. B., 298, 320, 545, 653 Mahalanobis, P. C., 320, 561, 659 Mahl, G. F. 575, 659 Maier, N. R. F., 321, 364, 659 Majka, T. J., 672, 756, 805, 815, 823–824, 859 Malitz, S., 69, 89, 720, 723, 748, 754, 845 Malmo, R. B., 144, 206 Mandell, W., 22, 40, 44 Mann, L., 280, 286, 822, 825, 849, 852 Mann, M. J., 680, 725, 852 Marchak, M. P., 794, 854 Marcia, J., 164, 170, 172, 175, 185, 189, 206, 348, 419–420, 480, 541, 659 Marcuse, F. L., 50, 56, 58, 61–62, 66, 69–70, 76, 90, 675, 684, 686, 695–696, 698, 707–709, 711, 717–718, 759, 852 Margolis, R., 83 Margulis, S., 671, 859 Marine, E. L., 365, 659 Marks, E. S., 401, 657 Marks, J. B., 626, 654 Marks, M. R., 460, 659 Marlatt, G. A., 675, 852 Marlowe, D., 20, 44, 59, 61, 89, 144, 205, 211, 222, 263, 292, 347–349, 362, 450, 486, 654, 703–704, 780, 784, 844 Marmer, R. S., 48–49, 73, 85–86, 90, 674, 733, 852 Marolla, F. A., 691, 842 Marquis, P. C., 818, 852 Martin, C. E., 776, 850 Martin, D. C., 674, 729–731, 733, 747, 750, 852 Martin, R. M., 50, 56, 58, 61–62, 66, 69–70, 76, 90, 675, 684, 686, 695–696, 698, 707–709, 711, 717–718, 759, 852 Marwit, S. J., 148, 164–166, 185, 189, 206, 419–420, 643, 650, 659 Maskelyne, (in example), 139, 297 Masling, J., 148, 164, 175, 189, 206, 332, 343, 359–360, 371, 419, 527, 601, 659 Maslow, A. H., 48, 62, 67, 90, 712–713, 720, 722, 776, 852 Matarazzo, J. D., 357, 371, 387–389, 444–445, 484, 659 Matthews, M. W., 858 Matthysse, S. W., 53, 71, 74, 91, 725, 738, 852 Maul, R. C., 671, 852 Mausner, B., 491, 659 Maxwell, M. L., 643–644, 650 May, R. B., 553, 653
May, W. T., 674, 852 May, W. W., 773, 852 Mayer, C. S., 680, 729, 733, 735, 754, 765, 853 Mayo, C. C., 643, 650 Mayo, E., 94 McCandless, B., 52, 92, 727, 761, 860 McClelland, D. C., 340, 510, 659, 701, 853 McClung, T., 27, 46 McConnell, D., 329, 653 McConnell, J. A., 707, 709, 853 McConnell, R. A., 390, 663 McConochie, R. M., 680, 692, 714, 716, 791–793, 857 McCormick, T. C., 683, 733, 852 McCorquodale, K., 292, 325 McCulloch, J. W., 680, 717–720, 722, 733, 855 McDavid, J. W., 76, 91, 704, 853 McDonagh, E. C., 729, 738, 853 McDougall, W., 8, 14 McFall, R. M., 163, 170, 172, 175, 206, 463, 659 McGinnies, E., 28, 45 McGovern, G., 777–778 McGuigan, F. J., 148, 206, 346, 361, 554, 563, 599, 659 McGuire, W. J., x, 4, 10, 15, 24–26, 29–30, 33–35, 37–39, 45, 106, 204, 213, 219, 262–263, 268, 276–278 669, 672, 770–771, 778, 780, 812, 818–819, 825, 844, 853 McLaughlin, R. J., 837–838, 853 McLean, R. S., 206 McNair, D. M., 707, 846 McNemar, Q., 48, 75, 87, 91, 542, 553, 659, 671, 775, 829, 853 McPeake, J. D., 170, 172, 204 McReynolds, W. T., 782, 853 McTeer, W., 357, 659 Mead, M., 313 Meadow, A., 484, 656 Meehl, P. E., 18, 45 Meichenbaum, D. H., 202 Meier, P., 727–729, 853 Meltzer, L., 59, 80–81, 89, 695–696, 847 Menapace, R. H., 798, 851 Mendel, G., 299, 310–311, 316, 318 Mendelsohn, G. A., 143, 205 Menges, R. J., 823, 853 Merritt, C. B., 280, 285 Merton, R. K., 149, 207, 398, 401, 660 Messick, S., 20, 40, 45–46, 110, 137, 814, 816, 822–823, 860 Mettee, D. R., 143, 206, 815, 850 Meyer, G., 838, 856 Meyers, J. K., 760, 853 Mezei, L., 280, 286 Michelson, A., (in example), 309 Miles, C. C., 18, 45
Author Index
873 Milgram, S., 41, 45, 277–278, 280, 285–286, 815, 822, 853 Mill, J. S., 7, 9–10, 14, 269, 603 Miller, A. G., 825, 853 Miller, B. A., 839, 853 Miller, D. C., 309–310 Miller, F. D., 756, 837–838, 846 Miller, G. A., 207 Miller, G. R., 27, 46 Miller, J. C., 281, 283 Miller, J. G., 523, 660 Miller, K. A., 638, 650 Miller, N., 27, 46, 277, 281, 286 Miller, P. E., 837, 652 Miller, S. E., 49, 91, 674, 853 Millman, S., 33, 35, 37, 45 Mills, J., 23, 37, 46, 472, 660 Mills, T. M., 110, 137, 441, 660 Milmoe, S., 196, 207, 624, 660, 765, 853 Minard, J., 330, 351, 355, 663 Minor, M. W., 175, 207, 244–245, 263, 690, 817, 853 Mintz, N., 146, 207, 374, 660 Mitchell, W. Jr., 756, 761, 853 Modell, W., 127, 137 Moffat, M. C., 170, 172, 175, 191, 207 Moll, A., 401, 439, 523, 660 Moore, B. V., 466, 653 Moore, J. C., Jr., 818, 857 Moore, R. K., 55, 68, 89, 687, 848 Mooren, R. L., 680, 725, 727, 729, 755, 765, 857 Moos, R. H., 825, 853 Morley, E., (in example), 309 Morris, J. N., 311 Morris, J. R., 674, 720, 725, 788, 844, 852 Morrow, W. R., 316, 660 Moss, H. J., 337, 340, 658 Mosteller, F., 49, 78, 89, 149, 154, 185, 207, 292, 310, 344, 415, 548, 553, 565, 633, 650, 660, 672, 682, 746, 776, 844, 853 Mousley, N. B., 765, 850 Mouton, J. S., 51, 88, 279, 281, 283, 285, 491, 658, 749, 754, 842 Mu¨eller, G. E., 9, 14 Mu¨eller, W., 161, 185, 207 Mu¨ller, H., (in example), 299 Mu¨ller, J., 10–11 Mulry, R. C., 48, 153, 158–159, 205, 208, 220, 292, 346–348, 350, 352–353, 378–380, 392, 477, 480–481, 488, 496–497, 502–503, 576, 587, 593, 618, 626, 660, 662–663, 687, 704, 725, 760, 853 Munger, M. P., 481, 652 Munn, N. L., 428, 660 Murashima, F., 27, 45 Murphy, D. B., 57, 63, 65, 69, 71–72, 91, 692, 695–696, 701, 703, 712–714, 716–718, 720, 722, 725, 733, 854
Murphy, G., 315, 660 Murray, D. C., 103, 105, 109 Murray, E. J., 755, 758, 847 Murray, H. A., 312, 660 Myers, R. A., 371, 388–389, 657 Myers, T. I., 57, 63, 65, 69, 71–72, 91, 692, 695–696, 701, 703, 712–714, 716–718, 720, 722, 725, 733, 854 Nace, E. 110 Nahemow, L., 48 Naroll, F., 544, 660 Naroll, R., 273, 286, 313–314, 544, 559, 587, 660 Nataupsky, M., 828, 858 Nebergall, R. E., 37, 46 Nelson, H. F., 680, 729, 748, 855 Nelson, T. M., 674, 857 Neulinger, J., 701, 703, 725–726, 854 Neurath, O., ix-x Newberry, B. H., 823, 854 Newcomb, T. M., 213, 219, 262, 292, 488, 660 Newman, M., 56, 61, 73–74, 76, 91, 680, 707–708, 710, 733, 854 Newton, I., (in examples), 11, 265–266, 297 Nichols, M., 170, 172, 185, 207 Nicholson, W., 270 Niesser, U., 110, 119 Niles, P., 778, 851, 854 Nixon, R. M., 777–778 Noltingk, B. E., 318, 660 Norman, D. A., 207 Norman, R. D. 53–54, 71, 89, 91, 355, 660, 725, 755–756, 761, 765, 845, 854 Norman, W. I., 20, 46 Norris, R. C., 471 Nosanchuk, T. A., 794, 854 Nothern, J. C., 680, 727, 733, 737, 765, 846 Nottingham, J. A., 748, 854 Novey, M. S., 196, 207 Novick, S., 293 Nowlis, V., 225, 263 Nunnally, J., 778, 854 O’Brien, R. B., 103, 105, 109 O’Connell, D. C., 441, 655 O’Connell, D. N., 110 O’Connor, E. F., 23, 46 O’Hara, J. W., 481, 652 O’Leary, V. E., 825, 854 O’Neill, E., 785 Oakes, W., 675, 780, 854 Odom, R. D., 330, 528, 665 Ohler, F. D., 854 Olsen, G. P., 685, 692, 704–705, 757, 765–766, 854 Ora, J. P., Jr., 48, 52, 56, 58, 68, 74, 76, 91, 671, 698, 714, 716, 720, 738, 759, 854
874
Author Index Orlans, H., 49, 91, 674, 854 Orleans, S., 279, 286 Orne, E. C., 110, 761, 858 Orne, M. T., x, 4, 12–13, 29, 50, 83, 110–112, 114–117, 119, 121, 124, 133, 136–137, 257, 259, 263, 268, 276, 292, 344, 362, 402, 440–441, 452, 559, 594, 660, 672, 680, 700, 750, 781–782, 794, 805, 807, 811, 814–815, 818, 823, 825–826, 845, 847, 854 Orr, T. B., 403, 658 Osborn, M. M., 27, 43 Osgood, C. E., 26, 46 Oskamp, S., 675, 858 Ostrom, T. M., 824, 855 Pace, C. R., 53, 72, 91, 727, 729, 855 Page, B., 822, 846 Page, E. B., 279, 286 Page, J. S., 643, 650 Page, M. M., 117, 137, 782, 823, 855 Page, R., 817, 824, 858 Palmer, L. R., 317–318, 660 Pan, J-S., 67, 91, 727, 733, 855 Papageorgis, D. 30, 33–34, 37, 45–46 Pareis, E. N., 357, 444–445, 659 Parlee, M. B., 839, 855 Parsons, O. A., 332, 665 Parsons, T., 335, 509, 660 Parten, M., 671, 855 Parzen, T., ix Pascal, B., 7, 14, 603 Paskewitz, D. A., 110, 823, 854 Pasteur, L., 311 Pastore, N., 828, 855 Patchen, M., 25, 46 Patterson, J., 55–57, 76, 92, 684, 692, 862 Paul, G. L., 128, 137 Pauling, F. J., xvii, 794, 798, 855 Pavlos, A. J., 675, 855 Pavlov, I., (in example), 405 Payne, S. L., 729, 848 Pearson, K., 305–306, 555, 660 Peel, W. C., Jr., 175, 207 Pellegrini, R. J., 773, 855 Penner, D., 292 Pepinsky, H. B., 292 Perier, F., (in example), 603 Perlin, A. E., 68, 91 Perlin, S. 68–69, 720, 723, 748, 855 Perlman, D., 680, 682, 700, 714, 716, 841 Perry, C. W., 110 Perryman, R. E., 764, 842 Persinger, G. W., 139, 167–168, 170–173, 175, 185, 187, 195–196, 207, 208, 292, 346, 348, 353, 378–379, 392, 442, 447, 454, 473, 477, 479–481, 488, 496–497, 502–503, 526, 576, 587, 593, 618, 622, 626, 660, 662–663
Persons, C. E., 143, 206, 815, 850 Petrie, H. G., 286 Pflugrath, G. W., 167–168, 207 Pflugrath, J., 485, 660 Pfungst, O., x, 12, 14, 149–151, 193, 207, 326, 364, 405, 414, 416, 438, 491, 502, 523, 535–536, 660 Philip, A. E., 680, 717–720, 722, 733, 855 Phillips, W. M., Jr., 758, 762, 855 Piers, E.V., 106, 109 Pifer, J. W., 687, 733, 735, 739, 758, 859 Pillard, R. C., 707, 846 Pilzecker, A., 9, 14 Piper, G. W., 30, 44 Pitt, C. C. V., 200, 202, 207 Planck, M., 93, 310 Plateau, J. A. F., 405 Platten, J. H., Jr., 319 Plotsky, J., 693, 750, 845 Pokorny, A. D., 839, 853 Polanyi, M., 141, 207, 264, 286, 309–310, 535, 660–661 Politz, A., 838–839, 855 Pollard, J. C., 52, 90, 747, 760, 848 Pollin, W., 68–69, 91, 720, 723, 748, 855 Pomazal, R. J., 760, 861 Pomeroy, W. B., 588, 661, 776, 850 Poor, D., 48, 53, 55, 57, 59–60, 62–63, 67–68, 70, 73–74, 76, 91, 685, 692, 695, 704–706, 710, 714, 716, 720, 725, 729, 733, 738, 756, 827, 855 Popper, K. R., 264–265, 286 Pratt, R. W., Jr., 680, 729, 733, 735, 754, 765, 853 Price, D. O., 749, 758, 855 Prince, A. I., 355, 524, 661 Prothro, E. T., 692, 725, 844 Pucel, D. J., 680, 729, 748, 827, 855 Quarterly Journal Studies of Alcohol, Editorial Staff, 402, 661 Quay, H. C., 818, 849, 855 Quine, W. V., 264, 286 Raffetto, A. M., 167–168, 175, 182, 207, 782, 855 Raible, M., 404, 665 Ramsay, R. W., 837, 855 Rankin, R. E., 277, 286, 342, 661 Rapp, D. W., 312, 661 Rasmussen, J., 299, 306 Rasmussen, J. E., 753, 857 Ratner, S., 153 Raven, B. H., 491, 661 Ravitz, L. J., 195, 207, 622, 661 Ray, A. A., 280, 284 Ray, D., 292 Raymond, B., 685, 700, 711, 856
Author Index
875 Razran, G., 405, 661 Redfield, R., 313 Reece, M. M., 359, 362 Reed, M., 292 Reed, W. P., 319 Rees, M. B., 271, 284 Reese, M. M., 359, 362, 661 Regula, C. R., 228, 263 Reid, S., 712–713, 761, 856 Reif, F., 466, 661 Reisman, S. R., 818, 844 Remington, R. E., 785, 856 Resnick, J. H., 773–774, 799–800, 810, 817, 856 Reuss, C. F., 59, 69, 74, 91, 695–696, 725, 727, 729, 739, 856 Rhine, J. B., 404, 519–520, 661, 664 Rice, C. E., 194, 207, 622, 661 Rice, S. A., 400, 661 Richards, T. W., 69, 91, 856 Richart, R. H., 674, 729–731, 733, 747, 750, 852 Richter, C. P., 48, 91, 672, 856 Rickels, K., 110, 136 Rider, P. R., 315, 661 Riecken, H. W., x, 54, 91, 12–13, 113, 119, 137, 207, 211, 263, 292, 344, 374, 376, 441, 447–448, 559, 661, 764, 856 Riecker, A., 14 Riegel, K. F., 838, 856 Riegel, R. M., 838, 856 Riesman, D., 332, 340–341, 353, 653, 655 Riggs, M. M., 64, 91, 714, 716–717, 720, 856 Ring, K., 246, 263, 323, 588, 658, 825, 856 Ringuette, E. L., 511, 661 Robert, J., 167, 180, 205, 637–638, 649 Roberts, M. R., 761, 842 Robins, L. N., 72, 91, 727, 729, 856 Robinson, D., 343, 661 Robinson, E. J., 778, 857 Robinson, J., 312, 661 Robinson, J. M., 342–343, 658 Robinson, R. A., 838–839, 856 Rodnick, E. H., 359, 661 Roe, A., 305, 465–466, 661 Roethlisberger, F. J., 94, 136–137, 275, 286 Rogers, P. L., 649–650 Rohde, S., 343, 661 Rokeach, M., 41, 47, 49, 91, 278, 280, 286, 305, 358, 491, 661, 657, 674, 856 Rollins, M., 761, 856 Rorer, L. G., 271, 286 Rose, C. L., 749, 856 Rosen, E., 52, 55, 61–62, 64–65, 70, 72–74, 91, 675, 683, 687, 698, 710, 712–714, 716–717, 720, 722, 725, 729, 732–733, 738–739, 759, 788, 856 Rosenbaum, M. E., 51, 91, 754, 758, 822, 856 Rosenberg, M., 67–68
Rosenberg, M. J., x, 4, 10, 27, 42, 46, 50, 106, 110–111, 137, 191, 211–213, 216–219, 262–263, 268, 276, 292, 344, 441, 449, 576, 598, 619, 661, 672, 750, 764, 805, 811, 817, 856 Rosenblatt, P. C., 38, 46, 277, 286 Rosenblum, A. L., 729, 738, 853 Rosenhan, D., 48, 90–91, 421, 611, 613, 661, 671, 815, 852, 856 Rosenthal, R., v-vii, xvii, 4, 10, 14, 36, 40, 46, 48, 53, 56–57, 63, 81–85, 106, 110, 113, 137–149, 151–158, 166–173, 175, 180–181, 184–193, 195–197, 199–200, 202–210, 243–245, 249–250, 259–260, 263, 268, 271, 275, 286, 301, 320, 327, 333, 346, 348, 350, 353, 356, 364, 367, 378–379, 381, 386, 392, 394, 406–407, 412, 423, 429, 442, 447–448, 450, 454, 459, 466, 468, 472–473, 477, 480–481, 488, 496–497, 502–503, 508–509, 524, 526, 540, 553, 576, 587, 593–594, 597, 601, 618, 624, 626, 630–634, 641, 644–646, 648–650, 660–663, 671, 680, 692, 714, 716, 753, 755–756, 788–793, 815, 828, 856–757 Rosenzweig, S., x, 765, 781, 818, 822–823, 856–857 Rosnow, R. L., v-vii, xvii, 4, 10, 36, 56–57, 63, 81–85, 110, 104–105, 109, 271, 632, 650, 671, 680, 685, 692, 699–701, 714, 716, 778, 780, 782–783, 788–795, 797–798, 799, 801–802, 804–805, 807, 809, 813–814, 817–821, 825, 840, 846, 851, 856–857 Ross, J. A., 794, 857 Ross, R. R., 202 Ross, S., 98, 109, 127, 130, 137, 403, 659, 753, 857 Rostand, J., 298, 306, 663 Roth, J. A., 141, 208 Rothchild, B. H., 824, 848 Rothney, J. W. M., 680, 725, 727, 729, 755, 765, 857 Rothwell, P. M., x Rotter, J. B., 281, 312, 624, 663 Rousseau, J. J., 10 Rowland, L. W., 12, 14 Rozeboom, W. W., 552, 663 Rozelle, R. M., 822, 845 Rubin, D. B., 646, 650 Rubin, Z., 670, 680, 683, 685–686, 729, 738, 753, 756, 762, 818, 848, 857 Rubinstein, E., 753, 857 Ruebhausen, O. M., 49, 91, 674, 857 Ruebush, B. K., 340, 663 Russell, B., 439, 663 Russell, G., xvii Sacks, E. L., 365, 367, 663 Sadalla, E., 752, 861 Sagatun, I., 840
876
Author Index Saiyadain, M., 27, 45 Sakoda, J. M., 62, 67, 90, 720, 722, 776, 852 Salmon, W., 265, 286 Saloway, J., 83 Saltzman, I. J., 114–116, 136 Salzinger, K., 799, 857 Sampson, E. E., 361, 535–536, 625, 663 Sanders, R., 346, 352, 389, 663 Sandvold, K. D., 816, 861 Sanford, R. N., 103, 309, 663, 780, 840 Sapolsky, A., 361, 663 Sarason, I. G., 114, 208, 329–331, 345–346, 351–352, 354–355, 369, 484, 626, 654, 663, 666, 777, 857 Sarason, S. B., 300, 340, 663 Sarbin, T. R., 7, 782, 849, 857 Saslow, G., 357, 371, 387–389, 444–445, 484, 659 Sasson, R., 674, 857 Sattler, J. M., 148, 209 Saul, L. J., 11, 13 Saunders, F., 388–389, 657 Schachter, B. S., 816, 840 Schachter, S., 12–13, 51, 55, 57–58, 91–92, 350, 663, 687, 699, 754, 824, 857–858 Schackner, R. A., 144, 205 Schaie, K. W., 837–838, 858 Schappe, R. H., 143, 206, 671, 815, 850, 858 Schatz, J. S., 190, 209 Scheffe´, H., 100, 109 Scheibe, K. E., 124, 137, 782, 818, 845, 854 Scheier, I. H., 65, 92, 717–718, 858 Schill, T. R., 139–140, 208, 292, 304, 468, 472, 496, 662, 666 Schmeidler, G., 390, 663 Schmeidler, G. R., 693, 750, 845 Schofield, C. B. S., 671, 861 Schofield, J. W., 753–754, 837, 858 Schoggen, P. H., 771, 844 Schopler, J., 684, 757, 858 Schro¨dinger, E., 93 Schubert, D. S. P., 48, 55–56, 58, 62–65, 67–68, 76, 92, 680, 683, 695–696, 698, 711–714, 716–717, 720, 858 Schuette, D., 334, 655 Schultz, D. P., xvii, 48, 55–58, 64–65, 68, 76, 92, 663, 672, 684, 692, 714, 716–720, 722, 788, 825, 858, 862 Schultz, S., xvii Schulze, G., 575, 659 Schumpert, J., 760, 786, 842 Schumsky, D. A., 782, 861 Schwartz, R. D., 107, 109, 271, 279, 282, 286, 822, 861 Schwartz, T., 773–774, 799–800, 810, 817, 856 Schwenn, E., 725, 860 Schwirian, K. P., 762, 858 Science, 319, 322, 655
Scott, C., 737, 761, 858 Sears, D. O., 23, 30–31, 35, 38–39, 46 Seaver, W. B., Jr., 643, 650 Sebeok, T. A., 194, 209, 622, 663 Secrest, L., 107, 109, 271, 282, 286, 822, 861 Seeman, J., 643, 773, 858 Segall, M. H., 286 Seidman, E., 196, 204 Selltiz, C., 267, 286 Semmelweiss, I., 311 Seyfried, B. A., 755, 758, 847 Shack, J. R., 680, 699, 720, 858 Shames, M. L., 161, 170, 209 Shapiro, A. K., 403, 592, 663 Shapiro, A. P., 299, 317, 664 Shapiro, J. L., 142 Sharp, H., 27, 46 Shaver, J. P., 141, 209 Sheatsley, P. B., 49, 90, 674, 776, 848 Sheehan, P. W., 121, 137, 826, 854 Sheffield, F. D., 26, 27, 29, 36, 40, 44, 103, 105, 109, 304, 664, 794, 848 Sheridan, K., 680, 699, 720, 858 Sherif, C. W., 37, 46 Sherif, M., 26–27, 37, 44, 46, 451, 462, 664 Sherman, S. R., 782, 858 Sherwood, J. J., 828, 858 Shils, E. A., 335, 660 Shinkman, P. G., 301, 664 Shor, R. E., 121, 136, 190, 209, 400, 664, 761, 858 Short, R. R., 675, 858 Shulman, A. D., 764, 781–782, 815, 817–818, 859 Shurley, J. T., 332, 665 Shuttleworth, F. K., 53, 92, 765, 858 Sibley, L. B., 535–536, 625, 663 Siegman, A., 53, 55–56, 62, 65, 67, 74, 92, 684, 712–713, 717–718, 720, 761, 776, 858 Siegman, C. R., 271, 284 Siess, T. F., 684, 714, 716, 722, 858 Sigall, H., 228–229, 244, 260, 263, 764, 802, 805, 810, 817, 824–825, 849, 858 Silver, M. J., 154, 188, 204 Silverman, I., x, 28, 46, 68, 76, 92, 117, 137, 163, 175, 183, 209, 228, 263, 315, 664, 671, 720, 724, 764, 781–782, 784, 803, 815–818, 837, 859 Silverman, I. W., 752, 859 Silverman, L., 117, 136 Silverman, L. H., 271, 286 Silverstein, S. J., 756, 782, 845 Simmons, W. L., 481, 664 Singer, R., 778, 851 Sipprelle, C. N., 675, 785, 841, 844 Sirken, M. G., 687, 733, 735, 739, 758, 859 Sjoholm, N. A., 329, 653 Skinner, B. F., 10, 14 Skolnick, J. H., 279, 286
Author Index
877 Slack, C., 114 Slatin, G. T., 782, 846 Smart, R. G., 75, 92, 553, 664, 672, 775, 788, 859 Smiltens, G. J., 83, 171–172, 209 Smith, A. A., 144, 206 Smith, E. E., 361, 664 Smith, E. R., 756, 837–838, 846 Smith, H. L., 314, 664 Smith, H. W., 674, 852 Smith, M. B., 771, 773, 844, 859 Smith, P., 794, 857 Smith, R. E., 672, 777, 857, 859 Smith, S., 57, 63, 65, 69, 71–72, 91, 692, 695–696, 701, 703, 712–713, 716–718, 720, 722, 725, 733, 854 Snedecor, G. W., 580, 631, 651, 664 Snow, C. P., 318, 664 Snow, R. E., 646, 650 Sobol, M. G., 103, 105, 109 Solar, D., 782, 843 Solomon, R. L., 29, 46, 97–98, 101–102, 103, 105, 109, 603, 664, 794–795, 859 Sommer, R., 281, 286, 777, 859 Spaner, F. E., 102, 105, 109, 794, 847 Speer, D. C., 765, 838, 859 Speisman, J. C., 825, 853 Spence, J. M., 484, 654 Spence, K. W., 314, 664 Spiegel, D., 687, 701–702, 707–708, 710–711, 725, 859 Spielberger, C. D., 442, 486, 664, 799, 801, 859 Spires, A. M., 361, 664 Stanley, J. C., 103, 106, 109, 122, 136, 267, 270, 284, 794, 812, 843 Stanton, F., 52, 92, 403–404, 411, 664, 761, 859 Staples, F. R., 50, 92, 750, 859 Star, S. A., 312, 332, 341, 653, 664, 794, 859 Stare, F., 124 Stein, K. B., 687, 692–693, 720, 727–729, 733, 737–738, 859 Stein, M. I., 701, 703, 725–726, 854 Steiner, I. D., 674, 859 Steinmetz, H. C., 18, 46 Stember, C. H., 312–314, 319, 332, 341, 343–344, 355, 386, 400–401, 485, 539–540, 544, 561, 583–584, 587–588, 566–567, 657, 664 Stephens, J. M., 309, 664 Sterling, T. D., 553, 664 Stevens, S. S., 11, 14, 405, 664 Stevenson, H. W., 148, 209, 330–332, 366, 528, 664–665 Stewart, D. Y., 837–838, 843 Stollak, G. E., 278, 286 Stongman, K. T., 785, 856 Stotland, E., 25, 46, 796, 849 Straits, B. C., 671–672, 756, 805, 811, 815, 823–824, 859
Stratton, G. M., 523, 665 Strauss, M. E., 164–165, 171, 185, 209–210 Streib, G. F., 837–838, 859 Streiner, I. D., 824, 844 Stricker, G., 148, 210 Stricker, L. J., 40, 46, 110, 137, 814, 816, 822–823, 859–860 Strickland, L. H., 848 Strometz, D. B., x Struff, J. W., (Lord Rayleigh), 310 Strupp, H. H., 300, 665 Stumberg, D. 815, 860 Stumpf, C., 150 Suchman, E. A., 729, 860 Suchman, E., 52, 92, 727, 761, 860 Suedfeld, P., 48, 57, 92, 692–693, 738, 756, 782, 845, 860 Sullivan, D. F., 484, 655 Sullivan, D. S., 674, 751, 774, 834, 860 Sullivan, H. S., 300, 665 Suls, J. M., 685, 795, 797–799, 801–802, 804, 817, 846, 857 Summers, G. F., 143, 209, 341–342, 665 Sutcliffe, J. P., 137 Swingle, P. G., 822, 860 Sylva, K., 293 Symons, R. T., 346, 665 Symposium: Survey on problems of interviewer cheating, 319, 665 Tacon, P. H. D., 690, 756, 860 Taffel, C., 346, 484, 665, 773, 860 Tamulonis, V., 320, 583, 653 Tannenbaum, P. H., 26, 46, 213, 219, 262 Tart, C. T., 190, 209, 420, 525, 588, 665 Taub, S. I., 814–815, 823, 860 Tavitian, M., 670 Taylor, J. A., 346, 665 Taylor, K. F., 822, 852 Teele, J. E., 685, 711, 727–729, 738, 844, 860 Terman, L., 18, 45, 264 Terwillinger, R. F., 778, 849 Test, M. A., 279–280, 284 Thackray, R. I., 823, 854 Theye, F., 148, 209 Thibaut, J. W., 368, 370, 657 Thistlethwaite, D. L., 40, 46, 683, 725, 860 Thomas, E. J., 782, 842, 860 Thomson, W., (Lord Kelvin), 310 Thorndike, E. L., 9, 14, 645, 651 Tiffany, D. W., 683, 695–696, 720, 737, 860 Timaeus, E., 158–159, 161–162, 167–168, 175, 185, 207, 209 Titchener, E. B., 8 Todd, K. L., 643, 651 Tolman, E. C., 523, 624, 665 Tooley, J., 48
878
Author Index Toops, H. A., 725, 755, 765, 860 Toppino, T., 14, 783 Tori, C., 782, 853 Toulman, S., 264, 266, 286 Towbin, A. P., 144, 209 Trattner, J. H., 170, 172–173, 196, 209 Trescott, P. H., 319 Troffer, S. A., 190, 209, 420, 525, 588, 665 Trosman, H., 402, 657 Trotter, S., 674, 860 Trumbull, R., 753, 857 Tryon, R. C., 430 Tuddenham, R. D., 441, 665 Tukey, J. W., 49, 78, 89, 149, 207, 315, 665, 672, 687, 776, 844, 860 Tune, G. S., 685, 698, 720, 725, 729, 733, 860 Turner, G. C., 376, 665 Turner, J., 27, 43, 316–317, 665 Twain, M., (in example), 211 Udow, A. B., 665 Uhlenhuth, E. H., 136 Ullman, L. P., 148, 206 Ulrich, R., 321, 323, 652 Underwood, B. J., 725, 860 Uno, Y., 170–171, 175, 193–194, 209 Uphoff, H. F., 139, 206, 303–304, 658 Valins, S., 690, 720, 724, 842, 860 Valles, J., 839, 853 Van Hoose, T., 228–229, 244, 260, 263, 764, 802, 805, 810, 817, 858 Varela, J. A., 57, 92, 692, 860 Vaughan, G. M., 143, 209 Vaughn, C. L., 656 Velikovsky, 141 Verinis, J. S., 860 Verner, H. W., 320, 583, 653 Vernon, P. E., 511, 652 Veroff, J., 441, 665 Verplanck, W. S., 321, 323, 665 Vidmar, N., 784, 860 Vidulich, R., 702, 746, 841 Vikan-Kline, L., 139–140, 167–168, 170, 208, 292, 304, 346, 348, 353, 378–379, 392, 442, 447, 454, 466, 468, 371, 388–389, 472–473, 477, 480–481, 488–489, 491, 496–497, 502–503, 507, 526, 535, 576, 587, 593, 618, 626, 657, 662–663, 665 Vinacke, W. E., 610, 665, 771, 861 von Felsinger, J. M., 69, 78, 90, 720, 723, 851 von Osten, W., 149–150, 405 von Tschermak, E., (in example), 310 Wagner, N. N., 143, 210, 692, 861 Waite, R. R., 340, 663 Walder, P., 822, 846
Wales, H. G., 471, 655 Walker, P., 158–159, 205 Walker, R. E., 144, 209 Wallace, D., 675, 727, 729, 762, 788, 861 Wallace, J., 752, 861 Wallach, M. S., 300, 665 Wallin, P., 53, 55, 62, 73–74, 92, 367, 665, 687, 711–713, 727, 733, 738, 755, 827, 861 Wallington, S. A., 751, 846 Walsh, J., 674, 861 Walster, E., 23, 46 Walters, C., 332, 665 Walters, G. C., 553, 656 Walters, R. H., 50, 92, 750, 859 Waly, P., 342–343, 658 Wambach, H., 702, 746, 841 Ward, C. D., 57, 76, 92, 691–692, 861 Ward, W. D., 816, 861 Ware, J. R., 359, 665 Warner, L., 404, 665 Warren, J. R., 57, 92, 691, 861 Wartenberg-Ekren, U., 158–159, 175, 209, 360, 665 Waters, L. K., 690, 714, 716, 729, 861 Watson, D. J., 299, 654 Watts, W. A., 35, 47 Weakland, J. H., 511, 652 Webb, E. J., 107, 109, 271, 282, 286, 822, 861 Weber, S. J., 764, 815, 861 Wechsler, D., 676, 861 Weick, K. E., 171, 175, 186, 209, 292, 315, 522, 527–528, 535, 665 Weigel, R. G., 725, 861 Weigel, V. M., 725, 861 Weil, H. G., 782, 840 Weiss, J. H., 55, 58, 92, 148, 210, 684, 692, 694, 758, 861–862 Weiss, J. L., 712–713, 717–718, 727–729, 844 Weiss, L. R., 139, 161–162, 209, 680, 691, 861 Weiss, R. F., 817, 861 Weiss, W., 39, 44, 491, 657 Weissman, H. N., 698, 712–714, 716–717, 755, 788, 848 Weitz, S., 748, 861 Weitzenhoffer, A. M. 55, 68, 89, 420, 665, 687, 848 Welch, F., 760, 786, 842 Wellons, K. W., 643, 651 Wells, B. W. P., 671, 861 Wells, M. G., 672, 775, 788, 847 Welty, G. A., 699, 841, 861 Wenk, E. A., 143, 209 Wernicke, K., 523 Wessler, R. L., 158–159, 161–163, 171–173, 175, 209–210 Wever, E. G., 11, 14 Wheeler, D. N., 680, 729, 748, 855
Author Index
879 Wheeler, N., 683, 725, 860 White, C. R., 139–140, 171–172, 208, 210, 292, 304, 392, 450, 468, 472, 496, 662, 666 White, G. M., 815, 856 White, H. A., 782, 861 White, M. A., 775, 861 White, R., 404, 652 Whitehead, T. N., 94 Whitman, R. M., 449, 666 Whitman, R. N., 359, 362, 661 Whitney, E. R., 484, 656 Whittaker, J. O., 27, 47 Whyte, W. F., 399, 666 Wicker, A. W., 48–49, 56–57, 69–70, 72, 92, 674, 680, 685, 690–691, 700, 725, 729, 760, 762, 765, 824, 861 Wiens, A. N., 371, 387–389, 659 Wiesenthal, D. L., 815, 817–818, 859 Wilkens, B., 69, 89, 720, 723, 748, 754, 845 Wilkins, C., 282, 285 Willard, S., 680, 683, 729, 738, 762, 848 Williams, F., 341, 666 Williams, J. A., 666 Williams, J. H., 799, 861 Williams, L. P., 317 Willis, R. H., 813, 824, 861 Willis, Y. A., 824, 861 Wilson, A. B., 400, 666 Wilson, E. B., 264, 286, 298, 315, 317, 458, 524, 592 Wilson, E. C., 53, 73–74, 89, 727, 733, 761, 846 Wilson, P. R., 55–57, 76, 92, 684, 692, 862 Wiltsey, R. G., 58, 92, 692, 694, 861 Winch, W. H., 9, 14 Winder, C. L., 484, 656 Winer, B. J., 614, 666, 770, 862 Winer, D., 687, 692, 694, 729, 738, 846 Winkel, G. H., 345–346, 666 Winstead, J. C., 159, 185, 192, 206 Wirth, L., 300, 465, 666 Wishner, J., 110, 137 Wittenbaugh, J. A., 717, 719–720, 762, 843 Wolf, A., 48, 55, 58, 92, 684–685, 692, 694, 758, 861–862 Wolf, I., 196, 207, 624, 660 Wolf, I. S., 299, 318, 553, 666 Wolf, S., 311, 666 Wolf, T. H., 592, 666 Wolfensberger, W., 49, 92, 674, 862 Wolfgang, A., 70, 92, 685, 725, 862
Wolfle, D., 49, 92, 610, 666, 674, 862 Wolins, L., 305, 322, 553, 666 Womack, W. M., 143, 210 Wood, F. G., 315, 666 Woods, P. J., 554, 666 Woodworth, R. S., 9, 14 Woolf, D. J., 317, 657 Woolsey, S. H., 146, 171, 192, 210, 712–713, 717–718, 727–729, 844 Wooster, H., 312, 666 Wright, P. H., 30, 47 Wrightsman, L. S., 68, 92, 690, 720, 862 Wuebben, P. L., 671–672, 756, 805, 811, 814–815, 823–824, 859, 862 Wunderlich, R. A., 671, 862 Wundt, W., 3, 8, 14, 289 Wuster, C. R., 491, 666 Wyatt, D. F., 401, 666 Yagi, K., 280, 283 Yando, R. M., 148, 210 Yarom, N., 643, 651 Yarrow, P. R., 282, 285 Young, F. W., 808, 862 Young, R. K., 346, 666 Ypma, E., 292 Yule, G. U., 298–299, 306, 666 Zamansky, H. S., 52, 89, 92, 761, 842, 862 Zax, M., 148, 210 Zegers. R.A., 175, 210 Zeigarnik, B., 784, 862 Zeisel, H., 103, 105, 109, 674, 846 Zelditch, M., 335, 666 Zemack, R., 41, 47 Zillig, M., 309, 666 Zimbardo, P. G., 27, 47, 672, 753, 846 Zimmer, H., 72–73, 92, 727, 729, 733, 735, 737, 862 Zimmerman, T. F., 674, 729–731, 733, 747, 750, 852 Zirkle, C., 311, 319, 405, 666 Zirkle, G. A., 300, 666 Znaniecki, F., 310, 666 Zoble, E. J., 161–162, 175, 190, 192, 210 Zold, A., 765, 838, 859 Zucker, L. G., 782, 840 Zuckerman, M., 48, 51–52, 57–59, 61, 64–65, 71, 76, 90, 92, 697, 701, 703, 708–709, 714, 716–719, 720, 757, 851–852, 862 Zuro, J., 292
Subject Index
ACE measure, 69, 726 Achievement need, volunteer’s, 701–703, 837 Acquaintanceship with subject, experimenter’s, 488–489 Age, experimenter’s, 340–341 Age, volunteer’s, 73–74, 733–736, 838 Altered replication, 214–220, 255 Altruism, volunteer’s, 700–701 Ambiguity/non-ambiguity conditions, 240–242 American Physical Society, 309 American Psychological Association, xi Animal learning, 151–158, 174–175, 630–631, 636, 640 Animal studies, 149–153, 179–180, 405–406, 423–439, 528–529, 636–637 Anthropologists’ interpretations, 313–314 Anxiety, experimenter’s, 345–347 Anxiety, subject’s, 482–486 Anxiety, volunteer’s, 65–66, 717–719, 838 APA ethical guidelines, 771–774, 834 Approval need, 59–60, 703–707, 780, 837 subject’s, 486–487 volunteer’s, 59–60, 703–707, 780, 837 Arousal-seeking, volunteer’s, 63–65, 714–717 Artifact (defined), 10–12 Artifact-influence model, 812–814, 836 and demand characteristics, 814–815 and receptibility, 815–816 and subjects’ behavior, 819–821 Artifact’s stages in life, 16–21, 770–771 Artifacts, typology of, 268–272 Audience Research Institute, 386 Auditory channels, 511–517, 522–525 Auditory cues (in communication), 191–192 Aufforderungscharakter, 781 Authoritarianism, experimenter’s, 352–353 Authoritarianism, volunteer’s, 61–62, 709–712, 837 Barron’s Independence of Judgment Scale, 708 Behavior, experimenter’s (correcting), 577–578 Behavior, experimenter’s, inferring from, 579–581 and qualitative bases, 581–582 and quantitative bases, 579–580
880
Behavior, experimenter’s (observation of), 502–508, 572–573 and ‘‘mechanical’’ observers, 575–576 methods for, 573–576 and molar variables, 510–518 and molecular variables, 508–510 and reduction of bias, 576–577 and representative observers, 574–575 subjects’, 573–574 and subjects’ ratings, 496–502 Behavioral attitude, experimenter’s, 354 Bernreuter Personality Inventory, 698, 708 Beta measure, 726 Bias, learning to, 533 Biases, typology of, 268–272 Biosocial attributes, 326–329, 344 Biosocial effects, 141–143, 326–329 Birth order, experimenter’s, 350–351 Birth order, volunteer’s, 57–58, 691–695 Blindness (in experiment), 562, 592–601 Boston University, x–xi Bureau of Applied Social Research, 267 California Ethnocentrism (E) Scale, 62, 103 California F Scale, 709–710 California Psychological Inventory, 58–59, 71, 723 California Psychological Inventory Good Impression Scale, 705 California Psychological Inventory Scale of Sociability, 696 Cancer diagnosis (in example), 299–300 Capability, subject’s, 818–819 Carlisle, A., (in example), 270 Cattell Award, xi Cattell Scales, 64, 68 Cheating, see Intentional error Christie-Budnitsky measure, 60, 703, 705 Clever Hans, 149–151, 405–406, 528–529 Cognitive dissonance studies, 215–216, 792–794 Collaborative disagreement, 568–569 Commitment, subject’s, 754–755 Communication, channels of, 189–192, 511–517, 523–525 Communication of expectancies, 519–523 and restriction of, 524–525 and signal specification, 528–536 and time of, 525–528
Subject Index
881 Communication, persuasive, 777–778, 789–791, 795, 797–799 Compliance study, 802–804 Computational errors, 140, 306, 475–476 Conformity, volunteer’s, 61, 707–709 Confounded experimental treatment, 268–272 and background interactions, 269–270 and interaction effects, 269–279 and main effects, 268–269 Constancy of conditions, 8 Content-filtered speech study, 195–196 Control experiment, 8–9 See also Expectancy controls Control group, 9–10 and placebo, 267–268 and synchronous pre-test-posttest design, 267 Control (in science), 7–10, 13 Controlling artifacts, 272–283 and disguised experiments, 275–283 and heteromethod replication, 274–275 and plausible rival hypotheses, 272–274 Conventionality, volunteer’s, 62–63, 712–713, 838 Couch-Keniston Scale, 708 Cuing, 220–231 Data analysis errors, correcting, 642–644 Debriefing, 255, 256 Deception, 823–824 Defense (as suspiciousness mechanism), 34–35 Demand characteristics, 780–782, 814–815 and drug research (in examples), 126–133 and experimenter bias, 113–114 and ‘‘good subject’’, 111–112 and human subjects research, 135–136 and hypnosis (in examples), 133–134 and non-experiment, 119–121 and placebo effects, 128–133 and post-experimental inquiry, 117–119 and pre-inquiry, 119–121 and quasi controls, 122–124, 129 and sensory deprivation study, 123–126 and simulators, 121–122 and subject’s attitude (toward experimenter), 110–112 Depression Scale, 65 DeRubini, E., (in example), 523 Designs (experimental) double-blind designs, 562, 592–601 four-group, 98–100 four-group control, 101–102 pretest-posttest, 99–100, 102 three-group control, 100 two-by-two factorial, 100 Differential Emphasis score, 250 Disagreement between scientists, 568–569
Disguised experiments, 275–283 and artifacts, 282–283 and classic studies, 279–280 and content restrictions, 277 and employment, 280 and ethical issues, 277–279 and public places, 280–281 and sample solicitation, 281–282 Dominance, experimenter’s, 183, 353–354, 506–507 Double-blind designs, 562, 592–601 Drug (pharmacological) research, (in examples), 126–133 Dual observation, 825–827 Early data returns, 453–454, 562–563 and delayed action effect, 462 effects of, 454–462 and effects of mood, 461 and expectancy, 461–462 and order effects, 456–457 and sex differences effects, 456 and treatment effects, 455–456 Education level, volunteers’, 53, 71, 727–728, 838 Edwards Personal Preference Schedule (EPPS), 59, 61, 71, 697, 708 Effect sizes, 632–634, 640 Effects biosocial, 141–143 early data returns, 453–462 intentional, 141 modeling, 147–148, 385–396 observer, 139–140, 297–307 placebo, 128–133 psychosocial, 143–144 situational, 144–147 unintended, 138–140 Einstellung hypothesis, 37–38 Errors computational, 140, 306, 475–476 data analysis, 642–644 nonrandom, 770–771 recording, 139, 303–305 ESP effect, 519–520 Ethical issues, 770–775 and APA guidelines, 771–774 and informed consent, 772–773 Evaluation apprehension, 211–214, 254–262, 819–820 and altered replication, 214–220, 255 and cognitive dissonance study, 215–216 and communications, 246– 248, 255 and cuing, 220–231 and debriefing, 255–256 and experimenter expectancy effect, 243–254 and manipulated arousal, 220–231 and systematic bias, 215, 254, 260–262 variables influencing, 231–243
882
Subject Index Everyday life, expectancy effects in, 399–400, 624–627, 630–631, 640, 643 Expectancy controls, 603–604, 621, 636–638 for cheating, 641–642 and combining methods, 617–618 and cost versus utility, 618–619 implementation of, 609–617 and partial controls, 607–609 and treatment effects, 604–607 Expectancy controls, implementation of, 609 and ethical considerations, 609–610 and experimenter assignment, 612–614 and experimenter expectancy, 610–612 and subject expectancy, 615–617 Expectancy effects, 173–177, 243–254, 397–400 as artifacts, 186–189 in clinical psychology, 401–403 and communication channels, 189–192, 522–525 and education, 201–204 in everyday life, 399–400, 624–627, 630–631, 640, 643 in experimental psychology, 403–407 intentional communication of, 519–523 and interpersonal learning, 192–194, 630–633 magnitude of, 177–181 mediation of, 186–194 and operant conditioning, 189 and person perception studies, 169–173, 630–631, 643 and survey research, 400–401 and teachers, 197–201, 627– 629, 635, 644–648 Expectancy, intentional, communication of, 519–525 Expectations, conflicting, 450–452 Expectations, subject’s, 753–754, 804–810 Experience, experimenter’s, 145, 148–156, 367–370, 587–588 Experiences, experimenter’s, 370–382, 570 Experimental setting, 28–31, 146 and forewarning effects, 30–31 and pretests, 29–30 and revealing content, 28–29 Experimental treatment, confounded aspects of, 268–272 Experimental variables, 10–13 Experimenter attributes, 326–329, 344–364 Experimenter behavior, see Behavior, experimenter’s Experimenter bias, 113–114, 544, 584–585 Experimenter biosocial attributes, 326–329, 344 Experimenter blindness to subject condition, 592 and absent experimenters, 601 and double-blind studies, 593–596 and human interaction, 599–601 and minimized contact, 598–602 Experimenter effects and animal learning, 151–158, 423–439 assessment of, 543–544, 550–551
and biased response magnitude vs. biased inference, 548–550 and operating characteristics, 546–547 Experimenter effects, assessment of, and experimenter bias, 544 and experimenter consistency, 545–546 Experimenter effects, generality of, 539–540 and experimenters, 539–542 and situations, 542–543 and subjects, 542 Experimenter expectancy, 243–254, 372–374, 584–585, 610–612, 636–638 and animal subjects, 423–439 avoiding, 184–185 and communication, 246–248, 255, 519–536 controls for, 603–604, 621 and human subjects, 411–421, 477–478, 615–617 and speech analysis study, 249–252 Experimenter modeling, see Modeling effects, experimenter Experimenter sampling, see Sampling, experimenter Experimenter selection, 583–584 and professionals, 589–591 and sample experiments, 584–586 and training, 586–588 Experimenter’s psychosocial attributes, 345–364 Experimenters, and laboratory characteristics, 492–494 and subject’s anxiety, 482–486 and subject’s need for approval, 486–487 and subject’s sex, 477–482 Experimenter-subject feedback effects, 382–384 Experimenter-subject sex differences, 330–340 Extraversion (of volunteers), 697–699, 837 Eysenck Neuroticism measure, 723 Eysenck Personality Inventory, 698 F Scale, 61–62, 103 Fear of task, subject’s, 749–751, 778–779 Fear of Tuberculosis Questionnaire, 103 Fechner, G., (in example), 405 Feeling states, subject’s, 751–753 Finger tapping studies, 224–225 Four factor theory, 648 Four-group design, 98–99, 100–102 Fraternity opinion study, 789–791 Friendliness, experimenter’s, 142–143, 506–507 Galvanic skin response (GSR), 114–116 Gatekeeper/non-gatekeeper conditions, 234–237 Gedanken experiment, 93–94, 114 Geographic variables (for volunteers), 74–75 Good subject effect, 111–112 Good subject study, 799–802 Guilford Scale, 64
Subject Index
883 Guilford STDCR measure, 723 Guilford-Zimmerman Scale, 58 Guilford-Zimmerman Temperament survey, 697, 723 Harvard University, x–xi, 542 Hawthorne studies, 94–95 Hayes, Meltzer, and Lundberg study, 80–81 Heart disease study (as example), 311 Henmon-Nelson measure, 726 Heron Inventory, 698 Heron Neuroticism measure, 723 Heteromethod replication, 274–275 Hipp chronoscope, 8 Holtzman inkblots, 164, 174–175 Horse study, 149–151, 405–406, 528–529 Hostility, experimenter’s, 351–352 Human subjects research, 110–112, 135–136, 158–173, 411–421, 477–478 and inkblot tests, 164–166, 630–631, 640 and laboratory interviews, 167–169, 174–175, 630–631, 640, 643 learning and ability, 158–161, 174–175 and person perception, 169–173, 411–416, 630–631, 643 and psychosocial judgments, 161–163, 630–631, 640, 643 and reaction time, 163–164, 174–175, 630–631, 640 Hypnosis research (in examples), 133–135, 781 Independent variable, 11–12 INDSCALE NDS procedure, 808–810 Inference, logic of, 264–268, 283 Inference (threats to), typology of, 268–272 Inferential validity, 821–827 Informed consent, 772–773 Inkblot tests, 164–166, 630–631, 640, 643 Intelligence, experimenter’s, 353 Intelligence, volunteer’s, 69–71, 724–727, 838 Intentional effects, 141 and communication channels, 522–525 and communication of expectancies, 519–523 Intentional error, 317–325 and behavioral sciences, 319–322 and biological sciences, 318–319 control of, 322–325 and physical sciences, 317–318 Interaction effects, 269–270 Interaction effects (with population characteristics), 270–271 Interest in experiment, 762–764 Interest in task, subject’s, 759–764 Interpersonal experimenter effects, 630–631 and external validity, 644–648 Interpreter effects, 140–141, 308–309 in anthropology, 313–314 in behavioral sciences, 311–314
in biological sciences, 310–311 control of, 315–316 in experimental psychology, 314–315 in physical sciences, 309–310 IPAT Trait Anxiety Scale, 66, 103 IQ tests, 69, 70, 198–200 Jackson’s Personality Research Form, 700 Janis and Field measure, 68 Juice study (as example), 147–148 Kalischer, O., (in example), 523 Kinsey interviews, 62, 67, 74–75, 79, 776–777 Kruskal-Wallis one-way anova, 218 Laboratory characteristics, 374–376, 492–494 Laboratory interviews, 167–169, 174–175, 630–631, 640, 643 Learning and ability (human), 158–161, 174–175, 640 Lie detector study, 114–116 Lister, J., (in example), 311 London tramway conductors (in example), 311 Lord Kelvin, (Thomson, W.), 310 Lord Rayleigh, (Struff, J. W.), 310 Lykken Emotionality measure, 723 MacArthur Foundation, xi Machelson-Morley experiment, 309 Main effects, 268–269, 271–272 Mann-Whitney Rank Sum Test, 252–253 Mann-Whitney U., 126 Marital status (of volunteers), 736–737, 838 Marlowe-Crown Social Desirability Scale, 59–60, 222–223, 231–233, 248, 347–349, 704–705 Marmer study, 85–86 Maslow Self-Esteem measure, 723 Maze learning, 423–429, 431–437 Message perception (as suspiciousness mechanism), 35–36 Mexican village (in example), 313 Middletown NP Index, 723 Mill Hill Vocabulary measure, 726 Minnesota Multiphasic Personality Inventory (MMPI), 18–20, 58, 63–65, 67, 76, 698 Minnesota Teacher Attitude Inventory, 103 Minsk-Pinsk joke, 16 Modeling effects, experimenter’s, 147–148, 385–396 in clinical psychology, 387–390 and laboratory experiments, 390–396 in survey research, 385–387 Mood Adjective Check List, 225–227 Morris, J. N., (in example), 311 Motivation, subject’s, 816–818
884
Subject Index Motivational elements, 465–471, 835–836 Motives, experimenter’s, 465–466, 470–471 and individual subjects, 466–468 and subjects in groups, 468–470 National Institute of Mental Health, xi, 93 National Science Foundation, xi, 48, 138, 264, 541 Need for approval, 231–233 experimenter’s, 347–349, 486–487 subject’s, 486–487 Newman Club, 107 Newman-Keuls technique, 808 Newton, I., (in examples), 11, 265–266, 297 Newtonian mechanics (in example), 93 Nicholson, W., (in example), 270 Non-experiment, 119–121 Nonrandom errors, 770–771 Nonreactive measures, 107–108 Nonverbal communication, 648–649 Nonvolunteers, 81–83, 87–88, 677–678, 830 and opinion change, 81–83 and persuasive communication, 81–85 and pseudovolunteering, 689–691 No-shows, 689–691 N-rays (in example), 298 Number of studies question, 156 Nurse-Patient Relationship Sort, 103 Observation See also Behavior, experimenter’s Observation, dual, 825–827 Observer effects, 139–140, 297–303 and agricultural studies, 299 and behavioral sciences, 300–305 and biological sciences, 298–300 and computational errors, 140, 306 control of, 306–307 and physical sciences, 297–298 and planaria studies, 300–303 and recording errors, 139, 303–306 Ohio State University, 218, 542 Omnibus Personality Inventory, 698 Operant learning, 429–436 Order effects, 104–105 Orientation-Inventory measure, 723 Oxford University Press, xi Paranoia Scale (MMPI) measure, 723 Participants, see Subjects Pasteur, L., (in example), 311 Pavlov, I., (in example), 405 Person perception studies, 630–631, 640, 643 photo-viewing, 169–175, 220–222, 247–250 Placebo control group, 267–268 Placebo effects, 128–133 Planaria studies, 300–303 Plateau, J. A. F., (in example), 405
Plausible rival hypotheses, 266–267, 272–274 Population characteristics, 270–271 Post-experimental inquiry, 117–119 Pre-inquiry, 119–121 Pretest-posttest design, 99–100, 102, 267 Pretests, 95–96 cautions for using, 106–108 and opinions, 103–106 Pretest sensitization effect, 93–94, 105 minimizing, 107–108 Pretest-treatment interaction effects, 102–104 Principal investigators, 176–177 Prior manipulation, 95–98 controls for, 96–97 and placebo, 99 Profile of Nonverbal Sensitivity (PONS), 649 Pseudovolunteering, 689–691 Psychesthenia Scale, 65 Psychopathology, volunteer’s, 66–69, 723–724, 838 Psychophysical judgments, 161–163, 174–175, 630–631, 640 Psychosocial attributes, experimenter’s, 345–364 Psychosocial effects, 143–144 Psychosocial judgments, 161–163 Puzzle experiments, 8 Pygmalion experiment, 644–648 Quasi controls, 122–124, 129–131, 823–824 Race, experimenter’s, 341–343 Race, subject’s, 143 Random assignment, 97–98 Rat studies, 151–153, 179–180, 423–439, 636–637 Raven’s Matrices, 726 Reaction time studies, 163–164, 174–175, 630–631, 640 Recording errors, 139, 303–305 Recruiter characteristics, 756–757 Recruiting subjects, 678–681 Red Cross Beginner Swimmer Test, 201 Religion, experimenter’s, 343–344 Religion, subjects’s, 73–74, 144, 737–739, 838 Remote Associates Task, 751 Replication, anecdotal, 558–560 Replication assessment, 555–556 and generality index, 557–558 and replication index, 556–557 Replication shortage, 552–554 and correlated replicators, 554–555 and statistical inference, 552–553 Research design, 827–828 Research findings, and ethics, 770–775 and motivational elements, 835–836 and robustness, 775–777, 834–835 and validity, 780–786 Response bias artifact, 16–21
Subject Index
885 Role play, 824–825 Rorschach test, 58, 69, 164–166, 174–175, 697 Rose Bowl (in example), 218 Rosenberg Self-Esteem measure, 723 Rosnow and Rosenthal studies, 81–85 Rotter Incomplete Sentence Blank, 709 Sampling, experimenter, 561–563 and cancellation of biases, 565–567 and correcting for expectancy effects, 570–571 and expectancies determined after, 569–570 and expectancies known before, 567–569 and expectancies unknown, 564–567 and homogeneity of results, 567 and increasing generalizibility, 563–564 and population characteristics, 564–565 and subdividing experiments, 562–563 Sarason Test Anxiety Scale, 248 SAT measure, 726 Schoolchildren studies, 312–313, 627–629 Self-disclosure, volunteer’s, 699–700 Self-reports, 765 Semmelweiss, I., (in example), 311 Sensation Seeking Scale, 64 Sensory deprivation study, 123–126 Sex differences, 142–143, 181–183 experimenter’s, 328–330 of experimenters and subjects, 477–482, 680–689, 837 of volunteers, 55–56, 680–689, 837 Shipley Hartford measure, 70, 726 Shostrom Personal measure, 723 Siegman Self-Esteem measure, 723 Silverman Self-Esteem measure, 723 Simulators, 121–122 Situational factors, 87, 144–147, 745, 766–768, 832–833 acquaintanceship with recruiter, 755–756 experimental scene, 146 experimenter experiences, 370–382 experimenter’s experiences, 145, 367–370 laboratory characteristics, 374–376 material incentives, 745–749 principal investigator, 146–147 recruiter characteristics, 756–757 subject behavior, 145–146 subject’s commitment, 754–755 subject’s evaluation, 764–766 subject’s expectations, 753–754 subject’s fear of task, 749–751 subject’s feeling states, 751–753 subject’s interest in topic, 759–764 task importance, 757–759 Skinner-Box learning, 152–153, 429–437 Sociability, volunteer’s, 58–59, 695–697, 837 Social class, volunteer’s, 71–73 Social Participation Scale (MMPI), 696
Social psychology of the experiment, 259–262 Social psychology of unintentional influence, 620–624 Social Reinforcement Scale, 704 Society of Experimental Psychologists, 12–13 Solomon four-group control design, 103 Source attractiveness (as suspiciousness mechanism), 36–37 Speech analysis study, 249–252 Speed of light (in example), 309–310 Spencer foundation, xi Stanford Binet test, 264 Statistical inference, 552–554 Status, experimenter’s, 355–358, 489–491 Stem-and-leaf display, 687–689, 729–730, 734 Stevenson marble-dropping task, 160–161 Subjects, and acquaintanceship with experimenter, 488–489 and acquaintanceship with recruiter, 755–756 and anxiety, 482–486 and behavior, 145–146, 782–784 and capability, 818–819 and commitment, 754–755 and evaluation, 764–766 and expectations, 753–754, 804–810 and fear of task, 749–751, 778–779 and feeling states, 751–753 and interest in task, 759–764 and material incentives, 745–749 and motivation, 816–818 and need for approval, 486–487 and perception of experimenter, 110–112, 472–475 and race, 143 and recruiter characteristics, 756–757 and recruitment, 678–681 and self-reports, 765 sex of, 477–482 and sex differences, 142–143, 477–482, 680–689, 837 and task importance, 757–759 Suspicion-arousing factors, conclusion drawing, 24–28 extremity of message, 26–27 message style and content, 24–31 opposition arguments, 25–26 presentation style, 27–28 Suspiciousness artifact, 15–16, 31–40 Suspiciousness artifact mechanisms, 34–38 Suspiciousness, and deception, 40–43 Suspiciousness (of experimenter’s intent), 21–22 and experimental setting, 28–31 and individual differences, 39–40 and source presentation, 22–24 and suspicion-arousing factors, 24–31 and temporal considerations, 38–39 Systematic bias, 215, 228–230, 254, 260–262
886
Subject Index Task importance, 757–759 Taylor Manifest Anxiety Scale, 65–66, 76, 143, 146, 346 Teacher expectancy study, 197–201, 627–629, 635, 644–648 Temple University, x–xi Tests of General Ability, 200 Thaddeus Bolton endowment, xi Thematic Apperception Test (TAT), 69 Three-group control design, 100 Town of origin, volunteer’s, 739–740 Two-by-two factorial, 100 Type I and Type II errors, 80, 108, 553, 794–799, 810 Unintended effects, 138–149 biosocial, 141–143 interpreter, 140–141 modeling, 147–148 observer, 139–140 psychosocial, 143–144 situational effects, 144–147 Unintended influence, 194–197, 620–624 social psychology of, 620–624 and subject type, 196–197 University of California,Riverside, x University of Chicago MESA, 95 computer program, 245 University of North Dakota, x, 292, 542 Unobtrusive measures, 107–108, 822–823 Validity, threats to, 780–786 and demand characteristics, 780–782 and role of subject, 782–784 and subject motivation, 784–786 Verbal conditioning study, 799–802 Vexirfehler, 10 Vexirversuche, 8 Visual channel, 511–517, 522–525 Visual cues, (in communication), 192–193 Voluntarism variable, 787 Volunteer bias, 672–675, 768–769, 833 Volunteer characteristics, 76–77, 87–88, 740–744, 830–831 achievement need, 701–703, 837 age, 73–74, 733–736, 838 altruism, 700–701 anxiety, 65–66, 717–719, 838 approval need, 59–60, 703–707, 784–785, 837 arousal-seeking, 63–65, 714–717 authoritarianism, 61–62, 709–712, 837 birth order, 57–58, 691–695 conformity, 61, 707–709 conventionality, 62–63, 712–713, 838 education level, 53, 71, 727–728, 838
extraversion, 697–699, 837 geographic variables, 74–75 intelligence, 69–71, 724–727, 838 marital status, 736–737, 838 psychopathology, 66–69, 723–724, 838 religion, 73–74, 737–739, 838 self-disclosure, 699–700 sex differences, 55–56, 680–689, 837 size of town of origin, 739–740 sociability, 58–59, 695–697, 837 social class, 71–73, 728–733, 838 Volunteer population, 75–77 Volunteer situations, 75–76 Volunteering, 50, 54–55 incentives to, 50–52 phenomenology of, 54–55 and reliability, 675–676 and survey research, 673 Volunteers and acquaintanceship with recruiter, 755–756 and bias, 672–675 and cognitive dissonance study, 85–86 college student, 48, 75, 86, 671–672, 775, 788–789 and commitment, 754–755 and evaluation, 764–766 and expectations, 753–754, 804–810 and experimental outcomes, 80 and fear of task, 749–751 and feeling states, 751–753 and interest in topic, 759–764 investigator’s influence on, 53 and involvement, 52 and material incentives, 745–749 and motivational differences, 777–780 and nonvolunteers, 677–678, 789–790 and recruiter characteristics, 756–757 recruiting, 678–680 representativeness of, 78–80 and situational variables, 87, 144–147, 745–768 and social biases, 53 and task importance, 757–859 Warmth, experimenter’s, 359–364 Water sample bias (in example), 271–272 Wechsler Intelligence Scale for Children (WISC), 159–160 Wernicke, K., (in example), 523 Western Electric Company (in example), 94–95 Wever-Bray effect, 11 Within-subjects design, 159 Worms (planaria) studies, 152–153 Yale campus study (in example), 215–217