Statistical and Methodological Myths and Urban Legends
Statistical and Methodological Myths and Urban Legends Doctrin...
23 downloads
695 Views
4MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Statistical and Methodological Myths and Urban Legends
Statistical and Methodological Myths and Urban Legends Doctrine, Verity and Fable in the Organizational and Social Sciences
Edited by
Charles E. Lance and Robert J. Vandenberg
New York London
Routledge Taylor & Francis Group 270 Madison Avenue New York, NY 10016
Routledge Taylor & Francis Group 27 Church Road Hove, East Sussex BN3 2FA
© 2009 by Taylor & Francis Group, LLC Routledge is an imprint of Taylor & Francis Group, an Informa business Printed in the United States of America on acid‑free paper 10 9 8 7 6 5 4 3 2 1 International Standard Book Number‑13: 978‑0‑8058‑6238‑6 (Softcover) 978‑0‑8058‑6237‑9 (Hardcover) Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, trans‑ mitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers. Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe. Library of Congress Cataloging‑in‑Publication Data Statistical and methodological myths and urban legends : doctrine, verity and fable in the organizational and social sciences / [edited by] Charles E. Lance & Robert J. Vandenberg. p. cm. Includes bibliographical references. ISBN 978‑0‑8058‑6237‑9 (hardcover) ‑‑ ISBN 978‑0‑8058‑6238‑6 (pbk.) 1. Organization‑‑Research‑‑Methodology. 2. Organization‑‑Research‑‑Statistical methods. 3. Social sciences‑‑Statistical methods. 4. Social sciences‑‑Research‑‑Statistical methods. I. Lance, Charles E., 1954‑ II. Vandenberg, Robert J. HD30.4.S727 2009 300.72‑‑dc22 Visit the Taylor & Francis Web site at http://www.taylorandfrancis.com and the Routledge Web site at http://www.routledge.com
2008019657
To my parents who, although they had little formal education of their own, always encouraged me to pursue mine vigorously. Charles E. Lance To Carole, Drew, Kaity, and Jackson for being the highest priorities in my life. Robert J. Vandenberg
Contents
Preface
xv
About the Editors
xvii
Acknowledgments
xix
Introduction
1
Part 1 Statistical Issues 1 Missing Data Techniques and Low Response Rates: The Role of Systematic Nonresponse Parameters Daniel A. Newman
Organization of the Chapter Levels, Problems, and Mechanisms of Missing Data Three Levels of Missing Data Two Problems Caused by Missing Data (External Validity and Statistical Power) Missingness Mechanisms (MCAR, MAR, and MNAR) Missing Data Treatments A Fundamental Principle of Missing Data Analysis Missing Data Techniques (Listwise and Pairwise Deletion, ML, and MI) 2 Systematic Nonresponse Parameters (dmiss and fmiss ) Theory of Survey Nonresponse Missing Data Legends Legend #1: “Low Response Rates Invalidate Results” Legend #2: “When in Doubt, Use Listwise or Pairwise Deletion” Applications Longitudinal Modeling Within-Group Agreement Estimation Meta-analysis Social Network Analysis Moderated Regression
7 8 8 9 9 9 11 11 13 14 17 21 21 24 26 26 27 27 28 29 vii
viii
Contents Conclusions 2 Future Research on dmiss and f miss
Missing Data Techniques References Appendix Derivation of Response Rate Bias for the Correlation (Used to Generate Figure 1.1c)
29 30 31 31 35 35
2 The Partial Revival of a Dead Horse? Comparing Classical Test Theory and Item Response Theory 37 Michael J. Zickar and Alison A. Broadfoot
Basic Statement of the Two Theories Classical Test Theory Item Response Theory Criticisms and Limitations of CTT Lack of Population Invariance Person and Item Parameters on Different Scales Correlations Between Item Parameters Reliability as a Monolithic Concept Criticisms and Limitations of IRT Large Sample Sizes Strong Assumptions Complicated Programs Times to Use CTT Small Sample Sizes Multidimensional Data? CTT Supports Other Methodologies Times to Use IRT Focus on Particular Range of Construct Conduct Goodness-of-Fit Studies IRT Supports Many Psychometric Tools Conclusions References
3 Four Common Misconceptions in Exploratory Factor Analysis Deborah L. Bandalos and Meggen R. Boehm-Kaufman
The Choice Between Component and Common Factor Analysis Is Inconsequential The Component Versus Common Factor Debate: Methodological Arguments The Component Versus Common Factor Debate: Philosophical Arguments Differences in Results From Component and Common Factor Analysis
38 38 40 44 44 45 46 47 48 48 49 50 50 50 51 52 53 53 53 55 56 57 61
62 66 68 69
Contents Orthogonal Rotation Results in Better Simple Structure Than Oblique Rotation
Oblique or Orthogonal Rotation? Do Orthogonal Rotations Result in Better Simple Structure? The Minimum Sample Size Needed for Factor Analysis Is… (Insert Your Favorite Guideline) New Sample Size Guidelines The “Eigenvalues Greater Than One” Rule Is the Best Way of Choosing the Number of Factors Discussion References
ix 71 71 72 74 76 79 83 85
4 Dr. StrangeLOVE, or: How I Learned to Stop Worrying and Love Omitted Variables 89 Adam W. Meade, Tara S. Behrend, and Charles E. Lance
Theoretical and Mathematical Definition of the Omitted Variables Problem Violated Assumptions More Complex Models Path Coefficient Bias Versus Significance Testing Minimizing the Risk of LOVE Experimental Control More Inclusive Models Use Previous Research to Justify Assumptions Consideration of Research Purpose References
91 96 97 100 102 102 103 103 104 105
5 The Truth(s) on Testing for Mediation in the Social and Organizational Sciences 107 James M. LeBreton, Jane Wu, and Mark N. Bing
Baron and Kenny’s (1986) Four-Step Test of Mediation Condition/Step 1 Condition/Step 2 Condition/Step 3 Condition/Step 4 The Urban Legend: Baron and Kenny’s Four-Step Test Is an Optimal and Sufficient Test for Mediation Hypotheses The Kernel of Truth About the Urban Legends Debunking the Legends Legend 1: A Test of a Mediation Hypothesis Should Consist of the Four Steps Articulated by Baron and Kenny (1986) Legend 2: Baron and Kenny’s (1986) Four-Step Procedure Is the Optimal Test of Mediation Hypotheses
110 111 111 111 112 113 113 116 116 120
Contents Legend 3: Fulfilling the Conditions Articulated in the Baron and Kenny (1986) Four-Step Test Is Sufficient for Drawing Conclusions About Mediated Relationships Suggestions for Testing Mediation Hypotheses Structural Equation Modeling (SEM) as an Analytic Framework Summary of Tests of Mediation A Heuristic Framework for Classifying Mediation Models Summary Conclusion Author Note References
6 Seven Deadly Myths of Testing Moderation in Organizational Research Jeffrey R. Edwards
The Seven Myths Myth 1: Product Terms Create Multicollinearity Problems Myth 2: Coefficients on First-Order Terms Are Meaningless Myth 3: Measurement Error Poses Little Concern When FirstOrder Terms Are Reliable Myth 4: Product Terms Should Be Tested Hierarchically Myth 5: Curvilinearity Can Be Disregarded When Testing Moderation Myth 6: Product Terms Can Be Treated as Causal Variables Myth 7: Testing Moderation in Structural Equation Modeling Is Impractical Myths Beyond Moderation Conclusion References
122 124 124 127 129 135 136 136 137 143 144 144 146 148 150 151 156 158 159 160 160
7 Alternative Model Specifications in Structural Equation Modeling: Facts, Fictions, and Truth 165 Robert J. Vandenberg and Darrin M. Grelle
The Core of the Issue AMS Strategies Equivalent Models Nested Models Nonnested Alternative Models Summary AMS in Practice Summary References
167 170 170 174 177 179 181 186 187
8 On the Practice of Allowing Correlated Residuals Among Indicators in Structural Equation Models 193 Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
Contents Unraveling the Urban Legend Extent of the Problem Origins A Brief Review of Structural Equation Modeling Indicator Residuals Model Fit An Example Why Correlated IRs Improve Fit Problems With Correlated Residuals Recommendations Summary and Conclusions References
xi 195 195 196 197 199 200 202 204 207 209 211 212
Part 2 Methodological Issues 9 Qualitative Research: The Redheaded Stepchild in Organizational and Social Science Research? 219 Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Definitional Issues 221 Philosophical Differences in Qualitative and Quantitative Research 222 Quantitative and Qualitative Conceptualizations of Validity 223 Caveats and Assumptions 225 Beliefs Associated With Qualitative Research 225 Belief #1: Qualitative Research Does Not Utilize the Scientific Method 225 Belief #2: Qualitative Research Lacks Methodological Rigor 226 Belief #3: Qualitative Research Contributes Little to the Advancement of Knowledge 228 Evaluating the Beliefs Associated With Qualitative Research 229 Evaluation of Belief #1: Qualitative Research Does Not Utilize the Scientific Method 234 Evaluation of Belief #2: Qualitative Research Is Methodologically Weak 236 Evaluation of Belief 2a: Qualitative Research Has Weak Internal Validity 236 Evaluation of Belief #2b: Qualitative Research Has Weak Construct Validity 237 Evaluation of Belief #2c: Qualitative Research Has Weak External Validity 238 Evaluation of Belief #3: Qualitative Research Contributes Little to the Advancement of Knowledge 239 The Future of Qualitative Research in the Social and Organizational Sciences 240 Concluding Thoughts 241 Author Note 242 References 242
xii
Contents
10 Do Samples Really Matter That Much? Scott Highhouse and Jennifer Z. Gillespie
247
Kernel of Truth Background History of the Concern The Research Base Why Do Samples Seem to Matter So Much? People Confuse Random Sampling With Random Assignment People Focus on the Wrong Things People Rely on Superficial Similarities Concluding Thoughts Author Note References
248 251 251 253 255 255 257 259 260 262 262
11 Sample Size Rules of Thumb: Evaluating Three Common Practices Herman Aguinis and Erika E. Harden
267
Determine Whether Sample Size Is Appropriate by Conducting a Power Analysis Using Cohen’s Definitions of Small, Medium, and Large Effect Size 269 Discussion 271 Increase the A Priori Type I Error Rate to .10 Because of Your Small Sample Size 273 Discussion 275 Sample Size Should Include at Least 5 Observations per Estimated Parameter in Covariance Structure Analyses 277 Discussion 279 Discussion 280 Author Note 283 References 284
12 When Small Effect Sizes Tell a Big Story, and When Large Effect Sizes Don’t 287 Jose M. Cortina and Ronald S. Landis
Effect Size Defined The Urban Legend The Kernel of Truth Quine and Ontological Relativism Contextualization Inauspicious Designs Phenomena With Obscured Consequences Phenomena That Challenge Fundamental Assumptions The Flip Side: Trivial “Large” Effects Conclusion References
289 290 291 292 295 296 299 300 302 305 306
Contents
xiii
13 So Why Ask Me? Are Self-Report Data Really That Bad? David Chan
309
310 313 316 319
The Urban Legend of Self-Report Data and Its Historical Roots Problem #1: Construct Validity of Self-Report Data Problem #2: Interpreting the Correlations in Self-Report Data Problem #3: Social Desirability Responding in Self-Report Data Problem #4: Value of Data Collected From Non-Self-Report Measures Conclusion and Moving Forward References
325 330 332
14 If It Ain’t Trait It Must Be Method: (Mis)application of the MultitraitMultimethod Design in Organizational Research 337 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Background Literature Review Range of Traits Studied Range of Methods Studied Not All “Measurement Methods” Are Created Equal The Case of Multisource Performance Appraisal The Case of AC Construct Validity Other Cases So, Are Any “Method” Facets Really Method Facets? Discriminating Method From Substance, or “If It Looks Like a Method and Quacks Like a Method…” References
338 342 342 343 344 345 347 349 350 351 353
15 Chopped Liver? OK. Chopped Data? Not OK. Marcus M. Butts and Thomas W. H. Ng
361
362
Urban Legends Regarding Chopped Data Urban Legends Associated With the Occurrence of Chopped Data Urban Legends Associated With Chopped Data Techniques Urban Legends Associated With Chopped Data Justifications Literature Review Chopped Data Through the Years Prevalence of Chopped Data The Occurrence of Chopped Data Over Time Chopped Data Across Disciplines Types of Chopped Data Approaches Evaluating Justifications for Using Chopped Data Insufficient or Faulty Justifications (Myths) Legitimate Justifications (Truths)
363 364 365 366 367 370 371 372 372 374 374 376
xiv
Contents Advantages of, Disadvantages of, and Recommendations for Using Chopped Data (Perceived) Advantages of Chopping Data Disadvantages of Chopping Data Recommendations When Faced With Chopping Data Conclusion References
377 378 378 382 383 383
Subject Index
387
Author Index
401
Preface
ἀ e objective of this book is to provide an up-to-date review of commonly undertaken methodological and statistical practices that are sustained, in part, upon sound rationale and justification and, in part, upon unfounded lore. ἀ e practices themselves are not necessarily intrinsically faulty. Rather, it is often the reasoning why or rationalization used to justify the practices that is questionable. All too frequently we hear authors whose manuscripts were rejected because of the invocation of some questionable methodological or statistical criteria by an editor or reviewer. We also hear authors state that they used “such-and-such” criteria, implying that by doing so their research is therefore methodologically sound. In reality, though, the application of such criteria may be largely myth. Some examples of these “methodological urban legends” as we refer to them in this book are characterized by the following manuscript critiques: (a) “you didn’t test for any alternative models”; (b) “your within group correlation was less than .70”; (c) “your self-report measures suffer from common method bias”; (d) “your test for mediation failed because your X and Y were not significantly correlated;” (e) “you have an unmeasured variables issue”; (f) “there is no point in interpreting your main effects when their product is statistically significant”; (g) “you cannot meaningfully interpret the product term because it suffers from multicollinearity”; (h) “your item-to-subject ratios are too low”; (i) “you can’t generalize these findings to the real world”; (j) “your fit indices are too low”; or (k) “your effect sizes are too low.” Historically, there is a kernel of truth to most of these legends, but in many cases that truth has been long forgotten, ignored, or embellished beyond recognition. ἀ is book examines several such legends. Each chapter is organized to address: (a) What is the legend that “we (almost) all know to be true”; (b) What is the “kernel of truth” to each legend; (c) What are the myths that have developed around this xv
xvi
Preface
kernel of truth; and (d) What should the state of the practice be? ἀ is book meets an important need for the accumulation and integration of these methodological and statistical practices. We foresee this being a popular book not only in statistical and methods research seminars, but also as a reference book for researchers in the organizational and social sciences.
About the Editors
Charles E. Lance is a Professor of Industrial and Organizational Psychology at the University of Georgia. His work in the areas of performance measurement, assessment center validity, research methods, and structural equation modeling has appeared in such journals as Psychological Methods, Organizational Research Methods (ORM), Journal of Applied Psychology, Organizational Behavior and Human Decision Processes, Journal of Management and Multivariate Behavioral Research. His 2000 ORM article with Robert J. Vandenberg on measurement invariance is the most often cited article in ORM’s history and won the 2005 Research Methods Division’s Robert McDonald Advancement of Organizational Research Methodology Award. His 2006 ORM article on the origin and evolution of four statistical cutoff criteria won the Research Methods Division of the Academy of Management Best Paper of the Year Award. Also, his 2008 article “Why Assessment Centers (ACs) Do Not Work the Way ἀ ey’re Supposed to” was one of the two inaugural focal articles in Industrial and Organizational Psychology: An Exchange of Perspectives on Science and Practice. Dr. Lance is also co-editor of Performance Measurement: Current Perspectives and Future Challenges (with Wink Bennett and Dave Woehr). Dr. Lance is a Fellow of the Society for Industrial and Organizational Psychology (SIOP) and the American Psychological Association, former President of the Atlanta Society for Applied Psychology, is a member of the Society for Organizational Behavior and is a licensed psychologist in the State of Georgia. He is currently Associate Editor of ORM, and on the editorial boards of Personnel Psychology, Human Performance, and Group & Organization Management.
xvii
xviii
About the Editors
Robert J. Vandenberg is a Professor of Management in the Terry College of Business at the University of Georgia. His primary substantive research focuses are on organizational commitment, and high involvement work processes. His methodological research stream includes measurement invariance, latent growth modeling, and multilevel structural equation modeling. His articles on these topics have appeared in the Journal of Applied Psychology, Journal of Management, Journal of Organizational Behavior, Human Resource Management, Organization Sciences, Group and Organization Management, Journal of Managerial Psychology, Organizational Behavior and Human Decision Processes, and Organizational Research Methods. Since 1999, both his substantive and methodological work has been integral to three funded grants totaling $4 million from the Centers for Disease Control, and the National Institute of Occupational Safety and Health. His measurement invariance article coauthored with Charles E. Lance received the 2005 Robert McDonald Award for the Best Published Article to Advance Research Methods given by the Research Methods Division of the Academy of Management. He has served on the editorial boards of the British Journal of Management, Journal of Applied Psychology, Journal of Management, Organizational Behavior and Human Decision Processes, and Organizational Research Methods. He is currently the editor of Organizational Research Methods. He is past division chair of the Research Methods Division of the Academy of Management. In addition, he is a fellow of the American Psychological Association, the Society for Industrial and Organizational Psychology, and the Southern Management Association. He is also a fellow in the Center for the Advancement of Research Methods and Analysis at Virginia Commonwealth University in which he conducts annual short courses in advanced structural equation modeling techniques.
Acknowledgments
Many people and institutions supported us in this endeavor. First and foremost, we thank the contributing authors. Simply stated, there wouldn’t be a book without their respective contributions. Each and every contributing author was a professional to the core in working with us, and within the deadlines we imposed. Second, we couldn’t have had a more supportive senior editor in Anne C. Duffy of Psychology Press in the Taylor & Francis Group. From first presenting her the book prospectus through the production process, Anne was continually available to assist us and displayed the upmost patience with us as the book was developed. We would also like to thank the many others in the Taylor & Francis Group who remain behind the scenes but play an important role in supporting these efforts such as marketing, production, and distribution. Finally, we thank the reviewers for their positive comments and feedback. Charles E. Lance’s work on this book was supported in part by: (a) National Institute on Drug Abuse (NIDA: Grant No. R01 DA01946001A1, Lillian Eby, P.I.); (b) National Institute on Aging (NIA: Grant No. AG15321, Gail Williamson, P.I.); and (c) National Institutes of Health, National Cancer Institute (NIH: Grant No. 5R03CA11747002, Lindsay Della, P.I.), and Robert J. Vandenberg’s work on the book was supported in part by the US Centers for Disease Control and Prevention (CDC: Grant No. 1 RO1 DP000111-01, Rodney Dishman, P.I.). However, this book’s contents are solely the responsibility of the editors and authors, and do not necessarily represent the official views of NIDA, NIA, NIH, or CDC. Finally, we would like to thank the University of Georgia and our respective departments and colleges for their support.
xix
Introduction Charles E. Lance and Robert J. Vandenberg
Almost everyone in the organizational and social sciences can recite a number of research-related “truisms” that we learned in our graduate training, while conducting research, in our experience publishing, while reviewing grant proposals, and so on. For example, nearly everyone could probably recite (a) some rule of thumb as to what constitutes an acceptably large factor loading, (b) how many subjects it takes to conduct a/n XXX (regression, factor, item analysis—pick one), and (c) good reasons why samples with low response rates (e.g., 15%–30%) cannot be trusted. ἀ ese truisms have been referred to as “received doctrines” (Barrett, 1972, p. 1) and “statistical and methodological myths and urban legends” (Vandenberg, 2006, p. 194). Beliefs in such “urban legends” (ULs) seem to be based, in part, on some kernel of truth(s) that can often be identified in relevant literature and, in part, on myth that has developed around their application and invocation. ἀ e purpose of this book is to provide a set of up-to-date reviews of the origin, development, pervasiveness, and present status of several of these ULs. ἀ ese ULs reinforce a number of methodological and statistical beliefs and practices that are based, in part, on sound rationale and justification and, in part, on unfounded lore. ἀ e beliefs and practices themselves are not necessarily intrinsically faulty, but the rationale for them often is questionable. ἀ e chapters in this book examine several such beliefs and practices, illustrated anecdotally by the following statements: • “What do you mean I shouldn’t allow my residuals to correlate? I’ve seen at least a dozen articles where they did this!” • “ἀ at’s absurd! Every aspect of my model is solidly anchored to theory. Why do I have to specify an alternative model?” • “Rats! My results are statistically significant, but my effect sizes are so small that they’ll never get past the reviewers.”
Charles E. Lance and Robert J. Vandenberg
• “Wait—you can’t interpret the first-order terms in the presence of a significant interaction!” • “Qualitative research is just barely science.” • “How am I ever going to justify a response rate of only 32%?” • “Everybody else does it, so I’m just going to do a median split on this variable and look at High versus Low group differences.” • “ἀe reviewers of this journal are going to reject this manuscript outright because we used self-report data—let’s send it to a lowertiered journal.” • “My advisor told me to never use a student sample and to always use samples of real working people.” • “I know I need a pretty big sample to do this analysis, but just how big?” • “Just follow the Baron and Kenny (1986) steps to test mediation— you can’t go wrong.” • “Don’t worry—there are a lot of publicized examples showing that it’s okay to use multiple sources as your method facet in your multitrait-multimethod study.” • “Classical test theory is so outdated. We need to rerun these analyses using IRT.” • “Why did I conduct a principal components analysis with Varimax rotation? Well, it’s the default in SPSS, so it must be optimal.”
Each of these statements represents a chapter in this volume. We asked contributing authors to address the following points regarding statements such as these in each chapter: (a) What is the legend that “we (almost) all know to be true”? (b) What is the “kernel of truth” to each legend? (c) What are the myths that have developed around this kernel of truth? and (d) What should the state of the practice be? As editors, we sought to work with the authors to reveal the truth, the lore, and the recommended best practice associated with their own legend. In the end, our goal was to provide researchers with a set of guidelines for sounder research practice. ἀ is book has a long history. Over a decade ago, Vandenberg became increasingly perplexed and frustrated by some comments he was receiving during the manuscript review process. Most editorial comments were appropriate and meaningful, critical but constructive. However, some very frustrating comments were of the form “you have a missing variables problem,” “your rwg is too low,” “your sample-to-item ratios are inadequate,” and many, many more. What was most disconcerting about these comments was that when the cited source supporting the comment was consulted either by reading the
Introduction
source or by actually asking the alleged source directly, “Did you ever say such a thing?” Vandenberg found that very often the comments were gross distortions of what had been actually written or that the alleged source reported personally as saying, “I never said such a thing” or “ἀ is is a misrepresentation of what I meant by this.” Not one to let things go, Vandenberg consulted with many of his colleagues about this state of affairs over the next 5 years. One outcome of these conversations was the realization that this frustration was shared by many others, a revelation and confirmation that “it’s not just I that gets these kinds of comments.” A second outcome of these conversations was some initial understanding of the origins of the UL beliefs. Colleagues repeatedly reported that “I saw this stated in another review and just thought I would use it here too,” “my advisor/professor/respected colleague explained this and its rationale to me,” “I was taught this in my research methods class,” and so on (see Vandenberg, 2006). A third outcome was that the list of apparent “ULs in use” grew and grew from just a few to dozens. As colleagues became more aware of the perpetuation of the UL belief phenomenon, they shared them at conferences and other venues (“hey—I have a new one for you…”). It was around this time that Vandenberg began to use the label “statistical myths and methodological urban legends” to characterize these beliefs. In late 2003 Vandenberg’s colleagues then encouraged him to organize a symposium for the 2004 Academy of Management Conference in New Orleans. ἀ e panelists (some of whom provided chapters for this volume) each presented a statistical myth and urban legend, and their presentations were more or less organized around the questions posed above (the kernel of truth, the myth, and present status). ἀ e symposium was originally submitted as a regular symposium for one division of the Academy, but unbeknownst to Vandenberg and panelists, it was eventually accepted as an “All Academy” symposium, meaning that it was deemed of interest to all members. On the day of the symposium, the room we were assigned was a ballroom with approximately 100 chairs set up—taking up about one fourth of the floor space. By the time the symposium started, audience members were dragging in chairs from other ballrooms and there was standing room only in the back. We stopped counting after 400 attendees. In short, our topic resonated well with a very large audience. A subset of the papers from the symposium appeared later as a special feature topic for Organizational Research Methods (ORM, Vandenberg,
Charles E. Lance and Robert J. Vandenberg
2006). One of these (Lance, Butts, & Michels, 2006) was awarded Sage Publication’s Best Paper of the Year Award, and another (Spector, 2006) is one of the top 20 most often read papers in ORM’s history. As of February 28, 2008, the three articles in this series (James, Mulaik, & Brett, 2006; Lance et al., 2006; Spector, 2006) had already been cited 51 times in the PsychInfo database in the 22 months since their publication. Researchers are paying attention. Why were the symposium and the papers so popular? We hope it is because reviewers, editors, researchers, authors, and graduate students truly want to understand better where these ULs come from, if they’re true, and whether they’re worth perpetuating. A kernel of truth seems to support each of the ULs that the chapters in this book discuss, but some amount of lore seems to accompany each one as well; we have tried to ensure that each chapter sorts out what is what. In some cases the origin of the UL can be traced and in some cases not, but in every case the chapter’s authors offer recommendations for research best practices. As such, the goal of this book is to discuss some of the more widely circulated ULs and to turn that legend into sound research practice. We hope that the chapters in this book influence your research in a positive way. References Baron, R. M., & Kenny, D. A. (1986). ἀe moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Barrett, G. V. (1972). Research models of the future for industrial and organizational psychology. Personnel Psychology, 25, 1–17. James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational Research Methods, 9, 233–244. Lance, C. E., Butts, M. M., & Michels, L. C. (2006). ἀe sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods, 9, 202–220. Spector, P. E. (2006). Method variance in organizational research: Truth or urban legend? Organizational Research Methods, 9, 221–232. Vandenberg, R. J. (2006). Introduction. Organizational Research Methods, 9, 194–201.
Part 1 Statistical Issues
1 Missing Data Techniques and Low Response Rates The Role of Systematic Nonresponse Parameters Daniel A. Newman
ἀ is chapter attempts to debunk two popular misconceptions (or legends) about missing data: Legend #1, low response rates will necessarily invalidate study results; and Legend #2, listwise and pairwise deletion are adequate default techniques, compared with state-of-theart (maximum likelihood) missing data techniques. After reviewing general missingness mechanisms (i.e., MCAR, MAR, MNAR), the relevance of response rates and missing data techniques is shown to depend critically on the magnitude of two systematic nonresponse 2 ). Response rates impact parameters (or SNPs: labeled dmiss and fmiss external validity only when these SNPs are large. Listwise and pairwise deletions are appropriate only when these SNPs are very small. I emphasize (a) the need to explicitly identify and empirically estimate SNPs, (b) the connection of SNPs to the theoretical model (and specific constructs) being studied, (c) the use of SNPs in sensitivity analysis to determine bias due to response rates, and (d) the use of SNPs to establish inferiority of listwise and pairwise deletion to maximum likelihood and multiple imputation approaches. Finally, key applications of missing data techniques are discussed, including longitudinal modeling, within-group agreement estimation, meta-analytic corrections, social network analysis, and moderated regression.
Daniel A. Newman
Organization of the Chapter ἀ e material that follows is organized into six sections. First, I distinguish three levels of missing data (item level, scale level, and survey level), two problems caused by missing data (bias and low statistical power), and three mechanisms of missing data (MCAR, MAR, and MNAR). Second, I present a fundamental principle of missing data analysis (“use all the available information”) and review four missing data techniques (listwise deletion, pairwise deletion, maximum likelihood, and multiple imputation) in light of this fundamental principle. ἀ ird, I introduce two systematic nonresponse parameters 2 ) and illustrate how response rate bias depends (SNPs: dmiss and fmiss entirely on the interaction between SNPs and response rates, rather than on response rates alone. Fourth, I present a theoretical model of survey nonresponse, highlighting how SNPs and response rate bias vary with the substantive constructs being studied. Fifth, I use the aforementioned information to redress two popular legends about missing data. Sixth, I review several prominent data-analytic scenarios for which the choice of missing data technique is likely to make a big difference in one’s results. Levels, Problems, and Mechanisms of Missing Data Missing data is defined herein as a statistical difficulty (i.e., a partially incomplete data matrix) resulting from the decision by one or more sampled individuals to not respond to a survey or survey item. ἀ e term survey nonresponse refers to the same phenomenon, at the level of the individual nonrespondent. Missing data is a problem from the perspective of the data analyst, whereas survey nonresponse is an individual decision made by the potential survey participant. Although nonresponse decisions may vary in how intentional they are (e.g., forgetting about the survey vs. discarding the survey deliberately), the above definition of survey nonresponse assumes that a potential respondent saw the survey invitation and made a de facto choice whether to complete the measures.
Missing Data Techniques and Low Response Rates
Three Levels of Missing Data ἀ e missing data concept subsumes three levels of nonresponse: (a) item-level nonresponse (i.e., leaving a few items blank), (b) scale-level nonresponse (i.e., omitting answers for an entire scale or entire construct), and (c) unit- or survey-level nonresponse (i.e., failure by an individual to return the entire survey). ἀ e response rate, which is a ratio of the total number of completed surveys to the number of solicited surveys, is an aggregate index of survey-level nonresponse. Two Problems Caused by Missing Data (External Validity and Statistical Power) ἀ ere are two primary problems that can be caused by low response rates. ἀ e first problem is poor external validity (i.e., response rate bias), which in this case means that the results obtained from a subsample of individuals who filled out the survey may not be identical to results that would have been obtained under 100% response rates. In other words, a respondents-based estimate (e.g., respondents-based correlation: rresp) can sometimes be a biased (over- or underestimated) representation of the complete-data estimate (e.g., complete-data correlation: rcomplete). ἀ e second problem caused by missing data is low statistical power, which means that—even when there is a true nonzero effect in the population—the sample of respondents is too small to yield a statistically significant result (i.e., Type II error of inference). I clarify that power is a function of the sample size, and not a direct function of response rate. For example, attempting to sample 1,000 employees and getting a 15% response rate yields more statistical power (N = 150) than attempting to sample 200 employees and getting a 60% response (N = 120). After controlling for sample size, response rates have negligible effects on power. Missingness Mechanisms (MCAR, MAR, and MNAR) Data can be missing randomly or systematically (nonrandomly). Rubin (1976) developed a typology that has been used to describe three, distinct missing data mechanisms (see Little & Rubin, 1987):
10
Daniel A. Newman
MCAR (missing completely at random)—the probability that a variable value is missing does not depend on the observed data values or on the missing data values. ἀe missingness pattern results from a completely random process, such as flipping a coin or rolling a die. MAR (missing at random)—the probability that a variable value is missing partly depends on other data that are observed in the data set but does not depend on any of the values that are missing. MNAR (missing not at random)—the probability that a variable value is missing depends on the missing data values themselves.
Of the three missingness mechanisms, only MCAR would be considered “random” in the usual sense, whereas MAR and MNAR would be considered “systematic” missingness (note the unusual label, missing at random [MAR], to describe a particular type of systematic missingness). For a helpful example of the MAR and MNAR mechanisms, consider two variables X and Y, where some of the data on variable Y are missing (Schafer & Graham, 2002). Missing data would be MAR if the probability of missingness on Y is related to the observed values of X but unrelated to the values of Y after X is controlled (i.e., one can predict whether Y is missing based on the observed values of X). ἀ e data would be MNAR if the probability of missingness on Y is related to the values of Y itself (i.e., related to the missing values of Y). Note that in practice, it is usually considered impossible to determine whether missing data are MNAR, because this would require a comparison of the observed Y values to the missing Y values, and the researcher does not have access to the missing Y values. Why do missing data mechanisms matter? Missing data mechanisms determine the nature and magnitude of missing data bias and imprecision (see Table 1.1). In general, systematic missingness will lead to greater bias in parameter estimates (e.g., correlations and regression weights) than will completely random missingness. ἀ at is, MCAR is harmless in that it does not bias the means, standard deviations, and estimated relationships between variables. Systematic missingness (MAR or MNAR), on the other hand, will often bias parameter estimates.
Missing Data Techniques and Low Response Rates
11
Table 1.1 Parameter Bias and Statistical Power Problems of Common Missing Data Techniques Missingness Mechanism Missing Data Technique
MCAR
MAR
MNAR
Listwise deletion
Unbiased, low power
Biased, low power
Biased, low power
Pairwise deletion
Unbiased, inaccurate power
Biased, inaccurate power
Biased, inaccurate power
Maximum likelihood
Unbiased, accurate power
Unbiased, accurate power
Biased, accurate power
Multiple imputation
Unbiased, accurate power
Unbiased, accurate power
Biased, accurate power
Note. Recommended techniques are in boldface.
Missing Data Treatments A Fundamental Principle of Missing Data Analysis Across missing data conditions, the best data-analytic methods for dealing with missing data follow a simple yet fundamental principle: use all of the available data. ἀ is principle characterizes all of the recommended missing data techniques shown in Table 1.2. However, the principle is not found in many of the more commonly applied missing data techniques, such as listwise and pairwise deletion. In general, item-level nonresponse can be redressed through meanitem imputation (Roth, Switzer, & Switzer, 1999), meaning that a researcher can average across the subset of scale items with available responses to calculate a scale score. ἀ is approach works especially well when scale items are essentially parallel. Unfortunately, there is a relatively common practice of setting an arbitrary threshold number of items that must be completed in order to calculate a scale score (e.g., if 4 or more items from an 8-item scale are complete, then those items can be averaged into a scale score; otherwise, set the respondent’s scale score to “missing”). Setting such an arbitrary threshold violates the fundamental principle of missing data analysis, because it throws away real data from the few items that were completed. Dropping an entire scale from analysis simply because some of its items were omitted will typically produce worse biases, in comparison
12
Daniel A. Newman
Table 1.2 Three Levels of Missing Data and Their Corresponding Missing Data Techniques Level of Missing Data
Recommended Missing Data Technique
Favorable Condition for Technique
Item-level
Use meanitem imputation.
Essentially parallel items
Scale-level
Use maximum likelihood (ML) or multiple imputation (MI).
Probability of missingness is correlated with observed variables (i.e., MAR mechanism)
Survey-level Use systematic (i.e., person-level, as nonresponse parameters 2 reflected in overall (dmiss and f miss ). response rate)
Data are available from previous studies that compare respondents to nonrespondents on the constructs of interest (i.e., 2
local dmiss and f miss can be estimated)
to assuming that the few completed items appropriately reflect the scale score. Next, scale-level nonresponse can be treated through maximum likelihood or multiple imputation techniques (ML and MI techniques; Dempster, Laird, & Rubin, 1977; Enders, 2001; Schafer, 1997), in which a researcher estimates the parameters of interest (e.g., correlations, regression weights) using a likelihood function (or alternatively using a Bayesian sampling distribution) based on observed data from all of the measured variables. (ML and MI will be discussed in more detail later.) In other words, if a respondent omits an entire scale, then using ML or MI techniques to recover the parameter estimates will typically produce less bias than using ad hoc techniques, such as listwise deletion, pairwise deletion, and single imputation (Newman, 2003). ML and MI techniques work especially well when missing data are systematically missing according to the common MAR mechanism. Finally, survey-level nonresponse—in which the entire survey is not returned—can be addressed using nonlocal meta-analytic estimates that describe respondent-nonrespondent differences on the constructs of interest. ἀ ese respondent-nonrespondent differences 2 . ἀ e use of SNPs to are captured by two SNPs, labeled dmiss and fmiss address survey-level missingness (i.e., low response rates) is a primary focus of this chapter. SNPs are particularly useful for addressing
Missing Data Techniques and Low Response Rates
13
the response rate issue, because some of the more-developed missing data approaches (e.g., ML and MI) are not currently capable of addressing survey-level (i.e., person-level) nonresponse, in which the data set contains absolutely no data on the nonrespondents. For handling survey-level nonresponse (i.e., low response rates), SNP methods reflect an attempt to use all of the available data (including nonlocal data on respondent-nonrespondent differences). Missing Data Techniques (Listwise and Pairwise Deletion, ML, and MI) Table 1.1 summarizes relationships between the missingness mechanisms (MCAR, MAR, MNAR) and parameter estimation bias. As seen in Table 1.1, the problems attributable to different mechanisms of missingness (i.e., missing data bias and low statistical power) depend on the missing data technique that is used. Four missing data techniques are covered here: listwise deletion, pairwise deletion, maximum likelihood (ML), and multiple imputation (MI). Listwise deletion involves analyzing data exclusively from individuals who provide complete data for all of the variables surveyed (i.e., partial respondents’ data are discarded). Pairwise deletion involves estimating correlations between two variables (X and Y) using all of the respondents who reported data for both X and Y (i.e., and ignoring data from respondents who did not report on both X and Y). ML and MI approaches both involve estimating the relevant parameters (e.g., correlations, regression weights) by using all of the available data on all of the variables from all of the respondents, regardless of partial data incompleteness. For example, ML and MI techniques estimate the correlation between two variables (X and Y) while accounting for the linear dependencies of X’s and Y’s missingness on the observed values of X, Y, Z, Q, and all other variables in the observed data set (see Enders, 2001, for a lengthier description of ML and MI techniques). As seen in Table 1.1, listwise and pairwise deletion are unbiased only when data are MCAR, whereas ML and MI techniques are unbiased under both MCAR and MAR conditions. ἀ is is why ML and MI approaches have been advocated as generally superior to listwise and pairwise deletion (Graham, Cumsille, & Elek-Fiske, 2003; Little & Rubin, 2002; Schafer & Graham, 2002). ML and MI techniques (e.g., FIML, EM algorithm, and multiple imputation; now available
14
Daniel A. Newman
in most statistical packages) perform well under MAR because they use all the available data to estimate parameters, whereas ad hoc techniques (e.g., listwise deletion) discard or ignore some of the available data. As for statistical power, I note that missing data reduce power regardless of the missingness mechanism. Some missing data techniques, however, are far worse than others when it comes to power. Listwise deletion typically will be far less powerful than other missing data techniques (Table 1.1), because listwise deletion discards all data from partial respondents, thereby greatly reducing sample size. Pairwise deletion, in contrast, suffers from its inability to account for the differential sample sizes across correlation estimates (Marsh, 1998). Although some correlations are based on more data than others (i.e., some correlations have more power than others), pairwise deletion uses a single sample size to estimate all the standard errors, providing overestimates of power for some parameters and underestimates for others (Newman, 2003). ἀ is problem is avoided under full information maximum likelihood (FIML) and MI approaches, which use the more appropriate standard errors for each estimate (and therefore give accurate estimates of statistical power). Finally, there are currently few if any available missing data techniques that perform well under the common scenario of MNAR missingness (see Collins, Schafer, & Kam, 2002; Newman, 2003). ἀ is is the context within which SNPs are introduced, as a way of characterizing respondent-nonrespondent differences, which can be used to better understand and deal with response rate bias (resulting from the MNAR mechanism). 2 ) Systematic Nonresponse Parameters (dmiss and fmiss
In this chapter, I propose a way to index the nature and magnitude of missingness mechanisms. It is suggested that, for any given variable that a researcher is interested in studying, SNPs can be estimated that characterize the differences between respondents and nonrespondents on the constructs of interest. Two such nonresponse 2 . parameters are the focus here: dmiss and fmiss ἀ e parameter dmiss is defined as the standardized respondentnonrespondent mean difference on a variable
Missing Data Techniques and Low Response Rates
15
[i.e., dmiss = ( X non − X resp ) s pooled ] (Newman & Sin, in press). In other words, if individuals with low job satisfaction are less likely to respond to a job satisfaction survey (Rogelberg, Conway, Sederburg, Spitzmuller, Aziz, & Knight, 2003), then dmiss will be negative. A nonzero dmiss suggests that missing data on a job satisfaction survey are missing systematically (MNAR), whereas dmiss = 0 suggests that the missingness mechanism is completely random (MCAR). Also, when dmiss is large and negative, paying attention to the respondents only will lead to an upward bias in estimates of mean job satisfaction, where the bias increases in magnitude as response rates drop. So the SNP dmiss is a useful way of describing the extent to which missingness is systematic (not random) for a particular variable, and it also determines the extent to which a parameter estimate (in this case, the mean) is biased by low response rates. ἀ e relationships among dmiss, response rate, and missing data bias in estimated means are illustrated in Figure 1.1a. In Figure 1.1a, we see that—when dmiss is negative—the respondent-based mean is an overestimate of the complete-data mean. Further, this positive bias increases as response rates fall (e.g., at dmiss = –.4, the mean is overestimated by 11.8% when the response rate is 10%). Importantly, when dmiss = 0, there is no missing data bias in the mean, regardless of the response rate. ἀ at is, low response rates only threaten external validity (i.e., lead to missing data bias) to the extent the SNP (dmiss) is large. Next, Figure 1.1b shows how the relationship between bias and response rate for the standard deviation (SD) also depends entirely on dmiss. ἀ ere is a negative response rate bias in SD (i.e., an underestimation of SD) that increases nonlinearly as response rates drop. At dmiss = –.4 and response rate = 10%, the SD is underestimated by 15.3% (see Newman & Sin, in press, for derivation of formulae that produced Figures 1.1a and 1.1b). 2 , is defined as A second systematic nonresponse parameter, fmiss the standardized respondent-nonrespondent difference in the relationship between two variables, X and Y. ἀ is parameter can be thought of as an eἀect size for a categorical moderator of response 2 status (see Appendix for derivation). When “ fmiss( +) ” is large, it means the correlation between X and Y among nonrespondents is larger than 2 the XY correlation for respondents (and when “ fmiss( −) ” is large, the
16
Daniel A. Newman d=0 d=–0.2 d=–0.4 d=–0.6 d=–0.8
% Bias in Mean
40 30 20 10 0 –10
10
30
50
70
90
100
Response Rate (%)
% Bias in SD
0 –10 –30 –40 –50
% Bias in Correlation
d=0 d=–0.2 d=–0.4 d=–0.6 d=–0.8
–20
10
30
50
70
Response Rate (%)
90
150
100
f^2(–)=0.008 f^2(–)=0.004 f^2=0 f^2(+)=0.004 f^2(+)=0.008
100 50 0 –50
–100
10
30
50
70
% Response Rate
90
99
Figure 1.1 (a) Response rate bias in the mean. (b) Response rate bias in the standard deviation. (c) Response rate biases in the correlation. Note. Mean bias evaluated at X resp = 4; correlation bias at rresp = .3;
dmiss _ x = and dmiss _ y = −.4.
nonrespondent correlation is smaller than the respondent correla2 tion). In Figure 1.1c, we see that at fmiss( +)= .004, dmiss _ x = dmiss _ y = −.4, and response rate = 10%, the XY correlation is underestimated by 41.6% due to missing data.
Missing Data Techniques and Low Response Rates
17
As can also be observed in Figure 1.1 (panels a, b, and c), there is no magical response rate below which an observed mean, standard deviation, or correlation becomes automatically invalid. Further, for a given, arbitrary amount of “tolerable” bias (say 10%), the corresponding response rate that produces this amount of bias depends 2 ). entirely on the SNPs (dmiss and f miss To help the reader in gauging the representativeness of the range of values presented in Figure 1.1, we summarize empirical estimates 2 ) as found in previous studies of nonresponof SNPs (dmiss and f miss dents (see Table 1.3). ἀ e estimates in Table 1.3 are taken from nonrespondent studies that employed two types of designs: (a) follow-up studies that tracked down nonrespondents after they were observed to not respond (e.g., Rogelberg et al., 2003), and (b) studies based on self-reported response behavior to past surveys and intentions to respond to future surveys (Rogelberg et al., 2000). As shown in Table 1.3, estimates of dmiss that are based on respondent self-reported intentions toward future survey responding (as well as self-reported retrospective histories of responding) offer large overestimates of dmiss when compared to the dmiss values obtained from observing actual response behavior (e.g., for the construct “satisfaction with management”: dmiss = –.59 for self-reported survey response, but dmiss = –.15 for actual, observed response behavior). ἀ e largest dmiss estimate for actual response behavior involved the construct of “procedural justice” (dmiss = –.44), suggesting that employees are much less likely to respond to a survey solicited by a company they believe has treated them unfairly. ἀ e important message of Table 1.3 is that 2 ) vary depending systematic nonresponse parameters (dmiss and f miss on the psychological constructs that are being studied. Theory of Survey Nonresponse Although survey nonresponse is often thought of as a methodological problem, it can also be connected to substantive, theoretical concepts. ἀ e individual decision to respond (or not respond) to a survey is a behavioral construct, which results from underlying attitudes, motives, dispositions, and norms. As with research on absenteeism (Martocchio & Harrison, 1993), studies of nonresponse behavior face the difficulty of modeling what individuals are not doing, rather than what they are actually doing. Rogelberg et al. (2000) described
18
Daniel A. Newman
Table 1.3 Empirical Estimates of Systematic Nonresponse Parameters 2 (dmiss and f miss )
dmissa
dmissb
Construct
2 f miss
a
2 f miss
a
2 f miss
Satisfaction (with Management)
Turnover Intentions
Agreeable
Organizational commitment
—
–.59 (183)
—
—
—
Job satisfaction
—
–.62 (182)
—
—
—
Satisfaction (work)
—
–.68 (183)
—
—
—
Satisfaction (pay)
—
–.13 (183)
—
—
—
Satisfaction (promotion)
—
–.24 (183)
—
—
—
–.15 (399) –.59 (180)
0
—
—
Satisfaction (management/ supervision)
.13 (399)
.60 (181)
.0028(+)
0
Agreeableness
–.35 (399)
—
.0027(–)
.0042(–)
Conscientiousness
–.38 (399)
—
.0014
.0074
Procedural justice
–.44 (608)
—
—
—
—
Perceived organizational support
–.13 (608)
—
—
—
—
Turnover intentions
a
(+)
(–)
0 .0096(+)
Note. All estimates uncorrected. Corresponding sample sizes (N) in parentheses. aBased on actual response behavior (Rogelberg et al., 2001; Spitzmuller et al., 2006); estimates compare respondents to pooled active-intentional and passive-unintentional nonrespondents. bBased on self-rated response intentions and retrospective response reports only (Rogelberg et al., 2000).
response to at-work surveys as an organizational citizenship behavior, and research consistent with this idea shows that nonrespondents have lower average job satisfaction, organizational commitment, conscientiousness, agreeableness, and intentions to remain with the company (see Table 1.3). In developing a Theoretical Model of Survey Nonresponse, I focus on predictors at multiple levels of analysis. ἀ at is, individual nonresponse behavior may theoretically result from individual attributes (e.g., dissatisfaction), group attributes (e.g., group trust and
Missing Data Techniques and Low Response Rates Risk Perception (Anonymity & Sensitivity of Information)
Reciprocity Norms Incentives: Social & Economic Organizational & Cultural Norms Invitation Content: Personal, Polite, Advance Notice, Explains Purpose Attitude Toward Surveying Entity Length of Survey Available Time
19
Perceived Response Norms & Obligations Attitude Toward Responding Perceived Control/ Capability to Respond
Follow-up Reminders
Response Intentions
Survey Nonresponse
Conscientious Personality
Figure 1.2 ἀe oretical model of survey nonresponse. Note. Dotted lines represent negative relationships. Light gray boxes are ἀe ory of Planned Behavior Constructs. Dark gray boxes are Methodological Choices under the researcher’s control.
support), and organizational and cultural attributes (e.g., company norms for survey participation, or Dillman’s [1978] cultural norms of willingness to do a small favor for a stranger who asks you to fill out a survey). According to the ἀ eory of Planned Behavior (Ajzen, 1988), a behavior such as survey nonresponse will be predicted by (a) favorable or disfavorable attitudes toward responding to the survey at hand, (b) subjective norms reflecting whether important referent others would likely respond to the survey, and (c) perceived confidence in one’s capability to respond to the survey. ἀ ese three antecedents (attitudes, norms, and perceived control) influence survey response behavior through a causal mechanism of survey response intentions (see Figure 1.2; cf. Rogelberg et al., 2000). Onto this ἀ eory of Planned Behavior model for survey nonresponse, I have overlain several antecedents and moderating conditions, including some proactive steps a researcher can take to increase response rates (see Figure 1.2). Past research has highlighted several design features that help in securing higher response rates (see dark gray boxes in Figure 1.2; largely consistent with Dillman, 1978; Fox, Crask, & Kim, 1988; Roth & BeVier, 1998; Yammarino, Skinner, & Childers, 1991; Yu & Cooper, 1983). ἀ is research shows survey response rates are higher when participants are given advance notice, the survey is personalized,
20
Daniel A. Newman
follow-up reminders are sent, and monetary incentives are offered. However, not all these techniques are equally effective. Below, I briefly summarize distinctions among techniques and speculate on their theoretical mechanisms. In Roth and BeVier’s (1998) integrative meta-analysis, response rates were most strongly affected by survey invitation factors (i.e., advance notice, more personalized [nonmailed] survey distribution, and distribution within one’s own company [rather than across many companies]). Follow-up reminders (e.g., postcards) had a smaller but still important unique effect on response rates. I conjecture that follow-up survey reminders offer additional opportunities for response intentions to be converted into actual response behavior (Figure 1.2). ἀ at is, follow-up reminders do not directly act to generate response intentions—rather they simply provide more chances to manifest these intentions. (ἀ e importance of distinguishing response intentions from actual response behavior is illustrated in the first two columns of Table 1.3.) Contrary to popular belief, survey length had only a meager effect on response rates (Roth & BeVier, 1998). I explain this by suggesting that survey length is moderated by individual differences in available time to complete surveys (Figure 1.2). Also, survey length may have a nonlinear association with response intentions, such that potential respondents lose interest after about 4 pages (Yammarino et al., 1991)—although the exact threshold for length is unknown. Monetary incentives for survey participation have their basis in exchange theory (Foa & Foa, 1980). Contrary to previous research (Yammarino et al., 1991), Roth and BeVier (1998) showed that monetary incentives may have virtually no effect on response rates to organizational surveys. I suggest that monetary incentives rely on reciprocity norms (Gouldner, 1960) in order to change response intentions (Figure 1.2) and thus may not uniformly result in more responses. Finally, norms for survey response can be made more salient when participants are placed at risk, due to sensitive content of the survey questions or perceived lack of confidentiality. Roth and BeVier (1998) showed that when anonymity is compromised, survey response rates actually increase substantially (probably due to fear of reprisal for nonparticipation). Despite the fact that compromising anonymity increases response rates, doing so violates research ethics and should therefore be staunchly avoided—survey response must be voluntary.
Missing Data Techniques and Low Response Rates
21
Why is a Theoretical Model of Survey Nonresponse (Figure 1.2) important for choosing a missing data strategy or, for that matter, for determining whether a given study’s response rate is “too low”? ἀ e answer is straightforward: Figure 1.2 gives rise to the SNPs (dmiss 2 ). Stated differently, nonresponse behavior is related to many and f miss social and psychological variables. For example, the Figure 1.2 box labeled “attitude toward the surveying entity” includes such concepts as organizational commitment and procedural justice, which have been shown to differ between respondents and nonrespondents (Table 1.3). ἀ e reason missing data can bias results of research studies is that the concepts being studied are related to individual survey response decisions. If we assume that a single cutoff response rate (e.g., below 20%) applies to all studies, regardless of the constructs being studied, then we have ignored Figure 1.2 and assumed nonresponse is related equally to all constructs. But—as shown in Table 1.3 and Figure 1.1—SNPs (a) vary across constructs being studied and (b) directly determine the extent of nonresponse bias. ἀ e above facts are useful in debunking two popular missing data legends, as explained below. Missing Data Legends Legend #1: “Low Response Rates Invalidate Results” As with most legends, the above statement contains a kernel of truth: As response rates decrease, results calculated from respondents only will (a) increasingly suffer from Type II error (low power) and (b) increasingly threaten bias in estimated means, standard deviations, and correlations, conditional upon the systematic missingness mechanism. ἀ e first myth associated with this kernel of truth is that it is possible to define heuristic response rates (e.g., 20%) below which results automatically fail to generalize. A related, false belief is that all nonresponse is the same—that is, results from a study with 40% response rate are more valid than results from a study with a 15% response rate (without explicitly considering the constructs and 2 ]). magnitude of substantive missingness mechanisms [dmiss and f miss To debunk this legend, I note that low response rates create no bias when data are MCAR. Likewise, low response rates often create only modest biases when data are missing systematically (MNAR).
22
Daniel A. Newman
2 Further, these biases depend entirely on the SNPs (dmiss and f miss ; see Figure 1.1). Finally, the issue of low statistical power is really an issue of respondent sample size (N) and not a response rate issue per se. As such, power-based criticisms of low-response-rate studies should focus on sample size and not on the response rate itself. A third, related myth is that response rates are a methodological issue only and are unrelated to the theory being tested. In fact, the response rate problem is an explicit function of the SNPs (dmiss and 2 f miss ) that correspond to the specific constructs being studied. Studies on topics like conscientiousness and procedural justice perceptions will be far more affected by response rates, in comparison to studies on satisfaction and turnover intentions (Table 1.3). Nonresponse is a behavioral indicator of one or more latent constructs, and these constructs can be substantive forces in empirical models, to varying degrees.
What Should the State of Practice Be? Rather than relying on the above legend to parse studies into “inadequate” versus “adequate” categories based on their response rates, there may be another—more graduated and empirical—approach. ἀ e first step in understanding response rate bias is to identify SNPs germane to the model being 2 for each construct or tested in a particular study (i.e., dmiss and f miss pair of constructs). Empirical estimates of these nonresponse parameters can be sought in the extant literature, especially from studies using follow-up designs that solicit information from initial nonrespondents (see Rogelberg et al., 2003, for a review of such designs). 2 across many Ultimately, researchers can meta-analyze dmiss and f miss primary follow-up studies, in order to more precisely estimate the local respondent-nonrespondent differences. With basic informa2 , the researcher can then conduct a sensitivtion about dmiss and f miss ity analysis to determine the response rate at which inferences break down, given the data set at hand and the SNPs identified. Take the following example. In a single-sample empirical study, we want to test whether the effect of conscientiousness on turnover intentions is mediated by job satisfaction. ἀ e mediation model is conscientiousness (C) → satisfaction (S) → turnover intentions (T). Let the respondent-based correlation matrix be rCS = .20 (Judge, Heller, & Mount, 2002), rCT = –.14 (Zimmerman, 2006), and rST = –.48 (Tett & Meyer, 1993). Assume the number of respondents for this sample is N = 200, but the response rate is only 10%. Our objective is to calculate a Sobel (1982) test for the indirect effect of conscientiousness on turn-
Missing Data Techniques and Low Response Rates
23
% Bias for Indirect Effect
0 –10 –20 –30 –40 –50 –60
10
30
50
70
90
99
Response Rate (%)
Figure 1.3 Response rate bias in indirect effect ( βCSβ ST ).
over, via satisfaction (i.e., Sobel z = βCS βST
2 2 βCS SEβ2ST + βST SEβ2CS ). (Note
that βCS = rCS , SEβCS = (1 − rCS2 ) (N − 2), βST = (rST − rCT rCS ) (1 − rCS2 ), and SEβST = (1 − R 2 ) [(N − 3)(1 − rCS2 )] .) After running the Sobel test on this sample, we find that Sobel z = 1.97 (p < .05), indicating a statistically significant indirect effect of conscientiousness on turnover intentions, mediated by satisfaction. Now, suppose a reviewer of the above study offers the following criticism: “With a response rate of only 10%, your observed positive result could very likely be due to missing data bias.” Such critical claims are commonplace but are founded on particular assumptions about the underlying pattern of nonresponse parameters, dmiss and 2 f miss . ἀ at is, low response rates can lead to either overestimation or underestimation of the mediated effect, depending on dmiss and 2 f miss . ἀ e corresponding empirical estimates of dmiss for this mediation analysis example can be found in the first column of Table 1.3, 2 parameter estimates can be found in columns and the needed f miss 3 and 4 of Table 1.3. Using the above formulae and the formula for rˆxycomplete from the Appendix, we get Figure 1.3. What Figure 1.3 shows is that—given the available empirical evidence for dmiss and 2 f miss involving the constructs of conscientiousness, satisfaction, and turnover intentions (Table 1.3)—at 10% response rates, the indirect effect βCSβ ST is likely to be underestimated by 34.9%. If the response rate had been higher, then the observed effect size would have been
24
Daniel A. Newman
larger (not smaller) due to response rate bias, and N would have also been larger. ἀ erefore, Sobel z would have been much larger (not smaller) at higher response rates. 2 estimates At this point, a caveat is in order—the dmiss and f miss found in Table 1.3 are too tentative as yet to support a universal call for response rate corrections. Rather, I recommend a more limited use of SNPs, as follows. When a critic proposes, in the absence of supportive data, that an observed sample eἀect is positively biased due to low response rates, prior empirical estimates of respondent-nonrespondent diἀerences should be brought to bear on the question. If 2 estimates suggest that the observed effect is unbiprior dmiss and f miss ased or downwardly biased by nonresponse (see example above), then the low response rate is no longer a legitimate criticism of the study’s conclusions. To restate, under the MNAR mechanism (i.e., 2 is nonzero), the appropriate analytic strategy is when dmiss or f miss to conduct a sensitivity analysis to see whether the obtained result can be explained away by known systematic nonresponse biases (see Table 1.2). ἀ is strategy follows the fundamental principle of missing data analysis: Use all of the available data (including nonlocal data on respondent-nonrespondent differences). Legend #2: “When in Doubt, Use Listwise or Pairwise Deletion” ἀ is belief also contains a (very small) kernel of truth: Listwise and pairwise deletion are unbiased techniques, but only when data are missing completely at random (MCAR; Table 1.1). ἀ e first myth associated with this kernel of truth is simply, “If one does not know the systematic missingness mechanism, it is OK to assume missingness is completely random.” ἀ is myth equates ignorance of systematic biases with absence of systematic biases. ἀ e myth is debunked by Table 1.3, which shows that commonly studied psychological constructs (e.g., attitudes, personality) are subject to sizable respondent-nonrespondent differences. A second and related myth is, “Missing data techniques that have been most used in the past are the best ones to use in the future.” ἀ is myth equates the familiarity/popularity of a technique with the accuracy/robustness of the technique. ἀ is (flawed) line of thinking is consistent with a Darwinian model of research methods (only the strongest methods survive over time). Perhaps a truer model of research methods is the
Missing Data Techniques and Low Response Rates
25
convenience model (only the easiest methods survive). Also, there is a tendency for students and professors to learn which methods are appropriate through imitation of what appears in scholarly journals. (Top journal articles in psychology and management still typically employ listwise and pairwise deletion.) Although this imitation strategy can sometimes enable helpful diffusion of methodological innovations, it also stymies progress by reinforcing the dominant methodological paradigm. ἀ ere is a further technological element of resistance to methodological change, as revealed by the lack of availability of modern missing data techniques in popular statistical software packages (e.g., for many years lagging the development of ML and MI approaches, SPSS software offered only listwise and pairwise deletion options). A third myth surrounding Legend 2 is that ML and MI approaches are based on shaky assumptions, compared with listwise and pairwise deletion. Although it is true that the ML approach was derived under the assumption of multivariate normality, listwise and pairwise deletion are ad hoc approaches, with no strong statistical basis at all. Departures from multivariate normality do not harm ML estimates as much as they harm estimates from ad hoc approaches (Gold & Bentler, 2000), and corrections are being developed to help the ML approaches become even more robust to nonnormality (see Gold, Bentler, & Kim, 2003). When it comes to comparing ML estimates against listwise and pairwise deletion, it is the deletion techniques that are founded on shaky assumptions (i.e., the MCAR assumption; Table 1.1). What Should the State of Practice Be? Researchers and editors should begin by understanding that—short of achieving 100% response rates (which may be unethical)—one must choose a missing data technique. Listwise and pairwise deletion are no more safe or natural than ML and MI techniques. Whether one uses listwise, pairwise, or ML techniques, the choice must be based on weighing the pros and cons of each technique. When weighing the pros and cons, ML and MI techniques are always as good as (under MCAR), and usually better than (under MAR), listwise and pairwise deletion, on the criteria of obtaining unbiased parameter estimates and accurate standard errors (Newman, 2003). When results from an ML or MI missing data technique differ from results obtained through an ad hoc procedure (e.g., listwise
26
Daniel A. Newman
deletion, pairwise deletion, mean imputation), then the burden of proof should be placed on the ad hoc technique, not the state-of-theart technique. ἀ at is, ML and MI techniques were designed to provide superior parameter and standard error estimates under a wider range of conditions than listwise and pairwise deletion can handle (summarized in Table 1.1). A biased approach (e.g., listwise or pairwise deletion) should not be used to “double-check” the accuracy of a less-biased approach (ML or MI). Further, maximum likelihood (EM algorithm, FIML) and multiple imputation approaches can now be variously implemented in SAS, SPSS, LISREL, MPlus, and other popular software packages. ἀ e number of good excuses for using listwise and pairwise deletion is quickly shrinking. Applications Longitudinal Modeling When sampling the same individuals across time points, a large portion of the missing data comes from attrition, or dropouts. Interestingly, dropouts are usually MAR (i.e., a dropout’s missing scores on X and Y at Time 2 are correlated with her/his observed scores on X and Y at Time 1). ἀ e propensity for MAR mechanisms in longitudinal designs gives ML and MI approaches a major advantage over ad hoc techniques (Table 1.1; Newman, 2003). Longitudinal designs are also sensitive to compounded missingness. If the response rate is 60% at each wave of measurement, the compounded response rate is Response Ratecompounded = (.60)W = 21.6%, where W = 3 waves (Newman, 2004). Also, when the response rate rises over consecutive waves (e.g., 40% response rate for first wave, then 80% response rates in subsequent waves), missing data can create a regression-to-the-mean phenomenon, resulting in upward bias in estimated slopes of growth models (Newman, 2004). For longitudinal studies, it is important to continually attempt to sample those who dropped out from earlier waves. Finally, longitudinal designs hold a special role in the study of SNPs 2 ), because they enable the estimation of respondent-non(dmiss and f miss respondent differences (see Rogelberg et al., 2003). ἀ at is, one way to 2 is to compare Time 2 respondents versus Time estimate dmiss and f miss 2 nonrespondents, based on their responses from Time 1.
Missing Data Techniques and Low Response Rates
27
Within-Group Agreement Estimation Missing data (MNAR in particular) can lead to overestimation of agreement among members of a group. If an agreement index is used to assess whether group-level aggregation is justified (e.g., rWG(J); James, Demaree, & Wolf, 1984), then missing data can lead to a false conclusion that aggregation is justified, when in fact it is not (Newman & Sin, in press). Further, when group agreement represents a substantive construct (Chan, 1998, e.g., climate strength; Schneider, Salvaggio, & Subirats, 2002), missing data can bias tests of whether agreement predicts other, group-level outcomes. Specifically, tests of dispersion hypotheses are prone to bias whenever there is betweengroups variability in response rates (Newman & Sin, in press). One way to address these problems is to conduct a sensitivity analysis, assessing whether response rates and levels of dmiss shown in Table 1.3 would lead to large enough changes in estimates that the conclusions of one’s study will change. Such sensitivity analyses are reviewed by Newman and Sin (in press). Meta-analysis Meta-analyses suffer mainly from two types of missing data problems: (a) unreported artifact information (e.g., scale reliabilities) and (b) publication bias. For missing reliability estimates, Hunter and Schmidt (2004) recommend using artifact distributions based on reported reliability estimates. One important question is, “Are the unreported reliability estimates missing completely at random (MCAR), or are low reliability estimates less likely to be reported than high reliability estimates (MNAR)?” In the latter case, corrections based on observed reliability estimates will lead to overestimation of reliability and therefore undercorrection of the primary study effects. Another common practice is mean imputation from the reported reliabilities (e.g., Harrison, Newman, & Roth, 2006), although substituting a mean for the missing values will artificially reduce the variance of the artifact distribution. Given the above discussion, it seems that a better approach to correcting for unreported artifacts (which are probably MNAR) would involve incorporating SNPs into artifact distributions (e.g., based on a dmiss parameter comparing reported versus unreported reliability estimates).
28
Daniel A. Newman
Publication bias, another missing data problem in meta-analysis, is a particular form of MNAR missingness, wherein smaller effects are less likely to be published and thus more likely to be missing from the meta-analytic database (see Lipsey & Wilson, 1993). Methods conceptually similar to the SNP approach advocated in the current chapter have been recommended, in order to estimate what the meta-analytic effect size would have been in the absence of publication bias (Duvall & Tweedie, 2000; Vevea & Woods, 2005). Social Network Analysis Several types of social network analyses (e.g., calculating connectedness, indirect friendships, etc., across the entire network) can be extremely sensitive to missing data (see Burt, 1987). In general, social network studies are held to a high standard of data completeness, with journal reviewers regularly requiring response rates of 90% or higher. Costenbader and Valente (2003) and Borgatti, Carley, and Krackhardt (2006) have offered early demonstrations that missing data influence individual network centrality scores in a predictable fashion. However, these analyses only simulate the MCAR pattern, which is potentially problematic because network data missingness is likely systematic, not random (i.e., missingness is associated with the strength of ties and with demographic factors; Burt, 1987). One reasonable strategy for reducing the negative impact of the missing data on network analyses is to impute respondent-to-nonrespondent ties in place of missing nonrespondent ties (Stork & Richards, 1992). In other words, if person A (a respondent) nominates person B (a nonrespondent) as a friend, then we can assume that person B would have nominated person A as a friend (i.e., friendship symmetry assumption). Consider an example network analysis of 100 individuals, of which only 70 respond to the network survey (individual response rate = 70%). At the network-tie level, there are 100 × 100 = 10,000 potential network ties (e.g., friendships vs. nonfriendships) that could be reported. Getting data from only 70% of the network members results in a network-tie-level response rate of (70 × 70)/10,000 = 49%. Using the strategy advocated above (assuming friendship symmetry) would increase the response rate from 49% up to [10,000 – (30 × 30)]/10,000 = 91%! ἀ at is, by using all the available data (i.e., by not listwise deleting nonrespondents), we observe a
Missing Data Techniques and Low Response Rates
29
dramatic improvement in the dyadic tie-level response rate. Another approach to modeling respondent and nonrespondent ties—which also uses all available data—is exponential random graph modeling (Robins, Pattison, & Woolcock, 2004). Moderated Regression When conducting tests for statistical interaction effects (i.e., testing whether the relationship between X and Y depends on a third variable, M), listwise deletion increases Type II errors of inference (i.e., failures to detect true effects). Pairwise deletion, on the other hand, leads to elevated Type I error (i.e., concluding there is a moderator effect, when in fact there is not; Dawson & Newman, 2006). ML and MI should be the preferred missing data techniques for testing moderator hypotheses. Conclusions ἀ is chapter offers three contributions. First, it identifies two SNPs 2 ) that capture the differences between respondents (dmiss and f miss and nonrespondents. Second, it illustrates how response-rate biases in the mean, standard deviation, and correlation depend on an interaction of these SNPs with the response rate. ἀ ird, it points out that Type II error (low power) is a function of number of respondents and not the response rate per se. ἀ ese contributions together demonstrate that low response rates (e.g., below 20%) need not invalidate study results. Rather, the robustness of results to low response rates 2 . is an empirical question, driven by dmiss and f miss In theory, survey response is part of a social exchange, wherein the respondent contributes a limited amount of time and effort in exchange for inducements of satisfaction, perceived organizational support, trust, and the promise of anonymity (Figure 1.2). As such, any psychological variable that is related to the nonresponse decision (especially attitudes and personality) will demonstrate a nonzero dmiss parameter estimate. 2 parameters estimated? Shafer and Graham How are dmiss and f miss (2002) note that it is very difficult to determine whether missing data are missing-not-at-random (MNAR), because this requires actually
30
Daniel A. Newman
collecting data from the nonrespondents. Rogelberg et al. (2003) suggest four strategies for gathering data from nonrespondents (e.g., follow-up designs). Using these designs, Rogelberg and colleagues (2000, 2003) show that there exist mean differences between respondents and nonrespondents in terms of job satisfaction, organizational commitment, conscientiousness, and agreeableness (dmiss estimates vary from –.1 to –.6, suggesting that nonrespondents are less satisfied and less conscientious than respondents, on average; Table 1.3). It is the precise sizes of these dmiss estimates that determine bias due to low response rates. Researchers should not rely on a heuristic response rate (e.g., below 20%) to automatically invalidate results. Rather, it should be acknowledged that “response rate bias” is an 2 , explicit, interactive function of response rate with dmiss and f miss 2 for the constructs at hand. When dmiss and f miss are nil, there is no 2 are large, response rate bias. By the same token, when dmiss and f miss results can be rendered invalid even at higher response rates (e.g., 50%). Response rate bias is not merely a function of response rate— SNPs also play a fundamental role (Figure 1.1). To answer the question, “Is my response rate high enough to support the conclusions of my study?” it will be useful to conduct a sensitivity analysis, using representative SNPs (Table 1.3) and formulae found in Newman and Sin (in press) and the Appendix of this chapter.
2 Future Research on dmiss and f miss
At present, relatively little is known about the magnitudes of SNPs 2 ) for many psychological constructs. As such, our (i.e., dmiss and f miss confidence in the biasing effects of low response rates will grow as more follow-up studies are conducted, and mean respondent-nonrespondent differences are cataloged (through meta-analyses of dmiss 2 ) for a variety of well-known psychological constructs (e.g., and f miss Big Five personality traits, affectivity, self-esteem, cognitive ability, job satisfaction, job performance). It would further be useful to investigate actions that can be taken to potentially alter the sizes of these SNPs. For instance, sending out survey reminders may result in more responses from passive nonrespondents (i.e., those who have response intentions but just have not responded yet) but may do little to attract responses from active nonrespondents (i.e., those who deliberately choose not to respond;
Missing Data Techniques and Low Response Rates
31
Rogelberg et al., 2003; Spitzmuller et al., 2006). ἀ us, sending out survey reminders may increase response rates, while simultaneously increasing dmiss. ἀ e diagrams in Figure 1.1 assumed dmiss was orthogonal to the response rate, which may or may not hold up under empirical scrutiny. Missing Data Techniques A final advantage of considering SNPs is that these parameters indicate the extent to which popular missing data techniques (listwise and pairwise deletion) will result in biased estimates (low external validity). In specific, listwise and pairwise deletion are appropri2 = 0). As such, ate only under MCAR (i.e., where dmiss = 0 and f miss the inferiority of listwise and pairwise deletion can be empirically demonstrated by looking at the SNPs. Because missing data are very rarely MCAR (Table 1.3), it can be expected that listwise and pairwise deletion strategies will routinely create nonresponse bias. 2 ≠ 0? Low response What should be done when dmiss ≠ 0 and/or f miss rates (i.e., survey-level nonresponse) create an MNAR pattern 2 ≠ 0. ἀ is MNAR missingness cannot be whenever dmiss ≠ 0 or f miss well addressed through listwise, pairwise, ML, or MI techniques (Table 1.1; see Collins et al., 2001). To deal with low response rates, then, the most appropriate (least biased) missing data treatment will be a sensitivity analysis based on SNPs (see Table 1.2). References Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing moderating effects of categorical variables using multiple regression: A 30-year review. Journal of Applied Psychology, 90, 94–107. Ajzen, I. (1988). Attitudes, personality, and behavior. Homewood, IL: Dorsey Press. Borgatti, S. P., Carley, K. M., & Krackhardt D. (2006). On the robustness of centrality measures under conditions of imperfect data. Social Networks, 28, 124–136. Burt, R. S. (1987). A note on missing network data in the General Social Survey. Social Networks, 9, 63–73.
32
Daniel A. Newman
Chan, D. (1998). Functional relations among constructs in the same content domain at different levels of analysis: A typology of composition models. Journal of Applied Psychology, 83, 234–246. Collins, L. M., Schafer, J. L., & Kam, C. M. (2001). A comparison of inclusive and restrictive strategies in modern missing data procedures. Psychological Methods, 6, 330–351. Costenbader, E., & Valente, T. W. (2003). ἀe stability of centrality measures when networks are sampled. Social Networks, 25, 283–307. Dawson, J. F., & Newman, D. A. (2006, May). Pairwise deletion problems with moderated multiple regression. In D. A. Newman (Chair), Testing interaction eἀects: Problems and procedures. Symposium presented at the SIOP Annual Convention, Dallas, TX. Dempster, A. P., Laird, N. H., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, B39, 1–38. Dillman, D. A. (1978). Mail and telephone surveys: The total design method. New York: Wiley. Duvall, S., & Tweedie, R. (2000). Trim and fill: A simple funnel plot based method of testing and adjusting for publication bias in meta-analysis. Biometrics, 56, 276–284. Enders, C. K. (2001). A primer on maximum likelihood algorithms for use with missing data. Structural Equation Modeling, 8, 128–141. Foa, E. B., & Foa, U. G. (1980). Resource theory: Interpersonal behavior as exchange. In K. Gergen, M. S. Greenberg, & R. Willis (Eds.), Social exchange: Advances in theory and research (pp. 77–94). New York: Plenum Press. Fox, R. J., Crask, M. R., & Kim, J. (1988). Mail survey response rate: A metaanalysis of selected techniques for inducing response. Public Opinion Quarterly, 52, 467–491. Gold, M. S., & Bentler, P. M. (2000). Treatments of missing data: A Monte Carlo comparison of RBHDI, iterative stochastic regression imputation, and expectation-maximization. Structural Equation Modeling, 7, 319–355. Gold, M. S., Bentler, P. M., & Kim, K. H. (2003). A comparison of maximum-likelihood and asymptotically distribution-free methods of treating incomplete nonnormal data. Structural Equation Modeling, 10, 47–79. Gouldner, A. W. (1960). ἀe norm of reciprocity: A preliminary statement. American Sociological Review, 25, 161–178. Graham, J. W., Cumsille, P. E., & Elek-Fiske, E. (2003). Methods for handling missing data. In J. A. Schinka & W. F. Velicer (Eds.), Research methods in psychology (pp. 87–114). Vol. 2 of Handbook of psychology (I. B. Weiner, Editor in Chief). New York: Wiley.
Missing Data Techniques and Low Response Rates
33
Harrison, D. A., Newman, D. A., & Roth, P. L. (2006). How important are job attitudes? Meta-analytic comparisons of integrative behavioral outcomes and time sequences. Academy of Management Journal, 49, 305–325. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis: Correcting error and bias in research findings (2nd ed.). Newbury Park: Sage. James, L. R., Demaree, R. G., & Wolf, G. (1984). Estimating within-group interrater reliability with and without response bias. Journal of Applied Psychology, 69, 85–98. Judge, T. A., Heller, D., & Mount, M. K. (2002). Five-factor model of personality and job satisfaction: A meta-analysis. Journal of Applied Psychology, 87, 530–541. Lipsey, M. W., & Wilson, D. B. (1993). ἀe efficacy of psychological, educational, and behavioral treatment: Confirmation from meta-analysis. American Psychologist, 48, 1181–1209. Little, R. J. A., & Rubin, D. B. (1987). Statistical analysis with missing data. New York: Wiley. Little, R. J. A., & Rubin, D. B. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley. Marsh, H. W. (1998). Pairwise deletion for missing data in structural equation models: Nonpositive definite matrices, parameter estimates, goodness of fit, and adjusted sample sizes. Structural Equation Modeling, 5, 22–36. Martocchio, J. J., & Harrison, D. A. (1993). To be there or not to be there? Questions, theories and methods in absenteeism research. In K. Rowland & G. Ferris (Eds.), Research in personnel and human resources management (Vol. 11, pp. 259–329). Greenwich, CT: JAI Press. Newman, D. A. (2003). Longitudinal modeling with randomly and systematically missing data: A simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organizational Research Methods, 6, 328–362. Newman, D. A. (2004, April). Missing data in longitudinal designs: Enhancing imputation with auxiliary variables. In D. A. Newman & J. L. Farr (Cochairs), Assumptions and conventions in data analysis. Symposium presented at the SIOP Annual Convention, Chicago, IL. Newman, D. A., & Sin, H. P. (2009). How do missing data bias estimates of within-group agreement? Sensitivity of SDWG, CVWG, r WG(J), r WG(J)* , and ICC to systematic nonresponse. Organizational Research Methods. Ostroff, C. (1993). Comparing correlations based on individual-level and aggregated data. Journal of Applied Psychology, 78, 569–582. Robins, G., Pattison, P., & Woolcock, J. (2004). Missing data in networks: Exponential random graph (p*) models for networks with nonrespondents. Social Networks, 26, 257–283.
34
Daniel A. Newman
Robinson, W. S. (1950). Ecological correlations and the behaviour of individuals. American Sociological Review, 15, 351–357. Rogelberg, S. G., Conway, J. M., Sederburg, M. E., Spitzmuller, C., Aziz, S., & Knight, W. E. (2003). Profiling active and passive nonrespondents to an organizational survey. Journal of Applied Psychology, 88, 1104–1114. Rogelberg, S. G., Luong, A., Sederburg, M. E., & Cristol, D. S. (2000). Employee attitude surveys: Examining the attitudes of noncompliant employees. Journal of Applied Psychology, 85, 284–293. Roth, P. L., & BeVier, C. A. (1998). Response rates in HRM/OB survey research: Norms and correlates, 1990–1994. Journal of Management, 24, 97–117. Roth, P. L., Switzer, F. S., & Switzer, D. M. (1999). Missing data in multiple item scales: A Monte Carlo analysis of missing data techniques. Organizational Research Methods, 2, 211–232. Rubin, D. B. (1976). Inference and missing data. Biometrika, 63, 581–592. Schafer, J. L. (1997). Analysis of incomplete multivariate data. New York: Chapman & Hall. Schafer, J. L., & Graham, J. W. (2002). Missing data: Our view of the state of the art. Psychological Methods, 7, 147–177. Schneider, B., Salvaggio, A. N., & Subirats, M. (2002). Climate strength: A new direction for climate research. Journal of Applied Psychology, 87, 220–229. Spitzmuller, C., Glenn, D. M., Barr, C. D., Rogelberg, S. G., & Daniel, P. (2006). “If you treat me right, I reciprocate”: Examining the role of exchange in survey response. Journal of Organizational Behavior, 27, 19–35. Stork, D., & Richards, W. D. (1992). Nonrespondents in communication network studies: Problems and possibilities. Group & Organization Management, 17, 193–209. Tett, R. P., & Meyer, J. P. (1993). Job satisfaction, organizational commitment, turnover intention, and turnover: Path analyses based on metaanalytic finding. Personnel Psychology, 46, 259–293. Vevea, J. L., & Woods, C. M. (2005). Publication bias in research synthesis: Sensitivity analysis using a priori weight functions. Psychological Methods, 10, 428–443. Yammarino, F. J., Skinner, S. J., & Childers, T. L. (1991). Understanding mail survey response behavior. Public Opinion Quarterly, 55, 613–629. Yu, J., & Cooper, H. (1983). A quantitative review of research design effects on response rates to questionnaires. Journal of Marketing Research, 20, 36–44. Zimmerman, R. D. (2006). Understanding the impact of personality traits on individuals’ turnover decisions. Unpublished doctoral dissertation, University of Iowa.
Missing Data Techniques and Low Response Rates
35
Appendix Derivation of Response Rate Bias for the Correlation (Used to Generate Figure 1.1c) Beginning with Aguinis, Beaty, Boik, and Pierce’s (2005, p. 105) modified f 2 , I derive the following: 2 2 2 non yresp
pNr a s f2=
+ (1 − p)Nr s
2 2 resp yresp
pNrnonas yresp bsxresp + (1 − p)Nrresp s yresp sxresp − 2 2 2 pNb sxresp + (1 − p)Nsxresp
2 2 pNa 2 s 2yresp (1 − rnon ) + (1 − p)Ns 2yresp (1 − rresp )
2
,
where N is the total number of surveys distributed (at response rate = 100%), p = nnon N (i.e., nonresponse rate), (1 – p) = nresp N (i.e., response rate), a = s ynon s yresp and b = s xnon s xresp (i.e., standard deviation ratios for y and x, modeling variance heterogeneity), and nresp approximates (nresp −1) and (nresp − 2) . Rearranging and then solving for rnon via the Quadratic Formula yields the following equation:
rnon =
pb 2 2 2 pb 2 2 2 pbrresp ± p 2b 2rresp − p 1+ f 2 1+ − f 2 pa 2 + (1 − p) 1+ f pb + (1 − p)+ pb 2 rresp − p ( 1 ) 1 ( − p)
{
}
pb 2 pa 1+ f 2 1+ (1 − p)
ἀ e presence of “ ± ” in the Quadratic Formula suggests that rnon can be either larger or smaller than rresp, for a given level of f 2 . As 2 such, the new notation f miss( − ) means that the nonrespondent correlation (rnon) is smaller than the respondent correlation (rresp), whereas 2 f miss( + ) means that r non is larger than r resp. Finally, the complete-data individual-level correlation (at 100% response rates) can be estimated as rˆxycomplete = rgroup ηx η y + rpooled (1 − η2x )(1 − η2y )
(see Ostroff, 1993; Robinson, 1950). Substituting alternative expressions for rgroup , η x , η y , and rpooled , the above equation expands to
36
Daniel A. Newman
2d d _ y dmiss _ y dmiss _ x rˆxycomplete = 2 miss _ x miss 2 (dmiss _ x + dmiss _ y ) 2 1 + dmiss _ x p(1 − p) 2 1 + dmiss _ y p(1 − p)
2 2 dmiss dmiss _y 2 2 _x 1 − + prnon ( 1 ) 1 + − p r − resp 4 1 + d 4 1 + d p ( 1 − p ) p ( 1 − p ) miss _ y miss x _
.
2 The Partial Revival of a Dead Horse? Comparing Classical Test Theory and Item Response Theory Michael J. Zickar and Alison A. Broadfoot
Advances in psychometric theory over the last 30 years have introduced many new tools and techniques to researchers interested in measuring psychological constructs. ἀ e revolution of item response theory (IRT) has raised questions about the relevance of its predecessor, classical test theory (CTT). In fact, some writers have suggested that CTT has been made obsolete by its successor. For example, Rojas Tejada and Lozano Rojas (2005) discussed how recent research has been used to “displace the CTT in favour of the use of Item Response ἀ eory–based models” (p. 370), and Harvey and Hammer (1999) predicted that “IRT-based methods . . . will largely replace CTT-based methods over the coming years” (p. 354). Samejima, in critiquing CTT, describes its “ fatal deficiency [italics added],” which relates to how CTT models measurement precision (Samejima, 1977, p. 196). Borsboom argues that “few, if any, researchers in psychology conceive of psychological constructs in a way that would justify the use of classical test theory as an appropriate measurement model” (Borsboom, 2005, p. 47). We have heard people dismiss CTT as irrelevant and antiquated, more worthy of history books than contemporary psychometric classes. Often these same individuals treat IRT as a panacea for all psychometric woes. In short, CTT is treated as an old racehorse that is nice to have around, though everyone is expecting it to perish soon. According to this argument, IRT is the new steed that has won a few races and is expected to abolish its predecessor’s triumphs. We believe that this urban legend is just plain myth and 37
38
Michael J. Zickar and Alison A. Broadfoot
that CTT still has uses in modern psychometrics. Having said that, like most urban legends, there is a kernel of truth to the reported obsolescence of CTT. In some cases and applications, CTT has been supplanted by IRT. In this chapter, we will sort out fact from fiction and provide a psychometric road map for people trying to navigate this confusing literature. In this chapter, we will debate the relative merits of both theories. In short, our belief is that both theories are useful and that any calls for the demise of CTT are shortsighted and premature. As we will outline throughout this chapter, there are many situations in which CTT may be sufficient or preferred and there are other situations in which IRT will be necessary. We will begin by reviewing briefly the assumptions and basic principles of each theory. Next we will highlight specific criticisms and limitations of both theories. Finally, we highlight scenarios and situations in which it would be preferable (or necessary) to use one theory over the other. Basic Statement of the Two Theories Classical Test Theory CTT can be best understood by investigating its models and related concepts. ἀ e general classical test model is based on a simple equation:
Xij = Ti + Eij
(2.1)
where an observed test score, Xij, for individual i and testing time j, is a function of two unknowns: a true score (Ti) plus an error score (Eij). For this basic classical test model, the basic assumptions are as follows:
1. True scores and error scores are uncorrelated, i.e., rTE = 0. 2. ἀe average error score for each examinee across replications and in the population of examinees is zero. 3. Error scores on parallel tests are uncorrelated (see Lord & Novick, 1968; Allen & Yen, 1979, for more discussion of CTT assumptions).
ἀ e work of Lord and Novick (1968) is the definitive statement of classical test theory, although it is densely written and difficult to understand. ἀ e work of Allen and Yen (1979) is a more readable statement of classical test theory that is more accessible to non–testing experts.
The Partial Revival of a Dead Horse?
39
ἀ ere have been many different attempts to explain the concept of a true score. ἀ e most succinct explanation is that a true score is equivalent to the expected value of the observed score for an individual on a particular test. As such, true scores are defined by both the person and the scale. True scores are not a property inherent only to the person himself or herself; therefore, an individual does not have only one true score for all intelligence tests but has a different true score for each intelligence test. ἀ e false notion of a true score that exists independent of a test has been called the platonic notion of a CTT true score (see ἀ orndike, 1964). Although invoking the expected value definition might lessen the scope and generalizability of the true score, it avoids ontological difficulties associated with the platonic version of true score (see Borsboom & Mellenbergh, 2002). Another fundamental concept in CTT is the concept of reliability, denoted by rxx’. In CTT, reliability is operationalized as the proportion of observed score variance that is due to true score variance, or
2
rxx’ = σ T
2
σX
(2.2)
2 2 where σ T is the true score variance and σ x is the observed score variance. Given that true scores are unknown, a variety of techniques have been developed to parse observed score variance into estimated true score variance and error score variance, thus allowing reliabilities to be computed. ἀ ese methods include test-retest, split-half, alternate forms, and internal consistency methods of reliability estimation. Each of these methods makes different assumptions about the nature of error scores and, thus, can provide different estimates of reliability for a particular test. Although the concept of reliability is simple, its operationalization is complex and worthy of study beyond what we could cover in this chapter (see Allen & Yen, 1979; Nunnally, 1978). Reliability is an important part of CTT because it provides a measure of precision for the tests. ἀ e standard error of measurement (SEoM) is a function of reliability,
SEoM= σ X 1 − r xx '
(2.3)
40
Michael J. Zickar and Alison A. Broadfoot
ἀ e SEoM is the average amount of error expected in a particular test score. SEoMs can be used to compute confidence intervals around observed scores to detail the plausible range of values that a person’s true score could be given their observed score, and SEoMs can be used to determine whether two test scores are significantly different from each other (see Dudek, 1979; Harvill, 1991). In general, when reliabilities increase, standard errors decrease, which means that test users can be more confident about precision of individual test scores. Although CTT focuses on the scale score as the unit of analysis, there are several statistics that have been used in CTT to assess item functioning. Item difficulty can be characterized as the proportion of test takers who affirm the item (e.g., correctly answer the item with ability items or agree with the item for personality items) given dichotomously (i.e., two options such as right or wrong) scored items, or the item mean for items that are polytomously (i.e., more than two options) scored. Item discrimination describes how well an item does at differentiating among test takers who have different levels of the trait being measured by the scale. Item discrimination can be calculated by correlating the score on a particular item with the total score on a scale (generally removing the focal score from the total score to avoid confounding the discrimination index). If this number is positive (different rules of thumb are given to signify items with acceptable discriminations, though .30 or greater seems to be a common heuristic), the item is said to discriminate well among test takers of differing ability. If this number is low, the item is presumed to not discriminate well. If the item-total correlation is negative, that is often an indication that an item should have been reverse-coded, that another response option more positively relates to the overall test score (and therefore is more likely to be the right answer), or that the item just does not work in the intended manner. Item Response Theory IRT focuses on measuring a latent construct that is believed to underlie the responses to a given test. Test takers are characterized by the latent trait theta (θ), which is the ability or trait measured by the scale. A common model under IRT has two primary assumptions: that the test or measure is unidimensional, in which it measures only
The Partial Revival of a Dead Horse? a = 1.613
1.0
41
Item Response Function b = 0.560 c = 0.127
Probability
0.8 0.6 0.4 0.2 0 –3
c b –2
–1
0 Ability
1
2
3
Figure 2.1 Sample 3PL Item Response Function.
one latent trait, and that local independence exists. Local independence means that items within a scale are related to each other solely because of θ; if θ were partialed out, there would be no correlation between items. IRT models can vary by the following characteristics: dichotomous or polytomous item response options; response categories with meaningful order (i.e., Likert scaling) or response option categories with no meaningful order (i.e., nominal models); models that require tests to evaluate only one ability (unidimensional) or models that allow tests to evaluate multiple abilities (multidimensional); and models that differ in the functional form relating theta to the response option (e.g., some models allow for unfolding shapes whereas others are forced to be logistic). One of the cornerstones of IRT is the item response function (IRF), which relates theta to the expected probability of affirming an item (see Figure 2.1). ἀ e shape of the IRF is determined by item parameters, which are determined by the model chosen by the researcher (Figure 2.1 is the three-parameter logistic model). Different IRT models have different item parameters, although common parameters include the following concepts: discrimination, difficulty, and pseudo-guessing. Item difficulty relates to the location of the θ continuum where the item is most discriminating. Items with Two thorough yet readable IRT texts include the work of Embretson and Reise (2000) and Hambleton, Swaminathan, and Rogers (1991).
42
Michael J. Zickar and Alison A. Broadfoot
low difficulty will be endorsed by nearly all respondents, even those with low θs, whereas items with high difficulty will be endorsed only by respondents with large positive θs. ἀ e item in Figure 2.1 has a difficulty parameter (b) of .56, indicating that it is a moderately difficult item given that θ is a standard normal variable (hence, an item with a difficulty value of zero would be of average difficulty). IRT item discrimination has the same goal as under CTT, to characterize the capacity of an item to differentiate between respondents with different levels of the underlying trait. In IRT, the discrimination parameter (a) relates to the slope of the IRF at its inflection point, which is equal or generally near to the item difficulty. Items with high discrimination have steep IRFs and can be used to make fine distinctions between people of different θ levels, whereas items with low discrimination tend to have flat IRFs and, therefore, cannot be used to make fine distinctions between individuals. For the item in Figure 2.1, the item is extremely discriminating between people who are below average compared to those are above average. For example, someone with a θ = –1 would be expected to get the item correct with a probability around .15, whereas someone with a θ = +1 would be expected to get the item correct with a probability around .80. As will be commented on later, this item would not be very discriminating between individuals who are high in θ (e.g., 2.0) versus those extremely high in θ (e.g., 3.0). Finally, the pseudo-guessing parameter relates to the probability that an individual with an extremely low θ will answer an item correctly. ἀ is parameter is often necessary because even though extremely low θ respondents may not know the correct answer to a multiple-choice test, these respondents will be able to correctly guess the item with a probability that is 1 divided by the number of options. ἀ e pseudo-guessing parameter is most often needed with ability items and in other situations where people would be motivated to guess or fake. For the item in Figure 2.1, the pseudo-guessing parameter (c) is .127, indicating that individuals with extremely low θ will still have about a 13% chance of getting the item correct. Another important concept in IRT is information. ἀ ink of a situation when a test taker sits down to complete a measure but has not yet responded to any items. At that point, we have no information about the individual’s θ. If we were forced to guess what the respondent’s ability would be, our best guess would be the population mean (generally zero). Once our mystery respondent starts answering
The Partial Revival of a Dead Horse?
43
items, however, we start to gather information that helps us better estimate his or her θ. Information is a quantification of the amount of uncertainty that is removed by considering item responses. Some items will provide lots of information; other items will provide very little information. Information is a function of an item’s discrimination and difficulty (and the pseudo-guessing parameter if used) as well as the respondent’s θ. All else equal, items that have high discrimination and have a difficulty parameter close to the respondent’s θ will have relatively high discrimination, whereas items that have low discrimination and have a difficulty parameter far away from the respondent’s θ will provide relatively low information. As will be discussed later, the implications of item information are enormous in that it quantifies measurement precision as a function of items, and, more specifically, the amount of information provided by an item varies as a function of θ. In fact, item information can be plotted as a function of θ, which results in an item information function. See Figure 2.2 for the item information function corresponding to the item response function from Figure 2.1. ἀ e height of the information function relates to the discrimination at a certain level of theta. Also, the peak of the information function is usually located at or close to the difficulty parameter for that particular item. As can be seen in Figure 2.2, the height of the information function is near .56, which was the difficulty value for that item. In addition, it could be noted that this item provides little or no information at the extreme ranges of θ. A test information function is derived when one sums up the item Item Information Curve
4
Information
3 2 1 0 –3
–2
–1
0 Scale Score
Figure 2.2 Sample Information Function.
1
2
3
44
Michael J. Zickar and Alison A. Broadfoot
information functions for a test or measure. ἀ e test information function shows for each level of theta how well the test is able to accurately estimate theta. ἀ is function, as will be described later, will be extremely important in evaluating psychological scales. One strong and very important property of IRT models is that item parameters are invariant across populations (assuming that the model fits the data in the population). ἀ erefore, no matter what sample from the population takes the test, the item parameter estimates will generally be the same (i.e., have the same difficulty, discrimination, and pseudo-guessing parameter estimates). It should be noted, however, that item and person parameter estimates are not necessarily invariant across populations. In addition, theta estimates are invariant across measurement instruments, assuming that the same underlying trait is being measured. Opposite to the property of CTT true scores, theta estimates on two different tests measuring the same construct should be equivalent within sampling error. ἀ ese properties of population and test invariance are important when considering the relative advantages and disadvantages of CTT and IRT. Criticisms and Limitations of CTT Lack of Population Invariance A serious limitation of CTT is that its statistics and parameters are sample dependent and test dependent. ἀ e true scores (person parameters) are test dependent and item difficulty and item discrimination (item parameters) are sample dependent. ἀ erefore, depending on the sample taken from the population and the test created to measure a specific construct, the attributes of the specific sample and test will affect the person and item parameters. For example, if a test contains more difficult items, this will affect respondents’ true scores such that their true sores will be relatively low. However, when the test is relatively easy, respondents’ true scores will be higher. In this case, it would not be possible to compare true scores across tests without doing some elaborate equating studies that would account for differences in test properties. A linear transformation is also generally needed to place the parameters on the same scale.
The Partial Revival of a Dead Horse?
45
Also, in CTT, item statistics are dependent on the sample that is used to estimate those statistics. When samples have a large range of abilities, item discrimination will be higher, but when samples have a small range of abilities, item discrimination will be lower. In addition, the item difficulty parameter estimate depends on the general ability of the sample completing the item. For example, the same algebra item may appear to be difficult for fourth graders yet easy for tenth graders. Because of IRT item parameter invariance, these problems are lessened. ἀ ese pitfalls of CTT can complicate analyses and cause problems with the interpretation of results. Person and Item Parameters on Different Scales Another limitation of CTT is that the person and item parameters are not on the same scale, whereas IRT’s person and item difficulty parameters are on the same scale. In IRT, the difficulty item parameter is described on the latent trait (theta) scale. ἀ is means that if an item is of average difficulty (i.e., the difficulty will equal 0 on a normal z-theta scale) and if a person responding to that item has an average ability level (their theta will equal 0 on a normal z-theta scale), this person will have a 50% chance of getting this item correct (assuming that there is no pseudo-guessing parameter). Having both person and item parameters on the same scale provides test developers and administrators some distinct advantages. One such advantage is the use of adaptive testing techniques such as computer adaptive testing (CAT; Reise & Henson, 2000). CAT usually starts with an item of average difficulty. How individuals respond to this first item will determine their initial estimate of their level of ability (theta) on the measure. For example, those respondents that answer this first item correctly will be estimated to have a higher than average ability (theta) and will then be given a more difficult item to better refine the estimate of their ability level (i.e., is their ability around the mean or well above the mean?). Notice that the difficulty of the item, and how a person responds to that item, helps the test administrator infer what that person’s ability (theta) is. Many iterations will occur until some criterion is met (Zickar, Overton, Taylor, & Harms, 1999). ἀ is criterion could be that the standard error of the measurement for the examinee’s ability is below some number, ensuring that the ability estimate is accurate, or that a specific number of items are
46
Michael J. Zickar and Alison A. Broadfoot
administered, ensuring that all examinees receive the same number of items to increase perceptions of fairness. With the former criteria and even with the latter, test length can be substantially shorter and have less measurement error compared to standard test administration procedures. With CTT, CAT would be much more difficult because the item and person parameters are not on the same scale, among other things. In CTT, there is no direct relationship between an item’s difficulty and a person’s ability, making adaptive testing a much more difficult process. Criterion-referenced tests can also benefit from having both item and person parameters on the same scale (Bock, 1997). Criterionreferenced tests are tests that require a person to have a certain level of ability, on the topic of interest, to pass or to be considered a master of that topic. Criterion-referenced tests are common for professional licensure and can be contrasted with norm-referenced tests where test scores are made meaningful by making comparisons to others’ test scores. For criterion-referenced tests, administrators and experts can identify the level of ability needed within different content areas to be considered a master of those areas. When that level of ability is identified, items can be selected that have item difficulty estimates that are close to this predetermined ability level (i.e., cutoff). Items that have difficulties near and at the necessary level of ability will provide the most information at their difficulty levels and will therefore be able to make fine-grain distinctions between examinees at the critical ability level (Zickar, Overton, Taylor, & Harms, 1999). ἀ is improves the confidence of the ability estimates, because the measurement error will be reduced, obtained from the exam. In addition, fewer items will be needed, as items that have difficulties that are lower than or higher than the critical ability level are unnecessary for this type of test. Correlations Between Item Parameters ἀ e parameters in CTT are often confounded with each other. Given that the discrimination index is based on a correlation, it is sensitive to the item base rate (which is directly related to the item difficulty statistic). If there is very little variance in item responding, the correlation between the item score and the total score must be attenuated. Although item discrimination and item difficulty should be
The Partial Revival of a Dead Horse?
47
theoretically uncorrelated with each other, in CTT the two are often dependent on each other. ἀ is interdependency, however, may not be a problem if items without extreme base rates (either extremely high or extremely low) are eliminated from scales. Many scale development guidelines, in fact, advocate eliminating items with extreme base rates, thus reducing this concern. Reliability as a Monolithic Concept Another criticism of CTT is that each test is assigned a reliability coefficient that estimates the measurement precision of the whole test; measurement precision is assumed to be a uniform value across the range of the tests. ἀ is assumption is clearly false for many tests. In fact, one set of authors called this assumption “hardly credible” (Rojas Tejada & Lozano Rojas, 2005, p. 370) and Samejima called this assumption a “fatal deficiency” (1977, p. 196). As an example of how this assumption is false, a test of basic arithmetic may provide reasonably high discrimination at the lower ends of mathematics ability, though such a test would not be able to differentiate between above-average and average students in college algebra classes (all of whom would presumably ace the arithmetic test). It is possible to examine a test’s discriminating power across the range of traits by examining the IRT-based information functions. Most people who use CTT assign a single value for a scale’s measurement precision, typically using coefficient alpha and possibly the SEoM. ἀ e SEoM typically assumes that the measurement precision of the test is the same throughout the range of the trait being measured. ἀ ere have been attempts to get beyond this limitation. Conditional standard errors can be computed using an expansion of traditional CTT called the binomial error model (see Feldt, 1984). ἀ ese conditional standard errors, however, have their own limitations (Kolen, Hanson, & Brennan, 1992). In conclusion, CTT has many limitations that can cause problems in terms of interpreting item and person statistics and in using the theory as a framework for psychometric tools such as adaptive testing. ἀ ese limitations help fuel the urban legend that classical test theory should be pronounced dead. ἀ ose who claim that CTT should be proclaimed dead, however, often forget that the major alternative, IRT, also has significant limitations.
48
Michael J. Zickar and Alison A. Broadfoot
Criticisms and Limitations of IRT Just as CTT has its own limitations, researchers have noted severe limitations of IRT that make the use of its methods difficult, impossible, or impractical in certain scenarios. ἀ ese limitations include the need for large sample sizes, strong assumptions of unidimensionality, and difficulty running programs. We will review these limitations and evaluate the consequences of each of them. Large Sample Sizes IRT models are more complex than CTT models in that they have more parameters to estimate. ἀ is means that all else equal, to measure IRT parameters with equal precision as their CTT counterparts, sample sizes will need to be larger with IRT. ἀ ere is no set rule or heuristic on the sample size needed to run various IRT models, though more complex IRT models require larger sample sizes. For the simplest IRT model, the Rasch model (a simple IRT model that has only a difficulty parameter to estimate for each item), the number of parameters to estimate equals k + n, where k equals the number of items and n equals the number of respondents. With the 3PL model, the number of parameters to estimate equals 3k + n. In addition, not all parameters are able to be estimated equally well. ἀ e pseudo-guessing (c) parameter within the 3PL model depends on having a large number of respondents at the lower end of the θ continuum. Although there are no set rules for sample sizes, most IRT studies rely on sample sizes over 200, with most studies that use polytomous IRT models requiring even larger sample sizes. ἀ e one exception is the Rasch model, which, because of its simplicity, has often relied on sample sizes smaller than 200. In cases where researchers are limited to small sample sizes, either because of practical constraints or because they are studying rare phenomena, classical test theory– based approaches might be the only viable option. It should be noted that advances in estimation have greatly increased the efficiency of IRT estimation. Marginal maximum likelihood estimation has reduced the number of cases needed for accurate estimation. Readers who stumble across articles that used previous methods of estimation (e.g., joint maximum likelihood) should ignore the discussions of sample size requirements.
The Partial Revival of a Dead Horse?
49
Strong Assumptions IRT has often been called a strong test theory in that the assumptions behind it are relatively difficult to satisfy. ἀ e main assumption behind IRT is local independence. As stated before, local independence means that once θ has been controlled for, there should be no relationship between items. For unidimensional IRT, this translates that tests should be unidimensional, in that once someone’s score on the underlying θ dimension is known, there should be no other information that can be used to help predict whether they answer the item correctly. Strict unidimensionality is more of a mythological concept than an attainable reality for most psychological constructs. It would be rare for items on a psychological test to measure variance due solely to the underlying construct. For example, with reading comprehension items, there may be nuisance factors related to some of the underlying content used in passages of items, although good item-writing procedures work to minimize that variance. With personality items, the challenge of writing strictly unidimensional items is even more futile given the multiple determinations of personality and the inherent correlations between most personality constructs. In general, it might be better to think of unidimensionality as a continuous (i.e., a matter of degrees) concept as opposed to a categorical one (i.e., either one has it or not). Monte Carlo simulation research has shown that strict levels of unidimensionality are not necessary for IRT models to recover item and person parameters with high levels of accuracy. Reckase (1979) found that as long as the first factor explained 20% of the scale variance and there was not a dominant second factor, an IRT model worked well (see further work by Harrison, 1986, and Kirisci, Hsu, & Yu, 2001). ἀ e concept of “sufficient unidimensionality” has been coined to represent unidimensionality that is less than perfect though still acceptable for using IRT models. Unfortunately with some types of data (e.g., biographical data and situational judgment tests), even sufficient levels of unidimensionality may not be possible. In these cases, multidimensional IRT models may be needed or perhaps subscales could be made that are more unidimensional.
50
Michael J. Zickar and Alison A. Broadfoot
Complicated Programs Another criticism that is becoming less potent over the years is that IRT estimation programs are difficult to run. In recent years, the “friendliness” of IRT software has increased dramatically, with most of the programs using window-based interfaces that allow for pointing-and-clicking to set up execution program parameters. One limitation that still exists is that all the programs that we are familiar with still do not interface well with the commonly used statistical packages SPSS and SAS. Unfortunately, many of the new users of statistics are taught the point-and-click techniques of SPSS and SAS with little instruction regarding how to develop code for those statistics, let alone the underlying assumptions and decisions that need to go into a particular statistical analysis. For these naïve consumers, learning an additional program needed to do the psychometric analysis is a disincentive. Given that most components of CTT-based analyses can be done in the SPSS and SAS frameworks, there often is a strong disincentive for researchers more interested in substantive issues as opposed to psychometric issues to learn new programs. In those cases, we would hope that the substantive researchers could partner with others who can do the psychometric “heavy lifting.” Times to Use CTT Although the urban legend is that CTT is dead, we believe there are many scenarios in which it would be preferable to use. Most of these reasons can be categorized due to limitations in data that might preclude IRT and practical considerations that might make CTT more preferable. Small Sample Sizes With regard to data limitations, as noted before, classical test models require less data than IRT models; thus, if one is limited to a small sample size, the only real option would be to conduct an item analysis using CTT methods and determine the reliability of the test using such methods. ἀ ere is good news for people who are stuck with small amounts of data in that, for many purposes, the decisions that are made with CTT-based item analyses are often similar to what
The Partial Revival of a Dead Horse?
51
would be made using the more sophisticated IRT approach. For example, there is often a high correspondence between CTT and IRT item discrimination estimates (see Ellis & Mead, 2002; Fan, 1998; MacDonald & Paunonen, 2002). ἀ e same finding often occurs with item difficulty statistics. In general, items that have low discrimination or extreme difficulty will most likely be identified as such using either method. If the goal is a cursory item analysis to identify poorly functioning items, CTT would be sufficient. ἀ ere would be some exceptions to this; for example, if there is significant guessing on items, the convergence of IRT and CTT item statistics would be lower than if there is no guessing. In addition, IRT-based estimates of ability, θ, are generally highly correlated with raw scale scores computed simply by adding up item scores. For example, MacDonald and Paunonen (2002), in a Monte Carlo simulation, found the correlation between θ and number right to be above .97 (in all conditions they studied), suggesting that similar decisions would be made based on IRT estimates of ability and raw scores (see also Lambert, Nelson, Brewer, & Burchinal, 2006). If the goal is simply to compute individual scores, there often is little need to use IRT estimates. Again, there would be some exceptions. For example, the convergence between IRT and CTT ability estimates could be smaller with scales with few items compared to scales with many items. In addition, the convergence between IRT and CTTbased estimates of ability may be much lower in the extreme ranges of ability (e.g., O’Connor, 2004). In general, however, there would be no need to use IRT for scoring alone. Multidimensional Data? Some people have suggested that it would be preferable to use CTT methods in the presence of multidimensional data. IRT models often fail to converge with high levels of multidimensional data and so in some cases, people may view CTT as a preferable alternative. But with high levels of multidimensionality, the meaning of the true score is confounded; using CTT may be possible, however, the results may be uninterpretable. Multidimensionality is not a legitimate reason to use CTT. With low levels of multidimensionality, at least one study has shown that IRT methods are preferable. Sinar and Zickar (2002)
52
Michael J. Zickar and Alison A. Broadfoot
showed that IRT methods were able to ignore the presence of deviant items in the context of a large number of items that measure the primary factor. ἀ eir simulation was modeled after a common scale development scenario in which a large number of items are combined with a small number of items that measure similar but distinct constructs. With IRT, these deviant items were given extremely low discriminations and hence were given small weight when computing trait scores. With CTT, however, all items are typically weighted the same. ἀ erefore, deviant items are given the same weight as good items with the result that trait scores are more influenced by irrelevant constructs with CTT. Although some people have speculated that CTT models may be more preferable with multidimensionality, we do not feel that is the fact. With small amounts of multidimensionality, IRT methods are preferable. With large amounts of multidimensionality, more preferable approaches would be to break down the multidimensional tests into unidimensional subtests or to model the dimensionality directly using multidimensional IRT (see Reckase, 1997). CTT Supports Other Methodologies In addition, CTT supports other statistical applications that are not readily available through IRT and these applications can in fact be helpful in the justification of using IRT. Such statistical applications include factor analysis and structural equation modeling (SEM). Factor analysis, both exploratory factor analysis (EFA) and confirmatory factor analysis (CFA), are based on the CTT measurement foundation (T = X + E). SEM uses CFAs to assess relationships among variables that have been disattenuated for measurement error (Lance & Vandenberg, 2002). In addition, SEM can be used to assess the feasibility of complicated models, composed of many interrelationships, which can include the assessment of mediation and moderation (Cheung, 2007; Bollen, 1989). By using SEM, researchers can assess the veracity of proposed models and theories at the true score level (Bollen, 1989). When using these advanced statistical techniques, CTT is still relevant. Although there have been efforts to link these advanced psychometric techniques to IRT (see McDonald, 1999), CTT is still the foundation for these techniques.
The Partial Revival of a Dead Horse?
53
Times to Use IRT ἀ ere are many scenarios when IRT would be preferable to CTT. In general, IRT should be used if test developers have specific hypotheses and needs to concentrate measurement precision at a certain range of the latent trait, if researchers want to model the process used by respondents when answering test items, and if researchers want to take advantage of many of the psychometric tools that have flourished because of IRT. In each of these cases, consistent with the urban legend, CTT may prove to be of little use. Focus on Particular Range of Construct First, there are many times when researchers do not just want to develop the most reliable test possible, but care to maximize measurement precision at a specific region of the latent trait. For example, if a test was being used as an early screening device in a sequential selection system, it may be important to maximize measurement precision at the moderately low level of the trait. At this point in the selection system, it may not be important to differentiate between the top candidates; it would be more important to differentiate between the candidates who cannot succeed and all the rest of the candidates. In professional certification tests, it would be important to maximize precision at the point that differentiates between those would be acceptable doctors, accountants, teachers, or engineers and those who should not be allowed to practice in their profession (see Cizek, 2001). Using IRT, it would be possible to choose items that maximize information at the range of the trait that is of concern. Conduct Goodness-of-Fit Studies One of the differences between CTT and IRT is that with the latter, one can compute strict goodness-of-fit analyses to see if a particular IRT model fits the data. ἀ ese analyses can be useful for determining the appropriateness of a particular IRT model and should be required before interpreting the meaning of IRT parameter estimates. ἀ ere are several different approaches to determining model fit. One approach compares nested models and is based on an
54
Michael J. Zickar and Alison A. Broadfoot
overall likelihood function (see Orlando & ἀ issen, 2000). For example, this approach could test whether a model that allows discrimination parameters to vary across items fits better than a model that constrains all discrimination parameters to be equal. ἀ is approach, however, allows one to only determine relative fit. It may be the case that one model fits better than the other albeit both models fit the data poorly in an absolute manner. Another approach to model fit creates expected probabilities of item responding based on the estimated model and compares those probabilities to the probabilities observed in the actual data. Based on the observed and expected probabilities, it is possible to compute a chi-square statistic (see Drasgow, Levine, Tsien, Williams, & Mead, 1995). In addition to chi-square statistics, there are graphical-based methods of determining fit that plot observed data against IRFs so that one can determine where in the θ continuum misfit occurs. ἀ ese fit analyses provide opportunities for IRT modelers to choose between various models and to make judgments about the accuracy of models in capturing some aspect of the response process. Just like in structural equation modeling (SEM), where there has been a proliferation of fit indexes, obsession with goodness-offit can be unproductive. As with most goodness-of-fit indexes, the IRT fit statistics are susceptible to sample size, in that with a large enough sample size, trivial instances of misfit will result in statistical significance on the fit indexes. SEM researchers have coped with the difference between statistical significance and practical significance in misfit by coming up with a variety of indexes that are less sensitive to sample size. In IRT, there has been relatively less attention to issues of fit and so researchers have not yet learned how to differentiate between misfit that needs to be addressed and that which can be tolerated. Although some researchers may view the need to evaluate goodness-of-fit as another psychometric hassle, we believe goodness-of-fit evaluation provides an opportunity for IRT researchers. We believe in the logic of falsifiability (à la Popperian scientific logic) and believe that it is a strength that can provide insight into respondent behavior. In a simple case, testing the differences in fit between the 2PL and the 3PL can provide insight into whether guessing is prevalent within a particular sample. IRT models have been used to develop insights into how respondents fake personality tests (Zickar & Robie,
The Partial Revival of a Dead Horse?
55
1999), among other areas. It is hard to imagine how CTT models could provide such insight. IRT Supports Many Psychometric Tools Finally, one of the biggest reasons to use IRT models is because many of the most advanced psychometric tools and applications depend on IRT. Applications such as differential item functioning (DIF), appropriateness measurement, and computer adaptive testing (CAT) have proven valuable to researchers and have helped applied psychologists provide better services. Each of these tools has had a precursor based on CTT. For example, item bias (e.g., DIF) analyses can be conducted using Mantel-Haenszel procedures which rely on an ANOVA framework; in fact, the Mantel-Haenszel procedure is still used by those who do not wish to use IRT or for those stuck with small sample sizes. However, DIF analyses based on IRT provide many advantages in that they allow for researchers to search for certain types of hypothesized item bias and are to identify types of item bias that the Mantel-Haenszel procedure cannot detect (Hambleton & Rogers, 1989). With appropriateness measurement, researchers aim to identify respondents who are responding to test items in an idiosyncratic manner that sets them out from other respondents. Non-IRT-based approaches to appropriateness measurement compare respondent data to item difficulty statistics and look to see if individuals have a problematic pattern (e.g., get difficult items correct but miss easy items; Harnisch & Linn, 1981). IRT-based appropriateness indices look for individuals who deviate from the IRT-model (see Levine & Rubin, 1979). ἀ e IRT-based models incorporate more information and provide more flexibility compared to the CTT-based methods. Computerized adaptive testing (CAT) deserves special mention in that CAT has flourished with the IRT’s popularization. Although it is technically possible to do some adaptive testing without IRT, such testing is awkward and inefficient. For example, it is possible to administer a small test and then, based on that small test, route individuals to different exams based on their performance on the short test; people who score poorly on the initial test would receive easier exams than those who scored better on the initial test. True adaptive tests, however, are much more efficient in that items are chosen to provide maximal information given the responses to all previous
56
Michael J. Zickar and Alison A. Broadfoot
items in the exam. ἀ e detailed, model-based approach of IRT allows this to be done in that statistics can be computed to derive specific predictions on how individuals are likely to respond to individual items. If one plans to be in the business of adaptive testing, it is necessary to learn IRT. Conclusions IRT has many advantages compared to CTT that are important to advancing the psychometric quality of our instruments. IRT models are theory-based, allow for testing of specific hypotheses, and facilitate advanced psychometric tools. In an ideal world filled with unlimited sample sizes, perfectly unidimensional scales, psychometrically savvy researchers and reviewers, plus computer programs that can read one’s mind to guide analyses, there might be good reason to relegate CTT methods to the same shelf in the library that hosts once important ideas such as the flat earth, Galen’s theory of personality based on bodily fluids, and the notion that there might be somebody who can secretly turn lead into gold. Of course, we do not live in a psychometrically perfect world; pragmatic considerations often trump psychometric concerns. Many others have recognized that IRT is not a panacea for all psychometric ills and that CTT still has a place in the psychometric toolkit. For example, one group of researchers noted: “We want to emphasize that analyses based on traditional classical test theorybased psychometric procedures remain valuable and informative” (Casillas, Schulz, Robbins, Santos, & Lee, 2006, p. 486). Just because a particular device is more powerful does not mean that that technique is always preferable to less powerful devices. Although electron microscopes provide much more detailed magnification compared optical microscopes, the use of the former is not always warranted. For example, if one were looking for a malignant growth in a specific area from a liver biopsy, an electron microscope would be vastly superior to an optical microscope. If, however, one were examining that same biopsy but did not have a clue for what one was looking for, an optical microscope might be a better first step. Given the general vague goals of many item analyses (e.g., just choose the best items), the added precision of IRT may not help researchers. In addition, as described in this chapter, for many circumstances, CTT and IRT
The Partial Revival of a Dead Horse?
57
will provide similar answers. In addition, there are extremely useful applications available within the CTT framework that are not available within the IRT framework and some of these applications can be used to justify the application of many IRT models. In short, we believe IRT is a powerful psychometric theory that all researchers should learn. In many situations, however, CTT should be the applied theory of choice. ἀ e test theory that some have labeled a “dead horse” has some more races left! References Allen, M. J., & Yen, W. (1979). Introduction to measurement theory. Monterey, CA: Brooks/Cole. Bock, R. D. (1997). A brief history of item response theory. Educational Measurement: Issues and Practices, 16, 21–32. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Borsboom, D. (2005). Measuring the mind: Conceptual issues in contemporary psychometrics. New York: Cambridge University Press. Borsboom, D., & Mellenbergh, G. J. (2002). True scores, latent variables, and constructs: A comment on Schmidt and Hunter. Intelligence, 30, 505–514. Casillas, A., Schulz, E. M., Robbins, S. B., Santos, P. J., & Lee, R. M. (2006). Exploring the meaning of motivation across cultures: IRT analyses of the Goal Instability Scale. Journal of Career Assessment, 14, 472–489. Cheung, M. W. L. (2007). Comparison of approaches to constructing confidence intervals for mediating effects using structural equation models. Structural Equation Modeling, 14, 227–246. Cizek, G. J. (Ed.). (2001). Setting performance standards: Concepts, methods, and perspectives. Mahwah, NJ: Erlbaum. Drasgow, F., Levine, M. V., Tsien, S., Williams, B., & Mead, A. D. (1995). Fitting polytomous item response theory models to multiple-choice tests. Applied Psychological Measurement, 19, 143–165. Dudek, F. J. (1979). ἀe continuing misinterpretation of the standard error of measurement. Psychological Bulletin, 86, 335–337. Ellis, B. B., & Mead, A. D. (2002). Item analysis: ἀe ory and practice using classical and modern test theory. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 324–343). Malden, MA: Blackwell. Embretson, S. E., & Reise, S. P. (2000). Item response theory for psychologists. Mahwah, NJ: Erlbaum.
58
Michael J. Zickar and Alison A. Broadfoot
Fan, X. (1998). Item response theory and classical test theory: An empirical comparison of their item/person statistics. Educational and Psychological Measurement, 58, 357–381. Feldt, L. S. (1984). Some relationships between the binomial error model and classical test theory. Educational and Psychological Measurement, 44, 883–891. Hambleton, R. K., & Rogers, H. J. (1989). Detecting potentially biased test items: Comparison of IRT area and Mantel-Haenszel methods. Applied Measurement in Education, 2, 313–334. Hambelton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage. Harnisch, D. L., & Linn, R. L. (1981). Analysis of item response patterns: Questionable test data and dissimilar curriculum practices. Journal of Educational Measurement, 18, 133–146. Harrison, D. A. (1986). Robustness of IRT parameter estimation to violations of the unidimensionality assumption. Journal of Educational Statistics, 11, 91–115. Harvey, R. J., & Hammer, A. L. (1999). Item response theory. The Counseling Psychologist, 27, 353–383. Harvill, L. M. (1991). Standard error of measurement. Educational Measurement: Issues and Practice, 10, 33–41. Kirisci, L., Hsu, T., & Yu, L. (2001). Robustness of item parameter estimation programs to assumptions of unidimensionality and normality. Applied Psychological Measurement, 25, 146–162. Kolen, M. J., Hanson, B. A., & Brennan, R. L. (1992). Conditional standard errors of measurement for scale scores. Journal of Educational Measurement, 29, 285–307. Lambert, R. G., Nelson, L., Brewer, D., & Burchinal, M. (2006). Measurement issues and psychometric methods in developmental research. Monographs of the Society for Research in Child Development, 71, 24–41. Lance, C. E., & Vandenberg, R. J. (2002). Confirmatory factor analysis. In F. Drasgow & N. Schmitt (Eds.), Measuring and analyzing behavior in organizations: Advances in measurement and data analysis (pp. 221–254). San Francisco: Jossey-Bass. Levine, M. V., & Rubin, D. B. (1979). Measuring the appropriateness of multiple-choice test scores. Journal of Educational Statistics, 4, 269–290. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley.
The Partial Revival of a Dead Horse?
59
MacDonald, P., & Paunonen, S. V. (2002). A Monte Carlo comparison of item and person statistics based on item response theory verses classical test theory. Educational and Psychological Measurement, 62, 921–943. McDonald, R. P. (1999). Test theory. Mahwah, NJ: Erlbaum. Nunnally, J. C. (1978). Psychometric theory. New York: McGraw Hill. O’Connor, D. P. (2004). Comparison of two psychometric scaling methods for ratings of acute musculoskeletal pain. Pain, 110, 488–494. Orlando, M., & ἀ issen, D. (2000). Likelihood-based item-fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50–64. Reckase, M. D. (1979). Unifactor latent trait models applied to multifactor tests: Results and implications. Journal of Educational Statistics, 4, 207–230. Reckase, M. D. (1997). ἀe past and future of multidimensional item response theory. Applied Psychological Measurement, 21, 25–36. Reise, S. P., & Henson, J. M. (2000). Computerization and adaptive administration of the NEO PI-R. Assessment, 7, 347–364. Rojas Tejada, A. J., & Lozano Rojas, O. M. (2005). Application of an IRT polytomous model for measuring health related quality of life. Social Indicators Research, 74, 369–394. Samejima, F. (1977). Weakly parallel tests in latent trait theory with some criticisms of classical test theory. Psychometrika, 42, 193–198. Sinar, E. F., & Zickar, M. J. (2002). Evaluating the robustness of graded response model and classical test theory parameter estimates to deviant items. Applied Psychological Measurement, 26, 181–191. ἀ orndike, R. L. (1964). Reliability. In Proceedings of the 1963 Invitational Conference on Testing Problems (pp. 23–32). Princeton, NJ: Educational Testing Service. Zickar, M. J., Overton, R. C., Taylor, L. R., & Harms, H. J. (1999). Developing an adaptive test to hire computer programmers. In F. Drasgow & J. Olson-Buchanon (Eds.), Innovations in computerized assessment. Hillsdale, NJ: Erlbaum. Zickar, M. J., & Robie, C. (1999). Modeling faking good on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551–563.
3 Four Common Misconceptions in Exploratory Factor Analysis Deborah L. Bandalos and Meggen R. Boehm-Kaufman
Although we have no data to support this claim, our experience suggests that exploratory factor analysis may be second only to structural equation modeling in the types and numbers of questionable practices conducted in its name. In this chapter we focus on the use of exploratory, rather than confirmatory, factor analysis. Although the distinction between exploratory and confirmatory analysis is somewhat murky, we operationalize this by simply stating that by exploratory factor analysis (EFA) we mean the class of factor analytic procedures available through such commonly available packages as SAS and SPSS. We reserve the term confirmatory factor analysis (CFA) for the procedures available through structural equation modeling programs. CFA is discussed in chapter 7 of this volume. In recent reviews of exploratory factor analysis applications, researchers have described the state of the art as “routinely quite poor” (Fabrigar et al., 1999, p. 295), leading to “potentially misleading factor analytic results” (Preacher & MacCallum, 2003, p. 14). In this chapter we will discuss four misconceptions we feel are commonly observed in applied studies: • ἀe choice between component and common factor extraction procedures is inconsequential. ἀ roughout this chapter, we use the term factor analysis in a general sense to include both component and common factor analysis. For situations in which we wish to make a distinction between these two methods, we use the terms common factor and component analysis. 61
62
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
• Orthogonal rotation results in better simple structure than oblique rotation. • ἀe minimum sample size needed for factor analysis is… (insert your favorite guideline). • ἀe “Eigenvalues Greater ἀ an One” rule is the best way of choosing the number of factors.
ἀ ese misconceptions can result in deceptive results in applied factor analytic research. Fortunately, the solutions are usually straightforward, often involving simply clicking on a different option in the computer package being used. The Choice Between Component and Common Factor Analysis Is Inconsequential Although we have found no published studies in which the author(s) have flatly stated that the choice between component and common factor analysis is inconsequential, we nevertheless feel that researchers, as well as members of editorial boards, are either unaware of the distinction or feel it is unimportant. As evidence of this we offer the large number of published applications in which the author(s) either do not report whether a component or common factor analysis was used, or use an analysis that is not compatible with the purposes of the study. For example, Fabrigar et al. (1999) surveyed applications in the Journal of Personality and Social Psychology and the Journal of Applied Psychology and found that in 22% of the applications in the former and 26% in the latter the authors did not feel it was necessary to report which of the two methods of extraction was used, perhaps preferring not to waste valuable print space on such a trivial detail. Similarly, Russell (2002) noted in a review of applications in the Personality and Social Psychology Bulletin that authors of 26% of the factor analytic studies made no mention of the method of extraction. Conway and Huffcutt (2003) surveyed three journals in the area of organizational research (Organizational Behavior and Human Decision Processes, Journal of Applied Psychology, and Personnel Psychology) from 1985 to 1999 and found that component analysis was used in about 40% of the 371 studies found, whereas the method of extraction was not reported in 28% of the studies reviewed. Finally, Henson and Roberts (2006), in a review of factor analysis applications
Four Common Misconceptions in Exploratory Factor Analysis
63
in four journals that routinely publish psychometric studies, and might therefore be expected to maintain a higher reporting standard (Educational and Psychological Measurement, Journal of Educational Psychology, Personality and Individual Diἀerences, and Psychological Assessment), nevertheless found that 13% failed to report the extraction method. In preparation for this chapter, we reviewed articles in the Journal of Applied Psychology, Journal of Educational Psychology, and the Journal of Personality and Social Psychology for the years 1980, 1990, and 2000. Our own review was consistent with previous work. For the three journals reviewed we found that, of the articles reporting factor analytic results, approximately 22% failed to report the specific extraction method used. Of those studies reporting the method of extraction, 53% reported using component analysis as either the sole method of extraction or in concert with some form of common factor analysis. Another cause for concern is the number of studies in which the author(s) used an analysis that was incompatible with the stated goals of the study. In component analysis, a set of variables is transformed into a smaller set of linear composites known as components. ἀ us, component analysis is essentially a method for data reduction. As an example, a researcher may want to predict performance from scores on a large number of aptitude and achievement tests. However, because the test scores are known to be intercorrelated the researcher may wish to “boil them down” into a smaller set of composite variables, or components. ἀ e components could then be used as predictors in place of the variables, thereby avoiding potential collinearity problems. Common factor analysis, on the other hand, is concerned with uncovering the latent constructs underlying the variables, in an attempt to better understand the nature of such constructs (Fabrigar et al., 1999; Worthington & Whittaker, 2006). For example, instead of creating linear composites of the test scores, as in the previous example, the researcher may want to identify the dimensions underlying the scores in order to better understand the constructs driving their intercorrelations. Such a goal would call for a common factor analysis, in which an attempt would be made to understand and name the underlying factors. Given this difference in purpose, it is dismaying to find that component analysis is often used for situations in which common factor analysis would be more appropriate. For example, Fabrigar et al. (1999) pointed out that component analysis was used in
64
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
approximately half of the applications they reviewed, even though the goals of these studies were better suited to the use of common factor analysis. Similarly, Conway and Huffcutt (2003) reported that, although reducing the number of variables was the stated goal in only one of the studies they reviewed, 40% used component analysis. In our own review, we found that of the studies in which the method of extraction was reported, component analysis was used in 53%, even though reducing the number of variables was not the purpose of these analyses. Despite the confusion in the applied literature, the distinction between component and common factor analysis is really quite simple: in component analysis all of the variance among the variables is analyzed, whereas in common factor analysis only the shared variance is factored. ἀ is is accomplished by factoring the entire correlation (or covariance) matrix in component analysis, or by replacing the diagonal elements of the matrix with estimates of the shared variance (known as communalities) in common factor analysis. ἀ is difference can be seen in the equations for the two procedures. ἀ e equation for common factor analysis contains an error, or uniqueness, term:
X iv = w v1 F1i + w v 2 F2i + ... + w vf Ffi + w vuU iv
(3.1)
where Xiv is the score of person i on variable v, wvf is the weight of variable v on common factor f , Ffi is the score on factor f of person i, wvu is the weight of variable v on the error, or unique factor, and Uiv is the score of person i on the unique factor for variable v. ἀ e uniqueness is actually composed of two parts: the unreliable variance and the specific variance. Specific variance is variance that is not shared with the factor or component and unreliable variance is due to random measurement errors. In contrast, the equation for component analysis is
X iv = w v1 F1i + w v 2 F2i + ... + w vf Ffi
(3.2)
in which the terms are defined as before and which, as can be seen, contains no uniqueness term. ἀ us, the two procedures are based on different models. In component analysis, all of the variance, including
Four Common Misconceptions in Exploratory Factor Analysis
65
that which is not shared with any other variables, is analyzed, while in common factor analysis, only the shared variance is analyzed. As Widaman (1993) explains, the differences in purpose between the two methods arise from this difference in model formulation. In a component analysis, the purpose is to reduce the dimensionality of the data by creating a weighted composite of the observed variables, error and all. If, however, the goal of analysis is to model the covariation among the observed variables as being due to one or more latent constructs, then the unique variance should be minimized. ἀ is goal is accomplished through the use of common factor analysis. ἀ us, as noted previously, the two methods differ in purpose, with component analysis typically being recommended for reducing a large number of variables to a smaller, more manageable number of components, and common factor analysis being better suited for identifying the latent constructs underlying the variables (Conway & Huffcutt, 2003; Fabrigar et al., 1999; Worthington & Whittaker, 2006). Although it may be the case that some researchers and article reviewers are simply not aware of the difference between component and common factor analysis, there are at least three other reasons that researchers may choose between them in a somewhat arbitrary manner. One is that (principal) component analysis is the default method in both SPSS and SAS, and some researchers not familiar with the differences may assume that the default method is “best” in some sense. Second, methodologists themselves cannot seem to agree on which method should be used, and this choice continues to be hotly debated on both methodological and philosophical grounds. ἀ e primary methodological arguments against common factor analysis are the indeterminacy of factor scores and the occurrence of improper estimates. Philosophical arguments center around the defensibility of viewing component analysis as a latent variable, rather than simply a data reduction, method. Finally, the third reason that researchers may understandably be confused about the differences between component and factor analysis is that the two methods can and do yield very similar results under certain conditions. ἀ e first point in the preceding paragraph does not require further elaboration; it is simply a fact that principal component analysis is the default method of extraction in both SPSS and SAS. However, brief discussions of the latter two points may be useful in elucidating the differences between the two methods. In the sections that follow, we will present both the methodological and philosophical
66
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
arguments in the component versus common factor analysis debate, and end with a discussion of the actual differences in results obtained from the two methods. The Component Versus Common Factor Debate: Methodological Arguments As noted in the previous paragraphs, methodologists disagree markedly on whether component or common factor analysis is preferable. For example, in a frequently cited article, Velicer and Jackson (1990) state that “the major conclusion of this article is that there is little basis to prefer either component analysis or factor analysis.” ἀ is view is echoed by Wilkinson (1985), writing in the manual for the SYSTAT statistical package and quoted by Borgatta, Kercher, and Stull (1986) that “principal components and common factor solutions for real data rarely differ enough to matter” (p. 264). Conversely, Widaman (1993) states that “it seems that the prudent researcher should rarely, if ever, opt for a component analysis of empirical data if his/her goal were to interpret the patterns of observed covariation among variables as arising from latent variables or factors” (p. 308). Overall, the relative advantages of component and common factor analysis have been the subject of intense debate among methodologists, to the extent that a special issue of the journal Multivariate Behavioral Research (volume 2, 1990) was devoted to this topic. ἀ e two methodological issues most commonly cited by proponents of component analysis as shortcomings of common factor analysis are factor score indeterminacy and the occurrence of Heywood cases. ἀ e arguments surrounding factor score indeterminacy fomented a spate of articles in a special issue of the journal Multivariate Behavioral Research (volume 31[4], 1996). Acito and Anderson (1986) provide a clear explanation of this issue, which can be summarized as follows. Calculation of factor scores is based on equations relating the observed variables to the factors, as shown in Equation 3.1. For a set of v variables, there are v such equations. However, note that in common factor analysis scores on f common factors as well as on v unique factors must be estimated. ἀ is results in a total of f + v unknowns that must be estimated from only v equations, a problem that is analogous to trying to solve for both x and y in an equation
Four Common Misconceptions in Exploratory Factor Analysis
67
such as x + y = 10. ἀ e solution is indeterminate not because there are no values for x and y that will satisfy the equation, but that there are too many such values. In the case of factor scores, the problem is not that there is no set of factor scores that can be obtained from the variables scores; it is that there are many such sets of factor scores. Note that a similar problem does not exist for the component model, because the unique factor scores are not estimated. In the component case, the v equations can be solved for the f factor scores uniquely, assuming a full component solution (i.e., the number of components retained is equal to the number of variables) is obtained. ἀ e factor indeterminacy debate is not over the existence of the indeterminacy problem; this is acknowledged by both camps. Instead, the discussion centers around the degree to which such indeterminacy is a problem. Defenders of common factor analysis (Gorsuch, 1997; McArdle, 1990) acknowledge that the factor scores obtained from this method are necessarily indeterminate but argue that this is not a compelling reason to abandon the method because applied researchers are rarely interested in saving and using factor scores. ἀ ey further argue that if factor scores are of interest, methods of confirmatory factor analysis (CFA) can be used to obtain scores that do not suffer from the problem of indeterminacy (Gorsuch, 1997). Proponents of component analysis have also noted that common factor analysis is prone to Heywood cases. ἀ ese are negative estimates of the uniquenesses in common factor analysis. As a side note, because SPSS and SAS do not print out estimates of the uniquenesses, evidence of Heywood cases can be inferred from factor pattern loadings greater than one. Of course, such estimates do not occur for component analysis because uniquenesses are not estimated. Not surprisingly, advocates of common factor analysis argue that, although negative estimates of uniquenesses do occur, they are not necessarily problematic. For example, Gorsuch (1997) stated that Heywood cases occur only when iterated communalities are used and recommends that communalities be iterated only two to three times to avoid this problem. Fabrigar et al. (1999) expressed the view that Heywood cases should not necessarily be seen as problematic in that they “often indicate that a misspecified model has been fit to the data or that the data violate assumptions of the common factor model” (p. 276) and can therefore be seen as having diagnostic value.
68
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
The Component Versus Common Factor Debate: Philosophical Arguments We turn now to a brief discussion of the philosophical differences between component and common factor analysis. Expositions of these views have been provided in articles by Mulaik (1987) and Haig (2005), as well as in the previously mentioned 1996 issue of Multivariate Behavioral Research (volume 31[4]). One aspect of these philosophical differences is that, while common factor analysis is a latent variable method, component analysis is not. ἀ is view is exemplified in the following statement by Bentler and Kano (1990): Does one believe, in a particular application, that each variable to be analyzed is generated in part by a random error variate, and that these error variates are mutually uncorrelated? If so, the dimensionality of any meaningful model is greater than the dimensionality of the measured variables, and hence by definition a latent variable model, here, factor analysis, is called for. (p. 67)
Haig (2005) provides a philosophical framework for this view based on abductive inference, in which new information is generated by reasoning from factual premises to explanatory conclusions. In other words, “abduction consists in studying the facts and devising a theory to explain them” (Peirce, cited in Haig, 2005, p. 305). According to Haig, component models cannot be viewed as latent variable models. An abductive interpretation of EFA reinforces the view that it is best regarded as a latent variable method, thus distancing it from the data reduction method of principal components analysis. From this, it obviously follows that EFA should always be used in preference to PC analysis when the underlying common causal structure of a domain is being investigated. (p. 321)
However, Maraun (1996) argues against this distinction between component and common factor analysis, maintaining that common factors are no more latent than components, because the only term differentiating the two models is an error term. As Maraun puts it, “the only feature of a latent common factor that goes beyond what is ‘known’ through the manifest variates is arbitrary” (pp. 535–536). Although we have presented methodological and philosophical arguments separately for ease of discussion, they are necessarily intertwined. For example, views of the reasoning underlying
Four Common Misconceptions in Exploratory Factor Analysis
69
the common factor analysis model put forth by Mulaik (1987) and Haig (2005) also provide a defense of the previously discussed factor score indeterminacy problem. Mulaik argues that inferences such as the identification of factors from patterns of correlations cannot be made “uniquely and unambiguously” from variable scores without making prior assumptions (p. 299). In Mulaik’s view, these prior assumptions might take the form of restrictions put on the factor structure or on the loadings themselves. Furthermore, results based on such inferences must be subjected to further testing on additional data: “In other words, if induction is to have any kind of empirical merit, it must be seen as a hypothesis-generating method and not as a method that produces unambiguous, incorrigible results” (p. 299). ἀ is view is amplified by Haig (2005), who explains that using EFA to facilitate judgments about the initial plausibility of hypotheses will still leave the domains being investigated in a state of considerable theoretical underdetermination. It should also be stressed that the resulting plurality of competing theories is entirely to be expected, and should not be thought of as an undesirable consequence of employing EFA. (p. 320)
Maraun (1996), of course, disagrees, stating that “conceptual issues” (i.e., the existence of common factors) are categorically different from “considerations relevant to empirical investigation”—i.e., the generation and testing of alternative hypotheses and competing theories suggested by Mulaik (1987) and Haig (2005)—and that arguments used to determine the former cannot make use of the empirical methods of the latter, but must stand on their own, without outside empirical aid. Differences in Results From Component and Common Factor Analysis From these philosophical heights we descend to the more practical matter of whether there is any actual difference in the results that would be obtained, or decisions that would be made, on the basis of a component rather than a common factor analysis. Again, methodologists differ predictably on this issue. Proponents of component analysis argue that any differences are trivial and possibly the result of extracting too many factors (Velicer & Jackson, 1990, p. 10).
70
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
However, studies by Widaman (1990, 1993) and Bentler and Kano (1990) have shown that the analyses can produce very different results when the communalities and/or the number of variables per factor are low. ἀ ese two conditions interact, such that if the average pattern loading is at least .8, three to seven variables per factor will suffice to yield estimates from the two methods that are very close. However, with average loadings of .4, 20 to 50 or even more variables per factor would be needed to yield similar estimates. ἀ ese results are not surprising given that the difference between the component and common factor models is that the latter model contains a uniqueness term, while the former does not. It follows that conditions in which the uniquenesses of the variables are minimized will lead to greater similarity between the two methods, with higher communalities and larger numbers of variables representing two such conditions. More specifically, Schneeweiss (1997) has shown analytically that the results of component and common factor analysis will be similar when the unique variances are small relative to the factor loadings, or when the differences of the uniquenesses across variables are small relative to the loadings. When this is not the case, however, Widaman (1993) has demonstrated that, even with relatively small proportions of uniqueness in the variables, component analysis results in overestimates of the population factor pattern loadings. Also, for models with correlated factors, component analysis was found to yield underestimates of the population factor correlations. More generally, unless variable uniquenesses are actually zero, component analysis would be expected to yield estimates of pattern loadings that are overestimates of the corresponding population loadings. Summary: ἀ e legend: Component and common factor analysis provide results that are sufficiently similar that it should not matter much which one is used. ἀ e kernel of truth: Component and common factor analysis do yield very similar results if the variable communalities are high (averaging .8 or higher) and the number of variables per factor is large. ἀ e myth: Component and common factor analysis are conceptually equivalent. ἀ e follow-up: Although methodologists still disagree about which model is most appropriate, component analysis and common factor analysis have different goals and are based on different philosophies. ἀ e choice between them should therefore be made on the basis of one’s purpose in conducting the analysis. If data reduction is the goal, component analysis should
Four Common Misconceptions in Exploratory Factor Analysis
71
be used. If one is interested in describing the variables in terms of a smaller number of dimensions that underlie them, one should use common factor analysis. Admittedly, these two purposes are easily confused. Preacher and MacCallum (2003) point out that saying one wants to describe variables using as few underlying dimensions as possible may sound like data reduction, and therefore component analysis. However, they go on to say that if one also wants to account for the correlations among the variables and to give the dimensions substantive interpretations, this goes beyond the domain of component analysis. Orthogonal Rotation Results in Better Simple Structure Than Oblique Rotation It is common knowledge among factor analysts that factor rotation generally results in solutions that are easier to interpret than unrotated solutions. Rotated solutions come in two basic forms: those yielding uncorrelated or orthogonal factors (or components) and those in which the resulting factors or components are correlated (known as oblique rotations). ἀ e goal of both types of rotation is to obtain results that are more interpretable and “cleaner.” ἀ e latter term is usually defined in terms of simple structure, the principles of which were originally advanced by ἀ urstone (1947), and include (a) the existence of several large loadings and a relatively greater number of variables with very small (ideally, zero) loadings for each factor, (b) different patterns of loadings across factors, and (c) small numbers of cross-loadings. In most applications, it is the third principle on which interest is typically concentrated. However, there appears to be some confusion among applied researchers regarding whether an orthogonal or oblique rotation is “best,” both in general and for the specific purpose of obtaining good simple structure. Oblique or Orthogonal Rotation? Because the difference between orthogonal and oblique rotations is simply that the latter yields correlated factors (components) while the factors or components obtained from the former will be uncorrelated, it stands to reason that the choice between the two methods
72
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
should be based on whether the factors/components are expected to correlate. For situations in which there is no information available on the expected level of correlation, methodologists are fairly consistent in recommending oblique over orthogonal rotations (although see Tinsley & Tinsley, 1987, p. 421). ἀ is is because an oblique solution will “default” to an orthogonal solution if the factors really are uncorrelated, but will allow for the factors to be correlated if this is necessary to fit the structure of the variables. For example, Comrey and Lee (1992) stated that “given the advantages of oblique rotation over orthogonal rotation, we see little justification for using orthogonal rotation as a general approach to achieving solutions with simple structure” (p. 283). Preacher and MacCallum (2002) went even further, stating that “it is almost always safer to assume that there is not perfect independence, and to use oblique rotation instead of orthogonal rotation” (p. 26). However, recent reviews of the literature have found that orthogonal rotation was the method of choice in 41% (Conway & Huffcutt, 2003) to 55% (Henson & Roberts, 2006) of factor analytic applications. Of the 72 factor analytic applications we reviewed, orthogonal rotation was used in 36, and most of these provided no rationale for doing so. Do Orthogonal Rotations Result in Better Simple Structure? ἀ ese results beg the question as to why orthogonal rotations are so popular. One possible reason is that researchers feel that orthogonal rotations will result in “cleaner” solutions with better simple structure. For example, Hill and Petty (1995) justify their use of orthogonal rotation by stating that “the varimax procedure was used in this study to minimize the number of loadings on a factor, thus simplifying its structure and making it more interpretable” (p. 63). Comrey and Lee (1992) discuss this misconception, stating “It is sometimes thought that this retention of statistically independent factors ‘cleans up’ and clarifies solutions, making them easier to interpret. Unfortunately, this intuition is exactly the opposite of what the methodological literature suggests” (p. 287). ἀ e idea that orthogonal rotations result in better simple structure persists despite clear advice to the contrary in the methodological literature. Comrey and Lee (1992), for example, explicitly state that “orthogonal rotations are likely to produce solutions with poorer
Four Common Misconceptions in Exploratory Factor Analysis
73
1
2
Factor One
3
4
5
Factor Two
6
Figure 3.1 Model with two correlated factors.
simple structure when clusters of variables are less than 90 degrees from one another…” (p. 282). Similarly, Russell (2002) states that “[orthogonal rotations] often do not lead to simple structure due to underlying correlations between the factors” (p. 1637). Finally, Nunnally and Bernstein (1994) make the same point, although somewhat obliquely, in their statement that “[o]blique factors thus generally represent the salient variables better than orthogonal factors” (pp. 498–499). To understand why orthogonal rotations may actually result in more cross-loadings than oblique rotations, we refer to Figure 3.1, which depicts a two-factor model with three variables loading on each factor. ἀ e curved double-headed arrow between factors one and two indicates that the two factors are correlated, whereas the straight, single-headed arrows from the factors to the variables represent the factor loadings. In factor analysis and related techniques such as structural equation modeling, diagrams such as that in Figure 3.1 are used to show Pun intended.
74
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
how the original correlations among the variables can be reproduced from the factor analysis or structural equation model. ἀ is is done by “tracing” the paths from one variable to another. For example, the reproduced correlation between variables 1 and 2 would be obtained by tracing backwards from variable 1 to factor one and then forwards from factor 1 to variable 2. Let us assume that variables 3 and 4 are correlated. If factors one and two are correlated, we can trace backwards from variable 3 to factor one, through the curved doubleheaded arrow, and from factor two to variable 4 in order to reproduce the variable 3/variable 4 correlation. However, if the two factors are orthogonal, the only way to account for the variable 3/variable 4 correlation is to insert a cross-loading from factor two to variable 3 (as represented by the dashed line), or alternatively, from factor one to variable 4. Of course, if variables 1 through 3 are uncorrelated with variables 4 through 6, no cross-loadings would be necessary and it would be appropriate to model the two factors as being orthogonal. However, it is only in this situation, where variables on one factor are not correlated with variables on other factors, that an orthogonal rotation can yield a solution with no cross-loadings. In any other case, it will produce cross-loadings, and the more highly correlated the variables are across factors the larger the cross-loadings that will be produced. Summary: ἀ e legend: Orthogonal rotations produce better simple structure. ἀ e kernel of truth: Orthogonal rotation does produce solutions that may be simpler in terms of interpretability. ἀ e myth: Use of orthogonal rotation when the factors are actually correlated will “clean up” the factor structure. ἀ e follow-up: Unless you have good reason to believe that factors will be uncorrelated, use an oblique rotation. If the factors are really uncorrelated, an oblique rotation will yield an orthogonal solution anyway. If you feel compelled to obtain an orthogonal solution, go ahead, but also obtain an oblique solution. If the correlations among the factors are nonnegligible, the results from the oblique solution are probably the best representation. The Minimum Sample Size Needed for Factor Analysis Is… (Insert Your Favorite Guideline) One question that is sure to be asked by anyone planning a quantitative research study is “How large a sample will I need?” In
Four Common Misconceptions in Exploratory Factor Analysis
75
factor analytic research, many rules of thumb have been suggested to answer this question. ἀ ese rules fall into two categories: (a) those that specify an absolute value for minimum N, and (b) those that specify values for the minimum sample size to number of variables (N:p) ratio. However, recent studies of these guidelines by Velicer and Fava (1998), MacCallum et al. (1999), and Hogarty et al. (2005) have all reached the same conclusion, which is that there is no absolute minimum N or N:p ratio. In the words of MacCallum et al.: We suggest that previous recommendations regarding the issue of sample size in factor analysis have been based on a misconception. ἀ at misconception is that the minimum level of N (or the minimum N:p ratio) to achieve stability and recovery of population factors is invariant across studies. We show that such a view is incorrect and that the necessary N is in fact highly dependent on several specific aspects of a given study. (p. 86)
Why, then, is the belief in an absolute value of N or N:p ratio so widespread? Velicer and Fava (1998) suggest that recommendations for a minimum sample size probably stem from knowledge of the sampling variability of correlation coefficients, which provide accurate estimates of their population counterparts when N reaches 100–200. Because both pattern and structure coefficients in factor analysis are based on correlations, it seems reasonable to assume that they would behave similarly. But do they? Gorsuch (1983) is a commonly cited reference for such rules of thumb. He discusses the standard errors of factor loadings (both pattern and structure) and, based on an early study by Cliff and Hamburger (1967), concludes that “As the simple structure becomes more poorly defined…the standard error increases” (p. 209). He goes on to recommend that, because Cliff and Hamburger found that standard errors for structure loadings were about 150–200% larger than those for correlation coefficients, researchers could obtain a “rough check” on the significance of loadings by doubling the standard error for the corresponding correlation coefficient. For example, with an N of 100, a correlation of around .2 is statistically significant (p < .05), so a structure loading of approximately .4 should be detectable at this sample size. However, as early In a later study, Cudeck and O’Dell (1994) found that standard errors for loadings depend on the method of rotation, number of factors, and degree of correlation among factors, among other things.
76
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
as 1983, Gorsuch provided the following caveat: “ἀ ese figures are for problems with small to moderate size correlation matrices, and may be too conservative for problems with many variables” (p. 209). Turning to the N:p ratio, Velicer and Fava (1998) suggest that these guidelines may have their origin in the well known “shrinkage” concept of multiple regression, which specifies that the degree to which a regression solution will cross-validate is a function of the ratio of the number of predictor variables to the number of subjects. Here again, Gorsuch (1983) is one of the commonly cited sources of such a recommendation. If we look at what Gorsuch actually said, however, the term recommendation seems too strong. His discussion of this issue is in the context of the use of statistical tests in deciding on the number of factors. After reviewing several criticisms of these tests, Gorsuch concludes by stating that, “[f]or these reasons, psychometrically oriented factor analysts prefer to have a large number of subjects.…A large number is usually defined as five or ten times the number of variables but not less than several hundred” (p. 148). Even those who choose to interpret this as a ringing endorsement of the five or ten variables per factor “rule” should note that Gorsuch, on the basis of more recent research, stated in a 1997 article that the sample size was in former times given as a function of the number of items (e.g., 10 cases for every item). ἀ is was a recommendation proposed largely out of ignorance rather than theory or research. (p. 541)
Gorsuch goes on to argue that the sample size needed is a function of “the stability of a correlation coefficient,” and that larger samples are needed if correlations are low. ἀ is argument is based on the fact that small correlations are less stable than large correlations. New Sample Size Guidelines If the old rules of thumb for determining sample sizes are not accurate, what should be used instead? ἀ e good news is that recent studies in this area have proposed new guidelines; the bad news is that these are more complicated than the old rules of thumb. Specifically, recent studies of the sample size issue based on simulated data (Hogarty et al., 2005; MacCallum et al., 1999, 2001; Velicer & Fava, 1998) have found that, although recovery of population factor loadings does improve with increased sample size, results also improve with
Four Common Misconceptions in Exploratory Factor Analysis
77
increases in (a) communality levels and (b) the number of variables per factor. ἀ ese studies have also found that sample size, communality levels, and the number of variables interact in their effects on recovery of population loadings. We consider the effects of these characteristics in more detail in the following paragraphs. ἀ e positive effects of high communalities are due to the fact that they are functions of the factor loadings, which in turn are functions of the variable correlations. When we recall that large correlations are known to be more stable than low correlations, the positive effect of high communalities on recovery of population loadings makes sense. ἀ at the number of variables per factor should positively influence factor recovery is perhaps less obvious, however. What about the “shrinkage” effect? With more variables there are more quantities to estimate, so we need a larger sample size, right? Not necessarily, according to the results of simulation studies. In all four of the studies cited, factor recovery improved as the number of variables per factor increased. As Velicer and Fava (1998) state in a summary of their results: Rules that related sample size to the number of observed variables were also demonstrated to be incorrect. In fact, for the low-variables conditions considered here, the opposite was true. Increasing p [the number of variables] when the number of factors remains constant will actually improve pattern reproduction for the same sample size. (p. 244)
Considering the number of variables in the context of a sampling issue may help in understanding these results. Just as we need adequate samples of people to approximate population quantities related to characteristics of such people, we need adequate samples of variables to approximate the population quantities related to the variables. Sampling too few variables can result in the same types of instability in estimating variable-related properties as can sampling too few people when estimating population parameters. Perhaps the most important aspect of these four simulation studies relates to the interactive effects of sample size, communality level, and the number of variables per factor. As Velicer and Fava (1998) put it, “strength on one of these variables could compensate for a weakness on another” (p. 243). ἀ e strongest compensatory effect ἀ is discussion assumes, of course, that the variables are all good measures of their respective factors in the sense of having relatively high loadings on the designated factor and loadings close to zero on other factors.
78
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
appears to be that of communality level on sample size. For example, MacCallum et al. (1999) found that with communalities of approximately .7, good recovery of population factors required a sample size of only 100, with three to four variables per factor. At this level of communality, increasing the number of variables per factor had little effect. With lower communalities, larger samples of both people and variables were necessary to obtain good recovery. With communalities lower than .5, six or seven variables per factor and a sample size “well over 100” would be required to obtain the same level of recovery. Finally, with communalities of less than .5 and three or four variables per factor, sample sizes of at least 300 are needed. One final aspect of these studies (Hogarty et al., 2005; MacCallum et al., 1999, 2001; Velicer & Fava, 1998) should be mentioned. ἀ ose who still feel that a larger sample size should be used when there are more variables in the analysis will be gratified to learn that this was found to be the case when more variables corresponded to more factors. In other words, given the same variable to factor ratio, a larger sample size was needed to obtain good recovery when there were more factors (and thus more variables) in the analysis. For example, with seven rather than three factors, each measured by three to four variables, MacCallum et al. (1999) found that samples of well over 500 were needed to obtain good recovery in the low communality (< .5) condition. In general, the following statement by Hogarty et al. (2005) provides a good summary of the results regarding the number of factors: Overdetermination of factors was also shown to improve the factor analysis solution. We found, however, in comparing results over different numbers of factors and levels of overdetermination, that samples with fewer factors by far yielded the more stable factor solutions. (p. 224)
Summary: ἀ e legend: ἀ e sample size needed for factor analysis increases with the number of variables to be analyzed. ἀ e kernel of truth: ἀ e sample size needed does increase with the number of factors. ἀ e simulation studies cited previously found that with low levels of communality and three to four variables per factor, a sample size of at least 300 was needed if there were three factors, but a sample size of at least 500 was necessary if there were seven factors. ἀ e myth: For a given number of factors, larger numbers of variables require larger sample sizes. ἀ e follow-up: Sample your variables
Four Common Misconceptions in Exploratory Factor Analysis
79
carefully. Choosing variables with high communalities will pay off in lower sample size requirements. The “Eigenvalues Greater Than One” Rule Is the Best Way of Choosing the Number of Factors We all know this rule. SPSS implements it as the default method for choosing the number of factors or components, and it is the default method in SAS for determining the number of components, so it must be right. Right? Wrong! In fact, one of the few things on which factor analysts seem to agree is that this criterion, variously known as “K1,” the “Kaiser rule,” or the “Kaiser-Guttman rule,” is one of the least reliable options among those available. In their summary of results from a simulation study comparing five methods for determining the number of components to retain, Zwick and Velicer (1986) flatly stated that “we cannot recommend the K1 rule for PCA [principal component analysis]” (p. 439). And, in case anyone is left wondering, Velicer, Eaton, and Fava (2000) conducted a follow-up study in which K1 was included as a basis of comparison for the accuracy of other methods. ἀ ey concluded that “the eigenvalue greater than one rule was extremely inaccurate and was the most variable of all the methods. Continued use of this method is not recommended” (p. 68). Neither does Cortina (2002) mince words in his evaluation that this criterion is “clearly inferior to the alternatives” (p. 350). Given all of this negative press, it may surprise some to find that the K1 criterion was the most commonly used single procedure in Fabrigar et al.’s (1999) recent review of applications, and was used as the sole method of determining the number of factors in 16–19% of the articles reviewed. Similarly, Conway and Huffcutt (2003) found that approximately 15% of the studies they surveyed used K1 as the only criteria for determining the number of factors. Russell (2002) found that a whopping 52% used K1 (either alone or combined with other criteria). And Henson and Roberts (2006), in their review of articles in psychometrically oriented journals, found that 57% relied on K1. It should also be noted that the actual numbers may be higher, because 38–41% of the studies reviewed by Fabrigar et al., 38% of those reviewed by Conway and Huffcutt (2003), and 55% of those found by Russell did not even report what criteria were used to make this important decision. In our own review of the literature, we
80
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
found that 35% used only the K1 criterion, 17% used K1 along with the scree plot, and 32% failed to report any criterion for their choice of the number of factors. Guttman (1954) is often credited with originating the K1 criterion. However, what he actually did was to derive three methods for estimating the lower bound for the rank, or dimensionality, of a population correlation matrix. One of these was that the minimum dimension of a correlation matrix with unities on the diagonal was greater than or equal to the number of eigenvalues that are at least one. ἀ ree things should be noted. First, the K1 rule applies to component analysis, not to common factor analysis; so applications to common factor analysis are, strictly speaking, inappropriate. Second, Guttman did not suggest K1 as a method of determining the number of components that should be extracted, but rather as determining the number of components that could be extracted (Gorsuch, 1983; Preacher & MacCallum, 2003; Velicer et al., 2000). It is the researcher’s job to determine the difference. Finally, Guttman’s derivations are based on population data. As noted by Nunnally and Bernstein (1994), the first few eigenvalues in a sample correlation matrix are typically larger than their population counterparts, resulting in extraction of too many components in samples when this rule is used. Kaiser (1960) provided another rationale for the K1 criterion, stating that components with eigenvalues less than one would have negative “Kuder Richardson” or internal consistency reliability. Researchers sometimes overinterpret this statement as indicating that components meeting the K1 criterion will be reliable. But Kaiser only claimed that such components would not have negative reliability, which is a far cry from what most researchers would consider acceptable reliability. In any case, Cliff (1988) appears to have debunked Kaiser’s claim, stating that “the conclusion made by Kaiser (1960) is erroneous: ἀ ere is no direct relation between the size of an eigenvalue and the reliability of the corresponding composite” (p. 277). So if not K1, what should be used? Methodologists recommend using several different methods for determining the number of factors or components to retain. In the ideal scenario, these methods will agree on the optimum number. More often, however, different methods will suggest different numbers of factors/components. In situations such as these, the researcher should obtain the solutions
Four Common Misconceptions in Exploratory Factor Analysis
81
suggested by the different methods and decide among these on the basis of interpretability, evidence of overfactoring, and theoretical considerations. As stated by Worthington and Whittaker (2006), “In the end, researchers should retain a factor only if they can interpret it in a meaningful way no matter how solid the evidence for its retention based on the empirical criteria” (p. 822). Among the recommended “empirical criteria” are the scree plot, parallel analysis (PA), and the minimum average partial (MAP) procedure. Researchers are probably familiar with the scree plot, in which the eigenvalues are plotted and the number of factors or components is determined by the point at which the plotted values level off. ἀ is method has been found to perform fairly well in the study by Zwick and Velicer (1986), although not as well as the PA and MAP procedures. ἀ e latter two methods may be less familiar to researchers, and are therefore described in some detail in the following paragraphs. ἀ e parallel analysis (PA) procedure was introduced by Horn (1965) and is available for both component and common factor analysis. Zwick and Velicer (1986), in a study comparing five methods for determining the number of factors to retain (K1, the scree test, the Bartlett chi-square test, the minimum average partial procedure, and PA) found PA to be the most accurate method across all conditions studied. ἀ e idea behind PA is that the number of components extracted should have eigenvalues greater than those from a random data matrix of the same dimensions. To determine this, a set of random data correlation matrices are created and their eigenvalues are computed. ἀ e eigenvalues from the matrix to be factored are compared to those from the random data, and only factors or components with eigenvalues greater than those from the random data are retained. Although the PA procedures are not currently available in the standard SPSS and SAS packages, O’Connor (2000) has helpfully provided macros for both SAS and SPSS (as well as MATLAB) that will implement PA for both component and factor analysis (available at http://flash.lakeheadu.ca/~boconno2/nfactors.html). ἀ ompson and Daniel (1996), and Hayton, Allen and Scarpello (2004) have also provided SPSS code for PA. Another procedure that has performed well in simulation studies (Velicer et al., 2000; Zwick & Velicer, 1982, 1986) is the minimum average partial, or MAP procedure (Velicer, 1976). It should be noted,
82
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
however, that this method is only appropriate for component, and not common factor, analysis. ἀ e method proceeds as follows. As each component is extracted, a partial correlation matrix (partialing out that component) is computed and the average squared off-diagonal element of the partialed matrix is obtained. ἀ e number of components retained is determined by the point at which the average partial correlation is at a minimum. ἀ e idea is that the components successively remove the common variance from the matrix, until all that is left is unique variance, defined as variance shared between only two variables. ἀ e average partial correlation will decrease as the common variance is removed, until the point is reached at which no common variance remains. At this point only components based on unique variance will be extracted (i.e., a component that has a high correlation with only one variable and low correlations with the others), and the average partial correlation will begin to increase. ἀ us, this method should indicate the point at which the components being extracted change from reflecting variance common to several variables to variance common to pairs of variables. ἀ e MAP procedure has been found to perform nearly as accurately as PA in simulation studies (Velicer et al., 2000; Zwick & Velicer, 1986). Unfortunately, as with PA, implementation is a problem because MAP is not implemented in the factor analysis procedures for either SPSS or SAS. Again, however, macros developed by O’Connor (2000) come to the rescue by providing a vehicle for implementing MAP in both SPSS and SAS. A final point should be made regarding the empirical studies of factor retention criteria (Velicer et al., 2000; Zwick & Velicer, 1982, 1986). ἀ ese studies examined only orthogonal factors (Velicer, personal communication, October 9, 2007), so their accuracy for situations in which factors are correlated is not entirely clear. Because orthogonal factors are more clearly separable than oblique factors, it is probably safe to assume that determining the number of factors will be more difficult in the latter case. ἀ us, we might expect factor retention criteria to perform more poorly with correlated factors. In fact, a recent study of the parallel analysis procedure supports this supposition (Cho, Li, & Bandalos, 2006). However, it is not clear whether the superiority of the PA and MAP procedures would be maintained in situations with correlated factors, and more study is needed in this area.
Four Common Misconceptions in Exploratory Factor Analysis
83
Summary: ἀ e legend: K1 is an accurate method of determining the number of factors to extract. ἀ e kernel of truth: ἀ e number of eigenvalues greater than one does represent a theoretical lower bound for the number of components (but not common factors) that can (but not necessarily should) be extracted in the population. ἀ e myth: K1 is an accurate method for estimating the number of common factors or components that should be retained in sample data. ἀ e follow-up: Although the default criterion in both SPSS and SAS, K1 has consistently been found to be inaccurate, and review articles are unanimous in recommending against its use. Use the scree plot in conjunction with PA and, if conducting a component analysis, MAP instead of K1. Also keep in mind that methodologists recommend using several criteria in combination, and stress interpretability and theoretical rationale as the ultimate criteria (Cortina, 2002). Discussion In this chapter we have reviewed four common misconceptions regarding the use of factor analysis, tracing them, when possible, to their origins. In some cases, this trail has led us to the inner workings of computer programs, which appear to have appropriated the decision-making process from its rightful place within the brain of the researcher. We suspect this may be the case for decisions regarding the choice between component and common factor analysis, as the former method is both the most commonly used method and the default method in popular computer packages such as SPSS and SAS. Although it is true that with high communalities and a sufficient number of variables per factor (component), the two procedures will yield very similar results, a recent review of factor analysis applications by Henson and Roberts (2006) suggests that these conditions are not often met in practice. More importantly, component and common factor analysis have different purposes, rationales, and philosophical underpinnings, and these should guide the choice between them. Another factor analytic decision that seems to have been hijacked by commonly used computer packages is the choice of the number of factors. Although the K1 criterion is easy to program into a computer package, and deceptively easy for researchers to use, it has not shown itself to be accurate in any empirical study despite the
84
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
many chances it has been given. In fairness to K1, it was never really intended to be used as a criterion for determining “the” number of factors at all, and in our opinion it should not be forced to take on this role. Instead we recommend that researchers use multiple criteria including PA, MAP, and the scree plot, looking for convergence among these methods. ἀ e final decision, however, should be based on judgments of interpretability and consistency of the factors with sound theory. With regard to the sample size issue, when we consider the complexity of factor analytic studies it should not be surprising that one sample size (or even one sample size to number of variables ratio) does not fit all. Unreliable variables contribute to instability in much the same way as do small samples, so it makes intuitive sense that the level of communality of the variables should play a role in our choice of N. ἀ is is good news for researchers analyzing carefully chosen and highly reliable sets of variables, but should send a cautionary message to those in the early stages of a factor analytic program of research in which the variables may not be as well developed, such as in new scale or measurement development. With regard to rotational methods, some researchers may be dismayed to learn that orthogonal rotations do not necessarily yield the best simple structure. In fact, this will only be the case for situations in which the factors actually are orthogonal, or close to it. If factors are suspected, or known, to be correlated, an oblique rotation should be the method of choice as it is more likely to yield a simple structure and will also provide a better representation of the relationships among the variables. Happily, however, this is one of the rare situations in which researchers can have it all. If the factors are truly orthogonal, an oblique rotation will yield factors with correlations close to zero, and the researcher can then, if s/he so desires, rerun the analysis using an orthogonal rotation. A final point concerns reporting practices. Reviews of the literature (Conway & Huffcutt, 2003; Fabrigar et al., 1999; Henson & Roberts, 2006; Russell, 2002) have consistently found that researchers routinely fail to report essential information about the factor analysis. Benson and Nasser (1998) provide a detailed list of information that should be included in the description of any factor analytic study. In particular, reporting the method of extraction and of rotation, as well as a description of how the number of factors was determined, along with justifications of these should be considered mandatory.
Four Common Misconceptions in Exploratory Factor Analysis
85
Researchers and reviewers of articles are urged to consult the Benson and Nasser article for more information, and to strive for more complete reporting practices in the area of factor analysis. We hope that our brief review of these issues will be helpful to those conducting and/or reviewing factor analytic research, or planning to do so in the future. Although in some cases the procedures we suggest will require a little extra effort on the part of the researcher, the payoff should be better quality research and more replicable results. References Acito, F., & Anderson, R.D. (1986). A simulation study of factor score indeterminacy. Journal of Marketing Research, 23, 111–118. Benson, J., & Nasser, F. (1998). On the use of factor analysis as a research tool. Journal of Vocational Research, 23, 13–23. Bentler, P. M., & Kano, Y. (1990). On the equivalence of factors and components. Multivariate Behavioral Research, 25, 67–74. Borgatta, E. R., Kercher, K., & Stull, D. E. (1986). A cautionary note on the use of principal component analysis. Sociological Methods and Research, 15, 160–168. Cho, S. J., Li, F., & Bandalos, D. L. (April, 2006). Accuracy of the parallel analysis procedure in exploratory factor analysis of polychoric correlations. San Francisco, CA: National Council on Measurement in Education. Cliff, N. (1988). ἀe eigenvalues-greater-than-one rule and the reliability of components. Psychological Bulletin, 103(2), 276–279. Cliff, N., & Hamburger, C. D. (1967). ἀe study of sampling errors in factor analysis by means of artificial experiments. Psychological Bulletin, 68, 430–445. Comrey, A. L., & Lee, H. B. (1992). First course in factor analysis, 2nd edition. Hillsdale, NJ: Lawrence Erlbaum. Conway, J. M., & Huffcutt, A. I. (2003). A review and evaluation of exploratory factor analysis practices in organizational research. Organizational Research Methods, 6, 147–168. Cortina, J. M. (2002). Big things have small beginnings: An assortment of “minor” methodological misunderstandings. Journal of Management, 28, 339–362. Fabrigar, L. R., Wegener, D. T., MacCallum, R. C., & Strahan, E. J. (1999). Evaluating the use of exploratory factor analysis in psychological research. Psych Methods, 4, 272–299. Gorsuch, R. L. (1983). Factor analysis (2nd ed.). Hillsdale, NJ: Erlbaum.
86
Deborah L. Bandalos and Meggen R. Boehm-Kaufman
Gorsuch, R. L. (1997). Exploratory factor analysis: Its role in item analysis. Journal of Personality Assessment, 68(3), 532–560. Guttman, L. (1954). Some necessary conditions for common factor analysis. Psychometrika, 19, 149–161. Haig, B. D. (2005). Exploratory factor analysis, theory generation, and scientific method. Multivariate Behavioral Research, 40, 303–329. Hayton, J. C., Allen, D. G., & Scarpello, V. (2004). Factor retention decisions in exploratory factor analysis: A tutorial on parallel analysis. Organizational Research Methods, 7, 191–205. Henson, R. K., & Roberts, J. K. (2006). Use of exploratory factor analysis in published research. Educational and Psychological Measurement, 66, 393–416. Hill, R. B., & Petty, G. C. (1995). A new look at selected employability skills: A factor analysis of the Occupation Work Ethic. Journal of Vocational Education Research, 20(4), 59–73. Hogarty, K. Y., Hines, C. V., Kromrey, J. D., Ferron, J. M., & Mumford, K. R. (2005). ἀe quality of factor solutions in exploratory factor analysis: ἀe influence of sample size, communality, and overdetermination. Educational and Psychological Measurement, 65, 202–226. Horn, J. L. (1965). A rationale and test for the number of factors in factor analysis. Psychometrika, 30, 179–185. Kaiser, H. F. (1960). ἀe application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141–151. MacCallum, R. C., Widaman, K. F., Preacher, K. J., & Hong, S. (2001). Sample size in factor analysis: ἀe role of model error. Multivariate Behavioral Research, 36, 611–637. MacCallum, R. C., Widaman, K. F., Zhang, S., & Hong, S. (1999). Sample size in factor analysis. Psychological Methods, 4, 84–99. Maraun, M. D. (1996). Metaphor taken as math: Indeterminacy in the factor analysis model. Multivariate Behavioral Research, 31(4), 517–538. McArdle, J. J. (1990). Principles versus principals of structural factor analysis. Multivariate Behavioral Research, 25(1), 81–88. Mulaik, S. A. (1987). A brief history of the philosophical foundations of exploratory factor analysis. Multivariate Behavioral Research, 22, 267–305. Nunnally, J. C., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). New York: McGraw-Hill. O’Connor, B. P. (2000). SPSS and SAS programs for determining the number of components using parallel analysis and Velicer’s MAP test. Behavior Research Methods, Instrumentation, and Computers, 32, 396–402. Preacher, K. J., & MacCallum, R. C. (2003). Repairing Tom Swift’s electric factor analysis machine. Understanding Statistics, 2, 13–43.
Four Common Misconceptions in Exploratory Factor Analysis
87
Russell, D. W. (2002). In search of underlying dimensions: ἀe use (and abuse) of factor analysis in Personality and Social Psychology Bulletin. Personality and Social Psychology Bulletin, 28, 1629–1646. Schneeweiss, H. (1997). Factors and principal components in the near spherical case. Multivariate Behavioral Research, 32(4), 375–401. ἀ ompson, B., & Daniel, L. G. (1996). Factor analytic evidence for the construct validity of scores: A historical overview and some guidelines. Educational and Psychological Measurement, 56, 197–208. Tinsley, H. E. A., & Tinsley, D. J. (1987). Uses of factor analysis in counseling psychology research. Journal of Counseling Psychology, 34, 414–424. ἀ urstone, L. L. (1947). Multiple–factor analysis: A development and expansion of the vectors of mind. Chicago: University of Chicago Press. Velicer, W. F. (1976). Determining the number of components from the matrix of partial correlations. Psychometrika, 41, 321–327. Velicer, W. F., Eaton, C. A., & Fava, J. L. (2000). Construct explication through factor or component analysis: A review and evaluation of alternative procedures for determining the number of factors or components. In R. D. Goffin & E. Helmes (Eds.), Problems and solutions in human assessment (pp. 41–71). Boston: Kluwer. Velicer, W. F., & Fava, J. L. (1998). Effects of variable and subject sampling on factor pattern recovery. Psychological Methods, 2, 231–251. Velicer, W. F., & Jackson, D. N. (1990). Component analysis versus common factor analysis: Some issues in selecting an appropriate procedure. Multivariate Behavioral Research, 25, 1–28. Widaman, K. F. (1990). Bias in pattern loadings represented by common factor analysis and component analysis. Multivariate Behavioral Research, 25, 89–96. Widaman, K. F. (1993). Common factor analysis versus principal components analysis: Differential bias in representing model parameters? Multivariate Behavioral Research, 28, 263–311. Worthington, R. L., & Whittaker, T. A. (2006). Scale development research: A content analysis and recommendations for best practice. The Counseling Psychologist, 34, 806–838. Zwick, W. R., & Velicer, W. F. (1982). Factors influencing four rules for determining the number of components to retain. Multivariate Behavioral Research, 17, 253–269. Zwick, W. R., & Velicer, W. F. (1986). Comparison of five rules for determining the number of components to retain. Psychological Bulletin, 99, 432–442.
4 Dr. StrangeLOVE, or How I Learned to Stop Worrying and Love Omitted Variables Adam W. Meade, Tara S. Behrend, and Charles E. Lance
A well-known problem in path analysis and structural equation modeling (SEM) is that even the largest and most comprehensive models cannot contain all of the causes of models’ endogenous variables. ἀ is violation of one of the underlying assumptions of path analysis and SEM gives rise to a commonly held belief that failure to include all relevant causes of endogenous variables may invalidate study results in path analysis and SEM. ἀ is problem has been referred to variously as the unmeasured variables problem (Duncan, 1975; James, 1980), the omitted variables problem (James, 1980; Kenny, 1979; Sackett, Laczo, & Lippe, 2003), left out variables error (LOVE; Mauro, 1990), a lack of perfect isolation (i.e., pseudo-isolation; Bollen, 1989), and lack of self-containment (James, Mulaik, & Brett, 1982). It has also been discussed as a particular type of model specification error (Hanushek & Jackson, 1977; Kenny, 1979). ἀ e omitted variables problem arises when the assumption that all relevant variables that influence the dependent (endogenous) variables are included in the model is violated. However, in the social sciences, this assumption is rarely, if ever, fulfilled. Although there is no shortage of scholarly discussion and writing related to omitted variables, it is less clear how often this issue arises in substantive academic and applied research. ἀ is is because discussion of omitted variables usually takes place “behind the scenes,” for example during the manuscript review process. In response to a post to the RMNET message board on June 11, 2007, several authors 89
90
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
indicated that omitted variable discussions have arisen during the review process. In one example, an anonymous reviewer commented on a paper related to sources of work absenteeism: However, omitted variables that are tied to absenteeism still remain a concern as family size, number of children, and being single head of household are also related to race/ethnicity. ἀ e issue is not that perceived value of diversity and children, etc. are related (as the authors contend), it is that race is correlated with both reports of value of diversity and number of children etc., and then with absenteeism. Hence, absenteeism is potentially being driven by factors other than what the author(s) allege. Simply acknowledging the lack of critical data (pages 26 & 27) does not eliminate the concern that major confounds were not adequately controlled. (S. Tonidandel, personal communication, June 12, 2007).
ἀ is comment is undoubtedly typical of those researchers regularly encounter. In order to provide some index of the extent to which researchers consider omitted variable issues in their work, we conducted a cited reference search. Specifically, we used the Social Science Citation Index to identify works that cited two seminal papers on omitted variables, James (1980) and Mauro (1990), on the assumption that authors dealing with omitted variables issues in their research would be likely to cite these works. A total of 63 sources were found that cited these studies. We then coded each of these sources into one of four categories based on the context in which they discussed omitted variables. Of the 63 sources, 12 actually took steps to assess risk from omitted variables or acted to minimize the impact of omitted variables in some way (e.g., including relevant variables not of central focus to the model [Prussia, Kinicki, & Bracker, 1993], testing alternative models with and without potential additional determinant variables [Colquitt, LePine, & Noe, 2000; Prussia & Kinicki, 1996]). An additional 21 articles cited James (1980) or Mauro (1990) when discussing the potential biasing effect of omitted variables but did not attempt to account for such variables in any way. Twenty-six sources cited these works as part of a methodological review of path analysis or SEM. Finally, four sources mentioned the potential of omitted variables as a limitation of previous research in order to help justify their current study. In sum, it seems that reviewers and others critically evaluating organizational research are aware of the omitted variables issue and voice concerns over LOVE, perhaps even in contexts in which there is minimal risk of omitted variables compromising research
Dr. StrangeLOVE
91
conclusions. On the other hand, authors seem to address omitted variables in a meaningful way less frequently than would be desired. ἀ is is not surprising given that authors may not want to call attention to methodological issues that could question the validity of their study conclusions. However, there are some instances in which omitted variables do pose a considerable threat to the conclusions of path analysis and SEM. In order to provide a better understanding of when omitted variables may or may not jeopardize the validity of path analysis and SEM, this chapter has three goals: (a) review the relevant assumptions in path analysis and SEM and present a mathematical explanation of the omitted variables problem, (b) discuss the conditions under which omitted variables are likely to be problematic and those under which the effects of omitted variables are negligible, and (c) provide recommendations for minimizing the risk of LOVE. Theoretical and Mathematical Definition of the Omitted Variables Problem Conceptually, the problems that may be caused by omitted variables are not difficult to understand. When researchers specify path or structural equation models in order to evaluate a theory, path coefficients are estimated based on the correlations among the measured variables in the model and the pattern of structural relations specified. If an endogenous (dependent) variable is affected by a variable that is unmeasured, and the unmeasured variable correlates to a moderate degree with other causal determinants in the model, the effects of the unmeasured variable can be incorrectly attributed to the measured causal determinants in the model. While the effect of the omitted variable could serve to decrease the magnitude of the path coefficient of the measured variable (i.e., a suppressor effect), it is more often assumed that the effect would cause a positive bias in the path coefficient of the measured variable. ἀ is positive bias could also result in the determination that a determinant has a statistically significant effect on an endogenous variable, when such a finding would not have been the case if the unmeasured variable had been included in the path model. ἀ is error is referred to as LOVE. ἀ e omitted variables problem is perhaps best understood by first looking at the basic mathematics supporting path modeling. In
92
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
order to clearly demonstrate this issue, we outline a series of progressively more complex path models based on standardized variables (i.e., β will be used as the symbol for path coefficients and regression weights). ἀ ese models may then be generalized to the case of latent variables in SEM as the underlying conceptual issues are the same. ἀ e simplest linear causal model includes one exogenous variable (X) and a single endogenous variable (Y). Assuming that both are expressed in standard score form, the relationship between them can be expressed as
Y = βyxX + d
(4.1)
where βyx is the standardized regression coefficient, and d is a disturbance term composed of (a) random shocks, (b) nonsystematic measurement error, (c) unmeasured relevant causes, and (d) unmeasured nonrelevant causes (James et al., 1982). Random shocks can be thought of as unstable causal influences, measurement error refers to nonsystematic error, and unmeasured causes are omitted variables (see James et al., 1982). Whether or not a cause is relevant depends on the nature of its relationship with other variables in the model and is illustrated below. Figure 4.1 illustrates the path model for the case of a single causal exogenous variable and a single endogenous variable. In Figure 4.1a, the disturbance term (d) consists exclusively of random shocks (RS), measurement error (ME), and unmeasured nonrelevant causes (NRC). For this model, the expected relationship between X and Y is given by the equation
E(X*Y) = βyxE(Y*Y) + E(X*d)
(4.2)
For Figure 4.1a, E(X*Y) reduces to βyx as E(Y*Y) = 1.0 for standardized variables and E(X*d) = 0 because the expected relationship between each of the three components of d (random shocks, measurement error, nonrelevant causes) and X equals zero. In this case rxy is an unbiased estimate of the causal parameter βyx. In Figure 4.1b, however, an additional component is present in the disturbance term, an omitted relevant cause (O). As before, the expected relationship between the random shocks, measurement error, and nonrelevant causes and X equals zero. However, the
Dr. StrangeLOVE
93 d (= RS + ME + NRC)
(a) βyx
X
Y
rxo
(b) X
d (= RS + ME + NRC + O)
βyx
Y
Figure 4.1 Path model for one exogenous and one endogenous variable.
expected relationship between X and d = rxobyo as there is an indirect effect of X on d due to the omitted variable that is present in d. An important concept to highlight is that the relevance of an omitted determinant of the endogenous variable is based entirely on the omitted variable’s relationship with other variables in the model. ἀ at is, if the omitted causal variable correlates with other determinants of Y, the omitted variable is by definition a relevant omitted variable. Conversely, if the omitted variable does not correlate with other determinants of Y, it is by definition a nonrelevant cause of Y. Consider now the case of a path model in which one of two exogenous variables is erroneously omitted from the path model (O in Figure 4.2). Assume further that O correlates significantly with both X and Y. In this case, the measured correlation between X and Y reflects not only the direct effect of X on Y, but also the indirect effect of X on Y via the shared correlation both variables have with O. In other words, the observed correlation is determined by the equation rxy = βyx + rxoβyo
X1
d
βyx1
rx1x2
Y X2 (a)
βyx2
(4.3)
X
d
βyx
rxo
Y O
βyo
(b)
Figure 4.2 Path model for two exogenous variables (one omitted).
94
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
However, because O is omitted from the path model, the (naively) estimated path between X and Y (βyx) will be equal to ryx, though ryx is actually determined by the effect of both βyx and rxoβyo. As a result, ryx as an estimate of βyx will be biased by a factor of rxoβyo. ἀ e effect of rxo is obvious. If X were not correlated with O, then ryx is not affected by O and ryx is an unbiased estimate of βyx. In this case, O is a nonrelevant omitted cause of Y. ἀ at is, its omission from the path equation has minimal effect on the estimated path coefficient of the included exogenous variables or on their associated tests of statistical significance. Conversely, if X were nontrivially correlated with O, rxy would differ from βxy by a factor equal to rxoβyo so that rxy would be a biased estimate of βxy. ἀ is bias can affect tests of statistical significance and lead to erroneous conclusions regarding the model. In this case, O is a relevant omitted cause of Y. Although the potential biasing effect of rxo on βyx is obvious, the effect of βyo is less transparent. ἀ e equation for the path coefficient βyo is
β yo =
ryo − ryx rxo 1 − rxo2
(4.4)
so that in order for βyo to have a biasing effect on rxy, which could be taken as the estimate of βyx, the correlation between X and O must be nonzero. If the correlation between X and O is nontrivially positive, bias in βyx will be greater when the correlation between Y and O is large and the correlation between Y and X is small. In order to provide some context for illustration, Table 4.1 includes several hypothetical values for rxy, rxo, and ryo. Note that no values of rxo = 0 are presented because there is no bias in rxy as an estimate of βyx when there is no correlation between the exogenous variable and the omitted variable (i.e., O is a nonrelevant cause of Y). As can be seen in Table 4.1, bias is greatest when the correlation between X and Y is somewhat low (.20) yet the omitted variable correlates highly with both X and Y. ἀ is is the classic third variable problem (e.g., the spurious correlation between ice cream sales and drowning deaths) and a primary reason that correlation cannot be interpreted as causation. In this case, much of the effect attributed to the relationship between X and Y is actually due to their mutual correlation with and/or dependence on O.
Dr. StrangeLOVE
95
Table 4.1 Biasing Effects of an Omitted Variable in a TwoDeterminant Model
βˆ yx = rxy
rxo
ryo
βyo
βxy
Bias
0.00
0.2
0.00
0.00
0.00
0.00
0.00
0.2
0.20
0.21
–0.04
0.04
0.00
0.2
0.60
0.63
–0.13
0.13
0.2
0.2
0.00
–0.04
0.21
–0.01
0.2
0.2
0.20
0.17
0.17
0.03
0.2
0.2
0.60
0.58
0.08
0.12
0.6
0.2
0.00
–0.13
0.63
–0.03
0.6
0.2
0.20
0.08
0.58
0.02
0.6
0.2
0.60
0.50
0.50
0.10
0.00
0.6
0.00
0.00
0.00
0.00
0.00
0.6
0.20
0.31
–0.19
0.19
0.00
0.6
0.60
0.94
–0.56
0.56
0.2
0.6
0.00
–0.19
0.31
–0.11
0.2
0.6
0.20
0.13
0.13
0.08
0.2
0.6
0.60
0.75
–0.25
0.45
0.6
0.6
0.00
–0.56
0.94
–0.34
0.6
0.6
0.20
–0.25
0.75
–0.15
0.6
0.6
0.60
0.38
0.38
0.23
Note. Bias is the estimated path coefficient ( βˆ yx = rxy ) minus the true path coefficient βyx. ἀi s value is equal to rxoβyo. Conditions in which rxo= 0 are not displayed, as there is no bias under these conditions.
Note that when the correlation between the endogenous variable (Y) and the omitted variable (O) is close to zero, byo can take on negative values. When byo is negative, rxy (which is used to estimate byx but is mathematically equal to byx + rxobyo) will actually be greater than βyx. In this case, the omission of O causes an underestimate of the path coefficient between X and Y and variable O is said to have a suppressor effect such that its inclusion in the model serves to increase the estimated path coefficient between X and Y. Examples of such negative bias are present in Table 4.1. Suppressor effects are most readily manifested when the omitted variable has a very low correlation with
96
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
the endogenous variable but a moderate or large correlation with the exogenous variable in question. In such cases, the true path coefficient for the observed exogenous variable is considerably larger than the zero-order correlation between the exogenous variable and endogenous variable that is used as an estimate of the path coefficient. In sum, several important points result from the discussion of a model with one observed determinant (X) and one omitted determinant (O) of a single endogenous variable:
1. rxy will be a biased estimate of byx to the extent that there exist omitted relevant causes of Y. 2. ἀ is bias will be upward (i.e., rxy > byx) to the extent that rxobyo > 0. 3. By extension, both rxo and byo must be nonzero for bias to occur. If either rxo ≈ 0 (O is unrelated to X and thus is a nonrelevant cause) or byo ≈ 0 (there is not unique effect of O on Y; it is not a determinant of Y), no bias occurs. 4. If one of the terms, rxo or byo, is negative and the other is positive, a suppression situation occurs (i.e., rxy < byo). 5. If rxo and byo are both negative, there will be upward bias in the estimation of byx from rxy.
Violated Assumptions Omitted relevant variable represents a violation of the assumption of self-containment in causal modeling (James et al., 1982; Simon, 1977) and is but one type of model misspecification. We cannot isolate an endogenous variable from all potential causal explanatory variables in the social sciences. Instead, we replace the assumption of isolation with one of pseudo-isolation by assuming that the disturbance term, variance in the endogenous variable not accounted for by its modeled causes, is uncorrelated with exogenous variables (Bollen, 1989), or with endogenous variables that precede the variable in question in the causal path (Duncan, 1975; James, 1980). ἀ is can be seen by again examining Figure 4.2b. In Figure 4.2b, the disturbance term, d, would now include the effect of the standardized omitted variable (βyo). Clearly, the self-containment assumption is violated, as X will correlate with the disturbance term by a magnitude of rxoβoy.
Dr. StrangeLOVE
97
X βyx
βmx rxo M βmo
βym
d Y
βyo
O
Figure 4.3 Partially mediated path model with omitted variable.
More Complex Models Although the effects of the omitted variable are clearly visible in a model with two exogenous variables, things rapidly become more complex when more variables are added to the model. Figure 4.3 depicts a path model illustrating the partially mediating effect of a mediator (M) on the relationship between an exogenous variable, X, and an omitted relevant causal variable, O, with the endogenous variable (Y). ἀ e path model for M is identical to that of a two exogenous variable model. As in the previous example, if O is omitted, then the expected path coefficients and potential for bias are identical to those of a path coefficient with two determinants. ἀ ere are three causes of Y, yet one of these is omitted. ἀ e true population path equation for this model is Y = βyxX + βymM + βyoO + d
(4.5)
and the path coefficient βyx in the true model is given as
β yx =
(
)
2 ryx 1 − rmo + rym (rxormo − rxm ) + ryo (rxm rmo − rxo ) 2 2 1 + 2rxm rmo rxo − rxo2 − rmo − rxm
(4.6)
More complicated models are obviously possible as well, though algebraic expressions for the path coefficients rapidly become unwieldy. In the current example, if variable O were omitted, the estimated path coefficient for the direct effect of X on Y would be
98
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
that of a two-determinant model, in which the effect of the omitted variable is ignored:
ryx − rym rxm βˆ yx = 2 1 − rxm
(4.7)
In order to further illustrate the effects of an omitted variable in this model, data were simulated for several levels of correlation between variables O and Y. Table 4.2 contains the level of bias observed in the path coefficient of X for different levels of correlation between the omitted variable and the other causal variables in the model. Readily apparent from Table 4.2 is that the magnitude of bias is not large in any of the conditions when the correlation between O and Y is .20. Results are more mixed for those conditions in which the correlation between the omitted variable and Y is .60. In these conditions, the magnitude of the bias of path coefficient of X can be large, but only when the correlation between the X and O is also quite large. Also, the magnitude of the bias is mitigated somewhat by the correlation between the omitted variable and M, though the bias is still sizable. Note the values presented in Table 4.2 that represent the case in which there is a relatively small correlation between X and Y, and large correlations between O and both Y and X. Under these circumstances, bias can be sizable. We set the correlations in Tables 4.1 and 4.2 to arbitrary values in order to demonstrate their effects, but in practice correlation coefficients may not plausibly vary independently of one another (Mauro, 1990). In other words, a situation in which two variables correlate very highly, and one of those two correlates highly with a third variable while the other correlates negatively with the third variable, is mathematically improbable. ἀ e patterns of correlations that result in the most bias are those in which there is a very low correlation between the measured determinants and the endogenous variable, and high correlations between both the measured determinants and omitted variables and the omitted and endogenous variables (refer to Tables 4.1 and 4.2). While such patterns of correlations are mathematically possible, they may be unlikely in some domains of study given what is known from previous research. To summarize, omitted variables can introduce bias in estimated path coefficients and this bias may be positive or negative in
Dr. StrangeLOVE
99
Table 4.2 Biasing Effects of an Omitted Variable in a ThreeDeterminant Model ryx
rym
ryo
rxm
rxo
rmo
βyx
βˆ yx
Bias
0.30
0.20
0.20
0.30
0.00
0.00
0.26
0.26
0.00
0.30
0.20
0.20
0.30
0.00
0.20
0.28
0.26
–0.02
0.30
0.20
0.20
0.30
0.00
0.60
0.31
0.26
–0.05
0.30
0.20
0.20
0.30
0.20
0.00
0.23
0.26
0.03
0.30
0.20
0.20
0.30
0.20
0.20
0.24
0.26
0.02
0.30
0.20
0.20
0.30
0.20
0.60
0.26
0.26
0.00
0.30
0.20
0.20
0.30
0.60
0.00
0.22
0.26
0.04
0.30
0.20
0.20
0.30
0.60
0.20
0.25
0.26
0.02
0.30
0.20
0.20
0.30
0.60
0.60
0.30
0.26
–0.04
0.30
0.20
0.60
0.30
0.00
0.00
0.26
0.26
0.00
0.30
0.20
0.60
0.30
0.00
0.20
0.30
0.26
–0.04
0.30
0.20
0.60
0.30
0.00
0.60
0.44
0.26
–0.18
0.30
0.20
0.60
0.30
0.20
0.00
0.14
0.26
0.12
0.30
0.20
0.60
0.30
0.20
0.20
0.18
0.26
0.08
0.30
0.20
0.60
0.30
0.20
0.60
0.25
0.26
0.01
0.30
0.20
0.60
0.30
0.60
0.00
–0.22
0.26
0.48
0.30
0.20
0.60
0.30
0.60
0.20
–0.12
0.26
0.38
0.30
0.20
0.60
0.30
0.60
0.60
–0.12
0.26
0.38
Note. βyx represents the true path coefficient of the exogenous variable X in the completely specified model. βˆ yx represents the estimated path coefficient of X in the omitted variable model. Bias is the difference between these two.
direction. ἀ e issue is then, under what conditions is it possible for an omitted variable to bias path coefficients? Below is a summary for a model with one observed exogenous variable and one relevant omitted variable: • If O is uncorrelated with the exogenous variable, rxy is an unbiased estimator of byx and the omitted variable has no effect. • If the variance in Y accounted for by O is completely redundant with the variables in the model, its unique effect (βyo) will be near zero and it will have little biasing effect.
100
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
• If O is uncorrelated with the endogenous variable but strongly correlated with the exogenous variable, rxy may underestimate byx (i.e., a suppressor effect).
ἀ us, there are three conditions which must be present in order for an omitted variable to cause positive bias in estimated path coefficients; that variable must (a) correlate at a nonzero level with other determinants of Y, (b) not be completely redundant with other variables included in the path model, and (c) correlate with the endogenous variable. If (a) and (b) are true, but (c) is not, the omitted variable may serve to artificially deflate the estimate of the path coefficient of the variables included in the model. In sum, the potential for LOVE is greatest when the omitted variable correlates highly with the outcome variable and moderately with other determinants in the model. Path Coefficient Bias Versus Significance Testing It is important to make a distinction between the biasing effect of omitted variables on the magnitude of path coefficients and the effect of omitted variables on the significance tests of those path coefficients. Generally speaking, in theory building via path analysis and SEM, there are two important outcomes of interest to the researcher: the magnitudes of the estimates of the path coefficients themselves and associated significance tests. Often in early stages of research, the primary outcome of interest in path analyses is the significance test associated with the path coefficient. In other words, the answer to the question “does the variable have a unique effect on the outcome?” would seem more important than the question “what is the precise magnitude of the unique effect of the variable on the outcome?” If early forays into model testing with a given set of variables indicate that the effect of a determinant on an endogenous variable is nonsignificant, it is less likely that future researchers would include this variable as a measured cause as frequently as if the variable did have a significant effect on the outcome. In this context, the magnitude of the path coefficient per se is less important than the decision as to the presence or absence of an effect of X on Y. If there does appear to be an effect (i.e., the test is significant), then future use and, importantly, replication of this effect is much more likely. While the rough magnitude of the effect
Dr. StrangeLOVE
101
is undoubtedly important, small bias in the path coefficients would likely be of little concern so long as the conclusion of the significance test is not affected at this stage of investigation. ἀ e second outcome of path analysis is the magnitude of the path coefficients themselves. Estimates of path coefficients are important in that standardized coefficients are one index of the unique variance in the endogenous variable accounted for by the determinant. Additionally, unstandardized coefficients can be compared over time, and cumulative evidence can be collected such that the relative effect of a determinant on an outcome can be estimated. As research cumulates over time, the precision of estimated paths becomes important to future meta-analysts such that an accurate estimate of the effect of a determinant on an endogenous variable can be calculated. ἀ us, even though precise estimates of effects may not be of primary interest to a researcher in early stages of research on a topic, these estimates take on additional importance over time as research accumulates and meta-analyses are conducted. Recall that if the omitted variable does not correlate with the endogenous variable but correlates with other variables in the model, it may act as a suppressor variable. ἀ is was shown in Tables 4.1 and 4.2 where the exclusion of an omitted variable resulted in negative bias of the estimated path coefficient. ἀ at is, its inclusion in the model could serve to increase the estimated path coefficients of the observed variables. In regard to significance testing, omitted variables that do not correlate with the endogenous are potentially problematic in that they may result in Type II errors (i.e., failure to detect an effect that truly exists). However, reviewer criticisms of a lack of comprehensive path models typically center more on the potential upward biasing effects of omitted variables and associated Type I error (i.e., wrongly identifying an effect that does not exist). ἀ e focus on Type I errors is understandable as such errors may translate to immediate implications for practice and use of an determinant variable, whereas Type II errors are less likely to be published and likely will be rectified in future studies. If Type II error is seen as less problematic as Type I error, the requirement of a significant correlation between the omitted variable and the outcome may be added to the list of conditions that must be met before the possibility of an omitted variable becomes a concern in path models. Omitted variables that do not correlate with the outcome cannot cause
102
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
upward bias in path coefficient estimates, which is typically the focus of LOVE concerns. Minimizing the Risk of LOVE ἀ ere are specific conditions under which omitted variables can be problematic, and it is true that no matter how comprehensive a path model, there are always omitted relevant variables in organizational research. We have also illustrated that there can be substantial bias under some conditions; thus, there is a kernel of truth relating to LOVE in organizational research. To this extent, educating researchers on the ways in which to minimize the risk of omitted variable problems is of paramount importance. ἀ ere are several ways in which organizational researchers can minimize the risk of omitted variables biasing path coefficients, discussed below. Experimental Control First, one could incorporate design characteristics that minimize the correlation between measured exogenous variables and omitted variables. Random assignment of participants is extremely successful in controlling for a wide range of known or unknown omitted individual difference variables. As we have emphasized, there can be no possible biasing effect of an omitted variable if that variable does not correlate with the observed variables in the path model (given sufficient sample size). As such, random assignment is highly effective for controlling for almost any individual difference variable in a path model. Although random assignment may not be possible in many instances of organizational research, there are some cases in which it may be employed. For example, participants may be randomly assigned to different types of training courses, reward systems, equipment and other environmental factors, or organizational interventions for which the effectiveness may be evaluated. In more mathematical terms, recall that in the case of one exogenous variable (X) and one omitted variable (O), the estimated effect of X on the endogenous variable (Y) is the zero-order correlation between X and Y. However, the true effect of X on Y should be given as Equation 4.8:
Dr. StrangeLOVE
β yx =
rxy − ryorxo 1 − rxo2
103
(4.8)
When random assignment is used, the correlation between X and O will be near zero (with sufficient sample size). ἀ us, Equation 4.8 reduces to rxy and there is no bias. More Inclusive Models Second, researchers should include as many known causes of the endogenous variable as is practically possible in the path model. ἀ e potential for bias in path coefficient estimates caused by omitted variables is much greater when they serve as unique causal agents of the endogenous variable. Recall that for a two determinant model with one determinant omitted, the bias present is equal to rxoβyo. By incorporating more determinants of the outcome, the unique effects of omitted variables may be reduced as βyo approaches zero. Note however, that there is a paradoxical side effect of including more variables. ἀ at is, each additional determinant that is included in the model is also prone to LOVE and is subject to the assumption of model self-containment. Use Previous Research to Justify Assumptions Researchers may also use what is already known from past research to demonstrate that omitted variables are not likely to be problematic. For example, when estimating the effects of ability determinants of job performance, one could legitimately leave out entire classes of other performance determinants such as personality and motivation, because these are likely to be uncorrelated with ability determinants and therefore are nonrelevant causes (Ackerman & Heggestad, 1997; Sackett, Gruys, & Ellingson, 1998; Salgado, Viswesvaran, & Ones, 2001; Schmidt & Hunter, 1998; see also Lance & James, 1999). On the other hand, if both verbal and quantitative aptitude were thought to be causes of employee job performance, it is unlikely that the omission of similar types of tests (e.g., mechanical ability) would
104
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
produce a strong biasing effect on path coefficients of those tests in the model, as mechanical ability is exceedingly likely to have a large correlation (i.e., be redundant with) with the measured ability test variables. As such, the plausibility of bias due to omitting mechanical ability tests is very low as again βyo will be closer to zero. Put differently, in many instances nonrelevant causes can largely be ignored because they are either (a) not related to measured causes or (b) largely redundant with relevant causes that are already measured. To this extent, prior research on correlates of both the outcome and other determinants can provide guidance on what variables are essential to include in the model and which may be safely omitted. Consideration of Research Purpose If the goal is to provide a precise estimate of path coefficients, or to compare the relative variance accounted for by different determinants, omitted variables are considerably more problematic than if the goal is to test the statistical significance of the effect of a determinant on an outcome. Examining again the simple two determinant case, influence due to omitted variables can result in bias in the estimated path coefficient (rxy) with respect to its true value (Equation 4.8). However, with large sample sizes, even sizable bias in estimated path coefficients are less likely to change decisions drawn from the statistical significance test associated with those coefficients. With large sample sizes, power is such that even small estimated effects tend to be statistically significant. In sum, omitted variables are a fact of life in organizational research and they can be problematic. Researchers should be particularly vigilant in cases in which (a) there are a large number of determinants of the outcome variable, (b) the study in question includes only a small subset of those determinants, (c) it is likely that the omitted variables have moderate or large correlations with the measured determinants, and (d) it is likely that the omitted variables would account for unique variance in the outcome variables. However, the notion that omitted variables are always problematic is a myth as the threat to the inferences that we tend to draw may not be as serious as some have believed.
Dr. StrangeLOVE
105
References Ackerman, P. L., & Heggestad, E. D. (1997). Intelligence, personality, and interests: Evidence for overlapping traits. Psychological Bulletin, 121, 219–245. Bollen, K. A. (1989). Structural equations with latent variables. Oxford, England: John Wiley and Sons. Colquitt, J. A., LePine, J. A., & Noe, R. A. (2000). Toward an integrative theory of training motivation: A meta-analytic path analysis of 20 years of research. Journal of Applied Psychology, 85, 678–707. Duncan, O. D. (1975). Introduction to structural equation models. New York: Academic Press. Hanushek, E. A., & Jackson, J. E. (1977). Statistical methods for social scientists. San Diego, CA: Academic Press. James, L. R. (1980). ἀe unmeasured variables problem in path analysis. Journal of Applied Psychology, 65, 415–421. James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models and data. Beverly Hills, CA: Sage. Kenny, D. A. (1979). Correlation and causality. New York: Wiley-Interscience. Lance, C. E., & James, L. R. (1999). ν2: A proportional variance-accountedfor index for some cross-level and person-situation research designs. Organizational Research Methods, 2, 395–418. Mauro, R. (1990). Understanding L.O.V.E. (left out variables error): A method for estimating the effects of omitted variables. Psychological Bulletin, 108, 314–332. Prussia, G. E., & Kinicki, A. J. (1996). A motivational investigation of group effectiveness using social-cognitive theory. Journal of Applied Psychology, 81, 187–198. Prussia, G. E., Kinicki, A. J., & Bracker, J. S. (1993). Psychological and behavioral consequences of job loss: A covariance structure analysis using Weiner’s (1985) attribution model. Journal of Applied Psychology, 78, 382–394. Sackett, P. R., Gruys, M. L., & Ellingson, J. E. (1998). Ability-personality interactions when predicting job performance. Journal of Applied Psychology, 83, 545–556. Sackett, P. R., Laczo, R. M., & Lippe, Z. P. (2003). Differential prediction and the use of multiple predictors: ἀe omitted variables problem. Journal of Applied Psychology, 88, 1046–1056. Salgado, J. F., Viswesvaran, C., & Ones, D. S. (2001). Predictors used for personnel selection: An overview of constructs, methods and techniques. In D. S. Ones et al. (Eds.), Handbook of industrial, work and organizational psychology, Vol. 1: Personnel psychology (pp. 165–199). London: Sage Publications.
106
Adam W. Meade, Tara S. Behrend, and Charles E. Lance
Schmidt, F. L., & Hunter, J. E. (1998). ἀe validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124, 262–274. Simon, H. A. (1977). Models of discovery: And other topics in the methods of science. Dordrecht, Holland: D. Reidel.
5 The Truth(s) on Testing for Mediation in the Social and Organizational Sciences James M. LeBreton, Jane Wu, and Mark N. Bing
One of the principal goals of scientific inquiry is the elucidation of relationships among constructs, such that strong causal inferences may be drawn (Platt, 1964). ἀ e realization of this goal involves the use of scientific concepts and methods for the construction and testing of causal systems. In the social sciences, the most basic causal system consists of two unobserved or latent psychological constructs, two observed variables measuring and thus linked to those constructs, a proposition defining the construct-construct linkage, a hypothesis defining the variable-variable linkage, and a statement of the boundary conditions delimiting the circumstances under which our causal system is expected to hold (Bacharach, 1989). Figure 5.1 presents this system, in which the unidirectional arrows linking constructs, measures, and measures to constructs are assumed to be lawful causal relationships. Specifically, changes in Constructs X and Y are assumed to cause changes in their respective measures, changes in Construct X are proposed to cause changes in Construct Y, and confirmation of this proposition is hypothesized to result in a verifiable statistical relationship between the measures of X and Y. ἀ is basic causal system may be thought of as a primary theoretical system (Bacharach, 1989), a principal nomological network To be consistent with the extant literature, we use circles to denote latent psychological constructs, squares to denote the manifest variables measuring these constructs, double-headed arrows to denote correlational relationships, and single-headed arrows to denote causal or directional relationships. 107
108
James M. LeBreton, Jane Wu, and Mark N. Bing
Construct X
Measure of X
Causal Proposition
Causal Hypothesis
Construct Y
Measure of Y
Boundary Conditions
Figure 5.1 Basic causal system.
(Cronbach & Meehl, 1955), or a basic construct validation framework (Binning & Barrett, 1989). ἀ is basic causal structure is often extended to include multiple constructs and multiple manifest indicators (i.e., measures) of each construct. Furthermore, in the social sciences, it is acknowledged that manifest indicators are imprecise representations of the latent constructs, and thus contain some degree of measurement error (Lord & Novick, 1968). Instrumental to the accumulation of scientific knowledge is the process of articulating causal propositions that link constructs, and more specifically, causal hypotheses that link manifest variables that measure those constructs. One particularly popular and useful causal hypothesis is the mediation hypothesis. Complete, perfect, or full mediation occurs when the effect of an antecedent variable (X) on a consequent variable (Y) is transmitted via an intermediate mediator variable (M). Figure 5.2D portrays Exogenous variables have no specified causal antecedents, whereas endogenous variables are specified as being caused by other variables in the causal model.
The Truth(s) on Testing for Mediation
Measure X
a
109
Measure M
(A)
Measure M
b
Measure Y
(B)
Measure X
c
Measure Y
(C)
Measure Y
Measure X a Measure M
b
(D) Measure X
c’
a Measure M
Measure Y
b’
(E) Figure 5.2 Inferences involved in tests of mediation.
these structural relationships at the level of observed variables (i.e., measured constructs). Here, X has a direct effect on M (i.e., X→M), M has a direct effect on Y (i.e., M→Y), but X only exerts an indirect effect on Y via its influence on M (i.e., X→M→Y). In contrast, partial
110
James M. LeBreton, Jane Wu, and Mark N. Bing
mediation occurs when the effect of an antecedent exogenous variable (X) on an endogenous consequent (Y) is transmitted both directly and via an intermediate mediator variable (M; see Figure 5.2E). With partial mediation X simultaneously influences Y directly (i.e., X→Y) and indirectly (i.e., X→M→Y). Although a variety of different statistical procedures exist for testing mediation hypotheses, the most popular technique is the four-step procedure described by Baron and Kenny (1986). ἀ is is arguably one of the most influential and important articles ever published in the social sciences. It has had tremendous substantive impact in a number of disciplines by virtue of how it has been used to draw inferences concerning the tenability of mediation hypotheses. It has had tremendous methodological impact by virtue of the dozens of subsequent papers seeking to understand the optimal techniques for testing mediation hypotheses. It is hard to envision what tests of mediation would look like had the Baron and Kenny test not been introduced. ἀ e thesis of this chapter is that while their article was a catalyst for progress in the social and organizational sciences, the four-step test (like any statistical procedure) is not without its limitations. Our concern and criticism is not with the four-step test introduced in 1986. Rather, our concern is with the unquestioned faith in this test held by so many researchers in the social and organizational sciences. Below we (a) review the four-step test recommended by Baron and Kenny, (b) describe three statistical urban legends involving this technique and the evidence that has given rise to these legends over the last 20 years, (c) analyze each of the legends and provide evidence documenting our concerns with the use of the four-step test as the primary mechanism for drawing inferences of mediation, and (d) offer recommendations for researchers interested in testing mediation hypotheses in the future. Baron and Kenny’s (1986) Four-Step Test of Mediation According to Baron and Kenny (1986), a variable acts as a mediator when four conditions have been met using a four-step procedure involving a series of regression analyses.
The Truth(s) on Testing for Mediation
111
Condition/Step 1 Variation in the antecedent variable (X) must be significantly related to variation in the consequent variable (Y). ἀ is condition is typically tested by regressing Y onto X. From this point forward, we will assume the variables are expressed in deviation score form (y and x, respectively). ἀ e equation corresponding to Condition 1 is given as
y = byxx + e1
(5.1)
where byx = c in Figure 5.2C, and e1 corresponds to a disturbance term (which is typically assumed to be independently and identically distributed with a mean of zero and a constant variance). Condition 1 is confirmed if the unstandardized regression coefficient, byx, is statistically significant (or equivalently, if the correlation between Y and X, ryx, is statistically significant). Condition/Step 2 Variation in the antecedent variable (X) must be significantly related to variation in the hypothesized mediator variable (M). ἀ is condition is typically tested by regressing M onto X:
m = bmxx + e2
(5.2)
where bmx = a in Figure 5.2A, and e2 corresponds to the disturbance term. Condition 2 is confirmed if bmx is statistically significant (or equivalently, if rmx is statistically significant). Condition/Step 3 Variation in the hypothesized mediator variable (M) must be significantly related to variation in the consequent variable (Y) after controlling for the effects of the antecedent variable (X). ἀ is condition is typically tested by regressing Y onto X and M simultaneously:
y = byx.mx + bym.xm + e3
(5.3)
112
James M. LeBreton, Jane Wu, and Mark N. Bing
where byx.m = c’ and bym.x = b’ in Figure 5.2E, and e3 is the disturbance term. Of critical import, bym.x corresponds to the effect of the mediator on the consequent after controlling for the eἀects of the antecedent. Condition 3 is confirmed if bym.x is statistically significant. Condition/Step 4 ἀ e previously significant relationship between the antecedent variable (X) and the consequent variable (Y) is no longer significant after controlling for the effects of the hypothesized mediator variable (M). ἀ is condition is typically tested using the same regression procedure described in Step 3, but now the focus is on byx.m. Note that byx.m corresponds to the effect of the antecedent on the consequent after controlling for the eἀects of the mediator. Condition 4 is confirmed if byx.m is statistically nonsignificant, with “the strongest demonstration of mediation” occurring when this coefficient is zero (Baron & Kenny, 1986, p. 1176). Evidence consistent with full or perfect mediation is established when all four conditions are satisfied and byx.m is zero. In contrast, evidence consistent with partial mediation is established when the first three conditions are satisfied, but the fourth condition is not satisfied. In essence, the Baron and Kenny (1986) test for mediation hinges on establishing a reduction in the magnitude of the effect of the antecedent variable (X) on the consequent variable (Y) by comparing byx to byx.m. If the effect of X on Y reduces to zero in Step 4, “strong evidence for a single, dominant mediator” (p. 1176) is claimed. However, when the effect in Step 4 is reduced such that byx.m is less than byx but byx.m is still greater than zero, the claim is for prima facie evidence for “the operation of multiple mediating factors” (p. 1176). Given the complexity of most causal systems, Baron and Kenny suggested that “a more realistic goal may be to seek mediators that significantly decrease [byx.m in comparison to byx] rather than eliminating the relationship between the independent and dependent variables altogether” (p. 1176). In addition to testing for full and partial mediation, they suggested that the indirect effect of X on Y could be estimated as
Indirect Effect = bmx*bym.x
(5.4)
The Truth(s) on Testing for Mediation
113
where bmx = a in Figure 5.2A and bym.x = b’ in Figure 5.2E; this indirect effect may also be tested for statistical significance (MacKinnon, Fairchild, & Fritz, 2007; MacKinnon, Lockwood, Hoffman, West, & Sheets, 2002; Shrout & Bolger, 2002; Sobel, 1982). Collectively, Equations 5.1–5.4 comprise a set of equations that have served as the basis for most tests of mediation in the social and organizational sciences. We will refer to this set of equations as the “Set 1” equations to distinguish them from a second set of equations introduced later in this chapter. Although Baron and Kenny (1986) noted that the Set 1 equations may be tested using sophisticated structural equation modeling (SEM) analyses, these equations are most typically tested using ordinary least-squares (OLS) regression analyses. We argue here that (a) several urban legends have formed around the use of the Baron and Kenny four-step approach to testing mediation, (b) this four-step approach has become sacrosanct statistical doctrine in the social and organizational sciences, (c) these urban legends are being perpetuated at a pandemic rate, while (d) the limitations associated with the four-step approach are largely ignored or discounted. The Urban Legend: Baron and Kenny’s Four-Step Test Is an Optimal and Sufficient Test for Mediation Hypotheses ἀ is statistical urban legend may be decomposed into three component statements: Legend 1: A test of a mediation hypothesis should consist of the four steps articulated by Baron and Kenny (1986). Legend 2: ἀe ir four-step procedure is the optimal test of mediation hypotheses. Legend 3: Fulfilling the conditions articulated in their fourstep test is sufficient for drawing conclusions about mediated relationships.
The Kernel of Truth About the Urban Legends Like many urban legends, there is undoubtedly some kernel of truth to each of these statements. An analysis of the epistemological evidence for these statements yields three forms of evidential support: technical evidence, evidence of orthodoxy in quantitative training,
114
James M. LeBreton, Jane Wu, and Mark N. Bing
and evidence for the ubiquity of application. ἀ e technical evidence giving rise to Legend 1 is fairly straightforward: Baron and Kenny (1986) did, as a point of fact, articulate a four-step test of mediation. ἀ us, one could argue that there is nothing untrue about the first component statement. Furthermore, given the relative simplicity of the four-step test, it is typically implemented without error. ἀ e indirect evidence giving rise to Legends 2 and 3 is furnished by an examination of the orthodoxy in quantitative training and the ubiquity of application of the four-step test. We argue that the omnipresence of the four-step test of mediation in the social and organizational sciences furnishes indirect evidence for researchers’ beliefs in the superiority of this approach in testing for mediated relationships vis-à-vis other statistical approaches. ἀ at is, why would so many researchers apply the four-step test (and train their students to apply the four-step test) if they did not believe it was an optimal and sufficient test of mediation? ἀ e Baron and Kenny four-step test has been applied in virtually every branch of social science including: social psychology (e.g., Brown & Smart, 1991), industrial and organizational psychology/ organizational behavior (e.g., Skarlicki & Latham, 1996), industrial relations (e.g., Kim, 1999), marketing (e.g., Gurhan-Canli & Maheswaran, 2000), strategic management (e.g., Gong, Shenkar, Luo, & Nyaw, 2007), accounting (e.g., Nelson & Tayler, 2007), clinical psychology (e.g., Kerig, 1989), personality psychology (e.g., Conrad, 2006), developmental psychology (e.g., Eaton & Yu, 1989), cognitive psychology (e.g., Gilstrap & Papierno, 2004), nursing (e.g., Welch & Austin, 2001), education (e.g., Osborne, 2001), and communication (e.g., Reinhart, Marshall, Feeley, & Tutzauer, 2007). As of December 1, 2007, Baron and Kenny’s (1986) article had been cited roughly 9,000 times according to the Web of Science database! In 2006 alone this article was frequently cited in many of the leading social and organizational science journals including Journal of Applied Psychology (22 times), Academy of Management Journal (8 times), Organizational Behavior and Human Decision Processes (8 times), Journal of Personality and Social Psychology (29 times), and Personality and Social Psychology Bulletin (27 times). How did the four-step test become so popular? Several explanations are possible:
The Truth(s) on Testing for Mediation
1. Baron and Kenny’s (1986) paper was one of the earliest to formally address issues of mediation and moderation—a primacy effect. 2. ἀe article was published in one of the most prestigious and highly cited journals in psychology—a source credibility effect. 3. ἀe straightforward, “cookbook” nature of the four-step test makes it very easy to understand and implement. Indeed, most upper-level undergraduate students likely have the requisite quantitative sophistication to conduct a basic mediation test. Research suggests that individuals almost automatically accept and believe things that they also understand (Gilbert, 1991)—a comprehension = acceptance effect. 4. Research also has shown that statements which have been repeated receive higher ratings of truth or are judged true with a higher degree of probability compared to statements which are novel (Hasher, Goldstein, & Toppino, 1977; Schwartz, 1982)—a “truth” effect. 5. Finally, research also has documented that greater exposure to a stimulus increases our liking of a stimulus (Zajonc, 1968)—a mere exposure effect.
115
Collectively these points proffer a reasoned explanation for social and organizational researchers’ overwhelming preference for, and unconditional embracement of, the four-step test. It was one of the first formal treatments of mediation published in a very prestigious journal. It is easy to understand and implement. Increased exposure leads to increased belief in and increased liking for the test, all of which has led to increased use of the test. In summary, the urban legend states that the four-step test as conceived by Baron and Kenny (1986) is the optimal and sufficient technique for establishing mediation. Evidence to support this legend comes directly from the original article which articulated four conditions or steps that should be satisfied prior to concluding that one had found a mediated effect and indirectly from the popularity of the technique. Below, we attempt to debunk the urban legend surrounding the four-step test of mediation by demonstrating the specious nature of each statement.
116
James M. LeBreton, Jane Wu, and Mark N. Bing
Debunking the Legends Legend 1: A Test of a Mediation Hypothesis Should Consist of the Four Steps Articulated by Baron and Kenny (1986) Several limitations associated with the four-step test have been identified. We describe five of these here. First, recall that Condition 1 requires a significant bivariate relationship between the exogenous antecedent (X) and the endogenous consequent (Y). ἀ is is problematic because it blurs the distinction between population parameters with sample statistics. Specifically, the bivariate relationship between X and Y (assessed via the correlation rYX or the regression coefficient byx) must be nonzero in the population if the effects of X on Y are completely mediated by M (Mathieu & Taylor, 2006). Consequently, establishing a significant bivariate relationship in one’s sample is conditional on sample size (assuming that the full mediation model is correct). For example, assume that in the population ρ XM = .30 and ρ MY = .30 ; thus, in the case of a full mediation model, ρ XY = .09 in the population (i.e., the relationship between X and Y is simply the product of the paths linking X → M and M → Y). Assuming that N = 100 and that sample correlations are rXM = .30 and rmy = .30, then both correlations would be significant at p < .05. However, the sample correlation rXY = .09 would not be significant. In fact, it would take a sample of N = 475 for this correlation to be significant at p < .05. Consequently, strict adherence to this rule may preclude tests for mediation when a full mediation model is the true model in the population. Adherence to this rule may be especially problematic for group and organizational researchers who often deal with relatively small sample sizes. For example, Barrick, Stewart, Neubert, and Mount (1998) were not able to proceed with two separate tests of the whether social cohesion (M) mediated the relationship between team agreeableness (X; measured as either the team’s mean level or team’s minimum level of agreeableness) and team viability (Y) because the bivariate correlations between agreeableness and viability were not statistically significant (rYX = .16 and .20, for mean level and minimum level, respectively). Although correlations of this magnitude are often significant in the organizational literature, Barrick et al.’s data had been aggregated to the team-level, leaving an N of 51. However, when we examine the pattern of relationships among agreeableness, cohe-
The Truth(s) on Testing for Mediation
117
sion, and viability we see that they are completely consistent with the hypothesized mediation model. Specifically, rMX = .32 and .38 for the mean and minimum levels of group agreeableness respectively, and rYM = .40. Using these values it is possible to calculate the indirect effect (i.e., reproduced correlation) of Y on X as .13 for mean level of agreeableness (i.e., .32*.40) and .15 for the minimum level of agreeableness (i.e., .38*.40), respectively. ἀ ese values are very close to the original bivariate values of .16 and .20, respectively. In sum, these researchers (a) had sound theoretical reasons for expecting mediation, (b) collected a rich and difficult to obtain field data set, (c) found evidence supporting mediation for other personality traits (i.e., extraversion and emotional stability), and (d) were not able to proceed with a test of mediation because of strict adherence to the Baron and Kenny technique; however, (e) the patterns of relationships among their variables were consistent with mediation. Other researchers examining team processes have obtained similar patterns of results. For example, Mathieu, Heffner, Goodwin, Salas, and Cannon-Bowers (2000) concluded that a mediation effect was not present because rYX was nonsignificant (with N = 56 teams) but that an indirect effect was present (because rMX and rYM were both significant). Second, and on a related point, Condition 1 becomes even more problematic when one’s mediation hypothesis takes the form of a more complex chain model where the effect of X on Y is carried through multiple mediators (e.g., X→M1→M2→M3→Y). In the previous example only two path coefficients were multiplied to obtain the indirect effect of X on Y; in chain models, additional path coefficients are multiplied to this product to account for the multiple mediating variables. ἀ us, as the influence of the exogenous antecedent (X) becomes more distal from the endogenous consequent (Y), the likelihood of detecting a significant bivariate relationship between X and Y decreases (James, Mulaik, & Brett, 2006; Shrout & Bolger, 2002). We believe that, in many instances, the lack of a “significant” bivariate relationship between X and Y may be due to underpowered designs, not necessarily poor theory (cf. Fritz & MacKinnon, 2007). In addition, suppression and/or interaction effects could attenuate the magnitude of the bivariate relationship (MacKinnon, Krull, & Lockwood, 2000; Mathieu & Taylor, 2006). For all of these reasons, establishing the bivariate relationship between X and Y in a given sample may be problematic.
118
James M. LeBreton, Jane Wu, and Mark N. Bing
ἀ ird, Condition 4 requires byx.m to be zero. Provided the first three steps are satisfied, when the effect in Condition 4 drops to zero, then one is to conclude support for a full mediation model. When the effect in Condition 4 does not drop to zero, then one is to conclude that there is evidence to support a partial mediation model. As James et al. (2006) noted, “ἀ ere is no opportunity to fail Step 4 because the significance/nonsignificance of byx.m determines whether the partial or complete mediation model is adopted to explain the results” (p. 239). ἀ is is problematic because it releases the researcher from making an a priori hypothesis concerning full vs. partial mediation. In essence, what started out as a confirmatory test of a causal hypothesis descends into exploratory data mining with no mechanism to compare model fit between full vs. partial mediation. Fourth, another concern with Condition 4 involves the potential for conclusions regarding mediation to be affected by researcher confirmatory biases (Bing, Davison, LeBreton, & LeBreton, 2002). Basically, the Baron and Kenny test requires researchers to determine if the regression weight in Condition 4 (Path c’ in Figure 5.2E) is significantly reduced in comparison to the regression weight in Condition 1 (Path c in Figure 5.2C). Although a t-test of statistical significance is available for Path c’, determining whether the relationship is fully or partially mediated may require judgment on the part of the researcher. For example, a researcher might argue that a nonsignificant but nonzero Path c’ is suggestive of partial mediation, as the X→Y relationship could be considered meaningful yet not statistically significant (especially with small sample sizes). Conversely, a researcher might argue that a nonsignificant relationship indicates full mediation, as the confidence interval around the Path c’ includes zero. ἀ is becomes even more of a concern when one takes into account the starting value of Path c. For example, a drop of .10 may be a “substantial reduction” in one situation (e.g., c = .15, c’ = .05) but not in another situation (e.g., c = .75, c’ = .65). ἀ us, to some degree, conclusions about what constitutes a meaningful reduction from c to c’ may be influenced by researcher judgment (Bing et al., 2002). ἀ is may become especially problematic when researchers conduct a priori tests under conditions of accepted paradigmatic theoretical bounds (Kuhn, 1996). In such instances, researchers may have formed a priori hypotheses regarding the relationships that may subjectively influence interpretation of the results (Kunda, 1990; Nisbett & Ross, 1980; Fiske, 1995).
The Truth(s) on Testing for Mediation
119
Finally, theoretical and mathematical concerns over Condition 3 have been raised by James et al. (2006). ἀ ey argued that, consistent with the laws of scientific parsimony (e.g., Occam’s razor), the baseline mediation model should be that of full mediation. However, the four-step test is predicated on the more complex, saturated, and better fitting partial mediation model. While the differences in baseline models have a trivial impact on tests of partial mediation, more pronounced differences emerge in tests of full mediation. Specifically, James et al. (2006) questioned the use of Equation 5.3 (under the previously described Condition/Step 3) to test the effects of M on Y. If full mediation is the a priori hypothesis, then the more appropriate estimate is given by
y = bymm + e4
(5.5)
where bym= b in Figure 5.2B. ἀ e critical difference between Equation 5.3 and Equation 5.5 is that in the latter equation the regression coefficient linking the mediator to the consequent does not control for the effects of X. ἀ is is because in instances where full mediation is the a priori hypothesized model, there are no reasons to include the effects of X on Y when testing for the effects of M on Y. In addition, establishing a significant bivariate relationship between X and Y under conditions of full mediation may prove problematic without extremely large sample sizes, and there is no need to include X in Equation 5.3 under conditions of full mediation. ἀ is misspecification of Equation 5.3 when testing full mediation also complicates the estimation of the indirect effect using Equation 5.4 as suggested by Baron and Kenny (1986). Instead, when full mediation is hypothesized the indirect effect should be estimated as
Indirect Effect = bym* bmx
(5.6)
where bym = b in Figure 5.2B and bmx = a in Figure 5.2A. ἀ us, our final concern involves the correct specification of the baseline model when testing hypotheses involving full mediation. ἀ e correct equations are those that derive from Figure 5.2A (X→M) and Figure 5.2B (M→Y) which may be graphically combined to form Figure 5.2D (X→M→Y). In contrast, Baron and Kenny believed that in tests of full mediation, the correct equations derive from Figure 5.2A (X→M), Figure 5.2B (M‡Y), Figure 5.2C (X→Y), and Figure 5.2E
120
James M. LeBreton, Jane Wu, and Mark N. Bing
(X→M→Y, and X→Y). If one tests for full mediation, then confirming Equation 5.1 is potentially very misleading unless data are collected on large sample sizes. Also, if testing for full mediation, then the use of Equation 5.3 is incorrect. Instead, Equation 5.2 should be used in conjunction with Equation 5.5 in tests of full mediation. Collectively, we recommend researchers testing hypotheses about full mediation rely on Equations 5.2, 5.5, and 5.6, which we will refer to as the “Set 2” equations in order to distinguish them from the original Baron and Kenny Set 1 equations (i.e., Equations 5.1–5.4). In all fairness, Kenny and his colleagues have recognized at least some of these problems and have indicated that confirming Steps 1 and 4 may not be necessary (Kenny, Kashy, & Bolger, 1998). However, after nearly 10 years, this retraction has gone largely unnoticed by users of the Baron and Kenny four-step test. For example, in 2006 the Baron and Kenny (1986) article was cited 29 times in the Journal of Personality and Social Psychology; but the Kenny et al. (1998) chapter was cited only twice. Similarly, in 2006, the Baron and Kenny (1986) article was cited 22 times in the Journal of Applied Psychology, but the Kenny et al. (1998) chapter was cited only three times. ἀ us, the original article continues to be the predominant one used to define and justify mediation via the four-step test, even after one of the original authors revised and retracted portions of the four-step test. In summary, Step 1 is potentially very misleading for tests of full mediation and is unlikely to be satisfied empirically with complex models (e.g., chain models) or with small to modest sample sizes. Step 4 transitions an otherwise confirmatory test to an exploratory analysis. Finally, there are serious concerns over the appropriateness of Step 3 when testing a hypothesis of full mediation. Given these concerns, it is not surprising that the four-step technique has not fared well when compared to other analytic approaches. ἀ ese approaches are discussed next. Legend 2: Baron and Kenny’s (1986) Four-Step Procedure Is the Optimal Test of Mediation Hypotheses Judging from the popularity of the Baron and Kenny procedure, one is tempted to conclude that it is the best test of mediation available. For if not, why then would so many researchers (across so many disciplines) rely so heavily on it for over 20 years? Recently, researchers
The Truth(s) on Testing for Mediation
121
have compared the four-step test to over a dozen alternative strategies in an attempt to understand the relative efficacy of this test for identifying mediated relationships. Below we briefly review the performance of the four-step test vis-à-vis these alternative strategies. In so doing, we conclude that the four-step test is far from being the optimal test of mediation hypotheses. In the most comprehensive comparison, MacKinnon et al. (2002) compared 14 different tests of mediation. Using simulations, these authors examined the Type I error rates and statistical power for these tests. ἀ ree general categories of tests were identified. ἀ e Baron and Kenny (1986) four-step test was classified under the first category, the causal steps approach. ἀ e second category focused on the diἀerence of coefficients. ἀ e basis for the tests involving the difference in coefficients resides in the belief that it is necessary to establish a significant reduction in the direct effect of X on Y (Path c in Figure 5.2C) when the mediator variable is included in the regression equation (Path c’ in Figure 5.2E; MacKinnon et al., 2002, 2007). ἀ e third category focused on evaluating the product of the eἀects (a*b’). ἀ e tests involving the product are simply tests of the indirect effect of X on Y (obtained using Equation 5.4, not Equation 5.6; MacKinnon et al., 2002, 2007). Although no universal, clear-cut “winner” emerged among the 14 tests of mediation, clear-cut losers did emerge—techniques relying on testing a set of causal steps, such as Baron and Kenny’s (1986) four-step test. ἀ e authors concluded that the four-step test had “Type I error rates that are too low in all the simulation conditions and have very low power, unless the effect or sample size is large” (MacKinnon et al., 2002, p. 96). For example, the four-step test only had a power of .52 to detect a medium effect with a sample size of 200 and only had a power of .11 to detect a small effect with a sample size of 1,000. ἀ e interested reader is directed to the original article for a detailed discussion of when the remaining techniques were deemed most useful; however, the overwhelming conclusion of this article was that the Baron and Kenny (1986) four-step test was not the optimal strategy for detecting mediated relationships. Readers are also encouraged to read the recent paper by Fritz and MacKinnon (2007) which furnishes power comparisons for several popular tests of mediation.
122
James M. LeBreton, Jane Wu, and Mark N. Bing
Legend 3: Fulfilling the Conditions Articulated in the Baron and Kenny (1986) Four-Step Test Is Sufficient for Drawing Conclusions About Mediated Relationships As noted earlier, a mediation hypothesis is a specific form of a causal hypothesis. Consequently, when attempting to draw causal inferences one should insure that theory, data, and methods satisfy the requisite conditions for such inferences (James, Mulaik, & Brett, 1982; Mathieu & Taylor, 2006). Although Baron and Kenny (1986) never stated that their four conditions represented the necessary and sufficient conditions for causal inference, over time researchers have grown ever emboldened in drawing such inferences from the application of Baron and Kenny’s four-step test (especially organizational researchers using single-source, cross-sectional data). Rather than draw attention to specific articles or authors who, in our opinion, may have inappropriately drawn causal inferences using the fourstep test of mediation, we prefer to recapitulate the primary conditions that must be satisfied in order to draw causal inferences (using any statistical procedure) and highlight those conditions which we feel are most typically violated when researchers attempt to draw causal inferences of mediation. Due to space limitations we provide this summary and analysis in Table 5.1. Although other researchers may identify additional conditions or use different labels for some conditions, most would agree that the conditions presented in Table 5.1 represent the primary or essential conditions that must be satisfied when drawing causal inferences (cf. James et al., 1982; Mathieu & Taylor, 2006). ἀ ese conditions are relevant to a discussion of mediation because mediation represents one form of causal inference. In essence, stating that “changes in X explain changes in M which explain changes in Y” represents a chain of causal inferences. ἀ us, establishing the conditions for causal inference is a requisite step when conducting a formal test of mediation. Table 5.1 presents a brief appraisal about the extent to which researchers in the organizational sciences are adequately addressing these conditions in tests of mediation. In our opinion, researchers have done a reasonable job satisfying some conditions (e.g., theoretical rationale for causal hypotheses) whereas some conditions are more problematic. Condition 5 (self-contained model and functional equations) is particularly challenging for researchers because
The Truth(s) on Testing for Mediation
123
Table 5.1 Analysis of Conditions for Drawing Causal (Mediation) Inferences Condition 1
Formal Statement of ἀ eory in Terms of a Structural Model
Analysis
Appears to be reasonably satisfied.
Condition 2
ἀ eoretical Rationale for Causal Hypotheses
Analysis
Appears to be reasonably satisfied.
Condition 3
Specification of Causal Order
Analysis
Some problems, especially when mediation is tested using singlesource, cross-sectional, nonexperimental designs and data. Greater effort needs to be placed on articulating competing hypotheses testing various models of temporal precedence.
Condition 4
Specification of Causal Direction
Analysis
Some minor problems. Researchers are encouraged to consider the plausibility of more complex mediation models including nonrecursive models and cyclically recursive models.
Condition 5
Self-Contained Functional Equations
Analysis
Most problematic. Researchers will never be able to fully obviate the unmeasured variables problem. However, greater care must be taken to include the most relevant variables in tests of mediation, lest our parameter estimates (and significance tests) become overly biased.
Condition 6
Specification of Boundaries
Analysis
Appears to be reasonably satisfied. However, researchers are encouraged to always be examining potential moderators of their mediation hypotheses.
Condition 7
Stability of the Structural Model
Analysis
Potentially problematic. A growing body of research indicates that a number of variables appear to have greater fluctuation and variability than previously thought (i.e., they lack an equilibrium-type condition), and the relationships these variables have to other variables in a mediation chain may also be in flux (i.e., the relationships are not stationary). Researchers should heed this work and consider the extent to which prior theory and data support the stability of both their constructs and their constructs’ linkages to other constructs.
Condition 8
Operationalization of the Variables
Analysis
Appears to be reasonably satisfied.
Condition 9
Empirical Support for Functional Equations
124
James M. LeBreton, Jane Wu, and Mark N. Bing
Table 5.1 Analysis of Conditions for Drawing Causal (Mediation) Inferences (continued) Analysis
Highly problematic when full mediation model is tested via the Baron and Kenny four-step test. Reasonably satisfied when conducting tests of partial mediation.
Condition 10
Fit Between Structural Model and Data
Analysis
Problematic when full mediation model is tested via Baron and Kenny four-step test. Reasonably satisfied when conducting tests of partial mediation; however, the simple three variable partial mediation model is fully saturated, and thus lacks degrees of freedom to test model fit in SEM without imposing additional constraints.
it is impossible to include all relevant variables in many causal models (James, 1980). Consequently, it is important to try and include the most relevant variables so that the degree of model misspecification and the resulting parameter bias will be minimized. Readers interested in a more exhaustive discussion of the conditions for causal inference are directed to Cliff (1983), James et al. (1982), and Mathieu and Taylor (2006). Assuming that the conditions for drawing inferences of mediation are met, the question still remains as to how to go about estimating those relationships. ἀ us, we conclude with recommendations for researchers interested in testing mediation hypotheses. Suggestions for Testing Mediation Hypotheses Structural Equation Modeling (SEM) as an Analytic Framework In general, we recommend that researchers frame mediation hypotheses as causal hypotheses and invoke strong confirmatory analytic techniques such as SEM to test these hypotheses. Hoyle (1995) defined SEM as “a comprehensive statistical approach to testing hypotheses about relations among observed and latent variables” (p. 1). He continued by noting that SEM and regression frameworks both share a number of important similarities (see pp. 13–14): (a) both are derived from linear statistical models, with regression representing a special instance of SEM, (b) the statistical tests furnished by both techniques are valid only when certain assumptions are met, (c) neither SEM
The Truth(s) on Testing for Mediation
125
nor regression provide definitive tests of causality—they can only confirm or disconfirm the viability of a particular model as causality is established by meeting the various criteria presented in Table 5.1, and (d) making “adjustments” to hypotheses after viewing one’s data increases the probability that one’s results will be sample specific. Although these techniques have much in common, Hoyle (1995) noted several important differences (see pp. 14–15): (a) Most regression software only allows one to specify the direct effects of antecedent variables on a single consequent variable; in contrast, SEM provides no default model and has relatively few limitations on the number and form of relationships that can be specified, (b) SEM permits researchers to test relationships among manifest variables, latent variables, or both; in contrast, regression frameworks are limited to testing relationships among manifest variables, and (c) like regression, SEM permits researchers to test the significance of individual parameter estimates, but unlike regression it also permits researchers to assess the overall goodness-of-fit between their data and their model. Byrne (1998) further noted SEM permits researchers to test an entire set of equations in a single, simultaneous analysis to determine the level of fit. In contrast, regression is limited to testing individual equations in isolation from the remaining equations (and provides no overall index of model fit). We would like to state at the outset that in the simple, three-variable mediation model (using manifest variables) described earlier we would expect to see relatively few differences between regression and SEM. However, as models increase in complexity (e.g., chain models, parallel mediator models, multiple outcome models, nonrecursive models) we would expect to see more differences between regression and SEM. Taken together, this leads to the conclusion that “the SEM approach is a more comprehensive and flexible approach to research design and data analysis than any other single statistical model in standard use by social and behavioral scientists. Although there are research hypotheses that can be efficiently and completely tested by [regression] methods, the SEM approach provides a means of testing more complex and specific hypotheses than can be tested by those methods” (Hoyle, 1995, p. 15). It is because SEM offers a number of advantages over regression that we see it as the generally preferred analytic strategy. For example, James and Brett (1984) argued that if mediation models were to be conceptualized as causal models, then strong
126
James M. LeBreton, Jane Wu, and Mark N. Bing
confirmatory analytic techniques such as SEM should be used. Others have noted that the Baron and Kenny test based on the Set 1 equations is not easily extended to situations containing multiple mediators and their approach is not able to individually assess the effect of each mediator (cf. James et al., 2006; MacKinnon et al., 2002; Shrout & Bolger, 2002). In contrast, SEM is better suited for testing models containing multiple mediators. Related to this issue, Shrout and Bolger (2002) discussed proximal and distal mediation processes in terms of temporality, such that an antecedent variable and a consequent variable occur within a certain temporal window. When an antecedent variable and a consequent variable are proximally mediated they are temporally close to one another and the opportunity for multiple mediators to be operating is limited. In contrast, when the relationship between an antecedent variable and a consequent variable is distally mediated, there is an increased temporal opportunity for multiple mediators to be influencing the relationship. Compared to the traditional four-step test using regression, SEM is better equipped to test such distally mediated chain models. However, we should remind the reader that failure to model each link in the mediation chain using equations derived from the Set 1 or Set 2 equations represents a form of model misspecification that could yield biased parameter estimates and erroneous conclusions regarding mediation, irrespective of whether an SEM or regression framework is adopted. Bing et al. (2002) illustrated several of the advantages of SEM by comparing a priori nested mediation models involving multiple consequent variables. ἀ ey compared the results obtained using the Baron and Kenny four-step test assessed via traditional regression analysis with those obtained using SEM. All analyses were conducted at the manifest variable level. ἀ ey noted that using a traditional regression framework to test for mediation fails to allow for a simultaneous test of models containing multiple consequent variables. Instead, separate regression analyses are needed for each consequent which could result in an elevated Type I error rate. In addition, by conducting separate regressions for each outcome variable, the four-step test ignores the correlations among the outcome variables. Such an approach is analogous to running separate ANOVAs rather than an omnibus MANOVA when one has multiple, correlated consequent variables. Furthermore, when competing models are identified a priori, and are nested within one another, the SEM
The Truth(s) on Testing for Mediation
127
technique can provide a chi-square goodness-of-fit test to determine which model has a better fit to the observed data. One useful set of nested models involves full vs. partial mediation models. Using an empirical example containing a mediation model with multiple, correlated consequent variables, they showed that the results obtained using SEM differed from those obtained using the traditional fourstep test. Finally, they showed the chi-square difference test for comparing nested models provided a more objective index of whether the competing models of full vs. partial mediation had better fit to the data. Our recommendation to test for mediation using SEM is not novel—Baron and Kenny (1986) themselves lauded the benefits of confirmatory techniques, as have others (James et al., 1982, 2006; Mathieu & Taylor, 2006; Medsker, Williams, & Holahan, 1994; Williams, Edwards, & Vandenberg, 2003); however, the majority of researchers still use a regression-based four-step test of mediation. For example, of the 29 papers published in 2006 in the Journal of Personality and Social Psychology using the Baron and Kenny test, 26 used traditional regression analysis while none used SEM (the remaining used some variation on ANOVA). Similarly, of the eight papers published in the Academy of Management Journal in 2006 that referenced the four-step test, only two used SEM to test their hypotheses. One explanation for this continued reliance on regression-based approaches is that researchers lack a framework for integrating analytic techniques with their mediation models. Below we present such a framework, but first recapitulate how we believe researchers should proceed to test mediation hypotheses. Summary of Tests of Mediation ἀ eory should always guide whether a full or partial mediation model is hypothesized a priori. We begin with a discussion of tests for the full mediation hypothesis and then proceed to a discussion of the more complex/saturated partial mediation hypothesis. At this stage, We remind the reader that when multiple models are tested using a single sample, it is critical that these models are specified a priori. If such models are tested in a post hoc exploratory manner, then it is necessary to obtain a cross-validation sample to confirm conclusions about the optimally fitted model (James et al., 1982).
128
James M. LeBreton, Jane Wu, and Mark N. Bing
we assume manifest variables in a simple three variable recursive mediation model. When full mediation is hypothesized, Equations 5.2 and 5.5 should be used to estimate the sign, magnitude, and significance of the structural parameters bmx and bym. If the direction of these effects is consistent with a priori theory and the parameters are statistically nonzero, then one has prima facie empirical support for a full mediation model. If SEM is used, then one also may examine the overall model fit to determine if the hypothesized model is consistent with one’s data. As noted earlier, when full mediation is the correct model, the bivariate relationship between X and Y is nonzero in the population. ἀ us, one could also test the statistical significance of byx obtained from Equation 5.1 (and thus the Set 2 equations would be modified to include 1, 2, 5, and 6); however, we remind the reader that even modest relationships will be nonsignificant without relatively large sample sizes. ἀ us, due to the low power associated with this test, the lack of a significant bivariate relationship between X and Y should not be taken as conclusive evidence that a full mediation model should be rejected. When partial mediation is hypothesized, Equations 5.1–5.3 may be used to estimate byx, bmx, byx.m, and bym.x and to determine if the observed values are statistically different from zero. If the direction of these effects is consistent with a priori theory and the parameters are nonzero, then one has prima facie support for a partial mediation model. Again, if SEM is the analytic framework, one may also review model fit statistics to assess the degree of fit between one’s theory and one’s data. Indirect effects may be estimated using Equation 5.4 when partial mediation is empirically supported. In sum, Set 2 equations should be used for tests of full mediation. Set 1 equations should be used for tests of partial mediation. ἀ ese recommendations follow logically from our discussion of Legend 1 and are straightforward for simple, three variable recursive mediation models (Figures 5.2D and 5.2E). However, how should mediation hypotheses be tested as models grow more complex? Indeed, if researchers make honest attempts to satisfy Condition 5 (see Table 5.1) they will likely seek out additional antecedent, mediator, and consequent variables. ἀ us, these more elaborate models will contain multiple Xs, Ms, and Ys. Such models may also contain nonrecursive or cyclically recursive relationships. Further complicating mediation models, some researchers are interested in testing
The Truth(s) on Testing for Mediation
129
mediation hypotheses at the level of latent constructs vs. the manifest measures of those constructs (or both levels). In addition to model complexity, mediation models may also differ in terms of the extent to which prior research supports the proposed linkages among the variables. At one end of a continuum we could envision a set of models we will label conventional mediation models. Such models largely involve replicating previously established linkages with only minor additions or modifications. At the other end of the continuum we could envision models that we will call speculative mediation models. Such models are developed with substantially less prior theoretical and empirical support. Although both types of models involve a confirmatory test of an a priori causal inference (i.e., full or partial mediation), the former are based on greater theory and research compared to the latter. So, where to begin? How should a model containing multiple antecedents, mediators, and consequences be tested? Should the degree of model complexity affect the approach adopted for testing mediation hypotheses? Should new mediation models based on limited theoretical and empirical support be treated differently than conventional mediation models which are based on substantially greater support? Like many questions in the social sciences, the answer is “It depends.” Below we articulate a heuristic framework for classifying mediation models and derive initial guidelines for testing mediation models consistent with this framework. A Heuristic Framework for Classifying Mediation Models It is important to recognize that the types of estimators used to generate the parameter estimates play a critical role in testing mediation hypotheses. We can distinguish limited information from full information parameter estimation techniques. Limited information techniques estimate the parameters for each equation separately. Hence, they are based on limited information from the covariance matrix containing the Xs, Ms, and Ys. One of the most common limited information estimation techniques is OLS. ἀ us, a test of full mediation using a limited information estimator such as OLS would involve calculating parameter estimates separately for Equations 5.2 and 5.5 (i.e., two separate OLS analyses, one for each equation). Most regression-based implementations of the Baron and Kenny four-step
130
James M. LeBreton, Jane Wu, and Mark N. Bing
Model Complexity
Prior Empirical Support Limited
Substantial
Support
Support
Low
Simple Speculative Models
Simple Conventional Models
High
Complex Speculative Models
Complex Comventional Models
Figure 5.3 Heuristic framework for classifying mediation models.
test involve a limited information estimator such as OLS. In contrast, full information techniques estimate model parameters simultaneously for all equations. Hence, they are based on the full information contained in the covariance matrix. One of the most common full information estimation techniques is full information maximum likelihood (FIML). A researcher employing a full information estimator such as FIML would conduct a simultaneous analysis that, in a single step, would generate parameter estimates for Equations 5.2 and 5.5. Most basic software packages (e.g., SPSS) furnish limited information estimators, whereas advanced SEM software (e.g., LISREL) can furnish either limited information or full information estimators. ἀ e distinction between full and limited information estimation techniques will be critical as we proceed with our recommendations. Figure 5.3 presents a framework for classifying mediation models developed by crossing model complexity with prior model support. ἀ e four cells presented in this figure represent ideal prototypes; in reality, model complexity and prior model support are not dichotomous variables but may assume a wide range of values. Cell 1—Simple Speculative Mediation Models Simple speculative mediation models are so described because they contain few variables and the relationships among the variables are relatively
The Truth(s) on Testing for Mediation
131
uncomplicated. In addition, these models, while based on a priori theory (see Condition 2, Table 5.1), may not have substantial empirical data supporting the hypothesized linkages. Because these are simple models, mediation is examined at the level of observed variables or latent constructs, but typically not both. Because these models are based on limited empirical support, we believe in many instances the use of limited information techniques for calculating parameter estimates may be preferred. ἀ is recommendation stems from the realization that full information techniques could be problematic because even slight misspecifications in the model often result in biased parameter estimates (Lance, Cornwell, & Mulaik, 1988). ἀ at is, because full information techniques simultaneously estimate all parameters, model misspecifications can have ripple effects on the accuracy of all parameter estimates. ἀ is is not true of limited information estimators—the degree of bias is limited to the location of model misspecification. An example of a simple speculative model is presented by Muraven and Baumeister (2000). ἀ ese authors suggested that depletion mediates the relationship between self-control and performance. Due to the fact that there is only one mediating variable, the relationship among the variables represented one of the simplest mediation models. However, this was a relatively newer mediation hypothesis based on limited and somewhat conflicting empirical support. ἀ us, because the proposed relationship may be considered more speculative in nature, we would recommend the use of limited information estimators such as OLS. ἀ e test of mediation would proceed by first specifying a partial or full mediation model (i.e., Set 1 or Set 2 equations) and then estimating the appropriate parameters using basic statistical software (e.g., SPSS) or more advanced software (e.g., LISREL using limited information estimators). Cell 2—Simple Conventional Mediation Models Simple conventional mediation models also contain few variables, and these variables have relatively uncomplicated relationships to one another. However, these models have substantially stronger prior empirical One may question whether it is meaningful to discuss tests of mediation in “speculative models,” as tests of mediation imply tests of causal hypotheses (James & Brett, 1984). Nevertheless, we discuss mediation tests in the context of what we refer to as speculative mediation models because it is not an uncommon application of mediation tests in the literature.
132
James M. LeBreton, Jane Wu, and Mark N. Bing
support for the linkages in the mediation chain. Like their speculative cousins, we classify models as simple conventional models if mediation hypotheses are being examined at either the level of observed variables or the level of latent constructs, but not simultaneously at both levels. Consequently, concerns over model misspecification are minimized and thus we expect little difference between full and limited information techniques, especially when certain distributional assumptions are met. An example of a simple conventional model is given by Locke and Latham (1990a) in their Goal Setting ἀ eory. According to Goal Setting ἀ eory, self-efficacy mediates the relationship between a goal and performance. ἀ is theory also contains a single primary mediating variable and has been widely tested and supported by a multitude of researchers and therefore can be considered a simple conventional model. ἀ us, researchers seeking to test such a model could safely rely on either full or limited information techniques; however, using full information estimators in SEM software would also provide the added advantage of having model fit statistics for a single, simultaneous test of all equations. Cell 3—Complex Speculative Mediation Models Complex speculative mediation models are arguably the most problematic. Such models have less empirical support for the hypothesized linkages in the mediation chain. Furthermore, these models contain larger numbers of variables (Xs, Ms, and/or Ys) often having more complex relationships with one another. In addition, such models may contain both manifest and latent variables. In such instances, we strongly recommend against the usage of full information estimation. Instead, we would encourage researchers to respecify the model using only manifest variables and estimate parameters using limited information techniques. Nevertheless, problematical situations may arise when complex models have features that lend themselves particularly well to full information estimation (e.g., nonrecursive relationships); however, even in these situations we recommend that researchers consider alternative limited information techniques (e.g., 2-stage least squares; James & Singh, 1978; Lance et al., 1988). If one decides to use full information estimation, then a large sample is strongly recommended in order to obtain highly stable parameter estimates and robust standard errors. It is also suggested that the complex speculative model be cross-validated on one or more
The Truth(s) on Testing for Mediation
133
additional samples to demonstrate that the observed relationships are replicable and not spurious. Overall, we believe that mediation analyses are a form of confirmatory analyses requiring strong a priori theoretical support. However, it is often the case that researchers are interested in undertaking a more “exploratory” examination of “potential” mediators or mediated relationships (or methodologists are asked to advise how to undertake such analyses as part of a larger multidisciplinary study). Such analyses are especially likely when complex speculative models are being hypothesized. In such instances we recommend the following:
1. In any situation where “model tweaking” is undertaken (i.e., models are revised based on empirical results), researchers should obtain one or more cross-validation samples in order to “confirm” the viability of their revised model. ἀ us, when more speculative models are tested, we expect (on the average) that more model tweaking will take place. Consequently, cross-validation samples become more essential. 2. When complex models are being tested in a quasi-confirmatory manner (e.g., a large set of mediators have been proposed and are being tested, but the mediators are based on somewhat limited prior research; cf. Lance, Woehr, & Fisicaro, 1991), disturbance term regressions (DTRs; Lance, 1986) could be employed to test the logical consistency of the structural model. ἀ ese tests are also discussed under the rubric of omitted parameters tests (James et al., 1982). In reality DTRs could be used in any test of mediation; however, such tests may be particularly helpful in situations where large numbers of Xs or Ms are being included in the mediation model.
ἀ e DTR approach is illustrated assuming a simple three-variable recursive full mediation hypothesis. Step 1 of the analysis involves obtaining the residuals by regressing Y on M (Equation 5.5). ἀ ese residuals (denoted e4 in Equation 5.5 but below as d, for disturbance) are then regressed on one or more antecedent variables:
d = bdxx + e5
(5.7)
If the full mediation hypothesis is correct, then the effect of X on Y should be fully transmitted via M. As such, bmx is statistically significant and bdx should not be different from zero. If bdx is not zero, then the a priori model is misspecified, and perhaps a partial mediation
134
James M. LeBreton, Jane Wu, and Mark N. Bing
model is more appropriate (and/or unmeasured variables must be included in order to correctly specify the model). When researchers test speculative complex models containing multiple Xs and multiple Ms, the DTR approach enables them to simultaneously test the mediation hypothesis for large “sets” of variables. Such an approach lacks the precision of a traditional SEM analysis, but it may be more appropriate than strict confirmatory techniques in the early stages of theory building and testing. An example of a complex speculative model is provided by Vandewalle, Brown, Cron, and Slocum (1999). ἀ ese authors proposed that the learning goal orientation and sales performance relationship is mediated by goal setting, effort, and planning. Because the hypothesized mediators between the predictor and criterion are based on relatively little prior support, this model can be considered a complex speculative model. Consequently, we would encourage researchers with similar models to rely on limited information estimators. A regression-based implementation of the DTR approach described above may be particularly well suited for such a model. Cell 4—Complex Conventional Mediation Models Finally, complex conventional mediation models are, by definition, more complicated. However, unlike their speculative cousins these models are based on substantially greater empirical evidence. Consequently, the use of full information estimators should be less problematic. However, in general, as models become more intricate, there are greater opportunities for structural misspecifications. ἀ us, as one moves towards more complex models greater caution should be used in employing full information techniques, even with conventional models. An example of a complex conventional model has been given by Carver and Scheier’s (1985) Control ἀ eory wherein a negative feedback loop consisting of an input function, comparator, output function and the impact on the environment are further influenced by a reference value as well as a disturbance or interruption term. ἀ is model is clearly more complex in terms of multiple mediating paths, feedback loops, and so on. However, substantial research has been conducted on their basic model. ἀ us, concerns over model misspecification are minimized and researchers are probably advised to take advantage of using full information estimators.
The Truth(s) on Testing for Mediation
135
Summary Figure 5.3 contains our heuristic model. Our framework lacks the concreteness or simplicity of the Baron and Kenny (1986) four-step test. However, an alternative interpretation is that the simplicity and concreteness of the Baron and Kenny test absolves researchers from making tough decisions and using sound professional judgment about the optimal ways to test mediation hypotheses and estimate the necessary parameters. We do not make “one size fits all” recommendations concerning tests for mediation, largely because situations vary and researchers must take into account the complexity of their model and the extent to which the hypothesized linkages in their model are based on limited vs. extensive empirical research. However, we do encourage researchers to consider if SEM may be appropriate for testing mediation hypotheses. ἀ is approach has several advantages including the following: • Offers flexibility to test models containing multiple antecedent variables • Offers flexibility to test models containing multiple mediator variables (parallel or chain mediation) • Offers flexibility to test models containing multiple consequent variables • Offers flexibility to test models containing recursive and nonrecursive relationships among the variables • Offers flexibility to test models containing manifest or latent variables (or both) • Offers flexibility to calculate parameters using full or limited information estimators • Provides tests of both individual parameters and overall model fit • Permits a comparative chi-square goodness-of-fit test for testing multiple, nested models • Can accommodate tests of multilevel mediation
We would also like to note that Figure 5.3 contains prototypes. In reality a myriad of models exists varying along the dimensions of complexity and prior research support. For example, the ἀ eory of Planned Behavior (Ajzen, 1991) could be considered a “mid-range” theory. Briefly, this theory hypothesizes that one’s intention to exhibit a behavior mediates the relationship between one’s attitude toward the behavior and the behavior itself. In addition, subjective
136
James M. LeBreton, Jane Wu, and Mark N. Bing
norms and perceived behavioral control are related to the intention to exhibit a behavior. Perceived behavioral control is also linked to actual behavior. It is relatively more conventional in nature because this theory has been tested a number of times (Conner & Armitage, 1998). However, we would label it a mid-range model because it contains a somewhat complicated set of relationships among the variables, but only one primary mediator, which is the behavioral intention. Models derived from this theory could reasonably be tested using either limited or full information estimators. If a researcher is extending the model in a more speculative direction, then limited information techniques are most appropriate because they are less susceptible to specification errors. If a researcher is simply applying the model to a new behavioral domain, and thus testing the model’s generalizability after prior research has supported it in other settings, then use of full information techniques is acceptable because the likelihood of specification error is less under these conditions. Conclusion ἀ e purpose of this chapter was to review and critique the pioneering work of Baron and Kenny (1986), specifically with reference to how it has been applied in the social and organizational sciences. ἀ is article provided a procedure for testing mediation hypotheses and brought greater focus onto theoretical and empirical issues concerning inferences of mediation. Although the Baron and Kenny approach has made important contributions to testing mediation hypotheses, over the last 20 years consensus has emerged that (a) their four-step test has conceptual and mathematical limitations, (b) their test, while popular and simple to implement, is often not the optimal test for mediation hypotheses, and (c) many researchers have relied too heavily on the Baron and Kenny four-step test as justification for drawing causal inferences of mediation. Author Note ἀ e authors would like to thank Martin Edwards, Chuck Lance, John Mathieu, and Bob Vandenberg for their insightful suggestions, comments, and constructive criticisms on earlier versions of this
The Truth(s) on Testing for Mediation
137
manuscript. We would also like to thank Larry James for several thought-provoking discussions over the years involving mediation, causal inference, and structural equation modeling. ἀ is acknowledgment does not imply that these individuals necessarily agree with all of the points presented herein. References Ajzen, K. (1991). ἀe theory of planned behavior. Organizational Behavior and Human Decision Processes, 50, 179–211. Bacharach, S. B. (1989). Organizational theories: Some criteria for evaluation. Academy of Management Review, 14, 496–515. Baron, R. M., & Kenny, D. A. (1986). ἀe moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182. Barrick, M. R., Stewart, G. L., Neubert, M. J., & Mount, M. K. (1998). Relating member ability and personality to work-team processes and team effectiveness. Journal of Applied Psychology, 83, 377–391. Bing, M. N., Davison, H. K., LeBreton, D. L., & LeBreton, J. M. (2002). Issues and improvements in tests of mediation. Society for Industrial and Organizational Psychology, 17th Annual Conference, Toronto, Ontario, Canada. Binning, J. F., & Barrett, G. V. (1989). Validity of personnel decisions—A conceptual analysis of the inferential and evidential bases. Journal of Applied Psychology, 74, 478–494. Brown, J. D., & Smart, S. A. (1991). ἀe self and social conduct: Linking self-representations to prosocial behavior. Journal of Personality and Social Psychology, 60, 368–375. Byrne, B. M. (1998). Structural equation modeling with LISREL, PRELIS, and SIMPLIS: Basic concepts, applications, and programming. Mahwah, NJ: Lawrence Erlbaum. Carver, C. S., & Scheier, M. F. (1985). A control-systems approach to the selfregulation of action. In J. Kuhl & J. Beckman (Eds.), Action control: From cognition to behavior, 237–265. New York: Springer-Verlag. Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate Research, 18, 115–126. Conner, M., & Armitage, C. J. (1998). Extending the theory of planned behavior: A review and avenues for further research. Journal of Applied Social Psychology, 28, 1429–1464.
138
James M. LeBreton, Jane Wu, and Mark N. Bing
Conrad, M. A. (2006). Aptitude is not enough: How personality and behavior predict academic performance. Journal of Research in Personality, 40, 339–346. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Eaton, W. O., & Yu, A. P. (1989). Are sex differences in child motor activity level a function of sex differences in maturational status? Child Development, 60, 1005–1011. Fiske, S. T. (1995). From the still small voice of discontent to the Supreme Court: How I learned to stop worrying and love social cognition. In P. E. Shrout & S. T. Fiske (Eds.), Personality research, methods, and theory: A festschrift honoring Donald W. Fiske (pp. 221–239). Hillsdale, NJ: Lawrence Erlbaum Associates. Fritz, M. S., & MacKinnon, D. P. (2007). Required sample size to detect the mediated effect. Psychological Science, 18, 233–239. Gilbert, D. T. (1991). How mental systems believe. American Psychologist, 46, 107–119. Gilstrap, L., & Papierno, P. B. (2004). Is the cart pushing the horse? ἀe effects of child characteristics on children’s and adults’ interview behaviors. Applied Cognitive Psychology, 18, 1059–1078. Gong, Y., Shenkar, O., Luo, Y., & Nyaw, M. (2007. Do multiple parents help or hinder international joint venture performance? ἀe mediating roles of contract completeness and partner cooperation. Strategic Management Journal, 28, 1021–1034. Gurhan-Canli, Z., & Maheswaran, D. (2000). Cultural variations in country of origin effects. Journal of Marketing Research, 37, 309–317. Hasher, L., Goldstein, D., & Toppino, T. (1977). Frequency and the conference of referential validity. Journal of Verbal Learning and Verbal Behavior, 16, 107–112. Hoyle, R. H. (1995). ἀe structural equation modeling approach: Basic concepts and fundamental issues. In R. H. Hoyle (Ed.), Structural equation modeling: Concepts, issues, and applications (pp. 1–15). ἀ ousand Oaks, CA: Sage. James, L. R. (1980). ἀe unmeasured variables problem in path-analysis. Journal of Applied Psychology, 65, 415–421. James, L. R., & Brett, J. M. (1984). Mediators, moderators, and tests for mediation. Journal of Applied Psychology, 69, 307–321. James, L. R., Mulaik, S. A., & Brett, J. M. (1982). Causal analysis: Assumptions, models, and data. Beverly Hills, CA: Sage. James, L. R., Mulaik, S. A., & Brett, J. M. (2006). A tale of two methods. Organizational Research Methods, 9, 233–244.
The Truth(s) on Testing for Mediation
139
James, L. R., & Singh, B. K. (1978). Introduction to logic, assumptions, and basic analytic procedures of 2-stage least-squares. Psychological Bulletin, 85, 1104–1122. Judd, C. M., & Kenny, D. A. (1981). Process analysis: Estimating mediation in treatment evaluations. Evaluation Review, 5, 602–619. Kenny, D. A., Kashy, D. A., & Bolger, N. (1998). Data analysis in social psychology. In D. T. Gilbert, S. T. Fiske, & G. Lindzey (Eds.), The handbook of social psychology (4th ed., pp. 233–265). Burr Ridge, IL: McGraw-Hill. Kerig, P. K. (1998). Moderators and mediators of the effects of interparental conflict on children’s adjustment. Journal of Abnormal Child Psychology, 26, 199–212. Kim, D. (1999). Determinants of the survival of gainsharing programs. Industrial and Labor Relations Review, 53, 21–42. Kuhn, D. (1996). Is good thinking scientific thinking? In D. R. Olson & N. Torrance (Eds.), Modes of thought: Explorations in culture and cognition (pp. 261–281). New York: Cambridge University Press. Kunda, Z. (1990). ἀe case for motivated reasoning. Psychological Bulletin, 108, 480–498. Lance, C. E. (1986). Disturbance term regression test procedures for recursive and nonrecursive models: Solution from intercorrelation matrices. Multivariate Behavioral Research, 21, 429–439. Lance, C. E., Cornwell, J. M., & Mulaik, S. A. (1988). Limited information parameter estimates for latent or mixed manifest and latent variable models. Multivariate Behavioral Research, 23, 171–187. Lance, C. E., Woehr, D. J., & Fisicaro, S. A. (1991). Cognitive categorization processes in performance evaluation: Confirmatory tests of two models. Journal of Organizational Behavior, 12, 1–20. Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ: Prentice Hall. Lord, F. M., & Novick, M. R. (1968). Statistical theories of mental test scores. Reading, MA: Addison-Wesley. MacKinnon, D. P., Fairchild, A. J., & Fritz, M. S. (2007). Mediation analysis. Annual Review of Psychology, 58, 593–614. MacKinnon, D. P., Krull, J. L., & Lockwood, C. M. (2000). Equivalence of the mediation, confounding, and suppression effect. Prevention Science, 1, 173–181. MacKinnon, D. P., Lockwood, C. M., Hoffman, J. M., West, S. G., & Sheets, V. (2002). A comparison of methods to test mediation and other intervening variable effects. Psychological Methods, 7, 83–104. Mathieu, J. E., Heffner, T. S., Goodwin, G. F., Salas, E., & Cannon-Bowers, J. A. (2000). ἀe influence of shared mental models on team process and performance. Journal of Applied Psychology, 85, 273–283.
140
James M. LeBreton, Jane Wu, and Mark N. Bing
Mathieu, J. E., & Taylor, S. R. (2006). Clarifying conditions and decision points for mediational type inferences in organizational behavior. Journal of Organizational Behavior, 27, 1031–1056. Medsker, G. J., Williams, L. J., & Holahan, P. J. (1994). A review of current practices for evaluating causal models in organizational behavior and human resource management research. Journal of Management, 20, 439–464. Muraven, M., & Baumeister, R. F. (2000). Self-regulation and depletion of limited resources: Does self-control resemble a muscle? Psychological Bulletin, 126, 247–259. Nelson, M. W., & Tayler, W. B. (2007). Information pursuit in financial statement analysis: Effects of choice, effort, and reconciliation. The Accounting Review, 82, 731–758. Nisbett, R., & Ross, L. (1980). Human inference: Strategies and shortcomings of social judgment. Englewood Cliffs, NJ: Prentice Hall. Osborne, J. W. (2001). Testing stereotype threat: Does anxiety explain race and sex differences in achievement? Contemporary Educational Psychology, 26, 291–310. Platt, J. R. (1964). Strong inference. Science, 146, 347–353. Reinhart, A. M., Marshall, H. M., Feeley, T. H., & Tutzauer, F. (2007). ἀe persuasive effects of message framing in organ donation: ἀe mediating role of psychological reactance. Communication Monographs, 74, 229–255. Schwartz, M. (1982). Repetition and rated true value of statements. American Journal of Psychology, 95, 393–407. Shrout, P. E., & Bolger, N. (2002). Mediation in experimental and nonexperimental studies: New procedures and recommendations. Psychological Methods, 7, 422–445. Skarlicki, D. P., & Latham, G. P. (1996). Increasing citizenship behavior within a labor union: A test of organizational justice theory. Journal of Applied Psychology, 81, 161–169. Sobel, M. E. (1982). Asymptotic confidence intervals for indirect effects in structural models. In S. Leinhardt (Ed.), Sociological methodology (pp. 290–312). San Francisco: Jossey-Bass. Vandewalle, D., Brown, S. P., Cron, W. L., & Slocum, J. W., Jr. (1999). ἀe influence of goal orientation and self-regulation tactics on sales performance: A longitudinal field test. Journal of Applied Psychology, 84, 249–259. Welch, J. L., & Austin, J. K. (2001). Stressors, coping and depression in haemodialysis patients. Journal of Advanced Nursing, 33, 200–207.
The Truth(s) on Testing for Mediation
141
Williams, L. J., Edwards, J. R., & Vandenberg, R. J. (2003). Recent advances in causal modeling methods for organizational management research. Journal of Management, 29, 903–936. Zajonc, R. B. (1968). Attitudinal effects of mere exposure. Journal of Personality and Social Psychology, 9, 1–27.
6 Seven Deadly Myths of Testing Moderation in Organizational Research Jeffrey R. Edwards
Moderation is central to research in the organizational and social sciences. Moderation occurs when the relationship between an independent variable and dependent variable depends on the level of a third variable, usually called a moderator variable (Aiken & West, 1991; Cohen, 1978). Moderation is involved in research demonstrating that the effects of motivation on job performance are stronger among employees with high abilities (Locke & Latham, 1990), the effects of distributive justice on employee reactions are greater when procedural justice is low (Brockner & Wiesenfeld, 1996), and the effects of job demands on illness are weaker when employees have control in their work environment (Karasek, 1979; Karasek & ἀ eorell, 1990). Procedures for testing moderation have generated considerable confusion. ἀ is confusion is organized here in terms of seven deadly myths. From a research standpoint, these myths are deadly because they lead researchers to make unwise choices, waste time and effort, and draw conclusions that are misleading or incorrect. ἀ is chapter describes these seven myths, discusses their basis and origins, and ἀ roughout this chapter, the term independent variable is synonymous with predictor variable; the term dependent variable is equivalent to criterion variable, outcome variable, and response variable; and the term moderator variable is the same as conditioning variable. In the language of path analysis and structural equation modeling, a dependent variable can be called an endogenous variable, and independent and moderator variables can be either exogenous or endogenous variables, depending on whether they are caused by other variables within a larger model (for the cases examined in this chapter, the independent and moderator variables are treated as exogenous variables). 143
144
Jeffrey R. Edwards
attempts to dispel them. My goal is not to identify researchers who subscribe to moderation mythology or point to streams of research where the myths run rampant. Rather, my intent is to give researchers good reasons to reject the myths, setting them aside as we pursue answers to important research questions that involve moderation. The Seven Myths Moderation is usually tested with analysis of variance or multiple regression (Cohen, Cohen, West, & Aiken, 2003; Pedhazur, 1997). Because analysis of variance is a special case of multiple regression (Cohen, 1968), tests of moderation using both approaches are susceptible to essentially the same myths. For simplicity, the present discussion is framed in terms of multiple regression, in which moderation is tested using equations of the following form:
Y = b0 + b1X + b2Z + b3XZ + e.
(6.1)
In Equation 6.1, Y is the dependent variable, X is the independent variable, and Z is the moderator variable. ἀ e product XZ captures the interaction between X and Z such that, when X and Z are controlled, the coefficient on XZ (i.e., b3) represents the change in the effect of X on Y for a unit change in Z (Aiken & West, 1991; Cohen, 1978). ἀ e interpretation of b3 is symmetric, such that it also indicates the change in the effect of Z on Y for a unit change in X. When Z is framed as the moderator variable, it is customary to view b3 as the change in the effect of X across levels of Z, which is the perspective adopted here. Myth 1: Product Terms Create Multicollinearity Problems Researchers often express the concern that Equation 6.1 is prone to multicollinearity (Morris, Sherman, & Mansfield, 1986). In general, multicollinearity decreases the stability of regression coefficient estimates and weakens the unique contribution of each predictor to the explained variance in the outcome (Belsley, 1991; Mansfield & Helms, 1982). Equation 6.1 might seem particularly susceptible to multicollinearity because X and Z can be highly correlated with the
Seven Deadly Myths of Testing Moderation
145
product term XZ. Drawing from Bohrnstedt and Goldberger (1969), when X and Z are normally distributed, the correlation between X and XZ can be written as rX , XZ =
E(Z )V( X ) + E( X )C( X , Z ) V( X )[ E( X )2 V(Z ) +E( Z )2 V( X ) + 2 E( X )E( Z )C( X , Z ) + V( X )V( Z ) + C( X , Z )2 ]
(6.2)
where E(), V(), and C() are expected value, variance, and covariance operators, respectively. Applying Equation 6.2 to representative values of X and Z demonstrates that rX,XZ can be high. For instance, if X and Z are measured on scales ranging from 1 to 5 and produce means of 3, unit variances, and a correlation of .50, rX,XZ would equal .85. Even if X and Z were uncorrelated, rX,XZ would drop only modestly to .69. For these illustrative values, identical correlations would be obtained for rZ,XZ , the correlation between Z and XZ. Concerns over the correlations of XZ with X and Z have prompted various corrective measures. For instance, Morris et al. (1986) suggested using principal components regression in which X, Z, and XZ in Equation 6.1 are replaced with weighted linear composites that exhaust the information in these variables and are uncorrelated with one another. More often, researchers center X and Z at their means, which usually reduces their correlation with XZ (Cronbach, 1987; Jaccard, Wan, & Turrisi, 1990). ἀ is practice capitalizes on the fact the correlation of XZ with X and Z depends on the means of X and Z. Returning to Equation 6.2, if X and Z are mean-centered, such that E(X) = E(Z) = 0, the numerator of Equation 6.2 equals zero, which in turn means that rX,XZ equals zero. Replacing X and Z with one another in Equation 6.2 further shows that when E(X) = E(Z) = 0, rZ,XZ also equals zero. When the normality assumption underlying Equation 6.2 is relaxed, rX,XZ and rXZ,XZ do not necessarily equal zero when X and Z are mean-centered, but other values can be derived to center X and Z such that rX,XZ and rZ,XZ both equal zero (Bohrnstedt & Goldberger, 1969; Smith & Sasaki, 1979). Although mean-centering usually reduces the correlation of firstorder terms with their products and squares, it has no meaningful effect on the estimation or interpretation of regression equations that contain these terms (Cohen, 1978; Cronbach, 1987; Dunlap & Kemery, 1987; Kromrey & Foster-Johnson, 1998). When Equation 6.1
146
Jeffrey R. Edwards
is estimated, evidence for moderation is obtained from testing b3, the coefficient on XZ. ἀ is test is equivalent to the test of the difference in R2 yielded by Equation 6.1 and an equation that drops XZ. ἀ e R2 values for these two equations are insensitive to additive transformations of X and Z (Arnold & Evans, 1979; Cohen, 1978; Dunlap & Kemery, 1987), of which mean-centering is a special case. Hence, the test of b3 in Equation 6.1 is unaffected by mean-centering, and the “problem” of multicollinearity seemingly indicated by rX,XZ and rZ,XZ is more apparent than real. ἀ e only real source of multicollinearity in Equation 6.1 involves rX,Z , the correlation between X and Z, and this correlation is not affected by mean-centering. Unlike b3, tests of b1 and b2 change when X and Z are rescaled, but these changes are not symptoms of multicollinearity or any other analytical anomaly. Rather, they reflect systematic relationships between b1 and b2 and the scaling of X and Z, a topic to which we now turn. Myth 2: Coefficients on First-Order Terms Are Meaningless ἀ e interpretation of the coefficients on X and Z when XZ is included in the equation has been a source of confusion. ἀ is confusion emanates from the fact that, with XZ in the equation, the coefficients on X and Z are scale-dependent, such that adding or subtracting a constant to X changes the coefficient on Z, and vice versa (Cohen, 1978). ἀ ese effects can be demonstrated using a version of Equation 6.1 that uses X* = X + c and Z* = Z + d as predictors, where c and d are arbitrary constants:
Y = b0* + b1*X* + b2*Z* + b3*X*Z* + e.
(6.3)
ἀ e asterisks in Equation 6.3 distinguish the coefficients on X*, Z*, and X*Z* from those on X, Z, and XZ in Equation 6.1. Substituting X* = X + c and Z* = Z + d into Equation 6.3 yields Y = b0* + b1*(X + c) + b2*(Z + d) + b3*(X + c)(Z + d) + e = b0* + b1*X + cb1* + b2*Z + db2* + b3*XZ + db3*X + cb3*Z + cdb3* + e = (b0* + cb1* + db2* + cdb3*) + (b1* + db3*)X + (b2* + cb3*)Z + b3*XZ + e. (6.4)
Seven Deadly Myths of Testing Moderation
147
Equation 6.4 expresses the coefficients on X, Z, and XZ in terms of the coefficients from Equation 6.3 and the constants c and d. Comparing Equations 6.1 and 6.4 shows that b0 = b0* + cb1* + db2* + cdb3*, b1 = b1* + db3*, b2 = b2* + cb3*, and b3 = b3*. Solving these expressions for b0* , b1* , b2* , and b3* and substituting the results into Equation 6.3 reveals how the rescaling X and Z changes the coefficients produced by Equation 6.1 (Arnold & Evans, 1979; Cohen, 1978): Y = (b0 – cb1 – db2 + cdb3) + (b1 – db3)X* + (b2 – cb3)Z* + b3X*Z* + e. (6.5) Equation 6.5 shows that, when X and Z are rescaled by adding c and d, respectively, b1 is reduced by db3, b2 is reduced by cb3, and b3 is unaffected. In addition, the intercept b0 is reduced by cb1 and db2 and increased by cdb3, although these changes do not alter the form of the interaction indicated by Equation 6.5. ἀ e effects of rescaling X and Z have caused concern because, in organizational and social research, measures of X and Z are usually at the interval level. For such measures, adding or subtracting an arbitrary constant is a permissible transformation from a statistical standpoint. However, this transformation changes the magnitude, sign, and significance of the coefficients on X and Z. Consequently, some researchers have declared the coefficients on X and Z “arbitrary nonsense” (Cohen, 1978, p. 861) and assert that attempts to test or interpret these coefficients are “useless” (Allison, 1977, p. 148). Although the effects of rescaling X and Z are undeniable, these effects do not render the coefficients on X and Z arbitrary or useless. Rather, rescaling X and Z changes their coefficients in systematic ways that can facilitate interpretation. ἀ is point is rooted in the principle that, when a regression equation contains X, Z, and XZ, the coefficient on X is the slope of Y on X when Z equals zero (Aiken & West, 1991). ἀ is principle is seen by rewriting Equation 6.1 in terms of simple slopes (Aiken & West, 1991; Jaccard et al., 1990):
Y = (b0 + b2Z) + (b1 + b3Z)X + e.
(6.6)
When Z = 0, the term b3Z = 0, and the compound coefficient on X reduces to b1. ἀ is principle is symmetric, such that b2 is the slope of Y on Z when X equals zero. ἀ is principle also applies to Equation 6.3, such that b1* is the slope of Y on X* when Z* = 0, and b2* is
148
Jeffrey R. Edwards
the slope of Y on Z* when X* = 0. Hence, whether rescaling X and Z yields useful coefficients depends on whether X* = 0 and Z* = 0 are meaningful values. If rescaling X and Z shifts their distribution such that X* = 0 and Z* = 0 fall outside the data, then b1* and b2* are not meaningful because they estimate slopes at points that do not exist in the data. On the other hand, if rescaling locates X* = 0 and Z* = 0 within the bounds of the data, then b1* and b2* can be interpreted accordingly. For instance, if c and d are the negative of the means of X and Z, respectively, such that X* and Z* are mean-centered versions of X and Z, then b1* is the slope of Y on X at the mean of Z, and b2* is the slope of Y on Z at the mean of X. X and Z can be rescaled using other values, such as the negative of scores representing one standard deviation above and below the means of X and Z, which can help clarify the form of the interaction captured by XZ (Aiken & West, 1991; Jaccard et al., 1990) Alternately, Equation 6.1 can be estimated, and Equation 6.5 can be used to calculate coefficients that would be obtained for different rescalings of X and Z. Hence, the coefficients on X and Z are meaningful and useful when X = 0 and Z = 0 are within the range of the data, and these coefficients are needed to compute simple slopes that clarify the form of the moderating effect (Aiken & West, 1991; Jaccard et al., 1990). Myth 3: Measurement Error Poses Little Concern When First-Order Terms Are Reliable Researchers tend to underemphasize the effects of measurement error on estimates obtained from Equation 6.1. ἀ is tendency is manifested in two ways. First, concerns over measurement error typically hinge on whether reliability estimates meet some conventional threshold, such as usual .70 criterion for Cronbach’s alpha (Lance, Butts, & Michels, 2006). If the threshold is met, analyses proceed as if the effects of measurement error can be disregarded. Second, reliability estimates are usually reported for X and Z but not the product term XZ. ἀ is practice implies that if X and Z exhibit adequate reliabilities, the reliability of XZ is likewise adequate. Because the reliability of XZ is not computed, its effects on the estimation of Equation 6.1 can easily escape the attention of researchers. ἀ e effects of measurement error on tests of moderation deserve greater attention than they are usually accorded. Methodological
Seven Deadly Myths of Testing Moderation
149
work has shown that measurement error drastically reduces the power to detect moderator effects (Aiken & West, 1991; Arnold, 1982; Busemeyer & Jones, 1983; Dunlap & Kemery, 1988; Jaccard & Wan, 1995; Lubinski & Humphreys, 1990), and this problem does not disappear when the reliabilities of X and Y exceed .70. Moreover, the reliability of the product term XZ can be quite low even when the reliabilities of X and Z might be considered adequate. Drawing from Bohrnstedt and Marwell (1978), when X and Z follow a bivariate normal distribution, the reliability of XZ can be expressed as ρ XZ =
E( X )2 V( Z )ρZ + E(Z )2 V( X )ρ X + 2 E( X )E(Z )C( X , Z ) + C( X , Z )2 + V( X )V( Z )ρ X ρZ E( X )2 V( Z ) + E(Z )2 V( X ) + 2 E( X )E( Z )C( X , Z ) + C( X , Z )2 + V( X )V( Z )
(6.7)
where ρXZ is the reliability of XZ and ρX and ρZ are the reliabilities of X and Z, respectively. To illustrate how ρXZ relates to ρX and ρZ , consider the case in which X and Z are standardized, such that Equation 6.7 simplifies to
ρ XZ =
2 rXZ + ρ X ρZ 2 rXZ +1
(6.8)
where rXZ is the correlation between X and Z. Using Equation 6.8, if X and Z are uncorrelated and have reliabilities of .70, the reliability of XZ equals .49. If the correlation between X and Z is .25, the reliability of XZ equals .52, and if the correlation increases further to .50, the reliability of XZ becomes .59. As these examples show, even when the reliabilities of X and Z meet conventional standards, the reliability of XZ can fail to reach those very same standards. It should be noted that the reliability of XZ is scale-dependent, such that adding a constant to X or Z will change the reliability of XZ (Bohrnstedt & Marwell, 1978). ἀ ese scaling effects operate through E(X) and E(Z), which appear in the numerator and denominator of Equation 6.7. Because ρXZ depends on the scales of X and Z, Bohrnstedt and Marwell (1978) cautioned against estimating ρXZ when X and Z are measured on interval scales, noting that the estimated value of ρXZ is as arbitrary as the origins of X and Z. However, when discussing the reliability of XZ, Lubinski and Humphreys (1990) surmised that the scaling effects of X and Z on ρXZ can be disregarded when the test of XZ controls for X and Z, because the increment in R2 explained by XZ is unaffected by the scaling of X and Z (Arnold
150
Jeffrey R. Edwards
& Evans, 1979; Cohen, 1978; Dunlap & Kemery, 1987). It follows that the reduction in statistical power created by measurement error in X, Z, and XZ is unaffected by the scaling of X and Z. Perhaps the reliability of the partialed XZ product is not affected by rescaling X and Z, even though the reliability of the XZ product itself is scaledependent. ἀ is issue could be clarified by deriving the reliability of the partialed XZ product, using the work of Bohrnstedt and Marwell (1978) as a starting point. Myth 4: Product Terms Should Be Tested Hierarchically Studies of moderation often test the interaction term XZ hierarchically, first estimating an equation using only X and Z as predictors and then estimating an equation that adds XZ, as in Equation 6.1. ἀ e difference in R2 between these two equations is then tested using the following F-ratio or its equivalent (e.g., Pedhazur, 1997): F=
RX2 ,Z , XZ − RX2 ,Z 1 − RX2 ,Z , XZ / ( N − 4 )
(
)
(6.9)
where R2X,Z is the R2 from the equation using X and Z as predictors, R2X,Z,XZ is the R2 from the equation that adds XZ to X and Z as predictors, and N is the sample size. ἀ e F-ratio given in Equation 6.9 has 1 numerator degree of freedom and N – 4 denominator degrees of freedom. A statistically significant F-ratio is taken as evidence of moderation. ἀ is hierarchical approach to testing moderation is firmly rooted in the literature, as evidenced by methodological discussions that refer to moderated regression as “hierarchical” (Busemeyer & Jones, 1983; Cortina, 1993; Jaccard et al., 1990; Lubinski & Humphreys, 1990) and present separate regression equations with and without the XZ product term (Arnold, 1982; Arnold & Evans, 1979; Cortina, 1993; Dunlap & Kemery, 1988; Jaccard et al., 1990; Lubinski & Humphreys, 1990; MacCallum & Mar, 1995; Morris et al., 1986; Zedeck, 1971). ἀ e hierarchical approach to testing moderation has two drawbacks. First, when a moderating effect is captured by a single product term, such as XZ in Equation 6.1, hierarchical analysis is unnecessary because the F-ratio in Equation 6.9 will give the same result as the t test of the coefficient on XZ (Cohen, 1978; Jaccard et al., 1990;
Seven Deadly Myths of Testing Moderation
151
Kromrey & Foster-Johnson, 1998; McClelland & Judd, 1993). If the increment in R2 explained by the moderating effect is of interest, it can be computed by squaring the t-statistic to obtain the corresponding F-statistic and multiplying this quantity by the denominator of Equation 6.9. When a moderating effect involves more than one product term, as in ANOVA designs with factors that have more than two levels, it might be convenient to test the effect and compute the increment in R2 using the hierarchical approach, although the same result is given by procedures that test simultaneous constraints on regression coefficients, such as the GLM procedure of SPSS using the LMATRIX subcommand. A second drawback of the hierarchical approach is that it can generate interpretations of the coefficients on X and Z that are misleading. In practice, researchers who use the hierarchical approach often interpret the coefficients on X and Z at the first step, before XZ has been added to the equation. ἀ ese interpretations are unconditional, such that the effect of X on Y is treated as constant across levels of Z, and likewise, the effect of Z on Y is viewed as constant across levels of X. However, if the coefficient on XZ is significant in the second step, then the effects of X and Z on Y are both conditional, such that the effect of each variable depends on the level of the other variable. ἀ e conditional effect of X is shown by Equation 6.6, in which the coefficient on X is the compound term (b1 + b3Z). Rewriting Equation 6.6 to show the conditional effect of Z yields the compound coefficient (b2 + b3X) on Z. Hence, when the second step indicates that moderation exists, the coefficients on X and Z in the first step should be disregarded because, by definition, moderation means that the effects of X and Z on Y are not each represented by a single value, but by a range of values that vary across levels of the other variable. ἀ is variation is not captured by the coefficients on X and Z from the first step, and reporting these coefficients invites their interpretation, which is unwarranted when the second step gives support for moderation. Myth 5: Curvilinearity Can Be Disregarded When Testing Moderation Studies of moderation rarely examine the squared terms X2 and Z2 along with the product term XZ. Disregarding X2 and Z2 might seem justified for various reasons. For instance, if an interaction between
152
Jeffrey R. Edwards
X and Z is predicted on theoretical grounds, then testing X2 and Z2 would go beyond what was predicted and might be frowned upon as atheoretical (Shepperd, 1991). In a similar vein, researchers simply might not consider curvilinear effects as often as moderating effects (Cortina, 1993; Ganzach, 1997). ἀ is possibility is consistent with a PsycINFO search of articles published since 1980 in the Academy of Management Journal, the Journal of Applied Psychology, and Organizational Behavior and Human Decision Processes. Of these articles, 232 mentioned the terms moderation or moderated in the title or abstract, whereas only 34 mentioned the terms curvilinear or quadratic. Researchers might also avoid testing X 2 and Z2 along with XZ due to interpretational difficulties. Methodological discussions of moderation and curvilinearity usually treat XZ, X 2 , and Z2 as separate terms, each with its own interpretation (Cortina, 1993; Ganzach, 1997, 1998; Lubinski & Humphreys, 1990; MacCallum & Mar, 1995). Any difficulties involved in the interpretation of XZ are likely to be compounded when X 2 and Z2 are added to the picture. As a general rule, researchers investigating moderation hypotheses should consider testing X2 and Z2 along with XZ (Cortina, 1993; Ganzach, 1997; Lubinski & Humphreys, 1990; MacCallum & Mar, 1995) Doing so helps establish that the coefficient on XZ taken as evidence for moderation does not spuriously reflect curvilinearity associated with X2, Z2, or both. Results for XZ can be misleading because, when X and Z are correlated, XZ is usually correlated with X2 and Z2 (Cortina, 1993; Ganzach, 1997; Lubinski & Humphreys, 1990; MacCallum & Mar, 1995). Drawing from Bohrnstedt and Goldberger (1969), if X and Z are normally distributed with zero means, the correlation between XZ and X2 is rXZ , X 2 =
2V( X )C(X , Z ) 2V(X ) [V(X )V(Z ) + C(X , Z )2 ] 2
.
(6.10)
Inserting representative values of V(X), V(Z), and C(X,Z) into Equation 6.10 shows how the correlation between X2 and XZ is influenced by the association between X and Z. For instance, when X and Z are uncorrelated, C(X,Z) = 0, and XZ and X2 are also uncorrelated. As the correlation between X and Z increases, the correlation between XZ and X2 likewise increases. Shifting the means of X and Z from
Seven Deadly Myths of Testing Moderation
153
zero alters the correlation between XZ and X2, but these changes do not affect tests of XZ and X2 when X and Z are controlled. ἀ ese principles also apply to the correlation between XZ and Z2, which can be computed by reversing the positions of X and Z in Equation 6.10. Simulation work has examined the effects of controlling for X2 and Z2 along with X and Z when testing XZ. When X2 and Z2 are uncorrelated with XZ, controlling for X2 with Z2 guards against inferring support for moderation that actually reflects curvilinearity, with a reduction in statistical power limited to the degrees of freedom consumed by X2 with Z2 (Cortina, 1993). When X2 and Z2 are correlated with XZ, the effect size of XZ is reduced when X2 and Z2 are controlled, because X2 and Z2 account for a portion of the variance that would be explained by XZ (Ganzach, 1998). ἀ is reduction in effect size increases the risk of Type II error for the test of XZ (Kromrey-Foster & Johnson, 1999) On the other hand, when X2 and Z2 are not controlled, the risk of Type I error for testing XZ can increase, given that moderation can be inferred when curvilinearity is actually responsible for the variance explained by XZ. ἀ e relative risks of Type I and Type II errors for tests of XZ also depend on the signs of the coefficients on XZ, X2, and Z2 (Ganzach, 1997). On balance, the benefits of controlling for X2 and Z2 seem to outweigh the costs (Cortina, 1993; Ganzach, 1997, 1987; Lubinski & Humphreys, 1990; MacCallum & Mar, 1995). Naturally, if curvilinear effects for X2 and Z2 are not predicted a priori as hypotheses that compete with the moderating effect of XZ, results for X2 and Z2 should be considered tentative, pending cross-validation (Kromrey-Foster & Johnson, 1999; Shepperd, 1991). Examining X2 and Z2 might also be reasonable from a conceptual standpoint. Strictly speaking, most theories in the organizational and social sciences predict relationships that are monotonic rather than linear (Busemeyer & Jones, 1983; Ganzach, 1998). Hypotheses of the form “if X increases, Y will increase” do not stipulate that the relationship between X and Y is linear, but instead make the more modest claim that higher values of X are associated with higher values of Y. In some instances, a monotonic relationship such as this ἀ e correlation between X2 and Z2 can also be derived from Bohrnstedt and Goldberger (1969), and again this correlation is a function of the correlation between X and Z.
154
Jeffrey R. Edwards
might be better conceived as curvilinear rather than linear, as illustrated by the diminishing effects of income on happiness (Eckersley, 2000). Alternately, if a linear relationship is hypothesized, analyzing curvilinear terms verifies that the relationship was not, in fact, curvilinear, yielding a stronger test of the hypothesis. Tests of curvilinearity involving X2 and Z2 can benefit from including XZ, given that the correlations among XZ, X2, and Z2 can generate misleading evidence for curvilinearity as well as moderation (Ganzach, 1997). Finally, difficulties associated with interpreting XZ, X2, and Z2 separately can be addressed by interpreting these terms jointly along with X and Z. ἀ is task can be approached by drawing from the logic used to interpret simple slopes in moderated regression analysis (Aiken & West, 1991). ἀ is logic is illustrated by Equation 6.6, which rearranges terms in Equation 6.1 to show the effect of X on Y at various levels of Z. ἀ is logic can be extended to an equation that includes X, Z, XZ, X2, and Z2, as given below:
Y = b0 + b1X + b2Z + b3XZ + b4 X 2 + b5Z2 + e.
(6.11)
Rewriting Equation 6.11 to show the relationship between X and Y at various levels of Z yields
Y = (b0 + b2Z + b5Z2) + (b1 + b3Z)X + b4 X 2 + e.
(6.12)
Equation 6.12 is a quadratic function relating X to Y that depends on the level of Z. ἀ e curvature of the function, indicated by b4, remains constant across levels of Z, whereas the intercept and the coefficient on X are influenced by Z, as shown by the terms (b0 + b2Z + b5Z2) and (b1 + b3Z), respectively. For a quadratic function such as Equation 6.12, the coefficient on X is the slope of the function at the point X = 0, as can be seen by taking the derivative of Y with respect to X:
dY/dX = b1 + b3Z + 2b4 X.
(6.13)
Equation 6.13 gives the instantaneous slope of Y on X. When X = 0, Equation 6.13 reduces to b1 + b3Z, which is the coefficient on X in Equation 6.12. Results for Equation 6.12 can be interpreted as follows. If b4 is positive, the function relating X to Y is curved upward, as shown in the top three panels of Figure 6.1. If (b1 + b3Z) is negative, the curve
-1
0
x
1
1
2
3
-1
0
x
1
2
-1
0
x
1
2
e. b4 < 0, b1 + b3Z = 0
-2
b. b4 > 0, b1 + b3Z = 0
-2
3
3
1 -3
2
3
4
5
6
7
1 -3
2
3
4
5
6
7
-1
0
x
1
-1
0
x
1
2
2
f. b4 < 0, b1 + b3Z > 0
-2
c. b4 > 0, b1 + b3Z > 0
-2
3
3
Seven Deadly Myths of Testing Moderation
d. b4 < 0, b1 + b3Z < 0
1 -3
0
1 -3
4
5
6
7
1 -3
2
x
3
2 -1
2
3
-2
a. b4 > 0, b1 + b3Z < 0
-2
2
3
4
5
6
7
3
4
5
6
7
1 -3
2
3
4
5
6
7
Y Y
Y
Y
Y Y
155
Figure 6.1 Quadratic functions relating X to Y for different values of b4 and b1 + b3Z.
156
Jeffrey R. Edwards
is negatively sloped at X = 0, which means that the minimum of the curve is shifted to the right, as in Figure 6.1a. If (b1 + b3Z) is positive, the curve is positively sloped at X = 0, and the minimum of the curve is shifted to the left, as in Figure 6.1c. If (b1 + b3Z) equals zero, the curve is flat at X = 0, and the minimum of the curve is centered at X = 0, as in Figure 6.1b. In contrast, if b4 is negative, the function relating X to Y is curved downward, as in the bottom three panels of Figure 6.1. If (b1 + b3Z) is negative, the curve is again negatively sloped at X = 0, which now means that the maximum of the curve is shifted to the left, as in Figure 6.1d. If (b1 + b3Z) is positive, the curve is again positively sloped at X = 0, which means the maximum of the curve is shifted to the right, as in Figure 6.1f. Finally, if (b1 + b3Z) equals zero, the curve is again flat at X = 0, which indicates that the maximum of the curve is centered at X = 0, as in Figure 6.1e. ἀ e foregoing discussion leads to the following interpretation of the coefficients on XZ, X2, and Z2 in Equation 6.11. Specifically, the coefficient on XZ is part of the compound term that indicates the slope of the function relating X to Y at the point X = 0, and the coefficient on X2 represents the curvature of the function. Together, the coefficients on X2 and XZ capture the curvature and horizontal location, respectively, of the function relating X to Y. ἀ e coefficient on Z2 is part of the intercept and indicates whether the effect of Z on the intercept varies across levels of Z. Hence, the coefficient on Z2 should be considered if the vertical position of the function relating X to Y is relevant from a conceptual standpoint. Additional guidelines for interpreting curvilinear relationships between X and Y are provided by Aiken and West (1991), and these guidelines can be applied to the curvilinear function in Equation 6.12. An alternative approach to interpreting Equation 6.11 frames the relationship between X, Z, and Y as a three-dimensional surface, of which the functions in Figure 6.1 are cross sections at selected levels of Z (Edwards, 2002; Edwards & Parry, 1993). ἀ is approach is useful when the joint effects of X and Z on Y are framed in terms of fit, similarity, or agreement. Myth 6: Product Terms Can Be Treated as Causal Variables In studies of moderation, the product term XZ is sometimes treated as a causal variable. ἀ is practice is common in studies that examine
Seven Deadly Myths of Testing Moderation
157
whether the moderating effect captured by XZ is mediated by some other variable, or what has been termed mediated moderation (Baron & Kenny, 1986). Mediated moderation is frequently examined using a version of the causal steps procedure that assesses the change in the coefficient on XZ that results when a mediator variable is added to Equation 6.1 (Baron & Kenny, 1986; Muller, Judd, & Yzerbyt, 2005). ἀ e first step of the procedure is to test b3 in Equation 6.1 to verify that a moderating effect exists. In the second step, the following equation is estimated:
M = b0 + b1X + b2Z + b3XZ + e
(6.14)
where M is the mediating variable through which the moderating effect is hypothesized to be transmitted. ἀ e second step requires that b3 is significant, meaning that XZ is related to M. ἀ e third and fourth steps involve the following equation:
Y = b0 + b1X + b2Z + b3XZ + b4 M + e.
(6.15)
Equation 6.15 adds M to Equation 6.1 as an additional predictor of Y. ἀ e third step is to test b4 to verify that the M is related to Y, such that the mediator is related to the outcome. Finally, in the fourth step, b3 is examined to determine whether it is smaller in Equation 6.15 than in Equation 6.1. If b3 in Equation 6.15 is reduced to the point it is not significant, then it is concluded that M fully mediates the effect of XZ on Y. If b3 is smaller but remains significant, then M is viewed as a partial mediator of the effect of XZ on Y. ἀ e treatment of XZ as a causal variable is misguided, because XZ has no causal potency in its own right. Rather, XZ is merely a mathematical device that captures the extent to which the effect of X on Y varies across levels of Z (or, equivalently, whether the effect of Z on Y varies across levels of X). It is X and Z, the variables that constitute XZ, that are capable of influencing Y in a causal sense. ἀ e product XZ does not represent some unique entity that exists separately from X and Z, and therefore it cannot exert an effect on Y beyond that generated by X and Z. ἀ e role of XZ in causal models relating X, M, and Z to Y can be examined by expressing causal paths in terms of simple slopes that show how the paths vary across levels of Z (Edwards & Lambert, 2007). In this manner, the variables with causal potency are properly depicted, and the manner in which product terms such as XZ alter the relationships among these variables is clarified.
158
Jeffrey R. Edwards
Myth 7: Testing Moderation in Structural Equation Modeling Is Impractical In organizational and social research, studies of moderation rely almost exclusively on ANOVA or regression analysis. ἀ ese methods rest on the assumption that the variables involved in the analysis are measured without error (Berry, 1993; Pedhazur, 1997). When this assumption is violated, results are affected in various ways. In particular, measurement error in the dependent variable biases R2 estimates downward. Measurement error in the independent variables can bias coefficient estimates upward or downward, depending on the pattern and degree of measurement error. ἀ e biasing effects of measurement error can alter the substantive conclusions drawn from analyses of moderation and further aggravate the problems of reduced statistical power discussed earlier. ἀ e effects of measurement error can be addressed using structural equation modeling with latent variables (Bollen, 1989; Kline, 2004). Although structural equation modeling usually involves linear relationships among latent variables, methods have been developed to estimate moderating effects (Cortina, Chen, & Dunlap, 2001; Jaccard & Wan, 1995; Jöreskog & Yang, 1996; Kenny & Judd, 1984; Kline & Moosbrugger, 2000; Li, Harmer, Duncan, Duncan, Acock, & Boles, 1998; Ping, 1996; Schumacker & Marcoulides, 1998). ἀ ese methods have been under development for over two decades, yet they are rarely applied in studies of moderation in the organizational and social sciences. When the methods are acknowledged, they are usually set aside as too complex or impractical due to limitations of available estimation procedures. For instance, structural equation modeling typically relies on maximum likelihood estimation, which incorporates the assumption that observed variables follow a multivariate normal distribution (Bollen, 1989). ἀ is assumption is violated when product terms are analyzed, because a product term is not normally distributed even when the variables that constitute the product are normally distributed. Products of observed variables are used in most methods for analyzing moderation in structural equation modeling (Cortina et al., 2001; Kenny & Judd, 1984; Li et al., 1998; Marsh, Wen, & Hau, 2004), which means that these methods violate a key assumption of maximum likelihood estimation. For these and other reasons, the use of structural equation modeling in tests of moderation is all but nonexistent.
Seven Deadly Myths of Testing Moderation
159
When compared to ANOVA or regression analysis, methods for analyzing moderation using structural equation modeling are undeniably more complex. However, these methods have become increasingly accessible in recent years, as a result of tutorials that demystify the methods (Cortina et al., 2001) and the availability of syntax for applying the methods in published work (Cortina et al., 2001; Li et al., 1998; Schumacker & Marcoulides, 1998). In addition, analytical developments have addressed estimation issues associated with moderated structural equation models. For instance, violations of multivariate normality can be addressed using methods that adjust standard errors and chi-square estimates (Chou, Bentler, & Satorra, 1991) or by applying the bootstrap, which can be used to generate sampling distributions for parameter estimates without assuming normality (Bollen & Stine, 1992; Efron & Tibshirani, 1993; Mooney & Duval, 1993). ἀ ese methods have proven effective in structural equation modeling (Curran, West, & Finch, 1996; Nevitt & Hancock, 2001) and show promise in models that include moderation (Yang-Wallentin & Jöreskog, 2001). Hence, methodological advancements in moderated structural equation modeling are ongoing, and researchers interested in testing moderation should incorporate these advancements into their work. ἀ e benefits of applying these methods will be worth the effort, given the detrimental effects of measurement error on tests of moderation. Myths Beyond Moderation ἀ e seven myths described here focused on studies of moderation. However, these myths apply to other analytical procedures that involve transformations of independent variables. For instance, studies that examine curvilinear relationships between X and Y typically use powers of X, such as X2, X3, and so forth, to represent curvilinearity (Cohen, 1978). Many of the myths associated with tests of moderation using the product term XZ apply to tests of curvilinearity using X2, X3, and higher powers of X, given that raising X to a power is equivalent to using product terms created by multiplying X by itself. Moreover, just as tests of moderation can benefit from incorporating curvilinearity, tests of curvilinearity can be sharpened by incorporating moderation (Ganzach, 1997; Lubinski & Humphreys, 1990). In addition, studies that examine fit, similarity, and agreement often
160
Jeffrey R. Edwards
use the absolute or squared difference between two variables as a predictor, sometimes along with the two variables that constitute the difference. Such studies have fallen victim to the myths discussed here, and the problems generated by the myths are compounded by those associated with difference scores (Edwards, 1994; Johns, 1981), which have spawned myths of their own (Edwards, 2001). ἀ us, the myths discussed here extend beyond studies of moderation, and researchers would be well advised to guard against these myths in studies of curvilinearity, fit, similarity, agreement, and other phenomena that involve transformations of independent variables. Conclusion ἀ e seven myths summarized here are prevalent in organizational and social research that examines moderation. Although some of the myths might have a kernel of truth, each myth has the capacity to lead researchers astray, invest unnecessary effort, and draw conclusions that are misleading or incorrect. By raising awareness to these myths, it is hoped that researchers will avoid the myths in their own work and point them out when they surface in the work of students and colleagues. Doing so will help increase the quality of research that involves moderation, leading to better answers to important theoretical and substantive questions that we collectively pursue. References Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: Sage. Allison, P. D. (1977). Testing for interaction in multiple regression. American Journal of Sociology, 83, 144–153. Arnold, H. J. (1982). Moderator variables: A clarification of conceptual, analytic, and psychometric issues. Organizational Behavior and Human Performance, 29, 143–174. Arnold, H. J., & Evans, M. G. (1979). Testing multiplicative models does not require ratio scales. Organizational Behavior and Human Performance, 24, 41–59. Baron, R. M., & Kenny, D. A. (1986). ἀe moderator-mediator variable distinction in social psychological research: Conceptual, strategic, and statistical considerations. Journal of Personality and Social Psychology, 51, 1173–1182.
Seven Deadly Myths of Testing Moderation
161
Belsley, D. A. (1991). Conditioning diagnostics: Collinearity and weak data in regression. New York: Wiley. Berry, W. D. (1993). Understanding regression assumptions. Newbury Park, CA: Sage. Bohrnstedt, G. W., & Goldberger, A. S. (1969). On the exact covariance of products of random variables. Journal of the American Statistical Association, 64, 1439–1442. Bohrnstedt, G. W., & Marwell, G. (1978). ἀe reliability of products of two random variables. In K. F. Schuessler (Ed.), Sociological methodology (pp. 254–273). San Francisco: Jossey-Bass. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Bollen, K. A., & Stine, R. A. (1992). Bootstrapping goodness-of-fit measures in structural equation models. Sociological Methods & Research, 21, 205–229. Brockner, J., & Wiesenfeld, B. M. (1996). An integrative framework for explaining reactions to decisions: Interactive effects of outcomes and procedures. Psychological Bulletin, 120, 189–208. Busemeyer, J. R., & Jones, L. E. (1983). Analysis of multiplicative combination rules when the causal variables are measured with error. Psychological Bulletin, 93, 549–562. Chou, C. P., Bentler, P. M., & Satorra, A. (1991). Scaled test statistics and robust standard errors for non-normal data in covariance structure analysis: A Monte Carlo study. British Journal of Mathematical and Statistical Psychology, 44, 347–357. Cohen, J. (1968). Multiple regression as a general data-analytic system. Psychological Bulletin, 70, 426–443. Cohen, J. (1978). Partialed products are interactions: Partialed powers are curve components. Psychological Bulletin, 85, 858–866. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum. Cortina, J. M. (1993). Interaction, nonlinearity, and multicollinearity: Implications for multiple regression. Journal of Management, 19, 915–922. Cortina, J. M., Chen, G., & Dunlap, W. P. (2001). Testing interaction effects in LISREL: Examination and illustration of available procedures. Organizational Research Methods, 4, 324–360. Cronbach, L. J. (1987). Statistical tests for moderator variables: Flaws in analyses recently proposed. Psychological Bulletin, 102, 414–417. Curran, P. J., West, S. G., & Finch, J. F. (1996). ἀe robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1, 16–29.
162
Jeffrey R. Edwards
Dunlap, W. P., & Kemery, E. R. (1987). Failure to detect moderating effects: Is multicollinearity the problem? Psychological Bulletin, 102, 418–420. Dunlap, W. P., & Kemery, E. R. (1988). Effects of predictor intercorrelations and reliabilities on moderated multiple regression. Organizational Behavior and Human Decision Process, 41, 248–258. Eckersley, R. (2000). ἀe mixed blessings of material progress: Diminishing returns in the pursuit of happiness. Journal of Happiness Studies, 1, 267–292. Edwards, J. R. (1994). ἀe study of congruence in organizational behavior research: Critique and a proposed alternative. Organizational Behavior and Human Decision Processes, 58, 51–100 (erratum, 58, 323–325). Edwards, J. R. (2001). Ten difference score myths. Organizational Research Methods, 4, 264–286. Edwards, J. R. (2002). Alternatives to difference scores: Polynomial regression analysis and response surface methodology. In F. Drasgow & N. W. Schmitt (Eds.), Advances in measurement and data analysis (pp. 350–400). San Francisco: Jossey-Bass. Edwards, J. R., & Lambert, L. S. (2007). Methods for integrating moderation and mediation: A general analytical framework using moderated path analysis. Psychological Methods, 12, 1–22. Edwards, J. R., & Parry, M. E. (1993). On the use of polynomial regression equations as an alternative to difference scores in organizational research. Academy of Management Journal, 36, 1577–1613. Efron, B., & Tibshirani, R. (1993). An introduction to the bootstrap. New York: Chapman & Hall. Ganzach, Y. (1997). Misleading interaction and curvilinear terms. Psychological Methods, 2, 235–247. Ganzach, Y. (1998). Nonlinearity, multicollinearity, and the probability of Type II error in detecting interaction. Journal of Management, 24, 615–622. Jaccard, J., & Wan, C. K. (1995). Measurement error in the analysis of interaction effects between continuous predictors using multiple regression: Multiple indicator and structural equation approaches. Psychological Bulletin, 117, 348–357. Jaccard, J., Wan, C. K., & Turrisi, R. (1990). ἀe detection and interpretation of interaction effects between continuous variables in multiple regression. Multivariate Behavioral Research, 25, 467–478. Johns, G. (1981). Difference score measures of organizational behavior variables: A critique. Organizational Behavior and Human Performance, 27, 443–463.
Seven Deadly Myths of Testing Moderation
163
Jöreskog, K. G., & Yang, F. (1996). Nonlinear structural equation models: ἀe Kenny-Judd model with interaction effects. In G. A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling (pp. 57–88). Hillsdale, NJ: Erlbaum. Karasek, R. A., Jr. (1979). Job demands, job decision latitude, and mental strain: Implications for job redesign. Administrative Science Quarterly, 24, 285–308. Karasek, R. A., & ἀe orell, T. (1990). Healthy work: Stress, productivity, and the reconstruction of working life. New York: Basic Books. Kenny, D. A., & Judd, C. M. (1984). Estimating the nonlinear and interactive effects of latent variables. Psychological Bulletin, 96, 201–210. Klein, A., & Moosbrugger, H. (2000). Maximum likelihood estimation of latent interaction effects with the LMS method. Psychometika, 65, 457–474. Kline, R. B. (2004). Principles and practice of structural equation modeling (2nd ed.). New York: Guilford Press. Kromrey, J. D., & Foster-Johnson, L. (1998). Mean centering in moderated multiple regression: Much ado about nothing. Educational and Psychological Measurement, 58, 42–68. Kromrey, J. D., & Foster-Johnson, L. (1999). Statistically differentiating between interaction and nonlinearity in multiple regression analysis: A Monte Carlo investigation of a recommended strategy. Educational and Psychological Measurement, 59, 392–413. Lance, C. E., Butts, M. M., & Michels, L. C. (2006). ἀe sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods, 9, 202–220. Li, F., Harmer, P., Duncan, T. E., Duncan, S. C., Acock, A., & Boles, S. (1998). Approaches to testing interaction effects using structural equation modeling methodology. Multivariate Behavioral Research, 33, 1–39. Locke, E. A., & Latham, G. P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ: Prentice Hall. Lubinski, D., & Humphreys, L. G. (1990). Assessing spurious “moderator effects”: Illustrated substantively with the hypothesized (“synergistic”) relation between spatial and mathematical ability. Psychological Bulletin, 107, 385–393. MacCallum, R. C., & Mar, C. M. (1995). Distinguishing between moderator and quadratic effects in multiple regression. Psychological Bulletin, 118, 405–421. Mansfield, E. R., & Helms, B. P. (1982). Detecting multicollinearity. The American Statistician, 36, 158–160. Marsh, H. W., Wen, Z., & Hau, K.-T. (2004). Structural equation models of latent interactions: Evaluation of alternative estimation strategies and indicator construction. Psychological Methods, 9, 275–300.
164
Jeffrey R. Edwards
McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376–390. Mooney, C. Z., & Duval, R. D. (1993). Bootstrapping: A nonparametric approach to statistical inference. Newbury Park, CA: Sage. Morris, J. H., Sherman, J. D., & Mansfield, E. R. (1986). Failures to detect moderating effects with ordinary least squares-moderated multiple regression: Some reasons and a remedy. Psychological Bulletin, 99, 282–288. Muller, D., Judd, C. M., & Yzerbyt, V. Y. (2005). When moderation is mediated and mediation is moderated. Journal of Personality and Social Psychology, 89, 852–863. Nevitt, J., & Hancock, G. R. (2001). Performance of bootstrapping approaches to model test statistics and parameter standard error estimation in structural equation modeling. Structural Equation Modeling, 8, 353–377. Pedhazur, E. J. (1997). Multiple regression in behavioral research (3rd ed.). New York: Holt. Ping, R. A., Jr. (1996). Latent variable interaction and quadratic effect estimation: A two-step techniques using structural equation analysis. Psychological Bulletin, 119, 166–175. Schumacker, R. E., & Marcoulides, G. A. (Eds.). (1998). Interaction and nonlinear eἀects in structural equation modeling. Hillsdale, NJ: Erlbaum. Shepperd, J. A. (1991). Cautions in assessing spurious “moderator effects.” Psychological Bulletin, 110, 315–317. Smith, K. W., & Sasaki, M. S. (1979). Decreasing multicollinearity: A method for models with multiplicative functions. Sociological Methods and Research, 8, 296–313. Yang-Wallentin, F., & Jöreskog, K. G. (2001). Robust standard errors and chi-squares for interaction models. In G. A. Marcoulides & R. E. Schumacker (Eds.), New developments and techniques in structural equation modeling (pp. 159–171). Mahwah, NJ: Erlbaum. Zedeck, S. (1971). Problems with the use of “moderator” variables. Psychological Bulletin, 76, 295–310.
7 Alternative Model Specifications in Structural Equation Modeling Facts, Fictions, and Truth Robert J. Vandenberg and Darrin M. Grelle
ἀ e goal of the current chapter is to examine alternative model specification (AMS) practices as applied in covariance structure modeling (CSM). CSM is our general term referring to tests of confirmatory factor analysis (CFA) and/or structural equation models (SEM). Namely, the concern of this chapter is with the practices per se underlying AMS and, in particular, with the fact that AMS is seldom undertaken in light of long-standing and overwhelming advice to do so. At the risk of oversimplifying the issue, the concern is best illustrated by examining the extreme views on the issue. At one extreme, AMS advocates claim that it should be practiced doctrine and a regular feature in each and every CSM application (Greenwald, Leippe, Pratkanis, & Baumgardner, 1986; MacCallum & Austin, 2000; Reichardt, 2002). From this perspective, the presumption is that very little AMS is undertaken when it should be, and as a consequence, the validity of the results from the CSM is questionable. ἀ is consequence is due to researchers specifying only one model when alternative models using the same variables exist that are equivalent or better than the focal model (Boomsma, 2000; MacCallum, Wegener, Uchino, & Fabrigar, 1993; Williams, Bozdogan, & Aiman-Smith, 1996). At the other extreme of the AMS issue are the researchers who thoroughly undertake their theoretical “homework” and, as such, can anchor the paths between latent variables and/or items to underlying constructs solidly to conceptual arguments and frameworks. From their perspective, there is simply 165
166
Robert J. Vandenberg and Darrin M. Grelle
no conceptual rationale to present an alternative specification to the focal model. ἀ e reality is that a kernel of truth exists at both extremes of the AMS topic. However, as with many topics in this book, a fog has settled between those extremes. What caused the fog or when it occurred is moot. ἀ e consequences of it, though, are not. One consequence is journal editors and reviewers using “not specifying an alternative model” as a primary excuse to reject a manuscript for publication. ἀ e other is authors failing to consider a competing AMS when it is obvious that they should have. What tangible evidence do we offer to support these consequences? None, other than the first author’s 25 years of experience (a) trying to publish one model manuscripts; (b) receiving said editorial comments; (c) dishing out these comments himself; (d) reading nearly 500 manuscripts as a reviewer; and (e) critiquing numerous student papers, dissertations, and theses. How does the first author know from his experiences that the comments from the editors and reviewers are wrong or inaccurate when using “having not conducted AMS” as a primary reason for not accepting a manuscript? He doesn’t and that is not the point. It’s the lack of elaboration underlying the editor’s and/or reviewers’ comment that is most revealing. In nearly 100% of cases, no attempt is made to state what form of AMS should be undertaken (e.g., equivalent, nested, or nonnested). Most importantly, if the reviewer or editor knew of an alternative model specification and the conceptual reasons supporting it, this should be made known to the authors. Doing so would set up undertaking an interesting competing model test pitting one conceptualization against another. However, because there has typically been no such elaboration, it makes us highly suspicious that the excuse is frequently evoked because it is convenient, and that in reality the editor and/or reviewers do not fully understand themselves what is truly meant by AMS. Similarly, are we implying that the original focal model presented by a researcher is somehow inaccurate because s/he did not present an AMS? Again, the answer is no, and again, that’s not the issue. ἀ e focal model may very well be accurate, but as above, it is the researcher’s failure to elaborate that is most telling. Little to nothing is presented in the manuscript to convince those evaluating it that AMS is not required and why. Further, as will be presented later in the chapter, our review of manuscripts in which authors claim an alternative model was specified indicates that there is a methodological
Alternative Model Specifications in Structural Equation Modeling 167
reason driving the specification; that is, the alternative model is typically one in which the fit of the theoretical factor structure is pitted against an alternative structure (e.g., single factor) to strengthen claims of measurement validity. Another common methodologically driven use is to test mediation hypotheses (e.g., models with vs. without the mediating paths). Although these uses are important, they are not the form and type of AMS envisioned by advocates of this topic, which is basically one of pitting competing conceptualizations against one another. Our whole point is that there seems to be a poor understanding of AMS among researchers in the organizational sciences as well as those evaluating the output of the researchers’ efforts. Is this an issue with which those in the organizational sciences should concern themselves? Our answer obviously is yes. Further, a recent review of CSM practices within the strategy research literature supports this assertion (Henley, Shook, & Peterson, 2006). At the core of our case is that understanding and engaging in AMS practices permits application of Popper’s (1959, 1972) disconfirmation strategy to CSM—a strategy that is sorely lacking in CSM applications at this point. To build our case, the next section is a review of this strategy. ἀ is is merely backdrop to justify the importance of AMS. Following this section, we review the three primary forms of AMS: (a) equivalent models, (b) nested models, and (c) nonnested models. Given that thorough technical reviews of these forms are available (equivalent models: MacCallum et al., 1993; nested models: Anderson & Gerbing, 1988; nonnested models: Oczkowski, 2002), our review is descriptive—not technical. ἀ is section is followed, in turn, with our review of articles using CSM in the major organizational sciences journals with a focus on those claiming to use AMS and whether the AMS fits within one or more of the three forms presented in the previous section. ἀ e final section of the chapter presents our recommendations. The Core of the Issue In 2006 the first author was preparing for a visit to the University of Melbourne in Australia where he was asked to address a number of topics, one of which was methodological concerns in the organizational sciences. Not wanting to state only his opinion, he solicited the opinions from a number of quantitative and qualitative
168
Robert J. Vandenberg and Darrin M. Grelle
methods experts in the field. Within the context of CSM, one comment was particularly germane to the current chapter and is reproduced by permission (J. R. Edwards, personal communication, May 2006). It is the confirmation bias, whereby we reinforce research that develops common-sense hypotheses, seeks confirming evidence (and obtains it with astounding regularity), and claims that progress has been made. We rarely attempt or encourage research that truly puts a theory at risk, as Popper (1959) encouraged us to do. As a result, our field has become crowded with increasingly minor variations on themes, and we periodically move to other topics not because we have reached definitive answers or rejected theories, but because we become bored or distracted by something more fashionable. In short, the “I’m OK, you’re OK” approach to research we use has produced rows of stifled crops that are too rarely thinned.
ἀ is comment embodies the core of the issue underlying AMS as envisioned by us. To understand this, we dissected the message into three parts corresponding to (a) the first sentence regarding confirmation bias; (b) the sentences regarding putting theories at risk; and (c) the comment about stifled crops rarely being thinned. At the heart of the confirmation bias within CSM is making an inappropriate inference regarding one’s target model based upon the goodness-of-fit chi-square test and other model fit indices (Vandenberg, 2006). Strong technical treatments of the issue have been provided by others (Boomsma, 2000; Hershberger, 2006; McCoach, Black, & O’Connell, 2007; MacCallum et al., 1993; Tomarken & Waller, 2003; Williams et al., 1996). In brief, confirmation bias is the belief among researchers that a favorable chi-square goodness-of-fit test (supporting the null hypothesis that the model-implied variance-covariance matrix equals the observed variance-covariance matrix) and strong descriptive fit indices (e.g., Tucker-Lewis index) permit them to accept their target model. In reality, however, not rejecting the null hypothesis does not mean accepting the target model—it only means we fail to reject it (Vandenberg, 2006). ἀ is is succinctly summarized by McCoach et al. (2007, p. 464), who state the following: In SEM, it is impossible to confirm a model. Although we may fail to confirm a model, we can never actually establish its veracity (Cliff, 1983). Statistical tests and descriptive fit indices can never prove that a model is correct (Tomarken & Waller, 2003). Rather, they suggest that the discrepancy between the observed variance covariance and the modelimplied variance covariance matrix is relatively small. ἀ erefore, one can
Alternative Model Specifications in Structural Equation Modeling 169 reasonably conclude that the model “provides an acceptable description of the data examined” (Biddle & Marlin, 1987, p. 9), in the sense that the covariance matrix implied by the specified model sufficiently reproduces the actual covariance matrix. Moreover, “when the data do not disconfirm a model, there are many other models that are not disconfirmed either” (Cliff, 1983 p. 117), given the number of untested models that are statistically equivalent to the specified model. ἀ erefore, in the best case scenario, when we achieve good fit, we can conclude our model “is one plausible representation of the underlying structure from a larger pool of plausible models” (Tomarken & Waller, 2003 p. 580).
Key for current purposes is the “larger pool of plausible models.” We will elaborate upon the following in greater detail in subsequent sections, but “larger pool” is key for the following three reasons. First, AMS is a strategy designed to systematically approach the pool of plausible models in an a priori manner. Second, despite perhaps knowing and understanding AMS strategies, our review of the research literature presented in a later section indicates that organizational science researchers still largely ignore AMS—at least in the manner advocated in this chapter. ἀ ere is still mostly a tone of “confirmation bias” used in stating inferences from the CSM applications. ἀ is brings us to the third reason “larger pool of plausible models” is key. Specifically, we firmly believe that missing historically in CSM applications (at least in the organizational sciences) is disconfirmation. Disconfirmation is based on the premise that a theory can never be proven, but only disproven, and what led Popper (1966) to state, “No particular theory may ever be regarded as absolutely certain” (p. 360). Based on the latter, he advocated constantly putting the focal theory at risk to disprove it by systematically engaging in a set of studies where new variables are introduced, alternative parameters are specified, or anything that conceptually represents an alternative explanation to the focal theory is examined. If the focal theory consistently emerges from these studies as the strongest or most successful in terms of explaining the focal processes, then greater confidence in its validity emerges. What does this have to do with CSM? CSM applications, particularly SEM, embody researchers’ theories representing how a set of variables work together to explain some process (e.g., the process of turnover, performance, adjusting to work). Yet, as noted previously, the most common practice is evaluating the focal model in isolation against a set of fit benchmarks; however, according to some, “relatively little information of
170
Robert J. Vandenberg and Darrin M. Grelle
scientific value is gained by evaluating models against arbitrary benchmarks” (Preacher, 2006, pp. 229–230). Rather, the greatest scientific value emerges when at least two models are specified representing competing conceptualizations, and one emerges the strongest especially over several replications (Lakatos, 1970; McCoach et al., 2007; Meehl, 1990; Preacher, 2006). Our point simply stated is that disconfirmation has not and continues to not be an integral aspect of the thinking underlying CSM applications even though most researchers know there is a “larger pool of plausible models” underlying their data. Given this and the tendency to interpret model fit as confirming the target model, we can fully understand the sentiment underlying statements that, as researchers, we have “produced rows upon rows of stifled crops that are too rarely thinned” within the organizational sciences. AMS from our viewpoint, therefore, is a viable means to address this shortcoming and, particularly, to put our theories at risk. We turn now to a brief overview of the three AMS strategies. AMS Strategies ἀ e literature on alternative model specifications (AMS) and underlying best practices dates back over two decades and has been written about extensively (e.g., Cliff, 1983; Stelzl, 1986; Lee & Hershberger, 1990). Reviews have been completed in abnormal psychology (Tomarken & Waller, 2003), personality and social psychology (Breckler, 1990), and recently in the application of structural equation modeling (SEM) to addressing strategic management research questions (Henley et al., 2006). Given these reviews, our presentation is brief and descriptive rather than technical. ἀ ere are three basic AMS strategies: (a) equivalent models, (b) nested models, and (c) nonnested models. Equivalent Models Models are equivalent if they have identical fit to the data (Breckler, 1990; Raykov & Marcoulides, 2001). Specifically, for any sample covariance matrix, S, two models (A and B) are considered equivalent when the reproduced covariance matrices generated by both
Alternative Model Specifications in Structural Equation Modeling 171
models are equal (ΣA = ΣB). Because fit indices are a function of the implied covariance matrices, the two models will have identical fit. Identical fit is a necessary result of model equivalence, but two models can have identical fit by chance and not be equivalent models. ἀ erefore, identical fit alone should not be considered proof of model equivalence (MacCallum et al., 1993). It is important to note here that, though the fit parameters for the model as a whole will be identical, individual parameter estimates may differ (Breckler, 1990). ἀ is fact will become useful in our discussion below of choosing the optimal model. MacCallum et al. (1993) calculated that for a model with a saturated block of six latent variables, there were 33,925 mathematically equivalent models. Mathematically is stressed to convey that a large proportion of the 33,925 models will not be theoretically plausible. However, given such a large number of equivalent models, there is a very high probability that out of the total set, a subset exists that is just as conceptually plausible as the target model (Lee & Hershberger, 1990; Raykov & Penev, 1999; Stelzl, 1986). ἀ is is in large part the reason why Lee and Hershberger (1990) advocated 18 years ago dropping the term best-fitting model and instead using the term optimal model, recognizing that the selected model (assuming one emerges) has equal fit to a number of others. Hence, it cannot be best-fitting—just optimal based on other criteria. Examples of equivalent path (i.e., regression) and latent variable models appear in Figures 7.1 and 7.2, respectively. If tested, each set of models would yield identical fit indices. ἀ e conceptual theme underlying the examples is primarily organizational behavior/applied psychology in nature. Model 1a reflects the standard premise that as one’s commitment declines, turnover intention will increase, and as such, the individual engages in job search activities (Vandenberg & Nelson, 1999). Model 1b implies that as individuals’ intentions to quit rise, they are likely to undertake a job search and experience a decreased attachment to the organization. ἀ e conceptual premises underlying Models 2a and 2b of Figure 7.2 are the same as the path models in Figure 7.1. Although our conceptual premises underlying the models in Figures 7.1 and 7.2 may be “stretches,” anyone familiar with the commitment and turnover research literatures will recognize that they are not so far-fetched as to be implausible. Indeed, we are very confident that with a thorough research literature search we could have
172
Robert J. Vandenberg and Darrin M. Grelle Model 1a RJSB
RTI Organizational Commitment
Turnover Intention
Model 1b
Job Search Behaviors
RTI Turnover Intention
Organizational Commitment
RJSB Job Search Behaviors
Figure 7.1 Equivalent path models.
supported each or some close variant of each. It is this very fact that is the primary point of the equivalent model issue. Namely, while equivalent statistically, the theoretical implications of each model differ markedly from one another. Further, because of those marked differences, it is readily apparent how identifying equivalent models and pitting them against each other is closer to the spirit underlying Popper’s (1966) notion of disconfirmation, and how strategically approaching the issue may reduce researchers’ confirmation biases. Specifically, if a researcher has an a priori target model developed, it is possible to identify all equivalent models in the planning stages of the study (Stelzl, 1986; Lee & Hershberger, 1990). ἀ at is, rational and empirical methodologies exist for generating equivalent alternative models. Researchers have the option of using programs like TETRAD (Scheines, Sprites, Glymour, & Meek, 1994) to automatically generate a number of equivalent models in a relatively short period of time. Although this is possible, it is certainly not what we
Alternative Model Specifications in Structural Equation Modeling 173 Model 2a Organizational Commitment
Job Search Behaviors Turnover Intention
X1
X2
X3
e1
e2
e3
X4
e4
X5
e5
X7
X8
X9
e7
e8
e9
X6
e6
Model 2b Turnover Intention
X4
X5
X6
e4
e5
e6
Organizational Commitment
X1
X2
X3
e1
e2
e3
Job Search Behaviors
X7
X8
X9
e7
e8
e9
Figure 7.2 Equivalent latent variable models.
are advocating here because it ignores the heart of the AMS issue— theory. We are certainly not advocating the need to specify all possible equivalent models. We are advocating the need to consider that there is quite possibly a subset of theoretically defensible equivalent models to the target model and that it would be best to consider these prior to data collection (see Tomarken & Waller, 2003, p. 583). We will return to this point after completing our review of the three AMS strategies because it is actually a need that exists within each. After identifying all theoretically plausible equivalent models, the next step is to collect the data used for model evaluation and hypothesis testing. ἀ e following assumes that acceptable model fit
174
Robert J. Vandenberg and Darrin M. Grelle
is observed (see Hu & Bentler, 1999). Recall, though, the exact same fit will be observed for the target and subset of equivalent models. At issue, then, is determining the optimal model (Lee & Hershberger, 1990). As noted by Breckler (1990), equivalent models can differ in the values estimated for individual parameters. ἀ us, even though all equivalent models will have equal overall fit, they may be distinguishable on the basis of how many pathways are statistically significant. James, Mulaik, and Brett (1982) referred to this as Condition 9 tests (see p. 59) when presenting their 10 conditions for causality. Namely, if the functional relations and equations underlying the paths in one of the equivalent models are statistically not significant, then that model may be considered disconfirmed. ἀ e following quote embodies the primary point of this section: “Without adequate consideration of alternative equivalent models, support for one model from a class of equivalent models is suspect at best and potentially groundless and misleading” (MacCallum et al., 1993, p. 196). To the latter point is the Henley et al. (2006) review of 10 years of strategic management literature in which CSM was used. ἀ ey reported that almost no journal articles mentioned the existence of equivalent models, but in reality a substantial number had theoretically plausible equivalent models that went unidentified and untested. Again, the key term is theoretically plausible, which, as seen shortly, is a key term within the other strategies as well. Nested Models Nested models are ones in which the parameters of one model are a subset of the other. Alternative models can be nested within the target model, or the target model can be nested within the alternative model. Like equivalent models, there are potentially a very large number of plausible nested models. Following are two examples. Returning to Figure 7.2a, assume that in addition to the conceptual foundation supporting it, there was also a segment of the research literature supporting a direct path from organizational commitment to job search behaviors. One may test both models in a critical study context. ἀ e question being addressed is, “does the model with fewer parameters (Figure 7.2a), and thus, larger degrees of freedom, reproduce the sample covariance matrix just as well as the model with more parameters?” If it does (as reflected in a statis-
Alternative Model Specifications in Structural Equation Modeling 175
tically nonsignificant chi-square difference test), then it is selected over the other model due to its parsimony (Kaplan, 2000). If the simpler model results in a worsening of fit, however, then one rejects it for the model with more parameters. Assuming the latter case, the presumption is that in all subsequent models in which these three latent variables appear, the path from organizational commitment to job search behaviors should be estimated. However, within the context of disconfirmation, we recommend continuous examination of these alternatives (path in vs. path out). If the path holds across these other tests, we gain greater and greater confidence in its validity (Popper, 1966). ἀ e second example is a CSM application in which a measurement model is examined first followed by a structural model imposing paths among the latent variables—the most common CSM approach in the organizational sciences. Although most researchers understand conceptually that their model of interest is a composite of measurement and structural components, they often overlook the fact that the final fit of the model may be decomposed into independent additive noncentrality chi-squares—one for the measurement model and the other for the structural model (McDonald & Ho, 2002; Steiger, Shapiro, & Browne, 1985; Tomarken & Waller, 2003). ἀ at is, the structural model is nested within the measurement model (Anderson & Gerbing, 1988). ἀ is realization is extremely important. To quote Tomarken and Waller (2003, p. 587): “As McDonald and Ho (2002) observed, it is often the case that the measurement component of latent variable models fits well and contributes a high proportion of the total degrees of freedom (i.e., the total number of restrictions imposed). In such cases, the result is often a well-fitting composite model that masks a poorly fitting structural component.” To illustrate, assume we examined the measurement model of Model 2a in Figure 7.2 before imposing the structural paths in Model 2a. Model 2a would be nested, therefore, in its measurement model. Assuming a sample size of 150, two fictitious scenarios are illustrated in Table 7.1. We decided in both scenarios to be realistic relative to the bulk of published CSM articles and assume that the chi-square value of the composite model (the path model) was statistically significant, and therefore, the quality of fit is evaluated using other indices. ἀ e most frequent practice in published CSM articles is to first interpret the fit of the measurement model and make statements regarding the
176
Robert J. Vandenberg and Darrin M. Grelle
Table 7.1 Fit of Composite, Measurement, and Structural Models for Two Scenarios Scenario 1
2
Model
χ2
df
p
RMSEA
Composite
37.65
25
0.05
0.06
Measurement
36.42
24
0.05
0.06
Structural
1.23
1
0.27
0.04
Composite
40.65
25
0.025
0.06
Measurement
36.42
24
0.05
0.06
4.23
1
0.04
0.15
Structural
validity of the measures. ἀ is is followed by an interpretation of the path model’s fit (what is called the composite model in Table 7.1) and statements regarding whether hypotheses were supported. At issue here is that the language adopted by authors makes it appear as if these are independent interpretations when in reality they are highly interdependent on each other (McDonald & Ho, 2002; Steiger et al., 1985; Tomarken & Waller, 2003). If we were to apply this standard practice to Scenarios 1 and 2 in Table 7.1, the conclusions would be that the measurement models possessed strong fit as did the composite models. However, if we take into consideration the nested nature of the models, and decompose them by separating out the contribution of the path (structural) model, a different interpretation is warranted. ἀ e decomposition is achieved by assessing the chi-square difference between the composite and measurement models and calculating the root mean square error of approximation (RMSEA) for each model. As seen in Scenario 1 of Table 7.1, when the composite model (the path model of theoretical interest) possesses a chi-square goodness of fit that is significant but just at the p < .05 level, the chi-square difference representing the contribution of the structural model is statistically nonsignificant, and its RMSEA of .04 is below the .06 benchmark representing strong model fit. ἀ us, the conclusion here is that the restrictions constituting the paths are meaningful and interpretable. Scenario 2 of Table 7.1, though, supports a different conclusion regarding the contribution of the structural model when the chi-square of the composite model corresponds to a p < .025 level. Both the chi-square difference test and RMSEA indicate that the path restrictions resulted in a worsening
Alternative Model Specifications in Structural Equation Modeling 177
of fit relative to the measurement model. ἀ erefore, the well-fitting nature of the composite model in Scenario 2 is due solely to the measurement model. Again, the major point here is recognizing that what has come to be accepted doctrine in terms of practices (i.e., interpreting the measurement and structural models independently) is not wholly appropriate. Most importantly, by modifying practices to examine the relative contribution of the structural model, researchers are truly putting their theory at risk given that the structural paths represent for the most part the conceptual foundation. We would like to emphasize, though, that the procedure outlined above is not without controversy. ἀ e most recent iteration of the controversy is reflected in the Mulaik and Millsap (2000) and Hayduk and Glaser (2000) articles as well as other articles in that volume and issue (Vol. 7, Issue 1, Structural Equation Modeling). An earlier iteration of the controversy was represented through Anderson and Gerbing (1992) versus Fornell and Yi (1992). As it was with equivalent models, the overarching point of this section is to encourage researchers to take into consideration nested AMS prior to data collection. Further, theory should be the sole driving force for stating the alternative model(s). ἀ us, once more, we are not advocating that all alternative models should be examined—only those models that are as theoretically defensible as the model of interest. ἀ is should not be interpreted as implying we are now “backpedaling” and support those who claim that their focal model is the most defensible and, thus, there are no AMS. ἀ is would indeed be a false interpretation, as we firmly believe that there are viable nested alternatives in the vast majority of CSM applications. As stated before, the primary goal of this chapter is to encourage the adoption of a disconfirming mind-set—a mind-set that is sorely lacking in CSM applications. In that vein, AMS is the most viable means of doing so. Nonnested Alternative Models Nonnested alternative models are ones in which their observed variance-covariance matrices, while overlapping, are not identical. ἀ us, the introduction of the vector of model parameters in each model attempts to replicate a different sample matrix. In contrast, nested
178
Robert J. Vandenberg and Darrin M. Grelle
models work within the same sample variance-covariance matrix. Using Model 2a of Figure 7.2 once more, in addition to its conceptual foundation, assume a segment of the research literature strongly supports an alternative model in which job search is irrelevant to the turnover process. It’s not simply a matter of testing the statistical significance of the path from turnover intention to job search, but job search is truly conceptually irrelevant and should not be in the model at all. ἀ us, the goal is to test each model and compare them, but the second model will have fewer observed scores due to the removal of the measure underlying job search. As another example, assume there is an alternative conceptual framework claiming that organizational commitment is irrelevant but environmental search constraints (e.g., too many ties to the community) are, and thus, one wishes to test two models—one with commitment in and another with commitment out but with environmental search constraints in its place. And yet, another example would be including a third model with both commitment and search constraints. In all three examples, the sample variance-covariance matrices would be overlapping yet different. ἀ e major point here is a theoretically justified alternative model to the theoretically justified target model may exist that does not have the same set or numbers of latent variables. As seen shortly in the next section, nonnested models appear very infrequently in publications using CSM within the organizational sciences. One reason for this may be a lack of understanding of how one selects the “best” model. Unlike nested AMS, one may not evaluate the relative merits of nonnested models through a chi-square difference test or differences in other fit indices. ἀ e best-case scenario is one in which the target model meets or exceeds all of the benchmarks denoting excellent fit while the nonnested alternative fails to meet those benchmarks. ἀ us, we fail to reject H0 (Σ = Στ) in the former model but reject H0 for the latter model. What happens, though, if the fit of the models is strong in all cases? In these cases, one needs to use the Akaike Information Criterion (AIC; Akaike, 1973) and Bayesian Information Criterion (BIC; Schwartz, 1978). ἀ e model with the smallest AIC and BIC values is considered the optimal choice among the alternatives. Most SEM programs provide the AIC and BIC. ἀ e AIC and BIC are indices computed from the likelihood of seeing a model given the data rewarded by goodness of fit and penalized for lack of parsimony (Burnham & Anderson, 2004). Both indices
Alternative Model Specifications in Structural Equation Modeling 179
have different weaknesses under varying sample sizes and number of parameters (see Kuha, 2004, for a review). Burnham and Anderson (2004) have deduced that though the BIC tends to outperform the AIC in Monte Carlo research designs, it is because of the differences in the theoretical derivations of each index. Most of the Monte Carlo studies are designed in which a “true” model exists and is in the set of models being evaluated. ἀ is favors the BIC because it was developed according to the philosophy that a true model exists. Burnham and Anderson (2004) further note that “true” models may exist within the nonsocial (e.g., hard) sciences but are not characteristic of the social sciences. In contrast to the BIC, the AIC assumes that a “best-fitting approximation” is among the set of competing models. Burnham and Anderson (2004), therefore, recommend the use of the AIC in the social sciences despite the fact that it tends to select overfitted models and requires larger sample sizes. ἀ ey also recommend using the sample-size-adjusted AIC. In summary, the nature of the questions being asked, the sample size available, and the complexity of the models under review should guide the researcher’s decision to use the AIC versus the BIC. ἀ e AIC and BIC are commonly used in model selection, but other methods are being developed that have potential as model selection criteria. Raykov (2001) describes using a bootstrap method to create confidence intervals around the RMSEA for each model under scrutiny. If the confidence interval generated around each model includes zero, then both models will be considered as having approximately the same fit per degree of freedom. ἀ e extent to which one or both models’ confidence intervals do not include zero helps to determine which model is optimal. Summary A common thread among the three AMS strategies is the use of theory in deriving the alternatives. Indeed, it is an absolute must if the intent is creating research scenarios using CSM approaches whereby theory is being put at risk; that is, we use AMS to truly attempt to disconfirm the focal theory. We will admit that invoking the use of theory theme is seemingly passé because it is invoked in nearly every article, book, chapter, and talk on CSM applications. Within the current context, however, stating this theme is anything but passé.
180
Robert J. Vandenberg and Darrin M. Grelle
Namely, the last outcome we desire from reading this section is the development of a checklist mentality. ἀ at is, the reader believes that in their application of CSM the relevant questions to ask are, “Do we have any (a) equivalent models and how many; (b) nested models and how many; and (c) nonnested models and how many?” These questions from our perspective are absolutely irrelevant and would undermine the goal of this chapter. We fear that by asking these questions the researchers’ intent is to remain firmly within the boundaries of their target frameworks and to seek out simple “tweaks” (e.g., removing arrows, adding arrows). Although doing so may technically create an alternative, it is not being done in the spirit of truly challenging the theory. Indeed, we fear that doing so will further create the rows upon rows of stifled crops that are so rarely thinned. Our desired outcome is to encourage a thorough examination of the focal research literature prior to data collection; that is, the conceptual rationale for each model regardless of whether it is equivalent, nested, or nonnested is stated in advance to avoid any temptation toward post hoc theorizing. ἀ e relevant question to ask during this review is, “What may be done to put the target framework at risk?” ἀ at is, one is purposely seeking evidence that may disconfirm the focal theory. As alluded to through the review of AMS strategies, there may be conceptual and/or empirical evidence countering the importance of a latent variable in the model or indicating that a particular relationship may be meaningless in the presence of another latent variable. ἀ ere may also be evidence suggesting that a particular measure used a hundred times to operationalize a particular construct is actually inappropriate and that a more valid operationalization of the construct exists through an alternative measure. In short, the major point is to identify the AMS prior to data collection. When this has been satisfied, it is at this juncture when it becomes appropriate to ask, “How many AMS exist and what is their form?” A distinct advantage of approaching AMS prior to data collection is forcing researchers to carefully plan the design of the study to accommodate not only the focal model but also the alternative model or models (MacCallum et al., 1993). For example, an obvious design consideration is including valid operationalizations of all constructs whether they are specific to the target model, alternative model, or both. Another design consideration is realizing that perhaps not all of the identified alternative models may be examined in a single study. ἀ us, one may need to undertake a programmatic
Alternative Model Specifications in Structural Equation Modeling 181
research stream that systematically evaluates the alternatives. An added advantage of this approach is dealing in some small way with the ubiquitous unmeasured variables issue (James et al., 1982) characterizing most CSM applications. Namely, if in one data collection, logistics or other constraints prevent the inclusion of a variable identified as part of an alternative model but conceptually that variable is known to be exogenous to two latent variables that are included in the current data collection, one may permit the disturbance terms of the two latent variables to correlate in order to control for the unmeasured variable. Our primary point is that identifying the alternative model provided the means to theoretically justify that correlation and to do so in advance. Other design considerations may include switching from a cross-sectional design to a longitudinal one when the competing models include contradictory statements regarding the causal priority among variables. It is beyond the scope of this chapter to entertain all design considerations. As a guiding framework for thinking of these issues, though, we heavily recommend reading Donald B. Rubin’s manuscripts, particularly his 1974, 1978, 1980, and 1986 publications. His underlying premise is to systematically approach study design by carefully considering the unit, the treatment, and the outcomes, even in the case of field research. ἀ is is particularly germane to the current topic given that the models represent theoretically competing views, and the goal is to have a “winner.” ἀ us, the design must fairly represent those competing frameworks. AMS in Practice As noted earlier, most of what was stated in previous pages simply reiterates recommendations long stated by others (MacCallum & Austin, 2000; MacCallum et al., 1993; Tomarken & Waller, 2003; Williams et al., 1996). Given its history, therefore, we were curious whether AMS was routinely applied in the organizational science research literature. Henley et al. (2006) recently asked a similar question within the context of CSM applications specific only to strategy research. ἀ eir results indicated that AMS was seldom undertaken. ἀ e current review included all studies using CSM from 1996 to 2006 in the 13 journals we felt represented the micro- to macro-perspective underlying the organizational sciences. Our list included
182
Robert J. Vandenberg and Darrin M. Grelle
the Journal of Applied Psychology, Personnel Psychology, Educational and Psychological Measurement, Journal of Management, Academy of Management Journal, Journal of Organizational Behavior, Organizational Behavior and Human Decision Processes, Strategic Management Journal, Organizational Dynamics, Human Relations, Group and Organization Management, Journal of Occupational Psychology, and Journal of Occupational and Organizational Psychology. We did not include methods journals, as our goal was to evaluate the practices of researchers typically engaged in hypothesis-testing research. We read each article to first determine whether AMS was used and whether that use was at the measurement model stage, the structural model stage, or both. Most importantly, we evaluated to what end the AMS was used; that is, was it conducted in the spirit embodied in this chapter (putting theory at risk) or did it have some other purpose? Other purposes were methodological in nature. For example, pitting an AMS in which a mediating path is specified against an AMS that does not have that path is not putting theory at risk as embodied here. Embodied is emphasized to note that certainly supporting or not supporting mediation has theoretical implications, but this test is not of the variety here where one is truly putting one’s theory at risk. ἀ e researcher is simply following prescribed methodological steps to test for mediation, and even when not supported, typically the core of theory underlying the model remains intact. Similarly, the tests underlying the nested model sequence for evaluating measurement invariance fall technically under AMS. However, again, the goal is a methodological one. ἀ ese results are presented in Tables 7.2 and 7.3. From column 2 in both Tables 7.2 and 7.3, one expected finding is the increasing frequency of CSM use from 1996 to 2006. Further, an examination of the third (Tested Alternative Models) and fourth (Percentage) columns of both tables indicates that a respectable number of studies are at least reporting that AMS was undertaken. ἀ us, from this perspective, it appears that AMS may occur quite routinely in the organizational sciences. However, the numbers in those columns are quite misleading. First, it was our original intent to construct the tables with a fifth column titled “Equivalent” (before the Nested and Nonnested columns)—that is, to have columns representing the frequency with which a theoretically a priori determined equivalent alternative model was compared to a target model. However, there was no need
Alternative Model Specifications in Structural Equation Modeling 183
Table 7.2 Number of Published Articles in 12 Journals Using Confirmatory Factor Analysis That Tested Alternative Models Year
CFA
Tested Alternative Models
Percent
Nested
Nonnested
Both
1996
20
13 (5)
65.0%
13 (5)
0
0
1997
26
17 (13)
65.4%
16 (12)
1 (1)
0
1998
20
13 (8)
65.0%
13 (8)
0
0
1999
37
21 (3)
56.8%
17 (1)
2 (1)
2 (1)
2000
43
31 (8)
72.1%
26 (4)
3 (2)
2 (2)
2001
34
23 (6)
67.6%
22 (5)
0
1 (1)
2002
37
21 (7)
56.8%
21 (7)
0
0
2003
30
17 (4)
56.7%
16 (3)
1 (1)
1 (1)
2004
66
45 (12)
68.2%
42 (11)
3 (1)
0
2005
81
47 (8)
58.0%
44 (5)
1 (1)
2 (2)
2006
69
49 (15)
71.0%
49 (15)
0
0
Total
463
297 (89)
64.1%
278 (75)
11 (7)
8 (7)
Note. CFA = total number of studies in that year utilizing confirmatory factor analysis; Tested Alternative Models = value outside parentheses is the number from CFA claiming test of alternative models, and value inside parentheses is number of studies doing so in a “disconfirming” manner; Percent = percentage of CFA studies claiming use of alternative model test; Nested and Nonnested = breakdown of those in the Tested Alternative Model column claiming a nested or nonnested strategy, with values in parentheses representing those following a “disconfirming” strategy; Both = number of studies from Tested Alternative Model column that employed both a nested and a nonnested strategy.
for the columns given that only one study specifically acknowledged that an equivalent model was tested (Carless, 1998). ἀ is is particularly troublesome in light of the recent Henley et al. (2006) findings. Using Lee and Hershberger’s (1990) method of calculating possible equivalent models, Henley et al. (2006) determined that of 79 studies using CSM, 59 (75%) had at least one theoretically viable equivalent model. Given that strategy research is one of the core organizational science disciplines, we can safely assume that a similar percentage represents the number of viable equivalent models within the other disciplines of the organizational sciences. Our conclusion is that equivalent models are simply not considered by organizational science researchers when they should be.
184
Robert J. Vandenberg and Darrin M. Grelle
Table 7.3 Number of Published Articles in 12 Journals Using Structural Equation Modeling That Tested Alternative Models Year
SEM Tested Alternative Models
Percent
Nested
Nonnested
Both
1996
7
3 (3)
42.9%
3 (3)
0
0
1997
10
5 (4)
50.0%
5 (4)
0
0
1998
12
8 (2)
66.7%
5 (1)
3 (1)
0
1999
21
14 (5)
66.7%
12 (3)
1 (1)
1 (1)
2000
14
8 (1)
57.1%
6 (1)
2 (0)
0
2001
14
10 (6)
71.4%
7 (4)
0
3 (2)
2002
13
8 (4)
61.5%
8 (4)
0
0
2003
15
9 (6)
60.0%
7 (4)
0
2 (2)
2004
22
15 (13)
68.2%
14 (12)
1 (1)
0
2005
40
34 (18)
85.0%
32 (16)
0
2 (2)
2006
24
16 (10)
66.7%
15 (9)
1 (1)
0
Total
192
130 (72)
67.7%
114 (61)
8 (4)
8 (7)
Note. SEM = total number of studies in that year utilizing structural equation modeling; Tested Alternative Models = value outside parentheses is the number from SEM claiming test of alternative models, and value inside parentheses is number of studies doing so in a “disconfirming” manner; Percent = percentage of SEM studies claiming use of alternative model test; Nested and Nonnested = breakdown of those in the Tested Alternative Model column claiming a nested or nonnested strategy, with values in parentheses representing those following a “disconfirming” strategy; Both = number of studies from Tested Alternative Model column that employed both a nested and a nonnested strategy.
Findings in Tables 7.2 and 7.3 support the idea that nested models are the alternative model of choice among researchers relative to nonnested models. Looking at the 2001 row, for example, one sees that of the 23 studies claiming the application of AMS in a CFA context (Table 7.2), 22 (96%) used a nested model approach. Similarly, from the same year in Table 7.3, 7 out of 10 (70%) studies claiming the use of AMS in SEM applications were of the nested model variety. Of importance to the current chapter are the values in parentheses next to the frequency of nested and nonnested AMS. ἀ ese values represent the number of studies embodying the idea of truly competing one theoretically specified model against at least one other theoretically specified alternative model. Although some years are better than others (i.e., higher frequency of AMS are truly pitting one
Alternative Model Specifications in Structural Equation Modeling 185
theoretical specification against the other), only 30% (89/297) of the total number of nested AMS in CFA applications and 55% (72/130) of them in SEM applications were of the “competing theory” or disconfirming variety across the total number of years reviewed by us. Exemplars of undertaking a disconfirming strategy within CFA applications from Table 7.2 are Cordes, Dougherty, and Blum (1997); Hwee and Aryee (2002); and Yukl, Chavez, and Seifert (2005). In these studies, the researchers undertook an extensive review of the relevant literature to derive multiple plausible factor models to explain the item covariance in the measures they were evaluating. ἀ ey described why each model was plausible and then used at least one sample to lend empirical support to one of the models. ἀ ough one superior model did not necessary emerge in all cases, these studies stress the importance of testing alternative models. ἀ e majority of studies claiming nested AMS within CFA were specifying alternative models, but they were of the methodological variety—that is, tests of measurement invariance or specifying different factors but just to demonstrate discriminant validity. An example of the former is Wang and Russell (2005), and an example of the latter is ἀ erney, Farmer, and Graen (1999). ἀ ough alternative models are being tested, this is not in the spirit of challenging theory. With respect to nested model AMS practices within SEM contexts (Table 7.3), strong examples of studies undertaking a disconfirming approach include Kinicki, Prussia, Wu, and McKee-Ryan (2004); Claessens, Van Eerde, Rutte, and Roe (2004); and Lim and Qing (2006). ἀ ese studies are excellent examples of AMS in structural equation modeling because each proposed strong theoretical justification for the inclusion of mediators or moderators before testing them empirically. ἀ ese studies use years of theory development to select valid models to compare rather than including or excluding paths between variables in an exploratory and atheoretical manner. ἀ e vast majority of claims of AMS in Table 7.3 were using nested strategies to undertake tests of mediation (e.g., Friedman, Anderson, Brett, Olekalns, Goates, & Lisco, 2004; Eddleston, Veiga, & Powell, 2006). Although the support or lack of it for mediation certainly has theoretical implications, the specification of the AMS is done to follow prescribed methodological steps. Across Tables 7.2 and 7.3, only 35 studies (adding the values outside the parentheses in the sixth and seventh columns) tested alternative nonnested models. One positive aspect to this, however, is
186
Robert J. Vandenberg and Darrin M. Grelle
that the frequency of cases doing so from a disconfirming perspective was proportionally higher than was the case for nested AMS. A very troubling characteristic, though, of the studies is that very few examined optimal model fit per the strategy explained earlier in the chapter (e.g., AIC, BIC). Most compared each model loosely by comparing fit indices generally reported in CSM studies (e.g. Flora, Finkel, & Forshee, 2003) or doing nested comparisons with a single baseline model and selecting the model with the smallest chi-square difference. Further, many did not provide any details on how the best model was chosen at all. ἀ ese findings illustrate the lack of best practices in use for the testing of nonnested alternative models. Summary From our readings of the articles, there were other aspects to CSM practices that were less germane to the chapter but troubling nonetheless and, thus, important to highlight. For example, in about half of the studies that specifically cite the Anderson and Gerbing (1988) method of testing measurement and structural models, fewer than half of the steps outlined by Anderson and Gerbing are actually followed. Further, we found that the methods recommended to compare alternative models to the target, whether nested or nonnested, were inconsistently followed, incorrect, incomplete, ignored altogether, or completely left out of the publication. ἀ us, what is our summary conclusion from our findings? It is that we have failed—and continue to fail for the most part—within the organizational sciences to make disconfirmation an integral aspect of CSM applications. Additionally, it was very common across the studies in our review for researchers to use interpretative language implying that they received strong support for their focal theoretical model (confirmation bias). AMS from our perspective is the most viable avenue to engage in disconfirmation and, as such, to perhaps avoid affirming the consequence. Hence, AMS should become an integral aspect of all CSM applications. In closing, while the focus over the last several pages has been clearly on the researcher/author using CSM, we would like to return momentarily to those responsible for evaluating manuscripts. Do not use the “failed to test an alternative model” benchmark to reject or severely discount a study without understanding thoroughly the
Alternative Model Specifications in Structural Equation Modeling 187
implications of the statement. Stating it by itself without elaboration is editorial irresponsibility from our perspective. If it is going to be used, be prepared at a minimum to follow that statement with suggestions as to where the researchers may have overlooked some critical literature that supports a viable alternative model. In any event, we hope this chapter serves to illuminate the AMS issue and its overall importance in CSM applications. References Akaike, H. (1973). Information theory and an extension of the maximum likelihood principle. In S. Kotz & N. L. Johnson (Eds.), Breakthroughs in statistics (pp. 599–624). New York: Springer. Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin, 103, 411–423. Anderson, J. C., & Gerbing, D. W. (1992). Assumptions and comparative strengths of the two-step approach. Sociological Methods and Research, 20, 321–333. Biddle, B. J., & Marlin, M. M. (1987). Causality, confirmation, credulity, and structural equation modeling. Child Development, 58, 4–17. Boomsma, A. (2000). Reporting analyses of covariance structures. Structural Equation Modeling, 7, 461–483. Breckler, S. J. (1990). Applications of covariance structure modeling in psychology: Cause for concern? Psychological Bulletin, 107, 260–273. Burnham, K. P., & Anderson, D. R. (2004). Multimodel inference: Understanding AIC and BIC in model selection.Sociological Methods and Research, 33, 261–304. Carless, S. A. (1998). Assessing the discriminant validity of transformational leader behaviour as measured by the MLQ. Journal of Occupational & Organizational Psychology, 71, 353–358. Claessens, B. J. C., Van Eerde, W., Rutte, C. G., & Roe, R. A. (2004). Planning behavior and perceived control of time at work. Journal of Organizational Behavior, 25, 937–950. Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate Behavioral Research, 18, 115–126. Cordes, C. L., Dougherty, T. W., & Blum, M. (1997). Patterns of burnout among managers and professionals: A comparison of models. Journal of Organizational Behavior, 18, 685–701. Eddleston, K. A., Veiga, J. F., & Powell, G. N. (2006). Explaining sex differences in managerial career satisfier preferences: ἀe role of gender self-schema. Journal of Applied Psychology, 91, 437–445.
188
Robert J. Vandenberg and Darrin M. Grelle
Flora, D. B., Finkel, E. J., & Forshee, V. A. (2003). Higher order factor structure of a self-control test: Evidence from confirmatory factor analysis with polychoric correlations. Educational & Psychological Measurement, 63, 112–127. Fornell, C., & Yi, Y.-J. (1992). Assumptions of the two-step approach to latent variable modeling. Sociological Methods and Research, 20, 291–320. Friedman, R., Anderson, C., Brett, J., Olekalns, M., Goates, N., & Lisco, C. C. (2004). ἀe positive and negative effects and anger of dispute resolution: Evidence from electronically mediated disputes. Journal of Applied Psychology, 89, 369–376. Greenwald, A. G., Pratkanis, A. R., Leippe, M. R., & Baumgardner, M. H. (1986). Under what conditions does theory obstruct progress? Psychological Review, 93, 216–229. Hayduk, L. A., & Glaser, D. N. (2000). Jiving the four-step, waltzing around factor analysis, and other serious fun. Structural Equation Modeling, 7, 1–35. Henley, A. B, Shook, C. L, & Peterson, M. (2006). ἀe presence of equivalent models in strategic management research using structural equation modeling: Assessing and addressing the problem. Organizational Research Methods, 9, 516–539. Hershberger, S. L. (2006). ἀe problem of equivalent structural models. In G. R. Hancock & R. O. Mueller (Eds.), Structural equation modeling: A second course. Greenwich, CT: Information Age Publishing. Hu, L., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6, 1–55. Hwee, H. T., & Aryee, S. (2002). Antecedents and outcomes of union loyalty: A constructive replication and an extension. Journal of Applied Psychology, 87, 715–722. James, L. R., Mulaik, S. A., & Brett, J. A. (1982). Conditions for confirmatory analysis and causal inference. Beverly Hills, CA: Sage. Kaplan, D. (1990). Evaluating and modifying structural equation models: A review and recommendation. Multivariate Behavioral Research, 25, 137–155. Kinicki, A. J., Prussia, G. E., Wu, B., & McKee-Ryan, F. M. (2004). A covariance structure analysis of employee’s response to performance feedback. Journal of Applied Psychology, 89, 1057–1069. Kuha, J. (2004). AIC and BIC: Comparisons of assumptions and performance. Sociological Methods & Research, 33, 188–229.
Alternative Model Specifications in Structural Equation Modeling 189
Lakatos, I. (1970). Falsification and the methodology of scientific research programmes. In I. Lakatos & A. Musgrave (Eds.), Criticism and the growth of knowledge (pp. 91–196). Cambridge, England: Cambridge University Press. Lee, S., & Hershberger, S. (1990). A simple rule for generating equivalent models in covariance structure modeling. Multivariate Behavioral Research, 25, 313–334. Lim, V. K. G., & Qing, S. S. (2006). Does parental job insecurity matter? Money anxiety, money motives and work motivation. Journal of Applied Psychology, 91, 1078–1087. MacCallum, R. C., & Austin, J. T. (2000). Applications of structural equation modeling in psychological research. Annual Review of Psychology, 51, 201–226. MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). ἀe problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114, 185–199. McCoach, D. B., Black, A. C., & O’Connell, A. A. (2007). Errors of inference in structural equation modeling. Psychology in the Schools, 44, 461–470. McDonald, R. P., & Ho, M.-H. R. (2002). Principles and practice in reporting structural equation analyses. Psychological Methods, 7, 64–82. Meehl, P. E. (1990). Appraising and amending theories: ἀe strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108–141. Mulaik, S. A., & Millsap, R. E. (2000). Doing the four-step right. Structural Equation Modeling, 7, 36–74. Oczkowski, E. (2002). Discriminating between measurement scales using nonnested tests and 2SLS: Monte Carlo evidence. Structural Equation Modeling, 9, 103–125. Popper, K. R. (1959). ἀe propensity interpretation of probability. The British Journal for the Philosophy of Science, 10, 25–42. Popper, K. R. (1966). The open society and its enemies. London: Routledge. Popper, K. R. (1972). Objective knowledge: An evolutionary approach. New York: Oxford University Press. Preacher, K. J. (2006). Quantifying parsimony in structural equation modeling. Multivariate Behavioral Research, 41, 227–259. Raykov, T. (2001). Approximate confidence interval for difference in fit of structural equation models. Structural Equation Modeling, 8, 458–469. Raykov, T., & Marcoulides, G. A. (2001). Can there be infinitely many models equivalent to a given structural equation model? Structural Equation Modeling, 8, 142–149.
190
Robert J. Vandenberg and Darrin M. Grelle
Raykov, T., & Penev, S. (1999). On structural equation model equivalence. Multivariate Behavioral Research, 34, 199–244. Reichardt, C. S. (2002). ἀe priority of just-identified recursive models. Psychological Methods, 7, 307–315. Rubin, D. B. (1974). Estimating causal effects in treatments using randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688–701. Rubin, D. B. (1978). Bayesian inference for causal effects. The Annals of Statistics, 6, 34–58. Rubin, D. B. (1980). Comment on “Randomization analysis of experimental data: ἀe Fisher randomization test” by D. Basu. Journal of the American Statistical Association, 75, 591–593. Rubin, D. B. (1986). Statistics and causal inferences: Which ifs have causal answers. Journal of the American Statistical Association, 81, 961–962. Scheines, R., Spirtes, P., Glymour, C., & Meek, C. (1994). TETRAD II: Tools for discovery. Hillsdale, NJ: Erlbaum. Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6, 461–464. Steiger, J. H., Shapiro, A., & Browne, M. W. (1985). On the multivariate asymptotic distribution of sequential chi-square statistics. Psychometrika, 50, 253–264. Stelzl, I. (1986). Changing a causal hypothesis without changing the fit: Some rules for generating equivalent path models. Multivariate Behavioral Research, 21, 309–331. ἀe rney, P., Farmer, S. M., & Graen, G. B. (1999). An examination of leadership and employee creativity: ἀe relevance of traits and relationships. Personnel Psychology, 52, 591–620. Tomarken, A. J., & Waller, N. G. (2003). Potential problems with “well fitting” models. Journal of Abnormal Psychology, 112, 578–598. Vandenberg, R. J. (2006). Statistical and methodological myths and urban legends: Where pray tell did they get this idea? Organizational Research Methods, 9, 194–201. Vandenberg, R. J., & Nelson, J. B. (1999). Examining the functionality of turnover intentions: A pretest-posttest control group design. Human Relations, 52, 1313–1336. Wang, M., & Russell, S. S. (2005). Measurement equivalence of the job descriptive index across Chinese and American workers: Results from confirmatory factor analysis and item response theory. Educational & Psychological Measurement, 65, 709–732.
Alternative Model Specifications in Structural Equation Modeling 191
Williams, L. J., Bozdogan, H., & Aiman-Smith, L. (1996). Inference problems with equivalent models. In G. A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling: Issues and techniques (pp. 279–314). Mahwah, NJ: Lawrence Erlbaum. Yukl, G., Chavez, C., & Seifert, C. F. (2005). Assessing the construct validity and utility of two new influence tactics. Journal of Organizational Behavior, 26, 705–725.
8 On the Practice of Allowing Correlated Residuals Among Indicators in Structural Equation Models Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
Imagine a situation in which an innocent researcher wishes to explain variance in some critical criterion variable. For expedience, he uses only a single predictor in his validation study. After data collection, the researcher observes a rather unimpressive correlation (e.g., rxy = .10 ). ἀ e researcher then explains in his Discussion section that his model is perfect because, if he had measured all relevant variables, his “model” would have explained all the variance in the criterion. Absurd, you say! Ridiculous! J’accuse! How can one argue for the integrity of a model based on unmeasured variables and/ or unexpected relationships? Despite the lunacy of the preceding example, a similar practice occurs with some frequency in applications of structural equation modeling (SEM). Specifically, the practice of allowing for correlated residuals among indicators in SEM is, in many cases, tantamount to capitalizing on “what could have been” and serves as the focus of the current chapter. SEM provides the tools to simultaneously test both measurement and structural relationships (Maruyama, 1998). In order to accurately test models, researchers must fully articulate the expected underlying relationships. Specific parameters and model fit statistics are calculated based on the comparison between the hypothesized (predicted) model and the underlying model that produced the observed data. A model “fits” to the extent that the covariance matrix reproduced from the hypothesized relationships matches 193
194
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
the observed covariance matrix. In addition to fit statistics and estimates of model parameters, SEM program output typically includes standardized residuals, modification indices, and expected changes in parameter values, which can be used to identify areas of poor fit. ἀ e information provided from a SEM analysis is often used as the basis for specification searches to achieve greater fit of the initial model (Kelloway, 1998; Long, 1983; Maruyama, 1998; Marcoulides & Drezner, 2001). A specification search is the process by which empirical data are used to modify an initial model to improve fit (Long, 1983; MacCallum, 1986). Indeed, there have been explicit calls for researchers to consider empirically equivalent or competing models to rule out alternative explanations (e.g., MacCallum, Wegener, Uchino, & Fabrigar, 1993; Vandenberg & Grelle, 2009). For situations in which modification indices suggest improved model fit through allowing residuals among indicators to correlate, however, guidance has been less clear. Indeed, allowing correlations between indicator residuals (IRs) based on significant modification indices is perhaps the least theoretically defensible practice in model modification primarily due to capitalization on chance of sample-specific characteristics that are not representative of the population. ἀ us, it must be recognized that conducting specification searches in post hoc model modifications is an exploratory, data-driven process and should not be used for confirmatory hypothesis testing. ἀ e complexity of SEM means that researchers, possibly overwhelmed by the apparent sophistication of the underlying mathematics and/or computer program interfaces, may not fully appreciate the implications of the decisions they make. Simply out of a desire to obtain reasonable model fit, they may apply decision criteria in a self-serving fashion. ἀ us, it is not surprising that some confusion about acceptable SEM practices may exist among researchers, even those facile with the technique. ἀ e urban legend that serves as the focus of the current chapter is the apparent belief that it is a reasonable practice to allow IRs to be correlated in covariance structure models in order to obtain better model fit. Our contention is that the estimation of correlations between IRs in SEM is only appropriate in a very restricted subset of circumstances and should not be applied in most analyses. Specification searches are also used for model simplification, but the focus of the present chapter is on model fit improvement.
On the Practice of Allowing Correlated Residuals
195
To facilitate discussion of the practice of allowing correlated IRs, the current chapter is organized as follows. First, we begin with an overview of this urban legend including the extent of the practice in current literature. We then present a brief review of SEM and discuss why allowing for correlated IRs is generally inappropriate. Finally, we describe those limited situations in which researchers might be justified in allowing for correlated IRs as well as recommended alternatives. Unraveling the Urban Legend In order to explore this urban legend, the following section addresses two specific questions. To what extent do researchers actually engage in this practice? Where might this legend have its origins? Extent of the Problem A quick review of several journals that publish organizational research provides some information regarding the extent to which authors reported allowing correlated IRs in applications of SEM. Specifically, all articles published between January 2002 and July 2007 in Personnel Psychology, Journal of Applied Psychology, and Journal of Management were reviewed for applications of SEM. During the time period, 58 empirical articles were identified as using SEM to test measurement models, structural models, or both. Of those articles, 5 specifically indicated that they estimated covariances or correlations between at least two IRs and another 2 articles did not provide enough information to definitively suggest they did not allow such relationships in the model. ἀ us, between 9% and 12% of researchers using SEM allowed for correlated IRs. In 2 of the aforementioned 7 articles identified in the preceding review, correlated IRs were specified a priori and the remaining 5 articles allowed for correlated IRs after an initial model test. ἀ is is important in unraveling the urban legend because those studies that allowed for correlated IRs in initial model testing were longitudinal and the IRs that were allowed to covary were the same indicators at different time periods. Alternatively, those studies that allowed for correlated IRs in a specification search were cross-sectional in nature and involved different indicators. Unfortunately, based on
196
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
our review of the literature, the practice of allowing for correlated IRs in modified models is the practice that appears to be most prevalent and, as we discuss in the following section, least defensible. Several other aspects of this review deserve additional comment with respect to the practice under consideration. First, some of the published research using SEM did not provide enough information to fully determine the model tested. Although degrees of freedom can be used to determine the exact number of estimated parameters, in some cases substantially different alternative models can produce the same degrees of freedom. Unless authors specifically stated they allowed for correlated IRs, we gave the benefit of the doubt and assumed they did not. ἀ us, if our percentages are inaccurate, they should be underestimates of the degree to which researchers engage in the practice. Second, the review of published research completely misses those articles submitted for publication that were ultimately not accepted. In our own experiences as reviewers and in informal conversations with other reviewers and editors, we believe the extent to which the practice is employed is noticeably higher than 9–12%. In support of this assertion, Cole, Ciesla, and Steiger (2007) reported that between 26.6% and 31.9% of articles using SEM in five toptier journals published by the American Psychological Association allowed for correlated IRs in some form. Origins Although a single source of the urban legend regarding the appropriateness of allowing correlated IRs is elusive, a review of the relevant literature reveals a pattern that may explain why some researchers inappropriately engage in the practice. Nearly all published articles that have addressed the subject of allowing correlated IRs from an inspection of modification indices provide warnings regarding the atheoretical nature of this practice but also provide suggestions for modifying models nonetheless (e.g., Cliff, 1983; Costner & Schoenberg, 1973; Gerbing & Anderson, 1984; Kaplan, 1989, 1990a; Long, 1983; MacCallum, 1986; MacCallum, Roznowski, & Necowitz, 1992; Reddy, 1992; Saris, Satorra, & Sörbom, 1987; Saris & Stronkhorst, 1984; Tomarken & Waller, 2003). ἀ us, the practice of allowing post hoc model modifications may have originated from researchers receiving mixed messages from articles that warn against the
On the Practice of Allowing Correlated Residuals
197
practice but then indicate how it can be done along with providing the tools to do it. No matter the definitive origins for allowing correlated IRs, we believe the practice persists for several reasons. First, extensive time, effort, and money often goes into data collection and researchers are loath to abandon their data even if analyses do not support the hypothesized model. In fact, Sörbom (1989, p. 384) conjectured that “rather than accept this fact and leave it at that, it makes more sense to modify the model so as to fit the data better.” In addition, modelfitting programs readily provide modification indices, expected parameter change (EPC) statistics, residuals, and other information that makes it easy to conduct post hoc specification searches. ἀ us, continued practice of allowing correlated IRs may result from the combination of a motivation to salvage data that do not support the original model, readily available information that can lead to a better-fitting model, and recommendations from professionals on how to properly conduct specification searches. A Brief Review of Structural Equation Modeling A simple model is presented in Figure 8.1 that serves as a foundation upon which to consider correlated IRs. In short, the primary model illustrated by this figure is that two antecedent variables (A and B) are predicted to cause a mediator variable (Y) that, in turn, causes another variable (Z) with each variable measured by 4 indicators. ἀ e relationships illustrated in Figure 8.1 reflect hypotheses linking the underlying constructs of interest (i.e., structural relationships) as well as the measurement models that depict the extent to which a given set of measures reflect an underlying latent construct. Using these hypothesized relationships and the observed variance-covariance matrix, parameter estimates are generated for each freely estimated relationship in the model. For each indicator in the measurement model, two parameter estimates are derived: variance associated with the target latent variable (factor loading) and all other unique sources (residual variance). In Figure 8.1, the lines with single-headed arrows that point from a factor to an indicator represent the hypothesized relationship that variance in the observed measure is “caused” by the underlying latent variable (e.g., A → a1). ἀ e single-headed arrows that point from the
B
A
δb3
b3
a3
δa3
δb4
b4
a4
δa4
εy1
y1
εy2
y2
Y
Figure 8.1 A typical structural equation model.
δb2
b2
b1
δb1
a2
δa2
a1
δa1
εy3
y3
εy4
y4
εz1
z1
εz2
z2
Z
εz3
z3
εz4
z4
198 Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
On the Practice of Allowing Correlated Residuals
199
measure residual to the indicator represents all remaining sources of unique variance in the indicator not accounted for by the target latent variable (e.g., δa1 → a1). ἀ e residual variance terms associated with each exogenous indicator are labeled as theta-deltas (δ) and with each endogenous indicator as theta-epsilon (ε). Indicator Residuals Researchers often focus first on developing strong measures of relevant latent variables (i.e., the measurement model). Indeed, some researchers recommend that authors begin all SEM with formal evaluation and revision of the measurement model(s) (Anderson & Gerbing, 1988; Lance, Cornwell, & Mulaik, 1988). Only then, perhaps with measurement parameters specified, should one consider estimating the structural model (Lance et al., 1988). In other words, one should have a clear understanding of the “causes” of observed variance in the relevant indicators. ἀ eoretically, observed variance in the manifest indicators may be partitioned into three components (Maruyama, 1998): true score common variance, true score unique variance, and error variance. True score common variance in a given indicator of a latent variable is the shared variability with other indicators of the same latent variable. ἀ e common variance represents the underlying latent variable that the indicators were hypothesized to measure. To the extent that an indicator has a large degree of true score common variance, the resulting factor loading of this indicator on an underlying factor should be large and statistically significant. Further, one would not expect indicators to have substantial loadings on other latent variables in the model (Anderson & Gerbing, 1988). If this latter situation occurs, a particular indicator would have an ambiguous role in the model and should probably be eliminated (Anderson & Gerbing, 1988). Additional parameters to be estimated in evaluating measurement relationships are those associated with the residuals (i.e., uniquenesses). ἀ ese residuals represent variance that is unique to each of Although a nontrivial number of SEM programs are available and each uses somewhat unique nomenclature, we have chosen to use LISREL terminology so as to simplify the discussion.
200
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
the manifest indicators and, theoretically, are defined by the other two sources of variance (Maruyama, 1998). True score unique variance is systematic variability associated with an indicator but uncorrelated with variance of other indicators. True score unique variance may represent the effects of any other latent variable. For example, a situational judgment test designed to measure interpersonal skills may also measure the latent variable “judgment” such that variance in scores associated with the judgment variable would be systematic variance and not shared by different measures of interpersonal skills. Finally, error variance is the unsystematic (random) variability associated with an indicator. ἀ us, for each indicator in a measurement model, there is a residual term (a.k.a. uniqueness) that includes the influence of all factors other than the target latent variable (i.e., both unique, systematic variance and unsystematic variance). Specifically, the theta-delta matrix contains residuals for indicators of exogenous variables and the theta-epsilon matrix contains residuals for indicators of endogenous variables. Both of these are square, symmetric matrices that contain the variances of IR terms along the diagonal and the covariances between IR terms in the offdiagonal cells. Typically, the covariances are fixed at zero (Byrne, 1994; Kelloway, 1998), because these variances are conceptually unique to each indicator and should share no variance with the uniquenesses of other indicators. Obviously, the practice of allowing for correlated IRs directly contradicts this assumption. Model Fit Overall model fit may be evaluated through a number of indices. Some of the more common fit indices are the chi–square (χ2) statistic, root mean square error of approximation (RMSEA; Steiger, 1990), Tucker–Lewis index (TLI; Tucker & Lewis, 1973), and comparative fit index (CFI; Bentler, 1990). All of these indices are driven, either directly or indirectly, through comparison of the observed variance/ covariance matrix to the reproduced variance/covariance matrix (Bollen, 1989). Smaller differences between the observed and reproduced matrices (i.e., smaller values in the residual matrix) indicate better model fit (Maruyama, 1998). In addition to overall tests of model fit, SEM programs provide information that can be used to modify hypothesized models that
On the Practice of Allowing Correlated Residuals
201
generate poor fit indices. ἀ ese indices can be used to conduct the “specification searches” described earlier. Although such searches can be conducted using the Wald test in order to identify unnecessary paths, the inclusion of unnecessary paths does not seriously compromise the level of most fit indices (the exceptions being the so-called parsimony fit indices). Instead, fit indices are compromised by the failure to include paths that would have received a substantial weight had they been included (Mulaik, James, Van Alstine, Bennett, Lind, & Stilwell, 1989). ἀ e Lagrange multiplier (LM) test (referred to in LISREL as modification indices) provides information regarding whether model fit could be significantly improved through freeing of a previously fixed-to-zero model parameter (Loehlin, 2004). In other words, the LM test allows for comparison of models with varying degrees of restrictiveness (i.e., fewer estimated parameters) through estimation of the initial model (Bollen, 1989) and can be applied either univariately or multivariately (Tabachnick & Fidell, 2007). Importantly, the LM test identifies potential modifications based exclusively on statistical, as opposed to substantive, criteria. Related to the LM test is the expected parameter change (EPC) statistic proposed by Saris et al. (1987). Rather than estimating what the expected decrease in overall chi-square is as a function of estimating a previously fixed-to-zero parameter, the EPC provides an approximated value for the estimated parameter itself (Saris et al., 1987). Mathematically, the EPC is determined by the modification index and the first-order derivatives of the fitting function evaluated at the fixed parameter. A specification search might, therefore, begin with consideration of modification indices and/or EPC statistics (Kaplan, 1990b). Based on these indices, modifications can be made to the model, and the model retested. For instance, a large modification index and large EPC might suggest freeing the associated parameter. ἀ e problem is that the instant one makes modifications, the research shifts from model testing to a data-driven exploratory model-building approach (Jöreskog, 1993). Although it may be appropriate to use data to guide specification searches and model modification, the data used to modify the model should not then be used as evidence to support the model. Instead, additional data must be collected to validate the modified theoretical model. Indeed, MacCallum (1986) described an extreme view in which specification searches should not be used at all
202
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
in SEM because doing so is “an admission that the initial model was not well conceived and/or that the research was not well conducted” (p. 109). MacCallum (1986) further explained that at the very least data-driven model modifications “cannot be statistically tested with any validity, that their goodness of fit and substantive meaning must be evaluated with caution, and that their validity and replicability are open to question” (p. 109). An Example In order to illustrate a typical use of specification searches within SEM, consider a test of Ajzen and Fishbein’s (1980) theory of reasoned action to examine the effect of employee attitudes on job performance. ἀ e researcher hypothesizes that behavioral intentions will mediate the relationship between attitudes and subjective norms on behavior (task performance). ἀ us, the researcher develops measures of employee attitudes toward task-related behaviors, subjective norms of the behaviors, behavioral intentions, and task performance according to suggested practices (e.g., Ajzen, 2002). Next, the researcher administers these measures to a sample of 500 employees at a large manufacturing firm. ἀ e data are entered and analyzed and results indicate that all the structural parameters are significant. ἀ e initial, hypothesized model is illustrated in Figure 8.2 along with the initial structural estimates. Despite the statistically significant structural paths, the overall model exhibits only a marginal fit to the data (e.g., χ2(98) = 126.58, p > .05, RMSEA = .10 [90% CI = .08 to .12], TLI = .90, and CFI = .91). Of note, there was a statistically significant EPC (standardized expected parameter change = .58) for the residual correlation between δb1 and δb3, two indicators of the subjective norms construct. In order to improve model fit, the researcher freely estimates this residual correlation and achieves a significantly better model fit (e.g., χ2(97) = 95.76, p < .05, RMSEA = .05 [90% CI = .03 to .07], TLI = .98, and CFI = .98). Importantly, this modification also resulted in new structural parameter estimates. ἀ ese revised parameter estimates are indicated in parentheses in Figure 8.2. ἀ e fit statistics and parameter estimates for this example were based on a real data set but were altered slightly for illustrative purposes. For simplicity, only those parameters that serve illustrative purposes are reported.
δb1
b1
.58
δb3
b3
Norms
b2
δb2
a3
δa3
Attitudes
a2
δa2
δb4
b4 εy1
y1
.25 (.28)
.17 (.10)
a4
δa4
εy2
y2
εy3
y3
Intentions
εy4
y4
.63 (.63)
εz1
z1
εz2
z2
εz3
z3
Behavior
εz4
z4
On the Practice of Allowing Correlated Residuals
Figure 8.2 A structural and measurement model based on the theory of reasoned action.
.34 (.27)
a1
δa1
203
204
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
Although all the revised parameters were significant, the relationship between subjective norms and behavioral intentions was lower in the modified model than the initial model and the correlation between behavioral attitudes and subjective norms was also lower in the modified model. Reddy (1992) showed that when correlated residuals are not modeled, then structural parameter estimates are consistently overestimated. ἀ e data in the present example supports this. Of additional note and equal importance, there is no compelling substantive explanation for why two indicators of the same construct should have correlated residual terms. Why Correlated IRs Improve Fit How correlated IRs improve model fit is important to understanding why allowing correlated IRs is problematic. As stated earlier, the indicator residual terms in latent variable modeling contain both random and systematic error. ἀ at is, measures may be systematically influenced by extraneous variables in addition to the target latent variable. To the degree that two residuals correlate, there is evidence that there exists a cause of both of the variables to which the residuals are attached but that is not specified in the model. ἀ e influence of this causal variable does not disappear. Instead, its influence goes into the residuals, and because it exists in both, the residuals themselves are correlated. Allowing these residuals to be freely estimated improves fit because it captures the influence of this omitted cause, though it does so without ever specifying what the cause is (Fornell, 1983). Allowing IRs to correlate in this instance assumes that observed covariation has not been accounted for by all variables in the model such as multiple repeated measurements (e.g., in longitudinal research), sampling error, or omitted variables (e.g., method variance, multidimensional constructs, or higher-order constructs). Assume that a researcher wants to investigate changes in job satisfaction by comparing employee satisfaction before and after an organizational intervention. Given that the researcher is likely to administer the same job satisfaction survey at both measurement occasions, it might be expected that the IRs of the satisfaction items would correlate across time. ἀ e freeing of IR correlations in such a situation is perhaps less problematic than in any other situation
On the Practice of Allowing Correlated Residuals
205
because the freeing of residual correlations for measures distinguished only by time implies nothing more than that the model, like all models, fails to specify all causes of the variable in question. Indeed, the researcher can theorize a priori which residuals should be correlated. ἀ us, it is not likely that the researcher is capitalizing on chance because the correlated residuals are theory-driven and not data-driven based on post hoc model modifications. In addition, given that measurement was conducted on two occasions, the residual variance can be partitioned into random and systematic error. In contrast, residual variance cannot be partitioned in a cross-sectional design unless some measures of the systematic error were collected (e.g., the omitted variable)—in which case, there would be no need for post hoc modification because the source of correlated residuals is included in the model. Although correlated IRs among identical measures separated by time are defensible, our own previously mentioned cursory review of the literature suggested that the more common application of correlated IRs is in post hoc modification based on computer output. Cortina (2002) discussed two reasons that could explain correlated IRs discovered in a post hoc analysis: sampling error and omitted variables. Cortina suggested that if the only reason for correlated residual terms were sampling error, then these correlations should be fixed at zero. Given the complexity of SEM, the modification indices will almost always reveal ways to obtain better fit. ἀ e risk involved with this data-driven approach is that the modifications may result from chance characteristics in that particular sample and will not generalize to the population (Cliff, 1983; MacCallum et al., 1992). More problematic is the issue of correlated residuals due to the omitted variables because allowing residual correlations does not recover the omitted variable, which is indicative of a design flaw or inaccurate theory. As an example, consider a case in which a test is constructed to measure overall job satisfaction. ἀ e 6-item scale includes the items from Table 8.1. Table 8.2 contains hypothetical modification indices associated with freeing parameters in the theta-delta matrix. As shown in Table 8.2, the modification indices for the interrelationships among items 4, 5, and 6 are quite high and statistically significant. ἀ us, allowing correlations between the IR terms attached to items 4, 5, and 6 would substantially improve model fit. But are these
206
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
Table 8.1 Sample Items From a Measure of Job Satisfaction 1.
Everything else equal, my job is better than most.
2.
I like doing the things I do at work.
3.
My job is enjoyable.
4.
I enjoy my coworkers.
5.
My supervisor is unfair to me.
6.
ἀ e benefits we receive are as good as most other organizations offer.
Note. Some of these items were adapted from a measure of job satisfaction developed by Spector (1985).
Table 8.2 Hypothetical Values for the Modification Indices for Residuals for the 6-Item Measure of Job Satisfaction Item #
1
1
—
2
3
4
5
2
0.64
—
3
1.83
1.21
—
4
0.55
0.60
0.88
—
5
0.40
0.10
0.28
4.91
—
6
0.96
0.08
0.49
8.29
11.57
modifications defensible? ἀ e answer, despite urban legends and/or common practice, is no. ἀ is hypothetical scale was designed to measure overall job satisfaction, which implies that most of the shared variance should be explained by a single latent variable. ἀ e residuals may be correlated for any number of reasons. One reason for the correlated residuals might be that items 1, 2, and 3 measure satisfaction with the job in general whereas items 4, 5, and 6 refer to more specific aspects of the job. ἀ ese specific aspects may be correlated with one another, and these correlations, coupled with the fact that the specific aspects themselves are omitted from the model, would produce correlations among the residuals for items 4–6. For example, justice perceptions may explain correlated residuals between items 5 and 6 because these items refer to fairness of supervisor and benefits. On the other hand, social exchange theory might explain the correlated residuals between items 4 and 5 because they refer to satisfaction with coworkers and supervisors that might result from a social exchange
On the Practice of Allowing Correlated Residuals
207
process. ἀ e point is, there may be several “theoretical” explanations for the correlated residuals, but the underlying cause can never be known without including a measure of the putative cause. Allowing the researcher to free IR correlations amounts to rewarding him/her for a bad research design and/or theory. Problems With Correlated Residuals Not surprisingly, some researchers (e.g., Henderson, Berry, & Matic, 2007; Salanova, Agut, & Peiró, 2005) in the previously described review of the literature who allowed for IRs to covary did so on the basis of modification indices to improve model fit rather than a priori expectations. Post hoc modifications of this type are not new to the literature. For example, Fornell (1983) and Breckler (1990) cited several studies that freed linkages between IRs without recognizing the transition from theory-driven research to data-driven research (e.g., Bagozzi, 1981; Bearden & Mason, 1980; Newcomb, Huba, & Bentler, 1986; Reilly, 1982). ἀ ere are several problems associated with allowing residuals to correlate based on post hoc indices of model fit. ἀ e problem is that empirical analyses do not allow one to determine which explanation applies in any given instance. ἀ e most critical issue is that respecification based on modification indices results in capitalization on chance (MacCallum et al., 1992). As a result, the modified model may reflect idiosyncrasies in the sample data and may not hold up under cross-validation because the correlated residuals are zero in the population. As soon as models are changed using information from modification indices, the process stops being confirmatory (theoretically driven) and becomes, to some degree, exploratory (data driven). Cliff (1983) pointed this out and cautioned that given the complexity of covariance structure models and the nature of correlational data, particular modifications are likely due to idiosyncrasies in the observed data. As a result, models constructed around these indices would not likely generalize to the population. Indeed, research shows that model modifications based on specification searches rarely uncover the correct model (e.g., MacCallum et al., 1992; Lance et al., 1988) As noted by MacCallum et al. (1992), modifications based on sample-specific characteristics are problematic because the population
208
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
value is zero and the results will not generalize to other samples or the population. ἀ ey go on to point out that several factors are likely to influence the extent to which chance is a likely explanation. First, the smaller the sample size, the more likely results are influenced by sampling error. For instance, MacCallum et al. (1992) provided evidence for the instability of post hoc modifications in two large datasets to support this point, even in samples as large as 300–400 participants. Second, the number of modifications made is likely to play a role. A large number of modifications increases the likelihood of capitalization on chance relative to a few modifications that are made early in the sequential ordering (Green, ἀ ompson, & Poirier, 2001). Related, post hoc model modifications in SEM are conducted without statistical protection from family-wise error (Steiger, 1990). Some recent work (e.g., Green, ἀ ompson, & Babyak, 1998; Green, ἀ ompson, & Poirier, 2001; Hancock, 1999) may help address this particular problem. Specifically, Green et al. (2001) provide an adjusted Bonferroni method and Hancock (1999) provides a Scheffétype test to control for Type I error in post hoc specification searches. Although familywise corrections for Type I error helps to statistically control for misspecifications due to sampling error, they do not directly address the problem of correlated residuals. Notably, if residuals are correlated due to sampling error, they should be fixed to zero (Cortina, 2002) and if they are correlated due to an omitted variable, no amount of protection will recover the omitted variable. ἀ ird, the interpretability of modifications is affected by sequence. For instance, Bollen (1990) pointed out that the pattern of freed parameters or order in which parameters are freed can affect the modification indices and expected parameter changes of the remaining fixed parameters. ἀ us, changes in one part of the model can affect the other parameters in unintended or unknown ways. In addition, some combinations of freed parameters may be closer to the population model than models produced using the stepwise procedure for post hoc modification most often recommended (Bollen, 1990). ἀ e most consistent recommendation is that modifications that cannot be substantively or theoretically justified are to be avoided (e.g., Cortina, 2002; MacCallum, 1986: MacCallum et al., 1992). Importantly, correlated IRs based on post hoc modification may actually mask the underlying structure of the data. For instance, Gerbing and Anderson (1984) provided illustrative examples that demonstrated that the addition of a single correlated IR term in one
On the Practice of Allowing Correlated Residuals
209
model resulted in fit that was nearly identical to another model demonstrating a second-order factor structure in the same data. Gerbing and Anderson (1984) posited that the acceptance of the first-order model with correlated IRs is inappropriate because the desired latent variable (the second-order factor) was not operationalized, and the two-factor (first-order) model is not representative of the population model. However, if the initial model is misspecified, modification indices will not necessarily lead researchers to the population model. Recommendations In almost all instances of unexpected IR covariance, there is no theoretically defensible reason for allowing IRs to correlate based on post hoc modifications. ἀ us, the best solution in these instances is to form a hypothesis about the reason for the correlated IRs and to collect new data that tests this hypothesis (e.g., Hayduk, 1990). If correlated IRs are due to sampling error, then they will likely not be present in a cross-validation sample and they should be ignored for hypothesis testing. If the residual covariances are due to an omitted variable, then it is imperative to identify the missing variable(s), collect data from a second sample, and test the hypotheses that the omitted variable accounted for the correlated IR. If a researcher has evidence to indicate that a potential unmeasured variable is responsible for poorer than expected model fit, the appropriate solution is not to allow for correlations between IR terms. Although reasonable post hoc explanations can be constructed in many such situations, performing such modifications is tantamount to rewarding poor scale construction and/or model development. In addition, if a researcher can provide a strong justification for allowing correlated IRs, it is reasonable to ponder why the parameter was not represented in the original model (MacCallum et al., 1992). Concurring with Cortina (2002), we also suggest that the practice of allowing correlations between IRs in certain situations may still proceed cautiously and only when a strong a priori reason exists for doing so. For example, in the case of longitudinal data with identical measures across time periods, it is impossible to avoid the expectation that residuals attached to identical indicators separated only by time will correlate. It is also reasonable to allow for correlated IRs associated with indicators that share components. For example, in
210
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
the case where a researcher computes cross-product terms to model interactions, residuals between some of these terms would certainly be related insofar as they share the same indicators. In either case, however, the researcher knows going into model testing that such relationships are likely to exist and can model them from the outset, thus rendering the use of post hoc modifications to IRs moot. ἀ is recommendation obviously puts a premium on model development and research design and does not reward researchers who, for reasons of either ignorance of expedience, build models based on modification indices. Despite the continued warnings about specification searches in general and allowing correlated IRs in particular, researchers will still be motivated to allow correlated IRs based on modification indices. ἀ us, we present two additional recommendations that might salvage measurement models and/or theory testing. ἀ ese recommendations include (a) elimination of problematic items and (b) estimation of the structural model only (i.e., path analysis). ἀ e solution to eliminate problematic items is based on Anderson and Gerbing’s (1988) recommendation that the first step in latent variable modeling is to create strong, unidimensional measurement models. Only after the researchers have cleaned up the measurement models would they proceed to estimate structural parameters. Assuming there are enough items to do so, omitting items with correlated residuals would eliminate the need to allow correlated residuals in a post hoc specification search simply to improve model fit. Returning to the earlier job satisfaction example, items 4, 5, and 6 might be omitted to improve model fit because the first three items are sufficient indicators of overall job satisfaction. Our second suggestion is to omit the measurement model altogether and simply create scale scores and estimate a path analysis with manifest variables. ἀ is suggestion, however, is predicated on the assumption that the psychometric characteristics for each measure are sufficient to warrant calculating scale scores for the variables (e.g., sufficient internal consistency). Both suggestions apply to hypothesis testing involving wellestablished constructs in which the structural parameters are of primary interest because both suggestions mask problems with the measurement models. Finally, these suggestions are viable only to the extent that the correlated IRs are the result of sampling error. Of course, none of these suggestions is ideal, but the advantage to
On the Practice of Allowing Correlated Residuals
211
each is the preservation of the theory-driven process of hypothesis testing. If misspecification is due to omitted variables, only a new study that included all relevant variables in the theoretical model will address the problem. We should note that cross-validation in a second sample of a model developed through exploratory post hoc model modification is not necessarily the correct course of action. Cross-validation does not guarantee that the correct model has been uncovered, because countless different models may fit the data and the majority of these will be incorrect (e.g., Lee & Hershberger, 1990; MacCallum et al., 1993). ἀ e focus should always be on developing and testing strong theory to exclude equivalent models on logical grounds (Jöreskog, 1993). Summary and Conclusions We have offered several potential sources for the urban legend that allowing for correlated IRs is an appropriate practice in the use of SEM models. ἀ ough there is no clear, definitive source for this practice, several sources that encouraged the use of stepwise procedures for model testing may have led researchers to believe that post hoc modification is acceptable. ἀ us, it may be that articles that urge caution about the dangers of model modification coupled with detailed explanations for how to conduct specification searches send a mixed message to researchers in the face of seeming ambiguity. ἀ ere may be a motivation (and justification) to modify the model in a post hoc fashion to obtain good fit to the data (Cole, Ciesla, & Steiger, 2007). Indeed, it is reasonable to assume that many researchers allow correlated IRs because these do not threaten core features of their hypothesized models. However, as Tomarken and Waller (2003) pointed out, this reasoning is flawed because of the possibility of omitted variables and the influence of correlated IRs on other parameters in the model. We have also demonstrated that the uncritical use of such methods can lead to untenable conclusions regarding model fit and have argued that such practices should be discouraged and abandoned. Importantly, there are situations in which correlated IRs may be appropriately estimated (e.g., longitudinal studies in which the same indicator is used in repeated measurements). Outside of these unique situations, however, the practice of allowing for correlated IRs should cease and desist.
212
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
References Ajzen, I. (2002). Constructing a TpB questionnaire: Conceptual and methodological considerations. Retrieved August 8, 2007, from http:// www-unix.oit.umass.edu/~aizen/pdf/tpb.measurement.pdf Ajzen, I., & Fishbein, M. (1980). Understanding attitudes and predicting social behavior. Englewood Cliffs, NJ: Prentice-Hall. Anderson, J. C., & Gerbing, D. W. (1988). Structural equation modeling in practice: A review and recommended two-step approach. Psychological Bulletin, 103, 411–423. Bagozzi, R. P. (1981). Attitudes, intentions, and behavior: A test of some key hypotheses. Journal of Personality and Social Psychology, 41, 607–627. Bearden, W. O., & Mason, B. J. (1980). Determinants of physician and pharmacist support of generic drugs. Journal of Consumer Research, 7, 121–130. Bollen, K. A. (1989). Structural equations with latent variables. New York: Wiley. Bollen, K. A. (1990). A comment on model evaluation and modification. Multivariate Behavioral Research, 25, 181–185. Breckler, S. J. (1990). Overall fit in covariance structure models: Two types of sample size effects. Psychological Bulletin, 107, 256–259. Byrne, B. M. (1994). Structural equation modeling with EQS and EQS/Windows: Basic concepts, applications, and programming. ἀ ousand Oaks, CA: Sage. Cliff, N. (1983). Some cautions concerning the application of causal modeling methods. Multivariate Behavioral Research, 18, 115–126. Cole, D. A., Ciesla, J. A., & Steiger, J. H. (2007). ἀe insidious effects of failing to include design-driven correlated residuals in latent-variable covariance structure analysis. Psychological Methods, 12, 381–398. Cortina, J. M. (2002). Big things have small beginnings: An assortment of “minor” methodological misunderstandings. Journal of Management, 28, 339–362. Costner, H. L., & Schoenberg, R. (1973). Diagnosing indicator ills in multiple indicator models. In A. S. Goldberger & O. D. Duncan (Eds.), Structural equation models in the social sciences (pp. 167–199). New York: Seminar Press. Fornell, C. (1983). Issues in the application of covariance structure analysis: A comment. Journal of Consumer Research, 9, 443–448. Gerbing, D. W., & Anderson, J. C. (1984). On the meaning of within-factor correlated measurement errors. Journal of Consumer Research, 11, 572–580.
On the Practice of Allowing Correlated Residuals
213
Green, S. B., ἀ ompson, M. S., & Babyak, M. A. (1998). A Monte Carlo investigation of methods for controlling Type I errors with specification searches in structural equation modeling. Multivariate Behavioral Research, 33, 365–384. Green, S. B., ἀ ompson, M. S., & Poirier, J. (2001). An adjusted Bonferroni method for elimination of parameters in specification addition searches. Structural Equation Modeling, 8, 18–39. Hancock, G. R. (1999). A sequential Scheffé-type respecification procedure for controlling Type 1 error in exploratory structural equation mode modification. Structural Equation Modeling, 6, 158–168. Hayduk, L. A. (1990). Should model modifications be oriented toward improving data fit or encouraging creative and analytical thinking? Multivariate Behavioral Research, 25, 193–196. Henderson, N. D., Berry, M. W., & Matic, T. (2007). Field measures of strength and fitness predict firefighter performance on physically demanding tasks. Personnel Psychology, 60, 431–473. Jöreskog, K. G. (1993). Testing structural equation models. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models. Newbury Park, CA: Sage. Kaplan, D. (1989). Model modification in covariance structure analysis: Application of the expected parameter change statistic. Multivariate Behavioral Research, 24, 41–57. Kaplan, D. (1990a). Evaluating and modifying covariance structure models: A review and recommendation. Multivariate Behavioral Research, 25, 137–155. Kaplan, D. (1990b). A rejoinder on evaluating and modifying covariance structure models. Multivariate Behavioral Research, 25, 197–204. Kelloway, E. K. (1998). Using LISREL for structural equation modeling. ἀ ousand Oaks, CA: Sage. Lance, C. E., Cornwell, J. M., & Mulaik, S. A. (1988). Limited information parameter estimates for latent or mixed manifest and latent variable models. Multivariate Behavioral Research, 23, 171–187. Lee, S., & Hershberger, S. (1990). A simple rule for generating equivalent models in covariance structure modeling. Multivariate Behavioral Research, 25, 313–334. Loehlin, J. C. (2004). Latent variable models: An introduction to factor, path, and structural equation analysis (4th ed.). Mahwah, NJ: Lawrence Erlbaum Associates. Long, J. S. (1983). Covariance structure models: An introduction to LISREL. Beverly Hills, CA: Sage. MacCallum, R. (1986). Specification searches in covariance structure modeling. Psychological Bulletin, 100, 107–120.
214
Ronald S. Landis, Bryan D. Edwards, and Jose M. Cortina
MacCallum, R., Roznowski, M., & Necowitz, L. B. (1992). Model modifications in covariance structure analysis: ἀe problem of capitalization on chance. Psychological Bulletin, 111, 490–504. MacCallum, R. C., Wegener, D. T., Uchino, B. N., & Fabrigar, L. R. (1993). ἀe problem of equivalent models in applications of covariance structure analysis. Psychological Bulletin, 114, 185–199. Marcoulides, G. A., & Drezner, Z. (2001). Specification searches in structural equation modeling with a genetic algorithm. In G. A. Marcoulides & R. E. Schumacker (Eds.), New developments and techniques in structural equation modeling (pp. 247–268). Mahwah, NJ: Lawrence Erlbaum Associates. Maruyama, G. M. (1998). Basics of structural equation modeling. ἀ ousand Oaks, CA: Sage. Mulaik, S. A., James, L. R., Van Alstine, J., Bennett, N., Lind, S., & Stilwell, C. D. (1989). Evaluation of goodness-of-fit indices for structural equation models. Psychological Bulletin, 105, 430–445. Newcomb, M. D., Huba, G. J., & Bentler, P. M. (1986). Determinants of sexual and dating behaviors among adolescents. Journal of Personality and Social Psychology, 50, 56–66. Reddy, S. K. (1992). Effects of ignoring correlated measurement error in structural equation models. Educational and Psychological Measurement, 52, 549–570. Reilly, M. D. (1982). Working wives and convenience consumption. Journal of Consumer Research, 8, 407–418. Salanova, M., Agut, S., & Peiró, J. M. (2005). Linking organizational resources and work engagement to employee performance and customer loyalty: ἀe mediation of service climate. Journal of Applied Psychology, 90, 1217–1227. Saris, W. E., Satorra, A., & Sörbom, D. (1987). ἀ e detection and correction of specification errors in structural equation models. In C. C. Clogg (Ed.), Sociological methodology (pp. 105–129). San Francisco: Jossey-Bass. Saris, W. E., & Stronkhorst, L. H. (1984). Causal modeling in nonexperimental research: An introduction to the LISREL approach. Amsterdam: Sociometric Research Foundation. Sörbom, D. (1989). Model modification. Psychometrika, 54, 371–384. Spector, P. E. (1985). Measurement of human service staff satisfaction: Development of the job satisfaction survey. American Journal of Community Psychology, 13, 693–713. Steiger, J. H. (1990). Structural model evaluation and modification: An interval estimation approach. Multivariate Behavioral Research, 25, 173–180. Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston: Allyn & Bacon.
On the Practice of Allowing Correlated Residuals
215
Tomarken, A. J., & Waller, N. G. (2003). Potential problems with “well fitting” models. Journal of Abnormal Psychology, 112, 578–598. Tucker, L. R., & Lewis, C. (1973). ἀe reliability coefficient for maximum likelihood factor analysis. Psychometrika, 38, 1–10. Vandenberg, R. J., & Grelle, D. M. (2009). Alternative model specifications in structural equation modeling: Facts, fictions, and truth. In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 165–191). New York: Routledge/Psychology Press.
Part 2 Methodological Issues
9 Qualitative Research The Redheaded Stepchild in Organizational and Social Science Research? Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
ἀ ere’s no such thing as qualitative data. Everything is either 1 or 0. (Fred Kerlinger) All research ultimately has a qualitative grounding. (Donald Campbell, as cited in Miles & Huberman, 1994, p. 40)
One of the fiercest methodological debates in the organizational and social sciences involves the relative merit of qualitative versus quantitative research. Proponents of qualitative research make strong claims about the strengths of their approach, including greater ecological validity, richer and more descriptive accounts of real-world events, and greater ability to uncover processes and mechanisms in natural settings (Kidd, 2002; Lee, Mitchell, & Sablynski, 1999; Maxwell, 2004; Miles, 1979; Ratner, 1997; Van Maanen, 1979). ἀ ose in the quantitative research camp lament the advantages of their approach, discussing strengths such as precision of measurement, experimental control, and generalizability (Aluko, 2006; Cook & Campbell, 1979). ἀ is debate is neither new nor restricted to the organizational and social sciences. Other fields of inquiry, such as sociology and anthropology, have waged similar debates. For an in-depth discussion of the division between qualitative and quantitative traditions in the social sciences, the interested reader is referred to Cook and Reichardt (1978) or Lincoln and Guba (1985). 219
220
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Notwithstanding this debate, the social and organizational sciences are dominated by quantitative research (Patton, 1991; Van Maanen, 1979). Qualitative methods are rarely offered in graduate research methods curricula (Cassell & Symon, 1994), and qualitative research does not frequently appear in mainstream, high-impact social science journals. For instance, Kidd (2002) found that the overall publication rate for qualitative research was 1% in 15 American Psychological Association (APA) journals and 33% of these journals had never published a single qualitative study. Likewise, in a review of 454 articles published in 10 APA journals, Munley et al. (2002) found that the vast majority (98%) were quantitative. Although some scholars are cautiously optimistic about the incorporation of qualitative research into the social and organizational sciences (e.g., Lee, 1999), others are less so (e.g., Cassell & Symon, 1994). Irrespective of these differing views, qualitative researchers are in agreement that their work is misunderstood, underappreciated, and devalued by quantitative researchers (e.g., Jick, 1979; Luthans & Davis, 1982; Maxwell, 2004; Miles & Huberman, 1994; Van Maanen, 1979). In an effort to remove the “redheaded stepchild” stigma associated with qualitative research, the present chapter pursues two specific objectives. ἀ e first objective is to identify and clearly state the deeply rooted and pervasive beliefs that quantitative researchers hold about qualitative research. For example, one belief is that qualitative approaches to research do not utilize the scientific method. ἀ e second objective is to critically examine each identified belief to uncover both the kernels of truth and the misconceptions (myths). To do so, we utilize existing scholarship and the results of an original review of 241 purely qualitative and mixed-method (qualitative and quantitative) articles published in the top 9 journals in the fields of applied psychology, management, and social psychology, based on the 2004 Journal of Citation Reports. Our ultimate goal is to educate quantitative researchers on the actual (rather than assumed) characteristics of qualitative research that appears in high-impact mainstream journals. Our chapter first defines qualitative research and discusses its philosophical underpinnings. ἀ is segues into a parallel discussion of quantitative research, because the qualitative-quantitative schism stems in large part from fundamental philosophical differences between the two approaches. ἀ en, we outline the beliefs associated with qualitative research. ἀ is discussion is followed by a review
Qualitative Research
221
of each belief using data from our original review of the literature where appropriate. ἀ e chapter closes with a discussion of the future of qualitative research in the social and organizational sciences. Definitional Issues Qualitative research has a rich history in a wide range of disciplines including interpretive sociology, anthropology, human geography, history, education, women’s studies, and to some extent psychology (Locke & Golden-Biddle, 2002; Mason, 1996). ἀ ere are many approaches to qualitative research such as ethnography, case study, and action research (Locke & Golden-Biddle, 2002; Miles & Huberman, 1994). Moreover, qualitative researchers employ a wide range of data collection techniques including focus groups, textual analysis, interviews, and participant observation (see Bachiochi & Weiner, 2002; Lee, 1999; Miles & Huberman, 1994). ἀ e substantial variability in research approaches and data collection techniques makes it difficult to precisely define qualitative research. Nonetheless, it may best be viewed as “an umbrella term covering an array of interpretive techniques which seek to describe, decode, translate, and otherwise come to terms with the meaning, not the frequency, of certain more or less naturally occurring phenomena in the social world” (Van Maanen, 1979, p. 520). Qualitative research has several defining features, which include (a) investigating phenomena in their natural setting, (b) collecting and analyzing either written or spoken text or observing behavior, (c) explicitly considering the context in which a phenomenon exists, (d) accepting the subjectivity inherent in understanding research participants’ perspectives, (e) studying ordinary behavior, and (f) imposing less structure or a priori classifications on data and demonstrating more interest in idiographic description and emergent themes (Cassell & Symon, 1994; Locke & Golden-Biddle, 2002; Luthans & Davis, 1982; Van Maanen, 1979). ἀ ese defining features indicate that there is much more than a “numbers–no numbers” distinction between qualitative and quantitative research (Bachiochi & Weiner, 2002, p. 162). In fact, just like quantitative research, qualitative studies typically use data reduction and interpretation techniques. ἀ is might include clustering observations into higher-order categories, factoring textual information to
222
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
identify common word elements, and writing summaries. ἀ e qualitative researcher may also use frequency counts, comparisons and contrasts of categorized data, and visual inspection of the data to identify relationships between broad categories of variables (Miles & Huberman, 1994). Furthermore, some qualitative research uses standard statistical techniques such as the chi-square or t test to examine a priori or post hoc relationships among study variables. Moreover, numerous computer programs are available for testing propositions based on qualitative data (Miles & Huberman, 1994). Philosophical Differences in Qualitative and Quantitative Research Qualitative research can be best understood by examining its ontological and epistemological assumptions. Ontology is the branch of philosophy that focuses on what exists in the world around us. ἀ e ontological stance of qualitative research is that reality (what exists) is in the mind. ἀ is means that the generation of knowledge depends on figuring out what is in the mind of research participants (Heppner, Kivlighan, & Wampold, 1999; Patton, 1991). Epistemology is the branch of philosophy that is concerned with the origin of knowledge or how we know what we know. Outlining the epistemological foundations of qualitative research is no simple task given the diversity of perspectives within the qualitative research community. And it is beyond the scope of this chapter to discuss the different philosophical stances adopted by qualitative researchers. ἀ e interested reader is referred to Locke and Golden-Biddle (2002) for a historical overview and Heppner et al. (1999) for a comprehensive review of the topic. For the purpose of this chapter we discuss the philosophical position held by many qualitative researchers, particularly those within the organizational and social sciences. Although this certainly simplifies the discussion of qualitative researchers’ philosophical orientation to research, it demonstrates the fundamental differences between qualitative and quantitative approaches and sets the stage for a more detailed discussion of the beliefs that quantitative researchers hold about qualitative research. Generally speaking, qualitative researchers take a constructivist epistemological stance and believe that the origin of knowledge is the individual. ἀ is means that an individual’s subjective interpreta-
Qualitative Research
223
tion of a situation, event, or experience represents knowledge. ἀ e qualitative researcher is an interpreter and the research participant is an active participant in the creation of knowledge. ἀ e goal of qualitative research is in-depth description and a rich, contextually embedded understanding from the perspective of the individual (Heppner et al., 1999; Kidd, 2002; Patton, 1991). ἀ erefore, rather than examine the “average individual’s” perception or experience like quantitative research does, qualitative research explores individuals’ idiosyncratic experiences and views this source of variability as meaningful, rather than as a source of error variance. ἀ e philosophical approach associated with qualitative research stands in sharp contrast to the positivistic orientation held by quantitative researchers (Kidd, 2002; Kuhn, 1962). From the positivistic ontological position, reality (what exists) consists of physical objects and processes. ἀ e epistemological assumption of positivists is that the origin of knowledge (how we know what we know) is through objective reality. Within the positivistic tradition the primary methodology is experimentation to isolate cause-and-effect relationships, and the goal of science is objectivity, prediction, and replication (Cook & Campbell, 1979; Kidd, 2002; Lee et al., 1999). Quantitative and Qualitative Conceptualizations of Validity Both quantitative and qualitative approaches to research discuss the importance of validity (e.g., compare Cook & Campbell, 1989, to Maxwell, 1992). Each approach is concerned about the credibility or accuracy of the data collected, the transferability or generalizability of the findings, and the extent to which the relationships among various constructs provides a complete and accurate representation of some phenomenon. Furthermore, both approaches believe that eliminating alternative explanations is an important goal of science. Quantitative researchers attempt to do so by using experimentation and control variables in correlation-based models. Qualitative researchers try to rule out spurious relations by discussing research findings with knowledgeable but detached colleagues, carefully ἀ ere are important philosophical differences between positivistic and postpositivistic orientations (Heppner et al., 1999). However, because postpositivism adheres to the same basic epistemological tenets as positivism, we refer only to the positivist perspective as a point of comparison for qualitative research.
224
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
considering rival explanations for the findings, and considering new displays of the data that provide a clean look at third variables and their potential effect on the data (Miles & Huberman, 1994). However, qualitative and quantitative conceptualizations of validity differ owing to the philosophical differences just discussed. Quantitative researchers aim to discover general laws of human behavior, which can then be applied to other situations and persons. ἀ is assumes that there is some objective reality that can be measured, quantified, and reduced to its component parts (Patton, 1991). Because quantitative approaches grew out of the physical sciences, there is an emphasis on using methods that allow for quantification of data and are (presumably) free from researcher bias (Heppner et al., 1999). ἀ us, positivistic notions of validity focus on precision of measurement, demonstrated through reliability evidence and traditional approaches to the establishment of content, construct, and criterion-related validity (Cook & Campbell, 1979). In contrast, qualitative researchers do not view validity in terms of specific procedures and methods. Rather, validity refers to whether the account of a phenomenon reflects the participants’ lived experience. Moreover, the researcher does not impose his or her meaning on these experiences by artificially constraining the participant’s account through standardized questions or other procedures designed to increase the precision of measurement (Maxwell, 1992). For qualitative researchers, the authenticity of the data collected, and the depth of understanding that results, trumps the discovery of general laws and precise measurement (Guba & Lincoln, 1989; Maxwell, 1992). Moreover, qualitative research does not operate under the assumption that there is one objective account or one correct answer. Rather, the same event or situation may be interpreted differently from individual to individual, and this variability is not a source of error but rather is meaningful in its own right (Maxwell, 1992). Moreover, qualitative researchers are often critical of the quantitative researcher’s emphasis on discovering general laws of human nature and fitting facts into existing laws. By focusing on patterns of regularity, quantitative research does not emphasize the underlying processes linking various phenomena, the individual meaning ascribed to events and situations, or the context in which natural events occur (Maxwell, 2004). It is against these conceptualizations of validity that the differences between these two research camps fall into sharp relief.
Qualitative Research
225
Caveats and Assumptions As noted previously, there are a wide range of approaches and techniques associated with qualitative inquiry, each with its own unique intellectual history (Locke and Golden-Biddle, 2002; Morgan & Smircich, 1980; Patton, 1991). As such, our discussion of qualitative research is necessarily general. We also recognize that the debate among scientists and philosophers about the appropriate way to describe, understand, explain, and predict the world around us is over 400 years old. We do not expect that the present chapter will resolve this deep-seated and hotly contested debate. Rather, our goal is to drill down to the underlying beliefs associated with qualitative research and identify both the kernels of truth and the myths associated with these beliefs. Finally, the identified beliefs are held by quantitative researchers and reflect positivistic notions of good science. Although generally agreed upon by quantitative researchers, this perspective is neither the only way to characterize good science nor a standard that is adopted by qualitative researchers. Some scholars refute the application of positivistic evaluative standards to qualitative research (e.g., Kidd, 2002; Kvale, 1996; Maxwell, 2004), whereas others argue that positivistic notions of methodological rigor can and should be applied to qualitative research (e.g., Bachiochi & Weiner, 2002; Lee, 1999; Yin, 1994). Because the aim of the present chapter is to identify the beliefs held by quantitative researchers, we frame the identified beliefs using positivistic standards of methodological rigor. Beliefs Associated With Qualitative Research Belief #1: Qualitative Research Does Not Utilize the Scientific Method ἀ e scientific method dominates the social and organizational sciences. It is rooted in the natural science model of inquiry and adopts a specific research approach (Heppner et al., 1999). Broadly speaking, the steps associated with the scientific method include (a) observation and description of some phenomenon, (b) the formulation and statement of a hypothesis or set of hypotheses about that phenomenon, and (c) hypothesis testing whereby the hypothesis of interest
226
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
is empirically examined (Aguinis, 1993). Essential to the scientific method is the collection of data through observation and experimentation to generate knowledge (http://www.m-w.com/dictionary). ἀ e belief that qualitative research does not utilize the scientific method is probably due in part to the fact that the scientific method is less frequently discussed outright in qualitative research (Van Maanen, 1979). Also, qualitative research often takes an inductive approach whereby hypotheses are not identified a priori but rather emerge from the findings (e.g., Glaser & Strauss, 1967). Belief #2: Qualitative Research Lacks Methodological Rigor Closely related to the first belief is the belief that qualitative research lacks methodological rigor (Lee et al., 1999; Maxwell, 1992, 2004). Methodological soundness is the hallmark of positivism and is typically viewed in terms of the extent to which research findings approximate the true relationships among the variables of interest (Cook & Campbell, 1979). ἀ e positivistic concept of validity captures the essence of methodological rigor and includes the following types of evidence: internal validity, construct validity, and external validity (Cook & Campbell, 1979; McGrath, 1982; Scandura & Williams, 2000). Because the concept of validity is central to how qualitative research is viewed by the quantitative community (Maxwell, 2002), specific beliefs related to each of the three types of validity evidence are discussed below. Belief #2a: Qualitative Research Lacks Internal Validity Internal validity refers to the extent to which there is a cause-and-effect relationship between two or more variables, as well as the degree to which alternative explanations for observed effects can be ruled out (Cook & Campbell, 1979; Sackett & Larson, 1990). ἀ e oft-cited criticism that we just cannot trust the findings from qualitative research because of researcher bias reflects this belief (Kidd, 2002; Lee et al., 1999). Researcher bias is believed to come into play in two primary ways. First, the qualitative researcher is not a passive, objective observer and documenter of facts (Aluko, 2006; Kidd, 2002). Rather, he or she is an active participant in the research process and is heavily involved in the interpretation of events and experiences as
Qualitative Research
227
recalled from the participant’s perspective (Kidd, 2002; Van Maanen, 1979). Because of this researcher involvement, positivists often view qualitative research as fatally flawed (Ratner, 1997). Second, unlike its quantitative counterpart, qualitative research lacks widely agreed upon and codified standards for data collection, data reduction, and data analysis. ἀ is lack of consensus reinforces the belief that qualitative research lacks objectivity and is fraught with researcher bias (Lee et al., 1999; Miles, 1979). In addition, ruling out alternative explanations is most easily accomplished when one or more variables can be manipulated to ascertain the effect on another variable (Cook & Campbell, 1979). Because qualitative research does not typically employ designs that lend themselves to this, it is assumed to lack internal validity (Aguinis, 1994). Belief #2b: Qualitative Research Lacks Construct Validity Construct validity deals with how accurately the constructs of interest are measured (Cook & Campbell, 1979; Stone-Romero, Weaver, & Glenar, 1994). In the positivistic tradition, construct validity is demonstrated through convergent validity, discriminant validity, nomological validation, and the use of previously validated measures (Cronbach & Meehl, 1955; Scandura & Williams, 2000). Qualitative research is viewed as deficient here because precision of measurement is not a high priority (Cassell & Symon, 1994; Lee et al., 1999). In addition, the types of measures used by qualitative researchers (e.g., field notes, textual passages) often cannot be subjected to traditional psychometric evaluation. However, construct validity can also be inferred from the extent to which multiple data sources, methods of data collection, and researchers are involved in the overall research plan (Jick, 1979; McGrath, 1982; Scandura & Williams, 2000). Such triangulation enhances construct validity because it provides a more holistic assessment of the phenomenon under study and reduces mono-method bias (Campbell & Fiske, 1959; Jick, 1979). In addition, convergence across data sources, methods, and researchers increases one’s confidence in research findings (Jick, 1979; McGrath, 1982). Perhaps because quantitative research in the social and organizational sciences does not perform well in terms of triangulation (Scandura & Williams, 2000), this aspect of construct validity is less often discussed as a limitation of qualitative research.
228
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Belief #2c: Qualitative Research Lacks External Validity External validity is typically discussed as the extent to which one can infer that the covariation found among two or more variables generalizes across persons, settings, and times (Cook & Campbell, 1979; Sackett & Larson, 1990). ἀ is definition assumes that external validity is demonstrated vis-à-vis statistical inference to a larger population, which requires a large, representative sample. Qualitative research is believed to be deficient here because it uses smaller samples, places less (or no) emphasis on representative sampling, and has little or no concern about making probabilistic inferences from a sample to a larger population (Campbell & Stanley, 1963; Larsson, 1993; Luthans & Davis, 1982; Yin, 1994). External validity is also associated with the research setting (McGrath, 1982). Naturalistic settings have greater ecological validity (Lee et al., 1999), meaning that external validity is stronger in field-based settings compared to laboratory settings (McGrath, 1982). Belief #3: Qualitative Research Contributes Little to the Advancement of Knowledge Given the previously discussed beliefs, it should not come as a surprise that qualitative researchers often report feeling like secondclass citizens (Kidd, 2002; Luthans & Davis, 1982). To this point, Reiss (1979) has this to say about the use of qualitative research in psychology: “ἀ e more ‘journalistic’ social science becomes the easier it is for its opponents to dismiss it as non-scientific. ἀ is leads to social science being seen as trivial in its results and dangerous in its techniques, making it ‘simultaneously impotent and threatening’” (p. 82). When qualitative approaches are used in the social and organizational sciences, they often are a supplement to quantitative research rather than a stand-alone methodology (Munley et al., 2002). Interviews with chief editors of 10 APA journals confirm this perspective. When asked about the value of qualitative research, Kidd (2002) concluded that editors were open to publishing qualitative research “as part of a larger research program that includes quantitative analysis” (p. 133, emphasis added). Moreover, in the Kidd study, only 1 of the 15 mission statements explicitly stated that both qualitative and quantitative methods were appropriate for the
Qualitative Research
229
journal. ἀ is suggests that qualitative research is viewed by quantitative researchers as less valuable for the advancement of knowledge. Evaluating the Beliefs Associated With Qualitative Research In this section each of the aforementioned beliefs is examined in light of published commentary on qualitative research and an original review of the three highest-impact journals from the fields of applied psychology, management, and social psychology. ἀ ese journals are listed in Table 9.1 (in rank order) according to the 2004 Journal of Citation Reports impact ratings. Two journals were excluded with replacement by the next highest impact journal. One journal (Academy of Management Review) publishes only reviews and the other (Counseling Psychologist) focuses on topics that are generally outside the domain of industrial/organizational psychology and human resource management/organizational behavior. We then performed a Boolean search for articles published between 1990 and 2005 using PsycINFO and Business Source Premier databases. ἀ e Boolean search included the broad term qualitative as well as terms associated with specific approaches to qualitative research (e.g., ethnography, phenomenology) and specific qualitative data collection/data analysis techniques (e.g., content analysis, participant observation). Two trained coders used a set of agreed-upon definitions to code each article on the study characteristics identified below. Most study characteristics were coded as being either present or absent, with the exception of those noted in Tables 9.2 and 9.3. Coders first read each identified article to determine if it met the criteria of a pure qualitative (used only qualitative methods) or mixed-method (combination of qualitative and quantitative methods) study. Only these 241 studies were retained for further examination. Of these 241 studies, 106 (44.0%) were pure qualitative studies and 135 (56.0%) were mixedmethod studies. ἀ e total number of articles published in each journal from 1990 to 2005 was also recorded and used to determine the overall publication base rate of pure qualitative and mixed-method studies by journal. Consistent with previous research (Kidd, 2002; Munley et al., 2002), the overall publication rate of qualitative and A detailed description of the search strategy used to identify articles is available from the first author upon request.
230
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Table 9.1 Publication Base Rate in Top Three Journals by Discipline (1990–2005) Only Pure Qualitativea
Only Mixed-Methoda
All Articlesa
Applied Psychology 1. Journal of Applied Psychology (1,460)
12 (<1%)
4 (<1%)
16 (1%)
5 (1%)
12 (3%)
17 (4%)
1 (<1%)
2 (1%)
3 (1%)
1. Administrative Science Quarterly (355)
28 (8%)
17 (5%)
45 (13%)
2. Management and Information Science Quarterly (375)
20 (5%)
8 (2%)
28 (7%)
3. Academy of Management Journal (953)
18 (2%)
7 (1%)
25 (3%)
1. Journal of Personality and Social Psychology (2,773)
4 (<1%)
42 (2%)
46 (2%)
2. Personality and Social Psychology Bulletin (1,626)
9 (1%)
24 (2%)
33 (2%)
2. Human Resource Management (444) 3. Journal of Experimental Psychology: Applied b (234) Management
Social
3. Journal of Personality (613) Totals
9 (1%)
19 (3%)
28 (5%)
106 (1%)
135 (2%)
241 (3%)
Note. A total of 8,833 articles were published in all journals from 1990 to 2005. ἀ e number in parentheses following each journal title indicates the total number of articles published during the time period reviewed. aOnly Pure Qualitative = articles using only qualitative methods (n = 106); Only Mixed-Method = articles using qualitative and quantitative methods (n = 135); All Articles = includes both pure qualitative and mixed-method articles (n = 241). bἀ e inaugural issue of Journal of Experimental Psychology: Applied was published in 1995.
mixed-method studies was very low (1–3%; see Table 9.1). As shown in Table 9.1, Administrative Science Quarterly reported the highest publication rate of pure qualitative (8.0%) and mixed-method (5.0%) studies, whereas several journals published no pure qualitative studies during the review time period.
Qualitative Research
231
Table 9.2 Indicators of the Scientific Method Only Pure Qualitativea
Only Mixed-Methoda
All Articlesa
Observation No
0 (0.0%)
0 (0.0%)
0 (0.0%)
Yes
106 (100%)
135 (100%)
241 (100%)
No
0 (0.0%)
0 (0.0%)
0 (0.0%)
Yes
106 (100%)
135 (100%)
241 (100%)
No
84 (79.2%)
42 (31.1%)
126 (52.3%)
Yes
22 (20.8%)
93 (68.9%)
115 (47.7%)
No
7 (31.8%)
3 (3.2%)
10 (8.7%)
Percentages only
4 (18.2%)
11 (11.8%)
15 (13.0%)
Statistical tests
11 (50.0%)
79 (84.9%)
90 (78.3%)
Description
Hypothesis formulation
Hypotheses testingb
Only Pure Qualitative = articles using only qualitative methods (n = 106); Only Mixed-Method = articles using qualitative and quantitative methods (n = 135); All Articles = includes both pure qualitative and mixed-method articles (n = 241). bPercentages refer to the subsample of n = 115 articles that formulated hypotheses. a
ἀ e study characteristics coded as indicators of the scientific method and various aspects of methodological rigor were drawn from several sources (Casper, Eby, Bordeaux, Lockwood, & Lambert, 2007; Cook & Campbell, 1979; McGrath, 1982; Mitchell, 1985; Sackett & Larson, 2002; Scandura & Williams, 2000). Indicators of the scientific method included observation, description, hypotheses formulation, and hypothesis testing. Articles coded as observation examined the phenomena of interest in its natural setting. Description refers to whether the article provided an in-depth explanation of the phenomena under study. Hypothesis formulation refers to whether researchers stated a priori hypotheses. Hypothesis testing applies only to those articles that stated hypotheses and indicates whether the researchers tested the proposed hypotheses in some manner.
232
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Table 9.3 Indicators of Validity Evidence Only Pure Qualitativea
Only Mixed-Methoda
All Articlesa
Internal validity Participant verification No
90 (84.9%)
130 (96.3%)
220 (91.3%)
Yes
16 (15.1%)
5 (3.7%)
21 (8.7%)
No
75 (70.8%)
28 (20.7%)
103 (42.7%)
Yes
25 (23.6%)
39 (28.9%)
64 (26.6%)
Some variables
6 (5.7%)
68 (50.4%)
74 (30.7%)
No
98 (92.5%)
95 (70.4%)
193 (80.1%)
Yes
8 (7.5%)
40 (29.6%)
48 (19.9%)
Cross-sectional
36 (34.0%)
68 (50.4%)
104 (43.2%)
Longitudinal
70 (66.0%)
67 (49.6%)
137 (56.8%)
No
102 (96.2%)
95 (70.4%)
197 (81.7%)
Yes
1 (0.9%)
10 (7.4%)
11 (4.6%)
Mixed
3 (2.8%)
30 (22.2%)
33 (13.7%)
No
48 (45.3%)
89 (65.9%)
137 (56.8%)
Yes
58 (54.7%)
46 (34.1%)
104 (43.2%)
No
50 (47.2%)
38 (28.1%)
88 (36.5%)
Yes
56 (52.8%)
97 (71.9%)
153 (63.5%)
No
64 (60.4%)
56 (41.5%)
120 (49.8%)
Yes
42 (39.6%)
79 (58.5%)
121 (50.2%)
83 (78.3%)
116 (85.9%)
199 (82.6%)
Reliability assessment
Manipulation of variable(s)
Time horizon
Construct validity Validity information
Triangulation of data source
Triangulation of methods
Triangulation of researchers
External validity Type of sample Convenience
Qualitative Research
233
Table 9.3 Indicators of Validity Evidence (continued) Only Pure Qualitativea
Only Mixed-Methoda
All Articlesa
Nonconvenience
23 (21.7%)
16 (11.9%)
39 (16.2%)
Mixed
0 (0.0%)
3 (2.2%)
3 (1.2%)
93.8 (92.8)
290.6 (366.7)
227.7 (320.1)
Lab
12 (11.7%)
52 (38.8%)
64 (27.0%)
Field
91 (88.3%)
82 (61.2%)
173 (73.0%)
Sample size Mean (SD) Setting
Only Pure Qualitative = articles using only qualitative methods (n = 106); Only Mixed-Method = articles using qualitative and quantitative methods (n = 135); All Articles = includes both pure qualitative and mixed-method articles (n = 241). a
Based on positivistic standards of methodological rigor, articles were coded on a number of internal validity indices, including participant verification, reliability assessment, manipulation of variable(s), and time horizon. Articles using participant verification were those that discussed findings with participants for verification. Reliability indicated whether the study reported agreement or reliability among raters or observers. Manipulation of variables refers to whether one or more variables were experimentally manipulated to examine the effect on another variable. Time horizon refers to whether the study design was cross-sectional or longitudinal. Other indicators of methodological rigor are construct validity and external validity. Construct validity was indicated by whether or not validity information was provided for measured constructs as well as three indicators of triangulation. Triangulation of data sources refers to whether the article used multiple data sources in order to examine the constructs from different perspectives. Triangulation of methods refers to whether the researchers utilized more than one data collection method in order to measure the variables of interest. Triangulation of researchers occurred when the article explicitly described the active involvement of multiple researchers in data gathering, coding, or interpretation. In other words, studies with multiple authors were not automatically classified as having triangulation of researchers.
234
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Several indicators of external validity were also coded. ἀ e type of sample was coded as one of the following: convenience, nonconvenience (purposive, stratified, or random samples), or mixed (combination of two or more sampling strategies). Sample size refers to the number of individuals who participated in the study. Finally, setting was coded either as a laboratory or as field-based. Four studies were not coded in this category because the category was not applicable (e.g., content analysis of published studies). Evaluation of Belief #1: Qualitative Research Does Not Utilize the Scientific Method As discussed previously, the scientific method involves the collection and interpretation of data, although it does not require the use of any particular type of data (Aguinis, 1994). Notwithstanding the widely held belief that qualitative research does not use the scientific method, logic and empiricism are central to qualitative research (Giorgi, 1997; Maxwell, 2004; Van Maanen, 1979; Yin, 1994). Just like quantitative approaches, qualitative research uses data (e.g., field observations) to induce meaning, draw inferences, and build theories (Miles & Huberman, 1994). Moreover, qualitative research often quantifies data to examine patterns and themes that emerge in data (Mason, 1996). We now review the results of our review examining the four steps in the scientific method: observation, description, hypothesis formulation, and hypothesis testing. Observation and Description A hallmark of qualitative research is careful observation and “thick description” of phenomena in their natural context (Jick, 1979, p. 609). An example of this is Druskat and Wheeler’s (2003) qualitative study of a Fortune 500 manufacturing organization that developed a theory of how external leader behaviors and strategies unfold over time to influence the success of selfmanaged work teams. ἀ e qualitative researcher identifies symbols that have meaning to the research participant and looks for patterns of responses that are associated with these symbols (Van Maanen, 1979). As such, it is not surprising that both 100% of the qualitative studies reviewed and 100% of the mixed-method studies were characterized by both observation and description (see Table 9.2).
Qualitative Research
235
Hypothesis Formulation and Hypothesis Testing Our results confirmed that, like quantitative research, qualitative research was sometimes used to confirm or disconfirm hypotheses (Bachiochi & Weiner, 2002). Recall that in classifying individual studies as qualitative research we used a wide range of terms including the broad term qualitative as well as terms referring to specific data collection methods (e.g., participant observation, case study) and data reduction techniques (e.g., content analysis) associated with qualitative research. As shown in Table 9.2, 68.9% of the mixed-method and 20.8% of the purely qualitative studies proposed a priori hypotheses. For instance, Ruderman, Ohlott, Panzer, and King (2002) tested hypotheses about the relationships between multiple life roles, wellbeing, and managerial skills in a mixed-method study of managerial women. Of the 115 combined studies that proposed a priori hypotheses, 78.3% provided standard statistical tests of the proposed hypotheses (e.g., t test, chi-square). As an illustration, Sternberg, Lamb, Orbach, Esplin, and Mitchell (2001) collected interview data using one of two methods to ascertain the quality of information obtained via the two methods. ἀ e open-ended responses to the interviews were coded to determine the amount of information solicited and hypotheses were tested using analysis of variance. An additional 13.0% simply “eyeballed” percentages to infer support or lack of support for hypothesized effects. An example here is McEvoy’s (1997) study of the effects of outdoor management education on trainee outcomes. In support of the hypothesis that the outdoor experience would increase trainees’ behaviors in a manner that was consistent with the organization’s vision, anecdotes from trainees were examined and some support was found for this hypothesis in the trainees’ verbal accounts. ἀ e remaining 8.7% proposed but did not test stated hypotheses (see Table 9.2). It is worth noting that only 50.0% of the pure qualitative studies that proposed hypotheses subjected these hypotheses to statistical tests. Summary of Belief #1 ἀ e belief is that qualitative research does not follow the scientific method. ἀ e kernel of truth is that hypothesis formulation and testing were by no means universally utilized, particularly in pure qualitative studies. ἀ e myth is the general assumption that the scientific method is not used in qualitative research. Two essential steps in the scientific method—observation and description—were always present in the qualitative and mixed-
236
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
method studies reviewed. Moreover, some of the studies we reviewed proposed and/or tested a priori hypotheses. Evaluation of Belief #2: Qualitative Research Is Methodologically Weak ἀ e second set of beliefs concern methodological rigor. Qualitative research is generally believed to be less rigorous than quantitative research (Maxwell, 2004). Before evaluating the specific beliefs associated with methodological rigor it important to note that “validity is not a commodity that can be purchased with techniques.…rather validity is like integrity, character, and quality, to be assessed relative to purposes and circumstances” (Brinberg & McGrath, 1985, p. 13). ἀ is means that there is nothing inherent in qualitative research that precludes the establishment of validity evidence. Evaluation of Belief 2a: Qualitative Research Has Weak Internal Validity Positivistic definitions of internal validity are similar to the qualitative notion of descriptive validity (Maxwell, 2002). Both refer to the accuracy of the data collected and the absence of bias or distortion. However, the manner in which internal validity is demonstrated varies across quantitative and qualitative research, owning to different epistemological belief systems (cf. Lee, 1999). Qualitative researchers generally refute the notion of one “objective account” or one “correct answer” with respect to a phenomenon under study (Maxwell, 2002, p. 283). For qualitative researchers, internal validity is reflected in the extent to which a researcher’s interpretation of some phenomenon is consistent with the research participant’s lived experience. As such, qualitative researchers infer internal validity by verifying their interpretation of a phenomenon with research participants or double-checking their own interpretation of a phenomenon with another researcher’s interpretations of the same phenomenon (Marshall & Rossman, 1995; Maxwell, 1992). As shown in Table 9.3, 15.1% of the pure qualitative and 3.7% of the mixed-method studies used participant verification. Moreover, 26.6% of the combined studies tried to
Qualitative Research
237
ensure internal validity by having more than one rater or observer and calculating some sort of reliability or agreement index. Another way to demonstrate internal validity is to experimentally manipulate one or more variables to examine cause-and-effect relationships (Cook & Campbell, 1979; McGrath, 1982). Because qualitative studies investigate phenomena in the natural environment, it is not surprising that only 7.5% of the pure qualitative studies manipulated variables to assess causal relationships (see Table 9.3). Mixed-method studies were more likely to manipulate variables, with 29.6% of studies falling into this category. Another approach to establishing cause-and-effect relationships is to utilize a longitudinal research design. As shown in Table 9.3, a full 56.8% of the combined studies used longitudinal data and 66.0% of the pure qualitative studies adopted this approach. Summary of Belief 2a ἀ e belief is that qualitative research lacks internal validity. ἀ e kernel of truth is that qualitative research does not routinely take explicit methodological steps to assess or remedy researcher bias and rarely manipulates variables to assess cause-andeffect relationships. ἀ e myth is that qualitative researchers are not concerned about researcher bias and do little to try to deal with the concern. Qualitative researchers are worried about researcher bias (Hammersley & Atkinson, 1983; Maxwell, 1992). And while assuring that this does not influence the research process is difficult, methodological steps such as participant verification are sometimes used to help reduce the potential for researcher bias. Moreover, the frequent use of longitudinal designs runs counter to the myth that qualitative research lacks internal validity, because one can draw causal inferences from properly designed and executed longitudinal research. Evaluation of Belief #2b: Qualitative Research Has Weak Construct Validity Construct validity is demonstrated in several ways and can be used to evaluate qualitative research (Bachiochi & Weiner, 2002; Lee, 1999). In qualitative research, this type of validity evidence focuses on how adequately the researcher’s account serves as a way to explain, describe, or interpret some phenomenon (Maxwell, 2002) rather than the psychometric properties of specific measures. ἀ erefore, it
238
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
is not surprising that for the combined sample only 4.6% provided validity evidence for all study measures and only an additional 13.7% provided validity evidence for some measures (see Table 9.3). ἀ e percentages are far lower for the pure qualitative studies (<1% and 2.8%, respectively). Another indication of construct validity is triangulation, and as illustrated in Table 9.3 favorable results were found here. Approximately half of the combined studies used multiple data sources (43.2%), multiple methods (63.5%), and multiple researchers (50.2%). Some differences were noted between pure qualitative and mixed-method studies; pure qualitative studies reported greater data source triangulation but less method and researcher triangulation than did mixed-method studies (see Table 9.3). Summary of Belief #2b ἀ e belief is that qualitative research lacks construct validity. ἀ e kernel of truth is that qualitative research does not often provide the form of validity evidence found in quantitative research such as factor analysis and the use of previously validated measures. ἀ is is because there is less concern about precision of measurement with qualitative research and more concern about capturing the breadth and depth of individuals’ experiences through more open-ended research methods such as interviews and participant observation. ἀ e myth is that qualitative research is completely lacking in construct validity. Qualitative research demonstrates construct validity in other ways, such as through the triangulation of data sources, methods, and researchers. Evaluation of Belief #2c: Qualitative Research Has Weak External Validity Qualitative research is generally not designed with an end goal of generalizing to a larger population of people, times, and settings (Maxwell, 1992). For qualitative researchers, generalizability exists when a theory that is developed in a specific context and with particular individuals has utility in understanding other individuals in a similar (or even different) context (Maxwell, 1992; Yin, 1994). ἀ us, replication in another situation (Maxwell, 2002; Yin, 1994) or a “reasoned judgment” that the results from one study (along with its particular context) can be applied to another study (along with its particular context) is the litmus test of generalizability (Lee, 1999,
Qualitative Research
239
p. 158). ἀ e distinction here is between internal generalizability (generalizing to those not directly observed who are members of the particular social system under study) and external generalizability (generalizing to other social systems not directly under study; Maxwell, 1992). In qualitative research far greater emphasis is placed on internal generalizability. Even though the emphasis is on internal generalizability, the sampling strategy is important in qualitative research. As shown in Table 9.3, 21.7% of the pure qualitative studies and 14.1% of the mixed-method studies used a strategy other than convenience sampling. Sample sizes were modest and varied considerably across pure qualitative (average n = 93.8, SD = 92.8) and mixed-method (average n = 290.6, SD = 366.7) studies. In terms of the research setting, the vast majority (88.3%) of pure qualitative studies were conducted in field settings (see Table 9.3). For mixed-method studies, 61.2% were conducted in field settings. Summary of Belief #2c ἀ e belief is that qualitative research has weak external validity. ἀ e kernel of truth is that there is heavy reliance on convenience sampling and the use of small to moderate sample sizes is a threat to positivistic notions of external validity. ἀ e myth is that qualitative research is completely lacking in external validity. ἀ e primary reliance on studying phenomena in naturalistic field settings enhances external validity. Evaluation of Belief #3: Qualitative Research Contributes Little to the Advancement of Knowledge Belief #3 must be evaluated in light of the accuracy of the previously discussed beliefs. If it is believed that qualitative research does not follow the scientific method and lacks internal validity, construct validity, and/or external validity, it follows that there would also be the belief that qualitative research does not make a substantial contribution to the advancement of science. Given the overall lack of support for many of the beliefs previously outlined, it appears as if belief #3 is yet another myth about qualitative research. Well-conceptualized and competently executed qualitative research clearly contributes to the advancement of knowledge. In fact, many of the seminal theories in the social and organizational sciences emerged
240
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
from qualitative research. ἀ is body of literature includes theories of group processes and group development (e.g., Gersick, 1988), leadership styles (Lewin, Lippitt, & White, 1939), career development (e.g., Kram, 1985; Levinson et al., 1978), social control and social interaction patterns (Baker, 1993; Festinger, Schachter, & Back, 1950), loyalty and commitment (Adler & Adler, 1988), and many other broad areas of inquiry (see Van Maanen, 1998). Qualitative research can also play an important role in understanding contemporary social and organizational issues. Organizational phenomena are increasingly complex (Lee et al., 1999), and because of this, relatively simple cause-and-effect models may have limited utility in moving the field forward (Maxwell, 2004). Because of its careful consideration of the context and in-depth examination of intact social systems, qualitative research is well suited to understand complex organizational, social, and psychological phenomena (Lee et al., 1999). The Future of Qualitative Research in the Social and Organizational Sciences In the present chapter numerous beliefs associated with qualitative research were outlined and a literature review conducted to examine each belief in detail. Although each belief had kernels of truth, many myths were identified. ἀ is was the case even when the methodological standards associated with quantitative research were imposed on qualitative research. ἀ is is important to note because quantitative conceptualizations of good science are not wholly consistent with those of qualitative research. For instance, positivistic notions of construct validity focus on the degree to which the measure of some construct truly captures the essence of that construct (Cook & Campbell, 1979). ἀ is is typically demonstrated through factor analysis and convergent/ discriminant validity evidence. In contrast, qualitative researchers are more concerned about whether they fully and accurately understand the meaning of an individual participant’s lived experience. Obviously, this is more difficult to quantify using statistical techniques. So, by examining qualitative research through the lens of quantitative research (which was necessary to do since the beliefs originate from the quantitative perspective), we are imposing standards that some scholars argue do not even apply to qualitative research. ἀ is means
Qualitative Research
241
that the proverbial deck was stacked against qualitative research from the start of this review. Even so, we at least partially debunked many of the myths associated with qualitative research. In fact, comparing our findings to several recent methodological reviews of mainly quantitative research highlights the strengths of qualitative research when viewed alongside quantitative research. In the present review we found greater use of observation and description, all three types of triangulation, longitudinal designs, and nonconvenience samples compared to several reviews of mostly quantitative studies (Aldag & Stearns, 1988; Casper et al., 2007; Mitchell, 1985; Scandura & Williams, 2000). Qualitative research was also more likely to study phenomena in their natural environment compared to management research in general (Scandura & Williams, 2000) and was comparable to work-family research in particular (Casper et al., 2007). Interestingly, we also found that the qualitative studies reviewed were about as likely to manipulate variables in an attempt to demonstrate causality as were quantitative work-family (Casper et al., 2007) and management research in general (Scandura & Williams, 2000). Notwithstanding the considerable strengths of qualitative inquiry, the road map for conducting high-quality qualitative research is not always clear-cut and easy to navigate (Lee et al., 1999; Miles, 1979). Luckily, there are many excellent sources that outline best practices in qualitative research. For the scholar trained in the traditional quantitative paradigm (like ourselves), we recommend Bachiochi and Weiner (2002), Lee et al. (1999), and Lee (1999) for a general discussion on conducting high-quality qualitative research. For a more technically oriented, comprehensive treatment of the nuts and bolts of qualitative data analysis, Miles and Huberman’s (1994) sourcebook is also highly recommended. Concluding Thoughts Qualitative research is essential for uncovering the meaning of individuals’ experiences, understanding the context in which individuals and social aggregates operate, and generating theories (Bachiochi & Weiner, 2002; Kidd, 2002; Lee et al., 1999; Van Maanen, 1979). However, qualitative research and quantitative research are often pitted against one another. Although different in focus, emphasis, and
242
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
form, both approaches are striving toward a common objective—the advancement of scientific knowledge (Van Maanen, 1979). Qualitative inquiry is an essential step in the process of initial discovery, just as quantitative research is necessary to confirm or disconfirm specific relationships among variables nested within a broader system (Bachiochi & Weiner, 2002; Jick, 1979; Munley et al., 2002; Van Maanen, 1979). Neither approach, when used to the exclusion of the other, makes for good science. To this point, we are reminded of McGrath’s (1982) sage observation: “No strategy, design, or method when used alone is worth a damn…multiple approaches are required” (p. 101). ἀ e value of integrating qualitative research into the study of organizational and social phenomena has been noted time and time again, yet the adoption of qualitative approaches has proceeded at a snail’s pace. We hope that by identifying and debunking some of the methodological beliefs associated with qualitative research, quantitative researchers might better appreciate what qualitative research has to offer to the social and organizational sciences. Author Note A previous version of this chapter was presented at the 2007 conference of the Society for Industrial and Organizational Psychology in New York, New York. ἀ is research was supported in part by a grant from the National Institutes of Health (R01DA019460-02) awarded to Lillian T. Eby. ἀ e opinions expressed herein are those of the authors and not the granting agency. References Adler, P. A., & Adler, P. (1988). Intense loyalty in organizations: A case study of college athletics. Administrative Science Quarterly, 33, 410–417. Aguinis, H. (1993). Action research and scientific method: Presumed discrepancies and actual similarities. Journal of Applied Behavioral Science, 29, 416–431. Aldag, R. J., & Stearns, T. M. (1988). Issues in research methodology. Journal of Management, 14, 253–275.
Qualitative Research
243
Aluko, F. S. (2006). Social science research: A critique of quantitative and qualitative methods and proposal for eclectic approach. IFE Psychologia, 14, 198–210. Bachiochi, P. D., & Weiner, S. P. (2002). Qualitative data collection and analysis. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 161–183). Oxford: Blackwell Publishing. Baker, J. R. (1993). Tightening the iron cage: Concertive control in selfmanaging teams. Administrative Science Quarterly, 38, 408–437. Brinberg, D., & McGrath, J. E. (1985). Validity and the research process. Newbury Park, CA: Sage. Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Campbell, D. T., & Stanley, J. (1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbook of research on teaching (pp. 171–246). Chicago, IL: Rand McNally. Casper, W., Eby, L. T., Bordeaux, C., Lockwood, A., & Burnett, D. (2007) A review of research methods in IO/OB work-family research. Journal of Applied Psychology, 92, 28–43. Cassell, C., & Symon, G. (1994). Qualitative research in work contexts. In C. Cassell & G. Symon (Eds.), Qualitative methods in organizational research: A practical guide (pp. 1–13). London: Sage. Cook D. T., & Campbell, D. T. (1979). Quasi-experimental designs: Design and analysis issues for field settings. Skokie, IL: Rand McNally. Cook, T. D., & Reichardt, C. S. (1978). Qualitative and quantitative methods in evaluation research. Beverly Hills, CA: Sage. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. Druskat, V. U., & Wheeler, J. V. (2003). Managing from the boundary: ἀe effective leadership of self-managing work teams. Academy of Management Journal, 46, 435–457. Festinger, L. U., Schachter, S., & Back, K. (1950). Social pressures in informal groups: A study of human factors in housing. Oxford: Harper. Gersick, C. J. G. (1988). Time and transition in work teams: Toward a new model of group development. Academy of Management Journal, 31, 9–41. Giorgi, A. (1997). ἀe theory, practice, and evaluation of the phenomenological method as qualitative research. Journal of Phenomenological Psychology, 28, 235–260. Glaser, B., & Strauss, A. (1967). The discovery of grounded theory. Chicago, IL: Aldine.
244
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Guba, E. G., & Lincoln, Y. S. (1989). Fourth generation evaluation. Newbury Park, CA: Sage. Hammersley, M., & Atkinson, P. (1985). Ethnography: Principles in practice. London: Tavistock Institute. Heppner, P. P., Kivlighan, D. M., & Wampold, B. E. (1999). Research design in counseling. Belmont, NJ: Brooks/Cole. Jick, T. D. (1979). Mixing qualitative and quantitative methods: Triangulation in action. Administrative Science Quarterly, 24, 602–611. Kidd, S. A. (2002). ἀe role of qualitative research in psychological journals. Psychological Methods, 7, 126–138. Kram, K. E. (1985). Mentoring at work. Glenview, IL: Scott, Foresman, and Company. Kuhn, T. S. (1962). Structure of scientific revolutions. Chicago: University of Illinois Press. Kvale, S. (1996). InterViews: An introduction to qualitative research interviewing. ἀ ousand Oaks, CA: Sage. Larsson, R. (1993). Case survey methodology: Quantitative analysis of patterns across case studies. Academy of Management Journal, 36, 1515–1546. Lee, T. W. (1999). Using qualitative methods in organizational research. ἀ ousand Oaks, CA: Sage. Lee, T. W., Mitchell, T. R., & Sablynski, C. J. (1999). Qualitative research in organizational and vocational psychology, 1979–1999. Journal of Vocational Behavior, 55, 161–187. Levinson, D. J., Darrow, D., Levinson, M., Klein, E. B., & McKee, B. (1978). Seasons of a man’s life. New York: Academic Press. Lewin, K., Lippitt, R., & White, R. K. (1939). Patterns of aggressive behavior in experimentally created “social climates.” Journal of Social Psychology, 10, 271–259. Lincoln, Y. S., & Guba, E. G. (1985). Naturalistic inquiry. Beverly Hills, CA: Sage. Locke, K., & Golden-Biddle, K. (2002). An introduction to qualitative research: Its potential for industrial and organizational psychology. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 99–118). Oxford: Blackwell Publishing. Luthans, F., & Davis, T. V. (1982). An idiographic approach to organizational behavior research: ἀe use of single case experimental designs and direct measures. Academy of Management Review, 7, 380–391. Marshall, C., & Rossman, G. B. (1995). Designing qualitative research (2nd ed.). ἀ ousand Oaks, CA: Sage. Mason, J. (1996). Qualitative researching. London: Sage. Maxwell, J. A. (1992). Understanding and validity in qualitative research. The Harvard Educational Review, 62, 279–299.
Qualitative Research
245
Maxwell, J. A. (2004). Causal explanation, qualitative research, and scientific inquiry in education. Education Researcher, 33(2), 3–11. McEvoy, G. M. (1997). Organizational change and outdoor management education. Human Resource Management, 36, 235–250. McGrath, J. E. (1982). Dilemmatics: ἀe study of research choices and dilemmas. In J. E. McGrath (Ed.), Judgment calls in research (pp. 69– 102). Beverly Hills, CA: Sage. Miles, M. B. (1979). Qualitative data as an attractive nuisance: ἀe problem of analysis. Administrative Science Quarterly, 24, 590–601. Miles, M., & Huberman, A. M. (1994). Qualitative data analysis. ἀ ousand Oaks, CA: Sage. Mitchell, T. R. (1985). An evaluation of the validity of correlational research conducted in organizational settings. Academy of Management Review, 10, 192–205. Morgan, G., & Smircich, L. (1980). ἀe case for qualitative research. Academy of Management Review, 5, 491–500. Munley, P. H., Anderson, M. Z., Briggs, D., Derives, M. R., Foresheet, W. J., & Whiner, E. A. (2002). Methodological diversity of research published in selected psychological journals in 1999. Psychological Reports, 91, 411–420. Patton, M. J. (1991). Qualitative research on college students: Philosophical and methodological comparisons with the quantitative approach. Journal of College Student Development, 32, 389–396. Ratner, C. (1997). Cultural psychology and qualitative methodology. New York: Plenum Press. Reiss, A. J. (1979). Governmental regulation of scientific inquiry: Some paradoxical consequences. In C. B. Klockars & F. W. O’Connor (Eds.), Deviance and decency (pp. 61–95). Beverly Hill, CA: Sage. Ruderman, M. N., Ohlott, P. J., Panzer, K., & King, S. N. (2002). Benefits of multiple roles for managerial women. Academy of Management Journal, 45, 369–386. Sackett, P. R., & Larson, J. R., Jr. (1990). Research strategies and tactics in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (pp. 419–489). Palo Alto: CA:. Psychologists Press Scandura, T. A., & Williams, E. A. (2000). Research methodology in management: Current practices, trends, and implications for future research. Academy of Management Journal, 43, 1248–1264. Sternberg, K. J., Lamb, M. E., Orbach, Y., Esplin, P. W., & Mitchell, S. (2000). Use of a structured investigative protocol enhances young children’s responses to free-recall prompts in the course of forensic interviews. Journal of Applied Psychology, 86, 997–1005.
246
Lillian T. Eby, Carrie S. Hurst, and Marcus M. Butts
Stone-Romero, E. F., Weaver, A. E., & Glenar, J. L. (1995). Trends in research design and data analysis strategies in organizational research. Journal of Management, 21, 141–157. Van Maanen, J. (1979). Reclaiming qualitative methods for organizational research: A preface. Administrative Science Quarterly, 24, 520–526. Van Maanen, J. (1998). Qualitative studies of organizations. ἀ ousand Oaks, CA: Sage. Yin, R. K. (1994). Case study research: Design and methods. Newbury Park, CA: Sage.
10 Do Samples Really Matter That Much? Scott Highhouse and Jennifer Z. Gillespie
It is a good morning exercise for a research scientist to discard a pet hypothesis every day before breakfast. It keeps him young. —Konrad Lorenz (1903–1989)
People often behave in ways that are inconsistent with what they know to be true. For instance, consumers purchase Advil at twice the price of generic anti-inflammatories, even though both contain 100% ibuprofen. People will pay more to participate in a lottery that has 10 winning tickets and 90 losing tickets, over one that has 1 winning ticket and 9 losing tickets—even though they realize that both lotteries have equal probabilities of success (Kirkpatrick & Epstein, 1992). In other instances, people privately cling to beliefs that they acknowledge are refuted by the evidence. For example, even though probability theory has shown that the “hot hand” in basketball is a myth (Gilovich, Vallone, & Tversky, 1985), people who know better still wait for their favorite team’s best shooter to get into a groove. Even though most psychologists know that holistic assessment is beyond human cognitive capabilities, they continue to hire their colleagues this way. And so it is with the issue of sample generalizability in applied research. Even though there is almost no empirical evidence to support the claim that the nature of the research sample matters much in making inferences about behavior in organizations, educated, intelligent people stubbornly cling to the belief that samples matter. Reviewers and editors commonly assert that students should not be used to study workplace phenomena as though such a declaration requires no further explanation. One of our colleagues asserted in 247
248
Scott Highhouse and Jennifer Z. Gillespie
a thesis defense that a study using a sample of student athletes will not yield inferences that can be applied to either other college students or athletes in general! We examined the frequency with which authors feel compelled to (or are forced to) confess that the nature of their sample is a limitation of their research. Specifically, we had a graduate student randomly sample 55 empirical articles published in the Journal of Applied Psychology between the years of 1996 and 2006 and record the frequency of various limitations mentioned in the discussion sections. ἀ e results are presented in Figure 10.1. Not surprisingly, problems with the research design were far and away the most often mentioned limitations, appearing in 65% of the empirical articles. However, the nature of the research sample was noted as a limitation nearly as often as the generalizability of the research results, and equally as often as construct validity. In nearly 30% of the articles, the authors mentioned that the findings might not apply to samples other than the one used in their studies. ἀ ese were not concerns about generalizing a sample statistic to the population but concerns over whether the inferences about behavior could be applied to people in other settings. In this chapter we examine the history of the sample generalizability debate and consider opposing arguments and empirical findings. We conclude that it is rare in applied behavioral science for the nature of the sample to be an important consideration for generalizability. We therefore speculate about possible reasons for the stubborn adherence to the idea that samples matter. Specifically, we suggest that people often confuse random sampling with random assignment. We also suggest that people erroneously focus on the generalizability of effects, when they should be focusing on the generalizability of theoretical inferences. Finally, we suggest that people rely on a simple heuristic in determining the degree to which findings from one sample are applicable to another sample. Kernel of Truth Although our goal in this chapter is to debunk the myth that the characteristics of the sample are critical for generalizing inferences We excluded articles that were not related to behavior in organizations (e.g., eyewitness testimony).
15%
20%
25%
29%
Construct Validity
35%
65%
Figure 10.1 Percentage of articles in Journal of Applied Psychology (1996–2006) where each appear as limitations.
Other
Sample Size
Method Variance
29%
Nature of Sample
Generalizability
Research Design
Do Samples Really Matter That Much? 249
250
Scott Highhouse and Jennifer Z. Gillespie
from a study, we concede that the sample matters when there is a specific, well-defined population of interest. Kardes (1996) used the term idiothetic research to describe research that is done for a specific organization or occupation. For example, a researcher who is specifically interested in studying the work-family conflict experienced by police officers could not sample from any population other than the population of police. Farber (1952) similarly noted that the goal of some research is to describe a phenomenon or population. To understand the job and life attitudes of male executives, for example, Judge, Boudreau, and Bretz (1994) acquired a representative sample of executives from the database of a large executive search firm. ἀ e sample needed to be representative for them to draw descriptive conclusions about this target population. ἀ e meaning and effort that participants assign to the study situation is also critical for generalizing inferences from the results (Berkowitz & Donnerstein, 1982). For example, if study participants are unable to understand the significance or substance of a variable because they lack the relevant knowledge base or experiential history that someone in the targeted setting might have, the experimenter may not be able to make valid inferences from that participant’s behavior. Likewise, if participants are carelessly responding to a survey or experimental manipulation, one may not draw meaningful inferences from their responses. ἀ e assignment of meaning on the part of study participants is likely affected by various elements of a given study, including actors, behaviors, items, and context (Ilgen, 1986; Runkel & McGrath, 1972). Motivation can be influenced by the level of involvement and experience with the stimuli and the rewards for participation. It would make no sense to ask unemployed college sophomores to imagine what factors might lead them to take an overseas assignment. It is important to recognize, however, that assuring that study participants assign appropriate meaning and effort to a given situation is not as simple as sampling employed adults who work in a for-profit organization. ἀ eory may be used to understand, explain, and assess the meaning that the people in a study assign to a given situation and to determine the features of a given study that are relevant to the interpretation of its results.
Do Samples Really Matter That Much?
251
Background History of the Concern ἀ e debate over whether the nature of the research participant matters in generalizing research findings seems to have begun in 1946, when McNemar observed “the existing science of human behavior is largely the science of the behavior of sophomores” (p. 333). McNemar’s criticisms were primarily directed toward researchers who study public opinion by administering attitude surveys to college students. In a published reply, Conrad (1946) countered that a homogeneous sample of students avoids the problems with sampling a less-literate general population. Conrad cited research showing that students do not differ from the general population in their national morale. Farber (1952) provided a more forceful defense of student samples, noting that students are only inappropriate for studying applied phenomena at a descriptive level. When one is doing research at the conceptual level, according to Farber, the college student is indispensable to psychological research. Rosenthal (1965) raised the issue of whether psychology can even claim to be studying the behavior of sophomores. ἀ at is, Rosenthal observed that the “volunteer” in most psychological experiments is unlikely to be representative of the general student population. Note that Rosenthal was not raising the issue of the generalizability of inferences from student samples; his concern was with the degree to which student volunteers were an unbiased sample from the student population. Oakes (1972) noted, however, that any research population is likely to be atypical on some dimensions. In the organizational literature, Gordon, Slade, and Schmitt (1986) published a highly influential analysis of 32 studies in which students and nonstudents participated in experiments investigating the same variables. ἀ e authors observed numerous instances in which at least one finding in a study differed by sample and concluded that findings from studies using students will not necessarily agree with findings from studies using “organizational” samples. In a critique of Gordon et al.’s methodology, Dobbins, Lane, and Steiner (1988) noted instances of study misclassification and pointed to confounds caused by differences in stimuli meaningfulness for the two groups. Dobbins et al. were essentially arguing that Gordon and his colleagues used a biased sample of studies investigating sample differences.
252
Scott Highhouse and Jennifer Z. Gillespie
Table 10.1 Bibliography of Articles Defending Convenience Samples Defenses of Convenience Samples
Sourcesa
Efficiency. Convenience samples are more cost-effective and more responsive than field samples.
Farber, 1952; Kardes, 1996
Homogeneity. Less noise or extraneous variation associated with homogeneous (convenience) samples.
Berkowitz & Donnerstein, 1982; Conrad, 1946; Greenberg, 1987; Lynch, 1982
Humanity. “Real-people” samples are a myth; people are people.
Campbell, 1986; Locke, 1986; StoneRomero, 2002
Dipboye & Flanagan, 1979; Oakes, Generalizability. Field samples are no more representative, and perhaps even 1972 less representative, of typical organizational members. Adequacy. ἀ eories generalize; samples don’t. Any sample encompassed by the theory is appropriate.
Calder et al., 1981; Chow, 1997; Dobbins et al., 1988; Farber, 1952, Greenberg, 1987; Highhouse, in press; Ilgen, 1986; Kardes, 1996; Mook, 1983.
aAt least elements of these defenses are contained in these sources. Some of the authors may not agree with the defense in its entirety.
It is really impossible to summarize the different arguments for why samples matter, because critics seem to assume that the criticism itself requires no explanation. In other words, many believe that it is inherently obvious that samples should look exactly like the populations to which a study’s findings are to generalize. In this argument, the burden of proof has been on the defense. Below is a brief summary of defenses of convenience or nonorganizational samples. Table 10.1 provides a bibliography of articles that make each defense. • Efficiency. Research conducted with convenience samples is much more efficient and cost-effective than research conducted with field samples. Farber (1952) noted that the use of college students is not a “lazy dodge, but rather an intelligent and efficient choice” (p. 102). Convenience samples provide a captive audience and a response rate that would be impossible with other samples. • Homogeneity. Homogeneous (convenience) samples have less noise or extraneous variation, which is an advantage for hypothesis test-
Do Samples Really Matter That Much?
253
ing. Berkowitz and Donnerstein (1982) noted that any homogeneous group, whether in the field or in the lab, avoids extraneous variance that can hinder the researcher’s ability to isolate effects. Moreover, comparing homogeneous groups for differences can isolate boundary conditions and actually enhance generalizability. • Humanity. ἀe perceived relevance of dispositional differences in samples is exaggerated. Campbell (1986) speculated, with tongue in cheek, “Perhaps college students really are people…why their disguise fools many observers into thinking otherwise is not clear” (p. 276). ἀe similarities between convenience and field samples are greater than the differences, and any differences are usually unrelated to the research questions. • Generalizability. Field samples are no more representative of organizational members than are convenience samples. Dipboye and Flanagan (1979) found that most organizational samples are from a narrowly defined group of male, professional or managerial employees in profit-driven organizations. Most field samples come from one organization with its own unique culture and customs, which may actually hinder generalizability. • Adequacy. As most applied research is aimed at identifying general principles that can be applied across organizational settings, we are interested in understanding and controlling causal mechanisms. If undergraduate students, military personnel, or secondary school teachers are among the people covered by the theory, then any one of these is an appropriate sample on which to test the theory.
The Research Base ἀ e samples-matter folks were reinvigorated by a recent metaanalysis of meta-analyses, concluding that effect sizes often differed “substantially” (and sometimes directionally) between student and nonstudent samples (Peterson, 2001). A closer inspection of the findings, however, suggests a more sanguine picture. First, Peterson reported a .75 correlation between effect sizes found for student samples and nonstudent samples. Despite this substantial correlation, Peterson noted “its magnitude suggests there is less than a one ἀ is bivariate correlation of .75 was calculated using average effect sizes computed for both student and nonstudent samples—within research study (see Peterson, 2001, Table 3).
254
Scott Highhouse and Jennifer Z. Gillespie
to-one correspondence between the respective effect sizes” (p. 455). Should we expect a one-to-one correspondence? An examination of Peterson’s report reveals that the student/nonstudent distinction is confounded by a lab/field distinction. Peterson acknowledged that the setting of the studies could not be used as a moderator because many meta-analyses did not provide this information. ἀ is presents serious problems in interpreting Peterson’s effect size data, because lab studies are almost always characterized by experimentation and field studies are almost always characterized by passive observation. Comparing effect sizes that result from experimental manipulation with effect sizes resulting from passive observation makes little sense when one considers that a hallmark of good experimentation is strong manipulation (Highhouse, in press). An effect that results from an appropriately strong manipulation conducted in a highly controlled setting should not be compared for equivalence with an effect that arises from passive observation in a highly noisy field setting. Peterson’s findings probably say more about the relation between lab and field than they do about students and nonstudents. One might conclude from the above discussion that because effect sizes that emerge from lab studies are highly dependent on how big the experimenter’s hammer is, these should not be accorded as much importance as effect sizes observed in the field. First, this is a criticism of research setting (lab vs. field), not of research sample. Second, effects observed in the field are almost always limited by measurement problems and lack of control over confounds. It really makes no sense to compare effect sizes across the two research settings. Indeed, the only sensible comparison is whether the phenomenon observed in the lab is also observed in the field. Peterson’s (2001) meta-analysis indicated that there was only one clear-cut case in which effect sizes differed in direction for students and nonstudents. Even in this one instance, Peterson (2001) acknowledged that the authors of the original meta-analysis cautioned against making much of the observed difference in directionality. Peterson’s data nonetheless showed that findings from student and nonstudent samples will agree nearly 80% of the time. Over 20 years ago, Locke (1986) gathered a group of scholars to dispense with the notion that lab studies with students lacked gen ἀ is estimate is based on a conversion of the .75 correlation to the Common-Language Effect Size statistic (see Dunlap, 1994).
Do Samples Really Matter That Much?
255
eralizability. Scholars from representative areas within the field of industrial and organizational psychology concluded, based on the existing research record, that there was no basis for dismissing laboratory research as lacking in generalizability. Campbell (1986) summarized the findings, noting that “the message is clear: the data do not support the belief that lab studies produce different results than field studies.” Subsequent meta-analyses have reaffirmed the conclusions reached by these researchers—that research setting and sample are not important moderators in organizational research (e.g., Eagly, Karau, & Makhijani, 1995; Kluger & DeNisi, 1996; Kubeck, Delp, Haslett, & McDaniel, 1996; Sagie, 1994). Anderson, Lindsay, and Bushman (1999) examined meta-analyses in all areas of psychology that coded for research setting (i.e., lab versus field), and found that effects for a variable in the lab correlated strongly with effects for the same variable in the field (r = .73). Why Do Samples Seem to Matter So Much? Arguing that generalizability is not contingent upon the nature of the research sample places one in the uncomfortable position of having to prove the null. It is conceivable that the evidence will one day show definitively that results from studies using convenience samples do not allow inferences to be drawn about behavior in organizations. Until that time, we must conclude that the evidence does not support such an assertion and that the reasons against the assertion are more plentiful and compelling than the reasons for it. So why, therefore, do people continue to believe that samples matter so much to generalizability? In the remaining portion of this chapter, we speculate about why people find so persuasive the notion that samples matter. People Confuse Random Sampling With Random Assignment Random sampling procedures are often seen as a solution to the generalizability problem. With regard to human subjects, this involves selecting people by chance from a clearly designated population. ἀ e chief reason for conducting random samples is to eliminate bias or ensure that some members of the population are not systematically overrepresented. ἀ is type of sampling results in a match on all attri-
256
Scott Highhouse and Jennifer Z. Gillespie
butes, such that the mean and variance of the sample will mirror the population (Shadish, Cook, & Campbell, 2002). With representative sampling, the population is divided into strata and a random sample is taken from each stratum. For example, television executives rely on the Nielson ratings to determine the number of unique viewers or households tuned to a television program in a particular time period during a week. ἀ e Nielson company (i.e., Nielson Media Ratings) uses a representative sampling method to generate a sample of households that will generalize to the U.S. population of television viewers. In response to advertiser concern that a specific segment of the target population (i.e., 18- to 24-year-old viewers) is not well represented by the Nielson ratings, the Nielson company tracks the television viewing habits of people who are members of a randomly selected “Nielson family” but are living away from home on a university campus. Note that the Nielson company uses its sample data to draw statistical conclusions about its target population (i.e., people who watch television). With random sampling, the company can tout the statistical generalizability of its sample data to the target population. For example, according to this week’s Nielson ratings, more people in America watch Wheel of Fortune than Jeopardy. True random sampling is rarely achieved in applied behavioral research (Shadish et al., 2002). Research participation is usually voluntary and, in the case of field research, usually restricted to one specific organization or industry. Researchers using students rarely randomly select names from the school directory but are content to study the students who are successfully recruited from introductory psychology or business courses. Even if they did randomly select from the directory, specifying the relevant population is hopelessly tricky. Should the sample have the same mean and variance (on all attributes) as the population of students at the same university? What about students at universities in other states? What about students who attend universities 2 years from now? ἀ e same problem occurs for people sampling workers in organizations. Certainly, they do not presume to be obtaining a truly representative sample of all workers. Despite this inability to randomly sample, useful generalizations are commonly made from unrepresentative samples (Fisher, 1955). Fisher’s emphasis on sample size and random assignment over random sampling from a defined population has been a source of controversy for some behavioral scientists (cf. Gigerenzer, 2006).
Do Samples Really Matter That Much?
257
People Focus on the Wrong Things Whereas demographers and television executives are interested primarily in statistical generalizability, behavioral science researchers are typically interested in theoretical generalizability. ἀ eoretical generalizability is about explaining behavior. ἀ us, as it relates to sample, theoretical generalizability is about a presumed causal relationship and the extent to which it can be expected to hold across populations (Shadish et al., 2002) and/or across people (Sackett & Larson, 1990). To illustrate the concept of theoretical generalizability, imagine a group of researchers with a theory about why some shows are more popular than others. For example, what is it about Wheel of Fortune that makes so many people want to watch it? One theory might be that people like to solve puzzles. Another might be that people enjoy seeing others compete for prizes. ἀ e researchers might test these theories by surveying a sample of television viewers using measures of attitudes toward puzzles and prizes. Another approach would be to randomly assign shows that differ in degree of emphasis on puzzles and prizes to a sample of television viewers. It is not necessary that these samples are representative of the population of all television viewers. It is only necessary that the sample does not systematically differ from the population in a way that would plausibly interact with the constructs of interest. As another example, Michael Birnbaum found that nearly 70% of the undergraduates that he studied consistently made decisions that violated a principle of coherence in decision making called stochastic dominance: If slot machine A always pays off at least as much as B, and sometimes more, one should always choose to play it (see Birnbaum & Martin, 2003). Birnbaum’s findings were not happenstance. He had in fact predicted this violation based on his theory of the context. ἀ is was a theory of how humans in general respond to contextual features of a gamble. Students, being human, were a useful sample for testing the theory. Economists, however, are an especially difficult group to persuade. College students who elect to take psychology courses, they say, are not a random sample of the population; they have graduated high school, they do not have
258
Scott Highhouse and Jennifer Z. Gillespie
college degrees, and they are often predominately female and homogeneous in age, race, religion, and social class. Birnbaum needed to show that his findings were not sample specific, so he decided to gather a group of participants who should be least likely to violate the principle of stochastic dominance. Birnbaum gathered a group of scholars who were highly trained in decision making and statistics and who were highly motivated to look rational. As you might have guessed by now, even the most highly educated scholars consistently violated the stochastic dominance principle, and the same basic psychological processes were implied by each stratum of the sample. ἀ e point of the above story is that behavioral science researchers, unlike researchers in some other fields, do not rely on true population sampling of either participants or situations as an argument for generalizability. As such, they must rely on appropriate substantive theory, rather than on statistical theory of random sampling. It is the underlying theoretical principles that give us the muscle to generalize across samples. An excellent example of how studies that lack surface similarity can have considerable applied relevance is the work of Daniel Kahneman and Amos Tversky on behavioral economics (Kahneman & Tversky, 1979; Tversky & Kahneman, 1974). ἀ e field of economics was widely considered a nonexperimental science, relying on observation of real-world economies rather than controlled laboratory experiments. Kahneman and Tversky, however, tested theoretical propositions about how people make uncertain financial decisions using a series of small, “artificial” experiments with students as research participants. ἀ e work had enough realworld relevance to garner Kahneman the 2002 Nobel Prize in Economics. Organizational researchers, like economists, are a difficult bunch to convince that samples do not matter—at least not as much as they think they do. ἀ e degree to which a sample matches the population of interest does not affect one’s ability to detect a relation between variables of theoretical significance, as long as that sample is unbiased on factors relevant to the research question (e.g., Calder, Phillips, & Tybout, 1981; Farber, 1952; Kruglanski, 1975). In fact, Farber (1952) went so far as to argue that convenience samples, like the fruit fly for the geneticist or the vacuum for the physicist, allow one to better iso Amos Tversky was deceased in 2002 and therefore not eligible to receive the Nobel Prize.
Do Samples Really Matter That Much?
259
late the effect of interest. Greenberg (1987) made a similar argument, suggesting that homogeneous samples avoid the problem of extraneous variance to the behavior under study (see also Berkowitz & Donnerstein, 1982; Lynch, 1982). People Rely on Superficial Similarities To judge the quality of a study on the basis of superficial characteristics, such as the apparent similarity between research participants and one’s prototype of an organizational member, is a classic example of overusing the representativeness heuristic (see Gilovich, Griffin, & Kahneman, 2002). Representativeness is a reflexive tendency to assess the similarity of things along superficial dimensions and to organize them on the basis of the implicit rule that “like goes with like” (Gilovich, 1991). According to Gilovich, We expect instances to look like the categories of which they are members; thus, we expect someone who is a librarian to resemble the prototypical librarian. We expect effects to look like their causes; thus, we are more likely to attribute a case of heartburn to spicy rather than bland food, and we are more inclined to see jagged handwriting as a sign of a tense rather than a relaxed personality. (p. 18)
Like any heuristic, representativeness operates at an automatic level of information processing. It involves judging things without thinking hard about them. Accordingly, there is a certain silliness to the argument that researchers need to use samples that generalize to “people” in organizations. Who are these people that they talk about? Are they talking about adults? Certainly not all workers are over 30. Are they talking about employees in general? Most students hold (or have held) jobs. Are they talking about full-time workers? ἀ ere is nothing to suggest that full-time work changes one’s psychological makeup. Are they talking about managers? Does the manager of the local grocery store count? How about the manager at the nearest Burger King? If we can generalize from the Burger King manager sample, why can’t we generalize from a student sample? Perhaps the issue is experience. ἀ ere are certainly some compelling arguments for when experience is relevant to some research questions (e.g., Lord & Hall, 2005). But this hand is often overplayed. Research on expertise in prediction shows that training helps, but
260
Scott Highhouse and Jennifer Z. Gillespie
additional experience does not, for predictions made by clinicians, social workers, judges, parole boards, auditors, surgeons, radiologists, and admission committees (Camerer & Johnson, 1991; Grove & Meehl, 1996). Sherden (1998) has argued persuasively that business forecasters are doing little more than making wild guesses, and Pulakos, Schmitt, Whitney, and Smith (1996) concluded that differences in interviewer validity were due entirely to sampling error. Nevertheless, using anyone other than experienced professionals to study business forecasting or employment interviewing would likely be considered ridiculous. We believe that the quality of organizational research is often judged automatically using heuristics like representativeness (see Kardes, 1996, for a similar argument about consumer research). For example, one may find less compelling a study on leader-member exchange if its sample is coaches and athletes on a sports team. Similarly, one would not expect a study on behavior in organizations to take place in a university laboratory. And, one would not expect the behavior of students to resemble that of people in organizations. ἀ e pertinent question, however, should be whether there are characteristics of the sample that interact with the constructs under investigation. As Campbell (1986) noted, the real issue is whether the specific work experiences of the research participants influence the phenomena being studied in a way that confounds the results of the study. Concluding Thoughts ἀ e question of whether and how much samples matter has been a source of controversy throughout the history of behavioral science research and has been especially contentious in applied areas such as consumer, political, and organizational psychology. One could argue that, with regard to organizational research, the samples-matter camp is winning the fight. Evidence for this is the fact that outlets that once published studies utilizing student samples, such as Academy of Management Journal, Journal of Management, and Journal of Organizational Behavior, rarely do anymore. Indeed, until recently, Journal of Occupational and Organizational Psychology directed researchers using student samples to send their manuscripts elsewhere.
Do Samples Really Matter That Much?
261
We believe that consumers, producers, and gatekeepers of organizational research need to consider the possible consequences of requiring researchers to utilize only “real people” in their studies. Experimenters are unlikely to move their experiments to the field, given the extraordinary constraints to doing field experiments. Moreover, organizational members are unlikely to have incentives to participate in laboratory research, given the dearth of federal funding for organizational research—along with corporations’ disinclination to invest in behavioral science research and development. Organizational research would eventually be characterized by an exclusive focus on passive observation, despite the fact that experiments are the only way to test many key propositions about behavior in organizations. When we read and evaluate research, we must abandon our simple heuristics (e.g., convenience samples are bad) and substitute them with the following questions: • Did the research question contain a specific and well-defined population of interest? Sometimes, researchers are interested in describing and predicting the behavior of a well-defined population. For example, a study might be focused on the reactions of obese job applicants to requests for photographs or on the reactions of nurses to changes in health policies. It may not make sense to have people imagine themselves in these roles. • Is there a characteristic of the convenience sample that may interact with the variables of interest in the study? Answering this one requires considerable thought. Do not just assume that participant experience is a prerequisite for understanding the relation between two or more variables. As just one example, we know a lot about the factors that affect the quality of high-stakes negotiations by studying inexperienced negotiators haggling over low-stakes outcomes. • Is participant motivation relevant to this study? Sometimes it is necessary for people in a sample to experience the role pressures or incentives inherent in a particular setting for the researcher to draw valid inferences from that sample’s behavior. If so, was the convenience sample motivated in other ways? For instance, an interviewee’s motivation for employment might be substituted with motivation to win a dollar prize for interviewing well. Alternatively, was the convenience sample motivated for other reasons? For instance, the interviewee might be motivated to appear competent before an audience of peers.
262
Scott Highhouse and Jennifer Z. Gillespie
• If the researcher is testing a theory, does the theory apply to this sample? In most instances, it is the theories that generalize, not the settings, the samples, or even the effects. For example, a theory about occupational satisfaction and commitment might apply to nurses, coaches, priests, or professional skateboarders. Any one of these samples is appropriate for testing the theory.
Careful thought is required when evaluating the degree to which a researcher’s inferences, based on the behavior of the people in her study, generalize to the behavior of others. As the first author has noted with regard to research setting, “When we get caught up in the distinctiveness of the setting, it implies that we are testing effects in settings, rather than testing theories that should apply to multiple (especially organizational) settings” (Highhouse, in press). A similar argument applies to samples. ἀ at is, when we get caught up in the superficial similarities between our research participants and “real” organizational members, we get distracted from what is really important. Author Note We are grateful to Maggie Brooks, Dalia Diab, and the editors for their contributions to this chapter. References Anderson, C. A., Lindsay, J. J., & Bushman, B. J. (1999). Research in the psychological laboratory: Truth or triviality? Current Directions in Psychological Science, 8, 3–9. Berkowitz, L., & Donnerstein, E. (1982). External validity is more than skin deep: Some answers to criticisms of laboratory experiments. American Psychologist, 35, 463–464. Birnbaum, M. H., & Martin, T. (2003). Generalization across people, procedures, and predictions: Violations of stochastic dominance and coalescing. In S. L. Schneider & J. Shanteau (Eds.), Emerging perspectives on judgment and decision research. Cambridge, England: Cambridge University Press. Calder, B. J., Phillips, L. W., & Tybout, A. M. (1981). Designing research for application. Journal of Consumer Research, 8, 197–207.
Do Samples Really Matter That Much?
263
Camerer, C. F., & Johnson, E. J. (1991). ἀ e process-performance paradox in expert judgment: How can experts know so much and predict so badly? In K. A. Ericsson & J. Smith (Eds.), Towards a general theory of expertise: Prospects and limits (pp. 195–217). New York: Cambridge Press. Campbell, J. P. (1986). Labs, fields, and straw issues. In E. A. Locke (Ed.), Generalizing from laboratory to field settings (pp. 269–279). Lexington, MA: Heath. Conrad, H. S. (1946). Some principles of attitude-measurement: A reply to “Opinion attitude methodology.” Psychological Bulletin, 63, 570–589. Dipboye, R. L., & Flanagan, M. F. (1979). Research settings in industrial and organizational psychology: Are findings in the field more generalizable than in the laboratory? American Psychologist, 34, 141–150. Dobbins, G. H., Lane, I. M., & Steiner, D. D. (1988). A note on the role of laboratory methodologies in applied behavioural research: Don’t throw out the baby with the bath water. Journal of Organizational Behavior, 9, 281–286. Dunlap, W. P. (1994). Generalizing the common language effect size indicator to bivariate normal correlations. Psychological Bulletin, 116, 509–511. Eagly, A. H., Karau, S. J., & Makhijani, M. G. (1995). Gender and the effectiveness of leaders: A meta-analysis. Psychological Bulletin, 117, 125–145. Farber, M. L. (1952). ἀe college student as laboratory animal. American Psychologist, 7, 102. Fisher, R. A. (1955). Statistical methods and scientific induction. Journal of the Royal Statistical Society, Series B, 17, 69–78. Gigerenzer, G. (2006). What’s in a sample? A manual for building cognitive theories. In K. Fiedler & P. Juslin (Eds.), Information sampling and adaptive cognition (pp. 239–260). New York: Cambridge University Press. Gilovich, T. (1991). How we know what isn’t so: The fallibility of human reason in everyday life. New York: ἀe F ree Press. Gilovich, T., Griffin, D., & Kahneman, D. (2002). Heuristics and biases: The psychology of intuitive judgment. Cambridge, England: Cambridge University Press. Gilovich, T., Vallone, R., & Tversky, A. (1985). ἀe hot hand in basketball: On the misperception of random sequences. Cognitive Psychology, 17, 295–314. Gordon, M. E., Slade, L. A., & Schmitt, N. (1986). ἀe “science of the sophomore” revisited: From conjecture to empiricism. Academy of Management Review, 11, 191–207. Greenberg, J. (1987). ἀe college sophomore as guinea pig: Setting the record straight. Academy of Management Review, 12, 157–159.
264
Scott Highhouse and Jennifer Z. Gillespie
Grove, W. M., & Meehl, P. E. (1996). Comparative efficiency of informal (subjective, impressionistic) and formal (mechanical, algorithmic) prediction procedures: ἀe clinical-statistical controversy. Psychology, Public Policy, and Law, 2, 293–323. Highhouse, S. (in press). Designing experiments that generalize. Organizational Research Methods. Ilgen, D. R. (1986). Laboratory research: A question of when, not if. In E. A. Locke (Ed.), Generalizing from laboratory to field settings (pp. 257– 267). Lexington, MA: Heath. Judge, T. J., Boudreau, J. W., & Bretz, R. D. (1994). Job and life attitudes of male executives. Journal of Applied Psychology, 79, 762–782. Kahneman, D., & Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, XVLII, 263–291. Kardes, F. R. (1996). In defense of experimental consumer psychology. Journal of Consumer Psychology, 5, 279–296. Kirkpatrick, L. A., & Epstein, S. (1992). Cognitive-experiential self-theory and subjective probability: Further evidence for two conceptual systems. Journal of Personality and Social Psychology, 63, 534–544. Kluger, A. N., & DeNisi, A. (1996). ἀe effects of feedback interventions on performance: Historical review, meta-analysis and a preliminary feedback intervention theory. Psychological Bulletin, 119, 254–284. Kruglanski, A. W. (1975). ἀe two meanings of external invalidity. Human Relations, 66, 373–382. Kubeck, J. E., Delp, N. D., Haslett, T. K., & McDaniel, M. A. (1996). Does job-related training performance decline with age? Psychology and Aging, 11, 92–107. Locke, E. A. (1986). Generalizing from laboratory to field: Ecological validity or abstraction of essential elements? In E. A. Locke (Ed.), Generalizing from laboratory to field settings (pp. 257–267). Lexington, MA: Heath. Lord, R. G., & Hall, R. J. (2005). Identity, deep structure and the development of leadership skill. The Leadership Quarterly, 16, 591–615. Lynch, J. G. (1982). On the external validity of experiments in consumer research. Journal of Consumer Research, 10, 109–111. McNemar, Q. (1946). Opinion-attitude methodology. Psychological Bulletin, 43, 289–374. Mook, D. (1983). In defense of external invalidity. American Psychologist, 38, 379–387. Oakes, W. (1972). External validity and the use of real people as subjects. American Psychologist, 27, 959–962. Peterson, R. A. (2001). On the use of college students in social science research: Insights from a second-order meta-analysis. Journal of Consumer Research, 28, 450–461.
Do Samples Really Matter That Much?
265
Pulakos, E. D., Schmitt, N., Whitney, D., & Smith, M. (1996). Individual differences in interviewer ratings: ἀe impact of standardization, consensus discussion, and sampling error on the validity of a structured interview. Personnel Psychology, 49, 85–102. Rosenthal, R. (1965). ἀe volunteer subject. Human Relations, 18, 389–406. Runkel, P. J., & McGrath, J. E. (1972). Research on human behavior: A systematic guide to method. New York: Holt, Rinehart, & Winston. Sackett, P. R., & Larson, J. R., Jr. (1990). Research strategies and tactics in industrial and organizational psychology. In M. D. Dunnette & L. M. Hough (Eds.), Handbook of industrial and organizational psychology (2nd ed., Vol. 1, pp. 419–489). Palo Alto, CA: Consulting Psychologists Press. Sagie, A. (1994). Participative decision making and performance: A moderator analysis. Journal of Applied Behavioral Science, 30, 227–246. Shadish, W., Cook, T. D., & Campbell, D. (2002). Experimental and quasiexperimental design. Boston: Houghton Mifflin Company. Sherden, W. A. (1998). The fortune sellers: The big business of buying and selling predictions. New York: John Wiley. Stone-Romero, E. F. (2002). ἀe relative validity and usefulness of various empirical research designs. In S. G. Rogelberg (Ed.), Handbook of research methods in industrial and organizational psychology (pp. 77–98). Malden, MA: Blackwell. Tversky, A., & Kahneman, D. (1974). Judgment under uncertainty: Heuristics & biases. Science, 185, 1124–1131.
11 Sample Size Rules of Thumb Evaluating Three Common Practices Herman Aguinis and Erika E. Harden
ἀ is chapter provides a description and critical analysis of three rules of thumb related to sample size that are commonly used by researchers in the organizational and social sciences. ἀ us, similar to the chapter by Vandenberg and Grelle (2008), our chapter does not address faulty assumptions or improper citations that can be traced back to an original source and have risen to the category of “statistical and methodological myths and urban legends.” Instead, we provide a critical analysis of these rules of thumb that we hope will provide information that will be useful to researchers in their own work as well as journal reviewers who evaluate the work of others. We also hope that by discussing these rule of thumbs critically we will prevent them from possibly becoming statistical and methodological myths and urban legends in the future. Our chapter is about inferences regarding estimated relationships between variables and latent constructs or between observed indicators and latent constructs. ἀ us, our chapter addresses rules of thumb about sample size related to internal, construct, and statistical conclusion validity but does not address issues of external validity (i.e., what sample size is needed to be able to generalize results across populations). We wanted to minimize the impact of our subjective opinion on the process of identifying any existing rules of thumb. So, rather than discussing what we think are some of the existing rules of thumb that researchers use, we adopted an inductive approach for identifying any existing rules. Specifically, we conducted an in-depth review of 267
268
Herman Aguinis and Erika E. Harden
the Method, Results, and Discussion sections for each of approximately 1,260 articles published between 2000 and 2006 in the following journals: • • • • •
Academy of Management Journal Administrative Science Quarterly Journal of Applied Psychology Personnel Psychology Strategic Management Journal
We selected the above journals because they arguably publish some of the most methodologically sophisticated and rigorous empirical research in the field of management. If rules of thumb that may not be appropriate, or are used inappropriately, are invoked frequently by researchers publishing in these journals, it is likely that these rules are used by researchers publishing in many other journals as well. Our inductive study consisted of searching for statements and justifications that authors used that involved sample size. We found 102 articles (i.e., about 8.2% of all articles included in our literature review) that included a statement in which authors explained how they chose the sample size they had, described consequences of their particular sample size, or explained or justified a result in relationship to their sample size. We identified the following commonly invoked rules of thumb related to sample size:
1. Determine whether sample size is appropriate by conducting a power analysis using Cohen’s definitions of small, medium, and large effect size. 2. Increase the a priori Type I error rate to .10 because of a small sample size. 3. Sample size should include at least 5 observations per estimated parameter in covariance structure analyses.
Next, we critically analyzed each of these three rules of thumb by answering the following questions: Where did these rules come from? What did the attributed sources really say about them? How much merit do these rules really have? Should we continue using these rules of thumb or should we abandon them altogether?
Sample Size Rules of Thumb
269
Determine Whether Sample Size Is Appropriate by Conducting a Power Analysis Using Cohen’s Definitions of Small, Medium, and Large Effect Size A crucial step in designing a study is determining sample size because N is one of the key determinants of statistical power. Statistical power is the probability of detecting an effect that exists in the population. ἀ e greater the sample size, the greater the statistical power. Statistical power is 1 – β, where β is the Type II error rate (i.e., the probability of not detecting an existing effect). In addition to sample size, power is affected by the size of the effect in the population (i.e., the greater the effect, the greater the power), and by the Type I error rate (i.e., α), which is the probability of falsely concluding that an effect exists. Note that Type I and Type II error have an inverse relationship. In order to conduct a power analysis to determine what sample size is sufficient to detect an effect, or whether the sample size in hand is sufficient to detect an effect, there is a need to choose a targeted effect size (Aguinis, Boik, & Pierce, 2001). Our review uncovered that a common rule of thumb in conducting a power analysis is to use Cohen’s (1988) definitions of small, medium, and large effect size. For instance, Raver and Gelfand (2005) conducted a power analysis using Cohen’s values and concluded that “[a] power analysis indicated that the power to detect a medium effect with an alpha level of .05 was 46 percent, and the power to detect a large effect was 86 percent (Cohen, 1988)” (p. 394). Similarly, Morgeson and Campion (2002) also used Cohen’s definitions and noted that “[s]tatistical power to detect a significant R2 in the regression analysis was 35% for a small effect (R2 = .0196, p < .05) and 99% for a medium effect (R2 = .13, p < .05; Cohen, 1988)” (p. 598). A perusal of articles published recently in some of the major journals in the organizational and social sciences reveals many additional examples. Consider Kim, Hoskisson, and Wan (2004), who noted that “the precise estimates of effect sizes are generally difficult to obtain, which is a major obstacle to implementing power analysis. Following Lane and colleagues (1998), we rely on general approximations of small, medium, and large effect size as suggested by Cohen (1992)” (p. 625). Likewise, Brews and Tucci (2004) argued that “[o]ur large sample size alleviates concerns about statistical power (Schwenk & Dalton, 1991; Ferguson & Ketchen, 1999). We have adequate power to detect small, medium, and large effects” (p. 437). Finally, Brown (2001) also used
270
Herman Aguinis and Erika E. Harden
Cohen’s definitions in his power analysis and stated that “this study is limited by a relatively small sample size and modest reliabilities of some measures. Although the power to detect moderate effects (r = .30) at the .05 alpha level with this sample is .78, the power to detect small effects (r = .10) is only .14 (Cohen & Cohen, 1983)” (p. 292). What did Cohen really recommend about the procedures to select a targeted effect size in conducting a power analysis to assess whether one’s sample size is sufficiently large? Did he recommend that researchers use specific values for small, medium, and large effects? Did these values remain consistent over time? How did he come up with these values? Let’s consider the cited sources. Cohen (1992) noted that “researchers find specifying ES [effect size] the most difficult part of power analysis” (p. 156). To address this issue, Cohen, Cohen, West, and Aiken (2003; based largely on Cohen & Cohen, 1983, pp. 59–60) outlined the following three strategies for identifying an appropriate effect size in power analysis:
1. To the extent that studies that have been carried out by the current investigator or others are closely similar to the present investigation, the ESs found in these studies reflect the magnitude that can be expected. 2. In some research areas an investigator may posit some minimum population effect that would have either practical or theoretical significance. 3. A third strategy is deciding what ES values to use in determining the power of a study is to use certain suggested conventional definitions of small, medium, and large effects…ἀ is option should be looked upon as the default option only if the earlier noted strategies are not feasible. (p. 52).
Consider the history behind the conventional definitions of small, medium, and large effect, which should be used only if the other strategies are not feasible. As described by Aguinis, Beaty, Boik, and Pierce (2005), Cohen’s first published description of specific magnitudes for effects appeared in his 1962 Journal of Abnormal and Social Psychology article. In this article, Cohen reported results of a review and content analysis of articles published in the 1960 volume of this same journal. In the Method section of his article, when describing the effect sizes he used for his power analysis, Cohen (1962, p. 147) noted that “the level of average population proportion at which the power of the test was computed was the average of the sample proportions
Sample Size Rules of Thumb
271
found” and “the sample values were used to approximate the level of population correlation of the test.” For the correlation coefficient, Cohen defined .40 as medium because this seems to have been close to the average observed value he found in his review. ἀ en, he chose the value of .20 as small and .60 as large. In other words, Cohen’s definitions of small, medium, and large effect sizes are based in part on observed values as reported in the articles published in the 1960 volume of Journal of Abnormal and Social Psychology, and in part on his own subjective opinion. A few years later, Cohen (1988) decided to lower these values to .10 (small), .30 (medium), and .50 (large) because the originally defined values seemed a bit too high. Given the history behind the conventional values for small, medium, and large effects, it is not surprising that Cohen (1992) himself acknowledged that these definitions “were made subjectively” (p. 156). In sum, numerous researchers conduct a power analysis to determine whether a study’s sample size is sufficiently large to detect an effect using Cohen’s conventional definitions of effect sizes. A critical analysis of this practice in light of the sources invoked to support its use leads to the following conclusions. First, Cohen mentioned that using his admittedly conventional values is only one of three procedures for identifying a targeted effect size to be used in a power analysis to assess whether a study’s sample size is sufficiently large (Cohen & Cohen, 1983, pp. 59–60). In fact, this strategy should be used only as a last resort and only if the other two preferred strategies are not feasible. However, many researchers seem to focus on this procedure to the exclusion of the other two. Second, Cohen himself noted that his values for small, medium, and large effects are subjective. In fact, he changed the values for small, medium, and large effects over time with no apparent reason but his subjective opinion that these values should be modified downward. Discussion Statistical power is the probability of detecting an effect that indeed exists in the population. Sample size is one of the key factors that affect statistical power. If statistical power is not sufficient, one risks the possibility of erroneously concluding that there is no effect in the population. ἀ us, when an effect is not found, journal reviewers usually request that a power analysis be conducted to assess whether
272
Herman Aguinis and Erika E. Harden
a study’s sample size was sufficiently large. At that point, a researcher must make a decision about what targeted effect size to use because choosing a large effect may lead to the conclusion that a particular N was sufficiently large, but this same N may not be sufficiently large to detect a smaller effect. In short, a particular sample size may be seen as adequate or not depending on the targeted effect size used in the power analysis. Although Cohen suggested three strategies for identifying the effect size to be used in a power analysis, most researchers use the effects that Cohen labeled small, medium, and large. Per Cohen’s own admission, these values are largely subjective. As our review indicates, they were initially derived from a very narrow literature review of articles published in the 1960 volume of the Journal of Abnormal and Social Psychology. However, using these values is a pervasive practice, perhaps because it is more convenient to do so as compared to using the other two preferred strategies for identifying targeted effect sizes (i.e., an effect size derived from previous literature or an effect size that is scientifically or practically significant). ἀ e two preferred strategies for identifying a targeted effect size used in a power analysis point to the need to take into account the specific research context and domain in question and to not rely on broad-based conventions. For example, Cohen (1988) wrote that, for the f2 effect size, .02 is a “small effect.” However, Aguinis et al. (2005) conducted a 30-year review of all articles in Academy of Management Journal, Journal of Applied Psychology, and Personnel Psychology that used moderated regression to test hypotheses about categorical moderator variables and found that the median effect size is f2 = .002 (i.e., 10 times smaller than what Cohen labeled as a small effect). Cohen (1988) himself recommended that context be taken into account in choosing a targeted effect size in a power analysis when he wrote that effect sizes are relative not only to each other but also “to the area of behavioral science or even more particularly to the specific content and research methods being employed in any given investigation” (p. 25). Finally, also related to the importance of placing a particular effect within its context, it is generally not appropriate to equate Cohen’s “small” (which requires a large N to be detected) effect with “unimportant effect” and Cohen’s “large” effect (which requires a smaller N to be detected) with “important effect.” In some contexts, what seems to be a small effect can actually have important
Sample Size Rules of Thumb
273
consequences. For example, Martell, Lane, and Emrich (1996) found that an effect size of 1% regarding male-female differences in performance appraisal scores led to only 35% of the highest-level positions being filled by women. Accordingly, Martell et al. (1996) concluded that “relatively small sex bias effects in performance ratings led to substantially lower promotion rates for women, resulting in proportionately fewer women than men at the top levels of the organization” (p. 158). Aguinis (2004) and Aguinis et al. (2005) described several additional illustrations of how, in some contexts, effects that are labeled as “small” based on Cohen’s definitions actually have very significant consequences for both theory and practice. Summary: ἀ e rule of thumb: Researchers determine the appropriateness of a particular sample size by conducting a power analysis using Cohen’s definitions of small, medium, and large effect size. ἀ e kernel of truth: ἀ e use of Cohen’s small, medium, and large effect size is only one of three methods that he recommended, and the least preferred of the three, to determine sample size via a power analysis. ἀ e inappropriate application of the rule of thumb: ἀ e definitions of small, medium, and large effect size are believed to have been determined objectively and can be used regardless of research context and domain. ἀ e follow-up: Future research is needed to understand the size of minimally meaningful targeted effect sizes in various research contexts and research domains. Increase the A Priori Type I Error Rate to .10 Because of Your Small Sample Size Recall that statistical power is 1 – β, β is the Type II error rate, and β is inversely related to α (i.e., Type I error rate). In the presence of what is seen as a small N, many authors decide to increase the a priori α from the usual .01 and .05 values to .10 or even .20 to decrease β and increase statistical power. Our review revealed that this is a fairly common rule of thumb. For example, Brown (2003) noted that “[g]iven that the sample was now relatively small (i.e., 41 teams), an α level of .10 was used for all hypothesis testing following the recommendations of Kervin (1992)” (p. 951). Likewise, Garg, Walters, and Priem (2003) argued that “our sample size is not overly large; it is appropriate to use a less conservative criterion for statistical significance (Sauley & Bedeian, 1989; Skipper, Guenther, & Nass, 1967). We
274
Herman Aguinis and Erika E. Harden
therefore selected .1, a priori, as the appropriate level of significance for testing our hypotheses” (p. 734). As another illustration, Boland, Singh, Salipante, Aram, Fay, and Kanawattanachai (2001) increased their a priori α to .20 using the justification that their sample was small. Specifically, they stated that “[t]he small sample required that we balance Type I and Type II error rates in statistical testing. At a traditional 95 percent confidence level, the power is only .20 (Cohen, 1977), given an average cell size of 12. Stevens (1996: 172) recommended a more ‘lenient’ alpha level as a way to improve power. We chose an 80 percent confidence level to ensure at least a power of 0.50. ἀ us, we set the Type I error rate at 20 percent” (p. 399). As shown by the above illustrations, the practice of relaxing the a priori α level to .10 or even .20 is a methodological practice often implemented when a study includes a small sample. Increasing the α level increases statistical power and the chances of detecting an existing effect. However, is this practice really justified by the cited sources? In other words, do the cited sources actually suggest increasing alpha to the specific value of .10 or even .20? Why not .15? Or .40, for that matter? Let’s consider the evidence. Sauley and Bedeian (1989) is often invoked as a source in support for the increase of α to .10. In discussing research studies with small samples, Sauley and Bedeian noted that when either sample size or anticipated effect size are small, a researcher should typically select a less conservative level of significance (e.g., .10 vs. .05). (p. 340)
However, these authors also noted that there is no right or wrong level of significance. Blind adherence to the .05 level of significance as the crucial value for differentiating publishable from unpublishable research cannot be justified. As Skipper et al. (1967) suggest, the selection of a significance level by a researcher should be treated as one more research parameter. Rather than being set at a priori levels of .05, .01, or whatever, the appropriateness of specific level of significance should be based upon considerations such as… sample size, effect size, measurement error, practical consequences of rejecting the null hypothesis, coherence of the underlying theory, degree of experimental control, and robustness. (p. 339)
Kervin (1992), which is another source used in support of the use of an increase α to .10, noted that
Sample Size Rules of Thumb
275
[s]ince sampling error is larger with smaller samples, you may want to be more lenient (larger alpha) with smaller samples, other matters being equal, in order to avoid low research power. (p. 557)
Finally, in another one of the sources cited in support of an increase in the a priori α to .10, Stevens (1996) argued that when one has a small sample, it might be prudent to abandon the traditional α levels of .01 or .05 to a more liberal α level to improve power sharply. Of course, one does not get something for nothing. We are taking a greater risk of rejecting falsely, but that increased risk is more than balanced by the increase in power. (p. 137)
In sum, the recommendation that we increase our a priori α level to .10 is fairly common in the literature as a means to increase statistical power in the presence of a small sample size. However, a careful examination of this recommendation in light of the sources used to support this practice leads to the following conclusions. First, the practice of increasing the a priori α is reasonable and leads to increased statistical power. Second, however, the practice to increase α to the specific value of .10 or even .20 is subject to the criticism that these values are arbitrary, much like the values of .05 and .01 are also arbitrary. Moreover, without taking into account the research context (e.g., negative consequences of incorrectly concluding there is an effect as a consequence of a Type I error), the practice of increasing the α level to an arbitrarily selected greater value may be equally as, or even more, detrimental to theory development and practice than having a small sample size, insufficient statistical power, and erroneously concluding that there is no effect. Discussion In the organizational and social sciences, researchers usually adopt the conventional .05 and .01 values for the a priori α (i.e., probability of erroneously concluding that there is an effect). As noted above, many authors choose to increase α to .10 or .20. However, this choice is seldom justified and no discussion is usually provided regarding the trade-offs involved (i.e., increase in the probability of committing a Type I error). If one wishes to increase power by increasing α, one should make an informed decision about the specific
276
Herman Aguinis and Erika E. Harden
trade-off between Type I and Type II errors rather than choosing an arbitrarily larger value for α. Murphy and Myors (1998) suggested a useful way to weigh the pros and cons of increasing the Type I error rate for a specific research situation. ἀ e appropriate balance between Type I and Type II error rates can be achieved by using a preset Type I error rate that takes into account the Desired Relative Seriousness (DRS) of making a Type I versus a Type II error. Because Type II error = 1 – power, this strategy is also useful for choosing an appropriate Type I error in relation to statistical power. Instead of increasing α to an arbitrary value, researchers can make a more informed decision regarding the specific value to give to α. Consider the following situation described by Aguinis (2004, pp. 86–87). A researcher is interested in testing the hypothesis that the effectiveness of a training program for unemployed individuals varies by region such that the training program is more effective in regions where the unemployment rate is higher than 6%. Assume this researcher decides that the probability of making a Type II error (i.e., β, incorrectly concluding that unemployment rate in a region is not a moderator) should not be greater than .15. ἀ e researcher also decides that the seriousness of making a Type I error (i.e., incorrectly concluding that percentage of unemployment in a region is a moderator) is twice as serious as making a Type II error (i.e., DRS = 2). Assume the researcher makes the decision that DRS = 2 because a Type I error means that different versions of the training program would be needlessly developed for various regions and this would represent a waste of the limited resources available. ἀ e desired preset Type I error can be computed as follows (Murphy & Myors, 1998):
p(H1 )β 1 αdesired = 1 − p(H1 ) DRS
(11.1)
where p(H1) is the estimated probability that the alternative hypothesis is true (i.e., there is a moderating effect), β is the Type II error rate, and DRS is a judgment of the seriousness of a Type I error visà-vis the seriousness of a Type II error. For this example, assume that based on a strong theory-based rationale and previous experience with similar training programs,
Sample Size Rules of Thumb
277
the researcher estimates that the probability that the moderator hypothesis is correct is p(H1) = .6. Solving Equation 11.1 yields (.6)(.15) 1 αdesired = = .11. 1 − .6 2
ἀ us, in this particular example, using a nominal Type I error rate of .11 would yield the desired level of balance between Type I and Type II statistical errors. Implementing this procedure for choosing the specific a priori Type I error rate provides a more informed and better justification than using any arbitrary value such as .10 or .20 without carefully considering the trade-offs and consequences of this choice. Also, implementing this more informed strategy for selecting an a priori α is less likely to raise concerns among journal editors and reviewers as compared to selecting any arbitrary value. Summary: ἀ e rule of thumb: When faced with a small sample, researchers increase the a priori Type I error rate to .10 or even .20 as a means to increase statistical power. ἀ e kernel of truth: Increasing Type I error will increase statistical power (i.e., probability of detecting existing effects). ἀ e inappropriate application of the rule of thumb: Increasing Type I error rate to .10, .20, or any other arbitrarily selected value is assumed to be beneficial regardless of research context and research domain. ἀ e follow-up: Future research is needed to understand the trade-offs involved in making Type I in relation to Type II errors in various research contexts and research domains. Sample Size Should Include at Least 5 Observations per Estimated Parameter in Covariance Structure Analyses It seems to be common knowledge that a factor analysis should include 5 observations per estimated parameter. ἀ is 5:1 ratio seems to be a common recommendations and is followed not only in the context of factor analysis but also in testing the fit of a measurement model before testing a substantive structural model in structural equation modeling, path analysis, and other types of analyses based on covariance structures (e.g., Pierce, Aguinis, & Adams, 2000; Pierce, Broberg, McClure, & Aguinis, 2004). ἀ e 5:1 ratio rule is also
278
Herman Aguinis and Erika E. Harden
used by authors in referring to structural models, not just measurement models. Bentler’s work is a source often cited in support of the 5:1 ratio rule of thumb. For example, Kinicki, Prussia, Wu, and McKee-Ryan (2004) stated that “Bentler (1990) recommends a minimum of five cases for each estimated parameter in structural models” (p. 1061). Likewise, Epitropaki and Martin (2004) cautioned that their results should be interpreted with caution because “the minimum 5:1 cases per parameter (Bentler, 1995) is still not met in those six groups” (p. 304). Additionally, Takeuchi, Yun, and Tesluk (2002) cited Bentler and Chou (1987) when stating that “[i]t is recommended that in SEM, the ratio of respondents to parameters estimated should be at least 5:1” (p. 660). Finally, as an additional illustration, Sturman and Short (2000) noted that “although strict guidelines for minimum sample sizes do not exist (Anderson & Gerbing, 1988), our sample of 416 exceeds the minimum of 200 recommended by Boomsma (1982), and our sample size to parameter ratios of at least 8:1 exceed the suggested minimum of 5:1 for reliable maximum likelihood estimation (Bentler, 1985)” (p. 685). What is the origin of the 5:1 ratio rule? Did Bentler (1985) really say that we need 5 observations per parameter estimated in a covariance structure analysis to obtain trustworthy estimates? Let’s consider the evidence. In a frequently cited source used to invoke this rule of thumb, Bentler (1985) noted the following: An over-simplified guideline regarding the trustworthiness of solutions and parameter estimates might be the following. ἀ e ratio of sample size to number of free parameters to be estimated may be able to go as low as 5:1 under normal elliptical theory. Although there is little experience on which to base a recommendation, a ratio of at least 10:1 may be more appropriate for arbitrary distributions. (p. 3) ἀ ese ratios need to be larger to obtain trustworthy z-tests on the significance of parameters, and still larger to yield correct model evaluation chi-square probabilities. (p. 3)
Two years later, Bentler and Chou (1987, p. 90) identified “large” sample size as one of the statistical requirements of structural equation modeling because “the statistical theory is based on ‘asymptotic’ theory, that is, the theory that describes the behavior of statistics as the sample size becomes arbitrarily large (goes to infinity). In practice, samples
Sample Size Rules of Thumb
279
can be small to moderate in size, and the question arises whether large sample statistical theory is appropriate in such situations.” Bentler and Chou provided a virtually verbatim “oversimplified guideline” from Bentler (1985) to serve as a rule of thumb regarding the ratio of number of observations per parameters estimated in a model: ἀ e ratio of sample size to number of free parameters may be able to go as low as 5:1 under normal and elliptical theory, especially when there are many indicators of latent variables and the associated factor loadings are large. Although there is even less experience on which to base a recommendation, a ratio of at least 10:1 may be more appropriate for arbitrary distributions. ἀ ese ratios need to be larger to obtain trustworthy z-tests on the significance of parameters, and still larger to yield correct model evaluation chi-square probabilities. (p. 91)
In sum, having an appropriate number of observations per estimated parameter in a factor analysis, as in any covariance structure analyses, is obviously an important issue. Not having a sufficient number of observations will lead to unstable and untrustworthy parameter estimates. However, a closer examination of the 5:1 ratio as described in the cited sources leads to the following conclusions. First, this is a lower-bound value and an oversimplified rule of thumb and not necessarily a desirable value. Invoking the 5:1 rule of thumb in support of the conclusion that a particular sample size is ideal is misleading. Second, this ratio applies to situations in which multivariate normality has been observed, which is an unusual situation in the organizational and social sciences. In fact, when multivariate normality is not present, a ratio of at least 10 observations per estimated parameter is recommended for obtaining trustworthy estimates of parameters. Moreover, an even larger number of observations is required to obtain trustworthy estimates of the statistical significance of parameters. Discussion Researchers seem to focus on what is an “oversimplified” guideline of 5 observations per parameter. Moreover, this guideline applies to situations in which the data follow a multivariate normal distribution only, which is not typical in the organizational and social sciences. ἀ is oversimplified guideline of 5 observations per estimated parameter should be seen as a lower-bound value and not
280
Herman Aguinis and Erika E. Harden
necessarily a desirable value, particularly when the multivariate normality assumption is violated. Invoking the 5:1 rule of thumb to claim that a particular study has the ideal sample size and follows best practices is misleading. Summary: ἀ e rule of thumb: Sample size should be such that there are at least 5 observations per estimated parameter in a factor analysis and other covariance structure analyses. ἀ e kernel of truth: ἀ is oversimplified guideline seems appropriate in the presence of multivariate normality. ἀ e inappropriate application of the rule of thumb: ἀ e 5:1 ratio is believed to be an ideal and best-practice research scenario. ἀ e follow-up: Future research is needed to understand the appropriateness of the 5:1 ratio in the presence of multivariate normality and for various degrees of model complexity. Discussion In this chapter, we have discussed three rules of thumb related to sample size that, based on a review of articles published from 2000 to 2006 in some of the most prestigious journals in management, are invoked quite commonly. Table 11.1 summarizes each rule of thumb, the kernel of truth, the inappropriate application of each rule of thumb, and the research needed regarding each of these rules of thumb. Why are these rules of thumb used? We can only speculate on the reasons, but we suspect that some authors may invoke these rules of thumb as a preemptive strike to counter a potential criticism from a reviewer when results do not turn out as predicted (e.g., there is lack of support for a hypothesized effect). Others may invoke these rules as a response to a criticism from a reviewer (i.e., “your sample size is not sufficient for a covariance structure analysis,” “your small sample size led to insufficient statistical power to detect population effects”) or even at the direction of a reviewer or a journal editor (i.e., “given your small sample size, you must conduct a power analysis using Cohen’s definitions of effect size”). Regardless of the reason for invoking these rules, we emphasize that our focus is on a critical analysis of these rules and not on specific authors who have used them. It is not our intention to point fingers and blame specific authors. In fact, we are ourselves guilty of using some of the rules of thumb we critically analyzed in this chapter (e.g., Aguinis & StoneRomero, 1997, used Cohen’s definitions of small, medium, and large effect sizes).
Sample Size Rules of Thumb
281
Table 11.1 Critical Analysis Summary for the Three Rules of Thumb Related to Sample Size ἀ e rule of thumb
We should determine the appropriateness of N by conducting a power analysis using Cohen’s definitions of small, medium, and large effect size.
When faced with a small N, we should increase the a priori Type I error rate to .10 or even .20 as a means to increase statistical power.
N should include at least 5 observations per estimated parameter in a factor analysis and other covariance structure analyses.
ἀ e kernel of truth
ἀ e increase of Type I ἀ is oversimplified ἀ e use of Cohen’s guideline seems small, medium, and error will increase appropriate in the large effect sizes is statistical power (i.e., probability of presence of only one of three multivariate detecting existing methods (but the normality. least preferred) that effects). can be used to determine the appropriateness of N via a power analysis.
ἀ e definitions of ἀ e ἀ e increase of Type inappropriate small, medium, I error rate to .10, application of and large effect size .20, or any other are believed to the rule of arbitrarily selected thumb have been value is assumed to determined be beneficial objectively and can regardless of be used regardless research context of research context and research and domain. domain.
ἀ e 5:1 ratio is assumed to be an ideal and bestpractice research scenario.
Research needed
What is the appropriateness of the 5:1 ratio in the presence of multivariate normality and for various degrees of model complexity?
What is the size of What are the trademinimally offs involved in meaningful making Type I in targeted effect sizes relation to Type II in various research errors in various contexts and research contexts research domains? and research domains?
282
Herman Aguinis and Erika E. Harden
ἀ e first question we discussed is, Should we determine sample size by conducting a power analysis using Cohen’s conventional definitions of small, medium, and large effect sizes? ἀ e answer to this question is no. First, Cohen’s values are, by his own admission, largely subjective and may not be relevant in many research domains in the organizational and social sciences. Second, one should take context into account in choosing a targeted effect size for a power analysis. In many situations, what is commonly labeled as a small effect can have great significance for science and practice. Finally, rather than using Cohen’s definitions, there are two preferred strategies for identifying a targeted effect size in a power analysis: (a) derive it from previous literature or (b) choose an effect size that will have significant implications for theory and practice. Unfortunately, using Cohen’s definitions of effect size to conduct a power analysis is often used as a rationalization for concluding that a specific sample size is sufficiently large. In many cases, this argument is used inappropriately to avoid facing the inconvenient fact that a particular study’s sample size is not sufficiently large to detect effect sizes that are practically or scientifically significant. ἀ e second question we addressed is, When one has a small sample, is it advisable to increase the a priori Type I error rate to .10 or .20 to increase statistical power? ἀ e answer to this question is “it depends.” If the increased α value is chosen arbitrarily, then the answer is no. However, if the increased value is chosen after a careful examination of the trade-offs involved between Type I and Type II error, then the answer is yes. Overall, an increase in the a priori Type I error rate is justified if the resulting value is chosen via an informed balancing of the trade-offs involved. Increasing the Type I error and choosing a value based on an informed decision is also likely to be more readily accepted by journal editors and reviewers as compared to choosing an arbitrarily larger value (e.g., .10 or .20). Unfortunately, arbitrarily increasing the a priori Type I error rate to .10 or .20 is often used as a rationalization for ignoring the result that the hypothesized effect is not statistically significant at the more traditional .05 or .01 levels. In many cases, as when Cohen’s definitions of effect size are used, this argument is used inappropriately to avoid facing the inconvenient fact that a particular study’s sample size is not sufficiently large to detect effect sizes that are practically or scientifically significant.
Sample Size Rules of Thumb
283
ἀ e final question we discussed is, Is it true that a sample size that includes 5 observations per estimated parameter in a covariance structure analysis leads to trustworthy estimates? ἀ e answer to this question is “it depends.” In most situations in the organizational and social sciences in which the data do not follow a multivariate normality pattern, at least 10 observations per parameter estimated are needed. On the other hand, 5 observations per parameter estimated may suffice when the data are multivariate normal (which is not a frequent situation). Nevertheless, this is an oversimplified rule and a lower-bound value for the number of observations. ἀ us, researchers should not invoke the 5:1 rule of thumb to support a statement that the sample size is ideal. Unfortunately, using the 5:1 rule of thumb is often used as a rationalization for using a sample size that may be too small. In many cases, this argument is used inappropriately to avoid facing the inconvenient fact that a particular study’s sample size is not sufficiently large, resulting in large standard errors and difficulties in replicating the findings in future studies. In closing, the phrase rule of thumb has many purported origins. One of them is that is that the phrase originates from some of the many ways that thumbs have been used to draw inferences regarding the alignment or distance of an object by holding the thumb in one’s eye-line, the temperature of brews of beer, or the estimated inch from the joint to the nail. We hope our critical analysis of the three rules of thumb regarding sample size will improve the way organizational and social scientists draw inferences from their own research. Author Note An abbreviated version of this manuscript was presented at the annual conference of the Society for Industrial and Organizational Psychology, New York, New York, April 2007. We thank Bob Vandenberg, Chuck Lance, Gilad Chen, Hank Sims, Larry James, and the Management & Organization doctoral students at the Robert H. Smith School of Business (University of Maryland) for constructive feedback on earlier versions of this manuscript. ἀ is research was conducted, in part, while Herman Aguinis was on sabbatical leave from the University of Colorado Denver and holding visiting appointments at the University of Salamanca (Spain) and University of Puerto Rico.
284
Herman Aguinis and Erika E. Harden
Correspondence and requests for reprints should be addressed to Herman Aguinis, Mehalchin Term Professor of Management, ἀ e Business School, University of Colorado, Campis box 165, P.O. Box 173364, Denver, CO 80217–3364, http://carbon.cudenver. edu\~haguinis References Aguinis, H. (2004). Regression analysis for categorical moderators. New York: Guilford. Aguinis, H., Beaty, J. C., Boik, R. J., & Pierce, C. A. (2005). Effect size and power in assessing moderating effects of categorical variables using multiple regression: A 30-year review. Journal of Applied Psychology, 90, 94–107. Aguinis, H., Boik, R. J., & Pierce, C. A. (2001). A generalized solution for approximating the power to detect effects of categorical moderator variables using multiple regression. Organizational Research Methods, 4, 291–323. Aguinis, H., & Stone-Romero, E. F. (1997). Methodological artifacts in moderated multiple regression and their effects on statistical power. Journal of Applied Psychology, 82, 192–206. Bentler, P. M. (1985). Theory and implementation of EQS: A structural equations program. Los Angeles, CA: BMDP Statistical Software. Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107, 238–246. Bentler, P. M. (1995). EQS structural equations program manual. Encino, CA: Multivariate Software. Bentler, P. M., & Chou, C. H. (1987). Practical issues in structural modeling. Sociological Methods & Research, 16, 78–117. Boland, R. J., Singh, J., Salipante, P., Aram, J. D., Fay, S. Y., & Kanawattanachai, P. (2001). Knowledge representations and knowledge transfer. Academy of Management Journal, 44, 393–417. Brews, P. J., & Tucci, C. L. (2004). Exploring the structural effects of internetworking. Strategic Management Journal, 25, 429–451. Brown, K. G. (2001). Using computers to deliver training: Which employees learn and why? Personnel Psychology, 54, 271–296. Brown, T. C. (2003). The effect of verbal self-guidance training on collective ef. cacy and team performance. Personnel Psychology, 56, 935–964. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20, 37–46.
Sample Size Rules of Thumb
285
Cohen, J. (1962). ἀe statistical power of abnormal-social psychological research: A review. Journal of Abnormal and Social Psychology, 65, 145–153. Cohen, J. (1977). Statistical power analysis for the behavioral sciences. New York: Academic Press. Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Erlbaum. Cohen, J. (1992). A power primer. Psychological Bulletin, 112, 155–159. Cohen, J., & Cohen, P. (1983). Applied multiple regression/ Correlation analysis for the behavioral sciences. Hillsdale, NJ: Erlbaum. Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences (3rd ed.). Mahwah, NJ: Erlbaum. Epitropaki, O., & Martin, R. (2004). Implicit leadership theories in applied settings: Factor structure, generalizability and stability over time. Journal of Applied Psychology, 89, 293–310. Garg, V. K., Walters, B. A., & Priem, R. L. (2003). Chief executive scanning emphases, environmental dynamism, and manufacturing firm performance. Strategic Management Journal, 24, 725–744. Kervin J. B. (1992). Methods for business research. New York: Harper Collins. Kim, H., Hoskisson, R. E., & Wan, W. P. (2004). Power dependence, diversification strategy, and performance in keiretsu member firms. Strategic Management Journal, 25, 613–636. Kinicki, A. J., Prussia, G. E., Wu, J., & McKee-Ryan, F. M. (2004). Employee response to performance feedback: A covariance structure analysis using Ilgen, Fisher, and Taylor’s (1979) model. Journal of Applied Psychology, 89, 1057–1069. Lane, P. J., Cannella, A. A., & Lubatkin, M. H. (1998). Agency problems as antecedents to unrelated mergers and diversification: Amihud and Lev reconsidered. Strategic Management Journal, 19, 555–578. Martel, R. F., Lane, D. M., & Emrich, C. (1996). Male-female differences: A computer simulation. American Psychologist, 51, 157–158. Morgeson, F. P., & Campion, M. A. (2002). Minimizing tradeoffs when redesigning work: Evidence from a longitudinal quasi-experiment. Personnel Psychology, 55, 589–612. Murphy, K. R., & Myors, B. (1998). Statistical power analysis: A simple and general model for traditional and modern hypothesis tests. Mahwah, NJ: Lawrence Erlbaum. Pierce, C. A., Aguinis, H., & Adams, S. K. R. (2000). Effects of a dissolved workplace romance and rater characteristics on responses to a sexual harassment accusation. Academy of Management Journal, 43, 869–880.
286
Herman Aguinis and Erika E. Harden
Pierce, C. A., Broberg, B. J., McClure, J. R., & Aguinis, H. (2004). Responding to sexual harassment complaints: Effects of a dissolved workplace romance on decision-making standards. Organizational Behavior and Human Decision Processes, 95, 66–82. Raver, J. L., & Gelfand, M. J. (2005). Beyond the individual victim: Linking sexual harassment, team processes, and team performance. Academy of Management Journal, 48, 387–400. Sauley, K. S., & Bedeian, A. G. (1989). 05: A case of the tail wagging the distribution. Journal of Management, 15, 335–344. Stevens, J. (1996). Applied multivariate statistics for the social sciences (3rd ed.). Mahwah, NJ: Erlbaum. Sturman, M. C., & Short, J. C. (2000). Lump-sum bonus satisfaction: Testing the construct validity of a new pay satisfaction dimension. Personnel Psychology, 53, 673–700. Takeuchi, R., Yun, S., & Tesluk, P. E. (2002). An examination of crossover and spillover effects of spousal and expatriate cross-cultural adjustment on expatriate outcomes. Journal of Applied Psychology, 87, 655–666. Vandenberg, R. J., & Grelle, D. M. (2009). Alternate model specifications in structural equation modeling: Facts, fictions, and truth. In C. E. Lance & R. J. Vanderberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences (pp. 165–191). New York: Routledge/Psychology Press.
12 When Small Effect Sizes Tell a Big Story, and When Large Effect Sizes Don’t Jose M. Cortina and Ronald S. Landis
Now if objective reference is so inaccessible to observation, who is to say on empirical grounds that belief in objects of one or another description is right or wrong? —W. V. Quine, Word and Object, 1960
Dissatisfaction over the limitations of significance testing and over mischaracterization of these limitations has generated a great deal of research and commentary over the past 15 years (e.g., Cohen, 1994; Cortina & Dunlap, 1997; Schmidt, 1998). In addition, the American Psychological Association (APA) created a task force whose charge was to consider whether and how APA policy should be modified to reflect the state of the art regarding the reporting of empirical results. Among the recommendations of the task force was that indices of effect size be regularly reported in empirical research published in APA journals and that authors’ conclusions be influenced by them. ἀ e primary advantage of the adoption of this recommendation, in our opinion, has been that authors of large sample studies are no longer able to claim unfettered support for hypotheses simply due to p < .05. Instead, conclusions must be tempered with recognition that small effects, though perhaps not attributable to chance, were trivially small. More broadly, the task force report draws into sharp relief the distinction between statistical interpretations of observed effects (p values) and substantive (psychological) meaning of those effects. A possible unintended consequence of drawing attention to effect sizes 287
288
Jose M. Cortina and Ronald S. Landis
is that these values could also be interpreted using a rather narrow set of decision rules and criteria (an oft-cited criticism of significance testing). Importantly, authors interpreting and presenting results of studies through effect sizes need also attend to both the statistical and substantive meaning of the observed results. ἀ at is, blind application of significance testing (i.e., reflexive interpretation) without regard to wider implications of observed results is to be avoided. Unfortunately, categorization of effect sizes using terms such as small or large can similarly result in reflexive interpretation of results and could, potentially, serve to restrict the use of subtle manipulations, to discourage the study of more distal causal relationships, and to retard development of alternative conceptualizations of traditional constructs, as these may be associated with smaller effect sizes. For many, if not most, topics in the social and organizational sciences, small effect size values justify language that is, at best, lukewarm and, at worst, dismissive. For example, if two self-report Likert items written to measure the same construct correlate at .10, it makes little difference whether this correlation is statistically significant. ἀ e magnitude of the correlation is too small to conclude that these items behave as intended, and this should be reflected in the language used to describe the results of the study. It is our goal, however, to point out that for some studies, small effects represent impressive support for the phenomenon in question. We argue that, in certain situations, “small” effect size values justify strongly supportive language. Reflexive dismissal of small effect sizes by researchers reflects an urban (and for that matter, rural) legend that small effect size values always mean the same thing and justify labels such as weak eἀect or trivial eἀect. ἀ e present chapter is organized as follows. First, we offer a brief review of the meaning of effect size. Second, we describe the urban legend that is the focus of this chapter as well as the kernel of truth that drives it. ἀ ird, we describe relevant lessons from W. V. Quine, the foremost authority on the language of science. Fourth, we use these lessons to uncover situations in which small effect size values justify strong conclusions. Finally, we apply these lessons to the opposite situation in which large effect sizes fail to justify strong conclusions.
When Small Effect Sizes Tell a Big Story
289
Effect Size Defined ἀ e term eἀect size is generally applied to any standardized index that represents the magnitude of a relationship between variables. ἀ e term standardized refers to the fact that effect size indices are independent of scales of measurement and are therefore comparable across studies. Although an unstandardized magnitude index is certainly possible (e.g., a difference between means, a covariance), the term eἀect size in the current discussion refers to standardized indices only (e.g., d, r2, η2). We also acknowledge that we use the term eἀect somewhat lightly. Although an effect size value can certainly be generated from the results of experimental designs in which causal inferences are common, we also apply the term to indices generated from quasi-experimental and observational designs. ἀ e most common effect size index for designs with categorical predictor variables is d. Although d can take many forms, the most common is d=
X1 − X 2 , sp
where the numerator contains the difference between group means and the denominator contains the pooled standard deviation, which is most easily conceptualized as the square root of the sample size weighted average within group variance. Division by the pooled standard deviation serves to remove the scale of measurement, yielding a standardized index. ἀ us, d represents the mean difference in standard deviation units and can be computed for a wide variety of ANOVA-like designs (Cortina & Nouri, 2000). Other common effect size indices for use primarily with categorical independent variables are η2 and ω2. ἀ ese indices cast effect size in terms of percentage of variance accounted for rather than in terms of mean differences. Two additional effect size indices, typically reported in nonexperimental and quasi-experimental research, are r2 (for bivariate relationships) and R2 (for multivariate relationships). Of course, the square of the correlation coefficient provides the percentage of variance in one variable accounted for by another. In the multivariate case, R2 gives the percentage of variance in a criterion variable accounted for by an optimally weighted combination of predictor variables.
290
Jose M. Cortina and Ronald S. Landis
Like the other indices mentioned above, r2 and R2 reflect magnitude of effects independent of the original scales of measurement. For all effect size indices, larger absolute values denote stronger effects. ἀ us, if a particular experiment on a particular effect using a particular methodology yielded d = 1.0, this result would suggest a stronger effect than if the same experiment had yielded d = .5. As we will argue later, this is not to say that d (or r2) values are so easily compared across studies. The Urban Legend ἀ e legend that we consider here is that one can craft a sentence about the importance of a phenomenon based only on the absolute magnitude of an effect size index. For example, if we apply Cohen’s (1988) famous criteria for effect size, we use the term small in the interpretation of a d value of .20 and the term large in the interpretation of a d value of .8, and so on. A large effect size warrants superlative language (e.g., strong support), whereas a small effect size requires modest or even pejorative language (e.g., weak support or little support). Although Cohen made clear that these categorizations were subjective, they are seldom evaluated critically. In explaining the concept of effect size, many commonly used statistical/methods texts (e.g., Gravetter & Wallnau, 2007; Howell, 1999) describe Cohen’s criteria and either offer them uncritically or endorse them entirely. Gravetter and Wallnau, for example, in describing a d value of 1.79, state, “Using the criteria established to evaluate Cohen’s d…, this value indicates a very large treatment effect” (p. 314). Others describe effect size without addressing the language issue at all (e.g., Hunter & Schmidt, 2004). Anecdotal evidence for the existence of this legend also exists. Consider the following quotes about power analysis compiled by the authors of a recent paper (Aguinis & Harden, in press): A power analysis indicated that the power to detect a medium effect with an alpha level of .05 was 46 percent, and the power to detect a large effect was 86 percent (Cohen, 1988). (Raver & Gelfand, 2005, p. 394) Statistical power to detect a significant R2 in the regression analysis was 35% for a small effect (R 2 = .0196, p < .05) and 99% for a medium effect (R 2 = .13, p < .05; Cohen, 1988). (Morgeson & Campion, 2002, p. 598)
When Small Effect Sizes Tell a Big Story
291
Although the power to detect moderate effects (r = .30) at the .05 alpha level with this sample is .78, the power to detect small effects (r = .10) is only .14. (Brown, 2001, p. 292)
In none of these quotes are the small and large labels questioned. ἀ ey are merely offered uncritically, and this is hardly exceptional. From our own work as reviewers and editors, we observe that effect size values of a certain magnitude are never questioned, whereas effect size values of other, smaller magnitudes are often questioned simply on the basis of their absolute values. As we later argue, there are many examples of phenomena and/or methodologies such that nearly any nonzero effect size value that is generated justifies the use of superlatives. Likewise, there are many examples of phenomena and/or methodologies that must generate enormous effect size values in order to justify even modest language. Regarding the former, we will describe methods, phenomena, and/or combinations of the two for which strong language should be attached to what would otherwise be considered small effect size values. In particular, we discuss effect sizes from intentionally inauspicious designs, effect sizes for cumulative phenomena, and effect sizes that undermine fundamental assumptions. Regarding the latter, we discuss designs that are (or should be) expected to inflate effect size values, thus requiring one to temper the conclusions drawn from “large” effect size values. The Kernel of Truth ἀ e kernel of truth in the present case is, in fact, more than a kernel. It is almost the entire cob. Almost. If it looks like a duck and quacks like a duck, then yes, it is probably a duck. Unless it isn’t. A small effect size value usually represents a trivial level of support for a given phenomenon. A large effect size value usually represents substantial support. d values can be interpreted in terms of degree of overlap of distributions. For example, a d value of .2 (the upper bound of Cohen’s “small” effect size) suggests that the 50th percentile score of the upper distribution would be at the 58th percentile of the lower distribution. Said another way, there is a tremendous amount of overlap between the two distributions. A d value of 1.0, by contrast, suggests that the 50th percentile score of the upper distribution is at the 84th percentile of the lower. Said another way,
292
Jose M. Cortina and Ronald S. Landis
the middle of the upper distribution is in the tail of the lower. It is for these reasons that a d value of .2 is seldom anything to write home about, whereas a d value of 1.0 is usually quite compelling. Our primary worry in writing this chapter is that researchers might use our qualified defense of small effect sizes to rationalize exaggerated statements about trivial effects. To be sure, our point is not that small effect sizes are often given unaccountably short shrift or that the conclusions drawn from large effect sizes are often exaggerated. Rather, our point is that small effect sizes suggest strong empirical support for a given phenomenon if and only if certain conditions are met and that large effect sizes, in certain circumstances, represent less empirical support than one might imagine. Cohen would have been (and probably was) the first to acknowledge that his labels were subjective and that effect sizes must be interpreted in context. In his early (e.g., 1962) and later (e.g., 1994) work, he recognized that interpretive language must reflect the conditions and circumstances under which the data were collected as reflected in his preference for range-based rather than point-based hypotheses. It is unclear, however, that modern researchers appreciate the importance of context when interpreting effect size. Before describing the role played by context, however, we will attempt to frame the issue using the terms of noted epistemologist and expert on the language of science, W. V. Quine. Quine and Ontological Relativism Willard Van Orman Quine was perhaps the foremost authority on the language of science. Of particular interest in the current discussion is his work from the 1950s and 1960s of which the quote which began the chapter is representative. We repeat Quine’s question, “Who is to say on empirical grounds that belief in objects of one or another description is right or wrong?” To paraphrase for our present purposes, “Who is to decide which language is appropriate in response to an effect size of a given magnitude?” If we were to offer “rxy = .10” as a stimulus to 100 social and organizational scientists and then ask for a sentence of interpretation, we would likely receive statements such as the following:
When Small Effect Sizes Tell a Big Story
293
1. X and Y are weakly related. 2. X and Y are largely independent. 3. X explains 1% of the variance of Y. 4. 99% of the variance in Y is independent of X.
Ideally, we would be able to characterize the sentence of interpretation as an “observation sentence.” An observation sentence, explains Quine (1968), is a sentence to which all members of a community (e.g., the social/organizational scientific community) would agree given a certain stimulus. Options c and d above represent observation sentences in response to the stimulus “rxy = .10,” as everyone who is a member of the community, “people who understand the correlation coefficient,” knows that its square gives the percentage of variance in one variable that is accounted for by the other. Options c and d are in direct causal proximity to the stimulus in question and do not depend on the experience or stored information of the target. ἀ ey each reflect an analytic truth (as opposed to a synthetic truth) because each is true by virtue of the meanings of words alone. What we usually desire in our discussion sections, however, is an observation sentence that is true not merely through the meanings of words but also through the implications of those words. Statements vary in the degree to which they follow directly from a given stimulus. Consider another quote from Quine (1960): Commonly, a stimulation will trigger our verdict on a statement only because the statement is a strand in the verbal network of some elaborate theory, other strands of which are more directly conditioned to that stimulation. Most of our statements respond thus to reverberations across the fabric of intralinguistic associations [italics added].
ἀ e statements that follow directly from data, such as the statement “X explains 1% of the variance of Y” from the stimulus “rxy = .10,” certainly make a contribution to understanding, but they are somewhat prosaic for the fact that they are little more than synonyms. We typically desire, in addition, statements that extend from intralinguistic associations. It isn’t enough to say that X explains a certain amount of the variance in Y. We also want a “reduction form” as described by Carnap (1936) that gives implications instead of mere equivalences. Unfortunately, the reduction form toward which we often drift is the class. Consider still another quote from Quine (1969):
294
Jose M. Cortina and Ronald S. Landis No wonder that in mathematics the murky intensionality of attributes tends to give way to the limpid extensionality of class; and likewise in other sciences, roughly in proportion to the rigor and austerity of their systematization. (p. 21)
Likewise indeed; rather than evaluate empirical findings based on the attribute-based statement “X explains 1% of the variance of Y,” we seek an artificial categorization for the result “rxy = .10” such as “X and Y are weakly related.” Only by placing rxy =.10 in the “weakly related” category or class do we feel that we have made sense of our results. And we are not alone in attaching importance to categorization of statistical indices in such a manner. Consider, for example, an analogous situation in the athletic world. Professional athletes are identified by statistics more so than possibly any other occupational group. We can review the number of home runs a baseball player hit in his career, the number of points a professional basketball player scored during her time in the WNBA, the number of major golf tournaments won, and the list goes on and on. In many cases, though, the discussion of a player’s importance and impact on a sport is categorized as “one of the greatest ever” or “Hall of Fame caliber.” ἀ is isn’t as unfortunate as the previous paragraph suggests. Language is inherently categorical and, therefore, intrasubjective, to borrow a phrase from Kant. If we wish to attach language to our empirical findings (and we clearly do), then we must deal with the “limpid extensionality of class.” Such classifications are the mechanisms through which we translate our findings. Although we might aspire to translation of our findings into strictly observational and logico-mathematical terms, we cannot do so because our findings (e.g., rxy = .10) are never the sole allowable consequences of a given phenomenon (Quine, 1968). A small observed correlation (rxy = .10) might have occurred because the phenomenon in question is trivial in magnitude, but it might also have occurred for other reasons that we discuss later. Classification is inevitable. ἀ e question isn’t whether or not one should classify but rather how broadly one can make classifications. ἀ is issue arises in Quine’s Word and Object (1960) and to some extent his Natural Kinds (1970). Using Quine’s terms, then, we state our theses as follows:
When Small Effect Sizes Tell a Big Story
1. Classification is an unavoidable by-product of language. No matter how sophisticated our methods, we must classify in order to communicate. 2. Classes such as “weakly related,” “strongly related,” and the like are too broad to apply to a decontextualized stimulus such as rxy = .10. No such class should represent an “observation sentence” because anyone who understands the different contexts that might have produced the stimulus, rxy = .10, would be unwilling to endorse a sentence that so classified the stimulus itself. 3. Only when we add context to the stimulus can statements such as “X and Y are weakly related” begin to approach the status of observation sentences.
295
It is the decontextualization of the stimulus that creates the myth to which our chapter is directed. Contextualization causes the myth to become visible for what it is. In the sections that follow, we consider the forms that relevant contextualization might take. Contextualization As mentioned earlier, different combinations of phenomena and methods preclude reflexive interpretation of effect sizes. Although we discuss these combinations separately, their interconnectedness is obvious. By way of example, consider the Milgram (1963) obedience studies. Although the most famous of these studies had no independent variable, the sentence “ἀ e Milgram findings were shocking” probably qualifies as an observation sentence in spite of its extensionality. Milgram found that many participants continued to punish the confederate even after it became clear that the punishment was excessive. Even if, however, he had found that a small percentage of his subjects continued to punish, his results would still have been characterized as shocking. In other words, even if he had generated a “small” effect size value, any nonzero value would have surprised most readers. ἀ ere were two reasons for this. ἀ e first was that the design was inauspicious (Prentice & Miller, 1992). If one were interested in studying obedience to punishment orders, one might study intact superior-subordinate pairs responding to willfully incorrect behavior by a confederate with punishment that is only mildly excessive.
296
Jose M. Cortina and Ronald S. Landis
Because the Milgram studies involved strangers, unintentional mistakes, and wildly excessive punishment, the design was extremely inauspicious with regard to generating obedience. But it did generate obedience, and the fact that it did so at any level in spite of the design makes the phenomenon in question all the more impressive. ἀ e second reason was that Milgram’s findings turned a fundamental assumption on its ear. At that time in history, the world liked to believe that Nazi soldiers complied with the brutal orders that they received because of the extreme circumstances, because they were German, or both. We would never behave in such a way. Milgram’s work suggested otherwise, and although he found a large amount of compliance, any compliance would have been sufficient to call into question a fundamental assumption about people. In the Milgram example, the context that produced the results justified a strong statement of interpretation. Both the methodology (inauspicious design) and phenomenon of interest (obedience to authority) suggest that the results not be interpreted through application of a traditional decision criterion (e.g., no effect can be characterized as shockingly large unless it exceeds a particular numerical magnitude). Other examples in the literature also speak to this issue. We now turn our attention to those situations that are likely to produce effect size values that do not translate into typical statements of interpretation. Inauspicious Designs One way to demonstrate the importance of a phenomenon is to show that the phenomenon can be detected even in the least auspicious of circumstances. In a dramatically underreferenced methods paper, Prentice and Miller (1992) described several such circumstances. Consider first the minimal group studies by Tajfel and colleagues (e.g., Tajfel, Billig, Bundy, & Flament, 1971). ἀ e purpose of these studies was to test the limits of the well-established finding that people tend to prefer members of their own group over members of other groups. In order to demonstrate the strength of this tendency, the authors showed that this preference exists even when group membership is trivial in nature. In Tajfel et al. (1971), boys were shown pictures of sets of dots and asked to estimate the number of dots in the picture. ἀ ey were then told either that they tend to overestimate or that they tend to underestimate the number of dots. In a subsequent game,
When Small Effect Sizes Tell a Big Story
297
“overestimators” tended to allocate more points to other “overestimators” than to “underestimators.” Although the effect sizes were not large by traditional standards (i.e., d less than .20), the fact that subjects showed preference for such a trivial grouping suggests that the ethnocentrism effect in question is very powerful indeed. As Prentice and Miller (1992) pointed out, Locksley, Ortiz, and Hepburn (1980) took this method one step further by using explicitly random group assignment. ἀ at is, subjects were randomly assigned to one group or another, and they knew that assignment was random. Even in this most inauspicious of designs, subjects showed a small preference for members of their own group. ἀ e d values in their various studies hovered around .20, but given the weakness of the manipulation, a d value of zero would not have been surprising. ἀ e fact that d was greater than zero speaks volumes about the importance of the phenomenon in question. Prentice and Miller (1992) also described several other examples of subtle manipulations. Specifically, they referenced research on the relationship between exposure and liking. ἀ is research has shown that participants prefer melodies to which they have been exposed even when exposure was so slight that they were unable to recall having heard the melodies afterward (Wilson, 1979). Also, Isen and Levin’s (1972) research on the effects of mood has shown that even a mood induction as mild as the offering of a cookie leads people to be more helpful to others than if the cookie was not offered. Another recent example of a subtle manipulation from the organizational research literature is Levine, Higgins, and Choi’s (2000) application of the classic Sherif (1936) paradigm to induce a promotion or prevention focus. It is a well established principle of decision making that humans tend to be risk-tolerant in “gain” situations and risk-averse in “loss” situations. Levine et al. (2000) were interested in the degree to which gain/loss priming would lead people to gravitate toward high-risk vs. low-risk task strategies. ἀ ese authors placed subjects into groups and guaranteed the subjects $3 but also provided an opportunity for $6. In the promotion focus condition, subjects were given $3 and were then told that they could earn an additional $3 if they performed at an 80% level on a memory task. In the prevention focus condition, subjects were given $6 and were then told that they would lose $3 if more than 20% of their responses in the memory task were incorrect. Note that the expected values are identical between the two conditions; only
298
Jose M. Cortina and Ronald S. Landis
the phrasing differs. And yet, the members of groups in the promotion focus condition tended to converge over time toward high-risk strategies whereas the members of groups in the prevention focus condition tended to converge toward low-risk strategies. Although the amount of convergence was small in magnitude (variance in strategies dropped from one trial to the next by approximately .02 units for both conditions), the fact that this manipulation had any impact at all suggests that an externally imposed focus can have a tremendous impact on problem solving. In all of these examples, effect sizes were small by conventional standards. ἀ e conclusions to be drawn, however, are not that the phenomena in question are trivial. Instead, the appropriate conclusions are that the phenomena are so powerful that they can be detected in spite of subtle manipulations. Reflexive interpretation of effect sizes is also problematic in designs that are inauspicious because of the creation, sometimes intentional, of an overwhelming situation. An example comes from research on peer pressure showing that people will endorse statements simply because they have seen others do so. Asch (1951) demonstrated the strength of this tendency by showing that some people will endorse a statement that is patently false (e.g., that a line that is clearly 1 foot long is only 6 inches long) simply because they have seen several others make the same judgment. ἀ e overwhelming situation in this case is created by the obvious discrepancy between the actual length of the line and the length that the participant is led to endorse. Suppose that, whereas none of the participants in a control condition endorse the statement that is patently false, 2% of participants who have seen others endorse the false statement also do so. ἀ is difference represents an effect size that falls into Cohen’s “small” category. And yet, if a nonzero percentage of participants will endorse a statement that is patently false under these conditions, then humans must be enormously swayed by peer pressure in situations that allow for more judgment. Prentice and Miller (1992) discussed related examples from research on the effect of physical attractiveness on jury decisions and the effect of social structure on suicide rates. All of these examples have in common an overwhelming situation that creates something akin to an inauspicious design. If the phenomenon in question can still be detected in such a situation, then it must have a profound effect indeed, and its role in milder situations must be substantial.
When Small Effect Sizes Tell a Big Story
299
An expectation of large, or even moderate, effect sizes is misguided. Instead, an expectation of zero effect is reasonable, and any departure from zero represents substantial evidence for the importance of the phenomenon in question. Phenomena With Obscured Consequences ἀ ere are many examples of phenomena whose effects are small but of great import nonetheless. Aspirin consumption, for example, explains less than 1/10 of 1% of the variance in heart attack occurrence, but because it is a (nearly) zero-cost intervention, any nonzero relationship has important implications. Other effects are small if observed in a snapshot but have enormous cumulative consequences. Consider Abelson’s (1985) example of the relationship between skill and probability of getting on base among professional baseball players. If we operationalize skill as previous success and then use it to predict whether a batter gets on base, we find that skill explains less than 1% of the variance in getting on base in a given at bat (d = approx. .15). How can this be given that some batters are paid 100,000% more than others? ἀ e answer is that the effect of skill on getting on base is cumulative. In order to see the effect of skill, one must look at several hundred at bats. Given that this is a cumulative phenomenon, any nonzero relationship between skill and bases at the individual at bat level is important. Something similar happens when we study individual choices. ἀ e relationship between a stable individual difference such as conscientiousness and whether one makes a particular choice (e.g., turning off the light before leaving the room) must be very small indeed. First, the causal connection between conscientiousness and turning off the light is likely mediated by a series of variables. Second, there are many factors other than conscientiousness that might influence this choice. If instead we were to cumulate the effects of conscientiousness across a large number of individual decisions, we would see that conscientiousness has a tremendous influence on behavior as a whole. ἀ is is analogous to the relationship between social structure and suicide rates (Durkheim, 1951). ἀ e causal connection between any societal level variable and an individual choice is very distant, and there are a great many factors that drive the choice to commit sui-
300
Jose M. Cortina and Ronald S. Landis
cide. ἀ us, any nonzero relationship between these variables suggests that social structure exercises a powerful influence. Consider the relationship between governmental policies and procedures and individual choices in behavior. For example, drunk driving has staggering costs (both financial and in human life) for individuals, communities, and society at large. Many states have adopted aggressive anti-drunk-driving campaigns aimed at reducing the instances in which individuals drive while impaired and, ultimately, the number of fatal crashes stemming from these occurrences. Given the plethora of factors that influence a particular individual’s decision to drink and drive on a specific occasion, the overall impact of such campaigns may be quite modest (or even small). An evaluation of such a program in the state of Tennessee in 1995 revealed an overall reduction in fatal crashes involving a drunk driver of 9 per month. ἀ ough this is a rather low number, the programs may be quite efficacious considering the levels through which the program must filter. In each of these cases, the distance of the causal relationship and the complexity of the outcome create a situation in which any nonzero effect augurs well for the influence of the causal variable. Phenomena That Challenge Fundamental Assumptions As was mentioned previously, Milgram’s work on obedience called into question fundamental assumptions about human responses to authority. Consider two other examples. Judge and his colleagues (Judge et al., 2005; Ilies et al., 2006) attempted to predict alternative workplace criteria such as citizenship behavior and deviant behavior from attitudes and personality. ἀ ough the prediction of such criterion measures from attitudes and personality is hardly novel, the thrust of this work was that withinperson variability in citizenship and deviance could be predicted from within-person variability in job attitudes. Citizenship, deviance, and most other outcomes have been conceptualized almost exclusively as between-subject variables. Even in those studies from which multiple measurements are obtained, the variability in these multiple measurements is traditionally treated as random error around a single true score. But what if there is no single citizenship score for a given person? What if there is no single deviance score?
When Small Effect Sizes Tell a Big Story
301
Judge and his colleagues hypothesized that meaningful withinperson variability in outcomes exists, that it can be explained by within-person variability in job attitudes, and that these relationships are moderated by stable individual difference variables. For example, Ilies et al. (2006) hypothesized that citizenship performance varies for each person from one day to the next. ἀ at is, a given employee can be a very good citizen on one day and a very bad one on the next. Next, they hypothesized that within-person variability in citizenship can be explained by within-person variations in positive affect. Finally, they hypothesized that agreeableness would moderate this relationship such that, whereas people who are high in agreeableness require no additional motivation to engage in citizenship (e.g., positive affect), people who are low in agreeableness will engage in citizenship only if those additional motivators are present. ἀ e authors used experience sampling in which participants responded to a series of questions on a daily basis. ἀ e first step in the data analysis was to determine if there was in fact substantial withinperson variability in citizenship. ἀ e authors found that nearly 30% of the total variability in citizenship was within-person, a number that would be labeled “substantial” by any measure. But suppose that this number had been 5%. We would argue that this would still represent a substantial percentage of variance simply because it is utterly at odds with previous conceptualizations of citizenship. In personnel selection research, we study citizenship with an eye toward hiring “good citizens.” In performance appraisal research, we study citizenship with an eye toward rewarding good citizens and punishing/firing bad citizens. But what if today’s good citizen is, to some degree, tomorrow’s bad citizen? We might ask the same question with regard to the deviance variable investigated in Judge et al. (2005). What if today’s Boy Scout is tomorrow’s saboteur? ἀ e results of Judge and his colleagues suggest that this is quite possible. By extension, these results call into question almost all of the previous research on the topic. For this reason, a large effect is unnecessary for us to attach import to the phenomenon in question. Consider another citizenship example. With few exceptions, previous research on the prediction of citizenship and related variables (e.g., contextual performance, prosocial organizational behavior) has focused on dispositional and attitudinal variables as predictors. ἀ e unspoken assumption is that anyone can engage in citizenship, but not all choose to do so. On the other hand, models of job per-
302
Jose M. Cortina and Ronald S. Landis
formance invariably contain knowledge and skills as fundamental predictors. Is there really no knowledge or skill that is relevant for citizenship performance? Dudley and Cortina (in press) discovered a large number of knowledge and skill variables that are relevant for citizenship and that contribute to prediction over and above relevant dispositional variables and cognitive ability. Citizenship facets such as helping, cooperating, and courtesy can be predicted by knowledge and skill variables such as emotion perception skill, emotion management skill, knowledge of target, knowledge of helping/cooperating strategies, and knowledge of organizational courtesy norms. Dudley and Cortina (in preparation) found that knowledge and skill variables demonstrated substantial predictability, generally outperforming dispositional variables by a considerable margin, large effect sizes were not necessary because any nonzero effect would serve to disrupt a fundamental assumption. If citizenship does have a knowledge/skill component, then the implications of previous research might be called into question. Suppose, for example, the desire to help coworkers translates into “helping performance” only if one has the knowledge and skill that is necessary to be helpful. Or suppose instead that certain dispositions lead one to acquire the skills necessary to helping. In both cases, the role of dispositions cannot be identified until knowledge and skills are included, and most previous conclusions on the subject would require substantial revision. In many of the preceding examples, effect sizes were substantial, but they needn’t have been. ἀ e implications of the theoretical portions of these papers are so far-reaching that the data need only fail to disconfirm. In other words, an effect that would normally be considered small is sufficient reason to pursue the possibility that previous work on the topic has been misguided in some way. The Flip Side: Trivial “Large” Effects ἀ e focus of this chapter thus far has been on underappreciated “small” effect sizes. ἀ e principle that we have tried to illustrate is that effects sizes cannot be interpreted in isolation. Rather, they can only be interpreted within the context that generated them. Failure
When Small Effect Sizes Tell a Big Story
303
to appreciate context can lead to incorrect conclusions about magnitude of effect. ἀ is is true not only of effects sizes that are small in magnitude but also of effect sizes that are large in magnitude. Just as an artificially minimal manipulation might be used to demonstrate the pervasiveness of an effect, so might an artificially maximal manipulation be used to exaggerate an effect. Consider the extreme groups design. In such designs, subjects are assigned only to the extremes of the independent variable in question. Or, perhaps more commonly, subjects are only included if they reside in one or the other extreme of a distribution. In either case, the variance of the independent variable is exaggerated. Although this does not necessarily affect unstandardized values (e.g., the unstandardized regression weight is unaffected), it does affect standardized values. Suppose, for example, that we wish to study the effects of smoking on cardiovascular health. An ecologically valid approach might involve a comparison of nonsmokers who live in typical conditions to those who smoke two packs of filtered cigarettes per day and who also live in typical conditions. An extreme groups approach, by contrast, might involve a comparison of people who have smoked 5 packs of unfiltered cigarettes since age 15 to nonsmokers who have spent their lives in hermetically sealed bubbles. ἀ e latter approach will almost certainly generate a larger effect size, but it would be a mistake to conclude from this large effect size that smoking has a dramatic effect on cardiovascular disease. ἀ e reason has to do with the word smoking and with the fixed effect nature of these comparisons. Presumably, the research was conducted in an effort to determine the health effects of typical cigarette consumption. When we use the word smoking, we aren’t referring to 100 unfiltered cigarettes per day, and when we refer to nonsmoking, we aren’t referring to pristine living conditions. Clearly, 100 unfiltered cigarettes are worse, however defined, than 40 filtered cigarettes. Equally clearly, the comparison of smokers to nonsmokers does not include people who live in an atypically purified environment. ἀ e sentence “Smoking has a effect on cardiovascular disease” has a limited number of meanings. It implies either a comparison of typical smokers to typical nonsmokers or a per-unit rate of change in cardiovascular disease. As pointed out by Cortina and DeShon (1998), the use of an extreme groups design does not influence the unstandardized regression weight. However, it does
304
Jose M. Cortina and Ronald S. Landis
influence mean differences, standardized weights, and standardized indices of covariation. Because the extreme groups design involves a comparison that is inconsistent with the language that is typically used in Introduction and Discussion sections, such language should not be used to describe the results of an extreme groups design. Suppose that we fill in the blank in the above sentence so that it reads, “Smoking has a large effect on cardiovascular disease.” Suppose further that our extreme groups design had produced a d value of 1.0. Although the stimulus “d = 1.0” is often sufficient for us to label the sentence “X has a large effect on Y” as an observation sentence (i.e., all members of the relevant community would endorse the verdict), this is not the case here. ἀ e context that produced the d value of 1.0 was such that the use of the term smoking is inappropriate because the comparison included in the extreme groups design is different from the comparison implied by the conclusion sentence. If, instead, an ecologically valid observational design had been used, then the stimulus “d = 1.0” would cause all informed members of the relevant community to endorse the verdict “Smoking has a large effect on cardiovascular disease,” thus making it an observation sentence. ἀ e only sentence that would serve as an observation sentence for the stimulus “d = 1.0” from the extreme groups design is “ἀ e level of cardiovascular disease for people who smoke 5 packs of unfiltered cigarettes per day since they were 15 is one standard deviation higher than the level of disease for nonsmokers exposed to no airborne toxins.” Of course, this sentence is an analytic truth as described earlier and is less useful for Discussion sections. ἀ is problem with extreme groups designs is a bit more pernicious in the context of interactions simply because it is harder to spot. McClelland and Judd (1993) showed that the primary determinant of power in the test for interaction is the residual variance of the product and that the primary determinant of the residual variance of the product is the variances of the components of the product relative to their means. ἀ us, an extreme groups design, with many extreme values and no moderate values, will generate more power than will an observational design, with many moderate values and few extreme values. ἀ e trade-off, as described in Cortina and DeShon (1998), is that one can no longer draw the typical conclusions from the effect size that is generated by the interaction. Just as the conclusion “Smoking has a large effect on cardiovascular disease” is inappropriate from d = 1.0 in an extreme groups design, so is the conclusion “ἀ e effect
When Small Effect Sizes Tell a Big Story
305
of smoking on cardiovascular disease varies considerably with the amount of cardiovascular exercise” inappropriate if the extreme groups smoking design described above were applied only to determined couch potatoes and triathlon junkies. ἀ is is not to say that the extreme groups design has no place. In fact, the extreme groups design is an excellent place to start. If an extreme groups design fails to generate an acceptable level of effect, then there is little point in pursuing a more ecologically valid approach. Rather, our point is that research cannot end with the extreme groups design because it does not speak directly to the language that must be contained in Introduction and Discussion sections. Just as an artificially large predictor variance complicates the interpretation of large effect size values, so does an artificially weak (or strong) situation. Lab research on job applicant behavior often falls into this category. ἀ e strength of a real hiring situation is so great that it can muffle the effects of individual differences. In the absence of actual consequences, such as is often the case in lab research on hiring, individual differences may have consequences that they would not ordinarily have. Without the characteristics of a typical hiring situation, we cannot draw conclusions such as “Conscientiousness has a strong relationship with interview performance” from r = .50. As with the extreme groups design, we must contextualize our conclusions. In doing so, however, we stray from the original intent of the study. Conclusion ἀ e legend that is the focus of this chapter is that effect size values of a certain magnitude necessarily justify certain general conclusions. In Quine’s terms, the myth is that general statements such as “X and Y are strongly related” are observation sentences vis-à-vis particular effect size values irrespective of the contexts in which the values were generated. ἀ e kernel of truth is that, all else being equal, larger effect sizes justify stronger language. On the other hand, labels such as strong and weak are comparative by nature. In science, all else is rarely equal when such comparisons are made. In evaluating research, we must look beyond conventional rules regarding the language that can be attached to particular effect sizes and, in so doing, avoid the very mistakes for which significance testing has been criticized (namely blind obedience to a dichotomous decision rule based on p < .05).
306
Jose M. Cortina and Ronald S. Landis
Many have rightly objected to the knee-jerk use of strong and weak to describe a result that is statistically significant. Unfortunately, many of these same people have no particular objection to the knee-jerk use of strong and weak to describe an effect size of a given magnitude. Effect sizes, like significance tests, are used to inform the language that we use to communicate our results. If one blindly attaches adjectives on the basis of p < .05, then one makes mistakes. If, on the other hand, one takes into account sample size, then one can glean the magnitude of effects, thereby reducing mistakes. If, in addition, one considers the measures involved, the nature of the manipulation, and the nature of the phenomenon in question, then one is likely to choose appropriate language. Likewise, if one blindly attaches adjectives on the basis of, say, Cohen’s d, then one makes mistakes. If, on the other hand, one takes into account sample size, then one can evaluate chance as an explanation for departure of the observed effect size from a comparison value, thus reducing mistakes. If, in addition, one considers the measures involved, the nature of the manipulation, and the nature of the phenomenon in question, then one is likely to choose appropriate language. Of course, all things being equal, larger effect sizes are desirable. In fact, almost all researchers would say “Super Effect Size Me” if asked. Rarely, however, are all things equal. Researchers study varied phenomena using a plethora of paradigms and methodologies. Our purpose in this chapter is neither to promote cynicism regarding large effect sizes nor to discourage cynicism regarding small ones. Instead, we hope to convince the reader that reflexive interpretation of effect sizes creates more problems than it solves. To quote a recent report from the Institute for Mixed Metaphors, ἀ ough the glint off the diamond in the rough can be hard to spot, all that glitters is not gold. References Abelson, R. P. (1985). A variance explanation paradox: When a little is a lot. Psychological Bulletin, 97, 128–132. Aguinis, H., & Harden, E. E. (2009). Sample size rules of thumb: Evaluating three common practices. In C. E. Lance & R. J. Vandenberg (Eds.), Statistical and methodological myths and urban legends: Doctrine, verity and fable in the organizational and social sciences. (pp. 267–286). New York: Routledge/Psychology Press.
When Small Effect Sizes Tell a Big Story
307
Asch, S. (1951). Effects of group pressure upon the modification and distortion of judgments. In H. Guetzkow (Ed.), Groups, leadership, and men (pp. 177–190). Pittsburgh, PA: Carnegie Press. Brown, K. G. (2001). Using computers to deliver training: Which employees learn and why? Personnel Psychology, 54, 271–296. Carnap, R. (1936). Testability and meaning. Philosophy of Science, 3, 419–447. Cohen, J. (1962). ἀe statistical power of abnormal-social psychological research. Journal of Abnormal and Social Psychology, 65, 143–153. Cohen, J. (1977). Statistical power analysis for the behavioral sciences. San Diego, CA: Academic Press. Cohen, J. (1994). ἀe Earth is round (p < .05). American Psychologist, 49, 997–1003. Cortina, J. M., & DeShon, R. P. (1998). Determining relative importance of predictors with the observational design. Journal of Applied Psychology, 83, 798–804. Cortina, J. M., & Dunlap, W. P. (1997). On the logic and purpose of significance testing. Psychological Methods, 2, 161–172. Cortina, J. M., & Nouri, H. (2000). Eἀect size for ANOVA designs. Newbury Park, CA: Sage. Dudley, N., & Cortina, J. M. (in press). Knowledge and skills that facilitate citizenship performance. Journal of Applied Psychology. Dudley, N., & Cortina, J. M. (in preparation). Knowledge and skills that predict helping. Durkheim, E. (1951). Suicide (J. Spaulding & G. Simpson, Trans.). Glencoe, IL: Free Press. (Original work published 1897) Gravetter, F. G., & Wallnau, L. B. (2007). Statistics for the behavioral sciences (7th ed.). Belmont, CA: ἀ ompson. Howell, D. C. (1999). Fundamental statistics for the behavioral sciences (4th ed.). Pacific Grove, CA: Brooks-Cole. Hunter, J. E., & Schmidt, F. L. (2004). Methods of meta-analysis (2nd ed.). ἀ ousand Oaks, CA: Sage. Ilies, R., Scott, B. A., & Judge, T. A. (2005). ἀe interactive effects of personal traits and experienced states on intraindividual patterns of citizenship behavior. Academy of Management Journal, 48, 561–575. Isen, A. M., & Levin, P. F. (1972). ἀe effect of feeling good on helping: Cookies and kindness. Journal of Personality and Social Psychology, 21, 384–388. Judge, T. A., LePine, J. A., & Rich, B. L. (2006). Loving yourself abundantly: Relationship of the narcissistic personality to self and other perceptions of workplace deviance, leadership, and task and contextual performance. Journal of Applied Psychology, 91, 762–775.
308
Jose M. Cortina and Ronald S. Landis
Levine, J. M., Higgins, E. T., & Choi, H. S. (2000). Development of strategic norms in groups. Organizational Behavior and Human Decision Processes, 82, 88–101. Locksley, A., Ortiz, V., & Hepburn, C. (1980). Social categorization and discriminatory behavior: Extinguishing the minimal intergroup discrimination effect. Journal of Personality and Social Psychology, 39, 773–783. McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376–390. Milgram, S. (1963). Behavioral study of obedience. Journal of Abnormal and Social Psychology, 67, 371–378. Morgeson, F. P., & Campion, M. A. (2002). Minimizing tradeoffs when redesigning work: Evidence from a longitudinal quasi-experiment. Personnel Psychology, 55, 589–612. Prentice, D. A., & Miller, D. T. (1992). When small effects are impressive. Psychological Bulletin, 112, 160–162. Quine, W. V. (1960). Word and object. Cambridge: MIT Press. Quine, W. V. (1968). Epistemology naturalized. Paper presented at the Fourteenth International Congress of Philosophy, Vienna. Quine, W. V. (1969). Ontological relativity and other essays. New York: Columbia University Press. Quine, W. V. (1970). Natural kinds. In N. Rescher et al. (Eds.), Essays in honor of Carl G. Hempel: A tribute on the occasion of his sixty-fifth birthday (pp. 5–23). Dordrecht: Reidel. Raver, J. L., & Gelfand, M. J. (2005). Beyond the individual victim: Linking sexual harassment, team processes, and team performance. Academy of Management Journal, 48, 387–400. Sauley, K. S., & Bedeian, A. G. (1989). 05: A case of the tail wagging the distribution. Journal of Management, 15, 335–344. Sherif, M. (1936). The psychology of group norms. New York: Harper & Row. Tajfel, H., Billig, M., Bundy, R., & Flament, C. (1971). Social categorization and intergroup behavior. European Journal of Social Psychology, 1, 149–178. Wilson, W. (1979). Feeling more than we can know: Exposure effects without learning. Journal of Personality and Social Psychology, 37, 811–821.
13 So Why Ask Me? Are SelfReport Data Really That Bad? David Chan
ἀ e use of self-report data is widespread across diverse fields of empirical research such as organizational behavior, personality and individual differences, social psychology, and mental health. Despite the prevalent use of self-report data in empirical studies, there is a widespread belief among researchers that there are severe threats to its validity which serve to weaken the intended substantive inferences to be drawn from such data. Any researcher familiar with the journal review process could probably testify that one of the most common methodological criticisms of manuscripts under review tends to be associated with alleged problems concerning the use of self-report data. ἀ ese criticisms are often evident in the comments by reviewers and editors. Even authors themselves tend to accept the alleged problems of self-report data, as indicated by the limitations they acknowledged in the Discussion section of their manuscripts. ἀ e term self-report data is often used to refer to data obtained using paper-and-pencil questionnaires or surveys containing items that asked respondents to report something about themselves and completed by respondents themselves. However, the type of selfreported variables (or questions/items) may vary widely including demographic variables, personality traits, values, beliefs, attitudes, affect, and behaviors. It seems that critics of self-report data have not sufficiently considered the diversity of the conceptualization and measurement of self-report variables. For example, recall errors are not applicable when respondents provide self-report data on demographic variables such as their sex and age, but these errors may affect the accuracy of their self-report data on other variables such 309
310
David Chan
as frequency of information seeking or performing some typical behaviors. Even within the same type of self-report variables (e.g., personality), differences in specific content of items may invoke differential susceptibility to the same threat to validity. For example, it is possible that behaviors reflecting neuroticism may be more likely to be perceived as maladaptive and socially undesirable than behaviors reflecting less value-laden traits such as extraversion. ἀ e diversity of self-report data suggests that the consistent criticisms and general dismissal of self-report data may have exaggerated the problem and such data, although often imperfect, are not inherently flawed. ἀ at is, self-report data are not really that bad and do not deserve the negative reputation in journal publications and the journal review process (Spector, 2006). ἀ e purpose of this chapter is to address the “urban legend” of the problem of self-report data and the way forward in dealing with the legend. ἀ e organization of the chapter is as follows. I will begin by stating the urban legend of self-report data and tracing the historical roots of the legend. Next, I will identify four alleged problems of selfreport data and describe the truths and myths associated with each problem. I will reinforce the truths, debunk the myths, and suggest ways of dealing with the legend for purpose of advancing research. The Urban Legend of Self-Report Data and Its Historical Roots ἀ e urban legend of self-report data is the widespread belief that self-report data have little validity (and hence value) because of two related assumptions, namely (a) they are inherently flawed as measures of the intended constructs and (b) they are unable to provide accurate parameter estimates of interconstruct relationships. ἀ e practical corollary of this widespread belief is the automatic dismissal of manuscripts on the part of many reviewers based almost exclusively on the fact that self-report data were used in the study as the empirical basis for substantive inferences. Moreover, many authors themselves subscribe to the urban legend such that they make unnecessary apologies for their studies and even incorrect research design decisions to attempt to mitigate purported validity problems of their use of self-report data. When discussing validity issues of self-report data, it is useful to distinguish two types of validity concerns associated with the
So Why Ask Me? Are Self-Report Data Really That Bad?
311
abovementioned two assumptions in the urban legend. ἀ e first type of validity concerns focuses on the measurement of a single self-reported variable, and it is often discussed as construct validity issues. ἀ e second type of validity concerns focuses on the interpretation of the substantive relationship inferred from an observed association (e.g., correlation) linking two or more self-reported variables, and it is often discussed as issues of “common method variance” (also known as monomethod bias) in self-report data. ἀ ere is a long history of arguments casting doubts on the construct validity of self-report data. Since the 1940s, there have been consistent concerns with issues such as interviewer biases, wording of questions, order of questions, social desirability responding, and various types of response sets (e.g., acquiescence, central tendency) as sources of random or systematic measurement error that adversely affect the construct validity of self-report survey responses (e.g., Couch & Keniston, 1960; Cronbach, 1946; Nisbett & Wilson, 1977). Despite the largely atheoretical research on these questionnaire design influences, researchers appeared to have come to the conclusion that a large variety of sources of measurement errors related to questionnaire design would affect construct validity of questionnaire responses and not much could be done about it (Sudman, Bradburn, & Schwarz, 1996). ἀ e constant caution on the presence of various potential measurement errors in self-report data collected from questionnaires and against acceptance of such data as veridical, as evident in widely used textbooks on psychometrics and psychological testing (e.g., Nunnally, 1978), only serves to reinforce the bad reputation of self-report data and produce cohort after cohort of social scientists who help perpetuate the urban legend of construct validity problems of self-report data. ἀ e discussion of common method variance in self-report data may be traced back to Campbell and Fiske (1959) who described common method variance as variance attributable to the measurement method rather than the constructs of interest intended to be assessed by the measures. ἀ e basic idea is that the relationship between constructs measured using the same method (e.g., self-reports) may be biased due to shared variance attributable to the same method effect. ἀ is shared variance (i.e., common method variance) is a systematic artifact of the measurement method and hence a form of systematic measurement error that biases the estimation of the true interconstruct relationship.
312
David Chan
Campbell and Fiske’s (1959) description of the problem of common method variance laid the foundation for the widespread criticism against interpretation of the correlation between two self-report measures as an accurate estimate of the relationship between the constructs represented by the two measures. Specifically, because of common method variance, the observed correlation is allegedly an artificial inflation with respect to the true magnitude of the interconstruct relationship. In the last half decade, there have been consistent attempts to identify and deal with the various method effects, mostly through statistical analyses that control, isolate, or estimate the impact of the method effects (for review, see Podsakoff, MacKenzie, Lee, & Podsakoff, 2003). Although attempts by different researchers have produced conflicting conclusions about the impact of method effects on correlations between self-report measures (e.g., Bagozzi & Yi, 1990; Crampton & Wagner, 1994; Spector, 1987; Williams, Cote, & Buckley, 1989), it seems that journal editors, reviewers, and even authors themselves have elevated the problem of common method variance to the status of gospel truth. ἀ e consequence is that unless there is clear empirical evidence that method effects have been isolated and separated from the estimation of the interconstruct relationships, studies primarily based on self-report measures are often dismissed as inadequate due to allegedly inflated estimates. In organizational research, the bias against self-report measures was evident and further reinforced in Campbell’s (1982) remarks as outgoing editor of the Journal of Applied Psychology where he stated that “if there is no evident construct validity for the questionnaire measure or no variables that are measured independently of the questionnaire, I am biased against the study and believe that it contributes very little. Many people share this bias” (p. 692). ἀ is quotation aptly described the zeitgeist in terms of the doubts about the construct validity of self-report data in general and the problem of common method variance in particular. ἀ ere were numerous studies, published after Campbell’s remarks in 1982, which found trivial or no impact of method effects in self-report data and showed no evidence of broad effects of common method variance. Authors of these studies cautioned against the exaggeration of the problem of monomethod studies (e.g., Chan, 2001; Crampton & Wagner, 1994; Lindell & Whitney, 2001; Ones, Viswesvaran, & Reiss, 1996; Schmitt, Pulakos, Nason, & Whitney, 1996; Spector, 1987, 1994). However, many researchers continued to assume that common method variance is
So Why Ask Me? Are Self-Report Data Really That Bad?
313
an ubiquitous problem of self-report data (Podsakoff et al., 2003; Spector, 1994), and this widespread assumption probably in part explains Sternberg’s (1992) finding that Campbell and Fiske (1959) is one of the most widely cited articles in the history of psychology. Comments such as those from Campbell (1982) and the widespread views of researchers on the problem add to the bad reputation of self-report data and help perpetuate the urban legend of common method variance problems of self-report data. In the following sections, I identify four commonly alleged problems of self-report data. For each problem, I will isolate and describe its kernel of truth, and then explicate and debunk the myths associated with the problem and thereby provide a point of departure for dealing with the legend to advance research involving self-report data. Problem #1: Construct Validity of Self-Report Data ἀ e most basic of the problems of self-report data has to do with the fundamental question on its construct validity, and firm believers of the urban legend are convinced that self-report data are inherently flawed as measures of the intended constructs. To examine this problem, we conceptualize the issue using the general linear model represented by Equation 13.1,
X = λT TC + λS SB + ε
(13.1)
where X refers to the observed variance of test scores, TC refers to the true construct variance associated with the intended test factor, SB refers to the systematic bias (systematic error variance) associated with the self-report method factor, λT refers to the factor loading (regression weight) associated with the true construct, λS refers to the factor loading (regression weight) associated with the method factor, and ε refers to the random error variance of measurement of the test scores. Note that the self-report method factor refers to one or more contaminating constructs extraneous and irrelevant to the intended test construct and evoked by the use of the self-report method of data collection. To illustrate, consider the various human cognitive and affective processes involved in interpreting and responding to questionnaire items such as language comprehension, information processing, and
314
David Chan
motivational mechanisms. Responses to self-report measures are, theoretically, susceptible to systematic influences from various types of extraneous variables associated with these processes. ἀ ese extraneous variables, if their influences in fact exist, are unintended constructs being assessed by the self-report and they therefore result in systematic measurement error of contamination (hences increases λS) and reduce the construct validity of the self-report measure. For example, if the questions were worded such that its comprehension required a relatively high level of language ability and if motivational processes such as ego-enhancement and ego-protection were operating, then data from a self-report measure of self-efficacy are likely to, in addition to reflecting true individual differences in self-efficacy (TC) and random measurement error (ε), reflect contaminating constructs (SB) such as individual differences in language ability and self-deception tendency. According to the legend, the pervasive problem of self-report data is that λT is unacceptably low when it should be high whereas λS is unacceptably high when it should be zero or near zero, thereby resulting in a lack of construct validity of the self-report data. ἀ e kernel of truth is that λS may be nonzero and the construct validity of the selfreport data decreases as λS increases. However, as a parameter estimate, the value of λS is clearly dependent on the relationship between X and SB. In other words, the magnitude of λS is dependent on the specific self-report measure and context of use in question in terms of the number and nature of the contaminating constructs present and the degree to which these irrelevant constructs were in fact evoked by the self-report method. ἀ e failure to appreciate the specificity of the self-report measurement and an uncritical assumption that λS is always high and λT is always low account for the myth that self-report data is inherently flawed and low in construct validity. ἀ e widespread belief of a ubiquitous presence of SB and high values of λS may have evolved because of the ease with which we could imagine the variety of systematic measurement errors that may exist for self-report responses. ἀ is widespread belief and hence broad dismissal of the construct validity of self-report measures is a myth for various reasons. First, the fact that systematic measurement errors may exist for self-report measures does not imply that all or even some of these errors will always exist or exist to a serious extent for all self-report measures. Second, for each possible contaminat-
So Why Ask Me? Are Self-Report Data Really That Bad?
315
ing construct (i.e., possible systematic error), the plausibility of its influence or the extent to which the self-report data is susceptible to its influence is clearly dependent on the specific self-report measure under investigation. It is implausible for self-deception tendency (Paulhus, 1984, 1986) to be a source of systematic measurement error causing contamination in self-reports of demographic variables such as sex, age, and education level. On the other hand, it is plausible for self-deception tendency to contaminate self-reported personality traits but the extent of susceptibility to contamination is likely to vary across the self-report measures of different traits. For example, it appears that self-deception is less likely to influence responses on a measure of extraversion than on measures of agreeableness or openness to experience. ἀ ird, across many research domains, there are many primary studies and reviews demonstrating reasonable criterion-related validities of self-report predictor measures where the criteria are not self-report measures and there are theoretical reasons to expect a predictor-criterion construct relationship. For example, self-reported Big Five personality traits predicted supervisory ratings of job performance (Barrick & Mount, 1991), self-reported proactive personality predicted objective measures of job performance, entrepreneurial behaviors, and career success (Becherer & Maurer, 1999; Crant, 1995; Seibert, Crant, & Kraimer, 1999), self-reported goal orientations predicted training performance and sales performance (Brett & VandeWalle, 1999; VandeWalle, Brown, Cron, & Slocum, 1999), self-reported fairness perceptions predicted organizational citizenship behaviors (Aryee & Chay, 2001; Moorman, Blakely, & Niehoff, 1998), and the list goes on. Fourth, within almost any major research domains, there are numerous well-established self-report measures of diverse constructs which have obtained construct validity evidence through both convergent and discriminant validation. For example, convergent and discriminant validity evidence exist for self-report measures of diverse constructs such as Big-Five personality traits (Costa & McCrae, 1992; Digman, 1990), proactive personality (Bateman & Crant, 1993), affectivity disposition (Watson, 1988; Watson & Clarke, 1984), self-efficacy (Bandura, 1997), goal orientations (Button, Mathieu, & Zajac, 1996; VandeWalle, 1997), perceived organizational support (Eisenberger, Huntington, Hutchinson, & Sowa, 1986;
316
David Chan
Shore & Tetrick, 1991), job satisfaction (Agho, Price, & Mueller, 1992), organizational commitment (Mowday, Steers, & Porter, 1979), and life satisfaction (Diener, Emmons, Larsen, & Griffin, 1985). Problem #2: Interpreting the Correlations in Self-Report Data As noted above, the second assumption in the widespread belief that self-report data have little value is that they are unable to provide accurate parameter estimates of interconstruct relationships. ἀ is is primarily the issue of the common method variance problem. To examine this issue, consider first the general linear model represented by Equation 13.2,
rxy = λTxλTy ΦTxTy + λSxλSy ΦSxSy
(13.2)
where rxy refers to the observed correlation between two measures, ΦTxTy refers to the true correlation between the two intended test constructs, ΦSxSy refers to the true correlation between the two method factors associated with the two measures respectively, and λ refers to the factor loading (regression weight) associated with the respective true construct or method factors. ἀ e common method variance problem refers to the case when both measures used the same self-report method (so that ΦSxSy = 1.00) and therefore the observed correlation, rxy, is thought to equal to the sum of the true correlation and the product of the factor loadings (λSxλSy) associated with the self-report method factor. ἀ e widespread belief here is that since λSx and λSy are unacceptably high (i.e., Problem #1), the observed correlation is an unacceptable inflated estimate (i.e., overestimate and hence inaccurate) of the magnitude of the interconstruct relationship represented by ΦTxTy. ἀ is widespread belief translates to typical reviewer comments such as “ἀ e self-report nature of the measures may account for the high correlations among the variables,” “ἀ e high correlations among the variables is not surprising given that the variables were all self-report data,” “ἀ e high correlations among the self-report measures in this study are clearly problematic,” and “A fatal flaw in this study is the problem of common method variance as evident in the high correlations among the self-report measures.”
So Why Ask Me? Are Self-Report Data Really That Bad?
317
ἀ e widespread adherence to this myth of inflation of correlation is somewhat surprising given what we all know about random and systematic measurement errors. First, similar to other types of measures, self-report measures contain random measurement errors and they therefore do not have perfect reliability. ἀ e lack of perfect reliability implies that the construct factor loadings λTx and λTy are less than 1.00. If there were no method factor effects (i.e., λSxλSy = 0), then Equation 13.2 reduces to rxy = λTxλTy ΦTxTy and the unreliability of measures will result in an observed correlation that is an underestimate of the magnitude of the true correlation between the two test constructs, i.e., rxy is less than ΦTxTy rxy when λTxλTy is less than 1.00. To the extent that there is attenuation of correlation due to unreliability of measures, the observed correlations among self-report measures are artificially deflated estimates (as opposed to being accurate or inflated estimates), which is the well-established psychometric basis for the application of formula to correct observed correlations for attenuation due to unreliability of measures to obtain a more accurate estimate of the true construct relationships. Hence, even if some kind of monomethod effects are in fact operating (i.e., λSxλSy ≠ 0) in ways that will artificially inflate the observed correlations (with respect to the true correlations), the observed correlations among the self-report measures are net results of both inflation due to monomethod effects and deflation due to unreliability of measures. From Equation 13.2, it is clear that whether the observed correlation between two self-report measures is an overestimate, underestimate, or accurate estimate of the true association between the two constructs represented by the measures is dependent on the relative magnitude of the two artificial effects (inflation from method effects versus deflation from attenuation due to unreliability), which are acting in opposing directions. In other words, in the correlation of two measures using the same self-report method (i.e., ΦSxSy = 1.00), whether rxy is higher than, lower than, or equal to ΦTxTy is dependent on the relative magnitudes of λTxλTy and λSxλSy. To illustrate, consider a hypothetical example of a true correlation (ΦTxTy) of .50 between two constructs, with three different scenarios where the common method factor effects are constant across scenarios such that λSxλSy is .25 but the reliability of measures varies across the three scenarios such that λTxλTy is .64, .49., or .36, respectively. In this example, applying Equation 13.2, the observed correlation (rxy) is .57, .50, or .43, respectively, for the three scenarios. ἀ at
318
David Chan
is, even when the same magnitude of common method effect (e.g., λSxλSy = .25) is present across scenarios, the observed correlation may be higher than (i.e., .57), equal to (i.e., .50), or lower than (i.e., .43) the true construct correlation (i.e., .50) due to varying degrees of reliability of measures. Of course, it is poor measurement if the observed correlation happens to be similar or equal in magnitude to the true correlation simply because it is a net result of high random measurement error and high systematic measurement error. But that is not the point. ἀ e point is it is a myth to take as a fact that the correlations among self-report measures are always inflated estimates of the true interconstruct relationships. Inflation of the observed correlation due to common method variance is certainly a potential problem of self-report data—this is the kernel of truth. However, this section has shown that inflation of the observed correlation is a possibility and not a necessity. Moreover, the impact of common method variance on the magnitude and interpretation of the observed correlation is not always clear because other counteracting effects exist. In recent years, several scholars have suggested that the problem of common method variance is probably exaggerated (Chan, 2001; Crampton & Wagner, 1994; Spector, 1994; Williams & Anderson, 1994), argued that the notion of common method variance is often poorly defined without a specification of the measurement issues involved, and highlighted the need for a theory of method effects and measurement errors when discussing the notion (Chan, 1998; Schmitt, 1994; Spector, 2006). Despite the efforts of these scholars, researchers and reviewers continue to speak of common method variance as if it is a clear, pervasive, and inherent problem in self-report data and a logical cause as well as a logical implication of high correlations among self-report measures—this is the myth. In short, the myth may be simply summarized as treating high correlations among self-report measures as necessary and sufficient for common method variance problem that allegedly renders any interpretation of a substantive interconstruct relationship unjustified and flawed. In symbolic logic terms, this myth may be construed as committing a bidirectional equivalence fallacy that wrongly equates high correlations with common method variance so that the presence of one is assumed to logically imply the presence of the other.
So Why Ask Me? Are Self-Report Data Really That Bad?
319
Problem #3: Social Desirability Responding in Self-Report Data Social desirability or impression management may be defined as the “tendency for an individual to present him or herself, in test-taking situations, in a way that makes the person look positive with regard to culturally derived norms and standards” (Ganster, Hennessey, & Luthans, 1983, p. 322). Social desirability responding is probably the most often cited criticism of self-report data. Given that this criticism is ubiquitous in the review process and even among authors themselves (evident in the study limitations that they list in the discussion section of their articles), it is important to examine the problem and understand its truths and myths. ἀ e problem of social desirability responding in self-report data involves the two types of validity concerns mentioned above. Specifically, social desirability responding purportedly threatens validity of self-report data in two ways. First, it allegedly decreases the construct validity of a single self-report measure by being the source of artificial method variance—that is, social desirability responding is the unintended extraneous construct making up the SB term in Equation 13.1. Second, it allegedly confounds the interpretation of the correlation between two self-report measures by being the source of artificial covariance (i.e., common method variance) between the substantive constructs assessed by the self-report measures—that is, social desirability responding affects (purportedly inflates) rxy and the effects are represented by the λSxλSy term in Equation 13.2. ἀ e problem of social desirability responding is multifaceted because, by allegedly being both a source of artificial variance and a source of artificial covariance, it shares all the features of Problem #1 (construct validity) and Problem #2 (interpreting the correlations in selfreport data). ἀ e multifaceted nature of the myth of social desirability responding in self-report data is characterized by the widespread belief among researchers that (a) a substantial proportion of variance in the responses on a self-report measure is artificial variance because it is attributable to social desirability, which is distinct from and uncorrelated with the intended construct represented by the measure; (b) a substantial proportion of the covariance in responses on two selfreport measures is artificial covariance because it is attributable to social desirability, which is distinct from and uncorrelated with the intended constructs represented by the two measures; and (c) social
320
David Chan
desirability responding is pervasive in self-report measurement and not much can be done to remove, reduce, or address it. It is true that some types of self-report measures, particularly those consisting of items with content that is transparent (obvious to the respondent) with regard to the intended construct and when they are used in contexts of high-stakes testing (e.g., applicant testing in personnel selection), are susceptible to social desirability responding in the sense of impression management or faking good. ἀ is kernel of truth has been transformed into a myth because the need for social desirability responding or motivation to fake is mistakenly assumed to apply similarly and to a substantial extent to all constructs assessed by self-report measures; fakeability is mistakenly assumed to imply actual faking; respondent motivations in highstakes testing contexts are mistakenly assumed to be operative in all contexts in which self-report measures are used; the occurrence of social desirability is mistakenly assumed to always lead to common method variance and therefore inflate the correlation between the two self-report measures being examined; and susceptibility to social desirability responding is mistakenly assumed to imply that social desirability responding is nonmalleable. ἀ e following sections debunk the myth by showing that the above assumptions are indeed false. Not all constructs assessed by self-report measures are equally susceptible to social desirability responding. ἀ ere is little or no reason to manage impression or fake on most self-reported demographic variables such as sex, age, and tenure when completing a self-reported measure. ἀ ere is also evidence that self-report measures are less susceptible to social desirability responding when the accuracy of the item responses is verifiable (Becker & Colquitt, 1992; Cascio, 1975). In addition, the content of some personality, attitudinal, or workplace perception constructs are less likely to be susceptible to social desirability responding given the absence of any clearly desirable norm or standard with respect to the direction of the responses. For example, respondents should be less motivated to manage impression on a self-report measure of extraversion than on a self-report measure of neuroticism, and the need for social desirability should be less relevant to a self-report measure of information-seeking behaviors than to a self-report measure of organizational commitment. In short, some constructs are susceptible to social desirability responding, and this is a kernel of truth that
So Why Ask Me? Are Self-Report Data Really That Bad?
321
contributed to the origin of the legend on social desirability of selfreport data. Susceptibility to social desirability responding, however, has been mistakenly assumed to apply similarly and to a substantial extent to all constructs assessed by self-report measures, and this mistaken assumption has contributed to the myth associated with the problem of social desirability responding in self-report data. ἀ e distinction between fakeability of self-report measures and actual faking on self-report measures is best illustrated in two streams of faking research in personnel selection. ἀ e first stream of faking research examines fakeability of self-report measures, and its typical research design is a true experiment in which responses to a selfreport measure are obtained under an honest instruction (control) condition versus a fake-good instruction (experimental) condition. A comparison of the self-report responses between these two conditions provides an indication of the maximum limit of the extent to which scores on the measure can be inflated (i.e., made more favorable) by a conscious attempt to fake good or appear socially desirable (Viswesvaran & Ones, 1999). Several of these experimental studies have shown that many self-report measures are readily fakeable (e.g., Cowles, Darling, & Skanes, 1992; Martin, Bowen, & Hunt, 2002), and a meta-analysis (Ones, Viswesvaran & Korbin, 1995) suggested that respondents can increase their scores by nearly one half of a standard deviation by faking good. However, fakeability of a self-report measure as demonstrated in experimental studies in laboratory settings does not imply that actual faking to the same extent will in fact occur when the measure is used in field studies or naturalistic settings. ἀ e second type of studies represents the other stream of faking research in personnel selection in which scores on self-report measures are compared between actual applicants and incumbents or examinees who have little or no reason to fake good (e.g., Hough, Eaton, Dunnette, Kamp, & McCloy, 1990; Hough & Schneider, 1996; Rosse, Stechner, Levin, & Miller, 1998). ἀ is stream of research attempts to determine the extent of actual faking in real-life applications of self-report measures. Using a large army sample, Hough et al. (1990) found similar scores between actual applicants and a group of examinees who had no motivation to distort responses. ἀ e authors also found that validities of the personality measures examined remained stable regardless of possible distortion by examinees in either unusually positive or negative directions. Hough et al.’s findings on null effects of social desirability may represent the most
322
David Chan
optimistic situation of these real-life applications. ἀ e meta-analysis by Ones et al. (1994) found variable but mostly small relationships between social desirability and personality as well as several jobrelevant performance criteria. Other studies have found evidence of faking on self-reported measures of personality when they were used in actual selection settings and showed that faking attenuated the personality measures’ criterion-related validities (Kluger, Reilly, & Russell, 1991; Rosse et al., 1998; Schmit & Ryan, 1992). ἀ e general conclusion from this second stream of faking research is that actual faking may occur in real-life applications but the extent of actual faking may vary considerably and, when it occurs, is much less than the effect size obtained in fakeability studies in laboratory settings. Edens and Arthur’s (2000) meta-analysis found that the faking effect size obtained in naturalistic settings (d = .30) is substantially smaller than that obtained in laboratory settings where respondents were asked to fake good (d = .73). Hence, several researchers have suggested that the faking effect size obtained in fakeability studies represents the maximum limit or worst-case scenario on actual faking in real-life situations in naturalistic settings (e.g., Graham, McDaniel, Douglas, & Snell, 2002; Rosse et al., 1998). In short, many selfreport measures are often fakeable. Fakeability, however, has been mistakenly assumed to necessarily imply actual faking, and this fallacious implication has contributed to the myth associated with the problem of social desirability responding in self-report data. Faking is likely to occur in high-stakes testing contexts such as personnel selection settings in which the respondent has strong motivations to present a socially desirable impression. However, similar respondent motivations (and hence faking) may not be operative in all contexts in which self-report measures are used. For example, in many research contexts in which self-report measures of personality, attitudes, or perceptions were administered to undergraduates or job incumbents, the stakes involved in the testing are relatively low, even when responses were not anonymous, and findings from many of these studies have shown that faking or impression management have trivial or no impact on the responses to these self-report measures (e.g., Chan, 2001, 2004; Moorman & Podsakoff, 1992). In short, it is true that respondent motivation to fake is likely to be high in high-stakes testing contexts. Faking motivation, however, has been mistakenly assumed to be necessarily operative in all contexts in which self-report measures are used, and this overgeneralization of
So Why Ask Me? Are Self-Report Data Really That Bad?
323
respondent motivations across testing contexts has contributed to the myth associated with the problem of social desirability responding in self-report data. ἀ ere is another widespread belief that a substantial proportion of the covariance in responses on two self-report measures is artificial covariance because it is attributable to social desirability. According to this belief, social desirability responding inflates rxy in Equation 13.2 and the effects are represented by the λSxλSy term in the equation. ἀ is myth is rooted in the mistaken, although often implicit, assumption that when social desirability responding occurs in selfreport measurement, common method variance necessarily occurs and it acts to inflate the correlation between the two self-report measures. ἀ e error in this assumption is surprisingly simple. Common method variance due to social desirability responding (and hence inflation of correlation) between two self-report measures occurs only when social desirability directly causes systematic measurement errors in both of the two self-report measures being correlated, that is, only when λSx and λSy are nonzero (i.e., λSxλSy ≠ 0). If social desirability causes systematic measurement error in only one but not both of the two measures, then, all other things being equal, the true magnitude of the relationship between the two constructs represented by the two self-report measures will be attenuated (i.e., artificially deflated). ἀ ere is nothing mathematically mysterious about this suppressor effect. Put simply, a systematic measurement error (due to social desirability responding) has occurred in one selfreport measure x (i.e., in Equation 13.1, λS ≠ 0) but not the other selfreport measure y and the error has introduced construct-irrelevant variance (which is uncorrelated with the measure y) in the affected measure x, which in turn serves to attenuate the observed correlation between measures x and y with respect to the true correlation between the intended constructs represented by the two measures. Finally, there is the widespread belief that not much can be done about social desirability responding, which is related to the mistaken belief that social desirability responding is pervasive and a validity threat inherent in self-report data. As discussed above, it is erroneous to assume that susceptibility to social desirability responding necessarily implies that social desirability responding is pervasive. Contrary to the belief of pervasiveness of social desirability responding in self-report data, the research database is full of examples of empirical studies, even of actual applicant testing in personnel
324
David Chan
selection contexts, that have demonstrated little or no faking on selfreported measures of personality and that even if faking existed, it did not reduce the criterion-rated validity of the measures (e.g., Cunningham, Wong, & Barbee, 1994; Ellingson, Smith, & Sackett, 2001; Hough, 1998; Hough et al., 1990; Ones & Viswesvaran, 1998; Ones et al., 1996). It is also erroneous to assume that susceptibility to social desirability responding and its actual existence in various situations necessarily imply that nothing can be done to remove, reduce, or address social desirability effects on validity. Social desirability may be minimized or even removed at the measure construction stage by careful selection of item content and careful item writing that avoid value-laden content or language that invokes the need or desire to present a positive impression with respect to the culturally derived norms and standards. In addition, at the measure administration stage, social desirability responding may be reduced or removed by decreasing the extent of evaluation apprehension through anonymous testing and instructions that emphasize the nonevaluative characteristic of the items and the absence of right or wrong answers to the items. Regardless of whether social desirability is assumed to be present, removed, or minimized in the items of the self-report measure, there are numerous data analytic techniques to statistically control or estimate the effects of social desirability (a variable that has been independently measured in the study) on construct validity and criterion-related validity of the self-report measure. Examples of such techniques include comparison of mean scores on the self-report measure obtained before versus after controlling for the social desirability variable; comparison of the zero-order correlation between the two self-report measures (or between the self-report measure and some non-self-report criterion measure) and the corresponding partial correlation after controlling for the social desirability variable; regression analysis that controls for the predictive effect of the social desirability variable; varieties of latent variable analyses that model the method effects of social desirability on the self-report measures; and the impact of social desirability on the parameter estimates of the associations between the constructs represented by the self-report measures (for review, see Podsakoff et al., 2003). To summarize, the extant empirical research literature shows that social desirability responding and its effects are not as ubiquitous as it is widely believed. Moreover, the extent of susceptibility to social desirability responding, the extent of actual social desirability
So Why Ask Me? Are Self-Report Data Really That Bad?
325
responding, and the effects of social desirability responding on criterion-related validity or correlations between self-report measures are each dependent on many factors including the nature of the construct being assessed by the self-report measure, the construction and administration of the measure (e.g., item content, item wording, test instructions, anonymity), respondent motivation, and the context of use. ἀ e failure to recognize these contingencies directly contributed to the myth of social desirability responding in self-report data. Problem #4: Value of Data Collected From Non-Self-Report Measures Finally, a problem that is in danger of approaching, if it has not already reached, mythical status concerns the purported superior value of data collected from non-self-report measures vis-à-vis self-report data. ἀ e myth surrounding this problem is reflected in reviewer comments such as “ἀ is study is problematic as the findings were based on self-report data as opposed to data collected from other sources” and “Unless convergent validity is obtained from other sources of data such as supervisory or peer ratings, the current interpretation in this study which is based on self-report data is problematic,” as well as author comments in the discussion section of their manuscripts such as “A limitation of this study is the use of selfreport measures and future research using other sources of data such as supervisory or peer ratings is needed” and “Future research needs to go beyond the subjective nature of self-report data to use otherreport measures or objective indicators of the focal constructs to replicate the present findings and test its generalizability.” Undoubtedly, there are situations in which data from non-self-report measures may provide useful convergent validity evidence for making inferences from self-report data. ἀ e mythical status of the problem results when there is a widespread belief that non-self-report measures such as other-reports (e.g., supervisory ratings, peer ratings) and objective indicators (e.g., number of meetings attended and frequency of absenteeism) are somewhat superior to self-report measures because they provide more valid data in terms of representing the constructs of interest (i.e., the belief that λS in Equation 13.1 is zero or substantially lower when X is a non-self-report measure as opposed to a selfreport measure). ἀ e practical implication of this mythical belief is
326
David Chan
that (a) it is always better to use non-self-report measures than selfreport measures to measure the same intended constructs and (b) we can be more confident of the validity of a self-report measure if the scores converge with the scores on the corresponding non-self-report measures, that is, if the self-report measure and the corresponding non-self-report measure are highly correlated. It is true that there are situations in which an appropriate nonself-report measure of the same intended construct may be selected or developed to provide a more valid assessment than the corresponding self-report measure. ἀ ese situations typically involve constructs that are highly susceptible to impression management or self-deception and readily observable by others or adequately captured by objective indicators (i.e., situations of TC in Equation 13.1 where [a] λT is low and λS is high when X is a self-report measure that is likely to be loaded with SB and [b] λT is high and λS is low when X is a non-self-report measure that is a valid indicator of TC). ἀ ese constructs are more likely to be assessed with high validity by nonself-report measures than self-report measures. For example, we can readily think of situations where a supervisory rating measure may be a more valid measure of job performance than a self-report measure, a peer rating measure may be a more valid measure of likability than a self-report measure, and an objective record of number of accidents or safety violations may be a more valid measure of worker safety behavior than a self-report measure. ἀ e problem of the myth of superior validity of non-self-report measures is most obvious when assessing constructs that are inherently perceptual in nature. For example, the use of self-report measure is not only justifiable but probably necessary when assessing constructs that are self-referential respondent perceptions such as job satisfaction, mood, perceived organizational support, and fairness perceptions. For these self-perception constructs, even if other (i.e., non-self-report) forms of measures are available, it is difficult to argue for a superior validity of these non-self-report measures given the self-experiential nature of the respondent perception constructs. Of course, the self-report measures of these respondent perception constructs may be susceptible to social desirability responding and other validity threat problems but, as mentioned above in the discussion of Problem #3, there exist various strategies in construction and administration of the measure, as well as statistical techniques in data analysis, that provide ways to remove, reduce, or estimate the
So Why Ask Me? Are Self-Report Data Really That Bad?
327
extent of these measurement problems. In short, to find out about the perception of an individual, it is probably best to ask the individual about his or her perception rather than infer it indirectly from what others observe about the individual’s behaviors. For self-referential respondent perception constructs, the dependence of other reports is problematic for at least three reasons. First, the individual’s perceptions may not translate into observable behaviors. Second, even if perceptions were translated into behaviors, others may not have the opportunity to observe these relevant behaviors. ἀ ird, valid measurement by other reports requires the reporter to accurately infer the individual’s specific perception and the specific value on that perception from the observation of the individual’s behavior. In short, in the assessment of respondent perception constructs, it is not true that non-self-report measures are inherently superior in validity when compared to self-report measures. On the contrary, in such assessment, one could justifiably argue that non-self-report measures are very often inferior in validity when compared to selfreport measures. In addition to respondent workplace perceptions, there are many constructs that are not adequately assessed by observable behaviors or objective indicators due to the weak link between these constructs and specific observable behaviors or objective indicators. For example, constructs such as beliefs about human nature may not translate into specific observable behaviors or objective indicators. Consider the construct of belief in malleability of intelligence. Conceptually, we do not have clear theory-driven predictions about an individual’s specific behaviors toward others from the individual’s belief on the malleability of intelligence. Finally, although the use of non-self-report measures may remove response sets problems, the issue of alternative sources of reports on certain constructs also raises difficult interpretation problems associated with potential divergent perspectives on the focal construct from different rating constituencies such as self, peers, and supervisors. ἀ is may be illustrated by organizational citizenship behavior (OCB) constructs (Vandenberg, Lance, & Taylor, 2004). For example, the same objective employee behavior, such as taking initiative to revise work procedures to accomplish a task, may be interpreted by peers as civic virtue but by supervisors as insubordination. In this situation, there are discrepancies in the OCB ratings across data sources and it is unclear which one of the two non-self-report
328
David Chan
measures (i.e., supervisory reports versus peer reports), if any, provides a “more valid” assessment of the OCB constructs of interest. Moreover, rather than being more or less valid, it may be that the various sources of data (self-report measures and the different types of non-self-report measures) are in fact measuring distinct constructs or distinct dimensions of a multidimensional construct. Hence, it is possible that these different measures are similarly valid but do not correlate highly because different constructs are in fact being measured. In these difficult conceptualization and measurement situations, more conceptual work is needed to articulate a comprehensive theory of the focal constructs that includes different perspectives of rating constituencies to provide the framework for empirical comparisons of ratings across self, peers, and supervisors. ἀ ere are two other facets in the debunking of the myth of the superior value of non-self-report measures. First, non-self-report measures may also result in artificially inflated correlations. ἀ is inflation is most aptly illustrated in situations of common method variance due to similar effects of the same method factor. For example, applying Equation 13.2, a supervisory report of personality (x) may strongly predict the supervisory performance ratings (y) because both supervisor-reported personality and supervisory ratings of performance were reflecting impression management of the rated individuals (i.e., λSx and λSy are high due to the same SB method factor) rather than their true personality or true performance (i.e., λTx and λTy are low). ἀ at is, rxy is an inflated estimate of ΦTxTy. In this example of common method variance, which is a case of predictor-related criterion bias, the use of non-self-report measures led to construct validity problems as well as accuracy problems in the estimation of criterion-related validity. Second, non-self-report measures may also result in artificially deflated correlations, for similar reasons that self-report measures may do so. Specifically, if a contaminating (unintended) construct causes systematic measurement error in only one but not both of the two non-self-report measures, then a suppressor effect occurs, and, all other things being equal, the true magnitude of the relationship between the two constructs represented by the two non-self-report measures will be attenuated (i.e., artificially deflated). For example, if a systematic measurement error occurred in the supervisory rating of performance (e.g., due to leniency error) but not the supervisorreported measure of personality, then the systematic measurement
So Why Ask Me? Are Self-Report Data Really That Bad?
329
error has introduced construct-irrelevant variance in the supervisory measure of performance, which in turn serves to attenuate the true correlation involving the affected measure (i.e., correlation between supervisory performance ratings and supervisory ratings of personality). Both artificial inflation of correlation due to predictor-related criterion bias and artificial deflation of correlation due to suppressor effect may also occur when a self-report measure is correlated with the corresponding non-self-report measure. Consider the correlation between a self-report measure of conscientiousness and a supervisorreport measure of conscientiousness. If both measures were affected by impression management of the rated individuals, then artificial inflation of correlation would occur. On the other hand, if impression management of the rated individual affected the self-report measure but not the supervisor-report measure, then a suppressor effect and hence artificial deflation of correlation would occur. ἀ e implication here is the observed correlation between a self-report measure and the corresponding non-self-report measure is not necessarily a good indication of the validity of the self-report measure because the correlation may be artificially inflated or deflated. To summarize, there are good reasons to consider the widespread belief that non-self-report measures are inherently superior to selfreport measures as a myth. Equations 13.1 and 13.2 may apply to non-self-report measures as well as self-report measures, and the values of the parameter estimates λT, λS , λTxλTy, and λSxλSy in the equations are dependent on the interrelationships linking the specific measures X and Y, the intended constructs Tx and Ty, and the method factors Sx and Sy in question. In fact, there are situations (e.g., when assessing self-experiential respondent perception constructs) in which it is better to use self-report measures than non-self-report measures to measure the same intended constructs. It is also not true that we can always be more confident of the validity of a self-report measure if the scores converge with the scores on the corresponding non-self-report measures. Sometimes, the self-report measure may be construct valid but it does not correlate with the corresponding non-self-report measure because the two measures are in fact measuring distinct constructs. In addition, the self-report measure and the corresponding non-self-report measure may be highly correlated because of artificial correlation due to common method variance problems or the two measures may be lowly correlated because
330
David Chan
of artificial deflation due to suppressor effect caused by systematic measurement error in one but not the other measure. In short, the decision to use non-self-report measures and the evaluation of their validity vis-à-vis the validity of the corresponding self-report measures should be done in the usual assessment context of determining the psychometric validity of measurements incorporating concepts of systematic measurement error, artificial inflation and deflation of correlations, and construct validation. Conclusion and Moving Forward ἀ is chapter has discussed in detail four allegedly pervasive problems of self-report data. Despite the conceptual and empirical arguments against the myths associated with the problems, including the various publications showing that many of the alleged problems associated with these myths are overstatements or exaggerations, it appears from reviewer comments and author comments in journal publications that the current status of these myths is that they continue to be perpetuated by graduate school training and the reviewer process. It is certainly important to be knowledgeable about the various measurement errors that may occur in self-report data and the limitations of self-report measures. But this knowledge per se may be misleading regarding the evaluation of self-report data if one does not realize that many of these errors may also apply to nonself-report measures, is not cognizant of the various myths of broad criticism of self-report measurement, and does not go beyond enumerating the list of potential errors to consider the specific use of self-report measures in particular research contexts. It is true that that self-report data may suffer from any or all of the four alleged problems described in this chapter. It is also true that we often do not really know the extent to which each of the four problems is indeed present in studies using self-report data. ἀ us, authors, reviewers, and editors do need to deal with these problems. However, there is no strong evidence to lead us to conclude that selfreport data are inherently flawed or that their use will always impede our ability to meaningfully interpret correlations or other parameter estimates obtained from the data. On the contrary, there are situations in which the use of self-report data appears to be appropriate and perhaps sometimes most appropriate. Unfortunately,
So Why Ask Me? Are Self-Report Data Really That Bad?
331
when evaluating studies using self-report data, firm believers of the urban legend will activate their schemas of the four problems but their assessment will exaggerate the study problems to fit what they believe to be true. An implication of the arguments in this chapter is that reviewers and editors need to be more open to the real possibility that one or more of the four alleged problems of self-report data described above is in fact not a major or even relevant issue in the specific study under review. Hence, reviewers and editors need to stop automatically dismissing studies using self-report data. ἀ ere is a need for a change of mind-set among reviewers (and even editors) and authors in their approach to self-report data. We should no longer take as default mode the position that self-report data are inherently full of serious or even fatal problems that automatically lead to fallacious inferences and hence require a rejection decision on manuscripts, but should appreciate the pros and cons of self-report data and how these may apply to the study in question, similar to how we would evaluate any other types of measurement method and data source. To conclude, when discussing self-report data, it is important to consider both the truths and myths associated with each of the four purportedly pervasive problems of self-report data. We need to critically evaluate the claims of the disadvantages (or advantages) of selfreport data, and to do so would require us to be explicit about the intended and unintended constructs represented by the self-report measures, as well as the substantive content of the specific self-report items under investigation. It is also important to explicate the possible cognitive, affective, or motivational mechanisms underlying the response process in which the respondent provided the self-report data. Explicating the conceptualization of self-report data, the nature of the self-report measurement, the substantive self-reported variables, and the mechanisms underlying the response process will provide a systematic conceptual framework for understanding, using, and evaluating self-report data. By identifying the various problems associated with the urban legend of self-report data in terms of the kernels of truth (which likely represented the origins of the legend) and the myths that have developed, this chapter has, hopefully, provided a conceptual basis or at least a point of departure for future researchers to develop conceptual frameworks as such.
332
David Chan
References Agho, A. O., Price, J. L., & Mueller, C. W. (1992). Discriminant validity of measures of job satisfaction, positive affectivity and negative affectivity. Journal of Occupational and Organizational Psychology, 65, 185–196. Aryee, S., & Chay, Y. W. (2001). Workplace justice, citizenship behavior, and turnover intentions in a union context: Examining the mediating role of perceived organizational support and union instrumentality. Journal of Applied Psychology, 86, 154–160. Bagozzi, R. P., & Yi, Y. (1990). Assessing method variance in multitraitmultimethod matrices: ἀe case of self-reported affect and perceptions at work. Journal of Applied Psychology, 75, 547–560. Bandura, A. (1997). Self-ef. cacy: The exercise of control. New York: Freeman. Barrick, M. R., & Mount, M. K. (1991). ἀe Big Five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44, 1–26. Bateman, T. S., & Crant, J. M. (1993). ἀe proactive component of organizational behavior: A measure and correlates. Journal of Organizational Behavior, 14, 103–118. Becherer, R. C., & Maurer, J. G. (1999). ἀe proactive personality disposition and entrepreneurial behavior among small company presidents. Journal of Small Business Management, 38, 28–36. Becker, T. E., & Colquitt, A. L. (1992). Potential versus actual faking of a biodata form: An analysis along several dimensions of item type. Personnel Psychology, 45, 389–406. Brett, J. F., & VandeWalle, D. (1999). Goal orientation and goal content as predictors of performance in a training program. Journal of Applied Psychology, 84, 863–873. Button, S. B., Mathieu, J. E., & Zajac, D. M. (1996). Goal orientation in organizational research: A conceptual and empirical foundation. Organizational Behavior and Human Decision Processes, 67, 26–48. Campbell, D. T., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Cascio, W. F. (1975). Accuracy of verifiable biographical information blank responses. Journal of Applied Psychology, 60, 767–769. Chan, D. (2001). Method effects of positive affectivity, negative affectivity, and impression management in self-reports of work attitudes. Human Performance, 14, 77–96. Chan, D (2004). Individual differences in tolerance for contradiction. Human Performance, 17, 297–325.
So Why Ask Me? Are Self-Report Data Really That Bad?
333
Costa, P. T., Jr., & McCrae, R. R. (1992). NEO-PI-R: Professional manual. Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI). Odessa, FL: Psychological Assessment Resources. Couch, A. S., & Keniston, K. (1960). Yeasayers and naysayers: Agreeing response set as a personality variable. Journal of Abnormal and Social Psychology, 60, 151–174. Cowles, M., Darling, M., & Skanes, A. (1992). Some characteristics of the simulated self. Personality and Individual Diἀerences, 13, 501–510. Crampton, S. M., & Wagner, J. A., III. (1994). Percept-percept inflation in micro-organizational research: An investigation of prevalence and effect. Journal of Applied Psychology, 79, 67–76. Crant, J. M. (1995). ἀe proactive personality scale and objective job performance among real estate agents. Journal of Applied Psychology, 80, 532–537. Cronbach, L. J. (1946). Response sets and test design. Educational and Psychological Measurement, 6, 475–494. Cunningham, M. R., Wong, D. T., & Barbee, A. P. (1994). Self-presentation dynamics on overt integrity tests: Experimental studies of the Reid report. Journal of Applied Psychology, 79, 643–658. Diener, E., Emmons, R., Larsen, J., & Griffin, S. (1985). ἀe Satisfaction With Life Scale. Journal of Personality Assessment, 49, 71–75. Digman, J. M. (1990). Personality structure: Emergence of the five-factor model. Annual Review of Psychology, 41, 417–440. Edens, P. S., Arthur, W. Jr. (2000). A meta analysis investigating the suspectibity of self-report inventories to distortion. Paper presented at the 15th Annual Conference of the Society for Industrial and organizational Psychology, New Orleans, LA. Eisenberger, R., Huntington, R., Hutchinson, S., & Sowa, D. (1986). Perceived organizational support. Journal of Applied Psychology, 71, 500–507. Ellingson, J. E., Smith, D. B., & Sackett, P. R. (2001). Investigating the influence of social desirability on personality factor structure. Journal of Applied Psychology, 86, 122–133. Ganster, D. C., Hennesssey, H. W., & Luthans, F. (1983). Social desirability response effects: ἀ ree alternative models. Academy of Management Journal, 26, 321–331. Graham, K. E., McDaniel, M. A., Douglas, E. F., & Snell, A. F. (2002). Biodata validity decay and score inflation with faking: Do item attributes explain variance across items? Journal of Business and Psychology, 16, 573–592. Hough, L. M. (1998). Effects of intentional distortion in personality measurement and evaluation of suggested palliatives. Human Performance, 11, 209–244.
334
David Chan
Hough, L. M., Eaton, N. K., Dunnette, M. D., Kamp, J. D., & McCloy, R. A. (1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Journal of Applied Psychology, 75, 581–595. Hough, L. M., & Schneider, R. J. (1996). Personality traits, taxonomies, and applications in organizations. In K. R. Murphy (Ed.), Individual diἀerences and behavior in organizations (pp. 31–88). San Francisco: Jossey-Bass. Kluger, A. N., Reilly, R. R., & Russell, C. J. (1991). Faking biodata tests: Are option-keyed instruments more resistant? Journal of Applied Psychology, 76, 889–896. Lindell, M. K., & Whitney, D. J. (2001). Accounting for common method variance in cross-sectional research designs. Journal of Applied Psychology, 86, 114–121. Martin, B. A., Bowen, C. C., & Hunt, S. T. (2002). How effective are people at faking on personality questionnaires? Personality and Individual Diἀerences, 32, 247–256. Moorman, R. H., Blakely, G. L., & Niehoff, B. P. (1998). Treating employees fairly and organizational citizenship behaviors: Sorting the effects of job satisfaction, organizational commitment, and procedural justice. Employee Responsibilities and Rights Journal, 6, 209–225. Moorman, R. H., & Podsakoff, P. M. (1992). A meta-analytic review and empirical test of the potential confounding effects of social desirability response sets in organizational behavior research. Journal of Occupational and Organizational Psychology, 65, 131–149. Mowday, R. T., Steers, R. M., & Porter, L. W. (1979). ἀe measurement of organizational commitment. Journal of Vocational Behavior, 14, 224–247. Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84, 231–259. Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill. Ones, D. S., Viswesvaran, C., & Korbin, W. (1995, May). Meta-analyses of fakability estimates: Between-subjects versus within-subjects designs. In F. L. Schmidt (Chair), Response distortion and social desirability in personality testing and personnel selection. Symposium presented at the 10th Annual Conference of the Society for Industrial and Organizational Psychology, Orlando, FL. Ones, D. S., & Viswesvaran, C., & Reiss, A. D. (1996). Role of social desirability in personality testing for personnel selection: ἀe red herring. Journal of Applied Psychology, 81, 660–679. Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of Personality and Social Psychology, 46, 598–609.
So Why Ask Me? Are Self-Report Data Really That Bad?
335
Paulhus, D. L. (1986). Self-deception and impression management in test responses. In A. Angleitner & J. S. Wiggens (Eds.), Personality measurement via questionnaires: Current issues in theory and measurement (pp. 143–165). Berlin: Springer-Verlag. Podsakoff, P. M., MacKenzie, S. B., Lee, J. Y., & Podsakoff, N. P. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879–903. Rosse, J. G., Stechner, M. D., Levin, R. A., & Miller, J. L. (1998). ἀe impact of response distortion on preemployment personality testing and hiring decisions. Journal of Applied Psychology, 83, 634–644. Schmit, M. J., & Ryan, A. M. (1992). Test-taking dispositions: A missing link? Journal of Applied Psychology, 77, 629–637. Schmitt, N. (1994). Method bias: ἀe importance of theory and measurement. Journal of Organizational Behavior, 15, 393–398. Schmitt, N., Pulakos, E. D., Nason, E., & Whitney, D. J. (1996). Likability and similarity as potential sources of predictor-related criterion bias in validation research. Organizational Behavior and Human Decision Processes, 68, 272–286. Seibert, S. E., Crant, J. M., & Kraimer, M. L. (1999). Proactive personality and career success. Journal of Applied Psychology, 84, 416–427. Shore, L. M., & Tetrick, L. E. (1991). A construct validity study of the Survey of Perceived Organizational Support. Journal of Applied Psychology, 76, 637–643. Spector, P. E. (1987). Method variance as an artifact in self-reported affect and perceptions at work: Myth or significant problem. Journal of Applied Psychology, 72, 438–443. Spector, P. E. (1994). Using self-report questionnaires in OB research: A comment on the use of a controversial method. Journal of Organizational Behavior, 15, 385–392. Spector, P. E. (2006). Method variance in organizational research: Truth or urban legend? Organizational Research Methods, 9, 221–232. Sternberg, R. J. (1992). Psychological Bulletin’s top 10 “hit parade.” Psychological Bulletin, 112, 387–388. Sudman, S., Bradburn, N. M., & Schwarz, N. (1996). Thinking about answers: The application of cognitive processes to survey methodology. New York: John Wiley & Sons. VandeWalle, D. (1997). Development and validation of a work domain goal orientation instrument. Educational and Psychological Measurement, 8, 995–1015.
336
David Chan
VandeWalle, D., Brown, S. P., Cron, W. L., & Slocum, J. W., Jr. (1999). ἀe influence of goal orientation and self-regulation tactics on sales performance: A longitudinal field test. Journal of Applied Psychology, 84, 249–259. Watson, D. (1988). Intraindividual and interindividual analyses of positive and negative affect: ἀe ir relation to health complaints, perceived stress, and daily activities. Journal of Personality and Social Psychology, 54, 1020–1030. Watson, D., & Clark, L. A. (1984). Negative affectivity: ἀe disposition to experience aversive emotional states. Psychological Bulletin, 96, 465–490. Williams, L. J., Cote, J. A., & Buckley, M. R. (1989). Lack of method variance in self-reported affect and perceptions at work: Reality or artifact? Journal of Applied Psychology, 74, 462–468.
14 If It Ain’t Trait It Must Be Method (Mis)application of the Multitrait-Multimethod Design in Organizational Research Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Campbell and Fiske’s (1959) landmark paper on the multitrait-multimethod (MTMM) matrix is one of the most highly cited articles in all of psychology (Fiske & Campbell, 1992). As of February 2007, it had been cited 4,338 times in the Web of Science database alone, in diverse fields such as political science (Funke, 2005), marketing (Kim & Lee, 1997), leisure studies (Glancy & Little, 1995), medicine (Bernard, Cohen, McClellan, & MacLaren, 2004), sociology (Bollen & Paxton, 1998), biology (Ittenbach, Buison, Stallings, & Zemel, 2006), law (Rogers, Sewell, Ustad, Reinhardt, & Edwards, 1995), education (Krehan, 2001), and sports sciences (Cresswell & Eklund, 2006), as well as in various subdisciplines of psychology (e.g., social, personality, industrial-organizational). One of the reasons for the widespread adoption of the MTMM methodology is that the establishment of convergent and discriminant validity by its use has been seen as one of the cornerstones for the documentation of a measure’s construct validity (Benson, 1998; Messick, 1995), which is itself a unifying construct for the organization of validation evidence (Cronbach & Meehl, 1955; Landy, 1986). ἀ us, the history of the MTMM methodology is one of widespread adoption and endorsement as an important tool in construct validation efforts. However, through its use at least one “urban legend” has arisen that we address in this chapter. ἀ is legend is that if one crosses two measurement facets, one of which constitutes the substantive constructs of interest (i.e., the Traits), then the other measurement facet constitutes, de facto, the Method 337
338 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
measurement facet. As we explain later, this default assumption has had some very unfortunate consequences in at least two bodies of literature in industrial and organizational psychology. Background Construction of a MTMM matrix requires the (at least partial) crossing of at least two different traits (the trait facet, e.g., different personality constructs) with at least two different measurement methods (the method facet, e.g., different question types such as true-false, forcedchoice, agree-disagree) so that each individual measure (referred to as a trait-method unit [TMU]) represents a unique combination of a particular trait as measured by a particular method (e.g., neuroticism as measured by a forced-choice inventory, perceived leader effectiveness as measured by Likert-type items). A stylized example of such a matrix that fully crosses three different traits with three different measurement methods is shown in Table 14.1. Campbell and Fiske’s (1959) criteria for convergent validity require that the monotrait-heteromethod (MTHM) correlations are large, especially relative to the heterotrait-heteromethod (HTHM) correlations and possibly compared to the heterotrait-monomethod (HTMM) correlations, as independent efforts to measure the same trait should be expected to converge in their operationalization of the trait in question. Discriminant validity is evidenced by (a) HTHM correlations that are low relative to the MTHM and HTMM correlations and (b) heterotrait correlations that evidence the same patterns of interrelatedness throughout the MTMM matrix. Finally, measurement method bias (also referred to as method effects or monomethod bias) is evidenced by HTMM correlations that are large relative to MTHM and HTHM correlations, as measurement of two traits by a common method may inflate the correlation between the measures as compared to the correlation between two traits obtained by different measurement methods (Spector, 2006). Campbell and Fiske’s (1959) criteria for convergent and discriminant validity and method effects are now recognized as being rather subjective and a number of more objective, quantitative approaches Some methods for analyzing MTMM data require larger matrices for identification purposes (see, e.g., Lance, Noble, & Scullen, 2002).
HTMM
Trait 3
HTHM
HTHM
Trait 2
Trait 3
HTHM
HTHM MTHM
MTHM HTHM
HTHM
HTHM
HTHM HTHM
MTHM
HTHM
HTMM
T2
Method 2
MTHM
HTHM
HTHM
T3
HTMM
HTMM
T1
HTMM
T2
Method 3 T3
Note. HTMM = heterotrait-monomethod correlation (in italics); MTHM = monotrait-heteromethod correlation (in boldface); HTHM = heterotrait-heteromethod correlation (in standard typefont).
MTHM
Trait 1
Method 3: MTHM
HTMM HTMM
HTHM MTHM
MTHM
HTHM
HTHM
HTHM
Trait 2
Trait 1
Trait 3
T1
HTHM
T3
HTHM
HTMM
T2
MTHM
Method 2:
HTMM
Trait 2
Trait 1
Method 1:
T1
Method 1
Table 14.1 Hypothetical MTMM Matrix of Correlations Between Three Traits as Measured by Each of Three Methods
If It Ain’t Trait It Must Be Method 339
340 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
to the analysis of MTMM matrices have been proposed, including analysis of variance (e.g., Boruch, Larkin, Wolins, & McKinney, 1970; Kavanagh, MacKinney, & Wolins, 1971), path analysis (Avison, 1978; Schmitt, 1978) and multiple regression (e.g., Lehman, 1988). Today, however, a confirmatory factor analysis (CFA) model of some form is usually the analytic method of choice for MTMM data (Millsap, 1995; Wothke, 1996). A common implementation of the CFA model for MTMM data, and one that Lance et al. (2002) argued is the most faithful to Campbell and Fiske’s (1959) original conception of the MTMM matrix, is often referred to as the correlated trait-correlated method (CTCM) model. According to the CTCM model, each TMU reflects the influences of (a) the Trait it is intended to represent (Ti), (b) the Method used to obtain the measure (Mj), and (c) nonsystematic measurement error and specific factors (δij):
TMUij = λijTi + λijMj + δij
(14.1)
where the λijs are the factor loadings, or the coefficients for the regression of the TMUs on the Ti and Mj factors. In the CTCM CFA model, convergent validity is evidenced by the strength of the TMUs’ loadings on their respective trait factors, discriminant validity is evidenced by the magnitude of the correlations among the trait factors (lower correlations indicate more distinct Traits), and method bias is evidenced by the TMUs’ loadings on their respective method factors (see Figure 14.1). Several other models for MTMM data have also been developed, such as Campbell and O’Connell’s (1967) and Browne’s (1984) multiplicative models, Marsh’s (1989) correlated uniqueness model, and Eid’s (2000) CTCM-1 model, but the most appropriate model for MTMM data is not a focal issue here, as this is being debated elsewhere (e.g., Eid, 2000; Lance et al., 2002; Lance, Woehr, & Mead, 2007). Rather, our concern is with the particular operationalizations of measurement methods in MTMM designs. In particular, we argue that some choices for method facets in applications of the MTMM design actually represent relevant substantive effects on the TMUs studied and not mere alternative methods of measurement and that interpreta Widaman (1985) presented a taxonomy of CFA models for MTMM data that includes a number of models that are special cases of (nested within) the CTCM model that can be used to conduct formal statistical tests of convergent and discriminant validity and the presence of method effects.
If It Ain’t Trait It Must Be Method
Method 1
341
Method 2
Trait 1
Trait 2
T1M1
T2M1
T1M2
T2M2
δ
δ
δ
δ
Figure 14.1 CTCM model for MTMM data.
tion of these substantive effects as method bias has had unfortunate consequences for construct validity inferences in the bodies of literature in which these misinterpretations have occurred. In the remainder of this chapter we review literature that has applied Campbell and Fiske’s (1959) MTMM methodology in order to (a) describe the range of traits and methods that have been studied, (b) identify applications that we argue may have misapplied the MTMM methodology to study alleged method effects that are in fact substantively much more meaningful and theoretically interesting than mere method effects, (c) highlight other applications that have aptly applied the MTMM methodology to assess method effects on measures, and (d) discuss a prototype multidimensional measurement framework within which researchers might better frame their choices of measurement facets in MTMM-related designs. As such, the urban legend addressed in this chapter is that if two measurement facets are crossed in what appears at least nominally to be a MTMM design, the facet that is not the trait facet must therefore be a method facet. ἀ e kernel of truth is that the MTMM design has been useful in isolating and assessing measurement method effects in a number of domains. ἀ e myth that we aim to debunk here is that the measurement facet that is not the trait facet is, by default, a method facet that represents a form of systematic bias. Our resolution is to suggest that researchers take a broader and more informed perspective on We use the term bias to refer to systematic construct-irrelevant influences on observed measures as opposed to nonsystematic measurement error that leads to unreliability in measures.
342 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
designing multifaceted measurement studies such as is supported by Cattell’s (1966) data box and a related prototype multidimensional measurement framework that we discuss later in this chapter. Literature Review We conducted a citation search of Campbell and Fiske (1959) using the Web of Science database. ἀ e search resulted in 4,338 articles from a variety of scientific disciplines. We limited our review to articles that (a) contained multitrait-multimethod or multi-trait multi-method in the title, abstract, or keywords, (b) were published in English, (c) were available online, (d) presented an MTMM matrix containing at least two traits and two methods, and (e) were published between January 2000 and June 2006. ἀ eoretical articles and empirical articles not using the MTMM methodology were excluded. ἀ ree meta-analyses were also not included because of the redundancy with other articles. Overall, 69 articles met our inclusion criteria. We categorized each article based on discipline, the nature of the traits studied, and the measurement methods used. Range of Traits Studied ἀ e majority of the trait constructs reported (see Table 14.2) were psychological in nature, including (a) personality-related variables such as the Big-5 personality dimensions (e.g., Lim & Ployhart, 2006) and Holland’s vocational interest types (e.g., Yang, Lance, & Hui, 2006), (b) job-related behaviors such as job performance dimensions (e.g., Scullen, Mount, & Judge, 2003) and candidate performance in assessment centers (e.g., Lievens & Conway, 2001), (c) mental health indicators such as family dynamics (e.g., Cook & Kenny, 2006) and exhaustion, anxiety, and depression (Cresswell & Eklund, 2006), and (d) social psychological constructs such as self-efficacy beliefs (e.g., Bong & Hocevar, 2002), influence tactics (Blickle, 2003), and various aspects of self-concept (Marsh, Ellis, Parada, Richards, & Heubeck, A complete listing of the all studies and a summary of the various disciplines represented in our review are available from the second author.
If It Ain’t Trait It Must Be Method
343
Table 14.2 Trait Factors Investigated in Studies Reviewed Trait Factor
Number (%) of Studies
Personality
14 (20%)
Job-related behaviors
14 (20%)
Mental health indicators
14 (20%)
Social psychology constructs
14 (20%)
Behavioral/social health indicators
7 (10%)
Group organizational characteristics
3 (4%)
Other
4 (6%)
2005). However, research also reported such traits as a variety of medical and health-related indicators (e.g., French, Marteau, Senior, & Weinman, 2005; Ittenbach et al., 2006) and organizational characteristics (e.g., organization environment; Harris, 2004; Ketokivi & Schroder, 2004). Range of Methods Studied ἀ e most often reported measurement methods included (a) different raters or rater groups such as peer, supervisor, and self-ratings of job performance (e.g., Scullen et al., 2003) or guardian, teacher, and self ratings of student behaviors (e.g., Cole, 2006) and (b) different tests of the same construct such as different published measures of the Big-5 personality dimensions (e.g., Lim & Ployhart, 2006) or selfconcept (Bresnahan, Levine, Shearman, & Lee, 2005; see Table 14.3). A number of studies used what we refer to as “mixed methods,” such as subjective ratings versus objective physiological indicators of health outcomes (e.g., Bourke, McColl, Shaw, & Gibson, 2004) or qualitatively different assessment center exercises (e.g., Lievens, 2001). Still others reported using alternative questionnaire formats (e.g., Yang et al., 2006), positively versus negatively worded items (e.g., DiStefano & Motl, 2006) and multiple occasions (e.g., Marsh et al., 2005). ἀ us, the range of measurement methods reported was nearly as broad as the range of traits.
344 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Table 14.3 Method Factors Investigated in Studies Reviewed Method Factor
Number (%) of Studies
Different raters
21 (30%)
Test forms
20 (29%)
Mixed
15 (22%)
Scale formats
7 (10%)
Positive/negative wording
2 (3%)
Occasions
2 (3%)
Other
2 (3%)
Not All “Measurement Methods” Are Created Equal Stevens (1946) defined measurement as “the assignment of numerals to objects or events according to rules” (p. 677), and, at least nominally, each of the methods listed in Table 14.3 satisfies this definition. However we argue that some “methods” may represent (much) more than mere alternative procedures for assigning numbers to objects or events. Specifically, some alleged “methods” may instead reflect more substantively interesting and theoretically relevant effects. We focus on two of these: (a) different rater sources and (b) assessment center (AC) exercises as alleged measurement methods. ἀ e literatures on multisource performance appraisal and AC construct validity overlap little, except that (a) both have relied fairly extensively on the MTMM methodology, and (b) much of these literatures are based implicitly on what Lance, Baxter, and Mahan (2006) referred to as a normative accuracy model. As an extension of classical test theory a normative accuracy model states generically that
X = T + SB + E
(14.2)
where some observed score (X) is presumed to reflect its true score counterpart (T), some systematic measurement bias (SB), and nonsystematic measurement error (E). Note the apparent isomorphism between Equations 14.1 and 14.2 where traits correspond to true score, methods correspond to systematic bias, and both equations contain measurement error. We suggest that this apparent link between the normative accuracy model in Equation 14.2 and the basic MTMM
If It Ain’t Trait It Must Be Method
345
model in Equation 14.1 has served, historically, to align trait facets with true score variance and method facets with systematic biases in measures. ἀ e multisource performance appraisal and AC construct validity literatures illustrate how this presumed linkage may have been misleading. The Case of Multisource Performance Appraisal ἀ e multisource performance appraisal literature has long equated rater or rater source effects with undesirable bias and error in ratings that are to be minimized. For example, Guilford’s (1954), Kenny and Berman’s (1980), King, Hunter, and Schmidt’s (1980), and Wherry and Bartlett’s (1982) mathematical models of ratings all specify rater source bias factors as part of their theories. More recently, the equation of rater or rater source effects with undesirable, biasing method effects to be minimized has continued within the MTMM framework. Example attributions of rater (source) effects as representing method bias include Conway’s assertions that “different trait-same rater correlations share a common method (i.e., the same rater)” (Conway, 1996, p. 143), and “[r]esearchers have often defined method variance in the Multitrait-Multirater (MTMR) sense.…In the MTMR framework, method variance is the systematic dimension-rating variance specific to a particular source” (Conway, 1998, p. 29). Another example of the “rater (source) effect equals rater bias” attribution is from Mount, Judge, Scullen, Sytsma, and Hezlett (1998), who wrote that “[s]tudies that have examined performance rating data using multitrait-multimethod matrices (MTMM) or multitrait-multirater (MTMR) matrices usually focus on the proportion of variance in performance ratings that is attributable to traits and that which is attributable to the methods or raters” (p. 559) and that these studies have “documented the ubiquitous phenomenon of method effects in performance ratings” (p. 568; see also, Becker & Cote, 1994; Conway & Huffcutt, 1997; Doty & Glick, 1998; and Podsakoff, MacKenzie, Podsakoff, & Lee, 2003, for similar attributions). As such, many researchers in the multisource rating literature have equated rater (source) effects with unwanted systematic rater bias and have used the MTMM model under a normative accuracy paradigm to estimate proportions of variance in ratings that are attributable to (a) traits (Eq. 14.1, i.e., performance dimensions) as true scores
346 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
(Eq. 14.2) and (b) methods (Eq. 14.1, i.e., rating source) as measurement methods (Eq. 14.2), and substantial rating source effects, interpreted as rater bias, are routinely found (e.g., Conway, 1996; Mount et al., 1998). From a normative accuracy perspective, these findings from analyses of multitrait-multirater data suggest that multisource ratings are routinely contaminated by rater bias as a form of common method effects. However, some performance rating researchers have offered alternative perspectives on these ubiquitous rater (source) effects and have proposed that these rater source effects may not represent mere measurement method bias but instead may have more interesting substantive interpretations. For example, Borman (1974) suggested that the MTMM approach to studying rater source effects on ratings “may be ignoring or incorrectly interpreting valid differences in perceptions between organizational levels” (p. 107) such that “if raters from different levels have different relationships with ratees and see different occurrences of ratee behavior due to each level’s unique vantage point, a criterion of agreement in ratings between these two groups seems forced at best” (p. 106). Relatedly, Lance, Teachout, and Donnelly (1992) suggested an alternative interpretation of rater source effects that Lance et al. (2006) referred to as an “ecological perspective” on multisource ratings. Under the rationale that different raters or groups may well be privy to different aspects of ratee behavior and thus have different perspectives on ratee performance, this interpretation views rater source effects as representing the raters’ own unique but valid overall perspective on ratee behavior (see also Bozeman, 1997; London & Smither, 1995; Tornow, 1993; Zedeck, Imparto, Krausz, & Oleno, 1974). Lance et al. (1992, 2006) provided competitive tests between the normative accuracy and ecological perspectives in CFAs of MTMR data by extending the basic CFA design to estimate correlations between dimension and rater source factors with additional performance-related variables (experience, training proficiency, cognitive ability) outside the core CFA. ἀ e rationale for this extended analysis was that if (a) rater source factors represented performance irrelevant method bias, they ought not to correlate with performancerelated external variables (the normative accuracy perspective), but if (b) rater source factors instead represented performance-relevant substantive constructs, they ought to correlate with performancerelated external variables (the ecological perspective). In both stud Hoffman and Woehr (2007) provided a substantive replication of these findings.
If It Ain’t Trait It Must Be Method
347
ies, results strongly supported the ecological perspective, indicating that rater source factors should be more properly interpreted as representing substantive performance-related constructs, not method bias factors. What implications do these findings have for the multisource rating literature? First, they suggest that the default assumption that rater sources represent mere alternative measurement methods led to a widespread misinterpretation of rater or rater source effects as representing method bias rather than raters’ unique but perhaps valid perspectives on ratee performance. ἀ is long-standing misinterpretation had the unfortunate consequence of leading researchers to believe that multisource ratings reflected substantial proportions of unwanted contaminating measurement method bias. Second, these findings imply that decades of work toward “improving” performance rating technologies, such as redesigned rating formats and rater training aimed at increasing interrater agreement, may have been misguided, as rater disagreement is to be expected. ἀ ird, these findings support one of the key beliefs that support multisource performance feedback efforts, namely that raters occupying different organizational positions relative to the ratee bring complementary perspectives to the assessment of rate performance (London & Smither, 1995; Tornow, 1993). Finally, benchmark reliability estimates from reliability generalization studies (e.g., Viswesvaran, Ones, & Schmidt, 1996) based on interrater reliability estimates may have (a) (severely) underestimated the reliability of performance ratings under the assumption that ratings obtained from different sources are interchangeable except for nonsystematic measurement error and, as a result, (b) motivated overcorrection for unreliability of ratings in meta-analyses that have used these benchmark estimates (e.g., Kuncel, Hezlett, & Ones, 2001). As such, the misattribution of rating source effects as representing mere measurement method biases on multisource ratings has had widespread unfortunate consequences for this area of research. The Case of AC Construct Validity Frequently, ACs are designed such that assessors provide ratings of candidates’ performance at the completion of each exercise (socalled postexercise dimension ratings or PEDRs; Lance, 2008). It is
348 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
also often the case that many of the same dimensions are assessed across different exercises so that the pattern of PEDRs resembles a MTMM matrix (i.e., dimensions are at least partially crossed with exercises). Traditional wisdom holds that candidate performance in ACs is assessed with respect to the dimensions that are defined for relevant exercises. However, Sackett and Dreher’s (1982) exploratory factor analysis of PEDRs resulted in factors that represented the exercises in which behavior was assessed and not the dimensions that were defined for the AC. ἀ ese “troubling empirical findings” (Sackett & Dreher, p. 401) have now been replicated in over 35 additional studies (see Lance, Lambert, Gewin, Lievens, & Conway, 2004). Why are these findings troubling? Because exercise effects were interpreted early on in this stream of research as representing unwanted measurement method effects: “Examining post exercise dimension ratings in a multitrait-multimethod matrix context (dimensions = traits; exercises = methods) typically reveals considerably higher correlations among dimension ratings made in the same exercise than among the various ratings of a given dimension across exercises” (Sackett, 1987, p. 19). ἀ is view persisted for 25 years until very recently as evidenced by statements to the effect that “considerable proportions of the exercise variance may be regarded as sources of measurement method bias” (Lievens & Conway, 2001, p. 1211), and “[t]his robust finding of method variance (referred to as the exercise eἀect) has led many researchers to question the construct validity of assessment ratings” (Schleicher, Day, Mayes, & Riggio, 2002, p. 735). ἀ at is, the AC construct validity literature has used the MTMM methodology within a normative accuracy framework to assess the proportions of variance in AC ratings attributable to (a) traits (Eq. 14.1, i.e., AC rating dimensions) as representing true scores (Eq. 14.2) and (b) methods (Eq. 14.1, i.e., AC exercises) as representing measurement methods (Eq. 14.2). Lance, Newbolt, Gatewood, Foster, French, and Smith (2000) argued for an alternative interpretation of AC exercise effects as representing true cross-situational specificity in AC performance. Lance et al. (2000) and Lance, Foster, Gentry, and ἀ oresen (2004) used an analytic approach similar to Lance et al.’s (1992, 2006) to test this idea. Specifically, they used CFA to analyze PEDRs in a quasi-MTMM framework and argued that if (a) exercise factors represented performance-irrelevant method bias they ought not to correlate with performance-related external variables (the normative
If It Ain’t Trait It Must Be Method
349
accuracy perspective), but if (b) exercise factors instead represented cross-situational specificity in actual AC performance, they ought to correlate with performance-related external variables (cognitive ability, job knowledge, job performance). Five separate studies reported by Lance et al. (2000, 2004) strongly supported the latter interpretation, indicating that exercise factors should be more properly interpreted as cross-situational specificity in actual AC performance, not method bias. What implications do these findings have for the AC construct validity literature? First, they suggest that the assumption that AC exercises represent mere alternative measurement methods led researchers to misinterpret exercise effects as unwanted, contaminating measurement method biases. Just as in the multisource rating literature, this assumption led researchers to believe that ACs were substantially contaminated by method bias. Second, the attribution that ACs were not construct valid because they reflected substantial exercise (i.e., measurement method) effects and not dimension effects stimulated over 25 years of research to “fix” ACs that in fact were not broken in the first place. Rather, exercise effects are now better understood as representing actual cross-situational specificity in candidate performance and not measurement method bias (Lance, 2008; Lance et al., 2000, 2004). ἀ ese recent findings also have implications for redesigning ACs around critical job tasks rather than around trait-like dimensions (e.g., Jackson, Stillman, & Atkins, 2005). As such, although it took 25 years of research to realize it, it is now clear that AC exercise effects represent much more than mere alternative methods of measurement. Other Cases Are there other cases where alleged measurement methods are more than mere alternative methods for assigning numbers to observations? We think so. One of these cases is found in research examining positive versus negative item-wording effects as method effects. ἀ ese effects have been found repeatedly with Rosenberg’s (1965) SelfEsteem Scale (RSES), for example. Although the RSES was written to be a unidimensional measure of self-esteem, Carmines and Zeller (1979) found that the five positively worded RSES items loaded highly on one factor, whereas the five negatively worded items loaded highly
350 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
on a second factor. Because these two factors did not correlate differentially with other theoretically related constructs, it was thought that these two factors likely represented a method artifact. However, more recently researchers have questioned this conclusion (Tomás & Oliver, 1999). For example, Motl and DiStefano (2002) and Horan, DiStefano, and Motl (2003) showed that the item-wording effects associated with the RSES were invariant over time and were related to negative wording effects on other attitudinal measures, and DiStefano and Motl (2006) and Quilty, Oakman, and Risko (2006) have shown that these wording effects are correlated with a number of personality traits. ἀ us, recent research suggests that these item-wording effects may also reflect more interesting, substantively meaningful phenomena rather than mere alternative measurement methods. A final example is measurement occasions as “methods.” Although not used as frequently as other “methods” (however, see Hernández & González-Romá, 2002; Marsh et al., 2005), we suspect that using occasions as methods may confound “method” with true state-related aspects of, and cross-situational specificity in, the constructs being studied. Fortunately, a number of well-developed statetrait-occasion related CFA designs have already been formulated to address these possibilities (see, e.g., Cole, Martin, & Steiger, 2005; Eid & Langeheine, 1999; Kenny & Zautra, 2001; Schermelleh-Engel, Keith, Moosbrugger, & Hodapp, 2004). So, Are Any “Method” Facets Really Method Facets? Yes, we think so. And probably quite a large number of them. If we define variations in measurement methods as alternative approaches to assigning numbers to observations to represent individuals’ standing on latent constructs independent of substantive content related to other latent constructs, then we think many good examples can be found in the MTMM literature. As one example, a number of studies have investigated the amount of method variance interjected as a function of various questionnaire scale anchor formats, for example, semantic differential versus Likert versus ratio scaling (e.g., Auken, Barry, & Bagozzi, 2006; see also Baltes, Bauer, Bajdo, & Parker, 2002; Lance & Sloan, 1993). Other studies have investigated alternative item formats such as multiple-choice versus essay questions (e.g., Pitoniak, Sireci, & Luecht, 2002; see also Yang et al., 2006). Still
If It Ain’t Trait It Must Be Method
351
other examples of alternative measurement methods include otherreport versus coded videotaped interactions as alternative methods for measuring parent-adolescent interactional styles (Janssens, DeBruyn, Manders, & Scholte, 2005); air-displacement plethysmography, anthropometry, and X-ray absorptiometry measures of children’s body composition (Ittenbach et al., 2006); and pharmacy refill records, blood serum concentrations, and self-report measures of patients’ adherence to immunosuppressant therapy regimens (e.g., Chisolm, Lance, Williamson, & Mulloy, 2004). And there are probably many other examples of alternative approaches to construct measurement that are just that—alternative measurement methods. Discriminating Method From Substance, or “If It Looks Like a Method and Quacks Like a Method…” So how is one to determine whether some measurement facet other than the one that consists of one’s focal constructs is simply a measurement facet or something more? ἀ is may not always be an easy question to answer, and we consider it from two perspectives. From one perspective, we return to a definition of measurement method that we proposed earlier: If a particular measurement facet truly represents alternative approaches to assigning numbers to observations to represent individuals’ standing on latent constructs independent of substantive content related to other latent constructs, the facet might be reasonably viewed as representing alternative measurement methods. However, if alternatives on the facet in question could be argued as representing variations in some substantive, theoretical constructs, then the facet may well represent something (much) more than simply variations in measurement method. But answering this question may not always be straightforward. As we pointed out earlier, alternative rating sources and AC exercises were once thought of as mere methods for obtaining scores relating to job performance and AC dimensions, respectively. We now know better. As such, answers to this question will necessarily depend on the current state (and maturity) of substantive knowledge in the relevant research domain. ἀ e other perspective we take on whether some measurement facet represents variations in measurement method or something more builds upon Cattell’s (1946, 1966) work on multidimensional
352 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
data relational systems in the context of the measurement of personality. Cattell’s (1946) “Covariation Chart” made the multidimensional nature of the structure of personality data explicit in formulating a three-dimensional Persons × Occasions × Tests array, which he showed could be partitioned and transposed into various subsets and configurations for purposes of data collection, analysis, and interpretation. From his Covariation Chart, Cattell (1946) clarified relationships among more typical R-mode factor analysis (where correlations among tests are calculated across persons and factored) versus Q-mode (correlations among people are calculated across tests and then factored, a form of cluster analysis), P-mode (correlations among tests are calculated across occasions and then factored, an idiographic application), and other applications of factor analysis. Cattell (1966) extended this idea to a more general relational system for multidimensional data that he called a Basic Data Relation Matrix or, more simply, ἀ e Data Box. ἀ is 10-dimensional Data Box consisted of five primary dimensions consisting of (a) persons (or organisms), (b) focal stimuli, (c) environmental background variables, (d) response patterns, and (e) observers, plus temporal variants of each of these. One intended purpose of ἀ e Data Box was to serve as a larger conceptual context within which any particular research study’s data array could be framed. It is in the spirit of Cattell’s conceptions of the multidimensional nature of data arrays that we urge researchers to consider their data structures within the (potentially much) larger context within which their particular data structure is a subset. What would constitute the dimensions of some theoretically complete data array for organizational researchers? A useful prototype multidimensional measurement system might consist of the following dimensions: (a) persons (or groups of persons, or collectivities of groups of persons who may be the object of study); (b) focal constructs that constitute the relevant characteristics of the entities studied; (c) occasions, or temporal replications of measurement; (d) different situations in which measurement may occur; (e) observers or recorders of entities’ behavior; and (f) response modalities/ formats. In fact, all of the studies’ designs reviewed in this chapter can be located as a three-dimensional subset space within this larger system. For example, the multisource rating studies reviewed here calculated correlations across (a) persons (ratees) in designs that crossed (b) focal constructs (performance dimensions) and (c)
If It Ain’t Trait It Must Be Method
353
observers (raters or rater groups). As a second example, the AC construct validity studies reviewed here calculated correlations across (a) persons (candidates) in designs that (at least partially) crossed (b) focal constructs (AC dimensions) and (c) measurement situations (AC exercises). Finally, studies that investigated scale format effects have typically calculated correlations across (a) persons in designs that cross (b) focal constructs with (c) alternative response modalities. We make no claim as to the exhaustiveness of this prototype measurement system’s dimensions and encourage others to build upon and modify it, yet it is useful in classifying the studies reviewed in this chapter. ἀ e points we wish to make here are that (a) each of the MTMMrelated studies cited here can be located as a three-dimensional subset space within the larger prototype multidimensional measurement space proposed, and (b) the design dimension that is not the focal construct dimension is not necessarily, by default, a measurement method dimension. ἀ e urban legend that “if it isn’t trait, it must be method” is an egregious oversimplification of multidimensional measurement design whose intentional or unintentional invocation has had some very unfortunate consequences in some areas of organizational research. We urge researchers to take a broader perspective in locating their measurement designs in a theoretical multidimensional measurement space such as the prototype system outlined here. References Auken, S. V., Barry, T. E., & Bagozzi, R. P. (2006). A cross-country construct validation of cognitive age. Journal of the Academy of Marketing Science, 34, 439–455. Avison, W. R. (1978). Auxiliary theory and multitrait-multimethod validation: A review of two approaches. Applied Psychological Measurement, 2, 431–447. Baltes, B. B., Bauer, C. C., Bajdo, L. M., & Parker, C. P. (1999). ἀe use of multitrait-multimethod data for detecting nonlinear relationships: ἀe case of psychological climate and job satisfaction. Journal of Business and Psychology, 17, 3–17. Becker, T. E., & Cote, J. A. (1994). Additive and multiplicative method effects in applied psychological research: An empirical assessment of three models. Journal of Management, 20, 625–641.
354 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Benson, J. (1998). Developing a strong program of construct validation: A test anxiety example. Educational Measurement: Issues and Practice, 17, 10–22. Bernard, R. S., Cohen, L. L., McClellan, C. B., & MacLaren, J. E. (2004). Pediatric procedural approach-avoidance coping and distress: A multitrait-multimethod study. Journal of Pediatric Psychology, 29, 131–141. Blickle, G. (2003). Convergence of agents’ and targets’ reports on intraorganizational influence attempts. European Journal of Psychological Assessment, 19, 40–53. Bollen, K. A., & Paxton, P. (1998). Detection and determinants of bias in subjective measures. American Sociological Review, 63, 465–478. Bong, M., & Hocevar, D. (2002). Measuring self-efficacy: Multitrait-multimethod comparison of scaling procedures. Applied Measurement in Education, 15, 143–171. Borman, W. C. (1974). ἀe rating of individuals in organizations: An alternative approach. Organizational Behavior and Human Performance, 12, 105–124. Boruch, R. F., Larkin, J. D., Wolins, L., & McKinney, A. C. (1970). Alternative method of analysis: Multitrait-multimethod data. Educational and Psychological Measurement, 30, 833–854. Bourke, S. C., McColl, E., Shaw, P. J., & Gibson, G. J. (2004). Validation of quality of life instruments in ALS. Amyotrophic Lateral Sclerosis and Other Motor Neuron Disorders, 5, 55–60. Bozeman, D. (1997). Interrater agreement in multi-source performance appraisal: A commentary. Journal of Organizational Behavior, 18, 313–316. Bresnahan, M. J., Levine, T. R., Shearman, S. M., & Lee, S. Y. (2005). A multimethod multitrait validity assessment of self-construal in Japan, Korea, and the United States. Human Communication Research, 31, 33–59. Browne, M. W. (1984). ἀe decomposition of multitrait-multimethod matrices. British Journal of Mathematical and Statistical Psychology, 37, 1–21. Campbell, D. T., & Fiske, D. (1959). Convergent and discriminant validation by the multitrait-multimethod matrix. Psychological Bulletin, 56, 81–105. Campbell, D. T., & O’Connell, E. J. (1967). Method factors in multitraitmultimethod matrices: Multiplicative rather than additive? Multivariate Behavioral Research, 2, 409–426. Carmines, E. G., & Zeller, R. A. (1979). Reliability and validity assessment. Beverly Hills, CA: Sage.
If It Ain’t Trait It Must Be Method
355
Cattell, R. B. (1946). Description and measurement of personality. Yonkerson-Hudson, NY: World Book. Cattell, R. B. (1966). ἀe data box: Its ordering of total resources in terms of possible relational systems. In R. B. Cattell (Ed.), Handbook of multivariate experimental psychology (pp. 67–128). Chicago: Ran McNally. Chisolm, M. A., Lance, C. E., Williamson, G. M., & Mulloy, L. L. (2004). Development and validation of the immunosuppressant adherence instrument (ITAS). Patent Education and Counseling, 59, 13–20. Cole, D. A. (2006). Coping with longitudinal data in research on developmental psychopathology. International Journal of Behavioral Development, 30, 20–25. Cole, D. A., Martin, N. C., & Steiger, J. H. (2005). Empirical and conceptual problems with longitudinal trait-state models: Introducing a traitstate-occasion model. Psychological Methods, 10, 3–20. Conway, J. M. (1996). Analysis and design of multitrait-multirater performance appraisal studies. Journal of Management, 22, 139–162. Conway, J. M. (1998). Understanding method variance in multitrait-multirater performance appraisal matrices: Examples using general impressions and interpersonal affect as method factors. Human Performance, 11, 29–55. Conway, J. M., & Huffcutt, A. I. (1997). Psychometric properties of multisource performance ratings: A meta-analysis of subordinate, supervisor, peer, and self-ratings. Human Performance, 10, 331–360. Cook, W. L., & Kenny, D. A. (2006). Examining the validity of self-report assessments of family functioning: A question of the level of analysis. Journal of Family Psychology, 20, 209–216. Cresswell, S. L., & Eklund, R. C. (2006). ἀe convergent and discriminant validity of burnout measures in sport: A multitrait-multimethod analysis. Journal of Sports Sciences, 2006, 209–220. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52, 281–302. DiStefano, C., & Motl, R. W. (2006). Further investigating method effects associated with negatively worded items on self-report surveys. Structural Equation Modeling, 13, 440–464. Doty, D. H., & Glick, W. H. (1998). Common method bias: Does common methods variance really bias results? Organizational Research Methods, 1, 374–406. Eid, M. (2000). A multitrait-multimethod model with minimal assumptions. Psychometrika, 65, 241–261. Eid, M., & Langeheine, R. (1999). ἀe measurement of consistency and occasion specificity with latent class models: A new model and its application to the measurement of affect. Psychological Methods, 4, 100–116.
356 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Fiske, D. W., & Campbell, D. T. (1992). Citations do not solve problems. Psychological Bulletin, 112, 393–395. French, D. P., Marteau, T. M., Senior, V., & Weinman, J. (2005). How valid are measures of beliefs about the causes of illness? ἀe example of myocardial infarction. Psychology and Health, 20, 615–635. Funke, F. (2005). ἀe dimensionality of right-wing authoritarianism: Lessons from the dilemma between theory and measurement. Political Psychology, 26, 195–218. Glancy, M., & Little, S. L. (1995). Studying the social aspects of leisure— Development of the multiple-method field investigation model (MMFI). Journal of Leisure Research, 27, 305–325. Guilford, J. P. (1954). Psychometric methods. New York: McGraw-Hill. Harris, R. D. (2004). Organizational task environments: An evaluation of convergent and discriminant validity. Journal of Management Studies, 41, 857–882. Hernández, A., & González-Romá, V. (2002). Analysis of multitrait-multioccasion data: Additive versus multiplicative models. Multivariate Behavioral Research, 37, 59–87. Hoffman, B. J., & Woehr, D. J. (2007). Disentangling the meaning of multisource feedback: An examination of the nomological network surrounding source and dimension factors. Manuscript submitted for publication. Horan, P. M., DiStefano, C., & Motl, R. W. (2003). Wording effects in selfesteem scales: Methodological artifact or response style? Structural Equation Modeling, 10, 444–455. Ittenbach, F. R., Buison, A. M., Stallings, V. A., & Zemel, B. S. (2006). Statistical validation of air-displacement plethysmography for body composition assessment in children. Annals of Human Biology, 33, 187–201. Jackson, D. J. R., Stillman, J. A., & Atkins, S. G. (2005). Rating tasks versus dimensions in assessment centers: A psychometric comparison. Human Performance, 18, 213–241. Janssens, J. M. A. M., DeBruyn, E. E. J., Manders, W. A., & Scholte, R. H. J. (2005). ἀe multitrait-multimethod approach in family assessment. European Journal of Psychological Assessment, 21, 232–239. Kavanagh, M. J., MacKinney, A. C., & Wolins, L. (1971). Issues in managerial performance: Multitrait-multimethod analyses of ratings. Psychological Bulletin, 75, 34–49. Kenny, D. A., & Berman, J. S. (1980). Statistical approaches to the correction of correlational bias. Psychological Bulletin, 88, 288–295. Kenny, D. A., & Zautra, A. (2001). Trait-state models for longitudinal data. In L. M. Collins & A. G. Sayer (Eds.), New methods for the analysis of change (pp. 243–263). Washington, DC: American Psychological Association.
If It Ain’t Trait It Must Be Method
357
Ketokivi, M. A., & Schroeder, R. G. (2004). Perceptual measures of performance: Fact or fiction? Journal of Operations Management, 22, 247–264. Kim, C., & Lee, H. (1997). Development of family triadic measures for children’s purchase influence. Journal of Marketing Research, 34, 307–321. King, L. M., Hunter, J. E., & Schmidt, F. L. (1980). Halo in a multidimensional forced-choice performance evaluation scale. Journal of Applied Psychology, 65, 507–516. Krehan, K. D. (2001). An investigation of the validity of scores on locally developed performance measures in a school assessment program. Educational and Psychological Measurement, 61, 841–848. Kuncel, N. R., Hezlett, S. A., & Ones, D. S. (2001). A comprehensive metaanalysis of the predictive validity of the graduate record examinations: Implications for graduate selection and performance. Psychological Bulletin, 127, 162–181. Lance, C. E. (2008). Why assessment centers do not work the way they are supposed to. Industrial and Organizational Psychology, 1, 84–97. Lance, C. E., Baxter, D., & Mahan, R. P. (2006). Multi-source performance measurement: A reconceptualization. In W. Bennett, C. E. Lance, & D. J. Woehr (Eds.), Performance measurement: Current perspectives and future challenges (pp. 49–76). Mahwah, NJ: Erlbam. Lance, C. E., Foster, M. R., Gentry, W. A., & ἀ oresen, J. D. (2004). Assessor cognitive processes in an operational assessment center. Journal of Applied Psychology, 89, 22–35. Lance, C. E., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimension and exercise variance components in assessment center post-exercise dimension ratings. Journal of Applied Psychology, 89, 377–385. Lance, C. E., Newbolt, W. H., Gatewood, R. D., Foster, M. R., French, N., & Smith, D. E. (2000). Assessment center exercise factors represent cross-situational specificity, not method bias. Human Performance, 13, 323–353. Lance, C. E., Noble, C. L., & Scullen, S. E. (2002). A critique of the correlated trait–correlated method (CTCM) and correlated uniqueness (CU) models for multitrait-multimethod (MTMM) data. Psychological Methods, 7, 228–244. Lance, C. E., & Sloan, C. E. (1993). Relationships between overall and life facet satisfaction: A multitrait-multimethod (MTMM) study. Social Indicators Research, 30, 1–15. Lance, C. E., Teachout, M. S., & Donnelly, T. M. (1992). Specification of the criterion construct space: An application of hierarchical confirmatory factor analysis. Journal of Applied Psychology, 77, 437–452.
358 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Lance, C. E., Woehr, D. J., & Meade, A. W. (2007). Case study: A Monte Carlo investigation of assessment center construct validity models. Organizational Research Methods, 10, 449–462. Landy, F. L. (1986). Stamp collecting versus science: Validation as hypothesis testing. American Psychologist, 41, 1183–1192. Lehmann, D. R. (1988). An alternative procedure for assessing convergent and discriminant validity. Applied Psychological Measurement, 12, 411–423. Lievens, F. (2001). Assessors and use of assessment centre dimensions: A fresh look at a troubling issue. Journal of Organizational Behavior, 22, 203–221. Lievens, F., & Conway, J. M. (2001). Dimension and exercise variance in assessment center scores: A large-scale evaluation of multitrait-multimethod studies. Journal of Applied Psychology, 86, 1202–1222. Lim, B. C., & Ployhart, R. E. (2006). Assessing the convergent and discriminant validity of Goldberg’s international personality item pool—A multitrait-multimethod examination. Organizational Research Methods, 9, 29–54. London, M., & Smither, J. W. (1995). Can multi-source feedback change perceptions of goal accomplishment, self-evaluations, and performance-related outcomes? ἀe ory-based applications and directions for research. Personnel Psychology, 48, 803–839. Marsh, H. W. (1989). Confirmatory factor analysis of multitrait-multimethod data: Many problems and a few solutions. Applied Psychological Measurement, 13, 335–361. Marsh, H. W., Ellis, L. A., Parada, R. H., Richards, G., & Heubeck, B. G. (2005). A short version of the Self Description Questionnaire II: Operationalizing criteria for short-form evaluation with new applications of confirmatory factor analysis. Psychological Assessment, 17, 81–102. Messick, S. (1995). Validity of psychological assessment: Validation of inferences from persons’ responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741–749. Millsap, R. E. (1995). ἀe statistical analysis of method effects in multitraitmultimethod data: A review. In P. E. Shrout & S. T. Fiske (Eds.), Personality research, methods, and theory: A festschrift honoring Donald W. Fiske (pp. 93–109). Hillsdale, NJ: Erlbaum. Motl, R. W., & DiStefano, C. (2002). Longitudinal invariance of self-esteem and method effects associated with negatively worded items. Structural Equation Modeling, 9, 562–578. Mount, M. K., Judge, T. A., Scullen, S. E., Sytsma, M. R., & Hezlett, S. A. (1998). Trait, rater, and level effects in 360-degree performance ratings. Personnel Psychology, 51, 557–576.
If It Ain’t Trait It Must Be Method
359
Pitoniak, M. J., Sireci, S. G., & Luecht, R. M. (2002). A multitrait-multimethod validity investigation of scores from a professional licensure examination. Educational and Psychological Measurement, 62, 498–516. Podsakoff, P. M., MacKenzie, S. B., Podsakoff, N. P., & Lee, J. (2003). Common method biases in behavioral research: A critical review of the literature and recommended remedies. Journal of Applied Psychology, 88, 879–903. Quilty, L. C., Oakman, J. M., & Risko, E. (2006). Correlates of the Rosenberg Self-Esteem Scale method effects. Structural Equation Modeling, 13, 99–117. Rogers, R., Sewell, K. W., Ustad, K., Reinhardt, V., & Edwards, W. (1995). ἀe referral decision scale with mentally disordered inmates—A preliminary study of convergent and discriminant validity. Law and Human Behavior, 19, 481–492. Rosenberg, M. (1965). Society and the adolescent child. Princeton, NJ: Princeton University Press. Sackett, P. R. (1987). Assessment centers and content validity: Some neglected issues. Personnel Psychology, 40, 13–25. Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling findings. Journal of Applied Psychology, 67, 401–410. Schermelleh-Engel, K., Keith, N., Moosbrugger, H., & Hodapp, V. (2004). Decomposing person and occasion-specific effects: An extension of latent state-trait theory to hierarchical models. Psychological Methods, 9, 198–219. Schleicher, D. J., Day, D. V., Mayes, B. T., & Riggio, R. E. (2002). A new frame for frame-of-reference training: Enhancing the construct validity of assessment centers. Journal of Applied Psychology, 87, 735–746. Schmitt, N. (1978). Path analysis of multitrait-multimethod matrices. Applied Psychological Measurement, 2, 157–173. Scullen, S. E., Mount, M. K., & Judge, T. A. (2003). Evidence of the construct validity of developmental ratings of managerial performance. Journal of Applied Psychology, 88, 50–66. Spector, P. E. (2006). Method variance in organizational research: Truth or urban legend? Organizational Research Methods, 9, 221–232. Stevens, S. S. (1946). On the theory of scales for measurement. Science, 103, 677–680. Tomás, J. M., & Oliver, A. (1999). Rosenberg’s self-esteem scale: Two factors or method effects. Structural Equation Modeling, 6, 84–98. Tornow, W. W. (1993). Perceptions or reality: Is multi-perspective measurement a means or an end? Human Resource Management, 32, 221–229.
360 Charles E. Lance, Lisa E. Baranik, Abby R. Lau, and Elizabeth A. Scharlau
Viswesvaran, C., Ones, D. S., & Schmidt, F. L. (1996). Comparative analysis of the reliability of job performance ratings. Journal of Applied Psychology, 81, 557–574. Wherry, R. J., & Bartlett, C. J. (1982). ἀe control of bias in ratings: A theory of rating. Personnel Psychology, 35, 521–551. Widaman, K. F. (1985). Hierarchically nested covariance structure models for multitrait-multimethod data. Applied Psychological Measurement, 9, 1–26. Wothke, W. (1996). Models for multitrait-multimethod matrix analysis. In G. A. Marcoulides & R. E. Schumacker (Eds.), Advanced structural equation modeling: Issues and techniques (pp. 7–56). Mahwah, NJ: Erlbaum. Yang, W., Lance, C. E., & Hui, H. C. (2006). Psychometric properties of the Chinese Self-Directed Search (1994 edition). Journal of Vocational Behavior, 68, 560–576. Zedeck, S., Imparto, N., Krausz, M., & Oleno, T. (1974). Development of behaviorally anchored rating scales as a function of organizational level. Journal of Applied Psychology, 59, 249–252.
15 Chopped Liver? OK. Chopped Data? Not OK. Marcus M. Butts and Thomas W. H. Ng
Crude classifications and false generalizations are the curse of organized life. —George Bernard Shaw
Over the years, there have been numerous articles in the organizational and social sciences discussing various types of chopped data (i.e., continuous data that are partitioned into far fewer categories for data analytic purposes). ἀ ese have included discussions on the practice of dichotomizing (or polytomizing) independent and dependent variables (Cohen, 1983; MacCallum, Zhang, Preacher, & Rucker, 2002), moderator variables (Bissonnette, Ickes, Bernstein, & Knowles, 1990), and the use of groups at the two extremes of a scale (Preacher, Rucker, MacCallum, & Nicewander, 2005). Although these discussions have primarily focused on the practice of using chopped data to perform analyses of data obtained from experimental designs (e.g., ANOVA), it has also been argued that applied research, where correlation and multiple regression are more popular, is not immune to this imprudent practice (Irwin & McClelland, 2001; McClelland & Judd, 1993). ἀ e general conclusion of the majority of these deliberations on the topic is that the practice of chopping or carving up continuous data for any purpose is at its worst an inexcusable flaw that clouds the interpretation of empirical results, and at its best a generally undesirable methodology that should be rightfully justified. Along with these continued admonitions against using chopped data, urban 361
362
Marcus M. Butts and Thomas W. H. Ng
legends about chopped data have emerged in the organizational and social sciences. For example, due to the abundance of articles criticizing the practice, there currently exists an urban legend that the occurrence of chopped data is minimal and has dissipated over time. Similarly, in the presumably rare case where researchers do use chopped data, it is assumed that tenable justifications are provided for such practices. Other urban legends have also been perpetuated such as (a) the belief that chopped data is less of a problem in disciplines of management and applied psychology than in social psychology and (b) the assumption that median split of independent variables is the most adopted method of chopping data. We discuss each of these urban legends in greater depth below. ἀ e primary objective of this chapter is to examine the use of chopped data in the organizational and social sciences literature in order to evaluate the perpetuated urban legends and underlying myths and/or kernels of truth associated with this practice. We begin this chapter by discussing in more detail the urban legends that have been perpetuated with respect to chopping data. ἀ e veracity and false assumptions of these urban legends are then examined and discussed using illustrations and empirical evidence from a detailed literature review we conducted. Finally, we conclude by summarizing both advantages and disadvantages of using chopped data and providing practical recommendations for researchers dealing with the decision of whether to chop up continuous data. Urban Legends Regarding Chopped Data As we mentioned in our introduction, the act of chopping up continuous data for subsequent analytic purposes is an unwise practice. ἀ e many methodological and statistical problems with such an approach have been repeatedly voiced (e.g., Cohen, 1983; MacCallum et al., 2002). However, authors have also occasionally defended its application (i.e., Arnold, 1982; Baumesiter, 1990). Although we have no intention of attempting to resolve the philosophical disagreements between those in favor and those opposed to using chopped data, the mere existence of two dissenting camps suggests there are perpetuated misconceptions about the phenomenon itself. ἀ erefore, we now turn to the urban legends associated with using chopped data.
Chopped Liver? OK. Chopped Data? Not OK.
363
Urban Legends Associated With the Occurrence of Chopped Data Perhaps the biggest urban legend about chopped data is that we assume everyone understands the problems underlying such an approach. A number of articles have provided mathematical and statistical evidence of the problems created by using chopped data (e.g., Maxwell & Delaney, 1993; Stone-Romero & Anderson, 1994). ἀ us, it should follow that if researchers are aware of the disadvantages of using chopped data and regard the practice as poor science, it should not occur with much frequency in articles published in high-quality journals. Providing results to the contrary, MacCallum et al. (2002) found that from 1998 to 2000 over 11% of social and clinical psychology journals surveyed contained dichotomization of continuous data. ἀ is evidence belies the urban legend that the use of chopped data is an anomaly of little concern to researchers today. Furthermore, it suggests that perhaps chopping data is a problem not only in social and clinical psychology but also in other disciplines within the organizational and social sciences such as management and applied psychology. Closely related to the urban legend that chopping data is not a problem in high-quality journals is the assumption that the phenomenon has become less prevalent over time. ἀ e (faulty) reasoning is that with the publication of more and more articles disparaging the use of chopped data, the occurrence of chopped data should diminish over the years. To illustrate, two of the most cited articles arguing against using chopped data are Maxwell and Delaney (1993) and MacCallum et al. (2002). According to the Social Sciences Citation Index, each article has been cited over 120 times. ἀ is fact substantiates the urban legend that since the publication of these two seminal articles, understanding of the problems associated with employing chopped data has increased. ἀ erefore, it follows that there is a widely held supposition that the practice of using chopped data occurs much less often now than it did before the existence of such widely circulated publications. Many authors have pointed to researchers’ preference for using ANOVA as one of the origins of the chopped data problem (Humphreys, 1978; MacCallum et al., 2002). ANOVA is problematic when researchers manipulate data to fit such a framework that are in reality best handled as continuous variables using regression/correlation
364
Marcus M. Butts and Thomas W. H. Ng
methods. As suggested by others (e.g., Aiken, West, Sechrest, & Reno, 1990), an underlying cause of this problem is the somewhat limited statistical training in many graduate psychology programs that often focuses primarily on ANOVA. However, specifically within the field of psychology, it is often acknowledged that applied psychology has some of the most strenuous statistical training and, on average, has more statistical rigor than social psychology. Furthermore, the disciplines of applied psychology and management rely heavily on field studies rather than experimental manipulation to conduct scientific research. Because ANOVA is less applicable to management and applied psychology methodological designs, it is assumed that the problem of using chopped data is more widespread in social psychology than these other two disciplines within the organizational and social sciences. ἀ is reasoning leads to the urban legend that the practice of using chopped data is of little consequence in management and applied psychology journals relative to social psychology journals. Urban Legends Associated With Chopped Data Techniques Approaches to carving up continuous data have taken many forms such as the dichotomization of a variable (or variables) in the form of a median split, whereby a sample is split at the sample median of scale in order to define high and low groups on the variable(s) of interest. Other dichotomization strategies include mean splits and the use of cutoἀ points. Another type of data chopping is the polytomization of variables to form three or more groupings of data by splitting the sample into three chunks of data (low, middle, and high) using sample distributions, a tertiary split, or by collapsing scale values into more coarse classifications. In a special case of polytomization, called the extreme groups approach, the sample is split into three groups (using one of the aforementioned polytomization methods), and then the middle group is discarded. Although these aforementioned approaches may differ slightly, the end result is generally the same—continuous data are reduced to a coarser measurement scale (i.e., 1 versus 2) or coarser classification categories (i.e., a 5-point scale forced into 3 categories). Based on our review of the literature on the problems arising from carving up continuous data (e.g., Bissonnette et al., 1990; Cohen, 1983; Vargha, Rudas, Delaney, & Maxwell, 1996), the most
Chopped Liver? OK. Chopped Data? Not OK.
365
frequently discussed of the aforementioned chopped data approaches is performing a median split on independent variables. Although recently there has been a push to condemn using an extreme groups approach on continuous data (c.f. Preacher et al., 2005), artificial dichotomization of independent variables via median split is still the most criticized approach (Maxwell & Delaney, 1993). Because it is the most ridiculed approach, an urban legend has been perpetuated that median split of continuous independent variables occurs most often in practice compared to other approaches. Illustrating the existence of this urban legend, Owen and Froman (2005) said, “Perhaps the most common form of data carving is the median split” (p. 497). Urban Legends Associated With Chopped Data Justifications Sometimes when researchers chop data, they attempt to justify their actions by explaining why using chopped data was an appropriate strategy. As an example, Sprott, Spangenberg, and Fisher (2003) defended their use of chopped data by saying, “ἀ e summated scale exhibited high internal reliability (α = .90) and statistical characteristics (M = 47.1, SD = 13.44) similar to those reported by Cialdini, Trost, and Newsom (1995). ἀ erefore, a median split on this scale was used to assign research participants to groups” (p. 428). However, as discussed by MacCallum et al. (2002), the majority of justifications given for using chopped data are obviously faulty, or they are presumably good explanations that in reality are based on myths and flawed reasoning. In the aforementioned quote, the sound psychometric properties of the scale are used to justify adopting a median split approach. However, the authors failed to mention that scale reliability typically decreases when continuous variables are dichotomized (Cohen, 1983). ἀ us, their use of a median split approach does not appear justified based solely on the high reliability of the scale. Other justifications for using chopped data have included (a) citing previous studies adopting the same chopped data approach (e.g., Langens & Stucke, 2005), (b) explaining that chopped data are necessary for the chosen statistical analyses (e.g., Von Hippel, Von Hippel, Conway, Preacher, Schooler, & Radvansky, 2005), (c) violation of normality in the data (e.g., Tazelaar, Van Lange, & Ouwerkerk, 2004), (d) ease of visual interpretation (e.g., Stanton, Kirk, Cameron, & Danoff-Burg, 2000), and (e) conceptual appropriateness for using
366
Marcus M. Butts and Thomas W. H. Ng
chopped data (e.g., Eisenhardt & Schoonhoven, 1990). Any of these justifications for using chopped data could be viewed as urban legends, and it remains to be seen which justifications are poor excuses based on myth and which justifications are legitimately warranted. To summarize, in this chapter we evaluate the following urban legends about the actual practice of using chopped data: (a) Chopping data is a low-base-rate research practice in high-quality publications, (b) the frequency of published chopped data has decreased over time, (c) chopped data is less of a problem in disciplines such as management and applied psychology than in social psychology, and (d) median split of independent variables is the most prevalent means of chopping data. Furthermore, we make an attempt to extricate myth from truth associated with a variety of common rationales (i.e., urban legends) given for using chopped data by exposing faulty justifications and praising legitimate defenses found in the literature. To evaluate the frequency of these urban legends, we performed a literature review of targeted high-quality journals over a 15-year time span. We describe details of our literature review in the following section. Literature Review We selected six journals representing the fields of management, applied psychology, and social psychology by choosing two journals per field that primarily publish empirical work and were ranked in the top four according to their Journal Citation Reports (2004) impact ratings. ἀ e selected journals by field included Administrative Science Quarterly and Academy of Management Journal (management); Personnel Psychology and Journal of Applied Psychology (applied psychology); and Journal of Personality and Journal of Personality and Social Psychology (social psychology). For each journal, we performed a Boolean search on the full text of articles published in 1990, 1995, 2000, and 2005 using PsycARTICLES and Business Source Premier databases. ἀ e time span included for the literature review was chosen to assess publication trends over time (i.e., 15 years) and also evaluate activity since the appearance of Maxwell and Delaney (1993) and MacCallum et al.’s (2002) seminal articles
Chopped Liver? OK. Chopped Data? Not OK.
367
on dichotomization of variables. ἀ e Boolean search included a list of common terms used to describe the practice of chopping up continuous data for analytical purposes (e.g., median split, dichotomiz*, tertiary split). Although we admit a Boolean search may be viewed as an incomplete literature search methodology, our purpose was to provide an illustrative, rather than exhaustive, review of empirical articles within the organizational and social sciences. ἀ at is, our aim was to examine the occurrence of data chopping across time and disciplines while collecting exemplary illustrations of good/bad justifications for data chopping. We coded the following aspects of each article that employed chopped data: type of variable(s) chopped, method of chopping, and justification provided. Variable(s) chopped indicated whether the continuous variable that was “chopped” was a control variable, an independent variable, a dependent variable, or both a control variable or independent variable and a dependent variable. ἀ is information was coded at the article level; thus multiple variable splits were not accounted for in our tabulation totals. Method of chopping was recorded as median split, mean split, cutoff point, tertiary split, or extreme group approach. Justification provided for chopping was simply coded as “yes” or “no” based on whether the authors gave any rationale for chopping their data. For those articles that provided a justification, each justification was qualitatively evaluated by agreement of both authors to determine its merit as a legitimate or faulty rationale for chopping data. Chopped Data Through the Years ἀ e results of our literature search are provided in Table 15.1 and Table 15.2. ἀ ese results, along with examples given to illustrate In addition to the years 1990, 1995, 2000, and 2005, the authors also performed a separate literature review (using the same journals listed) for the years 2003– 2005. Because the publication trends were very similar from 2003 to 2005, we felt it would be more informative to just include results in 5-year time frames. However, we sometimes use examples from our literature search results from 2003 and 2004 for illustrative purposes. ἀ e complete list of Boolean search terms is available from the first author upon request. A complete list of journal articles that were coded as chopped data articles is available from the first author upon request.
38
AMJ
26
PPSYCH
19
364
JP
Total
27 (7%)
0 (0%)
19 (10%)
1 (4%)
3 (4%)
1 (3%)
3 (23%)
Chopped Data Articles
340
23
162
21
50
69
15
No. of Articles
25 (7%)
0 (0%)
17 (10%)
1 (5%)
3 (6%)
3 (4%)
1 (7%)
Chopped Data Articles
1995
346
30
139
19
71
68
19
No. of Articles
26 (8%)
0 (0%)
16 (12%)
1 (5%)
3 (4%)
6 (9%)
0 (0%)
Chopped Data Articles
2000
313
45
115
16
71
55
11
No. of Articles
22 (7%)
6 (13%)
13 (11%)
0 (0%)
2 (3%)
1 (2%)
0 (0%)
Chopped Data Articles
2005
Note. ASQ = Administrative Science Quarterly; AMJ = Academy of Management Journal; JAP = Journal of Applied Psychology; PPSYCH = Personnel Psychology; JPSP = Journal of Personality and Social Psychology; JP = Journal of Personality. Numbers in parentheses refer to percentage of articles that used chopped data.
190
JPSP
Social Psychology
78
JAP
Applied Psychology
13
ASQ
Management
No. of Articles
1990
Table 15.1 Journal Results From Literature Search for Each Discipline by Year
368 Marcus M. Butts and Thomas W. H. Ng
0 (0%) 0 (0%)
Control/IV and DV
1 (4%) 2 (7%) 1 (4%) 1 (4%)
Mean split
Cutoff point
Tertiary split
Extreme groups approach
24 (89%) 3 (11%)
No
Yes
Split justification
22 (81%)
Median split
Type of split
27 (100%)
Dependent variable (DV)
0 (0%)
27
Independent variable (IV)
Control variable
Variable split
No. of chopped data articles
1990
7 (28%)
18 (72%)
4 (16%)
4 (16%)
4 (16%)
0 (0%)
13 (52%)
0 (0%)
2 (8%)
22 (88%)
1 (4%)
25
1995
13 (50%)
13 (50%)
2 (8%)
3 (11%)
2 (8%)
0 (0%)
19 (73%)
0 (0%)
2 (8%)
24 (92%)
0 (0%)
26
2000
11 (50%)
11 (50%)
0 (0%)
1 (5%)
3 (13%)
0 (0%)
18 (82%)
1 (5%)
1 (5%)
19 (85%)
1 (5%)
22
2005
34 (34%)
66 (66%)
7 (7%)
9 (9%)
11 (11%)
1 (1%)
72 (72%)
1 (1%)
5 (5%)
92 (92%)
2 (2%)
100
Total
Table 15.2 Summary Statistics for Literature Review of Articles Using Chopped Data
Chopped Liver? OK. Chopped Data? Not OK. 369
370
Marcus M. Butts and Thomas W. H. Ng
our findings, are discussed in the context of the urban legends highlighted in the previous section. Prevalence of Chopped Data ἀ e first assumption regarding chopped data we examined was the extent to which the phenomenon is of little concern in high-quality publications. As shown in Table 15.1, the use of chopped data occurred in approximately 7% of all journals articles examined in 1990, 1995, 2000, and 2005. Because this is a relatively low percentage, there seems to be some truth to the urban legend that chopped data is of little concern in the organizational and social sciences. However, such an inference has at least one caveat. Namely, although we only included quantitative articles in our total number of articles, we did not exclude quantitative articles that employed naturally dichotomized/polytomized data from the outset or were completely experimental in nature (i.e., the researcher had no opportunity to dichotomize continuous variables). ἀ erefore, if only quantitative articles that collected continuous data were considered, the percentage of chopped data articles would likely increase. Even with these qualifications, it may be unlikely that our results would go up more than a few percentage points. Supporting this notion, MacCallum et al. (2002) found that the practice of using chopped data occurred only 11.5% of the time in their review of social and clinical psychology journals, and that was after removing journals such as Journal of Applied Psychology (which we included) because they contained relatively few uses of chopped data. Although the percentage of chopped data articles found in our literature review may seem relatively low, in order to fully understand the gravity of our results one must consider the problematic nature of chopped data. Many researchers have contended that carving up continuous variables is never justified (i.e., Cohen, 1983), and at the very least, they are firmly opposed to the practice (i.e., McClelland, ἀ e total number of articles per journal represents articles with at least one quantitative data set or study. ἀ us, we excluded qualitative studies, meta-analyses, methodological studies, and literature reviews from our total number of articles per journal. However, we did not distinguish between studies adopting experimental manipulation or utilizing natural dichotomies and those studies that originally employed continuous data.
Chopped Liver? OK. Chopped Data? Not OK.
371
2003). Although it is certainly less deplorable, some may view using chopped data as egregious an act as committing academic plagiarism. As such, if plagiarism was found in 7% of top-tier journals, it would likely be considered a problem needing immediate attention. Supporting this idea, Brief (2004) was dismayed at the occurrence of just six cases of potential plagiarism during his 3-year term as editor of Academy of Management Review. Although admittedly it does not have the same degree of negative stigma as plagiarism, the point is that even a small occurrence of chopped data (i.e., 7%) is a concern when one considers the extensive literature devoted to severely criticizing this practice. To summarize, we found some empirical evidence supporting the urban legend that chopped data is a low-base-rate occurrence in high-quality journals. However, we made the argument that even a small percentage of articles that employ chopped data is a problem because the phenomenon is viewed so negatively. ἀ erefore, our results could be viewed as evidence suggesting the urban legend that chopped data is of little concern in the organizational and social sciences is based on myth. The Occurrence of Chopped Data Over Time ἀ e next urban legend we evaluated was the extent to which the practice of using chopped data has declined over the years. Our results (see Table 15.1) show that across the journals examined, chopped data occurred in approximately 7–8% of the articles for each time period. Furthermore, with a few exceptions in different years, the rate of published articles using chopped data stayed relatively consistent over time for each journal. ἀ ese results firmly suggest there is little truth to the urban legend that use of chopped data has become less of a problem over time. On the surface, it seems that frequently cited articles denouncing the practice (MacCallum et al., 2002; Maxwell & Delaney, 1993) have helped little to discourage researchers from chopping up continuous data. However, as we discuss in more detail with the justifications for using chopped data, these seminal articles may have helped to make researchers more aware of the need to supplement ANOVA results with multiple regression results. Illustrating this point, in 2000 and 2005 (versus 1990 and 1995), authors were twice as likely (8 articles vs. 4 articles,
372
Marcus M. Butts and Thomas W. H. Ng
respectively) to supplement their ANOVA results using dichotomization of originally continuous variables with identical analyses applying multiple regression using continuous variables. ἀ us, although it is a myth that the practice of using chopped data has declined over time, the truth is that the practice of relying on chopped data alone has decreased. One possible explanation for this finding is that reviewers are becoming more critical of studies using chopped data and are requiring authors to supplement their findings with analyses using continuous data. Chopped Data Across Disciplines As shown in Table 15.1, we also tabulated our results by discipline (i.e., management, applied psychology, and social psychology) in order to ascertain the degree to which there are differences across the three areas in publication rate of articles that use chopped data. Our results provide support for the belief that chopping up continuous data is less of a problem in the fields of management and applied psychology relative to social psychology. Looking at the year 2005 (see Table 15.1), chopped data articles appeared in only 0–3% of management and applied psychology publications, whereas in social psychology 11–13% of publications used a chopped data approach. Data in Table 15.1 show that top-tier journals in applied psychology are least likely to publish chopped data articles, followed closely by management; social psychology published chopped data far more frequently. In general, carving up continuous variables occurred most often using an ANOVA (or similar) factorial design, no matter the journal outlet for publication. ἀ erefore, it is likely that the frequent use of ANOVA (or the like) for statistical analyses in social psychology is the impetus of the urban legend that chopped data is less of a problem in the disciplines of management and applied psychology, where an ANOVA framework is seldom utilized. Types of Chopped Data Approaches Owen and Froman (2005) speculated that a median split of continuous independent variables is the most ubiquitous form of data chopping. As shown in Table 15.2, our results provide strong empiri-
Chopped Liver? OK. Chopped Data? Not OK.
373
cal support for this belief. One or more independent variables were split 92% of the time a data-chopping approach was adopted, and a median split was used in 72% of total cases. ἀ ere are a couple of plausible reasons why our results showed that independent variables were split most often. First, two of the most adopted statistical procedures—ANOVA and multiple regression—more easily lend themselves to dichotomization of independent variables (including moderator variables) rather than dependent variables. Second, in-depth information on control variables is seldom provided in journal publications. Typically, authors just describe what variables were used as controls but do not state if the variables were artificially dichotomized. Although median splits were most common, the use of cutoff points (11%) was in fact higher than the extreme groups approach (7%). ἀ is finding is important because there is actually a more extensive body of work examining the problems of dichotomizing using extreme groups (c.f. Preacher et al., 2005) than there is investigating troubles caused by dichotomizing using cutoff points (for an exception see Royston, Altman, & Sauerbrei, 2006). ἀ is perhaps occurs because, as we found in our results, one of the limiting characteristics of cutoff points is that they are often arbitrarily determined by the author(s) rather than in accord with validation studies. As an example, Aulakh and Kotabe (2000) described their dichotomization cutoff procedure by saying, “For a firm to be categorized as having either a developed country or a developing country focus, 75 percent or more of its sales had to be in one of the groups” (p. 352). It may be difficult to make broad generalizations about the calamities of dichotomization via the cutoff point approach because they are often inconsistently determined depending on the variable of interest as well as the researchers’ preference (Royston et al., 2006). To summarize, we found substantial support for the urban legend that a median split of independent variables is the most popular form of using chopped data. However, our results also revealed that the literary attention devoted to the extreme groups approach may be relatively overstated given its low occurrence compared to less frequently discussed types of data-chopping approaches (i.e., cutoff points and tertiary splits). ἀ us, both some truth and some myth are being perpetuated about types of chopped data approaches.
374
Marcus M. Butts and Thomas W. H. Ng
Evaluating Justifications for Using Chopped Data One of our intentions in this chapter was to survey the organizational and social sciences literature to examine the legitimacy of justifications given when adopting a chopped data approach. To achieve this goal, we inspected all of the justifications given for using chopped data provided by our literature search. Drawing on those justifications, we highlighted often cited justifications that were either overwhelming faulty (myth) or legitimate (truth) rationales for using such an approach. We also made an effort to focus on justifications that are primarily of interest to researchers in the organizational and social sciences and have not been extensively discussed elsewhere (for an excellent discussion of other justifications for dichotomization, see MacCallum et al., 2002). Before addressing the specific justifications researchers gave for using chopped data, we wanted to provide some general empirical findings from our literature search. As shown in Table 15.2, only 34% of articles that used chopped data provided any type of justification for their approach. ἀ erefore, similar to MacCallum et al.’s (2002) results that found justification in just 20% of cases, chopped data appears to most often occur without any type of explicit justification. What is perhaps more interesting are findings from our literature review suggesting that justifications for using chopped data have increased over the years. As shown in Table 15.2, only 11% of articles provided justification in 1990, but 50% of articles gave some type of justification in both 2000 and 2005. ἀ is finding leads us to believe that although it is primarily a myth that researchers no longer chop up continuous data, perhaps the truth is that they are getting better at providing a defense for their actions. We now look at the specific justifications given for using chopped data to elaborate on the myths and/or truths associated with each one. Insufficient or Faulty Justifications (Myths) Precedence in the Literature We found that one of the most often cited reasons for adopting a chopped data approach was that there was precedence for doing so. For example, Langens and Stucke (2005) justified their use of a median split procedure on scores for activity
Chopped Liver? OK. Chopped Data? Not OK.
375
inhibition by stating, “ἀ is procedure followed the general approach of dealing with scores of activity inhibition, which have frequently been dichotomized (e.g., McClelland 1979, 1985; McClelland, Floor, Davidson, & Saron, 1980)” (p. 55). We consider this defense not much better than none at all. ἀ e literature on the problems of using chopped data has grown sizably over the past couple of decades. It is understandable that before this literature appeared, many researchers used chopped data because they were unaware of the inherent statistical problems associated with data chopping. However, this is no longer a legitimate excuse. Furthermore, as has been pointed out with regard to other statistical urban legends (i.e., Lance, Butts, & Michels, 2006), citing previous research as sole justification often has a dangerous snowball effect that causes myths to become widely held beliefs. Most Appropriate Form of Statistical Analysis At times, authors would defend their use of chopped data by arguing it was the most appropriate way to accommodate the desired statistical analyses. For example, in order to conduct a multigroup structural equation analysis, Eisenberg, Fabes, Guthrie, and Reiser (2000) explained, “High and low negative emotionality groups were constructed on the basis of a median split on that variable at Tl” (p. 150). Von Hippel et al. (2005) also exemplified this type of justification with regard to their use of a median split approach and subsequent correlation analysis to test for moderation by saying, “Despite the fact that moderated regression is commonly used to test for such differences between relationships, this is a misuse of the technique, as moderated regression is properly suited for assessment of interactions of form rather than interactions of degree (Arnold, 1982)” (p. 28). What is especially interesting about this illustration is its use of Arnold (1982) to defend against applying moderated regression while overlooking subsequent articles by Stone and Hollenbeck (1984, 1989) that refuted Arnold (1982) and provided mathematical and empirical evidence that only moderated regression is needed to test moderating effects—no matter if they are of the degree or form variety. Violation of Normality ἀ e last major group of faulty reasons for chopping up data that we came across centered on the necessity caused by some type of nonnormality in the data such as a bimodal distribution or extreme skewness. In these instances, the researcher(s) would typically perform a median split on the variable in question in
376
Marcus M. Butts and Thomas W. H. Ng
an attempt to allocate an equal number of cases to each category. As an example, Tazelaar et al. (2004) defended dichotomizing by saying, “ἀ e primary reason was that the data for trust revealed a rather substantial violation of normality in the distribution, with the scores for trust exhibiting a bimodal distribution in which two clusters of peak values were observed” (p. 851). Although on the surface this reasoning may seem sound, and it was even cited by MacCallum et al. (2002) as a rare occurrence when dichotomization is justified, recent work by Irwin and McClelland (2003) provided mathematical evidence to dispel that false belief. Specifically, the authors showed that dichotomization of variables using a median split never improved the expected squared correlation between variables, no matter the shape or skewness of the data. Legitimate Justifications (Truths) Solely Illustrative Purposes One of the most common defenses for using chopped data we found was that it provides for ease of visual or conceptual understanding. In these cases, regression was usually performed in addition to the analyses using chopped data, and those results were typically provided in a footnote with an explanation such as the one given by Leippe, Eisenstadt, Rauch, and Seib (2004): “An alternative approach to analyzing the effects of…would be to treat NC as a continuous variable in a multiple regression analysis. Doing so with the present data does not change the results and their interpretation” (p. 534). We wish to point out that this justification for using chopped data was not given in lieu of using continuous variable but rather chopped data results were supplemented with results utilizing continuous variables (with similar/same effects). Although we believe this justification is an acceptable rationale accompanying chopped data, MacCallum et al.’s (2002) comments on the practice were not as favorable: “No real interests are served if researchers use methods known to be inappropriate and problematic in the belief that the target audience will better understand analyses and results” (p. 33). ἀ erefore, the degree of truth or myth underlying this justification may differ depending on whose opinion is requested.
Chopped Liver? OK. Chopped Data? Not OK.
377
Conceptual Appropriateness A rationale sometimes given by researchers to justify using chopped data was that the variable of interest is most appropriately viewed as a dichotomy. MacCallum et al. (2002) found the same rationale in their literature search; however, they condemned the justification except in the rare case that dichotomies are determined using taxometric methods (e.g., cluster analysis, latent class analysis) based on the specific study’s data. Although we fundamentally agree with the MacCallum and colleagues’ argument that conceptual dichotomization must be empirically substantiated, we also support the notion that chopped data is occasionally justifiably acceptable by other empirical means (provided there is sound theory to support the split). Providing a superlative example from the literature, Eisenhardt and Schoonhoven (1990) gave the following explanation for dichotomizing: Maximum variance was not as important as was conceptual consistency with the hypotheses. Categorical variables were appropriate, for several theoretical reasons…the stages were conceptualized in terms of thresholds, not continuous variables...We also verified the appropriateness of the market-stage operationalization empirically…ἀ e categorical operationalization of market stage provides a better empirical fit with the data than these continuous variables, as shown below. (pp. 513–514)
In summary, authors today are more likely to provide justification for using chopped than they did in the past. However, the majority of justifications found in our literature review were insufficient or inappropriate arguments for using chopped data. One explanation for this finding is that there are very few legitimate reasons for using chopped data. Although we viewed it as a legitimate reason, even using chopped data for illustrative purposes is questionable unless continuous data are also analyzed to confirm results. ἀ us, except for the rare case where dichotomization is conceptually appropriate, using chopped data almost always occurs without legitimate justification. Advantages of, Disadvantages of, and Recommendations for Using Chopped Data Up to this point, we have examined urban legends about chopped data and justifications for its use. In doing so, we have attempted to extricate myth from truth regarding occurrence of the phenomenon. However, we have done little to address why using chopped
378
Marcus M. Butts and Thomas W. H. Ng
data is such a bad strategy to employ. ἀ ere is a well-established literature devoted to demonstrating disadvantages and debunking perceived advantages of chopping up continuous data (e.g., Bissonnette et al., 1990; Cohen, 1983; Maxwell & Delaney, 1993). Although our purpose is not to discuss the various intricacies of chopped data approaches, in the following section we summarize some of the purported advantages (myths) and actual disadvantages (truths) of using chopped data in order to stymie the continued proliferation of urban legends regarding the practice. We then conclude with some practical recommendations for researchers faced with the decision to chop up continuous data. (Perceived) Advantages of Chopping Data ἀ e main, and perhaps only, advantage of using chopped data is that it simplifies presentation of findings and produces meaningful results that are easily comprehended by a wide audience (Farrington & Loeber, 2000). As an example, the reported finding from Vecchio (1990) that leader IQ has a stronger relationship with group performance for directive leaders (M + 1 SD) than for nondirective leaders (M – 1 SD) is easier to comprehend than trying to understand that level of “leader directiveness” moderated the relationship between leader IQ and group performance. Fortunately, the majority of academic audiences targeted by journals in the organizational and social sciences are well trained enough in statistics to understand either presentation of findings (i.e., chopped vs. continuous data). Furthermore, this advantage alone does not outweigh the methodological disadvantages caused by using chopped data. Disadvantages of Chopping Data ἀ e primary problem with using chopped data is the loss of information resulting from classifying subjects/cases that fall near the categorization point (e.g., median, mean, cutoff point). Borrowing from an illustration provided by MacCallum et al. (2002), Figure 15.1 shows how subjects might be classified using a median split of a single continuous variable with four subjects (A, B, C, and D) represented along the x-axis. As shown in Figure 15.1, although sub-
Chopped Liver? OK. Chopped Data? Not OK.
379
(a)
Low
High
B C
A
D
Continuous X
(b)
Low
A
High
B
C
D
Dichotomized X
Figure 15.1 Dichotomization of a continuous variable using a median split approach.
jects B and C may have only a 1- or 2-point difference in their scale scores, after dichotomization they are considered “different” from one another (i.e., “high” versus “low”). Conversely, although there are larger differences between A and B than there are between B and C, these two former subjects are now considered identical for measurement purposes. In essence, dichotomization has resulted in loss of information between subjects within each group while artificially exaggerating differences between some subjects across groups, therefore mitigating possible distinctions that were previously discernible. Furthermore, due to measurement error of observed scores, subjects who scored close to the median (i.e., B and C) may have been
380
Marcus M. Butts and Thomas W. H. Ng
misclassified in the low or high group. Analogous problems can also be illustrated with the extreme groups approach by using six subjects along the x-axis, where the “middle” group is discarded (see Figure 15.2). However, in this approach information is literally lost for subjects who scored close to the cutoff point and are designated as the middle group. ἀ is discarded information also makes the extreme groups approach susceptible to reduced ecological validity because individuals representing the middle group, which is often the majority, are omitted. As a result, observed effects may be above or below average effects found in the general population. (a)
Low A
High
B C
D E
F
Continuous X
(b) Low
A
High
E
B
F
Dichotomized X
Figure 15.2 Dichotomization of a continuous variable using an extreme groups approach.
Chopped Liver? OK. Chopped Data? Not OK.
381
Another problem that arises from chopping up continuous data is the lack of methodological consistency caused by the approach. Often the exact point used to dichotomize subjects is sample-specific based on the distribution of scores. Because the distribution of scores differs across samples, the exact point used to split continuous variable(s) likely differs across samples (Kowalski, 1995; Sedney, 1981). ἀ is sample-specific dichotomization point thereby lessens the ability to replicate similar results in other samples. Similarly, the idiosyncratic nature by which researchers choose where to draw lines between groups also contributes to methodological inconsistency across samples. Another argument against chopping up continuous variables is that it alters the power of statistical tests (Aiken & West, 1991; Cohen, 1983). Previous studies have found that performing a median split on one variable resulted in a point biserial correlation (i.e., effect size) between two variables that was on average .63–.64 of what it would have been if both variables were kept continuous (Bissonnette et al., 1990; Cohen, 1983; Cohen & Cohen, 1983; Humphreys & Fleishman, 1974). Furthermore, the point biserial correlation continued to decrease if dichotomization of the independent variable occurred at points other than the median (e.g., 60:30 split). Similar problems have also been empirically demonstrated using a cutoff point dichotomization approach (i.e., Faraggi & Simon, 1996). Adding to the problematic nature of chopped data, artificial dichotomization of multiple independent variables may also lead to cases of spurious statistical significance and overestimation/underestimation of effect sizes (Maxwell & Delaney, 1993; Vargha et al., 1996). Recently, these negative consequences have also been empirically supported for the extreme groups approach (Preacher et al., 2005). ἀ us, research findings have shown that chopped data approaches run the full gamut of statistical problems, including not only underestimation of effects sizes but also overestimation of effect sizes. ἀ e use of chopped data is also undesirable from a psychometric standpoint, primarily because it lowers measurement reliability. Cohen (1983) mentioned this drawback of using chopped data, and MacCallum et al. (2002) provided empirical evidence using simulated data with a median split approach. Preacher et al. (2005) came to the same conclusion regarding scale reliability under an extreme groups approach. Interestingly, although this disadvantage has been clearly articulated in the methodological literature (and to our knowledge
382
Marcus M. Butts and Thomas W. H. Ng
uncontested), there is evidence to suggest that researchers have overlooked or misinterpreted these findings. As an example, Langens and Stucke (2005) justified adopting a median split approach by saying, “To account for low internal consistency, we employed dichotomous activity inhibition as a predictor of thought and emotion in the present studies” (pp. 72–73). However, as we have just explained, dichotomizing actually decreases rather than increases reliability. ἀ erefore, it is evident that this disadvantage itself has sometimes been ignored by researchers in favor of a more pleasing, but mythical, explanation to support the act of chopping up continuous data. Recommendations When Faced With Chopping Data ἀ e use of chopped data causes many methodological problems and has few legitimate justifications. ἀ us, when tempted to use chopped data, what should researchers do? We offer our recommendations below. One of the most beneficial ways researchers can shield against the tendency to use chopped data is to devote a significant amount of time to planning their research design and accompanying statistical methods. For example, instead of artificially carving up a continuous dependent variable to parallel an experimentally manipulated (i.e., dichotomous) independent variable in order to use ANOVA, a researcher would be better served if he or she first considers the advantages and disadvantages of a research design that uses continuous data for all variables of interest (i.e., no experimental manipulation). Just as others have suggested that researchers should consider the limitations of their statistical techniques and computer programs (i.e., Milligan & McFillen, 1984), we recommend a thoroughly prepared research plan that maximizes the use of continuous data for statistical purposes whenever possible. It is also our recommendation that if researchers must dichotomize, it is imperative that they provide a supported rationale. As our literature review results have shown, few studies provide justification for using chopped data, and all too often the approach is adopted just because it was used in the past. Part of the onus for seeing that this recommendation comes to fruition lies upon journal editors and reviewers as gatekeepers of the publication process. However, authors can also do their part by making it a necessity to include legitimate justification when continuous data are artificially dichotomized for
Chopped Liver? OK. Chopped Data? Not OK.
383
statistical purposes. Taking this recommendation one step further, the most commendable of articles would not only provide rationale when using chopped data but also substantiate results using their previously carved-up continuous variables provided as a supplement in the text of the article. It is only by providing readers with such thorough explanation and alternative results that the detrimental inclination to arbitrarily use chopped data will be thwarted in future research endeavors and publications. Conclusion As we have demonstrated through our literature review of urban legends on the practice, the use of chopped data still occurs in the organizational and social sciences. Furthermore, in most cases the practice is completely unjustified. ἀ us, the topic warrants continued attention. Researchers must be vigilant to ward against the temptation to chop up continuous data. To this end, we should follow the sage advice of McClelland (2003): “So, resist the temptation to split. Leave your continuous variables continuous” (p. 2). It is our hope that by demystifying the urban legends and justifications associated with chopped data, this chapter will help researchers in the organizational and social sciences to appropriately deal with the instances where the decision to chop data arises. References Aiken, L. S., & West, S. G. (1991). Multiple regression: Testing and interpreting interactions. Newbury Park, CA: Sage. Aiken, L. S., West, S. G., Sechrest, L., & Reno, R. R. (1990). Graduate training in statistics, methodology, and measurement in psychology. American Psychologist, 45, 721–734. Arnold, H. J. (1982). Moderator variables: A clarification of conceptual, analytic, and psychometric issues. Organizational Behavior and Human Performance, 29, 143–174. Aulakh, P. S., & Kotabe, M. (2000). Export strategies and performance of firms from emerging economies: Evidence from Brazil, Chile, and Mexico. Academy of Management Journal, 43, 342–361. Baumesiter, R. F. (1990). Item variances and median splits: Some encouraging and reassuring findings. Journal of Personality, 58, 589–594.
384
Marcus M. Butts and Thomas W. H. Ng
Bissonnette, V., Ickes, W., Bernstein, I., & Knowles, E. (1990). Personality moderating variables: A warning about statistical artifact and a comparison of analytical techniques. Journal of Personality, 58, 567–587. Brief, A. P. (2004). Editor’s comments: What I don’t like about my job. Academy of Management Review, 29, 339–340. Cialdini, R. B., Trost, M. R., & Newsom, J. T. (1995). Preference for consistency: ἀe development of a valid measure and the discovery of surprising behavioral implications. Journal of Personality and Social Psychology, 69, 328–338. Cohen, J. (1983). ἀe cost of dichotomization. Applied Psychological Measurement, 7, 249–253. Cohen, J., & Cohen, P. (1983). Applied multiple regression correlation analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum Associates. Eisenberg, N., Fabes, R. A., Guthrie, I. K., & Reiser, M. (2000). Dispositional emotionality and regulation: ἀe ir role in predicting quality of social functioning. Journal of Personality and Social Psychology, 78, 136–157. Eisenhardt, K. M., & Schoonhoven, C. B. (1990). Organizational growth: Linking founding team, strategy, environment, and growth among U.S. semiconductor ventures, 1978–1988. Administrative Science Quarterly, 35, 504–529. Faraggi, D., & Simon, R. (1996). A simulation study of cross-validation for selecting an optimal cutpoint in univariable survival analysis. Statistics in Medicine, 15, 2203–2213. Farrington, D. P., & Loeber, R. (2000). Some benefits of dichotomization in psychiatric and criminological research. Criminal Behaviour and Mental Health, 10, 100–122. Humphreys, L. G. (1978). Research on individual differences requires correlational analysis, not ANOVA. Intelligence, 2, 1–5. Humphreys, L. G., & Fleishman, A. (1974). Pseudo-orthogonal and other analyses of variance designs involving individual-difference variables. Journal of Educational Psychology, 66, 464–472. Irwin, J. R., & McClelland, G. H. (2001). Misleading heuristics and moderated multiple regression models. Journal of Marketing Research, 38, 100–109. Irwin, J. R., & McClelland, G. H. (2003). Negative consequences of dichotomizing continuous predictor variables. Journal of Marketing Research, 40, 366–371. Kowalski, R. M. (1995). Teaching moderated multiple regression for the analysis of mixed factorial designs. Teaching of Psychology, 22, 197–198.
Chopped Liver? OK. Chopped Data? Not OK.
385
Lance, C. E., Butts, M. M., & Michels, L. (2006). ἀe sources of four commonly reported cutoff criteria: What did they really say? Organizational Research Methods, 9, 202–220. Langens, T. A., & Stucke, T. S. (2005). Stress and mood: ἀe moderating role of activity inhibition. Journal of Personality, 73, 47–78. Leippe, M. R., Eisenstadt, D., Rauch, S. M., & Seib, H. M. (2004). Timing of eyewitness expert testimony, jurors’ need for cognition, and case strength as determinants of trial verdicts. Journal of Applied Psychology, 89, 524–541. MacCallum, R. C., Zhang, S., Preacher, K. J., & Rucker, D. D. (2002). On the practice of dichotomization of quantitative variables. Psychological Methods, 7, 19–40. Maxwell, S. E., & Delaney, H. D. (1993). Bivariate median splits and spurious statistical significance. Psychological Bulletin, 113, 181–190. McClelland, D. C. (1979). Inhibited power motivation and high blood pressure in men. Journal of Abnormal Psychology, 88, 182–190. McClelland, D. C. (1985). Human motivation. Glenview, IL.: Scott, Foresman. McClelland, D. C., Floor, E., Davidson, R. J., & Saron, C. (1980). Stressed power motivation, sympathetic activation, immune function, and illness. Journal of Human Stress, 6, 11–19. McClelland, G. H. (2003). Dichotomizing continuous variables: A bad idea. Retrieved January 15, 2007, from http://core.ecu.edu/psyc/wuenschk/ StatHelp/Dichot-Not.doc McClelland, G. H., & Judd, C. M. (1993). Statistical difficulties of detecting interactions and moderator effects. Psychological Bulletin, 114, 376–390. Milligan, G. W., & McFillen, J. M. (1984). Statistical conclusion validity in experimental designs used in business research. Journal of Business Research, 12, 437–462. Owen, S. V., & Froman, R. D. (2005). Why carve up your continuous data? Research in Nursing and Health, 28, 496–503. Preacher, K. J., Rucker, D. D., MacCallum, R. C., & Nicewander, W. A. (2005). Use of the extreme groups approach: A critical reexamination and new recommendations. Psychological Methods, 10, 178–192. Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: A bad idea. Statistics in Medicine, 25, 127–141. Sedney, M. A. (1981). Comments on median split procedures for scoring androgyny measures. Sex Roles, 7, 217–222. Sprott, D. E., Spangenberg, E. R., & Fisher, R. (2003). ἀe importance of normative beliefs to the self-prophecy effect. Journal of Applied Psychology, 88, 423–431.
386
Marcus M. Butts and Thomas W. H. Ng
Stanton, A. L., Kirk, S. B., Cameron, C. L., & Danoff-Burg, S. (2000). Coping through emotional approach: Scale construction and validation. Journal of Personality and Social Psychology, 78, 1150–1169. Stone, E. F., & Hollenbeck, J. R. (1984). Some issues associated with the use of moderated regression. Organizational Behavior and Human Performance, 34, 195–213. Stone, E. F., & Hollenbeck, J. R. (1989). Clarifying some controversial issues surrounding statistical procedures for detecting moderator variables: Empirical evidence and related matters. Journal of Applied Psychology, 74, 3–10. Stone-Romero, E. F., & Anderson, L. E. (1994). Relative power of moderated multiple regression and the comparison of subgroup correlation coefficients for detecting moderating effects. Journal of Applied Psychology, 79, 354–359. Tazelaar, M. J. A., Van Lange, P. A. M., & Ouwerkerk, J. W. (2004). How to cope with “noise” in social dilemmas: ἀe benefits of communication. Journal of Personality and Social Psychology, 87, 845–859. Vargha, A., Rudas, T., Delaney, H. D., & Maxwell, S. E. (1996). Dichotomization, partial correlation, and conditional independence. Journal of Educational and Behavioral Statistics, 21, 264–282. Vecchio, R. P. (1990). ἀe oretical and empirical examination of cognitive resource theory. Journal of Applied Psychology, 75, 141–147. Von Hippel, W., Von Hippel, C., Conway, L., Preacher, K. J., Schooler, J. W., & Radvansky, G. A. (2005). Coping with stereotype threat: Denial as an impression management strategy. Journal of Personality and Social Psychology, 89, 22–35.
Subject Index
Numbers with an f indicate figures; those with a t indicate tables.
A Abductive inference, 68 Absenteeism, 17, 90, 325 Academy of Management Conference (New Orleans, 2004), 3–4 Academy of Management Journal, 114, 127, 152, 182, 260, 268, 272, 366 Academy of Management Review, 229 Ad hoc techniques, 12, 25–26 Administrative Science Quarterly, 230, 268, 366 Aggregation, 27 Agreeableness, 18, 18t Akaike Information Criterion (AIC), 178–179 Alternative model specification (AMS), 165–167 applied in research literature, 181–186, 183t, 184t CFA and, applied in, 184–185 equivalent models and, 170–174, 172f, 173f nested models and, 174–177 nonnested alternative models and, 177–179 summary of, 179–181 underlying issue in, core of, 167–170 understanding and engaging in, 165–167 American Psychological Association (APA), 220, 228–229, 287 AMS, see Alternative model specification (AMS) ANOVA chopped data problem and, 363–364, 382 Mantel-Haenszel procedure and, 55 moderation studies and, 158
multiple regression analyses and, 371, 372, 373 separate regression analyses and, 126 Anthropometry, 351 Appropriateness measurement, 55 Artifact distributions, 27 Artificial covariance, 319, 323 Assessment center (AC) construct validity, 347–349 Attitudes, 19, 202
B Basic Data Relation Matrix, 352 Bayesian Information Criterion (BIC), 178–179 Bayesian sampling distribution, 12 Behavioral intentions, 202, 204 Bias; see also Parameter estimation bias confirmation, 168–169 measurement method, 338, 346, 347, 348, 349 in missing data, results of research studies and, 21 monomethod, 311, 338 parameter, in missingness mechanisms, 11t, 13, 14 path coefficient, 100–102 path coefficient vs. significance testing, 100–102 predictor-related criterion, in non-selfreport, 328, 329 publication, 27, 28 response rate (see Response rate bias) Binomial error model, 47 Business Source Premier, 229, 366
C Causal inferences, 122, 123–124t Causal modeling, 92, 96, 124, 125–126 Causal system, basic, 107–108, 108f Causal variables, 93, 97, 98, 156–157
387
388
Subject Index
CFA, see Confirmatory factor analysis (CFA) Chi-square difference test, 127, 175–178, 186 goodness-of-fit test, 127, 135, 168, 175–177, 176t statistic, 54, 200, 201 Chopped data, 361–362, 377–378; see also Chopped data, urban legends regarding across disciplines, 372 advantages of using, 378 ANOVA and, 363–364, 382 approaches of, 372–374 conceptual appropriateness and, 377 cutoff points used in, 364, 367, 373, 380, 381 dichotomization of variables and, 364, 367, 376, 379t, 380t disadvantages of using, 377–382 extreme groups approach to, 364–365 mean splits used in, 364, 367 median split approach to (see Median split) occurrence of, over time, 371–372 prevalence of, 370–371 statistical analysis and, most appropriate form of, 375 techniques of, 364–365 tertiary splits used in, 364, 367, 369, 373 violation of normality and, 375–376 when faced with chopping data, 382–383 Chopped data, urban legends regarding, 362 justifications, 365–366 insufficient or faulty (myths), 374–375 legitimate (truths), 376 literature review and, 366–367 literature search and, results of, 367–370, 368–369t occurrence of chopped data, 363–364 Citizenship performance, 301–302 Classical test theory (CTT), 38–40 computer adaptive testing in, 45–46 criterion-referenced tests in, 46 factor analysis in, 52 vs. item response theory, 37–38, 53 multidimensionality in, 51–52 person and item parameters on different scales in, 45–46 population invariance in, lack of, 44–45 reliability concept and, 39–40, 47
sample size needed for, 50–51 SEM and, 52 true scores of, 39–40, 44, 51, 52 platonic notion of, 39 Cohen’s definitions of small, medium, and large effect size, 269–273, 282 Common factor analysis; see also Exploratory factor analysis (EFA) vs. component factor analysis (see Component vs. common factor analysis) equation, uniqueness term contained in, 64 parallel analysis procedure and, 81 Common method variance in correlated IRs, 204 exercise effect and, 348 as function of various questionnaire scale anchor formats, 350 in multitrait-multirater sense, 345 in non-self-report data, 328–330 in self-report data, urban legend of, 311–313, 316, 318–320, 323 basic idea of, 311 Communalities, 64, 67, 70, 77–79 Complete mediation, 108–109, 118 Complex conventional mediation models, 134 Complex speculative mediation models, 132–134 Component analysis; see also Exploratory factor analysis (EFA) vs. common factor analysis (see Component vs. common factor analysis) equation, 64 parallel analysis (PA) procedure and, 81 Component vs. common factor analysis; see also Exploratory factor analysis (EFA) distinction between, 62–66 methodological arguments in, 66–67 philosophical arguments in, 68–69 result differences in, 69–71 summary of, 70–71 Computer adaptive testing (CAT), 45–46, 55 Condition 9 tests, 174 Confirmation bias, 168–169 Confirmatory factor analysis (CFA), 52, 67, 165; see also Covariance structure modeling (CSM)
Subject Index
AMS applied in, 184–185 in MTMM matrix, 340, 346, 348–349 Conscientiousness, 18, 22–23, 299, 329 nonrespondents’ level of, 18, 18t Construct validity, 227 assessment center and, in MTMM matrix, 347–349 in qualitative research, lack of, 227 evaluation of, 237–238 indicators of, 232t summary of, 238 of self-report data, 313–316 Contaminating constructs, 313, 314 Contextualizations, 295–296 inauspicious designs and, 296–299 phenomena that challenge fundamental assumptions and, 300–302 phenomena with obscured consequences and, 299–300 Control ἀ eory, 134 Convenience model, 24–25 Convenience sampling defense of, 252–253, 252t in qualitative research, 239 Conventional mediation models complex, 134 simple, 131–132 Convergent validity, 227, 315, 325, 337, 338–339, 340n Correlated factors in IRs (see Indicator residuals (IRs)) in minimum average partial, 82 in non-self-reported data, 328–329 in orthogonal rotations, 70, 71, 73f, 82 in self-reported data, 316–318 Correlated trait-correlated method (CTCM) model, 340–341, 341f Correlated uniqueness model, 340 Counseling Psychologist, 229 Covariance, artificial, 319, 323 Covariance structure analyses, 277–280, 283 Covariance structure modeling (CSM), 165; see also Alternative model specification (AMS) applied in research literature, 181–186, 183t, 184t chi-square goodness of fit and, 175–176, 176t confirmation bias in, 168–169 disconfirmation in, 167, 169–170, 172, 175, 186 Covariation Chart, 352 Criterion-referenced tests, 46
389
Cross-loadings, 71, 73, 74 CTT, see Classical test theory (CTT) Curvilinearity, 151–156 Cutoff points, 364, 367, 373, 380, 381
D Darwinian model of research methods, 24 ἀ e Data Box, 352 Deletion techniques, 13–14 Dependent variables, 91, 143n, 144, 158, 367, 382 Desired Relative Seriousness (DRS), 276–277 Dichotomization of variables, 364, 367, 376, 379t, 380t Differential item functioning (DIF), 55 Disconfirmation, 167, 169–170, 172, 175, 186 Discriminant validity, 227, 240, 315, 337, 338–339, 340n Dispersion hypotheses, 27 Disturbance term (d), 92–93, 96, 110, 112, 181 Disturbance term regressions (DTRs), 133–134 dSNP 14–15, 17, 18t, 30–31 d value, 289–290, 291–292
E Educational and Psychological Measurement, 63, 182 EFA, see Exploratory factor analysis (EFA) Effect size, 287–288; see also Sample generalizability debate; Sample size; Sample size rules of thumb citizenship performance, 301–302 contextualizations, 295–296 inauspicious designs and, 296–299 phenomena that challenge fundamental assumptions and, 300–302 phenomena with obscured consequences and, 299–300 defined, 289–290 extreme groups design and, 303–305 gain/loss priming and, 297–298 indices, 289–292 kernel of truth in, 291–292 large, 302–305 ontological relativism and, 292–295 reflexive interpretation of, 298 urban legend of, 290–291
390
Subject Index
within-person variability and, 300–301 Eigenvalues greater than one rule, 79–83 EM algorithm, 13, 26 Endogenous variables correlated residuals and, 200 LOVE and, 89, 91–93, 93f, 95–98, 100–103 in testing moderation, 143n Epistemology, 222 Equivalent models, 170–174, 172f, 173f Exercise effect, 348 Exogenous variables correlated residuals and, 200 LOVE and, 89, 91–93, 93f, 95–98, 100–103 in testing moderation, 143n Exploratory factor analysis (EFA), 52, 61–63; see also Common factor analysis; Component factor analysis component vs. common factor analysis (see Component vs. common factor analysis) K1 rule vs. eigenvalues greater than one rule and, 79–83 orthogonal rotation vs. oblique rotations and, 71–74 reproducing correlations among variables in, 74 R-mode, 352 sample size guidelines for, 76–79 sample size needed for, 74–76 Exponential random graph modeling, 29 External validity, 228 indicators of, 232–233t, 234 low response rates and, 9, 15 missing data and, 9 qualitative research and, lacking in, 228 weak, qualitative research and, 238–239 Extreme groups approach, 303–305, 364–365
F , 15–17, 18t, 30–31 Factor loadings in factor analysis, 70, 73, 75, 76–77 in indicator residuals, 199 in ratio of sample size to number of free parameters, 279 in self-report data, 313, 316, 317 Factor retention criteria, 79–83 Factor score indeterminacy, 66–67 Fakeability vs. actual faking, 320–323
First-order coefficients, 146–148 Follow-up survey reminders, 20 Four-step test of mediation, 110; see also Four-step test of mediation, urban legend surrounding condition 1, 111 condition 2, 111 condition 3, 111–112 condition 4, 112–113 for drawing causal inferences, 122–124 limitations of, 116–120 popularity of, reasons for, 114–115 Four-step test of mediation, urban legend surrounding kernel of truth about, 113–115 legend 1: test of mediation hypothesis should consist of four-steps, 116–120 legend 2: four-step procedures optimal test of mediation hypotheses, 120–121 legend 3: drawing causal inferences, as sufficient for, 122–124 Full information maximum likelihood (FIML), 13, 14, 26, 130 Full mediation, 108–109, 116, 118, 119–120, 127
G Gain/loss priming, 297–298 Generalizability; see also Sample generalizability debate internal/external, 239 litmus test of, 238–239 theoretical, 257–259 Goal Setting ἀ eory, 132 Goodness-of-fit studies, 53–55 Goodness-of-fit test, 127, 135, 168, 175–177, 176t Group and Organization Management, 182
H Heterotrait-heteromethod (HTHM), 338, 339t Heterotrait-monomethod (HTMM), 338, 339t Heywood cases, 66, 67 Human Relations, 182 Hypothesis formulation, 231, 235 Hypothesis testing, 231, 235
Subject Index
I Impression management in non-self-report measures, 326, 328, 329 in social desirability responding, 319, 320, 322 Inauspicious designs, 296–299 Independent variables in chopped data, 362, 365–367, 372–373, 381, 382 in effect size, 289, 295, 303 in testing moderation, 143n, 144, 158, 159, 160 Indicator residuals (IRs), 193–195 allowing, origin of, 196–197 model fit and, 200–202 example of (theory of reasoned action), 202–204, 203f improved, 204–207, 206t observed variance in, causes of, 199–200 problems with, 195–196, 207–209 recommendations for, 209–211 SEM and, 197–199 Internal consistency reliability, 80 Internal validity, 226–227 IRs, see Indicator residuals (IRs) IRT, see Item response theory (IRT) Item difficulty appropriateness measurement and, 55 characterized, 40 item discrimination and, 46–47, 51 parameter estimate and, 44–45 Item discrimination, 40, 42, 46–47, 51 Item information, 42–43, 43f Item-level nonresponse, 9, 11–12 Item response function (IRF), 41, 42, 54 Item response theory (IRT), 40–44 vs. classical test theory (CTT), 37–38, 53 criticisms and limitations of assumptions behind, 49 in complicated IRT estimation programs, 50 multidimensionality and, 51–52 sample size needed for, 48 unidimensionality and, 48, 49 goodness-of-fit studies and, 53–55 information concept of, 42–43, 43f item discrimination and, 40, 42 item parameter estimates and, 44 item response function and, 41, 42, 54 models, 41 pseudo-guessing parameter and, 42
391
psychometric tools supported by, 55–56 range of construct in, focusing on, 53 test information function and, 43–44 Item-wording effects, 349–350
J Job satisfaction conscientiousness, effect on turnover intentions and, 22–23 IR terms and, correlations between, 205, 206t, 210 measure of, sample items from, 205 nonrespondents’ level of, 15, 18, 18t superior validity of non-self-report measures, myth of, 326 validity evidence and, convergent and discriminant, 315–316 Journal of Abnormal and Social Psychology, 270, 271, 272 Journal of Applied Psychology, 62, 63, 114, 120, 152, 182, 248, 249f, 268, 272, 312, 366 Journal of Educational Psychology, 63 Journal of Management, 182, 260 Journal of Occupational and Organizational Psychology, 182, 260 Journal of Occupational Psychology, 182 Journal of Organizational Behavior, 182, 260 Journal of Personality, 366 Journal of Personality and Social Psychology, 62, 63, 114, 120, 127, 366
K K1 rule, 79–83 Kernel of truth(s), defined, 1 Kuder-Richardson formula, 80
L Lagrange multiplier (LM) test, 201 Large effect size, 302–305 Latent trait theta (θ) defined, 40 estimates, 44 in IRT, 40–45 Latent variable modeling, 204, 210 Left out variables error (LOVE), 89 assumptions, using previous research to justify, 103–104
392
Subject Index
definition of, theoretical and mathematical, 91–104 (see also Path modeling) discussions and reviews on, 89–91 endogenous variables and, 89, 91–93, 93f, 95–98, 100–103 experimental control and, 102–103 inclusive models and, more, 103 path coefficient bias vs. significance testing and, 100–102 research purpose and, consideration of, 104 risk of, minimizing, 102 Likert scaling, 41, 288, 338, 350 Listwise deletion, 12, 13–14, 29 Literature review on chopped data, 366–367 on MTMM matrix, 342 methods studied, range of, 343–344, 344t traits studied, range of, 342–343, 343t Litmus test of generalizability, 238–239 LMATRIX subcommand, 151 Local independence, 41, 49 Longitudinal modeling, 26 LOVE, See Left out variables error (LOVE)
M Manipulation of variables, 233 MANOVA, 126 Mantel-Haenszel procedure, 55 MAR (missing at random), 10–14, 11t, 25, 26 Maximum likelihood (ML), 12, 13–14, 26 MCAR (missing completely at random), 10–11, 11t, 13, 15, 27, 28 Mean-centering, 145–146 Mean imputation, 11, 26, 27 Mean splits, 364, 367 Measurement, defined, 344 Measurement error (ME), 92, 148–150 Measurement method bias, 338, 346, 347, 348, 349 Measurement methods, 344–345 alternative, 350–351 anthropometry, 351 assessment center construct validity, 347–349 Basic Data Relation Matrix (ἀ e Data Box), 352 Covariation Chart, 352
measurement occasions as “methods,” 350 method from substance, discriminating, 351–353 multisource performance appraisal, 345–347 other-reports, 325, 327, 351 positive versus negative item-wording effects as method effects, 349–350 postexercise dimension ratings, 347–349 prototype multidimensional measurement system, 352–353 Rosenberg’s Self-Esteem Scale, 349–350 Median split dichotomization of variables using, 366, 378–379, 379f explained, 364–365 justifications for using, 365, 374–375, 382 as most ubiquitous form of data chopping, 369t, 372–373 point biserial correlation resulting from, 381 violation of normality and, 375–376 Mediated moderation, 157 Mediation hypothesis, see Mediation testing Mediation modeling; see also Path modeling cell 1: simple speculative mediation models, 130–131 cell 2: simple conventional mediation models, 131–132 cell 3: complex speculative mediation models, 132–134 cell 4: complex conventional mediation models, 134 summary of, 135–136 Mediation testing; see also Four-step test of mediation causal system and, basic, 107–108, 108f inferences involved in, 108–110, 109f, 122, 123–124t SEM framework for, 89, 124–127, 135 summary of, 127–129 Meta-analysis, 12, 20, 27–28 Methodological issues in chopped data, 380–381, 382 factor score indeterminacy, 66–67 Heywood cases, 66, 67 qualitative vs. quantitative research, 219, 222–224, 240–241
Subject Index
Methodological rigor, 225, 226, 231, 233, 236 Methodological soundness, 226 Minimum average partial (MAP) procedure, 81–82 Missing data; see also Missing data, urban legends in analysis, fundamental principle of, 11–13, 24 bias results of research studies and, 21 defined, 8 external validity and, 9 levels of, 9, 12 longitudinal modeling and, 26 mechanisms (see Missingness mechanisms) meta-analysis, 20, 27–28 moderated regression and, 29 social network analysis and, 28–29 statistical power and, 9 techniques (see Missing data techniques) within-group agreement estimation, 27 Missing data techniques, 7, 30 levels of missing data and, 11–13, 12t item-level nonresponse, 11–12 scale-level nonresponse, 12 survey-level nonresponse, 12–13 listwise and pairwise deletion, ML, and MI, 13–14 Missing data, urban legends in legend 1: low response rates invalidate results, 21–24 legend 2: when in doubt, use listwise or pairwise deletion, 24–26 Missingness mechanisms, 7 function/importance of, 10–11 MAR (missing at random), 10–14, 11t, 25, 26 MCAR (missing completely at random), 10–11, 11t, 13, 15, 27, 28 MNAR (missing not at random), 10–11, 11t, 14, 15, 27, 28 parameter bias/statistical power problems of, 11t, 13, 14 random missingness and, 9–10, 15, 24, 27, 28 systematic missingness and, 9–10, 15, 21, 24, 28 MNAR (missing not at random), 10–11, 11t, 14, 15, 27, 28 Moderated regression, 29 Moderation; see also Moderation testing, seven myths of
393
defined, 151 mediated, 157 myths beyond, 159–160 (see also see also Moderation testing, seven myths of) Moderation testing, seven myths of, 143–144 myth 1: product terms create multicollinearity problems, 144–146 myth 2: coefficients on first-order terms are meaningless, 146–148 myth 3: measurement error poses little concern when first-order terms are reliable, 148–150 myth 4: product terms should be tested hierarchically, 150–151 myth 5: curvilinearity can be disregarded when testing moderation, 151–156 myth 6: product terms can be treated as causal variables, 156–157 myth 7: testing moderation in structural equation modeling is impractical, 158–159 Moderator variables, 143–144, 272, 361, 373 Monomethod bias, 311, 338, See Monotrait-heteromethod (MTHM), 338, 339t Monte Carlo studies, 49, 51, 179 MTMM, see Multitrait-multimethod (MTMM) matrix Multidimensionality, 51–52 Multiple imputation (MI), 12, 13–14, 26 Multiple regression as ANOVA result supplement, 371–372, 373 MTMM matrix analysis and, 340 shrinkage concept of, 76 tests of moderation and, 144 Multiplicative models, 340 Multitrait-multimethod (MTMM) matrix, 337–338 background of, 338–342, 339t CFA in, 340, 346, 348–349 convergent validity and, 315, 325, 337, 338–339, 340n correlated trait-correlated method model and, 340–341, 341f correlated uniqueness model and, 340 discriminant validity and, 315, 337, 338–339, 340n heterotrait-heteromethod and, 338, 339t heterotrait-monomethod and, 338, 339t
394
Subject Index
literature review on, 342 methods studied, range of, 343–344, 344t traits studied, range of, 342–343, 343t measurement facet and, 337–338, 341 (see also Measurement methods) monotrait-heteromethod and, 338, 339t multiplicative models and, 340 multitrait-multirater and, 345, 346 rater source effects and, 346–347 trait-method unit and, 338, 340 Multitrait-multirater (MTMR), 345, 346 Multivariate Behavioral Research, 66, 68
N Natural Kinds (Quine), 294 Nested models, 174–177 Nonnested alternative models, 177–179 Nonrelevant cause (NRC), 92, 93 Nonresponse behavior, 17–18, 21 Non-self-report, 325–330 artificial deflation of correlations due to suppressor effect and, 328–329 artificial inflation of correlations due to predictor-related criterion bias and, 328, 329 impression management in, 326, 328, 329 malleability of intelligence and, belief in, 327 organizational citizenship behavior constructs and, 327–328 self-referential respondent perception constructs and, 327 Norms courtesy, 302 cultural, 19, 319, 324 reciprocity, 20 subjective, 19, 135–136, 202, 204 N:p ratio, 75–76
O Oblique rotations, 71–74, 84 Observed correlation, 316–318, 323, 329 Ω2 value, 289 Omitted relevant cause, 92–96 Ontological relativism, 292–295 Ontology, 222 Ordinary least-squares (OLS) regression analyses, 113
Organizational Behavior and Human Decision Processes, 62, 114, 152, 182 Organizational citizenship behavior (OCB) constructs, 327–328 Organizational commitment equivalent latent variable models and, 173f equivalent path models and, 172f nested models and, 174, 175 nonnested alternative models and, 178 nonrespondents’ level of, 18, 18t social desirability responding and, 320 validity evidence and, convergent and discriminant, 315–316 Organizational Dynamics, 182 Organizational Research Methods (ORM), 3–4 Organizational support, perceived, 29, 315, 326 Orthogonal rotations, 71–74, 84 Other-reports, 325, 327, 351
P Pairwise deletion, 12, 13–14, 29 Parallel analysis (PA), 81 Parameter estimation bias; see also Response rate bias biased, 126, 131 item difficulty and, 44–45 statistical problems of missing data techniques, 11t systematic missingness and, 10, 15 unbiased, 25 using ML or MI techniques to recover, 12, 25 Parsimony fit indices, 201 Partial correlation matrix, 82 Partial mediation, 109–110, 112, 118–119, 127 Path analysis and correlated residuals, 210 and LOVE, 89, 91, 100, 101 and MTMM matrices, 340 and sample size, 277 Path coefficient bias, 100–102 Path modeling complex, 97–100, 97f, 99t disturbance term in, 92–93, 96 inclusive models and, more, 103 measurement error in, 92 nonrelevant cause and, 93 omitted relevant cause in, 92–96
Subject Index
for one exogenous and one endogenous variable, 92–93, 93f random shocks in, 92 self-containment assumption and, violation of, 96–97 standardized regression coefficient in, 92–94, 111 suppressor effect and, 95–96 for two exogenous variables (one omitted), 93–96, 93t, 95t unmeasured nonrelevant causes in, 92 violated assumptions and, 96 Pattern loadings, 67, 70 Perceived control, 19 Perfect mediation, 108–109, 112 Performance appraisal, multisource, 345–347 Personality and Individual Diἀerences, 63 Personality and Social Psychology Bulletin, 62, 114 Personnel Psychology, 62, 182, 268, 272, 366 Phenomena applied, students studying, 251, 260 naturally occurring, qualitative research and, 221, 231, 234, 237, 239, 240, 241 with obscured consequences, 299–300 quantitative research and, 224 that challenge fundamental assumptions, 300–302 workplace, 247 Platonic notion of a CTT true score, 39 Point biserial correlation, 381 Polytomization, 303–305, 364–365 Poor fit indices, 201 Population invariance, 44–45 Positive versus negative item-wording effects, 349–350 Postexercise dimension ratings (PEDRs), 347–349 Power analysis, 269–273, 282 Power, defined, 9 Predictor-related criterion bias, 328, 329 Principal component analysis (PCA), 65, 79 Priori type I error rate, 273–277, 282 Procedural justice, 17, 21, 22, 143 Pseudo-guessing parameter, 42, 43, 44, 45 Pseudo-isolation, 89, 96 PsycARTICLES, 366 Psychological Assessment, 63 Psychometric tools, 55–56 PsycINFO, 4, 229 Publication bias, 27, 28
395
Q Q-mode, 352 Quadratic Formula, 35 Quadratic function, 154, 155f Qualitative research, 219–221; see also Qualitative research, beliefs associated with caveats and assumptions in, 225 definitional issues in, 221–222 future of, in social and organizational sciences, 240–241 vs. quantitative (see Quantitative research vs. qualitative) Qualitative research, beliefs associated with belief 1: qualitative research does not utilize scientific method, 225–226 evaluation of, 234 hypothesis formulation/testing and, 235 indicators of, 231t observation and description and, 234 summary of, 235–236 belief 2: qualitative research lacks methodological rigor, 226 evaluation of, 236 belief 2a: qualitative research lacks internal validity, 226–227 evaluation of, 236–237 indicators of, 232t summary of, 237 belief 2b: qualitative research lacks construct validity, 227 evaluation of, 237–238 indicators of, 232t summary of, 238 belief 2c: qualitative research lacks external validity, 228 evaluation of, 238–239 indicators of, 232–233t summary of, 239 belief 3: qualitative research contributes little to the advancement of knowledge, 228–229 evaluation of, 239–240 evaluating through published commentary, 229–234, 230t Quantitative research vs. qualitative debate over, 219–220 future of, 240–241 philosophical differences in, 222–223
396
Subject Index
publication rates and, 220 validity conceptualizations of, 223–224
R R /r value, 289–290 Random assignment, 255–256 Random measurement error, 64, 314, 317, 318 Random missingness, 9–10, 15, 24, 27, 28 Random sampling, 255–256 Random shocks (RS), 92 Rasch model, 48 Ratio scaling, 350 Reflexive interpretation, 298 Regression moderated, 29 multiple (see Multiple regression) Regression analyses four-step test of mediation and, 110, 126, 127 moderation testing and, 154, 158, 159 ordinary least-squares and, 113 social desirability and, controlling predictive effect of, 324 Regression coefficients, 151 Regression-to-the-mean phenomenon, 26 Reliability concept, 39–40, 47 Reliability estimates, 27, 148, 347 Representativeness, 259–260 Research methods convenience model, 24–25 Darwinian model, 24 Response behavior, 17–20 nonresponse, 17–18, 21 Response rate, 19–20; see also Missing data techniques bias (see Response rate bias) defined, 9 follow-up reminders and, 20 invitation factors and, 20 issue in SNPs and, addressing, 12–13 length of survey and, 20 monetary incentives and, 20 urban legend: low response rates invalidate results, 21–24 Response rate bias for the correlation, 15–16, 16f, 35–36 in indirect effect, 22, 23, 24 in the mean, 15, 16f in MNAR missingness, 14 standard derivation of, 15, 16f, 35–36 Response status, 15 2
R-mode factor analysis, 352 Root mean square error of approximation (RMSEA), 176–177, 179 Rosenberg’s Self-Esteem Scale (RSES), 349–350
2
S Sample generalizability debate, 247–248 concern over, history of, 251–253 convenience sampling and, defense of, 252–253, 252t ideopathic research and, 250 importance of samples and, 255 random sampling vs. random assignment and, 255–256 representativeness and, 259–260 research base and, 253–255 theoretical generalizability and, 257–259 Sample size; see also Effect size; Sample size rules of thumb communalities and, 77–78 for factor analysis, 74–79 guidelines, 76–79 N:p ratio and, 75–76 origins of, 283 overdetermination of factors and, 78 Sample size rules of thumb, 267–268 Cohen’s definitions of small, medium, and large effect size and, 269–273, 282 critical analysis summary for, 281t 5 observations per estimated parameter included in covariance structure analyses and, 277–280, 283 increasing α priori type I error rate to.10 and, 273–277, 282 SAS factor analytic procedures available through, 61 IRT estimation and, 50 K1 rule and, 79 listwise and pairwise deletion options in, 26 MAP and, 82 parallel analysis procedure and, 81 principal component analysis as default method in, 65 uniqueness estimates and, 67 Scale-level nonresponse, 9, 12 Scale score, 11–12, 40, 51, 210 Scientific method used in qualitative research, 225–226
Subject Index
evaluation of, 234 hypothesis formulation/testing and, 235 indicators of, 231t observation and description and, 234 summary of, 235–236 Score variance, 39, 345 Scree plot, 81 Self-deception tendency, 314, 315, 326 Self-report data, 309–310 artificial covariance in, 319, 323 common method variance in, 311–313, 316, 318–320, 323 contaminating constructs in, 313, 314 defined, 309 problem 1: construct validity of selfreport data, 313–316 problem 2: interpreting correlations in self-report data, 316–318 problem 3: social desirability responding in self-report data, 319–325 problem 4: value of data collected from non-self-report measures, 325–330 random measurement error in, 314, 317, 318 self-deception tendency and, 314, 315, 326 urban legend of, historical roots of, 310–313 validity concerns in, 311 variables, differences in, 309–310 Semantic differential scaling, 350 Sensitivity analysis, 7, 22, 24, 27, 30, 31 Shared variance, 64–65, 206, 311 Shrinkage concept, 76 Significance testing, 100–102 Simple conventional mediation models, 131–132 Simple speculative mediation models, 130–131 Simple structure, orthogonal vs. oblique rotations and, 62, 71–74 Sobel test, 22–23, 24 Social desirability responding artificial covariance and, 323 fakeability vs. actual faking and, 320–323 impression management and, 319, 320, 322 kernel of truth contributing to legend of, 320–321
397
myth of, multifaceted nature of, 319–320 pervasiveness of, 323–324 problem of, 319 Social network analysis, 28–29 Social Science Citation Index, 90, 363 Speculative mediation models complex, 132–134 simple, 130–131 SPSS factor analytic procedures available through, 61 GLM procedure of, using LMATRIX subcommand, 151 IRT estimation and, 50 K1 rule and, 79 listwise and pairwise deletion options in, 25, 26 MAP and, 82 parallel analysis procedure and, 81 principal component analysis as default method in, 65 uniqueness estimates and, 67 Standard deviation (SD), 15, 16f, 35–36 Standard error of measurement, 39–40, 47 Standardized regression coefficient (β), 92–94, 111 Statistical power low response rate and, 9 missingness mechanisms and, 11t, 13, 14 Stochastic dominance, 257 Strategic Management Journal, 182, 268 Structural equation modeling (SEM), 89–91; see also Alternative model specification (AMS) advantages of, 126–127 confirmation bias in, 168–169 in CTT, 52 disconfirmation in, 169–170 IRs and, 199–200 LOVE and, 89 in mediation testing, 89, 124–127, 135 model fit and, 200–202 reproducing correlations among variables and, 74 specification searches within, example of, 202–204, 203f testing moderation in, 158–159 typical model of, 198f Suppressor effect, 95–96, 328–329 Survey invitation factors, 20 Survey nonresponse, 9, 17; see also Systemic nonresponse parameters (SNPs)
398
Subject Index
defined, 8 levels of, 9 in missing data techniques, 11–13 predicting, 19 theoretical model of, 8, 18–19, 19f, 21 ἀ eory of Planned Behavior model for, 19, 19f Survey response anonymity, 20 behavior, 19, 19f defined, 8 SYSTAT statistical package, 66 Systematic missingness, 9–10, 15, 21, 24, 28 Systematic nonresponse parameters (SNPs), 7 dmiss, 14–15, 17, 18t, 30–31 empirical estimates of, 17, 18t , 15–17, 18t, 30–31 future research on, 30–31 longitudinal modeling and, 26 in meta-analysis, 27–28 response rate issue in, addressing, 12–13 sensitivity analysis used in, 7, 22, 24, 27, 30, 31
T Tertiary splits, 364, 367, 369, 373 Test information function, 43–44 TETRAD, 172 Theoretical Model of Survey Nonresponse, 8, 18–19, 19f, 21 ἀ eory of Planned Behavior, 19, 135–136 ἀ eory of reasoned action, 202–204, 203f ἀ eta, see Latent trait theta (θ) ἀ ird variable problem, 94 ἀ ree-parameter logistic model, 41f Tracing, 74 Trait-method unit (TMU), 338, 340 True correlation, 316–318, 323, 329 True score(s) common variance, 199 of CTT, 39–40, 44, 51, 52 platonic notion of, 39 unique variance, 199–200 Truisms, 1 Turnover intentions effect of conscientiousness on, 22–23 equivalent latent variable models and, 173f equivalent path models and, 171, 172f nonnested alternative models and, 178
nonrespondents’ level of, 18t Type I error in correlated residuals, 208 Desired Relative Seriousness of, 276–277 increasing α priori type I error rate to.10 and, 273–277, 281, 282 omitted variables and, 101 pairwise deletion and, 29 in power analysis, 269 regression analyses and, 121, 126 testing moderation and, 153 Type II trade-offs and, 276–277, 281, 282 Type II error Desired Relative Seriousness of, 276–277 listwise deletion and, 29 omitted variables and, 101 in power analysis, 269 testing moderation and, 153 Type I trade-offs and, 276–277, 281, 282
U Unidimensionality, 48, 49 sufficient, 49 Unit-level nonresponse, 9 Unmeasured variables problem, see Left out variables error (LOVE) Urban legends (ULs) belief phenomenon, 3 defined, 1
V Validity evidence, convergent and discriminant, 240, 315–316 Variables causal, 93, 97, 98, 156–157 Dependent, 91, 143n, 144, 158, 367, 382 Dichotomization of, 364, 367, 376, 379t, 380t Endogenous (see endogenous variables) Exogenous (see exogenous variables) Independent (see independent variables) Manipulation of, 233 Moderator, 143–144, 272, 361, 373 Omitted (see left out variables error (love)) Polytomization of, 303–305, 364–365 ἀ ird variable problem and, 94
Subject Index
Tracing, reproducing correlations among by, 74 Variance, common method, see Common method variance Variance score, 39, 345
399
W Web of Science, 114 Within-group agreement estimation, 27 Within-person variability, 300–301 Word and Object (Quine), 294
Author Index
A Abelson, R. P., 299 Acito, F., 66 Ackerman, P. L., 103 Acock, A., 158 Adams, S. K. R., 277 Adler, P. A., 240 Agho, A. O., 316 Aguinis, H., 31, 35, 226, 227, 234, 267–284, 290 Agut, S., 207 Aiken, L. A., 143, 144, 147, 148, 149, 154, 156, 270, 364, 381 Aiman-Smith, L., 165 Ajzen, I., 19, 202 Ajzen, K., 135 Akaike, H., 178 Aldag, R. J., 241 Allen, D. G., 81 Allen, M. J., 38, 39 Allison, P. D., 147 Altman, D. G., 373 Aluko, F. S., 219, 226 Anderson, C. A., 185, 255 Anderson, D. R., 178–179 Anderson, J. C., 167, 175, 177, 186, 196, 199, 208, 209, 210 Anderson, L. E., 363 Anderson, M. Z., 220, 228, 229, 242 Anderson, R.D., 66 Aram, J. D., 274 Armitage, C. J., 136 Arnold, H. J., 146, 147, 149–150, 362, 375 Arthur, W. Jr., 322 Aryee, S., 185, 315 Asch, S., 298 Atkinson, P., 237 Atkins, S. G., 349 Auken, S. V., 350 Aulakh, P. S., 373 Austin, J. K., 114 Austin, J. T., 165, 181
Avison, W. R., 340 Aziz, S., 15
B Babyak, M. A., 208 Bacharach, S. B., 107 Bachiochi, P. D., 221, 225, 235, 237, 241, 242 Back, K., 240 Bagozzi, R. P., 207, 312, 350 Bajdo, L. M., 350 Baker, J. R., 240 Baltes, B. B., 350 Bandalos, D. L., 61–85 Bandura, A., 315 Barbee, A. P., 324 Baron, R. M., 2 Barr, C. D., 18, 31 Barrett, G. V., 1 Barrick, M. R., 116, 315 Barry, T. E., 350 Bartlett, C. J., 81 Bateman, T. S., 315 Bauer, C. C., 350 Baumeister, R. F., 140 Baumgardner, M. H., 165 Baxter, D., 344 Bearden, W. O., 207 Beaty, J. C., 35, 270 Becherer, R. C., 315 Becker, T. E., 320, 345 Bedeian, A. G., 273, 274 Belsley, D. A., 144 Bennett, N., 201 Benson, J., 84–85, 337 Bentler, P. M., 25, 68, 70, 159, 174, 200, 207, 278–279 Berkowitz, L., 250, 252, 253, 259 Berman, J. S., 345 Bernard, R. S., 337 Bernstein, I. H., 73, 80, 361 Berry, M. W., 207 Berry, W. D., 158
401
402
Author Index
BeVier, C. A., 19–20 Biddle, B. J., 169 Billig, M., 296 Bing, M. N., 107–136 Binning, J. F., 108 Birnbaum, M. H., 257–258 Bissonnette, V., 361, 364, 378, 381 Black, A. C., 168 Blakely, G. L., 315 Blickle, G., 342 Blum, M., 185 Bock, R. D., 46 Bohrnstedt, G. W., 145, 149, 150, 152 Boik, R. J., 35, 269, 270 Boland, R. J., 274 Boles, S., 158 Bolger, N., 113, 117, 120, 126 Bollen, K. A., 52, 89, 96, 158, 159, 200, 201, 208, 337 Bong, M., 342 Boomsma, A., 165, 168, 278 Bordeaux, C., 231 Borgatta, E. R., 66 Borgatti, S. P., 28 Borman, W. C., 346 Borsboom, D., 37, 39 Boruch, R. F., 340 Boudreau, J. W., 250 Bourke, S. C., 343 Bowen, C. C., 321 Bozdogan, H., 165 Bozeman, D., 346 Bracker, J. S., 90 Bradburn, N. M., 311 Breckler, S. J., 170, 171, 174, 207 Brennan, R. L., 47 Bresnahan, M. J., 343 Brett, J., 185 Brett, J. A., 174 Brett, J. F., 315 Brett, J. M., 4, 89, 117, 122, 125, 131 Bretz, R. D., 250 Brewer, D., 51 Brews, P. J., 269 Brief, A. P., 371 Briggs, D., 228, 229, 242 Brinberg, D., 236 Broberg, B. J., 277 Brockner, J., 143 Browne, M. W., 175 Brown, J. D., 114 Brown, K. G., 269 Brown, S. P., 134
Brown, T. C., 273 Buckley, M. R., 312 Buison, A. M., 337 Bundy, R., 296 Burchinal, M., 51 Burnett, D., 231 Burnham, K. P., 178–179 Burt, R. S., 28 Busemeyer, J. R., 149, 150, 153 Bushman, B. J., 255 Button, S. B., 315 Butts, M. M., 4 Byrne, B. M., 125, 200
C Calder, B. J., 252, 258 Camerer, C. F., 260 Cameron, C. L., 365 Campbell, D., 256 Campbell, D. T., 219, 223, 224, 226, 227, 228, 231, 237, 240, 342 Campbell, J. P., 252, 253, 255 Campion, M. A., 269 Cannon-Bowers, J. A., 117 Carless, S. A., 183 Carley, K. M., 28 Carmines, E. G., 349 Carnap, R., 293 Carver, C. S., 134 Cascio, W. F., 320 Casillas, A., 56 Casper, W., 231, 241 Cassell, C., 220, 221, 227 Cattell, R. B., 352 Chan, D., 27, 309–331 Chavez, C., 185 Chay, Y. W., 315 Chen, G., 158, 283 Cheung, M. W. L., 52 Childers, T. L., 19 Chisolm, M. A., 351 Choi, H. S., 297 Cho, S. J., 82 Chou, C. H., 278, 279 Chou, C. P., 159 Cialdini, R. B., 365 Ciesla, J. A., 196, 211 Cizek, G. J., 53 Claessens, B. J. C., 185 Clark, L. A., 315 Cliff, N., 75, 80, 124, 169, 170, 196, 205, 207 Cohen, J., 143–147, 150, 159, 269–272, 274
Author Index
Cohen, L. L., 337 Cohen, P., 144, 270, 271 Cole, D. A., 196, 211, 343, 350 Collins, L. M., 14, 31 Colquitt, A. L., 320 Colquitt, J. A., 90 Comrey, A. L., 72 Conner, M., 136 Conrad, H. S., 251–252 Conrad, M. A., 114 Conway, J. M., 15, 62, 64, 65, 72, 79, 84, 345, 346, 348 Conway, L., 365 Cook, D. T., 219, 223, 224, 226, 227, 228, 231, 237, 240 Cook, T. D., 219 Cook, W. L., 342 Cooper, H., 19 Cordes, C. L., 185 Cornwell, J. M., 131, 199 Cortina, J. M., 79, 83, 150, 152, 153, 158, 159, 193–211, 287–306 Costa, P. T., Jr., 315 Costenbader, E., 28 Costner, H. L., 196 Cote, J. A., 312, 345 Couch, A. S., 311 Cowles, M., 321 Crampton, S. M., 312, 318 Crant, J. M., 315 Crask, M. R., 19 Cresswell, S. L., 337, 342 Cristol, D. S., 15, 17, 18, 19, 22, 26, 30, 31 Cronbach, L. J., 145, 227, 311, 337 Cron, W. L., 134, 315 Cumsille, P. E., 13 Cunningham, M. R., 324 Curran, P. J., 159
D Daniel, L. G., 81 Daniel, P., 18, 31 Danoff-Burg, S., 365 Darling, M., 321 Darrow, D., 240 Davidson, R. J., 375 Davison, H. K., 118 Davis, T. V., 220, 221, 228 Dawson, J. F., 29 Day, D. V., 348 DeBruyn, E. E. J., 351
403
Delaney, H. D., 363, 364–365, 366, 371, 378, 381 Delp, N. D., 255 Demaree, R. G., 27 Dempster, A. P., 12 DeNisi, A., 255 Derives, M. R., 228, 229, 242 DeShon, R. P., 303, 304 Diener, E., 316 Digman, J. M., 315 Dillman, D. A., 19 Dipboye, R. L., 252, 253 DiStefano, C., 343, 350 Dobbins, G. H., 251, 252 Donnelly, T. M., 346 Donnerstein, E., 250, 252, 253, 259 Doty, D. H., 345 Dougherty, T. W., 185 Douglas, E. F., 322 Drasgow, F., 54 Dreher, G. F., 348 Drezner, Z., 194 Druskat, V. U., 234 Dudek, F. J., 40 Dudley, N., 302 Duncan, O. D., 89, 96 Duncan, S. C., 158 Duncan, T. E., 158 Dunlap, W. P., 145–146, 149, 150, 158, 287 Dunnette, M. D., 321 Durkheim, E., 299 Duvall, S., 28 Duval, R. D., 159
E Eagly, A. H., 255 Eaton, C. A., 79 Eaton, N. K., 321 Eaton, W. O., 114 Eby, L. T., 219–242 Eckersley, R., 154 Eddleston, K. A., 185 Edens, P. S., 322 Edwards, J. R., 127, 143–160 Edwards, W., 337 Efron, B., 159 Eid, M., 340, 350 Eisenberger, R., 315 Eisenberg, N., 375 Eisenhardt, K. M., 366, 377 Eisenstadt, D., 376 Eklund, R. C., 337, 342
404
Author Index
Elek-Fiske, E., 13 Ellingson, J. E., 103, 324 Ellis, B. B., 51 Ellis, L. A., 342 Embretson, S. E., 41 Emmons, R., 316 Emrich, C., 273 Enders, C. K., 12, 13 Epitropaki, O., 278 Epstein, S., 247 Esplin, P. W., 235 Evans, M. G., 146, 147, 150
F Fabes, R. A., 375 Fabrigar, L. R., 61, 62, 63, 65, 67, 79, 84, 165, 194 Fairchild, A. J., 113 Fan, X., 51 Faraggi, D., 381 Farber, M. L., 250, 251, 252, 258 Farmer, S. M., 185 Farrington, D. P., 378 Fava, J. L., 75, 76, 77, 78, 79 Fay, S. Y., 274 Feeley, T. H., 114 Feldt, L. S., 47 Ferron, J. M., 75, 76, 78 Festinger, L. U., 240 Fidell, L. S., 201 Finch, J. F., 159 Finkel, E. J., 186 Fishbein, M., 202 Fisher, R. A., 256, 365 Fisicaro, S. A., 133 Fiske, D. W., 227, 311, 313, 337, 342 Fiske, S. T., 118 Flament, C., 296 Flanagan, M. F., 252, 253 Fleishman, A., 381 Floor, E., 375 Flora, D. B., 186 Foa, E. B., 20 Foa, U. G., 20 Fornell, C., 177, 204, 207 Forshee, V. A., 186 Foster-Johnson, L., 145, 151 Foster, M. R., 348 Fox, R. J., 19 French, D. P., 343 French, N., 348 Friedman, R., 185 Fritz, M. S., 113, 117, 121
Froman, R. D., 365, 372 Funke, F., 337
G Ganster, D. C., 319 Ganzach, Y., 152–154, 159 Garg, V. K., 273 Gatewood, R. D., 348 Gelfand, M. J., 269, 290 Gentry, W. A., 348 Gerbing, D. W., 167, 175, 177, 186, 196, 199, 208, 209, 278 Gersick, C. J. G., 240 Gewin, A. G., 348 Gibson, G. J., 343 Gigerenzer, G., 256 Gilbert, D. T., 115 Gilovich, T., 247, 259 Gilstrap, L., 114 Giorgi, A., 234 Glancy, M., 337 Glaser, B., 226 Glaser, D. N., 177 Glenar, J. L., 227 Glenn, D. M., 18, 31 Glick, W. H., 345 Glymour, C., 172 Goates, N., 185 Goldberger, A. S., 145, 152, 153 Golden-Biddle, K., 221, 222, 225 Gold, M. S., 25 Goldstein, D., 115 Gong, Y., 114 González-Romá, V., 350 Goodwin, G. F., 117 Gordon, M. E., 251 Gorsuch, R. L., 67, 75, 76, 80 Gouldner, A. W., 20 Graen, G. B., 185 Graham, J. W., 10, 13, 29 Graham, K. E., 322 Gravetter, F. G., 290 Greenberg, J., 252, 259 Green, S. B., 208 Greenwald, A. G., 165 Grelle, D. M., 165–187, 194, 267 Griffin, D., 259 Griffin, S., 316 Grove, W. M., 260 Gruys, M. L., 103 Guba, E. G., 219, 224 Guilford, J. P., 345 Gurhan-Canli, Z., 114
Author Index
Guthrie, I. K., 375 Guttman, L., 79, 80
H Haig, B. D., 68–69 Hall, R. J., 259 Hambelton, R. K., 55 Hamburger, C. D., 75 Hammer, A. L., 37 Hammersley, M., 237 Hancock, G. R., 159, 208 Hanson, B. A., 47 Hanushek, E. A., 89 Harden, E. E., 267–284, 290 Harmer, P., 158 Harms, H. J., 45, 46 Harnisch, D. L., 55 Harrison, D. A., 17, 27, 49 Harris, R. D., 343 Harvey, R. J., 37 Harvill, L. M., 40 Hasher, L., 115 Haslett, T. K., 255 Hau, K.-T., 158 Hayduk, L. A., 177, 209 Hayton, J. C., 81 Heffner, T. S., 117 Heggestad, E. D., 103 Heller, D., 22 Helms, B. P., 144 Henderson, N. D., 207 Henley, A. B., 167, 170, 174, 181, 183 Hennessey, H. W., 319 Henson, J. M., 45 Henson, R. K., 62, 72, 79, 83, 84 Hepburn, C., 297 Heppner, P. P., 222, 223, 224, 225 Hernández, A., 350 Hershberger, S. L., 168, 170, 171, 172, 174, 183, 211 Heubeck, B. G., 342 Hezlett, S. A., 345, 347 Higgins, E. T., 297 Highhouse, S., 247–262 Hill, R. B., 72 Hines, C. V., 75, 76, 78 Hocevar, D., 342 Hodapp, V., 350 Hoffman, B. J., 346 Hoffman, J. M., 113 Hogarty, K. Y, 75, 76, 78 Holahan, P. J., 127 Hollenbeck, J. R., 375
405
Ho, M.-H. R., 175, 176 Hong, S., 75, 76, 78 Horan, P. M., 350 Horn, J. L., 81 Hoskisson, R. E., 269 Hough, L. M., 321, 324 Howell, D. C., 290 Hoyle, R. H., 124, 125 Hsu, T., 49 Huba, G. J., 207 Huberman, A. M., 219–222, 224, 234 Huffcutt, A. I., 62, 64, 65, 72, 79, 84, 345 Hui, H. C., 342 Hu, L., 174 Humphreys, L. G., 149, 150, 152, 153, 159, 363, 381 Hunter, J. E., 27, 103, 290, 345 Huntington, R., 315 Hunt, S. T., 321 Hutchinson, S., 315 Hwee, H. T., 185
I Ickes, W., 361 Ilgen, D. R., 250, 252 Ilies, R., 300, 301 Imparto, N., 346 Irwin, J. R., 361, 376 Isen, A. M., 297 Ittenbach, F. R., 337, 343, 351
J Jaccard, J., 145, 147–150, 158 Jackson, D. J. R., 349 Jackson, D. N., 66, 69 Jackson, J. E., 89 James, L. R., 4, 27, 89, 90, 92, 96, 103, 107–137, 174, 181, 201, 283 Janssens, J. M. A. M., 351 Jick, T. D., 220, 227, 234, 242 Johns, G., 160 Johnson, E. J., 153, 260 Jones, L. E., 149, 150, 153 Jöreskog, K. G., 158, 159, 201, 211 Judd, C. M., 151, 157, 158, 304, 361 Judge, T. A., 22, 33, 250, 300, 301, 342, 345
K Kahneman, D., 258, 259 Kaiser, H. F., 79–80
406
Author Index
Kam, C. M., 14 Kamp, J. D., 321 Kanawattanachai, P., 274 Kano, Y., 68, 70 Kaplan, D., 175, 196, 201 Karasek, R. A., 143 Karasek, R. A., Jr., 143 Karau, S. J., 255 Kardes, F. R., 250, 252, 260 Kashy, D. A., 120 Kavanagh, M. J., 340 Keith, N., 350 Kelloway, E. K., 194, 200 Kemery, E. R., 145, 146, 149, 150 Keniston, K., 311 Kenny, D. A., 89, 110, 112, 113–122, 124, 126, 127, 129, 135, 136, 157, 158, 342, 345, 350 Kercher, K., 66 Kerig, P. K., 114 Kervin, J. B., 273, 274 Ketokivi, M. A., 343 Kidd, S. A., 219, 220, 223, 225–229, 241 Kim, C., 337 Kim, D., 114 Kim, H., 269 Kim, J., 19 Kim, K. H., 25 King, L. M., 345 King, S. N., 235 Kinicki, A. J., 90, 185, 278 Kirisci, L., 49 Kirkpatrick, L. A., 247 Kirk, S. B., 365 Kivlighan, D. M., 222 Klein, A., 158 Klein, E. B., 240 Kline, R. B., 158 Kluger, A. N., 255, 322 Knight, W. E., 15 Knowles, E., 361 Kolen, M. J., 47 Korbin, W., 321 Kotabe, M., 373 Kowalski, R. M., 381 Krackhardt, D., 28 Kraimer, M. L., 315 Kram, K. E., 240 Krausz, M., 346 Krehan, K. D., 337 Kromrey, J. D., 145, 151, 153 Kruglanski, A. W., 258 Krull, J. L., 117 Kubeck, J. E., 255
Kuha, J., 179 Kuhn, D., 118 Kuhn, T. S., 223 Kuncel, N. R., 347 Kunda, Z., 118 Kvale, S., 225
L Laczo, R. M., 89 Laird, N. H., 12 Lakatos, I., 170 Lambert, L. S., 157, 231 Lambert, R. G., 51 Lambert, T. A., 348 Lamb, M. E., 235 Lance, C. E., 1–4, 52, 89–104, 131, 132, 133, 148, 199, 207, 283, 337–353, 375 Landy, F. L., 337 Lane, D. M., 273 Lane, I. M., 251 Langeheine, R., 350 Langens, T. A., 364, 374, 382 Larkin, J. D., 340 Larsen, J., 316 Larson, J. R., Jr., 226, 228, 231, 257 Larsson, R., 228 Latham, G. P., 114, 132, 143 LeBreton, D. L., 118 LeBreton, J. M., 107–137 Lee, H., 337 Lee, J. Y., 312, 345 Lee, M. B., 72 Lee, R. M., 56 Lee, S. Y., 170, 171, 172, 174, 183, 211, 345 Lee, T. W., 219, 220, 221, 223, 225, 226, 227, 228, 236, 237, 238, 240, 241 Lehmann, D. R., 340 Leippe, M. R., 165, 376 LePine, J. A., 90 Levine, J. M., 297 Levin, P. F., 297 Levin, R. A., 321 Levine, M. V., 54, 55 Levine, T. R., 343 Levinson, D. J., 240 Levinson, M., 240 Lewin, K., 240 Lewis, C., 168, 200 Lievens, F., 342, 343, 348 Li, F., 82, 158, 159 Lim, B. C., 342, 343 Lim, V. K. G., 185 Lincoln, Y. S., 219, 224
Author Index
Lindell, M. K., 312 Lind, S., 201 Lindsay, J. J., 255 Linn, R. L., 55 Lippe, Z. P., 89 Lippitt, R., 240 Lipsey, M. W., 28 Lisco, C. C., 185 Little, R. J. A., 9, 13 Little, S. L., 337 Locke, E. A., 132, 143, 252, 254 Locke, K., 221–222, 225 Locksley, A., 297 Lockwood, A., 231 Lockwood, C. M., 113, 117 Loeber, R., 378 Loehlin, J. C., 201 London, M., 346, 347 Long, J. S., 194, 196 Lord, F. M., 38 Lord, R. G., 259 Lozano Rojas, O. M., 37, 47 Lubinski, D., 149–150, 152, 153, 159 Luecht, R. M., 350 Luong, A., 17–19, 22, 26, 30, 31 Luo, Y., 114 Luthans, F., 220, 221, 228, 319 Lynch, J. G., 252, 259
M MacCallum, R. C., 61, 71, 72, 75, 76, 78, 80, 150, 152, 153, 165, 167, 168, 171, 174, 180, 181, 194, 196, 201, 202, 205, 207, 208, 209, 211, 361–363, 365, 366, 370, 371, 374, 376, 377, 378, 381 MacDonald, P., 51 MacKenzie, S. B., 312, 345 MacKinney, A. C., 340 MacKinnon, D. P., 113, 117, 121, 126 MacLaren, J. E., 337 Mahan, R. P., 344 Maheswaran, D., 114 Makhijani, M. G., 255 Manders, W. A., 351 Mansfield, E. R., 144 Maraun, M. D., 68, 69 Mar, C. M., 150, 152, 153 Marcoulides, G. A., 158, 159, 163, 170, 194 Marlin, M. M., 169 Marshall, C., 236 Marshall, H. M., 114 Marsh, H. W., 14, 158, 163, 342, 343, 350
407
Marteau, T. M., 343 Martel, R. F., 273 Martin, B. A., 321 Martin, N. C., 350 Martin, R., 278 Martin, T., 136, 257 Martocchio, J. J., 17 Maruyama, G. M., 193, 194, 199, 200 Marwell, G., 149, 150 Mason, B. J., 207 Mason, J., 221, 234 Mathieu, J. E., 116, 117, 122, 124, 127, 136, 315 Matic, T., 207 Maurer, J. G., 315 Mauro, R., 89, 90, 98 Maxwell, J. A., 219, 220, 223, 224, 225, 226, 234, 236, 237, 238, 239, 240 Maxwell, S. E., 363, 364, 365, 366, 371, 378, 381 Mayes, B. T., 348 McArdle, J. J., 67 McClellan, C. B., 337 McClelland, D. C., 375 McClelland, G. H., 151, 304, 361, 371, 375 McCloy, R. A., 321 McClure, J. R., 227 McCoach, D. B., 168, 170 McColl, E., 343 McCrae, R. R., 315 McDaniel, M. A., 255, 322 McDonald, R. P., 52, 175, 176 McEvoy, G. M., 235 McFillen, J. M., 382 McGrath, J. E., 226, 227, 228, 231, 236, 237, 242, 250 McKee, B., 240 McKee-Ryan, F. M., 185, 278 McKinney, A. C., 340 McNemar, Q., 251 Mead, A. D., 51, 54, 340 Meade, A. W., 89–104 Medsker, G. J., 127 Meehl, P. E., 108, 170, 227, 260, 337 Meek, C., 172 Mellenbergh, G. J., 39 Messick, S., 337 Meyer, J. P., 22 Michels, L. C., 4, 148, 375 Miles, M. B., 219, 220, 221, 222, 224, 227, 234, 241 Milgram, S., 295–296 Miller, D. T., 295, 297, 298 Miller, J. L., 321
408
Author Index
Milligan, G. W., 382 Millsap, R. E., 177, 340 Mitchell, S., 235 Mitchell, T. R., 219, 231, 241 Mook, D., 252 Mooney, C. Z., 159 Moorman, R. H., 315, 322 Moosbrugger, H., 158, 350 Morgan, G., 225 Morgeson, F. P., 269, 290 Morris, J. H., 144, 145, 150 Motl, R. W., 343, 350 Mount, M. K., 22, 116, 315, 342, 345, 346 Mowday, R. T., 316 Mueller, C. W., 188, 316 Mulaik, S. A., 4, 68, 69, 89, 117, 122, 131, 174, 177, 199, 201 Muller, D., 157 Mulloy, L. L., 351 Mumford, K. R., 75, 76, 78 Munley, P. H., 220, 228, 229, 242 Muraven, M., 131 Murphy, K. R., 276 Myors, B., 276
N Nason, E., 312 Nasser, F., 84, 85 Necowitz, L. B., 196 Nelson, J. B., 171 Nelson, L., 51 Nelson, M. W., 114 Neubert, M. J., 116 Nevitt, J., 159 Newbolt, W. H., 348 Newcomb, M. D., 207 Newman, D. A., 7–31 Newsom, J. T., 365 Nicewander, W. A., 361 Niehoff, B. P., 315 Nisbett, R. E., 118, 311 Noble, C. L., 338 Noe, R. A., 90 Nouri, H., 289 Novick, M. R., 38, 108 Nunnally, J. C., 39, 73, 80, 311 Nyaw, M., 114
O Oakes, W., 251, 252 Oakman, J. M., 350 O’Connell, A. A., 168
O’Connell, E. J., 340 O’Connor, B. P., 81, 82 O’Connor, D. P., 51 Oczkowski, E., 167 Ohlott, P. J., 235 Olekalns, M., 185 Oleno, T., 346 Oliver, A., 350 Ones, D. S., 103, 312, 321–322, 324, 347 Orbach, Y., 235 Orlando, M., 54 Ortiz, V., 297 Osborne, J. W., 114 Ostroff, C., 35 Ouwerkerk, J. W., 365 Overton, R. C., 45, 46 Owen, S. V., 365, 372
P Panzer, K., 235 Papierno, P. B., 114 Parada, R. H., 342 Parker, C. P., 350 Parry, M. E., 156 Pattison, P., 29 Patton, M. J., 220, 222, 223, 224, 225 Paulhus, D. L., 315 Paunonen, S. V., 51 Paxton, P., 337 Pedhazur, E. J., 144, 150, 158 Peiró, J. M., 207 Penev, S., 171 Peterson, M., 167 Peterson, R. A., 253, 254 Petty, G. C., 72 Phillips, L. W., 258 Pierce, C. A., 31, 269, 270, 277 Ping, R. A., Jr., 158 Pitoniak, M. J., 350 Platt, J. R., 107 Ployhart, R. E., 342, 343 Podsakoff, N. P., 312–313, 324 Podsakoff, P. M., 322, 345 Poirier, J., 208 Popper, K. R., 168, 169, 175 Porter, L. W., 316 Powell, G. N., 185 Pratkanis, A. R., 165 Preacher, K. J., 61, 71, 72, 80, 170, 361, 365, 373, 381 Prentice, D. A., 295, 296, 297, 298 Price, J. L., 316 Priem, R. L., 273
Author Index
Prussia, G. E., 90, 185, 278 Pulakos, E. D., 260, 312
Q Qing, S. S., 185 Quilty, L. C., 350 Quine, W. V., 287, 288, 292, 293, 294
R Radvansky, G. A., 365 Ratner, C., 219, 227 Rauch, S. M., 376 Raver, J. L., 269, 290 Raykov, T., 170, 171, 179 Reckase, M. D., 49, 52 Reddy, S. K., 196, 204 Reichardt, C. S., 165, 219 Reilly, M. D., 207 Reilly, R. R., 322 Reinhardt, V., 337 Reinhart, A. M., 114 Reiser, M., 375 Reise, S. P., 45 Reiss, A. D., 312 Reiss, A. J., 228 Reno, R. R., 364 Richards, G., 342 Richards, W. D., 28 Rich, B. L., 300, 301 Riggio, R. E., 348 Risko, E., 350 Robbins, S. B., 56 Roberts, J. K., 62, 72, 79, 83, 84 Robie, C., 54 Robins, G., 29 Robinson, W. S., 35 Roe, R. A., 185 Rogelberg, S. G., 15, 17–18, 19, 22, 26, 30–31 Rogers, H. J., 55 Rogers, R., 337 Rojas Tejada, A. J., 37, 47 Rosenberg, M., 359 Rosenthal, R., 251 Rosse, J. G., 321, 322 Ross, L., 118 Rossman, G. B., 236 Roth, P. L., 11, 19, 20, 27 Royston, P., 373 Roznowski, M., 196 Rubin, D. B., 9, 12, 13, 55 Rucker, D. D., 361 Rudas, T., 364
409
Ruderman, M. N., 235 Runkel, P. J., 250 Russell, C. J., 322 Russell, D. W., 63, 73, 79, 84 Russell, S. S., 185 Rutte, C. G., 185 Ryan, A. M., 322
S Sablynski, C. J., 219 Sackett, P. R., 89, 103, 226, 228, 231, 257, 324, 348 Sagie, A., 255 Salanova, M., 207 Salas, E., 117 Salgado, J. F., 103 Salipante, P., 274 Salvaggio, A. N., 27 Samejima, F., 37, 47 Santos, P. J., 56 Saris, W. E., 196, 201 Saron, C., 375 Sasaki, M. S., 145 Satorra, A., 159, 196 Sauerbrei, W., 373 Sauley, K. S., 273, 274 Scandura, T. A., 226, 227, 231, 241 Scarpello, V., 81 Schachter, S., 240 Schafer, J. L., 10, 12, 13, 14 Scheier, M. F., 134 Scheines, R., 172 Schermelleh-Engel, K., 350 Schleicher, D. J., 348 Schmidt, F. L., 27, 103, 287, 290, 347 Schmit, M. J., 322 Schmitt, N., 251, 260, 312, 318, 340 Schneeweiss, H., 70 Schneider, B., 27 Schneider, R. J., 321 Schoenberg, R., 196 Scholte, R. H. J., 351 Schooler, J. W., 365 Schoonhoven, C. B., 366, 377 Schroeder, R. G., 343 Schulz, E. M., 56 Schumacker, R. E., 158, 159 Schwartz, M., 115, 178 Schwarz, G., 178 Schwarz, N., 311 Scott, B. A., 300, 301 Scullen, S. E., 338, 342, 343, 345 Sechrest, L., 364
410
Author Index
Sederburg, M. E., 15 Sedney, M. A., 381 Seibert, S. E., 315 Seib, H. M., 376 Seifert, C. F., 185 Senior, V., 343 Sewell, K. W., 337 Shadish, W., 256, 257 Shapiro, A., 175 Shaw, George Bernard, 361 Shaw, P. J., 343 Shearman, S. M., 343 Sheets, V., 113 Shenkar, O., 114 Shepperd, J. A., 152, 153 Sherden, W. A., 260 Sherif, M., 297 Sherman, J. D., 144 Shook, C. L., 167 Shore, L. M., 316 Short, J. C., 278 Shrout, P. E., 113, 117, 126, 138 Simon, H. A., 96 Simon, R., 381 Sinar, E. F., 51 Singh, B. K., 132 Singh, J., 274 Sin, H. P., 15, 27, 30 Sireci, S. G., 350 Skanes, A., 321 Skarlicki, D. P., 114 Skinner, S. J., 19 Slade, L. A., 251 Sloan, C. E., 350 Slocum, J. W., Jr., 134, 315 Smart, S. A., 114 Smircich, L., 225 Smith, D. B., 324 Smith, D. E., 348 Smither, J. W., 346, 347 Smith, K. W., 145 Smith, M., 260, 263 Snell, A. F., 322 Sobel, M. E., 22, 23, 24, 113 Sörbom, D., 196–197 Sowa, D., 315 Spangenberg, E. R., 365 Spector, P. E., 4, 206, 310, 312, 313, 318, 338 Spirtes, P., 172 Spitzmuller, C., 15, 18, 31 Sprott, D. E., 365 Stallings, V. A., 337 Stanley, J., 228 Stanton, A. L., 365
Stearns, T. M., 241 Stechner, M. D., 321 Steers, R. M., 316 Steiger, J. H., 175, 176, 196, 200, 208, 211, 350 Steiner, D. D., 251 Stelzl, I., 170, 171, 172 Sternberg, K. J., 235 Sternberg, R. J., 313 Stevens, J., 274, 275 Stevens, S. S., 344 Stewart, G. L., 116 Stillman, J. A., 349 Stilwell, C. D., 201 Stine, R. A., 159 Stone, E. F., 375 Stone-Romero, E. F., 227, 252, 263, 280 Stork, D., 28 Strahan, E. J., 61, 62, 63, 65, 67, 79, 84 Strauss, A., 226 Stronkhorst, L. H., 196 Stucke, T. S., 365, 374, 382 Stull, D. E., 66 Sturman, M. C., 278 Subirats, M., 27 Sudman, S., 311 Swaminathan, H., 41 Switzer, D. M., 11 Switzer, F. S., 11 Symon, G., 220, 221, 227 Sytsma, M. R., 345
T Tabachnick, B. G., 201 Tajfel, H., 296 Takeuchi, R., 278 Tayler, W. B., 114 Taylor, L. R., 45, 46 Taylor, S. R., 116, 117, 122, 124, 127 Tazelaar, M. J. A., 365, 376 Teachout, M. S., 346 Tesluk, P. E., 278 Tetrick, L. E., 316 Tett, R. P., 22 ἀ eorell, T., 143 ἀ erney, P., 185 ἀ issen, D., 54 ἀ ompson, B., 81 ἀ ompson, M. S., 208 ἀ oresen, J. D., 348 ἀ orndike, R. L., 39 ἀ urstone, L. L., 71 Tibshirani, R., 159
Author Index
Tinsley, D. J., 72 Tinsley, H. E. A., 72 Tomarken, A. J., 168–169, 170, 173, 175, 176, 181, 196, 211 Tomás, J. M., 350 Toppino, T., 115 Tornow, W. W., 346, 347 Trost, M. R., 365 Tsien, S., 54 Tucci, C. L., 269 Tucker, L. R., 168, 200 Turrisi, R., 145 Tutzauer, F., 114 Tversky, A., 247, 258 Tweedie, R., 28 Tybout, A. M., 258
U Uchino, B. N., 165, 194 Ustad, K., 337
V Valente, T. W., 28 Vallone, R., 247 Van Alstine, J., 201 Vandenberg, R. J., 1–4, 52, 127, 136, 141, 165–187, 194, 267, 283, 327 VandeWalle, D., 134, 315 Van Eerde, W., 185 Van Lange, P. A. M., 365 Van Maanen, J., 219, 220, 221, 226, 227, 234, 240, 241–242 Vargha, A., 364, 381 Vecchio, R. P., 378 Veiga, J. F., 185 Velicer, W. F., 66, 69, 75–82 Vevea, J. L., 28 Viswesvaran, C., 103, 312, 321, 324, 347 Von Hippel, C., 365, 375 Von Hippel, W., 365, 375
W Wagner, J. A., III, 312, 318 Waller, N. G., 168–170, 173, 175, 176, 181, 196, 211 Wallnau, L. B., 290 Walters, B. A., 273 Wampold, B. E., 222 Wan, C. K., 145, 149, 158 Wang, M., 185
411
Wan, W. P., 269 Watson, D., 315 Weaver, A. E., 227 Wegener, D. T., 165, 194 Weiner, S. P., 221, 225, 235, 237, 241, 242 Weinman, J., 343 Welch, J. L., 114 Wen, Z., 158 West, S. G., 113, 143, 144, 147, 148, 149, 154, 156, 159, 270, 364, 381 Wheeler, J. V., 234 Wherry, R. J., 345 Whiner, E. A., 220, 228, 229, 242 White, R. K., 240 Whitney, D. J., 260, 312 Whittaker, T. A., 63, 65, 81 Widaman, K. F., 65, 66, 70, 340 Wiesenfeld, B. M., 143 Williams, B., 54 Williams, E. A., 226, 227, 231, 241 Williams, L. J., 127, 165, 168, 181, 312, 318 Williamson, G. M., 351 Wilson, D. B., 28 Wilson, T. D., 311 Wilson, W., 297 Woehr, D. J., 133, 340, 346 Wolf, G., 27 Wolins, L., 340 Wong, D. T., 324 Woods, C. M., 28 Woolcock, J., 29 Worthington, R. L., 63, 65, 81 Wothke, W., 340 Wu, B., 185 Wu, J., 107–137, 278
Y Yammarino, F. J., 19, 20 Yang, F., 158 Yang, W., 342, 343, 350 Yang-Wallentin, F., 159 Yen, W., 38, 39 Yin, R. K., 225, 228, 234, 238 Yi, Y.-J., 177, 312 Yu, A. P., 114 Yu, J., 19 Yukl, G., 185 Yu, L., 49 Yun, S., 278 Yzerbyt, V. Y., 157
412
Author Index
Z Zajac, D. M., 315 Zajonc, R. B., 115 Zautra, A., 350 Zedeck, S., 150, 346
Zeller, R. A., 349 Zemel, B. S., 337 Zhang, S., 361 Zickar, M. J., 37–57 Zimmerman, R. D., 22 Zwick, W. R., 79, 81, 82