Effect Sizes for Research Effect Sizes for Research A Broad Practical Approach A Broad Practical Approach
This page page intentionally left blank
Effect Effect Sizes Sizes for for Research Research A Brood Broad Practical Practical Approach Approach A
Robert Grissom Robert J. J. Grissom
John J. J. Kim John Kim San San Francisco Francisco State State University University
� 2005 2005
IAWRENCE ERLBAUM ASSOCIATES, ASSOCIATES, PUBLISHERS LAWRENCE ERLBAUM PUBLISHERS London Mahwah, New New Jersey
Copyright Copyright © © 2005 by by Lawrence Lawrence Erlbaum Associates, Inc. rights reserved. reserved. No No part of this book may may be repro reproAll rights any form, by photostat, microform, microform, retrieval retrieval duced in any other means, without without prior prior written written per persystem, or any other mission of the publisher. mission
Inc., Lawrence Erlbaum Associates, Inc . , Publishers 10 10 Industrial Avenue Mahwah, New Jersey 007430 Mahwah, 7430
I Cover design by Sean Trane Sciarrone I Cover design by Sean Trane Sciarrone
Library of of Congress Congress Cataloging-in-Publication Cataloging-in-Publication Data Data Library Grissom, Robert JJ.. Effect sizes sizes for research :: a broad broad practical approach approach // RobEffect Rob ert J. Grissom, John J. Kim. p. cm. bibliographical references and and index. Includes bibliographical 0-8058-5014-7 ISBN 0-8058-5 0 1 4- 7 (alk. (alk. paper) of variance. 2. Effect Effect sizes sizes (Statistics (Statistics)) 33.. Exper11.. Analysis of Exper 1. Kim, II. Title. imental design. I. Kim, John J. II. QA279.F75 2005 CU\2 7 9 . F 75 2005 519.5'38—dc22 5 1 9 . 5'38-dc22
2004053284 2004053284 CIP
Books published published by Lawrence Lawrence Erlbaum Associates are printed Books bindings are chosen for strength on acid-free paper, and their bindings durability. and durability. Printed in the the United States of America 1100 9 9 88 7 66 55 4 3 2 1
This This book book is is dedicated dedicated to to those those scholars, scholars,
amply amply cited cited herein, herein, who who during during the the past past three three decades decades have have worked worked diligently diligently to to develop develop
and promote the and promote the use use of of effect effect sizes sizes and and robust robust
statistical methods methods and and to those who who have have statistical to those constructively criticized criticized such such procedures. procedures. constructively
This page page intentionally left blank
Contents
Preface Preface
1 Introdudtion Introduction 1 Review of Simple Cases of Null-Hypothesis Null-Hypothesis Significance Significance Testing 11 Signifying and Practical Practical Statistically Signifying 3 Significance Significance 3 Definition Definition of Effect Effect Size Size 44
xiii
I 1
Controversy About Null-Hypothesis Null-Hypothesis Significance Testing 4 4
The Purpose of This Book Book and and the the Need for a Broad Approach 6 6 Power Analysis Analysis 77 Meta-Analysis 8 8 Meta-Analysis Assumptions of Test Statistics and and Effect Effect Sizes 9 9 0 Violation Assumptions in Real Data Violation of Assumptions Data 110 Exploring the Data for a Possible Effect Effect Exploring Variability 114 of a Treatment on Variability 4 9 Worked Examples of Measures of Variability Variability 119 Questions 21 21 2 2 Confidence Confidence Intervals Intervals for Comparing Comparing the the Averages of Two Two Groups Introduction 23 Introduction 23
Confidence Intervals Intervals for �a - �b: Confidence Independent Groups Groups 24 Independent 24 Worked Example for Independent Groups 29 29 Further Discussions Discussions and Methods Further Methods 31 31 Solutions Solutions to Violations Violations of Assumptions: Assumptions: 32 Approximate Method Welch's Approximate Method 32 Worked Example of the the Welch Method 34 34
23 23
vii vii
viii viii
oP
CONTENTS CONTENTS
Yuen's Confidence Interval Interval for the the Difference
Between lWo Two 1rimmed Trimmed Means 36 36 Other Methods for Independent Groups Other
Dependent Groups
Questions Questions
46 46
43 43
40 40
Difference Between Means 3 3 The Standardized Difference Unfamiliar and Incomparable Unfamiliar Incomparable Scales 48 48
48 48
4 4 Correlational Effect Sizes for Comparing
70 70
Standardized Means: Standardized Difference Between Means: 49 Assuming Normality and a Control Control Group 49 Assuming Normality 53 Equal or Unequal Variances 53 Tentative Recommendations Recommendations 55 55 Additional Standardized-Difference Effect Effect Sizes Additional Standardized-Difference W hen There Are Outliers 57 When 57 Technical Note 3.1: 3.1: A Nonparametric Estimator Nonparametric Estimator of Standardized-Difference Standardized-Difference Effect Effect Sizes Sizes 58 58 of Confidence Intervals for a Standardized-Difference Standardized-Difference Effect Effect Size 59 59 Confidence Confidence Intervals Intervals Using Noncentral 64 Distributions Distributions 64 The Counternull Counternull Effect 65 Effect Size 65 Dependent Groups 67 67 Questions Questions 68 68
Two Groups The Point-Biserial Point-Biserial Correlation Correlation 70 70 Example of r pb 71 rpb 71 Confidence Confidence Intervals Intervals and Null-Counternull Null-Counternull Intervals for 72 72 Intervals for rrpop pop pop Assumptions r and r rpb 73 Assumptions of rand pb 73 Unequal Sample Sizes 76 76 Unreliability 76 Unreliability 76 Restricted Range 81 81 Large Effect Effect Size Size Values Small, Medium, and and Large Binomial Effect Display 87 87 Binomial Effect Size Display Limitations of the the BESD BESD 89 89 Limitations 91 The Coefficient Coefficient of Determination Determination 91 Questions 95 95
5 Effect Effect Size Size Measures That Go Go Beyond 5 Comparing Two Two Centers The Probability Superiority: Probability of Superiority: Independent Groups 98 98 Independent
85 85
98 98
CONTENTS CONTENTS
�
ix ix
the PS PS 1101 Example of the 01 03 A Related Measure 103 Measure of Effect Effect Size 1 Assumptions 1103 03 The Common Language Language Effect Effect Size Statistic Statistic 1105 05 5.1: The PS PS and and its Estimators Estimators 106 Technical Note 5.1: 1 06 Introduction to Overlap 106 Introduction 1 06 The Dominance Measure 1107 07 Cohen's U3 08 U3 1108 Effect Size 1109 Relationships Among Measures of Effect 09 Application to to Cultural Cultural Effect Effect Size 1110 10 Technical Note Note 5.2: 5.2: Estimating Effect Effect Sizes Throughout Distribution 111 111 Throughout a Distribution Hedges-Friedman 12 Hedges-Friedman Method 1112 Shift-Function Method 1112 Shift-Function 12 Graphical Estimators of Effect Effect Sizes 1113 Other Graphical 13 Dependent Groups 114 114 115 Questions 1 15 6 Effect Effect Sizes for One-Way ANOVA Designs 6 Introduction 117 Introduction 117 ANOVA Results for This Chapter 1117 ANOVA 17
117 1 17
7 Effect Sizes for Factorial Designs 7 Introduction 139 1 39
139 139
A Standardized-Difference Standardized-Difference Measure of Overall Overall Effect Size 1118 18 Effect A Standardized Standardized Overall Effect Effect Size Size 119 Using All Means 1 19 Strength of Association 1120 Strength 20 (n2) 1121 Eta Squared (112) 21 2 and Omega Squared «02) (w2) 1121 21 Epsilon Squared (e (£2)) and for Specific Specific Comparisons 123 123 Strength of Association for Evaluation of Criticisms Criticisms of Estimators Strength Evaluation Estimators of Strength of Association 1124 of 24 Standardized-Difference Effect Effect Sizes for Standardized-Difference for Two of kk Means at a Time 1127 27 Worked Examples 1128 28 Statistical Significance, Confidence Confidence Intervals, Statistical 129 and Robustness 1 29 Within-Groups Designs Designs and Further Reading Reading 134 134 Questions 1 37 137
Strength of Association: Proportion of Variance Variance 140 Explained 1 40
X x
�
CONTENTS CONTENTS
Partial Partial
w &2
2
141 1 41
Comparing Comparing Values Values of
&2 42 w2 1142
Effect Size Ratios of Estimates of Effect
1143 43
744 1 44 Manipulated Manipulated Factors Only 146 1 46 Manipulated Targeted Factor Factor and Intrinsic Peripheral Factor 1148 48 Illustrative Worked Examples 1150 Illustrative 50 Comparisons of Levels of a Manipulated Comparisons Manipulated Factor at One Level of a Peripheral Factor 1153 53 at and Extrinsic Targeted Classificatory Factor and 155 55 Peripheral Factor 1 156 Classificatory Factors Only 1 56 Statistical Inference and Further Further Reading 1160 60 Within-Groups 162 Within-Groups Factorial Designs 1 62 Additional Designs Designs and and Measures 1265 65 Limitations and and Recommendations Recommendations 166 Limitations 1 66 Questions 1167 Questions 67 and Results Results for This Chapter Designs and
8 Effect Effect Sizes Sizes for for Categorical Categorical Variables Variables 8 Review 1270 70 Background Review Chi-Square Test Test and and Phi Phi 1173 Chi-Square 73 Null-Counternull Null-Counternull Interval Interval for for Phi Phipop 1 76 pop 176
170 170
The Difference Between Between Two Proportions Proportions 1277 77 T he Difference Approximate Confidence Interval for P 82 Approximate P11 - PP2 1182 and the the Number Needed to to Treat 1283 83 Relative Risk and The Odds Odds Ratio 1288 88 1 91 Construction of Confidence Confidence Intervals Intervals for OR ORppop Construction opop 292 p Tables LargerThan Than 193 Tables Larger 2 2 x x 22 1 93 for Large Large rr x x cc Tables Tables 1295 95 Odds Ratios for Multiway Tables 1296 96 Multiway Tables 296 96 Recommendations 1 2 98 Questions 198 -
9 Effect Effect Sizes for Ordinal Categorical Variables 9 Introduction 200 200 The Point-Biserial rr Applied to Ordinal Categorical Categorical 202 Data 202
Confidence Interval Interval and Null-Counternull Null-Counternull Interval Interval Confidence for rrppop 203 op 203 for Limitations of rrppbb for Ordinal Ordinal Categorical Data 203 203 Limitations The Probability Superiority Applied to Ordinal Ordinal T he P robability of Superiority Data 205 205 Data
200 200
CONTENTS CONTENTS
�
xi xi
Worked Example Example of Estimating the PS PS From Ordinal Data 206 206 Data Measure and The Dominance Dominance Measure and Somers' Somers' D D 21 2111 the cIs 3 Worked Example of the ds 21 213 Generalized Odds Ratio 21 3 213 Cumulative Odds Ratio 21 214 Cumulative 4 21 6 The Phi Phi Coefficient Coefficient 216 A Caution Caution 21 2166 References for Further References for Further Discussion of Ordinal 7 Categorical Methods Methods 21 217 Questions 7 Questions 21 217 References References
2 19 219
Author Index Author
237 237
Subject Subject Index
245 245
This page page intentionally left blank
Preface
effect sizes sizes is a rapidly rising tide as over 20 journals journals in Emphasis on effect various fields various fields of research now now require that authors authors of research reports provide estimates estimates of effect effect size. size. For certain kinds of applied research it is provide acceptable only only to report report that results were statisti statistino longer considered considered acceptable Statistically significant results results indicate indicate that a re recally significant. Statistically discoveredevidence evidenceof ofaareal realdifference difference between betweenparameters parameters searcher has discovered association between variables, but unknown size. Esor a real association but it is one of unknown Es pecially in applied applied research research such statements often need to be supple supplepecially mented with estimates of how different different the average results results for studied strong the association association between variables is. Those groups are or how strong who apply apply research research results results often need to know know more, for example, than marketing campaign, or that one therapy, one teaching method, one marketing evione medication appears to be better than another; they often need evi of how how much better it is is (i.e., the estimated effect effect size). Chapter 1\ dence of effect size, discussion of those cir cirprovides a more detailed definition of effect cumstances in which effect sizes is especially important, cumstances which estimation of effect and discussion of why effect sizes is needed. and why a variety of measures of effect needed. The purpose of this book is to inform a broad readership (broad with respect to fields of research research and and extent statisextent of knowledge of general statis about a variety variety of measures and and estimators estimators of effect effect sizes for re retics) about search, their their proper applications applications and interpretations, interpretations, and their their excellent books on the topic of effect effect limitations. There are several excellent but these books generally treat treat the topic in a different different context context sizes, but books and for a purpose that is different from that of this book. Some books effect sizes in the the context of preresearch analysis analysis of statistical statistical discuss effect needed sample sizes for the the planned research. power for determining needed not the purpose This is not purpose of this book, which focuses on analyzing the size size of the obtained effects. effects. Some postresearch results in terms of the books discuss effect effect sizes in the context context of meta-analysis, meta-analysis, the quantita quantitasynthesizing of results from from an an earlier set of underlying underlying individual individual tive synthesizing research studies. This also is not not the purpose of this book, which fo foanalysis of data from an individual individual piece of research research cuses on the analysis Books on meta-analysis meta-analysis are also concerned (called primary research). Books xiii
xiv xiv
�
PREFACE PREFACE
with methods for approximating approximating estimates of effect effect size indirectly from reported test statistics statistics because raw data data from from the underlying pri prifrom mary research are rarely available available to meta-analysts. However, this mary meta-analysts. However, book is concerned with direct estimation estimation of effect effect sizes by primary primary re reeffect sizes directly directly because because they, unlike searchers, who who can estimate effect meta-analysts, have access to the raw data. meta-analysts, Broad Practical Approach Approach in part part because because it The book is subtitled A Broad effect sizes for for different different types of deals with a broad variety of kinds of effect variables, designs, circumstances, circumstances, and purposes. Our approach approach encom encomvariables, differences between means passes detailed discussions of standardized differences 6, and and 7), 7), some of the correlational measures (chap. (chap. 4), (chaps. 3, 6, strength of association (chaps. (chaps. 66 and and 77), confidence intervals (chap. (chap. 2 strength ) , confidence thereafter), other other common methods, and less-known less-known measures and thereafter), superiority (chaps. (chaps. 5 and 9). The book is broad 9). The broad also such as stochastic superiority the interest of fairness fairness and and completeness, completeness, we respectfully respectfully cite because, in the alternative viewpoints viewpoints for cases cases in which experts disagree about about the ap apalternative propriate measure of effect effect size. Consistent with the modern trend trend to topropriate ward more use of robust robust statistical statistical methods, the book also pays much assumptions of methods. methods. Also consistent consistent attention to the statistical assumptions with the broad approach, there are more than than 300 references. Software for those calculations calculations that would be laborious by hand is cited. for and content content of this book make it appropriate for use as a The level and statistics in such fields fields as psychol psycholsupplement for graduate courses in statistics ogy, education, education, the social sciences, sciences, business, management, management, and medi medicine. The book is also appropriate for use as the text for a special-topics course in those fields. In addition, beseminar or independent-reading independent-reading course be content and and extensive references references the book is intended cause of its broad content professional researchers, researchers, graduate students students to be a valuable source for professional who are analyzing doctoral thesis, or advanced analyzing data for a master's or doctoral undergraduates. Readers are expected expected to have knowledge of parametric statistics through and some knowledge of statistics through factorial analysis of variance and chi-square analysis analysis of contingency tables. Some Some knowledge knowledge of chi-square the case of two independent groups (i.e., the nonparametric analysis in the Mann-Whitney U U test test or W Wilcoxon Wmm test) would be helpful, helpful, but not ilcoxon W introductory with regard to statis statisessential. Although the book is not introductory many readers have little little or no prior tics in general, we assume that many effect size and and their estimation. estimation. knowledge of measures of effect We typically typically use standard notation. notation. However, However, where we believe that understanding, we adopt adopt notation that is more memorable and it helps understanding, consistent with the concept that underlies the notation. notation. Also, Also, to assist have only only the minimum minimum background define readers who have background in statistics, we define some basic statistical statistical terms with which other other readers will likely likely already We request the forbearance forbearance of these more knowledgeable be familiar. We readers in this regard. Although readability was a major goal, so too was avoiding oversimplifying. length of the book we do not discuss multivariate multivariate cases To restrain the length (some references references are provided); provided); we do not present equations or discus-
PREFACE PREFACE
�
XV XV
sions for all measures of effect effect size that are known to us. We We present
examples, and discussions for estimators of many equations, worked worked examples, references for others others that are sufficiently sufficiently dis dismeasures and provide references cussed elsewhere. Our discussions of the presented measures are also in intended to provide a basis for understanding understanding and for the appropriate appropriate use of of those other other measures that are presented in the sources sources that we cite. Criteria for deciding whether to include include a particular measure of effect effect Criteria computational accessibility, accessibility, both of size included its conceptual and computational of which likelihood that the measure will find its way into which relate to the likelihood common common practice, which was another another important important criterion. However, However, we admit admit to some personal preferences preferences and perhaps perhaps even fascination fascination with some measures. times we violate our own criteria for in measures. Therefore, at times inclusion. clusion. A few exotic measures are included. Readers should should be able to find in this book many many kinds kinds of effect effect sizes that they they can knowledgeably knowledgeably apply apply to many of their their data sets. We We at attempt ex tempt to enhance the practicality practicality of the book by the use of worked examples involving mostly real real data, for which the book provides provides amples calculations calculations of estimates estimates of effect effect sizes sizes that had had not not previously previously been original researchers. made by the original
ACKNOWLEDGMENTS ACKNOWLEDGMENTS We are grateful for many many insightful recommendations recommendations made by the re reviewers: Scott Maxwell, the University University of Notre Dame; Dame; Allen Huffcutt, Huffcutt, Bradley University; Shlomo S. S. Sawilowsky, Sawilowsky, Wayne State University; University; and Timothy Urdan, Santa Clara University. Failure to implement implement any of their recommendations correctly correctly is our fault. We We thank Ted Ted Steiner Steiner of for for clarifying a solution solution for a problem with the relative risk as an effect effect Wealso thank Julie A. A. Gorecki providing data, and for her assis assissize. We Gorecki for providing wordprocessing and graphics. The The authors authors gratefully ac actance with wordprocessing knowledge the generous, prompt, and very professional assistance of of our Riegert, and our production editor, Sarah Wahlert. our editor, Debra Riegert, Wahlert.
This page page intentionally left blank
Chapter Chapter
11
Introduction
REVIEW OF OF SIMPLE CASES OF OF NULL-HYPOTHESIS SIGNIFICANCE TESTING Much applied research begins with a research hypothesis that states relationship between two variables or a difference difference bethat there is a relationship be tween two two parameters, often often means. (In (In later chapters we consider re re-
involving more than two variables.) One typical form of the search involving variables.) One research hypothesis is that there is a nonzero correlation between the two variables in the population. Often Often one variable is a categorical inde indetwo pendent variable involving vari involving group membership (called a grouping grouping variTreatment a versus 1teatment Treatment b, able), such as male-female or 1teatment b, and the continuous dependent dependent variable, such as score score on an other variable is a continuous attitude attitude scale scale or on a test of mental health health or achievement. In this case of of grouping variable there are two customary customary forms of of research hypoth hypotha grouping eses. The hypothesis might might again be correlational, positing positing a nonzero point-biserial correlation between group membership and the dependpoint-biserial depend discussed in chapter 4. More often often in this case case of a ent variable, as is discussed grouping variable the research hypothesis posits that there is a differ difference between means in the two two popUlations. populations. Readers who are familiar with the general linear model should recogrecog nize the relationship between hypotheses that involve either correlation or the difference difference between means. However, However, the the two kinds of hypotheses not identical, identical, and some researchers may may prefer prefer one or the other other form are not may prefer prefer one approach approach,, some of hypothesis. Although a researcher may may prefer prefer the other. Therefore, researchers readers of a research report may from both approaches. should consider reporting results from The usual statistical analysis of the the results from from the kinds of research hand involves testing a null hypothesis hypothesis (Ha) (H0) which conflicts conflicts with the at hand research hypothesis either by positing that the correlation between the two variables is zero in the population population or by positing that there is no diftwo dif ference ference between the means of the two populations. The The t statistic is usu usutest the Ha H0 against against the research research hypothesis. The The significance significance ally used to test attained by a test statistic statistic such as tt represents the probalevel (p) that is attained
1 1
2 2
CHAPTER 11 CHAPTER
bility bility that a result result at at least as extreme as the obtained result would occur occur ifif the H0 were true. the Ho true. It is very important important for for applied researchers to to recogrecog nize that this attained p value primarily attainedp primarily indicates the the strength strength of the the evievi dence that the H0 is wrong, but p value does not the Ho but the the p not by itself itself indicate sufficiently how how wrong the Ho is. sufficiently H0 is. Observe in Equation 11.1 Observe . 1 that, for t,t, the the part part of the formula that is is usually usually of greatest greatest interest interest in applied research is the the overall numerator, the the difference difference between means (a value that is a major component of a common estimator However, Equation 11.1 . 1 reveals that estimator of effect effect size). However, whether tt is large enough enough to attain statistical significance is not merely a function function of how how large this this numerator numerator is, is, but but it depends on how how large this this numerator numerator is relative to the overall denominator. denominator. Equation 11.1 . 1 and the the nature nature of division reveal that for for any any given difference difference between means means an an increase in sample sizes will increase the absolute value of tt and, thus, decrease the p. Therefore, a statistically signifithe magnitude magnitude of p. signifi cant cant tt may may indicate a large difference difference between means or perhaps a less important important small small difference difference that has has been elevated to the status status of statisstatis tical significance because the researcher had the resources to use relarela tively large samples. Larger sample sizes also make it more likely that t will will attain statistical significance by increasing degrees of freedom for the tt test. Large sample sizes are to be encouraged because they are more likely to be representative representative of populations, popUlations, are more likely to produce replicable results, increase statistical statistical power, and and also perhaps increase increase robustness assumptions. robustness to to violation violation of statistical statistical assumptions. another The lesson here is that the result result of a t test, or a result using another test p< < .05 .05 that one treatment test statistic, statistic, that indicates by, by, say, p treatment is statististatisti cally significantly significantly better better than another, or that the treatment variable variable is is statistically suffistatistically significantly related to the outcome variable, does not not suffi ciently indicate how much much better the superior treatment treatment is or how the variables are related. The degree degree of superiority and strongly the strength strength of relationship relationship are are matters matters of effect effect size. (Attaining statistical significance significance depends on on effect effect size, sample sizes, variances, choice of of one-tailed one-tailed or two-tailed two-tailed testing, the adopted significance significance level, level, and and the degree to satisfied.) to which which assumptions assumptions are satisfied.) In applied research, particularly In particularly when when one of the the treatments treatments is actually a control control or placebo condition, it is very important important to estimate how much better better a statistically significantly significantly better better treatment is. is. It is not enough enough to know p< < .05), .05), or or even stronger stronger evievi know merely that there is evidence evidence (e.g., p dence (say, (say, p p< < .01), that there is some unknown unknown degree of difference difference in mean performance of the the two two groups. If the the difference difference between two two poppop mean ulation from nearly 0 to from O. 0. If it can be anywhere anywhere from to far far from If ulation means is not not 0, it two anytwo treatments are not not equally equally effective, effective, the better better of the two two can be any where from from slightly slightly better better to very much better better than than the other. For an example involving involving the tt test, suppose that a researcher were to compare the people the mean weights weights of two two groups groups of overweight diabetic people who who have undergone random random assignment assignment to either weight reduction reduction Program Typically, the program Program a or Program Program b. b. 1)rpically, the difference difference in mean mean post post program
11.. INTRODUCTION INTRODUCTION
�
3 3
weights H0 that weights would would be tested tested using using the the t test test of a Ho that posits posits that there there is is and �,, of populations who who undertake difference in in mean mean weights, weights, f.la and no difference Program a or (H0: f.la � = = 0) 0).. The independent-groups or b (Ho: independent-groups t statistic statistic in Program this case is is -
((1.1) 1.1)
S2
where Y values, s2 values, and ns are where iT and ns are sample means, variances, and and sizes, respectively. Again, if the great enough enough (positive or or negative) negative) the value of t is great to the extreme range range of values values that that are improbable improbable to to occur if H Ho to place t in the 0 were true, the H0 and the researcher researcher will will reject reject Ho and conclude that it it is plausible that populations. that there there is a difference difference between between the the mean mean weights weights in the the populations. Consider a possible limitation limitation of the the aforementioned interpretation interpretationof of the statistically significant significant result. What the the researcher researcher has apparently discovered discovered is that that there is evidence that that the the difference difference between between mean weights weights in the the populations populations is not not zero. Such information information may may be of use, especially if the the overall costs costs of the the two two treatments treatments are the the same, but but it it would would often often be more more informative informative to to have have an an estimate estimate of what what the amount of difference difference is than merely learning learning that that there there is evidence of of what (i.e., what it it is not not (i. e., not not 0). (In this this example we would would recommend concon structing structing a confidence interval for the the mean mean difference difference in population weights, but but we defer defer the the topic of confidence confidence intervals intervals to to chap. 2.) STATISTICALLY SIGNIFYING AND PRACTICAL SIGNIFICANCE SIGNIACANCE
"statistically significant" can be misleading synThe phrase phrase "statistically misleading because syn onyms of significant significant in the English language, the English language, but but not not in the the language languageof of statistics, important and statistics, are are important and large, and and we have havejust just observed with with the the t
2 test, and could stacould illustrate with other other statistics such such as F and x "I:,, that a sta tistically "Statistically significant significant result result may may not not be a large or important important result. "Sta tistically "statistically tistically significant" is best best thought of as as meaning meaning "statistically signifying." signifying. " A statistically statistically significant result result is signifying that the the result result is sufficient, sufficient, according according to to the the researcher's researcher's adopted adopted standard standard of required required evidence against H (e.g., < .05), to Ho0 (e. g. , p < to justify justify rejecting H Ho' 0. Examples of statistically practiExamples statistically significant results results that would would not not be practi cally cally significant significant would would include include statistically significant significant loss loss of weight weight or or blood pressure that is too too small small to to be medically significant and and statististatisti cally significant insuffisignificant lowering lowering of scores on on a test test of depression that that is insuffi cient to to be reflected reflected in in the the clients' behaviors behaviors or or self-reports self-reports of well-being well-being (clinical insignificance). Another Another example would (clinical would be be a statistically statistically signifsignif icant icant difference difference between between schoolgirls schoolgirls and schoolboys schoolboys that is not not large enough (educational insignifenough to tojustify justify a change change in educational educational practice (educational
4
�
CHAPTER 1 CHAPTER!
report actual actual statistically statistically significant differences differences icance). In chapter 5 we report between cultural cultural groups that may be too small to support support stereotypes stereotypes incorporate into into training training for diplomats, results that may be called or to incorporate culturally insignificant. insignificant. Note Note that the quality quality of a subjective judgment judgment culturally about the practical significance of a result is enhanced by expertise expertise in the about effect size, size, the the definition definition of which we discuss area of research. Although effect next section, section, is not not synonymous synonymous with with practical significance, significance, in the next knowledge of a result's effect effect size can inform a judgment about practical significance. significance. EFFECT SIZE DEFINITION OF EFFECT hypothesis that implies that there We assume the case of the typical null hypothesis effect or no relationship relationship between variables—for hyis no effect variables-for example, a null hy states that there is no difference difference between means of popula populapothesis that states tions or that the correlation correlation between variables in the population population is is zero. tions Whereas statistical significance significance provides the quantified strength strength W hereas a test of statistical an effect effect size of evidence (attained p level) that a null hypothesis is wrong, an degree to which such a null hypothesis is wrong. Be((E5) Be E5) measures the degree and usefulness, usefulness, we use the name effect effect size for cause of its pervasive use and effect size mea meaall such measures that are discussed in this book. Many effect 4) or its square (chaps. 4, 6, 6, sures involve some form of correlation (chap. 4) and 7), 7), some form form of standardized standardized difference difference between means (chaps. (chaps. 3,6, and 3, 6,
and 7), 7), or the degree degree of overlap of distributions distributions (chap. (chap. 5), but but many mea meaand discussed do not not fit into these categories. categories. Again, we use the sures that are discussed effect size for measures of the degree to to which results diff differ from label effect er from what is implied for them by a typical null hypothesis. Often the relationship between the numerical value of a test statistic statistic Often 5 is E£55EST and an an estimator estimator of E£5 = TS / f(N), where f(N) is some func(T5) and = T5/ f(N), where f(N) is some func EST tion of total sample size, size, such as degrees of freedom. freedom. Specific forms forms of 2 this equation equation are available for many many test test statistics, statistics, including t, F, F, andX this X2, test statistics statistics can be approximately approximately converted to indi indiso that reported test estimates of effect effect size size by a reader of a research report report without ac acrect estimates raw data data that would be required to estimate an effect effect size cess to the raw However, researchers researchers who work work with their own raw data (pridirectly. However, (pri mary researchers), researchers), unlike researchers researchers who who work with sets of previously mary statistics (meta-analysts), can estimate effect effect sizes sizes directly, directly, reported test statistics not need to use an approximate approximate as can readers of this book, so they do not conversion formula. CONTROVERSY ABOUT NULL-HYPOTHESIS SIGNIFICANCE TESTING
Statisticians have long urged researchers to report effect effect sizes, sizes, but but re reStatisticians Fisher ((1925) but not the searchers were slow to respond. Fisher 1 925) was an early, but
1. 1.
INTRODUCTION INTRODUCTION
�
5
first, advocate It can advocate of such estimation. It can even be argued that that readers of a report report of applied applied research research that that involves involves control or placebo placebo groups, or or that involves involves treatments treatments whose whose costs are different, different, have a right right to to see es that estimates of effect effect sizes. sizes. Some might might even argue that that not not reporting reporting such timates estimates in an an understandable manner manner to those who who might might apply the estimates results of research research in such such cases cases (e.g. educators, health officials, officials, man results (e.g.,, educators, mantrainee programs, clinicians, and and governmental governmental officials) officials) may agers of trainee withholding evidence. evidence. Increasingly editors of journals journals that that pub be like withholding publish research are recommending, or or requiring, the the reporting reporting of estimates of effect effect sizes. For example, the the American rec of American Psychological Association Association recand the the Journal ommends, and Journal of of Educational and Psychological Psychological Measurement and and at at least 22 other other journals journals as of the the time of this this writing writing require, the reporting of such estimates. We observe later later in this this book book that estimates reporting of alof effect effect size are are also also being being used in court cases involving, involving, for for example, al discriminatory hiring hiring practices (chap. and alleged harm harm from leged discriminatory (chap. 4) and from pharmaceuticals or or other other chemicals (chap. 8). Pharmaceuticals There opinions regarding when when estimates estimatesof of There is a range of professional opinions effect sizes sizes should should be be reported. reported. On the the one hand is the the view view that null-hynull-hy effect pothesis significance significance testing testing is meaningless because no null null hypothesis pothesis can be literally true. true. For example, according to to this this view no two two or or more population popUlation means can be exactly equal when when carried to to many many decimal decimal places. Therefore, from this point places. Therefore, point of view view that that no no effect effect size can can be be exex the task task of a researcher is to to estimate the the size of this this "obvi actly zero, the "obvinonzero effect. effect. The opposite opinion is that significance significance testing is is ously" nonzero paramount and and that that effect effect sizes are to to be be reported only only when when results are paramount found found to to be be statistically significant. For discussions discussions relating to to this dede consult Fan (2001), bate, consult (2001), Hedges Hedges and Hunter and and Olkin (1985), Hunter and Schmidt Schmidt Knapp (2003), Knapp Knapp and (2001), Levin Levin and (2004), Knapp and Sawilowsky (2001), andRobinRobin son (2003), Onwuegbuzie and Levin (2003), Roberts and and Levin and Henson son (2003), (2003), Robinson and and Levin (1997), and Rubin Rubin (1997), Rosenthal, Rosnow, Rosnow, and (2003), and Yoon (2002). (2000), Sawilowsky (2003), and Sawilowsky and and Yoon As we discuss in chapters 6, many As chapters 3 and and 6, many estimators estimators of effect effect size tend to or upward overestimate effect effect sizes sizes in in the the population population (called overestimate (called positive or upward bias). bias). The major estimajor question question in i n the the debate debate is whether whether or or not not this upward upward bias bias of o f esti mators of effect effect size is large enough so that that the the reporting reporting of a a bias-based mators estimate of effect effect size will seriously inflate the the overall estimate of of nonzero estimate effect (i.e., actually effect size in a field field of study study when when the the null null hypothesis hypothesis is true true (i. e. , actually zero effect effect in the the population) population) and results results are statistically insignificant. Those who who are not not concerned about about such bias urge the the reporting reporting of all efef feet sizes, sizes, significant significant or or not not significant, significant, to to improve the the accuracy of of fect meta-analyses. will avoid the the probprob meta-analyses. Their reasoning reasoning is that that such reporting reporting will lem of meta-analyses inflating inflating overall overall estimates estimates of effect effeet size that would result from not from primary result from not including including the the smaller effect effeet sizes that that arise from not attain statistical statistical significance. studies whose results did not significance. the opinion opinion that effect effect sizes are more important important in apap Some are of the plied research, in which which one might might be be interested interested in whether whether or or not not the the efef plied fect feet size is estimated estimated to to be be large enough enough to to be of practical use. use. In
6
�
CHAPTER 1 CHAPTER!
contrast, in theoretical theoretical research one might might only be interested interested in whether results results support support a theory's theory's prediction, say, for example, that Mean a will references and and further further discussion of this greater than Mean b be greater b.. For references controversy, consult Markus consult Harlow, Mulaik, and and Steiger (1997), Markus (2001), Nickerson (2000), (1996), and (2000), N. Schmidt (1996), and the the many responses to 2002 issue of to Krueger (2001) in the the Comments section of the the January 2002 of the the American Psychologist (vol. 57, pp. 65-71). Consult Jones and Tukey (2000) (2000) for a reformulation atThkey reformulation of null-hypothesis null-hypothesis testing testing that at tempts tempts to to accommodate accommodate both sides in the the dispute. For a review of the history of measures of effect effect size refer refer to to Huberty (2002). THE THE NEED mE PURPOSE OF THIS BOOK AND THE FOR A BROAD APPROACH It is not not necessary for this this book to to discuss the controversy controversy about null-hypothesis null-hypothesis significance testing testing further further because the the purpose of this book effect book is merely to to inform readers about about a variety variety of measures of effect size and their One reason their proper proper applications applications and limitations. limitations. One reason that a vava riety of effect effect size measures measures is needed is that different different kinds kinds of measures are appropriate appropriate depending on whether whether variables are scaled categorically, categorically, ordinally, or certain or continuously continuously (and also sometimes sometimes depending on certain characteristics purcharacteristics of the sampling sampling method method and the research design and pur pose that are discussed where pertinent pertinent in later later chapters). The results from from a given study often often lend themselves to more than one measure of of effect differeffect size. size. These different different measures measures can can sometimes sometimes provide very differ on the the results results (Gigerenzer ent, even conflicting, perspectives on (Gigerenzer & & Edwards, Edwards, 2003). Consumers of the jourthe results results of research, including editors of jour nals, those in the the news news media who who convey results results to the public, and papa tients often tients who who are giving giving supposedly informed consent consent to treatment, treatment, often need to to be made aware aware of the the results results in terms terms of alternative alternative measures of of effect effect sizes to to guard against against the possibility that biased biased or unwitting unwitting rere searchers have used a measure that makes a treatment treatment appear to be be more effective effective than another another measure would. would. Some of the the topics in chapchap ter Also, alternative ter 8 exemplify this this issue particularly particularly well. Also, alternative measures should traditional should be considered when when the the statistical statistical assumptions assumptions of traditional measures are not not satisfied. satisfied. Data sets can have their —that is, their Data their own own personalities personalities-that their individual individual complex characteristics. characteristics. For example, traditionally traditionally researchers have fofo cused on on the the effects effects of independent variables on onjust just one characteristic of of distributions distributions of data, their their centers, such as their their means or medians, reprep resenting (average) participant. However, a resenting the the effect effect on the the typical typical (average) participant. However, treatment treatment can also have an an effect effect on aspects of a distribution distribution other other than its Treatment can have an its center, center, such as its its tails. tails. 1reatment an effect effect on the the center of a distribution For example, distribution and/or and/or the variability variability around around that center. For consider a treatment treatment that increases the scores of some experimental group group participants participants and and decreases the scores of others others in that group, a a
INTRODUCTION 11.. INTRODUCTION
�
7 7
xX Subject Subject interaction. The result result is that the the variability variability of the experimental experimental group's group' s distribution distribution will will be larger larger or smaller smaller (maybe greatly greatly so) than the the variability variability of the control control or or comparison comparison group's group' s disdis tribution. Whether Whether there is an the an increase or decrease in variability variability of the experimental group's group' s distribution distribution depends on whether whether it is the the higher higher or or experimental lower lower performing participants participants who who are improved improved or worsened by the such cases the the centers of the the two two distributions distributions may may be be treatment. In such nearly nearly the same, whereas whereas the treatment treatment in fact fact has has had had an an effect effect on the tails It is quite likely that a treatment tails of a distribution. distribution. It treatment will will have an an efef fect fect on both both the the center and and the the variability variability of a distribution distribution because it is is common to find find that that distributions distributions that that have higher means means than other other common distributions also variabilities. also have have the the greater greater variabilities. As demonstrated demonstrated in later later examples in this this book, by by applying applying a variety of appropriate appropriate estimates estimates of measures of effect effect size to to the the same set of data of data researchers and and readers of their their reports reports can gain gain a broader perspective perspective on on the the effects effects of an an independent variable. variable. In some later later examples we obob serve that examination examination of estimates estimates of different different kinds kinds of measures of efef fect fect size can can greatly greatly alter alter one's one' s interpretation interpretation of results results and and their their refer to to Levin and and Robinson (1999) in this this regard.] regard.) importance. [Also refer Note that as of the Note the time of this this writing writing the the editors editors of journals journals that recrec ommend not ommend or or require the the reporting reporting of an an estimate of effect effect size do not specify any specify the the use use of any any particular particular kind of effect effect size. size. Note also that any appropriate must appropriate estimate of effect effect size that a researcher has has calculated must be reported to Howto guard guard against against a biased interpretation interpretation of the results. How ever, we acknowledge, as shown shown from from time time to time time in this book, that among experts about about the appropriate appropriate meamea there can be disagreement among sure of effect (Levin & & Robinson, Robinson, 1999; 1999; also effect size for for certain kinds kinds of data data (Levin consult Hogarty consult Hogarty & & Kromrey, 2001). There are AlThere are several excellent books books that that treat treat the the topic of effect effect size. Al though though our our book book frequently cites this this body body of work, these books books genergener ally different ally treat the topic topic in a different different context context and for a purpose purpose that is different from the from the purpose of this this book, as we briefly briefly discuss in the the next next two two secsec tions of this this chapter. Note also that our our book does not not discuss effect effect sizes sizes tions for for single-case designs. designs. For discussions of competing approaches for for such consult Campbell (2004) and and the the references references therein. designs, consult
Treatment 1reatment
POWER ANALYSIS
Some books consider effect effect sizes in the the context context of preresearch power analysis analysis for for determining determining needed sample sizes for for the the planned planned research research 1988; Kraemer & & Thiemann, 1987; Lipsey, 1990; K. R R.. Murphy Murphy (Cohen, 1988; Myors, 2003). The power of a statistical probabil& Myors, statistical test is defined defined as the the probabil ity that use of the the test will will lead to to rejection rejection of a false false Ho' statisti ity H0. Because statistical power effect power increases as effect effect size increases, estimating estimating the the likely effect size, or or deciding the the minimum minimum effect effect size that the the researcher is interested in having having the the proposed research detect, is important important for researchers who
8
�
CHAPTER CHAPTER 11
are planning planning research. research. Taking Taking into account power-determining factors effect size, size, the researcher's adopted alpha level, such as the projected projected effect likely variances, and available sample sizes, books on power analysis are are useful for the planning planning of research. research. TThe very useful he report by Wilkinson and the American Psychological Association's Task Force on Statistical InferInfer ence (1999; referred to hereafter hereafter as APA's APA's Task Task Force) urges researchers researchers
effect sizes sizes to facilitate future power analyses in a researcher's to report effect field of interest. interest. field META-ANALYSIS Several books books cover estimation estimation of effect effect sizes sizes in the the context of meta-anal meta-analMeta-analytic methods are procedures procedures for quantitatively quantitatively summa summaysis. Meta-analytic rizing the numerical results from from a set of research studies in a specific area Meta in this context means "beyond" or "more comprehen comprehenof research. Meta Synonyms for meta-analyzing meta-analyzing such sets of results include include integrat integratsive." Synonyms cumulating,or orquantitatively quantitativelyreviewing reviewingthem. them. ing, combining, synthesizing, cumulating, Each individual study study in the set of meta-analyzed studies is called pri pricommon form of of meta-anal meta-analmary research. Among other procedures, a common the (weighted) (weighted) estimates of effect effect size from from each ysis includes averaging the of the underlying he Wilkinson and underlying primary primary studies. TThe and the APA's Task primary researchers researchers to report effect effect sizes Force (1999) report also urges primary
for the purpose of comparing them with earlier reported effect effect sizes and and for later meta-analysis meta-analysis of such results. Meta-analyses that to facilitate facilitate any later use previously previously reported effect effect sizes that had been directly calculated by primary primary researchers on their their raw data data will be more accurate than those that are based on effect effect sizes that had to be be retrospectively estimated by meta-analysts meta-analysts using approximately approximately accurate formulas to convert the primary primary studies' reported test statistics statistics to estimates estimates of effect effect size. Meta-analysts typically typically do not have access access to raw data. data. Meta-analysts For an early example of a meta-analysis meta-analysis and the customary customary rationale rationale for such a meta-analysis, meta-analysis, consider a set of separate primary studies in for which the dependent variable is some measure of mental health and the independent variable is membership in either a treated group or a con control group (Smith & & Glass, 1977). 1977). Such individual studies studies of these vari variestimates of the the same kind of effect effect size size measure. ables yield varying estimates Most of these studies yield a moderate value for estimated effect effect size (i.e., therapy usually seems to help, at least moderately)' moderately), some yield a (i.e., therapy therapy seems seems to help very much or very high or low positive value (Le., little), and a very small number of studies yield a negative value for the effect size (indicating possible from the therapy). therapy). No one piece of effect possible harm from primary research is necessarily definitive in its findings. primary The because of sampling variabil variabilT he varying varying results are not surprising because studity and possibly relevant factors that vary among the individual stud ies-factors ies—factors such as the nature of the therapy, diagnostic and demo democharacteristics of the the participants participants across the studies, kind of test graphic characteristics mental health, and characteristics of the therapists. A common kind of mental
11.. INTRODUCTION INTRODUCTION
�
9 9
meta-analysis attempts from the individual studies inforinfor of meta-analysis attempts to extract from mation about variables, called moderator variables, that account for the mation varying estimates of effect example, if the effect of a treatment treatment effect size. For example, the effect were different gendifferent in a population population of women women and a population of men, gen der would be said to be a moderator moderator variable. Effect Effect sizes are often often not reported reported in older articles articles or in articles articles that are published in journals that do not not require such reporting. Therefore, as we previously mentioned, methods include formulas for approximately books on meta-analytic methods approximately converting the results of statistical statistical tests that primary primary researchers researchers typitypi cally do report, such as the value of tt or F, F, into individual individual estimates estimates of of effect effect size that a meta-analyst meta-analyst can then average. effect sizes Because the the focus focus of this book is on direct estimation of effect from the raw from raw data of a primary primary research study, we include only occaocca discussion of meta-analysis meta-analysis when it is pertinent to primary re resional discussion search. Also, it is beyond the scope of scope of this book to discuss limitations limitations of meta-analysis alternative rationales rationales for it. Moreover, Moreover, this book has no meta-analysis or alternative need to present formulas for approximately converting statistical statistical results, primary re such as t or F, F, into indirect estimates of effect effect size. Again, Again, primary researchers can directly calculate estimates of effect effect size from from their raw raw data using the formulas in this book. Because the archiving archiving of raw raw data is most areas of research, meta-analyses, and applied science science in still rare in most general, would benefit benefit if primary researchers, researchers, where appropriate, would routinely report estimates of effect effect size such as those covered in this book. report estimates There are several approaches to meta-analytic meta-analytic methods. Books that cover these methods (1989), Cooper and Hedges methods include those those by Cooper (1989), Hedges (1994), Glass, McGaw, and Smith (1981), and Olkin (1985), (1981), Hedges and Hunter and Schmidt (2004), Lipsey Lipsey and Wilson (2001), and Rosenthal Hunter (2004) is distinguished (1991b). The approach of Hunter and and Schmidt (2004) purpose of attempting to estimate estimate effect effect sizes from which which the in by its purpose influences of artifacts (errors) have been removed. removed. Such artifacts include error, restricted range, and unreliability unreliability or imperfect imperfect con consampling error, struct validity dependent variables. Some Some of struct validity of the independent and dependent of these topics are discussed discussed in chapter 4. Cohn and Becker (2003) disdis cussed the manner in which meta-analysis statistical meta-analysis increases increases the statistical effect size and and shortens shortens power of a test of a null hypothesis hypothesis regarding an effect a confidence interval for it by reducing the standard error of a weighted average effect effect size. Confidence intervals are discussed discussed in chapter 2 and thereafter thereafter throughout throughout this book where they are applicable.
AND EFFE EFFECT ASSUMPTIONS OF TEST STATISTICS AND CT SIZES
statisticians create a new test statistic statistic or measure of effect effect size, When statisticians they often do so for populations populations that have certain characteristics. For For they often the t test, F F test, and some common examples of effect effect sizes, two two of these assumed characteristics, called assumptions, assumptions, are that the populations populations from drawn are normally from which the samples are drawn normally distributed and have
110 0
�
CHAPTER 11 CHAPTER
equal variances. variances. The latter latter assumption assumption is called homogeneity homogeneity of of variance or homoscedasticity (from Greek words for same and homoscedasticity (from and scatter). When data actually from populations with actually come from with unequal variances this violation of the of variance or heteroscedasticity. heteroscedasticity. heterogeneity of the assumption assumption is called heterogeneity normality and homoscedasticity are the assumptions assumptions that are (Because normality more likely to be violated, violated, we do not yet discuss the usually usually critically important Throughout this important assumption assumption that scores are independent.) independent.) Throughout book we will observe how violation of assumptions assumptions can affect affect estimaestima tion and interpretation effect sizes, and we will discuss some alterna alternainterpretation of effect tive methods violations. methods that accommodate such violations. Often a researcher asserts that an effect Often effect size that involves the degree of of difference two means means is significantly different different from from zero because difference between two significance was attained significance attained when comparing the two means by a t test (or an F F test with with 1 degree degree of freedom freedom in the numerator). However, nonnormality nonnormality and heteroscedasticity can result in the shape of the actual sampling distribution distribution of the test statistic departing sufficiently sufficiently from from the theoretical sampling distribution distribution of tor t or F so that, unbeknownst reunbeknownst to the re p value for the result is not p searcher, the the actual p not the same as the observed observedp value in a table or printout. For For example, example, an observedp p< < .05 may actumay actu ally represent a truep > .05, Type I error. error. Also, violation of of . 05, an inflation of oflYPe assumptions can result in lowered statistical statistical power. assumptions power. For references references and further further discussions discussions of the the consequences consequences of and solutions to violation of of assumptions on t testing and F testing, consult Grissom assumptions Grissom (2000), Hunter and (2004), Keselman, Cribbie, and Wilcox (2002), Sawilowsky and Schmidt (2004), and Wilcox 1997), and Wilcox Wilcox and Keselman (2003a). (2002), Wilcox (1996, 1997), Huberty's (2002) article on the history of effect effect sizes noted that heteroscedasticity heteroscedasticity is common common but but has been given insufficient insufficient attention in discussions of effect effect sizes. We attempt attempt to redress redress this shortcoming. The fact fact that nonnormality and heteroscedasticity heteroscedasticity can affect affect estimation estimation and interpretation of effect interpretation effect sizes is of concern in this book because because real data often exhibit such characteristics, characteristics, as is documented documented in the next section.
VIOLATION OF ASSUMPTIONS IN REAL REAL DATA DATA VIOLATION OF ASSUMPTIONS IN Unfortunately, violations of assumptions are common in real data, and they often appear combination. Micceri (1989) presented many appear in combination. many exex amples of nonnormal that only about 3% 3%of ofdata datain inedu edunonnormal data, reporting that cational cational and psychological psychological research have the appearance appearance of near symmetry and light tails as in a normal distribution. Wilcox (1996) (1996) ilsymmetry il lustrated distributions can appear to be normal lustrated how two distributions normal and appear to have very similar variances when in fact fact they have very different different varivari ances, even a ratio of variances greater than 10: 10:1. Refer to Wilcox (2001) (2001 ) 1. Refer for for a brief history of normality normality and departures from from it. Grissom (2000) noted that theoretIn a review of the literature Grissom that there are theoret ical reasons to expect and empirical empirical results to document heteroscedasticity heteroscedasticity throughout throughout various areas of research. research. When raw data that that are amounts amounts or
1. 1. INTRODUCTION INTRODUCTION
........"
11 11
number of alcoholic drinks drinks consumed by counts have some zeros (e.g., the number during an alcoholism alcoholism rehabilitation program), program), group group means means some patients during and variances variances are often often positively related (Emerson & & Stoto, 1983; 1983; Fleiss, and 1986; Mueller, Mueller, 1949; 1949; Norusis, 1995; Raudenbush & & Bryk, 1987; 1987; 1986; Sawilowsky & & Blair, 1992; Snedecor Snedecor & & Cochran, Cochran, 1989). Therefore, distri distriSawilowsky butions for samples with larger means often often have larger variances than butions those for samples samples with smaller smaller means, resulting resulting in the possibility of of (Again, homoscedasticity and heteroscedasticity are heteroscedasticity. (Again, characteristics of populations, populations, not samples.) These These characteristics characteristics may not reflected by comparison comparison of variances of samples taken from from be accurately reflected because the sampling variability of variances is is high. ReRe those populations populations because fer to Sawilowsky (2002) for a discussion discussion of the implications relafer implications of the rela tionship between means and variances, including citations citations of an opposing distributions with greater greater positive skew tend to have the view. Also, Also, sample distributions larger means and variances, again suggesting possible heteroscedasticity. distribution is not symmetrically symmetrically Positive skew roughly means that a distribution left tail. Examples in inshaped because its right tail is more extensive than its left distributions of data from from studies of difference difference thresholds (sensitivity clude distributions stimulus), reaction reaction time, latency of response, time to com comto a change in a stimulus), length of hospital hospital stay, and galvanic skin response plete a task, income, length (emotional palm sweating). Wilcox Wilcox and Keselman (2003a) discussed skew For tests of symmetry versus skew, skew, consult consult and nonnormality in general. For Keselman, Wilcox, Wilcox, Othman, and and Fradette Fradette (2002), Othman, Keselman, Wilcox, Fradette, and Padmanabhan Padmanabhan (2002), (2002), and Perry Perry and Stoline (2002). heteroscedasticity in data from from research There are reasons for expecting heteroscedasticity treatment. First, a treatment treatment may be more beneficial for on the efficacy of a treatment. participants than for others, others, or it can be harmful for others. If If this some participants variability of responsiveness to treatment treatment diff differs from 1teatment Treatment Group a variability ers from to Treatment Group b because of the natures of the treatments that are bebe compared, heteroscedasticity may result. For For exam example, ing compared, ple, Lambert and deterioration in some patients, usually usually Bergin (1994) found that there is deterioration groups than in control control groups. groups. Mohr (1995) (1995) cited nega negamore so in treated treated groups tive outcomes from from therapy for some adults with psychosis. Also, some therapies may increase violence in certain kinds of offenders offenders (Rice, 1997). therapies Second, suppose that the dependent variable does not sufficiently sufficiently underlying variable that it is supposed to be mea meacover the range of the underlying suring (the latent latent variable). For For example, a paper-and-pencil desuring paper-and-pencil test of de might not not be covering the full full range of depression that can pression might occur in depressives. In this case a ceiling or floor effect effect can pro proactually occur within those groups whose whose duce a greater reduction of variabilities within treatments most most greatly decrease decrease or increase their their scores. treatments A ceiling effect effect occurs when the highest score score obtainable on a depend dependstanding with re reent variable does not represent the highest possible standing For example, a classroom classroom test is is supposed supposed to spect to the latent variable. For students' knowledge, but but if the test is too measure the latent variable of students'
112 2
CHAPTER 1 CHAPTER 1
student who scores 100% 100% may may not have as much knowledge of easy, a student possible and another student who scores 100% 100% may the material as is possible greater knowledge that the easy test does not enable that stu stuhave even greater effect occurs when the lowest score obtain obtaindent to demonstrate. demonstrate. A A floor effect able on a dependent variable does not represent the lowest possible standing with with respect to the latent latent variable. For For example, example, a particular particular standing screening test for a memory disorder may may be so difficult difficult for the partici participants that among among those senile patients patients who score score 0 on the test there pants may be be some who actually actually have even a poorer memory memory than than the others may 0, but but they cannot cannot exhibit their poorer memory because who scored 0, scores below 0 are not possible. result from outliers, which are defined defined Heteroscedasticity can also result (roughly for now) as extremely atypically high or low scores. scores. Outli Outlimay merely reflect recording errors or another kind of research ers may but they they are common common and should be be reported as possibly re reerror, but flecting an important important effect effect of a treatment treatment on a minority minority of partici particiflecting pants or as an indication indication of an important important characteristic of a small pants minority of the participants. participants. Precise Precise definitions definitions and rules for detect detectminority ing outliers vary (Brant, (Brant, 1990; 1990; Davies & & Gather, Gather, 1993; 1993; Staudte & Sheather, simple method for Sheather, 1990). 1990). Wilcox (2001,2003) (2001, 2003) discussed discussed a simple detecting outliers and also provided an S-PLUS software function for such detection (Wilcox, (Wilcox, 2003). This method is based on the median (MAD). The MAD and discussed as one of absolute deviation (MAD). MAD is defined and alternative measures of variability two sections of this the alternative variability in the last two Keselman (2003a) further discussed detection chapter. Wilcox and Keselman treatment of outliers and their effect effect on statistical statistical power. For For ad adand treatment ditional (1994) and ditional discussion of outliers consult Barnett and Lewis Lewis (1994) Jacoby (1997). Researchers Researchers should reflect on the possible reasons for any outliers outliers and about what, if anything, anything, to do about them in the analysis of their data. No single definition or rule for dealing with with outliers outliers may may be applicable to all data. If one has access access to a program program entry that cross checks and reports reports inconsistent inconsistent entries, entries, one of data entry protect against outliers outliers that merely reflect reflect erroneous entry entry of can protect data (and non-outlying erroneous entries as well) by entering all data data two files files to be cross checked. in two checked. Again, we are concerned about about outliers here because of the possibility they may result result in heteroscedasticity heteroscedasticity that may make the use of cer certhat they effect size size problematic. Evidence supports the the theoreti theoretitain measures of effect Evidence supports expectation that heteroscedasticity may may be common. Wilcox (1987) cal expectation found found that ratios ratios of largest to smallest smallest sample variances, variances, called maxi maximum sample variance ratios ratios (VRs), (VRs), exceeding 16 16 are not not uncommon, uncommon, mum and there are reports reports of sample VRs above 400 400 (Lix, (Lix,Cribbie, Cribbie,&&Keselman, Keselman, and 1996) and above 566 (Keselman et al., 1998). 1998). When a researcher as as1996) homoscedasticity, it is equivalently equivalently assumed that the population population sumes homoscedasticity, VR = = 1. 1. Maximum Maximum population population VRs of up to 12 12 are considered considered to be real realVR Serlin (1986). (1986). Because of the great sam samistic according to Tomarken and Serlin pling variability variability of variances, one can expect to find find some sample sample VRs pIing
11.. INTRODUCTION INTRODUCTION
�
13 13
that greatly exceed population VRs, exceed the population VRs, especially especially when sample sizes are small. However, However, in a study of gender differences differences using nnss > 1100, sam00, a sam 8, 000 (P edersen, Miller, utcha ple VR VR was was approximately 118,000 (Pedersen, Miller, PPutchaBhagavatula, & & Yang, 2002). In psychotherapeutic psychotherapeutic outcome research with children and adolescents, variances have been found to be statisti statistidifferent in treatment cally significantly different treatment and control groups (Weisz, Weiss, Han, Granger, Granger, & & Morton, 11995). 995). In research on treatment treatment of pho phobia, when comparing a systematic systematic desensitization desensitization therapy group to an 1 973) found implosive therapy group and a control control group, Hekmat ((1973) and nearly 29, respectively, on the Behavior Avoidsample VRs over 12 and Avoid ance Test. Research reports in a single issue of the of Consulting the Journal of and contained sample and Clinical Psychology Psychology contained sample VR values values of 3.24, 4.00 4.00 (sev (several), 6.48, 66.67, 7.32, 7.84, 7.84, 25.00, and 28 281.79 .67, 7.32, 1 .79 (Grissom, 2000). The last VR VR involved skewed distributions of the drinks per day the number number of drinks day treatments for Ev under two two different different treatments for depression depression in alcoholics (Brown, EvMiller, Burgess, ans, Miller, Burgess, & Mueller, 11997). When comparing comparing aa control control group group & Mueller, 997). When and two panic-therapy panic-therapy groups for number number of posttest panic attacks, VRs ofof8.02 8.02and and 6.65 6.65were werefound foundfor forcontrol controls2/treated s2/treatedS2s2and and sample VRs 2 respectively (Feske & & Goldstein, Goldstein, 11997). 997). Therapy 11 sVTherapy s2/Therapy 2 sS2,, respectively Statistical tests and measures of effect effect size are ideally ideally used to com comStatistical attempt to control confounding vari pare randomly formed groups to attempt variables, but necessarily used to compare pre-existing, not but they are often often necessarily formable, groups such as female female and male participants. randomly formable, Groups that are formed by random random assignment assignment are expected to repre represent, by virtue popUlations with equal virtue of truly truly random random assignment, assignment, populations However,preexisting groups often often seem to variances prior to treatment. However, different variances. For For example, volunteers represent populations populations with different and risk takers are less variable than comparison groups on measures of of sensation seeking (Watson, 1985). more variable sensation 1 985). Boys are more variable than girls on many mental tests (Feingold, 11992). 992). Purging bulimics are less variable than nonpurging nonpurging bulimics in mean percentage of deviation deviation of their weight from from normal weight (Gross, 1985, Howell, 1997). 1997). Two 1 985, cited in Howell, Two kinds of closed-head closed-head injury patients patients have significantly different different varivari ances with respect to five five measures of verbal learning (Wilde, Boake, & & Sherer, 11995). 995). Other cases of heteroscedasticity should be expected. expected. For self-reporting of anxiety-arousing anxiety-arousing stimuli to example, in research using self-reporting defense (if it exists), exists), perceptually defensive defensive partici particistudy perceptual defense pants should be expected to produce more variable reports of the stimuli pants defensive participants. participants. that they had seen than would less perceptually defensive Because treatment treatment may affect affect the variabilities as well as the the centers of of distributions, and because changes in variances can be of as much prac practical significance as are changes in means, researchers should should think of of just with with regard to whether their their data data satisfy satisfy the assump assumpvariances not just tion of homoscedasticity informative aspects of treatment effect. effect. homoscedasticity but but as informative For example, Skinner ((1958) instruction, 1 958) predicted that programmed instruction, contrasted with traditional instruction, would result in lower lower variances variances contrasted would result in achievement scores. scores. Similarly, Similarly, in research on the outcome of therapy therapy
114 4
.,.......,
CHAPTER 11 CHAPTER
support would would be given for the the efficacy therapy if it were more support efficacy of a therapy found therapy not only results in a "healthier "healthier " mean on a test of found that the therapy of mental health, health, but test when contrasted contrasted mental but also in less variability variability on the test with a control group or alternative alternative therapy therapy group. Also, a remedial remedial pro program that is intended to raise all participants' competence levels levels to a gram participants' competence minimally acceptable level level could be considered considered to be a failure failure or a limited minimally insuccess if it brought brought the group mean up to that level but but also greatly in creased variability performance of some participants. participants. variability by lowering the performance scholastic perforFor example, a remedial program increased mean scholastic perfor variability (Bryk, 1977; 1977; Raudenbush Raudenbush & & Bryk, mance but but also increased variability additional examples examples of treatments treatments af af1987). Keppel ((1991) 1991) presented additional fecting variances. Finally, Finally, Bryk Bryk and and Raudenbush Raudenbush (1988) (1988) presented fecting identifying the patient patient charac characmethods in clinical outcome research for identifying teristics that result in heteroscedasticity and for separately estimating estimating teristics effects for the the identified identified types of patients. treatment effects EXPLORING THE DATA DATA FOR A POSSIBLE EXPLORING THE FOR A POSSIBLE EFFECT EFFECT OF A ON VARIABILITY VARIABILITY OF A TREATMENT TREATMENT ON
often has an an effect effect on variability variability and because this Because treatment often book presents a broad approach to estimating the effects effects of treatments, treatments, it behooves us to consider the topic of exploring the data for a possible possible effect of treatment treatment on variability. Also, Also, as we soon observe, someeffect observe, there some limitations to the use of the standard standard deviation as a measure of times are limitations variability, and many many common common measures of effect effect size size involve a stan standeviation in their denominators. Therefore, in this section we also dard deviation consider the use of alternative measures of variability. treatment has had an An obvious approach to determining whether a treatment effect on variability variability would be to apply one of the common tests of effect homoscedasticity to determine if there is a statistically statistically significant significant dif difhomoscedasticity ference between the the variances of the two two samples. samples. However, However, this ap apference proach is problematic because the the traditional traditional tests of homoscedasticity proach often produce inaccurate p values when when sample sizes are small (e.g., nn < often 11 for for each sample) sample) or unequal or when distributions are not normal Carlo, 1997; Weston & & Hopkins, 1998). These traditional traditional tests of Hopkins, 1 998). These (De Carlo, homoscedasticity are reported to have low statistical statistical power even even when homoscedasticity distributions are normal normal (Wilcox, (Wilcox, Charlin, & & Thompson, 1986). 1986). How Howdistributions Wilcox (2003) (2003) provided an S-PLUS software function for a boot bootever, Wilcox strap strap method method for comparing comparing two variances, a method method that that appears to accurate p values and and acceptable power. The basic basic bootstrap bootstrap produce accurate method is briefly described described in the the penultimate section of chapter 2. For method references and more details about about the traditional traditional tests of homo homoscedasticity, refer refer to Grissom (2000). and facilitated by major statistical software Note that it is common, and packages, to test test for homoscedasticity and then conduct a conventional homoscedasticity) if if the difference in variances is t test (that assumes homoscedasticity)
1. 1 . INTRODUCTION INTROOOCTION
.rlIIfJI=
115 5
not not statistically significant. significant. (The (The same sequential sequential method method is also com com-
prior to conducting a conventional conventional F test test in the case case of two two or more mon prior means.) If the the difference difference in variances is significant, the the researcher researcher for forgoes the traditional traditional t test test for the Welch t test test that does not assume homoscedasticity, as discussed in chapter homoscedasticity, chapter 2. However, However, this sequential sequential
not only only due to the possibility inprocedure is problematic, problematic, but but this is not possibility of in accurate p levels and and low power for for the test test of homoscedasticity. homoscedasticity. accurate Sawilowsky (2002) discussed and demonstrated demonstrated how this sequential sequential Sawilowsky procedure increases the the rate of T Type ype II error. For further discussion of this problem, consult Serlin (2002) and and Zimmerman ((1996). As Serlin 1 996). As Type II error can also result result from from the the use (2002) noted, such inflation of a 1)rpe of a test test of symmetry symmetry to decide if a subsequent comparison comparison of groups is using a normality-assuming parametric parametric test (e.g., the t test) to be made using nonparametric test test (e.g., the Mann-Whitney U test or equivalent equivalent or a nonparametric discussed in chap. 5). Wilcoxon test, as discussed traditional inferential methods may may often often not not be powerful Although traditional enough to detect heteroscedasticity or yield accurate p values, research researchers should at least report 5s22 for each sample for informally comparing sample variabilities, variabilities, and perhaps report other other alternative alternative measures of the samples' variabilities, variabilities, to which which we now turn our attention. These measures sensitive to outliers and skew than are measures of variability are less sensitive standard deviation, deviation, and they they can provide the traditional variance variance and standard better measures of the typical deviation from from average scores scores under (Wenote notein inchap. chap.33that thatthese thesealternative alternativemeasures measuresof of those conditions. (We variability can also be of use in estimating estimating an effect effect size.) We We are not not aware of professional groups or journal editors who are recommending requiring such such measures. However, these measures measures are receiving in inor requiring measures. However, creasing creasing attention in articles articles on new statistical statistical methodology, methodology, attention that can be a prelude to such an editorial editorial recommendation recommendation or requirement, and this book attempts attempts to be be forward looking. Recall that the variance variance of a sample, 5s22, is a kind of average of squared deviations deviations of raw raw scores scores from from the mean; (1.2)
an unbiased estimator estimator of a or, when the variance of a sample is used as an population population variance, ((1.3) 1 .3 )
equation that one or a few extremely outlying outlying low or exNote in this equation ex tremely great effect effect on the variance. tremely outlying high scores can have a great variance. An
116 6
�
CHAPTER CHAPTER 1l
outlying score contributes (adds) denominator while contribut outlying (adds) 11 to the denominator contributto the numerator because of its large squared squared devia ing a large amount to deviation from from the mean, whereas whereas each moderate score contributes contributes 1 tion 1 to the denominator nudenominator while contributing contributing only a moderate amount amount to the nu merator. parameter nonresistant if it only merator. A A statistic statistic or a p arameter is said to to be nonresistant to have a relatively large effect effect on it. Thus, takes one or a few outliers to the variance and standard standard deviation are nonresistant. nonresistant. Therefore, alal though p resenting the sample samp le variances variances or standard standard deviations deviations for comcom though presenting parison researchers p arison across across groups group s can be of use in a research research report, rep ort, researchers should presenting should consider also p resenting an an alternative alternative measure of variability resistant to to outliers outliers than the variance or standard that is more resistant standard deviation are. are. Note also that the median is a more outlier-resistant outlier-resistant measure of a distribution's center than is the arithmetic arithmetic mean because the median, as the middle-ranked middle- ranked score, score, is influenced not not by the the magnitude magnitude of the the scores above or or below it, but but by the the ranking ranking of scores. The mean of raw scores, as we noted is the case for the the variance, has has a numerator numerator that can be greatly influenced influenced by each extreme score, score, whereas each extreme score score only adds 11 to the denominator. denominator. The range is not not a very useful useful as measure of variability variability because it is exex tremely nonresistant. The range, by definition, is only sensitive sensitive to to the tremely most extremely ext remely high score and and the the most most extremely low score, score, so the the mag most magnitude of either one of these scores can can have a great effect effect on on the the range. nitude However, researchers should report rep ort the the lowest and highest score within within comp are the the lowest scores scores each group because it can be informative to compare across across the groups group s and to compare comp are the highest highest scores across the groups. group s. Among measures of variability variability within a sample that are more rere Among the measures than are the the variance, standard standard deviation, and range, sistant to outliers than the Winsorized variance, variance, the the median absolute deviation, and we consider the the the interquartile inter quartile range. The reporting reporting of one or or more of these measures for for each sample samp le should should be considered for for an an informal exploration exp loration of a p ossible effect effect of an an independent indep endent variable on on variability. variability. However, possible However, again group s have not not been randomly formed, a pposttreatment osttreatment we note that if groups diff erence in variabilities variabilities of the the samples samp les might might not not necessarily necessarily be difference be attribut attributor entirely entirely attributable, to to an an effect effect of treatment. tr eatment. Although mea able, or Although the meavariability that we consider here are not not new to statisticians, they sures of variability becoming widely widely known known to researchers through through the are only recently becoming writings, frequently cited here, of Rand Rand R. Wilcox. Winsorized variance The steps step s that follow for calculating a Winsorized variance ((named named for for the statistician statistician Charles Winsor) are clarified clarified by the the worked example examp le in the To calculate the the next next section. To the Winsorized variance variance of a sample: samp le:
\.. Order the scores in the sample 1 samp le from from smallest to largest. largest. .en of the 2. Remove the the most most extreme .cn the lowest scores and and remove the
same .en .cn of the the most most extreme of the the highest scores of that sample, samp le, where .c is a pproportion .2) and and n is the the total sample samp le size. If If where .c rop ortion ((often often .2) .cn is not not an integer round round it down down to to the nearest integer. integer.
11..
INTRODUCTION INTRODUCTION
�
117 7
the lowest remaining remaining score Y YLL and and the the highest remaining remaining score YYHH• . 3. Call the the removed lowest scores with ..cn 4. Replace each of the cn repetitions of Y Yu L, and replace cn repetitions replace each of the removed highest highest scores with ..cn repetitions YHH,, so that the total size size of this reconstituted reconstituted sample returns to of Y its original size. 5. Calculate the usual unbiased s2 s2 (as (as defined defined by Equation 11.3) 5. . 3 ) on the 2 reconstituted sample to produce the Winsorized variance, reconstituted variance, s :W.. Depending on various factors, the amount of Winsorizing (i.e., re removing and replacing) that is typically recommended is ..cc = = .10, .20, or moving fo.25. The greater the value of c that is used, the more the researcher is fo subset of data. For For example, example, cusing on the variability of the more central subset when .c .c = = .20, more than 20% 20% of the scores would have to be outliers be bewould be influenced influenced by outliers. Wilcox fore the Winsorized variance would (1996, 2003) provided further discussion, references, and an S-PLUS software function (Wilcox, (Wilcox, 2003) for calculating calculating a Winsorized variance. software alternatives to the nonresistant nonresistant S2 s2 that we discuss here, However, of the alternatives 2 believe that s :w may perhaps be the most we believe most grudgingly adopted adopted by rere First, many researchers unsearchers for two reasons. reasons. First, researchers may balk at the un certainty regarding the choice c. Second, Second, although although choice of a value for c. Winsorizing is actually a decades-old procedure that has been used and recommended by quite respectable statisticians, the procedure may may respectable statisticians, present authors) authors) to be be "hocus "hocus seem to some researchers (excluding the present pocus."" For Forsimilar similar reasons some instructors instructors may refrain from teaching pocus. students because because of concern that it may encourage them them this method to students devise their their own less justifiable methods for altering altering data. For For a to devise psychologically problemmethod that is perhaps less psy chologically and pedagogically pedagogically problem atic we turn now to the MAD. The MAD for a sample is calculated as follows:
MAD
.
1. Order the sample's scores from the lowest to the highest. 1. score, Mdn. If there is an an even number number of scores in 2. Find the median score, be two middle-ranked scores tied for the me mea sample there will be case calculate Mdn as the midpoint midpoint (arithmetic mean) dian. In this case two scores. of these two 3. For For each score in the sample find find its absolute absolute deviation deviation from from the the successively subtracting subtracting Mdn from each YYii sample's median by successively ignoring whether whether each such difference difference is positive or nega negascore, ignoring tive, to to produce the set of of deviations 1| Y1 Y-, -- Mdn I\,, ...., Yn -- Mdn I\.· tive, the set . . , 1|Yn deviations, 1| Yi Yi -- Mdn I\,,from the lowest to the 4. Order the absolute deviations, numbers. highest, to produce a series of increasing (signless) numbers. absolute devia devia5. Obtain the MAD by finding the median of these absolute tions.
MAD
traditional s Note that the MAD is conceptually more similar to the traditional s2 because the latter latter involves squaring squaring deviation scores, scores, whereas than to S2
118 8
.,...",.."
CHAPTER 1 CHAPTER 1
the MAD MAD does not not square deviations. deviations. Under normality the MAD MAD = = .67455. . 6 745s. Wilcox (2003) provided provided an S-PLUS software function function for calculating calculating the MAD in the next section. MAD. Manual Manual calculation is demonstrated demonstrated in section. The final measure of variability that is discussed here is the the inter quartile range, which A quantile is roughly interquartile which is based on quantiles. A roughly defined defined here as a score score that is equal to or greater than a specified specified pro proportion of the scores in a distribution. Common examples of quantiles quantiles are quartiles, which which divide the data into successive successive fourths fourths of the data: data: .25, .50, .00. T he second quartile, Q �2 (.50 (.50 quantile) is the .50, .75, and and 11.00. The the overall Mdn of the scores in the distribution; distribution; that is, is, the score that has Q11 (.25 .50 of the scores he first scores ranked below it. TThe first quartile, quartile, Q (.25 quantile), is the median of the scores scores that rank rank below the overall Mdn; Mdn; that is, is, the (.75 score that outranks he third outranks 25% 25% of the scores. scores. TThe third quartile, Q3 Q3 (.75 quantile), is the median of the scores that rank rank above the overall Mdn; Mdn; that is, the score that outranks outranks 75% 75%of ofthe the scores. scores.The Themore morevariable variableaa distribution is, is, the greater distribution greater the difference difference there should be between the Q33 and Q, Q1 1' at least with respect to variability variability of the middle scores at Q bulk of the data. A interquartile A measure of such variability is the interquartile range, Riq, R. , which is defined as follows follows:: .
( 1 .4) For normal distributions the approximate approximate relationship between the ordinary 5 s and R Riq is s = = .75 . 75R Riq. .For Foran anintroduction introductionto toquantiles, quantiles, consult consult Hoaglin, Mostel r, and Tukey Thke ((1985); 1 985); for a technical Mosteller, technical discussion, refer refer to Serfling 1 980). When using statistical software packages Serfling ((1980). packages researchers should try to ascertain how the software is should try is defining quantiles quantiles because only only a rough rough definition definition has been given here for our purposes and defini definitions vary. To 1 996) and To pursue this topic refer refer to Hyndman Hyndman and Fan Fan ((1996) discussion and references references in Wilcox (2003), (2003), who also provided a sim simthe discussion ple example of a manual evidence manual calculation calculation using a method method that gives evidence of being the best for determining the inter interquartile of quartile range. T here are additional There additional measures that are more resistant to outliers than than 2 but discussion of these would be beyond the scope scope of this are Ss2 and s, but book. For For example, for technical reasons a measure called the fourth spread, which is superficially similar to Riq, Riq R. , might might be superior to R iq (Brant, 996). Also, Also, in chapter 3 we mention (Brant, 1990; Wilcox, 11996). mention a somewhat somewhat exotic, but but apparently apparently very commendable, resistant resistant alternative mea measure that can be used to make inferences about about differences differences between two population's variabilities. Note population's Note that what what we call a measure ofvariability of variability in called a measure of in this book is also called of a distribution distribution's's scale and and that what what 's center is often called distribution's called its location. we call a distribution Graphical methods for exploring differences differences between distributions distributions in addition to differences differences between their means are cited in chapter 5. One such graphic graphic depiction of data that is relevant to the present dis discussion and that researchers researchers are urged to present for each sample is a
i�
y
1. 1.
INTRODUCTION INTRODUCTION
�
119 9
boxplot. n the details o boxplot. Statistical Statistical software packages may vary iin off the 989), but boxplots that they present (Frigge, Hoaglin, & & Iglewicz, Iglewicz, 11989), but generally included are the range, median, first first and third third quartiles so that the inter quartile range interquartile range can be calculated, and outliers outliers that can indication of skew. Major statistical software packages can also give an indication produce two or more boxplots in the same figure for direct compari compari(2003) for further discussion and Carling (2000) son. Consult Wilcox (2003) for provided software for improvements in boxplots. Trenkler (2002) provided software for a For a general more detailed comparison comparison of two or more boxplots. For refer to Schwertman, Schwertman, method for detecting outliers using boxplots refer Owens, and Adnan (2004).
OF MEASURES OF OF VARIABILITY WORKED EXAMPLES OF Consider the following real data that represent partial data from re research on mothers of schizophrenic children (research (research that will be dis discussed in detail where needed in chap. 3) 3):: 11,, 11,, 11,, 11,, 2, 2, 2, 3, 3, and and 7. The possible scores ranged from 0. Note . 1 that the from 0 to 110. Notein Fig. Fig. 11.1 the data data are positively skewed. output, or simple inspection of the the data, yield Standard software output, y ield for the median of the raw raw scores, scores, Mdn = 2. As As expected, expected, because positive the skew pulls the very nonresistant nonresistant mean to a value that is greater than the median, Y > > Mdn in the the present case; case; specifically, specifically, Y = 2.3. Note that al although 9 of the 110 range from 1 3, the the outly outlying 7, causes causes 0 scores range 1 to 3, ing score, 7,
4
-
-
-
1
-
o
I
1
I
I
I
2
3
4
Scores
FIG. 11.1 FIG. .1
Skewed data
(n = 110). 0) . =
I
5
I
6
I
7
20
�
CHAPTER 11 CHAPTER
the range to equal 6. Software output also yyields estiields for the unbiased esti mate of population variance for these data 5s22 = 33.34. pres.34. Although the pres might not not be ideal for justifying justifying the application of ent small set of data might the alternative alternative measures of variability, it serves to demonstrate the cal calculation of the Winsorized variance and the MAD MAD.. Several Several statistical statistical culation For this example, example, Riq Riq = 22.. Riiq .. For software packages calculate R Winsorized Step 11 for calculating the Wi risorized variance (5(s2�W),) , ordering the scores scores from the lowest to the highest, has has already already been done. done. For For Step 2 we use ce from = 20, 20, so so ..cn .2(10) == 22.. Therefore, Therefore, we we remove removethe the two two lowest lowest scores scoresand and = en == .2(10) two highest scores, scores, which leaves 6 of the original 10 scores scores remain remainthe two ing. Applying Step 3, Y replace the the YLL = 11 and YYHH = 33.. Applying Step 4, we replace two lowest removed scores scores with with two two repetitions of Y YLL = = 11,, and we replace two replace two highest removed scores scores with two two repetitions of Y YHH = = 3, so that the the two reconstituted and 3. Although reconstituted sample of n = 10 is 11,, 11,, 11,, 11,, 2, 2, 2, 3, 3, and left side of the distribution, distribution, the re resteps 11 through 4 have not changed the left constituted data data clearly are more symmetrical than before because because of the constituted removal and replacement of the outlying score, 7. For step 5 we use any any removal outlying score, 7. For reconstituted data, the unbiased statistical software to calculate, for the reconstituted s2 of of Equation 11.3 find that 5s2�w = ..767. (For those who need need the re re52 .3 to find 767. (For manual calculation calculation of 52 s2 using using a raw-score raw-score com comfresher, an example of a manual putational version version of Equation Equation 11.3 section entitled entitled .3 can be found in the section Only Classificatory Classificatory Factors in chap. 7.) Observe Observe that, because because of removal Only chap. 7.) replacement of the outlier (Y. (Yi = = 7), 7), as we should expect, 5s2�w < < 52; s2; that and replacement < 33.34. mean Yw = 11.9, 76 7 < . 34. Also, Also, the mea n' of the reconstituted data, Yw .9, is is, ..767 original mean, Y=2.3. Y = 2.3. The closer to the median, Mdn = 2, than was the original but it is now now 2, which well describes describes the reconstituted range had been 6 but 3, inclusive. data in which every score is between 11 and 3, MAD for the original data, we proceed proceed to Step 3 of To calculate the MAD method because Step 11,, ordering ordering the scores from the lowest lowest to the that method was previously done, and for Step Step 2 we have already already found highest, was be that Mdn = = 2. For For Step 3 we now find find that the absolute deviation deviation between each original score and the median is 1|1-2| 1 -2 1 = 11,, 1|1-2| 1 -2 1 = 1, 1, =0,0, 1| 3-21 1, 1-2 1 = 11,, 111-21 1-2 1 = 11,, 1|2-2| 2-2 1 = 0, 112-21 2-2 1 = 2-2 1 = 0, 1|2-2| 111-21 3 -2 1 = 1, 7-2|1 = 55.. For For Step 4 we we order these absolute devia devia3-2 1| ==1 1, , and 1| 7-2 113-2 tions from the lowest to the highest: 1, 11,, 11,, 11,, 11,, and and 5. For highest : 0, 0, 0, 11,, 1, find by inspection that the median of these absolute devia deviaStep 5 we find tions is 1; 1; that is, the MAD = 1I.. MAD = meaWith regard to the usual intention that the standard deviation mea typical below-average below-average and sure within what distance from the mean the typical typical lie, observe the following facts facts about the ty pical above-average scores lie, data. Nine of the 110 (Yi =_ = 7 being the exception) are are 0 original scores (Yj within approximately 11 point of the mean (Y=2.3) within (Y = 2.3) but the standard dede viation of these skewed skewed data is 5s = (5 (s22)") '"2 = = 11.83, value that is nearly nearly viation . 83, a value twice as large as the typical distance (deviation) scores from the (deviation) of the scores contrast the Winsorized Winsorized standard deviation, which is mean. In contrast s = 5w . 8 76, is w = ^(55�w)) ' i == .876, is close close to to the the typical typical deviation deviation of of approximately approximately 11 point for the Winsorized point Winsorized data and for the original data. Finally, Finally, note that
INTRODUCTION
11.. INTRODUCTION
�
21 21
the MAD typical amount amount of deviation MAD too is more representative representative of of the typical deviation from the original original mean than the standard standard deviation is; MAD MAD = 1. 1. Of Of from course, the demonstration methods in this section with a single demonstration of the methods small proof or even strong small set of data does not constitute mathematical proof their merits. Interested readers should refer refer to empirical evidence evidence of their (1996, 1997, 1997, 2003) and the references references therein. In the boxplots in Wilcox (1996, Fig. indicates the outlier, Fig. 1.2 for the current data, the asterisk indicates outlier,the middle horizontal horizontal line within each box indicates the median, the black diamond within bottom within each box indicates the mean, and the lines that form the bottom first and third third quartiles quartiles respectively. and top of each box indicate the first nature of the current data data set Note that because because of the idiosyncratic nature (many repeated values) the interquartile interquartile range for the Winsorized data (2) happens to be equal to the range of the Winsorized data. (2)
QUESTIONS
1. List six factors that influence the statistical significance of t. 2. What is the the meaning meaning of of statistical significance, significance, and what do the au authors mean by statistically signifying? signifying? 3. Give an example, not from the text, of a statistically significant resignificant re sult sult that might might not be practically significant. 4. Define Define effect effect size in general terms. 5. In what circumstances would the reporting of effect effect sizes sizes be most useful? useful? 7
-
6
-
5
-
*
-
3
-
2
-
•
•
-
I
Original Data
I
Winsorized Data
FIG. 1.2. Boxplots ooff original and and Winsorized Winsorized data. FIG. 1 . 2 . Boxplots
22
-rw=
CHAPTER CHAPTER 11
6. What is the major issue in the debate regarding the reporting of ef6. ef fect sizes when results do not not attain statistical statistical significance? significance? reporting more than one kind of 7. Why should a researcher consider reporting of effect size for a set of data? effect 8. What relationship between a treatment's effect on 8. What is often often the relationship treatment's effect means variances? means and variances? Define power analysis. 9. Define 10. Define meta-analysis. 10. What are two two assumptions F tests tests on means? assumptions of the t and F 111. 1 . What 12. Define heteroscedasticity. heteroscedasticity. 12. 13. Is hereoscedasticity a practical concern for for data 13. data analysts, or is it theoretical interest? merely of theoretical 14. Discuss two reasons reasons to expect heteroscedasticity. heteroscedasticity. 14. Contrast a ceiling effect effect and floor effect, effect, providing an example of 115. 5 . Contrast and a floor each that is not in the text. 16. Define outliers and and provide two two possible causes of them. 16. whether the use of preexisting groups or randomly 117. 7. Discuss whether formed groups groups impacts the possibility of heteroscedasticity differ differformed ently. 18. Discuss the usefulness of tests 18. tests of homoscedasticity in general. 19. Why is it problematic problematic to precede 19. precede a test of two two means, or a test of of means, with a test of homoscedasticity homoscedasticity?? more than two means, 20. What effect effect can one or a few outliers have on the variance? variance? 21. Define nonresistance. 22. How resistant resistant to outliers is the variance? variance? In general terms, compare its resistance to that of four other measures of variability. Define MAD. MAD . 23. Define rough definitions of quantile and quartile. 24. Provide rough quantile and quartile. 25. Define median. median. Define interquartile range. 26. Define 27. Which Which characteristics characteristics of data do boxplots boxplots usually provide?
Chapter Chapter
2 2
Confidence Intervals Intervals for Comparing Confidence the Two the Averages of T wo Groups
INTRODUCTION
Although the topics of this chapter are generally not not considered Although considered to be exex effect sizes, and they might might not not be expected expected by some readers to amples of effect effect sizes, chapter 3 demonstrate demonstrate be found in a book on effect sizes, this section and chapter that there are connections between confidence confidence intervals intervals and effect effect sizes, both of which which can provide useful useful perspectives on the the data. TThe both he conficonfi discussed in this chapter and the the effect effect sizes sizes in dence intervals that are discussed chapter 3 all provide information information that relates to the amount amount of differ differchapter two populations' populations' averages, including means. When the ence between two commonly understood understood variable that is scaled in dependent variable is a commonly familiar units, such as weight in research that compares two weight weight-refamiliar -re confidence interval and an an effect effect size size can provide duction programs, a confidence useful and complementary information about about the results. Note Note that useful effect some authors do consider confidence intervals to be estimates of effect size (Fidler, TThomason, & Leeman, Leeman, 2004). homason, Cumming, Cumming, Finch, Finch, & 2004) . scales include ounces of alcohol con conAdditional examples of familiar scales sumed, milligrams of drugs drugs consumed, consumed, and counts of such such things as family size, cigarettes smoked, acts of misbehavior misbehavior (defined), (defined), days ab abfamily sent, days abstinent, abstinent, dollars earned or spent, length of hospital hospital stay, and relapses. Confidence intervals in terms of such familiar measures are outreadily understood because because such measures are widely encountered out and Richard Richard (2003) (2003) side of a specialist's research setting. Bond, Wiitala, and for the use of simple differences differences between means as effect effect sizes for argued for meta-analysis when the dependent variable is is measured on a familiar meta-analysis they presented a method for doing so. scale, and they intervals can be informative when when sample sizes sizes are large Confidence intervals statistical signifi signifienough to cause a relatively small difference difference to attain statistical other hand, hand, when when samples are their modercance. On the other their usual small or moder size, resulting in large sampling error, error, apparently apparently inconsistent inconsistent results ate size, literature may may be revealed later by the use of a confidence confidence interval in the literature 23
24
�
CHAPTER 2 2 CHAPTER
for each study to be be more consistent analyses originally originally for consistent than traditional analyses substanseemed to indicate. Such confidence intervals might well show substan overlap,, as was discussed discussed and illustrated by Hunter and Schmidt tial overlap (2004). Also Also,, by being introduced introduced to the concept concept of confidence intervals intervals in familiar with with the topic should this chapter those readers who are not very familiar understand the the topics of confidence intervals for effect effect sizes sizes that better understand thereafter. Unfortunately, the topics of this are presented in chapter 3 and thereafter. chapter are often often not not covered covered in statistics textbooks. and the American Psychological Psychological Association's T Task Wilkinson and ask Force called for for the greater use of confidence in in1 999) called on Statistical Inference ((1999) effect sizes sizes,, and and confidence intervals for for effect effect sizes; sizes; the the fifth fifth edi editervals, effect tion of the the Publication Manual Manual of of the American Psychological Psychological Association tion PsychologicalAssociation, 200 2001) recommended the (American Psychological 1 ) strongly recommended additional endorsements of confiuse of confidence intervals. Many additional confi dence intervals can be cited cited,, including those in Borenstein Borenstein((1994), Cohen 1 994),Cohen al. (2004), al. ((1997), and Schmidt ((1994), 1 994), Fidler et a1. (2004), Harlow et a1. 1 997), Hunter and (2004), and Kirk ((1996, 2001). Confidence intervals intervals are frequently re re1 996, 200 1 ) . Confidence medical research. research. Nonetheless, Nonetheless,as we note later in this chapter ported in medical thereafter, with regard to confidence intervals in general general or specific specific and thereafter, kinds of confidence confidence intervals intervals,, the method can have limitations limitations and in interpretive problems. Its merits notwithstanding, notwithstanding, we do not assert that the method is always the method of choice. choice. CONFIDENCE INTERVALS INTERVALS FOR FOR Ila - Ilb: INDEPENDENT INDEPENDENT GROUPS GROUPS CONFIDENCE Especially for an applied applied researcher in areas in which studies use the important part same familiar dependent variable, the practically most important statistic that tests the_ thejusual hypothesis of the formula for the t statistic usu�l null hypot�esis about two population population means is is the numerator, Ya - Yb. Using Using Ya - Yb to
Ila
Ya - Yb· Ilb
Ya - Yb
the size size of the the difference difference between a and and b can can provide provide aa very very estimate the informative informative kind of result, especially especially when the dependent dependent variable is measured by a commonly understood variable such as weight. Recall chapter 11 in which interested in comparing comparing the example from chapter which we were interested diabetic participants in weight-reduction Pro Prothe mean weights of diabetic case to gain grams a and b. It would be of great practical interest in this case information about the difference difference in mean population weights. weights. The The pro proinformation from Groups cedure for constructing a confidence interval uses the data from Groups is likely likely to contain the value of a and b to estimate a range of values that is of
Ila - �, within within them, them, with aa specifiable specifiable degree degree of of confidence in this this esti esti-
mate. For For example, difference in weight example, a confidence interval for the difference two populations who are represented gain between two popUlations of anorectic girls who represented by two samples who Treatment a or Treatment Treatment b might two who have received either 1reatment "One can be approximately 95% 95% confi confilead to a reported result such as: "One dent that the interval between 110 0 pounds and 20 pounds contains the difference in mean gain of weight weight in the two two populations. populations."" difference although any any given popUlation population of scores scoreshas a constant constant Theoretically, although equal-sized random samples samples from from a population have varying mean, equal-sized
CONFIDENCE INTERVALS CONFIDENCE INTERVALS
�
25
means (sampling (sampling variability). Therefore, Ya Yaand and Ybb might each be either overestimating overestimating or underestimating underestimating their their respective respective population means. Thus, Ya Ya -- Ybb may well be larger or smaller than Ila - Ilb .. IIn n other words, margin of error error when using using Ya Ya -- Ybb to estimate estimate Ila - Ilb '. If there is there is a margin such a margin of error it may be positive, (Ya (Ya - Yb) > ((Ila - Ilb) ,, or nega nega(Ya-- Yb) < < (Ila ( The larger larger the sample sizes and the less variable variable tive, (Ya the populations of raw raw scores, scores, the smaller the absolute value of the mar mar1, the margin gin of error be. That error will be. That is, as is reflected reflected in Equation 2. 2.1, margin of error is a function of the standard error. In this case case the standard error is the standard deviation of the distribution distribution of differences differences between two two populations' sample means. populations' Another factor that influences influences the amount amount of margin of error is the in one's estimate of a range of level of confidence that one wants to have in likely to contain Ila - Ilb '. Although it might might seem counter countervalues that is likely confiintuitive to some readers at first, we soon observe that the more confi dent one wants to be in this estimate, the greater the margin of error will have to be. be. For For a very simple example, it is safe to say that we can be 100% 100% confident confident that the difference difference in mean annual incomes of the popu population of high-school dropouts and the the population population of college graduates lation found within the interval interval between $0 and $1,000,000, but but would be found our 100% 100% confidence confidence in this estimate is of no benefit benefit because it involves our unacceptably large margin of error (an insufficiently informative re rean unacceptably two population population means of an ansult) .. The actual difference difference between these two $0 $1,000,000. (For the nual income is obviously not near $ 0 or near $1, 000,000. (For section, we used mean income as a dependent dependent variable in purpose of this section, our example despite the fact that income data are usually skewed and are typified by medians instead of means.) means.) A procedure that greatly decreases decreases the margin of error without exex cessively confidence in the truth of our result cessively reducing our level of confidence tradition is to adopt what what is called called the the 95% 95% (or would be useful. The tradition ..95) 95) confidence confidence level that leads to an estimate of a range of values that has a ..95 the value of Ila - Ilb ' When expressed has 95 probability of containing the expressed (e.g.,, ..95) accurately cal calas a decimal value (e.g. 95) the confidence confidence level of an accurately called the the probability probability coverage of of a culated confidence interval is also called confidence interval. interval. To To the the extent extent that thataamethod method for forconstructing constructingaacon conconfidence inaccurate, the actual probability coverage coverage will defidence interval is inaccurate, de part from what it was intended to be and what it appears to be (e.g. (e.g.,, from the nominal ..95). 95% confidence confidence may seem to 95). Although 95% depart from readers to be only slightly less confidence confidence than 1100% confidence, some readers 00% confidence, insuch a procedure procedure typically results in a very much narrower, more in formative interval than in our example example that compared incomes. formative incomes. For For simplicity, the first procedure that we discuss assumes normality, normality, homoscedasticity, and independent groups. The procedure is easily generalized to confidence levels levels other than than the 95% 95%level. First, First, we con consider an additional assumption of random sampling and consider fur further the assumption of independent groups.
Vb)
- Ilb)·
- Vb)
26
�
CHAPTER 2 CHAPTER 2
nonexperimental research we typically typically have to accept violation violation of In nonexperimental of problem by con conthe assumption of random sampling. sampling . Some finesse this problem cluding that research results apply to theoretical theoretical populations populations from from which our samples would have constituted constituted a random sample. sample. It can be argued that such a conclusion can be justified justified if the samples that were used seem to be reasonably representative of the the kinds of people to used reasonably representative whom we want to generalize the results. In the case of experimental re rewhom assignment to treatments treatments satisfies the assumption assumption (in search random assignment terms of the the statistical statistical validity validity of the the results, if not not necessarily in terms terms of the external external validity validity of the the results) results).. We We have more to say about of about the possible influence influence of sampling method method on confidence confidence intervals later. Independent groups groups can be roughly roughly defined defined for for our our purposes as groups Independent within which which no individual's score on the dependent dependent variable is related within to or predictable from the scores of any individual another group. individual in another Groups are independent if the probability probability that an individual in a group score remains the the same regardless regardless of what score is will produce a certain score individual in another with dependent produced by an individual another group. Research with dependent groups methods for construction of confidence confidence intervals intervals that groups requires methods different from from methods used for research with are different with independent groups, a ass we discuss discuss in the the last section of this chapter. for simplicity simplicity for for now now that the assumptions assumptions of normality, normality, Assuming for homoscedasticity, homoscedasticity, and independence have been satisfied and that the usual usual is applicable, it can be shown (central) t distribution distribution is shown that that for constructing constructing a confidence interval interval for f..la
-
� the the margin margin of error
[ (n
ME = t ' s�
'
1
a
1 +
)]
1/2
(ME) is given by (ME) is
(2. 1 ) (2.1)
�
after tt*isisthe the standard standard error errorof ofthe the difference difference The part of Equation 2.1 2. 1 after between two sample means. In addition addition to its role in confidence confidence inter intervals, the standard standard error is is used to indicate the precision with which a statistic estimating a parameter; the smaller the standard standard error the statistic is estimating greater the precision. construct a 95% 95%confidence confidence interval, interval, (t' When Equation 2.1 2 . 1 is used to construct is the absolute absolute value of t that a table of critical values of t indicates is re reat the .05 two-tailed quired to attain statistical statistical significance at two-tailed level (or one-tailed level) mat the 95% orany any other other level levelof ofconfi confi..025 025 one-tailed in a t test. For the 95% or 2 the pooled estimate of of the common variance of dence, ss �p is the the assumed common of 2 the two populations, acr. . Use for for the the degrees-of-freedom row of the the the two populations, degrees-of-freedom (df) (dj) row + nb nb - 22.. Because Because for now we are assuming assuming t table, ddff == n aa + 2 obtained by pooling the 02 is obtained homoscedasticity, the best estimate of a data from the two two samples to calculate the usual usual weighted average of data of two samples' estimates of 0 a22 to produce (weighting by sample sizes the two via the the separate sample's dfs dfs):) : via
-
CONFIDENCE CONFIDENCE INTERVALS INTERVALS 2 Sp
=
(no
� + (nb - l)s !
- l)s
�
27
(2.2) (2.2)
Because approximately approximately 95% 95% of the the time time when when such such confidence interinter vals are constructed, in the current case, might be be case, the value of Yaa -- Ybb might overestimating or underestimating ME 95 95,, one one can can say say that that underestimating Ila - Ilb by the ME approximately 95% 95% of the time the following interval of values will concon the value of of Ila - �:: tain the
(2.3) (2.3) The value (Y is called calledthe the lower lower limit limit of of the the 95% 95%confidence confidence Vb) A:!.E .95 9JS (Yaa -- Y b) --ME interval, and and the the value (Yaa -- Ybb)) + ME 95 95 is is called called the the upper upper limit limit of ofthe the95% 95% confidence confidenceinterval intervalisis(for (for our our purpose) purpose)the theinterval interval confidence interval. interval. AAconfidence of values between the Cl for the lower limit and and the the upper limit. We often often use CI confidence interval, interval, and 95% Cl CI we use .95 Cl CI or Cl.95• CI95. and to denote the the 95% Although confidence intervals for for the difference difference between two two averages are not not effect effect sizes, they can provide (but not information not always) alway s) useful useful information about For example, example, in our about the magnitude of the results. For our case case of comparing weight-reduction programs for diabetics, suppose that the lower and two weight-reduction interval for Ila - �, after upper limits of the confidence interval after 1 year y ear in one or the other program were l1 Ib and 2 1b, Ib, respectively. respectively. A between-program between-program difdif population weights (a constant, but an unknown unknown one) ference in mean popUlation one) that we are 95% 95%confident confident would wouldbe befound foundin inthe theinterval intervalbetween between 11and and22 Ib after after 11 year lb y ear in the programs would seem to indicate that there is likely little practical difference difference in the effectiveness effectiveness of the two two programs, one of of seemirigto tobe beonly onlynegligibly negligiblybetter betterthan thanthe theother other at at most. most.On Onthe the which seemirtg other hand, if if the lower and upper limits were found to be, be, say, 20 and 30 Ib, then one would be fairly fairly confident confident that one has evidence evidence (not proof) proof) lb, effective program is substantially better. that the more effective two examples examples of outcomes that neither the interval from from I1 Note in the two from 20 to 30 30 contains the value 0 within it. It can be be shown shown in to 2 nor from 95% confidence interval does not not contain the the the present case that if the 95% value 0 the results imply that a two-tailed two-tailed t test of Ho:0: Ila - � = = 0 would If have produced produced a statistically significant t at the .05 significance level. If the interval does contain the value 0, say, example, limits of -10 -10 and say, for example, +10, conclude that the difference difference between Ya Yaand andYb Ybisisnot not sig sig+ 1 0, we would conclude two-tailed .05 .05 level of significance. In general, general, if if we were were to to nificant at the two-tailed adopt a significance significance level level alpha, if if the ((1 - <X) a) confidence interval for the 1 difference populations' means does does not contain zero, zero, the diff erence between two populations' is equivalent to having found a statistically signifi significonfidence interval is difference between Y Yaa and Yb Ybat the alpha significance level. TTherecant diff erence between here tells us what a t test of statistical statistical fore, such a confidence interval not only tells but the confidence interval can also proprosignificance would have told us, but
28
�
CHAPTER 2 2 CHAPTER
important additional additional information, especially especially if if the dependvide possibly important depend ent variable variable measure is a familiar familiar one, such as weight. weight. Some have interpreted the relationship relationship between the results from from sig significance testing testing and construction construction of confidence confidence intervals to mean that nificance significance testing is not not needed. needed. Refer Refer to Frick ((1995) rebuttal. 1 995) for a rebuttal. 1 ) . In chapter Also consult consult Knapp Knapp and Sawilowsky (200 (2001). chapter 8 we discuss the difference between between two two populations' proportions, an exexample of the difference ex ample in which there is not not a simple relationship between the two two ap approaches to analyzing data. Another such example is the case of a single population proportion. proportion. Consult Knapp Knapp (2002) (2002) and and Reichardt Reichardt and and Gollob population Gollob ((1997) 1 997) for for an an argument argument justifying the the use of confidence intervals intervals in some cases and tests of statistical significance in other refer other cases. Also refer and Massey ((1983) differ1 983) for a discussion of some technical differ to Dixon and ences between the two approaches Note that apparent approaches.. Note apparent confidence confidence levels may be overestimating overestimating true true confidence confidence levels levels when when confidence confidence intervals may contingent on first first obtaining obtaining statistical statistical are only constructed contingent Consult Meeks and D'Agostino ( 1 983) and Serlin (2002) to significance. Consult D'Agostino (1983) address this problem. To construct a confidence interval interval other other than than the ..95 CI, in general the the T o construct 95 CI, CI, the the valu� value of of(t*that that isisused used in inEquation Equation22.1 the absolute absolute value value of oftt 11 -- aa CI, . 1 isisthe is required for two-tailed two-tailed statistical statistical significance significance at that a t table indicates is a/2, one-tailed). For For exam examt as for a/2, at a a = ..10 a = .05 .90 CI use the critical t required at 1 0 two-tailed two-tailed or a one-tailed, and for a .99 CI use a a = .01 .01 two-tailed or a a = .005 one-tailed. one-tailed. one-tailed, CI results in a very wide, less in inHowever, one would likely find that a .99 CI interval, as was suggested our exam example compariformative interval, suggested in our ple of income compari For a given set of data, the lower the confidence level, the narrower narrower sons. For statistically signifithe interval. Indeed, it has been suggested that when a statistically signifi cant difference difference between means is inferred by by observing a .95 CI that does cant might be proper to report an unusually unusually narrow narrow interval by not include 0 it might reporting a .80 or even ..70 CI together together with the the traditional ..95 CI (V (Vaske, reporting 70 CI 9 5 CI aske, & Morgan, 2002). Consult Onwuegbuzie and and Levin Levin (2003) (2003) for for a Gliner, & contrary view, view, and and also refer refer to Kempthorne Kempthorneand andFolks Folks((1971). Notein inthis this 1 971). Note contrary regard that criterion probability probability levels in the the field of statistical statistical inference inference are regard not always always conventionally conventionally .95 (or the related .05). .05). For example, statistical not power levels of .95 are typically considpower typically unattainable, and power = .80 is consid an acceptable acceptable convention for minimum minimum acceptable ered by some to be an acceptable power (Cohen, 1988). 1 988). power the 95% CI is also called called the the .95 CI. CI. Such a confidence confidence inter interRecall that the often mistakenly mistakenly interpreted to mean mean that there is a .95 probabil probabilval is often ity that Ila - 1lJ,,will will be beone oneof ofthe the values values within withinthe the calculated calculated interval, interval, as as ity if Ila - IlJ, were a variable. However, However, /-La - /-Lb is actually actually a constant constant in any any if specific pair pair of populations populations (an unknown unknown constant), and and it is each confi confiis actually a variable. Theoretically, Theoretically, because of sampling dence limit that is variability, duplicating duplicating a specific specific example of research by repeatedly ran ranvariability, domly sampling populations will prosampling equal-sized samples from two popUlations alpha significance the alpha significance level (the same
ples, for a
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
�
29
varying values of Yaa -- Vb' Yb, whereas the actual actual value of �a - �, duce varying remains constant for the specific pair pair of populations that are being re reremains peatedly compared via their sample means. In other words, although although a researcher actually actually typically typically samples Populations a and b only only once varying results are possible for Va' Ya,Vb' Yb,and, and,thereby, thereby,YY Ybin , inany any each, varying aa -Vb' research. Similarly, sample variances from from a population population one instance of research. would vary vary from from instance to instance of research, so the margin of error would is, say, a .95 is also a variable. Therefore, instead of saying that there is,
probability that �a - � is is a value within the calculated calculated interval, one probability is a .95 .95 probability probability that the calculated interval interval will should say that there is
quite intentionally used the fu fucontain the value of �a - �.. Note that we quite ture tense (i.e., "will") "will") in the previous sentence sentence because the probability probability relates to what might happen if we proceed to construct a confidence in interval, and assumptions assumptions are satisfied; the probability probability does not not relate to what what has happened after after the interval has been constructed. Once an in interval has already been constructed it must simply be the case case that the = = 11)) or does not not in in= 0). For example, if the the actual actual difference between �a (ua and and u�bb were, Ib, when when constructing constructing were, say, exactly 10 lb, difference .95 CI there would be be a .95 .95 probability probability that the calculated interval will will a .95 contain the value 10. Stated theoretically, if this specific research were contain indefinitely large number number of times, approaching approaching infinity, infinity, repeated an indefinitely the percentage percentage of times that the calculated .95 CIs CIs would contain the value 10 would would approach approach 95% 95% (if assumptions assumptions are satisfied). satisfied). It is in this value sense that the reader should interpret any statement statement that is is made about about from construction construction of confidence intervals in worked exam examthe results from ples in this book. analogy of the proper in inThe game of horseshoe tossing provides an analogy terpretation of confidence intervals intervals.. In this analogy the targeted spike terpretation fixed in the ground represents a constant constant parameter (e.g., the difference difference fixed between two two populations' means) means),, the left left and right sides of the tossed between who horseshoe represent the limits of the interval, and an expert player who surround the spike with the horseshoe in 95% 95% of the tosses repre reprecan surround who has actually actually attained attained a .95 CI. CI.What Whatvaries variesin inthe the sents a researcher who is not not the location of the spike, spike, but but whether whether or not not the sample of tosses is surrounds it. For For a listing of the common and the pre pretossed horseshoe surrounds varying definitions of confidence intervals, refer refer to Fidler and and cise varying Thompson 1). Thompson (200 (2001).
interval interval includes includes �a - �b (i.e., probability probability of inclusion probability of inclusion clude �a - � (i.e., (i.e. probability
=
WORKED EXAMPLE FOR INDEPENDENT GROUPS GROUPS
illustrates the aforementioned aforementioned method conThe following example illustrates method for con
for �a - �, assuming normality normality and and homoscedasticity. homoscedasticity. structing a .95 CI for study (Everitt, cited in raw raw data published by Hand, In an unpublished study Lunn, McConway, McConway, & &Ostrowski, 11994) Daly, Lunn, 994) that compared Treatments a girls with anorexia nervosa nervosa (self-starvation) (self-starvation) and used and b for young girls
30
CHAPTER CHAPTER 2 2
�
weight weight as the dependent variable, we find that the data data yield the follow following statistics: V Yaa = = 85.697, 85.697, Vb Yb = = 881.108, = 69. 69.755, = 22.508. 22.508. 1 . 1 08, s2a = 755, and s2�b = The sample sizes were na na = 29 29 and nb nb = 26, 26, so df df = = 29 29 + 26 26 -- 22 = = 53. 53. Many Many statistical software packages will construct a confidence confidence interval for case, but for the present case, but we illustrate a manual manual calculation to facilitate understanding the present present procedure and those to come. A problem problem with a manual manual calculation calculation with the current set of data data is that the t tables in statistics statistics textbooks textbooks do not provide the needed needed ( t' value . 1 when df for for Equation 22.1 df = = 53. 53. Therefore, Therefore, using a t table that provides critical values of 50 and 55 (Snedecor & 1 989), of t for for df df = 50 and df df = = 55 & Cochran, 1989), we linearly linearly interpolate interpolate three fifths fifths of the way way between 50 50 and 55 55 to esti estimate 3; ( . 006. (A mate the the critical value of t at at df df = = 553; t* = 22.006. (Amore moreprecise precisemethod method of of interpolation interpolation is available, but but it would result in little if any any difference difference in t in this case case because there is not not even very much difference difference in critical 50 and df = 55.) values of t at at df df = = 50 and df = 55.) just reported, to Now applying the required values, which were just Equation 2.2 2.2 we find that
:
2
sp =
[(29 - 1) 69.755 + (26 - 1) 22 .508]
Applying the needed needed values now to Equation
ME. 9S
=
29 + 26 - 2
=
[
(
2 .o06 4 7.469 � + � 29 26
Therefore, the the limits of the the
)J
L /
2
4 7.469.
22.1 . 1 we find find that =
3.733.
.95 2.3 are .95 CI CI given by by Equation 2.3 are
CJ 9s:(85.697 : (85.697 -- 81.108) 8 1.1 08) ± 3.733. 3.733. CI95
interval is thus thus bounded by the lower limit of 4.589 4.589-3.733 .856 The interval - 3 . 733 = .856 lb . 733 = 8.322 Ib and and the upper upper limit of of 4.589 4.589 + + 33.733 8.322 lb. Ib. The difference difference be between the the two two sample means, 4.589 4.589 Ib, Ib, is called called a point estimate of of the dif difference 856 to 8.322 ference between between Ila and �.. The interval from ..856 8.322 does not this_confidence interval also informs us that the 0, so this confidence interval include the value 0,_so diff erence between Va difference Ya and and Vb Yb is statistically statistically significant at at the the two-tailed two-tailed .05 level. level. We We conclude that there is statistically statistically significantly greater .05 weight weight in the girls who underwent underwent Treatment a compared to the girls who b. We who underwent underwent Treatment b. We are also 95% 95%confident confidentthat thatthe theinterval interval betweeil 856 lb between ..856 Ib and and 8.322 8.322 lb Ib contains contains the difference difference in weight between the two treatment popUlations. sizes are not equal populations. Note that the sample sizes (naa = 29, 29, nb nb = 26), 26), which is not not necessarily problematic. However, However, if the the dropping out smaller size of Sample b resulted resulted from participants dropping out for a reason reason that was related related to the degree degree of effectiveness effectiveness of a treatment treatment (nonrandom confidence interval and a test of signifi(nonrandom attrition), then the confidence
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
�
311 3
bee invalid. invalid. The The point estimate and the limits of the interval interval cance would b are depicted .1. depicted in Fig. Fig. 22.1. just-noted result would be a Again, the practical significance of the just-noted matter for expert opinion in the field field of study-medical study—medicalopinion opinion in this matter case—not a matter of statistical opinion. Similarly, suppose that the case-not current confidence limits, .856 and 8.322, had resulted not not from from two two current treatments for anorexia nervosa but but from from two programs intended intended to treatments raise the the lOs IQs of children who who are about about average average in IQ IQ. The practical sig sigICXpoints) would be a matter matter nificance of such limits (rounded to 11 and 8 IQpoints) which educators or developmental psychologists should opine. about which opine. In different fields of research the same numerical results may may well have different different different degrees degrees of practical practical significance. was approximately approximately a 33:1 and ss 2b in our Observe that there was : 1 ratio of ss2�a and in our 6 9 . 7 5 5 /2 2 . 5 0 8 = 33.1). . 1 ) . This ratio suggests possible example ((69.755/22.508 although it could also also be plausibly attributable attributable to heteroscedasticity, although variability of variances, variances, which can be great. great. However, However, we do sampling variability not conduct conduct a test of homoscedasticity homoscedasticity because of the likely low power 996). The possibility of of of of such a test (Grissom, 2000; Wilcox, Wilcox, 11996). heteroscedasticity suggests that one of the more robust methods that discussed in later sections sections of this chapter may may be more appropriate are discussed for the data at hand. for
�
FURTHER DISCUSSIONS AND AND METHODS
discussions of computer-intensive methods for constructing For further discussions satisfied (inmore accurate confidence intervals when assumptions are satisfied (in cluding construction construction of confidence intervals using noncentral noncentral distribu distributions),, consult Altman, Altaian, Machin, Machin, Bryant, Bryant, and Gardner Gardner (2000), (2000), Bird tions) Finch (2001), Smithson (200 (2001, 2003),), and Tryon (2002), Cumming and Finch 1 , 2003 Tryon (2001); for a brief brief introduction, see see chapter 3. Refer to Fidler and and (200 1 ) ; and, for 3. Refer
.856
•
4.589
8 .322
y
o
-10
10
Limits for for Difference Difference in in Mean Mean Weights Weights (Ibs.) Limits (lbs.) 2.1. FIG. 2 .1.
Limits for the 995% the difference difference in Limits 5 % confidence interval for the given either either 11'eatment Treatment a or mean weights of anorectic girls who had been given b. Treatment b.
32
�
CHAPTER 2 CHAPTER 2
Thompson (200 (2001) 1 ) for an an illustration illustration of the the use of SPSS software for for the the
For negative negative or moder moderconstruction of a confidence interval for fla - flb.. For ated views of confidence confidence intervals we cite, cite, for for the the sake of fairness fairness and and ated Feinstein ((1998), and Knapp completeness, Feinstein 1 998), Frick ((1995), 1 995), Parker ((1995), 1 995), and Knapp and Sawilowsky Sawilowsky (200 (2001). Smithson (2003), whose book is favorably dis dis1 ). S mithson (2003), and posed toward toward confidence confidence intervals, also discussed discussed their limitations. only considers considers the case in which the sampling Note that this chapter only distribution of the the estimator estimator (e.g. (e.g.,, the the difference difference between two two sample distribution means) on which a confidence interval interval is based is symmetrically symmetrically distrib distributed. In such cases cases the resulting confidence confidence interval interval is said to be sym symmetric metric because the value value that is subtracted subtracted from the estimate estimate to find find the limit of the interval is the same as the value that is added to the the eslower limit es timate to find find the the upper limit. (We (We call this value the margin of of error. error.)) timate from Therefore, in such cases the upper and lower limits are equidistant from the estimate. However, However, when the sampling distribution distribution of the estimator might be skewed (e.g., proportion in a sample, such as the proportion proportion might ( e . g . , a proportion of treated patients patients whose health health improves), improves), it is possible construct an of possible to construct asymmetric confidence confidence interval. interval. An asymmetric interval is one asymmetric confidence interval which the value that is subtracted subtracted from from the estimate is not not the same in which is added to the estimate estimate to find find the lower and upper as the value that is respectively. In such cases cases the limits are calculated separately as limit, respectively. values that cut cut off off approximately approximately a/2 a/2 of the area of the sampling distri districonfidence interval, the bution. Thus, with regard to an asymmetric confidence
limit is calculated as the value that has a/2 a/2 of the area of the sam samlower limit distribution below it, and the upper limit is calculated as the value pling distribution a/2 of the area beyond it, with with no requirement that these two that has al2 values be equidistant equidistant from the estimate. This topic is discussed further where relevant in later chapters chapters.. experimental research involves randomly asFinally, recall that most most experimental randomly as signing participants participants who who have not not been first first randomly sampled from from a population to form a pool of participants. participants. Instead, the par parlarge defined population ticipant prospective ticipant pool is a local subpopulation subpopulation of of readily accessible prospective participants, such as college college students who are available to an academic academic sampling, may may result in a researcher. Such sampling, called convenience sampling, t-based confidence confidence interval that is wider than than it could be. For elaboration be. For and an alternative refer to Lunneborg Lunneborg (2001). realternative approach, refer (200 1 ) . Much of the re mainder of this chapter chapter is informed by by Wilcox's ((1996, mainder 1 996, 11997) 9 9 7) research and expert expert reviews reviews of confidence intervals intervals under under violation violation of assump assumpand tions, supplemented with with more recent findings findings by Wilcox Wilcox and and others. SOLUTIONS TO TO VIOLATIONS VIOLATIONS OF OF ASSUMPTIONS: SOLUTIONS ASSUMPTIONS: WELCH'S APPROXIMATE METHOD WELCH'S APPROXIMATE METHOD
Even if if only the the assumption of homoscedasticity is violated, the the use of the the Even aforementioned confidence aforementioned t-based procedure can produce a misleading confidence level, unless perhaps n. na = =
n" 5 (Ramsey, 11980; 980; Wilcox, 99 7 ) . In the nb � > 115 Wilcox, 11997). the case
CONFIDENCE INTERVALS CONADENCE INTERVALS
MW=
33
of heteroscedasticity, heteroscedasticity, if no na "#* nnbb,, the actual actual confidence confidence level can be lower than of the nominal nominal one. For For example, a supposed ..95 may in fact fact be an an interval interval the 9 5 CI CI may that has less less than a .95 .95 probability probability of containing containing the value of fla �. . should do about about possible possible heteroscedasticity The least a researcher should -
constructing a confidence interval interval for fla �, under the assump assumpwhen constructing normality would be to use samples as close to equal size as is is pos postion of normality and consisting consisting of at at least least 115 However, a long sible and 5 participants participants each. However, known, but but little used, often more accurate approximate procedure procedure for -
constructing a confidence confidence interval interval for for fla �, in this case is Welch's constructing method is also known known as the ((1938) 1 93 8 ) approximate solution. This method Satterthwaite ((1946) procedure and and is related related to the work work of Aspin 1 946) procedure Satterthwaite well. The Welch procedure procedure accommodates heteroscedasticity ((1949) 1 949) as well. -
two ways. First, First, Equation 22.1 modified so as to use estimates of cr � in two . 1 is modified 2
a ! (popUlation (population variances that are believed believed to be different) different) from from ss a; and cr
�
s2!b separately separately instead of pooling ss2a and ss2!b to estimate a common and s
population variance. variance. Second, the equation equation for for degrees of freedom freedom is also population Second, the modified to take into account account the inequality inequality of modified
cr;
the ..95 CI again for for our our example, example, in this method the 95 CI
and cr ! .. Thus, using
(2.4) (2 . 4)
'
w stands for for Welch and, as before, tt*is the the absolute value of the the t where w statistic that a t table indicates is is required to attain attain significance significance at the statistic two-tailed .05 level. The heteroscedasticity-adjusted heteroscedasticity-adjusted degrees of freedom freedom two-tailed for the Welch Welch procedure, dfw dfw,' is given by: by: for ((2.5) 2.5)
(;J (:J .
----- +
nil
-1
-
-
nb
-1
To find f t* enter a t table aatt the degrees of freedom freedom row that results from Equation 22.5. You may may have to interpolate interpolate between the two two defrom . 5 . You de freedom values values in the table that are closest to your your calculated calculated grees of freedom demonstrated earlier, earlier, or you you can round round your your obtained devalue, as was demonstrated de freedom value down down or up to the nearest nearest value in the table. grees of freedom freedom can result result in a larger loss (Note that rounding down degrees of freedom of statistical statistical power power for a t test than one might might expect; expect; Sawilowsky & of & 2002.) confidence interval interval are then then found Markman, 2002 . ) The limits of the confidence 2.3, with MEw MEW replacing replacing ME. using expression 2 . 3 , with
34
CHAPTER 2 2 CHAPTER
Welch method method often often results in a smaller margin of error than than The Welch the usual, previously values previously demonstrated, demonstrated, method method that pools the two values of S2 s2 and and leaves leaves degrees degrees of freedom freedom unadjusted. A smaller margin margin of er erof would result result in a narrower, narrower, more informative, confidence confidence interval interval.. ror would method appears appears ttoo counter counter hetero heteroHowever, although the Welch method scedasticity well enough, it may may not not provide accurate confidence levels when at at least one of the popUlation population distributions is is not not normal, espewhen espe cially (but not exclusively) when when n nbb.. According to the review review by naa * =F n cially Welch method method may may be be at at its worst worst when two two popu popu1 9 9 6 ) , the Welch Wilcox ((1996), lations are skewed differently and sample sizes sizes are small and and unequal. lations Bonett and Price Price (2002) confirmed that the method method is is problematic problematic Bonett when sample sizes population distributions are when sizes are small and the two population grossly and very very differently nonnormal. Again, at the very least regrossly least re try to use samples that are as large and as close close to searchers should try equal in size as is possible. Using equal or nearly equal-sized samples from n= = 30 each might might result result in sufficiently accu accu0 to n = ranging from n = 110 rate confidence confidence levels levels under under a variety variety of types of nonnormality, nonnormality, but but a rate cannot be certain certain if the kind and and degree nonnormality in researcher cannot degree of nonnormality represents an exception exception to this conclusion conclusion (Bonett (Bonett & & a given set of data represents Price, 2002 2002). ) . Researchers should also consider using one of the robust methods to deal simultaneously simultaneously with heteroscedasticity and non nonmethods normality confidence interval to compare the normality when constructing a confidence two groups groups.. (Such (Such methods are discussed discussed in the two sections centers of two after the next section. section.)) after The prevalence prevalence ooff disappointingly disappointingly wide confidence intervals may may be partly responsible for their their infrequent use in the past. The The application application of of methods may result result in narrower confidence intervals intervals that more robust methods report confidence confidence intervals routinely, where ap apinspire researchers to report propriate, methodologists have urged. Of course, a decision propriate, as many many methodologists about reporting reporting a confidence confidence interval interval must must be be made a priori, and it about should not not be based on how how pleasing its width width is is to the researcher. should Note that if nnaa > > 60 and n may construct construct a satisfactory connbb > 60 one may satisfactory con fidence interval interval under heteroscedasticity simply by using a table of the standardized normal normal curve, instead of a t table, to find find the appropriate appropriate z standardized value instead .4, thus instead of a ( t*value valueto toinsert insert into intoEquation Equation22.4, thus eliminating eliminating the the use of . 5 (Moses, 11986). 986). In the . 96. In the of Equation 22.5 the case case of a .95 CI, CI, zz = = 11.96. the general case of n naa > 60 and nnbb > 60, confidence interval interval at at the ((1 60, for a confidence 1 - a) confidence used in place of tt* in Equation Equation 22.4 the positive value of z level of confidence . 4 the - (a/2) of the area of the normal normal curve below that the table indicates has 11 a/2 of the area of the normal normal curve above it. it, or a/2 -
*
WORKED EXAMPLE EXAMPLE OF OF THE WELCH METHOD METHOD WORKED THE WELCH
worked example of the the Welch Welch method method constructs a ..95 The following worked 95 CI CI
for �a �, from the same data on weight weight gain in anorexia anorexia nervosa as in for the previous example. -
CONFIDENCE CONFIDENCE INTERVALS INTERVALS
�
35
find dfw that we we can can determine determine the the value value of of t* ( to to use use in We first find dfw so that 2.4 to obtain obtain ME MEw. the previously previously stated stated values to Equation 2.4 W. Applying the 2.5 we find find that that Equation 2.5
=
45221.
to the the nearest integer, df dfw = 45. 45 . Most t tables in Rounded to in statistics w = not include df textbooks do not df = = 45, 45, but but in the t table in Snedecor and Cochran (1989) find that for 2.014. ( 1 989) we find for df df = = 45 45 the the critical value of t = 2.014. ( and and the the other other previously stated stated required values Applying this value of t* 2.4 yields yields to Equation 2.4
the limits of the the .95 . 95 Therefore, the replaced replaced by by MEw' MEw,
Cl 2.3 with ME CI are, using expression 2.3
9 S : (85.697 (85 . 697 -- 81.108) 8 1 . 1 08) ± ± 3.643 CI 95 3.643;; 9 S : 4.589 4.589 ± ± 33.643. . 643 . CI 95
the point point estimate of Ila - f..I" is 4.589 4.589 lb Ib for We previously found that the the present data. Now using the Welch method we find that the margin margin of error associated with with this point estimate is not not ± ±3.733 Ib, as before, before, of 3 . 733 lb, 3 . 643 Ib. lb. but but ± ±3.643 . 643 < . 73 3 , MEw Observe that, as is often often the the case, case, |IM E wl| < < I| ME M E |L, 33.643 < 33.733, but the Welch-based interval, bounded bby 4.589 y 4.5 8 9 -- 33.643 .643 = ..946 946 and 4.589 = 8.232, 8.232, is is only only slightly slightly narrower than the previously previously 4 . 5 8 9 + 33.643 . 643 = interval from ..856 Provided that a researcher constructed interval 85 6 to 88.322. . 322. Provided is good to narrow narrow the has used the more nearly accurate method, it is confidence interval interval without lowering lowering the confidence level. As before, confidence the interval does interval implies does not contain contain the value 0, 0, so so this jinterval implies that 1' Yaa is is statistically statistically significantly significantly greater greater than than 1'b Yb at at the the .05 .05 level, level, two-tailed. Welch method may may not not yield yield accurate confidence in inRecall that the Welch tervals when na nb. na "#-* n However, in our example the samples samples sizes sizes are not not b. However, very unequal and not very small. In the next section we consider a method that addresses addresses the problems of heteroscedasticity heteroscedasticity and skew at at method the same time. time.
36
�
CHAPTER 2 CHAPTER 2
YUEN'S CONADENCE CONFIDENCE INTERVAL FOR FOR THE DIFFERENCE BETWEEN TWO TRIMMED MEANS
Yuen's 1 9 74) method Yuen's ((1974) method constructs constructs aa confidence confidence interval interval for for Ilta - Iltb,, in in which each each Ilt is is aa trimmed trimmed population population mean. trimmed mean mean of of aa sam sammean. A trimmed whk!J. ple usual arithmetic after removing ple (Yt) (Yt)is is the the usual arithmetic mean mean calculated calculated after removing (trim (trimming) replacing ming) the the cc lowest lowest and and the the same same cc highest highest scores, scores, without without replacing them. is defined and discussed discussed further further in in the the final final paragraph paragraph of them. (The (The Ilt is defined and of this this section.) section.) Choke Choice of of the the optimum optimum amount amount of of trimming trimming depends depends on on several discussion of several factors, factors, the the detailed detailed discussion of which which would would be be beyond beyond the the scope 1 996, 11997, 9 9 7, 200 1 , 2003), Wilcox scope of of this this book. book. Consult Consult Wilcox Wilcox ((1996, 2001, 2003), Wilcox and discussions and Keselman Keselman (2002a), (2002a), and and Sawilowsky Sawilowsky (2002) (2002) for for detailed detailed discussions and subject. and references references on on this this subject. The The reader reader is is alerted alerted that that trimming trimming has has been been recommended recommended and and is is be being ing increasingly increasingly studied studied by by respected respected statistical statistical methodologists, methodologists, but but the the practice common. Many researchers and statistics practice is is not not common. Many researchers and instructors instructors of of statistics may issue is may be be leery leery of of any any method method that that alters alters or or discards discards data. data. This This issue is dis discussed section. cussed at at greater greater length length at at the the end end of of this this section. The optimum optimum amount amount of of trimming trimming may may range range from from 0% 0% to to slightly slightly The over justifica over 25%. 25%. The The greater greater the the number number of of outliers, outliers, the the greater greater the the justification might be be for, say, 25%, 25%, trimming. Small samples tion might for, say, trimming. Small samples may may also also justify justify 25% 983). For 25% trimming trimming (Rosenberger (Rosenberger & & Gasko, Gasko, 11983). For aa discussion discussion of of trim trimming less less than refer to to Keselman, Keselman, Wilcox Wilcox et et al. al. (2002). Also consult consult ming than 20%, 20%, refer (2002) . Also Sawilowsky is Sawilowsky (2002) (2002).. If If population population distributions distributions were were normal, normal, which which is not then one not the the assumption assumption of of this this section, section, then one would would use use the the usual usual arith arithmetic means, which 0% trimming. metic means, which is is equivalent equivalent to to 0% trimming. Note Note that that if if one one trimmed score, the be trimmed all all but but the the middle-ranked middle-ranked score, the trimmed trimmed mean mean would would be the the same same as as the the median. median. Thus, Thus, aa trimmed trimmed mean mean is is conceptually conceptually and and nu numerically (0% trimming) merically between between the the traditional traditional arithmetic arithmetic mean mean (0% trimming) and and the the median median (maximum (maximum trimming). trimming). If If one one or or more more outliers outliers are are causing causing the departure departure from from normality, normality, then then trimming trimming can can eliminate eliminate the the out outthe lier(s) lier(s) and and bring bring the the focus focus to to the the middle middle group group of of scores. scores. Because Because it it may may sometimes sometimes be be optimum optimum or or close close to to optimum, optimum, 20% 20% (.2) (.2) trimming trimming is is the the method method that that we we demonstrate. demonstrate. In In this this case case c = = .2 .2n for n for each sample. sample. If If ec is is not whole number, number, then then round round ec down to the neareach not aa whole down to the near est whole whole number. number. For For example, if n = 29, .2n = .2(29) .2(29) = 55.8, and, est example, if . 8, and, scores, nr, rounding rounding down, down, cc = = 55.. The The number number of of remaining remaining scores, nr, in in the the group group is equal to to n -- 2c. In the previous example example of of the the anorectic anorectic sample that is equal 2e. In the previous sample that received Treatment Treatment a, a, nr nr = nnaa -- 2e 2c = = 29 - 2(5) 2(5) = =19. Forthe thesample samplethat that received 29 1 9 . For received b, cc = 2nbb = .2, which received Treatment Treatment b, = ..2n = .2(26) .2(26) = = 55.2, which rounds rounds to to 55.. For For this this group, nr 2e = 26 group, nr = nb nb -- 2c = 26 - 2(5) 2(5) = 16. 16. The in the each The first first step step in the Yuen Yuen method method is is to to arrange arrange the the scores scores for for each group group separately separately in in order. order. Then, Then, for for each each group group separately, separately, eliminate eliminate the the ec = = .2n most most extreme extreme low low scores scores and and the the same same number number (for (for that that particular particular group) of procedure does require group) of the the most most extreme extreme high high scores. scores. The The procedure does not not require that nnaa = = nb. nb. If If na na = nb, it it may may or or may may not not turn turn out out that that aa different different number number that i= nb, =
=
=
=
=
=
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
37
�
of scores scores is is trimmed trimmed from from Groups Groups a and and b, depending depending on on the the results results of of of rounding the the values values of of c. c. Next, Next, calculate calculate the the trimmed trimmed mean, mean, Y for each each rounding Vt,t/ for group by by applying applying the the usual usual formula formula for for the the arithmetic arithmetic mean mean using using the the group remaining sample size, n , in the denominator; Y = (IY) / n . Continuing remaining sample size, nr,r in the denominator; Vtt = (Ly) / nr•r Continuing in this this section with the the previous previous data data on on weight weight gain in anorexia, anorexia, we we re rein section with gain in move the the five five highest highest and and five lowest scores scores from from each each sample sample to to find find the the move five lowest trimmed means means of of the the remaining remaining scores, scores, V Ytata = = 85 85.294andy = 881.031. trimmed .294 and Vtbtb = 1 .03 1 . The next next step step is is to to calculate calculate the the numerator numerator of of the the Winsorized Winsorized vari variThe ance, SSw' S5w, for for each each group group by by applying through 55 for for calculating calculating ance, applying Steps Steps 33 through Winsorized variance variance that that were were presented presented in in the the last last section section of of chapter chapter aa Winsorized (Steps 11 and and 22 of of that that procedure will already already have have been been completed completed by by 11.. (Steps procedure will this stage stage of the method. method.)) Applying Applying Steps through 55 to the this of the Steps 3 3 through to calculate calculate the numerator of of aa Winsorized Winsorized variance variance we we replace, replace, in in each each sample, sample, the the numerator trimmed five five lowest lowest original original scores scores with with five five repetitions repetitions of of the the lowest lowest trimmed remaining score, score, and and we we replace replace the the five five trimmed trimmed highest highest original original remaining scores with five five repetitions repetitions of of the the highest highest remaining remaining score. score. Because Because scores with s22 = = SS (n-- 11), SS = = 52(n s 2 (n -- 11). Using any any software software for for descriptive descriptive sta sta55 // (n ), 55 ). Using 5 tistics we we find find that that for for the the reconstituted reconstituted samples samples (original (original remaining remaining tistics scores plus the the scores scores that that replaced the trimmed trimmed scores) scores) 5s2�wa" = = 30. 30.206 206 scores plus replaced the 2 and s5 2wb = 12.718. Therefore, = 30.206(29 1) = 845.768 and = 1 2 . 7 1 8 . Therefore, 55 = 30.206(29 1 ) = 845 . 768 and and w wa b a 55wb = 112.718(26= 3317.950. SSwb 2 . 7 1 8 (26 - 11)) = 1 7 . 950. Next, we we calculate calculate aa needed needed statistic, statistic, w wY', separately separately for for each each group, group, to to Next, find W wya and w wyYbb .. Each Each sample's sample's W wyy is is found found separately separately by by calculating: calculating: find ya and w
=
(2.6) (2.6) The MEyy (y (y stands stands for for Yuen) Yuen) for for the the confidence confidence interval interval for for �ta - �tb is TheME is
(2.7) (2. 7) become and the the confidence confidence limits limits for for �ta - �tb become and
(2.8) (2 . 8) The degrees degrees of of freedom freedom to to be be used used to to find find the the tabled tabled value value of of ( t* in in The is Yuen's procedure, dfy' Yuen's procedure, df y is
(WYd + WYb )
2
(2.9) (2.9)
2 2 W ya - - + -Wyb ----n rd - 1 n rb - l
... .-
Applying the the previously previously reported reported required required values values to to Equation Equation 2.6 2.6 we we Applying find find that that
38
�
CHAPTER 2 2 CHAPTER
W ya
=
845.768 1 9(1 9 - 1)
= 2 .473,
and and
W
=
yb
3 1 7.950 1 6(1 6 - 1)
=
1325.
Now applying applying the the required required values values to to Equation Equation 2.9 2.9 we we find find that that Now dlf y
=
2 (2 .473 + 1325) 2 .473 2 1325 2 + 19 -1 16 - 1 ---
=
32.
--
Most Most t tables tables in in statistics statistics textbooks textbooks will will provide provide critical critical values values of of t for for 30 and df df = = 30 and df df = = 40, 40, but but not not for for degrees degrees of of freedom freedom between between these these val values. Snedecor and 1 989), we ues. However, However, using using the the t table table in in Snedecor and Cochran Cochran ((1989), we find find 30 and 5 . Because 32 is rows rows for for ddff = = 30 and df df = = 335. Because df df = = 32 is two two fifths fifths of of the the dis disof tance tance between between df df = = 30 30 and and df df = = 35, 35, we we linearly linearly interpolate interpolate two two fifths fifths of the the way way between between the the t values values at at df df = = 30 30 and and df df = = 35 35 to to estimate estimate that that the the critical value value of of ( t* is is approximately approximately 22.03 7. (More (More accurate accurate interpolation interpolation critical . 03 7. is negligible, if is possible possible but but would would likely likely make make aa negligible, if any, any, difference difference in in our our final results. ) final results.) Now applying applying the the obtained obtained required required values values to to Equation Equation 22.7 we find find Now . 7 we CI that for that for the the .95 .95 CI MEy 95
=
2 .03 7(2.473 + 1 .325 ) "
=
3 . 9 70.
Finally, applying applying the the required required values values to to expression expression 22.8 we find find that that Finally, . 8 we is bounded limits (85.294 - 8 1 .03 1 = 3 . 9 70. the .95 CI the.95 CIis bounded by by the the limits (85.294-81.031 = 4.263) 4.263) ± 3.970. Thus, Thus, the the point point estimate estimate of of Ilta Iltb is is 4.263 4.263 lb, lb, and and the the .95 .95 CI CI ranges ranges from 4.263 4.263-3.970 = .293 .293 Ib Ib t0 to 4.263 4.263 + 33.970 = 8.233 8.233 lb. lb. Although Although the the from - 3 . 9 70 = . 9 70 = Yuen Yuen method method usually usually results results in in narrower narrower confidence confidence intervals intervals than than the the Welch method method (Wilcox, (Wilcox, 11996), such is is not not the the case with regard to these Welch 996), such case with regard to these data. The The Yuen-based Yuen-based interval interval from from .293 .293 lb lb to to 8.233 8.233 lb lb is is wider wider than than the the data. previously calculated calculated Welch-based Welch-based interval interval from from .946 .946 lb lb to to 8.232 previously 8.232 lb lb (and (and also also wider wider than than the the confidence confidence interval interval that that was was constructed constructed using using the traditional traditional t-based t-based method). method). However, However, it it is is possible possible that that the the use of an an the use of alternative to to Sw sw in in the the Yuen Yuen procedure procedure may may narrow narrow the the interval interval (Bunner (Bunner alternative & Sawilowsky, Sawilowsky, 2002). 2002). & Note that that all all three three of of the the methods methods that that were were applied applied to to the the data data on on an anNote orexia general conclusions. methods resulted resulted orexia lead lead to to the the same same general conclusions. All All three three methods in confidence confidence intervals intervals that that did did not not contain contain the the value value 0, 0, so so we we can can con conin clude that that the the mean mean (or (or trimmed trimmed mean) mean) weight weight of of girls girls in in Sample Sample aa is clude is statistically significantly significantly greater than the the mean mean (or trimmed mean) statistically greater than (or trimmed mean) -
CONFIDENCE CONFIDENCE INTERVALS INTERVALS
�
39 39
weight of of girls girls in in Sample Sample b at at the the two-tailed .05 level. level. Also, Also, all all three three weight two-tailed .05 methods yielded lower limit limit of of mean mean (or (or trimmed trimmed mean) mean) weight weight differ differmethods yielded aa lower ence is under under 11 lb and an an upper upper limit of mean (or trimmed trimmed mean) ence that that is lb and limit of mean (or mean) weight difference difference that that is is slightly slightly over over 8 1b. Again, aa conclusion conclusion about about the the lb. Again, weight clinical of such would be the field of clinical significance significance of such results results would be for for specialists specialists in in the field of anorexia nervosa nervosa to to decide. decide. anorexia Note in in Equations Equations 22.7 and 22.9 that the the Yuen Yuen method method is is aa hybrid hybrid proceNote . 7 and . 9 that proce dure countering nonnormality nonnormality by and countering countering dure of of countering by trimming trimming and heteroscedasticity by by using using the the Welch Welch method method of of adjusting adjusting degrees degrees of heteroscedasticity of freedom and and treating treating sample sample variabilities variabilities separately separately instead instead of of pooling pooling freedom them. Wilcox aptly called called the the Yuen method the the Yuen-Welch them. Wilcox ((1997) 1 99 7) aptly Yuen method Yuen-Welch method (although (although the names of statisticians Aspin Aspin and and Satterthwaite Satterthwaite method the names of statisticians could be be added added to to Welch) Welch) and and provided provided S-PLUS S-PLUS software software functions functions for for could CI using method. Wilcox 1 996) also provided constructing aa .95 constructing .95 CI using this this method. Wilcox ((1996) also provided Minitab macros macros for for constructing the interval interval at at the the ..95 or other other levels levels of Minitab constructing the 9 5 or of confidence. Reed (2003) (2003) provided FORTRAN code code for for Yuen's confidence. Reed provided executable executable FORTRAN Yuen's method, and and Keselman, Keselman, Othman, Othman, Wilcox, Wilcox, and and Fradette Fradette (2004) (2004) are are method, further developing developing Yuen's Yuen's method. method. further Note that that although although the method has known since since 11974, Note the Yuen Yuen method has been been known 9 74, was made made accessible accessibleby by Wilcox Wilcox through through his his 11996 and 11997 books and and was 996 and 99 7 books software, and and appears appears often often to to be be superior superior to to the the traditional traditional t procedure software, procedure and the the Welch procedure for for constructing constructing aa confidence confidence interval, interval, the the and Welch procedure Yuen method method is is not not widely widely used. Its lack lack of may be Yuen used. Its of use use may be largely largely attributattribut able to to aa lack of awareness because it it is is absent absent from from nearly nearly all all textbooks textbooks able lack of awareness because of statistics. statistics. Also, Also, historically historically researchers researchers have have been slow to to adopt adopt new new of been slow statistical methods and slow slow to forego popular statistical methods and to forego popular methods methods that that ultimately ultimately are methodologists to are found found by by methodologists to be be problematic. problematic. Moreover, Moreover, as as was was men mentioned tioned earlier, earlier, there there may may also also be be discomfort discomfort on on the the part part of of many many re researchers about about trimming trimming data data in in general general and and about about lack of certainty certainty searchers lack of regarding optimum amount amount of of trimming trimming to to be done for for any any regarding the the optimum be done particular set set of of data. data. particular However, there there may may be be an an irony irony here. It could could be be argued argued that that some some re reHowever, here. It searchers may may accept accept the the use of medians, medians, which which amounts, amounts, in in effect, effect, to to searchers use of the maximum maximum amount amount of trimming (trimming midthe of trimming (trimming all all but but the the mid dle-ranked or or two two middle-ranked middle-ranked scores) scores) but but would would be be leery leery of of the the more more dle-ranked modest amount amount of of trimming trimming (20%) (20%) that that was was discussed in this this section. modest discussed in section. Wilcox (200 1 ) pointed is common in certain Also, as as Wilcox (2001) pointed out, out, trimming trimming is common in certain Also, kinds of of judging judging in in athletic athletic competition, competition, such such as as removing the highest highest kinds removing the and lowest lowest ratings ratings before before calculating calculating the the mean mean of of the the judges' ratings of and judges' ratings of figure-skating performance. aa figure-skating performance. Although by by using Yuen method method one one is is not not constructing confiAlthough using the the Yuen constructing aa confi dence interval interval for for for the the the traditional traditional �a -- Ilt" , but but but for for the the the less less familiar familiar �ta -- �tb,' dence the researcher researcher who who is is interested interested in in constructing constructing aa confidence confidence interval interval for for the the difference between between the the outcomes outcomes for for the the average average (typical) (typical) members the difference members of of Population aa and and Population Population b should should recognize recognize that, that, when there is is skew, Population when there skew, may better better represent represent the the score score of of the the typical typical person person in in aa popUlation population �t may
40
�
CHAPTER 2 2 CHAPTER
would a skew-distorted Refer to Staudte and Sheather than would skew-distorted traditional �.. Refer 20% definition of �t.. For For our our purpose we define, say, a 20% ((1990) 1 990) for a precise definition �t as the mean of those scores in the population population that fall between the .20 and .80 quantiles of that population. Note also that the Yuen method, difference between two trimmed when used to test the significance of the difference sample means, may may provide good control control of Type However, the the 1}'pe II error. However, statistical power (efficiency) (efficiency) of the Yuen method versus the tradi tradirelative statistical tional t-test method that uses uses the usual means and variances variances may depend greatly on the the degree of skew (Cribbie & & Keselman, 2003a). For a negative trimming, refer to Bonett and Price Price (2002). view of trimming, OTHER METHODS FOR INDEPENDENT INDEPENDENT GROUPS OTHER METHODS FOR GROUPS Wilcox ((1996) 1 996) provided discussion and a Minitab macro for constructing CJ for the diff difference a .95 CI erence between two populations' medians to counter nonnormality, but but this method method is not discussed here because it may often nonnormality, violations of assumptions assumptions as the Yuen not provide as good a solution to violations However, there are other promising methods for constructing constructing a method. However, difference between two two populations' centers. confidence interval for the difference centers. One such method method is based on Harrell and and Davis' ((1982) 1 982) improved method for estimating population medians. The sample median is a biased esti estifor mator of the population population median (although, for even slightly nonnormal nonnormal mator population distributions, a sample's median can provide provide a more accurate population estimate population than does the mean of that sam samestimate of the mean of the population Harrell-Davis estimator population's median ple; Wilcox, 2003). The Harrell-Davis estimator of a population's appears to be a less biased biased estimator, and appears to have less sampling variability than than does the ordinary sample median. variability Harrell-Davis estimator to construct a confidence inter interThe use of the Harrell-Davis complicated to be done manually, manually, is not widely available in soft softval is too complicated ware, and is not not demonstrated here. However, Wilcox ((1996) here. However, 1 996) again provided discussion, discussion, references, and a Minitab macro for this method for constructing the confidence confidence interval. Wilcox (2003) (2003) also provided discus discussion and an S-PLUS software function method for construct constructfunction for a simple method difference populations' ing a confidence interval for the diff erence between two populations' medians that is based on a method by McKean and Schrader Schrader ((1984). 1 984). An alternative computationally simple procedure for constructing constructing a alternative computationally interval for the difference difference between between two two medians that modifies confidence interval McKean-Schrader method and uses manual calculation is available the McKean-Schrader (Bonett & & Price, 2002). 2002). Unlike Unlike the Welch method, the the Bonett-Price method seems to produce fairly accurate accurate confidence confidence levels when sample method nonnormality. Bonett and Price sizes are small even under extreme nonnormality. construction of confidence confidence intervals (2002) extended the method to the construction for for the difference difference between two two medians at a time from from multiple groups (simultaneous confidence intervals) in one-way one-way and factorial designs. Al(simultaneous confidence Al though there are several more methods for constructing constructing a confidence confidence in inthough difference between two populations' centers centers (Wilcox, 1996, terval for the difference 1 996,
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
�
41 41
Lunneborg, 200 2001), one-step M-estimator 11997, 99 7, 2003; Lunneborg, 1 ), only one more, the one-stepM-estimator method, is mentioned here here because because it is among the methods that appear to be often often (but not always) always) better than the traditional traditional method. refinement of the the The one-step M-estimator method is based on a refinement procedure. (The (The letter M stands for maximum likelihood.) trimming procedure. likelihood.) two related issues when calculating calculating trimmed trimmed means. We Wehave There are two already discussed discussed the first first issue, issue, choosing choosing how much trimming to do. Second, as we have also discussed, discussed, traditional trimming trims equally skew, traditional on both sides of a distribution. However, in the case of skew, trimming as many scores scores on the side side ofthe of the distribu distributrimming results in trimming skew, where trimming is not not needed or less needed, tion opposite to the skew, skewed side of the distribution, where trimming is needed or as on the skewed needed more. A A measure of location location (center of a distribution) whose of lovalue is minimally changed by outliers is called a resistant measure oflo estimators of location location are resistant resistant measures that can be based cation. M estimators on determining how much, if any, trimming should be done done separately for each side of a distribution distribution (Hampel, (Hampel,Ronchetti, Ronchetti, Rousseeuw, & & Stahel, for 11986; 986; Staudte & 1 990). & Sheather, Sheather, 1990). gives equal weight to all scores (no trimming) The arithmetic mean gives trawhen averaging them. However, when calculating a trimmed mean tra ditional trimming in effect scores and effect gives gives no weight to the trimmed scores weight to each of the remaining remaining scores and the scores that have re reequal weight UsingM estimators is less drastic than using placed the trimmed scores. Using scores with weights trimmed means because because M estimators can weight scores other than than 0 (discarding) or 11 (keeping and treating equally) equally).. They cal calprogressively more weight to the scores scores closer culate location by giving progressively to the center of the distribution. Different Different M estimators use different different schemes (Hoaglin, Mosteller, & & Tukey, 1983). weighting schemes 1 98 3 ) . estimation. The simplest M-estimation M-estimation procedure iiss called one-step M estimation. Constructing a confidence interval using M estimation is too compli complicated and laborious to do manually. manually. Again, fortunately fortunately a Minitab macro (Wilcox, 11996) 996) and an S-PLUS software function (Wilcox, 11997, 99 7, 2003) are available for constructing a confidence interval for the difference difference be between the locations of two populations using one-step M estimation. heteroscedasticityisiscaused causedby byskew, skew,using usingtraditionally traditionally Note that when heteroscedasticity trimmed means may be better than using one-step M estimators estimators (Bickel trimmed Lehmann, 11975), but both methods methods may yield yield inaccurate inaccurate confidence confidence 9 75), but & Lehmann, levels when sample sizes are below 20. general, because because of the possibility of excessively excessivelyinaccurate confi confiIn general, not dence levels, the original methods using one-step M estimators are not sample sizes are below 20. However, However, a modi modirecommended when both sample fied version of such estimators may may prove to be applicable applicable to small sam samfied ples (Wilcox, (Wilcox, 2003, with with an S-PLUS software function) function).. Accessible introductions to M estimation can be found found in Wilcox ((1996, 2001,, introductions 1 996, 2001 (2003a). Note that when a population's 2003) and Wilcox and Keselman (2003a). median, but but also the distribution is not normal, not only a sample's median, 20% trimmed sample's M estimator, modified one-step M estimator, and 20%
42
�
CHAPTER 2 CHAPTER 2
mean can provide a more accurate estimate of the mean of the population than can the mean of that sample (Wilcox, 2003). There are ongoing attempts attempts to improve methods that are robust robust in heteroscedasticity. For For example, re rethe presence of nonnormality and heteroscedasticity. search continues on the optimum (Sawilowsky, optimum amount amount of trimming (Sawilowsky, 2002) and on combining, combining, in sequence, a test of symmetry symmetry followed by trimming, trimming, transforming transforming to eliminate skew, and bootstrapping bootstrapping regard to the the construction construction of con con(Keselman, Wilcox et al., 2002). With regard difference between two two populations' populations' centers, fidence intervals for the difference centers, develop methods that are more accurate under a wider the goal is to develop range of circumstances, such as small sample sizes, than the methods sections. One One such ro rothat have been discussed discussed in this and the preceding sections. bust method is the percentile t bootstrap method applied to one-step M M estimators (Keselman, Wilcox, & & Lix, Lix, 2003; Wilcox, 200 2001, estimators 1 , 2002, 2003). bootstrapping methods, to which we provide only a There are various bootstrapping brief brief conceptual introduction. introduction. A bootstrap bootstrap sample can be obtained by randomly randomly sampling k scores at a time, with with replacement, replacement, from from the originally obtained sample of one at of scores. Numerous such bootstrap bootstrap samples are obtained. A targeted sta stathe mean) is calculated for for each bootstrap sample. tistic of interest (e.g., the sample. sampling distribution of all of these bootstrap-based bootstrap-based values of Then a sampling of the targeted statistic is generated. generated. This sampling distribution is intended the approximate more accurately distribution of the to approximate accurately the actual sampling distribution targeted statistic when assumptions assumptions are not satisfied, as contrasted theoretical distribution distribution (e.g., the normal distribuwith its supposed theoretical normal or t distribu tions) when assumptions are satisfied. satisfied. present context context is to base the con conThe goal of bootstrapping in the present struction struction of confidence confidence intervals and significance significance testing on a boot bootstrap-based strap-based sampling distribution distribution that more accurately approximates approximates distribution of the statistic than than does does the traditional traditional the actual sampling distribution supposed sampling distribution. Recall that called the margin that what we called of of error is a function of the standard standard error of the relevant sampling dis distribution. Bootstrapping provides an empirical estimate of this standard tribution. error that can be used in place of what what its theoretical value would be if error if assumptions were satisfied. satisfied. assumptions Wilcox (200 (2001, boot1 , 2002, 2003) provided specialized software for boot strapping to construct construct confidence intervals. In the case case of confidence in intervals difference between tervals for the difference between two two populations' populations' locations locations (e.g., medians), bootstrap bootstrap samples are taken means, trimmed means, and medians), from original samples from the two populations. populations. Refer to from the two original (2003) for detailed descriptions of the applications of various Wilcox (2003) Welch method, Yuen bootstrap methods to attempt to improve the Welch method, and the median-comparison median-comparison method for constructing such confidence intervals. Researchers' acceptance of such relatively new Researchers' acceptance new bootstrap methods will depend in part part on the methods' demonstrated bootstrap confidence levels. abilities to produce accurate confidence
CONFIDENCE CONFIDENCE INTERVALS INTERVALS
�
43
Note Note that that bootstrapping bootstrapping is is intended intended to to deal deal with with violations violations of of statisti statistical assumptions. assumptions. Bootstrapping Bootstrapping cannot cannot rectify rectify flaws in the the design of re recal flaws in design of search, of search, such such as as the the use use of of original original samples samples that that are are not not representative representative of the interest. For the intended intended populations populations of of interest. For criticisms criticisms of of bootstrap bootstrap methods methods for Gieser ((1996). 1 996). Consult for constructing constructing confidence confidence intervals, intervals, refer refer to to Gleser Consult Shaffer Shaffer (2002) (2002) for for aa strategy strategy for for constructing constructing confidence confidence intervals intervals that that is based on is based on aa reformulation reformulation of of the the null null hypothesis. hypothesis. The The noncentrality noncentrality approach approach to to constructing constructing confidence confidence intervals intervals is is discussed discussed in in the the next next chapter becomes appropriate. chapter where where it it becomes appropriate. More bootstrap methods be be More than than aa cursory cursory discussion discussion of of bootstrap methods would would be beyond book. For of yond the the scope scope of of this this book. For nontechnical nontechnical general general discussions discussions of bootstrap 1 983), Thompson bootstrap methods, methods, consult consult Diaconis Diaconis and and Efron Efron ((1983), Thompson ((1993, 1 993, 1999), 1 999), and and Keselman (2003a). For book-length intro and Wilcox WilcoxandKeselman(2003a). For book-length introductory treatments treatments refer 1 999) and 1 999). For For ductory refer to to Chernick Cher nick ((1999) and Lunneborg Lunneborg ((1999). more book-length treatments consult Davison Hinkley more advanced advanced book-length treatments consult Davison and and Hinkley Efron and and Tibshirani Tibshirani ((1993), and Sprent Sprent (1998). ((1997), 1 997), Efron 1 993), and ( 1 998 ) . This This book book only only discusses discusses confidence confidence intervals intervals that that have have aa lower lower and and an upper upper limit limit (two-sided (two-sided confidence confidence intervals). intervals). However, However, there there are are an one-sided confidence one-sided confidence intervals intervals that that involve involve only only aa lower lower or or only only an an up upper per limit. limit. For For example, example, aa researcher researcher may may be be interested interested in in acquiring acquiring evi evidence that such as dence that aa parameter, parameter, such as the the difference difference between between two two populations' populations' means, means, exceeds exceeds aa certain certain minimum minimum value. value. In In such such aa case case the the lower lower limit limit for, one-sided .95 for, say, say, aa one-sided .95 CI CI is is found found by by calculating calculating the the lower lower limit limit of of aa two-sided ..90 90 Cl. two-sided CI. Consult Consult Smithson Smithson (2003 (2003)) for for further further discussion. discussion. DEPENDENT GROUPS DEPENDENT GROUPS
Construction of of confidence confidence intervals intervals when when using using dependent dependent groups groups re reConstruction quires independent quires modification modification of of methods methods that that are are applicable applicable to to independent groups. groups. Dependent-groups Dependent-groups designs designs include include repeated-measures repeated-measures (within (withingroups and and pretest-posttest) pretest-posttest) and and matched-groups matched-groups designs. designs. It It is is well well groups known design can be known that that interpreting interpreting results results from from aa pretest-posttest pretest-posttest design can be problematic, problematic, especially especially if if the the design design does does not not involve involve aa control control or or other other comparison group group and and random random assignment assignment to to each each group. group. (Consult comparison (Consult Hunter & & Schmidt, Schmidt, 2004, for aa favorable favorable view view of of the the pretest-posttest pretest-posttest de de2004, for Hunter sign.) de sign.) Also, Also, the the customary customary counterbalancing counterbalancing in in repeated-measures repeated-measures designs of signs does does not not protect protect against against the the possibility possibility that that aa lingering lingering effect effect of Treatment a when when Treatment Treatment b is is next next applied applied may may not not be be the the same same as as the the Treatment lingering lingering effect effect of of 1reatment Treatment b when when Treatment Treatment a is is next next applied applied (asym (asymmetrical metrical transfer transfer of of effect). effect). We We now now use use real real data data from from aa pretest-posttest pretest-posttest design to to illustrate illustrate construction construction of of aa confidence confidence interval interval for for dependent dependent design groups. . 1 depicts (lb) of 7 anorectic groups. Table Table 22.1 depicts the the weights weights (Ib) of117 anorecticgirls girlsbefore beforeand and after treatment treatment (Everitt, (Everitt, cited cited in in raw raw data data presented Hand et et aI., al., 1994). after presented in in Hand 1 994) . Assuming normality, normality, we we construct construct aa ..95 CI for for the the mean difference Assuming 95 CI mean difference between posttreatment pretreatment scores scores in between posttreatment and and pretreatment in the the popUlation, population,
44
�
CHAPTER 2 CHAPTER
TABLE 2. 2.11 TABLE Differences Between Anorectics' Weights (in (in 1bs) Differences Jbs) Posttreatment and Pretreatment Participant Participant 11 2 3 3 4 5 5 6 7 8 8 9 9 110 0 1111 112 2 113 3 114 4 115 5 116 6 17
Posttreatment Posttreatment 95.2 94.3 91.5 9 1 .5 91.9 9 1.9 1100.3 00.3 76.7 76.8 1101.6 01 .6 94.9 75.2 77.8 95.5 90.7 92.5 93.8 93.8 91.7 9 1.7 98.0 98.0
Pretreatment Pretreatment 83.8 83.8 83.3 83.3 86.0 86.0 82.5 86.7 79.6 76.9 94.2 73.4 80.5 80.5 81.6 8 1 .6 82. 82.11 77.6 83.5 83.5 89.9 89.9 86.0 87.3
Difference (D) (D) Difference 111.4 1 .4 111.0 1 .0 5.5 9.4 113.6 3.6 -2.9 -2.9 -0.11 -0. 7.4 21.5 2 1 .5 -5.3 -5.3 -3.8 -3.8 113.4 3.4 113.1 3.1 9.0 3.9 5.7 110.7 0. 7
Note. Adapted Adapted from from data data ooff Brian Brian S S.. Everitt, Everitt, from handbook of of small small data sets, by by D D.. JJ.. from A A handbook Hand, F. F. Daly, Daly, A A.. D D.. Lunn, Lunn, K K.. J. J. McConway, McConway, and and E E.. Ostrowski, Ostrowski, 11994, London: Chapman Chapman and and Hand, 994, London: Hall. Adapted Adapted with with permission permission of of Brian Brian S. S. Everitt. Everitt. Hall.
defining a difference difference score, score, Vi D. = = Yaa -- Yb Yb,, where YYaa Ila - Ilb' We begin by defining and Y Ybb are the scores (weights in this example) of the same partici partici-
a (posttreatment (posttreatment weight) and Condition b pants under Condition a (pretreatment weight), respectively. respectively. Thus, for Participant 1 in Table (pretreatment 2.1 D1 = = 995.2-83.8 = 111.4. Because Ya Ya-Yb -Yb estimates estimates 2 . 1 we find that D] 5 .2 - 83.8 = 1 . 4 . Because shown that the mean of such a set Ila - Ilb' and because it can easily be shown of V D values, 15 D = (LV)/n, (LD)/n, is also equal to Y Yaa -- V Yb'b, 15 D too estimates of c onfidence 15 is a point estimate estimate of Ila - Ilb ,, a confidence Ila - Ilb · Therefore, because D around the value of D. D. interval for Ila Ilb can be constructed around from expression 22.3 confidence intervals Recall from . 3 that the limits of the confidence discussed in this chapter chapter are given by the point estimate that are discussed estimate plus or margin of error (ME) (ME). . In the case two dependent groups the minus the margin case of two thus: limits are thus: ==
-
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
�
45
(2. 1 0) (2.10)
D ± ME ,icr , and where dep stands for dependent, and ME der
=
. 5
1) t . n"
(2.11) (2. 1 1)
2.11 5D in Equation 2 . 1 1 represents the standard deviation of of The symbol S D 1/2 D,, and 5SDD// nn1/2 thestandard standard error errorof ofthe themean meanof ofthe theval valisisthe the values of D D. The S 5 D is calculated as an unbiased estimate of aD in ues of D. populain the the popula D D first nn - 11 is used in the denominator denominator of 5SD, , n used tion, so first whereas n is then D in the denominator of the standard error of the mean, as shown in 2.11, correct twice for the bias. The n that is is used Equation 2 . 1 1 , so as not to correct 2.11 in Equation 2 . 1 1 is the number of paired observations (i.e., the number of D values). of CI we need to to find the value of t' t*required for statistical sig sigFor a .95 CI at the the two-tailed two-tailed .05 level. The degrees of freedom freedom for ( t*for forthe the nificance at dependent groups is is given by nn -- 1. case of dependent 1 . In the current example = nn-l- 1 = = 117-l = 116. df = = 116 df = 7-1 = 6 . A row for elf 6 can be found in the t table in t* will be 22.120. most statistics textbooks. The critical value of ( . 1 20. statisUsing any statistical software to calculate the needed sample statis D= = 7.265 7.265 and SD SD = 77.157. Applying the required requiredval valtics one finds that 15 . 1 5 7. Applying 2.11 find that: ues to Equation 2 . 1 1 we find ME der
=
2 .1 20
( 71.175 7 ) h
=
3.680.
required values expression 22.10 find Finally, applying the required values to expression . 1 0 we find
that CI for Ila - Ilb are !!:t at the lower and upper limits of the ..95 9 5 CI
ME = 7.265 7.265 -- 33.680 = 33.6 reD ±M . 680 = . 6 1b 1b and 77.265 . 2 6 5 + 33.680 . 680 = 110.9 0 . 9 1b, Ib, re E de dep = spectively. We are approximately 995% assuming spectivel y. We 5 % confident (again, assuming normality) that the interval between 33.6 .6 and 111.0 1 .0 1b lb would contain the mean weight gain in the population. Because this interval does not contain the value 0, the gain in weight can be considered consideredto be sta stanot .05 level, two tailed. Note again that in tistically significant at the .05 tailed. Note attribute statistically signifigeneral we cannot nearly definitively definitively attribute signifi cant gains or losses losses from from pretreatment to posttreatment posttreatment to the effect effect of a treatment treatment unless unless there is is random assignment to the treatment treatment of and to a control or comparison group. In our example of treatments for anorexia, a control group was included by the researcher, researcher, but but it for scope of this chapter to discuss discuss further analysis would be beyond the scope of these data (e.g., analysis of covariance). of covariance) . the D D values is skewed For dependent data in which the distribution of the software functions there is a Minitab macro (Wilcox, 11996) 996) and S-PLUS software (Wilcox, 11997) 997) for constructing an approximate confidence interval for difference between two quantiles. (Quantiles (Quantiles were defined in chap. chap. 11 the difference
46
�
CHAPTER 2 2 CHAPTER
of chap. 5 . ) In the case of two de of this book and and are discussed discussed further in chap. 5.) two dependent groups Wilcox Wilcox ((1997) discussed and and provided S-PLUS 1 9 9 7 ) also discussed functions for constructing a confidence confidence interval for the the difference functions difference bebe tween two two trimmed means, two two medians, medians, and other measures of two two distribution's locations. locations. However, in the the case of a confidence interval interval for difference between means of dependent dependent groups, for much real data the difference skew might not greatly distort confidence might not confidence levels.
QUESTIONS
useful? 11.. In what circumstance is a confidence interval most useful? 2. List three examples of familiar familiar scales that are not not listed in the text. 3 interval. 3.. Provide Provide a valid definition of of confidence confidence interval. 4. Define Define lower and and upper confidence confidence limits. 5. Define Define confidence confidence level. 6. What is a common misinterpretation of, say, a 95% 95% confidence confidence in interval? 7. To To what what does does .95 .95 refer refer in in aa 95% 95% confidence confidence interval? 8. In the concept of confidence confidence intervals what is constant constant and and what what is a a random random variable? variable? 9. Define Define margin of of error in in the the context of of confidence intervals. 110. 0 . List three factors that influence the magnitude of the the magnitude the margin margin of er erand what effect effect does each factor have? ror, and Define probability probability coverage. 111. 1 . Define 112. 2 . What 95% or a 99% confidence What is the trade-off trade-off between using aa95%ora99% confidence in interval? 113. 3 . Define Define independent independent groups groups.. What assumption assumption is being being made when a pooled 114. 4 . What pooled estimate is made of of population variance? popUlation variance? 115. 5 . For tests and and confidence confidence intervals involving the the difference difference between two means, what is the the relationship between the confidence confidence level two and a significance significance level? 116. 6 . For all parameters is there always always a simple relationship between the confidence significance level? confidence level and a significance 117. 7. Which factors influence the the width of a confidence confidence interval, and in what way way does each factor influence influence this width? 1 8 . In what specific 18. specific ways is the game of horseshoe tossing analogous to the the construction construction of confidence confidence intervals? the relationship the practical practical significance of re re119. 9 . Discuss the relationship between the upsults in applied research and the magnitudes of the lower and up per limits of a confidence confidence interval. 20. Define and and briefly briefly discuss confidence discuss the purpose of asymmetric confidence intervals . intervals. 2 1 . Contrast Contrast random 21. random sampling sampling and convenience sampling. sampling. 22. What is the purpose of Welch's con22 . What Welch's approximate method for con structing confidence confidence intervals, and when might a researcher constructing researcher con sider using it?
CONFIDENCE INTERVALS INTERVALS CONFIDENCE
.rlW=
47
2 3 . What What are two differences 23. differences between the Welch method method and the traditional method method for constructing constructing confidence confidence intervals? tional 24. What is the the effect effect of skew on the the Welch method? 25. Define trimming and and discuss its purpose. 2 S . Define 26. What factors might might influence influence the optimal optimal amount amount of trimming? 2 6 . What 27. What is the purpose ofYuen's of Yuen's method, and in what what ways is it a hy hy2 7 . What brid method? 28. What is the irony if a researcher researcher would would never consider using 2 8 . What trimmed means but would consider using medians? 29. bootstrap sample. 2 9 . Define a bootstrap context of this chapter, what is the the general purpose of bootboot30. In the context strapping? strapping? 31. and state the the purpose of of one-sided one-sided confidence confidence intervals. intervals. 3 1 . Define and 32. 3 2 . List versions of dependent-groups designs. 33. Define difference difference scores and and describe the role that they play play in the construction of confidence confidence intervals in the the case of two two dependent construction groups.
Chapter Chapter
3 3
The Standardized Standardized Difference Difference Between Means
UNFAMILIAR AND INCOMPARABLE SCALES
confidence interval interval for the difference difference between two two populations' populations' cen cenA confidence
ters can be especially informative when when the dependent variable is mea measured studies in an area. However, sured on the same same familiar familiar scale across the studies often dependent variables are abstract abstract and are measured indirectly us usoften ing relatively unfamiliar unfamiliar measures. measures. For For example, example, consider research that compares Treatments Treatments a and and b for depression, depression, a variable that is more ab abstract stract and more problematic problematic to measure directly than would would be the case familiar dependent variable variable measures that were listed in chap chapwith the familiar ter 22.. Although Although depression is very very real to the person who who suffers suffers from it, ter way for a researcher to define and measure it as there is no single, direct way one could do for the the familiar scales. scales. There are many many tests of depression available to and used used by researchers, researchers, as is true for many other other variables variables outside of the the physical physical and and biomedical sciences. sciences. (In such cases cases the the pre presumed underlying variable variable is called the latent variable, test that sumed variable, and the test measuring this dependent variable validly is called the is believed to be measuring measure of of the dependent dependent variable.) variable.) For example, suppose that confidence confidence 0 points mean difference limits of 5 and and 110 difference in Beck Depression Depression Inventory (BDI) scores between depressed groups that were given Treatment a or b would be less familiar and less informative are reported. Such a finding would (except perhaps to specialists) specialists) than would would be a report of confidence confidence lim limits of 5 0 Ib lb difference 5 and and 110 difference in mean mean weights in our our earlier earlier example of re reweight gain from treatments treatments for anorexia. anorexia. search on weight Furthermore, suppose that a researcher conducted a study study that com compared the efficacy efficacy of Treatments a and b for depression and that another another researcher conducted another another study study that also compared these two treat treatments. Suppose also that the BDI as the first researcher used the BDI asthe the depend dependent variable measure, whereas the second researcher used, for a conceptual replication of the first study, study, a different measure, say, the conceptual MMPI Depression scale (MMPI-D) (MMPI-D).. It would would seem to be difficult difficult to com-
48
STANDARDIZED DIFFERENCE DIFFEREMCE BETWEEN BETWEEN MEANS MEANS STANDARDIZED
�
49
pare precisely or combine the results of these two studies because the two scales of depression One would not know the rela reladepression are not not the same. One would not tionship between the numerical numerical scores on the two two measures. An interval tionship 0 points with regard to the difference of, say, 5 5 to 110 difference between means on one measure of depression would not not necessarily represent the same dede gree of difference difference in underlying underlying depression as would would an an interval interval of 55 to respect to another measure measure of depression. depression. 110 0 points points with respect effect size that places different different dependent vari variWe need a measure of effect from studies that use difable measures on the same scale so that results from dif ferent measures can be compared or combined. One such measure of ferent of effect effect size is the the standardized difference difference between means, a frequently used measure measure to which which we now turn our attention. STANDARDIZED DIFFERENCE DIFFERENCE BElWEEN BETWEEN MEANS: MEANS: STANDARDIZED ASSUMING NORMALITY NORMALITY AND A CONTROL CONTROL GROUP GROUP difference between means is like a z score, score, z = = (Y -- Y) // s, s, A standardized difference standardizes a difference difference in the sense that it divides divides it by a standard that standardizes many standard standard deviations above or bedeviation. A z score indicates how many be Y aa YFraw scoreis, is,and and itit can can indicate indicatemore. more. For Forexample, example,assuming assuming aa low V raw score normal distribution of raw raw scores so that z too will be normally distribdistrib uted, inspecting a table of the standardized normal any intro intronormal curve in any ductory statistics textbook one finds finds that approximately approximately 84% ductory 84% of z scores fall below z = =+ +1.00 Fig. 33.1 fall 1 .00 (also inspect Fig. . 1 that is displayed and discussed discussed later). Therefore, a z score can provide a very informative result, such as indicating that a score at at z = =+ +1.00 outscoring approximately approximately 84% 84% of of indicating 1 .00 is outscoring other scores. scores. Recall that in a normal curve approximately approximately 34% 34% of the the the other and z = = +1.00, ofthe the scores scores scores lie between between z = 0 and + 1 .00, approximately approximately 14% 1 4% of =+ +1.00 and z = =+ +2.00, and approximately approximately 2 2% ofthe the scores scores lie between z = 1 .00 and 2.00, and % of +2.00. course, because because of symmetry symmetry these same percent percentexceed z = + 2.00. Of course, if one substitutes substitutes minus minus signs for the plus signs in the previous ages apply if normality, a score at z = = + +1.00 exceeding ap ap1 .00 is exceeding sentence. Thus, under normality, proximately 2% + 14% 34% + 34% = 84% the scores. 84% of the proximately 2% 1 4% + 34% 34% = scores, or z-like measures, one can also compare results ob obUsing z scores, from different different scales, results that would not not be comparable if tained from if For example, one cannot cannot directly compare a stu stuone used raw raw scores. scores. For dent's grade point point average (GPA) with with that student's student's Scholastic Apti Aptitude Test (SAT) scores; they are on very different scales. range of scales. The range of GPA scores scores is usually from 0 to 4.00, whereas it is safe to assume that usually from who has taken the SATs scored well anyone reading this book and who would be meaningless to say that above 4 on them. In this example it would SATwas washigher higherthan thanhis hisor orher herGPA. GPA.Simi Simisuch a person's score on the SAT meaningless to conclude that most most people are heavier heavier larly, it would would be meaningless they are tall when when one finds that most most people than they people have more pounds of weight than they they have inches of height. However, However, one can meaning meaningof fully compare the otherwise incomparable by using z scores instead of fully of
50
�
CHAPTER 33
raw scores in such examples examples.. If one's z score GPA was higher higherthan than raw score on GP A was then in one's z score on the SAT, relative to the same comparison group, then fact that person did perform better on GPA GPA than on the the SAT SAT (an fact effect size researchers researchers can overachiever). By using z-like measures of effect meta-analyze results from studies that use different different decompare or meta-analyze de pendent variable variable measures measures of the same underlying underlying variable. variable. normality, one can obtain obtain the the same kind of information Assuming normality, from a z-like measure measure of effect effect size as one can obtain obtain from a zz score. from difference between a treated group's mean, Suppose that one divides the difference Y Ye (e (estands stands for for aa treated treated or or experimental experimental group) group),, and and aa control control group' group'ss mean, Yc,bbyy the standard standard deviation ooff the control group's scores, scores, se. sc. One ean, Ye' effect size a standardized then has for one possible estimator of an effect standardized difdif ference between means,
�
((3.1) 3.1)
Equation Equation
3.1 estimates the parameter: parameter: 3 . 1 estimates (3.2)
A is the the uppercase D, and and it stands for differ differThe symbol .1 uppercase Greek letter D, effect-size estimator estimator in Equation 3.1 attribence. The d version of the effect-size 3 . 1 is attrib V Glass (e.g., Glass et aI., al., 11981). the 9 8 1 ) . In Equation 33.1 . 1 the utable to Gene V. standard deviation deviation has has nn - 11 in its denominator. denominator. standard Aestimates estimateshow how many many(je acunits units above aboveor orbelow belowfJ.e Similar to a z score, score, .1
=+ +1.00, estimating that the av av= 1 .00, one is estimating erage (mean) (mean) scoring members of a treated population population score one (je unit unit members of the the control popula populaabove the scores scores of the average scoring members Also, if if normality normality is is assumed in this example, example, the average-scoring tion. Also, treated population are estimated estimated to be outscoring apmembers of the treated outscoring ap proximately proximately 84% 84% of the the members members of the the control population. If, If, say, = -1 -1.00, one would would estimate estimate that the average-scoring average-scoring members of the the d= .00, one treated population population are outs outscoring approximately 116% ofthe themem mem6 % of coring only approximately control population. bers of the control Of course, course, numerical results other than than d = =+ +1.00 -1.00 Of 1 .00 or -1 .00 are likely occur, including results with with decimal decimal values, and they are similarly similarly to occur, interpretable from from a table of the the normal curve if if one assumes normality. normality. interpretable 3.1 the example of .1 A= =+ +1.00. To use Fig. Fig. 33.1 reflect . 1 illustrates the 1 .00. To . 1 to reflect Figure 3 other than on the implication of values of d that have lead to an estimate other =+ +1.00 distribution of the treated 1 .00 the reader can imagine shifting the distribution .1 = population's scores scores to the right right or to the left left so that fJ.e falls elsewhere elsewhere on population's the control group's distribution. distribution. the value of fJ.e is is.. Again if, the if, say, d
-4
-3
-2
-1
o
1
2
511 5
�
STANDARDIZED DIFFERENCE DIFFERENCEBETWEEN BETWEEN MEANS MEANS STANDARDIZED
3
4
3.1. Assuming normality, when Ll A= = + +1.00 mean score in the the FIG. 3 . 1 . Assuming 1 .00 the mean treated population will will exceed exceed approximately approximately 84% 84%of ofthe the scores scoresin inthe the con contreated trol population.
mean of z scores is always always 0, so the mean mean of the z Recall that the mean population is equal to 0. the zz scores scores of of scores of the control population O. The mean of the (treated) population population is also equal to 0 with regard to the experimental (treated) the distribution scores of its own distribution of zz scores own population. However, However, in the exam examFig. 33.1 raw scores of the experimental ple depicted in Fig. . 1 the mean of the raw population corresponds to z = + +1.00 with regard to the distribution distribution of 1 .00 with of population scores of the control population. population. z scores The d estimator of Ll A has has been beenwidely widely used used since sinceitit was was popularized popularized in in & Glass, 11977). Grissom ((1996) many the 11970s 9 70s (e.g., Smith & 9 77). Grissom 1 996) provided many references for examples in research on psychotherapy psychotherapy outcome, and he many meta-analy meta-analyalso provided examples of averaged values of d from many efficacy of psychotherapy. These These examples illustrate the use of ses on the efficacy of d and Ll, A, so we consider them briefly here. here. Note Note first that when compar compartherapy groups (with the same disorder) control ing various therapy disorder) with control groups, one should should not not expect to obtain the same values of d from study study because (in addition addition to sampling variability) therapies of vary varyto study efficacy should produce different different values of d. Note also that some someing efficacy which a higher times the measure of the dependent variable is one in which score is a "healthier "healthier"" score (e.g., the measure of positive parent-child re relationships in the example that will soon be discussed), discussed), but but more often often lationships the clinical measure is one in which which a lower score is a healthier score (e.g., a measure of depression) . 1 and 3 .2 can be depression).. However, However, Equations 33.1 3.2 rewritten with with the the mean mean of Group c preceding preceding the the mean mean of Group e, so rewritten A,but but not nottheir theirmagnitudes, magnitudes,would would change. change.Altering Altering that the sign of d or Ll, way when needed Equation 33.1 . 1 in this way needed assures that when the treated group (Group e) e)has a better better (healthier) outcome outcome than the control group group group the value of d will be positive, and when the control group has a better way Grissom . 1 in this way outcome d will be negative. Altering Equation 33.1 0.75, with values of . 75, with of ((1996) 1 996) estimated that the median value of d was + 0 ranging from from -0.35 -0.35 to + +2.47. apd ranging 2 .47. Therefore, on the whole, therapies ap efficacious (median d = = +0.75), +0.75), with with some therapies in some pear to be efficacious
52
�
CHAPTER 3 3 CHAPTER
= +2 +2.47), and very few seeming to be circumstances extremely so (d = .47), and harmful 35). harmful (the (the rare negative values of d, e.g., e.g., -0. -0.35). There iiss a minority minority opinion opinion that psychotherapies have no specific benefits, only a placebo effect effect wherein any any improvement in peoples' peoples' mental health health is merely merely attributable to their their expectation expectation of therapeutic therapeutic mental success, a kind of self-healing self-healing by a self-fulfilling self-fulfilling prophecy. To To explore this point of view Grissom ((1996) 1 996) averaged d values that compared participants with with placebo placebo (phony or minimum minimum treatment) par partreated participants ticipants ticipants and then averaged d values that compared placebo placebo and con conadaptability of Equations trol (no treatment) groups. (Note here the adaptability 3 . 1 and 3 . 2 . One 3.1 3.2. One can use these equations to compare two groups that undergo any conditions do not have to be any two different different conditions. conditions. The conditions control.)) When comparing the treatment strictly treatment treatment vs. control. treatment group with placebo group the median value of d was +0.58, + 0 . 5 8 , suggesting with the placebo that therapy provides more than a mere expectation of improvement (pla (placebo effect) effect).. However, However, when comparing placebo placebo with with control (placebo group group replaces treatment group 3 . 1 ) the median group in Equation Equation 3.1) +0.44. plavalue of d was + 0.44. Together these results suggest that there are pla cebo effects effects but but that there is more to the efficacy efficacy of therapy therapy than than just such placebo effects effects.. This conclusion is not not necessarily necessarily definitive definitive but but these applications of of standardized-difference standardized-difference estimators estimators of effect effect size informative than would would have been the case if only have been more informative null-hypothesis significance significance testing had been undertaken to compare null-hypothesis therapy, control, and placebo. signif placebo. However, However, we are not not disparaging significance testing. Often complemen Often in this book there are examples of the complemensignificance testing and effect effect sizes, and there are tary use of significance discussions of situations in which which a researcher researcher's's focus might might be either on significance significance testing or on effect effect sizes. With regard to the . 1 and 3 .2, in research research the adaptability adaptability of Equations 33.1 3.2, without scores compared to men's scores) without a treated group (e.g., women's scores the experimental group's representation in the equations is replaced replaced by any any kind of group whose performance one wants wants to evaluate with regard to the distribution baseline comparison group. group. Therefore, more distribution of some baseline general forms of Equations 33.1 . 1 and 3 .2 are, respectively, 3.2
d = Ya - Yb 5b
'
((3.3) 3.3)
and Ll = l1 d - l1 b • <J b
(3.4) (3.4)
For a real example of such an an application of of d, we use a study in which parent-child relationship scores the healthy parent-child scores of mothers of disturbed
STANDARDIZED STANDARDIZED DIFFERENCE DIFFERENCE BETWEEN BETWEEN MEANS MEANS
�
53
(schizophrenic) children (Mother Group a) were compared to those of of mothers of normal children (Mother Group b), who served as the con concpmparisonjrroup & Pollin, Pollin, 11970). trol or comparison group (Werner, Stabenau, & 9 70). In this exex . 1 0, Y ample Y Yaa = = 22.10, Ybb = = 3.55, sb = = 11.88. 3 . 5 5 , and Sb .88. Therefore, = (2 (2.10 - 33.55) = -0. -0.77. d= .10 . 5 5 ) // 11.88 . 88 = 7 7. Thus, the mothers of the disturbed children scored scored on average about about three quarters of a standard standard deviation unit below the mean of the comparison mothers' mothers' scores. scores.Assuming nor norunit mality mality for now, inspecting a table of the normal curve one finds that at -0.77 disz= = -0. 77 we can estimate that the average-scoring mothers of the dis would be outs outscored approximately 78% comturbed children would cored by approximately 78% of the com parison mothers. Also, a two-tailed t test yielded a statistically significant difference difference between the two two means at the p < < .05 level. significant The results are consistent with three possible interpretations: (a) (a) dis disturbance in parents genetically and/or experientially causes distur disturbance in their children, children, (b) (b) disturbance in children causes causes disturbance in their parents, or (c) (c) some combination of the first first two two interpretations interpretations We have assumed for simplicity in this example can explain the results. We that the measure of the underlying dependent variable was valid and assumptions of normality normality and homoscedasticity were satisfied. that the assumptions Reliability of the measure of the dependent variable present exam variable in the present example is not not likely to be among the highest. Hunter and Schmidt (2004) (2004) dis discussed unreliability and provided software for the correction of a difference between means for unreliability unreliability in the dependstandardized difference depend other artifacts). artifacts). Unreliability is ent variable (as well as correcting for other 4.. discussed in chapter 4 EQUAL OR UNEQUAL VARIANCES EQUAL OR UNEQUAL VARIANCES
If the two populations populations that are being compared are assumed to have equal variances, then it is also assumed that cra = crb = cr, the co common mmon population standard deviation. In this case population case a better estimate of the dede nominator of a standardized difference difference between population population means can from both both samples to estimate the com combe made if one pools the data from instead of using using Sb sb that is based on the data of only one sample. mon cr instead = nnaa + + nnbb)) The pooled estimator, ssP' p , is based on a larger total sample (N = variable estimator estimator of cr a than Sb sb would be. To and is a less biased and less variable be. To calculate calculatesS take the square root of the value of Ss2�p that is obtained obtained from a printout o or� from from Equation 2.2 in chapter 22 that uses nn --11 in the denomi denominator of each of the variances that is being pooled. pooled. The estimator estimator of ef efnator fect size in this case is: g
= Ya - Yb '
((3.5) 3.5)
5p
which 985), and which is known known aass Hedges' Hedges' g g (Hedges & & Olkin, 11985), and estimates
54
�
CHAPTER 3 3 CHAPTER
(3.6) (3.6)
when using the standard standard deviation deviation that We always use the g notation when uses n -- 1I in the denominator denominator of each of the variances that is is being g notation notation helps to avoid confusing confusing Glass' d (no pool poolpooled. Use of the g estimators Cohen's Cohen's ds ds (pooling using n -- 1I;; this is is the same ing) with the estimators g,, so soddss is is not not used again in this book) book) or Cohen'sd n in inas g Cohen's d (pooling using n of nn-1). Throughout this book we distinguish distinguish between situations situations stead of - 1 ). Throughout Hedges' g or Glass' d might be preferred. in which Hedges' When cra = crb = cr, g 0 = Ll. However, in this case it will still be very un unsb = sSp ati duee to sampling variability variability of sample standard standard delikely that ssaa = Sb de so the smaller smaller the sample sizes, so it will be very very viations, the more ;0 =g g.. Similarly, Similarly, differing differing estimates of Ll A will likely result unlikely that d = from using ssaa instead instead of Sb sb in the denominator denominator of the estimator estimator even when when from cause s differ from s cr = crb because sampling variability can cause s to differ from Sb. Re b. Reaa a search reports reports should should clearly state which which effect effect-size parameter is being search -size parameter denominator of the estimator. estimator. estimated and which Ss has been used in the denominator g and Glass' d have some positive bias (i.e., tending tending to to Both Hedges' g parameters), the more so the smaller the overestimate their respective parameters), effect size in the population. population. Although Although g sample sizes and the larger the effect reduced by using Hedges' Hedges' approxi approxiis less less biased than d, its bias can be reduced mately unbiased unbiased adjusted g, g, gad gadjj; mately g ddj
=
[
g 1-
3 4df - 1 '
]
(3.7) (3 . 7)
df = nnaa + nb nb -- 2 (Hedges, (Hedges, 11981, Hedges & & Olkin, 11985). where df 9 8 1 , 11982; 982; Hedges 98 5 ) . byy substituting substituting d for g g iinn EquaGlass' d can also have its bias reduced b Equa using df df = = nc nc-- 1, ncc is the nn for the sample whose whose Ss is tion 33.7 . 7 and using I , where n two adjusted estimators estimators are seldom used used in the denominator. The two reduction have traditionally been believed to be because bias and bias reduction sizes are very small (Kraemer, 11983). slight unless sample sizes 983). Hunter and Schmidt (2004) demonstrated demonstrated why they consider the bias to be negligi negligiSchmidt sizes are greater than 20. [These authors authors also provided ble when sample sizes formulas for adjusting slight bias. bias.]] formulas adjusting the point-biserial correlation correlation for its slight discussed in the section entitled Controversy About However, aass was discussed Null-Hypothesis Significance Testing in chapter regarding the debate Null-Hypothesis chapter 11,, regarding about whether whether effect effect sizes sizes should be reported when results are statisti statistiabout concally insignificant, insignificant, some believe that the bias is sufficient to cause con cern. Consult the references that we provided in that section of chapter 1 for discussion discussion of this issue, and also refer to Barnette and and McLean for McLean ((1999) 1 999) for their results on the relationship between sample size and effect effect size. for if population population means differ differ it is also likely that popUlation population Recall that if standard deviations differ. This heteroscedasticity can cause problems. standard problems .
STANDARDIZED DIFFERENCE DIFFERENCE BElWEEN BETWEEN MEANS MEANS STANDARDIZED
55
the .1. Aparam paramFirst, because O"a :t. O"b' (Ila - �) / O"b :t. (Ila - �) / O"a. In this case the eter that is being estimated using one of the samples samples as the control or baseline group that provides provides the estimate of the standardizer will not be Athat thatwould wouldbe bethe the one onethat thatisisbeing beingestimated estimatedififwe we use use the same as the .1. the other sample as the baseline group that provides the estimate estimate of the Also, the formulas provided by Hedges and Olkin (1985) standardizer. Also, ( 1 985) for constructing confidence intervals for gpo gpopP assume homoscedasticity. for Hogarty and and Kromrey Kromrey ((2001) influence of 2 00 1 ) demonstrated the influence of Cohen's d and Hedges' g g.. heteroscedasticity and nonnormality on Cohen's Cohen ((1988) the To counter heteroscedasticity Cohen 1 988) suggested using for 0" the square root of the mean of a2a and 0"2b', estimated by (3.8)
who use ss'' (our notation, notation, not not Cohen's) Cohen's)as asthe theestimator estimator of of Researchers who a 0" a instead of the previously discussed discussed estimators of O"a or O"b should rec recognize that they are estimating the a of a hypothetical hypothetical population population a is between aa and ab.. In this case, therefore, therefore, such researchers researchers are whose 0" estimating a .1. Ain a hypothetical hypothetical population, an effect effect size that we label estimating The burden burden would be on the researcher to interpret the results in here .1.'. The terms of the hypothetical hypothetical population to which this effect effect size relates. relates. Re Rerecognize that Cohen ((1988) origisearchers should also recognize 1 988) introduced .1.' origi nally for the purpose of conducting a power analysis for estimating the approximate needed research. This pur needed sample sizes prior to beginning research. purdifferent from from the present purpose of using an an effect effect size to ana anapose is different suggested methods lyze results of completed research. research. Huynh ((1989) 1 989) suggested for for decreasing the bias and instability (variability) of Cohen's estimator of .1.' under heteroscedasticity. of TENTATIVE RECOMMENDATIONS estimator of the common a When homoscedasticity is assumed the best estimator the g g or g gadj estimator of effect effect size. size. If homoscedasticity is s5 , resulting in the adj estimator not is ri ot assumed use the ss of whichever sample is the reasonable baseline baseline comparison group. For example, use the ss of the control or placebo comparison or, if a new treatment treatment is being compared with a standard treat treatgroup, or, ment, use the ss of the sample that is receiving the standard treatment. It may sometimes be informative to calculate and report report two two estimates of may of .1., A, one based on ssaa and one based on Sb. sb. For example, in studies that com compare genders genders one can estimate a .1.FF to to estimate where where the the mean mean female female distribution of males' scores, score stands in relation to the population distribution to estimate estimate where where the the mean mean male score stands and one can estimate a .1.M M to population distribution distribution of females' scores. scores. A modest in relation to the population ,
56
�
CHAPTER 33
additional suggestion is to use ns > > 110 additional 0 and ones that are as close close to equal Kraemer, 11983). as possible (Huynh, 11989; 989; Kraemer, 983). about generalizing our suggestion to esti estiOne should be cautious about two types of effect effect sizes sizes on the same data because of a valid con conmate two significance testing. In significance significance testing one cern that stems from significance not conduct conduct more than one statistical test on the same set of data should not unless one compensates for the capitalizing on chance that results from from such multiple testing. Capitalizing on chance cumulation of T Type chance is a cumulation ype I error that results from inappropriately providing more than one oppor opporobtain statistical statistical significance within within the same data set. Thus, a tunity to obtain researcher has a greater chance of at at least once once attaining, attaining, say, the p < two tests of .05 level of statistical significance in a data set if conducting two of significance on those data than if conducting one test on those data. The chance probability probability of at least one of two two such tests attaining the p < < .05 significance level is greater than .05, just as the probability of a basket basketball player making one basket in either of two two attempts attempts is greater than The well-known well-known and the probability of making a basket in one attempt. The always optimum) solution solution would be to conduct the simplest (but not always separate tests at a more stringent adopted level level of significance significance two tests at the (Bonferroni-Dunn adjustment); say, conducting each of two and p < .025 . 025 level. Effect-size Effect-size methodology methodology is barely out out if if its its infancy, and practices develop develop perhaps one can be flexible until some widely accepted practices applying more than one estimator of effect effect sizes to the same data about applying set (but not not flexible about inflating T Type error). Indeed, Indeed, as discussed ype II error). discussed different kinds of measures of effect effect sizes sizes can provide later in this book, different different informative perspectives perspectives on the same data set, so there will be different not two but but several different different kinds of examples in which we apply not of effect sizes sizes to the same data data set. measures of effect Although there are data sets for which we illustrate application of Although of two or more measures of effect effect size for our our pedagogical purpose, an two author of a research report report might might choose to calculate and report only estimator that the author can justify as being most appropriate. an estimator than one estimator estimator is calcu calcuNonetheless, again we state that if more than should report all such calculated estimators. It lated the researcher should would be unacceptable to report report only only the effect effect size of which the mag magwould nitude is most supportive of the case that researcher is trying to make. Refer to Hogarty and and Kromrey (2001) Refer Kromrey (200 1 ) for further discussion. Note writing editors of journals that recom recomagain that at the time of this writing effect sizes not specify specify which kinds mend or require the reporting of effect sizes do not of effect effect sizes important point is that at least one of sizes are to be reported. The important appropriate estimate of effect effect size should be reported whenever such reporting would be informative. reporting the dependent variable In areas of research in which the measure of the common test that has been normed on a vast sample, such as has is a common been done for many many major clinical and educational tests, there is an another solution to heteroscedasticity. other solution heteroscedasticity. (A (A normed test is one whose dis distribution's shape, mean, and standard deviation have already been
STANDARDIZED DIFFERENCE BElWEEN MEANS MEANS STANDARDIZED DIFFERENCE BETWEEN
�
57
determined determined by applying the test to, e.g., e.g., many thousands of people [the normative of normative group] group].. For example, example, there are norms for the scales scales of inventory and for various IQand IQ and academic ad adthe MMPI personality inventory missions tests, such as the the SAT SATand andGraduate GraduateRecord RecordExamination.) Examination.)InIn estimator of Ll A one can divide divide Ye Ye -- Yn Yn by snn,' where n this case, for an estimator stands for the normative normative group (Kendall & & Grove, Grove, 1988; stands 1 98 8 ; Kendall, & Sheldrick, Sheldrick, 11999). Marss-Garcia, Nath, & 9 9 9 ) . The use of such a constant constant snn by all researchers who who are working working in the same field of research research dede uncertainty about about the value of A. because when not not creases uncertainty .6.. This is so because sn different different researchers find greatly varying using the common sn researchers will find values of d, even if their Ya -- Yb Yb do not not differ differ very much, sim simtheir values of Ya varying values of Ss from from study to study. ply due to the varying normative group group of For an example of the method, suppose that for a normative of Yn = =100 sn = = 115 developmental quotient, a babies Yn 1 00 and sn 5 on a test of their developmental population of scores is normally normally distributed. Suppose fur furtest whose population special diet or treatment that is given to an experimental ther that a special group of babies results in their Ye Ye = = 1110. 1 0. In this case case we estimate that Ll A= = ((110 - 1100) 715 =+ +0.67, treated babies 110 00) / 15 = 0 . 6 7, with the average-scoring average-scoring treated 0.67 average of the normative babies. babies. Inspection Inspection scoring 0 . 6 7 units above the average of a table of the normal azz of + +0.67 of normal curve indicates that a 0 . 6 7 is a result that outscores approximately 75% 75% of the the normative babies. population's distribution distribution is not normal the inter interWhen a comparison population's pretation estimating the percentile pretation of a d or a g g in terms of estimating percentile standing of the with respect to the normal distribu distribuaverage-scoring members of a group with tion of the baseline group's group's scores would not not be valid. Also, Also, because stanstan dard deviations can be very sensitive to a distribution's shape, as was 1 999), nonnormality can compellingly illustrated illustrated by Wilcox and Muska ((1999), A,ggpop , or their estimators. Inchapter chapter 55we we ' or their estimators. In greatly influence the value of a .6., o discuss measures of effect effect size (th (thelp probability and related robability of superiority and normality. measures) that do not assume homoscedasticity or normality. treatment may may have importantly different different effects effects on differ differFinally, a treatment ent dependent variables. For For example, a treatment for an addiction may different effect effect on one addiction compared with with another addi addihave a different ction in multiply addicted persons' persons' addictions. Therefore, we should should not ction generalize about about the magnitude and sign of an an effect effect size from from one de deFor exam exampendent variable to a supposedly related dependent variable. For ple, it would be very important important to know if a treatment treatment that apparently successfully targeted alcoholism resulted in an increase increase in smoking.
ADDITIONAL STANDARDIZED-DIFFERENCE EFFECT SIZES WHEN THERE ARE OUTLIERS Tentative Recommendations Recommendationsbecause because The previous section was entitled Tentative other types of estimators have been proposed for use when when there are other influence the means and standard deviations. One outliers that can influence One simple suggestion suggestion for a somewhat somewhat outlier-resistant estimator estimator is to
�
58
CHAPTER 33 CHAPTER
lowest score from each group, replace Y Yaa-- YYbbwith with trim the highest highest and lowest Mdnaa -- Mdn Mdnbb,, and use as the standardizer, in place place of the standard standard devi deviMdn ation, the range of the trimmed trimmed data data or some other other measure of vari variability that is more outlier resistant than than is is the standard deviation ability (Hedges & 98 5 ) . One alternative to the standard & Olkin, 11985). One possible such alternative standard deviation deviation is the median absolute deviation from from the the median (MAD). (MAD). Another alternative standardizer standardizer iiss .7 .75R ass proposed by Laird and Another alternative 5Riq, iq , a 1 990) to provide some resistance to outliers while using a Mosteller ((1990) denominator approximates the standard denominator that approximates standard deviation. Both the AMD MAD and Ri Riq were introduced in chapter chapter 11,, from which recall recall that that.. 75Ri 75Riqq apapproxi ates s proximates 5 when there is normality. normality. Note that, as Wilcox ((1996) 1 996) pointed out, using one of the relatively outlier-resistant devia outlier-resistant measures of variability instead of the standard standard deviation does does not not assure us that the variabilities tion variabilities of the two populations popUlations will when their means are not not equal. Also, although although at at the current be equal when equal. Also, stage of development development of methodology methodology for for effect effect sizes it is appropriate appropriate in variety of measures, this book to present a great variety measures, eventually the field should settle on the use of a reduced number of appropriate appropriate measures. A should more consistent use of measures of effect size by primary researchers researchers would facilitate facilitate the comparison comparison of results from from study to study. study. Nonewould None theless, we briefly briefly turn now to some additional alternatives.
riI
TECHNICAL 3.1: A NONPARAMETRIC ESTIMATOR TECHNICAL NOTE 3 .1: A OF STANDARDIZED-DIFFERENCE STANDARDIZED-DIFFERENCE EFFECT EFFECT SIZES SIZES nonparametric estimation estimation of standardized-difference standardized-difference effect sizes for For nonparametric pretest-posttest 1 982) and pretest-posttest designs designs consult Kraemer and Andrews Andrews ((1982) 1 98 5 ) . Hedges and Olkin ((1984, 1 9 84, 11985) 985) also provided a Hedges and Olkin ((1985). and Olkin nonparametric estimator estimator of a standardized-difference standardized-difference effect effect size that that does nonparametric not not require pretest data data or assume homoscedasticity. This method esti estidc* (our notation, notation, not not Hedges' and and Olkins'), Olkins'), defined defined as mates a � �* using d: d'c
= pc ' -1
(3.9) (3 . 9)
-1
where <1>-1 is the the standard standard normal normal cumulative cumulative distribution distribution function and and control group scores scores that the subscript pc represents the proportion of control are below belowMdn dc:* estimates estimates � :. = = (j..Le - j..Lc) / crc'. We We do are Mdnaa.. Under normality d
not of not demonstrate this method here because because the sampling distribution of d* not known, so methods for significance testing testing and for construct constructd: c is not
ing a confidence confidence interval interval for �
known. : are not known.
Recall that ideally a statistic should be resistant to outliers, as is the MAD, and have relatively low sampling variability to increase increase power and MAD from the previous previous section that to narrow narrow confidence intervals. Recall also from alternatives to the standard deviation, deviation, such as the MAD MAD, may may provide alternatives standardized-difference estimators of effect effect size better denominators for standardized-difference ,
,
STANDARDIZED DIFFERENCE DIFFERENCEBElWEEN BETWEENMEANS MEANS STANDARDIZED
�
59
than s does when there are outliers. However, However,the thebiweight biweightstandard standarddevia deviasbw (Goldberg (Goldberg& & Iglewicz, Iglewicz, 1992; Lax, 11985), be superior superior to 1 992; Lax, 985), appears to be tion, Sbw AMD as a measure of variability. Therefore, a more outlier-resistant outlier-resistant al althe MAD ternative estimator estimator of a standardized-diff standardized-difference effect size might might be ternative erence effect (3.10) (3 . 1 0) 2
sbwcc is the square root root of the biweight midvariance, Ss �w the where Sbw bw,' of the control group or other other baseline comparison group. Lax Lax ((1985) control 1 9 8 5 ) found resistant and most sta stathe biweight midvariance to be the most outlier resistant ble (least sampling variability) of any any of the very many many measures of of variability that were studied. studied. Manual calculation of s2!w is laborious variability bw 2003).. First, 996, 11997, 9 9 7, 2003) First, calculate for each score in the control (Wilcox, 11996, group Z Mdnc) // 9MAD. Next, set ai = = 11 if I|Zj Zi I\ < <1 and set ai = = 0 if if group Zji = (Y (Yji - MdnJ set aj 1 and set aj Zi l > 1l.Then, find I Zj . Then, find -
(3 .11) (3.11)
Minitab macros (Wilcox, (Wilcox, 11996) 996) and S-PLUS software functions available for calculating calculating s 2�w (Wilcox, 11997, 99 7, 2003) are available bw', for testing the significance of the difference difference between two two groups' values of Ss2�w significance bw (with apparently power and good control of Type and for con conapparently good power 'lYpe II error), and structing an accurate confidence confidence interval for this difference. difference. structing CONFIDENCE INTERVALS FOR A STANDARDIZED-DIFFERENCE EFFECT EFFECT SIZE FOR
variability of the Of course, the smaller the sample size, the greater the variability sampling distribution of an estimator. estimator. Thus, the smaller smaller the sample size, sampling discrepancybetween between aavalue value the more likely it is that there will be a large discrepancy of d d or g and the true value of the effect effect size estimating. of size that they are estimating. (Consult Bradley, Bradley, Smith, & & Stoica, 2002, and and Begg, 11994, discus9 94, for discus sions of consequences consequences of this fact. fact.)) Therefore, a confidence interval interval for for a standardized-difference effect effect size can be very informative. standardized-difference but more complex methods that we prefer for construct constructMore accurate, but confidence intervals intervals for a standardized-difference standardized-difference effect effect size dising confidence size are dis manual calculation calculation is cussed later. First, a simple approximate method for manual demonstrated. This method method becomes less accurate accurate to the extent that the as asdemonstrated. of homoscedasticity and normality normality are not not met, the smaller the sumptions ofhomoscedasticity naa < < 10 and and nnbb < 110), and the the more that Ll Adeparts departsfrom from O.0. 0), and sample sizes (say, n A= = gpop , the confiNote that because we are assuming homoscedasticity Ll ' so the confi pop interval that we give for Ll A applies to ggpop . dence interval pop
60
�
CHAPTER 3 3 CHAPTER
approximate 95% CI CI for for Ll A is given by by An approximate (3.12) (3. 12) where Z0 z0.25 is the positive value of z that has 22.5% ofthe thearea area of ofthe thenor nor.5 % of 25 beyond = + +1.96, estimated stan stanmal curve beyond it, namely, zz = 1 . 96, and Ssdd is the estimated To calculate calculate dard deviation of the theoretical sampling distribution of d. To sd, following following Hedges and Olkin Olkin ((1985), square root of Sd' 1 985), take the square (3.13) (3 . 1 3)
For example, example, suppose that one wants to construct a 95% CI CI for Ll A when = +0. +0.70, na = = nbb = = 20, and we are not adjusting d for bias because because bias bias d= 70, na is likely very slight when each n = = 20. In this case s2�d = = [[(20+20) (20+20) // 2 2 (20x20)] + [[.70 [2(20+20)]] = 00.106, sd = = (0. (0.106)' = 0.326. . 702 // [2(20 +20) 1 1 = . 1 06, and Sd 1 06)" = CI are 00.70 ± 11.96(0.326). . 70 ± .96(0.326). The lower limit Therefore, the limits of the .95 CI for this confidence confidence interval is 0.06 and the upper limit is 11.34 disapfor .34 (a disap pointingly wide interval) interval).. Thus, we estimate that the interval from from 0.06 pointingly contain the value of Ll A approximately 95% 95% of the time. to 11.34 .34 would contain from chapter 11 that there are opposing views regarding the releRecall from rele null-hypothesis significance significance testing. Therefore, authors (and vance of null-hypothesis would have varying varying reactions to the fact fact that readers) of a research report would 0.06 to 11.34 contain 0, a result that the confidence interval from 0.06 . 34 does not contain Ll also provides evidence at the two-tailed .05 level of significance that A significant result would be an impor impordoes not equal 0. O. This statistically significant retant perspective on the data for someone who is interested in evidence re theory that predicts a difference difference between the two groups. groups. This garding a theory perspective would also be important significance-testing perspective important if the research research two treatments treatments of equal overall cost, so the main issue were comparing two would then be which, which, if either, of the two two treatments is more effective. effective. would competing treatments treatments On the other hand, suppose that there are two competing includes an estimate of effect effect size when and that the prior literature includes treatments to a control condition. Suppose Suppose fur furcomparing one of those treatments effect size when the other ther that the present research is estimating effect treatment is being compared to the same control condition condition competing treatment interest would would be in the that was used in the prior study. In this case the interest magnitudes of the currently currently obtained obtained value of d and of the confidence magnitudes prior results as evilimits and in comparing the present results with the prior evi competition between the two treatments. dence regarding the competition effect Many meta-analyses include all available relevant estimates of effect statistical significance in the un unsize, including those that did not attain statistical derlying primary studies. Recall in this regard that we previously cited derlying results by Sawilowsky Sawilowsky and Yoon (2002) that provided evidence of in inthe results flation of Type I error when such nonsignificant estimates are used in a flation
STANDARDIZED STANDARDIZED DIFFERENCE DIFFERENCE BETWEEN BETWEEN MEANS MEANS
�
611 6
meta-analysis. Recall also the finding (Meeks & 'Agostino, 11983), 983), cited &D D'Agostino, in chapter 2, that if one only constructs a confidence interval contingent on obtaining a statistically statistically significant result, the apparent (nominal) confidence level will be greater than the true confidence level (liberal (liberal coverage). Perhaps a justifiable justifiable procedure procedure for a study in probability coverage). which the researcher wants to report a confidence interval would be to construct a confidence interval first and then address the presence presenceor orab absence of 0 in the interval from from the perspective perspective of significance significance testing. again some believe that researchers researchers should either conduct Nonetheless, again a test of significance significance or construct a confidence interval, depending depending on the purpose of the research (Knapp, 2002). 2002). Note in this regard that in situation (the difference difference between two two chapter 8 we encounter a situation proportions) in which a test of significance significance and a confidence interval produce inconsistent results. might produce A solution has been proposed for the issue of significance testing ver versus construction of confidence confidence intervals. This solution solution involves a null hypothesis that posits not not a single value (usually 0), 0), as is customary customary for a parameter such as Ll,, but a range of values that would be be of equal in inresearcher,values called called good-enough values. values. In this case the terest to the researcher, confidence confidence limits are not not based on the use of a distribution distribution of a test sta statistic that would be used to test a traditional null hypothesis (e.g., the t book. Instead, Instead, the relevant or normal zz distribution) as is done in this book. distribution is based on a test statistic statistic that would be used to test a range distribution null hypothesis. A good-enough confidence interval addresses addresses the issue of of whether an an effect effect is large enough to be of interest. interest. These These confidence confidence intervals can also provide evidence evidence regarding regardingaatheory theory that thatan aneffect effect will will be at at least a specified specified size. For further discussion and references refer to the review by Serlin of Serlin (2002) (2002).. Steiger Steiger (2004) (2004) discussed discussed construction of significance confidence intervals that are related to this approach to significance testing. The "good-enough" approach is reasonable reasonable for instances of ap applied research in which the researcher researcher has a credible credible rationale for deter determining what what degree degree of difference difference between two groups would be the menminimum that would be of interest. The approach is also briefly men tioned in chapter 8 in the section entitled "The Difference Difference Between Between Two Two Proportions, Proportions,"" where the work work of Fleiss, Levin, and and Paik (2003) is cited. example, note that the confidence interval is not as Returning to our example, informative as one would want it to be because because the interval ranges from from a value that would be be considered consideredto be be a very small effect effect size (0.06) to a considered to be a large effect effect size 0 (1.34). We would value that would be considered .34). We from chap chaplike to have obtained a narrower confidence interval. Recall from ter 2 that to attempt attempt to narrow a confidence interval some have sug sug.95 . gested that we consider adopting a level of confidence lower than .95. The reader can try 1 - <X) try this as an exercise by constructing a ((1 a) CI, CI, where <X a> > .05 .05 to narrow the confidence confidenceinterval by paying the price of having the confidence level below .95. In this case case the only element in expres expresis replaced by zzu/ z a // 22 arises, of of sion 33.12 . 12 that changes is that Zz.OlS 025 is a / 22'. This zu -
62
�
CHAPTER 3 CHAPTER 3
course, because the middle 1100(1 - a)% OO( 1 0.)% of the normal curve has one half half of the remaining remaining area of the curve above it; that is, it has 100(a/2)% of 1 00(0./2)% Note, however, that a .95 CI CI is traditional traditional and that the editors above it. Note, manuscript reviewers of some journals, and some professors who and manuscript professors who supervising student student research, are supervising research, may be uncomfortable with a result reported with less than 95% 95% confidence. current example with sample sizes sizes of 20 each would generally Our current most experiments. experiments. Nonetheless, Nonetheless, as a fur furbe considered adequate for most ther exercise exercise in narrowing narrowing confidence intervals (before the research is sizes while maintaining maintaining 95% 95% confidence, confidence, begun) by increasing sample sizes we change our example by now supposing that we had originally used = nb nb = = 50 instead of2 of 200 and that d = = +0. +0.70 = n bb = = 550 n aa = 70 again. Using n aa = 0 then taking the square root ooff the obtaineds52d� iin n Equation 33.13 . 1 3 and then one finds that now now Sd sd = = 0.206. The limits for the 95% 95% CI Cl for L1 A then be0.70 ± 11.96(0.206), come 0 . 70 ± . 96(0.206), yielding lower and upper limits of 0.30 and 11.10, . 1 0, respectively. This is still not a very narrow narrow confidence interval, but it is narrower confidence interval that was con conbut narrower than the original confidence structed using smaller sample sizes. structed When assumptions assumptions are satisfied, for a more accurate method for con constructing a confidence confidence interval for L1 A using SPSS or other software refer to Fidler and Thompson (200 1 ) and Smithson (200 1 , 2003). Some Some ratio (2001) (2001, rationoncentral distri distrinale for this method is discussed in the next section on noncentral butions. Additional software for constructing constructing confidence intervals, butions. combining them, and better understanding understanding their meaning is Gumming combining Cumming ' s (200 and Finch Finch's (2001) and 1 ) Exploratory Software for Confidence Intervals (ESCI).. For an an example of output output from from ESCI inspect our our Fig. Fig. 33.2 (ESCI) . 2 that will be discussed shortly. ESCI runs under Excel and can, as of the time of this writing, be downloaded from http:// http://www.latrobe.edu.au/psy/esci. www. latrobe.edu.au/psy/esci. writing, useful links. This site also has useful narrow confidence intervals intervals may often often require impracti unpractiSatisfactorily narrow sizes, so that a single study often often cannot cannot yield a definically large sample sizes, defini However, using software such as ESCI, combining a set of tive result. However, of intervals from related studies (i.e., the same variation variation of the the confidence intervals variable) may home in on a independent variable and same dependent variable) an effect effect size (Cumming (Cumming & & Finch, Finch, 200 2001; more accurate estimate of an 1; Wilkinson & & APA APATask TaskForce, Force, 11999). Inthis this case caseof ofrelated relatedstudies studiesthe the Re Re999). In sults section of the report of a later study can include a single single figure that depicts a confidence confidence interval interval from from its study study together with with the confidence intervals from from all of the previous studies. Such a figure places places our our results intervals context and can greatly facilitate interpretation interpretation of these re rein a broader context integrated with with the previous results. sults as integrated results. ESCI can produce such a figfig illustrated by Thompson (2002) (2002) and by our Fig. Fig. 3.2. 3.2. Such a ure, as is illustrated primary study into into a more informative meta-analysis. meta-analysis. figure turns a primary further discussion of confidence intervals for for standardized-differFor further standardized-differ effect sizes, sizes, consult Cumming and and Finch (200 (2001), and Olkin ence effect 1 ), Hedges and (2002). Hedges and Olkin ((1985) ((1985), 1 985), and Thompson (2002). 1 985) provided
STANDARDIZED DIFFERENCE DIFFERENCEBElWEEN BETWEEN MEANS MEANS STANDARDIZED
Standard i s ed effect s ize, -1
05
-
.
o
d
•
I
I
• •
I
I
•
I
•
I . I
1 .5
1
0.5 I . I • I
63
�
•
Past research, pooled ....
Current study •
Past
+
Current, pooled .,
FIG. 3.2. The 995% by ESC!, ESCI, for for placebo placebo FIG. 5 % confidence intervals, produced by versus drug drug for for depression. FromA Frornyl Meta-Analysis Meta-Analysis ooff the E Effectiveness offAn Anversus ffectiveness o tidepressants Compared Compared to Placebo by by J. A. Gorecki, Gorecki, 2002, 2002, unpublished unpublished J. A. tidepressants master's master 's thesis, San Francisco State University, San Francisco. (British original figure. figure.)) spelling per original
(charts) for aproximate confidence limits for ggppop � nomographs (charts) oP when 0 < g �< 11.5 na == nbb==2 10. Refer Refer to Smithson (2003) (2003)for definitions definitions and and .5 and na 2 to 10. g
discussions of confidence intervals that are called exact, uniformly most accurate, and unbiased. for � Athat were produced by by the the ESCI's op opFigure 33.2 . 2 depicts 95% CIs for MA (Meta-Analytic) (Meta-Analytic) Thinking. The g g values, calculated on tion called MA Yplacebo Ydru )) / defined as g g= = ((Y / ssp in stud studreal data (Gorecki, 2002), were defined placebo - Ydrug depression. (Note that what we label "g" "g" in this book �the ESCI softhe E SCI soft ies of depression. currently labels "d".) "d".) The figure is intended only to illustrate illustrate an ware currently ESCI result because there were actually actually 1111 prior prior studies to be compared ESC! but ESC! ESCI permitted depiction of confidence confidence interinterwith the latest study, but
64
�
CHAPTER
33
vals for up to 110 0 prior studies, a pooled (averaged) confidence interval for those studies, a confidence interval from from the current primary re refor searcher's's latest study, and a confidence confidence interval interval based on a final pool poolsearcher confidence ing of the 110 0 prior studies and the latest study. The pooled confidence meta-analysis undertaken by a primary primary re reintervals represent a kind of meta-analysis searcher whose study has predecessors. predecessors. CONFIDENCE INTERVALS USING NONCENTRAL DISTRIBUTIONS
distribution that is used to test the usual null hypothesis (that (that the The t distribution difference between the means of two two populations populations is 0) is centered sym symdifference about the value 0 because the initial presumption in research metrically about is that Ho H0 is true. Such a t distribution that that uses hypothesis testing is is centered symmetrically about about 0 is called distribution. (Not called a central t distribution. (Not distributions are symmetrical. symmetrical.)) However, However,when when we construct all central distributions interval for Ila Ilb or for LlAthere there isisno nonull nullhypothesis hypothesis being being a confidence interval distribution is a t distribu distributested at that time, so the relevant sampling distribution tion that may not be centered at 0 and may not be symmetrical. Such a t distribution is called a noncentral t distribution. distribution noncentral t distribution distribution differs differs more from from the central t distribu distribuThe noncentral Ade detion with respect to its center and degree of skew the more Ila - Ilb or Ll part from 0 and the smaller the sample sizes (or, (or, precisely, precisely,the thedegrees degreesof of part freedom).. Therefore, if assumptions assumptions are satisfied, the the more Ila - Ilb or A freedom) Ll depart from 0, and the smaller the sample sizes, the more improvement based on the there will be in the accuracy of confidence intervals that are based instead of the central t distribution. distribution. Thus, ESC! ESCI noncentral t distribution instead and much much of the other modern software for the construction construction of such and confidence intervals is based on the noncentral t distribution. would not be possible to table useful representative representative values of t from It would noncentral t distribution distribution because its shape depends depends not only on dethe noncentral de grees of freedom freedom but but also on the value of a parameter, called the the noncentrality parameter, that is related to Ll. A.Therefore, for example, example, for a noncentrality freedom the value of t that would have 2.5% of the area given degrees degrees of freedom of the distribution distribution beyond it will typically typically not be the same within the of distribution if Ho H0 is false. Also, Also, constructing a central or a noncentral t distribution interval for Ila /lo or for Ll A using a noncentral t distribution confidence interval which the lower limit limit and the upper limit limit of the requires a procedure in which interval interval have to be estimated separately because they are not equidis equidisYa -- Vb Yb or the standardized difference) difference) in tant from the sample value (i.e., Va case. Thus, such a confidence interval is not necessarily a symmetri symmetrithis case. point estimate plus or minus a margin of error. cal one bounded by the point procedure is an iterative (repetitive) (repetitive) one in which successive approx approxThe procedure imations imations of each confidence limit are made until a value is found that 95% CI) CI) of the noncentral t distribution distribution beyond has .025 (in the case of a 95% prohibitively labori laboriit. Therefore, software is required for the otherwise prohibitively construction of confidence intervals using noncentral distributions. ous construction -
-
STANDARDIZED DIFFERENCE DIFFERENCE BElWEEN BETWEEN MEANS MEANS STANDARDIZED
�
65
the construction of confidence intervals For detailed discussions of the that are based on noncentral distributions, consult Cumming Gumming and Finch (2001), Smithson (200 (2001, Thompson (200 1 ), Smithson 1 , 2003), Steiger and Fouladi ((1997), 1 99 7), Thompson ' s (2001 Smithson's (2001)) procedure (2002), and the references therein. Smithson procedure uses SPSS scripts. scripts. Additional applicable applicable statistical packages packages include include SAS SASand and STATISTICA. Note that literature noncentral t distribution often literature on the noncentral uses the symbol Ll, A, which we use to represent the standardized differ difference between two population means, to represent instead the parameter for the noncentral noncentral t distribution. distribution. The noncentrality parameter noncentrality parameter is a function of how how far far J.la J..lJ, is from from 0, as is LlA noncentrality noncentrality approach to the con conas we use it. Note again that the noncentrality struction of confidence intervals assumes normality normality and homo homostruction scedasticity, whereas the bootstrap discussed in bootstrap approach that was discussed does not. chapter 2 does -
THE COUNTERNULL EFFECT EFFECT SIZE = o. 0. Recall that a typical null hypothesis about J.la and J..lJ, is that J.la J..lJ, = This Ho H0 implies another; namely, Ho: H0: Ll = = O. 0. In traditional significance not far enough away away from from 0, one decides not testing if the obtained t is not to reject Ho concludes that the t test result pro H0', and, by implication, one concludes proA is other other than 0. However, some con convides insufficient evidence that Ll O. However, sider such reasoning to be incomplete. incomplete. For For example, suppose that the but insufficiently insufficiently so to attain attain statistical significance. sample d is above 0 but result can be explained, as is traditional, by the population Ll Aactu actuThis result varially being 0, whereas the sample d happened by chance (sampling vari A in this instance of research. research. However, However, an ability) to overestimate Ll explanation of the result is that Ll A is actually actually above 0, equally plausible explanation underestimateA Ll and more above 0 than d is, so d happened by chance to underestimate here. Therefore, according to this reasoning, a value of d that is beyond 0 (above (above or below 0) 0) by a certain amount amount is is providing just as much evi evio A= = 2d as it is providing providing evidence that Ll A= = 0 because d is no dence that Ll closer to 0 ((1 1 d distance away 1 d distance away away from from 0) than d is to 2d ((1 away from 2d) 2d).. For example, if d is, say, + +0.60, 0.60, this result is just as consistent from with Ll A= =+ +1.20 A= = 0 because +0.60 is just as close to + +1.20 1 .20 as with Ll 1 .20 as Aby by aacer cero. The sample d is just as likely to be underestimating Ll it is to 0. tain amount amount as it is to be overestimating Ll Aby that amount amount (except for tain some positive bias as is discussed next). next) . just-given example, assuming that a t test results in t and, by In the just-given statistically insignificantly different different from from 0, it implication, d being statistically would insignificantly different different would be as justifiable to conclude that d is insignificantly from + +1.20 conclude that d is insignificantly different different from 1 .20 as it would be to conclude from 0. Wemust must note, note,however, however,that thatthe thereasoning reasoningin inthis this section sectionisisonly only o. We from approximately true because of the bias that standardized-difference esapproximately es toward overestimating effect effect size. timators have toward size. The reasoning is more -
66
�
CHAPTER 33 CHAPTER
estimator are used, accurate when larger sample sizes sizes or a bias-adjusted estimator as previously discussed. This reasoning leads leads to a measure of effect effect size called the counternull of an effect effect size (Rosenthal (Rosenthalet aI., al., 2000 2000;; Rosenthal Rosenthal & & Rubin, 1994). 1 994). value of Here, we simply call this measure the counternull effect effect size, size, ES EScncn.. In the the case of standardized-difference standardized-difference effect effect sizes, and and in the case case of some (but not all) all) other kinds of effect effect sizes not sizes that we will discuss discuss later in this book, if one is, by implication implication of t testing, testing H = 0, then = if Ho: 0: ESpop pop ES cn = 2ES.
((3.14) 3 . 1 4)
When null-hypothesizing a value ofES ES pop other than 0, the more gen genpop formula is eral formula ES
en
=
2 ES - ES null
((3.15) 3. 1 5)
'
the null-hypothesized value ofES . See where ESnu ESpop See Rosenthal Rosenthal et al. null li is the po (2000) for an an example of the use of Equation 33.15. Inour ourexample, example,inin . 1 5 . In the estimate of effect effect size (i.e., d) = =+ +0.60, 0. 60, application of EquaEqua which the tion 3.14 estimate EScn = 2(4-0.60) = +1.20. Therefore, 3 . 1 4 yields the estimate = 2 ( + 0 . 60) = + 1 .20. Therefore, the cn null-counternull interval ranges from 1 .20. In other words, the from 0 to + +1.20. approximately as consistent with � A= = + +1.20 1 . 20 as they are results are approximately with with � A= = O. 0. situations in which construction of a confidence interval for an For situations effect effect size would be informative informative but not practicable, a researcher might consider reporting instead the the ESnull ES null and and EScn cn as limits of a null-counternull interval. In our example, the lower limit of the null-counternull interval is 0 and the estimated upper limit is + +1.20. null-counternull 1 .20. Equations 33.14 3.15 applicable only to estimators Note that Equations . 1 4 and 3 . 1 5 are applicable For equations that have a symmetrical sampling distribution, such as d. d. For for for application to estimators that have asymmetrical distributions, correlation coefficient coefficient r (discussed in the next chapter), refer such as the correlation to Rosenthal Rosenthal et al. (2000), (2000), who who also discussed discussed a kind of confidence level for a null-counternull interval. interval. (perhaps better called a likeTo understand such a confidence level (perhaps like the example in which which na na = = na na = = 20 and and d = =+ +0.70. 0 . 70. lihood level), recall the £5 = =d = =+ +0.70 and, assuming the In that example the estimated ES 0 . 70 and, ESnull = 0 in this hypothetical example, using Equation 3.14 usual ESn 3 . 14 ull = EScn = = 2ES = 22(( + 0. 0.70) =+ +1.40. Suppose further that the two-tailed two-tailed 2ES = 70) = 1 .40. Suppose EScn p for the the obtained obtained tt in in this this example had had been found to to be, say, p level for say, p= Recallalso also that thataa tttest testconducted conducted at at the the two-tailed two-tailed alpha alpha level level p = .04. .04 . Recall with a confidence interval for the difference difference between the is associated with involved population means, a confidence interval interval in which which one is two involved approximately 00 ( 1 - -a)% approximately 1100(1 a)%confident. confident.Similarly, Similarly,ininour ourexample exampleone one can be approximately 00(1 p)% = 00 ( 1 6 % confident approximately 1100(1 -p)% = 1100(1 - .04)% .04)% = = 996% confidentin in null-counternull interval ranging from from 0 to + +1.40. Note that the the null-counternull 1 . 40. Note •
STANDARDIZED DIFFERENCE DIFFERENCE BETWEEN BETWEEN MEANS STANDARDIZED MEANS
�
67
confidence confidence level for a confidence interval is based on a fixed probabil probabil- (X) a) that is is set by the researcher, researcher, typically ..95, ity ((1 1 9 5 , whereas the confidence level for a null-counternull null-counternull interval is based on a result-de result-determined probability, the p level attained attained by a test statistic such as t.t. A null-counternull interval can provide information information that is only somewhat somewhat conceptually similar to and not likely numerically the same information that is is provided by a confidence interval. Both inter interas the information vals bracket bracket the obtained obtained estimate estimate of effect effect size, but, unlike unlike the lower lower limit of a confidence confidence interval, when ESnull £5null = = 0, the lower limit of the the null-counternull interval O. Confidence interval will always always be 0. Confidence intervals intervals and null-counternull intervals cannot cannot be directly compared or combined. null-counternull We previously previously suggested that researchers might might consider construct constructing a null-counternull null-counternull interval in situations in which construction construction of a confidence interval is not not practicable. However, However, some researchers who confidence researchers who are conducting studies in which their focus is not on significance testing might because, like might be inclined to avoid the null-counternull null-counternull approach because, significance significance testing, this approach focuses on the value 0, although, un unlike significance significance testing, it also focuses on a value at some distance from from o 0 (the (the counternull value). value). More Moreinformation information about a variety of kinds of Rosenthal and 1 994), Rosenthal and Rubin Rubin((1994), Rosenthalet etal. al. (2000), (2000), E£SScn cn can be found in Rosenthal and in later later chapters in this book. DEPENDENT GROUPS
dependent-group Equations 33.3, . 3, 33.4, .4, 33.5, . 5 , and 33.6 . 6 are also applicable to dependent-group designs. In the case case of a pretest-posttest pretest-posttest design the means in the numer numerfour equations become become the pretest and posttest posttest means ators of these four (e.g., Y . 3 or 33.5). . 5 ) . In this case 7 re and Y Ypost case the ost when using Equations 33.3 ;dizer (st �ndard deviation) in Equation 33.3 . 3 can be spre standa standardizer (standard spre or spost spost (less common) common).. The pooled standard deviation, Ss , can also be used to produce na= n b,, spp is m merely root . 5 . Because na �rely the square root instead the g of Equation 33.5. 2 of and sS2�ost spp = /' . of the mean of Ss�re = [(s [(s2�re + S �ost s 2 post)12] )/2] 1/2 post; ;s pre + choice of a standardizer for estimation estimation of a standardized-differThe choice standardized-differ ence effect effect size must be based on the nature of the population of scores to which which one wants wants to generalize generalizethe results in the sample. Therefore, in pretest-posttest argued that the the case of a pretest -posttest design some have argued standardizer for an estimator estimator of A A should not not be based on a standard standard de deviation of raw scores scores as in the previous paragraph, but but instead it should viation be the standard deviation of the difference difference scores scores (e.g., the standard standard de deviation, viation, SSDD',of ofthe thedata datain incolumn columnDDininTable Table22.1 ofchapter chapter2). 2).Their Theirargu argu. 1 of design one should be interested in generalizing to the ment is that in this design mean posttreatment-pretreatment posttreatment-pretreatment differences differences in individuals individuals relative to population of such difference difference scores. scores. However, However, each standardizer the population has its purpose. For example, of example, in areas of research that consist of a mix of between-group between-group and within-group within-group studies of the same independent vari variable, greater comparability comparability with with results from from between-group studies
�
68
CHAPTER 3
attained when a within-group within-group study uses a standardizer that is can be attained based on the s of the raw raw scores. Consult the references that were cited (2002) for discussions supporting either the by Morris and DeShon DeShon (2002) standard deviation deviation of raw raw scores or the standard standard deviation deviation of the posttreatment-pretreatment difference difference scores posttreatment-pretreatment scores as the standardizer. Note pretest-posttest design complications arise if one constructs that in the pretest-posttest Awhose whose standardizer standardizer isisbased basedon onaa (based (basedon onaa a confidence interval for a Ll the pretest or based on pooling) pooling) instead of aD '. But if one uses aD as the then the methods that we previously discussed for inde indestandardizer, then construct an exact confidence confidence interval pendent groups can be used to construct ( Cumming & & Finch, Finch, 200 2001). Again, "exact" assumes that the usual as as1 ) . Again, (Cumming sumptions are satisfied. (2003) for a method for constructing Consult Algina and Keselman (2003) confidence interval an approximate approximate confidence interval in the case of dependent groups with equal or unequal variances. Their method appears to provide satis satiswith factorily accurate confidence levels under the conditions they simulated for the true values values of A for Ll and for the strengths of correlation correlation between the two populations of scores. scores. For a nominal .95 confidence confidence level their two slightly conservative method resulted in actual confidence confidence levels that from .95 .9511 to ..972. two ranged from 9 72 . The degree degree of correlation correlation between the two populations of scores effect on the accuracy of the scores seemed seemed to have little effect actual A increased, the actual con conactual confidence levels; as the true value of Ll conservative. Specifically, as the true fidence levels became became slightly more conservative. A ranged from from 0 to 11.6, values of Ll . 6, the actual confidence levels ranged extremely close or satisfactorily satisfactorily from ..951 95 1 to ..971—values 9 7 1 -values that are extremely close to the nominal confidence level of ..95 95 in the simulations. The method can be undertaken undertaken using any software package that provides method noncentrality parameters for noncentral t distributions, such as SAS SAS noncentrality SASfunction functionTNONCT), TNONCT),that thatAlgina Alginaand and Keselman Keselman(2003 (2003) recom(the SAS ) recom particularly useful for this purpose. Consult Wilcox mended as being particularly effect sizes sizes when comparing two de(2003) for other approaches to effect de construction of confidence in inpendent groups. In chapter 6 we discuss construction tervals for standardized-difference standardized-difference effect effect sizes focusing on sizes when one is focusing one-way between-groups or two of the multiple groups in a one-way analysis of variance (ANOVA) design. within-groups analysis QUESTIONS 1. might a standardized difference 1 . In what circumstance might difference between difference means be more informative than a simple diff erence between means? 2.. Define latent variable. 2 variable. Assuming normality, interpret d d= =+ +1.00 when it is obtained obtained by 33.. Assuming 1 .00 when interpretation. using Equation 33.1, . 1 , and explain the interpretation. 4. If population population variances are equal, what are two two advantages of of pooling sample variances to estimate the common population variance?
DIFFEREMCE BE1WEEN BETWEEN MEANS STANDARDIZED DIFFERENCE
�
69
5. Distinguish among Glass' d, Cohen's d, and Hedges' g. g. 5. Distinguish 66.. Why Why is it unlikely that Glass' d will equal Hedges' Hedges'gg even evenififpopula population variances are equal? 7. What is the direction of bias of Hedges' g and Glass' d, what two two ways do these two factors factors influence this bias, and in what ways influence the bias? Why is Hedges' Hedges' bias-adjusted version of g seldom used by re re88.. Why searchers? 9. In what ways does heteroscedasticity cause problems for the use of of standardized differences differences between means? standardized Which effect effect size is recommended as110. 0 . Which recommended when homoscedasticity is as sumed, and why? Discuss two two approaches to estimating estimating effect effect size that should should be 111. 1 . Discuss homoscedasticity is not assumed. assumed. considered when homoscedasticity effect size be reported by the re re112. 2 . Should all calculated estimates of effect searcher, and why? What might be a solution to the problem problem of estimating effect effect size 113. 3 . What in the face of heteroscedasticity in areas of research that use a why is normed test for the measure of the dependent variable, and why this so? Why is nonnormality problematic the usual interpretation of d 114. 4 . Why problematic for the or g? org? way might a large effect effect size for a treatment for an addic addic115. 5 . In what way tion be too optimistically interpreted? interpreted? Describe two two alternative alternative standardized-difference standardized-difference estimators of ef ef116. 6 . Describe fect size when when there are outliers outliers.. 117. 7. In what research of research context iiss the magnitude magnitude ooff the effect effect size of greatest interest? part of expression 33.12 confidence 118 8 Which part . 1 2 changes if one adopts a confidence 95, and why? level other than ..95, Identify two ways in which which a plan for data data analysis can narrow narrow 119. 9 . Identify the eventual confidence interval. interval. Results section, of what benefit is the presentation presentation of a figure figure 20. In a Results that contains current and past confidence intervals involving involving the levels of an independent variable and the same dependent same levels variable? 221. 1 . Contrast the central t distribution and a noncentral t distribution. influence the difference difference between the central and 222. 2 . Which two factors influence distributions, and in what ways? noncentral t distributions, effect size and and null-counternull null-counternull interval. 223. 3 . Define counternull effect What is the the rationale rationale for a counter counternull effect size? 24. What null effect null-counternull interval? 225. 5 . When might a researcher consider using a nuil-counternull Contrast a nuII-counternuII null-counternull interval and a confidence interval. 26. Contrast 27. data from from 2 7 . How can Equations 3.3, 33.4, .4, 33.5, . 5, and 33.6 . 6 be applied to data dependent groups?
Chapter Chapter
4 4
Correlational Effect Effect Sizes for Comparing ComparingT Two for wo Groups
THE POINT-BISERIAL POINT-BISERIAL CORRELATION Whence Fare continuous variables the familiar Pearson correlation When X and Y are continuous coefficient, r, provides an an obvious estimator estimator of effect effect size in terms of the the coefficient, size (magnitude of r) and and direction (sign of r) of a linear relationship be between X X and Y. Y. rlowever, However, thus far in this book, although the Y variable tween has been continuous the independent dichotoindependent variable (X) (X) has been a dichoto mous variable such as membership in Group Group b. Although Group a or Group computational formulas and and software for r obviously require both X computational X quantitative variables, calculating an r between a truly and Y Y to be quantitative truly didi chotomous variable and a quantitative Y variable does not chotomous categorical categorical X variable present a problem. By By a truly dichotomous variable we mean a naturally present dichotomous (or nearly so) so) variable, such as gender, or an independent independent different variable that is created by assigning participants into two different treatment Weare arenot notreferring referringto tothe the treatment groups to conduct an experiment. We problematic procedure of creating a dichotomous arbitrarily dichotomous variable by arbitrarily dichotomizing originally originally continuous scores into two groups, say, those median. When an originally above the median versus those below the median. continuous variable is dichotomized it will nearly always correlate lower with another variable than if it had not been dichotomized another variable & Schmidt, 2004). Similarly, as Hunter and Schmidt (2004) dis dis(Hunter & cussed, when a continuous continuous variable has been dichotomized it cannot at atmaximum absolute value of correlation with a tain the usual maximum continuous variable, 1[1]. continuous 1 1. calculating an r between a dichotomous dichotomous variable The procedure for calculating quantitative variable is simply to code membership Group a or and a quantitative membership in Group Group b numerically. For example, example, membership in Group Group a can be coded as 11,, and membership in Group Group b can be coded as 2. 2 . Thus, in a data file each member of Group a would would be represented by entering a 11 in the X X column and each member of Group b would be represented by entering a X column. As As usual, each participant's participant's score on the dependent 2 in the X dependent 70
CORRELATIONAL EFFECT EFFECT SIZES CORRELATIONAL
..-fHJI=
71 71
variable measure is entered in the Y column column of the data file. The magni magniregardless of which two two numbers are cho chotude of r will remain the same regardless sen for the coding. coding. The only aspect of the coding that the researcher must must keep in mind mind when when interpreting interpreting the obtained obtained sample r is which group group was found to be positive, then the sample assigned the higher number. If r is found assigned the the higher number on X (e.g., 2, instead of 1) that had been assigned 1) Y variable. IIff r is nega negatended to score higher than the other sample on the Yvariable. tive, then the sample that had been assigned assigned the higher number on X X score lower than than the other sample sample on the Y variable. tended to score The correlation between a dichotomous variable and a continuous continuous in the the sample, a comcom variable is called a point-biserial correlation, r pbb in !group case. effect size in the two two-group case. When using monly used estimator of effect not have to look for statistical software that includes ':Pb rpb , one does not includes rpbb'. One simply uses any software for the usual rr and enters the numeri numerical une J'al X column according according to each participant's participant's group member membercodes in the X ship. Refer to Levy ((1967) effect size that 1 96 7 ) for an alternative measure of effect is based on rrpb' pb. EXAMPLE OF rppbb
To illustrate the use of rrppbb we again use the research that was discussed in the healthy parent-child parent-child relationship scores scores of of chapter 3 in which tne mothers with normal children (Group b) were compared with those from mothers of disturbed children (Group a) a).. In that example, from d= 7 7, indicating that, in the samples, = -. -.77, samples, mothers of normal normal children about.. 777 stantended to outscore mothers of disturbed children by about 7 of a stan dard deviation unit. If we now code the mothers of the disturbed chil children with X X= = 11 and code the mothers of the normal children with X X= = 2, calculates rr we now find find that using any statistical software that calculates = .40. This result indicates that the sample that was coded 2 (nor (norrrpbb = mals) to outscore the the sample was coded 1\ (disturbeds), (disturbeds), a ri't als) tended to sample that was indicated in its own own way. An r of magnitude .4 finding that d already indicated would be considered to be moderately comparison to typical typical would moderately large in comparison research, as we discuss values of rr in behavioral research, discuss later in this chapter. = -. -.77 = .40 suggest in their own ways that Thus, finding that d = 7 7 and rpbb = is a moderately strong strong rel relationship there is .i'tionship between the independent and dependent variables in this example. calculates an r (r pb Software that calculates b in this case) will typically also test Ho: 6r that test. There isis an equation that H0: r pop = =00 and provide provide a p level ffor relates (Equation4.3), 4.3), and the p level attained by r when conduct conduct�;rpbb to t (Equation relat two-tailed H0: rrppop attained op = 0 will be the same as the p level attained o-tailed test of Ho: ing a n: two-tailed test of Ho: H0: Ila - � = O. Therefore, we by t when conducting a two-tailed = .40 is statistically statistically significantly different different from 0 be bealready know rpbb = lin chapter 3 that the sample means for the two kinds of of cause we foun found mothers were significantly different in a t test. significantly different negatively biased (i.e., they they tend to underesti underestiValues of r and rpbb are negatively correlation � in the population, rpop ) ' usually slightly so. Bias is mate the correlatio pop), 0
=
72
�
CHAPTER CHAPTER 4 4
greater the closer rpop is to ± 5 and for small samples, but of ± ..5 but this is not of J't�1 sample size is, say, greater 5 (Hedges & & great concern concern if if ttotal greater than 115 Olkin, 11985). 98 5 ) . When sample size is greater than 20 bias might be less than rounding rounding error (Hunter & & Schmidt, 2004). Exact values of an unbi unbiased estimator estimator as a function of r (or rpbb)) can be found in Table 11 in Hedges who also provi provided equation for an ded the following equation and Olkin ((1985), 1 985), who approximately approximately unbiased estimator, rapprox
=
r+
(
r l -r
2
)
2 (N 3) _
.
((4.1) 4. 1 )
. 1 are available (e.g., Hunter & & Other versions of Equation 4 4.1 Schmidt, 2004), but but a correction correction is rarely used because the bias is generally negligible. negligible. CONFIDENCE INTERVALS AND FOR rpop AND NULL-COUNTERNULL INTERVALS FOR pop
Construction of a confidence interval for rpop o can be complex, and there there pp may may be no entirely satisfactory method. (When (When r r0pop* 0, the sampling normal.)) For For details cons consult distribution of rr is not normal. Hedges and Olkin J.'tt Hedges ((1985) 1 985) and Wilcox ((1996, 1 996, 11997, 99 7, 2003). Smithson (2003) (2003) presented a method for constructing constructing an approximate approximate confidence confidence interval, noting noting that the approximation less accurate the greater the absolute size of of approximation is less the correlation size. Simi correlation in the population population and the smaller the sample size. Similarly, Wilcox (2003) (2003) presented an S-PLUS software function for a modi modiCI that appears to have fairly accurate fied bootstrap method for a ..95 95 CI fied bootstrap probability coverage (i.e., actual confidence level close to .95) provided not extremely large, say, that the absolute value of r in the population is not below .8 .8 (but not 0). Such values for rrpop would would be the case in most most correlational ciences. (A correlational research in the behaviora behavioralr�sciences. (A basic bootstrap bootstrap method was briefly introduced in chap. 2.) Wilcox's (2003 (2003)) method seems to perform well when assumptions are violated, even with sam samsec ple sizes as small as 20. These These assumptions assumptions are discussed discussed in the next seccorrelation for attenuation attributable attributable to tion. [In [In the case of correcting correlation unreliability (discussed chapter in the section entitled "Un "Ununreliability (discussed later in this chapter reliability") reliability") a confidence interval interval should should first be constructed constructed using the uncorrected r. Next, the limits of this confidence interval should be cor corroot of the reliability coefficient coefficient of the X rected by dividing by the square root X product of the square roots of the reliability variable, or dividing by the product coefficients coefficients of the X X and and Y variables, as shown shown later in Equation 4.5 and immediately thereafter. Hunter and Schmidt (2004) provided extensive discussion of this topic.] topic. ] A null-counternull interval (discussed in chap. 3) 3) can also be con cono If the null is the usual Ho: r o = 0, the null structed for rpop . hypothesis H : r = p 0 ppop pp
CORRELATIONAL CORRELATIONAL EFFECT EFFECT SIZES SIZES
73
�
value of such an interval is 0. o. Rosenthal et al. (2000) (2000) showed that the counternull value of an r, denoted ren rcn here, is given by counternull ((4.2) 4 .2 )
2r
In the present example, rpb so ren ( .40 ) / [ 1 + 3(.40 »)' /'2 = .66. rpb = = .40, sor = 22(.40)/[1 3(.40 22)]' = .66. cn = from 0 to .66. .66. Thus, the results would pro proTherefore, the interval runs from vide about as much much support for the proposition proposition that rpop is .66 as they proposition that 'pop rpop = = 0. null-counternull inwould for the proposition o. Perhaps a null �ounternull in terval terval for r op would would be most most relevant for researchers who focus on the null-hypot hesized value of 0 for 'r),pop null-hypothesized counternull value brings at at. The counternull op tention tention also to an equally plausible value. 0
ASSUMPTIONS OF rr AND AND rpb In the case of ,rpbb there are three distributions distributions of Y Yto to consider: consider:the the distri distribution roup a, bution of Y Yfor for'GGroup a, the the distribution distribution of of YYfor forGroup Groupb,b,and andthe theoverall overall distribution of Y Yfor for the the combined combined data data for for the the two two groups. groups. The Thefirst first two two the conditional distributions distributions of Y Y(conditional on distributions are called the whether whether one is considering the distribution of Y Yvalues values at at XX == aa or or at at X = b) and and the the overall distribution distribution of of Y Y is X = is called called the the marginal marginal distribu distribution Fig. 4. 1. tion of Y. Y. The three distributions distributions are depicted depicted in Fig. 4.1. Recall that the ordinary t test assumes homoscedasticity. The Welch homoscedasticity. The Welch version of the t test counters heteroscedasticity somewhat somewhat by using the 2 dfw and s �b dfw of Equation 2.5 instead of df df = = nnaa + nb nb -- 22 and and by by using s 2;a and separately in the denominator denominator of t instead of pooling these two vari variH0: rrpop = = ances. Therefore, if software is using the ordinary t test to test Ho: hing assuming 0, the software is assuming homoscedasticity; that is, it is assur equal variances of the If the populations' populations' conditional distributions distributions of Y. Y. If denominator of t (standard there is heteroscedasticity the denominator (standard error of the difference means) will be incorrect, possibly resulting in difference between two two means) lower statistical statistical power and less accurate confidence intervals (Wilcox, 2003) 2003).. Also, if the ordinary t test is used and the printout p and the ac actual tual (unknown) p are below .05, this result might not not in fact fact be signal signalcorrelation, but but instead instead it merely merely might be signaling signaling ing a nonzero nonzero correlation, heteroscedasticity. Heteroscedasticity is actually actually another another kind of de dependency between X Y X and and Y, Y, a dependency between the the variability variability of Y and the value of X. X. If the software's printout printout for r does not not indicate the statistical statistical signifi signifiIf cance of r and does not include the value of t that corresponds corresponds to the obob tained 'pb' rpb, convert rpb to t using pb 0
l-J
x
[" N _ 2 l '
t=r ph 1 - r 2h p
(4.3) (4.3)
74
�
CHAPTER 4 CHAPTER
• • • • •
• •
•
•
•
•
•
•
•
•
•
•
y
• • •
1
2
Marginal Frequency b
a
X
4.1. Groups a and b results in FIG. 4 . 1 . Unequal variabilities of the Y Y scores in Groups skew in the the marginal frequency distribution of Y.
participants.. Then use a t table, at the where N is the total number of participants df = =N N-2 2row, ascertain the statistical statistical significance significance of this value of row, to ascertain of df which, again, will also be the statistical statistical significance significance of rpbb '. If the ta tat, which, not have a row row for for the df required, interpolate fo for/the signifi signifible does not dfrequired, cance level as was previously shown. shown. underlying r. r. However, However, al alBivariate normality is an assumption underlying though skew in the the opposite direction for variables X and and Y lowers the though maximum value of r (1. (J. B. Carroll, 11961; Cohen, Cohen et aI., al., 2003), maximum B. Carroll, 9 6 1 ; Cohen, nonnormality itself does not not necessarily cause a problem for r,r, so we re renonnormality fer the interested reader to Glass brief discus discusfer Glass and Hopkins Hopkins ((1996) 1 996) for a brief bivariate normality. normality. Indeed, Indeed, when using rrpb pb the sion of the criteria for bivariate X variable cannot cannot be normally distributed. distributed. The distribudichotomous X -
CORRELATIONAL EFFECT EFFECT SIZES
�
75
tion of the variable in this case is merely 1 s and the X variable merely a stack stack of, of, say, Is and a stack stack However, outliers (even one outlier) and distributions with of, say, 2s. However, thicker thicker tails (heavy-tailed (heavy-tailed distributions) than those of the normal normal curve affect r, r, and and rpop interval (Wilcox, 2003).. The can affect 0 and a confidence interval for it (Wilcox, 2003) P samplee sizes sizes might might help in this situation situation somewhat somewhat but but not not use of large samp the overall dis disunder all conditions. Even slight changes in the shape of the tribution of Y in the population tribution population can greatly alter the value of rrpop pop 2003). (Wilcox, 11997, 99 7, 2003 ). Note that one should of should distinguish between a difference difference in the shapes of distributions for the underlying construct construct (e.g., true ap apthe conditional distributions titude titude or a personality personality factor) in the two two populations populations and a difference difference in two conditional distributions distributions for the measure of that construct construct (e.g., the two test of aptitude or on a test of a personality personality factor) in the two two scores on a test samples (Cohen, (Cohen, Cohen et aI. al.,, 2003 2003).) . In the the case of different different distribu distributional shapes for the construct construct in the two popUlations, populations, the resulting re retional 0 is not duction 1 1 for rpop duction in an an upper limit limit below 1111 not a problem; it is a phenomenon. However, However, there is a problem underestimatnatural phenomenon. r blem of rpbb underestimat ing rpop en the two r if the two two sample distributions distributions differ differ in shape w when two pop populations ulations do not not or if the the two two sample distributions distributions differ differ more in shape distributions do. than the two population distributions In reports of research that uses rr or rrpbb authors authors should include scatterplots he possible scatterplots and cautionary cautionary remarks about about the possible effects effects of outli outliand heteroscedasticity. heteroscedasticity. In the case rpb a scatterplot scatterplot ers, heavy tails, and case of rpb may well suggest heteroscedasticity with respect respect to the two conditional conditional may distributions distributions of Y and and skew of the the marginal distribution distribution of Y or neither heteroscedasticity nor skew because such skew and heteroscedasticity often associated (Cohen et aI. al.,, 2003 2003). the case case in the example are often ) . Such is the Fig.. 4. 4.1. the case of rr a scatterplot scatterplot that suggests skew in the the mar marin Fig 1 . In the ginal distribution distribution of X and/or and/or Y may may well also suggest curvilinearity, curvilinearity, heteroscedasticity, and nonnormal nonnormal conditional distributions (McNemar, (McNemar, conditional distributions 11962). 96 2 ) . Recall that rr reflects only a linear component of a relationship be between tween two two variables in a sample. Curvilinearity reduces reduces the absolute absolute value of r. r. For detailed discussions discussions of such matters in a broader broader context context diagnostics),, consult Belsey, Kuh, Kuh, and and Welsch ((1980), (regression diagnostics) 1 980), Cook Cook and Weisberg ((1982), and Fox Fox ((1999). and 1 982), and 1 99 9 ) . Refer 1 , 2003) for additional Refer to Wilcox (200 (2001, additional discussions discussions ooff assump assumptions tions underlying the the use of rpo r , shortcomings of nonparametric nonparametric mea meacorrelation (Spear (Spearman's and Kendall's tau), and for sures of correlation an's rho and alternatives to rr 0 for measuring measuring the relationship relationship between between two varivari 0(3 ) also provided S-PLUS S-PLUS software functions for de ables. Wilcox (2 (2003) detecting outliers and for calculating robust robust alternative measures of of not always problematic. An Xi Xi andYYii correlation. Note Note that outliers are not Xi and and Yi Yi are equally outlying outlying may may not not be impor imporpair of scores in which Xi the value of r. r. For For example, example, a person who who is 77 ft tall tantly influencing the 75 1b lb (outlier in the same direction) (outlier) and weighs 2275 direction) will not not likely influence the sample value of the correlation correlation between height weight influence height and weight is an outlier with respect to just one of the varias much as person who is
l
p6
{
ni
'
h
76
�
CHAPTER 4
ables. The most influential case for an otherwise positive r would would be one abies. X which a person is an an outlier outlier in the opposite direction with respect to X in which and Y, downward. Note finally that the use of r and and rpb do and Y, influencing r downward. b p not not assume equality of the the variances of the the X X and and Y variables. UNEQUAL S SAMPLE SIZES UNEQUAL AMPLE SIZES If If naa "# nb nb in experimental research the the value of rpbb might might be attenuated attenuated (reduced) causing causing an underestimation ofrrpop .. Th The degree of such such attenuattenu ation of rpbb increases the more dispropo disproportional and the onal naa and nbb are and ation larger the actual Note that when when sam samctual value of rpop is (McNemar, 11962). 962) . Note ple sizes there is an an increased chance that X X and and Y might might be sizes are unequal, ther skewed in the opposite direction because unequal sample sizes in the case of the the point-biserial point-biserial r amounts amounts to skew in the the X variable. As noted in the the previous section, skew of X X and and Y in the the opposite direction lowers the the , absolute value value of r. One One can calculate an an attenuation-corrected rpb pb' that denote rc, USIng using we d enote as re, ·
:ti
�
l
:i�
(4.4)
2
.2 5 // pq]' pqJ '\, and p and q are the proportions of total sample where a = [[.25 sample 00, in each group (Hunter & & Schmidt, Schmidt, 2004) 2004).. For example, if N AT = 1100, size in na 0/ 1 00 = 1 00 = .4. Of Of na = = 60, and and nb nb = = 40, then then p = = 660/100 = .6 and and q = = 40/ 40/100 not matter matter which which of naa or nb nb is associated with p or q be becourse, it does not = qp. Note that in experimental research in which which the sample cause pq = unequal, different researchers who who are studying studying the same X sizes are unequal, X variable and same Y y variable may may obtain obtain different different values of uncor uncorrected rpb different values of n/nb na/nb from study to study. pb partly due to different Therefore, values of r pbb should not not be compared or meta-analyzed meta-analyzed in such cases alues of rpbb have cases unless the values have been corrected using Equation 4 . 4 . Refer 4.4. Refer to Hunter Hunter and Schmi Schmidtt (2004) for further further discussion. discussion.
t
d
UNRELIABILITY UNRELIABILITY
another factor that can attenuate attenuate rrpbb and standard standardUnreliability is another ur purpose here, ized-difference estimators of effect ized-difference effect sizes. sizes. Roughly, Roughly, for 6our unreliability means the extent to which which a score is reflecting measure measureunreliability ment ment error, something something other other than than the true value of what is being mea measured. Measurement error causes scores scores from from measurement instance to measurement instance to be inconsistent, unrepeatable, or unreliable in the sense that one cannot cannot rely on getting getting consistent consistent observed scores for the an individual from the test or measure even when the magnitude of the underlying attribute attribute that is being measured has not changed.
CORRELATIONAL EFFECT EFFECT SIZES SIZES CORRELATIONAL
�
77
One common common way way to estimate the reliability of a test is to administer the test to a sample (the larger larger the better) and then readminister readminister the same test to the same sample within within a short enough period of time so that there is is little opportunity opportunity for the sample's true scores scores to change. In measurement error error would would be reflected reflected by inconsistency inconsistency in the this case measurement scores. If one calls the scores scores from from the first first administration administration of observed scores. of test Y1 values values and calls the scores from the second administration of of the test the test Y Y22values, values,and andthen thenone onecalculates calculatesthe therrbetween betweenthese these Ylvalues values an estimate of test reliability. Such a proce proceand Y2 values, one will have an called test-retest reliability reliability and and the the resulting rr is called called a reliability reliability dure is called cOi? fficient, denoted ryyyy •. Because r ranges from 1 , ideally one coefficient, from -1 to + +1, would would want rryy to be as close to + +11 as possible, indicating indicating perfect perfect reli reliUnfortunately, science tests ability. Unfor nately, some psychological psychological and behavioral science (and perhaps some medical tests) tests) have only modest values of ryy. For ex ex(and ample, the least reliable reliable of the tests of personality may have r values that are approximately .3 or .4. approximately equal to .3 .4. At the other extreme, we expect scale to measure weight with with r close close to 1. a modern digital scale 1. b, ' as an variation of X y, Because rrpb an r,r, is intended to to estimate the the covariation X and and Y, which true variation variation in Y is is related to true varia variathat is, is, the extent xtent to which tion in X, unreliability results in an attenuation attenuation of r.r. The r is attenuated attenuated because the measurement error that underlies the unreliability of Y but this additional additional variability is is an adds variability to Y (increases s )),, but unsystematic variability variability that is is not related to variation variation in the X vari variot related unsystematic Recall from introductory able. ((Recall introductory statistics that r is a mean of products of [z x zzy 1] //N, denominator.)) Bez scores, r = L[zx N, and that a z score has s5in its denominator. Be statistic and standardized-difference standardized-difference estimators estimators of effect effect cause the t statis ic and sizes have s5 values in their denominators and unreliability unreliability increases s, unreliability reduces reduces the value of t and a standardized-difference standardized-difference esti estiunreliability mator mator of effect effect size. Although simple physical physical measurements measurements can be made very very reliably, Although when using measures measures of more abstract abstract dependent variables, such as when should be of some concern to the re repersonality variables, unreliability should cases the researcher should conduct a search of the lit litsearcher. In such cases erature to choose the most reliable alternative alternative measure that may be erature available available for the dependent dependent variable and for the type of participants participants at in-house test-retest reliability reliability research hand. If the researcher conducts in-house estimate r prior to using a particular measure measure of the dependent dependent vari varito estimate research, or learns of the measure's ryv from from a search of able in the main in research, of the literature, the value of ry should be included in the research report. not only rrelevant levant to the magnitude of the reported efef The value of rryy is not fect size, but sigfect but it is also relevant to the statistical power of the test of sig nificance that was used to make an inference about the effect effect size, nificance statistically insignificant. insignificant. Unreliability can especially if the result was statistically reduce the power of a statistical test sufficiently sufficiently to cause a T Type II error. ype II Information about about the reliability (and validity) of many many published Information called the Mental Meatests can be found in the regularly updated book called Mental Mea-
Yj
Yj
Y2
t-(;.
��r "
l
Y
Y
Y
h
t
ni�
{{
�6
�
78
CHAPTER 4 CHAPTER
Yearbook (as of of the the time of of this writing, writing, The Fifteenth Fifteenth Mental surements Yearbook Measurements Measurements Yearbook; Yearbook; Plake, Plake, Impara, & & Spies, 2003) 2003).. An index of the the currently be tests and measurements that have been reviewed there can currently found at at http:// http://www.unl.edu/buros/indexbimm.html. www.unl.edulburos/indexbimm.html. Wilkinson and found the American American Psychological Association's the Association's Task Force on Statistical Statistical InferInfer ence ((1999) 1999) noted that an assessment of reliability is required to interpret estimates of effect effect size. A confidence confidence interval interval for for � ryy in the the population population can pop. be constructed constructed as was discussed in this chapter chapter for any rrpop . tor any Note that it may may be the case case that reliability is greater when using as participants a group of people with certain demographics than when participants using another group of people with different different demographics demographics.. For For exam examples, ryy old, or may be different different when using men or women, young or old, yy may students or nonstudents. nonstudents. The The most relevant relevant ryy that a researcher college students should y that btained when using should seek in the literature is is an ryy that has been been oobtained participants who are as similar aasl possible participants possible to the participants in the relevant ryy cannot be found in a search of the lit litpending research. If a relevant ii'ig a measure of questionable reliability erature, a researcher who who is us using should consider conducting a reliability study on an appropriate sample prior prior to the main main research. research. The reliabilities of the the scores across studies of the same underlying outcome variable may because of relevant differences differences bebe may vary either either because tween the the participants participants across studies or because of the use of different different measures of the outcome variable. variable. Therefore, one should not not compare effect sizes without considering effect considering the possible influence of such differential differential reliability. Researchers should also be interested in the reliability with with which the X disX variable variable is being measured measured or or administered administered because the the previous previous dis cussion about about the reliability reliability of the Y variable also applies to the X X vari varicase of rpbb', in which X has only two two values, membership able. Even in the the case which X reliability of the X in Group a or Group Group b, i'm unreliability X variable can occur, occur, along with its attenuating attenuating effects effects.. For example, example, consider the the case case of research with preexisting groups, such such as the comparison mothers of with preexisting comparison of the mothers of schizophrenic children and normal normal children that we undertook in chap chapter 3 and and earlier in this chapter. In such cases cases rpbb and and d would be attenu attenus�hizophrenia was made ated to the extent extent that the diagnosis diagnosis of schizophrenia unreliably. In experimental research the of the reliability of administration administration of the dichotomous dichotomous X variable is maintained maintained to the extent extent that all members of Group Group a are in fact fact treated in the same way way (the "a" way) and all of members of Group b are treated in the same way way (the "b" "b" way) as planned understand and follow their their planned and that all members of a group group understand instructions in the same way. instructions may be more difficult difficult to administer treat treatIn some areas of research it may ment reliably reliably than in other areas of research. For For example, in experi experiment ments that compare Psychotherapy a to Psychotherapy b, although although with any any degree of care on the part of the clinical researcher all members of of a particular particular therapy therapy group will very likely receive receive at at least the same for a variety variety of reasons it may may not not be possible general kind of therapy, for possible to
CORRELA TrONAl EFFECT CORRELATIONAL EFFECT SIZES SIZES
�
79
treat treat all members of a particular particular therapy therapy group in exactly the same way way moment of the the course of therapy. therapy. Psychotherapy Psychotherapy in all details for every moment can be a complex and dynamic process process involving two interacting people, people, the therapist and the patient, not a static exactly exactly repeatable procedure in which each patient in a group is readily spoon-fed the therapy in exactly the same way. The same kind of problem may arise in research that compares two methods of teaching. The treatment is The extent to which a treatment administered according to the research plan, and therefore administered consistently across all of the members of a particular treatment group, treatment integrity. To is called treatment To maintain maintain treatment treatment integrity, for some behavior therapies there are detailed manuals for the consistent behavior consistent administration of those particular therapies. administration treatment integrity level, the values When treatment integrity has not been at at the highest level, of of the estimator of effect effect size, the value of t, and the power of the t test may may have been seriously reduced reduced by such unreliability. In such cases cases a researcher should comment about about the level of treatment treatment integrity integrity in the research report. A more general name for treatment treatment integrity in experi experiexperimental control. Of course, researchers should mental research is experimental control all extraneous extraneous variables to maximize the extent to which varia variation of the independent variable itself is responsible for variation of the Group a to Group Group b. To To the values of the dependent dependent variable from from Group the extent extent that extraneous variables are not controlled, they will inflate ss values with unsystematic variability, resulting in the previously discussed discussed consequences for t testing and and estimation estimation of effect effect sizes. There is an equation for for correcting for for the attenuation in r, r, rrpbb', or estimator of effect effect size that has been caused by unreliable in meaother estimator ea surement surement of the dependent variable variable (Hunter & & Schmidt, 2004; Schmidt Hunter, 11996). attenuation results in & Hunter, 996). The equation for correcting for attenuation effect size that would be expected to occur ifYY an estimate of an adjusted effect could be measured perfectly reliably. reliably. In general an an estimator estimator of effect effect size that is aqjusted adjusted for unreliability of the scores scores on the dependent dependentvari variSaadjdi', is given by able, denoted here EES
ES adj =
ES ,; -. ryy
(4.5 (4.5))
1
In the case of nonexperimental adjustment for unreliabil unreliabilnonexperimental studies, an adjustment ity of the X X variable can be made by substituting substituting rxx ity xx for ryy in Equation 1/2 11 for the the denominator denominator to to adjust for 4.5, or (rxx xx rryy )) 2 can be used instead for both kinds oof{reliability at once. once. For the more complicated case of adjust adjustboth estimators of effect effect size for for unreliability unreliability of the the X variable in experiexperi ing estimators mental studies and for other discussion, refer refer to Hunter and Schmidt mental 2004).. For additional discussion of correction of effect effect sizes for ((1994, 1 994, 2004) unreliability, consult Baugh (2002a, 22002b). unreliability, 002b). If a confidence confidence interval is to be constructed constructed for the population value of a reliability coefficient, coefficient, then Equation 4.5 can be applied separately separately to the lower and to the up-
80
CHAPTER 4
per limit of the adjusted. In this case the ad justed the effect effect size that is to be adjusted. adjusted and the original original lower lower and upper limits should be reported. The adjustment adjustment for for unreliability unreliability is rarely used, apparently apparently for for one or more reasons other other than the fact fact that, unfortunately, interest in psychometrics as part of undergraduate undergraduate and graduate curricula is depsychometrics graduate curricula de creasing. (Psychometrics is, defined minimally here, here, the (Psychometrics is, the study of meth methods for constructing scales, and measurements in general and constructing tests, scales, validity.)) The first not assessing their their reliability reliability and validity. first possible reason for not making justment is simply that rryv may not be known in the liter making the ad adjustment literyy ature and the researcher does not not want to delay the research by precedature preced own in-house in-house reliability check. Second, some researchers ing it by one's own check. Second, be, or are believed to be, be, gener use variables variables whose scores are known to be, generally very reliable. researchers may reliable. Third, some researchers may be satisfied satisfied merely to have their results attain statistical statistical significance, significance, believing believing that unreli unreliability ability was not not a problem if it was not not extreme enough to have caused a statistically insignificC'nt insignificant result. result. Note, however, that even if results do attain statistical statistical significance, significance, reliability may may still have been low enough result in a substantial substantial underestimation underestimation of effect effect size for the the underly underlyto result ing dependent dependent variable population. Fourth, some researchers variable in the population. might be concerned estimates of effect concerned that their estimates effect size will be less accurate accurate to the extent extent that their estimation estimation of reliability is inaccurate. We We have not included the possibility that some researchers forgoing the researchers might be forgoing correction for unreliability because they believe that underestimation of correction underestimation of effect size is acceptable and effect size and only overestimation overestimation is unacceptable. unacceptable. Refer Refer to Hunter and and Schmidt (2004) (2004) for a contrary contrary opinion. The reader is reflect on the the merits of all of these reasons for not not encouraged to reflect calculating calculating and reporting a corrected corrected estimate estimate of effect effect size. Finally, there is a philosophical philosophical objection to the adjustment on the part calcu part of some researchers who believe believe that it is not not worthwhile worthwhile to calcuestimate of an effect effect size that is only theoretically late an estimate theoretically possible in an world in which the actually actually unreliable measure of the dependent ideal world variable could be measured perfectly reliably, reliably, an ideal that is is not not cur currently realized for the measures of their their dependent variables. variables. Hunter rently represent the opposing and Schmidt (2004) represent opposing view view with regard to correct correcting for unreliability unreliability and and other other artifacts artifacts.. To To accommodate both both sides sides in this this controversy controversy we recommend that researchers researchers consider reporting ad adjusted estimates estimates of effect the original unadjusted effect size and the original unad justed estimates. estimates. In should recognize that some readers of their their re rethis regard researchers should ports ports might might be more, or less, less, interested in the reporting of corrected estimates of effect effect sizes sizes than the researchers are. discussion we did not not mention mention the fact that correcting correcting In the preceding discussion for However, for unreliability unreliability increases sampling variability variability of an an effect effect size. However, the greater the reliability reliability of a measure, the less the increase in sampling will result result from the correction correction for unreliability. unreliability. There Therevariability that will fore, one should still strive to use the most reliable reliable measures even when planning to use the correction for unreliability. Consult Hunter and planning for an an elahoration elaboration of this issue and and a discussion of corSchmidt (2004) for
EFFECT SIZES CORRELATIONAL EFFECT
�
81 8 1
effect size for unreliability unreliability when the estimates are to recting estimates of effect be combined in a meta-analysis. Hunter Hunter and Schmidt Schmidt (2004) provided a authoritative treatment treatment of the attenuating attenuating effects effects of very extensive and authoritative of artiartifacts such as unreliability, and correcting for them. Additional arti facts include sampling sampling error, imperfect construct construct validity of the inde indefacts pendent and dependent variables, computational computational and other errors, extraneous factors introduced introduced by aspects of a study's procedures, and re reextraneous scope of this book to discuss this this stricted range. It would be far beyond the scope topics. It will have to suffice suffice for us to discuss discuss only the artifact of re relist of topics. stricted Le, stricted range, to which which we turn in the next next section. Refer to Schmidt, Le, and Ilies (2003)) for discussion of a broader type of reliability coefficient coefficient and Hies (2003 coefficient of equivalence and stability) stability) that estimates estimates measurement measurement (the coefficient from an additional source beyond those that the test-retest test-retest reliabil reliabilerror from coefficient reflects. For further discussions ity coefficient discussions of unreliability refer to Onwuegbuzie and Levin Levin (2003) and the references therein. At the time of of this writing, Windows-based commercial software is available, called "Hunter-Schmidt Meta-Analysis Programs Package" Package" for calculating calculating arti arti"Hunter-Schmidt fact-adjusted estimates of correlations and standardized differences differences be befact-adjusted These programs were written written to accompany the book on tween means. These (2004), but but they also include pro prometa-analysis by Hunter and Schmidt (2004), differences grams for correcting individual correlations and standardized differences primary researchers. Currently the package can be between means for primary from frank-schmidt@uiowa. [email protected], [email protected], or ordered from edu, [email protected], [email protected]. Hunter and Schmidt (2004) discussed discussed other soft [email protected]. ware for similar purposes. RESTRICTED RANGE
Another Another possible attenuator of rpbb is called restricted (or truncated) truncated) range, samples inthat usually means using samp les whose extent of variation on the in dependent variable is less less than the extent of variation variation of that variable in population to which the results are to generalized. generalized.An Anexample exampleof ofre rethe population receive up stricted range would be research in which patients generally receive certain therapy in the "real world" of clinical prac prac6 weeks of a certain to, say, 226 but a researcher studying the effect effect of duration duration of therapy com comtice, but (0 weeks) is intentionally intentionally weeks) to a treated group that is pares a control group (0 given, say, 116 6 weeks of that therapy. Another example would be drug research involving a drug for which the usual prescribed prescribed doses in clinical practice ranges from, say, 2250 5 0 mg to 600 mg, but a researcher compares intentionally prescribed, groups that are intentionally prescribed, say, either 300 300 mg or 500 mg. the effect effect of restricted range is the lower r betweenSAT An example of the SAT with the most demanding admissions scores and GPAs at universities with standards (restricting most admissions to those ranging from from high to standards very high SATs), compared compared to the r between SATs and GPAs at less restric restricvery (acceptingstudents across a wider range of SAT SATscores). scores). tive universities (accepting The examples thus far are examples of direct range restriction because knows in advance that the range of the independent varivarithe researcher knows
82
CHAPTER 4
which this range is restricted because the the able is restricted. Instances in which available participants merely happen to be, be, instead instead of being selected to re be, less variable than the population population are examples of indirect range restriction. Hunter Hunter and Schmidt Schmidt (2004) discussed methods methods for correcting correcting for direct and and indirect range restriction. However, However, as should be the case for under a fixed-effect fixed-effect approach, when when generalizations of results are con conunder fined to populations of whom the samples are representative representative in their the independent variable, instead of more general populations, populations, range of the no such correction correction need be made. Consult Chen and ), and and Popovich (2002), Cohen, Cohen, Cohen et al. (2003 (2003), and Hunter Hunter and and Schmidt (2004) for for further discussion of restricted range correct for it, and consult Auguinis and Whitehead Whitehead (1997) and how to correct ( 1 99 7 ) Callender and Osburn ((1980) discussions.. Many addi addiand Callender 1 9 80) for related discussions tional references can be found in Chan and Chan (2004). Note that re restricted if stricted range in the the measure of the the dependent variable can occur if would-be of would-be high high scoring or would-be low scoring participants participants drop out out of the research . 2 depicts the great research before their data are obtained. obtained. Figure 44.2 lowering of (compared to of the the value of of r (compared to rpo )) in samples in which X X varies much much less than it does in the populatio population. Hunter and Schmidt (2004) statistical correction for the case in which restriction of the provided a statistical range of the the dependent variable is not not accompanied by restriction of the the range of the independent variable. variable. Although typically typically not not the case, sometimes restricted range can cause an increase in r.r. Refer to Wilcox (200 (2001) 1 ) for an example involving a curvilinear curvilinear relationship in which restricted range causes causes an an increase in rethe magnitude of r and a change in its sign when the restricted range re sults from from the removal of outliers outliers.. Suppose also, for example, that a re relationship lationship between two two variables is curvilinear in the population population and the the sample is one in which which the range of X is restricted. In this case the the depend on whether whether the range magnitude and sign of r in the sample can depend is restricted to low, moderate, or high values of X. This case is depicted in 4.3. component of a rela relaFig. 4 . 3 . Recall again that r reflects only a linear component tionship between two variables. tionship In the effect size, not the case of standardized-difference estimators of effect not let letting lreatment Treatment a and lreatment Treatment b differ differ as much in the research as they do or might do in real-world real-world application application of these two two treatments treatments is also a restriction restriction of range that lowers the value of the estimator. In experi experiextent of difference difference between or among the treat treatmental research the extent of ments is called the the strength of of manipulation. A weaker weaker manipulation of the independent variable in the research setting than occurs in the world of practice would would be a case case of restricted range. of Restricted range is not not only likely to lower the value of any any kind of es eseffect size, it can also lower the the value of test statistics, such as timator of effect lowering statistical power. Therefore, Therefore, in applied areas re ret, thereby lowering searchers should use ranges of the independent independent variable that are as simi similar as possible to those those that would would be be found in the popUlation population to which which the results are to be generalized. generalized. Note Note in this regard that it is is also possi-
ri.
83
CORRELATIONAL EFFECT EFFECT SIZES
•
.
•
.. .. . -. . . I · . ., . .. • . .. .. • • .!-+---,...*. . ... • •• • • • r ::: O • I· • • • •• • • • • • • • . . .. . .
•
.
•
y
• . .
•
' . •
•
• .
• •
., ��.. .
.
.' ,
•
•
•
"
• ••
• • . .. .
- . .. . . -. . .. . ., . . . .. . . . . ., . .
•
•
..
• •
• ••
• •
.
• • . e . . • • •
• •
. ••
• •
•
.
•
•
.
. ..
.
•
r ::: O
'
•
r ::: O
Low X
Moderate X
High X
X
A case in in which which the the overall overall correlation between between X X and and Y (rpop) (rpop) is FIG. 4.2. A much much higher higher than it would would be estimated estimated to to be if if the range of X X in in the sample sample were restricted restricted to only only low values, only moderate values, or only high values. values.
ble, but excessive ble, but we warn against it, for an applied researcher to use an excessive range of the independent variable, a range that increases increases the value of the estimate of effect effect size and and increases increases statistical power, power, but but at at a price price of be be(externally invalid) in comparison to the range of the ing unrealistic (externally independent variable that would be used in practice. practice. Consider clinical clinical research involving involving a disease for which which there is at least one somewhat somewhat effective effective treatment treatment and for which it is known known that without treatment treatment there is not not a spontaneous remission of the disease. without treatment is already already known known in this case case to be worse Because using no treatment research on this disease than using the current treatment, conducting research disease (no treatment) with with a group that is is given by comparing a control group (no proposed treatment results in a wide wide range range of the independent a new proposed variable and and might might yield a relatively large estimate of effect effect size and unrealistic as well as uneth unethhigh statistical power power but but at a price of being unrealistic ical. The more realistic and ethical research research on treating this disease disease
84
CHAPTER CHAPTER 4
. . .- . . .•. . . .. ... •..• ...:•. ..... ..-•_. . a .. . . ,. , .. .:. ,... '. .- ,. .. .... ,. ... ..'. . . • :, .. :.. - . . . . . .. . � ,: . , j: • •• • • • . , • . .. . .... . , ..•...a•.••fIJ., .. . • • •• • e • .-• • • , .• • • • [:0 •- . _ .. ....:. . . . . .. . . e . . . .. ..:. . . � .. . . . ... . • • •.• • • •. _ a_ • . . e: : .. ...: . ., •.
y
•� • • • •
:.
h•y: • • • • • .
�
[ is
r
+
Low X
Moderate X
.
is -
High X
X
FIG. 4 4.3. the overall overall relationship relationship between X X and and Y is curvilinear curvilinear in in the the . 3 . If the popUlation, population, restricting the range of of X X in the sample to only low, low, only mod modor only high high values can influence influence the the size size and and sign of rr in in the the sample. sample. erate, or
would would compare a group of patients patients that is is given the current current best treat treatis given the new proposed treatment. Similarly, ment and a group that is Similarly, that obviously in educational research one would not conduct research that compares compares the performance of children who are taught a basic subject in performance of a control group of children children who are a new way to the performance not taught taught the sub subject at all. Consult Abelson Abelson ((1995) discussion of not ject at 1 995) for a discussion of efficacy ratio as an an effect effect size that is relative to the cause size the causal efficacy (i. e . , an (i.e., an effect effect size that is relative to the strength strength of the manipulation). refer to to Tryon's (200 (2001) discussion of such an an effect effect size. Chan and Also refer 1 ) discussion results of Monte Carlo simulations simulations of a boot bootChan (2004) discussed the results strap method for estimating the standard error and constructing a coefficient that has been corrected confidence interval for a correlation coefficient corrected for range restriction. restriction. for
CORRELATIONAL EFFECT EFFECT SIZES CORRELATIONAL
�
85
SMALL, MEDIUM, AMD SMALL, AND LARGE EFFECT SIZE VALUES z score, there is no theoretical limit to the magniBecause it is a type of z magni tude of a standardized-difference standardized-difference effect effect size, and theoretically rpop rpop can range from -1 to + +1. book may want to range 1 . However, some readers of this book better sense of the the different magnitudes of estimates of effect effect have a better different magnitudes so that they can be better able to place sizes that have been reported so newly encountered estimates in context. In behavioral, behavioral, psychological, newly educational research, standardized-difference standardized-difference estimates are rarely rarely and educational sign) 22.00, b estimates are rarely rarely be bemore extreme than (ignoring sign) . 00, and rrpb yond 70, with both kinds of estimates Jlly being very much less yond ..70, estimates typic typically values.. extreme than these values Categorizing Categorizing values values ooff estimates estimates ooff effect effect size aass small, small, medium, medium, or large is necessarily necessarily somewhat arbitrary. Such categories categories are, as Cohen large pointed out, out, very very relative relative terms-relative, terms—relative, for example, to such ((1988) 1 988) pointed area of research research and to its degree of experimental experimental factors as the particular area control of extraneous extraneous variables and the reliabilities of the scores on its control variables.. For example, an an effect effect size of a certain measures of dependent variables magnitude may may be relatively large if it occurs in some area of research in magnitude social psychology, whereas that same value may if may not not be relatively large if controlled area of research such as it occurs in some possibly more controlled neuropsychology. Also, two observers of a neuropsychology. Also, even in the same field of study two effect size may may rate that value differently. differently. given value of effect With appropriate tentativeness and a disclaimer 1 9 8 8 ) of disclaimer Cohen ((1988) offered admittedly rough criteria small, medium, and large effect effect criteria for small, (Weignore ignore the the sign sign of ofthe the ef efsizes, and examples within each category. (We fect sizes, which is not not relevant here here.) We also relate Cohen's criteria to fect . ) We the distribution of standardized-difference standardized-difference estimates estimates of effect effect sizes that behavioral, were found by Lipsey and Wilson (1993) ( 1 9 9 3 ) in psychological, behavioral, and educational educational research and to the findings reported reported by Grissom ((1996) 1 996) on psychotherapy psychotherapy research and cited previously previously in chapter chapter 33.. ass small A ::; .20 and rpop rpop ::; ..10, Cohen ((1988) 1 9 8 8 ) categorized a 1 0, with regard the point point-biserial to the -biserial correlation. Cohen's examples of sample values of of effect size that fall fall into into this category include (a) (a) the slight superiority superiority effect of mean IQ of IQ in nontwins compared to twins, (b) (b) the slightly slightly greater height of 116-year-old mean height 6-year-old girls compared to 15-year-old 1 5-year-old girls, and (c) differences between and men men on some scales of the the (cl some differences between women and Test. Lipsey and Wilson Wilson ((1993) Wechsler Adult Intelligence Test. 1 9 9 3 ) found that the lowest 225% the distribution of psychological, psychological, behavioral, and and 5 % of the examples of standardized-difference standardized-difference estimators of effect effect educational examples . 1 5 and the order of d ::; < .30, rpb ::;< .15 . 3 0, which is equivalent to rpb size were on the somewhat supports supports Cohen's criteria. somewhat A= = .5 and rrpopp = = .243 .243.. ConsisCohen ((1988) 1 988) categorized as medium, A .5 and Consis tent with these criteria for a medium medium effect effect size, U Lipsey tent psey and Wilson median d = = ..5. is also consistent consistent 5 . This criterion is ((1993) 1 9 9 3 ) found that the median typical effect effect sizes in counseling psychology (Haase, Waechter, Waechter, & with typical
86
�
CHAPTER 4
and in social psychology (Cooper & Findley, 11982). Solomon, 11982) 98 2 ) and (Cooper & 9 82). CoCo hen's approximate examples include the greater greater mean mean height height of of women compared to 114-year-old greater 118-year-old 8-year-old women 4-year-old girls and the greater mean IQ of clerical compared to semi-skilled workers and mean IQ and professional managerial workers. Recall from chapter chapter 33 that Grissom compared to managerial found a median median d = = .44 when comparing comparing placebo groups congroups to con ((1996) 1 996) found groups in psychotherapy psychotherapy research and a median d = = ..58 when com com5 8 when trol groups treated groups to placebo groups, which are roughly roughly equivalent paring treated to rpbb = .22 and ..27, 2 7 , respectively. p Cohen ((1988) and rpop > ..371. examples 8 and JXlP � 3 7 1 . His His examples 1 9 8 8 ) categorized categorized as large � � ..8 include the greater 8-year-old women greater mean height height of 118-year-old women compared to 113-year-old 3-year-old girls and a higher mean IQ IQof of holders of PhD degrees degrees com comcollege students. Somewhat Somewhat consistent consistent with Cohen's pared to first-year first-year college criteria, Lipsey and Wilson ((1993) 25% values of d 1 99 3 ) found that the top 2 5 % of values > ..67, corresponding to rrpbb �> ..32. again from chap chapwere d � 6 7, roughly corresponding 3 2 . Recall again 1 99 6 ) found that th 3 that Grissom ((1996) the most most efficacious efficacious therapy therapy pro proter 3 = .. 78 = 22.47, roughly corresponding to rpb 78.. . 4 7, roughly duced a (very rare) median d = b p 1 9 8 8 ) lower 5 ) iiss equi Note that Cohen's ((1988) lower bound bound for a medium � A (i.e. (i.e.,, ..5) equi2 ) and distant from his upper upper bound bound for a small small effect effect size (� (A = = ..2) and his lower bound bound for for a large effect effect size (� (A = ..8). Consult Cohen ((1988) and lower 8 ) . Consult 1 98 8 ) and and Wilson ((1993) for further further discussion. discussion. Rosenthal Rosenthal et al. (2000), Lipsey and 1 993) for and Lipsey (200 (2001), and chapter 55 of this book book have tables that Wilson and 1 ), and show the the corresponding corresponding values of various measures of effect effect various kinds of measures size (A, andothers othersthat thatare arediscussed discussedininchap. chap. 55). (�, rr0 ' ,and ). Note tthatt the the designations ooff small, medium, and large effect effect sizes sizes do not necessarily correspond to the degree degree of practical significance significance of an not effect. As we previously noted, judgment judgment about the practical practical signifi signifieffect. cance of an an effect effect depends on the context context of the research, and the exper expertise and values of the person who is judging the practical significance. from a new therapy therapy For example, finding a small lowering of death rate from for disease would be of greater practical for a widespread and and likely fatal disease improvement in cure rate for a new significance than finding a large improvement significance of an an effect effect is consid considdrug for athlete's foot. The practical significance ered further 1 9 8 1 ) for opposition further in the next next section. Refer Refer to Glass et al. ((1981) opposition to the designations small, medium, and large. Sometimes Sometimes the practical practical significance of an effect effect can be measured measured tan tan(2003) reviewed cases of utility analyses in gibly. For example, Breaugh (2003) which estimates were made of the amount amount of money that employers which job previews (RJPs). could save by subjecting job applicants to realistic job Although correlation between between the independent independent variable variable of RJP RJP ver verAlthough the correlation turnover was very sus no RJP RJP and the dependent variable of employee employee turnover = ..09, significance of the re re09, employers employers could judge the practical significance small, rr = evaluating the amount amount of money that utility analysis estimated sults by evaluating would be saved by the small reduction reduction of employee turnover that was would RJP program. Also, Also, Breaugh (2003 (2003)) cited an exam examassociated with the RJP 1 996) in which small but ple from Martell, Lane, and Emrich Emrich ((1996) but consistent
�
h�
CORRELATIONAL EFFECT EFFECT SIZES CORRELATIONAL
�
87
bias effects effects in ratings of the the performance performance of female employees employees can result number of women unfairly denied promotion. promotion. over the years in a large number Consult Prentice and Miller ((1992) for additional additional examples of Consult 1 9 9 2 ) for of apparently small effect effect sizes sizes that can be of practical importance. importance. apparently interpreting an an estimate of effect effect size size one should also consider When interpreting the factors, discussed earlier earlier in this chapter, that can affect affect the magni magnistatistical significance significance of such an estimate. estimate. Also, one should tude and statistical not be prematurely prematurely impressed impressedwith with a reported nonzero rr or d when there not result is no rational explanation for the supposed relationship and the result not been replicated replicated by other other studies. Of course, course, such findings, findings, espe espehas not samples, may may be just just chance chancefindings. findings. For Forex excially when based on small samples, amples, nonzero of nonzero correlations correlations have been reported over a certain certain period of market values and which football conference wins time between stock market amount of butter pro prothe Superbowl of United States football and the amount duced in Bangladesh (both nonsense ?). There are likely nonsense correlations correlations?). many thousands of values of rr calculated annually annually throughout throughout the the many at the p < < .05 signifiworld. Even if it were literally true that all rrpop 0 = 0, at signifi approximately 5 % of these thou nds of rr values will falsely cance level approximately 5% thousands lead to a conclusion that rpop rpop "# O. 0.
�J
DISPLAY BINOMIAL EFFECT SIZE DISPLAY Rosenthal and Rubin ((1982) 1 982) presented a table to aid in the interpretation of any any kind of r, r, including the point-biserial r. called the bi bir. The table is called of effect size display, BESD, and and was was intended especially to illustrate illustrate nomial effect possibly great practical practical importance supposedly small value for the possibly importance of a supposedly any discuss, is not not itself an estimator estimator of r. The BESD, as we soon discuss, of any type of r. effect size but but is intended instead to be a hypothetical hypothetical illustration illustration of effect of what can be inferred inferred about about effect effect size from from the the size of any any r. r. The BESD become a popular tool among among researchers researchers.. We Wediscuss discussits itslimitations limitations has become in the next section. develops from from the fact fact that rr can also be applied to data in The BESD develops which both both the the X X and and Y variables are dichotomies. In the case case of dichoto dichotowhich mous X and and Yvariables Y variables the name for r in a sample is the the phi coefficient, coefficient, $.. Treatment a versus 1teatment Treatment b, and and Y could be For example, X could be Treatment categories:: participant participant better after after treatment treatment and participant participant not the categories better after after treatment. One codes X X values, say, 11 for Treatment a and, calculating rpb' rpb, but but now now say, 2 for Treatment b, as one would if one were calculating for calculating phi phi Y is is also coded numerically, say, 1\ for better and, say, for Phi is is simply the rr between the X variable's variable's set of 11ssand and 2 for not better. Phi Y variable's set of IIss and 22s. may not not seem 2s and the Yvariable's s . Although software may to indicate that it can calculate phi, when calculating the usual rr for the file with two values, such as 11 and 2, X column column and two two data in a file 2, in the X software is is in fact fact calculat calculatvalues, such as 11 and 2, in the Y column, the software another context context for phi as an estimator estimator of ing phi. (In (In chap. chap. 8 we discuss discuss another of effect size and another way way to calculate it.) effect it. )
88
�
CHAPTER 4 CHAPTER
By supposing supposing that n naa = = nb nb and by treating a value of an r for the mo moment ment as if it had been a value of phi, we now observe that one can con construct a hypothetical hypothetical table (the (the BESD) that illustrates another another kind of struct of interpretation interpretation or implication of the value of an r. r. For For example, example, suppose an obtained obtained value of an r is is a modest modest .20. Although Although the r is is based on that an a continuous continuous YY variable, to to obtain a different different perspective on this result result the BESD pretends for the moment that that X X and and Y had both been been dichoto dichoto= .20 had, therefore, instead been been a mous variables and that the r = Table 4.11 depicts what results would look like like if <jl== .20 and, and, for <jl = .20. T able 4. example, na na = = nb nb = 1100. might not seem to some to rep rep00. An r equal to .20 might resent an effect effect size that might might be of great practical importance. How How4.11 that if the rr of .20 had instead been been a <jl = .20 ever, observe in Table 4. (the basis of T able 4 . 1 ), such results would have indicated that 20% Table 4.1), 20% more improve under Treatment a than improve under Treatment participants improve b (i. e . , 60% (i.e., 60%-- 40% 40%==20%) 20%).. 4.11 that 60 out out of a total of 1100 We observe in Table 4. 00 participants Treatment a are classified classified as being better after after treatment treatment than than ((60%) 60%) in "freatment they had hadbeen before treatment, treatment, and and 40 40 out of a total of of 1\ 00 00 participants participants they in Treatment b are classified classified as being better better after treatment treatment than they they had per had been before treatment. These These percentages percentages are called the success percentages for the two two treatments treatments.. The The result appears now, in terms of the the impressive. For For exam examBESD-produced success percentages, to be more impressive. many thousands thousands or milEons millions of patients in actual clinical practice ple, if many Treatment b be bewere going to be given Treatment a instead of the old Treatment cause of the results in T able 4 . 1 (assuming for the moment that Table 4.1 that the sam sample phi of .20 is reflecting a population population phi of .20; Hsu, Hsu, 2004), then then we would 20% of would be improving improving the health health of an additional additional 20% ofmany many thousands thousands or millions of people people beyond the number that would have been improved by the use of Treatment b. The more serious the type of illness, illness, the greater would be the medi medical and social significance of the present numerical result (assuming expensive or risky risky). also that Treatment a were not prohibitively expensive ) . The most extreme example would be the case case of any any fairly fairly common and possibly fatal disease thousands or disease of which 20% 20% more of hundreds of thousands millions of patients worldwide worldwide would would be cured cured by using using Treatment a millions ==
==
TABLE 4. 4.11 TABLE BESD A BESD
Treatment a
Treatment b Treatment
(X = 1) (X 1) =
(X = 2) (X =
Participant Participant Better (Y== 1) (Y 1)
Participant Participant Not Better (Y = 2) Not
Totals T otals
60
40
1100 00
40
60
1100 00
=
CORRELATIONAL EFFECT SIZES SIZES CORREI.ATIONAL
�
89
instead Treatment b. Again, such results are more impressive impressive than instead of Treatment an r = = .20 would seem to indicate at first glance. glance. However, as 20% increase in success per perRosenthal et al. ((2000) 2000) pointed out, the 20% does not not apply directly to centage for Treatment a versus Treatment b does the original raw (This dis raw data data because the BESO BESD table is hypothetical. hypothetical. (This disBESD that is discussed discussed in the next sec secclaimer leads to a criticism of the BESO tion.)) The BESO BESD is simply a hypothetical hypothetical way way to interpret interpret an an r (or rrpbb)) tion. the following question: What if both both X and and Y had had been by addressing the be n dichotomous variables, and, therefore, therefore, the rr had had been a phi coefficient, coefficient, resulting 22x2 margin totals (explained x 2 table had uniform margin (explained and the resulting percentage have been by us uslater), what would the increase in success success percentage many instances of ing Treatment a instead of Treatment b? Note that in many of research the original data data will already already have arisen from from a 2 Xx 2 table but BESO table, but not not always one that satisfies the specific criteria criteria for a BESD table, which is discussed next section. which discussed in the next In general for any any r, r, to find the the success percentage percentage (better) for the treatment X= = 1, (r/2)]%. the two two per pertreatment coded coded X 1 , use 1100[.50 00 [ . 5 0 + (r /2 ) ) %. Because the centages in a row of a BESD BESO must add to 1100%, 00%, the failure percentage percentage for the row r o wX X= course, 1100% row's success percent percent= 1 is, is, of course, 00% minus the row ' s success for age. The success percentage percentage for the the row row X = = 2 is given by - (r/2 (r/2)]%, is 1100% suc1100[.50 00 [ . 5 0 ) ) %, and its failure percentage is 00% minus the suc cess percentage for that row. In Table 4 4.1, r = = .20, so the success . 1 , = = Treatment a is 1100[.50 = 60%, 60%,and andits itsfail fail. 2 012 ) ) % = percentage for Treatment 00 [ . 5 0 + ((.20/2)]% - 60% 60% = 40%. 40%. The The greater the value of r,r, the ure percentage is 1100% 00% difference will be between the success success percentages percentages of the the greater the difference treatments. Specifically, the difference difference between these two two success two treatments. r)%.. Therefore, even before before con conpercentages will be given by ((100 1 00 r)% structing the BESO BESD one knows that when r = ..20 sucstructing 2 0 the difference in suc percentages will be [[100(.20)]% 20% if the the original data data are are cess percentages 1 00 ( . 20) ) % = 20% into an appropriate appropriate BESO. BESD. Note that one can also construct construct a recast into BESO, BESD, and estimate the difference difference in success success percentages for the the counter null value of r by starting starting with Equation 44.2 procounternull . 2 and then pro described in this section. We We discuss other other ap apceeding as has just described effect sizes sizes for a 2 x 2 table in chapter 88.. proaches to effect
�
LIMITATIONS OF THE BESD There are limitations BESO and its resulting estimation dif limitations of the BESD estimation of the difdif ference between the success success percentages percentages of two treatments. First, First, the difsuccess percentages percentages from the the BESD is is only equal to if if the the ference in the success overall success success percentage = = overall failure percentage = 50% 50% and if the The result is a table that two groups are of the same size size (Strahan, 11991). 99 1 ) . The Observethat Table 44.1 satisfies these cri criis said to have uniform margins. Observe . 1 satisfies teria because the two two samples are of the same size, size, the overall (marginal) teria percentage equals (60 = 50%, 50%, and the overall (mar (marsuccess percentage (60 + 40)/200 = ginal) failure percentage equals (40 + 60)/200 = = 50%. 50%. Note, however, opinion that this first criticism is is actually actually not a that we are aware of an opinion
90
�
CHAPTER 4
but merely part of the definition of a BESD. Refer Refer to Hsu (2004) limitation but for argument that such such an opinion opinion is problematic problematic in many many cases. for an argument criteria for for a BESD are satisfied the resulting difference difference in When the criteria success percentages percentages is relevant to the hypothetical population whose whose However, are the results relevant to data are represented by the BESD. However, the population that gave rise to the original original real data data that were recast relevant to any any real population population (Crow, 11991; into the BESD table or relevant 9 9 1 ; Hsu McGraw, 11991)? population for which which the BESD-generated 2004; McGraw, 9 9 1 ) ? The population difference in success 4.1 most rele reledifference success percentages in a table such as Table 4 . 1 is most population in which which each half half received received either TI:eatment Treatment a or vant is a population TI:eatment Treatment b and half half improved improved and half half did not. Again, this limitation may be considered by some to be merely an inherent inherent aspect of the may definition definition of the the BESD. BESD. Cases in which which the original original data data are available are the cases that are most relevant to this book because this book is addressed to those who (primary researchers) researchers).. It makes more sense in such cases produce data (primary of X 2 tables to compare the success of original original 22x2 success rates for the two treat treatments based on the actual actual data data instead of the hypothetical hypothetical BESD. For For exments ex other effect effect sizes sizes for ample, in such cases one may use the relative risk or other data in a 2 X x 2 table that are discussed in chapter 8.. The measure that we chapter 8 The measure and 9 is also applicable. call the probability of of superiority superiority in chapters 5 and Also, suppose suppose that the success percentage percentage and failure percentage percentage in the population to which the sample results are to be generalized are not real population 50%. In this case a BESD-based difference difference between success each equal to 50%. toward overestimating overestimating the difpercentages in the sample will be biased toward dif ference between success percentages for the two treatments treatments in that & Schumacker, 1997). population (Hsu, (Hsu, 2004; Preece, 11983; 98 3 ; Thompson Thompson & 1 997). Additional problems can arise when the original de original measure ooff the dependent variable variable is continuous instead instead of dichotomous. rependent dichotomous . In this case re searchers often split scores of each of the two two samples at at the overall median equal-sized overall median of the scores to form equal-sized overall successful and failing categories, thereby thereby satisfying the criteria for the hypothetical hypothetical BESD ta table. However, However, defining success success or failure in terms of scoring below or ble. above the overall median score often may 98 3 ; may not not be realistic (Preece, 11983; Thompson & & Schumacker, Schumacker, 11997). not every treated de de99 7) . For example, not Thompson who scores below the the median on a test of depression can be con conpressive who sidered to be cured or a success. success. Similarly, it has been reported that a school school district had such teachers such great difficulty in filling its quota of teachers much below the median that it even hired teachers who had scored very much hiring test. In that case, hiring on a hiring case, scoring well below the median on a hiring actually resulted in a "success" for some applicants (getting hired). test actually Furthermore, recall that dichotomizing dichotomizing a continuous continuous variable is also Furthermore, unwise because it can decrease statistical statistical power. Hunter Hunter and Schmidt unwise attenuation of a correlation (2004) discussed correcting for the attenuation coefficient that occurs occurs when when a continuous variable variable is is dichotomized. dichotomized. coefficient For a response to some of the the criticisms of the BESD method method refer refer to to Refer to Hsu (2004) for an an extensive critique of the the Rosenthal ((1991a). 1 99 1 a) . Refer
CORRELATIONAL EFFECT SIZES CORRElATIONAL
�
91 9 1
BESD. Common Common measures measures of of effect effect size size for for data data that that naturally, naturally, not not hy hyBESD. pothetically, fall fall into into 2 X x 2 tables tables (relative (relative risk, odds ratio, ratio, and and the the dif difrisk, odds pothetically, ference ference between between two two proportions) proportions) are are discussed discussed in in chapter chapter 88 of of this this book. Rosenthal (2000) for for further further discussion discussion of of the the BESD BESD and and book. Consult Consult Rosenthal the three three measures of effect effect size size that that were were just Refer to to Levy Levy the measures of just mentioned. mentioned. Refer for another another interpretation interpretation of of phi. phi. ((1967) 1 96 7) for THE COEFFICIENT OF OF DETERMINATION The square of of the the sample sample correlation coefficient, r2 r2 (or which is The square correlation coefficient, (or r2\pb), which is fy used called the the sample sample coe coefficient of determination, has has been been wide widely used as as an an called fficient of estimator of of r2�op which is is called called the the popuLation population coefficient coefficient of of determina determinaestimator pop , which tion. There are are several several phrases phrases that that are are typically typically used used (accurately (accurately or or in intion. There accurately, depending depending on on the the context) context) to to define define or or interpret interpret aa coefficient accurately, coefficient of determination. determination. The usual interpretation is that that r 2�oP proof The usual interpretation is indicates the the pro pop indicates portion of of the the variance of the the dependent dependent variable variable (i.e., (i.e., the the proportion portion variance of proportion of of that is is predictable predictable from, from, explained explained by, by, shared shared by, by, related related to, to, associassoci(J� )) that ated with, with, or or determined determined by by variation variation of of the the independent independent variable. variable. ated (However, the the applicability applicability of of one one or or more more of of these these descriptions descriptions depends (However, depends on which which of of its its variety variety of of uses is being being applied applied to, to, e.g., e.g., measuring relimeasuring reli on uses rr is ability or or estimating estimating the the size size of of an an experimental and on ability experimental effect, effect, and on models models of of the X X and and Y Y variables; variables; Beatty, Beatty, 2002 2002;; Ozer, Ozer, 1985.) the 1 985.) It can can be be shown shown mathematically mathematically that, that, under under certain certain conditions conditions and and It assumptions (Ozer, (Ozer, 1985) 1985) but not others, others, rr2�o is the the ratio ratio of of (a) the part part assumptions but not (a) the popp is of the the variance variance of of the scores on on the the dependent dependent variable variable that that is is related related to to of the scores variation of of the the independent independent variable variable (explained (explained variance) variance) and and (b) theto tovariation (b) the tal variance variance of of the the scores scores (related (related and and not not related to the the independent independent tal related to variable).. For For the the first first of of the the two two most most extreme extreme examples, examples, if if rr po = = 0, rr2pap �op variable) � = 00 and and none none of of the the variation variation of of the the scores scores isis explained explained by by va variation of iation of = the independent independent variable. variable. On On the the other other hand, hand, if if r 0 = = 1, = \1 and and all 1 , rpop �o = all the he independ Jtion of of the the variation variation of of the the scores scores is is related related to to the the varl variation of tthe independof ent variable. variable. In In other other words, when the the coefficient coefficient of of determination determination is is 0, ent words, when by knowing knowing the the values values of of the the independent independent variable variable one one knows knows 0% 0% of of by what one one needs needs to to know know to to predict predict the the scores scores on on the the measure measure of of the the dewhat de pendent variable, variable, but but when when this this coefficient coefficient is is 11,, one one knows knows 100% pendent 1 00% of of what one one needs to know know to to predict the scores. scores. In this latter latter case all of of the the what needs to predict the In this case all points in the the scatterplot scatterplot that that relates relates variables variables X and and Y Y fall fall on on the the straight straight points in line of of best fit through through the the points points (a regression line or prediction prediction line line of line best fit (a regression line or of perfect fit fit in in this this case) so that that there is no no variation variation of of Y Yat at aagiven givenvalue value of of perfect case) so there is rendering Y Y values values perfectly perfectly predictable predictable from knowledge of of X. X. X, rendering from knowledge For the the approximate approximate median median rr pbb found For found in in behavioral behavioral and and educational educational research, rrpbb = = .24, r �h = = .2422 ,: = .06; .06; therefore, therefore, typically typically independent research, independent � these variables iin areas of of research research on on average average are are estimated to explain explain variables these are�s estimated to about 6% of the the variance variance of scores on on the the measures measures of of dependent variabout 6% of of scores dependent vari ables. (Note (Note that that in in this this chapter chapter wherever wherever we we restrict our use of the the coef coefables. restrict our use of ficient of of determination determination to to the the case case of of the the squared squared point-biserial point-biserial ficient
92
..-ff
CHAPTER 4
r22 f
correlation, r pbb ,' we do not not have to distinguish between a linear and a correlation, curvilinear re relationship between the two-valued X Xvariable conationship between variable and the con tinuous Y variable. In the case of the relationship relationship between two two continu continutinuous r2 only estimates estimates the proportion proportion of variance in Y that is ous variables r relationship with X.) Smithson (2001) explained by its linear relationship x.) Consult Smithson (200 1 ) for a discussion of a method method for constructing constructing a confidence interval interval for for r2�pop. Refer to Ozer ((1985) discussions of the cir cir1 9 8 5 ) and Beatty (2002) for discussions 2 which the absolute value of r itself (not rr) ) may may be an an ap apcumstances in which propriate estimator estimator of a kind of coefficient coefficient of determination. For more propriate discussion and references references on this topic see see the section on epsilon squared chapter 66.. and omega squared in chapter words determined determined and explained misleading to explained can bbee misleading Note that the words some when when used in the the context context of nonexperimental nonexperimental research. To speak of research. To of variation of the dependent vari varithe independent variable determining variation context of nonexperimental nonexperimental research might might imply to some a able in the context connection between variation variation of the independent variable and the the causal connection magnitudes of the scores. scores. In this nonexperimental nonexperimental case case a correlation coco efficient is reflecting covariation covariation between X X and and Y, Y, not not causality of of the the efficient magnitudes of the the scores. In this case if, if, for example, the the coefficient of magnitudes determination in the sample is is equal to .49, it is estimated that 49% determination 49% of of their magnitudes) is explained by varia variathe variance of the scores (not their tion in the the X X variable. Accounting for for the the degree degree of variation variation of scores is tion not the same as accounting accounting for the magnitudes magnitudes of the scores. not scores . Only in research research in in which which participants have been randomly randomly assigned treatments (experiments) (experiments) and, therefore, therefore, there has been control of exto treatments ex traneous variables can we reasonably reasonably speak of variation variation (manipulation) traneous of the independent variable causing or determining the scores. scores. There Thereof nonexperimental research perhaps one should consider foregofore, in nonexperimental forego word determination determination and instead r2 as the the ing the use of the word instead speak of rl proportion scores that is is associated with or related to proportion of variance of the scores variation of the independent variable. variable. However, However, it has been argued that variation squaring r to obtain a coefficient coefficient of determination determination is is not not appropriate in squaring the case case of experimental research and and that rr itself itself is the the appropriate esti estithe an effect effect size in the experimental consult Ozer mator of an experimental case. Again consult and Beatty (2002) (2002) for this argument. Of course, a reader of a re re((1985) 1 98 5 ) and report can readily calculate r r2 if only r is reported or calculate r (at search report least its magnitude magnitude if not not its sign in all cases) if only r22 is reported. why the use of r r2 has fallen out out of fa faWe will consider three reasons why recently in some quarters. quarters. First, First, squaring squaring the typically typically small or vor recently typically closer to 0 than to 11)) that are found moderate values of r (i.e., r typically in psychological, behavioral, and educational research results in yet 2 2 smaller numerical values of r , such as the typical r = .06 compared to the underlying r pbb = = .24 itself. Some have argued argued that that such small values for an estimato estimator can lead to the underestimation underestimation of the practical impor imporfor the effect effect size. However, However, this is a less compelling compelling reason for for dis distance of the carding r2 when when the readership of a report report of research has sufficient sufficient familiarity with statistics statistics and when the author author of the report report has proprofamiliarity
r ()p .
r
r
r
r
r2,
/
r2
r
r2
CORRELATIONAL EFFECT EFFECT SIZES CORRELATIONAL
�
93
vided of vided the readers with with discussion of the the implications and and limitations limitations of the r r2 and also provided them them with other other perspectives perspectives on the data. In ad addition, the typically low or moderate values of r2 r2 can often often be very infor informative in some contexts. For example, some reports of research make mative very much much of, of, say, an an obtained rr = = ..77.. In model-testing research the the ac accompanying r r2 = .49 informs us that the X X variable is estimated to excompanying ex half (49%) (49%) of the variance in the Y variable. Such a result plain less than half additional X X variables (multiple corre correalerts us to the need to search for additional to explain a greater percentage of the the variance of Y. lation) to (2003) newspaper article article in Breaugh (200 3 ) reviewed an example from a newspaper which the independent two hospitals conducted which independent variable was which of two coronary bypass surgery surgery and the dependent dependent variable surviving ver vercoronary variable was surviving not surviving surviving the surgery. In this example it was found that r = = .07, . 0 7, sus not determination was was ..0049. so the the coefficient coefficient of determination 0049. Therefore, because choice of hospital only related to to less than one half half of 11% the variance of % of the ( . 0049) in the survivability (.0049) survivability variable, one might might conclude that choosing two hospitals would be of little effect effect and and of little practical between the two importance. However, However, looking at at the data data from from another another perspective, perspective, one mortality rate for the surgery at one of the hospitals was learns that the mortality 11.40%, .40%, whereas the mortality . 600/0- a mortality rate at the other other hospital hospital was 33.60%— mortality rate that is 22.5 7 times greater. Again we observe that it can be mortality .57 data from from different different perspectives perspectives.. (In very instructive to analyze a set of data other effect effect sizes for such data.) chap. 8 we discuss other data. ) Rosenthal and and Rubin ((1982) disRecall that Rosenthal 1 982) intended the previously dis rectify the the perceived problem of undervaluation undervaluation of a cussed BESD to rectify correlational effect effect size. In the the BESD example we discussed discussed a way way (however problematic) problematic) to look at at an r pbb of of .20 that increased increased the the apparappar (however importance of the the findi finding 4.1). g (Table 4 . 1 ) . On the other hand, ent practical importance 2 2 example is is ((.20) = ..04, indicating that only only 4% 4% of the vari vari04, indicating r2 in that example .20) 2 = ance of the dependent variable is related to varying varying treatment treatment from from Treatment Treatment b. If 4% Treatment a to 1teatment 4% of the variability in the dependent variation of the independent variable, then variable is determined by variation 100% - 4% = 96% dependent variable variable is not de96% of the variability of the dependent de 1 00% 4% = 2 variation of the independent variable. (Thus, (Thus, 11 --rr2 is called termined by variation is the coefficient coefficient of Cohen's (1988) of'nondeterinitiation.) nondetermination. ) Even Cohen's ( 1 9 88 ) so-called large the
rl'
effect > ..371 r2 = = ..138, the vari vari3 7 1 results in r2 1 3 8 , less than 114% 4% of the effect size of rpop ;::: dep�ndent variable being associated with variation ance of the dependent variation of the the independent variable when the effect effect size has attained attained Cohen' Cohen'ss mini minimum standard mum standard for large. (2003) provided additional examples of the the underestimation underestimation Breaugh (2003) of the the practical importance of an an effect effect size that can be caused by incau incauof the coefficient coefficient of determination. In tious or incomplete interpretation interpretation of the personality variables to to predict employee employee perforthe 11960s 960s the use of personality perfor out of favor because the the resulting resulting coefficients coefficients of demance began to fall out de termination about .05 .05.. Breaugh (200 (2003) termination were generally only about 3 ) also noted that in early early court cases, which involved challenged challenged hiring practices, which involved may have underestimated the relationship judges and expert witnesses may
94
�
CHAPTER 4
between various hiring hiring criteria valbetween criteria and job performance based on low val ues of the the coefficient coefficient of determination. determination. (In a special special issue on sexual ha harassment thejournal & Gutek, rassment the journal Psychology, Public Policy, and Law, Wiener & Gutek, cited many examples examples of the the use of effect effect sizes in courts.) 11997, 9 9 7, cited More recently it has been recognized recognized that modest values of the coeffi coefficient of determination determination can be of practical significance. significance. In this regard, (2003) cited a 11997 Breaugh (2003) 99 7 health campaign urging pregnant women not determina not to smoke. This campaign was based on a coefficient coefficient of determination equal to about about .01 .01 when correlating smoking versus not smoking tion with newborns' Also, consider the correlation newborns' birth weights. Also, correlation between scores on a personnel-selection test and performance on thejob valid the job (a validity coefficient). validity coefficient coefficient). A typical validity coefficient of rr = = .4 A results results in a coeffi coefficient of 1 6 . However, of determination determination of only ..16. However, a validity coefficient coefficient of .4 means that for each I1-standard-deviation-unit test means -standard-deviation-unit increase in mean tes! score that an employer sets as a minimum criterion for hiring, there is an estimated estimated .40 .40 standard standard deviation deviation unit increase in job performance. and Schmidt Schmidt (2004) an increase increase can be of Hunter and (2004) noted that such an of substantial substantial economic value to an employer. The fact that each I1-standard-deviation-unit T he fact -standard-deviation-unit increase increase in the mean value of X X results in an an estimated rr standard-deviation-unit standard-deviation-unit increase in YY units when rr = = .4) can be explained explained by recourse recourse to to the the (e.g., increase by .4s units z z score score form form of the equation for a prediction prediction line: line: z/ zv' = = rzx rzx, where Zy' zv' is is the ' predicted z score on the Y variable. Recall that z scores are deviation scores Therefore, observes in the equation that in standard deviation units. T herefore, one observes r determines determines the number number of standard standard deviation deviation units of Y the the the value of r predicted to devia value of Y is predicted to increase increase for for individuals for each standard deviaX (i.e. (i.e.,, rr is the multiplier). tion unit increase in their scores on X A second second reason for the the decreasing decreasing use of r r2 as an an estimator estimator of an an effect effect directionless; it it cannot cannot be be size in some quarters quarters is that, unlike rr or rrp bb', itit is directionless; negative. For example, if in gender rese ci'rch men research men had had been assigned assigned the the X = I ), when rpb the two two numerical codes (e.g., X lower of the = I), r pb is positive one produced the lower mean score on the dependent dependent vari variknows that men produced able, and and when rrpb that men men produced produced the higher pb , is negative one knows that mean. However, However, of course, course, the the square of a positive rr and the square of a negative r r of the same magnitude are the same value. Therefore, meta metaanalysts cannot cannot meaningfully average the values of r r2pbb from from a set of analysts studies in which some yielded yielded negative and and some yielde yieldedd positive values 2 of Primary researchers who who report report rr should always always report report it it together together of r pbb.. Primary with r or rr�b' , both of which can be averaged by meta-analysts. Refer with r both of which can by meta-analysts. Refer to to pb Hunter ana and Schmidt (2004) for further further discussion. discussion. third reason for for the the current disfavor disfavor of r r2 among some researchers researchers isis A third alternative kinds of measures of effect the availability of alternative effect size that did not not not widely known known when r r2 became became popular many decades exist or were not ago. Those who advocate the use of more robust robust methods than Pearson's correlation coefficient variables coefficient to measure the relationship relationship between variables (e.g., Wilcox, Wilcox, 2003) would also argue that another another reason reason to avoid the use of the the coefficient coefficient of determination is that its magnitude can be afaf-
CORRELATIONAL EFFECT EFFECT SIZES SIZES CORRELATIONAL
�
95
fected fected by the previously previously discussed discussed conditions conditions that can influence the corre correcoefficient, such as curvilinearity curvilinearity (not relevant relevant to rr pbb)) and and skew. lation coefficient, typically small values of r r2 outsi outside of the physi physiFinally, regarding the typically determined; that is, is, there there are cal sciences, human behavior is multiply determined; many genetic and experiential differences differences among among people. people. Therefore, pre preexisting genetic and and experiential differences differences among among individuals likely of ofexisting ten determine much of the variability variability in the dependent variables variables that are used in behavioral behavioral science and in other "people sciences," often leaving leaving opportunity for a researcher researcher's conlittle opportunity ' s single independent variable to con proportion of the total variability. Consult Consult tribute a relatively large proportion O'Grady (1982) and Ahadi and and Diener (1989) for further further discussion. ( 1 982) and ( 1 989) for discussion. Of Of (seechap. chap. 7)7)one onecan canvary vary course, in more informative factorial designs (see their combined and individ individmultiple independent variables to estimate their ual relationships with the scores on the dependent dependent variable. Also, unless unless very unwise in one's choice of independent variables, the multiple multiple one is very R, between a set of independent variables and a dependent correlation, R, variable will be greater greater than any any of the separate separate values of r, and the re revariable sulting multiple coefficient coefficient of determination, R 22, will be greater than sulting 2 any of the the separate values of rr. . The current current edition of a widely used any book on multiple multiple correlation, a topic which which is is not not discussed fur furclassic book ther book, is by Cohen, Cohen, West, and Aiken (2002) (2002).. In this ther in this book, other measures of the proportion proportion of explained variance book we discuss other 6 and 77.. Consult Hunter and Schmidt Schmidt (2004) for an in chapters 6 unfavorable view of the coefficient of determination. determination. unfavorable view the coefficient
de
QUESTIONS
Define a truly truly dichotomous variable. 11.. Define 2.. State two two possible consequences consequences of dichotomizing a continuous continuous 2 variable. variable. 3.. Describe Describe the procedure procedure for setting setting up a calculation of the rr between between 3 qualitative dichotomous dichotomous variable and a continuous continuous variable. variable. a qualitative 4. Define point-biserial point-biserial r, and and what what is its interpretation interpretation in the the sample is negative negative and when when it is when it is positive? S 5.. What What is the relationship relationship between a two-tailed two-tailed test of the null hy hypoint-biserial rr in the population population is is 0 pothesis that states that the point-biserial hypothesis that states that the two and a two-tailed test of the null hypothesis population means are equal? population 6.. What What is the the direction of bias of the the sample rr and and point-biserial point-biserial r, 6 and which two factors influence the magnitude magnitude of this bias and in way does each exert its influence? influence? what way 7. What What would would be the the focus of researchers researchers who who would would be interested in for r in the population population?? a null-counternull interval for 8.. To Towhich which possible possiblevalue valueooff aaparameter parameter such suchaas population rrdoes does 8 s aapopulation counternull value value brings one's attention? attention? a counternull 9.. Name and describe describe three distributions distributions that are relevant relevant in the case 9 of a point-biserial r. of
96
�
CHAPTER CHAPTER 4 4
consequences, if there is heteroscedasticity, of 110. 0 . State three possible consequences, of using using software that assumes homoscedasticity homoscedasticity when when testing the O. null hypothesis hypothesis that the population r equals 0. might skew be especially especially problematic for r, 111. 1 . In what circumstance might and in what way? 112. 2 . Considering Considering the the possibility possibility of a difference difference in the the direction direction of skew distributions of the Y variable in Samples Samples a and b, what difin the distributions dif ference in one's response uestion 111 1 would dif response to Q Question would there be if the difference in skew also occurs in the two two populations? ference populations ? What is the the effect effect of curvilinearity curvilinearity on on r? 113. 3 . What Describe a circumstance circumstance (other (other than sample size) in which which an out out114. 4 . Describe lier of a given degree degree of extremeness would have greater influence would have in another another cir ciron the value of r than that same outlier would cumstance. the value of a point-biserial r by 115. 5 . How does the possible reduction of the Question 11? I I? an unequal unequal sample size relate relate to Question 116. 6 . Why might might it be problematic to compare point-biserial correla correlations from different different experiments that used unequal sample sizes, resolve this problem? and what can resolve 117. 7. Define Define test-retest unreliability, unreliability, and and what is its effect effect on a correla correlation coefficient coefficient and on statistical statistical power? What is the the relevance of possible differences in the the reliabilities reliabilities of 118. 8 . What possible differences of different measures of the the dependent variable for comparisons of different of effect sizes across studies? effect 119. 9 . How can unreliability of the independent variable come about? 20. Define and and discuss treatment integrity. 1 . List six reasons 2 \. reasons why why the adjustment adjustment for unreliability unreliability is is rarely rarely used. 2 2 . Discuss jection" that some re 22. Discuss what what the text text calls a "philosophical "philosophical ob objection" researchers have regarding regarding the the use of an an adjustment for unreliability. unreliability. 2 3 . Define restricted range, 23. range, and state how how it typically (not always) inr. fluences r. 24. How can restricted restricted range occur in a dependent variable? r. 2 5 . Describe how 25. how restricted range might might result in an increase in r. 26. What is the usual effect effect of restricted range on statistical statistical power? 2 6 . What of manipulation, and 2 7 . What 27. What is meant by strength of and what is its effect effect on effect size? effect 2 8 . What justification for 28. What is the justification for and the possible problem with distin distinguishing guishing between small, medium, and and large effect effect sizes? 2 9 . Provide a possible example, not 29. not from from the text, of a large effect effect size that would would not not be of great practical practical significance. 3 0 . Why should ef 30. should one not not be overly impressed with with a reported large effect fect size of which there has not yet been an attempt attempt at at replication? 3 1 . Define binomial effect 31. Define a binomial effect size display. display. 3 2 . How does one find 32. find the difference difference between the two two success percent percentages in a BESD? 3 3 . Discuss 33. Discuss three possible limitations limitations of the the BESD. 34. Define Define coefficient coefficient of of determination.
CORRELATIONAL EFFECT EFFECT SIZES CORRELATIONAL SIZES
�
97
determination be misinterpreted in the label coefficient of of determination? coefficient 3 6 . Describe 36. Describeand anddiscuss discussthree threereasons reasonsfor forthe thereduced reduceduse useof ofthe thecoeffi coeffidetermination in recent years. cient of determination 37. Discuss why why it should not not be surprising that coefficients coefficients of deter deter3 7. Discuss mination are typically typically not very large in research research involving human human mination behavior behavior (ignoring the issue of squaring squaring for the purpose of this this question).. question) 3 5 . How might 35. might the word
5 5
Chapter Chapter
Effect Size Measures That Effect Go Beyond Comparing T Two wo Centers
THE PROBABILITY OF SUPERIORITY: INDEPENDENT INDEPENDENT GROUPS THE PROBABILITY OF SUPERIORITY: GROUPS effect size that would reflect what would hap hapConsider estimating an effect would reflect pen if one were able to take each score from from Population a and and compare it at a time, to see which of the two two to each score from Population Population b, one at see which larger, repeating such comparisons until every score from Popscores is larger, score from Pop ulation a had had been compared to every score score from from Population b. If most ulation of the the time in these pairings of a score from from Population a and a score of from from Population b the the score from from Population a is the higher of the two, two, this would would indicate a tendency for superior performance performance in Population and vice versa, if most of the the time the the higher score score in the pair is the one a, and from Population Population b. The The result result of such a method method for comparing comparing two two pop popfrom ulations is a measure of effect effect size that does not not involve comparing comparing the the ulations centers of the two two distributions, such as means or medians. This effect effect size is defined as the probability that a randomly sampled member of of (Ya) that is higher than than the the score (Yb) (Yb) at atPopulation a will have a score (Y) tained by a randomly sampled member of Population b. This definition definition tained much clearer in the examples will become become much examples that follow. for the the current effect size is Pr(Ya Pr(Ya > > Vb)' Yb), where Pr The expression expression for current effect Pr stands for for probability. This Pr(Ya Pr(Ya > > Vb) Yb) measure has has no widely used name, although names have been given to its estimators estimators (Grissom, Grissom & & Kim, Kim,200 2001; McGraw& Wong, 11992). 1 ; McGraw & Wong, 992). In 11994a, 9 94a, 11994b, 994b, 11996, 996, Grissom the just-cited just-cited references Grissom named an an estimator estimator of Pr(Ya Pr(Ya > > Vb) Yb) the the the probability of of superiority superiority (PS) (PS).. In this book we will instead use probability probability probability of superiority superiority to to label Pr(Ya Pr(Ya > > Vb) Yb) itself an estimator estimator of of it), of itself (not an it), so that we now define it as follows follows:: now F5 = Pr(Ya > Yb).
(5.1) (5 . 1 )
F5 measures the the stochastic (i.e (i.e.,. , probabilistic) superiority superiority of one The PS another group's group's scores. scores. Because the the PS PSisisaaprobabilprobabilgroup's scores scores over another 98
EFFECT MEASURES EFFECT SIZE SIZE MEt\SURES
�
99
ity and probabilities range from from 0 to I1,, the PS ranges from 0 to 11.. There Therewhen comparing comparing Populations Populations a and b fore, the two most extreme results when would be (a) = 0, in which which every member of Population Population a is outscored would (a) PS = by every member of Population b; and 1,, in which which every mem memby and (b) (b) PS = = I ber of Population Population a outscores every member of Population Population b. The least result (no effect effect of group membership one way way or the other) extreme result would result in PS PS = = ..5, Populations a and b out out5 , in which which members of Populations score each other equally equally often. sample estimates estimates a probability in a population. A proportion in a sample popUlation. For For 522 heads results in a sample of 1100 (ranexample, if one counts, say, 5 00 (ran dom) tosses of a coin, the proportion proportion of heads in that sample' sample'ss results is 52/100 = ..52, estimate of the the probability probability of heads for for a popula popula5 2/ 1 00 = 5 2 , and the estimate random tosses of that specific coin would would be ..5522.. Similarly, Similarly,the thePS PS tion of random estimated from from the proportion proportion of times that the nnaa participants participants in can be estimated outscore nb participants in Sample b in head-to-head com comSample a outs core the nb parisons of scores within all possible pairings of the score of a member of of one sample with the score of a member member of the other sample. The total number of possible possible such comparisons is given by the product two number product of the two nanb. Therefore, if, if, say, nnaa = nb nb = = 110 sample sizes, nanb' 0 (but sample sizes do not have to be equal), equal), and and in 70 of the the npb nanb = = 1100 00 comparisons the score not from the member of Sample Sample a is greater greater than the score from from the member from of Sample b, then then the the estimate estimate of PS is 70/1 70/100 = ..70. 00 = 70 . of example, suppose suppose that Sample a has For a more detailed detailed but but simple example, three members, Persons A, C; and three A, B, B, and and C; and Sample b has has three members, Persons D, D, E, E, and F. F.The Thennaanb nb==33x3 pairings to toobserve observewho whohas hasthe the x 3 ==99pairings higher E, A versus F, higher score would would be A versus D, A versus versus E, F, B B versus D, B versus E, E, B versus F, F,C versus D, C versus E, E, and and C versus F. F. Suppose that in five five of these nine pairings of of scores scores the the scores scores of of Persons Persons A, B, B, and and C in C the scores of of Persons Persons D, E, E, and and FF (Sample b), (Sample a) are greater than the other four pairings Sample b wins wins.. In this example the esti estiand in the other mate of PS is 5/9 5/9 = = ..56. actual research one would would not not mate 5 6 . Of course, in actual want to base the estimate estimate on such such small small samples. samples. estimate of PS will willbe begreater greater than than . .5 when members members of ofSample Sample aa 5 when The estimate outscore members of Sample b in more than one half outscore members half of the pairings, pairings, and when members of Sample a are out outthe estimate will be less than ..5 5 when scored by members of Sample b in more than one half the pairings. half of the solution is to to allocate one half half of the ties When there are ties the simplest solution group.. (There are other for handling handling ties; see see Brunner & to each group other methods for & & Gibbons, 1981; 2001; Munzel, 2000; Fay, Fay, 2003; Pratt Pratt & 1 9 8 1 ; Randies, Randles, 200 1; & Best, 200 2001; example if members members Rayner & 1 ; Sparks, 1967.) 1 96 7. ) Thus, in this example of Sample a had outscored outscored members members of Sample b not not five five but four times of but four in the nine nine pairings, half of the tie would be awarded awarded as pairings, with one tie, one half would be 44.5 supea superior outcome to each sample. Therefore, there would . 5 supe rior outcomes for each sample in the nine pairings of its members with rior the members of the the other other sample, and and the the estimate of PS PS would, there therethe 4.5/9 = ..5. related to the PS PS but fore, be 4 . 5 /9 = 5 . A measure that is related but ignores ties (Cliff, 1993) chapter (in Equation (Cliff, 1 99 3 ) is considered later in this chapter Equation 5.5). 5.5).
1100 00
�
CHAPTER 5 5 CHAPTER
number of times that the scores from one specified specified sample are The number higher than the scores from the other sample with which which they they are paired higher of the sample proportion that is is used to estimate estimate the (i.e., the numerator ofthe P5) is called the U statistic (Mann & & Whitney, 11947). Recalling that the the PS) 94 7 ) . Recalling number of possible comparisons comparisons (pairings) is nanb n a n b and using Pa>" pa>b to total number PS, we can now define define:: denote the sample proportion that estimates the PS, u
((5.2) 5.2)
other words, iinn Equation 55.2 numerator iiss the number ooff wins IIn n other . 2 the numerator for a specified specified sample and the denominator denominator is the number of opportuni opportunifor win in head-to-head member's ties to win head-to-head comparisons of each of its member 's scores the scores scores of the the other sample's members members.. The value of U U with each of the can be calculated manually, but but it can be laborious to do so except except for samples. Although Although currently major statistical statistical software very small samples. packages do not not calculate Pa>b' pa>b, many many do calculate the Mann-Whitney Mann-Whitney U U the equivalent W Wmm statistic. If the value of U is obtained obtained statistic or the then divides this outputted outputted U Uby nanb to by npb through the use of software, one then find the estimator, estimator, P pa>b software provides the the equivalent equivalent Wilcoxon a>b .' If software (1945) Wmm rank-sum statistic instead instead of the U U statistic, statistic, if there are no ( 1 945) W U by calculating calculating U U= =W Wmm -- [n ss(nss + + 11)] / 2, where nss is is the ties, find U )1 / smaller sample size or, if sample sizes are equal, the size of one sample. Note that Equation .2 satisfies the general Equation 55.2 general formula, which which was pre prechapter 11,, for the relationship relationship between between an estimate estimate of effect effect sented in chapter (E5EST test statistic (TS); (TS);ES = TS TS/[f(N)]. the case of EquaESEST Equa size (ES / [ f{N) I . In the EST = EST) and a test 5.2, ESEST p,, TS = = U, U, and and f{N) f(iV) = = naanb". = P tion 5 .2, ES EST = Researchers who focus on means and assume normality and homoscedasticity might might prefer to use use the t test to compare the means means homoscedasticity effect size size.. Researchers Researchers who do not not and use a standardized-difference effect interested in a measure of the extent extent to assume normality and who are interested which the scores in one group group are stochastically stochastically superior to those in which another group group will prefer to use the PS PS or a similar measure. Under homoscedasticity (in this case, equal variability variability of the overall overall ranks of homoscedasticity of scores in each group) one may may use the original Mann-Whitney Mann-Whitney U the scores U test to test Ho: H0: PS PS = = ..5 Halt: PS PS#- ..5. Utest test 5 against Halt: 5 . However, the ordinary U usually provided by ssoftware robust against against that is usually o ftware is not robust heteroscedasticity (Delaney (Delaney & V Vargha, B.. P.P. Murphy, 1976; argha, 2002; B 1 9 76 ; heteroscedasticity Pratt, 11964; & Zumbo, 11993). 964; Zimmerman & 9 9 3 ) . Further discussion of of discussion of a researcher researcher's choice between homoscedasticity and discussion ' s choice comparing means and and using the PS PS is is found in the forthcoming section assumptions. on assumptions. Wilcox ((1996, Vargha Delaney (2000), (2000), and and Consult Wilcox 1 996, 11997), 9 9 7) , V argha and Delaney and Vargha (2002) (2002) for for extensive extensive discussions discussions of robust robust meth methDelaney and for testing testing H PS = = ..5. presented a Minitab macro ods for Ho: 5 . Wilcox ((1996) 1 9 9 6 ) presented 0: PS and S-PLUS software functions ((Wilcox, and Wilcox, 11997) 99 7) for constructing a con-
EFFECT SIZE MEASURES EFFECT SIZE MEASURES
�
101 1 01
fidence (1981) fidence interval interval for the the PS PS based on Fligner and and Policello's (1981) heteroscedasticity-adjusted LIstatistic, statistic, U', U,and and on on aamethod method for for con conheteroscedasticity-adjusted U Mee ((1990) structing a confidence interval interval by Mee 1 990) that appears to be fairly further improved by mak accurate accurate.. The Fligner-Policello U' test can be further makadjustment to the degrees & ing a Welch-like adjustment degrees of freedom ((cf. cf. Delaney & Vargha, 2002).) . Refer Refer to V Vargha and Delaney ((2000) for critiques of al alVargha, 2002 argha and 2000) for ternative methods for constructing constructing a confidence PS, ternative confidence interval interval for the PS, equations for manual calculation, and extension extension of the PS equations PS to to compari comparisons of multiple groups.. multiple groups Also refer refer to Brunner and Puri (2001) to mul mul(200 1 ) for extensions extensions of the PS PS to designs.. (Factorial designs are discussed discussed in tiple groups and to factorial designs (Factorial designs chap Brunner and Munzel (2000) presented presented a further chap.. 77 of this book) book.) Brunner Munzel (2000) robust method that .5 that can be used to test the null hypothesis that PS PS = .5 and to provide provide aann estimate ooff the PS PS and construct a confidence interval for it. This method is applicable ties, heteroscedasticity, for applicable when there are ties, heteroscedasticity, Wilcox (2003) provided an an accessible discussion discussion of the Brunor both. Wilcox Brun and S-PLUS software functions for for the the calculations ner-Munzel method and two groups and for extension to the case in the current case of only two case in which s . (Wilcox (Wilcox which groups are taken taken two two at a time from from multiple multiple group groups. called the the PS PS p or P, and and Vargha and and Delaney called it A.) ==
THE PS PS EXAMPLE OF THE Recall from from chapters 33 and and 4 the example example in which the the scores of the mothers of schizophrenic children (Sample a) were compared to those of of the mothers of normal children (Sample b) b).. We observed observed from from two two difthe dif ferent perspectives perspectives in those chapters that there is is a moderately strong mother and the score score on a measure of relationship between type of mother of healthy parent-child parent-child relationship, as was indicated by the results results d = healthy -. 7 7 and now estimate the the PS for for the the data data of of this this example. example. -.77 and rr pb" = .40. We now Because na = nb = 20 in this example, example, nanb == = 20 X x 20 = 400 400.. Four hun hunmany pairings for manually dred is too many manually calculating U conveniently and confidence that the calculation will be error free. Therefore, we with confidence used software (many kinds of statistical statistical software can do this) to find 03/400 = .26. We We 1 0 3 . We that U U = 103. We can then calculate calculate pa>b == = U/nanb == = 1103/400 thus estimate estimate that in the populations populations there there is only only a .26 probability probability that a randomly randomly sampled mother mother of a schizophrenic child will outscore a a randomly sampled mother randomly mother of a normal normal child. Under the assumption homoscedasticity one can test test H PS = .5 assumption of homoscedasticity Ho: .5 0: PS using the ordinary ordinary U off which is often often U test or equivalent W Wm test, one o packages . Because software reveals provided by statistical software packages. reveals a statistically significant U at at pp < .05 for these data, one can conclude conclude in statistically PS '* ..5. assuming homoscedasticity for the the 5 . Specifically, assuming this case that PS current example, example, we conclude conclude that the population of schizophrenics' current defined by the the PS) PS) in its scoring when when compared mothers is inferior (as defined the population population of the the normals' normals' mothers (i.e. (i.e.,, PS PS < ..5). 5 ) . A researcher to the ==
==
n: nb ==
nan"
==
U
Pa>b U/nanb
==
==
==
==
m
U
1102 02
�
CHAPTER 5 5 CHAPTER
assume homoscedasticity homoscedasticity should choose to use one of the the who does not assume found in the sources alternative methods that can be found sources that were cited in the previous section. < .05 our result instead of reporting reporting a .05 for our Note that we reported p < why we did this this.. First, differspecific value for p. There are two reasons why differ ent statistics statistics packages might might output different different results for the the U 17test ent & Spooren, 2000 2000).) . Second, Second, we are not not confident (Bergmann, Ludbrook, & specific outputted p values beyond beyond the .05 level for the sample sample sizes in in specific We provide further discussion of these two two issues in the the this example. We remainder of this section. section. remainder the sampling distributions distributions of values of U or As sample sizes sizes increase, the W Wmm approach approach the the normal normal curve. curve. Therefore, Therefore, some some software software that that in includes the Mann-Whitney U Utest test or or the the equivalent equivalent Wilcoxon Wilcoxon W Wmtest test or or manually may may be some researchers who do the calculations for the test manually needed for statistical statistical significance significance on what is basing the critical values needed large-sample approximation approximation of these critical critical values values.. Because called a large-sample textbooks do not not have tables of critical values for these two two sta stasome textbooks have tables that lack critical critical values for the particular particular tistics or may have sample sizes or for the alpha alpha levels of interest in a particular particular instance of of normal curve research, recourse to the widely available table of the normal convenient.. Unfortunately, Unfortunately, the literature is is inconsistent inconsistent would be very convenient recommendations about about how how large samples should be before the in its recommendations convenient normal curve provides a satisfactory approximation approximation to the convenient sampling distributions of these statistics statistics.. However, computer computer simulasampling simula tions by Fahoome (2002) (2002) indicated that, if sample sizes sizes are equal, each = 115 satisfactory minimum minimum when testing at the ..05 n= 5 is a satisfactory 0 5 alpha level = 29 is is a satisfactory satisfactory minimum minimum when when testing testing at at the ..01 01 and each n = Also, Fay (2002) (2002) provided Fortran Fortran 9900 programs programs for use bbyy re relevel. Also, searchers who who need exact exact critical critical values for W Wmm for a wide range of searchers of and for a wide range of alpha levels. sample sizes and levels . If sample sizes sufficient for use of the normal curve for an an ap apIf sizes are sufficient homoscedasticity, if there are no ties, then proximate test and, assuming homoscedasticity, may test the null hypothesis hypothesis that PS PS = = ..5 5.3 one may 5 by using Equation 5 . 3 to convert U to z: convert m
((5.3) 5.3)
the null hypothesis hypothesis at two-tailed two-tailed level level ex a iiff the the value ooff I|zz I| exReject the ex za/22 in a table of the normal normal curve. Applying the values from the ceeds zw find that example of the two groups of mothers, we find 2 = 11103-(20 X 20)/2 20)/2|1 /[(20(20)(20 /[(20(20)(20 + 20 + +1)]/ = 22.624. I|z| zl = 1 03 - (20 x 1 ) J / 112]' 2]" = . 624. InIn the normal normal curve we wefind find that I|z 1| = = 22.624 is a statis. 624 is specting a table of the
EFFECT SIZE SIZE MEASURES EFFECT MEASURES
�
1 03 103
level, two-tailed. tic ally significant result at tically at the the p < .05 level, two-tailed. If there are ties, d� ', which replace the denominator . 3 withSaad denominator in Equation 55.3 which can be obtained obtained from . 5 in chapter 9. 9. from Equation 99.5
A RELATED MEASURE OF EFFECT EFFECT SIZE Because the maximum probability proportion equals 1, I, the the maximum probability or proportion the sum sum of the the possible outcomes outcomes probabilities or proportions proportions of occurrences of all of the of an event must sum to 11.. For Forexample, example,the theprobability probability that thataatoss toss of ofaacoin coin of will = 1. if there are will produce either a head or a tail equals 1/2 + V2 = 1 . Therefore, if no ties or ties are allocated equally, then I , and P then P paa>b +P paab ab.. >b + inbb,' or the in verse of this ratio. When there is no relationship relationship between the independent variable of membership in either Sample a or Sample b and the overall ranking of the scores on the measure of the dependent variable, ranking 5 5 = l1 .The greaterthe therelationship relationshipbebe and PPaab = -. 5/ b = .-55 and Paa b /' . 5 = • Thegreater b independent variable and the overall ranking of the scores in the tween the independent samples, the more this ratio moves above 11 when Sample b generally has the higher scoring member (more wins in the head-to-head head-to-head comparisons) away from from 1 toward 0 when when Sample a generally has the higher scoring or away 1 toward answer to the following question about about member. This ratio estimates an answer the two two populations. populations. For For all pairings of a member of Population a with a member of Population b, how how many many times more pairings would would there be in member which a member of Population a scores lower than in which which a member of which of Population b scores lower? lower? When using instead the ratio P es paa>b b as an esa//pPa< timator timator (the inverse of the previous ratio), ratio), replace the the word lower with the question.The Thetwo two versions versionsof ofthe the ratio ratio are arere repreceding question. word higher in the preceding lated to estimators of a generalized generalized odds ratio, about about which there is more discussion in chapter 9 9.. data involving the two two samples of For an example consider again the data of mothers. Recall that in this example P pa>b now we find find that that >b = .26, so now P 74, and . 8 . We Pa< Pa>b = 11 - .26 -26 = = ..74, and Ppaab = .-74/.26 74.26 = = 22.8. Weare arethus thus b = 11 -- P >b = < b/Pa ab = estimating . 8 times more pair estimating that in the populations there would would be 22.8 pairings in which which the schizophrenics' mothers are outscored by the nor noroutscore the mals' mothers mothers than in which the schizophrenics' mothers outscore normals' normals' mothers. -
-
ASSUMPTIONS The original purpose of the scores in one population the U test was was to test if scores are stochastically larger (i.e., likely to be larger) larger) than scores scores in another another population, assuming that both popUlations populations have the same, but but not & Whitney, 11947). was later 947). (This test was necessarily normal, shape (Mann & Wm 945, the W observed to be equivalent to an earlier test by Wilcoxon, 11945, rank-sum test.) if rank-sum test.) In other other words, the the purpose of the the U test was to test if score at at the ith ith percentile percentile of Population a is larger than the score at the score m
1104 04
CHAPTER 5 5 CHAPTER
percentile in Population b. A percentile percentile that is frequently that same ith percentile frequently of interest is the the 50th percentile, which is the the median (Mdn) of percentile, which (Mdn) in general general symmetrical distributions and also the mean in the case of symmetrical distributions.. u test ttoo test Ho0:: Mdna Mdna = = Mdn against the alterna alternaWhen using the U Mdnbb against Halt: Mdn , in effect assuming if treatment Mdnaa =f. Mdn Mdnb, one is effect assuming that treatment or tive Halt b group membership has an effect effect it will be to add (or subtract) subtract) a certain constant disconstant number of points, say, k points, to each score in a group's dis tribution. Adding a constant constant k to each score in a group shifts shifts its distri distritribution. without changing its shape. shape. This bution to the right bution right by k points without concept of an an additive effect effect of treatment is called a shift shift model, in which a treatment merely always always adds (or always always subtracts) subtracts) a con conwhich number of points stant number points to what the score of each participant participant in a group had been in the the other would have been if each of those participants had The independent-groups independent-groups t test for the difference difference between two two group. The means also assumes a shift shift model. shift model may may often often not not be the the most most realistic model of the the effect effect The shift of of treatments (or group membership) in behavioral, psychological, eduedu cational, or medical research. It seems reasonable to assume instead that cational, treatment may often have a varying varying effect effect on the individuals in a treated treatment treatment may may perhaps increase increase the scores scores of all par pargroup. In this case case a treatment varying amounts, decrease scores scores of all participants by vary varyticipants by varying the scores the scores of ing amounts, or increase the scores of some while decreasing decreasing the scores of varying amounts. In such cases scores scores are "pulled" to the right right others by varying left by varying varying amounts. The well-known well-known name for the and/or to the left amounts. The varying effect effect of treatment treatment on on different different individuals is Treatment X x Subject Subject varying an extensive discussion discussion interaction. Hunter and Schmidt (2004) (2004) provided an of of Treatment x Sub Subject inof the implications of1teatment ject interaction in the context of in dependent-and dependent-groups designs. dependent -and dependent -groups designs. As Delaney Vargha (2002) pointed out out with examples, examples, the shift shift Delaney and Vargha model, with its usual usual resulting resulting comparison comparison of means (or medians), medians), is appropriate when one is interested interested in information information about about which which treat treatappropriate when ment produces the lower lower or higher but the PS is appropri appropriment higher average score, but treatment is ate when one is interested in information about which treatment likely to help the greater number of people. For example, example, a therapist may be more interested in the latter, whereas a medical medical insurance com company may well be more interested in information about which treat treatpany information about ment on average results in the lower cost. Delaney and Vargha (2002) ment also provided an example, necessarily necessarily involving involving skewed data, in which a sample that has superiority superiority over another another sample in terms of the PS actu actuthan the mean of the inferior group. ally has a mean that is lower than In the the case of the U U test test that uses the actual probability distribution of of u instead of the the normal approximation, approximation, heteroscedasticity can influence U instead critical values of the test test were derived assuming assuming the result result because the critical populations, an assumption assumption that might might be be vio vioequal shapes of the two populations, heteroscedasticity. When using the standard normal approxi approxilated by heteroscedasticity. mation for the U U test, heteroscedasticity might result in an incorrect mation
EFFECT SIZE MEASURES EFFECT SIZE MEASURES
�
1105 05
estimation of the the standard standard error of U (the denominator denominator of Equation 5.3). 5.3). This problem for the normal normal approximation approximation can cause cause an increase increase in rate of Type II error if there is a negative relationship between the the vari varithe populations and and the the sample sizes. If instead there is a posi posiances of the tive relationship between the variances variances and the sample sizes, heteroscedasticity might cause cause a decrease in the the power of the test. For For further discussion consult Delaney Vargha (2002),, Vargha Vargha and further Delaney and V argha (2002) Delaney (2000), (2000), and Wilcox ((1996, 2001, 1 996, 200 1 , 2003). Finally, regarding precedence forthe the underlying underlying ideas ideasthat thathave havebeen been precedence for 1 945) W presented thus far in this chapter, because the Wilcoxon ((1945) Wmm test Mann-Whitney U U test test (Mann (Mann& &Whitney, Whitney,11947) areknown knownto tobe be 947) are and the Mann-Whitney often called the Wilcoxon-Mann-Whitney U U equivalent, the the U U test is often the Wilcoxon-Mann-Whitney test. Nonetheless, Nonetheless, the basic ideas can perhaps be traced back to at least versions of this test 11914 9 14 (Kruskal, 11957). 9 5 7) . Note also that there are other versions (Bergmann, Ludbrook, & Spooren, Spooren, 2000). 2000 ) . THE LANGUAGE EFFECT EFFECT SIZE SIZE STATISTIC STATISTIC THE COMMON COMMON LANGUAGE The estimates of PS PS from from various studies studies can can be combined combined in a Mosteller, 11988; 98 8 ; Mosteller & meta-analysis (Colditz, Miller, & &Mosteller, & Chalmers, Chalmers, because raw raw scores are not not typically available to 11992). 99 2 ) . However, because meta-analysts, estimator pa>b using meta-analysts, they cannot calculate values of the estimator Equation .2 . Fortunately, the PS Equation 55.2. PS can also be estimated estimated from from sample assuming normality normality and homoscedasticity, using means and variances, assuming a statistic that McGraw 1 992) called the common language ef McGrawand Wong Wong ((1992) efCL. The CL CL is based on fect fect size statistic, symbolized CL. on a z score, Zcu ZCL, where
Pa>b
(5.4) ( 5 .4)
The proportion of the the area under the the normal curve that is below Z ZCL is CL is the the CL CL statistic that estimates the the PS PS from from a study. For examples, examples, if a study's's ZCL ZCL = =+ +1.00 -1.00, study 1 .00 or -1 .00, inspection of a table of the normal curve PS would be estimated to be .84 or ..16, For reveals that the PS 1 6, respectively. respectively. For the example that compares the two groups of mothers, using Equation 5.4 5 .4 and the means and variances that were presented for this study in /2 ' -.60. Inspecting a table chapter 3, ZCL = (2.10-3.55) / (2.41 3.52)' chapter 3, Z (2. 1 0 - 3.55) / (2.41 + 3 . 5 2 ) 1> = -.60. CL = of the normal curve we find that approximately of normal approximately ..27 2 7 of the area of the normal -.60, our estimate of PS PS when the the schizonormal curve is below z = -. 60, so our schizo phrenics' mothers are Group Group a is ..27. 2 7 . Note that this estimate of ..27 2 7 for the PS PS using the CL CLisisclose closeto tothe the estimate estimate.26 .26that thatwe wepreviously previouslyob obRefer to Grissom and Kim Kim (200 (2001) comparitained when using pa>b. Refer Grissom and 1 ) for compari sons of the values of the the pa>b estimates and and the the CL CLestimates estimatesapplied appliedtoto simulations on the sets of real data and for the results of some computer simulations effect of heteroscedasticity further results of effect heteroscedasticity on the two two estimators. For further of
Pa>b '
Pa >b
1106 06
�
CHAPTER 5 5 CHAPTER
computer computer simulations of the robustness of various methods for testing Ho0: PS PS = = ..5, Vargha and Delaney Delaney (2000) (2000) and and Delaney Delaney and and 5 , consult V argha and Vargha (2002).) . Refer Refer to Dunlap ((1999) for software to calculate the the CL. Vargha (2002 1 999) for CL .
TECHNICAL 5.1: ITS ESTIMATORS ESTIMATORS TECHNICAL NOTE NOTE 5. 1 : THE THE PS PS AND AND ITS The PS scores from from Group PS measures the the tendency of scores Group a to outrank outrank the the scores scores of the the members of of scores from Group b across all pairings of the each group. Therefore, Therefore, the the PS PS is an an ordinal measure of effect effect size, reflectreflect absolute magnitudes magnitudes of the paired scores but but the rank order ing not the absolute of these paired scores. Although, outside outside of the physical sciences, one of often ten treats scores as if they they were on an interval interval scale, many of the mea meabut not sures of dependent dependent variables variables are likely monotonically, monotonically, but necessarily linearly, related to the latent latent variables that that they are measur measuring. In other other words, the scores presumably increase increase and decrease decrease along with the latent latent variables (i.e., they have the same rank order as the la lawith tent variables) variables) but but not necessarily to the same degree. Monotonic trans transtent leave the ordinally oriented oriented PS PS invariant. formations of the data leave Therefore, different different measures of the same dependent variable should PS invariant. If a researcher is interested in the tendency of the leave the PS the pair scores in one group to outrank the scores in another another group over over all pairings of the two, then then use of the the PS PS is reasonable. Theoretically, pPa> consistent and and unbiased unbiased estimator estimator of the PS, PS, a>bb is a consistent and it has the smallest smallest sampling variance variance of any any unbiased estimator estimator of of the PS. PS. (A consistent estimator estimator is one that converges randomly toward the the parameter that it is estimating as sample sizes approach infinity.) infinity. ) PS ;f. ..5, 5 , or Also, using pPa>b to test test Ho: H0: PS PS = .5 against against Halt: Halt: PS or against a one onea>b to
==
tailed alternative, alternative, is a consistent consistent test in the sense that the power of such tailed a test approaches 11 as sample sizes approach infinity. Some readers may question the statement that readers may that the CL CL assumes homoscedasticity because regardless because the variance of (Yaa - V Ybb)) is cr + cr regardless -
of of the values of cr a
�
�
: and shown that the CL and cr � .. However, it can be shown CL strictly strictly
estimates the PS PS under homoscedasticity and that it only estimates under normality normality and homoscedasticity an unbiased unbiased estimator estimator of the PS PS unless it is adjusted adjusted (Pratt & & is not not quite an and Wong ((1992), who named the CL, CL, were Gibbons, 11981). 9 8 1 ) . McGraw and 1 992), who discussions of the PS correct in assuming homoscedasticity. For more discussions the PS and its estimators consult consult Lehmann ((1975), Mosteller ((1990), 1 9 75 ), Laird and Mosteller 1 9 90), Pratt and and Gibbons Gibbons ((1981), and Vargha Vargha and and Delaney Delaney (2000) (2000).. Note that in Pratt 1 98 1 ), and you will typically find find the parameter symbolized symbolized in a these sources you manner similar to Pr(Y > Y ) Pr(Ya > Vb) with no name attached to it. manner a b
INTRODUCTION TO OVERLAP INTRODUCTION TO OVERLAP Measures of effect size can be related to the effect size the relative positions of the the dis distributions 0 = 0, tributions of Populations a and and b. When there is no effect, effect, A A = 0, rpop 5 . In this case, Distribution a and and PS = ..5. case, if assumptions are satisfied, Distributions and
==
��
EFFECT SIZE SIZE MEASURES MEASURES EFFECT
�
1107 07
maximum effect, effect, d Aisisat at its its maxi maxib completely overlap. When there is a maximum mum negative or positive value for the data, rr 0 = =+ +1 PS = =0 1 , and PS 1 or --1, 0 mum or 11 depending on whether whether it is is Population Population b orr Population Population a, respec respectively, that is is superior superior in all of the comparisons comparisons within the paired scores. maximum effect effect there is no overlap of the the two two distribu distribuIn this case of maximum score in the higher scoring group is is higher than tions; even the lowest score score in the lower scoring group. Intermediate Intermediate values of ef efthe highest score fect size result in less extreme two previ previfect extreme amounts of overlap than in the two cases. Recall the example in chapter chapter 33 in which which Fig. 3.1 the ous cases. Fig. 3 . 1 depicted the treated population's distribution distribution shifting 11 cry a unit to the mean of the treated of the control population's population's distribution distribution when .1. A= = ++1. right of the mean ofthe 1.
15 b
THE DOMINANCE MEASURE
Cliff ((1993) discussed a variation variation on the PS PS concept concept that avoids dealing Cliff 1 99 3 ) discussed >Y Ybbor orYYbb>>YYa.a. with ties by considering only those pairings in which Yaa > the dominance measure of of effect effect size (DM) (DM) here here be beWe call this measure the cause Cliff Cliff ((1993) estimator the the dominance statistic, which which we 1 993) called its estimator ds.. This measure is defined as denote by ds (5.5) (5.5)
and its estimator, estimator, ds, is given by
ds = Pa > b - T\ > a ·
(5.6) (5.6)
Here the the p p values are, are, as before, given by by U U/n nbb for each group, ex ex/nap cept for including including in each group' group'ss U LTonly only the the number number of ofwins winsin in the the nanb nanb pairings of scores scores from Groups a and b, with no allocation allocation of any any ties. pairings na = = nb nb = = 10, the 10 100 pairFor example, suppose that na 1 0, and of the 10 Xx 10 10 = = 1 00 pair Group a has the higher of the two paired scores scores 50 ings Group 50 times, Group b has the higher score 40 0 times, and there are 110 0 ties within the paired score 4 4 ; therefore, scores. In this case, case, P paa>b 50/100 = = .5, .5, T\>a pb>a = =4 40/100 therefore, 0/1 00 = ..4; >b = 50/100 the estimate estimate of the DM DMisis. 5.5-- . 4.4== ++.1, suggestingaaslight slightsuperiority superiority . 1 , suggesting of probabilities, both Pr Pr values can range from 0 0 of Group a. Because, as probabilities, to 11,, DM DMranges rangesfrom from00-11 ==-1 -1 to to11-00 == ++1. WhenDM DM==-1 -1the thepop pop1 . When ulation's distributions do not not overlap, with all of the scores distributions do scores from the scores scores from and vice versa Group a being below all of the from Group b, and DM = =+ +1. the DM DMbetween betweenthe the two two extremes extremesof of-1 -1 when DM 1 . For values of the +1, overlap.. When there iiss aann equal number and + 1 , there iiss intermediate overlap of their pairings, pairings, Ppaa>b = pT\>a = .5 of wins for Groups a and b in their .5 and the eses >b = b>a = timate of the DM DMisis. 5.5-- . 5.5 == O.0.In Inthis thiscase casethere there isisno noeffect effect and andcom complete overlap. overlap. Refer to Cliff Cliff ((1993) testing and and con conRefer 1 99 3 ) for discussions of significance testing intervals for the the DM DMfor for the the independent independent-groups -groups struction of confidence intervals dependent-groups cases, and for software software to undertake the caland the dependent-groups -
-
108 1 08
�
CHAPTER 5 5
refer to V Vargha (2000) for further further discusdiscus culations. Also Also refer argha and Delaney Delaney (2000) 's sion. Wilcox (2003) (2003) provided S-PLUS software functions for for Cliff Cliff's robust method method for constructing constructing a confidence confidence interval for theDM ((1996) 1 996) robust DM for the case of only two groups groups and for the case of groups taken taken two two at a (2003) indi inditime from multiple groups. groups . Preliminary findings by Wilcox (2003) provides good control of Type I error ( 1 993) method provides cated that Cliff's Cliff ' s (1993) many tied values, a situation situation that may may be problem problemeven when there are many methods.. Many ties are likely when there are relaatic for competing methods rela the dependent dependent variable, tively few possible possible values for the variable, such as is the case for discussed in chapter 9. An example for rating-scale data as discussed example of the the DM is presented in chapter chapter 99 along with more discussion. COHEN'S U COHEN'S U33
If assumptions of normality normality and and homoscedasticity homoscedasticity are satisfied satisfied and and if if experimental re repopulations are of equal size (as they always are in experimental distribusearch), one can estimate the percentage of nonoverlap of the distribu and b. One of the methods uses an estimate of tions of Populations a and uses as an of nonoverlap higher scoring sample nonoverlap the percentage of the members of the higher when normal normalwho score above the median (which is same as the mean when ity is satisfied) the lower scoring scoring sample. We Weobserved observedwith withregard regard to to ity satisfied) of the 3. 1 of chapter 3 that when Fig. when A =+ +1, the mean of the higher Fig . 3.1 chapter 3 !:!. = 1 , the higher scoring population lies 11 cryy unit unit above the the mean of of the the lower lower scoring populapopula popUlation normality, 50% tion. Because, under normality, 50% of the scores are at or below the approximately 34 34% lie between the mean and 11 % of the scores mean and approximately scores lie
ay unit above the mean (i.e., zz = =+ +1), approxcry 1 ), when !:!.A == ++11 we infer that approx 4% of the 34 % == 884% imately the scores of the the superior group exceed exceed the the imately 50% 50% + 34% comparison group. Cohen (1988) median of the comparison ( 1 988) denoted this percentage effect size, Uy as a measure of effect U3, to contrast it with his related measures, U1 and U22,, which which we do not not discuss discuss here. ill and il When there is no effect effect we have observed that � A = 0, 0, rpo rpop = = 0, 0, and and the the = .5, .5, and and now now we note that U3 U3 = = 50%. 50%. In this case 50% 50% of of the the scores PS = scores from PopUlation Population a are at at or above the median of the scores scores from from Popu Popufrom course, so too are 50% 50% of the scores from Population b lation b, but, of course, scores from at (0% nonoverlap) nonoverlap).. As As�A at or above its median; there is complete overlap (O% increases above 0, U U33 approaches approaches 100%. 1 00%. For example, if �A == +3.4, + 3 . 4, then U > 99.95%, 99.95%, with with nearly all all of of the the scores scores from from Population Population aa being being U33 > b.. above the median of Population b scores compared to a control, In research that iis s intended tto o improve scores standard-treatment group, a case case of successful successful treatment treatment is placebo, or standard-treatment defined (but not not always justifiably so) any score that exsometimes defined so) as any ex Then, the percentage percentage of the the ceeds the median of the comparison group. Then, scores from score of the com comfrom the treated group that exceed the median score parison group is called ascalled the success percentage of the treatment. When as sumptions are satisfied the success percentage is, U33.• For For is, by definition, U further discussions consult Lipsey (2000) and Lipsey and and Wilson further discussions Lipsey (2000)
EFFECT SIZE MEASURES
�
1109 09
(2001). For a more complex but but robust robust approach approach to an overlap measure (200 1 ) . For of effect effect size that does not not assume normality or homoscedasticity, homoscedasticity, refer refer of to Hess, Olejnik, (200 1 ) . Olejnik, and Huberty (2001). EFFECT SIZE RELATIONSHIPS AMONG MEASURES OF EFFECT Although Cohen's ((1988) His apparently merely coinci coinciAlthough 1 988) use of the letter U is apparently dental to the Mann-Whitney U LIstatistic, statistic, when when assumptions assumptions are aremet, met, dental U3 and the PS. PS. Indeed, many of the mea meathere is a relationship between U3 effect size that are discussed discussed in this book are related when as assures of effect Numerous approximately equivalent values among sumptions are met. Numerous many measures measures can be found by combining the information that is is in many tables presented presented by Rosenthal 6-2 1 ), Lipsey and WilWil Rosenthal et al. al. (2000, (2000, pp. 116-21), (2001, Cohen ((1988, 22), and Grissom ((1994a, son (200 1 , p. 1153), 5 3 ) , Cohen 1 988, p. 22), 1 994a, p. 315). 3 1 5). presents an an abbreviated set of approximate relationships relationships Table 5.1 5 . 1 presents measures of effect effect size. Table 5.1 accurate among measures size. The values in T able 5 . 1 are more accurate nearly normality, normality, homoscedasticity, and equality equality of sample the more nearly sizes are satisfied, and the larger the sample sizes. chapter 4 we discussed Cohen's Cohen's ((1988) admittedly rough rough criteria for In chapter 1 988) admittedly and large effed effect sizes in terms of values of � A and and values of of small, medium, and 5.1 TABLE S.1 Measures of of Effect Effect Size Approximate Relationships Among Some Measures A !1
rpop
pop
0
.000 .000
..1 1
.050
.2
..100 100
.3
..148 148
.4
..196 1 96
.5
.243
.6
.287 .287
.7
.330 .330
.8
.3711 .37
.9
.410 .410
11.0 .0
.447 .447
11.5 .5
.600 .600
2.0
.707 .707
2.5 3.0 3.4
.781 .781 .832 .862 .862
PS .500 .528 .556 .556 ..584 584 ..611 61 1 ..638 638 .664 .664 .690 .690 ..714 714 .738 .738 .760 .760 .856 .921 .962 .962 .983 .992 .992
U3(%) Ul;';') 50.0
54.0 54.0 57.9 57.9 61.8 6 1 .8 65.5 65.5
69.11 69. 72.6 72.6
75.8 75.8 78.8
81.6 8 1 .6 84.11 84. 93.3
97.7 97.7 99.4 99.4 99.9 99.9 >99.95
1110 10
CHAPTER 5 5 CHAPTER
�
relationships among r pop . Due to the the relationships among many many measures of effect effect size, we can now criteria to the the PS and U U33•. Categorized as small efri o� also apply Cohen's criteria ef fect (.1. � � . 1 0) would � . 5 6 and U3 � 5 7.9%. fect sizes sizes (A < .20, rpop < .10) would be PS < .56 and U < 57.9%. Medium op 3. ppop 6 3 8 and 9 . 1 %. Large values (.1. (A = = ..50, = .243) would would be PS = = ..638 and U33 = = 669.1%. values 5 0, rpop pop = values (.1. 8 , rpop 3 7 1 ) would 7 1 4 and .8%. (A ;;::> ..8, would be PS ;;:: > ..714 and U33 ;;:: > 78 78.8%. pop ;;::> ..371) 0
•
APPLICATION TO CULTURAL EFFECT EFFECT SIZE Three of the the measures of effect effect size that have been discussed thus far far in this book book have been applied to the the comparison comparison of two cultures cultures (Matsumoto, (Matsumoto, Grissom, & 1 ) . Among many differences between &Dinnel, 200 2001). many other other differences between partici partici1 6 1 ) that had (nus = = 1182) Japan (n (nJjpp = =161) pants in the United States (nus 82) and in Japan study (Kleinknecht, Dinnel, Dinnel, Kleinknecht, been reported in a previous study Kleinknecht, & Hirada, 11997), statistically significantly 99 7), the Japanese had statistically Hiruma, & USparticipants participants on onaascale scaleof ofEmbarrassability, Embarrassability, higher mean scores than the US t(341) = 4.33, Pp < < .00 .001; Social Anxiety, Anxiety, t(3 t(341) =:: 2.96, p < < ..01; t(34 1) = 1 ; a scale scale of Social 4 1 ) :::: 01; and a scale of Social Interaction Anxiety, Anxiety, t(34 t(341) = 33.713, < .00 .001. Todem dem. 71 3, p < 1 . To and 1) = onstrate that statistically onstrate statistically significant differences, differences, or even so-called "highly" "highly" statistically statistically significant differences, differences, do not not necessarily translate translate to very effects of culture (cultural effect effect size), Matsumoto et al. large, or even large, effects (200 1 ) estimated a standardized-difference (2001) standardized-difference effect effect size (Hedges' g gpop of chap. 3), 3), rPOl?' was estimated by and the the PS for these results. The PS was by pa>b >b using pop , and Equation 5 .2 . T able 5 . 2 displays the results. 5.2. Table 5.2 Values not included in Table 55.2 populaV alues of U33 are not . 2 because U33 assumes popula condition that is is not met by the United States and tions of equal size, a condition Japan. Observe that the values in the last column 5 , sug column are all below below ..5, suggesting that the members of Group a ((USA) would tend to be outscored outscored gesting USA) would by the members members of Group b (Japan) in paired paired comparisons comparisons of members of of the two two groups. groups. Recall that when when the PS PS is based on Y.la ) , if the Pr(Y instead of the the equally equally applicable Pr(Yb Pr(Yb > Y the members members Pr(Yaa > YYb) b) instead
Pa
'
TABLE 5.2 5.2
Effect Size Estimates When Comparing Comparing the United States and Japan Japan Cultural Effect
Scale Embarrassability
Yus
1108.80 08.80
YJP
1112.27 1 2.27
Social anxiety
83. 65 83.65
93.50
interaction anxiety Social interaction
26.36
3 1 .50 31.50
Note.
p level level < .001 < ..01 < 01 < .00 .0011 < <
g
-.16 -. 16
reb
.08
Pa>b
.46
-.34 -.34
..17 17
.41 .41
-.4 -.411
.20
.38
from "Do between-culture differences differences really mean that people are different? different? Adapted from
effect size," by D. D. Matsumoto, R R.. J. J. Grissom, and D. D. L. A look at some measures of cultural effect 2001, Journal of of Cross-Cultural Cross-Cultural Psychology, Psychology, 32, 32, ((No. 4), Dinnel, 200 1 , Journal No. 4 ). 478-490, p. 486. © 2001 with permission of Sage Publications. Copyright © 2001 by Sage Publications. Adapted with
EFFECT SIZE SIZE MEASURES MEASURES EFFECT
�
1111 11
of Group Group a tend to be outscored by the members members of Group Group b, then the the PS gets gets smaller as the the effect effect gets larger. larger. Thus, the greater value of this PS the effect, effect, the more the current current PS PSdeparts departsupward upward from from .5.5when whenGroup Group is superior and downward from ..5 Group b is is superior. superior. a is 5 when Group Observe in Table 55.2 although the estimates of effect effect size for the the .2 that, although anxiety scales scales are between Cohen' Cohen'ss ((1988) two anxiety 1 98 8 ) criteria for small and effect sizes, the the large sample sample sizes (182 and 1161) medium effect ( 1 82 and 6 1 ) have elevated elevated cultural mean differences differences to what some would call highly or very the cultural highly statistically significant basis of the impresimpres significant differences differences on the basis sively small p values values.. Moreover, Moreover, although although the cultural difference difference for Embarrassability might be considered by some to be highly statistically effect sizes sizes are only in the category of small effects. effects. significant, the effect Thus, it is possible possiblefor a cultural cultural (or gender) stereotype that is based based on a statistically significant difference difference actually actually to translate translate to a small effect effect statistically of culture (or gender) gender).. Even a somewhat valid (statistically) stereotype of may actually not apply to a large percentage of the stereotyped stereotyped group may therefore, may not not be of much practical use, such as in the training training and, therefore, of diplomats diplomats.. Worse, of course, some some stereotypes stereotypescan cando domuch muchpersonal personal of harm. and social harm. TECHNICAL NOTE 5.2: ESTIMATING EFFECT SIZES THROUGHOUT A DISTRIBUTION Traditional measures measures of effect effect size might be insufficiently insufficiently informative 1taditional informative or even misleading when there is heteroscedasticity, heteroscedasticity, nonhomomerity, nonhomomerity, or both. Nonhomomerity Nonhomomerity means inequality inequality of shapes of the the distributions. distributions. For example, example, suppose that a treatment treatment causes causes some participants participants to score higher and some to score lower than they would have scored if they had group.. In this case the treated group' group'ss variabil variabilbeen in the comparison group increase or decrease depending depending on whether whether it was the higher or ity will increase scoring participants whose scores were increased increasedor or decreased decreasedby by lower scoring treatment. However, However, although although variability variability has been changed by the the treatment. treatment in this example, example, the two groups' means means and/or and/or medians might remain nearly the same (which is possible might possible but but much less likely than the example that is case, if is presented presented in the next paragraph) paragraph).. In this case, if we estimate an effect effect size with Yaa -- Y Ybbor orMdna Mdna --Mdnb Mdnb in inthe thenumerator, numerator, be a value that is is not far from from zero zero although although the the estimate might be treatment may have had had a moderate moderate or large effect on the tails even ifif treatment may large effect not much of an effect effect on the center of the treated group's distri distrithere is not bution. The effect effect on variability variability may may have resulted resulted from from the treatment treatment "pulled" tails outward outward or having "pushed" tails inward. having "pulled" In another another case, the treatment treatment may have an effect effect throughout a distri distribution, changing both both the center and the tails of the treated group's disdis tribution. tribution. In fact, fact, it is is common for the group with with the higher mean also to now consider consider a combined greater variability. variability. In this case, if we now combined have the greater distribution that contains all of the scores scores of the treated and comparison distribution
112 112
�
CHAPTER S 5 CHAPTER
proportions of the treated group's scores among the overall groups, the proportions different from from what what high scores scores and among the overall low scores scores can be different would by an an estimate of A and Nowell (1995) Ii or U U33.• Hedges and ( 1 995) would be implied by provided a specific example. In this example, if A Ii = +.3, + . 3 , distributions are normal, the variance variance of the treated treated population' population'ss scores is only 15% normal, and the 1 5% greater variance of the comparison population's scores, one greater than the variance comparison population's would find approximately 2 2.5 participants' scores would find approximately . 5 times more treated participants' scores than comparison participants' participants' scores scores in the top 5% ofthe thecombined combineddistri distri5% of bution. For For more discussion and and examples examples consult Feingold ((1992, 1 992, 1995) 1 995 ) that havejust been been dis disand O'Brien O 'Brien ((1988). 1 98 8 ) . Note that the kinds of results that homoscedasticity if there is non noncussed can occur even under homoscedasticity Todeal dealwith withthe the possibility possibilityof oftreatment treatment effects effects that that are arenot not homomerity. To restricted to the the centers centers of distributions, other measures effect size measures of effect proposed, such as the measures that are briefly introduced in have been proposed, next two sections. the next Hedges-Friedman Method Informative methods have been proposed for for measuring effect effect size at at places along a distribution distribution in addition to its center. Such methods are necnec than the usual methods, so they have not not been essarily more complex than widely used. 1 993), assuming nor used. For For example, example, Hedges and and Friedman Friedman ((1993), norrecommended the the use of a standardized-difference standardized-difference effect effect size, A mality, recommended liu' a, beyond a fixed value, Yu' Ya, in a distribution distribution of the com comat a portion of a tail beyond from PopUlations Populations a and b. The The subscript alpha indicates that bined scores from Ya is is the the score score at the the l00 percentile point point of the the combined combined distribution, distribution, Yu 1 00aa percentile and the value of alpha is chosen by the researcher according to which porwhich por tion of the the combined distribution distribution is of interest. For example, if a tion example, if a = = .25, then score that has 100(.25)% = 25% the scores above above it. Yaa is the score 1 00(.25)% = 2 5 % of the then Y One can then define ((5.7) 5 . 7)
where m of just those scores scores from /-laa /-lab from Populations a aa and m ab are the means of respectively, that are higher than Y", Ya, and O"a aa is the standard deviaand b, respectively, devia tion of those scores combined distribution that are higher than than Y Yo;.a. scores in the combined the value of Yu Ya is selected by by the researcher score in the the Again, the researcher as the score combined distribution distribution that has c% c%of ofthe the scores scoresabove aboveit. it. Computations Computations of the estimates estimates of values of the various various A liua are repeated for those values of the of c that are of interest to the researcher. Extensive computational computational dede of c tails can be found in the appendix appendix of Hedges and Friedman (1993). ( 1 99 3 ) .
Shift-Function Method Doksum ((1977) two 1 9 7 7) presented a graphical method for comparing two groups not not only at the centers of their their distributions distributions but, more informainforma-
EFFECT SIZE SIZE MEASURES MEASURES EFFECT
1113 13
quantiles. Recall from from chapter chapter 11 that a quantile quantile can be tively, at various quantiles. roughly defined defined as a score that is is equal to or greater than a specified specified pro proroughly scores in a distribution. distribution. portion of the scores the median median is is at the the ..50 Recall also that the 5 0 quantile, which, if one divides fourths (called (called quartiles), quartiles), can also be be said said a distribution into successive fourths at the second second quartile. If one divides divides a distribution distribution into into successive to be at successive tenths, the quantiles quantiles are called deciles. The median is at at the fifth fifth decile. tenths, Doksum's ((1977) method involves a series of shift functions, each shift 1 9 7 7) method function indicating how far the comparison comparison sample's sample's scores have have to be function scores of the treated sample at a quantile quantile of moved (shifted) to reach the scores of researcher. The The method results in a graph of shift shift func funcinterest to the researcher. tions tions.. In such a graph graph quantiles of the comparison comparison sample's scores at their various qth quantile values, Y are plotted plotted against the differences differences between the the values values of Yqc and Yqtqt'., which particibetween Which is the score of a treated partici at the treated sam sample's (Asubscripted subscriptedletter letter ccrefers refers to to pant at e's qth quantile. (A the comparison group and a subscripted letter t refers to the treated group.)) Each Each shift shift function in this graph is is thus given by Yqtt - Yqc.The The group. graph of shift functions functions describes whether a treatment beco becomes more describes whether es m re or observes along the comparison sample sample's less effective as one observes 's distribution from from its lower scoring to its higher scoring members. distribution members . and Wilcox For more detailed discussions consult Doksum ((1977) 1 9 7 7) and 2003).) . Wilcox ((1996) ((1995, 1 995, 11996, 996, 11997, 99 7, 2003 1 996) provided a Minitab macro for shift functions functions and another for constructing a confidence confidence in inestimating shift terval for the difference difference between the two two populations' populations' deciles deciles at any any of of deciles throughout throughout the comparison comparison group's distribution. Wilcox the deciles in((1997) 1 99 7) also provided S-PLUS software functions for making robust in constructing robust simulta simultaferences about shift functions and for constructing With regard regard to simultaneous neous confidence intervals for them. With confidence intervals, the the confidence confidence level, level, say ..95, confidence 95, refers to one's level of of full set of intervals taken together, not not separately. confidence in the full 95%simultaneous simultaneous confidence confidence interval interval means that it is is estimated Thus, a 95% that 95% 95% of the time all of the involved involved intervals intervals would contain the difference between between the two populations' deciles. actual difference
Yc
Pl
Y
Y'lc '
Y Y c' rri � -
Effect Sizes Other Graphical Estimators of Effect would be beyond the scope scope of this book book to provide detailed detailed discussions It would of additional graphical graphical methods methods for estimating estimating effect effect sizes at various of points along a distribution. Such methods include the Wilk and Gnanadesikan ((1968) graph and and the the Tukey Tukey sum-dif sum-dif1 96 8 ) percentile comparison graph ference graph (Cleveland, (Cleveland, 11985, comparison graph graph ference 98 5 , 11988). 9 88). The percentile comparison percentiles from one group's distribution against the same percen percenplots percentiles group'ss distribution. Cleveland ((1985) demontiles from the other group' 1 9 8 5 ) demon strated the use of the percentile percentile comparison graph for the cases strated cases of equal and unequal sample sizes. sizes. (When sample sizes sizes are equal one only need raw scores from one group group against the ordered raw raw plot the ordered raw scores from the other group. group.)) A linear relationship between between the two sets of percentiles percentiles or ordered raw raw scores would would be consistent with the shift shift of
1114 14
�
CHAPTER S 5 CHAPTER
previously discussed, discussed, and this would would thus helpjustify the model that we previously use of effect effect sizes medians).. On the other hand, sizes that compare means (or medians) a nonlinear relationship would would further further justify the use of what we called the probability of of superiority (PS).. Consult Cleveland ((1985) for discus discusthe superiority (PS) 1 9 8 5 ) for how the Tukey Tukey sum-difference sum-difference graph graph can shed further light on sion of how shift model. the appropriateness of the shift Darlington ((1973) for depicting Darlington 1 9 7 3 ) presented an ordinal dominance curve for graph that is similar similar the ordinal relationship between between two sets of data, a graph percentile comparison comparison graph. The The proportion of the total area un unto the percentile corresponds to an estimate of the PS. der the ordinal dominance curve corresponds PS. domiThis estimate can readily be made by inspection of the ordinal domi nance curve as described by Darlington ((1973), 1 9 7 3 ) , who also demonstrated comparing two groups. groups. other uses of the curve for comparing of graphic comparison of distributions is the the The simplest example of two or more boxplots within within the same figure for easy com comdepiction of two parison. As As mentioned in chapter chapter 11,, statistical statistical software packages that produce such such comparisons comparisons include Minitab, SAS, and SAS, SP5S, SPSS, STATA, and SYSTAT. However, simplicity simplicity sometimes sometimes comes at at a price because more Trenkler (2002) (2002) presented a complex methods can be more informative. TrenkIer boxplot method (quantile-boxplot) for comparing two or more complex boxplot distributions. Discussion of other other complex methods can be found more distributions. Silverman (1986) ( 1 9 8 6 ) and Izenman (1991). ( 1 99 1 ) . in Silverman DEPENDENT GROUPS
probability of superiority, PS, PS, as previously defined defined and and estimated in The probability not applicable to the dependent-groups design. design. In this case this chapter is not effect size that we label PS PS dep one can instead define and estimate a similar effect dep; ( 5 . 8)
individual under Condition b and and Y the where Y Yiib Yia ia is the b is the score of an individual matched) individual under Condition score of that same (or a related or matched) a. We We use the repeated-measures (i.e., same individual) case for the re remainder of this section. mainder The PS . 8 is the probability probability that within a PSdedepp as defined defined in Equation 55.8 randomly sampled pair of dependent scores (e.g., two two scores from from the randomly same participant participant under two different different conditions), the score obtained un unCondition b will be greater greater than the score obtained der Condition obtained under Condition a. Note the difference difference between the previously presented definition of the PS and and the the definition of the the PSde PSd p .". In the the case of the the PSde PS deplZ one is estimating estimating PS an effect effect size that would arise if, for each member of the sampled popu popuan member'ss score under under Condition Condition b to that lation, one could compare compare a member' member's score under under Condition a to observe which which is greater. greater. same member's estimating PSde PSde by making, for each par parTo be concrete, one begins estimating ticipant in the sample, such comparisons �ass comparing Jane Jones'
EFFECT SIZE SIZE MEASURES EFFECT MEASURES
.rlII/IJ=
1 15 115
score under Condition Condition b with Jane Jones' score under Condition a. The estimate PSdep is the the proportion comproportion of all such within-participant com estimate of PSde participant' s score under under Condition b is greater parisons which a participant's parisons in which than that participant' participant'ss score score under Condition a. Ties are ignored in this than method. For example, if there are are nn = 1100 whom 60 60 00 participants of whom score higher under Condition b than es than they do under Condition Condition a, the esPS, p is Ppde = 60/100 60/100 = = .60. .60. In the the example that follows follows we timate of PSde depp = define as a wm win for Condition b each instance in which a participant define scores higher under Condition b than under Condition a. We We use the letter w w for the total number of such wins for Condition b throughout throughout letter Therefore, the n comparisons. Therefore,
Pdep
=
win .
(5.9)
should make estimation PSde data An example should estimation of PS de very clear. Recall the data of 2.1 in chapter 22 in which the weights of of Table Table 2.1 J'f nn == 1177 anorectic girls are shown posttreatment and pretreatment (Y (Yib) (Yiiaa )).. Observe in Table 22.1 shown posttreatment (Y .1 ib) and posttreatment than that 113 3 of the 17 1 7 girls weighed more posttreatment than they did prepre the number of wins for posttreatment posttreatment weight is w w= = 13. 13. treatment, so the four exceptions exceptions to weight gain were were Participants Participants 66,, 77,, 10, (The four 10, and and 11; 11; posttreatment and pretreatment weights weights.) Therefore, there were no tied posttreatment .) Therefore, 1 3/1 7= ..76. 76. We We thus de == ww/nin== 13/17 thus estimate estimate that that for for aarandomly randomly sampled PPdep member population of anorectic girls, of whom whom these 17 would imber of a population 1 7 girls would m be representative, there is a ..76 weight gain from pretreat76 probability probability of weight from pretreat ment to posttreatment. effect ment posttreatment. Causal attribution of the weight gain to the effect of the specific treatment subject limitations of the pre pretreatment is sub ject to the limitations test-posttest design that that were discussed in the last section section of chapter chapter 2. 2. Manual calculation of a confidence confidence interval calculation of interval for PS PSded is easiest in the extreme cases w = 0, 0, 1, 1, or nn - 11 (Wilcox, 1997). 1997). Somewhat cases in which w forall allother othervalues valuesof of more laborious manual calculation is also possible possible for w by following following the steps steps provided by Wilcox Wilcox ((1997) (1968) w 1 997) for Pratt's (1968) method. Wilcox 1 997), who Wilcox ((1997), who called PSd PSdee simply p, also provided an for computing �a confidence interval for for P5 S-PLUS software function for PSdep dep for for any any value of w. w. Hand ((1992) discussed circumstances in which the PS PS may may not not be the 1992) discussed best measure of the probability will be better than probability that a certain treatment treatment will another PSd can be another treatment treatment for a future treated individual and how the PSde Refer to Vargha dis(2000) for further further dis ideal for this purpose. Refer Vargha and Delaney Delaney (2000) application of the the PS PS to the the case of two two dependent groups, and cussion of application (2001)) for extension to multiple groups and facfac consult Brunner and Puri (2001 designs. Note again that that Hand our PS torial designs. Hand ((1992) 1992) and others do not use our PS and PSd PSdep notation for these probabilities. ep notation. Authors vary in their notation =
=
-
QUESTIONS Define the the probability probability of of superiority superiority for for independent groups. 11.. Define groups .
1116 16
.rlfIIII=
CHAPTER 5 5 CHAPTER
2.. Interpret PS PS = 0, PS PS = = ..5, and PS PS = 1I.. 5 , and 2 3 numerator in Equation 5 . 2 , and what 3.. What What is the meaning of the numerator 5.2, meaning of the denominator denominator there? is the meaning focus of researchers researchers who who prefer prefer to use a t test and to eses 4. What is the focus timate a standardized difference difference between means, and what what is the researchers who who prefers to use the U U test and estimate estimate a focus of researchers PS? 5 5.. What What is the nature of a large-sample approximation approximation for the U Utest? test? 6.. What What was the original purpose of the U U test? 6 7. What What is a shift shift model, model, and and why why might might this model model be unrealistic in many cases of behavioral research? 8.. When might a shift shift model model be more more appropriate, appropriate, and when might 8 the PS PS be more appropriate? 9.. What is the the effect effect of heteroscedasticity heteroscedasticity on the U test and and on the the 9 usual normal approximation for the U U test? normal approximation What is the the common language effect effect size statistic? 110. 0. What major existence of a monotonic, but but 111. 1 . What is a ma jor implication of the existence not depend not necessarily linear, linear, relationship relationship between a measure of a dependent variable and a latent variable variable that it is is measuring in behavioral behavioral science? 112. 2 . Identify Identify two two assumptions of the common language effect effect size sta statistic. are satisfied, describe the the extent of overlap between between 113. 3 . If assumptions are the two distributions 5 , and 1. the two distributions when PS PS = = 0, PS P5 = = ..5, and PS PS = = 1. and discuss the the purpose of the dominance measure of effect effect 114. 4 . Define and size. 115. 5 . Define Define Cohen's U3, U3, and and list three requirements requirements for its appropriate use. 16. Discuss the the relationship between U the success percentage. U33 and the percentage. 1 6 . Discuss 117. 7 . Describe Describe ways in which which traditional measures of effect effect size can be misleading when of when there is inequality inequality of the variances or shapes of distributions for the two groups. the probability of superiority in the the case of dependent 118. 8 . Define the dependent groups, and describe the procedure procedure for estimating it.
Chapter Chapter
6 6
Effect Effect Sizes for for One-Way ANOVA Designs
INTRODUCTION
The discussions in this and in the next chapter chapter assume the fixed-effects fixed-effects model, in which which the two or more levels levels of the independent variable that are being compared are all of the possible variations of the independent the possible male), or have been specifically specifically chosen by the the variable (e.g., female and male), researcher to represent represent only those variations to which the results results are to be generalized. For example, example, if ethnicity were were the the independent independent variable variable were, say, a white group and two specifically specifically chosen chosen non nonand there were, white groups, groups, the fixed-effects fixed-effects model model is is operative operative and the results should not be be generalized generalized to any nonwhite nonwhite group that was not represented in dependent groups are discussed discussed in the last sec secthe research. Methods for dependent tion of this chapter. normality and homoscedasticity Note that theANOV A F test assumes normality the ANOVA homoscedasticity and that its statistical power and the accuracy of its obtained p levels can be reduced by violation of these assumptions. Consult these assumptions. Consult Grissom (2000) and Wilcox (200 (2003) further discussion. discussion. Wilcox (2003) provided and 3 ) for further provided software functions functions for robust alternatives to the traditional traditional S-PLUS software ANOV A F test for both the independentANOVA independent- and the dependent-groups' Wilcox and Keselman (2003a) (2003a) further discussed cases. Wilcox discussed robust robust ANOVA methods and and software packages packages (SAS, (SAS, S-PLUS, and and R) R) for for implementing them. We assumptions throughout this chapter. We address the assumptions ANOVA RESULTS FOR FOR THIS CHAPTER the estimators estimators of effect effect sizes sizes that are presented For worked examples of the ANOVA from an unpublished study in in this chapter, we use ANOV A results from which levels of the which the levels the independent variable were five five methods of pre presentation sentation of material to be learned learned and the dependent dependent variable variable was the recall scores for that material (Wright, 11946; 946; cited in McNemar, 11962). 96 2 ) . This study preceded the time when iitt was common for researchers tto o es1117 17
1118 18
�
CHAPTER 6 CHAPTER
tirnate complement an ANOVA. Nonstatistical about timate effect effect size to complement Nonstatistical details about this research do not concern us here. What one needs needs to know know for the here. What .1. calculations calculations in this chapter is presented presented in Table 66.1. A A STANDARDIZED-DIFFERENCE MEASURE OF OVERALL EFFECT EFFECTSIZE measure of the the overall effect effect size is given by The simplest simplest measure
g mmpop
=
(6. 1) (6.1)
/-l max - /-l min 0"
where m represent the highest highest and the lowest lowest population population means /-lmmax n represent min /-lmi ax andm from and0"aisisthe theassumed assumedcommon common from the sampled populations, respectively, respectively, and standard deviation within within the populations, which is is estimated by MS MS1/2�w, standard where MSw MSw is obtained the F test or calculated where obtained from the software output for the variation of the formula for pooling separate variances, using a variation
MS
W
=
(n1 - 1)s� +
.
. .+
(nk - 1)s�
N-k
(6.2) (6.2)
.
estimator of the the effect effect size that is given by Equation 66.1 The estimator . 1 is
(6.3) (6.3)
g mm
(For a reminder of the no pool the distinction distinction between g [pooling [pooling]) and and d [[no poolestimators of standardized differences differences between means see see the secing] estimators sec tion ariances in chap. . ) Applying tion Equal or Unequal V Variances chap. 33.) Applying the values from from TABLE 6.1 6.1
Information Needed for the Calculations in Chapter 6 Information
Group Group 1 (n (n = = 16) 3.56 Sample mean (y) (Yi) 2.25 Sample standard standard deviation (s) (sj)
kk = = 5: 5:
Group 2 Group Group 4 4 Group Group Group 3 3 Group Group55 Totals T otals (n 16) (n (n = 16) (n = 16) 16) (n (n = 16) 16) (n 16) (N (n = 16) = BO) 80) (N = 6.38 9.12 110.75 0.75 9. 12 113.44 3 .44 YaH 8.65 Y » = 8-65 a 2.79 3.82 2.98 3 . 82 3.36 =
=
=
=
SSbb - 937.82, 937 .82, S5 SSww - 7714.38, 1 4.38, SStot 1 , 652.20 Notes. SS S5tot ==1,652.20 MSb MSwW = 9.53 MSb -= 234.46, MS 9.53 F(4,75) = 24.60, 24.60, P p < < .001. .001. =
=
=
=
are from from "Spacing of practice in verbal leaming learning and and the the maturation The data are hypothesis,"" by S. T. T. Wright, 11946, unpublished master's thesis, thesis, Stanford University, 946, unpublished hypothesis, Stanford, CA. Adapted Adapted with permission permission of S. T. T. Wright, now Suzanne Suzanne Scott.
Note.
=
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
�
1119 19
current set of ANOVA results in Table 66.1 Equation 66.3, the current . 1 to Equation . 3 , one finds 1/2 that 1 3 .44 .56) // 9 . 5 3';' = .20. Thus, the highest thatggmm = ((13.44 - 33.56) 9.53 = 33.20. highest and and lowest mm = population means are estimated to be 33.20 standard deviation units units population .20 standard apart, if if the standard standard deviation is assumed to be the same for each popu population that is is represented in the study. Note Note that it is not always always true true lation when the overall F F is statistically statistically significant a test of Y Ymmax -Y Ymmjnin will will that when ax statistical significance. significance. Discussions Discussions of testing the statistical statistical also yield statistical Ymmax -Y Ymmin differences within other pairs of of in and testing differences significance of Y ax Statistical means among the k means are presented in a later section, Statistical Significance, Confidence Confidence Intervals, and Robustness. Robustness. Note that the mea measure g gmmpop only be estimated in data analysis if the researcher mm op should only justify effect size. The canjusti fY a genuine interest in it as a measure of overall effect motivation for its use should not be the presentation presentation of the obviously obviously motivation g possible. possible. Not surprisingly for standardized-differstandardized-differ highest value of a g ence estimators of effect effect size, g gmm gmmpop mm op.' This mm tends to overestimate g many others, can also be used to estimate neede needed d sample measure, and many planning research (Cohen, (Cohen, 11988; &Delaney, Delaney, 2004). size when planning 988; Maxwell & A STANDARDIZED OVERALL EFFECT EFFECT SIZE SIZE USING ALL MEANS The ggmm gmm 6.1 6.3 . 1 and 6 . 3 ignore all of the means exex mm of Equations 6 mmpop pop and g two most extreme means. There There is a measure of overall effect effect cept the two one-way ANOVA that uses all of the means. This effect size in a one-way effect size, which assumes homoscedasticity, is Cohen's ((1988) which 1 988) f,f, a measure of a effect in the population population across all of the the kind of standardized average effect levels of the independent variable. Cohen's ff is given by O' � f=,
(6.4)
0'
where 0' is the standard deviation deviation of all of the means of the populations popUlations are;" represented by the samples (based on the deviation deviation of each that an mean from the mean of all of the means, as in Equation Equation 66.6), mean .6), and cr is the common (assumed) standard deviation deviation within the populations. An An esti esticommon mator off of f is given by (6.5) (6 .5)
where sthe standard standard deviation of the the set of all of the the Y Yvalues valuesfrom from Y Y11 Sf is the w�re to Yk• Yk. ThUS, Thus, for equal sample sizes, (6.6)
120
�
CHAPTER 6 6 CHAPTER
where, as previous lY d�fined, Yaallll is the previously defined, Y the mean of all sample means. In Equation 6.6 6.6 each Y Yall effect of the ith level level of the inde indeYji -- Y all reflects the effect effect in the sample pendent variable, so s Y reflects a kind of average effect Therefore,ff estimates data across the levels of the independent variable. Therefore, the MSw the standardized standardized average effect. Again, MS W can be found in software output from the overall calculated using Equation overall ANOVA FF test or calculated Equation 6.2. 6.2. Refer to Cohen ((1988) 1 988) for the case of unequal unequal sample sizes. Applying the the results results from the recall study to Equation 6.6 6.6 we find find that �
[
" �l
A
(3.56 - 8.65) 2 + (638 - 8.6 5)2 + (9.1 2 - 8.65) 2 lY2 +(10.75 - 8.65) 2 + (1 3.44 - 8.65) 2 5-1
I
j
=
3.828.
Therefore, using Equation 6.5, .2 4 . The 6.5, j f = = 3.8 3.833 // 9.53" 9.53' = 11.24. The average .2 4 standard effect effect across the samples is 11.24 standard deviation units. units. Although the theyy" have the same denominator, gmm gmm should be expected expected to f because difference between their numerators. because of the difference be greater than f mm is the range of the means, whereas the numerator Th Thet;. numerator numerator ofg gmm numerator of f is the standard deviation of that same set of means, an obviously standard of f factggmm two to four times larger than ff (Comm is often two (Co smaller number. In fact 988). Consistent with this �ll resul !, for our data on recallg recall gm hen, 11988). this typic typical result, m mm is more than 2.5 .20/ 1 .2 4 == 2.58. 2.5 times greater than f f;; gmmmm1/ff = - 33.20/1.24 2.58. estimator in Equation 6.5 6.5 iiss positively (i.e., upwardly) Note that the estimator biased because numerator are likely to vary because the sample means in the numerator more than do the population population means. An unbiased estimator estimator of ff is 2
(6.7) (6.7)
4 ) for further Refer ttoo Maxwell and Delaney (200 further discussion. Apply (2004) ApplyEquation 6.7 Table 6.1 . 7 to tthe h e data iin n Tab le 6 . 1 yyields ields iing ng E quation 6 funbiased
=
5 1 r - (24.60 _ 1)1 L 80 J
}j
=
1.09. Note that this estimate f is lower estimate for for/is
6.5, as it should be. than the one produced by Equation 6.5, be. Consult Steiger 4 ) for (200 (2004) for additional treatment treatment of measures of overall standardized ef effect size in ANOVA. fect STRENGTH OF ASSOCIATION
Recall from the the section The Coefficient Coefficient of Determination in chapter 44 r2b ', has traditionally traditionally been used to estimate estimate that in the two-group case, case, r�b
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
�
1121 21
the the proportion proportion of of the the total total variance variance in in the the dependent dependent variable variable that that is is as associated sociated with with variation variation in in the the independent independent variable. variable. Somewhat Somewhat similar similar estimators been used used for estimators of of effect effect size size have have traditionally traditionally been for one-way one-way in which which k > > 22.. These These estimators estimators are are intended intended to to reflect reflect ANOVA designs designs in strength of of association association on on aa scale from 0 0 (no association) to to 11 strength scale ranging ranging from (no association) (maximumassociation). association).
ETA SQUARED (if) A parameter parameter that that measures measures the the proportion proportion of of the the variance variance in in the the popu population that that is is accounted accounted for for by by variation variation in in the treatment is is 1"\2. n2. A tradi tradilation the treatment
tional but especially the strength-of tional but especially problematic problematic estimator estimator of of the strength-ofassociation parameter, 1"\2, association parameter, n2, is is �2 n2;; �2
5_ b _ _5_
(6.8) (6.8)
55 tot
The numerator of of Equation Equation 6.8 6.8 reflects attributable The numerator reflects variability variability that that is is attributable to variation in independent variable variable and the denominator denominator reflects reflects to to variation in the the independent and the toThe original name for for 1"\n itself itself was tal variability. variability. The original name was the the correlation ratio, tal but has since by some 1"\2.2. When but this this name name has since come come to to be be used used by some also also for for n When the the independent be independent variable variable is is quantitative quantitative n1"\ represents represents the the correlation correlation between the dependent but, unlike tween the the independent independent variable variable and and the dependent variable, variable, but, unlike 1"\ reflects a curvilinear as well as a linear relationship in that case. ' , n reflects a curvilinear as well as a linear relationship in that case. rpop pop ' Also, When When there there are are two two groups groups 1"\n has has the the same same absolute absolute size size as as rpop . Also, the pop 2 1/2V,the 2 2 2 1"\ /(1 previously 1 988)fis previously discussed discussed Cohen's Cohen's ((1988) f is related related t0 to 1"\2;f= n ; f = 1[n /(l -n1"\2) )] ] .. major flaw flaw of of �2 n2 as as an an estimator estimator of of strength strength of of association association is is that that it it A major 2 positively biased; biased; that is, it to overestimate is positively that is, it tends tends to overestimate 1"\2. n . This This estimator estimator is tends tends to to overestimate overestimate because because its its numerator, numerator, 55 SSbb', is is inflated inflated by by some some er error variability. variability. Bias is less for larger larger sample and for for larger larger values values of ror Bias is less for sample sizes sizes and of 1"\2. section we positive bias n2. In In the the next next section we discuss discuss ways ways to to reduce reduce the the positive bias in in esti estimating consult P.P Snyder Snyder and mating 1"\2. n2. For For further further discussion discussion of of such such bias bias consult and Lawson (1993) and Lawson (1993) and Maxwell Maxwell and and Delaney Delaney (2004). (2004). EPSILON SQUARED
(E22)) AND OMEGA SQUARED (w (0)22)) (e
A somewhat less biased biased alternative estimator of of 1"\2 somewhat less alternative estimator n2 is is £e22, , and and aa more more
nearly is 00w22;; consult nearly unbiased unbiased estimator estimator is consult Keselman equaKeselman (1975). ( 1 975). The The equa tions are are (Ezekiel, (Ezekiel, 11930): 9 3 0): tions £2
and Hays' ((1994) 1 994) and Hays'
=
55 b - ( k - 1)M5 w 55 tot
(6.9) (6.9)
1122 22
�
CHAPTER 6 CHAPTER
00 2
=
(6.10) (6. 10)
SS b - (k - l)MSw SS tot + MSw
We sample sizes sizes and We assume assume equal equal sample and homoscedasticity. homoscedasticity. Software Software out outmight include include £e22 and/or put put for for the the ANOVA ANOVA F F test test might and/or 00 w22. However, However, manual manual calculation is easy (demonstrated later) because because the calculation is easy (demonstrated later) the 55 and and MSw MSw values values are if these are available available from from output output even even if these estimators estimators are are not. not. Comparing Equations 6.9 and 66.10 . 1 0 with Comparing the the numerators numerators of of Equations 6.9 and with the the nu numerator ofEquation of Equation 6.8 6.8 for for 1\n22,, observe observe that that Equations 6.9 and and 6.10 6.10 at atmerator Equations 6.9 tempt to to compensate compensate for for the the fact fact that that 1\2 n2 tends tends to to overestimate overestimate Tln22 by by tempt reducing the numerator of )MSw. Equation 66.10 .10 reducing the numerator of the the estimators estimators by by (k (k -- 11)MS W. Equation goes also goes even even further further in in attempting attempting to to reduce reduce the the overestimation overestimation by by also adding MS to the the denominator. denominator. The Theool w2 estimator estimator isisnow now more morewidely widely adding MS Wto used than than is is ee22. used A statistically significant is A statistically significant overall overall F can can be be taken taken as as evidence evidence that that 00 w12 is significantly greater o. However, significantly greater than than 0. However, confidence confidence intervals intervals are are especially especially important important here here because because of of the the high high sampling sampling variability variability of of the the estimators estimators axwell, Camp, 981). For R. M. M. Carroll (M (Maxwell, Camp, & & Arvey, Arvey, 11981). For example, example, R. Carroll and and Nordholm (1975) (1975) found found great great sampling sampling variability variability even even when when N = = 90 90 Nordholm and 3 . Of in estimates and kk = = 3. Of course, course, high high sampling sampling variability variability results results in estimates often often being much much above above or or much much below below the the effect effect size size that that is is being being estimated. estimated. being 00 22 For based on For rough rough purposes purposes approximate approximate confidence confidence limits limits for for Tln22 based on w can graphs (called can be be obtained obtained using using graphs (called nomographs) nomographs) that that can can be be found found in in Abu Libdeh Libdeh (1984). Refer to for an Abu ( 1984). Refer to Venables Venables (1975) ( 1 975) for an advanced advanced discussion. discussion. Assuming use of of Assuming normality normality and, and, especially, especially, homoscedasticity, homoscedasticity, the the use noncentral distributions is for constructing noncentral distributions is appropriate appropriate for constructing such such confidence confidence intervals discussed in in the the section intervals.. Therefore, Therefore, as as was was discussed section on on noncentral noncentral distri distributions in chapter chapter 3, 3, software required for construction, so ex butions in software is is required for their their construction, so no no example of manual is presented Refer to ample of manual calculation calculation is presented here. here. Refer to Fidler Fidler and and Thompson (2001) (2001) for for aa demonstration demonstration of of the the use of SPSS SPSS to to construct construct aa Thompson use of confidence interval interval for for Tl2 n2 that that is is based on aa noncentral noncentral distribution. distribution. Also Also based on confidence consult Smithson Smithson (2003) (2003) and and Steiger Steiger (2004) (2004) for for further further discussion discussion of consult of such Smithson such confidence confidence intervals. intervals. At At the the time time of of this this writing writing Michael Michael Smithson provides SPSS, SAS, SAS, S-PLUS, S-PLUS, and R scripts provides SPSS, and R scripts for for computing computing confidence confidence in intervals. These These scripts http://www.anu.edu.au/psytervals. scripts can can be be accessed accessed at at http:// www. anu.edu.au/psy chology/staff/mike/lndex.html. STATISTICA STATISTICA can such chology/staff/mike/Index.html. can also also produce produce such confidence intervals. intervals. confidence Note de Note that that as as aa measure measure of of aa proportion proportion (of (of total total variance variance of of the the dependent variable independent pendent variable that that is is associated associated with with variation variation of of the the independent variable) be below 0, but inspection of Equations variable) the the value value of of Tln22 cannot cannot be below 0, but inspection of Equations can 6.9 and . 1 0 reveals reveals that the values of the estimators e£ 22 and 6.9 and 66.10 that the values of the estimators and 002 w2 can 4 ), who themselves be o. Hays Hays (199 had earlier themselves be below below 0. (1994), who had earlier introduced introduced 002 w2,, rec recommended the value ommended that that when when the the value value of of this this estimator estimator is is below below 0 0 the value should be be reported reported as as 0. However, some are concerned should o. However, some meta-analysts meta-analysts are concerned that estimates with zeros might additional that replacing replacing negative negative estimates with zeros might cause cause an an additional positive bias bias in in an an estimate that is is based on averaging averaging estimates positive estimate that based on estimates in in aa •
W
A
•
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
1123 23
�
meta-analysis. Similarly, Similarly, Fidler Fidler and and Thompson any meta-analysis. Thompson (2001) (200 1 ) argued argued that that any obtained negative negative value should be be reported reported as as such such instead instead of converting obtained value should of converting it to so that that the full width width of confidence interval interval can can be it to 00 so the full of aa confidence be reported. reported. Of Of course, course, when when aa negative negative value value is is reported, reported, aa reader reader of of aa research research report report has an an opportunity to interpret interpret it it as as 0 0 if if one one so so chooses. chooses. Consult Consult has opportunity to Susskind and and Howland Howland ((1980) and Vaughan and Corballis Corballis ((1969) for Susskind 1 980) and Vaughan and 1 969) for further discussions discussions of of this issue. further this issue. For For an an example example of of ill w22 we we apply apply the the results results from from the the recall recall study study (Table (Table to Equation Equation 6. 6.10 find that 66.1) . 1 ) to 1 0 to to find that
ro 2
==
93 7.82 - ( 5 - 1)9.5 3 ==.5 4. 1,65220 + 9.53
4% of of the the variability of the the recall scores Therefore, we we estimate estimate that that 554% variability of recall scores Therefore, is varying the the method of presentation presentation of the material is attributable attributable to to varying method of of the material that is be learned. This estimation estimation is is subject subject to are that is to to be learned. This to the the limitations limitations that that are discussed later later in in this chapter in in the section entitled entitled Evaluation Evaluation of of Criti Critidiscussed this chapter the section cisms of Estimators Estimators of of Strength Strength of of Association. For discussions of appli applicisms of Association. For discussions of designs refer cation cation of of ro w22 to to analysis analysis of of covariance covariance and and to to multivariate multivariate designs refer to Olejnik Olejnik and and Algina Algina (2000). to (2000). STRENGTH O OFF ASSOCIATION FOR SPECIFIC COMPARISONS
Estimation of the the strength of association just two the k groups Estimation of strength of association within within just two of of the groups at aa time time may may be be called called estimation of of a specific, specific, focused, focused, or simple-effects simple-effects at of association. Such estimation provides detailed informa informaSuch estimation provides more more detailed strength of tion than than do do the the previously previously discussed discussed estimators estimators of of overall overall strength tion strength of of association. To such aa focused focused estimate estimate one one can can use use association. To make make such � 2 ro comp
SS comp - MS w
== --'------
SS tot
+ MS w
'
(6. 1 1) (6.11)
where the the subscript (between two two where subscript comp represents represents aa comparison comparison (between groups). The The symbol symbol SS is sometimes of SSc 5Scomp (Often groups). SScontrast sometimes used used instead instead of omp.' (Often contrast is to two two means, whereas contrast refers to more than comparison refers refers to means, whereas refers to more than two means, as is is shown shown in in the the next next paragraph. paragraph.)) Observe Observe the the similarity similarity two means, as between Equations . 1 0 and 1 1 . In . 1 1 SS between Equations 66.10 and 6. 6.11. In Equation Equation 66.11 replaces the the SScom compp replaces SSb . 1 0, and 1 ) of . 1 0 is is now 1 == 11 in SSbof of Equation Equation 66.10, and the the (k - \) of Equation Equation 66.10 now 22 -- 1 in Equation 6. 6.11 because one one is is now now involving involving only only two two groups. groups. To find Equation 1 1 because To find SScocom in the present case case of of making simple comparison comparison involving involving two two 55 the present making aa simple mp in of Yji and of the the k means, means, Y and Yjj', use use _
_
SS comp
( _ - Y_j ) 2 Yj
==
1
1
-- + nI n.J
(6.12) (6. 12)
1124 24
�
CHAPTER 6
Consult Olejnik and Algina (2000) (2000) for a more general formulation 6.12 comand a worked example of Equation 6 . 1 2 that involves the case of a com (often simply called a contrast), such as comparing the plex comparison (often with the overall mean of two two or more combined mean of a control group with combined treatment groups. treatment _ on recall two two of the five group means were Yj Yi = = 110.75 he research on 0 . 75 InJ:the 38. and Y 6.38. Using these two means for an example and using that Using two Yj = 6. J, study A results that are presented in Table 6.1, Equation study'ss ANOV ANOVA 6.1, we apply Equation 6.12 to find find that in this example 6.12 SS = (10.75(1/16 + 1/16) 1/16) = = 152.78. SS = (1O. 75 - 6.38) 6.38)22'"//22 {1/16 1 52.78. Now applying EquaEqua• comp tion 6. 6.11, find that cocom w compp = = ((152.78 9.53)== .09. .09. 1 52 . 78 --9.53) 9.53) // {1(1,652.20 ,652.20 ++ 9.53) hon 1 1 , we find Therefore, we estimate (subject to the limitations that are discussed discussedin the next section) section) that 9% 9% of ofthe thevariability variabilityof ofthe the recall recallscores scoresisisattributable attributableto to presentation methodjj is used for for learn learnwhether presentation method i or presentation recalled. Consult Keppel (1991), (1991), Maxwell et ing the material that is to be recalled. al. ((1981), (2000), and and Vaughan (1969) 1 98 1 ), Olejnik and and Algina (2000), Vaughan and Corballis (1 969) for further further discussions of estimating strength of association for specific for comparisons. •
EVALUATION OF OF CRITICISMS OF OF ESTIMATORS OF STRENGTH OF ASSOCIATION
estimators �2 n ,, f.e2, , and and illw2 are all called estimators estimators of strength asThe estimators strength of as sociation, variance accounted accounted for, proportion of variance explained sociation, variance for, or proportion (POV). . We will call such estimators estimators POV POV estimators estimators in the the remainder of (POV) of = 22 these POV estimators are similar, but but not identi identithis chapter. (When k = POVestimators cal, to rr2!pbb ', the sample coefficient coefficient of determination determination that was discussed in chap. 4.) Such estimators estimators and the 112 n2 that they they estimate estimate share some of the 2 2 criticisms of rr \ or that have appeared in the literature and that pb and rr �pop discussedd in chapter 4. We We very briefly were discusse briefly review and evaluate these criticisms and evaluate evaluate some others. Note that we repeatedly state in this this book that no effect effect size or estimator estimator is without one or more limitations. limitations. Furthermore, some of the limitations limitations of 11n22 and its estimators apestimators are also ap Ll and plicable to measures of the standardized difference difference between means, A ggpop poP', and and their their estimators. Also, Also, some of the the limitations limitations are more of a problem for meta-analysis than for the underlying underlying primary primary research that is the focus of this book. (For (For an argument argument that these estimators, estimators, unlike 2 rr, , do not not actually Murray Dosser, 1987.) actually estimate POVpop 0 ', consult consult and Dosser, 1 987.) P First, recall from the the sectio sectioni;. The Coefficient off Determination Determination in Coefficient o chapter 4 that effect squaring values that would chapter effect sizes that involve involve squaring would oth otherwise be below 11 yield values that are often closer to 0 hu0 than 11 in the hu man sciences. A consequence of this that is sometimes pointed out in the man pointed out literature is a possible undervaluing importance of the result. A undervaluing of the importance statistically inexperienced reader of a research report report or summary, one who is familiar 0% to 1100% percent00% scale of percent who familiar with little more than the 0% ages, will not not likely be familiar with the range of typical values of esti estimates of a standardized-difference effect mates of effect size or POV POV effect effect size. 2
2
2
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
�
1125 25
estimate from an obtained obtained d is that A A= = .5, a value Therefore, if, say, the estimate that is approximately approximately equivalent equivalent to 0)2 co2 = = .05, such a statistically inexpe inexpereader will likely be more impressed by the effect inderienced reader effect of the inde pendent variable variable if an estimated estimated A A= = .5 reported than if an estimated estimated .5 is reported pendent POV = = .05 is reported. (Note that the magnitudes magnitudes of an an estimate estimate of POV POV POV estimate of A A depend in part on sample size; size; consult consult Barnette & & and an estimate and Onwuegbuzie & & Levin, 200 2003.) McLean, 2002, and 3.) The just-noted criticism criticism ooff the POV approach approach to effect effect size is less ap apstatistical knowledge that the intended intended readership of a plicable the more statistical research report report has and the more the author of a report report does to disabuse readers of incorrect interpretation Indeed, the more warn interpretation of the results. Indeed, warnabout this limitation that appear in articles and books, the less sus susings about ceptibility there will be to such undervaluing. On the other other hand, a low ceptibility POFcanbe value for an estimate of POV can be informative in alerting us to the need (a) search for additional additional independent variables that might might contribute to (a) conto determining values of the dependent variable and/or (b) (b) improve con variability in the retrol of extraneous variables that contribute to error variability re search and and thereby lower an estimate of the the POV. from chapter 4 and from from earlier in this chapter that Second, also recall from effect size (but not not necessarily their estimates) estimates) that involve measures of effect squaring are directionless; they cannot be negative, rendering them typi typically useless uselessfor averaging in meta-analysis. meta-analysis. The inappropriateness of av averaging estimates of POV across studies can be readily seen by recognizing two studies ifif that the same value for the estimate would be obtained in two 6.10 all of the values of the terms in Equations 6.8, 6.9, or 6 . 1 0 were the same rank order of the k means were opposite in these in both studies even if the rank situation would be one in which the most, in instudies. An example of this situation effective treatments treatments in Study 11 were Treatments a, termediate, and least effective b, and c, c, respectively, respectively,whereas whereasthe theranking rankingof ofeffectiveness effectiveness in inStudy Study22was was and a, respectively. respectively. The two two POV estimates would be the the Treatments c, b, and same although although the two studies produced opposite results. This is more a problem for a meta-analyst meta-analyst than for a primary researcher. researcher. However, this However, this limitation reminds one again that research research reports should include means for all samples, interpret results in the context ofthe of the for samples, rendering it easier to interpret results from from other other related studies. Third, a criticism that is sometimes raised is easy to accommodate. standardized-difference effect effect size for k = = 2, the the Namely, unlike a typical standardized-difference most commonly used POV POV effect effect size for k > > 2 designs (estimated by designs (estimated 6.9, or 66.10) global, that is, is, it provides information Equations 6.8, 6.9, . 1 0) is global, about the overall association association between the independent and dependent about does not provide information information about about specific compari comparivariables but but it does limitation can sons within the k levels of the independent variable. This limitation be avoided by applying the less 6.11 two less commonly used Equation 6 . 1 1 to two from the k samples. samples at a time from additional criticism is related to the the first first criticism. Recall Fourth, an additional from the section The Coefficient Coefficient of Determination Determination in chapter 4 that hu hufrom man behavior (e.g., the dependent determined; that man dependent variable) is multiply determined;
1126 26
�
CHAPTER 6 6 CHAPTER
is, itit is is influenced is, influenced by by aa variety variety of of genetic genetic and and background background experiential experiential variables (both being extraneous extraneous variables variables in in much research).. variables (both kinds kinds being much research) Therefore, it it is is usually usually unreasonable unreasonable to to expect expect that that any any single single independTherefore, independ ent variable variable is is going going to to contribute contribute aa very very large large proportion proportion of of what what deter deterent mines variability variability of of the dependent variable variable (Ahadi (Ahadi & & Diener, Diener, 11989; 989; mines the dependent O'Grady, 11982). more statistically statistically experienced experienced consumers consumers of of re reO'Grady, 9 8 2 ) . Again, Again, more search will take take multiple and typical typical sizes sizes of of estisearch reports reports will multiple determination determination and esti mates of of POVs POVs into into account when interpreting interpreting an an estimate of aa POV. POV. mates account when estimate of However, again, those readers readers of of reports reports who who are are inexperienced staHowever, again, those inexperienced in in sta tistics might merely POV is 0% tistics might merely note note that that an an estimated estimated POV is not not very very far far above above 0% and therefore, must must be little and often often mistakenly mistakenly conclude conclude that that the the effect, effect, therefore, be of of little practical importance. importance. In In fact fact aa small-appearing small-appearing estimate estimate of of POV POV might might practical actually be be important and might might also also be be typical of the effect of of inde indeactually important and typical of the effect pendent variables variables in the human report of of research pendent in the human sciences. sciences. Again, Again, aa report research can can deal with with this this possible by tailoring tailoring the Discussion section to the the deal possible problem problem by the Discussion section to level statistical knowledge knowledge of readership. level of of statistical of the the readership. Fifth, literature includes includes another another criticism criticism of POV measure Fifth, the the literature of the the POV measure that is is applied under the the fixed-effects fixed-effects model; namely, its magnitude dethat applied under model; namely, its magnitude de pends on on which which of of the the possible levels of of the the independent independent variable variable are are sepends possible levels se lected by by the the researcher researcher for for the the study. study. For For example, example, including including an lected an extreme extreme level, such as as aa no-treatment no-treatment control control group group (a strong manipulation), manipulation), level, such (a strong can the estimate. estimate. Note, Note, however, however, that that standardized-difference can increase increase the standardized-difference effect sizes are between effect sizes are similarly similarly dependent dependent on on the the range range of of difference difference between the two two levels levels of of the the independent independent variable variable that that are are being being compared, the compared, bebe cause this this difference the magnitude magnitude of of the the numerator numerator of of the the cause difference influences influences the measure or its estimator. estimator. For For example, example, one one is is likely obtain aa larger larger measure or its likely to to obtain value if one value of of an an estimate estimate of of aa POVor POV or standardized-difference standardized-difference effect effect size size if one compares aa high high dose dose of of aa drug drug with with aa zero zero dose dose than than if if one one compares compares compares two doses. This be countered if the researcher two intermediate intermediate doses. This criticism criticism can can be countered if the researcher chooses sensibly and and limits chooses the the levels levels of of the the independent independent variable variable sensibly limits the the in interpretation of of the the results only to to those those levels, as is is required required under under the the terpretation results only levels, as fixed-effects model. model. In In applied applied research research aa researcher's researcher's "sensible" "sensible" choice fixed-effects choice of independent variable those that comparable of levels levels of of the the independent variable would would be be those that are are comparable to levels that used or or ones ones that are likely to the the levels that are are currently currently used that are likely to to be be adopted adopted in practice. practice. Note Note too too by by inspecting inspecting the the numerators numerators of of Equations Equations 6.9 and and in 6.10 that an an estimate estimate of of overall overall POV POV is is also also affected affected by by the the number number of 6 . 1 0 that of levels of of the the independent variable, k; consult F.F. Snyder Snyder and and Lawson levels independent variable, k ; consult Lawson and Barnette Barnette and and McLean ((1993) 1 993) and McLean (2002). (2002) . is the the case case for for other other kinds kinds of of effect effect sizes, sizes, estimates estimates of of POV POV will will be be re reAs is duced by by unreliable unreliable measurement measurement of of the the dependent dependent variable variable or or by by unre unreduced liable measurement, measurement, unreliable unreliable recording, or unreliable unreliable manipulation liable recording, or manipulation of of the independent independent variable, variable, all all of of which which was was discussed discussed in chapter 4. The The the in chapter estimate of of the POV can can be be no no greater greater than, than, and and likely likely often often much much less less the POV estimate than, the the product product of of rxx and and rryy ,, which which are are the the reliability reliability coefficients coefficients than, (chap. of the the independent variable and dependent independent J' lriable and dependent variable, variable, respecrespec (chap. 4) of tively. In many cases the reliability reliability of of the the independent independent variable variable will will not not tively. In many cases the be independent be known. known. However, However, if if we we assume assume that that for for aa manipulated manipulated independent xx
EFFECT EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA
�
1127 27
variable r xx= 11,, or or nearly nearly so, so, then then the the estimate estimate of of the the POV will will have have an an variable upper limit limit at at or slightly below below the the value value of of ryy. The The lower lower the the upper or slightly reliabilities, the the greater greater the the contribution contribution of of error error variance variance to to the the total total reliabilities, variance of of the the data data and, and, therefore, therefore, the the lower lower the the proportion proportion of of total total variance variance independent variance of of the the data data that that is is associated associated with with variation variation of of the the independent variable. (Observe that the the denominators denominators of of Equations Equations 6.9, and variable. (Observe that 6.9, 6.10, 6. 10, and 6. 1 1 become become greater Also, as as was pre 6.11 greater the the greater greater the the error error variability.) variability.) Also, was previously stated, of POV assume they assume homoscedasticity, homoscedasticity, and and they viously stated, estimators estimators of can especially especially overestimate overestimate POV when when there there is is heteroscedasticity and can heteroscedasticity and unequal 9 75). This unequal sample sample sizes sizes (R. (R. M. M. Carroll Carroll & & Nordholm, Nordholm, 11975). This is is reason reason enough be cautious enough to to be cautious about about comparing comparing estimates estimates of of POV from from studies studies with different different sample sample sizes sizes (Murray & Dosser, Dosser, 1987). with (Murray & 198 7). Finally, Finally, analysis analysis of of data data occurs occurs in in aa context context of of design design characteristics characteristics 1 ). Therefore, Therefore, when that results (Wilson that can can influence influence the the results (Wilson & & Lipsey, Lipsey, 200 2001). when interpreting when comparing them with from other interpreting results results and and when comparing them with those those from other studies, one one should should be cognizant of of the and context context that that studies, be cognizant the research research design design and gave rise Snyder and and Lawson cautiously P. Snyder Lawson ((1993) 1 993) cautiously gave rise to to those those results. results. As As R noted, aa researcher researcher should should not simply report independent varinoted, not simply report that that an an independent vari able accounted for an an estimated estimated P% P% of of the of the the the variance variance of the measure measure of of the able accounted for dependent subject to the dependent variable. variable. Instead, Instead, aa researcher researcher should should report, report, subject to the other that have been discussed, discussed, that that it is estimated that P% other limitations limitations that have been it is estimated that P% of of the variance of the the measure of the dependent variable is accounted accounted for for the variance of measure of the dependent variable is when nn of the kind kind of of participants who were are assigned assigned to each of of the participants who were used used are to each of when the independent variable used. Refer Refer to the k levels levels of of the the independent variable that that were were used. to Onwuegbuzie and and Levin Levin (2003) (2003) and and the the many many references references therein for fur furOnwuegbuzie therein for ther of numerous characteristics of research ther discussions discussions of of the the influence influence of numerous characteristics of research designs effect sizes. sizes. Olejnik Algina (2003) (2003) discussed generalized designs on on effect Olejnik and and Algina discussed generalized POV measures that that are are applicable to aa variety of designs. designs. POV measures applicable to variety of There is extensive extensive literature on estimating Good starting There is literature on estimating aa POV. POV. Good starting points search of articles by Monroe points for for aa search of this this literature literature are are articles by Fern Fern and and Monroe (1996), O'Grady ((1982), and Algina Algina (2000), ( 1 996), O'Grady 1982L Olejnik Olejnik and (2000), Richardson Richardson ((1996), 1 996), P Snyder Snyder and and Lawson Lawson ((1993), Vaughan and and Corballis Corballis ((1969), and the the P. 1 993), Vaughan 1 969L and other articles articles that that have have been been cited cited in in this this chapter. chapter. Also Also consult the refer referother consult the ences that that are are footnoted footnoted by by Keppel Keppel ((1991). In the the next next section conences 1 991). In section we we con sider standardized-difference standardized-difference measures effect size that focus focus on on sider measures of of effect size that comparisons groups at comparisons between between two two groups at aa time time from from the the set set of of k groups. groups. This also addresses addresses the criticism of This is is an an informative informative approach approach that that also the third third criticism of measures of POV POV that that was was already discussed. Hunter Hunter and and Schmidt already discussed. Schmidt measures of (2004) discussed their measures. (2004) discussed their objection objection to to POV measures. xx
==
yy'
STANDARDIZED-DIFFERENCE EFFECT SIZES STANDARDIZED-DIFFERENCE EFFECT SIZES FOR TWO TWO OF OF k k MEANS MEANS AT AT A A TIME TIME FOR
When effect size involves the When an an estimator estimator of of aa standardized-difference standardized-difference effect size involves the mean e) of a control, placebo, or standard-treatment com -'parison mean ((Y ) of a control, placebo, or standard-treatment comparison Yc group and the mean of of any any one one of of the groups (Y), (Yj), and and group and the mean the other other groups homoscedasticity is not it is sensible to standard deviadeviahomoscedasticity is not assumed, assumed, it is sensible to use use the the standard
128
�
CHAPTER
6 6
tion of of such such aa comparison comparison group, group, ssec', for for standardizing standardizing the the mean mean differ differtion ence to ence to obtain obtain (6. 13) (6.13)
Alternatively, Alternatively, if if one one assumes assumes homoscedasticity homoscedasticity of of the the two two popula populations whose whose samples samples are are involved involved in in the the comparison, comparison, the the pooled pooled stan stantions dard , may be used instead to find dard deviation deviation from from these these two two samples, samples, ssp/ may be used instead to find p ((6.14) 6 . 1 4)
where jj can can represent represent aa control control or or any any other other kind kind of of group. If one one as aswhere group. If sumes sumes homoscedasticity homoscedasticity of of all all of of the the k populations, populations, the the best best standard standard deviation by by which which to to divide divide the the difference difference between between any any two two of of the the deviation means, including Yi -- Ye, Yc, is is the the standard standard deviation deviation that that is is based based on on pool poolmeans, including Yj ing ing the the within-group within-group variances variances of of all all k groups, groups, MS MS�^,, producing producing
g msw
(6 .15) (6.15)
� take take the the square square root root of of the the value value of of MS MSw that isis Again, Again, to to find find MS MS 1/2 w that found in in the the ANOVA ANOVA software software output output or or take take the the square square root root of of the the MS MSww found . 2 . As discussed in that that has has been been calculated calculated from from Equation Equation 66.2. As is is discussed in the the next next section, Examples, each .13, section, Worked Worked Examples, each of of the the estimators estimators in in Equations Equations 66.13, 6.14, and 66.15 has aa somewhat somewhat different different interpretation. interpretation. 6. 14, and . 1 5 has A problem problem may may occur occur when when applying applying Equation Equation 66.14 to more more than than one A . 1 4 to one of To some some extent of the the possible possible pairs pairs of of the the k means. means. To extent differences differences among among the two two or or more more values of g gpp may may arise merely froll!. from varyi�g varying values of sSji the values of arise merely values of from comparison comparison to to comparison, comparison, even even if the same say, Ye, Yc, is is used for from if the same YV,j, say, used for �cteristic ofpopula eachg Even when (a char eachg .. Even when there there is is homoscedasticity homoscedasticity (a characteristic of populations, � not samples), sampling sampling variability variability of of values values of of ss ;f can can cause cause great great tions, ot samples), variation values that contribute to the pooling of an an variation in in the the different different ss �2 values that contribute to the pooling of s2 and s� and an an Ss 22 for for each each gg .. Such Such sampling sampling variability variability should should be be taken taken into into account when when interprlting interpreting differences differences among among the the values values of of g gp o. For For fur furaccount ther discussion discussion of of limitations limitations of of d (and g) types types of of estimators estimators of of effect effect (and g) ther sizes, see the last section of of chapter chapter 7. sizes, see the last section 7. WORKED WORKED EXAMPLES EXAMPLES
We use the results in Table Table 66.1 from the the research research on on recall to demonstrate demonstrate We use the results in . 1 from recall to calculation calculation of of all all of of the the estimators estimators that that were were presented presented in in the the previous previous section. For For calculation calculation using using Equations Equations 66.13 and 66.14 we use . 1 3 and . 1 4 we use section.
EFFECT ANOVA EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA
�
129 1 29
Y22 = = 6.38 6.38 for for Yi Yiand and Ys Y5 = = 113.44 for Yc Ycand and y. Y.j Therefore, Therefore, 5)s scis 5s s5 = = 3.36 3.36 and and Y 3.44 for based on pooling the variances of Samples 2 and 5, in which 5spp is is based on pooling the variances of saniples 2 and 5, in which 2.79 7.78 and 5s2; = = 33.36 Values of of 52 s2 for for each each sample sample 5s2� == 2. 7922 == 7. 78 and .3622 == 111.29. 1 .29. Values can be be obtained obtained from from software output or or from from an equation for for manual manual can software output an equation 2 calculation; s522 = = [(EY - n(y2)] n(Y2)] // (n (Calculationofof52s2using usingthis thisequa equacalculation; [(:EP)) (n -- 1). 1 ) . (Calculation tion in chap. in the the Classificatory Classificatory Factors Factors Only Only section.) section.) tion is is demonstrated demonstrated in chap. 7 in 9.53 for the current on recall. recall. We previously previously reported reported that We that MSw MSw = = 9.53 for the current data data on One pools pools the and 5s;2 = =11.29 find 5sp� using = 77.. 7 788 and 1 1 .29 to to find using One the variances variances 5s;2 = Equation 66.16; Equation . 1 6;
5p2
==
(ni - 1)5� (nj - 1)5� +
(6.16) (6. 16)
Using Using Equation Equation 66.16, . 16, = [(16 [(161/2-- 11)7.78 ) 7 . 7 8 + (16 (16 -- 1)11.29] 1 ) 1 1 .29 ] // (16 (16 + 116 6 -- 2) 2) = = 9.54 9.54 and and 5s2p� = S 9.54 10 = = 3.09 3.09.. 5pp == 9.54' Applying the needed previously noted values Applying the needed previously noted values to to Equations Equations 66.13,6.14, . 1 3, 6.14, and 66.15 that d dcomp = (6.38 (6.38 - 13.44) and . 1 5 one one finds finds that -2. 10, = -2.10, 1 3.44) // 3.36 3 .36 = comp = ggp = 3.09 ==-2.28, -2.28,an andg = (6.38 (6.38--13.44) 9.53/1/2 -2.29. = (6.38 (6.38 --13.44) 13.44) // 3.09 d grnsw 13.44) //9.53' ' ==-2.29. msw = p From From the the value value of of dco dcomp we estimate that, with with regard to the the compari compariestimate that, regard to m we son population's distribution and standard deviation, deviation, the son population's distribJtion and standard the mean mean of of PopPop ulation i is deviation units of the the is 2.10 2 . 1 0 standard standard deviation units below below the the mean mean of ulation comparison population. population. From From the the value value of of g gp we we estimate that, with with re recomparison estimate that, gard distribution of Population j and and aa common common standard standard deviagard to to the the distribution of Populationj devia tion standard tion for for Populations Populations i and and j,j, the the mean mean of of Population Population i is is 22.28 .28 standard deviation units below the of Population from the value deviation units below the mean mean of Population j.j. Finally, Finally, from the value of gmsw we estimate that, with regard to the distribution of Population j we estimate that, with regard to the distribution of Populationj of gm sw and a a common common standard standard deviation deviation for for all all five the involved involved popula populaand five of of the tions, the mean mean of of Population Population i is is 2.29 standard deviation deviation units 2.29 standard units below below tions, the the of Population Population j. j. the mean mean of If one one assumes compared populations, one can can If assumes normality normality for for the the two two compared populations, one interpret the the results results in in terms terms of of an an estimation estimation of of what what percentage percentage of of the the interpret members of of one one population score higher or lower the aver avermembers population score higher or lower than than the age-scoring members members of (Refer to second secage-scoring of the the other other population. population. (Refer to the the second sec tion of of chap. chap. 33 for for aa refresher this topic.) A researcher should decide decide aa tion refresher on on this topic.) A researcher should priori which which pair pair or or pairs of means of interest interest and then choose priori pairs of means are are of and then choose among Equations Equations 6.13, 6.13, 6.14, 6.14, and and 66.15 on whether whether homo homoamong . 1 5 based based on scedasticity is is to to be be assumed. assumed. Any Any estimator estimator that is calculated calculated must must then then scedasticity that is be reported. be reported. STATISTICAL SIGNIFICANCE, CONFIDENCE INTERVALS, AND AND ROBUSTNESS ROBUSTNESS
Before considering considering standardized standardized differences differences between means we Before between means we discuss discuss methods for for unstandardized between means. the methods unstandardized differences differences between means. Recall Recall from from the opening sections sections of of chapters chapters 2 2 and and 3 inferences about about unstandardopening 3 that that inferences unstandard-
130
�
CHAPTER 6
ized differences differences between between means means can can be be especially especially informative informative when when the the ized dependent variable variable is is scaled in familiar familiar units units such such as as weight weight lost lost or or dependent scaled in gained, ounces of alcohol or or number number of cigarettes consumed, consumed, days days absti abstigained, ounces of alcohol of cigarettes nent or or absent, absent, or or dollars dollars spent. et a1. al. (2003) argued for for routine routine nent spent. Bond Bond et (2003) argued use of of unstandardized unstandardized differences differences in in such such cases and demonstrated demonstrated aa use cases and method for for their their use use in in meta-analysis. meta-analysis. _ _ method Yi Tests the statistical Testsof ofthe statistical significance significance of of all all of of the the Y -Y Yj. pairings, pairings, includ including Y Y and construction of confidence inte � al � that are based ing Ymax - Ymin and construction of confidence intervals that are based on on all of these these differences differences (simultaneous (simultaneous confidence confidence intervals) intervals) are are often often all of conducted Tukey's honestly statistically different different (HSD) (HSD) test test conducted using using John John Tukey's honestly statistically of pairwise comparisons. comparisons. This This method method is is widely widely available in software software of pairwise available in packages. Note that when when using using some methods of of pairwise packages. Note that some methods pairwise comparicompari sons, HSD method, sons, such such as as Tukey's Tukey's HSD method,ititisiscustomary customarybut but perhaps perhapsunwise unwise in terms terms of of loss of statistical to have have conducted conducted aa previous previous omni omniin loss of statistical power power to bus (overall) Tukey's method method is is aa substitute substitute for, for, not not aa follow-up test. Tukey's follow-up bus (overall) FF test. to, an an omnibus omnibus test. test. (The (The well-known Scheffe method, which is is not not dis disto, well-known Scheffe method, which cussed here, here, and and Dayton's, Dayton's, 2003, 2003, method, method, which which is discussed here, are cussed is discussed here, are exceptions. 19 75 ) and and Wilcox exceptions.)) Consult Consult Bernhardson Bernhardson ((1975) Wilcox (2003) (2003) for for elabo elaboration of of this this issue issue of of problematic problematic prior omnibus F testing. testing. ration prior omnibus Additionally, the the results results of of an an omnibus omnibus F F test test may may not not be be consistent consistent Additionally, with those those of of Tukey's Tukey's HSD The omnibus omnibus F may may be be significant significant even with HSD test. test. The even when versa. when none none of of the the pairwise pairwise comparisons comparisons is is significant significant and and vice vice versa. The researcher's researcher's initial initial research research hypothesis hypothesis or or hypotheses hypotheses should should deter deterThe mine whether whether to to use use an an omnibus omnibus F test test and and omnibus omnibus estimator estimator of of effect effect mine size or pairwise comparisons comparisons of means and their related related specific specific size or pairwise of means and their (focused) effect effect sizes. (focused) sizes. The procedures procedures for for the the Tukey Tukey method method for for such such pairwise pairwise significance significance The testing and construction of confidence confidence intervals, intervals, including testing and construction of including modificamodifica tions for for unequal unequal sample and heteroscedasticity, heteroscedasticity, are are explained explained in in de detions sample size size and 4 ). The tail in in M Maxwell and Delaney Delaney (200 (2004). The Tukey Tukey method method that that we we discuss axwell and discuss tail here, which which is is also also known known as as the the wholly wholly significantly different (WSD) (WSD) here, significantly different method, is is included included in in some some major major software software packages. packages. Note that the the method, Note that Tukey method method that to this section is is not not the the same and not not in inTukey that is is relevant relevant to this section same and terchangeable with with aa method that is is known known as the Tukey-b Tukey-b method. method. terchangeable method that as the In their their simulation simulation study study of of the the robustness robustness of of several several methods methods for for In making pairwise pairwise comparisons comparisons under under various various conditions conditions of of violation violation of making of assumptions, assumptions, Cribbie Cribbie and and Keselman Keselman (2003a) (2003a) found found that that the the Tukey Tukey method can be be outperformed by aa hybrid hybrid method method in in terms of control control of method can outperformed by terms of of Type II error error and and power, power, at at least least under under the the conditions conditions that that were were studied. studied. Type The hybrid hybrid method, method, which which came came to to be be known known as as the the REGWQ REGWQ. proce proceThe dure, is based on modification modification and and remodification remodification of of the the once-popular once-popular dure, is based on Newman-Keuls method. Consult Consult Cribbie Newman-Keuls method. Cribbie and and Keselman Keselman (2003a) (2003a) for for discussion the REGWQmethod. discussion and and references references regarding regarding the the history history of ofthe REGWCtmethod. Cribbie and and Keselman Keselman (2003a) (2003a) found applying the the REGWQ REGWQ. Cribbie found that that applying method to to the Welch (1938) version of of the statistic controlled controlled Type Type I method the Welch ( 1 938) version the t statistic error well. When When there there was was moderate skew power power was was higher when error well. moderate skew higher when higher using using the the original original Welch Welch t, but but when when skew skew was was great great power power was was higher max -
min'
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
�
131 131
when using using the the Yuen Yuen ((1974) version of of the the Welch that uses uses trimmed trimmed when 1 9 74) version Welch t that means and and Winsorized Winsorizedvariances, variances,aaversion versionthat that was was discussed discussedin inchap chapmeans ter 22 of of this this book. ter book. Note that simulations (known (known as Monte Carlo Carlo studies) studies) of Note that computer computer simulations as Monte of the robustness of aa statistical statistical procedure procedure cannot cannot examine examine all all possible the robustness of possible conditions of of violations of assumptions. assumptions. Therefore, where there there is is no no conditions violations of Therefore, where mathematical theory theory to to inform inform about about the the robustness robustness of of aa procedure, procedure,the the mathematical best that that aa simulation simulation study study can can do do is is to to simulate simulate aa reasonable reasonable variety variety of best of conditions under under which which aa statistical statistical procedure proceduremight mightbe beapplied appliedby byaare reconditions searcher. Among Among the the variables variables and of variables variables that that aa searcher. and combinations combinations of good by Cribbie (2003a) and good simulation, simulation, such such as as those those by Cribbie and and Keselman Keselman (2003a) and others, simulate are are k, N, variation variation of of nn across across samples, samples, extent extent of others, simulate of heteroscedasticity, pairings of of unequal unequal values values of of nn and and unequal unequal values values heteroscedasticity, pairings of a a22,, pattern of means means of of the the involved involved populations, shapes of of the the distri distriof pattern of populations, shapes butions in in the the populations, populations, and, and, in in the the case case of of pairwise butions pairwise comparisons, comparisons, whether and and what what kind kind of of preceding preceding omnibus omnibus test test has has been been applied. applied. Rewhether Re fer to to Sawilowsky Sawilowsky (2003) (2003) for for additional additional criteria criteria for for an an appropriate appropriate fer Monte Carlo Carlo simulation. simulation. Monte In the the case case of of planned planned comparisons comparisons between between each each mean mean and and the the mean mean In of aa baseline baseline group group (i.e., (i.e., aa control, control, placebo, placebo, or or standard-treatment standard-treatment of group as as in in the the numerator numerator of of Equation Equation 66.13), the Dunnett Dunnett many-one many-one group . 1 3), the method may may be be used for significance significance testing testing and and construction construction of of simul simulmethod used for taneous confidence confidence intervals intervals for for all all of of the the values values of of m -m The proce procetaneous �ii �cc.. The dure, which which assumes assumes homoscedasticity, can be be found found in in Maxwell Maxwell and and dure, homoscedasticity, can Delaney (2004). (2004). Note Note that that the the Dunnett Dunnett many-one many-one method method is is not not the the Delaney same and and is is not not for for the the same same purpose as the the Dunnett Dunnett T3 T3 method. same purpose as method. In applied applied research research one one might might be be interested interested in in pairwise pairwise comparisons comparisons In of the the mean mean of of the the best-performing best-performing group group (not (not known known aa priori) priori) with with of each of of the the other groups. The The Dunnett many-one method for planned planned each other groups. Dunnett many-one method for comparisons is is not not applicable applicable to to this this case. case. However, However, Hsu's modicomparisons Hsu's ((1996) 1 996) modi fication of of the the many-one many-one method method is is applicable, applicable, assuming assuming homo homofication scedasticity, for for testing testing each each such such difference difference and and constructing constructing aa scedasticity, confidence interval interval for for each each one. one. Refer Refer to Maxwell and and Delaney Delaney (2004) confidence to Maxwell (2004) for additional additional detailed detailed discussion. discussion. for Wilcox (2003) (2003) provided provideddiscussions discussionsand and S-PLUS S-PLUSsoftware softwarefunctions functions for for Wilcox variety of of newer newer robust robust methods methods that that compete compete with with the the Thkey Tukey method, method, aa variety including comparing comparing pairwise pairwise medians medians instead instead of of means. means. For another ap apincluding For another proach that that is is based based on on comparing comparing medians, medians, refer refer to to Bonett Bonett and and Price Price proach (2002). For For aa fundamentally fundamentally different different approach approach that that is is based on aa refor refor(2002). based on mulation of of the traditional null null hypothesis hypothesis refer refer to to Shaffer Shaffer (2002). mulation the traditional (2002). Note that that if if aa researcher is interested interested in in the the relative relative magnitudes Note researcher is magnitudes of of all of of the the means of the the populations that are are represented represented in in the the design, all means of populations that design, methods of of pairwise pairwise comparisons, comparisons, such such as as Tukey's Tukey's method, method, can can pro promethods duce intransitive intransitive (i.e., (i.e., contradictory) contradictory) results. For example, example, suppose duce results. For suppose that there there are are three three groups, groups, so so that that one one can can test test Ho: H0: m =m H0: m m3 3' , that � 11 = � 2= � �22', Ho: 2 and H it is is possible possible for method of pairwise and Ho:0: m � 11 = m �33'. Unfortunately, Unfortunately, it for aa method of pairwise =
=
1132 32
�
CHAPTER CHAPTER 6 6
comparisons comparisons to produce intransitive results, such as seeming to indi indicate that m contradictorily, adictorily, m = m 113•3. Of course, such a 1122 = 113'3, and, contr 1122,, m 1111 > m 11 11 == m suggested pattern of means cannot cannot be true. Dayton (2003) provided a method for making inferences about about the true pattern pattern of the means of the involved populations populations.. The method is in intended to be applicable applicable to the case of homoscedasticity or the case in which the pattern of the magnitudes of the variances variances in the populations is the same as the pattern of the means in the populations, which is likely common. The required required sample sample sizes at at a given effect effect size to attain statistical power for detecting the true pattern of these a given level of statistical Cribbie & & Keselman, means depends on the nature of this pattern (see (see Cribbie 2003b; consult consult Table 3 in Dayton, 2003). For patterns the required For some patterns sample sizes are very large, but but they are not as large as would be re required for tests such as Tukey's Tukey ' s HSD HSD test testto todetect detectthe thetrue truepattern. pattern. Con Construction not possible possible within (2003) struction of confidence intervals is not within Dayton's (2003) Tukey's, procedure. Also, this method, unlike Thkey 's, should be preceded by an omnibus test (a protected procedure) to improve improve its accuracy accuracy in detecting detecting the 2003b). the pattern of means (Cribbie & & Keselman, 2003b). 's method generally appears to be robust to nonnormality Dayton's nonnormality Dayton (Cribbie & & Keselman, 2003a; Dayton, 2003) and to heteroscedasticity in most cases (Cribbie & & Keselman, 2003b). Although Cribbie and and Keselman ' s method generally from their simulations Dayton's (2003a) concluded from simulations that Dayton provided good control of Type simulation re re1)!pe I error and good power, more simulation search may may be needed needed before a definitive conclusion can be reached regard regardheteroscedasticity. ing the overall performance of the method under heteroscedasticity. Dayton (2003) (2003) too was cautious about his method in this regard. regard. Compu Computations can be implemented implemented using Microsoft Microsoft Excel alone or together with special software. Refer Refer to Dayton (2003) for details. Maxwell (2004) discussed discussed the often often ignored power implications implications of of Maxwell (2004) for a methods of multiple comparisons, distinguishing distinguishing among power for specific comparison, any-pair differspecific any-pair power to detect at least one pairwise differ of ence, and all-pairs power to detect all true differences differences within within pairs of means. Low Low specific-comparison power can result in inconsistent inconsistent re results across studies even when any-pair power is adequate. adequate. Maxwell's (2004) results indicate that extremely large sample sample sizes, large required to deal with multi-center studies, or meta-analyses might be required this problem. problem. He He made numerous other recommendations, recommendations, including including the increased increased use of confidence intervals. We turn our our attention now now from from unstandardized unstandardized to standardized dif difLl, the ferences between means. Approximate confidence intervals for A, the standardized difference between population population means, can be obtained, as asstandardized difference suming homoscedasticity, by dividing the lower and upper limits of the difference between popUlation confidence interval for each pairwise difference population means by MS MS1/2� .. Refer Refer to Steiger ((1999) for software for constructing ex1 999) for constructing ex act confidence intervals intervals for for a standardized-difference standardized-difference effect effect size (as esti estigpp of our our Equation 6.14) from planned contrasts 6 . 1 4) that arises from mated by g
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA AMOVA EFFECT
�
1133 33
sizes are equal. when sample sizes equal. The exact confidence intervals use 1 997) and Smithson noncentral distributions. Steiger Steiger and Fouladi Fouladi ((1997) 4) for further (2003) illustrated illustrated the method. Consult Steiger (200 (2004) further dis dis(2003) cussion. Refer Refer to Bird (2002) (2002) for a discussion discussion of the likely differences differences in exact and approximate approximate confidence intervals. intervals. widths of exact (2002) also presented an approximate approximate method method for construction construction Bird (2002) of a confidence interval interval for � A that is based on the usual (Le., (i.e., central) central) t of (Assuming normality, an exact confidence confidence interval would distribution. (Assuming that was discussed discussed in require the use of the noncentral t distribution that chap. 33.) method also assumes homoscedasticity .) This method homoscedasticity by using the square root of MSw MSW from the ANOVA results to standardize standardize the differ difference between the two means of interest. This method appears generally close approximation approximation to the nominal nominal confidence confidence level to provide fairly close 95%). A simulation simulation study indicated that the actual confidence level (e.g., 95%). departs downward somewhat from from the the (called the probability coverage) departs A—and even more so the nominal level the greater the true value of L\.-and For example, when there were only two two smaller the number of groups. For 95% groups in the entire design and � A= = 0, the probability probability coverage coverage for a 95% indeed found to be .950, .950, but but when � A= = 11.6, confidence interval was indeed .6, the probability coverage coverage was actually .91 .911. However, the latter coverage 1 . However, coverage probability improved from from .91 .9111 to .929 .929 when there was a total of four four groups in the & Keselman, 2003). 2003). Also consult Algina Algina and and Keselman design (Algina & (2003) for a method method and a SAS/IML program for constructing constructing an exact (2003) interval for �, A, assuming assuming homoscedasticity. This method pro proconfidence interval option to pool all variances in the design or just the variances vides the option of the two groups that are involved in the effect effect size. At the time of this of and his colleagues colleagues offer offer free software for construct constructwriting Kevin Bird and approximate confidence intervals for standardized or unstandard unstandarding approximate contrasts, planned or unplanned, from between-groups or ized contrasts, within-groups designs. designs. This software is available at http:// http://www.psy. www. psy. within-groups unsw. edu. au/research/PS Y. htm. unsw.edu.au/research/PSY.htm. Refer to Keselman, Cribbie, and and Wilcox Wilcox (2002) (2002) for a method of paired Refer comparisons of trimmed means that controls Type I error when sample nonnormality and heteroscedasticity. sizes are unequal and there is nonnormality were discussed discussedin chap. chap. 11 of this book.) Also, as was dis dis(Trimmed means were 2 case in chapter 3, esticussed for the k = 2 3, there are additional, rarely used esti mators of standardized-difference standardized-difference effect effect sizes that may may be more resistant resistant mators heteroscedasticity than the estimators estimators that have been discussed in this to heteroscedasticity These estimators estimators involve alternatives to the use of mean differ differsection. These ences in the numerators and alternatives to standard deviations in the dede For further discussions refer to the sections Tentative nominators. For and Additional Standardized-Difference Standardized-Difference Eff Effect Recommendations and ect Sizes Outliers in chapter 33 and to Grissom Grissom and Kim Kim (2001). (2001). When There are Outliers discussions of effect effect sizes for multivariate designs, designs, consult For discussions 3 ). For a measure (2000) and Smithson (200 (2003). measure of effect effect Olejnik and Algina (2000) size for two or more groups in a randomized longitudinal longitudinal design, refer ==
1134 34
�
CHAPTERS CHAPTER 6
to Maxwell Maxwell ((1998). Rosenthal et et al. provided an an alternative alternative to 1 99 8 ) . Rosenthal al. (2000) provided treatment of effect sizes sizes in in terms of correlational correlational contrasts contrasts for one-way treatment of effect terms of for one-way and also discussed discussed this and factorial factorial designs. designs. Maxwell Maxwell and and Delaney Delaney (2004) (2004) also this topic. Timm (2004) proposed the ubiquitous effect effect size index as as an an alterproposed the alter topic. Timm native to to correlational effect sizes for exploratory exploratory experiments. experiments. Timm's Timm's native correlational effect sizes for method is is applicable applicable to to omnibus omnibus F F tests tests or tests on on contrasts. contrasts. (2004) method or tests This method method assumes assumes homoscedasticity and reduces reduces to to Hedges' Hedges'gg in in the the This homoscedasticity and case of two two equal equal sized groups. case of sized groups. For For the the case case in in which which two two groups groups at at aa time time from from multiple multiple groups groups are are compared, Wilcox Wilcox (2003) discussed discussed and and provided provided S-PLUS S-PLUS software software compared, functions for for estimation estimation of of what what we we called called in in chapter chapter 5 the the PS P5 (probabil (probabilfunctions ity of of superiority) superiority) and and the the DM DM (dominance measure). Consult Consult Vargha ity (dominance measure). Vargha and Delaney Delaney (2000) and and Brunner Brunner and and Puri Puri (2001) for extensions extensions of of what what (200 1 ) for and we called the designs. we called the PS PS to to multiple-group multiple-group and and factorial factorial designs. WITHIN-GROUPS AND FURTHER WITHIN-GROUPS DESIGNS DESIGNS AND FURTHER READING READING
Recall from from the the Dependent Groups section chapter 33 that that the the choice choice of of aa Recall Dependent Groups section in in chapter standardizer for for an an effect size depends depends on on the the nature nature of of the the population population to to standardizer effect size which one one intends intends to generalize the the results. results. The The choice standardizer in in which to generalize choice of of standardizer the estimator estimator must must be be consistent consistent with with the the nature nature of of the the variability variability within within the this targeted population. However, However, in in this this regard, this targeted population. regard, primary primary researchers, researchers, who directly directly estimate estimate effect effect sizes sizes from from raw raw data, data, unlike unlike meta-analysts, meta-analysts, who do do not not have have to to be be concerned concerned about about the the inflation inflation of of estimates estimates of of standard standardized-difference effect effect sizes. sizes. Such Such inflation inflation by by aa meta-analyst would be be at atized-difference meta-analyst would tributable to to the of invalid invalid formulas, instead of of the the valid valid formulas formulas in tributable the use use of formulas, instead in the case of within-groups within-groups designs, for converting converting values values of of t or or F to to an an esthe case of designs, for es timate of of effect effect size size (cf. (cf. Cortina Cortina & & Nouri, Nouri, 2000). timate For one-way one-way ANOVA ANOVA within-groups within-groups designs designs (e.g., (e.g., repeated-measures For repeated-measures designs), primary primary researchers researchers can can use use the the same equations that that were were pre predesigns), same equations sented independent-groups design sented in in this this chapter chapter for for the the independent-groups design to to calculate calculate standardized-difference effect-size effect-size indicators. indicators. If repeated-measures standardized-difference If aa repeated-measures design has involved aa pretest, pretest, the the pretest pretest mean mean (Y can be be one one of of any any has involved (¥pre)) can design two compared compared means, means, and and the the standard standard deviation deviation may may be be s re, s , or two or MS1/2�w.. The with rere MS The latter latter two two standardizers standardizers assume assume homoscedasticity homoscedasticity P�itl� gard either either to to the the two two compared compared groups groups or or to to all all of of the the groups, respecgard groups, respec tively. (Statistical (Statistical software software may may not generate ss rere', sS ,, or tively. not automatically automatically generate or MS MS1/2�wwhen whenitit computes computesaa within-groups within-groupsANOVA. ANOVA. However, However, such sJch soft:oft ware does allow one one to to compute the variance variance of of data data within within aa single ware does allow compute the single condition of of the design. Therefore, one one may may use use the the variances variances so so gener genern. Therefore, condition the des ated to to calculateMS calculate MS1/2W ated �wand andsspp using using Equations Equations 6.2 6.2 and and 6.16, 6 . 1 6, respectively.) respectively.) Consult Algina Algina and and Keselman Keselman (2003 (2003)) for for aa method method and and aa SAS/IML SAS/IML pro proConsult gram for constructing constructing an an approximate approximate confidence confidence interval interval for for aa stan stangram for dardized-difference effect effect size under homoscedasticity homoscedasticity or or hetero heterodardized-difference size under scedasticity in in aa within-groups within-groups design. scedasticity design. ,
EFFECT SIZES SIZES FOR FOR ONE-WAY ONE-WAY ANOVA ANOVA EFFECT
�
1135 35
For methods all pairwise pairwise comparisons comparisons (planned (planned or unFor methods Jbrjnaking for making all or un planned) of the the Y differences for for dependent dependent data, data, including including the the con conplanned) of Yii -- �Yj. differences struction of of simultaneous confidence intervals, intervals, refer refer to Maxwell and and struction simultaneous confidence to Maxwell Delaney's (2004) (2004) and and Wilcox's Wilcox's (2003) discussions of of aa Bonferroni Bonferroni Delaney's (2003) discussions method (historically, Bonferroni-Dunn method). Also Algina method (historically, Bonferroni-Dunn method). Also consult consult Algina and Keselman (2003) confidence intervals and Keselman (2003) for for aa method method for for constructing constructing confidence intervals for pairwise comparisons. for pairwise comparisons. The WSD) is is not recommended for significance The Thkey Tukey method method (HSD (HSD or or WSD) not recommended for significance testing and and the the construction construction of of confidence confidence intervals intervals in in the the case case of of dependtesting depend ent data because, as is not case for ent data because, as is not the the case for the the Bonferroni-Dunn Bonferroni-Dunn method, method, the the Tukey method method might might not not maintain maintain family-wise family-wise error error rate rate (e.g., (e.g., <Xrw aFW < .05) .05) Tukey unless the the sphericity sphericity assumption assumption is is satisfied. satisfied. (Of (Of the the several several ways ways to to deunless de the simplest considers the the variance the difference difference scores, scores, fine fine sphericity, the simplest considers variance of of the Yj - Y, with respect respect to to compared compared levels levels i and andj. Sphericity is is satisfied satisfied when when Yi Y, with j. Sphericity the p population variances of of such such difference scores are are the same for for all all opulation variances difference scores the same the such pairs pairs of of levels.) Because tests tests of of sphericity sphericity may may not not have sufficient such levels.) Because have sufficient power to detect its it is is best to use do not not assume assume power to detect its absence, absence, it best to use methods methods that that do sphericity (e.g., (e.g., it it is is better to use use aa multivariate multivariate than than aa univariate univariate ap apsphericity better to proach to to designs designs with with dependent dependent groups). groups). Refer Refer to to Maxwell Maxwell and and Delaney proach Delaney (2004) for for further further discussion of sphericity. sphericity. (2004) discussion of Finally, with with regard to unstandardized differences, Wilcox Wilcox (2003) (2003) dis disregard to unstandardized differences, Finally, cussed and provided S-PLUS software software functions for newer cussed and provided S-PLUS functions for newer robust robust methmeth ods for for pairwise in the the case of dependent to ods pairwise comparisons comparisons in case of dependent groups. groups. Refer Refer to Wilcox and and Keselman Keselman (2002b) (2002b) for for simulations simulations of of the the effectiveness effectiveness of Wilcox of bootstrap to deal deal with with the the problem of controlling Type I error error bootstrap methods methods to problem of controlling 'JYpe when outliers outliers are are either either simply simply removed removed or or formally formally trimmed when when trimmed when conducting all all pairwise of the the locations (e.g., trimmed trimmed conducting pairwise comparisons comparisons of locations (e.g., means) of of one one specific specific group group and and each each of of the the other other groups groups (another (another means) many-one procedure) in in the the case many-one procedure) case of of dependent dependent data. data. Dayton's Dayton's (2003) (2003) method and and suggested suggested software software (both (both discussed discussed in in the the previous method previous section) section) for detecting detecting the the pattern pattern of among the the means of the for of relationships relationships among means of the popupopu lations are are also also applicable applicable to to the the case case of of dependent dependent groups. groups. Wilcox Wilcox and and lations Keselman (2003b) discussed discussed the the application modified one-step M-esKeselman (2003b) application of of modified one-step M-es timators (that (that were were discussed discussed in in our our chap. chap. 4) 4) to to one-way one-way repeated-mea repeated-meatimators sures ANOVA. ANOVA. We We turn turn our our attention attention now now to to POV. sures To estimate estimate overall in aa one-way one-way ANOVA ANOVA design design with To overall POV in with dependent dependent groups, one can can use (Dodd & & Schultz, Schultz, '11973; also refer refer to to Olejnik Olejnik & groups, one use (Dodd 973; also & Algina, Algina, 2000) 2000)
(k - 1)(MS 55 tot
- MS t x s) + MS sub
effect
(6.17) (6. 1 7)
where kk is is the number of of treatment treatment levels, is the the mean mean square square for for the number levels, MSeffect where effect is the of treatment, MStx ssis mean square squarefor forTreatment 'Ireatment xX the main main effect effect of treatment, MSt is the the mean Subject interaction, interaction, SSto S5tott is is the the total sum of squares, and is the the Subject total sum of squares, and M5 MSssub ub is x
136
�
CHAPTER 6 6 CHAPTER
mean mean square square for for subjects. subjects. If If software software does does not not produce produce this this estimate estimate di directly rectly calculation calculation is is done done manually manually by by obtaining obtaining the the needed needed values values from from output. Equation 66.17 . 1 7 treats de output. The The approach approach underlying underlying Equation treats aa one-way one-way dependent-groups ANOVA ANOVA as as if if it it were were aa two-way two-way design design in in which which the the pendent-groups main factor factor is is 1teatment, Treatment, the the other other factor factor is is Subjects, Subjects, and and the the error term main error term is the the mean mean square square for for interaction. interaction. is With regard regard to areas of of research research in in which which the the same same independent independent vari variWith to areas able is is studied studied in in between-groups between-groups and and within-groups within-groups designs designs it it has has been able been argued that that aa partial partial POV POV should should be be estimated estimated instead instead of of the the usual usual POV argued POV for aa within-groups within-groups design. design. The The purpose purpose is is to to render render aa POV POV from from aa for within-groups within-groups design design comparable comparable to to one one from from aa between-groups between-groups design design by eliminating eliminating subject subject variability variability from from total total variability. variability. Keppel Keppel (1991) (1991) by provided the the relevant relevant formulas. formulas. For For aa contrary contrary view view refer refer to to Maxwell provided Maxwell and Delaney Delaney (2004). (2004). Partial Partial POV POV is is discussed discussed in in chapter chapter 7. and 7. We demonstrate demonstrate the the application application of of Equation Equation 6.1 6.177 using using data data that that pre preWe 2 (02 as ceded ceded the the introduction introduction of of w as aa measure measure of of Pov. POV. The The dependent dependent variable variable was treatments were was visual visual acuity, acuity, and and the the treatments were three three distances distances at at which which the the target was viewed viewed by four participants participants (Walker, (Walker, 1947; 1947; cited cited in in McNemar, McNemar, target was by four 1962). Substantive Substantive details details of of the the research research and and possible possible alternative alternative analyses analyses 1962). of Equa of the the data data do do not not concern concern us us here. here. The The values values that that are are required required for for Equation 66.17 or indirectly) are k ::: = 3, SSeffect ::: = 1,095.50, 1,095.50, SSt 55t xx ss == . 1 7 (directly (directly or indirectly) are 3, 55effect tion 596.50, nn ::: = 4, 4,SS = 2,290.25, 2,290.25, and and 5S SSsub = 598.25. 598.25. The The value value ofF of F is is signif signif596.50, SStot tot ::: sub ::: icant of icant at at p < < .05. .05. Dividing Dividing the the values values of of 55 55 by by the the appropriate appropriate degrees degrees of freedom to to obtain obtain the the required required values values of of MS, MS, we we find find that that freedom
(0
2
=
{
]
(3 _ 1 l 1,095.50 _ 596.50 3-1 (3 - 1)( 4 - 1) =36. 59825 2 ,29025 + 4 1
Therefore, viewed Therefore, we we estimate estimate that that the the distance distance at at which which the the target target is is viewed accounts variance in acuity scores, accounts for for 36% 36% of of the the variance in acuity scores, under under the the conditions conditions in in which which the the research research was was conducted. conducted. Note Note that that the the effect effect had had to to be be rela relatively strong strong for for the the F F to to have have attained attained significance significance at at p < < .05 .05 with with only only tively four four participants. participants. Rosenthal presented an Rosenthal et et a1. al. (2000) (2000) presented an alternative alternative treatment treatment of of effect effect sizes for for one-way one-way and and factorial factorial dependent-group dependent-group designs designs in in terms terms of sizes of correlational types types of of contrasts contrasts (also (also discussed discussed by by Maxwell Maxwell & & Delaney, Delaney, correlational 2004). For For further further discussions discussions of of various various topics topics of of this this chapter, chapter, consult consult 2004). Olejnik and and Algina Algina (2000), (2000), Cortina Cortina and and Nouri Nouri (2000), (2000), and and Hedges Hedges and and Olejnik Olkin ((1985). Hunter and and Schmidt Schmidt (2004) (2004) presented presented aa strong strong case case for for the the Olkin 1 985). Hunter use PS use of of dependent-groups dependent-groups designs. designs. For For extension extension of of what what we we call call the the PS measure to dependent multiple multiple groups groups in one-way and and factorial factorial designs measure to dependent in one-way designs consult We consult Vargha Vargha and and Delaney Delaney (2000) (2000) and and Brunner Brunner and and Puri Puri (2001). (2001). We consider consider effect effect sizes sizes for for factorial factorial designs designs in in the the next next chapter. chapter.
EFFECT SIZES SIZES FOR ONE-WAY ANOYA ANOVA EFFECT FOR ONE-WAY
�
137
QUESTIONS 1 . Define Define the general 1. the fixed fixed effects effects model, and and state state to to whom whom may may one one generalize from such such design? ize the the results results from design? 2. Name two assumptions, other other than of the in 2. Name two assumptions, than independence, independence, of the F test test in an ANOVA, and state two of violating an ANOVA, and state two possible possible consequences consequences of violating these these assumptions. assumptions. 3. Why for the the significance significance of of the difference between 3. Why bother bother to to test test for the difference between the smallest of the the sample if the the overall the greatest and and the smallest of sample means means if overall F the greatest test is test is significant? significant? 4. Define Cohen's/conceptually, stating why why it is an an effect effect size. size. 4. Define Cohen's f conceptually, stating it is 5. In In which which direction direction is is the the estimator in Equation Equation 6.5 6.5 biased, 5. estimator in biased, and and why? why? 6. is the the sample sample eta eta squared estimator of the 6. Why Why is squared aa problematic problematic estimator of the eta-squared eta-squared parameter? parameter? 7. How do sample epsilon epsilon squared squared and and omega omega squared squared attempt attempt to to 7. How do the the sample reduce biased estimation, and which is reduce biased estimation, and which is less less biased? biased? 8. one determine determine if if omega squared is statistically signifi signifi8. How How does does one omega squared is statistically cantly greater cantly greater than than 0? O? 9. are confidence for estimators estimators 9. Why Why are confidence intervals intervals especially especially important important for of POV? POV? of 10. Discuss the rationale for reporting negative estimates estimates of POV in in10. Discuss the rationale for reporting negative of POV stead of reporting reporting them them as as 0. stead of O. 1 1 . List List those those limitations criticisms of its estimators estimators that that 11. limitations or or criticisms of POV POV and and its might apply apply and those that that would would not not apply apply to standardized difmight and those to standardized dif ferences between means. means. ferences between 12. does unreliability of scores scores affect affect estimates of POV? POV? 12. How How does unreliability of estimates of 13. Results section section of re1 3 . What What would would be be more more accurate accurate wording wording in in aa Results of aa re search report report than merely stating stating that that the the independent independent variable search than merely variable accounted for for an an estimated estimated P% P% of of the the variance variance of of the accounted the dependent dependent variable? variable? 14. Name three three choices for the the standardizer differ14. Name choices for standardizer of of aa standardized standardized differ ence between between two two of of k means, and when when would would each each choice choice be apence means, and be ap propriate? propriate? 15. should one cautious when interpreting differences 1 5 . Why Why should one be be cautious when interpreting differences among among the of the 6 . 1 4? the values values of the estimator estimator in in Equation Equation 6.14? 16. which circumstance circumstance are are statistical statistical inferences and confidence confidence in in16. In In which inferences and tervals about unstandardized tervals about unstandardized differences differences between between two two means means eses pecially pecially informative? informative? 17. might it it be unwise to methodwith withan an HSD method 1 7. Why Why might be unwise to precede precede Tukey's Thkey ' s HSD omnibus F test? test? omnibus 18. Are the the results results of an omnibus and applications applications of 1 8. Are of an omnibus F F test test and of the the Tukey Tukey HSD test always always consistent? consistent? Explain. Explain. HSD test 19. terms the nature of of aa good good Monte Monte Carlo Carlo study study 1 9 . Describe Describe in in general general terms the nature of the robustness robustness of of aa statistical of the statistical test. test. 20. the purpose the Dunnett Dunnett many-one 20. What What is is the purpose of of the many-one method? method?
138
�
CHAPTER 6 6 CHAPTER
21. Give an example of intransitive results in multiple 2 1 . Give an example of intransitive results in multiple comparisons comparisons of of three differs from in the three means means that that differs from the the one one in the text. text. 2 2 . What What assumption assumption is being made if one confidence in 22. is being made if one constructs constructs aa confidence interval for for the the standardized standardized difference difference between two of terval between two of k population population means by the upper and lower lower limits limits of means by dividing dividing the upper and of the the confidence confidence in interval for for the the unstandardized unstandardized pairwise pairwise difference difference by MS � ??Answer terval by MS% Answer using more than one one word. word. using more than 2 3 . List List the the numbers of the the formulas formulas in in this this chapter chapter for for bebe 23. numbers of tween-groups standardized standardized mean mean differences be tween-groups differences that that would would also also be applicable to to within-groups within-groups designs. designs. applicable
Chapter Chapter
7 7
Effect Sizes for Factorial Designs Effect
INTRODUCTION
In this this chapter chapter we we discuss discuss aa variety variety of of estimators of standardized-differ standardized-differIn estimators of ence ence and and strength-of-association strength-of-association effect effect sizes sizes for for factorial factorial designs designs with with fixed-effects section on fixed-effects factors. factors. Prior Prior to to the the section on within-groups within-groups designs designs the the discussion only between-groups between-groups factors. dis discussion and and examples examples involve involve only factors. The The discussions of of estimators estimators of of standardized standardized differences differences are are much much influenced influenced cussions by the the seminal seminal work work of of Cortina Cortina and and Nouri Nouri (2000) (2000) and and Olejnik Olejnik and and by Algina (2000). We add add some and some of our our Algina (2000 ) . We some alternative alternative approaches approaches and some of own perspectives, perspectives, focusing focusing on on assumptions, assumptions, the the nature nature of of the the popula populaown tions results are are to generalized, and tions to to which which results to be be generalized, andpairwise pairwisecomparisons. comparisons. We respect to We call call the the factor factor with with respect to which which one one estimates estimates an an effect effect size size the the targeted targeted factor. We Wecall callany any other otherfactor factorin inthe the design designaaperipheral peripheral fac facIf later the analysis the same data aa researcher tor. If later in in the analysis of of the same set set of of data researcher estimates estimates an with respect to aa factor been aa periph an effect effect size size with respect to factor that that had had previously previously been peripheral the roles labels for factor and eral factor, factor, the roles and and labels for this this factor and the the previously previously tar targeted factor factor are are reversed. reversed. A peripheral peripheral factor factor is is also called an an off off factor factor geted also called (Cortina (Cortina & & Nouri, Nouri, 2000; Maxwell Maxwell & & Delaney, Delaney, 2004). The procedures for effect size facto The appropriate appropriate procedures for estimating estimating an an effect size from from aa factorial design design depend depend in in part part on on whether whether targeted and peripheral peripheral factors factors are rial targeted and are extrinsic or are factors factors that that do do not not ordinarily extrinsic or intrinsic. intrinsic. Extrinsic factors factors are ordinarily or which the results are gener or naturally naturally vary vary in in the the popUlation population to to which the results are to to be be generalized. Extrinsic factors factors are often manipUlated manipulated factors factors that are treat treatalized. Extrinsic are often that are those ment variables imposed on ment variables imposed on the the participants. participants. Intrinsic factors factors are are those that naturally vary the popUlation which the the results results are are to be that do do naturally vary in in the population to to which to be generalized—factors such such as as gender, gender, ethnicity, ethnicity, or or occupational occupational or or edu edugeneralized-factors cational level. level. Intrinsic factors are typically classificatory factors (which cational Intrinsic factors are typically classificatoryfactors (which is the the label label that we use use in in this this chapter), chapter), which which are are also also called called subject subject fac facis that we tors, grouping tors, grouping factors, factors, stratified stratified factors, factors, organismic factors, factors, or or individ individual-difference factors. factors. Refer Refer to to Maxwell Maxwell and and Delaney Delaney (2004) for for further further ual-difference discussion of of the the distinction distinction between extrinsic and and intrinsic factors. discussion between extrinsic intrinsic factors. extremely important important consideration consideration when when choosing choosing aa method method for An extremely for estimating aa standardized standardized difference difference or or POV POV (proportion (proportion of of variance variance exexestimating 1139 39
1140 40
--
CHAPTER 7 CHAPTER
plained) effect effect size size with with regard regard to is whether whether the peplained) to aa targeted targeted factor factor is the pe ripheral factor factor is is extrinsic extrinsic or or intrinsic. we soon explain, when when aa ripheral intrinsic. As As we soon explain, peripheral is extrinsic peripheral factor factor is extrinsic one one generally generally will will want want to to choose choose aa method method in in the attributable to factor is is held held in which which variability variability in the data data that that is is attributable to that that factor constant (making (making no no contribution contribution to the standardizer standardizer or or to total constant to the to the the total variability is to accounted for). for). When peripheral factor variability that that is to be be accounted When aa peripheral factor is is in intrinsic one trinsic one generally generally will will want want to to choose choose aa method method in in which which variability variability that is attributable attributable to peripheral factor permitted to that is to the the peripheral factor is is permitted to contribute contribute to to the magnitude the standardizer or to the total total variance for some the magnitude of of the standardizer or to the variance for some esti estimated proportion of which which the the targeted targeted factor factor accounts. accounts. mated proportion of In practical terms, when deciding deciding the role that that aa peripheral factor is is to to In practical terms, when the role peripheral factor play the must consider re play the researcher researcher must consider the the nature nature of of the the population population with with respect which the the estimate is to If aa peripheral fac spect to to which estimate of of effect effect size size is to be be made. made. If peripheral factor does not vary in is of tor does not typically typically vary in the the population population that that is of interest interest (usually (usually aa variable is manipulated in nature), nature), one variable that that is manipulated in in research research but but not not in one will will choose aa method ignores variability choose method that that ignores variability in in the the data data that that is is attributable attributable to the the peripheral peripheral factor. If aa peripheral peripheral factor factor does does typically to factor. If typically vary vary in in the the population that interest (often such as population that is is of of interest (often aa classificatory classificatory factor factor such as gen gender) one choose aa method is attributable der) one will will choose method in in which which variability variability that that is attributable to to such aa peripheral factor is is permitted permitted to or such peripheral factor to contribute contribute to to the the standardizer standardizer or to total total variability. this issue issue as each of to variability. We We address address this as we we discuss discuss each of the the esti estimators of effect effect size this chapter. chapter. For For further further discussions of the the role role of mators of size in in this discussions of of what peripheral factor, Nouri (2000), what we we call call aa peripheral factor, consult consult Cortina Cortina and and Nouri Gillett 3 ) , Glass 1 9 8 1 ), Maxwell Maxwell and Gillett (200 (2003), Glass et et al. al. ((1981), and Delaney Delaney (2004), and and Olejnik and and Algina Olejnik Algina (2000). (2000) . STRENGTH OF ASSOCIATION: PROPORTION PROPORTION STRENGTH OF VARIANCE EXPLAINED EXPLAINED
Estimation Estimation of of 112n2op (the (the eta eta squared squared POV POV"of chap. 6) is more more complicated of chap. 6) is complicated factorial designs designs than is the de with regard to with regard to factorial than is the case case with with the the one-way one-way design, in the sign, even even in the simplest simplest case, case, which which we we assume assume here, here, in in which which all all sam sample sizes are are equal. general, for for the or the of ple sizes equal. In In general, the effect effect of of some some factor factor or the effect effect of interaction, intrinsic one interaction, when when the the peripheral peripheral factors factors are are intrinsic one can can estimate estimate POV using POV using ro 2
=
55 effect
- (dfeffed MS w ) 55 tot + MS w
((7.1) 7. 1 )
All All MS, 55, and and df df values values are are available available or or can can be be calculated calculated from from the the MS, 55, ANOVA F-test output. With main effect Factor ANOVA F-test software software output. With regard regard to to the the main effect of of Factor effect = A, SSeffect = = SS and df = a -- 1, which a is the number of levels levels of A, SSeffect SSAA and I , in in which is the number of of dfeffect Factor A. With regard to the the main main effect effect of of Factor Factor B, B and Factor A. With regard to B, substitute substitute B and b (i.e., the number number of of levels of Factor Factor B) and a in in the previous sen sen(i.e., the levels of B) for for A A and the previous tence, and so so forth forth for any other tence, and for the the main main effect effect of of any other factor factor in in the the design. design.
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS DESIGNS EFFECT
�
141
55AB With regard regard to to the the interaction interaction effect effect in in aa two-way two-way design, design, 55eff SSeffect =: SS With eet :::: AB and ) (b - 11). ). In sec and dfeffect dfeffect = = (a (a - 1l)(b In the the later later Illustrative Illustrative Worked Worked Examples Examples section we we apply apply Equation Equation 7.1 7.1 using using an an example example that that is is integrated integrated with with exex tion amples of of estimating estimating standardized standardized differences differences between between means, means, d A and and amples ggpop,, for for the the same same data. data. References References for for estimation estimation of of effect effect sizes sizes for for de dethat are are not not covered in this this chapter, chapter, including including higer designs, ighs that covered in higer order order designs, ssigns are Additional Designs are provided provided later later in in the the Additional Designs and and Measures Measures section. section. Equation . 1 provides Equation 77.1 provides an an estimate estimate of of the the proportion proportion of of the the total total variance of of the the measure measure of of the the dependent dependent variable variable that that is is accounted accounted for for variance by an an independent independent variable variable (or (or by by interaction interaction as as another another example example in in by th� the factorial factorial case), case), as as does does Equation Equation 6.9 in in chapter chapter 6 for for the the one-way one-way design. in this this case case of of factorial factorial designs designs there there can can be be more more design. However, However, in sources of of variance variance than than in in the one-way case case because because of of the the contribu contribusources the one-way tions tions made made to to total total variance variance by by one one or or more more additional additional factors factors and and in interactions. teractions. An effect effect of, of, say, say, Factor Factor A A might might yield yield aa different different value value ofro of w22 if it it iiss researched researched in in the the context context ooff aa one-way one-way design design instead instead ooff aa facto factoif rial design, design, in in which which there be more more sources of variance variance that that ac acrial there will will be sources of count count for for the the total total variance. variance. As As stated stated previously, previously, estimates estimates of of effect effect size the design size must must be be interpreted interpreted in in the the context context of of the design whose whose results results pro produced the the estimate. estimate. A A method method that that was was intended intended to to render render estimates estimates of duced of POV from from factorial factorial designs comparable to to those from one-way one-way designs POV designs comparable those from designs is discussed discussed in in the the next next section. is section. For For aa method method for for correcting correcting for for overestimation overestimation of of POV POV by by omega omega squared in in nested nested designs, designs, in in which which each each level level of of aa factor factor is is combined combined squared with consult Wampold with only only one one level level of of another another factor, factor, consult Wampold and and Serlin Serlin (2000). Also refer refer to to aa series series of of articles articles that that debate debate the the matter matter (2000) . Also (Crits-Christoph, Tu, & & Gallop, Gallop, 2003 2003;; Serlin, Serlin, Wampold, Wampold, & Levin, 2003 2003;; & Levin, (Crits-Christoph, Tu, Siemer & designs in Siemer & Joormann, Joormann, 2003a, 2003a, 2003b) 2003b).. The The designs in this this chapter chapter are are not crossed designs, not nested nested but but crossed designs, in in which which each each level level of of aa factor factor is is com combined with with each each level level of of another another factor, factor, as as shown shown later later in in Tables Tables 7. 7.11 bined through 7.5. 7.5. through 2
PARTIAL ro w2 PARTlAL
alternative conceptualization conceptualization of of estimation estimation of of POV POV from from factorial factorial de deAn alternative
signs signs modifies modifies Equation Equation 7.1 7.1 so so as as to to attempt attempt to to eliminate eliminate the the contribu contribution that that any any extrinsic extrinsic peripheral factor may may make make to to total total variance. variance. tion peripheral factor The The resulting resulting measure measure is is called called aa partial POV POV. Whereas Whereas aa POV POV measures measures the strength of of an an effect effect relative relative to to the the total total variability variability from from error error and and the strength from measures the from all all effects, effects, aa partial partial POV measures the strength strength of of an an effect effect relative relative to variability variability that that is is not not attributable attributable to to other other effects. effects. The The excluded excluded ef efto fects are are those those that that would would not not be be present if the the levels levels of of the the targeted targeted fac facfects present if tor had had been been researched researched in in aa one-way one-way design. design. For For this this purpose eta tor purpose partial partial eta squared and partial omega squared, squared, the the latter being less less biased (chap. 6), squared and partial omega latter being biased (chap. have ;"","" , for ef have been been traditionally traditionally used. used. Partial Partial omega omega squared, squared, ro w 2partial for any any effect is fect is given given by by
1142 42
.rlW=
CHAPTER 7 7 CHAPTER �2
ill pdrtial
SS dfed . - (dlfeffe· ct MS ) SS effect + (N dlfdfect · )MS ' w
-
((7.2) 7.2)
w
where and calculation where N is is the the total total sample sample size size and calculation is is again again aa matter matter of of sim simple arithmetic arithmetic because because all all of of the the needed needed values values are are available available from from the the ple output from from the ANOVA F test. Again we defer introducing introducing aa worked worked exex output the ANOVAF test. Again we defer ample until until the Illustrative Worked Worked Examples so that disample the later later section section Illustrative Examples so that dis cussion the example example can be integrated with discussion worked cussion of of the can be integrated with discussion of of worked examples estimators of size for same set set of examples of of other other estimators of effect effect size for the the same of data. data. A research report report must must make make clear clear whether whether aa reported reported estimate estimate of A research of POV �.""'l '. Unfortunately, POV is is based based on on the the overall overall ffiw22 or or wffi2partial Unfortunately, because because values values POV can of of overall of estimates estimates of overall POV POV and and of of partial partial POV can be be very very different, different, the the more so more complex complex aa design, some textbooks textbooks and and some some software software more so the the more design, some may be be unclear unclear or or incorrect incorrect about about which two estimators it is is dis dismay which of of the the two estimators it cussing or (Levine & Hullett, 2002). One One serious serious conse conse& Hullett, cussing or outputting outputting (Levine quence of of such confusion would would be be misleading misleading meta-analyses meta-analyses in which quence such confusion in which the meta-analysts are unknowingly unknowingly integrating integrating sets sets of of two different the meta-analysts are two different kinds of of estimates estimates of of POV. POV. If If aa report report of of primary primary research research provides provides aa for forkinds mula for for the the estimate estimate of of POV, POV, readers should examine examine the the denominator denominator readers should mula to observe if or 7.2 is . 1 or is being being used. used. to observe if Equation Equation 77.1 2
OF ffi w2 COMPARING VALUES OF A researcher researcher who A who wants wants to to interpret interpret the the relative relative values values of of the the two two or or more estimates estimates of of POV POV (or (or partial partial POV) POV) for for the the various various effects effects in in aa facto factomore rial study study should should proceed with caution caution or or consider consider giving giving up up the the idea. rial proceed with idea. First, it is not not necessarily necessarily true true that that aa value value of ofw whosecorresponding corresponding First, it is ffi27feffect f
EFFECT SIZES SIZES FOR DESIGNS EFFECT FOR FACTORIAL FACTORIAL DESIGNS
�
143
milligrams drug? On On the the other hand, if each of milligrams of of the the drug? other hand, if the the levels levels of of each of the the two compared compared factors standard levels levels of of these factors in in clini clinitwo factors represent represent standard these factors cal practice, it of cal practice, it might might be be more morejustifiable justifiable to to compare comparethe the two two estimates estimates of of POVhave POV or of POV or of Ll. A. Furthermore, Furthermore, because because estimates estimates of POV have great great sampling sampling variability, even if manipulation were comparable it it variability, even if two two strengths strengths of of manipulation were comparable would be be difficult difficult to to generalize the difference difference between between two two values would generalize about about the values of POV merely comparing the estimates. of POV merely by by comparing the two two estimates. Note that that the great sampling sampling variability estimates of POV argues argues Note the great variability of of estimates of POV for the for POVs. for the use use of of confidence confidence intervals intervals for POVs. Also, Also, as as Olejnik Olejnik and and Algina Algina should not not compare compare estimates estimates of partial POV POV for for (2000) pointed pointed out, out, one one should of partial two factors in study because, in Equation Equation two factors in the the same same study because, as as can can be be observed observed in 7.2, the denominator of different the denominator of an an estimate estimate of of partial partial POV POV can can have have aa different value for each factor (different (different sources For aa similar similar rea reavalue for each factor sources of of variability). variability). For son one POV for son one should should not not ordinarily ordinarily compare compare estimates estimates of of POV for the the effect effect of of aa given given factor factor from from two two studies studies that that do do not not use use the the same same peripheral peripheral facfac tors tors and and the the same same levels levels of of these these peripheral peripheral factors. factors. Ronis ((1981) discussed ways ways to render manipulations studies com comRonis 1 9 8 1 ) discussed to render manipulations in in studies parable. Ronis (1981) for comparing estimated parable. The The Ronis ( 1 98 1 ) method method for comparing values values of of estimated POV applies only to factorial designs POV applies only to factorial designs with with two two levels levels per per factor. factor. Fowler Fowler provided aa method and is ap((1987) 1 987) provided method that that can can be be applied applied to to larger larger designs designs and is ap plicable to within-groups as between-groups designs, but plicable to within-groups as as well well as between-groups designs, but itit is is very ( 1 9 73 ) recommended recommended that that one one use use partial partial POVs POVs very complicated. complicated. Cohen Cohen (1973) if one wants to POVs that if one wants to compare compare the the estimates estimates of of POVs that are are obtained obtained by by difdif ferent studies of same number of levels levels of ferent studies of the the same number of of the the same same targeted targeted factor factor when factor has factors that differ when that that factor has been been combined combined with with peripheral peripheral factors that differ across the (1 973), Keppel Keppel across the studies. studies. For For further further discussions discussions consult consult Cohen Cohen (1973), Keren and and Lewis Lewis (1979), and Hullett (2002),, Maxwell Maxwell and and ((1991), 1 99 1 ), Keren ( 1 9 79), Levine Levine and Hullett (2002) Delaney (2004), (2004), Maxwell Maxwell et et al. al. ((1981), and Algina Algina (2000), and and Delaney 1 98 1 ), Olejnik Olejnik and Susskind and and Howland Howland (1980). research designs can produce ( 1 980). Different Different research designs can produce Susskind greatly varying estimates of POV, rendering comparisons or or meta-anal meta-analgreatly varying estimates of POV, rendering comparisons yses of estimates of POV POV problematic reestimates of problematic if if they they do do not not take take the the different different re yses of search features features into into account. account. Because comparisons or or meta search Because such such comparisons metaanalyses across different might be be misleading, misleading, Olejnik Olejnik and analyses across different designs designs might and Algina (2003) provided dozens of of formulas for estimated estimated generalized eta Algina provided dozens formulas for generalized eta squared and omega squared squared that that are to provide provide comparable comparable eses squared and omega are intended intended to timators designs. Similarly, timators across across aa great great variety variety of of research research designs. Similarly, Gillett Gillett (2003)) provided for rendering standardized-difference provided formulas formulas for rendering standardized-difference (2003 estimators of of effect effect size size from designs comparable from estimators from factorial factorial designs comparable to to those those from single-factor designs. single-factor designs. EFFECT SIZE RATIOS OF ESTIMATES OF EFFECT
One should should also also be cautious about about deciding on the relative impor imporOne be very very cautious deciding on the relative tance of of two two factors inspecting the ratio of of their estimates of effect tance factors by by inspecting the ratio their estimates of effect size. ratio of values of w22 for for two two factors different factors can can be be very very different size. The The ratio of values of ill from the of two effect size from the ratio ratio of two estimates estimates of of standardized-difference standardized-difference effect size
1144 44
�
CHAPTER 7
for these two Therefore, these these two estimators can for these two factors. factors. Therefore, two kinds kinds of of estimators can pro provide different different perspectives perspectives on on the the relative relative effect effect sizes sizes of of the the two two factors. factors. vide Maxwell et al. al. ((1981) provided an an example example in in which which the the ratio ratio of of the the Maxwell et 1 9 8 1 ) provided values is : 1 , which might lead lead some some to two two 00 w22 values is approximately approximately 44:1, which might to con conclude that that one one factor factor is is roughly roughly four four times times more more important important than than the the clude other factor. factor. Such Such an an interpretation interpretation would would fail fail to to take take into into account account the the other relative strengths of manipulation manipulation of of the the two two factors factors and and the the likely relative strengths of likely great variabilities of of the the estimates, as were were previously previously dis disgreat sampling sampling variabilities estimates, as cussed. Moreover, in in this this example example those those authors authors found found that that the the ratio ratio of cussed. Moreover, of two standardized-difference standardized-difference estimates estimates for for those those same same two two factors factors is two is not approximately approximately 4:1 but instead providing aa quantitatively, quantitatively, if not 4 : 1 but instead 2:1, 2 : 1 , providing if not qualitatively, qualitatively, somewhat somewhat different different perspective perspective on on the the relative relative im imnot portance of of the the two two factors. factors. We We soon soon turn turn our our attention attention to to effect effect sizes portance sizes involving standardized standardized differences in factorial factorial designs. designs. involving differences in DESIGNS AND RESULTS RESULTS FOR THIS CHAPTER CHAPTER
Tables 7. 7.11 through through 7.5 illustrate illustrate the the designs designs and and results results that are dis disTables that are cussed in in the the remainder remainder of of this this chapter. chapter. cussed The meaning meaning of of the the superscript superscript asterisks asterisks in in the the notation the col colThe notation for for the umn variances variances in in Table Table 77.3 is explained where these variances are are rele rele. 3 is explained where these variances umn vant in in the the later later section section Manipulated Targeted Factor Factor and and Intrinsic Intrinsic vant Manipulated Targeted Peripheral Peripheral Factor. Factor.
TABLE 7.1 7.1
x 2 Design With T Two ExtrinsicFactors Factors A2 x wo Extrinsic
Factor A Factor Therapy 11 Therapy
Therapy Therapy 2
Drug
Cell Il
Cell 2
No drug
Cell 3
Cell 4
Y Yj
Y22 Y
Factor B
1
T ABLE 7.2 TABLE 7.2
x 2 Design Design With an Extrinsic and an Intrinsic Factor A3 x
FactorB F actor B Female
Male
Treatment Tr eatment 11 Cell 1 I Cell 4
Factor F actor A Treatment Treatment 2
Treatment T reatment 33
Cell 2
Cell 3
Cell55 Cell
Cell 6
TABLE 7.3 7.3
Hypothetical Results From a 22 X x 22 Design With One Extrinsic Extrinsic Hypothetital Cind and One Intrinsic Intrinsic Factor
F actor A Factor Factor F actor B Female Female
T reatment 11 Treatment 11,, 11,, 11,, 11,2, , 2, 2,2,2,3,4 2, 2, 2, 3, 4
s�2 - .989 .989
T reatment 2 Treatment 2, 2, 3, 3, 3, 3, 3, 4, 4, 4 = 3. 3 11 Y 21 = '21 S �1 = .544 s21 .544
11,, 11,2,2,2, , 2, 2, 2, 2, 2, 2, 3, 4
11,2,2,3,3, , 2, 2, 3, 3, 3, 3, 4, 4, 4
Y l 1 = 1 .9
Male
2. 1 yY1212 == 2.1
Y 22 ==29 Y 2.9 2 S s2�22 = .989 .989
4� = = .767 .767
S 2
==20 2.0 S s*; = = .842 .842
y22 Y
= 2.5 2.5 s� .000 s = = 1 1.000 =
2
y = 3.0
Yj Y 2 2
= 2.5 2.5 YA1 = s� . 1 05 s2 = 1 1.105
=
=
S l
Y22
1
=
s ;* = .737 .737
S2
22
=
T ABLE 7.4 TABLE 7.4
the Data in T Table 7.3 ANOVA Output for the able 7.3
Source
SS SS
A
110.000 0.000
B
0.000
A Axx B
0.400 00400
Within
29.600 29. 600
df df 1 11 11 36
MS
F F
110.000 0.000
112.16 2. 16
.002 .002
0.000 0.400 00400 0.822
0.00 0.00
11.000 .000
0.049
P P
0497 .497
TABLE 7.5 7.5
x 22 x x 22 Design With One Extrinsic and and T Two Intrinsic Factors Factors A3 x wo Intrinsic
Factor F actor B White
Tr eatment 1 Treatment Cell 1l
Factor F actor A T reatment 2 Treatment Cell 2
T reatment 33 Treatment Cell 3
Female Non-white Non-white
Cell 4
5 Cell 5
Cell 6
White
Cell 7
6 Cell 8
Cell g 9
Cell 110 0
Cell 111 1
2 Cell 112
Male
Non-white Non-white
1145 45
146
�
CHAPTER 7 CHAPTER 7
MANIPULATED FACTORS ONLY
The appropriate procedure procedure for calculating an an estimate standardThe appropriate for calculating estimate of of aa standard ized-difference effect effect size size that that involves involves two two means means at at aa time time in in aa factorial ized-difference factorial design part on on whether design depends depends in in part whether the the targeted targeted factor factor is is manipulated manipulated or or classificatory and whether whether the the peripheral peripheral factor factor is is extrinsic extrinsic or or intrinsic. intrinsic. classificatory and To To focus focus on on the the main main ideas ideas we we first first and and mostly mostly consider consider the the two-way two-way de design. Suppose Suppose that that each each factor factor is is aa manipulated manipulated factor, factor, as as in in Table Table 77.1, . 1 , so so sign. that the the peripheral peripheral factor factor is extrinsic. Suppose further that that we we want want to to that is extrinsic. Suppose further and 2 overall, overall, so so the the numerator numerator of of the the stanstan compare PsychotheraQies !l_ and compare Psychotherapies dardized difference where 11 and and 2 represent represent columns columns 11 and Y]Yv and 2 dardized difference is is Y l-Y 2, where and the the dot dot reminds reminds us us that we are are considering considering column column 11 (or (or 2) and that we 2) period, period, over all all rows, rows, not not just just aa part part of of column column 11 or or aa part part of of column in combi combiover column 2 in nation with with any any particular particular row row (i.e., (i.e., not not aa cell of the the table) table).. These These two two nation cell of means are thus column column marginal marginal means. Therefore, in in this this example means are thus means. Therefore, example FacFac tor A is the targeted and Factor Factor B B is is the the peripheral peripheral factor. factor. As is the targeted factor factor and As in in tor chapters and 6, in in this we use to denote chapters 3 and this chapter chapter we use d to denote estimators estimators whose whose standardizers are are based based on on taking taking the square root root of the variance variance of of one standardizers the square of the one group, and we we use g to whose standardizers are based to denote denote estimators estimators whose standardizers are based group, and use g on taking the the square square root root of two or or more more pooled pooled variances variances (i.e., (i.e., the the pool poolon taking of two ing-based standardizers standardizers sspp and of chap. chap. 6). (Note (Note that that we and MS% MS � of we have have ing-based placed the subscript subscript for for the the column factor ahead of the the subscript subscript for for the the placed the column factor ahead of row factor, factor, whereas whereas the the more more common common notation in factorial row notation in factorial ANOVA places the subscript for row ahead ahead of of the subscript for for aa column. Howplaces the subscript for aa row the subscript column. How ever, in in this this chapter chapter we we are beginning with which the the targeted targeted ever, are beginning with the the case case in in which factor is is aa column column factor factor and and we we want want to to be be consistent with the the notation notation factor consistent with used by two two of of the the major major sources on effect effect sizes to which which we we refer refer read readused by sources on sizes to ers. These sources place place the the subscript subscript for the targeted first.)) ers. These sources for the targeted factor factor first. Recall that that the the choice choice of of aa standardizer standardizer by by which which to to divide the differ differRecall divide the ence between means means to to calculate calculate aa d or or g from aa factorial design depends factorial design depends ence between g from on one's one's conception conception of of the the population population to to which one wants wants to to generalize on which one generalize the results. results. Suppose Suppose that that one to generalize results to poputhe one wants wants to generalize the the results to aa popu lation that not naturally naturally vary vary with with respect peripheral factor. lation that does does not respect to to the the peripheral factor. Such (not always) case when manipuSuch will will often often (not always) be be the the case when each each factor factor is is aa manipu lated factor. factor. In In this this case case the peripheral factor factor would would not contribute to to lated the peripheral not contribute variability in in the dependent variable the population, population, variability the measure measure of of the the dependent variable in in the so one should should not let it it contribute contribute to to the the magnitude the standardizer so one not let magnitude of of the standardizer that is is used to calculate the estimate effect size. additional vari varithat used to calculate the estimate of of effect size. Such Such additional ability in in sample data from ability sample data from aa peripheral peripheral factor factor that that is is assumed assumed not not to to vary in the population would the estimate estimate of effect vary in the population would lower lower the the value value of of the of effect size inflating the standardizer in in its denominator. There are options options size by by inflating the standardizer its denominator. There are for choice of of standardizer standardizer in in this (Note that case of of clinical for choice this case. case. (Note that in in the the case clinical problems the psychotherapy psychotherapy and the drug drug therapy hand problems for for which which the and the therapy at at hand are sometimes combined in in practice, Table 77.1 and this example may may not not . 1 and this example are sometimes combined practice, Table provide an an example example of of aa peripheral peripheral factor factor that that does does not not vary vary in in the the pop popprovide ulation of of interest to the researcher. If the discussion and methods methods in ulation interest to the researcher. If so, so, the discussion and in this section would not this section would not apply.) apply.)
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS EFFECT DESIGNS
�
1147 47
First, suppose suppose that in aa two-way two-way design design have have aa combined First, that both both factors factors in combined control . 1 if 1 were actually No Ther control group group (e.g., (e.g., cell cell 33 in in Table Table 77.1 if Therapy Therapy 1 were actually No Therapy) and homoscedasticity of the variances across the margins of apy) and that that homoscedasticity of the variances across the margins of the the peripheral factor factor is By homoscedasticity homoscedasticity across across the the margins margins peripheral is not not assumed. assumed. By of the the peripheral peripheral factor factor we we mean equality of of the the variances variances of of the the popula populaof mean equality tions are represented each level of tions that that are represented by by each level of of the the peripheral peripheral factor factor over over all all of the levels of Table 7.1 7. 1 the the levels of the the targeted targeted factor. factor. In In Table the example example of of such such homo homoscedasticity equality of of variances variances of scedasticity would would be be equality of aa population popUlation that that receives receives the the combined combined participants participants in in cells 2) and the drug drug (represented (represented by by the cells 11 and and 2) and aa population by the the combined combined population that that does does not not receive receive the the drug drug (represented (represented by participants participants in in cells cells 33 and and 4)—that 4)-that is, is, homoscedasticity homoscedasticity of of the the population population row margin variances in in the present case. case. (Although (Although the the factors in Table Table row margin variances the present factors in 7.3 do not represent represent the the current current example example of of estimation effect size, size, sS2211 7.3 do not estimation of of effect and s 2;2 in and in that that table table exemplify exemplify row row marginal marginal sample sample variances variances that that estiesti mate whose homoscedasticity mate the the population population variances variances to to whose homoscedasticity we we are are now now re referring.) In In this this section we do do not not interrupt interrupt the development of of the the ferring.) section we the development discussion by demonstrating demonstrating estimation estimation of size when when such discussion by of effect effect size such periph peripheral-factor marginal homoscedasticity homoscedasticity is because the the method method is is eral-factor marginal is assumed assumed because the same as that we we demonstrate later using using Equation Equation 7.20 7.20 in in the the same as the the one one that demonstrate later the section section Within-Groups Within-Groups Factorial Factorial Designs. Designs. When marginal homoscedasticity asWhen such such peripheral-factor peripheral-factor marginal homoscedasticity is is not not as sumed one one may may want want to to use the standard of the the group group that is aa sumed use the standard deviation deviation of that is control group with respect respect to to both factors as as the the standardizer, exFor ex control group with both factors standardizer, ssc'c. For ample, in in Table if Therapy fact No No Therapy (control or or pla plaample, Table 7.1 7. 1 if Therapy 1 1 were were in in fact Therapy (control cebo), one one may may want want to to use use the the standard standard deviation deviation of of cell cell 33 (a cebo), (a No-Therapy No-Drug cell in in this this case) as the the standardizer. standardizes This This method method No-Therapy No-Drug cell case) as would also also be applicable if, if, instead instead of of aa control-group control-group cell, cell, the design in inwould be applicable the design cluded aa cell cell that that represented represented aa standard standard (in (in practice) practice) combination combination of of aa cluded level of of Factor Factor A and and aa level level of of Factor Factor B, standard-treatment compari comparilevel B, aa standard-treatment son group. In either of of these cases we we label label the estimate of of d A as as ddcomp son group. In either these cases the estimate comp (comp for comparison comparison group) group) and and use use (comp for (7.3) ( 7.3)
If instead for all all of of the the populations populations If instead one one assumes assumes homoscedasticity homoscedasticity for that in the design, an that are are represented represented by by all all of of the the cells cells in the design, an option option for for aa standardizer in this case case would would be to use use the the pooled pooled standard standard deviation, deviation, standardizer in this be to M S w�, , of groups, resulting resulting in the estimator estimator MS of all all of of the the groups, in the
g msw =
Y1 - Y2
((7.4) 7.4)
MS w):;
Note that 7.4 does inflating Note that the the method method of of Equation Equation 7.4 does not not deflate deflate g by by inflating the standardizer becauseMSw MSW is is based on pooling the within-cell valthe standardizer because based on pooling the within-cell SS SS val-
148
�
CHAPTER 7 CHAPTER 7
ues, and within within each each cell cell no no factor factor in in the the design is varying, including the the ues, and design is varying, including peripheral peripheral factor. factor. Therefore, Therefore, the the peripheral peripheral factor factor is is not not contributing contributing to to the magnitude this secsec the magnitude of of the the standardizer, standardizer, just just as as we we are are assuming assuming in in this tion tion that that it it does does not not contribute contribute to to variability variability in in the the population population of of inter interest. . 3 and and 7.4, appropriately changed can est. Equations Equations 77.3 7.4, with with appropriately changed subscripts, subscripts, can also be be used used to to compare the means two levels the manipulated fac also compare the means of of two levels of of the manipulated factor that that had been designated tor had previously previously been designated as as aa peripheral peripheral factor, factor, but but there thereafter becomes aa newly targeted factor, factor, using as after becomes newly targeted using the the same same reasoning reasoning as before. previously targeted pe before. In In this this case case the the previously targeted factor factor now now becomes becomes the the pe. 3 and and 7.4 are are pre ripheral factor. Worked examples examples using ripheral factor. Worked using Equations Equations 77.3 presented Designs, where sented later later in in the the section section Within-Groups Within-Groups Factorial Factorial Designs, where their their application is also appropriate. application is also appropriate. MANIPULATED TARGETED TARGETED FACTOR FACTOR AND AND INTRINSIC INTRINSIC PERIPHERAL FACTOR FACTOR
Suppose Suppose now now the the case of Table Table 7.2, in in which is aa manipulated manipulated and and case of which there there is an factor, and that one one wants an esti an intrinsic intrinsic classificatory classificatory factor, and that wants to to calculate calculate an estimate levels of mate of of effect effect size size for for two two levels of the the manipulated manipulated factor. factor. Unlike Unlike the the previous . 1 , in case the intrinsic factor, does previous case case of of Table Table 77.1, in this this case the intrinsic factor, Gender, Gender, does vary so one vary in in the the population population so one might might now now want want to to let let the the part part ofvariabil of variability measure of of the ity in in the the measure the dependent dependent variable variable that that is is attributable attributable to to the the intrinsic factor factor also also contribute contribute to to the the standardizer. standardizer. First, First, in in this this case intrinsic case there is is an an option option for for choice choice of of aa standardizer standardizer if if there there is is aa control control condithere condi in tion (or standard-treatment comparison condition), say, Treatment Treatment 11 in tion (or standard-treatment comparison condition), say, Table were actually actually No No lteatment. case one Table 7.2 if if Treatment Treatment 11 were Treatment. In In this this case one might want want to the standardizer overall 5 of of the control the control might to use use for for the standardizer the the overall groups across the groups across the levels levels of of the the peripheral peripheral factor factor (overall (overall55of ofcells cells 11and and 44 combined). In in which which Treatment of Table Table 77.2 concombined). In our our example example in Treatment 1 1 of . 2 is is the the con trol would be marginal 5s of trol condition, condition, this this standardizer standardizer would be 5s11 . , the the marginal of column column 11.. Using Using this this method method one one isis collapsing collapsing (combining) (combining) the the levels levels of of the the pe peripheral design is is equivalent ripheral factor factor so so that that for for the the moment moment the the design equivalent to to aa one-way design design in targeted treatment only fac one-way in which which the the targeted treatment factor factor is is the the only factor. assume homoscedasticity tor. This This method method does does not not assume homoscedasticity because because it it does does not not pool variances pool variances (i.e., (i.e., cells being considered considered to to represent represent one cells 11 and and 4 are are being one group). Also, because because the group). Also, the standardizer standardizer is is based based on on aa now-combined now-combined group of reflects any gender-based variability of the group of women women and and men, men, it it reflects any gender-based variability of the measure of the the dependent dependent variable variable in in the the popUlation. population. This This method method yields measure of yields as an estimate effect size size as an estimate of of effect '
d comp*
((7.5) 7.5)
When are more levels of manipulated facfac When there there are more than than two two levels of the the targeted targeted manipulated tor, as as is is the the case Table 7.2, the the numerical numerical subscripts Equation 7.5 tor, case in in Table subscripts in in Equation vary depending depending on on which which column column represents represents the control (or stanvary the control (or stan-
EFFECT EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS DESIGNS
�
1149 49
dard-treatment) condition condition and and which which column column contains contains the groups dard-treatment) the groups (level) with which which it it is is being being compared. (level) with compared. If the the targeted targeted factor factor is is represented represented by by the the rows rows instead instead of of the the col colIf 7.5 precede precede the umns the dots umns of of aa table, table, the dots in in Equation Equation 7.5 the numerical numerical sub subscripts (e.g., become s) s1) and and row replaces replaces column in in the the previous previous scripts (e.g., sll . become discussion. A definitional definitional equation, equation, Equation Equation 77.15, for $s l1 (the discussion. . 1 5, for (the row row case) case) is provided later is provided later in in the the section section Classificatory Classificatory Factors Factors Only Only using using nota notation that that is is not not yet Computational formulas formulas and and some some worked tion yet needed. needed. Computational worked examples examples are are also also provided provided there. there. Despite whether whether there there is is aa control control or or standard-treatment standard-treatment level level there there is Despite is an alternative alternative more more complex complex standardizer standardizer for for the the present present design design and and an purpose. This This standardizer, standardizer, which which assumes assumes homoscedasticity homoscedasticity of of all all of purpose. of the populations populations that that are are involved involved in in the the estimate, estimate, was was introduced introduced by the by ( 1 995) and also presented presented by Cortina and Nouri Nouri Nouri and and Greenberg Greenberg (1995) and also by Cortina and Nouri (2000).. The The method method involves involves aa special kind of of pooling from cells that is (2000) special kind pooling from cells that is consistent with the the goal goal of of this this section section to to let let variability variability in in the the measure consistent with measure of the the dependent dependent variable variable that that is is attributable attributable to to the the peripheral of peripheral factor factor contribute to the the magnitude of the the standardizer. standardizer. One One first first calculates, contribute to magnitude of calculates, separately for for each each of of the the two two variances variances that that are are later later going going to to be be en enseparately tered into into aa modified modified version version of of the the formula formula for for pooling, pooling, tered
((7.6) 7.6)
where stands for for targeted, targeted, pp stands stands for for peripheral, peripheral, tp tp stands stands for for aa cell cell at at where tt stands the tth tth level level of of the the targeted targeted factor factor and and the the pth pth level level of of the the peripheral peripheral fac facthe tor, and and t. stands for for aa level level of of the the targeted targeted factor factor over over the the levels levels of of the the tor, t. stands peripheral factor, factor, which which is is at at aa margin margin of of aa table table (e.g., (e.g., the the margin margin of colperipheral of col umn 11 or or column column 2 in in Table Table 7.2). 7.2). The The asterisk indicates aa special special kind kind of umn asterisk indicates of variance that that has has had had variability variability that that is is attributable attributable to to the the peripheral peripheral variance factor "added "added back" back" to The summation summation in in Equation Equation 7.6 is undertaken factor to it. it. The is undertaken over the the levels levels of of the the peripheral factor, there there being being two two such such levels levels in in the the over peripheral factor, case of Table 7.2. 7.2. Observe Observe that that Equation Equation 7.6 begins begins before the plus before the plus sign sign case of Table as were going be the variances, but but the as if if it it were going to to be the usual usual formula formula for for pooling pooling variances, the expression in in the the numerator numerator after after the the plus plus sign adds the the now now appropri appropriexpression sign adds ate portion of variability variability that that is is attributable attributable to to the the peripheral ate portion of peripheral factor. factor. Therefore, we denote resulting standardizer standardizer that is presented presented in in Therefore, we denote the the resulting that is Equation 7.7 by Equation 7.7 by Ssmmsw+ sw+ .' Equation 7.6 yields yields the the overall overall variance variance of of all all participants participants who who were Equation were subjected to to aa level level within within the factor that that is interest to to the the re resubjected the targeted targeted factor is of of interest searcher for for the the purpose purpose of of estimating estimating an an effect effect size size that that involves involves that that searcher level. This This variance variance is is the the variance variance of of all all such such participants participants as as if if they they were level. were combined into into just one larger larger group group at at that that level level of of the the targeted targeted factor, combined just one factor, ignoring the subgroupings that are are based based on the peripheral peripheral factor. This ignoring the subgroupings that on the factor. This variance serves the purpose of being being comparable comparable to to the the variance variance that that variance serves the purpose of would be be obtained obtained if if that that level level of of the the targeted targeted factor factor had had been been studied studied in in would
150
�
CHAPTER 7 7 CHAPTER
one-way design. design. The The estimate estimate of of effect effect size size that that will will result result from from this this aa one-way approach will will thus thus be be comparable comparable to to an an estimate estimate that that would would arise arise from from approach such aa one-way one-way design. design. such Again, Equation Equation 7.6 is is calculated calculated twice, twice, once once each each for for the the two two com comAgain, pared levels levels of of the the targeted targeted factor, factor, to to find find the the special special kind kind of of s 2�1. and pared andSs22� . (in this this example), example), ss*�221 and and Ss*;222 ,, to to enter enter into into the the pooling pooling formula formula 7. 7.77 be be(in low for for the the standardizer, low standardizer,
((7.7) 7 . 7)
The resulting resulting estimator estimator is is then then given given by by The
(7.8) As previously previously stated, stated, the the numerical numerical subscripts subscripts and and the the sequence sequence of As of the numerical numerical and and dot dot parts parts of of aa subscript subscript depend, depend, respectively, respectively, on on (a) the (a) which two two levels levels of of aa multi-leveled multi-leveled targeted targeted factor factor are are being being compared compared which (e.g., columns columns 11 and and 2, 11 and and 3, or 2 and and 3), 3), and and (b) (b)whether whether the the targeted targeted (e.g., 3, or factor is is represented represented by by the the rows rows or or columns columns of of the the table. table. Refer Refer to to Olejnik factor Olejnik and Algina Algina (2000) for for another another approach approach for for this this case. case. Having Having developed developed and the reasoning reasoning behind behind various various approaches approaches and and equations equations we we now now turn turn to to the worked examples. examples. worked EXAMPLES ILLUSTRATIVE WORKED EXAMPLES
Table 77.3 depicts hypothetical hypothetical data data in in aa simplified simplified version version of of Table Table 7.2 7.2 Table . 3 depicts in which which there there are are now now only only two two levels levels of of the the targeted targeted manipulated manipulated in factor that that is is represented represented by by the the columns. columns. The The cells' raw scores scores are are defactor cells' raw de grees of respondents' respondents' endorsement endorsement of of an an attitudinal attitudinal statement with re regrees of statement with spect to to aa 4-point 4-point rating rating scale scale ranging ranging from from strongly disagree to to spect strongly agree. The The treatments represent alternative alternative wording wording for for the the strongly treatments represent attitudinal statement. It is is supposed that 20 women women and and 20 men men were were attitudinal statement. It supposed that randomly assigned, 110 each to to each each treatment. treatment. The The table includes cell randomly assigned, 0 each table includes cell and marginal marginal values values of of Y Y and and Ss22. Because Because the the contrived contrived data data are are pre preand sented only only for for the the purpose purpose of of illustrating illustrating calculations, calculations, we we assume assume sented homoscedasticity as as needed. homoscedasticity needed. Before we we begin begin estimating estimating effect effect sizes sizes some some comments comments are are in in order Before order about the the example example at at hand. hand. First, First, although although rating-scale rating-scale items items are are typi typiabout cally used used in in combination combination with with other other such such items items on on the the same same topic topic to to cally form summated scales, our our example example that that uses just one one rat ratform summated rating rating scales, uses just ing-scale item item is is nonetheless nonetheless relevant relevant because because aa rating-scale rating-scale item item is ing-scale is
EFFECT EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS DESIGNS
e-<'/IIIIII=
1151 51
sometimes 2003) . sometimes used used alone alone ttoo address address aa specific specific question question (Penfield, (Penfield, 2003). Second, although e . g . , Cliff, 993) have recommended although some ((e.g., Cliff, 11993) recommended using or ordinal methods (such as what we call the PS PSin inour our chaps. chaps. 55and and 9) 9)to toan andinal alyze alyze data data from from rating rating scales, scales, many many researchers still still use parametric parametric methods paramet methods involving involving means means for the data from such scales, and parametric methods are still being developed developed for them (Penfield, 2003). How However, ever, violation violation of of the the assumption assumption of of normality normality in in the the case case of of data data from from rating scales may of rating scales may be especially especially problematic the fewer fewer the number of categories, the smaller the sample size, size, and the more extreme the mean rating in the population (Penfield, (Penfield, 2003). Nonetheless, we use hypo hypothetical ex thetical data data from from aa rating rating scale scale here here because because they they provide provide aa simple simple example for calculations. first that Treatment 1\ in in Table 7.3 is a control (or (or stan stanSuppose first 7.3 is dard-treatment) dard-treatment) level and, as was discussed in the previous section, one wants to to estimate estimate an an effect effect size size from from aa standardized standardized Yl - Y2 that would would be be comparable comparable to to an an estimate estimate that that would would arise from from aa design in in which which treatment treatment were the only factor. In this case case one can standardize using the of the the control group in Table 7.3, (S (s*�22 ))1/2'/' = = (.842)Y' (.842)1/2 = = .918. The overall s of overall overall s of a column column is is the s of a group consisting consisting of all of the participants in all of the column (with column treated as if the cells cells of that column column data treated if it were one data).. As As should be clear from from our our previous explanation explanation of the the vari variset of data) 5 is not not the square root of the yields, this overall s ance that Equation 7.6 yields, mean mean of of the the variances variances of of cells cells 11 and and 3 in in the the current current example example that that involves involves 1/2 Table 7.3 That is, is, this overall 76 7)/2)Y'. Equa Table 7.3.. That overallss is not not [(.989 [(.989 + ..767)/2] . Applying Equation 7.3, which which is applicable in this case, we find that dcomp .0) / .918 == --1.09. re 1 .09. Therefore, dcomp = = (2.0 (2.0 - 33.0) Therefore, we we estimate estimate that, that, with with re-
1\
Y2
spect to to the the control control population's population's distribution distribution and and 0, o, the the mean of of the the .09 standard deviations below the mean of the control population population is is 11.09 population population that receives Treatment 2. control level, level, if if we assume assume Next, regardless of the existence of a control homoscedasticity populations we can use the homoscedasticity of all of the the involved populations the special pooling method of Equations 7.6 7.6_ and and 7.7 as the first first steps toward toward stan stanable 7.3 with of dardizing the the difference difference between Y1 and and Y2 in T Table with the the method of Equation 7.8. 7.8. First we apply the results of column 11 to Equation 7.6 to find
Yl.
'2
S1 . =
Y2.
[(1 0 - 1).989] + [(1 0 - 1).767] + [1 0(1.9 - 2.0)2 ] + [10(2.1 - 2.0)2 ] = .842. 20 - 1
We then apply the results of column
'2 S2 .
=
find 2 to Equation 7.6 to find
[(1 0 - 1).544] + [(1 0 - 1).989] + [1 0 (3.1 - 3.0)2 ] + [1 0(2.9 - 3.0)2 ] = .73� 20 - 1
preceding results to Equation Applying the two preceding standardizer
find the 77.7 . 7 we find
1152 52
�
CHAPTER 7 7 CHAPTER
_ [ (20 - 1).842 + (20 - 1).73 7 l )!, _- .889.
smsw+ -
20 + 20 - 2
J
Applying the the difference the two Applying difference between between the two targeted targeted column column means, means, 2.0 and and the the standardizer, .889, to to Equation we find find that that 2.0 and 3.0, 3 .0, and standardizer, .889, Equation 77.8 . 8 we = (2.0 3.0) / .889 = -1.12. We, therefore, estimate that the = (2.0 1 . 1 2 . We, therefore, estimate that the / . 8 89 = 3 .0) gmsw+ msw+ mean of the population population that that receives receives Treatment Treatment 11 is is 11.12 standard demean of the . 1 2 standard de viations below below the the mean mean of of the the population population that that receives receives 1reatment Treatment 2, viations 2, where the the standard standard deviation assumed to to be be aa value value common common to to the the where deviation is is assumed involved populations. populations. Observe Observe that that the the result, gmsw+ = -1 -1.12, is close to involved result, g . 1 2, is close to msw + = the previous result, dcom dcompp = = -1 -1.09. similarity of of results results is is attribut attribut.09. Such Such similarity the previous result, able to to the the fact fact that that the the sample sample variances variances in in the the cells cells happen happen not not to to be be as able as different in in the the case case of of the the contrived contrived data data of of Table Table 7.3 7.3 as as they they might might different well be in the the case of real real data. data. well be in case of The output output from from any any ANaYA ANOVA software software (like (like that that presented presented in in Table Table The 7.4) provides provides needed needed information information to to proceed proceed with with some some additional additional in in7.4) terpretation and and estimation estimation of of effect effect sizes sizes for for the the data data in in Table Table 7.3. terpretation 7.3. Output did did not not provide provide the total 55 55 directly, directly, ssoo we we find find from from Table Table 77.4 .4 Output the total that SStot SStot = = SS + 55 SSAB + SS = 10 0 +.400 29.600 = 40.000. that 55AA + SSBB + + SSw = 1 0 + 0 + .400 + 2 9 . 600 = 40.000. W AB Observe in in Table Table 77.3 that the the marginal marginal means means of of the the Female Female and and Male Observe . 3 that Male rows happen happen to to be be equal equal (both (both 2.5), obviously d or or g = 00 in in such g= such aa rows 2 . 5 ) , so so obviously case regardless regardless of of which which standardizer standardizer is is used. used. For For the the targeted targeted Treatcase Treat ment factor, A, observe in Table Table 77.4 that F = 112.16 and pp = = .002, .002, so so we we 2 . 1 6 and ment factor, A, observe in . 4 that F= have evidence evidence of of aa statistically statistically significant significant difference difference between between the the have marginal means means (a main effect) effect) of of Treatments Treatments 11 and and 22.. marginal (a main Before we we estimate estimate aa POV and and aa partial partial POV for for Treatment Treatment Factor Factor A Before A from the the results in Table Table 7.4 7.4 the the reader reader is is encouraged encouraged to to reflect reflect on on the the exfrom results in ex tent difference one might expect expect between between these these two two estimates estimates in this tent of of difference one might in this case of of hypothetical hypothetical data data in which which SSBB = = 00 and and MS MSAB is unusually unusually small. AS is small. case Now applying applying the the ANOVA results to to Equation Equation 77.1 we find find that that Now ANaYA results . 1 we w22 = = [[10 - 11(.822)] (.822)]// (40 (40++ .822) .822)== .22. .22.Applying Applyingthe theoutput output results resultsto to 00 10 Equation 7.2 7.2 we we find Equation find that that oo�'rt", = [= 1 0[10 - 1 -(.822)] / [ 1/ 0[10 + (40 = .22. Recalling the the discus w2partial l(-822)] + (40 1).822] = .22. Recalling discus- 1 )-.8221 sion of of the the difference difference between between 002 w2 and and oo�",, w2)p.irtial", early early in in this this chapter chapter one one sion should expect the the two two estimates estimates to to be be very very similar similar in in the the case case of of the the hy hyshould expect pothetical contribute no pothetical data data of of Table Table 7.3 7.3 because because Factor Factor BB happened happened to to contribute no variability to to these data (output (output SSBB = = 0). 0). Interaction Interaction (statistically (statistically insig insigvariability these data nificant in this this example) example) contributed contributed just just enough enough variability variability to to the the data data nificant (outputSS cause aa very in the the magnitudes magnitudes SSAAB .400) to to cause very slight slight difference difference in (output B = .400) of the the two two kinds kinds of of estimates estimates of of aa POV, POV, but rounding to to two two decimal but rounding decimal of places w2 equal to oo�"rti w2partial"' in this example. example. We We conclude, to places renders renders 002 equal to in this conclude, subject subject to the previously discussed limitations limitations of of measures measures of of POV, POV, that that the the Treatthe previously discussed 1reat ment estimated to to account account for for 22% of the the variance the scores variance in in the scores ment factor factor is is estimated 22 % of under the the specific specific research research conditions. conditions. under
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIALDESIGNS DESIGNS EFFECT
�
153
COMPARISONS OF LEVELS OF OF A MANIPULATED FACTOR AT ONE LEVEL OF OF A PERIPHERAL FACTOR
standardized comparison comparison of two levels of Suppose now that one wants a standardized levels of peripheral factor at a time. time. For For ex exa manipulated factor at one level of a peripheral ample, with with regard to Table 7.2, 7.2, suppose suppose that one wants to compare Treatment 11 and Treatment 2 separately only for women or only for men. Thus, one would would be interested in an estimate estimate of effect effect size involv involvtwo cells, cells I1 and 2) where, again, t ing two two values of Y Y!!>tp (i.e., two cells, such as cells stands for a level of the the targeted factor and p stands for a level of the the pe peespecially appropriate ifif ripheral factor. Such separate comparisons are especially is an interaction interaction between the targeted manipulated manipulated and peripheral there is (Again, there may interaction regardless of the re refactors. (Again, may really be an interaction low-powered F test for for interaction.). interaction. ) . sult of a possibly low-powered If, say, of say, one wants wants to standardize standardize the difference difference between the the means of 7.2 and Y21, respectively) respectively)and andTreatment Treatment11isisaa cells 11 and 2 in Table 7 . 2 (Y11 and control level level or standard-treatment comparison comparison level, level, then then one can control standardize the mean difference 1, standardize the difference using the standard standard deviation deviation of cell cell 1, scell' Scell, if one one is is not not assuming assuming homoscedasticity. homoscedasticity.In this this case the the estimator estimator is is
(Y1 1
Y2 1'
(7.9) Off course, course, the the subscripts for the the two two values ooff Y O Y iin n Equation 7.9 off the targeted manipulated manipulated facchange depending on which two levels o fac tor are involved in the comparison and at which level peripheral tor level of the peripheral factor the comparison takes place. factor place. ((Recall Recall that in the Manipulated FacFac tors Only section we explained explained why why we adopted notation notation in which the subscript for a column precedes the subscript for a row.) numerical example for this case suppose that one wants to For a numerical Treatments 11 and 2 make a standardized comparison of the means of Treatments in Table 7.3 separately for men and women, and suppose further that is a control leveL level. We We demonstrate the method method by apply applyTreatment 1 is . 3 . In this ing Equation Equation 7.9 to the results in the Male Male row row of Table Table 77.3. numerator of Equation Equation 7.9 becomes (Y12 - Y22), and case the numerator .1 - 2 . 9 ) // ((.767)' . 76 7)'/2;' == -.9 dlevel = (2 (2.1 2.9) -.91.1 . (Table 7.3 7.3 shows that the sample dl ml = is ..767.) 76 7 . ) Therefore, with variance in the cell for Male-Treatment 1Us with re re-
1
(Y1 2 - Y22 ),
population's distribution distribution and cr, a, it is is esti estispect to the Male-Control population's mated that the mean of the Male-Treatment 2 population is Male-Treatment 2 approximately ..91 standard deviation deviation above the the mean mean of the 9 1 of a standard approximately Male-Control Male-Control population. If these are the kinds of populations that the researcher is seeking seeking to address, address, then the method of Equation 7.9 iiss a ann appropriate appropriate one. Again, the method method ooff estimation estimation that iiss chosen must be consistent kind of effect-size consistent with the kind effect-size parameter parameter that the re-
154 1 54
�
CHAPTER 7 CHAPTER 7
searcher wants to estimate and the assumptions assumptions ((e.g., homoe . g . , romo that are made about about the involved populations. scedasticity) that are alternatives to the aforementioned aforementioned procedure. If one as asThere are homoscedasticity with two populations whose sumes homoscedasticity with regard to the two sample (cell) means are being compared, compared, one can calculate the standardizer by pooling pooling the two involved values of ss *�fllcell .. For anexample example standardizer involved values For an and 2 of of Table 7.3 the the standardizer, sspcells/ now cells' is is given given now involving involving cells I1 and chapter 6 for pooling two by a version of the general Equation 6. 6 . 14 1 4 in chapter ooling two variances, variances,
p
(7. 1 0) (7.10)
Equation Equation
estimator 7.10 results in the estimator
((7.11) 7. 1 1 )
Now applying applying the results in the Female row row of Table 7.3 to Equation 7.10 7. 1 0 we find
speells
[= (10 - 1).989 + (10 - 1).544] 1, = .8 75. 10 + 10 - 2
Therefore, Equation 7. 7.11 glevel = = ((1.9-3.1)/.8 1 1 yields glevel 1 . 9 - 3 . 1 ) / . 8 75 75 =-1.37. = - 1 . 3 7. We We estimate regard to the Female-Treatment disestimate that, with regard Female-Treatment 2 population's dis tribution and 0", a, which which is is assumed to be the same as the Female-Treattribution Female-lteat ment 11 population's population's 0", a, the mean of the latter latter population population is 11.3 units .3 770"a units ment below the mean of the former population. population. If manipulated factor, as is the the case in Table 7.3, and and If there is only one manipulated if if one assumes homoscedasticity with with regard to all of the populations that are represented represented by by the the cells in the the the design, one can use MS MS %� as the standardizer our purpose. The The resulting estimator when comparing � tand�rdizer for our estimator when YI 11 thengiven givenby by Y then I - -Y2Y1 2lisis g lf'Velmsw
((7.12) 7. 1 2)
Using MSww from from the ANOVA ANOYA output that was reported reported in Table 7.4 of of the previous section, applying the results in the Female row row of Table ,'" = 7.3 to Equation 7. 1 2 yields g ( 1 . 9 -- 3.1) 3 . 1 ) // (.822) ( . 822)1/2 - 1 . 3 2 . We 7.12 glevelmsw = (1.9 = -1.32. We levelmsw = 1 population population is 1.32 1 .32 0" Female-Treatment 1 estimate that the mean mean of the Female-Treatment a units units lower than the mean of the Female-Treatment 2 population,
EFFECT DESIGNS EFFECT SIZES FOR FACTORIAL DESIGNS
�
1 55 155
where a o is assumed to be common for all of the populations populations that are represented in the design. homoscedasticity of Note that that under under homoscedasticity of all all represented represented populations populations MSw�1/2provides MS provides aabetter betterestimate estimateof ofthe thecommon commonaawithin withinall allof ofthese these poppopestimator of gpop o .' spcells/ resulting resulting in a g that is a better estimator ulations than does spedls' citY However, there is greater greater risk that the assumption of homoscedasti homoscedasticity is wrong, wrong, or more seriously seriously wrong, wrong, when when one assumes assumes that four or more populations that involve combined levels manipulated and more populations levels of of manipulated and classificatory factors are homoscedastic homoscedastic (as could be the the case in Tables Tables classificatory 7.2 or 7.5) 7.5) than when when one assumes that that two two populations at the same homoscedastic. Note also that that the level of a classificatory factor are homoscedastic. method 1 2 is not method of Equation 7. 7.12 not applicable when there is more than than one manipulated factor. factor. Olejnik Olejnik and and Algina (2000) provided discussion of manipulated of somewhat more complicated case. this somewhat TARGETED CLASSIFICATORY FACTOR AND EXTRINSIC PERIPHERAL FACTOR Suppose one wants Suppose now that that one wants to to standardize standardize aa comparison comparison between two levels levels of an an intrinsic factor (a classificatory factor here) two here) when there are one or more extrinsic peripheral factors and there are no or any any number of additional intrinsic factors. factors. When gender is the the tar tar7.2, 77.3, and 77.5 simgeted classificatory factor, Tables 7.2, . 3 , and . 5 illustrate illustrate the the sim will consider considerthe the cases casesthat thatare are represented representedby by plest of such designs. designs. We We will Tables 7.2 and and 7.3, in which which the the numerator numerator of the the estimator estimator is Y), Fe (Y1 - Y which is is the the difference difference between between the the marginal marginal means means of ofthe the Fe2), which male row row and Male row row in our our example. The difference difference between these two the case of Table 7.3, so we focus on the calculation of of two means is 0 in the an appropriate appropriate standardizer. standardizer Suppose further that one wants to examine examine the mean for one gender in relation to the the mean and distribution distribution of scores scores of the the other other gender. For For wants to calculate by how many many standard standard example, suppose that one wants units the the marginal sample mean of the the �ales males (Y2) is below or or deviation units (Y), stanabove the marginal sample mean of the females (Y 1), where the stan s for the the distribution distribution of scores for the the females females.. dard deviation unit unit is the S IIff the peripheral factor (treatment ordinarily vary (treatment iinn this case) does not not ordinarily in the popUlation population it is is an extrinsic extrinsic factor. In this case one would would not want to use for the standardizer standardizer the square root of the variance variance of the row females, Ss2211 in Tables 7.2 or 7.3, which variwhich would would reflect vari row for the females, ability standardize us ability that is attributable to treatment. Instead one can standardize usroot of the variance obtained from from pooling the variances ing the square root , 2, and of all of the the cells for the the females (cells 11,2, and 3 in Table 7.2 or cells 11 of and 2 in Table 7.3) to find find Sspcells pooled. Within and crlls', where again p stands for pooled. vary so variance within not influ influthese cells treatment treatment does not ot vary within a cell is not enced by variation variation in the manipulated manipulated peripheral factor. One can use the following version of the pooling formula to pool the cell variances. following
CV. 1
-
CY)
7.3)
ri
156
�
CHAPTER 7 7 CHAPTER
s peells
=
[ L [L(n(nC
- 1)
((7.13) 7. 1 3)
J
7. 1 3
levels of the the The summation in Equation 7.13 is conducted over the levels factor. If an an example involves a table such as T Table the peripheral factor. able 7.2, the summation would be over cells 11,2, and 3. The The resulting estimator estimator is summation , 2 , and
7.2,
3.
((7.14) 7. 14) where again class stands for classificatory. classificatory. This This method assumes are homoscedasticity of the populations whose samples' cell variances are being pooled. For simplicity we again again use the the data of Table 7.3 for the case and and the wants purpose that we have been discussing, and we suppose that one wants difference between the marginal means of the rows to standardize the difference for the females and males in such a table table.. In such a case of a 2 2 x 22 table, for Equation 7.13 reduces to Equation 77.10 . 1 0 and yields, just as we found when Equation Equation 7.10 7.10 was applied to the data of Table 7.3, when
7.3
7.13
7.3,
spedls
I (1 0 - 1).989 + (1 0 - 1).544l'
=l
v
(10 - 1) + (10 - 1)
.8 75. J =
this case of a two-way design, if if one assumes homoscedasticity homoscedasticity of all In this of the populations that are represented represented by the cells in the table (as (as was pre previously discussed, this is a riskier assumption assumption than the previous one) one) one viously MS1/2�w as the standardizer standardizer to find an estimator estimator for a comparison comparison of can use MS of the marginal marginal means of two two levels of a classificatory classificatory factor. We We would label the an estimator estimator ggclass factors in such an msw .. If there are one or more classificatory factors classmsw addition to the targeted targeted classificatory factor, factor, as in T Table refer to Ole Olejnik jnik addition able 7.5, refer (2000) for a modification of the standardizer. and Algina (2000)
7. 5,
CLASSIFICATORY FACTORS ONLY CLASSIACATORY
7.2
now that the column column factor in Table 7.2 were not not treatment Suppose now but ethnicity, so that the design there now consisted only (manipulated) but of classificatory Suppose also that one classificatory factors, gender and ethnicity. ethnicity. Suppose overall mean mean difference difference between between females and wants to standardize standardize the overall males (gender targeted, targeted, ethnicity peripheral, peripheral, for the moment)-that moment)—that is, the difference difference between between the the means of the the rows for females females and males, the Y1 - Y 22,, in the now revised Table 7.2. Again, there are alternative standardizers for this purpose.
Yj
-
7.2.
EFFECT DESIGNS EFFECT SIZES FOR FACTORIAL DESIGNS
�
1157 57
Consider first first the case in which one wants to calculate by how many many standard (Y) standard deviation units the margina marginal�mean for the males (Y 2) is below marginal mean (Y dis(V 11 )),' with regard to the overall dis or above the females' marginal tribution tribution of the females' scores. In this case, unlike the case in the previ previinstances the peripheral factor, ethnicity, does ous section, section, in many instances population that is of interest (an intrinsic factor) factor).. naturally vary in the popUlation Therefore, one should now want want the standardizer standardizer to reflect reflect variability our purpose the overall s5 of all that is attributable to ethnicity. Thus, Thus, for our of the females' scores can be used for the standardizer. standardizer. In the case of the the of root of the the modified version of Table 7.2, 7.2, this standardizer is the square root marginal variance of row row 11,, {s (s 221/,. ) '/2. This standardizer is defined by by (but equation that is based on devia devianot yet conveniently calculated by) an equation tion scores,
((7.15) 7.15)
where, in this example, col example, Vit 7itp is aann ith raw raw score in a cell ooff the row (or column in other examples) � whose standardizer,the umn hose marginal s is to be the standardizer, raw scores in this row, t is the level of the tar tarsummation is over all such raw geted factor based (female level here), and factor on which the standardizer is based and p is a level , 2, and 3 in Table 7.2. 7.2. The result level of the peripheral factor; p = = 11,2, resulting estimator estimator is then d
cla�s
==
Y j - Y2
(7.16) (7. 1 6)
51
The method that underlies Equation 77.16 does not assume . 1 6 does being comhomoscedasticity with regard to the two populations that are being com However, because because the subpopulations subpopulations (i.e., the ethnic sub subpared. However, populations populations in this example) example)may mayhave haveunequal unequalvariances, variances,aamore moreaccurate accurate estimation of the overall population's standard deviation that the estimation standardizer is is estimating may may be had if the proportions proportions of the participants participants proportions in the overall popula populain each subsample correspond to their proportions tion. For example, example, if if ethnic Subpopulation a constitutes, say, say, 13% 13% of the the population, then ideally 113% from Ethnic 3% of the participants should be from also diff differ (often the case Group a. If the subpopulations also er in their means (often variances differ), then choosing choosing subsample prowhen variances subsample sizes to match the pro portions in the subpopulations will also make the mean of each of the two two targeted levels that are being compared compared (e.g., male and female) a more acac A subsample should not curate estimate of the mean of its population. A standard deviation or mean of the overhave more or less less influence on the standard
1158 58
�
CHAPTER 77 CHAPTER
all sample sample than than it it has has in in the the population. population. Thus, Thus, appropriate appropriate sampling sampling will will all improve the the numerator numerator and and denominator denominator of of Equation Equation 7.16 7.16 as as estimators. estimators. improve easy way way to to calculate the standardizer standardizer that that is is defined by Equation Equation An easy calculate the defined by 7.15 for this this case would be be to to use use any any statistical statistical software software to to create create aa data data 7. 1 5 for case would file consisting of all all of of the the "n tt (i.e., raw scores scores as as ifif all all of of (i.e., n"11 in in our our example) example) raw file consisting of the constituted aa single the scores scores in in the the row row that that produces produces the the standarilizer standardizer constituted single group. (S.11 in in group. One Onewould would then then compute compute the the ss for for this this group group of of scores. scores. This This s.t s t (s our our example) example) should should derive derive from from the the square square root root of of the the unbiased unbiased Ss22 (i.e., (i.e., us using n --1,l , not not n, n, in in the denominator). This This s.t s t can ing the denominator). can also also be be calculated calculatedfrom from anan other formula formula for for s; in the the present present case case 8S11 = = [ [(LYi� ) - " 1 (y 21 )1 / (" 1 - 1) I " � . other s; in For simplicity simplicity we we use use the the data data of of Table Table 7.3 7.3 to to demonstrate demonstrate the the calcula calculaFor tion of of 8s11,' pretending pretending now, now, to to fit fit our our case, case, that that the the columns columns there there reprerepre tion sent aa peripheral peripheral classificatory classificatory factor, factor, such such as as ethnicity, ethnicity, instead instead of of aa sent treatment factor. factor. First, First, from from the the kind kind of data file file that that was was just treatment of data just described described for all of of the the scores scores in in the the standardizer standardizer's row, software software output output yielded yielded for all ' s row, 22 = 1.105, so 8sl1 = = 1.105 the alternative alternative formula formula Ss 11 = 1 . 105, so 1 . 105'1/2;' == 11.051. .05 1 . Using Using the from the the previous previous paragraph paragraph we we confirm that from confirm that rt 2 + 1 2 + 1 2 + 1 2 + 2 2 + 2 2 + 2 2 + 2 2 + 3 2 + 4 2 + 2 2
l Y2
+2 2 + 3 2 + 3 2 + 3 2 + 3 2 + 3 2 + 4 2 + 4 2 + 4 2 - 2 C\2.5 2 ) 1 I _ 51 I
1
L
20 - 1
j
:; ID5 1.
Note that dclass dclass of of Equation Equation 77.16 is comparable comparable to to aa dd that that would would Note that . 1 6 is arise from from aa one-way one-way design design in in which which the the targeted targeted classificatory classificatory facarise fac tor were were the independent variable variable in in the design. To Toillustrate illustrate an antor the only only independent the design. other that would accomplish this purpose, we we again again use other standardizer standardizer that would accomplish this purpose, use the example example of of gender as the the targeted targeted factor. factor. In In our our present present modified the gender as modified version of Table ethnicity is column factor factor version of Table 7.2, 7.2, in in which which ethnicity is aa peripheral peripheral column replacing the the treatment one can base one's standardizer on the replacing treatment factor, factor, one can base one's standardizer on the pooled row row margin margin variances, variances, ss2211 and and sS222 ,, each each one one of of which which reflects pooled reflects variability attributable attributable to to ethnicity ethnicity as as the the population population would. Wepool pool variability would. We using Equation Equation 77.17 (shown next) next) that that is is another another version version of of the the gen genusing . 1 7 (shown eral formula formula for for pooling pooling two two variances variances.. We We denote denote the the resulting resulting eral standardizer ssclass where again again p denotes denotes pooled. This method method as aspooled. This standardizer classp ', where sumes sumes homosced homoscedasticity of the the popUlations populations that that are are represented represented by by �sticity of the two two compared compared levels levels of of the the targeted targeted factor. factor. Again, Again, as as was was dis disthe cussed regarding regarding the the method method that that underlies underlies Equation Equation 7. 7.16, cussed 1 6, ideally ideally the the proportions proportions of of the the participants participants in in each each subs subsample (e.g., propor proporample (e.g., tions of of ethnic ethnic groups) groups) should should be be equal equal to to their their proportions in the the tions proportions in population. The The current current standardizer standardizer is population. is
EFFECT SIZES SIZES FOR FACTORIAL FACTORIAL DESIGNS DESIGNS
Sclassp
=
f (n 1
l
- l)s� + (n 2 n. 1 + n.2
)�
-l s
-2
l1,
�
1159 59
((7.17) 7. 1 7)
J
The resulting then given by resulting estimator is then •
_
g classp -
Y 1 - Y2 S classp
( 7. 1 8) (7.18)
is applied to the g g of Equation 77.18 distinguish it from The asterisk is . 1 8 to distinguish g of Equation 7.14. Continuing to use the the modified modified version of Table the g 7.3 in which which the column column factor is is now now a peripheral classificatory factor factor, we already know know from the preceding calinstead of a treatment treatment factor, preceding cal 2 culation = 11.105. After creating a data data file for the data data of row row 2 . 105. After culation that s 1 = (Male row) as was previously previously described for for the the data of row row 11,, we find find 2 = 1.000. Therefore, using Equation Equation 7.17 1 .000. Therefore, 7. 1 7 that software output yields Ss222 = for the standardizer standardizer we find find that for
�
Sdassp
=
[ (20 - 1)1.1 05 + (20 - 1)1.000] 20 + 20 - 2
1,
=
1.0 26.
There is an an alternative alternative method for calculating sscla classp ssp that is applicable when there are two two or more levels levels of the targeted targeted classificatory factor when and all cell sample sizes sizes are equal. In this case case one can use output from ANOVA entry into into Equation 7. 7.18, ANOV A software to calculate, for entry 1 8,
Sdassp =
[
]
SS tot - SS t< 1, N - k It'
((7.19) 7. 1 9)
where SS 55 for the the targeted targeted classificatory classificatory factor, N N is the total SStctc is the 55 and ktc ktc is the the number number of levels levels of the the targeted targeted classificatory sample size, and factor (Ole (Olejnik & Algina, Algina, 2000). Observe Observe in the the numerator numerator of Equation factor j nik & variability that is is attributable attributable to the targeted targeted factor is is sub sub77.19 . 1 9 that variability from total variability variability leaving only variability variability that is is attributable attributable tracted from discussed, is is appropri approprito the peripheral factor, which, as was previously discussed, ate in the case considered here. ANOVA summarizing Table Table 7.4, 7.4, for for our our example example that that uses the uses the In the ANOV A summarizing data in the the revised Table 7.3, SSBB,' N N= = 40, and kktctc = = 22.. In Table Table 7.4 Table 7 . 3 , SSttcc is SS Table 7.3 by summing summing all 55 55 values, values, we observe for the data of Table 7 . 3 that, by SStot = 40.000, and SSBB = Equation 7.19 = 0.000. Applying Equation 7. 1 9 we thus find SS tot = sclassp 2)]1/2 = = 11.026. (40.000 - 0.000) // (40 (40 - 2)]'" that scla .026. This value agrees ss == [[(40.000 with the p previous value for sscclassp that calculated data revious value ss was calculated from files for data Ia p from from separate rows instead of ANOVA output. output.
1160 60
�
CHAPTER 7 CHAPTER 7
We do do not not proceed proceed to to calculate calculate an an estimate estimate of of effect effect size size that that is is based We based on the the standardized standardized difference difference between between the the_row marginal means means for for the the on row marginal data of of the the modified modified Table Table 7.3 7.3 because because Yl Y1 - Y Y2 = = 0 in in that that table. table. However, However, data 2 the the method, method, and and also also the the interpretation interpretation when when the mean mean difference difference is is not not the should be be clear clear from from the the previous previous worked worked examples examples and and discussions. 0, should discussions. Again, when when selecting selecting from from aa variety variety of of possible possible standardizers standardizers for for an an esAgain, es timator, one one should should make make aa choice choicethat that is is based based on on one's one's decision decision regard regardtimator, ing which which version version of of the the effect-size effect-size parameter parameter the the sample sample d or or g g is to be ing is to be estimating. As As we we have have observed, observed, each each standardizer standardizer and and its its resulting resulting dd estimating. or g g has has aa somewhat somewhat different different purpose purpose and/or and/or underlying underlying assumption assumption or about homoscedasticity. homoscedasticity. about -
STATISTICAL INFERENCE AND FURTHER READING
Smithson (2001 (2001)) discussed discussed the the use of SPSS SPSS to to construct exact confi confiSmithson use of construct an an exact dence interval for 11n22, whole or partial, partial, and and for for aa related related effect effect size size that that is whole or is dence interval for proportional to to Cohen's Cohen's ((1988)f, we discussed discussed in in chapter chapter 66.. Fidler Fidler 1 988)J, which which we proportional and Thompson Thompson (200 (2001) further illustrated illustrated application application of of Smithson's Smithson's and 1 ) further (2001) method to to an an a x b design. design. Smithson Smithson (2003) demonstrated demonstrated the the (200 1 ) method 2 construction of of confidence confidence intervals intervals for for partial partial n and related related measures. measures. 11 2 and construction Also refer refer to to Steiger Steiger (2004) (2004).. STATISTICA STATISTICA can can also also be be used used to to construct construct an an Also exact confidence confidence interval interval for for 112 n2 for for the the factorial factorial design design at at hand. hand. EstimaEstima exact tion of of POV POV in in complex complex designs was discussed discussed by by Dodd Dodd and and Schultz designs was Schultz tion ((1973), 1 9 73), Dwyer 1 9 74), and 1 969). Olejnik Dwyer ((1974), and Vaughan Vaughan and and Corballis Corballis ((1969). Olejnik and and Algina (2000) discussed discussed estimation of POV in in designs designs with with covariates estimation of covariates Algina and split-plot designs designs (both and split-plot (both also also discussed discussed by by Maxwell Maxwell & & Delaney, Delaney, 2004) and in in multivariate multivariate designs. and designs. Bird (2002) discussed discussed methods, methods, under under the the assumptions assumptions of of normality normality Bird and homoscedasticity, homoscedasticity, for for constructing constructing individual individual and and simultaneous simultaneous and confidence confidence intervals intervals for for standardized standardized differences differences between between means means and and the the implementation of of these methods using using readily readily available available software. software. At implementation these methods At the time time of of this this writing writing Kevin Kevin Bird Bird and and his his colleagues colleagues provide provide free free soft softthe ware for constructing for standard ware for constructing approximate approximate confidence confidence intervals intervals for standardized and unstandardized contrasts, planned planned or or unplanned, unplanned, for for factorial factorial ized and unstandardized contrasts, designs with designs with aa between-groups between-groups and and aa within-groups within-groups factor. factor. Analyses Analyses of of more complex complex factorial factorial designs designs are are possible, possible, but but in in such such cases cases construc construcmore tion of of simultaneous simultaneous confidence confidence intervals intervals is is more more difficult. difficult. This This soft softtion ware is is available available at http://www.psy.unsw.edu.au/reasearch/PSY.htm. ware at http:// www. psy.unsw.edu.au/reasearch/PSY.htm. Steiger and and Fouladi Fouladi ((1997) discussed the the construction construction of of exact exact confiSteiger 1 997) discussed confi dence intervals. Also Steiger (2004) (2004).. Note Note that that in in the the case of or orcase of dence intervals. Also consult consult Steiger dinal data, data, such as those those from from rating rating scales, scales, aa different different approach approach may may dinal such as have have to to be be developed developedfor for the the construction construction of ofconfidence confidence intervals intervals for for the the difference between between two two means 2003). difference means (Penfield, (Penfield, 2003 ). As mentioned mentioned in in chapter chapter 66,, an an approximate approximate confidence confidence interval interval for for aa As standardized difference difference between between means means can can be be constructed constructed by by dividing dividing standardized the limits that that are obtained for for the unstandardized difference difference by by MS the limits are obtained the unstandardized MS1/2�w .. This heteroscedasticity itit This method method assumes assumes homoscedasticity. homoscedasticity. Under Under heteroscedasticity
EFFECT DESIGNS EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS
�
1161 61
problematic ttoo define the population ttoo which such a confi confiwould be problematic from our dence interval would apply. Also, Also, recall from our earlier discussion that when MS1/2wisis the standardizer permitting standardizer in a factorial design one is not not permitting whenMS is attributable attributable to a per peripheral contribute to the variability that is heral factor to contribute Therefore, the use of MS MS1/2 would not not be be appropriate appropriate if if the the standardizer. Therefore, '; wwould is a classificatory one that varies in the population population that peripheral factor is interest. is of interest. In the already noted case in which a classificatory peripheral factor varies in the population population (intrinsic factor), Maxwell Maxwell and Delaney Delaney (2004) (2004) calculating the recommended that the standardizer be obtained by calculating root of the variance that results from from adding the 55 SSvalues valuesfrom from square root other than the targeted manipulated factor and then then divid dividall sources other degrees of freedom freedom that are associated with these included ing by the degrees sources. For For example, example, suppose that one wants a standardizer standardizer for the sources. difference between the two two treatment treatment means in Table 7.3 (marginal difference column means) and that the variability pe variability that is is attributable attributable to the peripheral factor of gender is to contribute to the standardizer. Using all of the the values of 55 SSand and of ofdf dffor for the the data data in inTable Table 7.3 7.3that thatare arepresented presented of except for for those for for the the targeted factor of treatment treatment (Fac(Fac in Table 7.4, except tor A), A), the the standardizer standardizer is given by by [[ SS SSBB + 55 SSAB SS ) / tor + SSw) / w AB ]'i' = (df (dfBB + df dfAB dfw]1/2 = [[ (.000 + ..400 29.600) = .889. 400 + 29. 600) // ((1 1 + 11 + 36) ]1/2 'h = AB + dfw) As discussed in chapters 2 and 3, when the dependent variable iiss mea mea(i.e.,, unstan unstansured in familiar units, analysis of data in terms of "raw" (i.e. differences between means can be very informative and dardized) differences al.,, 2003) 2003).. Of course it is routine to conduct readily interpreted (Bond et al. construct simultaneous simultaneous confidence confidence intervals in intests of significance and construct differences are not not volving comparisons within pairs of means whose differences Maxwell & & Delaney, Delaney, 2004). The latter latter coau coaustandardized ((Bird, Bird, 2002; Maxwell discussed methods for the homoscedastic or heteroscedastic cases. thors discussed cases. is generally known as the Bonferroni method (more The procedure that is method) can be used to make appropriately, the Bonferroni-Dunn method) planned pairwise comparisons. (However, unless there is only a small might be concerned concerned about about the loss of sta stanumber of comparisons, one might tistical power for each comparison. comparison.)) Alternatively, Alternatively, the Tukey Tukey HSD HSD method (which is is the same as WSD WSDbut but not not the the same sameas asThkey-b) Tukey-b)isisap apmethod Wilcox (2003 (2003)) discussed funcdiscussed and provided S-PLUS software func plicable. Wilcox tions for less less known robust methods for pairwise comparisons (more contrasts)) and construction construction of simultaneous simultaneous generally, linear contrasts confidence intervals involving the the pairs of means of interest. confidence Prentice ((1997) (2000) presented 1 997) and Olejnik and Algina (2000) Abelson and Prentice for calculating calculating an estimator estimator of effect effect size for for interaction. interaction. methods for and Delaney Delaney (2004) (2004) discussed discussed methods for testing the statisti statistiMaxwell and differences among the the cell means that are involved cal significance of the differences in a factor that might or might not be interacting with another factor. cellwise comparisons test for simple effects effects.. A comparison of mar marSuch cellwise effects) when there is interaction interaction merely pro proginal means (testing main effects) an overall (i.e., an average) average) comparison of levels levels of the the targeted vides an
�
t
162
-"II/f=
CHAPTER 7 7 CHAPTER
estimation of an an effect effect size that is based based on factor. Such a comparison, or estimation because when there is an interac interacsuch a comparison, can be misleading because difference between targeted tion a difference targeted marginal marginal means does not reflect a difference between between cell means means at levels of the targeted factor at at constant difference targeted factor level of a peripheral factor. For example, example, the difference difference between the the each level column marginal means in Table 7.3 is is 2.0 2.0-3.0 = -1 -1.0. However, the - 3.0 = .0. However, difference between mean scores difference scores under Treatments 11 and 2 for females is -1.0.0 but but 11.9-3.1 = -1 -1.2, .9 - 3 . 1 = .2, and the difference difference between mean scores scores not -1 under Treatments 11 and 2 for males is is also not -1 -1.0.0 but 2.1-2.9 = -.8. -.8. 2 . 1 - 2.9 = under The difference difference between the column column marginal marginal means is the mean mean of these 8)] //22 = .0. If the interaction two two differences; differences; [(-1.2) + (-. (-.8)] = -1 -1.0. interaction had been statistically significant for the the data of Table 7.3 one could infer infer that that the the statistically difference -1.2.2 and -.8 were thereby statistically statistically significant. difference between -1 Note, however, that an interaction implies a statistically significant difference between simple effects, effects, but but the fact fact that a simple effect effect is difference found to be statistically statistically significant while another simple simple effect effect involv involvfound not statistically significant does not not im iming the same targeted factor is not ply an an interaction. For For example, suppose suppose that in Table 7.3 the difference difference in females' mean scores scores under Treatments 11 and 2 (i.e., 11.9-3.1 = -1 -1.2) .9 - 3.1 = .2) but that the difference difference in males' males' mean were statistically significant significant but scores under under Treatments 11 and 2 (i.e., 2.1 2.9 = = -.8) statisti2 . 1 -- 2.9 -.8) were not statisti significant. Such a result would not not necessarily interaccally significant. necessarily indicate an interac tion. Estimation of of standardized-difference standardized-difference effect effect sizes for for the the kind of of at hand was discussed in the section Comparisons cellwise comparisons at of Manipulated Factor at of Levels of a Manipulated at One Level of a Peripheral Factor. Aside from the statistical issues, in research that has theoretical theoretical impli implian interaction interaction would be of great importance. Note in cations explaining an whether main main effects, effects, simple simple effects, effects, and/or interactions this regard that whether statistically significant might might depend depend on the researcher's are found to be statistically researcher 's measure. Thvo Two measures might might seem to to be representing the same choice of measure. underlying construct construct when, in fact, they they might might be measuring measuring somewhat somewhat underlying different constructs. constructs. For For further discussion discussion and and debate on this and related different refer to Sawilowsky and Fahoome Fahoome (2003). issues refer (2004) and and the references therein provided deMaxwell and Delaney (2004) de tailed discussions of the the issue of interaction, including alternative ap aptailed proaches, confidence intervals intervals for the standardized unstandardized standardized and unstandardized differences between between the cell means, and a measure of population differences of strength of association for for interaction contrasts contrasts.. Timm's (2004) (2004) ubiqui ubiquistrength effect size index, which, as we mentioned in chapter 6, as astous study effect contrasts in sumes homoscedasticity, is applicable to F tests and tests of contrasts exploratory factorial designs. Brunner Brunner and Puri (2001) (2001 ) exploratory studies studies that use factorial the application application of of what we call the the PS effect size extended the PS measure of effect our chap. 55)) to factorial designs. designs . (discussed in our DESIGNS WITHIN-GROUPS FACTORIAL FACTORIAL DESIGNS
with only within-groups within-groups factors pri priIn the case of factorial designs with mary researchers can usually conceptualize conceptualize and estimate a standardstandardmary
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS DESIGNS EFFECT
163 163
�
difference between means using using the same reasoning and the same ized difference methods 7.3 and 7.4 in the earlier methods that were presented presented using Equations Equations 7.3 earlier Manipulated Manipulated Factors Only section. Note Note that there is is not not literally a MSw MSw in designs with only only within-group factors, but but it is is valid here to apply apply Equation 7.4 as if the the data had come from a between-sub between-subjects Equation jects design. is variability within each cell of a within-groups design, as there there There is is within each cell of a between-subjects between-subjects design, and the subject vari variables that underlie underlie population variability variability will be reflected reflected by this vari variability in both types of designs (cf. (cf. Olejnik & & Algina, 2000). (In the Designs and Further Reading in chapter 6, we section Within-Groups Designs statistical software packages packages to calcu calcupresented instructions for using statistical late standardizers in the case of one-way one-way within-groups within-groups designs. applicable to the denominators of EquaThose instructions are also applicable Equa 7.3 7.4.) T Typically within-groups factor will be a manipu manipu. 3 and 7.4.) ypically a within-groups tions 7 lated rather than a classificatory one because because researchers often subject researchers often participant to different different levels levels of treatment treatment at at different different times the same participant but but typically typically cannot cannot vary the classification classification of a person (e.g (e.g.,. , gender or ethnicity).. (Exceptions (Exceptions in which which a within-groups within-groups factor might be con conethnicity) sidered to be classificatory would include research that collects data collects data and after after a participant-initiated participant-initiated change of political political affiliation, before and religion, or gender.) gender. ) In the case of within-groups factorial designs, variability that is is at attributable manipulated factor should not contribute tributable to the peripheral peripheral manipulated contribute to variability that is is reflected by the standardizer standardizer if the peripheral peripheral ma mathe variability nipulated factor does not not vary in the population population of interest, interest, as it typi typinipulated cally does not. For able 7 . 3 Treatment For an example, suppose now that in T Table 7.3 11 and and Treatment Treatment 22 were were the the absence absenceand and presence, presence, respectively, respectively, of of aa new new drug for Alzheimer ' s disease, drug A, Alzheimer's A, with Factor A A being a within withingroups factor. Suppose . 3 Factor B Suppose also that in Table 77.3 B were not not gender but presence (row 2) of a very different but instead the absence absence (row 11)) or presence different new kind of drug for Alzheimer Alzheimer's disease, drug B, with Factor Factor BB also be benew 's disease, B, with . 3 might ing a within-groups within-groups factor. factor. The The data in Table 77.3 might represent the pa pascores on a short short test of memory or the number of symptoms symptoms tients' scores after treatment treatment with one or the other drug, a combination combination of remaining after of the two drugs, or no drug. Because of our our purpose here we do not not dis dismethodological issues issues (other than supposing counterbalancing) in cuss methodological hypothetical research, research, but proceed directly to demon demonthis hypothetical but instead we proceed strating alternative alternative estimators of a standardized difference difference between means for the case of within-groups factorial factorial designs. A for the targeted targeted factor and supposing supposing now that cell 11 Using Factor A control or standard-treatment comparison group (a (a con conrepresents a control comparison group condition in this example of the the revised factors in T Table trol or placebo condition able 77.3), . 3 ), we first . 3 to the data to find first apply apply Equation Equation 77.3 find that v, == -1 dcomp = = (2.00 - 33.00) / .989 -1.01. we assume homoscedasticity homoscedasticity of .01 . If we of .9891/2 dearn .00) / all four our populations scores that are represented represented in the design (cells 1 1 populations of scores through 4), and recalling from from Table 7.4 that we found that thatMS through MSw = .822 for the data data of Table 77.3, can alternatively apply Equation 7.4 to find find for . 3 , we can .00) / .822'1/2/' = . 1 0. thatgmsw = (2.00 - 33.00) / .822 = -1 -1.10. that
f
w
gmsw
==
==
1164 64
�
CHAPTER 7 CHAPTER
Finally, if if we we now now assume assume homoscedasticity homoscedasticity with with regard regard to to the the Finally, marginal variances variances of of peripheral peripheral Factor B, we we can using marginal Factor B, can standardize standardize using the the square square root root of of the the pooled pooled variances variances in in the the margins margins of of rows rows 11 and and 22;; Ss2211 = . 1 05 and . 000. Because sample sizes sizes for = 11.105 and Ss2122 = = 11.000. Because the the sample for the the two two rows are are the the same, same, the the pooled variance is is merely the mean mean of of the the two two rows pooled variance merely the variances; Ss22prm = ((1.105 = 11.053, as before, 1 . 1 05 + 11.000)/2 . 000) / 2 = . 053, wherep, where p, as before, dede variances; rm = notes poole pooledd and and1/2 rm denotes denotes repeated repeated measures. measures. The The standardizer standardizer is notes rm is then = 11.026. estimator for our purpose purpose is given by by .05 3 'h = .026. The The estimator for our is given then sspprm rm == 11.053 (7.20) ( 7.20)
For the the data data at at hand hand g gprm = = (2.00 - 3.00) // 11.026 = -. -.97. Note that that the For .026 = 97. Note the results from from applying applying EEquations are not not very very differ differresults q�ations 7.3, 7.4, and and 7.20 are ent case of ent in in the the artificial artificial case of the the data data of of Table Table 7.3 because because the the variances variances in in that table table are are not not as as different different as as they they are are likely to be be in in the the case case of of real that likely to real data. Again, Again, the the choice based on data. choice of of standardizer standardizer is is based on the the assumptions assumptions that that the researcher variances of involved populations the researcher makes makes about about the the variances of the the involved populations.. The of the estimates in of population The interpretation interpretation of the estimates in terms terms of population parameters parameters and should be be clear from the the earlier earlier discussions. discussions. and distributions distributions should clear from For further further discussions discussions of of Equations Equations 7.3 and and of of the the basis basis of and 7.4 and of For Equation 7.20, review review the the earlier earlier ManipUlated Manipulated Factors Only section. section. Equation Factors Only Olejnik and and Algina Algina (2000) provided provided discussions, and more more worked worked exam examdiscussions, and Olejnik ples of estimation estimation of of standardized standardized effect effect sizes sizes for for within-group within-group factorial ples of factorial designs. M Maxwell and Delaney (2004) discussed discussed construction construction of of confi confidesigns. axwell and Delaney (2004) dence dif dence intervals intervals for for the the difference difference between between marginal marginal means means and and for for the the difference within the of aa multivariate ference between between cell cell means means within the framework framework of multivariate approach to to two-way two-way within-groups designs. Bird provided an an approach within-groups designs. Bird (2002) provided example of of the of SPSS simultaneous confidence confidence intervals intervals example the use use of SPSS to to construct construct simultaneous for standardized assuming homoscedasticity, from aa design design for standardized effect effect sizes, sizes, assuming homoscedasticity, from with one one within-groups within-groups factor between-groups factor factor (split (split-plot with factor and and one one between-groups -plot design). Approximate individual and confidence intervals intervals design). Approximate individual and simultaneous simultaneous confidence for such can be be constructed, homoscedasticity, using using for such aa design design can constructed, assuming assuming homoscedasticity, the from Kevin Bird and the currently currently downloadable downloadable free free software, software, PSY, PSY, from Kevin Bird and his his colleagues. This This software software and and its its web web site site were were cited cited in in the the previous previous seccolleagues. sec tion. Consult Wilcox (2003) (2003) for for discussions discussions and and S-PLUS S-PLUS functions functions for for less less tion. Consult Wilcox known comparisons for known robust robust methods methods for for pairwise pairwise comparisons for two-way two-way within-groups designs. designs. Brunner Brunner and and Puri Puri (2001 (2001)) discussed discussed extension extension of within-groups of what factorial designs. designs. what we we call call the the PS measure measure to to within-groups within-groups factorial Maxwell and and Delaney Delaney (2004) presented presented one one of of the the various various formulas formulas Maxwell that attempt attempt to to estimate estimate POV for for the the main main effect effect of of the the targeted targeted factor factor in that in aa within-groups within-groups factorial factorial design. design. Their Their version version of of such such aa formula, parformula, aa par tial omega squared, squared, renders the estimate comparable to to what what it it would would tial omega renders the estimate comparable have been been if if the the targeted targeted factor factor had had been been manipulated manipulated in in aa one-way one-way behave be tween-groups design. design. Research Research reports reports should be clear clear about about which which of tween-groups should be of the available available conceptually conceptually different different equations equations has has been to estimate estimate the been used used to
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS DESIGNS EFFECT
�
1165 65
POV for for aa targeted targeted factor factor in in aa within-groups factorial factorial design so so that that the the POV
authors authors of of the the report, report, their readers, or or later later meta-analysts meta-analysts do not not un unwittingly For wittingly compare compare or or combine combine estimates estimates of of incomparable incomparable measures. measures. For example, Maxwell Maxwell and and Delaney Delaney's (2004) formula partials partials out out all effects effects example, 's (2004) except except for for the the main effect effect of of subjects, subjects, whereas other other possible possible approaches approaches might might partial partial out out all all effects, effects, including including the the main main effect effect of of subjects, subjects, or or partial partial out out no no effects effects (an (an estimation estimation of of POV, not not partial partial POV) POV).. Refer Refer to to Maxwell and and Delaney Delaney (2004) and and Olejnik Olejnik and and Algina Algina (2003) for for further further Maxwell discussions, and and refer refer to to this this chapter chapter's's earlier earlier section section on on partial omega omega discussions, squared for for a a brief refresher on on partial partial POV. Earlier Earlier discussions discussions were squared provided 1 9 73), Olejnik provided by by Dodd Dodd and and Schultz Schultz ((1973), Olejnik and and Algina Algina (2000), and and Susskind and and Howland Howland ((1980). Susskind 1 980). The The reader reader is is referred referred to to Maxwell Maxwell and and Delaney Delaney (2004) (2004) for for detailed detailed dis discussions of of assumptions assumptions and and of of analyses analyses of of marginal marginal means means and and inter interactions in in the the case case of of within-groups within-groups factorial factorial designs. designs. With With regard to to actions split-plot of split-plot designs designs these these authors authors again again provided provided detailed detailed discussion discussion of those of those topics, topics, the the construction construction of of confidence confidence intervals intervals for for the the variety variety of contrasts that that are are possible, and and equations equations for for estimation of of partial partial contrasts omega omega squared squared for for each each kind kind of of factor factor and and for for interaction. interaction. As As was was dis discussed in in the the previous previous paragraph, paragraph, these these equations equations for for estimation estimation of of par partial omega omega squared squared have have aa different different conceptual conceptual basis and and form form from from tial those that that might might be be found found elsewhere elsewhere (cf. (cf. Olejnik Olejnik & & Algina, Algina, 2000). As As we we those have previously previously mentioned, mentioned, Hunter Hunter and and Schmidt (2004) (2004) provided a a have strong strong endorsement endorsement of of the the use use of of within-groups within-groups designs. Note that that researchers researchers often often apply apply parametric parametric statistical statistical methods methods Note such as as ANOV ANOVA to data data that that arise arise from from rating rating scales by by assigning assigning or orsuch A to dered numerical numerical values values to to the the ordered ordered categories. For For example, example, the suc sucdered cessive 1 ) might respectively to cessive values values 11,, 2, 3, 4 (or 4, 3, 2, 1) might be be assigned respectively to the the categories categories agree strongly, agree, disagree, and and disagree strongly. strongly. Therefore, many many researchers researchers would would be be inclined inclined in in such such cases cases to to apply apply Therefore, the same same methods methods that that were were applied applied in in this this section to to the the data data of of Table Table the 7.3. However, of However, the the application application of of parametric parametric methods methods (e.g., (e.g., the the use use of means means)) to to data data from from ordinal ordinal scales scales such such as as rating rating scales scales is is controversial. controversial. Although such such methods methods may may not not be be problematic problematic in in terms of of rates rates of Although of Type error, there there may may be be more more powerful powerful methods, methods, such such as as those those that Type II error, are discussed discussed in in chapter do not not consider consider the the mean mean of a rat ratare chapter 9. Also, Also, some do ing scale scale to to be be aa meaningful meaningful statistic statistic (but (but consult consult Penfield, Penfield, 2003 2003). Wede deing ). We fer fer to to the the section section Limitations Limitations of of rpbb for for Ordinal Ordinal Categorical Categorical Data Data in in chapter 9 for for discussion discussion of of the the matt matter of of parametric parametric analysis analysis of of ordinal ordinal chapter data data such such as as those those arising from from rating rating scales.
h
ADDITIONAL DESIGNS AND AND MEASURES
There are methods methods for for calculating calculating estimators estimators of of standardized standardized mean mean There differences designs. DiscusDiscus differences available available for for various various additional additional ANOVA ANOVA designs. sions but the sions of of these these methods methods would would be be beyond beyond the the scope scope of of this this book, but the
1166 66
�
CHAPTER 7 7 CHAPTER
basic concepts and worked examples that have been presented here should prepare the reader to understand understand such methods methods,, which which are pre preshould prepare sented elsewhere. elsewhere. Cortina Cortina and Nouri (2000) (2000) and Olejnik Olejnik and Algina c, and analysis of (2000) discussed methods for a x b, a x b x c, of covariance designs. designs. The The latter authors discussed methods related to authors discussed split-plot designs (mix of between-groups between-groups and within-groups within-groups factors) factors);; split-plot also consult the previously y Gillett (2003). Wilcox previously cited article article bby (2003) discussed and and provided S-PLUS functions for robust linear con conof split-plot designs. trasts for two-way split-plot designs. Kline (2004) discussed many many of the topics of the current chapter. estimation of POV for for designs with random random factors designs with For discussions of estimation random and fixed fixed factors, consult consult V Vaughan or mixed random aughan and Corballis ((1969), 1 969), Dodd and 1 9 73), Olejnik and Schultz ((1973), Olejnik and and Algina Algina (2000), (2000), and and Maxwell discussed estima estimaM axwell and Delaney (2004). The latter latter authors also discussed tion of POV and and tests and and construction construction of confidence confidence intervals for differ differtion ences between marginal marginal means in the case designs. For For an case of nested designs. correlational approach approach to effect effect sizes for between-groups between-groups alternative correlational within-groups factorial designs Rosenthal et al. (2000) (2000).. and within-groups designs consult Rosenthal Delaney (2004). Their approach was also discussed by Maxwell and Delaney (2004) . LIMITATIONS AND RECOMMENDATIONS
chapter that there can be be more than one way way to con conWe observed in this chapter an effect effect size even when faced faced with with a given tar tarceptualize and estimate an manipulated and/or classificatory geted factor and a given mix of manipulated factors. Furthermore, there might might be additional valid approaches, not discussed here, discussed here, to choosing a method for designs designs that were discussed sometimes in the literature literature there is outright disagree disagreehere. Moreover, sometimes ment about about the appropriate method method for a given purpose. purpose. There There may may be ment disagreement about about how how to estimate Ll, A, how how to estimate POV, and and about about
whether Ll A or POV is the the more useful measure for for a given set of of data data or or whether for any any set of data. Work on some of these topics is ongoing and more re refor search is needed. needed. Researchers Researchers should should think carefully about the purpose purpose of their their research and and of the the nature nature of the the populations of interest, as have of been discussed discussed in this book and in the references therein, before deciding deciding on an appropriate appropriate measure and estimator. estimator. Because varying varying methods can result in apparently conflicting conflicting results of estimation of effect effect sizes in the literature it is imperative imperative that re reused. If searchers make clear in their reports which method they have used. If their readers readers and those who who review the literature will not not be this is done their unwittingly unwittingly comparing or combining (i.e., meta-analysts) conceptually and of and computationally computationally incomparable estimates of effect effect size. Authors of research reports should reporting not just one kind of esti estiresearch should also consider reporting mate of effect effect size but but two two or more defensible defensible alternatively alternatively conceptual conceptualized estimates to provide themselves and their readers with with alternative alternative perspectives on the results We are aware of a dissenting opinion that results.. ((We that
EFFECT SIZES SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS EFFECT DESIGNS
�
1167 67
holds that providing alternative alternative estimators holds estimators may only only serve to confuse some readers reports . ) readers of research reports.) Because methodological methodological and design design features can contribute nearly as magnitude of an an estimate estimate of effect effect size as does does a targeted much to the magnitude factor (Gillett, (Gillett, 2003; Ole Olejnik &Algina, Algina, 2003; 2003;Wilson Wilson & &Lipsey, Lipsey,200 2001), refactor jnik & 1 ), re their reports' Method sections sections and have at searchers should be explicit in their least a brief brief comment in their Discussion Discussion sections about about every charac characteristic of their their study study that could possibly influence the the effect effect size. size. In their their analysis analysis of of the the effect effect of of psychological, psychological, behavioral, behavioral, and and educational educational treatments Wilson and Lipsey (200 (2001) estimated, as a first first approxima approximatreatments 1 ) estimated, (randomized vs vs.. nonrandomized, tion, that the type of research design (randomized between-groups vs. within-groups) within-groups) and choice of concrete measure of between-groups of abstract underlying dependent variable were the methodological feafea an abstract tures that correlated highest with with estimates of effect effect size, but but many many methodological features also correlated with these estimates. estimates. For other methodological For effect size involv involvexample, as we observed in this chapter, estimates of effect two levels of a targeted factor can vary vary depending depending on the nature of ing two of factor (extrinsic (extrinsic or intrinsic). intrinsic). For For further discussions of the peripheral factor of which measures of effect effect size are sensitive, consult consult factors to which and Levin (2003) and and the the references therein. Onwuegbuzie and example of the the influence of design design features, features, we are aware For another example of Experi of aa thesis in in which Experiment Experiment 1\ was was a between-groups study, Experiment 2 was a conceptual conceptual replication replication of that study study using using a within-groups within the two versions of the study were both design, and the results within statistically significant, significant, but but in the opposite direction. direction. Such a conflicting statistically result from a between-groups and a within-groups study study is is not an iso isoand Delaney Delaney (2004) lated case. Consult Grice ((1966), 1 966), and Maxwell and (2004) and examples and and discussion. the references therein for further examples effect size can vary depending depending on the extent of vari variAlso, estimates of effect ability of the the participants participants. . For For example, for a given pair pair of levels of a fac facability tor and a given dependent dependent variable, effect effect sizes might might be different different for a tor population of college students and the possibly more variable general population population. Therefore, one should be cautious about comparing effect effect from populations populations that might might have sizes across studies that used samples from differing variabilities on the dependent dependent variable. Refer Refer to Onwuegbuzie differing Onwuegbuzie and Levin (2003) for further discussion. discussion. Again, Again, by being explicit about relevant methodological characteristics characteristics of their research, all possibly relevant authors of reports can facilitate facilitate interpretation interpretation of results and and facilitate authors relationthe work of meta-analysts who can systematically study the relation ships between such methodological variables ((moderator moderator variables) and the magnitudes of estimates estimates of effect effect size across studies. studies. the magnitudes QUESTIONS
what the text calls a targeted factor and a pe pe11.. Distinguish between what ripheral factor. ripheral
1168 68
�
CHAPTER 7 7 CHAPTER
factor. 22.. Distinguish between an extrinsic factor and an intrinsic factor. 3. How does the distinction distinction between extrinsic inextrinsic and intrinsic factors in fluence the the procedure procedure one adopts for estimating an an effect effect size? 4. Are intrinsic factors always classificatory classificatory factors? Explain. Explain. 5. Why is estimation of the the POV more complicated complicated in the the case of fac facS. Why torial designs than in the case case of one-way designs? designs? What is the the purpose of a partial partial POV? 6. What 7. Discuss why why it is problematic to compare two two values of an esti esti7. Discuss mated POV based on the the relative sizes of their their values, of the the values of their associated associated Fs, Fs, or of the the values of significance levels attained by their Fs. Why is it problematic to compare two two estimates of partial POV for 8. Why two factors in the same study? 9. Which two conditions should ordinarily be met if one wants to compare estimates of a POV for fac for the the same factor from different different factorial studies? torial 10. Why is it problematic to interpret interpret the relative importance of two two factors by inspecting estimated POVs? inspecting the ratio of their their estimated POVs? the targeted factor and the nature pe1I 1. 1 . How do the nature nature of the nature of the pe influence the choice choice of a procedure procedure for for estimating estimating a ripheral factor influence standardized effect size? standardized effect the nature nature of one's assumption about about homoscedasticity 12. How do the control group or standard-treatment standard-treatment com comand the presence presence of a control parison group group influence one's choice of a standardizer? standardizer? What assumption assumption underlies underlies the use of Equation 7.6, 7.6, and in sim sim113. 3 . What nature of the variance that it produces? plest terms what is the nature produces? describethree threeprocedures proceduresfor forestimating estimating aastandardized standardizeddif dif14. Briefly describe ference between means at at two two levels levels of a manipulated factor at at a ference given level of a peripheral peripheral factor, and and how does one choose one pro profrom these three? cedure from three? Briefly describe describe how how one estimates a standardized difference difference be be115. 5 . Briefly tween means at two levels of an intrinsic factor when there is one or more extrinsic peripheral factors. Discuss one procedure for estimating estimating a standardized overall differ differ116. 6 . Discuss classificatory factor when the peripheral ence between means of a classificatory factor is intrinsic. factor might a difference between between the proportions of various various demo demo117. 7. How might proportions of those sub subgraphic subgroups in a sample and the proportions groups in the population population influence influence the estimate of a standardized difference between between means? difference would it be inappropriate to use the the square root root of MSw MSW as a 118. 8 . When would standardizer homoscedasticity of all involved popula populastandardizer even when homoscedasticity tions is assumed? assumed? Briefly describe the the relationship between an interaction interaction and simple 119. 9 . Briefly effects.. . effects What effect effect might might one's choice choice ooff a measure for the dependent 20. What alternative measures) measures) have on the results variable (when there are alternative
EFFECT SIZES FOR FOR FACTORIAL FACTORIAL DESIGNS EFFECT SIZES DESIGNS
�
1169 69
of the various significance tests and estimates of the estimates of effect effect size that from a factorial ANOVA? emerge from typically applicable to within within221. 1 . Why are Equations 7.3 and 7.4 typically groups designs? designs? 22. What What is the the rationale rationale for the use of Equation 7.20 in the case of in the of within-group designs? within-group designs? might 23. Discuss the roles that methodological methodological and design features might play magnitude of an estimated estimated effect effect size. play in the magnitude 24. Considering the issues raised by Q Question information uestion 23, what information should be provided in the Method section of a research report?
Chapter Chapter
8 8
Effect Sizes Effect for for Categorical Variables
BACKGROUND BACKGROUND REVIEW
variables, contingency ta taReaders who are very familiar with categorical variables, 2
(x2 ) ) test of association, association, and related terminology terminology bles, the chi-square (X might want to proceed directly to the last last three paragraphs paragraphs of this secmight sec tion. This chapter chapter does not not involve the chi-square chi-square test test of goodness-of-fit. goodness-of-fit. Often in the behavioral behavioral and social sciences the two or more variables Often that are being related are categorical. An unordered categorical variable qualitativevariable variable because becauseits its variations variations (cat (catis also called a nominal or qualitative for qualities qualities (characteristics) (characteristics).. An experimental exegories) are names for ex ample of type of treatment as an unordered categorical variable is random assignment of participants . . . In this exam random assignment participants to Treatments a, b, ..... example the categorical independent variable is the type (category) (category) of treattreat ment. Common classificatory examples of categorical independent variables include gender: variables gender: male and female and political affiliation: DemDem Republican, or other. Note that the ordering of the categories in ocrat, Republican, not meaningful. meaningful. The categories in these exthese examples is arbitrary, arbitrary, not ex amples couldjust as well have been considered in any other other order. order. (In the next chapter chapter we discuss only categorical categorical variables that do represent a natural ordering, such as agree strongly, agree, disagree, disagree, and disagree strongly.) minority political strongly. ) Note also that in such examples lumping minority parties, minority religious groups, or minority minority ethnic groups, et cetera, catch-all "other" category is not intended to slight those groups; into a catch-all "other " category groups ; it would be purely a statistical consideration. Additional named categories would exam(involving minority groups) of the independent variables in such exam However, no category ples may may be used. However, category should be used that is likely to be attained by no or few members of the samples. This problem is likely category that represents a small mi mito occur if the researcher includes a category inapproprinority of the population population and the sampling method or size size is inappropri Inferences from estimators ate for sampling that minority sufficiently. sufficiently. Inferences from estimators of effect effect size may may be impossible or problematic when there are too few 1170 70
EFFECT SIZES SIZES FOR FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
�
1171 71
participants n one o participants iin orr more of the categories. IIff the researcher researcher wants to include minority minority groups an appropriate sampling method or size should be used to obtain obtain sufficient sufficient numbers of members members of these groups. groups. When When aa categorical categorical variable variable has has only only two two possible possible values values it it is called a dichotomous or binomial variable. When more than than two two values called multinomial. are possible possible the variable is called multinomial. When When each each of of the the variables variables in in the the research research is categorical categorical the the data data are are usually presented in a table such as T able 8 . 1 . In the simplest usually Table 8.1. simplest case only only two variables are being studied, one variable being represented by the rows rows and and the the other other by by the the columns columns of of the the table. table. In In this this case the the table table is The general called a two-way table. table. The general designation of a two-way table is table, in which x means "by, "by,"" and and r and and c stand for rows and and col colr x c table, umns, respectively. respectively. For a specific r x c table the letters r and and c are re replaced by by the the number number of of rows rows and and the the number number of of columns in in that that table, table, categories respectively; these numbers also correspond to the number of categories that the row and column column variables have. In the simplest case the row variable variable has has only only two two categories categories and and the the column column variable variable has has only only two two categories, resulting in the common common 2x2 table that is is also called aa.four fourwo-way or fold fold table because because the the table table contains contains four four cells cells.. T Two-way or multiway multiway (i.e., called cross-classi fication (i.e., more more than than two two variables) variables) tables tables are are also also called cross-classification tables or contingency tables. tables. The cells of the the cross-classification cross-classification tables fy (categorize) classi classify (categorize) each each participant participant across across two two or or more more variables. Within each cell of the table is the the number of participants that fall into into Within the row category and the column category that the cell represents represents.. Such data data are are called cell counts or cell frequencies. frequencies. The The general general purpose purpose of aa contingency contingency table table is is to analyze analyze the the table's table's data to determine independ determine if there is a contingency contingency (i.e., association or independence) ence) between between the the variables variables.. In In aa common common example example one one might might want want to to determine determine if if participants' participants' falling falling into into the the client better or or client not better categories is contingent on which treatment treatment category they were in. (Al(Al though though client better vs vs.. client not better is is an an example example of of an an ordered ordered cate categorical variable, the difference difference between ordered and unordered dependent dependent variables variables is is not not important important for for us us in in the the case case of of dichotomous dichotomous dependent dependent variables variables until until chap. chap. 9.) Note that that in in this this example example there there is is an an
x
2x2
TABLE S.1 8.1 Frequencies of Treatment of Outcomes Outcomes After Treatment
Symptoms Symptoms
Therapy Therapy Psychotherapy Drug Therapy
Totals T otals
Remain Remain
ff1111 = 1144 f21 f21
=
= 22 22
=
36 36
Gone Gone ff12 22 12 = 22 =
f22 f22
0 = 110
=
32 32
T otals Totals 36 36 32 32 68 68
1172 72
�
CHAPTER 8 8 CHAPTER
(the type of treatment treatment given) given) and a dependent independent variable (the not better), better), although although we also consider variable (the outcome of better or not which the categorical variables need not not be classifiable classifiable as examples in which independent independent variables variables or dependent variables. For For example, in research research religious affiliation and political political affiliation the researcher researcher that relates religious need not not designate an independent variable and a dependent variable, al although the researcher may may have a theory theory of the relationship which does though specify that, that, say, religious affiliation affiliation is is the independent variable and po pospecify affiliation is the dependent variable. The total count count for for each row row litical affiliation across the columns is placed at the right right margin of the table, and the to tocolumn across the rows rows is placed at the bottom mar martal count for each column The row totals and the column column totals are each called gin of the table. The
marginal totals. Table 8 . 1 is 8.1 is a 2 x 2 contingency table that is based on actual actual data. not relevant to our our discussion of estimating estimating an The clinical details are not effect effect size for such data, but but they they would would be very relevant to the re researcher's interpretation and generalization generalization of the results. For For the pur pursearcher ' s interpretation .1 pose of the next section section we assume for now that the data in Table 88.1 represent the fourfold categorizations categorizations of 68 former pain pain patients represent had been sampled from a clinic clinic that had provided either whose files had psychotherapy or drug therapy therapy for a certain kind of pain. psychotherapy pain. Such a method of research is called a naturalistic or or cross-sectional study. In method method the researcher decides decides only the total number number of partici particithis method be sampled, not the row or column column totals totals.. These latter totals pants to be when the total sample is categorized. Naturalistic Naturalistic emerge naturally when common in survey survey research. research. sampling is common In Table 8 . 1 the 8.1 the letter ff stands for for frequency frequency of occurrence occurrence in a cell, stand for the row row and column, re reand the pair of subscripts for each cell stand spectively, that the cell frequency frequency represents represents.. For For example,f21 example,f21 stands for the frequency with which participants participants are found in the cell representing crossing of the second row and the first column, column, namely namely 22 of the 32 the crossing patients who received drug therapy. therapy. (Note (Note that we are returning returning to standard notation notation in this chapter because because for our our present purposes we standard atypical sequencing of column and row row no longer have a reason for the atypical subscripts that we adopted and explained in chap. chap. 7.) 7.) samples. Refer Refer to The examples iin n this chapter involve independent samples. Levin, and Paik (2003) for discussion discussion of the case of experiments experiments Fleiss, Levin, matched with that use matched samples. In that case participants are matched respect to one or more attributes attributes that are known known to be, be, or are believed believedto be, related to the outcome variable. variable. Each participant participant within within each within each matched group of individu individumatched pair of individuals (or within cases in which which there are more than two treatments) is randomly randomly als in cases assigned to one of the treatments treatments.. (Fleiss et aI., al., 2003, discussed as corre correassigned measurements in longitudinal longitudinal lated binary binary data the case of repeated measurements participant is categorized categorized twice or more over studies, in which each participant time. cases in which time.)) Also Also consult Fleiss et al. (2003) (2003) for discussion of cases there are missing data or in which some participants have been
EFFECT SIZES SIZES FOR FOR CATEGORICAL CATEGORICAL VARIABLES EFFECT VARIABLES
1 73 173
misdassified misclassified into the categories. The The latter problem is related to the problem of unreliability of measurement measurement that was discussed in chapter chapter problem 4. Fleiss et al. (2003) also discussed measurement agreemeasurement of interrater agree ment in order to obtain an upper upper limit limit for the reliability reliability of the ment categorizations.
4.
TEST AND Phi CHI-SQUARE TEST
first that the statistical and effect-size effect-size procedures procedures that are pre preNote first 2x2 X 2 tables are applied here only to originally sented in this chapter for 2 originally dichotomou dichotomous) discrete ((i.e., i . e . , truly or originally s ) variables, not These procedures procedures are problematic when when the dichotomized variables. These row or the column variable has been dichotomized by the researcher, researcher, say, into better better versus not better better categories from an originally originally continu continuFor example, suppose that two two therapies are to be com comous variable. For pared for their effect effect on anxiety. Suppose further further that two categories of of anxiety are formed by the researcher categorizing patients patients as high or anxiety using scores above or below the median (or some other other low anxiety respectively, on a continuous scale of anxiety. Such arbitrary arbitrary cutpoint) respectively, dichotomizing might might render the procedures procedures in this chapter invalid bebe might depend not not only on the relative effectiveness effectiveness of cause the results might of therapies, as they should, but but also on the arbitrary cutpoint cutpoint the the two therapies, decidedto use to lump everyone everyone below the cutpoint cutpoint together researcher decided anxiety and to lump everyone above the cutpoint cutpoint together as as low anxiety high anxiety. anxiety. If some other other arbitrary cutpoints cutpoints had been used, such as high the lowest lowest 25% 25% of scores on the the continuous anxiety anxiety test test (low anxiety) anxiety) the 25% of scores (high anxiety), the results from from statistical and the highest highest 25% tests and and estimation estimation of effect effect size might might differ differ from those arising from equally arbitrary use of the median as the cutpoint. (However, (However, refer the equally Sanchez-Meca, Marin-Martinez, Marin-Martmez, & & Chacon-Moscoso, for cases Chacon-Moscoso, 2003, for to Sanchez-Meca, choice of cutpoint cutpoint seemed generally to have little influence influence in which the choice the biases biases and and sampling variabilities of estimators of effect effect size.) on the dependent variable is a continuous variable methods that When the dependent presented earlier throughout have been presented throughout this book are more appropriate dichotomizing.. than dichotomizing statistical significance of the associa associaThe most common test of the statistical between the row and column column variables variables in a table such as Table 8.1 tion between Table 8 .1
is the xX22 test of association. association. In general the the degrees degrees of freedom freedom for this test is given by df df = = (r (r-- 1)(c1 )(c 11), ), which in the case of a 2 x 2 table yields dff = = (2 d (2 - 1)(2 1 )(2 -- 11)) = 11.. However, whereas the xX22 test addresses the issue of whether or not not there there is an association, association, the emphasis emphasis in this this book book is of whether on estimating appropriate esti estimating the the strength of this association association with an appropriate estimator of effect effect size. mator with regard to the t statistic, the magnitude As we previously noted with of xX22 does not not necessarily necessarily indicate the the strength of the association be beof row and column variables. variables. The The numerical value of the Xx22 stastatween the row -
-
1174 74
�
CHAPTER 8 CHAPTER
tistic depends not not only on the strength of association but but also on the tis tic depends pattern of the cell total sample size. Thus, if in a contingency table the pattern (the same strength of association) association) but but the data were to remain the same (the
sample size increased, Xxl2 would increase. increase. size increased, What is needed is a measure measure of the the strength of the the association be beWhat is not not affected, affected, or less af aftween the row and column variables that is size. One common such measure of effect effect size fected, by total sample size. for a 2 Xx 2 table is is the population population correlation coefficient, coefficient, rrpopop '. An rr op for from a 2 Xx 2 table is called a population phi coefficient, phipop in arising from phi coefficie rit, phi pop 1n (In the statistical literature this book, estimated by the sample phi. phi. (In what we denote denote in this book phi phippop denoted II> and the estima estimaop isis usually denoted what tor phi denoted $.. Although it is easier easier to conceive of phi phipop tor phi is usually denoted op as pnote simply the special case dichotomous, o when both X and Y Y are dichotomous, case of rpop p p first that Xx22 can be considered considered to be a sum sum of squared effects; first squared effects; X
2
=
2.
(Jo �:e)
2
fo
fe
frequencies and and ex, where f0 and fe are the observed frequencies
respectively,in inaacell, cell,and andthe thesummation summationisisover overall all pected frequencies, respectively, four cells. Therefore, phi o can can be considered considered to to be a a kind kind of of average efef four fect, the square root of an an average average of the the squared squared effects. effects. For For formal formal ex exdiscussionconsult Hays, 1994, pression of this parameter and further discussion 1 994, and Liebetrau, Liebetrau, 11983. not be surprising that phipop is a kind of and 983. It should not of the mean of products products of Zz scores; average because rpop too is a mean, the
phipop
2. Z x Z y
phipop
rpop
.) To calculate calculate phi phi for a 2 Xx 2 table one can use the proceN dure that was outlined for this purpose in the section in chapter 4 on the binomial binomial effect effect size display (BESD). However, phi can can be calculated the However, phi more simply using
rpop
=
(8. 1) (8.1)
phi,
8.1 where N is the total sample size. size. (Observe in Equation 8 . 1 how phi, as an effect size, size, compensates for the the influence estimator of effect influence of sample size on x22 by dividing Xx22 by N.) For the purpose of applying phi to data from nat natX N. ) For
phi 22
unadjusted X using uralistic sampling, one calculates an unad justed x
(8.2) nrrl1 ', "nrr22 ', "necl1 ', and n"c2 represent the number number of participants in row 1, 1, where " e2,' represent row 22,, column column 11,, and column 22,, respectively. (Note that we are adopting adopting row recommendation of Fleiss et aI. al.,, 2003, that the numerator x22 not not the recommendation numerator of X adjusted be ad justed when calculating phi.)
phi.)
EFFECT SIZES FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT SIZES FOR
�
1175 75
the data data of Table 88.1, calculation yielded For the . I , software and manual manual calculation yielded X22 = 6.06 (for whichpp = .013), so phi = (6 (6.06/68) = .3 .30, (for which .06/68)'1/21> = 0, a value that may be considered considered to be statistically statistically significantly different different from from 0 at may different software and different different textbooks textbooks often often 0 1 3 . Note that different p = ..013. from Equation 8.2. Some Some su use equations for xX22 that are different different from superficially different different looking equations for xX22 are actually actually functionally perficially equivalent ones that yield identical results (e.g., our Equation 8.2 equivalent ( e . g . , our versus Equation 66.3 al.,, 2003 2003). . 3 iinn Fleiss et al. ) . Another difference difference between matter of adjusting or not numerator of equations is a matter not adjusting the numerator of x22 for the fact that its continuous continuous theoretical X theoretical distribution (used to obob significance level) is not not perfectly represented represented by its actual actual tain the significance discrete empirical sampling distribution. Again, Again, to calculate phi the unadjusted xX22 is used as in Equation 8.2. unadjusted calcuAs an rrpop theoretically ranges from from -1 to +1. + 1. If phi iiss calcu ° ,, phipop theoretically ani; �r using the section on the BESD in chapter usiri'g' the method in the chapter 4, lated as a the the calculation calculation will yield a signed value for for any any nonzero r, r, but, if we use 8.1, may not not be immediately . 1 , which produces produces a square root, it may Equation 8 whether phi is positive or negative negative.. However, the sign of phi is a clear whether However, the trivial result of the order in which the two two columns or the two rows are trivial arranged. For . 1 had drug therapy and For example, example, if Table 88.1 and its results in first row row and psychotherapy the first psychotherapy and its results in the second row, the not its size, would would change. change. To To interpret interpret our our obtained obtainedphi phi of of but not sign of r, but first that symptom two out out.30 ((+ + or -?) note first symptom gone is the better of the two come categories. categories. Observe also that 22/36 = .61 .61 of the total psychother psychother0/32 = .31 . 3 1 of the apy patients attained attained this good outcome, whereas 110/32 total total patients patients in drug therapy attained it. Therefore, one now has the proper interpretation interpretation of the the obtained phi. by implica implicaphi. Because xX22 and, by phi proportion of the psy psytion, p hi are statistically significant and a greater proportion chotherapy patients than the drug patients are found in the better outcome category, category, one can conclude conclude that psychotherapy is statistically significantly better than than drug therapy in the particular particular clinical example significantly of .1. of the the data in Table 88.1. interpretation of the results, the Because one now has the proper interpretation question of the the sign of phi is unimportant. unimportant. However, However, using the the reason reasonregarding the sign of the point-biserial point-biserial r, r, the reader ing of chapter 4 regarding should be able to see see now now that r = phi is negative for the the data 8.1 data in Table 8 .1 the usual kind of coding of of the the X X and and Y variables. If we were to using the = \, I , row Y= row 11 as X X= row 2 as X X= = 2, column 11 as Y = I1,, and andcolumn column code, say, row 2 as Y Y= = 2, phi is is negative because because there is a tendency for those in the the 2 lower category of of X X (i.e., row row 1) to be in in the the higher category of of Y (i. (i.e., 1 ) to e., column 2) and and for those those in the higher category category of X X (i.e., row to be in row 2) to column the lower category of Y (i.e., column column 1). 1 ) . This pattern of results results defines a lower category negative relationship negative relationship between variables. Unfortunately, the the value of phi is not not only only influenced by the the strength strength Unfortunately, of association association between the row and column column variables, as it should should be, of but margin totals, as we discuss next, which which can but also by variation in the margin
1176 76
e-'/fIIIM'
CHAPTER CHAPTER 8
be detrimental detrimental to phipop effect size. Therefore, its use is ° as a measure of effect recommended aturalistic research, wherein recommended only only iin naturalistic wherein the researcher researcher has chosen only only the total sample size, not the row or column column sample sizes, chosen so that any variation between between the two column column totals or between between the two is natural rather than being based on the researcher researcher's' s arbi arbirow totals is trary choices of sample sizes. A phi arising from another another study study of the dichotomous variables variables but using a sampling sampling method method other other same two dichotomous sampling would would not be be comparable comparable to a phi based on than naturalistic sampling naturalistic sampling. Therefore, a meta-analyst should not simply av average values of of phi that arise from from studies studies that used different different sampling Also, phi can only attain attain the extreme values of of -I +11 (per-1 or + (per methods. Also, fect correlations) correlations) when when both both variables are truly truly dichotomous and when fect total participants participants found found in one or the other of the the proportion proportion of the total row margins is is the same as the proportion proportion of the total participants who who row are found in one or the other of the column margins. about the the equality of a row row proportion and a col colThe requirement about umn proportion proportion to maintain maintain the possibility of phi =+ +11 or -1 as an exphi = ex umn treme limit limit is related to the problem of reduction of r by unequal skew of an X variable and a Y variable that was discussed in the Assumptions of r naturalistic sampling a reduction of the and rpb b section in chapter 4. In naturalistic te upper proportion to absol absolute upper limit for phi due to the failure of a row proportion equal a column column proportion might merely be reflecting reflecting a natural phe phenomenon 's nomenon in the two populations populations instead instead of reflecting the researcher researcher's arbitrary choice o off the two two sample sizes. sizes. Consult the treatments treatments ooff phi J. B. al. (2002), and and Haddock, Haddock, Rindskopf, and and in J. B. Carroll ((1961), 1 96 1 ), Cohen et a1. discussions.. J. BB.. Carroll ((1961) an Shadish ((1998) 1 998) for further discussions 1 9 6 1 ) provided an equation for the exact limits for phi, called phimmax but he cautioned cautioned equation ax', but For against the temptation to use philphi phi/phimax phi.. For max as a kind of corrected phi our able 8.1 our example, example, in T Table 8.1 the proportions of the total 68 participants participants found in row 11,, row row 2, column 11,, and column 2 are 36/68 = = that are found .53, 32/68 = = .47, 36/68 = = .53, and 32/68 = = .47, respectively. respectively. Note that that Table 8.1 satthe row and column marginal distributions in T able 8 . 1 happen to sat isfy the proportionality proportionality criterion for a 2 x 2 table in which the absolute isfy upper limit of phi is 11,, although although satisfying satisfying this criterion is not not necessary in the the case case of naturalistic naturalistic sampling. SPSS is among the statistical statistical pack packphi. ages that calculate phi.
J -h
iI
x
Phippop NULL-COUNTERNULL INTERVAL FOR Phi op Construction Construction of an an accurate confidence confidence interval for phipop comoP can be com plex, and there may may be no entirely satisfactory metho method,, especially especially for the sample sizes sizes that are common in behavioral research and the more phi pop from 0. Refer to Fleiss et a1. al. (2003) for discussion of a ° departs from o. Refer method an approximate confidence confidence interval for phi od for constructing constructing an phipop . me Instead of constructing constructing a confidence interval construct a interval for phipop we construc null-counternull interval interval for phipop' which, as previ previously sly stated, is an
d
th
0
6ti'
•
t�
EFFECT SIZES FOR CATEGORICAL VARIABLES
�
1177 77
rpop' using Equation 4.2 from from chapter 4 to find find the counternull value. We We rpo ;ume that the null-hypothesized as assume null-hypothesized value of of phi op is 0, so the null value of o. Applying Equation 4.2 to tthe he data of the interval interval is 0. data in Table 8.1, 2 1/2 2 phi 1 + 3phi 3phe)' 1 + 33(-.30) (-.30)22]f1/2 phi // ((1 ) ;, = = 2(-.30) // [[1 = -.53. —.53. Therefore, Therefore, the " = null-counternull interval for the data at hand are 0 and limits of the null-counternull -.53. provides as much support for the null -.53. The result of phi = = -.30 -.30 thus provides hypothesis that phippop = 0 as it would provide for a hypothesis that op = phipop = -.53 -.53 (a relatively large correlation) correlation).. pop = '
THE DIFFERENCE BETWEEN TWO PROPORTIONS
One important purpose of an effect effect size is to convey, if possible, the mean meaning of research results in the most understandable form for persons persons who who have little or no knowledge of statistics, such as clients, patients, patient's caregivers and some educational, governmental, or health-insurance health-insurance of officials. For For this purpose perhaps the simplest estimate estimate of the association association between the variables in a 2 x 2 table is the difference difference between two two pro proportions, which estimates the difference difference between the probabilities probabilities of a given outcome in two independent popUlations. 0 , which re populations. Unlike Unlike phipop requires naturalistic :quires either naturalistic sampling, this measure of effect effect size rrequires random assignment assignment to one of the two treatment samples (an experimen experimenrandom tal study) or purposive sampling. In purposive sampling for two groups, the researcher samples a predetermined predetermined N N participants, n11 of whom are those who have a certain characteristic and n22 of whom whom have an alterna alternacharacteristic (e.g., males and females females or past treatment with either tive characteristic Drug a or Drug b). The prospective and retrospective versions of purpos purposive sampling are discussed discussed later where needed in the Relative Risk and Number Needed to lteat Treat section. For an example, . 1 , but example, we again use the instructive data in Table 88.1, but now we assume that the participants participants had been randomly randomly assigned to their treatment groups. Note . 1 (36 and Note that the sample sizes differ differ in Table 88.1 32). 32). Although one would typically expect equal sample sizes when as assignment is random, random assignment does not not strictly require require equal sample sizes. In fact, all that is required for random assignment is that the total participants participants be randomly randomly assigned to conditions, conditions, and not that sam sample sizes be equal. equal. However, if the unequal sample sizes are attributable to attrition of participants, statistical inferences and estimation of effect effect sizes would be problematic unless the attrition attrition were were random. The first step is to choose one of the two outcome categories to serve as what we call the target category or target outcome. From Table 88.1, . 1 , one might use Symptoms Gone as the target category; we observe target category; observe later that does not not matter matter which category of outcome is chosen for this purpose. it does The next step is to calculate the proportion of the total participants in outcome and the propor propor(lteatment 11)) who have that target outcome Sample 11 (Treatment tion of the total total participants in Sample Sample 2 (Treatment (Treatment 2) who have that 6 1 of the psychotherapy patients and target outcome. In our example, example, ..61 '
1178 78
�
CHAPTER 8 CHAPTER 8
. 3 1 of the drug therapy .31 therapy patients patients became free of their symptoms. symptoms. One then then finds finds the difference difference between between these two two proportions; in our our case, ..61-.31 61 - .31 = 30. This sample result estimates that = ..30. that the probability probability that a receives psychotherapy will be relieved member of the population population that receives relieved of symptoms symptoms is ..661 and the the probability probability that that aa member memberof ofthe the population population of 1 and receives drug therapy therapy will be relieved relieved of symptoms is .3 .31. 1 . An even that receives 1 00 simpler interpretation interpretation is that the results estimate that of every 100 members of the population population of those who who are given psychotherapy psychotherapy for 1 - 31 = the symptoms symptoms at hand, 30 (i.e., 661-31 = 30) more patients will be re relieved of these symptoms them had symptoms than would would have been relieved of them therapy instead. they been given the drug therapy We continue continue to . 1 (Symptoms Gone) as our to use column column 2 of Table 88.1 our target category. Now call the proportion proportion of the total participants participants in row who fall fall into into column column 22 Ppl1' and and call call the the proportion proportion of ofthe the total total partici partici11 who pants pants in row row 2 who who fall fall into column 2 Ppr2 Therefore, Therefore, our our previously found 6 1 and pP22 = 31. found proportions are pPll = = ..61 = ..31. Note that the absolute difference difference between the two two proportions is the same as the absolute absolute value value of of phi for the 2 Xx 2 table, both being equal to the data data in T Table 8.1. chap.30 for the able 8. 1 . Recall from the section on the BESD in chap Table 8 . 1 , pPll and pP2I might ter 4 that, with with regard to a table such as Table 8.1, might be called the success difference will be equal to phi success proportions, and and their difference when (However, uniform when the marginal marginal totals totals in the table are uniform. uniform. (However, marginal totals are unlikely under random assignment or naturalistic and as was illustrated in sampling.) Recall also that, as is often the case and the section section on on the the BESD, different different kinds of measures of effect effect size can provide different different perspectives perspectives on data. A phi = = .30 might might not not seem to be very impressive to some, and the corresponding coefficient coefficient of determi determination of r22 = = phi22 = = .3022 = = .09 might might seem to be even less nation less impressive. .61 for for one therapy therapy that is nearly nearly dou douHowever, a success proportion of .61 success proportion proportion of ..31 for another another therapy therapy seems seems to be very 3 1 for ble the success impressive. difference between impressive. The difference between success proportions proportions is commonly commonly called the the risk difference. difference. Further discussions of the the risk difference difference can can be (2000) . found found in Rosenthal (2000) and Rosenthal et al. (2000). Because the method method in our our example involved random assignment to treatments instead appropri instead of naturalistic sampling, sampling, there there are more appropriate approaches for estimating estimating an an effect effect size than than an approach approach that is
996). The recom based on xX22 and and phi (Fleiss et al al.,. , 2003 2003;; Wilcox, 11996). recomof approach approach is to focus focus directly on the the difference difference between mended kind of proportions.. Recall that a proportion, proportion, P, p, in a sample estimates a two proportions P,in inaapopUlation. population.The Thesimplest simplestand andtraditional traditional approach approachisis probability, P, H0:: Pj P1 = = P2 P2 against against Halt: Halt: P1 P1 = two-tailed, where P1 P1 and and PI2 are es es;l:. P 2 two-tailed, to test Ho 2 timated general Pi timated by Pp1l and and Pp22,' respectively. respectively. In Pi is the the probability that that a member of the population population who has been assigned the treatment treatment in row target outcome, and P P. is the probability probability that a member i will have the target igned the treatment of of the population that has been ass assigned treatment in row j will have the target outcome. outcome. have
EFFECT SIZES SIZES FOR FOR CATEGORICAL VARIABLES EFFECT
�
1179 79
might choose to use the category that is is represented A researcher might represented by target category instead of using the category that is column 11 as the target 2.. The choice is of no statistical consequence consequence be berepresented by column 2 significance level will be attained attained when the the difference difference be because the same significance tween tween two two proportions proportions is is based on column column 11 as when when it is based on column 22.. Of course, course, finding that, say, the the success rate (proportion) for column Therapy Therapy ii is is statistically statistically significantly significantly higher higher than the the success success rate for Therapy j is equivalent to finding that the failure rate for Therapy i is significantly lower than the failure rate for Therapy j. In statistically significantly Table 8 . 1 the 1. 8.1 the failure outcome is represented represented in column 1. There are competing methods for testing H Ho: P P • Refer to Agresti : = . Refer to Agresti = 22 0 l1 (2002) and background discus and Fleiss et al. (2003 (2003)) for very informative background discussion. Also Also consult consult Chan ((1998), 1 998), Chuang-Stein (2001), (200 1 ), Martin Andres 1 999). and and Herranz Tejedor Tejedor (2004), and and Rohmel Rohmel and Mansmann Mansmann ((1999). Wilcox 1 996) provided provided a was recom Wilcox ((1996) a Minitab Minitab macro macro for for aa method method that that was recommended as best by Storer and ( 1 990). (The (The Storer-Kim Storer-Kim method has and Kim Kim (1990). modified by Skipka (2003 (2003)) to attain slightly greater power.) mapower. ) A ma been modified jor controversy controversy is whether whether such tests should jor should be conditional or uncondi unconditional, which which is a matter matter of the extent to which which fixed fixed margins in the the contingency table determine the sampling distribution distribution of the test statis statistic. For example, example, if each sample is a random random sample from one or the the other of two populations, and the samples are represented represented in the rows, then then only only the the row row margins margins are are fixed fixed and and unconditional unconditional tests tests are are applica applicable. Further Further discussion discussion of the controversy controversy is beyond the scope of this refer the reader to Agresti (2002).. Manual calculation is also book, so we refer Agresti (2002) the Storer-Kim Storer-Kim method, but but it is laborious laborious.. Therefore, we possible for the demonstrate a simpler traditional but accurate method. The demonstrate but less accurate The method an example of what what is called a large-sample, approximate, or asymp asympis an totic method method because because its accuracy increases increases as sample sizes n11 and n22 (e.g . 1 ) increase. We (e.g.,. , the two two row row totals in in Table Table 88.1) We provide provide criteria criteria for for aa at the the end of this section. After defining defining one additional large sample at section. After concept we provide a detailed illustration of the method. The mean proportion, proportion, p, p, is the proportion of all participants (for (for both samples) that are found in the target target category. category. In Table 88.1, which col col. 1 , in which umn umn 2 represents represents the the target target category, p
=
f1 2 + f22 N
(8.3)
N N is is the the total total sample sample size (nj (n1 + + n22).). For Table 88.1, .1, p= = (22 + 110) = .47, a value that one needs needs for the test of the cur curp 0) // 68 = mean proportion can can also be called the pooled estimate of P, estimate of rent H Ho. 0. The mean population proportion proportion of those who would would be found in the the overall population target category. Because one initially initially assumes that H astarget Ho0 is true, one as P1 = = P2 P2 = = P and that, therefore, the the best estimate estimate of P is is ob obsumes that P1 tained by pooling (averaging) (averaging) Pl p1 and Pp22 as in Equation 8.3. where where
1180 80
�
CHAPTER 8 8 CHAPTER
Recall standardized value) one Recall that that to to convert convert aa statistic statistic to to aa z (i.e., (i.e., aa standardized value) one divides divides the the difference difference between between that that statistic statistic and and its its mean mean by by the the stan standard statistic. The statistic of dard deviation deviation of of that that statistic. The statistic of interest interest here here is is Pp1l --pP22,, and and the the mean mean of of this this statistic statistic upon upon repeated repeated sampling sampling of of it, it, assuming assuming as as we we are for for now now that that Ho H0 is is true, true, is is o0.. The The standard of the the sampling sampling are standard deviation of distribution of of values values of of Ppll --pP22,' again again assuming assuming that that Ho H0 is is true, true, is is shown shown in the the denominator denominator of of Equation Equation 8.4. in
B.4.
z p , _ p,
[" p(l - p)
l
nl
p(l - p) l �
+ ��J
(8.4) (B.4) .
(We retained retained the the value value 0 in in Equation Equation 8.4 to to make make clear clear that that the the equa equa(We tion tion represents represents aa kind kind of of z, but but we we soon soon discuss discuss aa reason reason for for replacing replacing 0 with correcting value. larger the sample sizes sizes the with aa correcting value.)) The The larger the sample the closer closer the the dis distribution of of z p1 _p2 will will approximate approximate the normal curve. curve. Using the previ previtribution p _p the normal ous . 4 to ous calculations calculations 'of of Pp1l ,' Pp22, and and p, p, the the application application of of Equation Equation 88.4 to the the data . 1 yields yields data in in Table Table 88.1
z p , _ p,
=
.61-3 1 - 0
-------:,-,-.
[ .47(1-.47) 36
.4?Sl-.4 1 " + 32
]
=
2 .47.
Referring zz = = 2.47 2.47 to to aa table table of of the the normal normal curve curve one one finds finds that that this this Referring and, therefore, therefore, Ppl1 --pP22 are are statistically statistically significantly significantly different different from from 0 z and, at 36. Note there is at an an obtained obtained significance significance level level beyond beyond .01 .0136. Note that that there is an an adjustment of of Equation Equation B.4 8.4 whereby whereby 00 in in the the numerator numerator is is replaced replaced adjustment by by ..5(l/n 5 ( 1 lnl1 + Il/n ln2) to produce produce aa better better approximation approximation to to the the normal normal 2 ) to curve al. 2003). Replacing example curve (Fleiss (Fleiss et et al. Replacing 0 with with this this value value in in this this example yields z= = 22.23, value that that is is statistically statistically significant significant at at an an obtained obtained .23, aa value yields z significance level level beyond beyond ..0258. We recommend recommend use use of of this this adjust adjustsignificance 02 5 8 . We demon ment ment for for the the zz test test at at hand. hand. As As aa general general rule rule the the z test test that that we we demonstrated may may be be used when when all all of of the the following following are are �> 55:: n1pp,, nj n1((l1 -- p), strated P, p), n 22p, and n2( n 2 (l1 -- pl. p). For For the the data data in in Table Table B8.1, 36(.47) =16.92, and . l , n 1l Pp == 36 ( .4 7 ) = 1 6 . 92 , nn,(l 6 ( 1 -- .47) = 336(1 .47) = = 119.08, = 32( 32(.47) = 115.04, 9 .0 B , nn2P . 47 ) = 5 .04, and j ( 1 --p)p) = 2p = = 32( 32(11 - .47) .47) = = 116.96, all values values greatly greatly exceeding exceedingthe thecrite criten 22((l1 -- pp)) = 6 .96, all rion minimum of of 55.. rion Refer to to Fleiss Fleiss et et al. al. (2003) (2003) for for aa discussion discussion ooff comparison comparison of of propor proporRefer tions from from more more than than two two independent independent samples. samples. Recall Recall from from the the dis distions cussion of cussion of multiple multiple comparisons comparisons of of means means in in the the section section on on statistical statistical significance in in Chapter Chapter 6 that that the the methods methods (e. (e.g., the Tukey Tukey HSD significance g . , the HSD method) method) may may result result in in contradictory contradictory evidence evidence about about the the pairwise pairwise difdif ferences among the ferences among the means means (intransitivity) (intransitivity).. The The same same problem problem of of in intransitive results results can can occur occur when when making pairwise pairwise comparisons comparisons from from transitive
j
EFFECT EFFECT SIZES SIZES FOR FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES
�
1181 81
three oorr more more proportions proportions.. For For example, example, suppose suppose that aa third therapy therapy were represented represented by by aa third third row row added added to to Table Table 88.1 (Therapy 3), 3), so so that that . 1 (Therapy were one would would now now be be interested interested in in the the proportion proportion of of patients patients whose one symptoms Therapy 1,2, I , 2, or is, Pj, symptoms are are gone gone after after Therapy or 3, that that is, P1, P P22,, and and P3 P3•. Sup Suppose further further that that one one tested tested Ho: H0: Pj Pl = =P P22,, H Pl = = Py P3, and and H P22 = = P3 P3 sim simpose Ho: Ho: 0: Pj 0: P ply by by applying applying the the current current method method in in this this section section (or (or some some traditional traditional ply competing method) three three times. times. Even Even if if we we control control for for experimentwise experimentwise competing error using the justment, say, error by by using the Bonferroni-Dunn Bonferroni-Dunn ad adjustment, say, by by adopting adopting the the = ..0167 level for for each each of of the the three three tests, tests, aa problem problem of of pos pos.05/3 = 0 1 6 7 alpha level sible intransitivity intransitivity remains remains.. sible An example the possible possible sets sets of of intransitive results from example of of one one of of the intransitive results from the three tests would be results results that that suggest suggest the the following contradicthe three tests would be following contradic tory relationship relationships: P1 = = P2, P2, P2 P2 = = Py P3, and and P P3. Of Of course, course, such such aa pat patPj1 > P3• tory s : Pj tern of of values cannot be be true true in in the method for for tern values cannot the three three populations. popUlations. A method detecting among more more than than two detecting the the pattern pattern of of relationships relationships among two propor proportions in in independent independent populations populations has has been been proposed proposed by by Dayton Dayton (2003). tions The method method iiss similar similar to Dayton's (2003) method method that that was was discussed The to Dayton's discussed in chapter chapter 66.. The The method method can can be be implemented implemented using using Microsoft Microsoft Excel Excel in with or or without without additional software software programs programs.. For For details, details, consult consult with Dayton (2003), who who does does not not recommend recommend his his method method for for researchers Dayton who are comparisons more who are interested interested in in pairwise pairwise comparisons more than than the the overall overall pat pattern sizes of tern of of the the sizes of the the proportions proportions.. Fleiss et et al. (2003) (2003) discussed the comparison comparison of oftwo two proportions proportions in inthe the discussed the trials, that that seek case experiments that case of of experiments that are are called called noninferiority noninferiority trials, seek evi evidence treatment is worse than treatment by dence that that aa treatment is not not worse than another another treatment by aa defined defined specified amount. These authors also discussed discussed the the comparison comparison of of pro proThese authors portions in the the case case of that are called equivalence trials, portions of experiments experiments that are called which seek seek evidence evidence that that aa treatment treatment is is neither neither better better nor nor worse worse than than which another treatment treatment by by aa specified specified amount. amount. This This method method is is best used another when the the researcher researcher can can make make an an informed informed decision decision about about what what mini miniwhen mal difference between between the the two two proportions proportions can can be be reasonably reasonably judged judged to to mal difference be of of no no practical practical importance importance in in aa particular particular instance instance of of research. be research. This issue of selecting a minimally minimally important important difference difference was discussed furdiscussed fur ther by Steiger Steiger (2004) in in the context variables. context of continuous continuous dependent variables. Steiger (2004) (2004) described described the the construction construction of of aa confidence confidence interval interval (exact, Steiger (exact, if ANOVA assumptions assumptions are are satisfied) for for the the purpose of of observing observing if whether it it contains contains the the selected selected minimal difference. StatXact StatXact software software whether minimal difference. provides exact exact tests of equivalence, equivalence, inferiority, inferiority, and and superiority superiority when when provides tests of comparing two two proportions proportions in the the independentindependent- and and dependent-groups comparing cases. Although Although by by definition definition an an exact exact test test provides provides an an exact exact rate of of Type cases. Type error, it it is is possible possible that that an an approximate approximate method method will will be be more more powerful. powerful. II error, ideal method method would would yield yield very very accurate accurate pp levels levels while while providing providing very very An ideal high high power power (Skipka, (Skipka, 2003). repeated-measures version version of of the the kind kind of of experimental experimental research research that that A repeated-measures was discussed in this this section section is is the the crossover design. In this this counterbal counterbalwas discussed in design. In anced design design each each participant participant receives receives each each of of the the two Treatments a and and b, anced 1teatments a one at at aa time, time, in either either the the sequence sequence ab ab for for aa randomly randomly chosen one one half half of of one
1182 82
�
CHAPTER 8 CHAPTER
the participants participants or or the the sequence sequence ba ba for for the the other other half half of of the the participants. participants. the The rows rows of of aa 2 X x 2 table table can can then then be be labeled labeled ab ab and and ba, ba, and and the the columns columns The can be be labeled labeled aa Better Better and and bb Better. Better. Refer Refer to to Fleiss Fleiss et et al. al. (2003) (2003) for for aa discus discuscan sion of of the the comparison comparison of of the the proportion proportion of of times times that that Treatment Treatment aa is sion is better and and the the proportion proportion of of times that Treatment Treatment bb is better. better times that is better. CONFIDENCE INTERVAL FOR FOR P PI1 - P P2 APPROXIMATE CONFIDENCE
2
Again, for for our our purpose we demonstrate the simplest method for for con conpurpose we demonstrate the simplest method Again, structing aa confidence confidence interval interval for for the the difference difference between between proportions proportions structing (probabilities) in in two two independent independent populations, populations, and and then then we we provide provide ref(probabilities) ref erences for more more accurate accurate but but more more complex complex methods. methods. As As is is the the case case for for erences for approximate methods the the accuracy of the the following following large-sample large-sample approximate methods accuracy of method method increases increases with with increasing increasing sample sample sizes. sizes. In general general the the simplest simplest ((1 CI for for P PIl -P2 - P2 can can be be approximated approximated by by In 1 -- a) CJ CJ P , -P, : (P I - P2 ) ± ME,
(8.5)
estimate P P I1 -- P where ME is is the the margin margin of of error error in in using using p p2 to to estimate where P1I -- P2 P2• 2.
(8.6) where z' isisthe where z* thepositive positivevalue valueof ofzzthat thathas hasa/2 a/2 of ofthe thearea areaof ofthe thenormal normal curve _ is curve beyond beyond it, it, and and 5sPp1-p2 is the the approximate approximate standard standard deviation deviation of of the the sampling distribution distribution bE of the the difference difference between between Pp1I and and pPl" If one one seeks sampling seeks 2. If the usual usual ..95 CI(i.e., (a/2 = =.05/2 then one one will will recall recall or or observe the 95 CJ (i.e., (a/2 .05/2 = .025), then observe in aa table table of of the the normal normal curve curve that that z* = +1.96. in z· = + 1 .96. Because we we have have already already found found evidence evidence that that P1 P1 *::;:. P2' P2, for the confiBecause for the confi dence interval we do do not not use the same same equation equation for for sp sp1-p2 that was used _ that was used dence interval we use the P in the the denominator denominator of of Equation Equation 8.4 when when we we tested tested Ho: H0:'p] P1 = P2• P2. For For the the in confidence interval interval we we no no longer longer pool pool Pp1I and and P2 p2 to to estimate estimate the the previ previconfidence ously supposed supposed common common value value of of PPI1 = P P22 = =P P that that we we assumed assumed before before we we ously rejected Ho.0. Instead, Instead, we we now now estimate estimate the and P P22 values values sepa separejected the different different P P1I and rately using using P and P2 p2 in in the the equation equation for for SsPpI1_-Pp,2 ', rately PI1 and 1
(8.7)
(One pools p p1] and p2 for for the the significance significance test test because because one one is is then then assum assum(One pools and P2 ing the the truth truth of of Ho'0, but but there there is is no no such such assumption assumption when when constructing constructing ing confidence interval.) aa confidence interval. ) For the the data data iinn Table Table 8.1,p +1.96 For 8 . 1 , P1I --pP2 2 = ..61 6 1 --.31 . 3 1 = ..30, 3 0 , zz*· = + 1 . 96 bebe 6 1 = .39, n n]1 = 36, cause cause we we are are seeking seeking aa .95 .95 el, CI, 11 - P p1I = 11 - ..61 and n2 n2 = = 32. Therefore, Therefore, applying applying Equation Equation 88.6, Il --pp22 = l1--. 3 .311 = .69, and . 6, -
-
EFFECT EFFECT SIZES SIZES FOR FOR CATEGORICAL CATEGORICALVARIABLES VARIABLES
�
183 1 83
the ME that we we subtract subtract from from and and add add to to P plI -- P p22 is is equal to the equal to 1.96[.61(1-.61)/36 The limits of the con con1 .96[ . 6 1 ( 1 - . 6 1 )/36 + ..31(1 3 1 ( 1 - .31)/32] . 3 1 )/32]1/2 v, = .23. The fidence interval interval are are thus thus .30 ± .23. Therefore, Therefore, we we are are approximately approximately fidence 95% confident that that the the interval interval from from ..30-.23 = ..07 to 95 % confident 3 0 - .23 = 0 7 to = ..53 contains the the difference difference between between P Il and and P22.. Unfortu Unfortu.30 + .23 = 5 3 contains nately, as is often the interval is wide. Nonetheless, Nonetheless, nately, as is often the case, case, the the interval is rather rather wide. the interval interval does does not not contain contain the the value value 0, aa finding finding that that is is consistent consistent the with the result from from testing Ho: H0: P1:I = P22•. Note, Note, however, that some sometimes the result result of times the of aa test test of of statistical statistical significance significance at at aa specific specific alpha alpha ==
==
(a) level level and and the the ((1 for P PIl -- P22 do do not not produce produce consistent consistent results. results. 1 -- a) CI C1 for Refer to Fleiss et et al. al. (2003 (2003)) for for discussion discussion and and references regarding references regarding Refer to Fleiss such inconsistent inconsistent results. such results . Efforts to construct construct aa more more accurate accurate confidence confidence interval interval for for P1I -- PP22 Efforts to have been been ongoing ongoing for for decades. decades. Hauck Hauck and and Anderson Anderson ((1986) have 1 986) compared competing methods methods and and found found that that the the simple simple method method used used in in Exprescompeting Expres sion 8.5 and Equation 8.6 can result in an interval interval that, as wide as it can often be, actually tends tends to to be be inaccurately narrow. narrow. They They recommended recommended aa often correction for for this this method. method. Beal Beal ((1987) also compared compared competing competing meth methcorrection 1 98 7) also ods and and recommended recommended and and described describedaa method methodfor forwhich whichWilcox Wilcox((1996) 1 996) ods described Minitab macro. Wilcox described manual manual calculation calculation and and provided provided aa Minitab macro. Wilcox also provided an S-PLUS S-PLUS software function function for for constructing constructing the the (2003) also provided an confidence interval. interval. Refer Refer to to Smithson Smithson (2003) for for another large-sample method for for constructing constructing an an approximate approximate confidence confidence interval interval for for PPI l--PP22 •. method StatXact software software constructs constructs an an exact exact confidence confidence interval interval for for the the indeStatXact inde pendent- and and dependent-groups dependent-groups cases. cases. Also refer discussion and and refer to the discussion references in in Agresti for both both independent-groups independent-groups and and dependdepend references Agresti (2002) for ent-groups 1 998) compared ent-groups cases. cases. Newcombe Newcombe ((1998) compared eleven eleven methods methods and and Martin Andres Andres and and Herranz Tejedor (2003, 2004) discussed discussed exact exact and and Martin Herranz Tejedor approximate methods methods.. Hou, and Tai Tai (2003) proposed, proposed, andjusti and justiapproximate Hou, Chiang, Chiang, and fied by by simulation simulation studies, studies, aa method method for for construction construction of of simultaneous simultaneous fied confidence intervals intervals in in the the case of multinomial multinomial proportions proportions (i.e. (i.e.,, the the case confidence case of case of more than than two two possible categorical categorical outcomes). outcomes). Fleiss Fleiss et et al. al. (2003) and and of more Cohen ((1988) discussed and and presented presented tables tables for for estimating needed needed sam sam1 988) discussed ple between P1I and ple sizes sizes for for detecting detecting aa specified specified difference difference between and P22.• Note that that it it would would not not be be valid valid to to construct construct aa null-counternull null-counternull inter interNote val for for PI1 -- P22 using using the the methods methods for for constructing constructing such such an an interval interval that that val were appropriate earlier in in this this book book because because the the distribution distribution of of PplI --pP22 is were appropriate earlier not symmetrical. consult Rosenthal Rosenthal (2000) for for aa modification modification of of not symmetrical. Also, consult this measure. Recall Recall that that many, many, including including Rosenthal Rosenthal (2000), called the the (2000), called difference between between two two proportions proportions the the risk difference, difference, the the reason for for difference which is explained explained in in the next section. section. which NEEDED TO TREAT RELATIVE RISK AND THE NUMBER NEEDED
Suppose that that the the data data in in Table Table 88.1 had arisen arisen from from research research in which which Suppose . 1 had participants had had been randomly randomly assigned assigned to to Therapy Therapy 11 or or Therapy Therapy 2, a participants
1184 84
�
CHAPTER 8 CHAPTER 8
supposition that is in in fact fact true in the case case of of these data. In In this case supposition case an efef fect size size measure measure that is generally generally called called the relative risk is applicable. applicable. We We now turn turn to to the the development development of of this this measure. measure. A certain certain difference difference between between P1 Pl and and P2 P2 may may have have more more practical practical im importance when when the the estimated estimated P P values are both both close to to 0 or or 11 than than when when they are both both close to to .5. For example, suppose that P1I = = .010 and and they . 5 . For P2 = = .001 or that P1 Pl = = ..500 P2 = = .49 .491. cases P1 P1 -- P2 P2 = = .009, 1 . In both cases 5 00 and P2 P2 but in the the first case case P greater than P2, (P /P = .010/.001 = but P1l is 10 times greater than P2, (P/P2 = .010/.001 1 2 and in in the the second second case P1 P1 is is only only 11.018 times greater greater than than PP2, 110), 0), and .018 times 2/ (P1/P2 = = .500/.491 = = 11.018). Thus, the the ratio ratio of of the the two two probabilities (P/P2 .018). Thus, probabilities can be very informative. informative. For For 22x2 the ratio ratio of of the the two two probabilities is x 2 tables the be the RR (which (which also also is is called called rate ratio or or risk ratio). The The estimate of of RR, rr, the RR is calculated using the two sample proportions proportions (8.8) (8.8)
As before, pP1I and and Pp22 represent the proportion of of those those participants participants in in As before, the proportion Samples 11 and and 2, respectively, respectively, who who fall fall into into the the target target category, category, which which Samples again can can be represented represented either either by column column 11 or or column column 2 in aa table such as Table Table 88.1. For T Table 8.1, if column column 11 represents represents the the target target category category as . 1 . For able 8 . 1 , if then rr1 rr, = = ((14/36)/(22/32) 1 4/36)/(22/32) = .57, 5 7 and if column 2 represents the target category then = (22/36)/( (22/36)/(10/32) = 11.96. In the the latter latter case case there is is an an 1 0/32) = .96. In category then rr22 = estimated nearly nearly 2 to 11 greater greater probability probability of of therapeutic therapeutic success for success for estimated psychotherapy than for drug therapy for the clinical problem at hand. hand. psychotherapy therapy for (Because, as as previously previously discussed, discussed, aa given given difference difference between between P P1I and and PP22 (Because, has different different meanings meanings at different different values values of of P P11 and and P P2'2, RR may may be be a more has useful effect effect size size for for meta-analysts than P PIl -- P2, P2, Fleiss, Fleiss, 1994.) meta-analysts than 1 994.) The name name relative risk risk relates relates to to medical medical research, research, in which the the target target The category is classification classification of of people as having having a disease disease versus the the other other category is category of of not not having having the the disease. One sample has has a presumed risk risk faccategory disease. One fac tor for the disease (e.g., smokers), and the the other have tor other sample does not have this risk risk factor. factor. However, it seems seems strange strange to to use the the label label rela relaHowever, because it when applying applying the ratio ratio to aa column, column, such such as as column column 2 in T Table able tive risk when which represents represents aa successful successful outcome of of therapy, in in such such cases cases one 88.1, . 1 , which can simply refer refer to RR RR and and rr as as success success rate rate ratios, or as as the the ratio ratio of two two can independent probabilities or the the ratio of of two two independent independent proportions, proportions, independent probabilities or respectively. For For discussions discussions of of methods methods for for constructing constructing aa confidence confidence respectively. interval for for the the ratio ratio of of two two probabilities, probabilities, consult Bedrick Bedrick ((1987), interval 1 98 7), Gart and Nam Nam ((1988), and Santner and and Snell ((1980). Refer to to Smithson 1 988), and 1 980). Refer and (2003) for for a large-sample method for for constructing constructing an approximate conlarge-sample method approximate con fidence interval interval for for the the RR. A large-sample large-sample approximate approximate confidence confidence in infidence terval can be constructed for RR that is demonstrated RR using the method that for ORpop in in the the section section after after the the next next section. StatXact software software con confor section. StatXact structs an an exact exact a confidence confidence interval for for the the RR. Consult Agresti RR. Consult Agresti (2002) for further further discussion. discussion. As As we we reiterate reiterate throughout throughout this book book all all meameafor .
RR.
ORpop
,
EFFECT SIZES EFFECT SIZES FOR FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES
�
1 85 185
sures of of effect effect size have some some limitations. Refer to Fleiss for aa dis dislimitations . Refer Fleiss (1994) ( 1 994) for and meta-analysis. cussion of of limitations of RR for cussion limitations of for research research and meta-analysis. One the RR One of of the the limitations limitations of of the RR is that its its different different values values depending is that depending on one's choice choice of of placement of the two two groups in in the the numerator numerator and and dede nominator can lead to to different different impressions of the the result. result. The The problem nominator can arises as aa ratio of two two proportions, proportions, the RR or or rr can can range from from arises because, because, as ratio of the RR 0 to to 11 if if the the group group with with the the smaller smaller proportion proportion (lower (lower risk) risk) happens happens to to o be represented represented in the numerator, but they can range range from 1 to be in the numerator, but they can from 1 to 00 if if the the group with the the smaller smaller proportion proportion is represented in in the the denominator. group with is represented denominator. The The problem problem can can be be partially partially resolved resolved by by reporting reporting the the logarithm logarithm (com(com mon natural) of rr as as an estimate of the logarithm logarithm of RR. When the the mon or natural) RR. When smaller proportion the numerator numerator log rr can can range range from from 0 to -00,, smaller proportion is in the whereas when the larger proportion is in in the the numerator numerator log log rr can can range whereas when the larger proportion is range raw proportions from to +00. from 00 to + . The The actual actual raw proportions should should always always be be reported reported no no matter how how the the rr is reported. reported. The The value value of of the the relative risk risk also varies matter depending on on which which of of the the two two outcome outcome categories categoriesititisisbased. based.For Forexam examdepending ple, consider consider the involving the two hospitals hospitals that that provided ple, the case case involving the two provided corocoro nary was discussed the The The nary bypass bypass surgery, surgery, an an example example that that was discussed in in the Coefficient of of Determination Determination section in in chapter 4. We We observed observed previ previCoefficient ously that the estimated estimated RR, on the mortality percentages for for the the RR, based on 3.60%/1.40% = 2.57. On On the the other other hand, hand, looking looking at two hospitals, was 3.60%/1.40% two hospitals, was at the survivability percentages for two hospitals hospitals Breaugh the survivability percentages for the the two Breaugh (2003) noted that one reverses choice of hospital' s percentages percentages are noted that if if one reverses the the choice of which which hospital's are to appear appear in in the the numerator numerator of of the the ratio, ratio, the the success success rate rate ratio ratio for for these 1 00% -- 11.40%) .40%) // ((100% 1 00% -- 33.60%) .60%) = 11.02, .02, aa rere data data can can be be calculated calculated as as ((100% sult that that conveys conveys aa much much smaller apparent apparent effect of choice choice of of hospital hospital sult effect of than does the the risk risk ratio ratio of of 22.57. example provides provides aa compelling compelling rea rea. 5 7. This example son to Rosenthal (2000) also also presented an an son to present present the the results results both both ways. ways . Rosenthal example in in which which the the RR can provide provide a a misleading misleading account of the the re example RR can account of results, and and he presented a modification RR, based on the BESD, to to modification of the RR, correct the problem. problem. Gigerenzer Gigerenzer and and Edwards (2003) discussed discussed other correct the Edwards (2003) other RR might be misunderstood misunderstood by by pa measures be used used when measures that that might might be when RR might be patients or even even by by health health professionals. professionals. The RR is applicable applicable to to data data that that arise from research research that that uses ranRR is arise from uses ran The dom assignment or or from from naturalistic naturalistic or or prospective research, research, but not dom but not from We previously defined naturalistic refrom retrospective retrospective research. research. We previously defined naturalistic re search. In In prospective n1 participants participants search. prospective research research the the researcher researcher selects selects n1 who have a suspected risk factor factor (e.g., have ( e . g . , children whose parents parents have abused drugs) drugs) and and n2 n2 participants participants who who do do not not have the the suspected risk suspected risk factor. The two two samples samples are are tracked tracked to to determine the the number number from from factor. each sample sample who who do do and and do do not not develop the target target outcome outcome((e.g., abuse develop the e . g . , abuse drugs ) . From be clear why prospecprospec drugs themselves themselves). From the the definition definition it it should should be clear why tive research research is is also also called called cohort, forward-going, or follow-up forward-going, or follow-up research. tive On the other hand, in in retrospective retrospective research research (also (also called re On the other hand, called case-control reselects nn11 participants the search) the the researcher researcher selects participants who who already already exhibit exhibit the target outcome outcome (the (the cases) target cases) and and n22 participants participants who who do do not not exhibit exhibit the the target outcome outcome (the (the controls controls). The two two samples samples are are checked checked to see how target ) . The see how
1 86 186
�
CHAPTER CHAPTER 8 8
many in sample had did not suspected risk risk factor. many in each each sample had or or did not have have the the suspected factor. Re Refer to Fleiss et (2003 ) for of a variety of of error fer to Fleiss et al. al. (2003) for discussions discussions of a variety of sources sources of error that are retrospective research for methods control that are possible possible in in retrospective research and and for methods to to control or adjust for such such errors or adjust for errors.. A related measure measure of of effect effect size in 2 Xx 2 tables tables in which one one group group is is aa A related size in in which treated the other other is aa control control or or otherwise treated group is the the treated group and the otherwise treated number needed needed to treat, treat, NNT. NNT. The The NNT NNT can can be be defined defined informally informally as as the the number people that of number of of people that would would have have to to be be given given the the treatment treatment (instead (instead of no treatment or or the the other treatment) per per each each such would no treatment other treatment) such person person who who would be expected expected to to benefit benefit from from it. The The more more effective effective a treatment treatment is, relative be to the the control or competing treatment, the the smaller smaller the of to control or competing treatment, the positive positive value value of result for (Values bebe NNT, with with NNT NNT = 11 being NNT, being the the best best result for aa treatment. treatment. (Values tween -1 and and + 1 , exclusive, exclusive, are problematic) . When 1 every tween-1 +1, are problematic). When NNT NNT = 1 every per person who who is subjected to targeted treatment would be expected to to son is subjected to the the targeted treatment would be expected benefit. in the group and benefit. Formally, Formally, in the case case of of comparing comparing aa treated treated group and aa control control group, the the NNT NNT parameter parameter is defined defined as as the reciprocal reciprocal of of the the difference difference group, between the probability that a control control participant will show benefit between the probability that a participant will show no no benefit (e.g., symptoms and the the probability probability that that aa treated treated person will (e.g., symptoms remain) remain) and person will show no no benefit. benefit. This This measure measure will will be be illustrated illustrated by by pretending pretending (for (for our our show present purpose purpose of comparing aa control control group group and and aa treated group) that that present of comparing treated group) row . 1 represented control group. group . The row 2 of of Table Table 88.1 represented aa control The required required probabilities by the proportions in in the table. probabilities are are estimated estimated by the relevant relevant proportions the table. The estimate of the NNT NNT parameter The estimate of the parameter for for the the data data in in the the now now slightly slightly re re. 1 is is given difference between vised vised Table Table 88.1 given by by the the reciprocal reciprocal of of the the difference between the the proportion of participants in the the control proportion of participants in control group group whose whose symptoms symptoms re remain the proportion proportion of the participants participants in in the main (22/32 = .6875) and and the of the the treated group whose whose symptoms symptoms remain 1 4/36 = = .3889). The The difference treated group remain ((14/36 difference = .2986. Thus, Thus, between these these two two proportions is .6875 between proportions is - .3889 = to the the nearest integer, we we use use NNT Rounding to nearest integer, NNTesestt = 1/.2986 = 3.35. Rounding Test that we would need need to treat approxi NNT = 3. We We therefore therefore estimate estimate that we would to treat approxiNN est = mately each person who will For the mately three three people people for for each person who will benefit. benefit. For the case case in in 8 . 1 as as is is (i.e., (i.e., comparing comparing which these results results arise arise from from the data of of Table which these the data Table 8.1 two we would estimate that patients treated two therapies) therapies) we would estimate that for for every every three three patients treated with psychotherapy psychotherapy instead instead of of drug therapy therapy one person person will become with become free who would would not not have become free sympfree of of symptoms symptoms who have otherwise otherwise become free of of symp toms. Note Note that that the NNT NNTmeasure measurecan can also also be beused usedin in other other areas areas such such as as education (e.g., evaluating education or or organizational organizational psychology psychology (e.g., evaluating the the costs-ben costs-benefits of a a remedial or a a training efits of remedial program program for for students students or training program program for for em employees, in in both be ployees, in which, which, in both kinds kinds of of research, research, participants participants will will be classified as attaining attaining or or not of aa targeted targeted skill. classified not attaining attaining mastery of The NNT effect The NNT effect size sizecan can be beinformative informative regarding regarding the the practical practical signifi significance of results. Considering the estimated NNT NNTin cance of results. Considering the estimated in the the context context of ofthe the cost cost and risks of aa treatment treatment and and the seriousness of of the the illness, illness, or or the and risks of the seriousness the seriousserious ness of the skill, can aid in the decision ness of of the the lack lack of of mastery mastery of the skill, can aid in the decision about about whether should be be adopted. whether aa treatment treatment should adopted. For For example, example, one one would would not not want to adopt expensive somewhat risky treatment want to adopt aa moderately moderately expensive somewhat risky treatment when when unless the disease were the NNTesestt is the NNT is relatively relatively large large unless the disease were sufficiently sufficiently serious. serious.
EFFECT SIZES FOR FOR CATEGORICAL VARIABLES EFFECT
�
187 1 87
The values of NNT NNT that might might seem seem to be useful for such such decision-mak decision-makuseful for ing are are the upper and and lower lower limits of aa confidence confidence interval interval for for NNT. ing Detailed discussions discussions of the significance testing for for the complex topics of significance NNTest and confidence intervals for NNT are beyond the scope of this and confidence intervals NNT the this est book. We some brief comments comments.. First, if one one were We will merely make some testing aa traditional traditional null null hypothesis based on on aa hypothesis that that the testing treatment has no no effect, effect, then one one would would be attempting aa problematic problematic test of of Ho: Ho: NNT NNT = or aa problematic problematic indirect indirect test test of of significance by exex = 00 or amining aa confidence confidence interval interval to to observe if the interval interval contains contains the the amining observe if value 00 . (Less problematic would would be be constructing constructing aa confidence confidence interval value merely for providing some information estiinformation about about the precision of the esti mate of of NNT. NNT.)) One One approach approach to confidence confidence intervals intervals involves involves first con constructing aa confidence confidence interval interval for for the difference two structing difference between the two populations' (probabilities) that are involved involved in in the defini definipopulations' proportions proportions (probabilities) tion of of NNT, NNT, using one one of of the methods that were discussed discussed in the previ previous by Fleiss, Fleiss, Levin, Paik (2003). (2003 ) . Then Then the the reciprocals reciprocals ous section section and and by Levin, and and Paik of the confidence confidence limits satisfy satisfy the the definition definition of of the NNT NNTand and thus thus pro proof the vide the limits for the NNT. There is another another approach that is recomrecom mended by Schulzer Schulzer and Mancini Mancini ((1996), and these authors 1 996), and authors also discuss NNT in the the context context of oftreatments treatments that that harm harm some somepatients patients (the (the NNT in needed to harm, NNH; NNH; Mancini Mancini and and Schulzer, Schulzer, 1999). We are are con con1 999). We number needed cerned, unless sample sizes are very very large, that the the various various methods unless sample large, that for constructing the confidence intervals intervals might might lead lead to to greatly greatly varying varying for results, and, therefore, lead to inconsistent recommendations for prac practitioners. However, However, one must recognize that the NNT NNTis isaa relatively relatively new new must recognize measure derived for for 2x2 and such such tables have have aa long long history measure 2 x 2 tables, and history of of development of competing methods of of analysis. analysis. development Note that that some some medical attempt to to resolve resolve the problem Note medical researchers researchers attempt the problem 2 of significance significance testing for the the NNT applying a X2 test of of association association of NNTcst cst by applying or aa t test (numerically (numerically coding coding the the outcome outcome categories) to to the the 2x2 taor 2 x 2 ta ble. Regarding the Xx22 test, a better approach might might be to test the signifible. signifi cance of of the the difference difference between two proportions as as we previously between the two discussed. Regarding Regarding the the application application of of aa t test to to the data data of of aa 2 x 2 ta tadiscussed. ble such such as as Table Table 8.1, the facts that that in such a 8 . 1 , issues arise concerning concerning the ble case the the dependent dependent variable variable has has only only two two values and and is is ordinal instead of continuous continuous.. For debate about of For a a discussion discussion of of the the debate about these these latter latter issues issues consult the section entitled "Limitations of rpb for Ordinal Categorical consult b for p Data" in chapter chapter 9. 9. Note also also that, because the NNT NNT varies varies with with baseline baseline risk, risk, aa point point es estimate and and confidence confidence limits for for the NNT NNTas as estimated estimated from from prior prior re research are are most a practitioner clients or patients patients are are search most useful useful for a practitioner whose clients very similar similar to those who participated from which which the the participated in the research from NNT was estimated. The baseline risk is estimated from the proportion NNT of control control participants participants who who are are classified as having event of having the "bad" event (e.g., .6875 in in our revised Table Table 88.1). the our presently revised . 1 ) . The lower the (e.g . , 22/32 = .6875 baseline risk the lower the the justification justification might might be for for implementing implementing the the .
=
1188 88
�
CHAPTER 8 CHAPTER
treatment, seriousness of the illness and the treatment, depending again on the seriousness overall costs of treatment. treatment. For an an extensive discussion of the the NNT NNT and related measures refer to Sackett (2000).. Also consult Laupacis, related Sackett et al. (2000) and Roberts Roberts ((1988) many discussions reSackett, and 1 988) and the many discussions of this and re lated topics that can be found in the online British Medical lated Medical Journal ((http://bmj.bmjjournals.com). http : //bmj .bmjjournals . com) . THE ODDS ODDS RATIO
The final final effect effect size for for a 2 x 2 table that we discuss here is the odds odds ratio, which is is a measure of how many many times greater the odds are that a mem memwhich fall into a certain category than the odds ber of a certain population will fall certain category another population population will fall fall into that category. are that a member of another effect size random assignment, This effect size is applicable to research that uses random prospective research, and retrospective retrospective research naturalistic research, prospective 994; Fleiss et aI. possible (Fleiss, 11994; al.,, 2003). Unlike Unlike the phi coefficient, coefficient, the the possible range of values of an an odds ratio ratio is not not limited by the marginal marginal distribu distribuBecause we leave it to the interested tions of the contingency contingency table. Because interested reader to apply, as exercises, exercises, the methods of this and the next next section to data in Table 88.1, illustrate this effect effect size with with the naturalistic naturalistic the data . 1 , we illustrate Table 8.2. A sample odds ratio provides an estimate estimate of the ra raexample in Table tio of (a) (a) the odds that participants of a certain kind (e.g., women) attain (e.g.,. , voting Democrat Democrat instead of voting Republican) a certain category (e.g and (b) men).. An and (b) those same odds for participants of another kind kind (e.g., men) odds ratio ratio can be calculated for any any pair of categories of a variable (e.g., gender) that is being related to to another another pair of categories categories of of another another variable (e.g., political variable political preference). preference) . (For a formal definition of OR ORP°P,, consider the the common case in which categorization categorization with respect to one of the two variables might might be said to variable.. For For examples, examples, precede categorization with respect to the other variable . 1 and type type of therapy therapy precedes precedesthe thesymptoms-status symptoms-status outcome outcomein inTable Table88.1 and precedes agreeing or disagreeing in Table being male or female precedes Table 8.2. Now label a targeted targeted outcome outcome Category Category T (e.g., (e.g., agree), the alternative alternative out outcome category being labeled labeled not T. preceding T. Then label a temporally preceding category [e.g., stands for probability, probability, a measure of the the category Ie.g., man] pc. pc. Where P stands T will occur conditional on pc pc occurring is is given by = P(T P(TI |pc) pc) odds that T I l pc) . Similarly, the odds that T / P(not T T|pc). T will occur conditional on cate cateoccurring [e.g., woman] is is given by Odds Oddsnot gory pc pc not occurring / = JP(T P(T |l not pc) I notpcc = P(not T ation is T |l not pc) pc).. The Theratio ratio of ofthese thesetwo twoodds oddsin inthe thepopu population isthe theodds odds ratio, OR OR op = Oddsp/Odds Oddspc/Oddsnotnotpcpc') .) IIn n Tab 8 . 1 , the i2 1,' andi22 Tablee 8.2, aass iinn Table 8.1, the cell valuesil values f11l J1 ,f122,f and f22 represent ' 21 the counts (frequencies) (frequencies) of participants participants in the first first row row and first first col colfirst row row and second column, second row row and first first column, and umn, first second row row and second column, respectively. We Weuse usethe the category category that thatisis category. The The sample odds that that a represented by column 11 as the target category. participant who is in row row 11 will be in column 11 instead of column 2 are participant who
0:"1
l
f
SIZES FOR CATEGORICAL CATEGORICALVARIABLES VARIABLES EFFECT SIZES
�
189
8.2 TABLE B.2 Gender Gender Difference Difference in Attitude Attitude Toward a Controversial Statement
Agree. A gree Men
f11
= 10
Women
f21 = 1
Disagree Disa gree
f12 = 13 f22 = 23
byf/11 approximately 110/13 inthe the case caseof ofTa Tagiven by //12 7, in 0/ 1 3 = ..777, 1 2,' which are approximately 1 1 /f ble 8 . 2 . In a study that is is comparing 8.2. comparing two kinds of participants participants who are represented by the two rows in this example, example, one can evaluate these odds in relation relation to similarly similarly calculated odds for participants participants who are in the second second row. The The odds odds that a participant in row 2 will be in column 11 instead of column 2 are given by byf/212 / /f/22 approximately 22,' which are approximately 11/23 /23 = 04, in the case able 8 . 2 . The ratio of the two = ..04, case of T Table 8.2. two sample odds, odds, de denoted OR, OR, is is given by (Jl (f11l//f/121 2)) // ((J2 f 2 1// /22) f 2 2 ),, which, because (a/b) // ((c/d) = (ad) // (be), is is equivalent to to (a/b) c/d ) = ((8.9) 8.9)
Equation 88.9 multiplied by Note in Equation . 9 that each cell frequency iis s being multiplied the cell frequency Ta frequency that is is diagonally across from from it in a table table such as Table 88.2. For this reason reason an odds ratio is also called a cross-products cross-products ratio. . 2 . For not the same as probabilities. We Weobserved observed with with (Note also that odds are not 8.2 be in the agree category category regard to Table Table 8 . 2 that the odds that a man will be = ..77. However, the probability probability that a man man will be in 77. However, are given by 110/13 0/ 1 3 = is estimated by the proportion proportion 110/23 23 the category agree is 0/23 = .43, where 23 is the total number of men in the sample.) sample. ) Table 8 . 2 depicts 8.2 depicts actual actual data, but but the example should should be considered consideredto be hypothetical because because the column labels, labels, row row labels, labels, and the title have been changed to suit the purpose of this section. section. A A very important important aspect of of these data data emerges emerges if we relate the odds that a man man will agree instead of disagree disagreeto the odds that a woman woman will agree instead of disagree disagreewith with a of controversial test statement statement that was presented to all participants participants by controversial researcher. Applying Applying Equation 88.9, find that the researcher. . 9, we find OR 0(2 3 ) / 3 ( 1 ) = 117.69. 7.69. We OR = 110(23) / 113(1) Wejust just found found that that the the odds odds that that aa man man agree with with the controversial statement statement are estimated to be nearly 18 will agree 18 greater than the odds that a woman woman will agree with it. However, times greater out of context context this result can be somewhat misleading misleading or incomplete, out because if one inspects T Table able 8.2, which the researcher would be obliged to include include in a research report, one also observes observes that in fact fact in the sam sammajority men ((13 23) majority of women ples a ma jority of men 1 3 of 2 3 ) as well as a (larger) majority (23 (23 of 24) disagree disagree with with the statement. =
=
1190 90
........,
CHAPTER 8 CHAPTER 8
Both 0 range Both the the sample sample OR OR and and the the parameter parameter ORpop range from from zero zero to to infin infinity, of these these extreme extreme value values� \.v when one of of the the cell cell frehen one fre ity, attaining attaining either either of quencies When there there is between the quencies is is zero. zero. When is no no association association between the row row and and OR 0 = 11.. A zero cell cell frequency frequency in in the the population column variables, variables, OR column A zero population (called a ze�r would because in in most research in in (called a structural zero) would be be unlikely unlikely because most research the researcher the behavioral behavioral and and social social sciences sciences it it would would be be unlikely unlikely that that aa researcher would member of would include include aa variable variable into into one one of of whose whose categories categories no no member of the the population falls observe in . 2 that population falls.. However, However, observe in the the real real data data in in Table Table 88.2 that we we came very to having having a in sample in which! came very close close to a zero zero in sample ce1l cel12211 ,' in whichf212 1 = 11.. In In re research in search in which which OR ORpop would would not not likely likely be be zero zero or or infinity, infinity, aa value value of of zero cell or sample OR or infinity infinity for for the the sample ORwould would be be unwelcome. unwelcome. When When an an empty empty cell in sample data does does not not reflect reflect aa zero zero population frequency for for that in sample data population frequency that cell, cell, a of a solution solution for for this this problem problem of of aa mere mere sampling sampling zero zero is is required. required. One One of the solutions would increase one's adding an the possible possible solutions would be be to to increase one's chance chance of of adding an en entry or increasing total sample size size by try or entries entries to to the the empty empty cell cell by by increasing total sample by aa fixed fixed number. just the the sample number. Another Another solution, solution, which which is is common, common, is is to to ad adjust sample OR OR to by adding to the the frequency of each to ORaadjd"� by adding aa very very small small constant constant to frequency of each cell, cell, not not just the empty in the the literature just to to the empty one. one. Recommended Recommended such such constants constants in literature -8 have been as small as as 11CT 0-8 and and as large as 5. have been as small as large as ..5. Refer to to Agresti's Agresti's ((1990, discussions of of the the problem problem of of the the Refer 1 9 90, 2002) discussions empty frequency is zero, adding constant, such such empty cell. cell. Even Even when when no no cell cell frequency is zero, adding aa constant, as 5 , has been recommended estimator of If a as ..5, has been recommended to to improve improve OR OR as as an an estimator of ORpo .. If a constant has has been been added added to to each each cell, cell, the the researcher should report report hnavav constant researcher should ing done so and report OR the adjusted to each each OR and and the adjusted OR, OR, OR ORad 5 to and report adj... Adding ..5 . 2 changes 7. 6 9 to 2 . 1 9, w hich isisstill cell cell in in Table Table 88.2 changes OR ORfrom from 117.69 to 112.19, which stillimpres impressively sively large. large. Note Note that that adding adding aa constant constant to to each each cell cell can can sometimes sometimes ac actually cause cause OR tually ORto to provide providean an inaccurate inaccurate estimate estimate of ofOR OR0P and and lower lower the the power of of aa test test of of statistical of OR, OR, whic which provides even even 'h provides power statistical significance significance of more reason report results OR and ORad more reason to to report results with with both both OR and OR Consult Agresti Agresti adj. Consult (2002) for justment methods re less less arbitrary for discussions discussions of of ad adjustment methods that that �are arbitrary than adding adding constants constants to to cells. than cells . Again, no limitations . Refer Again, no measure measure of of effect effect size size is is without without limitations. Refer to to Rosenthal results for which the Rosenthal (2000) (2000) for for an an illustration illustration of of results for which the odds odds ratio ratio can be misleading misleading and and for his suggested (based on on the can be for his suggested modification modification (based the BESD) the problem. problem. For BESD) of of the the odds odds ratio ratio to to correct correct the For example, example, as as was was pre preRR, the viously discussed with viously discussed with regard regard to to rr and and RR, the possible possible range range of of values values for for OR OR and and ORppop is 0 to to 11 or or 11 to to depending depending on on which which group is repre repreo is sented mi'merator or sented in in the the numerator or denominator. denominator. Again, Again, the the results results can can be be pre presented both ways, the result result can be transformed transformed to to logarithms sented both ways, or or the can be logarithms as as before. odds before. For For aa review review of of criticisms criticisms and and suggested suggested modifications modifications of of odds ratios consult Fleiss al. (2003), for further further discussions ratios consult Fleiss et et al. (2003), and and for discussions consult consult Agresti (2002) (2002),, Fleiss Fleiss ((1994), Haddock et et al. al. ((1998), and the the book on 1 9 94), Haddock 1 99 8 ) , and book on Agresti odds ratios ratios by by Rudas Rudas (1998). ( 1 99 8 ) . odds The null hypothesis hypothesis H can bbee tested tested approximately The Ho: approximately against op = 11 can 0: ORppop OR ::F: 1 using the common the the alternative alternative hypothesis hypothesis H Halt : 1 using the common corrected corrected X2 X2 alt : pop test of of association association (i.e., subtracti subtracting the numerator numerator before before squarsquarri'g ..55 in the 00
0
EFFECT SIZES SIZES FOR FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
1191 91
�
ing). This This method method becomes becomes more more accurate accurate aass the the expected frequencies in in ing). expected frequencies each cell cell become become larger. larger. The The method should not not be when any such each method should be used used when any such expected frequency frequency is is below below 55.. The The test test statistic statistic is expected is (B . l O) (8.10)
fre
where the the summation is over over the the four four cells of the the table, table, frc is is the the ob obwhere summation is cells of served frequency in aa cell, cell, and and nr nr and and ne nc are are the the total total frequency frequency for for the the served frequency in particular row and and the the total total frequency frequency for for aa particular particular column column that that aa particular row given cell cell is is in, in, respectively. respectively. The Thevalue valuen,n/N n r n c /N isisthe the expected expectedfrequency frequency given for aa given given cell null hypothesis. (Note that, in for cell under under the the null hypothesis. (Note that, unlike unlike the the case case in which xX22 is is used to calculate calculate phi, the the numerator-adjusted numerator-adjusted Equation which used to Equation
8.10 should be be used here.). ) 8 . 1 0 should used here Applying the the data data in in Table 8.2 Equation B8.10 calcula. l O for for manual manual calcula Applying Table B . 2 tto o Equation 2 tion one one finds finds that that xx22 = [[ 1110-(23 x 111/47) (23x11/47)+ 1 0 - (2 3 x 1 /4 7) |-.51 1 - . 5 ] 21 (23 13 tion x 1 114 7) + [[ 1|13 2 2 -(23x36/47) -.5l (23x36/47) + [[ 111 (24x x 11/47) /(24x (23 x 36/4 7 ) I . 5 ] 2 I (23 x 3 6/47) + 1 --(24 1 114 7) I -| -.5] . 5 ] 2 I (24 x 2 11/47) - ..5] x 336/47) = 8.05. A table 1 23 - (24 x 3 6/4 7 ) 1| 5 ) 2 //(24 (24 x 6/4 7) = 8 .05 . A tableof of 1 1 /4 7) + I[ 123-(24x36/47) =
critical values values for for xX22,, which which can can be be found found in in any any textbook textbook of of introduc introduccritical tory statistics, reveals that when when df = = (r - ll)(c - 1)(2 - 1) ) (e - 1) 1 ) = (2 (2 1 )(2 1 ) = 1, 1, tory statistics, reveals that the . 05 is beyond the 005 level. the value value 88.05 is statistically statistically significant significant beyond the ..005 level. One One has has sufficient evidence evidence that that OR ORpopdoes does not not equal equal 11.. Refer Refer to to Fleiss Fleiss et et al. al. sufficient (2003)) for for detailed detailed discussio discussion off approximate approximate and and exact exact p values values for for this this (2003 case. There are are various various methods methods for for constructing constructing aa confidence confidence interval interval case. There for OR ORpop to which which we we turn turn in in the the next next section. section. for op', to
df
-
-
��
p OF CONFIDENCE INTERVALS FOR FOR OR OKppop CONSTRUCTION OF op
approximate confidence confidence interval interval for for OR ORpopthat that isisbased basedon on the the normal normal An approximate distribution can can be indirectly. The larger the the sample, sample, the the better better Th� larger distribution be constructed constructed indirectly. the approximation. approximation. Again Again we we present the simplest method for for our our purpose, the present the simplest method purpose, and then then we we cite cite references references for for more more accurate more complex complex methods. methods. and accurate but but more First, aa confidence interval for for the natural logarithm logarithm of First, confidence interval the natural of OR ORpop pop', In OR OR,0 is is constructed constructed because because as as sample sample size size increases increases quicker quicker approxi approximation to aa normal distribution is is attained attained by the sampling distribution 1tion to normal distribution by the sampling distribution m of In OR OR than than by by the the sampling distribution of of OR. OR. Then, Then, the the antiloga antilogaof sampling distribution rithms rithms of of the the limits limits of of this this interval interval provide provide the the limits limits of of the the confidence confidence interval for for OR pop itself. Adding Adding the constant ..5 to each each cell cell frequency frequency interval 0 itself. the constant 5 to might reduce reduce the bias in in estimating estimating In ORppop so we we use use this this adjustment. Ui/bias might op,' so adjustment. The limits limits of of the the ((1 - (X) a) CI CI for for In OR ORppopopare are approximated approximated by by The 1 0
'
((8.11) 8.1 1)
where zza/2 is the the value value of of zz (the standard normal normal deviate) deviate) beyond beyond which which (the standard where a/2 is lies the the upper upper (X/2 a/2 proportion proportion of of the the area area under under the the normal normal curve curve and and SSlninOR lies OR
1192 92
CHAPTER 8 CHAPTER 8
.rlIf!I=
standard deviation (the (the standard standard error) error) of the sampling distribution distribution is the standard OR: of In OR: of
In
(8.12) (8 . 12)
\
r 1 5 In R = l + 1 + 1 + 1 +r O 111 +.5 112 +.5 12 1 +.5 In +.5 J --
--
--
--
off this method iiss problematic problematic for the rela relaAlthough the accuracy o research, com comtively small sample sizes sizes that are typical of behavioral research, epidemiologicalresearch research(i.e. (i.e.,, disease-incidence disease-incidenceresearch), research), pared to, say, epidemiological for illustrative purposes purposes we will will apply the method method to the data at at hand. for Using the data data in Table 8.2,
8.2,
5
l + -l +l + -l fl =.9 3 7. = f -LI0+.5 1 3+.5 1+.5 23+.5J
In OR
(I - a) = ( 1 - .05) = .95 .025 .025) 1 .96. 95 12.19 .664 4.337.
= 1 .96
If seeking the the usual usual (1 -a) = (1 -.05) = .95 CI, CI, one uses Z,l/ z 22 = 1.96 bebe cause a total of .05 (i.e., .025 + + .025) of the the area of the the normal curve lies the tails beyond Zz = ± 1.96. Therefore, the the ..95 confidence limits for for In in the OR 0 based on ORadj ORpop ORadj = = 12.19 from from the the previous section, are are ln( 12.19) ± 1.96(.937), which are .664 and 4.337. The antilogarithms of .664 and 4.33 7 yield, for ORp,op ORpop itself, 1.94 and 76.48. yield, as the ..95 confidence limits for conSurprisingly, considering the vastness of the interval that we con structed for OR OR pop (1.94 to 76.48), the method structed method that has been presented interval that is too liberal; liberal; that is, is, it is here will likely llead d to a confidence interval actual interval. For a better better approximation approximation a more narrower than the actual interval. For traditional method is available (Cornfield, 1956; Gart & complex traditional & Additional discussions discussions of approximate and and exact confi confiThomas, 1972). Additional dence intervals for ORpop ORpop can be found found in Fleiss et al. (2003) and and the the refer refernnn can ences therein. Also SAS V Version and ersion 9 and Also consult Agresti Agresti (1990, 2002). SAS construct a confidence confidence interval for OR ORpop in which the 1 - exaconfi confiStatXact construct latter pac package includes software software for dence level (e.g., .95) is exact. The latter e includes independent-and dependent-groups dependent-groups cases. both the independent-and If the null hypothesis that is is being tested originally is is Ho: H0: OR ORpop =1 If (i.e., no association), Ho: 0 = 0. o. Th ere association), this is equivalent equivalent to testing testing H ORpop There0: In OR fore, because an also con because the the distribution distribution of In OR OR is symmetrical, symmetrical, on one can construct a null-counternull null-counternull interval indirectly for OR ORpop 0 using Equation struct 3.15 in chapter 3 by starting starting with with such an an interval forr In OR OR 0 . Recall pothe from from chapter 3 that the null value of the interval is the nullnull-hypothesized value of the the effect (ES), which effect size (£5), which in this logarithmic case is 0, and the counternull value is 2 E5, is2/n(12.19) = 2(2.5) = 5. Taking ES, which is antilogarithms of 0 and 5, the null-counter null-counternull interval for OR OR p itself null interval itself from 1 to 148.41, again a disappointingly wide interval. interva ranges from Readers might might be concerned concerned about a null-counternull null-counternull interval as that it is is intrinsic to the wide ranging as 1 to 148.41. In this regard note that null-counternull null-counternull interval to grow wider the larger the obtained esti-
.05
'
1 . 961 .93 7), 4.337
95
��
(1 .94
76.48),
1 956;
1 9 72).
(2003) (1 990, 2002).
kag
.95)
In
In
3.15
In In(12.19) .664 1 .94 76.48.
3
f6 In
2 0 5, 148.41,
1
1
ll
148.4 1 .
2 In(1 2 . 1 9)
9
1-
0
=
•
hy
2(2.5) = 5 .
r
0,
=1
SIZES FOR FOR CATEGORICAL VARIABLES EFFECT SIZES
�
1193 93
mate of the effect effect size because its starting point is is always always the null-hy null-hymate £5 (usually the the extreme value of ES £5 that indicates no pothesized value of ES samassociation), and its endpoint, which, in the case of symmetrical sam distributions, is twice the obtained obtained value of the the estimate of ES. ES. pling distributions, unlike a confidence interval, a null-counternull interval interval cannot cannot be Also, unlike narrower by increasing sample size. made narrower Null-counternull intervals are simple simple to construct when estimators Null-counternull of an effect effect size are symmetrically symmetrically distributed. In this book we used of intervals for cases in which which there is no completely sat satnull-counternull intervals isfactory method method for constructing a confidence interval or for cases in isfactory which the methods for constructing confidence intervals are complex references that we cite. However, However, the original inand presented well in references original in tended uses of null-counternull null-counternull intervals were to demonstrate that (a) (a) a statistically significant attained attained p level does not not necessarily necessarily imply a statistically effect, and (b) statistically insignificant ES £5 might might provide as large effect, (b) a statistically much evidence that ES ESpop = 22 ES hypothesis much ES as itit provides provides for for the the null null hypothesis 0 ((Rosenthal al.,, 2000) 2000).. Using Using the data data in Table 8.2 8.2 we ES pop a =0 Rosent et aI. that E5 pp found in the previous section that X2 is statistically significant and that found the estimate of effect effect size is moderately large, ORadj = 12. 12.19, estimating 1 9, estimating the large, ORad" researcher's conher ' s presented con that the odds that a man will agree with the resear statement are more than 12 12 times greater than the odds odds that a troversial statement woman will agree agree with with that statement. When possible, possible, elaboration of woman of results that involve a large estimate of effect effect size might underresults might be better under taken by constructing a confidence interval for the effect effect size than than by taken is likely to be very constructing a null-counternull null-counternull interval for it that is case of a large estimated effect effect size. wide in the case
hcil
X2
�
TABLES LARGER THAN 2 2x2 X 2
beyond the scope of this book to present present a detailed discussion discussion It would would be beyond of measures of effect effect size for for r x ec tables that are larger than 22 xx 22,, which of two categor categorwe call large r x ec tables, or for tables that involve more than two e.g., Table 8.3). For example, if Table 8.2 ical variables variables (multiway (multiway tables; e.g., Table 8.3). Table 8.2 had an an additional additional column for the no opinion category it would be an exhad ex 2 x 33 table) table).. It will suffice to dis disample of a large r x ec table (specifically, a 2 common methods, make some general comments, and provide cuss two common references for detailed treatment of the the possible methods. methods. may begin analysis of data in a large r x ec table with with the usual x2 One may with test of association between the row and column variables with df = — (r - 11He )(c - 11). Thetraditional traditional measures measuresof ofthe theoverall overallstrength strength of ofas asdf ) . The when sampling sampling has sociation between the row and column column variables, when naturalistic, are the the contingency coefficient coefficient (CC (CCppop and Cramer's been naturalistic, ap) and Cramer ' s V , which which are are estimated estimated by by Vpap'
X2
-
-
CC
_
X [� 2 ]1, X2 _ +N
((8.13) 8.13)
1194 94
�
CHAPTER 8 CHAPTER
TABLE 8.3 T ABLE 8.3 An able An Example of a Multiway T Table
Democrat Democrat
Republican
Other Other
Female
White
Male Nonwhite
Female Male
and
V=
[ Nmin(rX-2 1, c - 1) ]Yz
miner - 1, - 1 )
((8.14) 8 . 1 4)
r-
- 1.
V �i,
where Cramer ' s Vpop where min(r - 1,Cc - 1)means meansthe thesmaller smallerofof r -1 1and andc c - 1. Cramer's ranges from Howev from 0 (no association association) to 11 (maximum association) association).. However, 0 are less than 1; and unless r = c, can the upper limits of the CC CC and CC CCPpop c, Vcan. equal 1 even when when there there is less tthan maximum association between between equal an a maximum the row row and column variables in the population. population. Refer Refer to Siegel and 1 9 8 8 ) for Castellan ((1988) for further further discussion of this limitation limitation of V. Observe x c (or r x 22)) tables tables min(r - I, c - 1) = 1; therefore, that for 2 X 2 12 V = [x /N(l)] / in this case, which the phi coefficient. coefficient. (As (As noted which is the noted with regard to phi op in the section Chi-Square Test and Phi, Vpop op is a kind of of average effec effect,, the square square root of the mean mean of the squared standardized square standardized effects. and effects. For For formal expressions for the parameters parameters CC CCpop and and Vpop and fur furP consult Hays, 11994, Liebetrau, 1983, and Smithson, mithson, ther discussions, consult 994, Liebetrau, 2003 SPSS . 2003.). ) A value value for for V is provided provided by by SPSS. Two or more values of the the CC CC should not not be compared or averaged unless they they arise from tables with with the same number number of rows and the number of columns same number columns.. Also, two two or more values of V should should not not be compared or averaged unless they they arise from tables with with the the same min(r, c). c). Refer Refer to Smithson (2003 (2003)) for for methods methods for for constructing constructing a con confidence interval for for the CC CCpop using computing 0 computing routines for pop and Vpop major software packages. Stat StatXact act and SPSS Exact calculate exex some major contingency coefficients coefficients.. act contingency CC pop 0 and Cramer ' s Vpop ofthe be The CC and Cramer's the overall association association beO ,, as measures of tween th the twoo variables, ar are not inot as informative as are finer-grained in tween dices of of strength strength of of association association in a large r X c table. Also, Also, it has has been difficult of difficult for statisticians to develop a very satisfactory satisfactory single index of the 1 9 90, 2002) for of the overall association. Refer Refer to Agresti ((1990, for discussions of for large r x cc tables and and of methods for for partitioning partitioning several such indices for such tables into smaller tables tables for more detailed x2 analyses analyses..
)
1
1;
h
r= V V.
miner - 1 , - 1 ) = 1 ;
= [X2/N(1 ))'/'
Phi, V d
t
1 9SP3,
V
V
miner,
V 'X
:�
��
X2
V
S
FOR CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT SIZES FOR
�
195 1 95
pinpoint the source or sources of There are methods methods that attempt to pinpoint of association in the the subparts subparts of of a large r xX c table. For example, consider association any r x c table each sample has some proportion, proportion, p, of its mem memthat in any target category, where p ranges from from 0 to 11.. If, If, say, r repre reprebers in the target number of samples, samples, there are r such proportions. In the case case of sents the number of random assignment an unad unadjusted x22 test (i.e., no research that uses random justed X subtracted in the numerator) with ddf=r-l constant subtracted f = r - 1 can be used to test for the statistical significance of the the differences, differences, overall, of these r pro profor portions. method was demonstrated demonstrated by Fleiss Fleiss et al. (2003). How Howportions. The method a1. (2003). x22 is statistically significant, it does not necessarily mean mean not necessarily ever, if this X that each sample proportion proportion is statistically significantly significantly different different from each other sample proportion. Therefore, the next task task is is to determine different from from which which proportions are statistically significantly different other proportions. There are various methods for this task. References were provided by al. (2003 (2003), x 2 table. This Fleiss et a1. ) , who demonstrated a method for an r X method two groups of sam sammethod involves dividing all of the r samples into two ra group group and the rb rb group. The overall proportion of all of the the ples, the ra ra group who fall into the target category category is then com commembers of the ra who pared to the overall proportion proportion of all of the members of the rbb group who
fall into into the target target category. Another x with ddff = = 1\ is is used for this X2 test with fall 22 Two then conducted conducted to determine if there purpose. T wo additional x X tests are then statistically significant significant difference difference among among the ra ra group of proportions proportions is a statistically (d (dff = ra ra -- 11)) and among the rbb group of proportions (d (dff = = rbb -- 1I). ). freedom for the last three xX22 Note that the equations for the degrees of freedom division of the total set of samples into the specific raa tests assume that division rb groups of samples had been planned planned before before the data data had been col coland rb division into the two specific groups had not not been planned planned lected. If the division before the data had been collected but instead based on inspection inspection of before but was instead of is invalid invalid and must be the data, the procedure that was just outlined outlined is modified because because it capitalizes capitalizes on chance (a concept that was was discussed modified the section Tentative Tentative Recommendations Recommendationsin inchap. chap.33). Asimple simplesolution solution in the ). A df=r-l to the problem is to use d f = r - 1 for each of the last three x X22 tests instead of the previously stated ddff = = Il,df=r and df df = rbb --1,I , respectively. , df = raa-l, - I and respectively. of descriptive aids to interpretation interpretation of the sample results in large Simple descriptive bar graphs graphs showing showing the percentage percentage of each kind r x c tables are tables or bar of participant who category of interest. interest. For For example, if of who fall into a target category if 8.2 opinion column column it would would be informative informative to see seeaa Table 8 . 2 had a third no opinion table or bar graph graph that depicts the percentages of men and women women who categories of agree, disagree, and no opinion. Gliner, Mor Morfall into the categories agree, disagree, and Harmon (2002) provided a specific example. gan, and 2
,
X c TABLES TABLES ODDS RATIOS FOR LARGE r X
Recall that an odds ratio ratio applies to a 2 x 2 table. However, researcher However, a researcher should not divide a large r x c table, step by step, into all all possible 2 2x2 x2 should
1196 96
�
CHAPTER 8 CHAPTER
subtables to calculate a separate OR OR for for each of these subtables. subtables. In this invalid method x 2 method each cell would be be involved in more than one one 2 2x2 OR, resulting in much redundant infor inforsubtable and in more than one OR, The number number of theoretically theoretically possible 22x2 mation. The x 2 subtables is [r(r )/2 ] (c(c -- 1l)/2]. )/2J . However, 1 9 90, 2002) provided dem [r(r - 1l)/2][c(c However, Agresti Agresti ((1990, provided a demjacent rows and ad jacent onstration of the fact fact that using only cells in ad adjacent adjacent columns results in a minimum number number of DRs ORs that serve as a sufficient sufficient fine-grained analysis of the association between the row row and descriptive fine-grained column variables column variables for the sample data. presented a method for constructing simul simulGoodman ((1964, 1 964, 11969) 96 9 ) presented confidence intervals for for a full full set of population DRs, ORs, but but this taneous confidence our interest in comparing only a method is too conservative for our nonredundant set of population DRs ORs.. (Simultaneous (Simultaneous confidence inter internonredundant vals were defined . ) For defined in the section Shift-Function Method in chap. 55.) For a demonstration of a simpler procedure procedure that produces produces narrower confi confidemonstration dence intervals, refer refer to Wickens Wickens ((1989). fur1 9 8 9 ) . Consult Rudas ((1998) 1 9 9 8 ) for fur odds ratios for r x c tables in general. ther discussion of odds MULTIWAY MULTIWAY TABLES
Recall that contingency tables that relate more than than two categorical variables, each of which consists of two categories, are called two or more categories, multiway tables. An An example would would be be a table that relates the independ independvariables ethnicity ethnicity and gender gender to the dependent dependent variable political afent variables political af filiation, although the variables variables do not have to be designated as dependent variables. variables. T Table 8 . 3a, a2 2xx22 x 33 tta aindependent variables or dependent able 8.3, ble, illustrates this hypothetical example. ble, It would be beyond the the scope scope of this book to encapsulate the the literature effect sizes for multiway multiway tables tables.. Refer Refer to the book by Wickens Wickens ((1989) on effect 1 989) for for an an overview overview from the perspective of research research in the social sciences. Those who 1 9 8 9 ) called who consult that book should note that Wickens Wickens ((1989) some measures of effect effect size size association coefficients. coefficients. Rudas ((1998) dis1 9 9 8 ) dis cussed odds ratios for tables in which there are two two categories categories for each each of of two variables 2k tables). tables). more than two variables (called 2k RECOMMENDATIONS RECOMMENDATIONS
When a researcher researcher has undertaken undertaken naturalistic sampling, in which which only only the total number of participants has been chosen, chosen, and then the partici participants are classified with respect to two truly dichotomous dichotomous variables in a x 2 table, appropriate measures measures of effect effect size are the phi coefficient coefficient 22x2 (taking into consideration its further limitations regarding meta-analy meta-analysis), , relative risk, risk, and the odds ratio. In a study in which the researcher sis) treatment groups to be has randomly assigned the participants into two treatment classified appropriate measures of effect classified in a 2 x 2 table, appropriate effect size are the dif diftwo population probabilities probabilities (proportions), (proportions), relative ference between two
EFFECT SIZES FOR CATEGORICAL CATEGORICAL VARIABLES
1197 97
�
risk, and the odds ratio. The The difference difference between two two probabilities can also be used in the cases cases of prospective and retrospective sampling, and recomrelative risk is also applicable applicable to prospective sampling. These These recom mendations 8.4. mendations are summarized summarized in Table 8.4. Because very different different perspectives perspectives on the results can be provided by the different measures measures of effect research report report should should include the effect size, a research appropriate measures. A research re revalues for estimates of the various appropriate any contingency contingency table on which which a reported reported esti estiport should also include any an effect effect size is based. contingency table is especially mate of an based. Providing a contingency calculate other other estimates estimates of important to enable readers of the report to calculate of effect sizes sizes if the the researcher has has not not presented estimates for for each of the the effect appropriate measures and to enable readers to check the symmetry symmetry of appropriate of row and column column marginal marginal distributions. distributions. Consult Fleiss ((1994) the row 1 9 94) and al. ((1998) for further discussions discussions.. For two-way tables that Haddock, et a1. 1 9 9 8 ) for sampling has been naturalistic, naturalistic, researchers can are larger than 2 x 2, if sampling ' s V with a cautionary Cramer's cautionary remark remark about about its limitations. limitations. A report Cramer approach when such a table has resulted from from a study study recommended approach random assignment is is to apply the method in Fleiss et a1. al. that used random (2003) comparing multiple multiple proportions. (2003 ) for comparing many methods for analyzing analyzing data data in contingency tables There are many that are beyond beyond the scope of discussion discussion here. StatXact StatXact and LogXact LogXact are specialized statistical statistical packages packages for such analyses. In chapter 9 we apply the measure measure that we called the probability of superiority (chap. 5)) to (chap. 5 contingency tables in which the two or more outcome categories have a contingency meaningful order (e.g., participants participants categorized as worse, unchanged, meaningful after treatment treatment or responses agree strongly, or better after responses consisting of agree agree, disagree, disagree, and disagree strongly) strongly).. comparability of effect effect sizes sizes arises when when a meta-analyst meta-analyst A problem of comparability combination of studies that used a continuous continuous dependent encounters a combination from which which estimates of standardized-difference standardized-difference ef efvariable measure, from fect sizes can be calculated, calculated, and studies that dichotomized the same de defect pendent variable measure and presented the data data in a 2 x 2 table. estimating a standardized-difference standardized-difference efHowever, there are methods for estimating
TABLE 8.4 8.4 TABLE Effect Sizes for 22 x x2 2T Tables ables Effect Sizes Method of Categorization Method of phi h·lpop P OOP Naturalistic
Yes
Random Assignment AsSignment Prospective Retrospective
No No No
Appropriate Effect Effect Sizes Appropriate P1 --PP2 RR P} RR 2 Yes Y es No Yes Yes Y es Y es Y es Yes Yes Yes Y es No
ORpop Yes Yes Yes Yes Y es Yes Y es POP
1198 98
�
CHAPTER 8 8 CHAPTER
fect size from from data data in a 2 X x 2 table so that results from from the two two kinds of fect of studies can be combined in a meta-analysis 2003 ) . meta-analysis (Sanchez-Meca et aI. al.,, 2003). meta-analyst might estimate the prob probAlternatively, for this problem a meta-analyst ability of superiority superiority (PS) (PS)for for continuous continuous dependent dependent variable measures demonstrate in the next next chapter, apply the PS PS to the data in and, as we demonstrate 2x2 The use of the PS PS in meta-analyses meta-analyses was discussed discussed by the 2 X 2 tables. The and Mosteller Mosteller ((1990) Mosteller and Chalmers Chalmers ((1992). Laird and 1 990) and Mosteller 1 9 92 ) . QUESTlONS QUESTIONS
two synonyms for unordered unordered categorical variables. 11.. Name two variables. 2. 2. Distinguish between a nominal variable and an ordinal categorical providing an example of each that is not not in the text. text. variable, providing 3.. Define cross-classification table. 3 4. Why does a contingency table have that name? 5.. Define naturalistic 5 naturalistic sampling and state one other name for it. 6.. Misclassification Misclassification is is related to what common problem of measure measure6 ment? 7. might the application application of the methods for 22x2 this 7. Why might X 2 tables in this chapter be problematic if applied to dichotomized variables instead chapter of originally originally dichotomous variables? of variables? 8. Why Why is chi-square chi-square not not an example of an an effect effect size? 9. In which way way is a phi coefficient coefficient a special special case of the the common Pearson Pearson r? of 110. 0. How does does phi, as an an effect effect size, compensate for the influence influence of chi-square? sample size on chi-square? 111. 1 . How does does one interpret a positive or negative value for phi in between the two rows rows and the two col colterms of the relationship between umns? columns 11 and 2 are switched, switched, 112. 2 . If rows 11 and 2 are switched, or if columns effect on a nonzero value of phi? what would be the effect applicable to data that arise from from naturalistic 113. 3 . Why is phi only applicable sampling? 114. 4 . For which which kinds of sampling or assignment of participants is the the difference between two two proportions an appropriate effect effect size? difference What do proportions proportions in a representative sample estimate in a pop pop115. 5 . What ulation? difference between two two proportions proportions is transformed into 116. 6 . When the difference z, what influences influences how closely closely the distribution distribution of such Zz values values a Z, approximates the normal curve? approximates results when when 117. 7 . Provide a general kind of example of intransitive results making (A making pairwise comparisons from from among k > > 2 proportions proportions.. (A general answer answer stated stated symbolically symbolically suffices suffices.) general .) What influences the accuracy of the normal-approximation normal-approximation pro pro118. 8 . What cedure for constructing constructing a confidence interval for the difference difference be between two urobabilities? probabilities?
EFFECT SIZES SIZES FOR CATEGORICAL VARIABLES VARIABLES EFFECT FOR CATEGORICAL
�
1199 99
19. Explain why why the interpretation interpretation of of a given given difference between two two difference between 1 9. Explain probabilities depends depends on on whether whether both both probabilities probabilities are are close to to 11 probabilities or close to .5. or 0, or or both both are are close to .5. Define relative risk, and and explain explain when when it it iiss most most useful. useful. 20. Define 21. What might might be be aa better better name name than than relative relative risk risk when when this this measure 2 1 . What is applied applied to aa category category that represents aa successful successful outcome? a limitation limitation of of relative relative risk as as an an effect effect size. 22. Discuss Discuss a 23. For which kinds kinds of of categorizing or or assignment assignment of of participants 2 3 . For participants is relative risk risk applicable? applicable? relative Define prospective and and retrospective research. 24. Define 25.. Define Define odds odds ratio in in general general terms. 25 terms . 26. Define odds odds ratio formally. 2 6 . Define 27. To which kinds kinds of of categorization categorization or or assignment assignment of of participants participants is 2 7. T o which an odds ratio ratio applicable? an 28. an odds odds ratio for the data data in Table 88.1. 2 8 . Calculate and interpret interpret an .1. 29. Construct and and interpret interpret aa confidence confidence interval for the the population population 2 9 . Construct interval for odds ratio ratio for for the the data data in in Table Table 8.1. odds 8. 1 . Why is is an an empty empty cell cell problematic problematic for for aa sample sample odds odds ratio? 30. Why ratio? 31. How does does one one test the the null null hypothesis that that the the population population odds odds ra ra3 1 . How tio is is equal equal to to 11 against against the alternate alternate hypothesis hypothesis that that it is is not not equal equal to 1? 1? to 32. Construct aa null-counternull null-counternull interval interval for for the the population odds odds ra ra3 2 . Construct tio for for the the data data in in Table Table 8.2. tio 8.2. 33. In which which circumstance circumstance would would iitt not not be be surprising surprising that that a 3 3 . In null--counternull wide? null-counternull interval interval is is very very wide? Name two two common common measures of of the overall overall association association between 34. Name row and column variables for tables larger than 22 x 22. 35. off sampling sampling are are the two measures measures in Question 3 5 . For For which which kind kind o the two in Q uestion 34 applicable? applicable? 36. Two or more more values values of of the the CC CC should should only only be be compared comparedor or averaged 3 6. T wo or averaged for tables tables that that have have what what in in common? for common? 37. Two or or more more values values of of V should should only only be be compared compared or or averaged averaged for for 3 7. lWo tables that have what in common? 38. Why should should aa research report always present present aa contingency contingency table table 3 8 . Why report always on whose data data an estimate of effect effect size is reported? 3 9 . Define meaning. 39. Define the the NNT NNT and and discuss discuss its its meaning. 40. Discuss Discuss the the problem problem of of testing testing the the significance significance of of an an estimate estimate of of NNT. NNT.
Chapter Chapter
9 9
Effect Effect Sizes for Ordinal for Ordinal Categorical Variables Variables
INTRODUCTION Often Often one of the two categorical variables that are being related is an or ordinal categorical variable, variable, a set of categories categories that, unlike a nominal vari varivariables able, has a meaningful meaningful order. order. Examples Examples of ordinal categorical variables the set of of rating-scale categories Unimproved, Moder Moderinclude the categories Worse, Unimproved, Improved, and and Much Improved; attitudinal scale categoately Improved, Improved; the set of attitudinal catego ries Strongly Agree, Agree, Disagree, Disagree, and of and Strongly Disagree; the the set of Waiting List, Applicant Re Recategories Applicant Accepted, Applicant on Waiting from Introversion to Extroversion. jected; and the scale from Extroversion. The technical name for categorical variables variables is ordered foordered polytomy. The fo for such ordinal categorical cus of this chapter is on some relatively simple methods for estimating estimating an effect effect size in tables with two rows that represent two groups and three or more columns that represent ordinal categorical outcomes (2 X case of two x cc tables) tables).. (The (The methods also apply to the case two ordinal cate categorical outcomes. categories, the outcomes. However, However, with with fewer fewer categories, the number of tied likely to increase, matter that is outcomes between the groups is more likely increase, a matter chapter.)) Table 99.1 provides an example example with real discussed later in this chapter. . 1 provides data in which participants were randomly assigned assigned to one or another treatment. Of course, course, the roles of the rows and and columns columns can be reversed, reversed, so the methods also apply to comparable comparable r x 2 tables. tables. The clinical details do not concern us here, but but we do observe that the Improved column re reveals that neither neither Therapy 11 nor Therapy 2 appears to have been very However, this result successful. However, result is perhaps less surprising when when we note therapy that the results were based on a 4-year follow-up study after therapy and the presenting problem (marital problems) problems)was waslikely likelydeteriorating deteriorating from D. just prior to the start of therapy. The data are from D. K. K. Snyder, Wills, and Grady-Fletcher ((1991). and 1 99 1 ) . Gliner et al. (2002) provided important points provided reminders reminders ooff two two important about the use of ordinal categorical scales. scales. First, the number of categoabout First, the catego the greatest ries to be used should be the greatest number of categories into which 200
EFFECT EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES
�
201
TABLE 9. 9.11 Ordinal Categorical Outcomes of T wo Psychotherapies Two Psychotherapies
Therapy 1 1 Therapy 2
1 Worse Worse 3 3 2 112
2 No Change 22 113 3
3 3 Improved Improved 4 4 11
T otaL Total 29 26
The insight- oriented The data data are are from from "Long-term "Long-term effectiveness effectiveness of of behavioral behavioral versus versus insight-oriented marital Wills, and A. marital therapy: therapy: A A four-year four-year follow-up follow-up study," study," by by D. D. K. K. Snyder, Snyder, R. R. M. M. Wills, and A. Grady-Fletcher, Grady-Fletcher, 11991, Journal of of Consulting Consulting and and Clinical Psychology, Psychology, 59, 59, p. p. 140. Copyright Copyright 99 1 , Journal © 1999 1999 by by the the American American Psychological Psychological Association. Association. Adapted Adapted with with permission. permission. ©
Note.
the participants participants can placed. Second, Second, if the data the can be be reliably reliably placed. if the data are are origi originally de nally continuous continuous it it is is generally generally not not appropriate appropriate (due (due to to aa likely likely decrease crease in in statistical statistical power) power) to to slice slice the the continuous continuous scores scores into into ordinal ordinal categories. categories. Note Note also also that that one one should should be be very very cautious cautious about about compar comparing ing effect effect sizes sizes across across studies studies that that involve involve attitudinal attitudinal scales scales.. Such Such ef effect sizes can vary if if there there are the number fect sizes can vary are differences differences in in the number of of items, items, number number of of categories categories of of response, response, or or the the proportion proportion of of positively positively and and negatively worded worded items items across to Onwuegbuzie Onwuegbuzie and negatively across studies. studies. Refer Refer to and Levin (2003 ) for Levin (2003) for further further discussion. discussion. The the row The statistical statistical significance significance of of the the association association between between the row and and column variables, variables, as as well as the the effect effect size that is is used to measure measure the the column well as size that used to strength of of that that association, association, might might vary vary depending depending on on who who is is doing doing the the strength categorizing. For categorizing. For example, example, there there may may not not be be high high interobserver interobserver reliabil reliability by a close relative ity in in the the categorization categorization done done by a patient, patient, aa close relative of of the the patient, patient, or Therefore, a or aa professional professional observer observer of of the the patient. patient. Therefore, a researcher researcher should should be appropriately appropriately cautious cautious in in interpreting interpreting the the results results.. (Refer (Refer to Davidson, Rasmussen, Hackett, & Rasmussen, Hackett, & Pitrosky, Pitrosky, 2002, for for an an example example of of comparing comparing ef effect sizes observer-rated scales anxi fect sizes for for patient-rated patient-rated and and observer-rated scales in in generalized generalized anxiety disorder.) disorder.) A related related concern has been raised raised about the use of of a researcher 's rating patients after treatment with researcher's rating of of the the status status of of patients after treatment with aa drug, drug, even even under under double-blind double-blind conditions, conditions, in in cases cases in in which which the the researcher researcher has has a a monetary monetary relationship relationship with with the the drug drug company. company. This This and and other other possi possibly drug-favoring drug-favoring methodologies methodologies (Antonuccio, (Antonuccio,Danton, Danton,&&McClanahan, McClanahan, bly 200 3 ) might 2003) might inflate inflate the the estimate estimate of of effect effect size. size. Before discussing discussing estimation estimation of of effect effect sizes sizes for for such such data data we we briefly Before consider the related problem of of testing testing t�e the statistical statistical significance significance of of the the association between the the row row and column variables. variables. Suppose Suppose that the re association between and column that the researcher's hypothesis is is better the searcher ' s hypothesis is that that one one specified specified treatment treatment is better than than the other—a specified specified ordering ordering of of the the efficacies efficacies of of the the two treatments. Such other-a two treatments. Such a research research hypothesis hypothesis leads leads to to aa one-tailed one-tailed test. test. Alternatively, Alternatively, suppose a that the the researcher researcher's hypothesis is that that one one treatment treatment or or the the other other (unthat 'S hypothesis (un specified) is is better—a prediction that that there will will be be an an unspecified orderorderbetter-a prediction
202
CHAPTER 9 9 CHAPTER
�
efficacies of the two two treatments. This latter latter hypothesis leads leads to ing of the efficacies test. One One or the other of these two ordinal hypotheses pro proa two-tailed test. H0 that posits no association association between between vides the alternative to the usual Ho row and column variables variables.. An ordinal hypothesis is is a hypothesis the row predicts not not only a difference difference between the two treatments treatments in the that predicts distribution of their scores in the outcome categories (columns distribution their scores (columns in this but a superior outcome outcome for one ((specified example) but specified or unspecified) of the two treatments treatments.. These These typical ordinal researchers' researchers' hypotheses are of in intwo terest in this chapter. AX x22 test is inappropriate to test the null hypotheses at at hand because because
the value of X x2 is insensitive to the ordinal ordinal nature nature of ordinal categorical 2 2 variables. In this ordinal ordinal case a x test can only only validly validly test test a not very X test useful "nonordinal" "nonordinal" researcher researcher's' s hypothesis hypothesis that the two groups are in useful way not distributed the same in the various outcome categories some way categories from chapter 8 that the magnitude of xX22 is (Grissom, 11994b). 9 94b) . Also, Also, recall from not an an estimator estimator of effect effect size because because it is very sensitive sensitive to sample size, not not just to the strength strength of association association between the variables variables.. (The not (The Kolmogorov-Smirnov two-sample test would would be be a better better choice than Kolmogorov-Smirnov testing Ho H0 against against a researcher researcher's' s hypothesis hypothesis of superiority superiority the x X22 test for testing of one treatment treatment over another, but but this test also has unacceptable unacceptable short shortof comings for this purpose; Grissom, Grissom, 11994b.) 9 94b. ) Although there are other, complex, approaches to data more complex, data analysis for a 2 x c contingency table ordinal categorical outcome outcome variable, in this chapter chapter we con conwith an ordinal sider those that involve relatively simple measures of effect effect size: the the point-biserial correlation (perhaps problematic in this case), (perhaps the most problematic the probability probability of superiority, the the dominance measure, measure, the the generalized the generalized cumulative odds ratio. odds ratio, and the cumulative 2
APPLIED TO ORDINAL ORDINAL CATEGORICAL CATEGORICALDATA DATA THE POINT-BISERIAL r APPLIED
Although we soon observe that there are limitations limitations to this method (as (as is Although often true true regarding measures of effect effect size), one might might calculate a often (see chap. 4), as perhaps perhaps the the simplest simplest estiesti point-biserial correlation, r pb b (see effect size size for th the case case at at hand. First, First, the c column category la lamate of an effect replaced by ordered numerical values, such as 11,, 2, ...., c. For the the bels are replaced . . , c. 9.1 might use I1,2, column categories in Table 9 . 1 one might , 2, and 3, 3, and call these the Next, the labels for the row categories are replaced scores on a Y variable. Next, numerical values, say, 11 and 2, and these are called the scores on anX an X with numerical One then then uses any any statistical statistical software to calculate the correla correlavariable. One coefficient, r, for for the the now now numerical X X and and Y variables. tion coefficient, Software output yielded rppbb = Software = -.397 -. 3 9 7 for for the the data data in in Table 9.1. 9 . 1 . When When unequal, as they are in Table 99.1, correct for the sample sizes are unequal, . I , one can correct the attenuation attenuation of r that results from such inequality by using Equation from chapter 4 for rce, where c denotes denotes corrected. corrected. Because sample 4.4 from sizes are reasonably large and not very different different for the data in T Table able 9 . 1 , we are not surprised 9.1, surprised to find find that the correction correction makes little differdiffer-
l
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
203 203
�
ence in this case; 3 9 8 . The case; rc rc = -. -.398. The correlation correlation is moderately large using Cohen's ((1988) 1 9 8 8 ) criteria criteria for the relative sizes sizes of correlations correlations that were critiqued in chapter 3 9 7 is statisti chapter 4. Output Output also indicates that rpbb = = -. -.397is statistisignificantly different different from from 0 at at the p < < ..002 two-tailed. Note 002 llevel, vel, two-tailed. cally significantly that the negative correlation correlation indicates that Therapy 11 is is better better than Therapy Therapy 22.. One One can now conclude, conclude, subject to the limitations limitations that are discussed later, that Therapy 11 has a statistically statistically significant and moderately strong superiority superiority over Therapy 2. moderately 2.
l
CONFIDENCE INTERVAL AND NULL-COUNTERNULL INTERVAL FOR FOR rpop rpop
Recall from from chapter 4 that construction of an accurate confidence inter inter0 can be complex and that there may may be no entirely satisfactory val for rrpop metho herefore, researchers method. Therefore, researcherswho who report reportaaconfidence confidenceinterval intervalfor forrr should also include include such a cautionary comment in their research research re-: ports.. For For more details consult Hedges and Olkin ((1985), ports 1 9 8 5 ) , Smithson (2003), 2003). Refer to the section on confi confi(2003 ), and Wilcox ((1996, 1 996, 11997, 99 7 , 2003 ) . Refer dence intervals and null-counternull brief null-counternull intervals in chapter 4 for a brief discussion of the improved methods for construction of a confidence confidence in in(2003) and Wilcox Wilcox (2003 (2003). As an alternative alternative terval for rpop by Smithson (2003) ) . As confidence be inclined to construct instead the to a confid ce interval one might be simple null-counternull o using Equation 4 . 2 from null-counternull interval for rpop 4.2 from chap chapHowever, as pointed out out in cha chapter 8, a null-counternull null-counternull interval interval ter 4. However, t r 8, for an an effect effect size is less useful useful (very wide) when the the estimate of the the effect effect for size is is already known known to be large large and statistically significant, significant, which is the case for the data in Table 9 .1. 9.1. Althoug h , from n advance that the Although, from chapter 88 w wee know know iin null-counternull Ho: null-counternull interval will be wide wide when the the null hypothesis is is H 0: r 0 = rpop = 0 and and the the obtained estimate estimate of effect effect size is is large, large, we proceed to to use Equation 4 4.2 null-counternull interval for rpop for s Equation . 2 to construct a null-counternull our null-hypothesized value oof rpop these data as an exercise. Because our op is the lower limit of the the interval (null (nullvalue) value) is o. 0. We Weapply the the obtained obta ned 0, the 2 '2 -.397 Equation 44.2, = 2r // ((1 r = -. 3 9 7 to Equation . 2 , rrm 1 + 3r2)' ) '',, to find that the upper cn = 2 2 1/2 (counternull value) value) is is 22(-.397) / [[1 3(-.397 = -.654. Therelimit (counternull (-. 3 9 7 ) / 1 + 3 (-. 3 9 7 )] ) ] ,' = There from 0 to -.654. fore, the interval runs from
lT
;;
�';i
p�
J�
r';!' i
=
LIMITATIONS OF OF rpb rpb FOR ORDINAL CATEGORICAL DATA
For general discussion of limitations of r pb b refer refer to the the section AssumpAssump troubletions of r and r pbb in chapter 4. The limitations limitat ons may may be especially trouble cases, such one, in which there are very few values some in cases, uch as the present one, of the X and and Y variables (two and and three values, respectively) respectively).. These These data data of cause concerns such as the possibly inaccurate obtained pp levels levels for the t test that is is used to test for the statistical statistical significance significance of rr pbb.. However, test However, in ordinal example there might might be some favorable favorable circ circumstances this ordinal mstances that
l
i
tl
204
�
CHAPTER 9 CHAPTER
possibly reduce the risk in using rr pbb.. First, sample sizes are reasonably are reasonably large. Second, the obtained obtained p level 1iss well beyond the the customary customary mini minimum criterion criterion of .05 .05.. Also, indicated that statistical statistical mum Also, some studies have indicated power and and accurate p levels can be maintained maintained for the t test test even when when the the power variable is is dichotomous dichotomous (resulting (resulting in in aa 22 x 22 table) table) if if sample sample sizes sizes are are Y variable 'Agostino, 11971; greater than 20 each, as they are in our our example (D (D'Agostino, greater 971; 9 70 ) . A dichotomy is a much grouping of categorical Lunney, 11970). much coarser coarser grouping categorical outcome than the polytomy able 9 .1. polytomy of tables such as T Table 9.1. Regarding the t test statistical significance of rpbb', it has been re retest of the statistical ported that even when five the the p levels for ported when sample sizes are as small as fi� test can be accurate when at least three ordinal ordinal categories the t test when there are at (Bevan, Denton, & Sawilowsky (1998) & Meyers, 11974). 9 74) . Also, Nanna and Sawilowsky ( 1 998) showed that the t test can bbee robust with respect ttoo T Type showed ype I error and can applied to data from from rating scales, but but M Maxwell axwell maintain power power when when applied showed that, under under heteroscedasticity and equality and Delaney ((1985) 1 9 8 5 ) showed of means of populations, populations, parametric methods applied to ordinal data of might result result in misleading conclusions conclusions.. (However, in experimental re remight search it might might not be common common to find find that treatments treatments change variances means.)) For For references to many articles whose con conwithout changing means. clusions favor favor one or the other side of this longstanding controversy controversy parametric methods methods for ordinal data, data, consult consult Nanna about the use of parametric and Maxwell and and Delaney (2004 (2004). Regarding the the prospects for (2002) and ) . Regarding future satisfactory method method for constructing constructing a confifuture development of a satisfactory confi difference between the mean ratings of two two dence interval interval for the difference Penfield (2003). (200 3 ) . groups, refer refer to Penfield bee concerned off our One might also b concerned about the arbitrary nature o equal-interval scoring of the the columns ((1,2, and 33)) because other sets of 1 , 2, and of numbers could have been used. Snedecor and Cochran three increasing numbers and Moses (1986) moderate differences among ((1989) 1 9 8 9 ) and ( 1 9 8 6 ) reported that moderate differences among ordered, but but not not necessarily equally spaced, respaced, numerical scores that re place ordinal ordinal categories do not result in important important differences differences in the value of t. However, Delaney and Vargha (2002) provided provided contrary re value rethere was a statistically significant difference between between sults in which there significant difference means for two treatments for problem problem drinking drinking when when the increas increasthe means levels of alcohol consumption consumption were ordinally ordinally numerically numerically scaled ing levels with equal spacing as 11 (abstinence) (abstinence),. 2 (2 to 6 drinks per week), week). 3 (be(be tween 7 and 1140 week), and 4 (more (more than 1140 tween 40 drinks per week). 40 drinks per week), but but there was not not a statistically significant difference when the week). difference when same four four levels of drinking were scaled with slightly spacing slightly unequal spacing such 2,, 33,, 44.. Consult Consult Agresti (2002) for similar similar results that indi indisuch as 0, 2 the spacing is important important and and for for further further discussion of the choice cated the of scores for for the the categories categories.. For dependent variables for for which there is of obvious choice of score spacing, such as the the dependent variable in no obvious Table 9.1, 9 . 1 , Agresti (2002) acknowledged that equal spacing of scores is often a reasonable reasonable choice. often If, unbeknownst to the researcher, a continuous latent variable variable hap hapscale, one would would want want the spacing of scores to be pens to underlie the scale,
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
rlW=
205
differences between the underlying values values.. Agresti consistent with the differences (2002) recommended recommended the use of sensitivity analysis analysis in which which the results from compared. One from two two or three sensible sensible scoring schemes schemes are compared. One would hope that the results results would not be very different. different. In any event, the re rewould not sults from each of the scoring schemes should be presented. researchers will remain concerned concernedabout about the the validity validity of rr pbb and and Some researchers the accuracy of the the p levels levels of the t test under the the following co combinabina the circumstances:: Sample sizes are small, there are as few as three tion of circumstances ordinal categories, there is possible skew or skew in different different directions directions ordinal for the two heteroscedasticity. Because for two groups, and there is possible heteroscedasticity. Because and/or highest extremes extremes of the ordinal categories categories may may not the lowest and/or extreme as the actual actual most extreme standings of the participants participants be as extreme with regard to the construct construct that underlies underlies the rating scale, skew or with differential skew may may result. For example, example, suppose suppose that there are re redifferential spondents in one group who disagree extremely extremely strongly strongly with a pre prespondents statement and respondents in the other other group who sented attitudinal statement extremely strongly strongly with it. If the scale does does not not include these agree extremely extreme categories, categories, the responses responses of the two two groups will "bunch very extreme less extreme strongly strongly disagree disagree or strongly strongly agree up" with those in the less and ceiling effects, effects, as dis discategories, respectively (which are floor and different directions directions cussed in chap. 11). ) . The consequence will be skew in different for the two two groups as well as a restricted range of the dependent dependent vari varifor able. Recall from from chapter 4 that differential differential skew and restricted range of the the measure of the dependent dependent variable variable can be problematic for for rrph" of pb. Pearson rrpop Note that the issue issue of the Pearson pop' as a measure of only the linear component between X and and Y is not relevant here be becomponent of a relationship relationship betwe not relevant cause the two two values of the X variable do not not represent a dichotomized continuous variable that might have a nonlinear relationship with the variable. Instead, Instead, the the two two values values of of the the X X variable variable represent representaa true true di diY variable. chotomy such as Therapy 11 and Therapy 2 or male and female. Finally, Cliff Cliff ((1993, justi1 99 3 , 1996) 1 99 6 ) argued that there is rarely rarely empirical justi fication for treating the numbers numbers that are assigned to ordinal categories fication having other than than ordinal properties. Wenow nowturn turnto toaaless lessproblem problemas having properties. We effect size for ordinal categorical categorical data, a measure for which the the cate cateatic effect ordered and and the the issue of the spacing spacing of numerical gories need only be ordered scores is irrelevant.
rri'
lri
PROBABILITY OF OF SUPERIORITY APPLIED TO ORDINAL DATA THE PROBABILITY DATA
The part of the following material that is is background background information information was explained in more detail in chapter 55,, where the effect effect size called the probability continuous probability of superiority was introduced in the context of a continuous Yvariable. variable.Recall Recallthat thatthe theprobability probabilityofofsuperiority, superiority,PS, PS,was wasdefined definedasas Y probability that a randomly randomly sampled member of Population a will the probability (Ya) that is is higher randomly have a score (Yal higher than the score attained by a randomly of Population Population b (Y1J (Yb). Symbolically, PS = P Pr(Y >Y sampled member of r(Yaa > Yb) . In b).
206
�
CHAPTER 9 9 CHAPTER
9.1 Therapy 11 and b represents Therapy 2, the case of Table 9 . 1 a represents Therapy 2, wee now now call these therapies Therapy a and Therapy b. The PS iiss esti estisso ow paa >>bb,' which mated by P which is is the theproportion proportionof oftimes times that thatmembers membersof ofSample Sample have a better outcome than members members of Sample Sample bb when the outcome a have of each member of Sample a is compared to the outcome of each member of of Sample Sample b, one by one. In Table 99.1 the outcome of No of . 1 we consider the = 22)) to be better than the the outcome Worse (Y = = 1) and the the out outChange (Y = 1 ) and Improved (Y = = 33)) to be better than than the outcome No Change. Change. The come Improved number of times that the outcome outcome for a member of Sample a is better better number Sample b in all of these than the outcome for the compared member of Sample head-to-head comparisons is called called the U statistic. (We (Wesoon soonconsider considerthe the head-to-head scores.) comhandling of tied scores . ) The total number of such head-to-head head-to-head com product of the two two sample sizes, sizes, nnaa and n nb parisons is given by the product b. an estimate of the PS is given by Equation 55.2 from chapter 55,, Therefore, an .2 from pa>b estimator are are not not sensitive to to the the magmag and its its p Paa>b > b estimator > = U/nanb. The PS and but they are nitudes of the scores scores that are being compared two at a time, but which of the two two scores is is higher (better), (better), that is, ordersensitive to which is, an order the two two scores. Therefore, Therefore, the PS and and P paa >bb are are applicable applicable to to 22 xx cc ing of the c categories are ordinal categorical. categorical. Note Note that nu nutables in which the c likely when comparing two scores outmerous ties are likely scores at a time when out comes are categorical, even more so the smaller the effect effect size and the fewer the categories; categories; consult Fay Fay (2003 (2003). Therefore, we pay particular fewer ) . Therefore, attention to ties in the following sections. attention
PS
(Y
(Y
U
Pa b U/nanb•
PS
:>
EXAMPLE OF ESTIMATING ESTIMATING THE PS PS FROM ORDINAL ORDINAL DATA WORKED EXAMPLE
discussing the use of software for the present task we describe Before discussing describe calculation. (Although (Although a standard standard statistical package might manual calculation. intermediate values for the calculations, calculations, we describe provide at least intermediate manual calculation calculation here because because it should provide readers readers with with a better manual the concept of the PS when applied to ordinal categori categoriunderstanding of the manual calculation calculation requires frequencies, cal data. Also, Also, manual requires only cell frequencies, whereas calculation calculation using using standard standard software might might require more labo labowhereas entry of of each observation. observation.)) We We estimate PS = = Pr(Ya > > Yb) using Sa Sa to to rious entry Sample a has an outcome denote the number of times that a member of Sample that is superior to the outcome for the compared member of Sample b. We use T to denote the number of times that the two outcomes are tied. two participants who are being compared A tie occurs whenever the two have outcomes that are in the same outcome category (same (same column of of 9.1). number of ties arising arising from each column column of the table is Table 9 . 1 ) . The number product of the two cell cell frequencies frequencies in the column. column. Using the simple the product tie-handling method that was recommended recommended by Moses, Emerson, Emerson, and tie-handling Vargha (2002) we al alHosseini ((1984) 1 984) and also adopted by Delaney Delaney and Vargha locate ties equally equally to each group group by counting counting each tie as one half half of a win assigned to each of the two two samples. (Consult Brunner & & Munzel, 2000; assigned 2003;; Pratt & & Gibbons, Gibbons, 11981; Randies, 200 2001; Rayner & & Best, 200 2001; 9 8 1 ; Randles, 1 ; Rayner 1; Fay, 2003 and Sparks, Sparks, 11967, discussions of ties. ties.)) Therefore, Therefore, and 96 7, for further discussions
PS
PS Pr(Ya Yb)
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
= U =
�
Sa+.5T. Sa + . 5 T.
207 ((9.1) 9.1)
Sa
Calculating 5a bbyy beginning beginning with with the the last last column column (Improved) (Improved) ooff Table Table Calculating 9.1, observe that that the the outcomes outcomes of of the the four four patients patients in in the the first first row row (now (now 9 . 1 , observe called Therapy Therapy a) a) are are superior superior to to those those of of 113 = 25 of of the the patients in called 3 + 112 2 = patients in row 2 (now (now called called Therapy Therapy bb).) . Therefore, Therefore, thus thus far far 4( 4(13 = 100 1 3 + 112) 2) = 1 00 row pairings of of patients patients have have been been found found in in which which Therapy Therapy aa had had the the supe supepairings rior outcome. outcome. Similarly, Similarly, moving moving now now to to the the middle middle column column (No (NoChange) Change) rior of the the table table observe observe that that the the outcomes outcomes of of 22 of of the the patients patients in in Therapy Therapy a of are superior superior to to those of of 112 of the the patients patients in in Therapy Therapy b. This This latter latter result result are 2 of adds 22 x 112 to the the previous previous subtotal subtotal of of 1100 pairings within within adds 2 = 264 to 00 pairings which patients patients in in Therapy Therapy a had had the the superior superior outcome. outcome. Therefore, Therefore, which Sa == 1100 = 364. The The number number of of ties ties arising arising from from columns columns 11,2, 00 + 264 = Sa , 2, 12 = 36, 22 x 13 13 = so and 33 is is 33 x 12 = 286, and and 4 x 11 == 44,, respectively, respectively, so and T == 36 5 (326) = 5527. 2 7. The 36 + 286 286 + 4 == 326. 326. Thus, Thus, U U == Sa + .5T .5T == 364 364 + ..5(326) The number of of head-to-head head-to-head comparisons comparisons in in which which aa patient in Therapy number patient in Therapy a a had aa better better outcome outcome than than aa patient patient in in Therapy Therapy b, b, when when one one allocates allocates had ties equally, equally, is is 5527. There were were nanb = =2 6= = 754 total total comparisons comparisons 2 7 . There 299 x 226 ties made. Therefore, Therefore, the the proportion proportion of of times times that that aa patient patient in in Therapy Therapy aa had had made. an outcome outcome that that was was superior superior to to the the outcome outcome of of aa compared compared patient patient in an Therapy b, pa > b, (with (with equal equal allocation allocation of ofties) ties) is is 527/754 5 2 7/754 = = .699. . 6 9 9 . We We thus estimate estimate that that there there is is nearly nearly aa ..77probability probability that that aarandomly randomly sam sampled patient patient from from aa population population that that receives receives Therapy Therapy aa will outperform outperform a randomly randomly sampled sampled patient patient from from aa population population that that receives receivesTherapy Therapy b. b. a If therapy has If type type of of therapy has no no effect effect on on outcome, outcome, PS = = .5. Before Before citing citing methods that for methods that might might be be more more robust robust we we discuss discuss traditional traditional methods methods for testing PS = ..5. As discussed in chapter 55,, one might H0: PS = .5 5 . As might test Ho: .5 against Halt: PS ..5 using the the Mann-Whitney Mann-Whitney U U test test (perhaps (perhaps more more ap apagainst 5 using propriately called, in in terms of of historical historical precedence, the Wilcoxon Wilcoxonpropriately called, precedence, the Mann-Whitney test). However, However, as as discussed discussed in in the Assumptions Mann-Whitney test). the section section Assumptions in chapter chapter 5, heteroscedasticity can can result in aa loss of of power power or or inaccu inaccuin 5, heteroscedasticity result in rate pp levels levels and and inaccurate inaccurate confidence confidence intervals intervals for for the (cf. Delaney & rate the PS (cf. Vargha, 2000; Wilcox, Wilcox, 11996, 2001, 2003).) . Vargha, 996, 200 1 , 2003 Only a minority minority ooff textbooks textbooks ooff statistics statistics have have aa table table ooff critical critical val valOnly a ues of of U for for various various combinations combinations of of sample sample sizes, sizes, na and and nb. Also, Also, books books that do include include such such aa table table (or aa table for for the equivalent equivalent statistic, statistic, Wm Wm', that is is discussed shortly) shortly) may not not include include the the same same sample sample sizes that that sizes that were used by by the researcher. researcher. Therefore, Therefore, we we now use of soft softnow consider the use U test. ware to ware to conduct conduct aa U test. Programs of statistical software packages can be used to conduct conduct a U U test from ordinal ordinal categorical categorical data data if if aa data data file is created created in in which which the the ordi orditest from file is nal categories categories are are replaced replacedby by aa set setof ofany any increasing increasingpositive positive numbers, numbers, as as nal . 1 . Available we we already already did did for for the the columns columns in in Table Table 99.1. Available software software may may in in' s Wm stead provide an an equivalent equivalent test test using using Wilcoxon Wilcoxon's Wm statistic. statistic. Software Software stead may may also also be be using using an an approximating approximating normal normal distribution distribution instead instead of of the the exact distribution distribution of of the the W Wmm statistic statistic and and use use as as the the standard standard deviation deviation of exact of ==
Sa
==
nanb
Pa > b'
PS
PS Halt: PS ;t:.
PS
==
PS
na
nb.
==
208
�
CHAPTER 9 CHAPTER
distribution (the (the standard standard error) error) a standard deviation that has not this distribution been adjusted for ties. ties. (We (Weadjust adjust for forties tieslater laterin inthis thissection. section.) Forthe the data data ) For Table 9.1 yields Wm Wm = = 8878, = .01 .0114, . 1 such software yields 78, Pp = 14, two-tailed, using in T able 9 normal approximation approximation in which the standard standard error is is not ad adjusted a normal justed for reties, so the reported p level is not as accurate as it could be although the re ported value of Wmm is correct. To To derive derivean an estimate estimate of ofPS PSfrom fromthis this output output ported is transformed transformed to to U Uusing, using, as as in in chapter chapter 5, 5, Wm m is
(9.2) nss iiss the the smaller ooff the the two two sample sizes oorr simply nn iiff nnaa = = nnbb. Ap Apwhere n plying the data data in T Table 9.1 9.2, the obtained Wm Wm = = .878 able 9 . 1 to Equation 9.2, . 8 78 transforms to U U= = 878 878 -- [26(26 [26(26 + I1)] 527,which whichisisthe thesame samevalue value transforms ) ) //22==527, for U that we previously obtained using manual calculation. for When sample sizes are larger than those in a table of critical values of U U a manually-calculated U U test is often often conducted conducted using a normal of approximation. (Unlike approximation. (Unlike tables of critical values of U or of Wm m', tables of of the normal curve appear in all books on general general statistics . ) For statistics.) For ordinal categorical data data there is an old three-part three-part rule of thumb (possible mod modification of which we suggest later) ification later) that has been used to justify use of of U test that uses uses the normal approxima approximathe version of the Wm m test or U tion. The rule consists consists of (Part 11)) nnaa �> 110, (Part 2) nnbb �> 110, and (Part 33)) no no 0, (Part 0, and frequency > ..5AT, = nnaa + + nnbb (Emerson & & Moses, column total frequency 5N, where N = al.,, 11984). 11985; 98 5 ; Moses Moses et al. 984). According According to this rule, if all of these three satisfied the the following transformation of U U to z is made, criteria are satisfied and the the obtained z is referred referred to a table of the normal curve to see see if it is and least as extreme as the critical value that is required for the adopted at least level (e.g., z = ± ±1.96 two-tailed):: 1 .96 for the .05 level, two-tailed) significance level
9.3) ((9.3) standard deviation ooff the distribution ooff U U (standard er erwhere Ssuu is the standard ror) and
(9.4)
might justify use of With regard to the minimum sample sizes that might the normal approximation, approximation, recall from from chapter 5 that Fahoome Fahoome (2002) found that the minimum equal sample sizes sizes that would justify justify the use found of the normal approximation approximation for the Wm Wm test (equivalent (equivalent to the U U test), test), of 5 for tests in terms terms of adequately adequately controlling controlling lYpe Type II error, were 115 tests at at the the
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
�
209
Therefore, until there is is fur fur.05 level and 29 for tests at the ..01 0 1 level. level. Therefore, about minimum sample the ther evidence about sample sizes for the case of using the approximation to test test PS PS = = ..5 ordinal categorical data, normal approximation 5 with ordinal perhaps a better rule of thumb Fahoome ' s thumb would be to substitute Fahoome's minimum sample sizes for those in in Parts 1\ and and 22 in in the the previ previ(2002) minimum sizes for ously described old rule. significance level level can be attained attained by adjusting Ssuu for A more accurate significance ties.. Such Such an ad adjustment beneficial if any any column to toties justT!;mt might be especially beneficial tal contains more mure than one half of the total participants participants.. This This condition condition violates the criterion for Part 3 that was previously previously listed for justifying justifying use of a normal normal approximation. approximation. (Because some software might not make adjustment adjustment.) Observe that that this ad justment we demonstrate the manual ad justment . ) Observe able 9. 3 = 35 of the column 2 of T Table 9.11 contains 22 + 113 the 29 + 26 = 55 of the the patients. Because 35/55 = = .64, .64, which is is greater than the criterion criterion total patients. maximum of .5, we use the ad adjusted su/ denoted Ssadj denominator maximum justed sU' adj ', in the denominator of Zu zu for a more accurate accurate test, of
(9.5 (9.5))
where/; where J; iis s a column total frequency. Beginning our Equation 9.4 for Su our calculation with with Equation su we find find for the Table 9.11 that thatsSuu = = [29(26)(29 + 26 + 1l)/12] 9. 3 1 8 . Next, )1 1 21/2i )" = 559.318. data in T able 9. we calculate , 2 , and 33 in that order. calculateff33 ij -- J; fi for each of the columns columns 11,2, 5 33--15 1 5 = 3,360, 35 3 5 = 42, 840, and - 5 = 1120, 20, These results are 115 3533--35 42,840, and 5533 -5 Summing these last three values values yields respectively. Summing 3,360 + 42, 840 + 120 = 42,840 = 46,320. Placing 46,320 into Equation 9.5 we 1 8 [ 1 -- 46,320/(5 have S saadd" = 59.3 59.318[1 46,320/(555 33 - 55)f' 55)]' A = 50.385. From Equation with Ssadj 99.3 . 3 wit su' u, we now have adj replacing s 5 (29)(26)] / 50.385 == 22.98. . 98. Inspection zZuu = = [527 [527 - ..5(29)(26)1 Inspection of a table table of the normal normal curve reveals that a Zz that is is equal to 2.98 is is statistically statistically signif significant beyond the .0028 level, two-tailed. There is is thus support for a re researcher's hypothesis that one one of the therapies is is better better than the other, searcher 's hypothesis find that Therapy a is is the better one. and we soon find adjusting Ssuu for ties ties results now now in a different different ob obObserve first that acljusting tained significance level from from the value of .01 .0114 previously ob ob1 4 that was previously although both both levels levels represent < .02. Because tained, although represent significance at p < our estimate of P Pr(Y >Y Fb)b),, pa a>>b'b, was was ..699, which isisaa value valuegreater greaterthan than 699, which r( Yaa > our null-hypothesized value of .5, the therapy therapy for which which there is is this the null-hypothesized just-reported statistically significant significant evidence of superiority is Therapy a. Because U U is statistically statistically significant significant beyond the .0028 (approxi (approxitwo-tailed level, paa >>bb == .699 mately) two-tailed .699 isis statistically statistically significantly significantlygreater greater than .5 .5 beyond the .0028 two-tailed two-tailed level. naa and nnbb > 110 ties, as is true for the the 0 the presence of many ties, When both n data in T Table 9.1, result generally in the approxiapproxidata able 9 . 1 , has been reported to result
h
P
P
210 210
�
CHAPTER 9 9 CHAPTER
mate p level level being within within 50% of of the the exact p level (Emerson & & Moses, Moses, 1 985; but obtained p level, 1985; but also consult consult Fay, Fay, 2003) 2003).. In this example the the obtained .0028, is so far far from from the the usual criterion of .05 that perhaps one need not not very concerned about the exact exact p level attained attained by the results. How Howbe very ever, especially when more than half of the participants participants fall fall in one out outthan half and the the approximate approximate obtained p level is close close to .05, a come column and researcher might prefer report an an exact obtained p level as is discussed discussed prefer to report in the paragraph paragraph after after the next next one. Note that the and OR OR o of the DM and of the the next next three three sections) sections) is the PS (and the wo ordinal applicable to tables that have as few aastltwo ordinal outcome catego categories, although more ties are likely when when there are only two outcome outcome . 1 (chap. . 1 (chap. (chap. 8 categories. Tables 44.1 (chap. 4) and and 88.1 8)) provide examples be because Participant Participant Not Better after treatment treatment Participant Better versus Participant 4.1 after treat treat. 1 and Symptoms Remain versus Symptoms Gone after in Table 4 able 8. ment in T Table 8.11 represent in each case case an an ordering of outcomes outcomes.. One outcome is not not just different different from from the other, as would would be the case case for a scale, but but in each example one outcome can be considered nominal scale, considered to alternative outcome. As an exercise exercise the reader might be superior to its alternative . 1 to Equation . 1 to verify, apply the results of Table 88.1 Equation 99.1 verify, with with regard to the superiority Psychotherapy to Drug Therapy in that example, superiority of Psychotherapy that the PS P5 is estimated to be .649. An exact p level for for U and, therefore, therefore, for for testing H H0o:: PS PS = = .5 against H PS "* .5, can be obtained using the statistical software packages Halt: alt: PS Exact, or SAS SASVersion Version9.9.(Refer (Referto toPosch, Posch,2002, 2002,for foraastudy study StatXact, SPSS Exact, of of the power of exact [StatXact [StatXact]I versions of the the W Wmm test and and competing tests applied to data data from . ) Recall from from 2 Xx c tables tables.) from chapter 55 that Fay (2002) provided a Fortran 90 program program to produce produce exact critical values for for the the Wmm test over a wider range of sample sizes and and alpha levels than generally be found in published can generally published tables. For For further further discussions discussions of the the PS and Inde and U U test in general review the the The Probability of Superiority: Superiority: Independent Groups and Assumptions sections in chapter 5 5.. Consult Delaney Delaney and and Vargha Vargha (2002) (2002) for discussion of robust meth methfor the the current current case of ordinal categorical dependent dependent variables. ods for variables . Type condiHowever, such methods might inflate T ype I error under some condi and Vargha (2002) demonstrated that these tions of skew. Delaney Delaney and methods methods might might not not perform perform well when when extreme skew is combined with both sample sizes sizes being at at or below 10. Sample Sample sizes sizes between 20 one or both and 30 might be satisfactory. satisfactory. Wilcox (2003 (2003)) provided provided an an S-PLUS func funcand tion Munzel (2000) method for testing H Ho: tion for the the Brunner and and Munzel PS = = .5 .5 0: PS PS under conditions and for constructing constructing a confidence confidence interval for the PS of heteroscedasticity, heteroscedasticity, ties, or both. of Recall that from recommended that re refrom time to time in this book we recommended searchers consider estimating and reporting reporting more than one kind of ef effect size for a given set of data data to gain different different perspectives perspectives on the the fect results. acknowledged a contrary contrary opinion opinion that holds results . However, we also acknowledged reporting of estimates of multiple measures might only serve that such reporting
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL VARIABLES EFFECT CATEGORICAL VARIABLES
�
211 2 11
to confuse readers. The example of estimation confuse some readers. estimation of the point-biserial point-biserial rpop and and the PS PSfor fordata data such suchas asthose thosein inTable Table99.1 . 1 are areof ofinterest interestin inthis this re regard. the former former was -.398 -.398 and the estimate of the latter latter d. The estimate of the was .699. A researcher who who reports both of these values values would be not only to discuss discuss the limitations limitations of the point-biserial correla correlaobliged not but also to make clear to readers the dif diftion in the case of ordinal data but ferent meanings, but message, of the two two reported estimates but consistent message, of effect effect size. Both results support the the superiority superiority of Therapy aa.. of The values -.398 -.398 and .699 for the two two estimates both constitute esti estieffect sizes by Cohen's Cohen's ((1988) 1 988) criteria that were mates of moderately large effect discussed in chapters 4 and and 5. Also, referring to the columns 5 . Also, columns forrrpop op', the PS, U33 measure of overlap in T ' s ((1988) and Cohen Cohen's Table and 1 988) U able 5.1 of chapter 5, observe that these two two values for estimates of rr p and the PS both correspond to a value of U U33 that indicates that tely three fourths of the members that appro approximately of the better performing group have outcomes that are above the median of performing group. (Note that it is of no concern outcome of the poorer performing concern = .398 0 = when interpreting the results or examining the rows closest to rpop the PS was positive. in Table 5.1 5.1 that rpb b was negative and the estimate of the PS wa ositive. p Because it is a proportion proportion the the estimate of PS cannot be negative, negative, and and a value superiority for for Group a. A negative value for for rrpb b, similarly similarly over .5 indicates superiority Group a] al tends to score higher tthan an Group indicates that Group 11 [same as Group 2. The sign of r pbb depends on on which which sample's data data are are arbitrarily arbitrarily placed in row 11 or row row 2, as discussed discussed in chap. chap. 4.) Note that those who do not find find the median to be meaningful meaningful in the case of ordinal data data with few categories many ties would not not want to apply U U33 in such cases. and many Note that in the case of ordinal categorical data, due to the limited number of possible outcomes outcomes (categories) number (categories ) there is no opportunity for the shifted up or down by a treatment treatment to a most extreme outcomes to be shifted (for which which there is no outcome outcome category) category).. The The result more extreme value (for would be a bunching of tallies tallies in the existing most most extreme category would cf. Fay, Fay, 2003), 2003), obscuring the the degree degree of shift shift in the the underlying underlying (skew; cf. an underestimation of the PS, PS, be bevariable. Such bunching can cause an cause this bunching can increase increase ties in an existing extreme category when in fact fact some of these ties actually represent superior outcomes outcomes for represent superior resultmembers of one group regarding the underlying variable. Skew result ing from such bunching bunching can also cause rr fbb to to underestimate underestimate rrpo pop ' as was discussed in the the section Assumptions oof r and and rrpb b, in chapter 4. Again, reduced by the use of th the maximum maximum number number of such problems can be reduced of categories into into which which participants can reliably reliably be placed and by the use of either the the Brunner Brunner and and Munzel (2000) (2000) tie-handling tie-handling method method or Cliff Cliff's of 's ((1996) 1 996) method method that is discussed in the next section.
g��
5,
xmia
�p
h
1,
f
/
4.
AND SOMERS' D D THE DOMINANCE MEASURE AND from the section The Dominance Dominance Measure Measure in chapter 5 that Cliff Cliff Recall from discussed an effect effect size that is a variation variation on the PS PSconcept concept ((1993, 1 993, 11996) 996) discussed
2 12 212
<""II/fII=
CHAPTER CHAPTER 9 9
that avoids allocating allocating ties, a measure that we called the dominance mea measure and . 5 as DM = and defined defined in Equation 55.5 = Pr(Ya Pr(Ya > > Yb) Yb) -- Pr(Yb Pr(Yb > > Ya). Ya). Cliff Cliff ((1993, 1 993, 11996) 996) called called the the estimator estimator of this effect effect size the the dominance statis statistic, which . 6 as ds = which we defined defined in Equation 55.6asds = Pa p a>>bb- pPt,b >> aa.. When calculat calculat' s value of U/npb ing the ds each p p value is given by the sample sample's U/nanb with no now given only by the 5S part part of Equation allocation of ties, so each U is now 9.1. denominator of each p p value is still given by nnanb. 9. 1 . The denominator anb. Note again that many many ties are likely in the case of ordinal categorical categorical data, which is ordinal outcomes. outcomes. The application application of theDM especially true with fewer ordinal DM and the ds will be made clear in the worked example in the the next next section. Recall from chapter 1 . When chapter 55 that the ds and DM range from from -1 to + +1. every member of Sample of Sample b has a better better outcome than every member of Sample a, ds == = -1 -1.. When every every member member of Sample a has an outcome outcome = ++1. 1. that is better than the outcome of every member of Sample Sample b, ds = When there is an equal number number of superior superior outcomes outcomes for each sample in the head-to-head head-to-head pairings, ds = o. 0. = -l -\ or + +11 there is is no overlap between between the two samples' dis disWhen ds = tributions tributions in the the 2 X x c table, and when ds = 0 there is complete overlap in the two samples samples'' distributions. distributions. However, because estimators of the PS and the DM are sensitive to which which outcome is better in each pairing, but but of not sensitive sensitive to how good the better better outcome outcome is, is, reporting reporting an estimate estimate of these two two effect effect sizes is not not very informative for for ordinal categorical categorical data data unless the 2 x c c (or rr x 2) table is also presented. For For example, with re regard to a table with with the column categories of Table 9. 9.11 (but not not the data therein), if P paa>>bb == 11 or ords ds = =+ + \1 (both (bothindicating indicatingthe themost mostextreme extreme possipossi ble superiority superiority of Therapy a over Therapy b) b) the result could mean that (a) all members members of Sample b were in the Worse Worse column column whereas whereas all mem members of Sample Column, or Sample a were in the No Change Change column, Improved Column, in either either the the No Change or Improved columns columns;; or (b) (b) all members of of Sample b were in the Change column, whereas all members of Sam the No Change Sample a were in the Improved Improved column. Readers Readers of a research research report report would would know which of these four meaningfully different different results under underwant to know . 5 or lying Pa p a>>bb = = 11 or ds = =+ +11 had occurred. Similarly, when when Pa pa>> bb = .5 ds = = 0 (both indicating no superiority superiority for either therapy), among other possible patterns of frequencies in the table the result could mean mean that all participants participants were in the Worse Worse column, all were in the No Change column, or all were in the Improved column. One One would would certainly want column, to know know whether whether such a Ppda >> bb or or ds ds were were indicating indicating that that both both therapies therapies were always possibly (No possibly harmful harmful (Worse column), always always ineffective ineffective (No Change column), column), always always effective effective (Improved (Improved column), column), or that there were some other pattern in the table. table. 1 99 3, 11996) 996) for Refer Refer to to Cliff Cliff ((1993, for a discussion of significance testing for the ds and construction of confidence confidence intervals for the DM for the inde independent-groups and the dependent-groups cases pendent-groups cases and for software to undertake the calculations. calculations. Wilcox (2003) provided provided an S-Plus software software undertake function for Cliff's Cliff's ((1993, method and, as noted in the discussion 1 993, 11996) 996) method function of the DM in chapter chapter 5, (Wilcox, (Wilcox, 2003) reported reported tentative tentative findings that of ==
==
EFFECT VARIABLES EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES
�
2 13 213
this method Type II error well even this method controls controls Type error well even when when there there are are many many ties. ties. 1 986), Vargha Delaney Consult Consult Simonoff, Simonoff, Hochberg, Hochberg, and and Reiser Reiser ((1986), Vargha and and Delaney and Delaney Delaney and and Vargha for further further discussions. discussions. The (2000), and Vargha (2002) for The ds is also known is also known as as the the version version of of Somers' Somers' D statistic statistic (Agresti, (Agresti, 2002; Somers, 1 962) that that is is applied to 2 x Somers, 1962) applied to X c tables tables with with ordinal ordinal outcomes outcomes (Cliff, 11996). exact p level level for for the the statistical statistical significance significance of of Somers' (Cliff, 996). An exact Somers' D is provided by StatXact StatXact and and SPSS is provided by SPSS Exact. Exact. WORKED EXAMPLE OF THE THE ds ds Calculating Calculating the the ds with with the the data data in in Table Table 9 9.1 . 1 by by starting starting with with column column 3, and not not allocating allocating ties, ties, we we note note that that (as (as already already found found in in the the previous previous secand sec tion) tion) Therapy Therapy aa had had S a = 364 superior superior outcomes outcomes in in the the 29 x 26 = = 754 head-to-head head-to-head comparisons. comparisons. Therefore, Therefore, P p a > b = 364/754 = .4828. Starting Starting again with with column column 3 we we now now find find that that 11 patient patient in in Therapy Therapy b b had had aa better better again outcome outcome than than 22 + 3 patients patients in in Therapy Therapy a, a, so so thus thus far far there there are are 11(22 (22 +3) = 25 pairs pairs of of patients patients within within which which Therapy Therapy b b had had the the superior superior outcome. outcome. Moving Moving now now to to column column 2 we we find find that that 13 13 patients patients in in Therapy Therapy b b had Therapy had an an outcome outcome that that was was superior superior to to the the outcome outcome of of 3 patients patients in in Therapy 3 x 3 = 39 to for a, a, adding adding 113x3 to the the previous previous subtotal subtotal of of 25 superior superior outcomes outcomes for Therapy b. Therefore, Therapy b. Therefore, Pb > a = = (25 (25 ++ 39)/ 39)1 754 754 == .0849. .0849. Thus, Thus, ds .4828--.0849 .0849 == .398, . 398,another anotherindication, indication,now now on ds== P a > b - P b > a = -4828 onaascale scale 1, ofthe from from -1 to to + +1, of the degree degree of of superiority superiority of of Therapy Therapy aa over over Therapy Therapy b. b. Observe that one Observe that one can can check check our our calculation calculation of of 25 + 39 = = 64 superior superior outcomes outcomes for for Therapy Therapy b b by by noting noting that that there there were were aa total total of of 754 compari comparisons, resulting (S)a) for sons, resulting in in 364 superior superior outcomes outcomes (S forTherapy Therapy aaand andTT== 326 326 ties; ties; so so there there must must be be 754 754 - 364 364 - 326 326 = = 64 comparisons comparisons in in which which Ther Therapy had the that it apy b b had the superior superior outcome. outcome. Note Note that it is is a a coincidence coincidence that that the the abab solute of the solute values values of the ds and and the the previously previously reported reported corrected corrected rpb (i.e., (i.e., rre), c), for 1 . The for the the data data in in Table Table 9.1 9.1 are are the the same, same, |1 .398 .3981. Theds dsand and rppbactually actually dedescribe different characteristics characteristics of of the the data. data. scribe somewhat somewhat different
a
Pa >b - Pr, > a
a>b
Pr, > a
=
rb rb p
GENERALIZED ODDS ODDS RATIO the discussion of Recall from Related Effect Size section chapter 5 Recall from the the A A Related Effect Size section in in chapter 5 the discussion of an estimator estimator of of an an effect effect size size that from the ratio of of the p val valan that results results from the ratio the two two P ues, the generalized generalized odds odds ratio. ratio. We Wenow now apply applythe the generalized generalizedodds oddsratio ratio ues, the to the able 9.1 9 . 1 by by using using the the same same definitions to the data data inT inTable definitions of of P paa >b == Ua/nanb and is, and Pt, = that that were were used used in in the the previous previous two two sections sections;; that that is, all we we ignore ignore ties ties in in calculating calculating the the two two U U values values but but we we use use all nanb = = 26 xX 29 = 754 possible possible comparisons comparisons for for the the two two denominators. denominators. given by Therefore, generalized odds Therefore, the the generalized odds ratio ratio estimate, estimate, OR ORg,, is is given by
> b U/nanb
Pr, > a Ub/nanb
nanb
g
OR g
=
Pa> b Ph> a
(9.6)
214 2 14
�
CHAPTER 9 9 CHAPTER
From the the values values that that were were calculated calculated in in the the previous previous section section we we now now From find pro .4828/.0849 = 5 .69. For these data the OR a = find that that P. pa>b /p = = 5.69. For these data the OR > b/ Pr, b>a > gg provides vides the the informative informative estimate estimate that that in in the the population population there there are are 5.69 times times more more pairings pairings in in which which patients patients in in Therapy Therapy aa have have aa better better outcome outcome than than patients patients in in Therapy Therapy bb than than pairings pairings in in which which patients patients in in Therapy Therapy bb have have aa better parameter, better outcome outcome than than patients patients in in Therapy Therapy a. a. The The estimated estimated parameter, ORg op 0 = Pr(Y Prey.a > Y Yb) Pr(Yb >> Y.), Ya),measures measures how how many many times times more more pair pairb) 1/ Pr(Yb ings there are in in which which aa member member of of Population Population aa has has an an outcome outcome that that is ing here are better than than the the outcome outcome for for aa member member of of Population Population bb than than vice vice versa. versa. For better For more discussion discussion of of generalized generalized odds odds ratios ratios consult consult Agresti Agresti (1984). more ( 1 984).
:{
CUMULATIVE ODDS RATIO
Suppose that that in in aa 2 Xx cc table table with with ordinal ordinal categories, categories, such such as as Table Table 9.2, Suppose one iiss interested interested in in comparing comparing the the two two groups groups with with respect respect to to their their at atone taining least some category. For taining at at least some ordinal ordinal category. For example, example, with with regard regard to to the the ordinal categories of Agree, Agree, Agree, Disagree, Disagree, ordinal categories of the the rating rating scale-Strongly scale—Strongly Agree, and Strongly Strongly Disagree-suppose Disagree—supposethat thatone onewants wantsto tocompare comparethe thecollege college and Agree women and and college college men men with with regard regard to to their their attaining attaining at at least least the the Agree women category. Attaining Attaining at at least the the Agree Agree category means means attaining attaining the the category. Strongly Agree category Strongly Agree Agree category category or or the the Agree category instead instead of of the the Strongly Strongly Disagree category or or the the Disagree category. Therefore, Therefore, one's Disagree category Disagree category. one's focus focus would be be on on the the now now combined Strongly Agree Agree and and Agree Agree categories would versus the the now now combined combined Strongly Strongly Disagree Disagree and and Disagree categories.. versus Disagree categories Thus, Table Table 9.2 is is temporarily temporarily collapsed collapsed (reduced) (reduced) to to aa 2 XX 2 table table for for Thus, X 2 tables of this purpose, rendering this purpose, rendering the the odds odds ratio ratio (OR) effect effect size size for for 22x2 tables of chapter 8 applicable applicable to to the the analysis analysis of of the the collapsed collapsed data. data. chapter A A population population OR ORthat that is is based based on on combined combined categories categories is is called called aa pop population cumulative cumulative odds odds ratio (population (population ORcum) ORcum).' This This effect effect size size is is aa ulation measure of of how how many many times times greater greater the the odds odds are are that that aa member member of of aa cer cermeasure tain g . , Agree tain group group will will fall fall into into aa certain certain set set of of categories categories (e. (e.g., Agree and and Strongly Agree) Strongly Agree) than than the the odds odds that that aa member member of of another another group group will will fall fall into of into that that set set of of categories categories.. In In our our example example we we are are calculating calculating the the ratio ratio of (a) the odds odds that that aa woman woman Agrees Agrees or or Strongly Agrees with with the the statement statement (a) the Strongly Agrees (instead of of Disagreeing Disagreeing or or Strongly Strongly Disagreeing Disagreeing with with it) it) and and (b) the odds (b) the that aa man man Agrees Agrees or or Strongly Strongly Agrees Agrees with with that that statement statement (instead of that of
9.2 TABLE 9.2 Gender Comparison With With Regard to an Attitude Attitude Scale
Women Men
Strongly Agree Strongly 62 62 30 30
Agree
Disagree
118 8
2 2
112 2
7 7
Strongly Disagree Strongly o 0 1
EFFECT SIZES-ORDINAL SIZES—ORDINAL CATEGORICAL CATEGORICAL VARIABLES VARIABLES EFFECT
�
215 2 15
Disagreeing with it). it). The choice choice ooff which which ooff the the Disagreeing or Strongly Disagreeing 2X x cc ordinal categorical categorical table table two or more categories to combine in a 2 should be made before the data data are collected. 9.2 exshould collected. Table 9 . 2 presents an ex original complete table (before collapsing it into Table 9.3) ample of an original actual data, but but the labels ooff the response categories have been using actual changed somewhat. somewhat. The nonstatistical nonstatistical details of the research research do not not changed concern us here. 9.2 combinCollapsing Table 9 . 2 by combining columns 11 and 2 and by combin ing columns columns 3 and 4 produces T Table ORcum by applying applying able 9.3. One finds ORcum Equation 8 . 7 from chapter 8 8.7 8 to Table 9.3, OR OR ==ff11 f 12 f 21 Observe in 1 ' Observe 1 if22 22/fnr2 0 = 7 + 11 = 8 , /] /22 thatf11l = 62 + 118 80,f 7 = 8,/ = = 2, 2, and and 8 = 80 = 2 + = Table 9.3 thatfl 2 22 12 ' f21 42. As chapter 8 each f value . As in chapter 8 we adjust eachfvalue by adding ..5 5 f 30 + 112 2 = 42 1 = 2 to it to improve the sample ORcum ORcum as an an estimator estimator of OR OR 0 . We We then then use Equation 8.7 f11l = = 80. 80.5,f = 88.5,f = 22.5, andf21 =42.5. in Equation 8. 7fl 5 ' /22 . 5 ' /12 . 5 , andf 5 . Therefore, 22 = 21 = 12 = the adjusted ORcum ORcum is ORa ORadjdj = = 80. 80.5(8.5)72.5(42.5)= 6.44. We havejust have just the 5 (8 . 5 ) / 2 . 5 (42 . 5 ) = 6 .44. We found from the sample sample ORad ORadj that the odds that a woman will Agree or found Strongly tement are estimated Strongly Agree Agree with the st statement estimated to be more than six However, to avoid times greater than the odds that a man will do so. so. However, exaggerating the gender difference difference that was found by ORa ORadd" in these exaggerating is also important important to note in Table 99.3 majority . 3 that a great m jority of of data, it is the men Agree or S trongly Agree with the statement (42/50 = 84%) 84%) Strongly but an even greater majority of the women Agree Agree or Strongly Strongly Agree but 97.6%). with it ((80/82 8 0/82 = 9 7 . 6%) . effect size that were discussed previously in Any of the measures of effect discussed previously data in T Table subject previthis chapter are applicable to the data able 9.2, subj ect to the previ ously discussed limitations. With regard to Table 9.3, if the two popula population odds are equal, the population population ORpo = 11.. Recall from chapter 88 that tion popp = op :#: 11 can be conducted a test of Ho H0:: OR ORppop = 11 versus Hal Haltt :: OR ORppop conducted using the the op = 2 2 usual xX2 test of association. If Xx2 is significant significant at at a certain p level, level, then then ORaaddj is statistically statistically significantly significantly different different from 11 at at the same p level. 2 data of Table 9.3 yield Xx2 = 111.85,p<.001. Readers might might obtain obtain The data 1 . 8 5 , P < .00 1 . Readers somewhat varying, but but still statistically significant, results because somewhat textbooks and different different software might might use somewhat somewhat different different equa equatextbooks x22. Some Some use one or another another of the equations equations that are modified tions for X for application to tables that have one or more cells cells with relatively small for application small frequencies.. Although Although T Table not use a modified frequencies able 9.3 is such a table, we did not •
4'2.
J
�
9.3 TABLE 9.3 9.2 Collapsed Version of Table 9.2
Women Men
Strongly Agree Agree or Strongly f11 = 80 f21 = 42
Strongly Disagree Disagree or Strongly f12 = 2 f22 = 8
2 16 216
�
CHAPTER 9 9 CHAPTER
because we assumed that many many readers would not not be accessequation because access modified equation. (We (We discuss discuss a situation situation in which which researchers ing a modified should not use a modified section. ) modified equation equation in the next next section.) stated in chapter of As stated chapter 8, when when cell counts are small the distribution distribution of
the supposed X x2 statistic statistic less accurately approximates the actual xX2 dis disinconsistent in their their criteria for small cell counts. tribution. Authors are inconsistent Refer to to Agresti ((1990, discussion and and references for 1 990, 2002) for further discussion Refer the complex problem of ad adjusting for small cell counts. Fisher Fisher's Xl2 for ' s exact justing x test can be conducted using StatXact, StatXact, SPSS Exact, Exact, or or SAS SASVersion Version9.9.Refer Refer test chapter 8 for discussion to chapter discussion and worked worked examples of construction construction of a confidence interval or a null-counternull null-counternull interval for OR ORoo 2
2
pp
THE Phi COEFFICIENT
Recall from from chapter chapter 4 that phipop is the the correlation between two two dichoto dichotomous variables, For a final esti variables, such as the two variables in Table 9.3. For estimated effect effect size size for the the data in Table 9.3 we apply Equation 8.1 from from ( 1 1 .85/1 32)"1/2 = .300, which re chapter chapter 8 to find find that phi = (X (x22IN)' /N)'/2' = (11.85/132) reflects a medium strength strength of relationship (between (betweengender and and attitude attitude in flects ' s ((1988) our example) according Cohen's Note again again that, fol fol1 988) criteria. Note our according to Cohen lowing the the recommendation recommendation of Fleiss et al. (2003), when using xX22 to cal calculate should use the standard standard equation culate phi one should equation for xX22 instead instead of a justed for small samples Also note version of the equation equation that is ad adjusted samples.. Also
fiZ'0
again that the applications x22 as a test test statistic and phi as an estimator estimator applications of X of effect effect size for for the the data of T Table naturalistic sampling, sampling, as of able 9.3 assume naturalistic described in chapter chapter 8. We We applied xX22 and and phi to these data data as a worked worked assumption of naturalistic sampling although although we example under the assumption certain that the author of this attitudinal research actually actually cannot be certain used naturalistic sampling. The reason for our our uncertainty uncertainty is that that there difference between the number of women (82) (82)and andthe thenumber number is a large difference of men men (50) (50) sampled, an unlikely difference difference sampled, which would seem to be an of when sampling from from some populations. A plausible explanation explanation for the conpreponderance of women would be that perhaps the research was con attracts many many more women ducted on students in a college course that attracts would be concerned concerned about about the nature of the than men. In this case we would population generalize. population to which these results would generalize. A CAUTION
Some researchers and statisticians statisticians will be concerned concerned about conducting two significance for the same set of of two tests of of significance for estimates of of effect effect size on the data. For For example, there might might be concern about about proper interpretation interpretation of the p level attained of attained by by a Xx22 test of the the significance significance of ORcum ORcum for for the the data of after one has already conducted a of a collapsed collapsed table, such as Table 9.3, after test of significance significance of association association between the the rows and and columns of the the
EFFECT SIZES-ORDINAL SIZES—ORDIMAL CATEGORICAL CATEGORICAL VARIABLES EFFECT VARIABLES
�
217 217
original complete table, such such aass Table 9.2, having having used any off the original any one o methods that were previously discussed discussed in this chapter. The The simplest soso lution two chances chances to obtain obtain lution would be to compensate for giving oneself two a significant result from the same data by adopting a more conservative alpha QRcum and for the sta staalpha level than the usual .05 level for the test of ORcum test of the the estimate of effect effect size that is first first applied to the the full full ta tatistical test ble. For For example, one might might use a Bonferroni-Dunn Bonferroni-Dunn approach approach by adopting the ..025 alpha level for each of the two two tests. If there are going 025 alpha adopting to be two two tests of significance, significance, the conservative alpha alpha levels levels to be two forthcoming tests should be chosen before data adopted for these two collection. Note, Note, however, that in our our case the obtained obtained p level is so collection. not worry worry about about this otherwise .00 1, that perhaps we need not small, p < .001, important issue this time. important issue REFERENCES FOR FURTHER DISCUSSION DISCUSSION OF ORDINAL CATEGORICAL METHODS discussions of ordinal categorical methods that lead to esti estiFor further discussions effect sizes consult consult Agresti ((1984, Cliff 1 984, 1989, 1 989, 1990, 1 990, 2002), Cliff mation of effect (1996), Hildebrand, Laing, ( 1 996), Fleiss et al. (2003), Gibbons Gibbons (1985, ( 1 985, 1993), Hildebrand, andRosenthal Liebetrau ((1983), Liu ((1998), etal. 1 9 7 7), Liebetrau 1 983), Liu 1 998), Moses et al. (1984), ( 1 984), and Rosenthal ((1977), Randies (2001), and and Wickens ((1989). Randles 1 989). Consult Vargha and Delaney (2000) and Brunner and Puri (2001) for application application of what we call the PS for ordinal categorical outcomes to the the cases cases of between-groups and and within-groups one-way and factorial designs. within-groups one-way QUESTIONS
11.. Define ordinal categorical variable, and an example that is and provide an not in the text. name for an ordered categorical 22.. What What is a technical technical name categorical variable? ' s choice of the 3. State two two criteria for for one one's the number of ordinal catego categories to be used for the dependent variable. 4. Why Why should should one be cautious about comparing comparing effect effect sizes across attitudinal scales? studies that involve attitudinal What do the authors authors mean by an an ordinal hypothesis? hypothesis? 55.. What 66.. Why Why is a % testinappropriate inappropriateto totest testan anordinal ordinalhypothesis hypothesis? '1.,22test ? 7. Describe Describe the procedure for calculating calculating a point-biserial for aa 2 Xx c point-biserial r for table in which the dependent variable is ordinal categorical. should be made to the procedure in Q Question What adjustment adjustment should uestion 7 in 88.. What the case of unequal unequal sample sample sizes? sizes? interpret a negative and and a positive point-biserial point-biserial r in 99.. How does one interpret 9.1? tables such as Table 9.1? 10. the point-biserial or1 0. Discuss a possible limitation of the use of the point-biserial r for or dinal categorical data.
2 18 218
.rl/IIf1=
CHAPTER CHAPTER 9 9
111. 1 . Discuss the the problem of choosing a scale scale of scores scores to replace replace ordinal ordinal categories.
12. Provide, discussing your your reasoning, your your own own choice choice of a possible 12.
sensible sensible numerical numerical scale for the example of treatment for alcohol alcoholism in the text text in which which the four outcome categories were absti abstinence, 2 to 6 drinks per week, 7 40 drinks per week, and more 7 to 1140 than 140 drinks per week. 113. 3 . Describe of Describe how how skew in opposite directions might might occur in the case of 2 Xx cc tables involving involving attitude scales, scales, and why why might might this differen differential skew be problematic for the point-biserial point-biserial r? 114. 4. Define the probability of superiority. What ad adjustment might improve the accuracy of an obtained sig sig115. 5 . What justment might U test based on the normal approximation? approximation? nificance level for a U What is the estimate of the PS PS if there are 6610 116. 6 . What 1 0 wins for therapy in head-to-head comparisons of the the outcomes of 33 participants who who head-to-head received therapy therapy and outcomes of 33 participants who who received received a placebo? but why why would would it still be problematic, to apply the 117. 7. When is it valid, but PS to 2 x 2 tables? . 1 to the data in Table . 1 of chapter 118. 8 . Apply Equation Equation 99.1 Table 88.1 chapter 8. 19. Why -biserial correlation applied 19. Why iiss the negative value for the point point-biserial 9.1 not inconsistent inconsistent with with the estimate of PS PS = = .699 for to Table 9 . 1 not those data? 20. Why Why might might the use of too few categories categories cause the PS PS to be under underestimated? Define dominance measure. 221. 1 . Define 22. Why might the estimation of the PS PS or the DM not not be very infor informative to a reader of a research report unless the underlying con contingency table is also presented? tingency 23. What What is another another name for dominance statistic? generalized odds odds ratio. 24. Define generalized ratio. Interpret a generalized odds ratio that is is equal to 5. 5. 25. Interpret cumulative odds odds ratio. 26. Define cumulative 227. 7. Interpret Interpret a cumulative cumulative odds ratio that is is equal to 2. 2 8 . Under which circumstance iis s the phi coefficient 28. coefficient applicable ttoo a collapsing a larger larger table? 22x2 x 2 table that results from collapsing data in Table 9.3. 29. Calculate phi for the data What iiss the problem, and what iiss the simplest solution solution to the prob prob30. What conducting a test of significance significance for an origi origilem, that arises from conducting nal table and then conducting another another test of significance significance for a collapsed 22x2 x 2 version of that table?
References
Abelson, principled argument. Abelson, R. R. R P. (1995). ( 1 995). Statistics as principled argument. Mahwah, NJ: NJ: Lawrence Erlbaum Associates. Erlbaum Associates. Abelson, R. P., & Prentice, ( 1 997). Contrast Contrast tests Abelson, R. P, & Prentice, D. D. A. A. (1997). tests of of interaction interaction hypotheses. hypotheses. Psychological Methods, 2, 3 1 5-328. Psychological Methods, 2, 315-328. O.. (1984). Strength ooffassociation iinn the simplegeneral simple general linear model: A A ( 1 984) . Strength Abu Libdeh, O comparative study study ooffHays' Hays' omega-squared. Unpublished doctoral doctoral dissertation, dissertation, comparative University of Chicago, The University Chicago, Chicago. Chicago. New York: Wiley. Agresti, ( 1 984). Analysis ooff ordinal Agresti, A. A. (1984). ordinal categorical categorical data. New York: Wiley. Agresti, A. Tutorial on modeling ordered ordered categorical categorical response response data. data. PsyAgresti, A. (1989). ( 1 989). Tutorial on modeling chological Bulletin, Bulletin, 1105, 290-301.. 05, 290-30l Wiley. Agresti, ( 1 990). Categorical Agresti, A. A. (1990). Categorical data analysis. analysis. New New York: York: Wiley. Agresti, A. (2002). (2002). Categorical Hoboken, NJ: Wiley. Agresti, A. Categorical data analysis analysis (2nd. (2nd. ed.). ed.). Hoboken, NJ: Wiley. S., & & Diener, Diener, E. determinants and and effect effect size. Journal Journal of E. (1989). ( 1 989). Multiple Multiple determinants of Ahadi, 5., Personality Personality and and Social Psychology, 56 56,, 398-406. 398-406. Keselman, H. J. (2003) (2003) Approximate confidence intervals Algina, Algina, K., K., & & Keselman, H. J. Approximate confidence intervals for for effect effect 3 7-553. sizes. sizes. Educational Educational and Psychological Psychological Measurement, 63, 63, 5537-553. Altaian, D. D. G., G., Machin, Machin, D., D., Bryant, Bryant, T. T. N., N., & & Gardner, Gardner, M. M. J. J. (2000). (2000). Statistics with Altman, con fidence: Confidence (2nd ed.). London: confidence: Confidence intervals intervals with statistical guidelines guidelines (2nd ed.). London: British Medical Medical Journal Journal Books. British Books. American American Psychological Psychological Association. Association. (2001). (2001). Publication manual ooff the American (5th ed.). Psychological Psychological Association (5th ed.). Washington, Washington, DC: DC: Author. Author. O., Danton, W. & McClanahan, T. T. M. (2003). Psychology in W. G., & M. (2003). Antonuccio, D. D. 0., the prescription prescription era: era: Building a firewall firewall between marketing and science. science. American Psychologist, 1028-1043. Psychologist, 58, 58, 1028-1043. Aspin, Aspin, A. A. ((1949). use in comparisons comparisons whose accuracy involves two 1 949). Tables for use involves two variances separately estimated. estimated. Biometrika, 290-293. variances separately Biometrika, 36, 36, 290-293. Auguinis, R. ((1997). 1 997). Sampling Auguinis, H., H., & & Whitehead, Whitehead, R. Sampling variance variance in in the the correlation correlation co coefficient under under indirect validity general generalefficient indirect range restriction: Implications for validity 528-538. ization. ization. Journal Journal ooffApplied Applied Psychology, Psychology, 82, 82, 528-538. V.,, & & Lewis, T. ed.). Chichester, Chichester, T. (1994). (1 994) . Outliers iinn statistical data (3rd (3rd ed.). Barnett, v. England: Wiley. England: Wiley. November).Empirically Empirically based basedcriteriaf criteria for & McLean, McLean, J. E. E. ((1999, 1 999, November). or Barnette, J. J. J., & determining meaningful meaningful eeffect size. Paper Paper presented presented at at the the annual annual meeting meeting of of determining ffect size. the Mid-South Mid-South Educational Research AL. the Research Association, Point Clear, AL. E. (2002, Shedding light on the eta-squared eta-squared and Barnette, J., & & McLean, J. J. E. (2002, April). April) . Shedding and omega-squared relationships standardized eeffect size. Paper Paper presented presented omega-squared relationships with the standardized ffect size.
219 2 19
220
REFERENCES REFERENCES
at the annual meeting of the American American Educational Research Association, at Research Association, New Orleans, Orleans, LA. New LA. F. (2002a). Correcting effect effect sizes for for score score reliability: reliability: A reminder that Baugh, E measurement and and substantive issues are linked linked inextricably. Educational and and measurement Psychological Measurement, 62, 254-263. 62, 254-2 63. Psychological F. (2002b) (2002b).. Correcting Correcting effect effect sizes for for score reliability. reliability. IInn B B.. Thompson Thompson Baugh, E (Ed.), Contemporary thinking on reliability issues (pp. 331-41). (Ed. ) , Score reliability: Contemporary 1 -4 1 ) . Thousand Oaks, CA.: CA: Sage. Thousand L. ((1987). the difference difference between between two two Beal, S. S. 1. 1 9 8 7 ) . Asymptotic confidence intervals for the binomial parameters for use with small small samples. Biometrics, Biometrics, 43, 43, 941-950. Beatty, M. JJ.. (2002). Do we know a vector vector from a scalar? Why Why measures measures of asso assoBeatty, (2002 ) . Do ciation (not their their squares) squares) are appropriate appropriate indices indices of effect. effect. Human Communi Communiciation 28, 605-6 605-611. cation Research, 28, 11. Bedrick, EE.. JJ.. ((1987). confidence intervals intervals for the ratio of two two bino bino1 98 7 ) . A family of confidence mial proportions. proportions. Biometrics, Biometrics,43, 43, 993-9 993-998. 98. mial C.. B B.. ((1994). Cooper&L. V.Hedges Hedges(Eds (Eds.), ThehandBegg, C 1 994) . Publication bias. IInH. n H. Cooper & 1. V. . ) , The hand of research synthesis (pp. 399-499). New Y York: Foundation. book of (pp. 3 9 9-49 9 ) . New ork: Russell Sage Foundation. A.,, Kuh, Kuh, E., & & Welsch, R. R. E. E. ((1980). diagnostics:Identifying Identifying 1 980) . Regression diagnostics: Belsley, D. A influential off collinearity. New Y York: ork: Wiley. in fluential data and sources o R., Ludbrook, J., J., & & Spooren, W W P.E J. J. M. (2000). (2000). Different Different outcomes outcomes Bergmann, R., of the the Wilcoxon-Mann-Whitney test from from different different statistics statistics packages. The The of 54, 72-77. American Statistician, 54, Bernhardson, C C.. S. Type proceBernhardson, S. ((1975). 1975). T ype I error rates when multiple comparison proce follow a significant significant FF test test of of ANOV ANOVA. Biometrics,331, 229-232. dures follow A Biometrics, 1 , 229-232. M. E E,, Denton, Denton, J. J. Q Q,,& L. ((1974). of the FF test to to , & Meyers, J. J. 1. 1 9 74) . The robustness ofthe Bevan, M. violations of continuity and and form form of of treatment treatment population. population. British Journal of violations British Journal of Mathematical Statistical Psychology, Psychology, 2 27, Mathematical and Statistical 7, 199-204. 1 9 9-204 . P J., J., & & Lehmann, E. E. L. L. ((1975). statistics for for nonparametric nonparametric 1 9 75 ) . Descriptive statistics Bickel, P. models:: II. II. Location. Annals ooff Statistics, 3, 3, 11045-1069. 045-1 06 9 . models
K.. D D.. (2002 (2002).) . Confidence intervals for for effect effect sizes iinn analysis ooff variance. variance. Bird, K Educational and and Psychological Psychological Measurement, 62, 62, 1197-226. Educational 9 7-226. C. EF,, Wiitala, W W. 1., L., & & Richard, E F. D D.. (200 (2003). of raw raw differ differ3 ) . Meta-analysis of Bond, C. Psychological Methods, Methods, 8, 8, 406-4 406-418. ences. Psychological 18. D.. G G.,. , & & Price, Price, R R.. M. (2002 (2002).) . Statistical inference for for a linear linear function function of Bonett, D medians:: Confidence Confidence intervals, intervals, hypothesis hypothesis testing, testing, and sample sample size require requiremedians ments. Psychological Psychological Methods, Methods, 7, 7, 3370-383. 70-3 8 3 . M.. ((1994). clinical tri triBorenstein, M 1 994 ) . The case for confidence intervals in controlled clinical Controlled Clinical Trials, 115, 411-428. als. Controlled 5, 41 1-42 8 . T,, Smith, D D.,. , & & Stoica, Stoica, G. G. (2002 (2002).) . A Monte-Carlo estimation ooff ef efBradeley, M. T. fect size distortion due to to significance testing. testing. Perceptual Perceptual and Motor 95, Motor Skills, 95, fect 837-842. 8 3 7-842. R. ((1990). Comparing classical and and resistant outlier outlier rules. Journal Journal ooff the Brant, R. 1 990). Comparing Association, 85, 85, 11083-1090. 083-1 090. American Statistical Association, J. A A. (2003 (2003). Effect size estimation: estimation: Factors to to consider consider and and mistakes mistakes to ) . Effect Breaugh, J. Journal ooff Management, 29, 29, 79-9 79-97. avoid. Journal 7. R. A A.,, Evans, D D.. M M.,. , Miller, I.I. W, Burgess,, E. E. SS.,. , & & Mueller, T. T. II.. (1997). W, Burgess ( 1 9 9 7) . Brown, R. Cognitive-behavioral treatment for for depression iinn alcoholism. alcoholism. Journal Journal ooffCon ConCognitive-behavioral sulting and Clinical Psychology, Psychology, 65, 715-726. sulting 65, 7 1 5-726. E., & & Munzel, U U.. (2000). The nonparametric nonparametric Behrens-Fisher Behrens-Fisher problem: Brunner, E., approximation. Biometrical Journal, Asymptotic theory and small-sample approximation. 42, 117-25. 42, 7-2 5 .
REFERENCES REFERENCES
221
& Puri, M. M. 1. L. (200 (2001). Nonparametric methods in factorial factorial designs. Brunner, E., E., & 1 ) . Nonparametric methods in 1-52. Statistical Papers, 42, 1-52. A. SS.. ((1977). to cast away away stones, a Bryk, A. 1 9 7 7 ) . Evaluating program impact: A time to gather stones stones together. together. New New Directions for for Program Evaluation, Evaluation, I1,, time to gather 32-58. 32-5 8. Bryk, A A.. SS.,. , & & Raudenbush, SS.. W W. ((1988). of variance in experi experi1 9 8 8 ) . Heterogeneity of mental studies: interpretations. . Psychological mental studies : A challenge to conventional conventional interpretations Bulletin, 04, 396-404. Bulletin, 1104, J.,. , & & Sawilwosky, S. (2002). to Sw Sw in the the bracketed interval interval S. (2002 ) . Alternatives to Bunner, J of the the trimmed mean. Journal ooff Modern Modern Applied Applied Statistical Methods, Methods, I1,, of 176-181. 1 76-1 8 1 . Callender, JJ.. c., C, & & Osburn, H H.. G G.. ((1980). 1 980). Development and test of a new model for validity validity generalization. Journal ooff Applied Applied Psychology, Psychology, 65 65,, 543-5 543-558. for generalization. Journal 58. (2004). offfour effect effect sizes for for Single-sub single-subCampbell, J. M. (2004 ) . Statistical comparison o ject designs. Behavior Modification, Modification, 28, 234-246. K. (2000). Resistant Resistant outlier rules and and the the non-Gaussian case. case. Compu ComputaCarling, K. ta tional Statistics Statistics and Data Analysis, Analysis, 33, 33, 249-2 249-258. 58. B. ((1961). off the data, oorr how correlation coef coefCarroll, J. B. 1 96 1 ) . The nature o how tto o choose a correlation 347-372. ficient. Psychometrika, 26, 34 7-3 72. R. M M.,. , & & Nordholm, Nordholm, 1. L. A. ((1975). characteristics of of Kelley' Kelley'ss £e2 1 9 7 5 ) . Sampling characteristics Carroll, R. 2
and Hays' cO d)22. . Educational Educational and andPsychological PsychologicalMeasurement, Measurement, 35, 35,541-554. 541-554. and
Chan, 1.I. SS.. F. F. ((1998). of equivalence equivalence and and efficacy efficacy with with a non-zero 1 99 8 ) . Exact tests of bound for for comparative studies studies.. Statistics Statistics in Medicine, 117, 7, 11403-141 403-1 4 1 33. . lower bound W,, & & Chan, W-L. W.-L. (2004 (2004). Bootstrap standard standard error error and and confidence confidence inter inter) . Bootstrap Chan, W the correlation corrected for range restriction: restriction: A simulation simulation study. vals for the Psychological Methods, Methods, 9, 9, 369-38 369-385. Psychological 5. R Y. Y.,, & & Popovich, P.P M M.. (2002 (2002). and nonparametric nonparametric ) . Correlation: Parametric and Chen, P. measures. Thousand Oaks, Oaks, CA: CA: Sage. measures. R. ((1999). Bootstrap methods: A Apractitioner's New Y York: Wiley. Chernick, M. R. 1 999 ) . Bootstrap practitioner's guide. New ork: W iley. Chuang-Stein, (2001). superiority or inferiority inferiority after concluding concluding Chuang-Stein, C. C. (200 1 ) . Testing Testing for superiority Information formation Journal, 35, 35, 1141-143. 41-143. equivalence? Drug In Cleveland, W 1 9 8 5 ) . The elements o CA : W. SS.. ((1985). off graphing graphing data. Monterey, Monterey, CA: Wads worth. Wadsworth. W. S. (Ed. (Ed.). The collected collected works ooff John W W. Tukey: Tukey: V Vol. Cleveland, W ) . ((1988). 1 9 8 8 ) . The ol. V. V. and Hall. Graphics. New York: Chapman and Cliff, N. statistics: Ordinal analyses to answer answer ordinal quesN. ((1993). 1 9 9 3 ) . Dominance statistics: ordinal ques Psychological Bulletin, Bulletin, 1114, 494-509. tions. PsycholOgical 1 4, 494-5 09. Cliff, N N.. ((1996). methods for for behavioral data analysis. Mahwah, NJ: Cliff, 1 99 6 ) . Ordinal methods Lawrence Erlbaum Associates Associates.. Eta-squared and and partial partial eta-squared iinn fixed fixed factor ANOVA deCohen, J. ((1973). 1 9 7 3 ) . Eta-squared de 1 0 7-1 1 2 . signs. Educational Educational and Psychological Psychological Measurement, 33, 33, 107-112. Cohen, JJ.. (1988). analysis for for the behavioral sciences (2nd ed. ed.). ( 1 9 8 8 ) . Statistical power analysis ). Academic Press. New York: Academic earth is round round ((p<.05). 49, 1 994 ) . The earth p < . 05 ) . American Psychologist, 49, Cohen, J. ((1994). 997-1003. 9 9 7- 1 00 3 . J.,. , Cohen, Cohen, P.P.,, West, SS.. G G.,. , & & Aiken, Aiken, 1. L. SS.. (2003 (2003). Applied multiple multiple regres regresCohen, J ) . Applied analysis for for the behavioral sciences (3rd ed. ed.). Mahwah, NJ: ) . Mahwah, sion/correlation analysis Associates.. Lawrence Erlbaum Associates L. D D.,. , & & Becker, B B.. JJ.. (2003 (2003). How meta-analysis meta-analysis increases increases statistical statistical Cohn, 1. ) . How power. Psychological Psychological Methods, Methods, 8, 243-2 243-253. 53.
222
REFERENCES REFERENCES
Colditz, G. Miller, J. N., & & Mosteller, F. ((1988). gain in in the the evalua evaluaG. A., A, Miller, Mosteller, F. 1 98 8 ) . Measuring gain tion of of medical technology: The probability tion probability of a better outcome. International Journal ooff T Technology 637-642. Journal echnology Assessment in Health Care, 4, 63 7-642 . Cook, R. D.,. , & Weisberg, Weisberg, S. (1982). and influence New R. D ( 1 9 82) . Residuals and influence in regression. New York: and Hall. Y ork: Chapman and for literature ThouCooper, H. M. (1989). ( 1 9 8 9 ) . Integrating Integrating research: A guide for literature reviews. Thou sand Oaks, CA: sand CA: Sage. & Findley, Findley, M. ((1982). Expected effect effect sizes: Estimates for for statistical Cooper, H., & 1 982) . Expected power analysis analysis in in social psychology. Personality Personality and and Social Psychology Psychology BulleBulle power tin, 8, 8, 168-173. 1 68-1 73. H.,. , &Hedges, L. V (Eds.). handbook ooffresearch synthesis. New New Cooper, H & Hedges, L. (Eds . ) . (1994). ( 1 9 94 ) . The handbook York: York: Russell Sage Foundation. statistical problem arising from from retrospective studies. In Cornfield, J. (1956). ( 1 95 6 ) . A statistical off the the Third Third Berkeley Berkeley Symposium Symposium on Mathematical Mathematical J. Neyman (Ed.), Proceedings o Statistics and Probability (Vol. 135-148). Berkeley: University University of Cali Caliol. 4, pp. 1 3 5-1 4 8 ) . Berkeley: Probability (V fornia Press. fornia Press. Cortina, J. M., & Nouri, H. (2000). Effect Thousand Cortina, Effect sizes for for ANOVA ANOVA designs. Thousand Oaks, CA: CA: Sage. R. A., & Keselman, effects of nonnormality on on para paraCribbie, R. A, & Keselman, H. J. (2003a). The effects metric, nonparametric, nonparametric, and model comparison approaches to pairwise com comparisons. Educational Educational and Psychological Psychological Measurement, 63, 6615-635. 1 5-6 3 5 . parisons. R. A., & Keselman, H. J. (2003b) (2003b).. Pairwise multiple comparisons comparisons:: A Cribbie, R. A, & testing approach approach versus stepwise procedures. procedures. British British Journal Journal ooffMathe Mathemodel testing Psychology, 56, 167-182. matical and Statistical Psychology, 56, 1 6 7- 1 8 2 . Crits-Christoph, P. P,, Tu, x., X., & Gallop, (2003). ass fixed fixed versus ran ranGallop, R. R. (2003 ) . Therapists a dom effects —some statistical and effects-some and conceptual issues: A comment comment on on Siemer and Joormann. Methods, 8, 8, 5518-523. and Joormann. Psychological Psychological Methods, 1 8-52 3 . E. L. Rosenthal's comment wee doing iinn soft soft Crow, E. L. ((1991). 1 9 9 1 ) . Response tto o Rosenthal's comment "How are w 46, 11083. 083. psychology?" American Psychologist, 46, Cumming, G G.,& (2001). primer on on the the understanding, use, and and cal calCumming, . , & Finch, S. S . (200 1 ) . A primer culation of confidence confidence intervals intervals that are based on central and noncentral culation distributions.. Educational and Psychological Psychological Measurement, 661, distributions 1 , 5532-574. 32-5 74. D'Agostino, B. ((1971). at analysis of of variance on on dichotomous D 'Agostino, R. R. B. 1 9 7 1 ) . A second look at data. Journal off Educational Educational Measurement, 8, 3327-333. Journal o 2 7-3 3 3 . Darlington, two groups bbyy simple graphs. Psychologi PsychologiDarlington, M. L. L. (1973). ( 1 9 73 ) . Comparing two cal Bulletin, Bulletin, 79, 79, 1110-116. 1 0-1 1 6 . Davidson, J. R.. T, J.,. , Hackett, D D.,. , & Pitrosky Effect size J. R T. , Rasmussen, J Pitrosky, B. B . (2002). (2002 ) . Effect comparisons of patient and observer rated rated scales scales in generalized generalized anxiety anxiety dis discomparisons patient and the venlafaxine venlafaxine ER European Neuropsychopharmacology, Neuropsychopharmacology, order, using the ER dataset. European 12, 1 2, 346-347. & Gather, Gather, U. ((1993). The identification of of multiple outliers. Journal Journal of of Davies, L., L., & 1 9 9 3 ) . The 782-792. 92. the American Statistical Association, 88, 782-7 D.. V ((1997). methods and their applications. applications. Bootstrap methods Davison, A. A C, c., & Hinkley, D 1 9 9 7 ) . Bootstrap England: Cambridge University University Press. Cambridge, England: Dayton, C. C. M. (2003). for pairwise comparisons. Psycho PsychoDayton, (2003 ) . Information criteria for logical Methods, 8, 61-71. 6 1 -7 1 . Dee Carlo, LL.. T. T. ((11 9997). the meaning meaning and use of of kurtosis. Psychological Psychological MethMeth D 9 7 ) . On the and use 2, 292-307. 292-3 0 7 . ods, 2, H. D D.,. , & & Vargha, Vargha, A A. (2002 (2002).) . Comparing several robust robust tests ooff stochas stochasDelaney, H. samtic equality equality with ordinally ordinally scaled variables and small to moderate sized sam Psychological Methods, Methods, 7, 7, 485-503. ples. Psychological
REFERENCES REFERENCES
223
&Efron, Efron,BB. Computer-intensivemethods methodsin instatistics statistics.. Scien ScienDiaconis, P., P., & . ( (1983). 1 9 8 3 ) . Computer-intensive tific American, American, 248 248(5), 116-130. tific (5), 1 1 6-130.
W. JJ.,. , & & Massey, Massey, E F. J. ((1983). Introduction to statistical analysis analysis (4th ed.). Dixon, W 1 98 3 ) . Introduction ed. ) . McGraw-Hill. New York: York: McGraw-Hill. H.,. , & & Schultz, R. R. EF. ((1973). Computational procedures for for estimating estimating Dodd, D. H 1 9 7 3 ) . Computational magnitude of effect effect for for some analysis analysis of variance designs. Psychological Psychological Bulle Bullemagnitude 79, 3391-395. 9 1 -3 9 5 . tin, 79, K.. AA.. ((1977). Somegraphical graphicalmethods methods in instatistics. statistics.AAreview reviewand and some some Doksum, K 1 9 7 7 ) . Some extensions.. Statistica Neerlandica, 331, 53-68. 1, 5 3-68. extensions W. P.R ((1999). Aprogram programto to compute computeMcGraw McGrawand andWong's Wong'scommon common lan lan1 99 9 ) . A Dunlap, W guage effect effect size size indicator. indicator. Behavior Research Methods, Instruments, &Comput ComputInstruments, &
31, 706-709. 1 , 706-709. ers, 3
H. ((1974). Analysis of variance and and the the magnitude magnitude of of effects effects:: A gen genDwyer, J. H. 1 9 74 ) . Analysis Psychological Bulletin, Bulletin, 881, 731-737. 1 , 73 1-73 7 . eral approach. Psychological Efron, ., & 1 9 9 3 ) . An ork: Efron, B B., & Tibshirani, R. R. J. ((1993). An introduction to the bootstrap. New Y York: Chapman and and Hall. Chapman D.,. , & & Moses, L. L. E. E. ((1985). note on on the the Wilcoxon-Mann-Whitney Emerson, J. D 1 98 5 ) . A note test for 2 x tables.. Biometrics, 441, x k ordered tables 1 , 303-309. & Stoto. Stoto, M. A. ((1983). D.. C. C. Hoaglin, F. Emerson, J. D., & 1 9 8 3 ) . Transforming data. In D E W. T Tukey (Eds.), Understanding robust and and exploratory exploratory data Mosteller, & J. W ukey (Eds . ) , Understanding analysis (pp. 97-127). (pp. 9 7- 1 2 7 ) . New York: Wiley. analysis Methods ooff correlational analysis. analysis. New Y York: Ezekiel, M. ((1930). 1 930). Methods ork: Wiley. Fahoome, G G.. (2002 (2002). Twenty large-sample ). T wenty nonparametric statistics and their large-sample approximations. Journal Journal ooffModernApplied Modern Applied Statistical Methods, 2, 2,248-268. approximations. 248-268. Fan, X. X. (200 (2001). and effect effect size in in education research: Two 1 ) . Statistical significance and Two of a coin. Journal Journal ooff Educational Educational Research, 94, 94, 2275-282. sides of 7 5-282. B.. R. R. (2002 (2002). JMASM4:: Critical values for four nonparametric nonparametric and/or dis disFay, B ) . JMASM4 tribution-free tests of of location location for for two two independent samples. Journal Journal ooffMod Modtribution-free Applied Statistical Methods, Methods, 2, 489-5 489-517. 1 7. ern Applied R. (2003). A A Monte Monte Carlo computer study oof'the power properties oofsix distriFay, B. B. R. computer study fthe power fsix distri bution-free and/or and/or nonparametric statistical tests tests under various methods ooff re rebution-free solving tied ranks when applied to normal normal and nonnormal data distributions. solVing when applied distributions. Unpublished doctoral dissertation, dissertation, Wayne State University, Detroit, MI. Unpublished Ml. A. ((1992). Sex differences differences in variability in intellectual abilities: abilities: A new Feingold, A. 1 992). Sex at an an old controversy. controversy. Review ooff Educational Research, 62, 661-84. look at 1 -84. A. ((1995). effects of differences differences in in central central tendency and and Feingold, A. 1 9 9 5 ) . The additive effects variability are important in comparisons between groups. American Psychol Psychol50, 55-13. ogist, 50, -13. Feinstein, A. A. R. R. ((1998). P-values and and confidence intervals intervals:: Two Two sides ooff the the same same 1 9 9 8 ) . P-values unsatisfactory coin. Journal Journal ooff Clinical Epidemiology, Epidemiology, 661, 355-360. 1, 3 5 5-360. unsatisfactory Fern, E F. E., E.,&&Monroe, Monroe,K.K.BB. Effect-size estimates: estimates: Issues Issuesand and problems problems inin . ( (1996). 1 9 96 ) . Effect-size interpretation. Journal Journal ooff Consumer Research, 23, 89-1 89-105. interpretation. 05. & Goldstein, Goldstein, A. A. J. ((1997). Eye movement movement desensitization desensitization and and repro repro1 99 7 ) . Eye Feske, U., & treatment for panic disorder: disorder: A controlled controlled outcome outcome and partial dis discessing treatment mantling study. study. Journal Journal ooffConsulting and Clinical Psychology, 65, 65,1026-1035. 1 026-1035 . mantling N.,. , Cumming, Cumming, G. G.,, Finch, SS., & Leeman, J. J. (2004). Editors Editors Fidler, E, E, Thomason, N ., & researchers to confidence confidence intervals, intervals, but but can't make them them think. P Psycan lead researchers sy chological Science, 115, 5, 1119-126. 1 9- 1 2 6 . Fidler, E E,, & & Thompson, B B.. (200 (2001). confidence intervals intervals for 1 ) . Computing correct confidence ANOVA and random-effects effect effect sizes. sizes. Educational and Psychological Psychological ANOV A fixed- and 61, 575-604. 1, 5 75-604. Measurement, 6
224
REFERENCES REFERENCES
for London: Oliver & Fisher, R. R. A. ((1925). 1 92 5 ) . Statistical methods f or research workers. London: Boyd. Fleiss, J. L. L. ((1986). and analysis ooff clinical experiments. New New Y York: ork: 1 9 8 6 ) . The design and Wiley. Fleiss, J. L. L. ((1994). 1 994 ) . Measures of of effect effect size for for categorical data. In In H H.. Cooper & & L. L. V. Hedges Hedges (Eds.), off research synthesis (pp. (pp. 245-260) 245-260).. New New York: V. (Eds . ) , The handbook o Foundation. Russell Sage Foundation. Levin, B B.,. , & & Paik, M. C C.. (200 (2003). for rates and and propro Fleiss, J. L., L., Levin, 3 ) . Statistical methods for ) . New York: portions (3rd ed. ed.). York: Wiley. Fligner, M. A., & & Policello, Policello, II, G. procedures for for the the G. E. E. ((1981). 1 98 1 ) . Robust rank procedures Behrens-Fisher problem. Journal American Statistical Association, 76,, Behrens-Fisher Journal of of the American Association, 76 1 62- 1 6 8 . 162-168. R. L. L. ((1987). method for for comparing comparing effect effect magnitudes magnitudes in Fowler, R. 1 9 8 7 ) . A general method ANOVA designs. Educational Educational and Psychological Psychological Measurement, 4 47, 361-367. 7, 3 6 1-3 6 7 . J.. ((1999). Applied regression analysis, linear models, and and related related methods. Fox, J 1 9 9 9 ) . Applied Thousand CA: Sage. Thousand Oaks, CA: R.. W W. ((1995). problem with confidence confidence intervals. intervals. American AmericanPsychologist, Psychologist, Frick, R 1 99 5 ) . A problem 50, 1102-1103. 50, 1 1 02-1 1 0 3 . Frigge, M M.,. , Hoaglin, D D.. c C.,. , & & Iglewicz, B. B. ((1989). implementations of of the the 1 9 8 9 ) . Some implementations American Statistician, 43, 50 -54. -54. boxplot. The American 1 98 8 ) . Approximate Approximate interval Gart, J. J., & & Nam, JJ.. ((1988). interval estimation estimation of the the ratio ratio of bi binomial parameters parameters:: A review review and and corrections of skewness. skewness. Biometrics, 44, nomial Biometrics, 44, 323-3 38. 323-338. J.. JJ.,. , & & Thomas, Thomas, D. D. G. G. ((1972). onn approximate approximate confi confiGart, J 1 9 72 ) . Numerical results o for the the odds ratio. Journal ooff the Royal Society (Series (Series B), Royal Statistical Society dence limits for 34,, 44 441-447. 34 1-44 7 . Nonparametric statistical ference (2nd ed.). ed. ) . New Y ork: Gibbons, J ( 1 9 8 5 ) . Nonparametric J.. D D.. (1985). statistical in inference York: Dekker. Thousand Oaks, Nonparametric measures ooff association. Thousand Gibbons, J. D. ((1993). 1 9 9 3 ) . Nonparametric
CA: Sage. CA: & Edwards, A. (2003, September 227). underGigerenzer, G., G., & 7) . Simple tools for under version].. British Medi Medistanding risk: risk: From innumeracy to insight [[Electronic Electronic version] Journal, 327, 741-744. Retrieved March 10, 2004, 2 7, 7 4 1 - 74 4 . R etrieved M arch 1 0, 2 0 0 4 , ffrom rom ccal a l Jou rnal, 3 .bmjjournals. 7/7417/741 http ://bmj ://bmj .bm jjournals. com/cgi/content/full/3 com/cgVcontent/fuIV3 2 7 /74 1 7/ 7 4 1 Gillett, R. ) . The metric R. (2003 (2003). metric comparability comparability ooff meta-analytic meta-analytic effect effect size mea mea8, 4 419-A33. Psychological Methods, 8, 1 9-43 3 . sures. Psychological Glass, G 1 996). Statistical G.. v. V,, & Hopkins, K K.. D D.. ((1996). Statisticalmethods methods in inpsychology psychology and andeduca education ((3rd ed.). & Bacon. 3 rd ed. ) . Boston: Allyn & G. V, B., & Smith, M. L. research. Glass, G. v., McGaw, B ., & L. ((1981). 1 9 8 1 ) . Meta-analysis in social research. Thousand Oaks, CA: CA: Sage. Thousand Gleser, L. on "Bootstrap Confidence Intervals." Statistical GIeser, L. J. ((1996). 1 99 6 ) . Comment on 1 , 2 1 9-22 1 . Science, 121,219-221. Gliner, JJ.. A., Morgan, G. A.,. , & Harmon, Harmon, R R.. JJ.. (2002). and ac acG. A (2002 ) . The chi-square test and companying effect effect size indices indices.. Journal Journal of of the American Academy of of Child Child and and companying Adolescent 4 1 , 1510-1512. 1 5 1 0- 1 5 1 2 . Adolescent Psychology, 41, K.. M M.,. , & & Iglewicz, B B.. ((1992). extensions ooff the the boxplot. boxplot. Goldberg, K 1 9 92 ) . Bivariate extensions Technometrics, 307-320. T echnometrics, 34, 34, 3 0 7-320. L.A. A.((1964). Simultaneous confidence confidencelimits limitsfor for cross-product cross-productratios ratiosin in Goodman, L. 1 964). Simultaneous contingency tables. Journal Journal ooffthe Royal Royal Statistical Society Society (Series (Series B), 26, 86-1 86-102. 02 . contingency A.((1969). Howto toransack ransacksocial socialmobility mobilitytables tablesand andother otherkinds kinds of of Goodman, L. L. A. 1 96 9 ) . How cross-classification tables. American American Journal Journal of of Sociology, 75, 1-40. 1-40. cross-classification
REFERENCES REFERENCES
225
A. (2002 (2002). meta-analysis ooff the the eeffectiveness of antidepressants antidepressants comGorecki, J. A ) . A meta-analysis ffectiveness of com pared to placebo. Unpublished master master's StateUniversity, University, 's thesis, San Francisco Francisco State pared San Francisco. G. R. R. ((1966). of empirical laws upon upon the the source of of experimenGrice, G. 1 966) . Dependence of experimen 66, 48 8-49 9 . tal tal variation. Psychological Psychological Bulletin, Bulletin, 66, 488-499. R. J. ((1994a). off the the superior outcome ooff one treatment treatment 1 994a) . Probability o Grissom, R. over another. another. Journal Journal ooffApplied Applied Psychology, 79, 3314-316. 1 4-3 1 6 . over R. JJ.. ((1994b). Statistical analysis ooff ordinal categorical status status after after Grissom, R. 1 994b). Statistical Journal ooff Consulting and and Clinical Psychology, Psychology, 62, 2281-284. 8 1 -284. therapies. Journal R. J. ((1996). number .7 ± Meta-meta analysis analysis of of the the 1 996). The magical number ± ..2: 2 : Meta-meta Grissom, R. probability probability of superior outcome outcome in comparisons comparisons involving involving therapy, placebo, 73-982. and and control. Journal Journal ooff Consulting and and Clinical Psychology, Psychology, 64, 9973-982. Grissom, R. R. J. (2000). (2000). Heterogeneity Heterogeneity of of variance in clinical data. Journal Journal ooffCon Consulting and Clinical Psychology, 5 5-1 6 5 . Psychology, 68, 1155-165. R.. JJ.,. , & & Kim, Kim, JJ.. J. (200 (2001). of assumptions assumptions and and problems problems in the the 1 ) . Review of Grissom, R appropriate conceptualization of of effect effect size. Psychological Psychological Methods, Methods, 6, 6, appropriate 1 3 5-146. 135-146. S. ((1985). Weight modification and and eating disorders disorders in adolescent adolescent boys and and Gross, J. S. 1 98 5 ) . W eight modification girls. Unpublished doctoral dissertation, University of V ermont, Burlington. Vermont, Waechter, D. & Solomon, G. How significant is a sig sigG. S. S. ((1982). 1 982). How Haase, R. R. F, E, Waechter, D. M., & nificant difference? difference? A Average effect size of of research in counseling psychology. nificant verage effect Journal 8 -65 . Journal ooff Counseling Psychology, 29, 558-65. C.. K., D.,. , & & Shadish, Shadish, W W. R R.. ((1998). ratios as ef ef1 99 8 ) . Using odds ratios Haddock, C K., Rindskopf, D fect fect sizes for meta-analysis meta-analysis of dichotomous data: A primer on methods and 3, 339-35 3. issues. PsycholOgical Psychological Methods, Methods, 3, 339-353. F. R., Ronchetti, Ronchetti, EE.. M., M., Rousseeuw, Rousseeuw, P.P. J., J., &&Stahel, Stahel,W W.A A.( (1986). Robust Hampel, F. 1 986) . Robust ork: Wiley. statistics: The approach fluence functions. New approach based based on in influence New Y York: Hand, D. two treatments. TheAmerican The American Statistician, Statistician, 46, 46, D. J. (1992). ( 1 992). On comparing two 190-192. 1 9 0-1 92 . K.. JJ.,. , & Ostrowski, E.. ((1994). A Ostrowski, E 1 994) . A Hand, D. J., Daly, F., E , Lunn, A. A D., D . , McConway, K Hall. handbook handbook ooff small data sets. London: London: Chapman and and Hall. L.,. , Mulaik, Mulaik, SS.. A., & Steiger, J. H. (Eds.). if there were no Harlow, L. L. L A, & (Eds . ) . (1997). ( 1 9 9 7 ) . What if significance Lawrence Erlbaum Associates. Associates. significance tests? Mahwah, NJ: Lawrence E.,&&Davis, Davis,D. D.E.E.( (1982). Anew new distribution-free distribution-free quantile quantile estimator. estimator. 1 982). A Harrell, F. E E., 69, 635-640. Biometrika, 6 9, 63 5-640. &Anderson, Anderson, S.S .( (1 1986). 9 8 6 )A. A comparisonofof large-sampleconfidence confidence W W. W:, ,& comparison large-sample Hauck, W. interval methods for the the difference difference of two two binomial binomial probabilities. The Ameri American Statistician, 40, 3 1 8-322. Statistician, 40, 318-322. W. L. L. ((1994). for psychologists (5th ed. ed.). Hays, W 1 9 94 ) . Statistics f or psychologists ) . Fort Worth, TX: Hartcourt Brace. Hartcourt L. V V. ((1981). theory for for Glass's estimator of effect effect size and and Hedges, L. 1 98 1 ) . Distribution theory estimator of 1 0 7- 1 2 8 . related estimators. Journal Journal ooff Educational Statistics, Statistics, 6, 107-128. 8 2 ) . Estimation f effect f independent Hedges, L. L. V V. (119982) . Estimationoof effectsize sizefrom fromaaseries seriesoof independentexper experiments. Psychological Psychological Bulletin, Bulletin, 92, 92, 490-499. 490^199. L.. V V.,& Friedman, L. L. ((1993). differences in in variability variability in in intellec intellecHedges, L , & Friedman, 1 99 3 ) . Gender differences tual abilities abilities:: A re-analysis of of Feingold's results. results. Review ooff Educational Educational ReRe 63, 9494-105. search, 63, 105. L.. V V,, &&Nowell, Nowell,A A.((1995). Sexdifferences differences iin mental test testscores, scores,variabil variabilHedges, L 1 9 9 5 ) . Sex n mental ity, and 1-45 . and numbers of high-scoring high-scoring individuals. Science, 269, 269, 4 41-45. Hedges, LL.. V V,, & & aIkin, Olkin, 1I.. ((1984). Nonparametric estimators of of effect effect size in in 1 984). Nonparametric meta-analysis.. Psychological Psychological Bulletin, Bulletin, 96, 573-580. 96, 5 73-5 80. meta-analysis
226
REFERENCES REFERENCES
Hedges, L. L. v. 1 9 85 ) . Statistical or meta-analysis. V,, &&Olkin, Olkin, 1.I.((1985). Statisticalmethods methodsffor meta-analysis. San SanDiego, Diego,
CA: Academic Press. CA:
Hekmat, H. H. ((1973). Systematic versus semantic desensitization desensitization and implosive implosive Hekmat, 1 9 7 3 ) . Systematic therapy:: A comparative comparative study. therapy
Journal ooff Consulting and Clinical Psychology, Psychology,
4, 202-209. 202-209 . B., Olejnik, S., & Huberty, C. C. JJ.. (200 (2001). efficacy ooff two two improvement improvement Hess, B . , Ole jnik., S ., & 1 ) . The efficacy over over chance effect effect sizes for two-group univariate comparisons under under vari variance heterogeneity heterogeneity and and nonnormality.
Educational Mea Educational and Psychological Psychological Mea-
surement, 6 1 , 909-9 36. 61, 909-936.
D.. K. K.,, Laing, Laing, J. D., & & Rosenthal, Rosenthal, H. ((1977). off ordinal Hildebrand, D 1 9 7 7 ) . Analysis o Thousand Oaks, CA: CA: Sage. Hoaglin, D.C., E, & . ) . ((1983). 1 983). D.C., Mosteller, E, & Tukey, J. W W. (Eds (Eds.).
data. data.
Understanding Understanding robust
ork: Wiley. and exploratory data analysis. and exploratory analysis. New Y York:
. ) . ((1985). 1 98 5 ) . Exploring tables, Hoaglin, D D.. c., C., Mosteller, Mosteller, EE,, & & Thkey, Tukey, J. W W. (Eds (Eds.). Exploring data tables,
trends, and shapes. New Y ork: Wiley. York:
We've reporting some effect effect W e've been reporting sizes: Can you guess what they mean? mean? Paper presented presented at the the annual meeting meeting of
Hogarty, K. K. Y., Y, & Komrey, Komrey, J. D. (200 (2001, April).. 1 , April)
the WA. the American American Educational Research Association, Association, Seattle, WA. Hou, C.-D ai, J. J. (2003). A family C.-D.,. , Chiang, Chiang, J., & &T Tai, family of simultaneous confidence intervals intervals for multinomial multinomial proportions proportions.. Computational Computational Statistics and Data Analysis, 43, 43, 29-45. 29-4 5 . D.. C C.. ((1199 7).. Statistical Statistical methods methodsfor for psychology psychology (4th (4th ed. ed.).) . Boston: Boston:Duxbury. Duxbury. Howell, D 99 7) Hsu, J. C. Multiple comparisons: Theory Theory and methods. New York: Chap ChapHsu, C. (1996). ( 1 996). Multiple man HalL man and Hall. Hsu, (2004) . Biases success rate rate differences shown in binomial binomial effect effect size Hsu, L. L. M. (2004). Biases of success differences shown displays.. Psychological Psychological Methods, Methods, 9, 9, 1183-197. displays 83-1 9 7 . Educational and Psychologi Huberty, C. C. J. (2002 A history of effect (2002).) . Ahistory effect size size indices. Educational Psychologi227-240. 7-240. cal Measurement, 62, 22 & Schmidt, F. L. ((1994). for sources of of artifactual artifactual vari variE L. 1 994) . Correcting for Hunter, J. E., & across studies. In In H. Cooper & & L. V Hedges (Eds (Eds.), handbook ooff re re. ) , The handbook ance across L. V. synthesis (pp. (pp. 323-3 323-338). York: Foundation. search synthesis 3 8 ) . New Y ork: Russell Sage Sage Foundation. & Schmidt, EF.L. L. (2004). (2004). Methods Methods ooff meta-analysis meta-analysis (2nd ed. ed.). ThouHunter, J. E., E., & ) . Thou sand Oaks, CA: CA: Sage. sand Huynh, C. 1 98 9 ) . A ffect size in C. L. L. ((1989). A unified unified approach approach to the estimation estimation ooff eeffect Paper presented at at the the Annual Annual Meeting of the the American EduEdu meta-analysis. Paper cational ResearchAssociation, Research Association, San Francisco, Francisco, CA. CA. (ERIC Document Reproduccational Reproduc tion Service Clearinghouse No. No. ED306248) ED306248) tion Service R. J., & & Fan, Fan, Y. Y. ((1996). quantiles in in statistical statistical packages. The Hyndman, R. 1 996). Sample quantiles 361-365. American Statistician, 50, 50, 3 6 1-365 . A. J. ((1991). Recent developments in in nonparametric nonparametric density density estima estimaIzenman, A. 1 9 9 1 ) . Recent Journal of of the American Statistical Association, Association, 86, 205-224. tion. Journal Jacoby, W 1 99 7) . Statisticalgraphicsfor W.G. G. ((1997). Statistical graphics for univariate and bivariate data. Thou Thousand Oaks, CA: CA: Sage. sand v., & (2000). A sensible Jones, L. L. V, & Tukey, J. W W. (2000). sensible formulation formulation of of the the significance test. Psychological Psychological Methods, S, 5, 4 411-414. 1 1-4 1 4 . O.,, & & Folks, Folks, L. L. ((1971). and data analysis. analysis. Kempthorne, 0. 1 9 7 1 ) . Probability, statistics, and LA:Iowa IowaState StateUniversity UniversityPress. Press. Ames, IA: Rc C.,. , & & Grove, Grove, W W. M. ((1988). comparisons in in therapy therapy out out1 9 8 8 ) . Normative comparisons Kendall, P. Assessment, 110, 147-158. come. Behavioral Assessment, 0, 1 4 7-1 5 8 . 1 999). Normative Kendall, P. R cC.,. , Marss-Garcia, A A.,. , Nath, SS.. R., R., & & Sheldrick, Sheldrick, R R.. C. C. ((1999). Normative comparisons for for the the evaluation evaluation of of clinical clinical significance. significance. Journal ooff Consulting comparisons and Clinical Psychology, 667, and 7, 285-299.
REFERENCES REFERENCES
G. ((1991). Keppel, G. 1 99 1 ) .
227 227
's handbook and analysis: analysis: A researcher researcher's handbook (3rd ed.). Design and
Cliffs, NJ NJ:: Prentice-Hall. Englewood Cliffs, Lewis, C. C. ((1979). 1 9 79 ) . Partial omega Keren, G., G., & &Lewis, omega squared squared for for ANOVA designs. designs. Educa Educational and Psychological Psychological Measurement, 39, 1119-128. 1 9-1 2 8 . H. ((1975). off three estimates ooff treat treatKeselman, H. 1 9 75 ) . A Monte Carlo investigation o ment ment magnitude magnitude:: Epsilon squared, eta squared, and and omega squared. Cana CanaPsychological Review, 116, 6, 44-48. dian Psychological H.. JJ.,. , Cribbie, Cribbie, R. A.,& &Wilcox, Wilcox,R. R.R. R.(2002 (2002). Pairwisemultiple multiple compar comparKeselman, H R.A, ) . Pairwise ison tests when when data are nonnormal. nonnormal. Educational and Psycholo Psychological Measureison gical Measure 62, 420-434. ment, 62, C. J., Lix, Olejnik, S., Cribbie, Cribbie, R. R. A A.,, Donahue, B., B., Keselman, H. J., Huberty, C. Lix, L. L. M., Olejnik, R. K K.,, Lowman, R. L., L., Petoskey, Petoskey, M. D., Keselman, Keselman, JJ.. c C.,. , & Levin, J. Kowalchuk, R. R. ((1998). 1 99 8 ) . Statistical practices of of educational educational researchers researchers:: An analysis of of their their and ANCOVA offEducational Educational Research, 68, 68, ANOVA, MANOVA andANCOV A analyses. Review o 350-3 86. 350-386. J.,. , Othman, Othman, A A. R., R., Wilcox, R. R. R., R., & & Fradette, Fradette, K K. (2004). (2004). The new new and and Keselman, H. J improved 5, 47-5 1. improved two-sample two-sample tt test. Psychological Psychological Science, 115, 47-51. H.. JJ.,. , Wilcox, Wilcox, R R.. R., R., & & Lix, Lix, L. L. M M.. (200 (2003). robust approach approach Keselman, H 3 ) . A generally robust to hypothesis designs . hypothesis testing testing in independent and correlated correlated groups designs. Psychophysiology, 86-5 9 6 . Psychophysiology, 40, 5586-596. J.,. , Wilcox, R. R. R., Othman, Othman, A A. R., R., & & Fradette, Fradette, K K. (2002). (2002). Trimming, Keselman, H. J transforming statistics, statistics, and bootstrapping bootstrapping:: Circumventing the biasing biasing eftransforming ef fects of heteroscedasticity and nonnormality. Journal Journal ooffModern Modern Applied Applied Sta Staand nonnormality. Methods, 2, tistical Methods, 2, 288-309. R. E. E. ((1996). has come. Educa EducaKirk, R. 1 996). Practical significance: A concept whose time has 9. tional and Psychological Psychological Measurement, 56, 746-75 746-759. R. E. E. (200 (2001). Promoting good statistical statistical practices: Some suggestions. EduKirk, R. 1 ) . Promoting Edu cational and Psychological Psychological Measurement, 661, 213-218. 1, 2 1 3-2 1 8 . Kleinknecht, R. R. A A.,, Dinnel, Dinnel, D D.. LL.,. , Kleinknecht, Kleinknecht, EE.. E., Hiruma, N., & Hirada, N N.. Cultural factors in social anxiety: anxiety: A comparison comparison of social phobia phobia ((1997). 1 99 7 ) . Cultural fusho. Journal Anxiety Disorders, 111, 1, 1 5 7-1 7 7 . symptoms and and taijin kyo kyofusho. Journal ooff Anxiety 157-177. Kline, R. B B.. (2004). Beyond significance significance testing: Reforming Reforming data analysis analysis methods (2004) . Beyond Washington D.C., Association. D. C . , American Psychological Association. in behavioral research. Washington Journal ooffModern Modern Knapp, T. T. R. (2002). (2002 ) . Some reflections reflections on on significance testing. testing . Journal Applied Statistical Methods, Methods, 11,, 240-242. Applied T. R. R. (2003). Was Monte Applied Sta Monte Carlo necessary? Journal Journal ooffModern Modern Applied StaKnapp, T. tistical Methods, 7-24 1 . Methods, 2, 2, 23 237-241. & Sawilowsky, Sawilowsky, SS.. S. S. (2001). methodologKnapp, T. T. R., & (200 1 ) . Constructive Constructive criticisms criticisms of of methodolog and editorial editorial practices. practices. The Journal Journal ooffExperimental Experimental Education, 70, 65-69. ical and Kraemer, H. C. (1983). ( 1 98 3 ) . Theory effect sizes: Use in H. C. Theory of estimation and and testing of effect meta-analysis. 0l . meta-analysis. Journal Journal ooff Educational Educational Statistics, Statistics, 8, 93-1 93-101. Kraemer, H. C., c., & G. ((1982). 1 982 ) . A non-parametric technique for & Andrews, Andrews, G. for 1 , 404-4 12. meta-analysis meta-analysis effect effect size calculation. calculation. PsycholOgical Psychological Bulletin, Bulletin, 991, 404-412. H.. c C.,. , & & Thiemann, Thiemann, S. S. (1987). subjects? Statistical power Kraemer, H ( 1 98 7) . How many subjects? CA: Sage. analysis analysis in research. Thousand Oaks, Oaks, CA: (2001). significance testing: Krueger, J. (200 1 ) . Null hypothesis significance testing : On the survival of a flawed method. American Psychologist, 56, 116-26. 6-2 6 . flawed Historical notes notes oonn the Wilcoxon Wilcoxon unpaired two-sample W H. H. ((1957). 1 95 7). Historical Kruskal, W. test. Journal Journal ooff the American Statistical Association, 52, 356-360. 52, 3 5 6-360. N. M., & & Mosteller, Mosteller, E F. (1990). methods for for combining combining ex ex( 1 9 90) . Some statistical methods Laird, N. perimental results. International International Journal Journal ooff T Technology echnology Assessment in Health perimental Care, 6, 6, 5-30. 5-30.
228
REFERENCES REFERENCES
Lambert, M. J E. ((1994). 1 994). The effectiveness A J.,. , & & Bergin, A A. E. effectiveness of of psychotherapy. psychotherapy. In A. S. L. . ) , Handbook E. Bergin & S. L. Garfield (Eds (Eds.), Handbook ooff psychotherapy psychotherapy and and behavior
(4th ed., pp. 1143-189). York: 43-1 89). New Y ork: Wiley. change (4th A.,, Sackett, Sackett, D D.. L., L., & & Roberts, Roberts, R. S.. ((1988). An assessment assessment of of clinically clinically R. S 1 988). An Laupacis, A of useful useful measures measures of of the the consequences of of treatment. New New England Journal of 3 1 8, 1728-1733. 1 728-1 733. Medicine, 318, Lax, D 1 985). Robust estimators D.. A A. ((1985). estimators ooff scale: scale: Finite sample performance performance in long-tailed symmetric distributions. Journal American Statistical Asso Journal ooff the American Asso-
736-741.. ciation, 80, 80, 736-741 Lehmann, L. (1975). ( 1 975). Nonparametrics: Nonparametrics: Statistical methods based Lehmann, E. E. L. based on ranks. San Francisco: Holden-Day. Holden-Day. & Robinson, D. H 1 999). Further Levin, J. R ., & R., Robinson, D. H.. ((1999). Further reflections reflections on on hypothesis testing and editorial editorial policy policy for for primary research journals. Educational Educational Psychological Psychological and
Review, 11, 1 1 , 1143-155. 43-1 5 5 . Levin, J ., & ). The trouble J.. R R., & Robinson, Robinson, D D.. H H.. (2003 (2003). trouble with interpreting interpreting statistically statistically Journal ooff Modern nonsignificant effect effect sizes in in single-study investigations. Journal Modern Applied Statistical Methods, Methods, 2, Applied 2, 231-236. Levine, T. ., & T. R R., & Hullett, C C.. R R.. (2002) (2002).. Eta squared, partial eta eta squared, and and misre misreporting of Re of effect effect size in communication research. Human Communication Re-
search, 1 2-62 5 . search, 28, 6612-625.
Levy, P. 1 967). Substantive P. ((1967). Substantive significance ooff significant significant differences differences between between two two groups. Psychological 7, 3 7-40. Psychological Bulletin, Bulletin, 667, 37-40. Liebetrau, A 1 98 3 ) . Measures A. M. ((1983). Measures ooff association. Thousand Thousand Oaks, CA: CA: Sage. W. ((1990). for experimental re reLipsey, M. W 1 990) . Design sensitivity: Statistical power f or experimental Thousand Oaks, CA: search. Thousand CA: Sage. Lipsey, M. W W. (2000). Statistical Statistical conclusion conclusion validity for intervention intervention research. In alidity and social (pp. 1 01-120). Thou L. Bickman (Ed .), V (Ed.), Validity social experimentation experimentation (pp. 101-120). Thousand CA: Sage. sand Oaks, CA: M.. W, & & Wilson, D. D. B. efficacy of of psychological, psychological, educational, educational, Lipsey, M B. ((1993). 1 993). The efficacy and behavioral behavioral treatments: Confirmation Confirmation from meta-analysis meta-analysis.. American American Psy Psyand
chologist, 48, 11181-1209. 1 8 1 -1 209. Lipsey, M. W 1 ) . Practical meta-analysis. Thousand W,, & & Wilson, D D.. B. B. (200 (2001). Thousand Oaks, CA: CA: Sage. Liu, Q 1 998). An order-directed d ((1998). order-directed score test test for trend trend in ordered ordered 2 x x k tables.
Biometrics, 54, 11147-1154. 1 47-1 1 54.
Lix, 996). The analysis ooff completely Lix, L. L. M., Cribbie, Cribbie, R., & & Keselman, Keselman, H H.. JJ.. (June, 11996). completely presented at randomized univariate designs. Paper presented randomized at the the annual meeting meeting of the the Psychometric Society, Banff, Banff, Al Alberta, Psychometric b erta, Canada. Lunneborg, C. E. E. ((1999). 1 999). Data analYSis and applications. Lunneborg, C. analysis by by resampling: Concepts Concepts and applications. Pacific Pacific Grove, CA: CA: Duxbury. Duxbury. 1 ) . Random stan Lunneborg, E. (200 Lunneborg, C C.. E. (2001). Random assignment of available cases: Bootstrap standard 6, 402-4 12. dard errors and confidence intervals intervals.. Psychological Psychological Methods, Methods, 6, 402^12. 1 9 70). Using Lunney, G G.. H H.. ((1970). Using analysis of variance with a dichotomous dependent dependent variable: An empirical empirical study. Journal 7, 263-269. variable: Journal ooffEducational Measurement, 7, Mancini, 1 999). Reporting Mancini, G. G. B B.. J., & & Schulzer, M. ((1999). Reporting risks and and benefits of therapy by the the use of the the concepts of unqualified unqualified success and and unmitigated unmitigated failure. Cir Cirby
culation, 999, 9, 3377-383. 7 7-383. Mann, ., & 1 947). On a test Mann, H. H. B B., & Whitney, Whitney, D D.. R R.. ((1947). test of whether one ooff two two random random variables variables is stochastically larger larger than the the other. Annals ooffMathematical Mathematical Sta Sta-
tistics, 118, 8, 50-60. Markus, 1 ) . The Markus, K. K.A A.(200 (2001). Theconverse converseinequality inequality argument argument against against tests tests of ofstatis statistical significance. Psychological Psychological Methods, 6, 1147-160. 47-1 60. tical
REFERENCES
*-w^>
229 229
R. F., Lane, D. M., & & Emrich, Emrich, C C.. ((1996). differences: A com comMartell, R. E, Lane, 1 996). Male-female differences: puter simulation. simulation. American Psychologist, 5 51, 1 , 1157-158. 5 7-1 5 8 .
Martin Andres, Andres, A., & & Herranz T Tejedor, Unconditional confidence interval interval Martfn ejedor, II.. (2003). Unconditional for the difference difference between two two proportions. for
45, 426-436. Biometrical Journal, 45,
& Herranz Herranz Te Tejedor, I. (2004). Exact unconditional non-classi non-classijedor, I. Martin Andres, A., &
the difference difference of two two proportions proportions.. Computational Computational Statistics and cal tests on the and Analysis, 45, 45, 3373-388. Data Analysis, 7 3-3 8 8 . Matsumoto, D., Grissom, Grissom, R. R. J., & & Dinnel, D. D. L. L. (200 (2001). Do between-culture dif dif1). D o between-culture Matsumoto, mean that people are different? different? A look at some measures measures of ferences really mean of effect size. size. Journal Journal ooff Cross-Cultural Psychology, 32, 32, 478-490. 478-490. cultural effect S. E. E. ((1998). group comparisons comparisons:: Maxwell, S. 1 99 8 ) . Longitudinal designs in randomized group intermediate observations observations increase increase statistical statistical power? power? Psychologi PsychologiWhen will intermediate 275-290. 3, 2 75-290. cal Methods, 3, S. E. E. (2004). (2004). The persistence persistence of underpowered underpowered studies studies in in psychological psychological Maxwell, S. and remedies remedies.. Psychological Psychological Methods, Methods, 9, 9, research: Causes, consequences, and 147-163. 1 4 7- 1 6 3 . S.. E E., Camp, C C.. c C., &Arvey, R. D D.. ((1981). Measures of of strength strength of of asso asso. , Camp, ., & Arvey, R. 1 98 1 ) . Measures Maxwell, S ciation: A comparative comparative examination. examination. Journal Journal oof Applied Psychology, 66, 66, f Applied ciation: 525-534. S.. E., E., & & Delaney, Delaney, H. H. D D.. ((1985). and statistics statistics:: An exami examiMaxwell, S 1 98 5 ) . Measurement and nation of construct construct validity. Psychological Psychological Bulletin, Bulletin, 997, 85-93. 7, 85-9 3. nation S.. E. E.,, & & Delaney, Delaney, H. D. (2004 (2004). experiments and and analyzing analyzing ) . Designing experiments Maxwell, S data: A model comparison perspective perspective (2nd ed. ed.). ) . Mahwah, NJ: Lawrence Lawrence Erlbaum Associates. Erlbaum K. O. O. ((1991). Problems with with the the BESD: A comment comment on on Rosenthal's McGraw, K. 1 99 1 ) . Problems soft psychology?" American American Psychologist, 46, 46, "How are we doing in soft 1084-1086. 1 0 84-1 086. K. 0., O., & & Wong, Wong, S. S. P. P. ((1992). common language language effect effect size statistic. 1 992 ) . A common McGraw, K. Bulletin, 1111, 361-365. Psychological Bulletin, 1 1, 3 6 1 -3 6 5 . J.. w. W,, & & Schrader, Schrader, R R.. M. ((1984). comparison ooff methods for McKean, J 1 984). A comparison Statistics-Simulationand and studentizing the sample median. Communications in Statistics-Simulation 751-773. 1-773. Computation, 113, 3, 75 d ((1962). Psychological statistics (3rd ed. ed.). New Y York: 1 962 ) . Psychological ) . New ork: Wiley. McNemar, Q Mee, R. R. W. 1 990). Confidence W ((1990). Confidence intervals for for probabilities probabilities and and tolerance tolerance regions regions on a generalization of the the Mann-Whitney statistic. Journal Journal ooff the Amer Amerbased on Association, 85, 793-800. ican Statistical Association, Meeks, S. S. L. L. & & D'Agostino, R. R. B. on the the use use of of confidence confidence limits limits B. ((1983). 1 9 8 3 ) . A note on Meeks, following re rejection The American Statistician, Statistician, 337, following jection of a null hypothesis. The 7, 134-136. 1 34-1 36. T. ((1989). normal curve, curve, and and other other improbable crea creaMicceri, T. 1 9 8 9 ) . The unicorn, the normal tures. Psychological Psychological Bulletin, Bulletin, 1105, 05, 1156-166. 5 6-166. D. C. C. ((1995). Negative outcomes in in psychotherapy psychotherapy:: A critical critical review. review. Clin ClinMohr, D. 1 9 95 ) . Negative ical Psychology: Psychology: Science and Practice, 2, 2, 1-2 1-27. 7. S.. B., & DeShon, R. R. P. R (2002 (2002). Combining effect effect size estimates in in ) . Combining Morris, S B., & meta-analysis with repeated measures and independent-groups designs. Psy Psymeta-analysis Methods, 7, 7, 105-1 105-125. chological Methods, 25. Moses, L. L. E. E. ((1986). explain with statistics. statistics. Reading, Reading, MA MA:: Addi Addi1 9 8 6 ) . Think and explain son-Wesley. son-W esley. Moses, L. L. E., E., Emerson, Emerson, J. D D., & Hosseini, Hosseini, H. H. ((1984). Analyzing data from from ordered ., & 1 984). Analyzing England Journal Journal ooff Medicine, 3 311, categories. New England 1 1 , 442-448. E, & & Chalmers, T. T. C C.. ((1992). Some progress and and problems in Mosteller, E, 1 9 92). Some clinical trials. Statistical Science, 7, 22 227-236. meta-analysis of clinical Science, 7, 7-236.
230 230
*-**#
REFERENCES
Mueller, C. C. G. G. ((1949). 1 949). Numerical transformations transformations in the the analysis analysis of experimen experimental data. 9 8-223 . data. Psychological Bulletin, Bulletin, 46, 1198-223.
Murphy, B 1 9 76). Comparison B.. P.E,, ((1976). Comparison ooff some two two sample means tests bbyy simula simulation. Communications in Statistics-Simulation ), 23-32. Statistics-Simulationand andComputation, Computation, B5( B5(11), 23-32. Murphy, K gen K. R., R., & & Myors, B B.. (2003). (2003). Statistical power analysis: A simple and and genLawrence eral modelfor for traditional and modern hypothesis hypothesis tests. tests. Mahwah, Mahwah, NJ NJ:: Lawrence Erlbaum Associates. 1 9 8 7) Murray, L. w., & L. W, & Dosser, Dosser, D. D. A. ((198 7).. How How significant significant is a significant significant difference? difference? Problems Problems with the the measurement measurement of of magnitude of effect. effect. Journal Journal ooffCounseling Psychology, Psychology, 34, 68-72. Nanna, M. M. J. (2002) (2002).. Hoteling's Hoteling's T T22 vs. vs. the rank. rank transformation transformation with real Likert Nanna, data. Journal Journal ooff Modern Applied Applied Statistical Methods, I1,, 8833 -99 -99.. Nanna, M. Sawilowsky, S 1 99 8 ) . Analysis o M. JJ.,. , & &Sawilowsky, S.. SS.. ((1998). off Likert Likert scale data in disabil disabil6. ity and medical rehabilitation rehabilitation research. research. Psychological Psychological Methods, 3, 3, 55-5 55-56. Newcombe, R. 1 99 8 ) . Interval estimation R. G G.. ((1998). estimation for the the difference difference between inde independent pendent proportions: proportions: Comparison of eleven methods. Statistics in Medicine, Medicine, 117, 7, 8 73-890. 873-890. Nickerson, R. S. (2000 ) . Null hypothesis significance R. S. (2000). significance testing testing:: A review of an an old and continuing continuing controversy. Psychological Methods, Methods, 5, 241-30 241-301. and 1. Norusis, 1 99 5 ) . SP55 SPSS 66.1 . 1 Guide to data analysis. Norusis, M M.. JJ.. ((1995). analysis. Englewood Cliffs, Cliffs, NJ NJ:: Prentice Hall. Nouri, H 1 98 5 ) . Meta-analytic H.,. , & & Greenberg, Greenberg, R. R. H. H. ((1985). Meta-analytic procedures procedures for for estimation estimation of of sizes in experiments using complex analysis of variance. Journal of effect effect sizes Journal of 1 , 801-8 12. Management, 2 21, 801-812. O'Brien, P. 1 9 8 8 ) . Comparing RC C.. ((1988). Comparing two two samples samples:: Extensions Extensions ooff the tt,, rank-sum, and and log-rank log-rank tests. Journal Journal oofftheAmerican the American Statistical Association, Association, 83, 83, 52-61 52-61.. O'Grady, K E. ((1982). 1 982) . Measures of explained K. E. explained variance: variance: Cautions Cautions and and limita limita77. tions. Psychological Bulletin, Bulletin, 92, 92, 766-7 766-777. Ole jnik, S Olejnik, S.,. , & & Algina, JJ.. (2000) (2000).. Measures Measures ooff effect effect size for for comparative comparative studies: studies: Applications, interpretations, interpretations, and and limitations limitations.. Contemporary Contemporary Educational Educational Psy Psychology, 25, 4 1 -2 8 6 . 25, 2241-286. Ole jnik, S ., & Olejnik, S., & Algina, Algina, JJ.. (2003 (2003).) . Generalized eta and and omega omega squared squared statistics: of effect effect size size for for some common common research designs designs.. Psychological Psychological Measures of Methods, 8, 8, 434-43 434-437. 7. Methods, A. JJ.,. , & & Levin, Levin, J. R. R. (2003 (2003). Without supporting supporting evidence, evidence, where Onwuegbuzie, A. ) . Without would measures of substantive substantive importance importance lead? lead? Journal Journal ooff Modern Modern Applied Applied would Statistical Methods, 2, 1133-151. 3 3-1 5 1 . Sta tistical Methods, Othman, A. R., R., Keselman, Keselman,H H.. JJ.,. , Wilcox, Wilcox, RR.. R., R., Fradette, Fradette, K K.,, & & Padmanabhan, A. Othman, R. (2002). A test of of symmetry. Journal ooffModern Modern Applied Applied Statistical Methods, Methods, 1 1 0-3 1 5 . 1,, 3310-315. D.. JJ.. ((1985). Correlation and and the the coefficient coefficient of determination. Psychologi PsychologiOzer, D 1 9 8 5 ) . Correlation Bulletin, 997, 307-315. cal Bulletin, 7, 3 0 7-3 1 5 . Parker, S 1 99 5 ) . The "difference Ameri S.. ((1995). "difference of means" may may not not bbee the the "effect "effect size. size."" Ameri50, 11101-1102. can Psychologist, 50, 1 01 - 1 1 02 . W. c C.,. , Miller, Miller, LL.. c C.,. , Putcha-Bhagavatula, Putcha-Bhagavatula, A A.. D D.,. , & &Y Yang, Y. (2002). Pedersen, W. ang, Y. (2002 ) . Evolved sex differences differences in sexual strategies: The long and the short short ooff it. Psy Psychological Science, 113, 157-161. 3, 1 5 7-1 6 1 . R.. D D.. (2003). (2003). A score method method of of constructing constructing asymmetric asymmetric confidence Penfield, R intervals intervals for the mean mean of a rating rating scale item. item. Psychological Psychological Methods, 8, 8, 149-163. 149163. K. T. T,, & & Stoline, M. M. R. R. (2002 (2002). comparison of of the the D'Agostino Su Su test to to Perry, K ) . A comparison versus asymmetry as a preliminary the triples test for testing of symmetry versus
REFERENCES REFERENCES
test to to testing testing the the equality of means. test
23 2311
Journal ooff Modern Modern Applied Applied Statistical Journal
Methods, 11,, 3316-325. 1 6-32 5 .
B.. SS.,. , Impara, Impara, JJ.. e C.,. , & & Spies, R. R. A A. (Eds (Eds.). (2003). The fifteenth Plake, B . ) . (2003 ) . Thefi fteenth
mental meamea
NB: Buros Euros Institute Institute.. surements yearbook. Lincoln, NB:
Posch, M. ) . Asymptotic and exact M. (2002 (2002). exact tests in 2 x x cc ordered categorical categorical contin contin-
gency tables. Journal Journal ooff Modern Modern Applied Applied Statistical Methods, Methods/ I1,, 1167-175. 6 7-1 75 .
Pratt, J. W. W. ((1964). for the the two-sample two-sample loca locaPratt, 1 964). Obustness (sic) (sic) of some procedures for tion problem.
Journal ooff the American Statistical Association, Association, 59, 59,665-680. 665-680. Journal
J.. W. W. ((1968). normal approximation approximation for for binomial, F, F, beta, and and other other Pratt, J 1 968 ) . A normal common related tail probabilities, probabilities, II.. Journal Journal oofftheAmerican the American Statistical Associ Associcommon
ation, 63, 45 7- 1 4 8 3 . ation, 63, 11457-1483. 1 9 8 1 ) . Concepts fnonparametric theory. New Y ork: Pratt, JJ.. w. W,, & & Gibbons, JJ.. D D.. ((1981). Concepts oofnonparametric York: Springer-V erlag. Springer-Verlag. P F. F. W. W. ((1983). of experimental experimental effect effect size based on on success Preece, P. 1 98 3 ) . A measure of rates. Educational Educational and Psychological Measurement, 43, 763-766. 1 9 92 ) . When Prentice, D D.. A A.,, & & Miller, D. D. T. T. ((1992). When small effects effects are are impressive impressive.. Psycho Psycho1 2, 1 60-1 64. logical Bulletin, Bulletin, 1112, 160-164. R H. H. ((1980). Type for robustness robustness of of Student's Student's t test test Ramsey, P. 1 980) . Exact T ype II error rates for with unequal unequal variance. Journal Journal ooff Educational Educational Statistics, 5, 3337-350. 3 7-3 50. Randies, R. R. H. H. (200 (2001). On neutral neutral responses (zeros) in in the the sign test and and ties in the the Randles, 1 ) . On Statistician, 55, 96-1 96-101. Wilcoxon-Mann-Whitney test. The American Statistician, 01 . Raudenbush, S. W., w., & S. ((19 1 9 8 7) Raudenbush, S. & Bryk, A. S. 7).. Examining Examining correlates correlates of of diversity. diversity. Jour Journal ooff Educational Statistics, 112, 241-269. 2, 241-269. C.. W, & Best, D. D. JJ.. (200 (2001). approach to w., & 1 ) . A contingency table approach Rayner, J. e nonparametric testing. FL: Chapman & testing. Boca Raton, FL: & Hall/CRe Hall/CRC.. F.(200 (2003). Solutionsto tothe the Behrens-Fisher Behrens-Fisherproblem. problem.Computer ComputerMethods Methods Reed III, J. F. 3 ) . Solutions and Programs in Biomedicine, 70, 70, 2259 -263. 5 9 -2 63. and Reichardt, e 1 9 9 7 ) . When C.. 5S.,. , & & Gollob, H. H. EE.. ((1997). When confidence intervals should be used instead of statistical tests, and and vice versa. In L. L. L. L. Harlow, Harlow, SS.. A. Mulaik, & & A Mulaik, J. H. H. Steiger (Eds (Eds.), if there were no significance significance tests? (pp. (pp. 2259-284). . ) , What if 5 9-284). Mahwah, Mahwah, NJ: NJ: Lawrence Lawrence Erlbaum Associates Associates.. M. E. E. ((1997). offender research and and implications implications for the the criminal criminal Rice, M. 1 99 7 ) . Violent offender AmericanPsychologist, Psychologist, 52, 52,4414-423. justice system. American 1 4-423 . J.. T. T. EE.. ((1996). off effect effect size. Behavior Research Methods, Methods, Richardson, J 1 996). Measures o 2-22. Instruments & & Computers, 28, 112-22. ., & 3 ) . Not Roberts, J. K K., & Henson, Henson, R. R. K. K. (200 (2003). Not all effects effects are are created equal: A rejoin rejointo Sawilowsky. Journal oof Modern Applied Applied Statistical Statistical Methods, Methods, 2, 2, f Modern der to 226-230. 226-230. D.. H H.,. , & & Levin, Levin, JJ.. R. R. ((199 7). on statistical and and substantive Robinson, D 1 99 7 ) . Reflections on significance, with a slice of replication. Educational EducationalResearcher, 26, 221-26. 1 -26 . significance, J.. & & Mansmann, Mansmann, U U.. ((1999). Unconditional non-asymptotic 1 9 9 9 ) . Unconditional Rohmel, J one-sided one-sided tests for for independent independent binomial binomial proportions proportions when when the interest interest showing non-inferiority non-inferiority and/or superiority. superiority. Biometrical Biometrical Journal, Journal, lies in showing 4 1 , 1149-170. 49- 1 70. 41, Ronis, D 1 9 8 1 ) . Comparing A designs. EduEdu D,. L. L. ((1981). Comparing the the magnitude of of effects effects inANOV in ANOVA 1 , 993-1 000. cational and Psychological Psychological Measurement, 441, 993-1000. & Gasko, M. M. ((1983). Comparing location location estimators estimators:: 1 9 8 3 ) . Comparing Rosenberger, J. L., L., & Trimmed means, medians, and and trimeans trimeans.. IIn D.. e. C. Hoaglin, F. F.Mosteller, & & JJ.. lrimmed nD (Eds.). Understanding robust and and exploratory exploratory data analysis W. Tukey (Eds . ) . Understanding analysis (pp. 297-336). York: Wiley. 2 9 7-33 6 ) . New York: R.. ((1991a). Effect sizes: sizes: Pearson's correlation, correlation, its display the Rosenthal, R 1 9 9 1 a) . Effect display via the and alternative indices. American Psychologist, 46, 11086-1087. 0 8 6-1 08 7 . BESD, and alternative indices.
232 232
"-w**
REFERENCES REFERENCES
R. ((199 for social research. research. Thousand Thousand Rosenthal, R. 1 9 9 1Ib). b) . Meta-analytic procedures for CA: Sage Press. Oaks, CA: Rosenthal, R. R. (2000). Effect Effect sizes in in behavioral behavioral and and biomedical research. research. In L. 1. Validity experimentation (pp. 1121-139). Thousand Bickman (Ed.), V alidity and social experimentation 2 1-1 3 9 ) . Thousand Oaks, CA: CA: Sage. Rosnow, R. R. L., Rubin, D. B. (2000).. Contrasts and and eeffect for ffect sizes f or Rosenthal, R., Rosnow, 1., & Rubin, B. (2000) Behavioral Research. Cambridge, Cambridge, England: England: Cambridge University University Press. Behavioral & Rubin, Rubin, D. D. B B.. ((1982). magRosenthal, R., R., & 1 982). A simple general purpose display of mag of experimental effect. effect. Journal ooff Educational Educational Psychology, 74, 74, nitude of 1166-169. 66-1 6 9 . R., & Rubin, Rubin, D D.. B B.. ((1994). value of an an effect effect size: size: A Rosenthal, R ., & 1 994). The counternull value new statistic. Psychological Psychological Science, S, 5, 329-334. T. ((1998). Odds ratios in the analysis analysis ooff contingency tables. tables. Thousand Rudas, T. 1 9 98). Odds CA: Sage. Oaks, CA: Sackett, D. L., Strauss, S. E., W 5S.,. , Rosenberg, Rosenberg, W, & Haynes, R. R. B. 1., Strauss, S. E . , Richardson, W. w., & (2000) ) . Ed (2000).. Evidence based medicine: How to practice and and teach EBM (2nd ed. ed.). Edinburgh: Churchill Livingtone. Scinchez-Meca, JJ.,. , Marin-Martinez, Marln-Martmez, E, & & Chac6n-Moscoso, Chacon-Moscoso, S. (2003). Effect-size Sanchez-Meca, 5. (2003 ) . Effect-size for dichotomized dichotomized outcomes outcomes in meta-analysis. meta-analysis. Psychological Psychological Methods, Methods, indices for 8, 8, 448-467. . K. intervals for p 1 p1 - p2 Santner, T. T.JJ.. , ,&&Snell, Snell,M M. K.( 1(980). 1 980)Small-sample . Small-sampleconfidence confidence intervals for x 2 contingency tables. Journal Journal ooff the American Statistical As Asand p!/p2 in 2 x 86-3 94. sociation, sociation, 75, 75, 3386-394. F.E.E.( (1946). Anapproximate approximatedistribution distributionofofestimates estimatesofofvariance variance Satterthwaite, E 1 946). An Bulletin, 2, 2, 1110-114. 1 0-1 1 4. components. Biometrics Bulletin, Sawilowsky, (2002).. A measure measure of relative relative efficiency efficiency for for location of a single single Sawilowsky, S. S. S. S. (2002) Journal ooff Modern Modern Applied Applied Statistical Methods, Methods, I1,, 52-60. sample. Journal Sawilowsky, S. S. 5. (2003 ) . 1rivials: of Sawilowsky, S. (2003). Trivials: The birth, birth, sale, and and final production production of meta-analysis. Journal ooffModern Modern Applied Applied Statistical Statistical Methods, Methods, 2, 2, 242-246. meta-analysis. S., & Blair, Blair, R. R. C. C. ((1992). at the the robustness robustness Sawilowsky, S. S. 5 ., & 1 992 ) . A more realistic look at II error properties of the t test to departures population nor norand type II departures from population mality. Psychological Psychological Bulletin, Bulletin, 1111, 352-360. 11, 3 5 2-360. Sawilowsky, S. S.,. , & & Fahoome, G. G. (2003 (2003). ) . Statistics through Monte Carlo simulasimula Sawilowsky, s. 5 Hills, MI: MI: Journal Journal of Modern Modern Applied Statistical Statistical tion with Fortran. Rochester Hills, Methods Methods.. S., & Markman, Markman, B. (2002). the tt test test with with uncommon uncommon Sawilowsky, S. S. 5 ., & B. S. S. (2002 ) . Using the sample sizes. sizes. Journal ooff Modern Statistics, I1,, 1145-146. Modern Applied Applied Statistics, 45-146. Sawilowsky, SS.. S., & Yoon, JJ.. SS.. (2002). The trouble with trivials (p > > .05 .05). Jour) . Jour Sawilowsky, 5., & nal ooffModern Modern Applied Applied Statistics, Statistics, 1, 143-144. I, 1 43-144. Schmidt, E F. L., & Hunter, J. E. E. ((1996). error in in psychological psychological re reSchmidt, L., & 1 996). Measurement error search: Lessons Lessons from 26 research scenarios. Psychological Psychological Methods, 11,, 199-223. 1 9 9-22 3 . Schmidt, F. Le,H., H.,&&Ilies, Hies,R. R.(200 (2003). Beyondalpha: alpha:An Anempirical empiricalexamination examination Schmidt, E LL., . , Le, 3 ) . Beyond of the the effects effects of different sources of measurement measurement error on reliability esti estiof for measures of individual individual diff differences constructs. Psychological Psychological Meth Methmates for erences constructs. ods, 8, 206-224. N. (1 996). Statisticalsignificance significancetesting testing and andcumulative cumulative knowledge knowledgein in Schmidt, N. ( 1 99 6 ) . Statistical psychology: Implications for for training of researchers. researchers. Psychological Psychological Methods, Methods, 1, 115-129. I, 1 1 5- 1 2 9 . & Mancini, G. G. B B.. J. ((1996). and 'unmitigated 'unmitigated 1 996). 'Unqualified success' and Schulzer, M., & failure': Number-needed-to Number-needed-to-treat -treat related concepts for assessing treatment treatment
P1/P2
-P2
REFERENCES REFERENCES efficacy efficacy in the the presence of treatment induced adverse effects effects..
233 233
International
Journal of of Epidemiology, Epidemiology, 25, 704-712. 704- 7 1 2 . Journal Schwertman, N N.. c., C., Owens, M. M. A A.,. , & & Ad.nan, Adnan, R. R. (2004). (2004). A simple more general general Schwertman, boxplot method method for for identifying identifying outliers. outliers. Computational Computational Statistics and Data boxplot 47, 165-174. Analysis, 4 7, 1 65-1 74. R.. JJ.. (1980). Approximation theorems ooff mathematical statistics. New New Serfling, R ( 1 980). Approximation York: Wiley.
(2002). Constructive criticism. Journal Journal ooffModern Applied Applied Statistical Serlin, R. R. C. C. (2002). 202-227. 27. Methods, 11,, 202-2 Serlin, R. R. c., C., Wampold, Wampold, B B.. EE., & Levin, JJ.. R R.. (2003 (2003).) . Should providers off treat treatSerlin, ., & providers o ment be regarded regarded as a random factor? factor? If it ain't broke, don't don't fix fix it: A comment comment ment and Joorman (2003 (2003). Psychological Methods, Methods, 8, 8, 524-534. on Siemer and ) . Psychological
Shaffer, J. P.P. (2002). Multiplicity, Multiplicity, directional directional (1)rpe (Type III) III) errors, and and the the null hy hyShaffer, pothesis.
Psychological 5 6-369. Psychological Methods, Methods, 7, 7, 3356-369. Nonparametric statistics for for the behavioral Nonparametric
& Castellan, N. J. ((1988). Siegel, S., S., & 1 988).
ed.). ) . New York: McGraw-Hill. sciences (2nd ed. Siemer, M M.,. , & &Joormann, J. (2003a). (2003a). Power Power and and measures of effect effect size in analy analyfactors. sis of variance with fixed versus random nested factors.
ods, 8,497-517. B, 49 7-5 1 7 .
Psychological Meth MethPsychological
Siemer, M M.,. , & & Joormann, J. (2003b). (2003b). Assumptions Assumptions and and consequences of treating
effects:: Reply to to providers in therapy studies as fixed versus random effects
Crits-Christoph, 1\1, Tu, and and Gallop (2003) and and Serlin, Wampold, and and Levin Crits-Christoph, (2003). (2003 ).
Psychological Methods, 8, 5535-5441. PsycholOgical 3 5-544 1 . Density estimationf for analysis. New New Density or statistics and data analysis.
Silverman, B. W. ((1986). Silverman, B. W 1986).
York: Chapman and and Hall. Y ork: Chapman Simonoff, . , Hochberg, Y. B. (1986). ( 1 9 8 6 ) . Alternative Simonoff, J. SS., Y,, & & Reiser, B. Alternative estimation estimation proce procefor Pr(X Pr(X < Y) Y) in categorical categorical data. dures for
Biometrics, 42, 895-907 895-907.. Biometrics,
B.. F. F. ((1958). machines.. Science, 128, 128, 969-9 969-977. 1 95 8 ) . Teaching machines 77. Skinner, B hypotheses in non-in Skipka, ) . The likelihood Skipka, G G.. (2003 (2003). likelihood ratio testfor for order order restricted restricted hypotheses non-inferiority trials. Unpublished Unpublished doctoral doctoral dissertation. dissertation. Gottingen Gottingen University, University, f eriority trials. Gottingen, Germany.
L., & & Glass, G. psychotherapy outcome outcome V. ((1977). 1 9 7 7). Meta-analysis of psychotherapy Smith, M. 1., G. V studies. American Psychologist, 32, 752-760. 752-760. studies. American
Smithson, M. (200 (2001). confidence intervals intervals for various regression effect effect Smithson, 1 ) . Correct confidence and parameters. parameters. Educational and Psychological Psychological Measurement, 661,605-632. 1 , 605-632. sizes and M. (2003 (2003).) . Smithson, M.
Confidence intervals. intervals. Thousand Oaks, Oaks, CA: CA: Sage. Confidence
G. W, & & Cochran, Cochran, W W. G. G. ((1989). (8th ed.) ed.).. Ames, lA: 1A: Snedecor, G. 1 989). Statistical methods (8th Iowa Iowa State University University Press.
Snyder, D. K. 1 99 1 ) . Long-term K.,, Wills, R. R. M., & & Grady-Fletcher, A. ((1991). Long-term effectiveness effectiveness
of behavioral vs. insight insight-oriented A 4-year follow-up study. of -oriented marital therapy: A Journal ooff Consulting and and Clinical Psychology, 59, 1138-141. 3 8-1 4 1 . P.,, &&Lawson, Lawson,SS. Evaluatingresults resultsusing usingcorrected correctedand anduncor uncorSnyder, P. . ( (1993). 1 9 9 3 ) . Evaluating rected effect effect size estimates estimates.. Journal ooffExperimental Experimental Education, 661, 334-349. 1 , 334-34 9. R. H. H. ((1962). new asymmetrical measure ooff association for for ordinal Somers, R. 1 962 ) . A new variables. American American Sociological Review, 227, 799-811. 7, 799-8 11. variables. N.. ((1967). Effects inapplicability ooff the continuity condition upon the ffects ooff inapplicability Sparks, J. N 1 967). E probability distributions ooffselected statistics and and their implications for research probability implicatiOns for in education (Final Rep. Rep. No. RIE SYN71840). RIE SYN 7 1 840). University Park: Pennsylvania University. State University. P.( 1(1998). Datadriven drivenstatistical statisticalmethods. methods.London: London:Chapman Chapman and andHall. Hall. Sprent, P. 99 8 ) . Data Staudte, R. R. G., & & Sheather, Sheather, S. testing. New York: York: Staudte, S. J. ((1990). 1 990). Robust estimation and testing. Wiley.
234 234
«-w>
REFERENCES REFERENCES
Steiger, J.. H. ((1999). steiger, J 1 999). STATISTICS STATISTICA power analysis. Tulsa, Thlsa, OK: OK: StatSoft. (2004). Beyond Beyond the the F test: Effect Effect size confidence confidence intervals intervals and and tests Steiger, J. H. (2004). of the analysis of close fit fit in the analysis of variance variance and and contrast contrast analysis. analysis. Psychological Methods, 9, 9, 1164-182. 64-1 82 . Methods, J.. H., & & Fouladi, Fouladi, R. R. T. T. ((1997). interval estimation estimation and and the the Steiger, J 1 99 7) . Noncentrality interval evaluation of of statistical statistical methods methods.. In 1. L. 1. L. Harlow, Harlow, S. A. Mulaik, & & J. H. Steiger evaluation S. A (Eds.), What if if there were no significance significance tests? (pp. (pp. 22 221-257). NJ: 1 -25 7). Mahwah, NJ: (Eds . ) , What Lawrence Erlbaum Associates. Associates. E., & & Kim, Kim, C. of some exact exact test statistics statistics for for Storer, B. B. E., c. ((1990). 1 990). Exact properties of binomial proportions. Journal AsJournal ooff the American Statistical As comparing two binomial sociation, 85, 1146-155. 46-1 5 5 . Strahan, R.. F. E (1991). the binomial binomial effect PsyStrahan, R ( 1 9 9 1 ) . Remarks on the effect size display. American Psy chologist, 46, 11083-1084. cholOgist, 083-1084. Susskind, E. E. C., E.. W W. (1980). effect magnitude magnitude in in re rec., & Howland, E ( 1 980). Measuring effect peated measures ANOVA designs designs:: Implications for gerontological research. research. Journal off Gerontology, 35, 8867-876. Journal o 6 7-8 76. B.. ((1993). use of of statistical statistical significance significance tests in in research: research: BootThompson, B 1 9 9 3 ) . The use Boot strap and and other other alternatives. Journal Journal ooffExperimental Experimental Education, 661,361-377. 1 , 3 6 1 -3 7 7. strap B.. ((1999, April).. Common methodology methodology mistakes mistakes in educational educational reThompson, B 1 999, April) re Pasearch, revisited, along with a primer on both eeffect ffect sizes and the bootstrap. Pa presented at at the annual meeting meeting of the American Educational Educational Research per presented Research Association, Montreal, Canada. Thompson, (2002). What future quantitative social science research could Thompson, B. B. (2002 ) . What 1, look like: like: confidence confidence intervals for for effect effect sizes. Educational Researcher, 331, 25-32. K. N., & & Schumacker, Schumacker, R. R. E. E. ((1997). An evaluation evaluation of of Rosenthal Rosenthal and and Thompson, K. 1 99 7) . An effect size display. Journal ooff Educational Rubin's binomial binomial effect Educational and Behavioral Statistics, 22, 1109-117. 09-1 1 7 . effect sizes sizes in exploratory exploratory experimental studies Timm, N. N. H. H. (2004). Estimating effect when using a linear Statistician, 58, 2 1 3-2 1 7 . linear model. The American Statistician, 213-217. Tomarken, A 1 9 8 6 ) . Comparison of A. J., & & Serlin, R. R. C. C. ((1986). of ANOVA alternatives un unand specific specific noncentrality noncentrality structures. Psychologi Psychologider variance heterogeneity and Bulletin, 99, 90-9 90-99. cal Bulletin, 9. ltenkler, ). Q uantile-boxplots. Communications in Statistics-Simulation Trenkler, D. D. (2002 (2002). Quantile-boxplots. Statistics-Simulation and Computation, 331, and 1 , 11-12. -12. Tryon, W W (200 (2001). difference, equivalence, and indetermi indetermi1Iyon, 1 ) . Evaluating statistical difference, nacy using inferential confidence confidence intervals: An integrated alternative alternative method method of nacy of conducting null null hypothesis hypothesis statistical tests. Psychological Methods, 6,371-386. conducting Psychological Methods, 6, 3 7 1-386. V argha, A., A, & Delaney, H. D Vargha, &Delaney, D.. (2000). (2000). A critique and and improvement of the the CL CLcom common effect size statistics of McGraw and and Wong Wong.. Journal of Educamon language effect of Educa tional and Behavioral Statistics, Statistics, 25, 1101-132. 0 1- 1 3 2 . Vaske, J. J., J . , Gliner, J. A., A, & & Morgan, G. G. A A. (2002) (2002).. Communicating Communicating judgments judgments about practical practical significance: significance: Effect Effect size, confidence confidence intervals intervals and and odds ratios. ratios. about Human Dimensions ooff Wildlif Wildlife,e, 7, 7, 28 287-300. 7-300. Vaughan, G. M., & & Corballis, Corballis, M M.. C. C. ((1969). of significance: Estimat EstimatV aughan, G. 1 96 9 ) . Beyond tests of ing selected ANOVA designs. ing strength strength of of effects effects in selected designs. Psychological Psychological Bulletin, Bulletin, 772, 2, 204-213. 204-2 13.
REFERENCES REFERENCES
*-***>
235 235
Venables, W. 1 9 75 ) . Calculation W. ((1975). Calculation of confidence intervals for for noncentrality pa parameters.
Journal ooff the Royal Statistical Statistical Society Society (Series (Series B), 337, 406-412. 7, 406-4 12.
E. L. L. ((1947). and distance discrimination. discrimination. Unpub UnpubWalker, E. 1 94 7 ) . Factors in vernier acuity and Stanford, CA. CA. lished doctoral dissertation, Stanford Stanford University, Stanford,
Wampold, ., & R. C. C. (2000). The consequences of Wampold, BB.. EE., & Serlin, R. of ignoring a nested nested
factor on on measures effect size in analysis of of variance. variance. factor measures of effect
Psychological Meth MethPsychological
ods, S, 5, 425-433.
Watson, JJ.. SS.. ((1985). groups are more homogeneous homogeneous Watson, 1 98 5 ) . Volunteer and risk-taking groups seeking than control groups. groups. on measures measures of sensation seeking
Perceptual and Motor Motor Perceptual
61, 471-475. 1, 4 7 1-475 . Skills, 6 S. 5 ( 1 9 9 5 ) . Effects of Weisz, J. R R.,. , Weiss, Weiss, B B.,. , Han, S. S.,. , Granger, Granger, D. A., & Morton, T. T. (1995). Effects of psychotherapy with children of children and and adolescents adolescents revisited: revisited: A meta-analysis of 1 7, 450-468. treatment outcome studies. Psychological Psychological Bulletin, 1117, 450-468.
L. ((1938). of the the difference difference between between two two means means when Welch, B. B. L. 1 93 8 ) . The significance of 5 0-362. the population variances are unequal. Biometrika, Biometrika, 29, 29, 3350-362.
1 9 70). T AT method Werner, M., Stabenau, ., & Stabenau, J. B B., & Pollin, W. W. ((1970). TAT methodfor forthe thedifferentia differentia-
tion of families families of of schizophrenics, delinquents, and and normals. Journal ooffAb Abnormal Psychology, 75, 1139-145. 3 9-1 45 . K. D. D. ((1998). Testing for homogeneity homogeneity ooffvariance: An An evaluation Weston, T, T., & Hopkins, K. 1 998). T estingfor ooff current practice. Unpublished Unpublished manuscript, University University of Colorado Colorado at at Boulder. Wickens, T. 1 9 8 9 ) . Multiway T. D. ((1989). Multiway contingency contingency tables analysis f for tables analysis or the social sciences. Mahwah, NJ: Lawrence Erlbaum Mahwah, NJ: Erlbaum Associates Associates.. Wiener, R. 1 9 9 7 ) . Sexual harassment [Special ) . Psychology, R. L., L., & & Gutek, B B.. ((1997). [Special issue issue]. Psychology, Public Policy, and 5(3). and Law, 5(3). 1 9 8 7 ) . New Wilcox, R. R. R R.. ((1987). New designs iinn analysis of of variance. variance. Annual Review ooff Psy Psychology, 38, 29-60. 29-60. Wilcox, R. R. R. R. ((1995). Comparing two two independent independent groups groups via via multiple Wilcox, 1 9 9 5 ) . Comparing 1-99. quantiles. The The Statistician, 44, 991-99. 96). Statisticsf or the social sciences. San W ilcox, R. Wilcox, R. R. R. ((11 9996). for San Diego, Diego, CA: CA:Academic AcademicPress. Press. Wilcox, R. R. R. 997)7). Introduction . Introductiontotorobust robustestimation estimationand andhypothesis hypothesistesting. testing.San San Wilcox, R. (( 1199 Diego, CA: CA: Academic Academic Press. R. R. R. (200 (2001). Fundamentals ooff modern statistical statistical methods: Substantially Substantially 1 ) . Fundamentals Wilcox, R. ork: Springer-Verlag. improving improving power and accuracy. New Y York: Spring er-Verlag. R. (2002 (2002). Multiple comparisons comparisons among among dependent dependent groups groups based based on on a Wilcox, R. R. R. ) . Multiple modified 77. modified one-step one-step M-estimator. Biometrical Journal, Journal, 44, 466-4 466-477. Wilcox, R. 3 ) . Applying R. R. R. (200 (2003). Applying contemporary contemporary statistical statistical techniques. techniques. San San Diego, CA: CA: Academic Press. L., & 1 986). New Wilcox, R. R. R R.,. , Charlin, Charlin, V. V. L., & Thompson, K. K. L. L. ((1986). New Monte Monte Carlo results results the robustness of the the ANOVA F, F, W, and and F* F* statistics. Communications Communications in on the Statistics-Simulation S, 933-943. Statistics-Simulation and Computation, Computation, I15, Wilcox, R., & Wilcox, R. R. R., & Keselman, H. H. J. (2002a). Power Power analysis when comparing 1. trimmed means means.. Journal Journal ooffModern Modern Applied Applied Statistical Methods, 11,, 24-3 24-31. Wilcox, R R.. R R., & Keselman, H H.. JJ.. (2002b (2002b).) . Within groups multiple comparisons Wilcox, ., & on robust measures of of location. Journal Journal ooff Modern Modern Applied Applied Statistical based on Methods, , 2 8 1 -2 8 7. Methods, 11,281 -287.
R., & & Keselman, H. R. R., H. J. (2003a). Modern robust data analysis: Measures Wilcox, R. of of central tendency. Psychological Psychological Methods, Methods, 8, 8, 254-274.
236
REFERENCES REFERENCES
R., & Keselman, Wilcox, R. R. R., Keselman, H. J. (2003b). (2003b) . Repeated measures one-way one-way ANOVA based on on a modified one-step one-step M-estimator. M-estimator.
British British Journal Journal of of Mathematical Mathematical
and Statistical Psychology, 56, 1-13. 6 , 1-1 3. and Psychology, 5 R.. R R., & Muska, J. ((1999). effect size: size: A Anon-parametric ana1 9 9 9 ) . Measuring effect non-parametric ana Wilcox, R ., &
log of (j) co22. . British British Journal Journal ooff Mathematical Mathematical and and Statistical Psychology, Psychology, 52, 52, log of 93-110. 93-1 10. F.( (1945). Individualcomparisons comparisonsbby ranking methods. methods.Biometrics, Biometrics,11, 1 945 ) . Individual y ranking , Wilcoxon, F. 80-83. 80-8 3. Wilde, M. c 1 99 5 ) . D o recognition-free recall discrepdiscrep C,. , Boake, Boake, c., C, & & Sherer, Sherer, M. ((1995). Do deficits in closed-head injury? An An exploratory exploratory analysis analysis ancies detect retrieval deficits the California Verbal Verbal Learning Learning T Test. and Experimental Experimental with the est. Journal Journal ooff Clinical and Neuropsychology, 117, 849-855. Neuropsychology, 7, 849-8 55. Wilk, M. B ., & B., & Gnanadesikan, R R.. ((1968). Probability plotting plotting methods for for the the 1 96 8 ) . Probability analysis 7. analysis of data. Biometrika, Biometrika,55, 55, 1-1 1-17. Wilkinson, LL., & AP APA on Statistical Statistical Inference. ((1999). ., & A Task Force on 1 9 9 9 ) . Statistical methods psychology journals: journals : Guidelines methods in psychology Guidelines and and explanations. explanations. American Psy Psy594-604. 94-604. chologist, 54, 5 effectiveWilson, D. B., B . , & Lipsey, M. W. W. (2001). (200 1 ) . The The role role of of method method in in treatment treatment effective 6, 4 1 3-42 9 . ness ness:: Evidence from meta-analysis. meta-analysis. Psychological Psychological Methods, Methods, 6, 413-429. Wright, S 1 946). Spacing S.. T. T. ((1946). Spacing ooffpractice in verbal learning and and the the maturation hy hy's thesis, Stanford University, Stanford, CA. CA. pothesis. Unpublished master master's K. K. K. (19 74).. The two two sample sample trimmed t for unequal population variances. variances. Yuen, K. (I 9 74) Biometrika, 6 1, 1 65-1 70. 61, 165-170. Zimmerman, D. D. M. M. ((1996). preliminary tests of of equality of 1 99 6 ) . Some properties properties of preliminary of variances in in the the two-sample two-sample location location problem. The Journal Journal ooffGeneral Psychol Psychology, 1123,217-231. 23, 2 1 7-23 1 . Zimmerman, D D.. M., & & Zumbo, B B.. D D.. ((1993). transformations and and the the Zimmerman, 1 9 93 ) . Rank transformations power populations power of the Student Student t test and Welch tt'' test for non-normal popUlations with unequal Journal oof f Experimental unequal variances. Canadian Journal Experimental Psychology, 47, 47, 523-539. 523-5 3 9 .
Author Index
A
84, 161, 219 Abelson, R. R. P., P., 84, 161, 2 19 219 Abu Libdeh, 0., 122, 122, 2 19 Adnan, RR., 233 Adnan, . , 19, 1 9, 2 33 A.,, 179, Agresti, A 1 79 , 183, 1 8 3, 184, 1 84, 1190, 90,
192, 204, 205, 205, 1 92 , 194, 1 94, 196, 1 9 6, 204, 213, 214, 217, 219 2 1 3, 2 1 4, 216, 2 1 6, 2 1 7, 2 19 S., 219 Ahadi, 5 . , 95, 9 5 , 126, 1 2 6, 2 19 L.. S., 176, Aiken, L 5 . , 74, 75, 7 5 , 82, 95, 95, 1 76, 221 22 1 J.,. , 68, 123, 1 23 , 124, 1 24, 127, 1 2 7, 1133, 33, Algina, J
134, 1 34, 135, 1 3 5 , 136, 1 3 6, 139, 1 3 9 , 1140, 40, 143, 150, 1 5 0, 155, 1 5 5 , 156, 1 5 6, 1159, 5 9, 160, 1 60, 161, 1 6 1 , 163, 1 6 3, 164, 1 64, 1165, 65, 166, 219, 1 6 6, 167, 167, 2 1 9, 230 219 Altman, D. G., G . , 31, 31, 2 19 S., Anderson, S . , 183, 1 83 , 225 Andrews, G G., 227 Andrews, . , 58, 2 27 219 Antonuccio, D. D. O., O. , 201, 201, 2 19 Arvey, R. D.,. , 122, 124, Arve� R. D 1 24, 143, 1 4 3 , 1144, 44, 229 2 29 A. A A.,, 333, 219 Aspin, A 3, 2 19 219 Auguinis, H., 82, 2 19
B Barnett, V, 219 Barnett, v., 112, 2, 2 19 J.,. , 54, 125, 219 1 25 , 126, 126, 2 19 Barnette, J Baugh, ER,, 79, 220 220 S.. L., 183, 220 1 83 , 220 Beal, S Beatty, M M.. J., 991, 220 1 , 92, 220 Becker, B. 221 B . J., 9, 2 21 J.,. , 184, 220 Bedrick, E. E. J 1 84, 220 Begg, C C.. B., 559, 220 9 , 220
Belsley, D. A A.,, 75, 220 220 Bergin, A A. E., 11, 228 E., 1 1 , 228 Bergmann, R., 220 R., 102, 1 02 , 105, 1 05 , 220 Bernhardson, C C.. S., 130, 220 S., 1 30, 220 231 Best, D. J., 99, 206, 2 31 Bevan, M M.. E, 204, 220 220 E , 204, Bickel, P 220 4 1 , 220 P. J., 41, Bird, K. 31, 133, 160, K. D., 3 1, 1 33, 1 6 0, 161, 1 6 1 , 1164, 64, 220 Blair, R. R. c., C., 11, 232 1 1 , 232 Boake, C., 236 c . , 13, 1 3, 2 36 Bond, C C.. E, 161, 220 E , 23, 130, 1 30, 1 6 1 , 220 Bonett, D. D . G., 34, 40, 131, 1 3 1 , 220 220 Borenstein, M., M., 24, 220 Bradeley, M. T., 59, 220 T., 5 9, 220 Brant, R., 12, 18, 220 R., 1 2, 1 8 , 220 Breaugh, JJ.. A., 220 A, 86, 93, 94, 185, 1 85 , 220 Brown, R. R. A A.,, 113, 220 3 , 220 Brunner, E., E., 99, 101, 1 0 1 , 115, 1 1 5 , 134, 1 34, 1136, 3 6, 162, 1 62 , 164, 1 64, 206, 210, 2 1 0 , 2211, 11, 217, 221 2 1 7, 220, 2 21 Bryant, T. T. N N.,. , 331, 219 1, 2 19 Bryk, A. S.,. , 11, 221, 2311 A S 1 1 , 14, 22 1 , 23 J.,. , 338, 221 Bunner, J 8, 2 21 Burgess, E. S.,. , 13, 220 1 3 , 220 E. S
C c 2211 Callendar, J. J. C., c . , 82, 22 Camp, C. C. C., c . , 122, 1 2 2 , 124, 1 24, 143, 1 4 3 , 1144, 44, 229 M.,. , 7, 22 2211 J. M Campbell, J. K., 19, 2211 ., 1 9, 22 Carling, K
221 J. B., B . , 74, 176, 1 76, 2 21 Carroll, J. M.,. , 1122, 127, 142, 221 22 , 1 2 7, 1 42, 2 21 Carroll, R. R. M
237
238
�
AUTHOR AUTHOR INDEX INDEX
J.,. , 1194, 233 Castellan, N. N. J 94, 2 33 Chacon-Moscoso, 5 . , 1173, 73 , 1 98, 2 32 S., 198, 232 05 , 1198, 9 8 , 229 Chalmers, T. T. c C,. , 1105, 229 S.. EF.,, 1179, Chan, I. S 79 , 2221 21 W,, 882, 221 Chan, vv. 2 , 84, 2 21 Chan, vv. -L . , 8 2 , 84, 2 21 W.-L., 82, 221 L., 114, 235 Charlin, V 1., 4, 2 35 Chen, P. 21 PY Y,, 82, 2221 Chernick, M. R . , 43, 2 21 R., 221 J.,. , 1183, Chiang, J 8 3 , 226 Chuang-Stein, C., 2211 Chuang-Stein, c., 1179, 79, 22 Cleveland, W 5 . , 1113, 1 3, 1 1 4, 2 21 W. S., 114, 221 Cliff, N., 99, 1107, 151, 205, Cliff, 0 7, 1108, 08, 1 5 1 , 205, 211, 212, 213, 217, 221 2 11, 2 1 2, 2 13, 2 1 7, 2 21 W. G G.,. , 111, 30, 204, Cochran, W 1, 3 0 , 35, 3 5 , 38, 3 8 , 204, 233 2 33 J.,. , 7, 24, 228, 75, 8 , 55, 5 5 , 74, 7 5 , 82, Cohen, J 85, 86, 93, 95, 8 5, 8 6, 9 3, 9 5 , 1108, 0 8 , 1109, 09 , 1111, 1 1 , 1119, 1 9, 1120, 20, 121, 1 2 1 , 1143, 43, 1 60, 1 76 , 1183, 8 3 , 203, 11, 160, 176, 203, 2211, 2 16, 2 21 216, 221 E,, 74, 75, 82, 995, 221 Cohen, P. 5 , 1176, 76 , 2 21 Cohn, 1. D., 9 21 L. D., 9,, 2221 05 , 222 Colditz, G. G. A A.,. , 1105, 222 Cook, R. 5 , 222 R. D D.,. , 775, 222 Cooper, H 8 6 , 222 H.. M., 99,, 86, 222 M. C., Corballis, M. c., 1123, 2 3 , 1124, 24, 1127, 2 7, 1160, 60, 1166, 6 6 , 234 J.,. , 1192, Cornfield, J 92 , 222 J. M M.,. , 1134, 139, Cortina, J. 3 4, 1136, 3 6, 1 3 9, 1140, 40, 1149, 49, 1 66 , 222 166, Cribbie, R. 0 , 112, 2, 40, 1130, 30 , 1131, 31, R. A., 110, 1132, 32 , 1133, 3 3 , 222, 2 27, 2 28 227, 228 Crits-Christoph, P.E,, 1141, 222 Crits-Christoph, 4 1 , 222 E.. 1L., 222 Crow, E . , 90, 222 . , 23, 24, 3 1 , 62, 65, Cumming, G G., 31, 223 68, 222, 2 23 D o
D'Agostino, B., 28, 61, D 'Agostino, R. R. B ., 2 8, 6 1 , 204, 222, 229 E,, 229, Daly, F. 9 , 43, 44, 225 W. G., G., 2201, 219 01 , 2 19 Danton, W 1 4, 222 Darlington, M M.. 1., L., 1114, 222 Davidson, J. R T., 20 1 , 222 R.. T., 201, 222 2 , 222 Davies, 1., L., 112, 222 Davis, D D.. E., E., 40, 225 225 A. C., 222 Davison, A. c., 43, 222 Dayton, C 3 0, 1132, 32 , 1 3 5 , 1181, 81, C.. M., 1130, 135, 222 De Carlo, L. L. T. T.,, 14, 222 222
Delaney, H 00, 1101, 0 1 , 1104, 04, 1105, 05 , H.. D D.,. , 1100, 1 06 , 1108, 08, 1 1 5 , 1119, 1 9 , 1120, 20, 106, 115, 1121, 2 1 , 1130, 30, 1131, 3 1 , 1134, 34, 1135, 35 , 1 3 6 , 1139, 3 9, 1 40, 1143, 43 , 1160, 60, 136, 140, 164, 1161, 6 1 , 1162, 62 , 1 64, 1165, 6 5 , 1166, 66, 167, 204, 206, 2207, 210, 1 6 7, 204, 07, 2 10, 213, 217, 229, 2 13, 2 1 7, 222, 2 2 9 , 234 Denton, J 220 J.. eL a, , 204, 220 R. P.E,, 68, 229 229 DeShon, R. Diaconis, P. E,, 43, 223 223 Diener, E ., 9 5 , 1126, 26, 2 19 E., 95, 219 ., 1 1 0, 2 2 7 , 229 Dinnel, D D.. 1L., 110, 227, 229 W. J., 28, 2223 Dixon, vv. 23 Dodd, D 3 5 , 1160, 60, 1165, 65 , 1166, 66 , D.. H H.,. , 1135, 223 2 23 K. A., 1112, 223 Doksum, K. 1 2 , 1113, 1 3 , 223 Donahue, B., B., 12, 1 2, 22 2277 230 Dosser, D. D. A., 1124, 24, 1127, 2 7 , 230 W. P.E,, 1106, Dunlap, vv. 06, 2223 23 223 Dwyer, J. H., H . , 160, 1 60, 2 23
E Edwards, A 8 5 , 224 A.,. , 6, 1185, 224 Efron, B., 43, 223 Efron, B., 223 J. D., 210, Emerson, J. D., 111, 1 , 206, 208, 2 1 0, 217, 223, 2 1 7, 2 2 3 , 229 C., 886, 229 Emrich, c., 6, 2 29 D.. M., 113, 220 Evans, D 3 , 220 Ezekiel, M 2 1 , 223 M.,. , 1121, 223
F
G., 208, 209, 209, Fahoome, G . , 1102, 02 , 1162, 62 , 208, 223, 232 X., 55,, 118, 226 Fan, x., 8 , 223, 226 B.. R., 99, 206, 210, Fay, B R., 9 9 , 102, 206, 2 1 0 , 2211, 11, 223 2 23 Feingold, A 3 , 1112, 12 , 2 23 A.,. , 113, 223 A. R R., 32, 223 Feinstein, A. ., 3 2 , 223 F. EE.,. , 1127, 223 Fern, E 2 7, 223 Feske, U U.,. , 113, 223 3, 2 23 E,, 223, 24, 29, 31, Fidler, F. 3, 2 4, 2 9, 3 1 , 62, 1122, 22 , 1123, 2 3 , 1160, 60, 223 223 Finch, SS., 23, 31, 65, 68, ., 2 3 , 24, 3 1 , 62, 6 5, 6 8, 223 222, 2 23 Findley, M 6 , 222 M.,. , 886, 222 R.. A., 4, 224 224 Fisher, R J.. L., 11, 61, Fleiss, J 1., 1 1, 6 1 , 1172, 72 , 1173, 7 3 , 1174, 74, 1175, 75 , 1176, 76 , 1178, 7 8, 1179, 79 , 1180, 80, 1181, 8 1 , 1182, 8 2, 1183, 8 3 , 1184, 84, 1185, 85 ,
AUTHOR INDEX INDEX AUTHOR
186, 188, 1 86 , 187, 1 8 7, 1 88, 190, 1 9 0, 1191, 91, 192, 216, 217, 1 92 , 195, 1 9 5 , 197, 1 9 7, 2 1 6, 2 1 7, 224 M. A., 1101, 224 Fligner, M. 0 1 , 224 226 Folks, L., L., 28, 226 65, Fouladi, R. R. T., T. , 6 5 , 133, 1 3 3 , 160, 1 60, 2234 34 R. L., L., 1143, Fowler, R. 43 , 2224 24 J.,. , 75, 224 224 Fox, J 11, 36, 39, 42,, 2227, Fradette, K., K., 1 1, 3 6, 3 9 , 42 27, 230 2 30 R. W, 28, 32, 224 Frick, R. w. , 2 8, 3 2 , 224 112, 225 Friedman, L., L., 1 1 2 , 225 19, 224 Frigge, M., 1 9, 224 G
R., 222 Gallop, R . , 1141, 4 1 , 222 M. J., 331, 219 Gardner, M. 1, 2 19 J.. JJ.,. , 184, Gart, J 1 84, 192, 1 92 , 224 Gasko, M., 231 Gasko, M., 36, 2 31 2 , 222 Gather, Gather, U U.,. , 112, D., 106, 206, 2217, Gibbons, J. D . , 99, 1 06 , 206, 1 7, 224, 2231 31 G., 6,, 185, 224 Gigerenzer, G ., 6 1 8 5 , 224 166, 22 Gillett, R., R., 140, 1 40, 143, 143, 1 66 , 167, 1 6 7, 224 6, Glass, G. 8,, 9, 86, G. V, 8 9 , 50, 51, 5 1 , 74, 8 Glass, 233 1140, 40, 224, 2 33 Gleser, L. L. JJ.,. , 4 43, 224 GIeser, 3 , 224 J.. A. A.,, 228, 200, 224, 224, 2; Gliner, J 8 , 1195, 9 5 , 200, 234 R., 236 Gnanadesikan, R . , 113, 1 13, 2 36 Goldberg, K K.. M M.,. , 559, 224 9, 224 3, 223 Goldstein, A. A. J., 113, 223 H. E., 231 Gollob, H. E., 28, 2 31 Goodman, LL.. A A.,, 1196, 224 9 6 , 224 Goodman, J.. A., 663, Gorecki, J 3 , 225 Grady-Fletcher, A., A., 200, 200, 2201, 233 0 1 , 233 Grady-Fletcher, Granger, 13, Granger, D. A., A, 1 3 , 235 R. H H.,. , 1149, 230 Greenberg, R. 49, 2 30 G. R R., 167, 225 Grice, G. ., 1 6 7 , 225 R.. J., 110, 31, 51, Grissom, R 0, 113, 3 , 114, 4, 3 1, 5 1, 85, 86, 105, 109, 5, 8 6 , 98, 1 05 , 109, 52, 8 110, 117, 202, 2225, 1 1� 1 1 7, 133, 1 3 3 , 202, 25, 229 J.. 5S., 13, ., 1 3 , 225 Gross, J W. M., 557, 226 Grove, W. 7, 226 Gutek, B., 235 Gutek, S., 94, 235 H
Haase, R. 85, 225 R. E, F., 8 5 , 225 Hackett, D., 201, Hackett, D., 20 1 , 222
�
239
Haddock, C. K.,, 176, 225 Haddock, C. K. 1 76 , 190, 1 90, 197, 1 9 7, 225 Hampel, F. F. R., 4 41, 225 Hampel, 1 , 225 Han, S. S. S., 13, 235 5., 1 3, 2 35 D. JJ.,. , 229, 44, Hand, D. 9 , 43, 4 4 , 1115, 1 5 , 225 Harlow, L. L. L., 6,, 24, 24, 225 225 Harlow, L., 6 R. JJ.,. , 195, 200, 224 224 Harmon, R. 1 95 , 200, Harrell, F. F. E., 40, 225 40, 225 W. W, Hauck, W. w., 1183, 8 3 , 225 R.. B., 188, 232 Haynes, R S., 1 88, 2 32 W. LL., 121, 174, Hays, W. ., 1 2 1 , 1122, 22, 1 74, 1194, 94, 225 Hedges, L. L. V, 5,, 9, 53, 54, 55, 558, V, 5 8, 60, 62, 72, 1112, 203, 1 2 , 136, 1 3 6 , 203, 222, 225 225,, 226 222, Hekmat, H H., 13, Hekmat, ., 1 3 , 226 R.. K., K., 55,, 2231 Henson, R 31 Herranz Tejedor, II.,. , 1179, 229 79 , 183, 183, 2 29 Hess, B., 226 S . , 109, 1 09, 226 Hildebrand, D. D. K K.,, 2217, 226 Hildebrand, 1 7, 226 D.. V V,, 43, 222 Hinkley, D Hirada, N., 2277 Hirada, N., 1110, 1 0, 22 Hiruma, N N., 2277 Hiruma, . , 1110, 1 0, 22 Hoaglin, D. C., 19, 41, 224, 226 D. c . , 18, 18, 1 9, 4 1 , 224, Hochberg, 213, 233 Hochberg, Y., Y, 2 1 3, 2 33 Hogarty, K. K. Y Y,, 77,, 555, 56, 226 5, 5 6 , 226 Hogarty, Hopkins, K K. D D.,. , 114, 224, 235 235 Hopkins, 4, 74, 224, H., 208, 2217, Hosseini, H . , 206, 208, 1 7, 229 Hou, C.-D., 183, 226 Hou, 1 8 3 , 226 Howell, D. C., 226 c . , 13, 2 26 Howland, E. E. w. W,, 1123, 234 23 , 1143, 4 3 , 165, 1 65 , 2 34 C., Hsu, J. C . , 1131, 3 1 , 226 Hsu, L. L. M., 888, 226 8 , 90, 226 Huberty, C. JJ.,. , 66,, 110, 109, 226, Huberty, C. 0, 12, 1 09 , 226, 2277 22 R., 228 Hullett, C. C. R . , 142, 1 42 , 143, 1 43 , 2 28 Hunter, J. EE., 5,, 99,, 110, 53, ., 5 0, 24, 43, 5 3, Hunter, 54, 70, 72, 76, 79, 80, 81, 5 4, 7 0, 7 2, 7 6, 7 9, 8 0, 8 1, 95, 82, 90, 94, 9 5 , 1104, 04, 1127, 27, 136, 226, 2232 1 3 6 , 165, 1 65 , 226, 32 Huynh, C C.. LL., 55, 56, 226 ., 5 5, 5 6, 2 26 Hyndman, R. J.,. , 18, 226 R. J 1 8, 2 26 I Iglewicz, S ., 1 9, 5 9 , 224 B., 19, 59, 224 Ilies, 1, 2 32 Hies, R., 881, 232 Impara, J. . , 78, 2 31 J. c C., 231 J.,. , 114, IIzenman, zenman, A. A J 1 1 4, 226
J J
Jacoby, W. W. G., 226 G . , 12, 12, 2 26
240
�
AUTHOR INDEX
Jones, L. 226 L. V, v., 6, 226 Joormann, J., 233 Joormann, J . , 1141, 41, 2 33
K Kempthorne, O., 28, 0., 2 8 , 226 Pc C,. , 557, 7, 226 Kendall, P. G., 143, Keppel, G . , 114, 4, 1124, 24, 1127, 2 7 , 1136, 3 6, 143, 227 2 27 G., 143, 227 Keren, G ., 1 43, 2 27 Keselman, H H.. J., 110, 11, 36, 39, 0, 1 1 , 112, 2, 3 6, 3 9, Keselman, 41, 68, 40, 4 1 , 42, 43, 6 8 , 1117, 1 7, 1 21, 1 3 0, 1131, 3 1 , 1132, 32 , 1133, 33 , 121, 130, 134, 219, 222, 2227, 1 34, 1135, 35 , 2 1 9, 222, 2� 228, 35 228, 230, 230, 2235 C., 227 Keselman, J. c . , 112, 2, 2 27 Kim,, c C., 179, 234 Kim ., 1 79, 2 34 Kim,, J. JJ.,. , 998, 133, Kim 8 , 1105, 05 , 1 3 3 , 225 R. EE., 24, 227 227 Kirk, R. . , 24, E. E E., 110, Kleinknecht, E. ., 1 1 0 , 2227 27 R. A A.,, 1110, 2277 Kleinknecht, R. 1 0, 22 B., Kline, R. R. B . , 166, 1 66 , 2227 27 T. R R., 5,, 228, 61, Knapp, T. ., 5 8 , 32, 6 1 , 2227 27 D., 7, 555, 56, 226 Komrey, J. D . , 7, 5, 5 6, 226 R. K., K., 12, 2227 Kowalchuk, R. 27 H. c C., 7, 54, 556, 58, 227 Kraemer, H. . , 7, 6, 5 8, 2 27 6,, 2227 Krueger, J., J., 6 27 W. H H.,. , 1105, 05 , 2227 27 Kruskal, W. E., 75, 220 Kuh, E ., 7 5 , 220 L Laing, J. D ., 2 1 7, 2 26 D., 217, 226 N.. M M.,. , 558, 198, 2277 Laird, N 8 , 1106, 06, 1 9 8 , 22 Lambert, M. M. JJ.,. , I11, 228 Lambert, I , 228 D.. M., 229 Lane, D M . , 86, 86/ 229 A.,, 1188, 228 Laupacis, A 8 8, 228 Lawson, S., 233 S./ 1121, 2 1 / 1126, 2 6, 1127, 27, 2 33 D.. A A.,, 559, 228 Lax, D 9 / 228 H.,. , 881, 232 Le, H 1/ 2 32 J.,. , 223, 24, 223 Leeman, J 3, 2 4 , 223 Lehmann, EE.. LL., 106, 220, 228 228 Lehmann, . , 41, 1 06 , 220, Levin, B., 61, B., 6 1 , 1172, 72 , 1173, 73 , 1174, 74, 1175, 75 , 176, 180, 1 76 / 1178, 78 , 1179, 7 9, 1 80, 1181, 81/ 182, 186, 1 82 , 1183, 83 , 1 86 , 1187, 8 7, 1188, 88, 190, 195, 1 90, 1191, 9 1 , 1192, 92 , 1 95 , 1197, 9 7, 216, 2 1 6 , 217, 2 1 7 , 224 J.. R., 7, 112, 28, 81, R., 5, 7, 2, 2 8, 8 1 , 1125, 25, Levin, J 167, 201, 227, 1127, 2 7 , 1141, 41, 1 6 7/ 2 01, 2 27, 228, 230, 230, 2231, 233 228, 31, 2 33 T. R R.,. , 1142, 228 Levine, T. 42 , 1143, 4 3 , 228
R,, 71, 991, 228 Levy, P. 1 , 228 Lewis, C., 143, 227 c., 1 43, 2 27 Lewis, T. T.,, 112, 2 , 2219 19 Liebetrau, A. 174, 217, 228 Liebetrau, A M., M , 1 74, 1194, 94, 2 1 7 , 228 W.,, 7, 9, 885, Lipsey, M. w. 5 , 86, 1108, 08 , 228, 2236 1109, 09 , 1127, 2 7, 1167, 6 7, 228, 36 Liu, 0" 1 7 , 228 228 a, 2217, L. M M.,. , 12, 42, 227, 228 228 Lix, L. R. L., 12, 227 Lowman, R. L., 1 2 , 227 J.,. , 1102, 220 Ludbrook, J 02 , 1105, 05 , 220 Limn, D., 44, 225 Lunn, A. A D . , 29, 2 9 , 43, 44, 32, 228 Lunneborg, C. C . E., E., 3 2 , 41, 43, 228 Lunney, Limney, G. G. H H.,. , 204, 204, 228 228 .
M
Machin, D D., 31, 219 Machin, ., 3 1, 2 19 G.. B B.. JJ.,. , 1187, 228, 2232 Mancini, G 8 7, 228, 32 H. B B., 100, 103, 228 Mann, H. ., 1 00, 1 0 3 , 1105, 05 , 228 U.,. , 1179, 231 79 , 2 31 Mansmann, U Marin-Martinez, F. E,, 1173, Marin-Martinez, 73 , 1198, 9 8, 232 Markman, B B.. S., 232 S . , 33, 33, 2 32 Markus, K. K. A A.,. , 66,, 228 228 Markus, Marss-Garcia, A A.,, 557, Marss-Garcia, 7, 226 Martell, R. E,, 86, 229 229 R. F. Martin Andres, A., A., 1179, 183, 229 Martin 7 9, 1 83 , 2 29 Massey, F. JJ.,. , 228, 223 Massey, F. 8 , 223 D., 229 Matsumoto, D . , 1110, 1 0, 2 29 Maxwell, S.. E., 119, 121, Maxwell, S E., 1 1 9, 120, 120, 1 2 1 , 1122, 22, 124, 1 24, 1130, 3 0, 1131, 3 1 , 1132, 32, 1134, 34, 1135, 3 5 , 1136, 3 6, 1139, 3 9 , 1140, 40, 1143, 43 , 144, 1 44, 1160, 60, 161, 1 6 1 , 1162, 62 , 1164, 64, 165, 167, 204, 229 1 6 5 , 1166, 6 6, 1 6 7, 204, 19 McClanahan, T. T. M., 20 201,1 , 2219 McConway, K. K. JJ.,. , 229, 225 McConway, 9 , 43, 44, 225 McGaw, B., 224 B., 9, 50, 86, 1140, 40, 224 O., 90, 98, 1105, McGraw, K. K. 0., 05 , 1106, 06 , 229 W.,, 40, 2229 McKean, J. w. 29 J.. EE., 125, 219 McLean, J . , 54, 1 2 5 , 1126, 2 6, 2 19 5, 7 6 , 1117, 1 7, 1 3 6 , 229 McNemar, 0" a, 775, 76, 136, R. w. W.,, 1101, 229 Mee, R. 0 1 , 229 229 Meeks, S. S. L., L., 28, 61, 6 1 , 229 204, 220 220 Meyers, J. L., L., 204, 229 Micceri, T., T., 10, 1 0, 229 T.,, 887, Miller, D. T. 7, 2231 31 I. w. W,, 113, 220 3 , 220 Miller, I. J.. N., N., 1105, Miller, J 05 , 222 L. C., Miller, L. c., 113, 3 , 2230 30 Mohr, D.. c C.,. , 111, 229 Mohr, D 1 , 229 Monroe, K. K. B., 127, 223 Monroe, B., 1 2 7, 223
AUTHOR INDEX
G.. A A., 28, ., 2 8 , 1195, 9 5 , 200, 224, Morgan, G 234 S.. B., B., 68, 2229 Morris, S 29 X,, 113, 235 Morton, T. 3, 2 35 L. E E., 34, ., 3 4 , 204, 206, 208, Moses, L. 210, 217, 223, 2 1 0, 2 1 7, 2 2 3 , 229 E,, 118, 41, 58, 105, Mosteller, E 8, 4 1, 5 8, 1 05 , 106, 1 06, 222, 226, 226, 2227, 1198, 9 8 , 222, 2 7 , 229 Mueller, C. C. G G., 11, Mueller, ., 1 1 , 2230 30 T. II.,. , 113, 220 Mueller, T. 3 , 220 Mulaik, SS.. A., A., 66,, 24, 225 225 Mulaik, U., 101, 206, 2210, . , 99, 1 0 1 , 206, 1 0, 211, 211, Munzel, U 220 Murphy, B B.. P.P,, 1100, 230 Murph� 00, 2 30 K. R R.,. , 7, 2230 Murphy, K 30 Murray, L. L. W W,, 1124, Murray, 24, 1127, 2 7, 230 Muska, JJ.,. , 557, 236 Muska, 7, 2 36 B., 7, 2230 Myors, B . , 7, 30 N
J.,. , 1184, 224 Nam, J 84, 224 Nanna, M M.. J., 204, 204, 230 Nanna, S.. R., 57, 226 Nath, S R., 5 7, 226 R.. G G., 183, 230 Newcombe, R ., 1 83 , 2 30 R. 5S., 6,, 2230 Nickerson, R. ., 6 30 Nordholm, L. L. A A.,. , 122, 122, 1127, 2 7 , 1142, 42, Nordholm, 221 2 21
M. JJ.,. , 111, 230 Norusis, M. 1 , 230 H., 139, 140, Nouri, H . , 1134, 34, 1136, 3 6, 1 39, 1 40, 149, 1 49 , 222, 230 1166, 6 6 , 222, Nowell, A A.,. , 1112, 1 2 , 225 Nowell,
O o Pc C., 112, O'Brien, P. ., 1 1 2 , 2230 30 K. EE., 95, ., 9 5 , 1126, 2 6, 1127, 2 7 , 2230 30 O'Grady, K Olejnik, SS., 12, 109, 123, 124, Olejnik, ., 1 2, 1 0 9, 1 23, 1 24, 127, 1 2 7, 133, 136, 1 3 3 , 1135, 3 5, 1 3 6 , 1139, 3 9, 140, 1 40, 155, 1143, 4 3 , 1150, 5 0, 1 5 5 , 1156, 5 6 , 1159, 59, 160, 163, 1 60, 1161, 61, 1 6 3 , 1164, 64, 165, 165, 166, 226, 2227, 1 6 6 , 1167, 6 7, 226, 2 7 , 230 Olkin, II.,. , 55,, 9, 553, 58, Olkin, 3 , 54, 55, 5 8, 60, 72, 1136, 203, 225, 225, 226 62, 72, 3 6, 203, Onwuegbuzie, A. A. J., 81, Onwuegbuzie, J., 5, 28, 8 1 , 125, 1 25 , 127, 201, 230 1 2 7 , 1167, 6 7, 20 1, 2 30 Osburn, H H.. G G., 2211 . , 82, 22 E., 229, 9 , 43, 44, 225 Ostrowski, E., A. R R., 11, 36, Othman, A. ., 1 1, 3 6 , 39, 42, 227, 227, 230
241
M. A., 119, 233 Owens, M. 9, 2 33 D.. JJ.,. , 91, 92, 230 91, 9 2, 2 30 Ozer, D P p
A. R R., 230 Padmanabhan, A. . , 111, 1, 2 30 M. c C., 61, 172, 174, Paik, M. ., 6 1, 1 72 , 1173, 73, 1 74, 1175, 75, 176, 179, 1 76 , 1178, 7 8, 1 79 , 1180, 80, 181, 181, 182, 183, 187, 1 82, 1 8 3 , 1186, 86, 1 8 7, 188, 1 8 8, 190, 191, 192, 195, 1 9 0, 1 91, 1 92 , 1 95 , 1197, 9 7, 216, 217, 2 1 6, 2 1 7, 224 S., 230 Parker, S . , 32, 2 30 W. c., C., 113, Pedersen, W 3 , 2230 30 R. D D.,. , 1151, 160, 204, 51, 1 6 0, 1165, 6 5 , 204, Penfield, R. 230 2 30 K. T. T.,, 111, Perry, K 1 , 2230 30 M. D D.,. , 112, Petoskey, M. 2 , 2227 27 Pitrosky, B B., 201,, 222 222 Pitrosky, . , 201 B.. SS., 78, Plake, B ., 7 8 , 2231 31 PolicelloII, G.. EE., 224 Policello II, G . , 1101, 0 1 , 224 W, 553, 235 Pollin, W, 3, 2 35 Popovich, P.P. M M.,. , 82, 22 2211 Popovich, M.,. , 2210, 231 Posch, M 1 0, 2 31 J.. W, W., 999, 100, 206, Pratt, J 9, 1 00, 1106, 06 , 1115, 1 5 , 206, 231 2 31 PE EW W,, 990, 231 Preece, P. 0, 2 31 D.. A A.,. , 887, 219, 231 7, 1161, 61, 2 1 9, 2 31 Prentice, D R.. M M.,. , 34, 40, 1131, 220 Price, R 3 1 , 220 M. L., L., 1101, Puri, M. 0 1 , 1115, 1 5 , 1134, 34, 136, 1 3 6, 164, 217, 2211 1162, 62 , 1 64, 2 1 7 , 22 A. D D., 13, 230 ., 1 3, 2 30 Putcha-Bhagavatula, A. R
PH H.,. , 332, Ramsey, P. 2 , 2231 31 Randies, R R.. H H.,. , 99, 206, 206, 2217, 231 Randles, 1 7, 2 31 J.,. , 20 201, Rasmussen, J 1 , 222 S. W, 11, 14, 221, 231 Raudenbush, S. W, 1 1, 1 4, 22 1, 2 31 J.. C. C. W W,, 999, 206, 2231 Rayner, J 9 , 206, 31 III, J. E E,, 339, 231 Reed III, 9, 2 31 Reichardt, C. C. 5S., 28, Reichardt, ., 2 8 , 2231 31 B., 213, 233 Reiser, B ., 2 13, 2 33 E., 111, Rice, M. E., 1 , 2231 31 Richard, EE D D., 23, 161, 220 Richard, ., 2 3 , 1130, 3 0, 1 6 1 , 220 J.. T. T. EE., 127, ., 1 2 7 , 2231 31 Richardson, J Richardson, W. W. SS.,. , 1188, 232 Richardson, 8 8 , 232 176, 197, 225 Rindskopf, D., D., 1 76 , 1190, 90, 1 9 7, 225 J.. K, K., 55,, 2231 Roberts, J 31 R. SS., 228 Roberts, R. . , 1188, 8 8 , 228 Robinson, D D.. H., H., 55,, 77,, 228, 228, 2231 Robinson, 31
242
AUTHOR INDEX IMDEX AUTHOR
Rohmel, J., 179, 231 R6hmel, J., 1 79 , 2 31 Ronchetti, E E.. M M., 41, 225 ., 4 1 , 225 Ronchetti, D. L L., 143, 231 Ronis, D. ., 1 43, 2 31 W,, 1188, Rosenberg, W. 8 8 , 232 Rosenberger, JJ.. L L., 231 . , 36, 2 31 H., 217, Rosenthal, H ., 2 1 7, 226 5,, 9 9,, 6 66, 67, 73, Rosenthal, R., R., 5 6, 6 7, 7 3 , 86, 87, 93, 8 7, 89, 90, 91, 9 3 , 109, 1 09 , 134, 136, 166, 178, 1 34, 1 36, 1 66 , 1 7 8, 1 183, 83, 185, 190, 193, 232 1 85 , 1 90 , 1 93 , 2231, 3 1 , 232 R.. L., L., 55,, 666, 67, 73, 86, 89, 6, 6 7, 7 3, 8 6, 8 9, Rosnow, R 109, 136, 1 09 , 1134, 34, 1 3 6 , 1166, 6 6, 1 178, 78 , 193, 1 9 3 , 232 P. J., J., 441, 225 Rousseeuw, P. 1 , 225 D.. B., 66, 67, 73, 87, B., 5, 6 6, 6 7, 7 3 , 86, 8 7, Rubin, D 89, 93, 1 09 , 1 34, 1 3 6 , 1166, 66 , 109, 134, 136, 178, 193, 1 78, 1 9 3 , 232 Rudas, T. 90, 1196, 96 , 2 32 T.,, 1190, 232
s
5
L., 1188, 228,232 232 Sackett, D. D. L., 8 8 , 228, Sanchez-Meca, J., 173, 198, 232 J., 1 73 , 1 98, 2 32 Santner, T. T. JJ.,. , 1184, 232 Santner, 84, 232 F. EE.,. , 333, 232 Satterthwaite, F. 3 , 232 S.. S., 55,, 110, 11, 15, 28, Sawilowsky, S 0, 1 1, 1 5, 2 8, 33, 36, 38, 32, 3 3, 3 6, 3 8 , 42, 60, 131, 131, 162, 204, 22 221, 230, 1 62 , 204, 1 , 2227, 2 7 , 230, 232 F. LL.,. , 55,, 99,, 110, 24, 443, 53, 0, 24, 3, 5 3, Schmidt, E 54, 7 0 , 72, 76, 7 9 , 80, 8 1, 70, 79, 81, 95, 82, 90, 94, 9 5 , 1104, 04, 1127, 27, 136, 165, 1 36, 1 65 , 226, 226, 232 N.,. , 66,, 2232 Schmidt, N 32 R. M M.,. , 440, 229 Schrader, R. 0, 2 29 R.. F., 135, 160, 165, Schultz, R F. , 1 35 , 1 60 , 1 65 , 1166, 66 , 223 2 23 Schulzer, M 8 7, 228, 32 M.,. , 1187, 228,2232 Schumacker, R R.. E., E., 90, 2234 34 Schwertman, N. N. c C,. , 119, 233 Schwertman, 9, 2 33 Serfling, R. JJ.,. , 118, 233 S erfling, R. 8, 2 33 R. c., C., 112, 15, 28, 61, Serlin, R. 2, 1 5, 2 8, 6 1 , 141, 141, 233, 234,235 235 2 3 3 , 234, W. R., R., 1176, 190, 197, 225 Shadish, W. 76, 1 90, 1 9 7, 225 Shaffer, J. J. P.P,, 443, 131, 233 Shaffer, 3, 1 31, 2 33 Sheather, S. J.,. , 112, 233 S. J 2, 40, 41, 41, 2 33 R. c C.,. , 557, 226 Sheldrick, R. 7, 226 13, 236 Sherer, M., 1 3, 2 36 Siegal, S., 194, 233 S., 1 94, 2 33 M.,. , 1141, 233 Siemer, M 41, 2 33
Silverman, B B.. w. W,, 1114, 233 1 4, 2 33 Simonoff, J. J. SS., 213, 233 Simonoff, ., 2 13, 2 33 3, 2 33 Skinner, B B.. F.E,, 113, 233 Skipka, G G., 179, 233 ., 1 7 9 , 1 181, 8 1 , 233 59, 220 Smith, D., D., 5 9 , 220 M. LL., 51, . , 8, 9, 50, 5 1 , 86, 140, 1 40 , Smith, M. 233 224, 2 33 M.,. , 331, 63, Smithson, M 1 , 32, 43, 62, 6 3, 6 5 , 72 2, 1 22, 1 3 3 , 1160, 6 0, 65, 72,, 992, 122, 133, 183, 194, 203,2233 1 8 3 , 1184, 84, 1 94, 203, 33 G.. w. W.,, 111, 35, 38, Snedecor, G 1 , 30, 3 5, 3 8 , 204, 204, 233 2 33 K.,, 1184, Snell, M. K 84, 2232 32 Snyder, D D.. K K.,, 200, 201, 233 200, 20 1, 2 33 E,, 1121, 126, 127, 233 Snyder, P. 21, 1 26, 1 27, 2 33 G.. SS.,. , 885, 225 Solomon, G 5 , 225 Somers, R R.. H., 233 H . , 213, 2 1 3 , 233 J.. N., 206,2233 Sparks, J N., 99, 206, 33 R. A A.,, 78, 23 2311 Spies, R. Spooren, W. W. P.P.J.J.M., M., 1102, 220 02 , 1105, 05 , 220 R,, 43, 2233 Sprent, P. 33 Stabenau, J. J. BB., 53, 235 Stabenau, ., 5 3, 2 35 W. A A.,, 4 41, 225 1 , 225 Stahel, W. Staudte, R. ., 1 2 , 40, 4 1, 2 33 R. G G., 12, 41, 233 J.. H H.,. , 66,, 224, 61, 65, Steiger, J 4, 6 1, 6 5 , 120, 120, 122, 1132, 32 , 133, 1 3 3 , 1160, 60, 1181, 81, 225, 234 22 5, 2 34 G., 59, 220 Stoica, G ., 5 9, 2 20 Stoline, M. ., 1 1, 2 30 M. R R., 11, 230 Storer, B. E.,. , 1179, 234 B. E 7 9, 234 M. A A.,, 111, 223 Stoto, M. 1 , 223 Strahan, R. R. E, 234 Strahan, E, 89, 2 34 Strauss, S. 188, 232 S . E., 1 8 8 , 232 E. c C.,. , 1123, 143, 165, 234 Susskind, E. 23, 1 43, 1 6 5 , 234 T Tai, J. J. J 8 3, 226 J.,. , 1183, 7, 2227 Thiemann, S., S., 7, 27 D.. G G.,. , 1192, 224 Thomas, D 92 , 224 Thomason, N., 223 Thomason, N., 23, 24, 223 29, 31, 65, B., 2 9, 3 1 , 43, 62, 6 5, Thompson, B., 1 22 , 1 23, 1 60, 223 122, 123, 160, 223,, 234 Thompson, K K. LL.,. , 114, 235 Thompson, 4, 235 Thompson, K K. N N.,. , 90, 234 234 Thompson, Tibshirani, R R.. JJ.,. , 43, 223 223 N.. H., 134, 162, 234 Timm, N H., 1 34, 1 62, 234 A J., 2 , 234 Tomarken, A. J., 112, 234 D.,. , 119, 114, 234 Trenkler, D 9, 1 1 4, 234 W. w. W.,, 331, 234 Tryon, W. 1 , 84, 234
AUTHOR ACJTHOR INDEX INDEX
Tu, x., X., 1141, 222 Th, 4 1 , 222 Tukey, JJ.. W W,, 66,, 118, 41, Tuke� 8, 4 1 , 226 v V
Vargha, 100, Vargha, A., A, 1 00, 101, 1 0 1 , 104, 1 04, 1105, 05, 1 06, 1 08, 1 15, 1 34, 1136, 3 6, 106, 108, 115, 134, 204, 1 0, 2213, 1 3, 204, 206, 206, 207, 207, 2 210, 217, 2 1 7 , 222, 234
Vaske, J.. JJ.,. , 28, 234 234 V aske, J Vaughan, 2 3, 1 24, 127, 127, Vaughan, G. G. M M.,. , 1123, 124, 1 60, 1 66, 234 160, 166, Venables, Venables, W, w. , 122, 234
W w Waechter, M., 885, 225 Waechter, D. D. M., 5 , 225 E. LL., 136, 235 Walker, E. ., 1 36, 2 35 Wampold, 4 1 , 233, Wampold, B B.. E., E., 1141, 233, 235 235 Watson, J ., 1 3 , 235 J.. 5S., 13, Weisberg, S ., 7 5 , 222 S., 75, 222 B., 13, 13, 2 35 Weiss, B., 235 J.. R R., 13, 235 Weisz, J ., 1 3 , 235 Welch, B ., 3 3, 1 3 0, 2 35 B.. LL., 33, 130, 235 R.. E., E., 775, 220 Welsch, R 5 , 220 Werner, M., M., 53, 235 West, S G., 74, 7 5 , 82, 9 5, 1 76, 22 S.. G., 75, 95, 176, 2211 Weston, 4, 2 35 Weston, T. T.,, 114, 235 . , 82, 2 19 Whitehead, R R., 219 Whitney, D . , 1100, 00, 1 03, 1 05 , 228 D.. R R., 103, 105, 228 Wickens, 9 6, 2 1 7, 235 Wickens, T. T. D D.,. , 1196, 217, 235 Wiener, R. R. L., L., 94, 235 235
�
243
Wiitala, W. W. LL., 23, 130, 161, 220 Wiitala, ., 2 3, 1 30, 1 6 1 , 220 Wilcox, R., 1 0, 1 1, 1 2, 1 4, 1 7 , 118, 8, Wilcox, R. R. R., 10, 11, 12, 14, 17, 19, 21, 31, 36, 38, 1 9, 2 1, 3 1 , 32, 34, 3 6, 3 8, 39, 41, 3 9 , 40, 4 1 , 42, 43, 45, 46, 5 7, 5 8, 5 9 , 68, 72, 73, 75, 57, 58, 59, 82, 94, 1 00, 101, 101, 1 05 , 1108, 08 , 100, 105, 1 13, 1 15, 1 1 7, 1 3 0 , 1131, 31, 113, 115, 117, 130, 1 33, 1 34, 1 35, 1 6 1 , 1164, 64, 133, 134, 135, 161, 166, 178, 179, 183, 203, 1 6 6, 1 7 8, 1 7 9, 1 8 3 , 203, 207, 2 1 0, 2 1 2 , 22 7, 230, 210, 212, 227, 230, 235, 2 3 5 , 236 Wilcoxon, 00, 1 03, 1 0 5 , 236 Wilcoxon, E F.,, 1100, 103, 105, 236 M. c., C, 13, 236 Wilde, M. 13, 2 36 Wilk, M. . , 1113, 1 3, 2 36 M. B B., 236 L., 24, 62, 78, 236 Wilkinson, L . , 8, 2 4, 6 2, 7 8 , 236 Wills, R. . , 200, 33 R. M M., 200, 201 201,, 2233 Wilson, D. D. B B., 85, 108, Wilson, . , 9, 8 5 , 86, 1 0 8 , 1109, 09, 127, 167, 228, 236 1 2 7, 1 6 7, 2 28, 2 36 Wong, S 8, 1 05, 1 0 6 , 229 S.. P. P.,, 9 98, 105, 106, 229 Wright, S T. , 1 1 7, 118, 1 1 8, 236 S.. X, 117, 236 Y y
230 Yang, Y., y., 113, 3 , 230 J.. S., 5,, 60, 232 232 Yoon, J 5., 5 K. K K., 36, 131, 236 Yuen, K. ., 3 6, 1 31, 2 36 z Z
Zimmerman, D. M.,. , I15, 100, 236 Zimmerman, D. M S, 1 00, 236 Zumbo, B 1 00, 236 B.. D., D., 100, 236
This page page intentionally left blank
Subject Index
A American Psychological Association (APA), (APA) , 5
Publication Manual Manual oof f the AmeriAmeri Associa can Psychological Psychological Association, 24 APA Task Force Force on on Statistical In In8, ference, 8, 24, 62, 778, 2 36 236 American Psychologist, 6 Assumptions, Assumptions, effect sizes, 9-1 9-10 effect 0 F test, 10 10 Ftest, ttest, t test, 10 10 violations in real data, 110-14 violations 0-14 Attitudinal scales, see also Rating scale data data in comparing effect effect sizes, problems in 201 201
B estimators of effect effect size, size, 6 Biased estimators Binomial Effect Effect Size Display Display (BESD), 87-91 8 7-9 1 and dichotomizing, 90 limitations, limitations, 89-9 89-911 and median median split, split, 90 coefficient, 887-88 and phi coefficient, 7-88 success percentages, 88-90 and success and uniform uniform margins, margins, 89-90 and 171 Binomial variables, 1 71 Biweight midvariance, 599 Biweight midvariance, 5 standard deviation, Biweight standard standardize, 5599 as standardizer,
Bonett-Price Bonett-Price method, method, 40 Bonferroni-Dunn ad adjustment, Bonferroni-Dunn justment, 56 factorial ANOV ANOVA, in factorial A, 1161 61 within-group multiple compari compariwithin-group sons, 135 135 Bootstrapping, Bootstrapping, 42-43 Boxplots 1 8- 1 9 , 114 114 18-19, quantile-boxplot, 114 1 14 British Medical Medical Journal, Journal, 1188 88
C c Capitalizing Capitalizing on chance, 5566 Case-control research, 185 1 85 Categorical variables, variables, 1 70-1 771 Categorical 1 70-1 1 efficacy ratio, 884 Causal efficacy 4 Cause size, 8 4 84 Chi-square 70, 1 73-1 75 Chi-square test, 1170, 173-1 ad justed and justed, adjusted and unad unadjusted, 1 74-1 75, 190-191 1 90-1 9 1 174-175, Classificatory 139 Classificatory factors, factors, 139 Coefficient ooff determination, 991-95 Coefficient 1 -95 curvilinearity curvilinearity and and skew, 94-95 interpretation, 1 -95 interpretation, 991-95 multiple, 95 multiple, 95 reasons for disfavor, 92-95 underestimation of practical practical sigsignificance, 93-94 Coefficient of nondetermination, 93 Coefficient ds,' 54 Cohen's ds Cohen's f, 120 120 Cohen's/, unbiased estimator, estimator, 122 unbiased 1 22 Cohen's U3, 08-1 09 U3, 1108-109 Cohort 8 5-1 86 Cohort research, research, 1185-186 Common Language Language Effect Effect Size StaStatistic, 1105-106 05-1 06
24! 245
246
SUBJECT INDEX SUBJECT INDEX
assumptions, 05-106 assumptions, 1105-106 estimator, compared to U/-based U-based estimator, 1105 05 Confidence intervals, intervals, Confidence asymmetric, 32 bootstrapping vs. vs. noncentral disdisbootstrapping 655 tributions, 6 for contingency coefficient, coefficient, 1194 for 94 Cramer 'sV 's pop, 1 94 194 difference between dependent difference means, 43-46 difference between independent difference means, 24-3 24-311 difference between independent difference independent medians, 40 difference between two propordifference tions, 1182-183 82- 1 8 3 disappointing width, 34 disappointing measure, 2211-213 dominance measure, 1 1 -2 1 3 and effect effect sizes, sizes, 2233 and eta squared, 160 whole or partial, 1 60 good-enough, 61 good-enough, 61 299 horseshoe tossing analogy, 2 interpretations, interpretations, correct and incorrect, 28-29 correct upper limits, 2277 lower and upper M estimation, 4411 and M multiple comparisons, unstandardized, 1130-132 unstandardized, 30-1 3 2 vs. null-counternull intervals, v s . null-counternull 193 1 93 odds ratio, 1191-193, 9 1 - 1 9 3 , 196 1 96 than contingency tables larger than x2 2,, 196 1 96 2x one-sided, 43 coefficient, 1176-177 phi coefficient, 76-1 7 7 point-biserial rppop catpoint-biserial o for ordinal catp egorical data, 203 population g, charts, 6633 population population r, 7722 popUlation 184 84 relative risk, 1 vs. significance significance testing testing for for effect effect sizes, 60-62 simultaneous, one-way ANOV ANOVA, in one-way A, 1130 30 differences, dependstandardized differences, depend ent groups, 68 standardized differences, differences, inde indestandardized 59-65 pendent groups, 5 9-65 meta-analysis of, of, 663-64 meta-analysis 3-64
Vpop'
noncentral distributions, distributions, using noncentral 64-65 width, 661-62 1 -6 2 and statistical significance, 27-28, 2 7-2 8, 60-62 Welch's method, 32-35 36-40 Yuen's method, 3 6-40 Confidence level, 25 Confidence Consistent estimator, 106 Consistent 1 06 106 Consistent test, 1 06 Contingency coefficients, coefficients, Contingency for naturalistic naturalistic study, 1193-194, for 93-1 94, 196-197 1 96-1 9 7 comparproblems averaging or compar 194 ing, 1 94 Contingency tables, 1171-217 7 1 -2 1 7 than 2 xx22,, 1193-196 larger than 93-1 9 6 multiple multiple comparisons comparisons ooff pro proportions, 195 1 95 multiway, 196 1 96 partitioning, 1194-195 partitioning, 94-1 9 5 odds ratios, 1195-196 9 5-1 9 6 171-193 2x x22,, 1 7 1-1 9 3 Convenience sampling, 3 2 32 Correlated binary data, 172 1 72 Correlation, attenuation by restricted range, attenuation 81-84 8 1 -84 74-75 bivariate normality, 74-75 conditional conditional distributions, distributions, 73-76 attenuation from from correction for attenuation dichotomizing, 90 curvilinearity, 75, 84 73-75 heteroscedasticity, 73-75 influential cases, 75-76 75-76 influential marginal distribution, 75 nonsense, 87 75-76 outliers, 75-76 point-biserial, see Point-biserial coefficient, Correlation coefficient, assumptions, 73-76 bias and bias reduction, 771-72 1 -72 confidence intervals, intervals, 72 confidence correcting for artifacts, artifacts, 8811 correcting null-counter null intervals, 72-73 72-73 null-counter percentage difference, difference, success percentage 88-89 8 8-8 9 Correlational hypothesis, 1 1 Counternull effect effect size, 65-6 65-677 Counternull Cramer's Vpop, Cramer ' s Vpop' confidence intervals for, for, 1194 confidence 94
SUBJECT INDEX INDEX SUBJECT and phi, 194 1 94 averaging or compar comparproblems averaging 194 ing, 1 94 171 Cross-classification, 1 71 Crossover design, difference between proporproporand difference 181-182 tions, 1 8 1 -1 82 Cross-sectional study, 172 1 72 Cultural effect effect size, 1110-111 Cultural 1 0-1 1 1 214-215 Cumulative odds ratio, 2 1 4-2 1 5
D
50-53 d, SO-53
and Il, A, 50-53 and interpretation under under normal normalinterpretation ity, S50-51 O-5 1 "personalities", 6-7 Data sets' "personalities", Degrees of of freedom, freedom, 333, 377 3, 3 Welch's method, 3333 in Welch's Yuen's method, 3377 in Yuen's Dependent variables, abstract, abstract, 4 488 Dichotomous variables, 171 1 71 vs. dichotomized variables, 70, 173 90, 1 73 statistic and dominance Dominance statistic 107-108, 211-213 measure, 1 0 7-108, 2 1 1-2 1 3 statistical significance, significance, 2212-213 statistical 1 2-2 1 3 Dunnett's 131 Dunnett's many-one many- one method, 131 E
Effect sizes, Effect sizes, 2 applied vs. theoretical research, 2, applied 2, 5-6 categorical variables, variables, 1170-198 categorical 70- 1 9 8 comparing for attitudinal scales comparing studies problem problemacross studies 201 atic, 2 01 not comparing comparing centers ooff two two not 98-115 groups, 98-1 15 comparing from from continuous continuous and comparing 197-198 categorical data, 1 9 7-1 98 defined, 4, different for different different dependent different variables, 5577 variables, estimation, biased, 6 estimation, factorial ANOY ANOVA, 139-167 factorial A, 1 3 9-1 6 7 classificatory factor targeted, classificatory
247
extrinsic peripheral factor, extrinsic 1155-156 5 5-1 5 6 classificaperipheral factor classifica tory, 1156-160 tory. 5 6- 1 6 0 limitations, 1166-167 limitations, 66-1 6 7 manipulated factor targeted, manipulated 1146-150 46-1 5 0 intrinsic peripheral factor, 148-150, 1 48-1 50, 1153-155, 5 3- 1 5 5 , 1161 61 methodological influences, influences, 1167 methodological 67 distribution-wide, 1111-114 distribution-wide, 1 1 -1 1 4 graphic, 1112-114 1 2-1 1 4 longitudinal design, design, 1133-134 longitudinal 3 3-1 3 4 data, more than one on the same data, 566 5 one-way ANOY ANOVA, 117-136 one-way A, 1 1 7-1 3 6 ordinal categorical dependent 200-217 variables, 200-2 17 different for different different raters, raters, different 201 2 01 and p p level, 1111 11 patient ratings ratings by doctors and patient drug-favoring methodoldrug-favoring methodol ogies, 2201 01 practical significance for employpractical employ ers, 86-87, 94, 186 1 86 superiority, probability of superiority, 98-115, 205-211 98-1 1 5 , 205-2 11 relationships among different measures, 1109-110 09-1 1 0 small, medium, medium, and large, 85-8 85-87, small, 7, 1 09-1 1 1 109-111
Empty cell, cell, Empty
1190 90
Epsilon squared, ANOVA, in one-way ANOY A, 121-123 12 1-123 181 Equivalence trials, 1 81 see Statistical software software ESCI, see Eta squared, ANOVA, ANOY A, 140 factorial, 1 40 one-way, 1121 one-way, 21 and Cohen's 21 Cohen's f,f, 1121 comparable for different different designs, comparable 1143 43 estimator positively positively biased, 121 estimator 121 andr,f, 121 121 and
Excel, see see Statistical Statistical software Excel,
Exploratory Software for for Confidence Confidence Exploratory Intervals, see see Statistical software software
SUBJECT INDEX SUBJECT INDEX
248
139 Extrinsic factor, 1 39 F
F test, F assumptions and robustness, 117 assumptions 117 homoscedasticity and normality, normality, homoscedasticity 117 1 17 Fixed-effects 17 Fixed-effects model, 1117 Fligner-Policello U' statistic, statistic, 100-101 1 00-1 0 1 Statistical software FORTRAN, see Statistical G
g, 53-56 5 3-5 6 Generalized odds ratio, 213-214 2 1 3-2 1 4 Glass' d, 54 611 Good-enough values, 6 Grouping variable, variable, 1 Grouping
H Harrell-Davis method, 40 5 3-5 6 Hedges' g, 53-56 Hedges-Friedman method, 112 112 and homogeneity homogeneity of Heterogeneity and of variance, 9-10 9-1 0 Heteroscedasticity, 110 Heteroscedasticity, 0 10 Homoscedasticity, 1 0 low power power of tests for, for, 3311 low many-best method, 131 Hsu's many-best 131 Hunter-Schmidt Meta-Analysis ProPro grams Package, see see StatistiStatisti cal software
I
Independent groups, defined, 2266 defined, Interaction, 1141, 161-162 Interaction, 41, 1 6 1 -162 Interquartile range, 18 Interquartile 18 difference, use in standardizing a difference, 588 5 Intransitivity, 1131-132 Intransitivity, 3 1 -1 3 2 139, 148-150 Intrinsic factor, 1 3 9, 1 48-1 50 J J
Journal editors, editors,
effect size reporting, 5566 effect Journal off Consulting and Journal o and Clinical Psychology, 1133 Psychology, Journal ooff Educational Educational and and Psycholog PsychologJournal ical Measurement, 5 L
Large-sample approximation, approximation, 102-103 1 02-1 03 Latent variable, variable, 48 distribution, 1188 Location of a distribution, see Statistical software LogXact, see Longitudinal design, design, effect size for, effect for, 1133-134 3 3- 1 3 4 M
411 M estimators, 4 MAD deviation),, MAD (median absolute deviation) 17-18, 20-21 1 7- 1 8 , 2 0-2 1 588 as standardizer, 5 Mann-Whitney U statistic, statistic, see U staMann-Whitney sta tistic Margin of error, difference between dependent difference dependent means, 45 difference between independent difference 25-27, means, 25-2 7 , 32 for Welch's Welch's method, 333-34 for 3-34 for Yuen's method, 3377 for McKean-Schrader method, method, 40 and variances related, related, 111 Means, and 111 Measurement Measurement error, 76-81 76-8 1 Median absolute deviation, see MAD MAD Median, vs. mean, 19 19 Mental Measurements Measurements Y Yearbook, Mental earbook, see coefficients Reliability coefficients Meta-analysis, 8-9 not correcting effect effect correcting or not sizes for for unreliability, unreliability, 80-811 80-8 categorical and contin contindata from categorical uous variables, variables, 197-198 1 9 7-1 98 and nonsignificant nonsignificant effect-size effect-size esand es 60-61 timates, 60-61 point-biserial r when when ns are un unpoint-biserial equal, 76 problematic for for POV POV in in factorial factorial ANOVA, ANOV A, 1143 43 strength o off association, 1122-123 22-1 2 3
SUBJECT SUBJECT INDEX INDEX
Microsoft Excel, see Statistical Statistical soft softMicrosoft ware Minitab, see see Statistical software software Minitab, Monotonicity, relationship between dependent relationship variables, 106 and latent variables, Monte Carlo simulations, criteria, 131 131 coefficient of determina determinaMultiple coefficient tion, 95 one-way Multiple comparisons in one-way ANOVA, all-pairs power, 132 all-pairs 1 32 any-pair power, 132 1 32 Dayton's method, 1131-132 Dayton's 3 1-132 intransitivity, 1131-132 intransitivity, 3 1-132 many-best method, 131 many-best 131 many- one method, 131 many-one 131 pattern 3 1 - 1 32 pattern of means, means, 1131-132 dependent means, 135 dependent 1 35 power, 132 1 32 protected procedure, 132 protected 1 32 REGWQ 30-1 32 REGWQ method, 1130-132 robust methods, 131 robust 131 robustness, 1129-132 robustness, 2 9-1 32 specific-comparisons specific-comparisons power, 132 1 32 trimmed 1 32 trimmed means, 132 129-132 unstandardized, 1 2 9-1 32 within-groups bootstrapping, 135 1 35 robust 1 35 robust methods, 135 trimming, 135 trimming, 135 Multiple comparisons comparisons of proportions Multiple contingency tables, in large contingency 194 1 94 Multiple determination determination of human human Multiple behavior, 1125-126 behavior, 25-1 2 6 Multiway 1 96 Multiway contingency tables, 196 N
study, 172, 175-176, Naturalistic study, 1 72, 1 75-1 76, 193, 196-197 1 93 , 1 9 6-1 9 7 Nested designs, over estimation in, 1141 POV overestimation 41 Nominal variables, 170 70 Nominal variables, 1 distribution, 64 Noncentral t distribution, Noncentrality parameter, 64 Noncentrality 111 Nonhomomerity, 1 11 Noninferiority trials, 181 Noninferiority 181 Nonnormality, Nonnormality,
249
differinterpreting a standardized differ 577 ence, 5 5 77 and standard deviation, deviation, 5 approximation, Normal approximation, rules of thumb thumb for application application to to U and and W Wm' 208-209 m, 208-209 49-5 11 Normal curve, 49-5 and choice of Normed test and of standardizer, 556-57 standardizer, 6-5 7 Null hypothesis, hypothesis, 1 1 Null-counternull intervals intervals for popu popuNull-counternull lation r, r, 72-73 72-73 lation Number Needed to Harm, 187 187 Number Needed to Treat (NNT), 186-188 1 8 6-1 8 8 confidence intervals, 187 confidence 187 significance testing, 1187 significance 87
O o Odds ratios, 1188-193, 8 8-1 93, 195-197 1 95-1 9 7 confidence intervals, 1191-193, confidence 91-193, 196 1 96 cumulative, 2214-216 cumulative, 1 4-2 1 6 213-214 generalized, 2 1 3-2 1 4 larger contingency tables, 195-196 1 95-1 96 limitations, 190 1 90 null-counternull intervals, 192-193 1 92-1 93 statistical significance, 1190-191 statistical 90-1 9 1 Off factor, 1139 Off 39 Omega squared, squared, for different different designs, comparable for 143 1 43 factorial ANOV ANOVA, factorial A, 140-141 1 40-1 4 1 difficulty in comparing comparing values, difficulty 142-143 1 42-1 43 interaction, 1141, interaction, 4 1 , 165 165 141-142, partial, 1 4 1 - 1 42 , 152 152 ratios ooff two two values, 1143-144 ratios 43-144 one-way ANOVA, 1121-123 one-way 2 1-123 and standardized standardized differ differand 125 ence, 1 25 ratios vs. standardized difference difference ratios 143-144 ratios, 1 43-144 polytomy, 200 Ordered polytomy, variOrdinal categorical dependent vari 200-217 ables, 200-2 17 215-216 cell counts small, 2 1 5-2 1 6 insensitive to or orchi-square test insensitive 201-202 der, 20 1-202
SUBJECT SUBJECT INDEX INDEX
250
heteroscedasticity, 205 interobserver interobserver reliability, 201 201 test not Kolmogorov-Smirnov test optimal, optimal, 202 number of categories, categories, 200-201 200-201 number parametric methods, 203-205 200-201 reliability, 200-2 01 restricted range, 205 robust 10 robust methods, 2210 sensitivity analysis for spacing of of "scores", 205 205 skew, 205 sliced from from continuous continuous variables, 2 01 201 spacing spacing ooff conversions conversions to "scores" "scores",, 204-205 statistical significance, significance, 2201-202 statistical 0 1 -202 significance on multiple tests of significance of effect effect size, estimates of 2 1 6-2 1 7 216-217 ties, 200, 20 7-20 8 207-208 Ordinal dominance curve, 1 14 114 relation relation to the probability probability of su superiority, 1 1 3-1 1 4 113-114 Ordinal 20 1-202 Ordinal hypotheses, hypotheses, 201-202 Outliers, 12 12 and and standardized-difference standardized-difference effect effect sizes, 5 7-5 9 57-59 Overlap, as measure 06-1 0 7 measure of effect effect size, 1106-107
p P p p level, 1-2 Partitioning Partitioning large contingency ta tables, 1 94-1 95 194-195 multiple propor multiple comparisons comparisons of proportions, 194 1 94 comparison graph, Percentile comparison 113-114 1 1 3-1 1 4 relation relation ttoo the probability probability ooff su superiority, 1 3-1 1 4 periority, 1113-114 Peripheral factor, 1 3 9, 1 48-1 5 0 139, 148-150 Phi coefficient, 7-88, 1 73-1 7 7 coefficient, 887-88, 173-177 collapsed 2 x c c ordinal ordinal categorical 216 tables, 2 16 comparing from from different different studies, comparing 1 76 176 margin margin totals totals influence, influence, 1175-1 75-1 76 76 maximum value, value, 11776 6
naturalistic naturalistic studies only, 175-176, 196-197 1 75-1 76, 1 9 6-1 9 7 null-counternull null-counternull interval, interval, 176-177 1 76-1 7 7 76 Phimax max, 1176 Planned Planned comparisons in one-way one-way ANOVA, ANOY A, 1131 31 Point-biserial r, r, 1 assumptions, 73-76 73-76 restricted attenuation from restricted range, 881-84 1 -84 and bias reduction, reduction, 771-72 bias and 1 -72 calculation, 7071 70-71 conversion tto o t, 73 73 difference difference between between population population shapes vs. difference difference be between sample shapes, shapes, 75 meta-analysis meta-analysis when ns are ununequal, 76 ordinal categorical categorical data, 70-76, ordinal 202-205, 2210-211 202-205, 1 0-2 1 1 limitations, 203-205 scatterplots, 74-75 74-75 statistical significance, 71 71 unequal nnss correction, correction, 76 unequal Point-biserial r rpop pop,' null-counternull null-counternull interval, 203 ordinal categorical confi categorical data data confidence interval, 203 203 Point estimate, estimate, difference difference between two two independ independent means, 3 300 Power analysis, 7-8 Power in multiple comparisons in one-way ANaYA, 1 2 9- 1 3 5 one-way ANOVA, 129-135 Practical significance, 3-4, 3-4, 3311 ,, 86-877 86-8 Probability Probability coverage, coverage, 25 Probability Probability of of Superiority (PS), (PS), 98-106, 109-111, 114-115 98-1 06, 1 09- 1 1 1 , 1 1 4-1 1 5 assumptions, 03-1 05 assumptions, 1103-105 confidence 00-1 0 1 confidence intervals, intervals, 1100-101 dependent dependent groups, 1114-115 1 4-1 1 5 estimators characteristics, 106 1 06 exact test of Ho: 5, 2 10 of H = ..5, 210 0: PS = homoscedasticity and normality, normality, 1100-101, 00-1 0 1 , 104 1 04 205-211 ordinal data, 205-2 11 robust 10 robust methods, 2210 ties, 206-207, 10 206-207, 209-2 209-210 in 2 x 210 x22 tables, 210
SUBJECT SUBJECT INDEX INDEX ordinal ordinal dominance curve, relation 1 14 relation to, 114 related measure using ratio of of PSs, 1103 03 robustness, 1100, 00, 106 1 06 vs. standardized standardized difference, difference, 100, 104 choosing, 1 00 , 1 04 variance explained Proportion of variance explained (POV), see also Strength of of association for different different designs, comparable for designs, 143 factorial 40-144, 1152, 52, factorial ANOVA, 1140-144, 1 60, 1164-165 64-1 6 5 160, confidence 60 confidence intervals, 1160 meta-analysis possibly mis misleading, 1 43 143 141-142, 164-165 partial, 1 4 1 -1 42 , 1 64-1 6 5 one-way ANOVA, 1 24-1 2 7 124-127 variability great, 1143 sampling variability 43 Proportions Proportions difference 7 7-1 8 3, difference between two, 1177-183, 1 9 6-1 9 7 196-197 repeated measures, 1181-182 8 1-1 82 multiple comparisons of, 180-181 1 80- 1 8 1 difference between, testing the difference unconditional, conditional or unconditional, 179 1 79 Prospective research, 185 1 85 of Superiority PS, see Probability of software PSY, see Statistical software Psychology, Psychology, Public Policy, and and Law, 94 Psychometrics, 80 Purposive sampling, 1 7 7, 1 96-1 9 7 177, 196-197
Q Q uantile, 18 18 Quantile, quantile-boxplot, 114 quantile-boxplot, 1 14 Q uartile, 18 18 Quartile,
R
R, see Statistical software software Range, 1166 restricted, 881-84 restricted, 1 -84 direct and indirect, 881-82 1 -82 of effect-size effect-size esti estireduction of mates, 82
251 251
reduction of of statistical statistical power, 82 Rating scale scale data, see see also Attitudi Attitudinal nal scales issues, 1 5 0-1 5 1 , 1 60, 165 1 65 150-151, 160, Relative risk, 183-186, 1 8 3-1 86, 1 96-1 9 7 196-197 confidence 84 confidence intervals, 1184 limitations, 185 185 not for retrospective research, not 185 1 85 Reliability coefficients, coefficients, 7777 confidence confidence interval, 79-80 differences differences causing differences differences in effect-size estimates, 78 effect-size experimental control, 78-79 independent variables, 78-79
Mental Mental Measurements Measurements Yearbook, Yearbook, 7 7-78 77-78 test-retest, 77 treatment integrity, 78-79 78-79 treatment variation with demographics posdemographics pos78 sible, 7 8 Research hypothesis, 1 1 Resistance and nonresistance, 116-19 6- 1 9 Resistant measure o off location, 4411 research, 1185-186 Retrospective research, 8 5-1 8 6 difference, 1178, 183-184 Risk difference, 78, 1 83-1 84
S 5 S-PLUS, see Statistical Statistical software Sampling, convenience, 32 see Naturalistic naturalistic, see study study prospective, 1185 85 retrospective, 1 85-1 8 6 retrospective, 185-186 Statistical software SAS, see Statistical Scale of a distribution, 18 18 Scales, familiar and unfamiliar, 23, 48 familiar analysis, Sensitivity analysis, spacing of "scores" "scores" converted from ordinal ordinal categories, from 204-205 Shift-function 1 1 2-1 1 3 Shift-function method, method, 112-113 Shift Shift model, relation to percentile percentile comparison 113-114 graph, 1 1 3-1 1 4 treatment by sub subject interaction, treatment j ect interaction, 1 04 104
252
SUBJECT INDEX INDEX SUBJECT
Significance, Significance, 311 practical, 3 4-6 testing, controversy, 4-6 vs. confidence confidence intervals intervals for for ef effect sizes, 60-62 review of, 11-3 -3 Small cell counts, ordinal categorical 2 x cc tables, ordinal 215-216 2 1 5-2 1 6 and large effect effect Small, medium, and sizes, 1109-111 09-1 1 1 Software, see see Statistical Statistical software Software, Somer'sD, 213 Somer 's D, 2 13 Sphericity, 135 135 SPSS, see Statistical software software Standard deviation, deviation, Standard sampling variability, 54 sampling Standard error of mean difference, difference, 25-26 difference between Standardized difference means, 48-69 bias and and bias reduction, 554 bias 4 control control group and normality, normality, 49-53 dependent groups, 6 7-6 8 67-68 standardizer choice, 667-68 standardizer 7-6 8 heteroscedasticity, 54-56 vs. not not pooling, 553-56 pooling vs. 3-5 6 standardizer from from normative normative standardizer group, 556-57 group, 6-5 7 tentative 5-5 7 tentative recommendations, 555-57 variances equal or unequal, 53-56 5 3-5 6 Standardized difference difference between me meStandardized 57-58 dians, 5 7-5 8 difference effect effect sizes, Standardized difference confidence 9-65 confidence intervals, intervals, 559-65 132-133 one-way ANOVA, 1 3 2-1 3 3 nonparametric, 5588 nonparametric, 57-58 outliers, 5 7-5 8 one-way ANOVA, overall in one-way 1118-120 1 8- 1 2 0 comparisons iin specific comparisons n one-way ANOVA, 1127-128, 2 7- 1 2 8 , 132-134 1 32-1 3 4 134-135 within-group ANOVA, 1 34- 1 3 5 confidence intervals, intervals, 1134-135 confidence 34-1 3 5 Standardized Standardized differences, differences, confidence intervals intervals in factorial confidence designs, 1160-161 60-1 6 1 within-group factorial designs, within-group 163-165 1 6 3-1 6 5
164 confidence intervals, 1 64 Standardizes, Standardizers, 58-59 outlier-resistant, 5 8-59 see Statistical Statistical software Stata, see STATISTICA, software STATISTICA, see Statistical software Statistical software, ESCI (Exploratory Software Software for Confidence Intervals), 62, 63, 6 3 , 64 39, 102, 210 FORTRAN, 3 9, 1 02 , 2 10 Hunter-Schmidt Meta-Analysis Meta-Analysis Hunter-Schmidt Package, 8811 Programs Package, LogXact, 1197 97 Microsoft Excel, 62, 1181 Microsoft 81 Minitab, 339, 41, 59, Minitab, 9 , 40, 4 1 , 45, 5 9, 1100, 00, 113, 114, 1 13, 1 1 4, 1183 83 133, PSY, 1 3 3 , 160, 1 60, 1164 64 117, R, 1 1 7 , 1122 22 S-PLUS, 112, 14, 17, 18, 39, 40, 2, 1 4, 1 7, 1 8, 3 9 , 40, 41, 45,46, 46, 4 1 , 45, 5 9 ,59, 72, 72, 75, 75, 100, 101, 108, 113, 1 00, 1 01, 1 08, 1 1 3 , 1115, 15, 134, 1117, 1 7, 122, 1131, 31, 1 3 4, 1135, 35, 166, 183, 210, 212 1164, 64, 1 66, 1 83, 2 1 0, 2 12 SAS 5 , 68, 1 1 4, 1117, 1 7, 1122 22 SAS,, 665, 114, SAS Version Version 9, 1192, 210, 216 SAS 92 , 2 1 0, 2 16 133, SAS/IML, 1 3 3 , 1134 34 SPSS, 332, 65, 114, 2 , 62, 6 5, 1 1 4, 1122, 22 , 1160, 60, 164, 1 64, 1176, 7 6, 1194 94 210, 213, SPSS Exact, 1194, 94, 2 1 0, 2 1 3, 2 16 216 Stata, 114 114 STATISTICA, 65, 234 ST ATISTlCA, 6 5 , 122, 1160, 60, 234 181, 192, StatXact, 1 8 1 , 1183, 83, 1 92 , 1194, 94, 197, 210, 213, 216 1 9 7, 2 1 0, 2 13, 2 16 SYSTAT, SYST AT, 1114 14 significance, Statistical significance, and sample size, size, 1-2 and Statistically signifying, 3-4 3-4 Statistically StatXact, see see Statistical software software 98, Stochastic superiority, 9 8 , 1103, 03 , see Probability of Superior Superioralso Probability ity Strength of association, association, see see also Pro ProStrength portion of variance ex exportion plained (POV) confidence intervals, 1122 confidence 22 124-127 criticisms evaluated, 12412 7 0, 1122-123 estimates below 0, 22-1 2 3 factorial ANOV ANOVA, 140-141, factorial A, 1 40- 1 4 1 , 1160 60 confidence intervals, 1160 confidence 60 the independent variable, levels of the 1126 26
SUBJECT INDEX INDEX SUBJECT
meta-analysis, 1122-123 meta-analysis, 22-1 2 3 overall iin one-way ANOVA, ANOVA, overall n one-way 1120-123 20-1 2 3 sampling variability variability of of estimates estimates sampling high, 1122 high, 22 specific comparisons, comparisons, 123-124 specific 1 23-124 statistical significance, significance, 122 statistical unreliability of, unreliability effect of, 1126-127 effect 2 6-1 2 7 manipulation, 82 Strength of manipulation, Structural zero, zero, 1190 Structural 90 percentages, 888-90, Success percentages, 8-90, 1108 08 rate ratio, 184 1 84 SYSTAT, SYST AT, see Statistical software T t,
statistic, 11 statistic, from point-biserial point-biserial converted from r, 73 r, interpolation, 30, 3333 table, interpolation, Targeted factor, 1139 39 99, Tied values, 9 9 , 1107-108 0 7-108 Treatment, effect on variabilities and and centers, effect centers, 13-21 1 3-2 1 effects throughout throughout a distribution, distribution, effects 111-114 1 1 1-1 1 4 exploring data data for effect effect oonn vari variability, 14-21 ability, 1 4-2 1 Treatment Treatment by subject subject interaction, and and shift model, 104 1 04 Trimmed means, means, 36-37, 3 6-3 7, 39-40 3 9-40 Trimming, researchers possibly leery of, 39 39 and skew, 339-40 9-40 Truncated range, 81-84 8 1 -84 graph, 114 Tukey sum-difference sum-difference graph, 1 14 Tukey's Tukey ' s HSD HSD method, 130-132 1 30-1 32 in factorial ANOVA, 161 in ANOVA, 1 61 U u
U, Superiority Probability of of Superiority U, see also Probability (PS) statistic, 100 1 00 test, approximate normal approxiapproximate normal approxi mation, 102-105 1 02-105
253 253
not robust robust against against not heteroscedasticity, heteroscedasticity, 1100, 00, 1104-105 04-105 Ubiquitous effect-size effect-size index, index, 134, Ubiquitous 1 34, 1162 62 Unreliability, see see also Reliability Reliability Unreliability, attenuation of effect-size effect-size esti estiattenuation mates, 76-81 mates, 76-8 1 correcting for or not correcting for, 79 for, in meta-analysis, meta-analysis, 80-81 80-81 in used, 80 correction rarely used, power, 77 77 reduction of statistical power, differences, see see also Unstandardized differences, comparisons Multiple comparisons factorial designs, 161 confidence intervals, 1 61 161 robust, 1 61
V v Validity coefficients, 94 V alidity coefficients, Variance, 115-16 5- 1 6 pooling in ANOVA, ANOVA, 1118 18 20-21 Winsorized, 116-17, 6-1 7, 20-2 1
W w Wilcoxon (rank-sum statistic) statistic) Wm' m, 102, 105, see also 1100, 00, 1 02 , 103, 1 0 5 , see U statistic U statistic approximate test, 207-210 207-2 1 0 normal minimum normal approximation approximation minimum sample sizes, 208-209 ordinal categorical categorical data, 207-210 20 7-2 1 0 Within-group one-way ANOVA, ANOVA, Within-group one-way multivariate approach, approach, 135 multivariate 135 one-step M estimation, 135 135 overall off association, overall strength o 135-136 1 3 5-1 3 6 POV, 1136 POV vs. partial POV; 36 probability of superiority, 136 probability 136 Within-groups vs. Within-groups vs . between-groups designs, possibly conflicting results, 167 167
Z z z-like measures, 49-5 49-511 9-5 1 zz score, 449-51 Zero, structural, 190 1 90