Statistics in Psychology: An Historical Perspective

Statistics in Psychology An Historical Perspective Second Edition This page intentionally left blank Statistics in ...

Author: Michael Cowles

103 downloads 2340 Views 15MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Statistics in Psychology An Historical Perspective Second Edition

This page intentionally left blank

Statistics in Psychology An Historical Perspective SecondEdition

Michael Cowles

York University, Toronto

2001

LAWRENCE ERLBAUM ASSOCIATES, PUBLISHERS Mahwah,NewJersey London

Copyright © 2001by Lawrence Erlbaum Associates, Inc. All rights reserved.No part of this bookmay bereproduced in any form, by photostat, microform, retrieval system, or any other means, without the prior written permissionof the publisher. Lawrence Erlbaum Associates, Inc., Publishers 10 IndustrialAvenue Mahwah,NJ 07430

Cover designby Kathryn Houghtaling Lacey

Library of Congress Cataloging-in-Publication Data Cowles, Michael, 1936-Statisticsin psychology: anhistorical perspective / Michael Cowles.-2nd ed. p. cm. Includes bibliographical references (p.) andindex. ISBN 0-8058-3509-1(c: alk. paper)-ISBN 0-8058-3510-5(p: alk. paper) 1. Psychology—Statisticalmethods—History. 2. Social sciences— Statisical methods—History.I. Title. BF39.C67 2000 150'.7'27~dc21

00-035369

The final camera copyfor this workwaspreparedby theauthor,and thereforethe publisher takesno responsibilityfor consistencyor correctnessof typographical style. However, this arrangement helps to make publicationof this kind of scholarshippossible. Books publishedby Lawrence Erlbaum Associates areprintedon acid-freepaper,andtheir bindingsarechosenfor strengthanddurability. Printedin the United Statesof America 1 0 9 8 7 6 5 4 3 21

Contents

Preface Acknowledgments 1

ix xi

The Development of Statistics

1

Evolution, Biometrics,and Eugenics The Definition of Statistics 6 Probability 7 The Normal Distribution 12 Biometrics 14 Statistical Criticism 18 2

Science,Psychology, and Statistics Determinism 21 Probabilisticand Deterministic Models Scienceand Induction 27 Inference 31 Statisticsin Psychology 33

3

1

Measurement In Respectof Measurement SomeFundamentals 39 Error in Measurement 44

21 26

36 36

Vi

CONTENTS

4

The Organization of Data

47

The Early Inventories 47 Political Arithmetic 48 Vital Statistics 50 Graphical Methods 53 5

Probability

56

The Early Beginnings 56 The Beginnings 57 The Meaning of Probability 60 Formal Probability Theory 66 6

Distributions The Binomial Distribution The Poisson Distribution The Normal Distribution

7

68 68 70 72

Practical Inference

77

Inverse Probability and the Foundationsof Inference FisherianInference 81 Bayesorp< 05? . 83 8

77

Sampling and Estimation

85

Randomnessand Random Numbers 85 Combining Observations 88 Samplingin Theory and in Practice 95 The Theory of Estimation 98 The Battle for Randomization 101 9

Sampling Distributions The Chi-SquareDistribution The t Distribution 114 The F Distribution 121 The Central Limit Theorem

105 705 124

CONTENTS

10 Comparisons, Correlations, and Predictions

VJJ

127

Comparing Measurements 727 Galton'sDiscovery of Regression 129 Galton' s Measureof Co-relation 138 The Coefficient of Correlation 141 Correlation - Controversies and Character 146 11 Factor Analysis Factors 154 The Beginnings 156 Rewriting the Beginnings The Practitioners 164

154

162

12 The Design of Experiments

171

The Problemof Control / 71 Methodsof Inquiry 1 73 The Concept of Statistical Control 176 The Linear Model 181 The Design of Experiments 752 13 Assessing Differencesand Having Confidence

186

FisherianStatistics 186 The Analysis of Variance 187 Multiple Comparison Procedures 795 ConfidenceIntervals and SignificanceTests 199 A Note on 'One-Tail' and Two-Tail' Tests 203 14 Treatments and Effects: The Rise of ANOVA The Beginnings 205 The Experimental Texts 205 The Journalsand thePapers 207 The Statistical Texts 272 Expected Means Squares 273

205

Viii

CONTENTS

15 The Statistical Hotpot Times of Change 216 Neyman andPearson 217 Statisticsand Invective 224 Fisher versus Neyman and Pearson Practical Statistics 233

216

228

References

236

Author Index

254

Subject Index

258

Preface

In this secondedition I have made some corrections to errors in algebraic expressions that I missedin the first edition and Ihavebriefly expandedon some sections of the original where I thought such expansion would make the narrative cleareror more useful. The main changeis the inclusion of two new chapters;one onfactor analysisand one on therise of the use ofANOVA in psychologicalresearch.I am still of the opinion that factor analysis deserves its own historical account,but I am persuaded that the audiencefor such a work would be limited werethe early mathematical contortions to be fully explored. I have triedto providea brief non-mathematical background to its arrival on the statistical scene. I realized thatmy accountof ANOVA in the first edition did not dojustice to the story of its adoptionby psychology,and largely due to myre-readingof the work of Sandy Lovie(of the University of Liverpool, England)and Pat Lovie (of Keele University, England), who always writepapersthat 1 wish 1 had written, decidedto try again.I hope thatthe Lovies will not be toodisappointed by my attempt to summarize their sterling contributions to the history of both factor analysisand ANOVA. As before, any errors and misinterpretationsare my responsibility alone.I would welcome correspondence that points to alternative views. I would like to give special thanksto the reviewersof the first edition for their kind commentsand all those who have helpedto bring aboutthe revival of this work. In particular Professor Niels Waller of Vanderbilt University must be acknowledgedfor his insistentand encouraging remarks.I hope thatI have ix

X

PREFACE

deserved them.My colleaguesand manyof my students at York University, Toronto, have been very supportive. Those students, both undergraduate and graduate,who have expressed their appreciation for my inclusion of some historical backgroundin my classeson statistics and method have givenme enormous satisfaction. This relatively short account is mainly for them, and I hopeit will encourage some of them to explore someof these mattersfurther. There seemsto be a slow realization among statistical consumers in psychology that there is moreto theenterprise than null hypothesis significance testing,and other controversies to exerciseus. It isstill my firm belief that just a little more mathematical sophistication and just a little more historical knowledge woulddo agreatdeal for the way wecarry on ourresearch business in psychology. The editors and production peopleat Lawrence Erlbaum, ever sharp and efficient, get onwith the job andbring their expertiseandsensible advice to the project and I very much appreciate their efforts. My wife has sacrificeda great dealof her time and given me considerable help with the final stagesof this revisionand she and my family, even yet,put up with it all. Mere thanksare not sufficient. Michael Cowles

Acknowledgments

I wish to expressmy appreciationto a numberof individuals, institutions,and publishers for granting permissionto reproduce material that appearsin this book: Excerpts from Fisher Box, J. (c) 1978,R. A. Fisher the life of a scientist,from Scheffe, H. (1959) Theanalysisof variance,and from Lovie, A. D & Lovie, P. Charles Spearman, Cyril Burt, and theoriginsof factor analysis. Journalof the History of the Behavioral Sciences,29, 308-321.Reprintedby permissionof John Wiley& Sons Inc.,the copyright holders. Excerpts from a numberof papersby R.A. Fisher. Reprinted by permissionof Professor J.H. Bennett on behalf of the copyright holders,the University of Adelaide. Excerpts, figures andtables by permissionof Hafner Press,a division of Macmillan Publishing Companyfrom Statistical methods for research workers by R.A. Fisher. Copyright(c) 1970 by Universityof Adelaide. Excerpts from volumes of Biometrika. Reprintedby permission of the Biometrika Trusteesand from Oxford University Press. Excerpts from MacKenzie, D.A. (1981). Statisticsin Britain 1865-1930. Reprintedby permissionof the Edinburgh University Press.

XI

Xii

ACKNOWLEDGMENTS

Two plates from Galton, F. (1885a). Regression towards mediocrity in hereditarystature.Journal of theAnthropological Instituteof Great Britain and Ireland, 15, 246-263.Reprintedby permissionof the Royal Anthropological Institute of GreatBritain andIreland. ProfessorW. H. Kruskal, Professor F. Mosteller,and theInternational Statistical Institute for permissionto reprint a quotationfrom Kruskal,W., & Mosteller, F. (1979). Representative sampling, IV: The historyof the conceptin statistics, 1895-1939.International StatisticalReview,47, 169-195. Excerptsfrom Hogben,L. (1957). Statistical theory. London: Allen andUnwin; Russell,B. (1931). Thescientific outlook. London:Allen and Unwin; Russell, B. (1946). Historyof western philosophyand itsconnection withpolitical and social circumstancesfrom theearliest timesto thepresentday. London: Allen and Unwin; von Mises, R. (1957). Probability, statisticsand truth. (Second revised English Edition preparedby Hilda Geiringer) London: Allenand Unwin. Reprintedby permissionof Unwin Hyman,the copyright holders. Excerpts from Clark, R. W. (1971). Einstein,the life and times. New York: World. Reprintedby permissionof Peters, Fraser and Dunlop, Literary Agents. Dr D. C. Yalden-Thomsonfor permissionto reprint a passagefrom Hume,D. (1748). An enquiry concerning human understanding. (In D. C. YaldenThomson (Ed.). (1951). Hume, Theory of Knowledge. Edinburgh: Thomas Nelson). Excerpts from various volumesof the Journal of the American Statistical Association. Reprinted by permissionof the Boardof Directorsof theAmerican Statistical Association. An excerpt reprintedfrom Probability, statistics,and data analysisby O. Kempthorneand L. Folks, (c) 1971 Iowa StateUniversity Press, Ames, Iowa 50010. Excerptsfrom De Moivre, A. (1756).Thedoctrine of chances:or, A methodof calculating the probabilities of eventsin play. (3rd ed.),London:A. Millar. Reprintedfrom the editionpublishedby theChelseaPublishingCo.,New York, (c) 1967with permissionand Kolmogorov,A. N. (1956). Foundationsof the theory of probability. (N. Morrison, Trans.). Reprinted by permissionof the Chelsea Publishing Co.

ACKNOWLEDGMENTS

Xiii

An excerptreprinted with permission of Macmillan Publishing Companyfrom An introduction to the study of experimental medicineby C. Bernard (H. C. Greene,Trans.),(c) 1927 (Original work published in 1865)and from Science and human behaviorby B. F. Skinner (c) 1953 by Macmillan Publishing Company, renewed 1981 by B. F. Skinner. Excerpts from Galton, F. (1908). Memoriesof my life. Reprintedby permission of Methuenand Co. An excerptfrom Chang, W-C. (1976). Sampling theories andsamplingpractice. In D. B. Owen (Ed.), On thehistory of statisticsandprobability (pp.299~315). Reprintedby permissionof Marcel Dekker, Inc.New York. An excerpt from Jacobs,J. (1885). Reviewof Ebbinghaus's Ueberdas Gedachtnis. Mind, 10, 454-459 and from Hacking, I. (1971). Jacques Bernoulli's Art of Conjecturing. British Journalfor the Philosophyof Science, 22,209-229.Reprintedby permissionof Oxford University Pressand Professor Ian Hacking. Excerpts from Lovie, A. D. (1979). The analysisof variancein experimental psychology: 1934-1945. British Journal of Mathematical and Statistical Psychology,32, 151-178and Yule, G. U. (1921). Reviewof W. Brown and G. H. Thomson, The essentialsof mental measurement. British Journal of Psychology,2, 100-107and Thomson,G. H. (1939)The factorial analysisof humanability. I. The present position and theproblems confronting us. British Journal of Psychology,30, 105-108. Reprintedby permissionof the British Psychological Society. Excerptsfrom Edgeworth,F. Y. (1887). Observations and statistics:An essay on the theory of errors of observationand the firstprinciples of statistics. Transactions of the Cambridge Philosophical Society,14, 138—169 and Neyman,J., & Pearson,E. S.(1933b). The testingof statistical hypotheses in relation to probabilitiesa priori. Proceedingsof the Cambridge Philosophical Society, 29, 492-510.Reprintedby permissionof the Cambridge University Press. Excerpts from Cochrane,W. G. (1980). Fisherand theanalysisof variance.In S. E. Fienberg,& D. V. Hinckley (Eds.).,R. A. Fisher: An Appreciation (pp. 17-34)and from Reid,C. (1982). Neyman -fromlife. Reprintedby permission of Springer-Verlag,New York.

XJV

ACKNOWLEDGMENTS

Excerpts and plates from Philosophical Transactionsof the Royal Societyand Proceedingsof the Royal Societyof London. Reprintedby permission of the Royal Society. Excerpts from LaplaceP. S. de(1820).A philosophical essayon probabilities. F. W. Truscott, & F. L. Emory, Trans.). Reprinted by permissionof Dover Publications,New York. Excerpts from Galton, F. (1889). Natural inheritance, Thomson, W. (Lord Kelvin). (1891). Popular lecturesand addresses,and Todhunter,I. (1865). A history of the mathematical theoryof probability from thetime of Pascal to that of Laplace. Reprintedby permissionof Macmillan and Co., London. Excerpts from various volumesof the Journal of the Royal Statistical Society, reprinted by permissionof the Royal Statistical Society. An excerptfrom Boring, E. G. (1957). Whenis humanbehavior predetermined? The Scientific Monthly, 84, 189-196.Reprintedby permissionof the American Association for the Advancementof Science. Data obtainedfrom Rutherford, E., & Geiger, H. (1910). The probability variations in the distribution of a particles. Philosophical Magazine,20, 698-707. Material used by permissionof Taylor and Francis, Publishers, London. Excerptsfrom various volumesof Nature reprintedby permissionof Macmillan Magazines Ltd. An excerpt from Forrest,D. W. (1974). Francis Galton:Thelife and work of a Victorian genius.London: Elek. Reprinted by permissionof Grafton Books, a division of the Collins PublishingGroup. Excerpts from Fisher,R. A. (1935, 1966,8th ed.). The design of experiments. Edinburgh: OliverandBoyd. Reprintedby permissionof the Longman Group, UK, Limited. Excerptsfrom Pearson,E. S. (1966). The Neyman-Pearson story:1926-34. Historical sidelights on an episode in Anglo-Polish collaboration.In F. N. David (Ed.). Festschrift for J. Neyman. London:Wiley. Reprinted by permissionof JohnWiley and Sons Ltd., Chichester.

ACKNOWLEDGMENTS

XV

An excerpt from Beloff, J. (1993) Parapsychology:a concise history. London: Athlone Press.Reprinted with permission. Excerpts from Thomson,G. H. (1946) Thefactorial analysisof human ability. London: Universityof London Press. Reprinted by permissionof The Athlone Press. An excerpt from Spearman,C. (1927) The abilities of man. New York: Macmillan Company. Reprintedby permissionof Simon & Schuster, the copyright holders. An excerpt from Kelley, T. L. (1928) Crossroadsin the mind of man: a study of differentiable mental abilities. Reprinted with the permissionof Stanford University Press. An excerpt from Gould, S. J. (1981) The mismeasureof man. Reprinted with the permissionof W. W. Norton & Company. An excerptfrom Boring, E. G. (1957) Whenis human behavior predetermined? Scientific Monthly, 84,189-196.And from Wilson,E. B. (1929) Reviewof The abilities of man. Science,67, 244-248.Reprintedwith permission from the American Associationfor the Advancementof Science. An excerpt from Sidman, M. (1960/1988) Tacticsof scientific research: Evaluating experimental data inpsychology.New York: Basic Books. Boston: Authors Cooperative (reprinted). Reprinted with the permissionof Dr Sidman. An excerpt from Carroll, J. B. (1953) An analytical solutionfor approximating simple structurein factor analysis. Psychometrika, 18, 23-38. Reprinted with the permissionof Psychometrika. Excerptsfrom Garrett H. E. & Zubin, J. (1943) The analysis of variancein psychological research.Psychological Bulletin, 40, 233-267and from Grant, D. A. (1944) On "The analysisof variance in psychological research." Psychological Bulletin, 41, 158-166. Reprintedwith the permission of the American Psychological Association. An excerpt from Wilson, E. B. (1929) Reviewof Crossroadsin the mind of man. Journal of General Psychology, 2, 153-169. Reprinted with the permissionof the Helen Dwight Reid Educational Foundation. Published by Heldref Publications, 1319 18th St. N.W. Washington, D.C.20036-1802.

This page intentionally left blank

1

The Development of Statistics

EVOLUTION, BIOMETRICS, AND EUGENICS The central concernof the life sciencesis thestudy of variation. To what extent does this individualor group of individualsdiffer from another? Whatare the reasonsfor the variability? Can thevariability be controlled or manipulated? Do the similarities that exist spring from a common root? Whatare theeffects of the variation on the life of the organisms?Theseare questionsaskedby biologists and psychologists alike. The life-science disciplines are definedby the different emphases placed on observed variation,by the natureof the particular variablesof interest,and by the ways in which the different variables contributeto the life and behaviorof the subject matter. Change and diversity in nature reston an organizing principle,the formulationof which hasbeen saidto be thesingle mostinfluential scientific achievementof the 19th century:the theory of evolutionby meansof natural selection.The explicationof the theory is attributed, rightly,to Charles Darwin (1809-1882).His book The Origin of Specieswas published in 1859, but a numberof other scientistshadwritten on theprinciple, in whole or in part, and thesemenwereacknowledgedby Darwin in later editionsof his work. Natural selectionis possible because there is variation in living matter. The struggle for survival within and acrossspeciesthen ruthlessly favorsthe individuals that possessa combination of traits and characters, behavioral and physical,that allows themto copewith the total environment, exist, survive, and reproduce. Not all sourcesof variability are biological. Many organisms to agreateror 1

2

1. THE DEVELOPMENT OF STATISTICS

lesser extent reshape their environment, their experience, and therefore their behavior through learning.In human beings this reshapingof the environment has reachedits most sophisticatedform in what hascome to becalled cultural evolution. A fundamental feature of the human condition,of human nature,is our ability to processa very great deal of information. Human beings have originality and creative powers that continually expand the boundariesof knowledge. And, perhaps most important of all, our language skills, verbal and written, allow for the accumulationof knowledge and itstransmission from generationto generation. The rich diversity of human civilization stemsfrom cultural, aswell as genetic, diversity. Curiosity about diversityand variability leadsto attemptsto classify and to measure.The orderingof diversity and theassessment of variation have spurred the developmentof measurementin the biological and social sciences,and the applicationof statisticsis onestrategyfor handlingthe numericaldataobtained. As science has progressed,it has become increasingly concerned with quantification as a means of describing events.It is felt that precise and economical descriptions of eventsand therelationships among them are best achievedby measurement. Measurement is the link between mathematics and science,and theapparent(at anyrate to mathematicians!) clarityand order of mathematicsfoster the scientist's urgeto measure.The central importanceof measurementwasvigorously expoundedby Francis Galton(1822-1911):"Until the phenomenaof any branchof knowledge have been submitted to measurementand numberit cannot assume the statusand dignity of a Science." Thesewords formed partof the letterhead of the Departmentof Applied Statistics of University College, London,an institution that received much intellectual and financial supportfrom Galton. And it is with Galton,who first formulated the method of correlation, thatthe common statisticalprocedures of modern social science began. The nature of variation and thenature of inheritancein organisms were much-discussedand much-confused topicsin the second halfof the 19th century. Galtonwasconcernedto makethe studyof heredity mathematical and to bring order into the chaos. Francis Galton was Charles Darwin's cousin. Galton's mother was the daughterof Erasmus Darwin (1731-1802) by his secondwife, and Darwin's father wasErasmus'sson by hisfirst. Darwin, who was 13yearsGalton'ssenior, had returned homefrom a 5-year voyageas thenaturaliston board H.M.S. Beagle(an Admiralty expeditionary ship)in 1836 and by 1838 had conceived of the principle of natural selectionto accountfor someof the observationshe hadmadeon theexpedition.The careersandpersonalitiesof GaltonandDarwin were quite different. Darwin painstakingly marshaled evidence and singlemindedly buttressedhis theory, but remaineddiffident about it, apparently

EVOLUTION, BIOMETRICS AND EUGENICS

3

uncertain of its acceptance.In fact, it was only the inevitability of the announcementof the independent discovery of the principle by Alfred Russell Wallace (1823~1913) that forced Darwin to publish, some20 yearsafter he had formed the idea. Gallon,on theother hand, though a staid andformal Victorian, was notwithout vanity, enjoyingthe fame and recognition broughtto him by his many publicationson abewildering varietyof topics. The steady streamof lectures, papersand books continued unabated from 1850 until shortly before his death. The notion of correlated variationwas discussedby the new biologists. Darwin observesin The Origin of Species: Many laws regulate variation, some few of which can bedimly seen,...I will here only alludeto what may becalled correlated variation. Important changes in the embryo or larva will probably entail changes in the mature animal. . . Breeders believe that long limbs arealmost always accompanied by an elongated head .. .catswhich areentirely whiteandhave blue eyesaregenerally deaf... it appears that white sheep and pigs are injured by certain plants whilst dark-coloured individuals escape ... (Darwin, 1859/1958,p. 34)

Of course,at this time,the hereditary mechanism was unknown, and, partly in an attempt to elucidate it, Galton began,in the mid-1870s,to breed sweet peas.1 The resultsof his studyof thesizeof sweetpeaseeds overtwo generations were publishedin 1877. Whena fixed size of parentseedwas compared with the mean sizeof theoffspring seeds, Galton observed thetendency thathecalled then reversionand later regressionto the mean. The meanoffspring size is not as extremeas theparental size. Large parent seeds of a particular size produce seeds that have a mean size thatis larger than average, but not aslarge as the parentsize. The offspring of small parentseedsof a fixedsize havea mean size that is smaller than average but nowthis mean sizeis not assmall asthat of the fixed parent size. This phenomenon is discussed laterin more detail.For the moment,suffice it to say that it is an arithmetical artifact arisingfrom the fact that offspring sizesdo not match parental sizes absolutely uniformly. In other words, the correlation is imperfect. Galton misinterpreted this statistical phenomenon as areal trend towarda reductionin population variability. Paradoxically, however, it led to theformation of the Biometric Schoolof heredity and thus encouraged the development of a great many statistical methods. 1

Mendel had already carriedout his work with ediblepeasand thus begunthe scienceof genetics. The resultsof his work were publishedin a rather obscure journal in 1866 and thewider scientific world remained obliviousof them until 1900.

4

1. THE DEVELOPMENT OF STATISTICS

Over the next several years Galton collected data on inherited human characteristicsby the simple expedientof offering cash prizes for family records. From these data hearrivedat theregression linesfor hereditary stature. Figures showing these lines are shownin Chapter10. A common themein Galton's work, and later that of Karl Pearson (1857-1936),was aparticular social philosophy. Ronald Fisher (1890-1962) also subscribedto it, although, it must be admitted, it wasnot, assuch,a direct influence on hiswork. Thesethreemen are thefoundersof what are nowcalled classical statisticsand allwere eugenists. They believed that the most relevant and important variablesin humanaffairs are inherited. One'sancestorsrather than one'senvironmental experiences are theoverriding determinants of intellectual capacityand personalityas well as physical attributes. Human wellbeing, human personality, indeed human society, could therefore, they argued, be improvedby encouragingthe most ableto have more children than the least able. MacKenzie (1981)andCowan(1972,1977)have argued that much of the early work in statisticsand thecontroversies that arose among biologists and statisticians reflectthe commitmentof the foundersof biometry, Pearson being the leader,to theeugenics movement. In 1884, Galtonfinanced andoperatedan anthropometric laboratory at the International Health Exhibition. For a chargeof threepence, members of the public were measured. Visual and auditory acuity, weight, height, limb span, strength,and anumberof other variables were recorded. Over 9,000 data sets were obtained, and, at thecloseof theexhibition,the equipmentwastransferred to the South Kensington Museum where data collection continued. Francis Galton was anavid measurer. Karl Pearson(1930) relates that Galton's first forays intothe problem of correlation involved ranking techniques, although he wasaware that ranking methods couldbe cumbersome.How could onecomparedifferent measuresof anthropometric variables?In a flash of illumination, Galton realized that characteristics measured on scales basedon their own variability (we would now say standard score units) could be directly compared. This inspiration is certainly one of themost importantin the early yearsof statistics.He recalls the occasionin Memoriesof my Life, publishedin 1908: As these linesare being written,the circumstances under which I first clearly graspedthe important generalisation that the laws of hereditywere solely concerned with deviations expressed in statistical unitsare vividly recalled to my memory. It was in thegroundsof Naworth Castle, where an invitation had been givento ramblefreely. A temporary shower drove me toseekrefuge in a reddish recessin the rock by the side of the pathway. Therethe idea flashed

EVOLUTION, BIOMETRICS AND EUGENICS

5

acrossme and Iforgot everythingelsefor amomentin mygreatdelight.(Galton, 1908, p. 300)2 This incident apparently took place in 1888,and beforethe year wasout, Co-relationsand Their MeasurementChiefly From Anthropometric Datahad been presented to the Royal Society.In this paper Galton defines co-relation: "Two variable organsare said to be co-relatedwhen the variation of one is accompaniedon theaverageby more or less variationof the other, and in the samedirection" (Gallon, 1888,p. 135). The last five words of the quotation indicate thatthe notion of negative correlationhad notthen been conceived, butthis briefbut important paper shows that Galtonfully understoodthe importanceof his statistical approach. Shortly thereafter, mathematicians entered the picture with encouragement from some, but by no meansall, biologists. Much of the basic mathematics of correlation had,in fact, already been developedby the time of Gallon's paper,but theutility of the procedure itself in this contexlhad apparently eluded everyone. It was Karl Pearson, Gallon's disciple andbiographer,who, in 1896,set theconcepton asound mathematical foundation and presented statistics with the solutionto the problem of representing covariationby meansof a numerical index,the coefficient of correlation. From thesebeginnings springthe whole corpusof present-daystatistical techniques. George Udny Yule (1871~1951),aninfluential statisticianwho was not a eugenist,and Pearson himself elaborated the conceptsof multiple and partial correlation.The general psychologyof individual differencesand researchinto the structureof human abilitiesand intelligence relied heavilyon correlationaltechniques. The firsl third of the20th centurysaw theintroduction of factor analysis throughthe work of Charles Spearman (1863-1945),Sir Godfrey Thomson (1881-1955),Sir Cyril Burt (1883-1971),and Louis L. Thurstone(1887-1955). A further prolific and fundamentallyimportant streamof development arises from the work of Sir Ronald Fisher.The techniqueof analysisof variancewas developed directlyfrom themethodof intra-class correlation- anindex of the extentto which measurements in thesame category or family arerelated, relative to other categoriesor families. 2

Karl Pearson (1914 -1930)in thevolume published in 1924, suggested that this spot deserves a commemorative plaque. Unfortunately, it looks asthoughthe inspiration canneverbe somarked, for Kenna (1973), investigating the episode, reports that: "In the groundsof Naworth Castle there arenot anyrocks,reddishor otherwise, which could provide arecess,..."(p. 229),and hesuggests that the location of the incident might have been Corby Castle.

6

1. THE DEVELOPMENT OF STATISTICS

Fisher studied mathematicsat Cambridgebut also pursued interests in biology andgenetics.In 1913he spentthe summer workingon afarm in Canada. He worked for a while with a City investment company andthen found himself declaredunfit for military service because of his extremely poor eyesight.He turned to school-teachingfor which he had notalent and which he hated. In 1919 he had theopportunity of a postat University Collegewith Karl Pearson, then headof the Departmentof Applied Statistics,but choseinsteadto develop a statistical laboratoryat theRothamsted Experimental Station near Harpenden in England, wherehedeveloped experimental methods for agricultural research. Over the next several years, relations between Pearson and Fisher became increasingly strained. They clashed on a variety of issues. Someof their disagreements helped, and some hindered,the developmentof statistics.Had they been collaboratorsand friends, rather than adversaries and enemies, statisticsmight havehad aquite different history. In 1933 Fisher became Galton Professorof Eugenicsat University Collegeand in 1943 movedto Cambridge, wherehe wasProfessorof Genetics. Analysisof variance, whichhas hadsuch far-reachingeffects on experimentationin the behavioral sciences, was developed through attempts to tackle problems posed at Rothamsted. It may befairly said thatthe majority of textson methodologyand statistics in the social sciences are theoffspring (diversityandselection notwithstanding!) of Fisher's books, Statistical Methodsfor ResearchWorkers3 first publishedin 1925(a), and TheDesignof Experimentsfirst publishedin 1935(a). In succeeding chapters thesestatistical conceptsareexaminedin more detail and their development elaborated, but first the use of theterm statisticsis explored a little further. THE DEFINITION OF STATISTICS In an everyday sense when we think of statisticswe think of facts and figures, of numerical descriptionsof political andeconomic states (from which the word is derived), and ofinventories of the variousaspectsof our social organization. The history of statistical proceduresin this sensegoesback to the beginnings of human civilization. When trade and commerce began, when governments imposed taxes,numerical records were kept.The counting of people, goods, andchattelswasregularlycarriedout in theRoman Empire,theDomesday Book attempted to describethe state of England for the Norman conquerors,and government agenciesthe world over expenda great deal of money and 3

Maurice Kendall (1963) says of this work, "It is not aneasy book. Somebody once said that no student should attempt to read it unlesshe hadread it before" (p. 2).

PROBABILITY

7

energyin collectingand tabulating suchinformation in the present day. Statistics are usedto describeand summarize,in numerical terms,a wide varietyof situations. But thereis anothermore recently-developed activity subsumed under the term statistics:the practiceof not only collectingandcollating numerical facts, but also the processof reasoningabout them. Going beyond the data,making inferencesand drawing conclusions with greateror lesserdegreesof certainty in an orderly and consistentfashion is the aim ofmodern applied statistics.In this sense statistical reasoning did not begin until fairly late in the 17th century and then onlyin a quite limited way. The sophisticated models now employed, backedby theoretical formulations that are often complex,are allless than100 years old. Westergaard (1932) points to the confusionsthat sometimes arise becausethe word statisticsis usedto signify both collectionsof measurements and reasoning about them, and that in former times it referred merelyto descriptionsof statesin both numericaland non-numerical terms. In adoptingthe statisticalinferential strategythe experimentalistin the life sciencesis acceptingthe intrinsic variabilityof the subject matter.In recognizing a rangeof possibilities,the scientist comesfour-squareagainstthe problem of deciding whetheror not theparticular set of observationshe or she has collected can reasonablybe expectedto reflect the characteristicsof the total range. This is the problem of parameter estimation, the task of estimating population values (parameters) from a considerationof the measurements made on a particular population subset - the sample statistics.A second taskfor inferential statisticsis hypothesis testing, the processof judging whetheror not a particular statistical outcome is likely or unlikely to be due tochance. The statistical inferential strategy depends on aknowledgeof probabilities. This aspectof statisticshasgrown out of three activities that,at first glance, appearto bequite different but in fact have somecloselinks. They areactuarial prediction, gambling,and error assessment. Each addresses the problemsof making decisions,evaluating outcomes, and testing predictionsin the face of uncertainty,and eachhascontributedto thedevelopmentof probability theory. PROBABILITY Statistical operations areoften thoughtof aspractical applications of previously developed probability theory. The fact is, however, that almost all our presentday statistical techniques have arisen from attemptsto answerreal-life problems of prediction and error assessment, and theoretical developments have not always paralleled technical accomplishments. Box (1984) has reviewed the scientific context of a rangeof statistical advances and shown thatthe fundamental methodsevolved from the work of practisingscientists.

8

1. THE DEVELOPMENT OF STATISTICS

John Graunt, a London haberdasher, born in 1620, is credited withthe first attemptto predict andexplain a numberof social phenomena from a consideration of actuarialtables.He compiled histablesfrom Bills of Mortality, the parish accountsof deaths that were regularly, if somewhat crudely, recorded from the beginningof the 17th century. Graunt recognizesthat the question mightbe asked:"To what purpose tends all this laborious buzzling,andgroping?To know, 1. thenumberof the People? 2. How many Males,andFemales?3. Howmany Married,andsingle?"(Graunt, 1662/1975,p. 77), and says:"To this 1might answerin general by saying, that those, who cannot apprehend the reasonof these Enquiries, areunfit to trouble themselvesto askthem." (p. 77). Graunt reassured readers of this quite remarkable work: The Lunaticksarealso but few, viz. \ 58 in229250though I fear manymorethan are set down in ourBills ... So that, this Casualty being so uncertain,1 shall not force my self to make any inference fromthe numbers, and proportions we finde in ourBills concerning it: onely I dareensureany man atthis present,well in his Wits, for one in thethousand, that he shall not die aLunatick in Bedlam,within thesesevenyears,becauseI finde not aboveone inabout one thousandfive hundredhavedone so. (pp. 35-36)

Here is an inference basedon numerical dataand couchedin terms not so very far removedfrom thosein reportsin the modern literature. Graunt's work wasimmediately recognizedasbeingof great importance,and theKing himself (CharlesII) supportedhis electionto the recently incorporated Royal Society. A few years earlierthe seedsof modern probability theory were being sown in France.4 At this time gamblingwas apopular habitin fashionable society and a range of gamesof chancewas being played. For experienced players the oddsapplicableto various situations must have been appreciated, but no formal methods for calculating the chances of various outcomeswere available. Antoine Gombauld,the Chevalierde Mere, a "man-about-town" and gambler with a scientific and mathematical turnof mind, consultedhis friend, Blaise Pascal (1623-1662),a philosopher, scientist,and mathematician, hoping that 4 But note that thereare hints of probability conceptsin mathematics going back at leastas far as the12th centuryand that Girolamo Cardano wrote Liber de Ludo Aleae,(The Book on Games of Chance)a century beforeit was publishedin 1663 (see Ore, 1953). There is also no doubt that quite early in human civilization, therewas anappreciationof long-run relative frequencies, randomness,anddegreesof likelihood in gaming,andsome quiteformal conceptsare to befound in GreekandRoman writings.

PROBABILITY

9

he would be able to resolve questionson calculation of expected(probable) frequency of gains and losses,as well as on thefair division of the stakesin games that were interrupted. Consideration of these questions led to correspondence between Pascal and hisfellow mathematician Pierre Fermat (1601 -1665). 5 No doubt their advice aided de Mere's game. More significantly, it was from this exchange that some of the foundationsof probability theoryand combinatorial algebra were laid. Christian Huygens(1629-1695)published,in 1657, a tract On Reasoning With Gamesof Dice (1657/1970), which waspartly basedon thePascal-Fermat correspondence, and in 1713, Jacques Bernoulli's (1654-1705)book The Art of Conjecture developeda theory of gamesof chance. Pascalhad connectedthe studyof probabilitywith the arithmetic triangle (Fig. 1.1), for which he discoverednew properties, although the triangle was known in China at least five hundred years earlier. Proofs of the triangle's properties were obtained by mathematical induction or reasoningbyrecurrence.

FIG. 1.1

5

Pascal's Triangle

Poisson(1781-1840),writing of this episodein 1837 says,"A problem concerning games of chance, proposed by a man of theworld to an austere Jansenist, was theorigin of the calculus of probabilities" (quotedby Struik, 1954,p. 145). De Mere was certainly "a man of theworld" and Pascaldid become austere andreligious, but at thetime of de Mere'squestionsPascalwas in his so-called "worldlyperiod"(1652-1654).I am indebtedto my father-in-law, the late Professor F.T.H. Fletcher, for many insights intothe life of Pascal.

10

1. THE DEVELOPMENT OF STATISTICS

Pascal'striangle, as it isknown in the West,is a tabulationof the binomial coefficients that may beobtainedfrom the expansionof (P + Q)n where P = Q = I The expansionwas developedby Sir Isaac Newton(1642-1727),and, independently, by the Scottish mathematician, James Gregory (1638-1675). Gregory discoveredthe rule about 1670. Newton communicated it to theRoyal Society in 1676, although later that year heexplained thathe had firstformulated it in 1664 whilehe was aCambridge undergraduate. The example shownin Fig. 1.1 demonstrates that the expansionof (~ + | )4 generates,in the numerators of the expression,the numbersin the fifth row of Pascal'striangle. The terms of this expression also give us the fiveexpected frequencies of outcome(0, 1, 2,3, or 4heads)or improbabilitieswhena fair coin is tossedfour times. Simple experiment will demonstrate that the actual outcomesin the"real world"of coin tossing closely approximate the distribution thathas been calculatedfrom a mathematicalabstraction. During the 18th centurythe theory of probability attractedthe interest of many brilliant minds. Among themwas a friend and admirer of Newton, Abraham De Moivre (1667-1754).De Moivre, a French Huguenot,was internedin 1685after the revocationby LouisXIV of the Edict of Nantes,an edict which hadguaranteed tolerationto French Protestants. He wasreleasedin 1688, fled to England, and spent the remainderof his life in London. De Moivre published what mightbedescribedas agambler's manual, entitled TheDoctrine of Chancesor a Method of Calculating the Probabilities of Eventsin Play. In the second editionof this work,publishedin 1738,and in arevised third edition published posthumouslyin 1756, De Moivre (1756/1967) demonstrated a method, whichhe had firstdevisedin 1733,of approximatingthe sum of avery large numberof binomial terms when n in (P +Q)" is very large(animmensely laborious computationfrom the basic expansion). It may beappreciated thatas n grows larger,the numberof terms in the expansionalsogrows larger. The graph of the distribution beginsto resemble a smooth curve (Fig. 1.2), a bell-shaped symmetrical distributionthat held great interest in mathematical terms but little practical utility outsideof gaming. It is safeto saythat no other theoretical mathematical abstraction has had such an important influence on psychology and the social sciencesas that bell-shaped curve now commonly knownby thename that Karl Pearson decided on- thenormal distribution-a\thougl\he was not the first to use the term. Pierre Laplace(1749-1827)independently derived the function andbrought together much of the earlier workon probabilityin Theorie Analytiquedes Probabilites, publishedin 1812.It was hiswork, aswell ascontributionsby many others, that interpretedthe curve as the Lawof Error andshowed thatit could be applied to variableresults obtainedin multiple observations.One of the firstapplications of the distribution outsideof gaming was in the assessmentof errors in

FIG. 1.2

The Binomial Distribution for N = 12 and the Normal Distribution

11

12

1. THE DEVELOPMENT OF STATISTICS

astronomicalobservations.Later theutility of the "law" in errorassessment was extendedto land surveyingand evento range estimation problems in artillery fire. Indeed, between 1800 and 1820 the foundationsof the theory of error distribution were laid. Carl Friedrich Gauss (1777-1855),perhapsthe greatest mathematician of all time, also made important contributions to work in this area. He was a consultantto thegovernmentsof Hanoverand ofDenmark when they undertook geodeticsurveys. The function that helpedto rationalize the combination of observationsis sometimes calledthe Laplace-Gaussian distribution. Following the work of LaplaceandGauss,the developmentof mathematical probability theory slowed somewhat and not agreat dealof progresswas made until the present century.But it was during the 19th century, throughthe developmentof life insurance companies and throughthe growth of statistical approachesin the social and biological sciences, thatthe applications of probability theory burgeoned. Augustus De Morgan (1806-1871),for example, attempted to reduce the constructsof probability to straightforward rulesof thumb. His work An Essayon Probabilities and on Their Application to Life Contingenciesand Insurance Offices, publishedin 1838, is full of practical advice and iscommentedon by Walker (1929). THE NORMAL DISTRIBUTION The normal distributionwas sonamed because many biological variables when measuredin large groupsof individuals,andplotted asfrequency distributions, do show close approximations to the curve. It is partly for this reason thatthe mathematicsof thedistribution areusedin data assessment in the social sciences and inbiology. The responsibility,aswell as thecredit, for this extensionof the use ofcalculations designed to estimate erroror gambling expectancies into the examinationof human characteristics rests with Lambert Adolphe Quetelet (1796-1874),a Belgian astronomer. In 1835Queteletdescribedhisconceptof theaverageman- / 'homme moyen. L'homme moyenis Nature's ideal,an ideal that corresponds with a middle, measured value.But Nature makeserrors,and in, as itwere,missingthe target, producesthe variability observedin human traitsandphysicalcharacters.More importantly, the extent and frequency of these errorsoften conformto the law of frequencyof error-thenormal distribution. JohnVenn(1834-1933),the English logician, objected to the use of the word error in this context:"When Nature presents uswith a group of objectsof every kind, it is using rathera bold metaphorto speak in this case alsoof a law of error" (Venn, 1888,p. 42), but theanalogy wasattractive to some. Quetelet examined the distributionof the measurements of the chest girths

THE NORMAL DISTRIBUTION

13

of 5,738 Scottish soldiers, these data having been extracted from the 13th volumeof theEdinburgh MedicalJournal. There is nodoubt thatthe measurements closely approximate to a normal curve.In another attemptto exemplify the law, Quetelet examined the heightsof 100,000 French conscripts. Here he noticed a discrepancy between observed and predicted values: The official documents would make it appearthat, of the 100,000men, 28,620 are of less height than5 feet 2 inches: calculation gives only 26,345. Is it not a fair presumption, thatthe 2,275 men who constitutethe difference of these numbershave beenfraudulently rejected? We canreadily understand that it is an easy matterto reduceone'sheighta half-inch,or an inch, whenso greatan interest is at stakeasthat of being rejected. (Quetelet, 1835/1849, p. 98)

Whether or not theallegation stated here - that short (butnot too short) Frenchmen havestoopedso low as toavoid military service- is true is no longer an issue. A more important pointis notedby Boring (1920): While admittingthe dependenceof the law onexperience, Quetelet proceeds in numerouscasesto analyze experience by meansof it. Such a double-edged sword is a peculiarly effective weapon, and it is nowonder that subsequent investigators were tempted to use it inspite of the necessary rules of scientific warfare. (Boring, 1920,p. 11)

The use of thenormal curvein statisticsis not, however, based solely on the fact that it can beusedto describethe frequencydistributionof many observed characteristics.It has a much morefundamental significancein inferential statistics,as will be seen,and thedistribution and itsproperties appear in many partsof this book. Galton first became awareof the distribution from his friend William Spottiswoode,who in 1862 became Secretary of the Royal Geographical Society, but it was thework of Quetelet that greatly impressed him. Many of the data setshe collected approximated to the law and heseemed,on occasion, to be almost mystically impressed with it. I know of scarcely anythingso apt toimpressthe imaginationas thewonderful form of cosmic order expressed by the "Law of Frequencyof Error." The law would havebeenpersonifiedby theGreeksanddeified, if they hadknown of it. It reigns with serenityand in complete self-effacement amidst the wildest confusion. The hugerthe mob and thegreaterthe apparent anarchy, the more perfect is its sway. It is the supremelaw of Unreason. Whenever a large sample of chaotic elementsare taken in handand marshalledin the order of their magnitude,an unsuspectedandmost beautiful form of regularityprovesto have been latentall along. (Galton, 1889,p. 66)

14

1. THE DEVELOPMENT OF STATISTICS

This rathertheologicalattitude towardthe distribution echoesDe Moivre, who, overa century before, proclaimed in TheDoctrine of Chances: Altho' chance produces irregularities, still the Oddswill be infinitely great, that in the processof Time, those irregularities will bear no proportion to the recurrencyof that Order which naturally results from ORIGINAL DESIGN... Such Laws, aswell as theoriginal DesignandPurposeof their Establishment, must all be fromwithout... if we blind not ourselves with metaphysical dust, we shall beled, by ashortandobvious way,to theacknowledgement of the great MAKER andGOVENOUR of all; Himself all-wise, all-powerfulandgood.(De Moivre, 1756/1967p. 251-252)

The ready acceptance of thenormal distributionas a law ofnature encouraged its wide applicationand also produced consternation when exceptions were observed. Quetelet himself admitted the possibility of the existenceof asymmetric distributions,andGalton was attimes less lyrical,for critics had objected to the use of the distribution,not as apractical toolto beused with caution where it seemedappropriate,but as asort of divine rule: It hasbeenobjectedto someof my former work,especiallyin Hereditary Genius, that I pushedthe application of the Law of Frequencyof Error somewhattoo far. I may have doneso, ratherby incautious phrases than in reality; ... I am satisfiedto claim theNormal Law is afair averagerepresentation of theobserved Curves during nine-tenths of their course;...(Galton, 1889,p. 56)6

BIOMETRICS In 1890, WalterF. R. Weldon (1860-1906)was appointedto the Chair of Zoology at University College, London.He wasgreatly impressedandmuch influenced by Gallon's Natural Inheritance.Not only did the book showhim how thefrequencyof the deviationsfrom a "type" might bemeasured,it opened up for him, and forotherzoologists,a hostof biometric problems.In two papers published in 1890 and 1892, Weldon showed that various measurements on shrimps mightbe assessed usingthe normal distribution.He also demonstrated interrelationships (correlations) between two variableswithin theindividuals. But the critical factorin Weldon's contributionto thedevelopmentof statistics was hisprofessorial appointment, for this broughthim into contact with Karl Pearson, then Professor of Applied MathematicsandMechanics,a post Pearson had held since 1884. Weldonwas attempting to remedy his weaknessin 6

work!

Note thatthis quotation and theprevious one from Galton are 10pagesapart in the same

BIOMETRICS

15

mathematicsso that he could extendhis research,and heapproached Pearson for help. His enthusiasmfor the biometric approach drew Pearson away from more orthodox work. A second importantlink was with Galton,who hadreviewed Weldon'sfirst paperon variation in shrimps. Galton supportedand encouragedthe work of thesetwo youngermen until his death, and, under the terms of his will, left £45,000 to endow a Chair of Eugenicsat the University of London, together with the wish thatthe post mightbe offered first to Karl Pearson.The offer was madeand accepted. In 1904, Galtonhad offered the University of London £500to establishthe study of national eugenics.Pearsonwas amemberof the Committee thatthe University set up, and the outcomewas adecisionto appointtheGallon Research Fellow at what was to benamedthe Eugenics RecordOffice. This becamethe Galton Laboratory for National Eugenicsin 1906, and yet more financial assistancewas providedby Gallon. Pearson,still Professorof Applied Mathematics,was itsDirector aswell asHeadof theBiometrics Laboratory. This latter received muchof its funding over many yearsfrom grantsfrom the Worshipful Companyof Drapers, whichfirst gave moneyto theUniversity in 1903. Pearson's appointment to the Gallon Chair brought applied statistics,biomelrics, and eugenics together under his direction at University College.II cannot howeverbe claimedabsolutelythat the day-to-day workof these units was drivenby acommon theme. Applied statistics andbiometrics were primarily concernedwith the developmentand applicationof statistical techniques to a variety of problems,including anthropometric investigations; the Eugenics Laboratory collected extensive family pedigreesand examined actuarial death rates. Of course Pearson coordinated all the work, and there was interchange and exchange among the staff that workedwith him, but Magnello (1998,1999) hasargued that therewas not asingle unifying purposein Pearson'sresearch. Others, notably MacKenzie (1981), Kevles (1985), and Porter (1986), have promotedthe view that eugenics was thedriving force behindPearson'sstatistical endeavors. Pearsonwas not aformal memberof the eugenics movementHe did not join the Eugenics Education Society, and apparentlyhe tried to keep the two laboratories administratively separate, maintaining separate financial accounts, for example,but it has to berecognized thathis personal viewsof the human condition and itsfuture included the conviction that eugenics was ofcritical importance.There was anobviousand persistent intermingling of statistical resultsandeugenicsin his pronouncements. For example,in his Huxley Lecture in 1903 (publishedin Biometrika in 1903and 1904),on topicsthat were clearly biometric,havingto do with his researcheson relationships between moral and

16

1. THE DEVELOPMENT OF STATISTICS

intellectual variables,he ended witha plea, if not a rallying cry, for eugenics: The mentally better stockin the nationis not reproducing itselfat thesame rateas it did of old; the less able,and theless energetic, aremore fertile thanthe better stocks. ... The only remedy,if one bepossibleat all, is to alter the relative fertility of the good and the badstocksin the community.. . . intelligence can beaided and be trained,but notrainingor educationcancreateit. You must breedit, that is thebroad result for statecraft whichflows from the equality in inheritanceof the psychicaland the physical characters in man. (Pearson, 1904a, pp. 179-180).

Pearson'scontributionwasmonumental,for in less than8 years, between 1893 and 1901, he published over30 paperson statisticalmethods. The first was written as aresultof Weldon's discovery that the distributionof one set of measurementsof the characteristicsof crabs, collectedat thezoological station at Naplesin 1892,was "double-humped." The distributionwas reducedto the sum of two normal curves. Pearson (1894)proceededto investigatethe general problem of fitting observed distributions to theoretical curves. This work was to lead directlyto theformulationof the x2 test of "goodnessof fit" in 1900,one of the most important developments in the history of statistics. Weldon approachedthe problem of discrepancies between theory and observationin a much more empirical way, tossing coins anddice and comparing the outcomes withthe binomial model. These data helped to produce another lineof development. In a letter to Galton, writtenin 1894, Weldon asksfor a commenton the results of 7,000tossingsof 12 dice collectedfor him by aclerk at University College: A day or two agoPearsonwantedsomerecordsof the kind in a hurry, in order to illustrate a lecture,and Igavehim therecord of the clerk's7000tosses... on examination he rejects them, because he thinks the deviation from the theoretically most probable resultis so great as to make the record intrinsically incredible, (quotedby E. S.Pearson,1965, p. 11)

This incidentset off agood dealof correspondence between Karl Pearson, F.Y. Edgeworth (1845-1926),an economist and statistician, and Weldon, the details of which are now only of minor importance. But,as Karl Pearson remarked, "Probabilitiesare very slippery things" (quoted by E. S. Pearson, 1965, p. 14), and the search for criteria by which to assessthe differences between observed andtheoretical frequencies, andwhetheror notthey couldbe reasonably attributed to chance sampling fluctuations, began. Statistical research rapidly expanded into careful examinationof distributions other than the normal curve and eventually intothe propertiesof sampling distributions,

BIOMETRICS

17

particularly through the seminal work of Ronald Fisher. In developinghis researchinto the propertiesof the probability distributions of statistics, Fisher investigated the basisof hypothesis testingand thefoundations of all the well-known testsof statistical significance. Fisher's assertion that p = .05 (1 in 20) is the probability thatis convenientfor judging whetheror not a deviation is to beconsidered significant (i.e. unlikely to be due tochance), has profoundly affected research in the social sciences,although it should be noted thathe was not the originatorof the convention (Cowles& Davis, 1982a). Of course,the developmentof statistical methodsdoesnot endhere,nor have all the threads been drawn together. Discussion of the important contribution of W. S. Gosset("Student," 1876-1937)to small sample workand therefinements introduced into hypothesis testing by Karl Pearson'sson, Egon S. Pearson(1895-1980)and Jerzy Neyman(1899-1981)will be found in later chapters, whenthe earlier details have been elaborated. Biometrics and Genetics The early years of the biometric school were surrounded by controversy. Pearsonand Weldon heldfast to the view that evolution took placeby the continuous selectionsof variations that were favorable to organismsin their environment.The rediscoveryof Mendel's workin 1900 supportedthe concept that heredity depends on self-reproducing particles (what we now call genes), and that inherited variationis discontinuousand saltatory. The source of the developmentof higher types was occasional genetic jumps or mutations. Curiously enough, thiswas theview of evolution that Galtonhad supported. His misinterpretationof the purely statistical phenomenon of regressionled him to thenotion thata distinctionhad to bemade between variations from the mean that regressand what he called"sports"(a breeder'sterm for an animalor plant variety thatappearsapparently spontaneously) that will not. A championof the position that mutations were of critical importancein the evolutionary processwas William Bateson(1861-1926)and aprolongedand bitter argument withthe biometricians ensued. The Evolution Committeeof the Royal Society broke down over the dispute. Biometrikawasfoundedby Pearson and Weldon, with Galton'sfinancial support, in 1900, after the Royal Society had allowed Batesonto publish a detailed criticismof a paper submittedby Pearson beforethe paper itselfhad been issued. Britain's important scientific journal, Nature, tookthe biometricians' sideand would not print letters from Bateson. Pearson replied to Bateson'scriticisms in Biometrika but refusedto accept Bateson'srejoinders, whereupon Bateson hadthem privately printedby the CambridgeUniversity Pressin the format of Biometrika\ At the British Association meetingin Cambridge in 1904, Bateson, then

18

1. THE DEVELOPMENT OF STATISTICS

Presidentof theZoological Section, took theopportunityto deliverabitter attack on the biometric school. Dramatically waving aloft the published volumesof Biometrika, he pronounced them worthless and hedescribed Pearson's correlation tables as: "aProcrusteanbedinto whichthebiometricianfits hisunanalysed data." (quotedby Julian Huxley,1949). It is even said that Pearson and Batesonrefusedto shake hands at Weldon's funeral. Nevertheless,after Weldon's deaththe controversy cooled. Pearson's work became more concerned with the theory of statistics, althoughthe influenceof his eugenic philosophy wasstill in evidence,and by1910, when Bateson becameDirector of theJohn Innes Horticultural Institute, the argumenthaddied. However, some statistical aspects of this contentious debate predated the evolution dispute,andechoesof them - indeed, marked reverberations from them - arestill around today, although of course MendelianandDarwinian thinking arecompletely reconciled. STATISTICAL CRITICISM Statisticshas been calledthe "scienceof averages,"and this definition is not meant in a kindly way. The great physiologist Claude Bernard (1813-1878) maintained thatthe use ofaveragesin physiology couldnot be countenanced: becausethetruerelationsof phenomenadisappearin the average;whendealing with complex and variable experiments,we must study their various circumstances,andthen presentour most perfect experiment as atype, which, however, still standsfor true facts. ... averagesmust thereforebe rejected, because they confuse while aiming to unify, and distort while aiming to simplify. (Bernard,1865/1927,p. 135)

Now it is, of course, true that lumping measurements together may notgive us anything more than a pictureof the lumping together,and theaverage value may not beanything likeany oneindividual measurement at all, but Bernard's ideal type fails to acknowledgethe reality of individual differences.A rather memorable example of a very real confusionis given by Bernard(1865/1927): A startling instanceof this kindwas inventedby aphysiologistwho took urinefrom a railway stationurinal where people of all nationspassed,and whobelievedthat he could thus presentan analysisof average European urine! (pp. 134-135).

A less memorable, but just astelling, exampleis that of the social psychologist who solemnly reports "mean social class." Pearson(1906) notes that: One of theblows to Weldon, which resultedfrom his biometric viewof life

STATISTICAL CRITICISM

19

was that his biological friends could not appreciatehis newenthusiasms. They could not understandhow the Museum "specimen"was in thefuture to be replacedby the "sample"of 500 to 1000 individuals,(p. 37)

The view is still not wholly appreciated. Many psychologists subscribe to the position thatthe most pressing problems of the discipline,andcertainly the ones of most practical interest, are problemsof individual behavior. A major criticism of the effect of the use of thestatistical approachin psychological researchis the failure to differentiate adequately between general propositions that apply to most, if not all, membersof a particular groupand statistical propositions that applyto some aggregatedmeasureof the membersof the group. The latter approach discounts the exceptionsto thestatisticalaggregate, which not only may be themost interestingbut may, on occasion, constitute a large proportionof the group. Controversy abounds in the field of measurement, probability, and statistics, and the methods employedare open to criticism, revision, and downright rejection. On theother hand, measurement and statistics playa leading rolein psychological research, and thegreatest danger seemsto lie in a nonawareness of the limitations of the statistical approach and thebasesof their development, as well as the use of techniques, assisted by thehigh-speed computer, as recipes for datamanipulation. Miller (1963) observedof Fisher, "Few psychologists have educated us as rapidly, or have influencedour work as pervasively,as didthis fervent, clearheaded statistician."(p. 157). Hogben (1957) certainly agreesthat Fisherhasbeen enormouslyinfluential but heobjectsto Fisher's confidence in his own intuitions: This intrepid beliefin what hedisarmingly calls common sense...has ledFisher ... to advancea battery of conceptsfor thesemantic credentials of which neither he nor hisdisciplesoffer anyjustification enrapport with the generallyaccepted tenetsof the classical theoryof probability. (Hogben, 1957,p. 504)

Hogben also expresses a thoughtoften shared by natural scientists when they review psychologicalresearch,that: Acceptability of a statistically significant result of an experiment on animal behaviourin contradistinctionto aresult whichtheinvestigatorcanrepeat before a critical audience naturally promotes a high outputof publication. Hencethe argumentthat the techniques workhas atempting appealto young biologists. (Hogben, 1957,p. 27)

Experimental psychologists may well agree thatthe tightly controlledexperiment is the apotheosisof classicalscientific method,but they are not so

20

1. THE DEVELOPMENT OF STATISTICS

arrogant as tosuppose that their subject matter will necessarily submit to this form of analysis,and they turn, almost inevitably, to statistical,as opposedto experimental, control. This is not amuddle-headed notion, but it does present dangersif it is accepted without caution. A balanced,but notuncritical, viewof the utility of statisticscan bearrived at from a considerationof the forces that shaped the disciplineand anexamination of its development. Whether or not this is an assertion that anyone, let alonethe authorof this book,canjustify remainsto be seen. Yet thereare Writers, of a Classindeed very different from that of JamesBernoulli, who insinuateas if theDoctrine of Probabilities could haveno placein any serious Enquiry; and that Studiesof this kind, trivial and easy as they be, rather disqualify a man for reasoning on any other subject. Let the Reader chuse.(De Moivre, 1756/1967,p. 254)

2

Science, Psychology, and Statistics

DETERMINISM It is a popular notion thatif psychologyis to beconsidereda science, thenit most certainly is not an exact science.The propositions of psychology are considered to be inexact becauseno psychologist on earth would venturea statement suchas this: "All stable extravertswill, when asked,volunteer to 1 participate in psychological experiments." The propositions of the natural sciencesare consideredto beexact because all physicists wouldbe preparedto attest (with some few cautionary qualifications) that, "fireburns" or, more pretentiously, that"e = mc2." In short, it is felt that the order in the universe, which nearly everyone (though for different reasons)is sure mustbe there,has been more obviously demonstrated by the natural rather than the social scientists. Order in the universe implies determinism, a most useful and amost vexing term, for it brings thosewho wonder about such things into contact with the philosophicalunderpinningsof the rather everyday concept of causality.No one hasstatedthe situation more clearly than Laplace in his Essai: Present eventsare connectedwith preceding onesby a tie basedupon the evident principle that a thing cannot occur without a causewhich producesit. This axiom known by the name of the principle of sufficient reason, extends even to actions which are consideredindifferent; the freest will is unable withouta determinative motive to give them birth;... We ought thento regardthe present stateof the universeas theeffect of its anterior 1 Not wishing to make pronouncements on theprobabilistic natureof the work of others,the writer is makingan oblique referenceto work in which he and acolleague (Cowles& Davis, 1987)found that there is an 80%chance that stable extraverts will volunteerto participatein psychological research.

21

22

2. SCIENCE, PSYCHOLOGY, AND STATISTICS stateand as thecauseof the onethat is to follow. Given for oneinstant an intelligence which could comprehendall the forces by which nature is animated and the respectivesituationsof thebeingswho composeit- an intelligence sufficiently vast to submit these data to analysis - it would embrace in the same formulathe movementsof the greatestbodies of the universeandthoseof the lightest atom;for it nothingwould beuncertainand thefuture, as thepast,would bepresentto its eyes. (Laplace, 1820/1951,pp. 3-4)

The assumptionof determinismis simply the apparently reasonable notion that eventsare caused.Sciencediscovers regularitiesin nature, formulates descriptionsof these regularities,and provides explanations, that is to say, discoverscauses.Knowledge of the past enablesthe future to be predicted. This is thepopular viewof science,and it isalsoa sort of working ruleof thumb for those engaged in the scientific enterprise. Determinism is thecentral feature of the developmentof modern science up to thefirst quarterof the 20th century. The successesof the natural sciences, particularly the successof Newtonian mechanics, urged andinfluenced someof thegiantsof psychology, particularly in North America, to adopt a similar mechanistic approach to the study of behavior. The rise of behaviorism promoted the view that psychology could be ascience, "like othersciences."Formulae couldbedevised that would allow behaviorto bepredicted,and atechnology couldbe achieved that would enable environmentalconditionsto be somanipulatedthat behavior couldbe controlled. The variability in living things was to bebrought under experimental control, a program that leads quite naturally to thenotionof thestimulus control of behavior. It follows thatconceptssuchaswill or choiceor freedomof action could be rejectedby behavioral science. In 1913, JohnB. Watson (1878-1958)publisheda paper that became the behaviorists' manifesto.It begins, "Psychologyas thebehaviorist viewsit is a purely objective experimental branch of naturalscience.Its theoretical goalis the predictionand control of behavior" (Watson, 1913, p. 158). Oddly enough, this pronouncement coincided with work that began to questionthe assumptionof determinismin physics,the undoubted leader of the natural sciences.In 1913, a laboratory experimentin Cambridge, England, provided spectroscopic proofof what is known as theRutherford-Bohr model of the atom. Ernest Rutherford (later Lord Rutherford, 1871-1937)had proposed thatthe atom was like a miniaturesolar system with electrons orbiting a central nucleus. Niels Bohr (1885-1962),a Danish physicist, explained that the electrons moved from oneorbit to another, emitting or absorbing energy asthey movedtoward or away from the nucleus.Thejumping of anelectronfrom orbit to orbit appearedto be unpredictable.The totality of exchanges could only be predictedin a statistical, probabilistic fashion.That giantof modern physicists,

DETERMINISM

23

Albert Einstein(1879-1955),whose workhad helpedto start the revolutionin physics,was loath to abandonthe conceptof a completely causal universe and indeed neverdid entirely abandonit. In the 1920s, Einstein made the statement that has often been paraphrased as, "God doesnot play dice withthe world." Nevertheless,he recognizedthe problem. In a lecture givenin 1928, Einstein said: Today faith in unbroken causalityis threatenedprecisely by thosewhosepath it had illumined as their chief and unrestricted leaderat the front, namely by the representativesof physics... All natural lawsaretherefore claimedto be, "in principle," of a statisticalvariety and ourimperfect observationpracticesalone havecheatedus into a belief in strict causality, (quotedby Clark, 1971,pp. 347-348)

But Einstein never really accepted this proposition, believing to the endthat indeterminacywas to beequated with ignorance. Einstein may be right in subscribing ultimatelyto the inflexibility of Laplace's all-seeing demon, but another approachto indeterminacywas advanced by Werner Heisenberg (1902-1981)a German physicist who, in 1927, formulatedhis famous uncertainty principle. He examinednot merelythe practical limits of measurement but the theoretical limits,and showed thatthe act ofobservationof the position and velocity of a subatomic particle interfered with it so as toinevitably produce errors in the measurementof one or theother. This assertion hasbeen takento mean that, ultimately,the forces in our universeare random and therefore indeterminate.Bertrand Russell (1931) disagrees: Spaceand time were inventedby the Greeks, and served their purpose admirably until the present century. Einstein replaced them by akind of centaur whichhe called "space-time,"and this did well enoughfor a coupleof decades,but modern quantum mechanicshas made it evident thata more fundamental reconstruction is necessary. The Principle of Indeterminacyis merely an illustrationof this necessity,not of the failure of physical lawsto determinethe courseof nature, (pp. 108-109)

The important pointto be awareof is that Heisenberg's principle refers to the observerand the act ofobservationand notdirectly to the phenomena that are being observed. This implies that the phenomena have an existence outside their observationand description,a contention that,by itself, occupies philosophers. Nowhereis the demonstration that technique and method shape the way in which we conceptualize phenomena more apparent than in the physicsof light. The progressof eventsin physics thatled to theview that light was both wave and particle, a view that Einstein's workhad promoted, beganto dismay him when it was used to suggest that physics would have to abandon strict

24

2. SCIENCE, PSYCHOLOGY,AND STATISTICS

continuity and causality. Bohr responded to Einstein's dismay: You, the man whointroducedthe ideaof light asparticles! If you are soconcerned with the situation in physicsin which the natureof light allows for a dual interpretation, thenask theGerman government to ban the use of photoelectric cellsif you think that light is waves, or the use ofdiffraction gratings if light is corpuscular, (quotedby Clark, 1971,p.253)

The parallels in experimental psychologyare obvious. That eminent historian of the discipline, Edwin Boring, describes a colloquium at Harvard when his colleague, William McDougall,who:"believed in freedom for the human mind - in at least a little residue of freedom- believed in it andhoped for as much as hecould savefrom the inroads of scientific determinism,"and he, a determinist, achieved,for Boring, an understanding: McDougall's freedom was my variance. McDougall hoped that variance would always be found in specifyingthe laws of behavior,for there freedom might still persist. I hoped then- less wise thanI think I am now (it was 31years ago)- that science would keep pressing variance towards as zero alimit. At any rate this general fact emergesfrom this example: freedom, when you believeit is operating, always residesin an areaof ignorance. If there is a known law, you do nothave freedom. (Boring, 1957,p. 190)

Boring was really unshakablein his belief in determinism,and that most influential of psychologists,B. F.Skinner, agrees with the necessityof assuming order in nature: It is aworking assumption which must be adoptedat thevery start. Wecannotapply the methods of science to a subject matter whichis assumedto move about capriciously. Sciencenot only describes,it predicts. It dealsnot only with the past but with the future...If we are to use themethodsof sciencein the field of human affairs, we must assume that behavior is lawful and determined. (Skinner,1953, p. 6)

Carl Rogersis among thosewho have adoptedas afundamental positionthe view that individualsare responsible, free,and spontaneous. Rogers believes that, "the individual choosesto fulfill himself by playing a responsible and voluntary partin bringing aboutthedestined events of the world" (Rogers,1962, quotedby Walker, 1970,p. 13). The use of theword destinedin this assertion somewhat spoils the impact of whatmosttaketo be theindividualistic andhumanisticapproachthat is espoused by Rogers. Indeed,he hasmaintained thatthe conceptsof scientific determinism and personal choicecanpeacefully coexistin the way inwhich the particle

DETERMINISM

25

and wave theoriesof light coexist. The theoriesaretrue but incompatible. Thesefew words do little more than suggest the problems thatare facedby the philosopherof science whenhe or shetacklesthe conceptof methodin both the naturaland thesocial sciences. What basic assumptions can wemake? So often the argumentspresentedby humanistic psychologists have strong moral or even theological undertones, whereas those offering the determinist's view point to theregularities that existin nature- even human nature - andaver that without such regularities, behavior would be unpredictable. Clearly, beings whose behaviorwas completely unpredictable would have been unable to achievethe degreeof technologicalandsocial cooperation that marks thehuman species. Indeed, individuals of all philosophical persuasions tend to agree that someone whose behavior is generally not predictable needs some sort of treatment.On the other hand,the notion of moral responsibility implies freedom. If I am to bepraisedfor my good worksandblamedfor my sins,a statement that "nature is merely unfolding as it should" is unlikely to be acceptedas a defensefor the latter, and unlikely to be advancedby me as areasonfor the former. One way out of theimpasse,it is suggested,is to reject strict "100 percent" determinismand to accept statistical determinism. "Freedom"then becomes partof the error termin statistical manipulations. Grimbaum (1952) considersthe arguments against both strict determinism and statistical determinism, arguments based on the complexityof human behavior,the conceptof moral choice, and theassignmentof responsibility,assertions that individuals are uniqueand that therefore their actions are notgeneralizablein the scientific sense,and that human beingsvia their goal-seeking behavior themselves determinethe future. He concludes,"Since the important arguments against determinismwhichwe have considered arewithout foundation,the psychologist neednot bedeterredin his questand canconfidently use thecausal hypothesis as a principle, undauntedby the caveat of the philosophical indeterminist" (Grunbaum,1952,p. 676). Feigl (1959) insists that freedom must not beconfused withthe absenceof causality, and causal determination must not be confused with coercionor compulsion or constraint. "To be free means thatthe chooseror agent is an essentiallink in the chainof causal events and that no extraneous compulsion - be itphysical, biological,or psychological- forces him to act in adirection incompatiblewith his basic desiresor intentions" (Feigl, 1959,p. 116). To some extent,and many modern thinkers would say to alarge extent (and Feigl agreeswith this), philosophical perplexities can beclarified, if not entirely resolved,by examiningthe meaningof the terms employedin the debate rather than arguing about reality. Two further points mightbe made. The first is that variabilityanduncertainty in observationsin the naturalas well as thesocial sciences require a statistical

26

2. SCIENCE, PSYCHOLOGY, AND STATISTICS

approachin order to reveal broad regularities, and this appliesto experimentation andobservationnow, whatever philosophical stanceis adopted.If the very idea of regularity is rejected, then systematic approaches to the study of the human conditionareirrelevant. The secondis that,at the end of thediscussion, most will smile withDr Samuel Johnson when he said,"I know thatI have free will and there'san end onit."

PROBABILISTIC AND DETERMINISTIC MODELS The natural sciencesset great storeby mathematical models.For example,in the physicsof massand motion, potential energy is given by PE = mgh, where m is mass,g is accelerationdue togravity, and h isheight. Newton'ssecond law of motion states thatF = ma, where F is force, m is mass, and a is acceleration. These mathematical functions or modelsmay betermed deterministic models because, giventhe valueson theright-hand sideof the equation,the construct on the left-hand sideis completely determined.Any variability that mightbe observedin, for example, Forcefor a measured mass and agiven acceleration is dueonly to measurement error. Increasing the precisionof measurement using accurate instrumentation and/or superior technique will reducethe error in Fto 2 very small margins indeed. Some psychologists, notably, Clark Hull in learning theoryand Raymond B. Cattell m personality theoryand measurement,approachedtheir work with the aim (one mightsay thedream!) of producing parallel models for psychological constructs.Econometricians similarlysearchfor models thatwill describe the market with precision and reliability. For social and biological scientists thereis apersistentandenduring challenge that makes their disciplines both fascinatingandfrustrating.That challengeis thesearchfor meansto assess and understandthe variability thatis inherentin living systemsand societies. In univariate statistical analysis we are encompassing those procedures wherethereisjust onemeasuredor dependent variable (the variable that appears on the left-hand sideof the equals signin the function shown next)and one or more independentor predictor variables: those that are seenon theright-hand 2

For thoseof you who have been exposed to physics, it must be admitted that although Newton's laws were more or less unchallenged for 200yearsthemodelshegaveus are notdefinitions but assumptionswithin the Newtonian system, as thegreat Austrian physicist, Ernst Mach, pointed out. Einstein's theoryof relativity showed that theyare not universally true. Nevertheless, the distinction betweenthe models of the natural and thesocial and biological sciencesis, for the moment,a useful one.

SCIENCEAND INDUCTION

27

side of the equation.The function is known as thegeneral linear model.

The independent variables are those thatare chosen and/or manipulated by the investigator. Theyare assumedto be related to or have an effect on the dependent variable.In this modelthe independent variables, x\, x2, * 3j xn , are assumedto be measured without error, po , PI, Pa, PS, P« , are unknown parameters. Bo is aconstant equivalent to the interceptin the simple linear model,and the others are theweightsof the independent variables affecting, or being used to estimate,a given observationy.Finally, e is therandom errorassociatedwith the particular combinationof circumstances: those chosen variables, treatments, individual difference factors, and so on. Butthese modelsare probabilistic models. Theyare based on samplesof observationsfrom perhapsa variety of populationsand they may betested underthe hypothesisof chance. Even when they pass the test of significance, like other statistical outcomes they are notnecessarily completely reliable. The dependent variable can, at best,be only partly determinedby such models,and other samplesfrom other populations, perhaps using differently defined independent variables, may, and sometimes do, give us different conclusions. SCIENCE AND INDUCTION It is common to tracethe Western intellectual tradition to two fountainheads, Greek philosophy, particularly Aristotelian philosophy, and Judaic/Christian theology. Science began with the Greeks in the sense that theyset out its commonlyacceptedground rules. Science proceeds systematically. It gives us a knowledgeof nature thatis public and demonstrable and, most importantly, open to correction. Science provides explanations that are rational and,in principle, testable, rather than mystical or symbolic or theological. The cold rationality that this implieshasbeen tempered by theJudaic/Christian notionof the compassionate human being as acreature constructed in the imageof God, and, by the belief thatthe universeis the creationof God and assuch deserves the attention of the beings thatinhabit it. It is these streams of thought that give us thedebate about intellectual values and scientific responsibilityand sustain the view that science cannot be metaphysically neutral nor value free. But it is nottrue to saythat thesetraditionshavecontinuouslyguided Western thought. Christianity took some time to become established. Greek philosophy and science disappeared under the pragmatic technologists of Rome. Judaism, weakened by the loss of its home and thepersecutionof its adherents, took refuge in the refinement and interpretation of its ancient doctrines. When

28

2. SCIENCE, PSYCHOLOGY, AND STATISTICS

Christianity did becomethe intellectual shelterfor the thinkers of Europe,it embracedthe view thatGod revealed whatHe wishedto revealandthat nature everywherewas symbolicof a general moral truth known only to Him. These ideas, formalized by the first great Christian philosopher,St. Augustine (354-430),persistedup to, andbeyond,theRenaissance, andsimplistic versions may be seenin the expostulationsof fundamentalistpreachers today.For the best part of 1,000 yearsscientific thinking was not animportant part of intellectual advance. Aristotle's rationalismwasrevivedby "the schoolmen,"of whomthe greatest was Thomas Aquinas(1225?-1274),but the influence of science and the shapingof the modernintellectual world beganin the 17th century. The modern world,so far asmental outlookis concerned,beginsin the seventeenth century. No Italian of the Renaissance would havebeenunintelligible to Plato or Aristotle; Luther would have horrified ThomasAquinas, but would not have been difficult for him to understand. Withthe seventeenth century it is different: Plato and Aristotle, Aquinasand Occam,could not have made head or tail of Newton. (Russell, 1946,p. 512)

This is not to deny the work of earlier scholars. Leonardo da Vinci (1452 -1519)vigorously propounded the importanceof experienceand observation, and enthusiastically wroteof causalityand the "certainty" of mathematics. Francis Bacon(1561-1626)is a particular exampleof one whoexpressedthe importanceof systemand methodin the gainingof new knowledge,and his contribution, althoughoften underestimated, is of great interestto psychologists. Hearnshaw (1987) gives an accountof Bacon that shows how his ideascan be seenin the foundationand progressof experimentaland inductive psychology. He notes that: "Bacon himself made few detailed contributions to general psychology as such,he sawmore clearly than anyone of his time the need for, and thepotentialitiesof, a psychologyfounded on empirical data,and capable of being appliedto 'the relief of man'sestate'"(Hearnshaw, 1987, p. 55). A turning point for modern science arrives with the work of Copernicus (1473-1543). His account of the heliocentric theoryof our planetary system was publishedin theyearof his deathand hadlittle impactuntil the 17th century. The Copernican theory involved no newfacts, nor did it contributeto mathematical simplicity. As Ginzburg (1936) notes, Copernicus reviewed theexisting facts and came up with a simpler physical hypothesis than that of Ptolemaic theory which stated thatthe earthwas thecenterof theuniverse: The fact that PTOLEMY and his successorswere led to make an affirmation in violence to the facts as then known shows that their acceptanceof the belief in the

SCIENCE AND INDUCTION

29

immobility of the earth at thecentre of the universewas not theresult of incomplete knowledge but rather the resultof a positive prejudice emanating from non-scientific considerations. (Ginzburg, 1936, p. 308)

This statementis onethat scientists, when they are wearing their scientists' hats, would supportand elaborate upon,but anexaminationof the stateof the psychological sciences could not fully sustainit. Nowhereis our knowledge more incomplete than in the studyof the human condition,andnowhereare our interpretations more open to prejudiceand ideology.The study of differences between the sexes, the nature-nurture issuein the examinationof human personality, intelligence,and aptitude,the sociobiological debateon the interpretation of the way in which societiesare organized, all are marked by undertonesof ethicsand ideology thatthe scientific purist wouldsee asoutside the notion of an autonomous science. Nor is this a complete list. Scienceis not somuch about factsas about interpretationsof observation, and interpretationsas well as observationsare guided and moldedby preconceptions. Ask someoneto "observe"and he or she will ask what it is that is to be observed. To suggest thatthe manipulationand analysisof numerical data, in the senseof the actual methods employed, canalso be guidedby preconceptions seems very odd. Nevertheless, the developmentof statisticswasheavily influencedby theideological stance of its developers.The strengthof statistical analysisis that its latter-day usersdo nothaveto subscribeto theparticular views of the pioneersin order to appreciateits utility and apply it successfully. A common viewis that science proceeds through the processof induction. Put simply, thisis theview thatthe future will resemblethe past. The occurrence of an eventA will lead us to expectan eventB if past experiencehas shownB alwaysfollowing A The general principle thatB follows A is quickly accepted. Reasoning from particular casesto general principlesis seen as the very foundation of science. The conceptsof causality and inference come togetherin the process of induction. The great Scottish philosopher David Hume (1711-1776) threw down a challenge thatstill occupiesthe attentionof philosophers: As to past experience, it can beallowed to give direct and certain informationof those precise objects only, and that precise periodof time, which fell under its cognizance:but why this experience should be extendedto future times,and toother objects, whichfor aughtwe know, may beonly in appearance similar; this is themain question on which I would insist. Thesetwo propositions are farfrom being the same, / have found thatsuchan object has always beenattended withsuchan effect and I foresee that other objects, which are,in appearance, similar, willbe attended withsimilar effects. I shall allow, if you please,that the oneproposition may justly be inferred fromthe other: I know,

30

2. SCIENCE, PSYCHOLOGY, AND STATISTICS

in fact, that it always is inferred. But if you insist thatthe inferenceis madeby a chain of reasoning,I desireyou to produce that reasoning. (Hume, 1748/1951, pp. 33-34)

The arguments against Hume's assertion that it is merely the frequent conjunctionor sequencingof two events that leads us to abelief thatonecauses the other have been presented in many formsand this account cannot examine them all. The most obvious counteris that Hume'sown assertion invokes causality. Contiguityin time and spacecausesus to assume causality. Popper (1962) notes that the ideaof repetition basedon similarity as thebasisof a belief in causality presents difficulties. Situationsarenever exactlythe same.Similar situationsareinterpretedasrepetitionsfrom a particular pointof view - and that point of view is a system of "expectations,anticipations, assumptions,or interests"(Popper, 1962,p. 45). In psychological matters there is the additional factor of'volition. I wish to pick up my pen andwrite. A chain of nervousand muscularand cognitive processes ensuesand I dowrite. The fact that human beings can and docontrol their future actions leadsto a situation wherea general denialof causality flies in the face of common sense. Such a denial invites mockery. Hume's argument that experience does not justify prediction is more difficult to counter. The courseof natural eventsis not wholly predictable,and the history of the scientific enterpriseis littered with the ruins of theories and explanations that subsequent experience showed to bewanting. Hume's skepticism, if it was accepted, would lead to a situation where nothing could be learnedfrom experienceand observation.The history of humanaffairs would refute this, but theargument undoubtedly leads to acautious approach. Predicting the future becomesa probabilistic exerciseand scienceis no longer ableto claim to be the way tocertainty andtruth. Using the probability calculusas an aid toprediction is onething; usingit to assessthe value of a particular theoryis another. Popper(1962) regards statements about theories having a high degreeof probability as misconceptions. Theoriescan beinvokedto explain various phenomena andgoodtheories arethose that stand up to severe test. But, Popper argues, corroboration cannot be equated with mathematical probability: All theories,including the best,havethe sameprobability, namelyzero. That an appealto probabilityis incapableof solving the riddle of experienceis a conclusionfirst reachedlong ago byDavid Hume... Experience doesnot consist in the mechanical accumulation of observations. Experienceis creative. It is the result of free, bold and creative interpretations, controlled by severe criticismand severe tests. (Popper, 1962, pp. 192-193)

INFERENCE

31

INFERENCE Induction is, andwill continue to be, alarge problemfor philosophical discussion. Inference can benarrowed down. Although the termsare sometimes used in the same sense and with the same meaning, it is useful to reservethe term inference for the makingof explicit statements about the propertiesof a wider universe thatare based on a much narrowerset of observations. Statistical inference is precisely that,and the discussion just presented leads to the argumentthat all inference is probabilistic and therefore all inferential statementsare statistical. Statistical inference is a way ofreasoningthat presents itself as amathematical solutionto the problemof induction. The searchfor rulesof inferencefrom the time of Bernoulli and Bayesto that of Neymanand Pearsonhasprovided the spur for the developmentof mathematical probability theory.It has been argued thatthe establishingof a set ofrecipesfor data manipulation has led to a situation where researchersin the social sciences"allow statistics to do the thinking for them." It has beenfurther argued that psychological questions that do not lend themselvesto the collectionand manipulationof quantitative data are neglectedor ignored. These criticisms are not to betaken lightly, but they can beanswered.In the first place, statistical inference is only a part of formal psychological investigation.An equally important component is experimental design.It is the lastingcontributionof Ronald Fisher,a mathematical statistician and achampionof the practical researcher,that showedus how theformulation of intelligent questionsin systematic frameworks would produce datathat, with the helpof statistics, could provide intelligent answers. In the second place,the social sciences have repeatedly come up with techniques that have enabled qualitativedatato be quantified. In experimental psychologytwo broad strategies have been adopted for coping with variability. The experimental analytic approach sets out boldly to contain or to standardizeas many of the sourcesof variability as possible. In the micro-universeof the Skinner box, shaping andobservingthe rat'sbehavior dependon a knowledgeof the antecedentand present conditions under which a particular pieceof behaviormay beobserved.The second approach is that of statistical inference. Experimental psychologists control (in the senseof standardizeor equalize) those variables that they cancontrol, measure what they wish to measure witha degreeof precision,assume that noncontrolled factors operate randomly,and hope that statistical methods will teaseout the"effects" from the "error." Whateverthe strategy, experimentalists will agreethat the knowledge they obtain is approximate. It has also been generally assumed that this approximate science is an interim science. Probabilityis part of scientific

32

2. SCIENCE, PSYCHOLOGY, AND STATISTICS

method but not part of knowledge. Some writers have rejected this view. Reichenbach(1938), for example, soughtto devise a formal probability logic in which judgmentsof the truth or falsity of propositionsis replaced by the notion of weight. Probability belongs to a classof events. Weight refers to a single event,and asingle eventcan belongto many classes: Supposea manforty years old hastuberculosis;. . . Shall we consider . . . the frequency of death withinthe classof men forty yearsold, or within the classof tubercular people?... We take the narrowest classfor which we have reliable statistics... we should take the classof tubercularmen of forty ... thenarrower the classthe betterthe determinationof weight...a cautious physicianwill even placethe man inquestion within a narrower classby making an X-ray; he will then use as theweight of the case,the probability of death belongingto a condition of the kind observedon the film. (Reichenbach, 1938,pp. 316-317)

This is a frequentist viewof probability, and it is theview that is implicit in statistical inference. Reichenbach's thesis shouldan have appealfor experimental psychologists, although it is not widely known. It reflects, in formal terms, the way in which psychological knowledge is reported in the journals and textbooks, although whether or not thewriters and researchers recognize this may be debated.The weight of a given propositionis relativeto the stateof our knowledge,andstatements about particularindividualsandparticular behaviors are prone to error. It is not that we aretotally ignorant,but that manyof our classesare toobroadto allow for substantialweightto beplacedon theevidence. Popper (1959) takes issue with Reichenbach's attempts to extendthe relative frequency view of probability to include inductive probability. Popper, with Hume, maintains thata theory of induction is impossible: We shall haveto get accustomedto the idea thatwe must not look upon scienceas a "body of knowledge",but rather as asystem of hypotheses; thatis to say, as a systemof guessesor anticipations... of which we arenever justifiedin saying that we know that theyare"true" or "more or lesscertain"or even"probable". (Popper, 1959, p. 317)

Now Popper admits only that a systemis scientific when it can betestedby experience. Scientific statements aretestedby attemptsto refuteor falsify them. Theories that withstand severe tests are corroboratedby thetests,but they are not proven, nor arethey even made more probable. It is difficult to gainsay Popper's logicandHume's skepticism. They arefood for philosophical thought, but scientistswho perhaps occasionally worry about such things put willthem aside,if only becauseworking scientistsarepractical people.The principle of induction is the principle of science,and thefact that Popperand Hume can

STATISTICS IN PSYCHOLOGY

33

shout from the philosophical sidelines that the "official" rules of the gameare irrational andthat the "real" rulesof the gameare notfully appreciated,will not stop the game from being played. Statistical inference may bedefinedas the use of methodsbasedon therules of chance to draw conclusionsfrom quantitativedata. It may be directly compared with exercises where numbered tickets aredrawn from a bag oftickets with a view to making statements about the compositionof the bag, or wherea die or a coin is tossedwith a view to making statements about its fairness. Supposea bagcontains tickets numbered 1,2,3,4, and 5.Each numeral appears on the same,very large, numberof tickets. Now suppose that25 tickets are drawnat randomand with replacementfrom the bag and the sum of the numbers is calculated.The obtainedsum could be as low as 25 and as high as 125, but the expected value of the sumwill be 75,because each of the numerals should occur on one fifth of thedraws or thereabouts.The sumshouldbe 5(1 +2 + 3 + 4 + 5) = 75. Inpractice, a given drawwill have a sumthat departsfrom this value by an amount aboveor below it that can bedescribedas chance error. The likely sizeof this error is givenby astatistic calledthe standard error, which is readily computedfrom the formula a/Vw , whereCT is thestandarddeviation of the numbersin the bag. Leaving aside,for the moment,the problem of estimatingCTwhen,as isusual,the contentsof the bag areunknown,all classical statistical inferential procedures stem from this sort of exercise.The real variability in the bag isgiven by the standard deviation, and thechance variability in the sumsof the numbers drawnis given by the standard error. STATISTICS IN PSYCHOLOGY The use ofquantitative methodsin the study of mental processes begins with Gustav Fechner(1801-1887)who sethimself the problem of examiningthe relationship between stimulusandsensation.In 1860he publishedElementeder Psychophysik,in which he describeshis invention of a psychophysicallaw that describesthe relationship between mind and body. He developed methods of measuringsensationbasedon mathematicalandstatisticalconsiderations,methods that have theirechoesin present-day experimental psychology. Fechner madeuse of thenormal law in hisdevelopmentof themethodof constant stimuli, applying it in the Gaussian sense as a way ofdealingwith error anduncontrolled variation. Fechner'sbasic assumptions and theconclusionshe drew from his experimental investigations have been shown to befaulty. Stevens'work in the 1950s and thelater developments of signal detection theory have overtaken the work of the 19th century psychophysicists, but therevolutionary natureof Fechner's methods profoundly influenced experimental psychology. Boring (1950)

34

2. SCIENCE, PSYCHOLOGY, AND STATISTICS

devotesa whole chapterof his book to thework of Fechner. Investigationsof mental inheritanceand mental testing began at about the same time with Galton,who took the normal law of error from Quetelet and madeit the centerpieceof his research.The error distributionof physics became a descriptionof thedistributionof valuesaboutavalue thatwas"mostprobable." Galton laidthe foundationsof the methodof correlation thatwasrefinedby Karl Pearson, work that is examinedin more detail laterin this volume. At the turn of the century, Charles Spearman (1863-1945) usedthe methodto define mentalabilitiesasfactors. When two apparentlydifferent abilitiesare shownto be correlated, Spearman took this as evidencefor the existenceof a general factor G, a factor of generalintelligence, and factors that were specific to the different abilities. Correlational methods in psychology were dominant for almost the whole of the first half of this centuryand thetechniquesof factor analysiswere honed during this period. Chapter 11 provides a review of its development,but it is worth noting here that Dodd (1928) reviewed the considerable literature that hadaccumulated over the 23years since Spearman's original work, and Wolfle (1940) pushed this labor further. Wolfle quotes Louis Thurstoneon what he takesto be themost importantuse offactor analysis: Factoranalysisis useful especiallyin thosedomains wherebasicandfruitful concepts areessentiallylacking andwhere crucial experiments have beendifficult to conceive. ... They enableus to make onlythe crudest first map of a newdomain. But if we have scientific intuitionand sufficient ingenuity,the rough factorialmap of a new domain will enableus toproceedbeyondthe factorial stageto themore direct forms of psychologicalexperimentationin the laboratory. (Thurstone, 1940, pp. 189-190)

The interesting point about this statementis that it clearly sees factor analysis as amethodof data exploration rather than an experimental method.As Lovie (1983) points out, Spearman's approach was that of an experimenter using correlational techniques to confirm his hypothesis,but from 1940on, that view of the methodsof factor analysishas notprevailed. Of course,the beginningsof general descriptive techniques crept into the psychological literature over the same period. Means and probable errorsare commonly reported, and correlation coefficients are also accompaniedby estimatesof their probable error.And it was around 1940 that psychologists started to become awareof the work of R. A. Fisherand toadopt analysisof variance as thetool of experimental work.It can beargued thatthe progression of events thatled to Fisherian statistics also led to a division in empirical psychology,a split between correlationaland experimental psychology. Cronbach (1957) chose to discussthe "two disciplines"in his APApresidential address. He notes thatin the beginning:

STATISTICS IN PSYCHOLOGY

35

All experimentalprocedureswere tests,all testswere experiments... .the statistical comparison of treatments appearedonly around 1900. . . Inference replaced estimation: the mean and its probable error gave way to the critical ratio. The standardizedconditions and thestandardizedinstruments remained,but the focus shifted to thesinglemanipulatedvariable,andlater, following Fisher,to multivariate manipulation. (Cronbach, 1957, p. 674)

Although there have been signs that the two disciplinescan work together, the basic situationhas notchangedmuch over 30 years. Individual differences are error variance to the experimenter;it is the between-groupsor treatment variance thatis of interest. Differential psychologists look for variations and relationships among variables within treatment conditions. Indeed, variation in the situation here leads to error. It may befairly claimed that these fundamentaldifferencesin approach have had themost profoundeffect on psychology.And it may befurther claimed that the sophisticationand successof the methodsof analysis thatare usedby the two camps have helped to formalizethe divisions. Correlationand ANOVA have led to multiple regression analysis and MANOVA, and yet themethods are based on the same model- the general linear model. Unfortunately, statistical consumers are frequently unawareof the fundamentals, frightened away by the mathematics,or, bored and frustratedby the argumentson the rationale of the probability calculus, they avoid investigation of the general structureof the methods. When these problems have been overcome, the face of psychologymay change.

3

Measurement

IN RESPECT OF MEASUREMENT In the late 19th centurythe eminent scientistWilliam Thomson, LordKelvin (1824-1907),remarked: I often say that whenyou canmeasurewhat you arespeakingabout,and expressit in numbers,you know something about it; but whenyou cannot measure it, when you cannot expressit in numbers, your knowledge is of ameagreandunsatisfactory kind: it may be thebeginningof knowledge,but youhave scarcely,in your thoughts, advancedto the stageof science whatever the matter mightbe. (William Thomson, Lord Kelvin, 1891,p. 80)

This expressionof the paramount importance of measurementis part of our scientific tradition. Many versionsof the same sentiment,for example, thatof Gallon, notedin chapter1, andthat of S. S.Stevens(1906-1973),whose work is discussed laterin this chapter,are frequently noted with approval. Clearly, measurement bestows scientific respectability,a stateof affairs that does scant justice to the work of people like Harvey, Darwin, Pasteur, Freud, and James, who, it will be noted, if they are to belabeled"scientists,"are biological or behavioralscientists.The natureof the dataand thecomplexityof the systems studied by thesemen arequite different in quality from the relatively simple systems that were the domainof the physical scientist. This is not, of course,to deny the difficulty of the conceptualand experimental questions of modern physics,but thefact remains that,in this field, problemscan often be dealt with in controlled isolation. It is perhaps comfortingto observe that,in the early years, therewas a 36

IN RESPECT OF MEASUREMENT

37

skepticism about the introductionof mathematics intothe social sciences. Kendall (1968) quotesa writer in the Saturday Reviewof November11, 1871, who stated: If we saythat G representsthe confidence of Liberals in Mr Gladstoneand D the confidenceof Conservativesin Mr Disraeli and x, y thenumberof thoseparties;and infer that Mr Gladstone'stenure of office dependsupon someequation involving dG/dx, dD/dy, we have merely wrappedup a plain statementin a mysterious collection of letters. (Kendall, 1968,p. 271)

And for GeorgeUdny Yule, the most level-headed of the early statisticians: Measurementdoesnot necessarily mean progress.Failing the possibility of measuring that whichyou desire,the lust for measurement may, for example, merely result in your measuringsomethingelse- andperhapsforgetting thedifference-or in your ignoring somethings becausethey cannotbe measured.(Yule, 1921,pp. 106-107)

To equate science with measurement is a mistake. Scienceis about systematic and controlled observationsand the attempt to verify or falsify those observations. And if the prescriptionof science demanded that observations must be quantifiable, thenthe natural as well as thesocial sciences would be severely retarded.The doubts aboutthe absoluteutility of quantitative description expressedso long ago could well be ponderedon by today's practitioners of experimental psychology. Nevertheless, the fact of the matteris that the early years of the young disciplineof psychology show, with some notable exceptions, a longing for quantification and, thereby, acceptance.In 1885 Joseph Jacobs reviewed Ebbinghaus's famous work on memory, Ueber das Gedachtnis. He notes"If sciencebe measurement it mustbe confessed that psychology is in a badway" (Jacobs, 1885, p. 454). Jacobs praises Ebbinghaus's painstaking investigations and hiscareful reporting of his measurements: May we hope to see the daywhen schoolregisterswill record that suchand such a lad possesses 36 British Association unitsof memory power or when we shall be able to calculate how long a mind of 17 "macaulays"will take to learn Book ii of Paradise Lost"? If this be visionary, we may atleast hopefor much of interest and practical utility in the comparison of the varying powersof different minds which can now atlast be laid down to scale.(Jacobs,1885, p. 459)

The enthusiasmof the mental measurers of the first half of the 20th century reflects the same dream,and even today,the smile that Jacobs' words might bring to the facesof hardened test constructors and users contains a little of the old yearning.The urgeto quantify our observationsand toimpose sophisticated statistical manipulations on them is a very powerful one in thesocial sciences.

38

3.

MEASUREMENT

It is of critical importance to remember that sloppy and shoddy measurement cannot be forgiven or forgotten by presenting dazzling tables of figures, clean and finely-drawngraphs,or by statistical legerdemain. Yule (1921), reviewing BrownandThomson's bookon mental measurement, commentson the problem in remarks thatare notuntypical of the misgivings expressed,on occasion,by statisticiansandmathematicians when they seetheir methodsin action: Measurement.O dear! Isn'tit almost an insult to the word to term someof these numerical data measurements? They are of thenatureof estimates, mostof them, and outrageouslybad estimatesoften at that. And it should alwaysbe the aim of theexperimenternot to revel in statistical methods (whenhe does reveland notswear)but steadily to diminish, by continual improvement of his experimental methods,the necessityfor their use and the influence they haveon his conclusions.(Yule, 1921,pp. 105-106)

The general tenorof this criticism is still valid, but the combination of experimentaldesignandstatistical method introduced a little later by Sir Ronald Fisherprovidedthe hope, if not the complete reality,of statisticalasopposedto strict experimental control. The modern statisticalapproach more readily recognizes the intrinsic variability in living matter and its associated systems. Furthermore, Yule's remarks were made before it becameclearthat all actsof observation contain irreducible uncertainty (as notedin chap.2). Nearly 40 years after Yule's review, Kendall (1959) gently reminded us of the importanceof precisionin observationandthat statistical procedures cannot replaceit. In an acuteand amusing parody,he tells the story of Hiawatha. It is a tragic tale. Hiawatha,a "mighty hunter,"was anabysmal marksman, although he didhavethe advantageof having majoredin applied statistics. Partly relying on his comrades'ignoranceof thesubject,heattemptedto show thathis patently awful performancein a shooting contestwas notsignificantly different from that of his fellows. Still, theytook away his bow andarrows: In a corner of the forest Dwells alonemy Hiawatha Permanentlycogitating On the normal law of error. Wondering in idle moments Whether an increasedprecision Might perhapsbe rather better Even at therisk of bias If therebyone, now andthen, could Registeruponthe target. (Kendall, 1959,p. 24)

SOME FUNDAMENTALS

39

SOME FUNDAMENTALS Measurementis the applicationof mathematicsto events. We usenumbersto designateobjectsandeventsand therelationships that obtain between them. On occasion the objects are quite realand therelationships immediately comprehensible; dining-room tables,for example, and their dimensions, weights, surfaceareas,and so on. Atother times,we may bedealing with intangibles such as intelligence,or leadership,or self-esteem.In thesecasesour measurements are descriptionsof behavior that,we assume, reflectsthe underlying construct. But the critical concernis the hope that measurement will provide us with preciseandeconomical descriptions of eventsin a manner thatis readily communicatedto others. Whateverone'sview of mathematics with regard to its complexities and difficulty, it is generallyregardedas adiscipline thatis clear, orderly, and rational. The scientist attemptsto add clarity, order, and rationality to the world aboutus by using measurement. Measurementhasbeena fundamental feature of human civilizationfrom its very beginnings. Division of labor, trade,andbarterareaspectsof our condition that separateus from the huntersand gathererswho were our forebears. Trade and commerce mean that accounting practices are institutedand the"worth" of a job or anartifact has to belabeledanddescribed. When groups of individuals agreedthat a sheep couldfetch three decent-sized spears and a couple of cooking-pots,the species made a quantum leap intoa world of measurement. Counting, makinga tally, representsthe simplestform of measurement. Simple though it is, it requires thatwe have devisedan orderly and determinate number system. Developmentof early societies, like the developmentof children, must have included the mastery of signsand symbolsfor differences and sameness and, particularly, for onenessand twoness. Most primitive languages at least have words for "one,""two," and"many,"andmodern languages, including English, have extra wordsfor one and two(single, sole, lone, couple, pair, and soon). Trade,commerce,and taxation encouraged the developmentof more complex number systems that required more symbols. The simple tallyrecordedby a mark on a slatemay bemademore comprehensibleby altering the mark at convenient groupings,for example, at every five units. This system, still followed by some primitive tribes,aswell as bypsychologists when constructing, by hand, frequency tables from large amountsof data, corresponds with a readily availableandportable countingaid - the fingers of one hand. It is likely that the familiar decimal system developed because the human handstogether have10 digits. However, vigesimal systems, based on 20, are known, and languageonceagain recognizes the utility of 20 with the word score in Englishand quatre-vingt for 80 in French. Contraryto generalbelief, the decimal systemis not theeasiestto use arithmetically, and it isunlikely that

40

3.

MEASUREMENT

decimal schemeswill replaceall other counting systems. Eggs and cakeswill continueto besold in dozens rather than tens, and thehours aboutthe clock will still be 12. This is because12 hasmore integral fractional parts than 10; that is, you candivide 12 by more numbersand getwhole numbersand notfractions (which everyonefinds hard)thanyou can 10.Viewed in this way,the now-abandoned British monetary system of 12penniesto the shilling and 20shillingsto the pound doesnot seemso odd orirrational. Systems of number notationand counting havea base or radix. Base5 (quinary), 10 (decimal), 12 (duodecimal),and 20(vigesimal) have been mentioned, but any baseis theoretically possible.For many scientific purposes, binary (base2) is used because this system lies at theheart of the operationsof the electronic computer.Its two symbols,0 and 1, canreadilybe reproducedin the off and on modesof electrical circuitry. Octal (base 8) andhexadecimal (base 16) will also be familiar to computer users.The base of a number system correspondsto the number of symbols thatit needs to express a number, provided thatthe systemis aplace system. A decimal number,say 304, means 3 X 100 plus 0 X 10 plus 4 X 1. The symbol for zero signifies an empty place. The invention of zero, the earliest undoubted occurrence of which is in India over 1,000 yearsago but which was independently usedby the Mayas of Yucatan, marksan important step forwardin mathematical notation and arithmetical operation.The ancient Babylonians,who developeda highly advanced mathematics some4,000years ago, had asystem witha baseof 60 (abasewith many integral fractions) that did not have a zero. Their scriptsdid not distinguish between, say, 125 and 7,205, and which one ismeantoften has to beinferred from the context. The absenceof a zero in Roman numerals may explainwhy Romeis not remembered for its mathematicians,and therelative sophisticationof Greek mathematics leads some historians to believe that zeromay have been invented in the Greek world and thence transmitted to India. Scalesof Measurement Using numbersto count events,to order events,and toexpresstherelationship between events,is the essenceof measurement. These activities have to be carried out accordingto some prescribed rule. S. S.Stevens (1951) in his classic piece on mathematicsandmeasurementdefinesthe latter as"the assignmentof numeralsto objects or events accordingto rules" (p. 1). This definitionhasbeen criticizedon thereasonable grounds that it apparently doesnot exclude rulesthat do nothelp us to beinformative,nor rulesthatensure that the samenumeralsare always assignedto the same events under the same conditions. Ellis (1968)haspointedout that some such rule as,"Assign the first numberthat comes into your head to eachof the objectson thetable in turn,"

SOME FUNDAMENTALS

41

mustbe excluded fromthe definitionof measurement if it is to be determinative and informative. Moreover, Ellis notes that a rule of measurement must allow for different numerals,or rangesof numerals,to beassignedto different things, or to the same things under different conditions. Rules such as "Assign the number3 to everything" are degenerate rules. Measurement must be madeon a scale,and weonly havea scale whenwe havea nondegenerate, informative, determinative rule. For themoment,the historical narrativewill be setasidein order to delineate and comment on the matter. Stevens distinguished four kinds of scales of measurement,and hebelieves thatall practical common scales fall into one or other of his categories.Thesecategoriesare worth examiningandtheir utility in the scientific enterprise considered. The Nominal Scale The nominal scale,as such, doesnot measure quantities.It measures identity and difference. It is often said thatthe first stagein a systematic empirical scienceis the stageof classification. Likeis grouped with like. Events having characteristicsin commonareexamined together.The ancient Greeks classified the constitutionof nature into earth, air, fire, and water. Animal, vegetable, or mineral areconvenient groupings. Mendeleev's (1834-1907)periodictable of the elementsin chemistry,and plantand animal species classification in biology (the Systema Naturae of the great botanist Carlvon Linne, knownasLinnaeus [1707-1778]), and the many typologies that existin psychology are further examples. Numbers can,of course,be usedto label eventsor categoriesof events. Street numbers,or house numbers, or numberson football shirts"belong"to particular events,but there is, for example,no quantitative significance between player number10 andplayer number4, on ahockey team,in arithmetical terms. Player number 10 is not 2.5times player number 4. Such arithmetical rules cannot be applied to the classificatory exercise. However, it is frequentlythe casethat a tally, a count, will follow the constructionof a taxonomy. Clearly, classificationsform a large partof the data of psychology. People may be labeled Conservative, Liberal, Democrat, Republican, Socialist, and so on, on thevariableof "political affiliation," or urban, suburban, rural, on the variable of "location of residence,"and wecould thinkof dozens,if not scores, of others. The Ordinal Scale The essential relationship that characterizes the ordinal scale is greater than (symbolized>) or less than (symbolized <). These scales of measurement have

42

3.

MEASUREMENT

proved to be very useful in dealing with psychological variables. It might, for example,be comparatively easyto state that, according to some specifiedset of criteria, Bill is more neurotic than Zoe, who ismore neurotic than John, or that Mary is more musical than Sue, who ismore musical than Jane, and so on,even though we are not inpossessionof a precise measuring instrument that will determineby how muchthe individualsdiffer on ourcriteria. It follows thatthe numbersthat we assignon theordinal scale represent only an order. In ordering a group of 10 individuals accordingto judgmentsin terms of a particular set ofcriteria for leadership,we maydesignatethe onewith the most leadership ability"1" and gothrough to the onewith the least and rank this person"10," or we maystartwith the onewith leastability andrank him or her "1," progressingto the onewith most whomwe rank "10." Either methodof ordering is permissibleand both give preciselythe samesort of information, provided thatthe onedoing the ordering adheres to the system being used and communicatesthe rule to others. The Interval and Ratio Scales When the gapsbetweenequal points,in the numericalsense,on themeasurement scalearetruly quantitativelyequal, thenthe scaleis calledan interval scale. The differencebetween130 cm and 137 cm exactly is the sameas thedifference between 137 cm and 144 cm.When this sortof scale startsat atrue zero point, as in thedistanceor length scale just mentioned, Stevens designates them ratio scales.For example,an individual who is 180 cmtall is twice the heightof the child of 90 cm,who, in turn, is twice the lengthof the baby of 45 cm. Butwhen the scalehas anarbitrary zero point- for example,the FahrenheitandCelsius temperature scales - theseratio operationsare notlegitimate. It is neither meaningful nor correct to saythat a temperatureof 30 isthree timesas hot as a temperatureof 10. This ratio doesnot represent three times any temperaturerelated characteristic. Perhaps a more readily appreciated example is that of calendar time. Althoughit is true to saythat 60 years is twice 30 years,it is meaninglessto saythat 2000 A.D.will be twice 1000 A.D.in any description of "age." In 1000 A.D. the famous Parthenon in Athenswas nottwice as old as it was in 500A.D. Why not? Becauseit was built in about 440 B.C. It follows from these examples that arithmetical operations on a ratio scaleare conductedon the scale values themselves, but on theinterval scale withits arbitrary zero such operations are conductedon theinterval values. The higher-order intervaland ratio scales haveall the properties of the lower-order nominaland ordinal scalesand may be so applied. Psychologists, however, strive where possible for interval measurement,for it has the appearanceat least of greater precision. This desire sometimes leads to conceptualdifficulties and statistical misunderstandings. It is easyto see that

SOME FUNDAMENTALS

43

the 3 cm bywhich a line A of 12 cmdiffers from a line B of 9 cm is thesame as the 3 cm bywhich B exceedsthe 6-cm-long lineC. We areagreedon how we will define a centimeter, and measurementof length in these unitsis comparatively straightforward. But what of this statement? "Alberthas anIntelligence Quotient (IQ)of 120, Billy one of 110, and Colin of 100." The meaning of the 10-point differencesbetweentheseindividuals,even whenthe measuring instrument (the test)usedis specified,is not soeasyto see,nor is itpossibleto saywith complete confidencethat the gap inintelligence between Albert and Billy is preciselythe same amountas the gapbetweenBilly and Colin. Suppose that Question 1 on our test, say, of numerical ability, asksthe respondentto add 123 to456; Question 2, to divide 432 by 144; and Question3 to add thesquareof 123 to the square rootof 144. Few would argue that these demands exhibit the same degreeof difficulty, and, by the same token,few would agreeon theweighting that mightbe assignedto themarksfor each question. These weightings would, to some extent, reflectthe differing conceptsof numerical ability. How much more numerically ableis the individual who hasgraspedthe conceptof square root than the one whounderstands nothing beyond addition and subtraction? Although these examples could, quite justifiably, be describedas exaggerated, they aregiven to show how difficult, indeed, impossible, it is to constructa true interval scalefor a test of intelligence. Leaving asidefor the momentthe fact thatno perfectly reliable instrument existsfor the measurementof IQ, we also know that there is no clear agreement on the definition of IQ. In other words, statements about the magnitudeof the differencesin length betweenlines A, B, and C, or in IQbetween Albert,Billy, and Colin, dependon theexistenceof a standardunit of measurementand on a constantunit of measurement,one that is the same overall pointson the scale. In the case of the measurementof length these conditions exist, but in the measurementof IQ they do not. Do our arithmetical manipulations of psychometric testscales,as though they were true interval scales, suggest that psychologists arepreparedto ignore the imperfectionsin their scalesfor the sake of computational convenience? Certainly the situation just described emphasizes the responsibility of the measurerto report on how thescaleshe or sheemploysare used and to keep their limitationsin mind. Many, perhaps most, discussions of scalesof measurement reveal a serious misconception that also arises from Stevens' examination. It is that the level of measurement, thatis, the specific measurement scale used in a particular investigation, governsthe applicationof various statistical procedures when the data cometo beanalyzed. Briefly, the notion is abroad thatan interval scaleof measurementis required for the adhibition of parametric test:the t-ratio, F-ratio, Pearson's correlation coefficient, and so on, all ofwhich are discussed

44

3.

MEASUREMENT

later in this work. Gaito (1980)is one of anumberof writers who have triedto exposethe fallacy, noting thatit is basedon aconfusion between measurement theory and statistical theory. Gaito reiterates some of the criticisms madeby Lord (1953), who,in a witty andbrilliant discussion, tells the story of a professor who computedthe meansand standard deviations of test scores behind locked doors because he hadtaughthis students that such ordinal scores should not be added. Matters came to ahead whenthe professor,in a story that shouldbe read in the original, and notparaphrased, discovers that even the numberson football jerseysbehaveas if they were"real," that is, interval scale, numbers. In fact, "the numbersdon't remember where they came from" (p. 751). It is important to realize that the preceding remarksdo not detract from Stevens' sterling contribution to the developmentof coherent concepts regarding scalesin measurement theory. The difficulties arisewhentheseareregarded asprescriptionsfor the choiceand use of a wide varietyof statistical techniques. ERROR IN MEASUREMENT All acts of measurement involve practical difficulties that are lumped together as thesourceof measurement error. From the statisticalstandpoint, measurement error increasesthe variability in data sets, decreasing the precisionof our summariesand inferences. It follows that scientists strivefor accuracyby continually refining measuring techniques and instruments.The oddfact is that this strategyproceedseven thoughwe know that absolutely precise measurement is impossible. It is clear that a meter rule mustbe madefrom rigid material and that measuringtapesmustnot beelastic. Without thisthe instruments lack reliability and self-consistency and, put simply, separateand independent measurements of the same eventare unlikely to agree with each other. Only very rarely do paper-and-penciltests,or two forms of the sametest(say, a test of personality factors), give exactlythe same results when given to an individual or group on two occasions.In a sense these tests are like elastic measuring tapes,and the discrepancy betweenthe outcomes is an index of their reliability. Perfect reliability is indexed1. Poor reliability wouldbe indexed0.3 or less,and good reliability 0.8 or more. In acceptingthis latter figure as good reliability, the psychologistis acceptingthe inevitable errorin measurement, being prepared to do sobecausehe or shefeels that a quantitative description better serves to grasptheconcepts under investigation, to communicate ideas about the concepts more efficiently, and todescribethe relationships between these concepts and others more clearly. Reliability is notjust a function of the quality of the instrument being used. In 1796, Maskelyne, Astronomer Royal and Directorof the Royal Observatory at Greenwich, near London, England, dismissed Kinnebrook, an assistant,

ERROR IN MEASUREMENT

45

because Kinnebrook's timing of stellar transitsdiffered from his, sometimesby as muchas 1sec.1 The accuracyof such measurements is, of course, crucialto the calculationof the distanceof the star from Earth. Some20 years later this incident suggestedto Bessel, the astronomerat Konigsberg, that such errors were the result of individual differencesin observersand led towork on the personal equation.The astronomers believed that thesediscrepancies were due to physiological variabilityand early experimentson reaction time reflect this view. But as scientific psychologyhas grown, the reaction-time variablehas been usedto study choiceand discrimination, predisposition, attitudes, and the dynamicsof motivation. These events mark the beginningsof the psychology of individual differences.For theimmediate argument, however, they illustrate the importanceof both inter-observerand intra-observer reliability. The circumstancesof an exercisein measurement must remain constant for eachseparateobservation.If we were, say, measuring the heightsof individuals in a group, we would ensure that each of them stoodup straight with heels together, chinin, chest out, headerect,no slouching,no tiptoeing, and so on. Quite simply, heighthas to bedefined,and ourmeasurements aremadein accord with that definition. When a concept is definedin terms of the operations, manipulations, and measurements that aremadein referring to it, wehavean operational definition. Intelligence might be defined as the score obtained on a particular test; a character trait like generosity might bedefinedas theproportionof one'sincome that one gives away. These sorts of definition serveto increasethe precisionof communication.They most certainly do notimply immediate agreement among scientists aboutthe nature of the phenomena thus defined. Very few people would maintain thatthe precedingdefinitions of intelligenceandgenerosityare entirely satisfactory. Operationalism is closely alliedto the philosophyof the Vienna Circle, a group, formedin the late 1920s, that aimed to clarify the logic of thescientific enterprise.The membersof the Vienna Circle, mainly scientists and mathematicians, cameto beknown aslogical positivists. They hopedto rid thelanguage of scienceof all ambiguityand toconfinethe businessof scienceto the testing of hypotheses aboutthe observable world.In this philosophy, meaningis equated with verifiability,and constructs thatare notaccessibleto observation are meaningless. Psychologists will recognize this approach as closely akinto the doctrineof behaviorism.The view that theoretical constructs can begrasped via the measurement operations that describe them is an interesting one,for it challengesthe contention that there can be nomeasurement without theory and that operations cannot be describedin nontheoretical terms. 1

Kinnebrookwas employedat theobservatoryfrom May 1794to February 1796.For some more detailsof the incidentand what happened to him, seeRowe (1983).

46

3.

MEASUREMENT

The logical positivist viewis obviously attractiveto the working scientist becauseit seemsto beeminently"hard-nosed."The doctrineof operationalism demandsan analysisof the natureof measurementand itscontributionto the descriptionof facts. To statethat scienceis concernedwith the verification of facts is to imply that therewill be agreement among observers. It follows that new methodsand newobservational tools,as well as observer disagreement, will changethe facts. The fact or reality of a tomato is quite differentfor the artist, the gourmet cook,and the botanist,and the botanical view changed dramatically whenit became possible to view the cellular structureof the fruit through a high-power microscope. In the broad view, this argument includes not only the idea of levels of agreementandverification, but alsothe validity of toolsandmeasurements.Do they, in fact, measure what they were designed to measure?And what is meant by "designedto measure" must have some basis in a conceptual disposition if not a full-blown theory. We must not only consider levelsof measurement,but also the limits of measurement.In chapter 2 it wasnoted that observinga system disturbsthe system, affects the act of measurement,and thus provokesan irreducible uncertaintyin all scientific descriptions. This proposition is embodiedin the Uncertainty Principle of Heisenberg:the neareronetries to obtain an accurate measurementof eitherthe momentumor the positionof a subatomic particle, the less certain becomes the measurement of the other. Ultimatelytheprinciple appliesto all actsof measurement. The world of psychological measurement is beset with system-disturbing features. Experimenter-participant interactions and thearousingand motivating propertiesof the setting, be it the laboratoryor the natural environment, contributeto variancein the data. These factors have been variously described as experimentereffect, demand characteristics, and, more generally, as the social psychologyof thepsychological experiment. They areexaminedat length by Rosenthaland Rosnow (1969)and Miller (1972). Despitethe difficulties, logical and practical,of the procedures, scientists will continueto measure.The behavioral scientist is trying to pin down, withas muchrigor aspossible,the variability of thepropertiesof living matter that arise as aresultof the interactionof environmentalandgenetic factors.As mentioned in chapter1, Quetelet(1835/1849)described these in termsof what we now call the normal distribution, usingthe analogyof Nature's"errors." Although Queteletand Galtonand othersmay becriticized for the promotion of thedistributionas a"law of nature,"their work recognizedthe irreducible variancein living matter. This bringsout theessenceof the statistical approach. There can be noabsolute accuracyin measurementbut only a. judgment of accuracy in terms of the inherent variationwithin and between individuals. Statisticsare thetools for assessingthe propertiesof these random fluctuations.

4 The Organization of Data

THE EARLY INVENTORIES The countingof peopleandlivestock, hearthsandhomes,andgoodsand chattels is an exercise thathas along past.The ancient civilizationsof Babylon and Egypt conducted censuses for thepurposesof taxationand theraisingof armies, perhaps 3,000 years before the birth of Christ,andindeedthebirthplaceof Christ himself was determined partlyby the fact that Maryand Joseph traveledto Bethlehemin order to beregisteredfor a Roman tax. Censuses were carried out fairly regularlyby the Romans,but after the fall of Rome many centuries passed before they became part of the routineof government. It can beargued that these exercisesare notreally importantin statistical historybecausethey were not used for the purposesof making comparisonsor for drawing inferences in the modern sense (that is, by usingthe probability calculus),but they are early examplesof the descriptionsof States. When inferences were drawn they were informal.The utility of such descriptions was recognizedby Aristotle, who prepared accounts, which were largely non-numerical,of 158states, listing details of their methodsof administration, judicial systems, customs, and so on.Although almostall of these descriptions are lost, they clearly were intended to be used for comparative purposes and compiledaspart of Aristotle's theoryof the State. Such systematic descriptions became partof our intellectual traditionin Europe, particularlyin the German States, duringthe 17th and 18th centuries. Staatenkunde, the comparative descriptionof states, madefor the purposesof throwing light on their organization, their power, and their weaknesses, became an important discipline promoted especiallyby Gottfried Achenwall (1719-1772),who wasProfessorat the Universityof Gottingen. Hans Anchersen (1700-1765),a Danish historian, introducedtables intohis work, and although these early tables were largely

47

48

4. THE ORGANIZATION OF DATA

non-numerical, theyform a bridge betweenthe early comparative descriptions and Political Arithmetic, wherewe see thefirst inferences basedon vital statistics (see Westergaard, 1932, for an accountof these developments). Someearly statisticalaccounts were, until quite recently,thought to have been made solely for the purposesof taxation. Notable among these is Domesday Book, orderedin 1085by William the Conqueror,"to placethegovernment of the conqueredon awritten basis" (Finn, 1973,p. 1).This massive survey, which was completed in less than2 years, is not just a tax roll, but an administrative record made to assist in government. Moreover, Galbraith (1961) notes thatthe Domesday Inquest made no effort to preservethe overwhelming massof original returns: "Instead,the practical geniusof theNorman king preservedin a 'fair copy' what waslittle more thanan abstractof the total returns" (Galbraith, 1961, p. 2), sothat Domesday Book is perhapsone of the earliest examplesof a summaryof a large amountof data. POLITICAL ARITHMETIC The student in statistics who developsan interest in their history will be astonishedif he or shevisitsa good libraryandsearchesout titles on thehistory of statisticsor old bookswith the word statisticsor statistical in the title. Very soon,for example,TheHistory of Statistics,publishedin 1918to commemorate the 75th anniversaryof TheAmerican Statistical Association, and edited by Koren, will be discovered. But thereare nomeansor variancesor correlation coefficients to befound here. Thisvolume contains memoirs from contributors from 15 countries detailingthe developmentof the collection and collation of economic and vital statistics,for all kinds of motivesand using all kinds of methods, withinthe various jurisdictions. Such works are notunusual,andthey relate to a variety of endeavors concerned with the use ofnumerical data both for comparativeand inferential purposes. They reflect the layman's viewof statistics in the present day. Political arithmetic may be dated from the publication by John Graunt (1620-1674),in 1662, of Natural and Political Observations Mentioned in a Following Index and Made Upon the Bills of Mortality, althoughthe name political arithmetic was apparentlythe inventionof Graunt's friend William Petty (1623-1687).It hasbeen reported that Petty was infact the authorof the Observations,but this is not so. Petty's Political Arithmetickwas written in about 1672but was notpublisheduntil 1690. Petty's workwas of interestto governmentand surely had ruffled some feathers.Petty'sson, dedicatinghis father'sbook to theking, writes: He was allowed by all, to be theInventorof this Method of Instruction; wherethe perplexed and intricate waysof the World are explain'd by a very mean peiceof

POLITICAL ARITHMETIC

49

Science;and had not the Doctrins of this EssayoffendedFrance,they had long since seen the light, and hadfound Followers, as well as improvements before this time, to the advantageperhapsof Mankind. (Lord Shelborne, Dedication, in Petty, 1690)

That the work had theimprimatur of authority is given in its frontispiece: Let this Book called PoliticalArithmetick,which waslong sinceWrit by Sir William Petty deceased,be Printed. Given at the Court at Whitehall the 7th dayof Novemb. 1690. Nottingham.

It wasKarl Pearson's (1978) view, however, that Petty owed more to Graunt than Grauntto Petty. Grauntwas awell-to-do haberdasher who wasinfluential enoughto be able to securea professorshipat Gresham College, London, for Petty, a man who had a variety of posts and careers. Graunt was acultured, self-educatedman and wasfriends with scientists, artists, andbusinessmen.The publicationof the Observationsled to CharlesII himself supportinghis election to the Royal Societyin 1662. Graunt's workwas the firstattemptto interpret social behaviorand biological trendsfrom countsof the rather crude figures of births anddeaths reportedin Londonfrom 1604to 1661- theBills of Mortality. A continuing themein Graunt's workis the regularityof statistical summaries. He apparently acceptedimplicitly the notion that whena statistical ratiowas not maintained then some new influence mustbe present, that some additional factor is thereto bediscovered.For example,he tried to show thatthenumber 1 of deaths from the plague had been under-reported and by examiningthe number of christeningsreasonedthat the decreasein population of London becauseof the plague wouldbe recoveredin 2 years.He attemptedto estimate the populationdistribution by sex and age and constructeda mortality table. Essentially, Graunt's work, andthat of Petty, emphasizes the use ofmathematics to impose orderon data and,by extension,on society itself. Petty was afounding memberof the Royal Society, whichwas grantedits charter in 1662. Althoughhe had been a supporter of Cromwell and the Commonwealth,he wasknightedin 1661 by CharlesII. He was not somuch concerned withthe vital statistics studiedby Graunt. Rather,in his early work he suggested ways in which the expensesof the state couldbe curtailedand the 1

The weekly accounting of burialswasbegunto allay public fearsaboutthe plague. However, in epidemic years when anxiety was at itshighest,it appears that casesof theplague were concealed becausemembersof families with the illness were kept together, both the sick and thewell.

50

4. THE ORGANIZATION OF DATA

revenue from taxes increased,and a continuing themein his work is the estimationof wealth. His Political Arithmetick (Petty, 1690)comparesEngland, France,and Holland in terms of territory, trade, shipping, housing, and so on. Hisinterest was in practical political mattersand money, topics that reflected his professional careeraswell as hisphilosophy. In the context of the times, Grauntand Petty's workmay be seen as a demonstrationof their belief thatquantification provided knowledge that was free from controversy and conflict. Buck (1977) discusses theseefforts in the political climateof the times. The 17th century was,to say theleast,a difficult time for the people of England. The Civil War, the executionof the King, the turmoil of theCommonwealth, religious conflict, the Restoration,the weakness of the monarchsof the Stuart line,the Glorious Revolution that swept James II from thethrone- all contributedto civil strife andpolitical unease.And yet the post-Restoration years saw thefounding of theRoyal Society,the establishment of the Royal Observatoryat Greenwich,and thesetting of the stage for the triumphs of Newtonian sciencein the 18th century. It was also the era of Restoration literature. Both John Evelyn (the diaristwho gavethe Society its nameand itsmotto) and thepoet Dryden were members of the Royal Society. Buck (1977) and others have argued that the philosophyunderlyingthe contributions of Petty and Grauntdiffers from that of the 18th centuryin that it does not acceptthe existenceof natural orderin society. Indeedit could not, because the political mechanisms necessary for the maintenanceof a relatively stable social systemhad notbeen established. The practicaleffectsof the work thatwasbegunby Petty has itscounterpart in the present-day agencies of the state that collect data for all mannerof purposes. In this accountof the developmentof statistics, however, this path will be left for now, in orderto returnto theenterprise that sprang from Graunt's mortality tables. Actuarial science hasmoreto dowith our perspective because it involves the early studyof probability and risk and the use ofthese mathematics for inference. VITAL

STATISTICS

Actuarial sciencehasbeendefined as theapplicationof probability to insurance, particularly to life insurance.But to apply the rulesof probability theory there haveto bedata,andthese dataareprovidedby mortality tables.Life expectancy tablesgo back as far as the 4th century, when they were used under Roman law to estimatethe value of annuities,but the major impetusto the systematic examinationof the mathematicsof tontinesand annuitiesis generally placedin 17th century Holland. JohnDe Witt (1625-1672),the Grand Pensionary of Holland and West Friesland, applied probability principles to the calculationof

VITAL STATISTICS

51

annuities.2 Dawson (1901/1914)statesthat De Witt must be consideredto be the founderof actuarial science, although his work was notrediscovered until 1852. More prominentis the work of De Witt's contemporaries, Christiaan Huygens(1629-1695)andJohn Hudde(1628-1704).Graunt's work influenced that of Huygens and hisyounger brother,who debatedthe relative meritsof estimatingthe mean durationof life and theprobability of survival or death at a given age. Theseattempts to bring rationality and order intothe general areaof life insurancewasaided considerably by thework of Edmund Halley(1656-1742), the English astronomer andmathematicianandpatronof Newton. Hepublished An Estimate of the Mortality of Mankind, Drawn From Curious Tables of the Births and Funerals at the City of Breslaw; With an Attempt to Ascertain the Price of Annuitieson Lives,in Philosophical Transactionsin 1693. He was a brilliant man, and hisclear expositionof the problems is remarkableon two counts. The first is that it was producedin orderto fulfill a promiseto the Royal Society to contribute materialfor the Transactions, whichhad resumed publication after a break of several years, and was atopic that was out of the mainstreamof his work. The secondis that the early life offices in Englanddid not use thematerial, the common opinion being that insurance was largely a game of chance. The general realization that an understandingof probabilities helpedin the actual determinationof gains and lossesin both gamesof chance and life insurancewas not tocome until almostthe middleof the 18th century. The necessityof reliablesystemsfor the determinationof premiumsfor all kinds of insurance became more and more importantin the rise of 18th century trade and commerce. This accountof the beginningsof actuarial scienceis by nomeans complete, but the brief description highlightsthe utility of data organizationand probability for the making of inferencesand thebusinessof practical prediction.As Karl Pearson(1978)noted in his lectures, there were two main linesof descent from Graunt;the probability-mathematicians andactuaries,and the18th century political arithmeticians.Of the probability-mathematicians perhaps the most entertainingandenterprisingwas aphysician, John Arbuthnot (1667-1735).He was able,he wasintellectual,he was awit, and he can be credited withthe first use of anabstract mathematical proposition, the binomial theorem,to test the probability of an observed distribution, namely, the proportion of male and femalebirths. In fact, he appearsto have beenthe first personto assessobserved data,chance,and alternative hypotheses using a statistical test. Arbuthnot had, in 1692, publisheda translationof Huygens workon probability with additions of his own. In this book he observes that probability may be applied to 2

Tontines were once a popular formof annuity. Subscribers paid into ajoint fund, and theincome they received increasedfor the survivorsas themembers died off.

52

4. THE ORGANIZATION OF DATA

Graunt's suggestionin the Observations thatthe greater numberof male than female births was a matter of divine providenceand not chance. In An Argument for Divine Providence,Taken From the Constant Regularity Observ'd in theBirths of Both Sexes, published in 1710, Arbuthnott (sometimes, as he didhere,he spelledhis namewith two t's) usesthe binomial expansionto argue thatthe observed distributionof maleand female christenings (which he equates with births) departs so much from equalityas to beimpossible, "from whenceit follows that it is Art, not Chance, that governs" (p. 189). An excess of malesis preventedby: the wise Oeconomyof Nature;and tojudge of the wisdom of the Contrivance,we must observethat the external Accidentsto which Malesaresubject (who mustseek their Food with danger)do makea greathavockof them, andthat this lossexceeds far that of the other Sex, occasioned by the Diseasesincident to it, as Experience convincesus. To repair that Loss, provident Nature,by the Disposalof its wise Creator,brings forth more Males than Females; and that in almost constantproportion. (Arbuthnott, 1710, p. 188)

The resulting near-equal proportions of the sexes ensures that every male may havea femaleof suitable age,andArbuthnotconcludes,as didGraunt, that: Polygamy is contrary to the Law ofNatureandJustice,and to thePropagationof the Human Race;for where Malesand Femalesare inequal number,if one Man takes Twenty Wives,NineteenMen must livein Celibacy, whichis repugnantto the Design of Nature; nor is itprobable that Twenty Women will be sowell impregnatedby one Man as byTwenty. (Arbuthnott, 1710, p. 189)

which are asingenious statements of what we nowcall alternative hypotheses as onecould find anywhere. The most outstandingof the probability-mathematicians of the era was Abraham De Moivre (1667-1754).He was born at Vitry in Champagne,a Protestantand the son of poor a surgeon. From an early age, although an able scholar of the humanities,he wasinterestedin mathematics. Afterthe repeal of the Edict of Nantesin 1685, he wasinternedand wasfaced withthe choice of renouncinghis religion or going into exile. He chosethe latter and in 1688 went to England. By chance he met Isaac Newtonand read the famous Principia, whichhadbeen published in 1687.He masteredthe newinfinitesimal calculus and passed intothe circles of the great mathematicians, Bernoulli, Halley, Leibnitz, andNewton. Despitethe efforts of his friends, he wasunable to obtain a secure academic position and supported himselfas a peripatetic teacher of mathematics, "and later in life sitting daily in Slaughter'sCoffee House in Long Acre, at the beck and call of gamblers,who paid him a small sum for calculating odds,and of underwritersand annuity brokerswho wished their values reckoned" (K.Pearson, 1978, p. 143).

GRAPHICAL METHODS

53

De Moivre first publishedhis treatiseon annuitiesin 1725. Essentiallyand without going intothe details,and someof the defects,of his procedures,De Moivre used the summationof series to compute compound interests and annuity values.The crucial factor was inapplyingthe relatively straightforward mathematicsto a mortality table. He examined Halley's Breslau table and concludedthat the numberof individuals who survivedto a given age from a total startingout at anearlier agecould be expressedas decreasing terms in an arithmetic series. In short, De Moivre combined probabilities and theinterest factor to compute annuity values in a manner that pointed the way tomodern actuarial science. Thomas Simpson (1710-1761)has been describedas De Moivre's younger rival. He is certainly rememberedin the history of life insuranceas aninnovator. He appearsinitially to have arguedwith DeMoivre that mortality tables shouldbe used as they were found and not fitted to mathematicalrules. Simpsonis dismissedby Karl Pearson (1978) as aplagiarist who "set out to boil down De Moivre's workandsell the resultat alower price" (p. 176), whereas Dawson (1901/1914) describes his book as "an attempt to popularizethe science, nevera popular movement among those who hope to profit by keepingit exclusive"(p. 102). De Moivre was undoubtedly upset by Simpson's reproduction of his ideas, but Pearson'sview of Simpson's character is not sharedby many historiansof mathematics,who have describedhim as aself-taught genius. The applicationof mathematicalrulesto tablesof mortality are theearliest examples of the use oftheoretical abstractions for practical purposes. They are attemptsto organizeand make senseof data. The developmentof, and rationalesbehind, modem statistical summaries are of immediate interest. GRAPHICAL METHODS As knowledge increases amongst mankind and transactions multiply,it becomes more and more desirableto abbreviate and facilitate the modes of conveying information from one person to anotherand from one individual to the many. I confessI was long anxiousto find out whetherI was actuallythe first whoapplied the principles of geometry to matters of finance as it hadlong before been applied to chronology withgreat success.I am now satisfied upondue inquiry, that I was the first: As the eye is thebestjudge of proportion being ableto estimate it with more accuracy thanany other of our organs,it follows, that wherever relative quantities are in question, a gradual increaseor decreaseof any ... value is to bestated,this mode of representingit is peculiarly applicable. That I havesucceededin proposingand putting in practicea new anduseful mode of statingaccounts... and asmuch informationmay beobtainedin five minutes as would require whole days to imprint on thememory, in a lasting manner,by a table of figures. (Playfair,180 la, pp. ix-x)

54

4. THE ORGANIZATION OF DATA

These quotations aretakenfrom Play fair's Commercialand Political Atlas, the third edition of which was publishedin 1801. They providean unequivocal statementof his view, and theview of many historians, that he was theinventor of statistical graphical methods, methods that hetermed lineal arithmetic. Also in 1801, Playfair's Statistical Breviary; Shewing, on a Principle Entirely New, the Resourcesof Every Stateand Kingdom in Europe, appeared.We see, in handsome hand-colored, copper-plateengravings, lineand bargraphs in the Atlas,andthese together with circle graphs and piechartsin the Breviary. They illustrate revenues, importsand exports, population distributions, taxes, and other economicdata,and they are accompaniedby constant reminders of the utility of the graphical methodfor the examinationof fiscal and trade matters. William Playfair (1759-1823)was theyoungerbrotherof John Playfair,the mathematician.He had avariety of occupations, beginning as anapprentice mechanicalengineeranddraftsmanandlater as awriter andratherunsuccessful businessman.In his examination of the history of graphical methods, Funkhouser (1937) notes that the French translationof the Atlas was very well receivedon thecontinent;in fact, much more attention was paid to it there than in England.He also suggests that this might accountfor the greater interest shown in graphical workby the French overthe next century. As a method of summarizingdata,we must givefull credit to Playfair for the developmentof the graphical method,but thereare examplesof the use of graphs that predate his work by many centuries. Funkhouser (1937) mentions the use ofcoordinatesby Egyptian surveyors and that latitudesandlongitudes were used by the geographersand cartographersof ancient Greece. Nicole Oresme(c. 1323-1382)developedthe essentialsof analytic geometry, which is ReneDescartes'(1596-1650)greatest contribution to mathematics, although it is of course true that graphs could have been used, and infact occasionally were used, withoutany formal knowledgeof the fact that the curveof a graph can be representedby a mathematicalfunction. Funkhouser(1936) reportson a10thcentury graphof planetary orbitsas afunction of time, andover the years there havebeen other examples of the method, perhaps the most obviousof which is musical notation. Despite Playfair's ingenuity and hisattemptsat thepromotion of the technique, the use ofgraphswas surprisingly slowto develop. We find Jevonsin 1874 describingthe productionof "best fit" curvesby eye andsuggestingthe procuring or preparation of "paper divided into equal rectangular spaces, a convenientsize for the spaces being one-tenth of an inch square"(p. 493), and not until its llth edition, publishedin 1910, did Encyclopaedia Britannica devote an entry to graphical methods.Funkhouser(1937) suggests,and the suggestionis eminently plausible, thatthe collection of public statisticswas affectedby anongoing controversyas towhether verbal descriptions of political and economic states were superior to numericaltables. If statisticaltables were

GRAPHICAL METHODS

55

regarded with suspicion, then graphs would have been greeted with even more opprobrium.In Franceand Germanythe use ofgraphswas openly criticizedby some statisticians. In the social sciencesthe personwho most helpedto promotethe use of the graph as a tool in statistical analysiswas Quetelet,who has already been mentioned.It hasbeen noted that the useof theNormal Curvefor thedescription of humancharacteristics begins with his work. The law of error and Quetelet's adaptationof it is of great importance.

5 Probability

A proposition that wouldfind favor with most of us isthat the businessof life is concernedwith attemptsto domore or lessthe right thing at more or lessthe right time. It follows thata great dealof our thinking, in the senseof deliberating, about our condition is bound up with the problem of making decisionsin the face of uncertainty.A traditional viewof sciencein general,on theother hand, is that it searchesfor conclusions thatare to betaken to be true, that the conclusionsshould represent anabsolute certainty. If one of itsconclusionsfails, then the failure is put down to ignorance,or poor methodology,or primitive techniques- thetruth is still thoughtto be outthere somewhere. In chapter2, some of the difficulties that can arise in trying to sustain this positionwere discussed. Probability theory provides uswith a meansof reaching answers and conclusions when,as is thecasein anything but the most trivial of circumstances,the evidenceis incompleteor, indeed,can never be complete. Put simply, whatis thebestbet, whatare theoddson ourwinning,and howconfident can we bethat we areright? The derivationof theoriesand systems thatmay be applied to assist us in answering these questions continues to present intellectual challenges. THE EARLY BEGINNINGS The early beginningsare very early indeed. David (1962) tells us that many archeological digs have produced quantities of astragali, animal heel bones, with the four flatter sides marked in various ways. These bones may have been used as counting aids,as children's toys,or as theforerunnersof dice in ancient games.The astragaluswas certainly usedin board gamesin Egypt about 3,500 years beforethe birth of Christ. David suggests that gaming may have been developedfrom game-playingand reports thatit is saidto have been introduced 56

THE BEGINNINGS

57

in Greecejust 300 years before Christ. Shealso suggests that gaming may have emergedfrom the wager,and thewager from divination and theinterrogation of oracles with originsin religiousritual. We know that gamingwas acommon pastimein both Greeceand Romefrom at least 100years before Christ. David notes that Claudius (10 B.C.- 54 A.D.) wrote a book on how to win atdice, which, regrettably, has not survived. David also commentson a remark in Cicero'sDe Divinatione,Book I: When the four dice producethe venus-throw [the outcome when four dice, or astragali,are tossedand thefaces shownare alldifferent] you maytalk of accident: but supposeyou made a hundred castsand thevenus-throw appeared a hundred times; couldyou call that accidental? (David, 1962, p. 24)

This may be one of the earliest statements of the circumstances in which we might acceptor reject the null hypothesis. Cicero quite clearlyhad some grasp of the conceptsof randomnessand chanceand that rare events do occur in the long run. A variety of explanationsmay beadvancedfor the fact that these early insightsdid not leadto amathematical calculus of probabilities.Greek philosophy, in the works of Platoand Aristotle, searchedfor order and regularityin the phenomenaof the universe,and later, as Christianity spread, the Church promoted the idea of an omnipotentGod who wasunceasingly awareof, and responsible for,the slightest perturbation in naturalaffairs. Human beings were doomed to be ignorantof the successionof natural eventsand could achieve salvation onlyby submissionto the will of an Almighty. A more mundane explanationmay beprovidedby the fact that early arithmetic,at anyrate in what we would now call Westernculture, was hamperedby the absenceof a logical and efficient system of number notation. Whateverthe reasons,and many others have been suggested (see, e.g., Acree, 1978; Hacking,we 1975), know that about 1,600 years elapsed before a foundation for probability theorywas laid.

THE BEGINNINGS The reader is referredto David's interestingbook (1962)for a review of the contributionsthat emergedin Europeduring the first millennium and beyond. In the 17th century, Gerolamo Cardano's book Liber de Ludo Aleae [The Book on Gamesof Chance] was published(in 1663, some87 yearsafter the death of its author). A translationof this work by S. H. Gould is to be found in Ore (1953). The treatise is full of practical adviceon both oddsand personality, noting that personswho arerenownedfor wisdom,or old anddignified by civil honoror priesthood, should not play, but that for boys, young men, and soldiers, it is lessof a reproach. Cardarno also cautions that doctors and lawyers playat a disadvantage,for

58

5.

PROBABILITY

"if they win, they seemto begamblers,and if they lose,perhaps theymay be takento be asunskilful in their own art as ingaming. Men of these professions incur the same judgmentif they wish to practice music" (Ore, 1953, pp. 187-188). It cannot be claimed that Cardano produced a theory with this book. However,it can bestated thathe understoodthe importanceof therelationship between theoretical reasoning and eventsin the "real world" of gaming. The birth of themathematical theoryof probability, thatis, "when the infant really seesthe light of day" (Weaver, 1963/1977, p. 31), took placein 1654.In that year, Antoine Gombauld, Chevalier de Mere (1607-1684),posed the questionsto Blaise Pascal(1623-1662)that initiatedtheendeavor.The two may have met in thesummerof 1652 and soon become friends, although there is some doubt about both the time and thecircumstancesof their meeting. David, in her book, deals rather harshly with de Mere. Shedismisseshim ashaving"a second-rate intelligence" and indicates thathis writings do not show evidence of mathematical ability, "being literary pieces of not very high calibre" (David, 1962,p. 85). De Mere was not amathematician, although there is no doubt that he held a good opinionof his abilitiesin a numberof fields. What he appears to have beenwasFrance's leading arbiter of good tasteandmanners,a man who studiedthe ways of polite societywith care anddedication.1 His writings were soughtafter,and hisopinions respected, by thosewho wishedto maintain good standingin the drawing roomsof Paris (seeSt. Cyres, 1909,pp. 136-158,for an accountof de Mere). Pascal appears to have beena willing pupil, although toward the end of1654heunderwenthis second conversion to thereligious life. The notion that, astonishingly, the calculusof probabilitieswas theoffspring of a dilettanteand anascetic, which Poisson's remark (footnoted in chap. 1) has been takento imply, is a notquite accuratereflection. Every accountof the history of the concept of probability showshow importantthe searchfor waysof assessing the oddsin various gaming situations has been in shapingits development. Thereseemsto be little doubt but that Pascalhadspent sessions at thegaming tables,and itseems unlikely that various situationshad notbeen discussed with de Mere. What movedthe matter into the realm of mathematicaldiscoursewas thecorrespondence between Pascal and hisfellow mathematician Pierre Fermat, sparked by specific questions that had been raisedby de Mere. An old gamblinggameinvolvesthe "house"offering a gambler even money that he or shewill throw at least 1 six in 4throws of a die. In fact, the odds are slightly favorableto the house. De Mere's problem concerned the questionof the odds in a situation wherethe bet wasthat a player would throw at least1 1

That David's judgmenton themattermay be inerror is likely. Michea (1938) maintains that Descartes'bonsensandPascal'scoeurcan beequated withde Mere'sbongout, and no onecould find that Descartesand Pascalhad second-rate intelligences.

THE BEGINNINGS

59

double six in 24 throws of 2 dice. Herethe odds are,in fact, slightly against the house, even though 24 is to 36 as 4 is to 6. De Mere hadsolved this problem but used the outcome to argue to Pascal thatit showed that arithmeticwas self-contradictory!2 A second problem, which de Mere was unableto solve, was the"problem of points," which concerns,for example,how thestakesor the "pot" shouldbe divided amongthe players, accordingto thecurrent scores of the participants,if a gameis interrupted. Thisis analtogether moredifficult, questionand theexchanges between Pascal andFermat that took place over the summerof 1654 devote much space to it. The letters3 show the beginningof the applicationsof combinatorial algebraand of thearithmetic triangle. In 1655, Christiaan Huygens (1629-1695),a Dutch mathematician, visited Paris and heard of the Pascal-Fermat correspondence. There was enormous interest in themathematicsof chance but a seeming reluctance to announce discoveries and reveal the answers to questions under discussion, perhaps becauseof the material valueof the findings to gamblers. Huygens returned to Holland and worked out the answers for himself. His short accountDe Ratiociniis in Ludo Aleae [Calculating in Gamesof Chance]was publishedin 1657 by Fransvan Schooten(1615-1660),who wasProfessorof mathematics at Leyden. This book,the first on theprobability calculus,was animportant influence on James Bernoulliand DeMoivre. James (also referred to asJacquesor Jakob) Bernoulli(1654-1705)was one of an extendedfamily of mathematicianswho made many valuable contributions. It is his Ars Conjectandi [The Art of Conjecture},publishedin 1713after editing by his nephew Nicholas, that is of most interest.The first part of the book is a reproductionof Huygens' workwith a commentary.The secondand third parts deal withthe mathematicsof permutationsandcombinationsand its application to games of chance. In the fourth part of the book we find the theorem thatBernoulli called his "golden theorem,"the theorem named "The Law of Large Numbers"by Poissonin 1837. In an excellentand most readable commentary, Newman (1956) declares the theoremto be "of cardinal significance to the theory of probability," for it "was the first attempt to deduce statistical measures from individual probabilities" (vol. 3, p. 1448). Hacking (1971) hasexaminedthe whole work in some considerable depth. Bernoulli likely did not anticipateall of the debatesand confusions and 2

The probability of not getting a six in asingle throw of a die is 5/6 and the probability of no sixes in 4 throws is (5/6)4. The probability of at least 1 six in 4throws is 1 - (5/6)4, which is .51775, whichis favorableto the house. The probability of not getting 2 sixes in a single throw of 2 dice is 35/36 and theprobability of the event of at least oncegetting 2 sixesin 24 throws of 2 dice is 1- (35/36)24, which is .49140, whichis slightly unfavorableto the house. 3

Translationsof the Pascal-Fermat correspondence are to befound in David's (1962) bookand in Smith (1929).

60

5.

PROBABILITY

misconceptions, both popular and academic, thathis theorem would generate, but that it was asourceof intellectual puzzlement may betakenfrom the fact that its author states that he meditatedon it for 20 years.In its simplest form, the theorem states that if an eventhas aprobabilityp, of occurringon asingle trial, and ntrials are to bemade, thentheproportion of occurrencesof the event over the ntrials is alsop.. As n increases,the probability thatthe proportionwill differ from p by less thana given amount,no matterhow small, also increases. It also becomes increasingly unlikely thatthenumberof occurrencesof theevent will differ from pn by less thana fixed amount, however large. In a seriesof n tossesof a "fair" coin the probability of heads occurring close to 50% of the time (0.5« occurrences) increases as n increases,and there is a much greater probability of a difference of, say, 10 headsfrom what wouldbe expectedin 1,000tossesof the coin thanin 100tossesof the coin. Kneale (1949) notes that it is often supposed that the theoremis: A mysterious law of nature which guarantees that in a sufficiently large numberof trials a probability will be "realizedas afrequency." A misunderstandingof Bernoulli's theoremis responsiblefor one of thecommonest fallacies in the estimation of probabilities, the fallacy of the maturity of the chances.When a coin hascomedown headstwice in succession,gamblerssometimes say that it is more likely to come down tails next time because"by the law of averages"(whatever thatmay mean)the proportion of tails mustbe broughtto right sometime. (Kneale, 1949, pp. 139-140)

The fact is, of course, thatthe theoremis no moreand noless thana formal mathematical proposition and not astatement about realities and, asNewman (1956)puts it," neither capable of validating 'facts' nor of being invalidatedby them" (vol. 3, p. 1450). Now this statement accurately represents the situation, but it mustnot bepassedby without comment.If reality departs severely from the law, then clearlywe do notdoubt reality,and nor do wedoubtthe law. What we might do, however,is to doubt someof our initial assumptions about the observed situation.If red wereto turn up 100timesin a row at theroulette table, it might be prudentto bet on red on the 101spin st because this sequence departs so muchfrom expectation thatwe might conclude thatthe wheel is "stuck" on red. In fact, we might usethis observationas support for an assertion thatthe wheel is not fair. This example moves usawayfrom probability theoryandinto the realmof practical statistical inference, aninference about reality that is based on probabilities indicatedby anabstract mathematical proposition. THE MEANING OF PROBABILITY A problem has been raised.The difficulty inherentin any attempt to apply Bernoulli's theoremis that we assume thatwe know the prior probability^ of

THE MEANING OF PROBABILITY

61

a given eventon asingle trial- that, for example,we know thatthe value AFP for the appearanceof heads whena coin is tossedis ½ If we doknow p, then predictionscan bemade, and, givena seriesof observations, inferences may be drawn from them. The problem is to statethe grounds for setting p at a particular value,and this raisesthe questionof what it is that we mean by probability. The issuewas always there,but it was notreally confronteduntil the 19th century. The title of this sectionis taken from ErnestNagel'spaper presented at the annualmeetingof the American Statistical Association in 1935 and printedin its journal in 1936.Nagel'sclearandpenetrating account does not, as headmits, resolve the difficult problems raisedin attempts to examine the concept, problems thatare still the source of sometimes quite heated debate among mathematiciansandphilosophers. What it doesis presentthe alternative views and commenton theconclusions that adoption of one orother of the positions entails.4 Nagel first examinesthe methodological principles that can bebrought to bear on the question; thenhe delineatesthe contexts in which ideasof probability may befound; and,finally, he examinesthe three interpretations of probability that are extant. What follows draws heavily on Nagel'sreview. Appealsto probabilitiesarefound in everyday discussion, applied statistics and measurement, physical andbiological theories,the comparisonof competing theories,and in themathematical probability calculus. The three important interpretationsof the concept are theclassical; the notion of probability as "rational belief; and thelong-run relativefrequencydefinition. Laplace (1749-1827)and DeMorgan (1806-1871)were the chief exponentsof the classical view that regards probability as, in DeMorgan'swords, "a stateof mind with respectto anassertion"(De Morgan, 1838).The strength of our belief aboutany given propositionis itsprobability. Thisis what is meant in everyday discourse when we say(though perhaps not every day!) that"It is probable that gaming emerged from game-playing." In evaluating competing theories this viewis also evident,as whenit is said by some,for example, that, "Biologically based trait theories of personalityaremore probably correct than social learningtheories."This interpretationis not what is meant when statistical assertions about the weather,the probabilityof precipitation,for example, or the outcomeof an atomic disintegration are made. John Maynard Keynes (later Lord Keynes, 1883-1946)interpreted probability asrational belief, and it hasbeenassertedthat thisis themost appropriate view for most applicationsof the term. Evidence about the moral characterof witnesses would lead to statements about the probability of the truth of one person'stestimony rather than another's or, inexaminingtheresultsof evidence that attemptedto show, for example, thata social learning theory interpretation Acree(1978) givesanextensive review that, for thepresent writer,is sometimesdifficult to follow.

62

5.

PROBABILITY

of achievement motivationis more probable thanone basedon trait theory. These statements rest on rational insights intotherelationship between evidence and conclusion,not on statistical frequencies, because we do nothave such information. It is Nagel's view,and theview of others, that boththe classical and theKeynesian views violate an important methodological principle - that of verifiability: On Keynes'view a degreeof probability is assignableto a single proposition with respectto given evidence.But what verifiable consequences can bedrawnfrom the statement that withrespectto the evidencethe proposition thaton the next throw with a given pair of dice a 7 will appear,has aprobability of 1/6? For on theview that it is significant to predicatea probability to the single instance thereis nothing to be verified or refuted. (Nagel, 1936,p. 18)

The conclusionis that the views we have described "cannot be regardedas a satisfactory analysis of probability propositionsin the sciences which claim to abideby this canon" (Nagel, 1936, p. 18). The third interpretationof probability is the statistical notionof long-run relative frequency. Bolzano (1781-1841), Venn (1834-1923), Cournot (1801-1877),Peirce(1839-1914),and, more recently, von Mises(1883-1953) and Fisher (1890-1962)are among the scientists and mathematicianswho supported this view.In a simple example,if you aretold that your probability of survival and recovery aftera particular surgical procedure is 80%, then this means thatof 100patientswho have undergone this operation, 80 have recovered. The majority of working scientistsin psychologywho rely on statistical manipulationsin the examinationof data implicitly or explicitly subscribeto this meaning. Whether or not it isappropriateto the endeavorand whetheror not theconsequences of its acceptanceare fully appreciatedis a questionto be considered.It is important to note, and this is not meant to be facetious, that following the operationyou areeither aliveor you aredead. In other words, the probability statementis about a relationship between events, or rather propositions concerning the occurrenceof events,not abouta single event.In this respectthe frequentists have no quarrelwith the Keynesians. The most common logical objection (in fact, it is a class of objections)to the frequency view concerns the vague term long-run. Clearly, in order to establisha frequency ratio the run has tostop somewhere, and yetprobability values are defined as thelimit of an infinite sequence. Nagel's answer to the objection, which von Mises (1957) also deals with at some length,is that empirically obtained ratioscan beusedas hypothesesfor the true valuesof the infinite series,andthesehypothesescan betested.Further, probability values might be arrived at by using some other theory than that obtained from a statistical series, for example, using theoretical mechanics to predict the

THE MEANING OF PROBABILITY

63

probability of a fair coin turningup heads. Againthe predicted probabilitycan be testedempirically. Nagel also argues that Keynesian examples about the credibility of witnessescouldbeexaminedin frequencyterms;for instance, "the relative frequency with whicha regular church-goer tells lies on important occasionsis a small number considerably less than ½" (Nagel, 1936,p. 23). It is obviously the casethat thereis more thanone conceptionof probability and that each one hasvalue in its particular context. Nagel concludes that the frequency view is satisfactoryfor "every-daydiscourse, applied statistics and measurement,and within many branchesof the theoretical sciences"(p. 26). It has been noted that psychological scientists have acceptedthis view. Nevertheless,the fact that probabilityhas to doboth with frequencies and with degreesof belief is the aleatory and epistemological duality that is clearly and cogently discussedby Hacking (1975). We blur the issueas wecompute our statistics and speakof the confidencewe have in our results. The fact thatthe answer to the question "Who, in practice, cares?"is "Probably very few," is basedon an admittedly informal frequency analysis,but it is one inwhich we can believe! Probability and the Foundations of Error Estimation The practical spurto the developmentof probability theory was simply the attemptto formulate rules that,if followed assiduously,would leadto a gambler making a profit over the long run, although,of course, it was notpossibleto asserthow long the runmight haveto be for profit to be assured. Probability was therefore boundup with the notion of expectancies,and the 150years that followed the end of the17th century saw its constructs being appliedto expectancies in life insurance and the sale of annuities. Johnde Witt (1625-1672)andJohn Hudde(1628-1704),who calculated annuities basedon mortality statisticsand whose workwas mentioned earlier, corresponded and consulted with Huygens and drew heavily on his writings. An important figure in thehistory of probability, althoughthe extent of his contribution was not fully recognized until long, long after his death, was Abraham De Moivre (1667-1754). His book The Doctrine of Chances:or A Method of Calculating the Probabilities of Events in Play was first published in 1718, but it is thesecond (1738)and third (1756) editionsof the work that are ofmost interest here.De Moivre'sA Treatiseof Annuities on Lives,editions of which were publishedin 1724, 1743,and 1750, appliesThe Doctrine of Chancesto the "valuation of annuities." The Treatise is to be found bound together withthe 1756 editionof the Doctrine. Here the problems of the gaming roomsand the questionsof actuarial prediction in the mathematicsof expectanciesare linked. A third strandwas not to emergefor over 50 yearswith investigationson theestimation of error.

64

5.

PROBABILITY

This latteris associated most closely with the work of Laplace(1749-1827)and Gauss(1777-1855)and thederivation of "The Law of Frequencyof Error," or the normal curve as it came to be known. And yet we nowknow that this fundamentalfunction had,in fact, been demonstrated by De Moivre,who first published it in 1733 (the Approximatioad Summam Terminorum Binomii (a+b)n in Seriem Expansi),andEnglish translations of the work are to befound in the last two editions of The Doctrine of Chances.The Approximatio is examined by Todhunter (1820-1884), whose monumental Historyof the Mathematical Theory of Probability From the Time of Pascal to That of Laplace, publishedin 1865, still standsas one of themost comprehensive and one of the dullest bookson probability. Todhunter devotes considerable space in his book to De Moivre's work, but, according to Karl Pearson (1926),he "misses entirelythe epoch-making character of the 'Approximatio' as well as its enlargementin the 'Doctrine.' He does not say: Hereis the original of Stirling's Theorem, here is the first appearanceof the normal curve, hereDe Moivre anticipated Laplace as the latter anticipated Gauss" (Pearson, 1926, p. 552). It is true that Todhunter does not state any of theabove precisely,and it is thereforenow often reported that Karl Pearson discovered the importanceof the Approximatio in 1922 while he waspreparing for his lecture courseon the history of statistics thathe gave at University College London between 1921 "and certainly continuing up till 1929" (E. S.Pearson& Kendall, 1970,p. 479). Karl Pearson published an article on thematterin 1924 (K.Pearson 1924a) and some detailsare providedby Daw and E. S.Pearson (1972/1979). Quite how much credit we cangive Pearsonfor the revelationand howjustified he was in his judgment that "Todhunter fails almostentirely to catchthe drift of scientific evolution." (Pearson, 1926, p. 552) can be debated.It is certainlythe casethat one is notcarried alongby the excitementof the progressionof discoveryas oneploughsone'sway through Todhunter's book, but hedoes say,in a section wherehe refersto thepagesof thethird editionof the Doctrine, which givethe Approximatio and show an important result: Thus we have seen that the principal contributionsto our subject fromDe Moivre are hisinvestigations respecting theDuration of Play, his Theoryof Recurring Series, and his extension of the value of Bernoulli's Theoremby the aid of Stirling's Theorem. Our obligations to De Moivre would have been still greater if he had not concealedthe demonstrationsof the important results whichwe have noticed . .;. but it will not bedoubted thatthe Theory of Probability owesmore to him than to any other mathematician, with the sole exceptionof Laplace.(Todhunter, 1865/1965, pp. 192-193)

It is also worth noting thatDe Moivre's contributionwas remarkedon by the American historianof psychology, Edwin Boring, in 1920, before Pearson,

THE MEANING OF PROBABILITY

65

and without so much fuss. In a footnote referringto the"normal law of error" he says, "The so-called'Gaussian'curve. The mathematical propadeutics for this function were preparedaslong ago as thebeginningof the 18th century(De Moivre, 1718 [sic]). SeeTodhunter. .." (Boring, 1920,p. 8). A more detailed discussion of the Approximatio and thepropertiesof the normal distributionis given in the next chapter.The roles of Laplaceand Gauss in the derivation and applicationsof probability theory are of very great importance. Laplace brought together a great dealof work that had been published separately as memoirs in his great work Theorie Analytiquedes Probabilites, first publishedin 1812, with further editions appearingin 1814 and 1820. Todhunter(1865/1965)devotesalmost one quarterof his book to Laplace's work. From the standpointof statisticsasthey developedin the social sciences,it is perhaps only necessary to mentiontwo of his contributions:the derivation of the "Law of Error," and the notion of inverse probability. Laplace's theorem, the proof of which, in modern terms,may befound in David (1949), and which had been anticipatedby De Moivre, showsthe relationship between the binomial seriesand the normal curve.Put very simply, and in modern parlance,as n in(a+b)n approachesinfinity, the shapeof the discrete binomial distribution approaches the continuous bell-shaped normal curve. It is therefore possibleto expressthe sum of anumberof binomial probabilities by meansof the area of the normal curve,and indeedthe value of this sum derived from the familiar tables of proportions of area underthe normal distribution is quite closeto the exact value even for n of only 10. Thedetails of this procedureare to befound in manyof theelementary manuals of statistics. Laplace's work givesa numberof applicationsof probability theoryto practicalquestions,but it is thebell-shaped curve as it isappliedto the estimation of error thatis of interest here. Both Gauss andLaplace investigated thequestion independentlyof each other. Errors are inescapable, even though they may not alwaysbe of critical importance,in all observations that involve measurement. Despite the most sophisticatedof instrumentsand the most skilled users, measurements of, say, distances between stars in our galaxyat aparticular time, or points on the surfaceof the earth, will, when repeated,not always produce exactly the same result.The assumptionis made thatthe values thatwe require are definite and,to all intentsandpurposes, unchanging - that is to say, a true value exists. Variationsfrom the true valueare errors. Laplace assumed that every instanceof anerror arose because of the operationof a numberof sources of error, eachone of which may affect the outcomeone way or theother. The argument proceeds to maintain thatthe mathematical abstraction that we now call the normal law representsthe distributionof the resultant errors. Gauss'sapproachis essentially the same as Laplace's but has amore unashamedlypractical flavor. He assumed that there was anequal probability of errors of over-estimationand under-estimationand showed,by the method

66

5.

PROBABILITY

of leastsquares,thatit is thearithmetic meanof the distributionof measurements that best reflectsthe true valueof the measurementwe wish to make. The distribution of measurements,the variation in which represents error on these assumptions,is preciselythat of Laplace'stheorem,or thenormal curve,or, as it is sometimes termed, the Gaussian distribution. FORMAL PROBABILITY THEORY From the 17th to the20th century, mathematical approaches to probability have been algebraicaland geometrical. The work of Pascal and Fermat led to combinatorialalgebra.The invention of thecalculusled toformal examinations of theoretical probability distributions. Gaussian methods are essentially geometrical. Alwaysthe probability calculusfound itself in difficulties over the absenceof a formal definition of probability that was universally accepted. Modern probability theory,but not, it mustbe noted, statistics,hasescaped this difficulty by developingan arithmetical axiomatic model. The renowned German mathematician David Hilbert (1862-1943),Professorof Mathematicsat Gottingen, presenteda remarkable paper to the International Congressof Mathematics meetingin Parisin 1900. In it he presented mathematics with a series of no less than23 problems that,he believed, wouldbe theones that 5 mathematics should address during the 20th century. Among themwas acall for a theory of probability basedon axiomatic foundations.At that time, various attemptswere being madeto develop a theory of arithmetic basedon a small number of fundamentalpostulatesor axioms. These axioms have no basis in what is called the real world. Questions such as "What do they mean?"are excluded, ambiguityis avoided.The axioms themselves do notdependon any assumptions.For example,one of thebest-knownof themathematical logicians, Guiseppe Peano (1858-1932),basedhis arithmeticon postulates such as"Zero is a number." Thisis not theplaceto attemptto discussthe difficulties that have been uncoveredin the axiomatic approachor in set theory, with which it is closely allied. Suffice it to saythat modern mathematical probability theory is largely basedon the work of Andrei Kolmogorov (1903-1987),a Russian mathematician whose book Foundations of the Theory of Probability was first publishedin 1933. Set theory is most closelylinked with the work of Georg Cantor (1845-1918),born in St. Petersburgof Danish parentsbut who lived most of his life in Germany. Set theory grew out of a new mathematical conceptionof infinity. The theory hasprofound philosophicalaswell as mathematical implicationsbut its basis is easy to understand.It is a mathematical 5 In the actual talk Hilbert onlyhad time to deal with 10 of theproblems,but entire manuscriptin English, can befound in the Bulletin of the American Mathematical Society (1902).

FORMAL PROBABILITY THEORY

67

system that deals with defined collections, lists, or classesof objectsor elements: The theory of probability,as amathematical discipline, can andshouldbe developed from axioms in exactly the sameway asGeometryand Algebra. This means that after we havedefined the elementsto be studiedand their basic relations, and have statedthe axiomsby which these relations are to begoverned,all further propositions must be based exclusivelyon these axioms, independent of the usual concrete meaningof these elements andtheir relations ... ... theconceptof afield of probabilities is definedas asystemof setswhich satisfies certain conditions. Whatthe elementsof this setrepresentis of no importancein the purely mathematical development of the theory of probability. (Kolmogorov, 1933/1956,p. 1)

The system cannotbe fully delineated here, but this chapterwill endwith an illustration of its mathematical elegance in arriving at thewell-known rulesof mathematical probability theory. This account follows Kolmogorov closely, althoughit does not use histerminologyand symbols precisely. E is a collection of elements called elementary events. / is a set ofsubsets of E. Theelementsof f arerandom events. The axiomsare: 1. / is a field ofsets. 2. / containsthe set E. 3. To eachset A in / isassigneda non-negative realnumberp(A).This is called the probability of eventA. 4. p(E}=\. 5. If A and Bhave no elementin common, thenp(A + B)= p(A) + p(B) In set theory,A has acomplementarysetA'.In the languageof random events, A' meansthe non-occurrenceof A. To saythat A is impossibleis to write A = 0, and to saythat A must occuris to write A = E. BecauseA + A' = E and from the 4th and 5thaxioms, p(A)+ p(A') = 1 and p(A) = \-p(A') and becauseE' = 0, p(E') = 0. Axiom 5 is theaddition rule. lfp(A) > 0, then p(B \ A) - ^ , * , which is theconditional probabilityof

P(A)

the event B under conditionA. From this followsthe general formula known as themultiplication rule, p(AB) = p(A)p(B A), and as weshow in chapter7, this leadsto Bayes'theorem. Note thatp(B\ A) > Q,p(E \A) = 1, and p{(B + C)\A}=p(B\A)+p(C\A). For readersof a mathematical inclinationwho wish to see more of the developmentof this approach(it doesget alittle more difficult), Kolmogorov's book is conciseand elegant.

6 Distributions

When Grauntand Halley and Queteletmade their inferences, they made them on thebasisof their examinationof'frequency distributions. Tables, charts, and graphs- nomatter how theinformationis displayed- all can beusedto show a listing of data, or classificationsof data, and their associated frequencies. These are frequency distributions. By extension, such depictions of the frequencyof occurrenceof observationscan beusedto assessthe expectationof particular values,or classesof values, occurringin the future. Real frequency distributionscanthenbe usedasprobability distributions.In general, however, theprobability distributions thatarefamiliar to theusersof statistical techniques are theoretical distributions, abstractions basedon a mathematical rule, that match, or approximate, distributions of eventsin the real world. When bodies of data are described,it is the graph and thechart that are used. But the theoretical distributions of statisticsandprobability theoryaredescribedby the mathematical rules or functions that define the relationships between data,both real and hypothetical,andtheir expectedfrequenciesor probabilities. Over the last 300 yearsor so, thecharacteristicsof a great many theoretical distributions,all of which have beenfound to have some practical utility in one situationor another, have been examined. The following discussionis limited to threedistributions thatare familiar to usersof basicstatisticsin psychology. An accountof somefundamentalsampling distributionsis given later. THE BINOMIAL DISTRIBUTION In the years 1665-1666,when Isaac Newtonwas 23 and had just earnedhis degree, his Cambridge college (Trinity)was closed becauseof the plague. Newton went hometo Woolsthorpein Lincolnshireand began,in peaceand leisure,a scientificrevolution.These weretheyearsin which Newton developed

68

THE BINOMIAL DISTRIBUTION

69

someof the most fundamentaland importantof his ideas: universal gravitation, the compositionof light, the theory of fluxions (the calculus),and thebinomial theorem. The binomial coefficientsfor integral powershad been knownfor many centuries, but fractional powers werenot considereduntil the work of John Wallis (1616-1703),Savilian Professorof Geometry at Oxford, and themost influential of Newton's immediate English predecessors. However, expansions of expressions such as (x -x2 )1/2 were achievedby Newton earlyin 1665. He announcedhis discoveryof the binomial theoremin 1 676 inletterswritten to the Secretaryof the Royal Society, although he never formally publishedit nor did he provide a proof. Newton proceededfrom earlier work of Wallis, who publishedthe theorem, with creditto Newton, in 1685. The problem was to find the areaunder the curve with ordinates (x - x2 )" Whenn is zerothe first two terms arex - 1 (x3 \ andwhennis1 they arex - 1 (x3 \ Newton, usingthe methodof interpolation employedsomuchby Wallis, reasoned that when n was l

\* /2 the corresponding terms should be, ;c - -r— . He arrived at theseries

and then discovered that the same result couldbe obtained by deriving, and subsequently integrating

2

l/1

the binomial expansionof (1 - jc) . Theinterestingandimportant pointto be noted is that Newton's discoverywas notmade by consideringthe binomial coefficients of Pascal'strianglebut by examiningthe analysisof infinite series, a discovery of much greatergeneralityand mathematical significance. Figure 6.1 showsthe binomial distributionfor n = l. Newton'sdiscoveryof the calculus in his "golden years"at Woolsthorpe establishes him as its originator, but it was Gottfried Wilhelm Leibniz (1646-1716),the German philosopher and mathematician,who haspriority of publication, and it is pretty well established that the discoveries were independent. However,a bitter quarrel developed over the claims to priority of discovery and allegations were made that, on avisit to London in 1673, Leibniz could have seenthe manuscriptof Newton's De Analysi Aequationes Numero Terminorum Infmitas, which, though writtenin 1669, was notpublisheduntil 1711. Abraham De Moivre was among those appointed by the Royal Society in 1712to report on thedispute. De Moivre made extensive use of themethod in his own work, and it was hisApproximatio, first printed and circulated to somefriends in 1733, that links the binomial to what we nowcall the normal

70

6. DISTRIBUTIONS

FIG. 6.1

The Binomial Distribution for N = 7

distribution. The Approximatio is included in the second (1738) and third (1756) editionsof the Doctrine. It should be mentioned thata Scottish mathematician, James Gregory (1638-1675),working at the time (1664-1668)in Italy, derivedthe binomial expansionand produced important work on themathematicsof infinite series, discoveredquite independentlyof Newton. THE POISSON DISTRIBUTION Before the structure of the normal distribution is examined, the work of Simeon-Denis Poisson (1781-1840)on a useful special caseof the binomial will be described.The Ecole Polytechnique was foundedin Parisin 1794. It was the model for many later technical schools, and its methods inspiredthe productionof many student texts in mathematicsandengineering whichare the forerunnersof present-day textbooks. Among the brilliant mathematiciansof the Ecole duringthe earlier yearsof the 19th centurywas Poisson.His nameis a familiar label in equationsandconstantsin calculus, mechanics, andelectricity. He waspassionately devoted to mathematics and toteaching,andpublished over 400 works. AmongthesewasRecherchesur laProbabilite desJugements in 1837. This contains thePoisson Distribution, sometimes called Poisson's law n of large numbers.It wasnoted earlier thatas n in (P + Q) increases,the binomial distribution tendsto thenormal distribution. Poisson considered the case where

THE POISSON DISTRIBUTION

71

as nincreases towardinfinity, P decreases towardzero,and nPremainsconstant. The resulting distributionhas aremarkable application. Data collectedby insurance companies on relatively rare accidents,say, people trapping their fingers in bathroom doors, indicates that the probability of this event happeningto any oneindividual is very low, in fact near zero. However, a certain numberof such accidents(X) is reported everyyear,and the number of these accidents varies from year to year. Overa number of years a statistical regularityis apparent,a regularity thatcan bedescribedby Poisson's distribution. If we set X at k, aninteger, then

where A, is any positive number,e is theconstant2.7183...,and k! isfactorial k. Although the distribution is not commonlyto be found in the basic statistics testsin psychology,it is usedin the social sciencesand it does havea surprising rangeof applications. It hasbeen usedto fit distributionsin, for example, quality control (defectsper numberof units produced), numbers of patients suffering from certain specificdiseases,earthquakes, wrong-number telephone connections, the daily numberof hits by flying bombs in London during WorldWar II, misprintsin books, and many others.

FIG. 6.2

The Poisson Distribution applied to Alpha Emissions (Rutherford & Geiger, 1910)

72

6. DISTRIBUTIONS

Poisson attempted to extendthe possibleutility of probability theory, for he applied it to testimonyand tolegal decisions. These applications received much criticism but Poisson greatly valued them. Poisson formally discussedthe conceptsof a random quantityandcumulative distribution functions,and these are significant theoretical contributions. But his nameandwork in probability doesnot occupy much space in the literature, perhaps because he wasovershadowed by famous contemporaries such as Laplaceand Gauss. Sheynin (1978) hasgiven us acomprehensive review of his work in the area. An exampleof a Poisson distributionis given in Figure 6.2. THE NORMAL DISTRIBUTION The binomialandPoisson distributions stand apart from the normaldistribution because theyare applied to discrete frequency data. The invention of the calculus provided mathematics with a tool that allowedfor the assessmentof probabilities in continuous distributions.The first demonstrationof integral approximation,to thelimiting caseof the binomial expansion,wasgivenby De Moivre. In the Approximatio, De Moivre beginsby acknowledgingthe work of James Bernoulli: Altho" the Solution of Problemsof Chanceoften requires that several Termsof the Binomial (a + b)n [this is modern notation]be added together, nevertheless in very high Powersthe thing appearsso laborious,and of sogreatdifficulty, that few people have undertaken that Task; for besides Jamesand Nicholas Bernoulli, two great Mathematicians,I know of no body thathas attemptedit; in which, tho' they have shown very great skill,and havethe praise whichis due totheir Industry,yet some thingswerefarther required;for what they havedoneis not somuch an Approximation as thedetermining very wide limits, within which they demonstrated that the Sum of Termswas contained. (De Moivre, 1756/1967,3rd Ed., p. 243)

De Moivre proceedsto showhow hearrivedat theexpressionof the ratio of the middle termto the sum of all thetermsin the expansionof (1 +1)" whenn is a very high power. His answerwas2/BVn, where"B representsthe Number of which theHyperbolic Logarithmis 1 - 1/12+ 1/360-1/1260+ 1/1080, &c." He acknowledgesthe help of James Stirlingwho hadfound that "B did denote the Square-rootof the Circumferenceof a Circle whose Radius is Unity, sothat if that Circumferencebe called c, the Ratio of the middle Termto the Sum of all the Terms will be expressedby 2/V(«c)" (De Moivre, 1756,3rd Ed., p. 244). De Moivre had thus obtained(in modern notation)the expression , for large n, whereY 0 is themiddle term.

THE NORMAL DISTRIBUTION

73

He also givesthe logarithmof theratio of the middle termto anyterm distant from it by aninterval /as (w + / - l^)log \m + l- 1 j + (m - I + V2)log \m - I + 1} - 2wlogw + log ((m +l)/m , wherem = '/2« andconcludes,in the first of nine corollaries numbered 1 through6 and 8through 10,7 having been omitted from the numbering, that,"if m or ½n be aQuantity infinitely great, thenthe Logarithm of the Ratio, whicha Term distantfrom the these middleby the Interval 1, has to themiddle Term, is -2///n" (p. 245). This is merely the _ 2/n expression that,Y/ = Y 0 e~21 , for large n. The second corollary obtains the "Sum of the Terms intercepted between the Middle, andthat whole distance from i t . . . denotedby /", in modern terms,the sum of Yo+ YI + Y 2 + Y 3 + . .. + Y/, as

which is the expansionof the integral

When / is expressedas S^fn , and Sinterpretedas !/2, the sum becomes

... which convergesso fast, thatby help of no more than sevenor eight Terms,the Sum required may becarriedto six or seven placesof Decimals: Now that Sumwill be found to be 0.427812,independently fromthe common Multiplicator2/Vt, and therefore to the Tabular Logarithm of 0.427182,which is 9.6312529,adding the Logarithm of 2At, viz. 9.9019400,the sumwill be 19.5331929,to which answersthe number 0.341344. (De Moivre, 1756, 3rd Ed., p. 245)

This familiar final figure is thearea underthe curveof the normal distribution betweenthe mean (whichis, of course, alsothe middle value)and anordinate one standard deviation from the mean. In the third corollaryDe Moivre says: And therefore, if it was possible to take an infinite number of Experiments, the Probability that an Event whichhas anequal numberof Chancesto happenor fail, shall neitherappearmore frequently than'/2 n+] /2"Jn times, nor more rarely than '/2 n - !/2 Vntimes,will be expressedby thedoubleSum of thenumberexhibitedin the secondCorollary, that is, by 0.682688,and consequentlythe Probability of

74

6. DISTRIBUTIONS thecontrary...will be0.317312,thosetwo Probabilities together compleating Unity, which is the measureof Certainty. (De Moivre, 1756,3rd Ed., p. 246)1

½Vw is what todaywe call the standard deviation.De Moivre did not name it but he did, in Corollary 6, saythat Vw "will be as itwere the Modulus by which we are toregulateour Estimation"(De Moivre, 1756,3rd Ed., p. 248). In fact what De Moivre doesis to expandthe exponentialand to integrate from 0 to SG. In Corollary 6 De Moivre notes thatif / is interpreted as Vw rather than V2 \H , then the series doesnot convergeso fast and that moreand more terms would be required for a reasonableapproximation as / becomes a greater proportion of V«, ... for which reason1makeuse inthis Caseof the Artifice of Mechanic Quadratures, first invented by Sir Isaac Newton...;it consistsin determiningthe Area of a Curve nearly, from knowinga certain numberof its OrdinatesA, B, C, D, E, F, &c.placed at equalIntervals,(De Moivre, 1756, 3rd Ed., p. 247)

He usesjust 4 ordinatesfor his quadratureand finds, in effect, that the area between±2o or '/2/7 ± Vw is 0.95428,and thatthe areain whatwe now call the tails is 0.04572. The true valueis a little less than thisbut it is, nevertheless, familiar. Theseresults can beextendedto the expansionof (a + b]n and where a and b are notequal. If the Probabilitiesof happeningand failing be in anygiven Ratioof inequality,the Problemsrelatingto the Sum of theTermsof the Binomial (a + b)n will be solved with the same facilityas those in which the Probabilitiesof happeningand failing are in aRatio of Equality. (De Moivre, 1756/1967,3rd Ed., p. 250)

In Corollary 9, De Moivre in effect, and in modern terms, introduces ^(npq), the expressionwe usetoday for the standard deviation of the normal approximationto thebinomial distribution. The sum andsubstance oftheApproximatiois that it gives,for the first time, the function that was rediscovered much later, the function that dominates so-called classical statistical inference - thenormal distribution - which in modern terminologyis given by the density function

1

The value for the proportionof areabetween±la is 0.6826894,so that De Moivre was out by one unit in the sixth decimalplace.

THE NORMAL DISTRIBUTION

75

The normal distributionis shownin Figure 6.3. De Moivre's philosophical positionis revealed in the sections headed "Remark I" in the 1738 editionof the Doctrine and anadditional,and much longer, "RemarkII" in 1756. De Moivre sets his work in the philosophical contextof an ordered determinate universe. His notionof Original Design (see the quotationin chapter1) is anotion that persisted at least downto Quetelet. A powerful deity revealsthe grand design through statistical averages andstable statisticalratios. Chance produces irregularities. As Pearson remarked: There is much valuein the idea of the ultimate laws being statistical laws, though why the fluctuations shouldbe attributedto a Lucretian 'Chance',I cannot say. It is not anexactly dignified conceptionof the Deity to supposehim occupied solely with first momentsand neglecting secondand higher moments!(Pearson,1978, p. 160)

and elsewhere: The causeswhich led De Moivre to his "Approximatio" or Bayes to his theorem were more theologicaland sociological than mathematical, and until onerecognizes that the post-Newtonian English mathematicians were more influenced by Newton's theology thanby hismathematics,thehistory of sciencein theeighteenth centuryin particular thatof the scientistswho were membersof the Royal Society- must remain obscure. (Pearson, 1926,p. 552)

FIG. 6.3

The Normal Distribution

76

6. DISTRIBUTIONS

It is interesting that thisis preciselythe sort of analysis thathasbeen brought to bear on thework of Galton and Karl Pearson himself, save that it is the philosophy of eugenics thatinfluenced their work, rather than Christian theology. In Remark II, De Moivre takesup Arbuthnot's argumentfor the ratio of male to female births, whichwas discussedin Chapter4, defendingthe argument againstthe criticisms thathad been advanced by Nicholas Bernoulli,who had noted thata chance distributionof the actual male/female birth ratio would be found if the hypothesized ratio (i.e., the ratio underwhat we would now call the null hypothesis)had been takento be 18:17 rather than 1:1. But De Moivre insists: This Ratio oncediscovered, and manifestly servingto a wise purpose,we conclude the Ratio itself, or if you will the Form of the Die, to be anEffect of Intelligence and Design. As if we were shewna number of Dice, each with18 white and 17black faces, whichis Mr. Bernoulli's supposition, we shouldnot doubt but that thoseDice had beenmadeby someArtist; and that their form was notowing to Chance,but was adaptedto theparticularpurposehe had inView. (De Moivre, 1756/1967,3rd Ed., p. 253)

With the greatestrespect to De Moivre, this was clearly not Arbuthnot's argument, and De Moivre's view that he might have saidit is somewhat specious. LikeQuetelet'suse of thenormal curve many years later, De Moivre's view is a prejudgmentand all findings must be made to fit it. Karl Pearson (1978)makes essentiallythe same point,but it is apparentlya point thathe did not recognize in his own work.

7 Practical Inference

Someof the philosophical questions surrounding induction and inference were dealt within chapter2. Herethe foundationsof practical inference are considered. INVERSE PROBABILITY AND THE FOUNDATIONS OF INFERENCE The first exercisesin statistical inference arose from a considerationof statistical summaries such asthosefound in mortalitytables.The developmentof theories of inferencefrom the standpointof the implicationsof mathematical theory can be dated from the work of Thomas Bayes(1702-1761),an English Nonconformist clergyman whose ministry was inTunbridge Wells. Bayes was recognized as avery good mathematician, although he published very little,and he was electedto theRoyal Societyin 1742 (see Barnard, 1958, for a biographical note). Curiously enough,the paperfor which he isrememberedwas commu1 nicatedto theRoyal Societyby hisfriend Richard Price more than2 yearsafter his death,and thecelebrated forms of the theorem that bear his name, although they follow from the essay,do not actually appearin the work. An Essay Towards Solving a Problem in the Doctrine of Chances(Bayes,1763) is still the subject of much discussionand controversy bothas to itscontentsand implications and as to howmuch of its import was contributedby its editor, Richard Price. The problem that Bayes addressed is statedby Pricein the letter 1 Price was also a Unitarian Church minister. His church stillstandsin Newington Greenin north London and is theoldest Nonconformist church building still being so usedin London. Next doorand abutting the church is a licensed betting shop.It has been noted that Bayesian analysis, resting, as it does,on thenotion of conditional probability, is akin to gambling.

77

78

7. PRACTICAL INFERENCE

accompanyinghis submissionof the essay: Mr De Moivre . . . has . . after . Bernoulli,and to agreater degree of exactness,given rules to find the probability thereis, that if a very great numberof trials be made concerning any event, the proportion of the numberof times it will happen,to the numberof times it will fail in those trials,should differ less thanby small assigned limits from the proportion of the probability of its happeningto theprobability ofits failing in one single trial. But I know of no personwho hasshewnhow to deduce the solution of the converse problemto this; namely, "the number of times an unknown event has happenedand failed being given,to find the chance thatthe probability ofits happening should lie somewhere between any twonamed degrees of probability." (Bayes, 1763,pp. 372-373)

A demonstrationof some simple rules of mathematical probability, using a frequency model, will help to derive and illustrate Bayes' Theorem. The probability of drawinga redcard from a standard deckof playing cardsis 26/52 or/?(R) = 1/2. The probability of drawing a picture cardfrom the deck is 12/52 orp(P) = 3/13. The probability of drawing a redpicture cardis 6/52 or/?(R & P) = 3/26. Whatis theprobability of drawing eithera redcard or a picture card? The answeris, of course, 32/52 orp(Ror P) =8/13. Note that: p(R or P)=p(R) +p(?) -p(R & P), [8/13 = (1/2 + 3/13) - 3/26]. This is known as theaddition rule. Suppose thatyou draw a card from the deck, but you are notallowed to see it. What is the probability that it is a redpicture card? We cancalculatethe answerto be6/52. Now suppose that you aretold thatit is a redcard. Whatnow is the probability of it being a picture card? The probability of drawing a red card is 26/52. We also can figure out that if the card drawn is a redcard then the probability of it also beinga picture cardis 6/26. In fact, p(R & P) = /7(R)[p(P|R)]» [6/52 = 26/52(6/26)] The term /?(P|R) symbolizesthe conditional probability of P, that is, the probability of P given that R has occurred and the expression denotesthe multiplication rule. Notethat/?(R& P) is thesameas/?(P& R), andthat this is equal to p(P)\p(R\P)], or 6/52 = 12/52(6/12). From this fact we seethat:

which is the simplest formof Bayes' Theorem.In current terminology,the left-

INVERSE PROBABILITY AND THE FOUNDATIONS OF INFERENCE

79

hand sideof this equationis termed theposterior probability, the first term on the right-hand sidethe prior probability, and theratio p(P\R)/p(R.), the likelihood ratio. From the addition rulewe canalso show thatthe probability of a redcard is the sum of theprobabilitiesof a redpicture cardand a rednumber card minus the probability of a picture number card (which doesnot exist!), or:

p(R) =p(R & P) + p(R & N) -p(? & N). But p(P & N) - 0 becausethe events picture card and number card are mutually exclusive. So Bayes' Theoremmay bewritten:

The fact that this formulais arithmeticallycorrect may be checked by substitutingthe valueswe canobtain from the known distributionof cards in the standard deck. However, this does not demonstratethe allegedutility of the theoremfor statistical inference.For that we must turnto another example after substitutingD (for Data) in place of R, and H,(for Hypothesis1) in place of P, and H2 (for Hypothesis2) in place of N, where H, and H2 are twomutually exclusiveand exhaustive hypotheses. We have:

Bayes' Theorem apparently provides a meansof assessingthe probability of a hypothesis giventhe data (or outcome), whichis the inverseof assessing the probability of an outcome givena hypothesis(or rule), and of coursewe may envisage more than two hypotheses.In the original essay, Bayes demonstrated his construct usingas anexamplethe probability of balls comingto rest on one or other of the parts of a plane table. Laplace, who in 1774 arrivedat essentiallythe same resultas Bayes,but provideda more generalized analysis, used the problem of the probability of drawing a white ball from an urn containingunknown proportionsof black and white balls given thata sampling of a particular ratiohasbeen drawn. Traditionally, writers on Bayes make heavy use of "urn problems,"and tradition is followed here witha simple example. Phillips (1973) givesa comprehensive account of Bayesian procedures and showshow they can beappliedto data appraisal in the social sciences. Suppose we arepresented withtwo identical urnsW and B. Wcontains70 white and 30

80

7. PRACTICAL INFERENCE

black balls,and Bcontains40 white and 60black balls. Fromone of theurns we areallowed to draw 10 balls, and wefind that 6 of them are white and 4 of them are black. Is the urnthat we have chosen more likely to be W ormore likely to be B? Presumably most of us wouldopt for W. Whatis the probability that it was W? /?(Hi|D) is theprobability of W giventhe data,and/?(H2|D)is the probability of B given the data.Now theprobability of drawing a white ball from W is 7/10 and from B it is 4/10. The probability of the data givenH, [p(D|Hi)] is (0.7)6(0.3)4, and p(D|H2) is (0.4)6(0.6)4. Now we apply Bayes' Theorem:

In order to completethe calculation,we haveto have valuesfor p(H,) and /?(H2) - the prior probabilities. And here is thecontroversy. Theobjectivist view is that theseprobabilitiesare unknownand unknowable because the state of the urnthat we have chosenis fixed. There are nogroundsfor saying that we have chosenW or B. The personalist wouldsay that we canalways state some degreeof belief. If we are completely ignorant about which urn was chosen,then bothp(H,) andp(H2) can beexpressedas 0.5. This is a formal statementof Laplace's Principleof Insufficient Reason,or Bayes' Postulate.If we put thesevalues in our equation,p(Hi|D) works out to be 0.64, which matches common sense. This value could then be usedto revisep(H])and/?(H2) and further datacollected.For itsproponents,the strengthof Bayesianmethods lies in the claim that they provideformal proceduresfor revising many kindsof opinion in the light of new data in a direct andstraightforward way, quite unlike the inferential proceduresof Fisherian statistics. The most important point that has to beaccepted,however, is the justification for the assignmentof prior probabilities. Althoughin this simple example there might seem to be little difficulty, in more complex situations, where probabilities, based on relative frequency2, hunches,or onopinion, cannotbe assigned precisely or readily, the picture is far less clear. Furthermore, the principleof the equal distribution of ignorancehasitself come undera great dealof philosophicalattack. "It is rather generally believed thathe [Bayes] did not publish becausehe distrustedhis postulate,and thoughthis scholium defective.If so he wascorrect" (Hacking, 1965, p. 201). 2

Presumablyit could be argued thatour urn with its unknown ratioof black andwhite balls is one of aninfinite distributionof urns with differing make-ups.

FISHERIAN INFERENCE FISHERIAN

81

INFERENCE

When the justification for the probabilistic basisof inferencein the senseof revising opinionon thebasis of data was thoughtof at all, it was theBayesian approach that held sway until this century. Its foundations had come under attack primarilyon the grounds that probability in any practical senseof the word must be basedon relative frequencyof observationsand not ondegrees of belief. Venn (1888)was perhapsthe most insistent spokesman for this view. Sir Ronald Fisher(1890-1962),certainlythe most influential statisticianof all time, set out toreplaceit. In 1930he publisheda paper that supposedly set out his notion of fiducial probability, claiming, "Inverse probability has, I believe, survived so long in spiteof its unsatisfactorybasis, because its critics have until recent timesput forward nothingto replaceit as arational theoryof learningby experience" (Fisher, 1930, p. 531). There is no doubt thatFisher'smethods,and thecontributionsof Neyman and Pearson that have been grafted on to them, have provided us with a set of inferential procedures. There seems to beconsiderable doubt as towhetherhe provided us with a coherent non-Bayesian theory. Fisher himself asserted that the conceptof the likelihood functionwasfundamentalto his newapproachand distinguishedit from Bayesian probability. Kendall (1963) in his obituary of Fisher says this: It appearsto me that, at this point [1922],his ideas werenot very well thought out. Certainly his exposition of them was obscure. But, in retrospect,it becomes plain that he wasthinking of a probability function f(x,Q) in two different ways: as the probability distributionof* for given0, and asgiving some sortof permissible range of 0 for anobservedx. To this latterhe gave the thenameof 'fiducial probability distribution' (later to be known as afiducial distribution)and in doing so began a long train of confusion;for it is not aprobability distribution to anyonewho rejects Bayes's approach, and indeed, may not be adistribution of anything. Fisher neverthelessmanipulatedit as if it were, and thereafter maintainedan attitude of rather contemptuous surprise towards anyone who wasperverse enoughto fail in understandinghis argument.... The position on both sideshasbeen restatedad nauseam, without much attempt at reconciliationor, as Ithink, withoutan explicit recognitionof the real point, which is that a man's attitude towards inference, like his attitude towards religion,is determined by his emotional make-up,not by reasonor mathematics. (Kendall, 1963, p. 4) Richard von Mises (1957)is baffled also: Fisher introducesthe term [likelihood] in order to denote something different from probability. As hefails to give a definition for either word,i.e.,he does not indicate how the value of either is to bedeterminedin a given case,we canonly try to derive

82

7. PRACTICAL INFERENCE the intended meaningby consideringthe context in which he usesthesewords. I do not understandthe many beautifulwordsusedby Fisherand hisfollowers in supportof the likelihood theory. The main argument, namely, thatp[the probability of the hypothesis]is not a variable but an "unknown constant,"does not mean anything to me.(von Mises 1957, pp. 157-158)

It is tempting to leaveit at that, but some attemptat capturing the flavor of Fisher's position must be made. Clearly,if one hasknowledgeof adistribution of probabilities of events, then that knowledge can beused to establish the probability of an event that has not yetbeen observed,for example, the probability thatthe next roll of two dice will producea 7 (p = .167). Whatof the situation whereaneventhasbeen observed - theroll didproducea 7 - can we sayanything abouttheplausibilityof anevent withp =. 167havingoccurred? This is a decidedlyodd sort of question because the eventhasindeed occurred! Before the draw from a lottery with 1,000,000 ticketsis made,the probability of my winning is .000001.It will be difficult to convince me after the draw, as I clutch the winning ticket, that whathashappenedis impossible,or even very, very unlikely to have occurredby chance,and if you continueto insist I shall merely keepon showingyou my ticket. Fisher, in 1930, putsthe situationin this way: There are two different measuresof rational belief appropriateto different cases. Knowing the populationwe canexpressour incomplete knowledgeof, or expectation of, the sample in terms of probability; knowing the samplewe can expressour incomplete knowledgeof the populationin terms of likelihood. (Fisher, 1930,p. 532)

Likelihood thenis a numerical measureof rational beliefdifferent from probability. Whetheror not thelogic of the situationis understood,all usersof statistical methodswill recognize the reasoningas crucial to both estimation and hypothesis testing.The method of maximum likelihood that Fisher propounds (althoughit had been put forward by Daniel Bernoulliin 1777; see Kendall, 1961) justifiesthe choice of population parameter.The method simply says thatthe best estimateof the parameteris the value that maximizesthe probability of the observations,or that whathasoccurredis themost likely thing that could haveoccurred.A numberof writers, for example, Hogben(1957), have stated that this assumption cannot bejustified without an appealto inverse probability, so that Fisherdid not succeedin detaching inference from Bayes' Theorem.Nevertheless,the basicnotion is that we canfind the likelihood of a particular population parameter,say thecorrelationcoefficient R, by defining it as avalue thatis proportionalto the probability that from a population with that valuewe have obtaineda sample withan observed valuer.

BAYES OR p < .05?

83

There is no doubt that Fisher's argument will continueto be controversial and that many attemptsto resolve the ambiguitieswill be made. Dempster (1964) is among thosewho have enteredthe fray, and Hacking's (1965) rationalehasresultedin one well-known statistician (Bartlett, 1966) proposing that the resulting theorybe renamed "the Fisher-Hacking theory of fiducial inference."

BAYES OR p < .05? Criticism of Fisherian methods arrived almost as soon as they beganto be adopted.In more recent years, criticism of null hypothesis significance testing (NHST) appearsto have becomean 'area' in its own right. Berkson (1938) probably led theway, whenhe noted that: if the number of observationsis extremely large- for instanceon theorder of 200,000- thechi-squareP will besmall beyondanyusual levelof significance ... For we mayassume thatit is practically certain thatany seriesof real observations doesnot follow a normal curve with absolute exactitude in all respects. (p. 526)

Berkson was expressingin essence what many, many other writers in the following 60 years have observed: that when HQ is expressedas anexact null hypothesis (zero difference or no relationship) then very small deviations from this (dare one sayit?) pedantic viewwill be declared significant] Someof the more recently-expressed doubts have been brought together by Henkel and Morrison (1970),but thepapers they collected together were most certainly not the last word. Indeed, Hunter (1997) called for a ban onNHST, and reportedthat no lessa body thanthe American Psychological Association has a"committee looking into whether the use of thesignificancetest should be discouraged"(p. 6)! The main criticisms, endlessly repeated, are easily listed. NHST does not offer any way oftestingthe alternativeor research hypothesis; the null hypothesis is usually falseand when differencesor relationships are trivial, large samples will lead to its rejection; the method discourages replication and encourages one-shot research; the inferential model dependson assumptions about hypothetical populations and data that cannot be verified; and there are more. Someof the criticismsarevalid andneedto befaced more carefully, even more boldly than they are, while some seem to be'strawmen' set up to beblown down. For example,the fact that we only reject HQ and do nottest H^ is not particularly satisfying,but the notion that that inference is faulty becausein some way it rests on characteristicsof populationsnot observedor not yet observedis merely statinga fact of inference! Moreover, mostof the criticism

84

7. PRACTICAL INFERENCE

would be diluted, if not eliminated,if greater attention to parametersof a test other thanalpha level were considered, that is to say, effect size, sample size, and power. Even Fisher's most supportive and ardent colleague, Frank Yates (1964) said: The mostcommonly occurringweaknessin the application of Fisherianmethodsis, I think, undue emphasis on testsof significanceandfailureto recognize thatin many typesof experimentalwork estimatesof thetreatmenteffects, togetherwith estimates of the errorsto which theyaresubjectare thequantitiesof primary interest.To some extent Fisher himselfis to blame for this. Thusin The Design of Experimentshe wrote: "Every experimentmay besaid to exist onlyin order to give the factsa chance of disproving the null hypothesis."(p. 320)

Yates goeson to saythat the null hypothesis,as usually expressed,is "certainly untrue"andthat: suchexperiments[variety andfertilizer trials] are infact undertaken with the different purposeof assessing the magnitudeof the effects... [Fisher]did n o t . .. sufficiently emphasisethe point in his writings. Scientistswere thusencouragedto expectfinal and definitive answers...some of them, indeed, came to regardthe achievementof a significant resultas an end initself, (p. 320)

Although a numberof modificationsof, or alternativesto, NHST have been suggested (see, for example,confidenceintervals, discussed in chapter13, p.. 199) by far themost popularof the suggested replacements (rather than salvage operationsfor the Fisherian model)is Bayesian analysis. The claim is usually made that Bayesian methods areconcernedwith the alternative hypothesis, that they encourage replication, that, in fact, they reflect more clearly the traditional 'scientific method.' Curiously enough, it is also often claimed that despitethe so-called subjectivityof Bayesian priors,a Bayesian analysis will arrive at the same conclusion as a'classical' analysis. This leaves thejourneyman psychological researcher with nothing but thetired protest that nothing hasbeenoffered as arewardfor changingfrom the familiar routines!

8 Sampling

and Estimation

RANDOMNESS AND RANDOM NUMBERS In common parlance,to choose"at random" is to choose without bias, to make the actof choosingonewithout purpose even though the eventual outcomemay be usedfor a decision.The emphasis here is on the actrather thanthe outcome. It is certainly not true to saythat ordinaryfolk accept that random choice means absenceof design in the outcome. Politicians, examining the polls, have been known to remark thatthe result was "in thecards."Primitive notions of fate, demons, guardian angels, and modern appealsto the "will of God" often lie behind the drawing of lots and thetossingof coins to "decide" if Mary loves John,or to take one courseof action rather than another. It is absenceof design in the mannerof choosing thatis important.The point mustnot belabored,but everyday notionsof chanceare still construed, evenby the most sophisticated, in ways thatare not too farremovedfrom the random elementin divination and sortilege practicedby theoraclesandpriestswho were consultedby our remote, andnot-so-remote, ancestors. And, asalready noted, perhaps one of thereasons for the delay in the developmentof the probability calculusarose from a reluctanceto attemptto "second-guess" the gods. The concepts of randomnessand probability are, therefore, inextricably intertwined in statistics.The difficulties inherentin defining probability, which were discussed earlier, are once more presented in an examinationof randomness. It is commonly thought that everyone knows what randomness is. The great statistician Jerzy Neyman (1894-1981)states, "The method of random sampling consists,as it is known, in taking at random elementsfrom the population whichit is intendedto study. The elements compose a sample which is then studied" (Neyman, 1934, p. 567). It is unlikely that thisdefinition would begiven a high gradeby most teachers of statistics. Laterin this paper and,this inadequatedefinition notwithstanding, 85

86

8. SAMPLING AND ESTIMATION

it is apaperof centralandenormous importance, Neyman does note that random samplingmeans that each element in the population must have an equal chance of being chosen,but it is not uncommonfor writers on the methodto ignore both this simple directiveandinstructionsas to themeansof its implementation. Of course,the use of thewords "equalchance"in the definition bringsus back to what we mean by chanceandprobability. For most purposeswe fall back on the notion of long-run relative frequency, rather than leaving these constructs as undefined ideas that make intuitive sense. Practical tests of randomness rest on the examinationof the distributionof eventsin long series.It is alsothe case, as Kendall and Babington Smith (1938) point out, that random sampling proceduresmay follow a purposive process.For example,the numberTI is not a random number,but its calculation generates a seriesof digits that may be random. These authors are amongthe earliestto set out thebasesof random sampling from a straightforward practical viewpoint, and they were writinga mere 60 years ago.The conceptof a formal random sampleis a modern one; concerns aboutits placeand importancein scientific inferenceandsignificance testing paralleledthe developmentof methods of experimental designand technical approaches to theappraisalof data. Nevertheless, informal notionsof randomness that imply lack of partiality and"choiceby chance"go back many centuries. Stigler (1977) researched the procedure knownas "the trial of the Pyx," a samplinginspection procedure that has beenin existenceat the Royal Mint in London for almost eight centuries. Over a period of time, a coin wouldbe taken daily from the batch thathad been mintedand placedin a box called the Pyx. At intervals, sometimes separated by as much as 3 or 4years, the Pyx was openedand thecontents checked andassayedin order to ensure thatthe coinage met thespecifications laid downby the Crown. Stigler quotes Oresme on the procedurefollowed in 1280: When the Master of the Mint has brought the pence, coined, blanched and made ready,to the placeof trial, e.g. the Mint, he must put them all at once on the counter which is covered withcanvas. Then, whenthe pence have been well turned over and thoroughly mixedby the handsof the Master of the Mint and theChanger,let the Changertakea handful in themiddle of theheap,moving roundnine or tentimes in one direction or the other, until he hastaken six pounds. He must then distribute thesetwo or three times into four heaps, so that theyarewell mixed. (Stigler, 1977, p. 495)

The Masterof the Mint was alloweda marginof error calledthe remedyand had to make goodany deficit that was discovered. Although mathematical statistics playedno part in these tests,and if they had they would have been

RANDOMNESSAND RANDOM NUMBERS

87

1 more precise, the procedure itself mirrors modern practice.

The trial of the Pyxeven in the Middle Ages consistedof a sample being drawn, a null hypothesis (the standard) to betested,a two-sided alternative,and atest statistic and a critical region (the total weightof the coins and theremedy). The problem even carried with itselfa loss function which was easily interpretablein economic terms. (Stigler, 1977,p.499)

Random selectionsmay bemadefrom real, existent universes, for example, the population of Ontario, or from hypothetical universes,for example, individuals who,over a period of time, have takena particular drugas part of a clinical trial. In the latter case,to talk of "selection"in any real senseof the word is stretching credulity,but we do use the results basedon theindividuals actually examinedto make inferences about thepotential population thatmay begiven the drug. In the same way, the samples of convenience thatare used in psychological research are hardly ever selected randomly in the formal sense. Undergraduate student volunteers are notlabeledas automatically constituting random samples,but they are often assumedto be unbiased withrespectto the dependent variables of interest,anassumption that hasproduced much criticism. This latter statement emphasizes the fact that a sampling method, whatever it is, relatesto the universe under study and theparticular dependent variable or variablesof interest. A questionnaire asking about attitudes to healthandfitness given to membersof the audienceat, say,a symphony concertmay well be generalizableto thepopulationof the city, but thesame instrument given to the annual conventionof a weight watchers' club would not. These statements seem to be soobviousand yet,as weshall see,overlooking possible sources of bias either by accidentor design has led tosome expensive fiascos. Unfortunately, the method of sampling can never be assuredly independent of the variable under study. Kendalland Babington-Smith (1938) note that "The assumption of independence must therefore be made with moreor less confidenceon a priori grounds. It is part of the hypothesison which our ultimate expressionof opinion is based"(p. 152). Kendall and Babington Smith commenton the use of"random" digits in random sampling,and it isworth examining these applications because most of the present-day statistical cookbooks at least pay lip-serviceto the procedures by including tablesof random numbersand some instructionas to their use. Individual units in the populationare numberedin some convenient way, and 1 Stigler notesthat the remedywas specified on a perpound basisandthat all the coin weightswere combined. This,togetherwith the central limit theorem, almost guarantees that the Master would not exceedthe remedy.

88

8. SAMPLING AND ESTIMATION

then numbers, taken from the tables,arematchedwith the individualsto select the sample. Thisproceduremay result in a sampleof individuals numbered1, 2, 3,4, 5,6, 7, 8,9, or 2,4,6,8, 10, 12,14,16, 18,20, groupings that mayinvite comment because they follow immediately recognizable orders but that nevertheless couldbe generatedby a random selection.The fact that a random selection producesa grouping that looksto bebiasedor non-representative led to a great dealof debatein the 1920sand 1930s. The sequence 1,4,2,7,9, has the appearanceof being randombut the sequence1, 4, 2, 7, 9, 1, 4, 2, 7, 9, has not. Finite sequencesof random numbersare therefore only locally random. Even the famous tablesof one million random digits produced by the RAND Corporation (1965)can only, strictly speaking,be regardedas locally random, for it may bethat the sequenceof onemillion wasaboutto repeat itself. Random sequencesmay also be "patchy." Yule (1938b),for example,after examining Tippett's tables, which hadbeen constructed by taking numbersat random from census reports, gained the impression that they were rather patchy and proceededto apply furtherteststhat gavesome supportto hisview. Tippett (1925) usedhis tables (not then published) to examine experimentally his work on the distribution of the range. Simpletestsfor local randomnessareoutlined by Kendalland BabingtonSmith, tests that Tippett's tables hadpassed,andalthough much more extensive appraisalscan bemade todayby using computers, these prescriptions illustrate the sort of criteria that shouldbe applied. Each digit should appear approximatelythe same number of times; no digit should tendto befollowed by another digit; there are certain expectationsin blocks of digits with regard to the occurrenceof three, four, or five digits thatare all thesame; thereare certain expectations regarding thegapsin the sequence between digits that are thesame. Testsof this sortdo notexhaustthe many thatcan beapplied. Whitney(1984) hasrecently noted that "It hasbeen said that more time hasbeen spent generating and testing random numbers than using them" (p. 129). COMBINING OBSERVATIONS The notion of using a measureof the averageas anadequateand convenient summary or description of a numberof data is an acceptedpart of everyday discourse.We speakof average incomes and average prices, of average speeds and averagegas consumption,of averagemen andaverage women.We are referring to some middling, nonextreme value that we take as a fair representation of our observations.The easy use of theterm doesnot reflect the logical problems associated with the justification for the use ofparticular measuresof the average.The term itself, we find from the Oxford English Dictionary, refers, among other things, to notions of sharing laboror risk, so

COMBINING OBSERVATIONS

89

old forms of the word referto work doneby tenantsfor a feudal superioror to shared risks among merchants for lossesat sea. Modern conceptions of the word include the notion of a sharingor eveningout over a rangeof disparate values. A large seriesof observationsor measurements of the same phenomenon producesa distributionof values. Giventhe assumption thata single true value does,in fact, exist,the presenceof different valuesin the series shows that there are errors. The assumption that over-estimations are aslikely as under-estimations would provide supportfor the use of themiddle valueof the series,the median,asrepresentingthetrue value.Theassumption that thevaluewe observe most frequently is likely the true valuejustifies the use of themode,and the "evening out" of the different valuesis seenin the use of thearithmetic mean. It is this latter measure that is now thestatisticof choice when observations are combined,andthereare anumberof strandsin our history that have contributed to thejustification for its use.The employmentof the arithmetic meanand the use of the lawof error have sound mathematical underpinnings in the Principle of Least Squares,which will be considered later. First, however, a somewhat critical look at the use of themeanis in order. The Arithmetic Mean In the 1755 Philosophical Transactions, and in arevision in Miscellaneous Tracts of 1757, Thomas Simpson argues the casefor the arithmetic meanin An Attempt to Show the Advantage arisingby Taking the Mean Of a Number of Observations in Practical Astronomy.Theseare valuable contributions that discuss,for the first time, measurement errors in the context of probabilityand point the way toward the idea of a law of facility of error. Simpson (1755) complainsthat "somepersons,of considerable note, have been of opinion,and even publickly maintained, that onesingle observation, taken with duecare,was as muchto berelied uponas theMean of a great number" (pp.82-83). In the revision, Simpson (1757) statesas axioms that positiveand negative errors are equally probableand that thereare assignablelimits within which errors can betaken to fall. He also diagramsthe law of error as anisosceles triangleandshows thatthe meanis nearerto thetrue value thana single random observation.The claim of the arithmetic meanto be thebest representation of a large bodyof data is often justified by appealto the principle of least squares and the law oferror. This is high theory,and inappealingto it there is a danger of overlookingthe logic of the use of themean. Simply,asJohn Venn (1891), who wasmentionedin chapter 1, puts it: Why do we resortto averagesat all? How can asingle introductionof our own, andthat a fictitious one, possiblytakethe

90

8. SAMPLING AND ESTIMATION place of the many values which were actually given to us? And theanswersurely is, that it can notpossibly do so; the onething cannot takethe place of the other for purposesin general,but only for this or that specific purpose, (pp. 429-430)

This seemingly obvious statement is onethat hasfrequently been ignoredin statisticsin the social sciences. Venn points out thedifferent kindsof averages that can beusedfor different purposesandnotes cases where the use of anysort of averageis misleading. Edgeworth (1887) hadprovidedanattemptto examine the mathematical justifications for the different averagesand Venn and many others referto this treatment. Edgeworth's paper also examines the validity of the least squaresprinciple in this context. Venn illustrates his argument with some straightforward examples. If two people reckonedthe distanceof Cambridge from London to be 50 and 60miles, in the absenceof any information that would leadus to suspect eitherof the measures,one would guess that55 miles was theprobabledistance.However,if oneperson said that someone they knew lived in Oxford and another thatthe individual lived in Cambridge,the most probable location would not be atsome placein between.In the latter case, in the absenceof any other information, one would havea chanceat arriving at the truth by choosingat random. Edgeworth's paperson the best mean represent some of his most useful work. A particular serviceis rendered by his distinction between realor objective meansand fictitious or subjective means. The former arise whenwe use thearithmetic meanas thetrue valueunderlying a groupof measurements that are subjectto error; the latter is a descriptionof a set. The mean of observations is a cause,as it were the source from which diverging errorsemanate.The meanof statistics is adescription,a representativeof the group, that quantity which,if we must in practice put onequantityfor many, minimisesthe error unavoidably attending such practice. Observationsare different copiesof one original; statisticsare different originals affording one 'genericportrait.' (Edgeworth, 1887,p. 139)

This formal distinctionis clear. However,becausethe mathematicsof the analysisof errors and themanipulationsof modern statistics rest on the same principles, the logic of inference is sometimes clouded.It is Quetelet who brought the mathematicsof error estimationin physical measurement into the assessmentof thedispersionof humancharacteristics. Clearly, something of the sort had been done before in the examinationof mortality tablesfor insurance purposes,but we seeQuetelet makinga direct statement that only partially recognizes Edgeworth's later distinction: Everything occursthen as though there existeda type of man, from whichall other men differed moreor less. Nature hasplaced beforeour eyesliving examples of

COMBINING OBSERVATIONS

91

what theory showsus. Each peoplepresentsits mean, and thedifferent variations from this meanwhich may becalculateda priori. (Quetelet, 1835/1849,p. 96)

We notedin chapter1 that it was Quetelet'sview that the averagevalue of mentalandmoral, aswell asphysical, characteristics represented the idealtype, I'homme moyen (the average man) for which Nature was constantly striving. Quetelet'sown pronouncements were somewhat inconsistent in that he sometimes promotedthe view of the "average being"as auniversal biological type and at other times suggested that the average differed across groupsand circumstances. Nevertheless, his methods werethe forerunnersof work that attemptsto establish"norms" in biology, anthropology,and thepsychologyof individual differences.One of therequirementsof a "good" psychometric test is that it be accompaniedby norms. These norms provide standards for comparisonsacrossindividual test scores and, just asQuetelet's characterization of the average as the ideal type aroused opposition and controversy, so the establishmentof national normsand sub-group normsand racial norms produces heated debate today. The Principle of Least Squares In its best-known form, this famous principle states that the sum ofsquared differencesof observationsfrom the meanof those observations is aminimum; that is to say, it is smaller thanthe sum ofsquareddifferences from any other reference point. Legendre(1752-1833)announcedthe Principle of Least Squaresin 1805, but in Theoria Motus Corporum Coelestium, published in 1809, Gauss (1777-1855)discussingthe method, refersto his work on it in 1795 (whenhe was a 17-year-old student preparing for his university studies). This claimto priority upset Legendre and led tosome bitter dispute. Today the methodis most frequently associated with Gauss, who isoften identified as thegreatest mathematician of all time. That the method veryquickly becamea topic of much commentaryand discussionmay bedemonstratedby the fact that in the 70 or so years followingLegendre'spublication no lessthan 193 authors produced 72 books, 23 parts of books,and 313memoirs relatingto it. Merriman (1877) providesa list of these titles together with some notes. Faced with such sources, to say nothing of the derivations, papers, commentaries, and monographs published in the last 110years,the following represents perhaps one of adozen ways of commentingon its origins. Using the meanto combinea set ofindependent observations was atechniquethat had been usedin the 17th century. Gauss later examined the problem of selecting from a numberof possible waysof combining datathe one that producedthe least possible uncertainty about the "true value." Gauss noted

92

8. SAMPLING AND ESTIMATION

that in thecombinationof astronomical observations themathematical treatment dependedon themethodof combination.He approachedthe problemfrom the standpoint thatthe method should lead to the cancellationof random errorsof measurementandthat, astherewas noreasonto preferonemethod over another, the arithmetic mean should be chosen.Having appreciated thatthe "best,"in the senseof "most probable,"value couldnot beknown unlessthe distribution of errorswasknown,heturnedto anexaminationof thedistribution, which gave the meanas themost probable value. Gauss's approach wasbasedon practical considerations,andbecausethe procedureshe examineddid produce workable solutionsin astronomicalandgeodetic observations, themethodwas vindicated. In fact, the principle of least squares, asGauss himself noted, can beconsidered independentlyof the calculusof probabilities. If we haven observationsX^X2 X3 ^ . . .Xn , from apopulation witha mean u,, what is the least squares estimate of u? It is thevalueof (i that minimizes,

NOW :

whereX is themeanof the n observa-

tions.

Clearly the right-hand sideof the last expressionis at aminimum when, X- u,, which demonstrates the principle. This easily obtained result provides a rationalefor estimatingthe population meanfrom the sample mean that is intuitively sensible.The law oferror enters the picture whenwe considerthe arithmetic meanas themost probable result. In this casewe find thatthe law is infact givenby thenormal distribution, often referredto as theLaplace-Gaussian distribution. It hasbeen statedon more than one occasionthat Laplaceassumedthe normal law in arriving at themean to provide whathe describedas themost advantageous combination of observations. Laplacecertainly consideredthe casewhen positiveandnegativeerrors are equally likely,andthesecasesrest on theerror law being whatwe nowcall "normal" in form. The threadsof the argumentare notalways easyto disentangle, but one of thebetter accounts,for thosewho arepreparedto grapple with a little mathematics,wasgiven by Glaisher as long ago as 1872. The crucial point is that of the rationalefor the two fundamentalconstructs of statistics,

COMBINING OBSERVATIONS

93

the mean,X,and the variance, I,(X- X) /n. Theessential factis easily seen. Given thata distribution of observationsis normal in form, and given thatwe know the mean and thevarianceof the distribution, thenthe distribution is completely and uniquely specified.All its propertiescan beascertained given this information. In the context of statisticsin the social sciences, both the normal law and the least squares principleare best understoodin the context of the linear model. The model encompasses these constructs and brings togetherin a formal sense the mathematicsof the combinationof observations developed for use inerror estimationand mathematical statistics as exemplifiedby analysisof variance and regression analysis. Representation and Bias The earliest examples of the use ofsamplesin social statisticswe have seenin the work of Graunt, Petty, Halley, and theearly actuaries. These samples were neither randomnor representative,and mistaken inferences were plentiful. In any event, it was not until much later, when attempts were made to collect information on populations, inferential exercises repeated, and the results compared, that critical examinations of thetechniques employed could be made. Stephan (1948) reports that Sir Frederick Morton Eden estimated thepopulation of Great Britainin 1800to beabout 9,000,000. This estimate, which wasbased on the numberof births and theaverage number of inhabitantsin eachhouse(a number that was obtained by sampling),was confirmedby the first British censusin 1801. Earlier attemptsto estimate populations had been madein France,and Laplacein 1802 madean attemptto do sothat followeda scheme he had devised and publishedin 1786, a scheme that included a probability measureof the precisionof the estimate. Specifically, Laplace averred that the odds were 1,161 to 1that the error wouldnot reach halfa million. Westergaard (1932) provides some more details of these exercises. Elsewhere in Europeand in the United Statesthe 19th centurysaw various censuses conducted, aswell as attemptsto estimatethe size of the populationfrom samples.In the United States the Constitution providedfor a census every10 years in order to determine Congressional representation, but a Bill introduced intothe British Parliamentin 1753to establishan annual census was defeatedin the Houseof Lords. It appears thatthe probability mathematicians and the rising group of political arithmeticians never joined forces in the 19th century. The latter favored the institutionof complete censuses and, generally speaking, not were mathematicians,and theformer were scientists who hadmany other problems

94

8. SAMPLING AND ESTIMATION

to test their mettle. Almost100years passed before scientific sampling procedures wereproperly investigated. Stephan (1948) lists four areas where modern sampling methods could have been used to advantage:agricultural crop and livestock estimates,economic statistics, social surveys and health surveys,and public opinion polls.The last will be considered herein a little more detail because it is in this areathat the accuracy of forecastsis so often and soquickly assessedand breakdownsin sampling procedures detected with the benefit of hindsight. The Raleigh Starin Raleigh, North Carolina, conducted "straw votes" as early as 1824, covering political meetings andtrying to discoverthe "senseof the people." By the turn of the century, many newspapers in the United States were regularly conducting opinion polls, a common method being merely to invite membersof the public to clip out aballot in the paperand tomail it in. The same basic procedure was followed by all thepublications. Thenthe largecirculation magazines, notably Literary Digest, began to mail out ballotsto very large numbersof people, sometimesas many as 11,000,000. In 1916 this publicationcorrectly predictedthe electionof Wilson to thepresidencyand from then until 1936 enjoyed consistent and much admired success.Its predictions were very accurate.For example,in 1932its estimateof the popular votefor each candidatein the presidential election came to within 1.4% of the actual outcome. In 1936 came disaster.For years Literary Digesthadconducted polls on all kinds of issues, mailingout millions of ballots,at considerable expense, to telephone subscribers and automobile owners.In 1936 the magazine predictedthat AlfredM. Landon wouldwin thepresidencyon thebasisof the return of over 2,300,000replies from over 10,000,000 mailed ballots. The record shows that Franklin D. Rooseveltwon thepresidency withone of thelargest majorities in American presidential history. The reasonsfor this disastrous mistake are noweasyto see. Priorto 1936, preferencefor the twomajor political parties in the United Stateswas notrelatedto level of income.In that yearit seems thatit was. The telephone subscribers and carowners (andin 1936 these were the rather moreaffluent) who hadreceivedthe Literary Digest ballots were, in the main, Republicans. In 1937the magazine ceased publication. Crum (1931) and Robinson (1932, 1937) gave commentaries on someof these early polls. Fortune magazine fared no better in 1948, underestimating the vote for the Democratsand Harry Trumanby close to 12%and, of course,failing to pick the winner. In 1936, with a much, much smaller sample than that of Literary Digest (less than 5,000),it had forecast Roosevelt's vote to within 1%. The explanation for its failure in 1948 restswith the swing of both decidedand undecided voters between the Septemberpoll and theNovember electionand failure to correct for a geographic sampling bias. Parten(1966) has some

SAMPLING IN THEORY AND PRACTICE

95

commentary on these and other polls. One of the most successful polling organizations, the American Institute of Public Opinion, headedby George Gallup, beganits work in 1933. But the Gallup poll also predicteda win for Dewey over Trumanin 1948,and many people have seen the Life photograph of a victorious president holding the Chicago Daily Tribune withits famous Type I error headline, "Dewey defeats Truman." The result producedone of thefirst claims thatthe polls influencedthe outcome, inducing complacency in the Republicansand asmall turnoutof their supporters, leading to the defeat of the Republican candidate. Today there is much controversy overthe conductingand the use ofpolls. Theywill survive becausein general theyare correct much more often than not. Their success is due to thedevelopmentof refined sampling techniques. SAMPLING IN THEORY AND IN PRACTICE Chang (1976) givesa quite thorough reviewof inferential processesand sampling theory,and it is worth sketchingin some of its developmentin the context of survey sampling. However, from the standpointof statistics and sampling in psychology, thereis no doubt but that the rationaleof sampling proceduresfor hypothesis testing rather than parameter estimation is of greater import. The two arerelated. The early political arithmeticians held to theview that statistical ratios - for example, malesto females, average number of children perfamily, and so on were approximately constant, and, as aresult, proceededto draw inferences from figures collectedin a single townor parish to whole regionsand even countries.The early 19th centurysaw theintroductionof the law of error by Gaussand Laplace and anawareness that variability was animportant consideration in the assessmentof data. Populations are nowdefinedby twoparameters, the meanand thevariance.One of theearliest attemptsto put ameasureof precision ontoa sampling exercisewas that of Laplacein 1802.The samplehe used was not random, althoughhe appearsto have assumed that it was. Communes distributed across France were chosen to balanceout climatic differences. Those having mayors known for their"zeal and intelligence" were also selectedso that the data wouldbe themost precise. Laplace also assumed that birth ratewas homogeneousacrossthe French population, exactly the sort of unwarranted assumption that was madeby theearly political arithmeticians. Nevertheless, Laplace estimated the population total from his figures and, appealing to the central limit theorem (whichhe had discussedin 1783), approximatedthe distribution of estimationerrorsto thenormal curve. Survey samplingfor all kinds of social investigations owes a great dealto Sir Arthur Lyon Bowley(1869-1957). In his time he was recognized as a

96

8. SAMPLING AND ESTIMATION

pioneerin the definition of sampling techniques, and hismethodsand assumptions werethe subjectof much debate.In the event, someof his approaches were found to bedefective,but hiswork focussed attention on theproblem. In 1926 he summarizedthe theory of sampling,and ashort paperof 1936 outlines the application of samplingto economicand social problems.He servedon numerous official committees that investigated the economic stateof the nation,social effects of unemploymentand so on, andworked directlyon many surveys. Maunder (1972)has written a memoir that pointsup Bowley's contributions, contributions that were somewhat overshadowed by thework of his contemporaries Pearson, Fisher, andNeyman.He was acalm and courteous man, enormously concerned with social issues, and heoccupied,at theLondon School of Economics,the first Chair devotedto statisticsin the social sciences. Bowley (1936) definesthe sampling problem simply: We are hereconcerned. . . with the investigationof the numerical structureof an actual andlimited universe,or "population" whichis thebetter wordfor our purpose. Our problems are quite definitely to infer the population from the sample. The problem is strictly analogousto that of estimatingthe proportion of the various colours of balls in a limited urn on thebasisof one ormore trial draws, (pp.474-475)

In the early yearsof this century, Bowley began to examine boththe practice and theory of survey sampling. His work helpedto highlight the utility of probability samplingof oneform or another. Systematic selection was adopted and advocatedby A. N. Kiaer(1838-1919), Director of theNorwegian Bureau of Statistics,in his examinationsof census data,but themajority of influential statisticians, represented by the InternationalStatisticalInstitute, rejected sampling, pressingfor complete enumeration. It took almost30 yearsfor theutility and benefits of the methodsto be appreciated. Seng (1951) and Kruskal and Mosteller (1979) give accounts of this most interesting periodin statistical history. The latter authors givea translationand paraphraseof the remarks of Georg von Mayer, Professorat the University of Munich, on Kiaer's workon the representative method, which was presentedat ameetingof the Institutein Berne in 1895: I regardas most dangerousthe point of view found in his work. I understand that representative samples can have some value,but it is a value restrictedto terrain already illuminated by full coverage. One cannot replaceby calculation the real observationof facts. A sample provides statistics for the units actuallyobserved,but not true statisticsfor the entire terrain. It is especially dangerousto proposerepresentative sampling in the midst of an assembly of statisticians. Perhaps for legislative or administrativegoals sampling may haveuses- but onemust never forget that it cannot replacea complete survey. It is necessaryto addthat thereis among us these daysa current in the minds of

SAMPLING IN THEORY AND PRACTICE

97

mathematicians that would, in many ways, haveus calculate rather thanobserve. We must remainfirm and say: no calculations when observations can bemade, (von Mayer, quotedby Kruskal & Mosteller, 1979,pp. 174-175)

Oddly enough, Kiaer's work is not mathematicalin the sense that modern methodsof parameter estimation aremathematical.At the time those methods were not fully delineatednor understood.Kiaer aimed, by a variety of techniques,to producea miniatureof the population, although he noted as early as 1899 the necessityfor the investigationof both the practical and theoretical aspectsof his methods. At a meetingin 1901(a report of which waspublished in 1903), Kiaer returnedto the theme and it was in adiscussion of his contribution that L. von Bortkiewicz suggested that the "calculus of probabilities" couldbe usedto test the efficacy of sampling. By establishinghow much of a difference between sample and population could be obtainedaccidentally and checking whetheror not anobserveddifference lay outside those limits, the representativeness of the sample couldbe decidedon. Bortkiwiecz did not, apparently, formulate all the necessary tests, and othershad employed this method, but he seemsto have beenthe first to draw the attention of practicing statisticiansto thepossibilities. In 1903, Kiaer must have thought that the sampling argument was won, for a subcommitteeof the International StatisticalInstituteproposeda resolutionat the Berlin meeting: The Committee, considering that the correct application of the representative method, in a certain numberof cases,can furnish exact and detailed observations from which the resultscan begeneralized,within certain limits, recommendsits use, provided thatin thepublicationof the resultsthe conditions under which the selection of the observation unitsis madeare completely specified.The question willbe kept on the agenda,sothat a report may bepresentedin the next sessionon the application of the method in practiceand on thevalueof the results arrivedat. (quotedby Seng, 1951, p. 230)

What is more, a discussantat the meeting,the French statistician Lucien March, returnedto the ideas thathad been put forward by Bortkiewicz and outlined someof the basicsof probability sampling (see Kruskal & Mosteller, 1979, for a short summaryof this presentation).The wayahead seemed clear. In fact the question was,for all intentsand purposes, shelved for more than 20 years,and it was notuntil 1925,at theRome sessionof the Institute, thatthe advantagesof the sampling method were fully recognized. Thiswas in nosmall way due to thetheoretical workof Bowley. Bowley had suggestedin his Presidential address to the Economic Scienceand Statistical Sectionof the British Associationas early as 1906 thata systematic approach to the problem

98

8. SAMPLING AND ESTIMATION

of sampling would bearfruit: In general,two linesof analysisarepossible:we may find anempiricalformula (with Professor Karl Pearson) which fits this classof observations [Bowleyis referringto data thatmay not benormally distributed],and byevaluatingthe constants determine an appropriate curve of frequency,andhence allotthe chancesof possible differences betweenour observationand theunknown true value; or we mayaccept Professor Edgeworth's analysis of thecauseswhich would producehis generalisedlaw of great numbers,and determinea priori or by experiment whether this universal law may be expectedor is to befound in the casein question. (Bowley, 1906, pp. 549-550)

Edgeworth's methodis basedon the Central Limit Theorem,and Bowley explains its utility clearly and simply: If quantitiesare distributed accordingto almost any curve of frequency,... the averageof successive groups o f. . .these conformto anormal curve (the more and more closelyas n isincreased) whose standard deviation diminishes in inverse ratio to thenumberin each sample...If we canapply this method...,we areableto give not only a numerical average, but areasoned estimate for the real physical quantity of which the averageis a local or temporary instance. (Bowley, 1906, p. 550)

The procedure demands random sampling: The chancesare thesamefor all the itemsof the groups to besampled,and the way they are takenis absolutely independent of their magnitude. It is frequently impossibleto covera whole areaas thecensusdoes,...but it is not necessary.We canobtain as good resultsas wepleaseby sampling,and very often quite small samplesare enough;the only difficulty is to ensure that every person or thing has thesame chanceof inclusion in the investigation. (Bowley, 1906,pp. 551-553)

THE THEORY OF ESTIMATION Over the next 20 years, Bowleyand his associates completed a number of surveys, and his theoretical researches produced The Measurement of the Precision Attainedin Samplingin 1926. This paper formed part of the report of the International Statistical Institute, which recommended and drew attention to the methods of random selectionand purposive selection:"A number of groupsof units areselected which together yield nearly the same characteristics as the totality" (p. 2). The report does not directly addressthe method of stratified sampling, even though the techniquehad been in general use. This procedure received close attention from Neymanin his paperof 1934. Bowley had attemptedto present a theory of purposive samplingin his 1926 report.

THE THEORY OF ESTIMATION

99

A distinctive featureof this method, according to Bowley, was that it was acaseof cluster sampling.It wasassumed that the quantity under investigation was correlated with a numberof characters,called controls,and that the regressionof the cluster meansof thequantityon thoseof each controlwaslinear. Clusters were to be selected in such a waythat averageof each control computed from the chosen clusters should (approximately) equalits population mean.It was hoped that,due to theassumed correlations between controls and thequantity under investigation, the above method of selection would result in arepresentative sample with respectto thequantity under investigation. (Chang, 1976, pp. 305-306)

Unfortunately,a practical testof the method (Gini, 1928) proved unsatisfactory and Neyman's analysis concluded that it was not aconsistentnor an efficient procedure. As Neyman pointed out, the problemof samplingis theproblemof estimation. The first forays intothe establishment of atheory hadbeen madeby Fisher (1921a, 1922b, 1925b),but themannerof samplinghadreceived little attention from him. The methodof maximum likelihood, which rested entirelyon the propertiesof the distribution of observations, gave the mostefficient estimate. Any appealto thepropertiesof theapriori distribution- theBayesian approach - wasrejectedby Fisher. Neyman attempted to clarify the situation: We are interestedin characteristicsof a certain population, say, n , . . . it hasbeen usually assumed that the accurate solution of sucha problem requires the knowledge of probabilitiesa priori attachedto different admissible hypotheses concerning the valuesof the collective characters [the parameters] of the populationn. (Neyman, 1934, p. 561)

He then turnsto Bowley's work,notingthatwhenthepopulationn isknown, then questions about the sort of samples thatit could producecan beanswered from "the safe groundof classical theoryof probability"(p. 561). The second question involvesthe determination, when we know the sample,of the probabilities a posteriori to beascribedto hypotheses concerning the populations. Bowley's conclusions arebased: on some quite arbitrary hypotheses concerning the probabilities a priori, and Professor Bowley accompanies his results withthe following remark: "It is to be emphasized thatthe inference thus formulated is based on assumptions thatare difficult to verify andwhich are notapplicablein all cases."(Neyman, 1934,p. 562)

Neymanthen suggests that Fisher's approach (that involving the notion of fiducial probability, although Neyman does not use theterm) "removesthe difficulties involved in the lack of knowledgeof the apriori probability law"

100

8. SAMPLING AND ESTIMATION

(p. 562). He further suggests that these approaches have been misunderstood, due, he thinks,to Fisher's condensed form of explanationanddifficult method of attacking the problem: The form of the solution consistsin determining certain intervals which I propose to call confidenceintervals..., in which we mayassumearecontainedthe values of the estimated charactersof the population, the probability of the error in a statementof this sort being equal to or less than1 - e ,where e is anynumber0 < e < 1,chosenin advance.The numbers I call the confidence coefficient. (Neyman, 1934, p. 562)

Neyman's commentson Fisher's abilityto explain his view produced,in the discussionof his paper, the first (mild) reaction from Fisher. Subsequent reactionsto Neyman's workandthat of his collaborator, Egon Pearson, became increasingly vitriolic. The report reads: Dr Fisher thoughtDr Neyman mustbe mistaken in thinking the term fiducial probability [Neyman had used the term "confidence coefficient"]had led to any misunderstanding;he had notcome uponany signsof it in the literature. WhenDr Neyman said"it really cannot be distinguished fromthe ordinary concept of probability," Dr Fisher agreedwith him ... Hequalified it from the first with the word fiducial... Dr Neyman qualified it with the word confidence. The meaning was evidently the same,and he did notwish to deny that confidence could be used adjectivally. They wereall too familiar with it, as Professor Bowleyhad reminded them, in the phrase"confidence trick." (discussion on Dr Neyman'spaper, 1934, p. 617)

From the standpointof the familiar statistical procedures found in our texts, this paperis importantfor its treatmentof confidence intervalsand itsemphasis of the importanceof random sampling.It extended estimation from so-called point estimation,the use of asample value to infer a populationvalue,to interval estimation, whichassesses the probability of a range of values. Neyman demonstratesthe use of theMarkov2 method for deriving the best linear Andrei Andreyevich Markov(or Markoff, 1856-1922)is best knownfor his studiesof the probabilities of linked chainsof events.Markov chains have been used in a variety of social and biological studiesin the last 30 or 40years.But Markov made many contributionsto probability theory. If we havea random variableX, then regardless of its distribution,for any positive numberc (i.e. c > 0), theprobability, Px(X > cu), thatthe random variable X is greater thanc timesits expected valueux = u doesnot exceed lie. Thatis, Px(X > c^) < \lc. This is knownas theMarkov Inequality. Markov was astudentof Pafnuti LTchebycheff(sometimes spelledChebychev or Chebichev,1821-1894),who formulated the Tchebycheff Inequality which states that, Px(X< n - da or X > + da) = tfwhereX is a random variable with expected value u andvariance a2, and d> 0. This resultwas independently arrived at by theFrench mathematician I.J. Bienayme (1796-1876).These inequalities areimportant in the developmentof the descriptionsof the propertiesof probability distributions.

THE BATTLE FOR RANDOMIZATION

101

unbiased estimators.It also contains other important ideas, in particular, a discussion of the methods of stratified samplingand appropriate statistical modelsfor it. Neyman's paper marks a new era inboth the methodandtheory of sampling, although,at the time, it was its treatment of the problem of estimation that received the most attention.In a senseit complementedand supplementedthe work of Ronald Fisher thatwas going on atRothamsted,but it became evident that Fisher did not quitesee itthat way. THE BATTLE FOR RANDOMIZATION There is no doubt thatthe requirement that samples should be randomly drawn wasthoughtof by thesurvey-makersas aprotection against selection bias. And there is also no doubt that when sample size is large it affords such protection, but not, it must be stressed,a guarantee. In agricultural research,it had long been recognized that reduction of experimentalerror was ofcritical importance.At Rothamstedtwo methods were available: repeating the experiments over many years, and multiplying the number of plots on a field. Mercer and Hall (1911) discussthe problem in considerable detail andgive suggestions for arrangingthe plots sothat theymay be "scattered."This was theapproach thatwasabandoned, although not immediately, when Fisher started his important workat the Station. Eventually,for Fisherand hiscoworkersthe argumentfor randomizationhad aquite different motive from that of trying to obtain a representative sample, one that is crucial for an appreciationof the use ofstatisticsin psychology. Fisher, although a brilliant mathematician,was apractical statistician,and hisapproachto statistics can only be understood through his work on thedesignof experimentsand the analysis of the resultant data.The core of Fisher's argument rests on the contention thatthe valueof an experiment depends on thevalid estimationof error, an argument that everyone would agree with. But how was theestimate to be made? In nearly all systematic arrangements of replicated plotscareis takento put theunlike plots as close togetheras possible, and thelike plots consequentlyas far apart as possible,thus introducinga flagrant violationof the conditions upon whicha valid estimate is possible. One way ofmaking sure thata valid estimateof error will be obtainedis to arrange the plots deliberatelyat random. The estimateof error is valid, because,if we imagine a large numberof different results obtainedby different random arrangements,the ratio of the real to the estimatederror, calculated afreshfor each of these arrangements, will be actually distributed in the theoretical distributionby which the significance of the result is tested. Whereasif a group of arrangementsis chosen such that the real errorsin this group are on thewhole lessthan thoseappropriateto random arrangements,it has

102

8. SAMPLING AND ESTIMATION

now beendemonstrated that the errors, asestimated, will,in sucha group, be higher than is usualin random arrangements, andthat, in consequence, within such a group, the test of significanceis vitiated. (Fisher,1926b,pp. 506-507)

The contribution of Fisher thatis overwhelmingly importantis the development of the t and ztestsand thegeneral technique of analysis of variance.The essenceof these proceduresis that they provide estimates of error in the observationsand theapplicationof testsof statistical significance.These methods were not immediately recognizedas being useful for larger-scale sample surveys,and it waspartly the work of Neyman (mentioned earlier) and others in the mid-1930s that, ironically, introduced them to this area. Argumentsabout randomized versus systematic designs began in the mid1920s. Mostly they revolved around the issueof what to do when therewas an unwanted assignmentof treatmentsto experimental units, thatis, when the assignmenthad apattern thatthe researcher either knew or suspected might confoundthe treatments. Fisher argued very strongly against the use ofsystematic designs,on thebasisof theory, but hisargumentwas notwholly consistent. Some had suggested that if a random design produced a pattern thenit should be discardedandanother random assignment drawn. Of course,thesubjectivity introducedby this sortof procedureis precisely thatof the deliberate balancing of the design. And how many draws mightone beallowed? Most experimenterson carrying out a random assignment of plots will be shocked to find out how farfrom equallythe plots distribute themselves... if the experimenter rejects the arrangement arrived at bychanceasaltogether "too bad,"or in other ways "cooks" the arrangementto suit his preconceived ideas,he will either (and most probably)increasethe standarderror as estimatedfrom the yields; or, if his luck or his judgment is bad, he will increase the real errors while diminishing his estimate of them. (Fisher, 1926b,pp. 509-510)

But even Fisher never quite escapedthe difficulty. Savage (1962) talked with him: "What would you do," I had asked, "if, drawinga Latin Squareat random for an experiment, you happenedto draw a Knut Vik square?"Sir Ronald said he thought he would draw againandthat, ideally, a theory explicitly excluding regularsquares should be developed. (Savage, 1962, p. 88)

Studentsandteachers cursing their way through statistical theory and practice should takesome comfortfrom the inconsistencies expressed by the master. The debate reached its height in argument between "Student" and Fisher. "Student"consistently advocated systematic arrangements. In a letter to Egon Pearson (Pearson, 1939a) written shortly before his death in 1937, "Student"

THE BATTLE FOR RANDOMIZATION

103

commentson work by Tedin (1931) thathad examinedthe outcomes when sytematic, as opposed to random, Latin squares were used in experimental designs. The "Knight's move" Latin squarehe prefers aboveall others:"It is interesting as anillustration of what actually happens when we depart from artificial randomisation:1would Knight's move every time!" (quotedby E. S. Pearson, 1939a, p. 248).3 Over the previous yeara seriesof papers, letters,and lettersof rebuttalhad come forth from "Student"andFisher. "Student"was adamantto the end,and Fisher reiteratedhis claim that valid error estimates cannot be computedin arranged designs andthat in such casesthe test of significanceis made ineffective. Picard (1980) describes anddiscussesthe argumentandalso examinesthe contributionsof Pearsonand Yates,who hadsucceeded Fisher at Rothamsted. Others enteredthe debate.Jeffreys (1939)is puzzled: Reading "Student's"paper [of 1937] and Fisher's Designof Experiments I find myself in almost complete agreement with both; and Ishould therefore have expected them to agreewith each other. But it seemsto me that "Student"is wrong in regarding Fisheras anadvocateof extremerandomness,andpossibly Fisherhas notsufficiently emphasizedthe amount of systemin his methods. (Jeffreys, 1939, p. 5)

Jeffreys makesthe point thatomitting to take accountof relevant information makes avoidable errors: The bestprocedureis to designthe work so as todetermineit [the error] as accurately as possibleand not toleave it to chance whetherit can bedeterminedat a l l . . .. The hypothesis is a considered proposition.. . The argument is inductive and not deductive; it is not dealt with by consideringan estimable error thathas nothing to do with it. (Jeffreys, 1939,p. 5)

As ever, Jeffreys' argumentis a paragonof logic, and it notes that Fisher's advice to balanceor eliminatethe larger systematiceffects as accuratelyas possible and randomizethe rest "sumsup thesituation very well"(p. 7). This is the prescription thatthe designof experimentsfollows today. E. S. Pearson (1938b) attempted to expandon and toclarify "Student's" stand,but heclearly understoodthe view of the opposition. Nevertheless, he concluded that balanced layouts could give some slight advantage. An illustration of "Two Knight's moves"would be DEABC B C D EA E A B CD C D E AB ABODE

104

8. SAMPLING AND ESTIMATION

Yates (1939),in a lengthy paper also goes over the wholeof "Student's"views on thematter,but hisconclusion supports the essenceof Fisher's views: The conclusionis reached thatin caseswhere Latin square designs can beused,and in many caseswhere randomized blocks have to beemployed,the gain in accuracy with systematic arrangements is not likely to be sufficiently great to outweigh the disadvantages to which systematic designs are subject. On the other hand, systematic arrangements may in certain casesgive decidedly greateraccuracy than randomized blocks, but it appears thatin suchcasesthe use of the modern devicesof confounding, quasi-factorial designs, or split-plot Latin squares whichare much more satisfactory statistically, are likely to give a similar gain in accuracy. (Yates, 1939, p. 464)

This bringsus to theapproachesof the present day. The realization that sampling was importantin psychologicalresearch,and that its techniqueshad been much neglected, waspresentedto the disciplineby McNemar (1940a).In an extensive discussion, he pointsout situations thatare still with us: One wonders, for instance, how many psychometric test scores for policeman, firemen, truck driversetal.have been interpreted by theclinician in termsof college sophomorenorms. In psychological research we aremorefrequently interestedin makingan inference regarding the likenessor difference of two differentially defined universes, such as two racial groups,or an experimentalvs. acontrol group. The writer venturesthe guessthat at least 90% of theresearchin psychology involves such comparisons. It is not only necessaryto considerthe problemof samplingin the caseof experimental and control groups,but alsoconvenientfrom the viewpointof both good experimentation and sound statisticsto do so. (McNemar, 1940a,p. 335)

This paper, which, regrettably, is not among those most widely cited in the psychological literature, should be required readingfor all those embarking on researchin any aspect of the social sciences.Its closing remarks contain a prediction thathasbeen most certainly fulfilled andwhose contentis dealt with shortly: The applicability in psychologyof certain of ProfessorR. A. Fisher's designs should be examined. Eventually, the analysisof variancewill come intouse inpsychological research.(McNemar, 1940a,p. 363)

For themoment,no more needsto besaid.

9 Sampling Distributions Large setsof elementary events are commonly called populationsor universes in statistics,but the settheory term sample space is perhapsmore descriptive. The term population distribution refers to the distributionof the valuesof the possible observationsin the sample space. Although the characteristicsor parameters of the population(e.g., the mean,\i, or the standard deviation,a) are of both practicaland theoretical interest, these values are rarely, if ever, known precisely. Estimatesof the values are obtained from corresponding sample values,the statistics. Clearly,for a sample of a given size drawn randomly froma samplespace,a distributionof valuesof a particular summary statistic exists. This simple statement defines a sampling distribution.In statistical practice it is the propertiesof these distributions that guides our inferences about propertiesof populationsof actualor potential observations.In chapter 6 thebinomial, the Poisson,and thenormal distributionswere discussed.Now that samplinghas been examinedin some detail, three other distributions and the statistical tests associated with them are reviewed. THE CHI SQUARE DISTRIBUTION The developmentof the j^ (chi-square) testof "goodness-of-fit" represents one of the most important breakthroughs in the history of statistics, certainlyas important as thedevelopmentof the mathematical foundations of regression. The fact that both creationsare attributableto the work of one man,1 Karl MacKenzie (1981) givesa brief accountof Arthur Black (1851-1893),a tragic figure who,on his death, left a lengthy, and now lost, manuscript, Algebraof Animal Evolution,which was sent to Karl Pearson. "Pearsonstarted to read it, but realized immediately thatit discussed topics very similar to thosehe wasworking on, anddecidednot toreadit himself but to sendit to Francis Galtonfor his advice" (p. 99).Of great interest is that buried among Black's notebooks, which have survived, is a derivation of the chi-square approximation to the multinomial distribution.

105

106

9. SAMPLING DISTRIBUTIONS

Fig. 9.1 Chi- Square for 4 and 10 Degrees of Freedom Pearson,is impressiveattestationto hisrole in thediscipline. Thereare anumber of routesby whichthetestcan beapproached,but thepath thathasbeen followed thus far is continued here.This path leads directlyto the work of Pearsonand Fisher, who did not make use, and,it seems, werein general unaware,of the earlier work on goodness-of-fitby mathematiciansin Europe. Before looking at thedevelopmentof the test of goodness-of-fitthe structureof the chi-square distribution itself is worth examining. Figure9.1 showstwo chi-square distributions. Givena normally distributed population of scoresY with a meanu, and a variance a2, suppose that samples of size n = 1 aredrawnfrom the distribution and each scoreis convertedto its corresponding standard score 2. 2

z - (7-fj,)/a and x = (Y-\\) /G definesthechi-square distribution with one degreeof freedom. If samples of n = 2 aredrawn, thenx,2 is given by 2

-,

2

T

(Y] - n) /a + (F2 - u,)/a". In fact if n independent measures aretaken ran2 2 domly from a distribution with mean\i, and variancea , % is defined as the sum of the squaredstandardizedscores:

CHI-SQUARE

107

This is the essenceof the distribution2 that Pearson used in his formulationof the test of goodness-of-fit. Why wassucha test so necessary? Gamesof chanceand observational errorsin astronomyand surveying were subject to the random processesthat led scientistsin the 18th and early 19th centuriesto the examinationof error distributions.The quest was fora sound mathematical basisfor the exerciseof estimating true values. Simpson introduced a triangular distributionand Daniel Bernoulli in 1777 suggesteda semicircular one.In the absenceof empirical data, these distributions, established on apriori grounds, were somewhat arbitrarily regarded as having no more and noless claimto accuracyand utility. But the 19th centurysaw the normal law established. It had powerful credentials because of the fame of its two main developers, Gauss andLaplace. Startingfrom the assumption thatthe arithmetic mean represented the true value, Gauss showed that the error distribution was normal. Laplace, startingfrom the view that every individual observation arisesfrom a very great numberof independently acting random factors (the essence of the central limit theorem), cameto the same result. Gauss's proofof the methodof least squares further establishedthe importance of the normal distribution,and when, in 1801,he usedthe initial data collected 3 from observationson a newplanet, Ceres, to accurately predict where it would be observed laterin the year, these procedures, as well as Gauss'sreputation, were firmly establishedin astronomy. In astronomyaswell as inbiology and social science,the Laplace-Gaussian distribution wasindeed law,it was indeed regarded asnormal. This prescription led to Quetelet'suse of it as a"double-edged sword" (see chapter 1) and led to many astronomers using it as areasonto reject observations that were considered to be doubtful, for example Peirce (1852). Quetelet's procedure for establishingthe "fit" of the normal curvewas thesame as that of the early astronomers.The tabulation,and later the graphing,of observedand expected frequenciesled to their being comparedby nothing more than visual inspection (see, e.g., Airy, 1861). 2

A history of the mathematicsof the x2 distribution would includethe developmentof the gamma function by the French mathematician 1. J. Bienayme(1796-1876),who, in the 1850s,found a statistic that is equivalentto the Pearson X in the contextof least squares theory. Pearson was apparentlynot awareof his work; nor were F. R. Helmertand E.Abbe, who,in the 1860sand 1870s, also arrived at the X

distributionfor the sum ofsquaresof independentnormal variates. Long after the test had become

commonly used,von Mises (1919)linked Bienayme's workto the PearsonX2. Details of this aspectof the test and distribution's historyare given by Lancaster (1966), Sheynin (1966), and Chang (1973). 3

Cereswas the firstdiscovered "planetoid" in the asteroid belt. Gauss also determined the orbit of Pallas, another planetoid.

108

9. SAMPLING DISTRIBUTIONS

There was some dissent. Egon Pearson (1965) notes: As a reactionto this view among astronomers I rememberhow SirArthur Eddington in his Cambridgelectures about 1920 on theCombination of Observationsused to quote the remark that"to say that errors must obey the normal law means taking away the right of thefree-born Englishmanto makeany error he darn well pleases!" p. 6)

Karl Pearson's first statistical paper (1894) was on theproblemof interpreting a bimodal distributionas two normal distributions, the problem thathad arisen as aresult of Weldon's discovery that the distributionof the relative frontal breadthsof his sampleof Naples crabswas adouble-humped curve. This paper introduced the methodof momentsas ameansof fitting a theoretical curve to a set ofobservations.As Egon Pearson (1965) states it: The question"doesa Normal curvefit this distributionandwhat does this mean if it doesnot?" was clearly prominent in their discussions.There were three obvious alternatives; (a) The discrepancybetweentheory and observationis no more than mightbe expectedto arise in random sampling. (b) The dataareheterogeneous, composedof two or more Normal distributions. (c) The dataare homogeneous,but there is real asymmetryin the distributionof the variable measured. The conclusion(c) mayhave been hard to accept, such was theprestige surrounding the Normal law. (p. 9) 4 It appearsfrom Yule's lecture notes that Karl Pearson probably was employing a procedure that used the ratio of anabsolute deviation from expectation to its standard errorto examinethe recordof 7,000 (actually 7,006) tosses of 12 dice madeby Mr Hull, a clerk at University College. Thiswas therecord (see chapter 1) that Weldon said Karl Pearson had rejectedas "intrinsically incredible." Yule's notes also contain an empirical measureof goodnessof fit that Egon Pearson saysmay be setdown roughly as R = I\O - 7]/ir, where \O - 7] is the absolute valueof the difference between the observed and theoreticalfrequencyand T thetotal theoreticalfrequency,thoughit shouldbe mentionedthat the actual notesdo not containthe formula in this form. This expression mean absolute error was in useduring the latter yearsof the 19th century,and Bowley usedit in the first edition of his textbook in 1902. Karl Pearson's second statistical paper (1895) on asymmetricalfrequency curves occupied the attentionof thebiologists,but thequestionof thebiological meaningof skewed distributionswas not onethat, at the time, was in the

These notesare reproducedin Biometrikas Miscellanea(Yule, 1938a).

CHI-SQUARE

109

forefront of Pearson'sthoughts. Interestingly enough, Pearson did not use the mean absolute error as atest of fit in any of hiswork. His preoccupationwas with the developmentof a theory of correlation,and it was inthis context that he solved the goodness-of-fit problem. The 1895 paperand two supplements that followed in 1901 and 1916b introduceda comprehensive system of frequency curves that pointed a way tosampling distributions that are central to the use ofstatistical tests,but it was a waythat Pearson himself did not fully develop. Pearson's(1900a) seminal paper begins, "The object of this paper is to investigatea criterion of theprobability on anytheoryof an observed system of errors, and to apply it to the determination of goodnessof fit in the case of frequency curves'" (p. 157). Pearson takesa systemof deviationsx} , x2, . . ., xn from the meansof n variables with standard deviations a,, c r 2 , . . ., an andwith correlations r, 2, r13, an 2 7*23 •> • • •, '"n-i.n d derivesx as "the equationto a generalized 'ellipsoid,'all over the surfaceof which the frequencyof the systemof errorsor deviations jc,, x2,..., xn is constant" (p. 157). It was Pearson's derivation of the multivariate normal distribution that 2 formed the basisof the x test. LargeJCj's represent large discrepancies between theory and observationand inturn would give large values of x,2. But x.2 can be madeto becomea test statisticby examiningthe probabilityof a systemof errors occurring witha frequency asgreat or greater thanthe observed system. Pearsonhadalready obtained an expressionfor the multivariatenormal surface, and hehere givesthe probability of n errors whenthe ellipsoid, mentioned earlier is squeezedto becomea sphere,which is thegeometric representation of zero correlationin the multivariate space,and showshow onearrivesat the probability for a given valueof x2. When we compare Pearson's mathematics with Hamilton Dickson'sexaminationof Gallon's elliptic contours, seenin the scatterplotof two correlated variables while Galton was waiting for a train, we see how farmathematical statistics had advancedin just 15 years. Pearsonconsidersan (n +l)-fold grouping with observed frequencies, mi', m2', w3', . . . , wn', wn+ /, andtheoretical frequencies known a priori, m\, m2, w 3 , . . ., wn, mn +,. 2w = Iw' = N, is thetotal frequency,and e = m' -m is the error. The total errorZe (i.e., e} + e2 + e3 +... + en+] ) is zero. The degrees of freedom,as they are nowknown,follow; "Hence onlyn of the n + 1errors are variables;the n + 1th isdetermined when the first n areknown,and inusing 2 formula (ii) [Pearson'sbasicx. formula] we treat onlyof n variables" (Pearson, 1900a,pp.160-161). Starting withthe standard deviation for the random variationof error and the correlation between random errors, Pearson auses complex trigonometric

110

9. SAMPLING DISTRIBUTIONS

transformationto arrive at aresult"of very great simplicity":

Chi-square (thestatistic)is theweighted sum ofsquared deviations. Pearson (1904b, 1911) extended the use of thetest to contingency tables and to the twosamplecaseand in1916(a) presented a somewhat simpler derivation than the onegiven in 1900a,a derivation that acknowledges the work of Soper (1913). The first 20 years of this century brought increasing recognition that the test was of thegreatest importance, starting with Edgeworth , who wrote to Pearson, "I have to thank you for your splendid methodof testing my mathematical curvesof frequency. That x? of yours is one of themost beautiful of the instruments thatyou have addedto theCalculus" (quotedby Kendall, 1968, p. 262). And even Fisher,who by that timehad becomea fierce critic: "The testof goodnessof fit was devisedby Pearson,to whose labours principallywe now owe it, that the test may readily be appliedto a great varietyof questionsof frequencydistribution" (Fisher, 1922b,p. 597). In the next chapterthe argument that arose between Pearson andYule on the assumptionof continuous variation underlying the categoriesin contingency tables is discussed. When Fisher's modifications and correctionsto Pearson's theory were accepted,it was Yule who helped to spread the word on the interpretationof the newidea of degreesof freedom. The goodness-of-fit testis readily applied whenthe expected frequencies based on some hypothesisare known. For example, hypothesizing that the expected distributionis normal witha particular meanand standard deviation enablesthe expected frequency of any valueto be quickly calculated. Today 2 X is perhaps moreoften appliedto contingency tables, where the expected valuesare computedfrom the observed frequencies. This now routine procedure forms the basisof one of themost bitter disputes in statisticsin the 1920s.In 1916 Pearson examined the questionof partial contingency. The fixed n in agoodness-of-fit test imposes a constrainton the frequenciesin thecategories;only k - 1 categories are free to vary. Pearson realizedthat in the caseof contingencytables additional linear constraints were placed on the group frequencies,but heargued that these constraints did not allow for a reductionin the numberof variablesin the casewherethe theoretical distribution wasestimatedfrom the observed frequencies. Other questions had also been raised. Raymond Pearl(1879-1940),an American researcherwho was at the Biometric Laboratoryin the mid-191Os, pointedout some problemsin the

CHI-SQUARE

111

applicationof X2 in 1912, noting that some hypothetical data hepresents clearly show an excellent "fit" between observed and expected frequency but that the 2 value of x was infinite! Pearson,of course, replied,but Pearl was unmoved. I have earlier pointed out other objectionsto the x2 test ... I have never thought it necessaryto make any rejoinderto Pearson'scharacteristically bitter replyto my criticism, nor do Iyet. The x2 test leadsto this absurdity. (Pearl, 1917, p. 145)

Pearl repeatsthe argument that essentially notes that in caseswhere thereare small expected frequencies in the cells of thetable,the valueof X 2 can begrossly inflated. Karl Pearson (1917),in a reproof that illustrateshis disdain for those he believed had betrayedthe Biometric school, respondsto Pearl: "Pearl . . . provides a striking illustration of how the capable biologist needs a long continuedtraining in the logic of mathematics before he ventures intothe field of probability" (p. 429). Pearl had in fact raised quite legitimate questions about the application of 2 the X test, but in the context of Mendelian theory,to which Pearsonwas steadfastly opposed.A close readingof Pearl's papers perhaps reveals he that hadnot followed all Pearson's mathematics, but thequestionshe hadraised were germane. Pearson's response is theapotheosisof mathematical arrogance that, on occasion, frightens biologists and social scientists today: Shortly Dr Pearl'smethodis entirelyfallacious, as anytrained mathematician would have informedDr Pearlhad hesought advice before publication. It is most regrettable that such extensions of biometric theory should be lightly published, without any duesenseof responsibility,not solely in biologicalbut in psychological journals. It can only bring biometry into contempt as ascienceif, professinga mathematical foundation, it yet showsin its manifestations most inadequate mathematical reasoning. (Pearson, 1917, p. 432)

Social scientists beware! In 1916 Ronald Fisher, then a schoolmaster, raised his standardand made his first foray into whatwas to becomea war. The following correspondence is to be found in E. S. Pearson (1968): Dear Professor Pearson, There is an article by Miss Kirstine Smithin the current issueof Biometrika which, I think, ought not to pass without comment.I enclosea short note uponit. Miss Kirstine Smith proposesto use theminimum valueof x2 as acriterion to determinethe best formof the frequency curve;... It shouldbe observed thatx can only be determined whenthe material is grouped into arrays,and that its value

112

9. SAMPLING DISTRIBUTIONS

dependsupon the manner in which it is grouped. Thereis ... something exceedingly arbitrary in a criterion which depends entirely upon the mannerin which the datahappensto begrouped, (pp. 454-455)

Pearson replied: Dear Mr Fisher, I am afraid that I don't agreewith your criticism of FrokenK. Smith (sheis a pupil of Thiele'sand one of themost brilliant of the younger Danish statisticians).. . . your argumentthat x2 varies withthegrouping is of course well known... Whatwe have to determine, however,is with given grouping which method gives the lowest

X 2. (p. 455)

Pearson asks for a defenseof Fisher's argument before considering its publication. Fisher's expanded criticism received short shrift. After thanking Fisher for a copy of his paper on Mendelian inheritance (Fisher, 1918), he hopesto 5 find time for it, but pointsout thathe is"not a believerin cumulative Mendelian factors as being the solution of the heredity puzzle"(p. 456). He then rejects Fisher's paper. Also I fear thatI do not agreewith your criticismof Dr Kirstine Smith's paperand under present pressure of circumstances must keep the little space I have in Biometrika free from controversy which can only waste what power I have for publishing original work.(p. 456)

Egon Pearson thinks that we canaccepthis father's reasons for these rejections; the pressureof war work, the suspensionof manyof his projects,the fact that he wasover 60 yearsold with much workunfinished, and hismemoryof the strain that had been placedon Weldonin the rowwith Bateson.But if he really was trying to shun controversyby refusing to publish controversial views, then he most certainlydid not succeed. Egon Pearson's defense is entirely understandablebut far toocharitable. Pearson hadcensored Fisher's work before and appearsto have been trying to establishhis authority overthe youngerman and over statistics. Evenhis offer to Fisher of an appointmentat the Galton Laboratory mightbe viewed in this light. FisherBox (1978) notes thatthese experiencesinfluencedFisherin his refusal of the offer, in the summerof 1919, recognizing that, "nothing would betaughtor publishedat theGalton laboratory 5

Here Pearsonis dissembling. Fisher's paper wasoriginally submittedto theRoyal Society and, althoughit was notrejected outright,the Chairmanof the Sectional Committee for Zoology "had beenin communication withthe communicatorof the paper,who proposedto withdraw it." Karl Pearsonwas one of the tworeferees. Norton and Pearson (1976) describe the event and publish the referees' reports.

CHI-SQUARE

113

without the approvalof Pearson"(p. 82). The last strawfor Fisher camein 1920 whenhe senta paperon theprobable error of the correlationcoefficient to Biometrika: Dear Mr Fisher, Only in passingthrough Town todaydid I find your communicationof August 3rd. I am very sorry for the delay in answering ... As therehas beena delay of three weeks already, and as Ifear if I could givefull attentionto your paper, whichI cannotat the present time,I shouldbe unlikely to publish it in its presentform ... I would preferyou published elsewhere ... I am regretfully compelledto exclude all that I think erroneouson my own judgment, becauseI cannotafford controversy,(quoted by E .S.Pearson,1968, p. 453)

Fishernever again submitted a paperto Biometrika,and in 1922(a) tackled the X2 problem: This short paper withall its juvenile inadequacies, yet did somethingto break the ice. Any readerwho feels exasperated by itstentativeandpiecemeal character should remember thatit had to find its way topublicationpast critics who,in the first place, could not believe thatPearson'swork stood in needof correction, and who, if this had to beadmitted, were sure that they themselves had corrected it. (Fisher's commentin Bennett, 1971,p. 336)

Fishernotes thathe is notcriticizing the generaladequacyof the X2 test but that he intendsto show that: the valueof n' with which the table shouldbe enteredis not nowequalto thenumber of cells but to onemore thanthe number of degreesof freedom in the distribution. Thus for a contingency tableof r rows and ccolumnswe should taken' = (c- \)(r - 1) + 1insteadof n' = cr. This modificationoften makesa very great difference to the probability (P) that a given value of x2 should have been obtained by chance. (Fisher, 1922a,p.88)

It shouldbe noted that Pearson entered the tablesusing n' = v + 1, where «' is the numberof variables(i.e., categories)and v iswhatwe now call degrees of freedom. The modern tables areenteredusing v = (c - 1 )(r - 1). The use of «' to denote sample size and n todenote degrees of freedom, even thoughin many writings n wasalso usedfor samplesize, sometimes leads to frustrating readingin these early papers. It is clear that Pearson did not recognise thatin all caseslinear restrictions imposed upon the frequenciesof the sampled population,by our methodsof reconstructing that population, have exactly the sameeffect upon the distributionof x as have restrictions placed upon the cell contentsof the sample. (Fisher, 1922a, p. 92)

114

9. SAMPLING DISTRIBUTIONS

Pearsonwasangeredandcontemptuouslyrepliedin thepagesof Biometrika. He reiteratesthe fundamentalsof his 1900 paperandthen says: The processof substitutingsampleconstantsfor sampled populationconstantsdoes not mean thatwe select out of possible samples of size n, those which have precisely the same valuesof the constantsas theindividual sample under discussion. ... In using the constantsof the given sampleto replace the constants of the sampled population,we in nowise restrictthe original hypothesisof free random samples tied down onlyby their definite size.We certainly do not byusing sample constants reduce in any way therandom sampling degrees of freedom. The abovere-descriptionof what seemto me very elementaryconsiderations would be unnecessaryhad not arecent writerin theJournal of the Royal Statistical Society appearedto have wholly ignored them ... thewriter hasdoneno service to the scienceof statisticsby giving it broad-cast circulation in the pagesof the Journal of the Royal Statistical Society. (K.Pearson, 1922, p. 187)

And on and on,neverreferring to Fisherby namebut only as "mycritic" or "the writer in theJournalof theRoyal StatisticalSociety,'" until the final assault when he accuses Fisher of a disregardfor the natureof the probable error: I trust my critic will pardon me for comparinghim with Don Quixote tilting at the windmill; he must eitherdestroyhimself, or thewhole theoryof probable errors,for they are invariablybasedon using sample values for thoseof the sampled population unknown to us. Forexample hereis an argumentfor Don Quixote of the simplest nature ... (K. Pearson, 1922, p. 191)

The editorsof the Journal of the Royal Statistical Society turned tail and ran, refusing to publish Fisher'srejoinder. Fisher vigorously protested,but to no avail, and he resignedfrom the Society. There wereother questions about Pearson's paper that he dealtwith in Bowley'sjournal Economicain 1923and in the Journal of the Royal Statistical Society (Fisher, 1924a),but the endcame in 1926 when, using data tables that had beenpublishedin Biometrika, Fisher calculated the actual average value of x2 which he had proved earlier should theoretically be unity and which Pearsonstill maintained should be 3. Inevery case the averagewas close to unity, in no casenearto 3 . . .. There was noreply. (Fisher Box, 1978, p. 88,commenting on Fisher, 1926a)

THE t DISTRIBUTION W. S. Gosset was a remarkable man,not the least becausehe managedto maintain reasonably cordial relations with both Pearsonand Fisher,and at the same time.Nor did heavoid disagreeing with themon various statistical issues. He wasborn in 1876 at Canterburyin England,and from 1895to 1899 was

THE t DISTRIBUTION

115

at New College, Oxford, wherehetook a degreein chemistryand mathematics. In 1899 Cossetbecamean employeeof Arthur Guinness,Son andCompany Ltd., the famous manufacturers of stout. He was one of the first of the scientists, trained either at Oxford or Cambridge,that the firm hadbegun hiring(E. S. Pearson, 1939a).His correspondencewith Pearson, Fisher, and others shows him to have beena witty and generousman with a tendencyto downplay his role in the developmentof statistics. Comparedwith the giants of his day he published very little,but his contributionis of critical importance. As Fisher puts it in his Statistical Methodsfor Research Workers: The studyof the exact distributionsof statistics commences in 1908 with"Student's" paper The Probable Error of the Mean. Oncethe true natureof the problem was indicated, a large numberof sampling problems were within reach of mathematical solution. (Fisher, 1925/1970, p. 23)

The brewery had apolicy on publishingby itsemployees that obliged Gosset to publishhis work underthe nomdeplume "Student." In essence,the problem that "Student"tackled was thedevelopmentof a statistical test that could be applied to small samples.The nature of the process of brewing, with its variability in temperatureand ingredients, means that it is not possibleto take large samples over a long run. In a letter to Fisherin 1915,in which he thanks Fisher for the Biometrika paper that begins the mathematical solutionto the small sample problem,"Student"says: The agricultural (and indeed almost any) Experiments naturally requireda solution of the mean/S.D. problem,and the Experimental Brewerywhich concerns such things as theconnection between analysis of malt or hops, and thebehaviourof the beer,andwhich takesa day toeachunit of theexperiment, thuslimiting the numbers, demandedan answerto such questions as, "If with a small numberof casesI get a value r, what is the probability that thereis really a positive correlationof greater value than (say) .25?" (quoted by E. S.Pearson, 1968,p. 447)

Egon Pearson (1939a) notes that in his first few years at Guinness, Gosset was making use ofAiry's Theory of Errors (1861), Lupton's Notes on Observations (1898),and Merriman'sTheMethod of Least Squares (1884).In 1904 he presenteda report to his firm that stated clearly the utility of the application of statisticsto thework of the breweryand pointedup theparticular difficulties that mightbe encountered.The report concludes: We have beenmetwith the difficulty that noneof our books mentions the odds, which areconveniently accepted asbeingsufficient to establishany conclusion,and itmight be of assistanceto us toconsult some mathematical physicist on thematter, (quoted by Pearson,1939a,p.215)

116

9. SAMPLING DISTRIBUTIONS

A meeting was in fact arranged with Pearson, and this took placein the summerof 1905. Not all of Gosset'sproblems were solved, but a supplement to his report and asecond reportin late August 1905 produced many changes in the statisticsusedin the brewery. The standard deviation replaced the mean error, andPearson's correlation coefficient becamean almost routine procedure in examining relationships among the many factors involved with brewing. But one featureof the work concerned Cosset: "Correlation coefficients areusually calculatedfrom large numbersof cases,in fact I havefound only one paperin Biometrika of which the casesare as few innumberas thoseat which I have been working lately" (quoted by Pearson, 1939a, p. 217). Gosset expressed doubt about the reliability of the probable error formula for the correlationcoefficient whenit was appliedto small samples.He went to London in September 1906to spenda year at the Biometric Laboratory.His first paper, published in 1907, derives Poisson's limit of the binomial distribution and applies it to the error in sampling when yeast cells are countedin a haemacytometer.But his most important work during that year was thepreparation of his twopaperson theprobable errorof the meanand of thecorrelation coefficient, both of which werepublishedin 1908. The usual methodof determining thatthe meanof the population lies withina given distanceof the meanof the sample,is to assumea normal distribution about the mean of the sample with a standarddeviation equalto 5/Vn, where s is the standard deviation of the sample,and to use the tables of the probability integral. But, as wedecreasethe value of the numberof experiments,the value of the standard deviationfound from the sampleof experiments becomes itself subject to increasing error,until judgements reached in this way become altogether misleading. ("Student," 1908a,pp. 1-2)

"Student"setsout what the paper intendsto do: I. The equationis determinedof the curve which represents the frequency distribution of standard deviations of samples drawnfrom a normal population. II. Thereis shown to be nokind of correlation betweenthe meanand thestandard deviation of sucha sample. III. The equationis determinedof the curve representing the frequency distribution of a quantity z, which is obtained by dividing the distance between the mean of the sampleand themeanof the populationby the standard deviationof the sample. IV. The curve foundin I. is discussed. V. The curve found in III. is discussed. VI. The two curvesare compared with some actual distributions. VII. Tables of the curves foundin III. are given for samplesof different size. VIII and IX. Thetablesareexplainedand some instances aregiven of their use. X. Conclusions. ("Student," 1908, p. 2)

THE t DISTRIBUTION

117

"Student"did not provide a proof for the distribution of z. Indeed,he first examined this distribution, andthat of s, byactually drawing samples (of size4) from measurements, made on 3,000 criminals, takenfrom data usedin a paper by Macdonell (1901).The frequencydistributionsfor s and zwere thus directly obtained,and themathematical work came later. There hasbeen comment over the yearson thefact that "Student's"mathematical approachwas incomplete, but this shouldnot detractfrom his achievements. Welch (1958)maintainsthat: The final verdict of mathematical statisticians will, I believe,be that they have lasting value. They havethe rare quality of showing us how anexceptional man wasable to make mathematicalprogresswithout paying too much regard to the rules. He fortified what he knew with some tentative guessing, but this was backed by subsequent careful testing of his results, (pp.785-786)

In fact "Student"had given to future generationsof scientists,in particular social and biological scientists,a new andpowerful distribution.The z test, which was to becomethe t test,6 led the way for allkindsof significance tests, and indeed influenced Fisher as hedeveloped that mostuseful of tools, the analysis of variance. It is also the case thatthe 1908 paperon theprobable error of the mean ("Student," 1908a) clearly distinguishedbetween whatwe nowcall sample statisticsand population parameters,a distinction that is absolutely critical in modern-day statistical reasoning. The overwhelming influence of the biometriciansof Gower Streetcan perhaps partly account for the fact that "Student's" workwas ignored. In 1939 Fisher said that"Student's"work was receivedwith "weighty apathy," and,as late as 1922, Gosset, writing to Fisher,and sendinga copy of his tables, said, "You are theonly man that's everlikely to usethem!" (quotedby Fisher Box, 1981, p. 63). Pearson was, to say theleast, suspiciousof work using small samples.It was theassumptionof normality of the samplingdistributionthatled to problems,but thebiometricians never used small samples,and"only naughty brewers taken so small that the difference is not of theorder of the probable error!" (Pearson writingto Gosset, September 1912, quoted by Pearson, 1939a, p. 218). Some of "Student's"ideashad been anticipated by Edgeworthas early as 1883 and,as Welch (1958) notes,one might speculateas to what Gosset's reaction would have been had hebeen awareof this work. Gosset'spaperon theprobable errorof thecorrelation coefficient("Student," 1908b) dealt withthe distributionof the r values obtained when sampling from 6

Eisenhart (1970) concludes that the shift from z to / was due to Fisherandthat Gosset chose / for the newstatistic. In their correspondence Gosset used t for his owncalculationsand x forthose of Fisher.

118

9. SAMPLING DISTRIBUTIONS

a populationin which the two variables were uncorrelated, that is, R = O7. This endeavorwasagain basedon empirical sampling distributions constructed from the Macdonell dataand amathematical curve fitted afterwards. With characteristic flair, he says that he attempted to fit a Pearson curveto the "no correlation" distributionand came up with a Type II curve. "Working from 2

0

2

(n-4)/2

y= yoC\. -x ) for samplesof 4, I guessedthe formula y = yQ(\ -x ) and proceededto calculatethe moments" ("Student,"1908b, p.306). He concludeshis paperby hoping thathis work "may serveasillustrations for the successful solver of theproblem" (p. 310). And indeedit did, for Fisher's paperof 1915 showed that "Student" hadguessed correctly. Fisher had beenin correspondence with Gosset in 1912,sendinghim a proof that appealedto «-dimensional spaceof the frequency distributionof z. Gosset wantedPearson to publishit. Dear Pearson, I am enclosing a letter which givesa proof of my formulae for the frequency distribution of z(=xls),... Would you mind lookingat it for me; Idon't feel at home in morethan threedimensions evenif I could understandit otherwise. It seemedto methat if it's all right perhapsyou might like to put theproof in a note. It's so nice and mathematical thatit might appealto some people, (quoted by E. S. Pearson,1968,p. 446)

In fact the proof was notpublished then,but again approachingthe mathematics througha geometrical representation, Fisher derived the samplingdistribution of the correlation coefficient,andthis, together withthe derivation of the 2distribution, was publishedin the 1915 paper.The sampling distribution of r was,of course,of interestto Pearsonand hiscolleagues; afterall, r was Pearson'sstatistic. Moreover, the distribution followsa remarkable systemof curves, witha variety of shapes that depart greatly from the normal, depending on n and thevalueof the true unknown correlation coefficient R, or, as it is now generally known,p. Pearsonwasanxiousto translate theory into numbers, and the computationof the distributionof r wascommencedand publishedas the "co-operativestudy" of Soper, Young, Cave,Lee,and K. Pearson in 1917. Although Pearsonhad said thathe would send Fisher the proofs of the paper (the letteris quotedin E. S.Pearson, 1965), there is, apparently,no record that he in fact received them.E. S.Pearson (1965) suggests that Fisher did notknow "until late in the day" of the criticism of his particular approachin a sectionof the study, "On theDeterminationof the 'Most Likely' Valueof the Correlation 7

R wasGosset's symbol for the population correlationcoefficient, p (the Greek letter rho) appearsto have beenfirst usedfor this valueby Soper(1913). The important pointis thata different symbol was usedfor sample statistic and populationparameter.

THE t DISTRIBUTION

119

in the Sampled Population." Egon Pearson argues his thatfather's criticism, for presumablythe elder Pearsonhad taken a major role in the writing of the "co-operative study," was a misunderstanding based on Fisher's failure to adequately define what he meantby "likelihood" in 1912and thefailureto make clear that it was not basedon the Bayesian principleof inverse probability. Fisher Box (1978) says, "their criticism was asunexpectedas it wasunjust,and it gavean impressionof something less than scrupulous regardfor a new and therefore vulnerable reputation"(p. 79). Fisher's response, published in Metron in 1921 (a),was thepaper, mentioned earlier, that Pearson summarily rejected because he could not afford controversy. In this paper Fisher makes use of the r =tanh z transformation, a transformation thathad been introduced,a little tentatively,in the 1915 paper. Its immense utilityin transformingthe complex systemof distributionsof r to a simple functionof z, which is almost normally distributed, made the laborious work of the co-operative study redundant. In a paper publishedin 1919, Fisher examinesthe dataon resemblancein twins thathad been studiedby Edward L. Thorndike in 1905. Thisis the earliest exampleof the applicationof Fisherian statistics to psychological data, for the traits examined were both mental and physical. Fisher lookedat the questionof whether there wereany differences between the resemblancesof twins in different traits and here uses the z transformation: When the resemblances have been expressed in terms of the new variable, a correlation table may be constructedby picking out every pair of resemblances betweenthe same twinsin different traits. The valuesare nowcentered symmetrically about a meanat 1.28, and thecorrelationis found to be-.016 ± .048, negative but quite insignificant.The result entirelycorroboratesTHORNDIKE'S conclusions as to thespecializationof resemblance. (Fisher, 1919, p. 493)

These manipulations and thedevelopmentof the same general approach to the distribution of the intraclass correlation coefficient in the 1921 paperare important for the fundamentalsof analysis of variance. From the point of view of the present-day student of statistics, Fisher's (1925a) paperon the applicationsof the t distributionis undoubtedlythe most comprehensible. Here we see, in familiar notationand most clearly stated,the utility of the t distribution,a proof of its "exactitude ... for normal samples" (p. 92), and theformulae for testing the significanceof a difference between meansand thesignificanceof regression coefficients. Finally the probability integral withwhich we areconcernedis of valuein calculating the probability integralof awider classof distributionswhich is relatedto "Student's" distribution in the same manneras that of X2 is relatedto the normal distribution.

120

9. SAMPLING DISTRIBUTIONS

This wider classof distributions appears(i) in the study of intraclass correlations (ii) in the comparisonof estimatesof the variance,or of thestandard deviation from normal samples(iii) in testingthe goodnessof fit of regressionlines (iv) in testing the significance of a multiple correlation,or (v) of acorrelation ratio. (Fisher,1925a, pp. 102-103)

These monumental achievements were realized in less than10 years after Gosset's mixture of mathematicalconjecture,intuition, and thepractical necessities of his work led the way to thedistribution. t From 1912, Gossetand Fisherhad beenin correspondence (although there were some lengthy gaps), but they did not actually meetuntil September 1922, when Gosset visited Fisher at Harpenden.Fisher Box (1981) describes their relationshipandreproduces excerpts from their letters. At the end ofWorld War I, in 1918, Gossetdid not even knowhow Fisherwas employed,and when he learnedthat Fisherhad beena schoolmasterfor the durationof the war and was looking for a job wrote, "I hear that Russell [the head of the Rothamsted Experimental station] intends to get astatistician soon, when hegetsthe money I think, and it might be worth while to keep your ears open to news from Harpenden" (quoted by Fisher Box, 1981, p. 62). In 1922, work beganon thecomputationof a new setof t tables using values of t = z\n - 1 rather thanz, and thetables were entered with the appropriate degreesof freedom rather than n. Fisher and Gosset both worked on the new tables,and after delays,and fits and starts,and thecheckingof errors,the tables

Fig. 9.2 The Normal Distribution and the t Distribution for df = 10 and df = 4

THE F DISTRIBUTION

121

were publishedby "Student" in Metron in 1925. Figure9.2 compares two? distributions to the normal distribution. Fisher's "Applications" paper, mentioned earlier, appeared in the same volume, but it had,in fact, been written quite early in 1924. At that time Fisherwas completinghis 1925 bookand needed tables. Gosset wanted to offer the tablesto Pearsonandexpressed doubts about the copyright, because the first tableshadbeen published in Biometrika. Fisher went aheadandcomputedall thetableshimself,a task he completedlater in the year (Fisher Box, 1981). Fisher sentthe "Applications" paperto Gossetin July 1924. He says that the note is: larger thanI had intended,and tomake it at all complete shouldbe larger still, but I shall not have timeto makeit so, as I amsailingfor Canadaon the25th, and will not be back till September, (quoted by Fisher Box, 1981,p. 66)

The visit to Canadawas made to presenta paper (Fisher, 1924b) at the International Congress of Mathematics, meeting that year in Toronto. This paper discussesthe interrelationshipsof X 2,z, and /. Fisherwasonly 35 years old,and yet the foundationsof his enormousinfluence on statistics werenow securely laid. THE F DISTRIBUTION The Toronto paperdid not, in fact, appearuntil almost4 yearsafter the meeting, by which time the first edition of Statistical Methodsfor Research Workers (1925) had been published.The first use of ananalysisof variance technique was reported earlier (Fisher& Mackenzie, 1923),but this paper was not encounteredby many outsidethe area of agricultural research.It is possible that if the mathematicsof the Toronto paperhadbeen includedin Fisher's book then much of the difficulty that its first readershad with it would have been reducedor eliminated,a point that Fisher acknowledged. After a general introductionon error curvesand goodness-of-fit, Fisher examinesthe x,2 statisticand briefly discusseshis (correct) approachto degrees of freedom. He then pointsout that if a numberof quantitiesx1,,...,* „ are distributed independentlyin the normal distribution with unit standard deviation, then x.2 = Lx is distributedas "the Pearsonian measure of goodnessof 2 fit." In fact, Fisher uses S(x ) to denotethe latter expression,but here, and in what follows on the commentaryon this paper,the more familiar modern notation is employed. Fisher refers to "Student's"work on theerror curveof the standard deviationof a small sample drawn from a normal distributionand 2 shows its relation to . -

122

9. SAMPLING DISTRIBUTIONS

wheren - 1 is thenumberof degreesof freedom (one less than the numberin 2 2 the sample)and s is thebest estimatefrom the sampleof the true variancecr. 2 2 For the generalz distribution, Fisherfirst points out that s, and s2 are misleading estimatesof a, and 2a when sample sizes, drawn from normal distributions,are small: The only exacttreatmentis to eliminate the unknown quantitiesa, and cr 2 from the distribution by replacing the distribution of s by that of log s, and soderiving the distribution of log st/s2. Whereasthe samplingerrorsin s, areproportional to Oj the sampling errorsof log s, depend only uponthe size of the sample from which 5, was calculated. (Fisher, 1924b, p. 808)

then z will be distributed aboutlog a, /a2 asmode, in a distribution which depends wholly upon the integersnl and «2. Knowing this distributionwe cantell at once if an observed value of z is or is notconsistent withany hypothetical valueof the ratio CT, /o2. (Fisher, 1924b,p. 808)

The casesfor infinite and unit degreesof freedom are then considered.In the latter casethe "Student" distributions are generated. In discussingthe accuracyto beascribedto the mean of a small sample,"Student" took the revolutionary step of allowing for the random sampling variationof his estimate of the standard error.If the standard error were known with accuracy the deviation of an observed valuefrom expectation (sayzero),divided by the standard error would be distributed normally with unit standard deviation; but if for the accuratestandarddeviation we substitutean estimatebasedon n - 1 degreesof freedom we have

consequentlythe distribution of t is given by putting n, [degreesof freedom] = 1 and substitutingz = V* log t2. (Fisher, 1924b,pp. 808-809)

THE F DISTRIBUTION

123

For the caseof an infinite numberof degreesof freedom,the tdistribution becomesthe normal distribution. In the finalsectionsof thepaperthe r - tanh z transformationis appliedto the intraclass correlation,an analysisof variance summary table is shown,and z=loge s} /s2 is thevalue thatmay beusedto testthe significanceof the intraclass correlation. These basic methods arealso shownto lead to testsof significance for multiple correlationand r\ (eta),the correlation ratio. The transition from z to F was notFisher's work.In 1934, GeorgeW. Snedecor,in the first of anumberof texts designedto make the techniqueof analysisof variance intelligibleto a wider audience,defined F as theratio of the larger mean squareto the smaller mean square taken from the summary table, using the formula z = '/2loge F. Figure 9.3 shows two F distributions. Snedecorwas Professorof Mathematicsand Director of the Statistical Laboratory at Iowa State Collegein the United States. This institution was largely responsiblefor promotingthe valueof themodern statistical techniques in North America. Fisher himself avoidedthe use of thesymbol F becauseit was notusedby P. C. Mahalonobis,an Indian statisticianwho hadvisited Fisherat Rothamsted in 1926,and who hadtabulatedthe valuesof the variance ratioin 1932,andthus established priority. Snedecor did not know of the work of Mahalonobis,a clear-thinking and enthusiastic workerwho becamethe first Director of the Indian Statistical Institute in 1932. Todayit is still occasionallythe case thatthe ratio is referredto as"Snedecor'sF " in honorof both SnedecorandFisher.

Fig 9.3 The F Distributions for 1,5 and 8,8 Degrees of Freedom

124

9. SAMPLING DISTRIBUTIONS

THE CENTRAL LIMIT THEOREM The transformationof numerical datato statistics and theassessmentof the probability of the outcome beinga chance occurrence, or the assessmentof a probable rangeof values within whichthe outcome will fall, is a reasonable general descriptionof the statistical inferential strategy.The mathematical foundationsof theseprocedures rest on aremarkable theorem, or rathera set of theorems, thataregrouped togetheras theCentral Limit Theorem. Much of the work that led to this theoremhas already been mentioned but it is appropriate to summarizeit here. In the most general terms,the problem is to discover the probability distribution of the cumulativeeffect of many independently acting, and very small, randomeffects. The centrallimit theorem brings together the mathematical propositions that show that the required distributionis thenormal distribution. In other words,a summaryof a subsetof random observations^,^,.. . ,Xn, saytheir sum'LX=X] + X2 +... + Xn or their mean,(Ud/n has asampling distribution that approaches the shapeof the normal distribution. Figure9.4 is an illustration of a sampling distributionof means. This chapter has described distributionsof other statistical summaries, but it will have been noted that, in one way oranother, those distributions have links with the normal distribution.

Fig. 9.4 A Sampling Distribution of Means (n = 6). 500 draws from the numbers 1 through 10.

THE CENTRAL LIMIT THEOREM

125

The history of the developmentof the theoremhasbeen brought together in a useful and eminently readable little book by Adams (1974).The story begins with the definition of probability as long-run relative frequency, the ratio of the numberof ways an eventcanoccurto thetotal numberof possible events, given that the eventsare independentand equallylikely. The most important contribution of James Bernoulli'sArs Conjectandi is the first limit theorem of probability theory. The logic of Bernouilli's approachhasbeenthe subjectof some debate,and weknow that he himself wrestled withit. Hacking (1971) states: No one writes dispassionatelyaboutBernoulli. He hasbeenfathered with the first subjective conceptof probability, and with a completely objectiveconcept of probability as relative frequency determined by trials on a chance set-up. He has beenthought to favour an inductive theoryof probability akinto Carnap's.Yet he is said to anticipateNeyman'sconfidence interval technique of statistical inference, which is quite opposedto inductive probabilities.In fact Bernoulli was, likeso many of us, attractedto many of these seemingly incompatible ideas, and he wasunsure where to rest his case. He left his book unpublished, (pp. 209-210)

But Bernoulli's mathematics are not inquestion. Whenthe probability^ of an eventE is unknownand asequenceof n trials is observedand theproportion of occurrencesof £ is£nthen Bernoulli maintainedthat an "instinct of nature" causesus to use n£ as anestimateof p. Bernoulli's Law of Large Numbers shows thatfor any small assigned amount 8, |p - En \ < e increasesto 1 as « increasesindefinitely: It is concluded thatin 25,550trials it is more thanone thousand timesmore likely that the r/t [the ratio of what Bernoulli calls "fertile" events to thetotal, that is, the event of interest here called £] will fall between 31/50and 29/50 than thatr/t will fall outside these limits. (Bernoulli, 1713/1966, p. 65) The aboveresults holdfor known p. If/? is 3/5, thenwe can bemorally certain that ... thedeviation ... will be less than 1/50.But Bernoulli's problemis the inverse of this. When/?is unknown, can hisanalysis tell whenwe can bemorally certain that someestimateof p isright? Thatis theproblemof statistical inference. (Hacking, 1971, p. 222)

The determination of/?was ofcoursethe problem tackledby De Moivre. He usedthe integrale~*2 to approximateit, but asAdams (1974)and others have noted, thereis no direct evidence thatimplies that De Moivre thoughtof what is now called the normal distribution as a probability distribution as such. Simpson introduced the notion of a probability distributionof observations,andothers, notably Joseph Lagrange (1736-1813)andDaniel Bernoulli

126

9. SAMPLING DISTRIBUTIONS

(1700-1782),who wasJames's nephew, elaborated laws of error.Thelate 18th centurysaw theculminationof this workin the development of the normallaw of frequencyof error and itsapplicationto astronomical observations by Gauss. It was Laplace's memoir of 1810 that introduced the central limit theorem,but the nub of hisdiscoverywas describedin nonmathematical language in his famous Essai publishedas the introduction to the third edition of Theorie Analytique desProbabilitesin 1820: The general problem consists in determiningthe probabilities thatthe valuesof one or several linear functions of the errors of a very great numberof observationsare contained withinany limits. The law of thepossibility of the errors of observations introduces intothe expressionsof these probabilities a constant, whose value seems to requirethe knowledgeof this law, whichis almost always unknown. Happily this constantcan bedeterminedfrom the observations. Thereoften existsin the observationsmany sourcesof errors:...The analysis which I have used leads easily, whatever the numberof the sourcesof error may be, to thesystemof factors which givesthe most advantageous results, or thosein which the same erroris less probable than in any other system. I ought to make herean important remark. The small uncertainty thatthe observations, when they are not numerous, leavein regard to the values of the constants...rendersa little uncertainthe probabilities determined by analysis. But it almost alwayssuffices to know if the probability, thatthe errors of the results obtained arecomprised within narrow limits, approaches closely to unity; andwhen it is not, it sufficesto know up to what pointthe observations should be multiplied, in orderto obtainaprobability such thatno reasonable doubt remains... The analytic formulae of probabilitiessatisfy perfectlythis requirement; . . . They are likewise indispensablein solving a great numberof problems in the natural and moral sciences. (Laplace, 1820, pp. 192-195in thetranslationby Truscott&Emory, 1951)

Theseareelegantandclear remarksby a geniuswho succeededin makingthe laborsof many years intelligible to a wide readership. Adams(1974) givesa brief accountof the final formal developmentof the abstract Central Limit Theorem.The Russian mathematician Alexander Lyapunoff (1857-1918), a pupil of Tchebycheff, provided a rigorous mathematicalproof of the theorem.His attentionwas drawnto theproblem when he waspreparing lecturesfor a course in probability theory,and his approachwas atriumph. His methodsand insights led,in the 1920s,to many valuablecontributionsand even morepowerful theoremsin probability mathematics.

10

Comparisons, Correlations and Predictions

COMPARING MEASUREMENTS Sincethe time of De Moivre, the variables that have been examined by workers in the field of probability have expressed measurements asmultiplesof a variety of basic units that reflectthe dispersionof the range of possiblescores.Today the chosen unitsare units of standard deviation,and thescoresobtained are called standard scoresor z scores. Karl Pearson usedthe term standard deviation andgave it the symbol a (the lower caseGreek letter sigma)in 1894, but the unit was known (althoughnot in itspresent-day form)to De Moivre. It correspondsto that point on theabscissaof a normal distribution such that an ordinateerectedfrom it would cut thecurveat itspoint of inflection, or, in simple terms, the point wherethe curvatureof the function changesfrom concaveto convex. Sir GeorgeAiry (1801-1892)namedaV2 themodulus (although this term had been used,in passing,for VK by De Moivre as early as 1733)and describeda variety of other possible measures, including theprobableerror, in 1875.' This latterwas theunit chosenby Gallon (whose workis discussed later), although he objected stronglyto the name: It is astonishing that mathematicians, who are themost preciseand perspicaciousof men, havenot long since revolted against this cumbrous, slip-shod, and misleading phrase Moreoverthe term Probable Erroris absurd when applied to the subjects

1 The probable erroris definedas onehalf of the quantity that encompasses the middle 50% of a normal distributionof measurements.It is equivalentto what is sometimes called the semi-interquartile range (that portionof the distribution betweenthe first quartileor the twenty-fifth percentile,and the third quartileor the seventy-fifth percentile, dividedby two). The probable erroris 0.67449times the standard deviation. Nowadays, everyone follows Pearson (1894), who wrote, "I have alwaysfound it more convenientto work with the standard deviation than with the probable erroror the modulus,in terms of which the error function is usually tabulated" (p.88).

127

128

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

now in hand, suchas Stature, Eye-colour, Artistic Faculty,or Disease.I shall therefore usuallyspeakof Prob. Deviation. (Galton, 1889, p. 58)

This objection reflects Galton's determination, and that of his followers,to avoid the use of theconcept of error in describingthe variation of human characteristics. It also foreshadowsthe well-nigh complete replacement of probable error with standard deviation, and law of frequency oferror with normal distribution, developments that reflect philosophical dispositions rather than mathematical advance. Figure 10.1 shows the relationship between standard scoresandprobable error. Perhapsa word or two of elaborationand examplewill illustratethe utility of measurements made in units of variability. It may seem triteto make the statement that individual measurements and quantitative descriptions are made for the purposeof making comparisons. Someone who pays $1,000for a suit hasbought an expensive suit,aswell ashaving paida greatdeal more thanthe individual who haspickedup acheapoutfit for only $50.The labels"expensive" and "cheap"areapplied because the suit-buyingpopulation carries around with it some notion of the average priceof suitsand some notionof the rangeof

FIG. 10.1

TheNormal Distribution - Standard Scores and the Probable Error

GALTON'S DISCOVERY OF REGRESSION

129

prices of suits. This obviousfact would be made even more obvious by the reaction to an announcement that someone had just purchaseda new car for $1,000. Whatis a lot ofmoney for a suit suddenly becomes almost trifling for a brand new car. Again this judgment depends on aknowledge, whichmay not be at allprecise,of the average price,and theprice range,of automobiles. One can own anexpensive suitand acheapcar and have paid the same absolute amount for both, althoughit must be admitted that sucha juxtaposition of purchasesis unlikely! The point is that these examples illustrate the fundamental objective of standard scores,the comparisonof measurements. If the mean priceof cars is $9,000 and thestandard deviationis $2,500 then our $1,000car has an equivalentz scoreof ($1,000 - $9,000)/$2,500or-3.20. If suit prices havea meanof $350and astandard deviation of $150, thenthe $1,000.00suit has anequivalentz scoreof ($1,000 - $350)/$150or +4.33. We havea very cheapcar and avery, very expensive suit.It might be added that, at thetime of writing, thesefigures were entirely hypothetical. These simple manipulations are well-known to usersof statistics, and,of course, when they are appliedin conjunctionwith the probability distributions of the measurements, they enable us to obtainthe probability of occurrenceof particular scoresor particular rangesof scores. Theyare intuitively sensible. A more challenging problem is that of the comparisonof setsof pairs of scores and of determininga quantitative description of the relationship between them. It is a problem thatwas solvedduring the secondhalf of the 19th century. GALTON'S DISCOVERY OF "REGRESSION" Sir Francis Gallon (his workwas recognizedby the award of a knighthood in 1909) has been describedas aVictorian genius. If we follow Edison's oftquoted definition2 of this condition then therecan be noquarrel with the designation,but Galtonwas a man ofgreat flair as well asgreat energyand his ideas and innovations were many and varied. In a long life he produced over 300 publications, including17 books. But,in one of those odd happeningsin the history of human invention,it was Galton's discoveryand subsequent misinterpretation of a statistical artifact that marks the beginningof the technique of correlationas we nowknow it. 1860 was theyear of the famous debateon Darwin's work between Thomas Huxley (1825-1895)and Bishop Samuel Wilberforce (1805-1873). This encounter, which heldthe attentionof the nation, took placeat themeetingof the British Association (forthe Advancementof Science)held thatyearat Oxford. Galton attendedthe meeting,and althoughwe do notknow what rolehe had at it, we doknow thathe later becamean ardent supporter of his cousin's theories. In 1932, Thomas Alva Edison said that "Genius is onepercent inspiration and ninety-nineper cent perspiration."

130

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

The Origin of Species,he said, "madea marked epochon my development,as it did in that of human thoughtgenerally"(Galton, 1908, p. 287). Cowan (1977) notes that, although Galton had made similar assertions before, the impact of The Origin must have been retrospective, describing his initial reactionto thebook as,"pedestrianin the extreme." Shealso asserts that "Galton never really understood the argumentfor evolutionby natural selection, nor was heinterestedin the problem of the creation of new species"(Cowan, 1977, p. 165). In fact, it is apparentthat Galtonwas quite selectivein using his cousin's work to supporthis ownview of the mechanismsof heredity. TheOxford debatewas notjust a debateabout evolution. MacKenzie (1981) describes Darwin's book as the"basicwork of Victorian scientific naturalism" (p. 54), the notion of the world, and thehuman species and itsworks, aspart of rational scientific nature, needing no recourseto thesupernaturalto explainthe mysteriesof existence. Naturalismhas itsorigins in the rise of sciencein the 17th and 18th centuries,and itsopponents expressed their concern, because of its attack, implicitandovert,on traditional authority.As one19th century writer notes,for example: Wider speculationsas to morality inevitably occuras soon as thevision of God becomesfaint; whenthe Almighty retires behind second causes,insteadof being felt as animmediate presence, and hisexistencebecomesthe subject of logical proof. (Stephen,1876, Vol II, p. 2).

The return to nature espousedby writers suchas Jean Jacques Rousseau (1712-1778)and hisfollowers would haverid the world of kings and priests and aristocrats whose authority rested on tradition and instinct rather than reason,and thus, they insisted, have brought about a simple"natural" stateof society. MacKenzie (1981), citing Turner (1978), adds a very practical noteto this philosophy.The battlewas notjust about intellectual abstractions but about who should have authorityand control and who should enjoy the material advantages that flow from the possession of that authority. Scientific naturalism was theweaponof the middle classin its strugglefor powerandauthority based on intellect andmerit andprofessional elitism,and not onpatronageor nobility or religious affiliation. These ideas,and the newbiology, certainly turned Galton away from religion, as well as providing him with an abiding interestin heredity. Forrest (1974) suggests that Gallon'sfascination for,and work on, heredity coincided with the realization thathis ownmarriage wouldbe infertile. This alsomay have been a factor in the mental breakdown that he suffered in 1866. "Another possibleprecipitating factorwas theloss of his religious faith which left him with no compensatory philosophy until his programmefor theeugenic improvement of mankind becamea future article of faith" (Forrest,1974, p. 85).

GALTON'S DISCOVERY OF REGRESSION

131

During the years 1866to 1869 Galtonwas ingenerally poor health, but he collectedthe material for,andwrote, one of hismost famous books, Hereditary Genius, whichwas published in 1869. In this book and in English Men of Science, published in 1874, Galton expounds and expands uponhis view that ability, talent, intellectual power,and accompanying eminence are innately rather than environmentally determined. The corollary of this view was,of course, that agencies of social control shouldbe established that would encourage the"best stock" to have children,and todiscourage those whom Galton describedas havingthe smallest quantities of "civic worth" from breeding. It has been mentioned that Galton was acollector of measurements to a degree that wasalmostcompulsive,and it iscertainthat he was nothappy withthe fact that he had had to use qualitative rather than quantitative data in supportof his argumentsin the two books just cited. MacKenzie (1981) maintains that: the needsof eugenicsin large part determinedthe content of Galton's statistical theory....If the immediateproblemsof eugenicsresearchwereto besolved,a new theory of statistics,different from that of the previouslydominanterror theoristshad to be constructed.(MacKenzie, 1981, p. 52)

Galton embarkedon acomparative study of the sizeandweightof sweetpea 3 seeds overtwo generations, but, as helater remarked,"It was anthropological evidence thatI desired, caringonly for the seedsas meansof throwing lighton heredityin man. I tried in vain for a long and weary timeto obtainit in sufficient abundance" (Galton, 1885a, p. 247). Galton began by weighing,and measuringthe diametersof, thousandsof sweet peaseeds. He computedthe meanand theprobable errorof the weights of theseseedsand madeup packetsof 10seeds, each of the seeds being exactly the same weight.The smallest packet contained seeds weighing the mean minus three timesthe probable error,the nextthe meanminustwice the probable error, and so on up topackets containing the largest seeds weighing the mean plus three timesthe probable error. Sets of the seven packets were sent to friends acrossthe length and breadthof Britain with detailed instructions on how they were to beplantedand nurtured. There were two crop failures, but theproduce of seven harvests provided Galton with the datafor a Royal Institution lecture, "Typical Laws of Heredity," given in 1877. Complete data for Gallon's experiment are notavailable,but heobserved whathe statedto be asimplelaw that connected parentand offspring seeds.The offspring of each of the parental In Memoriesof My Life (1908), Galton says that he determinedon experimenting with sweet peas in 1885 and that the suggestionhad come to him from Sir Joseph Hooker (the botanist, 1817-1911)and Darwin. But Darwin had died in 1882 and theexperiments must have been suggested and were begun in 1875. Assuming that this is notjust a typographical error, perhaps Galton was recalling the dateof his important paperon regressionin hereditary stature.

132

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

weight categorieshad weights that were what we would now call normally distributed, and,the probable error(we would now calculate the standard deviation)was thesame. However, the mean weightof each groupof offspring was not asextremeas theparental weight. Large parent seeds produced larger than average seeds, but mean offspring weight was not aslarge as parental weight. At the other extremesmall parental seeds produced, on average, smaller offspring seeds,but themeanof the offspring was found not to be assmall as that of the parents. This phenomenon Galton termed reversion. Seeddiameters, Galton noted, are directly proportionalto their weight,and show the same effect: By family variability is meantthe departureof the childrenof the sameor similarly descendedfamilies, from the ideal mean typeof all of them. Reversionis the tendency of that ideal mean filial type to departfrom the parent type, "reverting" towards whatmay beroughly and perhapsfairly describedas theaverageancestral type. If family variability had beenthe only processin simple descent that affected the characteristicsof a sample,the dispersionof the race fromits mean ideal type would indefinitely increase withthe numberof generations;but reversion checks this increase,andbrings it to a standstill. (Galton, 1877, p. 291)

In the 1877 paper, Galton gives a measureof reversion,\v\uchhe symbolized r, and arrivesat anumberof the basic properties of what we nowcall regression. Some years passed before Galton returned to thetopic, but during those years he devised waysof obtainingthe anthropometric data he wanted. He offered prizes for the most detailed accounts of family historiesof physicalandmental characteristics, character and temperament, occupations and illnesses, height and appearance,and so on, and in1884 he opened,at his own expense,an anthropometric laboratoryat theInternational Health Exhibition. For a small sum of money, members of the public were admittedto the laboratory. In return the visitors receiveda record of their various physical dimensions, measures of strength, sensory acuities, breathing capacity, color discrimination, and judgmentsof length. 9,337 persons were measured, of whom 4,726 were adult males and 1,657 adult females. At the end of the exhibition, Galton obtaineda site for the laboratoryat the South Kensington Museum,where data continued to becollectedfor closeto 8 more years. These data formed partof a numberof papers. They were, of course,not entirely free from errorsdue toapparatus failure, the circumstancesof the data recording, andother factors thatarefamiliar enoughto experimentalistsof the present day. Some artifacts might have been introduced in other ways. Galton (1884) commented, "Hardlyany trouble occurredwith the visitors, thoughon some few occasions rough persons entered the laboratorywho were apparentlynot altogethersober"(p. 206). In 1885, Gallon's Presidential Address to theAnthropological Sectionof

FIG 10.2

Plate from Galton's 1885a paper (Journal of the Anthropological Institute).

FIG. 10.3

Plate from Galton's 1885a paper (Journal of the Anthropological Institute).

GALTON'S DISCOVERY OF REGRESSION

135

the British Association (1885b), meeting that year in Aberdeen, Scotland, discussedthe phenomenonof what he nowtermed regression toward mediocrity in human hereditary stature.An extended paper (1885a) in the Journal of the Anthropological Institute gives illustrations, which are reproducedhere. First it may benoted that Galton used a measureof parental heights that he termed the height of the "mid-parent." He multiplied the mother's heightby 1.08 andtook the meanof the resulting valueand thefather's heightto produce 4 the mid-parent value. He found that a deviation from mediocrity of one unit of height in the parentswas accompaniedby a deviation,on average,of about only two-thirds of a unit in the children (Fig. 10.2). This outcome paralleled what he hadobservedin the sweetpea data. When the frequenciesof the (adult) children's measurements were entered into a matrix against mid-parent heights, the data being"smoothed"by computing the meansof four adjacent cells, Galton noticed that values of the same frequency fell on a line that constitutedan ellipse. Indeed,the data produceda series of ellipsesall centeredon themeanof the measurements. Straight lines drawn from this centerto pointson theellipse that were maximally distant (the points of contact of thehorizontalandvertical tangents- thelinesYN and XM in Fig. 10.3) producethe regressionlines,ON and OM, and theslopesof these lines give the regression values of \ andj . The elliptic contours, which Galton said he noticed whenhe waspondering on his datawhile waiting for a train, are nothing more thanthe contour lines that are producedfrom the horizontal sectionsof the frequency surface generated by two normal distributions (Fig. 10.4). The time had nowcome for some serious mathematics. All the formulae for Conic Sections having long since goneout of my head,I went on my return to London to the Royal Institutionto read themup. Professor, now Sir James,Dewar, camein, and probably noticingsigns of despair on my face, asked me what I was about;then said,"Why do you bother over this?My brother-in-law, J. Hamilton Dickson of Peterhouse loves problems and wants new ones. Sendit to him." I did so, under the form of a problem in mechanics,and hemost cordially helped me by working it out, as proposed,on thebasis of the usuallyacceptedand generally justifiable Gaussian Law of Error. (Galton, 1908,pp. 302-303) I may bepermitted to saythat I never felt sucha glow of loyalty and respecttowards the sovereigntyand magnificent swayof mathematical analysis aswhen his answer Galton (1885a) maintains that this factor "differs a very little from the factors employedby other anthropologists, who, moreover, differ a trifle between themselves; anyhow it suitsmy data better than 1.07 or 1.09" (p. 247). Galton also maintained (and checked in his data) "that marriage selection takes little or noaccountof shortnessor tallness...we maythereforeregardthe marriedfolk ascouples picked out of the general populationat haphazard" (1885a,pp. 250-251) - a statement thatis not only implausibleto anyonewho hascasually observed married couples but isalsonot borneout byreasonably careful investigation.

FIG. 10.4

136

Frequency Surfaces and Ellipses (From Yule & Kendall, 14th Ed., 1950)

GALTON'S DISCOVERY OF REGRESSION

137

reached me, confirming, by purely mathematical reasoning,my various and laboriousstatistical conclusions than I had daredto hope, for the original data ran somewhatroughly, and 1 had tosmooth them with tender caution. (Galton, 1885b, p. 509)

Now Galton was certainly not a mathematical ignoramus, but the fact that one ofstatistics' founding fathers sought help for the analysisof his momentous discovery may be ofsome small comfortto studentsof the social scienceswho sometimesfind mathematics such a trial. Another breakthroughwas to come, and again, it was to culminatein a mathematical analysis, this time by Karl Pearson, and the developmentof the familiar formula for the correlation coefficient. In 1886, a paperon "Family Likenessin Stature,"publishedin the Proceedings of the Royal Society, presents Hamilton Dickson's contribution, as well as data collectedby Galton from family records. Of passing interestis his use of the symbol w for "the ratio of regression" (short-lived, as it turns out)as he details correlations between pairs of relatives. Gallon's work had led him to amethod of describing the relationship between parentsand offspring and between other relatives on a particular characteristicby usingthe regression slope.Now heapplied himselfto the task of quantifying the relationship between different characteristics,the sort of data collectedat theanthropometric laboratory. It dawnedon him in aflash of insight that if each characteristic was measuredon ascale basedon its ownvariability (in other words,in what we now call standard scores), then the regression coefficient could be applied to these data. It was noted in chapter 1 that the location of this illumination was perhapsnot the place that Galton recalled in his memoirs (written whenhe was in hiseighties)so that the commemorative tablet that Pearson said the discovery deserved will haveto be sited carefully. Before examining someof the consequencesof Galton's inspiration,the phenomenonof regressionis worth a further look. Arithmeticallyit is real enough, but,as Forrest (1974) puts it: It is not that the offspring have been forced towards mediocrity by the pressureof their mediocreremote ancestry,but aconsequenceof a less than perfect correlation betweenthe parentsand their offspring. By restrictinghis analysisto the offspring of a selectedparentageand attemptingto understand their deviations from the mean Galton failsto account for the deviationof all offspring. ... Galton's conclusionis that regressionis perpetual and that the only way in which evolutionarychangecan occur is through the occurrenceof sports, (p. 206)5 5 In this contextthe term "sport" refersto ananimal or a plant that differs strikingly from its species type. In modern parlance we would speakof "mutations." It is ironic thatthe Mendelians,led byWilliam Bateson(1861-1926),used Galton's work in supportof their argument thatevolution was discontinuous and saltatory, and that the biometricians.led by Pearsonand Weldon,who held fast to the notion that continuousvariationwas thefountainheadof evolution, took inspiration from the same source.

138

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

Wallis and Roberts (1956) present a delightful accountof the fallacy and give examplesof it in connection with companyprofits, family incomes, mid-term and finalgrades,and salesandpolitical campaigns.As they say: Take any set ofdata,arrange themin groups according to some characteristic, and then for eachgroup computethe averageof some secondcharacteristic. Thenthe variability of the second characteristic will usually appearto be less than thatof the first characteristic.( p. 263)

In statisticsthe term regressionnow means prediction,a point of confusion for many students unfamiliar with its history. Yule and Kendall (1950) observe: The term "regression"is not aparticularly happyone from the etymological point of view, but it is sofirmly embeddedin statistical literature that we makeno attempt to replace it by an expression whichwould more suitably expressits essential properties,(p. 213)

The word is now part of the statistical arsenal, and it servesto remind those of us who areinvolvedin its application of an important episodein the history of the discipline. GALTON'S MEASURE OF CO-RELATION On December 20th, 1888, Galton's paper, Co-relations and Their Measurement,Chiefly from Anthropometric Data,was read beforethe Royal Societyof London. It begins: "Co-relation or correlation of structure" is a phrase much used in biology, and not least in that branchof it which refersto heredity,and theideais even morefrequently present thanthe phrase;but 1 am notawareof anyprevious attempt to define it clearly, to traceits modeof action in detail, or to show how to measureits degree. (Galton, 1888a,p. 135)

He goeson to statethat the co-relation between two variable organs must be due, in part, to common causes, that if variation was wholly due to common causes then co-relation would be perfect,and if variation "were in no respect due tocommon causes, the co-relationwould be nil" (p. 135). His aim thenis to showhow this co-relationmay beexpressedas asimple number,and heuses as anillustrationthe relationship between the left cubit (the distance between the elbow of the bent left arm and the tip of themiddle finger) and stature, althoughhe presents tables showingthe relationshipsbetweena varietyof other physical measurements.His data are drawn from measurements made on 350 adult males at the anthropometric laboratory, and then, as now, there were missingdata. "The exact number of 350 is notpreserved throughout, as injury

GALTON'S MEASURE OF CO-RELATION

139

to some limbor other reducedthe available numberby 1, 2, or 3 indifferent cases"(Galton, 1888a,p. 137). After tabulatingthe data in order of magnitude, Galton noted the valuesat the first, second,andthird quartiles.One half the value obtainedby subtracting the value at thethird from the valueat thesecond quartile gives him Q, which, he notes,is the probable errorof any single measurein the series,and thevalue at the second quartileis the median. For staturehe obtaineda medianof 67.2 inchesand a Q of1.75, and for theleft cubit, a medianof 18.05 inchesand a Q of 0.56. It shouldbe noted that although Galton calculated the median,he refers to it as being practically the mean value,"becausethe series run with fair symmetry" (p. 137). Galton clearly recognized that these manipulations did not demand thatthe original unitsof measurement be thesame: It will be understood that the Q valueis auniversalunit applicableto themost varied measurements, such asbreathing capacity, strength, memory, keenness of eyesight, and enables themto be compared together on equal terms notwithstanding their intrinsic diversity. (Galton, 1888a,p. 137)

Perhaps in an unconscious anticipation of the inevitable universalityof the metric system,he also recordshis data on physical dimensions in centimeters. Figure 10.5 reproduces the data (TableIII in Galton's paper)from which the closenessof the co-relation between stature and cubit was calculated. A graph was plotted of stature, measured in deviationsfrom M s in units of Qs, against the mean of the correspondingleft cubits, again measured as deviations, this timefrom M c in units of Qc (columnA against columnB in the table). Betweenthe same axes,left cubit was plotted as adeviation from M c measuredin units of Qc againstthe mean of corresponding statures measured as deviationsfrom M s in units of Qs (columnsC and D in thetable). A line is then drawnto represent "the general run" of the plotted points: It is here seento be astraight line, and it wassimilarly found to be straight in every other figure drawnfrom the different pairsof co-related variables that I haveas yet tried. But the inclinationof the line to the vertical differs considerablyin different cases. In the presentone theinclination is such that a deviationof 1 on thepart of the subject [the ordinate values], whether it be statureor cubit, is accompaniedby a mean deviationon thepart of the relative [the valuesof the abscissa], whether it be cubit or stature,of 0.8. This decimalfraction is consequentlythe measureof the closenessof the co-relation. (Galton, 1888a, p. 140)

Galton also calculates the predicted values from the regression line.He takes what he terms the "smoothed"(i.e., readfrom the regression line) value for a given deviationmeasurein unitsof Qc or Qs, multipliesit by Qc or Qs, andadds the result to the meanMc or My For example, +1.30(0.56) + 18.05 = 18.8. In modern terms,he computesz' (s) + X~ X. It isilluminating to recomputeand

FIG. 10.5

Galton's Data 1888. Proceedings of the Royal Society.

THE COEFFICIENT OF CORRELATION

141

to replot Galton'sdataand tofollow his line of statistical reasoning. Finally, Galton returnsto his original symbol r, to representdegree of co-relation, thesymbol whichwe usetoday, andredefines/=\ 1 - r 2 as"the Q value of the distribution of any systemof x values,as x} , x2, x3, &c., round the mean of all of them, which we may call X " (p. 144), which mirrorsour modern-day calculationof the standard errorof estimate,and which Galton had obtainedin 1877. This short paperis not acomplete account of themeasurementof correlation. Galton shows thathe has not yetarrived at the notion of negative correlation nor of multiple correlation,but his concluding sentences show just how far he did go: Lety = the deviationof the subject, whichever of the two variablesmay betaken in that capacity;and let;c,, ;c2, x3, &c., be thecorresponding deviations of the relative, and let the meanof thesebe X. Then we find: (1) that y = rX for all valuesof y; (2) that r is the samewhicheverof the two variablesis taken for the subject;(3) that r is always less than1; (4) that r measuresthe closenessof co-relation. (Gallon,1888a, p. 145)

Chronologically, Gallon'sfinal contributionto regressionandheredity is his book Natural Inheritance, published in 1889. This bookwascompleted several months beforethe 1888 paperon correlationandcontains noneof its important findings. It was, however,an influential book that repeatsa great dealof Galton's earlier workon the statisticsof heredity,the sweetpea experiments, the data on stature,the recordsof family faculties, and so on. It wasenthusiastically received by Walter Weldon(1860-1906),who wasthen a Fellow of St John's College, Cambridge, and University Lecturer in Invertebrate Morphology, and it pointed him toward quantitative solutions to problems in species variation thathad been occupyinghis attention. This booklinked Galton with Weldon, and Weldon with Pearson,and then Pearsonwith Galton, a concatenation that beganthe biometric movement. THE COEFFICIENT OF CORRELATION In June 1890, Weldon was electeda Fellow of the Royal Society,and later that year became Jodrell Professor of Zoology at University College, London.In March of the same yearthe Royal Societyhad receivedthe first of his biometric papers that describes the distribution of variationsin a numberof measurements made on shrimps. A Marine Biological Laboratoryhad been constructedat Plymouth2 yearsearlier, and since that time Weldon had spent partof the year there collecting-measurements of thephysical dimensions of these creatures and their organs.The Royal Society paperhad been sentto Galton for review, and with his help the statistical analyseshad been reworked. This marked the

142

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

beginningof Weldon'sfriendshipwith Gallon and aconcentrationon biometric work that lastedfor the rest of his life. In fact, the paper doesnot present the resultsof a correlational analysis, although apparently one hadbeen carried out. I haveattemptedto apply to theorgansmeasuredthe testof correlation givenby Mr. Galton ... and theresult seemsto show thatthe degreeof correlation betweentwo organsis constantin all the racesexamined;Mr. Galton has,in a letter to myself, predicted thisresult A result of this kind is, however,so important to the general theory of heredity, thatI prefer to postponea discussionof it until a larger body of evidencehas been collected. (Weldon, 1890, p. 453)

The year 1892saw these analyses published.Weldon beginshis paperby summarizingGalton's methods.He then describesthe measurements that he made, presents extensive tables of his calculations,the degreeof correlalion between the pairs of organs, and the probable errorof the distributions (Qvl -r 2). The actualcalculation of the degreeof correlation departs somewhal from Gallon's method,although,because Gallon was inclose touch wilt Weldon, we canassume thatit had hisblessing.It is quite straightforwardand is here quotedin full: (1.). . . let all thoseindividuals be chosenin which a certain organ,A, differs from its averagesizeby a fixed amount,Y; then, in these individuals,let thedeviationsof a secondorgan,B, from its averagebe measured.The variousindividualswill exhibit deviationsof B equal tox} ,x2,x3,..., whose meanmay becalled xm. Theratio jcm/Y will be constantfor all valuesof Y. In the sameway, supposethoseindividuals are chosenin which the organB has a constant deviation, X; then,in theseindividuals,ym. themean deviation of the organ A, will have the sameratio to X, whatevermay be thevalueof X. (2.) The ratios * m/Y and ym/X are connectedby an interesting relation. Let Qa representthe probable errorof distribution of the organA about its average,and Qb that of theorgan B; then -

a constant. So that by taking a fixed deviationof either organ, expressed in terms of its probable error,and byexpressingthe mean associated deviation of the second organ in terms of its probable error,a ratio may bedetermined,whosevalue becomes± 1 when a changein either organ involves an equal changein the other, and 0when the two organsarequite independent. This constant, therefore, measures the "degreeof correlation"betweenthe twoorgans.(Weldon, 1892,p. 3)

In 1893, more extensive calculations were reported on data collectedfrom two large (eachof 1,000adult females)samplesof crabs,one from the Bay of

THE COEFFICIENT OF CORRELATION

143

Naples and the other from Plymouth Sound.In this work, Weldon (1893) computesthe mean,the meanerror, and themodulus. His greater mathematical sophistication in this work is evident. He states: The probableerror is given below, instead of the'meanerror, becauseit is the constant which has thesmallest numerical value of any ingeneral use. This property renders the probable error more convenient than either the meanerror, the modulus,or the error of mean squares,in the determinationof the degreeof correlation whichwill be described below. (Weldon, 1893, pp. 322-323)

Weldon found thatin the Naples specimens the distributionof the "frontal breadth"producedwhat he terms an "asymmetricalresult." This finding he hoped might arise from the presencein the sampleof two racesof individuals. He notesthat Karl Pearsonhad tested this supposition and found that it was likely. Pearson (1906)statesin his obituary of Weldon thatit was this problem that led to his(Pearson's)first paper in the Mathematical Contributionsto the Theory of Evolution series, receivedby the Royal Societyin 1893. Weldon definedr and attemptedto nameit for Galton: a measureof the degree to which abnormalityin one organ is accompaniedby abnormality in a second. It becomes±1 when a changein one organ involvesan equal changein the other, and 0when the two organsare quite independent.The importance of this constantin all attemptsto deal with the problems of animal variation was first pointedout by Mr. Galton ... theconstant...may fitly be known as "Galton'sfunction." (Weldon, 1893,p. 325)

The statisticsof heredityand amutual interestin the plansfor the reform of the Universityof London (whichare describedby Pearsonin Weldon's obituary) drew Pearson and Weldon together,and they were close friendsand colleagues until Weldon's untimely death. Weldon's primary concern was to make his discipline, particularlyas it related to evolution, a more rigorous scienceby introducing statistical methods. He realized thathis own mathematical abilities were limited,and hetried, unsuccessfully,to interest Cambridge mathematical colleagues in his endeavor.His appointmentto the University College Chair broughthim into contactwith Pearson, but,in the meantime,he attempted to remedy his deficienciesby an extensive studyof mathematical probability. Pearson (1906) writes: Of this the writer feels sure, that his earliest contributionsto biometry werethe direct resultsof Weldon's suggestions and would never have been carried out without his inspiration and enthusiasm. Both were drawn independentlyby Galton's Natural Inheritance to these problems, (p. 20)

Pearson'smotivationwas quitedifferent from that of Weldon. MacKenzie

144

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

(1981) hasprovided us with a fascinating accountof Pearson's background, philosophy, and political outlook.. He wasallied with the Fabian socialists 6; he held strong viewson women's rights;he wasconvincedof the necessityof adopting rational scientific approachesto a range of social issues;and his advocacy of the interestsof the professional middle-class sustained his promotion of eugenics. His originality, his real transformation rather than re-ordering of knowledge, is to be found in his work in statisticalbiology, wherehe took Galton'sinsights and made out of them a newscience. It was thework of his maturity - he startedit only in his mid-thirties- and in it can befound theflowering of mostof themajor concernsof his youth. (MacKenzie, 1981,pp. 87-88) It can beclearly seen that Pearson was notmerely providinga mathematicalapparatus for othersto use ... Pearson'spoint wasessentiallya political one:the viability, and indeedsuperiorityto capitalism,of a socialiststatewith eugenically-plannedreproduction. The quantitative statistical formof his argument providedhim with convincing rhetorical resources.(MacKenzie, 1981,p. 91)

MacKenzie is at painsto point out that Pearsondid not consciouslyset out to found a professional middle-class ideology. His analysis confines itselfto the view that herewas a match of beliefs and social interests that fostered Pearson'sunique contribution,and that this sortof sociological approachmay be usedto assessthe work of such exceptional individuals. Pearsondid not seek to becomethe leader of a movement; indeed,the compromises necessary for such an aspiration would have been anathema to what he saw as the role of the scientist and theintellectual. However one assesses the operationof the forces that molded Pearson's work, it is clear that,from a purely technical standpoint, the introduction,at this juncture, of an able, professional mathematician to the field of statistical 7 methodsbroughtabouta rapid advanceand agreatly elevated sophistication. The third paper in the Mathematical Contributions series was read beforethe Royal Society in November 1895.It is an extensive paper which dealswith, among other things, the general theoryof correlation. It containsa numberof historical misinterpretations and inaccuracies, that Pearson (1920) later 6

The Fabian Society (the Webbs andGeorge Bernard Shaw were leading members) its took name from Fabius,the Roman Emperorwho adopteda strategyof defenseand harrassmentin Rome's war with Hannibaland avoided directconfrontations. The Fabiansadvocated gradual advance and reform of society, rather than revolution. 7

Pearsonplaced Third Wranglerin the Mathematical Triposat Cambridgein 1879. The "Wranglers" were the mathematics students at Cambridgewho obtainedFirst Class Honors- the oneswho most successfully"wrangle" with math problems. This method of classification was abandonedin the early yearsof this century.

THE COEFFICIENT OF CORRELATION

145

attemptedto rectify, but for today's usersof statistical methods it is of crucial importance,for it presentsthe familiar deviation score formula for the coefficient of correlation. It might be mentioned here that the latter termhad been introducedfor r by F. Y. Edgeworth (1845-1926)in an impossibly difficult-to-follow paper published in 1892. Edgeworthwas Drummond Professor of Political Economyat Oxford from 1891 untilhis retirementin 1922,and it is ofsomeinterestto note that he hadtried to attract Pearsonto mathematical economics, but without success.Pearson (1920)saysthat Edgeworthwas also recruitedto correlation by Galton's Natural Inheritance,and he remainedin close touch withthe biometricians over many years. Pearson's important paper (published in the Philosophical Transactionsof the Royal Societyin 1896) warrants close examination. He begins withan introductionthat statesthe advantagesandlimitations of thestatistical approach, pointing out that it cannot give us precise information about relationships between individualsand that nothingbut meansand averagesand probabilities with regard to large classescan bedealt with. On theotherhand, the mathematical theorywill be of assistanceto the medicalman by answering,inter alia, in its discussionof regressionthe problemas to theaverage effect upon the offspring of given degreesof morbid variationin the parents.It may enable the physician, in many cases,to state a belief basedon a high degreeof probability, if it offers no ground for dogma in individual cases.(Pearson,1896, p. 255)

Pearsongoeson to define the mean, median, and mode, the normal probability distribution, correlation,and regression,as well as various termsemployed in selection and heredity. Next comesa historical section,whichis examined laterin this chapter. Section4 of Pearson'spaper examinesthe "special caseof two correlatedorgans." He derives whathe terms the "wellknown Galtonian formof the frequency for two correlated variables,"and says that r is the "GALTON function or coefficient of correlation" (p. 264). However, he is notsatisfied thatthe methods usedby Galton and Weldon give practically the best methodof determiningr, and hegoes on to show by what we would now call the maximum likelihood method that S(xy)/(nG\ a2) is the bestvalue (todaywe replaceS by I forsummation).This expressionis familiar to every beginning student in statistics. As Pearson (1896) puts it, "This value presentsno practical difficulty in calculation,and thereforewe shall adoptit" 8 (p. 265). It is now well-known thatwe have done precisely that. Pearson (1896,p. 265) notes "thatS(xy) correspondsto theproduct-moment of dynamics,as S ( x ) to the momentof inertia." This is why r is often referred to as the"product-momentcoefficient of correlation."

146

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

There follows the derivation of the standard deviation of the coefficient of 2 2 correlation,(1 - r )/Vn( 1 - r ), which Pearson translates to theprobable error, 0.674506(1- r 2)/\n( 1- r 2). These statistics arethen usedtorework Weldon's shrimp and crab data, and the results show that Weldon was mistaken in assuming constancy of correlationin local racesof the same species. Pearson's exhaustive re-examination of Galton's dataon family stature also shows that someof the earlier conclusions were in error. The detailsof these analyses are now of narrow historical interest only, but Pearson's general approach is amodel for all experimentersand usersof statistics. He emphasizesthe importanceof samplesizein reducingthe probable error,mentionsthe importanceof precision in measurement, cautions against general conclusions from biased samples, and treats his findings with admirablescientific caution. Pearson introduces V= (o/w) 100 , thecoefficient of variation, as a way of comparing variation,andshows that "the significance of themutual regressions o f . . . two organsare as thesquaresof their coefficients of variation" (p. 277). It should alsobe noted that,in this paper, Pearson pushes much closer to the solution of problems associated with multiple correlation. This almost completes the accountof a historical perspective of the developmentof the Pearsoncoefficient of correlation as it iswidely knownand used. However,the record demands that Pearson's account be lookedat inmore detail and the interpretation of measuresof associationasthey were seenby others, most notably GeorgeUdny Yule (1871-1951),who made valuable contributions to thetopic, beexamined.

CORRELATION - CONTROVERSIES ANDCHARACTER Pearson's (1896) paper includes a sectionon the history of the mathematical foundationsof correlation. He says thatthe fundamentaltheorems were "exhaustivelydiscussed"by Bravaisin 1846. Indeed,he attributesto Bravaisthe invention of the GALTONfunction while admitting that "a single symbolis not used for it" (p. 261). He also statesthat S(xy)/(na\ a2) "is the value givenby Bravais, but he doesnot show thatit is the best" (p. 265). In examiningthe general theoremof a multiple correlationsurface, Pearson refersto it as Edgeworth's Theorem. Twenty-five years later he repudiates these statements and attemptsto set hisrecord straight.He avers that: They have been accepted by later writers, notablyMr Yule in his manualof statistics, who writes (p. 188): "Bravais introduced the product-sum,but not asingle symbol for a coefficient of correlation. Sir Francis Galton developed the practical method, determining his coefficient (Galton'sfunction as it wastermed at first) graphically. Edgeworth developed the theoretical sidefurther and Pearson introduced the product-sum formula."

CORRELATION - CONTROVERSIES AND CHARACTER

147

Now I regretto saythat nearly the whole of the abovestatementsare hopelessly incorrect.(Pearson,1920, p. 28)

Now clearly, it is just not thecase "that nearlythe whole of the above statementsare hopelessly incorrect."In fact, what Pearson is trying to do is to emphasizethe importanceof his own andGalton's contributionand toshift the blame for the maintenanceof historical inaccuracies to Yule, with whom he had come to have some serious disagreement. The natureof this disagreement is of interest and can beexaminedfrom at least three perspectives. The first is that of eugenics,the secondis that of the personalitiesof the antagonists,and the third is that of thefundamental utility of, and theassumptions underlying, measuresof association. However, before these matters are examined, some brief commenton the contributionsof earlier scholarsto the mathematicsof correlation is in order. The final yearsof the 18th centuryand the first 20yearsof the 19th was the period in which the theoreticalfoundations of the mathematicsof errors of observation were laid. Laplace (1749-1827)and Gauss(1777-1855)are the best-knownof the mathematicianswho derivedthe law of frequencyof error anddescribedits application,but anumberof other writers also made significant contributions (see Walker, 1929, for an accountof these developments). These scholarsall examinedthe questionof the probabilityof thejoint occurrenceof two errors,but "None of them conceivedof this as amatter which could have application outsidethe fields of astronomy, physics, andgeodesyor gambling" (Walker, 1929,p. 94). In fact, put simply, these workers were interested solely in the mathematics associatedwith the probabilityof the simultaneousoccurrenceof two errorsin, say,the measurement of the positionof a pointin a planeor in three dimensions. They were clearlynot looking for a measureof a possible relationship between the errors and certainly not consideringthe notion of organic relationships among directly measured variables. Indeed, astronomers andsurveyors sought to make their basic measurements independent. Having said this,it is apparent that the mathematicalformulationsthat were produced by these earlier scientists are strikingly similarto those deduced by Galtonand Hamilton Dickson. Auguste Bravais(1811-1863),who had careers as a naval officer, an astronomerand physicist, perhaps came closest to anticipatingthe correlation coefficient; indeed, he even usesthe term correlation in his paper of 1846. Bravais derivedthe formula for the frequency surfaceof the bivariate normal distribution and showed thatit was aseriesof concentricellipses,as didGalton and Hamilton Dickson40 years later. Pearson's acknowledgment of the work of Bravaisled to thecorrelationcoefficient sometimes being called the BravaisPearson coefficient. When Pearson came, in 1920, to revisehis estimationof Bravais' role,he

148

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

describeshis ownearly investigationsof correlationandmentionshis lectures on the topic to research students at University College. He says: I was far tooexcited to stop to investigate properly whatother peoplehad done. I wanted to reachnew resultsandapply them. AccordinglyI did not examine carefully either Bravais or Edgeworth,andwhen I cameto put mylecturenoteson correlation into written form, probablyaskedsomeonewho attendedthe lecturesto examinethe papersand saywhat was in them. Only when I now comeback to the papersof Bravaisand Edgeworthdo I realisenot only that I did grave injusticeto others,but mademostmisleadingstatements which havebeenspreadbroadcastby thetext-book writers. (Pearson,1920, p. 29)

The "Theory of Normal Correlation"was one of thetopics dealt withby Pearson whenhe startedhis lectureson the'Theoryof Statistics"at University College in the 1894-1895session, "givingtwo hoursa week to a small but enthusiastic class of two students- Miss Alice Lee, Demonstrator in Physicsat the Bedford College,andmyself (Yule, 1897a,p. 457). One ofthese enthusiastic students, Yule, published his famous textbook,An Introduction to the Theoryof Statistics,in 1911,a book thatby 1920 was in its fifth edition. Laterin the 1920 paper, Pearson again deals curtly with Yule. He is discussingthe utility of a theory of correlation thatis not dependenton the assumptionsof the bivariate normaldistribution andsays,"As early as 1897 Mr G. U. Yule, thenmy assistant, made an attemptin this direction" (Pearson, 1920, P- 45). In fact, Yule (1897b)in a paperin theJournal of the Royal Statistical Society, had derived least squares solutions to the correlation of two, three, and four variables. This methodis a compelling demonstration of appropriatenessof Pearson'sformula for r under the least squares criterion. But Pearsonis not impressed,or at leastin 1920 he is notimpressed: Are we notmakinga fetish of themethodof leastsquaresasothersmadea fetish of the normal distribution? ... It is by no means clear therefore that Mr Yule's generalisationindicatesthe real line of future advance.(Pearson,1920, p. 45)

This, to say theleast, cold approachto Yule's work (and thereare many other examplesof Pearson's overt invective) wascertainlynot evident overthe 10 years that span the turn of the 19thto the 20th centuries. During what Yule describedas "the old days" he spent several holidays with Pearson, and even when their personal relationship had soured,Yule states thatin nonintellectual matters Pearsonremained courteousand friendly (Yule, 1936), althoughone cannotbut help feel that herewe havethe wordsof anessentially kindand gentle man writing an obituary noticeof one of thefathersof his chosen discipline. What was thenatureof this disagreement that so aroused Pearson's wrath? In 1900, Yule developed a measureof associationfor nominal variables,the

CORRELATION - CONTROVERSIES AND CHARACTER

149

frequenciesof which are entered intothe cells of a contingencytable. Yule presentsvery simple criteriafor such measures of association, namely, that they shouldbe zero when thereis norelationship (i.e.,the variablesareindependent), +1 when there is complete dependence or association,and -1 when thereis a complete negative relationship. The illustrative example chosenis that of a matrix formed from cells labeledas follows: Survived A Died

VaccinatedB AB aB

Unvaccinated A0 ap

and Yule deviseda measure,Q (named,it appears,for Quetelet), that satisfies the statedcriteria, and isgiven by,

Yule's paper had been "received"by the Royal Society in October 1899, and "read"in Decemberof the same year.It was describedby Pearson (1900b), in a paper that examines the same problem,as "Mr Yule's valuable memoir" (p. 1). In his paper, Pearson undertakes to investigate "the theoryof the whole subject" (p. 1) andarrives at his measureof associationfor two-by-two contingency table frequencies, an index that he called the tetrachoric coefficient of correlation. He examines other possible measure's of association, including Yule's Q, but heconsiders themto bemerely approximations to the tetrachoric coefficient. The crucial differencebetween Yule's approach andthat of Pearson is that Yule's criteriafor ameasureof associationareempiricalandarithmetical, whereasthe fundamental assumption for Pearsonwasthat the attributes whose frequencieswere countedin fact arosefrom an underlying continuous bivariate normal distribution. The details of Pearson's method are somewhat complex and are notexamined here. There were a numberof developmentsfrom this work, including Pearson's derivation, in 1904,of the mean square contingency and thecontingencycoefficient. All these measures demanded the assumption of anunderlying continuous distribution, even thoughthe variables,asthey were considered, were categorical. Battle commenced in late1905 when Yule criticized Pearson'sassumptionsin a paper readto the Royal Society (Yule, 1906). Biometrika'sreaders were soon to see, "Replyto Certain Criticismsof Mr G. U. Yule" (Pearson, 1907) and, after Yule's discussionof his indices appearedin the first edition of his textbook, were treatedto David Heron's exhortation, "The Danger of Certain Formulae Suggested asSubstitutesfor the Correlation Coefficient" (Heron,1911). These were stirring statistical times, marked by swingeing attacksas thebiometricians defended their position:

150

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

If Mr Yule's viewsare accepted,irreparable damage will be done to the growth of modern statistical theory... we shall termMr Yule's latest methodof approaching the problemof relationshipof attributesthe methodof pseudo-ranks...w e . .. reply to certain criticisms, not to saycharges,Mr Yule hasmade againstthe work of one or both of us. (Pearson & Heron, 1913,pp. 159-160)

Articulate sniping and mathematicalbombardment werethe methodsof attack usedby the biometricians. Yulewas more temperatebut nevertheless quite firm in his views: All those who have diedof small-poxare all equally dead:no one ofthem is more deador lessdeadthan another,and thedeadare quite distinct fromthe survivors. The introductionof needlessand unverifiable hypotheses does not appearto me to be adesirableproceedingin scientific work. (Yule, 1912,pp. 611-612)

Yule, and his great friend Major Greenwood, poked private fun at the opposition.Partsof a fantasy sentto Yule by Greenwoodin November 1913 arereproduced here: Extractsfrom The Times, April 1925 G. Udny Yule, who hadbeen convictedof high treasonon the 7thult, was executed this morning on a scaffold outside GowerSt. Station. A short but painful scene occurred on thescaffold. As the rope was being adjusted, the criminal made some observation, imperfectly heard in the press enclosure, the only audible words being "the normal coefficientis —." Yule was immediately seizedby theImperial guard and gagged. Up to thetime of going to pressthe warrantfor the apprehensionof Greenwoodhad not been executed,but the police have what they regard to be animportant clue. During the usual morning service at St. Paul's Cathedral, which was well attended, the carlovingian creed was, in accordancewith an imperial rescript, chantedby the choir. Whenthe solemn words,"I believein one holy and absolutecoefficient of four-fold correlation" were uttereda shabbily dressedman near the North door shouted"balls." Amid a sceneof indescribable excitement, the vergers armed with severalvolumes of Biometrika made their way to thespot.(Greenwood,quotedby MacKenzie, 1981,pp. 176-177)

The logical positionsof the two sides werequite different. For Pearsonit was absolutely necessary to preservethe link with interval-level measurement wherethe mathematicsof correlationhad beenfully specified. Mr Yule ... doesnot stop to discuss whether his attributesarereally continuousor discrete,or hide under discrete terminology true continuous variates. We seeunder such class indices as"death"or "recovery","employment"or "non-employment"of mother, only measures of continuousvariates... (p. 162) The fog in Mr Yule's mind is well illustrated by his table...(p. 226)

CORRELATION - CONTROVERSIES AND CHARACTER

151

Mr Yule is juggling with class-namesas if they representedreal entities,and his statisticsonly a form of symbolic logic. No knowledgeof a practical kindevercame out of theselogical theories,(p. 301) (Pearsonand Heron, 1913)

Yule is attackedon almost everyone ofthis paper's157 pages,but for him the issue was quite straightforward. Techniques of correlation were nothing more and nothing less than descriptions of dependencein nominal data. If different techniques gavedifferent answers,a point that Pearsonand his followers frequently raised, thenso be it. Themean, median,and mode give different answersto questions about the central tendencyof a distribution,but each has itsutility. Of coursethe controversywas never resolved.The controversycan never be resolved,for there is no absolute,"right" answer. Each camp started with certain basic assumptions, and areconciliationof their viewswas notpossible unlessone orboth sides wereto have abrogated those assumptions, or unlessit could have been shown with scientific certainty thatonly oneside's assumptions were viable.For Yule it was asituation thathe accepted.In his obituary notice of Pearsonhe says, "Timewill settlethe questionin duecourse"(p. 84). In this he waswrong, because there is nolongera question.The pragmatic practitioners of statistics in the presentday arelargely unaware that there ever even was a question. The disagreementhighlightsthe personalitydifferencesof the protagonists. Pearson couldbe describedas adifficult man. He held very strong viewson a variety of subjects,and he wasalways readyto takeup his pen and write scathing attackson those whomhe perceivedto be misguidedor misinformed. He was not ungenerous,and hedevotedimmenseamountsof time and energy to the work of his studentsand fellow researchers, but it is likely that tactwas not one of his strong pointsand anysort of compromisewould be seenas defeat. In 1939, Yule commentedon Pearson's polemics, noting that, in 1914, Pearson said that "Writers rarely . . . understandthe almost religious hatred which arisesin the true man of science whenhe sees error propagated in high places"(p. 221). Surely neitherin the best typeof religion nor in thebest typeof science should hatred enter in at all. ... In onerespectonly has scientific controversy perforce improved since the seventeenth century.If A disagrees with B's arguments, dislikeshis personality and isannoyedby the cock of his hat, he can nolonger, failingall else, resort to abuseof £'s latinity. (Yule, 1939, p. 221)

Yule had respect and affection for "K.P." even though he haddistanced himself from the biometriciansof Gower Street (where the Biometric Laboratory was located). He clearly disliked the way in which Pearsonand his followers closed ranksandpreparedfor combatat thesniff of criticism. In some

152

10. COMPARISONS, CORRELATIONS AND PREDICTIONS

respectswe canunderstand Pearson's attitudes. Perhaps he didfeel isolatedand beset. Weldon's death in 1906 and Gallon's in 1911 affectedhim greatly He was thecolossusof his field, and yetOxfordhadtwice(in 1897and 1899) turned down his applicationsfor chairs,and in 1901he applied, again unsuccessfully, for the chair of Natural Philosophyat Edinburgh. He felt the pressureof monumentalamountsof work and longed for greater scopeto carry out his research.This was indeed to come with grantsfrom the Drapers' Company, which supportedthe Biometric Laboratory,and Galton's bequest, which made him Professorof Eugenicsat University Collegein 1911.But more controversy, and more bitter battles, this time with Fisher, were just over the horizon. Once more,we areindebtedto MacKenzie,who soably puts togetherthe eugenicaspectsof the controversy.It is his view that this providesa much more adequate explanation than one derivedfrom the examinationof a personality clash. It is certainly unreasonable to deny thatthis was animportant factor. Yule appearsto have been largely apolitical. He came from a family of professionaladministratorsand civil servants,and hehimself worked for the War Office and for theMinistry of Food during World War I, work for which he receivedthe C.B.E. (Commanderof the British Empire) in 1918.He was not a eugenist,and his correspondencewith Greenwood shows that his attitude toward the eugenics movement was far from favorable.He was anactive memberof the Royal Statistical Society, a body that awardedhim its highest honor, the Guy Medal in gold, in 1911. The Society attracted members who were interestedin an ameliorativeand environmentalapproachto social issues - debatesonvaccination werea continuingfascination.Although Major Greenwood, Yule's closefriend, was at first anenthusiasticmemberof thebiometric school, his career,as astatisticianin the field of public healthand preventive medicine, drewhim toward the realization that povertyand squalor were powerful factors in the status and condition of the lower classes,a view that hardly reflected eugenic philosophy. For Pearson, eugenics and heredity shapedhis approachto the questionof correlation,and thenotion of continuousvariation was of critical importance. His notion of correlation,as afunction allowing direct prediction from one variable to another,is shown to have its roots in the task that correlationwas supposedto perform in evolutionary and eugenic prediction.It was notadequate simplyto know that offspring characteristicswere dependenton ancestral characteristics: this dependencehad to bemeasuredin sucha way as toallow the prediction of the effects of natural selection,or of conscious intervention in reproduction. To move in the direction indicated here, from prediction to potential control over evolutionary processes,required powerful and accurate predictive tools: mere statements of dependencewould be inadequate. (MacKenzie, 1981, p. 169)

MacKenzie's sociohistorical analysis is both compellingand provocative,

CORRELATION - CONTROVERSIES AND CHARACTER

1 53

but at leasttwo points, bothof which are recognizedby him, needto bemade. The first is that this viewof Pearson's motivations is contrary to his earlier expressedviews of the positivistic natureof science (Pearson,1892), and second, the controversy mightbeplacedin thecontextof atightly knit academic group defendingits position. To the first MacKenzie says that practical considerations outweighed philosophical ones, and yet it isclear that practical demands did not lead the biometriciansto form or tojoin political groups that might have made their aspirations reality. The second viewis mundaneand realistic. The discipline of psychologyhasseena numberof controversiesin its short history. Notable among themis the connectionist (stimulus-response) versus cognitive argument in the field of learning theory.The phenomenological,the psychodynamic, the social-learning, and thetrait theorists have arraigned themselves against each otherin a variety of combinationsin personalityresearch.Arguments aboutthe continuity-discontinuityof animalsand humankindare exemplified perhaps by the controversy over language acquisition in the higher primates. All thesedebatesare familiar enoughto today'sstudentsof psychology, and it is notinconceivableto view the correlation debateas part of the system of academic "rows" that will remain as long as there is freedom for intellectual controversyand scientific discourse. But the Yule-Pearson debateand its implications are not among those discussedin university classrooms across the globe. For amultitudeof reasons, not the least of which was thehorror of the negative eugenics espoused by the German Nazis,and thedemandsof the growing discipline of psychology for quantitative techniques that would help it deal with its subject matter, veryfew if any of today'spractitionersand researchersin the social sciences think of biometrics when theythink of statistics. Theybusily get onwith their analyses and, if they give thanksto Pearsonand Gallon at all, they remember them for their statistical insights rather than for their eugenic philosophy.

11 Factor Analysis

FACTORS A featureof the idealscientific method thatis always hailedasmost admirable is parsimony.To reducea massof factsto a single underlying factoror evento just a few conceptsand constructsis anongoing scientific concern. In psychology, the statisticaltechnique knownas,factor analysisis most oftenassociated with the attemptto comprehendthe structureof intelligenceand thedimensions of personality. The intellectual struggles that accompany discussion of these mattersare avery longway from being over. Of all the statistical techniques discussed in this book, factor analysis may be justly claimedas psychology's own.As Lawley and Maxwell (1971)and others have noted, some of the early controversies that swirled around the methods employed arose from arguments about psychological, rather than mathematical, matters. Indeed, Lawley andMaxwell suggest that the controversies "discouragedthe interest shownby mathematiciansin the theoretical problems involved."(p. 1). Infact, the psychological quarrels stem from amuch older philosophical debate that is most often tracedto Francis Bacon,and his assertion that"the mere orderly arrangement of data would makethe right hypothesisobvious"(Russell, 1961,p. 529). Russellgoeson: The thing thatis achievedby thetheoretical organization of scienceis thecollection of all subordinateinductions into a few that arevery comprehensive- perhapsonly one. Such comprehensive inductions are confirmedby so many instances that it is thought legitimateto accept,as regards them,an inductionby simple enumeration. This situation is profoundly unsatisfactory,(p. 530)

To bejust a little more concrete,in psychology: We recognizethe importanceof mentality in our lives and wish to characterizeit, in 154

FACTORS

155

part sothat we canmakethe divisionsanddistinctionsamongpeoplethatour cultural and political systemsdictate. We therefore givethe word "intelligence" to this wondrously complex and multifaceted set of human capabilities.This shorthand symbol is then reified and intelligence achievesits dubiousstatusas aunitary thing. (Gould, 1981, p. 24)

To besomewhat trite,it is clear that among employed persons there is ahigh correlationbetweenthe sizeof a monthly paycheckandannualsalary,and it is easy to seethat both variablesare ameasureof income - the "cause"of the correlation is clear andobvious. It is therefore temptingnot only to look for the underlyingbasesof observed interrelationships among variables but to endow suchfindings with a substance that implies a causal entity, despite the protestations of thosewho assertthat the statistical associations do not, by themselves, provide evidencefor its existence. The refined searchfor these statistical distillations is factor analysis,and it is important to distinguish betweenthe mathematicalbasesof the methodsand theinterpretationof the result of their application. Now the correlation between monthly paycheck and annual salarywill not be perfect. Interest payments, profits, gifts, even author's royalties, will make r less than+1, but presumablyno onewould arguewith the proposition thatif one requires a reasonable measure of income and material well-being, either variable is adequateand the other thereby redundant. Galton (1888a), in commentingon Alphonse Bertillon's work thatwas designedto provide an anthropometric indexof identification, in particularthe identification of criminals, pointsto thenecessityof estimatingthe degreeof interdependence among the variables employed,as well as theimportanceof precisionin measurement and of not making measurement classifications too wide. There is little or nothing to begainedfrom including in the measurements a numberof variables that are highly correlated, when one would suffice. A somewhatdifferent line of reasoningis to derivefrom the intercorrelations of themeasured variables a set ofcomponents that areuncorrelated.Thenumber of components thus derived would be thesameas thenumberof variables.The componentsareorderedsothat the initial ones accountfor moreof the variance thanthe later ones thatare listed in the outcome. Thisis themethodof principal component analysis,for which Pearson (1901) is usually cited as theoriginal inspiration and Hotelling (1933, 1935)as thedeveloperand refiner, although there is no indication thatHotelling was influenced by Pearson's work.In fact, Edgeworth (1892, 1893)had suggesteda schemefor generatinga function containing uncorrelated terms that was derivedfrom correlated measurements. Macdonell (1901)was also an early workerin the field of "criminal anthropometry." He acknowledgesin his paper that Pearson haspointedout to him a methodof arriving at ideal characteristics that "would be given if we calculated

156

11. FACTOR ANALYSIS

the seven [there were seven measures] directions of uncorrelated variables, that is, theprincipal axesof the correlation 'ellipsoid'"(p. 209). The method,as Lawley and Maxwell (1971) have emphasized, although it has links with factor analysis,is not to betaken as avariety of factor analysis. The latter is amethod that attempts to distill downor concentratethe covariances in a set ofvariablesto a much smaller number of factors. Indeed, none other than Godfrey Thomson(1881-1955),a leading British factor theoristfrom the 1920sto the 1940s, maintained in a letterto D. F.Vincentof Britain's National Institute of Industrial Psychology that"he did not regard Pearson's 'lines of closestfit as anythingto do with factor analysis.'" (noted by Hearnshaw, 1979, p. 176). The reasonfor the ongoing association of Pearson's name with the birth of factor analysisis to be found in the role playedby another leading British psychologist,Sir Cyril Burt (1883-1971)in the writing and therewriting of the history of factor analysis,an episode examined later in this chapter. In essence,the method of principal componentsmay be visualizedby imagining thevariables measured - thetests- aspointsin space.Teststhatare correlatedwill be close togetherin clusters,andtests thatare notrelatedwill be further away fromthe cluster. Axesor vectorsmay beprojected into thisspace through the clusters in a fashion that allowsfor as much of the variance as possibleto be accounted for. Geometrically this projection can take placein only three dimensions, but, algebraically an w-dimensionalspacecan beconceptualized. THE BEGINNINGS In 1904 Charles Spearman (1863-1945)publishedtwo papers (1904a, 1904b) in the same volumeof the American Journalof Psychology, thenthe leading English language journal in psychology.The first ofthese,a general accountof correlation methods, their strengths and weaknesses,and thesourcesof error that may beintroduced that might "dilate"or "constrict" the results, included criticism of Pearson'swork. In his Huxley lecture (sponsored by the Anthropological Instituteof Great Britain and Ireland) of 1903, Pearson reported on a great amountof data that had been collectedby teachersin a numberof schools.The physical variables measured included health, hair and eyecolor, hair curliness, athletic power, head length, breadth,and height, and a"cephalic index."The psychological characteristics were assertiveness, vivacity, popularity, introspection, conscientiousness, temper, ability,and handwriting.The average correlations (omitting athletic power) reported between the measuresof brother, sister,and brothersister pairsare extraordinarily similar, ranging from .51 to .54. We areforced, 1 think literally forced,to thegeneral conclusion that the physicaland

THE BEGINNINGS

157

psychical charactersin men areinherited withinbroadlines in the samemannerand with the sameintensity. (Pearson,1904a,p. 156).

Spearmanfelt that the measurements taken could have been affected by "systematic deviations," "attenuation" produced by the suggestion thatthe teacher'sjudgments werenot infallible, and by lack of independencein the judgements,finally maintaining that: When we further consider that each of these physicaland mental characteristics will have quitea different amountof such error(in the former this being probably quite insignificant) it is difficult to avoid the conclusion thatthe remarkable coincidences announced between physical and mental hereditycan hardly be more than mere accidental coincidence. (Spearman, 1904a,p. 98).

But Pearson's statistical procedures were not underfire, for later he says: If this work of Pearsonhas thus been singled out for criticism, it is certainly fromno desire to undervalueit. ... My present objectis only to guard against premature conclusionsand topoint out the urgent needof still further improvingthe existing methodicsof correlational work. (Spearman, 1904a,p.99)

Pearsonwas notpleased. When the addresswas reprintedin Biometrika in 1904he includedan addendumreferringto Spearman's criticism that added fuel to the fire: The formula inventedby Mr Spearmanfor his so-called "dilation" is clearly wrong ... not only are hisformulae, especiallyfor probable errorserroneous,but hequite misunderstandsand misuses partial correlation coefficients, (p. 160).

It might be noted that neither man assumed that the correlations, whatever they were, mightbe due toanything other than heredity, and that Spearman largely objected to the data and possible deficienciesin its collection and Pearson largely objected to Spearman's statistics. The exchange ensured that the opponents would never even adequately agree on the location of the battlefield, let alone resolvetheir differences. Spearmanwas to become, successively, Reader andheadof a psychological laboratory at University College, London,in 1907, Grote Professor of the Philosophyof Mind and Logic in 1911, and Professorof Psychology in 1928 until his retirement in 1931. He was therefore a colleagueof Pearson'sin the same collegein the same university during almost the whole of the latter's tenureof the Chair of Eugenics. Their interests and their methodshad a great deal in common, but they never collaborated and they disliked each other. They clashed on the matter of Spearman's rank-difference correlation technique, and that and Spearman's new criticism of Pearsonand Pearson's reaction to it beganthe hostilities that

158

11. FACTOR ANALYSIS

continuedfor many years,culminating in Pearson's waspish, unsigned review (1927) of Spearman's(1927) book TheAbilities of Man. Pearson dismisses the book as,"distinctly written for the layman"(p. 181)andclaims that, "what ProfessorSpearman considers proofs are notproofs . . . With the failure of ChapterX, that is, 'Proof thatG and Sexist,' the very backbone disappears from the body of Prof. Spearman's work" (p. 183). Nowhere in this book does Pearson's name appear! In his 1904 paper, Spearman acknowledges that Pearson had given the name the method of "product moments" to the calculationof the correlation coefficient, but,in a footnote in his book, Spearman (1927), commenting again on correlation measures, remarks that: easily foremostis theprocedure whichwassubstantially givenin a beautiful memoir of Bravais and which is now called thatof "product moments." Such procedures have been further improved by Galton who inventedthe device - since adopted everywhere- of representingall gradesof interdependence by asingle number,the "coefficient," which rangesfrom unity for perfect correlation down to zero for entire absenceof it. (p. 56)

So much for Pearson! The second paper of 1904hadproduced Spearman's initial conclusion that: The . . . observed facts indicate that all branches of intellectual activity havein commononefundamentalfunction (or group offunctions) , whereasthe remaining or specific elementsof the activity seemin every caseto bewholly different from that in all the others. (Spearman, 1904b, p. 284)

This is Spearman'sfirst general statementof the two-factor theory of intelligence,a theory thathe was tovigorously expoundanddefend. Centralto this early workwas thenotionof a hierarchyof intelligences.In examining data drawnfrom the measurement of childrenfrom a "High Class Preparatory School for Boys" on a variety of tests,Spearman starts by adjustingthe correlation coefficientsto eliminateirrelevantinfluencesanderrorsusingpartial correlation techniquesand then orderingthe coefficients from highest to lowest. He observeda steady decrease in the valuesfrom left to right andfrom top tobottom in the resulting table. The samplewas remarkablysmall (only 33), but Spearman (1904b) confidently and boldly proclaims thathis methodhasdemonstrated the existenceof "GeneralIntelligence,"which,by 1914,he waslabelingg. Hemadeno attempt at this timeto analyzethe natureof his factor, but henotes: An important practical consequence of this universalUnity of the Intellectual Function, the various actual formsof mental activity constitutea stableinter-

THE BEGINNINGS Classics Classics

French

English

Math

Discrim

Music

0.83

0.78

0.70

0.66

0.63

0.67

0.67

0.65

0.57

0.64

0.54

0.51

0.45

0.51

French

0.83

English

0.78

0.67

Math

0.70

0.67

0.64

Discrim

0.66

0.65

0.54

0.45

Music

0.63

0.57

0.51

0.51

159

0.40 0.40

FIG 11.1 Spearman's hierarchical ordering of correlations connectedHierarchy accordingto thedifferent degreesof intellective saturation,(p. 284)

This work may bejustly claimed to include the first factor analysis. The intercorrelationsare shown in Figure 11.1. The hierarchical orderof the correlationsin the matrix is shown by the tendencyfor the coefficients in a pair of columnsto bear the same ratioto one another throughout that column. Of course, the techniqueof correlation that began with Galton (1888b) is central to all forms of factor analysis, but, more particularly, Spearman's work relied on the concept of partial correlation, which allowsus to examinethe relationships between, for example,two variables whena third is held constant. The method gave Spearman the meansof making mathematical the notion that the correlation of a specific test witha general factorwas common to all tests of intellectual functioning.The remaining portionof the variance, errorexcepted, is specificand uniqueto the test that measures the variable. Thisis the essenceof what cameto be known as thetwo-factor theory. The partial correlationof variables1 and 2when a third, 3, is held constant is given bv

So that— = — and therefore, r aj ri,d

160

11. FACTOR ANALYSIS

If there are twovariablesa and b and g is constant a and theonly causeof the correlation (rab) betweena and b is g,thenrah_g would be zero and hence

If a is set toequal b then This latter represents the variancein the variablea that is accountedfor by g andleadsus to thecommunalitiesin a matrix of correlations. If now we were to take four variablesa, b, c, and d, and to considerthe correlation of a and bwith c and d,then

The left-hand sideof this latter equationis Spearman's famous tetrad difference, and hisproof of it is givenin an appendixto his 1 921book. The term tetrad difference first appearsin the mid 1920s (see e.g., Spearman & Holzinger, 1925). So, amatrix of correlations such as

generatesthetetrad differenceracr M - r hcr ail and, without beingtoo mathematical, this is the value of the minor determinantof order two. Whenall these minor determinantsare zero,the matrix is of rank one. Moreover,the correlations can beexplainedby onegeneral factor. Wolfle (1940) has noted thatconfusion about whatthe tetrad difference meant or implied persisted over a good many years. It was sometimes thought that whena matrix of correlations satisfied the tetrad equation, every individual measurementof every variable could onlybe divided into two independent parts. Spearman insisted that what he maintainedwas that this division could

THE BEGINNINGS

161

be made,not that it was theonly possible division. Nevertheless, pronouncements aboutthe tetrad equation by Spearman werefrequently followed by assertions such as,"The onepart hasbeen calledthe 'general factor'and denoted by the letter g . . . Thesecond parthasbeen calledthe 'specific factor'and denotedby the letter s" (Spearman, 1927, p. 75). Moreover, Spearman'sfrequent referencesto "mental energy" gavethe impression thatg was "real," even though, especially when challenged, he denied that it was anything more thana useful, mathematical, explanatory construction. Spearman made ambiguous statements g, about but it seems that he was convinced thatthe two-factor theorywas paramountand the tetrad difference formed an important partof his argument.Not only that, Spearman was satisfied that Garnett's earlier "proof (1920)of the two-factor theoryhad effectively dealt with any criticism: There is another importantlimitation to the division of the variables into factors.It is that the division into generaland specific factorsall mutually independentcan be effected in one wayonly; in other words, it is unique.(Spearman,1927, p. vii)

In 1933 BrownandStephenson attempted a comprehensive test of the theory using an initial battery of 22 testson asampleof 300 boys aged between 10 and 10'/2 years. But they "purified"the batteryby dropping tests that for one reason or another did not fit and found support for the two-factor theory. Wolfle's (1940) reviewof factor analysis saysof this work: "Whetherone credits the attempt with success or not, all that it provesis this; if one removesall tetrad differenceswhich do not satisfy the criterion, the remaining onesdo satisfy it" (p. 9). Pearson and Moul (1927), in a lengthy paper, attempta dissection of Spearman'smathematics,in particular examiningthe sampling distributionof the tetrads and whether or not it can bejustified as being closeto the normal distribution.They conclude: "the claim of Professor Spearman to have effected a Copernican revolution in psychology seems at present premature" (p. 291). However, theydo state that even though they believe the theory of general and specific factorsis "too narrow a structureto form a frame for the great variety of mental abilities,"they believe thatit shouldandcould be testedmore adequately. Spearman's annoyance with Pearson's view of his work was not sotroubling as thepronouncements on it madeby theAmerican mathematician E. B.Wilson (1879-1964),Professorof Vital Statisticsat Harvard's Schoolof Public Health. The episodehas been meticulously researched and acomprehensive account given by Lovie and Lovie (1995). Spearman met Wilson at Harvard duringa visit to the United States latein 1927. Wilson had read The Abilities of Man

162

11. FACTOR ANALYSIS

and hadformedthe opinion thatits author's mathematics were wanting and had tried to explainto himthat it waspossibleto "getthe tetraddifferencesto vanish, one andall, in so many ways thatone might suspect thatthe resolution intoa general factorwas notunique" (quotedby Lovie & Lovie, 1995,p. 241). Wilson subsequently reviewed Spearman's book for Science.It is generally a friendly review. Wilson describesthe book as "animportantwork," written "clearly, spiritedly, suggestively,in places even provocatively," and hiswords concentrateon the mathematical appendix.Wilson admits thathis review is "lop-sided," but hemaintains that Spearman has missed someof the logical implicationsof the mathematics,and inparticular thathis solutionsare indeterminate. But he leavesa loopholefor Spearman: Do gx, g ,. .. whether determinedor undeterminable represent the intelligence of x, y,.. . ? Theauthor advancesa deal of argumentand ofstatisticsto show that they do. This is for psychologists,not for me toassess (Wilson, 1928,p. 246)

Wilson did not want to destroy the basic philosophyor to undermine completely the thrust of Spearman's work. Later in 1929, Wilson publisheda more detailed mathematical critique (1929b) and acorrespondence between him and Spearman ensued, lasting at least until 1933.The Lovies have analyzed these exchangesand show that a solution, or, as they term it, "an uneasy compromise" was reached. A "socially negotiated" solutionsaw Spearman accepting Wilson's critique, Garnett to modifying his earlier "proof of the two-factor theory,and Wilsonoffering ways in which the problems mightbe at least modified,if not overcome. Morespecifically, the idea of a partial indeterminacyin g was suggested. REWRITING THE BEGINNINGS In 1976, the world of psychology,and British psychologyin particular, was shaken by allegations publishedin the Sunday Timesby Oliver Gillie, to the effect that Sir Cyril Burt (1883-1971), Spearman's successor as Professorof Psychology at University College,had faked a large partof the data for his researchon twins and, among other labels in a seriesof articles, called Burt"a plagiarist of long standing." Burtwas aprominentfigure in psychologyin the United Kingdom and hiswork on the heritability of intelligence bolsteredby his dataon itsrelationship among twins reared apart, was notonly widely cited, but influenced educational policy. Leslie Hearnshaw, Professor Emeritus of Psychology at the University of Liverpool, was,at that time, researching his biography (publishedin 1979) of Burt, and when it appearedhe had notonly examinedthe apparentfraudulent natureof Burt's databut healso reviewedthe contributions of Burt to thetechniqueof factor analysis.

REWRITING THE BEGINNINGS

163

The "Burt Scandal"led to adebateby the British Psychological Society, the publication of a "balancesheet"(Beloff, 1980), and anumberof attemptsto rehabilitatehim (Fletcher, 1991; Jensen, 1992; Joynson, 1989). It must be said that Hearnshaw's view that Burt re-wrote history in an attemptto place himself rather than Spearman as thefounder of factor analysishas notbeen viewedas seriously as the allegations of fraud. Nevertheless, insofaras Hearnshaw's words have been picked over in theattemptsto atleast downplay,if not discredit, his findings, and insofar as therecord is importantin the history and development of factor analysis, theyare worth examininghere. Once more the credit must go to theLevies (1993),who have provideda "side-by-side"comparison of Spearman'sprepublication notesand commentson Burt's work and the relevant sectionsof the paper itself. The publicationof Burt's (1909) paperon "general intelligence"was preceded by some important correspondence with Spearman,who saw thepaper as offering support for his two-factor theory.In fact, Spearman re-worked the paper, and large sectionsof Spearman's notes on it were used substantially verbatim by Burt. As the Lovies point out, thiswas notblatant plagiarismon Burt's part, even though the latter's acknowledgement of Spearman's role was incomplete: but was akind of reciprocatedself-interest aboutthe Spearman-Burtaxis at this time which allowed Burt to copy from Spearman,and Spearman to view this with comparativeequanimitybecausethe article provided suchstrongsupportfor the two factor theory, (p. 315)

However,of more importas far as ourhistory is concernedis Hearnshaw's contention that Burt,in attempting to displace Spearman, used subtle and sometimes not-so-subtle commentary to placethe originsof factor analysis with Pearson's(1901) articleon "principle axes"and tosuggest thathe (Burt) knew of thesemethods beforehis exchangeswith Spearman. Indeed, Hearnshaw (1979) reportson Burt's correspondence with D. F. Vincent (noted earlier)in which Burt claimed thathe hadlearnedof Pearson's work when Pearson visited Oxford and that it was then thathe andMcDougall (Burt's mentor) became interestedin the techniques. After Spearman died, Burt's papers repeatedly emphasizethe priority of Pearsonand Burt's rolein the elaborationof methods that became knownasfactor analysis. Hearnshaw notes that Burt did not mention Pearson's 1901 work until 1947and implies that it was thepublication of Thurstone's book (1947) Multiple Factor Analysis and its final chapter on "The Principal Axes" that alerted Burt.It should alsobe noted that Wolfle's (1940) review, although citing11 of Burt's papersof the late 1930s, doesnot mention Pearsonat all. It will be some years beforethe "Burt scandal" becomes no more thanan

164

11. FACTOR ANALYSIS

historical footnote. Burt'smajor work The Factors of the Mind remainsa significant and important contribution,not only to the debate aboutthe nature of intelligence,but alsoto theapplicationof mathematicalmethodsto thetesting of theory. Eventhe most vehement critics of Burt would likely agreeand, with Hearnshaw, might say, "It is lamentable thathe should have blottedhis record by the delinquenciesof his later years." In particular, Burt's writings moved factor theorizing awayfrom the Spearman contention that g was invariant no matter what typesof tests were used - in other words, thatdifferent batteriesof tests should produce the same estimates of g. The idea of a numberof group factors thatmay beidentified from sets of teststhat had similar, althoughnot identical, content,- for example, verbal factors, numeracy factors, spatial factors and so on - was central in thedevelopmentof Burt's writings. Overlap among the tests within each factordid not suggest thatthe g values were identical. Again, Spearmandid not dismiss these approaches out of hand, but it is plain thathe alwaysfavored the two-factor model. THE PRACTITIONERS Although the distinction may besomewhatsimplistic, it is possibleto separate the theoreticians from the applied practitionersin these early yearsof factor analysis. Lovie (1983) discusses the essentially philosophical andexperimental drive underlyingSpearman's work, aswell asthat of Thomsonand Thurstone. Spearman's early paper (1904a) begins with an introduction that discusses the "signs of weakness"in experimental psychology;he avers that "Wundt's disciples havefailed to carry forward the work in all the positive spiritof their master,"and heannouncesa: "correlationalpsychology,"for the purposeof positively determiningall psychical tendencies,and in particular those which connect togetherthe so-called"mental tests"with psychical activitiesof greatergenerality and interest, (p. 205)

The work culminatedin a book that set out hisfundamental, if not yet watertight and complete, conceptual framework. Burt's work with its continuing premiseof theheritabilityof intelligencealso showsthe needto seek support for a particular constructionof mental life. On the other hand,a numberof contemporary researchers, whose contributions to factor analysis were also impressive, were fundamentally applied workers. Two, KelleyandThomson, werein fact Professorsof Education,and a third, Louis Thurstone,was agiant in the field of scaling techniquesand testing. Their contributions were marked by the task of developing testsof abilities and skills, measuresof individual differences. Truman L. Kelley (1884-1961),Professorof Educationand Psychologyat Stanford University,

THE PRACTITIONERS

165

published his Crossroads in the Mind of Man in 1928. In his preface he welcomes Spearman'swork, but in his first chapter, "Boundariesof Mental Life, " he makesit clear that his work indicates that several traits are included in Spearman'sg, and heemphasises these group factors throughout his book: Mental life doesnot operatein a plain but in anetworkof canals. Though each canal may have indefinite limits in length and depth, it doesnot in width; though each mentaltrait may grow andbecome moreandmore subtle,it doesnot lose its character and discretenessfrom other traits,(p. 23)

Lovie and Lovie (1995) reporton correspondence between Wilson and H. W. Holmes, Deanof the School of Educationat Harvard, after Wilsonhad read andreviewed (1929a) Kelley' s book "If Spearmanis ondangerous ground, Kelley is sitting on a volcano." Wilson's (1929a) reviewof Kelley again concentrates on the mathematics and again raises,in even more acute fashion, the problem of indeterminacy.It is not an unkind review,but it turns Kelley's statement (quoted above) on its head: Mental life in respect of its resolutioninto specific and general factorsdoes not operate in a network of canalsbut in acontinuouslydistributed hyperplane... no trait has anyseparateor discrete existence, for eachmay bereplacedby proper linear combinationsof the others proceeding by infinitesimal gradationsand it isonly the complex of all that has reality. Insteadof writing of crossroadsin the mind of man one shouldspeakof man's trackless mental forest or tundraor jungle - mathematically mind you, accordingto theanalysisoffered. The canalsor crossroads have been put in without any indication,so far as I cansee, that they have been put in anyother way than we put in roads acrossour midwestern plains, namely to meet the convenienceor to suit the fancy of the pioneers,(p. 164)

Wilson, a highly competent mathematician, had obviouslyhad a struggle with Kelley's mathematics.He concludes: The fact of the matter is that the authorafter an excellent introductory discussion of the elementsof thetheoryof the resolutioninto generaland specific factors... gives up the general problem entirely, throws up his handsso to speak,and proceedsto develop special methods of examininghis variablesto seewhere there seem to be specific bonds between them. (p. 160)

Kelley, perhapsa little bemused, perhaps a little rueful, and almost certainly grateful that Wilson's reviewhad not been more scathing, responds: "The mathematical toolis sharpand I may nick, or may already have nicked myself with it. At any rate I do still enjoy the fun of whittling and of fondling such clean-cut chipsas Wilson has letfall" (p. 172). Sir Godfrey Thomsonwas Professorof Education at the University of

166

11. FACTOR ANALYSIS

Edinburgh from 1925 to 1951. He had little or no training in psychology; indeed,his PhD was inphysics.After graduate studies in Strasbourg,he returned hometo Newcastle, England, where he wasobligedto takeup apostin education to fulfill requirements that were attached to the grants he hadreceived as a student.His work at Edinburghwas almost wholly concerned withthe development of mental testsand therefining of his approachto factor analysis. In 1939 his book TheFactorial Analysisof Human Ability was published,a book that establishedhis reputation and set him firmlyoutsidethe Spearmancamp. A later edition (1946) expanded his views and introducedto a wider audience the work of Louis Thurstone(1887-1955).Thomson's nameis not aswellknown as Spearman's,nor is hiswork much cited. Thisis probably because he did not develop a psychological theoryof intelligence, althoughhe did offer explanationsfor his findings. Hecan, however,be regardedasSpearman's main British rival, and heparts companywith him on three main grounds:first, that Spearman'sanalysishad not andcould not conclusively demonstrate the "existence"of g; second, that evenif the existenceof g was admitted, it was misleading and dangerousto reify it so that it becamea crucial psychological entity; and third, that Spearman's hierarchical model had been overtakenby multiple factor models that were at once more sophisticated and had more explanatory power. The Spearmanschool of experimenters, however, tend always to explain as much as possibleby onecentral factor.... Thereareinnumerable other ways of explaining these same correlations ... And the final decision between them has to bemade on some othergrounds.The decision may bepsychological.... Or thedecisionmay bemadeon theground that we should be parsimoniousin our inventionof "factors," andthat whereonegeneral and onegroupfactor will servewe shouldnot inventfive group factors..(Thomson, 1946, p. 14-15)

Thomsonwas notsaying that Spearman's view doesnot make sense but that it is not inevitably the only view. Spearmanhad agreed thatg was presentin all the factors of the mind, but in different amountsor "weights," and that in some factorsit was small enoughto beneglected. Thomson welcomed these views, "provided thatg is interpreted as a mathematical entity only,and judgementis suspendedas towhetherit is anything more thanthat" (p. 240). And his verdict on two-factor theory?After commenting thatthe method of two factors was an analytical devicefor indicating their presence, that advancesin method had been made thatled to multiple factor analysis,and further "It was Professor Thurstone of Chicagowho sawthatonesolutionto the problem couldbe reachedby ageneralizationof Spearman's ideaof zerotetrad differences."(p. 20).

THE PRACTITIONERS

167

Thomson himselfhadsuggesteda "sampling theory"to replaceSpearman's approach: The alternativetheory to explain the zero tetrad differencesis that eachtest calls upon a sampleof bonds whichthe mind can form, andthat someof thesebondsare common to two testsand causetheir correlation,(p. 45)

Thomsondid notgive more thana very general viewof whatthe bonds might be, although it seems that they were fundamentally neurophysiological and, from a psychological standpoint, akin to the connectionsin the connectionist views of learning first advancedby Thorndike: What the "bonds"of the mind are, we do notknow. But they are fairly certainly associatedwith the neuronesor nerve cellsof our brains...Thinking is accompanied by the excitationof theseneuronesin patterns.The simplestpatternsare instinctive, more complex onesacquired. Intelligenceis possibly associatedwith the number andcomplexity of thepatterns whichthe braincan (orcould) make. (Thomson, 1946, p. 51)

And, lastly, to complete this short commentary on Thomson's efforts,his demonstrationof geometrical methodsfor the illustration of factor theory contrasts markedly with both Spearman and Burt's work and ismuch morein tune with the "rotation" methodsof later contributors, notably Thurstone. In 1939, as part of a General Meetingof the British Psychological Society, Burt, Spearman, Thomson (1939a), and Stephenson each gave papers on the factorial analysisof humanability. These papers were published in the British Journal of Psychology, together with a summingup by Thomson (1939b). Thomson pointsout that neither Thurstone nor any of hisfollowers were present to defend their viewsand gives a very brief and favorable summaryof them. He even says, citing Professor Dirac at a meetingof the Royal Society of Edinburgh: When a mathematical physicist finds a mathematical solutionor theorem whichis particularly beautiful... he canhave considerable confidence that it will prove to correspondto something realin physical nature. Something of the samefaith seems to lie behind Prof.Thurstone'strust in the coincidenceof "Simple Structure"in the matrix of factor loadings with psychological significance in the factors thus defined. (Thomson,1939b,p. 105)

He also noted that Burt had "expressedthe hope thatthe 'tetraddifference'of the four symposiasts would be found to vanish!" (p. 108). Louis L. Thurstone produced his first work on "the multiple factor problem" in 1931. The Vectorsof the Mind appearedin 1935and what he himself defined

168

11. FACTOR ANALYSIS

as adevelopmentand expansionof this work, Multiple Factor Analysis,in 1947. Reprintsof the book werestill being published longafter his death.The following schememight give somefeel for Thurstone'sapproach.We looked earlier at the matrix of correlations that producea tetrad difference.Now consider:

This is a minor determinantof order three.Now a newtetrad can beformed:

If this new tetrad is zero then the determinant "vanishes." This procedure can becarried out with determinantsof any order and if we come to a stage whenall theminors"vanish"the "rank" of thecorrelation matrixwill be reduced accordingly. Thurstonefound that testscould be analyzed intoas many common factors as thereduced rankof the correlation matrix.The present account follows that of Thomson (1946, chap. 2), who gives a numerical example.The rank of the matrix of all the correlationsmay bereducedby inserting valuesin the diagonalof thematrix - thecommunalities.Thesevaluesmay bethought of asself-correlations,in whichcasetheyare all 1. Butthey may alsobe regarded asthat part of the variancein the variable thatis due to thecommon factors,in which casethey haveto be estimatedin some fashion. They might even be merely guessed.Thurstone chose values that made the tetrads zero. If these methods seemnot to be entirely satisfactory now, they were certainly not welcomed by psychologistsin the 1920s, 1930sand 1940s who were tryingto cometo grips with the newmathematical approaches of the factor theoristsand their views of human intelligence. Thurstone's centroid method bears some similarities to principal components analysis. The word centroid refersto a multivariate average,and Thurstone's "first centroid"may bethoughtof as an averageof all thetestsin a battery. Centralto theoutcomeof the analysisis the production of'factor loadings - values that express the relationshipof thetests to the presumed underlying factors.The ways in which these loadingsare arrived at aremany and various,and Thurstone's centroid technique was just one early approach.However, it was anapproach thatwas relatively easyto apply and (with some mathematics) relatively easy to understand.How the factors wereinterpretedpsychologically was,and is, largely a matter for the researcher.Thus tests that were related that involved arithmeticor numerical

THE PRACTITIONERS

169

reasoningor number relationships might be labeled numerical,or numeracy. Thurstone maintained that complete interpretation of factors involvedthe techniqueof rotation. Loadingsin a principal factors matrix reflect common factor variances in test scores.The matrices are arbitrary for they can be manipulatedto show different axes - different "frames of reference." We can only make scientific senseof the outcomes whenthe axes are rotated to pass through loadings that show an assumed factor that represents a "psychological reality." Such rotations,which may produce orthogonal (uncorrelated) axes or oblique (correlated)axes,were first carriedout by amixture of mathematics, intuition, and inspiration and have becomea standard partof modern factor analysis. It is perhaps worth noting that Thurstone's early training was in engineeringand that he taught geometryfor a time at theUniversity of Minnesota. The configurational interpretations are evidently distastefulto Burt, for he doesnot havea singlediagramin his text. Perhapsthis is indicative of individual differences in imagery types which leadsto differencesin methods and interpretation among scientists. (Thurstone, 1947, p. ix)

Thurstone'swork eventually produceda set ofseven primary mental abilities, verbal comprehension, word fluency,numerical, spatial, memory, perceptual speed,and reasoning. Thiskind of scheme became widely popular among psychologistsin the United States, although many were disconcerted by the fact that the numberof factors tendedto grow. J. P.Guilford (1897-1987)devised a theoretical model that postulated 120 factors (1967),and by 1971 the claim was made that almost100 of them had been identified. An ongoing problemfor the early researcherswas theevident subjective element in all the methods. Mathematicians stepped into the picture (Wilson was one of theearly ones),often at therequestof the psychologistsor educationists who were not generally mathematical sophisticates (Thomson, not trained as apsychologist,was anexception),but they were remarkably quick learners. The search for analytical methodsfor assessing "simple structure" began. Carroll (1953)was amongthe earliestwho tackledthe problem: A criticism of current practicein multiple factor analysisis that the transformation of the initial factor matrix Fto arotated "simple stucture" matrix ^must apparently be accompaniedby methods which allow considerable scopefor subjectivejudgement, (p. 23)

Carroll's papergoes on to describea mathematical method that avoids subjectivedecisions. Modern factor analysis had begun. From then on agreat variety of methodswasdeveloped,and it is noaccident that such solutions were developed coincidentallywith the rise of the use of thehigh-speed digital

170

11. FACTOR ANALYSIS

computer,which removed the computational labor fromthe more complex procedures. Harman (1913-1976), building on earlier work (Holzinger [1892-1954]& Harman, 1941), published Modern Factor Analysis in 1960, with a secondand athird (1976) edition,and thebook is a classicin the field. In researchon thestructureof human personality,the giants in the areawere Raymond Cattell(1905-1998)and Hans Eysenck(1916-1997).Cattell had beena studentof Cyril Burt, and in hisearly work had introduced the concept of fluid intelligence thatis akin to g, abroad, biologicallybasedconstruct of general mental abilityandcrystallized intelligence that depends on learningand environmental experience. Later, Cattell's work concentrated on theidentification andmeasurementof the factorsof humanpersonality (1965a),and his16PF (16 personality factors) testis widely used. His Handbook of Multivariate Experimental Psychology (1965b) is a compendiumof the multivariate approach witha numberof eminent contributors. Cattell himself contributed 6 of the 27chaptersand co-authored another. Cattell's research output, over a very long career,was massive,as wasthat of Eysenck.The latter's method, which he termed criterion analysis,was a use of factor analysisthatattemptedto confirm the existenceof previously hypothesized factors. Three dimensions were eventually identified by Eysenckand his work helpedto set off one ofpsychology's ongoing arguments about just how many personality factors there really were. This is not theplaceto review this debate,but it does,once again, place factor analysis at thecenterof sometimes quite heated discussion about the "reality" of factors and theextent to which they reflect the theoreticalpreconceptionsof the investigators."What comes out is no more than whatgoesin," is the cry of thecritics. Whatis absolutely clear from this situation is that it is necessaryto validate the factors by experimental investigation that stands aside from the methods used to identify the factors and avoid capitalizingon semantic similarities among the tests that were originally employed.It has to beacknowledged,of course, that both Cattell and Eysenck (1947; Eysenck & Eysenck, 1985)and their many collaborators have triedto dojust that. Recent years have seen an increasein the use offactor-analytictechniques in a variety of disciplinesand settings, even apparently in the analysisof the performanceof racehorses!In psychology,the sometimes heated discussion of the dangersof the reification of factors, their treatment aspsychological entities, as well as arguments over,for example, justhow many personality dimensions are necessaryand/or sufficient to define variationsin human temperament, continue. At bottom, a large partof these problems concerns the use of the technique eitheras aconfirmatory toolfor theory or as asearchfor new structure in our data.Unless thesetwo quitedifferent viewsare recognizedat thestart of discussionthe debatewill take on anexasperatingfutility that stiflesall progress.

12 The Design

of Experiments

THE PROBLEM OF CONTROL When Ronald Fisher accepted the post of statisticianat Rothamsted Experimental Station in 1919, the tasksthat facedhim were to make whathe could of a large quantityof existing datafrom ongoing long-term agricultural studies (one had begunin 1843!) and to try toimprovethe effectivenessof future field trials. Fisher later describedthe first of these tasksas "raking over the muck heap"; the secondhe approached withgreatvigor and enthusiasm, layingas he did so the foundationsof modern experimental design and statistical analysis. The essential problemis the problem of control. For the chemist in the laboratory it is relatively easyto standardizeand manipulatethe conditionsof a specific chemical reaction. Social scientists, biologists, and agriculturalresearchers have to contend withthe fact that their experimental material (people, animals, plants)is subjectto irregular variation that arises as aresultof complex interactions of genetic factorsand environmental conditions.These many variations, unknownand uncertain, makeit very difficult to be confident that observeddifferencesin experimental observations are due to the manipulations of the experimenter rather than to chance variation.The challenge of the psychological sciences is thesensitivity of behaviorand experienceto amultiplicity of factors. But in many respectsthe challengehas notbeen answered becausethe unexplained variationin our observationsis generallyregardedas a nuisanceor asirrelevant. It is useful to distinguish between experimental control and the controlled experiment.The former is the behaviorist's ideal,the state where some consistent behavior can be set offand/or terminatedby manipulating precisely specified variables. On the other hand,the controlled experiment describes a procedurein which the effect of the manipulationof the independent variable or variables is, as it were, checked against observations undertaken in the 171

172

12. THE DESIGN OF EXPERIMENTS

absenceof the manipulation. Thisis themethod thatis employedand supported by the followersof the Fisherian tradition.The uncontrolled variables that affect the observationsareassumedto operatein a random fashion, changing individual behaviorin all kinds of ways sothat whenthe dataareaveraged their effects arecanceledout, allowingthe effect of themanipulated variableto beseen.The assumption of randomnessin the influence of uncontrolled variablesis, of course,not onethat is always easyto justify, and therelegation of important influenceson variability to error may leadto erroneous inferences anddisastrous conclusions. Malaria is a disease thathas been knownand feared for centuries. It decimatedthe Roman Empirein its final years,it wasquite widespreadin Britain during the 17th century,and indeedit was still found there in the fencountryin the 19th century. The names malaria, marsh fever, andpaludism all reflect the view that the causeof the diseasewas thebreathingof damp, noxiousair in swamp lands.The relationship between swamp lands and the incidence of malariais quite clear. The relationship between swamp lands and thepresence of mosquitosis also clear. But it was notuntil the turn of the century thatit was realized thatthe mosquitowas responsiblefor the transmissionof the malarial parasite,and only 20 years earlier,in 1880, was theparasite actually observed. In 1879, Sir Patrick Manson(1844-1922),a physician whose work played a role in the discovery of the malarial cycle, presenteda paper in which he suggestedthat the diseaseelephantiasiswas transmitted throughinsect bites. The paperwasreceived with scornanddisbelief. The evidencefor the life cycle of the malarial parasitein mosquitosandhuman beingsand itsbeing established as the causeof the illness camein a numberof ways - not theleast of which was thehealthy survival, throughout the malarial season,of three of Manson's assistantsliving in a mosquito-proofhut in themiddleof the Roman Campagna (Guthrie, 1946,pp. 357-358).This episodeis an interesting exampleof the control of a concomitantor correlated biasor effect that was thedirect causeof the observations. In psychological studies, some of the earliest work that used true experimental designsis that of Thorndike and Woodworth (1901)on transferof training. They used "before-after" designs, control group designs, and correlational studiesin their work. However,the nowroutine inclusion of control groupsin experimental investigations in psychology doesnot appearto have beenan acceptednecessityuntil about50 years ago.In fact, controlled experimentation in psychology moreor less coincidedwith the introductionof Fisherian statistics, and the twoquite quickly became inseparable. Of courseit would be both foolish and wrong to imply that early empirical investigations in psychology were completely lackingin rigor, and mistaken conclusions rife. The point is that it was not until the 1920s and 1930s that the "rules" of controlled

METHODS OF INQUIRY

173

experimentation were spelled out andappreciatedin the psychological sciences. The basic ruleshad been in existencefor many decades, having been codified by John StuartMill (1806-1873)in a book, first publishedin 1843, thatis usually referredto as theLogic (1843/1872/1973). These formulationshadbeen precededby an earlier British philosopher, Francis Bacon, who made recommendationsfor what he thought wouldbe sound inductive procedures. METHODS OF INQUIRY Mill proposedfour basic methodsof experimentalinquiry, and the fiveCanons, the first of which is, "If two or more instancesof the phenomenon under investigation have only one circumstancein common,the circumstancein which alone all the instances agree,is the cause (or effect) of the given phenomenon"(Mill, 1843/1872,8th ed., p. 390). If observationsa, b, and c aremade in circumstancesA, B, and C, and observationsa, d, and e incircumstances A, D, and £,then it may beconcluded that A causes a. Mill commented,"As this method proceedsby comparing different instancesto ascertainin what they agree,I have termedit the Method of Agreement"(p. 390). As Mill points out,the difficulty with this methodis the impossibility of ensuringthat A is the only antecedentof a that is commonto both instances. The second canonis the Methodof Difference. The antecedent circumstances A, B, and C arefollowed by a, b, and c.When A is absentonly b and c are observed: If an instancein which the phenomenon under investigation occurs, and aninstance in which it doesnot occur, have every circumstance in common saveone, thatone occurring only in the former; the circumstancein which alone the two instances differ, is the effect or the cause, or an indispensable partof the cause of the phenomenon,(p. 391)

This method containsthe difficulty in practiceof being unableto guarantee that it is the crucial difference that has beenfound. As part of the wayaround this difficulty, Mill introducesa joint methodin his third canon: If two or more instancesin which the phenomenon occurs have only one circumstancein common, whiletwo or more instancesin which it doesnot occur have nothing in common savethe absenceof that circumstance; the circumstancein which alonethe two setsof instancesdiffer, is the effect, or the cause,or an indispensable part of the cause,of the phenomenon,(p. 396)

In 1881 Louis Pasteur (1822-1895) conducted a famous experiment that exemplifies the methodsof agreementand difference. Some30 farm animals

174

12. THE DESIGN OF EXPERIMENTS

were injected by Pasteur witha weak cultureof anthrax virus. Later these animalsand asimilar numberof others thathad notbeenso "vaccinated"were given a fatal dose of anthrax virus. Withina few days the non-vaccinated animals were dead or dying, the vaccinated ones healthy. The conclusion that was enthusiastically drawnwas that Pasteur's vaccination procedure had produced the immunity thatwas seenin the healthy animals.The effectivenessof vaccinationis now regardedas anestablished fact.But it is necessaryto guard against incautious logic. The healthof the vaccinated animals could have been due to some other fortuitous circumstance. Because it is known that some animals infected with anthrax do recover,an experimental group composed of theseresistantanimals could have resulted in a spurious conclusion.It should be noted that Pasteur himself recognized as thisapossibility. Mill's Method of Residues proclaims that having identified by the methods of agreementand differences that certain observed phenomena are theeffects of certain antecedent conditions, the phenomena that remain are due to the circumstancesthat remain. "Subductfrom any phenomenon such part as is known by previous inductions to be theeffect of certain antecedents, and the residue of the phenomenonis the effect of the remaining antecedents" (1843/1872,8th ed., p. 398). Mill here uses a very modern argument for the use of themethod in providing evidencefor the debateon racial and gender differences: Thosewho assert,what no one hasshown any real groundfor believing, that there is in onehuman individual,onesex, or oneraceof mankindover another,aninherent and inexplicable superiorityin mental faculties, could only substantiate their proposition by subtracting fromthe differencesof intellect which we in fact see,all that can be traced by known laws either to the ascertaineddifferences of physical organization,or to thedifferences which have existed in the outward circumstances in which the subjectsof thecomparison have hitherto been placed. What thesecauses might fail to accountfor, would constitutea residual phenomenon, which and which alone wouldbe evidence of an ulterior original distinction,and themeasureof its amount. But the assertorsof such supposed differences have not provided themselveswith thesenecessarylogical conditionsin the establishmentof their doctrine, (p. 429)

The final method and the fifthcanon is the Method of Concomitant Variations: "Whatever phenomenon varies in any manner whenever another phenomenon variesin some particular manner, is eithera causeor an effect of that phenomenon,or is connected withit through some factof causation"(p. 401). This methodis essentially thatof the correlational study, the observationof covariation: Let us supposethe questionto be,what influencethe moon exertson thesurfaceof

METHODS OF INQUIRY

175

the earth. We cannottry an experimentin the absenceof the moon, so as toobserve what terrestrial phenomenon her annihilation would put an end to; but whenwe find that all the variations in the positionsof the moon are followed by corresponding variationsin the time and placeof high water,the place always being either the part of the earth which is nearestto, or that whichis most remote from,the moon, we have ample evidence that the moonis, wholly or partially, thecause which determines the tides, (p. 400)

Mill maintained that these methods were in fact the rulesfor inductive logic, that they were both methods of discoveryandmethodsof proof. His critics, then and now, argued against a logic of induction (see chapter2), but it isclear that experimentalistswill agreewith Mill that his methods constitute the meansby which they gather experimental evidence for their views of nature. The general structureof all the experimental designs that are employedin the psychological sciences may beseenin Mill's methods.The applicationand withholding of experimental treatments acrossgroups reflectthe methodsof agreementand differences. The useof placebosand thesystematic attempts to eliminate sourcesof error makeup the methodof residues,and themethodof concomitant variationis, asalready noted,a complete descriptionof the correlational study. It is worth mentioning thatMill attemptedto deal with the difficulties presentedby the correlationalstudy and, in doing so, outlinedthe basics of multiple regressionanalysis,the mathematicsof which werenot to come for many years: Suppose, then, that when A changesin quantity, a also changesin quantity,and in such a manner thatwe cantracethe numericalrelationwhich the changesof the one bear to such changes of the otheras take placewithin the limits of our observation. We maythen safely conclude that the samenumericalrelationwill hold beyondthose limits. (Mill, 1843/1872,8th ed., p. 403)

Mill elaborateson this propositionand goeson to discussthe casewherea is not wholly the effect of A but nevertheless varies with it: It is probably a mathematicalfunction not of A alone,but of A andsomething else: its changes,for example,may besuchas would occurif part of it remained constant, or varied on some other principle, and the remainder variedin some numerical relation to the variationsof A. (p. 403)

Mill's Logic is his principal work, and it may befairly castas thebook that first describes botha justification for and the methodologyof, the social sciences.Throughouthis works, the influence of the philosopherswho were, in fact, the early social scientists,is evident. David Hartley (1705-1757) publishedhis Observationson Man in 1749. This bookis a psychology rather

176

12. THE DESIGN OF EXPERIMENTS

than a philosophy (Hartley, 1749/1966).It systematically describes associationism in a psychological contextand is the firsttext that deals with physiological psychology. James Mill (1773-1836),John Stuart's father, much admired Hartleyand hismajor work becameone of themain source books that Mill the elder introducedto his sonwhenhe startedhis formal educationat the age of 3(when he learned Ancient Greek), although he did not get toformal logic until he was 12.Another important earlyinfluence was Jeremy Bentham (1748-1832),a reformer who preachedthe doctrine of Utilitarianism, the essential featureof which is the notion that the several and joint effects of pleasure andpain governall our thoughtsandactions. LaterMill rejected strict Benthamismandquestionedthe work of a famousandinfluential contemporary, AugusteComte (1798-1857),whose work marksthe foundation of positivism and of sociology. The influenceof the ideasof thesethinkerson early experimental psychologyis strong and clear,but they will not beexplored here.The main point to be madeis that John StuartMill was anexperimental psychologist's philosopher. More,he was themethodologist's philosopher. In a letter to a friend, he said: If there is any sciencewhich I am capableof promoting, I think it is the science of scienceitself, the scienceof investigation- of method. I once heard Maurice say ... that almostall differencesof opinion when analysed, weredifferencesof method, (quoted by Robson in his textual introductionto the Logic, p. xlix)

And it is clear that all subsequent accounts of method and experimental designcan betraced backto Mill.

THECONCEPTOF STATISTICAL CONTROL The standard designfor agricultural experiments at Rothamstedin the days before Fisher was to divide a field into a numberof plots. Each plot would receivea different treatment, say,a different manureor fertilizer or manure/fertilizer mixture. The plot that producedthe highest yield wouldbe taken to be the best, and thecorresponding treatment considered to be themost effective. Fisher, and others,realized that soilfertility is by no meansuniform acrossa large field and that this,as well as other factors,can affect the yields. In fact, the differencesin the yields could be due tomany factors other thanthe particular treatments and the highest yield might be due to some chance combinationof these factors.The essential problem is to estimatethe magnitude of these chance factors - the errors - to eliminate, for example,the differencesin soil fertility. Someof the first data thatFisher saw atRothamsted werethe recordsof

THE CONCEPT OF STATISTICAL CONTROL

177

daily rainfall and yearly yields from plots in the famous Broadbalk wheat field. Fertilizers had been appliedto these plots, using the same pattern, since 1852. Fisher used the methodof orthogonal polynomialsto obtain fits of the yields over time. In his paper (1921b) on these data published in 1921 he describes analysis of variance (ANOVA) for the first time. When the variation of any quantity (variate)is producedby theaction of two or more independentcauses,it is known that the variance produced by all the causes simultaneouslyin operation is the sum of thevaluesof the variance producedby eachcauseseparately... In Table II is shown the analysisof the total variancefor each plot, divided according as it may beascribed(i) to annual causes, (ii)to slow changesother than deterioration, (iii) to deterioration;the sixth columnshowsthe probability of larger valuesfor the variancedue toslow changes occurring fortuitously. (Fisher,192 Ib, pp. 110-111)

The method of data analysis thatFisher employedwas ingeniousand painstaking,but he realizedquickly that the data that were available suffered from deficienciesin the designof their collection. Fisherset out on a newseries of field trials. He divided a field into blocksand subdividedeach block into plots. Each plot within the block was given a different treatment,and each treatment was assignedto each plot randomly. This, as Bartlett (1965) putsit, was Fisher's "vital principle." When statisticaldataarecollected asnatural observations,the most sensible assumptions aboutthe relevant statistical model have to be inserted. In controlled experimentation, however, randomness could be introduced deliberately into the design, so that any systematic variability other than [that] due toimposed treatments could be eliminated. The second principle Fisher introduced naturally went with the first. With statistical analysisgearedto the design, all variability not ascribedto theinfluence of treatmentsdid not have to inflate the random error. With equal numbersof replicationsfor the treatments each replication could be containedin a distinct block, and only variability among plotsin the same block werea source of error - that between blocks could be removed. (Bartlett, 1965, p. 405)

The statistical analysis allowed for an even more radical break with traditional experimental methods: No aphorism is more frequently repeatedin connectionwith field trials, than thatwe must ask Nature few questions,or, ideally, one question,at a time. The writer is convinced that this viewis wholly mistaken. Nature, he suggests,will best respond to a logical and carefully thoughtout questionnaire; indeed, if we ask her asingle question, shewill often refuseto answeruntil some other topichasbeen discussed. (Fisher, 1926b,p. 511)

178

12. THE DESIGN OF EXPERIMENTS

Fisher's"carefully thoughtout questionnaire"was thefactorial design. All possible combinationsof treatments wouldbe applied with replications. For example,in the applicationof nitrogen (N), phosphate (P), andpotash(K) there would be eight possible treatment combinations: no fertilizer, N, P, K, N & P, N & K, P & K, and N & P & K. Separate compact blocks would be laid out and these combinations would be randomly appliedto plots withineachblock. This design allowsfor an estimation of the main effects of the basicfertilizers, the first-order interactions (theeffect of two fertilizers in combination), and the second-orderinteraction (theeffect of the three fertilizersin combination).The 1926(b)papersetsout Fisher'srationale for field experimentsand was,as he noted, the precursorof his book, The Design of Experiments (1935/1966), published9 years later.The paperis illustratedwith a diagram(Fig.12.1)of a "complex experiment with winteroats" that had been carriedout with a colleagueat Rothamsted (Eden & Fisher, 1927). Here 12 treatments,including absenceof treatments- the"control" plots were tested.

FIG. 12.1 Fisher's Design 1926. Journal of the Ministry of Agriculture

THE CONCEPT OF STATISTICAL CONTROL

179

Any general difference between sulphate and chloride, betweenearly and late application, or ascribable to quantity of nitrogenous manure,can bebasedon thirty-two comparisons,each of which is affected by such soil heterogeneityas existsbetweenplots in the sameblock. To make thesethreesetsof comparisons only, with the sameaccuracy,by single question methods, would require 224 plots, againstour 96; but inaddition many other comparisons can bemade with equal accuracy,for all combinationsof the factors concernedhave been explored. Most important of all, the conclusions drawn from the single-factor comparisonswill be given, by the variation of non-essential conditions, a very much wider inductive basis than could be obtained, by single question methods, without extensive repetitions of the experiment. (Fisher, 1926b, p. 512)

The algebraand thearithmeticof the analysisaredealt within thefollowing chapter.The crucial pointof this work is the combinationof statistical analysis with experimental design. Part of the stimulus for this paperwas Sir John Russell's (1926) article on field experiments,which had appearedin the same journaljust months earlier. Russell's review presents the orthodox approachto field trials and advocatedcarefully planned, systematic layouts of the experimental plots.Sir JohnRussellwas theDirectorof the Rothamsted Experimental Station, he hadhired Fisher,and he wasFisher's boss,but Fisher dismissedhis methodology.In a footnotein the 1926(b) paper, Fisher says: This principlewas employed in an experimenton theinfluence of the weatheron the effectivenessof phophatesand nitrogen alludedto by SirJohn Russell.The author must disclaimall responsibilityfor the designof this experiment, whichis, however, a good example of its class. (Fisher, 1926b,p. 506)

And as Fisher Box (1978) remarks: It is a measure of the climate of the times that Russell,an experiencedresearch scientist who . . . had had the wisdom to appoint Fisher statistician for the better analysis of the Rothamsted experiments, did not defer to theviews of his statistician when he wrote on how experiments were made. Design was, in effect, regardedas an empirical exerciseattemptedby the experimenter;it was not yet thedomain of statisticians, (p. 153)

In fact the statistical analysis, in a sense, arises from the design. Nowadays, when ANOVA is regardedasefficient and routine,the various designs that are availableandwidely usedaredictatedto us by theknowledge thatthe reporting of statistical outcomes andtheir related levels of significance is thesine qua non of scientific respectabilityandacceptabilityby the psychological establishment. Historically, the newmethodsof analysis camefirst. The confounds, defects, and confusionsof traditionaldesigns became apparent when ANOVA was used to examinethe data and so newdesigns were undertaken.

180

12. THE DESIGN OF EXPERIMENTS

Randomizationwas demandedby the logic of statistical inference. Estimatesof error and valid testsof statisticalsignificancecan only be made when the assumptions that underlie the theory of sampling distributionsare upheld. Put crudely, this means that "blind chance" should not be restricted in the assignmentof treatmentsto plots, or experimental groups. It is, however, important to note that randomization does not imply that no restrictions or structuringof the arrangementswithin a designare possible. Figure 12.2 showstwo systematic designs: (a) ablock designand (b) aLatin squaredesign,and two randomized designs of the same type,(c) and(d). The essentialdifference is that chance determines the applicationof the various treatments applied to theplots in the latter arrangements, but therestrictionsare apparent.In the randomizedblock and in therandomizedLatin square, each block containsone replicationof all thetreatments. The estimateof error is valid, because,if we imaginea large numberof different results obtainedby different random arrangements, the ratio of the real to the estimated error, calculated afreshfor each of these arrangements, will be actually distributed in the theoreticaldistribution by which the significanceof the result is tested. Whereasif a groupof arrangementsis chosen such that the real errorsin this group are on thewhole less than those appropriate to random arrangements, it has now beendemonstrated that the errors,asestimated,will, in such a group,be higher than is usualin random arrangements, andthat,in consequence, within sucha group, the test of significance is vitiated. (Fisher, 1926b, p.507)

Block 1 Block 2 Blocks Block 4 Blocks

Block 1 Block 2 Block3 Block 4 Blocks

Treatments 2 3 4 5 B C D E B C D E B C D E B C D E B C D E (a) Treatments 1 2 3 4 5 D C E A B A D B C E B A E C D E D C A B B A D E C (c)

A Standard Latin Square

1 A A A A A

A B C D

B A D C

C D B A

D C A B

(b) A Random Latin Square

FIG. 12.2Experimental Designs

D C B A

A B D C

C D A B

(d)

B A C D

THE LINEAR MODEL

181

Fisherlater examinesthe utility of the Latin square design, pointing out that it is by far themost efficient andeconomicalfor "thosesimple typesof manurial trial in which every possible comparison is of equal importance"(p. 510). In 1925 andearly 1926, Fisher enumerated the 5 x 5 and 6x6squares,and in the 1926 paperhe madean offer that undoubtedly helped to spreadthe name,and the fame,of the Rothamsted Station to many partsof the world: The Statistical Laboratoryat Rothamstedis preparedto supply these,or other typesof randomized arrangements, to intending experimenters; this procedure is consideredthe more desirable sinceit is only too probable thatnew principleswill, at their inception,be, insome detail or other, misunderstood andmisapplied;a consequence for which their originator,who has made himself responsible for explaining them, cannot be held entirely freefrom blame. (Fisher, 1926b,pp. 510-511)

THE LINEAR MODEL Fisher described ANOVAas a way of"arrangingthe arithmetic" (Fisher Box, 1978, p. 109), an interpretationwith which not a fewstudents would quarrel. However,the description does point to thefact thatthe componentsof variance are additive and that this propertyis an arithmeticalone and notpart of the calculusof probability and statisticalinferenceas such. The basic construct that marks the culmination of Fisher's workis that of specifying valuesof an unknown dependent variable, >>, in termsof a linear set of parameters,eachone of which weightsthe several independent variables jc,, jc2, # 3 , . . . , jt n, that are usedfor prediction, togetherwith an error component8 that accountsfor the randomfluctuations in y for particularfixed valuesof A:,, x2, J C3 , . . . , xn. In algebraic terms,

As we noted earlier,the random component in the modeland thefact thatit is sample-based make it a probabilistic model,and the properties of the distribution of this component, real or assumed, govern the inferences thatmay be made aboutthe unknown dependent variable. Fisher's work is the crucial link betweenclassical least squares analysis and regression analysis. As Seal (1967) notes, "The linear regression model owes so muchto Gauss that we believe it should bearhis name"(p. 1). However, thereis little reasonto suppose that this will happen. Twentyyears ago Seal found that veryfew of the standard textson regression,or the linear model, or ANOVA made more thana passing reference to Gauss, and the situation is little changed today. Some of the reasonsfor this have alreadybeen

182

12. THE DESIGN OF EXPERIMENTS

mentionedin this book.The European statisticians of the 18thand 19th centuries were concerned with vital statistics and political arithmetic,and inferenceand prediction in the modern sense were, generally speaking, a long way off in these fields. The mathematicsof Legendreand Gaussand otherson the theory of errors did not impingeon thework of the statisticians. Perhaps more strikingly, the early links between social and vital dataand error theory that were made by Laplace and Quetelet were largely ignored by Karl Pearsonand Ronald Fisher. Why, then, could not theTheory of Errors be absorbed intothe broaderconceptof statisticaltheory ... ? ... Theoriginal reasonwasPearson'spreoccupation withthe multivariate normal distributionand itsparameters.The predictive regressionequation of his pathbreaking 'regression'paper (1896) was notseento be identical in form and solution to Gauss'sTheoria Motus (1809) model.R. A. Fisher and his associates... wererediscovering manyof themathematical results of leastsquares(or error) theory,apparentlyagreeingwith Pearsonthat this theory held little interestto the statistician. (Seal, 1967, p. 2)

There mightbe other more mundane, nonmathematicalreasons. Galton and others were strongly opposed to the use of theword error in describing the variability in human characteristics, and themany treatiseson thetheory might thus have been avoided by the newsocial scientists, who were, in the main,not mathematicians. In his 1920 paperon the history of correlation, Pearsonis clearly most anxious to downplay any suggestion that Gaussian theory contributed to its development.He writes of the "innumerabletreatises" (p. 27) onleast squares, of the lengthy analysis,of his opinion that Gaussand Bravais "contributed nothingof real importanceto theproblemof correlation"(p. 82), and ofhis view that it is not clear thata least squares generalization "indicates the real line of future advance"(p. 45). The generalizationhad been introduced by Yule, who Pearsonand hisGower Street colleagues clearly saw as theenemy. Pearson regardedhimself as the father of correlation and regression insofaras the mathematics were concerned. Galton and Weldon were,of course, recognized as important figures,but they werenot mathemaliciansand posedno threatto Pearson'sauthorily. In other respects,Pearsonwas driven to try to show that his contributions were supreme andindependent. The historical recordhasbeen tracedby Seal (1967),from the fundamental work of LegendreandGaussat thebeginningof the 19th centuryto Fisher over 100 years later. THE DESIGN OF EXPERIMENTS A lady declaresthat by tasting a cup of teamade with milk she can discriminate whether the milk or the teainfusionwas firstaddedto thecup. We will considerthe

THE DESIGN OF EXPERIMENTS

183

problem of designingan experimentby meansof which thisassertioncan betested. (Fisher, 1935/1966,8th ed., p. 11)

With thesewords Fisher introduces the example that illustrated his view of the principles of experimentation. Holschuh (1980) describes it as "the somewhat artificial 'lady tasting tea' experiment" (p. 35), and indeedit is, but perhapsan American writer doesnot appreciatethe fervor of the discussionon the best methodof preparing cupsof teathat still occupiesthe British! FisherBox (1978) reports thatan informal experimentwascarriedout atRothamsted.A colleague, Dr B. Muriel Bristol, declineda cup of teafrom Fisheron thegrounds thatshe preferredone towhich milk had firstbeen added.Her insistence thatthe order in which milk and teawere poured intothe cup made a differenceled to a lightheartedtest actually being carried out. Fisher examinesthe design of such an experiment. Eight cups of tea are prepared. Fourof them havetea addedfirst and four milk. The subjectis told that thishasbeen doneand thecupsof tea arepresentedin a randomorder. The task is, of course,to divide the set ofeight intotwo setsof four accordingto the methodof preparation. Because there are 70waysof choosinga set of 4objects from 8: A subject withoutany faculty of discrimination wouldin fact divide the 8 cups correctly intotwo setsof 4 in onetrial out of 70, or,more properly, witha frequency which would approach1 in 70 more and more nearlythe more oftenthe test were repeated.. . . The odds could be made much higherby enlargingthe experiment, while if the experiment were much smaller even the greatest possiblesuccesswould give oddsso low that the result, might with considerable probability, be ascribedto chance. (Fisher, 1935/1966,8th ed., pp. 12-13)

Fishergoeson to saythat it is "usualand convenientto take 5 per cent,as a standard levelof significance,"(p. 13) and so anevent that would occurby chance oncein 70 trials is decidedlysignificant. The crucial pointfor Fisheris the act of randomization: Apart, therefore, fromthe avoidable errorof the experimenter himself introducing with his test treatments,or subsequently, other differences in treatment,the effects of which the experiment is not intendedto study, it may be said that the simple precaution of randomisation will suffice to guaranteethe validity of the test of significance, by which the result of the experiment is to be judged. (Fisher, 1935/1966,8th ed., p. 21)

This is indeedthe crucial requirement. Experimental design when variable measurementsare being made,and statistical methodsare to beusedto tease out the information from the error, demands randomization. But there are

184

12. THE DESIGN OF EXPERIMENTS

ironies herein this, Fisher's elegant account of the lady tastingtea.It has been hailed as themodel for the statisticalinferential approach: It demandsof the readerthe ability to follow a closely reasonedargument,but it will repay the effort by giving a vivid understandingof the richness, complexityand subtlety of modern experimental method. (Newman, 1956, Vol. 3, p.1458)

In fact, it usesa situation and amethod that Fisher repudiated elsewhere. More discussionof Fisher's objections to the Neyman-Pearson approach are given later. For themoment,it might be noted that Fisher's misgivings center on theequationof hypothesis testing with industrial quality control acceptance procedures where the population being sampled has anobjective reality,and that populationis repeatedly sampled. However, thetea-tasting example appears to follow this model! Kempthorne (1983) highlighted problemsand indicated the difficulties that so many havehad in understandingFisher's pronouncements. In his book Statistical Methodsand Scientific Inference,first publishedin 1956, Fisher devotes a whole chapterto "Some Misapprehensions about Tests of Significance." Herehe castigatesthe notion that 'thelevel of significance' should be determined by "repeatedsamplingfrom the same population", evidently with no clear realization that the population in questionis hypothetical" (Fisher,1956/1973,3rd ed.,pp. 81-82). He determinesto illustrate "the more generaleffects of the confusion betweenthe level of significanceappropriately assigned to a specific test, with the frequencyof occurrenceof a specifiedtypeof decision" (Fisher, 1956/1973, 3rd ed., p. 82). He states,"In fact, as amatterof principle,the infrequencywith which, in particular circumstances, decisive evidence is obtained, shouldnot be confusedwith the force, or cogencyof such evidence" (Fisher, 1956/1973,3rd ed., p. 96). Kempthorne(1983), whose perceptions of both Fisher's genius and inconsistenciesare ascogentand illuminating as onewould find anywhere, wonders if this book's lack of recognitionof randomizationarose because of Fisher's belated, but of course not admitted, recognition thatit did not mesh with "fiduciating." Kempthorne quotes a "curious" statementof Fisher's and comments, "Well,well!" A slightly expanded version is given here: Whereasin the "Theory of Games"a deliberately randomized decision (1934) may often be usefulto give an unpredictable element to thestrategy of play; and whereas plannedrandomization(1935-1966)iswidely recognizedasessentialin theselection and allocation of experimental material,it has nouseful part to play in the formation of opinion, and consequentlyin testsof significance designedto aid theformation of opinion in the Natural Sciences.(Fisher, 1956/1973,3rd ed.,p. 102)

THE DESIGN OF EXPERIMENTS

185

Kendall (1963) wishes that Fisher had never writtenthe book, saying,"If we had tosacrifice any of hiswritings, [this book] would havea strong claim to priority" (p. 6). However he did write the book, and heusedit to attack his opponents. In marshaling his arguments,he introduced inconsistencies of both logic and method that haveled to confusion in lesser mortals.Karl Pearsonand the biometricians used exactly the same tactics.In chapter15 theview is presented that it is this rather sorry state of affairs that has led to thehistorical development of statistical procedures,as they are used in psychologyand the behavioral sciences, being ignored by the texts that made them available to a wider, and undoubtedlyeager,audience.

13 AssessingDifferences and Having Confidence

FISHERIAN STATISTICS Any assessmentof the impactof Fisher's arrivalon thestatistical battlefieldhas to recognizethat his forcesdid not really seekto destroy totally Pearson's work or its raison d'etre. The controversy betweenYule and Pearson, discussed earlier, had aphilosophical,not to sayideological, basis.If, at theheightof the conflict, one or other sidehad "won," then it is likely that the techniques advocatedby thevanquishedwould have been discarded andforgotten. Fisher's war wasmore territorial. The empireof observationand correlationhad to be taken over by the manipulationsof experimenters. Although he would never have openly admitted it - indeed,hecontinuedto attack Pearson and hisworks to the very end of hislife (which came26 yearsafter the end ofPearson's)the paradigmsandprocedureshe developeddid indeed incorporate andimprove on the techniques developed at Gower Street.The chi-square controversy was not a dispute aboutthe utility of the test or its essential rationale, but a bitter disagreement over the efficiency and methodof its application. For anumber of reasons,which havebeendiscussed,Fisher's views prevailed.He wasright. In the late 1920sand 1930s Fisher was at theheight of his powers and vigorously forging ahead. Pearson, although still a man to bereckoned with, was nearly 30 years awayfrom his best work,an old manfacing retirement, rather isolatedas heattackedall thosewho were not unquestioninglyybrhim. Last, but by no means least, Fisher was thebetter mathematician. He had an intuitive flair that broughthim to solutionsof ingenuityand strength.At the sametime, he wasable to demonstrateto the community of biological and behavioral scientists, a communitythatsodesperately needed a coherent system of data management and assessment, that his approachhad enormous practical utility. 186

THE ANALYSIS OF VARIANCE

187

Pearson'swork may be characterizedas large sampleand correlational, Fisher'sassmall sampleandexperimental. Fisher's contribution easily absorbs the best of Pearsonand expandson theseminal workof "Student." Assessing the import and significance of the variation in observations across groups subject to different experimental treatmentsis the essenceof analysis of variance, and a haphazard glanceat any research journalin the field of experimental psychology attests to its impact.

THEANALYSIS OFVARIANCE The fundamental ideas of analysis of variance appearedin the paper that examined correlation among Medelian factors (Fisher, 1918). At this time, eugenic researchwas occupying Fisher's attention. Between 1915 and 1920, he published halfa dozen papers that dealt with matters relevant to this interest, an interest that continued throughout his life. The 1918 paper uses the term variance for fai 2 + a22J, ai and a2 representingtwo independent causes of variability, ana referreato the normally distributed population. We may now ascribeto the constituentcausesfractions or percentagesof the total variance which they together produce. It is desirableon the onehand that the elementary ideas at the basis of the calculus of correlations should be clearly understood,and easily expressedin ordinary language,and on theother that loose phrasesaboutthe "percentage of causation,"which obscurethe essentialdistinction betweenthe individual and thepopulation, should be carefully avoided. (Fisher, 1918, pp. 399-400)

Here we seeFisher already moving away from Pearsonian correlational methodsas such and appealingto the Gaussianadditive model. Unlike Pearson'swork, it cannot be said thatthe particular philosophy of eugenics directly governed Fisher's approach to new statistical techniques, but it is clear that Fisheralways promotedthe valueof the methodsin genetics research (see, e.g., Fisher, 1952). Whatthe newtechniques were to achievewas arecognitionof the utility of statistics in agriculture,in industry, and in the biological and behavioral sciences,to an extent that couldnot possibly have been foreseen before Fisher cameon the scene. The first published account of an experiment that used analysis of variance to assessthe data was that of Fisherand MacKenzie (1923)on The Manurial Responseof Different Potato Varieties. Two aspectsof this paperare of historical interest. At that time Fisherdid not fully understandtherules of theanalysisof variance- hisanalysisiswrong - nor therole of randomization. Secondly,although the analysis of variance is closely tied to

188

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

additive models, Fisher rejects the additive modelin his first analysisof variance, proceedingto a multiplicative modelas more reasonable. (Cochrane, 1980, p. 17)

Cochrane pointsout that randomizationwas notusedin the layout and that an attempt to minimize error usedan arrangement that placed different treatments nearoneanother. The conditions couldnot provide an unbiased estimate of error. Fisher thenproceedsto ananalysis basedon amultiplicative model: Rather surprisingly, practically all of Fisher's later workon theanalysisof variance usesthe additive model. Later papers give no indicationas to why theproduct model was dropped. Perhaps Fisher found, as 1did, that the additive model is a good approximation unless main effects are large, as well as being simplerto handle than the product model. (Cochrane, 1980, p. 21)

Fisher'sderivation of the procedureof analysis of variance and hisunderstandingof the importanceof randomizationin the planningof experimentsare fully discussedin Statistical Methodsfor Research Workers (1925/1970), first published in 1925. This work is now examinedin more detail. Over 45 yearsand 14editions, the general characterof the book did not change.The 14th edition was publishedin 1970, usingnotesleft by Fisher at the time of his death. Expansions, deletions, and elaborationsare evident over the years. Notable areFisher's increasing recognition of the work of othersand greaterattentionto thehistorical account. Fisher's concentration on his rowwith the biometriciansastime wentby is also evident.The prefaceto the last edition follows earlier onesin stating thatthe book was aproductof the research needs of Rothamsted. Further: It was clear thatthe traditional machinery inculcated by the biometrical schoolwas wholely unsuited to the needs of practical research.The futile elaboration of innumerablemeasuresof correlation, and the evasion of the real difficulties of sampling problems under cover of a contemptfor small samples, were obviously beginningto make its pretensionsridiculous. (Fisher, 1970, 14th ed., p. v)

The opening sentence of the chapteron correlationin the first edition reads: No quantity is more characteristic of modern statistical work than the correlation coefficient, and nomethodhasbeen appliedsuccessfullyto such various data as the method of correlation. (Fisher, 1925, 1sted., p. 129)

and in the 14th edition: No quantity has been more characteristic of biometrical work thanthe correlation coefficient, and nomethodhas been appliedto such various data as themethod of correlation. (Fisher, 1970, 14thed., p. 177)

THE ANALYSIS OF VARIANCE

189

This not-so-subtle change is reflectedin the divisionsin psychology thatarestill evident. The twodisciplines discussed by Cronbachin 1957 (see chapter 2) are those of the correlational and experimental psychologists. In his opening chapter, Fisher sets out thescopeand definition of statistics. He notes that theyareessentialto social studiesandthat it is becausethe methods are used there that"thesestudiesmay beraisedto therank of sciences"(p. 2). The conceptsof populationsand parameters,of variationand frequency distributions, of probability and likelihood,and of thecharacteristicsof efficient statistics are outlined very clearly.A short chapteron diagrams oughtto be required reading,for it points up how useful diagramscan be in theappraisalof data.1 The chapteron distributions deals with the normal, Poisson,andbinomial distributions.Of interest is the introductionof the formula:

(Fisher usesS for summation) for variance, noting that s is thebest estimate of a. Chapter4 deals withtestsof goodness-of-fit, independence, and homoge2 neity, givinga complete description of the applicationof the X tests,including Yates'correction for discontinuityand theprocedurefor what is now known as the Fisher Exact Test. Chapter 5 is on tests of significance,about which more is said later. Chapter6 managesto discuss,quite thoroughly,the techniquesof interclass correlation without mentioning Pearsonby name exceptto acknowledge thatthe dataof Table 31 arePearsonand Lee's. Failureto acknowledge the work of others,which was acharacteristicof both Pearsonand Fisher,and which, to some extent, arose out of both spiteand arrogance,at least partly explainsthe anonymous presentation of statistical techniques that is to befound in the modern textbooksand commentaries. And then chapter7, two-thirdsof the waythroughthe book, introduces that most importantand influential of methods- analysis of variance. Fisher describesanalysisof variance as "the separationof the varianceascribableto one group of causesfrom the variance ascribable to other groups" (Fisher, 1925/1970,14th ed., p. 213), but heexaminesthe developmentof thetechnique from a considerationof the intraclass correlation. His exampleis clear and worth describing. Measurements from n' pairs of brothersmay betreated in two ways in a correlational analysis. The brothersmay be divided into two Scatterplots,that so quickly identify the presenceof "outliers," are critical in correlational analyses.The geometrical explorationof the fundamentalsof variance analysis provides insights which cannot be matched (see, e.g., Kempthorne, 1976).

1 90

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

classes,say,the elder brotherand theyounger,and theusual interclass correlation on some measured variable may becalculated. When, on theother hand, the separationof the brothers intotwo classesis either irrelevantor impossible, then a common meanand standard deviation and anintraclass correlationmay be computed. Given pairs of measurements, x\ ,x'\ ; x2 , xr 2 ; *3 , x'3 ; . . .xn> , x'n> the following statisticsmay becomputed:

In the preceding equations, Fisher's S hasbeen replaced with S and r,used to designatethe intraclass correlation coefficient. The computationof r, is very tedious,as thenumberof classesk and thenumberof observationsin each class increases. Each pair of observationshas to beconsidered twice, (;ci , x'1) and (x'1 , x1) for example. A set of kvalues givesk (k - 1) entriesin asymmetrical table. "To obviate this difficulty Harris [1913] introducedan abbreviated method of calculation by which the value of the correlation givenby the symmetrical table may beobtained directlyfrom two distributions" (Fisher, 1925/1970, 14th ed., p. 216). In fact:

Fisher goeson to discussthe sampling errorsof the intraclass correlation and refers themto his z distribution. Figure 13.1 shows the effect of the transformationof r to z. Curves of very unequal varianceare replaced by curves of equal variance, skew curves by approximately normal curves, curves of dissimilar form by curves of similar form. (Fisher, 1925/1970,14th ed.,p. 218)

The transformationis given by:

Fisherprovides tablesof the r to ztransformation. After giving an example

FIG. 13.1

r to z transformation (from Fisher, Statistical Methods for Research Workers).

191

192

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

of the use of thetable and finding the significanceof the intraclass correlation, conclusionsmay bedrawn. Becausethe symmetrical table does not give the best estimateof the correlation,a negative biasis introduced intothe valuefor z andFisher showshow this may becorrected. Fisher then shows that intraclass correlationis anexampleof the analysisof variance:"A very great simplification is introduced into questionsinvolving intraclass correlation when we recognisethat in such casesthe correlation merely measures the relative importanceof two groupsof variation" (Fisher, 1925/1970, 14th ed., p. 223). Figure 13.2is Fisher's general summary table, showing, "in the last column, the interpretationput upon each expression in the calculationof an. intraclass correlation from a symmetricaltable" (p. 225).

FIG. 13.2

ANOVA summary table 1 (from Fisher's Statistical Methods for Research Workers

A quantity madeup of two independentlyand normally distributed parts with variancesA and B respectively,has atotal varianceof (A + B). A sample of n' values is taken from the first part and different samplesof k values from the second part added to them. Fisher notes that in the populationfrom which the values are drawn, the correlation between pairs of membersof the same family is:

and thevaluesof A and B may beestimatedfrom the set of kn'observations. The summary tableis then presented again (Fig. 13.3) andFisher pointsout that "the ratio between the sums of squares is altered in the ratio n ' : ( n ' ) ,

THE ANALYSIS OF VARIANCE

193

FIG. 13.3 ANOVA summary table 2 (from Fisher, Statistical Methods for Research Workers which precisely eliminatesthe negative bias observedin z derived by the previous method" (Fisher, 1925/1970, 14th ed., p. 227). The generalclassof significancetestsapplied hereis that of testingwhether an estimate of variance derivedfrom n1 degreesof freedom is significantly greater thana second estimate derived from n2 degrees of freedom. The significance may beassessedwithout calculatingr. The value of z may be calculatedas}/2 loge \(n' - 1)(kA + B) - n'(k- 1)B|. Fisher provides tables of the zdistribution for the 5% and 1% points. In the later editionsof the book he notes that these values were calculated from the corresponding values of the 12 variance ratio, e, andrefersto tablesof these values prepared by Mahalonobis, usingthe symbolx in 1932,andSnedecor, using the symbolF in 1934. In fact: Z =

l

/2\OgeF

"The wide use in theUnited States of Snedecor's symbolhas led to the distribution beingoften referredto as thedistributionof F " (Fisher, 1925/1970, 14th ed., p. 229). Fisher ends thechapterby giving a numberof examplesof the useof the method. It shouldbe mentioned here that the detailsof the history of therelationship of ANOVA to intraclass correlationis aneglected topicin almostall discussions of the procedure.A very useful referenceis Haggard (1958). The final two chaptersof the book discussfurther applicationsof analysis of varianceand statistical estimation. Of most interest here is Fisher's demonstration of the way inwhich the techniquecan beusedto test the linear model and the "straightness"of the regression line.The method is the link between least squares andregression analysis. Also of importanceis Fisher's discussion

194

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

of Latin square designs and theanalysisof covariancein improvingthe efficiency andprecision of experiments. Fisher Box notes thatthe book did not receive a single good review.An example, which reflected the opinionsof many,was thereviewerfor the British Medical Journal: If he feared that he waslikely to fall betweentwo stools,to producea book neither full enoughto satisfy thoseinterestedin its statistical algebranor sufficiently simple to pleasethosewho dislike algebra,we think Mr. Fisher'sfears arejustified by the result. (Anonymous, 1926, p. 815)

Yates (1951) comments on these early reviews, noting that many of them expressed dismay at thelack of formal mathematical proofs and interpretedthe work as though it was only of interestto those who were involvedin small sample work. Whatever its receptionby thereviewers,by 1950, whenthe book was in its 11th edition, about 20,000 copies had been sold. But it is fair to saythat something like10 years wentby from the dateof the original publication before Fisher's methods really startedto havean effect on the behavioral sciences. Lovie (1979) traces its impact overthe years 1934to 1945. He mentions,as doothers,that the early textbook writers contributed to its acceptance. Notable here are theworks of Snedecor, published in 1934and 1937. Lush (1972) quotes a European researcher who told him, "Whenyou see Snedecoragain, tellhim that over herewe say, 'ThankGod for Snedecor;now we canunderstand Fisher' " (Lush, 1972,p. 225). Lindquist publishedhis Statistical Analysisin Educational Researchin 1940, and this book, too,was widely used. Even then, some authorities were skeptical,to say theleast. CharlesC. Petersin an editorial for the Journal of Educational Research rather condescendingly agreesthat Fisher'sstatisticsare "suitable enough"for agricultural research: And occasionallythesetechniqueswill be useful for rough preliminary exploratory researchin other fields, including psychology and education. But if educationists and psychologists,out of somesort of inferiority complex, grab indiscriminately at them and employ them where they are unsuitable, education and psychology will suffer another slump in prestige such as they have often hitherto sufferedin consequenceof the pursuit of fads. (Peters, 1943, p. 549)

That Peters'conclusion partly reflects the situationat that time,but somewhat missesthe mark, is best evidenced by Lovie's (1979) survey, which we look at in thenext chapter. Fifty or sixty years yearson, sophisticated designs and complex analyses are commonin the literaturebut misapprehensions and misgivings are still to be found there. The recipesare enthusiastically applied but their structureis not always appreciated.

MULTIPLE COMPARISON PROCEDURES

195

If the early workers reliedon writers like Snedecorto help them with Fisherianapplications, later ones are indebtedto workers like Eisenhart (1947) for assistance withthe fundamentalsof the method. Eisenhart setsout clearly the importanceof theassumptionsof analysisof variance, their critical function in inference,and the relative consequences of their not being fulfilled. Eisenhart's significant contributionhas been pickedup andelaboratedon by subsequent writers,but hisaccount cannotbe bettered. He delineatesthe two fundamentallydistinct classesof analysisof variance - what are nowknown as thefixed andrandom effects models.The first of these is the most familiarto researchersin psychology. Herethe task is to determine the significance of differences among treatment means: "Tests of significanceof employedin connection with problems of this classare simply extensionsto small samplesof the theory of least squares developed by Gauss andothers- theextensionof thetheoryto small samples being dueprincipally to R. A. Fisher" (Eisenhart, 1947, pp. 3-4). The secondclassEisenhart describes as thetrue analysisof variance. Here the problem is one ofestimating,andinferring the existenceof, the components of variance,"ascribableto random deviation of the characteristicsof individuals of a particular generic typefrom the meanvalue of these characteristics in the 'population' of all individuals of that generictype" (Eisenhart, 1947,p. 4). The failure of the then current literature to adequately distinguish between the twomethodsis becausethe emphasishadbeenon testsof significancerather than on problemsof estimation. But it would seem that, despite the best efforts of the writers of the most insightful of the nowcurrent texts(e.g., Hays, 1963, 1973),the distinctionis still not fully appliedin contemporary research. In other words, the emphasisis clearly on theassessmentof differencesamong pairsof treatment means rather than on therelativeand absolute sizeof variances. Eisenhart's discussion of the assumptionsof the techniquesis a model for later writers. Random variation, additivity, normality of distribution, homogeneity of variance, and zero covariance among the variablesare discussedin detail and their relative importance examined. This work can beregardedas a classicof its kind. MULTIPLE COMPARISON PROCEDURES In 1972, Maurice Kendall commented on howregrettableit was that duringthe 1940s mathematicshad begun to "spoil" statistics. Nowhereis the shift in emphasisfrom practice, withits room for intuition and pragmatism,to theory and abstraction more evident than in the area of multiple comparison procedures.The rulesfor making such comparisons have been discussed ad nauseam, and they continueto bediscussed. Among the more completeand illuminating

196

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

accountsarethoseof Ryan (1959)and Petrinovichand Hardyck (1969). Davis and Gaito (1984) provide a very useful discussion of some of the historical background.In commentingon Tukey's (1949) intentionto replacetheintuitive approach (championed by "Student")with some hard, cold facts, and to provide simple and definite proceduresfor researchers, they say: [This is] symptomatic of the transition in philosophy and orientation from the early use ofstatisticsas apractical and rigorous aid for interpretingresearchresults, to a highly theoreticalsubject predicatedon theassumptionthat mathematicalreasoning was paramountin statistical work. (Davis& Gaito, 1984, p. 5)

It is also the case thatthe automaticinvoking, from the statistical packages, of any one ofhalf a dozen procedures following an Ftest hashelpedto promote the emphasison thecomparisonof treatment means in psychological research. No one would argue withthe underlying rationaleof multiple comparison procedures. Given that the task is to compare treatment means, it is evident that to carry out multiple t testsfrom scratchis inappropriate. Overthe long run it is apparent thatas larger numbers of comparisons are involved, using a procedure that assumes that the comparisonsare basedon independent paired data setswill increasethe Type I error rate considerably whenall possible comparisonsin a given set ofmeansare made. Put simply,the numberof false positiveswill increase.As Davis andGaito (1984) point out, witha set at the .05 level and H0 true, comparisons, using the t test, among10 treatment means would, in the long run, leadto thedifferencebetweenthe largestand thesmallest of them being reported assignificant some60% of thetime. One of theproblems here is that the range increases faster than the standard deviation as thesize of the sample increases.The earliest attempts to devise methods to counteract this effect come from Tippett (1925)and "Student" (1927),and later workers referred to the "studentizing" of the range,using tablesof the sampling distribution of the range/standard deviation ratio, known as the qstatistic. Newman (1939)publisheda procedure that uses this statistic to assess the significanceof multiple comparisons among treatment means. In general,the earlier writers followed Fisher, who advocatedperforming t tests following an analysis that produced an overall z that rejectedthe null hypothesis,the variance estimate being provided by the error mean square and its associated degrees of freedom. Fisher'sonly cautionary note comes in a discussionof the procedureto beadopted whenthe ztest fails to reject the null hypothesis: Much caution shouldbe used before claiming significance for special comparisons. Comparisons, whichthe experimentwasdesignedto make, may,of course,be made without hesitation. It is comparisonssuggestedsubsequently,by a scrutiny of the

MULTIPLE COMPARISON PROCEDURES

197

resultsthemselves,that are open to suspicion; for if the variants are numerous, a comparisonof thehighestwith thelowestobservedvalue,pickedout from theresults, will often appearto be significant, even from undifferentiated material.(Fisher, 1935/1966,8th ed., p. 59)

Fisher is here giving his blessing to planned comparisonsbut does not mention that these comparisons should, strictly speaking, be orthogonal or independent. Whathe doessay isthat unforeseen effectsmay betaken as guidesto future investigations. Davis and Gaito (1984)are atpainsto point out that Fisher's approach wasthat of the practical researcher and tocontrastit with the later emphasison fundamental logicand mathematics. Oddly enough,a numberof procedures, springing from somewhat different rationales,all appearedon thesceneat about the sametime. Among the best known are those of Duncan (1951, 1955), Keuls (1952), Scheffe (1953), and Tukey (1949, 1953). Ryan (1959) examines the issues. After contending that, fundamentally,the same considerations apply to both a posteriori (sometimes called post hoc) and apriori (sometimes called planned) comparisons, and drawing ananalogy with debates over one-tail andtwo-tail tests(to be discussed briefly later), Ryan definesthe problem as the control of error rate. Per comparison error ratesrefer to the probability thata given comparisonwill be wrongly judged significant.Per experiment error rates refer not toprobabilities as such,but to thefrequencyof incorrect rejectionsof the null hypothesisin an experiment, overthe long run of such experiments.Finally, the so-called experimentwise error rate is aprobability,the probability thatany oneparticular experimenthas atleast one incorrect conclusion.The various techniques that were developed have all largely concentratedon reducing,or eliminating,the effect of the latter. The exception seemsto be Duncan, who attempted to introduce a test based on the error rate per independent comparison. Ryan suggeststhat this special procedure seems unnecessary and Scheffe (1959),a brilliant mathematical statistician, is unableto understandits justification. Debateswill continue and, meanwhile, the packagesprovide us with all the methodsfor a keypressor two. For the purposesof the present discussion, the examinationof multiple comparison procedures provides a case historyfor the state of contemporary statistics. First,it is an example,to match all examples,of the questfor rules for decision makingand statistical inference that lie outsidethe structureand concept of the experiment itself. Ryan argues "that comparisons decided upon a priori from somepsychological theory should not affect the nature of the significance tests employedfor multiple comparisons" (Ryan, 1959, p. 33). Fisher believed that research should be theory drivenand that its results were always opento revision. "Multiple comparison procedures" could easily replace, with only some

198

13. ASSESSINGDIFFERENCESAND HAVING CONFIDENCE

slight modification,the subjectof ANOVA in the following quotation: The quick initial successof ANOVA in psychologycan beattributedto the unattractiveness of the then available methods of analysing large experiments, combined with the appealof Fisher'swork which seemedto match, witha remarkabledegree of exactness,the intellectual ethosof experimental psychologyof the period, with its atheoreticalandsituationalistnatureand itswish for moreextensiveexperiments. (Lovie, 1979,p. 175)

Fisher would have deplored this; indeed, he diddeploreit. Second,it reflects the emphasis,in today's work,on the avoidanceof the Type I error. Fisher would havehad mixed feelings about this.On the onehand, he rejected the notion of "errorsof the second kind"(to bediscussedin the next chapter); only rejection or acceptance of the null hypothesisentersinto his schemeof things. On theother, hewould have been dismayed - indeed, he wasdismayed- by a concentrationon automaticacceptanceor rejection of the null hypothesisas the final arbiter in assessingthe outcomeof an experiment. Criticizing the ideologiesof both Russiaand theUnited States, where he felt such technological approaches were evident, he says: How far,within such a system [Russia], personal and individual inferencesfrom observed factsare permissiblewe do notknow, but it may perhapsbe safer... to conceal rather than to advertisetheselfish andperhapshereticalaim of understanding for oneselfthe scientific situation.In the U.S. alsothe great importanceof organized technology has Ithink madeit easyto confusethe process appropriate for drawing correct conclusions with those aimed rather at, let ussay,speeding production,or saving money. (Fisher, 1955, p. 70)

Third, the multiple comparison procedure debate reflects, as hasbeen noted, the increasingly mathematical approach to applied statistics.In the United States,a Statistical Computing Center at the State Collegeof Agriculture at Ames, Iowa (now Iowa State University), becamethe first centerof its kind. It was headedby George W. Snedecor, a mathematician,who suggestedthat Fisher be invited to lecture thereduring the summer sessionof 1931. Lush (1972) reportsthat academic policyat that institutionwas such that graduate coursesin statistics were administered by the Departmentof Mathematics. At Berkeley, the mathematics department headed by Griffith C. Evans,who went there in 1934, was to beinstrumentalin making thatinstitution a world center in statistics. Fisher visited there in the late summerof 1936 but made a very poor personalimpression. Jerzy Neyman was tojoin the departmentin 1939. And, of course,Annals of Mathematical Statisticswas foundedat Universityof Michigan in 1930.On more thanoneoccasionduring thoseyearsat the end of the 1930s, Fisher contemplated moving to the United States, and one may

CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS

199

wonder whathis influence would have been on statistical developments had he becomepart of that milieu. And, finally, the techniqueof multiple comparisons establishes, without a backward glance,a systemof statistics thatis based unequivocally on along-run relative frequencydefinition of probability where subjective, a priori notions at best run in parallel withtheplanning of an experiment,for they certainlydo not affect, in the statistical context,the real decisions.

CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS Inside the laboratory, at the researcher's terminal, as theoutcomeof the job reveals itself,andmost certainlywithin the pagesof thejournals, success means statistical significance.A very great many reviews andcommentaries, some of which have been brought together by Henkel and Morrison (1970), deplore the concentrationon the Type I error rate,the a level, as it is known, that this implies. To a lesserextent,but gaining in strength,is thepleafor an alternative approachto the reporting of statistical outcomes, namely, the examinationof confidence intervals. And,to an even lesser extent, judging from the journals, is the challenge thatthe statistical outcomes made in the assessmentof differencesshouldbe translatedinto the reportingof strengthsof effects. Here the wheel is turning full circle, backto the appreciationof experimental results 2 in terms of a correlational analysis. Fisher himselfseemeto believe thatthe notion of statistical significance was more or less self-evident.Even in the last edition of Statistical Methods, the words null and hypothesisdo not appearin the index, and significance and testsof significance, meaningof haveone entry, which refersto thefollowing: From a limited experience,for example,of individuals of a species,... we may obtain some ideaof the infinite hypotheticalpopulationfrom which our samplehas been drawn,and so of theprobablenatureof future samples If a second sample beliesthis expectationwe infer that it is, in the languageof statistics,drawn from a second population; that the treatment...did in fact makea material difference.... Critical testsof this kind may becalled testsof significance,and when such tests are availablewe maydiscover whether a second sample is or is notsignificantly different from the first. (Fisher, 1925/1970,14th ed.,p. 41)

A few pages later, Fisher does explain the use of thetail area of the probability intervaland notes thatthe p = .05level is the "convenient" limit for judging significance. He does thisin the context of examplesof how often 2

Of interest here,however, is the increasinguse,and theincreasingpower,of regression modelsin the analysisof data; see,for example,Fox (1984).

200

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

deviationsof aparticular sizeoccur in agiven numberof trials - that twicethe standard deviation is exceeded about once in 22 trials, and so on.Thereis little wonder that researchers interpreted significance as anextensionof the proportion of outcomesin a long-run repetitive process, an interpretationto which Fisherobjected! In TheDesignof Experiments,he says: In order to assertthat a natural phenomenon is experimentally demonstrable we need, not anisolated record,but areliable methodof procedure. In relation to the test of significance, we may saythat a phenomenonis experimentally demonstrable when we know how toconductan experiment whichwill rarely fail to give us a statistically significant result. (Fisher,1935/1966,8th ed., p. 14)

Here Fisher certainly seems to be advocating "rulesof procedure," againa situation which elsewherehe condemns. Of more interestis the notion that experiments mightbe repeatedto seeif they fail to give significant results. This seemsto be avery curious procedure, for surely experiments,if they are to be repeated,arerepeatedto find support for an assertion.The problems that these statements cause arebasedon thefact thatthe null hypothesisis astatement that is the negationof the effect that the experimentis trying to demonstrate,and that it is this hypothesis that is subjectedto statistical test.The Neyman-Pearson approach (discussed in chapter15) was anattemptto overcome these problems, but it was anapproach that again Fishercondemned. It appearsthat Fisheris responsiblefor the first formal statementof the .05 level as thecriterion for judging significance,but the convention predates his work (Cowles& Davis, 1982a). Earlier statements about the improbability of statistical outcomes were made by Pearsonin his 1900(a) paper,and "Student" (1908a) judged that three times the probable errorin the normal curve would be considered significant. Wood and Stratton (1910) recommend "taking 30 to 1 as thelowest oddswhich can beacceptedas giving practical certainty that a differencein a given directionis significant"(p. 433). Fisher Box mentions that Fisher took a courseat Cambridgeon thetheory of errors from Stratton duringthe academic year1912-1913. Odds of 30 to 1representa little more than three timesthe probable error (P.E.) referredto the normal probability curve. Because the probable erroris equivalentto a little more than two-thirds of a standard deviation, three P.E.s is almost two standard deviations, and, of course, reference to any table of the "areasunderthe normal curve" shows that a z scoreof 1.96 cutsoff 5% in the two tails of the distribution. With somelittle allowancefor rounding,the .05 probability levelis seento have enjoyedacceptancesome time before Fisher's prescription. A testof significanceis atestof theprobabilityof a statistical outcome under the hypothesisof chance. In post-Fisherian analyses the probabilityis that of

CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS

201

making an error in rejecting the null hypothesis,the so-called TypeI error. It is, however,not uncommon to readof the null hypothesisbeing rejectedat the 5% level of confidence', an oddinversion that endows the act ofrejection with a sort of statementof belief about the outcome. Interpretationsof this kind reflect, unfortunatelyin the worst possible way,the notionsof significancetests in the Fisherian sense, and confidence intervals introduced in the 1930sby Jerzy Neyman. Neyman (1941)statesthat the theory of confidence intervalswas establishedto give frequency interpretations of problems of estimation. The classical frequency interpretation is best understood in the context of the long-run relative frequency of the outcomesin, say, the rolling of a die. Actual relative frequenciesin a finite run aretaken to be more or less equalto the probabilities,and in the"infinite" run are equalto the probabilities. In his 1937 paper,Neyman considersa systemof random variables x1,x2, xi,. .. xn designated E, and aprobability law p(E | 0i,0 2 , . .. 01)where 61,0 2 , . . . 0i are unknown parameters.The problem is to establish: single-valuedfunctions of the x 's 0 (£) and 0(£) havingthe property that, whatever the valuesof the O's,say 0'i, 0'2,. . . 0'i, the probability of 0(£) falling short of 9'i and at thesame timeof 9(£) exceedingO'i is equal to a numbera fixed in advance so that 0 < a < 1,

It is essentialto notice thatin this problem the probability refers to the valuesof 9(£) and§(£) which, beingsingle-valuedfunctions of the.x 's arerandom variables. 9'i being a constant,the left-hand side of [the above] doesnot representthe probability of 0'i falling within somefixed limits. (Neyman, 1937,p. 379)

The values 0(£) and 0(£) representthe confidencelimits for 0'1 andspan the confidence intervalfor theconfidencecoefficient a. Caremustbetaken here not to confuse thisa with the symbol for the Type 1error rate;in fact this a is 1 minus the Type I error rate. The last sentencein the quotation from Neyman just given is very important. First,an exampleof a statementof the confidence interval in morefamiliar termsis, perhaps,in order. Supposethat measurements on a particular fairly large random sample have produced a mean of 100 and that the standarderror of this meanhas been calculatedto be 3. Theroutine methodof establishingthe upperandlower limits of the 90%confidence interval would be tocompute 100 ± 1.65(3). Whathasbeen established? The textbooks will commonly saythat the probability thatthe population mean,u, falls within this interval is 90%, whichis precisely what Neyman says is not thecase. For Neyman, the confidence limitsrepresentthe solution to the statistical

202

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

problemof estimating 0\ independentof a priori probabilities. What Neyman is saying is that, overthe long run, confidence intervals, calculated in this way, will containthe parameter90% of thetime. It is notjust being pedanticto insist that, in Neyman's terms, to saythat the oneinterval actually calculated contains the parameter90% of thetime is mistaken. Nevertheless, that is the way in which confidence intervals have sometimes come to be interpreted and used. A numberof writers have vigorously propounded the benefitsof confidence intervals asopposedto significance testing: Wheneverpossible,the basicstatisticalreportshould be in theform of a confidence interval. Briefly, a confidence intervalis a subset of the alternative hypotheses computedfrom the experimentaldatain such a way that for a selectedconfidence level a, theprobability thatthe thetrue hypothesisis includedin a set soobtained is a. Typically, an a -level confidence intervalconsistsof thosehypothesesunder which the p valuefor theexperimental outcome is larger than1 - a.... Confidence intervals are the closest we can atpresent cometo quantitative assessmentof hypothesis-probabilities. . . and arecurrently our most effectiveway to eliminate hypothesesfrom practicalconsideration- if we chooseto act asthough noneof the hypothesesnot included in a 95%confidence intervalare correct, we stand onlya 5% chanceof error. (Rozeboom,1960, p. 426)

Both Ronald Fisherand Jerzy Neyman would have been very unhappy with this advice! It does, however,reflect once againthe way inwhich researchers in the psychological sciences prescribe andpropound rules that they believe will leadto acceptanceof the findings of research. Rozeboom's paper is a thoughtful attempt to provide alternativesto the routine null hypothesis significance test and dealswith the important aspectof degreeof belief in an outcome. One final point on confidenceinterval theory:it is apparent that some early commentators(e.g., E. S.Pearson, 1939b; Welch, 1939) believed that Fisher's "fiducial theory" andNeyman's confidence interval theory were closely related. Neyman himself (1934)felt that his work was anextensionof that of Fisher. Fisher objected stronglyto the notion that therewas anything at all confusing about fiducial distributionsor probabilities and denied any relationshipto the theory of confidence intervals, which he maintained,was itself inconsistent.In 1941 Neyman attemptedto show that thereis no relationship betweenthe two theories,and herehe did notpull his punches: The presentauthor is inclined to think that the literature on thetheory of fiducial argumentwas born out of ideassimilarto those underlyingthe theory of confidence intervals. Theseideas,however, seemto have beentoo vague to crystallizeinto a mathematical theory. Instead they resulted in misconceptionsof "fiducial probability" and "fiducial distribution of a parameter"which seemto involve intrinsic inconsistencies In this light,thetheory of fiducial inferenceis simply non-existent

A NOTE ON "ONE-TAIL" AND "TWO-TAIL" TESTS

203

in the same senseas, for example, a theory of numbers definedby mutually contradictorydefinitions. (Neyman, 1941,p. 149)

To the confused onlooker, Neyman doesseem to have been clarifyingone aspectof Fisher's approach,and perhapsfor a brief momentof time therewas a hint of a rapprochement.Had it happened, thereis reason to believe that unequivocalstatementsfrom thesemen would have beenof overriding importancein subsequent applications of statistical techniques.In fact, their quarrels left the job to their interpreters. Debate and disagreement would,of course, have continued,but those who like to feel safe could have turned to the orthodoxy of the masters,a notion thatis not without its attraction. A NOTE ON "ONE-TAIL" AND "TWO-TAIL"

TESTS

In the early 1950s, mainlyin the pages of Psychological Bulletin and the Psychological Review- then, as now, immensely importantand influential journals - a debate took place on theutility anddesirabilityof one-tail versus two-tail tests(Burke, 1953; Hick, 1952; Jones, 1952,1954; Marks, 1951,1953). It had been stated that when an experimental hypothesis had a directional component- that is, notmerely thata parameteru, differed significantly from a second parameter 2,jj.but that, for example,ji, > u2 - thentheresearcherwas permitted to use thearea cut off in only one tail of the probability distribution when the test of significancewas applied. Referredto the normal distribution, this means thatthe critical value becomes 1.65 rather than 1.96. It was argued that becausemost assertions that appealed to theory were directional- for example, that spacedwas better than massed practicein learning, or that extraverts conditioned poorly whereas introverts conditioned well- the actual statistical testshould takeinto account these one-sided alternatives. Arguments againstthe use ofone-tailed testsprimarily centeredon whatthe researcher does when a very large differenceis obtained,but in theunexpected direction.The temptationto "cheat"is overwhelming!It was also argued that such data ought not to be treated withthe same reactionas a zero difference on scientific grounds: "It is to be doubted whether experimental psychology, in its present state,can afford suchlofty indifference toward experimental surprises" (Burke, 1953, p. 385). Many workers were concerned that a move toward one-tail tests represented a loosening or a lowering of conventional standards, a sort of reprehensible breaking of the rulesand pious pronouncements about scientific conservatism abound. Predictably, the debateled to attemptsto establishthe rulesfor the use of one-tail tests (Kimmel, 1957). What is found in these discussions is the implicit assumption that formally

204

13. ASSESSING DIFFERENCES AND HAVING CONFIDENCE

stated alternative hypotheses are anintegral partof statistical analysis. What is not found in these discussions is anyreferenceto thelogic of usinga probability distribution for the assessment of experimental data.Put baldly andsimply, using a one-tail test means that the researcheris using only halfthe probability distribution, and it is inconceivable that this procedure would have been acceptableto any of thefounding fathers.The debateis yet another exampleof the eagernessof practical researchers to codify the methodsof data assessment so that statistical significance has themaximum opportunityto reveal itself,but in the presenceof rules that discouraged, if not entirely eliminated, "fudging." Significance levels, or p levels,are routinely acceptedas part of the interpretationof statistical outcomes.The statistic thatis obtainedis examined with respectto a hypothesized distribution of the statistic,a distribution thatcan be completely specified. Whatis not soreadily appreciatedis the notion of an alternative model. This matter is examinedin the final chapter. For now, a summaryof the processof significancetesting givenin one of theweightierand more thoughtful texts might be helpful. 1. Specificationof ahypothesizedclassof modelsand analternativeclassof models. 2. Choice of a function of the observationsT. 3. Evaluationof the significance level, i.e.,SL =P(T> t\ where t is theobserved value of T and where the probability is calculated for the hypothesizedclass of models. In most applied writingsthe significancelevel is designatedby P, acustom which has engendereda vast amountof confusion. It is quite common to refer to the hypothesized classof models as the null hypothesisand to thealternativeclassof modelsas thealternative hypothesis.We shall omit the adjective "null" becauseit may bemisleading. (Kempthorne & Folks, 1971, pp. 314-315)

14 Treatments and Effects: The Rise of ANOVA

THE BEGINNINGS In the last chapterwe consideredthe developmentof analysis of variance, ANOVA . Herethe incorporationof this statistical technique into psychological methodologyis examinedin a little more detail. The emergenceof ANOVA as themost favored method of data appraisalin psychologyfrom the late 1930sto the 1960s represents a most interestingmix offerees. The first was thecontinuingneedfor psychologyto be"respectable" in the scientific senseand thereforeto seek out mathematical methods. The correlational techniques that were the norm in the 1920sand early 1930s were not enough, nor did they lend themselves to situations where experimental variables mightbe manipulated.The secondwas thegrowth of the commentators and textbook writerswho interpreted Fisher witha minimum needof mathematics. Froma trickle to a flood, the recipe books pouredout over a period of about 25 years and theflow continues unabated. And the third is the emergenceof explanationsof the models of ANOVA using the concept of expected mean squares. These explanations, which did not avoid mathematics, but which wereat alevel that required little more than high school math, opened the way formore sophisticated procedures to beboth understoodand applied. THE EXPERIMENTAL TEXTS It is clear, however, that the enthusiasm that wasmountingfor the newmethods is not reflectedin thetextbooksof experimentalpsychology that were published in those years. Considering four of the books that were classics in their own time shows that their authors largely avoided or ignoredthe impact of Fisher's statistics. Osgood's Method and Theory in Experimental Psychology (1953) 205

206

14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

mentions neither Fisher nor ANOVA, neither Pearsonnor correlation. Woodworth and Schlosberg's Experimental Psychology (1954) ignores Pearson by name,and theauthors state,"To go thoroughly into correlational analysis lies beyond the scopeof this book" (p. 39). The writers, however, showsome recognitionof the changing times, admitting that the "old standard 'ruleof one variable'" (p. 2)doesnot mean thatno more thanonefactor may bevaried.The experimental design must allow, however, for the effects of single variablesto be assessedand also the effect of possible interactions. Fisher's Design of Experiments is then cited, and Underwood's (1949) bookis referencedand recommendedas asourceof simple experimental designs. And that is it! Hilgard's chapterin Stevens'Handbookof ExperimentalPsychology (1951) not only givesa brief descriptionof ANOVA in the contextof the logic of the control group and factorial designbut also comments on theutility of matched groups designs, but other commentaries, explanations, and discussionsof ANOVA methodsdo not appearin the book. Stevens himself recognizes the utility of correlational techniquesbut admits that his colleague Frederick Mosteller had toconvincehim that his claim was overly conservativeand that: "rank order correlation does not apply to ordinal scales because the derivation of the formula for this correlation involves the assumption thatthe differences between successive ranks are equal,(p. 27)." An astonishing claim. Underwood's (1949) book is a little more encouraging,in that he does recognizethe importanceof statistics.His preface clearly states how important it is to deal with both methodand content.His own teachingof experimental psychology required that statisticsbe anintegral partof it: "I believethat the factual subject mattercan becomprehended readily without a statistical knowledge,but a full appreciation of experimental design problems requires some statistical thinking"(p. v). But, in general,it is fair to saythat many psychological researchers were not in tune with the statistical methods that were appearing."Statistics"seemsto havebeen seenas anecessaryevil! Indeed, thereis more thana hint of the same mindset in today'stexts.An informal surveyof introductory textbooks published in the last 10 years showsa depressingly high incidence of statistics being relegated to an appendix and of sometimesshrill claims that theycan be understood withoutrecourseto mathematics.In using statistics such as the t ratio and more comprehensive analyses such as ANOVA, the necessityof randomizationis always emphasized. Researchers in the social and educational areasof psychology realized that such a requirementwhenit cameto assigning participantsto "treatments"wasjust not possible. Levelsof ability, socioeconomic groupings, age, sex, and so oncannot be directly manipulated. When methodologists such as Campbelland Stanley (1963), authorities in the classification and comparison of experimental designs, showed that meaningful

THE JOURNALS AND THE PAPERS

207

analyses couldbe achieved through the use of"found groups"and what is often called quasi-experimental designs, the potential of ANOVA techniques widened considerably.The argument was,and is,advanced that unless the experimenter can control for all relevant variables alternative explanations for the results other thanthe influenceof the independent variable can befound- the so-called correlated biases. The skeptic might argue that, in addition to thefact that statistical appraisals are bydefinition probabilisticand therefore uncertain, direct manipulationof the independent variable does not assuredly guard against mistakenattribution of its effects. And, very importantly,the commentaries, such asthey were,on experimental methods haveto be seen in the light of the dominant force,in American psychologyat least,of behaviorism. For example, Brennan's textbook History and Systemsof Psychology(1994)devotestwo chaptersandmore than40 pages out of about 100pagesto Twentieth-Century Systems (omitting Eastern Traditions, Contemporary Trends and theThird Force Movement). This is not to be taken as acriticism of Brennan: indeed, his book has run toseveral editionsand 1 is an excellentand readable text. The point is that from Watson's (1913) paper until well on into the 1960s, experimental psychology was, for many, the experimental analysisof behavior - latterly within a Skinnerian framework. Sidman's (1960) book Tactics of Scientific Research: Evaluating Experimental Data in Psychologyis most certainlynot about statistical analysis! Scienceis presumably dedicated to stampingout ignorance,but statistical evaluation of dataagainsta baselinewhosecharacteristicsaredeterminedby unknown variables constitutesa passiveacceptanceof ignorance. Thisis a curious negation of the professedaims of science.More consistent withthoseaims is the evaluation of data by meansof experimental control,(p. 45)

In general, the experimental psychologists of this ilk eschewed statistical approaches apartfrom descriptive means,frequency counts, and straightforward assessments of variability. THE JOURNALS AND THE PAPERS Rucci and Tweney (1980) have carried out acomprehensive analysis of the use of ANOVA from 1925 to 1950, concentrating mainly, but not exclusively,on American publications. They identify the earliest research to useANOVA as a paper by Reitz (1934).The author checkedfor homogeneityof variance, gave the method for computingz, andnoted its relationshipwith r\2. An early paper 1

Perhapsone might be permittedto ask Brennanto include a chapteron the history and influence of statistics!

208

14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

that usesa factorial designis that of Baxter (1940).The author,at theUniversity of Minnesota, acknowledges the statistical help given to him byPalmer Johnson, and healso credits Crutchfield (1938) with the first applicationof a factorial designto apsychological study. Baxter states that his aim is "anattemptto show how factorial design, as discussedby Fisherand more specificallyby Yates . . . , hasbeen appliedto a study of reactiontime" (p.494). The proposed study presentedfor illustration dealswith reaction timeand examined three factors: the hand usedby theparticipant,the sensory modality (auditory or visual),and discrimination (a single stimulus, two stimuli, three stimuli). Baxter explains how the treatment combinations could be arranged and shows how partial confounding(which resultsin some possible interactions becoming untestable) is used to reduce subject fatigue. The paper is altogethera remarkably clear accountof the structureof a factorial design.And Baxter followed throughby reporting (1942)on theoutcomeof a study which used this design. Rucci and Tweney examined 6,457 papers in six American psychological journals. Theyfind that from 1935to 1952 thereis a steady risein the use of ANOVA, which is paralleledby a rise in the use of the ratio. t The rise became a fall during the war years, and therise became steeper after the war. It is suggested thatthe younger researchers, who would have become acquainted with the newtechniquesin their pre-war graduate school training, would have been eligiblefor military serviceand the"old-timers" who were left used older procedures, such as thecritical ratio. It is not thecasethatthe use ofcorrelational techniques diminished. Rucci andTweney's analysis indicates that the percentage ofarticles using such methods remained fairly steady throughout the period examined. Their conclusion is that ANOVA filled the void in experimental psychologybut did notdisplace Cronbach's (see chapter 2, p. 35)other discipline of scientific psychology. Nevertheless, the use ofANOVA was not establisheduntil quite lateand only surpassedits pre-war use in 1950. Overall, Rucciand Tweney's analysis leads, asthey themselves state,to the view that "ANOVA was incorporatedinto psychologyin logical and orderly steps"(p. 179), and their introduction avers that "It took less than15 yearsfor psychology to incorporateANOVA" (p. 166). Theydo notclaim - indeedthey specifically deny - that the introductionof these techniques constitutes a paradigm shift in the Kuhnian sense. An examinationof the use ofANOVA from 1934 to 1945 by a British historian of science, Lovie (1979), gives a somewhatdifferent view from that offered by Rucci and Tweney, who,unfortunately,do not cite this work,for it is an altogether moreinsightful appraisal, which comes to somewhatdifferent conclusions: The work demonstrates that incorporatingthe technique[ANOVA] into psychology

THE JOURNALS AND THE PAPERS

209

was along and painful process.This was duemore to the substantive implications of the method thanto the purely practicalmattersof increasedarithmetical labour and novelty of language,(p. 151)

Lovie also shows that ANOVAcan beregarded,as heputsit, as"one of the midwives of contemporary experimental psychology" (p. 152). A decidedly non-Kuhniancharacterization,but close enoughto an indicationof a paradigm shift! Threecase studies described by Lovie showthe shift, over a period from 1926 to 1932,from informal,as it were, "let-me-show-you" analysis through a mixture of informal andstatistical analysis to awholly statistical analysis. It was the design of the experiment that occupied the experimenteras far asmethod was concerned,and themeaningof the results couldbe consideredto be "a scientifically acceptableform of consensualbargaining between the authorand the readeron the basis of non-statistical common-sense statements about the data"(p. 155). When ANOVA was adopted,the old conceptualizations of the interpretations of datadied hard, and theresults broughtforth by the method were used selectivelyand asadditional supportfor outcomes thatthe old verbal methods might well have uncovered. Lovie's valuableaccountand hisexaminationof the early papers provides detailed support for the reasonswhy manyof the first applicationsof ANOVA were trivial and why more enlightened and effective useswere slowto emergein anyquantity. A seriesof papersby Carrington (1934, 1936, 1937) in the Proceedingsof the Societyfor Psychical Research report on resultsgiven by one-way ANOVA. The papers dealwith the quantitativeanalysisof trance statesand show thatthe pioneersin systematic research in parapsychology were among the earliest users of the newmethods. Indeed, Beloff (1993) makesthis interesting point: Statistics,after all, have always been a more criticalaspectof experimental parapsychology than they have been for psychophysicsor experimental psychology precisely becausethe results were more likely to be challenged.There is, indeed, someevidencethat parapsychology acted as aspurto statisticiansin developing their own discipline and inelaboratingthe concept of randomness,(p. 126)

It is worth noting that some of the most well-knownof the early rigorous experimental psychologists, for example, GardnerMurphy and William McDougall, were interestedin the paranormal,and it was thelatter who, leaving Harvard for Duke University, recruitedJ. B. Rhine, who helpedto found, and later directed, that institution's parapsychology laboratory. Of course, there were several psychological andeducational researchers who vigorously supportedthe "new" methods. Garrett andZubin (1943) observe that ANOVA had notbeen usedwidely in psychological research. These authors

210

14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

make the point that: Even as recently as 1940, the author of a textbook [Lindquist] in which Fisher's methodsare applied to problems in educationalrersearch,found it necessaryto use artificial datain severalinstancesbecauseof the lack of experimental materialin the field, (p. 233)

These authors also offer the opinion thatthe belief thatthe methods deal with small sampleshadinfluenced their acceptance - a belief, nay,a. fact, that has ledothers to point to it as areasonwhy they were welcomed! They also worried about "the language of the farm," "soil fertility, weights of pigs, effectivenessof manurial treatmentsand the like" (p. 233). Hearnshaw,the eminenthistorian of psychology, also mentions the samedifficulty as areason for his view that"Fisher'smethods were slow to percolate into psychology both in this country [Britain]and inAmerica" (1964,p. 226). Althoughit is the case that agricultural examples are cited in Fisher's books,and phrases that include the words "blocks," "plots," and "split-plots" still seema little odd to psychological researchers,it is not true to saythat the world of agriculture permeates all Fisher's discussions and explanations,as Lovie (1979) has pointed out. Indeed,the early chapterin The Design of Experimentson themathematicsof a lady tastingtea isclearly the story of anexperimentin perceptual ability, which is aboutas psychologicalas onecould get. It must be mentioned that Grant (1944) produced a detailed criticismof Garrett and Zubin's paper,in particular taking themto task for citing examples of reports where ANOVAis wrongly usedor inappropriatelyapplied. Fromthe standpointof the basisof the method, Grant states that in the Garrett andZubin piece: The impression is given . . . that the primary purposeof analysisof variance is to divide the total variance intotwo or more components (the mean squares) which are to be interpreteddirectly as therespective contributions to the total variance made by the experimental variablesand experimental error, (pp.158-159)

He goeson to saythat the purposeof ANOVA is to test the significanceof variation and, while agreeing that it is possibleto estimatethe proportional contribution of the various components, "the process of estimation mustbe clearly differentiatedfrom the test of significance"(p. 159). Theseare valid criticisms, but it must be said that suchconfusion as some commentaries may bring, reflects Fisher's (1934) own assertionin remarkson a paper givenby Wishart (1934) at a meetingof the Royal Statistical Society that ANOVA"is not a mathematical theorem, but rather a convenient methodof arrangingthe arithmetic" (p. 52). In additionthe estimationof treatmenteffects is considered

THEJOURNALSAND THE PAPERS 211

by many researchers to becritical (andseechapter7, p. 83). What is most telling about Grant's remarks, made more than 50 years ago, is that they illustratea criticismof data appraisal using statistics that persists to this day, thatis, the notion that the primary aim of the exerciseis to obtain significant resultsand that reportingeffect size is not aroutine procedure. Also of interestis Grant's inclusion of expressionsfor the expected values of the mean squares,of which more later. A detailed appraisalof the statistical contentof the leading British journal has notbeen carried out,but a review of that publication over10 year periods shows the rise in the use ofstatistical techniques of increasing varietyand sophistication.The progressionis not dissimilarfrom that observedby Rucci and Tweney. The first issueof the British Journal of Psychology appeared in 1904 underthe distinguished co-editorships of James Ward(1843-1925)and W. H. R. Rivers (1864-1922).Some papersin that volume madeuse ofsuch statisticsasweregenerally available,averages,medians, mean variation, standard deviation,and thecoefficient of variation. Statistics were there at the start. By 1929-1930,correlational techniques appeared fairly regularly in the journal. Volume30, in 1940, contained29 papersand includedthe use of 2x, the mean,the probable error,and factor analysis. There were no reports that usedANOVA and nor did any of the 14 papers published in the 1950 volume (Vol. 41). But x,2, and the tratio still found a place.Fitt and Rogers (1950),the authorsof the paper that used the t ratio for the difference betweenthe means felt it necessaryto includethe formula and cited Lindquist (1940)as thesource. The researchersof the 1950swho were beginningto useANOVA, followed the appearanceof a significant F valuewith post hoc multiple t tests. The use of multiple comparisonprocedures(chapter 13, p. 195) had not yetarrived on the data analysis scene. Six of the 36paperspublishedin 1960 included ANOVA, one ofwhich was anANOVA by ranks. Kendall's Tau, the t ratio, correlation, partial correlation, factor analysis, the Mann-Whitney U test - all were used. Statistics had indeed arrivedand ANOVA was in theforefront.. A third of the papers(17 out of 52, to bemore precise!)in the 1970 issue made use ofANOVA in the appraisalof the data. Also present were Pearson correlation coefficients,the Wilcoxon T, Tau, Spearman's rank difference correlation,the Mann-Whitney U, the phicoefficient, and the tratio. Educational researchersfigured largely in the early papers that used ANOVA, both as commentatorson the methodand inapplyingit to their own research. Of course, manyin the fieldwere quitefamiliar with the correlational techniques thathad led tofactoranalysisand itscontroversiesandmathematical complexities. There had beenmore than40 yearsof study and discussionof skill andability testingand thenatureof intelligence-the raisond'etre of factor analysis- andthis wascentral to thework of thosein the field.They wereready,

212

14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

even eager,to come to grips with methods that promised an effective way to use themulti-factor approachin experiments,an approach thatoffered the opportunity of testing competing theories more systematically than before. And it was from theeducational researcher's standpoint that Stanley (1966) offered an appraisalof the influence of Fisher's work "thirty yearslater." He makesthe interesting point that although there were experimenters with "considerable 'feel'for designing experiments, . . . perhaps someof them did not have the technical sophisticationto do the corresponding analyses."This judgmentis basedon the fact that 5 of the 21volunteered papers submitted to the Division of Educational Psychology of the American Psychological Association annual conventionin 1965 had to bereturnedto have more complex analysis carried out. This may have beendue tolack of knowledge,it may have been due to uncertainty aboutthe applicability of the techniques,but it also reflects perhapsthe lingering appealof straightforwardand even subjective analysis thatthe researchersof 20 years before favored, as wenoted earlierin the discussionof Lovie's appraisal.It is also Stanley's view that "the impact of the Fisherian revolutionin the designandanalysisof experiments came slowly to psychology" (p. 224). It surely is a matter of opinion as towhether Rucci and Tweney'scountof less than5 percentof papersin their chosen journals in 1940 to between 15 and 20 percent in 1952 is slow growth, particularly when it is observed thatthe growth of the use ofANOVA varied considerably across thejournals.2 This leadsto thepossibility thatthe editorial policyand practice of their chosen journals could have influenced Rucci and Tweney'scount.3 It is also surelythe case thatif a new techniqueis worth anythingat all, thenthe growth of its usewill show as apositively accelerated growth curve, because as more people takeit up, journal editors havea wider and deeper poolof potential refereeswho cangive knowledgeable opinions. Rucci and Tweney's remark, quoted earlier, on thelengthof time it took for ANOVA to become part of experimental psychology, has atone that suggests that they do not believe that the acceptancecan beregardedas"slow." THE STATISTICAL TEXTS The first textbooks that were written for behavioral scientists began to appear in the late 1940sandearly 1950s althoughit mustbe noted thatthe much-cited 2

Stanleyobservesthat theJournal of GeneralPsychologywas late to include ANOVA and it did not appearat all in the 1948 volume. 3 The author, many years ago, had apaper sent back with a quite kind note indicating thatthe journal generallydid notpublish "correlational studies." Younger researchers soon learn which journals are likely to bereceptiveto their approaches!

EXPECTED MEAN SQUARES

213

text by Lindquist was publishedin 1940.The impactof the first edition of this book may have beenlessened,however, by a great manyerrors, both typographic and computational. Quinn McNemar (1940b), in his review, observes that the volume,"in the opinionof thereviewerandstudentreaders,suffers from an overdoseof wordiness"(p. 746). He also notes that there are several major slips, although acknowledging the difficulty in explaining severalof the topics covered,anddeclares thatthe book "shouldbe particularlyuseful to all who are interestedin obtaining non-mathematical knowledge of the variance technique" (p. 748). Snedecor,an agricultural statistician, published a much-used textin 1937, and he isoften given creditfor making Fisher comprehensible (see chapter 13, p. 194). Rucci and Tweney place George Snedecor of Iowa State College, Harold Hotelling, an economist at Columbia, and Palmer Johnsonof the University of Minnesota,all of whom had spent time with Fisher himself, as the foundersof statistical trainingin the United States.But any informal surveyof psychologistswho were undergraduates in the 1950sand early 1960s would likely reveal thata large numberof them were rearedon thetexts of Edwards (1950) of the University of Washington,Guilford (1950) of the University of SouthernCalifornia,or McNemar (1949)of Stanford University. Guilford had publisheda text in 1942,but it was thesecondedition of 1950 and thethird of 1956 that became very popular in undergraduate teaching. In Canadaa somewhat later popular text wasFerguson's (1959) Statistical Analysis in Psychology and Education, and in the United Kingdom, Yule'sAn Introduction to the Theory of Statistics,first publishedin 1911,had a14th edition published (with Kendall) in 1950 that includes ANOVA. EXPECTED MEAN SQUARES In his well-known book,Scheffe (1959) says: The origin of the random effects models, like that of the fixed effects models, lies in astronomical problems; statisticians re-invented random-effects models long after they were introducedby astronomersand then developedmore complicatedones, (p. 221)

Scheffe citeshis own 1956(a) paper,which gives some historical background to the notion of expected mean squares - E(MS). The estimationof variance components using E(MS) was taken up by Daniels (1939)and later by Crump (1946, 1951). Eisenhart's (1947) influential paperon theassumptions underlying ANOVA models alsodiscusses them in this context. Andersonand Bancroft (1952) publishedone of theearliest texts that examines mathematical expectation - theexpected valueof a random variable"over thelong run"- andearly

214

14. TREATMENTS AND EFFECTS: THE RISE OF ANOVA

in this work theystatethe rules usedto operate with expected values. The concept is easily understoodin the context of a lottery. Supposeyou buy a $1 .00ticket for a draw in which 500 ticketsare to besold. The first and only prize is $400.00.Your "chances"of winning are 1 in 500 and of losing 499 in 500.If Y is therandom variable, When Y = -$1.00,the probabilityof Y, i.e.,p(Y) = 499/500. When Y= $399.00 (you won!),the probabilityp(Y)= 1/500. The expectationof gain overthe long run is: E(Y) = ZYp(Y) = (-1)(499/500)+ (399)(1/500)= -0.0998+ 0.798= - 0.20 If you joined in the draw repeatedlyyou would "in the long run" lose 20 cents. In ANOVA we are concernedwith the expected valuesof the various components- the mean squares - over thelong run. Although largely ignoredby the elementary texts,the slightly more advanced approachesdo treat their discussions and explanationsof ANOVA' s model from the standpointof E(MS). Cornfieldand Tukey (1956)examinethe statistical basisof E(MS), but Gaito (1960) claims that"there has been no systematic presentation of this approachin a psychological textor journal which will reach most psychologists" (p. 3), and hispaper aimsto rectify this. The approach, which Gaito rightly refers to as"illuminating," offers some mathematical insight intothe basisof ANOVA and frees researchersfrom the "recipe book" and intuitive methods that are often used. Eisenhart makes the point that these methods have been worthwhile and that readersof the books for the "non-mathematical" researcher have achieved sound and quite complex analyses inusing them. But these early commentators saw theneedto go further. Gaito'sshort paperoffers a clear statement of thegeneralE(MS) model,and his contention that this approach brings psychology a tool for tackling many statistical problemshas been borneout by thefact that now all thewell-known intermediate texts that deal with ANOVA modelsuse it. This is not theplaceto detail the algebraof E(MS) but here is the statement of the generalcasefor a two- factor experiment:

EXPECTED MEAN SQUARES

215

In thefixed effects model - often referred to asModel I - inferences can be made only aboutthe treatments that have been applied. In the random effects model the researchermay make inferencesnot only about the treatments actually appliedbut about the range of possible levels.The treatmentsapplied are arandom sampleof the range of possible treatments that could have been applied. Then inferences may bemade aboutthe effects of all levels from the sampleof factor levels. This model is often referredto asModel II. In the case where thereare two factors A and B thereare A potential levelsof A but only a levels are included in the experiment. Whena = A then factorA is a fixed factor. If the a levels includedare arandom sample of the potentialA levels then factor A is a random factor. Similarlyfor factor B there are Bpotential levels of the factor and blevels are used in the experiment. Whena = A then ^ = 1 and when B is a random effect,the levels sample size b is usually very, very much smaller thanB and | becomesvanishinglysmall, asdoes ^ where n is sample sizeand N ispopulation size. Obtaining the variance components allows us to seewhich componentsare included in the model, be the factor fixed or random, and to generate the appropriateF ratios to test the effects. For fixed effects: E(MSA ) = E(MSAB) = cre2 + noap2, and theerror termis oe2. 2 2 2 For random effects: E(MS A) =
Thereis little doubt that developmentof this approachhasenabled ANOVAto be more closely appreciated by its practitioners. Finally, it is only in recent years that there has been a full realization that mathematically the ANOVA approach and the regression approachcan be brought together. Cronbach's delineation of the "two disciplines" (1957,and seechapter2), the neglect by the textbook writersof any meaningful account of the general linear model, and theavoidanceby the introductory text authors of a discussionof expected mean squares, have all contributedto a misunderstandingof the unifying basisof statistical methods.

15 The Statistical Hotpot TIMES OF CHANGE The later yearsof the 1920s were watershed years for statistics. Karl Pearson was approachingthe end of hiscareer(he retiredin 1933),andsomeof the older statisticians werenot able to cope with the neworder. The divisions had been, and were still being, drawn.Yule retiredfrom full-time teachingat Cambridge in 1930, and, writingto Kendallafter K.P.'s deathin 1936 said,"I feel asthough the Karlovingianera hascometo anend,and thePiscatorialerawhich succeeds it is one inwhich I can play no part" (quotedby Kendall, 1952,p. 157). Yule's text had not bythen tackledthe problemsof small samples. The t testswere not discusseduntil the 11thedition of 1937, a revision that was,in fact, undertakenby Maurice Kendall. The changes that were taking place in statistical methodology led to positions being adopted that were often based more on personalityand style and loyalties than rational argument on the basic logic and utility of the approaches. Yates, who succeeded Fisher at Rothamsted in 1933, was to write, in 1951, in a commentaryon Statistical Methodsfor Research Workers'. Becauseof the importance that correlation analysis had assumedit was natural that the analysisof variance shouldbe approachedvia correlation,but tothosenot trained in the school of correlational analysis(of which I am fortunate to be able to count myself one) this undoubtedly makes this part of the book moredifficult to comprehend. (Yates, 1951,p. 24, emphasis added)

And Yates was asolid supporterof Fisherto the very end. Fisher publishedhis text in 1925, and in thesame yearhis paper on the applications of "Student's"distribution appeared. Egon Pearson and Jerzy Neymanmet that year. They were to introducenew features into statistics that Fisher would vehemently oppose to the end of hislife. 216

NEYMAN AND PEARSON

217

NEYMAN AND PEARSON It is a curious fact that what most social scientists now take to be inferential statistics is a mixture of procedures that,as presentedin many of the current 1 statistical cookbooks, would be criticized by its innovators, RonaldA. Fisher, Jerzy Neyman,andEgon Pearson. Fisher established the present-day paramount importanceof the rejection or acceptanceof the null hypothesisin the determination of a decision on astatistical outcome- thehypothesis thatthe outcome is due tochance.The newparticipants- they couldnot bedescribedaspartners - in theunion that became statistics argued for anappreciationof the probability of an alternative hypothesis. In 1925 Jerzy Neyman, on a Polish governmentfellowship and with a new PhD from Warsaw, arrivedat University College, London,to study statistics with Karl Pearson. Neyman was born in Russiain 1894and read mathematics at theUniversityof Kharkov. In 1921,he movedto the newRepublicof Poland, where he becamean assistantat theUniversity of Warsawandwhere he worked for the State Meteorological Institute. His initial months in London were difficult, partly becauseof his struggleswith English and partly becauseof misunderstandings with "K.P.," but graduallyhe struck up afriendship that led to aresearch collaboration with Egon Pearson. Neyman's professional progress and hissocial and academic relationships have been recounted in a sympathetic and revealing accountby ConstanceReid (1982). The younger Pearson has been describedby many as asomewhat diffident and introvertedman who wasvery much in the shadowof his father. Reid tells us of his feelings: Pearsonhad decided thatif he wasgoing to be astatisticianhe wasgoing to haveto break with his father's ideasand construct his own statistical philosophy. In retrospect,he describes whathe wantedto do as"bridging the gap"between "Mark I" statistics- a shorthand expression he uses for the statisticsof K.P., whichwas based on large samples obtained from natural populations- and thestatistics of Studentand Fisher, whichhad treatedsmall samples obtainedin controlled experiments- "Mark II statistics." (Reid, 1982,p. 60) Pearson (1966) himself describes "the first steps." In 1924 papersby E. C. Rhodes and by Karl Pearson (1924b) had explored the problem of choosing 1

The phrase statistical cookbook has a apejorative ring, whichis not wholly justified. There are many excellent basic texts available, and therearemany excellent cookerybooks. Both are necessaryto our well-being. The point is that conflicting recipesleadto statistical and gastronomic confusion,and thefact that such conflicts exist has,by and large, been ignored by consumersin the social sciences.

218

15. THE STATISTICAL HOTPOT

between alternative tests for the significanceof differences,tests thathad the same logical validitybut that gavedifferent levels of significance for the outcome: I setabout exploringthe multi-dimensional sample spaceand comparing what came to be termed the rejection regions associated with alternative tests. Could one find some generalprinciple or principles appealingto intuition, which would guideone in choosing betweentests? (E. S.Pearson, 1966, p. 6)

Pearsonhad ponderedthe question: what, exactly, was theinterpretationof "Student's"test? In large samples...the ratio t = (x - \i)^n/s could beregardedas thebest estimate available of the desired ratio(x- |a)Vn/a and, as such, referredto the normal probability scale. If the samplewas . . .small ... a samplewith a less divergent mean jci,might well provide a larger value of t than a secondsample witha more divergent mean,3ci, simply becauses \ in the firstsample happened through sampling fluctuations to besmaller than52 in the second. To someone brought up with the older point of view this seemedat first sight paradoxical... I realize that a reorientationof outlook mustfor me at anyrate have been necessary.It was ashift which I think K. P. was notable or never saw theneed to make. (E. S.Pearson,1966, p. 6)

In 1926 Pearson, put theproblemto "Student,"and thelatter's reply shows, once more,the fertility of his ideas.Justasthey hadaided Fisher's inspirations, his commentsnow set theyounger Pearson and later Neymanon thepath that led to theNeyman-PearsonTheory. After noting what, of course,waswidely accepted, that with large samples one isableto find the chance thata givenvalue for the mean of the sample liesat any given distancefrom the mean of the population,andthat, evenif the chanceis very small, thereis no proo/thatthe samplehas notbeen randomly drawn, he says: What it doesis to show thatif there is anyalternative hypothesis which will explain the occurrenceof the sample witha more reasonable probability, say 0.05 (suchas that it belongsto adifferent populationor that the sample wasn't random or whatever will do thetrick) you will be very much moreinclined to consider thatthe original hypothesisis not true, (quotedby Pearson,1966, p. 7, emphasis added)

Pearson recalls that during the autumnof 1926,the problemsof the specification of the classof alternative hypotheses and their definition, the rejection region in the sample space, and the twosourcesof error were discussed.At the end of theyear, Pearson was examiningthe likelihood ratio criterionas a way of approachingthe questionas towhetherthe alternative,or what Fisher later

NEYMAN AND PEARSON

219

called the null, hypothesiswas themore likely. Pearson always recognized that he was aweak mathematicianand that he needed help withthe mathematical formulation of the newideas. He recalledto Reid (1982) thathe approached Neyman, the newpost-doctoral student at Gower Street,perhapsbecausehe was "so 'fresh' to statistics"(p. 62) andbecause other possible collaborators would all have preferences for either MarkI or Mark II statistics. Neymanwas not a memberof eitherof the camps. He really knew very littleof the statistical work of the elder Pearson,nor that of "Student"or Fisher. But an immediate and close collaborationwas notpossible. Although Egon Pearson andNeyman had a good deal of social contact overthe summer of 1926 and Pearson remembered that they had touchedon the newquestions, Neyman was disenchanted withthe Biometric Laboratory.It was not thefrontier of mathematical statistics thathe hadexpected,and hedeterminedto go toParis (wherehis wife was an artstudent)and presson with his work in probabilityand mathematics. At the end of1926, correspondence began between Neyman and Pearson, and Pearson visitedhis colleaguein Paris in the spring of 1927. Reid (1982) reportsthat shegave copiesof Neyman's lettersto Pearsonto Erich Lehmann, an early studentof Neyman'sat Berkeley and anauthority on the NeymanPearson theory (Lehmann, 1959), for his comment. Lehmann's conclusion was that, at anyrate until early in 1927, Neyman "obviously didn't understand what Pearsonwas talking about"(p. 73). The first joint paper, in two parts, was published in Biometrika in 1928. Neymanhadreturnedto Polandfor theacademic year1927-1928,teaching both at Warsaw and atKrakow, and heclearly felt that he had notplayed as big a role in the first paper as hadPearson.The paper (PartI) ends with Neyman's disclaimer: N.B. I feel it necessaryto make a brief commenton theauthorshipof this paper. Its origin was amatterof close co-operation, both personal and byletter, and theground covered includedthe general ideasand theillustration of theseby samplingfrom a normal population.A part of theresults reached in commonare includedin Chapters 1, II and IV. Later I was much occupiedwith other work, and therefore unableto co-operate.The experimental work,the calculationof tables and thedevelopment of the theory of ChaptersIII and IV are dueentirely to Dr Egon S. Pearson,(p. 240)

It might be, too, that at this time Neyman wanted to distancehimselfjust a little from the maximum likelihoodcriterion, which was themain theoretical underpinningof the ideas developedin the paper. In the early correspondence with Pearson, Neyman referred to it as"your principle"and was notconvinced that it was theonly possible approach. The 1928 (PartI) paper beginsby statingthe important problemof statistical inference, that of determining whetheror not a particular sample(E) has

220

15. THE STATISTICAL HOTPOT

been randomly drawn from a population (n). HypothesisA isthat indeedit has. But I mayhave been drawn from some otherpopulationn', andthus,two sorts of error may arise. The first iswhenA is rejectedbut I wasdrawnfrom K, and the secondis whenJis acceptedbut Z hasbeen drawnfrom n': In the long run of statistical experience the frequency of the first sourceof error (or in a single instanceits probability)can becontrolledby choosingas adiscriminating contour, one outside whichthe frequency of occurrenceof samplesfrom TC is very small - say, 5 in 100 or 5 in1000. The second sourceof error is more difficult to control. ... It is not of course possibleto determinen',... b u t . .. we maydeterminea "probable" or "likely" form of it, and hence fix the contours so that in moving "inwards" acrossthem the difference between n and thepopulationfrom which it is "most likely" that S has been sampled should become less and less. This choice also implies that on moving "outwards" acrossthe contours, other hypotheses as to thepopulation sampled becomemore and more likely than HypothesisA. (Neyman & Pearson,1928, p. 177)

Hypothesis^correspondsto S having been drawn from n, andHypothesis A' to I having been drawn from n'. A ratio of the probabilitiesof A and A' is a measurefor their comparison. But: Probability is a ratio of frequenciesand this relative measure cannot be termed the ratio of theprobabilitiesof the hypotheses, unless we speakof probability a posteriori and postulate somea priori frequency distribution of sampled populations. Fisher hastherefore introduced the term likelihood,and calls this comparative measure the ratio of the likelihoodsof the twohypotheses. (Neyman & Pearson, 1928,p. 186)

The likelihood criterionis given by:

This ratio definessurfacesin a probability spacesuchthat it decreasesfrom 1 to 0 as aspecific point moves outwardand alternativesto the statistical hypothesis become more likely: One hadthen to decideat which contourH(| shouldbe regardedas nolonger tenable, that is whereshould one chooseto boundthe rejection region?To help in reaching this decision it appeared thatthe probabilityof falling into the region chosen,if H were true,was onenecessary piece of information. In taking this viewit can of course be argued thatour outlook was conditionedby current statistical practice,(E. S. Pearson, 1966,p. 10)

NEYMAN AND PEARSON

221

The paper doesconsider other approaches andnotes thatthe authorsdo not claim that the principle advancedis necessarilythe best to adopt, but it is clear that it is favored: We haveendeavouredto connectin a logical sequenceseveralof the most simple tests,and in sodoing havefound it essentialto make use of what R. A. Fisher has termed"the principle of likelihood." The processof reasoning,however, is necessarilyan individual matter, and we do not claim that the method whichhas been mosthelpful to ourselves willbe of greatestassistance to others. It would seemto be acasewhere each individual must reasonout for himself his own philosophy. (Neyman& Pearson,1928, p. 230)

There are faint echoes hereof the subjective elementin Neyman's initial reasoning, commented on byLehmannandreportedby Reid (1982)- thenotion that belief in a prior hypothesis affects the considerationof the evidence: There is a subjective elementin the theory whichexpressesitself in the choice of significance levelyou aregoing to require;but it is qualitative rather than quantitative.2 While it is not very satisfyingand rather pragmatic,I think this reflectsthe way our minds work better than the more extreme positions of either denyingany subjective element in statistics or insisting upon its complete quantification. (Lehmann, quotedby Reid, 1982,p. 73)

Neyman, in Poland in 1929, wantedto presenta joint paperat the International Statistical Institute meeting, scheduled to be held there that year.The paper that Neymanwas preparing dealt withthe Bayesian approachto the problem thathe andPearsonhad consideredand was toattemptto show thatit led to essentiallythe same solution. Egon Pearson just could not agreeto a collaboration that admitted, in the slightest,the notion of inverse probability: Pearson... pointed out to Neyman thatif they publishedthe proposedpaper,with its admissionof inverse probability, they would find themselvesin a disagreement with Fisher. . . . Many years later he explained, "The conflict between K.P. and R.A.F. left me with a curious emotional antagonism and alsofear of the latter so that it upsetme a bit to see him or to hear him talk." (Reid, 1982,p. 84)

He would not put hisname to the paper. The curious factis that neither Neymannor Pearson ever wholeheartedly subscribed to theinverse probability approach, but Neyman felt that it should be addressed. Pearson's attempt to avoid a confrontationwasdoomedto failure, just as hisfather's earlier attempts Cowlesand Davis (1982b)carriedout asimple experiment that supports the suggestion thatthe .05 level of significanceis subjectively reasonable. Sandy Lovie pointed out to theauthor thatthe same experiment,in a slightly different context,was performedby Bilodeau (1952).

222

15. THE STATISTICAL HOTPOT

to deflect Fisher'sviews had been. The years that spanned the turn of the 1920sto the 1930s werenot particularly happy onesfor the collaborators. Pearson was enduringthe agoniesof an unhappy romance and finding his relationship with his father increasingly frustrating. Neyman found his work greatly constrainedby economic and political difficulties in Poland, and the elder Pearson rejected a paper he submittedto Biometrika. But, albeitin anintermittentfashion,the collaboration continuedasthey worked toward what they described astheir "big paper." This was communicatedto the Royal Society by Karl Pearsonin Augustof 1932, readin Novemberof that year,and publishedin Philosophical Transactions in 1933. Neymanhad written to Fisher, with whom he was then on reasonably amicable terms, about the paper, and thelatter had indicated that were it to be sentto the Royal Society,he would likely be areferee: To Neyman it has always beena sourceof satisfactionand amusement thathis and Egon'sfundamentalpaperwaspresentedto the Royal Societyby Karl Pearson,who was hostile and skeptical of its contents,and favourably refereedby the formidable Fisher, who waslater to be highly critical of much of the Neyman-Pearson theory. (Reid, 1982,p. 103)

Reid reportsthat when she wrote to the librarian of the Royal Society to discoverthe nameof the second referee, shefound that therehad only beenone referee, and that he was A. C.Aitken of Edinburgh(a leading innovatorin the field of matrix algebra). In fact, two papers were published in 1933. The Royal Society paper dealt with proceduresfor the determinationof the most efficient tests of statistical hypotheses. There is no doubt thatit is one of themost influential statistical papers ever written.It transformedthe way inwhich boththe reasoning behind, and the actual applicationof, statistical tests were perceived. Forty years later, Le Cam andLehmann enthusiastically assessed it: The impact of this work has been enormous. It is, for example,hard to imagine hypothesistestingwithout the concept of power.... However, the influenceof the work goesfar beyond. ... By deriving tests as thesolutions of clearly defined optimum problems,Neyman and Pearson established a pattern for Wald's general decisiontheoryand for thewhole field of mathematicalstatisticsas it hasdeveloped since then. (Le Cam & Lehmann, 1974,p. viii)

The later paper (Neyman & Pearson, 1933), presented to the Cambridge Philosophical Society, again setsout theNeyman-Pearson rationale andproceduresfor hypothesis testing. Its statedaim is toseparate hypothesis testing from problems in estimation and toexaminethe employmentof tests independently of a priori probability laws.The authors have rejected the Bayesian approach

NEYMAN AND PEARSON

223

and employed the frequentist's viewof probability. A concept of central importance is that of thepower of a statisticaltest, and theterm is introduced here for the first time. A Type I error is that of rejectinga statistical hypothesis, H0 , when it is true. The Type II error occurs whenHQ is not rejected but some rival, alternative hypothesis is, in fact, true: If now we have chosena region w, in thesample spaceW, ascritical region to test //o, thenthe probability thatthe sample pointI defined by the set ofvariates[x , ;c2 . . . , *n] falls into w, if //Q is true, may bewritten as:

The chanceof rejectingHQ if it is true is therefore equalto 8, and w may be termed r of size e for Ho . Thesecond j type of error will be made when some alternative H.i is true, and Z falls in w= W-w. If we denoteby P (w),/Jn (w) andP(w) the chance of an error of the first kind, the chanceof an error of the secondkind and thetotal chanceof error usingw ascritical region, thenit follows that:

(Neyman& Pearson, 1933,p. 495)

The rule thenis to set thechanceof the first kind of error (the sizeor what we would now call the a level) at asmall valueandthen choosea rejection class so that the chanceof thesecond kindof error is minimized.In fact, the procedure attempts to maximizethepower of the test for a given size. The probability of rejecting the statistical hypothesis, ()H, when in fact the true hypothesisis H., that is P (w |H.), is called the power of the critical region w with respectto H.. If we nowconsiderthe probabilityP\\ (w) of type II errors when using a testT based on the critical regionw, we maydescribe

as theresultant powerof the test T. . .. It isseen that while the power of a test with regard to a given alternativeH\ is independentof the probabilitiesa priori, and is therefore known precisely assoon as H\ and w are specified, thisis not thecasewith the resultant power,which is a function of the (pi's. (Neyman& Pearson, 1933b, p. 499)

224

15. THE STATISTICAL HOTPOT

Note here thatthe cp'sare the probabilitiesa priori of the admissible alternative hypotheses. Neyman and Pearsonof coursefully recognized thatthe (p's cannot oftenbe expressedin numericalform and that the statisticianhas to considerthe sense,from a practical pointof view, in which testsare independent of probabilitiesa priori, noting: This aspectof the error problem is very evidentin a number of fields where tests must be usedin a routine manner,and errorsof judgment leadto wasteof energyor financial loss. Such is the casein sampling inspection problems in mass-production industry. (Neyman& Pearson,1933, p. 493)

This apparentlyinnocuous statement alludes to thefeaturesof the NeymanPearson theoryof hypothesis testing that, over the courseof the next few years, Fisher vigorouslyattacked,that is, (a) thenotion that hypothesistestingcan be regarded as adecision processakin to methods usedin quality control, and reducedto a set ofpracticalrules,and (b) theimplication that"repeatedsampling from thesame population"- which occursin industry whensimilar samplesare drawn from the same productionrun - determinesthe level of significance. These suggestions were anathema to Fisher. They were, however,the very features that made statistics so welcomein psychologyand thesocial sciences - that is, thepromiseof a set ofruleswith apparently respectable mathematical foundations that would allow decisions to be made on the meaning in noisy quantitative data.Few ventured intoan examinationof the logic of the methods; very few would wish to be trampledas thegiants fought in the field. These mattersare examined laterin this chapter. Perhapswe should sparea thoughtfor Sir Ronald Fisher, curmudgeon that he was. He must indeedbe constantly tossingin his grave as lecturers and professors across the world, if they rememberhim at all, referto the content of most current curricula asFisherian statistics. STATISTICS AND INVECTIVE The full force of Fisher's oppositionto the "Neyman-Pearsontheory was not immediately felt in 1933. The circumstancesof his developingfury are, however, evident.In a decision that could hardly have been less conducive to harmony in the developmentof statistics,the University of London split Karl Pearson'sdepartment whenhe retired in 1933: Statisticswas taught at no other British university,nor wasthere another professor of eugenics,chargedwith the duty and themeansof research into human heredity. If Fisherwere to teachat auniversity, it would haveto be asPearson'ssuccessor. (Fisher Box, 1978,p. 257)

STATISTICS AND INVECTIVE

225

Fisher became Galton Professor in the Departmentof Eugenics,and Egon Pearson, promotedto Reader, headedthe Departmentof Applied Statistics. There, in Gower Street, Fisher's department on the topfloor andEgon Pearson's on the floor below, the rows beganto simmer. FisherBox (1978) reports that her fatherhadascertained, before accepting the appointment, that Egon Pearson was well disposedto it. Fisher corresponded with him, apparently in the hope that they might resolve conflicts in the teachingof statistics. FisherBox (1978) learnedfrom correspondencewith Egon Pearson that Fisher hadeven suggested that, despitethe decisionto split the old department,he andPearson should reunite it. Pearson's response was to theeffect that no lecturesin the theory of statistics shouldbe given by Fisher, thatthe territories had been defined,and that each should stay within them. Early in 1934 Pearsoninvited Jerzy Neymanto join his department, temporarily, as anassistant. Neyman accepted and, in the summerof 1934, received an appointmentas lecturer. The moodsat Gower Street must have been impossible, and theintensityof the strain is difficult to imagineand reconstruct: [Karl] Pearsonwas madean honorary memberof the TeaClub, and when he joined them in the Common Room,it was observed that Fisher did him theunique honour of breaking out of conversationto step forwardandgreethim cordially. (Fisher Box, 1978, p. 260)

Or, The Common Roomwas carefully shared.Pearson'sgroup had tea at 4; and at 4:30, when theywere safely out of theway, Fisherand hisgroup troopedin. Karl Pearson had withdrawn acrossthe college quadrangle withhis young assistant Florence David. He continued to edit Biometrika; but,as far asMiss David remembers,he never againenteredhis old building. (Reid, 1982,p. 114)

It appears that, at first, Neymanand Fishergot onquite well andthat Neyman tried to bring Fisherand Egon Pearson together. His work on estimation, rather than the joint work with Pearsonon hypothesis testing, occupied his attention. His 1934 paper (discussed earlier) was, in the main, well-receivedby Fisher, and Fisher's December 1934 paper, presented to the Royal Statistical Society (Fisher, 1935),was commentedon favorably by Neyman. But any harmony disappeared at meetingsof the Industrial andAgricultural Section of the Royal Statistical Societyin 1935. Neyman presented his "Statistical Problemsin Agricultural Experimentation"in which he questionedthe efficiency of Fisher's handlingof Randomized Block and Latin Square designs, illustrating his talk with wooden models thathad been preparedfor him

226

15. THE STATISTICAL HOTPOT

at University College.3 Fisher's responseto the paper (for whichhe was supposedto give the vote of thanks) beganby expressingthe view, to put it bluntly (which he did), that he hadhoped that Neyman would be speakingon something thathe knew something about.His closing comments disparaged the Neyman-Pearson approach. Frank Yates (1935) presented a paper on factorial designsto the Royal StatisticalSocietylater that year. Neyman expresseddoubts aboutthe interpretation of interactions and main effects whenthe numberof replicationswas small. The problem hasbeen examined more recently by Traxler (1976). The details of thesecriticisms, and they had validity, did not concern Fisher: [Neyman had] assertedthat Fisher waswrong. This was anunforgivable offenseFisher was never wrongand indeedthe suggestion thathe might be wastreatedby him as adeadly assault.Anyone who did notacceptFisher's writingas theGod-given truth was atbeststupid and atworst evil. (Kempthorne, 1983,p. 483)

OscarKempthorne, quoted here, and undoubtedlyan admirer of Fisher's work, hadwhat he hasdescribedas a"partial relationship" with Fisher when he worked at Rothamstedfrom 1941 to 1946and knew Fisher'sfiery intransigence. It is understandable that Fisher Box (1978) presentsa view of these troubled times thatis more sympatheticto Fisher, couchingher commentaryin terms that do reflect an important aspectof the situation. Statisticswas becoming more mathematical,and theshift of its intellectual power baseto the United Stateswas tomake it even more mathematical. Fisher Box (1978)comments, "Fisher was aresearch scientist using mathematicalskills, Neymana mathematician applying mathematical concepts to experimentation"(p. 265). She quotes Neyman's reply to Fisher,in which he explainshis concerns about inconsistencies in the applicationof the ztest. The test is deducedfrom the sumsof two independent squares but therestricted samplingof randomized blocks and Latin squares, leads to the mutual dependenceof results: Mathematicianstendedto formulatethe argumentin these terms, that is in termsof normal theory, ignoring randomization. Nevertheless, in doing so they exhibiteda fundamental misunderstanding of Fisher'swork, for it happensto be false that the derivation of the z distribution dependson the assumptions Neyman criticized. (Fisher Box, 1978,p. 266)

Fisher Box defends this view,but it was not aview that convinced many mathematiciansand statisticians outside the Fisherian camp. 3 Reid (1982) relatesa story told by both Neymanand Pearson.One evening they returned to the departmentafter dinnerand found the models strewn about the floor. They suspected that the angryact was Fisher's.

STATISTICS AND INVECTIVE

227

In 1935, Egon Pearson was promotedto Professorand Neyman appointed Reader in statistics at University College. Fisher opposed Neyman's appointment. Neyman recalledto Reid (1982) thatat about this time Fisherhad demanded that Neyman should lecture using only Statistical Methods for Research Workers andthat when Neyman refused said,"'Well, if so, then from now on I shall opposeyou in all mycapacities.'And heenumerated- member of the Royal Societyand soforth. There were quite a few. Thenhe left. Banged the door" (Reid, 1982,p. 126). Fisher withdrew his support for Neyman's electionto the International Statistical Institute.It was this sort of intense personal animosity that led to confusion,indeeddespair,asonlookers attempted to graspthe developmentsin statistics. The protagonists introduced ambivalence and contradictionas they defendedtheir positions. Debateand discussion,in any rational sense, never took place. As if this was not enough, Neymanand Egon Pearson were beginningto draw apart. Neyman certainly did notfeel committedto University College. Karl Pearson died in April 1936, and very shortly thereafter Egon Pearson began work on a survey of his father's contribution(E. S. Pearson, 1938a). Neymanwas working on the developmentof a theory of estimation using confidence intervals.In a move that seems utterly astonishing, Pearson, who had inherited the editorship of Biometrika, after a certain amountof equivocation, rejectedthe resulting paper. Pearson thought that the paper was too long and toomathematical. It subsequently appeared in the Philosophical Transactionsof the Royal Society (Neyman, 1937). Joint work between the two friends had almost ceased. Neyman visited the United Statesin 1937 and made a very good impression.His visit to Berkeley resultedin his being offereda post therein 1938, an offer that he accepted. Neyman's moveto the University of California, the growth in the development and applicationsof statisticsat the State Agricultural Collegeat Ames, Iowa, and theimportantinfluence of the University of Michigan, where Annals of Mathematical Statistics,edited by Harry C. Carver, had been foundedin 1930, movedthe vanguardof mathematical statistics to America, whereit has remained.The tribulationsof actualand devastating warfare were on Britain's horizon. Therewas no oneleft on thestatistical battlefieldwho wantedto fight with Fisher. He wasdislikedby manybut hisclose collaborators,he wasoften avoided, and he wasmisunderstood. Fisher enjoyeda considerable reputation as ageneticist, carryingout experimentsand publishing work thathad considerable impact.His study of natural selection from a mathematical pointof view led to a reconciliation of the Mendelian and thebiometric approaches to evolution. In 1943 he acceptedthe post of Arthur Balfour Professorof Genetics at Cambridge University,an appointmentfor which Egon Pearson must have given thanks. Not that Pearson

228

15. THE STATISTICAL HOTPOT

suffered too much from the lash of Fisher's tongue,for Fisher regardedhim as a lightweight and, whenever he could, coldly ignored him. Fisherwas alwayswilling to promotethe applicationof his work to experimentationin a wide varietyof fields. He wasunableto acceptany criticism of his view of its mathematical foundations. Perhaps the most unfortunateepisode took place at the end ofKarl Pearson'slife. Taking as his reason an attack Pearson(1936) had made, in a work published very shortly after his death,on a paper writtenby an Indian statisticianR. S. Koshal (1933), Fisher (1937) undertook "to examinefrankly the statusof the Pearsonianmethods"(p. 303). Nearly 20 years later,in an author's note accompanying a republicationof "Professor Karl Pearson and theMethod of Moments," Fisher reallyexceeds the boundsof academic propriety: If peevish intoleranceof free opinion in others is a sign of senility, it is one which he haddevelopedat anearly age. Unscrupulous manipulation of factual materialis also a striking featureof the whole corpusof Pearsonian writings,and inthis matter someblamedoesseemto attachto Pearson'scontemporariesfor not exposinghis arrogantpretensions.(Fisher, 1950,p. 29.302ain Bennett, 1971)

Of course Karl Pearson was guilty of the same sortof invective. Personal criticism in the defenseof their mathematicalandstatistical stances is a feature of the style of both men. No academic writer expects to escape criticismindeed, lackof criticism generally indicates lack of interest- butthesesort of polemicscan only damagethe discipline. Ordinary,and even some extraordinary, men andwomenrun for cover. Theseconflicts may well be responsible for the rather uncritical acceptance of the statisticaltools thatwe usetoday, a point that is discussedfurther. FISHER versusNEYMAN AND PEARSON In an author's note preceding a republicationof a 1939 paper, Fisher reiterates his oppositionto the Neyman-Pearson approach, referring specifically to the Cambridge paper: The principles broughtto light [in the following paper] seemto theauthor essential to thetheory of testsof significancein general,and tohave been most unwarrantably ignored in at least onepretentious workon "Testingstatistical hypotheses."Practical experimentershave not beenseriously influenced by this work, but in mathematical departments,at atime when thesewere beginningto appreciatethe part they might play as guidesin the theoreticalaspectsof experimentation, its influence has been somewhatretrograde.(Fisher, 1950, p. 35.173ain Bennett,1971).

Fisherset out hisobjectionsto NeymanandPearson's views in 1955. Statistical

FISHER versus NEYMAN AND PEARSON

229

Methods and Scientific Induction setsout to examinethe differencesin logic in the two approaches.He acknowledges that Barnard had observed that: Neyman, thinking thathe wascorrectingand improvingmy own early work on tests of significance, as a meansto the "improvement of natural knowledge,"in fact reinterpreted themin termsof that technologicalandcommercial apparatus which is known as anacceptanceprocedure. (Fisher, 1955, p. 69)

Fisher acknowledges the importanceof acceptance procedures in industrial settings, noting that whenever he travels by air he gives thanksfor their reliability. He objects, however,to thetranslationof this modelto thephysical and biological sciences. Whereas in a factory the population thatis the product has anobjective reality, suchis not thecasefor the populationfrom which the psychologist'sor the biologist's samplehas been drawn.In the latter case,he argues thereis: a multiplicity of populationsto eachof which we canlegitimately regardour sample asbelonging;so that the phrase "repeated sampling from the same population" does not enableus to determine which population is to beusedto definethe probability level, for no one ofthem hasobjective reality,all being productsof the statistician's imagination. (Fisher, 1955, p. 71)

Fisher maintains that significance testing in experimental science depends only on thepropertiesof the uniquesample thathasbeen observed andthat this sample shouldbe compared only with other possibilities, that is to say, "to a population of samplesin all relevant respectslike that observed, neither more precisenor less precise,and which thereforewe think it appropriateto selectin specifying the precisionof the estimate" (Fisher, 1955, p. 72). Fisher was never ableto come to terms with the critical contribution of Neyman and Pearson, namely, the notion of alternative hypotheses and errors of the second kind. Once again his objectionsare rooted in what he considersto be theadoption of the misleading quality control model.He says, "The phrase 'Errors of the second kind,' although apparently only a harmless pieceof technicaljargon, is useful as indicatingthe type of mental confusionin which it was coined"(Fisher, 1955,p. 73). Fisheragreesthat the frequencyof wrongly rejectingthe null hypothesiscan be controlled but disagrees thatany specificationof the rate of errors of the second kindis possible. Nor is sucha specification necessary or helpful. The crux of his argument restson his objection to using hypothesis testingas a decision process: The fashion of speakingof a null hypothesisas "accepted when false," whenever a test of significancegives us nostrong reasonfor rejecting it, and when in fact it is

230

15. THE STATISTICAL HOTPOT

in someway imperfect, showsreal ignoranceof the researchworker'sattitude,by suggestingthat in such a casehe hascome to an irreversible decision. (Fisher, 1955, p. 73)

What really should happen when the null hypothesisis accepted?The researcher concludes that the deviationfrom truth of the working hypothesisis not sufficient to warrant modification.Or, saysFisher, perhaps, that the deviation being in the expected direction,to an extent confirms the researcher's suspicion,but thedata availableare notsufficient to demonstrateits reality. The implication is clear. Experimental science is anongoing processof evaluation and re-evaluationof evidence. Every conclusionis a provisional conclusion: Acceptanceis irreversible,whether the evidencefor it was strongor weak. It is the result of applying mechanically rules laid down in advance;no thought is given to the particular case,and thetester'sstate of mind, or his capacity for learning, is inoperative. (Fisher, 1955, pp. 73-74)

Finally, Fisher launchesan attack, from the same basic position, on Neyman's use of theterm inductive behaviorto replace the phrase inductive reasoning. It is clear that Neymanwas looking for a statistical system that would provide rules.The 1933 paper shows that Neyman andPearson believed that they wereon thesame trackasFisher: In dealing withthe problem of statistical estimation,R. A. Fisherhas shown how, under certain conditions, what may be describedas rules of behaviour can be employed which willleadto results independent of those probabilities [here Neyman is referring to probabilitiesa priori]; in this connectionhe hasdiscussedthe important conceptionof what he terms fiducial limits. (Neyman & Pearson,1933, p. 492, emphasisadded)

It appearsthat the usersof statistical methods have implicitly acceptedthe notion thatthe Neyman-Pearson approach was anatural, almostan inevitable, progressionfrom the work of Fisher.It is, however,little wonder, even leaving the personalitiesandpolemics aside, that Fisher objected to the mechanization of the scientific endeavor.It wasjust not the way helookedat science.The 1955 paper almost goes out of its way toattack Abraham Wald's (1950) book on StatisticalDecision Functions, objecting specifically to its characterizationas a book about experimental design. Neyman (1956) rebutted the criticisms of Fisher and effectively restateshis position,but it is apparent thatthe two are really arguing at cross-purposes. Responding to Neyman's rebuttal, Fisher (1957) begins,"If Professor Neyman were in the habit of learningfrom others he might profit from the quotationhe givesfrom Yates" (Fisher, 1957, p. 179). Of interest hereis Fisher'scontinuingdeterminationto defendthe concept

FISHER versus NEYMAN AND PEARSON

231

of'fiducial probability at all costs. He raisesthe concept againand again in his writings on inferenceand never admits thatit presentsdifficulties. The expressions "fiducial probability" and "fiducial argument"are Fisher's. Nobody knows just what they mean, because Fisher repudiated his most explicit,but definitely faulty, definition and ultimately replacedit with only a few examples. (Savage,1976, p. 466)

The Bayesian argument lends itself to decision analysis. Fisher objected to both. The fiducial argumenthas been characterized as anelliptical attempt to arrive at Bayesian-type posterior probabilities without invoking the Bayesiantype priors. What muddiesthe water was Fisher's insistence that fiducial probabilitiesare verifiable in the same way,for example,as theprobabilitiesin gamesof chanceare verifiable. What makesthe situation even more vexing is that the perceived relationship between Neyman's confidence intervals and Fisher's fiducial theory would makethe latter easierto grasp. But Neyman's theory of estimationby confidence sets grows out of thesame conceptual roots asNeyman-Pearson hypothesis testing. No rational compromisewas possible. The null hypothesis(it has been noted that the term is Fisher'sand that it was notusedby Neymanand Pearson)is, in Fisherianterms,the hypothesis that is tested and it implies a particular value (mostoften zero) for the population parameter. Whena statistical analysis produces a significant outcome, thisis taken to be evidence againstthe null hypothesis, evidence against the stated value for the parameter, evidence against the assertion that nothing has happened, evidence against a conclusion thatthe experimental manipulation has had no effect. The point is that a significant outcome doesnot appearto be evidencefor anything.The Neyman-Pearson position is that hypothesis testing demandsa researchhypothesisfor which we can findsupport. Suppose that the statistical hypothesisstatesthat in the population the correlation is zero, and indeedit is; then an obtained sample value of+0.8, givena reasonablen, would lead to a Type I error. If, on the other hand,the statistical hypothesis statesthat the population valueis +0.8, but again it is really zero, thenan obtained value of 0.8 will lead to a Type II error. It will immediatelybe argued thatno one would set thepopulation parameter under the null hypothesisas +0.8 unless there were evidence that it was of this order. However, this implies that experimenters take into account other evidence than that immediately to hand. No matter how formal the rules, it is plain that in real life the rules do not inevitably guide behaviorand decisions,or conclusions about outcomes. As was noted earlier,my holdinga winning lottery ticket thatwas drawn from a million tickets - anextemely improbable event - does notnecessarily leadto my rejectingthe hypothesisof chance. There are, then, difficult problemsin theassessment of the rulesof statistical

232

15. THE STATISTICAL HOTPOT

procedures.It is small wonder that they are notwidely appreciated because the founding fathers themselves were not clearon theissues,or at anyratein public, and in their writings, they werenot always clearon the issues. Somefew examples willsuffice to illustratethe point. In 1935 Karl Pearson asserted that "tests are used to ascertain whethera reasonable graduation curve has been achieved,not to assert whetherone or another hypothesis is true or false" (K. Pearson, 1935, p. 296). All he is doing here is arguing with Fisher.He himself usedhis own x,2 test for hypothesis testing(e.g.,K. Pearson, 1909). In his famous discussion of the "tea-tasting" investigation, Fisher discusses the "sensitiveness" of an experiment.He notes thatby increasingthe sizeof the experiment: we canrenderit moresensitive, meaningby this that it will allow of the detectionof a lower degreeof sensory discrimination, or, in other words,of a quantitatively smaller departurefrom the null hypothesis. Sincein every casethe experimentis capableof disproving, but never of proving this hypothesis,we may saythat the value of the experimentis increased whenever it permits the null hypothesis to be more readily disproved. The same result couldbe achieved by repeatingthe experiment,as originally designed,upon a number of different occasions.(Fisher, 1935/1966,8th ed., p. 22)

Here Fisher appears to bealluding to what Neymanand Pearson formalized as thepower of the test - a notion thathe never would acceptin their context. He is also usingthe words "proved" and "disproved" rather loosely, and, althoughhe might be forgiven, the slip flies in the faceof the routine statements about the essentially probabilistic nature of statistical outcomes. Elsewhere Fisher presentsthe view that each experiment is, as itwere, self-contained in the context of significance testing: On the whole the ideas (a) that a test of significance must be regarded as one of a seriesof similar testsapplied to a successionof similar bodiesof data,and (b)that the purposeof thetestis to discriminateor "decide"betweentwo ormore hypotheses, havegreatly obscuredtheir understanding, when taken ascontingent possibilitiesbut as elementsessentialto their logic. (Fisher, 1956/1973, 3rd ed., pp. 45-46)

To practicing researchers, the notion that testsare notusedto "decide" is incomprehensible. What elseare they for? Fisher's supporters would say that what he meansis that theyare notusedto decide finally and irreversibly,and scientistswould surelyagree. In his 1955 paper, where he objectsto theNeyman-Pearson specification of the two types of error and gives his views (mentioned earlier)of what the researcherwill conclude whena result fails to reach statistical significance,

PRACTICAL STATISTICS

233

Fishersays: These examples showhow badly the word "error" is used in describing sucha situation. Moreover, it is a fallacy, so well known as to be astandard example,to conclude froma test of significance thatthe null hypothesisis thereby established; at most it may be said to be confirmed or strengthened. (Fisher, 1955, p. 73, emphasisadded)

It might beargued thata careful readingof Fisher'svoluminousworks would make his position clear,and that these quotations are selective.The point is taken, but the point may also be made thatit is hardly surprising thatthe later interpretationsof, and commentarieson, his influential contribution contain difficulties and contradictions.Had theprotagonists been more concerned with rational debaterather than heated argument, statistics would have had aquite different history. PRACTICAL STATISTICS A striking featureof the general statistics texts produced for coursesin the psychological sciences is theanonymityof the prescriptions thatare described. A startlinglyhigh numberof incoming graduate students in the author'sdepartment were unaware thatthe F ratio was namedfor a mancalled Fisher,but a cursory glancethroughthe indexesof introductory texts reveals why. Pearson's nameis sometimes mentioned as aconvenientlabel for a correlation coefficient, distinguishingit from the rank-order methodof Spearman. Neyman andPearson Jr. arehardly ever acknowledged. "Power"is treatedwith caution. Controversial issuesarenever discussed. Gigerenzer (1987) presents an interestingdiscussionof the situation: "The confusion in the statistical texts presented to the poor frustrated students was causedin part by the attemptto sell this hybrid as thesine qua non ofscientific inference"(p. 20). Gigerenzer'sthesis addresseswhat he seesas experimentalpsychology's fight against subjectivity.Probabilisticmodelsand statistical methods provided the discipline with a mechanicalprocessthat seemedto allow for objective assessment independent of the experimenter. Parallel constructs of individual differences as error and uncertainty as ignorancefurther promoted psychology's view of itself as anobjective science. It is Gigerenzer's view that the illusion was moreor less deliberately created by the early textbooks and hasbeen perpetuated. The general neglect of alternative theories and methodsof inference, anonymous presentation, the silenceon thecontroversies,and"institutionalization,"all conspiredto provide psychology with its need for objectivity. Gigerenzer makes tellingand

234

15. THE STATISTICAL HOTPOT

important points. But a more mundane explanation can beadvanced. Surely social scientistscan beforgiven for not being mathematiciansor logicians, and those who took a peek at the literatureof the 1920s and 1930s must have been dismayed at the bitternessof the controversies.At the same time, the methods were being popularized by such peopleasSnedecor- and they seemedto bemethods that worked.It is absolutely clear that psychologists wanted to constructa researchmethodology that wouldbe acceptedby traditional science.The controversies were ignored because experimentalists in the social sciences believed that they were sifting the methodological wheatfrom the polemical chaff. The principalsin the arguments were ignored because of an eagernessto get onwith the practical job,and inthis they were supported by the master,Fisher. Moreover,the textbook writers, interpreting Fisher, can be forgiven for their lack of acknowledgmentof thefounding fathers- they were, after all, following the masterwho rarely gave creditto anybody! What is, perhaps, more contentious is the effect that all this has had on the discipline. If it is to be admitted thatthe logical foundationsof psychology's mostwidespread method of data assessment areshaky, whatare we tomakeof the "findings" of experimental psychology?Is the whole edifice of dataand theory to becompared withthebuildings in thetownsof the"Wild West" - a gaudy falsefront, and little of substance behind? This is an unreasonable conclusion, and it is not aconclusion thatis borne out by even a cursory examination of the successful predictions of behaviorand theconfident applications of psychology, in areasstretching from market researchto clinical practice, that havea utility that is indisputable.The plain fact of the matter is that psychologyis using a set oftools that leaves much to be desired. Some parts of the kit perhaps should be discarded; some of them, like blunt chisels, will let usdown and wemight be injured. But they seemto have been doing a job. Psychologyis a success. Now some wouldqualify that success.An agonizing reappraisalof the discipline by Sigmund Kochfollowed his struggleswith his editorship of Psychology: A Study of a Science (1959-1963). From his statementat the beginning of that enterprise that psychology was a"disorderly matrix" to his plea(1980)that 10yearsafter Miller's urgingin his Presidential Address to the American Psychological Association (1969) that psychology should be "given away" (in the senseof pressingfor more social relevance), Koch thought that it ought to be "taken back." Although the drift of the criticism is inescapable,to throw in the towel is, as itwere, bothunenlighteningandunproductive. There is no doubt thatthe disparate natureof its subject matter,and the sometimes conflicting pronouncements issuing from the research journals,has led to a fragmentationof the discipline, even to denials thatit is, in fact, a coherent discipline. However two themes are apparentand common to all

PRACTICAL STATISTICS

235

branchesof psychology.One is theclassical statistical method used by thevast majority of psychological researchers and theother is its history, and, more particularly, its historic questions, the mind-body dichotomy,the mechanisms of perceptionandcognition,thenature-nurtureissue,the individual and society, the mysteriesof maturationand change,and more. This bookis an attempt to promotean interestin both statisticsand history.

References Acree,M. C. (\91$).Theoriesof statistical inference in psychological research: A historico critical study. Unpublished doctoral dissertation, Clark University, Worcester, MA. Adams, W. J. (1974).Thelife and timesof the central limit theorem.New York: Kaedmon. Airy, G. B. (1861).On thealgebraicaland numerical theoryof errors of observationsand the combinationof observations(2nd ed.). London: Macmillan. Anderson, R. L., & Bancroft,T. A. (1952). Statistical theoryin research. New York: McGraw-Hill. Anonymous(1926). Review of Fisher's Statisticalmethodsfor research-workers.British MedicalJournal, 1, 578-579. Arbuthnot, J. (1692).Of the laws of chance:Or a methodof calculationof the hazardsof gaming. London: Printed by B. Motte and sold by Randall Taylor. Arbuthnott, J. (1710).An argumentfor divine providence, taken from theconstant regularity observ'din thebirths of both sexes.Philosophical Transactions of the Royal Society,27, 186-190. Barnard, G. A. (1958). Thomas Bayes - A biographical note. Biometrika, 45,293-295. Bartlett, M. S.(1965).R. A. Fisher and thelast fifty yearsof statisticalmethodology. Journal of the American Statistical Association, 60, 395-409. Bartlett, M. S. (1966).Reviewof Logic of StatisticalInference, by Ian Hacking. Biometrika, 53,631-633. Baxter, B. (1940). The application of a factorial designto a psychological problem. PsychologicalRevie\v,47,494-500. Baxter, B. (1942).A study of reactiontime using factorial design. Journal of Experimental Psychology,31,430-437 Bayes,T. (1763). An essay towards solvinga problem in the doctrine of chances. Philosophical Transactionsof the Royal Society,53, 370-418.[Reprinted witha biographical noteby G. A. Barnardin Biometrika, 1958,45,293-315.] Beloff, H. (Ed.). (1980).A balance sheet on Burt. Supplementto theBulletin of the British PsychologicalSociety,33. Beloff, J. (1993).Parapsychology:A concise history. London: Athlone Press. Bennett,J. H. (1971).Collectedpapers of R. A. Fisher. Adelaide: Universityof Adelaide. Berkson,J. (1938).Some difficultiesof interpretation encountered in the application of the chi-squaretest.Journal of the American Statistical Association, 33, 526-536. Bernard, C. (1927).An introductionto the study of experimental medicine (H. C. Greene, Trans.).New York: Macmillan. (Original work published 1865) Bernoulli, D. (1966).Part four of the Art of Conjecturing showingthe use andapplication of the preceding treatisein civil, moral and economicaffairs (Bing Sung, Trans.). Cambridge,MA: Harvard University, Departmentof Statistics,Tech. ReportNo. 2. (Original work published 1713) Bilodeau, E. A. (1952). Statistical versus intuitive confidence. American Journal of Psychology,65, 211-271.

236

REFERENCES

237

Boring, E. G.(1920). The logic of thenormallaw of error in mental measurement. American Journal of Psychology,31, 1-33. Boring, E.G. (1950).A history of experimental psychology.New York: AppletonCentury-Crofts. Boring, E. G. (1957). Whenis human behavior predetermined? Scientific Monthly, 84, 189-196. Bowley, A. L. (1902). Elementsof statistics. London:P. S.King. (6th ed. 1937) Bowley, A. L. (1906). Presidential address to the economic science and statistics sectionof the British Association for the Advancementof Science, York. Journalof the Royal Statistical Society,69, 540-558. Bowley, A. L. (1926). Measurement of the precision obtainedin sampling. Bulletinof the International Statistical Institute,22, 1-62. Bowley, A. L. (1928).F. Y.Edge-worth'scontributions to mathematical statistics. London: Royal Statistical Society. Bowley, A. L. (1936).The applicationof samplingto economicandsocial problems. Journal of the AmericanStatistical Association,31, 474-480. Box, G. E. P.(1984).The importanceof practicein the developmentof statistics. Technometrics,26, 1-8. Bravais,A. (1846).Sur lesprobability'sdeserreursde situationd'un point[On theprobability of errors in the position of a point]. Memoiresde I'Academie Royaledes Sciencesde I'lnstitut de France, 9, 255-332. Brennan,J. F. (1994). History and systemsof psychology.4th Ed.EnglewoodCliffs, NJ: Prentice-Hall. Brown, W., & Stephenson,W. (1933).A test of the theory of two factors. British Journalof Psychology,23, 352-370. Buck, P. (1977). Seventeenth-century political arithmetic:Civil strife andvital statistics. Isis, 68, 67-84. Burke, C. J. (1953). A brief noteon one-tailed tests. Psychological Bulletin, 50, 384-387. Burt, C. (1909).Experimental tests of generalintelligence.British Journal of Psychology,3, 94-177. Burt, C. (1939).The factorial analysisof humanability. III. Linesof possible reconcilement. British Journal of Psychology,30, 84-93. Burt, C. (1940). Thefactors of the mind. London:University of London Press. Campbell,D. T., & Stanley,J. C.(1963). Experimental and quasi-experimental designs for research on teaching. In N. L. Gage (Ed.), Handbookof research on teaching, (pp. 171-246.Chicago: Rand McNally. Carrington, W. (1934). The quantitativestudyof trance personalities. 1. Proceedingsof the Societyfor Psychical Research. 42, \73-240. Carrington, W. (1936). The quantitative studyof trance personalities. 2. Proceedingsof the Societyfor Psychical Research. 43, 319-361. Carrington, W. (1937).The quantitativestudy of trance personalities. 3. Proceedingsof the Societyfor Psychical Research. 44, 189-222. Carroll, J. B. (1953). An analytical solution for approximating simple structure in factor analysis. Psychometrika, 18,23-38. Cattell, R. B. (1965a).Thescientific analysisof personality. Baltimore,MD: Penguin Books. Cattell, R. B. (Ed.) (1965b). Handbookof multivariate experimental psychology. Chicago: Rand McNally.

238

REFERENCES

Chang, W.-C. (1973).A history of the chi-squaregoodness-of-fit test. Unpublished doctoral dissertation,University of Toronto,Toronto,Ontario. Chang, W.-C.(1976).Sampling theoriesandsampling practice.In D. B. Owen (Ed.),On the history of statisticsandprobability (pp. 299-315).New York: Marcel Dekker. Clark, R. W. (1971). Einstein,the life and times.New York: World. Cochrane,W. G. (1980).Fisher and the analysis of variance.In S. E.Fienberg& D. V. Hinckley (Eds.),R. A.Fisher: AnAppreciation (pp.17-34).New York: Springer-Verlag. Cowan, R. S.(1972). Francis Gallon's statistical ideas: The influenceof eugenics. Isis,63, 509-528. Cornfield, J., & Tukey, J. W. (1956). Average values of mean squares in factorials. Annals of MathematicalStatistics,27, 907-949. Cowan, R. S.(1977).Nature andnurture:the interplay of biology andpolitics in the work of FrancisGallon.In W. Coleman& C. Limoges (Eds.), Studies in thehistory of biology (Vol. 1, pp. 138-208).Baltimore, MD: Johns Hopkins University Press. Cowles, M. & Davis, C. (1982a). On the origins of the .05level of statistical significance. American Psychologist, 37, 553-558. Cowles, M., & Davis, C. (1982b). Is the .05 level subjectively reasonable? Canadian Journal of Behavioural Science, 14, 248-252. Cowles, M., & Davis, C. (1987). The subject matterof psychology: volunteers. British Journal of Social Psychology,26, 97-102. Cronbach,L. J. (1957).The twodisciplinesof scientific psychology. American Psychologist, 12, 671-684. Crum, L. S. (1931). On analytical interpretationsof straw vote samples. Journalof the American Statistical Association, 26, 243-261. Crump, S. L. (1946). The estimation of variance componentsin analysis of variance. Biometric Bulletin, 2, 7-11. Crump, S. L. (1951).Thepresent status of variance component analysis. Biometrics, 7,1-16. Crutchfield, R. S.(1938). Efficient factorial design and analysisof variance illustratedin psychological experimentation. Journal of Psychology,5, 339-346. Daniels,H. E. (1939).The estimation of components of variance. Royal Society Journal, Supplement6, 186-197. Darwin, C. (1958).Theorigin of species.New York: New American Library.(Original work published 1859) David, F. N. (1949). Probability theoryfor statistical methods. Cambridge, England: Cambridge UniversityPress. David, F. N (1962).Games, godsand gambling. London: Charles Griffin. Davis,C., & Gaito,J.(1984).Multiple comparison procedures within experimental research. CanadianPsychology,25, 1-12. Daw, R. H., & Pearson,E. S.(1979).AbrahamDe Moivre's 1733 derivationof the normal curve: A bibliographical note.In M. Kendall & R. L. Plackett (Eds.), Studies in the history of statistics and probability (Vol. II, pp. 63-66)New York: Macmillan. (Reprinted from Biometrika,59, 677-680,1972) Dawson,M. M. (1914). The developmentof insurance mathematics. InL. W. Zartman (Ed.), Yale readingsin insurance(W. H. Price,Revisor, pp. 95-119).New Haven,CT: Yale University Press. (Original work published in 1901). Dempster,A. P. (1964).On the difficulties inherentin Fisher'sfiducial argument. Journal of the American Statistical Association, 59, 56-66.

REFERENCES

239

De Moivre, A. (1967).Thedoctrine of chances:or, A methodof calculatingthe probabilities of eventsinplay (3rded.).New York: Chelsea.Includesa biographical noteon DeMoivre by Helen M. Walker. (Original work published in 1756) De Morgan,A. (1838).An essayon probabilitiesand ontheir applicationto life contingencies and insurance offices.In D. Lardner (Ed.). Cabinet Cyclopaedia, (pp. 1-306)London: Longman, Orme, Brown, Green & Longman,& John Taylor. Dodd, S. C.(1928).Thetheory of factors.Psychological Review, 35,1,211-234;II, 261-279. Duncan,D. B. (1951). A significancetest for differences between ranked treatmentsin an analysisof variance.Virginia Journal of Science,2, 171-189. Duncan,D. B. (1955).Multiple rangeandmultipleF tests.Biometrics,11, 1-42. Eden, T., & Fisher,R. A. (1927). Studiesin crop variation.IV. The experimental determination of the value of top dressings with cereals. Journal of Agricultural Science 77,548-562. Edgeworth,F. Y. (1887).Observationsand statistics:An essay on thetheory of errors of observationand the firstprinciplesof statistics. Transactions of the Cambridge Philosophical Society,14, 138-169. Edgeworth,F. Y. (1892).Correlated averages. Philosophical Magazine, 34, 190-204. Edgeworth,F. Y. (1893). Note on the calculation of correlation betweenorgans. Philosophical Magazine,36, 350-351. Edwards,A. E. (1950). Experimental design in psychological research. New York: Holt Rinehartand Winston. Eisenhart,C. (1947). The assumptions underlying the analysisof variance. Biometrics,3, 1-21. Eisenhart,C. (1970).On the transition from "Student's"z to "Student's"/. American Statistician,33, 6-10. Ellis, B. (1968). Basic conceptsof measurement. Cambridge, England: Cambridge University Press. Eysenck,H. J. (1947) Thedimensionsof personality. London: Routledge & Kegan Paul. Eysenck,H. J., & Eysenck,M. W. (1985) Personalityand individual differences: A natural scienceapproach.New York: Plenum Press. Fechner,G. (1966).Elementsof psychophysics.(H. Adler, Trans.).New York: Holt, Rinehart and Winston. (Original work published 1860) Feigl, H. (1959).Philosophical embarrassments of psychology. American Psychologist, 14, 115-128. Ferguson,G. A. (1959). Statistical analysisin psychology and education. New York: McGraw-Hill. Finn, R. W. (1973). Domesday book: A guide. Chichester, England: Phillimore. Fisher,R. A. (1915). Frequency distributionof valuesof thecorrelationcoefficient in samples from an indefinitely large population. Biometrika, 10, 507-521. Fisher, R. A. (1918). The correlation between relatives on thesuppositionof Mendelian inheritance. Transactions of the Royal Societyof Edinburgh,52, 399-433. Fisher, R. A. (1919). The genesisof twins. Genetics,4, 489-499. Fisher, R. A. (192la). On the"probable error"of a coefficient of correlation deducedfrom a small sample. Metron,1, 3-32. Fisher,R. A. (1921b). Studiesin crop variation.I. An examinationof the yield of dressed grain from Broadbalk. Journalof Agricultural Science,11, 107-135. Fisher,R. A. (1922a).On theinterpretationof X2 from contingencytablesand thecalculation of P. Journal of the Royal Statistical Society, 85, 87-94.

240

REFERENCES

Fisher, R. A. (1922b). The goodness of fit of regression formulae and thedistribution of regressioncoefficients. Journalof the Royal Statistical Society, 85, 597-612. Fisher, R. A. (1923).Statistical tests of agreement between observation and hypothesis. Economica, 3,139—147. Fisher, R. A. (1924a).The conditions under which x2 measuresthe discrepancy between observationand hypothesis. Journalof the Royal Statistical Society. 87, 442-450. Fisher, R. A. (1924b). On adistribution yieldingthe error functionsof several well known statistics. Proceedings of the International Congressof Mathematics, Toronto , 2, 805-813. Fisher,R. A. (1925a).Applications of "Student's"distribution. Metron,5, 90-104. Fisher,R. A. (1925b). Theory of statistical estimation. Proceedings of the Cambridge Philosophical Society,22, 700-725. Fisher, R. A. (1926a). Bayes' theoremand thefourfold table. Eugenics Review, 18,32-33. Fisher, R. A. (1926b).The arrangementof field experiments. Journalof the Ministry of Agriculture of Great Britain, 33, 505-513. Fisher, R. A. (1930). Inverse probability. Proceedings of the Cambridge Philosophical Society,26, 528-535. Fisher, R. A. (1934) Discussionof Dr Wishart's paper. Journalof theRoyal Statistical Society,Supplement1, 51-53. Fisher, R. A. (1935).The logic of inductive inference. Journal of theRoyal Statistical Society,98, 39-54. Fisher,R. A. (1937). Professor Karl Pearson and the method of moments. Annalsof Eugenics,7,303-318. Fisher, R. A. (1952). Statistical methods in genetics. Heredity,6, 1-12. Fisher, R. A. (1955). Statistical methodsand scientific induction. Journalof the Royal Statistical Society,17,69-78. Fisher, R. A. (1956). Statistical methods and scientific inference. New York: Hafner Press (3rd ed., 1973). Fisher, R. A. (1957). Commenton thenotesby Neyman, Bartlett,andWelch in this Journal (Vol.18, No. 2, 1956).Journal of the Royal Statistical Society, B, 19, 179. Fisher, R. A. (1966). Thedesignof experiments. (8th ed.). Edinburgh: Oliver and Boyd. (Original work published 1935) Fisher, R. A. (1970). Statistical methodsfor research workers. (14thed.). Edinburgh: Oliver and Boyd. (Original work published 1925). Fisher, R. A. (1971)."Author'sNote" preceding, Professor Karl Pearson and themethodof moments. In J. H. Bennett (Ed.)., Collected papers of R. A. Fisher (Vol. 4 p. 302). Adelaide: University of Adelaide. (Reprintedfrom Annals of Eugenics, 1939,7, 303-318) Fisher,R. A. (1971)."Author'sNote" preceding,The comparisonof sampleswith possibly unequal variances.In J. H. Bennett (Ed.)., Collected papers of R. A. Fisher (Vol. 4 pp. 190-198).Adelaide: Universityof Adelaide. (Reprintedfrom Annalsof Eugenics, 1939, 9, 174-180) Fisher, R. A., & MacKenzie,W. A. (1923). Studiesin crop variation. II. The manurial responseof different potato varieties. Journalof Agricultural Science,73, 311-320 Fisher Box,J. (1978). R. A. Fisher, the life of a scientist.New York: Wiley. FisherBox, J.(1981).Gosset,Fisher,and the tdistribution. American Statistician, 35,61-66. Fletcher, R. (1991) Science, ideology and themedia: TheCyril Burt scandal. New Brunswick, NJ: TransactionBooks.

REFERENCES

241

Forrest,D. W. (1974). Francis Gallon: The life and work of a Victorian genius. London: Elek. Fox, J. (1984).Linear statistical modelsand related methods.New York: Wiley. Funkhouser,H. G. (1936).A noteon a 10th century graph. Osiris, 1, 260-262. Funkhouser,H. G. (1937). Historical development of the graphical representation of data. Osiris, 3, 269-404. Gaito, J. (1960).Expected means squares in analysisof variance techniques. Psychological Reports, 7, 3-10. Gaito, J. (1980).Measurement scales and statistics: Resurgence of an old misconception. Psychological Bulletin,87, 564-567. Galbraith, V. H. (1961). Themaking of Domesdaybook. Oxford: University Press. Galton, F. (1877). Typical lawsof heredity. Proceedingsof the Royal Institutionof Great Britain, 8, 282-301.[Also in Nature, 15,492-495]. Galton, F. (1884).On theanthropometric laboratory at thelate International Health Exhibition. Journal of theAnthropological Instituteof Great Britain and Ireland, 14, 205-221. Galton, F. (1885a).Regression towards mediocrity in hereditary stature. Journal of the Anthropological Instituteof Great Britain and Ireland, 15, 246-263. Galton, F. (1885b).Opening address to theAnthropological Sectionof the British Association by thePresidentof the Section. Nature,32, 507-510. Galton, F. (1886).Family likenessin stature. Proceedingsof theRoyal Societyof London, 40, 42-73(includesan appendixby J. D.Hamilton Dickson) Galton, F. (1888a).Co-relationsandtheir measurement, chiefly from anthropometricdata. Proceedingsof the Royal Societyof London,45, 135-145. Galton, F. (1888b). Personal identification and description. Proceedingsof the Royal Institution, 12, 346-360. Galton, F. (1889).Natural inheritance. London: Macmillan. Galton, F. (1907)Inquiries into humanfaculty and itsdevelopment (3rd ed.). London: J. M. Dent. (Original work published 1883). Galton, F. (1908).Memoriesof my life. London: Methuen. Galton, F. (1962). Hereditary genius (second edition reprinted). London: Collins and New York: World. (Original work published 1869; 2nd ed. 1892) Galton, F. (1970).English men of science. London: Frank Cass.(Original work published 1874) Garnett, J. C. M., (1920) The single general factorin dissimilar mental measurements. British Journal of Psychology,10, 242-258. Garrett, H. E. & Zubin, J. (1943) The analysis of variance in psychological research. Psychological Bulletin,40, 233-267. Gauss,C. F.(1963). Theoria motus corporum coelestium. (Theory of theMotion of Heavenly Bodies). New York: Dover. (Original work published 1809) Gigerenzer,G. (1987).Probabilistic thinkingand the fightagainst subjectivity.In L. Krtiger, G. Gigerenzer& M. Morgan (Eds.),Theprobabilistic revolution: Ideasin the sciences (Vol. 2, pp. 7-33). Cambridge,MA: MIT Press. Gini, C. (1928).Une application de la methode representative aux materiaux du dernier recensementde la population italienne (ler decembre 1921) [An application of the representative method to material from the last censusof the Italian population (1 December 1921)]. Bulletinof the International Statistical Institute,23, 198-215. Ginzburg,B. (1936). The scientific valueof the Copemican induction. Osiris, I, 303-313. Glaisher,J. W. L. (1872). On the law offacility of errors of observationand themethodof leastsquares. Royal Astronomical Society Memoirs, 39, 75-124.

242

REFERENCES

Gould, S. J.(1981) Themismeasureof man. Markham: Penguin Books Canada. Grant, D. A. (1944) On The analysisof variancein psychological research.' Psychological Bulletin, 41, 158-166. Graunt, J. (1975).Natural andpolitical observations mentioned in afollowing index and madeupon thebills of mortality. New York: Arno Press.(Original work published 1662) Grunbaum,A. (1952). Causalityand thescienceof human behavior. American Scientist, 40, 665-676. Guilford, J. P.(1936) Psychometric methods. New York: McGraw-Hill. Guilford, J. P.(1950) Fundamental statistics in psychology and education. New York: McGraw-Hill. Guilford, J. P.(1967) The nature of human intelligence.New York: McGraw-Hill. Guthrie, S.C. (1946).A history of medicine. Philadelphia: J. B. Lippincott. Hacking, I. (1965). The logic of statistical inference. Cambridge, England: Cambridge University Press. Hacking, I. (1971).Jacques Bernoulli's Art of Conjecturing. British Journal for thePhilosophy of Science,22, 209-229. Hacking, I. (1975). Theemergenceof probability. Cambridge, England: Cambridge University Press. Haggard,E. A. (1958)Intraclass correlationand theanalysisof variance.New York: Dryden Press. Halley, E. (1693).An estimateof themortality of mankind, drawnfrom curioustablesof the births and funerals at the city of Breslaw; with an attempt to ascertain the price of annuitieson lives. Philosophical Transactions of the Royal Society,17, 596-610. Harman,H. H. (1976).Modern factor analysis. Chicago: University of ChicagoPress. Harris, J. A. (1913) On thecalculationof intra-classandinter-class coefficients of correlation from classmoments whenthe numberof possible combinations is large. Biometrika,12, 22-27. Hartley, D. (1966). Observations on man, his frame, hisduty and his expectations. Gainesville,FL: Scholars' Facsimiles andReprints. (Original work published 1749) Hays, W. L. (1963).Statistics.New York: Holt, RinehartandWinston. Hays, W. L. (1973).Statisticsfor thesocial sciences. New York: Holt, Rinehartand Winston. Hearnshaw,L. S. (1964).A short historyof British psychology.New York: Barnes& Noble. Hearnshaw,L. S. (1979).Cyril Burt, psychologist. London: Hodder and Stoughton. Hearnshaw,L. S.(1987).Theshapingof modernpsychology. London: Routledge and Kegan Paul. Heisenberg,W. (1927). Thephysical principlesof the quantum theory. Chicago: University of Chicago Press. Henkel, R. E., & Morrison, D. E. (1970). The significance test controversy. London: Butterworths. Heron, D. (1911). The dangerof certainformulae suggestedassubstitutesfor the correlation coefficient. Biometrika,8, 109-122. Hick, W. E. (1952). A note on one-tailed andtwo-tailed tests. Psychological Review, 59, 316-318. Hilbert, D. (1902) Mathematical problems. Bulletin of the American Mathematical Society, 8,437-445;478-79. Hilgard, E. (1951) Methodsand procedures in the study of learning.In S. S.Stevens (Ed.) Handbook of experimentalpsychology (pp.517-567).New York: John Wiley. Hogben, L. (1957).Statistical theory. London: Allen and Unwin.

REFERENCES

243

Holschuh,N. (1980).Randomizationanddesign:I. In S. E.Fienberg& D. V. Hinkley (Eds.), R. A. Fisher: An appreciation (pp.35-45).New York: Springer-Verlag. Holzinger, K. J., & Harman,H. H. (1941). Factor analysis:a synthesisof factorial methods. Chicago: Universityof Chicago Press. Hotelling, H. (1933).Analysisof acomplexof statistical variables into principal components. Journal of Educational Psychology, 24, 417-441,498-520. Hotelling, H. (1935). Themost predictable criterion. Journal of EducationalPsychology,26, 139-142. Hume, D. (1951). An enquiry concerning human understanding. In D.C. Yalden-Thomson (Ed.). (1951).Hume, Theory of Knowledge (pp. 3-176)Edinburgh: ThomasNelson. (Original work published 1748) Hunter, J. E.(1997).Needed:a ban on thesignificance test. Psychological Science, 8, 3-7. Huxley, J. (1949)Haifa centuryof genetics. (London) Sunday Times, 10th July. Huygens,C. (1970).On reasoningin gamesof dice. In J. Bernoulli, The art of conjecture (F. Maseres,Ed. andTrans.). New York: Redex Microprint. (Original work published 1657) Jacobs,J. (1885).Reviewof Ebbinghaus's Ueberdas Geddchtnis. Mind, 10, 454-459. Jeffreys, H. (1939).Randomandsystematic arrangements. Biometrika, 31, 1-8. Jensen,A. R. (1992). Scientificfraud or false accusations?The caseof Cyril Burt. In D. J. Miller & M. Hersen(Eds.),Research fraudin the behavioraland biomedical sciences. (pp. 97-124). New York: John Wiley& Sons. Jevons,W. (1874). Theprinciples of science. London: Macmillan. Jones,L. V. (1952).Testsof hypotheses: One-sided vs.two-sided alternatives. Psychological Bulletin, 49,43-46. Jones,L. V. (1954). A rejoinderon one-tailed tests. Psychological Bulletin, 51, 585-586. Joynson,R. B. (1989). TheBurt affair. London: Routledge. Kelley, T. L. (1923). Statistical method. New York: Macmillan. Kelley, T. L. (1928).Crossroadsin themind of man:a studyof differentiable mental abilities. Stanford CA: Stanford University Press. Kempthorne,O. (1976). The analysisof varianceand factorial design.In D. B. Owen (Ed.) On the history of statisticsand probability ( pp. 29-54.)New York: Marcel Dekker. Kempthorne,O. (1983). A reviewof R. A.Fisher: An Appreciation. Journalof the American Statistical Association,78,482-490. Kempthorne,O., & Folks, L. (1971). Probability, statistics,and data analysis. Ames: Iowa State University Press. Kendall, M. G. (1952). George UdnyYule, 1871-1951.Journal of the Royal Statistical Society, 115A, 156-161. Kendall, M. G. (1959).Hiawatha designsan experiment. American Statistician, 13, 23-24. Kendall,M. G. (1961). Daniel Bernoullionmaximumlikelihood.Biometrika,48,1-18.[This paper is followed by a translationby C. G.Allen of D. Bernoulli's The most probable choicebetween several discrepant observations and theformation therefrom of the most likely induction 1777(?)., together with a commentaryby Leonard Euler(1707-1783).] Kendall, M. G. (1963).Ronald Aylmer Fisher,1890-1962.Biometrika,50, 1-15. Kendall,M. G. (1968). Studiesin thehistoryof probability andstatistics, XIX. Francis Ysidro Edgeworth (1845-1926). Biometrika, 55, 269-275. Kendall, M. G. (1972).The history andfuture of statistics.In T. A. Bancroft(Ed.).Statistical papers in honor of George W.Snedecor. Ames: Iowa State University press. Kendall, M. G., & Babington Smith,B. (1938). Randomness andrandom sampling numbers. Journal of the Royal Statistical Society, 101, 147-166.

244

REFERENCES

Kenna, J. C.(1973). Gallon'ssolution to thecorrelation problem:A false memory? Bulletin of the British Psychological Society, 26, 229-30. Keuls, M. (1952). The use ofstudentized rangein connection withan analysisof variance.

Euphytica,!, 112-122.

Kevles, D. (1985).In the nameof eugenics.New York: Alfred A. Knopf. Kiaer, A. N. (1899).Sur lesm&hodesrepresentatives outypologies appliquees a la statistique (On representative methods or typologies appliedto statistics). Bulletinof the International Statistical Institute,11,180-185. Kiaer, A. N. (1903). Sur les m&hodes representativesou typologies (on representative methodsor typologies). Bulletinof the International Statistical Institute, 13, 66-78. Kimmel H. D. (1957). Three criteriafor the use ofone-tailed tests. Psychological Bulletin, 54,351-353. Kneale, W. (1949).Probability and induction. Oxford: Oxford UniversityPress. Koch, S. (Ed.). (1959-1963).Psychology:A study of a science(6 Vols.). New York: McGraw-Hill. Koch, S. (1980).Psychology and its human clientele: Beneficiaries or victims? In R. A. Kasschauand F. S.Kessel (Eds.) Psychology and society:In searchof symbiosis,(pp. 30-60).New York: Holt, RinehartandWinston. Kolmogorov, A. N. (1956).Foundationsof the theory of probability. (N. Morrison, Trans.) New York: Chelsea Publishing Co. (Original work publishedin 1933) Koren, J. (Ed.). (1918).Thehistory of statistics.New York: Macmillan. Koshal, R. S.(1933). Applicationof the methodof maximumlikelihood to thederivationof efficient statisticsfor fitting frequencycurves. Journalof theRoyal Statistical Society, 96,303-313. Kruskal, W., & Mosteller, F. (1979).Representative sampling, IV: The history of the concept in statistics,1895-1939.International Statistical Review, 47, 169-195. Lancaster,H. O. (1966).Forerunnersof the PearsonX2. Australian Journalof Statistics,8, 117-126. Lancaster,H. O. (1969). Thechi-squared distribution.New York: Wiley. Laplace, P. S. de(1810). Memoir sur les approximationsdes formules de tres grandes nombres,et surleur applicationauxprobabilites (Memoiron approximationsof formulae which are functions of very large numbers,and their applicationto probabilities). Memoirs de la classedessciences mathematiques et physiquesde I'institut de France, Annee 1809,353-415; suppl. 559-565. LaplaceP. S. de(1812). Theorie analytiquedesprobabilites. Paris: Courcier. LaplaceP. S. de(1951). A philosophical essayon probabilities. (F. W. Truscott & F. L. Emory, Trans.).New York: Dover. (Original work published 1820) Lawley, D .N.,& Maxwell, A. E. (1971). Factor analysis as astatistical method.2nd edition. London: Butterworth. Le Cam,L., & Lehmann,E. L. (1974). J. Neyman.On the occasionof his 80th birthday. Annals of Statistics,2, vii-xi. Legendre,A. M. (1805). Nouvelles methodespour la determinationdesorbitesdescometes [New methodsfor determiningthe orbits of comets],Paris: Courcier. Lehmann,E. L. (1959). Testingstatistical hypotheses. New York: Wiley. Lindquist, E. F. (1940). Statistical analysisin educational research. Boston: Houghton Mifflin. Lord, F. M. (1953).On thestatistical treatment of football numbers. American Psychologist, 5,750-751.

REFERENCES

245

Lovie, A. D. (1979). The analysis of variance in experimental psychology:1934-1945. British Journal of Mathematicaland Statistical Psychology, 32, 151-178. Lovie, A. D. (1983).Imagesof man inearly factoranalysis-psychological and philosophical aspects.In S. M. Bem, H. Van Rappard& W. Van Hoorn (Eds.), Studies in the history of psychology and thesocial sciences,Proceedingsof the first Europeanmeetingof CHEIRON, 1 (pp. 235-247).Leiden: Leiden University. Lovie, P., & Lovie A. D. (1993). Charles Spearman, Cyril Burt, and theorigins of factor analysis. Journalof the History of the Behavioral Sciences. 29, 308-321. Lovie, P., & Lovie A. D. (1995). The cold equations: Spearmanand Wilson on Factor indeterminacy. British Journalof Mathematical and Statistical Psychology,48, 237-253. Lupton, S. (1898).Noteson observations. London: Macmillan. Lush, J. L. (1972).Early statisticsat Iowa State College.In T. A. Bancroft (Ed.)., Statistical papers in honour of George W.Snedecor (pp.211-226).Ames: Iowa State University Press. Macdonell, W. R. (1901).On criminal anthropometryand theidentificationof criminals. Biometrika, 1, 177-227. MacKenzie,D. A. (1981).Statisticsin Britain 1865-1930. Edinburgh: Edinburgh University Press. Magnello,M. E. (1998).Karl Pearson'smathematizationof inheritance: Fromancestral heredity to Mendelian genetics(1895-1909).Annals of Science,55, 35-94. Magnello, M. E. (1999).The non-correlationof biometrics and eugenics: rival formsof laboratory work in Karl Pearson'scareer at University College London. Historyof Science,37, Part 1, 79-106;Part 2, 123-150. Marks, M. R. (1951). Two kindsof experiment distinguished in termsof statistical operations. Psychological Review, 58, 179-184. Marks, M. R. (1953). One-andtwo-tailed tests.Psychological Review, 60, 207-208. Maunder, W. F. (1972). Sir Arthur Lyon Bowley. An inaugural lecture delivered in the University of Exeter. (In M. G. Kendall& R .L. Plackett (Eds.), Studies in the history of statisticsandprobability. (1977, Vol.II, pp. 459-480).New York: Macmillan. Maxwell, A. E. (1977).Multivariate analysisin behavioural research. London: Chapman and Hall. McMullen,L. (!),& Pearson, E.S.(2). (1939). William Sealy Gosset, 1876-1937 (1). "Student"as aman; (2)."Student"as astatistician. Biometrika,30, 205-250. McNemar, Q. (1940a). Samplingin psychological research. Psychological Bulletin, 37, 331-365. McNemar, Q. (1940b). Reviewof E. F.Lindquist, Statistical analysisin educational research. Psychological Bulletin, 37, 746-748. McNemar, Q. (1949). Psychological statistics. New York: Wiley. Mercer, W. B. ,& Hall, A. D. (1911). The experimental errorof field trials.Journal of Agricultural Science,4, 107-132. Merriman, M. (1877).A list of writings relatingto themethodof leastsquares,with historical and critical notes. Transactionsof the Connecticut Academy of Arts and Sciences,4, 151-232. Merriman, M. (1884). A textbookon themethodof leastsquares (8th ed.). New York: Wiley. Miche"a, R. (1938).Les variations de laraisonauXVII e siecle; essaisur lavaleur du langage employ6en histoire litteraire [Differences in meaningin the 17th century; essay on the valueof language employed in literary history]. Revue Philosophique de la Franceet de

l'Etranger,126,m-2Ql.

246

REFERENCES

Mill, J. S.(1973).A systemof logic ratiocinativeand inductive. Beinga connected viewof the principles of evidenceand themethodsof scientific investigation (8th ed., 1872 J. M. Robson, Ed.,Toronto: University of Toronto Press.Original work published1843). Miller, A. G. (Ed.). (1972). Thesocial psychologyof the psychological experiment. New York: Free Press. Miller, G. A. (1963).RonaldA. Fisher: 1890-1962.American Journal of Psychology,76, 157-158. Miller, G. A. (1969). Psychology as a means of promoting human welfare. American Psychologist,24, 1063-1075. Nagel, E. (1936).The meaningof probability. Journalof theAmerican Statistical Association, 31, 10-30. [Reprinted with a commentaryin J. R.Newman (Ed.),Theworld of mathematics, 1956, Vol.11, pp. 1398-1414.New York: Simonand Schuster]. Newman, D. (1939). The distribution of the rangein samples from a normal population expressedin terms of an independent estimate of standard deviation. Biometrika, 31, 20-30. Newman,J. R.(Ed.). (1956). Theworld of mathematics. (Vols. 1-4). New York: Simonand Schuster. Neyman,J. (1934). On the twodifferent aspectsof the representative method: The method of stratified samplingand the method of purposive selection. Journalof the Royal Statistical Society,97, 558-625. Neyman,J. (1935).Statistical problemsin agricultural experimentation. Journal of the Royal Statistical Society, Supplement, 2, 107-154. Neyman, J. (1937). Outline of a theory of estimation basedon theclassical theoryof probability. Philosophical Transactions of the Royal Society,A, 236, 333-380. Neyman,J. (1941).Fiducial argumentand thetheory of confidence intervals. Biometrika 32, 128-150. Neyman, J. (1956).Note on anarticle by Sir Ronald Fisher. Journalof the Royal Statistical Society,B, 18,288-294. Neyman,J., & Pearson,E. S.(1928). On the use and interpretationof certaintest criteria for purposesof statistical inference. Biometrika, 20a, Part 1, 175-240,Part II, 263-294. Neyman,J., & Pearson, E. S.(1933). The testing of statistical hypothesesin relation to probabilities a priori. Proceedings of the Cambridge Philosophical Society,29, 492-510. Norton, B., & Pearson,E. S.(1976). A note on thebackgroundto, andrefereeingof, R. A. Fisher's1918 paper'On thecorrelation between relatives on thesuppositionof Mendelian inheritance.'Notesand Recordsof the Royal Society,31, 151-162. Ore, O. (1953).Cardano: Thegambling scholar. Princeton, NJ: Princeton UniversityPress. Osgood,C. E. (1953)Method and theory in experimental psychology. New York: Oxford University Press. Parten,M. (1966).Surveys, polls,and samples: Practical procedures. New York: Cooper Square Publishers. Pearl, R. (1917).The probable errorof a Mendelianclassfrequency.American Naturalist, 51, 144-156. Pearson,E. S.(1938a).Karl Pearson:An appreciationof some aspects of his life and work. Cambridge, England: Cambridge University Press. Pearson,E. S.(1938b).Someaspectsof the problem of randomization.II. An illustration of 'Student's'inquiry into the effect of balancingin agricultural experiments. Biometrika, 30,159-171.

REFERENCES

247

Pearson,E. S.(1939a).William Sealy Gosset,1876-1937(2)."Student"as a statistician. Biometrika, 30, 205-250. Pearson,E. S.(193 9b).Note on the inverse and direct methodsof estimation in R. D. Gordon'sproblem. Biometrika,31, 181-186. Pearson,E. S.(1965).Some incidentsin the early historyof biometryandstatistics, 1890-94 Biometrika, 52, 3-18. Pearson,E. S.(1966).The Neyman-Pearson story:1926-34. Historical sidelights on an episode in Anglo-Polish collaboration.In F. N. David (Ed.)., Festschrift for J. Neyman (pp. 1-23).London: Wiley. Pearson,E. S.(1968).Some early correspondence between W. S. Gosset,R. A. Fisherand Karl Pearson, with notes andcomments. Biometrika,55, 445-457. Pearson,E. S.(1970).Karl Pearson'slectureson thehistory of statisticsin the 17th and18th centuries. Appendixto E. S.Pearson& M. G. Kendall (Eds.),Studiesin the history of statistics andprobability, (pp. 479-481).London: Griffin. Pearson,K. (1892). Thegrammar of science. London: Scott. Pearson,K. (1894). Contributions to themathematical theoryof evolution. Philosophical Transactionsof the Royal Society,A, 185, 71-110. Pearson,K. (1895). Contributionsto themathematical theoryof evolution. II. Skew variations in homogeneous material. Philosophical Transactions of the Royal Society,A, 186, 343-414. Pearson,K. (1896).Mathematical contributions to thetheory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society,A, 757, 253-318. Pearson,K. (1900a).On thecriterion thata given systemof deviationsfrom the probablein the caseof a correlated systemof variablesis such thatit can bereasonably supposed to have arisenfrom random sampling. Philosophical Magazine, 50, 157-175. Pearson,K. (1900b). Mathematical contributionsto the theory of evolution. VII. On the correlation of charactersnot quantitatively measurable. Philosophical Transactions of the Royal Society,A, 795, 1–47. Pearson, K. (1901). On lines and points of closest fit to systems of points in space. Philosophical Magazine,2-6, 559-572. Pearson,K. (1903). On thelaws of inheritancein man,I. Biometrika, 2, 357-462. Pearson,K. (1904a).On thelaws of inheritancein man, II. Biometrika, 3, 131-190. Pearson,K. (1904b).Mathematical contributionsto the theory of evolution. XIII. On the theory of contingencyand itsrelation to associationand normal correlation. Draper's CompanyResearch Memoirs, Biometric Series 1. 35 pages. Pearson,K. (1906).Walter Frank Raphael Weldon, 1860-1906.Biometrika, 5, 1-52. Pearson,K. (1907).Reply to certain criticismsof Mr G. U. Yule. Biometrika,5, 470-476. Pearson,K. (1909).On the test of goodnessof fit of observationto theory in Mendelian experiments. Biometrika,9, 309-314. Pearson,K. (1911). On theprobability that two independent distributions of frequencyare really samplesfrom the same population. Biometrika, 8, 250-254. Pearson,K. (1916a). On abrief proof of the fundamentalformula for testing the goodness of fit of frequencydistributionsand of theprobable errorof "P." Philosophical Magazine, 31, 369-378. Pearson,K. (1916b). Mathematical contributionsto thetheory of evolution.XIX. Second supplementto a memoir on skew variation. Philosophical Transactions of the Royal Society,A, 216, 429-457.

248

REFERENCES

Pearson,K. (1917). The probable errorof a Mendelian classfrequency.Biometrika,11, 429-432. Pearson,K. (1920).Noteson thehistory of correlation.Biometrika, 13, 25-45. Pearson,K. (1922).On the x2 test of goodnessof fit. Biometrika, 14,186-191. Pearson,K. (1924a).Historical noteon theorigin of thenormal curveof errors.Biometrika, 16,402-404. Pearson,K. (1924b). On the difference and thedoublet testsfor ascertaining whether two samples have been drawn from the same population. Biometrika, 16,249-252. Pearson,K. (1926). Abraham de Moivre. Reply to Professor Archibald. Nature, 117, 551-552. Pearson,K. (1927). The mathematicsof intelligence. Reviewof C. Spearman,Theabilities of man, their natureand measurement. Nature, 120, 181-183. Pearson,K (1914-1930). The life, letters and labours of Francis Galton. Cambridge, England: Cambridge University Press. (Vol.1, 1914; Vol. 2, 1924; Vol. 3a, 3b,1930) Pearson,K. (1935). Statisticaltests.Nature, 136,296-297. Pearson,K. (1936)Method of momentsandmethod of maximum likelihood. Biometrika, 28, 34-59. Pearson,K. (1978). The history of statistics in the 17th and 18th centuries against the changing backgroundof intellectual,scientific and religious thought. ( E. S. Pearson, Ed.). London: CharlesGriffin. Pearson,K., & Heron, D. (1913). On theoriesof association. Biometrika, 9, 159-315. Pearson,K., & Moul, M. (1927).Themathematicsof intelligence. Biometrika,19,246-291. Peirce, B. (1852). Criterion for the rejectionof doubtful observations.TheAstronomical Journal, 2,161-163. Peters,C. C. (1943). Misusesof the Fisher statistics. Journal of EducationalResearch,36, 546-549. Petrinovich, L. F., & Hardyck, C. D. (1969).Error ratesfor multiple comparison methods: Some evidence concerning the frequency of erroneous conclusions. Psychological Bulletin, 71,43-54. Petty, W. (1690). Political arithmetick. London: Printedfor Robert Clavel and Henry Mortlock. Phillips, L. D. (1973).Bayesian statistics for social scientists. London: Nelson. Picard, R. (1980).Randomizationand design:II. In S. E. Fienberg& D. V. Hinkley (Eds.) R. A. Fisher: An appreciation (pp.46-58).New York: Springer-Verlag. Playfair, W. (1801a). Commercialandpolitical atlas. (3rd ed.). London: Printed by T. Burton. Playfair, W. (1801b).Statistical breviary; shewing on aprinciple entirely new,the resources of every stateand kingdom in Europe. London: Printedby T. Bensley for J. Wallis, Egerton, Vernorand Hood, Blackand Parry, andTibbet and Didier. Poisson,S.- D. (1837). Recherches sur laprobabilite desjugements en matiere criminelleet en matiere civile, precedee desregies generaldu calcule desprobabilites. (Research on the probability of judgmentsin criminal andcivil matters,precededby generalrules for the calculation of probabilities). Paris: Bachelier. Popper,K. R. (1959). The logic of scientific discovery. London: Hutchinson. Popper,K. R. (1962).Conjecturesand refutations: Thegrowth of scientific knowledge.New York: Basic Books. Porter,T (1986) Therise of statistical thinking. Princeton, NJ: Princeton UniversityPress.

REFERENCES

249

Quetelet,L. A. (1849).Letters addressedto H.R.H. the Grand Dukeof SaxeCoburg and Gotha on theTheory of Probabilities as applied to themoral andpolitical sciences.(O. G. Downes, Trans.). London: Charles & Edwin Leyton. Authorized facsimileof the original book producedin 1975 by Xerox University Microfilms, Ann Arbor, MI. (Original work published1835) RAND Corporation(1965). A million random digits-with 100,000 normal deviates. New York: Free Press. Reichenbach,H. (1938).Experience and prediction. An analysis of thefoundations and structure of knowledge. Chicago: University of ChicagoPress. Reid, C. (1982). Neyman-from life. New York: Springer-Verlag. Reitz, W. (1934). Statistical techniques for the study of institutional differences. Journalof ExperimentalEducation,3, 11-24. Rhodes,E.G. (1924). On theproblem whethertwo given samplescan besupposedto have been drawnfrom the same population. Biometrika, 16,239-248. Robinson, C. (1932). Straw votes.New York: Columbia UniversityPress. Robinson,C. (1937). Recent developments in thestraw-poll field. Public Opinion Quarterly, 1 (3), 45-56,and 1(4), 42-52. Rosenthal,R., & Rosnow,R. L. (Eds.). (1969).Artifact in behavioral research.New York: Academic Press. Rowe, F. B. (1983). Whatever becameof poor Kinnebrook? American Psychologist, 38, 851-852. Rozeboom,W. W. (1960). The fallacy of the null hypothesis significance test. Psychological Bulletin, 57,416-428. Rucci, A.J.,& Tweney, R.D. (1980) Analysisof varianceand the"SecondDiscipline" of scientific psychology:A historical account. Psychological Bulletin, 87, 166-184. Russell,B. (1931). Thescientific outlook. London:Allen and Unwin. Russell,B. (1961). Historyof Westernphilosophyand itsconnection with politicaland social circumstancesfrom theearliest timesto thepresent day. London: Allen and Unwin. (Original work published1946) Russell,Sir John, (1926). Field experiments: How they aremadeandwhat theyare.Journal of the Ministry of Agriculture of Great Britain, 32, 989-1001. Rutherford,E., & Geiger, H. (1910). The probability variationsin the distribution of a particles. Philosophical Magazine, 20, 698-707. Ryan,T. A. (1959).Multiple comparisonsin psychological rsearch. Psychological Bulletin, 56, 26-47. Savage,L. J. (1962). Thefoundations of statistical inference. New York: Wiley. Savage,L. J. (1976).On rereadingR. A. Fisher. Annalsof Statistics,4, 441-500. ScheffiS, H. (1953).A methodfor judging all contrastsin the analysisof variance. Biometrika, 40, 87-104. Scheff6, H. (1956a)Alternative modelsfor the analysisof variance. Annalsof Mathematical Statistics,27, 23-36. ScheffS, H. (1956b) q'mixed modelfor the analysisof variance. Annalsof Mathematical Statistics,27,251-271. Scheffe, H. (1959). Theanalysisof variance.New York: Wiley. Seal, H.L. (1967).The historical development of the Gauss linear model. Biometrika, 54, 1-24.

250

REFERENCES

Seng,Y .P. (1951). Historical surveyof the developmentof sampling theoryand practice. Journal of the Royal Statistical Society, 114,2] 4-231. Sheynin,O. B. (1966).Origin of the theory of error.Nature, 211,1003-1004. Sheynin,O. B. (1978).S. - D. Poisson'swork in probability.Archivesfor History of Exact Sciences,18,245-300. Sidman,M. (1960). Tactics of scientific research.New York: Basic Books. Simpson,T. (1.755).A letterto theRight Honourable George Earl of Macclesfield, President of the Royal Society,on the advantage arisingin taking the mean of a number of observations,in practical astronomy. Philosophical Transactions of the Royal Society, 49, 82-93. Simpson,T. (1757).On the advantage arisingfrom taking the mean of a number of observations,in practical astronomy, wherin the odds thatthe result in this way ismore exactthanfrom onesingle observationis evincedand theutility of the methodin practise clearly madeappear.In Miscellaneous tractson some curiousand very interesting subjects in mechanics, physical astronomy and speculative mathematics (pp. 64-75). London: Printedfor J. Nourse. Skinner, B. F. (1953).Scienceand human behavior.New York: Macmillan. Smith, D. E. (1929)A source bookin mathematics.New York: McGraw Hill. Smith, K. (1916). On the "best" values of the constants in frequency distributions. Biometrika, 11, 262-276. Snedecor,G. W. (1934).Calculationand interpretationof analysis of varianceand covariance. AmesIA: CollegiatePress. Snedecor,G. W. (1937). Statistical methods. Ames, IA: Collegiate Press. Soper,H. E. (1913). On the probable errorof the correlation coefficient to a second approximation.Biometrika,9, 91-115. Soper,H. E., Young,A. W., Cave,B. M., Lee, A., & Pearson,K. (1917).On thedistribution of the correlation coefficientin small samples.A co-operative study. Biometrika, 11, 328-413. Spearman,C. (1904a). General intelligence, objectively determined and measured. American Journal of Psychology,15, 201-293. Spearman,C. (1904b).The proof andmeasurementof association between two things. American Journalof Psychology,15, 72—101. Spearman,C. (1927). Theabilities of man. New York: Macmillan. Spearman,C. (1933). The factor theoryand itstroubles.Journal of EducationalPsychology, 24, 521-524. Spearman,C. (1939).The factorial analysisof humanability. II. Determinationof factors. British Journal of Psychology,30, 78-83. St. Cyres, Viscount(1909).Pascal. London: Smith, Elder and Co. Stanley,J. C.(1966).The influenceof Fisher's""The designof experiments"on educational research thirty years later. American Educational Research Journal, 3, 23-229. Stephan,F. F. (1948). Historyof the use ofmodern sampling procedures. Journal of the American Statistical Association, 43, 12-39. Stephen,L. (1876).History of English thoughtin theeighteenth century (Vols. 1 -2). London: Smith, Elderand Co. Stephenson,W. (1939).The factorial analysisof human ability.IV Abilities defined as non-fractional factors. British Journal of Psychology,30, 94-104. Stevens,S .S.(1951).Mathematics, measurement andpsychophysics.In S.S. Stevens (Ed.)., Handbook of experimentalpsychology (pp.1-49).New York: Wiley.

REFERENCES

251

Stigler, S. M. (1977).Eight centuriesof sampling inspection:the trial of the Pyx. Journal of the American Statistical Association, 72, 493-500. Struik, D. J. (1954).A concise historyof mathematics. London: Bell. "Student"(1907).On the error of countingwith a haemacytometer. Biometrika, 5,351-360. "Student"(1908a).The probable errorof a mean. Biometrika,6, 1-25. "Student"(1908b).Probable errorof a correlationcoefficient. Biometrika,6, 302-310. "Student"(1925).New tablesfor testingthe significanceof observations. Metron, 5,25-32. "Student"(1927). Errorsof routine analysis. Biometrika, 19, 151-164. Tedin, O. (1931). The influenceof systematic plot arrangements upon the estimateof error in field experiments. Journalof Agricultural Science,21, 191-208. Thomson,G. H. (1939a).The factorial analysisof human ability.I. Thepresent positionand the problems confrontingus.British Journal of Psychology,30, 71-77. Thomson,G. H.(1939b).Thefactorial analysisof human ability- agreementanddisagreement in factor analysis:a summingup. British Journal of Psychology,30, 105-108. Thomson,G. H. (1946). The factorial analysis of human ability. London: Universityof London Press.(First edition1939) Thomson,G. H. (1969).Theeducationof an Englishman. Edinburgh: Moray House Publications. Thomson,W. (Lord Kelvin). (1891). Popular lectures and addresses. London: Macmillan. Thorndike, E. L. (1905). Measurements of twins. Archivesof Philosophy,Psychology,and Scientific Methods,No. 1. NewYork: SciencePress. Thorndike, E. L., & Woodworm,R. S.(1901).The influenceof improvementin onemental function uponthe efficiency of other functions. Psychological Review, 8, 247-261. Thurstone,L. L. (1935). Thevectorsof the mind. Chicago: University of ChicagoPress. Thurstone,L. L. (1940).Current issuesin factor analysis. Psychological Bulletin, 37, 189-236. Thurstone,L. L. (1947) Multiple factor analysis. Chicago: University of ChicagoPress. Tippett, L. H. C. (1925). On theextremeindividualsand therangeof samples taken from a normal population. Biometrika, 17, 364-387. Todhunter,I. (1865). A history of the mathematical theoryof probability from thetime of Pascal to that of Laplace. Cambridgeand London: Macmillan. [Reprintedin 1965by the Chelsea Publishing Co.,New York]. Traxler, R. H. (1976).A snagin the history of factorial experiments.In D. B. Owen (Ed.), On the history of probability and statistics(pp. 283-295).New York: Marcel Dekker. Tukey, J. W. (1949). Comparingindividual meansin the analysisof variance. Biometrics, 5,99-114. Tukey, J. W. (1953). The problem of multiple comparisons. Unpublished manuscript, Princeton University, Princeton, NJ. Turner, F. M. (1978). The Victorian conflict between science and religion: A professional dimension. Isis,69, 356-376. Underwood,B. J. (1949).Experimentalpsychology.New York: Appleton-Century-Crofts. Venn, J. (1888). The logic of chance. (3rd ed.),London: Macmillan. (Original work published 1866). Venn, J. (1891).On thenatureandusesof averages. Journal of theRoyal Statistical Society, 54, 429-448. von Mises, R. (1919). Grundlagender wahrscheinlichkeitsrechnung (Principles of probability theory). Mathematische Zeitschrift, 5, 52-99.

252

REFERENCES

von Mises,R. (1957). Probability, statistics and truth. (2nd rev. Englished. preparedby H. Geiringer).London: Allen andUnwin. Wald, A. (1950).Statistical decision functions. New York: Wiley. Walker, E. L. (1970).Psychology as anatural and social science. Belmont, CA: Brooks/ Cole. Walker, H. M. (1929). Studiesin thehistory of statistical method. Baltimore, MD: Williams and Wilkins. Wallis, W. A., & Roberts,H. V. (1956).Statistics:A newapproach.New York: FreePress. Watson.J. B. (1913).Psychology as the behaviorist viewsit. Psychological Review, 20, 158-177. Weaver,W. (1977).Lady Luck: Thetheoryof probability. Harmondsworth: Penguin Books. (Original work publishedin 1963). Welch, B. L. (1939).On confidence limits and sufficiency with particular referenceto parametersof location. Annalsof Mathematical Statistics,10, 58-69. Welch, B. L. (1958)."Student"andsmall sample theory. Journal of the American Statistical Association,53, 777—788. Weldon, W. F. R. (1890). The variations occurringin certain DecapodCrustacea.-!. Crangon vulgaris. Proceedingsof the Royal Societyof London, 47, 445-453. Weldon, W.F.R. (1892).Certain correlated variations in Crangon vulgaris. Proceedings of the Royal Societyof London, 51, 2-21. Weldon, W.F.R.(1893).On certain correlated variations in Carcinus moenas. Proceedings of the Royal Societyof London, 54, 318-329. Westergaard,H. (1932). Contributionsto thehistory of statistics. London:P. S.King. Whitney, C. A. (1984, October). Generating and testing pseudorandom numbers. Byte p. 128. Wilson, E. B. (1928).Review of Theabilities of man, their natureand measurementby Charles Spearman. Science, 67, 244-248. Wilson, E. B. (1929a).Review of Crossroadsin the mind of man by T. LKelley. Journal of General Psychology, 2, 153-169. Wilson, E. B. (1929b). Comment on Professor Spearman's note. Journal of Educational Psychology,20, 217-223. Wishart, J. (1934). Statisticsin agricultural research. Journal of the Royal Statistical Society Supplement,I, 26-51. Wolfle, D. (1940).Factor analysisto 1940.Chicago:University of ChicagoPress. Wood, T. B., & Stratton,F. J. M.(1910). The interpretationof experimental results. Journal of Agricultural Science,3, 417-440. Woodworth,R. S., & Schlosberg,H. (1954) Experimental psychology. New York: Holt, Rinehart & Winston. Yates, F. (1935).Complex experimentation. Journal of the Royal Statistical SocietySupplement^,181-247. Yates, F. (1939).The comparative advantages of systematicandrandomized arrangements in the designof agricultural andbiological experiments. Biometrika, 30,440-466. Yates,F. (1951). The influence of Statistical Methodsfor Research Workers on the developmentof the scienceof statistics. Journalof theAmerican Statistical Association, 46, 19-34. Yates,F. (1964)Sir RonaldFisherand thedesignof experiments.Biometrics,20,307-321 Yule, G. U. (1897a).Notes on theteachingof the theory of statisticsat University College. Journal of the Royal Statistical Society, 60, 456-458.

REFERENCES

253

Yule, G. U. (1897b). On thetheory of correlation. Journalof theRoyal Statistical Society, 60, 812-854. Yule, G. U. (1900).On theassociationof attributesin statistics.Philosophical Transactions of the Royal Society,A, 194, 257-319. Yule, G. U. (1906).On aproperty which holds good for all groupingsof anormal distribution of frequencyfor two variables. Proceedings of the Royal Society,A, 77, 324-336. Yule, G. U. (1912). On the methodsof measuring association between two attributes. Journal of the Royal Statistical Society, 75, 579-642. Yule, G. U. (1921). Reviewof W. Brown & G.H.Thomson, The essentialsof mental measurement. British Journal of Psychology,2, 100-107. Yule, G. U. (1936). Karl Pearson, 1857-1936.Obituary Noticesof Fellows of the Royal Societyof London,2, 73-104. Yule, G. U. (1938a).Notes of Karl Pearson's lectures on Theory of Statistics. Biometrika, 30, 198-203. Yule, G. U. (1938b). A test of Tippett's random sampling numbers. Journal of theRoyal Statistical Society, 101, 167-172. Yule, G. U. (1939).[Review of] Karl Pearson:An appreciationof some aspects of his life and work. By E. S.Pearson. Cambridge: University Press, 1938. Nature,220-222. 143, Yule, G. U., & Kendall,M. G. (1950). An introductionto thetheory of statistics (14th ed.) London: CharlesGriffin. (1st ed., Yule, 1911)

Author Index

A Acree,M.C.,57,61,236 Adams, W. J., 125-126,236 Airy,G.B., 107, 115,236 Anderson,R. L., 213,236 Arbuthnot, J., 51-52,235

B Babington Smith,B., 86-88,243 Bancroft,!. A., 213,235 Barnard,G. A., 77,235 Bartlett, M. S., 83,177,235 Baxter,B., 208,236 Bayes,T., 77-78,235 Beloff,H., 163,235 Beloff, J., 209,235 Bennett, J. H., 113,235 Berkson, J., 83,236 Bernard, C., 18, 236 Bernoulli, J., 9, 52, 59,125,236 Bilodeau, E. A., 22,236 Boring, E. G., 13, 24, 64, 65,237 Bowley, A. L., 95-96,98-99,108,114, 237 Box, G. E. P, 7,237 Bravais, A., 146-147, 237 Brennan,J. F.,207, 237 Buck, P., 50, 237 Burke, C. J.,203, 237 Burt, C., 5,156, 162-164,167,237

c Campbell,D. T., 206, 237 Carrington,W., 209, 237

254

Carroll, J. B., 169, 237 Cattell, R. B., 170, 237 Cave, B.M., 118,250 Chang, W.-C.,95, 99,107,238 Clark, R. W., 23-24,238 Cochrane,W. G., 188,238 Cornfield, J.,214,238 Cowan, R.S.,4, 130,238 Cowles, M., 17,21,200,221,238 Cronbach,L. J., 34,189,208, 215,238 Crum, L. S., 94, 238 Crump, S. L. 213,238

D Daniels, H. E., 213,238 Darwin, C., 1, 3,129-131,238 David, F. N., 56, 58,65,238 Davis, C., 17, 21,196-197,200, 221, 238 Daw, R. H., 64,238 Dawson,M.M., 51,53,238 Dempster,A. P., 83,238 De Moivre, A., 10,14,20,52-53,63-65, 69, 72-76,239 De Morgan, A., 12,61,239 Dodd, S. C., 34, 239 Duncan,D. B., 197,239

E Eden,T., 178,239 Edgeworth,F. Y., 90, 98,110, 117, 145146,155,239 Edwards,A. E.,213,239 Eisenhart,C., 117, 195,213-214,239 Ellis, B., 40-41,239 Eysenck,H. J., 170,239

AUTHOR INDEX

255

F

Harman,H.H., 170,242 Harris, J. A., 190,242 Fechner,G., 33,239 Hartley, D., 176, 242 Feigl, H., 25, 239 Hays, W. L., 195,242 Ferguson,G. A., 213 Hearnshaw,L. S., 28,156, 162-164,210, Finn, R. W., 48, 239 242 Fisher,R. A.,4-6,38,62,81-82,99,102 Heisenberg,W., 23, 46, 242 110-111,113-115,118-122,171,177-Henkel, R. E., 83,199,242 181,183-184,186-200,202-203,205-Heron, D., 149-151,242,245 206,210,212-213,216,225,210,228-Hick, W. E., 203, 242 230,232-233,239, 240 Hilbert, D., 66, 242 Fisher Box,J., 112, 114, 117,119-121, Hilgard, E., 206, 242 179, 181, 183, 200,224-226,240 Hogben, L., 82, 242 Fletcher, R., 163,240 Holschuh,N., 19,183,243 Folks, L., 204, 243 Holzinger, K. J., 160, 170,243 Forrest, D. W., 130, 137,241 Hotelling, H., 155,213,243 Fox, J., 175, 199,247 Hume, D., 30,243 Funkhouser,H. G., 54, 241 Hunter, J. E., 83, 243 Huxley, J., 18,129,243 G Huygens,C., 19, 51, 59, 63, 243 Gaito, J., 44,196, 197,214,241 Galbraith, V. H., 48,241 Gallon, F., 2, 4-5,13-14,46, 128-142, 145-146,155,159,247 Garnett, J. C. M., 161,247 Garrett, H. E., 209-210,247 Gauss,C. F.,64-65,91-92, 126, 181-182,247 Geiger, H., 71, 249 Gigerenzer,G., 233, 247 Gini, C., 99,247 Ginzburg,B., 28, 247 Glaisher,J. W. L., 92,247 Gould, S.J.,155,242 Grant, D. A., 210-211,242 Graunt,J., 8,48-52,242 Griinbaum, A., 25, 242 Guilford,J. P., 169,213,242 Guthrie, S. C., 172,242 H

Hacking, I., 57, 59, 63, 80, 83,125, 242 Haggard, E. A., 193, 242 Hall, A. D., 101,245 Halley,E., 51-53,242 Hardyck, C. D., 196,245

J

Jacobs,J., 37, 243 Jeffreys, H. 103,243 Jensen,A. R., 163,243 Jevons,W., 54,243 Jones,L. V., 203, 243 Joynson,R. B., 163, 243 K

Kelley,T.L., 164-165,243 Kempthorne,O., 184, 189, 204, 226,243 Kendall, M. G., 6,37-38,64, 81-82, 8688, 110, 138, 185, 195,213,216,243 Kenna,J. C., 5, 244 Keuls,M, 197, 244 Kevles, D., 15, 244 Kiaer, A. N., 96-97, 244 Kimmel, H. D., 203, 244 Kneale, W., 60, 244 Koch, S., 234, 244 Kolmogorov, A. N., 66-67, 244 Koren, J., 48, 244 Koren, R. S.,228, 244 Kruskal, W., 96-97, 244

256

AUTHOR INDEX

L

O

Ore, O., 8,57-58,246 Lancaster,H. O., 107, 244 Laplace, P. S. de,10,22,61,64-65,79, Osgood,C. E.,205,246 92-93,95,126, 244 P Lawley,D.N., 154, 156,244 Le Cam,L., 222, 244 Parten,M., 94, 246 Lee, A., 118,189,250 Pearl,R., 111,246 Legendre,A.M.,91, 244 Pearson,E. S.,16,64,103,108,112-113, Lehmann,E. L., 219, 221-222,244 115-118,202, 217-218, 220-224, 227, Lindquist, E. F.,210-211,213, 244 230, 238, 246,247 Lord, F. M, 44, 244 Lovie, A. D., 34,161-165,194,198,208- Pearson,K., 4,16, 18, 44, 49, 51,53,64, 75-76, 102, 106,108-114,117-118, 210,212, 245 127,137,141,143-151,153,155,157Lovie, P., 161-163,165,245 158, 161,182,189,200, 206, 217, 222, Lupton,S.,115,245 232, 248 Lush, J. L., 194, 198,245 Peirce,B., 62, 107,248 Peters, C. C., 194,248 M Petrinovich,L. F., 196,248 Petty, W., 48-50,248 Macdonell,W. R., 117,155,245 MacKenzie,D. A., 4, 15, 105,130-131, Phillips, L. D., 79,248 Picard,R., 103,248 144,150,152,245 Playfair, W., 53-54, 248 MacKenzie,W. A. 121, 187,240 Poisson,S.-D., 58-59,70-72,248 Magnello,M. E., 15,245 Popper,K. R., 30, 32, 248 Marks, M. R., 203,245 Porter,!.,15,248 Maunder,W. F., 96 Maxwell, A. E., 154, 156,244 Q McMullen, L., 115,245 McNemar,Q., 104, 213,245 Quetelet,L. A. 46, 55, 91, 249 Mercer, W. B., 101,245 Merriman,M.,91,115,245 R Michea, R., 58, 245 Mill, J. S.,173-174, 245 RAND Corporation,88, 249 Miller, A. G., 19,46,245 Reichenbach,H., 32,249 Miller, G. A., 19,234,245 Reid, C., 217, 219, 221-222, 225-227, Morrison, D. E., 83,199, 242 249 Mosteller, F., 96-97,206, 244 Reitz, W., 207,249 MouLM., 161, 248 Rhodes,E.G., 217,249 Roberts,H. V., 138,252 N Robinson,C., 94,249 Rosenthal,R.,46,249 Nagel,E., 61-63,246 Rosnow,R. L.,46,249 Newman,D., 196, 246 Rowe,F. B., 45, 249 Newman,J. R., 5,59-60,184,246 Rozeboom,W. W., 202,249 Neyman,J., 85,98-100,198, Rucci, A. J., 207-208, 211-213,249 201-203,219-225, 227, 230,246 Russell,B., 23,28,154,249 Norton, B., 112,246

AUTHOR INDEX

Russell, J., 179, 249 Rutherford,E., 71,249 Ryan, T. A., 196-197,249

S Savage,L. J., 102,231,249 Scheff6,H., 197,213,249 Schlosberg,H., 206, 252 Seal,H. L., 181-182,249 Seng,Y. P., 96-97,250 Sheynin,O. B., 72, 107,250 Sidman,M, 207,250 Simpson,T., 53, 89,125,250 Skinner,B. F., 24, 250 Smith, D. E., 59, 250 Smith, K., 112,250 Snedecor,G. W., 123,193,194,213,234, 250 Soper,H.E., 110, 118,250 Spearman,C, 5, 156-164, 166,250 St. Gyres, Viscount, 58, 250 Stanley,!. C., 206,212,237 Stephan,F. F.,93-94,250 Stephen,L., 130,250 Stephenson,W., 167,250 Stevens,S. S., 36,40-44,206, 250 Stigler, S. M., 86, 87, 257 Stratton,F. J. M., 200,252 Struik, D. J., 9,257 "Student", 17, 102-103, 115-118,121, 122, 196,200,257

T

257

Traxler, R. H., 226,257 Tukey, J. W., 196-197,257 Turner, P.M., 130,257 Tweney,R. D., 207-208, 211-213,249 U

Underwood,B. J., 206, 257 V

Venn, J., 12, 62, 81,89-90,257 von Mises,R., 62,81-82, 107,257,252 W

Wald, A., 230, 252 Walker, E. L., 24,252 Walker, H. M., 12,147,252 Wallis, W. A., 138, 252 Watson.J. B.,22,207,252 Weaver,W., 58, 252 Welch, B.L., 117,202,252 Weldon, W. F. R., 14,141-143,146,252 Westergaard,H., 7, 93, 252 Whitney, C. A., 88, 252 Wilson, E. B., 161-162, 165,252 Wishart, J., 210,252 Wolfle, D., 34, 160-161,163, 252 Wood, T. B. 200, 252 Woodworth,R. S., 172, 206,252 Y

Yates,F., 84,104,189,194,216,226,252 Tedin,0.,103,257 Young, A. W., 118,250 Thomson,G. S., 5,156, 165-168,257 Yule, G. U., 5,37-38,88, 108,110, 138, Thomson,W. (Lord Kelvin), 36,257 146,148-151,213,253 Thorndike,E. L., 119, 172,257 Thurstone,L. L., 5,34,163,166-169,257 Z Tippett, L. H. C., 88,196,257 Zubin,J.,209,210,247 Todhunter,L, 64-65,257

Subject Index

A Actuarial science,51 Analysis of variance, 5, 35, 102-104, 177-181, 187-195 Anthropometric laboratory,132 Arithmetic mean,18, 89-91 Average,18, 88 B

Bayes'theorem,77-80,83, 231 Behaviorism,22 Bills of mortality, 8,49-52 Binomial distribution,10-11,64-65, 68-70 Biometrics,4, 14-18

C Causality,seealso Determinism,30 Censuses,47, 93 Central limit theorem,95, 107, 124-126 Centroid method,168 Chi square, controversy,110-114 distribution, 105-114 test, 109 Communalities,168 Confidenceintervals, 100, 199-203 Control, 20, 31,38, 171-178 Mill's methods of enquiry and, 173176 statistical, 176-181 Correlation,2, 35, 129, 138-146,174 coefficient of, 5,116, 141-146,158 controversies,146-153

258

Correlation (cont.), distribution of coefficient, 118 intraclass,5, 119,123,189-193 nominal variables, 148-151 probableerror, 117 D

Data organization,47-55 Degreesof freedom, 110-114 De Mere's problems, 9, 58-59 Determinism,21-26, 118 in psychology,33-35

E Error estimation,10, 63, 90,101-103 Estimation,theory of, 98-101 Eugenics,4, 15,130-131, 152-153,187 Evolution, 1-3, 17, 129,143 natural selection,1, 129, 130,143 Expected mean squares, 213-215 Experimentaltexts, 205-207 Experiments, designof, 31,101,171-185 tea-tasting, 182-184

F F ratio, 123, 193,233 distribution, 121-123 Factor analysis,154-170 Factor rotation,169 Free will, 24 Frequencydistributions,68

SUBJECT INDEX

G

Gaming, 8, 56-59 Generallinear model,27, 93, 181-183 Genetics,15,17-18 Gower Street, 151,225 Graphical methods,53-55 H

Heredity, 131 Hypotheses, alternative,217-224,229-233 null, 57,184,217-219 testing, 7, 222-224, 228-233 I

Induction, 27-30, 175,230 Inference,29, 31-33 early, 47-50 Fisherian, 81-83 practical, 77-84 statistical,7, 31, 33, 217 Insuranceandannuities,50-53,63 Intelligence,158-161 Inventories,6, 47-^48

L Law of error, 10, 12,89 Law of large numbers,59-60 Least squares,89, 91-93,148,182 Likelihood criterion, 82,218-220 Logical positivism,45-46 M

Mean, seearithmetic mean Measurement,2,36-38,40 error, 38, 44 definition, 40 interval and ratio scales,42 nominal scale,41, 148-150 ordinal scale,41 Multiple comparisons,195-199

259

N

Naturalism,130 Neyman-Pearson theory, 217-224 controversy with Fisher,228-233 Normal Distribution, 10, 12-14, 64-66, 72-76, 107-108, 125, 128 O

One-tail andtwo-tail tests,203-204 Operationalism,45-46

p Parameters,7, 95 Pascal'striangle, 9, 59,69 Personal equation, 45 Poisson distribution,70-72 Political arithmetic,48-50 Polling, 94-95 Population,7, 82, 85, 87,95-101,105 Power,222-224,232-233 Prediction,30 Primary mental abilities,169 Principal component analysis, 155 Probabilisticanddeterministic models, 26 Probability, 7-9, 32, 56 and error estimation,63 and weight, 32 beginningsof the concept,56-60 distributions,68-76 fiducial, 81-83,99,202,231 inverse,65, 77-80, 119,221 mathematicaltheory,66 meaningof, 60-63 subjective,61,221 Probableerror, 127,139,200 R

Randomness,85-88,172 randomnumbers,85-88 random sampling,87 randomization,94, 101-104, 177-179, 188

260

SUBJECT INDEX

Statistics (cont.), Regression,3, 17, 129-138,135 in practice,233 Representative sampling, 93,96 Rothamsted Experimental Station, 6,101, in psychology,33-35 textbooks,212-213 120,176,188 parapsychology,209 Royal Society,8, 17,49-51 vital, 8, 49-51

s

T

Sampling, /ratio, 196,218 distributions,16,105-126 distribution, 114-121 in practice, 95-98 Tetrad difference,160 random,98 Trial of the Pyx, 86 representationandbias, 93-95 Type I error,95, 196-198,201,223,231 Science,21-26, 36-37,45 Significance, 17,19,83,183-184,199- Type II error, 223, 229,231 200,218 U Standard deviation,73-75,116,127 Standard error, 33,141 Uncertainty principle,23,46 Standard scores, 4, 75,127-129,137 Statistics, V arguments,224-228 criticism of, 18-20,233-235 Variation, 1,4,132,187 definition of, 6 Fisher v. Neyman and Pearson,228Z 233 Fisherian,186-187,224 2 scores,seestandard scores in journalsandpapers,207-212