QUANTITATIV DATAANALYSIS I Re serch a DoingS ocia to Testldeas
D ONALD I. TRE IMA N
If i?j[i,i:l[fri:,
reserved' Copyright@2009by JohnWiley & Sons'Inc All dghts by JosseY-Bass Published
com cA 941O3-wwwjossevbass ftltijt?tlltJ,l'",, t"' Francisco, form stored in a retrieval system' or tansmitted in any No part of this publication may b€ reproduced' exceptas oth"*ise' ol t:o:lt:q. or bv anv means,elecfonic, mechatucal,photocopying'-recording theprior either Act'without
;:#;i
ffi;;!;".1b;i
ul'iei s'ut"'copvright
roa
"r aulori^tion trttougrtpuy-"ni of-theappropriate-p"-t:1oP1*" "' wrinen Dermissionof the putrisrrer,or 'i]"'iiie (e78) 750-8400' oiuq n-u"'l MA'ore23' ;;;;'I*: il;i""i iliilt:;;;ii should permission for publisher the t n"q*t" '?;;il"*ooa o. onttn"ut *t* fax (978)646-8600, NJ Hoboken' stree! "clt River "oi-yig;' l,1l Inc : to thePer.ir.ion. o"ptii!'ni,i"-rt^ wii"y n Sons' be addressed ssrons. www.wiley.com/so/pen at online or oiriid, iii,1j i1d_oor1,fax 201,744_6008, ascitationsor sourcesfor further information Readersshouldbe awarethat InternetWebsitesoffered waswrittenandwhenit is rcad' this time the between disappeared .. ."1 t """-.ft-ag"a publisherandauthorhaveusedtheir bestefforts Limit of Liability/Disclaimer of warranty: while the or com wi*l respeclto lhe accuracy or lhis book.Lheymakeno repre'enlations wafianlie' in DreDaring or merchanrabil' warranties implied rr,i. roor #i ,fi"iri.aiiy di-ctaimany ;iJ";K, ;i ;;;.;;;;;,,'oi
ffi;il;?;;ili."iu,'pttp"t"
n'*Lantvmavbecreatedorextendedtysalesrei::il
I The aivice and strategies contained herein may-not ,"1* .it"tials .i *iii." nor author shall publisher the leither upp.op;ut". ation.you should consutt wltt, a protessiinut-*-fi.." to special' limited not but including oit*'' be liable for any loss of p.ot t o' "ommerciJdamages' -ydamaBes' or other con(equential. rncidental.
most bookstores To-contactJossey-Bassdirectl) JosseyBass books and products are availablelhrough the United *itio if," Unitla Star". ur 1a0O)956-?739' outside call our CusromerCar" u"p*"n, (317) 572-4002' Siatesat (3ll) 572-3986' oi via fa'x at formats some content that appearsin Jossev-Bassalso publishesits books in a variety ofelecftonic print may not be ivailable in electronic books' Library of Congress Cataloging'in-Publication
Data
Donald J. Treiman, -jutu unalysis : doing social researchto test ideas/ Donald J Treiman d"-tl[G D, Cm,
2.Sociorogv-f,esearch-statist "liJJj;l.T'3;:3:;:,t3*"f33?,*,n"^"thods. methods-Computer + Socialsciences-statistical
methods. 3. Sociology-statisticar -"if'oOt programs. 5. Stata. I Title HA29.T675 2008 300;72-4c22 Printed in the United StatesofAmerica FIRST EDITION
PB Printing
l0 9 8 7 6 5 '1 3 I I
20080131:v
-*fq-$ Tg$XT'-{. fables, Figur€s,Exhibits. and Boxes
Xi
Preface
xxiii
The Author
xxvii
Introduction CROSS-TAB U LATIONS What This ChapterIs About Introductionto the Book via a ConcreteExample Cross-Tabulations What This ChapterHas Shown MORE ON TABLES What This ChapterIs About The Logic of Elaboration SuppressorVariables Additive and InteractionEffects Direct Standardization
xxix 1 1 2 8 19 21 z1 22 ).) 26 28
A Final Note on StatisticalControlsVersusExperiments What This ChapterHas Shown STILLMORE ON TABLES What This ChapterIs About ReorganizingTablesto Extract New Information When to Percentagea Table "Backwards"
45 47 47 48 50
Cross-Tabulations in Which the DependentVariable Is Representedby a Mean Writing About Cross-Tabulations
52 58 61
What This ChapterHas Shown
o-1
Index of Dissimilarity
Vl
Contents
4 ONTHEMANIPULATION OFDATABYCOMPUTER
o)
What This ChaprerIs Abour
tr)
Introduction
66
How Data Files Are Organized Transforming Data What This ChapterHas Shown Appendix 4.A
Doing Analysis Using Stata Tips on Doing Analysis Using Stata Someparticularly Useful Stata 10.0Commands
INTRODUCTIONTO CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About Introduction Quantifying the Size of a Relationship:RegressionAnalysis Assessingthe Strengthof a Relationship: CorrelationAnalysis The RelationshipBetweenCorrelation and RegressionCoefficients FactorsAffecting the Size of Correlation(and Regression)Coeflicients CorrelationRatios What This ChapterHas Shown 6
INTRODUCTIONTO MULTIPLE CORRELATION AND REGRESSION (ORDINARYLEASTSQUARES) What This ChapterIs About .
Introduction A WorkedExample:The Determinants of Literacy in China Dummy Variables A Strategyfor ComparisonsAcross Grouos A BayesianAlternativefor Comparing Models IndependentValidation What This ChapterHas Shown
MULTIPLE REGRESSION TRICKs: TECHNIQUES FOR HANDLING SPECIAL ANALYTIC PROBLEMS What This ChapterIs About NonlinearTransformations
OI
72 80 80 80 84
87 87 88 89 o1
94 94 99 102
r03 103 104 113 120 124 133 135 136
139 139 140
contentsVii Tesrin,ethe Equality of Coefficients TrendAnalysis: Testingthe Assumption of Linearity LrnearSplines Lrpressing Coefficientsas Deviationsfrom
MULTIPLEIMPUTATIONOF MISSING DATA \\tar This ChapterIs About lntroduction -\ WorkedExample:The Effect of Cultural Capital on EducationalAttainmentin Russia \\hat This ChaprerHas Shown SAMPLEDESIGNAND SURVEYESTIMATION \\har This ChapterIs About SurveySamples Conclusion \nlar This ChapterHas Shown REGRESSION DIAGNOSTICS what This ChapterIs About Introduction A WorkedExample:SocietalDifferences in StatusAttainment RobustRegression
' ! 1 SCALECONSTRUCTION What This ChapterIs About Introduction
149 152
the
Grald Mean (Multiple ClassificationAnalysis) OrherWaysof RepresentingDummy Variables Decomposingthe DifferenceBetween Two Means \\'har This ChapterHas Shown
Bootstrappingand StandardErrors What This ChapterHas Shown
147
r64 166 172 179 181 181 \82 187 194 195 t95 196 )t7
224 225 225 226 229 237 238 240 241 241 1,41
Validiry Reliability
242 243
Vlll
12
Contents ScaleConstruction
246
Errors-in-VariablesRegression What This Chapter Has Shown
258
LOG-LINEARANALYSIS What This ChapterIs About Introduction Choosinga PrefenedModel ParsimoniousModels A Bibliographic Note What This ChapterHas Shown Appendix 12.A Derivation of the Effect parameters Appendix 12.8 Introductionto Maximum Likelihood Estimation Mean of a Normal Distribution Log-Linear Parameters
,'3
BINOMIAL LOGISTICREGRESSION What This ChapterIs About Introduction Relationto Log-LinearAnalysis
261
263 263 264 265 277 294 295 295 297 298 299 301 301 302 303
A WorkedLogistic RegressionExample: PredictingPrevalenceof Armed Threats A SecondWorkedExample:SchoolingprogressionRatiosin Japan
304 314
A Third WorkedExample (Discrete-TimeHazard_Rate Models): Age at First Marriage
318
A FourthWorkedExample(Case-ControlModels): Who WasAppointed to a Nomenklataraposition in Russia? What This ChapterHas Shown Appendix l3.A Some Algebra for Logs and Exponents Appendix 13.8 Introduction to probit Analvsis
327 329 330 330
14 MULTINOMIAL AND ORDINALLOGISTIC REGRESSION AND TOBITREGRESSION WhatThisChapterIs About Muhinomial LogirAnalysis
335 J J.)
336
Contents lX frinal
Logistic Regression
342
Tobit Regression(andAllied Procedures)for Censored DependentVariables Otter Models for the Analysis of Limited DependentVariables &'hat This ChapterHas Shown
t5
353 360 361
IMPROVINGCAUSAL INFERENCE: FIXED EFFECTS AND RANDOM EFFECTS MODELING What This ChapterIs About Introduction Frxed Effects Models for Continuous Variables RandomEffects Models for ContinuousVariables A Worked Example: The Determinants of Income in China Fired Effects Models for Binary Outcomes A Bibliographic Note Wtat This ChapterHasShown
363 363 364 365 371 372 375 380 380
1 6 FINALTHOUGHTS AND FUTURE DIRECTIONS:
RESEARCH DESIGN AND INTERPRETATION ISSUES whar rhis Chapter is About ResearchDesignIssues The Importanceof Probability Sampling A Final Note: Good ProfessionalPractice What This ChaDterHas Shown
38r 381 382 397 400 405
Appendix A: Data Descriptions and Download Locations fot lie Data Used in This Book
407
Appendix B: Survey Estimation with the General Social Survey
4',11
References
417
lndex
431
':-,-,::,li::1,i' ;.l.ll LiFl,-..,
a:.x:X Ii:::.-i,:;,,*rXf":* i-::'.,:: i, TABLES I .1.
Joint FrequencyDisrributionof Militancy by Religiosity Among UrbanNegroesin the U.S., 1964.
1.2.
PercentMilitant by ReligiosityAmongUrbanNegroes in the U.S., 1964.
10
PercentageDistribution of Religiosity by EducationalAttainment, UrbanNegroesin the U.S., 1964.
l3
PercentMilitant by EducationalAttainment,Urban Negroes in the u.s., 1964.
l3
PercentMilitant by Religiosity and EducationalAttainment, UrbanNegroesin the U.S., 1964.
15
PercentMilitant by Religiosity and EducationalAttainment, Urban Negroesin the U.S., 1964(Three-DimensionalFormat).
18
PercentageWho Believe Legal Abortions ShouldBe PossibleUnder SpecifiedCircumstances,by Religion and Education,U.S. 1965 (N : 1,368;Cell Frequencies in Parentheses).
27
Percentage AcceptingAbortion by Religion and Education (HypotheticalData).
28
PercentMilitant by Religiosity,and PercentMilitant by Religiosity Adjusting (Standardizing)for Religiosity Differencesin Educational Attainment,UrbanNegroesin the U.S., 1964(N : 993).
30
1.3. 1.4. 1.5. 1.6. Ll.
2.2. 2.3.
2.4.
PercentageDistribution of Beliefs Regardingthe Scientific View of Evolution(U.S.Adults,1993.1994.and2000).
2.5.
Percentage Accepting the ScientificView of Evolution by ReligiousDenomination(N : 3,663).
2.6.
Percentage Acceptingthe ScientificView of Evolution by Level of Education.
2.7.
Percentage Accepting the ScientificView of Evolution by Age.
2.8.
Percentage Distributionof Educational Attainmentby Religion
2.9.
PercentageDistribution ofAge by Religion.
2.10.
Joint ProbabilityDistribution of EducationandAge.
33
35 35 36
Xll
Tables,Figures,Exhibits,and Boxes
2 .11. PercentageAccepting the ScientificView of Evolution by Religion, Age, and Sex (PercentageBasesin Parentheses) 2.12. ObservedProportionAccepting the ScientificView of Evolution, and ProportionStandardizedfor EducationandAge. 2.r3. PercentageDistribution of OccupationalGroupsby Race,South African Males Age 20-69, Early 1990s(Percentages ShownWithout Controlsand also Directly Standardizedfor Racial Differencesin EducationalAttainment";N = 4,004). 2 .14. Mean Number of ChineseCharactersKnown (Out of 10), for Urban and Rural ResidentsAge 20-69, China 1996(MeansShown Without ControlsandAlso Directly Standardizedfor Urban-Rural Differencesin Distribution ofEducation; N : 6,081). FrequencyDistribution ofAcceptanceof Abortion by Religion andEducation,U.S.Aduits, 1965(N : 1,368). Social Origins of Nobel Prize Winners(1901-1972)and Other U.S. Elires (and,for Comparison,the Occupationsof EmployedMales i900-1920). 3.3. MeanAnnual Income in 1979Among ThoseWorking Full Time in 1980,by Educationand Gender,U.S. Adults (Category FrequenciesShownin Parentheses). Meansand StandardDeviationsof Income in 1979bv Education and Gender,U.S. Adults, 1980. 3.5. MedianAnnual Income in 1979Among ThoseWork rg Full Time in 1980, by Educationand Gender,U.S. Adults (CategoryFrequencies Shownin Parentheses).
6.2.
6.3. 6.4.
PercentageDistribution Over Major OccupationGroupsby Race and Sex,U.S. Labor Force, 1979(N : 96,945). Mean Number of PositiveResponsesto an Acceptanceof Abortion Scale(Range:0-7), by Religion, U.S. Adults, 2006. Means,StandardDeviations,and CorrelationsAmong Variables Affecting Knowledgeof ChineseCharacters,EmployedChinese Adults Age 20-69, 1996(N = 4,802) Determinantsof the Number of ChineseCharactersConectly Identifiedon a Ten-ItemTest,EmployedChineseAdults Age2U69,1996 (StandardEnors in Parentheses). Coefficientsof Models ofAcceptanceofAbortion, U.S. Adults, 1974 (StandardErrors Shownin Parentheses); N : 1,481. Goodness-of-FitStatisticsfor Altemative Models of the Relationship Among Religion, Education,andAcceptanceofAbortion, U.S. Adults, 1973(N = 1,499). DemonstrationThat Inclusionof a Linear Term Does Not Affect PredictedValues.
37 39
4l
42 48 51
52
58 60
101
115
116 127
136
153
Tables, FiguretExhibits. and BoxesXiii ":
"-i
-.4
Cefficiens for a Linear Spline Model of Trends in years of Sciool Compleredby year of Birth, U.S. Adults Age 25 and Older, ad Comparisonswith Other Models (pooled Datafor 1972_2004, \ : -19.324). Goodness-of-FitStatisticsfor Models of Knowledgeof Chinese Cba-actersby year of Birth, Controlling for years of Schooling, rirh \-arious Specifications of the Effect of the Cultural Revolution rTbose Affected by the Cultural Revolution Are Deflned peoole as Tuning Age I I During the period 1966ttuough 1977),Chinese -{dnlts Age 20 ro 69 in 1996(N = 6,086). Cocfficientsfor Models 4, 5, and 7 predicting Knowledgeof Chinese Charactersby year of Birth, Controliins for ye;rs ( p Valuesin parentheses). of Scbooti_ng
--s
CoefficientsofModels of ToleranceofAtheists, U.S. Adults, 1[O to 2004 (N : 4,299). -6, Desiga Matrices for Alternative Ways of Coding Categorical \-ariables(SeeText for Details). Coefficients for a Model of the Determinants of Vocabulary Knorrledge,U.S. Adults, 1994(N : 1,,757R2 : .2445: Sald TestThat CategoricalVariablesAll Equal Zetot F.t,rrrt = 12.48; p <.0000). -tVeans. StandardDeviations, and Correlations for Variables Included rn a Model of EducationalAttainment for U.S. Adults lgg0 to 2004, , b1.Race(BlacksAbove the Diagonal,Non_BlacksBelow). - )Coefficientsof a Model of EducationalAttainment,for Blacks and \on-Blacks, U.S. Adults, 1990to 20O4. - 1n- Decompositionof the Difference in the Meanyears of School Completedby Non-Blacksand Blacks, U.S. Adults, 1990to 2004. LtDescriptiveStatisticsfor the VariablesUsed in the Analysis, Russian -{dulrsAge Twenty-Tivoro Sixty_Ninein 1993 (N: a,6S5). t: Comparisonof Coefficientsfor a Model of EducationalAttamment Estimatedfrom a Casewise-Deleted Data Set [C] (N = 2,661) and from a Vultiply ImputedData Set [M] (N :4,6g5), RussianAdulrs Ase Twenry-Twoto Sixry-Nine in 1993. 9.1Portion ofa Tableof RandomNumbers. of the Total PopulationResidingin Each of the Ten LarqestCitils in Califomia, 1990. +-:,
DesignEffects for SelectedStatistics,Samplesof 3.000 with Clustering(50 Countiesas primary SamplingUnits, 2 Villages or
157
160
161 t65 16g
169
176 I77 178 1g9
Dz D6
ZO1
XiV
9.4.
9.5. 9.6.
tables,Figures, Exhibits, andBoxes Neighborhoodsper County,and 30 Adults Age 20 to 69 per Village or Neighborhood),With andWithout Stratification,by Level ofEducation. Determinantsof the Number of ChineseCharactersCorrectly Identified on a 10-ItemTest,EmployedChineseAdults Age 20-69, 1996(N = 4,802). Coefficientsfor Models of the Determinantsof Income,U.S. Adult Women, 1994,Under VariousDesignAssumptions(N : 1,015). Coefficientsof a Model of EducationalAttainment,U.S. Adults, 1990to 2004(N: 15,932).
10.1. Coefficientsfor Models of the Determinantsof the Strengthof the Occupation-Education Connectionin EighteenNations. 11.1. Valuesof Cronbach'sAlpha for Multiple-Item Scaleswith Various Combinations of the Number of Items and the Averase Correlation Among Items.
210
216 221 223 236
11.3. Abortion FactorLoadinssAfter Varimax Rotation.
246 253 254
11.4. Means,StandardDeviations,and CorrelationsAmong Variables Included in Models of the Acceptanceof Legal Abortion, U.S. Adults, 1984(N : 1,459).
256
11.5. Coefficientsof Ttvo Models PredictingAcceptanceofAbortion, U.S.Adults, 1984.
256
11.2. FactorLoadingsfor Abortion AcceptanceItems Before Rotation.
11.6. 11.7.
Mean Scoreon the ISEI by Level of Education,Chinese Males Age Twenty to Sixty-Nine, 1996.
259
Coefficientsof a Model of the Determinantsof Political ConservatismEstimatedby ConventionalOLS and Errors-inVariablesRegression,U.S. Adults, 1984(N : 1,294).
260
1,2.1. FrequencyDistribution of Programby Sex in a GraduateCourse. 12.2. 12.3. 12.4.
12.5.
265
FrequencyDistribution of Level of Stratificationby Level of Political Integrationand Level of Technology,in Ninety-Two Societies.
268
Models of the RelationshipBetweenTechnoiogy,Political Integration,and Level of Stratificationin Ninety-Two Societies.
269
PercentageDistribution of ExpectedLevel of Stratificationby Level of Political Integrationand Level of Technology,in Ninety-Two Societies(ExpectedFrequenciesfrom Model 7 Are Percentaged).
272
FrequencyDistribution of Whether'A CommunistShouldBe Allowed to Speakin Your Community" by Schooling,Region,and Age, U.S.Adults, 1977(N = 1,478).
273
Tables, Figures, Exhibits, and Boxes XV G..odness-of-Fit Statisticsfor Log-LinearModelsof theAssociations i:n.rns \\:hethera CommunistShouldBe Allowed to Speakin Your C..mmunit\'. Age, Region,andEducation,U.S.Adults, 1977. :r:e.-red Percentage(from Model 8) AgreeingThat 'A Communist S:ruld Be Allowed to SpeakinYour Conrmunity" by Education,Age, .: i Resion.U.S.Adults, 1977. Distribution of Voting by Race,Education,andVoluntary i>:.r.iation Membership. -::quenl - ::quenl Distribution of Occupationby Father'sOccupation, C:rnese-{dults,1996. -:,:;raction Parametersfor the SaturatedModel Applied to Table 12.9. G..odness-of-FitStatisticsfor AlternativeModels of Intergenerational O,-cupational Mobility in China(Six-by-SixTable).
'
275
276 278 280 282 284
F:;quency Distribution of EducationalAttainmentby Size of ?,::e of Residenceat Age Fourteen,ChineseAdults Not Enrolled :: School.1996.
289
P.rcentageEver Threatenedby a Gun, by SelectedVariables,U.S. {Jults. 1973to 1994(N : 19,260).
306
G..t dness-of-FitStatisticsfor VariousModels Predictingthe P::ralenceof ArmedThreatto U.S.Adults, 1973to 1994. Eie!-r Parametersfor Models 2 and4 of Table 13.2.
308 310
Goodness-of-FitStatisticsfor VariousModels of the Processof ErucationalTransitionin Japan(PreferredModel Shownin Boldface).
315
Eiect Parameters for Model 3 ofTable 13.4.
316
OddsRatiosfor a Model Predictingthe Likelihood of Marriagefrom \Ee at Risk, Sex,Race,and Mother's Education,with Interactions Bet$ eenAge at Risk and the OtherVariables. Coeillcientsfor a Model of Determinantsof Nomenklatura \Iembership,Russia,1988.
328
Efiect Parametersfor a Probit Analysis of Gun Threat(Corresponding :.r \lodels 2 and4 ofTable 13.3).
331
Ettect Parametersfor a Model of the Determinantsof English and RussianLanguageCompetencein the CzechRepublic, 1993 p Valuesin Italic.) \ : 3,945).(StandardErrors in Parentheses;
339
Eftect Parametersfor an OrderedLogit Model of Political Party Identification, U.S.Adults, 1998(N : 2,443).
345
PredictedProbability Distributionsof Party Identificationfor Black and non-BIackMales Living in Large CentralCities of Non-Southern S\lSAs and Earning $40,000to $50,000perYear.
349
XVi 14.4. 14.5. 14.6. 14.7.
15.1. 15.2. 15.3.
Tables,Figuret Exhibits,and Boxes Effect Parametersfor a GeneralizedOrdercdLogit Model of political Party Identification,U.S. Adults, 1998. Effect Parametersfor an Ordinary Least-Squares Regression Model of Political party ldentification,U.S. Adults, 199g. Codesfor Frequencyof Sex in the Pastyear, U.S. Adults, 2000. AlternativeEstimatesof a Model of Frequencyof Sex,U.S Adults, 2000 (N : 2,258).(StandardErrors in parenthesesl All CoefficientsAre Significantat .001 or Beyond.) SocioeconomicCharacteristicsof ChineseAdults by Size ofplace of Residence,1996. Comparisonof OLS and FE Estimatesfor a Model of the Determinantsof Family Income,ChineseRMB, 1996(N : 5,342). Comparisonof OLS and FE Estimatesfor a Model of the Effect of Migration and Remittanceson SouthAfrican Black Children,s SchoolEnrollment,2OO2to 2003.(N(FE) : 2,408 Children; N(full RE) = 12,043Children.)
350 354 356
357 373 374
379
FIGURES 2 .1.
The ObservedAssociationBetweenX andy Is Entirelv Spurious and Coes to Zero When Z Is Controlled.
2.2.
The ObservedAssociationBetweenX andy Is partlv Sourious: theEffecrof X on Y ls ReducedWhenZ Is Controll;d(Z Affecrs X and Both Z and X Affect Y). The ObservedAssociationBetweenX andy Is Entirely Exolained by the InterveningVariableZ and Goesto Zero When 2 Is bontrolled. The ObservedAssociationBetweenX andy Is partly Explainedby the InterveningVariableZ: the Effect of X on y Is ReducedWhen Z Is Controlled(X Affects Z, and Both X and Z Affecr y).
2.5. 2.6.
4.1. 5.1. 5.2.
Both X and Z Affect Y, but ThereIs no AssumptionRegarding the CausalOrdering of X and Z. The Size of the Zero-OrderAssociationBetweenX andy (andBetween Z andY) Is Suppressed When the Effects ofX on Z andy haveOpposite Sign, and the Effects ofX and Z ony haveOppositeSign. An IBM punch card. ScatterPlot of Yearsof Schoolingby Father,syears of Schoolins (HypotheticalDara.N : t0). Least-Squares RegressionLine of the RelationBetween Yearsof Schoolingand Father'sYearsof Schoolins.
24 24 25
26 11
88 89
T Tables, Figures, Exhibits. and Boxes XVii -.:-.:-.iuares RegressionLine of the RelationBetweenyears S:: -.-'irn,sand Father'sYearsof Schooling,ShowingHow the '::::: Prediction"or "Residual"Is Defined. '-: -;..:-Squares RegressionLines for Three Conligurationsof Data: : :-:::.rl Independence, (b) PerfectCorrelation,and (c) perfect ----. :-:;ear Correlation-a ParabolaSymmetricalto the X-Axis. -:: I-e;r of a SingleDeviantCase(High Leveragepoint). - :-'.:=:lng DistributionsReducesCorrelations. - :: iiecr of Aggregationon Correlations. of the Relationship Between --:-:: DimensionalRepresentation \::-:er of Siblings,Father'sYearsof Schooling,andRespondent,s -::--. ri Schooling(Hypothetical Data;N : l0).
90
92 95 97 99
105
:r:e;:ed \umber of ChineseCharactersIdentified (Out of Ten) , . \:,r: ol Schoolingand Gender,Urban Origin ChineseAdults Age 20 : :- ::r 1996with NonmanualOccupationsand with years of Father,s S: :l.ine andLevelof CulturalCapitalSetat TheirMeans(N : 4,g02). \::e: ihe temaleline doesnot extendbeyondl6 because thereareno :'::.".esin the samplewith post-graduate education.) 120 :,j-':pranceofAbortion by EducationandReligiousDenomination, 131 -.S. -\dulrs.1974(N : 1.481). --.-: RelationshipBetween 2003 Income andAge, U.S. Adults .{:: Ttlen*'to Sixty-Fourin 2004(N : 1,573). t4l :r-;ted 1n(Income) by YearsOf SchoolCompleted, U.S. Males Females.2004, with Hours Workedper WeekFixed at the -:: l'1i-rntbr Both SexesCombined(42.7;N : 1,459). 1,44 ir:e.-ied Incomeby Yearsof SchoolCompleted, U.S. Malesand ::neles. 2004,with Hours Workedper Week Fixed at the Mean for 3-.rhSeresCombined(42.7). 145 ::end in ArtitudesRegardingGenderEquality,U.S.AdultsSurveyed : i9r-l Through1998(LinearTrendandAnnualMeans;N=21,464). 151 f-:arsof SchoolCompletedby Yearof Birth, U.S.Adults (pooled S:mplesfrom the 1972Through2004GSS;N = 39,324;Scatter Pr.rtShownfor 5 PercentSample). 154 \lean Yearsof Schoolingby Yearof Birth, U.S. adults(SameData :i tbr Figure7.5). 155 Tluee-YearMoving AverageofYears of Schoolingby year of Birth, L.S. Adults(SameDataasfor Figure7.5). 155 Trendin Yearsof SchoolCompletedby Year of Birth, U.S. Adults SameData as for Figure 7.5). PredictedValuesfrom a Linear Splinewith a Knot at 1947. 158
XVlll
Exhibits, andBoxes Tables, Figures,
7 .9.
Graphsof ThreeModels of the Effect of the Cultural Revolution on VocabularyKnowledge,Holding ConstantEducation (at TwelveYears),ChineseAdults, 1996(N : 6,086).
7.10. 10.1. 10.2.
10.3. 10.4.
Figure 7.9 Rescaledto Show the Entire Rangeof the Y-Axis. Four ScatterPlots with Identical Lines.
163 163 226
ScatterPlot of the RelationshipBetweenX andY andAlso the RegressionLine from a Model That IncorrectlyAssumesa Linear RelationshipBetweenX andY (HypotheticalData).
227
Yearsof School Completedby Number of Siblings,U.S. Adults, 1994 (N - 2,992). Yearsof SchoolCompletedby Number of Siblings,U.S. Adults, 1994.
10.5.
A Plot of LeverageVersusSquaredNormalizedResidualsfor Equation7 in TreimanandYip (1989).
10.6.
A Plot of LeverageVersusStudentizedResidualsfor Treimanand Yip's Equation7, with Circles Proportionalto the Size of Cook's D.
lO.7.
Added-VariablePlots for Treiman andYip's Equation7. Plot for Treiman andYip's Equation7. Residual-Versus-Fitted
10.8.
Plots for Treimanand AugmentedComponent-Plus-Residual Yip's Equation7. 10.10. ObjectiveFunctionsfor ThreeM Estimators:(a) OLS Objective Function,(b) Huber ObjectiveFunction,and (c) Bi-Square ObjectiveFunction.
228
zz8 232 233 233 234
10.9.
10.11. SamplingDistributionsof BootstrappedCoefficients (2,000Repetitions)for the ExpandedModel, Estimatedby RobustRegressionon SeventeenCountries. 11.1. 13.1. 13.2. 13.3. 13.4.
13.5.
235
238
240
Loadingsof the SevenAbortion-AcceptanceItems on the First Two 255 Factors,Unrotatedand Rotated30 DegreesCounterclockwise. ExpectedProbability of Marrying for the First Time by Age at 320 Risk,U.S.Adults, 1994(N = 1,556). Risk the First Time by Age at ExpectedProbability of Marrying for (Range:Fifteen to Thirty-Six), Discrete-TimeModel, U.S. Adults, 1994. 3ZZ ExpectedProbability of Marrying for the First Time by Age at Risk (Range:Fifteen to Thirty-Six), Polynomial Model, U.S. Adults, 1994. ExpectedProbability of Manying for the First Time by Age at fusk, Sex, and Mother's Education(Twelveand SixteenYearsof Schooling), Non-Black U.S. Adults, 1994. ExpectedProbability of Marrying for the First Time by Age at Risk, Sex,and Mother's Education(Twelveand SixteenYearsof Schooling),Black U.S.Adults, 1994.
322
326
326
Tables,Figures.Exhibits,and Boxes XIX
:,:.8.1. ProbabilitiesAssociatedwith Valuesof Probit and Logit Coefficients. --+.l. 11.1. 16.1. -6.1.
ThreeEstimatesof the ExpectedFrequencyof Sex per Year, U.S. Married Women,2000 (N : 552). ExpectedFrequencyof Sex PerYearby Genderand Marital Status, U.S.Adults,2000(N : 2,258). 1980Male Disability by Quarterof Birth (Preventedliom Work by a PhysicalDisability). Blau andDuncan'sBasicModel oflhe Processof Stratification.
JJ{
358
359 386 394
EXHIBITS :. 1 :2.
lllistration of How Data Files Are Organized. A CodebookCorresponding to Exhibit4.1.
67 68
BOXES
Stata-do- Files and Jog- Files Direct StandardizationIn Earlier SurveyResearch
3 6 9 10 14 15 16 18 22 27 30 31
The Weaknessof Matching and a Useful Fix
44
TechnicalPointson Table3.3
53 54 66 70 72 75
Open-EndedQuestions SamuelA. Stouffer TechnicalPointson Table 1.1 TechnicalPointson Table 1.2 TechnicalPointson Table 1.3 TechnicalPointson Table 1.4 TechnicalPointson Table 1.5 TechnicalPointson Table 1.6 Paul Lazarsfeld HansZeisel
SubstantivePointsOn Table3.3 A Histodcal Note on Social ScienceComputerPackages HermanHollerith The Way Things Were TreatingMissing Valuesas If They Were Not
XX
Tables,Figures,Exhibits,and Boxes
PeopleGenerallyLike to Respondto (Well-Designed andWell-Administered)Surveys Why Use the " Least Squares" Criterion to Determine the Best-FittingLine? Karl Pearson A Useful Computational Formula for r A "Real Data" Exampleof the Effect of Truncatingthe Distribution A Useful ComputationalFormulafor 12 Multicollinearity ReminderRegardingthe Varianceof DichotomousVariables A Formula for ComputingR':from Conelations Adjusted R'? Always PresentDescriptiveStatistics TechnicalPoint on Table6.2 Why You ShouldInclude the Entire Samplein Your Analysis Gettingp-valuesvia Stata Using Statato Comparethe Goodness-of-fitof RegressionModels R. A. (RonaldAylmer) Fisher
17 9I 93 93 97 101 108 110 111 r1 1 114 117 122
r25 125 126
How to Test the Significanceof the Difference BetweenTwo Coefficients Altemative Ways to EstimateBIC
129
Why the RelationshipBetweenIncome andAge Is Curvilinear
140
A Trick to ReduceCollinearity
145
In SomeYearsof the GSS,Only a Subsetof Respondents WasAsked CertainQuestions
150
134
An AlternativeSpecificationof SplineFunctions Why Black versusNon-black Is Better Than White versus Non-white for SocialAnalysis in the United States
156
A Commenton Credit in Science Why PairwiseDeletion ShouldBe Avoided
175
TechnicalDetailson lhe Variables TelephoneSurveys
188
Mail Surveys
r99 200 202 205
Web Surveys Philip M. Hauser A SuperiorSamplingProcedure
175 183 198
Tables, Figures, Exhibits, and BoxesXXi St-rurces of Nonresponse ["eslieKish Hos the ChineseStratifiedSampleUsed in the Design Erperimentswas Constructed $ii,ehdng Data in Stata Limitarions of the Stata10.0 SurveyEstimationprocedure -{n -{lternativeto SurveyEstimation Ho\l to DownweightSampleSize in Stata Eirs to AssessReliability $-h1' the SAI and GRE TestsInclude SeveralHundredItems TransformingVariablesso That ,,High,'has a ConsistentMeaning ConstructingScalesfrom IncompleteInformation h Log-LinearAnalysis "Interaction',Simply Means ,Association,, l: Defined Other Softwarefor EstimatingLog-Linear Models \larimum Likelihood Estimation ProbitAnalysis Techdcal Point on Table 13.1 Limitations of Wald Tests SmoothingDistributions EstimatingGeneralizedOrder Logit Models With Stata JamesTobin PanelSurveysin the PublicDomain Otis Dudley Duncan SewellWright -\sk a Foreigner To Do It GeorgePeterMurdock ln the United States,Publicly FundedStudiesMust be Made Available to the ResearchComrnunity Al'Available from Aulhor" Archive
207 ?08 212 2,13 215 219 219 244 245 248 249 264 267 294 302 302 305 309 325 349 354 369 395 396 398 401 404
, -, ,_ _ :l ,:-i ,
,"
.a.
: , :. a book abouthow to conducttheoreticallyinfomed quantitativesocialresearch ":-: .. socialresearchto testideas.It derivesfrom a coursefor graduatestudentsin sociprofessionalschools(public -:, .rnd other social sciencesand social science-based -.-----. education,socialwelfare,urbanplanning,and so on) that I havebeenteachingat - -.t tbr somethirty years.The coursehasevolvedasquantitativemethodsin the social , ::::s haveadvanced;early versionsof the coursewere basedon the first half of this -.., r throughChapterSeven),with additionalmaterialsaddedover the years.Interest:-:-.. I havebeenableto retainthe sameformat a twenty-weekcoursewith onethree::-: -e.tureper week and a weekly exercise,culminatingin a term paperwritten dudng --i .-it lbur weeksof the course from the outset,which is, I suppose,a tributeto the --.:=sing level of preparationand quantitativecompetenceof graduatestudentsin ::= .-..ial sciences.The book owes much to lively classdiscussionsover the years,of :: :ubtle andcomplexmethodologicalpoints. tsr rheendof the book,you shouldknow how to makesubstantive senseof a body of data. you That is, prepared should be well produce to publishable papersin -:-,:::ative :-: neld. as well as first-ratedissertationchapters.Of course,thereis alwaysmore to :=:. In the final chapter(ChapterSixteen),I discussadvancedtopics that go beyond ; '.: .an be coveredin a first coursein dataanalysis. Tie focusis on the analysisof datafrom representative samplesof well-definedpop- ,:, rns.althoughsomeexceptionsareconsidered.The populationscanconsistof almost societies,occupations,pottery shards,or what--l -:-rns people,formal organizations, ::. ihe analytic issues are essentially the same.Data collectionproceduresare men- :J only in passing.Thele simply is not enoughspacein an alreadylengthybook to do -.::re to both data analysisand datacollection.Thus, you will needto look elsewhere r .i stematicinstructionon data-collectionprocedures. A strongcasecan be madethat .hould do this after rather than before a courseon data analysisbecausethe main :. : ---emin designinga data collectionefforl is decidingwhat to collect, which means - irst needto know how you will conductyour analysis.An altemativemethod of :--:ring aboutthe practicaldetailsof datacollectionis to becomean apprentice(unpaid, : ,:;essary) to someonewho is aboutto conducta surveyand insistthat you get to par,:::ate in it step-by-step evenwhenyour presence is a nuisance. Thisbookcoversa varietyoftechniques,includingtabularanalysis,log-linearmodels r :abulardata,regressionanalysisin its variousforms,regressiondiagnosticsandrobust -.::-\sion, ways to cope with missing data,logistic regression,factor-basedand other :::.niquesfor scaleconsnxction,andfixed- andrandom-effects modelsasa way to make ,.-.al inferences.But this is not a statisticsbook; the emphasisis on usingtheseproce:-:;s to drawsubstantive conclusionsabouthow the socialworld works.Accordingly,the :' .-.kis designedfol a courseto be taken after a first-yeargraduatestatisticscoursein -: rocial sciences.Although thereare many equationsin the book. this is becauseit is
XXIV
Preface
necessa.ry to understandhow statisticalprocedureswork to usethernintelligently. Because the emphasisis on applications,there are many worked examples,often adaptedfrom my own research.In addition to data from samplesurveysI haveconducted,I also rely heavily on the GeneralSocial Survey,an omnibussurveydesignedfor use by the research community and also for teaching.Appendix A describesthe main data sets used for the substantiveexamplesand provides information on how to obtain them; they are all availablewithout cost. The only prerequisitesfor successfuluseof this book are a prior graduateJevelsocial sciencestatisticscourse,a willingnessto think carefullyandwork hard,andthe ability to do high school algebra-either rememberedor relearned.With only a handful of exceptions (referencesat one or two points to calculus and to matrix algebra),no mathematics beyondhigh school algebrais used.If your high schoolalgebrais rusty, you can find good reviews in Helen Walker, Mathematics Essential for Elementary St,,tistics, and W. L. Bashaw,Mathematicsfor Statistics.These books have been around forever. Although more recent equivalentsprobably exist, school algebra has not changed,so it hardly matters.Copiesof thesebooksarereadily availableat amazon . com, andprobably many otherplacesaswell. The statisticalsoftwarepackageusedin this book is Srara(release10). Downloadable commandfiles (-do- files in Stata'sterminology),files of results(-1og- files), and ancillary computer files used in the computations are available at wwwjosseybass. conr/golquantitativedataanalysis Often the details underlying particular computationsare only found in the downloadable do - and - 1og - files, so be sureto downloadandstudythemcarefully.Thesefiles will be updatedasnew releasesof Statabecomeavailable. I use Statain my teachingand in this book becauseit has very rapidly becomethe statistical packageof choicein leadingsociologyand economicsdepartments. This is not accidental.Statais a fast and efficient packagethat includes most of the statistical procedures of interest to social scientists,and new commandsare being addedat a rapid pace. Although many statistical packagesare available, the thrce leading contenderscurrently are Stata,SPSS,and SAS. As software,Statais clearly superiorto SPSS-it is faster, more accurate,andincludes a wider rangeof applications.SAS, althoughvery powerful, is not nearly as intuitive as Stata and is more difficult to learn (and to teach). Nonetheless, this book canbe readilyusedin conjunctionwith eitherSPSSor SAS, simply by translating the syntaxofthe Stata-do- files.(I havedonesomethinglike this,exploitingAllison's excellent,but SAS-based,expositionof fixed- andrandom-effects models[Allison 2005] by writing the correspondingStatacode.)
FORINSTRUCTORS Somenotes on how I have usedthesematerials in teaching may be helpful to you as you designyour own course. As noted previously,the courseon which this book is basedruns for two quarters (twenty weeks). I have offered one three-hour lecture per week and have assignedan exerciseeveryweek.When I fust taughtthe course,I readtheseexercisesmyself,but as
Preface
XXV
:-:: -::rentshaveincreased,I haveenjoyedthe servicesof a T.A. (chosenfrom among -::.-:. $ ho haddonewell in the coursein previousyears),who assistsstudentswith the : .::ies of computingand statisticsand also readsand commentson the exercises.In lecturesandhaveassignedexercisesfor all but the :::r: \eais. I haveofferedseventeen '.'' the course devotedto producingtwo draftsof a term paper -. -- :ih rhe final monthof : :::rJirihon sessionI readthe first draftsandwrite comments,in an attemptto emulate : : - : -:nal submissionprocess.Thus, in my course,everyonegetsa "reviseand resub::-: :i>ponse.I encouragestudentsto developtheir telm papersin the courseof doing andto completetheir draftsin the two weeksafter the lastexerciseis due. -:= -.:::ises l;-: initial exercisesare designedto lead studentsin a guided way through the , :-:::rics of analysis,and someof the later exercisesdo this as well. But the exercises - -::-.:nglr take a free form: "carry out an analysislike that presentedin the book." ,,:-.:ir e answersareprovidedfor thoseexercisesthat involvedefinitiveanswers that , ,- .3 sin-Iilarto statisticsproblemsets. -:3 .oursesyllabus,weekly exercises,andillustrativeanswersto thoseexercisesfbr i:-[ have written illustrative answersare availablefor downloadingfrom www. : ::_.r.i:s.com/go/quantitativedataanalysis
ACKNOWLEDGMENTS -,. , r:3dearlier,this book hasbeendevelopedin interactionwith manycohortsof gradu-:. .::dents at UCLA who havewrestledwith eachof the chaptersincludedhere and :- . :erealed troubles in the exposition, sometimesby way of explicit comments -- - : r:nerimesvia displaysof confusion.The book would not exist without them, as I :: :: -naginedmyselfwdting a textbook,and so I owe themgreatthanks.Onein partic---. ?.rmelaStoddard,literally causedthe book to be publishedin its currentfolm by : ::-.:ing in the courseof a chanceairplaneconversationwith Andrew Pastemack,a ...' , -Bassacquisitionseditor,that her professorwas thinking of publishingthe chap.. . : usedas a coursetext.Andy contactedme, andthe restis history. h: courseon which this book is basedfirst cameinto being throughcollaboration i -: :r] colleagueJonathanKelley, when he was a visiting professorat UCLA in the - - .. The first exerciseis borrowedfrom him, andthe generalthrustof the course,espe- - -. :re lirst half, owesmuch to hrm. \ly colleague,Bill Mason,recentlyretiredfrom the UCLA Sociologyand Statistics -..:::rients, hasbeenmy statisticalguru for manyyears.Otien I haveturnedto him lbr :: i::s irto difficult statisticalissues.And much that I have learnedabout topics that ;: : :roi part of the cuniculum when I was a graduatestudenthas beenfrom sitting in -,: ::red statisticscoursesofferedby Bill. Anothercolleague,Rob Mare, hasbeenhelp-- -. :nuchthe sameway.My new colleague,JennieBrand,who took over my quantita- : :;ia analysiscoursein the fall of2008, hasreadthe entiremanuscdptandhasoffered relptul suggestions. Finally, the book hasbenefitedgreadyfrom very carefulread--.,. .: :l' a group of about 100 Chinesestudents,to whom I gavea specialversionof the , --:.: in an intensivesumner sessionat Beijing University in July 2008.They caught
XXVI
Preface
ftmy errors that had gone unnoticed and mised often subtle points that resulted in the reworking of selectedportions of the text. My understanding of research design and statistical issues, especially conceming causality and theats to causal inference, has benefited greatly from the weeHy seminar of the Califomia Center for Population Research,which brings together sociologists, economists, ald other social scientists to listen to, and corrment on, presentationsof work in progress,mainly by visitors from other campuses.The lively and wide-ranging discussionhasbeen somethingof a floating tutorial, a realization of what I haveimagined academiclife could and should be like. Finally, my wife, Judith Herschman,has displayed endlesspatience, only occasionally asking, "When are you going to finally publish your methodsbook?"
. : & LJYh**t H t Treiman is distinguishedprofessorof sociologyat the Universityof Califomia u --s 1:.:-:s rLCLA) andwas until recentlydirectorof UCLA's Califomia Centerfor aorurr,:r Re:earch.He hasa BA from ReedCollege(1962)and an MA andphD from ! -n-.-.-:r .-'fChicago(1967).As a graduatestudentat Chicago,he spentmostofhis .f, \aiional Opinion ResearchCenter(NORC), wherehe gainedvaluabletrain_ -: .Er:- :1-nence in surveyresearch.He then taught at the University of Wisconsin, rntae :l :e,-ided that he really was a social demographerat heart, and made the Center ru }:,-1:rrph1 and Ecology his intellectualhome. From Wisconsin,he moved to I 'rrrrn-; Lnirersitv and then, in 1975,to UCLA, wherehe has beenever since,albeit qd E\i=J-1 so.;ournselsewhere,as staff director of a study committee at the National r;rrr='. .:: Sciences,4.Jational ResearchCouncil (1978-1981)and fellowship yearsat Bl:eau ofthe Census(1987-1988), theCenterfor AdvancedStudyin theBehav_ ---i umr rc S.r-ialSciences(1992 1993),andthe NetherlandsInstitutefor AdvancedStudy r M and SocialSciences(1996-1,997). l::--.or Treiman startedhis careeras a studentof social stratificationand status --::rrniries il.!yn-..:-- parricularlyfrom a cross-national perspective,and this has remaineda con_ i'Fr._r :::3resr.He andhis Dutch colleague,Harry Ganzeboom,have beenengagedin a {mr--€:= project to analyze variations in the status attainment Drocess --ross-national [irrlr. :::!-lD! throughoutthe world over the courseof the twentiethcentury.To date, tEl r:-,: ;ompiled an archiveof more than 300 samplesurveysfrom more than 50 m:cs- =ngrns through the last half of the century. In addition to his comparativeproj_ s ?:: -::sor Treimanhas conductedlarge-scalenationalprobability samplesurveysin ir@ \--.,-a | 1991-1994),EastemEurope( 1993-1994),andChina(1996),all concemed q [ -.J::.u! aspectsof socialinequality. :lj .Lrent researchhasmovedin a more demographicdirection.He hasa national !r.rr!---'::\ lample surveycurrentlyin progressin China,which focuseson the determ! m.- :i:amics. andconsequences of internalmigration.
:r,{-rK*milcT-l*ru I -. :or uncommonfor statisticscoursestakenby graduatestudentsin the socialsciences x :E [eated essentiallyasmathematicscourses,with substantialemphasison derivations rnc:roofs. Evenwhenempiricalexamplesareused-which they frequentlyarebecause <;r
(}VTRVIEWOF CHAPTERS 5*1 iis book beginsat the beginning,with the most basicapproachto analyzingnon=-_s^mentaldata-percentagetables.Chapters One throughThree describethe logic r --:.-ss-tabulations and provide many technicaldetails on how to produceattractive b::'u-.e they are clear and easy to read) tables. The two central ideas in thesechapters r,: Je--idingin which directionto percentagetablesand understandingstatisticalcon!:'i. h rumsout that the first of theseis difficult for somestudents-much moredifficult lorc --.rpingwith complexmathematicalformulas,which we do later in the book. Thus, :r- 'J \ ou think you alreadyknow all thereis to know aboutpercentage tables,I encour!p i !1uto pay carcful attentionto thesechapters.Doing so will pay greatdividends. Chapter Four is an introductionto computing.In this chapter,I showhow dataare q:.::;zed for analysisby computer and how analysisis conductedusing statistical
XXX
Introduction
software. I also provide hints for using Stata, the statistical package used in this book. However, the chapter is written in such a way that it also can serveas an introduction to otherstatisticalpackages,suchas SpSSand SAS. Chapters Five through Sevenconsider ordinary least squarescorrelation andregres_ . sion, the workhorse of statistical analysis in the social sciences.These procedureslro_ vide a way of quantirying the relationship between some quantitative oot"o-" it, determinants-for example, how much of a difference in income should we exDect -d for people who differ by a given number of yearsin their level of schooling. holdiog .oo.t*, other conf-ounding factors? They also provide a way of assessinghow good our predic_ tion is-for example, how much of the variability in income can be attributed to differences h education, gender, race, ard so on. Chapter Five focuses on two-variable correlation and regressionto get the logic straight and to consider somecommon errors in interpretation of correration and regression statistics. chapter Six considers multiple regression,which is used when there are severalpredictors of a particular outcome, and inhoduces the idea of "dummy" or dichotomous variables, which require special treatment. Making use of dummy variables and ,.interaction terms," I offer a itrategy for assessingwhether social processesdiffer acrosspopulation groups, a frequent qo"rioo in the social sciences.Chapter Sevenoffers a variety of tricks thaipermit relatively reflned hypothesesto be testedwithin a regressionframework. Most datasetsanalyzedby social scientistsareplaguedby missing data_information on particular variables that is missing for specific individuals. Chapter Eight reviews ways to cope with missing data, culrninating in a demonstrationof how to do multiple imputation of missing data, the current state-of-the-artapproach. Chapter Nine takesup the issueof sampling and iti implications for statistical anal_ _ ysis. r hereas the previous chapters assumed simple random sampling, most general population samplesare actually complex, multistage samples.Correctly analyzing data from such samplesrequiresthat we take accountof the "clustering" of observations when we compute standarderrors.This chapterintroduces srrv4r estimation prccedves, which do this. There are many pitralls to regressionthat can trap the unwary. As noted, these are fust discussed (briefly) in Chapter Five. Chapter Ten gives them a fuller treatrnent, through the introduction of what are known as regressiondiagnosdcs.Theseprocedures provide protection against the possibility of making false inferences frorn regression results. Chapter E_ leven shows why and how to construct multiple_item scales, focusing . pincipally onfactor-based scahngbutalso introducing e.ffea_p)oportional scaling.Often we wart to study conceptsfor which no one item in a questionnaireprovides an adequate measure,for example, "level of living," ..Iiberalisrn,',,lype-A personality,, and ,.depres_ sion." Sumnary measures,or scales, based on several items usually provide variables that are both more reliable autd,more valil than single items. This chapter shows how to createsuch scalesand how to use them. Chapters Tlvelve through Fourteen provide techniques for considenng limited _ dependentvariables. Ordinary least squaresregression is designed to handle outcome
Introduction XXXi rfir:les thatcanbe treatedascontinuous,suchasincome,yearsof schooling,andso on. S.ir r,.n'outcome variablesofinterestto socialscientistsaredichotomous(for example, rnescr peoplevote, man:y,havebeen victimized by crime, and so on; and othersare ui-r:..mous (political affiliation in a multiparty system,occupationalcategory type of attended,andso on). LogJinear analysisandlogisticregressionaretechniques 'm-..r:iN :r with limited dependentvariables.chapter Ttvelveconsiderslog-linearanal-alin-g ,s:o.:.rechniquefor making rigorousinferencesaboutthe relationshipsamonga set of nrr-.:Llmousvariables,that is, inferencesabout the degreeand patternof associations mr:r: cross-tabulated variables.In this sense,logJinear analysisprovides a way of [E: ltatisticalinferenceaboutthe kinds of tableswe considerin ChaptersOne through In-=. Chapter Thirteen introducesbinary logisticregression,an appropriatetechnique ir =;.l1zing dichotomousoutcomes,andthen showshow to usethis techniqueto handle w;'--:t kinds of cases:progressionratios, where what is being studiedare the factors d*--:rv $ hetherpeoplemove througha seriesof steps,sayfrom one level of schooling r :e ne\t: discrete-timehaza.) ratemodels,where what is being studiedis the likeli_ [.x lar an event(say,first marriage)occursat a point in time (say,a given age);and .s=-:..nrol models,which providea way of studyingthe likelihood of rareeventssuch E:iimacring diseases,gainingelite occupations,and so on. Chapter Fourteen shows tLrrra study still other limited dependentvariables:unorderedpolytomousvariables: :. n pe of placeof residen ce-yiamultinomiallogisricregrestion: ordinaloulcomes jr ; fich the order of categoriesis known but not the distance betweencategories,such -ri i:.loe aftitude scales(are you "very happy," "somewhat happy," or ..not too happy"), t:i :-Jinel logistic regression;and "censored"variables,where the range of a scaleis t'rI:c-Jed.for example,an incomevariablewith the top category,.$100,000per year or -w-r: ia tobit regression. \\len using nonexperimental data,it frequentlyis difficult to delinitively establish ftr lne'ariable causesanotherbecausethey both could dependon still a third variable, r:- u-nmeasured. Chapter Fifteen providesa classof techniques,knownasfrxed-effects m: -;ndom-effectsmodels, for dealing with such problems when one has suitable data_ :der panel data, in which data are available for the sameindividuals at more than one rr:E: ln time, or clustereddata,in which observationsare availablefor more than one m-.idual in a family, school,community,or otherunit. When appropriatedataare avail_ rri- fiis is a very powerful approach. The final chapter(Chapter Sixteen) considerstechniquesthat are beyond what I Ir.. beenableto coverin this book and beyondwhat usuallycan be coveredln a first:l-aduatecoursein quantitativedataanalysis.Many of thesetechniques,now widely .q-::ir u,se::r economics,are waysof copingwith variousversionsof theendogeneity problem, ft that unmeasured variables affect predictors both and outcomes, resulting -rossibility tr rir.:.edestimates.Fixed- and random-effects modelsprovide one way of dealingwith :rc: oroblems,but many other techniquesare available,which are reviewedin Chapter i.-;:-n. I also briefly introduceshucturalequationmodeling,a techniquefor dealing r-:: --omplexsocial processesin which an outcomevariableis a predictorvariablefor rtr:,jrer outcome.For example,in statusattainmentanalysis,we would want to study
XXXII
Introduction
how the social status of parents affects education, how parental status and education affect the status of the fust job, and so on. The brief introduction to these advanced techniquesis intended to provide guidancein pursuing more advancedtraining in quantitative analysis. I conclude the chapter with advice on good researchpractice_hints on how to improve the quality of your work and how to save time and energy in the Darqaln.
-:!,q p T r n
CROSS-TABULATIONS IIAT
THISCHAPTER IS ABOUT
! tu :::::.r. r\'estaftwith an introductionto the elementsof quantitativeanalysis_the E:L :: be colered in this book. Then we deal with the most basicof all quantitative 6Ir",q'. 1_1]15. cross-tabulations or percentage tables.(Strictly speaking.not all percent_ i,E! rf,ie. ire cross-tabulations becausewe can pgrcentageunivariatedisrributions.But @ rli:: of this chapterwill be on how to percentagetables involving the -:Dphasis ii@lxtrr=srus tabulationof two or more va.riables.) Although the proceduresare basic, fi!3 E: :!rr triyial. There are clear principles for deciding how to percemage cross_ rllrrifil qs. \\'e will cover theseprinciplesand also their exceptions.In the course of umrs ::.- s e will considerthe logic of causalargument.Then we wil considerother *n'''-o-:res:Jes percentage tables,of summarizingunivariateandmultivariatedistributions fr [a li s ell as ways of assessingthe relative size of associationsbetween oairs of tnnri<: --orlrolling for or hordingconstantothervaiables.Takethis chapterseriousry, 3a ' -.-.'uhaveencounteredpercentagetablesbefore and think you know a lot about t: .t experience,gettingright the logic of how percentage * to a tableprovesto be e-r :i:uh for many students,much more difficult than seeminglyfancier^procedures, .ni: .:-.=uhiple regression. -: rr \\ ill notice that many of the examplesin the first three chaptersare quite old, ,!i'r: :om studiesconductedasfar backasthe 1960s.This is becauseat that time tabu,n-r".'. !is \\ asthe "stateof the art"-the techniqueusedin mostof the articlespublished u ca::: journals.Thus,by going back to the older researchliterature.I haveLeen able to inc ::nicularly clearapplicationsof tabularprocedures.
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION TO THEBOOKVtA A CONCRETE EXAMPLE ln 1967, Gary Marx publishedan article in the American SociologicalReview tjtled "Religion: opiateor inspirationof civil rights militancy amongNegroes?',(Marx 1967a; see also Marx 1967b).The title expressedtwo competingideas about how religiosity among Blacks might have affected their militancy regarding civil rights. One possibility was that religiouspeoplewould be lessmilitant than nonreligiouspeoplebecausereli_ gron gave them an other-worldly rather than this-worldly orientation, and established religious institutions have generally had a stake in the status quo and hence a conservative orientation. The other possibility was that they would be more militalt because the Black churcheswere a major locus of civil rights militancy, and religion is an important sourceof universal humanistic values.of course,a third possibility was that thergwould be no connectionbetweenreligiosity and militancy. Supposethat we want to decide which of theseideasis correct. How can we do this? One way-which is the focus of our interest here-would be to ask a probability sample of Blacks how religious they are and how militant with respect to civil rights they are, and then to cross-tabulatethe answersto determinethe relative likelihood, or probaliliry that-religious and nonreligious people say they are militant. If religious people are less likely to give militant responsesthan are nonreligious people, the evidencewould sup_ port the first possibility; ifreligious pe ople aremore Tikelyto give militant responses, the evidencewould favor the secondpossib ity; and if there is no difference in the relative likelihoods of religious and nonreligious people giving militant responses,the evidence would.favor the third possibility. Of course,evidencefavoring an idea doesnot definitely prove it. I will say more about this later. simple example contains all of rhe elementsthat we wili be dealing . . fhil ygninglf with in this book and that a researcherneedsto take accountof to arnve at a meaninsful and believableanswerto any researchquestion.Let us consider the elementsone by o-ne. First, the idea: is religion an opiate or inspiration of civil rights militancy? Without an idea, the manipulation ofdata is pointless.As you will seerepeatedly,the nature of the idea a researcherwants to test w l dictate the kind of data chosenand the manipulations performed.without an idea, it is impossible to decide what to do, ard the researcher will be tempted to try to do everything and be at a loss to choose from among the various things he or she has done. Ideas to be tested are generally called hypotheies; they also will be referred to here and in what folrows as theories. Atheory need not be eitheigrandiose or abstractto be labeled as such.Any idea about what causeswhat, or whv and how two variables arc associated,is a theory. Secondis the information, or data, neededto test the idea or hypothesis (or theory). In this book, we will be concemedwith data drawn from probability samplesof popuia_ ttons- A population is any definable collection of things. Mostly we will be concemed with populationsofpeople, suchasthe populationof the united states.But social scientrsts ar_ e also interestedin populationsof organizations,cities, occupations,and so on. A probability sample is a subsetof the population selectedin such a way that the proba_ bility that a given individual in the population will be included in the sampleis known. only by usinga probabilitysampleis it possibleto makeinferencesfrom the characteristics ofthe sampleto the characteristicsof the population from which the sample is drawn.
Cross-Tabu lations
lh s- : r: r-rbsen'e a givenresultin a probabilitysample,we can inf.erwithin a speci_ ri- u{" a har rhe likely resultwill be in the population. lbe i::ngle usedby Marx is actuallyquitecomplex,consisting ofa probabilitysample f, r!: Birks liring in metropolitanarcasoutsidethe South,ptis four .p""iat .u-pt"r, um*t:\ :arnplesof Blacks living in Chicago,New york, Atlanta, and Birmingham. lh ]|:f}-:umber of respondents from the non_--Southem u.ban ,u_pr. pfu. the four spe.l ;rcks is 1.119.and Marx treatsthe combined sampleas representatrve of urban Lbrr - ie Lnited States.This is not, in fact, entirelylegitimate.iater we will explore nur r: r:ight complexsamplesto make them truly representative of fhe populations tn t:sh rher are drawn.Evaluationof the sampleusei in an analysisis un i_port_t rmi :i :< iara analyst,stask.But.for nor, rv" ,"ill go along with Marx in treatinghis rnrc r. I probabilitysampleof U.S. urbanBlacks. L-q ..ur ideasare aboutthe behavioror attitudes of people,a standardway of col_ n*n'r.I ',:-:.ir to ask a probability sample chosenfrom anapiropriate populationto tell r lr:u irir behaviorandattitudesby answeringa set of specificquestions.That is, we iD5,t!-r:e :ample by asking eachindividual in the sample a set of questronsand record_ rtE :3sponses. In mostsamplesurveys,the possibleresponses arepreselected, andthe :iu --eingsurveyed,rhe respondent,is askedto choosi the bestresponsetrom a list tu*a,ir- lee the boxed corment on open_ended questions).For example,one of the m.:Li \Iafl( askedwas ,,-a ,.,culdyau ny abautthecivilrightsdemonstrations over the lastfewyears_that :-E, '.,e hetpedNegroes a greatdeal,helpeda lixle,hurta little,or hun a qreatdeat? Se-peda greatdeal 1 lc-ped a little 2 ::-= a linle 3 E:r a deal 4 -great ]::'r know 5
OPEN-ENDED QUESTIONS
occasionaly, questions
=::=l;1,:ilil::llil:ffi ':"i:i*fru Jil:ffi Jl=lil":::ff T;ff "?::lff
-:-: j,/ttstedon a questionnajre or when the researcher doesn,thavea very good ideaof responses '''-:::he possible wiI be.open-endedquestions mustbe coded,that is,converted -:: a standardset of response categories, as an editingoperatjonin the courseof data :=:3ratton. Thisis verytime-consuming and expensive and is avoidedwheneverpossjble. :- someitemsmust be askedin an open-ended format.Bothin the decennrar censusand - -any contemporary surveys in the unitedStates, for example, a seriesof threeopen-ended :-=stionstypically isaskedto elicitinformation necessary to classify respondents according to .-=-darddetailed(three-digjt) classifications of occupation and industrv.
4
euantitativeDataAnalysis:DoingSocialResearch to Testldeas
Each response,or responsecategory,has a numberassociatedwith it, known as a code. The codes are whar are acrually recorded when ,h" ;;;; ;" pr"p_"d ibr analysis
tomanipurate da,i^,"
;;iiirl,'.o*"
."roondents l-X:.1:':^1": .f 'sed wll 1;"_e*;;. reruse to answer a questionor. in a self_admini.i.r.u qu"rtioo'orre, will choosemore than one response.Sometimes,aninterviewer will forget to reclrO a responseor will ambiguousway. For theserear"r., _-"_?" i, urualy designatedto f:1:.1,_tlnonresponses "" rndrcate "JO" or uncodableresponses. For example, ,oiglr, Ueassigned to nomesponsesto the preceding qu€stion when tlr" Outau." ""a":.1_ U"irrg p.epared for analysis (thistopicis discurr"ctu tt tarer).Howdi;il ;;;finses, or missing data, ". ^-bit rs one of the peremial problems of the survey ;;';;;iiidevote a deal of $ear attentionto this question. """1)";, The term variable refersto each set of responsecategoriesand the associated codes. A machine-readabledata set (v
disks,cD_RoMs,il#*":l':I:,1"j:l;il:,T$il,.'"i;,iffi liii,.j.r-;lL'll codesfor eachindividualin the ,u.pt" .orr".ponOlng a"ri"
,ii."r" .",egonesfor the variablesincludedin rhe data ,"t. "tie eartierquestion i* *;pl; ;h" on ,Suppor", whethercivil rightsdemonsrrations huu"t w"groJir'rfr"l",.,f, in a survey. Suppose, also,thatthefirst respondent;n "tp"O "-,rote tr," .a_prJt ..helped a litde'" Thedatasetwouldthin incluo". 'z- 'ui" aairii r,"iai_*ro",lons in i"ntr,l.-"",ii ii- *" ,i.st individual. To know exactlywhatis includedin a data set,"a ,n"." ir-ri" O'"" setit is located,a codebookis prepuedandusedas
ho.' tou."a"ojJ;ffi"*; t,;#:'.::;:::ffffiJJ:::llg,ffi'".]i1i"",::"'J sary to carry out the sort of analvsisdealt with in this boot
_" u aliu ,"r, u codebookfor the daraset, and documentationthat de.criO", tfr" ,a_pi". W" ,ifii o" *rcerned with problems of data collection or the preparation of u rnu"hin"-."uJult. Outu,"t, except in require
f#il:r;;::::
fuu treatrnent in rheirownrigr,t,_J*"\vil
*ics
nothavetime
l:,j:ffi,ff"";"T:iill",ff ::.riq;ff i3.i:,i:";,ry_:,#n:::ry::f lJ,"x""ff *
and collectivelyexhausdvecatesories. Rerigiousafmiaiion i. *of such a variable.For example,we mighthlaveth" foffo'rvlngr"rfonrl'"ut"jo.i". "**pt" uoo "oo"r, Protestant I Catholic 2 Jewish 3 Other None No answer
5 9
Note that no orderis imDlied,among_ the responses_no response ls ,.better,,or "higher"rhananyother Thevariable .i_!ry p.ouii"r-u ,rr'ii"fi*orr"* peopleinto religiousgroups.Note,further,thatevery individualin the surveyhasa code,even those
Cross-Tabulations5 I
I G'5€f, rb€
t
question. This is i bv includinga residualcategory -r--,-.1.;".*#i"';r1*omplished :-. deslgned vanables, categories arealways .-.-.n raEandcollectively ! cr:F..'!E cot,"*u^r- exhausiive_,h", ^-r^,.lll.t]y ;;,,* i;il;ffi"r*;#XJ;
a f
".
l =-*.r_\.
!
(rnCiupi",el* *" rv,u *';**;*r]::11,1"^i:iiTo :* :"d onryonecode. of coding missingdata..1 pioperty-they
canbe arranged "o.luJ'al crimersion: quantity, value, o.Lu"i.rir" qu"ir;;;iTiffil.'J:ffHlin an order
! I
uno.q"-", *r,"*,i" j-i.i,ii"i f::,::::ordeled :ITg::r "_i"uc, is helpfulness to Negroes. ",,n,"n n""Jry, ,i" 11-.rre
-Etr::.n:.y:-:,:,
encounter
"-)iili::,,l",I.,1
in surveys. r;'orl" d;ses, rLrpuxscs'..don,t oon l ] 5 :J,::,:T:i.::tuarly u lmplicit "no answer" not self-evidentlyordered with r G (-
r-
ff:;*o.;:f E $,.ierhara -.- *r-rc dL}'
:,::,:lf:::,*:*:llg:4;ir'" ,,don't know,, responie is
"#:.rr"ii"i ".0'*" in be,*"",
p,ausib,e argu.
r.:ffi {il; rti-\ a neutra] rafhe " 'i either a a-positive or a or a nesatrve negative response. Puuuve response. To treat To treat F ri,;. *,,, o thi', ,, ;;unj, ^_ ^_^,,-Lrhun
hen
..3:.tl "y,r,.,..h,,"r;'::,.Y::lo_l'::1" the.variableb-yassigning .de ..r.. "oo.
Dhr_-=an
.;;;.;:;il;;'#;;"i:,;T[ffiTl.tfl;J:5il:l'; =-Y ,** *.,n"
qE ffi
- --r'i::5#*::,::,:,"^tTlt$*l,J*;.i;*areundertaken, asparrorrhewritiuporthe ri*oria'ffiffi#;"H;:lfil -urv.i*l ;G;;i:: aneufta, response
ffi."j"li:"*::-1": ses tb. no--aroorra
hlh
:g*
.
so varied' includingsimpleenor. failure ro comHence.thereis no way to predicrhow --he no*.rponi.n,. rnponded hadthey _ done so. Therefore,it probablywould b" *lr"rt -+.*.'*J;;:;;".""o,,":'lt r_ner_ to t s missinsdata. "ui L ry..rrant featre of ordinal variables is that they include no information about D'ahEe tr;ts een catesorie.. For.example, we do not t * ,i",rr* the difference r :ir*', "'' -E+ment rhatcivil rishts demonstrations fif""'u,o thejudgmentthat qr throj a tjnle.. is srearer"orsmalle,rthan ,fr" Oiff*"r"" Uiween ahejudgment r,E -b'o.o a little.: and that they.,helped ;;;;ffii,il. thi. ,"u.on, .o_" - .md social researchers --
ry---o._*,"r,il;"#"1f,T#i[:.j:::JfrX',f .rr a \ ariableanduseonly th" ord". prop".ty.-ffiri. :il:llf*:,:m*::] ryEs bhi ffi
rro,nr
fr" p"rrd"" qe u.ill mainly consid, ""i appropnate takenhere. kinds of statistics,those tbr nominal s..rJrh^ca on-- -;^.^ .-:r.two
doseappropriate forintervai il-,utjo;uiiJ;tr#fl1ff:ff#T
#: ffi",fl":"J"l1'"u*^.r-t*'ins 'i"ti'i",,'"",n*tlvdesigned
ffi;-"*
* - !".r"., i""I"*;;;;;'il]1,:?:ii#,T::.n tr;:ffi:.JH*"1:,1#i.;} j,*:ili,Tffi;,',,J1:;:nn*]:l Tl *"^r";dil;,i:';,f,l'#f :\ample' rhare..or is normaty distributea, 'db-ir s""""J'"ffii @-madcally
h
mLl]
tractablethan
usedthan parametricstatistics;moreove, ,fro"
*"-,r"ry
statisticsaremuch altemativesfor
QuantitativeData Analysis:Doing SocratResearch to Testldeas
rhingandiirrle consensus amongresearchers aboutwhich ordinat :::::ltl:llii sratlstrcto use.*:.same Third, many ordinalstatisticsinvo]veim;Icit assumptrons that arejust as restrictiveas the assumptionsunderlying parametric statistics.For example,it can be show! thatspearman'srank ordercorrelation(an ordinalstatiJcfis ioenticalto the product-moment(Pearson)coneradon(the conventional parametriccorrelationcoefricient) whenintervalor ratio variablesare converted to ranks.In effect,then,the Spearmanrank ordercorrelationassumes an equaldistancebetweeneaaf, .",lr.r,han makingno assumptions aboutthe distancebe "-"-,"g".y
byusingordinar,",i**'G.,""""T;";J:'.:Ti;.:[:T;H.-rH]il[.1 stonscanbe tbundin Davis1971,andHildebrancl andoth.., lt i.i Interw vqriqblesandratio vqriqblesaL.e similaf in thatthe clistanceoetweencatego_ Nor only canwe.saythatonecaregoryis higher rhananother(on some :t"_:l: T.i."irCf"] drn'renslon) bur also how much higher.Such uuriobi.. i.gltiiotelf can Uemanipulated with standardarithmeticoperations:addition, subtractiori-'outr'i oi"ouon, and division.
'-r! l \lt L ..-' \
('1900-?960)wasanearlyteader in thedevelop_,, ot ment surveyresearch. Hewasbornin SacCity,lowa,andearneda B.A. frori Morninglide college;earnedan MA. in riterature at Harvard; served threeyearsasan eaitorofihesiiirty. . ot hisrather; andthenbesan e'"d'"t" rtuoi",in sociotosi at iH;,i.1^"]i3T"jj::i*o the university of chicaso,completing ph.D. his in 1930.i.A/hit; ;a;;";;;.;*;;*"r," F.osburn,who introduced himto starisrics uurpituitr r"no"..,ibed jnirial :1,:_,::::tyl lr: hostility to thesubject. Hestudiedstatisticai methods and**n"ru,i., ,n*nrtu"ryat chicago
univers,v olLondon, ililil T::J.l[:ff;:,:::::::"::1y:::'::,::rtt:,,*a,the
whereheworkedwith Karlpearson, amongothers(seethe bi"grupf,i.uf ,f";h';"r"";;""i;' Chapter Five). Stouffer heldacademic appointments in statistics andsociology Wrr.onrio, Chrcdgo, andHarvd,d. Hewasd srlredresea.cn "t ad-ri.tistrdto,, i""Ong o'nurnO"r. o+toro"
il;;:';.:;;;;il il;:il:*;':::":, TTli: ;:Iil:: i:::1,:.::t::11t:;;.;
sociat science Research councir project to evaruate theinfruence of the,.ilri:; i;'r::,i durinsWorrdWarrr,a studvot sorders rorrhe ;:T:.::*l:':::: iiJ.'lf:: :::*"'hs; Defense Department, whichresulted intheclassic priblica..r"rr, ,n"'ir"*r'ildriffi;
a-studv ortheantkon'm'ni't t'v't",iu J;;
il""-h' :y funded ;::::':,i:l'"'11,".ii-]?jl:: era, by the FordFoundationt Fundfor the Republic, which;;;;; ;;r;;;# (t55) when he diedratherun"*0".,"0t"u, asesixtyartera ::,:r'::: T: :::,,:,!::,* briefiilness, hewas in theprocess
of devetoping for theeopulalio;;il;,;;* . ffiff d:"r*:rg nations. Heatsoplayea un irnpo,tunr rorein deverop_. f:til:1T""_Ti:-l.jl]l]l I statistical standards intheU.s.Bureau oftheBudget. A hailmai.f ur"ri"ra*i*";; ;; ; to usingempiricat daia quanritarive and me*or, . |',nor*i,Vi"ri )l::"r,:iny:"rT,tted roeasabout socialprocesses,
rns rhe statisticiar ;; ;n#il;;;J",*- iffi ruTffi.iil*'#:l""lT.,. o,"n;,;
which makesit fittjng that a poslhumous collectionb{ his .r. papersls titled SocialResearchto Testldeas(1962). .
Cross-Tabulations
I l
'=[F#J&:
I
7
arions ror rhem rhe [r dr#i #i.*#ilder
,fl"T:H,"":f =E::m#':.':,"J"'fi ;i""..",.,,".;:oj:;::;;'# i:l::iiiiird;J:;::$:: iF::;Tiri;fl lE*'"',*'n:"'ff
lffii,tij$1lr",-"1ff ***,$+T.;"-,#n:r
"*::i*'T #l;'tril*fd1#;ltti*Ttfi'trJ,i
lq'jf"f* vanous forms of
ffi'tuffi*,,*j:.l: h g
*
@.lo '\*t \d HI {*at
:,pinion, is rhe povemm.
i*,ii )ii,';'A|,:':Kf", ".
* washinston pushinsintesration toosrowtoo
,rantb workhard. ct I t aheadiustas e&Jilyas anyone else'@isagree.) thould sp"rd .or" ti*" and less time demonstratinS' (Disagree.) tc nuh I wou, u" *or" no'o'mg part in civil righrs demonstrations.(Disagree.) wt like to s"" *or o"^ol,'* tions or lessdemonstrations? (More.) ot+ne^horta oot hoJ
ffi1-*'-;;":;:::;:;;":::::":"::,',;;:"X:T;i ft
,"t
O-Oerty shouldnot haveto sell
to Negroestf he doesn| wsnt b.
-rtf*tf*#r."r::%#.qf =*#tr{*1df #$,dTr*:
8
euantitative DataAnalysis:Doing Social Research to Testtdeas
wil devote considerabre afientionro ijit'i'" j,X,iJii1;li::;",,1":,:lf*, Ereven_we The third elementin any quantitative analysisis the model,the way we organizeand marupulatedata to assessour idea or hypothesis.Th; ;;il; rwo componenrs:the choiceof staristicarprocedure rh. now the variablesin our analysisare related.Given"r,i ";r;;;;;;r;;;;""il'fi, I
;::T,'.x','l;:J*,'.",llfutii::li*:::*.r-d:i ;j:;
witr, our nypotr,e,e.."d;;;#r::Jffi ',.TH,""r'":l'XT:1,"H;,il:"#:i*:i
cross-tabulations of militancvbv religiosity trurr ,rr. i",-i".iiri or succersiue variables,whichare.tiscusseo cont.ol a ,"J .rp""lliol' fi,vpotr,".iOi, tr,utu of thenonretiglous ^bitlar".i, thanof the"r, religio;;;ii; :',:T:.1"1"*"c. mllitant_{r because we navecompering hypotheses, thata lower.percen,"g" rant Larcrin thebookwe will deal "i,i" """-rigious will beraili_ withstatisrr""r moresophisticatedmostlyvariantsof thegeneralli1:,Tmod:llbur -,r"o"ir,rr" ,,re.i.gi. -. *,ril"_rn unchanged. we actuallycarryoutcross_tabulatlon How analysis is thetopicof thenexrsecnon. CROSS.TABULATIONS
rerigious Bracks aremore rikerv (orress ft:,T,1T;:Tfli,ffn:x*:rmine.whether Blacks'
Perhaps the moststraightforward approach rs ro cross-tabulate ,-;:-:llci:"t *tiglt.ill' thatis, to count thet.-qu"n"y oi persons witheachco,o6;12661 ^1L o1|1?,cl
*,-;;;;;;;;;;;,:l :ill:'J:i Hl:ffit:i HI ff:l;ll:;li,lli i ::"^T:. jill'ffi.Tiff'l[T:'J+;i,Ti']iiofmiriraricf ;t ;j'*.""'r rhe rorrowing vieids
. Trt, ^:t" Joint Frequel Among urban Negroes in the,rcy Religiosity
Distribution
Militant
Veryreligjous
61
of Militancy by Religiosity
Nonmilitant
Total
4t
290
C
372
532
t! U
108
t95
169
f
Somewhatreligious
160
Not very religious Not at ali religious 11
36
ld
fl[ @ @
333 soulce.Adaptedfrom Alarx(j967a,
Table6)
660
993
d
Tl I
Cross-Tabulations
: ' . 1; 1. : ' O :' lN T AB L E1 .1 -:ll:li]lSividuals) in thetabteis siveninthetower_riqht ce|(orposition
,= -]_ in l-=: = ='lo1e that this is fewer than the nrrmharofcases .:c6" in i^ the +L^sample tne number --^* -;,*(recallthatthe ^f j9
::-sstsol t,j cases). Thedifference isdueto missing Oau;Ua o, ,or" L_ r : - ::-:: r d not answerall the questions needed to construct the religiositanOmi||Later, wewtlldealextensively with missing datap.U"r, lor tfrup**ni, rgnorethe missingdataand treatthe sampleas if jt consists of g93 -i : ;-: :ei s In the interior of the ta
. : ..::,en that cyore"* 1", ?ll;lffJ :#frHffi:Dcv-dislribution. -,0,."i,"1f --- :: :. :. the variables and response caregones areg,venin the tablestubs.
Ji:ff :[:?,".J."^: :iffi?f:fl:i,T .i_,-': *ffi :::ffi";:::i
: : - 169 230,andsoon;addlngup theentries of eachcoiumnandconfirm_ 'r correspond -:--:- i: :-ey to the row marginal, for example, 6l + 160+ g7 * ,, = ,:a, :-. :r :-; and addingup the row marginars andthe corumnmarginars and contirming :-= -: :_r of eachcorresponds to the tabletotal. lt is easyto intro"duce errors, especrally -:- ::c_,'ngtables,and it is far bett
.:-:::-nt than you,,"ua"o,o ror or.ll,"'lfJ"fij[T;:JJlffi:,l,1"tffi:ruill:
:::'
.: _- iables.
-:': ::: rable.can we decidervhetherreligiosity favorsor inhibits militancy?Not rrr.. ;:. T,r do so. we would needto determin-e the"ret"iiri p-'iloitirythat peopleof rrur ic::- of reli_eiosityare militant. If the probabi[t l""ri"r* ,r* religiosity, we nr*n,nruul ::r--.ude rhar religiosity milit_"y; if tne proUaUitityof militancy rq:i:
#il;:#:::::Tifr [TiJi;':'il?Tl'X.H:'#fl,.T:X ),--.-..;;j'g;',?ffi
10
QuantitativeDataAnalysis;Doing SocialResearch to Testldeas
percenttrltititantby RetigiosityAmong TABLF 1.2. Urban Negroes in the U,S..1964.
very Religious
Somewhat Religious
Not Very Religious
Not at All Religious
TjICHNICAL POINTSON TABLE 1.2 i) Always includethe percentagetotals (the row of 100%s).Although thrs may seem redundantand a wasteof space,jt makesit immediatelyclearto the reader in which directionyou havepercentagedthe table.When the percentagetotalsare omitted, the readermay haveto add up severalrows or columnsto figure jt out. Usingpercentage signson the top row of numbersand againon the Totarrow arsocrearryindicates to the readerthat this is a percentagetable. 2) Wholepercentages areprecjse enough.Thereis no point in beingmoreprecise in the presentationof data than the accuracyof the data warrants. Moreover,fractionsof percentages are usuallyuninteresting. lt is hardto imagjneanyonewantingto know that 37.44percentof womenand 41.87percentof mendo something; it is sufficient to note that 37 percentof women and 42 percentof men do it. Incidentally. a conve_ nrentroundingrule is to round to the evennumber Thus,37.50 becomes 3g, but 36.50 becomes36. Of course,36.5.1becomes37 and 37.4galsobecomes you 37. only want to reportmorethan whole percentagesif you havea distributionwith manV categoriesand are concernedabout roundingerror. 3) Alwaysincludethe numberof caseson which the percentages are based(that is, the denominatorfor the percentages).This enablesthe reader to reconstruct the entiretable of frequencies(within the limjts of roundingerror) and hence to reorganizethe datainto a differentform. Notethat Tablei.2 containsall of the intormation
Cross-Tabulations
't1
you canreconstruct that Table1.1 containsbecause Table1.1 from Table1.2: 27 percentof 230 is 62.1,whichroundsto 62 (withinroundingerrorof 61),and so on. Customarily,percentagebasesare placedin parentheses to clearlyidentify ihem and to helpthem standout from the remainder of the table.
F
Sometimesit is usefulto includea Totalcolumn.as I havedone here.and sometimesnot. Thechoiceshouldbe basedon substantiveconsiderations. In the present case,about one-thirdof the total sampleis militant (as defined by Marx); hence,the marginaldistribution for the dependentvariableis reportedhere. Recall from page7 that "militants"arethosewho gavemilitantresponses to at leastsix of the eight items in the militancyscale.We now seethat about onethirdof the sampledid so.Obviously, if we definedasmilitantallthosewho gave at leastfivemilitantresponses, the percentage militantwouldbe higher. No conventiondictatesthat tables must be arrangedso that the percentages run down,that is,so that eachcolumntotalsto 100 percent.ln Table'1.2, the categoriesof the dependentvadableform the rows.and the categories of the independent variableform the columns.lf it is moreconvenient to reverse this,so that the categories of the independent variableform the rows,this is perfectly acceptable.The only caveat is that within each category of the independentvariable, the percentagedistribution acrossthe categoriesof lhe dependentvariablemusttotalto 100 percent.Thus,if the categories of the dependentvariableform the columns,the tableshouldbe percentaged across each row
The Diredion to Percentagethe Table Note that the direction in which this table is percentagedis not at all arbitrary but rather is determinedby the nature of the hypothesisbeing tested.The question being addressed is whetherreligiosity promotesor hinders militancy. In this formulation, religiosity is presumedto influence, cause,or determine militancy, not the other way around. (One could imagine a hypothesisthat assumedthe opposite-we might srppose that militants would tend to lose interest in religion as their civil rights involvement consumedtheir passions. But that is not the idea being testedhere.) The variable being determined,influenced, or causedis known as the dependentvariable, and the variables that are doing the causing, determining, or influencing are known as independent,or predictor, variables.The choice of causalorder is always a matter of theory and cannotbe determinedfrom the data. The choice of causalorder then dictatesthe way the table is constructed.Tablesshould (almost-an exception will be presentedlater) always be constructedIo expressthe conditional probability of being in eachof the categoriesof the dependentvariable given that an individual is in a particular categoryof the independentvariable(s).(Do not let the fact that the table is expressedin percentagesand the rule is expressedin probabilities confuse
12
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
l*l$#i-q_1"",.:fi i:,fi,,f,H1, ,i*s1:1i*1",__;:Tffi :,T:*l%*;.""d,*1,:,i i**qii;ttil,TittrL';tff #l i**"x,:;{ilr#,;:xn:ii:r#:j,:i^: ru;,m;::*fr TiJ#;[ *#",'T; fflj..':,-jTi:ili??1"iif."," ffi'l'ffi*:,lnri:;nll; ;ry:*:.1"i;:1iil:11*t;"#i.Ixli:'f
:txxT:r*,r#fr :ffi:::t;?HTtH:1H$::" ff::i.;iffi i Control Variables Jhus
far, we have determined ti
,,ffi ?::1;:J", mh*::H,"ili:j.{##:HI""3!:,J,,T#j:
a rtrongrheory,lri p.Ji.,.aT'it
religiositycauses peopleto belessrnilirant.lf wehad
:;i#lqfr#;#$ltfi ffi fi r+l.t!;t#:;:'. ,ffi: iff jffijjff l1;1u,"nr.,.-,h;Gffi uJ:[; frf :i:*[.{;ffiiifi
*:ry]i',;^:,:**i.:. I T,[TJ'.;*? ?ff d#if,::1lf,H "f,. ;;;;il::iJ#liiifi
:-,.:il;::f
il""n*"j,9,;;?:j#nfja*J
Horv can we testthis possibility?
j:,lffiTrnXi*ereducationdoesinracrredr "*r'ilhff .ffi1,
j*fu31"*"*l=*-"^*i1:': ;::::*ltln:: :i."ll$".:;*t"Tfl;*:"., ;iHl?il :ilff:.ilT i;fi :ffii::#:lT:{
:; ;:,:l',f
;:r**:L'f;*i**T##.,:ff ;ff d#kl'lJi:#f i#*"::* ir is. whar ,",r0 y,"1. ,ff::JH il:ru:::n""1,* p",".n,ug.o ,r. ^,.nin-gliyou ,"rJiI.
we need to dererminewhether educationincreasesmilitancy by crearing
..";?ffi;:"*i#:'rfifi }:lff:T;"J$ $,",f ::ffi ,"ji?lil"Jj;T",T
I Cross-Tabulations
't3
:i:rent of those with high school education, and fully 53 percent of those with college are militant. Another way of putting tfts t to say-ttrat"a'posrtive association -:-ration :!r\ between education and militancy: as education increases, the probability of mili_ increases. -::r
i , :i , percentage Distribution of Retigiosity by Educational AtGinment, Urban Negroes in the U.S., 1964. EducationalAttainment Religiosity
GrammarSchool
High School
College
.€ry religious
',:: at all religious btal
frornNlarx(1967a, - : -'::r Adapted Table 6).
'-:; .r-f . percent Militant Lfran Negroes in the U,S., 1964.
by Educational Attainment
Educational Attainment Vilitancy i,!ilitant '.3nmilitant Total
GrammarSchool
High School
College
14
Quantitative DataAnalysis: DoingsocialResearch to Testldeas
T'ECHN|CAIP{.}iXTSON TABLE 1.3 5ometrmes your percentages will not totalto exactly100percentdue to roundinger_ ror.Deviations of one percentage point(99 to 101)areacceptable. Largerdeviations probablyindicatecomputational errorand shouldbe carefully checked.
\
Note how the title is constructed.lt stateswhat the table is (a percentagedistribu_ tion),which variables are included(the conventionis to listthe dependentvariable first),what the sampleis (urbanNegroesin the U.S.),and the dateof datacoltectron (1964).Thetableshouldalwayscontainsufficientinformation to enabieone to read it without referringto the text.Thus,the title and variableheadingsshouldbe clear and complete;if thereis insufficient spaceto do this,il shouldbe done in footnotes to the table. In the interpretation of percentagedistributions, comparingthe extremecategoriesand ignoringthe middlecategoriesis usuallysufficient.Thus,we notedthat the proportion "very religious"decreases with education,and the proportion ,,not at all religious,, increases with education.Similarassertions about how the middlecategories(,,some_ what religious"and "not veryreligious.,) varywith educationareawkwardbecausethey may draw from or contributeto categorieson eitherside.Forexample,the percentage "not very religious"amongthosewith a collegeeducationmight be largerjf eitherthe percentage"somewhatreligious"or the percentage,,not at alj religious,,were sma er. Butone shiftwould indicatea morereligiouscollege-educated population,andthe other shift would indicatea lessreligiouscollege-educated population.Hence,the -not very religious"row cannot be interpretedalone,and usuallylittle is said about the Intenor rowsof a table.On the other hand,it is importantto presentthe dataso that the reader canseethat you havenot maskedimportantdetailsand to allowthe readerto reorqan_ izethe table by collapsingcategories (discussed later).
4) In dealingwith scaledvariables,such as religiosity, you shouldnot make much of the relativesizeof the percentages within each distribution;that is, comparisonsshould be madeacrossthe categoriesof the independentvariable,not acrossthe categones ot the dependentvariable.In the presentcase,it is jegitimateto note that those with a grammarschooleducationare more likelyto be very religiousthan are those who are bettereducated,but it is not legitimateto assertthat morethan half thosewrrna grammar schooleducationaresomewhatreligious.The reasonfor this isthat the scaleis only an ordinalscale;the categoriesdo not carryan absolutevalue.How religiousis ,,veryre_ ligious"?All we know is that it is morereligious than ,,somewhat religious.,, In conse_ quence,it is easyto changethe distributionsimplyby combiningcategorjes. Suppose, for example,we summedthe top two rows and calledthe resultingcategory,,reli_ gious." In this case,88 percentof those with grammarschool educatjon would be shownas "religious." Consider how thiswouldchangethe assertions we wouldmake about this sampleif we took the categorylabelsseriously.
Cross-Tabulations15
, ' . : f A t - F Ct t , f T so N T A BL E1 .4 r, When you are presenting severaltablesinvolvingthe samedata,alwayscheckthe consistency of yourtabiesby comparingnumbersacross the lableswhereverpossible. For example,the number of casesin Table 1.4 should be identicalto that in T able' 1 .3 .
BecauseeducatedurbanBlacksareboth lesslikely to be religiousandmorelikely to :s militant than are their lesseducatedcounterparts, it is possiblethat the observedaisobetween religiosity (non)militancy and is determined entirely by their mutual --iadon Jependence on educationandthat thereis no connectionbetweenmiliiancy andreligios:1 amongpeoplewho are equally well educated.If this provestrue, we would say that :ducation.rplains the associationbetweenreligiosityandmilitancy andthat the associa:on is spurious becauseit does not arise from a causal connectionbetween the rar-iables. To testthis possibility,we studytherelationbetweenmilitancy andreligiositywithin :ate-qories of educationby creatinga thee-variablecross-tabulation of miliiancyby rel! by education. such a table can be set up in two different ways. The first is shown -iiosity rr Table1.5,andthe secondin Table 1.6.
Tlt 3 L f ? , 5 percent nnilitant by Retigiosity and Educationat " Attainment, Urban Negroes in the U.S.,1964. GrammarSchool Militancy
High School V SN
College VSN
Jour.er Adaptedfrom Marx ('1967a,Table6). S=somewhat -V=veryreligious; relgious;N=not veryreligious or not at all religious.
16
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
TECI{}iJCAI-PCII,JTSON TABLE 1.5 1) In thissortof table,education js the controlvariable. Thetableis set up to showthe relationship betweenmilitancy and religiosity within categories oi eOuc"tion, tt"t is ,,controlling (synonymously), ,,holdin9 for education,,, edui"tion oi.-n"t ot education " Thecontrorvariabre shouldalwaysbe put on the outside -nrtunt,of thetaburation sothat it changes mostslowly.This{ormatfacilitates readingtfreiaUt"Oec"use it put, the numbers beingcompared in adjacent corumns. (sometimes we wantto studythe relationship of eachof two independent variables to a dependent uuriuOiu, in .uru controlling for the other.In suchcases, "u.t jn we stillmakeonlyonetableandconstruct tt whatever waymadeit easiest to read.lf ourdependent variable isdjchotomous or can betreatedasdichotomous, we setup thetablein theformatof Table16.) 2) Notethatthe ,,notveryreligious,, and ,,notat all religious,, categories werecom_ bined.Thisis oftenreferred to ascollapsjng categories. Collapsin-q is usuay done whentherewouldbe too few cases to producereliableresults for soime .uregon"s.tn the presentcase,aswe knowfromTablej.l or LZ. therearetfrirty_srlx peoplewho arenot at all religious. Dividing themon the basisof educational attajnment woutd producetoo few casesin eachgroup permit to reliabre estimates of the percentmrrrtant.Hence, theywerecombined withtheadjacent group,,,notueryreligious.,. An additional reason for colli
detair makes itdirric,r, r",,h";;;:;ij":ffiln: jiJff:ffi,'ii'"'tJi:#:.j
it helpsto reduce the numberof categories presented. OntheotherhanJ,it categories of theindependent variable djfferjn termsof theirdistribution on tf," Jup"nOunt u"ri abre,combining thecategories wilrmaskimportant distinctions. e fineiarancemust bestruckbetweenclarityandprecision, whichiswhyconstructjng tablesis an art.
From Table I.5, we seetharreligiosity continuesto inhibit militancy evenwhen edu_ cation is conholled, although thedifferences ln p"r""ni_itit_-t-u.ong religiosity cate_ gones tend to be smaller than in Tabte 1.2 where educatim i, i.i (In the next chapter' will discussa procedurefor ""r"ol"d. calculatingthe sir" or,rr"."oo",ion 1e in an associatron resulring from the introducrion_ i. ;i|n*a ytr,ol yanaUfe, ,* Oercentage O:{-:r:""",.)!^"g 11a thosewith Cr.lTrnar schooleducation,f Z p#*, of the very ret! grous and 32 percent of the not religious are militant; the p".centages for thosewithhighschool "orr"rilnorog education i.z+
38and68'Thusweconclude -ai^Jli;Ji;';il*ege thatedu"ution oo", noi' -----rrv!vrr sw! associationbetweenreligiosity andmilitancv. "o-pi#iy "***
educarion are tor theinverse
At this point, we haveto decid.ewhether to continue the searchfor actditionalexplan_ atory variables. our decision usually will be uur"a on u of substantiveand technical considerations.If we have grounds "o-iinltion for believing that some other factor misht
Cross-Tabulations
17
:"'count both for religiosity and militancy, net of education,we probably would want : ' control for lhat factor as well. Note, however,that the power of additionalfactorsto :rplain the associationbetweentwo original variabres(herereligiosity and militancy) '; ill dependon their associationwith previously introducedcontrol variables.To the :\tent that additionalvariablesare highly correlatedwith variablesalreadyintroduced, :e'will havelittle impact on the association.This is an extremelyimportantpoint that -'ill recur in the context of multiple regressionanalysis. Be sure you understandit :oroughly. Considerage. What relation would you expect age to have to religiosity and to :rilitancy?
Pauseto ThinkAbout This Religiosityis likely positivelyassociated with age-that is, olderpeopletend to be more -:lieious and militancy is inverselyassociatedwith age-younger people tend to be ::rrre militant. Hence,we might expectthe associationbetweenreligiosity andmilirancy :: be a spuriousfunctionof age.That is, within agecategories,theremay be no associa_ -:.rnbetweenreligiosityandmilitancy. What,however,of the relationbetweenageand education?In fact, from knowledse =:out the secula.r trendin educationamongBlacks,we would expectyoungerBlacksio t substantiallybettereducatedthan older Blacks.To the extentthis is true.aseand edu:.:rion are likely to havesimilar effectson the associationbetweenreligiosityand mili--ncv Hence,introducingageasa controlvariablein additionto educationis not likely to -:ducethe associationbetweenreligiosityandmilitancyby much,relativeto the effeci of :Cucationalone. Apart from theoreticalandlogical considerations (is a variabletheoreticallyrelevanr, .:d is it going to add anythingro the explanation?),thereis a straightforwardtechnical ::ason for limiting the number of variablesincluded in a single cross_tabulation_we :rr!-kly run out of cases.Most samplesurveysincludea few hundredto a few thousand -.ses. We alreadyhaveseenthat a three-variablecross-tabulation requhedthat we col_ -:rse two of the religiosity categories.A four-variablecross-tabulation of the samedata -. lilely to yield so many smallpercentagebasesas to makethe resultsextremelyunreli.:le. The difficulty in studying more than about three variablesat a time in a cross_ -:bulationprovidesa strongmotivationto use someform of regressionanalysisinstead. { substantialfraction of the chaptersto follow will be devotedto the elaborationof :::ression-basedprocedures. Table 1.5 also enablesus to assessthe effect of educationon militancy,controlling ::r religiosityby comparingcorresponding columnsin eachof the threepanels.Thus,we ,rtethat, amongthosewho are very religious, 17 percentof the grammarschooledu_ :trIedare militant comparcdto 34 percentof the high schooleducatedand 3g percentof :; collegeeducated;amongthosewho are somewhatreligious,the correspondingper_ r:nmgesare 22, 32, and48; and amongthosewho are not religious,they are 32,47, a\d ::. Hence,we concludethat, at any given level of religiosity,the better educatedare -.rre militant.
18
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
TA B L E J . 5. percent uititant by Religiosity and Educationat Attainment, Urban Negroes in th€ U.s., 1964 (Three_Dimen"iir,.l rorrnii). EducationalAttainment Religiosity
GrammarSchool
High School
61)
Source:Table'1.5.
TECI-iN{CALPOINTSON TABLE 1.6 1) Eachpairof entriesgivesthepercentage of peoplewho havea traitandthepercent_ agebase,or denominatotof the ratiofrom whichthe percentage wascomputed. Thus,the entryin the upperJeftcornerindicates that ?7 percentof the 108veryreligrousgrammar-schooleducated peoplein the samplearemilitant.Fromthjstable, we canreconstruct anyof the preceding fivetables(butwith the two leastretigious categories collapsed into one),withinthe limitsof roundingerror.Tryto do thisto confirmthatyou understand the relationships amongthesetables.
This reguires a fairly tedious comparison, however, skipping around the table to , locate the appropriate cells. When the dependent ,-i"Of" t'&"frLt"mous, that is, has onll tw9 response caFgories, a much more succinct taOte tormat rs possible and is prefened' Thble i.6 containsexactly the same information nii"i.i, urr,rr" information is_anangedin a more succinct way. Tableslike Table r.o "r t ro*n u, tr,r"e-dimensional tables-" CompareTables 1.5 and 1.6.,you will see that they contain exacfly the same inforrnation-all the additional numbersin Table t.S *"."iuoJ_f Moreover, Table I.6 rs much easier to read becausewe.can.seethe effect of religi;rity;, militancy, holding constanteducation,simply by reading down the columns, ani th" of educa_ tron on militancy, holding constantreligiosity, simply "aoie" "ff"ct by'readinfacross the rows.
Cross-Tabulations
19
WHAT THISCHAPTERHAS SHOWN In this chapter,we haveseenan initial ideaformulatedinto a researchproblem,an appropriate samplechosen,a surveyconducted,and a set of variablescreatedand combined into scalesto representthe conceptsof interestto the researcher. We then consideredhow to constructa percentagetable that showsthe relationshipbetweentwo variables,with specialattentionto determiningin which directionto percentage tablesusingthe concept of conditionalprobability distributions-the probability distributionover categoriesof the dependent variable computed separately for each category of the independent variable(s).This is the most difficult conceptin the chaptet and one you shouldmake sureyou completelyunderstand. The other importantconceptyou need to understandfully is the idea of statistical controls,also known as controlling for or holding constantconfoundingvariables,to determinewhetherrelationshipshold within categoriesof the controlvariable(s).Finally, rve consideredvarious technicalissuesregardingthe constructionand presentationof ubles. The aim of the gameis to constructattractive,easyto readtables. In the next chapter,we continueour discussionof cross-tabulations, consideringvarious waysof analyzingtableswith morethantwo variablesand,moregenerally,the logic of multivariateanalvsis.
to AS
is DN lal
oe .6 a-
CHAPT ER
MOREON TABLES WHATTHISCHAPTER ISABOUT ln this chapter we expand our understandingof how to deal with cross-tabulations,both substantivelyand technically. First we continue our considerationof the logir of elaborarion, that is, the introduction of additional variablesto an analysis; second,we consider a special situation known as a suppressoreffect, when the influences of two independent variablesoffset eachother; third, we consider how variables combine to produceparticular effects, drawing a distinctionbetween additive and interaction effects;fourth, we see how to assessthe effect of a single independentvariable in a multivariate percentagetable while conholling for the effects of the other independentvariables via direct standardizafion; and flnally we considerthe distinctionbetvteen experimentsandstatistical controls.
11 1Z
QuantitativeData Analysis:Doing SocialResearch to Testldeas
THE LOGICOF ELABORATION In traditional treatmentsof survey researchmethods(for example,Lazarsfeld 1955: Zeisel 1985),it was customaryto make a distinctionbetweentwo situationsin which a third variablecompletelyor partiallyaccountsfor the associationbetweentwo other variables:spariors associations andassociations that can be accountedfor by an interveninB variableor variables.The distinctionbetweenthe two is that when a control variable(Z r is temporallyor causallyprior to an independentvariable(X) anddependentvariable(I). and when the control variablecompleteryor partryexplainsthe associationbetween the independentanddependentvariable,we infer that thereis no causalconnectionor onlv a weak causalconnectionbetweenthe independentand dependentvariables.Howeuer. whenthe controlvariableintervenestemporallyor causallybetweenthe independent and dependentvariables,we would not claim that thereis no causalrelationshipbetween the independentand dependentvariablesbut ratherthat the interveningvariabL explains, or helps explain,how the independentvariableexertsits effect on the dependentvariable. In the previouschapterwe consideredspuriousassociations. In this chapterwe revisir spunousassoctations andalso considerthe effectof interveningvariables.
SpuriousAssociation Considerthe threevariables,X, { andZ. Supposethat you had observedan association betweenX and I and suspected that it might be completelyexplainedby the dependence of both X and Y on Z. (For a substantiveexample,recallihe hypothesisin the previous chapter that the negativerelation between religiosity and miiitancy was due to the
Moreon Tables 23
IA
ri18 Z) '), rc t, d E
)r e,
ir
dspendenceof both on education-Blacks with more education were both less religious md more militant.) Such a hypothesismight be diagrammedas shown in Figure 2.1. Causaldiagramsof this sort are usedfor purposesof explication throughout the book. Tbey are extensively usedin path analysls, which is a way of representingand algebrai;ally manipulating structural equation models that was widely used in the 1970sbut is bss frequently encounterednow (seeadditional discussionof structural equation models .md path analysisin Chapter Sixteen). My use of such models is purely heuristic. Nonefreless, I usethem in such a way asto be conceptually complete. Hence, the pathsfrom x bX(px.)andfromytoy(lDyy)indicatethatotherfactorsbesidesZinfluenceXandI. Now, if the associationbetweenX and I within categoriesof Z were very small or moedstent, we would regard the associationbetween X and Y as entirely explained by 6eir mutual dependenceon Z. However,this generally doesnot happen;recall, for examde. that the negativeassociationbetweenreligiosify and militancy did not disappearwhen edrcation was held constant.We ordinarily doi not restict ourselvesto an all-or-nothing hlpothesis of spuriousness-except in the exceptionalcasewhere we have a very skong drory requiring that a particular relation be completely spurious;rather, we ask what the associationis betweenX and I controlling for Z (and what the associationis betw eenZ and tscontrolling for X). The logic of our analysiscan be diagramedas shownil Figure 2.2. To statethe samepoitt differendy, rather than assumingthat the causal comection betweenX ard yis zero and determining whether our assumptionis correct, we esfimate Se relation betweenX and lholding constantZ and determineits size-which, of couse, may be zero, in which caseFigure 2.1 and Figure 2.2 are identical. X <1 - x
Y+ Y
I
FIGURI 2.1 , the obs"*"d Association Between X andy tsEntirelyspurious and Goesto Zero WhenZ ls Controlled. Xi-x
.,
z
II I
\
I
Y+Y
FIGURt 2.2, rheOOn*ed Association Between X andY tspafttySpuious:the Effectof X on Yls Reduced WhenZ ls Controlled(Z AffecbX and BothZ andX Affect y).
24
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Interuening Variahles Now let us considerthe interveningvariablecase.Supposewe think two variables,X and L are associatedonly becauseX causesZ and Z causesZ An examplemight be the relation betweena father's occupation,son's education,and son's income. Supposewe expect the two-variable associationbetweenX and f-sometimes called the zBro-orderassociation. shortfor zero-orderpartial association,that is, no partial association-to be positive,but think thatthis is dueentirelyto the fact thatthe father'soccupationalstatusinfluencestle son'seducation and that the son's educationinfluencesthe son's income; we think there is no direct hfluence of the father'soccupationalstatuson the son'sincome,only the indirect influence throughthe son'seducation.This sortof claim can be diagrammedasshovn in Figure 2.3. But, as before, unless we have a very strong theory that dependson there being no direct connectionbetweenX and I, we probably would inspect the data to determinethe influence of X on f, holding constant the intervening vaiable Z, and would also determine the influence of Z on I, holding constantthe antecedentvariable X. This can be diagrammedas shown in Figure 2.4 If the net, or partial, associationbetweenX and I provesto be zero,we would conclude that a chain model of the kind describedin Figure 2.3 describesthe data.Otherwise. we would simply assessthe strengthand natureof both associations,betweenX and land betweenZ ard f (and, for completeness,Ihe zero-order associationbetweenX and Z. Notice the similarity betweenFigure 2.2 and Figure 2.4. With respectto the ultimate dependentvariable, f, the two models are identical. The only difference has to do with the specification that Z causesX or that X causesZ. Tttere is still another possibility: X and Z causeI, but no claim is maderegarding the causalrelation betweenX and Z. This canbe diagrammedas shownin Figure2.5.
,r'
',
'/.
z
Ft6r.,RE 2.3.The ObservedAssociationBetweenX and Y ls EntirelyExplained by the lntervening Variable Z and Goesto Zero When Z ls Controlled-
/
\
Ff Gf.JRg 2"4, ne observedAssociation Between x andy ls parttyExptained by the lnterveningVariableZ: the Effectof X on Y ls ReducedWhenZ is Controlled (XAffectsZ, and BothX andZ Affect y).
Moreon Tables 25
Z
FIGURE 2.5, eoth xanA z AffectV but theretsno AssumptionRegarding tu CausalOrdering of X and Z. In almost all of the analyseswe undertake-including cross-tabulationsof the kind c te concemedwith at present,multivariate models il al ordinary least squaresregresin t-anework, and logJinear and logistic analogsto regressionfor categorical depenrzliables-the models,or theories,represented by Figures2.2,2.4, and2.5 willbe r$rically indistinguishable with respect to the dependentvariable, Y. The distinction mns them thus must rest with the ideas of the researcher,not with the data. From the m+oinl of data manipulation, all three models require assessingthe net effect of each dtro variableson a third variable, that is, the effect of eachindependentvariable holding :msta-ot the other independentvariable. Obviously, the sameideas can be generalizedto :inrari6a3 ilyslyilg more than three variables.
SUPPRESSOR VARIABLES (he final idea needsto be discussedhere, the notion of suppressorvariablas. Thus far, r€ bavedealt with sifuationsin which we suspectedthat an observedassociationbetween tm rariables was due to the effect of a third, either as an antecedentor an intervening qiable. Situations can arise, however, in which there appears to be no association hcr-een two variables when, in fact, there is a causal connection. This happenswhen re other variable is related to the two variables in such a way that it suppressesthe fterted zero-order association-specifically, when one independentvariable has oppo* effects on another independentvariable and on the dependentvariable, and the two *?endent variables have opposite effects on the dependentvariable. Such situations :a be diagrammedas shown in Figure 2.6. For example, supposeyou are interestedin the relations among education,income, rl fertility. On theoretical grounds,you might expectthe following: educationwill havea !tr[ft e effect on income; holding constantincome, educationwill have a negativeeffect '[ fertility (the idea being that educatedpeople want to do more for their children and cgrrd children asmore expensivethal do poorly educaiedpeople;henceat any given level rCimome, they have fewer children); holding constanteducation,the higher the income, fu higler the number of children (the idea being that children are generally regardedas ,Lsirable so that at any given level of the perceivedcost of children, those with more to md that is, with higher income, will havemore children). Theserelationship are repreoed ir Figure 2.6, where X : level of education, Z : income, and I : number of
26
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
),' ".,.1 \l\t
- r :- '
\i
I
Y+Y
Ff GtrR€ 2.6. rhe sireof thezero-order AssociationBetweenxandyf:nd BetweenZ and Y) ls SuppressedWhen the Effectsof X on Z and y have Opposite Sign, and the EffecB of X and Z on y have Opposite Sign. children. The interesting thing about this diagram is that it implies that the gross, or zero-order,relationshipsbetweenx and r andbetweenZ and r wili be smaller thal the nel or first-order partial, relationshipsandmight evenbe zero, dependingo' therelative sizeoi the associationsamongthe three variables.To seehow this happenf consider the relation_ ship betweeneducation and fertility. we have posited that eiucated people tend to have more income,and at any given level of education,higher_incomepeopletend to havemore children. Hence, so far, the relation betweeneducationand fertility would be expectedto be positive. But we alsohavepositedthat at any given income level, better_educadd people tend to havefewer children. So we havea positive causalpath and a negative causaliath at work at once,andthe effect of eachis to offsetor suppresi the other effect, so that the zeroorder relationship betweeneducationand fertility is reduced.
ADDITIVE AND INTERACTION EFFECTS we now considerinteractioneffects,sii)ationsin which the effect of onevariable on another is contingenton the value of a third variable.To seethis clearly, considerTable 2. 1. This table showsthat in 1965 educational attainmenthad no effecr on acceptanceof abortion amongCatholicsbut that amongprotestants,the greaterthe education, the greater the percentageacceptingabortion. Thus, Catholics and p.ot"rturrt, with gth grade iducation or less were about equally likely to believe that legal abortion should permiaed be under specified circumstancesbut, among those with more education, protestants were subslanjal]1-lore likely to acceptabortion than were Catholics.Among the college edu_ cated,the difference betweenthe religious groups is fully twenty points: about 31 percent of Catholics and 51 percentof hotestants believed that ;bortion should be permittid. This kind of result is calredan interaction effect. Rerigionand educational attainment mteract to produce a result different from what each would produce alone. That is, the relationship between education and acceptanceof abortion differs tbr Catholics and hotestants, andth€ relationshipbetweenreligion andacceptanceof abortion differs by edu_ cation' Situationsin which the relationshipbetweentwo variablesdepends on the value of a third, as it doeshere, are known as interactions.In the older survey analysisliterature (for example,Lazarsfeld 1955; Zeisel 1985), interactionsare sometimescalled specifitations. Religion specffesthe relationshipbetweeneducationandbeliefs aboutabortion: acceptance of abortionincreaseswith educationamonghotestants but not amonsCatholics.
More on Tabl es
27
:j:.i .i 'l . Percentagewho BelieveL€gal Abortions should Be Possible by Religionand Education,U.5.1965 UnderSpecifiedCircumstances, (N = 1,368;Cell Frequenciesin Parentheses). EducationalAttainment Religion
8th Grade or Less
someHigh School
29o/o (287)
(2s0)
High School Graduate
somecollege or More
'and ri te
Protestant
oss, or tle net, sizeof .lationto have -emore rted to people pathat le Zero-
36"/o
430/o
5 1 '/ o
(22s)
(1966). i:-rr.erRossi ',:ae:NonChrist ansornltted.
another ance of sreater educamftted N Were se edupercent Ed. Lrnment is. the cs and bl edulue of a ,lle (for :ations. sPAnCe
Where we do not have interactioneffectswe have qdditive effects(ot no effects). Suppose,for example,that insteadof the numbersin Table 2.1, we had the numbers in Table2.2. 'hown What would this show?We could saytwo things:(1) The effectof religionon accep-;mceof abortion is the sameat all levels of education.That is, the differencebetweenthe who would permit abortionis l0 percentat each of Catholicsand Protestants -rcentage of abonionis the samefor (2) education on acceptance The effect of --\'el of education. between the percentagewho would Catholicsand hotestants. For example,the difference abortionamongthosewith somehigh schooleducationandthosewho arehigh school -rmit SimilzLrly, thedifferencebetween is 10percentfor both CatholicsandProtestants. rraduates those with a high schooleducationand pemit abortion among percentage who would -:re :nong thosewith at leastsomecollegeis 20 percentfor both CatholicsandProtestants.
28
to Testldeas QuantitativeData Analysis:Doing SocialResearch
TAtsLf 2,2. nercentageacceptingabortion by Religionand Education (HypotheticalData). 8th Grade or Less
Some High School
High School Graduate
Some College or More
The reason this table is additive is that the effects of each variable add together to producethe final result. It is as if the probability of any individual in the sampleaccepting abortion is at least .3 (so we could add .3 to every cell in the table); the probability of a Protestantaccepting abortion is .1 greater than the probability of a Catholic accepting abortion (so we could add .1 to all the cells containingProtestants);the probability of someonewith sornehigh schoolacceptingabortion is .05 greaterthan the probability of someonewith an 8th gradeeducationdoing so (so we would add .05 to the cells for those with somehigh school); the probability of thosewho are high school graduatesaccepting abortion is . 15 geater than the probability of those with an 8th grade educationdoing so (so we would add .15 to the ce1lsfor those with high school degrees);and the probability of those with somecollege accepting abortion is .35 higher than the probability of those with an 8th grade education doing so (so we would add .35 to the cells for those with at least somecollege). This would produce the results we seeinTable 2.2 (after we convert proportions to percentagesby multiplying eachnumber by 100). By contrast,it is not possible to add up the effect of eachvariable in a table containhg interactions becausethe effect of each variable dependson the value of the other independentvariable or variables. Many relationships of interest to social scientists involve interactions----especially with gender and to some extent with race; but it is also true that many relationships are additive. Adequate theoretical work has not yet been done to allow us to specify very well in advancewhich relationships we would expect to be additive and which relationshipswe would expectto involveinteractions. Later you will seemore sophisticatedways to distinguish additive effects from interactions and to deal with various kinds of interactions via los-linear analvsis and resression analysis.
DIRECT STAN DARDIZATION Often we want to assessthe relationship betweentwo variables controlling for additional variables. Although we have seenhow to assesspartial relationships-that is, relationships between two variables within categories of one or several control variables-it would be helpfui to have a way of constructing a single table that shows the average
Moreon Tables 29 reIationship betweentwo variablesnct of, that is, controlling for, the effectsof other variables.Direct standardization provides a way of doing this. Note that this technique has other names,for example, covariate adjustment. Howeve\ the techniqueis most widely usedin demographicresearch,so I usethe term by which it is known in demography,direcl ttandordization.It is important to understandthat, eventhough the sameterm is used, this procedurehas no relationship to standardizingvariables to create a common metric. We sill considerthis subjectin ChapterFive.
Example 1: Religiosity by Militanqr Among U.S.Urban Blacks The procedureis most easily explainedin the contextof a concreteexample.Thus we revisit the analysisshownin Tables1.2 through 1.6 of ChapterOne (slightly modified). Recall that we were interestedin whether the relationship between militancy and religiosity amongBlacks in the United Statescould be explainedby the fact that better-educated Blackstendto be both lessreligiousandmoremilitant. Becauseeducationdoesnot completely explain tlle associationbetweenmilitancy and religiosity,it would be useful to have a way of showing the associationremaining after the effect of education has been rcmoved. We can do this by getting an adjusted percentagemilitant for each religiosity ,-ategory which we do by computing a weighted averageof the percent militant across iducation categorieswithin each religion category but with the weights taken from the overall frequency distribution of education in the sample. (Alternatively, becausethey are mathematically identical, ws can compute the weighted srm, using as weights the ?roportion of casesin each category.)By doing this, we construct a hypothetical table >howingwhat the relationship betweenreligiosity and militancy would be if all religios1r-!goups had the samedistribution of education. It is in this precise sensethat we can sav we are showing the associationbetweenreligiosity and militancy net of the effect of education.As noted earlier, this procedureis known asdirect standardizationor covariate ,lCjustment. Note that the weights need not be constructed from the overall distribution in the table. Any other set of weights could be applied as well. For example, if we wanted to assessthe associationbetweenreligiosity and militancy on the assumptionthat Blacks had the samedistribution of educationas Whites, we would treat Whites as the stand.d.rd, topulation and use the White distribution across educational categories (derived from someextemalsource)asthe weights.We will seetwo examplesof this strategya bit later in the chapter. Now let us constructa militancy-by-religiositytable adjusted,or standardized,for education,to seehow the procedureworks.We do this from the datain Table 1.6.First, $e derivethe standarddistribution,the overall distributionof education.Becausethere are993 casesin the table(= 108 +... * 49), andthereare 353 1= 193 +201 + 44) peoplewith a grammarschooleducation,the proportionwith a grammarschooleducation is .356 (:353/993). Similarly, the proportionwith a high schooleducationis .508, and the proportionwith a collegeeducationis .137.Theseare our weights.Then to get the adjusted,or standardized,percentmilitant among the very religious, we take the n eightedsum of the percentmilitant acrossthe threeeducationgroupsthat subdivide fte "very religious" category(that is, the figuresin the top row of the table): 17Voa.356
30
QuantitativeData Analysis:Doing SocialResearch to Testldeas
TAB Le 2.3.
percentlvtitirantby Retigiosity, and p€rcentMilitanr
by Religiosity Adjusting (Standardizing) for Religiosity Differences in Educational Attainment, Urban Negroes in the U.S., 1964 (N = 993).
PercentMilitant
PercentMilitant Adiustedfor Education
Percentage spread
+ 34Va*.508+ 38Voa .137= 29Va.To get the adjustedpercentmilitant amongthe .,somewhat religious,"we apply the sameweightsto the percentages in the secondrow in the table:227o+.356+ 32Vo*.508+ 48%a.137= 31Vo. Finally,to get the adjustedpercentmilrtantamongthe "not very or not at all religious,"we do the samefor the third row of the table,which yields45 percent.We canthencomparethesepercentages to the corresponding percentages for the zero-orderrelationshipbetweenreligiosityandmilitancy (thatis, not controllingfor education).The comparisonis shownin Table2.3. (The Stata-do_ file usedto carry out the computations,using the command-dstdize- and the -Iog- file that showsthe results,are availableas downloadablefiles from the publisher,JosseyBass/lViley(wwwjosseybass.com/go/quantitativedataanalysis) asare similar files for the remainingworkedexamplesin the chapter.Becausewe havenot yet beguncomputing,it probablyis bestto notethe availabilityofthis materialandretum to it laterunlessvou are alreadyfamiliar with Stata.)
STATA -Do- FILESAND -Loc-
FILES Insrata, -do-iircsare
commands,and - 1oq- filesrecordthe resultsof executing-do- files.As you will seein Chapter Four,the management of dataanalysis is complexand is muchfacjlitated by the creationof -ao- files,whichare efficientand alsoprovidea permanentrecordof whar you naveoone to produceeach tabulationor coefficient.Anyonewho hastried to replicatean analysis performedseveral yearsor evenseveral monthsearlierwill appreciate the valueof havinqan exactrecordof the computationsusedto generateeach result.
Moreon Tables 31
N I
t0) I ,2)
Whenpresentingdataof this sort,it is sometimes usefulto comparethe rangein the :ercentagepositive(in this case,the percent militant) acrosscategoriesof the indepen_ lent variable,wirh and wirhout conft;ls. rn Tabb ti,;; ;" ;;;r" the differencein -hepercentmilitant betweenthe leastandmos-treligiou, rwenty_one points '.rhereas,when educationis controlled, "ut"go;r'r. the differJncet.;;;; ro srxteen polnts, a l-1 percentrcduction (= I - 16/2.1).ln ,o." ,.or., tt say that education "erplains" abour a quarrer "r,;;; of the relarionship^betweenreffirit| in"o w" n""o :o be cautiousaboutmaking computations of thi, .o.t unjonty -rt,un"y. tt"_ when they ':re helpful.in making the analysis_ "_ffoy clear.no. ii io"'rri, ir*" much senseto ""u,npr", a "spread"or "range" in the percentages iithe relatio;shi; betweenreligiosity 'ompute 3flimilitancy is not monotonic(that ir, if th. p!.""ntug" Jili""ia."'", increase,or ar reastnot decrease, asreligiositydeclines). "",
i
t' omen the :milf the ondat is, - file file sey: rhe g. it i ale
plEE_qT_ tN EARLTER STANDARDTZATTON
}.f,*nS l-':':ffiffil:f"i:::i:i N ;*::u,lxli:lifi
to a "weighted netpercentaqe difference,, or ,,weighted netpercentage spread.,, ThereaJly usefulpartof the procedure is the computation of adjusted, or staidarorzeO, rates.The subsequent computation of percentage differences or percent"g",p*uJr-i, onry,or"tlrnu, useful, asa wayof summarizing theeffectof control varjables.
Example2: BeliefThatHumansEvolvedfrom Animals(Direct Standard_ ization with Two or More Control Variables) Sometimeswe want to adjust,or standardize, our databy more than onecontrol variable 3i.a time.to€et a summaryof the effect of some variabieon _oii", \Jt/i,"ntwo or more orhervariablesareheld constant-Consider'ror u.."ptun"" of the scientificthe.rn of evolurion.In 1993, 1994. and 2000, ""u-pt", ttre N
32
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
TABLE 2.4. lercentageDistributionof Beliets Regardingthe ScientificView ot Evolution(U.S.Adults, 1993, 1994, and 2OOO).
ly not true
Perhapsthe recent increasein the proportion of the population that adheresto fundamentalist religious beliefs, especially fundamentalist Protestantviews in which the Bible is taken as literally true, accountsfor this outcome. To seewhether this is so, in Table 2.5 I cross-tabulatedacceptanceof the scientific view of evolution (measuredby endorsement of the statementthat descentfrom other animals is "definitely true") by religious denomination, making a distinction between"fundamentalist" and "denominational" Protestants. (For want of better information, I simply dichotomized Protestantdenominationson the basis of the proportion of their members in the sample who believe that "The Bible is the acfual word of God and is to be taken literally, word for word." Denominations for which at least 50 percent of respondentsgave this response-"other" Protestantsand all Baptists except membersof the "American Baptist Church in the U.S.A." and "Baptists, don't know which"-were coded as fundamentalist; all other Protestantdenominations were coded as DenominationalProtestants.)Unfornrnately,the samedistinction, between religious subgroupswith andwithout a literal belief in theholy scriphres of their faith, cannot be madefor non-Protestants given the way the datawere origimlly codedin the GSS. Although there are substantial differences among religious groups in their acceptance of the scienfific view of evolution, the fundamentalist-denominationalspht among Protestantsdoes not seemto be central to the explanation becausethere is only a 4 percent difference betweenthe fwo groups. Interestingly, non-Christians appea.rto be much more willing than Christians to acceptan evolutionary perspective,and Catholics appear to be more willing than Protestantsto do so. Given thesepattems, it could well be that the observedreligious differences are, at least in part, spurious. In particular, educational differences among religious groupsJewsare particularly well educatedand fundamentalistProtestantsare particularly poorly
Moreon Tables 33 -i-,
.i 2,5, fercentage Ac(€pting the ScientificView of Evotution by leligious Denomination (N = 3,663). PercentageAcceptingthe Evolution of Humansfrom Animals as "De{initelv True"
I1.8
Catholics
17.8
(858)
?J]erChristians
5.6
(18)
h'6 ulmenlible is ble 2.5 €ment :nomismnts, on the iible is rns for ald all lPtlsts, urtrons etween cannot accepamong 4 perr much appear are, at )uPSpoorly
(1,222)
Denominational Protestants
I:ll,.,..i...:,....lll*1r,:,l,..'..l ..',:,:',,,. :.:l.lli*,.,
C$er religion
23.6
(123)
tS religion
32.5
(391)
r:;riated-might partly accountfor religiousdifferencesin acceptance of the scientific '--:';'..Similarly, age differencesamong religious groups-the young are particularly -i:11 to rejectreligion-might providepart of the explanationas well. To considerthesepossibilities,we needto determine,first, whetheracceptance ofthe explanationof humanevolutionvariesby ageandeducationand,if so, whether ---;ntific =-:giousgroupsdiffer with respectto their ageandeducation.Tables2.6 and2.7 provide :e necessary informationregardingthe first question,andTables2.8 and2.9 providethe :,-rrespondinginformationregardingthe secondquestion. l- nsurprisingly,endorsemenlof the statementthat humansevolvedfrom other animals L. -definitely true" increasessharply with education,as we seein Table 2.6, ranging from : of thosewith no more than a high schooleducationto 36 percentof thosewith -rcent r-r-graduate education.It is also true that younger people are more likely to endorsethe (seeTable2.7): 18percentof explanationof evolutionthanareolderrespondents -ientific :r-:e under agefifty, comparedto 7 percentof those seventyand over, say that it is "defi=',il) true" that humansevolved from other animals. religiousgroup, -\s expected,Table2.8 showsthat Jewsare by far the best-educated ::loued by other non-Christiangroups,and that FundamentalistProtestantsand Other Crisdans arethe leastwell educated. AIso asexpected,Table2.9 showsthat thosewith:r-{ religiontendto be young.However,membersof "other" religiousgroupsalsotend::sproporlionately-to be young,perhapsbecausethey aremainly immigrants.
34
QuantitativeData Analysis:Doing SocialResearch to Testldeas
?,q*,-a .i..*. eercentageAcceptingrhe Scientific View of Evolution by Level of Education, Percentage
Somecollege
l&XLe
11.9
2.?,
eercentage Acceptins
the Scientific View of Evolution by Age. Percentage
50-69
13.5
(889)
Theseresultssuggestthat differencesamongreligiousgroupswith respectto ageand educationmight, indeed,explainpart of the observeddifferencein acceptance of the sci_ enlificview of evolution. To seeto what extent ageandeducationaldifferencesamongreligious groups account for religious group differencesin acceptanceof evolution, we can directly standardizethe religion/evolution-beliefs relationship for education and age. we do rhis by deterrnining the joint distribution of the entire samplewith respectto age and education and then use theproportionin eachage-by-education categoryasweightswith which to compute,sep_ aratelyfor eachreligious group, the weightedaverageof the age-by-education_specific percentagesacceptinga scientificview of evolution.By doing this, we treat eachreli_ giousgroup asif it had exactlythe samejoint distributionwith respectto ageandeduca_ (ionasdid theentiresample.Thisprocedure rhusadjuslsrhepercentage of iach religious group that endorsesthe scientificperspectiveon evolutionto removethe effect of religiousgroup differencesin thejoint distributionof ageandeducation.
lff Si-{ 2.*, Religion.
eercentageDistributionof EducationalAttainment by
High School Some College Postor Less College Graduate Graduate Total
N
7.1 Denominational Protestants 4 7.6
26.5
15.2
10.6
OthefChristians
61.1
16.7
16.7
s.6
100.0
(18)
1 5 .7
2 1.7
31.3
31.3
1OO.O
(83)
{123)
JCWS
Ctherreiigion
3 6 .6
25.2
19.5
18.7
100.0
Total
47.6
25.6
15.3
11.6
t00.1 (3,663)
T-
': L F 3 . *,
md
Protestants
:cI-
unt the ing use
rercentag€ Distribution of Age by Religion. 18-49
50-69
7O+
Total
59.8
25.O
15.1
99.9
N
catholics', eo,1 ': 'ts.:er ,',,,..,ll:';|;i:i'.,;.t.;iirat,,,,,,:i:,'.,,,. 3iher Christians
8 8 .9
11.1
0.0
99.9
(18)
83.7
13.8
2.4
99_9
(123)
11.0
100.1
Jew5
ific :lica)us rli-
f,iherreligion
36
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
To get the necessaryweights,we simply cross{abulateageby educationandexpress , the numberof peoplein eachcell of ttre taUteas a proportioi ofihe total. Thesepropor_ tions are shownin Table2.10. We then tabulatethe percentageacceptingthe scientific position on evolution by reli_ . glon, age,and education.Thesepercentages are shownin Table2.11.Note that many of thesepercentages arebasedon very few cases.This meansthat they are nol very precrse, in the sensethat they are subject to large sampling variability. We ctuld collapse the eOucation and age categoriesstill further, but that would ignore substantial within_category heterogeneity.As always, there is a balanceto be struc[between sampling precision and substantivesensibility or-in terminology we will adopt later_betw eenreliability and validity . the presentcase,I might havebeenbetter advisedto take a more conservative -rn approach, especiallybecausethe cell_specific percentages bouncearounda lot (exactlyas we wouldexpect given the large degreeof sampling variability), whrch makesthe differ_ ences in the resurting standardizedpercentagessomewhatless ctear cut. on the other hand,the weights are very small for the cells basedon few cases,which minimizes their contribution to the overall percentages. Finally, to get the adjusted,or standardized,coefficients, we sum the weighted per_ centages,wherethe weightsarethe proportionsin Table2.10. For example,the adjusteO, or dircctly standardized,percentageof Fundamentalistprotestants who acceptthe evolutionary viewpoint as ,.definitively true,, is, within rounding error, 9.7 :5.7*.274 + 3.8*.184+ 15.7*.110 + 29.3+.080 + 4.9*.126+ 3.3*.056+ 25.O*.032 + 40.9*.029 + 3.3+.076+ 7.7*.015+ 10.0*.012+ 16.7*.007 standardizedpercentagesare derived in the sameway. .l hey are . _Tl" :"ld"i$ shown inTable.2.l2, with the observedpercentages repeatedfrom Table2.5 to make compari_ sonseasler. TAtsLf
2.10,
roint probabitityrDisrribution of Educarionand Age. 18-49
50-69
70+
Total
| / \ ll I I !1.r..
.'
| |
ts€rlcrrlcuE AlrcFlitts
a h € t G i € r l i t i G v i € w H , G u € l ll l l r l l
kt F€ i l g l 'n
ig!
a ta d 5 Er l Fr i .r i tl .g i
l|| I'drrrrllrrrrr)
Fundamentalist Protestants
Denominational Protestants
Catholic
Other Christians
Jewish
Other
None
I0.01 (2)
128.61 (14)
11. 5 (26)
29.3 (82)
Age 18-49
Somecollege
3.8 (183)
10.' !
(208)
20.1 (159)
22.7 (22)
[0.0] (4)
6.8
20.o
121.41
(6)
(r4)
3.3 (60)
9.0 (78)
(441
(1)
collegegraduare
25.o Q4)
21.3 (47)
20.0 (2s)
(0i
t2o.Ol (s)
I25.Ol (4)
Post-graduate education
40.9 (22\
20.0
32.O (25)
(0)
[37.5] (8)
[2s.0]
Somecollege
t0.01
12s.ol (4)
(4\
l4t.7) (12)
166.71 (12)
(Continued)
TA B L F 2. '1 'l . p.rcentage Accepting the Scientific View of Evolution by Religion, Age, and Sex (Percentage Basesin Parentheses)'(Contmued) Fundamentalist Protestants
Denominational Protestants
Other Christians
Age 70 or more
l7.71 (13)
[0.0] (1)
[100.0] (2) basedon fewer than twenty cases(shownin squarebrackets)shouldbe interpretedwith caution Note: percentages
Moreon Tables
39
t
it ::.1,3. oUserveaProportion Acceptingthe scientific view of Evolution, and Proportion Standardized for Education and Age.
Percentage Accepting Scientific View of Evolutionas "Definitely True"
Percentage AcceptingScientific View, Standardized by Age and Education
:JndamentalistProtestants l:'rominationalProtestants
(968)
12_2
(1,222) (858)
-ainolrc5 : --er Christians
N
2.1
-9,r5
(r8) (83) (123)
{o religion
-\s you can see, despitethe associationbetweenreligious group affiliation and, --:rectively, ageandeducation,andthe associationof ageandeducationwith acceptance :: .:rrevolutionaryaccountof humanorigins, standardizingfor thesevariableshas rela- 1:-\ little impacton religiousgroupdifferencesin acceptance of the claim that humans :'. r.\ ed from otheranimals.The oneexceptionis the non-religious,whosesupportfor a ';::atific view of evolution appearsto be due, in part, to their relatively young age. l,:spite minor shifts in the expecteddirection for FundamentalistProtestantsand for -':;. andthosew th otherreligions,the dominantpattemis oneofreligious groupdiffer:--:e\ in acceptance of an evolutionaryview of the origins of mankindthat arenot a sim:-: :ellectionofreligious differencesin ageandeducationbut presumablyreflect,instead, :: lheologicaldifferencesthat distinguishreligiouscategories.
Example3: OccupationalStatusby Racein SouthAfrica l; '.1let us consideranotherexample:the extentto which racial differencesin occupational .jnment in SouthAfrica can be explainedby racial differencesin education(the data are ::t rhe Suney of EconomicOppoftunityandAchievementin Sculr A/n-ca,conductedin the --=.,, 1990sfTreiman,lrwin, andLu 2006]:the Stata-do- and -1og- files for the worked =--,.:rpleareavailableasdownloadablefiles; for informationon the datasetandhow to obtain
40
DataAnalysis: DoingsocialResearch to Testldeas Quantitative
it, seeAppendix A). From the left-handpanelof Table2.13, it is evidentthat therea.restrong differencesin occupationalaftainmentby race.Non-Whites,especiallyBlacks, are substantially lesslikely to be managerial,professional,or technicalworken than areWhites and are substantiallymore likely to be semiskilledor unskilled manual workers.Moreover,Blacks are far more likely to be unemployedthan are membersof any other group. It is also well known that substantialracial differencesexist in educationalattainmentin SouthAfrica, with Whitesby far the besteducated,followed,in order,by Asians(who in SouthAfrica aremafu y descendants of peoplebroughtasindenturedworkersfrom theIndian subcontinent),Coloureds (mixed-racepersons),andBlacks (thesearetheracial caiegoriesconventionallyusedin South Africa); andalsothatin SouthAfrica, aselsewhere,occupationalattainmentdependsto a considerabledegreeon educationalattainment(Ireiman, McKeever,and Fodor 1996).Under thesecircumstances,we might suspectthat racial differencesin occupationalattainmentcan be largely explainedby racial differencesin educationalattainment.Indeed,this is what Treiman,McKeever,andFodor (1996)found usingthe IntemationalSocioeconomicStatusIndex (ISEI) (Gaueboom, de Graaf, and Treiman 1992; GaruBboomand Treiman 1996) as an ildex of occupationalattainment.However,it also is possiblethat accessto certaintypesof occupations,such as professionaland technicalpositions,dependsheavily upon education, whereasaccessto others,suchasmanagerialpositions,may be deniedon the basisof raceto thosewho areeducationallyqualified. To determine to what extent, and for which occupation categories,racial differences in accesscan be explainedby racial differencesin education,I adjusted(directly standardized)the relationship betweenrace and occupationalstatusby education.Here I used the White distribution of education, computed ftom the weighted data, as the standard distribution to determine what occupational distributions for each of the non-White groupsmight be expectedwere they able 1oupgade their levels of educationalattainment so that they had the samedistributionsacrossschoolinglevelsas did Whites. The resultsare shownin panelB of Table2.13.They are quite instructive.Bringing the other racial groups to the While distribution of education(and assumingthat doing so would not affect the relationship between education and occupational attainment within each group), racial differences in the likelihood of being a professional would entirely disappear.Indeed,Blacks would be slightly more likely than membersof the other groups to becomeprofessionals.By contrast,the percentageof eachrace group in the managerial category would remain essentially unchanged,suggesting that it is not education but rather norms about who is permitted to supervisewhom that accountfor the racial disparity in this category.The remaining large changesapply to only one or two of the three non-White groups:Asians would not be very substantially affected exceptfor a reduction in the proportion semiskilled; Coloureds would increasethe proportion in technical jobs andreducethe proportionin semiskilledand unskilledmanualjobs and farm labor; and Blacks would increase the propoftion in clerical jobs and reduce the proportion in all man[al categories. IJ all four racial groups had the same educational distribution as Whites, the dissirnilarity (measuredby A; seeChapter Three) between the occupational distributionsof Whites andAsianswould be reducedby about30 percent(fuom29-2to 20.5)aswould thedissimilarityin the occupationaldistributionsof WhitesandColoureds (from 37.9 to 26.5),whereasthe dissimilarityin the occupationaldistributionsof Whites
More on Tables 3 Srong {bstanand are Blacks -.o well :a- with mainly lourcds r South ) a conUnder ,'nt can at Treit Index ta s an ,pes of cation, |ace to
i4
: -: . , Percentage Distribution of Occupational croups by Ra€e,South African rqE 20-69, Early 199Os (Percentages Shown Without Controls and also Directly for Racial Differences in Educational Attainment;. N = 4.0O4).
WithoutControls
Adjusted for Education Black
White
Asian
Coloured
Black
13.7
7.0
3.3
13.2
13.6
11.9
16.5
7.2
13.4
5.8
7.2
13.0
9.1
8.5
3.5
2.2
9.4
2.5
2_6
14.4
18.6
18.8
19.3
8.2
13.0
9.7
6.2
16.6
20.1
100.0
99.1
99.9
100.0
100.0
20.5
26.5
46.1
tences i stanI used mdard White DINCNI
n-glng ing so A.ithin lrllely rcups gerial n but lsparthJee lclion ljo b s ': and ir all )n as ional )-f to ureds hites
r. -r.n Whites(A). -
100.2
99.9
100.0
29.2
37.9
52.4
::::-::on isth eeduc at ionaldls t r ibut ionof t heWh i t e m a e p o p u l a t i o cno m p u t e d f r o r nt h e s u r v e d y a t aw e i g h t e d - : - ::nsusdistrbut onsof reqon bv urbanversusruralresidence. :::'occu pa tondat at odet ailedoc c upat ons ha d n o t b e e n c o m p l e t e d w h e n t h s t a b e w a s p r e p a r eIdh,a v e i n :, i : ,a:egory"occupaton unknown." - -:_:. = 1/2the sumof the absolute va uesof the dlfferences betweenthe perceniaoe of Whitesand the percent' _ = . al groupin ea(hoccLrpation categorySeeChapterThreefor furtherexpost ol']cf ih s ndex
42
to Testldeas QuantitativeData Analysis:Doing SocialResearch
andBlackswould be reducedonly about 12 percent(from 52.4 to 46.i). The substantiaremainingdissimilarityin the occupationaldistributionsof the four race groupsnet o: educationsuggeststhat Treiman,McKeever,and Fodor's (1996)conclusionthat education largelyexplainsoccupational.ttatrsdifferencesbetweenracegroupsin SouthAfrica doesnot tell the whole story.
in China Example4: Levelof Literacyby Urban VersusRural Residence Now considera final example-the relationshipamongeducation,urbanresidence,ani degreeof literacy in the People'sRepublic of China. In a 1996nationalsampleof the were askedto identify tet adult population(Treiman,Walder,andLi 1996),respondents properties (see the of this data set and ho$ A regarding Appendix Chinesecharacters is inter?retedas indicating correct identifications to obtain accessto it). The numberof the degreeof literacy(Treiman2007a).Obviously,literacywould be expectedto increase with education.Moreover,I would expectlhe urban populationto scorebetter on the characterrecognitiontask just becauseurban respondentstend to get more schooling than do rural respondents. The questionof interesthere is whethereducationaldifferencesbetweenthe rural and urbanpopulationentirelyexplainthe observedmeandiffercorectly identified,which is 1.8(asshownin Table2.14t encein thenumberof characters the urbanandrural meansby assumTo determinethis, I adjusted(directlystandardized) ing that both populationshavethe samedistributionof education-the distributionfo: the entireadult populationof China,computedfrom the weighteddata.Note that in this that are standardizedbut rather means.The procedureis exampleit is not percentages identicalin both cases,althoughifthis is doneby computer(usingStata),a specialadjustment needsto be madeto the datato overcomea limitation in the Statacommand-the requirement that the numerators of the "rates" to be standardized (what Stata call-. -charvar-) be integers.To seehow to do this, consultthe Stata-do- and-1og- filesfo; this example,which areincludedin the setof downloadablefiles for this chapter. fAmLf ?,14. rvl..,t Number of chinese charactersKnown (out of 1o), for Urban and Rural ResidentsAge 20-69, China 1996 (Means Shown Without Controls and Also Directly Standardized for Urban-RuralDifferences in the Distribution ot Education;.N = 6,O81).
Ruralresidents
Without Controls
Adiustedfor Education
2.0
2.4
(3,002)
'The standardpopulatonis the entre populatlonof Chinaage 20-69, computedfrom the sutueydata we ghted to reflectd fferentialsamprng ratesfor the ruraland urban populationsand to correctfor vafiaweremissinq wereomitted. s ze.N ne casesfor whichdataon education tionsln household
Moreon Tables 43 hntial let of duca\frica
na t, and of the fy ten I how )ating I€ASC
n the nling liffer[ffer, 14\
sumn for n this me is djust-the calls x for
t, ut
The results are quite straightforward and require little comment. When education is standardized,the urban-ruralgapin the meannumberof characterscorrectly identified is reducedfrom2.8tol.6.Thus,about43percent(=1-1.6/2.8)oftheurban-ruraldifferencein vocabularyknowledge is explained by rural-urban differencesin the level of edu.-ationalattainment. Although the four examples presented here all standardizefor education, this is purely coincidental. Many other usesof direct standardizationare imaginable. For example. it probably would be possible to explain higher crime rates among early twentiethcentury immigrants to the United Statesthan among natives simply by standardizingfor age and sex. Immigants were disproportionately young males, and young males are Lnown to have higher crime rates than any other age-sexcombination.
A FINALNOTEON STAT]STICAL CONTROISVERSUSEXPERIMENTS Ia describingthe logic of cross-tabulations, I havebeendescribingthe logic of nonexperimental data analysis in general. True experiments are relatively uncommon in social research,although they are widely used in psychological research and increasingly in microeconomics(for a very nice example of the latter, seeThomas and others [2004]). A mre experimentis a situation in which the objects of the experimentare randomly divided into two or more groups, and one group is exposedto some treatment while the other gloup is not, or severalgroups are exposedto different treatments.If the groups then diffEr in someoutcome variable, the differences can be attributed to the differencesin treatments.In such caseswe can unambiguouslyestablishthat the treatmentcausedtJre Jifference in outcomes(although we may not know the exact mechanisminvolved). (Of .-ourse,this claim holds only when differences between the experimental and control are not inadvertentlyintroducedby the investigatorsas a consequence of design _groups darvsor of failure to rigorously adhereto the randomized trial design. For a classic dis.rrssionof such problems,see Campbell and Stanley [1966] or a shorterversion by Campbell[1957]that containsthe core ofthe Campbelland Stanleymaterial.) When experimentsare undeftakenin fields such as chemistry sampling is not ordinarily a considerationbecauseit can be confidently assumedthat any batch of a chemical Eill behavelike any other batch of the same chemical; only when things go wrong do .-hemiststend to question that assumption.In the social and behavioral and many of the biologicalsciences,by contrast,it cannotbe assumedthat one subjectis just like another illbject. Hence,in experimentsin thesefields, subjectsare randomlyassignedto treatnent groups.In this way,it becomespossibleto assesswhethergroup differencesin out.-omesare larger than would be likely to occur by chancebecauseof sampling variability. Il- so. we can say, subject only to the uncertainty of statistical inference, that the differ€lce in treatmentscausedthe differencein outcomes. In the socialsciences,randomassignmentof subjectsto treatmentgroupsis oftenm fact usually-impossible for severalreasons.First, both ethicalandpracticalconsider.:donslimit the kind of experimentationthat can be done on human subjects.For example. it would be neitherethicalnor practicallypossibleto determinewhetherone sort of
44
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
schoolingwas pedagogicallysuperior to anotherby randomly assigningchildren ii' different schools and severalyears later determining their level of educational achievement. In addition, many phenomenaof interest to social scientistsare simply not expe+ mentally manipulable,evenin principle. The propensity for in-group solidarity to increl'. in wartime,for example,is not somethingthat can be experimentallyconfirmed,nor ca the proposition that social stratiflcation is more pronounced in sedentary agriculturzi societies. societiesthanin hunter-gatherer Occasionally,"natural experiments" can be analyzed.Natural experirnentsare situations in which different individuals are exposedto different circumstances,and it can bc reasonablyassumedthat the circumstanceto which individualsareexposedis essentialll random. A very nice example of such an analysis is the test by Almond (2006) of tbe "fetal origins hypothesis." He showed convincingly that individuals in utero during the few monthsin which the 1918flu pandemicwasragingsufferedreducededucationz" attainment,increasedrates of physical disability, and lower income in midlife relative rc those in utero in the few months preceding and following the epidemic. Becausetherei! no basisfor expectingthe exactmonth of conceptionto be correlatedwith vulnerabilig to the flu virus, the conditions of a natural experiment were well satisfied in this elegar analysis.Naturalexperimentshavebecomeincreasinglypopularin economicsasthe limitations of variousstatisticalfixes to correctfor "sample seLectionbias" have becom more evident.We will retum to this issuein the final chapter (For additional examplesct naturalexperimentsthat arewell worth reading,seeCampbellandRoss[1968],Berelscr [1979],Sloanandothers[1988],andthe paperscited in ChapterSixteen.) Given the limited possibilitiesfor experimentationin the social sciences,we resort to e variety of statistical controls of the sort discussedhere and later Theseproceduressharc a common logic: they are all designedto hold constant somevariable or variables so dl.{ the net effect of a siven variable on a given outcome can be assessed.
?,
FIX AND A USEFUL OF MATCHING THEWEAKNESS
\\ -
by matchingcomparison attemptto simulaterandomassignment surveyanalysts Sometimes practice unsatisfacwas inherently form, this In its original groupson somesetof variables. tory When attempting to match on all potentiallyrelevantfactors, it is difficult to avoid are usedln the match,it is no matterhow manyvariables runningout of cases.l\,4oreover, facgroups on somenonmatched differ possible and control that the experimental always with matching combining Howevef, outcome with the experimental tor that is correlated statisticalcontrolscan be a useful strategy,especiallywhen the adequacyof the match is and Rubin1983) Forrecenttreatments score"(Rosenbaum via a "propensity summarized (1997), Beckerand lchino(2002),Abadieand propensity see Smith score matching, of (2006), (2004), (2006), and Beckerand Caliendo(2007) Halaby Brandand Brand others in scorematchingis alsodiscussed applicationPropensity Harding(2002)is an instructive ChaDterSixteen.
Moreon Tables 45 hen to chieveEXPen-
rcrease lor can ltural : srruacan be ntially of the during uional tive to hereis ability legant plimecome esof relson rttoa share ;o that
Comparedto experiments,statisticalcontrols have two fundamental limitations, s hich makeit impossibleto definitivelyproveany causaltheory(although definitivediszrool is possible).First, no matter how many control variabieswe mtroduce, we can neverbe surethat the remainingnet relationshipis a true causalrelationship and not the spuriousresultof someyet-to-be-introduced variable. Second,although we speakoI holding constant somevariable, or set of variables, . $hat we usually do in practiceis simpty reducethe within-group variability for these variables.This is particularlyobviouswhenwe aredealingwith cross-tabulations because *e generallydivide the sampleinto a small setof categories.In what sense,for example, ;an we be said to "hold educationconstant"when our categoriesconslst of thosewith lessthan a high schooleducation,thosewith somesort of iigh scnoot experience,and rhosewith some sort of collegeexperience? Although the within_category variability in educationalattainment obviously is smaner than the total variability in tire sampreas a s hole, it is still substantiar.Hence,if two other variabresboth dependon educational rnainment,theyarelikely to be correlatedwithin educationalcategories asgrossasthese, rs,well as acrosseducationalcategories. As you will seein more detail later,usinginter_ ral or ratio variables in a regressionframework will not solve the problem but merely u'ansformit. Although the within-category variability generally will be reduced, the very narsimony in the expression of relationships between variabres thar regresslon procelures permit will generallyresultin somedistortionof the true comprexities of suchrelatonships-discontinuities, nonlinearities,and so on, only some of which can be .rpresentedsuccinctly. Our only salvationis adequatetheory.Becausewe can seldomdefinitively establish --ausalrelationshipsby referenceto data,we needto build up a body of theorythat con_ :i-srsof a setofplausible,mutuallyconsistent,empiricallyverifiedpropositions. Although re cannotdefinitivelyprovecausalrelations,we candetermine whither our dataarecan_ tsrent with our theories;if so, we can say that the propositionis tentatively empirically rerified.we arein a strongerpositionwhenit comesto disproof. rf ourd.ataarc inconsis:cnr with our theory, that usually is suffcient grounds for rejecting the rheory, although z 3 need to be sensitiveto the possibility that there are omitted variablesthat would :iange our conclusionsif they wereincludedin our cross{abulation or model.In short, :..\maintaina theory it is necessarybut not suffictentthat thedatabe aspredictedby the Becauseconsistencyis necessaryto maintainthe theory,inconsrstency is suffi_ ro -rory. --'3nr requireus to reject it-provided we can be confidentthat we havenot omitted :rportant variables. (On the other hand, asAlfred North Whitehead is supposedto have :3id. neverlet datastandin the way of a good theory.If the theoryis sufficiently strong, ,;!-rumight want to questionthe data.I will havemore to sayaboui this later,in a discui_ ilon of concaptsandindicators.)
TI/HATTHISCHAPTERHAS SHOWN L this chapterwe have consideredthe logic of multivariatestatisticalanalysis and its Eplication to cross{abulationsinvolving threeor more variabres.The notion of an inter-L-iioneffect-a situationin which the effectof oneindependentvariabledependson the
46
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
value or level of one or more other independentvariables-was introduced.This is a verr importantidea in statisticalanalysis,and so you shouldbe sure that you understandr thorougbly. We also consideredsuppressoreffects, situations in which the effect of oE independentvariable offsets the effect of another independentvariable becausethe tqo effects have opposite signs. In such situations, the failure to include both variables in the model can lead to an understatementof the true relationships between the included variable and the dependentvariable. we then tumed to direct standardization (sometirnes called covariateadjustment),a procedurefor purging a relationship of the effect of a par_ ticular variable or variables. Direct standardizationcan be thought of as a procedurefcr creating "counterfactual" or "what if' relationships-for example, what would be tbc relationship between religiosity and militancy if we adjusted for the fact that well educatedBlacks in the 1960stendedto be both more religious and less militant than lesswell educatedBlacks. Having discussedthe logic of direct standardization, we considered severaltechnical aspectsof the procedureto seehow to standardizedata startins not onh frorn tables but also from individual records;to standardizepercentagedisrriuo-tion.;,nd to standardizemeans.we concludedby considering the limitations of statistical controK m contrast to randomized experirnents. In the following chapter,we complete our initial discussionof cross-tabulationtables by considering how to extract new information from published tables; then note the ooe clrcumstancein which it makes senseto percentagea table ,,backwards',;touch for thc first but not for the last time on how to handle missing data; consider cross-tabulation tables in which the cell entries are means;presenta measureof the similarify of percenragedistributions, the Index of Dissimilarity (A); and end with somecommentsabouthos to write about cross-tabulations.
CHAPT ER
STILLMOREON TABLES WHATTHISCHAPTER ISABOUT ID this. chapter we wrap up our discussion of cross_tabulations for now. Alter spending $me time leaming to love the computer_a very brief time, actually_and thentelvin! i o th€ mysteriesof regressionequations,we will retum to cross-talulations and discuss for making inferences about relations embodied in them via logJinear *...1:*. irullvsls_ -We begin this chapterwith a discussionof how to extract new inlbrmation from pub, . h9red tables; then note the one circumstancein which it makes sensero percentagea rable "backwards"; touch for the first but not for the last time on how to handle missing dam: consider cross-tabulationtables in which the cell entries are means;presenta mea_ sre ofthe similarity of percentagedistributions,the Index of Dissimilarity (A); and end r-ith somecomnents about how to write about cross_tabulations.
48
QuantitativeDataAnarysis: DoingsociarResearch to Testrdear
REORGANIZING TABTES TO EXTRACT NEWINFORMATION
Oflenwhenworkjngwithpublish^eddara. reading research papers. thardarahadbeenpresented andsoon.we wish differenrly. so."r;,o.:, ,;ffi;;"",r#i,i"r,,"" i, presenred ro
:Jffj:,n:ffiilT:r;J;.,..:I[T",ff*'J# "oo*"0 ffi,3il|:'.T:;ti,"]i;li CollapsingDimensions
rerarionship q:oss berween acceprance #?ff: lil ilTfi: 'il'il'illt ll i:only orabonion a tablesuchas Table2.I ln itup,", rwo. rio"*
couldyou *r*i,rr.',*"
[_,xTti:nHy,:,:'#[l$L:,i.:fJJ:i::trJ thepercentage frequencies; iabr".ir i"ati"i : r p"."'*i.i'bo "jf "#jaDle
:
: Tl ^1i]*
" fi .',';*','.ff',T1!::f i ;ffi,fi:::n;il{*nl3:i"",".".11".T:f ;=,*'":
;L?ur::JilTiln :n*"H:;H[tr,il#::!f""titfu i,*l*"y.'i y
::i,:r^.j,",t "
;ql:,fr,::",ff :"t, Hfii]ff l!,i:;jruf,*:;
-"iifi fr:';,1t"'#'r:tilf ,uooron. ,11,1zortrorp'or*,rll,ulffi rhe weishr€d il:f il:H""lffi:itrff1,., jp"1lT..,,;:J'il:i#Jjlro-mpuLing ",,"r*
'jLfjtrj #I;fxq#jfi ,1*-i*,*#;i.lll;:,*n;qifi compuungthe full tableof frequenciesis, firsr,
tharir provi.", ,
,O*J-'*O
"n
the accuracy
ffi*i*-X;.L:T.Jyi;'J,'.'j:H,",J:l:,?B:ceorAbortion EduGation
Source_ Table2.1.
Catholic
Protestant
StillMoreon Tabtes 49
IL we wish resentedto 6 opposed re.
of !'our computationsand, second,that it permits other tabulationsto be constructed,for .sample, the zero-orderrelation betweeneducationand acceptanceof abortion. Although many other examplescould be given, they all follow the same logic. you $ould get in the habit of manipulatingtablesto extract information from them. Not only is ir a useful skill but it also gives you a better understandingof how tablesare constructed.
CollapsingCategoriesto ReprcsentNew Concepts rf abortion Two. How s, thinking ? The proa table of se,formed od by addthe 't€ther rn be per'Catholics : weighted separately $eragesto
%x90)+ raltage of ; accuracy
k
Sometimeswe want to view a variable in a mannerentirely different from that envisioned uy'_. the original investigator, in which case we may want to reorder the categories.We rheady have seenone example of this, in our discussionof how to treat .,no answer" in o r considerationof nominal variables in Chapter One. "No answer,,may be thought of .tsa neutral responseand henceas lying betweenthe least positive and the least negative rsponse; or "no answer" might be thought of as not on the continuum at all, and hence -tESttreatedas missing data. Another examplecan be drawn fiom the U.S. Congress.In the late 1970s,theNew york Tinles,lhe WashingtanPost, and similar rags took to calling conservativeDemocrats.,boll reevils" and liberal Republicans"gypsy moths" (fads come and go; you never hear these nms anymore).Supposewe wereconductinga studyof membersof the U.S. Houseof Representativesandinitially classifiedeverymemberinto one ofthe following four categories: 1. StandardRepublicans 2. Gypsymoths 3. Boll weevils 4. StandardDemocrats This four-category classification can be collapsed into three distinct two-category rlassifications,eachof which representsa different theoretical construct.If we were interesredin studying party politics and wantedto know which parq, controlled the House, we rould combine category 1 with category2, and combine category3 with category4: StandardRepublicans Gypsy moths
Republicans
Boll weevils StandardDemocrats
Democrats
If we wereinterestedin distinguishingbetweenliberalsandconservatives, we would .-ombinecategory 1 with category 3 and combine category2 with category4: 110 :-:
StandardRepublicans Boll Weevils G)?sy moths StandardDemocrats
)
coo.".uutiu",
Liberals
50
Quantitative Data Analysis:Doing SocialResearchto Testldeas
tirerestedin studyingpartyloyalg andwantedto know whatproportion ^]t_:-:^y:T areparty loyalists,we would combinecategory ot congressmen ^" I with category4 and combinecategory2 with category3:
StandardRepublicans StandardDemocrats
Party loyalists
Boll weevils Gypsy moths
Cross-overs
The point ofall this is that nothing,is sacrosancjabout the way a variable is originally constructed.you can and should recortevariables freely to get the bert ."p."."ntuion o,f the conceptyou areinterestedin studvins. A very important corollary of,this piint is that when you are designing or executing a data.colleclioneffort, you should alwaysconserve as ,ir""fr-a""if as possible.In the early days of survey research,the technorlgy of data maniputution researchers to pack as many variablesas possibleonto one "n"ouruged IBM iard; hence highly aggregated classificationswere adoptedto savespace(and the tedium of maffitation). The technol_ ogy haschanged.Todaythereis-with oneexception_no."u.oirio, ,o p."r".ve asmuch detail as possible in the initiar coding of your v'ariaures. lrrre exc-eptronis that you need to design your data-collectioninstrumentin a way that minimizes respondent,intervrewer, and coder enor. For example, in a survey with data collectlon done by face_
jlj:^tlj::::::a
lenel!1andcomplicated
r"r,"-" f*l
variaure
is likely to rncrease "oaing lntervrewererror) you neverknow when youwill geta newideathatwill require recodingoneor severalvariables;andif you lackimagnufron,in" not oserof the same *t *j. Everyexperienced survey analyst ha?t""J-gr"", t-.,."tion on counr_ l:|a occasions -1t less because detailthr
j,:jifr:,,:*#J:flI 0"," *u,not co'upiloll*'"Hirffi ffiTilJ.1 ::ll._",:1-q: a1tea-sycomputeroperation;disaggregating variablesis impossible, ls
:i::Bur.rcs wthout going back to the original questionnaireand usJary
at least
not then either.
WHENTO PERCENTAGE A TABTE -BACKWARDSThere is one exception to the rule that tables should be percentagedso that the categories of the dependentvariable add to 100 percent. This is in the sample is not representative of the population ,.at risk,, of falling into"u;;;;*" tt" uJoo, categoriesof the dependentvariable' Sometimes samplesare stratifi;d on trr" a"p"no"o, variable rather than the independent variables or variables; tnut ir, ,o."ti_o iti"y _" on th" basis of their value on the dependentvariable. Various "fror"n hard+o_findiopulations are typi_ cally sampledin this way: convictedcriminals, university stuOenti,pofitcat activlsts, cancerpatrents,and so on.
--
StillMoreon Tables 5'! )pornon \' -l and
-
: ll-?, SocialOrigins of Nobel PrizeWinners (1901-1972)and other U.s.Elites(and,for Comparison,the Occupationsof EmployedMales 19(x)-1920). Father'sOccupation Professional
Other
Total%
. . 1q :.. i '-1q0% l 28
18
100y.
15
57
2a
1000/o
24
35
41
100%
',:bel laureates
iginally ation of ecutmg . ln the searchregated echnolt5much )u need : inter1 faceL.lielyto require Ie SAME
:::Tators Employed males
1900 1910 1920 (1977,64). ,-:e AdaptedfromZuckerman
LCOUnt-
xitially nber of at least
e_gones : is not of the : rather on the re typr:tr\,rsts,
For example,Table3.2 showsthe social origins of variousAmericanelites.In this ::ie the tableis percentaged to showthe distributionof fathers'occupationsfor eachof a :-:mber of elite groups,and also for the U.S. labor force as a whole for selectedyears :- ughly corresponding to whenthe fatherswerein the labor force.The point of the table :.. of course,to show that elites come from elite origins: much higher percentagesof :e nembers of theseelites are from professionalor managerialorigins than would be :rpected if their fathers'occupationscorrespondedto the distributionof professionals in this direction,contraryto the ,:d managersin the labor force.The tableis percentaged probability percentages the conditional rule that express of someoutcomegiven -:ual lrme causalor antecedent condition,becauseit is constructedfrom informationobtained samplesof elites(plus somegenerallaborforce data),andthereforeis not represen-rrm ::rile of the social origins of the population.It would not be sensibleto use data from : representative sampleof the populationto study the likelihood that the children of becomeSupremeCouft justices,Nobel laureates.and so on, becausewe ::Lr1'essionals ;.-ruldvirtuallv neverfind anv caseswith theseoutcomesunlesswe obtaineddata
52
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
irom the entire population*the outcomesare simply too rare. Thus, m suchcases,we rely on response.-basetl samples,andpercentagethe ,, ,rr"* in" distriburionof the independentvariablefor eachresponsecategory_in"urc the presentcase,the socialorigins of variouselitescomparedto the generalpopulation.
CROSS.TABULATIONS IN WHTCH THEDEPENDENT VARTABLE ISREPRESENTED BYA MEAN When the dependentvariableis an intervalor ratio variable,it often is useful to display the meansof the dependentvariablewithin categori". fo"""l;;;;;oss_ctassification of va.n"bles.forexample,supposeyou areinterested in therelarionshipamong il|p:d":, educalion, gender, and eamings,perhapsbecauseyou suspectthat women get smaller r€turnson their educationthan do men.Table3.3 shtws the meanannualincomein 1g7g 1'orfull-time workersby level of educationalanalnment anj genOlf computedfiom the 1980NORC GeneralSocialSutley.
?& * l- t: 3,3. ru".r, Annual thcome in 1979 Among Those working Full Time in 1980, by Education and cendei U,S. Adults (Ca'tegoryrrequencies Shown in parentheses).
Collegegraduate
27,227 (46)
11,789
16,288 (131)
10,324 (10s)
'13,536 (236]'
t 1,135 (246)
16,654
20,4r5 (380)
(3s)
20,s12 ( 81)
\ozo)
Still More on Tables ES,We of the rigins
53
:ali"-tNie AL POtltTS 0N TABLE 3.3 1. Notethat the formatof thistableis identical to that of Table1.6 from ChapterOne, exceptthat percentages are presentedin Table1.6, and meansare presented here. Thetablesarereadin the sameway.
isplay ion of mong naller 1979 n the
2. In thistablelevelsof educational attainmentarepresented in descending order Either descending or ascending orderjs appropriate; the choiceshoulddependon which makesthe discussion easier. L Note that this table jncludesonly 626 cases,out of a total sampleof 1,468.Thjs reflects the factthat manyindividuals do not work full time,particularly women,and alsothat information on educationand incomeis missingfor someindiviouals. Some_ timesit is usefulto catalogthe missingcases,especially whenthereare manymissing casesor whentheirdistribution hassubstantive importance. ln suchcases, a tootnote can be addedto the table or an additionmade at the bottom of the table, for examole. Numberof casesin table No information on income
bl
626 57
No information on education
1
No information on education and incorne
1
Totalworkingfull time
685
Men not workingfull time
235
Women not working full time
549
Totalin full sample
1,469 The reasonthjstabulationshows.1,469when thereare .1,468 casesin the samplejs dueto roundingerror Because of errorsin the execution of a ,,splitballot,,procedure in the 1980 GSS,the data haveto be weightedto be representative of the popula_ tjon (Davis,Smith, and Marsden2OO7).We will considerweighting issuesin ChapterNine. Evenwhen you do not presentthe informationshown in the tabulation,it is wjse to compileit for yourserf,as a checkon your computations.In fact, in the courseot creat_ Ing the preceding accountingof missingcases,I discovered a computingerrorr had madethat resultedin incorrectnumbersin Table3.3 (since corrected). An alternative way to dispraythesedata,which would makethe point ol the tabre moreimmediately evidentto the reader,is to show in the rjghtmostcolumn. temale meansas a percentage of maremeans,ratherthan the totarmeans.Tabre_making is an art, and the ajm of the gameis to makethe message as clearand easyto under_ standas possible.
54
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
InspectingTable 3.3, you seethat in 1980,women eamedmuch lessthan equally well-educatedmen, although for both men and women income tended to increaseas the level of educationincreased.The genderdifferencein incomesis striking: on average, women eamedjust over half of what men did, and the best educatedwomen (those with post-graduatetraining) earnedless on averagethan did the leasl educatedmen (thosewho did not completehigh school). To provide an easily graspedcomparison of male and female averageincomes for each level of education, we can compute tlle ratio of female to male means.Ordinarily, thesewould simply be included in an additional columl in the table or as a substitutefor the total column.
Ed ucati o n a I Atta i n m ent Post-graduate training Collegegraduate Somecollege Highschoolgraduate Lessthan 12 years Total
Mean Female lncome Exprcssedas a Percentage of Male Mean lncome 44 43 68 63 53 55
The computationshere arejust the ratios multiplied by 100, which yield the female meansexpressedaspercentagesof the male means.They show that within educationcategories, women on averageearn between two-fifths and two-thirds what men do. You might be curious whether things have changedsince 1980.To find out, you can construct the sametable from a more recent GSS.
S{jBS'}}IVTIVE T'OTNTSON TABLE 3.3 The ratio of femaleto male incomesshown here(55 percent)is somewhatlowerthan the ratiotypically estimated from censusdata(forexample, Treiman and Hartmann1981,16), which is about 60 percent.The discrepancymay reflect differencesin the definition of fulftime workers. lvlost of the computationsbased on census(or current Population Survey[cPs])data define "full-timeyear-round"workersas those employedat least thirty-five hours in the week precedingthe surveyand employedat least fifty weeks in the previousyear.The G55question,by contrast,askswhether peoplewere working in the previousweek and if so, how many hours,or, if they had a job but were not working in the previous week,how manyhoursthey usuallyworked.lt may be that the G55table
Still More on Tables pally s the ragg, r with : who
)s for arily, te for
55
nc udesa substantial year numberof peoplewho did not work full time in the prevrous andthereforehad lowerincomesthan thoseemployedful time,whereasthesepeople rvouldbe excludedfrom computations basedon censusor CPSdata.Because women :end to havemore unstableemploymenthistories than men, it is probablethat those nciudedin the GSSbut not the censusdefinitionof "full time" would be mainlywomen, ivhichwould lowerthe GSSratiorelativeto ratioscomputedfrom censusor CPSdata. Notethat thereis a certainamountof slipperiness to the analysis usingeitherthe GSs cr the censusdefinitions of full-timeworkers:information on hoursworkedperweekat :he time of the surveyis relatedto incomecomputedfor the previouscalendaryear. Ihere is no helpfor this because is to askabouthoursperweektypically the alternative ,vorkedlastyear,whichis boundto be highlyerrorprone,or to askaboutcurrentsalary cr wage-which is also highlyerrorpronebecauseincomeis highlyvariableoverthe .ourseof the year.Theconvention, whichisthe convention because it isthoughtto yield ihe best data, is to ask about hoursworked in the pastweek but to ask the weeks year. ,vorkedand incomequestions with respectto the pastcalendar Anotherpossible reasonfor the discrepancy betweenthe GSSand censusestimates cf the ratioof femaleto maleincomesis that the G55figuresaresubjectto substantial :amplingerror We will take up statistical inference in surveyanalysis in ChapterNine. The point of this note is to emphasizethat wheneveryour resultsdiffer from those
emale n cat-
feportedby others,especiallythose that are widely cited, it is important to attempt to accountfor the differences asbestyoucan,andto eliminatecandidate explanations that croveto be incorrect. Yourpapersshouldbe filledwith commentsof thissor'l;they give ihe readerconfidence that you havethoughtthroughthe issues and areawareof what s goingon in yourdataand in the literature.
r. You struct
ffierences from lnformation on Missing Data the t6), ro f rcn last €K5
9tn Jn9 rble
\.rle that the catalog of sourcesof missing data presentedin the technical note on T:ble 3.3 can be combinedwith informationin the tableto get an approximateestimate s differencesin labor force participation rates.The row marginal of the table tells us ilr -r rhere are 380 males and 246 females employed full time for whom complete inforrrrrion is available.From the information in the technical note, we seethat there xe 235 n:les and549 femaleswho arenot employedfull time. If we arewilling to ignorethe 59 who areemployedfull time but for whom informationis missingon educationor ;e1-rple rLnme, we can estimatethat 62 (: [380(380 + 235)]+100)percentofthe malesin the ;omple and 3l (: [246/(246 + 549)]+100)percentof the femalesin the samplewere :mployedfull time during the week of the survey.Of course,becausewe havethe data, couldget theseestimatesdirectly and would not haveto ignorethe 59 missingcases. -s 3ur if we had only the publishedtableand the accountingof sourcesof missingdata,we
56
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
could use them to estimate labor force participation rates, even though the table was not prcsentedwith this in mind.
StillAnother Way of Presentingthe SameData Sometimesit is useful to presentstandarddeviationsas well as meansin tables such as Table 3.3. When you needto presentstandarddeviationsas well as means,a useful way to avoid overcrowdingyour tables is to presentseveralpanels,as in Table 3.4. The point of presenting the standarddeviations is both to enable the reader to do statistical inferencecomputationsfrom the data in the table (the standarddeviations are neededto computeconfidenceintervalsfor testsof the sisnificanceof the differ_ encebetweenmeans)and to provide substantiveinformation. For example,it is infor_ matrve to note-from the rightmost column-that the heterogeneity in income is more than three times as great for men with post-graduatetraining as for women with postgraduatetraining-a ratio that is much larger than for any of the other levels of education. This gives us a hint as to why the averageincome of women with post_graduate training is so low-unlike their male counterparts,some of whom get extremely high_ payingjobs, thesewomenappearto be lockedinto a setofjobs with a very narrowrange of incomes.We could takethis further by investigatingthe propertiesof suchjobs_but we will nol do sohere. A seriousshortcomingin the comparisonof meansacrossgroups is that means, unlike medians,are sensitivero outliers*extreme observations.Thus, for example. the inclusion of a few very high-incomepeoplein a samplecan substantialryaffecithe computedmeans.This is equally a problem when the data are codedinto a set of cate_ Eofieswitb a top code for incomes higher than some value, as is the casefor the income measuresusedin the GSS. In 1980the top code for income was $50,000.To comoute a mean,a valuehas to be assignedto eachcategory.This is not much of a problemfor most categories;it is conventional,and reasonablyaccurate,to simply assignthe midpoint of the mnge included.For example,the bottom category,..under $1,000,',would be assigned$500,and so on. But for the top code,any decisionis likely to be arbitrary. One possibility is to use a Paretotransformationto estimatethe meanvalue of the top code (Miller 1966,215220), but this dependson rather strongassumptionsregarding the shape of the distribution. In the analysis shown here I thus, rather arbitrarilv. assigned$62,500to the rop code.Had I assigned,say,$75,000,the male-femaleincome differencesfor well-educatedpeople would have been larger, and the male standard deviationswould havebeenlarger as well. In the caseof skewed(asymtnetrical)distributions where one tail is longer than the other, of which income ii perhapsthe most commonexample,it makesmore senseto computemediansfor descrjptivepurposes, although for analytic purposesmost analystsresort to a transformationof income, usually by taking the natural log of income becausemedians are yery algebraically intractable.Table3.5 is the equivalentof rable 3.3 exceptthat mediansare substituted for means.(Ifan analystwantsan analogto a standarddwiation, the interquartilerange is commonly used.)In this casethe meansand the mediansyietd similar interpretlations, but often this is not the case.
s such useful Ie 3.4. to do ations differinfori more I posteducaaduate ; highi range ,s-but means, ample, :ectthe )f cateincome Jmpute lem lor re mid' g ould birary. the top earding itrarily, rncome randard rdistrrhe most lrposes, income, 'raically tstituted le range :erprem-
58
to Testldeas QuantitativeData Analysis:Doing SocialResearch
TAffiLg 3.5" ueuianannuattncome in
1979 Among Those Working Full Time in 1980. by Education and Genden U,S, Adults (Category Frequencies Shown in Parenthes€s),
ilii.:l... ,l*'*+i,',4:., :l:l1ii!iiii.i:: .:i,.:i, :.t,l,l Collegegraduate
23,750 (46\
11,250 (35)
18,750 (81)
*r r,llil;..,.:':. iti:::i:::.i::j*iqli:ti, Highschoolgraduate
Less than
16,250 (131)
9,000 (105)
11,250 (236)
11,250 (246)
13,750 (6261
': , : i .:.: : . ; ,:..:,t, : .: , ,
: i.t i :::: i Total
16,250 (380)
INDEXOF DISSIMILARITY Thus far we have studiedthe associationbetweentwo or more variablesby comparing percentages, means,or mediansacrosscategoriesof the independentvariableor variables.As we havenotedalready,thereare situationsin which this strategydoesnot yield particularlyinformativeresults.In particular,whenthereare largenumbersof categories in a distribution,comparingthe conditionalpercentages in any onecategoryignoresmost of the information in the table. Supposeyou areinterestedin knowingwhetherthe laborforce is more segregated by sexor by race.You might investigatethis by crosstabulatingoccupationby sexandrace, as in Table3.6.Visually,the tableis of little help-it is not obviouswhetherthe distributions of the two racial groupsor the two gendergroupsare more similar.To decidethis, you can computethe Indexof Dissimilaity (L), glenby
| 1 -Q , l n
(3.1)
Still More on Tables
I Time rryn in
59
rhere P. equalsthe percentageof casesin the ith category of the fust distribution and Q, equalsthe percentageof casesin the ith category of the seconddistribution. This index ;an be interpreted as the percentageof casesin one distribution that would have to be difted among categoriesto make the two distributions identical. If the two distributions ae identical,A will of coursebe 0. If they are completelydissimilar,as would be, for erample, the disribution of students by gender in an all-girls school and an all-boys school,A will be 100. From Table 3.6 we can compute A for each pair of columls. For example, the A fm White males and White females (which gives us the extent of occupational segregadon by sexamongWhites)is computedas42.1 : (15.6 - 16.41+ 114.9- 6.81+ " + lli - 0.91y2.In the presentcase,four of the six comparisonsare of interest:
Occupational segregation by sex among Whites
42.1
Blacks and others
41.3
occupational segregation by race among Men Women
fnnanng ; or vannot yield ategories IES MOSI
€ated by andrace, rdistribucide this,
(3.1)
24.3 18.2
From thesecomputations, we seethat more than 40 percent of White women would bave to change their major occupation group to make the occupational distribution of s-hite females identical to that of White males, and sirnilarly for Black and other romen relativeto Black men (note that the coefficientis symmetrical,so we could as easily discussthe extent of changerequired of males to make their distributions similar lo thoseof females).By contrast,less than one-quarterof Black maleswould haveto ctange major occupation groups to make the Black male distribution identical to the $'hite male distribution, and among women, the corresponding proportion is less than me-fifth. Thus, we conclude that occupational segregationby sex is much grcater than ccupational segregationby race.Although it is not common to report testsof significa.ncefor A, it is possibleto do so. (SeeJohnsonand Farley [1985] andRansom[2000] for discussionsof the samplingdistributionof A.) One important limitation of the Index of Dissimilariry is that it tends to increaseas number of categoriesincreases(A cannotget smaller if the categoriesof a distribution fte re disaggregatedinto a la.rgernumber of categories;it can only get larger or remain mchanged). Hence, comparisonsofAs are legitimate only when they are computedfrom distributions basedon identical categories.For example,it would not be legitimate to use -1asa measureof the degreeof occupationalsexsegregationin different countriesbecause L1-cupationalclassificaiions tend to differ fiom country to country (unless,of course,the distributions were recoded to a st rdard classification, for example, the Intemational *andard Classifcation of Occupatiozs[InternationalLabour Office 1969, 1990]) or *ome aggregationof this classification.
60
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
f,&,Sl,f 3,*, f.r..rrt.ge Distribution over Major OccupationGroups by Race and Se:c U.S.Labor Force, 1979 (N = 96,945). Blackand Other Men
Managers and administrators
White Women
Blackand Other Women
14.9 6.4
Clericalworkers
6.0
Other serviceworkers
Farmlaborers and supervisors
N (thousands)
16.1
1.5
(so,721)
(s,779)
source:AdaptedfromTreirnan and Hartmann0 981, 26)
.-.-
24.6
StillMoreon Tables 61
WRITING ABOUTCROSS-TABUtAfl ONS b *riting about cross-tabulations,or for that matter,quantitativerelationshipsof any kind, aim of the gane is clarity, not elegance.Youshould try to say enoughabout what the :rble showsto guide the readerthrough it but not so much as to confuseor bore. Strive for :-rlnomy of prose. Hemingway is a good model. Among quantitative social scientists, \adran Keyfitz (who addscharrnto simplicity) andpaul Lazarsfeld(a goodexamplebecause rs nativelanguagewas German,not English, andhe was saidto write his drafts a half dozen nes or more beforebeing satisfiedwith the prose)are worth emulating.Robert Merton is f-rr a quantitativesociologistbut is a good negativerole model an1.way.He is excessively rnate anduseserudition to finessesticky points. Too many social scientistsare simply tur_ Howard Becker's book, Writing for Social Scientists(1936), is a wonderful primer on eiting good social science,but he doesnot pay much aftentionto writing aboutquantitative ::m- However,two recentbooks by JaneMiller (2004, 2005) do this very well, providing rrh useful advice.It would be well worth your time to consultboth of thesetexts,the first --rnhich focuseson cross-tabulationsandthe secondon multivariatemodels.The following .ue :ome specificpointersfor writing aboutthe sort of datawe areconcernedwith here: r
Describe tables mainly in terms of their subsfantive implications. Cite numben only as much as is necessa.ry to makeclear what the table shows,and then statethe conclusionsthe numbersleadyou to. The point of presentingdatais to test ideas,so the datashouldbe discussedin termsof their implicationsfor the ideas(hypotheses) beingtested.Simply citing the numbersis not sufncient.On the otherhand,you need to cite enoughnumbersto guidethereaderthroughthe tablebecausemostreaders_ including most professionalsocial scientists-are more or less illiterate when it comesto readingtables.
r
Strive for simplicity. Try to stateyour argumentand describeyour conclusions in termsyour ancientgrandmotheror your cousinthe appliancesalesmanwould understand.There is no virtue in obscurity. Obscurity andprofundity are not synonyms;obscurityand confusionare, at leastin this context.As our brethrenin the physical sciencasknow, truly elegantexplanationsare almost always simple. Avoid phrases that add no meaning. For example, insteadof ,,We now investi_ gatewhat inferencewe canmakeasto whetherA might be saidto havean effect on 8." write "DoesA affectB?"
a
r
r
Avoid passiveconstructions,"It is found thatX is relatedto y,, tells us no more than '.X is relatedto Y." Avoid 'A scaleof supportfor U.S. foreign policy was constructed."Who constructedit, God?Write ,.I constructeda scaleof support for U.S. foreign policy" or "I usedthe Universityof Michigan Internationalism Scaleto measuresupportfor U.S. foreign policy.',
:
Avoid jargon when it doesnot help. Note thatI did not suggestavoidingjargon altogether.Jargon, the technical terms of a particular discipline or craft, has a clear function----economy.Use jargon terms when they enable you to convey a point in a sentencethat otherwise would require a paragraph.But if ordinary
62
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
;l',ll'JJ#H'[il;L#;.;1;ii;;,*T."",,;fr jtrr;doesnotmakey
'
*':* :' }iJ";F ';{t*el* *#,'r r:#, frlflr +'ru* #::,J.:n:,i..ltff n:i.,1::#:#,,";xTT,E unavoidabre inrabre, jiilFlffl'sa:f,T"lfr,.:"#T,T.are because orren
' ,?U1ffJJ:J.TffJ::#::l:j:ll.,,is pre,entious inasol
jl"'#f i",j n:'fff::i ffi S:: : ::::::";G;;:#:; :,T*l#; .'-'i,,JJT""J:; ro''"u.p t".
ffilJ,ilT';*fl ji** J1',
;: :.T:il.l
";.:tmir:lnri.,*,1*ir#H"# "ffi
'';tp":3.tfffi {"3i;::f"1#ilr*i#-
**txi*t;;,llffi [**t*ht{:l -"*#,:fr,ffTi1i'::l: ,**ffi*,e;l,Nil*t*J 6***1*'5,i;#16*, t
I rl
rt
i
ff -
---'"DLw4vtorean
*..Ii.-".T-:r-: "--u rur wnung is "ot
the wrrr oi, .n the .-,* to .^ your ,. wall next
g"tting -tr,nri" you ool;r:n
word processor,for
t
Don'r get it right, get
I
Write, don't read.
I
Don't let the perfect becomethe enemy of thepossible.
it written.
confoft
Still More on Tables
63
Havethe courageto be simpleminded. Anything worth doing is worth doing superficially (with thanks to John Tukey). The last 10 percentof the work takeshalf the time. The first 10 percentof the work also takeshalf the time. You can't write the seconddraft until you have written the flrst draft. Write honestfirst drafts. (Show your friends your first drafts, not your fifth ftafts passedoff as first drafts. It is much more efflcient to get othersto tell you what is wrong-and what is right-with your prosethan to try to figure it out for yourself.) I
a
Thereis no suchthing as good writing, only goodrewriting. Accept criticism gracefully, even though it feels like rape, castration, or some similar violation of your person;it happensto everybody,ard everybodyfeels the samewav,
WHATTHISCHAPTERHAS SHOWN
I t :
fo rhis chapter we have seen how to extract new information from p[blished tables. Tben we noted the one circumstancein which it makes senseto percentagea table ar".-krlards"-when we analyzedataderivedfrom "response-based" samples,samples m-:rified on the dependentvariable. We saw why it is necessaryto provide information or "-asesin the samplebut excludedfrom a table, and how to do this. We considered D;{ ro construct and interpret cross-tabulationsin which the cell entries are means(and mlard deviations). We learned how to compute the Index of Dissimilarity (A), a meare ofthe similarity of percentagedisftibutions.And we consideredhow to write about lrcs-nbulations. -{Il of our work so far has been basedon paper-and-penciloperations,involving at n''q a hand calculator. In the next chapter we enter the world of modem social research 4 Ieaming how to construct cross-tabulationsfrom data on individuals via computer rrs-are designedfor statisticalanalysis,focusingon the statisticalpackageStata,which .€ qill use in the remainder of this book.
CHAPT ER
ONTHEMANIPULATION OFDATABYCOMPUTER W}IATTHISCHAPTER ISABOUT h this chapter we consider how to manipulate data by computer to produce crossntulations. The same logic of data manipulation will apply when we get to regression mall-sis, so this chapter servesalso as an introduction to statistical analysisby computer. s-e considerhow data files (of the kind that are of interest in this book) are organizedand hq to extract data from them; we consider ways to transform variables to make them rwEsent the conceptsof interest to us; and we again addressthe nagging problem of how rr handlemissins data.
66
Quantitative DataAnarysis: Doing Research to Testrdeas ''ciar
INTRODUCTION Most stadsdcalanalysisby
socia
ffiixir,trrffi il:"i#H:: #;#*ilT*{:i,ll.lFI,T$.rr:,n ,:tffifi:il[l,t:.jl $ilffi ffiry;fl1*ihT:;:*:::1fij":,u:i11ffi j. :.fl J,g:il#jt;ru*:l* H,;:lf i*m**;.Tr.h+l{,"!;,f, ins new command, *.* '"'Tl::,1#fi::t.T:i,]# t"d;Tffi|Iil
#i#T:$i,",'"",;"H};;:}:x#r$iprosra n+::T{lJ-1iTffi academic users. srara israpidry becoming,r,",,i ir,T"rif"&"g1,;iffiiii;$?Xli
scrENcE coMpurER i^?',:l??L*'-I:ttoNsocrAl Kf p,ouuiry-i"';; ."."ng,o.iorosi,t,. # Ji3"ti,:i,i?liJ:i :::l:,::,:,Jn"o" J:;;:: :[::l|Iff:i:i::iiiiil.::l':::. i:;::i::il; :"'n" "'"r*"'''''J::
j:: ::.#:Hu:nl m;::lj:::Ll,:m:#11,.=s=*:::1,:,::.1 *:l wnttenbvcompute,il;,T,'J:"Tt;:T[Tl;#:;ffiHmU:;i
i::ffi:#ilt
iffif.T:te
asintroductions t" *o'1,.'r,,'*" orisinar manuar isno
r"J:"._1;?,T,T:"THt; n:ru*tllld!1l[:]rjfi ;:",;H;:;tr.:,[x.fl ;:::tT:1*1";:::jliyir":r{: r# uo.ptuna.o.pri""t,"'"",,, ii"ttn
universities Assocjal scientists t
lt is notaneasvlanguage Fortunatery, to teachor to rearn. Stata, whicho"n.tll,ltllY"
#ll'd#{::tJ *;;;$,it*n+*5i|i;
evenwlthverylargedatasets(for percent sample example, of th" an,""r"iln-,t-ltll'. a1 lt iscapable of doing most of modern the tt rg, data.""vr,r.ii" *q;ir"d ," aregenerallv simple andstraishtforw"ra dutu,"t, .* il;;on, ano stutu -.Tnsus) il;t1nds
work Appendix 4A,",,r;:lJTl".n;jJfll ,,11i:.:,1?j,".""1.11"1.: Pruvruc5 trpslor carryingout "; ":,,,0,"0 usjngstata. data analysjs I
Onthe Manipulation of Databy Computer
6V
r-: ::fis are availableboth for Unix platforms and for pcs. of tne three.stata is the :1:.;: rearlf identic-alacrossplatforms. There are, of.ourr..'.on1, orher statisticai :-!!-:ges as well. Many of thesepackagesar: exploringbut onty alier you have :;::red the materialin this book. As you -lvorth will Ois.ou.r. ttr. -iugl. ot data analysis by - ':::rter is fairry standard,althoughthe comm.tnd iirr"i, ,o-ewhat from pro_ "_l,nu., - ::.. r!)program.Havingoncemasteredthe basiclogic, it is easyto apply it to otherdata ,:r. rd otherstatisticalpackagecompurer programs.
-IOW DATAFILESARE ORGANIZED .j.].* tuut to think of the organizationof data in a computeris to imaginea mafix : :; ::ich the rows are casesand the.columns(or setsof col;;;.j are ua.iautes.specifi_ :- ' considera data set that contains257 variablesluut +zz cotumnsof data because
J':"fr'#, TffT"ix1lf l;"'?1.,i;ll$g:"'il"ff :-:.1,1qi# T"fh.d,":llJ:
-: hence1,609rows ofdata). For the ,ot" of ."g_O,nr. dataset as con_ ::-,S informationfrom a representative ".-on"r"t"n".S, sample of the U.S. popuiouon.In sucha data rr xe informationmight be organizedasin Eihibit 4.1. lrr manipulatethesedata,we needa map to the dataset,which tells us wherein the r;:::r particularinfomation is locatedand what the infonnation meanr.Sucha map is \:: :,n asa cod.ebooft. In the presentexample,we might huuea cod"book somethinglike ; :: rs shownin Exhibit4.2. ,\rmed with the informationin the codebook,we now know exactlywhat the dataset - ::.ins. It consistsof one recordper re-spondent for eachof 1,609respondents. High_ generallyprovide information about the -::i iharactenstics .also of the on which the data set is basedan
*s
QuantitatjveDataAnalysis:Doing SocialResearch to Testldeas
A Codebook Corr€sponding to Exhibit 4.1. Variable Number 1
Column IA
2
5
3
6-7
Variable Name rdno 5ex
_,] 422
poficV
Variable Label and Code
sexof respondent '1 Male 2 Female Age(exactyear) 99 99 or older
The policiesof.ihe presidentare l Wonderful
20K 3 Not so hot 4 God a\,vful
5 Who knowsandwho cares 6 Noansweror uncodable
computerreatlableandare known asfi1e.r.In the presentexample,the first four columns give the identilicationnumbertbr eachrespondent.Usualrythis is of rittle interestat the data analysisstage.but it is vitary necessaryto keeptrack of the data and is crucial if everwe wantto addadditionaldatato the file-for example,if we haveconducted another surveyof the samerespondents andwant to mergethe datafrom the two surveys,or if we want to supplementinteryiew responseswith informationfrom organizational records. and so on. Column 5 givesthe sexof the respondent,columns6 ani i give the age,and column422 givesthe responseto a questionaboutthe policiesof the president. Usingthe responsecategoriesindicatedin the codebook,we seethat the first respon_ _ dent is a twenty seven-year-old male who thinks the president,spolicies are god awful and that the secondrespondentis a forty-one-year_old femalewho tninks the president,s policies are wonderful.The third respondentis a woman for whom no inlbrmation is availableregardingeitherher ageor her judgmentof presidentialpolicies.perhaps she refusedto answerthesequestionsin the interviewor guuenonserrsical responses, or per_ napstherewas somesort of editing error that destroyedthe informaLton;ln any event,rt is unavailableto the dataanaryst.(Note thatthereis no "n/a" cotiefor sex. It is rareto find a "no answercode" for sex.at reastin interyiewsurveys,becausethe interviewer usually recordsthisinfomation.) Somecodebooksgive thefiequencydistribution(the marginali)
I
I
ut m
tr|r
l]tq
u0u h
lllrlll @
[l
{@n
On the Manipulation of Databy Computer 69 ts each variable. This is a very useful practice, and if you construct a codebook, you $ould include the marginals(the -codebook- commandin Stataaccomplishesthis). Ttreir inclusion permits better initial judgments as to suitable cutting-points for variables :s well as a standardagainstwhich to check your computer output for accuracy.It is very casyto make mistakeswhen specifying computer runs, so you should check eachrun for ronsistencywith previous runs and the marginals. Supposewe wanted to ascertainwhether men and women differ in their support for gesidential policies. To do this, we might cross{abulate the presidential policy question I'r- sex, percentagingthe table so that the judgment of presidential policies is the depenibt variable. Thus we have to instruct the computer where to find each variable, to do 6e cross-tabulation,and to percentagethe table in the appropriate direction. We also hi€ to instruct the computer what to do about the "no answer',categoryin the presidential policy variable. There are two ways to specify how to locate data in a file, and computer programs ,fffer as to whether either or both is permissible. Some programs use instructions that prrnt to paticular columns in the file-for example, "cross-tabulatecoluml422 by colrn 5." More commonly, programsrequire that the analyst first specify where in the data s eachvariableis located and then use variable namesto commandparticular manipulatirx-for example,"The variable SExis in column 5 and the variable po-Lrcyis in colm -122.Cross-tabulatePOLICyby SgX." A variant of this approachis to require a map mbering the variables sequentially and specifying their location, for example: Variable 001 002 003
Columns t4 5
DS l€
422
if ET (e b,
rd F.
d
tross-tabulate UAR257by yAROO3."In most currentprogmms,includingStata,SAS, d SPSS,suchrnapsare createdin the courseof creating ry stemfilesi as part of the prepabn of the file, variable names(ussally restricted to eight characters,although no lon!:lr io in Statabeginning with Version 6.0), wriable labels, andvalue labels (indicattng lb meaning of each responsecategory) are attachedto the file, and variables are then *-ified by name. In instructions to the program, which are known as commands,the rf'st usesthe namesof the variables and neednot be concemedabout their location in & file. For example,the Statacommatd
is
tab policy
E f-
ir d ,-tv jl
sex, col
fre computer to cross-tabulatepO-LfCy (the row variable) by SEX (the column variand computecoluml percentages.Note that in Stata,variable namesare casesensiThus Stata regards sex, sEx, and sex as three different variables. (Although in t! book, variable names appearin ALLCA?S, to make it easier to distinguish variable fiom otherwords in a sentence,in my Statacommandfiles [-do- files-see the -s l3 *r
70
QuantitativeData Analysis:Doing SocialResearch to TestIdeas
lollowingdiscussionl. I alwaysnamefijeswith lowercase namesto avoid extra typing ano rne elror that accomDanies it- )
A Digressionon Card Decksand Card_tmage Computer Files
computersbeganto be usedextensivelyin the social sciencesin the mid-lg60s but did not becomeubiquitousuntil the 1g70s.As a conseq".o"", .ets sr'r of interest cr:at:g use wirh pre_compureranalytic_rec.hnotogy, -urv'iJ,u spe#carf y with machinery Y:Je _f9r that readsIBM punchcards(seeFigure 4.1j. Alrhoughtn"i"gi" o"t" organizationis similar to that usedfor analyri. by "dictated the technology "i severalimportant "ornprt.r, $ff,".:"::r. Whereasthere is in principle no limit to tt nuiiU".-J va.'ables that can be includedin a singlecomputerrecord(although " fmitution, thereare u. to f,ow many vari_ ablesa programcanhandle),an IBM co-ntains eight;;;i;;;. ;"""r.e the machin_ "ard ery for manipulatingIBM cardscould handle ;; ;J ;; ; time lsuch machines "rly unit.record equipmenr,where the record was one caro length), there was ::::::ll "rpacking premrumon a as manyvariablesaspossibleonto a singlecard.
'.: A card dataset consistsof one or more cardsper respondent.For example,to represent all of the datacontainedin our illustrati ve ?57 -variabre,i2z-*iuo,n outu."t *ould require 6-cardsper responde\t (= 4ZZ/}O,rounded up) ,f''"r'LfOS ..r;""dents, or 9,654cards. The information shownfor the iirst respondent might be representedon an IBM card asin Figure4.1,wheretheresponsetopresidentialpo'Jes is coituin"Jio , a+,but other_ wrsethe columls correspondto thosein Exhiiit "orr.n 4.1. An analystwantingto cross_tabulate responsesto presidentialpoliciesby sexwould passthe deck tbrougha counrer_sorter, whicir would pirysicallydivide the deckinto two subdecks readingthe holespunched in u o".ignui.j _by 5 in this case. cards wirh a "1" punch would fill into the "otu#n, "orrrnr, r p*ri"t ir,r,.,*"r,i"l and cardswirh a,,2,, punchwould fall into the 2 pocket Each of thesesubdectr*ouiJ-,rr"n u" passedthrouqh
On the Manipulation of Data by Computer
71
).png
ur did rterest [inery fon is nrtant canbe y vanachinchines re was
epresent i requlre i4 cards. ard as rn ut other:x would into two his case. rttha"2" I though
.'' A .1 . an tBM punchcard. machinea secondtime,andthedistributionof punchesin column64 wouldbe counted -[c displayedfor the analystto copy by hand onto paper.Thesecountswould generate :e bivariate frequency distribution of judgments regarding presidential policy by sex, in the usualway (usinga deskcalculator). r::;h would thenbe percentaged This technology had severalimportatt consequencesfor data organizationand data [,:-1sis. First, it discouragedthe use of statistical methods other than cross-tabulations :rr--auseall it could do wasgeneratethe countsneededasinput to statisticalprocedures-the manipulationstill had to be canied out by hand.Second,it discouragedthe reten_m-::braic :rr{r of detailedinformation; therewas, indeed,a greatpremium on squeezingthe response ::r:gories hto a singlecolumn if at all possiblebecausea two-column variablewas tedious :: ranipulate (it requiredmuch morecard handling becausethe variablehad to be sortedon fe ffst digit, and then eachof the resulting categorieshad to be sortedon the seconddigit) n: producedmore detail than could be usedeffectively in a cross-tabulation.This resulted r ie use of what are known as zone punches,the locations on an IBM card above the called"x" and"y" r:erical columns,which alsowereusedfor "*" and"-" (sometimes (no punch) meaningful category. Thus, fbr examas a :-xhes), and also the use of blanks years in data setsdesignedfor --re-ir would be unlikely for ageto be representedby single re *ith unit record technology; rather, a set of age categorieswould be predesignated. =d. in the interest of getting as many variablesas possibleon a single card-because it rr. impossibleto include in a single tabulationvariableslocatedon different cards-some lc-tsts resortedto putting more than one variableinto a singlecolumn. Considervariables I ad 257 in the preceding example (Exhibit 4.1). Becausethere are only two possible -i.::i,.lnsesto the sexitem and six to the presidentialpolicy item, they could be included in a r.5le column simply by using punches.l-9 for the presidentialpolicy responsecategories. { €\ ice on the counter-sortermachile madeit possibleto suppresssomepunchesand sort r .rrhers.Columnsof this kind were known asmultiple-punclrcdcolumls. ,\11of thesedevicesfor packing as much data as possibleinto a single IBM card :,ved havoc when the shift to data analysisby computer occuned. Becausemost
72
QuantitativeData Analysis:Doing SocratResearch to
Testldeas
weredesignedro recodedatafrom one setof symbotsto another,the ffl!1r11lroSru-s srmpte caseswerc thosein which zonepunches andblant, ,"r.u.io u. _"uningful cate_ much more difficulr problem.arose,t"n *"r" irr"f,ipf._prnched.Such il^.lT " casesusuany requiredextensivespeciarized "*0. computer progru.-in! ,o them into compubr-readableform. "onu"a Even after computersbecamewidely ava able for social research,data set, were often initia.lly prepared in machine_readable fo._ oo mVt usrng a keypunch machine and then read into compurersand transferredto sto.ug""_O', ."aiu ,u"t as computer tape; only.relatively recently have keypunch machinesb*"' *?-r"i"o u, work stations that permit keying data directlv into a.:omputer nt". ft"n"", *uny existing data sets, includingthe NoRC GSSwell into th" tsso., orguJ;;-;;;;;_,*r* records.T.hat in computer sroragemedia -" as a series of eighty columr records li:TI:: :t ":"nred Typicalty.the first rhreeo, for. lll_iifn I.".p-d"."t. rhe respondenr roennncatronnumberand column g0 containstn",."o.0 "oturnnr'#tain nuroilr, o, out tO.This orgazatronof datahasno consequences fo-ranalysis,bur iiil;;h;;" *"y the computer rs rnstructedto readrhedata.The specifi" a"tuitsu*y.t"p"nl;;;;;" p."g."m you use, but you should be aware of this altematrve mode of data organization, rn addition to the specificationof onerongrecordper respondent with whi"r, i,li"gl" *. oiscussion.
THE WAY THINGS
WERE
available, commandfilesalsowere
Before electronic daraenrryterminats became
wrote outhiscommand fire, .", .rJ::il:;:ii;::J#:[
1
: l
?,ilil.:?ju'; J,5il:J;
a separatelBN,4 card (by a keypunchoperatoror' in the caseof undertundedgraduate students,by the analyst).The resulting,,deck,, of lBlvlcardswas then transportedto the university centralcomputinqcenterand either submittedto u .turt o'. Lo directlyinto a caroreaderEventually the commandfile was executed(,,thejob ran,,j.otten after a delay of severalhours,and the printedoutputplus the box of cardswere returnedto the analyst. lf.therewere errors,the entireprocess was repeated.Thistechnologylimited the number of computerrunsto two or three per day,which made'tf,t".r.ofJi"" * any particular
verVttme-consuming proposition bycurrenr standards_but
didat teasthavethe iii.ll1s feature l salutary of allowingmoretjme to think whilewaitjng for the joO,o |.un.
TRANSFORMING DATA As notedseveraltimesin previous,chapters, dataare not alwaysinitrally represented in a form that is suitableto oui ,"r"u
oru,ngr" u*iuuiil;;il,l;l;i:'*T#]riJ3iill;li#'J#j";li*:: "oa..
data transformation.r, and each o
ceduresroraccomilil;;';;;:i;:;#iTfi ::X#it"li'"."n"'ffi:*::U.
brlrtresFaciliryartransformins variable* to u ro.- t,u-t"oriirffi concepts is animporrant skil ;f theqr"",ii"i* 0""1"
I
I
"*pr.rr"s
theoretical
On the Manipulation of Data by Computer he Ierch Ito tre
rh t T
)ns ls, mt rds \aIer se, fue
73
Rxoding Recodingis the term usedfor changing the values of a variable to a different set of val_ rs. Recodinghasmanyuses,someof which we havealreadyseen. one is to collapsecategoriesof a variableinto a smauernumberof catesories.for :rample, whenI createdthe leftmostcolumnof rhble 2.3 from Table1.1.To se-e how this rccedure works, let us considerthis examplein detail. I startedwith a reliqiositv scale of the following categones: --omposed l. 2. L +.
Very religious Somewhatreligious Nol veryreligious Not at all religious
(For the moment,ignore the possibility of missing data.)To combine the last two cate_ gies. I simply changed,or recoded,category4 to category3, which yields a new variable: l. Very religious ?. Somewhatreligious 3. Not religious -\lthough somecomputerprogramspermit a variable to be .,written over,,_that is, to 5e replacedwith a new variable-this is very poor practice. Rather, you should createa ..s containingthe transformedvalues.The reasonfor this shouldbe obvious: 'ariable :[dr to protect againsterror and to permit you to transform a variable more than once in te samecomputer run, you should preservethe original coding of a variable as well as .u' recodedor otherwisetransformedversionsofthe variable.Typically,stadsticarpackrse !-omputerprograms operate line-byline; each line of code operareson the data in rbareverform they appearafterthe previousoperation.Hence,it ii all too easyto trans_ tr-m a variable and then inadvertently transform it again, unlessa new variable is created n rbe courseof the transfomation ,\ second use of recoding is to redefine a variable by creating a new set of ::riesoriesrepresentinga new dimension.you have seenan example of this also, in :sr discussionof property spacein ChapterThree. Recall our classification of U.S. :.]n-gressmen lnto
L StandardRepublicans 2. Gypsymoths na he
J.
+.
Boll weevils StandardDemocrats
AS
ros1cal
To createa classificationaccordingto party membership,we can recode2 to I and 3 r- -1.l,ielding a new variablewith valuesof 1 (: Republican)and 4 (: Democrat).To ::eatea classificationofcongressmenasliberal or conservative, we canrecode2to 4 and, -::,..'1. againyielding a new variablewith valuesof 1 (- conservative) and4 (: liberal).
74
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
Note' howevet that when variables are recoded to dichotomies,
the conventionis .o code
;i"J5-:fl i:l'#trff :ffi":;il$n",H:il;:?T,*y*"*ln
"Republican,"in which l and 2 in the original variabre;;"; r, and 3 and 4 in the original variable are coded 0. As we will seein later chapters,tliJ0_l recoding convention facilitates the use of dichotomous variables in both dLS and logistic regression. A_third useof the recodeoperationi. to ur.ign ."u1" ,"o.", tJ th" .. of u u_iable.For example, we might have a variable "ut"go.r", measuringeducationatattainment, which is initiallycodedasfollows: 1. 2. 3. 4. 5. o. 7. 8. 9. 10.
No schooling 1-4 yearsof elementaryschool 5-7 yearsof elemenrary school 8 yearsof elementaryschool 1-3 yearsof secondaryschool 4 yearsof secondaryschool 1-3 yearsof college 4 yearsof college 5 or moreyearsof college No information
For many purposes,it is useful to tr.eat years of school completed as a ratio variable. By doing so, it is possibleto computethe meannumberof yearsof schoolcompletedby of the popularion, to use years of ."frJ in regression lT::,":-,^ribt:*os equa ons,and so on. To do this, we might recode "o-pf"ted the original u_iu6t" Uy u..igilng it midpoint or another estimate of the years of school .#;;O " ,, rndividuals in each category:
Original Code 1 2 3 4 5 6 7 8 9 10
Recode 0 2.5 6 8 10 12 14 16 18 -1
In making recodesof this sort, it is important justify to your choice of valuesrather than assigningthem arbitrarily.For example, ,.18 years,, the decisioni. ;:G to the
On the Manipulationof Data by Computer
D code ded l. named in the 0nven)n. a vanhich is
75
:ategory"5 or more yearsof college"ratherthan 17 yearsor 19 yearsmust be justified, rx simply asserted. \ote the specialtreatmentof category10,"no information."In carryingout the anal.'ri. we want eitherto excludethis categoryor otherwisegive it specialtreatment.We :rti sive it a specialcode,which we can eitherdefne asmissingdata (seethediscussion .rx;r in the chapter) or otherwise modify. It is convenientto use negativenumbersto flag :nesoriesthat we aregoing to treatasmissingdatabecausedoing so minimizesthe likei,h.od of inadvertently treating them as substantivelymeaningful. (A useful alternative, l-r aYailablein Stata,is to usethe code"." to specifymissingvalueswhen we haveno re:d to distinguishbetweendifferenttypesof missing value,and to use the codes...a,', ---'-. . . ".2" whenwe want to distinguishdifferenttypesof nonresponse-again,seethe of missingdatalaterin the chapter).For example,supposewe recodedthe .,no -r.---ussion .n'..rmation"categoryas 99. If we subsequentlydecidedto analyzethosewith at least iL-secollegeeducation,we might instructthe computerto selectall caseswith yearsof ;nool completedgreaterthan or equal to 14, forgetting that category 99 meansno inforr:iiion- This, of course,would resultin the inclusionin the highesteducationcategoryof f,r-\. fbr whom educationis not known along with the college educated.
TREATINIG MISSING VALUES AS IFTHEYWERENOT?,;I
riable. xed by resslon ing the n each
N :"ffiJ:nffi Till:;t:Ti::'iH:,:!T; :::il:::1it1ililii..,::i:,";""fi rtercourseper month increases with a wifes age-contraryto alJexpectations! Alas,as r scovered by Kahnand Udry(1986,736),shefailedto noticefour outliers,caseserrone:.Jslycoded88 ratherthan 99, the specifiedmissingdata code.When thesefour cases :.e omitted,the positiveeffectof wife'sage disappears. Kahnand Udryalsoomittedfour l:her outliers,promptinga livelyresponse from Jasso(1986)aboutwhat shouldbe regarded .s an outlier We will returnto a discussion of outlierswhen we considerreoression diao-osticsin ChapterTen.
-\ final useof the recodeoperationis to convertdatafrom old surveysthat usezone -hes and blanks into a form that permits numerical manipulation. This typically can :ln t .jone by reading the data in an alphanumericformat and converting them to a floatingr'^nr decimalformat.
Afith meti c Transformati ons
i rather 't o t he
irmetimes we want to transform variablesby performing arithmetic operationson them. i,,.-h ransformations will be particularly important when we get to regressionanalysis :e--auseit is sometimespossibleto representnonlinearrelationshipsby linear equations ntolYing nonlinearvariables.For example,it is well knownthat therelationshipbetween ir:Lrmeand ageis cuwilinear-income increasesup to a cenain age and then declines.
76
QuantitativeDataAnalysis:Doing SocialResearch to
lr?ii*l'"tntn
"an
Testldeas
be represented by constructinga regression equationof thefollowi':a+b(A)+c(12)
(4.1)
thatis, income(= y) is takento be a linear functionof ageandthe squareof age.To esti_ equation, we need to create a new variable, tie .qr_" oi ug". So we simpty l1:,T": AGESQ = 468*46s
(4.2)
and then regressy on AGE
atd.Ac exrensivetransrorm",,.,."o"b,r;tj?#j;j;,r,jTrXl'Ji:T"ff1
3
ffiff;"
operaror or anyof a numberof specializeo f";i;;r,
suchasthesquare
Cont i ngen cy Transform ations A final way to transformvariablesis to use ,,if, specificationsin your commands.,,If, specificationsare an alternative to_recode commanis and are ne"iute in some b""u:r" makeit possibte.to.p""ify -u"tr -ore I?, t:I r"ru,ionrfripsinvolv"o_pl"" "onttngJi"r example, if wewanted todistinguislithose
whowereupwardly
;fr1il1l1J1:l:TFor jobs rhat were,f hd;;;;ljijl|,ii'lilTl?Jil:Tfr3?ff:'f"*ffi ,IffTl specifyingrhe following: if pREs?rcE is greaterthan pnsiriis _op_
se?HEB constructa new variable,MOBILITY, and give-it the,ufu" f; oifr".*]r", grveit the value 0. the synlt of.rhe computer commandrequired ro do this wilt vary dependingon fl]lo]lih ue used,the logic is, as usual,straightforward: lrogram u;;;;;*ru, variable is,eated, scored 1 for those individuats. who _"""p,r-d;; ;;i;juia ,"or"o 0 orherwise (where"upwardmobiliry,' is def,ned ashavinga" dp;,i"r;i;;er prestigethan the occupationof one,sfather). Another kind of continsencvtransformation is to createa variableconsistingof a count of the numberof resp-onses to a specifiedset of other variablesthat meet specific criteria. For example, we misht create a scale of acceptan"" oi uUo.tioo Uy the mmber of ''pro choice',(,,ac-cepting,,) responses,; ;:;;;iqd;;s about"ounting rt,e circumstancesunder which abortion should be permitted. Contingency statementsare used not only to transform variables but also to select subsamples for analysis.For example,ir *" J"." in[r"ri"J,rl"arr,"g fer_ tility, we might want to restrict ( ""mpleted
jffi ':. accomprisheJ in.o-""o,"'u,". j?lTJ;il! ;:"TnJ5'; r3;! ":"n on the subsample. Otherpackages,suchasStata.
doingall the subsequent operationsonly
;",fi::T::,1'
partof eachcommand, arthough ."bi;;
;i;#;,mpre
is possibre in
On the Manipulationof Data by Computer
llow-
(4.1) )estimPlY
ta )\ ovide using quare
i. "If' some rolvmrdly o had Ed by , conlue 0. ng on s crerwrse hn the gofa rccific ng the rcumselect d ferhis is le and Stala, ible in
77
}|issing Data Otten.substantive informalionon certainvariablesis missingfrom a dataset.The sources ri missingdataare nearlyendless.In datasetsderivedfrom interviewswith a sampleof the information may neverhavebeen elicited from the respondent,either in error -ople, .T asa matterof design(somequestionsare "not applicable,"for example,spouse'sedu:arion for the never-married; and sometimesquestions are asked of random subsetsof :espondentsto increasethe length of the questionnairewithout increasingthe respondent rrden-the GSS often doesthis). The respondentmay haverefusedto answercertain :tr€stions,may haverespondedto somequestionsby claimingnot to know the answeror rrt to have an opinion, or may have given logically inconsistent answers(for example, :aponding "never married" to a question on marital statusbut providing an answer to a *n€sdon on "age at first marriage"). Interviewers may have failed to record responsesor =:1 haverecorded them incorrectly. Errors may have been introduced in the processof :reparing data for analysis-as when narrative responsesare inconectly assignedto code =regories by coding clerks or when correctly assignedcodesare incorrectly keyed in the :qrrse of dataentry. Similar problemsplague other sortsof data sets.Bureaucraticrecords m'eoften incomplete and frequently contain inconsistentinformation.
PEOPLE GENERALLY LIKETO RESPOND TO (WELL-E[
R":::"f ),:?.,1T"?",Y-:',,:;fP#:)::"lT"tP
written.By and large,peopleare flatteredthat they are askedtheir opinionsand askedto talkaboutthemselves. Thereisa famousstoryfrom the loreof surveyanalysis aboutthe Indianapolis Fertility Survey, oneof the earliest surveys that askedexplicitly aboutsexualbehavjor. Oneof the analysts went out with considerable trepidation to conducta pretestof the questionnaire, not knowinghow womenwouldrespondto "intimate"questions. As it happened, the interviewwent off without a hitch untilthe veryend,when the interviewer got to the routinedemographic questions and askedthe respondent her age,at whichthe ladydrew herselfup indignantly and said,"Now you'regettingpersonal!" Theexceptionto the generalwillingness to respondis with respectto informationthat peoplefear might put them in jeopardy, suchas income,whichthey suspeclmight find its way to the tax authorities.
ln high-quality surveys,great pains are taken to minimize the extent of error In the -rurse of readying data setsfor analysis,they arecleaned,that is, edited to identify and if rr-\sible correct illegal codes(codesnot correspondingto valid responsecategories)and il-rg:ically impossiblecombinationsof codes.For example,whena respondentwho claims =ver to havebeen manied gives his age at first marriage,sometimesit is possibleto lecide which is thecorrectandwhich theincorrectresponseby inspectingotherresponses by the sameindividual.When this is not possible,the respondentmight be con_air:en -rcted and askedto resolve the inconsistencv.
78
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
t,#i;T'r,$::
the.editing. process aswelascorrected. For
\x!ni:':!::!in
:"1*:iff",1,J:,:",?ffj'#. lii:'::twl'i;: ffi#:HTi;'*?"#f
nousewrves, takenonastemporary employ"".oytrr"C"nrur*nu#u,.."o,'""t"d,, returnsin whicha woman's maritarstaturi"". r"i r"L1*"i"Joi',""",*"n ""n.u, *.."."0oa"0 responseto''married."
rhelikelihooa, or"ou,.,", t auiin
i;rtTfl::r*i;ffilitatus
":J ffiff:ffiXi; ltiiii;itJffitril3:h""1 lT.":'3;*{Ti*"fid,Tttr
ladies_isnot supposed to occur,butit does. io* o,"onrtn . ln thecourseof,h" codes are assigned to.thevariou. "di ti., ornon.ou.t-tiu;;;p.^:r:i:::1t:,explicit "u,"go-
jf Lff : *Jfr *':::"Hl*#H:"jJ:#l;xf;p il"i:'t";ffi il#,i"f ll;: ;,::S;#:il,inT:::j.:,;ff
ffi ilF:';'"''otri".io*,Jiu;;;",ffi todisri n;;;;;;#J#iJ:l:liffi g::tgie '"'i"il:ffi ffi :n',;:"'f,"n#l
.,rH.: .:ilfr ::li ffi ff :q;:*r *th:in#ft .ifi i hFT #f t fi r't""i*,np.,r,r".Xo"i"g*ru" ::::TflT,i.TfiX1;lioiLiiil-of,r responses. rr
l..Tf ":""h"o,"_"*",".#lii^*;iilT:""##il-ff l1,,ffiffi ,ffi.,:,$:
caregories ro Repre-sent N;;
c"r*i oetartm thecodingof nonsubstanhve responses.
in the sec_
"nrprer r;il ii rJ'r",lo..ruo, to preserve
Analyzing Surueyswith Missing Data Presumiag that the data are
coded in s-u$
waV as to preserve
all relevant distinctions, 1 ,. o"n'" r'o'tot .ui ffiffiJ:H.':"#Tjf*TlT,l'"t ordecisions *:' "n""il",^' whichresponses -o beregaroel' stroulo assubstanrively
interestinq
y,1,s^Y.YTli":
an6 "4""',. what todo"aioui t"*r'-'.",i"'#T.Tr[T:"r}iii"'''T[]fl::T?,T#on?
example here. Anorher case thatarisesr..qr*ttyin tauui;;;"rrl
,Tfi:,T::"J## t*"'"; qi'*i'-''i".llJiu,r ..,".ore ;'l":ffi"'ffii,:in (mean;T:il+i':, " .rr,i,i,ooiu'go;il'];;Hili,ili.T::f ,:fnT:riii:Tffi[::Hil,:J; I nusrr youarestudying theadulrpopulatio" i]j,"J,li",.rr. your
reler.totheen'ireadultpopulation, tabtesshould notjust to "r*. whi*r ,"Jnrr"t.. The sorutionin casels simplyto create this a residual -.other...and,o in"foO",t In thetablebulnor bother[o discussit. tt is inctuded_ 91tegory. foi,r,. ,*l ;;;""J:_rch. amongolher "f otrrei ror
il1l:."il":fijii"'tra;:,:i;;J:,-,,: T."ure trr"rp""in" n,fri"1r;;ff;:#j*
-,t" A more
ffiiis_burisnot discussed
of risidualcatesories generally "terogeneitv
difficult problem aris more variabres'' *; ffi;;;;nf, :1i1,.:$*::lx'm:jHfrta,."j"ffi #
On the Manipulatlon of Data by Computer
L For h the {lass C N S US
nrted hatin r who rlyrlegoBOnes nedin mple, plicarl. it is ,flexilier, a nomeses.If rssible eselYe rc seceserve
cllons,
o rear garded ) about iypical ) about (meananalyDterest. should in this but not g other ;cussed nerally ' one or [rcation
79
.-r their income.Again, one alternativeis to include a "no answer'.categoryin eachrow .:ld columnof the table.If thereare many missingcases,this is wise.If thereare only a ::\' rnissingcases,the increasedsizeof the tableprobablyis not warranted.In this caseit :: sufficientsimply to reporthow many casesare missing,in a footnoteto the table. When our variablesare continuous,we musteitherexcludemissingvaluesfrom the :nalysisor in someway imputethe values.ChapterEight is entirelydevotedto the treatrent of missingdata. Most statisticalpackageprogramsallow the analystto specifywhich codesare to be :eated as missingvalues(andindeedrequireit in the sensethat any codesnot specified :-. missingvaluesare includedin the computationwhetheryou intend it or not). Typi:rllv. statisticalpackageprogramsarenot completelyconsistentacrossprocedures(comnands) in the way they handlemissingdata,soit is very importantto understandexactly rhat each proceduredoes and to design your analysisaccordingly.In designingany .nalysis, you must know how lhe procedurewill treat eachlogically possiblecode in l our data,includingin particularthosecodesyou designateasmissingvalues;otherwise :ou inevitablywill get into trouble. In the example on educationdiscussedearlier, "no information,' was assigneda -1. When computinga mean,we ordinarilywould declare-1 to be a missing --odeof r:lue for education.In SPSSsyntax,missingvaluesareexplicitlydeclared:,,missingvalues :duc (- 1)"; in Stata,asnotedpreviously,missingvaluesmaybe excludedautomatically by .Lisigningone of several"missing value" codes,or may be explicitly excludedfrom a proce$Ie by limiting the samplewith an y' qualifier: . . . if educ-: - I (that is, if EDuc is not equalto - l). Thesestatementstell the computerto omit all individuals for whom education r. coded- 1 (or assignedthe missingvaluecode)from the computationof the mean.Neglectng to so inform the computerresultsin an incorrect meanbecauseany individuals who are .-odedir the dataashaving 1 yearsof schoolingare includedin the computation.Errors of dis sort are very common,which is why it is imperativeto checkandrecheckthe logic oi vour comrnands.A useful checkis to work throughthe logic of your computercommands ,ine by line fot specifiedvaluesof your original variablesto seehow the computertransforms iem at eachstepin the process.You will makesomesurprisingdiscoveries. Oneof the thingsthat typically happensto novicedataanalystsis that they do some ;omputation and discoverthat their computerprintout showsno casesor a very small ;rumberof cases.Usually this turns out to be the result of a logical enor in the specifi:ation of data transformations.For example,consider an income variable originally ;oded in a set of categoriesrepresentingrangesof income, for example, 1 : under 53.000per yeal 2 : $3,000to $4,999,and so on, bur where 97,98, and99 are usedto specify various kinds of nonresponses. If the analyst recodesthe income categories 10the midpointsof their ranges,for example,recodesI to 1,500,2 to 4,000,and so on, t'ut then forgets this and specifiesas missing valuesall codesgreaterthan or equal to 97. all the caseswill be excludedbecauseall casesfor which incomewasreportedhave beenrecodedto valuesin the thousandsof dollars,that is, greaterthan 97. If you do not rhink this will happento you, wait until you ffy itl It happensto all of us. The trick is to ;atch logically similar, but more subtle,errors before you constructentire theoretical edificesunon them.
80
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
WHATTHISCHAPTER HASSHOWN
fi::ffi:iff ffiff"#HT.: :i,ffilx"::.;:'*;:T ili:i"i'fifJtlTl?'tff
data manipulation and the treatmentof missing datalThe ct up*. ,t o, ,"*", us a founda_ tion that^shouldmake it easy to leam any statlstical packug" p.og;urn_Stuta, which we 'SPSS will use for the remainder of the book, oi any ottrer p;kag?ii"ir'". SeS . In the next chapterwe tum to the gen"rj he". _oA"i r"l f, u g"o,l" inooduction .. via a discussionof bivariatecorrelationandresression.
APPENDIX4.A DOINGANALYSISUSINGSTAIA TIPSON DOINGANALYSISUSTNGSTATA This appendix offers some simple tips that will greatly enhancethe easeand efficiency with which you use Srarafor analysis. In additio'n, ,h; ;;;;; iir,, some parricularly useful commandsthat are easily overlooked.
Do Everythingwith -ao_ Files You should from the outsetdevelopthe habit of carrying out a// your analysisby creating commandfiles, known in Stataparlanceas .._ do_ nt"j, Oologio-lus two major advan_
torepeatyour untilyougetit .i!r,t, _a ii _utJJ t; ;;; :19^:::tj-T1": :asy Keepins analysis oocumentyourl work. a log of your analysisis nit an adequatesubstiiute
(although,of course.you must-create- _1og_ afile io ,uua yo* output) becausea log faithirlly records all of your enors and false s-teps, makir! ii iiti*i a r" ow rhe direct path to successfulexecution and tedious ,o ."p"u, yo* u?rA;;;"." rs an example of part of_oneof my -do- files, shown to suggesta standardformat you might want to adopt.I usethis set of commardsat the begiriring of eactr_ao_ fite f create.The com_ mandsin the file are shownin Courier type,andmy commentsrn squarebrack_ .New ets are shownin Times New Roman type (the standardfont ior the text). capture
1og close [This commandclosesany -1og_ file (seethe next command) it finds open.The -captureprefix to a commandis very useful because it instrucm Stata not to stop if an ,,error,,is encountered_whichwould be the caseif it could not find a - 1og- flle to close.l
1og using
class.1og, replace [This command te]ls stata to keep a file of commandsand the results of the commands,calleda ..- log _ fiIe,,,andto replu"" *y pr"uioo, u"r.io-osof the _1og_ file. The - replace - part of the commandis crucial becauseotherwisewhenyou execute
On the Manipulation of Data by Computer
the -do- fi1e,fix an enor, and try to executeit again,Statawill complainthat a previousversionof the - 1og- file exists.l
I SOme
rgic of Dundaich we
81
*-elimit; [This commandtells Statato end all subsequentcommandswhenevera ";" is encountered.I find this the most convenientway to handlelong lines. The default in Statais to regarda carriageretum (the computercommandthat endsa line) asthe end of the command, which means that, unless the cariage retum is "commented out" (see below), commandsare restrictedto one line. Of course,the line may be very long, extendingwell beyond the width of a page,but this makesyour file difficult to read.l
:
lon vla
-i=:s ion 10.0; [This command tells Stata for which version of Stata the file was created. Stata always permits old - do - files to run on more recent venions of the software, if the versionis specified.l ;iency ularly
eaf,ng dvanasyto ctitute a log direct ple of &ntto comrack-
:-:
more 1; [This command tells Stata not to stop at the end of every page of the output. When executinga -do- file, you want the programto run completelywithout stopping. The way to inspectthe oulputis to readthe log file.l
fThis commandclears any data left over lrom a previous attempt to executethe program or any other Statacommand.Statais good about waming you againstinadvertently destroying data you havecreated.But the fact that Statawams you meansthat you needa way to overridethe waming, which is what this commanddoes.l ---^r.h
dr^^
.l
l
fThis commanddrops any existing programsthat you might havecreatedin a previous executionof the -do- file. Failing to do this causesStatato stop if you have includedany programsin your -do- file.l s=t mem 10 0m;
: The a not r find
com' file.
lThis commandtells Statato reserve100 MB of memory.Spacepermitting,Stata readsall datainto memoryand doesits analysison thesedata,which is why it is so fast.If you specifytoo little memory Statawill complainthat it hasno room to add variablesor cases.l i:LASS. DO (D.lr iniriared 5/L9/99, :=-st revi-sed 2/a/Oe) i [I alwaysnamemy -do- file, and becauseI work often with others,indicatethe author,the initial dateof creation,and the last dateof revision.This is very useful
82
euantitativeDataAnalysis:DoingSocialResearch to Testldeas
in identifying differentversions of thesame _do_ file, which might exist because my coauthorsand I haveboth revised the samefile, o. L""u*" f frur" made a revi_ sronon my office computerand,have forgottento upOu,.-,fr"versronon my home computer, and so on. Note are distinguished'from commands by an asteriskin the first 'o*^ezls "oru-n.lnu' *Thls -dofile creates computations for a paper on literacy in China.; I always include a descriptionof the analysis the _do_ file is carryrng out. Because ove_r31111naea oaner perioa, *rJscriitiln ,sextemery ]-"11":11l :" " rrerprur rn Jogglngmy memory andhelpingme to locatethe correii fi1e..1 use d: \ china\ survey\ data \ china 0 ? - dta ; [This commandIoadsthe datainto memory. The remainderof the _do- file then consists of commands that perform ;;";;;;'in ,ir" ou,u and produce variouscomputations.l "rri;.
_rog close; [ This c om m and c l o : e s t h e - 11o9oo-
word processor.]
6 t c so e ^ that , h . r it i r can Lfile' ^ ^ _ be openedby my editor or
The basic
procedure for c (1)openanew'nreiil;;ilTljil:.ffi ;iil:.*,ffij*:;,*::;.:li!1.,; editor),remembering rhar_do-
ntes_must inctuoJtil (2) inserta front end.of thesortjust outlined i"fr^f "rr"rJl:,0o,, uflruy,lopy ,r"rn 1*t i', nl""l"* Iile to my current nle to minimizetypjng):(3) create a set of conmanas,to ou, the first task; "_ry
jTJTti,:ffihr; li"T,l'1'il:.i""T":'J.'l?"j'#::,-j; ;::#:,rjr# there surely will be an eror most of.th" ti_"1; fSl 6gg;".t
t" thi eOitor;
correct
the 16.y ffiE::"Hili:?lilljff;3'j?":ffi:*iflJ:"Jd':";;i;""'"-u"-n,ein
T::1.l,ry"+';'H:::,;1",'"1i.'#:J::l?',:1
of howyougotrheresu-lts showninrfr"_i""_ ni" _d (2)canbe ll:::,i"*.d .H"n. rerunat maywantto doif, as j1i_ tou rr"pp"i.=, Tl ,oT#r.r"", _ error in rhe "ri.,
iiftl{ii.ililfinT::ti::"J,"##!,'**i::r,i::* ffT."JJii"
you submit a paper for publication_ and get an invitation to ,t;;ie
and resubmit,,the
tion, :r'."il"".p;;";i;;;ri"oli""o.'n""..u.f ff:Tiff,:TJ,"*:HX:""::ff -do- file'rh" uuuituuinf
ota -do- firewiualsogreatl,.rJ.ffJi:;::"""Tt"*vour
On the Manipulation of Databy Computer 83 ECaUSe a rer ii bome f b\ aD
rause elptul
)ild
ln Extensive Checksof your Work
h s e-rtremely easyto make errors___of both a logical and a clerical kind_when doing .sourer-based dataanalysis.The only.way to protectyourself from happily makingui .Ede! aboutresultsproduced in error is to compulsivelycheckyour work. you can do fu .n two ways. Fhst, check the logic of each sit of daia transformation comnands by u.:rliirg through-as a pencil and paper operation-how each value of a variable being :rc-{ibrmedis affectedby eachcommand.second,tabulate or summanzeeachnew vari_ mt atd ac-tuallylook at the output.you will be surprisedhow many errorsyou discover :,a --rkiIlgthesetwo simplesteps!
fulment
ther duce
sto
;CT |oDt
|ent sk: ing md the all I mat he T if Ie Li
Your -do- FileExhaustively
!- shouldmakeextensivenotesin your _do_ file aboutthe purpose ofeach setofcom_ mrls andthe underlyinglogic----especially in the caseof datatranstbrmations. Includ_ -Ds'-ornmentssummarizing the outcome of each set of cornrnandsmakes it clear why I :m' out the next stepof the analysis.The -do- file then becomesa documentsu'ma_ s my entireanalysis.I cannotemphasize stronglyenoughtheimportanceof adequate ,ur:zmentqtion.It is qrpical in our field to work on severalproblems at once and to return r -r problem after months or even years. In addition, ttre eOltorial review processoften ni-- a vgry 16.t 1ime.Ifyou have-notdocumentedyour work, you may havea greatdeal r:ouble rememberingwhy you havedonewhat you have done.lhis is inefficientand :n, Lrehighly embarrassing-as when a journal editor asks you to do some additional ndysis. and you haveno idea why you madeparticularcomputatrons, much lesswhat fu .-hain of reasoning was, and cannot reproduce the previous results. This happens mrch more often than most of us want to admit.
Hude "Side" Computationsin your _d,o- File lb-: is a corollary to the point about exiaustive documentation madein the previous sec_ Lwe do ,,side"computationsin the courseof writing papers to makepoints or .gften ar illustrations to the aext-for example, computing th. .utio'oi t,"o coeificients in a rcie s e havemadeor compudng a correlation coefficient between two variableslisted in rne otherpublication.The way to make your _do_ file a comprehensrve documentof rf rour computationsis to use stata, rather than a hand calculator'or spreadsheet,to do the *,:rli: o1 at minimum,to includeboth the dataandthe resultsascommentsin your _do_ ire- -\[orethanonce,I haveproduceda paperwith a well_documented _do_ file but have r,]ed_to include side computationsin the -do- file, and then havedrscoveredmonths ;rer *rat I had no idea how the side coefficients reportedin the paper were derived. kntn Your -ao- File as a Final Check q: te point you havecompleted a paperand areaboutto submitit for a course,ibr Dost_ ls rn an online paperseries,or for publication,you should make a poinr of "*."iting -r,-cr--do- file in a singlestepand then checkingeverycoefficientin the paper againj b conespondingcoefficientsin your resultinglog file. you likely will be startleotloois;T er how many discrepancies thereare.Because- do - files oftenare developedover an
84
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
extendedperiod of time and often are executedin pieces, it is extremely easy for incon_ sistenciesto creepin. If you havea -do- file that will run from beginningto end,without interruption, and will produce eyery result you report in the paper,you will havemet the gold standardfor documentation.you also will be a happy camper months or years later when you need to make a minor changethat affects many results.you will discover that the change usually can be made in a matter of minutes-although updating your tablesby handis usuallya much moretediousbusiness.
Make Active Useof the StataManual The only way to become facile at any statistical program, including Stata, is to make a point of continuously improving your skills. Each time you are unsure how to carry out a task, look for a solution in the manual.you will find the improvement in facility very rewarding. After you becomereasonablyfacile at Stata,you should then take advantage of Stata's-net- commands,which link you to the Statausercommunityand the moit up-to-dateapplications.Of course,to usethe -net - commandsyou must be connected to the Intemet.
SOMEPARTICULARLY USEFULSTATA1O.OCOMMANDS Here is a list of key datamanipulationand utility commands.It is to your advantageto study the descriptions of these commandsin the Stata manual_in addition to reading through the User's Guide. The time you spendgaining familiarity with thesecommands-. and with the logic of Stataprocedures,will be more than repaid by improvementsin the efficiency of your work. I haveincluded few of the commandsfor carrying out estimation proceduresbecausethey will be introduced in later chaDters.
append by
cd codebook col l apse
Getsadjustedvaluesfor meansand proportions. Combinestwo data setswith identicalvariablesbut differentobservations. (Seealso-merge-.) Repeats a Statacommandon subsets of data. Capturesreturncode(thatis,allowsStatato continuewhetherthe cond! tion is true or not). Changesdirectory Produces a codebookdescribing the data. Producesaggregatestatistics,suchas means,for subsetsof data. Parlicularly usefulfor makinggraphs.Similarto the ,,a99regate,, command In 5 P5 5 .
compress
counc +del imi t
:---
Compresses variables to makea datasetsmallerbut withoutalteringthe logicalcharacter of anyvariable.Usefulwhenyourdatawill not fit into memory Givesthe numberof observations _count_ satisfying specified conditions. without a suffjx gjvesthe numberof observationsin the data set. Changes the delimitcharacter
On the Manipulation of Databy Computer 85 s-v for inconto end, with*'ill have met nths or yeals $ill discover pdating your
-=:-ibe ic 3:::
.-.j-:
is to make a r to carry out r facility very ke advantage and the most be connected
--:: ce : - : : -a ch
E : e :a ce
--
-'
e
::-=:ect
advantageto m to reading e commands, emenlsin the r]t estimation
::g
b,se rvations.
ra::< G=fe
\thecondi: e : S e a r ch
r'c omma n d fteringthe t flt into
-=ie! ; =:ji
6. -count:-acde :3:ame
ct
Describes the contentsof a data set. Displays file namesin the currentdirectory Substitutes for a handcalculator. Executes commands from a -do- file. Dropsvariablesor observations from the file. Allowsyouto edityourfile cellby cel. Usefulin inspecting the contentof yourfile or correcting errorsin the file. Extensions 10the -generate command. Permitsrecodingof stringvariables to numericvariables. RepeatsStatacommandfor a list of items(variables, values,or other enti ties).Similarto "do repeat"in SPSS powerful. but more RepeatsStatacommandfor a set of consecuuvevarues. Creates or changesthe contentsof a varjable. Obiainsonlinehelp.(Seealso-search, and -ner search-.) Readsdatainto stata.(seealso-inJix- and -insheet .) lnputsdata from the keyboard. Usefulsummaryof numerical particular{y variables, whenyou arenot familiarwith a dataset.Reports the numberof negative, zero,and pos! tivevalues;the numberof integersand nonintegers;the numberof unique values;andthe numberof missingvalues;and produces a smallhistogram. Keepsvariablesor observationsin the file-that is, dropseverythingnot specified. Usefulwhen youwant to createa new file containinga small subsetof variables. Creates or modifiesvalueand variablelabels. Listsvaluesof variables. Createsa log of your session. lVarksobservations (a way of maintaining for inclusion consistency regarding missingvaluesthroughoutan analysis). Mergestwo data selswith correspondingobservationsbut different (Seealso-append-.) variables. lnstalls and managesuser-written additions from the net. Searches the lnternetfor installable commands. Combinesnumerical valueswith labelsso that both aredisplayed. This often is very convenient. Putsnotesinto data set. Reordersvariablesin a data set. Oblainspredictions afteranyestimation command. Preserves data.Usethisbeforea commandthat will alterthe dataset,such as -collapse-. Thenuse-restore to restorethe preserved dataset. Performs Statacommandwithoutshowingintermediate steps. Recodes variables. Renames variables.
86
euantitativeDataAnalysis: DoingsocialResearch to Testldeas
replace
neplacesvaluesof a variablewjth
reshape restore #review
new variablesif specjfied conditionsare
.,'atand vice o.:lJ:I"';ffi:lffiJ:T.'o" versa lil' hevtewsprevious commands.Halr
save search s or t summari ze t abf e ltabulat e update us e
version xi
j;i:iil:::: :Hl:T :Htrt j# ff:::i,"::Y"xi:: ij._
Savesa data set. Searches Statadocumentation for
**"Pa* cu ar' m*l'",;:::;#ii:J"il;: 1;j;4m; {orlconrinuous
;;;ffi ;,;il:i;;Tri6 t'* produces one_ andr*"_*ri "fi
variabjes
1;];1';"""'* :;::j: :;;1i;[,T#it# $IT""j i:11,';,1'i,;,"J ;ffi:iffi;n*#:#r.:it "i"' :"J .#iFHl1:t*1 ". """*"xi:ii;n:;:iJI,:,"ii
5pecifies whkh versio",, ,,"u
"OOi,
ffi ;ij:"expans,on,;#;JTi,:"'rl:"J;:i:T;ffi:
ms are
CH APT ER
d rather b to
aularly
rk for axis use mmano).
INTRODUCTION TO CORRELATION AND R EGR ES S(TOORDt N NA RY LEASTSQUARES) TTIATTHISCHAPTER ISABOUT 3o tar we havebeen dealing with proceduresfor analyzing categoricaldata.We now tum I a powerful body of techniquesthat can be applied when the dependentvariable is an or ratio variable: ordinary least-squaresregression and correlation analysis. In -rtalchapterwe deal with the two-variable case,where we have a dependentvariable and fu e smgle independentvariable, to illustrate the logic. In the following two chapters we G2I q ith multiple regression,which is usedwhen we want to explorethe effectsof sevol independentvariables on a dependentvariable, the typical case in social science gearch.
88
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
INTRODUCTION Supposewe havea setof dataarrayed like Father's Years of Schooling 2 12 4
13 6 6 8 4 8 10
_
this:
yearsof Schooling Respondent's 4 10 8 13 9 4 i3 6 6 11
What can we say about the rclationship
betweenfather,s educatronand respondent,s
much. Visual inspectio, of tfr",.'" :d.l:latr?n?.]{or tr weptot the rwo
i. q"nJrriioro'",r"".
However. variables "."f, in two-dimensional therelationship revealedwhenyou inspecrrheDlot(Figure ,i;;, ;l"';;-r#;f is 5 1),i irr;;;,"ry lvident thatthech'drenof highlyeducated farhersiendto behighly In rhissiruarion. we saythatthefather,sandtheresponO*tt "0"*"J,f,"_r?*s. Arthoushwe canseethattheiather,, "Ou"utioou." p "ri'iii, _*"r"*a. gd;;;;;;;;:;:;irare posrtivery related,we wanrto quanrifythe relationship cor_ ; ;;;;. il; we want a way to
scho.tins frf.yf,,?l;,f :i;i:***^:,::::,:":';:::,Fa,her,syears.f ---
Introduction to Correlation and Regression (OrdinaryLeanSquares)
89
Gcribe the character of the relationstripbetween the father's and respondent,syears of fuiing. How large a differencein the dependentvariable,yearsof schooling,would we qect on averagefor a personwhosefather'sschooling(theindependentvariable)differs by .- unit (one year)?What level of schoolingwould we expect,or predict, on averagefor person,giventhat we know how much schoolinghis or her fatherhas?Second,we want -$ l rav to characterizethestrengthof the co-rclalron,or conelatbz, betweenthe resDondent's .d farher'syearsof schooling.Can we get a preciseprediction of the respondent;slevel of r*r-rion from the father'slevel of educationor only an approximateone?
THE SIZEOF A RELATIONSHIP: QTJANTIFYING IfGRESSIONANALYSIS
Lrespondent's ive. However, elationshipis that the chil&is situation, oted, nsitively coriant a way to
Ihe conventionalald simplest way to describethe characterof the relationship between rro variablesis to put a straight line through the points that ,.best' summarizesthe average daionship between the two variables. Recall from school algebrathat straight lines are 4resented by an equation of the form
Y:a+ b( D
drrc a is the intercept (the valtrc of Y when the value of X is zero) and b is the slope (the imge in yfor eachunit changein X). Figure 5.2 showsthe coefficients a and b for our involving years of education (IJ and father's years of education (X). The figure --nple r e eraphic representationof the equation:
(s.2)
E:3.38 +.687(Er)
0
of Schooling
( s.1)
2
4
6 810 Fatherbyearsof schooling
12
14
Ff GURE 5.2" least-Squa years resRegression LineoftheRelation Between d khooling and Father'sYearsof Schooling-
90
to Testldeas QuantitativeDataAnalysis:DoingSociaiResearch
Here E indicates the expectedntmber of years of school completedby people with eachlevel of father's yearsof schooling(EF)on the assumptionthat the relationshipis lizear, rhat is, that each increasein the father's educationproducesa given increasein the respondent'seducationregardlessof the initial level; 3.38 is tl]tr,intercept,that is, the expectedyearsof schoolingfor peoplewhosefathershad no schoolingat all; and .687 is the slape, that is, the expectedincreasein yearsof schoolingfor eachone-yearincreasein the father's schooling.From this equation,we would predict that thosewhosefathershave 10 yearsof schoolingwould have10.25yearsof schoolingbecause3.38 + 10* -687:10.25. Similarly, we would predict that the children of university graduateswould have2.75 more yearsof schooling,on average,thanthe childrenof high schoolgraduatesbecause.687*(1612) : 2.75. U"U^uring the valueof the dependentvariablein a regressionequationfor given valuesof the independentvariableis known asettaluatingIlrc e
0
2
4
6 810 FatherbyearsoJ schooling
12
14
F,&URE 5,3. teast-Sgua resRegression Lineof the RelationBetween Years of Schoolingand Father's YeaRof Schooling, ShowingHow the "Errorof Prediction"or "Residual"ls Defined.
Introductionto Correlationand Regression(OrdinaryLeast5quares)
WHY USETHE "LEASTSQUARES" CRITERION TO
Dplewith hip is ftnse in the b. the ex687is the asein the s have10 : 10.25. 1.75more _ 687*(16 rfor given efficients is that we re sum of ed in this lhis critei$l person r father's specifled which are is sumof
91
squares"is not the onlyplausible criterionof "bestfit." An intuitively moreappealingcriteion is to minimizethe sum of the absolutedeviatjons of observed valuesfrom expected valJes.Absolutevaluesare mathematically intractable, however, whereassumsof squares have convenientalgebraicproperties,which is probablywhy the inventorsof regressionanalysis rit uponthe criterionof minimizing the sum of squarederrors.Theconsequence is that observations with unusually largedeviations from the typicalpatternof association canstrongly affect regressionestimates;becausethe deviationsare squared,such observationshavethe greatestweight. The presenceof atypicalobservations,known jn this context as high leveragepoints,canthereforeproducequitemisleading results. We will discuss this pointfurther n the upcomingparagraphs and in ChapterTen.
It can be shown, via algebraor calculus, that the following formulas for the slope and uercept satisfy the least squarescriterion:
cov(x,y) _ t(x-txv-t) ",_ var(X) l{x-x)'
= Ntg-.(t&(Dv) Ntx'-(tx,)
(5.3)
sf,
a:Y -b(x):4
f-r
-b
lx
(5.4)
N
ASSESSING THE STRENGTH OF A RELATIONSHIP: €ORRELATION ANALYSIS \n that we have seenhow regressionlines are derived and how they are interpreted,we ed to assesshow good the predictionis. Our criterion for goodnessof predictionor 4;lxlnessofft is the fraction or proportion of the variancein the dependentvariable that =a be attributed to variancein the independentvariable. We define
^
'fz_1_z-' - )
Years
ru
DETERMINETHE BEST-FITTINGLINE? Nore that,,reasr
5 - r y- yr ' l,r v
l t v - lf tu
T-.:l is, l, which is just the squareof the Pearsoncorrelationcoefficient,is equalto 1 fiinus the ratio of the variancearound the regressionline to the variancearound the mean r de dependentvariable.(The Pearsoncorrelationcoefficientis, of course,the correlaa.o coefficient you have encounteredin introductory statisticscourses.It has the
92
Quantitative DataAnarysis: Doings'ciarResearch to Testrdeas
aovantageof ranging from _ I to *1 dependingon whether two variables move
j!li:"j",1'"..,1x'x jfr:ffi,#I ;:T*"1';'*.,T;T::A":J: y#: ;LXtil"T* together
dependentvariable_that is, whe
* p,"ar"i,r,"'"a"i, #",i"T#'+:iii#li:""11y1"1"# iiir'"p
dependent variableis lhe leastsquares predicrion ofeachvair"i_,f,.,u,,o,, I and = thrscaseis illustratedin ra) of I 0: Figure5 4 *fr"" f,""*i.Og" .,r;.,.value of the indepen_
ffil#:i lifrJfJJiTfl"J;:l#,Ti;:1',:ilT""""11J o j;,",.","r"io1,hp u"tw".o
twovariabres .(;T;i_T,:,T?i:::i'"::XilllHfr """*;:ff::H;"iJ::il:j1::.n"t",,
"o,,.rutionu.t,""ni;;;ff
X.i"#trj:,;l:
jlix**j"*: :.:#;*jn*iii:*",*.11;X*:1,,*"1TH
mav beassociared witnu".yorre."nr,"rltffi,:'#,;"* ilJ,Hf,iiH":n.t;"i""J sronprovidesanadequate sumrnary .'.n*"rr"i,.""r, of u ."f u,ionffioni,
Y= 5.00+ 0(X)
0
represents the
22 20 18 16 14 12 10 8 6 4 2 0
Y = 5.76+ 0(X )= 5.76
(c)
5,4, reast-sgua yfI resRegression Lines for Threeconfigurations f f (a)perfect of Data: lndependence. perfect b) correhri,r, iiftn"ct Curvitinear Correlation-a parabolaSymmetncat "* to thex-Axis.
Introductionto Correlationand Regression(OrdinaryLeastSquares)
93
i ;novetogether square.)When ihe meanof the t r anabledoes he meanof the i slandl:0; rf rhe indepen.-.theratio is 0, ni o variables, r erample,the :sobviousthat . * hich reproen corelation Linearregresrepresents the
,-: a '
i -0 2(X) 8 10
-::-:.ter of the relationship. When it fails to do so, additional variables need to be : :, -led in the model. You will see how to do this in the next chaDter. Rerurning to oul example about inrergenerational continuity in educational attain_ \\'e note that t' : .536, which tells us that the variance around the regression line is : :half the size of variance around the mean of the dependent variable, and therefore -:,: -r :.: rbout half of the variance in educational attainment is explained by the correspond_ : -,u-iability in father's education. As social science results go, this is a very high , - eiation.
A USEFUL COMPUTATIONAL FORMULAFORr
rherol ?z
lvifg is a usefulcomputational formulafor the correlation coeffictent, r, which comes -andywhenyou haveto do handcalculations: '" N
ons of 'Jrvilinear
cov(X,Y)
N:Xv (:,xx>y)
94
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
THE RELATIONSHIP BETWEEN CORRELATION AND REGRESSION COEFFICIENTS
Supposewe were to standnrdize our variablesbefore computing the regressionof I on X, by, for eachvariable, subtractingthe mean from the value of eachobservationand dividing by the standard deviation. Doing this produces new variables with mean : 0 and standarddeviation = 1. Then we would have a regressionequation of the form
i: a@)
(s.6
(The convention adoptedhere, which is widely but not universally used, is to represen standardizedvariables by lowercaseLatin symbols and the coefficients of standardize variables by Greek rather than Latin symbols.) There is no intercept becausethe resression line mustnecessarilypasstfuoughthe meanof eachvariable,which for standar&ze variablesis the (0,0) point. we interpret B as indicating the numberof standarddeviations by which we would expecttwo obseruationsto differ on y that differ by one standarddeviation on X. (This follows directly from the fact that for standardizedvariables,the standard deviationis one. Thus, one standarddeviation on x is oneunit on r; and the samefor y and y.) It can be shown,through a simple manipulation of the algebraiccomputationalformulas for the coefflcients,that in the two-variable case,r = p. It is also fue that r is invariant under linear transformations.(A linear transformationis one in which a variable is multiplied [or divided] by a constantand./ora constantis added [or subtracted]. consider two variables,yand I', withY' : a + ,(y ). In thiscase,r_,= r-.,.; So thecorrelationbetween standardizedvariablesand unstandard.izedvariablesii neceisarilv perfect. A convenientpair of formulasfor moving betweenb and B'(which also holds for multiple regression coefficients) is
p:b(&)=,:r(f)
(5.7
a=y _ b(X)
(5.8)
where s, and s, are the the standarddeviations ofX and f, respectively.
FACTORS AFFECTING THE SIZEOF CORRELATION (AND REGRESSTON) COEFFTCTENTS
Now that we seehow to interpret correlation and regressioncoefficients, we needto con_ sider potential troubles-factors that affect the size of coefficients in wavs that mav lead to incorrect interpretation and false inferencesby the unwary.
Outliersand Leveragepoints
As noted, correlation and regressionstatisticsare very sensitiveto observationsthat deviate subsaantiallyfrom the t)?ical pattem. This is a consequence of the least souares criterion-because "errors" (differencesberweenobservedand predictedvalueson rhe
Introduction (OrdinaryLeastSquares) to Correlation and Regression -- 10points, with(13.13) --- 9 points,omitting(13,J 10points, with(13,0) -
'onX, divid0 and
95
a
.9 1 E
-6
(s.6)
!l
f,esent rdized
Egresrdized 'nttons ! devimdard rland formuvariant multibr two ween lds for
(s.7) (5.8)
to conay lead
at devisquares ; on the
o tlGURf
2
,1*"., L.^",l*""Jl
12 14
5,5. rherffectof a singte-oeviant crri guignt"n"r"ge point).
dependentvariable) are squared,the larger the error, the more it will contribute to the wm of squarederrors relative to its absolute size. Thus, conelation coefficients can be xbstantially affected by a few deviant observations, with regression slopes pulled {rongly toward them, producing misleading results. To seethis, consider the following Erample, illustrated in Figure 5.5. Supposethat in our example about intergenerational educationaltransmission,the fourth casehad values(13,0) (shownas a solid circle surroundedby an opencircle) insteadof (13,13)(shownas an opencircle).That is, suppose ftat in the fourth casethe child of a man with thirteen years of schooling had no educarion instead of thirteen years of schooling-perhaps becausethe child was mentally mpaned. The alterationofjust one point, from (13,13)to (13,0),dramaticallychanges de regressionline and misrepresentsthe typical relationship between the father's and rspondent's education, making it appearthat there is no relationship at all (he regresiion equationfor the ten points with (13,0) asthe fourth valueis.6 = 6.74 + .0a91,(E"); ; _ .002). This exampleillustrates the condition under which deviant casesare influential-that ir havehigh "leverage." This is when points are far away from the centerof the multivariaredistribution.Outlierscloseto the centerof the distribution,for example,the (8,13) pointin Figure5.5,havelessinfluencebecause,althoughtheycanpull theregressionline ry or down, they have relatively little effect on the slope. We will consider this distincrironfurther in ChapterTen. The most straightforward solution is to omit the offending case.When this is done, fte regressionline tfuoughthe remainingnine points is very closeto the regressionline drough ten points with (13,13). However,this generally is an undesirablepractice becauseit createsthe temptation to stafi "cleaning up" the data by omitting whatever casestend to fall far from the regression surface. Two better strategies,which will be elaboratedin ChaptersSevenandTen, are (1) to think carefully about whether the outliers
96
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
might have been generatedby a different processfrom the remainder of the data and. when you suspectthat possibility,to explicitly model the process;or (2) to use a robust regressionprocedure that downweights large outliers. Fortunately, the damage done by outliers diminishesassamplesizesincrease.However,evenwith large samplesexaeme outliers can be distorting-for example,incomes in the millions of d-ollars.one simple way to deal with extremevalues on univariate distributions is to truncate the distribution, foi example,in the UnitedSratesin 2006by specifying$150,000for incomesof 150,000 or $ above(this is what the GSS does;in 2006,just over 2 percentof the GSS sample had rncomesthis high); but this createsits own problems,as we will seenext. A bener way, which you will see in Chapter Fourteen, is to use interval regression(an elaboration of tobit regression)to correctly specify the categoryvalues.
Truncation are sometimestempted to divide their study population into subgroupson the ]+nffst_s basis-of values on the independent or dependentvariitti or on variables substantially correlatedwith the independentor dependentvariable. For example,an analyst who sus_ pects that income depends more heavily on education among those with nonmanual occupationsthan amongthose with manual occupationsmight attempt to test this hypothesisby correlating income with educationseparatelyfor nonmanualand manual workers. This is a bad idea becauseincome is correlated with occupational status; thus, dividing the population on the basis of occupational status will truncate the distribution of the dependentvariable, which, all else equal, will reduce the size of the conelation. More_ over, ifone subgroup,say manual workers, has a smaller variancewith respect to income than doesthe other subgroup,saynonmanualworkers (and this is likely to be true in most societies),the size of the correlation w l be more substantianyreducedfor manual than for nonmanual workers, thus leading the analyst to_misiakenly_believe that the hypothesisis confirmed. To-se9 thirs,consider a highly stylized example, shown as Figure 5.6. To keep the example simple, imagine that all manual workers in the sample have less than seven yearsof schooling and that all nonmanualworkers havemore than sevenyears of school_ ing. Note that in the example, there is exactly the same income retum to an additional year of educationfor nonmanual a:rd manual workers. Note further that eachpoint is an equal distance from the regressionline. Now, supposethe correlation between income and education were computed separatelyfor manual and nonrnanualworkers. The correlation for both groups would be smaller than the correlation computedover the total sample, and the correlation would be smaller for manual than for nonmanual workers. This follows directly from Equation 5.5 because,from the way the example was constructed, the variance around the regression line is identical in ail tl,,ee cases,but the variance around the mean of the dependentvariable is smaller for nonmanualworkers than for the total sampleand smaller for manual workers than for nonmanualworkers. Although, for the sakeof clarity, the exampleis highly stylized, the principle holds generally: whin distributions are truncatedthe correlation tends to be reduced.This, by the way, is the main reason GRE scoresare weak predictors of grades in graduate schtol courses: sraduate
(OrdinaryLeastSquares) and Regression Introduction to Correlation a and, r robust doneby me outple way don, for 0,000or ple had rcr way, ation of
97
20000
15000
10000
5000
0
s on the mntiallY who susnrnanual i hypothworkers. dividing n of the n- Moreo income e ln most nual than that the keep the |an seven rf schoolrlditional nint is an n rncome lhe corretotal samkers.This nstructed, : variance nn for the nough,for when diss the marn ; graduate
0
2
4
6
8
10
12
14
', 1 6
Yearsof schooling
F;G L;iig
5.S,
rruncating DistributionsReducescorrelations.
.hartments do not admit people with low GREs, thereby truncating the distribution of GRE scores.But this doesnot imply thatGRE scoresshouldbe ignoredin the admissions l-cuess, as statistically illiterate professorsargue from time to time.
OFTHEEFFECT OFTRUN-''L A "REALDATA"EXAMPLE CATING THE DISTRIBUTION
Analyzins theu.s.sampre forthePo^.Igl
.al Action:An Eight NationStudy,1973-1976(Batnesand Kaase1979)someyearsago, lwas ouzzed to discoveran extremelylow correlationbetweeneducationand income(lessthan betweenthesetwo variables is on the order .1,whereasin U.5.surveys the typicalcorrelation 3f .3).Furtherinvestigationrevealedthat the low end of both the educationand incomedistripresumably in eitherthe sampling truncated, asa resultof inadequacies cutionswereseverely When the datawere weightedto reproducethe bivariatedistri3r the field work procedures. cution of educationand incomeobservedin the U.5.censusfor 1980(the yearclosestto :he survey),the estimatedcorrelationapproximatedthat typicallyfound in U.5.surveys.
firession Towardthe Mean becauseof a pheof truncationactuallyareworsethanjust suggested, Tbeconsequences menon known as "regressiontoward the mean." When two measurementsare made at ,:iferent points in time, for example, prelest and post-test measurementsin a random;zed experiment or scoreson the GBE, it is typical to observethat those caseswith high rdues on the first observationtend, on average,to havelower valueson the secondobserrarion. and that those caseswith low values on the first observationtend to have hisher
98
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
valueson the secondobservation.That is, both the high and the low valuesmove toward (or "regess toward") the mean. This is true even w-henthere is no change in the trae value betweenthe two measurements. The rcasonfor this is that observedmeasurements consistof two components:a true score and a componentrepresentingerror in measurement of the underlying true score. For example,considerthe GRE. The observedscorefor each individual can be thought of asconsistingofa componentmeasuringthe candidate,s ,,true,,(or underlyingor conitang ability to do the kind of work measuredby the test and a random componentcomprisedof variationsin the exactquestionsaskedin that administration of the test,the candidate's level of energyandmentalacuity,level ofconfidence(Steele 1997),andso on. It thenfol_ lows that thosewho havehigh scoresin any given administration of the test will dispro_ portionatelyinclude thosewho havehigh positiverandom components,and thosewho havelow scoreswill disproportionately includethosewho havelow randomcomponents. Butbecausethe secondcomponent,rrandom,thosewho have high randomcomponents on the first test will tend, on average,to have lower random coniponentson the second test and thosewho havelow random componentson the first test, will tend, on average,to have higher random componentson the secondtest. The resuit is that the correlation betweenthe two testswill be less than perfect and also that the regressioncoefficient relatingthe secondto the first testwill be lessthan 1.0.This is true evenif the meansand standarddeviations of the two tests are identical. An important irnplication of this result is that a researcher who targets fbr special . rnterventiona low-scoringgroup (thosewho did poorly on a practiceGnE, ttrosewitl row gradepolnt averages. andso on) will be boundto conclude,incorrectly,thatthe inter_ ventlonwas successful.Of course,if that sameresearcher chosethe high-scoringgroup for the sameintervention, he or shewould be forced to conclude that the intervention was completelyunsuccessful-indeed,that it was counterproductive. All of this is a simple consequence of analyzinga nonrandomsubsetof the original sample. E-xactlythe samephenomenon-measurement error_has the effect of lowering the correlation betweenseparatephenomena,for example,education and income, the heights of fathersandsons,andso on. This kind of observaionis what leJ FrancrsGalton,oneof the foundersofcorrelationandregressionanarysis,to conclude in the latenineteenthcentury that a naturalphenomenonof intergenerational transmissionwas a ,,reversion,, (or "regression")toward ..mediocrity,'-hence the term ,,regressi,on anatysis,,to describe the linear predictionprocedurediscussedhere.But what-Galton failed to notice is thar there is also, and for exactly the samereason,a tendency for valuesnear the meanto move away from the mean. The result is that the vaiancL of thepredicted (but not the observefl ualuesandthe slopeof the regressionline arereduced in iroportion to the complementof the correlationbetweenthe variables. lFor a book-teigth treatmentof this topic, seeCampbellandKenney[1999].)
Aggregation Studentswho havespentsometime studyingthe behavior of populatronsof individuals usually concludethat we live in a stochasticworld in which nothing is very strongh relatedto anythingelse.For example,in the United States, typically about l0 percentof
Introductionto Correlationand Regression(OrdinaryLeastSquares) 225
: toward l\e true s: a true |e SCOre. oughtof 0nstant) rrisedof rdidate's ften folI disproosewho ponents. rponents : second crage,to rrelation pfficient eansand r special osewith he interrg grouP don was a simple ering the e heights n, oneof :nth cension" (or describe :e is that mean to t not the the comrt of this
dividuals strongly Ercent of
99
175
€ d
125
100
58
60
62
64
66 68 Height(inches)
to
12
74
: i.: : ,:X n $.7 . rne rftect of Aggregation on correlations. '
:re variancein income can be attributedto variancein education(r - .3 + f - .Og). $udents are then puzzled when they discover that seemingly comparable correlations .rmputed over aggregates,for example,the correlationbetwggnmean educationand rean income for the detailedoccupationalcategoriesusedby the U.S. Bureau of the Cinsus,tendto be far larger(in the presentexample,r .7 .=t f - .49).Why is this so? The explanationis simple.when correlationsare computed over averagesor othersumaa''; measures, a greatdealofindividual variabilitytendsto ,,average out.,,In the extreme :".*e. where there are only two aggregatecategories,the correlation between the means :or the two categorieswill necessarilybe 1.0, as you can seein Figure 5.7 (wherethe :rve circle represents the meanheightandweightfor women,andthe largetrianglerep_ the meanheight and weight for men); but the principle holds for more than two -ents :elegories as well.
CORRELATION RATIOS 5.. l-arwe have been discussingcaseswhere we have two interval or ratio variables. Sometimes,howeveqwe want to assessthe strengthof the associationbetweena catesoricalvariableand an intervalor ratio variable.For example,we might be interestedin rhether religiousgroupsdiffer in thefuacceptance of abortion.or we might be interested :tr \\'hetherethnicgroupsdiffer in their averageincome.The obviousway to answerthese +estions is to computethe meanscoreon an abortionattitudesindex for eachrelisious voup or the meanincomefor eachethnicgroup.Bul if we discoverthat the meansJiffer ubstantially enoughto be of interest,we still areleft with the questionof how strongthe =lationshipis. To determinethis we can computean analogto the (squared)correlation .,lefficient,known asthe (squared)conelationratio, rf (etasquared). is definedas t'
100
DataAnalysis: DoingsocialResearch Quantitative to Testldeas ^2 _ | ' ,
= 1-
Varianceqround the subgroupmeans Variancearound the grand mean Within Rroupsum of sauures Total sum of squares
(5.9)
llt4,-t,i r ' \-\-rv
1J
1r \-
it
7 12 -
'
j'
where Iis the dependentvariable, there arej groups,and I caseswithin eachgroup. Thus. I.' is the meanof f for groupj, and I. is the grandmeanof L From Equation5.9,it is evident that ifall the groups havetlle samemean on the dependentvariable, knowing which group a casefalls into explains nothing; the variance around the subgroupmeansequals the variancearoundthe grand mean,and 4, : g. 41 ,1t" other extreme,if the groups differ in their means,and if all caseswithin each group have the samevalue on the dependent variable-that is, there is no within-group variance-then the ratio of the within-group sum of squaresto the total sum of squaresis 0, and 42 : 1. From this we seethat 42, like f , is a proportional reduction in yanance measure, Let us explore the religion and abortion acceptanceexample with some actual dataIn 2006 (ald for rnostyearssince1972)the GSSaskedsevenquestionsaboutthe acceprability of abortion under various circumstances: . . . should [it] be possible for a woman to obtain a legal abortion . . . r r
if there is a strong chanceof seriousdefect in the baby? if she is married and doesnot want any more children?
r
if the woman'sown healthis seriouslyendangered by the pregnancy?
r
if the family has a very low income and cannot afford any more children?
r
if she becamepregnantas a result of rape?
I
if sheis not manied and doesnot want to marry the man?
r
if the womal wants it for any reason?
From theseitemsI constructeda scaleby countingthe positiveresponses, excluding all caseswith any missing data. The scale thus rangesfrom 0 to 7. Table 5.1 shows the mean number of positive responsesby religion. All those who specifled religions other than Protestant, Catholic, or Jewish or said they had no religion were included in the "Other andNone" category.From the table, it is evidentthat Jewsandother non-Christians are much more acceptingof abortion than are Ctu.istians(Protestantsand Catholics). But how important is religion in accountingfor acceptanceof abortion?To seethis, we computeq2 : .070.(The Statacomputationsto createTable5.1 andto obtain areshownin 42 the downloadable-do- and-1og- files for the chapter.)
Introduction to Correlation and Regression (OrdinaryLeastSquares) .l01 TABLF 5,'l , u.rn Number ot positive Responses to an Acceptance d Abortion Scate(Range:0-7), by Religion, U.S.'Aduh;, 2O;.
(5.9)
leligion
MeanNumberof PositiveResponsesStandardDeviation
Catholics
p. Thus, it is evig which s equals p differ pendent n-8roup ri'],like |al data. accept-
2.5
tE* ts Other
2.2
ts' ,:,'
Clearly, religious affiliation does not explain much of the variance in abortion atti_ les. H_owcan this be, giventhe substantialsizeof tn" ,"om"."n""s? The answeris ;aple. Jewsand "Others,,differ substantiallyfrom prot".tunt" unO, Catholics = rheiracceptance of abonion.But thesegroupsarequlte small,especially ".p""ially, Jews. Hence, :c. matterhow deviant they are from th" *".u11 uu".ug", tfrey are u'nfrtefy to have much =ryact; when more than half of the popuiation is incluAed in on. group, as is the case :rre with Protestants,a large fraction oi the vari-"" in uUo.tion *'ceptance rs bound to - --= rr ithin-group variancerather than between_group variance. A seconduseof the correlationratio is to test assumptions of linearity.We will take up in ChapterSeven. -s
A USEFUL COMPUTATIONAL FORMUIAFO]
? formula to compute trom-frequency ?, OynanO orpercentage distributions i:
cluding owsthe $ other I in the ristians :s).But recomlown ln
,::\- llv:-l l i i
o *"0
x i,12 [L>:ti1 / l '
'
5-f
f
77',
wneretherearef groupsand I categoriesof the dependent variable,which in this caseis desig_ natedby X 50 Xi is the scorefor the ith category(of thelh group,aithoughthe caregoryscores are,thesamefor ajl groups),and /, is the numberof cases in the rth .u,"!ory o,nong,uro"r, of the/th group Noticethe difierencefrom Equation5.9, wherethe r refeisto inoividuars rather than to categories of the dependentvarjable.
Kl
102
QuantitativeDataAnarysis: DoingsociarResearch to Testrdeas
WHATTHISCHAPTER HASSHOWN In this chapterwe have considered simple (two_vanable;ordinary
teast_squares (OLS)
H?;ru"'"",'J:ff j *::kl*,.*#.i:l*#li# [..8i":!i;il'""fJ"fi is affectej j br,o"ii"*"i" o,,i"i"iili ll1.':g].:.'io"""{ficients
mn'i* xffis*{#jl_:"_t W'}i.#*:;;'y.;"i* gjtF:?.J"-4ru::;:f,1 ffi:l**i#fi,ltrl1"Ji'ff ;:nnl'*ffi#*1'T;#',""ri*#;#l
(or-s) im. the elation 6cally, to the oughly le then hisan ebuta nltiple r more
CHAPT I iT
INTRODUCTION TO MULTIPLE CORRELATION AND REGRESSION
(onDtNARYLEAST SQUARES)
WHATTHISCHAPTER tSABOUT h this chapterwe consider the central techniquefor dealing with the most b/pical social r..ienceproblem-understanding how some-ontcome is affected by severaldetermining Frriablesthat are correlatedwith eachother. we begin with a conceitual overview of mur=le correlation and regression,and then continu! ,ith u ,ortJ to illustrate Lrll to interpret regressioncoefficients.We then turn to "*arople consideration of the specialprop_ =ties.of categoricalindependentvariables, which U" in"tuj"Jlo multiple regrcssion 3luatronsas a set of dichotomous(.,dummy,') variables, "al one for eachcategoryof the origi_ ril variable(exceptthat to enableestimation of the equation, one categorymust be repre_ :entedonly implicitly). In the courseof our discussionof oummy variiutes,we develip a {rategy for comparing goups that enablesus to determine wheiher whateversocial pro_ -'esswe are investigating operatesin the same way for two or more subsegmentsof the population-males and females, ethnic categories, anOso on. We conctudewith an alter:atrveway ofchoosinga prefenedmodel,the BayesianInformation Coefficient(BIc).
104
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION For most social sciencepurposes,the two variable regressions we encounteredin the previous.chapter.arenot very interesting,exceptas a baselineagainstwhich to compare modelsinvolving severalindependentvariabies.Sucn moO"t J" ttr" fbcusof this chap_ ter Here we generalizethe two-variable procedureto many variables.That is, we predicr some (interval or ratio) dependentvadable from a ser of iniependent vanables.The logic rs exactly the sameas in the caseof two-variabre regression, excepl that we are estimatrng an equationin many dimensions. Let us first consider the case where we have two independentvariables. Extending the ten-observationexample from the previous chapter, ,rp'p"r"-r"" *i"t ,frat education dependsnot only on the father,s educaiion but also^on th" iru-i". ot ,iUUngs.The argu_ melt.is that the more siblingsone has,the lessattention on. i"""iu.. f.orn one,sparents /all elseequal),and hence,in consequence, the lesswell one doesin schooland, there_ fore, the lesseducationone obtains,on average(for examplesof studiesof sibship_size effectsin rheresearchlirerarure,seeDown"y tlsgsl, N4_uffi i06 , L"[2005], andLu and Treiman [2008]). Suppose,further, that we have informution on utf tn"" uariablesfor our sampleof ten cases: Father's Yearsof Schooling 2 12 4 13 6 6 8 4 8 10
Respondent'sYears of Schooling 4 10 8 13 9 4 13 6 6 11
Number of Siblings 3 3 4 0 2 5 3 4 3 4
Note that the first two columns are simply repeated from the examplein the previous chapter(seepage88). To test our hypothesis that the number of siblings negatively all.ects educational afianment, we would estimatean equation of the form: E : a + b(Eo) + c(S)
(6.1)
(Note that I use generic symbols, for example, X and f, b indicate variablesin equa_ .. tions of a generalform, but nnemonic ,y-toi., io, OS,to indicate variables in equations that refer to speclnc concrete "*r_pt",'U,ir, examples.I find ^it much easier to keeptrack ofwhat is in my equationwhenI use mlemonlc symUotstor varlaUtes.;
|---
(OrdinaryLeastSquares) 105 Introduction to MultipleCorrelation and Regression
d in the preto compare of this chaps we predict es.The logic I are esumats. Extending ut education Es.The arguone'sparents ,l and, thereI sibship-size 0051,andLu lariables for
Numberof Siblings 3 3 4 0 2 5 3 4 3 4
Equationssuch as Equation 6.1 are known as muhiple regressionequations.In rldple regressionequationsthe coefficientsassociatedwith each variable measure ft expecteddifference in the dependentvariable associatedwith a one-unit difference r 6€ given independentvariable, holding constant each of the other independentyarii,t-es.So in thepresentcase,the coefficientassociated with thenumberofsiblings tells us a. erpected difference in educationalattainmentfor eachadditional sibling amongthose rfude fathers have exactly the sameyears of education.Corespondingly, the coefficient rsociated with the father's education tells us the expecteddifference in years of educarn for thosewhosefathers differ by one year in their educationbut who haveexactly the re number of siblings. In the tbree-variablecase(that is, when we have only two indepodent variables), but not when we have more variables, we can construct a geometric that illustrates the sensein which we are holding constq.ntonevariable and -Fesentation simating the net effect of the other. h multiple regression,as in two-variableregression,we use the least-squares critem to find the "best" equation-that is, we find the equation that minimizes the sum of ryared errors of prediction. However, whereasin bivariate regressionwe think in terms r- fte deviation between each observedpoint and a line, in multiple regressionthe anahg is the deviation between each observedpoint and a k-dimensional geometric surface rherE t - I * the number of independentvariables.Thus, where there are two indepen&nr variables, the least-squarescriterion minimizes the sum of squared deviations of a::h observationfrom a olane.as shownin Fisure 6.1.
Dthe previous s educational
(6.1) ables in equas, to indicate nuch easrerto riables.)
012345 Number of siblinqs
Fi G i XA &,1" three-oimensional Representation of the Relationship Between and Respondent's )tnber of Siblings, Father'sYearsof Schooling, Yearsof Schooling OiwotheticalData;N = 10).
106
Quantitative DataAnalysas: DoingSocialResearch to Testldeas
M etric Regressi on Coeffi cie nts
Thecoefficientsassociated with eachindependent variableareknow fcients, or netregression coeffici)nts.(orsomerimes rau o,;;;-;;;;::;t;"::"#ri:;{;, to distinguishthemfrom siindardized ,"i"frr;;;;;,";;;;',;;Z;",r""wi'learn later).In thepresentcase,theestirnat"O " , ."gr"r.ion l;' "quati; E : 6.26+ .564(E _ .640(s) ") This equationtells us that a person
who had no siblings and whosetather
morc
(6.2)
had no edu-
[lF{,"#l!i,n'"","in:""ffi ry:fr.::,'"...:;,,",,### ::illiiki:r"]h***i,Til3:rTT
:itii,"#l!:i*1?rffi todifferintieirow"."h",ri;;;l;.;-;;;
;:Hf;"*:j:*".
oi"iar y"- 1p.".i."ry.
Note that the coefficient associated with the father,s education rn Equation 6.2 is smallerthan the correspondintct
#i;1.:'ffi1,_:':*iffill"X IT:,;"::;;i;il#:il:fl,':,!Tl,j.?'ff observed
m ract.-.503 in rhisexarn'le). rrrus.in equation ;.i.;;;fi. thefather's educarion on rheresnondenr., .dil;;li;Ji
nT 3ffii Hl#"*',H::l'l'r
effecrof ,iJir* *u, poortyedu_
{d;:;;;"d"'ilri'ies.tendtoso,ess
thisassociation andgives theeffeciorthe ru*,... t::9"tt"llingfor)thenumber "Jo"utiT;;;ii;fift::":T1or of siblings Theimptication .i,rrir'."rriij.ll1:,ol:-t11t
;i;;:"";Tl,ffiT#Jilijl:T;il:l#il1,[ij.,5 1"e.,"1,9!tJ"#;ffi
rn the equationwill be biased_that is, *ltt ou"r.t t" * u;;";;;" between
thegivenindependent variabl"Fqr;G;;;;;H;1""?;*0,,o t:fj-ut variableis uncorrelatedrvitri,r," illl,jlllft. xnown as specfficationerror or omitted variable bias. """"iri-iiL
*"
relation "ausat
thelimirins equation). Thisis
Someanatystspresenra ,*l:r^:_r^ y:*rll!t!"rno." multiple regression changes in thesizeo.fspecin"-"oefn"ieoir..".ulting "o_pt"t" 3^"_OllTl.9h":s fromtheinclul sron ot additionalvariabresThisis a s"mur" rou,"gy unc". oo-e'*-p"J"i'" the analystwanrsto considerhow the ,t eo erect of ,, "onoitroo: modifiedby rhe inclusionof anothervariable(or variables). onJ6r;;;;;;;tb;", rr,"i ri-_'" _"i.li analogous the searchfor spuriousor intervening ro "ro."r, rer"tirrrtf, ii crraprers 1r." and. Two the "i"r*ir?irl, Three), analystmight wani to investigate il;;;;;rlar relationship is or partlyexplained. by another ru"o'r.no. .*u.pr.."iiil", ,. observed ::lll::4 Jourhemers rha areresstoreranr of sociardeviants However'theanalystmaywarr to assess ,rrun*. p*.'[^ riuingou,rio. theSoutithepossibilitythatttrisreLionstripis (or taryely)spurious' entiretv arisinerromthe.facr thais"";;";;il";. Iess we'educared andlessurbanthanothers,ind thatedu"ution _o *U- *rli"i.J i"o.."u." ,olerance. In it would be appropriare io pres"nt trvo _od"f___oo"-.lg."rrrog,or"r_"" * :T._h"T.:residence Joutnem anda second.egreisingtot"ruo"" on iou-tfrJrriierioen"",education
(OrdinaryLeanSquares) 107 rntroduction to MultipleCorrelation and Regression
€?rtj.
x-e
tuels be x€ :l;... lis D'. ls. of
&ress tbe lss. de led loD in,s s rs lon
:lu0€n the ;t o iro )r s hat rrh. €l) Ied -ln ro n |on.
ml size-of-place-and then to discussthe reduction in the size of the coefficient rs.\ociatedwith Southernresidencethat occurs when educationand size-of-placeare rided to the equation. However, absentspecific hypothesesregarding spurious or medime effects,there is no point in estimatingsuccessiveequations(exceptfor models nsolling setsof dummy variables,discussedin the next section,or variablesthat alter imtional forms, discussedin the next chapter); rather, all relevant variables should be nluded in a single regressionequation.However,evenin this casethe analystshould resent a table of zero-order(two-variable) correlation coefficients betweenpairs of variri.ies. plus meansand standarddeviations for all interval and continuous variables and trE';entagedistributions for all categorical variables.Thesedescriptive statistics help the roler to understandthe properties of the variables being analyzed.In addition, as noted 3.rrlier,the zero-order correlations provide a baselinefor assessingthe size of net effects Tren othervariablesare controlled.
Tating the Significance of Individual Coefficients h :: conventionalto compute and report the standarderror of the coefficient of eachindemdent variable-although, as you will soon see,standarderrors have limited utility in :r caseof dummy variables or interaction terms. The convention is to interpret coeffi:renB at least twice the size of their standarderror as statisticallysignificant.This :onl'ention arises from the fact that the sampling distribution of regressioncoefficients :..ilows a l-distribution and that, with 60 d.f. (where the degreesof freedom is computed x -\ - k - 1, with ft the numberof independentvariables),r : 2.00 definesthe 95 per::nt confidence interval around the value b : 0. It is important to understandthat the :-{atistics indicate the significance of eachcoefficient net of the effect of ali other coeffi;rcnts in the model. Thus, when severalhighly correlated variables are included in the nrdel, it is possible that no one of them is significantly different from zero, although as a _:roupthey are significant(seealsothe following boxedcommenton multicollinearity). Some aralysts estimate regressionmodels involving severalindependentvariables, imp the variables with nonsignificant coefficients (this is known as trimming the regresrion equation), and reestimatethe model, on the ground that to leave coefficients in the nrdel that havenonsignificanteffectsbiasesthe estimatesof the other variables.How::|er. other analystsargue that the best estimate of the dependentvariable is obtained by -n^--ludingall possible predictors, even those for which the difference ftom zero cannot 5e established with high confldence.The latter shategyis preferablebecauseit provides a€ bestpoint estimatebasedon a setof variablesthatthe analysthasan apriori basisfor $specting aflect the outcome.
Standard ized Coeffi cie nts ,\ questionthat naturally ariseswhen there are multiple determinantsof some dependent ruiable is which determinanthasthe greatestimpact.We cannotdircctly comparethe coefn.'-ientsassociatedwith eachindependentvariable becausethey typically are expressedin lifferent metrics.Is the consequence of a differenceof one year of schoolingcompleted ofa differenceof onesibling?Although $ thefathergreateror smallerthan theconsequence 6e questioncan, of course,be answered-as we saw earlier, the cost of each additional
108
to Testldeas Research DoingSocial DataAnalysis: Quantitative
MULTICOLLINEARITY
correlated' variables arehighlv whenindependent
a condition known as multrto//,nea,ty,regressioncoefficientstend to have large standard of errorsand to be ratherunstable,in the sensethat quitesmallchangesin the distribution (1991 1 1; notes As Fox , the coefficients size of produce in the largechanges the dala can variable,./' an independent error o{ seealsoFox 1gg7, 337-366\,the inflationin the standard is given by 1\1 - Ri),where Rf is the coefficientof determination due to multicollinearity, of variableion the remaining with the regression (discussed laterin thischapler)associated and can be computedin factor inflation variance the independentvariables;this is known as (SeeFoxand Stataby usingthe -estat vif- commandafterthe -regress- command suchasa setof dummy variables, to setsof independent MonetteI19921for a generalization in chaptersevenof thisbook,in and itssquare;seealsothe discussion variables or a variable ") Transformations. the sectionon "Nonlinear to be for multicollinearity must be quitehighlycorrelated variables clearly.the independent quadrupled' and an importantproblem Forexample'i! Rl:75, the errorvariancewill be R;'s as largeas .75 are quite uncommon, the standarderrorwill thus be doubled.Because in mainlyarisingin situations sciences, problem social in the a is not often multicollinearity model a single in are included concept measures of the sameunderlying which alternative and most commonlywhen aggregateddata, suchas propertiesof occupations,cities,or nainto solutionis to combinethe measures a reasonable In suchsituations, tions,areanalyzed. a multiple-itemscale(seeChapterEleven). Someanalystsattempt to minimizemulticollinearityby employingwhat is known asstepwse in which variablesare selectedinto (or out of) a modelone at a time, in the order regression, that producesthe greatestincrement(or the smallestdecrement)in the sizeof the R'?Such methodsare generallymisguided,both becausethey are completelyathoreticaland because the order in which variablesare selecledcan be quite arbitrary,given the previouslynoted arehighlycorrelated' whenvariables coefficients ln regression instability
sibling is somewhatgreater than the gain from each year of the fatler's education-the answerdoesnot tell us which variable has the strongereffect on the dependentvanable becausethe variance in the number of siblings is much smaller than the variancein the father's years of schooling. If it is not obvious why the size of the valiance mattels considerthe effect of educationandincome on the valueof the car a persondrives.suppose that for a samDleof U.S. adults,we estimatesuchan equationand obtain the following:
500(E) v - rs,ooo+.s1r;-
(6.3)
We would hardly want to conclude flom this that the effect of education is 1'00 times as larse as the effect of income, or to measureincome in unis of $100 and then to f
I
(OrdinaryLeastSquares) 109 Introduction to MultipleCorrelation and Regression fted, dard nof ,1 1; *e,L mon nrn9 ed In ia n d mmy x, In
bbe , and mon, ns in tooel r nai Into
,u,/ise order Such tause roted
tion-the lent varirriance in e matters, i Suppose Dwing: (6.3) l is 1,000 trd then to
$clude that the effect of educationis 10 times that of income.Actually, the equation nlicates that a year of educationreducesthe (expected)valueof a person'scar by $500, of income,whereasa $1,000incrementin incomeincreasesthe (expected)valueof a -t luson's car by $500,net of education.ln this precisesense,a year of educationexactly d'=ts $1,000in income.However,a more generalway to compareregressioncoeffiis to transformtheminto a commonmetric. -ntsThe conventionalway this is doneis to expressthe relationshipbetweenthe depenieirt and independent variables in terms of standardized variables-that is, variables rrnsformed by subtracting the mean and dividing by the standard deviation. Because uh variablesall havestandarddeviation: 1, the regressioncoefficientsassociated with *andardized variables indicate the number of standard deviations of difference on the ft?erdent variable expectedfor a one standarddeviation difference on the independent r:riable, net of the effects of all other independentvariables. In the presentexample,the i{uation relating the standardizedcoefflcients-that is, the standardizedcounterpafi to E4uation6.2-is j-.601(et)-.260(s)
(6.4)
R.eminder:As noted in the previous chapter,there is no intercept becausestandardized rsiables all havemean = 0 and a regressionsurfacemust passthrough the mean of each r:riable.) From inspeciionof the coefficientsin Equation 6.4, we concludethat the irher's educationhas a greatereffect on educationalattainmentthan doesthe number of nalngs-a greatereffect in the precise sensethat a one standarddeviation difference in ir father's years of schooling implies an expecteddifference of .60 of a standarddevia:.-'n in the respondent'syears of schooling, whereasa one standarddeviation difference n rhe number of siblings implies only a -.26 standarddeviation expecteddifference in yearsof schooling ile respondenl'b Note that in practice we do not ordinarily standardizethe variables and recompute ::e regressionequationbut rather instruct the software to report standardizedcoefficients usuallyin additionto metric coefficients).Becausestandardizedcoefficientsoften are f,!'t reported,particularly in the economicsliterature, we also can make use of the relation 3 ,: bo\r/s")-ahat is, the fact that the standardizedcoefficientrelating independent r3riableX to dependentvariableyis equalto the metriccoefficientmultipliedby the ratio lf the standarddeviations of the independentand dependentvariables-to convert metric -",.Standardized coefficients(or vice-versa).(RecallEquation5.7 and5.8.) regressioncoefficients.The conThereis somecontroversyregardingstandardized ientional wisdom in sociologyand other social sciencesis that they are useful for the Fvrposejust described-to assessthe relativeeffect size of eachof a set of independent rrriables in determining someoutcome-but that they are inappropriatefor assessingthe relativeeffectsizeof a givenvariablein differentpopulations,preciselybecausethe standardizedcoefficients will differ if the relative standarddeviationsdiffer in the populations
110
Quantitative DataAnalysrs: DoingSocialResearch to Testldeas
standardized regressioncoefficielt in the two_variable case.)For example,supposerre wanted to compare the effect of the number of siuring. on among Blacks anti ir U. States.Suppose,turther, ,f," ,f,.-."oi" "iu"iiion ."glsron coelficient relaF Yj:: T" "fted to the tng yearsof schooling numberof siblingsis iaentlcatior-slacks and whites, thar the standarddeviation of yearsof scho"ri"g r"t"nr""t. rdentical, but that the standarddeviation of number of siblings iJlarger -a-fr-ri"Js for Blacks thai"for whites. under the= cfcumstancesthe standardizedcoefficient relating the nu-u". or siblings to years of
qsobelargerfor Btacrstnanrorivnii". iair?irom ooectlyfrom :::t]rg.,r?rtq the mathematicar relationbetween
the standardized andmetri.i"g."Joo .o"ffi"ients shoul in Equations5.2 and 5.g).Would we really want to con.tua""tfruilt nu-ber of siblings has a strongereffect for Blacks than fot urrtr". " in a","t-iJijil. ,nu"t ."r,ooling thel get if the "cosr" (in termsof vearsof schooling) of eachuJJi iorJ riuring rs identicalfor probablynot. Ho*"n"" ,i"r" *" it 3.1"1:lr,--o*tl"r? ii". *"-p", Hargins1976r who arguethat it would be meaningfulto "* ,ir" say that .iU.nip _uit".s more fbr Black preciselybecausethereis more variibility in rhe sibshif siz"rlfslu"t tu- r".. Additional light may be shed on-this poin, Uy'"o_parln! the interpretationof standardizedand unstandardized-coefficients .i,hi; ;-;i;dr":^mple. In an analysis ofeducationalattainrnentin a 1962US. nationatrepreslri"ii"",""*pr", n"verly Duncaa (1965,60, 65) showedthat the cost of coming f.o_ u nonirrtu"tfu.ily was very high_ a year of schooling, net of a rarge numi"r of othe. f;";. lbout ;;."r"r, the standard_ ized coefficientrelatingeducationto family intactnesswa, ."l"ii""fy *"*, .09,far trom the largeststandardized coefficient.ito." "fout ,t iro 1"."i1, o, reconciled? The fact is.thatthereis nothing inconsistentabout "_ them.The"r" m"t i. inocates that for the relativelyfew personsfrom non_intact families (these*" "o"E"rent auiujio_ 1962),the cosr was very substantial'But the standardized coefficieni indr"ut". tt uiru,o'y intactnesswas not a very important determinantof variance in educationaL"ilirir*,, p.""i."ly because only a relativelysma' fractionofthe sample wasfrom non-intactrannities. civen thenear f*rily
inrachess variabti, it r,*oty of thevariabilin ::i::T:-f "f F.aflainment. rn educalional "oura "*fiuiimluch
K!
VARTANCE IFHTApI$"^n-EqAt?llg__rHE oF DICHOTOMOUS VANIAELTS variabre. Asyou the standarddeviationof suchvari_ ''irrreca,,,", ",",1f.ii,"t"*.,,,1"fi:;l:T:'"il:::i:::lT::: (whichevu|" .u,usofis,o uvr
T: H* Yll..fl"f-t'"l,,positive,, ,{:.fl--:l^tly:, themore thedisrribution,
I vr
)uLl|
vdtf
oerinea). el l neo/. rnut Inat t5, i,,
thatii, thefurtherrrdeparrs from thesma-skewed erthestandard deviation and hence the smaler rrroflcr (t_ tne ure >ld standl .l^f,i^tl,,ll:l l"n:tive, ::*:1^.::tl':,":,
.r
i!
vr yo r
!r
Iu l l
Because for dichoromous variables the sizeof standardrzed r r u s u lcoeffjcients uc | l r Ltet t15
of themerric coefficient burarsoon theproportion of the ::::?::.T,.:l fi,,positive" ll":o"attribute, jt
::T,-::lt-n,]n" coefficients for suchvariables.
4-
ir rn*r" to,"tl ilJ;;i;;ffi;; seneraly
(OrdinaryLeast5quares) '111 Introduction to MultipleCorrelation and Regression Ee we ks and I relats, that hat the r these :ars of om the shown iblings rg they ical for r 1976) Blacks tion of malysis )uncan highrndard.09,far d? The that for fte cost esswas because thenear riability
)F IOUS
Yarift is, hom andBnts It n e lized
Cefficient of Determination (R2) Id!.g well doesEquation 6.2 explain the variancein educationalattainment?We determine trr r-ia an exact analogy to l, known as R2,or the coefficient of determination, which us the proportion of variancein the dependentvariable explainedby the entire set of -dlrs nlEpendent variables.Just as for f, R2 : 1 - the ratio of the error variance(the variance
A FORMULAFORCOMPUTINGR' FROMCORRELA-?, TIONS A convenient formulafor computing R, froma matrixof correlations ,"d Iq 5iandardized regression coefficients is
Rtr,, .r*:Dr,,,Fu, Thatis,B'?canbe computedasthe sum of the productsof the correlations betweeneachof :he independent variables and the dependentvariableand the corresponding standardized r F r l r ac< i.,n.^ a ff i.ia n r <
ADJUSTED
R'z
whenthenumber of variables included ina modet istaroe reta-?zI
iiveto the numberof casesin the sample,the explained variance is necessally laJg"b""."ur" $ ihe amountof information usedin the explanation approaches the amountof Information to be explained. To correctfor this.mostcomputerprogramsreportthe "AdjustedR," as well asthe ordinaryR'?. Theformulafor AdjustedR'zis
n,_r_ rr_ n ,l ,y l ,rl rrJ whereAiis the numberof casesand k isthe numberof independent variables. lt is clearthat ask approaches N, R;. getssmall;indeed.it canbecomenegative. Theproblemof overfitting the dataonlyariseswhen samples are quitesmall;but in suchcasesthe AdjustedR, should be taken seriously. However,the ordinaryR'?shouldbe usedin testsof the signlficanceof the incrementin R'z(discussed laterin the chapter,in the section"A Strategyfor Comparisons AcrossGroups").
a .t .t
I tZ
euantitative DataAnalysis: Doing SocialResearch to Testldeas
gmsffi*ff-*l;ffi
ffwm# c |l
Il
I
*gu$*'s'**m*$m
a
& il
C
rt ';::::::,i;:,:::::::7::;:,:,,;:;;,:,",!,!1,:,!i:ii1;JJ? ru! :,1. p, ex :"r; ";::::;:;::ni^::xr*_ ana,. n.r,he f!i;;;:1:J:: r:c, I fl
'i:,::!vr;w:t:::i;L!,:;:i':ii"*,xx: qr
tr j1ii:."::,,u,",'-u,i)uu",J"':::::::::::l! uod ;,! "i!i!!li!l?;il'"',!ff ::r:::::!:"il:::;:;:";,;,,;;y ;;3:P;;:,:;:::;f Gc
stalus arlribures ro." iilii"'#iffi::"',;;;;;,;:..j,:::,;XT'1.];::HlT,';,j,:f:#,i"j!iJ::::Hg "o"''uJ.. .".'oi".t .],,'jl']#J[il.::*o Fodor Iee6, ani t^r::!:ro ,::, of Estimate (RootMsE)
IU6!t llllr f
Mtm fiutul ot r iiq-r
#H::,ffili,ffi1'"ffi *,:"xrx:hl:-::::,:ljffiT.di"l1ffffi ,
n
nu0[: mlh
ltutD]][|
Introduction to MultipleCorrelation and
Regression (OrdinaryLeastSquares) 113
tx preseF f or pendent itsjoint tion 6.2 m eduAcross ix now asethe ready I slarus uales" Iauser ment, of the t were rding think I edu)anln ttaine not Blau I
:
(6.5) r bereiy'is thenumberofcases, k
*l;u;ry,nnr.*:x*h:tf.q*n",x'"i.fi1,".;:"T,1
ri.iifil:l:JsTx;::ml':*,*ri] [T*?]}"":Tff F,*.l'"T:T"fl jjj"gl ffi,;:fr; s*HTtrJ#T".,H:,n::Hly"i:Tlt,iT;:::,ffi ..,-pr",,rsp".",i"i#J#:1i"F:1llT.j:dT,fl :;a#:;ffi J*:Ttr"lT
:triff"it;ffi:il::t$;r:1.;il,*'stohlewithiir.si";'il;,ih"regressionsurrace isthatthe'r.e.e. isnotsensitive totherelative size
r ,r3ixl:iltJ,t#f#Tff"il::2
;:il#".:tlx#;l*:f Tffi;flt j:'*#,Tffi 3Jlil"#:f :tHffii{; A-WO-RKED EXAMPTE: THEDETERMINANTS OFLITERACY IN CHINA Let.us_workthroughthepresent
:j*#g*#iX,;:'ffr#X*1'i:'{itiqff i;,::"T:']"J;r",'#:;?JJi,:l ll,:f;i."li:"rif :*:::*nf,*j:,,,1"",S:il1llf I :"fr:ijFi$'tfil -3",T,:.S:*ffi:ii"il#,TJ,iffl:T"'ii:ml,'"',1;:l bedownloaded fromthecourseweb site).A surveyof thepoputution lg96 or tr," f-lT
ji:;l-'*;.i:*::l#,#*iil#$h*I.*'.:"trq;x.: '{;fli sc-ioo-r'iog -" H"J"fniil::llXltfl#l'::':ase inw",t".nution.. infacr,measured interms ofthemrmber ; "h,';;;;;;g,,)"4'd]:1]i!"#:r;:,:$T withaddition-ar
ting ally and
I,::?:,."":,#;;;;;;;lT::dj/*iffi #:$i,:#$Til"_",".1T:::[
:,#;,i;:",,.*'..J::l}if j:t# :d{Tiin::i3lJff ilJ"ff #..lTi,."1il i g n *. pr iy'in.'r.** anda ...1; *, :X, ;:liffi
:t^:',1*r
:i'i, wtren,t. ramitr r".oJ"il", *ri;X:;t
of
l:
T:.'j't*
measu
",J readingwasimponant re rowhich in the
,iliffi:::ll.fi ,jJHT:tl *#1T:":{i:fi ryTi",:*:1$XilllJJl:iilf jffi[::,:: [":T##'.T:.
ffi
:xJ:n*k*.1:#yr,:1,nfu:ffi:
114
euantitative DataAnalysis : DoingSocialResearch to Testldeas
;:xTlr..?lJ::f#f"ffJ:;T*,:::,*.libraries aremore rikery toexrsr andtocontain
.*fi**$Nn****r
ffi*,mffiffi*
ii#trp,;* $$$$"}i#*ffihlrf
h'*fr iffi fr *"i*jiilt*tti*"*inltffi'f i;:Tfti,id",T"},..;;,.x",f;Jm S;1:;:#i'"",T"'.ffi
ru*#;Hj:,#ffiJ;,J,[.;,# #},[#r;,rn #Tn.li
Nf":#:{t*i*?iiil.i,ii;i"r:f *y.1,tr3r!,,,.,A "fi,n l,:.m r::,l,..;:rt::"riiiliffii[Ti, *:;l] *:l in tt'. o",t ;ffi#Tffi:fff i1lil^:*" ouine ro,"",-u,Jl'uuo"' "0",t "*ry,uo npm#:: il:'#;fu ff;:il' il#*ftp;i,*il"1 :
.,l":x I:Hil:ffi:":*l} i{"""J:##: :::r:h##T: }ff I
(OrdinaryLeastSquares) 115 Introduction to MultipleCorrelation and Regression :onuln rteracy ccuparequire med in jobs. inantthan of esover side of leamed hin the quiring n-suishheva aYerage iearsof nts);18 percent pondent )ndents. r. I also ns were eswhen rnberof vearsof
.eole ra y s TT]Aii on
oout here 'cent na .s of ;grks
-
' l. I 6..1 . M".rr", StandardDeviations,and CorrelationsAmong Variables Affecting Knowledge of Chinese Characters,Employed Chinese Adults Age 20-69, 1995 (N = 4,802), M
EF
.397
.400
.331.
,z+/
.4tJ
.341
.216
.514
.368
.030
.327
E: tathersyearsof schooling \i: Nonmanual occupationb
,.,r Ny'ale
'.1ean
030
3 .6 0
6.47
3.O1
.177
.180
.558
.227
-:rtems, in ncreasrngorderof difficulty,areyiwan (ten thousand),x/rgmng (full name),iiargshi(grain), -.'thu \fundon), diaozhu(catue),slrue (wreakhavocoTwanton massacre), qimao chuarmlu (erroneous), r::rgenarian),chl-chu(wa k s owly),and taot/e(g utton). ,:'ables N, U, and M are dichotomies, scoredT fof thosein the categoryand scored0 for thosenot in aategory. -_ 3scae isthe meanof standardized -_: scores for fivevariables measuring the behavior o{ parents whenthe '-:3ondentswere agefourteen:the numberof booksin the home,the presence of ch drensmagazines - r^e home,the frequencywith whlch parentsreada newspapetthe frequencywith wh ch parentsread ::-cus nonfc(ion,andthe presence oJ an atas nthehome. f informatlon was missing for an item,thai a:- was excludedfrom the average. The resuting scae was transformed to a 0 I metric that is, the :..1,t ,co'e ,s0 and Lheh qhestsLores I
Table6.2 confirmsthe importanceof yearsof schoolingbecausethe standard:zedcoefficientsfor yearsof schoolingin both modelsarefar largerthanthe standardized :crefficients for anyothervariable.Eachadditionalyearofschoolingproducesan expected rcrease of about.4 in the numberof charactersconectly identified,net of all otherfac:ors. This means,for example,that a universitygraduate(sixteenyearsof schooling) .r.ould be expectedto identify abouttwo morecharactersthanwould an otherwisesimilar . ocationalor technicalschoolgraduate(elevenyearsof schooling).
"t*$j:
il *.r1. Determinants of the Number of chinese characters correcfly ldentified on a Ten-ltemTest EmployedChineseAdults Age 2(H9, 1996(Standard Errors in Parentheses).. Variable
Model 1
Model2
.030 (.006)
.009 (.007)
Metric regression coeff i(ients
EFrFather'syearsof schooling
.255
.177
(.0s3)
(.054)
_o49
.015
C Levelof culturalcapital(rangeO_1)
5.e.e. Standardized regression .oefficients
FFrFathe13yearso{ schooling
i!
fl
tr
1 lt t!
rtl
C Levelof culturalcapital(range0,1)
I
tr aresignificant "Ailvariables at or beyond the.001levelexceptfor father'seducationin Model2 (p = .195).
I
€
ntroduction to MultipleCorrelation (OrdinaryLeart5quares) 117 and Regression
:,t-lill\illA!- P0if'\rTON TABLE 6.2
lz
Notethat both modelsin Table6.2 are basedon exactlythe samecases,the numberof casesshownin Table6.1.A comrnonerroranalysts makeis to presentsuccessive models basedon differentcases-allthe casesfor which completeinformat/onis available for the variables includedin that model.Thisis ill advisedbecause it makesit impossible to determine whetherdifferences in the coefficients for successive modelsaredueto the inclusionof additionalvariables or are due to variationin the samples. Moreover, formal comparisons of the incrementin explained varianceresulting from the inciusion of addi(presented tiona{variables in the next section)are not correctunlessthe moqetsare basedon the sarnecases.Statahasa command, . whichmakesit easyto ensure that all modelsbeingcomparedarebasedon the samecases.
I I
I t t)
I I
I I
l)
F,r I
|)
It.
I ,:' l/
r ::. I, r.' .: a
.::
i
l .r
'' l :,.:
t
I ..::
t: : ,
I i:: I ,,i I . 1 9 5 ).
Model I predictsthenumberof characters identifiedfrom all variablesexcept,.cultural In this model all coefficients are significantat the .001level.Net of otherfactors. --apital." iose with nonmanualoccupationsscoreabouta quarterof a point higher than manual a orkers,thosefrom urbanorigins scoreabouta quarlerof a point higherthan thosefrom rral origins,andmalesscoremorethana third of a point higherthanfemales.Clearly,all iese effectsarerealbut, with theexceptionofeducation,aremodestin size.Interestingly, ie father'syearsof schoolingsignificantlyincreasesknowledgeof characters,net of all otherfactors,althoughthe effect is very small (he expecteddifferencebetweenthose $ith themost-educated andleast-educated fathersis only abouthalfa character-precisely .5-l : .030*l8). Together,the factorsin Model 1 explainmorethantwo-thirdsof the variancein vocabularyknowledge,which is a very strongrelationship.Also, the standard errorof estimatefor Model 1, 1.25,tells us that95 percentof the actualvocabularyscores Iie within 2.45 points (-!'1-96*1.25)of the regressionsurface.It is instructiveto note how largethe error is. Evenwith a very high R, by socialsciencestandards, the casesare distributedover nearlyhalf the rangeof the dependentvariable.This suggeststhe needto exerciseconsiderable cautionin interpretingregressionestimates. The intercept,.579, is interpretedas the expectedvocabularyscorefor those with a score of zero on each of the independentvariables-that is, for rural origin females working at manualjobs without any schoolingwhosefathershad no schooling.This is not a very meaningfulvalue.Althoughin Chinatherearepeoplewho fit this description, in manynationsa personwith 0 scoreson all variableswould be beyondthe rangeof the obseNeddata.To achievea meaningfulintercept,it often is useful to reexpressthe continuousindependentvariablesasdeviationsfrom their mean.If this is done.the interceDt is then interpretable as the expectedvalue on the dependentvariable for people who are at the meanwith respectto eachof the continuousvariables(and,of course,havescores of 0 with respectto eachdichotomousvariable).In the presentcasesucha reexpression would give us the expectedvocabularyscore-in this case3.30-for rural femalesworkins
1 18
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
at manual jobs (the 0 values on each of the dichotomous variables) who have average "of educationand whosefathershave averageeducation.Note that such a reexpression the independentvariableshasno effect on the regressioncoefficients, the standari errors,tbr .1R2, or the standarderror of estimate.Only the inrercept is affected. Model 2 includes "cultural capital" as an addiiional factor. The associatedcoefficient indicatesthat, net of all other factors,peopleraisedin households with the highest cultural capital, that is, maximally involved with reading, score almost a full polnr higher in their knowledge of vocaburarythan do people raise-Jin householdsminimalrv involved with reading. Although the explained varlLce is .igrifi";,it i;;;;;;; (i" the next section we considerhow to assessthe significance-of the rncrementin Rr). tlte-increase is hardly important from a substantivJ point of view. What is importanr is that the introductionof "cultural capital" reducesthe effect of father,seducationto nonsignificalce.This makesclear the reasonwhy, net of the respondent,sown education, knowledgeof vocaburaryis enhancedby the father'seducation: householdswitb educatedfatherstend to be more involved in readingthan other households.After the "cultural capital" of the householdis takeninto account,the father'seducationhas no additionaleffect on vocabularyskill. The "cultural capitarl'variable alsoreducesthe size of the "urban origin" effect, which indicates that parr of the advantage of urban origins is the tendencyof urban householdsto be more involved with reading than are otherwise similar rural households.None of the other coefficients is much affected by the introduction of cultural capital.
GraphicRepresentationof Results Sometimes,for easeof exposition, it is useful to graph the net relationship implied by the model between a given independentvariable and the dependent variaOle.this is easy m do. The trick here is to simplify the estimation equati; by substitutlng the means, or other appropriate values, for the remaining indelendent u*iubl". the one of interest ajld collecting them into the constant. This yields the "*""pt expected value on tre dependentvariableat eachlevel of the independentvariable, holding constantall other independentvariables at the specified values. The same procedure can be extendedto show separategraphs for each category of a categorical variable_for example, if we were interested in how the relationship between tlracy scores and years of schooling implied by Model 2 differed for males and females.For continuous variables the meanis a good choice of the value to substituteinto the equation.For dichotomousvariables we could substituteeither the mean or some suitable ;alue_for example,nonmanualworkers from urban origins. Of course,for dichotomousvariables, the meanis just the propor_ tion that is "posjtirze"with respectto the variable.Thus, if we substitutemeansfor qlcnotomousvanables,we are not evaluating the equationfor any actualperson_after all, one cannot be 18 percent urban or 56 percent male; rather, we are eva.luatingwhat are,in somesense,the typical circumstances of the population. To seehow the procedure works, Iet us evaluatetle equation in two ways: for nonmanual workers from urban origins, and for the mean values of these variables. In each case, we evaluate the equation separatelyfor males and females to create graphs thal
lntroduction to MultipleCorrelation (OrdinaryLeastSquares) 'l'rg and Regression have average expressionof hd errors,the
fu.'s separatelines for males and females. Considering first an equation evaluatedfor rumanud workers from urban origins, we have, for females i
[ciated coefrirh the highst a fuIl point lds minimally increased(in ementin R'?), t is important i educationto 5 own educameholdswith Ids. After the cation has no ducesthe size urban origins are otherwise , the introduc-
mplied by the lhis is easy to lhe means, or pt the one of value on the stant all other e extendedto rample,if we ; of schooling ssthe mean is I variables we manualworklst the proporfie means for person-after aluating what vays:for nonables.In each te graphs that
a+btE)+clEt)-d(N\
re(Ut+ ItMtJ
gtc)
: .546+ .393(E)+.009(3.07) +.21(1) +.177(1)+ .385(0)+ .866(.227)
(6.6)
:1.158 +.393(E) nl tbr males i - a + b(Et 1 ctEt|+ dtNt+ ?(u t + ftM )+ gta) +.38s(1)+ .866(.227) - .s46+ .393(E)+.009(3.07)+.21 (1) + .177(1) :1.543 + .393(E\
(6.7)
Having arrived at a pair of bivariate equations,differing only by a constant (= .385, t coefficientassociatedwith "male"), we can simply graph the equations.Figure 6.2 rir-rqs the graph, which makes clear the relative magnitude of the education and gender decn net of all other determinantsof vocabulary knowledge in China. Clearly, educann is far more important than gendet although within levels of educationthereis a small frderence favoring males. Now, supposethat instead of evaluating the equation for nonmanual workers from ulan origins, we evaluatedlhe equationat the meansof eachof the independentvariables :rcept, of course,educationandgender,becausewe want to displaythe effectsof these ro variables).Our equationfor femalesis then i - a + b(Et I c(Et t+ d(Nt+ e(U| | ftM)+ gla) - .546+.393(E)+.009(3.07)+.21(.171) +.r77(.18O)+ .385(0)+ .866(.227) :.839+.393(E) (6.8) ld for malesis i - a + b(EtL aEr't+ dN t- a[J )+ f\M )+ g(al = .546+ .393(E)+.009(3.07)+.21(.177)+ .r77(.180)+.385(l)+.866(.227) : 1.224+ .393(E)
(6.9)
Note that the only differencesbetweenEquations6.6 through6.7 andEquations6.8 Arough 6.9 are in the interceptsand also that the difference in the interceptsbetweeneach
120
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
t ffi q-
g
@ll
re {c .d * -&-
p !f
I
E
J f,r dz
x I
q
Yearsof schoolcompleted
f;€tinf 6,?: rxp".rea ruumber of chinesecharacterstdentified(Out of Ten) by Yearsof Schoolingand Gender,Urban Origin ChineseAdults Age20 to 69 in 1996with Nonmanual Occupationsand with yearsof Father'sSchoolingand Level of Cultural Capital Setat Their Means (N = 4,802). /votei Thefemale linedoe5notextend beyond 16because therearenofemares inthesampre with post-graduate education.
f@rl
ne iMli
dl pair ofequationsis identical.Thus,a graphofEquations6.g through6.9 would be almost identicalto Figure6.2 exceptthat both lines would be shifteddown.For this reason I do not botherto showa graphof Equations6.8 through6.9.Whetherto substitutemeans or otherspecificvalueswhen evaluatingan equationis a matterofjudgment that shouldbe decidedby the analyst'ssubstantiveconcems.
|[iF
r
a[]: MI 1m!@
owI
DUMMYVARIABLES often situationsarisein which we want to analyzethe roreof categoricalvariables, such asreligiousaffiliation,maritalstatus,or political party membership,in determiningsome outcome.Moreover,typically we want to combine categoricalvariableswith inferval variables,to study the effect of eachcontrollingfor the other.Thus, we need a way to includecategoricalvariableswithin a regressionframework. To seehow this is done,let us revisit the problem we consideredin the final sec_ tion of chapter Five, on correlationratios.Recallthat we wereinterestedin the relation betweenreligiousaffiliationandacceptance of abortion,andwe analyzedthis by estimating the meannumberof positive(accepting)responsesto a seven-itemscalefbr each of four religiousgroups(Protestants, catholics, Jews,and thosewith other or no relision)
rp qflu@lltrhdL
nmuo dtea {@ rfrnlfir|If,r
mr@e" fio@
b
(OrdinaryLeastSquares) ",21 fntroduction to MultipleCorrelation and Regression ftom the 2006 GeneralSocialSunrey(GSS)data.Here we explorea similar substantive goblem, but this time using data from the 1974GSS,becausethe resultsfor that year rE particularly clear-cut and hence more suitable for exposition of the method. (As an aercise, you might want to carry out a similar analysisusingthe 2006data).We startby €r$'erting the religious denomination variable into a set of four dichotomous variables, c for eachreligiousgroup,with eachvariablescored1 for personswith that religion nf scored0 otherwise.That is, we definea set of new variables(seethe downloadable - ic - or - 1og - file): R, : 1 if the respondentis Protestant,and - 0 otherwise R, : 1 if the respondentis Catholic,and : 0 otherwise R, : 1 if the respondentis Jewish,and : 0 otherwise R, : 1 if the respondenthas anotherreligion, no religion, or failed to respond, and : 0 otherwise 8
Variables of this kind are known as dichotomous or dummy variables. Using these ruiables, we can estimatea multiple regressionequation of the form:
tTen) n '-evel lewith
alrnost bnldo Fansor ould be
es,such ng some interval I way to inal secrelation estimatr eachof religion)
A : o +fi,n
= o + b2R2 + b3R3 + b4R4
(6.10)
stere A is the numberof"pro-choice" responses-thatis, positiveresponses to questions rbout the circumstancesunder which legal abortions should be permitted (in 1974 six nch questionswere asked,so the scalerangesfrom 0 to 6), and the R. are as specifiedin de precedingparagraph. Note that it is necessaryto omit one category from the regressionequation to avoid r linear dependency(the situation in which any one independent variable is an exact fonction of other independentvariables); becauseof the way dummy variables are congucted, with each individual scored 1 on one dummy variable and 0 on all the remaining variablesin the set, knowing the value on all but one of the dummy variables allows fErfect prediction of the value of the remaining dummy variable. In such situations OLS equationscannotbe estimated.Any categorymay be omitted, but because,aswe will see, fre coefficients of the dummy variables included in the equation are interpreted as deviarions from the value for rhe omitted, or reference,category it is best to choosethe referce category on substantivegrounds-the category against which the analyst wants to contast other categories.The only exceptionto the substantivecriterion is that very small categoriesshould not be chosenas the omitted category becausedoing so may create a nrar-lineardependencyamongthe remainingcategories,which could result in unstable numerical estimatesof the coefficients. EstimatingEquation6.10,we have
a ::.l s -.::1 n ,;+ 1.6( &)+.88( &) ;R' :.045
(6.11)
122
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
Now, let us derive the predicted valuesfor each category: For Protestants:
A= a+ b"(o)+ br(0)+ Do(0)= a :3.93
(6.12)
A = a + br(I)+ br(o)+ bo(o) = a + br: 3.98-.33 : 3.65
(6.13)
A = a + br(o)+ br(l)+ bo(O) = a + br: 3.98+ 1.61: 5.59
(6.14)
For Catholics:
For Jews:
For OtherandNone: A= a + br(o)+ br(o)+ bnl)= a + bo: 3.98+.88:4.86
(6.151
From Equations6.12throueh6.15,it is evid€nt thattheintercept,d, glvestheexpected value for the omitted, or referenlce,category and that the coefficients assocratedwith each ofthe dummy variables,the b,, give tbe-difference inthe"rp"ii"Jrir" u",ween thar caregory and the omitted, or refeieicr
expectedscoreor3.n*",,r,"-J"i""iX',i,s;f; J,L?:ffi::"J:1H"";Hffj:ffi fr acceptablefor abo[t four of the six conrtitions. Cutt o'"r, ty"orrt
u'st,havean expected scoreof 3.65, .33 less than protestarts.Jews on averageenOoJ nearty all six items (precisely,5.59), 1.6r rnore rhan prorestants. Finally,;; ;"-^t;;; of the sample,the residualcategory"other and None,"fa's midway beiweern ,"r,"o" u.o Lewsin their averagelevel of acceptanceof abortion.
Ql
N
|NCLUDE rHE ENflREsAMpLE }gF.ty_ygrl lHolr_L.p l! YOUR ANALYSIS
o,,;;;;"-,,;;J^""""
catesoryisa
catcharrresiduar caiegory andisthusratheruninteresting, it isdesirabre to incrude suchcategories Intheanarysis ratherthanrestricting theanarysis to thosewith ,,inrerpretabre,, rerigions. The reason for thisisthatwe ordjnarily wantto generalize to an entirepopulation, not a subset of the popuration with definablecharacteristics. rf we omit the residuar categoryfrom the analysis, ourestimates of theaverage population characteristics arelikelyto be biased, and, whatisworse,biased in unk
arso be ::"H[1f#ffi :J l:"J.Ti]::ffi;1""".mav biased. see rhe discu,,,"" ",
fntrodudionto MultipleCorrelation (OrdinaryLeastSquares) 123 and Regression
(6.12)
(6.13)
(6.14)
(6.15) expected vith each that catI havean abortion expected iix items nple, the s in their
'LE atchpnes ;.The ubset n the and, 50 be
Note that the R, is identical to the correlation ratio, 4r, that we encounteredin the pevious chapter Moreovel the prcdictedvalueson the abonion attitudesscalearejust 6e meansfor each religious group shown in Table 5.1. This follows from the fact that, rbsentany additionalinformation,the meanis the least-squares predictionof the valueof .a observation. Thus,the "least-squares" estimatesfor eachreligiouscategoryarejustthe r:bgroup means.So far we seemto haveno more than a complicatedapproachto estimat_ ine subgroupmeansand correlation ratios. The real value of dummy variablesis when they are used in combination with rrtier variablesto test the effect of group membershipon the dependentvariablenet of ie effectsof other variablesand also to assessthe effect of group membershipon the relationshipbetweenothervariablesand the dependentvariable(andthe effect of other variableson the relationshipbetweengroup membershipandthe dependentvariable)tat is, to assessinteractionsbetweenthe group categoriesand other variables.To see tis- let us continuewith our example.Supposewe are interestedin assessingthe effect .rt educationon acceptanceof abortion.Supposethat in addition we want to assessthe Sassibilitythat religious groups differ in their acceptanceof abortion and, moreover, n the relation betweeneducationand abortionacceptance-with Catholicstending to rpose abortionregardlessof their education,Jewstending to acceptabortionregardie-<s of their education,andthe remainingtwo groupsbecomingmore acceptingastheir slucation increases.To test theseclaims,we eslimatethreesuccessivelvmore comoliJaredregressionequations:
A : a+ b E -4 ,4:,/- r AfF- r \- - p
a e : a" + b"E + Lc,' R,+ --1\c,R,E
(6.16) (6.r7)
(6.18)
The first model(Equation6.16)positsan effectof educationbut no effectof religion. This model assumesthat all religious groupsare alike in their acceptanceof abortion. The mond model(Equation6.17)positsan across-the-board, or constant,differencebetween rligious groups in their acceptanceof abortion but assumesthat the relation between slucationandacceptance of abortionis the samefor all religiousgroups.The third model rEquation6.18)positsan interactionbetweeneducationandreligion in the acceptance of .&ortion, ot to put if differently, assumesthat the religious groups differ in the way edu;arion affects abortion acceptance.(The conventional way to representan interaction in r regressionframework is to construct a variable that is the product of the two for more] rariables among which an interaction is posited-although other nonlinear functional ic:ms are sometimesDosited.)
124
QuantitativeData Analysis:Doing SocialResearch to Testldeas
A STRATEGY FORCOMPARISONS ACROSSGROUPS Our first taskis to decideamongthe modelsrepresented by Equations6.16through6.1g. In this situation'wherewe areassessing whethergroupsdiffer with respectto somesocial process,we would generallypreferthe most parsimoniousmodel (except when we hare a strongtheoreticalreasonfor positingdifferencesbetweengroupsor, asnoted in the section on Metric RegressionCoefficientsearlierin the chapter,when we suspect the possi bility of omittedvariablebias).That is, in generalwe shouldprefer a more complicated model only if it doesa signiflcantlybetterjob of explainingvariability in our dependenr variable,acceptanceof abortion.We decide on the preferredmodel by comparingtbe varianceexplainedby eachmodel.If a more complicatedmodel explainsa significantll largeramountof variancein the dependentvariable,we acceptthat model;if ii doesnoi we acceptthe simpler model. (This is the classical,or frequentist,approach. The neu sectionprovidesan altemativeapproachto modelassessmeni basedon Bayesiannotion: a comparisonof BICs.) we begin by comparingthe first and third models.That is, we contrasta model thar asslmesthat thereare no religiousdifferencesin acceptanceof abortionbut only edu_ cationaldifferenceswith a model that assumesthat the relation betweeneducation and acceptance of abortiondiffers acrossreligiousgroups.To assessthe significanceof the differencein R2,we computean F-rano:
(4-nilr* ( 1- 4) /( N- k- 1)
(6.19,
whgre is the varianceexplainedby the largermodel (here,Equation6.lg); R; is the vananceexplainedby the smallermodel (here,Equation6.16);Nis the numb". oi,"ur.., k is the total numberof independentvariablesin the largermodel;rn is the differencein the numberof independentvariablesbetweenthe largerand smallermodel; the numera_ tor degreesof freedom : m; and the denominatordegreesof freedom : 1y'_ _ k 1. In our numericalexamDlewe have
(.097-.053)/6 : 11.96 (1-.097)/(1.48r-7 -1)
(6.20t
with 6 and 1,473degreesof freedom.To determinewhetherthis F_rafiois significant,we find thep-value correspondingto the numericalvalueof the r'-ratio with the specified numerator anddenominatordegreesoffreedom. If ourp-value is smallerthan somecritical value(.05 is *ject the null hlpothesis (Model l) in favor of the altemativehypothesis :_":":"I^":1U I. (Model 3). In the presentcasethat is what we are led to do becauseF(6 : ,.,,rr, 11.d6,which irnpiiesp < .0000).
Firoductionto MultipleCorrelation (OrdinaryLeastSquares) 125 and Regression
GETTING p-VALUES VIA STATA untir rairry n
:";LL:m:m:ii:ll"l,ru:n:ru[H':]#i"1",:; Jiil""lfJK[
and is now somethingof an ord-fashioned approach.stata providesa set of buirt-instatis-ry tables,includinga table of probabilities :rcal associatedwith specificF-ratios.The probability =cra given F-ratiocan be computedby executingthe command,_display fprob (df_], jf_2, F) - , where df_l is the numerator degrees of fteedom, df_2 is the denominator legreesof freedom,and F is the calculatedF-ratio).
USINGSTATA TO COMPARE THEGOODNESS.OF.FIT -^rheF-testfortn"'".r"ru*r-r*r,"*
OF REGRESSIONMODELS
Q!
ent to the wald test that the coefficientsfor the set of variablesthat are includedthe lurg"r. EII 'nodel but not in the smallermodel are not significantlydifferentfrom zero. Thussoftware :hat implements the Wald test as a post-estimation command(for example,_test_ and -testparm- in Stata)canbe usedto carryout the F-test.lt alsocanbe shownthat when a ;inglevariableis addedto a regression equation,the t-ratiofor the additionar variabre equars :he squareroot of the F-ratiofor the incrementin R, and the t- and F-ratioshaveidentical crobability distributions. Thus,when two equationsdifferby a singlevariable,they can be contrastedsimplyby inspectingthe significanceof the t-ratio,which is routrnervorovadedas cart of the regressionoutput.
Having determined that we cannot posit a single model of the relation between lircation and abortionacceptance for all religiousgroups,we next investigatewhether r .' necessaryto posit religious group differences in the relation between education and rinnion acceptanceor whether there are simply across-the-boarddifferencesbetweenthe -reii-siousgroups in their acceptanceof abortion but a similar relation betweeneducation ab-onionacceptancefor all groups; that is, we ask whether both the slopesand inter_ -rl ::ps differ acrossreligious groups or whether only the intercepts differ? To answerthis .:cesdon,we contrastthe R2for Model 3 (Equation6.1g) and Model 2 (Equation6.17), :simating an F-ratio usingEquation6.19.For our currentnumericalexample,we get
(.097-.089)/3 (t - .o97)/(1,481-7-1)
(6.21)
rith 3 and 1,473degreesof freedom.BecauseF3.,n r, - 4.35,whichimpliesp : .0046,we qrect thenull hypothesisthatthe relationshipbeiweeneducationandabortionacceDtance
't:*
r^r tv [\
I
QuantitativeDataAnalysis:Doing SociaJResearch to Testtdeas
(1890_j962)wasa Britjsh stdtistician wtn a stronginterest in biology(hewasa founder,w,th Sewa Wfight_see the biosketch o: Wrightin chaptersixteen_and l.B.s.Hald"n",ot tt eoretoaipoj.rlarion genetrcs). He was respon, siblefof majoradvances in experimental design, introaudng ih;io-t,onof rundo, urrignment o, cases to different treatments andshowjng ho* to ,re of vaijance, whrchheinventecl_the F-distribution "nulysr:, is nam€dafterF.her-to utt"t, tt . .ontrioutioi otlacn ot reverat racto*in dete._ mininganoutcome' a procedure thatgreatry enhanced thepoweroiexperrmentar designs. Hearsc invenied theconcept oflhe maximum iikerihood andmademajorciontributions to statisticar procedurcsforassessing sma'sampres Histextstat^f/Lal Method,fo, ner"urcn work"rs,firstpubrishec in 1925'wasverywideJy used, espe'iairy asa handbook forthedesijnandanarysis of experimenrs andranthroughfoufteened|t|ons, tnetatest published in 1970 Fisher wasbornjn London,thesonof an urt O"at"tunOuraioneer.Hewasa precociotrs s.u.ent,wrnningthe NeerdMedar(a cornpetitive essayin mathematics) at Harrowschoore: the ageof sixteen. (Because of hispooreyesight, n" *rr'iri"*, ," mathematics withoutthe aidof paperandpen,whjch
#
asopposed tou,ins urs"boi.d;uu",loi;,T_::::1,"";H:.[1T:",rJ"il::ff:n
mathematical resultswithoul
ics atcambridse, rr.";r;;'"'"rHt"?,"":1,::r}:ffi:,,11:;t::'i::ff'ru;*:l|::
entrst in thearmyduringWorldWartbut was rejected becauie of rrrspooreyesrgnt, andthe. spentseverar yearsteachingmalhematics in secondary school.at tne enciof ttrewar,he wa: ofiereda positionat the Galt(
*
but because o{hssril;J; ;:,f::l,:]"-fi :*:Jiilffi::liT::,T,ll :,.i:n:i culturalexperimental il: station(RothamsteO, *n*" n" o",nn appointedprofes_
t
sorof Eugenics at university "rn.,rud"r*u co,ege Londonin '1933and thento the Baltourchair of Genetic,. jn 1943.After r at Cambridge
his rire as asenior,","";;,iiilnl,iil.J#;"n:,/;'L,ii.lrlilii illlffir"::::Organjzation in Adelaide, jmportunt Australia. Fishers .ontriOrfions
lo oorngenetjcs andst;_ rrstrcs areemphasized bythe remarkof theweJl-known statjstici"n, , i"on.rC I Savage (1976): occasionally meetgeneticists who askmewhetherit istrue,nu,af]" i** n"n"ocrstR.A. Fishewasd'soan,-noortdnt slatislictan
;
I
:
:j5#:1""i"::ii,llt,_?:::.-""f.
burrhat.rhe groups difrerin rheiracross_the br-.,
ffi::liJ;T::.?::'::Ti:Y::l'.:xii:"*'i""r'ipo't"i''i'i'ii,.'*ili'li nr r,,,, u", r,.a,,-, i"",il. f;;;il;.:il,li,iillJllI::l#." ;::l.,ljl1.1,,, of religion educution drd .ho.ion ilcLeprrnl.e j"i i',::l:j::l::' li'''tron\hrp bcr\r. .rirr.r,,acrossreligious groups.Thus,rn sum,our:: Ierredmo.letis nna+h.i ""."* -..^,.- .. -.lt":tt
rrrrsrul duecr it'orrton attlr,. and that the effect of education vartes by religion (and. necessarily as well, that the ei:, ol relision varies variec by h., education). ^.r,,^-.:^.-.
lt!
introduction to Multiprecorreration and Regression (ordinarvLeastsquares) 127
istician ich of Spon- . Fnt oi . fl-the . ldeter' llealsoproce- blished ments, Dcious nol at ' ut the terrns,.
- i i l. | 6. 3 , Coefficientsof Modelsof Acceptanceof Abortion, U,S.Adult, 1974 (Standard Errors Shown in parentheses);N = 1,481. Model I
:-: Catholic
Model 2
Model 3
-.373 (.r11)
' 1.059
(.4ss)
Model 3'
-.371 (.111)
1.341 (.282) :,: Otheror None
.747 (.184)
.702 (.187)
bdu.. |ematnedto d then . F WAs t Fivei llagrilrofes:netics lars of . l search' d staV6\: "l
i' *F
F.,*6
'ntercept
frsnef ' ' ' J lerl\l o d e3 ' i s i d e n ti c a l to Mo d e r3 e xceptthati nr\rode 3' yearsofschoorng(educatron)rsexpfessedas . revationfrorfthemeanyears of school ng.
he-board between our preanitudes tle eff'ect
The conventional practice is to report the estimated coefficients for each model, not :nerely the preferred model. These are shown in Table 6.3. Let us seehow to interpret each ofthese moders.Model 1 isjust a two-variable regression equation of the sort we encounteredin the previous chapter; nothing further needs to be said here. As we have noted, Model 2 posits the same relationship between education and abortion acceptancefor all religious groups but across-the-boarddifferences between
128
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
religious groupsin the level of acceptanceof abortion amongthosewho havea given level of education.What is meantby an "across-the-board"differenceis clarified by writing our Equation 6.17 separatelyfor eachreligious group. For Protestants,we have
A : a + b (E )
(6.221
A : a I b(E) t c, : (a + c")'t b(E)
(6.23)
A - a I b(E) + c, : (a + cr) *b{E)
(6.241
A: a t b(E)-rco= @+ c.)+ b(E)
(6.25j
For Catholics, we have
For Jews, we have
For Others,we have
From Equations6.22 ttrotgh 6.25 it is evidentthat Model 2 (Equation6.17)implies that the religious groups differ in their interceptsbut not in the slopesrelating education to abortion acceptance.If Model 2 were our preferred model, we could conclude that eachyear of educationresulted, on average,in a .125 increasein abortion acceptancefor peopleof all religions, so that, for example,college graduates(sixteenyearsof schooling) would be expectedto accept abortion for one reasonmore thaa would those of the same religion with only a primary school education (eight years of schooling). And we would expect Jews to agreethat abortion ought to be permitted for 1.3 reasonsmore than hotestants,on average,and Catholics to agree.4 times less on averagethan Protestants.ln short, interpretation of the coefficients for Model 2 is straightforward, and the net effecls of education and of religious group membership can be assessedseparately.However. although the sze of the coefficient for each religious group can be interpreted individually, it generallyis not meaningfulto assessthe significanceof individual coefficients becauseeachcoefficient indicatesthe difference betweenthe expectedvalue for the given category and the expectedvalue for the omitted category net of all other factors. Therefore, a significant t-ratio merely indicates that a coefficient is significantly different ftom the implied coefficient of zero for the omitted category,and which coefficients are shown as significantin one'scomputeroutputis entirelydependentupon the choiceof omitted. or reference,category.Thus, the appropriateprocedureis to assessthe significance of the entire set of dummy variables representinga given categorical variable by computing an F-test of the increment in R2for models including and excluding the set of dummy variablesconespondingto a singleclassification(or equivalently,theWald testthatthe setof coefficientsis jointly : 0).
hroduction to MultipleCorrelation and Regression (OrdinaryLeastSquares) 129 ;iven level vriting out
(6.22)
low To TESTTHES|GNIF|CANCE OFTHEDTFFER_ sronsin which an analystwants to assessthe signifjcanceof the differencebetween two specificcategories of a dummyvariabre crassification. In this case,it is possibre to makeuse of the formula: t : (bi
\6.23)
bj,/ (va(bi) + va(b,) - 2cov(b,\))h
,vhereb, and bj are the lwo coefficientsbejng compared.Most statisticalpackagespermit :he estimationof the variance,covariance matrix of coefficients.Of course,in these daysof :rgh-speedcomputing,it probablyis easierto simplyreestimate the model,redefiningthe referencecategoryStataprovidesan eveneasierway to comparecoefficients,by computing a Wald test that b. : b,.
(6.24)
(6.2s) /) implies cducation :lude that trancefor chooling) Ihe same re would han Protstants. In Et effects llowever, individuefficients the given s. Thererent from re shown iomitted, rceof the Putrngan rmy varithe set of
K
ENCE BETWEENTWO COEFFICIENTSrhere may beocca_
When interaction terms are involved, the requirementsare even more strinsent. lSrrronly must the significanceof all of the associatedcoefficientsbe assessedsimulrmeously, but the coefficients themselvesmust be interpreted together rather than inli'idually. ConsiderModel 3, which includes interaction terms betweeneducation rrl religious group membership.It helps to write out Equation 6.1g separatelyfor *h religious group. hr Protestants. we have
A: a + b(E) : 2.18+.155(E)
(6.26)
fur Catholics,we have
a=a+ b (E )+ c r+ d r(E ) :(a+ c r)+ (b + d r)E - (2.18+ 1.06)+ (.1s5_ .121)E = 3.24+ .034(E)
(6.27)
Fsr Jews,we have
A :a+b(E )ic r. rQ (E ) =(.a*c r)* (b + 4 )E : (2.r8+ 3.20)+ (.155-.140)E - 5.38+ .01s(E)
(6.28)
130
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
tr!t-Eollr
For Others, we have
A=a+b(E)+co+d.(E) : (a 'l c,,)+ (b -l d")E : (2.18+.53)+ (.155+.014)E
(6.29)
:2.71+ .141(E) Again, it is evidentfrom Equations6.26through6.29thatEquation6.lg allowsboth the slopesandthe interceptsto vary betweengroups.The coefficientsassociated with the dummy variables,the c-s,indicatethe differencesin the interceptbetweenthe reference categoryandeachofthe explicitly includedcategorieswhereasthecoefficientsassociated with the interactionterms,the d s, indicatethe differencesin the effect of education(the slope)betweenthe referencecategoryandeachof the explicitly includedcategories. From Equations6.26tfuough6.29 it shouldbe clearthat for equationsof the form of Equation6.18, no overall summaryof rhe effecrof a variable1in rhrscaseeducationor religiousgroup membership)is possiblebut only the effect of eachcombinationof education and religiousgroup membership.Specifically,the coefficient.155in Model 3 of Table6.3 (andin Equation6.23)doesnot refer to the overalleffectofeducationbut rather to the effectof educationamongProtestants; and so on for the othercoefficients. BecauseEquation6.18 is a saturatedmoclel-it includesall possibleinteractions amongthe independentvariables(in the presenrcase.all possibleinteractionsbetween educationandthe religiousgroupcategories)-it is mathematicallyequivalentto estimating separateequationsfor eachreligiousgroup.Equations6.26 through6.29 show this equivalence:the coefficientsresultingfrom rewriting Equation6.1g as Equations6.26 tfuough6.29 and collectingtermsa.reidenticalto the coefficientsobtainedby estimating the equationseparatelyfor eachreligious group (you might want to persuadeyourself of this by carryingout the computationboth ways).The advantage of estimatingEquation 6.18 is that it permitsan explicit testof the hypothesisthat the groupsdiffer, throughthe F-test shownas Equation6.19. Becausethe conversionof Equation6.lg to Equations 6.26 through6.29 is a fairly tedioushandoperation,especiallywhen the numberof vari_ ablesis large,we often estimateequationsof the form of Equation6.1g to obtainthe R: requiredfor the F-test (or the coefficientsrequiredfor the wald test) but then estimate sepa.rate equationsfor eachgroup andreportthem in a separatetable From Table6.3,andevenfrom Equations6.26 through6.29,it is difficult to interprer the relationshipsamong religion, education,and abortionattitudes.With equation;ol this sort,graphingthe relationshipsis often useful.Figure 6.3, which can be constructed in the sameway as Figure 6.2 (seethe downloadable-do, or -1og_ file for details). shows the.levelof abonion acceptanceexpectedfrom educationand religious group membership.Inspectingthe graph,it is evidentthatJewsarehighly acceptingof abortion regardlessof their level of education,that catholics are relativelyunacceptingof abortion regardlessof their level of education,andthat aborlionacceptance variesstronglyb\. educationfor Protestants and"others," with the poorly educatedsimilar to Catholicsand the well educatedsimilar to Jews.
.&..".lErat aE
fuar .lm]]]tfr:|m
ftmqu-t ,- I m@!:m.'r."iit m!fltds--itl!
uos -\c rt &luulEEr x liM
-l ll::rlt Etr
dllwlliJ: :
J:E :
tu r,r_,amf a Tuma:i l0illlllu r{hnD
.!:nm] @re {Mruhrc.. lt! re1{ff M& r&rl'rir G
mryirr irrr liMur
rE-i-r
gn6!M',]i5
rk-,t.t IULS| - rE lDr *ij( , ln
Introduction (OrdinaryLeastSquares) 131 to MultipleCorrelation and Regression
-5
(6.29) 94
wsboth rith the ference ;ociated ion (the ies. form of allon or of edudel 3 of n rather ractions tetween esnmatlow this ns 6.26 imating iourself 4uation rughthe |uatlons of varin the -R'? ;strmate nterprer tions of structed details), s $oup abortion of aborrnglyby fics and
b
-) E
'1
810
12 14 16 Yearsof schoolcompleted
--.-
Protestants Catholics
---r-
Other/none
18
20
FIG URS 6,3. Acceptance of Abortionby Educationand Religious Denomination, U.S.Adults, 1974(N = 1,481). Rexpressing Variables as Deviations from Their Means -{part from graphing the relations among variables,asin Figure 6.3, we can use one ot}er de\-ice to render the coefficients in models containing interaction terms more readily interpretable-we can reexpress the continuous variables as deviations from their means-just as we saw in the earlier example predicting knowledge of Chinese characre.I5.The advantageof doing this here is that the group effects (the main effects of the .irnmy variables for groups) can then be interpreted as indicating the expecteddiffereacesamonggroupswith respectto the dependentvariablefor personswho areat the ,n erage with respectto the interval-level independentvariables or variables. In the present context, this means reexpressingyears of school completed by subFacting the samplemeanfrom eachobservation.(The reexpressedcoefficients are shown in Table6.3 as Model 3'.) The interceptthen gives the expectedvalue on the abortion scaleamong Protestantswith averageeducation (where the mean is computed over the eotire sample,not just Protestants).The coefncients associatedwith eachof the dummy r-ariablesthen give the difference in the expectedlevel of pro-choice sentimentbetween Protestantsand the specified category among personswith averageeducation.Note that rie slope associatedwith years of schooling (which, as noted, gives the effect of schooling for Protestants)is unchanged,as are the coefficients associatedwith the interaction erms; only the group interceptschange.However,the interpretation of the coefficients is gready facilitated: we seethat among those with averageeducation,Protestantsendorse about4 of the 6 abortion items on average,Jewsan additional 1.5 items, "Others" an addidonal .7 items,andCatholicsabout.4 itemsfewer.We also seethat eachadditionalyear of schoolingincreasesthe expectedendorsements of Protestantsby .155, and of those
132
Quantitative Data Analysis:Doing SocialResearchto Testldeas
rtsLni
without religion by about the sameamount becausethe difference in the slopes is onlv '014; by contrast,educationmatterslittle for Jewsand cathoricsbecausethe deviations from the Protestantslope are negativeand almost as large as the hotestant slope.
TestingAdditional Hypotheses:Constraining Coefficients to Zero or to Equatity
n
cEli
-,I. rjrL|
ry:rc:
Inspecting Figure 6.3, we might be led to infer that education has no eftbct on abortion attitudes for Catholics and Jews and the same effect for protestants and ..Others.,,How can we formally test the correctnessof our inference?We can do this by estimating an equation of the form:
A = + | b,R,+ c(ER,.r ERo) " i=1,3,4
di.r
Itr (ih
-r ]415 .qudl
6lFi|.rr'l
a
Efni tr haf,
e*:
(6.30)
ks
where, in this case,Catholics are the omitted category.To see how this equation repre_ sentsthe particularhypothesisof interest,we can again write out Equation6.30 separately for eachreligious group.
[-
hrs I
tu14 ne nl
tu
For Protestants:
A: @+b,)+c(E)
i.f, Til rc@ (6.31)
cdr
ler IT
For Catholics:
nOg-!,{-
[
'$r d frmt
For Jews:
(6.331 For Others:
(s
tur
i- n;
t= (a+ bo)+c(E)
(6.341
is evidentfrom inspectionof Equations6.31tbrough6.34, underthe specification _As of^ Equation 6.30 each religious group differs in the int#ept; the slope retating educa. tion to abortion acceptanceis zero for Catholics ald Jews; and tne stope ls identical frr Protestantsand "Others.,' To tesr whether this constrained specification is an adequare representationof the data, we cannot compute the incremeni in R, for Model 3 relatire to Model 3'becausethe two modelsdo nor standin a hierarchical relationshipto each other: there is no main effect for education in the constrained model. So, what to do? Fortunately,a solution is available
f=
turl
N, ' l!W5.-1 :ef:r
.u,r rd !E
GfuI Fn-lt
(OrdinaryLeastSquares) 133 Introduction to Multiplecorrelationand Regression )nl\ ion'
1lon Io s
i.-10t
?re€pa-
6.i I t
6.32)
6.33)
6.3,1) ration duca:al for 4uate :lative ) each o do?
A BAYESIAN ALTERNATIVE FORCOMPARINGMODELS Re can exploit an alternativeway of contrastingmodels, the BayesianInformation Criterion (BIC ), introducedinto the sociologicalliterature by the statisticianAdrian Rattery for log-linear analysis(1986) and generalizedto a vadety of applicationsin -!: important article in SociologicalMethodology (1995a: see also the critical com:rent by Gelman and Rubin [1995], the appreciativecommentby Hauser[1995], and Raftery's reply to both [1995b], and also the February 1999 issue of Sociological of B1C).In a sense llethods and Resenrcft,which is devotedentirely to an assessment 31C operateson the oppositeprinciple from classicaltestsof significance.It is a like:hood ratio measurethat tells us which model is most likely to be true given the data :or a brief introduction to maximum likelihood estimation,seeAppendix 12.B); clas;:"-al inference, by contrast, tells us how likely it is that the obse ed data could ::r e been generatedby sampling error given that some theoretical model (the null :\ pothesis)is true. B1C has tbree important advantagesover the F-test introducedpreviously.First, :like the F-ratio, B1Ccanbe usedto comparenonhierarchicalmodels.Any two models trurportingto describethe samephenomenoncan be contrasted.Second,B1C builds in :. correctionfor largesampleswhereasif the sampleis largeenoughvirtually any increrent in R2 will be significant,no matter how small and substantivelyunimportant.A -"rger incrementin R'?is requiredto generatea particularBIC value for large samples iran would be requiredfor small samples.Thus, B1C reflectsthe conventionaladvice :..' choosea smallerprobability value when the sampleis large. Third, BIC penalizes ."rge models.That is, if it takesthe introductionof many additionalvariablesto gener:re much of an increasein R2,BIC is more likely than the F-test to lead us to prefer ie simpler model.There are severalspecificways to calculateBIC, dependingon the rarticular statisticbeing analyzed.To compareregressionmodels,we can useRaftery's Eauation26:
Blco: t711n11- + po[n(N)l ^oz;l
(6.35)
nhere Rf is the value of R'?for Model t, pn is the numberof independentvariablesfor \Iodel k, andlr' = the numberof casesbeing analyzed.A negativevalueof B1Cindicates :hata specifiedmodel is morelikely to be true thanthe baselinemodel of no association setweenthe independentvariablesand the dependentvariable.To comparetwo models, rre estimateBIC for eachof them and choosethe model with the more negativeB1C. Raftery(1995a,Table6) givesa rule of thumbfor comparingBlCs: a B1Cdifferenceof 0 to 2 constitutes"weak" evidencefor the superiorityof onemodeloveranother;a differenceof 2 to 6 constitutes"positive" evidence;a differenceof 6 to 10 constitules"strong" eridence;and a differenceof >10 constitutes"very strong"evidence.However,because Raftery'srule of thumbis mostapproB1Ctendsto increaseasthe samplesizeincreases, priatefor relativelysmall samples. To seehow BIC is used,let us computeBIC valuesfor the threemodelsshownin Table6.3. For Model 1. we have
134
K
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
ALTERNATIVEWAYS TO ESTIMATEgtc
ntrlr,€bqr
Even rora siven
statistic, there are arternative versions of Bic. I prefer Rafterys formuras because they buird in a comparisonto a baselinemodel. Thus, I have wrjtten a small _do_ file, _bicreg. do_, to calculate B/C foliowing Raftery: * BIC REG . DO ( Updat ed ve rsi on
f or
St ata
.t.O
t-I/\\/OI.\
?. 0
*Compute BIC from
saved
results
from
drop
bic
in
-hltt'.Ilrbrsl -ddedd hfi.mIh11
[email protected] qaerci.aritr hniE6cll G$e1ril.
regreasion.
ge n bic = e ( N) * 1n ( 1- e ( 12 ) ) + e 1 6 5 _ *; *f n ( e ( N ) ) *Note.: BfC is the same for all observat.ions. * 1ist Bf C f or any obs er v at ion . list
[Er -Ut adr uli.rgder Fryj:I:E--ir rsl-.r- d fts. tpwx--i*.rE Tbe nce [m
Thus,
I
can
1
rcrr
bic
InvoKe -bicreg-
immediateryfo owing your regressioncommand. However, stata r0.o now offers BiC as a post-estimation statistic. To have Stata calculate B/C wjthout using my -do- file, invoke the command ,estat ic_ immediately followrng your regres_ sion command. The numerical value of each BiC will differ from mine, but the difference
between the for arternative moders wit be identicai regardless of whichversion of B/c '/cs iscalculated.
-{ltr€ff
a sI
:se- Il
-ttrbea.1:
ir:L
ka=<--lti d-Lrsla-*q
rda.ry .-'( 6e i ra:ETllL
t:rc-,t'.1 I
BIC,: 1,481*tn(I - .053)+ t*ln(I,481)= _13.4
(6.36)
f{=
I
=s€fl
ForModel2, wehave BIC,: 1,481+tn(t - .089)+ 4*ln(1,481) =_108.4
I iiai fr
For Model3, we have B1q : 1,481*ln(1 - .097)+ 7*tn(1,48t) =-100.3
t€ ..{.i .E
l Wa-r-r. Il
(6.38)
From a comparisonof the glcs for the three models, we are led to conclude ahatthe data are most consistentwith Model 2, which posits the surn" on abortion attitudesJor all religious groups and an across{he_board"ff""iot "Or"ation difference (that is, a differencethat holds at eachlevel of education)in abortion acceptance for the various religious groups. From the size of the B.IC differences, we conclud^e that the dara .Aery suoiglyfavor Model 2 over Model 1 and ..strongly,' favor Model 2 over Model 3Note that theseresultsare inconsistentwith the resultsse olF,-_r pnilosly rbmugt ,rh a comparison of Rrs via an I'-tesr- What are we to mafre d rls? rs m definitire
aif.
m,rc;r*l *E *:Ed
[ h- i: =*i
Ihe GSS 5 :i: _ia::a
-r: :tr ,T EFi fu f t f i. €: r ETE
(OrdinaryLeastSquares) 135 lntroduction to MultipleCorrelation and Regression answer My advice is, first, go with theory. If you have a theoretical reasonto prefer one modelover the other,choosethat one.This adviceis consistentwith one of Weakliem's .19991criticisms of BIClhat BIC assumesa "unit prior" B1Cis an approximation of the Bayesfactor,which involvesa comparisonof the posteriorlikelihood of models,where -rheposteriorlikelihoodis simply the productof the datalikelihood andthe researcher's prior. The researcherthen choosesthat model with the greatestlikelihood; that is, the model that has the highest probability of being the true model given the researcher'spri,xs and the data" (Winship 1999a,356).If thereis no clearreasonto expecta departure lom the null hypothesis,a "unit prior"-which amountsto saying we havelittle informadon about the likely outcome-is appropriate.But if we havestrong theoretical reasonsto e\pecta relationship,BIC canbe too conservative. In this case,classicalinferencewould :eem to be the preferred tool unlesswe were to modify B1C in ways that go beyond this course.We will discusslikelihoods in ChaptersTwelve ard Thirteen. Absent a strong theory go for parsimony, which is what 81C generally does. In the prcsent case,I would be inclined to prefer Model 3 becauseI think there are good realons to expectCatholicsand Jewsto haveconsistentreactionsto abortionregardlessof deir level of education(Catholicsbecauseabortionis prohibitedby the ChurchandJews because-still in 1974evenif lessso today-the Jewishcommunitywas sociallyliberal, andJewslacking educationtendedto be immigrants who had the valuesof educatedpersons)and to expectProtestantsand Othersto be more acceptingif they arebetter educated 'becauseof the increasingsophisticationthat educationbrings).But if I did not havea srong, coherent,explanation for the religious difference, I would then prefer Model 2. We can, of course,also compute81C for the constrainedmodel derivedfrom the data: BICy,: 1,491*tn 1 - .096) + 4*ln(1,481) : -17L0
(6.36)
(6.37)
(6.38) lrat the n aborr differ:ligious rongly" through lfinitive
qhich is more negative than the BIC for any of Models 1 through 3 and thus "very strongly" suggeststhat, for thesedata, the constrainedmodel is to be preferred.
INDEPE NDENT VALIDATION Note that I said &atfor thesedata lhe constrainedmodel is to be preferred.This is because q e arrived at a new preferred model basedon our inspection of the data rather than from a priori theory. Thus, we are vulnerable to the possibility that we arc simply capitalizing on sampling enor To anive at a definitive preferencefor the constrainedmodel, we need to show that it is the preferred model in an independentdata set. If our sample size pernitted, we would want to carry out all of our exploratory analysis using half of the data and then to reestimateour final model (and its competitors) using the other half of the data.The GSS providesa closeapproximationto this ideal becauseit repeatsidentical questionsin successive surveysconductedusingthe samesamplingprocedures.Thus,it is reasonableto treat adjacentsurveysas independentsamplesdrawn from the samepopulation,at leastfor phenomenanot subjectto short-termfluctuation.The implicationof rhis is that we can carry out all of our exploratory analysisfor one year and then use the yearto validateour conclusions. datafrom the previousor subsequent
't36
QuantitativeDataAnalysis:Doing SocjalResearch to Testldeas
f*-'Ti(
"{
.&; f- t: S . {, Goodness-of-Fitstatistics for Atternative Models of the Relationship Among Religion, Education, and Acceptance of Abortion, U.S.Adultr 1973(N = 1,499).
u]]!];arc _-lu
I t;tg,
;-iirl:lits l!
d.f.
!!
lL
fl:tl! EU::j i '_'-
l!'
lllllllllr iir::
-197.7
l
fl]llllrlg:ifIt
.1405
j]lllllllrr|1- Jl
-:tr :l ili
N4odel 3
-'191.'t
lfi;
1/10:)
Contrasts Model3 vs.Model'l
-41.2
14.52
6; 1491
Model 3 vs. Model 2 Constrained vs. L4odel2
Here we can exploit the GSS in just this way, reestimatingthe four models of pro-choiceattitudesusingdatafrom the 1973GSS.Insofaras we can assumethat abortion attitudesdid not changein the populationbetween 1973 and,19i4. reestimatins the modelsusingthe datafrom 1973constitutesan independenttestof the claim thatthi "constrained"modelis the preferredmodel.Table6.4 showsB1CandR, valuesbasedon the 1993datafor all four modelsandcontrastsbetweenmodelswherevermeaningfuland appropriate.The outcomesare, in fact,just the sameas for 1974..Model 3 is preferred to Model 1 and Model 2 by the criteria of classicalstatisticalinference,whereasby the Blc criterionModel 2 is preferredto Model 3: and by the glc criterionthe constrained model is the most preferred.Thus, we can concludethat our preferencefor the con_ srainedmodel.derivedfrom inrpeclion of rhedara,is r alid.
WHAT THISCHAPTER HAS SHOWN In this chapteryou have learnedhow to carry out multiple regressionand correlation analysisandhow to interprettheresultingcoefficients,consideringa workedexampleon the determinantsof literacy in China.we then focusedon the manipurationof dummr
._ r r't
:
lntroduction to MultipleCorrelation and Regression (OrdinaryLeartSquares) 131 rriables (setsof dichotomousvariablesthat_represent categoricalvariables),including :gecially interactionsbetweendummy variablesand other;ariables, usingas a workei
li.tn,
[,
I
f
F: -.:: t
'
nodelsof ttrat aborstimating n thatthe basedon ngful and preferred as by the nstrained the con-
|rrelation ampleon i dummy
CH APT ER
REGRESSION MULTIPLE TRICKS:TECHNIQUES FORHANDLINGSPECIAL ANALYTICPROBLEMS ISABOUT WHATTHISCHAPTER This chapter presentsvarious "tricks" for dealing in a multiple regression framework The Stata-do- and -1ogrith specificanalyticproblemsfacedby socialresearchers. are available as downloadablefiles. Spein this chapter worked examples fles for all the and independent of both dependent transformations cifically, we consider nonlinear an equation; how to assessthe rzriables; ways to test the equality of coefficients within rsumption of linearity in a relationship, with a trend analysisas a worked example; how to construct andinterpret linear splines asa way of representingabrupt changesin slopes; dtemative ways of expressingdummy variable coefficients; and a procedurefor decomposing the difference betweentwo means.
'|-40
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
NONLINEARTRANSFORMATIONS often when doing regressionanalysis,we havereasonto suspectthat the rerationship betweenparticularindependentvariablesandthe dependentvariableis nonlinear.Henci. an estimateof the linear relationshipbetweenthe independentand dependentvariables would not properlyrepresenttherelationshipin the sampleunderstudy.you haveseen al exampleof this kind in (c) of Figure5.4 in chapterFive,which showsa perfectparabolic relationshipbetweentwo variablesbut which producesa slope and correrationof zero whenestimatedby a linear regressionequation.Fortunately,thereis a simplesolution to problems of this kind-you can transform one or more variables so that the dependenr variable is a linear function of the independent variables. Here are severarexamolestogetherwith someinterpretivetricks.
CuruilinearRelationships:Age and Income In cross-sectional dataincomecomrnonlyincreaseswith ageup to a point in the middle of the career and then begins to fall. A reasonableway to ripresent this is to estimate atr equation of the form t:a+b(A)+c(A,)
e.t.
wheref= annualincome,A : age,and, A2 : A*A. In the 2004 Generalsociar suruay(GSS),the estimatedvaruesfor this equationare (for people age 20 to 64 with informationon personalincome; : N l,573fthe openendedupperinterval-$ 110,000per yearor more-was recodedto $150,000;theremain_ ing incomeintervalswererecodedto their midpoints): i' = -4gJ3g + 3,777(A)- 35.95(4,); R, : .084
(j .)
which can be represented graphically,as shownin Figure7.1.
ENINCOMEANDAGE ?,-! wr_{yTHERELAT|ONSHIp BETWE N
lS CURVILINEAR
There areseveratpossible exptanarions torthecurvitineariry ofthe
relationship bet\,,r'een incomeandage.of whichthe two majoronesarethe following: .
.
Economists arguethat productivityincreases with age up to a point and then falls; sociologists sometimes makesimilarargumentsbut alsopoint out that variousinstitu_ tionalfactors,suchasthe greaterdifficurty orderworkershavein returningto work after layoffs,resuitin the sameobserved pattern. The cross-sectional observationmay simply be an artifact of a cohort progressionof earnings, with successive cohortsearningmoreat anygivenagethan theirseniors, and the earnings of all workerscontinujngto risethroughoutthe career.
ffii G & m'F ru! fl@[[ rqr fuI
m:I
ffi ltrm
q ffi ilhfr
..D mffi
v,Jltiple Regression Tricks:Techniquesfor HandlingSpecial Analytic problems
141
50,000 L-rili,:
F{e;l;;,
nn,3.
45,000 40.000 35,000
t z::
30.000
iLr: :: tria:l nPl3>-
20,000 I5 ,0 0 0 l 0 ,0 0 0 s,000
rid:-: 0
: .i ; ": . The RelationshipBetween 2OO3Income and Age,U.S. Adults Twenty to Sixty_Fourin 2OO4(N = 1,57 . +p o:= 0a::-
_': '
T:::';:i:.ffi*;ilJ:"#tff T$Hffi:l:*ill? =in d*:i:i:ffi::
ro:.rr550,000 peryear.Withoutthelraph,h";;;, ir":;r because the coefficients the
;;*r"tili"#equation
z.z is oir_
.nr.=pretarion. rtispossibre, .F-.t rnterpretation. It can be"''.;i,::i:nil";11,l.il:;il'j'.r.lf"ilf":1l1i:,1'; shownthat in the equation (.7.3)
F:ere /r : a - b2,/4candF = _b/Zc (\Nith thecoefficients on the right side taken from i::arion 7.1),z is themaximumincome,-and F is the ageat whlcfrtfremarlmum lncome : =:rained.In the presentcase,the numerical esti."";{;;,#;.; I = 50,066- 35.95(52.53- A),;
" R, = .084
(7.4)
Equations7.Z md 7.4, of course,,yield the samegraph becausethey are equivalent But Equation7.4 also_ telli us precisely,#i;; -;ressions. rs ;;;ome 1j50,066 and lrr: rhispeakis attainedbetweenfifty-,*o unOnfry_,-tr.. ,"i. r*" tp*"isely, 52.53). \n equivalenrrransformationis possibl" "i f". adaitionatinde_ "; "q;;;;italirng Consideran equationof the form -.E:dentvat:iables.
Y- a+ b(A)+ c(Ar)+ d(z)
(7.5)
142
Quantitative Data Analysis:Doing So
whereZ is someother independentI ariable.and the remainingvariablesare as before. We could thenrepresentthe relationbenveen.{ and L net of Z, by substitutingthe mean of Z. Z . so lhat Y :
tn +,1 t7 \t
-
ht )t +
-t
a2 |
or, equivalently,
i' : m+ c (F -A )r where,in this case,
m: (a + d e D
b2t4c
(7.E
and F is as before.
Semilog Transformations: lncome A usefultransformationwhenpredictingincomeis the semilogtransformation;thatis, instead. of predictingincome,we predict the natual log of income.This hasthreeadvantages. First, economictheoriesabout what generatesincome tend to make predictionsn termsoflog income.Specifically,humancapitaltheorytakesincomeasdeterminedb1.a rnvestmentprocess(Mincer and Polachekl9T4). Hence, insofar as we take such theorier seriouslyor are interestedin testingthem seriously,we probablyshouldpredictincoE in its log folm. Second,incometendsto be distributedlognormally in the United Statesand o6a advancedindustrialsocieties,so the log of incomeis distributednormally,a convenied property. Third, and mostimportant,when the dependentvariableis in (natural)log form. e metric regressioncoefficientscan be interpretedasindicatingapproximatelythe protrF tional increasein the dependentvariableassociatedwith a one-unitincreasein the iDi* pendentvariable,for b lessthan about0.2.To seethis, considerthe equation
t n (Y )-a + b (X ) Now considertwo individuals who differ by one unit with respectto X: thal xX.= X^+i. Then
ln(Yr)--a+b(Xr\ and
.----::-
l nrl -':i -'tl -
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 143 So, subtracting, ln(f,) - ln(fr) : (a - a) + b(X, - X..) : b
(7.12)
But we know from the propertiesof logs that h(f,) - h(fr) : ln(Y,/Yr)
('7.13)
ln(YrlYr) : b
(7.r4)
So we have
Then,exponentiatingboth sides(thatis, makingeachan exponentofe), we have
(7.1s)
YrlY, : eb Now let us look at the relationshipof b to e, for variousvaluesof b.
m m es 0e ET
nt be xL-
.9,
l0'
b 0.01 0.05 0.10 0.15 0.20 0.30 0.40 0.50
1 .0 1 1 .0 5 1 .1 1 1.2 2 1 .3 5 1 .4 9 1 .6 5
b - 0.01 - 0.05 0.10 0.15 -0.20 - 0.30 -0.40 - 0.50
€o 0.99 0.95 0.90 0.86 0.82 0.74 o.67 0.61
We seethatfor b lessthanabout10.21, b is a good approximationto the expectedproportional increasein I for a one-unit increasein X. For larger values of b, D underestimatesthe proportional increasein L To see how to interpret such results,considerthe effect of educationand hours workedon ln(income),by sex,usingthe 2004 GSS.We estimatea modelof the form ln(I) = a + b(E) + c(H) + d(M)
(7.16)
where/ : incomein 2003,E : yearsof schoolcompleted,11= hoursworkedper week, and M : l for males and = 0 for females. (Note that although the present analysis is restrictedto peoplewith incomes,it is common to add a small constant,say 1, to the value of the dependentvariable to ensure that zero values are not dropped; such transformed variablesare known as "startedlogs" [Tukey 1977].Seethe discussionof tobit analysisin Chapter14 for an alternativeway of dealingwith zero values.)The estimated equation,basedon 1,459caseswith completedata,is
ln(I) - 7.41+ .125(E)+ .0207(H) + .335(M)t R2=.257
(7.r7)
144
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
This equationtells us that each year of schoolingwould be expectedto increase income by about 12 percent,within gendercategories,amongthose working an equal number of hours per week. Correspondingly,each additional hour worked per week would be expectedto increaseincomeby 2.1 percent,within gendercategories,among thosewith equal education.Finally, among thosewith the sameeducationworking an equal number of hours, men would be expectedto earn about 40 percentmore than women.Here the coefficientunderstates the male advantagebecauseer35= 1.398.This remindsus thal the b may only be directly interpretedasindicatingthe expectedpercentageincreasefor b < 10.21. For largerbs, we shouldactuallycalculaGthe exponent. Negativecoefficientshavethe sameinterpretation.For example,a coefficient of -0.05 indicatesthat a one-unitincreasein the independent variablewould be expectedto yield a 5 percentdecreasein the dependentvariablei that is, the expectedvalue of the dependenr variablewould be 95 percentas large.Also, for b < -0.2, the percentlosswill be smaller thanimplied by the coefficient.So, again,we shouldcomputethe exponentiated value. Note that the equationexpresses a linear relationshipbetweenthe independentvariablesar'dIhe natural log of income,not incomeitself. This is evidentfrom inspectionof a graph of the relationshipbetweeneducationand ln(income),evaluatedseparatelyfor malesand femalesat the meannumberof hoursworkedper week by all workers,males andfemalescombined(42.67).The relationshipis, ofcourse,linear andthe expectedvaluesfor the two sexesdiffer only by a constant,as shownin Figure7.2. However,when we graphthe expectedrelationshipbetweenincomeand education. the relationshipis curvilinearandthe lines areno longerparallel(Figure7.3).
* r.Jdl]lro'ft
ll
-l@ll
ir mm' ll
.g E
fr m [r
IU
q
:nlllullr
;
A o r ir J\ i.'1 i. r . 4
4
a 12 Yearsof schooling
16
20
Expected ln(tncome) by Years of School Completed, IJ.S.Males and Females,2004, with Hours Worked per Week Fixed at the Mean for Both Sexes Combined = 42.7 (N : 1,459).
fllL EI
l-, llrr
vultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems Ease qual reek xtn_i
145
70,000 60,000 s0,000
6.t Ihri enI-
?
40,000
- 30,000
2
l-C': ld: kn: Ilit
20,000 10,000
tn-
0
to:
for Ie! al-
yearsof
s.hoolirg : c, , ; iL?.3 .Expected lncome by yearsof SchoolCompleted,
u.S.Mates Females, 2004, with Hours Worked per Week Fixedat the Mean for Both Sexes '-=nbined (42.7).
yd
Supposewe have an equationinvolving
a logged dependentvanableand squared *";j:n:H::ff"11*pendent variabres.-H", ; ;; ;;";t th" "o"rn"r.ot,.r ln(Y) = a+b(X)+c(X One way to interpret this equation_isto
L.l,ilL.*tro
- k)2
(7.18)
find the first derivative
of li@ with respect tt forappropriate vatues ofX,.", ;" ;;;;. ;i;.ir r.o* firrty"* "a_ ddnllir =b+2cX_2ck -^=. cJ\ )
(7.19)
p"ffi i:li:s,f """""":,5":3t:'*:*T,i}**",#HTil
same equation because suchvariables tendto behighly correlateo. ,"
.ir.. ."rri*iil ;"J a variable anditssquare, analysts sometimessubtract a constant b"* ,0"","n lt canbeshown r:iir:Trnj, whereb isthestopeof theregression of X, on X, tende(; T x and(X _ b/2)2 onhogonal (seeTreiman andRoos1983,62D.
lll
146
euantitativeDataAnalysrs: DoingSocjalResearch to Testldeas
t
;.",:ffi.ffi i;."t:,;:fl:".#:'.?T":1:,,1TXHTiilTi';? ilffi:,# tf becausethe funclio
j..l;i*;lrTtil#uitil:,, lrrr ;ffi flT:J}::1il *";:Ti,5"tr*T# t[,'y,;,Jl *, ''J!:{:ii;:r#"*::nl:#;ff T.T:,;.".",, ^ttowever.
i!t-
-
."\.)t
-
u - f o \ x t ) +. ( y .
_ t \2
(7.20t
ft rq
and l ntv \ "'\r2J
u f
6iE;="+b(Xl
o(x.l
+ ((x-
_ L\2
(7.21t
+ 1)+ c((Xt+ 1)_ 1r1z
r&
,--:::-. tnlY2l- ln(Yt)= b + 2cXt_ 2ck + c But because
rr,ry],_ inr}r = ntef,t
(1.:1,,
sidesof Equalton 7.23. we have ,2/it
=
e@+2cxr2ck+c)
Thus.Equation7.25 . sives rl increasein X. evaju;,#"i;;; ;f^:xpecred
{ -5,
in".ease
in wesetx equal giu.tth.p'opo*ion;r";;;#1"t7:l 1 so'ir-proponiona.l ro irs I /for twoindiv'o*""i?l'ilH*Tit uoo1",,l. mean onva-riable x.
f,-[
ff
( 7 ) 1t
Then,subtractingEquadon 7.20from E qtation7.22, we have
lr wc exponentiate borh
fr mr
.f for a one-w
jffi #ffi[#tr#: THgri=f"EiEE{s ";
UultipleRegression Tricks: Techniques for HandlingSpecial Anallticproblems 'l47 Nr::,:
lobility
Effects
i:rpose we want to testthe Durkheimianhypothesisthat extremesocialmobility, either rrn ard or downward,leadsto anomie.If we are willing to considerthe effectof upward rc.l downwardmobility as symrnetrical,we might estimatean equationofthe form A : ct+ b(P)+ c(P) + d(P - PF)2
(7.26)
r:ere A = the scoreon an anomiescale,P, : the prestigeof the respondent'sfather,s :csupation,and P : the prestigeof the respondent'soccupation.(Note that this specifi:=on of the hypothesisassumesthat it applies Io intergenerationalmobility, that :c--upational mobility is a good indicatorof socialmobility andprestigea good measure :i.:rcupational status,and that extrememobility shouldbe most heavily weighted-by the difference.In a substantiveanalysis,all of theseassumptions needto bejus---nring :i3d explicitly,not merelypresentedwithoutjustification.)A significantlypositivecoef:crent d indicatesthat anomie increasesas the discrepancybetweenrespondent'sand -:.-\er's occupationalprestigeincreases,controlling for the level of both respondent's r:l tather'sprestige.Thus d indicatesthe effect of mobility per se, controlling for the ::ect of statuslevel. It is necessaryto control for statuslevel becauseanomiemay be -:-Jtedto origin or destinationstatusentirelyapartfrom any effectof mobility. Of course,many othertransformations of variablescanbe usedto representdifferent r\-ial processes. For someexamples,seeGoldberger(1968,Chapter8), Treiman(1970), cd Stoltzenberg(1974, 1975).
TESTING THE EQUALITYOF COEFFICIENTS
._: nr:.: I t-nLr t'-
\rmetimes situationsarise in which we want to determinewhether two coefficients the sameequationare of equalsize.You havealreadyseenan examplein the pre-ithin chapter, -.:ous in the discussionof Equation6.30.Here we consideran additionalexam:.e. Supposewe areinterestedin assessing theeffectofparentaleducationon respondent's :lucation and, in particular,in decidingwhetherthe mother'sor the father'seducation :.1sa strongereffect.The hypothesisthat educationaltransmissionthroughthe mother -i strongerthan throughthe fatherarisesfrom the observationthat mothersspendmore :,.newith their childrenthando fathersandhenceareputativelymoreimportantsocializ-rs agents.The altemativehypothesis,that the father'seducationhas a strongereffect, :erivesfrom the claim that the father'ssocioeconomiccharacteristics largely determine :e family's socioeconomicstatus.Becauseeducationinvolvesopportunitycosts,it may ; ell be that thosewhosefathersare poorly educatedwill be more likely to leaveschool :3rly to switchfrom beinga drainon the family financialresourcesto beingan economic :ontributorto the family. Amed with thesetwo competinghypotheses,we might thenestimatethe regression rf yearsof schoolcompletedon father'sand mother'syearsof schoolcompleted.From --ne1980GSSI estimatedan eouationof the form
E :a+b( E , )+ c (E r)
(7.27)
148
DataAnalysis; Quantitative DoingSocialResearch to Testldeas
l*..r: c,e k
where E : respondent'syears of schooling,E : father's years of schooling, and E" : rnother'syearsof schooling.(I chosethe 1980GSSdatato illustratehow to testthe significance of an apparenttrue difference. The 2004 data yield virtually identical coefficientsfor mother'sandfather'seducation.Assessingcross{emporaltrendsin the relative effects of parental education and the reasonsfor such trends might yield an interesting paper.)EstimatingF,quation7.27, with N = 985, yields E :7 .8 7
I .2 O qtF | + 16qrF r. '-"'\"Fl
R , -.Jl J
(7.28)
This result appearsto supportthe claim that mother,seducationhas a somewhatstronger effect on educationalattainmentthandoesfather'seducation.It is possible.however.that this result arisessimply from samplingvariability. How can we find ouf/ The trick is to force the coefficientsfor mother'sandfather'seducationIo be equalard thento assesswhetherthe R2for the unconsaainedequation(7.28t is significantly larger than theR2for the constrained equation.We constrainthe coefficientsto equalif by estimatingan equationof the form:
E : a + b (E " )
(7.?9)
trr =-. ce fu+Tt-d IIEilD
AI
& ire GSsI .I06,1-J=!f, 5= -::g
Wr-. i'.::U
furi;::-c g ro -EIg:= l:E i,''f
Lb,n:-.-lo ,aot:s-;.r4 mc ::r:il =0
(7.30)
Note that defining a variable as the szm of the years of schooling of the mother and father is equivalent,with respectto testingthe hypothesis,to defininga variableas the mean of the years of schooling of the mother and father. If the mean were soecified. the coefficientwould simply doublein size.In the presentcasethe sum is mori readily interpretablebecauseit retains the metric of the separatemeasuresfor mother and father. Estimatingthe equation,we get E : 7.93 + .236(E"); R2: .317
C;r.rn
ENlXIiff:
where E" : E * 8". Thus, we have
E: a + b(E,) : a + b (E " -lE r) : a + b (E , )+ b (E * )
L:-r-os
(7.31)
Next we comparethe two models.First, we do an F-test of the equality of the coeffi_ cients, which is equivalentto testing the significanceof the increment in R2.This can be donevery easilyin Stataby usingthe - test - command.In fact, we don't evenneedto consfucttheconstrainedmodel.We simplyissuethecomrnand - test paeduc=maeduc _ after estimating the unconstrainedmodel. This yields an F-value of 1.40, which is not significant 1p = .236); note tlat becausewe have a two-tailed test (we have hypotheses expecting either mother or father to have greaterinfluence), the conventionalsignificance level required to reject the null hypothesisis .025. Aa altemative way of comparing the models is, of course,to comparethe BlCs for the two models.To do this we needto esti_ mate the constrainedmodel. The two B1Cs,estimatedin the usual wav. are
rnr=
I :: a
hnr":,
-.:CC
re=:r:if =.r,1 ft-: .Ltr T :I:TDfl mq+r-,:c -_:: ung
5;:--s
rr-r[fs?x mlOreS ]ite S ,m: 3jC r
ry!tr:Elo r n:r=. lgl l-n.'t r_5-: I =:til t nitr r!i1*rF: G5i:r3;
rfrla
::t n
:: -as*
faultiple Regression Tricks:Techniquesfor HandlingSpecialAnalytic probtems
nling. and r to testttre ica] coeffitberelative interesting
(7.28) rtal stroDrel'er, that is to force tether the mstralned eform: (7.29)
Unconstrained:
-355.4
Constrained:
-360.9
149
lD this caseboththe BIC andtheF_testfavortheconstrained modelof no difference. Thisgeneralstrategycanbe appliedto u wio" ua.i"ty oi su*i_ij" p.oof"rnr.
TREND ANAtyStS:TESTTNG THEASSUMPTTON OFLtNEARtTy As rhe GSS hasmatured,it has becomean increasingly valuableresourcefbr the study of ;ross-temporaltrends.Becausemany questions havebeen askedin exacfly the sameway smcethe flrst GSSwasconductedin.1972,it is possible to poot tfr" Outatiom all yearsto fldy a variery of trends.Moreover, if no uiiutiJni i." 0","","a, *," Ou,u ".or._i".po.ut frr all yearscan be treatedasa sampleof the U.S. populati* in ,fr" i ,r"otieth century n generatesufficientcasesto studyrelativelysmall " iubsetsof the population. model (apart from the tim;;;;.; ;i;;;;"d) is thar there is a ._ l: trend :'To"'1.oend inear over time with respectto the outcomeof iiterest. es a first step,it is useful to :otrtrast such a model with a model that posits year_to_y"_ uuri;;;. in the outcome_ rhat Sorokinmany yearsago (1927)described ,.""rfu., as nr"i"rl,ions.,,To do this, we 60mate two models:
Y=a+bT
(7.32)
(7.30)
*her and le as the pecified, ) readily d father.
(7.31) >coeffir can be needto educ r is not otheses fcance ing the to esti-
Y=d'+bT+
\--
z-
(7.33)
s-here ? is a linear representationof time (here, the year of the surveyJ,and the Z' are dummy variablesfor eachyear the survey was conducted; note that two dummy variables mustbe omitted becausethe linear term usesup one degreeof freedom. We then compare de two-models in the uslal via an F+esi of th" ,ignin"_"" ot,t e increment _waV, in R2 and a comparison of BIC valuesA convenientway to Jo the first in Stata is to estimate Equation7.33 and then to test the hypothesis,fr"i af tfr" ,2.il1, zero,vraa Wald test using Stata's - test - command. (Note that equution", "l"a smply a different parameterizationof an equation in which the linear ierm is omitteJand oniy the dummt are included. The coefficients will, of course, Olff".. nui tt p."dicted values, 'ariables R:, and-81Cwill be identical.) If w.econclu.le that " no simpf" fln"ar ,r"nO no ,he data, we mrght then posit either a model with a-smoothcurve by inifoJirrg u ,qr*.a t"rm for Z, or a model that tries to model particular historical events by g.oupiig y"_, ioto historically meaningful groups and identifying each group ltess one') ui'u u'a'orn_y variable, or a splinemodel (seethe section"Linear Sptines,iater in tne inuf".;i""uur" ,he explainedby Equation7.33 is the maximum possible '-y'."-p."."ntution variance ftom of tr-" (measuredin years), the R'?associated with Equation ?33 ;;, ;', a standard against which to assess,in substantiverather than .t i"tty rtatr.ti"ail"r-1, ro* close various
1 50
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
sociologically motivated constrainedmodels come to fully explaining temporal variation in the dependentvariable. Although, to simplify the exposition, I have not included any variablesin the model other than time, a model actually positedby a researcherqpically would include a number of covariates(otherindependentvariables)andalso,perhaps,interactionsbetweenthecovariatesandthe variablesrepresentingtime. Exactly the samelogic would apply to suchan analysis asto the simpleranalysisjust described;the logic is alsoidentical to the dummy variable approachto the assessment of group differencesdescribedin the previouschapter(although herethe "groups" are yearsor, if warrantedby the analysis,multiyear historical periods).
Prediding Variation in Gender Role Attitudes over nme: A Worked Example Four items on attitudesregardinggender-roleequality were askedin most yearsof the GSS between 1974 and 1998.The four variablesare shownhere with the percentageendorsing the pro-equalityposition,pooledover all yearsin which all four questionswere asked: r
Do you agree or disagreewith this statement?Women should fake care of running their homesand leaverunning the country up to men (74 percentdisagree).
r
Do you approveor disapproveofa married woman earning money in businessor industry if shehas a husbandcapableof supporting her? (77 percent approve).
r
If your party nominated a woman for President, would you vote for her if she werequalifiedfor thejob? (84 percentsayyes).
r
Tell me if you agreeor disagreewith this statement:Most men are better suited emotionallyfor politics than are mostwomen(63 percentdisagee).
To form a gender-equalityscale,I simply summedthe pro-equality responsesfor tbe four items, excluding all people to whom the questionswere not askedand treating other noffesponsesas negativevalues.The point of treating "don't know" and similar responses asnegativevaluesrather than excluding them is to savecases.But this would not be wise if therewerenot substartivegroundsfor doing so-in this case,it seemedreasonableto me to treat "don't know" as somethingother than a clear-cutendorsementof genderequality.
?,I N
l
h
rN SOMEYEARSOFTHEGSS,ONLYA SUBSET OFRESPONDENTS WASASKEDCERTAIN QUESTIONSusersor the GSSneedto be awarethat to increase the numberof itemsthat can be includedin the G55 each year,some items are askedonly of subsetsof the sample.A convenientway to excludepeoplewho were not askedthe questionsis to usethe Stata-rmiss - option under the -egen- commandto countthe numberof missingdataresponses and then to exclude peoplemissingdata on all itemsincludedin a scale.However,in the currentanalysis I excludedall thosewho lackedresponses on any of the four itemsbecausesome,but not all. of the questionswere askedin someyears.
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 151 u'al variation in the model rde a number enthecovarigrch an analmry variable ner(although lpenods).
s of the GSS geendorsing asked: care of runntdisagree). rbusinessor approve). r her if she
EstimathgequationssuchasEquations7.32and,7-33suggestssignificantnonlinearities in attitudes regarding gender inequality. The increment in R, implies F = 3.54 with 11and,21,448d.f.,which hasa probabilityof lessthan0.0001.Howevel the B1Cfor the lrnear trend model is more negative than the B1C for the annual variability model 'de BlCs are,respectively,-959 and -871), suggestingthat a lineartrendis morelikely siyenthe data.BecauseB1Candclassicalinferenceyield contradictoryresults,a sensible Fxt stepis to graph annualvariations in the meanlevel of support for genderequality, to Jee whether there is any obvious pattem to the nonlinearity. If substantively sensible deviationsfrom linearity are observed,the annual variation model might be accepted,or e new model, aggregatingyears into historically meaningful periods, might be posited ,teeping in mind the dangersof modifying your hypothesesbasedupon inspection of the dan-see the discussionof this issueat the end of ChapterSix), or a smoothcurve or spline function might be fitted to the data. Figure 7.4 showsboth the Iinear trend line and annualvariations in the mean.Inspecting the graph, it appearsthat deviationsfrom lineariq are neither large nor systematic. Given this, I am inclined to accept a linear trend model as the most parsimoniousrepresentationof the data, despitethe F-test results.The lineartrendis, in fact, quite substantial,implying an increaseof 0.81 (= .0338*(19981974))over the quarter of a century for which we have data; this is about 20 percent of 6e range of the scaleand is about two-thirds of the standarddeviation of the scalescores. -\pparcntly, support for gender equality has been increasing modestly but steadily ftroughout the closing years of the twentieth century. From a technical point of view, it may be helpful to comparethe estimatesimplied by rhe two altemative ways of representingdepa.rturesfrom linearity: Equation 7.33 and the
Etter suited nsesfor the tating other |I responses * be wise if ble to me to quality.
+ -
Llneartrend Mean fo. year
6 62
RES. ,sers of t in the way to l under xclu0e ls l e xta ll, of
z
1914 1976 ',19781980 19a2 19a4 1986 1988 1990 1992 1994 1996 1998
Yearof survey
FiGUfl€ 7.&, rrendin AttitudesRegardingGenderEquatity,U.s.Adutts Surveyed in 1974Through1998(LinearTrendandAnnualMeans;N = 21,464).
152
euantitativeDataAnalysrs: DoingSocialResearch to Testldeds
altemativespecificationthat doesnot incrudea linear term for year.when the rinearterm is included,two dummy variabrecategones aredropped,ratherthanone,becausethe lin_ ear terrnusesup one degreeof freedom. However,,ir" t*o pro""J*". produceidentical results. is evidentfrom inspectlon of Table7. I . ^which untortunately. thereis no simp^lecorrespondence betweenthe coelficientsin equa_ trons rhe form of Equation z.j: o"viation, ;;; .of ;;';;i"hons of rhe tinear .ana equatron.If you want to show annual depanuresf.;h;;y:;; needro construcra new variable, which is the difference u.ir""" ,fr" pr"al"i.J J"fi", ,", each year from Eq,tation7.32 and Equation 7.33.^This i..u"ry ui""_oirJf, in Srata using rhe - foreach- or -forvalues_ cornmand. "ufi1o
I
t , |. !
I
LINEARSPLTNES I
Somedmeswe encountersituationsin which we believe that the relationshipbetween tu.o variableschangesabruptly at some point on the distribution of the independentvariabre. so that neither a linear nor a curvilinear representationoi-,i" l"fu,ionrrup is adequate. c_olsurnlriyn.may have no impact on qlcofol l"arf, U"to* some rhreshold. l:l,"]-"-pr" whereasabovethethresholdheatthdecline. i" li""*;;y;; ;;;;i consumptionincreases. Temporaltrendsalsomay abruptrycnange, " asa result of policy changes,cataclysnic evenl. suchas depressions,wars,revolutions, and so on. In casei of this kind, it is useful to representthe relationshipsvia a setof connected line segments,know:nu tin"o, ,plir"r.
A Worked Example:Trendsin Educational Altainment over Timein the united States
form?
_._l*^"ji il::fTfi::::Ti:,":llilffii I,r::.*" :""pr.,,n" ::ilT'ru.,i"";:;
showssucha plot, madewith the same specificationsas the scatterplot. Inspecting the y"",r::^,h"."verage educationin"r"u."a in u _or" o. i"rrlr"l_ rv"}, ,or thoseborn 11,,: between1900and 1947bur rhenlevel"aoff. n""uu." rt a bit, prob_ *" relarively ";l;;l;;;;. smau nu_b".;;; f;;;;; -""rd
Jffi
I
!
s 1, l>r-:
c
a
=v
a/
rtmigrrtueueuer to
hie*"uirii"J;;#ft?:l#;:?"::il*TTJ:i"*"llifJ.ffi ii:Tl,i: - do - and - 1os- files.) rnspecting this graph,*";;;;; ;;;;; conclusion_there
is a fairly abruptchansein ihe trend, wittr it os" b",..' il;;;;;;df
2
er < :-=
,;:"ffi:'i: tfere appears," b";;il;;;,::ruffis: i:1,,'# ;'"ffiff tffi* discem-is it linear or is the trendbetter representedby someotherfunctronal
:,0]la3:iy "r moving averase plot three-year
I
rt
!\
consider chalges in the averasereverof educationover time. Figure 7.5 presents a scafter plot relating educationalattainirent to year of birth, estimatedfro;'trre css. To create rhis graph,I combineddatafrom all vears betw.* 1972;;;6;.r"rv*"r, , *"0*o those bom prior to I 900 becausethe very small sampl" .ir* p;il;;;u"ii"".*"ro. , a., droppeda thoselessthanagetweng/-fivearthe rime of rh" J;";;;;;;iluiy"i"opr" ao not their schoolinguntil rheir mid-twenries. Th. ;d; i;;"; "o_pt.t. * , cases, ,Jittered,' oiJ"i ro make it readable,andis to ma_k "r*" To discoverhow rhe increasr
a
of the rwentierh
-= -a
Ito r r to ttr tr .t
llo n
I h d I ltrcl url orl
.,I .r I t.tt,..r l ot tl t l ).rot
N (rl A116r I l'r|'(lt(t.'(l
V.rhror.
Coef f ic ient i = a + I-rr " - i' i
1975 1977
1998
z .t 6r4 3 * C tg tt:2 .5 1 1 4 * O :OOA I = 2 51q5
a. + D .19/5 + c,u,.s: _j j .68578+ O.O375B 72* 1975+ 0. 0403799: 2. 5893 i , + bl i g77 + c,pn:.-71..68578 _ ^111c^io T + .A 37sa7i * 1q-7-7 U. |] J6418= 2. 5105 ),,,. . . ._,,. ..v)/)otz.tel /
154
QuantitativeDataAnalysis:Doing socialResearch to Testldeas
m]Il|::e
!.I J.... 16
:t; 1..
E 6
=
i
tz
o
't: . i..*.-.' 1900
1910
.'J.
1920
1930
194A 1950 Yearof b rth
1960
1970
fll${"}gti:ir.5.
Yea6 of SchoolCompletedby Yearof Birth, I).5. Adults (Pooled Samplesfrom the 1972Through 2004 GSi;N = 39,324;ScatterPlot Shown for 5 PercentSample).
re-m . ibsf
century(precisely,until 1947)experiencinga fairly steadyyear-by-yearincreasein their schooling,but thoseborn in 1947or later experiencingno changeat all. This suggest: that the trend in educational attainment is appropriately representedby a linear spline with a knot at 1947,where"knot" refersto the point at which the slopechanges. This specificationcanbe represented by an equationof the form: E - a'l br(Br)+ b"(8,)
(7.3+
whereBr - the yearof birth for thosebom in 1947or earlierand : 1947otherwise,and B, - the year of birth - 1947for thosebom after 1947and : 0 otherwise.More generalll.. a splinefunctionrelatingZto X with segments vt. . .!,*t andknotsatkr k2,. . . ,k,can be reDresented bv
Y : a'l br(X,)+ b,(Xr)+... + b,*lx,*)
(7.35r
wherev, : min(X k,), u, - max(mintX- k,. k, k1).0),.. ..urr+rr: max(X f,,0)(see Panis [1994]; the entry for Stata's -mksplinecommand lstatacory 2007]; and Greene[2008]).Eachslopecoefficientis thenthe slopeof the specifiedline segmenr.We can seethis concretelyby going back to our example,Equation7.34, and evaluating the equationseparatelyfor thosebom ir 1947or earlrerand thosebom after 194j. Fot thosebom in 1947or earlier,we have
.1
I
rc-m &im.,rll
xultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems f 55
14
d
l
13 o
12 11 10 9 8 1930 1940 1950 Yearof binh
old
FIGURE 7.6. ueu, yearcof Schooting by yearof Birth,u.s- Actutl(same hta asfor Figure7.5).
6eir 5e$s pline I E
I 14
13{t
ad tll]'. can
!-.
12 11
g
10
35r EC trd ['e
rg br
';
9 81930
1940 1950 Yearof binh
FlGtrRE 7.7. Three-year MovingAverage of yearsof schooling byyearof Birth, U.5.Adults (SameDataas for Figure7.5).
156
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
E = a + b,(B)+ br(0): a + br(B) andfor thosebom laterthan 1947, we have E: a + b,(1947) + b,(B-1947) : (a + 1947br) + b2@-1947)
(7.37)
Notice that the intercept in Equation 7.37 is just the expectedlevel of educationfcr thoseborn in 1947 aadthatbrgives the slopefor thosebom after 1947.Thus, the expected level of educationfor those 6om in 1948 is just the expectedlevel of educationfor those bom in 1947 plus Dr; for those born in 1949 rt is the expectedlevel of educationfa thosebom in 1947plus 2br; and so on. Estimating Equation 7.34 from the pooled 1972-2004 GSS data yields the coefficients in Table 7.2. By inspecting the BICs for three models-the spline model, a linea trend model, and a model that allows the expectedlevel of schooling to vary year_b).year-it is evident that the linear spline model is to be prefened. Note, however, that a comparisonof R2sindicatesthat by the criterion of classicalinference,the model posirhg year-by-year variation in the level of schooling fits significantly better than the splbe model. I am inclined to discount this result becauseit has no theoretical iustification. is
SpECtFtCATtON OF SpLtNEFUNC4N ALTERNATTVE
TION S Analternative specification represents theslopeot eachlinesegment asa deviation fromtheslopeof theprevious linesegment. Inthisspecification, a different setof newvariables is constructed. Suppose therearek knots,thatX istheoriginal variable. andthatyr,...,yh+r) arethe constructed variables. Then
ur= X - k.,if X> k,; :0 otheruuise u,,, = X - k.il X > k"; : 0otherwise To seethis concretely, considerthe presentexamplespecifyinga knot at birth year 1947 in the trend in educationalattainment.We would estimatethe equationwherez, : birthyear(, andur: X 1947if X > 1947and = 0 otherwise.Thenfor thoseborn in 1947or earlier, t : a + b,(X) + b,(o) : a + b1(X) whilefor thoseborn laterthan 1947 E: a + b,(X) + b2(X- 44 Thus,for those born in 1948,the expectedlevelof educationis given by (a + 48b,) + b,; for thoseborn in 1949it is (a + 4gb1)+ 2br;and so on. Fromthis,it is evidentthat b, give; the deviationof the slopefor the previouslinesegment.Forusefuldiscussions of thesemethods,seeSmith(1979)and Gould(1993).
r.'!ultiple Regression Tricks: Techniques for Handringspeciar Anaryticprobrems 157 '-. -16
-'
r a,S ?.3, Co.ffi.i.rrts fora LinearSplineModel of Trends years in of School Completed by year of Birth, U.S. Aiults age iS Ofa.., comparisonswith other Moders(pooredDatar". "na = ".rO r6'iz-zoo+,
rv
r f. , 3Ltion tbr rPecred v *lose tion lbr e-oeffia linear ear-b] : that a osiring spbne tion. is
39,324),
s,e. 5bpe '. :: ,:..'i'.: 5.ope(bjrthyearsI94j-1979)
,i.:: .0092
.o024
r*""u1,rr,1.,:, Model Comparisons
2) Lineartrendmodel
.1167
(3) :. i5
I ) vs.(2)
-5 3 1
.0121
545.2
1;39321 .OOO0
:-:arly inferior by the BIC criterion,and occurs simply as a consequence of the large i-mple size.Thus, I acceptthe linear splinemodel asttrepreteneJmoO"t. The coefficientsfor the line segmentsindicatethat for peopt"iorn in 1947or earljer, :ere is an expectedincreaseof .0g6yearsof schooling foi ,*""rriu" birth cohort. .._. us.peopleborn lwelve yearsapartwould be expectld "uin to differ on averageby abouta 1:ar of schooling.However,for peopleborn in 1947or later,;";" ." trendin educa-:.rnalattainment;the coefficient .0092 implies ttut it *ouiO iut "about a century for -:. eraqeschoolingto increaseby one year This is a somewhatsurpnsrng result, espe_ :::lly becausetherehavebeensu
.":nraged minoriries, rhat is, ",""1::#'iff'":!ilT":fi:-:rTli:,ffiH:tr"Hi,
::d also amongwornen.However,as Mare (1995, tb:; not"r-.d*utronally disadvan_ --!ed proportionsof the population havegrown over tim"..tutlu" to tn" White majority. )saggregation of the trend woula be wolrthwhile u* p"r.ued here;it would :rte,an interestingpaper The graph implied by the"""".ii. coefficienisfor the linear spline :Lrdel is shown.inFigure 7.g, togetherwiih u 2 i".".nt rundornslmpte of observations : rr eachcohort_(redlced from 5 percentto 2 percentto mut. it .J". to seethe shapeof :e.spline). In this figure the -j itterfeaiurein Stutui, u."J'io _uke it clearwhere : rhegraphthereis the greatestdensityof points.
158
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
". :,t'..-i i f
?.... .'.t'
lfi
_g E .r
t
t.
-iifir-Er,
t
flrcfl diiMd-
o fr
N[dd dbi
libu
%i
o
llrry
btrt
It
hr
l$.-" .!F, tEd l
'1900
1910
1920
1930
1940 1950 Yearof birth
1960
1970
1980
@m h ftr/rqi trtil
Ff &Untr 7 .&, rrenain Yearsof SchootCompteted by yearof Birth,U.S.Adutl (SameDataasfor Figure7.5;ScatterPlotShownfor 2 percentSampte). predicted Valuesfrom a LinearSplinewith a Knot at 1947.
tuq drF
A SecondWorked Example,with a Discontinuity: euality of Education in China Before, During, and After the Cultural Revolution The typical useof splinefunctionsis to estimateequationssuchasthe onejust discussedin which all points are connectedbut the slope changesat specifiedpoints (,,knots"rHowever, there are occasionsin which we may want to posit discontinuoas functionsThe Chinese Cultural Revolution is such a case.It can be argued that the disruption of socialorder at the beginningof the CulturalRevolutionin 1966was so massivethat it js inappropriateto assumeany continuity in trends.Deng and Treiman (1997) makejus such an argument with respect to trends in educational reproduction. They argue thal there was then a gradual 'tetum to normalcy" so that changesresulting from the end of the Cultural Revolution in 1977 were not nearly as sharp and were appropriately representedby a knot in a spline function rather than a break in the trend line. Here we consideranotherconsequence of the Cultural Revolution,the quality of educationreceived(the exampleis adaptedfrom Treiman [2007a]).Although prima4 schoolsremainedopen thoughout the Cultural Revolution,higher level schoolswere shutdown for varyingperiods:most secondaryschoolswereclosedfor two years,from 1966to 1968,and most universitiesand other tertiarylevel institutionswere closedfor six years,from 1966 to 1972. Moreover,it was widelv reDortedthat even when the
m-
lr"D. lhEb
h br& {@E
fu r frFfr ffi{
ryE'ft bd rlidh' Ed &trI
hr mb &'n|n b
litultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 159
0 ldutts Pd
n hste.l lol-s- 1,
:tiorr. ion e.: ar it i: Ie JL.:i |e thar end of reprelir-r e-ti
rima+ t $ efe
. from ed for en lhe
siools were open, little conventionalinstruction was offered: rather, school hours rere taken up with political meetingsand political indoctrination.Rigorousacademic himrction was not fully reinstituteduntil 1977, after the death of Mao. Under the ;iriumstances,we might well suspectthat, quite apart from deficits in the affiount of siooling acquiredby thosewho wereunfortunateenoughto be of schoolageduringthe Culmral Revolution period, those cohorts also experienced deficits in the quality of $ooling comparedto thosewho obtainedan equalamountof schoolingbeforeor after fre Cultural Revolution. To test this hypothesis,we can exploit the ten-item characterrecognition test ,&iristered to a nationalsampleof Chineseadultsthat was also analyzedin Chapter SLx(seeTable6.2). As before,I take the numberof characterscorrectly identified as a of literacy andhypothesizethat, net of yearsof schoolcompleted,peoplewho -asure rned age eleven during the Cultural Revolution would be able to recognizefewer [Laractersthanpeoplewho turnedelevenbeforeor after the Cultural Revolutionperiod. Uoreover,following Deng andTreiman (1997),I posit a discontinuityin the scoresat tu beginningbut not at the end of the period. To do this, I estimatean equationof fre form:
i - a + b1(B)+ bz(B) + cr(Dr) + \(\)
(7.38)
rhere B, = year of bfuth (last two digits) if born prior to or in 1955 and : 55 ifbom Fbsequentto 1955;Br: 0 if bom prior to 195(, = year of birth - 55 if born between 1956and 1967,inclusive, and : 67 - 55 if bom subsequentto 1967',83: 0 if bom = 0 for lrior to or in 1,967and : year of birth - 67 for those bom after 1967i and D, : 1955. Note difference prior 1 for those bom after that the born to or in 1955 and 6ose henveenthis and Equation7.35 is that I include a dummy variableto distinguishthose born after 1955from thoseborn earlier;this is what permitsthe line segmentsto be disat 1955.If I were to havepositeda discontinuityat 1967as well, the equarr')otinuous :ion would be the mathematicalequivalentto estimatingthree separateequations,for rte periodbefore,during, and after the Cultural Revolution,in eachcasepredictingthe rtrmberof charactersrecognizedfrom yearsof schoolingand year of birth. The advanage of equationssuchas Equation7.38 is that they permit the specificationof altematire modelswithin a coherentframeworkand by so doing permit us to selectbetween nodels. Estimatingthis equationyields the resultsshownfor Model 4 in Tables7.3 and 7.4. -{s in the previous example,I contrastmy theory-driven specification with other possibiliries: that there is a simple linear trend in the data; that there are year-by-yearvariations; tat there are knots at both the beginning and the end of the Cultural Revolution, but no discontinuities; that there are discontinuities at both the beginning and the end of the Culural Revolution; and, for the three spline functions, that there is a curvilinear relationship between year and knowledge of characters during the Cultural Revolution period.
l6S
QuantitativeData Analysis:Doing SocialResearch to Testldeas
''
'inla
Ra^r.rr
':,.:l:
: ,'.l' Goodness-of-Fit statistics for Models of Knowledge of chinese Characters by year of Birth, Controlling for years of schooling, with Various Specifications of the Effect of the Cultural Revolution (Those Affected by the Cultural Revolution Are Defined as people Turning Age il During the Period 1966 through 19771,Chinese Adutts Age 20 to O9 in 1996 (N = 6,08G),
: 'Chinese Char a( lues in Paren ---Va
':=-i o: schocl;:l: .665 .616
i 956-'196: .g
i:i6725.9
-6723.9
.612
- 6722.1
.611
:: -
--
-6724.1
. 2A 71.72
-6717.4
1116.33 :--.
---..1,i1,/:-
'::(
-€ar 1r€tc '-f5
-. ':::
4.26 - 42.4
:a Ba . a .=' .
30.04 ::
54.43
.003
s1.11
1.8
.00'l
. 6.86
6.5
.000
'a a - a _ e a - :t
-
.
: :;l _1i Lrn i :
:::
' .-
- i ddl l i L-rr:
:
. : - , t t r ing iit : a. - :-:-rruities-.. : , , - . likelr r : . :
' - t iple Re g re s s i olnri c k s :T e c h n i q u efo s r H andl i ngS peci aA na yti c P robl errs
] 5l
' , :, Coefficients for Models 4, 5. and 7 Predicting Knowledge :- Chinese Characters by Year of Birth, Controlling for Years of Schooling :-Va lues in Parentheses).
:: 's of schooling
- i 955 or earlier(age11 1965or earler)
:
:--
1956-1967(age11 1966*1977)
: r - - 1968or l a te r(a g e1 1 1 9 7 8o r l a te r)
: - i: q inu: t ya t ' 9 5 5
Model4
Model 5
.443 (.000)
.443 (.000)
A44 (.000)
0.001 \.721)
0.001 (.134\
0.001 (.749)
0.043 (0.000).
0.032 (0.000)
0.041 (.000)
0.016
-0.557 ( 000)
*0.508. (.000)
. -o.o4l (0.18s) 0.028 (.012) -0.349. / nnl\
o.241 (.010)
, : : r nt inuit ya t 1 9 6 7
0.0066 (.00e)
:--,llineartend'195ffi7
= : (rootmeansquareerror)
Model 7
0.770
0.770
0.771
0.571
0.672
o.672
1.29
1.29
1.29
. ,rnparison of the B.lCs suggeststhat three models-my hypothesized model, a model ,: in addition to a discontinuity at the beginning of the Cultural Revolution allows the =:J during the Cultural Revolution period to be curvilinear. and a model positing - .:ontinuities at both the beginning and the end of the Cultural Revolution are about , ..i1ly likely given the data, albeit with weak evidence favoring the single-knot model. - : that all three are strongly to be preferred over all other models.
162
QuantitativeData Analysis:Doing SocialResearch to Testldeas
Again, B1Cand classicalinferenceyield conrradictoryresultsbecausethe two alternativemodelsfit significantlybetter(at rhe 0.01 le\el) than doesthe originally hypothesizedmodel.Here I am in a bit of a quandaryas to u hich modelto prefer.I havealreadr stateda basisfor positing a single discontinuity.plus a knot at the end of the Cultura Revolution.However,anotheranalyst might favor a two-discontinuitymodel, on th; groundthat the curricularreform in 1977that restoredthe primacyof academicsubjecc was radical enoughto posit a discontinuityat the end as well as at the beginningof the Cultural Revolution.A third analystmight arguethat a linear specificationof trends. especiallyin times of great social disruption,is too restrictiveand that it makesmore senseto posit a curvilinear effect of time during the Cultural Revolutionperiod. Ir Treiman(2007a),I presentedthe model positinga discontinuityat 1955,a knot at 196-. and a curvebetween1955and 1967-see Figure7.4 in that paper.Howevel the truth i! thatthereis no clearbasisfor preferringany oneof the three,exceptfor the evidenceprc' vided by BlC, which suggeststhat ihe originally hypothesizedmodel is slightly mor; likely thanthe othersgiventhe data.Again, my suggestionis, go with theory.If you har: a theoreticalbasisfor one specificationover the others,that is lhe one to feature;but. iI the sametime, you mustbe honestaboutthe fact that alternativespecificationsfit nearll equally well. In fact, the optimal approachis to presentall threemodelsand invite th: readerto chooseamongthem.A waming: if you do this, you probablywill haveto figL with journal editors, who are always trying to get authors to reduce the length of papersand perhapswith reviewers,who sometimesseemto want definitive conclusionsere: when the evidenceis ambiguous. The estimatedcoefficientsfor all threemodelsare shownin Table7.4.In alt thr*modelseachadditionalyearof schoolingresultsin nearlyhalf a point improvementin dE numberof charactersidentified.However,the coefficientsassociatedwith trendsortime are relativelydifficult to interyrer.Again, this is an instancein which graphingrtr relationshiphelps.Figure7.9 shows,for eachof the threepreferredmodels,the predicr* numberof charactersrecognizedfor peoplewith twelve yearsof schooling,that is, *bi havecompletedhigh school.Although the threegraphsappearto be quite different,th1 all show a declineof abouthalf a point in the numberof charactersidentifiedfor thos who wereageelevenduring the early yearsof the CulturalRevolutionperiod,relatile :: thosewith the samelevel of schoolingwho tumed elevenbeforeand after the Cuhwir Revolution.Thus, despitethe difficulty in choosing among alternativespecificatio*togetherthey stronglysuggestthat the quality of educationdeclinedduring the Culruii, Revolution.Peoplewho acquiredtheir middle school(unior high school)educationdr:ing the CulturalRevolution,in effect,lost a year of schooling-that is, displayedknos _edgeof vocabularyequivalentto thosewith one year lessschoolingwho wereeducai= before and after the Cultural Revolution. Still, we shouldbe cautiousin our interpretationof Figure 7.9, where the Culru:rr Revolutioneffect appearsto be quite large becauseof the way the data are graphi; (with the y-axis rangingfrom 5.3 to 6.7 charactersrecognized).Indeed,Figure 7.10_r which the y-axis rangesfrom 0 to 10, suggestsa ratherdiffereDtstory'-a very mod:s decline in the numberof charactersrecognized.It is quire rea_.onable ro reporr li,su::: suchas Figure7.9 to makethe differencesamonsthe model
,'1s|=:rr.4
tlt
Jnfi!flt
:
'-:95
= at
ro alterrypotheralread) Cultural , on the zubjecs g of the I trenG-
P 65
t u, g 57
25 30 35 40 45 50 55 60 55 70 75
25
:0
15
Yearol birth Dlsconlinuily at 1955,knot at 1967
40
45
50
55
60
65
70
75
Di5.oninuiies at 1955and i967
es lnOIE
riod. In ,r 1967. rtruth is DCeprol5r more ou ha\ e i but. at I nearll Mite the to fighr PaPers. 0s eveD rll three in the ds or,er dng the Edicted is. who nL the)' r those ative to ]rrltural Dations. )ultural on durknollltlcated
I6.5
;
61
g
57
25
30
35
"*-,",,
40
45
50
55
60
*,*,:;:::::47
65
70
75
curve 1ss5_,e67
FlGi..rR[ 7,9. eraphsofThreeModets of the Effect of theCutturatRevotution an Vocabulary Knowledge,HoldingConstant Education(at Twelveyears),Chinese (N = Hults, 1996 6,086).
= 1.
aa
p c8
a
5
; =2
= E0 25
J0
l5
40
45
50
55 60 65 7o Yea.ofbirth Dscontinrityat 1955,knot ar 1967
75
25
30
35
40
45
50
55
60
65
70
75
Discontinuiries at 1955and 1967
p
r8 !
5
).lltural raphed 1.10.in modest figures ses the
E2
25
l0
15
40
45
50
55
60
65
70
75
",,,",,,,;i:::'"f'$u,..,"",,,,_,,u, ",..", FlGUnf
7 .7&" rigur" 7.g Rescated to showthe EntireRanqe of they-Axis.
164
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
responsibleanalystwill call the reader'sattentionto the range of the y-axis to avoid misinterpretation.
EXPRESSING COEFFICIENTS AS DEVIATIONS FROM THEGRANDMEAN (MULTIPLE CLASSIFICATION ANALYSIS) The conventional way of treating categorical independentvariables is the approachprcsentedin the previous chapter:omit one categoryand interpret the remaining coefficients as deviations from the expected value for the omitted category. Sometimes,especiall!' when we have a large number of categories,it is preferable to expressour coefficients as deviations from the mean ofthe dependentvariable. We can do this by transforming the coefficients,makinguseof the following relationships: aii -
I
I I
Dii f Ui:
Vi :
- ), p;io,1 j
(7.391
where the a, are the coefficients for the 7th category of the lth predictor, expressedas deviations from the mean of the dependentvariable; the b, are the correspondingcoefF cients associated with the dummy variables; the Qt are adjustment coefflcients thi constrain the weighted sum of the coefficients associatedwith the categoriesof eachpredictor to zero; and the p- are the proportion of total casesfalling in theTth categoryof tbe ith predictor(Ardrews er al. 1973,4547). To see how these coefficients work, consider the relationship between religious denomination and tolerance.The anal.ysistask includes two elements: r
To assessto what extent and in what way religious groups differ in their toleranceof the antireligious
r
To assessto what extent the observeddifferences between religious groups can be attributed to the fact that they differ with respect to education and Southern residencebecausethese variables are known to affect tolerance (with the morE educatedand non-Southemresidentsmore tolerant than others)
I start by estimating two regressionequationsin the usual way-one with only the dummy variablesfor religious groups and one also including educationand Southernresidence usingpooleddatafrom the 2000,2002,and2004GSS;I pool threeyearsof dan to increasethe samplesizebecausesomereligiouscategoriesarequitesmall,andthe tol. erancequestionswere askedof only a subsetof respondentsin eachyear. The results arr shownin the left-handpanelofTable 7.5. I thenreexpressthesecoefficientsasdeviatiom from the mean of the dependentvariable using Equation 7.39. The rightmost panel of the tableshowsthe reexpressed coefficients. Ordinarily you would not presentboth setsof coefficients,but would chooseone form or the other-either a dummy variable representationor a multiple-classification representation.I present them both together here so that you can see the relationshipe betweenthe coefflcients.
l\4ultiple Regression Tricks: Techniques for HandlingSpecial Analyticproblems 165 L\ls to avoid
"
::: i 1..11. coefficients of Models of Tolerance of Atheists. U.S.Adults, 2O0Oto 2OO4(N = 4299). Dummy VariableCoefficients (Deviation5 from Omitted
tt pproachprercoefficients s. especially efficientsas ;fbrmingthe
Category)
tsaplists
(7.3e) rpressedas ding coeffiicients that rf eachpre:-soryof the
MCA Coefficients (Deviationsfrom Grand Mean)
Model 1
Model 2
0.000
0.000
-o.422
0.647
o.447
o.224
Model 1
OtherProtestants
Model 2
- 0.308
t644'
(163)
-0.066
n religious o.27 4 their tolersrOUpS Can I Southem h the more
o.643
h only the uthernIesars of data nd the tolresultsare deviations mel of the looseone ;sification Itionships
R2
.0 6 1
0.102
0.102
0.000
0.039
(2,195)
0.136
(3,446)
0.136
0.061
tuoter5incep-varuesare not readilyaornputedfor the McA coefficents, and afe not partrcurar y meaningfLrl lor the dur.myvariable coefficleirts b-".aLrse theyindicatethe s gnificance oJthe difterence trom the omitted calegorythey are not shown here.
156
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
Note thatthe dff?rencsbetweencoeff,cientsis the samein both versions.In Mode,: for example,the differencebetweenthe tolerancescoresof Methodistsand Baptisli ][ 0.395: 0.395- 0 : -0.027 ( 0.422).Similarly,in Model 2 the differenceis 0.298= 0.298- 0: 0.010- (*0.398). What do the reexpressed coefficientstell us?I think they are easierto interpret.F::s considerModel 1. From this model we seethat Baptistsare considerablyless tolertr than the averagerespondentwhereasJewsand thosewithout religion are considerahl! more tolerant than average,and Lutheransand "Others" are somewhatmore tolerant $.t average.However,thesedifferences,especiallythe high toleranceof Jews,are some$ix explainedby religiousdifferencesin educationand regionof residencebecause,in general, the deviations from the mean decline when thesetwo factors are controlled. Southemresidentsare somewhatbelow averagein tolerance,net of religiousaffilir. tion and education,whereasnon-Southemresidentsare slighdy aboveaveragein tol* ance.Non-Southernresidentswill necessarilybe closerto the overallmeanbecausetheE aresubstantiallymoreof them,andthe weightedaverageofthe coefficients(with weisiG corresponding to the proportionof the samplein the category)must sum to zero. The coefficientsfor the religiousgroupsandfor Southerners versusnon-Southem<s are sometimesreferred to as "adjusted group differences,"where the adjustmentrefeF r the fact that othervariablesin the modelare controlled. The slopecoefficientfor educationdoesnot change,but the scalingof the edu,> tional variabledoes.In the reexpressed ("MCA") representation, educationis expres:.( as a deviationfrom its mean-in this case13.4.Finally, the interceptin the reexpresset representation is just the meanof the dependentvariable,the level of tolerance.
OTHERWAYSOF REPRESENTING DUMMY VARIABLES Three other ways of representingthe effects of categoricalvariablessometimescan fa,i. itate interpretation.Two of them,effectcoding andcontrastcoding,requirerepresennrr the categoriesof the classiflcation in a different way from conventional dummy variahh coding (seeCohen and Cohen [1975, 172-210],Hardy [1993, 64-751,andFox [199-. 206-11D. A third, which I label sequential effects, involves manipulating the ou+{. None of thesealtemativeways of expressingthe effectsof a categoricalvariableah<s the contributionof the categoricalvariableto the explainedvariance;that is, the R: I unaffected.All they do is reparameterizethe effects, and so the only reasonfor using a:rr of them is to makeinterpretationof the relationshipsin the dataclearer To seehow thesealtemativeswork, let us considera new problem-the effecr :r occupationand educationon knowledgeof vocabularyin the United States,using d,:cr from the 2004GSS.The GSSincludesa ten-itemvocabularytest,a detailedclassificad.r of currentoccupation,and a measureof yearsof schoolcompleted.For the purposeu this example,I havecollapsedthe detailedoccupationalclassificationinto four categrnes: upper nonmanual(managersand professionals),lower nonmanual(technicial: salesoccupations,andadministrativesupportoccupations), uppermanual(precisionpr,duction,craft, and repair occupations),and lower manual (all other categories:sen-i: occupations;agriculturaloccupations;and operators,fabricators,and laborers).I expe: that, net of currentoccupation,vocabularyscoreswill increasewith yearsof schooliG
MultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 167 In Model 1, I Baplistsis e is 0.298erpret.Ffust bss tolerant onsiderably Dlerantthan e somewhat use,in genled. ious affiliage in tolerrause there dth weights
\tore interestingly,I expectvocabularyscoresto increasewith occupationalstatus-that is. that lowgr manual workers, upper manual workers, lower nonmanual workers, and uppernonmanualworkerswill haveincreasinglyhigh vocabularyscores,on average,net trf yearsof schooling.The argumentis that symbolicmanipulationplays an increasingly largerole in work as one movesup the occupationalhierarchy,so that verbal skills are much more strongly reinforcedand enhancedin high-statusoccupationsthan in lowstatusoccupations.(Of course,in a seriousanalysis,I would needto considerthe possibility that those with better verbal skills, relative to their education,would be more likely to endup in high-statusoccupations.) The conventional approachto representingtheseeffects is to estimatean equationof the form ^4
V : a + b (E )+ lc , O i
(7.40)
.fo.
Southemers tnt refers to ' the educas expressed €expressed E.
s canfacilepresennng ny variable Fox [1997, Ite output. iable alters s, the R'?is f usrngany E effect of using data assification purposeof Durcategoechnicians, f,isionpro€s: servlce t. I expect schooling.
*here V is the vocabulary score,E is the years of school completed, and the O are the occupationcategorieswith, say,O, = 1 for lower manual workers and = 0 otherwise, .. . , : 1 for uppernonmanualworkers and : 0 otherwise.The top panel of Table7.6 shows Q the "design matrix" (also known as the "model matrix") for the conventional coding of dummy variablesto representthe separateeffects of eachoccupationcategory;the resulting coefficients are shownin the top panel of Table 7.7. As you can see, there are no surprises here. As expected, vocabulary knowledge increaseswith education and also increases monotonically with occupational stafus. However, let us now see how to represent the effects of occupational status in other, mathematicallyequivalent,ways.
Effect Coding One possibility is to highlight the effect of each occupation category by contrasting it s ith the unweighted averageof the effects of the other categories.If we include a single categorical variable in the model, representedby a set of ft - 1 trichotomous variables eachcoded - I for the omitted category 1 for the ith category and 0 otherwise (refer to the secondpanel of Table 7.6), the resulting regressioncoefficients give the difference betweenthe meanon the dependentvariable for the specific categoryand the unweighted averageof tJIemeansfor all categories.That is, (7.41)
where I is the unweighted averageof_the sample means for each category averaged overall categoriesof the classification;I is alsothe interceptof the equation.The coefficient for the omitted category is just the negative sum of the coefficients for the k - 1 explicitly includedcategories,althoughin thesedaysof high-speedcomputing,it usually is easierto simply changethe omitted category.When other variables are included in the
158
euantitative DataAnarysis: Doingsociar Research to Testrdeas
f .6 .Tl Y i- (See .l Variables
DesignMatricesfor AtternativeWays of coding categoricl
Text for Details). Lower Manual
Upper Manual
Lower Nonmanual
Upper Nonmanual
ConventionalDummy Variable Codinq
,.;i
i
;ffi:X::Tl?*same
,.'
,i1
relationships holdexcept thatnowwehaveadjusred
As nobd, the codingof catesories tharproducesthis outcomeis snown rn the se.
d;::,, &ca,egories rererence iiT*:1?::Ji,i",^i:2,;:_" catesory I i onl::" {L**,oi#;r:: . i;;;; iscoded i;fJ".'iTJ,fr::f,:i::l "ri,r'1
ul",u""",,iu"ry ;"fiT"J:,h.::"fi',:#"1*i""*lU*,",,"i,i"g1,"g"]r* I and0 ontheremaining inoi"utot u.iuir.ri;;,ffi;:"#:TT"tl""".Ttil:t:; ..i
categoryand eachof the other categorieswhile minimizing the influence of the rem mg categories. Inspectingthe coefficientsin the secondpanel of Thbte7.7, we see that the un, thefour
occupa,ion
ilil"?,i*:*:f:for have substanria,y ""Le..i* i.l.os ;"0;:,,;;.,n_,", lower rhan :Ji:rT::T::,1ffi :l ","*;;;;;;"#ffi
Multiple Regression Tricks:Techniquesfor hgorical
Pper
HandlingSpecialAnaly.ticproblems
l6g
:,1a,-: 7. ;, coefficienrs for a Model of the Determinants of Vocabulary (nowledge, U.5. Aduttr 1994 (N = 1,757,R2= .2445;Wald Test That Categorical VariabfesAf l Equal Zero: Frr.,,ur, = 12.48ip < .o00o).
'nanual
Coefficient
Standard Error
p-Value
Conventionaldummy variablecoding
.5
t.
-o.377
0.076
0.000
o.143
0.070
o.o41
2.482
0.239
0.000
0.106
0.000
o.154
o.142
td mear: intercept e s e cona n e s . T h: : used i,: lr code; onlittai aantalt:-
Contrastcoding
-o.529 r ei-chre.: g orker. manu-
c2
_o.226
(Continued)
176
to Testldeas QuantitativeData Analysis:Doing SocialResearch
; ,.:..1.-:: ., , .. Coefficients for a Model ot the Determinants of Vocabulary Knowfedge,u.s. Adults, 1994 (rv = 1,757;R2=.2445; Wald TestThat Categori(Continued) = 12.48'P < .OO0O). cal Variabfes Aff Equal zero: Fo..t1s2t Coefficient
standardError
p-Value
q
o.244
0.120
o.o42
Intercept
2.482
0.239
0.000
Educat,on
0.277
0.018
0.000
5,
0.226
0.154
o.142
5l
o.295
0.154
0.056
s4
o.243
0.119
Q.O42
Intercept
2.105
0.220
0.000
_-- :,
-
Sequential coefficients
Jrl ' L
workershavesomewhatlower thanaveragescores;lower nonmanualworkershaveson:what higher than averagescores;and upper nonmanualworkers have substantiallyhighthan averagescores.Note that the differencesbetweenoccupationcategoriesare identic(within roundingenor) in thetwo parameterizations andthatneitherthe effectof educatic: nor the R']is affected.This repalameterizationis likely to be mostuseful whenthe categonc: no one of which is a panicular.-. variableincludesa largenumberof responsecategories, useful referencecategory.Note also the contrastbetweenthis parameterizationand that dicussedin the previoussection,which showscoefflcientsas deviationsfrom the weigltri: averageof the subgroupmeans.Here the coefficients are deviations from lhe unweiglttt: andeachmay be usefulundercertaincircumstances. average. Both areappropriate,
.
llr --
:
tl
-:
:l r ,hr.g-:a :--
[r:l _
r_
rlln
:!
5:l l l i --
t'7 t'
- ,;
u:
I
Contrast Coding Sometimeswe may want to comparethe effectsof subsetsof variables.For example.ri: may want to contrastnonmanualandmanualworkers,and then to contrastthe two no:We can do this by constructinga seti: manualcategoriesandthe two manualcategories. contrastsof the subgroupmeans.That is, we forrn a setof contrastsof the form
.:
-
-
UultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 171 fir'ect to the constraintsthat the ai sum to zero; that ,t - 1 contrastsbe formulatedto Eresent f categories;and that the codesfor eachpair of contrastsbe linearly indepen!E or. to put it differently,that the contrastcodesbe orthogonal-which conditionis nrtred when,for eachpair of contrasts,the sumof the productsof the codes: 0. -\ setof contrastcodesis shownin the third panelof Table7.6. Note that they satisfy irX -:nreeconstraintsmentionedin the previousparagraph:three contrastvariablesare caj to representthe four occupationcategories;eachrow sumsto zero; and the sum of G :roducts of the codes : 0 in all cases(for example,for C, and q we have .5*1 + .5* -: r + (-.5)*0 + (-.5)+0 : 0; and similarlyfor C, and C,,andfor C.,and q). This i;Cng. plus a little futher computationon the regressionoutput,explainednext, yields .-oeffcients shown in the third panel ofTable 7.7. \ote that the interceptgivesthe unweightedmeanof the categorymeans,just asfor de-:t coding,but the coefficientsof the indicator variableshave a somewhatdifferent Eeryretation,which requiresa little additionalmanipulation.Eachconhast,j, givesthe dj:rence in the unweightedmeansof the meansfor the categoriesin the two groups Es contrastedandis computedby C ,:
bi
flsr I hez (nr)(nrz)
(7.43)
/r3ris the numberofcategoriesincludedin the first group,n,2is the numberofcate--re includedin the secondgroup.and b- is the regressioncoefficientfor the !:I:es contrast::tier1dummy variable.Note thatthe standarderors alsomustbe multipliedby the same ii.:--lr asthe regressioncoefflcients. Inspectingthe contrastcoefficientsshownin the third panelofTable 7.7, we seethat E ranual groupsaverageabouta half point below the nonmanualgroupsin their mean l":-r3bularyscores,which is highly significant; that upper manual workers averageabout ! rarter point higher than lower manual workers in their vocabulary scores,but that this .tiJ:renceis significantonly at the.16 level,which givesus little confidencethat thereis r :ue difference between these categories;and that upper nonmanual workers likewise Fe.age about a quarter point higher than lower nonmanualworkers, and that this differ:alF is significantat the 0.04level,which meansthatwe canhavemodestconfidencethat tr.. is a true differencebetweenthesecategories.
huential Coefficients ,lre additionalway ofpresentingcoefficientsfor categoricalvariablesis sometimeshelpjr- \\'hen the underlyingdimensionis ordinal,or we want to treatit asordinal,it may be rsetul to reexpressthe coefficientsas indicatingthe differencebetweeneachcategory nc theprecedingcategory.To do this is a simplematterof estimatingthe equationusing :-E|entional dummy variablecodingbut fhen subtractingeachcoefficientfrom the pre=jing coefficient. If we have fr categoriesand omit the first one, then k, remains mchanged(k' . : k, - O); k' , - k, - krl and so on. The appropriate standarderrors for :ch coefficient arc then the standarderrors of the difference from the preceding coeffi:.1r. Again, the standarderror for t, remains unchanged,and the standarderrors of the
7 172
La[--LiPleRegr€s
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
remaining variablesmay either be computedby hand from the variance-covariancematnr of coefficientsusing the denominatorof the formula shownin the boxednote, "How to Test the Significanceof the DifferenceBetweenTwo Coefficients,"in ChapterSix' or alteringthe omitcomputedsimply by reestimatingthe regressionequation,successively ted category. Here we see (in the fourth panel of Table 7.7) that the differencesbetween adjacenr occupation categoriesare each about one-quafier of a point on the vocabulary scale and that alLbut the first of the differencesis signilicant at conventionallevels.Note also that the confasts between the first and secondcategories(lower and upper manual workers) and betweenthe third and fourth categories(lower andupper nonmanualworkers) are identical (within limits of rounding error) to contrasts2 and 3 of the previous panel-which, of course,must be so becausethe samecategoriesare being contrastedin both cases.
Tben. ulnng
TWO MEANS BETWEEN THEDIFFERENCE DECOMPOSING A commonproblemin socialresearchis to accountfor why two (or more) groupsdiffer with respect to their average score or value on some variable. For example, we ma) observe that Blacks and non-Blacks differ with respect to their averageeamings and may wonder how this comes about. In particular, we may wonder to what extent the differencearisesfrom group differencesin their "assets,"the traits that enlance eamings' and to what extent they arise becausethe groups get different "rates of retum" to their assets-that is, some groups gain a greater advantagefrom any level of assetsthan do other groups. Consider education, for example. To what extent do earnings differences betweenBlacks and non-BlacksarisebecauseBlackstend to be lesswell educatedthan non-Blacks,and to what extentdo they arisebecauseBlacks get a lower retum to their educationthan do non-Blacks?A naturalway to investigatethe determinantsof any outdeterminantsandnotethe comeof interestis to regressthe outcomeon a setof suspected relative size of the coefficientsassociatedwith each independentvariable.A natural extensionof this approachfor the pulposeof comparingtwo groupsis to computeparallel regressionsfor the two groups of interest, to subtractone equationfrom the othet and to note the sizeof the resultingdifferences. Considerthe following equations:
'lbu .m !o aryt:u and der Equi.rir'n --{
!€.
ftr :.iependenr tF---gs m me': nc ::frtren.-e n i!D:s 3 mr$-.
I s r, 11 --u1 -
r\-r'
1) ut1,ti l
$
r. dr
i> n.r trb' m:,-1 -lien
v
hr.e ,.
and 12:a2fLoiZAiz
\Sartr. ]0q i F,luaricl. Jan I$o ff
(7.451
which representsomemodel with /
Ge .-: the tq L' I I! boft tbes 3Ed d [ :-:L. indept :t!':E '-lre aD:= 'l :lltirieF rlt3eGl oa rbe I nt, :s :.dr i-bl -.-l-St E:';::(':)
MuftipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 173
-r \-
t2- u2
t" i
k -r- \- t
v
(7.46)
(7.47)
Then,taking the differencebetween(7.46)and (7.47),we have
Yr-Yz: (a, +lbnx)
- (a2+ Dbi2ii2) i
(a, - ar) * D(bir - b,)xi2 + D biz(xiz- xiz) + I (4, bi)(xit - tiz)
(7.48)
, Youcanwork out the equalityfor yourself.It is easierif you startwith the expanded {Lrrion and derivethe simpledifference.) Equation7.48 alsocan be writtcn as
Y, - t, : (a, - a2)+ f (4r * bi)xn + Db,t6n - Xi2) +Ltbit-b,r\tX ,r-X ,rt
(7.49)
r-\gain, you can convinceyourselfof this by working out the algebra.) Equations7.48 and 7.49 representalternativedecompositionsof the difference !E^ eentwo meansinto the differencebetweenthe intercepts,the slopes,the meanson ft :ndependentvariables,and the interactions betweenthe differencesin slopesand difi==ncesin means.In Equation7.48, Ctro]up 2 is usedas a standard.Hence,the effectof :nr Jifference in slopesis evaluatedat the meanfor Group 2, and the effect of the differa-: in meansis evaluatedwith respectto the slopefor Group2. In Equation7.49,Group . rj taken as the standard.These equationsgenerally yield different answers,and there rsdlv is no obviousway to decidebetweenthem.Hence,it is a goodpracticeto present trn setsof decompositions,as I do here.Differencesin interpretationassociatedwith rse .1fthe two standardswill be discussedshortly. ln both thesedecompositionsthe coefficients representingthe effect of the difference n :eans and the interactionare unchangedwhen a constantis addedto or subtracted ii:n the independentvariables,but the coefficients representingthe effect of the differ.n-: rn interceptsand the difference in the rate of retum to the independentvariables do Eod on the scalingof the variables(Jonesand Kelley 1984).For this reason,it generull--ris advisableto combinethesetwo terms.Doing so yields threecomponents.From 3.r:rion 7.48 we have
174
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
Y,, Y,
Actual: Observed groupdifference.
Dbi2(Xu- X i2)
Composition:Portiondueto differences in assets.
(a1 h) +D(b - bi2)Xi2
Rate: Portiondue to dilJerencesin the ratesof return to assets(thatis,the difference remainingf assets were equalized).
D(bi1 b)(ri1-Xj2)
!ilr-:
:rt
WH ulrF F {T -= -:-g
Interaction:Portiondueto valuingthe differences in assets at the Group2 rateof returnratherthanthe Group1 rateof
la-
return.
l _::t
-:-:a =
- :'-i
Equation7.49 can, of course,be reorganizedin the sameway. Note that in Equa_ tions 7.48 and7.49,the interactiontermshavethe sameabsolutevaluebut oppositesign. which follows from the definitionof the interactionterm.
.4
l:c
A WorkedExample:FactorsAffecting RacialDifferences in EducationalAttainment
I i
Now let us considera substantive problemto seewhat to do with thesecoefficients.Sup posewe are interestedin studyingthe factorsaffectingracial differencesin the level oi educationcompleted.It is well known thaton averageBlackshavelesseducationthando membersof other races.Data from the GSS show that over the period 1990 to 200-1. Blacks averagedabouta year lessin schoolingthan did others11i.g yearscomparedtc i3.7 yearsfor others).What factorsmight accountfor this difference? To study this question,it is necessaryto obtain a largeenoughsampleof Blacks rc, yield reliableestimates. Althoughno oneyear of the GSShasenoughBlack respondents. pooling all yearsfrom 1990 to 2004, excluding2002 (in which the race questionwa: askedin a nonstandardway and thus the data are not comparableto thosefrom other years),yieldsa sampteof 2,105Blacks with complereinformationon all variables.I thus pooleddatafrom theseyearsofthe GSS,dividedthe sampleinto Blacksandnon_Blacks. and estimatedfor eachsamplethe regressionof yearsof schoolcompletedon mother.s yearsof schooling,numberof siblings,and whetherthe respondentresidedin the south at agesixteen.I chosethesevariablesfor studybecausethey are known to affect educational attainment:mother'seducationis a measureof family culturalcapitaland is superior to father'seducationin the Black populationdue to the relatively large numbei oi female-headed households-the higher the lever of mother'seducation.the hisher the expectedlevel of respondent'seducation;the number of siblinss is an indiiaror oi the shareof parentalresourcesthat can be devotedto any singte-child-the larger the numberof siblings,the lower the expectedrevelof educationalattainment;and Southem residenceat age sixteenis an indicatorof inferior schooling-those who grew up in the South a.reexpectedto obtain less educationthan thosewho grew up in other parts oi the country.
dtrfi-::
a . i :D
{!i,t:l
r. n
::fi
'llrjrl
ls'!tr :-'.
iL:-
:
fi3lj]ne+ M * 1 ,.,:i tr,fl -1 D
AC :
31'\
*-'; :]{ -a la
AnalyticProblems 175 for HandlingSpecial uultiple Regression Tricks: Techniques
NON-BLACK IS BETTER THAN 3[ WHY BLACKVERSUS wHrrE vERsus NoN-wHlTEFoR SOCIALANALYSIS N lN THE UNITED
STATES
racial dfferences intheunlted states, whenstudylns
'- is reasonable Non-Blacks, of course,will to dividethe populationinto Blacksand non-Blacks. are white). Including o€ mainlyWhite (in the GSSfrom 1990to 2004 94 percentof non-Blacks 'Others"(thosewho areneitherBlacknor White-and, in fact, aremostlyAsian)wlth Whiteshas rttleeffecton the estimates but hasthe advantageof retainingthe entirepopulationratherthan arbitrarily studyingmostbut not all of the population.Includinq"others" with Blacksmakesless sense,both because"Others"aremoresimilartoWhiteswith respectto mostsocialcharacteristics and becausethey would constitutea largerfractionof the "non-White"populationthan of the "non-Black"population,thus makingthe categorylesshomoqeneous.
lf
The questionat issueis to what extentthe observednearlyone-yeardifferencein the .rrerageeducationof Blacks and non-Blacksis due to racial differencesin the average in el of mother'sschooling,the averagenumberof siblings,andthe probabilityof living n rhe South,and to what extent the difference is due to the lower ratesof retum to Blacks iom havingeducatedmothers,comingfrom smallfamilies,andliving outsidethe South. i by estimating an equation of the forrn '1art
m
E- a+ b(Eu)+ c(S)+ d(R)
(7.50)
II
II L l5
;eparately for Blacks and non-Blacks, where -E= years of school completed; E, = years completedby the mother;S = numberof siblings;andR = 1 if the respondent ".t school :!r'edin the Southat ase sixteenand= 0 otherwise.
C 5" t
n F F rf E
:r r: f E
x
'l)! thedeco'po,ton A COMMENT ON CREDITlN SCIENCE tnstata is carriedout usingan -ado- file, -oaxaca- , which can be downloadedfrom the_Web:ry* N ,.-net search oaxaca-", then clickthe entryfor oaxaca.The nameof the -ado- tlle ls a techniquewas introducedby the Thedecomposition tellingreflectionof the sociologyof science. EvelynKitagawa,in 1955and was elaboratedin a numberof waysoverthe years demographer; the and sociologistsseethe sectionon "AdditionalReadingon Decomposing by demographers it wasonlywhen an economist.Ronald BetweenMeans"laterin the chapter.However, Difference lt is now that it gainedgeneralcurrencyamongeconomists. oaxaca(1973),usedthe procedure decomposition," due to a someor the "Blinder-Oaxaca knownasthe "Oaxacadecomposition" Alan Blinder('1973). what clearerexoosition bv anothereconomist,
17*
to Testldeas QuantitativeData Analysis:Doing SocialResearch
Table 7.8 shows the means, standard deviations, and corelations among the yariables included in the equation, separately for Blacks and non-Blacks. From the table \\ e see that Blacks come from much larger families, are much more likely to have beer: raised in the South, and that both respondents and their mothers average nearly a yer lessschoolingthar non-Blacks.Table 7.9 showsthe regressionestimates,and Table 7.ll showsthe decomposition. For Blacks in the 1990 to 2004 GSS pooled sample, the estimated values for Equa, tion 7.50 rue
f : r t.oq + .220(E,t) .o7l(s) .512(R)
(7.51
whereas for non-Blacks the estimated values are
E
10.7b| .JU8rf,,) .iJsr5r
.488rR)
l 7 5^
Table 7.9 gives the regression coefficients for the two equations, together with standard errors. This table shows that the main differences between the determinants of edu, cation for Blacks and non-Blacks are, first, that the cost of coming from a large family i. substantially greater for non-Blacks than for Blacks and, second, that the advantagof mother's education is greater for non-Blacks than for Blacks. Interestingly, the effect o:
: a'.-:l-:. , .;. Means, Standard Deviationt and correlations for Variables Included in a Model of Educational Attainment for U.S. Adults, 199O to 2004, by Race (Blacks Above the Diagonal, Non-Blacks Below). tFl
(f) Yearsof school
lF
I
a
0.350
-0.186
(EM)Mothersyearsof school
0.411
(5) Numbero{ siblings
o.232 . o )64
(R)Livedin Southat 16 Mean Standard deviatron
-0.274
-0.065
0.011
13.7
11.4
3.33
3.46
-0.201
o 1)4
-0.102
2.83
(R)
2.67
10.6
4.96
3.73
..:
0.559
0.444
N = 14,985
3.45 O.49r-
I.
MultipleRegression Tricks: Techniques for HandlingSpecial AnalyticProblems 177 I \ l1_:-
}{s;: : tet: a I .;: e_ --
-f&*tF
7,9, coefficients of a Model of Educationalattainment fol Elacksand Non-Blacks,U.S.Adults, 1990 to 2oo4. Metric Coefficients(StandardErrors) Non-Blacks
Eq::-
r-.-<-
Number of siblings
-o.o71 (0.016)
Constant
11.09 (0.22)
* 0.138 (0.008)
| --:: I Si4'- -
dll :: &t:--:: 'ecr .:
les
r+ -D.
ts 5 73
F 497
t
sowing up in the South differs little for the two races,which representsan important :hangefrom the past.Finally,lessthana flfth of the variancein educationis explainedby ie threevariablesin the modelfor non-BlacksandIessthana sixth for Blacks,both sub{antially lessthanin the past. The coefficients in Table 7.9 do not, however, permit a formal comparison of the Jeterminantsof educationalaftainmentfor Blacks and others.To see this we tum to Table7.10,which givesa decompositionof the almostone-yeardifferencein the average earsof school completedby Blacks and non-Blacks.Decompositionl, which takes -.. Blacksas the standard,is constructedfrom Equation7.48, whereasDecomposition2, s hich takesnon-Blacksasthe standard,is constructedfrom Equation7.49.In both cases non-Blacksare takenas group 1, and Blacks as group 2. So the decompositionis of the epproximatelyone-yearadvantagein the averageschoolingof non-Blackscompared rith Blacks.Both decompositionssuggestthat differencesin assets-the fact that nonBlack womenhavebetter-educated mothers,fewer siblings,and are lesslikely to live in fte South-are more importantthan differencesin retumsto assets.But the two decomp'ositionsdiffer in the contributionthey assignto differencesin matemaleducationand numberof siblings, both of which are more important in Decomposition2 than in Decomposition1. The reasonfor this is straightforward:when Blacks are taken as the
378
QuantitativeData Analysis:Dojng SoctatResearch to lest ldeas Decomoosi
schoorcompreter t"
of the Differencein the Meanyears of
""rllilllltn
: ,. De(ompositionI (Black Standard)
Decomposition2 (Non-BlackStandard)
Totaldifference 0.89 Differences in assets
"!,:,:
-i,
Motherseducdtion Numberof siblings LrvedIn Southat i 6 lotal due to differencein assets
o.17 0.11 /,]|- !-
0.15 o.44
Differences in returns to assets Mother,seducation Numberof siblings . Lrvedin South ai 16 lntercept .Totaldue to differences in returns
0.93 - 0.34 0.01 0.33 0.28
o.46
Interactions Mother3 education Numberof siblings Livedin Southat 16 Totaldue to interactions
0.o7 0.11 -0.0i o.17
-o.07 -0.11 0.01 - 0.17
lMultipleRegression Tricks: Techniques for HandlingSpecial Analyticproblems 179
,t r2 lard)
I
iurdard-that is, when the Black/non-Blackdifferencesin slopesis evaluatedat the Black mean the differencein expectedvaluesis smallerin both casesthan when the lifference in slopesis evaluatedat the non-Blackmean.(To convinceyourself of this, .ketch a graph of the two slopesfor eachof the two variables.)Finally, the interaction :ermshave relatively little importancein this decompositionbecauseof the offsetting iflects of numberof siblingsandSouthemresidence.
Additional Readingon Decomposingthe Difference BetweenMeans For a good senseof how to carry out more complicateddecompositionsand what the :nterpretative issuesare,readthe papersby Duncan(1968),WinsboroughandDickinson 1971),Kaufman (1983), Treiman and Roos (1983), Jonesand Kelley (i984), Kraus 1986),TreimanandLee (1996),andTreiman,McKeever,andFodor(1996).
WHATTHISCHAPTER HAS SHOWN In this chapterwe havecoveredvariouselaborationsof multiple regressionprocedures ftat give us improvedability to representsocial processesand thus to test ideasabout how the socialworld works. Specifically,we haveconsiderednonlineartransformations .rf both dependentand independentvariables;ways to test the equality of coefficients \l ithin an equation;how to assessthe assumptionof linearity in a relationship;how to ;onstructand interpretlinear splinesto representabrupt changesin slopes;altemative s ays of expressingdummy variablecoefficients;and a procedurefor decomposingthe lifferenca betweentwo means.Severalof the worked exampleshavefocusedon trends overtime, which givesus a modelfor how to usernultipleregressionproceduresto study iocial change. In the next chapterwe returnto perhapsthe mostvexingproblemin nonexperimental socialresearch-missingdata on somebut not all variables-and considerwhat is currentlyregardedasthe gold standardfor dealingwith missingdata:multiple imputationof missinsvalues.
CH APT ER
IMPUTATION MULTIPLE OF MISSINGDATA ISABOUT WHATTHISCHAPTER In this chapterwe consider issuesinvolved in the treatmentof missing data, review various methodsfor handling missing data, and seehow to use a state-of-the-artprocedure for imputing missing data to createa complete data set, the method of multiple imputanbr. For a very useful overview of imputation methods, incLudingmultiple imputation, see Paul and others (2008), upon which this discussiondraws heavily. Other useful reviews of the literature on missing data treatmentsinclude Anderson, Basilevsky, and Hum (1983),Little (1992),Brick and Kalton (1996),andNordholt (1998).
18?
QuantitativeData Analysrs: Doing SocialResearch to Testldeas
INTRODUCTION Missing data is a vexing problemin socialresearch.It is both commonand diflicult ro manage.Most surveyitemsincludenoffesponsecategories:respondents do not know thr answersto somequestionsor refuseto answer;intervieweGinadveftentlyskip questionr or recordinvalid codes;errorsare madein keying data;and so on. Administrative dau_ hospital records,and other sorts of data have similar problems_invalid or missinE responses to particularitems where informationis missingbecauseit is not applicabteri particular respondents(for example, age at marriage for the never married), there is no problem;the analyticsampleis simply definedas those.,atrisk,' of the event. But in rhr remainingcases,in which in principle therecould be a response,we needspecial proceduresro copewith missinginformaljon. In the statisricsliteratureon missing data (Rubin l9g7; Little ard Rubin 2002). a distinction is madebetweentkee conditrons..missing completelyat random (MCAR), tL condition_in which missingresponses to a particularvariableareindependent of the values of any other variable in the explanatory model and of the true value of the variable il questron;mdJirrg at random (llAR), thecondition in which missingness is independentaf the true value of the variablein questionbut not of at leastsomeof the othei variabres in the explanatorymod,e|'and.missing not at random (MNAR) or, altematively, nonignor_ abk (NI), in which rnissingnessdependson the true value of the variable in question and_ possibly,on othervariablesaswell. Note that thesedistinctionsrefer to net effects.Thus,for example,if the probabilig that data are missing on the father's education is independent of ine true varue of the father'seducationafter accountis taken of the respondent'seducationbut dependson the respondent'seducation,the data would satisfythe MAR condition.The fact thar rie typology refersto net ratherthan grosseffectsis very importantbecauseotherwise it r-< difficult to think of variablesthat satisfythe MAR condition.For example,it is likely t,. missingness on the father'seducationi s co,elated withthe true valueof the father,seducationsimply becausethe father,sandthe respondent,s educationarecorrelated,andlaci of knowledgeof the father,seducationis greateramongthe poorly educated. Unforlunately,at leastin cross-sectional data,theie is no way to empirically deter_ mine whethermissingnessis independentof the true valueof the variable; this mustbe defendedon theoreticalgrounds.Although it is likely that missingnessis seldom com_ pletely independentof the true valueof the variable,there*. casesin which it L< plausibleto assumethat it is largelyindependent, -uriy net ofthe other variablesin the explan_ atory model.Thesearethe casesthat concernus here. The NI conditionis often discussedunderthe rubric of sampreserectionbias,lhe sir uatronwherethe sampleis selectedon the basisof variablescorrelatedwith the depen_ dent variable This topic is beyondwhat can be includedin this book (but seeChapt€, Sixteenfor a brief introduction).Accessiblediscussionsof the issuesinvolved in sample selectionbiasandpossiblecorectionscanbe foundin Berk andRay (19g2), Berk ( l9g3 rBreen(1996),andStoltzenbergandRelles(1997). Next we review a numberof proceduresfor dealingwith missingdata,culminating . in a discussionof Bayesianmultiple imputation,the cment gold standard, andpresenti tion of a workedexampleusingthis method.
h fril JI L ;iiiibffi ,iiii6lh
flrnr h ,flE Iu[
ffi
qru ,t[4d #F 6E mlr]lJ
friri tu f
@i @rrq & Fnd [M*,J pfr &mm flmd J[r Snuhc DH h.t
Er Md M[[r[
Multipletmputationof MissingData
183
C-asewise Deletion iffcult:c knot r,rt q esrion-r ti\ e da!-:.' missing licable -: eTe is Fi lut in rbe aI pro.-e-
l0ol r. .:. AR r.rbe le .t alu3:
riable i rdent o: ariable: onisno,tion ani $abilit_r e o flbi etrds oi that t.be iise ir i: rell rh;: i's edumd lack r detetmust tE m comich iI ;s erplanAre si:depen3hapte sample i 198-1. dnarin=: RSente-
fhe mostcommonlyusedmethodfor dealingwith missingdata(whichwe haveemployed so far in this book) is simply to drop all caseswith any missing data on the variables hcluded in the analysis.If datamainly aremissingcompletelyatiandom, dueto record_ ing. keying or codingerrors,or omissionby design(the questionis askedonly of a ran_ dom subsetof the sample),the main costis to reducethe samplesize.This is badenough becauseoften the reductionin sampresize is quite dramatic.For example.Clark and -{ltman(2003)reporteda studyof prognosisof ovariancancerin which rijssing dataon l0 covariatesreducedthe samplesizeby 56 percent,from 1,1g9to 5lg.
WHY PAIRWISE DELETION SHOULDBE AVOIDED
N
Sometimes, to avoidsubstantial reductions in theirsample size,analysts basetheiranalysis on "pairwise-present" correrations-that is,correrations computed fromaI dataavailabre for eachpairof variabres. Thisisa badideabecause it canproduce inconsistent, andottenuninterpretabre, resurts, especia|y whenhierarchical moders arecontrasted, of thekinddiscussed in thesection of Chapter Sixon ,,AStrategy for Comparisons Across Croups.
However,the problem usually is much worse becausedata are not missins completelyal random Rather,the presenceor absenceof dataon particularvariablestlnds to dependon the value of other variables. For example, as noLd previously, poorly edu_ ratedpeopleare lesslikely to know abouttheir family histories,and hencetheir oarents, characteristics, than are wen-educatedpeople; the refusal to answercenain kinds of questions,for example,thoseinvoking political attitudes,may vary with political party affiliation;self-employedbusinessmen may refuseto divulge theii incomefor fear that theinformationwill wind up in the handsof the tax authorities;andso on. In suchcases, coefficientsestimatedusingcasewisedeletionarc generallybiased.Thus,to simplv omit missingdatais to risk seriouslydistorlingour analysis. case deletion (also known as listwise deretion) is arsoappropriatewhen the model is perfectlyspecified,andthe valueof the dependentvariableis noiaffectedby the missing_ nessof data on any of the independentvariables(paul and others200g).But perfectiy specifiedmodelsare virtually unrnown in the sociarsciences.Trte meantmputationwilh dumntytariables methoddiscusseda bit later providesa test of the dependence of the dependentvariable on the missingnessof the independentvariabre or variables; but we still are left with the problem of imperfect model specification.One circumstancein which casedeletionir appropriateis whena questionis askedonry of a randomsubsetof a samplebecausethen the subsetis still a probabilitysampleof the population.But even here.thereusuallyis a heavycostro pay in lermsof reductionin sampiesize.
Weighted CasewiseDeletion A similarapproach,which is possiblewherethe populationdistributionofsome variables is known or can be accuratelyestimated(for example,from a censusor high-quality
184
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
survey),is to drop caseswith any missingdatabut then to weight (or reweight)the sarDpte so that it reflectsthe populationdistributionwith respectto known variables,sucher age,sex,ethnicity,education,andgeographicaldistribution.The U.S. CensusBureaua.od a number of sample survey housesdo this to correct their samplesurveysfor differentid If fu nonresponse, but the methodalso has beenusedto corect for item nonresponse. substantivemodel is perfectly specified, the method can result in unbiased estimatesalbeit with inflated standarderrors. In addition, weights ihat depaft substantially frn unity alsoinflate the staldarderrors.(Stata's-pweight- optionprovidescorrectstandard errors when this method is used,but the standardenors typically will be inflated r+ ative to standarderrors for unweighted data.) However,becauseour models arc virtuaB never perfectly specified, the validity of the procedure dependson how closely perfe.:r specificationis approximated,which requiresajudgmenton the pafi of the analyst.
Mean Substitution Various methods for imputing missing data (rather than dropping cases)have been proposed.(The mean substitutionmethodsprovide a way to generatecompletedata $ji respectto the explanatoryvariables.In thesemethods,the dependentvariableis nu imputed;doing so would amountto artificially inflatingthe strengthof the association b addingcaseson the regressionline.) Early studiesoften simply substitutedthe meanc modeof thenonmissingvalues,but this procedureis now regardedasentirelyinadequc becausedoing so without further correctionproducesbiasedcoefficientsin regressir modelsevenunder the MCAR condition (Little 1992) and also producesdownwar r biased standarddeviations of eachdistribution containing imputed data and hencedo'elwardly biasedstandarderrorsandconfidenceintervalsof calculatedstatistics. Anotherapproach,which hasbeenwidely usedin the socialsciences,is the miss{l indicqtor method: for eachindependentvariable with substantialmissing data, the mer (or someother constant)is substituted,and a dummy variable,scored1 if a valuebl beensubstitutedand scored0 otherwise,is addedto the regressionequation.An adrltage of the method is that it provides a test of the MCAR assumption:if any of fu dummy variableshas a (significantly) nonzerocoefficient, the data are not MCAR. Coba and Cohen (1975,274), early proponentsof this method,claim that it conectsfor d! noruandomness of missing data. Howevel Jones(1996) has shownthat it and relaial (for methods example,addinga categoryfor missingdatawhena categoricalvariablebr beenconvertedto a setof dummy variables)producebiasedestimates. A final meansubstitutionmethodis conditionalmeantmputation,in which missitg valuesarereplacedby predictedvaluesfrom the regressionof the variableto be imputl (for the subsetof caseswith observationson that variable)on othervariablesin the dn set; this is the method implementedby Stata 10.0 in its -impute- command.Thi method also results in (typically downwardly)biasedcoefficientsand underestimanl standarderrors. A11the mean imputation methods suffer from the problem of over-fitting. Becaus missingdataarereplacedby a predictedvalue,the completeddatasetdoesnot adequar4r representthe uncertainty in the processbeing studied-the error componentin the for eachindividual.This is manifestin standarderrorst}tat are too small, evenin
Multipletmputationof MissingData i5a-
ri
e:
f ,alrI Ffa*
trnE E-"!
1 g5
rtere the coefficients themselve
Ti;.ffi*fi:i nWY:;;$":t"#*l-tr*trt*tr':"::"T';ffi .ii"vLffi
ffi1"?ixff,,fi::#,:;;h:: r:;,,|i;T:: "y,:;w **.t
^an*u",
tmedbyRubinandschenter [i986;i"ti" rdsii^";i.""i,j#
fii-rm
,Mtleck Imputation
f,ftF
Ihis is the methodusedby the U.S. CensusBureauto constructcomplete datapublic use q[es. The sampleis dividedinto.strata(similar to the s,ru,uur-J'in ,h" *"ighted r*€ deletionandconditional case_ mean
l.:rd1 rf,*
tr:rS
DJI t l-'l II @
r
& rI'
4 I ll l-
b I G
d f
imp*il.r."rf,"o*l.trr"rluli _i.rtng uAu"*ithin e *atum is replacedwith a valuerandomly O.awnffi ."nr"""_"rO O"rn theobserved cls€swithin thestratumAs a result,within-each iqrted casesis (within thelimits or ru-pting stratumthi Jir,.iuotronor uuruesfor the .,.orfia"i'ir""ut io ii" oirt iootionot uuts for rheobserved cases. when.rhei-putiion _ooJ;;;i, specifled(thatis, rten all variables correlared with tt _i.rtng*r, ,rnable areused b rmputethe missingvalues)'this"tr",rrra'p.rJr*roi;iJ;;;;" i"oi#J illtn"".ts but biased rr also.tendsto performp"".,;;;;;; ;.;;rl ]fT1"-.: i.l"t,oo or l[ feastonemissingvalue(Royston2 004, huu" 22$ . "u.",
h l BayesianMuttipte tmputation Ihis method,introduced bv Rubi
..'.,, j]".""G,ili::Ti:_li'J;"T,:ffilT:Jl'jiH"li#li:ii iiiios :T;li# ryutatronin pracri""y. r_tttle md in" Jfori,ion or
,r,emethod, hn Schafer(1997, 1999)providesl1b-il139021, 1, introductions, "r1.Jij .as moreaccessible Ooesattson (2001).
see rreiman' Bi"[;;;d"'c asl8;fras"r'"nk"" r."i,"un, * #":Hi11i;:'ins' The essenceof multiple
imor eacn variaure with ffi 'fi#L:ff;:j*."ri:::X1*T;
I
I
jiffj":J*:rr.."S"j
,'ffi ;ffi::fiL#,llrrfi.,$?:"T;J:"$:ff1 ;T":'""# -i ^r"g,pi"i"Jffi dara
pletedataset,with the missinsiata i_p"rJi"*ra';;:".0""
tr#:h,,ffiHfi;;,| I
:j:
a randomfrom the predicld distributlon ur" .uuJtutJ ,", ii" *..rr, Because rariableswith missingdatamav be amongthe predictorsfor anoitrer "nues. variaute witt niss_ ing data,the processis repeatei s
sersare nve'1ut there-is some may be "uio",'".,r'uiio," herptur
Eachof thesedatasetsis thenanalyzed inlhelsual way,andtheresultingcoefficients ae averagedor otherwisecombined, using whathaveo""o_"tno.'n as ,.Rubin,s rures,, lRubin1987,76).Thismethod
ororr,ces ,ru"."J"""ii"i""ir' uytutiogu""ount of theadditional uncerraintv created by *," i,npuiuti*'ffili "io,"ui*, ffir"o ,tund-o .,'orr. Specifically, thestandard enorof a coirn"i"ntiu."o oni i.i",r:i1". * gr"", ny
-, +)e]F; l"f,'i+[ ": t1 ,1----7-
,
-
(8.1)
186
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
That is, the standarderror is estimated as the averageof the standarderrors based on eachimputation (the leftmostterm), which capturesthe uncertaintyin the estimatewithin each imputation, plus a componentfor the variation in the estimatedcoefficients across imputations,which capturesthe uncertaintyinhoducedby the imputation procedure. For this procedure to produce proper imputations, two conditions must be satisfied: and(2) that if, in the substan(1) thatthe analystdo a goodjob ofpredicting missingness, the outcomevariable be variable, with the outcome missingness is correlated tive model, included in the imputation model. Softwarcthat implementsmultiple-imputationprocedureshasbeenwritten for Statab-r'. Royston(2004,2005a,2005b,2007,building on earlierwork by VanBuuren,Boshuizen,and Knook [1999]); to downloadthe necessary-ado- files ftom within Stata(connect€dto the network), b,pe -lookup ice- and click the fourth entry, "sj-7-4." (Seealso usefrrl guides to using Royston's - ado - flles written by AcademicTechnologyServicesof the Universiq of California at Los Angeles[UCLA]; they will appearin responseto the same- lookupcommand.)Royston'ssoftwaremakesthe processlesstediousthanit usedto be.Nonetheless implementingmultiple imputation can add considerablecomplexity to your analysis.The difficult and time-consumingpart of the work is to choosethe predictor variablesusedto produceestimatesof the missingvaluesof eachvariablethat hasany missingvalues. The essenceof the procedureis to specify which variablesto include, to createappropriate transformations (dummy variables and interactions), and to specify the relationshipsamongthe variables.Thesedetailsare part of Royston'scommand- ice -, which shouldbe built usingthe -dryrun- option so that the logic canbe testedbeforebeginning what can be a lengthy computation.The imputation is then carried out, and a data set is saved,consisting of multiple copies of the original data, each of which has complete data becausethe missing values are imputed. However, in each of the completed daia sets, the imputed values generally will differ. This multiple-copy, or multiply imputeddatasetcan then be usedto carry out any analysis,usingthe command-micombine-' This command carries out the specifled estimation procedure, for example, multiple regression,using each of the imputed data sets and then combines the resulting coefficients to produce a single coefficient-usually the averageof the coefflcients estimated from eachof the completeddata sets-and a standarderror that takesaccountof the additional uncertaintyintroducedby the imputationprocedure(seeEquation8.1). Typically, construction of the imputed data set is computationally intensive-in the worked example discussednext, it took about 3.5 minutes on my home computer (which has a 792 GHz processor)-but analysis using the imputed data set is nearly as fast as As you increasethe number analysisusinga singledataset,typically a matterof seconds. of imputations, the time required to createthe imputed data setsincreasesarithmetically. As you add variablesto be imputed, the time required increasesat a fasterrate. For example, approximately doubling the number of variables to be imputed increasedthe time b) a factor of four. Perhapsthe bestway to conveywhat is entailedin usingmultiple imputationto createand analyzecompletedata setsis to carry out an example.This is what follows. The -do- and - log- f,les thatproducedtie exampleareavailablefor do*nloading.Theseflles contain,jus beforethe imputationstep,a discussionof how to specilJthe - ice - command.
t5l
Itr '!M * qag,'t &D !&
Et
trrr t-r'i rlus r@["f,l D'dl nl
fu!,rc rfrfl bs trGi crat
r d
{lf,.[
Ml 5-i*,r UFj|. - l rfgrc rrc rE
m'lbo 4 rrurli;iil G]IUls
a
cd [isr 1
d :, #t l @
tI @ ffil{
dmr 8:a fttrEd
Multipletmputationof MissingData trEs based rrne \rirhin af,s ilcraEs !E-
t satisfied E subsknaia-ble be r shra b-f EizeD-ald *d ro de fol S-do !-ni\ersq .oc;c.:!t Defrelessysrs. Ibe s usedto b apprc'rclarion-. qhicir E b€Bjtrdaraser omplete dl .1^r' mpute& cir=-, nuldpb coeflitirn2re\i E addi-
-in tbe [*'hi;t fasr a. mber dcallf CIAIIF
me b) E aBi '- ani D-ju{
197
A WORKED EXAMPLE: THEEFFECT oF CULTURAL CAPITAL oN OUCATIONAL ATTAINMENT IN RUSSIA Tbere is increasing evidence from many nations that the extent to which parents are cngagedwith the written word_measured by the number of books in the household uten the respondentwas growing up_is at least as important (and perhapsmore imporrr ) a determinantof educationarattainmentasis the amount of formal schooling attained ti parents(Evansand others2005). The reasoningis straightf-.,';1, what mattersabout lnental educationis not the credentialsit. bringi but thJway it affectsfamily life and &ild rearing. In householdswhere_readingis an important u"tii,if, often learn to r€ad at home, enjoy reading, and becomegood at it, all of which "'to,"n improves their ability n meetthe demandsof formal schooling. Thus, they tend to do well in school and in conr€quenceto continuetheir educationto advancedlevels. in this exampleI investigatewhetherbooks in the childhood home were rmportant - educationalattainment itr in Russia.I choseRussiafor th" ,"o.t"a uoth because 6e number of books is probably a good indicator of family "tr-pr" reaatng haUits in Russia trcause the cost of books was very low during rh" Soui"t;"J;J(my data pertain to rftrlts surveyedin 1993,just after the collapse-of the Soviet Union) and because_asa result of massivecasualtiesduring the SecondWorld War_there is an unusualy large missing data on parentalcharacteristicsin the Russian data set, a national -:u:l-gf gobability sampleof 5,002 Russian_adults age twenty unA ou", 1."" Appendix A for deails on the data and how to obtain them; ."" ulso T."irn_ unAV"ieny 1993,Treiman 1994) The sampleis restricted to those agetwenty to sixty-nine to avoio understatement lii€ducational attainmentby thosestilt in-school lfewer inun i p".""r,, of twenty-year_ olds were still in school) and differential monality and .o.UiOiiy (rcndering people rmavailableto be interviewed) arnongthose age seventyand older. This reducesthe sam_ ple to 4,685.In the presentanalysis,th" dutu." not,"igtrtea, atitrougtrweightsareper_ missiblein- i ce,-, because weightingintroducesanaddiional complicationin comparing Iesulrsobtainedfrom a casewise_deleted datasetanda muttipty imiuteO Oataset.
Creatingthe SubstantiveModel I first specify a conventionaleducationalattainmentmodel: -6
E: a+ b(Ep)+c(E)+Da,@,)+ e(c)+ f(s)+ g(M)+h(B)+i(EpB) (s.2) i:2
shere E is the number of years of schoolingof the respondent,E is the sum of the years of schooling of the father and ttre mottrea A^ i, the difference in -number.of rhe numberofyears of schoolingof thefatherandthe mott..; "rJ -ol."te.s tt to the father,s occupational categorywhentherespondentwasagefouneen;C is" year of birth (.,cohort,,), shich capturesany effectsof the secularincreasJin rn Ru.'a over the course of the twenliethcentury;S is the numberof siblings,"au"ution whictr is tnown to negativelyaffect educationalattainment (Maralani Z0O4 Ll ZOO5;l-l anA freiman 2O0g;;M is scored 1 for malesand 0 for femalesto test the possibility that the .'o ,"*", Oifi.", ln ,t eir averase
188
DataAnalysis: DoingSocialResearch Quantitative to Testtdeas
education,somethingthat is true in someplacesbut not others; and, finally, B is an ordinal scale measuring the number of books in the household when the respondent was age fourteen (the categoriesare none, I or 2, around 10, around 20, around 50, around 100, around 200, around 500, and 1,000 or more), and EoB is a product term that capturesa possible interaction between parental education and the number of books-I would expect the size of the home library to be more important when parentsare less educate4 on the ground that well-educated parents are likely to provide school-relevant skills whether or not there is a family culture of reading but that this is less likely when the parents are poorly educated.That is, I would expect that parental education and parental involvement in reading are to some extent substitutes.
T:CH|.IICAL DET,q{{-SON THE VARIABLES r
Parentaleducation. I specifythe sum and differenceof the yearsof schoolingof each parentratherthan simplyincludingthe yearsof schoolingof eachparentasa separate variable.lt can be shownthat the two specifications are mathematically equjvalentand that eithercan be derivedfrom the other. But the specificationI used is more readily interpretablebecauseit givesthe overalleffectof parentaleducationplusanyadditional effect resultingfrom a differencein the educationallevelof the parents.
r
Father's occupation. The occupationalcategoriesare from the six-categoryversion of the Erikson-Goldthorpe-Portocararo (ECP)occupationalclassscheme.modifiedby (1996). Ganzeboom and Treiman
f
Number of books, I exploredthreespecifications of thisvariable:the ordinalscale,midpointsof the numberof booksindicatedby eachcategoryand the naturallo9 of the midpointscale.Interestingly, the ordinalscaleproducedthe bestfit, probablybecause the lo9 scaleexcessively diminishedthe effectof increases in largehomelibraries.
The problem for this analysisis that many of the variablesin the model havesubstantial fractions of missing data. Table 8.1 shows the percentageof casesmissing for eacl variable. If I were to simply drop all caseswith missing data, I would be left with only il percent of the sample(2,661 cases).Moreover, becauseit is probable that missingnessL correlatedwith other variablesin the model, I would be analyzing a nonrandomsubsetd the original sample,thus completely undercutting the validity ofany claim that the analf sis characterizesthe educationalattainmentprocessin late twentieth-cenfuryRussia.Evidence that missingnessis not random is to be found in a comparison of the meansad standarddeviations basedon the complete-data(casewisedeletion) sample (N = 2,661) and the corresponding statistics computed over all observations available for eacl variable: in the complete-datasubsample,the means of the socioeconomic status var; ables are generally higher, and the standarddeviations are generally smaller than whet computations are based on all observationsavailable for each variable. Thus, I turn b multiple imputation ofthe missing data to createa valid complete data set.
{nls
Descriptive,r:r,:!-.: for the Variables --'-- Used Age Twentyto Sixty-Ninein -'o in the Anatysis,Russian 1993(N=4.685). Mean Atl Observauons 1 2 .5
SD
Casewise Deletion (N = 2661)
Atl ObservationS
12.9
3.7
3.5
4,633
1.1
4.0
3.9
3,880
17.2
4,469
4.6
7.5
3,8A7
18.7
3.2
3,807
18.7
4,685
0.0
8 .5
7 .4
7 .9
1 6 .0
16.4
0 .8
4.4
1.6
4 .7
.4 1
Casewise Numberof Deletion Nonmissing (N = 2661) Observations % Missing
.4 1
2 .2
21
1951
I9 5 3
'r50
172
.49 2.0
1.9
4,219
10.0
13.1
13.1
4,685
0.0
4,305
8.1
245
259
%.
2 2 .7
23.6
2 .6
2 .7
1 .7
1 .6
3 1 .8
32.5
2 A .2
19.6
2 1 .A
2 0 .0
100.0
100.0
:- cf casesfor whichresponses aaefLrtmrsstng
3,265
30.3
1 90
DataAnalysis: DoingsocialResearch to Testldeas Quantitative tt dl
Creating the Imputation Model For each of the variables with any missing data, it is necessaryto specify an model-that is, a model predicting valueson the variable from the casesfor which vations are available.Van Buuren, Boshuizen,and Knook (1999,687) suggest although in principle the larger the numberof variablesin the imputation model the in practice(to avoidmulticollinearityand computationalproblems),it is bestto limit predictor set to fifteen to twenty-five variables.They proposeas criteria for inclusion: 1. Include (as predictors for each variable with missing data) all variables that model. be includedin the substantive(complete-data) 2. In addition,include (aspredictorsfor a givenvariablewith missingdata)aI thought to affect the missingnessof that variable. Suchvariablescan be by examiningthe associationbetweenmissingnessand candidatevariables.If associationis not zero or is closeto zero-includethe candidatevariable. 3. In addition,include(aspredictorsfor a givenvariablewith missingdata)all ables that are strong predictors of that variable. Such variables can be by examining the associationbetweenthe given variable and candidatevari for casesin which the siven variableis observed. 4. Removefrom sets(2) and (3) thosevariablesthat themselveshave amountsof rnissinsdata. An intermediate step that I skip in the presentexposition is to confirm that the are not MCAR by predicting missingnessfrom other variables in the data set; if some the coefficients are nonzero, we have evidence that the data are not MCAR. there is no way of deciding empirically whether they are MAR or NI. For each missingnessis dichotomous,so the appropriateestimationtool is binary logistic sion.However,becausewe will not coverthis techniqueuntil ChapterThirteen,this of the worked example is omitted. In the Dresentcase. we need to imDute missins data for all variables in Table except gender and year of birth (which have no missing data). Following the criteria Van Buuren and his colleagues,my imputation model for the variables included il substantivemodelis
E : flEo,E*DOt, C,S,M, B, EpB) Eo : f lE , E * E O t C, S , M, B ) Er : flE, E,, DO, C, S,M, B) O : flE, Eo,E* C,S,M, B) S : f(E, E",E,,DOi,C,M, B) B - flE, E,, E,,tOt, C,S,M)
I
-
sq -t d
-l
Multipletmputationof MissingData l rmputaioq ftich obs€rus,sesr tb.r I the benez to limir 6e Jusion: 6 rhar qiII {l rariabb E identiiiol bles. Il6E aI all |adidentifial : \-ariable! ubstandal
r the d.ale f some ot lloq erer r-ariabtrer regres,this pan l'able5.. dteria ol d in dre
rfrae rhe variablesare thoseincluded in the substantivemodel definedin Equationg.2. 16l Dothaveto resfict myself to variables includedir, tt" *Urtuntiu" model.Rather, filning VanBuuren,Boshuizen,andKnook (1999),I might well havechosenadditional which predictthe independentvar.iaLf".i" ,fr"-i,"a"i". iredict their missing_ Pn:, Bs or do both; generaly,rhis would be advisable. H"*;";;, ;;;; rnterestof keeping examplefrom becomingtoo complex,I sertledfor the predictionequations *:"* -ice- commandpermitsthespecification of severaldifferentestrmationmodels _(I5 ]e for continuousvariables,and also binary, ."for"JJ, _0"_oinal logistic regres_ m tbr categoricalvariables).Becausewe have no, y", regression,I ask h.:" ,". tf" on faith that they ar9,.t{reapproqrial "o-*."Offidc tectniques For oealing with these T ra of variables;thesetechnioueswill,be exposited rrt cn"pi."lirrr""en and Fourteen. -{* n happens,all variablesto be imputedare continuous ,fr" ,nA"r,s occupation ,*gories, which areimputedusinr "*.iif-
u ;_i u""t, i"1i" i"",5,"i;,;ilil,Tl,lli1Hi,.';?x;::f"il1fi:1"1,;:1 -{ veryusefutfeatureof _i ce_ is itsab ty t. rrl"oi" ;pariiuJiy
,_pu,"d,,u*iuut"., ftrt is' variables such as interaction terms and sets of dummy variibtes that are matherncal transformationsof other variables. .o-"-of ,t iJ'ilu1ii""f"a" missing data. {c Royston(2005a,191-195)for a descripdonof thepro""ooro unott oownloadable files for the chapter for a"tatt"O &s.usrion"o]'^ho* " ,o specify the _i:. _":j,o-lu;. Cofltparing CasewiseDeletion and Multiple tmputation Results
Table8.2 showsregressioncoefflcients,standard errors,andr_andp-valuesfor two mod_ cls- one estimated using casewise deletion and tfr" from a multiply "tfo._. ".tiait"d ryuted.data setusingRoyston'sco'mand -micombine elthougrrthe .esultsarenot _rreadydifferent, they do lead to different substartiv. *g*Oing thee of the rr elve variablesin the model:if we accept ""r"f".i"". the conventional .0! level of significance,we rould concludethat all variabtesin th" CjI-S.oO"f *" ,igrm*ri, *nf, ,fr" exception of 6e father'semploymentin a routine nonmanuarjob, which resultsin equaleducational ;tances for offspring as does the farher,s ;;;ug".rur o, professional ln particular, "_pt";_;rt;;
Fb
I E._:
191
wewouldconclude tharit ir uir"n"Li-*r,iri'di
,notr,".is well edu_ raredrelativeto the father(because, net of the averagelevel p**r, education,the morethefatler's educationexceedsthemother's,the ilw". trt""f utrainmentof offspring). Wealsowouldconclude that,u, ,norJ;; "i"oluonur families getless il. education'andalsothatmaleseetlessschoohn! "*p""r"a, thanoo r"mut.*-u-J.'ever,noneof these &reecoefficients exceeds the.-05levelof signiicun""in ,fr"i_ou*i Ouru. Interestingly,thesizeof thehomerib.ur!t ,r," i.poilii ,*uur" in a" *oa"r, asindicatedby the standardized coefficienis, -ort ,h";i;;;;;;;ost column.But, as predicted, its importance diminishes asparentaleducation inc.ieases, as rs evidentfrom de negativecoefficientfor theinteracrlonrern.
fA * t, il & " 2, Compa.ison of Coefficients for a Model of Educational Attainment Estimated from a CasewiseDeleted Data Set ICI (nr= 2,661) and from a Multiply lmputed Data Set [M] (N = 4685), RussianAdults Age Twenty-Twoto Sixty-Nine in 1993. Std.Error
of school Parents'years (difference)
-.O41
M
c
M
- 026
020
.016
-2.O4
.51
-3.40
c
M
-1.61
.o42
. 108
-3.11
.001
.000
M
Father'soccupation (professionals and managers omitted)
Self-employed
-1 .77
-1 .90
.52
_A
_22
.000
O1
manual Agricultural
-1.41
-1.42
23
25
-603
-572
000
000
-.o4
Y oarof bhtl r
.0 i ,
siblinqs f,llger s'f
-
Male
-,249
00b
547
I00 -1.86
:t, .125
- 2.00
.00t
.l)
194
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
WHATTHISCHAPTER HASSHOWN In this chapterwe haveconsidereddifferentkinds of missingdata,drawingupon a di_ tinction betweendata that are .,missingcompletelyat randolm,,OaCen), data that a: "missingat random" (MAR), anddataihat are .,missing not at ranJom,,(MNAR or \T. We exploredthe propertiesof eachof thesemissing jata ,yp", then considereii numberof proceduresfor handlingmissingdata,in.taing tirt*ir."raOeletionand vario* imputarion of_missing values.We determinedthat of ,h" proceduri! l:Tt "l produce reviewed biasedcoefficientsin predictionmodels.This-or, circumsrance mouvarer consideration of multiple imputationprocedures,in which missingjata are tmputeds _ eral times and the resultsof eachimputationare combined. uutiipt" i*purution ott.', the bestchanceof producingresultsthat are free from bias.we thei considered,throuq a workedexample(the role of culturalcapitalin educational n"rri"i frJ toimplementmuitipleimputationprocedures,using "u"i"_"rii, softwarewritten by the British medcal statisticianRovston. Thus far we have carried out statisticalinferenceon the assumptlonthat our dr:r weredrawnfrom a simplerandomsample.However, most surveys,suchasthe GSS.:_ nol basedon simplerandomsamplesbut ratheron complex multistageproUabilitysan_ ples.-Inthenext chapterwe considervarioussample designsunJr"" norv to get correJ standarderrorswhenwe are usingdatabasedon ,t utm"O"o, multistageprobe bility sanplesor both. "iurt"..d
la r I -r:
ftJ.AD-fr-
\-l l f1 I
I,!-.r
r \-I t*!
rilrs ft-} ir:e6e: I:e:off=: I(ES
. b:'e DEa' .{ r e
l- :r: sa-
SAMPLEDESGN AND SURVEY ESTMATION
II?J
mhr-
WHATTHISCHAPTER IS ABOUT Thusfar we havetreatedthe issueof statisticarinferenceasif we wereanalyzingsimple randomsamplesand our data conformed to the distributional propertiesassumedby ordinaryleastsquares(OLS) regression.Neitherconditionis likely to hold in practice. Thus. Dowthat we arecomfortablewith the manipulationandinterpretationofregression mod_ els.it is time to expandour analytictoolkit to makecorrectinferencesaboutdata based on the kinds of complexsamplestypically usedin nationalsurveys.We alsowill want to !-onsiderhow to identify and, if possibre,correctfor anomalousfeaturesof our data. As s'e will see,thesetwo topicsarefairly closelyrelated. I beginwith a descriptionof typesof samplesusedin surveyresearch.I then discuss theproblemof statisticalinferencecreatedby complexsampledesigns.ChapterTen then considersvariousdiagnosticproceduresfor OLS regressionmodelsand some ways of correcting problems revealedby the diagnostic procedures.
to Testldeas QuantitativeData Analysis:Doing So€ialResearch
', ,,
SAMPLES SURVEY As wc know tiom elementarystatistics,to generalizetiom a sampleto a population\\: needsome sortol probability sanple. For our puryoses'there are threebasic kinds r: probabilitysamples: Simple random sanrples,in which every individual in the populationhas ' equalchanceof beingincludedin the sample.(The equal-probability-of-selectic u.rmpleasrrndom) Jehnes cL-'ndition
'
.: Multistage probability samples.Theseare nothingmorethancomplexrando:. and then subunitsof the sampli: samplesin which units are randomlysan.rpled units are randomlysampled,and so on. Examplesincludeareaplobability sar-' ples in which, say, cities and countiesate randomlysampled,then bloci'' within areas.then householdswithin blocks' then personswithin householc' and schoolsamplesin which, say,schooldistrictsare sampled'then schoc' within districts.thenclassroomswithin schools'then pupils within classroom' .
Stratified probability samples.which arealsocomplexrandomsamplesln str;:ilied samplesthe populationis dividedinto strataon the basisof cefiaincharacti istics (race.sex,place of rcsidence,and so on). A probabilitysampleis dtar'' within eachstratum,with the strataoftensampledat dift'erentrates-fbr examP-: with Blacks sampleclat a higher rate than non-Blacksto ensurethat there ".enoughBlacks1branalYsts.
Simple Random Samples requiresa list ofer e: To drawa randomsan.tple simp)erandomsamples. Let usfitstconsicler to selecta fiactionof theindivic-individualin thepopulationancla randomizingprocedure A typicalway of drawinga randomsamplebeforetheageof comput':' alsin thepopulation. wasto colsult a list of randomnumbem.Table9 l showsa smallportionof sucha list Supposewe wantedto draw a rantlomsampleof 10peopleout of a classof 40, usi: ' a table of random numberssuch as Table 9 1 We would list the 40 peoplein the c1;'Portion of a Table of Random Numbers. 10480
1 5 0 11
01536
A ZO11
81647
9164.-
22368
46513
25595
8s393
30995
8919:
24134
48360
22521
91265
76393
6480!
42161
93093
46243
61680
07856
163i 4
SampleDesignand survey Estimation
.trHni*'i"{,*,3i;1;5: #1fr
bti 0: --r t:tnir j rh;:5 €l3-'::u
fl+1iffiTril.f;fl\il,31!6T;,j:. :,,,i,:#rx,;: H;il ;#:#:ru::x;m'#li**hr*mn
|:tT$ry
";X T'"J"i""il',':iT# qtq$6*iJ$..,:":;fl :L:'h# H:**
am3F;
H::;f:.'fr
w !i::_ b. itr.r eh-..=, II1
197
xquentially.Iiom I to 40. Then
work ;;;;;,;'"i."r, sampte witho ut '1il:.'J":fl#i:ll,o1"cticar
T
"xj*iii,""ry#.:i'i#:rlffi.TiHi*ifr ff#,r"{,##: H;;:$-l;i#i*;;*.*,t'il,*Tt#rFf qn*.T;T,#;i{trdj$ff :$,* ll;?: :ft'nffi
0 =:f,-
rs randombecauseby virrue of the
chance orbeing ,,n,#11 -;;;;;' #;;;il:X;T"i'fffiffi:iltf '.rranequar
IL-='
4:--a q,E lE :f:
fi,inil :#,:'
cases,")"r"mnr,a sam;i;;;,;#1,il"1 lll.;ill.:1jl ffi:ffiffi:f i:T:I#: anaqren eve+-;;;;*
*--tJl:1""-':lfflilf;td""Tj::,:::sen
^
chosen rromthe
ilHf,;L*ili ;:T,:n** ili* *1,l#*ir[*HT#j[Hr,ffi every roun h s tu de ntin 1,1#:ff
H,lT lffU: JilTake n:U:m #; " "'"."tt*' "t,The propertiesof sysrematic .
E f,
t 5 ? ri :J
*.r""
,: :-":,-"
to
a sample
'et
*:_:".i,,.:,r,.,n,n",'n'raiuirlffi ilffi,[.".':Tllililf,:.#ffi :irrflf, u..o"n.,ir,unl-oo,-,iiu,,o,., #.i,'n,;"1ff1,;ff:nT,'n:i;i:r",uv because,hey f .::o::,g"r":",",.,'il"i"f::|Ti.li{::f"::,",}ffi ::T'J'offff""1Ti:,:T -.
r"i'".J##:.rou,or*g #T$::T#ti",",J*ji:ifrH tooffset i"j::^11'ru'-"r'"p,"., sampring, roodthat i, what i say;l:,';;;il:;T:ffilfiffff::: um".'r'oura"u"
Hu lti stageprobabi Iity Sa ples m
5rmpterandomsamplingis pracdcr -nmpterelisr of thepopularioni, l,:*I,':t
circumslancesin panicular.
whena "Tt* ecentrar rocation. b!,-,'";#;:;'ii:,jj"#1.'i$:,:",ir"ilf":ffi n i"r.r.,,fiiff#
Gi..ii.'
'urelqordrofeurp aq plnoMlr 'asnol qlnur e sr 'Sar4Uno: ,{ueu.r ul ualqord :aq1o 1ou serqaldu.les'sauoqdalata^Pqsalels }o 1O patPaJla eq sat^ap 6utuaalts pa run aq] ursploqasnoq lle]sourlpalurs'llrls serllnlr#rp ^^au auoqdalal-aldrllnul Cl-ra e) pue 'saurqrpu]xe] 'souotldlla) ]o uorleta]tlotdaql sploLlasnoq ro, ]snlpe o] pue sauoqdelalssaulsnq]no uaalls o] pssl^apeq ]snul sajnpa)old laAaMoH '6urlplpll6lp tuopueJ,Lo sueeuAq'pasnaq uP) 6urlduesuropuPr'aldr)u d ul 'asne)aqsMal^ qlrmueql smarnralut auoqdqI^ Jarsee stIllptauab6utldules -ralurploqasnoquosrad-ur 'uorlesranuot auoqde e 6uropaq ol ul|el) saDuabeoutla)reulalaleulosaluts olur aldoadMprp o1alr^epp se ^a^dns ueql pleMol Jo] plau aql pareanbseq 6uqa>peuta1a1 Aller)adsa'qlrpasarIaruns a1euir1r6el are stuapuodser a|1soq,(l6ursear:ut leqt sl s^arunsauoqdalalqll/!\ Illn)$+lp lPul+V 'sraMawalurauolldola]lualadLuo)put] otrlln)tj+tpst ]t 'pnole
l_C
ftrr!s [! E-arg t
J)y.
l+ -a-'!fr c.Er:5; t 6.1" ;r cL9J}-a"E P-*i
! s r-r--: ?1F-:t\
pS,u |1: 7:[a
ur era uP ur la^oalon an6tle] luapuodsataztutl 6urpearte pallr)s arp eldoad Ma] Ltrf,rq^^ pue 'auoqdalalaql ta,lo suorl aq lsnuJs^^ar^ralur ueql ra]loqs a)p]-ol-alel -uru]ol sMarrualur -sanbxalduJollse o] pue lodder qsrlqelsaol llnlr+,rparou.rsr1rpupqraqlo aLlluO 1)loM ol lue]lnlal aJesla^ at^lalurpue)Joopaqt raMsuPol luelfnlat ate aldoadataqmspooqloqq6tau palpo ut ssor.llsP Ll)ns'uos ro spooqloqqbtau auru)-q6rqur osle pup 's6urplnq ^]rnlas ssa)fp o1alqrssod srlr leql st slallns auoqd ere sploqasnoq ur ssa)re ol fua^ -rad leql llnlt11rp ql6ual auoqd e lo] 0E|,g auresaql +o,!\at^Jalut -a1a1 1oa6eluenperoleuraql '}so) L!orj Uedv ltuJapete outpPal ]noqe qlr^^ parpduJol']uapuodsarrad OO€$]noqe slsol sraluar ^aruns lnoq-auo a)eJ-o}-a)P+ aq] jo auo ,{q papnpuor aldures&tltqeqoldleuol}eue ul Mal^lalul p'salels palrunaql ul Illuornl s^ahraluta)e]-ol-ale+ueLll ss"1 ".,",{aql"rnrr"O Sl ^llso) euoqdatat pasn ares^a^rns
Bcr-.r:r:=! -€.:::5]; 3-.E
LPti ts
l:
:r-:
,,ra eI =rtr':-E
'ilv5t
D6 l
^t6urspal]ur ^tapri\^
n
Ft: [\F I,T.f fm;{,Li'uurxr-:qr F U:S ail o-1€\:'id F F:4ntq -f @oJ:qq | -umIPJrSpf|l 4 -':trL'L=q F Urqllld
;flEl
FlsIII ]s gr1:orSq:1P 't:rq+ls -iue op,6PJ U?q F [uoI\ sluE_s .D
SAlnUnS
lNOHdlllI
IC
s3IeBuVso'I asoqr ,{luopuer a,t;r 'aldruexerog 'sarlrca8rel ur eldoedueqt eldurusaql ur popnycurSuraq3o eouzqcraq81qqcnur ? sulhol IIsIrrs ur eldoed elr8 ppo,{\ sII :uerp Jo altrosasooqc{Iflopuer pue ,tnunoc eq1ur seprceq11e 1sq,{1duls1ouplnoc ar'r ,{1snor,rqo 'aceydqceo ur sr\er^ialr ,quaml Srmcnpuocpue (sOSd) $Tm Sqldul€s freurud perpunq ol :uo Sursooqcfq srql op ol aplcep pue eldoad puesnoqto,,ir1go eldurespuollsu 3 ^\"Jp papales qcee;o Sureq f,1rc ue,r a,n asoddn5 uorlepdod s1r;o azls aql ol Fuoruodord s1 ou?qJ eql rql .{" \ 3 qrns uI 'sr l?rll-(sdd) azts o1luuog:odord f,tlllqeqord qlrm 'urop -rr?l lB u,r€Jp em (uo os pus 'seDunoc'seIlIJ) ,sr?urgu'1\dutos{totuttd'eaels lsJg er0 uJ '(saldups tqtqeqotd a3o1sy1,'rxu eureuaql acuaq)se8z1sul peleen sm saldruusqcn5 'sfe.,'-msploqosnoqpuopeu ro; seyduesflryqeqord e8e1sqlnulasrlap ol sn peal suolteJeplsuocryo.npleg pue Su1dues qloq snqJ 'uou?u eql moq8norql parellucs aq f,1uleuet tsoulp ppo.tr oqrrr 'sluepuodserpalsales eql Jo seuoq ol elqlssod aJe,& eql ol lal?.r1ol a^rsuedxa,{le,rqrqrqordeq ppo,tr y'eldures s q3ns ^reJp 'uollelndod uropuer aldrurs E ,lreJp ol elqlssodurl aql Jo aldures q I uele 'puoJes tt saleur 'luapuodser 'g qJTq,r'uoqelndod n aqt 3o ralsr3er puollsu ou sI eJerll'petou lsnf su 'lsrg Jrn Jo arxoq oql ol sao8 Je,,r{er^le1ulaql qcq^\ ut s,{\er^Jelul'sl tst0-pepnpuot eJ€ qJlq,u.ur uorlelndod '5 n aqt;o selduiespuoueuJo asec s/r\ar^Jetur IBqueprsale3?J-ol-of,sJ seapllsal ol q)rpasauler)os6uloo :sls^leuvelec o^llelllueno
86 t
SampleDesignandSurveyEstimation 'r99 rt€ta.i
f be rhi.f r iir o6e k t2gB b. trl}l:
r':T (I @
rH de$
{T SantaMonica or Beverly Hills (the latter two are small cities in Los Angeles County) md then randomly selectedtwenty peoprein the chosencity (assumingwe had a list of all raidents),anypersonin SantaMonica or BeverlyHills wouldhavea muchhigherchance crbeing includedin the samplethanwould any personin Los Angeles. so insteadwe group cities tnto strata on the basis of their size and randomlv samnle ides within strata,at a raceproportional to their size. For example,we might ;oup ihe Iffsest citiesinto a stratum,largecitiesinto a secondstratum,medium-sizedcitiesinto a &fud stratum, and so on. Supposethe population of the largest group averagedtwo mil_ [on..the popllation of the secondgroup one million, and the population of th" thi.d group rire hundred thousand,and so on. we might then randomly chooseevery cify in the first 3roup,every other city in the secondgroup, every fourth city in the third group, and so on. ff E-ethen interviewed the samenumber of people from each selectedcity, p".son "u"ry ir the country would have an (approximately) equal chanceof being included: 20/2 million = 1/100,000 20l(1 million/0.5) : 1/100,000 zjt (s00,0o0t0.25) : I /100.000 and so on
MAIL
SURVEYS
vuit surveys aregenera'y undesirabte because rhevtenoto ?z
ji[:l"*fi Nl il:ffi51.'"iT: iff:,:lH":fi :"j,i."Jffi Jffi:'"::: i#:H:'^'"T:
of the survey(through registeredletters,telegrams,phone calls,and so on). JonathanKelley and Mariah Evanshaveachievedamazingryhigh responseratesin mairsurveyscarriedout in Australia-on the orderof 65 percent-by doing extensive follow up. Theyalsoshowthat nonrespondentsto their surveysare essentiallyno diflerent from respondents(Evansand Kelley 2004, Chapter 20). Suchsurveysrequirea samplingframe that includesaddresses. This is impossiblein the united Statesbut possiblein countriesthat haveregistrationsystems,suchas Australia,where voting registrationis required.Noncitizensare excluded,but the samplinq frame is good orheruvise. Another disadvantageof mail surveysis thal one cannot ask complex questionsor questionsthat are contingenton responses to previousquestions;respondentshavedifficulty {ollowingthe logicof complexcontingencies, known asfilters.On the other hand,one canask questionswith relativelylong listsof alternativesbecausepeoplecan handlemorealternatives when they can readand referbackto them than when the itemsare readto them. A final limitation of mailsurveysis that they arevulnerableto beingcompletedby committee-that is, by severalmembersof the householdconsultingone another For many topics,this poseslittle difficultyand may actuallybe advantageous, as,for example,when life historiesare solicitedj but where independentresponses are required.this is a seriousshortcomina.
200
N
: l
WEB
QuantitativeData Analysis:Doing SocialResearch to Testldeas
SURVEYS
years rn recent web-based surveys havebecome increasingty
widelyused.In somerespects Websurveys arelikemailsurveys in that theyeliminate the interviewerand requirea respondentto decideto partjcipateand to completethe surveywithout the benefit of persuasionby a live person,which-when practicedby a skilledinterviewercan overcometrepidation,boredom,irritation,and other impediments to completingthe interview.On the other hand,for the computerliteratethey areeasierto completethan paperquestionnaires. and-pencil at leastwhentheyarewelldesigned. Theyalsohavethe advantage overall othermodesin permittingcomplexfilters,in whichquestions areincludedor omitted dependingon responsesto previousquestions.In both face-to-faceand terepnonesurveys, filtersare used.but they are vulnerableto interviewererror.In paper-and-pencil surveys,using filtersis difficuitbecause respondent erroris Iikely. With respectto samplebias,Web surveys todayfacethe samelimitations astelephone surveysdid in the United Statesin the first half of the twentieth century:a strong socioeconomicbiasin computeraccess andcomputerliteracy. In addition,thereis no knownsampling frame of Web addressesthat correspondsto a populationof people. Moreover,given the currentflood of spamand concerledattemptsto interceptit throughspamfilters,effortsto secureresponses from a randomsampleof Web addresses will likelyfail. Hencethe useful_ nessof Web-based surveysis likelyto be restricted to situationsin which there is a wellspecified samplingframe (suchas a list of membersof an organization) and the abilityto address surveyquestionnaires to namedindividuals with suitableaopealsand inducements to respondand assiduousfollow-up efforls to convertnonresponses to responses.
The problem with this method is that the stratamay be quite heterogeneous. Fc example,supposeall cities with populationsof one million or more are includedin thc first stratum.Then, if cities were simply chosenat random, residentsof Los Angeler would have only one{hird the chanceof being included in the sampleas residentsd San Diego, since the population of Los Angeles is about three times the populatiot of SanDiego. To avoid this problem an altemative procedure is often used: within each stratun" units are sampledPPS.To accomplishthis, all the units arc anayedin order accordiry to their size, and the tolal population is cumulated.Then numbers are drawn at randon and units are chosenthat include the randomly drawn numbers.For example,suppost we want to samplePPS five of the ten largestcities in Califomia as pSUs, so we ca interviewonehundredpeopleper PSU. (Here,becauseof the largevariancein the sized the cities,it makessenseeitherto samplewith replacemenlor to divide Los Angeles,ar{ perhapsSanDiego, into portions and treat eachportion as a separatecity. I havedonetbc former.)Table9.2 showsthe population(here,accordingto the 1990census),the cumulative populationwhencitiesarearrayedby size,andthe percentage of the total populatic of the ten cities residing in eachcity.
s {I
It is f
Sample Design andsurveyEstimation 201 The population Size, population Size, and of the Total populat .ron -C.umulative Residing in Each of the Ten Largest -.(entage F catifornia, 1990. Cities _
'1990 Population
Cumulative Population
3,485,398
3,485,398
1,110,549 782,225 723,959 -:-l Seach -..
.id
,a:-:anento
429,433 312,242 369,365 354,202
: , . : . s ide
226,50s 210,943
\ow we need to choose sor ---1er tableat the backof one,l:-i":.*t.
:: -: rhrough ninth nuil;#
4,595,947 5,378,17 2 6,102,131 6,531,564 6,903,806 7 ,273,171 I ,627,373 7,8s3,878 8,064,821
Percentageof Total Populationof the tO Cities 43"2 13.8 9.7 9.0 5.3 4.6 4.6 4.4 2.8 2.6
nutnbersGoing to x convenient random
," d;;;; }_:l":::';:Til:il:;lbitrariry deciding
Belondrheranscrsnorel ottt" (since4,204,805r'als within rherange:3,485,399 to ;U;;ot;" 1.168,953 ChooseLos Anqeles ' :oJ 5-2'{1chooseLo\ An;le5 '8ain ChouseLo. Anletes,ri .:::?0 agrin 6.574,717 ChooseOaklanj I .:;:ij r'-:u+'6ur
\ore thar Los Angeles,, .nr1:1-1L..1 of the five rimes.(Or course. r:rtatronofLosAngelesis.l3percentofthetotalOo'u,u*''.oii,r.ren.largestcities :"r sjnce the
;*;
to Testldeas QuantitativeData Analysis:Doing SocialResearch
,.-: in california, we would expect Los Angeles to be chosen about two out of five times average if we repeated the sampling procedure mtlny times.) We would thus divide Lr ' Angeles into thlee equally sized sections and treat each of them as a primary samPlif.. unii, together with San Diego and Oakland. By sampling in this way' and repeating tr: process for smaller units within each primary sampling unit, we ensure that every lnc ' vidual living in the ten cities has an approximately equal chance of being included in tr. siLmple,precisely becausethe chance of the city being included is exactly propofiional :: the size of the citY. Note that I say "approximately equal " This is becausethe multistage selection pr':cess introduces "lumpiness." Here, for example, each primary sampling unit represer'' exactly 20 percent of the population, but each city does not contain an exact multiple . 20 percent of the population. Although there always will be sorne lumpiness' the larg;: the number of sampling units at each stage, the smaller the problem becomes' Typically a survey house will use the same primary sampling units repeatedly Fr: example, the National Opinion Research Center (NORC) changes its primary sampli:: units every ten years, when the new census data are available (these are needed to dete:mine the population size). NoRC does this because it maintains a staff of intervieu e,' in each primary sampling unit and wants to avoid the expense of recruiting and trainil:: a new set of interviewers for each survey. The part of a sampling design that is fixed advance and maintained over time is known as the scLmplingfram'e'
ffl lY'l -'
( I 909- i994r was a oemoqraphe' who soenthis enti'e 1 earnrngl'isBA ir 929' hrsMA in 1933'and his or Chicago, u.uo"t'. .areeraTthe Univerqitv contributions phD in 1938,all in sociology. and academic He madeimportantorganlzational 1947, serving 1939 to from the Census Bureau of U S. workingat the to the socialsclences, Direcas Assistant eventually first asAssistantChiefStatisiicianof the PopulationCensusand tor (and asActing Directorfrom 1949to 1950).At the Bureauhe playeda major rolein creating the 20 percentsamplelong form, usedin the 1940censusfor the firsttime, as well as particularly of Blacks methodsto teducethe undercount, most notableamonghls pulrlications on manytopics.Perhaps At Chicagohe published dno Hauser1973) He also by'ace ano c,ass(Kiraqawa was d stJdyof'norra.irvo,flerenl;ars Centerandservedasitsdirectol{or Population Research of ChLcago the University established from developirgnations-He is many o, them PhDs. years, than a hundred trainingrnore thirty in professional associations president maior o{ three perhapsthe only personto haveservedas Associathe socialsciences:the AmeficanSociologicalAssociatlon,the AmericanStatistical tion, and the PopulationAssociationof Arnerlca.
I
.t!]
lll
-
,
'
flt I :
f --
-
l]l]1]!
l]
-:
l! f l -:
When sarnplinglarge,geographicallydiversepopulations'the selectionprocesst\ F cally is repeatedseveraltimes, for successivelysmallerunits. For example,in a 199: nationalsamplesurveyof China (Treiman 1998),we dividedthe countryinto urbanar: rural sectors.Then. within eachsector,we sampledcounties(or their urbanequivalent: with probability proportionalto sizc. Within eachof the chosencounties,we sampli:
frhrjrclt -t
l':
SampleDesignandSurveyEstimation 243 IES
:E
de L:r
ry+ nS-
rr'xr:- l:
I r:-
mships (or zip-code-sizeddistrictsof cities [',streets,']),with probabilityproportional r rize.Thenwithin eachof the chosentownshipswe sarnpledvillages(or neighborhoods a eities), with probability proportional to size. Once small geographical units are identified-for example, villages in rural China c districts or neighborhoodsof cities-there are four standardaltemativesfor choosins dr idualsto be interviewed: r
Randomselectionfrom a populationregister
r
Randomselectionfrom a list ofaddresses(householdsamples)andfurtherselec_ tion within households
r
Randomwalk procedures(anotherway of selectinghouseholds)
.
Quotaselection
E!€:fi
dcx la[..= i- F:r TL:g ftl=E5 =-:
in;'_r itf trL
)L'-F' trr' 15-
rl,<:
wulation RegisterSamplesIn countriesthatmaintainregistersof the population(for aample, EastemEuropeandChina),it is commonto randomlysampleindividualsmeet_ ry the studycriteria (usuallysimply thosefalling within someagerange)directly from 6e populationregister.This is a very good method becauseit allows strong control Eom the office-that is, it makes it very difficult for the interviewers to cheat by filling lhe questionnairesthemselves.A simple control procedureis to ask the respon_ ".qt dent for the exactdate of birth. This informationtypically is in the populationregister tm will be unknownto the interviewer.Thus interviewerscannotmakeup an answerto rtis questionfrom their kitchen table. There are tbree (related)potential disadvartagesto using population registersto draw :mples. First, if the registeris not kept up to date,it will miss thosewho tend to move round a lot. Second,often people are officially registeredin one place (for example,their bme village) but are away working somewhereelse for an extendedperiod. Thus they rill be interviewedin neitherplace becauseit usually is extremelyexpensiveto track rhem down. This is a major problem in China, where 25 percent of the population of Beijing, and comparableproportions in other cities, is .,floating," working in the city but registeredin a village. To obtain better records for official statistics (and also-indeedmainly-to maintain tight social control of the population), the Chinese government beganin 1994 to require that people residing in a place for more than thrce months regiser as "temporary residents"; nonetheless,many people fail to register A third disadvan_ age to basing sampleson population registers is that the registers are virfually always restrictedto the dejare population rather thanthe defacto population. So a large resident alien population-like Germany's Gqstarbeiter (gtest workers)-will be excluded.This canresult in rather odd samples.For example,Germansamplestypically havefar too few maleunskilledworkersbecauseunskilledjobs are almostalwaysdoneby Gastarbeiter. Random Samplesof Householdsand Further Selectionwithin Households In theUnited Sntes and other countries lacking population registers,rhe problem is to create a list of peopleto be sanpled within eachof the small geographicunits chosen.This typically is done in thrce stages:by enumeratinghouseholds,sampling them, and then, as parl of tie hter\iewing process,randomly choosingoneperson(or more) per householdto be interviewed.
204
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
Households are enumerated(listed) by fieldwork staff who walk through the arer. locatingand recordingeveryoccupieddwelling unit. In suburbanneighborhoodsfull c single-familyhouses,this is pretty easy-although one still has to be carefulto inclulr mother-in-lawapartmentsand such.In placeswherepeoplelive in garages,roomsin rin backs of shops,and other informal dwellings, it can be very difficult. (Contemporur urbanChinais sucha case.For an accountofchallengesfacedby thosetrying to confoc samplesurveysin suchenvironments, seeTreiman,Mason,andothers[2006].)It cana,ro be difficult to get into securitybuildingsandgatedneighborhoods. This is a problemD[ only at the listing stagebut at the interviewingstageas well. Once the list is compiled,a random sampleof dwelling units is drawn and inrer viewersare sentto conductthe interviews.The next problemis to randomlyselect!-E or more peoplewithin the householdfor interviewing.This is done by the intervieswho lists all the residentsof the householdwho meet the criteria and randomlyseles one (or sometimesmore,dependingon the designof the study),using what is knoul r a Kish table(after the samplingstatisticianLeslie Kish) or similar methods(seeGazim [2005] for a review of within-householdrespondentselectiontechniques).Suppose;m example,that the interyieweris instrucledto interview a personagedeighteento siu_r_ nine. The interviewerlists all householdmembersbetweeneighteenand sixty-nined then choosesone by referringto a tableof randomnumbersor using someotherderi:suchaschoosingthe personwhosebirthdayis closestto the interviewday. Householdsampleshave the advantageof capturingthe de facto population___m populationactuallyliving in a place.But theyhavethreeimportantdisadvantases. Fi.'l r is fairlyeasyto chear.inlerviewing whomeverhappens ro bi avairabre rarher6an renuing to complete an interview with a personchosenbut not available.Interviewers are sE!!posedto makea specifiednumberof attemptsto completethe interview(typically rhre beforeabandoningthe attempt.I pickedup cheatingin a surveyI did in SouthAfrica r the ea.rly by noting that 97 percentof Blacks were interviewedon the first tn ___i -1990s completely unbelievablepropoftion.(By contrast,aboutg0 percentof the White, Asi,e. andColouredinterviewswerecompletedon the first try.) To makeit possibleto discoru suchproblems,it is a very goodideato build informationon theinterviewingprocessirr, the datacollection-for example,by having the interviewerrecordthe dateand time d eachattemptto completean interviewand the outcomeof the attempt,and also by cr* lectinginformationon at leastthe ageand sexof eachhouseholdmemberanoincluarg this information in the analytic data set, which permits the analyst to comparethe dis* butionof completedcasesto the distributionofhouseholdmembers.In my SouthAfricr study I usedsuchinformation to determinethat men had beenundersampledand wasatl to get the surveyhouseto collecta supplementary sampleof men. A second disadvantageof household samplesis that they are not true probabiln samplesof the population,becausepeople in large householdshave a smaller prcdability of being selectedthan do people in small households.For examDle.in :0r_t-. 34 percentof U.S.householdsincludedonly oneadult,54 percenthad two adults,andfu remaining12percenthadthreeor moreadults(datafrom the GSS).Obviouslythe chau of an adultin a single-adulthouseholdbeing incrudedin a sampleis twrceas greatasfu chanceof an adultin a two-adulthousehold beingincluded.
SampleDesignand SurveyEstimatjon 205 E ATAL
i full or irl*r. rfu6 e PfrErI
mfu: n a.l-io In E.-{ lnIETcfe hreL Rla:E I.D ls
ztqe- t,I ili_t:d. TT:E.
-& r-tr Er-
ry !r AI i-r
b tt
n :d 4
We typically convert householdsarryl11 to person samples by weighting the data by te, numberof eligible peoplein the household,normaiizedto retain'rheongrral samplesize. tris is very easyto do in Stata.For example,supposethat rhe rurg"ipopututioni. adults,that E sish to correctfor the numberof adultsin thi household, unithit *" hun" u countof the mber of adultsin eachhousehold.In Statawe simply speci.fy = adutts] lpweight ta. dependingon the command, laweight = ua"ri"l f.iolv ,uppose l. the averagenum_ b of adultsper householdwas2.0.Then a householdwith iour adJti woutOget a weight ot a qiereas a householdwith oneadultwould geta weight of0.5. Also, tne meanweight would h I and.thesum of the weightswould be N, the numb'erof cur", io tL" .*pt.. .{s-it weighting the GSS to take accountof differential household . _happcns, size [3tes little differencefor most variables,which meals that the analysiswe havedone chaptersusing the GSS is for the most part not far # the mark. Still, it is lji"]"I rynrtant to get it right. Moreover, sometimescorrecting for differential household size &ei matter-for example,when we considerfamily incote. Conectfy weignting for dif_ &rntial householdsize increasesthe estimateof iamlty in"o-" Uy about l0 percentin ft :002 U.S. GSS(for the evidence,seeparr I of downloadable .,ch09.do,,): fiie urweightedmean: $50,102 meanweightedby householdsize : $54,gg0 .\ third disadvantage ofhousehold.samples is theincreasingdifficulty in securinghigh rs{nnse rates. In Eastem Europe during the communist periJd it was common to com_ qlal more than 90 percent of the anemptedinterviews, oJutresponserates dropped with fu thll of comrnunism. The same is true in China, *h"." ."rpinr" ,utes once exceeded ff y19en1 trl tr_ave beenfalling steadily, especially in urU- ui"u, *t people increas_ live_in high-rise apartmentbuildings with restricted u"""r.. it "r" \{1 CSS typically gets $our a 75 percentresponserate,andotherU.S. ,u*"ys do " *o.re, which createsa ton_Spossibilitythatrespondents will be a nonrandomsubset -u"h of the targetpopulation.In - - GSS,for example, men are usua y undersampled relativeto womenbecauseof dif&rential nonresponse (Smith 1979).Any populationestimatein which men and women dffer-which is often the casewith attitude items_will be biased.
l;
!l I
ft
t b
a : I ll
A SUPERIOR SAMPLING PROCEDURE Asuperior arternative to ?)I
J[il[,:T:'"iil:li'-lill,li,i N i?l;::li]:ifl :::lli:;"i:ii fl#"::l ;:::Jfr.
and recordingthe age,sex,and other identifyingcharacteristics of eachresident,and then samolingdirectlyfrom the listof eligibleindividuals. Thisapproachradicallyincreases costsbut is far moreaccuratethan simplylistingaddresses, becausehouseholdsizestend to varysubstantially and because,especiarry in crowdedneighborhoods, there often are ,,doorsbehinddoors,,_ that is, separatehousehords that wourd be missedwithout interviewingrocarresidents and Inquiring aboutthe presence of suchhouseholds.
206
to Testldeas QuantitativeDataAnalysis:Doing SocialResearch
Two ways of improving coverageare typically used: drawing a sample somewhar larger than the target number of completed interviews, to offset nonresponses;and substitution by survey interviewers of a new case,typically from the samesmall area,when an interview cannotbe completed.Both methodsincreasethe number of completedinterviews but do nothing to overcome biaseslhat are due to the differential availability of potential respondents. Random Walk Samp/es Random walk samplesare a variant of householdsamples. Within eachsmall area,the intervieweris instructed10 start at a particular location (a particularstreetintersection)andto proceedin a specifiedway, taking everynth address (or evenvarying the interval accordingto a scheduleof random numbers)and turning in a specifieddirection at eachintersection.This amountsto doing the addresslisting on the fly. This is not a desirablemethodbecause,in addition to the other weaknesses of householdsamples,it results in difficult-to-find dwelling units being overlookedeven by honest interviewers.Also, cheating is even easier than with conventional householdsamplesbecauseenumeration,householdselection,and interviewing are all doneby the sameperson:and typically thereis little or no documentationof the potential sample,only thoseactuallyinterviewed.It is usedbecauseit is lessexpensivethan population-registersampling and conventionalhouseholdsampling. In the first t\ro years of the GSS (1972 and 1974),a random walk procedurecombinedwith a quora samplewas used. Quota Samples A quota sample is a sample in which the interviewer is instructed to obtain information on a given number of people with specified characteristics-females under forty, femalesforty and over, working women, and so on. Often quota procedures are combinedwith multistageprobabilitysampling:small areasare selectedusingmultistageprobability sampling methods,and then, within each small area,the interviewer is instructed to obtain ilterviews to fulflll specified quotas. In general,quotasamplesarenot a good idea,for two reasons:first, they do not meet the conditions pemitting valid statistical inference-they are not a probability sampleof any population.Second,they typically producea biasedsampleof the populationthe-r. purport to represent,overrepresentingthe kind of people who tend to be available when interviewing is carried out. Still, carefully controlled quota sampling can be useful under conditions in which probability sampling is prohibitively difficult, becausein such circumstances coverageof the populationmight actuallybe better
Stratifi ed Probabi Iity 5ampl es Multistage probability samplesare sometlmesstratirted, that is, designedto treat va.rious segmentsof the population as if they are separatepopulations. For example, an initial distinctionmight be madebetweenurban and rural areas,with separatesamplesdrawn from the urbanportion and the rural portion.The main reasonfor stratifying a sample
i n ensure tha sis-For err -! it $ ould I lre: ntm \mall a * smr€ small fuu t effect of clu
SOUR -3
-'a
:3 =t
-r€---5ni :a 3e--5e
d
IE2--a-3
0
';Ee :2
t$
,'e'.'.eq aJl -c-.=.-
EIGN
3
EF
ft iL-r rhar E r-rearg ry++ &-lindlou btu'rben ot r:riabler rrl tl poluluicn .-'f tu rn-e relarir arol drasn f,r re.T'etl to -{e t frird $ge b€eneflrs ac.-l.rmr-rir sl .lmPle tlht re need r&srhrd A rR'r: r'<s Eiee:lrdSrj fr+:rgd clF desisE lEc.-1:ir3!
SampleDesignand SurveyEstimation rhE
mhPlr
E.
rcr
247
b to ensurethat a sufficient number of casesare drawn from eachstratumto Dermit nalvsis. For example,to get estimatesof somephenomenonfor eachstatein the Uniced staresit would be necessaryto stratify a national sampreby statebecauseotherwise ooll a small numberof respondents, or perhapsnoneat all, would be likely to be chosen aom small states.A secondreasonfor using a stratifiedsampledesignis to minimize 6r effectof clustering,a point discussedin more detail later in the chaDter.
lEr 1! B!:
ry
@s 5S oI. clr
&1 :f,-
SOURCES OF NONRESPONSE rhemain reason rornonresponse is
failureto starrthe interview becausethe interviewercannotcontactthe target household(as somettmes happens in gatedcommunities andhigh_rise apartments), because no one tshome, or becausethe householderrefusesto answerthe door. For this reasonhigh_qualitysurvey operationsoften attempt to contacttargeted householdsby mail to explainthe surveyand pavethe way for the interviewerOnce contacted,relativelyfew people refuse to be inter_ viewed(althoughrefusals are increasing, especially in urbanareas), and almostno one termi_ natesan interviewafter it starts.
ITI a/\ ft
TI€SIGNEFFECTS I' ET ET
DI E
T t ! tl' j-
I
I I
The fact that national sample surveysgenerally are basedon multistage areaprobability *rmples createsa problem-standard statisticalpackages,which assumerandomsam_ iiing. tend to understatethe true extentof samplingenor in the data.The reasonfor this E dat when observationsare clustered (drawn from a few selectedsampling points), for rany variables the within-cluster variance tends to be smaller than the variance across ae population as a whole. This in tum implies that the between-clustervariance-the qriance of the cluster means, which gives the standarderror for clustered samoles-is niated relative to the variance of the same variable computed from a simple random $mple drawn from the samepopulation.Reducedwithin-clustervariance,especially Tidr respectto sociodemographicvariables,is typical within the small areasthat make u; te ftird stageof multistage probability samples:areasof a few blocks rend to be mor! hmogeneouswith respectto education,age,race,and so on than the populationof the affe country. The result is that when we use statistical proceduresbasedon the assumptbn of simple random sampling, our computed standard errors rypically are too ..ail. l-hat we need to do is to take accountnot only ofthe var:ianceamong individuals within r cluster but of the variance betweenclusters.This is what survej estimation ptocedures (For a usefulintroductionto suchprocedures,especiallyasimplementedin Stata, see ';lo. Etinge andSribney[1996].However,notethat stata'ssurveyestimationprocedures have geatly expandedsince that paper was published: they are now capableof handling multi_ *age designswith more than two levels,and surveyversionsof many more estimation rocedures are available.)
K[
;t$
to Testldeas QuantitativeData Analysis:Doing socialResearch
To illustrate what can happen to our standard ertots when we take account i: design effects-the tact that we have a clustered sample I draw upon some samplrr: experiments conducted in the course of designing rny 1996 national sample survel r: China (Tteiman and others 1998).Becausethis survey was to be conductedby sendir. interviewers from Beijing to each sampling point, cost was a strong incenti!e to minimi: ' the number of sanrpling points. However, since China is a very heterogeneouscountr' it was possiblethrt a highly clusteredsamplewould producean unacceptablyhigh ler. of sampling error. To estimate the potential damage that could result liom clustering. r" conductedsome analysisusing a 1:t00 sampleof the 1990Censusof China. Although we carried out severalexperiments,I draw upon only a subsetto illL' trate the potential problel.r of clustering a three stage design for a rural sample.T:' first stageconsistedof Iifty counties,chosenrandomly with probability proportional : size. In the second slage two villages within each county were chosen randomly s::probability proportioral to size. In the third stage thirty people between ages twer:. and sixty-nine were chosen at random within each village. Altogether, this desi.: created a sample of thlee thousand people. We also drew a corresponding sanlr: from the urban population. To assesswhether the clustered samples produce lar5.: sampling variability than would correspondingrandom samples of the same popu'-tion. we cornputed several statistics summarizing featuresof the Chinese populatr.: and estimatedthe design effect (deff tbr each statistic.Delis the ratio of the variar-. calculated taking the clustered sample design into account to the estimatedsampli-. variancefrom a hypothetical survey of the same size with observationscollected fri a simple random sample. It also can be thought of as a factor for the sample size; thu' -
i'arl E! -
of thetwenti(1910-2000) wasoneof the leadingsurveystatisticians (1965), became which monograph SurveySampling the pioneering centurypublishing "tn of inference to the development for the field.He mademaiorcontrlbutions the standard proceduresfor complexsamplesand other applications.((ish inventedt\e deff and nefl statlstics.)He also helped to found the lnstitute for SurveyResearchat the Universityof Michiganandto designits sample. parentage, he camewith hisfamilyto the of Hungarian Bornin Poprad,now in Slovakia, UnitedStates1n1925.Hisfatherdiedshortlythereafter,so he completedhis3A in mathematics ln night schooiat Ciiy Collegein New York,studyingwhile helpingsupporthis motherand siblings.He alsotook two yearsoff to fight the fascistsin Spainas a memberof the International Briqade.After completinghis BA he movedto Washington,D.C . where he workedfirst Hethen againvolunteered of Agricuiture. at the CensusBureauand then ai the Department the University of Mlchigan, he moved to Army. In 1947 this time in the U.S. for militaryservice, he cornand teaching, where,in additionto helpingfoundthe lnstitutefor SoclalResearch, pietedhisMS and PhD.He fenainedat Michiganfor the restoi hislife.
_ ilt - -
llf
-l::: I
1l:
'-
-_," ,:
SampleDesignand Survey Estimation
[ rf qirg IIt|
-fr!
mr -lrrf br.d E! {c ilh6-
:,Tb dD
rfr tElr[t .*rh
Tb EET
& iotr E
ft I
at
ZOg
ji:i4!H.iL:"#Jf.T:J;'J"-1t,#::.::,mp,ewi,hour "r4uudu crror we would obtain from ".:.*:i sampletKish 1965.259i a srmple
ffiAs,i1J*[,*t**, ffi ffi:,fr fh leftmostpanelof ,ihUie l.: rr
#H#ili:###j,l$.[{H,1#{:::.'"::fl :ff;il];::H
g**j*ff *,ri,p;*a*hT,"ntr'i; 'f ;,fi:#i#,fT ;:l;iff ffi ffii,i:",i'"ff ;X":*l;-;fiirffi ,g..noug' ,oiugs"li';;';:;:.:fi :1.ff.Tlr,;,?#:.lnn* ;i ::rull
to Offset the Effectof Clustering
andpowerful featureof the statisti 6.we canr more is thar under certain more or less cerrain conlesscompler.etv,.tf;i;; comDlerelv con_ _i.^.:l:*plng ;;":;:irsurstal{nder ^*"-.
il'Jff tr'..',;'11'#,*$#::"lh: j."^l"lil.;i".::T::""l,ii','r"ffi fjl
ff nff:#1.##tili,;"f. l;iilT: :.,:,T"tii!:rF:rfi:H:.:,#,,:.".,
il,1,:il1:nr,tn*t ffs,,,:.#r:f ffi;:;#:rJili*f,*j#ff :ffi:f.,;jf ffl;:fll;i;:":*:l{i!_ffi ::.ff :*'}:,::nJ,:::ff :,:
#iI:'"T:*ll#*1trJ#**.-r$tr,l$Tl"!".ror aswerr' rhishomogeneiry can
Hy::,1':*:;;ry^ilHi,:l;f;h:ffibres
be
.i:ffi #':*#trffi.f,frHq"'lfg fi fi:J.I:"n'ffi f,"T'f:l.'"id!F::fiifitiil;fllifi lfliln.,orruur"sJ ;"J.',::" ;ililii:
lff,f:f.#iff frj:i:fr,':",H::[J
sampte.
l:rer in this chapte.r pr*."i
*xl'i"J.l|fi iTl
gxnilttgill,m:ifu:t;in'.l.:;:l.*:il:.,l## " "*'""ur
':
,,. t; l- !: 1), ii , oesign rffects for selected Statistics, samples of 3,ooo with clustering (5o counties as Primary sampling Units,2 Villagesor Neighborhoodsper County,and 30 Adults Age 20 to 69 per village or Neighborhood), With and Without Stratification, by Level of Education. Without Stratification
Coefficient
Statistic
S ampl e
Meanyearsof schooling
Urban
4.45
Rural
5.49
Urban
Meana9e
N,4ean lSElscore
Percent wlth localregisiration
llr l
Deff
o.1a
4.22
5.6 1
o.a7
38.06
2.69
38.21
0.96
Rural
38.55
1.73
38.71
0. 99
Urban
33.35
5.68
33.44
4. 87
Rural
24.O2
2.44
24.61
0. 91
Urban
95.37
6.08
94.61
o.B7
2.13
99.30
0.96
Urban
B'1.07
4.19
B1.13
0.93
R ural
88.97
)99
87 47
0.q3
tl F
!t
F'l | 4l
| 1 I i I I | | , ' ,li| | | | r l I ' i I l rtll
Coefficient 8.4 1
Rural Percent employed
Deff
With Stratification
I l tl rdr I
II r,| l
'l /l l
lll
l rr/
BE:
r nl Regression of lSElon years of schooling
l t.
Urban
Rural bro*
Urban Rural
Regression of lSElon yearsof schooling, age,and sex
int.
Urban
Rural b-
Urban Rura I Urban Rural Urban Rural
A/oteThisisthe l tj OOsarnple of the 1990Census of China,firstversion.
lJ l l l
1).90
4.70
10.6/
0. 92
18.91
2.81
17. 45
0. 96
2.42
6.36
2. 71
0.90
0.93
1.80
1. 28
0. 94
16.23
?.83
13. 57
o.97
27.85
1.67
22. 88
0. 95
2.23
3.31
2 . 54
0. 94
0.49
1.70
0.89
0_94
-0.08
2.75
-0.08
0. 95
-o.20
1.50
-o.14
o. 97
2.7A
1.43
2.81
0.99
2.31
1.43
4.1 4
0.98
212
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
SAMPLEUSEDIN THE STRATIFIED HOWTHECHINESE s".uu'" WASCONSTRUCTED EXPERIMENTS DESIGN with eachstratumtreatedas a separate samples, stratifiedsamplesarejust multistageprobability earlier, usingasan is similarto that described for creatingsuchsamples the procedure sample, exampleCaliforniaclties.To createthe chinesesample,we first dividedall county-levelunits (counties,county-level cities,and districtso{ largecitles)into an urbanand a ruralsectoi using the 1990 Chinese census.We treatedthesesectorsastwo separatepopulations-the datafrom of rural population of ruralChina.considerthe population of urbanChinaandthe population china firsl, which consistedof about 2,400 counties.We arrayedthesecountiesin the orderof the proportionof the adult populationwith at leasta lower middleschooleducationWe then equalsizeso that countiestotaling dividedthe countiesinto twenty-fivestrataof approximately population in were included eachstratum.We then chosetwo 4 percentof the approximately proportional to size,pickingthe firstone at ranfrom eachstratum,with probability counties by addinghalfthe populationof the stratumto the dom and the secondone systematically originalnumberand pickingthe countywithin which the sum fell, wrappingas necessary The remaining stagesweresampledPPSin the usualway.We then createdthe urbansample
EU
na*. ^_r rrd
d .@ tlrjE!
']rqr 'lr
in the sameway.
* am| :re s
As noted earlier in the chapter, a second reason for using stratified samplesis to sample different subpopulationsat different rates. We did this in the Chinese sampleAlthough for convenienceI have presentedthe Chinese data as if they consistedof two separatesamples(an urban sampleand a rural sample),the urban-rural distinction may be thought of simply as a secondstratification variable. However, becauseChina was about 75 percent rural at the time the survey was conducted,we sampledthe urban populaiion at tbree times the rate at which we sampledthe rural population in order to achieveurbaa and rural samplesofthe samesize, which we wanted for our analysis The samestrateS) was usedin the 1982and 1987GSS to achievea sampleof the Black populationof the United Stateslargeenoughto sustaina separateanalysisof Blacksandnon-Blacks.
lffnu rI
lrq-r ir-fli *a![ :-ff'
4Niml 5lll lnrft :ti -Tt
weighting When portions of the populationare sampledat different rates,the sampleis, of course, no longer representativeof the entire population. Thus any statistics computed over the entire samplewill be biased.For example,if we naively computedthe meanlevel of educationin the Chinesesample,we would overstatethe true level of educationil the Chinesepopulation sincethe urban population, which was oversampledrelative to the rural population, is much better educated.A similar naive computationusing the 1982 or thelevel of educationin the poP Blacks,would understate 1987GSS,which oversampled ulation given the lower level of educationof Blacks comparedwith non-Blacks.To correct for such distortions, we v,eight the dataproportionally to the inverseof the sampling rate. For example,in the 1996Chinesesurvey,which includes(approximately)threethousand rural casesand three thousand urban cases,to correct for the fact that the urban
lxf lr5[ m m!!-!!iff &
'I!''il
T€4:r&nrdM f-,_S
SampleDesignand5urveyEstimation 213 was sampledat tbreetimes the rate of the rural populationwe would assign t tcight, w", to the urban populaiionand a weight, w,, to the rural population,where r-: 3w'".Note that we would not want to simply assigna weight of 0.33 to the urban ;rylation anda weightof 1.0to therural populationsincethis would resultin a weighted qle sizeof 4,000,whereasthe true samplesizeis 6,000.Rather,we would adjustthe ta back to the original samplesizeby dividing the initial weight by the meanweight, 16-- (This is, of course,just what we havedoneto converthouseholdsamplesto person ryles). Thus we would createa new variable (weight) that has the value 0.5 for urban cre-s andthe value1.5for rural cases.This yields a weightedsamplesizeof 6,000(which L ilentical to the unweightedsamplesize) and a weightedsamplesize of 1,500urban aes and4,500 rural cases,which correspondsto their relative population sizes.Then we c !-omputeunbiasedsummary statistics for the entire population. Note, however, that th procedureoverstatesthe reliability of rural responses(there are actually only 3,000 El respondents,but we are treating the data as if there were 4,500) and similarly underthe reliabilitv of urbanresoonses. -s
WEIGHTING fu:r
+.* itrc qhr
t(u ri! rtd E ftu
DATA I N STATA weishts canbeinctuded instata co'pu-!f
;:T:.?.Hl:::Xfl :J.l;'ffi ;:,f;lT:il:ffi ;:[l"1[":1';::T'*:N jamplewith a weight vadablenamedl4T, we would issuethe followingStatacommand:reg
i x lpweight=wt ] . Statapermitsseveralkindsof weights;seethe User3Guide(Statacorp 2007) for details.In general,probabilityweights (pweights)are the appropriatechoicefor stratifiedprobabilitysamples,and these weights are used in Stata'ssurveyestimationcom.nands.However,Statadoes not permit probabilityweightsfor all commands,and it requires iat frequencyweightsbe integers.I thus recommendthat, in the relativelyraresituationsin rvhichit is appropriateto weight data but not to do surveyestimation(surveyestimationproceduresare discussedlater in the chapter),analyticweights (aweights)be used whenever pweightsand aweights weightsarenot permitied.Stataautomatically normalizes orobability total numberof casesincludedin the analysis, whichmakesit unnecessary :o the unweighted ior the analystto carryout this step.
Bfi
rcd be. trf orhr F.'r FP rraa n=, t(Ertan
Sometimesmore complex weights are devised.For example,in the Chinesecase re fust corrected for differential household size by using the number of adults in the bqsehold as our first weight. Then we deviseda weight to correct for oversampling te urban population. We then multiplied the two weights together to achieve an overall wight, which is appropriatesinceeachweight is normed to a meanof 1.O-which is nther way of sayingthat the sum of the weighteddata is identical to the sum of the us eighteddata.
214
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
As notedin thepreviouschapter,somesurveyhousesconsnrrcta complexsetof weigla to take accountof differential nonresponse.That is, they weight ttre data so that the disrri bution of key variables(geographiclocation, sex, age,education,and so on) rn the samolc conforms to the distribution in a standardpopulation, such as the census.(This procedrrE is implementedin Stata10.0usingthe -post.strat.a ( )_ and _postweignt ( options in the -s\,l/setcommand.)This can be useful when nonresponserates diff{ substantiallyacrosspopulationgroups of interest,but it also is potentially misleadingsilcc it assumesthat nomespondentsareidentical to respondentswithin the groupsformedly tb z-lray cross-tabulationof the variablesusedto createthe weishts. The use of weights is somewhatcontroversial.Somearguethat you shouldnever weight your data but rather should include in your analysisall the variablesusedo devisethe weights.The claim is that weights sweepproblemsunder the rug, maskiry effects that should be explicitly modeled.There is much to be said for this oosition_ certainly, urban-ruraldistinctionsare crucial in china, and racial distinctionsare crucial in the united states.Thus, generallyit win be far more informatrveto exDhcirrT describethe urban-ruraldistinclion in China or racein the United Srares.ptus apiropri ateinteractionswith othervariabres,in one'sanalysisthan to weight the dataandigncrc these distinctions.However, from a practical point of view .""ighting is sometimer unavoidable,particularly in the computationof descriptivestatistics.If we want m accuratelyestimateeducationalattainmentin China, we do need to weight the dara to reflect the oversampleof the better-educated urban populaiion; and so on. In addi_ tion, it is sometimestediousto model nuisance effects-factors that might affect tbc outcomebut that are not central to the substantiveanalysis.The effect of differentic householdsize is such an example.Thus a casecan be madefor weighting the dataro correctfor sucheffectswithout focusing on them. of course,the counteris that eitba they are unimportant,in which caseweightsare unnecessary, or they are important.h which casethey shouldbe modeledexplicitly. Perhapsthe most important point to make about weights is that it is imperative tbr . the analyst fully understandthe weighting schemeusedin the data being anJyzed. Ofter y:ightt _-: quite complex,andjust as often weightingschemesare badly documentocl Although it often takes a good deal of effort, full understandingof weighting scherner can save a great deal of trouble-and considerableembarassment arising from erroc in the analysis--{own the road. In general,wheneveryou begin to use a new data s€ryou shouldtry to obtainasmuch documentationaspossibleaboutthe sampledesign arxt execution-and then,of course,readit. Estimation a'tsingstata To get correctestimatesof standarderrorsfrom mulG 'uruey stage samples,we need to use estimationproceduresspecifically designedfor suci samples.Stataprovidesa setofs rveyestimationcommandsto estimatestandardenorr for many commonstatistics,including means,proportions,OLS regressioncoefficients. and logistic regressioncoefficients.Thesecomrnandsmake it possibleto take accoutr of both clusteringand stratificationat each level of a multistagesample,albeit wifr somerestrictions.
u
t
u! G
c !r! I
; t -"Fl
; u d
ut N!!i
fr t*
Ttt
:
ftfti !i I
Dmrq r
fi c r ol
lma
SampleDesignand SurveyErtimation 2'|'5 ls E-
r G
k E
tk ag I q fr-
rad! E
m: Ei I: [.1
LIMITATIONS OFTHESTATA1O.OSURVEY ESTIMA-AI pr"uiou, TION PROCEDURE Although stata10.0ismuch improved over u"n
$l
sionsin itsabilityto correctly estimatestandarderrorsand designeffectsfor multistage sam- ples,one important limitationremains:becauseof the way Stataestimatesstandarderrors, the defaultfor stratawith only one samplingunit is 10 reporimissingstandarderrors.Stata just 10.0provides threealternatives, whichare helpfulif onlyan occasional stratumcontains cnesamplingunit-althoughin thiscaseStatarecommends that the offendingunitsbe combined with others (Suryel Data, 154 lstatacorp 2007]). Ihe alternativesare inappropriate when,by design,eachstratumcontainsonlyone samplingunit.(Notethat in Stata'simplementation"the samplingunitsfrom a givenstageposeasstratafor the next samplingstage" Thisisthe designusedby the Gss,whichhasone PSU lsurveyData,154(Statacorp2007)1.) per stratumand is the designusedin the 1996Chinesesurveyanalyzedhere,in which one iownshipwassampledper county. ThesolutionI adoptis to ignorethe stagecontainingonlyone unit perstratum;but this understates the degreeof clustering.For example,in the Chinesecasetwo villagesper countywere sampled,but both were drawnfrom a singletownship;ignoringthe township levelresultsin this aspectof clustering not beingtakeninto account.Althoughnot optimal, stagesaltogether, thissolutionstrikesme as betterthan ignoringsubsidiary which is what Statadid beforerelease 9.0.
frr XIL
b .t r
b E
Ea ES tr! tEl j
To showthe effect ofusing surveyestimationprocedures,I first repeatthe analysis, lresentedin ChapterSeven,of the determinantsof knowledgeof Chinesecharacters, cing surveyestimationprocedures.I then follow with an analysisof race differences m income among U.S. women to show how to do survey estimationfor subsamples. I concludewith an analysisof race differencesin educationin the United States(the rue exampleI usedto discussthe decompositionof differencesin meansin the previ|[s chapter)to show how to do surveyestimationwhen combining severalyearsof the GSS (or, by extension,other data sets).SeeAppendix A for descriptionsof both the Chhese data and the GSS, and seeAppendix B for a discussionof how to do survey estimationusing the GSS.
A Worked Example:Literacy in China hEt If,!
t5M
id
B-iratfollows is a comparisonof the regressionestimatesand standarderrors derived two &1s: by using surveyestimationprocedures,andby assumingthat the datawere ftom a *imple randomsample,as we did in ChapterSeven(seeTable9.4). The 1996 Chinese nnvey analyzed here used a design similar to the design of the sampling experiments describedearlier in the chapter,except that in the sample survey we sampledone townSip per county and two villages per township. (SeeAppendix A for details on how to r€ess the documentation for the survey, including information on the sample design --\ppendix D of the documentationl and how to obtain the data.)
z'!{;
Quantitalive Data Analysis:Doing SocialResearch 1o Teslldeas
-|.;|lf
l-l ,1l.ll oeterminants of the Number of chinese characters correcfly ldentified on a lo-ttem Test, Employed ChineseAdults Age 20 09, 1996 (N = 4,802). Unweighted
Weighted
-
li L
i
DesignBased
l-
D
S.e .
b
s.e.
b
0.378 0.006
0.393
0.006
0.393
0 .0 0 2 0 .0 0 6
0.009 0.007 0.009
s.e.
Deff
Meff
Meff ,.
:J
!l[:h
0.0.!o 2.ss
2.g3
1.5:
0.007
| .45
1.0.
0.206 0_046 0.211 0.057 0.216 0.055 1.0.t 1.42
0.9;
0.281 0.045 0.177 0.054 0]72
| 2/
0.050 0.88
fls
1.21
0. -o:
1.80
1.12
r:l
lillr
0.366 0.037 0.385 0.044 0.385 0.049 1.70
.; iiii.-:
0.759 0.101 0.866 0.118 0.872 0.129 1.53 1.64
1.14 ll
0.040 0.546 0.039 0.544 0.060 3.25 R2
0 .6 8 1
0.687
2.31
1.5:
*:-'
0.688 *t
s.e.e.
1.2 4
1.24
afe signficantat or beyoncl "Allvariables the .OOllevelexceptfor father,s educatron. lor the unwerghi:l data,p : .690.Forthe weighteddata,p = .l 95, andfor the clesig| basedanaysis,p : ta6
Stata requiresthat information regardingthe propertiesof the data be set beii... specifyingestimationcommands.Once this is doni, using the _svyset_ comllta]:l estimationis carriedout in the usualway,exceptthat the suivey versronof the estimatii commandis substitutedfor the nonsurveyversion.The specific commandsusedto c surveyestimationfor the Chineseliteracyexampleareshownin downloadablefile ,,cht-,: do" (Part2). Seealso the - log- file .,ch09.log;'for the output.
-
:. :d
SampleDesignandSurveyEstimation 217
h
tG
tt
n,
* t
If
The Stata10.0surveyestimationcommandsprovide four designellbct statistics:/zqf 6e misspecification effect;deff,the crassic design'-effect ri".i.,i" iJ*r"p"a by Kish ( 1965) .ad dircussedearlierjn the chapter; andmeft zrrrd defi,*ni"n a."in" approximatesquare n<:ts of meff anddeff. of theseI find the nist tro ttt" *osi o.]iL Thesecoefficients are eported in Table9 4, which also includesthfee estimatesoi ,rr" J"i".-in-o of riteracy in China:regressioncoefficients as
nmaom simpring w*.""d;;'jlillill1xHx?,1l'i.#"11"1 i.i|#il"trif;#"f
pmel, "Design Based"). Finallv, the table shows rr"o.irtiJ oesign statistic, which I caI meff.,. -otrr".
Seff ,t4Fis the rado of the samplins variance(the squareof the standarderor) computedusing ti design-based estimationco]mand to tt ,a-pting uariu*" Joinputeoon the assumptinn of unweightedsimple random sampling. " tvi"Xlnu" informs-us.1usthow badly we rould err in our estimatesof samplingvari"an"" *"." *" ,o-nuiuelycomputestatistics rithout taking accounteither of clustering or of differentluiruniiiirrg .ut"s_as we have Jone in previous chapters;for the current example, theseare the compuhtlons shown in 6e first two columnsof Thble9.4.l,lote that as specifieJUy ,fr" a"fioi,io" of meff,in the tust row meff:2.93 : 0.0102/0.006, (or, preciseiyti."_,ir" a"irr""aable _1og_ filel q'm95421'1/0'00557672), wherethe ratio is rormea'uy a"-.q.-"d standard error estinared using design-basedestimationdivided by trr. iqu-"i ,tunoard error assuming rmple unweightedrandom sampling. Sometimes, u. ; thi, the underesti_ ,h" sampling.variability "*_pf", canbe subsrantial;tf,", it *_fJ-U" T-,-" :f inapproprateto usenaiveesfimating "ompletely procedures lor theChinesedata. to. weight your databut ro ignore clustenng and srratification, 1:: ^.ii:dficienr 6- ..rl the computations in the rightmostcolumn of Table9.4 demonstrates. .Ihis coefficient, the.ratioof the desig" iJ r"i,prr"g varianceto the 3_:1.1]::"-*ed .y"ff*, .g.rues ramphng vanance estimatedby w^:ignting the data but not taking account of clustering cr stratification'(Becausethis coefflcientis not amongtir" stuiu o-ptronr-r createdit for heurisricpurposes-it must be computeoby hand,unrJssy;;;;; p."gr"_ Stataro do l for you. Seedownloadablefile ,.chOl.Oo,; part i] to seeiorv r u."i s,u,u ,o do the com_ Frtauons.)As is evident,the varianceestimatescan be quite different..Ihus onceagain jh" 1mgo.t-ce of taking accountof the samphnjd".ign iJ g", ]-.-j!" estimates ir the standarderrors of our coefficrents. "orr"ct
&ff C
t f D L
-\ nored.earlierin the chaptet deff is the ratio of the design_based estimate of the sam_ 1*ing varianceof a statistic that has.been-collectedunder ,;;.p1"; survey design to the esrimatedsampling variancefrom a hypothetical s;;t;f inl"ffiiir" *itn observations rollected through simple random sampling.Thus zefis d iner.i tron a.6.tn that it gives 6e ratio of the samplingvariancesobtainJofrooui *toJ juo-uta". aro (1) rhen we usedesign-based estimationto accountfor clusteringandsample "onditions: weightsand(2)
218
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
when we ignore clustering and weights and estimate statistics appropriate for simple unweighted random samples; deff, by contrast, gives the ratio of the design-based sampling varianceto the sampling variance that we would expect if we had actually carried out a survey using a simple random sample.In this sensezelis mainly of didactic value becauseit revealsthe consequencesof naive estimalion. De;f can be thought of as a variance inflator, indicating the extent to which the sampling variance is inflated becauseof the clustering of the observationsin the sample. Becausethe standarderror is a function of the squareroot of the sample size, de;f also can be thought of as indicating how much larger than a simple random sample a sample basedon the clustereddesign would have to be for both samplesto yield standarderrors of the samesize. In the present case, despite our best efforts to stratify the sample, we still have a relatively large deff for years of school completed: 2.99. This implies that with respect to the measurement of yearsof schooling,our clusteredsampleof about6,000caseshas the precisionof a simplerandomsampleof about2,000cases.Although this is a great improvementover the designeffectsof 8.22 and13.43that we obtainedfor the rural and urbansamplesin our designexperimentbasedon the 1990Chinesecensus,it still is quit€ large. Fortunately,none of the remaining variables in the model has design effects nearl)' as large(althoughthe interceptdoes). In the courseof carryingout analysisof an existingsurveythereis, in fact, little reason to compute delFbecausede;fprovides information that is useful primarily in designing a new survey (as in the Chinesecensusanalysisdiscussedearlier). Rather,for samples for which we haveadequateinformation on the design,we simply carry out our analysisin the standardway, but usi.ngthe survey estimation commandsrather than commandsthar assumesimple random samples.Unfortunately,suchinformation often-indeed, usuallyis not included in survey documentation,especially for older surveys In such casesthere is a next-best approach.You can approximate design effects b)' treating your sampleas somewhatsmaller than it actually is, by weighting your data by 0.75or 0.67or 0.50 (thisis easilyaccomplished, eitherby creatinga weightvariable: O.75,0.67, or 0.50,or whateveryou judge the reciprocalof the designeffectto be; or br multiplying any existing weight variable by your judgment of the reciprocal of the desigl effect).Weightingby 0.75 is tantamountto assumingthat eachstatisticin your analysis hasa designeffectof 1.33(= 1/0.75).Because,aswe haveseen,designeffectscan varysubstantially, this is hardly an optimal solution,but it is superiorto blithely assumingthat the multistage probability samplesupon which almost all survey data are based are as preciseas simplerandomsamples,which is what we do whenwe makeno correctionfor designeffects. In the GSS the designeffect is typically about 1.5for attitudeitems and about i.75 for sociodemographic items,which tend to vary more acrossclusters(Davis and Smith 1992).So we could get an approximationto the correct standarderrorsby weighting our sampleby the reciprocalof the designeffect, for example,weighting the GSS by 0.57 (: 111.75),to be conservative.However in recent years the GSS has included the variableSAMPCODE, which permitsthe useof surveyestimationprocedures.The GSS usesa complex design,which, moreover,has changedsomewhatevery decade.
t t{ T
n :!a {b d !I] ir@ @
ff
- { m I nt F
n t]]t $a d :I f,
ts I
lp t
t @
p tll
SampleDesignandSurveyEstimation 219 F€ EE
-ft ]' L b
a It ta
a I I
I E h lF T I T
ritl
the shift to a new sampling frame. Hence the correct use of survey estimation 1rn-eduresfor trend analysisis a somewhatcumbersomebusinesseven when, as is yearsare treatedas strata.However,for the analysis of a singleyear of the -rsonable, rBS. the task is somewhateasier.(SeeAppendix n tor a discussion of how the sample &sign of the GSS has changedover time and the implications of the changesfor how b do surveyestimationusing the GSS.)
AN ALTERNATIVE TO SURVEYESTTMATIONrrthere istL
:ilTff"i:,"#J::ff::ii,::,:',1'i::ffi1;1,"i,1:,1,::,ii" IU
'.viththe robust- and -cluster_ options.Thiswill producestandarderrorsthat are lwithin roundingerror)identicalto the estimates producedby the surveyestrmation com_ nand when no strataare specified.That is, -robust_ and _ctuster_ opttonstake account of clusteringbut not of stratification.In general,but not always,lailure to take accountof stratificationwill producelargerstandarderrors.
This approachmay make it possibreto providea partiarcorrection for crusteringeven lvheninformation on the sampledesignis not available in the surveydocumentation. Because almostall largepopulationsurveysare clusteredon the basisof geography, you may be able io usegeographlcplaceidentifiersas a clustervariable.In addition,you may havea data set that incrudes informationon househords and arsoon severar individuars withina househord. In sucncasesyou cantreatthe householdidentifieras a cluster variable(in additionto any geographic identifierin the dataset).
HOW TO DOWNWEIGHTSAMPLES|ZEtN STATAwhen ?)r
;::ffi;Il'ffl"fl'::::"T,i:',"ffi ffilil.l"J;,ffi :ii::i:ilH:]"i;i:?:Nl
refrectan approximate designeffectby infratingthe standarderrors.To accomprish this in Stata,usethe tir^,eighrl specification, which createsweightsthat are not renormedto the samplesize. Note that using the tiweighrl specificatjon is the equryatentof usino Iaweighrs] ratherthan lpweighrs], but using taweighrsl when doing modelestii mationis in generalincorrectand typicallywili producesmallerstandard errorsthan when lpweighrs] are used.So thjs clearlyis a suboptimalsolution.lt thus is generally well worth the (oftenconsiderabre) effort to determinethe actualsampredesrgnand to obtain the variablesnecessaryto implement survey estimation procedures that correctlyreflect complexsamplingdesigns-evenif it meansimposinguponoriglnal investjgators who have goneon to other research and do not want to be botheredtryingto documentsomething theypaidlittleattentionto in the firstplace.
224
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
Analysis of Subpopulations:Effectof Educationand Raceon lncome Among Women One specialfeatureof the design-based estimationprocedurein Stataneedsto be higb. lighted: when analysisis restrictedto a subsetof the data,it is inappropriatesimply to excludecasesnot meetingthe selectioncriterion.The reasonfor this is that the sample designfeaturespertainto the entiresample,not the subsetselectedfor analysis.Statacorrectly handlesanalysisof subsamples throughthe - subpop- option in the estimatioo comrnand(seePart 3 of the downloadablefiles "ch09.do" and "ch09.log").To illustrate the use of this option and to further illustrate how the useof surveyestimation can change substartiveconclusions,I here carry out a simple analysis,using the 1994GSS, of tbe effect of educationandraceon incomeamongwomgn. The 1994 GSS is a stratifiedmultistagesample.The units for the first stagewerc 2,489 U.S. metropolitanareasand nonmetropolitancountiesdivided into 100 strata with one PSU per stratum selectedat random with probability proportional to size: then, within PSUs, 384 second-stageunits (groupsof blocks) were chosenPPS, and in someinstancesa third-stageselectionwas made as well. However,the documentation for the GSS identified only the PSUs,using the variableSAMPCODE. Because. as noted,Statadoesnot permit the specificationof one PSU per stratum,I set the PSU but not the strata,and I treatedthe analysisas including only one samplingstage.ThLi procedureprobablyunderestimates the true standarderrorsbut is the bestoption givetr the documentationavailable. Table 9.5 showsthe coefficients and standarderrors for three models, eachestimated threeways: treatingthe sampleas if it were a simplerandomsampleof the population: weighting the sampleto correct for differential householdsize; and taking accountof the clusteringcreatedby the first stageof the multistagedesignused in the GSS. Itr thesemodelseducationis expressedas a deviationfrom the meanyearsof schoolingof womenin the sample. Consideringfirst the contrastsamong models,it is evident that adjusting for differential probabilities of selection resulting from differential householdsizeshas a nontrivial effecton the results.The estimationthat fteatsthe dataas an unweightedsimplerandom sample(panelI) yields a significantincrementin R2for Model 3 in contrastto Model lleadingus to concludethat the determinantsof income differ for Black and non-Black women.By contrast,neither the weighted (panelII) nor survey (panelIIf estimation procedureyieldsthe sameconclusion.From the weightedanddesign-based estimates,'.r'e would be led to acceptthe null hypothesisof no racial differencein retumsto education for women.Here is a casewheretaking accountof the imprecisioncausedby treating householdsamplesas if they are personsampleschangesa substantiveconclusionin an rmportant way. An alternativeway to do this without weighting the data would be to introducea set of dummy variablesfor the number of adults in the household,plus interactions betweenthe set of dummy variablesand,respectively,race and education,and perhaps three-wayinteractionsas well. Unless the focus of the analysisis on how race and educationdifferencesvary by the number of adults in the household,this alternative strikesme as excessivelycomplexand tedious.I think the examplemakesit clear wh)--
u& -f
- qr
I
Fl ll, lE--
-E
'
r'l Elz
t'l F'l m r-:,-- -E
5' :-!_:--
[i
I I
SampleDesignand SurveyEstimation 221 ,1 lt i. Coefficients for Models of the Determinants of lncome, U.S. rdrlt Women, 1994, Under Various Design Assumptions (N = 1,015). Edu(ation D
5.e.
Interaction
Intercept
p
-.-.r ng simplerandomsample
IttLi3
2,548
205
.OO0 -1O
1,419 .gg4 't,.755
-r -'ningweightedrandomsample
-:a-mrf9 wetghtedand clustered sample
'.,.,r:2
2,656
33a .A0O1,251 1,772 .480
!k,:: 3
2,419
26S .000
18,001
(l\/lode :_:_:sts 3 versus Model1)
42 98
d.f.(l)
d.f.(2)
2
1,O11
1.22 .2951
2 .1 4 .1 2 1 2
r '- o s the net regression coeffraent; s.e. is the standarderrorof the coeff cient;p rstrre assocrated prob. - d f.l1) andd.f.l2)arethe numeratoranddenominator degrees of freedorn; anciF is the varueof the : .::,: c for the contrastbetweef modes
222
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
in general,we want to do surveyestimationwhen we havethe information to be able to do so. Note also that the R2sfor correspondingmodels in panelsII and III are identical eventhough the standarderrorsdiffer. This foltows from the definition of R'?as a function of the ratio of variancearoundthe regressionsurfaceto regressionaroundthe mean of the dependentvariable. Becausethe point estimatesare the samefor panels II and IIL the R2sare also the same,eventhough in panel III the point estimateshavewider confidenceintervals. Note also that I have not shown BIC estimates.Although it is legitimate to computt B1C for simple random samples,as we did in ChaptersSix and Seven,B1C is not appropriate for weighted or clustered samples.For such designs,pseudolikelihood functions are estimated,which may be substantially different from true likelihoods and may even vary in a non-monotonic way acrossnestedmodels.Thus neither likelihood ratio testsnor BIC of which the log likelihood is a component, should be usedto comparemodels frr weightedor clustereddata.Rather,Wald statistics,implementedin Stataas the -testand -sr,1rtest- commands,shouldbe used.(For a discussionof maximumlikelihoo'l estimation,which is usedby most of the procedureswe will explorein ChaptersTwelle throughFifteen,seeAppendix 12.B.)
TgB in ili
Combining GSSData Setsfor Multiple Years Previously I suggestedthat under some circumstancesit is useful to merge severalsamples drawn from the same population into a single data set. In particular, if it can be assumedthat a social processis consistent over time, it would be reasonableto combine GSS samplesdrawn in differentyearsto increasethe numberof cases.I did this in the workedexamplein ChapterSeven,decomposingthe differencebetweentwo means.Here I use the samedata in a slightly modifled way to study racial (non-Black versus Blackr differencesin educationalattainmentover the period 1990 through 2004. The poinr of the presentexerciseis to illustrate what is entailed in combining severaldata sets(seedownloadable file "ch09.do," Part 4, for the Stata code). In carrying out this analysis, I treal year as the stratum variable, on the ground that the sample for each year is fixedI then manipulate the data a bit to createa weight va.riablethat is consistentacrossyea6. (Seethe downloadablefile for detailson this process.)Havingappropriatelyweightedthe data, I carry out survey estimation in the usual way. The results are shown in Table 9.6. For our presentpurposesboth the del andmeff coefficients are instructive. The largestdeff tglls us that in our estimationof the coefficientfor Southernorigins, we hare the sameefficiencyas a randomsampleof 8,754(: l5,9321I.82).Of course,sinceour sampleis so large (becausewe havecombineddata from eight GSS sampies),we still have the equivalent of a very large sample. The meff coefficients also are large, especially for mother'syearsof schooling.This onceagainsuggeststhat nai.veanalysisthar takesno accountof weighting or of clusteringcan be misleading,althoughagain the very large size of the sampleprotects us. Although the results are substantively interesring, I forgo further commentary on them since it would largely repeat the discussionin ChaDterSeven.
(il 'trhL uo4 t,lm ld
H iEl
@ $cfr T o!fr
SampleDesignand SurveyEstimation 223 '::rr,. f., Coefficients of a Model of Educational ld,rfts, 199O to 2OOa(N = 15.932). l.=cictor Variable
Coef.
iltl-ei's yearsoJ
Attainment, U.S.
5.e,
Meff
0.288
.0i0
2.27
-0.133
.010
1.59
-0.531
.065
o.434
.331
1.31
1.75
.023
1.32
1.7A
' I.13
1.20
5_CC
'u-:er of siblings hrern rUE'5
residence,
ia:. nE<'.nothe13 Er trj school 3 a:. -jiblings
0.057
.021
l[{-Southern
-0.006
.1 5 4
1.47
10.961
.1 3 5
2.26
E€CE
lL
0.'182
:SNCLUSION -:: .:>\on to be drawn from each of the.analyses in this chapteris that we are likely to ''r'- 'lnderestimate the degreeof samplingvariab ity ir we^fal to take account of and --.: lbr.thefact that large samplesurveystypically use multistagedesignsthat result r -:>rantialclusreringof observations. Note thit thii is tru" not oiy of ar.a probability but also of organizationallybasedsamples.suehas samples -;_,:s of students(often ;,: :d by first selectingschools,then classrooms,thenindividuals within classrooms), irr :::rl or clinic patients,and so on. survey estimationprocedures shoulclbe usedfor i; :ln'eys aSWell. ,, en when completeinformationon the samplingdesign is unavatlable_whichis r: --Jnatelyall too common_it sometimesis possible to approximarethe desisnbv
224
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
since almost all surusing information on the place where the inteNiew was conducted' their data setsfor inforveys"arectusteredUy place. Analysts are well advisedto explore and then to use th design' sampling mation that will enable them to approximate the reliability of their the overstating design-basedprocedures available in Stata, to avoid of Type I errrr chance the n"ai-ng". Una"ttru,ing the sampling error, and thus increasing of is the usual consequence treating multi6ejecing the null hyfotnesis when it is true), stagesamplesas if they were simple random samples' book have survel-Most of the standard statistical procedureswe deal with in this For procedures based versions, and these should be used whenever they are available' be possibh may it commands' estimation y", it"f"a"O m the packageof survey-based -clusterthe and "", to uppro^i-ut" .u*"y-bu."d estimationby using [pwe-ights] is only one sampliry optiin, ulong fit" fines suggestedin this chapter' Also' where there option usedwid -clusterstageand no informationon any stratumvariables,the proceestimadon ooiro*"y estimation proceduresyields results identical to the survey case which in analyzed' dures discussedin this chapter (eicept when a subpopulationis suweyestimationproceduresshouldbe used)'
WHAT THISCHAPTERHAS SHOWN random sampling is This chapterhas taken us ftom "textbook" analysis,in which simple the implications cf and research social assumedlto the kinds of samplesactually used in of samples'wiih types the main such sampledesignsfor statistical analysis'We reviewed samples'both particular focus on multistage probabiiity samples;the idea of stratifying the populaof to ."do"" sampling error and to gain sufficient casesfor small subgroups ro desirable or tion to permii ana-lysis;and theionditions under which it is necessary fc a set of procedures compute weigtrted estimates.We then tumed to survey estimation' of the sample designfeatures of account take that correctly estimating standarderrors two statistics' df interpret to how considered of cases.Finally, we purtl"otarty for sampling error' "to.t"ri-ng samples random andm"6, that qtantify the effect of depa{turesfrom
CHAPT ER
REGRESSION DIAGNOSTICS THISCHAPTER 15ABOUT Is chapter we consider ways of identifying, and under some circumstancescorrectuoublesomefeaturesof our data that misht lead to incorrect inferences.We do this rralyzing one of my published papersas a way of seeinghow to apply and interpret ins regressiondiagnostic tools. Apart from not adjusting our standarderror estimatesto take account of complex designsof the kind we consideredin the previous chapter,there are other ways we be led to incorrect inferences.Even if we are comDletelv attentive to the comolexitv u sample,we may err either becausewe have specifiedthe wrong model or because rnple includes anomalousobservations-topics we briefly touched on in Chapter Here I give thesetopics extendedtreatment. Iueat thesepossibilitiestogetherboth because,in at leastsomecases,the samesetof ionsmay be thoughtof eitherasanomalouswith respectto somepositedprocess r conforming to some other processthat can be captured by introducing additional or changing the functional form of one or more predictors, and also becausethe methodscan be used to detect and correct both setsof oroblems. First we consider diagnostics,a setof proceduesfor detectingtroubles.Thenwe considerrobust an approachto correctinga subsetof thesetroubles.This discussionrelies on Fox (1997,Chapters11-12) andthe discussionof variousregessiondiagnosi the Stata 10.0 section"Re$ess Postestimation"(Statacorp2007).I recommend sourcesfor further study.
226
to Testldeas QuantitativeData Analysis:Doing SocialResearch
INTRODUCTION To illustrate the kinds of troublesthat can befall the naive analystwho is inattentiveto tbe propertiesof his or her data,considerthe four scatterplots shownin Figure 10.1.Theseploc were contrivedto producethe sameregressionestimate(slope andintercept),the samecorrelation betweenthe variables,and the samestandardefior of the regressioncoefficients However,only plot (a) is reasonablysummarizedby a linear regressionline. Plot (b) sho'r-s a curvilinear relationship.Plot (c) showsa linear relationship with one value that distors what is otherwisea perfect linear relationship.Plot (d) showsa data set with variancein X and the sloperelating yto X createdentirely by a singlepoint (whereX is the variableon tbe horizontalaxis and lis the variableon the vertical axis). Clearly thereis a cautionarylesso in this-it is a very good ideato visually inspectthe relationshipsamongvariablesto ensue that your specifledmodel adequatelycapturesthe tnle relationshipsil your data. Apart from theseexampleswe needto be sensitiveto still otherwaysour regression modelsmay fail to adequatelycapturerelationshipsobservedin our data.In particular important variables may be omitted from our mode1,as illustrated in Figure 10.2. Here ir is obviousthat the regressionof I on X is misleading,becausethe threemiddle obsen'ations have expected values of I three points higher than those to the left and right. h should be evident that an equation of the form
Y : a + b (D+ c (4
FUEE' e9.d Xa
Z: . $I
G
(10.1t
r red !!Hiri.l-|Efl
rit *rLi
SI_
&ryrtrirrr :qr re;f *6rri lrg:sd
{rndr
T'.-iro
.5cl
IfuiE_ lllicqs
lhs -rr'FF
--;"Oa \;E-id
b-:{ss
.<-"-r€{
ibtr
(c)
(d)
ru*ti&.r 3*.1.Four ScatterPlotswith ldentical Lines. Source.Anscombe 1973, 19-20.
h-f:i:i rrd dlu If[-:-rb frdmqt* atrtr-srr &rc,;gi l }EJ
:l-r':r
r.ri
RegressionDiagnostics 227
'!S,2, Scatterplot of the Relationship FGUqf BetweenXandy and Atso fr RegressionLinefrom a Model That lncorrectly Assumesa Linear Relationship &|€en X and Y (Hypothetical Data). me Z is scored 1 for the three middle observationsand is scored 0 otherwise and @..-dy predicts y. Visual inspectionof a scatterplot suchas in Figure 10.2, or a compom-plus-residual plot, discussedlater in this chapter,can sometimesreveal the need to additional variables-although empirical examplesare not usually so clear cut. Still another potential problem is heterosced.asticiD,, unequal error variance around t reqession surface at different predicted values, which results in inaccurate standard mrs of regressioncoefficients. Heteroscedasticityis fairly common becausein many rrc the varianceof observationsincreaseswith the mean.Fortunately,modestviolations!h Lar_sest error variancesless than ten times as large as the smallest-have little effect [ 6e standarderrors.Still, we needto checkfor largeviolations. To detect anomalousrelationshipsin our data in the caseof simple two-variable mlels, we can simply plot the relationshipbetweenX and y as in Figures 10.1 and trdrl-l-However, for multiple regressionequations,zero-order scatterplots between each tie independent variables and the dependentvariable are likely to miss imponant and anomalies.Thus we need to exploit a set of additionalprocedures rn collecrivelvas repressiondiapnostics. )ionetieless, ii is usEfrft6Ttan:Fm;i€ry simple example,if only ro illushare how milious the problem can be in actualresearchsituations.Supposean analystfails to notice nh missingdataare representedby very large codes(recall the boxedcommentin Chapter &m- ''Treating Missing Valuesas if They Were Not"). Consider the relationship between mber of siblings and years of school completed in the 1994 GSS. For both S/BS and ,fiDLC code 98 : "Don't know" and code 99 : "No answer."If we naively assumedthat fu datawerecompleteard corelated the two variables,we would concludethat the amount ,deducation obtainedis umelatedto the numberof siblings, becauser : .006. Excluding tu missingdatafrom both variablesyields a more plausibleestimate:r - -.246. What can we do, apart from simply being alert and careful, to protect ourselves iEmst such an error?The first and most obviousstepis simply to make and inspecta
228
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas
6
a E I o 6
n
na
Number oi siblings
FIGURE 1 0.3 .Years of School Completed by
Number of Siblings, U.S. AdulE,
1994(N = 2,992).
E rn _q
E 912
_o
v8
0 0 1 2 3 4 5 6 7A 910
15 20 25 Number oi siblings
30
FtGURT 10.4.Years of School Completed by Numher of
35 siblings,
U.S.Adults, 1994.
scatterplot of the relationship betweenthe two variables.As we haveseenin Figure 10-lsuch scat0erplots are enonnously instructive, not only in revealing gross errors such as the inclusion of missing value codes but also in indicating other anomaliesin the datE curvilinear relafionships, discontinuities, patterns that suggestthe possibility of omitted variables, and heteroscedasticity.Figure 10.3 shows a plot of the relationship betweer numberof siblingsand yearsof schoolcompleted,in the 1994GSS. This plot immediately revealstrouble and would do sojust asclearly if the numerical value of the ' NA ' category.99, were shown. Inspecting the plot, we seethat the missing ualuesmus@ts. Doing soresultsintheplot shownin Flgure 10.4.with the regressionline included.Note that the new plot is basedon 2,975
RegressionDiagnostics 229 rzses. a reductionof only 17 cases,but the regressionestimatediffers substantially. letause so few casesare missing, we neednot be concemedabout imputing the missing ,iara(recall the discussionin ChapterEight). However,evenafter omitting missing cases, T€ still needto be concemedaboutthe possibilitythat the regressionestimateis unduly iiluenced by the relativelysmallnumberof respondents with many siblings.The fotlowile sectionaddresses how to assesssucha possibility.
A WORKED E)(AMPLE: SOCIETAL DIFFERENCES I{ STATUS ATTAINMENT
L D
I I
{ D
Ibe failure to omit missing casesin the preceding example is a particularly blatant error, cts)'to noteandeasyto fix. Sometimes,howevet errorsaremore subtle.Thuswe needa ret of proceduresfor detectinganomaliesin our data.Regressiondiagnosticprocedures rre at presentnot very well systematized.There are many graphical methods and tests, den doing more or less the samething, and considerableconfusion about nomenclature rthe same proceduresare called by different names, and t}te same names are used for ftfferent procedures).I have illustrated a subset of these proceduresthat seem useful, Eloring those that are or easily can be implemented in Stata.(For a useful exposition of regressiondiagnosticprocedures,seeBollen andJackman[1990].) As a concreteexampleofhow to ca.rryout regressiondiagnosticprocedures,I reanalyze .r anicle I completed with a former graduate student, Kam-Bor Yip (Treiman and Yip 1989).In this multilevel analysis we were interestedin how r444:rosocialcharactelistics dect the processof status attainment.For a very simple model predicting men's otfspring'sfdupationaltrtafi-s fi6 l-th1ir fathers' occupational statusand their own educarionin eighteennations,we hypothesizedthat the effect of the son'seducationshould El6Gger;d ilr:'eetrect of rhe iather's occupationalstatusweakerin more-industrialized co ntriesand in countrieswith lessincomeinequalityand lesseducationalineoualitvin 6e father's seneration. The first step, after converting all of our data to a common metric, was to estimate 6e micromodg,lseparatelyfor eachnation. The secondstep was to predict the size of the . !---- - ---i---'----1- coefficientsresultingfrom the first step,using measures resr'ession of industrialization -:rmd inequality. This sort of two-step estimation procedure,although statistically suboptimal, is conceptuallyclear.(For statistically optimal multilevel procedures,seeRaudenbush ,mdBryk [2002],andfor a brief introductionto multilevel analysig,seethe discussionin ChaDterSixteen.) Here I reanalyzethe results shown in Equation 7 of Treirnan andYip: 6,: - zO1n4*.19(11)+.31(D); R2- .55; Adj. R2: .46 (r0.2) $here bu is the metric regressioncoefficient relating occupational status to education in eachmicroequation,EI is a measureof educationalinequality,11is a measureof income mequality,and D is a measureof economicdevelopment.The coefficientsestimatedfor Equation10.2areexpressed in standardform. However,regressiondiagnosticprocedures operateon metric coefficients.Here is the correspondingequation,with coefficients eroressedin metric form.
b":2.o2 - .3s(E1) .32QD+ .30(D)
(10.3)
230
QuantitativeDataAnalysis:DoingSocialResearch to Testloeas
Although regressiondiagnosticproceduresare helpfirl for samplesof all sizes, they ac particularlyusefulfor analysesbasedon smallnumbersof observationsbecause suchsamples areparticularly vulnerableto the undueinfluenceof oneor a few extremeobservations. Downloadablefiles,.chlO.do"and ,,ch10.log,'showthe Stata-do_ and -log_ files for my reanalysis; you should study these along with the text becausemany details ale provided only in the commentaryin thesefiles.
Preliminaries A.. yoll9 do in anyreanalysis of published resulrs,I srarrby tryingto replicatethe I.
publishedfigures.The stata - 1og- file showsa listing ofthe dataset, varioussummaff statistics,and estimatesfor the regressionequationreportedin the pubrished article.Ai agreewith the correspondingfigures in the published article. This is not always the caseA surprisingly large number ofpublished articles contain errors----coefficients that do nd correspondto estimatesderived from data setsthat are listed by the authors or are available from archives. Sometimesthis is becausethe authors have dropped casesor transinforming the Igadel,bgt sometimesffi*-s gqes^yithout f,"*;i-pf, PTgq maoemlsrakes.Ulven the easeof email communication,it often is possible to clearr4r suchproblems relatively painlessly and is certainly worth the effort. The first time I tried to replicatethe publishedequationI got an absurd minimum valrr ^ the educational for inequality measureandregressionestimate--s that disagreedwith the puL lished estimates.It tumed out that the explanationwas simple_the scanning operationthr input the datafrom the publishedarticle (which waswritten many yearsago, beforeI started sy^stematically keepinglogs of my work) recorded -69 rather ihan -0.69 for Britain, and I failed to notice this when I proofiead the file. It is not worth the space, your or trme, tI) detail my effort to detectthe sourceof andcorrectthe problem,but the tesson is clear:devise asmany checksaspossibleand studythem carefirlly at eachstepbeforeproceeding. There also is a lesson here regarding good professional practice_rn your publica_ tions always describeyour proceduresin sufficient detail to make it possiblJ fo, i"o_p.tent analyst to exactly replicate your coefficients given only your paper and your origilal data set. Doing so is not only a matter of courtesf; it wili irelp yoo air"ou", your-owr errors before they are published and becomevulnerable to snideconection by somegraG uate student looking for a quick publication. Wheneveryou produce paper a for publica_ tron (or evena semipublication such as a deposit on a Web page or a submission as a term paper-or dissertation chapter), your last step before submiision should be to rcrun yow complete-do- file andcheckeverycoefficientin your paperagainst _do_ the file. you will be surprisedhow many errors you find!
Leverage Having establishedthat I can replicate the published results,I now consider whether ther provide a reasonablerepresentationof the relationships in the data. I start by consiaerin! unl observationshave parricularly high levirage, where leverage ref-ers yl:ft.r to tbe difference between the value or values of the indepenLnt variable or variables for a particular observation,and the mean or centroid of the values for all observations.plcr (d) in Figure 10.1 illustratesthis case.The observationwith a scoreof nineteenon tbe horizontal axis has high leverage.such observationsare troublesome becausetl'ev mav
RegressionDiagnostics 231 undueinfluenceon the regressionslope;obviously,in (d) of Figure 10.1the slope be inflnite except for the single high leveragepoint. A conventionalmeasureof leverageis the diagonal elementsof thehat-matrix, wlich iles a scale-free measureof the distalce of individual observationsfrom the cen_ il Computing the hafmatrix for the eighteennations in our data set (searchfor .,hat,, ft downloadablefile "chl0.do"), we note that India has an unusuallylarge value, four times the mean hat-value. This suggeststhe possibility that India is unduly the regressionestimates.
acting on this possibility, however,we need to further exDlorethe data. Our stepis to discoverwhetherthere are any exlj]emeoutliers, observationsfar from regressionsurface.To do this we needto adjustfor the fact that observationswith leveragetend to havesmall residualspreciselybecausethe least-squares property the regressionsurfacetoward such observations.The studentizedresiduat (E..\ les suchan adustmentby basingeachresidualon a regressionequationestimate'd the observationomitted.The studentizedresidualis attractivebecauseit follows a with N - t - 2 degreesof freedom(whereN is the numberof observaand ft is the number of independentvariables),which makesit possibleto assess satistical significanceof specificresiduals. However, becausewe usually do not have a priori hypothesesregarding particular we needto adjust our tesrsof significancefor simultaneousinference.A simIt way to do this is to make a Bonferroni adjusnnenrby dividing our desiredprobability value(conventionally.025 for a two{ailed test) by the number of possiblecompari_ rs. which in this caseis the numberof observations.Thus the procedurefor the analvsis -Iff h is to compute studentizedresidualsand to identi! outliers as unlikely to havearisen !rctance if thep-valueis lessthan.025/18: .00139.As it happens, noneof theoutliersis lri*ically significant,becausethe largeststudentizedresidual,for Denmarlqis 3.349,with J - 3 - 2 = 13 dl, which impliesa r-valueof.00523(searchfor..estu"in thedownload* file "chlO.do"). It probably would be unwiseto take suchtestsof sisnificancetoo seriespecially given the very small sample. Fox (1997,280) arguesthat studentized greaterthan two in absolutevalue are worthy of concem. This suggeststhat we to furtherconsiderDenmark(E* = 3.35)andperhapsIndia (Et = 1.9,.
Huence ksures that take simultaneous account of both leverage and outliers are known as 4uence stqtistics. Several relatively similar measuresare available; here we focus on tCak's Distancemeasure(Cook's D), which is a scale-freesummarymeasureof how a c of regressioncoefficients changeswhen each obseflation is omitted. Taking 4/N as eortoff point for Cook's D, we notethat only India is exceptionallyinfluential,with the liuited Statesmarginallyso (searchfor "cooksd"in the downloadablefile.,ch1O.do").
llots for Assessinglnfluence Ibspite our focus thus far on numerical surnmary measures,a generally more useful 4'roach to diagnosingregressionills is to plot the relationshipsamong variousindicators.
232
Quantitative DataAnalysrs: DoingSocialResearch to Testldeas
Two useful plots that combine measuresof leverage and residuals are the leveragtversu_s-residual-squared plot (the _1vr2p1ot _ commandin Stata.landthe stuclentjzelresidual-vers,us-hat pl.ot weightedby Coik,s D proposedby fox ftWZ, 2g5) and easilr implementedin Stata(to seehow I did this, searchfor "rnut=fru,;;i, il;;;i;;d;; filC'ch10.do").Figures10.5and 10.6showtheseplorc. ff,"fJni of rquaringthe residr.C in Figure 10.5is to indicatethe influenceof the outliers b".uur"-,n" ."gr"rrron procedut minimizesthe sumof squaredenors.Still, Figure 10.6 seemsm Jo a betterjob of reveal ing the overall influence of specific observations. Clearly India standsout from 6c remaining observations.Denmark, however,has the largest outlier
Added-Variableptots Our.next task is to fy to discover any systematrcrelationships among the variablestbr might accountfor either the large residualsor the highly innuentla oUservations.A gocl way to do this is to construct added-variable plols, ul.o kno*n as partiaL_regressi(. leve-rageplox or simply partial-regressionploti. Such plots provide a two_dimensiosd analog to the kind of scatterplot with a regression line throogf ir,lr* _ the simpleregressioncase.Added-variableplotsdo this "onstrufi by snoing u fo, ''"or "* ,rr" relationshi betweentwo predicted values:(1) the predicted vau.. iro- u r"lr"irion of the dependel variableon all variablesexceptone and (2) the predicted the regressioncf the variableomitted from (1) on the remaining independent"d;;;;;. variables. Cuaph(a) in Figure 10.7, assessingthe effect oi educational inluaiity (81), sugges that India is highly influential; it has very high educational inequalt relative to its incon mequality andlevel ofindustrialization. andii arsodispray, u .tr-*g.i of educationa occupationlhanwould be expectedfrom its levelsof income inequiiry "ff""t anoinoustriarizatiqInterestingly, the plot reveals that if India were removed oi oo-.nweignteo, the sloF
I Normalized resldual squafed
'i *.5. Sf€UR€ ,q pbt of LeverageVersusSquaredNormalized Residuats for Equation 7 in Treiman and Vp (t g1g). Note. The horizontal and vertical lines are the means of the two variables being plotted.
,.
t t I
RegressionDiagnostics 233
.6
.2
0
1 3 stuoJntituo '"'ia'utt' studentized Residualsfor Treiman ptot versus ot Leverage a 1A.6. FfGURg the sizeof Cook'sD' Yp's Equation7, with CirclesProportionatto -d andtheverticallineisat zero' lineisat the meanhat-value' !e-'The horizontal o
-1
1.5 Q1 a, .s -f l o
-l
x -> -,1
,I
x'l A^ 6 - .) -1
-t
,.,
l
o
"(
o'")
7' andYip'sEquation for Treiman FIGURE lA-V ' Xa"a-v"riablePlots occupational retums to education would dadng educational inequality to the level of arease.Denmark,bycontrast,nasunusuallyloweducationalinequalityrelativetoits -;;; much stronger educationi;"qoutity uoo indusnializaiion' but it has a ;;;" on the other two position its from connection than would be expected --p",i*
234
DataAnalysis: Quantitative DoingSocialResearch to Testldeas 1 .5 1 I
:2
:.
0 -.5
Flttedvalues
nG URr1S.8Residual-Versus-Fitted plot for Treimanand yip,s Equation 7variables, so the omission or downweighting of Denmark would decrease the effect of educational inequality. Graph (b), assessingthe effect of income inequality (1I), reveals that only Denmark is a large outlier. Otherwise, the plot is fairly unremarkable.Grapb (c), assessingthe effect of industrialization(d), showsthe United Statesto be a higileverage observation, with a very high level of industrialization relative to its level of educational and income inequality. Because the United States is below the regression line, its omissionwould increasethe slope.
ptotsand Formal Testsfor patterns Residual-Versus-Fitted in the Data
A secondtest,stata's - ovtest - command,assesses the possib ity of omittedvariables by testing whether the fit of the model is improved when the second through fourth powers of the fitted valuesare addedto the equation.Given the small samplesize,I takethe p-value of .08 resulting from this test as suggestingthe possibility of omitted variablesComponent-plus-residuar plots aretseful in iwealing theiunctional form of relationships and,by extension,the possibility of omitted variables.Suchplots differ from added-variable plots becausethey add back the linear componentof the;aftial relationshipbetweeny and X to the least-squaresresiduals, which may incl]de an unmodeled nonlinear component.Figure 10.9 showssuch plots for our data,using the ,.augmented,, version availablein StataGearchfor.,acprplot,,in the downloadableile ,,chl0.do,,). The plots in Figure 10.9 continue to show Denmark as a large outlier. But otherwise they do_not appearorderly; and-with one exception-I can tiink of no omitted vari_ ables.The exceptionderivesfrom work by Miillir and Shavit (199g) that suggeststhat the education-occupationcomection is especially strong in nations with wel-Jeveloped vocationaleducationsystemsand especiallyrv"uk in *ion, with poorly developedvoca_ tional educationsystems.In our dataDenmark,Germany,Austria, and ttre Netherlandshave especiallyshongvocationaleducationsysterns,andthe United States, Japan,ard Irelandhave very weak vocationaleducationsystems.The relationshipfound by Miiiter and shavit seems to hold in our data, with the nations with strong vocational educationsysternsabove the
ni riu t d Ef DC-
-_a
&16l lfcr dtufl
tutu, trd hi[
]F.
:lr b -*tu rEd€
|
h:r fr nru['d form r fuft.8!
RegressionDiagnostics 235
!
51 !9
n
==
0123 Educational inequality
-1.5
1-.5 0.5 Incomeinequality
-2 9-
1
Pe td *
F * td
-2
1012 Economi.development
.!0.9"Augmented Component-Plus-Residual FIGURE Plotsfor Treimanand Ws Equation7.
b
s t br F * rI 9I
ir 'tG D'
b Fd E} E F T
lb
qression line and the nationswith weak vocationaleducationsystemsbelow the regression h. This result suggestsaddingthe strengthof the vocationaleducationsystemasa predictor. b do this I add two dummy variablesto distinguishthe three setsof nations(strong,weab d neitherespeciallystrongnor especiallyweakvocationaleducationsystems).I thenreestiEe Equation7, which yields the coefficientsshownin the secondcolumn of Table 10.1 (for ruvenience, Column 1 showsthe metric coefflcientsfrom Treimanandyip's original Equa7, that is, those shownin Equation 10.3 of this chapter);lhe remaining colurnnsshow :ious additionalestimatesdiscussedin the following paragraphs. -D The specificationshownin Column2 poduces a betterrepresentation of the deierminants dthe strengthof theeducation-occupation cormectionin theeighteennationsstudiedherethan the original specification.The adjustedR2increasessubsantially and, as expectedfrom A pafiem of residuals,the coeffrcientsfor strongand weak vocationaleducationalsystems -es ld havethe expectedsigns.(I discussthe standarderrorslater in this chapter.) However,the question remains as to whether the results are still substantially driven !- India and Denmark. To determine this I repeatedall the diagnostic proceduresdisossed previously with the new equation. The Stata log contains the commandsI used, lm in the interest of saving spaceand avoiding tedium, I have not shown the resulting floa andwill not discussthe resultsexceptto note that India continuesto be a high leverage lnint and Denmark continues to be a large oudier, although the diagnostic indicators for both are somewhatless extremethan the conesponding indicatorsjust reviewed.
156
QuantitativeDataAnalysis:Doing Social Research to Testldeas
l';1;,,, :- 1,;, ', Coefficients for Modets of the Determinants of the Strergl,. of the Occupation-Education Connection in Eighteen Nations.
18 Observations
17 Observations (ExpandedModel)
f, ii
lf
{ ll
Original
IducationalInequality
IncomelnequaJity
lndustrialization
Model (Metric
Model (OLS
Coefficients)
Estimates)
- 0.354 (0.s32) -0.320 (0.299) 0.299 (0.27s)
Weak Voc. Ed.System
R' AdjustedR,
Expanded
l0ll
R obur
ors
Regress:r
-0.292 (0.s6s)
-0.821 (2.268)
-a.J a,-
-0.342 (0.324)
,0.321 (1.s94)
0. 3: :
tiltl
\1.29nl tl a
4.2a7 (0.275)
0.208 \1.449)
0.836 (0.410)
0.707 (0.644)
0.5E': (0.59,
*0.476 (2.518)
tO.1La
lll
I
StrongVoc. Ed.System
Intercept
iut
-0.403 (o.414)
it
I
2.021 (0.222)
1.899 (0.2s1)
1.814 (0.631)
.553
.762
.792
.457
.662
.698
0.529
0.471
o.672 /V!fer Bootstrappedstandarcl errors,denv
1.8 C: (0.5a:
.
iii:l::$,jiij!,, and ::T:19;[J,.:'il;1';]f+tlq11fiI,i",.,l,"iil,llil,"l;j,"J:li 0r76 ror corumn 1:a i7s,o2;5.-*r,;r;;:;;;, 0 313,and 0 I 74 f oj Colum n3 ; ancO l i l I , A . 2 6 7 , A . 1 8 5 , 0";ili;,illill."iiili;liflillliii;1,-, .330,0.362,and0.189forCoIumn4.
RegressionDiagnostics 237
x)BUSTREGRESSION 5r- $hat to do? Becausewe have no clear basis for modifying or omitting particular oar.r\,ations,nor for transformingour variablesto a different functional form, we need an ivnative way ofhandling outliersandhigh leveragepoints.Onealtemativeis robustestiman'r which doesnot in generaldiscardobservationsbut ratherdownweightsthern-sivins less iduence to highly idiosyncraricssg@aqs. RobusteffiG iftac[iilelaus."they re nearl6GffidiEiTis ordina:f-least-squaresestimatorswhen the error distribution is nrmal andare much more efficient when the errorsare hear,y-tailed,as is qpical with high Lremge points and outlien. There are, however,severalrobustestimators,and there are no The bestadviceis to explore "*ffut rulesfor larowingwhich to applyin what circumstances. :rar dataas thoroughlyas time and energypermit. @or fi[ther detailson robustestimation. ,,:cult Fox [1997,405414;2f2],Berk [1990],andHamittonl199Za;I992b,207-2111.) One classof robustestimators,known as M estimators,works by downweighting dftlen'ationswith largeresiduals.It doesthis by performingsuccessive regressions, each :m (afterthe first) downweightingeachobservationaccordingto the absolutesizeofthe nidual from the previous iteration. Different M estimators are defined by how much kr_sht theygive to residualsofvarious sizes,which canbe showngraphicallyasobjective brtion* The objectivefunctionsof three well-known M estimatorsare shownin (a), ,b - and (c) of Figure 10.10The OLS objectivefunction ([a] of Figure 10.10)increases dqonentially, as it must given that OLS regressionminimizes the sum of sqzared residu. rk- The Huber function ([b] of Figure 10.10)gives small weight to small residualsbut reiehts largerresidualsas a linear finction of their size.The bi-squareobjectivefuncen ([c] of Figure 10.10)givessharplyincreasingweight to medium-sizedresidualsbut lh,'o flattensout so that all large residualshaveequal weight. BecauseHuber weights deal prrrrly with severeoudiers (whereasbi-weights sometimesfail to convergeor produce mldple solutions),Stata'simplementationof robustregressionfirst omits any observa_ nas with very large influence(Cook's D > 11,usesHuber weightsundl the solutions Jrll erge.ano tnen usesbt-welghtsunlll the solutlonsagalnconverge. Becauseof the rrr it is defined,robustregressiontakesaccountonly of outliersbut not of highJeverage ,ftervations with smallresiduals.For someproblemsthis can be a major limitation. Panel2 of Table 10.1 shows (in Column 4) robust regressionestimatesfor the elabomed model of the education-occupationconnectionwe have been studying. There is no rctust regressionestimatein Panel 1 becausethe procedure dropped India at the outset he-'auseof its large Cook's D. Columa 3 shows the correspondingOLS estimateswith hlia omitted. Interestingly, the OLS and robust regressionestimatesdiffer very little in hel 2, with the exceptionof the effect of strongvocationaleducation,which is reducedin rb mbust estimatebecauseDenmark,with its large residual,is downweighted.The agreemt betweendifferent estimatorsdoesnot alwayshold and shouldnot be taken asan indi:rion that robust estimationis unnecessary.However,the stability of the estimatesunder "iferent estimationproceduresgives us adde.dconfidencein them. By contrast,the omissionof India stronglyaffectsthe educationalinequalitycoeffi*trI. increasingit by more than a factor of two. The coefficientfor strongvocational a,ircationis modestly reduced,and the coefficient for industrializationis even more - | € z _ 4 . : =+'
238
euantitativeDataAnalysis: DoingSocialResearch to Testldeas
35
.8 .7 .6
'd
= 1s 10 .2
;l s-4 -3 2 _ toi2;;;tr
.l 0
-6
Deviation score
'6 (b)
5-4-
r
)-r
, i I1 4 . I\ 5 i ^ )/ Deviation score
3 9r -
__ 1
.5 0 - b- 5- 4- t
" tz 1456 -
^
:
i
Deviatton score
10, objective Functions forrhree M Estimator: (a) oLsobjectil f.gyry tunction,(b)113" Huberobjectivefunction, and(c)bi_squareoiluir" tu*rior.
modestlyreduced'A reasonable conclusionis that the education-occupatron connecti
tr;,r'-'"i#lno"iiu, u,*u.yor, generat relationship ;::illT:flff.ff ffi:*::::""':,::.:1.''*between ird*t'i;-;"il,];;;ffiffi H,ff :ilfffi properly set India asidefor separateconsideration.
BOOTSTRAPPTNG AND STANDARDERRORS
;J""Jt""ft",f"f"YJ'1il3r,'S1t::*:i:1g*.b"*ordinaryreastsquaresandr af. no: norrnally disnibuted,ifr" airt lUut* errors enors rs isasvmntoficallv asymptotica'v nn*"r *^:^u1o:, t normar_that is,;,il;;i"#ff
H;"J:?#iftr#:ltJ# .ynil"ar,.":"p1"'r;,,],iJl , * the observadons number d and r i.tr,"*''r". l:.,:l* o'r-iJ:t#[iffi;rj ;"t :iffi#: "*r::ffffi.':#*I;T::::: **Afy;JT:1:T; ffi#ff glfl'lv:'arva'r""tl#i-i o,*,.*l oneway around thisprobrem-is;;;;'r;;;##;;::X?fiffiL, "'".i"^llTn*
. .
_* "a*jm#*1i*ffi*::",1# ;gIfi,11 Fxl:XT*TilTt ,*:tJT:;:,";::*t6:i::.tl#*,:#;:m:ff il"".::i{ff jtr**.ffi ;fiT ;lll'il"lXT'Ji,t;ilffi'#::;:."1J:;:Lffi
iff
J[*",..-
RegressionDiagnostics
dls m :f tr:rs Ei Ei fze s@I& dEd lft
r:r F.-
239
d eighteennationsis drawn doesnot actuallyexist sincewe took alr nationsfor which dra were available.Thus we needto resorl to an approximation. Bootstrappingapproximatesresamplingby taking the observedsampleas a proxy fu the population and repeatedlysampling,with replacement,observationsfrom the *served sample.Thus,in our currentexample,we would randomlydraw (with reolace_ a first sampleof eighreencasesfrom our eighteenobservationr. say Norray. -ot) Srlerlands, India, Ireland, Austria, United States,Finland, philippines, Denmari<, blr'. Taiwan, Sweden, India, Ireland, Finland, Denmark, Denmark, Taiwan. Note tar England,Germany,Hungary,Japan,Northern lreland, and poland do not fall into lh sample;Austria, Italy, the Netherlands,Norway, the philippines, Sweden,and the fnited Statesare includedonce;Finland,Ireland,India, andTaiwanareincludedtwice: rl Denmarkis includedthreetimes. From this samplewe would estimateour reqrestir equationand record the coefficients.Then we would draw a secondsample-with E{rlacement,a third, andso on. The resultis, for eachcoefflcient,a distributionof values cqFalin size to the numberof sampleswe havedrawn.From this distributionwe esti_ the standard error as the standard deviation of the distribution. (For further -are Gcussion of bootstrapping,seeFox [1997, 493-Sl4], Stine [1990], Hamilton [1992a; l992b,313-3251,andthe entry for -bootstrapin the Stata10.0manual.) This methodprovidesa good estimateof the standardenor of a statistic.orovided thr the samplein fact representsthe populationfrom which it is drawn anJthat the rsrlting distributions are approximately normal. with very small sampleswith outliers d high leveragepoints, as in our case,there tends to be high variabilio/ from sampleto rmple. Thus it is wise to drawmany samplesto get stableestimatesof the samplinedislrhtion. In the presentcaseI drew2,000samplesto estimatethe standarderroisirieach d- the columnsof Table 10.1 (seethe sectionon ..Bootstrapped StandardErrors" in the &sdoadable file "chlO.do"). I experimentedwith smallernumbersof samplesbut got rsatisfactory variability acrosstrials in my estimatesof the standardenors. With 2,0b0 dications the standarderrorsseemto be reasonablystablebut hardly normally distributedc Figure 10.11.The outliersin thesedistributionsderivefrom the randomomissionor mltiple presenceof highJeverage observations.(with seventeenobservationssamoled sith replacement,the probability of a given country being excluded from a pa_rticular mple is 36 percent-more precisely,0.357= t1 - llNlu = 11- Ufiy .) Note that the standard errors are sometimes much larger than the corresponding |simptotic standard errors reported in the note to the table, especiallv those for the edrcationalinequalirymeasure. This resultalertsus to the dangeiof naivelyaccepting ccmputed standard erors from general purpose statistical progr:rms! especially when rvkhg with small samples.On the other hand, as noted previously, it is unclear in the Fesentcasethat much shouldbe madeof the standarderrors,giventhat our .,sample,'is hnh very small andhardlya probabilitysampleof any population.Ir is reasonable to ten_ rively accept the estimatedmodel, specifically the robust regressionestimatesfor JEi'enteen societiesreportedin column 4 of rable 10.1,which havefar smallerstandard aors than do the conespondingoLS estimates.Nonetheless,we must note that the results re only suggestiveand require confirmationwith more and better data before beins regardedas definitive.
240
QuantitativeDataAnalysis:DoingSocialResearch to Testldeas 1.5
#
.6 F
#
.2 0 -15
-10
-5
0
5
42024 hcomeinequality
Educational Inequaiity
1.5
n'
- 20246 Strongvocational system
d
-505 Weakvocational systern
-ro
-5
0 5 Intercept
t;
F,6URE 10.1 1, samplingDistributionsof Bootstrappedcoefficients (2,000 Repetitions)for the ExpandedModel, Estimatedby RobustRegression on Seyenteen Countries. /Voter These arethe bootstrapped coefficients for (olumn4 of Table10.1. when we have genuine probability samples of larger populations, the . l{o1vever, standarderrors and the conndence-intervalsthey imply assumemuch greaterimportanceThe calculationof appropriateconfidenceintervarrto, tootrt upp"o ,Ltrstics is an unsettled and ongoing areaof statisticarresearch.Stataprovides four ciiii.eorls pe.""nt confidence rnlervalsbasedon different assumptions.There is considerable controversyas to which of theseestimatesprovidesthe best coverageof the true standard error But the weight of tbe evidenceto date seemsto supportbias-correctedestimates, andthis is the default in stata
WHATTHISCHAPTER HASSHOWN In this chapterwe have seenhow to check our data for anomalousobservationsand violations ofthe assumptionsunderlying oLS regression, how to usethe inlbrmation obtained to generatenew hlpotheses, how to use robust regression procedures to achieve estimates with smaller standard erro
standard errors insrtuations inwirllT: LT#,[T .i:::#i,::ffi:ilJj: ;l#f;
distributed cannot be susrained.The main lessin of rhi. that much can be leamed by gaphing relationships in the data. Indeed, often"h;;;;;. the best way to understand your data is to graph what you think you are observing. The resulJare otten surprising, and usually informative.
t f
!rq ,0 I
CHA PTER
SCALECONSTRUCTION THISCHAPTER ISABOUT chapterwe seehow to improve both the validity and reliability of measurementby rctmgmultiple-itemscales.we considertbreeways to construct scares:additive ns. factor-basedscaling, and effect-proportional scaling. We also consider two vari_ of regressionaralysis errors-in-variablesregression,;hich co[ects tbr unreliabil_ lf measurement,and seemingly unrelated iegression, which is used to compare ssion equations with (some or all of) the sameindependentvariables but different variables.
242
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
INTRODUCTION for which E In socialresearchwe often wish to studythe relationshipamongconcepts stratification'clas' haveno direct and exactmeasures.Examplesinclude' from social authoritarianism: ad atutor, und po*"r; from attitude research,anomie, alienation' and of this kind' ft r concepts For conservatism irom poflticaf sociology, liberalism versus they behal how or people believe what of Jim"ot to imagine th-atany single measure in dil arc interested we that example' would adequatelyreflect the concept. Suppose,for recld voting their liberal ferentiatinj among membersof Congresson the basis of how a singlevoteis. We would hardly be contentto measure"liberal voting" by choosing and thtrr be liberals say,support for foreign aid-and declaring thosewho voted for it^to b€sides"liber* who voieOagainstit io be conservatives.For any particular vote' factors particular languageof & ism" or "conservatism" may come into play-objection to the judgment that in tight times fuoib legislation, the need to dischargea political debt, the Although extraneousfactors m4 on' and so b"r,"t spent on social welfare at home, average"liberals" would be rntt that on -! affecr any particular vote, we would still expect rights' affirmarirt likely to support foreign aid, domestic welfare' civil liberties' voting to refine want might action, and^such,than would "conservatives'"(Of course'we 'o social rCexample' concept,differentiating domains of liberalism or conservatism-for samebasic poifl o"r, dr"ul policS anJ intemationalismversusisolationism But the becauseextrarFholds: any one itim tendsto be a poor measureof an underlying concept strategyfor cma useful Therefore item a single to ) ous factois may affect the response scaict multiple-item $eate \s to concepts structingopeitional indicators of wderlying r reflecr to thought items of of a set to each fnat is,-wetatcettre averageof the responses Multiple-ita the concept of underlying concept as iniicating ot measuring ttre stren}th be rellable' scalessho-uldsatiify two criteria: they should be vcli4 and they should
VALIDITY if it adequatt* An indicator is valid if it measureswhat it is supposedto measure-that is, technical wal rn measuresthe underlying concept. Unfortunately, there is in general no when we dr.evaluatethe validity of a scale,although, as we will seelater in the €hapter by determr of a scale validity in the gain confidence cuss factor-basedscaling, we can on theoreticrl expect would we that ing that it has the relationships to other variables theor* appropriate an constucting gr:ounds.Assessmentof validiay is mainly a matter of alc or indicators' indicator its icai link for the relationship between the concept and betweenthe conceptand other variables. measuer Many of the most important argumentsin scienceare about the validity of sciences'Indeat This is as true in the physicaland biological sciencesas in the social (Burgess1978) thl I recommenda fascinatingaccountof a searchfor life on Mars camp and S' includes a vivid portrayal of the ongoing dispute between the "pro-life" the Mars Lander by back sent being indicators "anti-life- camp as to whether particular Mars' on of life presence could validly bi interpreted as evidenceof the you arE The firsi requirement for devising a valid measureis to be clear about what concep'E not our than often trying to measuie.This is not as obvious as it sounds' More
ScaleConstruction 243 uE -irrmulatedrathervaguely.Just what do we meanby "social class."for example?If .. -j]ie a Manist approachanddefineclassby the "relationshipro the meansof producu:r- s,e havemerely shiftedthe problem,becausewe then haveto sayexactlywhat we nr:: by therelationshipto the meansof production.If we takea Weberianapproachand rc-e classby "marketposition,"we haveexactlythe sameproblem. \n),onewho thinksI am constructinga strawman is advisedto look at the wntings of hi Oiin Wright and his followers,who are to be commendedfor trying to do serious .E4::rtativeresearchwithin a Marxist framework(see,for example,Wright and others "!:. Wright 1985,andWright andMartin 1987).A goodporrionof thewriringsof Wright ltrr:ais groupis preoccupiedwith the validity of altemativeindicators. Er en seeminglystraightforwardvariablesoften havethe samedifficulties.Just what Er-;h ing conceptarewe trying to measurewhenwe devisea scaleofeducationalattainr:: skill, knowledge,credentials, values,conformityto externaldemands,or still somea.t: else?In principle our theory as represented in the specificationof the conceptof .E!E::st.shoulddictateour choiceof indicators.For example,if we are interestedin the functionof educationin channelingaccessto particularkindsofjobs, we may 4:r,rreeping r ir-: ro measureeducationalattainmentby the highestdegreeone obtains.If we regard rr:':.rling asenhancingcognitiveskills,we may wish simply to countthe numberof years 'r -rooling a personobtains. Sometimes,of course,we are restrictedto extantdata and must work in the other ir--':ion, constructingan argumentaboutwhat underlyingconceptis represented by the :L€:iure we have at hand. In either event clarity is crucial in your own mind, and on !E ;.ntten pageas well, regardingwhat conceptsyour indicatorsmeasure.(For a brief $-rduction to differenttypesof validity, seeCarminesandZeller [1979, 17-26).)
IEUABILITY {e:.bility refers to consistencyin measurement. Different measuresof the sameconrE:L or the samemeasurements repeatedover time, shouldproducethe sameresults.For :;::pie, if oneindividualscoreshigh andanotherindividualscoreslow on a measureof :--acial tolerance,we would like to get the sarnedifferencebetweenthe two individur,. ,: q e useda different(but equallyvalid) measureof interracialtolerance;to the extent te :reasuresyield similar results,we say that both measurements are reliable.Also, I -:: :amerespondentwereaskedhis attitudeat two points in time, we would like to get t= :.:ne result(assuminghe hasnot changedhis attitude). Fromthis definitionit is easyto seewhy multiple-itemscalesaregenerallymorerelirrr-=Ihan single-itemscales.When responsesare averagedover a set of items, eachof r:.'h measuresthe sameunderlyingdimension,the idiosyncraticreasonsthat individuir :.spond in particularways to particularquestionstend to get "averagedout." Of i L:>e.this is true only to the extentthateachitem in a scalereflectsthe sameunderlying ":i":.irsion(theconceptualvariable).If an item is capturingsomeotherunderlyingdimeni'ri insteadof, or in additionto, the oneof interestto the analyst.it will undercutthe relir,-,n (and validity) of a scale.For example,supposeresponsesro a questionabout r --l,sness to have people of a different race as neighborsreflecteddifferencesin
244
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
economicanxiety,with somepeoplerejectingpotentialneighborsnot becauseof racfl intolerance but for fear (rightly or wrongly) of a reduction in property values.We woul not want to include suchan item in a scaleof racial tolerancebecauseit would tendt make the scalelessreliable-with scalescoresdeterminedto someextent by whetherc happenedto include a lot ofpeople with economic anxieties or only a few such peopleAn important reason for creating reliable scales is that, all else equal, unreliaE scalestend to havelower correlations with other variabies.This follows from the fact 6t unreliable scales contain a lot of "noise." We might think of scales as having a "tnrcomponentand an "error" component.The "true" componentis representedby the cofl!lation of the observedmeasurementwith the true underlying dimension; the size of tli, corelation gives the reliability of the scale.The "error" component-the portion uncc related with the underlying dimension-reflects idiosyncratic determinants of tb observedmeasurement.From this definition of reliability, it follows that the los€r the reliability of each of two measures,the lower the correlation between their obsenel values relative to the true correlation between the underlying dimensions.Formally, n can estimatethe "true" correlation betweenvariables by knowing their observedcorreletion and the reliability of eachvariable. The true correlation is given by :
'*t
(1l .l l
where Pxrv, is the correlation between the fiue scores, rr" is the observed correlatil betweenX and \ andro, and r"". are the reliability coefficients for X and Il respective$Equation 11.1 is also referredto as a formula for correctingfor attenuationcausedb unreliabilify; Px,r, is the correlation between X and f corrected for attenuation. Fa example,if two scaleseachhavea reliability of .7 and an observedcorrelationbetwec them of .3, the correlationconectedfor attenuationwill be .3/J(.7)(.7) : .43. Clearl5 correlations can be strongly affected by the reliability of the componentvariables.
K[
Hl,#'t"g,::,tttt
RELIABILITY rhere are severar ways tomeasure
.
Iest-refestreliabiliA ls the correlationbetween scoresof a scaleadministeredat two pointsin time.
.
Alternate-formsreliabilityis the correlationbetween tlvo different scalesthought to measurethe sameunderlying dimension.
.
lntemalconsistencyreliabilityis a functionof the correlationamongthe itemsin a scale. Cronbach's alpha, discussed in the following paragraphs, is an internal-consistency measure.
ScaleConstruction 245
j:?1"."':d{i.qi:fi #:"',l,T:r:f :#*f, : -a'*#*##',#l:l roestrm aiJ;; ;#; i# ::H[XT:';"Jil#:1,r.*n*,r,.J" !". 9.* aore "ri, sTlitlxH:".T:'H,3,TJ'j;tri:x"?;H,:"il:"i'; flif":*"*$: fh extensive .t;;"."o,,.]."u'"'ll,t":,'..:":kt andothersllg72, tgTglr"t."*"oi"r'"? "."
-Q'ates
"' Lqlsr rrr uus cnaptel
this concept.
*"esdepends onrwofacto6: rhe rrabiJity. orinrernar-consisrency t *ru.nc,J"#"'r,lil; ffifffffi:,fi::::.' _*,r:1iflJ:"#j"il:l.;;fi:::Tl^lj1*1y,,
*:,:'.1":i#iffi $: l":ls:*::i*:ii::,dnTyilJH';,'ffi "
= --- rtr--
1+ r(ff _ t)
(11.2)
E N is rhe numbero[ items _O ,,.r,:n: averagecorrelation arnongitems.ln Table
,lli,iu,".ug" .,:.,.,,.1'ri, ,irfr ffi Hru*,y"ilTru1i1"jj iot".it". * "i.."rl ffi :i":ilff tr-TffiT*ectiverv,de,.'s,-ir1ir,"i"iii:#:fffrTt:'*"fl 1fi ll"i:1;,f,T:*-1il,"y:,1**:g"correrationas.25,scarescom_ orat reasr rr€d _"t:?a:'Jfl :|:ff seven "'.i"r,.i,.'",1.aJii.+iiiff;.ffi,T1?l,1l.r,i;ill,li".,ll;
fi ;;3",,,'i#i: THli:."Trd#*;;,nr;trl,.".l?l;,T#*"',ffi :,T#,ffi:iil:&TJ,:l; .fr;"#ff :,*T#,ilT::;,ilH:,y#::;ffi;J""ft fNCLUDE 5EVERAL H,th#Bi+tn?."RE rEsrs number of items in makes a *.," *t,^ rJ KN clearwhyexaminations ",r ,r.f:-"I"tl:ltn"a19GRE comprise severar e".",,".ott"eu hu"" ;;"-' Lll o. 1,ffi; ;:r#::::::-:Ar
illl*;r::::;j :n;;:[l*l:lfilT:;T,",lj:f #"il:ii:i;::ff j}i:ii::j:;[iffi *:,n*l*nj*f"ilJtt#{i,"11i.,:il", :ix*] :ffi;:lT:?.::Tit:;Tl{ Ii[i:;,";"il#:;.;l:':.+T:iliT:::]H"ffi
preparation.
le the test is taken. and also,
of course,the degreeof
246
QuantitativeData Analysis:Doing SocialResearch to Testldeas 1'.i .:'1;-1: I '1 , 'i values of cronbach,s Alpha for Multiple-ttem Scales with Various Combinations of the Number of ltems and the Average Correlation Among ltems.
.09
.25
.17
.40
.23
.50
.28
.57
5
.33
.62
6
.37
.67
7
.41
.70
8
.44
.73
9
.47
.75
10
.50
.77
20
.66
.41
50
.83
.94
100
.91
.97
200
.95
.99
SCALECONSTRUCTION In this chapterwe considerthreestrategiesto createmultiple-itemscales:additivesring, factor-basedscaling,and effeclproponional scaling.lFor a brief generalintroc--tion to scaling,seeMclver and Carmines[1981]. For a recentextendedtreatmenr.\Netemeyer,Beardon,and Sharma[2003].A classicbut still usefultreatmentrs Nunna andBernrtein | 1984j..1
ScaleConstruction 247 Mftive Scaling h .'-:nplestway to createa multiple-itemscaleis simply to sumor averagethe scoresof run :i rhecomponentitems-which is what we havebeendoing up to now.Whereitems Jr]G :--hotomous,this amountsto countingthe numberof positiveresponses. Wherethe -hemselves |m. constitutescales-for example,continuousvariablesfor educationor [tr:--e or attitudeitemsrangingfrom "strongly agree"to "stronglydisagree" we ordimr-f-.:iandardizethe variablesbeforeaveraging(by subtractingthe meanand dividing "* :e 'tandarddeviation).If we fail to do this, the item that hasthe largestvariancewill !mn-:e greatestweight in the resultingscale.The effectof the varianceon the weight is rtr, :!1seeby consideringwhat would happenif a researcherdecidedto make a sociorriTl.r:iricstatus(SES) scaleby combiningeducationand income,and did so simply by 1rr!,rr-..for eachrespondent,the numberof yearsof schoolcompletedand the annual nr:c:. He may think he has an SES scale,but what he actuallyhasis an incomescale wlr : \ery slight amount of noise, becauseeducationtypically rangesfrom zero to xr-::," \ears (and,in the United States,effectivelyfrom eight to twenty years)whereas rr::-e rangesin the tensof thousandsof dollars.By dividing eachvariableby its stan@: :e\ iation, the analystgives eachvariableequal weight in determiningthe overall n:i* r;ore. (I first cameto appreciatethis point manyyearsagoin graduateschoolwhen mr-::essor told me that he andanothermemberof the faculty effectivelycontrolledwho :m* and who failed the collectivelygradedPhD exams.They did so simply by using -=d rd :re hundredpointsin the scalethe faculty haddevisedfor scoringexams,while most ,r :e:r colleaguesgavefailing examinationsa scoreof fifty or so.) lee trouble with simple additive scalesof the sortjust describedis that the items tu(-Jed may or may not reflecta singleunderlyingdimension.A scalewith a heterogenr:r-<setof itemsrunsthe risk of beingboth invalid,becausein additionto what the anatr; rinls the scaleis measuring,it is also measuringsomethingelse, and unreliable, :!e::j-ie at least someof the items are weakly or even negatively corelated. Wtor-Based Scaling fr;': --anwe determinewhetherthe itemswe proposeto include in a scalereflecta singr 5rension? First we identify a setof candidateitemsthat we believemeasurea single &-:lving concept.Then we empiricallyinvestigatetwo questions:(1) Do the items all 'br:: together" as a whole, or do one or more items tum out to be empirically distinct mr .in the senseof havinglow correlationswith) the remainingitems,eventhoughwe !rL-=irt they reflectedthe sameconceptualdomain?If so, we must reject the offending E. {2) Doeseachitem haveapproximatetyJlqsEnllretrettlo}lgloJb9._4gpendent llr-ile of interest?If not, the deviantitemsshouldnot be usedbecause this is againeviE:c,' that they do not measurethe same concept (or that they measureother concepts ;::,ies the oneof interest).Assessingthe secondquestionis a simplematterof regressing t j.pendent variableon the set of tentativelyselectedcomponentsof the scale,plus Whenthe scalewill be usedasa dependent m:ronal controlvariableswhereappropriate. cf,-r5le. the corrqlations between the !9lqBonent
items and the indepe4!9qt
-variables
'il.-'J bein:Gaa ln bor-hsiruation s.wnuitv. ..iooting Gfi( rhatthecandi";dence thesamemagnitude, u.,t.q5El!y)he_$)_glgnd approximately
Z4E
DoingsocialResearch to Testldeas DataAnalysis: Quantitative
Education,occupationalstatus,andincomearegood examplesof itemsthat arer:N'//tively correlatedbut thattendto havequitedifferentnet effectson variousdependeni.r'ables.For example,fertility is known to be negativelyrelatedto educationnet of in; -ar of to]er::,,r bttt positivelyrelatedto incomenet of education.Similarly,variousmeasures tendto be positivelyrelatedto educationnet of incomebut uffelatedor negativelyre-:inru to incomenet of education.For this reasonthe commonpracticeof consfructingscai:. Jr ( variablesshou: nr \\ socioeconomicstatusshouldbe avoided,and eachof the component /fr includedasa separate predictorof the dependentvariableof interest. t A useful procedurefor deciding whetheritems "hang together" is to submit therr-i, ,r factor analysls)is a p: ':: fqctor anolysis.Factoranalysis(or moreprecisely,e;rploratory durefor empiricallydetemining whethera setof obseNedcorrelationscan,with reir"-rEby, a smallnumberofhypoth:-:u ableaccuracy, be thoughtofas reflecting,or asgenerated with man\ \ :-:,r!underlyingfactors.Factoranalysisis a well-developed setof techniques, tions. However,this chapteris concemednot with the intricaciesof factor anallsr. rur with its useasa tool in scaleconstruction.For our presentpurposes, the optimalproc;:-* is to useprincipal factor analysiswith iterations anda varit?Mxrotation andlhen to in:r:: Ihe rotatedfactor matrLx.The varimax rotation rotates the factor matrix in such a $ : . '1! to maximizethe contrastbetweenfactors,which is what we want when we are tq t-. : detemine whetherwe canfind distinctivesubsetsof itemswithin a largersetof canrj::.or: items.We thenchoosethe itemsthathavehighloadtngson onefactorandlow loadin5..nr the remaining factor or factors.A rule of thumb for "high" is loadingsof .5 or more (\\ :f,rr areconsistentwith correlationsof about.52: .25 or hieher).
TRANSFORMING VARIABLES SOTHAT"HIGH"HASA CONSISTENT
M EAN I NG
" hish"refers offactor anaLysis, :: Inthecontext
va ueof a factorloading. the absolute We wouldthusregarda loadinglessthanorequalto : however, that a h 9h neq: or greaterthanor equalto .5 ashigh.lt is lmportantto appreciate, tiveloadingimpliesthat a variabe isnegatlvetrelatedto the underlying concept.Forth s re:' y runinthe samedirectron th.: son,it isdesirable to transform allvariables sotheyconceptual (frois,sothat a highvalueon the variabLe ndicates a highlevelof the underyingdlmension whichit thenfollowsthat allthe indicators shouldbe postivey correlated). Forexample, corsidertheGSSitemsSPKCAM('SupposethisadmittedCommunistwantedtomakeaspeech' your community.Shouldhe be allowedto speak,or nat?") andCOLCAM(Supposehe is teacring in a college.Shouldhe be fied, or not?"). Cleady,a positiveresponseto the first iternan: a negativeresponse to the seconditemboth ndicatesupportfor civi I berties.Soto maketl'. interpretation o{ the factoranalysis lessconfusing,it would be desirable to reverse the sca' ing of the seconditem.Thiscanbe accomplished easiy by transforming the originalvarlable X, into a reverse-scaled variable, X', usingthe relationX' : (k + 1) X wherethereare . response categories. Similartransformat onsare helpfulin anykindof multivariate analyss.
scaleconstruction 249 i:
:hen choose those items that meet both criteria-high
loadings on the factor and
::elationshipsto the dependentvariable-and combinethem into a singlescaleby .:::dardizing them (subtractingthe mean and dividing by the standafddeviation) ::r averagingthem. Theseprocedurestypically producescaleswith a meannea.r =j a rangefrom somemessynegativenumberaround-2. "r or -3. r to someequally ,-. :ositive number.For convenienceof exposition,it is useful to convertthe scale . :rnre extendingfrom zero to one becausethen the coefficientassociatedwith the _.:\'esthe expected (net) difference on the dependent variable between cases with
:'::st and highestscoreson the scale.Sucha conversionis easyto accomplish,by -:-:iso equationsin two unknownsas you did in schoolalgebra: 1:a*b(max) 0:a+&(nin)
( l r .3)
''max" is the maximumvalueof a scale.S, in the data.and "min" is the minimum r: 5 in the data.This yields a andr, which you then useto transformS into a new :-:. S'. asfollows: S': a + b(S)
(11.4)
CONSTRUCTING SCALES FROM INCOMPLETE ?>] INFORMATION
Whenyouconstruct multiple-item scales, it oftenis uselut $
: :rnpute scalescoresevenwhen rnformationon some rtemsis missinq.This reduces--: ^Jmber of missingcases.Forexample,if I am constructinga five-itemscae, I might -:-oute the averagelf data are presentfor at leastthreeof the five items.Thisis easyto ::::':'rplishin Stataby usingthe -rowrnean- commandto computethe mean and the -::i;niss- commandto count the numberof mlssingitems,replacingthe scalescore :- ihe m ssingvaluecode if the numberof missingitemsexceedsyour chosenlimit-in --: 3resentexarnple, lf morethan two of the ftveitemshavemissng values.
.ereralfactorsemergefrom the factoranalysis,we can,of course,constructseveral . Heretheproblemof validity loomsagain.Becausewe ordinarilystartwith a setof r:::e itemsthat a priori we think measurea singleunderlyingconcept,we areon the : _rroundif only one factor emerges.If more than one factor emerges,we are forced
- L.:der what concepteachfactor is measuring.Working from indicatorsto concepts :e very real dangerthat our sociologicalimaginationwill get the betterof us and ; : $ ill invent a concept to explain a set of correlations that reflect sampling error ':rn some underlying reality. The danger is compounded if we forger that we have
250
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
inventedthe conceptto explain the data and stafl treatingit as if it has an independer reality-that is, if we refJr our concept.To be surethat we have actually discou"."d ,o* underlyingrealiry we shouldreplicatethe items andthe scarein someindependentdan set (perhapsby usinga randomhalf of our sampleto developour scalesand fit our mo& els and then using the orher random half of the sample to verify the adequacyof bot scalesand models).Unfortunately,this seldomis done,becausewe usuallywant larga samplesno matter how large our sampleis. However, the GSS provides such opportuli tiesbecausean analysisdevelopedusingdatafrom oneyear oftencanbe replicaiid us;B dala from the preceding or following year. I strongly encouragethis kind of independerr validation. Readersfamiliar with factor analysismay wonder why I suggestchoosinga set of candidateitems, weighting them equally,andaveragingthem,in contrastto constructinga scab by using thefactor scoresas weights.The reasonis that using the factor scoresmaximizes the associationbetweenthe hypotheticalunderryingconceptandthe conshuctedscalein rrr sample. Tltat is, it capitalizes on sampling variability. The result is that the correlationr betweena scale constructedin this way and other variablesare likely to be substantiar! smaller if the sameanalysisis replicatedusing a different data set. By contrast, the facttr_ basedscaling prccedure,tn wllch the items are equally weighted,is much less subjectEi cross-sampleshrinkage.In this sensefactor-basedscalesare more reliable than are scales constructedusing factor scoresasweights.
A WorkedExample:Religiosityand Abortion Attitudes (Again) Abortionhasbecomeanincreasinglysalientandemotionallychargedissuein recentyear-:Fundamentalist religiousgroups(andothers)opposeabortionas'1nurder"while feminis (and othert defendthe right of women to control their own bodies.Despitethe shal polarization of opinion regarding abortion, most Americans evidently support the ar.ai ability of legal abortionunder at least somecicumstances.Many peoplefind abonicr acceptablefor medical or therapeuticreasonsbut nol for reasonsof personalpreferencetr convenience. Consideringthe theologicalunderpinningof the "right to life" movemenrthat a fetus is a personand hencethat abortion is tantamountto murder_we might expe; strongly religious people to adamantly oppose abortion for personal preferencereasonr but to be lessopposedto abortionfor therapeuticreasons,when the ,rights,,of the fenrs mustbe weighedagainstthe healthandsafetyof themother.Thosewho arelessreligiou:_ by contrast,might be expectedto makelessof a distinctionbetweenthe acceptabiliq...r abortion for personalpreferenceand therapeuticreasons.If thesesuppositionsare conerwe would expectreligiosity to have a weakereffect on attitudesrlgarding therapeuui abortionthanon attitudesregardingabonionfor personalpreferencereasons. To testthis hypothesis,I usedatafrom the 1984GSS,a representative sampleof l.-l-l adultAmericans.(seedownloadabre fires"ch11.do"and"ch1Llog" for estimationdetail,. I usethe 1984surveybecauseit containsitemssuitablefor constructinga scaleofreligiority (discussed later).Specifically,I comparethe coefficientsin two regressionequations: (11,:
scaleConstruction 251
F: a' + b'(F) + c'(E)
(11.6)
*re I I and F are, respectively,scalesof the acceptability of abortion for therapeutic resons, the acceptability of abortion for personal preference reasons, and religiosity f-h). E is years of school completed,introducedas a control variablebecauseit is bsn that acceptanceof abortion increaseswith educationand that religious fundamen& is negatively correlaaedwith educationin the United States. The three scaleswere constructedby factor analyzing items thought to representthe funsion being measured,elirninating items with low factor loadings, converting each h to standardscoreform, and averagingitems. To facilitate interpretation of the regrescoefficients, the resulting scaleswere then transformed so that each had a range of Eo (for the lowest level of religiosity ard the lowest acceptarceof abortion) to one (for -r t highestlevel of religiosity and the highest acceptanceof abortion). Candidateitems for the scaleof religious fundamentalismincluded the following: '
l. ATTEND: How often do you attend religious servlces?(Range: never . . . several timesa week). 1. POSTLIFE: Do you believe there is a life ajler death? (no, yes). 3. PRAY:About how oftendo you prayl (Range:never. . . severaltimesa day). 1- RELITEN: Would you call yourself a rtrong [religion named by respondentin responseto questionon religiouspreferencefor not a strong [preference]?(not very strong; somewhatstrong lvolunteered] or don't know or no answer;strong). 5. B1B.'Altemative versions of this question were askedof two-thirds and one-third of the sample,respectively: L Which of thesestatementscomescl.osestto describing your feelings about the Bible? a.
The Bible i"sthe actual word of God and is to be taken literally, word for word. b. The Bible is the inspired word of God but not everything in it should be taken literally, wordfor word. c. The Bible is an ancient book offables, legends,history, and moral precepts recordedby men II. Here arefour statementsabout the Bible, and I'd like you to tell me which is closestto your own yiew: a. b.
The Bible is God's word and all it says is true. The Bible was wrixen by men inspired by God, but it contains some humanerrors.
252
euantitativeDataAnarysis: DoingsociarResearch to Testrdeas c. d.
The Bible is a good book beca '---.-aLause tt was written bywise men, bur Gd hod nothing to'ii* The Bible was written by men who lived so long ago that it is ,lofth little today.
*.sionsI-and rr were combined, excepr thar(
combinedwith category(c) from "".(:;H:ti.lHtJ*:ffi1t;:In versronI to a a new newvariahle *r*oro o^,^,-2 variable, NEwBIB. Before thesgn,,e-it",#;;" #JiffiT#:l#li; coneratedrvith,h;;;;;;;,.1ilo, ..ooot "noanswer', ;:?Xl:1,*:lj:..",:very rno*_ responses
r"l ri. ;il'Jfit :."#:rtff ,".:.T"T?L ".r".J. size Arter .a.; ;;;;;;-u.ll"u", i#li*r**T"::::l.l*l"y therumber ,"." "ri-;", orcases availabreil; ffi;;;yJj";fl5t :H:ffiT;llI U"latedprincipalfactonngandvarimax y4urrd roration. rurauon. A singredominant iacror emer ged'
*"t^r:1,:r,lg
wrth loadings after rotation:
ATTEND POSTLIFE PRAY RELITEN NEWBIB
which explained 86 percent of the total
.787 .573 .654 .260
Given the pattem of factor loadingg it appearsfrom simple inspection that a threo_
t;j:ffi:l1"T"1?:H:'::l+fi:,1's!di"" jili','xf ::"#ffi sca.rerhatincrudesennowzwnreiin";;ff#;l:;,;lT;%;.#;;*lt",T "".".",iab,s\a,
ll1l'. r,li. ::Til:':,]:Til:l;,'l'ff ":T,": nrylv "g4q,,,,'"
,";:lfii,ii:,.',H;3"ffi1[tnl]. ;1;111ag,r'oery,-i',ffi r"f;r= w-ith much rower r".i.i#i,r.,u*i, liJ:ilfi i".iG ffi:*:fili:Ti:t1t""s
;'J.lffi"HJTff.n:"*.."##::,r :,..]:x"qhisr,L"oi,e,.u;;::;ff
jtr1H ffiffi [r".T:#'.i,ffi ilT"3#,*:::+m1;r;*:if"f
f,t1ri.Hffif:aft Jfr li*:J:"'ilJi:''.T#:::',i"l: abortion, Tocreatescale,."u.*Uirii
. ine ,"u* it ..,i;i; ,'
';;:#:;:;,f:;#::;:
ffiffi:H:::::Xi#::
I ractor anarvzed theroro*-
notlou think it shoutdbepossibte ror a pregnant,eotnd .o.r
1' ABDEFECT: If there is a strong chanceof seriousdefect in the baby? 2. ABN)M2RE:
If sheis mariedano 0"".,"r*ani_r-_#cilii**,
scaleconstruction
ZS3
ABHLTH: If the woman'sown hea
e.a.! ram'y has a";,i,*',:::ilj?":jff;,'"jr:lJ::ffi:ll*t ! on: rrthe children? OU^!!!!: If shebecame pregnant asa resultof rape? ABSINGLE: If sheis notmarriedanddoesnot*ant to mar.1, themanr ABANI If thewomanwantsit for anyreason / --1each case the possibleresponseswere ,,yes,',,.No,', ,,Don,t know,,,and ..No ,rr ,3r." '.rDon'tknow,,and ,.No answ .::. * ingrrnqdlur"bqtx""o and"No."AlTh-oTgEas-rndt.ut"dlpiJiilifi-nyfo,r,".i# -j"O ,r,u,,n"rearedistinc_ :' : ::jponsesto abortionfor therapeutic and personalprefea"n"",*aonr, I nonetheless ::*: : analyzedboth subsetsofiter together to confirm empiricallythat the two sets of :r--i do in fact behuu.dirtin"tiu"lLt T,.ro nontrivial factorswere .*ou.t.,l._ *hi"h rogetherexplained ' 96 percentof the *in the items.Table11.2showsthe loadings before,o,i-ior,-t.-e \s is evidenr,all sevenitems load strongly o-nFu"to. 1- B;;;;_" are posrtiveand r :: :re negativeon Factor2. The pattem of positiu" unOn"gutiu"ioadingson Factor 2 i*.:ir\ thatrheseitemscanbe subdivided into two distin"rtuit"r* iuUf. 11.3shows the :f -.: of executinga varimaxrotation, a rotation of ,i," .irt iactor matrix that :-: :rizes the distinctionbetweenfactors. "-* "
l-1*:
,'- l'1 ..?. ractorLoadinss
for Abortion Acceptance ltems Before Rotation. Factor 1
ABNOMORE
ABPOOR
Factor 2
*.263
.8 3 1
-.183 .412
.869
- .249
254
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
-t"A* f-g
31,3.
ebortionractor
Loadings After Varimax Rotation. Factor 1
ABNOMORE
Factor 2
.880
ABPOOR
ABSINGIE
.876
.217
Inspectingtheseloadingsyou seethat,ashypothesized, two factorsunderlieabor::r attrtudes.ABNOMORE, ABPOOR, ABSINGLE, and,ABANy all load strongly on Fai::r 1 (shownin bold) andweakryon Factor2, whereastheremainingthreeitemsload stror.r' on Factor2 (shownin bold) andweaklyon Factorl. Thesetwo setsof itemsconespon;i: the a priori distinctionI madebetweenabortionfor personalpreferencereasons(racro: . andabortionfor therapeuticreasons(Factor2). Figure 11.1demonstrates that the unrotatedand rotatedfactor srucruresare slm:,-.. mathematicaltransformations of one anotheranddo nothingto changethe reradonsr_r amongthe variables.The rotationmerelypresentsthe resultsin a form that makes rl-_.n more readily interpretable.As notedpreviously,in the unrotatedmatrix (solid axesr. :] itemsload positivelyon Factor 1 but someitemshavepositiveloadingson Factor I someitemshavenegativeloadings.After I rotatethe axes30 degreescounterclock$:s -rr (to the dashedlines),all items havepositiveloadingson both fa&ors, but four (the s-_ sonalpreferencereasons)load stronglyon the first factorandweaklyon the secontlfa;-,r while three(the "therapeutic"reasons)load weakly on the first factorandstronglv on = secondfactor Giventheseresults,two separatescalesarewananted.I thereforeconstructeda s!--r of accaptanceof abortionfor personalpreferencereasons,using the fbur items loac-:r stronglyon Factor 1, and a scaleof items for therapeuticreasons,using the three ite= loadingstronglyon Factor2. In eachcasethe itemswereconyertedto standardtbrm.:rr averaged.I computedaveragesif valid responseswere availablefor at leastthree of the ir,personalpreferenceitems and at least two of the three therapeuticitems. Again ,:
ScaleConstruction 255
6
-.2
-.6 ---
-1 -8
axes Unrotaled Rotated axes
6 -o -',^Jorr'
4
6
8
1
F {C ,Jt?f 1 1 , f , roaarngs of the SevenAbortion-Acceptanceltemson the First TryoFactors,lJnrotated and Rotated30 Degreescounterclockwise'
D
tu
a t!, l['
t fF b r,-fl d E
F rcur rfr r*
& tx. rdl rfm' I&
rcales were transformed to range from zero to one, with one indicating high acceptance d abortion. The second criterion for scale validity is whether the cornponent items all bear ryroximately the same relationship to the other variables in the analysis. Ideally, one $ould assessboth the zero-order and net relationshipsbetweenthe componentitems and Le dependentvariables.Here, however,the dependentvariablesare the two abortion attirdes scales.Thus I assessthe consistencyof the relationshipssimply by inspectingthe curelations among each of the componentsof all three scalesplus the remaining indepdent variable,education.Thesecorrelationsare shownin downloadablefile "chll. lry." All of the componentsof eachscaleshow consistencywith respectto sign and gross imilarity with respect to magnitude in their correlations with the remaining variables fhus I concludethat combining theseitems into scalesas I have done is appropriate. Table 11.4 showsthe means,standarddeviations,and correlationsamongthe three r:ales andyearsof schoolcompleted,andTable11.5 showsthe coefficientsestimatedfor of theraErluations11.5 and 11.6.Not surprisingly,the meanfor the scaleof accaptance for of acceptance of abortion for the scale the mean Futic abortion is rnuch higher than the Lowest by converting (Because is calibrated each scale lnsonal preferenceteasons. sore in the sampleto zero and the highest scorein the sampleto one, comparisonof the rans acrossscalesis not, strictly speaking,legitimate. However,they do indicate where mostacceptingandleastaccepting te rypicalrespondentfalls relativeto the respondents d eachcategoryof aborlion, andhencecan be usedto comparethe relative acceptanceof 6e two typesof abortion.)
scateconstruction
257
-\s predicted, acceptanceof abortion for reasonsof personalpreferenceis somewhat Te strongly socially structured than is acceptanceof abortion for therapeuticreasons. L f: for the former is .182,comparedwith .136 for the latter.Moreover,both of the coefficients are substantially larger for the personal preferenceequation than for rec & fterapeutic equafion, indicating that both education and religiosity have a greater on attitudes regarding abortion for personal preference reasonsthan regarding ryfi ifution for therapeuticreasons.However, the standardizedeffect of religiosity is about T'-llv strongfor both setsof abortion reasons,whereasthe standardizedeffect of educa:[ is much strongerfor personalpreferenceabortion.
1n
i ngly UnreI ated Regressi on
,f, hrmal test of whether correspondingcoefficients differ significantly in the two equath is available through Zellner's seemingly unrelnted regression procedure, implearfrid in Stataas - sureg-. This proceduresimultaneously estimatesmodelscontaining or all of the same independentvariables but different dependentvariables. When t- fudependentvariablesare identical acrossmodels, the coefficients and standarderrors ilentical to thosefrom separatelyestimatedequations,but -sureg- providestwo Sional kinds of information-an estimate of the correlation between residuals ftom d equationand a test of the significanceof the difference betweencorrespondingcoeftas. In the presentcase,the correlationbetweenresidualsis .38, which tells us that *cr-er factors other than education and religiosity lead to acceptanceof abortion for lkzpeutic reasonsalso tend (modestly) to lead to acceptanceof abortion for personal F*rence reasons.The tests of the equality of correspondingcoefficients reveal that, as $adesized, the coefficients for education and religiosity are significantly larger in the preference equation than in the therapeutic equation. (See downloadable file lxtnal tI l.do" for delailson how to imnlemenr- errra.'- \
k-Proportional
scaling
.[ pecial kind of scaling problem arises when we have an independentvariable that has a anlinear relationship to the dependentvariable of interest. In Chapter Seven I diseed proceduresfor assessingwhether relationships are nonlinear and for representing ,-.in€ar relationships by changing the functional form of equations. One possibility I ftcqssed was to representnonlinear relationships by converting variables into sets of qories and studying the relationship between category membershipand the outcome rbble. In this sectionI describean extensionof categoricalrepresentations of vari*: efrect-propor"tionalscaling, which is availablein situations in which the dependent :i$le has a clear metric. (For an exampleof a researchuse of effect-proportional scal.ee Treiman andTenell [1975].) QSuppose,for example,that we are interestedin the relationshipbetweeneducational Ginment andoccupationalstatusin a nationwith a multitrackschoolsystem.We might d[ eapectthat in suchsystemsoccupationalattainmentdependsnot only orthe qmount drciooling but on the rypeof schoolingcompleted.How to representtheeffectof schooliq in a succinctway becomesa difficult problemin suchsituations.We could,of course, cf,-- andreport the coefficients for a typology of type-by-extent of schooling, but this is
258
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to requirethe presentationof many coefficients.An altemative wourdbe to s. stepfurther and scaleeducationalcategoriesin termsof their e;fecl on occuparioritus. From a technicalpoint of view, this is very simple.We estimate the relatios betweenoccupationalstatus(measured, say,by the IntemationalSocioeconomrc lner occupations[ISEI] [Ganzeboom,de Graaf,andrreiman 1992; Ganzeboomandrren 19961)anda setof dumrnyvariablescorrespondingto our typology of type_by_errial schooling,and then we form a new educationvariablein-;hic;-each categorl.:typologyis assignedits predictedoccupationalstatus. Doing this maximizesthe conelationbetweeneducational attainmentand oc.-r tional status-no other scalingof educationwould produce high".;";;il;;=; the sameset of categories),and, of course,the correlationis "ideitical ," irr" ."r.: mtio Thusthe interpretationofthe educationvariabrebecomes"the highestlevel oi cation achieved,calibratedin termsof its averageoccupationalstatus return.,,So lo-r. the analystis candid with the readerthat this is whaf has been don", tt .un f objection.The clearadvantageof the procedureis that it allows "r" educatronal attaina: be includedsuccinctly in subsequentanalysisand thus permits assessment of hori relationshipbetweeneducationalattainmentand occupational statusrs aftectedb\ :d factors,andhow therelationshipdiffers acrosssubpopulationr, fo, ,.;* "*o-fi.,i, ethnicity. Hereis an exampleof the constructionanduseof sucha scale.(No log file is rr ^ this worked for examplebecauseno new computrngtechniquesare rntroduced.) t 1996Chinesesurveyanalyzedearlier(in ChapiersSix, Seven,and Nine; seeAppenr_ for detailson the dataset and how to obtainit), educationwas solicitedwith a cuesr that included the categoriesshownin Table 11.6.Although, with the exceptionc:l last two-categories,the classificationappearsto form an ordinal scaleof increasins: cation,it is not evidentwhetherthe scalehas a monotonic ..l"ti"".hrp ;;;;;;;;; status.In fact it doesnot, ascanbe seenfrom the meanson the ISEI shownin Table _ In particular,vocationaland technicalmiddle schoolgraduates tend to achievesuti tially higheroccupationalstatusthando academicuppir middle schoolgraduates\\ b: not go on to university. I thus created a new education variable in which each category was assrgnei mean ISEI score shown in Table 11.6. (A convenient way to do th;s in Stata is to re:
ISEI on the educationcategoriesand get predictedvaluesfrom th. ,.g*;;;. i; associatedwith this regression,.372,is, ofiourse, just the square of the correlation:: that we encounteredin ChapterFive, 4r.) This scaiecan then b" usedin other ana_i For example,we rnightwish to assess the dependence of occupationals;;;;;;;:: and father'soccupationalstatusfor severalnations,including China, to assessnal,r similaritiesand differencesin the relativeimportanceof achievementand ascriptin occupationalstatusattainment.
ERRORS-IN-VARIABTES REGRESSION As notedpreviously,unreliablemeasurementgenerallyproduces weakermeasuredel,-: Thus when variablesare measuredwith differential reiiability, the multivariate strucR_ relationshipscan be substantiallydistorted.Becauseattituie variablesotten ha'e r
Scaleconstruction ?59
$r F j @
d !!
- ',,3i-g X t.$. fvf..n score on the tsEtby Levelof Education, ChineseMales Age Twenty to Sixty-Nine,1996. Levelof Education illit er a te
Mean l S E l 1a.2 ' :' i l .:::
113 " ' r"
Canread
16.0
E
Uppermiddle(alsospecialized)
35.5
272
tt
" fi v eb i g " S pec ia l i z ei d n ,c l u d i n g
61.0
111
65.1
65
&-
.-.
m itu Ur !l
n I la a llN
! t.
$dll E > [|i
lmperialdegreeholdet (xiucai,juren)
30.5
[* il
Othe'
39.0
!t
Total
28.5
G
2,413
1$, qt li I
lui trl ml
e-bility, analysesincludingsuchvariablesoftencanbe misleading.A way of correcting trs lroblem, whenmeasures of reliability areavailable,is to correctconelationsfor attenrar..,ncausedby unreliability.The Statacommand- eivreg - (enors in-variablesregres'L-l.- doesthis conveniently.The analystsuppliesan estimateof the reliability of each ,rr-"ble,andthe commandmakestheadjustmentandcaniesout the regre5\ione\timation. li ro estimateis supplied, the variable is assumedto be measureduith perfect e::lility.)
260
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
it canhave,I herepreserr To showhow this procedureworks andwhat consequences an analysisof the effectof abortionattitudesandreligiosity(the threescalescreatedpreviously) plus race,region of residence,an interactionbetweenrace and region, and the narurallog of incomeon politicaIconservatism. From the previousanalysiswe havethe reliability of the threescales.I takethe rel' ability of the income measute,.8, from Jencksand others (1979, Table A2.13) anr assumethat race and region of residenceare measuredwithout error.Table 11.7 shosi ari the resultsof OLS estimationwithout correctingfor unreliability of measurement, errors-in-variablesestimationthat does correct for unreliability. Because-eivregdoesnot permit correctionof the standarderrorsfor clusteringandrequiresaweightst* regressionwith the!. fweights),I carriedout both conventionaland errors-in-variables specifications.
fA,iLf 11"7. coefficients ofa Model of the Determinantsof Political Conservatism Estimated by Conventional OLSand Errors'in-Variables Regression,U.5, Adults, 1984 (N = 1,294). ConventionalOLS
s.e.
p
0.692
0.170
.000
-0.282
0.091
.OO2
b ReliEosity preference Personal aDonron
o
Errors-in-Variables s.e. P
1.066
o.2s7
0oOi
-0.220
0.113
.051
I
.816
0.172
o.425 DI
0.063
0.079
s .e .e .
1.24
1.23
I i
Scale Construction 261 The effect of adjustingfor differential reliability is dramatic-the coefficient associated mrt:eligiosity increasesby 54 percent.In addition, the coefficientsassociatedwith rc3?nce of therapeuticabortionand with incomeincreaseslightly,and the coefficient urn:cated with acceptanceof abortion for personalpreferencereasonsdecreasesslightly. ]^,"llGresultsindicate clearly how the relative effects of variablesin a multiple regression u re distorted if the variables are measuredwith differential reliability, as these are. Be::]l that the reliabilities of the religiosity, therapeuticabortion, income, and personal are,respectively, .66,.78,.80,and .93.) mr=.enceabortionmeasures \ote that with one exception,all the coefficientsare aboutwhat we would expect: increaseswith religiosity,with income,and for non-Blacks,with tuu:;ai conserrr'atism (although lruiem residence this last effectis only marginallysignificant)anddecreases ru,.ic:eptance of both kinds of abortionincrease.The unexpectedresultis that acceptance r ::erapeuticabortion is a much strongerpredictor of political conservatismthan is lrc:::tance of abortionfor personalpreferencereasons.From the analysisI presented rr-:
n{ATTHIS CHAPTER HASSHOWN !r ::s chapterwe have seenwhy multiple-itemscalesare advantageous: they improve Gil:ility of measurement. We haveconsideredtwo waysto constructsuchscalesthat go ;-. :ld simplecountsof the kind usedin previouschapters.In this chapterwe focused m.rily on factor-basedscaling,which providesa meansof purging a scaleof items k lo not reflectthe sameunderlyingdimensionas the remainingitems or that reflect rF dimensions effect-proportional in additionto themaindimension.We alsoconsidered ilcrr::g.which is usefulin establishinga metricfor a setofcategoriesby scalingthe items ru::ding to their effect on somecriterion variable.Finally, we consideredtwo extenrrn of OLS regression:errors-in-variables regression,which correctsregressioncoeffirclr for attenuationcausedby unreliability-something that can alter our substantive mc-usions when variablesin a model are measuredwith differential reliability; and rerngly unrelatedregression,which provides a meansfor comparingmodels with if:::nt dependentvariablesbut (at leastsomeof) the sameindependentvariables.
LOG-LINEAR ANALYSIS IVHATTHISCHAPTERIS ABOUT -,-E-linearanalysisis a techniquefor makinginferencesaboutthe presenceof particular r.-:rionshipsin cross-classification tables.The first tfuee chaptersof this book were irloted to percentagetables.In those chapterswe spentconsiderabletime on cross:nlations, developingrules of thumb for decidinghow largea differencebetweentwo ,nE:Jentages had to be beforewe were willing to take it seriously,how to detectinteraci:i-s amongvariables,and so on. LogJinear analysisprovidesa way of formalizingthe rx-\'sis of cross-tabulations, permittingan assessment of whetherrelationshipsobserved r : cross-tabulation constructedfrom sampledata are likely to exist in the population il: n which the sampleis drawnandalsoprovidinga way of describingthe relationships. h:ilis chapterwe first considerhow to fit a logJinear model to multiway tables,to get te mechanicsstraight.We then move on to more parsimoniousmodels for two-way :r-es. usingthe studyof intergenerational occupationalmobility asour main substantive =-:,-nple,althoughthesetechniquescanbe appliedin many othercontextsaswell. Addi:r:-rl expositionsof logJinear analysiscan be found in Knoke and Burke (1980)and in P:";ersandXie (2000,chap.4).This chapterdrawsheavilyon PowersandXie.
264
QuantitativeData Analysis:Doing SocialResearch to Testldeas
INTRODUCTION In one sensethe model-fittingaspectof logJinear analysisis nothingmorethan a :' -s: alizationof the l'? (chi-square)testfor the independence of two variabres.Recarir--- r the usual (Pearson)1'test, the observedfrequenciesin eachcell are contr.asted \i r: ,r modelof perfectindependence, in which the expectedfrequenciesin eachcell are s : the productof the marginalfrequenciesdividedby the total numberof casesin the _: e The size of X'then dependson the extent to which the observedfrequenciesc::.: from the frequenciesexpectedfrom the model of independence. This approachcanbe generalizedto morecomplexrelationships,albeitwith a c:-.:.,,r in the formula.For a bivariatefrequencydistdbutionwe can write a generalformu-. n expectedcell frequencies: F,1- 4r,\ r\ . )t
-
where 4 (eta) is the geometricmean of the cell frequencies(the geometricmear I valuesis the /tth root of their product); rf is the "effect parameter"for the ith car-. ofthe X variable(f is pronounced "tau"); r,f is similarlydefined:rnd rjt is rhe parameterfor the "interaction"of the lth categoryof X andthejth categoryof ), -l_:
?)f $l --l
In Log-Linear Analysis"lnteraction,,SimplyMeans
"ASSOtiatiOn"
,,inreraction,, Notethatinrhetog-tinear titerature isrheterm,:,
what rscalled"associatron" In the olderiiterature on cross-tabulation tables.rt is i\odantto re:, ognizethat lt is not the sameaswhat iscaliedan interaction in boththe oldertabularliteratL-= and in the literature on multipleregression. In thoseliteratures "interaction', refersto the si:_ation in which the relationship betweentwo variables dependson the va ue of one or mc-: othervariables.
The relationshipexpressed by Equation12.1can be shownto hold when the i.. -::: definedasfunctionsof oddsratios(seeAppendix1Z.A .The oddsof an observation in a givencategoryof a variablearejust the ratio of the frequencyof observations i-::,tL ,-r categoryto the frequencyof obsenations not in it. Thus in a classof 20 men ar:: women,the oddsof a studentin the classbeinga man are 20/10: 2:1(,,two to one.. Analyzing the datain Table 12.1,we seethat the ratio of the oddsof being a l;-. :andscience(LS) student,giventhat oneis male,to the oddsof beinean LS studenr.!_ rr thatoneis female. are 9ll l )l\g ) - |:I L So menareone-elevenrh aslikell ro be Li ... dentsas a.rewomen(and,of course,womenare eleventimesas likely to be LS stuc.-:as are men). Oddsratios vary aroundunity; if the oddsof being an LS studentuer- :c samefor malesand females,the oddsrario woultl be 1.0.An odrisratio of lesstha.. indicates,in this case,that the oddsof beingan LS studentare smallerfor malestha'' : , femaleswhereasan oddsratio of greaterthan 1.0 indicatesthat the oddsof being a: studentaregreaterfor malesthanfor females.
Log-LinearAnalysis
26s
;& f t:: t ?, '1. rr.qu"ncy Distributionof Programby sex in a GraduateCourse. Male
Female
Total
Management
\ow supposewe takethe naturallog of both sidesof Equation12.1.This givesus ln(F,,) - ln(.rtr{rl r{Y )
: ln(t) + ln(rf ) + ln(rf) + tnlrfr)
(12.2)
q.:::: hasa logJinear form-ttat is, the left side of the equationis a linear function of ,rss of the quantitieson the right side of the equation-hence the term, log-linear Equation12.2is sometimes(for exampleby Leo GoodmantI972, 10431,oneof the :=:-.rrs of the method)exnressedas
G , ,- o ' r ^ ! + ^ l +
ry
(r2.3)
r rre--:the \s (lambdas)are (natural)logs of the r's, 0 (theta)is the log of 4, and G is the rug:: F . An altemativenotation,usedby PowersandXie (2000, 107),is
ln{, - ln(r) + ln(rf) + ln(rt) + ln(?,fc)
: p+ tlf + pf + pt'
(12.4)
nnt-,: Lr(mu) : ln(r), and so on. An even more convenient notation, which we also will ur]E-. [XY], which implies that the model of interest includes the explicitly specified rc::tion and all of the lower order effects. This is sometimes called theltted rnarginals flnl::.rn. Equations 12.1 through 12.4 can easily be generalized to more than two qf:iles. as we will see in the following section.
SIOOSINGA PREFERRED MODEL un 12.1the observedcell frequencies, f,, are exactlyequalto the predicted lll, -ration ::quencies, F., and thus are perfectlypredictedbecauseall possibleeffect parametr: :e presentin the model.HenceEquation12.1is known as a seturatednlodel.
266
QuantitativeDataAnarysis: DoingsociarResearch to Testroeas
interesr. ordinariry, il,TJ:iff ,ffi;[ffiT.tff -] t orIinre
:::ilf i?nJ:l.#t*t***l*'{}*tfi*J#i** .j*#:,":",,".diiT"",ill,:&f,f""
","+*:""i:"ffi
:orlrnonis.,"""r.r.'l^iii,fi "_j::m 1li:.i,+.,i"{"r*"'J:.:,U.r;:mf to*t?j?iu l"T::,T:ffi:9;*il1ffiffl,: ,,u,.*", erm-retr l:u*nuutv 'to'.it
ltt':,T##'#i*ktT{1ffj,".,:jT;'#1ff"1:i::i iJiIil",'"H 'ffi;ilTf":ln'ffi';l*t :'":ifl :"'ffi l:ff?i:ffi :*'":f#,l";'-#ffi Model SetectionBasedon
Goodnessof Fit
approach, usins thedara rrom rabre I2.1.rlrc
.;itill*illfi -ffi1'Iffi-r*"J-rlff*t;,y:-.^rflg ;""fi 3.H*i::*::. "#i#i.i:ii!:!ilH:#;ff tr*:1.lt"ii3,i,l,,ifiT:i;;*lhnti".:m
:_; j:T::;"'"il;;#;u':ilff;, r, urown ff;;::;#,:\,iJ,:fil,1:: a,r rV ' F.l Jt I i :t i = t\ ri j )
(12.-'
where-F is, as notedpreviouslv. t
jl;i:dj:,f:lTilfl "H fj{fy"::,:"i:;;'::i,:_TH:rjt" "t
.,#,. jtil_til*;1i::1rtr. ,,;mn*+:#1fr I p;411,11i::;iill.x,",:#:f;:t*
ff,?#"ff ;:trflhnff :?T:".'*:,*;::r[iff :l;lury,.o_,r,o q r{ = 1/r{ r{ = 1/ r{ ,f{
=,{{:1/r{{
(12.6t
=11,x,r
Log_LinearAnalysis 267 it
n E f
E
! 3 l-
}
le:ause for the simplestmodel we estimateonly 4, we havethreeremalnlng(residual) .e:reesof freedom.As expected,the f,t to Tablel2.1 is poo., Lt: t0.96. which implies b:: rheobserveddistributionwould occur by ch_.. oniy about I percentof the time if t€ Jellssizeswereequalin the population(precisely,p :'.OiZ; S"'*. that this n del doesnot fit the data;that is, we reject tt nuit t ypott i.i. that"onclude the cell sizesare " -rrl (For detailson how to estimatesuchmodels, seethe workedexampreon anticommrst sentrmentlater in the chapterand also the downloadable files ..chl2_1.do,,and -:12 1.loe.") To be s-ure,the ,,population,, in this exampleis problematicbecausewe arepresum. i"'-] studyingthe characteristicsof alr individuars enrolredin u-p*i"ulu. courseand E\-e might think of theseindividualsasconstitutingthe populationratherthana sample. IL-s ever,we might regardthe individualsenrolled'in ,n! ;our;;;-_y grventime as a imple of all possiblesetsof individualseverenrolled in the course,anahencegeneral_ -= iiom the particularset of observationsto what we might expect*fbr this courseover tu long run" or for "courseslike this." This sort of use Jr .,u,i'.,*t inferenceis in fact .ure commonln researchpractice(seethe discussion of the conceptof srperpopulation n ChapterSixteen).
L2 Defingd
tt canbe shownthatl, is minustwicethe differenc(
*:"tnim:*:Llrii**:ffi n:mN ":H:iT;?:;t::il::i:if ikelihood estimation and a definitionof the likelihood.)
might testthe possibilitythat the two variablesX and f are independent,so . )ext y-e * ,n:.*11 are simply a function of ttre marginatdistributions. {:nl"ncies We would r:rte this: [X][Yl. In this casewe are estimatingthreeparaireters:,1, .t, r"J,". o"rv a; r to 1.0,so we have1 degreeof freedom.In this case# : 6.35,so onceagamthe fit is -.-r :rrr (p : .012, only coincidentlythe sameas for the previous model),indicatingthat tere rs an associationbetweenX and l_we cannotpridict the cell f."qu"n"i", ,i_piy =m the marginals. To obtaina good fit in this example,it is necessa.ry to estimateall four parameters,wtxch s- up all dsglsss6f freedom(andhence,as we havenoted, ensuresa perf.ectfit). We write ar tY Notejfal in rhis expositionwe are dealinglvitt t i"r_.t i.a models,which .''ans that every higher-orderrelationshipimplicitly coitains a lower-orderrelationships. Seace[XY] '+[1][X][][y]. we win returntothispoint lat". in ,h. op,". So far-wehavedonenothing that could not be ione "t witn tne ,suui rt test for inde:Frdence.Howevet the sameproceduresapplyto cross_tabulations conrarnrng morethan :ro variables,and also to polytomies as well as dichotomies.ConsiderTable 12.2-
268
DataAnalysis: DoingSociat Quantitatlve Research to Testldeas
'l 2"2, ?AgtC r..q.,..rcy Distribution of Levet of stratification by Levd of Political Integration and Level of Technology,in Ninety-ftiro Societi€s. No Metalworking Stateless
State
Metalworkingat Lean Stateless
State
Source:Computedfrom Murdockand provost(1973).
a cross-tabulation of level of stratificationby level of political integrationand lerrtechnologyamongninety-twosocieties. In the data-dredging approachto log-linearanalysis,it is commonto positan i or baseline,model of completeindependence amongthe variablesin the table-i: presentcasethe model of no associationbetweentechnology[T], political lPl, and stratification[S]. We do this by fitting the model tTltpltsl. For this model.i: 84.68,with 7 degreesof freedom;the goodness-of-fitstatisticsfor this model,andser others,are shownin Table 12.3.Clearly,this model doesnot fit the data(p < .0000. we will nonetheless makeuseof it momentarily. We might next posit an association,or interaction,betweenlevel of political i tion anddegreeof stratificationandassumethat neitherofthese variablesis relatedr..: level of technology.That is, we fit [T][pS] (Model 2 in Table 12.3).This model pthat the observedcell frequenciescan be accountedfor (within the limits of sam: error) by the univariate distribution of level of technoloey and the bivariate of the degreeof stratificationandthe level of political integration.Estimatingthis yieldsll = 41.54,with 5 d.f. Although the largeZ, tells us that the model doesnot provide an accuratefit ri data (p < .0000),we might still want ro know wherherthe predictionis imDroved:.: tive to lhe baselinemodelof completeindependence. To seer-his.we subtracror: from the other and similarly subtractthe degreesof freedom, and then we ge: p-value associatedwith the new Z2 and new dl It also is common to show the I: eachof the subsequent modelsas a proportionof the 12 for the baselinemodel,l"to showthe index of dissimilarity,A, betweenthe observedfrequenciesand the cresexpectedunderthe model, and also B1C.The differencesin thesemeasuresc:: computedaswell; the differencesbetweenModels I and2 are shownin the row of Table 12.3.All of thesecomputationsprovide informationon the goodness,:f of the models and the improvementin goodnessof fit realizedby positing succesr elaborationsof a model.
Log-LinearAnalysis269 -:
.: :r 1?"3, naoa"tr of the RetationshipBetween Technotogy,potitical liegration, and Level of Stratification in Ninety.Ti,voSocieties. Iodel
BIC
d.f.
L'ILtr
h t ; -t:P5l
41.54
30.4
h IPItrsl NU .:,.TPl
60.48
F
irPltTsl
6
h
FsltPsl
m-,llTsllPsl
0.60
3
.4Ol
- 10.6
.03
5.3
2
.739
-8.4
.01
2.5
I'
r!nus (2)
,l - -us(8)
2.34
--, r :heprobability Io(L, = 43.14,with2 d.f.lt canbeobtained fromthecomplernent of TE :':cabilityreturned bytheStata-chi2-function. r'r :we reverse thesignaftersubtraction sothata negative B/Cndicates animprovement infit. -: the probabilityof l,': - 43.2.wfih 2 d.f.,is lessthan .0000,we concludethat IrN'*ir,:-_J an -ccauseassociationbetween political integration and stratificationsignificantly r[@rr'..s the fit of the model. Similarly, the differencein B1C tells us that the second mni:E-i! much more likely than the first, given the data (althoughneitheris as likely as M .i-Jated modelbecauseboth BlCs arepositive). ;: canget a quantitativeestimateof the extentof improvementin the fit of a model !'w :e rwo remainingsetsof coefficients.From the ratio of the s. we seethatpositing n i:-i\:iation betweenthe degreeof stratificationand the level' of political integration ogru:- thelack of fit of the modelto the databy abouthalf relativeto the baselinemodel m ::ulete independence amongthe threevariables. :--:a11y. from the rightmostcolumnof Table 12.3we notethat the modelof complete urct:dence misclassifiesabout42 percentof the casesin the table(that is. 42 percent m rr :aseswould haveto shift categoriesfor the expecteddistributionto be identicalto
270
QuantitativeData Analysis:Doing SocialResearch to Testtdeas
the observeddistribution-recall the discussionof the Index of Dissimilarin. _ ChapterThee), whereasthe secondmodelmisclassifiesonly 30 percentof the ca.:, Becausemodel [T][SP] does not fit the data well, we would evaluatestill models,searchingfor the mostparsimoniousmodel that doesfit well. Table l2.j .: goodnessof fit statisticsfor eight models(all the logically possiblemodelsexce:: saturatedmodel and the model that assumesall cells have the same frequency). C!-:-_ ing to examine the coefficients in Table 12.3, we see thar Model 7, [TS][pS], fits rh: ;
quite well. This model positsthat both the level of technologyand the level of pc -_ rntegratlonare associated with the degreeof stratificationbut that the level of techr:: andthe level of political integrationare unrelated,net of their associationwith srr"--: tion. It misclassifiesonly about5 percentof the casesin the table and also redu.-, baseline Z'?by97 percent(: 100+(1.0- 0.03)). Although Model 8, which positsthateachpair of variablesis associated, pro\ i:: r evenbetterfit, it might be arguedthatit overfitsthedata.Thepenultimatemodel, ITS. _-,_i fits near_ly aswell andwould be my choiceasthe final modelon the groundofpari::, especiallybecausethe improvementin fit betweenModel 7 andModel g is not sisL . (2.94- 0.60:2.34 p: .1261. Note that the useof testsof significancein this contextis the oppositeof their _, role as a decisionrule for rejectinga null hypothesis;herewe want to decidet he-:e accepta null hypothesis;that is, a model.Accordingly,we would like to mrmmiz: , II (p) enor (the probability of acceptinga null hypothesiswhen it is false) rarhe_ Type I (ci) error (the probabilityof rejectinga null hypothesiswhen it is true).L:. nately,thereis no direct way to do this, and so we must settlefor a computation.: I enor A usefulrule of thumb is to accepta model if d is greaterthan 0.2. Hou er:: larger the samplesize,the smallerd will tend to be, so fbr very large samplesu. ::
wish to accept a model even when a is quite small. As we will see momentant,. j offers an altemative and more satisfactory method of model selection
One additionalcoefficientis shown in Table 12.3,B1C,the BayesianInfor:_j lrt Criterion (Raftery 1986, 1995a, 1995b),which we first encounter€din Chanr:: !. Recall81C's definition:
BrC : _2[In(B)] whereB is theratioof the (unknown)probabilityof somemodel,M, beingtrue (unknown) probability of the saturated model being true, given the data. For los models.B/C is estimatedby
l
L'?-((U.)lln(N)l whereZ'?is the likelihoodratio t' for M odelM; d.f. is the residualdegreesof freedi : mnr Model M; andN is the numberof casesin the table.When B1Cis negatlve,Modi ! ll.
preferred to the saturatedmodel. When several models are compared, the model \\:--: :ntr
most negativeB1C is most preferred becauseit has the greatestlikelihood oi:;:::E true giventhe data.Here,Model 7 is morelikely thanModel g giventhe data.Com: ::-:ll
Log-LinearAnalysis
271
:tr --i.rmation obtainedfrom the.L2and BlC contrastsof Models 7 and 8, we seethat r -,-- - is to be preferred. .-:,erealvalueof BIC is in the comparisonof modelsfor very largesamplesbecause ,r,,:c-:he sampleis large,often no model (except,of course,the saturatedmodel) fits the ru; --' conventionalstandards.When that happens,B1C is of great use in helping us i: ,:r among models.For this reason81C has becomethe conventionalmethod for "r,.-.:ng altemativemodelsin log-linearanalysis.An additionaladvantageof BIC, noted m -=;pter Six, is that it can be usedto comparenonnestedmodels.
"fwry-Based Model Selection -ll:c approachto model selectionis to contrastmodelsthat representaltemative -::ond -!:':ises ur aboutrelationshipsamongvariables-that is, to do theory-drivenratherthan model selection.For example,we might ask whether the association omur:-:edging rm: *-n the degreeof stratificationandthe level of political integrationcanbe explained on the level of technology.If the answeris yes, we would -, :e-r mutual dependence s"r:: lTPl[TS] to fit the data,becausethis model impliesthat the obseryedfrequencies m :e :ablearegeneratedby an associationbetweentechnologyandpolitical integration betweentechnologyand stratificationin the absenceof an association ,nuur ': .rssociation political integrationand stratification.As we seein Table 12.3 (Model 5), this rr.n -n doesnot fit the data,because12 : 21.88with 4 d.f. (p < .000) Hencewe reject nnnnr.,-!. fi[rc:]Ntnesls.
ffi&
Parameters
*: :: :hown in Appendix 12.A, the parameters associatedwith the interaction tems in ",-.--.:ar models (for example, rlv in Equation 12.1) can be interyreted to indicate the unr=::--,nand strength of associations in cross-tabulation tables. Note, however, that :d :=meters for two-way interactions involving dichotomies are shown relative to Jr --'.Tic means of the expected frequencies. When more than two-way interactions or m, F: '-ian two categodes are involved, the interpretation becomes more complex. r#!.r:.:\er. by default Stata uses a "dummy-variable" parameterization When a dummyrl[-.;--e parametedzation is used, the parametersfor two-way interactions give the odds mui,:\ lrr log odds) for the specified categoriesrelative to the teference categories. riJause the effect parameters are not very straightforward, most analysts use loglM": .nalysis to test hypothesesabout the presenceor absenceof particular associations :1i-:.'tions) in the table but then discuss the table in terms of percentage differences, wnt;: re much more familiar to ordinary readers.This is particularly so when the softvariable form that is, w.]E-: -.id to estimate the models shows the parametersin dummy or to 1 0 in the multhe log fbrm is set to 0 in l]! ;E-.::tions from an omitted category that variable form are difficult to r',r;:j!e form-because coefficients expressedin dummy m<::;r in the log-linear context. \!l recommendation is that you use log-linear modeling when 1ou \\'ant to test ,,rrr ,- hypotheses about relationships in cross-tabulation tables- hecause it is an Y"m:,:ely powetful tool for doing this job. However, once you settle on a preferred
272
DataAnalysis: DoingsocialResearch to Testldeas Quantitative
:,4' i,,]. l:: ' i. : , Percentage Distribution of Expected Level of stratificatict by Level of Political Integration and Level of Technology,in Ninety-Two Societies (Expe
State
Metalworking at Least Stateless
State
Egalitarian
78.1
33.2
55.
Statusdistinctions only
20.5
31.2
46.2
15.7
1.4
29.6
20.7
41.7
Total
100.0
100.0
100.0
100.0
N
(30.6)
(1s.4)
{12.4)
(33.6
Two or more classes
I
LO
model, I suggestyou interpreteither the observeddistributionor the expecteddisrr :'-tion implied by the model.The point ofpercentagingthe expectedratherthantheobse:.:: frequenciesis that unsystematicvariability is removed;however,you shouldbe sens::-. to the possibilitythat deviationsof observedfrom expectedcell frequenciesmay re.:relationshipsnot adequatelycapturedby the model. Table 12.4 shows the percentagedistributionof level of stratificationby le\i. political integrationandlevel of technologyimplied by Model 7, which positsan ass.,:rtion betweenthe level of technologyandthe degreeof stratificationandbetweenthe li : of political integrationand the degreeof stlatif]cationbut not betweenthe level of r;-:nology andthe level of political integration.Becausethe model fits well, the distribu'r : of expectedpercentages closelyparallelswhat we would havefbund had we percenrii:Table 12.2.As we see,within levelsof technology,statesocietiestendto havemorec. rplex stratificationsystemsthan statelesssocietiesand, within levelsof political inre---tion, societieswith metalworkingtechnologytend to have more complex stratifica'*: systemsthansocietieslackingmetalworkingtechnology.(Onelimitationof this apprr -_:r: is that the marginalfrequenciesin the expectedtablegenerallydo not matchthoseiL -: conespondingtable of observations.For a methodthat recoversthe marginaldisrr,:,-tions,seeKaufmanandSchervish [ 1986].)
Another WorkedExample:Anticommunist Sentiment The optimal way to cany out logJinearanalysisusingStatais to usethe -g1m- (gei:alizedlin*u model) command,which permitsthe estimationof a wide variety of lir-:models.Indeed,asshouldbe evidentfrom Equation12.2,logJinearanalysisisjust a ::t cific caseof the familiar linear model,in which the dependentvariableis the natural ,: of the number of casesin a cell of a multiway cross-tabulationand the indepenc.:
Log-Linear Analysis 27 3 Frequency Distribution of Whether ,,A Communist Should L Allowed to Speak in Your Community" by Schooling, Region, and Age, U.S. rcJhs, 1977 (N = 1,478). .1 , ::,
CommunistSpeaker(C)
rSe (A) * tr' younger
Region(R)
Schooling(S)
50uIn
No college
72
71
College
55
22
Non-South
Allow
No college 'ol
{
:. clder
South
Non-5outh
Not Allow
92
College
151
25
No college
65
162
College
23
23
No college
197
214
College
107
32
rr:-:s are dummy variablesfor the categodesthat makeup the variablesincludedin lurr-: ri i-tabulation.Although a user-writtenStata ado file (Judson1992.\993) can ft !=l to do hierarchicallogJinear analysis,the advantageof using -g1m- is twofold: 0 "--:'=rsthe linear model framework,and all of the Statapost-estimation commandsare u r=rle. To showhow to cany out logJinearanalysisusingthe gfm- command,I mu*..::Table l0 from Knoke and Burke (1980);a comparisonof my resultswith theirs m, ::or ide additionalinsight. rrppose we are interestedin the relationshipbetweenage (thifty-nine and younger s-.:-: Ibrty and older),region of residence(Southversusnon-South).schooling(some : 1-:: \ersus high schoolor less),and tolerancefor civil liberties.as measuredby a r[,']f, ]n on whethera communistshouldbe allowedto give a speechin I our comrnunity. : :-- :i\1ayfrequencydistributionof thesevariables,basedon dataliom the 1977GSS. u ,:,.; n in Table12.5. hysis Strategy The first stepin carryingout a loglinear analysisof Table l l.5 is to :s:-i:-:: a baselinemodel.Becausemy interestis in the effectof age.recion.and school mr . r rtrlerance of cofi[runists, a reasonable baselinemodelis tC]|ARSI. That is. I fit the
274
QuantitativeData Analysis:Doing SocialResearch to Testldeas
three-variablerelationship ;rmong age, region, and schooling exactly, but I assume noneof thesevariablesis relatedto toleranceof communistspeakers. As a seconds posit [CA][CR][CS][ARS].Tharis, I conrinueto fit the rhree-variable relationshio andin additionposit effectsof eachof the independent variableson toleralce of c nist speakers("interactions"betweeneachof age,region,andschooling,respectivelrtoleranceof communistspeakers). If my secondmotlelyields a good fii, I thentry to , plify the_model by omitting specifictwo-variableinter;cdons.if my secondmodel not yield a good fit, I explore more complicated models by fltting various three_va: interactionsinvolving toleranceof communistsplus pairs oi the irirlependent variabre: lmplementation To carrli out the analysisusing -g1m_ in Stata,I first readin the tentsof Table 12.5as a data set, whereeachcell is an observationand the va.riable! the responsecategoriesfor each variable plus an additional variable that gives the ci in eachcell. I thuscreatea dataset,call it,,knoke.raw": 11 I1 11
r1 12 12 t2 1? 21 21
1172 1 2 2155 J1 ' ta
11161 1292 21157 ')
)
)7
22 ))
r<
1165 121-62
,1
21 ),
7l
z5
2223 11197 12214
2 tto 7 .
L
JZ
andthen readthe datainto Statawith the following command: inJile a r s c count using knoke.raw,
clear
Recall that the baselinemodel [C][ARS] is a shorthandway of representrng the Imdd tcllAltRltsltARllAsltRsllARsl. Thus, I need to specify each of the terms in tu model. Becausethe Statacommandto createproducttermsfoi categorical variables,_>:__. does not permit more lhan two-way products, I take advantageoiu ur"r_\Vrr,,"n __-._ comman{ -desmar- (Hendrickx 1999, 2000,2001a,Z00ib), to specifyrhe requind variables.(Seethe downloadableflles ,,chlr_l.do,,and,,ch12_l.log,,tbr details.).{iiir; because-glm- doesnot provideall the coefficientsshownin Tablei2.3 andproduc._.1 incorrect estimateof B1C (given the way I have specifiedthe problem, _g-Lm_ counli r casesthe numberof cells in the tableratherthan ths numberof p^eople in ttre sampte),I b,a
Log-Linear Anatysis 27 5
(ror"gociJness or nr").anda rerseversion, -,.,. '.ll"l:",::,,"i1"j"^^1::"^?:.;;;i3Bil""ii::*::::::l"""ts:these
_donr",,r,o..u,,iui,ixrlrJ,';.jll;
',_ _- .Jrble file. for lhis chapler _
rit
orherStara
estimarion command, rurr'rdrru. =_,i:l_-:::T1111orkr,.Juu wirh wrtn , ,::iion: :v:ry ; : ,. becauseit can handlemany kinds of linear moa"r. specify which -nust --r.lel . r: """,,l"' w^,r.,ohr ,r-i-^,1^ -
inailil;il;ffi;] r1e ;J| ;[ . .=':::1:i[:,:::::::j::*l1it .:::j.i."l:.'::11-.lod:l poisson b":1,.r" -r,vlntion distribution;"r.."*"",", 11" ".o"n,." variable. sp".ry'ngil ;;:i"#ffi :. :',tL::,:"T":ependent flr!-rj..--::s a log-linear model. --.: _ g"::
Ji,il; ;iHl;
the
g"._l
_. .".j::":.:::-..1: slm_ command the ule Ee'Erdres senerares r r.,:--::lrs ; rz.o. /{1"1 r.1 shown in the ;-l:i,lirst line of,,.r-^Lr,"1: r^ I rnen repeat the process for a model that lull :' -:- : p --r
roc butnotro;nt.ra.tti;;f" ;;;;;r,",t"# J:$; _ .::j ,T9,_1,::lroberelated that
l-:;uefficienrsshoran :o 9' ls,tansllagldcrisii.tn"." commands -- _ -:i].1:.:T::^'"':ll:n'nto on the bolromtine
oftaOte lZ.o ir"l lel g). _-i rhp.,era o./. Clearly, \'rsdrryJ this trus j :r :!, = L_, all crite . :. the data well by na..rnd::d so well as to suggestthat -,6,, a simpler ,r^ model , fln-__ in hr rhp ,.rar.,,,^ll r-.
. ... rhis. resrimare a,,il....i*;.-:i,l;H.; : ,. :> irSll5,o,lli:""llJ:,*::ln rhe remaining coefficients in the table. ..::cting these statistics, we see that none of these
models fits the data adequately. :!*--. I serrleon IARS]IACIIRC][SC] asmy preferred 'n"J.i.'i",a*,fy, age,region, -
.
,..i,,
..: . .r.
coodness_of_Fit Statistics for Log_Linear Models of the
conmunist shourJ ,. *" speak in ff"^t11:lf:1-*I.l*. hr Community, Age, Region, and Education, "ii"*.o ,.r, ,rn. flcc€i
"a"f,r, Lz
d.f.
BIC
L'ILtr
A
N 1 r:;q r t
15_1 :]:;S]IAC]
o/t
1
.69
10.7
a_2
B :_:,sllRcl -c::lll(at
87.75
6
.000
44.0
.44
;- --\)lrALt'KLt
84.j2
5
.000
48.2
.42
a : ; s l l Acl [5 C]
48.69
5
.000
12.2
44.74
5
.oo0
4.2
.22
1.7
re.:lslJAcl[RC][sC] 2.92
4
.571
_26.3
01
1.5
r .:xsllRc][sc]
7.8
276
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
TA E ,- € tr 2 . 7 , erp..t"d percenrage(from Modet 8) Agreeing That "A Communist Should Be Allowed to Speak in your Cornmunity" by Education, Age, and R€gion, U.S. Adult, 1977. Age
Region
No College
College
::;;; : :::ri:I;::;: t.::.:..
::t*t*i (253)
:::
(182) 567 (46',)
47.8 (411)
74.3
(13s)
Noter Cellfrequenciesare shown in parenthesses.
andeducationall affectattitudesregardingcomrnunistspeakers. To seewhat theseeffac are, I percentagethe table of frequencies predicted by this model. (To see how to d these,consultthe downloadablefiles, ,,ch12_i.do',and,,ch12_1.log.',) Thesep"r""n,Jr" are shown ]n Table 12.7. The table clearly shows that, controlling for each of the odo. factors,thosewho arebettereducated,younger,andnon-Southemaremorelikely to su_ port the right of a cornmunistto give a speech.In eachcomparison,the percentasediafrencesalwaysgo in thesamedirectionandarequitesubstantial. The attitudes reported here are from thiny years ago. during the height of the Cifl War. It would be of interest to determinewhether the samepattern holds today.To do fui, within a log-linear framework you would need to construct a seconddata set, basedL recentdata(for example,the 2006 GSS),to appendthe seconddata set to the first. trt an additionalvariable(?l for ,.time,,),andthento assess whetherit is necessary ro posrt! effect of time (or of interactions betweentime and any of the two-variable associaticroq to adequatelyrepresentrhe data. That is, you would estimate [ARS][AC]tRCltsf,] tARSltACltRCltSCl[T], and IARSIIACTIIRCTI[SCT], and perhapssomeinrermedi.G models,and comparetheir goodnessof fit. If none of the more elaboiatemodelsproducd a betterfit than tARSltACltRCllSCl, which isjust Model g replicatedfor thepooleda.o you would conclude that attitudes regarding the rights of communistshave not chand between1977and2006.If tARSltACltRCl[SC][T] emergedasthe prefenedmodet. 1ic would concludethat there had been an across-the-boardchange(preiumably an increact in supportfor the civil libertiesof communists.If IARSIIACTIIRCTI [SCi] emergedr the preferred model, you would conclude that the structure of the relationships bet$ ear age, region, and education, respectively,and support for the civil rights of commun:*
Log_LinearAnalysis 2V7 ::r-sed between
1977 and 200 ::r;tude that the wourd strir;,#;:xl;, Tffi::ffi1T:"#ffiffi,:f,Tr""ffii;Jou
h-ing Log-LinearAnalysiswith polytomous Variahles
orcommunists, ff:i",''**r* ;rn$.#*l ri'hts ffjiifi :'"rTtr{|.il".Hffi ;#?:;'"":T"i"H1?ffi
L";;Tffi;[*i:ff;;#:1rr".'o*"u".i',rfi
::
:.*fi ,",ilir:?ffi ilHi"rHt Fir:*,##fifu #:ir;ilffi -rnc ;"-r"^n* il..'.-l;fiiliiiiiJ;'l.l#::y lllociatign if':T'ili"Jll;
race'and membership are dichotomous =o^'.rtionis . ,rrt..-.ror"ri'"fJ, variables. but tto*ts thal we crealetwo durffny vari,f,!i:: s2 { -- | for high schoolgra,i.f1i.uno9"t :
0 otherwise) ands3t: 1ro.ttrose ,E1!; some '19 those wittrlt uiJ =-dl'ri#lT' vrse)' with lackinghigh schoolgmduation "o "g" m:ed category. asthe Suppose we are inrerested ,":.lT:lrl,
a modelin whichrace,
educarion, and exampr"-'" aono,"u." *3:;itll,"lt;fff';""t::?.iJ$il- theprevious abour the membership re-.cr ion amongth'e;;,
and hence
.*v,nrrysrrv_i,r j-d;ffi ",|:jiilffi ',}iil.il*1fi'X"rl'":",1il11#*"# ;;;,1:,", I .:,nd
m'.I1 this model with the _glm_ command: l-n
count r s2 s3 m r vs3 vm, ram iry rpoisl2oJs3
rm s2m s3m rs2m rs3m v vr vs2
ryo=--eachof the compound variables
ls a productrerm_for example, n rL rseethedowntoadable rs2 =r*s2, and files.thrz_r.irrn",oecificarion of ""d:.r,ir-ir"*lil.
r'o,i;,,""r,jl ".,;.,',o."o"" ffij'l*#:;:il;l"fiii;;:?" too"p'"rulv rur provides u.i"rii"i".l *"r,i"i"1,*,?J:Tffii'_'#lJil'jffi""#:i'*1. -. t9 Log-Linear Analysis with tndividual_Level Data
,r:'T,lixlf ffi-;;ffi ililtff il#"#$:i:'":1tti*i,:ilffi t; .!@r lisstanins rrom ;,;;ff;#Ti{*"*
*ljH ;:l$fff,"il1'nJ"_:y, ::,"::. ur@:nd.(Downtoadable file,,chlr_t.do,, shows ,rr"J"iarrl*""irJ"':in,r_r.,on.,,., MONIOUSMODETS r.:r we lave dealtwith _"0"r. *1|:r.r;r-i
global associarion. or absenceof global
n"_.,.".._.,,i.1,d :-;,i";,ffj;1i..?3i**1,":r arhypotheses rike totesr regarding ,n" ":1,:r",.brt.n, ,,"_,",i"'_"lt :er tables can be described ,7,.rii[ll?J,l,,llili"l5;,,::] by relatlvely simple models that generate the observed
2V8
DoingSocialResearch to Testldeas DataAnalysis: Quantitative
| ,l A ;.,ii '1 7 ,3. r."q,r.ncy Distribution of voting by Race,Education, and Voluntary Association Membership. Didn't \i:-
Race
n2 1. t
White
6
12
18 6C
Oneor more
24
lC
Sourcer Adaptedffom Knokeand Burke(1980,Tabe 3).
patternof frequenciesin the table.The developmentof suchmodelsto describepan;:n of intergenerational occupationalmobility hasbeena lively enterpriseoverthe pastlh--: yearsor so,but the lbrmal modelsdevelopedin this contexthaveapplicationsfar bel r,ai the studyof socialmobility (for example,Radeletand Pierce1985;Schwartzand \1,:r 2005;Robefisand Chick 2007;Domanski2008).Still, it is convenientto illustratethss models in the context of mobility analysis.(Seedownloadablefiles "ch12-2.do' :r "chl2_)..log" for details on the Stata proceduresused to estimatethe models in:E remainderof the chapter) It is helpful to begin by deriving a generalexpressionfor log odds ratios. Re:rEquation 12.4, which gives the natural log of expectedfrequenciesfor a two-varil'rd
Log-LinearAnalysis279 From Equation12.4we can write an expresririe as a function of a setof p,parameters. mE tbr the log odds ratio of the expectedfrequenciesfor cells formed from any pair of m. ri andi') andcolumns(j andj') in a two-variabletable: !
or-: -
F..F,... F,,1F,,. loe " " : los v ''J -losfl.-loeE, " Fri - F,iF,j lFri
- loe4, - loeE.,
- (rL+p! + pl + pnc)+(tt+pf + pf,+ pff) @+ pf + pf + pl9)-(p+ pf + pf + uff)
(12.e)
- tf, + pff - ptP- pff
I
lfher dummy-variablecodingis used,asin Stata's- glm- command,andi' andj' arethe !*r.nce categories,theright sideof Equation12.9simplifiesto Pfc, which makesclear h de interaction parametersrepresentthe log odds ratios for each cell relative to the ,rined categories(ordinarily the first row and first colunn). \ote that to uniquelyidentify the coefficienls,it is necessaryto imposeconstraints. bc differentconstraints,or "normalizations,"are typically used.One is effect coding pal in Equation 12.6andAppendix 12.,4.1, coefficientsas deviations which expresses fu rhe grand total by requiring that the log-form coefflcients for each variable sum to rF:, The otherconstraint,dummy-variablecoding,codesonecategory[in Stata,the flrst csonl of eachvariableaszero.) Il the fully saturatedmodel thereis a unique coefficientfor eachcell of the table q)=t. with dummy-variablecoding, the cells in the first row and first column. This (for a seven-by-seven table): by the following designmatri"x. mdel canbe represented
1 1 1 1111 12 3 4567 l 8 9 15 r14 21 120 r 2 6 2728293031 1 3 2 3334353637
10 16 22
ll 17 23
12 18 24
13 19 25
: full dm
@ lQl
dtr h F 'd ]U lEl ri$
fu rhat a design matrix is simply a variable,with one value per cell, that imposes alFrlit-r'constraintson somesubsetof cells-a1lcells with the samevalueareconstrained :cne equalcoefficients.This designmatrix specifiesthat all the coefficientsfor the first ind first columr areequal;in fact,they are (implicitly) zeroby vinue of the dummyr mble coding.Noneof the remainingcoefficientsis constrainedto be equal.This model m; dl the availableinformation,andthe observedcountsin eachtableare f,t exacdy. \ote that in Stata's-g1m- command,the specification ..:::glm count i.X
i.Y
i. full_dm, family(poisson)
280
euantitativeDataAnarysis: DoingSociar Research to Testrdeas
ffi lf ,]i,;f;.';::T.T:rrrT"ibutionoroccuparionbvFatherr Respondentt
Prof. cadre cler. iJes
in 1996 Ser.
117
Man.
810
Agric.
2,765
producesresultsidentical to the usualway of specifying the saturatedmodel: xi:g1m count i. X*i. y, family (poisson) That is, -gIm- creates a desisn --"b" matnx like that of 'full_dm,,when specified. the interacdcn
,,iil:,T::: yifi:r:il::;i{_j_", ffi a; :.{,i# fi :T}j,T,,: fi *o "?, ^ oo.','lnt"o',"" ; il:::"'ffi':J oio.no;^ ;"i*t:'ff o,. il::::i"t.," women ^i,,.
*.** women roincrease ro I havepooledmen increase ,f,. rhe ."ln","""#, *,0r. L?l juffi;":,fii:"jllf,iJffi:::,..::.# ,TLifll. ,reparatety. s,.iar
two-.wayrabre:tharrr,"*"."_*uuini""ffi ll,: ;'i;3#:r::,il1#:',ily?":."j:::::l':: orathrce-way ,ab,e i* -tt"!:ibrTiiy f::i:;';?f :i,THtr ii!:X":i;yffi j i.'io8i,i '*"iJ:,ffi r,i'^." ro,esr "1,::::1r:-1 i",r-, the
nrsr condi*".,.il;, ff 1 - ,#;.1ff;,ji;liir,l;a,. ""il1,J:fi ttri:L'Ttli:l il;;ffi ,fr ner o:h
ilflruffiff
,''r.,",,"iJ,.l and women {rhr ?olr,"y,oun.,"" G
nand R=;;;"oJ;l'":#ffi f :,nee .il"l.:H:nm: ;# l':"i?:.rufi ffi,rffi ,:i,ri;'f lffi*,l,,TtH1ffi
.#.iiff +*:f 11X11*"
Log-LinearAnalysis28'l x.raly marginally significant.Given the relatively large sizeofthe sample,I am inclined to ttus on the BIC ratherthan thep-value andconcludethat the first condition is satisfled. To test the secondcondition, I contrasta model (call this Model B) thar omits the interEion betweensexandfather'soccupation-that is, [SR][FR]-against Model A. The subIllrive argumentfor this is that in China, where almost all women are in the labor force, u€ shouldexpectno differencein the distribution of father's occupationfor employedmen ml rvomen.To contrastthe two models,I take the differencein the 17 and the differencein fu degreesof freedom to get the p-value for the improvementof goodnessof fit resulting frm the addition of [SF] and also get the difference in B1C values.Although the fit of lf,rlel A is significantlybetterby classicalstandards (p : .019IL| - L: = 67.18- 52.03 = 15 l5;dl,- dJn:42 - 36 6l).ModelB is morelikelygiventhedaratBle BICA = -185.9 - [-250.6] : -35.3). Again,I am inclinedro put moreweighton theB1Cdiffoence and concludethat the secondcondition is satisfied.Thus I am willing to pool men nl $ omenfor the subsequentanalysis,which effectively doublesthe samplesize. Table 12.10showsthe coefficientsfor the saturatedmodel (see"ch10_2.do,'tosee h* thesecoefficientswere computedusing Stata).As we haveseen,thesecoefficients re not readily interpreted directly. However, in the present caseit may be of interest to .=nnast particular cells in the table. For example,we might ask about the relative chances r de child of an agricultural worker becoming an agricultural worker insteadof a mannal rr orker comparedto the conesponding odds for the child of a manual worker From Ewation 12.9it is evidentthat the log oddsratio can be computedas
to g9=p+f+pt| -pf{-p| f : 2.756+ 1.567- 1.088- .80 = 2.434
(r2.10)
dt{h implies that the relativeoddsare 11.40(: e2434)' that is, the childrenof agriculErl workers are more than eleven times more likely to become agricultural workers thselves, rather than becorningmanual workers, than are the children of manual work6x" Similarly,the oddsthat the child of a professionalwill becomea professionalinstead rr -rcorning a cadre,comparedto the correspondingodds for a child of a cadre,are
to g9=,y y +pl f-t"ff-pt =O+.627-0-0 : .677
(r2.11)
ffir,-h implies that the relativeodds are 1.87 (: eo621). Clearly,in China (as elsewhere) E -inieritance" of farrn occupations relative to inflow from the children of manual cr*ers is much stronger than the inheritance of professional occupations relative to dt* from the childrenof cadres.
fulogical,
or Levels,Models
filrriag shorn how to interpret the interaction coefficients, I next addresswhether the nilb can be simplified.In pafiicular,given the lack of differentiationbetweensalesand
':
rXf,af i;:..!S"Interaction Fathert Occupation When R Age 14
parameters for the Saturated Model Applied to Table 12.9.
Prof.
Cadre
Respondent,sOccupation in 1996 Clex Sales Ser.
1.213
-0.169
0.054 -15qq1
1.489
6. Manual workers
1.595
Man.
Agric.
,0.100
-o.341
0.384
-0.058 0.607
Log_LinearAnatysi,293 serlice workers in the Chinese ecr ronably be collapse;l;;;;"::mv.'
I s:s,pejt that these two categories might rea-
c.u.invorui'g.iieil;#o;;iKTliff ,"'ffiTili:r*;ffi ,*:f,:";#. 1 11 1111 ::54456 , |
6
:!2814v15rc :, : ? ! 1rt td
t z z 2 3 2 4 )q 2 s2 6
9
9
14
14
l9
,g
l0
ti = ss_dm
1 5 i; ZO
21
16.06,.with 11dr-because oryrwenty-nve orrhe
ffi:nT:'#ffif;;"::ii:i.,r
d^:53),';;'";ffi ;:ni_",,,1T#il;";#:lt3i;fi *:3,:l?;"J,"_? asaseven-bv-seven rable. r,il.;;';il;e
ffiT;:$:i^lell
*"t*t;H#;l':f
subsequent anarysis
ff i:::::.:-"-:llTpyr:*'voushourdkeepinmindthis you are trying to decide .'"t
"t,Tfl E-egones ofa tabl". Th" o.o".d.tn"never
to
; J'#':"':".?1T"':::*ffi :"3'lTii: Tf,:Ti:,'#"'fi,'ff:* SliT;i:;" "tt"t
"otffi
ceilsofa tabre ashaving *#?Ji,,tix1?ill;:}f,ffj|t:iu":.:f panicurar identicar
m.*".pL.,',""ir"#;'il;?,1'r"r.j:;:ff :.T:lTH,:;,,[:n".J*r""ir-"]ir, Qnsi-lndependenceModels
*1'#::i
;rT:T;:,ftlf#j
if georr.e areabre tofreethemserves fromthesociar
u:n*::ll.* *lhlt""lxiliiT:""Ff,1,;:i*t..';i:t1:?fJ:ffi
(onthe hpothesis couuo,"a .i*_oulJ*.,-11!l-il'o."ifi [Uffi:Hffi:1""}.;fffi1ff:
{egonar ce's of thetablebut otherwrse fbrcesa'interaction parameters to be
identical:
2111r1
r3r1ii 114111 111511 llll6l lltt;;
= diag_dm
.Asse canseefromthefirstrowol^tf s3c3nd.panel of Table12.1l,thismodelis a huge rprovemenrovertheindependencl mgde],*fri"f, t ,fr" U"*U-r"model in Thble12.11. lbough it doesnotfit by ciassical stanou.os. ir is mor. tit"i, ,ir_ *i'**r,"0 moderand rnisclassifies about2 percent of thecases. S l. other.JO.llr"i-gh,ht evenberter -r'
i
'irlili-:-
': ,:.,I :
statislics for Alternative Models Goodness-of-Fit in China (Six'by-sixTable)' Mobility Occupational of Intergenerational
.
B,c
L'?lLl
.000
869
1.000
': fi
000
-109
054
-:i
nj 2
.58/
' :w '
L'
d.f.
p
1080
25 20
58.8 oJ+
14
451
24
.000
249
.418
urban hukou Line!r-by-linear,
157
24
000
-45.2
145
:-g
Linear-by-linear, lSElr urbanhukou
150
23
.000
* 43.8
.139
I !i
324
16
.000
190
,300
^^n
14
Row_andcolumne{fectsll (RC)
^^n
:
6
.098
- 106
.050
,;
.020
'_:
-117
.432
' :
- 11a
.031
':
- 14
Diagonalcellsfitted exactly
.000
Quasi-iridependence
.016
-62.4
Quasi-symmetry
21.a
10
Crossinqs
)o I
16
n)t
Uniformassociation
34.5
18
.011
l)tl Llnear-oy-lrnear,
33.7
18
A1A
urbanhukou Linear-by-linear,
37.2
18
.005
-114
.034
.Linear-bylinear, t) + uroannuKou
33.7
17
n6q
- 10q
,031
Row-and-columneffectsI
10.3
10
.415
9.0
10
Row and columneffectsll (Rc)
1nq
73.4
_ ' a
Log-LrnearAnalYsis285
Cluasi-SymmetryModels
t important issue in social mobility researchis whether' net of any shift in the G:-sinals, the relative odds of upward and downward mobility betweencorrespondn: !ategoriesare symmetrical.The following design matrix specifiesthis model for te :ir-by-six table: :11111 3 1 8 I i917 10 1 1li14
9 12 515 15 16
8 4 13
1l 14 16 17 7
10 13 6 17
: qi-dm
ts ;\e seein Table 12.11,this model fits slightly better than the quasi-independence riel by the likelihood ratio standardbut not nearly so well by the BlC standard'
CmssingsModels tableaslepresentwe wereto takethe occupationalcategoriesin our six-by_-six Sdr.r,ose Supposefurther mobility barriersto = .ocial classes,with boundariesthat constitute "cross" eachbarto llrl in an analogyto movementacrossphysicalspace,it is necessary E ttween adjacentclassesto achievemobility betweennonadjacentclassesWe can sr::ent this model (following PowersandXie 2000, 117)as
(rz.t2)
F,,= nrlrl ufc
riuu fori > i j- l
il
uu fori < i
€i
fori:
t
fu* ,pecificationimplies the followhg interactionpalametersfol the cells of the six-by-srx mie rivith the diagonalcells fitted exactly): q1
\,
E.
F
E,
),
to Testldeas DoingsocialResearch QuantitativeDataAnalysis:
286
one for eac| These parameters can be estimated by summing six design matrices' and taking parameterplus one for the diagonal design matrix -(diag-dm)' ir estimated is exactly "*ring. diagonal the fit not does antitogs. f.lle conesponding model that desigr five the are Here omitted' is matrix ttrat ttre diagonat design th" ,ui *uy " "*."pi crossingsparameters: matrices for the 011111 100000 100000 100000 100000 100000 crl-dm
001111 001111 110000 110000 110000 110000 ct2-dm
000111 000111 000111 111000 111000 111000 cr3-dm
000011 000011 000011 000011 111100 111100 cr4-dm
000001 000001 000001 000001 000001 111110 cr5 dm
Ij*
rfr @4
the othermodels'rc As we seein Table12.11,the crossingsmodelfits betterthan any of degradesthe ft exactly cells diagonal the have reviewed so far. Interestingly, ntting
movingb"tY":t rt:,jiii:li because presumablv .tigt,tyUym" AC standard,
-3
-5
-0.138 0.002 -o.203 -0.228 -1.033
farm and nonfarn Clearly, by far the most difficult transition (crossing) is between and China is m everywhere' is true this o".uputOnt (specifically,manualoccupations); cadre and clericd between is exception. Interestingly, the least difficult transition distincticr sharp no Chinain occupations.Again, this is no particular surprise, because of tbr the brightest and best is made between clerical and administrative tasks and the mobilig clerical staff are often tapped to become cadres' The known intragenerational positions seenas pa$em may well carry over to intergenerationalmobility,.with clerical cadre positions ieasonable starting points for the children of administrative cadres and Finally' thb as aftainable upwld mobility goals for the children of clerjcal workers' lt could females and males combines here result could be due to the fact that the analysis workers' clerical to become tend well be that the daughtersof cadresdisproportionately
lJniform AssociationModels
tut
T1:"-
weil by the crossingsparameters,ard the additional degreesd i. Jiu"gona "uptured ""U, freJdom usedby fitting the diagonal ex actly arc penalizedby BIC ' The crossingsparametersfor the simpler crossingsmodel are
-2
fi
rd
parsimonious When the cateSoriesof a table are ordered.ir is possiblelo eslimalemore model assumesthl such simplest The models than are available for nominal categories'
I
EL:r !i
h*-
*t I5lrtrE
r [.d l&lr |d:r
G-
m-{ @ dEF -trtd
dfr
Log-LinearAnalysis297 te.differencebetweeneachpair of adjacent categoriesis equar,so thatthe scalefor each uiable can be represented by consecutiveintegJrs.rr,"t iiii" .#r i,
togF..= p+ p! + pf + Bij
(12.13 )
rtere the strengthof the association betweenthe row level and the colur* level is '"-;red by F From this it follows that the log odds.",i" u"."#.-"* .ategories r and .e: .olunn categodesjandj, is just '
to g 9 =B G-0 U -l )
(12.14)
Table12 11 showsgoodness_of_fit statisticsfor the uniform assoclation model with .rl . c.ithourrhe main diasonalfitted exactly.As y"" .;;,;;;;;iagonal cells are nor ft :ractly, the uniform aisociationmodel hts u".y luAfy. ffr" ."^on tbr this is simple; F'ole disproponionatery tend to remainin the sameoccirpu,#ii ,heir fathers. Eh. tendencyis capturedbv fitti "go.y ^
p,.r,..",,L',."J;i;""i#':,,1f, ,il""*HiJffi ffH',:u:ii;?:T:i, :,ft::[:
&gonal cellsare estimatedexacflv. when the diagonal cells*areestimatedexactly, {y.""t rhe
umlbm association *:ll. It vieldsB : .046.FromEquatio;;l;;;;; ."" tharthisimplies, TII!l':"1t: e:\ample, that the log oddsthat the child of a professiooutrvilti""o_" u protbssional the corresponairg for the child of 1.150; s,,50: 3t5S-i;:ii.i",i"i "OO.
ffier than a farmer are more than_threedmes
nnrmer:.046(1 - 6x1 - 6) = ,"*,nerlow odds m,-'' whichis consisrenr with thegeneralr"nr" ,rruiini".g"iJrationar mobilityin is easier thanin mostorhernaionsdr;:;;w;#;;:#ii, ^n:..a 20071 trooo, fora .w:erargument). tfua r-Sy-1in"" r Association Models
. cr ruppose we have more information than simply a rank order of categories_for **rple, socioeconomic statusscores.We can then estimateu iin"*-Oy_In"* u..o"iu_ m nodel, wherethe scalescoresaresubstituted for thec","g"fi"O"*r. LLrat is, instead rLluation 12.13,we have
logF,, : 1t+ pf + Lrf + p\yj
(12.r5)
*fr ae log oddsratio givenby 1og0:B e,-x)(t1
_t)
(12.16)
Esrimatingthis model for the,Chinese data,with occupationcaregonesscored by filhr meanoccupationalstatus(ISEI; see Ganzeboom,O.'Cr""i ano Treiman 1992).
288
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
we achievea model that fits marginally better than the uniform associationmodel. bg B1C criterion. For this model, B : .000483. Thus for the samecategoriesas in the form associationexarnple,we have.000483(16.2- 63.7)(16.2- 63.7) = 1.090;e:! 2.974.We areherebyled to a quaiitativelysimilarconclusion:the oddsthat the child professionalwill becomea professionalrather than a farmer are about threetimes as as the correspondingodds for the child of a farmer. Note that it is possible to include more than one scaling of the categoriesof a to representdifferentconcepts.Table12.11 showsgoodness-of-fitstatisticsfor two tional linear-byJinear models, one of which scales occupations by the proportiu incumbentswho havepermanenturban regisffation (urban hukou) andthe other of usesboth the ISEI and urban registration measures.As it happens,neither fits as rr-ell the ISEI and uniform associationmodels.However.if we wishedto assessthe los ratio using, say,the model that includes both measures,we would simply apply 12.16to both variablesand computethe sum. (For a well-known applicationof this ki model,seeHout 1984.)
Row- Effects(and Column- Effects)Models Sometimeswe are confident that one variable can be scoredwith an integer scale--fr1 is, that the difference between each pair of adjacent categoriesis the same-but we l uncertair about how to order the other variabie. ln such cases we can estimate tr untnown scores.In this model the expectedfrequenciesare given by
logFij = tt+ p! + LLf+ ift
(llrqi
where thej index the categoriesof one variable and the d. are the estimatedscale sctc for the othervariable.The los oddsratio is sivenbv log0:tS,-fi.t\j-j')
(1114
As an example of a situation in which theseconditions might hold, considerthe r* tionship betweensize of place of origin and educationalattainment,for the 1996 Chirp surveywe havebeenusing. Table 12.12showsthe bivariatefrequencydistributionfu adultsnot currentlyattendingschool.In constructingthistable,I havecollapsededucarir so that the categoriesrepresent approximate three-year intervals in median schooli4. The size-of-place categories are from the official administrative hierarchy of Chir. which sffongly affects the flow of resourcesto places. Thus, in addition to the geDcd advantageof urban residencefor educational attainment (greaterexposureto the wrira word and such), we would expect educational attainment to be greater for placeshi-sh in the administrativehierarchybecausesuchplacesarc the beneficiariesof more resourclr from the central govemment. The row-effects model fits well (BIC : -135, L : 2.96) although not ! classical inference (p < .000). But contrary to my expectation, the estimated scrEi
Log-Linear Analysis Ll j -
289
':.:
:
-rn:'-'I
: ': .-
'.1 :?,1?, FrequencyDistribution of EducationatAttainment Size of Place of Residence at Age Fourteen, ChineseAdults Not Enrolled t rr S
:-.:3
Level of Schooling i[
Lower Upper Lower None Primary Primary Middle
3:-f"
upper Middle Tertiary Total
n:,: r i ":.: 1 5: -
u
:{
-|;i.trl q:.E[r:[ h l: m
a:r'rry-levelcity
F:
:!l:
iE
lc
r:-, nciaT capital
'-'. -
'.iJ llraS1l
'- ri E'rl:,iq* lmrrfi
rm ;m' tlirtuinm :url:"F 'ltrrnu !!
sul
1,142
nr of place of origin suggesta non-monotonicrelationshipto education.The 'ize ,ut;:t:iare Village Town Countyseat Countylevel city PrefectureJevel city Provincialcapital Province-levelcity
0.00 0.36 0.74 0.86 0.73 1.01 0.98
5T1[:!
-att* NT iumil
TI
qrj::.ling to this model,peoplefrom county-levelcities (mediumcities) get somewhat r-'r: education than do people from prefecture-levelcities, although it would be rrl ->eto make too much of this becausethe confidenceintervalsoverlap(the 95 perqr:onfidence intervalis 0.71 to 1.01 for county-levelcities and 0.63 to 0.84 for re:::ture-level cities).
790
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
Column-effectsmodels are formally identical to row_effectsmodels, but with role of rows and columnsreversed.A columa_effecrs model of the relationshipbetc sizeof placeat agefourteenand educationalattainmentdoesnot fit as well as the c{ spondingrow-effecrsmodel (B1C: - 108,A : 2.98, andp < .000),which suggests the_assumption of equal scaledifferencesbetweenadjacentsize_oi_ptace categories probably inconect. This is hardly surprising given the dlviation from equal diff.erencc: the estimatedcoefflcients for size-of-placecategoriesin the row_effects model and. cially, the non-monotonicity of the scoresrelative to my a priori ordenng. Row-and-Column-Effects Model I Another analytic possibility is to treat both the andcolumneffectsscoresasunknownquantitiesto beistimated.However, in this cr is important to have the correct ordering of both the row and column categoflesbe, the results are not invariant under different orderings. For the Chinese example we been exploring-the relationship between the size ;f the place of origin and educaticd attarnment-this createsa bit of a dilemma. Is it better to reorder the size_of_placectr gories according to the scale scoresderived from the row effects model or to retain rb.l priori orderingderivedfrom the Chineseadministrativehierarchy? One possibilityL
-152,p =.304,andA: 1.20, compared withBIC: -n6,; I .u_rC, A : 16 usingthea priori categories). For therow_and_columr effectsmodelwith-O thereorderul categories, thescalescoresareasfollows: Village
0.00
No schoolins
0.00
ffJl"tu,e-r"u"r"i* -?.13bilff;",Tfr?Zl County seat -Z-ZZ Lower middle
CountyJevelcity ProvinceJevelcity Provincialcapital
-3.10 -4.00 _4.95
Uppermiddle Terliary
Z]g 3.g4 4.g0
Formally, the row-and-colurrn effects model (often called Row-and-column-Efi-a:r Model I to distinguishit from a log-multiplicativemodel a.lsoproposed by Goodmr [1979] andknown as Row-and-Column-Effects Model II, which we witl discussil fu next section),is givenby
togF,, - 1.t+ p! + pf + jdi + i[j
( 11.19r
with thelog oddsratiogivenby tog9: (5, - 4t)0 - j,) + @i - ej)(i,
L)
(l l.)r{
Thus, for example,from Equation 12.20we cancalculate the log odds ratio of a tenirr versus an upper primary education for a person raised in a prwincial capital comparJ
Log-LinearAnatysis291 uq a personraisedin a villageaslog 0 : (-4.95 - 0X6 - 3) + (4.80 - 1.67)(7- l): i l-:. rvhichimpliesthatthe oddsratio is 50.9 (- a3r). That is, the oddsofpeople obtain[E.1tertiary educationrather than a primary educationare more than fiffy times as great tn dose living in provincialcapitalsas for thoseliving in rural villages.When people thm Chineserural villages make it to university, they are overcoming stupendousodds.
fu-and-Column-Effects Model Il (the RCor Log-MultipticativeModel) +i I noted in the previous section,a seriouslimitation of Row-and-Column-Effects tfurjel I is that correctestimationof the scalescoresdependson correctly orderingthe retories. For this reasonan altemativemodel proposedby Goodman(1979), Rowlr-Column Effects Model II (also called the RC model or the Log-Multiplicative .**cel t. which is invariantunder any orderingof categories,and which estimatesscale nicre: from the data,has becomemuch more widely used.In this model the expected hroencies are calculatedas
logF,,: 1.r+pf + pl + dt?j
(r7.?1)
iE :]e log odds ratios as
log0 : tQ,- Q,,lttp, p,,)
(12.22)
\n altemativeparameterization of Equation 12.21,which includesa term for the lrc:I strengthof associationin the table (particularlyuseful for comparisonsbetween gmn. u hichI do notcoverhere)is
logF,, : 11,+p! + ttl + BOpj
(12.23)
mn :ie oddsratiosgivenas toe9: 0@,_ 6L)(pi
pt)
(r2.24)
R-r the data shownin Table 12.12,estimationof Equation12.23yields a very good r = .140 and BIC -- - 147.3. Interestingly,the estimatedscalescorespreservethe :rder of the row-and-columneffect scoresreportedearlier: Village Town Prefecture-level city County seat County-levelcity ProvinceJevel city Provincialcapital
0.00 0.42 o.76 0.82 0.91 1.00 1.04
No schooling Lower primary Upperprimary Lower middle Uppermiddle Tertiary
0.00 0.14 0.17 0.50 0.80 1.00
292
to Testldeas DataAnalysis: DoingSocialResearch Quantitative
In China, size of place of origin appearsto be very strongly associatedwith attainment,reflectedin the associationparameterB : 4.17.Moreover,the greatest betweenrural villages and any urban place, with the next largest gap betweentowns prefecturelevel cities. Making the samecomparison as for the row-and-column model, from Equation 12.24we can calculatethe log oddsratio of a tertiary versus upperprimary educationfor a personraisedin a provincialcapitalcomparedto a raisedin a village aslog 0 : 4.17(1.04- 0X1.00 * 0.17) : 3.60,which impliesthar oddsratio is 36.6 1: lhut is, the RC modelimplies that the oddsof peopleobaL "r'e9. ing a tertiary education versus a primary education are about thirty-seven times as for those living in provincial capitals as for those living in rural villages. Althougtr odds ratio implied by this model is not as large as that implied by the effects model (which yields an odds ratio of fifty-one), it is still extremely large. Although in this examplethe scalingof size-of-placecategorieswasreasonablyclo:el my a priori assumptions,andthe rank ordering of the educationcategorieswas exactlynL I anticipated,there is nothing in the method that guaranteessuch a close colresponderre Becausethe scalescoresargcomputedto maximizethe associationbetweentherow andc+ umn variables,they provide a test of the correcmessof a priori assumptions.We can seetclearly by estimatingan RC model for the Chinesehtergenerationaloccupationalmohlir table analyzedearlier. In contrastto the typical outcomein Westemnations (Galzeboor Luijkx, and Treiman 1989),the resulting scalescoresfor Chha deviatevery substand{ from my a priori ordering of occupationcategoriesbasedon their socioeconornicpositi (perhapsbecauseour dataincludeboth malesandfemaleswhereasmostresearchon occrp tional mobility for other nations, including that carried out by Ganzeboom,Luijkx. d Treiman [1989] ard also Wu and Treiman's 2007 analysis of these data, is restrictedI males).The following coefficientsare ftom a model with the diagonalblocked.
Father's 0ccupation Professionals Cadres Clerical workers Salesand serviceworkers Manual workers Agricultural workers
0.00 -27 .68 -13.76 -12.97 2.33 1.00
Respondent's 0ccupation 0.00 -0.27 - 0.18 -0.77 -0.87 1.00
Clearly, the children of cadres are much more likely than other offspring to move irm high-statuspositions.By contrast,the childrenofprofessionalsarehardlyprotectedar ol from downward mobility, which may reflect the rather heterogeneouscharacter of fu category; it includes village accountantsand school teachersand many technical posF occupati(u tions thatdo not requiretertiaryeducation.The scalescoresfor respondents' are somewhatmore orderly, revealing a sharpmanual-nonmanualdivide, although mob? ity into the professionsfrom all sorts of origins appearseasierthan mobility into cleri.{ or cadrepositions.Wu and Treiman (2007) also obtain distinctiveresults,albeit not r
Log-LinearAnalysis293 h:ad Ef r!' r x=d n :5:rf
oeme as these,in their male-only analysisand argue that their results reflecl a distinciie Chineseinstitution, the residential registration system, which makes the children of cal nonagricultural workers vulnerable to downward mobility into agriculture but also crrales ar extreme upward mobility route into the professionsfor the bright children of
rE$:i
F*r.ents. The conffast between these results and results from the corresponding Row-andCo{umn-EffectsModel I is instructive:
) r :E!C ES=.s[[: .e:tm 5l5_ s ! hqg"r.t d.c::ufl,
t :|r!rD r*,.fl r .n::d* -.i€ l b tlroftl elfllrr
ts*rr4 ast:fl (r:c:D' .qa: d IFAEII,D
E:'3:
II
f! -: f ,
[:ci!.
Respondent's Occupation
0.00 -0.66 -0.86 - 1.15 -133 -1.66
0.00 0.25 0.51 0.92 1.29 1.53
The row-and-column effects model gives orderly results, consistentwith my a priori dering of categories.Thus, an analyst might be temptedto settlefor this model because ! 6e likelihood-ratio criterion it hasby far the best fit among all the models estimatedin af,s chapterexceptfor the RC model (seeTable 12.11)-although it doesnot havethe rr* negativeB1C.However,the row-and-column effects model is clearly incorrect, even thu-eh it is nearly as likely as the RC model by the B1C standard. From the RC model we can calculatethe relative odds that the child of a professional rfl becomea professionalrather than a farmer, comparedwith the correspondingoddsfor |h.-bild of a farmer. Becausethe associationcoefficient, B, for the Chinesemobilify table (anotherindicator of the lack of associationil the table), from Equation 12.24 o r-r.O455 havelog 0 : .0455(0- 1X0 - 1) : 0.0455,which impliesthatthe oddsratio is i.047 - d fl55).Apparendy,the oddsthat the children of professionalswill follow their fathers' N: Scrsrepsrather than going into the fields are hardly larger than the odds that the children disrners will becomeprofessionalsrather than following their fathers into the fields. This is a very different result from what we calculatedfrom the uniform and linearlr{inear associationmodels, and it brings home in a dramatic way the importance of iding the right model before making inferences.(It is also quite different from the corModel I, which implies that the chil4onding result from Row-and-Column-Effects ln of professionalsare abouttwice as likely [becausee065: 1.92] as the childrenof lcants to become professionalsrather than peasants.)Nonetheless,here we might be dI advisedto settle for the linear-byJinear associationin which mobility is a function ddifrerences in statusbetweenoccupationcategories,on the groundthat it hasthe most qadve BIC.
funsions
ry= Imi
lrfl:
Professionals Cadres Clerical workers Salesand service workers Manual workers Agricultural workers
Father's Occupation
Dr
I x quite possibleto extendthe parsimoniousmodelspresentedhereto more than two The mostconmon applicationis to comparctwo-variabletablesacrosscontexts -iables.
294
euantitativeDataAnalysrs: DoingSocialResearch to Testldeas
ttm9. ge1ods, nations, ethnic groups, ald so on), but more general extensionsare ab possible.Many of theseprocedures are discussedin the litera'tu."i.i"ny ,"ui"wed in h following section.
A BIBTIOGRAPHTC NOTE A numberof treatmentsof log-linearanalysis are available,rangingliom thoseintendrl for social scientistswith limitea mathematical backgroundsto full-blown treatisesir mathematicalstatistics.The most accessible treatmerisinctuOettroseby Davis (19_{5 Kn:!: uf Burke (1980),cilbert (198.1), and powers
rt979r.Cioggr l'oS),.C..q, ffi'lir"r;;il6t;.
;;
FORESTTMATTNG Qtr g-T!!EESOFTWARE LOG_L|NEAR N
y9,?ELS
cuv isavaitaote forpurchase fromhttp://w$1/v.nag.co.uk/starycDcE_sofr
, examples asp.Theworked appearingin Goodmanand Hout (t gSg) can be downloaded as two Microsoftoffice Excer97 workbook firesfrom the carnegie Meton university Statistics Department'sstatlib: http://rib.stat.cmu.edu/Dos/generur. ir*r. fires, ,,mobility.xrs,,ano "voting.xls,,,jncludethe raw data, GLIN,4 ,esrlts, a-ndgrapnicatJrsptaysfrom the examples presentedin the artlcle. Vermunt3(1997)software,/em,and the accompanying documentation (Vermunt jgg1) can be downloadedfree of charge.The easiestway to iinj tf,e lownfoaO sjte is query to a searchengtnefor ,,homepage jeroenvermunt.,,rhe documentation ts verycrypttc,but the softwarecomeswith manyworked examples that can ..rify O" pisati3Stata _ado_ fjle to estimateuniform layer_effect"j.Oa*. modetscan be downloaded from within Stata(connected to the internet)by typing""uar""r.f) prsatr,, and then clicking " s9142+tomhttp://wwl/v. stata.com/stb,/stb5 5." JohnHendrickx haswrittenan _ado_file,_rc2 -, thatestimates the RCmodel(Equation 12.23).To download_rc2_ from within Stata.type "net searci'rc2.,,Thenclick rc2 from http;//fmwvr'wbcedu/RepEc/bocode/r and fotow ihe inrtru.ttnr. r thank Maarten Boisfor pojntjngme to Hendrickx5 program.)
Log-LinearAnalysir295 mLl6-:.1984),Sobel,Hout, and Duncan (1985),Yamaguchi(1987),Becker and Clogg ll&[9 r.Mare(1991),Xie (1992),EriksonandGoldthorpe(1992a,1992b),Hout andHauser +ilN!91r. GoodmanandHout (1998),Fu (2001),Pisati(2001),andParkandSmits(2005). Tbe 1998 paper by Goodmanand Hout is particulady valuablefor analystswho mir ro comparelogJinear modelsacrosscontexts.Goodmanand Hout estimatedtheir usingGLIM, a powerfulBridsh competitorto Stata.Thesemodels,andvirtually .rrherlog-linearor log-multiplicativemodel, can be estimatedusing lem developed JEroenVermunt at Tilburg University, the Netherlands.A subsetof the models disby Goodmanand Hout can be estimatedusing a Stata -ado- file by Pisati 1t: seealsoYamaguchi(1987)andXie (1992),who originally proposedversionsof models,
\pplicationsof logJinearmodelingto substantive problemsotherthan socialmobil:3n be locatedby searchingSociologicalAbstractsor otherbibliographicdatabases. 5rt 310hits searchingSociologicalAbstractsfor "loglinear" land variants"loglinear" iogJinear"l asa key word on 24 November2007.)
THISCHAPTERHAS SHOWN ti.r chapterwe haveseenhow to uselog-linearanalysisto test hypothesesregarding resence or absenceof associations amongvariablesin multiway tables.Thesetools us a powerful way of testing hypothesespertaining to percentagetables. In addition, imre seen how to apply various models to parsimoniously summarize patterns of htion in two-way tables,and to determinewhich of severalaltemativemodelsfits examplesfor discussingparsimoniousmodelsweredrawn -{lthoughthe substantive ] from studiesof social mobility, the topic that has driven most model developftese modelscan be appliedto a wide varietyof substantiveproblems.
APPENDIX 12,A DERIVATION OF THEEFFECT PARAMETERS \r'hatthe ? andTs are,considerthe saturatedmodelfor a two-by-twotable.Recall Equation12.1that the expectedfrequenciesin eachcell of a tablecan be expressed of rs: r2 t-ttt2 t1 t2 1
.|2' llIl2l|2
Fr, : qr{ r{ r{rY
Fr, : nr{rlr|l
(l2.A.l)
:lultiplying one of the equationsby the otherthee and simplifying (recallingthe amongthe ?s shownin Equation12.5),we have 11: I F ,,F ,rF rrF rr)tta
(tz.A.2)
296
DataAnalysis: Quantitative DoingsocialResearch to Testldeas
Thus 4 is just the geometricmean of the expectedcell frequencies(the meanof a setof ,, numbersis the zth root of their product).In this sense,l is a scale tor; all it does is take account of the fact that mbles have different averagenumberr casesper cell. Next we expressthe row effect as a function of the cell frequencies.We do thB writing the product of the two conditional odds as a function of 4s and rs simplifying:
| qr lr lr ! , Y 'r lr lr ! r lr Y If ,l l n,1 l ri l - - - i /i i I F,,llFu) [n,{,(,{,Y]lqrlrvrr!} I w 1t
I vl
(1- .
And so r{ : [(Frr/Fr)(Fr2/F2)lt/4 = [(4142) / (F21F2)ltt4
(11.
That is, we see frorn Equation 12.A.4 that r is a function of the product of the conditional odds. But we can get a more readily interpretablealternativeexpresl by multiplying both the numeratorand the denominatorof the secondline of l2.A.j (F,,F,r)'/aand simplifying.This yields t li
I:
\l l 2
-x 't - (FlfnFrlr2Y
(1 t .
From Equation12.4.5 we seethat rf, the effect parameterfor the first row. i: . the ratio of the averagesize of the cells in the first row to the averagesize of all cells rr table (where by averagesI mean geometric means).Thus rs larger than one indicare a disproportionately large share of all the casesin the table is in the first row- aDi smaller than one indicate that a disproportionatelysmall shareof all the casesin the is in the first row. In a similar way we can derive correspondingexpressionsfor the parameterassociatedwith the secondrow and also the effect parametersfor colunL.Finally, we can deriveinterpretableexpressions for the interactioneffectparamd{r, To seethis, we write the expectedoddsratio, (F1tlF2)/(F \JF2), as a function of 4s d rs and simplify as we did earlier: ti
tF
1qr{ r{ r (tY)/ (rtr { r{ r{{ )
Ft2lF22
(qr { r { r {{ )/ (nr{ r{ r{l )
(ll-.{{r
This yields
lrfi"]o:
(Ft/ F2)/(Fn/Fz2)
( 11.,\n
Log-LinearAnalysis297
rdY : lrF,1F2,J/tFn/F22\)
(12.A.8)
& rs ;fiv is a function of the ratio of the two conditional odds. Once again we can get e r-re readily interpretable expression by multiplying both the numerator and the by the geometricmeanof the expected tL'minator of the right side of Equarion12.,4..8 h reagisg,(F rrFrrFr.Fr")tta, andsimplifying: ,
-
(4tF2)tt2
(12.A.9)
(41FnF^F22)rt4
From Equation 12.A.9we seethat the interactioneffect parameter,7fl1,is just the m .-'ithe averagesizeof the two diagonalcells to the averagesizeof all cells.If this r L s-..rrerthan one, there is a positive association(or interaction,in log-linearterms) Luttn X andL If r is smallerthan one,thereis a negativeassociationbetweenX and I mming that Category1 is the "positive" valuein eachcase).In a similar way we can fuie expressionsfor the other interaction effect parameters. Theserelationshipscan be generalizedbeyond the two-by-two case,but that is ti.-Ed the scopeof the presentdiscussion.Thosewishing to pursuethis topic should tie sourceslisted in the BibliographicNote sectionof this chapter.
ESTITVIATION TO MAXIMUM LIKETIHOOD IPPENDIX12.8:NTRODUCTION likelihoodestimationis oneof severalmethodsusedto obtainparameterestitbr the models presented in this and the following chapters. The principle is orward,althoughboth the underlyingmathematicsandthe computationalproceare often quite complexand go beyondwhat is dealt with in this book. For good to the topic, seeKing (1989),Eliason(1993),Long (1991,25-33, 52-61), PosersandXie (2000,AppendixB). Supposewe observea randomsampleof valueson somevariable,xr x.,..., x,, independentlyfrom a population distribution/(x,, xr,..., x,,10)govemedby an r\ir parameter 0. We may then ask what is the probability of obtaining the ed samplefor any given value of 0. This is the likelihood of the sample.What rant to do is to find the value for 0 that maximizesthe likelihood of the sample; : the maximum likelihood estimateof 0. More generally,maximum likelihood ion consistsof proceduresfor finding estimatesof unknown parametersthat ize the likelihood of the observeddata; the resulting parameteresttmatesare maximumlikelihood estimates. \larimum likelihoodestimationinvolvestwo steps:determiningthe likelihoodlimc$hich expresses the probabilityof the observeddata as a function of the unknown
298
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
parameters;and maximizins the likelihood function. We can write a general for the likelihoodfunction:
A(e)= IJ/(,r,;0) where0 is a column vectorof unknownparameters; notethat theremay be only oneunkr parameer,in which case0 is a scalar. Equation 12.8.1 h"ld;;;;;J. the observatiom assumedto be independent,which means that theh joint AistriUutionmay be wriften
marginal oistriuutiom iioweili
iJJo:u."p.oou",,".r* T*l1.1l^"*: intractable, :dividual. mathemarically weco-nvert Equation 12.B.1il i;r;;i#; because therelationship between a variabLanditslogis mo".iJ"l. irrr. ," il"; fr"*
)(e):hifi/(",,r)]: irrrr",,ot
(r2
find the valuesof 0 (denored0) thatmaximize the log likelihood; ^.."Y*"i oI the monotonicrelationship,thesealsomaximize the likelihood.
MEAN OF A NORMALDISTRIBUTION Considera simple-case. Supposewe want to find the
maximumlikelihoodestimateofh
uno_ulyjt,tiiot"o popurutioo.t ;;H* :.::Ti:TlT.lt:.?f "!,-"*"ion,r,o_ varianced. Becausethe likelihooa for u .ingt" ;.;.;;d i'.
L (p o2 . t : j.*o['l2n
o2
'l
r ", - 1 ,' 1 2o2
)
(123-l
it follows (from Equations12.8.r and 12.8.2)that the log likelihood of the sample is
^r=i^[#,-,(-q#)]
: -N(nJz*)_$f,a,_,t,
However,we can disregardthe lefoirost term on the
right side of the equattonbecause
it doesnot dependon the x-. We also can discard,t"
;}. known.This leavesus with the ftenrelof the log likelihood: -\-r'
-,,' :
(12.B_+
,".rn becauseo2 is assr,rrxrl
(12.B_a
(:r:-
Log-LinearAnalysis
JL
i##ll':,&?;:T#ffi-,*Tr.:T:^T:,*rike,inood,Equa,ion j:ff":lit jjt'i#*i#"r#"*#J#ii"i:.!ili::nT ii3;li;i""rt"";ilifi I.,::Hrr,:1T*i*:ltl:t,,*,*il:::',#
.::
;*{Tl;i}lj"$ii#i:rums#*J#-[fi;i*# fr
L :*reme Hii;:Til:f varue isil;11*In:X*TTifr ""t ;i*:niii#*n;:LxrTrilrlll:'t[:T; ffi:Htr.#:H:flT ; a
tr]t-:lr qirnm t : ri. ]]I& bT :1&
299
;iy l
I
ittG
(xi - p)2
-
z)
x . - ') i / , ,
(12.8.6)
IIl:: i:lliu
ffi&*rion
12.8.6tozeroand solvingfor p yields the
-l[,r
*
maximumfikelihood esti-
(t2.8.7) ::l
this is the maximum be
o;::itJ-". i2B5wi,h :' .n*ron
j ',.X[:fi Hrillrri{**1*dj*;lT:T:i1{Fr'"xiilil"ti;r;i,; "JJ*il ,#,;li:,1*fJg",11,ff :T: ft:,,n:$ ffi m3**:*;*";*'#;ft."iH:f ftr l,xr:[ *f;J,fli i# H#j[* ffi;J:f;:;*5:r1ffi,fi;:::ffi:
:li
PARAMETERS .rcG.LINEAR
max,rnum,ike,ihood jrding j:fl *T;1;lT,:iilHm:ni.:,:,.r$:l::fJ?::il, Hx."3"a#:ff'.Tii,lf ffi:r,::fJ?::"J; HJ:i q*-.ed rutn*:1l:*:ti:li{"rjt#:1,"#*::#j"#{#;}*: frequenci*'r* r;;;;#jil.J Ftj = P+ p: + pf ( 12.8.8)
;"J::f,$#1.j;.ri,:,:::.**n Fl;i:Tfijifrijl,f r".",. j:Jf,"l+F"#$ik*r*i#?;4:!:i'"i:;1il:ri ""::;: ru..r "o ffi jaredilrHlrnm":";;"rutr1i:ll.;.,:t,t$j:"J ff ,.::ffif soc *,,n.."r,'i"r,rl,!1 iffi il:H i[, ;,rj #:
lo
10 ll l0 I1
x,
xz!
0), 0!, 1J. l yo
300
to Testldeas DataAnalysis; DoingsocialResearch Quantitative
The independencemodel can then be written as a model for counts: m':
exP(Bo+ Prx, t B2x2i)
where rz is the expectedfrequency in the ith cell. Under Poisson sampling, the the log likelihood is
)(p) : t(y,log m, - mi) and so we needto maximize Equation 12.B.10.Becausethe rnodelis nonlinear,an solution is requiredin which we repeatedlyupdatethe estimatesof the Bs using the secondderivativesof Equation 12.8.10 with respectto B. For our purposes,it is not sary to considerfurther how the estimationis actually carried out. For additional detail Eliason (1993); Gould and Scribney(1999); andPowersandXie (2000,Appendix B).
,r^ | t A Fr-r
L r "! A r
:}*Niii'
r
rtn
t.l
d|frf
:l
ill*
Etr Ed d"re
LOGISTIC BINOMIAL REGRESSION r THISCHAPTER IS ABOUT chapter introduces binomial logistic regression,a technique for estimating models a dichotomousdependentvariable.We startby consideringthe relationshipof binolo-sisticregressionto logJinear analysisand then see,by studyinga worked exambow to estimateand interpret logistic regressionmodels.We then considerthree worked examples to expand the applications of binary logistic regressionto progressionand similar models,to discrete-timehazard-ratemodels,and Eie-control designs.
302
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
INTRODUCTION Often social scientists are confronted with the need to analyze categorical v variables-whether people vote, for whom they vote, their degreeof agreement seen' as we.have on Altholgh particularattitude,their choiceof occupation,andso are iegression procedurescan easily handle categorical independentvai^bles' they dict case of ln the apiropriatefor categoical depenfunt variables,evendichotomies parti mous dependentvariables,the assumptionsof multiple regression'including in that errois of prediction are normally distributed, break down badly' often yielding ously misleading results; moreover,predictedvalues often 1ieoutside the logically ble range (zero io one). For thesereasonsa variety of procedureshavebeendevelol dealing with dichotomous dependentvariables, of which one of the most 1 logit analysisor (synonymously)logistic regression,which usesmaxlmum vari e."ti-ution. Logistii regiession can be readily extended to handle dependent of ordered sets with more than two categori.es(multinomial logistic regression)and the in gories (orderedlogistic regression).We will considerthesetwo extensions chapler.Bur we startwith binomial logistic regression
?,I Nl -
tiketihoe MAXIMUM LIKELIHOODESTIMATION Maximum
prirrefersto a framework for estimatingparametersof statisticalmodels.The "rtlmatlon the to find models'is and logisticregression .iol". which underliesestimationof log-linear (See the likelihoodof observingthe sampledata valueof the parameterthat maximizes KingI19891'Eliaso' Appendix12.Bfor a briefoverviewof maximumlikelihoodestimation; B] for accessib: Appendix LongL1gg7 , 25'33' 52-611,and Powersand Xie [2ooo' I'19931,
i n tro d u c ti o n s to th e to p | c;andGou| dandsri bney[1999]foratechni ca| di scussi onof howi: in Stata.) do likelihoodestimation
'ril N -
PROBIT
ANALYSIS
use: morewidely regression, to tosistic An atternarive
yieldsimila' generally is probitanalysisThe two procedures than in sociology, in economics (sei professional convention of matter r"rultr, and the choice between them is largelya ) to probitanalvsis Aooendix13.8for a brrefintroduction
Binomial logistic regressionis a procedurefor predicting'from a setof the log oddsthat individualswill be in eachof two categoriesof a I vari.ables, dependentvariable.The formula for logistic regressionis
BinomialLogisticRegressjon
f- l
l:.Jiiill|d| E
t-I[
D
" ' \" ,x A]
lf
\
^ 1 ' r,r ^
303
K s-.
-): a+ Lbkxl
( 13.1)
:ea. :f".f !. 4 - m gcllr+ l[-,!:rltut :'tr,g i@-
lr rn*g,l;fE'i
I aq:u f EPiiirx,r{ rE{!ut
fu/( independentvariables,X" wher.ethe a and Doare coefficientsanalogousto OLS tcression coefficients,andthe dependentvariable iJthe nuturutt,ogof tfr" odds dbelng in category1 of the dependentvariabre ""pected ratherthan i"ffi".y ,, conditional on tralues of the independentvariables.Tho., lo;i.ti" ."G..i."'ii_",n". .pecific case d6e generallinear model. ft is also true (and can be easilyshownby dividing throughby 1V)that the log odds d 6e expectedconditional freauency distribution or irr" or.iotolous dependentvariequalsthe log of rhe r;o of the'expected prot"uttiit"t .i"i"gln eachof the rwo Stgones: -lre
{f :G. (p
frE x!:!
o&r rr-
s('',f (!e
\
2Yl=h(p:Dt\p:z))= rn((p = 1)/(1_(p:1)))(r3.2) ^ltf; I ,r) Thedependent variable(thelog odd-s) is knownas therogit.As we havejust seen, may be expressed @irs eitherin termsof frequencies or rn t"#. orprouuurtr"r. EATION
5s
lF
le ?
TO LOG.LINEARANALYSIS
relarionshipof the losit soecificationto log_linear analysisis straightforward,as can wif rhehelp of a-littl; algebra.Consiie. a l"g_i'""'een _"f'.i, ,n which thereare .':":11.r;
that,". rviirri" [g*a'a"s
trredependent varif; *'. A, andB.{9, Now1"nr:1":Jvariabte considerthe saturated modelrelating;",*J;u*ril?;?j. [ller thismodel,expectedcell frequencres areestlmatedas
rn(F,f"):0+^!+^: +yk+^f +^iv+^f +^;fy PC
tr E
( 13.3)
\ow, becausethe dependentvariable, is dichotomous, X we can easilyderivethe log i ofbeing in category1 (ratherthanin category 2) of y fro- oqrution 13.3 (for those .exponenrs on schoolalgebra,somealgebraicrelati-ns'hjp,in"ol"ing l;;.;;; are r rnAppendix13.A):
\r\(F;*I F;:") = rn(r,j* - tn(rif" ) ) : (a+.r,, +.1,f +tl + >,1" + s!, + li,"+,\;i") -(o+li+s,f+),r+);,+11'+),.8,'+);iy)
_ry)+(.r;"_s|)+ (t1," : (.r," _ +(^ff,_ ^::) ^H,)
(13.4)
304
to Testldeas DoingSocialResearch QuantitativeDataAnalysis:
acrosseach dimension' )i But becausethe s must sum to zero ^ So we have
= -)i'
anOtl
n(rff" lrif'):z^i +2^i"+2^:{+2^:i" rather thn of the den11d9nt lariable In short, the log odds of being in one category
theusql"c-l-"1,"::f:^':T'::Tlit""::"fffi ofiwice sum *l ei"*-uv,r'" *-'"T:;"i11-1'^3]: alone variables' "ir'"' ;uiiitiltT;"#r,,oi',i."1"0"p""0"* :l ::t::5 thecoefficient),fl.expressin*theass nJolr-,"rraturated modelsNotethat t* or urca dropsour "il" variables' uarrutl"t'itpino"p"no"nt the theindependent between don donbetween
"itr'" "**Jlo:t-:.)il"-'li in publishedt
tegressions :t'h":::i Thus we can carry out uinodd logi*tic
dt+]tl,t-tl^"-1"::li:,t"::":::':::X"i: .^rtt'ir i.t;r ;log-rinear "."rv*iJ"'3 coet!Lrlsareexpressad wtr"nttre^tog-unear #lffi;*;;il;;i;
;iff:l#
froni a referencecategorywith an impt dummy-variablefo.aut-$tut t', u' i"niations coefflcientof zero. -'"',qrit"tgrtlrg;,u"-t""" ** *"tt Arl'uuts''u6" anatvsisandlogisticregress.ionaremarhemati:illli::::,":'"Trtt as a specialcaseof 1og-linearanall-ris separateorigins.Logit analysis.wz o"u"roped as de9e1{ent which one (dichotomous.)vattaor" is regarde! 1n." 1"1-1t'-":}'t;:,::t::S
to *"1 o*"r"'p"a bv statisticiansand^econometricians l"#;::;;#;;;;i* d"t"tg*t:"""u:-::l::1"-:::::T:3"1:1*:l dichotomous L*'**"**;ifo that pruurerun ule problems wrur wirh the independentvariables'' handlecontinuous regression.Therefore it was devek th"g ll9:,11lT,i9:':,"':t::'i:?,Tiri; *" statrstri'al the to a a gouu'Lruuuvuu" "*-;s soodintroduction with sociologicit examples'seeLong [1997] treatm For and Lemeshow[2000]. text' seeLong andFreese[2006]') PowersandXie [2000].For a Stata-oriented
EXAMPLE: REGRESSION A WORKEDLOGISTIC oF ARMEDTHREATS pnlorclrue PREVALENcE
the likelihood that a personhasevertEl Supposewe are interestedin what determines fu *" are interestedin asllrtainine whether threatenedwith a gun. Moreover, *ppo" que:N latter this (Investigating, of arm"eothrea$ has changed ouer time. ,**r"."" comparl<* make to how ;ccasion for demonsirating 1l*,lllpotd ;il;;;;;;; more likely to have experiencedd using the GSS.) we *rgttt *otpe"t ;;i t*Jt -" of the male population beenin c* threatsthan are femut".. Not onry t'us somefraction r men tend to be more likely to be involtaf bat, unlike women (ontif u"'y t"""ntty)' but
r"'le"ir"tit,o* ngr',r' smnr' "'a't13'-""rr"i1"t',"'1t-1T:1'::f'.::"::ff*':T# withsocioeconornic conelared n"gutlu"tv :il:#::'.l'u,ii.i'?tt'"".."ttt""iJut Fs activities' leisure in t'utotdifferences rtn-rh ;J;iitt"*l;i;"rio",ttiut (SES): status '"g'"guiion"uno educatronutin indi"uto' of socioeconomic. convenience,I take fixed over-the adult life course ad r occupational status and in"orn", tt i' essentially Third' it is likely that Blacks interpretable equivalently tor men and women' pace other racial groups' net of SES' given to more armedthreatsthan are membersof middle-class Blacks to live in high-crm of residential discrimination that force even
Regression 305 BinomialLogistic rghborhoods. Fina\ claims about the breakdownof civility in America would suggest Se prevalenceof armed threatshasbeen increasingover time. Datato assessthesepossibilitiesare availablein the GSS.In most yearsfrom 1973 1994 respondentswere asked, "Have you ever been threatenedwith a gun, or et:- In addition, the sex, race (White, Black, or Other), and education(yearsof completed,ranging from 0 to 20) of each respondentwas ascertained.I omitted' ence,5,031casesin which the gun questionwasnol answered(mostlybecausein crrl yearsthe question was asked only of a subsampleof respondents),an additional crscsin which informationon educationwas missing,and an additional16 casesin information on the number of adults in the household(usedto constructthe weight people ) was missing.Theseproceduresyielded an effectivesampleof 19,260 estimation survey using the analysis out I carried r the yea.rs1973tkough 1994. and treating each year as a stratum. (For estimation details, seeAppendix B te downloadablefiles "ch13-1.do" and"ch13-1.log"') Iable 13.1 confirms that a substantially higher percentageof males than of females e moderately higher percentageof Blacks than members of other races have ever Sreatenedby a gun. It is difficult to see a consistentpattern with respectto either f,ional attainment or year, but it is possible that each variable suppressesthe effect otherbecauseeducationhasbeenincreasingover tlme.
t Eil{-t]{ICAL PCTNTON TABLE 13.1 but the t€:e that in Table13.'1the percentages are basedon weightedfrequencies .r.veighted percentagebasesare shown l weighted the data to take accountof differthe of Blacksin 1987,and to equalize size,to adjustfor an oversample =:'ai household For descriptive do" for details) 1 3-1 =. iributionof eachyear(seedownloadablefile " ch to usethe weighteddata to get correctestimatesfor the popula=:st cs, it is necessary rc.. But it is desirableto show the unweighted N's to revealto the readerthe actual 'r-'rber of caseson which eachcomputationis based
lly fust taskis to choosea preferredmodel.Table 13.2showsgoodness-of-fitstatisfti five models. Model 1 is a baselinemodel' positing that sex, race, and education affect the oddsof being thrcatenedby a gun. Model 2 in additionpositsa mnd in the (log) oddsof being threatened'net of the effectsof sex,race,andedu. If thelikelihoodof being threatenedhasbeenincreasingover time, the coefficient rred with year should be positive. Model 3 posits year-to-yearvariation around any ornd in the (log) oddsof beingthrcatened.Models 1,2' and3 standin a hierarchito eachother.Model 4 positsthat the log oddsof being threateneddepend race, and education; that the 1ogodds increaseover time in a linear fashion; and r{ and race interact-the hypothesis being that gender differences in the likeli.-l beins threatenedwill be smallerfor Blacks than for othersbecause'owing to
306
to Testldeas QuantitativeData Analysis:Doing SocialResearch
{r& I l- ! } li . I . percentageEver Threatenedby a Gun, by selected vari*' U.S.Aduftt '1973ro 1994 (N = 19'26O). Percent Threatened'
Percentage
18.8 25.O 17.3 Education Lessthan high school
21.8
Highschoolgrad
11.3 21.2 18.0
Year 1973
'16.8
1975
18.0
1976
17.0
1974
19.8 21.O
() :--
BinomialLogisticRegression
-iE:
20.3
r54
189
'i:a
19 .5
IS7
20.4
"$3
22.O
I n *s
19.3
s
307
19.5 1991 " -:j3
20.1
'944
14.7
bl
(1,0s3)
(19,260)
19.5
{ii:-: :. weiqhtedfrequences-see the box "TechnicalPo nt on Table13 1 _' : l.ted lrequencles.
"
neigh:{,:ential discrimination, Blacks are more likely than others to live in dangerous n,c,_.ods,andhenceBlackwomenareparticuladyvulnerabletobeingthreatened.Model race and education pos' r::nds the same argument to lnclude an interaction between r. .rsma l l e r ef f ec t of educ a ti o n o n th e o d d s o fg u n th re atforB l acksthanfol others rc-,.:.e of the residential vulnerability of even well-educated Blacks' of rhe U S population' it would be possible -; the GSS were a simple random sample (reponed in Stata out: : :lpare nested models by using the likelihood ratio X'?s,or Z'?s - command as LR chi2); lJ is defined as twice the difference :ru::om the - logistic with no independent rurr;en the model log likelihood and the log likelihood for a model be distributed would lrs the sample' random a simple r;:les. If we were analyzing rve could such cases In L2 s pair of :my between lrrr -'r i mately asX'?,ua*ould th" diff"aence the significance by assessin-e another ,u=.- rvhetherone model fits significantly better than ;-erlifferencebetweentwo12s,withdegreesoffreedorncalculatedasthedifferencein when $ e use $ eighted e" ::grees of freedom associatedwith the two models. However, is actually ];l]l- Justered data, or design-basedestimation procedures,what Stataestimates
BinomialLogistic Regression 309
iriE-
R'?= 1 - Lt/Lo where lo is the log likelihood for a constant-only model (that modelwith no independentvariables),andZl is the log likelihood for the estimated Obviously, if the dependentvariable is perfectly explained by the set of indepenmriables, L, - 0 andpseudo-N : 1, and if the independentvariables explain noth: 0. Thusthepseado-lR2 givesa senseof how well a modeldoes.Howevel pseudo-Rz s in the caseof weightedor clustereddata, pseudo-1oglikelihoodsare estimated, pseudo-loglikelihoodscan increaseratherthan decreasefor more completemodels, brlnce the pseudo-R2s can decrease,which makes little sense.More generally, when
[f, E :
F t
z d seasl he \\-ald Ea rczrncotgl o a \\ ald <Ej(..ct the I rof tb€ N imple molH
bls is sigrfr: mcrion rl lusEred diE
likelihoodsare estimated,thereis no simplerelationshipbetweenchangesin prseudo-log likelihoodsand improvementsin the goodnessof fit, so the pseudo-R2s uninterpretable.For the same reason,81C is inappropriate for designed-based ion becauseit alsois basedon a comparisonof log likelihoods.(For randomsamBlC for logisticregressionis estimatedby -E + (df.)lln(Ml. The signsareopposite tbr Eauation12.8 becausehere the comparisonof interestis not with a saturated but with a baselinemodel in which predictionsare basedon the interceptalone.) we havedatabasedon complexsamplesas in the currentcase,surveyestimation is the best availabletool, with by the Statacommand-svy: logistic-) comparedthroughadjustedWald tests.
wayto do't.- ?I UMITATIONS OF WALD TESTS rheappropriate
:siical inferencefor complex samplesis at present an unsettled issue. As we saw in $l typicalof multistageprobabilitysam- :.apter Nine,when the clusteringof observations :,es is ignoredthe standarderrorsof statisticsmay be substantiallybiased-they typically ae underestimatedbut in some instancesmay be overestimated.But the proposedcorrec:ons have their own limitations.both theoreticallyand practically.In particular,Wald tests +e known to have poor properties,which may produce misleadingresults(Gould and for weightedor clusleredsamples. >.bney 1999,7-8)i and as noted,8/C is not available rre optimal solution may be to treat clusteredsamplesin a multilevelcontext, estimating er*rerfixed- or random-effectsmodels(Mason2001),which can be done in Statausingthe go beyondwhat can be coveredin this book, -:{t- or -gee- command;theseprocedures to Althoughevennow much 3(n seeChapterSixteenfor a briefintroduction multilevelanalysis. . -':ra:-_--l-journals, and treats simplyignorescomplexsanpledesrgns rat is published, evenin ldading this is generallyinappropriate cata as if they were generatedby randomsamplingprocedures, in its variForthe oresent,lsuqqestfor loqisticreqression 4d can leadto incorredinferences.
:us formsthatwhenyouhavedatathatareweightedorclustqedyoucarryouty!u!-estimatigl '.-:-relvon adtustedWald tettilor modelselection. +:---
iindit-tata3 survevestimalioncommandsand + Onlywhere 3e cautioui,however,in your interpretationand exploreallernativespecifications. you usethe - logistic - commandand random sample should have a true, unweighted, -,ou ikelihoodratio test (-lrtest) . Further,wheneverpossible,eschewweightingin favor of rxluding the variablesusedto createthe weightsin the model.
31 0
to Testldeas QuantitativeData Analysis:Doing SocialResearch
Inspectingthe Wald{est statisticsin the bottom panel of Table 13 2' ue .a Model 2 fits betterthan Model 1, but no modelfits significantlybetterthanMod: l thus conclude that the likelihood of armed assaultdepends on gender, race, and e;. and also changes over time in a linear way. To see the nature of these relations:-::', examine the coefficients in Table 13.3. I also have included the coefficients fbr \
in Table 13.3,eventhough Model 4 is only a marginallysignificantimprovem::r Modelz (p = .092).I do this to illustratehow to dealwith interactiontermsin ih. of logistic regression.
ilrflij
'-,,-.
for Mod€ls2 and 4 of Table'13.2. effect Parameters
Independent Variable
Standard Error
Model 2
0,0065
Education
Intercept
-2.9037
0.3178
and"others." ot "Whites" consist "Backversusnon-BlackNon-Blacks
.000
BinomialLogistic Regression 31 l There are two altemadve(bu
fi ilT;:1',."J#:;#:rjr:'"$:"J.ff T:,j#:tr*Til "*,i..."g.".,i.on" 6e 1ogodds of the dependentvariable.-and ro.onriO",,f,. ), lrrplicatirceffecrs
r
d
irdependenr of "i,irr, variableon rheoddsof rhed.p.";";, ;;il. .on'0., ho* ,o thelog_odds effects,rheeffecrs*,h";;r;;;li"J!nir.o. eq"ution t-rprer r:.r. to log odd_s, '-aninterpretthecontributions theis, ju.i ," ,r""io .rr"coefficients loLS regression in equation:a one-unitdifferen* rit ,rr"". i"o"p."o'"nt vanableresurts I t unitsdifference in thelos od1. ,r,.."i#i iy'" g1r,,1l",o,uI other res. Thusfor example, variin MJder2 :f_b..loe r"br"i3i, ;;;;;i"i.i"in.,ri" ,ogooo,or,nur". d tbmales "f beingthreatened bv a holdingconstant race,education, andthe tEr of the**"i. eGlo ,,T]^il,],-'?, ""Jr, of havingbeenthreatened in,1994areaboui;,il;;;;J; in r97Jrpredr.0 2t2l = 0.0t01*[t994_19731). -b alletseequal.;;;;;*, "an -\lrhoughrheinrerpretation is straighforw_d,l"g;di;" *t very rntuitivelymean_ ;tuI. Hence.a moreappearins oossibilityis ,, i",;;;il;;;o!l or t" rs, tt" te a one-unitdiffe..n." io thJind"p"ro"* ,.i^-ui"l"rrl r"rr""l'i"il^" ",.. {c unirsin thererative
tn::,:rT j T,.::dr**,i.'oy"o.dro,:;"iT::ffi [H".:ffi :f.,:"r;:ffi
t n ( 4 , **,1 r r , " , r ) : "+ fo o xo
(r3.6)
Erponentiatingboth sidesof the equation, we get
rttx
,lrztx,
=e
,"*ir,*,r \=l
^ K
: e.fIe4x,
(13.7)
Tbat is. the odds of beins in cate
,#:'.";ru: Hilff L+:, liil:"ni:J,TiTi::l?#::i,K:ff
u.erpreted ascontributions toodds rytiis,.,r,rilr,ir,"'."'i*#;flHrri#,j::r1: .l,l_":,"::
i,a"f",o*,Ydrduro' iilo,n, y:: --- ^-^*vrv'sv'! lorolng constant all :lH31;T u,, independent variables. I.trus for I qi""" "liii,i, "Iil ei "on,,un, 2 the. e\pected odds of males rbrearened iXT?l:]: by gun a are4.r5 (: -yodeltfal theoddsof females being :ned, ", ^) llel qreater red,holding holdine const race, constant ,^* .^-^-,1 education, aad the yearof the**"r. o^"tTl"".oio".'l? ^,,
r"-'""'"r'i.,."""1 teT#:H:i"J:i::T:fl:-'rti919?,"""1' expected netodds ofhavine been "?Iio"""u"",," jl1:jd.. 1l;1"il ffi;lTrjJ:fi:# Jooro,,,sqa-,gu:, = t: .: what itJj !f:::,f !:?t?tsubstanrively? 1"*"i1i,,*ii,i,";;:, so canweconclude t,"r oiott.. ru"torl.'ri'"tl_iito "". of r ever having been threatenedare four times greater "oo. than fo. f"_ui"r; the expected
312
to Testldeas QuantitativeData Analysis:Doing SocialResearch
education (the odds of having been threalened decline slightly with increasing
lessthan for thcs being threaten-edfor those with at least a BA are about 14 percent 8)))'but increasemc: oniy"aneighthgradeeducation-precisaV'O'!!Z+-n':o"tl'u in an1 ouo tia"lut ,t'" ttave seen;and the odds of Blacks having been threatened sex with thi the same of non-Blacks for year are more than 1.5 times as greatas amountof education(precisely,1.56- e04461)' andyear:r Now 1etus considir Model 4. Note thatthe coefficientsfor education for V coefficients the change.Thus we can restrict ourselvesto the interpretationof these --i andELACK and their interaction. A convenientway to seehow to inter?ret yEAR' Ler u-' and EDUCATION of values for fixed equation the cients is to evaluate assessth' 1994andtwenty yearsof educationas our valuesfor thesevariables'to gun We thu' by a threatened been ever having of of race and sexLn the probability : + 0-01i -2903.7 0 0i91"20 : 4br*94 a + br*20 po* u n"* intercept:i' ': -2-3363 (wh# b, is the coeffi'cientfor educadonandbris the coefflcientfor 1a a gun thre''i survey). Then we wriie out the expectedlog odds of having experienced for MALE' b ' convenience,call this G) by race and sex (where bMis the coeffic\ent ' term)' coefficient for B.I-4CK, andb uris the coefflcient for the interaction For non-Black females we have G: a' : -2.3363 For Black femaleswe have G=a''lba = -2.3363+O.5690: -17673 For non-Blackmaleswe have G: a' -r bu : -2.3363+ 1.4543= -0.8820
( 1,-
For Black males we have G:a ' tb e + bM+ bR M
: -0 5255 : -23363+0.5690+1.4543 - 0'2125
6_: l
Wesee,bothftomthecoefficientsinTablel3'3andfromthesumsjustshown'tha: fot: expectedlog oddsofbeing threatenedare 1'45 largerfor non-Blac\ T3t"t'h"i B-'t for Black femal"es;that the expected1ogodds of being threatenedare 0'57 larger do; full femalesthan for non-Black females; but that non-Black males do not face the tha'r 0 21 less are odds log their expected because jeopardyof being male andBlack, the differently' put it to Or, : coefficient BI,ACK sum of the MALE coefficient and the is difference race the and Blacks, der differenceis greaterfor non-Blacksthan for
BinomialLogistic Regression 313 fu ttmales than for males.Theseresults are as hypothesizedexceptthat the interaction is b $eak for us to havemuch confidencein it. Again,the interpretationis easierif we considerthe oddsratiosratherthanthe logits. (L: way to do this is simply to takethe antilog of the logits we just computed(the Gs). Iling this, we seethat the expectedodds of ever having beenthreatenedby a gun among lEoplewith twentyyearsof schoolingin 1994are0.10for non-Blackfemales(: e-23363), O.ll for Black females,0.41 for non-Blackmales,and 0.59 for Black males.Note thar t oddsratiosarejust what are shownin the rightmostcolumnof Table 13.3(within the hits of rounding error): the odds of non-Blackmaleshaving beenthreatenedare 4.3 fues as large as the oddsof non-Blackfemaleshaving beenthreatened(0.4140/0.0967 = 1.2813= 4.2817); the oddsofBlack maleshavingbeenthreatenedare about3.5 times r hrge as the oddsof Black femaleshavingbeenthreatened(0.5913/0.1708:3.4619 3.J618: 4.2817*0.8085);and so on. We can seethis most clearly by writing out the oddsjust aswe did for the logits. For non-Black femaleswe have
(13.r2) = .0967 For Black femaleswe have e" : e" e"t
( 13.13)
: (0.0967)(1.7665) : 0.1708 Fornon-Blackmaleswe have eG :
ea' eb,
(13.14)
: (0.0967)(4.2817) : 0.4140 ForBlackmaleswe have gc :
g"'abu ab, ab*
(13.1s)
= (0.0967)(1.7665X4.2817X0.8085) : 0.s9I 3 Oneothercoefficientis sometimesuseful-the percentage changein the odds.given I00(eb 1). For example,from Model 2 in Table13.3we would concludethat all else the odds of Blacks having ever been threatenedor shot at are 56 percent greater the conespondingoddsfor non-Blacksbecause100(1.56- l) : 56. However,even though odds ratios are readily interpreted.expectedodds are still particularly intuitive. Thus it would be useful to convert expectedodds into perFor example,in the presentcase it would be helpful to get rhe expected of individuals in each race-by-sex group who have ever been threatened,
ofeducationand surveyyear.That is, we would like to get the adjustedpercentages
314
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
implied by the model so that we can assesspercentagedifferencesbetweenrace sexgroups,controlling for educationand year of survey.We can do this by making of the relationshio
x Pcrtyl=1001 I lx + 1.J
(13,1{l
where x is the odds of I for specifiedvaluesof the independentvariables.Note th becausethe relationship between the odds and the percentagesis nonlinear, we ncal to choose speciflc values of the independentvariables for which we wish to make L conversion.Here I use the samevaluesfor which we evaluatedModel 4; that is, I a expectedpercentages by race and sex amongpeoplewith twenty yearsof educatioofo 1994.For example,for non-Blackwomen,we havePcr(y) : 100+[0.0967(0.0967 - II : 8.8.The corresponding percentages are,respectively,for Black women,14.6;for nc. Black men,29.3;andfor Black men,37.2.If we wished,we could,ofcourse,construcr! entire table of such expectedpercentagesfor various valuesof educationand year of w vey. Doing this requires a fair amount of hand calculation. However, in conjunction F h their text on Stataproceduresfor handling limited dependentvariables,Long and Frees (2006)developeda set of Stata -ado- files that automatethe computationof thesed otherstatisticsfor interpretinglogistic regressioncoefficients.(Thosewishing to expLE thesefiles shouldstafi with Long's web page:http://www.indiana.edu/-jslsoc. Follos.rb links for Long andFreese'sbook.)
A SECONDWORKEDEXAMPLE: SCHOOLING PROGRESSION RATIOSIN JAPAN In the educational stratification literature an important hypothesisis that the dependen:r of educationalattainmenton the social statusof one's parentsdecreasesas educaricr increases. This hypothesishasbeenoperationallyspecified in terns of prcgressionral.r liom one level of schoolingto the next (Mare 1980, 1981).That is, we can ask \\.bd affects the odds that those at any given level of educationgo on to the next level: that Frmary schoolgraduatesenter secondaryschool,that thosewho enter secondaryschtrl graduate,that secondaryschool graduatesgo on to college or university, and so on. ODcr we specifythe problemin this way, it is evidentthat it is a logistic regressionproblerbut one of a particular kind. The distinctive featureof this sort of problem is that any in.fo. vidual may make severaltransitions. It also should be evident that the formal structureif the problem is identical to that of many nonreversibletransitions; for exampte,in crimi nology, from arrestto araignment to trial to conviction to sentencing;in medical researci. the transition through various stagesof a disease;and so on. We tackle problems with rtx sort of formal structure by pooling data for all transitions into a single data set and tht8 analyzing not a sampleof people br:/"rather a sampleof transitions. To seehow this is done, consideran analysisof trendsin educationalattainm€fi in Japan,carriedout by TreimanandYamaguchi(1993).Here, Io illustratethe methal
Regression 31 5 Logistic Binomial
rll llu
trli]]]lllll
.dr lfl
l'rm
concemedwith the transitioniiom middleschool ur=--ntonly theportionof our analvsis or uni'ersity in schooland frorn higher secondaryschoolto,college #;1.;;;";l rheir educationduring *: data set in"tuaed t,-320men who completed ;;:;'r"p; level is compulsoryin school to the middle m:,-rstwar period Because"ducation up were"at risk' of t:319.1:" it' ffi-.'"", tl* it t,320middle."ttootg'uauu'"tt!h-* so andhencewere did educationOf these'1'056 at leastsomehighersecondarv ffiing the first making of risk .u cottegeor universityPoolingthose.at -:k" of continuingo,, to to Possibilities study' i",ti' we fraue23lo (: l'320 + l '056)-transition ilr if the transi--""d""t.i u dommy variable'SUCCESS(S)' scoredI mu.:i.h of theseca..r, t" "'"ut" the two transitiolsby a dummy variable' um ras madeand 0 otherwise.we distinguish from higher secondaryto terliary education' ffi..-15/71ON (T), scored1 for the transiti"on logistic regressionequationsin which the otherwise.we then estimatea seriesof m a transitionand odds-ofsuccessfully laking uue'r:Jentvariableis the naturallog of the of parents' status the of variables are the tiansition variable' measures u :J.f*o*, Table 13'4 variables amongthe m -.i birth (to studytrends),uoi-u-iout interactions
' d: E :1Ul .gfif
forvarious Models ofthe Process :r - : 'i 3 ,4'. cooa"ess-of-Fit statistics dEducationaIT]ansitioninJapan(PreferredModelshowninBold{ace). -j
: rllllL
Model L'z
:n d-_
d.f.
Blc
l.
-251
F
-653
fr:
a :l]lllu
4
-641
origins)*(Year) 3i + (Social
,if :'T
rrr c":1,
:.ltF:
n
tt
:-i:lL'
t'
ltr] 3]
:g:
- t)\
l-
410
-388
28
-14
(1993'Table104)' andYamaguchi !,--= AdaptedfromTreiman OO0levelexceptthe(4) the at areslgniilcant rlr -:relsanci contrasts ,r *: 223 level
12
(3)contrasi'"'rc'ssgnrii'an1
316
l;.lii;i
to Testldeas QuantitativeData Analysis:Doing SocialResearch
'i f 'l.
dlir :r- 1: qnl]][ i: ::r
for Model3 of rable13'4' erect Parameterc
.*l education E: Parents'
*j:
r:
tu
:-r
firu :rE ] iildur"
'!0I
ilu0luut]ri
"{\ ll|r:
lti
i:rxr
lulili
f: Transition
1.23
0r5lN T*P
T*Y
-0.0180
-0.0439
0.9-
,..1
0 9:-
'llliu -:.m fr[ d]e 0Duug). l' du: lmum IF dll{"'* Ifltfr{ I I !
fcir ii:!g: -rni illli
didnot repc' (1993,TableI o 5);TremanandYamaguchl frornTremanandYamaguch Adapted sourcer '-i stanoaro efiors.
MU
'llr
Ej
ilr'/lff + :
fimn'firElr I
fl um(
shows goodness-of-fit statistics for various models of the educational ffansition prc\-e,ii and Table 13.5 shows effect parametersfor the prefered model (This analysis was cr-:r:l out before designed-basedestimation was generally available Thus no account was L{r of the clustedng of the sample. In addition to the usual clustering by sampling 5 typical of national sur-veys,transition-ratio models are clustered by person becaus: :nt transitions made by any one person are hardly independent.Thus in addition to an! l-r adjustments for clustered samples, the observationsfor each individual should be tre'el as nonindependent.) F ro mT a b l e l 3 .4 w e s e ethatModel 3fi tsbestaccordi ngtoboththel i kel i hood:a' t and B1C.The model posits that the effect of social origins varies acloss transitions and .-:.,r that the odds of making the two transitions change over time (but the effect of social or-:Ei does not change over time). From the point of view of our a priori hypothesis-tha: lt effect of social origins declines with successivetransitions-the contlast between Nlo=: 2 and 1 is particularly noteworthy Model I posits that the odds of moving to the next h::3 level of education are affected by the social status of one's parents (specifically' pare::' education and father's occupational status,measuredby plestige) but that the relation::: is the same regardlessof which transition is considered. Model 2, by contrast, posrts 'r:Ir that the odds of making the ftansition depend on which ffansition is considered and --L
l!i{rii"l3t u:
j
tmi
1![
Whrui
NMW:M llull|0l
@r
ry
t!ffi ! r
'lTnfflr' ul!!* *lrl!Ut; @
lllllllfiru
BinomialLogistic Regression 317
"{
'i '-f :-€
'{
r 5, :.
-:r]:J$" :.rr-r:lit
t 9:.
:t{{r
Ih;
r:rG
El--4-.-:€ lE i'
-(E'
-l l-:.rcd
rn'L.:i: c
:@
=':.8!5
c1a la+I6 E :-::: \l =: -tsr IE\: :r:!Er \-:r-$e-:;, ,r"q rr t\-':---r fl:n lc -,:
crt therelationshipbetweensocialoriginsandthe oddsof makinga transitiondependon transitionis considered. Model 2 represents our a priori hypothesis. -lichAs we see,Model 2 is far more likely than Model I given the data,but Model 3, aich also posits a temporalshift in the odds of making eachtransition,is still more itely. Thus we havepreliminarysupportfor our hypothesis,but we also haveevidence irrl the transitionprocesshad changedover time. (This point is further exploredin the !trperbut neednot concemus here.) In retrospect,the contrastspresentedby TreimanandYamaguchiare not wholly satsrctory. It would havebeen betterto include a model intermediatebetweenModel 1 .ol Model 2; that is, a model that positsa differencein the oddsof making successive nasitions but with the effect of socialorigins constrainedto equalityacrosstransitions. Tle difficulty is that we do not know whetherModel 2 is superiorto Model 1 because ir oddsof making the transitionvary acrosstransitions,or becausetlte effect of social u:cins variesacrosstransitions,or both.The samepoint can be madewith respectto the f:ct of birth year-a model intermediate between Model 2 and Model 3 would have rcn desirable. .\ctually, all that the coefficientsin Table 13.4 tell us is that a model that posits .q"erenl effects of social origins for different transitions is more likely than a model that l.sits the sameeffect.To pin down our claim, we needto inspectthe effect parameters, :e.rcrtedin Table 13.5,to be surethat they havethe predictedsign. Table13.5showstheeffectparameters associated with the preferredmodel.Note that : f,J not reportthe standarderrorsor p-valuesfor individual coefficients.Becauseall of ir "main effects" in the model also appearin "interaction terms," the appropriateway to x6-=ss the effectof a singledimensionis to contrastmodelswith andwithoutthe variables rresenting thatdimension.I havedonethis in Table13.4,but only for selectedcontrasts mer thaneverypossiblepair of models.(Rafterydiscusses S-P/assoftwarethatmakesit from amongall possiblemodels rssible to choosethe mostlikely-model-given-the-data l',.llving a givensetof variables.InterestedreadersshouldconsultRaftery[1995a].) Notethatthetreatmentof standarderrorsin Table13.5contrastswith Table13.3.where I ir. show the standarderrors andp-values. The difference is that Table 13.3 showsonly M. interaction, so the p-value associatedwith the interaction term indicates the signifi::rce of the differencein the fit of modelsincluding andnot including the interactionterm. '*aere a model includes both variables for which individual significance tests are ninhgful and variablesfor which they are not becausethey are confoundedby interacr.cs (or other transformationssuch as squaredterms), the usual practice is to report all nraificance testsandp-values.It might be preferable,however,only to repoft significance mrGtics when they are meaningful,in order to precludeincorrect inter?retation. This model suggeststhat the processof moving from one level of educationto the xn in Japanis aboutaswe would expectit to be: the oddsof makingeachtransitionvary lsrtively with parents'educationandwith the statusofthe father'soccupation.Of greater rerest arethe coefficientsofthe interactiontermsf*E andI*P. Theseareboth negative, r::;h indicatesthat,ashypothesized, in thesedatathe effectof socialoriginson goingon n :le nextlevel of educationis weakerfor the transitionfrom highersecondaryschoolto er!:rsity thanfor thetransitionfrom middleschoolto highersecondaryschool.Eachyear
318
QuantitativeDataAnalysis:Doing SocialResearch to Testldeas
ot averageparental education increases the odds of making the first transition. f middle schoolto highersecondaryschool,by about40 percent(becauseeo34s0 : 1.+ but increasesthe oddsof the secondtransition,from secondaryschoolto universinonly about35 percent(becausee(0.3480-o.o5o3) - 1.341).Thus,for example,all elseeg the odds that a.son of a universitygraduatewill go on to higher secondaryschooi more than 1I times as great as the correspondingodds for the son of a middle sc.l graduate(because1.416(t6-e) : 1I.414).By contrast,amongthosewho managed to into higher secondaryschool, the odds that the son of a university graduate will go to university are only eight times as great as the corresponding'oddsfor the son middle schoolgraduate(because[(1.416X0.951)]o6-nr : S.O:O;.iimltarty, rhe net e of eachunit incrementin the prestigeof the father'soccupationis to increasethe oc the first transitionby aboutsix percent(becauseeooi6e = 1.059)but to increasethe
of the second transition by only 4 percent (becausestoo56eoo1s0) : 1.040). Thus, for ple, the net odds of the son of a shopkeeper (prestige score = 42) making the tral
from middle schoolto higher secondaryschoolare more than twice as jreat as the odds of the son of a factory worker (prestige score = 29) making the transition (bec: 2e): 2.107).But the 1.059(42 net oddsof a shopkeeper,s .on the transitioD secondaryto tertiary educationare only about 66 percent greater -iking than those for a f, worker's son (because1.040(42-2e) : 1.665).The effectsof year of birth and of the i action betweentransition and year of birth can be interpreted in a similar way. As a reminder, the interpretation of contributions to log odds in models inr . interactiontermsis exactlythe sameas in ordinaryleast-sqiaresregresslon(see ter Six): the appropriatecoefficientsare added.However,ai *".u* in the first example,exponentiatedcoefficients(contributionsto odds ratios) are not added rather multiplied. Thus, for example.the coefficient for parentaleducationis 0.1 for the first transition and0.Z9jj (: 0.34g0 - 0.0503)for the secondtransition. correspondingexponentiatedcoefficientsarc l-4162 for the first transitronand l.: (= 1.4162*0.9509) for the second.Ofcourse,1.3468: e0.2e11.
A THTRD WORKEDEXAMpLE (DtSCRETE_T|ME HAZARD_RATE MODELS): AGEAT FIRST MARRIAGE One of the most powerful usesof binary logistic regressionproceduresis ro esdmdiscrete-time hazard-rate models, sometjmes called event hiitory models. Hazard-rt models are those for the rate at which events occur or the likelihood that an event Efl occur at a specifiedtime. Thereis a well-developedstatisticaltechnologyfor estimari4g such models, most of which is beyond the scope of this book. However, for a particuii classof thesemodels,in which time is treatedas a set of discretevalues and thi interen is in estimating the likelihood that an event occurs in each period of time, conventicqrl binomial logistic regression procedures can be used once the data are appropnacf arranged.Indeed, as we will see, discrete-time hazard_ratemodels are formally iqulrr_ lent to the educationattransitionmodelwejust discussed. The basic procedureis to createa person-perioddata set by s/acfir?greplicates of fu original datasetfor eachperiod for which eachindividual is .,atrisk,, of the eventoccurri.og.
BinomialLogistic Regression 3.19 Fu erample'supposewe areinterestedin estimating the likelihoodthatrndividualsmarry agT-say, at eachyearof agefrom 15ti 36. we.un aoif,,, o1...r.utin,s 111lfied a new &r set consistingof one observationfor eachperson fo. ea.h yeur ot ugeat which the F-{tr hasnot yet married,plus the ageat which thepersonmarjJ if he or sheaid. up to ,u. includingthe individual'scurrentage.The dependentua.iaUteis a Orcf,otomy, scored I r de personmaried at that aqeand scored 0 0iherwise.r'o. individuars, [h lependentvariabletakeson the value "u"t-.u-.a 0 fo, eactrage,fro. ug"'ii u*, ,n" yearbefore b-. married,andis scored1 for the.ageat which they malry. 6bseruadorrsrepresendng r.lequent agesare droppedfromthe data set, becauseonc" they marry,peopleare no r-E.r 'at ri-sk"of (first) marriage.ror never-married individuals,ihe dependentvariable ur,ciired0 for all years,from age 15 utr)to their currentug". eg."'g..r,"r than their cur_ u -:-eeare dropped from the data set becausethey obvio-ustya.-re n"otat rist of marriage fu:_-:estheyhavenot yet reached.We thenanalyze this dutu."iln ttl u.uut *ay, estimat_ llg -:.binomiallogistic regressionequanon. ,\r thispoint you may be wonderingwhy we go to all this fusswhenit would be easy r| t3ar age at first mariage as a ratio variableand simply ou, un OLS regression nft rse at first marriageasthe dependentvariable. "u.ry This'rnight be a reasonaUte procedure ff r: had a-sampleof personsold enoughto no longerbe i, .irt oi rnu_ug". However, h lpically is not the case,becausewe usually analyzerepresentativesamplesof a lut-iation and thus include adultsof all ages,someof *fro_'fruu" no, yet mamed but nfl lo so in the future. Thesecasesarecensored becausewe have stoppedobserving &n: '* hile they are still at risk for the event.Under thesecircumsiances OLS regression 5ars misleadingresuhswhereasdrscrete{imehazard_rate modelsgive conect estrmates ,d :- lrtelihood of marying ar eachage fo. those*ho a." ,,ili"ilir* o*"use rheyhave @'Fd that age without ever having married. illustratethe practicalproceJuresfor carryingour suchan analysis,I usethe 1994 __Trr rK\:o estimatethe likelihoodof marryingforthe firsttime asa functionof age,mother,s dn::don, sex, and race (Blacks versusnon-Blacks). Given marital norms in the late lhqleth-century United States,we would expectthe likelihoocloi marr)rlngto mcrease mm 3seup to the mid_twentiesbut then to decline,and we also would expectmalesto
wewould expect those fromweu-"J""ui"J ri.ili". (measured 55:-ll::",i1i i:Tlles. :orher'seducation) to marrylatelin parr
because ,h"y ;";r;;;';'"rd';T::H: '' completeir,"t. (althoughfor 5*::j:._*^.: 1"llt.-:-:.g".untiirhey people marriage "Ju"u,ion affecrs thelikelihood
,:]i:.f
of continuing i" ,"r,""rj.ii"Ji,}?il,; likelyto marrythannon-Blackslbotn O"*ur" oitn" socioeco_
l"of:"-t.ri : position Blacks and becauseof racial differenc". in norm, ."g-oing child_ are
lesslikelyto becoerced into rxorr4Bs by uy their Lrcrr -''v marriage ,n]acrs fujes .41?ll.i:e^"."t-lffr_: in the case of unanticipated pregnancies. ,,chl3_2.do,, downll.a-dable_files and ,,chl3_2.1og,,show the specific commands __$ :Llcarry I out the analysis, together with comments. Because I have extensively documented
s-tata commands isfonr,.,"",iii ;"f':Y:j":::T:ij of the"1jf 'n".i1" Stataserup _,."hu;; _ .;,.i;*;;;;.#r; """"""..",r. is the use of he I),i:rel.feiture ctedataset,shownwith resultsin theStatalog file. This commandconv.rtsdatafrom :o long form; that
is, in this casefrom a file o=fpeopleto u nl. oi [._n_years, where
320
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
remainsur thereis oneobservationfor eachpersonfor eachyear at which he or she tT"b^t: plusthe yearof marriage.This is i very efficigntw1Vtl.create^a *" ::jl"l-i my selupplus time hazard-rateanalysis,which requiresonly a few lines of code. Study this command of relevantsectionof the Statamanualio be sureyou understandthe logic (afterercl*" I beganby clefiningtherisk setasincludingages15through56'because with missing values on the independent v""!l"l:-"^I ing u ,riutt n r-b". oi "u.", for the first time beforeage 15 or after age56 I then estimalair in-tt".u*ot" -*led equation of the form
r nlw l: o* ir e, \ l- wt
(l::-ili
,4
age' andtrt where W is the probability of first marriage,conditionalon a respondent's regressn This category' A, are dummy variables for age at risk, with 15 the omitted the lqn (converted from pioduced the expectedprobabilitiesshown in the Statalog the figure' ile trc estimatedin Equationt3.tZ1 anOgraphedin Figure 13 l Inspecting that the right tail is not very orderly. thosein fu In pariicular, the proba'bility of first matriage appearsto increaseamong why thl" r it clear fi1e makes -1ogStata 40s ani 50s. lnspection of the downloadable Thus one or lro marriedhas everyone so: by the time p-eopleare in thet late 30s, almost is someoh graph The at risk u,e a nonnegligibleproportion of all those becac perfect "onr,i is prediction -.iug". misleaiing in anothe,*uy u. *"I b""uuse all agesfor which
.1 3 .1 2 .1',r .1 0 b .u 6
;.0 7
c06 i. u4
21
25
29
37 33 Age at nsk
41
45
4e
)3
)1
rts&irifi'13.1.ExpectedProbabilityof Marrying for the First Timeby Age at Risk,U.S.Adults, 1994(N = 1,s56).
BinomialLogisticRegression
EE; b@ E S]Ur6_
r
:untr
@: . [
m: tE Ie: +lu : i:'gd:" r:
rg
[::T6 r :J
r: .rneat risk married at that age(37, 42, '14tbrough 48, 50 through 55) are droppedfrom & equationand hencefrorn the graph. The lessonhere is that very sparsedata may give msleadingresults.Beforecontinuingwith the analysis,I droppedall agesgreaterthan36. I next reestimatedthe model,predictingthe probabilityof marriageat discreteages nb.!g an equationsimilar to Equation13.17)and then substituteda fourth-degreepolyrmial for discreteyearsto fit a smoothcuwe to age at risk. (I decidedthat a fourth&gree polynomial,estimatedbY
w rn I l :o +a t A1+ cyA) lr dr Ar t+ etA' ) II_ WJ
:-- -
lif
: l: 31I G'rTe
hr:a.ur
321
( 13.18)
powersofrisk age.)The two curves rrs requiredby testingthe significanceof successive thattheyarequitesimilar' .|e :hown asFigures13.2and 13.3.Visualcomparisonsuggests at specificages. ine'ugh a formal test of significancerevealssignificantdiscrepancies S-n I determinedthis. I had to considerwhetherto continuewith a discreteor smooth which more faithfully qresentation of ageat risk. I optedfor a discreterepresentation, which is far more of age at risk, representation Ere\ents the data,althougha smooth gr-.rmonious,alsowould havebeenreasonable. to estimatetwo additionalmodels:includingfirst tlte othervariables I thenproceeded I nTothesizedto affect age at marriage(sex,race, and mother's education)and then cr.lctions betweenthesethreevariablesand ageat risk. Wald tests,for all interactions rbr interactionswith eachof the main effects(shownin the Statalog), madeit clear -: ta rhe model including interactions is the preferred model; all tests are significant at htnd the .000 level.Thus, the likelihood of marrying at eachagevariesby sex,race, mother'seducation.Table 13 6 showscontributionsto odds ratios, which are the of the coefficientsestimatedfrom an equationof the form ,ri.-,es -
nl-!-l: l l -Wl
r + u,E\+ c(M| - drBt- f e,,+. , r5,a' 4
(13.19)
+Lc,A,M+Lh,A,B +Df ,A,E
(A-:E
rise lV is the probability of marrying given that one is at risk; E is the number of years d r-irool completedby the respondent'smother,expressedas a deviation from the sample 1 for Blacksand0 for non-!1is scored1 for malesand 0 for females;B is scored with age 25 the referencecategory' risk, for age at variables are dummy is: andtheA the expectedodds of shows "Main Effect," labeled h Table 13.6 the first columl, years are at the sample of schooling whose mothers' ! ing for non-Black females columnsshow remaining three 25. The in ratio to theeffectfor thoseage expressed except thatthe coefrleractions of ageat risk with mother'seducation,sex,andtace, for thesevariablesat age25 Te the main effects.Theseoddsratioscan be used n:rlie any comparisonof interest.For example,amongwomen who havenevermar:y..age21, the oddsof Blacksmarryingat that ageareaboutthree-fifthsthe oddsfor
322
DataAnalysis: to Testldeas DoingSocialResearch Quantitative .1 3 .1 2 .1 1 .1 0
P .oe ; 07
E .oq .03 .42 .0 1 0 15
17
19
21
23
25 27 Age at risk
29
31
33
35
Ff GUR€ 13.2. rxpectedProbabitity of Marryingfor theFirstTimeby Age at Risk(Range:Fifteento Thirty-Six), Discrete-Time Model,U.5.Adults,1994.
.1 3 .1 2 .1 1 .'t0
tt
il
P .os
.i
>
ri
; 07
lj
3 .o+
|l
.03 .o2 .0 1 0 15
17
19
21
23
25 27 Ageat risk
29
31
33
35
Fl{:Unf 13.3. rxpecteaProbabitityof Marryingfor theFirstTimebyAse (Range: at Risk Fifteento Thirty-Six), PolynomialModel,U.5.Adul$, 1994.
r .l ':1,
l/\lfl Mothor'$
t
nt ]tl!k, s"r, t trlala |l-t tr[ h'| r Mrrrl.l Fr e.ll. IlrrU ih! I lkollh.ttt.l r|l Mat tltgo lrotlr Aga Variables' Other and the Edu(dtlon, wlth Int.!ractlons Botween Age at Risk | | t,
lld(o'
'rn'l
lnteraction with Sex (Male)
Race(Black)
9-847
3. '156
20
3. '108
22
0.786
23
a.765
24
0.918
category)b 25 (reference.
'1.000
26
0.498 (Continued)
BinomialLooisticReqression 325
SMOOTH ING DISTRIBUTIONS
smooth nsrefers toactass ofterh-?>J
niquesfor makingthe generalshapeof a distribution clearby removng " no ," " -d"ui"tions $ from the underlyingtrend that resultfrom samplingerroror id osyncratic factors.Perhapsthe simplest smoothertsa movingaverage. A movingaverageis the average valueof several consecutlve data points.Considerthe workedexamplein this secton. A three-year moving averageof the expectedprobability of marriageat eachage would be constructed by first takingthe averageof the expectedprobablitlesfor agesfifteen,sixteen,and seventeen; thenthe average of the expectedprobabilities for agess xteen,seventeen, and eighteen; and so on. At the time the age-at-first-marriage examplewas created,the Statasubcommand -ma- ("movingaverage")was available within the -egen command.However, this sub(although commandisno longerdocumented in Stata10 it stillworks),andhasbeenreplaced by smooth . whichgenerates mediansof the lncludedpointsratherthan means.Another smootheravailable in Statais -lowess- .
il .tt
a'n-Blacks (precisely,0.591 : 0.190*3.108).Among 3O-year-old never-married people, fu oddsof marryingin that year amongthosewhosemothersare collegegraduatesare r:rlv 10percenthigherthan the oddsfor thoseof the sameraceand sexwhosemothers (precisely, r: highschoolgraduates 1.094: (0.918*1.114)4). Despitethe usefulnessof Table13.6for making specificcontrasts,the overallpattem qlied by the coefficientsis difficult to discem.Again, graphshelp. Figures 13.4 and of the expectedprobabilityof first marriageby age --:-i showthree-yearmovingaverages f, isk. separatelyfor Blacksandnon-Blacks.In eachgraph,separatelines are shownfor tri.r-esand femaleswhosemothershad twelve and sixteenyearsof schooling(as a con|€=ientway of visually representingthe effect of mother'seducation).Moving averages r: shownbecausethereis a greatdealof "float andbounce"for individualyears,which F lident from inspectionof the coefficientsin Table 13.6.(Seethe downloadablefile *:13_2.do" for details how on the moving averageswereconstructed.) InspectingFigures 13.4 and 13.5, we see that mariage rates for Blacks differ mnntially from thosefor non-Blacks,with Blacks much less likely than non-Blacks @:lirry at all. Moreover,non-Blackfemales(especiallythosewhosemothershaveonly ,I rsi schooleducation)marry at disproportionately high ratesat agesnineteenthrough lDdxn -five; non-Black males marry a bit later and with less concentrationin a short FL{. Black marriagerates,by contrast,are spreadout over a much longerperiod,but rrE ar upsurgein marriageratesfor malesin their thirties,especiallythosewhosemothG ire high schooleducated.For both Blacks andnon-Blacks,malestend to marry later k remales,with male ratesexceedlngthoseof femalesbeginningaroundage thirty. lirrJl. amongall race-by-sexgroups,thosewhosemothersarehigh schoolgraduatesare m3 likely to marry than are thosewhosemothersare collegegraduates. Ir I werepreparingtheseresultsfor publication,I would presentonly a subsetof the fter large set of tablesand graphswe havejust marchedthrough.The intent here,of
to Testldeas QuantitativeDataAnalysis:DoingsocialResearch
326
. 18 . 16
F e m a l e(s1 2 ) -.----o- Males(12) F e m a l e(s1 6 ) --_ Males(.16)
\
,/ . 14
6
i .os E p .u o
.04 .02 0 15
1/
21
19
23
25
27
29
31
33
35
Age at nsk
PtGtJ*i: 13'4, r"pecteaProbabititvof Marrvingfor the FiRt Timebv Ase Sex,andMother'sEducation(Twelveand sixteenYearsof Schooling)' at Risk, U.S.Adults,1994. Non-Black
. 18 . 16 . 14
(12) Fem ales - o.---o- Males(12) (16) Fem ales --Mates(16) -
E b
9 .oe € .o o
rr.r-.-Q,
.04 .02 0 19
21
23
25
21
29
31
Age at nsK
of Marryingfor theFirstfimebyAge Fl€URg 13'$. etpect"aProbabitity at Risk,Sex,andMother'sEducation(TwelveandSixteenYea'sof Schooling)' BlackU.s.Adults,1994.
;d nl !t
BinomialLogisticRegression 327 is to providealtemativesfor you to considerwhenpresentingyour own analyses. of the application of discrete-time hazard-ratemodels include Astone and oth1J00),Dawson(2000),Lewis andOppenheimer(2000),and Sweeney(2002).
FOURTHWORKEDEXAMPLE(CASE-CONTROL MODELS):WHO APPOINTED TO A NOMENKLATURA POSITIONIN RUSSIA? a dependentvariableis a rareevent,it is inefficientto draw a representative sample populationat risk for the event,becausethe samplesizewould haveto be extremely to obtainenough"positive"casesto analyze.This is a frequentoccurrencein epideical research,where the eventsof interestare diseases,but it also occursin the :ciences.For example,if we are interestedin studyingwhat determineswho gets to Congress,we could hardly do this by drawing a representative sampleof the ion andlooking for the congressmen in it. We havesimilar problemsin studvins crime victimization,homosexuality,and variousotherrelativelyuncommonpheOne solutionto this problemis to sampleon the dependentvariable(that is, to a sampleof congressmen, criminals,or homosexuals), collect informationon that collect oie. correspondinginformationon a representative sampleof the population 'lrs not experiencedthe rareevent(becomingcongressmen,criminals,or homosexuals), the two samples,and model the odds of experiencingthe rare event.This is ascase-controlsampllngin the epidemiologicalliterature(for an excellentreview itatisticalproceduresinvolved,seeBreslow [1996]). C3-ie-controlsampling exploits the fact that odds ratios are invariant under shifts distributionof the data.This extremelyimportantfeatureof oddsratios makesit to combine sampleswith very different distributionson the independentand variablein orderto modelrareevents.This capabilityis not possiblewith OLS becauseOLS coefficientsare affectedby the distributionsof the variablesin n]del. T.r see how case control procedures work in practice, let us consider what factors
the oddsof becominga memberof the Russianpolitical elite at the end of the ist era. From Social Stratification in Eostern Europe after 1989 (Treiman and samplesfrom Russia:a probabilitysampleof 1i 1993),we havetwo representative ,< population(N : 5,002)and a randomsampleof personswho werein nomenpositionsasof January1988(N = 850).(SeeAppendixA for a descriptionof the .md informationon how to obtain them.)Nomenklaturapositionswere thosethat the approvalof the CentralCommitteeof the Communistparty. They ranged rery high govemmentofficials (for example,membersof the politburo) down to of sensitiveorganizations-for example,rectorsof universities,editorsin chief of newspapers, andheadsof largeindustrialenterprises. Th generalpopulation sample departsin two ways from compliancewith the ions underlyingcase-controlsampling,but neitherdeviationis importantfrom standpoint.First, it is a probability sampleof the 1993 populationrather tb 1988population.However,the samplingframe is basedon the lg89 census,and nmple thereforeprobablyrepresentsthe 1988populationnearly as well as it does
BinomialLogisticRegression
II
329
Before tuming to interpretation of the results, we should note the one difference hween case-controlanalysisand ordinarybinomial logistic regression:in case-control aalysis the intercept is not meaningful. This should be obvious from the fact that the in logistic regressionindicates the proportion of the sample that is "positive" rid respectto the dependentvariable. However, in case-controldesignsthis proportion -ercept b ixed by the sampledesign, and thus the coefficient addsno information. Inspectingthe coefficientsin Table13.7,we seevery largeeffectsandfew surprises. Ech year of schoolingincreasesthe odds of becominga memberof the nomenklatura b more than 70 percent. Thus, all else equal, university graduates(who typically have Li yearsof schoolingin Russia)are more than 15 times as likely as high schoolgradu(with 10 yearsof schooling)to be appointedto nomenklaturapositions(precisely, -s l5-i2 : 1.72605r0)).The effect of genderis astronomical:malesare more than 17 times G likely as females to be appointed to nomenklatura posts. The effect of age is also anemely strong: all else equal, the odds of being appointed to a nomenklatura posrtion i;rease about 14 percentper year.Thus, for example,a SO-year-oldis more than 7 times hkely to securea nomenkhtura positionasis a 35-year-old(precisely,7.23 : 1.141(50-35)). -Itrhaps more interesting, the effect of social origins, evenamong thoseequally well educred, is far from trivial. Coming from a family in which one's father was a memberof the Communist Party improves one's chancesof a nomenHntura appointmentby about half, d elseequal.Also, eachyear of father'sschoolingincreasesthe oddsof nomenklatur.l qpointment by about 11 percent-this in the worker's paradise!-so that the offspring of t university-educatedintelligentsia (15 years of school) are about three times as likely * the offspring of those with only a primary educationto sectJrenomenklatura apporntlmts, inespectiveof their own educationalachievement(precisely,294 : 1.114(s5)). rllone amongthe variableswe haveconsidered,father's occupationalstatushasno impact r the odds of appointmentto a nomenklatura post.
XHAT THISCHAPTER HASSHOWN h dis chapterwe have seenhow to estimateand interpret binary logistic regressionmodds- which are widely usedto model dichotomousoutcomessuch as whether people vote, employed, or are members of a particular organization. We have seenthat although t- estimationproceduresare quite different, the interpretation of the coefflcients of such ndels is similar to that of OLS regression, except that the coefficients represent net &cts of eachindependentvariable on the log odds of an outcome. Because log odds are not intuitive quantities, we have considered two nonlinear :nsformations to more readily interpretable coefficients----oddsand expected problllities-and have also seenhow to graph net relationships, a form of regressionstanfor logistic regression.Finally, we consideredthree extensionsof the basic &ization models, listic regressionmodel:educationprogressionratios,discrete-timehazard-rate d case-controlmodels.A notable feature of logistic regressionmodels is that they are with respectto the distributions of variables in the sample,which is what makes procedureslegitimate in the logistic regressioncontext blrt not in the OLS Ge{ontrol -aiant rlresslon conlexr.
330
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
APPENDIX I3,A SOM: AIGIBRA FORLOGSAND EXPONENTS who have forgotten their school algebra, here are some usefirl Io. 9lr" rnvorvlngnaturallogarithmsandantilogs(exponents): e'lt) : X
h(x*r):h(x)+h(r) ln(X /I) : tn(X) - ln(f) X * Y : et^6) etn(Y): e(L(x)+ln(Y) e <x +Y ) _ e x *e v "(x-Y)
"x1"v
ln(XP): P * 1n11; XP :1s1"(x)1c - .e.n1x1 Note that I : ln(X) and X : ey are equivalent.
APPENDIX13.8 TNTRODUCTIO}I TO PROBITANALYSIS *pinning of this chapter,an alternariveto logistic regressronas a mod *::j:1^:,,-1. ror predlctlngmodelbinaryresponses is theprcblr model,wiich is definedas k
Pr(f = | lx ) = O (6 ' x \ : 6'\ror B + \ -n * ZrPt4i
rt
(13.B-O
where o is the standardcumulative norrnal distribution and thereare t predictor variabbFrom this definition it is evident that the are z_scores, Bs *O1n"i *" associatedprf, d",":ined by finding the area under rhe normal curve correspondingror ::ll?_:T-O: parucularz-score.This canbe doneby invoking Stata,s_normal _ function. Consider the example used in the chapter-to ittustrate tfre interpretation of logis_ regressionmodels-the determinantsof th; tkelihood of being ttfeatened by a gun r being shot at. Table 13.B.1showsrhe probit co"m"i"ntr_tfr" irlorr".pondiog to h logistic coefficients shownfor Models i and 4 in Table f :.:. Noi" tt ut tfr" p.obit andlqa models yield similar conclusionsexceptthat in Moder 4 th" int"ru"trL t".m ls marginallr
BinomialLogisticRegresslon
331
thatwhenestimatedby using _ :.ant wheneslimatedusinga logit model,andnot even " rit model. ":, in *undard deviation i..uur" p.obitsare.-scores,they indicatethe expect:q:!1n^g": 1-P 620] calls 7Q01 (what StataCorp lRefere,nce - , in th" iatent dependentvariabie variable' predictor :: rrobit index"),reiulting from a one-unitchangein the associated latent the of the variance - .,.r, iai. pt.perty oflrobits, in commonwith logits' that " Effect Parameters for a Probit Analysis o{ Gun Threat : .',..i' : c.rresponding to Models 2 and 4 of Table13'3)' rc:pendent Variable
b
Standard Error
P
Marginal Effect
o.8022
\i::
-0.01' 11 0.0062 0.2586
i : : ..
r::':ept
- 1.709s
0397
.000
.1A20
.000 .1154
''=: ..ed probability :: 4
,u; :
0 .8 1 26
.0038
.003
0.0062
.0022
.004
o.2994
.0545
.000
-0.0806
.o721
.264
.1810
.000
-0 .0 1 14
l
-..--'-* .^.''----'
$i:,.-L4ale
.o729
1.7117
to Testldeas DoingSocialResearch QuantitativeDataAnalysis:
332
are introduced into a model- This means variable changesas additional variables - : ts not appropriateto comparecorrespg"ditc is Pt*'it ":i":::11t"3:,?":t:::t :|I; musi 'IUL'rPPrvPrralw meltricoLS coefficients Rather' we of mediating variabies' as we oo *itrt dependentvariable by-dividing by th" Y111"^"*:-i::::: r4lw'r svvv'^*---L'e latent su luarurzsthe standardize which canthenbe directly comn ir"o bution.This producesY*-standard "o"in"i"n,s, ol.pregicto: numbers
*ith differing lar,11t:*: :::riT:?.TlT inthe rogitcoerncients ""..*'"-o*l*t ;'it;';;nou'aiutionotoroinal #;.i"' ill::.:".Xliff
va{P:] f:1li11"'*'j"T*:1. metrics. 11:?' ,r'" inEquation GL..'iN"" the probiishaveintrinsic metrics'thel "'"1c;;;';;:'ffi; t;-;;ardized
;;t';;;; ttunttot-,t9]l:::f::::T":i'; r'ur probits l'rlerPrel Thus ro interpret. vrvv^'" tvfi"uttv drlnculr to difficult 'oo,"o*" givenconfigurationof va]u= -" fo, a ing the exPectedProbabilitYof a PlJi,i* effect of a changein eachprecl: marginat ttre ng tnt"rpJ the Dredictorvariabl", o, oy variable on the probability of a positive outcome'
'T*"
Il,tt':,:11*".'^:"*:T,:?:'i"#i'T l"'iio- rt'"wort"oexampte in Moder4 ri bvthelogitcoerncients ;;;;;i*pr'ed
tormaqrotilnodet'::,c19li"i1""tlh: piouuuliti"' the*1"1"* Til:T:i"H using probabilities ffi;##;;n;;''t" orschooli:4 wig lwenlv p1op1e vears **"on ror ffi*?ffi'ffiti#,it"ltlrt et:::":",Y#t:ul,i: of samevalues
".Ji#;J;;'il;';; :.fi;;";;,l":naine
theprobitequationiorthe 1994.To evaluate
:
= : a +.bE+20-+ o' .bv4e4 '-r'11" #.###1ffi;:#;il;;;'"""p" -o.miz,sq isthe = -r.:sos uurr+'zu - -"t-oi rwhere'bu *'bli::':T1?T:::jljTl?, ffiifrd; s b" is the Probit coefficient for Yea
t*"y;l
Then we write out the expected:
tlfg:-|,::*:":"1|;:'1,:.fiitl b" istneloeffici"nt byraceandsex(where themusins tem) andtransrorm ;;;-;;
t'-""tion
Non-Blacks
Black
Females
o(a')
6(a' + bu)
Males
Q(a' + br)
Q(a' + bu* bM+ bBM)
#;;#
H.T;i'J#
-normal - function:
these coefficients' we have Substituting the numerical values of
Females
O(- 1.3569)= 0'0874
Blacks + 0.2994) o(- 1.3569 = O (-1 . 0 5 7 5 ): 0 1 4 5 1
Males
O(- 1.3569+ 0 8126) = i'zstt : @(-0'5443)
O(- 1 3569+-0'8126+ 0'29v -0 0806): o(-0'325s): 0'-:-:4
Non-Blacks
predicr= extremelycloseto the percentages NoG that, multiplied by 100, theseare
for n"t B11"Il:T:t:i:t:"t:"::Hi":l:= thelogit model,whichare,respectively' rolc-" r- etu't men'37'2' (seetheparagraph
ff;:?;
ffi:;#;;;d;'';
Equation13.16.)
BinomialLoqistic Reqression 333 Now let us considerthe marginaleffect.We might askhow big a changein the probdiliry we could expectfor a small changein a parlicularindependentvariable.However, br-ause the relationship between the probit index and the probability is nonlinear, rhe answerdependson the valuesof the independentvariablesat which we evaluatethe uiange. Unless we have a reasonfor doing otherwise, evaluating the marginal effect of e-h variable relative to the expectedvalue when all independentvariables are set at their Ens would seemmost reasonable,and this is the approachStatatakesfor continucS r'ariables.However, there is an exception-it makeslittle senseto evaluatemarginal tlanges in dummy variables relative to their means.A better approachfor dummy vari*{es is to compute the discrete change ihe difference in the expectedprobability for fu6e scored1 and 0 on the dummy variable, with all other variables(including any other fumy variablesin the equation)set at their means.Thus, for example,we would want D how the expected difference in the probability of males and females having been &eatened, among people who are at the mean with respect to the other variables. For ,cmdluous variables, however,we want to know the effect of a small changerelative to rte meanfor all variables.Thus for continuousvariablesthe marginaleffect is defined r$ de slope of the probability function at the mean,extrapolatedto a unit increase. The marginaleffectsfor Model 2 are shownin the rightmostcolumnof Table13.8.1. lide that I do not show marginal effects for Model 4. This is becausewhen we haveinterrtion terms, the effects of the variables included in the interaction cannot be separated. Thus when we have a model involving interactions, it is best to evaluatethe probabilities ftr various combinations of variables, as in the logit example. The first thing to note is the predictedprobability, 0.1753, which tells us the expected Fobability that the averagepersonin the data sethasever beenthreatenedby a gun or shot r- h is reassuringthat the predictedvalue is close to the observedvalue-19.5 percentof m samplehasbeenthreatened.This gives us confidencein the corectness of the model. Now note the marginal effect for males. Becausesex is a dichotomousvariable, this efficient gives the difference in the expectedprobability of having ever been threatcd for males and females who are at the mean with respectto the other characteristics a^luded in the model; among suchpeople,males are predictedto be 21 percentmore Itely than femalesto haveexperienceda gun threat.We also seethat, at the mean,a onelcar increasein schoolingwould be expectedto reducethe probability of having been tleatened by 0.0029.What would, say a ten-yearincreasein schoolingbring?Note that hre we cannot simply extrapolatethe marginal effect. For example, it is not correct to q ftat a ten-yearincreasein schoolingwould resultin a 0.029decrease in the expected Foportion having been threatened.Rather, we need to compare the cumulative normal tmsformations at the mean and at the mean Dlusten vears:
Q(sa+{1M+ 13,(E+1O)+ BtM+ il"E+3.y+ 3p) &Y + pAB)-A(po+ : iD(-1.710 x 84.47 +0.111) * 0.451-0.0111* (12.39 + 0.802 + 10)+0.0062 + 0.259 x 0.451-0.0111* * 84.47 * 0.111) - O(- 1.710 + 0.802 + 0.0062 + 0.259 12.39 : .1482-.1753 : - .0272
(13.B.2)
334
Quantitative Data Analysis:Doing SocialResearchto Test ldeas
5
-3
-2
1012 Coe{fclen't(b)
ffeu $q r1 3.8 .1"Probabilities Associated with
3
4
S
Values of Probit and Logit
Coefficients. A flnal point to note is that the logit and probit modelshavesimilar shapes,er::d that probit coefficients more quickly reach probabilities asymptotically close to zer; m' one than do logit coefficients,as is evidentfrom Figure 13.8.1. For this reason.1:!t models are more sensitive when dealing with rare eventsor with predicted probabil-:x closeto zero or one.But with this exception,the two modelsalmostalwaysyield silrlb' substaniiveconclusions. For furtherdiscussionof thebinomialprobit model,seePetersen(1985),Long ( 19q:. 40-84), PowersandXie (2000,Chapter3), Long and Freese(2006),Wooldridge(li{rD, probit pos test imat ion-, -svy:prob::583-595),and the -probit-, and -svy:probit poste st imat ion- entriesin Statacorp(2007).For an inter:sing applicationseeManski andWise (1983). The Statacommandsusedto createthe worked examplefor the probit model anCtb outputare shownas the lastpart of downloadablefiles "ch13_1.do"and "ch13_l.log.-
-l
tt-
it and Logtt
ar shaPes,excef{ cr t closeto zero 'rhis logrr reason, kted Probabilitie: rays Yield similar (199-' ,985),Long sooldridge (20O6- svY: Probit- ' t?). For an interestrobit model andtbe ' nd "ch13-1'1og
C HAPT I I
AND MULTINOMIAL LOGISTIC ORDINAL AND TOBIT REGRESSION REGRESSION ISABOUT WHATTHISCHAPTER
types of limited dependent models for three additionar rn this chapter we consider rariabies: which multinomial more than two categories' for r categorical variables with logisdc regressionis aPProPnate ordinal logistic regressionis appropriate ordinal variables' for wh'rch : not observed variables' where observatronsare dependent censored' or ! truncated, ior whicrr tobit regressionis approPriate below or abovesome revi, an illustrative subis specifiedand then work through model the how see we case ln each standveanalysis'
336
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
MUTTINOMIALLOGITANALYSIS Sometimes we wish to analyze categorical dependent variables with more than categories.In this case,we haveavailablea naturalextensionof binomial losistic sion: multinomiallogisiic regression.The procedureinvolvessimultaneousl"v est a setof logisticregression equations. of the form
"[##]=o,+fb^xo ,["," :4 D]-
o ,+ fb ,xo
(l
'[ffi#j:o.+fu^xr Here' one category of the dependentvariable is omitted and becomesthe reference gory. The estimation procedure yields, for a set of m + 1 categoriesof some deper u-artable,m logistic regressionequations, each of which prediits the log odds of a fallin^ginto a specific categoryrather than into the referenie category(here designateJ I: 0). Note, however,that although the interpretation is similario the oinomial caseestimation procedure is not equivalent Io estimating a set of binomial logistic regresJ equationsin which the oddsof beingin a particularcategoryversusnot beins in thatca gory are predicted. In general,the estimateswill differ and the binomial estiirates *-ill
lncolTect. This can easily be appreciated by imagining that we are interested in what
determinewhether,in 1988poland, a person was a Communist party official, a Cc nist Party memberbut not an official, or neither a membernor an ofdcial. If we esti a binomial logistic regressionpredicting ordinary party membership(without office ho ing) and anotherlogistic regressionequationpredicting party office holding, we woulJ in trouble with respectto the first equationbecausethe negativecategory(not an ordi party member) would include those who were neither party memb;rs nor officials aly thlse who wereparry ofrciats.In consequence, the resultingcoefficientswould misleading.For example,it is likely that a coefficient relating eduJationto party memi ship would be very weak becauseparty officials are likely to be better educated than members,whereasparty membersare likely to be better educatedthan nonmembers. The appropriate way to handle this problem would be to estimate a multinon logistic regressionmodel with three categories:nonmember,ordinary member,and cial. Doing so would result in two equations,one contrastingordinary membersv nonmembers and the other contrasting officials versus nonmembers, which are
and Tobit Regression Multinomial and Ordinal LogisticRegression
n > lf
337
rerpreted in the ordinary way. An altemative would be to do a sequentiallogit analysis r rnich first membershipversusnonmembershipis modeled,and then offlce holding usrs ordinary membership is modeled for party members only. The choice between Ge alternativeswould dependon how the processof becominga party memberor a official occurs.(Seethe brief discussionat the end of the chapterin the sectionon ;q *t(I5er Models.")
Yorked Example: Foreign-Language Competence bthe CzechRepublic E ;ee how this procedure works in practice, let us analyze the factors that account for in Englishand Russianin the CzechRepublic.The datausedherewere colnationalprobability sampleof 5,496 Czechsage Med in 1993 from a representative part the swey Social Stratifictltion in Eqstem Europe After kn to sixty-nine, as of rl${9 rTreimanand Szel6nyi1993;seeAppendixA for detailson this surveyandhow to Here we considerfour groups: frh the datasetanddocumentation). I
thosewho speakneitherEnglishnor Russian thosewho speakEnglishbut not Russian
r r
thosewho speakRussianbut not English
D
r
thosewho speakboth languages
e
To be classedas a speakerof a language,a resPondenthad to report that he speaks "fairly well" or "very well"; those who reported that they speak the lanJ.:-nguage "only a little" or "not at all" or who failedto answerthe questionwereclassifiedas of the language.Becausethe survey was conductedin Czech, everyone spoke Czech. A few may also have spoken a second language other than or English,but this possibilityis not analyzedhere. andtechnicianswould be morelikely thanother \l]' expectationis that professionals ion groupsto speakEnglishbecauseEnglishis now the intemationallanguageof :e. technology,and scholarship,andhencethe ability to speakEnglishis important rofessional advancament.Those who were ever Communist Party members, and ially thosewho were govemmentor party officials, would be more likely than other for political advanceion groupsto speakRussianbecauseRussianwasnecessary in the EastemBloc. It is lessclearwhetheror to what extentbeinga managerwould for intemationalbusinessdealthe oddsof speakingEnglish(perhapsnecessary (perhaps dealings). for Eastem Bloc necessary or Russian ' To identify thosewho potentiallyneededRussianfor their careers,I classifyresponby their 1988 occupationand createfour dummy variablesfor 1988occupation, scoredI for thosein the category and scored0 otherwise: officials, other managers, sionalsandtechnicians,andothers.(This variablewasconstructedby recodingthe versionof ISCO 88 shownin Treiman[1994,AppendixC]. "Officials" include 1000to 1166."other manasers"include codes 1200to 1320."professionalsand includecodes2000to 3480,and"others"includecodes4000to 9333.Those
n q -* f
t lb n l.
d * lL E fd
fr 5 G dl !cI" fli
ru
338
Quantitative DataAnalysis: DoingSocialResearch to Testldeas
not reporting an occupationin iggg were excludedfrom the analysrs.)ln thesevariables,I includeeducationasa controlvariablebecause it is clearthattho.e are educatedare more likely to speakforeign languagesin general. The data were weighted to adjust for differentiar hou-sehord size and to britrr samplecharacteristics into conformity with populationdistributions 1r." fr.irnan tt SectionI.G, for details).However,standa.derrorswere not aOiusteO ibr clustenae_ the sampledesign,censustractswere divided into eight strata on ,n. i"rir'"i ,'Li
households were randomly sampled within strata. Be-cause the stratum rdentificffi._rr
not givenin the documentation,thereis no alternativebut to treat the sampleasa S (weighted)randomsample.Given the probablelack of systematicassocrationber andorhercharacteristics, the lack;f adjustmentfor strat.ifi;{ lT.:L: is likely "t:"^ir.lracrs to.be of little consequence. The results*" ."port"O iri tuOt" 14.1for rhe_: peoplewith a job in 1988 for whom completeinformation was avarlable.(Doqni able file "chl4_1.log" showsthe Statalog for the analysis, and ,,chl4_l.do,, shos:. - do - file usedto obtainthe results.) Inspectingthe coefficientsin Table14.1,we seethat, asexpected, the oddsof s ing-eitherRussianor English,or both, improvedsubstantially with education.The multipliersin the secondpaneltell us thateachadditionalyear of schoolingincrea.nd by 25 percent,the oddsof speakingEnglishby 36 perceDl :,1d::j_tp-"""kt"C,lussian. the odds of speakingboth languagesby 51 percent__alt ir, "con;ast to speakingneib Russial nor English.Thus,for example,net of other factors, the oddsthat a Czeshr versrtygraduatecould speakRussianbut not English (in contrastto speakingDtfl Russiannor English) are nearly two and one half-rimesas high as the oddsthar a h schoolgraduatecoulddo so (because1.24gu6-12) = 2.43).The &ds ttrara unlversrn.s uale could speak English but not Russian are more than three times the o,lOsfor'"I
schoor gradriate r.:oyil;;=;.4; ilffi;fi
;r"J;ffi.il:i
speak-both RussianandEnglisharemorethan five timesthe oddsfor a high school,
1r): 5.17). uate (because1.508(16
Note that we are not restricted to comparisonswith the omitted referencecaregrnr By subtracting the coefficients for the log odds (or, altemativell taking the ratio cut oddsmultipliers).we can comparethe categoriesfor which *" ho1r" example,each year of school increasesthe odds that ";p?;l;";i;; a Czech Jhu:: jol could d -: English.instead.ofRlssian by about 9 percent(because ecsoso-.:2,.rl.:ij)i.:il 1.092).Hence,the oddsthat a universitygraduatecould speakEnglish and not RusE (rather than Russian and not English ) arJmore than +O pirc.nt gieater tnan the c,.J f* high schoolgraduares(becausesaooec.i,"t: : LqIh 1i 3AZ/1.2+8)o :::111li.9ddr (Note that in contmsrto our usual rule of thumb that three .lgnin.ani Oigit. _" suffi.-ru_ it probably is best to report four digits for the coefficients blcause they often are u_\€rin subsequentcalculations. Too much rounding error is introduced wlen only three rii_. are reported, so that the mathematicalrelationships implied by the coefficients shosT.I downloadablefile ..chl4_1.log',appearno longerto hoid.) Continuing with our substantivecompariron,,. not" thut, usexpected,member.slibo the Communist Pafty increasedthe odds of speakingRussian _J;;;;;;;"H; had no impacr on the odOsof sp-eakingboth RussianandEnglisbllies fngtisn .Uyr -.Nt else equal, the odds that Contmunist Parqr membeis spoie Russian but not Englilt |r
Multinomialand OrdinalLogisticRegression andTobitReqression i.,: ': .:. 1 . Effect parametersfor a Model of the Determinantsof Englishand RussianLanguageCompetencein the CzechRepublic,i993 X = 3,945). (Standard Errors in parentheses;p-values in ltalic.)
- -- -: lll
I
-[
-. :--:
l
--.
-:
:_ _--.
. -
:.:
,:: -i-
Russian
E ngl i sh
Both
':ars of school ::inpteleo
0.2213 \.0247) .000
0.3096 (.0404) .000
0.4107 (.0429\ .000
:.:r a CommunistParty -:rn ber?
0.3020 (.1488) .042
0.8965 (.3332) .007
0.0484 B) \.277 .862
::vernmentor CPofficial r :9 8 8
1.5591 (.716e) .030
-28.2975 (.6097) .000
::-er managerin 1988
o.9941 (.272s) .000
0.8010 \.4844) .098
0.8534 (.s330) .109
>:'essionalin 1988
0.9943 (.1s48) .000
1.124 (.2990) .000
1.3856 (.3s77) .000
- 5.5378 (.3021) .000
-8.I 541 (.5036) .000
- 10.1965 (.5866) .000
!: ::1
t : :r
y'ariable -: r ts (b)
:.___ :. _
:tL
::J
: , .'; :
: {li
. :-. -': -- l-
:
: r* :
-'= l
:'' -u :::-_.---:
-:
339
--J.-'
-28.3602 (.7039) .000
:u ':L-:il
::: multipliers(d) '::.s of school ::-oleted
1.248
1.363
1.508
: :- a Communtst Party -:-ber?
1.353
0.408
i.050
\Cantinued)
340
QuantitativeDataAnalysis:Doing SocialResearch to Testtdeas
Y&PLi:
1,6, t , ef."t parametersfor a Modetof the Determinants of
English and Russian Language Competence in the Czech Repubti(, 1993 (N : 3,945). (Standard Errors in parentheses; p_values in ttaiic.) (Continued) Variable
Russian
Othermanagerin l98g
2.702
English
2.38
I about a thfudhigher than the odds that they spokeneither language, whereasthe oddsdlx Communist Party membersspoke English but not Russian i" Jniy uOout40 percenrr great as the odds that they spoke neither language.Thus the odds that Communist Fa.l membersspokeRussianbut not English are more than thre" ti*", u, gr.ui u. tt;-;il ; they spokeEnglish but not Russian(becauseecozor-.txr)) : 1.35410.40g = 3.316).Th sameis trxe of service as a govemment or Cornmunist party official. Here, as expecr.;, officials were nearly five times as likely to speakRussian urt to speakingneitba 1ln Russian nor English) than were those who were neither "ont managers nor proibssionalsl: technicians(recall that the referencecategoryis all other occ-upatrons). The odds d;r g?YeyTrent officials spoke English or both Russian and English are effectively zero_ which they should be becausenot one of the sixteenofflcials ii the samplespokeEnelin Fin{1y, yi seethat being a professionalor technicianin 19gg roughly triples the od6 d speakingRussianonly or English only, andqladruples the oOAs of spJakingbottrEnglishmd Rl^ssian,relative to speakingneither English no. Russian. By coniast, being a managern triples ttre odds of speakingRussianonly, relative to speaking neither.Bur c l3!9- ":t,l erect or bernga manageron the oddsof speakingEnglish or of speakingboth English mc Russianwereboth somewhatsmallerthanttte effectof 6eing a on tt oddsof spez&ing Russian.A1so,the coefficientsareonly marginally signi-ficant_alt " 0.1 --ig", aboutthe level. Althoughfor this exampleI settledon a singlemo-delin advance, model selectionfir . multinomial logit modelsis carried out in exactly the same way as fbr binomial lcrs_ models-by taking the ratio of the differencein Z;s (Modef XrO'to tfr" Oiff"..;;;;; de^grees_of freedom for any two models,to determinewhether one model fits the data srcnilicantly betterthan the othermodel (but recall that this p."""d";ir;;;;;;;;;; robustestimationis used-that is, whenthe dataareweighted or clustered;rather,a \\-ais testshouldbe usedto comparemodels).
lndependenceof IrrelevantAlternatives In the_multinomial logit model, the relative odds of being in two categonesare assumedr be independentof the other altemarivesincluded in the riodel. This fJllows from Equari.r 14.1,flom which we canderivethedifferencein log odds for two categories, d andc, a.
Multinomial and Ordinal LogisticRegression and Tobit Regressio n
.'LuurJ ''[##J:1"*2u,"r) 1",
Bot B E:
i_1!E
'-'t be --ia :rt rr [ -::-r fmr nr-< 6.
:aL:
nt
-E - - - _-,f 3: 3L:P-JdtL-:-::€liF
fe.:: --:,:,. n Tx be s:! n3-l .-:nk: E:-::* s c-r. .\i'.5 ir h Er::-:-: a a T;i;::
I
i6er. B:: @ r Ergr-: m dL. o: :-n0,1ie'.::r' --le:-::: ino mr:.:'s iete;i;e :: :it ; de ia- q. f,rs'ible tiE' 121i131-3 \\ arD
re LisuiDei
II
ion Equ::.:r ..imdi. =
341
(14.2)
\.rte that only the two categoriesbeing comparedenterthe equation.If, however,the rela::;e odds do depend on what the altematives are, the model produces misleading srimates.To seelhis clearly,considerMcFadden's(1974) well-knownexampleof transa-rtation choice. Supposepeoplecan travel to work by bus or by car and that half choose -t go by car and half by bus.Now supposea competingbus companyestablishes buses r:ih the sameroutesandschedule,so we no longerhave,say,only blue busesbut alsored r.es. Presumably,the half that traveled by car would continue to do so, but the half that :-.r'eled by bus would divide equally between the red and blue buses,taking whichever ri showed up first at the bus stop. Thus the odds ratio for car versus blue-bus riership would changefrom i:1 to 2:1, violatingthe assumptionof the model. Now consideranotherexample.Supposetherearetwo restaurantsin a neighborhood,a ![erican andan Italian restaurant,andthat the Mexican restaurantgets60 percentof the total r-.iness. Then a new Chineserestaurantopensin the neighborhoodanddrawsoff 20 percent :idre businessof the Mexican restauant and20 percentofthe businessofthe Italian restau:::]r The Mexicanrestaurant'sshareof thetotal is now 48 percent,andthe Italian restaurant's (trA) ;;re of the total is 32 percent. Here the independence-of-irrelevant-altematives rsrmption holdsbecause60/40 : 48/32 : 312. Becausethe multinomial model is misleadingwhen the IIA assumptionis violated, \(;Fadden suggeststhat multinornial(andconditional)logisticregressionmodelsshould :E estimatedonly when the outcomecategories"can plausiblybe assumedto be distinct md weighedindependentlyin the eyesofeach decisionmaker" (1974,I13). A formal testofthe IIA propertyis available,implementedin Stata10.0as suest-r-emingly unrelatedestimation,"a generalization of an earliercommand,-hausman-). la€ -suest- test comparesmodelsthat do and do not include presumablyirrelevant :qicomes.If the resultingparametersfor the restrictedanduffestrictedmodelsare simi-::- the additionaloutcomescan be assumedto be irrelevant.Applying theseideasto our ::rrent example,we might ask whetherthe oddsthat peoplespeakEnglish are affected f. including "Russian" as an alternativein the model. In this case the test strongly ;.sgests that the IIA condition is not satisfied.Thus we might considerestimating r,equential logit model in which we successivelyconsidertwo {uestions:whethera =spondentspeakseither Russianor English versusspeakingneitherlanguage,and for :L'h of the two subsetsof respondents-thosespeakingRussianand those speaking 1:,glish-whether they speakthe otherlanguageaswell. For fulher discussionof the IIA assumptionand its consequences. seeMcFadden (1988). (1984), Hoffman Hausman and McFadden and Duncan Zhang and -97.1), (1993), (1997,182-184), (2000. Long Powers and Xie 215 247). Long and -;:frman (2007). (2006), -suesti=ese andthe -hausman- and entriesin Statacorp Addi:rroal examplesof the applicationof multinomial logit modelsincludeAIl and Shields (1999t.and Breen and Skag-es ,991),Haynesand Jacobs(1994),Tomaskovic-Devey (2000). rcd Jonsson
342
DataAnalysis: Quantitative DoingSocialResearch to Testldeas
ORDINATTOGISTIC REGRESSTON Often in the social scienceswe haveordinal dependentvariables,wherethe response ries canbe orderedon somedimensionbut wherethedistancebetweencateeorieiis ur Most attitude variablesare of this sort. For example,if people are askedto say how hfl lhey are, and the responsecategoriesinclude ,.veryhappy,',,,prettyhappy,,'and .,ncr: happy,"there is no ambiguity in assumingthat those who say they are ,.pretty happllesshappythan thosewho saythey are'\;ery happy',andaremorehappythan thoseu.bc, they are "not too happy."However,thereis no basisfor assumingthat the distance "not too happy" and "pretty happy" is the sameasthe distancebetween..prettyhapp\,'1ery happy." Many other aftinrde scaleshave similar properties.In such caseswe o predict the scalescoreusing ordinary least-squares regression.However,to do so wouk tantamountto assumingthat the distancebetweenresponsecategoriesis uniform. (For a ful discussionof this andotherpoints, seeWinship andMare [1984].) An altemativeis to estlmatean ordinal logit eqtJation,which makesuseof the property of the responsecategorieson the dependentvariable but makesno at all abouttherelativedistancesbetweencategories. The basicassumptionof the ord logit model is that thereis an unobservedcontinuousdependentvariable,f*. whicb linearfunctionof a setof independentvariables: Y* :
al
Db jx j + p
However,what is observedis a setof orderedcategories,y : 1 . .. { suchthat Y:Iif-cn3Y*1kr -Z rf kt
( I/t
wherethefr.are"cuttingpoints" on theunobserved, or latent,underlyingvariable.Now, we observe : I when I* < f,, observeI: Zwhenk,l y* < !, andsoon, it follows nrr
P rr'Y -i l X ]- Pr ik.,
I
Substitutingfrom Equation 14.3 and imposing the constraint that a : necessaryto identify the equation,we have
er(r: ;lx): er1r,_, <)-ax
+ pt < ktlx)
0, whict r
( I-3.1iir
Th€n, subtractingwithin the inequality and noting that the probability that a random ra. able falls between two valuesis the difference between the cumulative density funcdcu evaluatedat trese values.we have
and Tobit Regression Multinomial and Ordinal LogisticRegression
Pt\Y-ilx)rE -BEr r*--rfg!,
rF:&ut l - LTM
#-& t.tr qr cbg-E g-d re:cd rrscirl b fficr e rnbecrjstt 'q:oil criard
be rt!.i
rt r
la-ii
---]--* l+e
" -" " '
IIert
rtbx'
343 (11.7)
Iid is, the expectedprobability that an observationwill havea particular value is the dif&Ence betweenthe probability associatedwith reachingthe upper-boundcufting point d rhe probabilityassociatedwith reachingthe lower-boundcutting point, wherethese lnbabilities are estimatedfrom logistic functionsknown as cumulative/ogirsbecause tlq- give the log odds of reaching eachcutting point. (Note that for the extreme categowith fu oneof the termsof Equationi4.7 dropsout becausethe probabilitiesassociated -x andm arezero.)
lWorked Example:PoliticalParty ldentification Aff€ United States,1998 problem.Supposewe wish to assess what factorslead C..osiderthe following substantive end rather than toward the Democratic place toward the Republican to themselves FE?le ql of a scale of political party identiflcation. Here is the item and the ordinal response ceories in the 1998GSS: Generally speaking, do you usually think of yourself as a Republican, Democrat, bi.pendent, or wh(tt? IF REPUBLICAN OR DEMOCRAT: Would you call yourself a strong (Republican/ Democrat)or not a strong (Republican/Democrat)? IF INDEPENDENT, NO PREFERENCE, OR OTIIER: Do you think of yourself as closer to the Republican or Democratic Party?
ftar
yieldedsevenresponsecategories: This setof questionsandresponses /
,1l_4,1
(D.ba.c 5nss5 *tr | 1+-{l )- shich 3
r r r r I
Strong Democral Not strongDemocrat nearDemocrats Independent, Independent Independent, nearRepublicans
I
Not strongRepublican
r
StrongRepublican
On the ground that the Republican Party is increasingly the party of nonurbanaffluc non-Blackmales,especiallythosefrom the South,I predictthe scoreon a continuum derlying the listedresponsecategoriesfrom the following variables: r
rl+,5t mdomrmli functicc
r r
sizeofplace (peopleliving in large[populationmorethan 250,000]centralcities of StandardMetropolitan StatisticalArcas [SMSAs], other people living in SMSAs,andpeopleliving outsideSMSAs) income (with categoriesrecoded to their midpoints and the open-endedupper category$110,000andover,recodedto $i50,000) gender(ma1eversusfemale)
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
344 i r
regionof residence(Southversusother) race(Black versusnon-Black)
Surveyestimationprocedureswere used to take accountof clusteringand householdsize.The Statacommandslhat cary out this analysisare shownin ablefile "ch14_2.do"andthe resultsare shownin "ch14-2.1og." A property of ordinal logistic regression(which appliesalso to binomia' rlrg regressionmodelsof the sort discussedin the previouschapter)is that the !ai: the latent variable nresumedto underlie the observedoutcome variable variablesare addedto the prediction equation.Thus, it is not appropriatelo comparecorrespondingcoefficientsacrossmodels, as is commonly done rr::l models(seeChapterSix). Rather,the latent dependentvariablemust flrst be ized. To illustrate how to carry out such a standardizationand how to inter:Es resulting coefficients, I estimate two models-Model which includes race.
1, which omits race, and \d
First considerModel 1, shownin the left panelof Table 14.2.Herewe seeth.e: the variables have the expectedsigns-a positive sign meansa shift toward Re: identification.However,Southemresidenceis not at all significant.(To assess r significanceofthe two "urban" coefficients,I useStata's-test - cormand in t:l; way. Doing so, I concludethat the urbandistinctionsare significantat well beror: ventionallevels.)Now considerModel 2. Once Blacks are includedin the mr-rEeffectof Southernresidencebecomesmarginallysignificant(at the .048level).I:r. we would expect,consideringthatBlacksaremorelikely thannon-Blacksto resiE South(53 nercentof Blacks versus33 oercentof non-Blacks)and are also mu;: likely to identify as Democrats(63 percentof Blacks versus30 percentof nor-B. When raceis not includedin the model.the larsefraction of SouthernBlack suppresses the positive effect of Southernresidenceon Republicanleaning.Gc controlfor race,this effectemergesclearly. Converting the Logits to Y*-Standardized Form Inspecting the coeffici:r, appearsthat the inclusion of race in the model dramatically increasesthe e:-sJ Southemresidence,from .050 to.187. However,this comparisonis inappnc becausethe varianceof the latent "ReDublicanism"variablechangeswhen adr variablesare included in the model. Thus, before comparingcoefficients,it is saryto standardizethe coefficients.Although thereare severalwaysto do this. a: ularly appealingapproachis to standardizeonly the latent dependentvariable..: the resulting (y+-standardized)coefficientsindicate the expectedchangein ihdard deviationof the latent variablefor a one-unitchansein the independent\a An important advantageof l*-standardizationover full standardizationis that-.1. saw in ChapterSix, fully standardizedcoefficientsare not appropriatefor cate3 variablesbecausefor such variablesthey are affectedby the relative size of the gory as well asby the size of the metric effect. An additional reason for standardizins the coefficients. even when we do not xan
compare correspondingcoefficients acrossmodels, is that the latent dependent\
,lr/t'i,tolrr.t
'' ''
l"ral l"r'frrrlrrr t"r
'rl
l'r.l-rr.l I rrgtrM.r.rrr.rt r.rlfl..r F.rry r(r.r1trr.rr.rr,u.s.A(rurrr.r996
Model I
Model 2
Standard Error
Y*-Std. Coeff-
0.105
.156
Standard Error
Y*-Std. Coeff.
o.092
.120
Substantivevariables
0 .5 1 7
0.400
.178 0.081
.108
0.334
0.081
.100 .056
Black ^1 .414
o.164
.000
-.423 (Continued)
et..t parametersfor an ordered Logit Model of Politi€alPady ldentifi
Model 1
standard Error
Y*-Std. coeff.
b
Standard Error
P
Y*-Std. Coeff.
and Tobit Regresslon Multinomial and Ordinal LogisticRegression
347
meaning-lasno intrinsic metric, which makesthe size of the unstandardizedcoefficients the represent 14'2 ta 3 that the coefficientsshownin Table iltq-uatlon ;.^G;; latent' or on.the unobserved' ;h-ge in eachof the independentvariables ;J;;;**i variables) However' independent other al1 variaule,"y*, holding- constant G;;t; of Y*' we can divide the coefflcients by the btt-ausei.t is possibleto estimatetne variance that is' f*-standardized' coefficients' sldard deviation of Yx to get semistandardized' deviationsof differencein I+ expected rhich arethen interpretedastne numberof standard ftrtwoindividualswhodifferbyoneunitonthegivenindependentvariable.Thatis' (14.8) ay+-standardized lth variable and P' is the rbere b. is the coefficient associatedwith the 129\: i" t., the varianceo[ Y*' I follow Long t]991' -=fn.i.j* (14.9)
var(Y*):B/VB+var(P)
matrix of the logits' and rtere B is a vector of logits, V is the variance-covariance "ctrt+-Z'Oo" for how to €stimate thesecoeffiwt rL)is r'?l3. (See the downloadaUiente each.panelof Table 14'2') ; teported in the rightmost column of ;nL;hd Blacks are nearly a half standard factors, consider Model 2. e. ," ."", ,i"t or n other .bjationlowerthannon-BlackswrthrespecttoRepublicanorientation.Noothelvaliab]e positive' theeffect of Southemresr,. .*tg an impact' In particiar' although il asthe effect of genderand about a third as strong "i'-ly .hce is weak, only about half as strong Family income also has only a mod€steffect' s rhe effect of nonmetropotrtanresidelnce. per vear' toota l^ue to differ in income bv about $184'000 ;;tdiniaul' e;;;j;, and Blacks as are in nepublican tendencies of all other factors, to be aoout as far apafi -(precisely' 0 '423 0'023+18'39)' -t r.n-Blacks, who are identical rn other resiects is way to assessthe magnitudeof the effects Hting PredictedPetcenEges.Another particular valuesof the independentvariables' E evaluatethe prediction equatl;; for the of the coefficientsassociatedwith eachof X.-do this we needto take accountboth conewhich cut.points' the and of the ancillary parameters' i"""ti"ti "-l"tles 2 14'7 we can estimate (from the Model lsization. For example, rrom Equ-ation per vear man earning$40'000to $50'000 ="ffi;;;ti; ilu^ultitv tttut u non-n1ack as a "strong categorized is South the outside ; centralciry of an SMSA ;i;G;;
H'ffi il ffi
"iq'"JJ"i+'+r'c"!h*'.tl"j: thecatemoder :I"-':t'^Tl:::0"'
Democrat": 1
-
r,c
strong Democrar' is {iflilarly, the probability that such a personis a "not
( 14.10)
348
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
r
L
-(
( .02,+.0764+ 4.5+ 314)
I
|
-(
i
I l l ) +.0764i 4 5+ l l 4)
Although theseprobabilitiesmay be computedby hand,it is easyto haveStatado commandto get the predicteddistributim for work. The trick is to use the -predict(seethe Stata-1og-). Table14.3$m personswith the desiredproflle of characteristics the predicted distribution of party identification for Black and non-Black males e"try $40,000 to $50,000 per year and living in central cities of SMSAs. (Of course,I anid equally well estimateprediciedprobability distributionsfor any combinationof charai|@i tics.Indeed,it is possibleto get estimatesfor combinationsof variablesnot found ir ur sample by creating a new data set containing thesecombinations.(See the discussit-rd - in Statacorp2007.)As we see,non-Blacksare substantiallymoreRepul*ro -predict than otherwisesimilar Blacks. Constructing Odds Ratios Still another way to assessthe net effect of an indepeniu variable is to compute its contribution to the ratio of the odds of being below an) Sln![ value in the ordinal scale to the odds of being at or abovethat value. Becauseof the rq the logits are derived, their contributions to odds ratios are constantregardlessof the *ting point, and it can be shown (Long 1997, 139) that they are equal to e-4, for tbe in independentvariable. Thus, for example,the ratio of the oddsthat males and female-'s'lll Democratsisjust e-0334:0.72;or, puni-eru be strongDemocratsversusless-than-strong more naturally, net of other factors women are about 40 percent more likely than ro (precisely,1.39 : 110.72)to be strongDemocrats(ratherthan anything closer tc m Republicans).Similarly, women are about40 percentmore likely than men to be an! LTd plus Republicans). of Democrat(comparedto Independents Comparisonsto Other Estimating Procedures:-goIogic2- As we havejust seet-I" important constraintembeddedin the - ologit - estimation procedure s what is kn(-ry theproportional odds assumption-lltat the explanatoryvariableshavethe sameefra ^s on the oddsthat the dependentvariableis below any dividing point. On the face of iL d.G is often little reasonto assumethat the odds are proportional. Why should we assume'ir example,that genderhas the sameeffect in distinguishing shong Democratsfrom all .-@ers and in distinguishing anyone who is Democratic-leaning from lndependenrsc Republicans;and the same for each other independentvariable?A user-written -ai:(for GeneralizedOrderedIngit Model) relaxesthis assumpdirfile, -gologit2allowing the odds to vary acrosscutting points. ReestimatingModel 2 of Table ir yieldsthe coefficientsshownin Tablel'!4 ratherthan -ologitusing -gologit2As we can see,the effectsof eachvariablediffer substantiallyfrom categoryto ;cegory. For example, Southernresidencedistinguishesneither strong Democratsr:r both kinds of Democratsfrom those who lean more toward Republicans,nor doe-'r distinguishstrongRepublicansfrom others;but it doessignificantlyaffect the remafing distinctions.Similarly, nonmetropolitanresidencemattersrathermorein the midJt of the distribution than at either extreme. Still, the pattern of distinctions doesn-.f appearto be very systematic.
Multinomial and Ordlnal LogisticRegression and Tobit Reqression 349
- : .:i i ,i ,3 , predicted probabitity Distributions of party tdentification t Blackand Non-BtackMales Living in Large CentralCitiesof Non-Southern and Earning $4O,O0O to $5O,OO0 per year. '5As
c :l: ; r E :i j
Black
ti r
Non-Black
-:,:'r I
:':=-+ ....-,. :r:
: fE
-::
:L: :ri
::.: ;u"
fc:oendent
.095
k:5trong Republican
.053
f -:r:
ift _li] lI-: : r
iroflg Republican
:
:[-:-:i 1:rr
:i:,:
hal
0.999
1.001
ilE
T': . t i : ' i-n [
ri
i:-
:: !:rt'-'i: -:r: . -:ri * I l['
LEs
::
irtr - :-:a: --
=: I : I L
rs
--
1 :]] -.li&" :, r'. I m
::-,:" : :- '_---: : :, :^:ru l : C" : ! llr
-r my judgement,becausethe gologit model is rathermore complexthan the ologit jr&-.-.. two criteria need to be satisfied to justify substituting -gologitz _ for *: - : rit - estimates:first, that the propofiionaloddsassumptionbe shown to be inadeur-.i:: and second,that the coefficientsfor the gologit model be interpretableand .m:I:tative. To determinewhetherthe proportionaloddsassumptionis inadequate,we :Srl-rre the gologit model andtestthe equalityof correspondingcoefficientsfor eachof tu : jtting points.In the presentcasewe reject the null hypothesisthat the coefficients --;ual(X'?,with 30 d.f. : 147'p <.0000). However,I am hard pressedto arive at a .efie:3ntinterpretationof the variationsin the coefficientsacrosscuttins Doints.I would
ESTIMATING GENERALIZED ORDER LOGITMODELS?)!T
*l"Tl,"?":f.tfl j",:::i::flilTff ,."i;:T:: 11.",1i:""',$T:11:,H:
:-=:nenhanced byWilliams(2006).Williams's -ado- file, -gotogir2 , canbe downloaded ..: withinStata. Type" net searchgologit2,,,clickthe firstentrv,andthenseect ,,Clickhere ^r : : 1 s ta l"l .
350
DataAnalysis: DoingSocialResearch to Testtdeas Quantitative
,:Atl;-:1,i,r1, etect Parametersfor a Generalizedordered Logit Mo@t of PoliticalParty ldentification, U.S.Adults, 1998. o
StandardError
0.4732
.000
0.391
.00c
.412
.149 .524 .000 .000
.000 0.095
.000
.003
.000 .493 .000 .193
and Tobi t R egressi on M u l ti n o m i a la n d Ord i n a lLogi sti cR egressi on
351
lndependentversusather ='ocrat or Democratic-leaning :.--,ial incorne(000s)
0.0582
0.0111
.000
'- rencein SMSA, -:: r argecenter
o.469
0.135
.001
;= lence :'-:ilde SMSA
0.700
0.'182
.000
' a:
0.239
0.096
.013
:f -:1ern E5 SenCe
0.238
o.117
.043
-1 .5 7 9
0.201
.000
-0 .5 9 1
0.158
.000
. --.' versusRepublicanleaninglndependentor Republican l--Jal in.ome(000s)
0.0941
0.o127
.000
;=
in SMSA, --ence - : : ' la rg ec e n te r
4 0.57
0-142
.000
is
0.820
0.182
.ooo
0.383
0.093
.000
0.345
0.114
.002
-1.724
0.225
.000
* ] 696
o.111
.000
ri Jence
I.= Ce SMSA , aa 5r:,-:.ern ?5:ence ;.:<
(continued)
Multinomialand OrdinalLogistic Regression andTobitRegression 353
T
frrs be inclined to settle for the conventional ordinal loeit model on the srounds of Pusrmony. Minary Least Squares as an Alternat-ve Finally, we could treat the dependentvari.rbleasan interval variable and estimatean ordinary least-squaresequation.This amounts s assumingthat the distance between any pair of adjacent categories is identical. As i tums out, in the present case the coefficients yielded by the OLS model, shown in Table14.5,are quite similar to thoseyieldedby the ologit model.Thus,we might be as rell servedby simply estimating an OLS model, which is much simpler to estimate and D interpret than is the ologit model. The difficulty is that unlesswe carry out the analysis toth ways we really do not know whether the results will be similar in any particular arance. Thus, a reasonablestrategyis, indeed,to carry out the analysisboth ways and, f the results prove to be similar, to present the OLS results but to add a note indicating hr you did the analysisboth ways and got similar results. Of course,if the results differ mugh to affect the conclusions, the ordinal logit model is to be prefened over OLS tEcauseit is less restrictive; that is, becauseit does not assumethat the categoriesare sFidistant.
(ANDALLIEDPROCEDURES) rOBITREGRESSION K)R CENSORED DEPENDENT VARIABLES (Xen we havedependentvariablesthat arecensoredinthe sensethat the recordedvalues t) not representthe entire range of the true underlying variable. The classic caseis that odied by the economistJamesTobin (1958)-hence the nametobit regressian(coined ! econometricianArthur Goldberger when he described "Tobin's probit")-where a rmsumer good was purchasedif the desire was high enough,with "desire" measuredby ft dollar amount spenton the good. From this definition of "desire," it is evident that the xsure is "censored" at zero, becauseall those not making the purchaseare recordedas hring "zero" desire, whereas in reality some might have been close to making it and iglt have done so had the price been a little lower. Others might have had no desire at { andwould neverhavemadethe purchaseregardlessof the price, and still othersmight hre wavered in between. That is, there actually is variability in the relative desire of bse recordedas having zero desire. An underlying variable is censoredin many other situations as well. The classic case L r-here many values are below a theshold that would lead to action; for example, the mber of extramarital affahs (Fair 1978), the number of infant deathsexperiencedby nrhers (Wood andLovell 1992),the number of killings committed by police in different ;risdictions (Jacobsand O'Brien 1998), the number of anests afler releasefrom prison {!l itte 1980),the number of scientific publications (Stephanand Levin 1992),the number d protestsin a nation (Walton and Ragin 1990), and the number of hours worked per lEar (Rosen1976,Keeleyand others1978,Questerand Greene1982).But we also can hagine other kinds of cases:attitude variables that fail to offer enough options, income rnded in categorieswith a top code that is too low, censoringthat occursbecausethe lasth of time to ar event is analvzed onlv for those to whom the event has occurred
j5{,
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
"'''-ii.:
for an OrdinaryLeast-Squares :,: ,. EffectParameters U.S'Adults, 1998' ldentification, Party Political of Model Regression o
i
x
p
AnnuaLincome(000s)
0.0803
.0100
.000
in 5lVSA, Resldence nol In largecenret
a.411
.100
.000
Residence outside5N45A
0.620
.141
.000
Male
0.337
.082
.00c
Southernresrdence
0.212
.096
.429
Ba c k
1.386
.146
.000
Intercept
1.981
.123
.000
I
Q
standard Error
(1981),is b. (1918 2002),a NobelLaureatein Economics
sociar,.i"ntists(:?;:jtt"t"':.:,.:l|1,tff:':;:,::Til"':'iiil];' ono*n u'ons -
dependentvariablesBut hismaiorwofk, with censored procedure for estimatingrnodels to c. andtheirrelations of financialrnarkets whichhe won the NobelPrize,washisanalysis employment,production.and prices He made me ' sumptionand investmentdecisions, and flrmsactuallydeterminethe comp':' of how households to the analysis contributions theory" The result' 'what is knownas "portfolioselection developing tion oi theirassets, of financialmarketsandflows in the economy and analysis a description wherehisfatherwasa ic --lllinols, in Champaign, grew household up in a liberal Tobin athleticproqramanc -: of lllinois University nalistwho workedas publicitydirectorfor the high schoolwhere, as ': laboratory . motherwas a socialworker.He attendedthe university's stfawvc.. in a 1932presidential notesln hisNobellecture,he castthe onlyvotefor Roosevelt where he earneci' ' at Harvard. graduate economics work ln ?nd He did his undergraduate job in Washing:ca first by interrupted, been PhD in 1947, his graduatestudieshaving
D.C.,and then by servicein the Na4/asan officeron a destroyerAfter thfeeyearsasa Harva': Junior Fellow (a very prestigiousfeilowship),which he used ln parl to study econome'-'careerat Ya;. he hadmissedduringthe war,he thenspenthisentireacademic developments
andTobitRegression 355 Multinomialand ordinalLogisticRegression Smith, and Nord 1990). Other substantiveapplications include Mare and Chen ; Saltzman(1987);Roncek(1992);andTreno,Alaniz, and Gruenewald(2000).
Tobit Model obvious question is what to do in the case where we have censoredobservaexample,observationsscoredzero (or someother constant)when we think thereis variability in the true underlying value of the censoredobservations.One ion is to simply carry out an OLS regressionof the entire data set. But this proinconsistentestimates(Long 1997, 188-190).Another solution is to discardthe casesand carry out an OLS estimationof the relationshipin the noncensored for example,determinantsof how many hours peoplework among thosewho at leastsomehours.But this approach,which amountsto truncatingthe distribualso producesinconsistentestimates(Long 1997,188-190) Tobin's solution was ide observationsinto two sets:uncensoredand censoredobservations.Formally, observedvaluesof dependentvariable X censoredat somevalue r, we have
"-1":
:o+fb* x, r + e , if Y ! > r
'' -
[",.
(14.12)
if Y ! ! r
is. the observedvalueof I is equalto the "true" valueof Y Y*, \f Z* is abovethe at which observationsare censoredand is equal to someconstantvalue (usua.lly,but necessarilythe value at which observationsare censored)if the true value is at ol the value at which observationsare censored.For the first set,estimatesare derived sameway as in ordinary least-squaresestimation. For the secondset' it is possible imate the probability that an observation is censored,conditional on the values of fudependentvariables, and to use this probability to estimate the likelihood. These are then combined to produce expectedvaluesfor all observations,conditional valuesof the independentvariables:
(E(Y'l4t t,x,)l E(Ytlxt): lw(uncensoredlx,l* + fPr(censoredlx)* Tyl
(14.13)
x. - d+Db,x,k expositionof the mathematicsinvolved,seeLong (1997'Chapter7)' m accessible
356
Quantitaiive DataAnalysis: DoingSocialResearch to Testldeas
The tobit model hasbeenextendedandgeneralizedin
a numberof ways:
x
to allow for right censoringand both left atd right censoring (that is, at low valuesandhigh valuesof a distribution)
u
to-allow for the possibility that differentobservations are censoredat di values (for example,income when severalyears of the GSS are pooted) to allow for situationsin which an underlyingcontinuous variableis coded set of categories(in many surveysincomeis codedthis way) to correctly estimateeffects where observationsare truncated to dealwith sample-selection problems
r x !
In the following section I provide a worked example that illustrates many of thesee esrimation details, see the Stata downloadablefiles ..ch14 ;r9qs, lFor 3.do_ "ch14_3.1og.")
A Worked Example:Frequenqrof Sex
The 2000 GSS,includedthe question ,About how often did you have sex during tE .0 twelvemonths?"The responsecategories(shownwith coderio l" u."o tut"| ar. d.*i in Table14.6. Clearly,thesedamare censoredboth below and above. Thosewho havenot h.u at all in the last yearincludethosewho haveneverever had sexandthosewho har* , ply beenunlucky in the pastyear,with othersin between.At th" oth", "more than threetimes a week" asfour times a week, or five times a week,"*t."*". mav un
TAELe
14.5,
coae"ror Frequency of sex in the pastyear,u,s.Adutrs, Midpoint
2 or 3 timesa month
2 or 3 timesa week
LowerBound
UpperBo(d
MultinomialandOrdinalLogistic Regression andTobitRegressio" 357 b prowessof newlywedsand other sexualathletes.Finally, somecategoriesinclude a mge. which might or might not be optimally represented by the midpoint. To illustratethe effect of censoring,let us considera simple model in which frepency of sex is predicted from age, gender,and marital status(currently maried versus n1(,. ln fact, in this and most analysesinvolving age, it would be better to include a ryared term. However,I do not do sojust yel becauseincluding only linear terms makes fu crpositioneasier. Table14.7showsthe resultsfor four estimates: r
ordinaryleast-squares estimateswith the categoriescodedat their midpointsbut with an arbitrarytop codeof 208 for "more than 3 timesa week" (- 52+4)
I
tobit estimateswith censoringfrom below
r
tobit estimateswith censoringboth below andabove
r
intervalregressionestimateswith censoringboth below and above
C:nparing the coefficients in the two left columns, we see that the effect of censoring fum below is severe.Failure to take proper accountof such censoringresults in an
TA B i- t: '! 4.7. ett.rtt"tive Estimates of a Model of Frequencyof serc Gt Adults, 2OO0(N = 2,258). (Standard Errors in Parentheses;All Coefficients Significant at .O01or Beyond.) -
Model 1:
oLs -
Model 2: Tobit, Left censored
Model 3: Tobit, Left and Right Censored
Model 4: Interval, Left and Right Censored
119.2 (6.8)
118,4
1Z.V
v .t) -1 .4 1 (0.09)
-2.16 (0.12)
(6.8) ..'';i'l]''''
. . | .t, :. ., ,.r :..t . r.,.'ar,r t
71.7
11 n
358
QuantitativeData Analysis:Doing SocialResearch to Testldeas
underestimate of the effect of marital status on frequency of sex by about half and
very substantial underestimationof the effects of age and of being male. Interesti taking accountof censoringfrom aboveas well as below hardly changesthe coeffici suggestingthat marital status, age, and gender have little impact on the probabilirl being extremely sexually active. Inspection of the probability of censorshipfrom confirmsthis supposition:even among the most sexuallyactive group, young nu[D men! no more than about 15 percent have sex more than three times per week. Bv trast, there is great variability by marital status,sex, and especially agein the of never having had sex in the last year, ranging from about 3 percent of young num men to about 90 percent of elderlv unmarried women. Apart from the probabilities, three predictions are of interest: the linear predi from the model, the censoredprediction,and the ftuncatedprediction.Graphsof predictedvaluesfor Model 4 are shownin Figure 14.1by age,for marriedwomen, linear prediction is the prediction from the model, which tells us that, net of other the frequencyof sexper year declinesby about2.3 occasionsper year of age.The tells us that for married women the frequency of sex declines to less than once a ye:r about age seventy.Although negative observedvalues make no sense,the linear prerlrtion gives the values of a latent, or underlying, variable. We can think of this variable the propensityfor sex, which declinessteadilywith age (because,of course,we h* modeled the frequency of sex as a linear function of age). The censoreclprediction eqtals lhe latent prediction when the dependentvariat*r i observedand equalsthe censoringvalue when the dependentvariable is censored.(Sorr. what confusingly,Stata calls censoredpredictionsthe "ystar" option, although l-. r, 120 100 b B0
\-
E 60 ,:
40
n0 -20
Age
Ff6t-rnS 14.J. rf,r"" Estimates of the Expected Frequency of Sexperyear, U.S.MarriedWomen,2000(N : 552).
MultinomialandOrdinalLogistic Regression andTobitRegression 359
!ttr.{mq
ftcn
:3 & Br IFI fg f
F*3 l|@-
rTb -F ER ,I 3t
r t IEIL
'
I1
lEs
staily takento indicatethe latentvariable,as it is in Equation14.12.)Thus in this case, * assumethat 0 and 208 are fue valuesfor thosein the lowest andhighestcategories. D construction, censored predictions must fall within the range of the uncensored Gervations, The truncatedprediction is defined only for thoseobservationsthat ale not censored. h 6is case the truncated prediction gives the predicted frequency of sex among those rto had any sex at all in the last year. Note that neither the censoredprediction nor the rncated distribution is linear. Thus, thesepredictions must be evaluatedat specific levr* of the independentvariables.Most commonly we will be interestedin the linear pdiction. Now that we seehow to interpret tobit coefficients, let us extendthe analysisslightly a make it more substantivelyplausible. I do this by adding a squaredterm for age and *rying interactionsbetweenage,gender,andmaritalstatus.As it happens,it is not neccary to posit three-way interactions among marital status,gender,and, respectively,age d age squared; a model positing the three-way interactions does not fit significantly her than a model with the two sets of two-way interactions, between gender and, :spectively, age and age squared,and between marital status and, respectively,age and 4r squared.The coefficients for this model are shown in downloadable file "ch14_3. l;:- Becausethey are difficult to interpret directly, I have graphed (in Figure 14.2) the dationship between age and the frequency of sex for each gender-marital status mbhation. Inspectingthe graph, we seethat-no surprise-maried peoplehavemore active sex hs than do cunently unmaffied peopie of the same age and gender, and that sexual GiTiry declines at an increasins rate with ase. 100
tg
50
dE
E
150 -200
Currentlymarriedmen Currentlymarriedwomen Not marriedmen Not marriedwomen
tr: Age t'€ftt{,
Un€ 1rtr.,Z. Expected Frequency of sexPerYearbyGenderand Marital U.S.Adults, 2000 (N = 2,258).
360
QuantitativeData Analysis:Doing SocialResearch to Testldeas Interestingly, in both marital status categories,men report more actrve sex lives
do women of the sameage and marital status.The reasonfor the genderdiscrepar.w within marital statuscategoriesis not completelyclear but probablyreflectsa tendel.*for men to overreport and (or) for women to underreport their sexual activity. Note ii, consideringonly heterosexual activity,both the averagenumberof sexualencounteFdl the averagenumber of partnersmust be identical for males and females.Thus drclearlyis biasedreporting;differentialnonresponse (for example,the likelihoodthat-sai womenwith manysexualpanners- for example.prosritutes_are underrepresente; m the GSS);or morereportedhomosexualactivity amongmen than amongwolnen. Maried men and women both averageaboutone parlner(precisely,1.03 and -9k". which suggeststhat for both married men and married women their spousers usualh. :sr.. only partner,which in tum would imply that the averagenumberof sexualencourrer* should be the samefor currently maried men and women, adjusting for the three_r:1. averagedifference in age. However, inspection of Figure 14.2 shows a difference L--:s than_canbe explained by the age gap (if the age gap were the full explanation, the ja3r would be parallel for married men and women, and a line segmentofihree-years, le:s drawn to the left of the male line and parallel to the x-axis should iust touch the f-eI5E line).This suggests thepossibilitythateitheror both marriedmenandmarriedwomeni* tort their reportsof the frequencyof sexualactivity in a socially desirabledirectit-:_ men claiming sexualprowessand women claiming sexualmodesty.The likelihotx o distortion is substantiallygreateramong the unmarried:unmarrietl men on averagerrf:r abouttwice as many partnersin the last year as do unmarriedwomen(1.g5.o-p*". .90), which-given that rhis discrepancyis far too large to be accountedfor by differe..,-" homosexualactivity-suggeststhepossibilitythatunmarriedmen andwomendiston :rm the number of partnersand the frequencyof sexualactivity in the socially desirabled:::: . tion. Another possibility is that the propensity for women to be younger than thef =a partners pafily accountsfor the gender difference in repofied sexual activity amons ft unmarried.Adjudicatingamong thesepossibilitieswould requiremore analysis th;: : warrantedhere.
OTHERMODELSFORTHEANALYSIS OF LIMITED DEPENDENT VARIABLES This introductionby no meansexhauststhe varietyofproceduresavarrablefbr the arr,r , sis of limited dependentvariables.Stata10.0includescommandsto carry our a nur,:E: of procedures,including x
s
Conditionallogistic regressionand mixed models,where outcomesdepeni :r featuresof the outcomesas well as on characteristicsof the individuals.Fr" examples,see Boskin (1974), Hoffman and Duncan (19gg),White and L:-ae (1998),andYanovitzkyandCappella(2001). Nestedlogistic regression,which extendsconditionallogit analysisby dn i,..r_E outcomesinto a hierarchyof levels.For examples,seeCameron(2000).Soo:cz, manienandJohnes(2001),and SouthandBaumer(2001).
andTobitRegression 361 Regression MultinomialandordinalLogistic r r
Probit regression,an altemalive to logistic regression.For a brief introduction, seeAppendix 13.8. Poissonregression, usedto modelcounts,the numberofoccurrencesofan event. A classicexampleis von Bortkiewicz's 1898 study of the number of soldiers kicked to deathby horsesin the Prussianarmy.Applicationsin the social sciencesinclude Long (1990), Greenberg(1991), Rasler (1996), Chattopadhyay and others (2006), andWeitoff and others (2008). The definitive statistical treatment of poissonregressionis CameronandTrivedi (1998).
(1997),HosmerandLemeshow(2000),and Powersand Xie (2000)provideexcelintroductionsto many of theseproceduresthat, with a bit of diligence, are accessible socialscientistswho havea modeststatisticalbackground.Long and Freese(2006) proa guide to using the proceduresin Stata.For a useful overview,seeGould (2000).
HAS SHOWN r THISCHAPTER fris chapterwe have seenhow to estimatemodels for three types of limited dependent : ordinal variables, for which ordinal logit analysis is the appropriatemethod; variables,for which multinomial logit analysisis the appropriatemethod; and variables(where valuesaboveor below somecutting point are not observed),for tobit modeling is the appropriatemethod.
d
rx
{\-!ld--r ^t J-/\R T T iL }i, -l\ I
CAUSAL IMPROVING E FIXED INFERENC AND RANDOM EFFECTS MODELINC EFFECTS ISABOUT WHATTHISCHAPTER h this chapter we consider two closely related techniquesfor coping with omitted varilble bias. Recall from Chapter Six that omitted variable bias occuts when we havefailed n hclude in our model variables that affect the outcome and that are correlatedwith one r more of the predictor variables.The techniquesdiscussedin this chapterfor estimating nbiased coefficients are known as.fixedeffects and random effects models.Thesemodd' use information on the sameindividuals from two or more time points or information m two or more individuals within groups(families, schools,firms, communities'or similar measuredor unmeasured, goups) to purgethe estimatingequationof all characteristics, groups. The result is that the characteriswithin or constant over time tat are constant factors.For usetime-invadant by unobserved unbiased ncs we are ableto measureare (2006,Chapters (2005) Wooldridge and see Allison fol introductionsto thesetechniques, l-j and 14),both of which I draw on in this chapter.
354
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
INTRODUCTION As we haveseenat manvDointsin this book. the nonexperimental methodswe ha\e studvinsare vulnerableto omittedvariqblebias: the possibilitvthat unmeasured affect both the predictorand outcomevariables.In this casethe coefficientswe throughOLS or logistic regressionwill be incorrect.To appreciatethis most full)- r helpful to contrastthe linear model approaches we havebeenstudyingwith exDeriments. In the classicrandomizedexperiment,individuals are randomly assignedtc groups; membersof the treatment group are exposedto some sort of intervention membersof the controlgroup arenot, anddifferencesin one or more outcomesare sured.(This designcan be generalizedto include severaldifferenttreatmentgrout\ the logic remainsunchanged.)Becausethe treatmentand control groupsare,with:: limits of sampling eror, identical on averagein their pretreatmentattributes--or. tc the same point differently, receipt of fteatment is uncofielated with pretreatment butesof individuals anv differencein averaseoutcomesmav be assumedto be cby the treatment.
With linear model approaches, we attemptto approximaterandomized by statistically controlling for as many confounding factors-that is, factors with both the predictorand outcomevariables-as possible.For example,if we that men eam more than women, we might wonder whether this is due to
However.beforewe acceDtedsucha conclusionwe would want to considerwhedr= leastin paft, the pay gapis dueto the fact thatmen arc morelikely to havetechnical: ing, to enterhigh-payingfields,to havemorework experience,andto work longerhWe would then statisticallycontrol for thesevariablesand assessthe effect of gend:: eamingsamongpeoplewho areidenticalwith respectto the controlvariables.If $: found a genderdifference in pay, we might then be willing to attribute the remainin5 ferenceto discrimination.However.we would be vulnerableto the claim that we ha; includedother crucial factorsthat could result in oav differences.For examole.rr may not bargain as effectively as men and may thereforeacceptjobs at lower \:levelsthan men.If we omit a measureof bargainingprowess(or if we measure ing prowess imperfecdy so that true bargaining prowess remains pardy then any effect of prowesswould be captured by the enor term. However, if bargai prowessis correlatedwith gender,the assumptionof OLS (andotherlinear model: error is uncorrelatedwith the predictor variableswould be violated,producingbi coefficients. So what can we do? It tums out that if we havemeasurements on the sameindir-r als for at leasttwo points in time, we can get unbiasedestimatesof lhe effectsof r ablesthat, at leastfor someindividualsbeing studied,changeover time. We do th-. predictingchangein our outcomevariablefrom changesin our predictorvariables.*: hasthe effectofpurging from our predictionequationthosefactors,measuredand sured.that do not chanseover time. But thereis no suchthins asa free lunch.The ccr this method,known asfixed effecrs(FE) modeling,is twofold: (l) We are unableto mate the "main effects" of predictors that do not vary over time for individuals, for ple, sex and race (although we are able to estimate interaction effects involving
lmprovingCausalInference:FixedEffectsand RandomEffectsModeling
s Lf r[i
tl
!'
n
365
wiables and variables that do change over time-we will return to this point in the cmtext of our gender pay gap example-and we also are able to esimate effects lhat rtange over time for time-constant variables). (However, recent work by Bollen and Erand [2008] has shownhow, with suitable assumptionsabout unobservedlatent factors, i is possible to obtain effects of time-constant predictors within a structural equation ndeling (SEM) framework-a set of techniquesbriefly discussedin the next chapter). rlt When we are analyzing limited dependentvariables, we usually will have a substanfel reductionin samplesize becausein FE logistic regression,individualswho do not ciange over time on the outcomevariable are droppedfrom the analysis.However,under me circumstances,and with some additional assumptions,we can recover our sample lize by resorting to what is known as random effects (RE) modeling. We will consider tis approachlater in the chapter.
VARIABLES MODELSFORCONTINUOUS HXEDEFFECTS t' :ee how FE works for continuousoutcomevariables,let us write a prediction equation: yr: lL,+ Bxrllz,la,*e,,
i:1,
,n; t:1,
,T
( 15.1)
is an interrbere y,,is the value of the outcomevariable for the ith individual at time t; 7-r., ceprthat is allowed to vary with time; x,, is a vector of variablesthat vary both over indirrfuals and, for eachindividual, over time; z, is a vector of variables that vary over indiriirals but, for each individual, not over time; o, representsunmeasureddifferences Letrveenindividuals, that is, differences not accountedfor by the 12,,that are fixed over ine: and €, represents idiosyncratic factors that vary both over time and across hlividuals. To simplify the discussion,assumetltat I - 2, althoughthe sameconclusionshold rhen Z ) 2. Now supposewe simply pooled observationsfrom the two time points d estimatedour outcome through OLS. Clearly, insofar as omitted variables are correIled with the variables in the model (as in our example involving bargaining prowess), ting this will produce biased estimatesbecausethe fundamental assumption of OLS, frr the error term (which in this caseis the sumof d- * e, becauseor is unobserved)is r-orrelated with the predictor variable, will be violated.
RtndamentalFEEquation bs'ever, supposewe write sepamteequations for each time period and subtract one lom the other.Subtractins d
ld = F, ! l3xa + 12. + oi + €il
]
(rs.2)
t
liz:
G
pzI ]xiz + .f zi + oi + €i2
IT
u: E
liz - lt:
(pz- 11) + P(xiz - xit) + (€o - €^)
(1s.3)
to Testldeas Doing SocialResearch QuantitativeData Analysis:
366
eff:ct oj predictor variables: ::tF i': thl whv equantr -Td Notice that both 'y2,, the trme-constant is which ,,ain"r"n""ooufl of Equarion 15.3, ',;.J; rru{" l""n ;;il., 15 3 has twr equatiins' *:t t^'' of this sort are known ^" n'n-a'n"'""t"a ^lY"t well as any mea'udl are constant over dme' as purgeclof all unmeasuredfactors that gquation 15-.3solves,theomitted-vadable-I6 factors that are constant ou". ,r-"]iho. change 'w oo-"utot"d factors whose effects problem-assuming that there *" "o it tft*
***poot' thui time;thisisanontrivial
ffi.Til 1il#'i!ffi;d"t1il:ffi"'"
anc fi xTtr x''# rorwhich 5 l"u't'o'o"r'-u*:*lit;' ut?]-":.1":^f9 ' ro candidate as a '" age for example' ruling out
and.x, are not perfe.tly "ot'"tut"o ftfto'' with the idiosvn'-rrc oit"*"a p'"a)"tor variables are uncorrelated tbe observedpredictorvaria-fu "i;a;'ili;;;il*" foints: lhal is' thal error Lerms.r,, and c-. at Uotir.trm-e observedirr rttuiirt"f oo not otpend on the outcome arc srrictlyexogeno?s--cruclauy' earliertime Point.
Allowing the S/oPesof the Xs to Vary
lr that the effects of lhe predictor variablei Notice that in Equation 15'3 it ls assumed firstdifferen:-rr a can be testedby estimating .xs, are constantov". ti-". rnr. u^.Joaftion r *" allowed to vary. To see this. consider eauation in which the slopes or ;;; following Pair of equations: a' -t €t lir: Pr* l31;t* 1z'l ( ij rtur and 1a' * e'" !,2: lJz*1zx,z*1z' 15'4 from the secondyields Here,subffactingthe first equationof l r:--r{l
we hc:aoe slope of any of the x s differs over time' That is, to test the hypothesisthat the the tr score Then' if the coefflcient for both the time 1 variable ana tn" Jitf"t""t" nol e.ldl t""uo conclude that the slopes are 1 variable differs significantly ttoit^'Lo' I la:: time the for ty suUtractingthe coefficient and can get the value of the time t siope score' ablefrom the coel'ficientl'or the difference
Testingwhether the Effectsof the Time-lnvariant VariablesVaryover Time
over ctrrc the time-invariant variables to change We also can a1lowttre coefficients for of equatrons: To seethis, considerthe following pair yr= P 1l 0xi 1* 1rz' * a, * e,,
and t a,l €i z !,2: l f,r| Fx,z+ 1" 2'
Ll5 rx
lmprovingCausalInference:FixedEffectsand RandomEffeds Modeling
S-tacting
367
the first equation of 15.6 from the second yields l;z-
lit:
Qrz- lr)+ P(xo - xr)+ (12 - 7, )2, * (e,, - e,,)
(rs.7)
kc'm Equation15.7we seethat it is possibleto assessthe claim that the effectsof the z. & not vary over time, by testingthe significanceof the coefficientsassociatedwith the : rariables. Note that these coefficients do not show the effects of the zs but rather the Serences in tlte effectsof the z s betweentime 2 and time 1.
ftractions BetweenTime-Constantand Time-VaryingVariahles &* noted previously, we generally cannot get the effect of time-constant variables from t FE model (but seeBollen and Brand [2008]).However,we can get the effect of the of the time-constantvariables with the time-varying variables, the xs. To see considerthe following pair of equations: t:-Efaction la:
l \t
A x i tI1 z ,I6 x,rz,I
a,I t,,
(15.8) y,, - 11,,*Bx,,I1zi
+ 6xi2ziI a.I e,,
Subtracting the first equation of 15.8 from the second yields liz - lr:
012- I,t)'l BQ' - x,r)'l 6zi(xi2- xi1)+ (€i2- €ii)
(1s.9)
erample, retuming to the effect of genderon income, the FE model doesnot allow a assessmentof the role of gender in creating income differences.However, it does us to determine, say, gender differences in the effect of changesin performance ion scoreson changesin income.Supposegenderis coded1 for malesand 0 for and x (now designating a single variable rather than a vector of variables) is a evaluationmeasure.Thenwe would have: f or f em a l e s : l i z - !a :p (x ,rior m ale s :
l o -J r:(0 + 6 )(x ,2 -
x r)+
(15.10) x)+
More than Two TimePoints we have three or more measurementsper individual-which we increasingly hrcause a number of multiple-wave data sets are now in the public domain-there *eleral possibilities,of which two are simpleextensionsof the methodswe havejust Consider each of these.
Fint, we may analyzetwo wavesat a time, computingfirst differencesbetweensuccesraves. This approachhasthe limitation that, unlesswe tum to advancedmethodssuch leastsquares,we cannotget a singlesetof coefficientsfor all wavescombined, approach Z - 1 setsof coefncientsfor f wavesof data.Thus the successive-waves to be of greatestinterestwhenthe numberof wavesis small,saythreeor four.
368
to Te* ldeas DoingsocialResearch QuantitativeDataAnalysrs:
eachynable in the data over waves; then' for Arr alternative is to pool the or"tl j****;;.rffil,o,ffi1['}l.t] comoutethe averagevaruc '"' computetheaveragevaluef and that individual'sover 1;;;;;'inOiutduut wave-specific equ the between oLS regression "tl,l;;;;;.;;;; in a conventional re:ulfn-9.1tr"e the use and average; comDute r|55 Ir'- compute ^r E^,,.ri^n - or Equation rorm
ilTlti,:'#jl,; ;;;;;;-;"'h'
t,=+rt,*dr'=+P'" Then' for eachvariable' observationsfor person i' where n,. is the number of the observedvalue: o".-r*-l'o""in" *"an from and x',,= x,,-7 !',,: !i, -!, This yields an equation of
the form T.
y',,--L P,D,4 Bx',,* e'', differ by t allow the that variables dummy are ilf:":pJ: .t where rhe D, i;rJ';;."';;;,; equano-n inEquauu' o om and the rhe zs and zs rhat t *f :::-'.'li11ill,il#"t":'iiXT;.llT Noticethat Nodce a zero.Equation15.13a individuals' within afeconstant -th"9t"'1::l'^'-:":::;-,,ili r'" i"*"*t. Thereascr
th:'t11{11 tnat (JLs except oLSexcept of t* throush insteadof estimatedtrough estimated ::^T'"y'#"0:iffT",. "''#,.#;;;;hg the data but. instead equrvarcttt is-the 13 l5 "' Equation that sample--fi is this vari*]: -"^"""i'i"nJiar^f in the dummv a(]u,ru'r includng a .*.i,., i'cluding such€q Sucb deviation :1'^:^": i""i",i"" scores, by dummy.variables. o,.TiTlt'l*Jiadables. ;-dfi;;.;;;r.
-the -xt wrtncorrcLls'-;;Jr1;;;;." estimated' tionscanbe$Jl"..T"il:;.':;J.ffi#;:":'":':":':r#" l"j;:'fiT: - commands-";-x'"H;:T Fh esdmate to way ':j""ii;'ii "f ..trupt".alsocanbe But theopdmal lt^",t],: o"iT:::,T^T';:lililr*,", irri,
models or rE elaborarons elaboratrons various The vadous rhe "' ""#i-";;;;;d than adaptedto fhe analysisof more
discussed neednot be firther *."u.*
RatherThan over Fixing Effects AcrossIndividuals
ltme
rucuruuJru rn"tt'ootlo't t*ttt^t^:i.Lt-Tl dtscusseo dis"o""a have wehave farwe So Sofar appliedwhenwen"]:.-"5 Ue:*:."#Hi::ff*:il Jii.irgi" "-
;"1i"'*F*,'.':i*:m:;:m:'" ff lffi:iJT'ffi f;f;""l'Tffi i''#il"#l[lF":i,1ilif
i::::*X;l'lg::lg,T;Hff":ff "f;i."fi
tncome' shipberweeneducadonand Y:-'i:::':;":;" i.,*, tn so far in schoola.odF ramlies r.rulw that oI ur "i charactensucs to chara i"in il;;;;;.,"ristics Dart, l':t"::i:il.9,t:JJ",i;J$*" for such unobservedcharac*r;r une. wav ,o job market the in '" ::l:::^';';;;""". successful "onoot in io.oln" ur u
comparcsiblin;: to compar€ be to wouldbe caniedouts familieswould of of families 'ffi1;;;i;G'*d :"ly:Pt"lf'?'i,lllir*,ino, Krueger(1994)"*"u educauol or level the in of differences ::':ll1'::l ::.,,;;;*he effectsof educt
;H;:i;;;;;"*'.'li?J"til"L:Ti,.n*,'}.[f,*',T1l;;",con,ro stronger tnan were in fact slightly gender,age, and race'
lmprovingCausal Inference: FixedEffects and RandomEffectsModeling 369
PANELSURVEYSlN THE PUBLICDOMAIN Maior u.s.?)J panelstudiesof interestto socialscientists include .
(PslD):http://psidonline.isrumich.edu PanelStudyof IncomeDynamics
.
(NLS): NationalLongitudinal Surveys http://wwwbls.gov/nls
p|
.
WisconsinLongitudinalStudy(WLS):http;//wwwssc.wisc.edu/wlsresearch
.
Healthand Retirement Study(HRS): htlp://hrsonline.isrumich.edu
.
NationalLongitudinal Studyof Adolescent Health(Add Health):http://wl,^,/wcpc.unc. edu/addhealth
lmportantforeignpanelstudiesinclude .
ChinaHealthand NutritionSurvey(CHNS): http://wavwcpc.unc.edu/china
.
GermanSocio-Economic PanelStudy(5OEP): http:/
.
^/ww.diw.de/english/soplndex.html IndonesiaFamilyLifeSurvey(IFLS): http://w1,vw.rand.orgllabor/Fl5/IFls
.
MexicanFamilyLifeSurvey(MXFLS): http://wwwradix.uia.mx/ennvih/main.php?lang=en
.
MexicanHealthand AgingStudy(MHAS):http:/Arywv/.mhas.pop.upenn.edu/english/ home.htm
lvlanyadditionalpanelsurveysmoreor lesscomparable to the PSIDare listedat http:// psidonline.isrumich.edu/Guidey'PanelStudies.aspx.
Now consider a secondexample.In an analysisof Indonesiandata, Frankenbergand hon (1995) studiedthe effect of maternaleducationon behaviorsconduciveto chil&a's health,includingsanitationandhygienepracticessuchasthe sourceandtreaffnent d drinking water,wastedisposalpractices,and so on. However,in developingnations rh aslndonesia,both a mother'slevel of educationandthe possibilityof easilyobtain.g safewater or protecting againstcontamination from human waste tend to vary across mmunities, dependingon their level of development.In this situationone would want n prge the associationbetweenmaternal educationand child health-reiated practicesof ft confounding influences of community characteristics.This is what Frankenbergand [son did by fixing community characteristicsand relating differences in health pracib to differencesin matemaleducationamongwomenin the samecommunities.In this rtr1 they were able to show a causaleffect of matemal educationon behavior conducive o child health.
linitations of Fixed Effects Approachesand Cautions to Keepin Mind Lte all other statistical procedures, FE approachescarry a set of assumptions and rquirements.When theseare violated,FE coefficientsmay be worse(morebiased)than
370
DataAnalysis: DoingSocialResearch to Testldeas Quantitative
simply poolingdataandobtainingOLS estimates.Unfortunately,oftentheseassumptr:rr areunteslable. Herearesomecautions; If unmeasuredeffectsdo changeover time (or, in the cross-sectionalapi-r:2. tion just discussed,do vary acrossindividuals),FE estimationdoesnot .:'- *t the bias problem. It is thus necessaryto think carefully about whethe: & assumptionof time-constantunmeasuredeffects is tenable.The samel-'.r holds even more strongly for family or community fixed effects-one h-. u assumethat noneof the unmeasuredfactorsaffectingthe outcomevaries3!-::1ii individuals within families or communities.This is often dubious,espe,--:-' within families. To convinceyourself of this, think of recentU.S. presic:m andtheir ne'er-do-wellsiblings;or simplyconsidervariationsamongsib-::. in families you know. Could such differencesaccountfor differencesir l: kinds of outcomesstudiedwith family FE models?This is a crucial que:::a (Of course,unmeasuredeffectsthat changer =often ignoredby researchers. time alsobias OLS coefficients.So resortingto OLS regressionin suchca:::.:i no solution.) The predictorvariablesmustbe strictlyexogenous, conditionalon theunobse:.= -,.r. variables.That is, we mustassumethat oncewe controlfor the unobserved ables,tiere is no remainingcorrelationbetweenthe predictorvariablesanc 1 idiosyncraticenors, the X,s and the e,s. One commonway strict exogeneir.:violatedis when one or more of the predictorvariablesdependson the our.i,T variablemeasuredat a previouspoint in time. For example,if we were stuc-.'l}r how the crimeraterespondsto changesin the sizeofthe policeforce,andthe :--r of the police force were determined by the crime rate in the previous year-* strictexogeneityassumptionwould be violated. Relativeto variabilityin the outcomevariable,theremustbe sufficientvariab:-r, over time in the predictorvariables(or acrossindividualsin the cross-secdt':a FE approach).What is sufficient?This is difficult to quantify.Still, it is obr::'rr that predictorvariablesthat hardly vary can havelittle impact on the outc.--. just asin OLS analysisonecannotpredicta variablefrom a constantandu t .3r a poorjob of trying to predicta variablefrom a nearconstant.
I L
@tr u!@ 3tu [T
3 0
-J ID fll
ffiD :[ !r! 0
nu D C @
1t5 [! iq @
T @!l
rdU E @
A corollary of the previouspoint is that variablesthat differ only by a Li:-.r transformationover time are regardedas unchangedover time. Thus,for er::ple, agecannotbe includedin an over-timeFE analysisbecauseageat time : $ identicalto ageat time I plus a constant.It thenfollows that variablesthatdi= over time by a nearlinear transformationcreateproblems. The predictorvariablesmust be reliably measured.As Wooldridgenotes,"D:ferencinga poorly measuredregressorreducesits variationrelativeto its cor:; tion with the differencederror causedby classicalmeasurement error,resul-.-;l: in a potentiallysizablebias" (2006,475).
I T
@
lmproving CausalInference:FixedEffeds and RandomEffectsModeling
e assumPDolls
donal aPPIt;lloes not solre c \\ hether 6a h€ same Poir ts--one hli -^' E \ aries acrol{ ous. especialll ".S. Presiden:' amongsiblilgs ferencesin thg nrcial questtt-r. rat changeorin suchcases:.' r rheunobsend rnobservedtariariables and th ict exogeneiqr' ; on the outcotrB re \lerc stud\ ins orce,and the siza reviousYear.rb ficient variabilT re cross-sectioDd Still, it is obuc'cl t on the outcoE nsrantand will ril
371
VARIABLES MODELSFORCONTINUOUS RANDOMEFFECTS BecauseFE modelsdo not allow us to assessthe size of time-invariantvariables(or, in family, organizational,or community applications,variablesthat are invariant across individualswithin units), therehas beena strongincentiveto find modelsthat do yield such estimates.Among these, a frequently used approach is lhe random effects (RE) model.Like the FE model,the RE model can be written by startingwith Equation15.1. However,the assumptionsare different. Whereasthe FE model assumesthat the g represent a set of fixed parameters,which are purged from the model by differencing, the RE model assumesthat each n. is a normally distributed random variable with a meanof zero and constantvariance and that it is independentof 2,,x,,, and e,r'This is a strong assumption. Fortunately,it canbe tested,usinga testproposedby Hausman(1978).The strategy is to estimatecorrespondingFE and RE models ald to comparethe similarity of the coefficientsusingthe Hausmantest.If the null hypothesisof no differenceis not rejected,we can concludethat the independenceof the d. i.ssupported,which meansthat the RE model yields unbiasedcoefficients. Becausethe RE model yields estimatesof the effects of the assumptionis satisfied.If it is not :, the RE modelis to be prefered if the independence of the effects of the 2,.The and forgo estimates FE model for the we must settle satisfied, the RE model. Bollen and does not support quite and often restrictive Hausmantest is FE and RE models for comparing (2008) statistics offer a range of altemative Brand procedures are based Brand's and also proceduresfor forming hybrid models.Bollen and but is i.n this book on structural equation modeling, which is beyond what is covered briefly discussedin the next chapter. How can we eslimate the RE model? The details are beyond what can be considered here,but it is possibleto sketchthe generalapproach.Because,by assumption,a- is uncorrelated with the explanatory variables, the coefficients of these variables could be However,doing so would ignore at consistentlyestimatedfrom a single cross-section. periods). Pooling the data and esti(or than two time for more more, leasthalf of the data yield estimates.However,neiconsistent matingthe coefficientsthroughOLS alsowould rher procedureyields the correct standarderrors.The reasonfor this is that the errors will the two errortermsin be seriallycorrelatedovertime.We caneasilyseethis by repLacing Equation 15.1 with a single term for the compositeerror:
(1s.14) r onlY bY a liner :. Thus,for examse ageat time I i5 ariables that ditr= "Di:-fuidge notes, lativeto its correbF rnt enol resultiry
Becauseo, is includedin the compositeerror for eachtime period,the u-,areseriallycorrelatedover time, with the correlationgivenby
corrqu,,,r,, 1: fi l1of,+ o!1, t=s
(15.1s)
*here ol : Var(o,)ando! : Var(e,).However,it is possibleto derive)a genernint the transformationthat eliminatesthe serial conelation alized least-souares Def,ning
EITOTS.
372
to Testldeas Doing SocialResearch QuantitativeData Analysis:
x:t-t":
l(4 +To""))""
rr- u
we can wrlte
+ B * (x " o -) 1 )+ (u " -\ u ' )
y,-)n=p.(1-))*p,Qr, ' ' -)r" )+
(l: ' r
t P""::l-t:;;3;"3';;;; l;Ji:X*:ffi,:f::i[:tJil sim'arity the Note ff ;;il;;erie.1]r;1e*resizeorthe=rwar':l lr*,TJ#Hx.:ilnTiH;:ffi in ttti*'ted (whichcanbedone "evenl it u^tJi on o2,o;'andi tiondepends lromthepooledi ' = iZ tun Ut tui*uted though OLS neednol concemus)' Equatlon' ' unOthecolr€ct standarderrc:' tn" time datato yield co"titt"nt "'u*uit"JJf the enor ter:: r "o"m"l"ntt u"tween FE and RE by rewriting Finallv' we can seettt" t"rut'JlJip Eouationi5.17 as u,,- \i, -- (l- \)a * e,,-)q
(1:,!
:: f ii':i::;:l:: i:; il""J#'J":'1il1i:ili:!, i11ilff ilfi HH:,'::t ffi ; ::i#;; ;i
""'*"' ",t:lllf :#,Lif":'ffiill3i,T,il'i;Li.if
ol tl approaches0. a larger frac on bv definition' the bias tncreases'
OF INCOMElN CHIi{A DETERMINANTS A WORKEDEXAMPLE:THE dependent .;and RE modelsl?l "oYnloot -oiFE aselser:':-: To seehow to estimateand lnterpret ln cl11 ttreoeterminantsf"Jiv i"t"*t iifnina' I consider Chtne'::r' ables, the hi€herin Communities
ut'o" tot-onitits incomedifferssubstantialty fromruralvillagesto province-:': (a seven-category la---:' hierarchy ban '"i"rn"ittut 'ung"t tendto have'hjgheraverage tt-til"t" Chongqing' Beiiing, hu:"s cities: ""Jii"":fn) *" i"tt"t "naoted with both the income.But they alsonuu"poptitutio'i''irtui
*#;;il;;;.:lS*:i*;,:'Xfj,*1.ffi :,'ffilT;,"y,i';1;:Tj
#.r*'il*j:knn"ui$kt*lxm;:::l*ll differenc" : oitt.r, simply reflects the one hand, and family ln"o.", "i'ti" -community ournpl"' the tendencyof r-:-t tatlffett i*o'n"-fot otn"r and market labor th" "onoitions to disproportionately mov-eto iaP11, *rtilttt"^liv "atcation
--r"'^,',"o"yir,.."l"1-r::::i:jlt'j*';;:nj":'i,ffil.i#fJ
inser usetr survev ci'i'"* ""ir"''"rsampre -' vill'=.Y.lii?51,ifiJ.1li::S':Tii""ffi rural hundred one' oJtign ioitt'i' tu*ty inttooed previouschaptersTh" t*pr" andone hundredurDarltrcrBuuuruq aboutthifiY households(SeeAPPt tion on how to obtain the data )
u't- n"ignuoffi;' i" :;*a"tu hundred one and 9,!T-':"'*:T::T"XX'.".I#T?.= t o" ttt" studydesignandfor info=-.tot 't
lmproving Causallnference:FixedEffectsand RandomEffectsModeling
-
373
:ii,
f 15,1,. so"ioeconomi
r:_-
05El)
Mediarr Family Incomeu
Median Fam.lnc. per Worker"
-J,vnshipor town
8.2
31.9
9,000
4,000
:3unty-levelcity
10.4
49.8
10,680
5,000
L:vrncialcapital
1 0 .5
44.1
13,000
6,000
7.6
29.9
7,000
2,775
410
;
NCH h A
.:
Mean Occ. Status
': -
-_= :.rNevwasconducted durinqthe summerof 1996.The ncomequestionwas " Now frornall sources, (mid-l995to mid-1996), the 'i- :: .,rasyour familyincomein the pastyear?" Duringthe relevantperiod :' 'i ,,asworth$0.12,withhardlyanyfluctuation. *i ::ta are missing Forthe remaining columns,rnissng data on for slzeof placeor yearsof schooling. _i: ,:riableareexcuded.
rjfu -
6.090b
In this analysisI predictfamily income(in RMB) from the education,occupational c.r:s. andageof the respondentor spouse(whicheveris higher),the numberof peoplein re louseholdwho are employed,and whetherthe householdis engagedin varioustypes (Becausein the surveyusedhereno variableidentifiesthe headof ri::mily enterprises. rr ]lousehold,I used the higher of the respondent'sand spouse'scharacteristicsas a r:ry for the characteristicsof the householdhead. This variable will be incorrect are otherrelativesof the householdhead-for example,aduh tr:.'tir asour respondents decision :rJren or siblings.In a seriousanalysis,I would developa more sophisticated ",n; tbr decidingwho is the heador how to characterizethe socioeconomicstatusof the u,-rehold.But for our presentpurposes,this proxy is adequate.)We would, of course, r:ect the educationand occupationalstatusof householdmembersto affecthousehold s,,::,me. In addition,householdincomeis likely to increasewith age,which canbe taken n . proxy for expedence.I haveno clearpredictionsregardingthe effect of engagingin jr. production,agriculturalsidelines,or nonagriculturalsidelines.but I suspectthat affectincome. tc-r' aspectsof entrepreneurship
fmprovingCausalInference: FixedEffectsand Random EffectsModeling
325
f.T;:;{TTffi ffi :;'i$["::,T:;Hffi i:11#f:'?'ii",{y:;l':f "#iJll .*J'"i,i, :: . rie l,il FE anarysis 1il",0"-, u,,?"'l#:ji:':;,'i"11,'11",,s. ffi ** "d.",",'Jr,::i: 1,i"
""ffi]T.:ftTrr#ljl;'il,ffill;: T-::::'"1'"".',5:'ili[ftfl:."",,: ,1ing,r,e po,i riu" l':p::::_r{.11";iiffi *xi',"ii:':trll'.*."r;x,*:n;;#:;#;ii":H;t:ix': 6anfamiries o"l",l-'" J::::"T:ij:"Xjl:f'*nd to earn ramiries 'r,,' .",r.0i" subsranrialy ;' ;;;';ffi;d',113'f,lili;J1l':'as
""",r,'l,lJJlll
iil:#[:1",ilIy:;:;1"**{:1i:.lf,:'.H1:'#i:ffi ','J,"f .il:H j:j::j,i.::!L'.T:GT',:ffi T,ri:Jil:'i:1lTh:,r,t*: '#'J,,:fl;::
;fln:ll;:n1,.*#ni:;::lj:*::,11'ff:'..Hfni y,.';1;T '""l#"lil:ff: I t'."f,:t*f ;ffiiJJH."i'I;"::ff
,f;n*?X ffi ftf*:j,Tfl ,T,,fr Jr#; :liT,:i'l,'i;:1T"-,'"HL',T:;:"i#J[ffi j;l;:t*.:;":*;#$:1r:t]#".J3i1tft ff;Ll;lltltffi ffiJfif :il'#, ililil;ll,ii;,illiffi uthe;ff:1;,i?#ii*#:'',;.fl lfJ"''J,'# arevalidonJvif theunobserved o-.arei"o"p."o.ii l@ car we do eveno.n.,iwiu,i""
'LBIcsston.
sucnas the[ position l However-as noredpreviouslv. in the urban R-;"q'ur!'rss
.iil.'i;X,::-tltt:"1: :orer rr.ertime umeor vanabtes oracross .,;,iln1li.-"T1bbr. ft: obseryed across thar inCiviau,r" individuars wffi"i,,..;q:ftffiff:ffilil:T:,,H thJ'2,:;;';;";;.,,ricerrors. ff5 :TH,Tffi:'#'fl:*,i'gn.units' -;
jrfij#11id:5:,[:l:[""1TfiT,y; ;T:*? *ji# ::Jl# *?Tfi'lr;ff fr iffo..H:T:'.T,'j,y:i":Tq';,!i:##il;Jf ,i'""."",T:; .:del €stimares rcewemustsetrte '#i""l,lTf, arebiased, tt f.S
il'x.;ffil#f " ".fi-lrTjTj,li^gy JIT:ff.'.:;l.i:""'ffi
ll',il;'H:,l
l#",Tl,H::tr*:Htr#rl ruHilH,il'J:"Jl:::1t:.#:il":ff
b permitmuchof an inference. Th"i;,
,;
.ffiil:
me standardenorsareroo
.'h"",h;; j'J ;;;;ffi;JiilJT,l1 ffi:5:fi:T?:l i'"ff;::::":..::f j;'H:f #,ffn::,ixlJf ffi:;1,*:,,j:: ,f:[ffi
fi:T*:;;'#,ffi
' "( f ; ) =
r ,* u .,,*1 2 ,ra , i =1 ,,1 1; y=1,2
(i5.19)
376
DataAnalysis: Quantitative DoingsocialResearch to Testldeas
wherep, is the probability that y,',= 1 rather than 0, and the remaining terms are deful asin Equation15.1.In addition,we needto assumethat within individuals,y., and-r--: independent. Then it follows that Pr(y,,: 0,y,, :0) : 0 - p,)(I- p,") Pr(y,,= 1,1, : 0): p,r(l - po) Pr(yu : 0,y,, :1) = (l - p)po
(r5_tl
Pr(y,.,: 1,y,, : 1): prp,, Becauseour goal is to estimate p, and p while controlling for the time-invariant corriates,we use only variation within individuals to estimatetheseparameters.Thus, becindividuals for whom the outcome variable y, doesnot changebetween time I and ri2 contribute no information, we drcp them from the sample.we are left with the two nl. dle rows of Equation 15.20.We take the log of the ratio of these probabilities to set r equationthat "differencesoul" the z anda:
- rll = h(r r, [I(}f!2. e,) + lnp,,- rnp,,-ln(r- p,,) = = 1,y,, 0)J lPr(y,,
:,"i o',l-'"I o, ] p,, p,, \r-
)
11-
(l_i:trt
)
Substituting the right-hand terms of Equation I 5. I 9, we have
. fprn Pr(yu : 1,1,,, I Pr(r
1)l= (p2-
0)J
p)+ P(xi2_x,,)
Notice that the outcomevariablein Equation15.22is the equivalentofthe log oddsdr "positive" outcome at time 2 for thosewhose outcomesdiffer betweentime 1 and rinF I Thus Equation15.22reducesto a conventionalbinomial rogisticregressionequatimi which the predictor variablesare the difference scoresfor the xs. However,becauseF1-tion 15.22is estimatedonly for individuarswhoseoutcomehaschanged,thereis usulr a large reduction in the sampresize relative to the full sample. Keep this limitatioo i mind when interpreting FE logistic regressionresults. FE estimation can be generalized to permit observations on more than two rb pornts per person (or more than two people within a unit) by resorting to conditid maximum likelihood estimation. That is, when there are more than two observations I! unit, the problem becomesa conditional logistic regressionanalysis. (The algetn involvedis sketchedin Allison [2005,57-59]; seealsothe entryfor the _clogit _ cn_ mand in Statacorp [2007] and the referencescited there.)
lmproving CausalInference:FixedEffectsand RandomEffectsModeling
377
MODELSFORBINARYOUTCOMES RANDOMEFFECTS As in the continuouscase,we can estimateRE binary logistic regressionrnodels as an allemative to FE models. RE models for binary outcomesnot only have the advantageof nllowing estimatesof the effects of variables that are constantacrossobservationswithin mits but also are not restrictedto observationsfor which the outcomevariesover time. aboutunobserved However,logistic regressionRE modelsrequirethe sameassumptions effects as in the continuous case:that they have ar expectedvalue of zero, are normally distributed with constant variance, and are independent of both the time-varying and dme-constant observedvariables.As in the continuouscase,the assumptionof independencecan be testedwith a Hausmantest. 3[qD@
rtrnd rd-
ooc. j.lu E It
ilror
A WORKEDEXAMPLEWITH A BINARYOUTCOME:THE EFFECT AMONG SOUTH OF MIGRATIONON SCHOOLENROTLMENT AFRICANBLACKS To illustrate how to derive and interpret FE and RE modelsfor binary outcomes,I present r portion of the analysisLu and Treimn (2007) carried out in their study of the effect of Eor migration and remittanceson children's schoolingin South Africa. As a conseFence of apartheid-erarestrictions on residential rights, many South African Blacks rere forced to live in rural "homelands" carved out of the least productive land in the ntion. As a result, many people, mainly men but somelimeswomen, left their families Hhd and sought employment in "White" South Africa. In a majority of caseslabor [igrants sentremittanceshome to their families. The question Lu and Treiman addressedwas whether remittances benefited the ,rtildren left behind, by improving the odds that they would enroll in school. It mighr be rgued that the extra income provided by remittancesincreasesthe likelihood of school fiollment. But supposeparents are committed to keeping their children in school and hce decideto go out for work to make this possible.That is, supposethat the same ureasured characteristics of families determine both the migration decision and the dool enrollment decision.If this is the case,the coefficient relating remittancesto school rrollment will be biased.However,an over-time FE analysiscan control for this (and all that canbe assumedto be constantover time). characteristics der unmeasured Using data for the South African Black population (which constitutes78 percent of t total population) from the September2002 and September2003 SouthAfrican Labour hrce Survey,Lu and Treiman studied changesin school enrollment between 2002 and l[3 asa function of changesin the migration-and-remittancestatusof the householdand der time-varying householdattributes(householdincome, the highestlevel of education ained by household members,the number of children in the household, whether the hsehold was female-headed,and the year of the survey). In addition, they included the ae of the child asa predictor variable. Although this variable is regardedastime constant bause the time-2 value is an exactlinear transformationof the time- I value (age, : age, - l). recallfrom Equation15.7that suchvariablescanbe includedin an FE equationto :s the possibilitythat the effectsdiffer over time; the coefficientsassociatedwith such
3?8
to Testldeas DoingsocialResearch DataAnalysis: Quantitative
effect for time 2 relativeto time 1' In I variablesgive thedifferenceinthe size of the effectof children's:4r suspectthat for SouthAfrican Blacksthe ;;**;;,;;ight periodasschoo-lE yearin thepostapartheid on schoolenrollmentnasgoneoownfear by hasbecomemore readilY available an FE model' for an RE model' :oi Table 15.3 showstbree setsof estimates:for rlluced sample of th:* rather than lh: (Seedownloadableij:r for an RE model estimated on the full sample by the FE model who changedenrollmentttu'ut ut '"quitta how the analysiswascarriedout') on ' i-i^o-J; r"o "ch15-2.log" for details hyF"'f"r,rConsiderthe FE resultsfirst They providesubstantialsuppofifor the central chilr= remittanceshome' the odds that esis of the study-when taoor mrffis send of otherfactors'relativeto the oddsfor c:-enroll in schoolincreaseby 50 perlnt' net : it is jmportar:i''i households(preciselv'14: a."n'iiti,u in :,'::):1",",y^:"er' predictir; t "o","igrant tras-ueen d"rnontttut"d Recall that what we are ;;l;rn!;.;""tliwhat bet\\ changed status schoolenrollment '5 the oddsof schoolenrollmenttor thosewhose crBlack African the sampleof all South 2002 and 2003, which is onty ZOp"tt"n' of subgrouponly Moreover'for this subgr''rr dren.Thusthe linding apptresto this selected the differel: on the odds of enrollment' Interestingly' n". mu""tr "ffect age on schooleffollment is negar'-""iiling'.i* between2002 and 2003 in tne ertectof a child's 2001 agematteredilightly lessin 2003than in which indicatesthat, ashypotnesized, n sho\\: FE model the to The next step 1sto estlmarc an RELodel corresponding are legitimately intel?reted only if = the first two columns. Recall ti'ui'una1- effects measuretleffects and the idiosyncratlc er:r un-"ua,lraa effects are independentof the perform a Hausman test of the similanq :o To determine whether this is the case' we we are interestedoni] -r coeflicients in the FE and Ri models' Because ;;d;i;c effects' we restrictthe Hausr the similarity of the estmatesof migration-remittance h'rp"rdummy variablesWe cannotreiectthe null testto the two migratron-remlfiances similar resultsr":: = 0.4;6Jthus concludethat the FE and RE analysesyield "rit; status'This allowsus to interpretthe RE model' to ,".pit FE model Of course'this is r<:-' -igrotion-remittance out that the resultsare similar to thosefor the It tums becauseotherwise the :" tTtt:' fo, th" t*o the restri'-s ".r;;;;;;J ha:vebeenrejectedby the Hausmantest For .i.tlarity would-ig'ution-'"-ittun:: frip"it"*-"f livil; I children that shlw that the odds sampleusedfor the FE a"alysts, the RE estiiates a: = large percent as are more than 40 householdsreceivingremrnanceswill attendschool
withoulTT-T:-Y:::"5:IT::: .rt .p""Jttg tOO.'forthoselivingin households intheREarHowever,
il::"##;f;"";li.;
"gi"i"* in"tu-0" tto
inthetwomodels. "ri"cts residence h=' tl".'*tonstant variables:genderand urban .i,irri*
;; ;il;;;;i. femalesto be enrolled in 2003' But plact :' estingly, males appearto be more tikely than residince (urbanversusrural) has no eff-ect' an RE model estimatedfor the full s::' However,the questionremalns asto whether to the 20 percent of the sample whose effolLnta of" ,",i1. ,ftr" t"ing ."'tt''"ttd RE m'-'= r"sults To determinethis' we estimatethe *"uliyi"ld sir.t''ilar i whetherthe full-sample ",*all"it-*"J SouthAfrican Black chitdren'Again' to dercrmine. for this model and-:r coefficients of we comparethe con-sistency modelis acceptable, ^from hypothesis(p = 72)' fe *oO"t. enO ugainwe fail to rejectthe null
lmprovingcausal lnference:FixedEffectsand RandomEffectsModeling
379
Comparison of OLS and FE Estimates for a Model of the -:. :',---, Effect of Migration and Remittan<es on South African Black Children's School Enrollment, 2002 to 2OO3.(N(FE)= 2,408 Children; N(full RE) = 12,043 Children.)
FixedEffects
1- - eIe-h eaded househol d
-0 .1 9 s .1 6 7
RandomEffects (Restricted Random Effects (Full Sample) Sample)
4.121
.183
0.022
.798
0.045
000
o.014
.788
|--tey year Qa03)
-1.74
.000
-3.67
.000
-1.-54
.000
:-: cf child
-0 .0 2 9
.0 0 0
-0.019
.000
-0.063
.000
.'aa
0.169 .050
0.210
.000
_
0.003 .974
0.330
.000
-at1
Still again, we are led to the same qualitative conclusion regarding the eff'ect of : .:ilances-they increasethe odds of enrollment by nearly 45 percent. However, for the - .ample all other effects are significant as well, with two exceptions: households in
380
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
ftom those in 'r'b{ remittances are indistinguishable which there are migrants but not households oit'J fi"iot*' tnlOt* inl"tale-headed there are no migran,.t una, n"' or most of -L Moreover' io school ;il;t';;';;;*ott"o th"" likely less no and no more emollment rncratG models The oddl of^sJtrool effects are la.rge(thun fot tt'" oti'"t't*o incoc I household of levJ oiit'" t'oot"ttoto; ttt* effect hd I monotonically with the edu"ation children tt'" lomt*1 ": 1"1^t^:""t" li\41 twice as large as in the *tt;;;;;;' more are and urban residents u",rr Dositive effecton s"hoo1enrott-eit;'*J -"r* trtun are female or rural residents' in il" "ototi"o
NOTE A BIBLIOGRAPHIC
Htilu"#'ir;111i1%ly"J.'$;i3.ff{ltt *:ff,1r"f.ffi u".-r" modelscanbe founain ctrarnuerLailiiig6l,
Ronnijg (1995),andHaLrlt
-a
?,',l1'"T1*#*di*.*Jnv!:':#'-r'"i"Tf" Krueeer(1994),Frankenberg "td;";;;li;9t)' andlre (2005),Hotz andXt"t i;il;i:'H;'
6Joi-Jbo?ll, N.t"*
(2001)'cam$d Budisand ^Ensland uutlin' anosctrotz(2005)'Bu$enbn
R
(2007)' andLuandrreiman r'uot)nt"tg(2006)'
-a
HAs SHOWN WHATTHISCHAPTER two techdques-fixed In this chapter we have considered
ll
effects (FE) and randcr
;?;i.Ely,:u",-'::.1lf ii","**o:*;:,"'n:Tn:Tl#3"::T;::'ffi ro& across differences i;;'il;;; ;;;;"r'rv' ;Ti:#:TT:T:"',ffi?fffiil so on)' The assumptionsunderl-r{ (famihes'
viduals within groups ""t-"tlf"t' were pre::^lted' one involving r -a examples. each method were discussed' t*"'*"tt"J depender other involving a dichotomous continuous dependent vanabte ;nd;h; dichotomlr for models FE. t()ln"*ft"t different variable, becausethe two cases carEt because of th'o1,e11fl-samnle -" a subset onlv oi ." o{r ##;;t,;;fiv members across "'tLut"a on", tLe or diffcr notit for which the outcome variable oo-es estimarb permit do,not re -g" frt?o*t' group must be dropp"o ft"- ti'" Jt"rvij' -o-atls variables' whereasRf time-constant on-"u'o"d of the effects of either measureJ oi met- d: ;*"-ptions of the RE model are il;;;; models do. For these two reas;' RE model is to be Preferred'
I
ll
E F 3 fl ]
!Ei a t
l ^S lAl**r f^r :F/ F,l ta
(
C
tt:! =
t"-, |
| ,:1
:
j
*;l
i "{
r\: : i f u !t --1,_,3l5E
I ir::,- * fi=: :r: l rl-r-: r:i5
.m
f s:::: : 3Ji:
-rr'iii '
n:H:-r@ € F
IIJ
::lJ!n
FINALTHOUGHTSAND FUTURE DIRECTIONS: RESEARCH DESIGNAND INTERPRETATION ISSUES
s e>=3iEL
rfon--:'-qd a:rS= l# r ulc<--lt4 im !r-r:!r t s dsJ-riql dtb.-1.-:r etali<
JIST
rmten du it e.dr..rm s h
WHATTHISCHAPTER ISABOUT In this chapterI review various aspectsof researchdesign,some of which we have encounteredin previous chaptersand someof which are new. In the courseof doing this, I alsobriefly discussa numberof advancedstatisticaltechniquesand procedures,which dataanalysisin the I'ou will needto completethe "tool kit" requiredfor state-of-the-art you position given arenow in a to tackle, the foundationpror.,cial sciences,and which rided by what hasbeencoveredin this book.I thencommenton theimportanceofprobablity samplingand waysto think aboutpopulations.I concludewith someadviceabout guodresearchpractice.
382
to Testldeas SocialResearch QuantitativeData Analysis:Doing
DESIGNIS5UE5 RESEARCH
appropriateanalytic designsto ar'-': In this sectionI considersomeissuesregarding data' researchquestionsusingnonexperimental
ComparisonsAre the Essence
ca:3;take the following tbrm: I want to study Not infrequentlyterm paperproposals to e\:-]1 want I to usefor my anabli-1or ers,andI havefound a sampteor caregivers ::-!r students of in a school'and I havea sample program'instituted ru f a constant is that you cannot-study ""rv'"i"*,it""f "that school.The problemwith theseproposals .,li are.palticularly plone to deplei want to know, for example,whether caregivers to kno\\ --tr'rlli unOnon"utlglu"tt similarly' if you want caregivers of t"-pl" .""O" \\ r- o ,"" a saJple of both thosewho do andthose tactorsleadpeopleto migrate,you need t sam:': a you need program, the eificacyof a not migrate.And if you want to evaluate ani before (or data implemented --g placeswhere the programnu' unJ hut not been oYer--:tr compansons making problems.in. special iJrern"ntution-ulth-ough there are
ilfi;iiil;Jtittt"u
* uit tut"tin ittechapter)Thisis anextremelv'simplt rylsam:-:
' If' for example'you havea i. often ignoredin datacollectionefforls' r' variations -c1 intemal study is do can you "r"in"t migrantsor delinquentsot t*"giu"t" all or caregivers'which presumablyis not ':'; oelinquents or migrants differenttypes of you arereally interestedln' forcedto relv or 1:; only the populationof interest'you are '"";;;;;;;pled to compars:rtrying often entails to you, studyto make comparisons'which their da-- I ' compare "^t"au't data.Sometlmesin suchsituations'teseatchers quite-comparable For e: -: population to paterns assumedto hold for somestandard a specialpopulati.on (Chen andL'-:= childrenin Beijing ple, a recentstudyof schoollngavailableto migrant -:e* school-agechildren'From on u .o'uty of migranthouseholdswith ;il|;;;tJ enr: -:of such children not calcuiated and reported the proportion aro ,f* ,"**"ft-s r:= B': nonmigrant all that onty i*pti"it-is in school.The implicrt contrasr-and it tut particularleasonto plesumethis' Indeed '-r children attendschool.But thereis no to be false T:-' socialscientistsmake Jbouttheir own societiesProve the assumptions be preferred' .^pti"i ty .o*p*u,ive dataare stronglyto the obviousnext questionis' what -st':--' If comparisonsarethe es'e"ceoianulysis' f comparisonsare appropriatefor what purposes and Historical Periods A common re\-:.-Population Subgroups, Populations' subgroups(malesversuster'-: questionin the social sclencesrs whetherpopulation andthefactorsdeterr -:on) differ with respetito someootcome ;ril;;;;;;o :' in the section'A Strategyfor Compar' that outcome.We saw in CtrapterSix' :' ie rer 'l of analyticquestion HereI briefly AcrossGroups,"how to approactlthis kind tttu?3u"r"._rn" x predictorsand some : whethera relationshipbetweena set of oIof a population whethel it differs a'-::' come variable Y holds for utt suugroups be OLS equanonsor : '': threepredi;on equationst which may .uUgroopr,t" "rtlmate
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues383
+propriate for some other linear model-for reeiression) : V- -
|
example, some form of logistic
(16.1)
\'LV
f :a'+la', X,-rlc,G,
(16.2)
j=2
i:l
IJJ]
i' - a"+lu',' x,+lc', c, +LDd,,x,G j=2
,:l
l! J"
rr Lr! dl I!' lo
d"
t
i=2
( 16.3)
i=l
h Equations16.1through 16.3,the X are predictorvariablesand the G are population rbgroups, with eachsubgroup(except the flrst) representedby a dummy variable coded I for those in the subgroupand coded 0 otherwise. We then contrast Model 3 (Equation 163) againstModel 1 (Equation16.1)to determinewhetherwe needto posit different dationships betweenthe X, and I for the different groups.(We do this by assessing th significanceof the increment in R'?or, equivalently,assessingwhether the ci and the I in Model 3 are collectively not significantly different from zero.) If Model 3 fits sigbetter than Model 1, we conclude that the social processbeing studied differs nong the groups and ask the subsidiary question:is the difference only in the intercepts, -ficantly the significanceof the increment c is it in the slopesas well? (We do this by assessing whetherthe d,,in Model i.R- betweenModel 3 andModel 2 or, equivalently,assessing 3 are collectively not significantly different from zero.) Note that this strate'gyis only lfropriate when the groups can be taken to be exogenousto the outcome under study, *ich holds for gender,ethnicity, and so on. When selection into the group is conelated rih the outcome, net of the other predictor variables in the model, the assumptionof CS regressionthat the predictor variables are uncorrelated with the error is violated, endogenous switchingregressionprocedures,discussedlater in the chapter,mustbe to vield unbiasedestimatesof effects. If Model 3, or Model 2, provesto be the prefened model, it is then possibleto decomthe differencesbetween groups in their averageoutcomes,using the proceduresfor differences in meansdiscussedin Chapter Seven.Note that the decomDoprocedurewas discussedin ChapterSevenin the contextof OLS regression.The procedurecanbe usedto decomposedifferencesin loggeddependentvariables(see iman and Roos [1983,636-640] for an example)or in log odds,albeit without quite sameintuitive appeal. -\ variant on the strategyfor assessinggroup differences is to start with an equation the form I-l: a+
s-\ ^ 2)c i\ri j:2
(16.4)
384
to Testldeas QuaniitativeDataAnalysis:Doing SocialResearch
Becausefor Equation16.4the predictedvaluesfor f are simply the meansof Y fbr alt by contrastingEquation16 3 with Equation-: ! subgroup,the questionbeingaddressed differenceswith respe: ''' 1orEquution16.2with Equation16.4)is to what extentgroup the outcomecan be explainedby groupdifferencesin the otherpredictorvariables Exactlythe sameprocedurescanbe usedto makecomparisonsover time For er::(lit'':'" ple, we might want to know whetherthe relationshipbetweenpolitical attitudes tr:e in the 1970s, same was the of abortion ia- u..aua conservatism)and acceptance appare---" to abortion Rowv. Waclewasfirst decided,andin the 2000s,whenopposition becameobligatory for Republicanpresidentialcandidatesln this case,time becc::cl the G variabL andpolitical attitudesthe X variablein Equations16 1 through16 3 'L': of course,the samelogic holds for comparisonsof changesover time in group dL=:ences.albeit with an increasein complexitybecauseof the needto considerthree-:r interactions.For example,the analysisof the interactionbetweeneducationandreli,r--'r-' of abortionin 1974,which I usedin Chapter::l denominationin determiningacceptance :r:"! couldbe replicatedin 2006to assess to introducethe strategyfor groupcomparisons, attitudes' affected the "abortionwars" over the last thirty-two yearshave Somecross-temporalcomparisonsare vulnerableto estimationproblemsstenEj+ from the fact thatdatafrom differenttime periodsmay not be independentThis is tn:: :" aggregatemeasuressuchas the averagelevel of schooling The valueof sucha vari::'r r'rl computeafor, say,the United Statesin 2005 will hardly differ from the valuefor l --:r becauseboth computationsare basedon more or less the samepopulation Thus two obseryationsare not independent.Proceduresfor coping with the nonindepend::':: kn ownasautocorrektion' andwith otherspecialfeaturesof time-st- " of observations, seethe Statamantal TimeSerien[fS] (Statacorp2007) TL=dcta arewell de'reloped; seriesproceduresare widely usedin economics.Another kind of data, widely use: : other social sciencefields as well as economics,derivesfrom panel studies,in $::: the sameindividualsare surveyedtwo or more times, typically severalmonthsor ) 3:apart.Data with this structureprovide one meansof carrying out FE and RE anal""' oi th" kind discussedin the previouschapter.Theseand other techniquesfor dea-; time ser':' of observationsareknown asXT (cross-sectional with the nonindependence book ::' in this to consider able have been models.Such modelsgo beyondwhat we Dc,ta Longitudinal/Panel manual ljstandardintroductions,consult the stata 10.0 (2003)' Bai''+ (2002)' Hsiao (Statacorp200?) and textsby Sayrs(1989),Wooldridge the Greenetext reason::. (2005),andGreene(2008).The Sayrstext is quite accessible, so. and the otherthreefaj.rlyformidable. comparisonsand cross-temporalcomparisonsmay' of coL-t' Both cross-sectional two comparisons(morethantwo groupsor moredlan two i::( to more than be extended of a singlenationor populationsof differ:: points),and groupsmay be subpopulations nations.For examplesof the latter,seeEriksonandGoldthorpe(1987a,1987b)' comparisonsts to ::r or cross-temporal The reasonfor canying out cross-population subgroupsdiffe: :' population abouthow populationsor somehypothesis,or hypotheses, this is a reasonablestrategyHowe":: time periodsvary.If you havea priori hypotheses, you are vulnerableto the counterclaimthat the differencesyou posit and obsene --' -:ts spurious,becausetheyreflectdifferencesbetweengroupsor over time that affectboth
n
l I t
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues385 ans of Y for each iih Equation 16.'1 esrvith resPectto r variables. r time. For examanitudes(liberalLthe 1970s,when Frtion apparendr 6e. dme become-< brough16.3.And r in grouP differDnsider three-wa) ation and religiou-' red in ChaPterSir hos 2006to assess es. mblems srcmrnlng d€nt.This is trueof : of such a variable the value for 2frtr pulation.Thus. Ihe e nonindependeD!'E riuresof time-senj"J acorp 2007) TiF bta. widely used ir el studies, in wtrict ial months or Yeal FE and RE anal-ssir hniques for dealilg Ectional time senes' ter in this book' Fr ol/Panel Data lXf.l tsiao(2003),Balqr reenetext reasonatdli 6ons may, of courscr more than fwo tic pulations of differd 87a.1987b). comparisonsis to c on subgrouPsdiffer rr ble strategY.Ho$esa' posit and obsene me thataffectbo$ &
independentand dependentvariables.Binary comparisonsare particularlyvulnerableto suchclaims,becauseany numberof otherfactorscould accountfor the differences. Indeed,any graduatestudentin the social sciencescan invent a post hoc explanation for an) observeddifference!If you do not believeme, try this simpletest on your friends: invent a finding about some sociefyor populationthey do not know about or, betteryet, report a finding but reversethe sign or otherwisechangethe outcome.Then wait to seewhat plausibleexplanationthey give you. I usedto do this at cocktailparties whenI discoveredthat everybodythoughtthatmy finding of essentialinvariancein occupationalprestigehierarchiesaroundthe world (Treiman1977)wasa caseof documenting the obvious. I startedtelling people that prestige hierarchieswere quite different in, say, Russia,andgot all sortsof interestingerplanationsaboutwhy it wasobviousthatthis had to be so (eventhoughit wasnot!). Comparisonsof threegroups(timepoints)arefar more constraining,andcomparisonsof still more groups(time points)evenmore so. As a casein point, considerhistorical comparisons.Nee (1989, 1996) has argued that the shift toward a market economyin China reducedthe power of cadresandenhanced the power of "direct producers."The difficulty, asWalderpoints out in a critique (1996, 1,064),is that manythingschangeover time: Theproblemwith time [asa measure]is that manyother changesconceptuallydistinct from the spreadof makets,and which mayalsoaffect the distributionof power and income,also occur through time and at different ratesin different regions. Some grow rapidlywhileothersdo not statepolicymayprovide emergingmarketeconomies gains grain producers windfall to in one period only; private enterprisemay thrive in someregionsbut remainmarginalin others;capitalmaybe highlyconcentratedin some regions,moredispersed or absentin others.All of theseprocesses affect the disttibution of powerandincome;any time-dependentmeasurcof marketallocationmustcarefully controlfor them. This difficulty, sometimesrefened to as the "too many degreesof freedom" problem, b€causethere are too many plausibleexplanationsfor whateveris observed,is generic to two-case comparisons,both cross-temporaland cross-sectional.For this reason, comparisonsof a small number of casesare more helpful in demonstratingsimilarity than in explaining differences.Sometimesit is helpful to show that a finding in one societyor at one point in time also holds in a different time and place. If so, we can bavemore confidencethat we have identified a generalphenomenonand not just an idiosyncraticresult. By contrast,considera testof the "fetal origins"hypothesis(Barker1998)by Almond 11006).The hypothesisstatesthat adverseeventsexperiencedby pregnantwomen can bavelong-term consequencesfor their offspring. Almond studied this claim by analyzing the consequences of the 1918flu pandemicfor educaiionalattainment,occupational :tatus,income,disability,andotheroutcomesmeasuredin the 1960,1970,and 1980censrses.He finds strongeffects,one of which is shownin Figure 16.1,maledisabilityrates m 1980by quarterof birth in 1918through 1920.Becausedisabilityrateswereelevated ooly for those who were in utero during the pandemic and returned to the trend line for
to Test ldeas Quantitative Data Analysis:Doing SocialResearch
386
. 19
-g
s 6
. 16
1 9 1 9Q 1
Ouarterof birth
f f slrng
from of Birth(Prevented bv Quarter 16'3 ' tgeou"le Disabilitv
Work by a PhysicalDisabilitY). Source.Almond 2006, Figure 2.
other' unmeasured'chrrlgr those bom later, we can rule out the possibility that some altemative explanarb coincided with the onset of the pandernic. More precisely, any the pandemic' vlil with coincided exactly would have to show a pattem of timing that in this caseis not remotelyplausible. becausetqt Natural Experiments Analyses of this kind are particularly compelling nonexf all almost with constitulerntural experiments.As wehave seen,the difficulty the ra' both affect that m"ntut *o.t is the possibility that we have omitted variables reG experiments Natural comesandthe prediitor variables,thus biasing our estimates' that plausibly can h or eliminate ooritt"d u.iubl" bias by focusing on natural events 1918 pandemic stni arguedto be distributed randomly in the population' Becausethe 1919' it is reasc *itfrout *u-iog in October1918and had largely dissipatedby early group I treatment as a pandemic of the utt" io ."g-O tfiote in utero during the monthi Becau* grouP' a control as thosein Jero just before andjust after thesemonths to be the misfortune'h.tT: t" piuttiUf" aiff**ces betweenthesegloups, exceptthatone had in outcomes differences uieio at ttre time of the pandemic,it is reasonableto infer that infectL became mothers pregnant all not course, Of pandemic. the to do" ,o infected' u'lil "*po.*" But we know that about one-third of childbearing-agewomen did become exist Almond's pag' i, a t-ge enough gloup to reveal differences in outcomesif they pandemic' is a modeld which also exploits state-to-statev:fiations in the severity of the how to do analYsisof this kind.
ttsues 387 Designand Interpretation Research FinalThoughtsand FutureDirections: For other examplesof natural experiments,albeit some more persuasivethan others in how thoroughly tlley overcomepotential omitted-variable bias, seeDeng and Treiman ,1997), Ansolabehere,Snyder,and Stewart (2000), Abadie and Gardeazabal(2003), t-assen(2005),Oster(2005),Treiman (2007a: seealso the discussionof this examplein ChapterSeven),andLu andTreiman(2008). tultilevel AnalysisWhen you havemany comparisongroups(manytime points,many D.ltions,and so on), it makes senseto shift from treating each group as a discrete point b1 includinga set of dummy variablesrepresentinggroups)to scoringeachgroup with rspect to various dimensions(for example, characterizingnations by their level of econrmic development,the degreeof urbanization, and so on). The optimal way to do this is al carry out the analysisat two or morelevels.In the lattel case,macrosocial"contexts" re defined(for example,classroomsor schools,or both, in educationalstudies;societies m cross-nationalstudies;birth cohortsor historicalperiodsin cross-tempolalcomparirrns: and so on). Then a microequation(one fepresentingsomesocial process)is estirated separately for each context, and variations in the coefficients representing the licroprocess are predicted from characteristicsof the contexts' For example, supposeyou wish to test the hypothesis that the negative effect of the romber of sibiings on educational attainment is stronger where school fees as a ftaction oatotal family incomes are higher. Here is the typical setupfor such an analysis:
t{rc,-
(16.5)
Yij: a j +b j X i j +€ i j
:1fiEr
Ee;a e\:l
-
;-$"L''i
denr:. rfofi tr-'-:.t-:-€ 4[ all o.'re\Fat'o{b I.i. .rdmenlr ra:@ rrsit'l} .-:o b nrle rru; gd g- ir ii ri.lsr.} FoIgt]try'd fau--e rb|els G forune r.-'trc r in outcorF: ecame ioJ-od-
-erte,Lrfui i [mond s Prya k- is a modei d
ai : \6]-IlnlG;'f
oo1
(16.6)
bj:n u + n rc j+ a u rbere j : 1, 2, ..., J denotescontexts'or level-2units (schoolsystemsin the example)' nl 1 : l, 2, ..., n, denotesindividualswithin contexts(level-1units)'The level-2equat.os assertthat th'e intercepts and slopesof the level-l equationsvary over contexts as hear functions of G (or, in the example, that the lnegative] slope associatedwith the mrber of siblings is greater[in absolutevalue] when schoolfees are higher; in this anmple I have piovided no hypothesisregarding the intercepts of the level-l equations' though it might be plausible to expect that the level of educational attainment is lower rben schoolfeesare high). of course,the level-l equationsmay includemore than one I and the level-2 equationsmay include more than one G' To appreciatethe substantivepayoff of multilevel analysis,considera 1989paper in the academicorganization of ! Lee ani Bryk that analyzedthe effect of differences mathematicstests.This in standardized achievement lig! schoolson the distributionof analysis(Raudenbush multilevel text on cill paperby oneof the authorsof the standard but lays our rhe example substanrive d-Bti t2d02l) not only providesa compelling Beyondstudy and ainical issuesin a very clearway. Using datafrom the High School achievement scores rl0-187 studentsin 160 high schools),Lee and Bryk showedthat
388
DataAnalysis: DoingSocialResearch Quantitative to Testldeas
tend to be highest and differentiation among studentsby race and socioeconomic lowest in schools with a standardizedacademiccurriculum required of all in contrastto "shoppingmall" schoolswith a wide vadety of coursesand manv tives. They cite this difference in the organization of lhe curriculum as the main r Catholic schoolstend to be more successful,a widely noted but not previously understoodphenomenon. There are many ways to carry out a multilevel analysis, including the meta-analv approach used by Treiman and Yip (1989) in the example presentedin Chapter Tra illustrateregressiondiagnosticprocedures. The technicaldetailsgo beyondwhat ue been able to considerin this book. Good introductionsinclude a paperbv Diprete Fonistal (1994), which emphasizessubstanliveapplicationsand providesa good of what this body of techniquescan be usedto do. A paperby Mason(2001)focr-rse: variousmultilevelanalysisproceduresas a way of copingwith observationsthat are independentbecausethey are clustered within higher-level units (for example, in families, pupils in classrooms).Severalbook-lengthtreatmentsalso address level analysis,of which Raudenbushand Bryk (2002)is the stardardbut is demanding,as is Goldstein(2003);SnijdersandBosker(1999)may be more accesft. For additionalsubstantiveexamples,seeEntwisleandMason (1985)on the relati between level of socioeconomicdevelopmentand fertility; Dihete and Grusky (l99lLu 1990b,i990c) on temporalvariationsin socioeconomicattainmentin the United Srq. and Sampson,Raudenbush,and Earls (1997) on the role of neighborhoodefficacr n reducingcrime.
Endogeneity,Sample-Selection Bias,and Other Threats to CorrectCausallnference In this section I discuss severalclosely related circumstancesthat require special u-eei ment to avoidbiasedestimates.I alsoprovidebrief introductionsto someof the st,rrilii solutionsnot addressed in previouschapters. Treatment Effeds Endogeneityrefers to situationsin which one or more of the predir variables is correlated with unobservedvariables that affect the outcome. Becaus€ft effects of unobservedvariables are relegatedto the error term, the coefficients of predrtor variables correlated with such unobservedvariables are biased.A common situ-aim in which endogeneityproblems may ariseis when we wish to assesstlte effect of a .Irerment" that itself dependson unmeasuredfactors that affect the outcome.For example-ff in a developingnation midwives are placed in villages with the worst health outcomei tr assessmentof the effect of midwives on health that fails to control for this nonrandl assignmentwill be downwardlybiased(seeFrankenbergandThomas[2001]for an an* ysis of Indonesiandatathatusesa fixed effectsfdifference-in-differencel analysis.offu kind discussedin the previouschapter,to derivecorrectestimates).Similarly,if worb less able to command high wages (for unmeasuredreasons) are more likely to joil I union,OI.S estimatesof the effectof union membership-here regardedasthe ,,treatmeff-_ on wageswill be downwardly biased. For an interesting example of how to carq/ otn !n
Designand Interpretation tssu"t 389 Research FinalThoughtsand FutureDirections:
lI! : $
ls f l. t{r_. :L|F dtrr:m rd' -.1
n-mr[16 lsTal' Er...bE D.F:Gd g!\t
J6G
I ia-cs.cr c &rfEefl *- rd'rb n* iei:
rr*hr$ c *-:=ssi&rrekn:udf,l rsb l{ql. :dF:Srffi d e-,.x1 rr
+E::t-
xEfl-
{te s-od
fe r- B
mr-a bld!
For hc'ul.r-g' bii
0l I t!'r :E
o:l1sL'- d H ] . iJfilell :t- b
Flreamd 5 tO r-arTl
''
analysisusing a treafinent-effectsapproach,see Brand's 2006 study of the effect ofjob on the quality of subsequentjobs. displacement The most obvious way to correct for endogeneity problems is to measure all the factors thought to affect the outcome. We encounteredthis idea in our consideration of ordinary least-squaresregression in Chapter Six, when we discussedthe presentation of severalmodels,with successivelymore variables,to assesshow newly introduced r-ariablesmediate the effects of variables already in the model. From this we see that endogeneitybias is a form of omitted-variable bias. However,it is not alwayspossibleto measureall the potentialinfluenceson an outcome, either becausewe are reanalyzing already collected data or becausethe analyst may not be able to identify a priori all the potential influences on an outcome that are correlated with variables explicitly measured-for example, all the factors that might both lead individuals to join a union and be conelated with their capacity to command high wagesas individual employees.Thus we need ways to correct for endogeneitybias rand its close cousin, sample-selectionbias). We have already covered one approach, flxed effects or random effects modeling, which is possible when we havemeasurements for individuals at more tlan one point in time or measurementsfor different individuals rithin $oups (for example,families or classroomsor communities). When suchdata are rot available,severalother analytic strategiesmay be considered,all of which go beyond riat it hasbeenpossibleto coverin this book. For usefuldiscussionsof what is entailed h establishingcausality,seeHolland (1986), togetherwith commentsby Rubin, Cox, Glymour, and Granger,and Holland's rejoinder; andWinship and Morgan (1999). Variables Regression Al approach to coping with endogeneity that is ',Ef,;rumental popular among economistsis inslramental variable (lY) estimation.If a variable (Z) can be found that is uncorrelatedwith unobservedvariables (u) that aJfectthe outcome (f), is cnrrelatedwith the variable (X ) in the model thought to be correlatedwith the unobserved raiables, and is conditionally unrelatedto the outcomevariables net of the effect of both fu observedand unobservedvariables,Z can be usedas an instrument for X to yield reladrely unbiasedestimatesof the effect ofX on L For example,considera 1990paperby Angrist studyingthe effect of servicein the ilitary during the Vietnam War on lifetime earnings.The difficulty with estimating an (}LS equation is that the decision to join the military might well have been correlated uith unmeasuredfactors that affect earnings.Angrist exploited the fact that for much d the war period, a lottery system was used to determine who would be drafted into s'ice. Although there were many exceptions,the increasedprobability of being drafted fu those with low numbers makes the assignmentof lottery numbers a kind of natural .ry€riment----one's lottery number was correlated with the likelihood of serving but not rirh other factors relatedboth to serviceand to subsequentincome. Thus the lottery numbis a good instrument for adjusting the effect of Vietnam veteranstatuson income. Another situation where IV estimation may be helpful is where the causal order is nbiguous. Supposewe observethat women who work are lesslikely to be depressed fu women who do not work. Can we concludethat employmentprotectsagainstdepreswomenmay be less in? Perhaps.But the causalordermight go theotherway: depressed
390
QuantitativeData Analysis:Doing SocialResearch to Testldeas
likely to seek or retain employment. One way to address this problem would be to
rnstrumentfor employment.A reasonablechoice might be whether the rnother is known that the daughtersof mothers who worked-aremore likely to work tl But there is no particular reasonto believe that, net of her own emiloyment, a I mother's employment affects the likelihood that the woman herseli expenences sion-(theexampleis from Ettner [2004]).Thus,mother'semployment would satisfrl conditions for an instrumental variable A final circumstancein which IV approachescanbe helpful is to estimatesimulu_E equation_models or, asthey aresometimescalled, reciprocaicausationmodels. Wooldri (2002,555) providesa usefulexample:in a sample oi cities,we might expecttne mu rate to dependon the size of the police force_the more police peicaprta, the lowe, expectedper capita murder rate. But we might also expect the size o1 the police ft ,h" murder.rate-the higher the (anticipated) murder rate, rhe greater :^*Ty.l to increase incentive the sizeof the police force. Becausewe observeonly the equitibrir condition-a particular murder rate and a police force of particular a srze_specifu a simultareous equation model amounts to asking the qo".,i;;;; would be the murder rate if the size of the police force were "oun,".fu"*a different? What would be: size of the police force if the murder rate were different? IV methodsprovide a war estimatingsuchmodels. casethat usually involves reciprocal causationand hencemight be han - _Another *:d."t. (or by structural equation models of the kind discussed later) is when ,OI,|V. attitude is thought to affect anotherbut they are measuredat the sametrme. Ratherrhr assumingthat somehowone causallyprecedesthe other,it usually is more sensibleto rrdt them aseither both dependenton a third variable(seemingly o*"iut"O..gr"r.ion, encm_ teredin ChapterEleven, can be helpful in such cases)or-as having recrprocalefi.ectsThe difflculty with IV estimation is that it often;s difficult to"finOgood instrume','4 variables,and poor instrumentsoften produce results worse lmore irasea; than usitrsr instrumentsat all. For a good introduction to IV estimation ani lt. Oung"rr,,"" WootOri:ur (2006, Chapters15 and 16). Orherusefirl references incfodeeaun (20'06i,;" il;;; - ivregress - commandin StataCorp(2007),and Green(200g). sample-selection Brbs Sample-selectionbias ariseswhen unmeasuredtacbrs correld with ar outcomedeterminewhether an individual is included in the sample.For ex,rnfia a woman may enter the labor force only if she can command a reasonably high lagr_ Thus selection into the sample (people in the labor force) is nonrandombut dependsI unmeasuredcharacteristicscorrelatedwith the outcome variable (wages).Analyzing mry thosewith wagesthenresultsin biasedestimates. Consider anotherexample.Many surveysin China are restricted to the de jure l]Ib_ population.As wu and rteiman (2007) haveshown, in such studiesestimatesof ircgenerationalmobility are overstatedbecausethose of rural origins who obtain urban rar. istration are not a random sampleof the rural population bu, J,h". ;;;tr;;;;;; the."best and the brightest,', who have experienced long_rangeupwarA social mobilrt If the entirepopulationis includedin the analysis, oitt"'"*ii;iffi;;; muchmoremodest. "rti-it",
Designand Interpretation tssres 391 Research FinalThoughtsand FutureDirections: DiDdir [iE@"1 trle sdr.I r.iFI@
8s--{ & *:*iE-"
bnlbrlB E@ lt{slts lx fG
Flts Fiil rycfil.q fr iti dFrts ta114d trg}.ndH sriraG RtiE' itie --,:rd rrr-es{cde::s [email protected] fiuirySi\iiE{E cdl rrlts
15i-fi€ld
Ftreu@. r higb r..|!E
rd+adsEhzin!d* de-iuEfl r-e! ot il; rlttim 4|'in rc?ofiiEdl
rial m.rtf,r f mct'unr-
Heckman Selection Model A standardapproachto correcting for sample-selectionbias (in caseswhere it is not possible to redefine the population as Wu and Treiman did) is to :osea Heckmancorreclion (seeHeckman 1979).The procedureinvolves predicting (using a binary probit equalion) the prcbability of being in the sample(or, equivalently, ofhaving an observedoutcome), calculating the expectederror for eachobservation,and using dleseerrors as regressorsin an equation predicting the outcome of interest. SeeWinship andMare (1992) for a very clear exposition of this and other modelsfor sample-selection bias,and seeDubin and Rivers (1989) for an extensionof theseproceduresto models sith binary outcomes. The Stataentry for the -heckman- command (Statacorp 2007) offers anothervery clear exampleand exposition of the method,using the canonical example,women's eamings. In the example,eamings (for women who haveeamings) are predicted from education and age, and the probability of having earningsis predicted from marital status,the number of children at home, education, and age (and implicitly-tbrcugh the inclusion of education and age, which predict the outcome----ofthe expected wage itselfl. Note rhat the assumptionhere is that marital statusand the number of children at home do not affect eamings but only the probability of having eamings.We might well quesfion this assumptionbecausemanied women, and particularly women with children at home, may cbooseto take lower-paying jobs that more readily accommodatetheir dual careersas sorkers and mothers, This examplethus revealsa major limitation of the procedure.To yield robust results, 6e predictors in the selection equation should strongly affect the probability of being selectedbut should haveno net effect on the outcome.(Heckmancorrectionscan be made e!-enwhen there arc no such variables, by relying on the functional form of the equation b identify the model. However,the results are often neithet robust nor substantivelycompelling.) Suitable variables are often difficult to find. Note the similarity to IV estimation discussedpreviously. For instructive applications of corrections for sample-selectionbias, see Mare and srnship's 1984study of employmenttrendsfor young Black and White men; Hagan's sudies of factors influencing the severity of punisbmentfor convicted criminals (Peterson md Hagan 1984;Hagan and Parker I985t Zatz and Hagan 1985);Manski and Wise's of graduationfrom college;andHardy's(1989)studyof r 1983)studyof the determinants acupational mobility in the nineteenthcentury basedon matching data acrosscensuses, rhich takesaccountof selectiondue to deaths,emigration, and namechanges Erdogenous Switching Regression Note that the Heckman procedurealso can be used n analyze endogenoustreatment effects, as an altemative to IV estimation, However, a is alsoavailablein Stata,in additionto -heckman-. $?aratecommand,-treatreg-, The problem of an endogenoustreatmenteffect-that is, where there is a nonzerocorreirion between assignmentto a "treatment" group and unmeasuredfactors affecting the stcome----can in tum be generalizedto the casein which the parametersof a model link.g treatmentsto outcomes differ acrosstreatment groups and assignmentto treatnent groups is endogenous.For example, Gerber (2000) asks whether the fact that former CommunistParty membersdo better in post-SovietRussia than do others is due to
392
to Testldeas DoingSocialResearch DataAnalysis: Quantitative
residual social capital (the fact that connectionscontinue to favor former party membagl or rather to unmeasuredfactors that affect both the likelihood that people becamep !| membersduring the Soviet era and their eamings in the post-Soviet periodThis kind of problem can be addressedusing methods that are similar to thoseir treatment effects and sample-selectionproblems-specifically, endogenous switch4 regressionmodels.Endogenousswitching regressionmodels are usedin sit[ations $ bE= one outcome.I., is observedif a selectionvariable, Z = 0, but a different oulcome. l- r observedif Z : 1. Using this method,Gerberconcludesthat the advantageenloyeOt-. former communistsis due entirely to unmeasuredcharacteristicsassociatedwith bectning a member of the Communist Party and that there is no lingering effect of Sovier-ta socialor political capital.(Seealso the critiqueby Rona-Tasand Guseva[2001] andfu rejoinderby Gerber[2001].) Gooddescriptionsof the techniqueandofhow to implementit canbe found in \h: andWinship (1988)andPowers(1993).For additionalapplicationsseeWillis andRosa (1979);GamoranandMare (1989);Long (1990);SakamotoandChen(1991);Manskid others(1992);TiendaandWilson (1992);Powersand Ellison (1995);Smock,Manniry and Gupta(1999);Hofmeyr and Lucas (2001);Lichter, Mclaughlin, and Ribar (20t,:c (2004);andProuteauandWoltr (2006). Sousa-Poza Propensiy ScoreMatchrng Another threat to correct causalinference occurs when dr predictor variable of interest occurs only rarely in the sample and is highly correl*=d with other independentvariables.For example,what is the effect of attendingan elite re occupationalstatus?The usualway of approachingsucha quesuil versityon subsequent is to carry out a multiple regressi.onof occupational status on attendanceat elite \ers other universities plus a set of variables controlling for family background, high schrd performance,and so on. The difficulty is that attending an elite university tends to be rl highly correlatedwith the control variables that controlling for confounding factors I-afo to hold them constant,becausethere are few people with low values on the control \.mableswho attend elite universities.Apart from the conceptualproblem this createsabtl the meaning of "holding constant," there is a serious statistical problem-"unbalamed treatments"tend to inflate standarderrors (Rosenbaumand Rubin 1983,48), malJE problematicthe rejectionof the null hypothesisofno effect.To copewith this probleu. analysts sometimesresort to matching pairs of casesthat differ with respectto the \aable of interest (the "treatmenf' variable) but that are identical on a set of covariarEl However, as Srnith notes (1997, 326-327), until recently matching studies otlen hsrc been resistedon the gound that they involve "throwing away" a lot of data. Moreorer- I often is difficult to find good matchesfor more than a small number of variablesbecaret for a linear increasein the number of covariatesthere is a geometric increasein the number of matchesrequired. However, advancesin the statistical theory of matching-the seminal adicle is ! Rosenbaumand Rubin (1983)-have led to the developmentof a procedurethat replaL'E the large set of discretematchesrequired by classicalmatching procedureswith r propensity score, a scalarsummaryof the degreeof similarity betweencaseswith resFq: to a large number of covariates.The procedureinvolves predicting the treaflnent variahic
EinalThoughtsand FutureDirections: Research Designand Interpretation lssues 393
r tt
:fu ff 5
q,r ff G
t-r l-
Jom covariatesand then matchingeach"treatment"casewith the confiol casethat has tre nearestpropensityscore(or sometimeswith severalcontrol cases;seeMorgan and rrnship [2007] for a useful discussionof the technicalissuesinvolved). The resultn-e sampleis then analyzedin one of severalways: focusing on outcomedifferences nenveenmatchedtreatmentand control cases,ignoring the unmatchedcases;stratifying te sampleinto stratawith similar propensityscoresand comparingoutcomeswithin sr:m (for an interestingapplication,seeBrand and Xie [2007]);or usingthe propensity aarredirectly in a regressionequationto get an estimateof the effect of the treatmentnet [l: rhepropensityto be in the "treatment"group.The essentialinsightis that by compar[€ casesthat havea similar propensityto be in the treatmentgroup, we createa quasi qFeriment.That is, we canthink of matchedcasesasbeing,in effect,randomlyassigned D eitherthe treatmentor the conhol group becausethey havethe sameprobability of lHne in eithergroup,giventheir covariates. Consider the example presentedby Smith (1997) in his illuminaiing exegesis d matchingmethods.He was interestedin comparingthe mortality rate in two typesof Lspitals, ordinaryhospitals(N : 5,053) and "magnef' hospitals(N - 3g)-hospitals rtrh organizationalpracticesthat enhancedtheir reputationsas good placesto practice n-\ing. Contrastingan OLS analysiswith a propensitymatchingprocedure,he showed fi& fte two methodsyielded similar estimatesof the difference in mortality rates in the Do Npes of hospitals,but the latter methodhad far smallerstardarderrors,yielding a uistically significantreductionin mortality in the magnethospitalscomparedto ordiri hospitals,a conclusionnot yieldedby the OLS analysisbecauseof the largestandard resultingfrom the unbalanceddesign. There is by now a substantialliterature on both the statistical theory underlying proiry scorematchingandpracticalproceduresfor implementingthe method.The 1997 paperis a goodplaceto startandalsohasa usefulbibliography.BeckerandIchino :). Abadie and others(2004), and Beckerand Caliendo(2007) discussthe impleion of propensityscorematchingin Stata.DehejiaandWahba(2002)and Brand Halaby(2006)provideusefulevaluationsandworkedexamples.Harding(2002)is a icularly instructiveapplication.For otherapplications,seeBerk and Newton(1985), andothers(1995),Keatingandothers(2001),Lu andothers(2001),Morgan(2001), andSmith (2003),Lundquist(2004),andCohen(2005).One limitation of propen\-.orematchingis that it may not balanceunobservedcovariates. Thus if you suspect , you will needto resortto oneof the methodsdiscussedhereor in the previ.hapterthat are specificallydesignedto handlesuchproblems.
Equation Models ':ural equqtionmodeling(SEM) is a technique(or, more precisely,a set of tech) thatpermitsthe estimationof systemsof equations,often involving unmeasured lrcnt constructs.Considera simpleexample,Blau and Duncan's(1967, 170)classic of statusattainment,shownin Figure 16.2.When we think abouthow occupa: statusis transmittedfrom onegenerationto the next,it becomesevidentthat this is process:men whosefathersare well educatedandhavehigh-statusjobs tend rriieve more schooling; those who achieve high levels of schooling tend to obtain
394
to Testldeas DoingsocialResearch QuantitativeDataAnalysis; Fathers educat|on
FatherS occ.
ft€UnS
16'?"
Respondents educallon
.3 1 0
.224
| 818 /
First job
of stratificatim' Modelof the Process Basic alauand Duncan's
sour.eiBlauandDuncan1967,170
those who have h1*' high-status first jobs (but their social origins may also trelp); and current jobs (but tu' high-stalus into them parlay ,-tu:to"nrrt iot, u." likely to be able to -!{ The various "pathreducation and even their social origins may continue to matter). shown in the figc' which fathers' occupational statusls transmitted to theh sons are equationspredict4 a set.of known as a "path diagram." The paths can be representedby can be explora: I each of the outcomeJin tum. The relationships among the equations the two wyield insights regarding the relative importarce of different paths linking (typically' if the size of particular cod' ables. Moreover, under some "rr"uattu"""' or more coefficienr' cients is fixed, usually but not necessarily^t zero' ot two r}Ie goodnessof fit oi:i: overidefiirted), is ir, if tne model ui""O ro i" "qoi-tlut "onrt model canbe assessed. variable Hosgra' In the modeljust discussed,thereis only one indicatorfor each thought to reF measures often the analyst has available repeatedmeasulesor a set of to use SEIIs n possible senta singleunderlyingor latentconstruct'In suchcases'it is Featherman(1977r d assessand correct tbr measurementerror. SeeBielby, Hauser,and Still anotherrc Hauser,Tsai, and Sewell ( 1983) for two early but instructive examples even involving lara of SEMs is to estlmate processesinvolving reciprocal causation' (Note thar m an example (1968) for such variables.See Duncan,Haller, and Portes work-the of recent lack to a due not applicationsjust cited are all very old. This is "smuch more expli* ,"nt tit"rutur" is vast-but rather to the fact that the early work was
FinalThoughtsand Future Directions:ResearchDesignand Interpretationtssues .:195
?J ,"ou T;":tQ " -."1L?.,'ii.'*-llll;'jj"j3
:'
statistician-sociologist LeoGoodman"the most importantquantitativesociologistin the wor d in the lattef half of ihe twentiethcentury" Duncanwas responsb e for ntroducingpath analysis(a versionof structuralequationmodels)inlo sociology.He usedpath analysisasthe technical appdratusto reconceptualize intergeneralionalsocialmobility as a multistepprocessin which statusattributes(suchas education,occupationalstatus,and ncome)are modeledas dependingnot only on parentalstatusbut aisoon the priorstatusesof individuals. Duncanalso contributedimportantly to our understanding of racialdifferences rn socioeconomic attainment,spatialand racialinequalllies withincities,and,laleIn hiscareer, attitudemeasurement. Althoughlackingadvanced mathematical training,Duncanprobablymadebetteruseof the statistical toolsat hisdisposal than anyolhersocialscientisi, throughthe combination of an unusualabil;tyto think through a problemin advanceand greatclarityabout how to representsociological models.lt is stril.ing,and telling,that becauseof thenideasin stalistical extant rulesgoverningaccessto Current PopulationSurveydata, all of the tabulationsand estirnalesin Duncan'slandmarkbook TheAmerlcanOccupationa! Structure(Btauand Duncan 1967)were specified in advance, withoutthe analysts havingseena s ngiecoefticient. InteF (1984), estingly, Duncanhimsel{regardedhis latebook, NotesonSocial Measurement as his most importantcontribution,a judgmeht not widelysharedby the many researchers strongly contrioJtons. 'rflrerceo by hissJbstant.ve yearsin Stillwater, Sornin Nocona, Texas, Duncanspentmostof hrsprecollege Oklahoma, professor lvherehisfather,OtisDurantDuncan,alsoa sociologist, was a at OklahomaState UniversitfDuncandid his undergraduatework at LouisianaStaieUniversity, obtainedan l\lA at the University of Minnesota. servedthreeyearsin the U.S.Armydur ng WorldWar ll, and ihen completedhisPhDat the University of Chicagoin 1949.Hetaughtat Pennsylvania State Jnive's;ty, the Universitv ol Wi:consin,lhe UlversiLyof Chicago,rne Urrversiry ol Micnigan, :he Unlversityol Arizona,and the Universityof Cali{o.niaat SantaBarbara.Durcan enloyeda secondcareeras a composerof electronicmusicand was famousamong peoplewho had no oedthat he wdsa distinguisl'ed socrasc'enlist.
-. -i the models being estimated than much of the literature that followed. after struc*-- equation modeling became widely used. Thus for didactic purposes the early papers --: lore useful.) TIle strategy for estimating SEMs is to exploit the lact that the posited relationships -- rg the vadables (observed and latent) inplies a particul;rr covariance structure (that - .et of relationships among the variances and covariances of the observed variables), , - , r is why the technique is sometimes called covariance stntcture npdeling. Goodness : is assessedby comparing the covariance structure implied by the nrodel with the rrnce stmcture obserued in the data set beins analvzed.
tts'es 397 Designand lnterpretation Research FinalThoughtsand FutureDirections:
r i $6. G iIDlss
(sF tc$ t'ek-r rof Chi-r !b-.Fb sPeciNfl 5r Erac ([sa.ntl€rF fszre ry*ff,-ar€ PO#F sr Snud i b1 -\rbnclftDe anabs r n a s1.*en d Ecent \esipos. srarisit
rhat enablesthe analyst to explore the implications of whatevermodel the analyst posits on a priori grounds.Thus structuralequationmodelingis best seenas an interpretatrve procedure,with the addedfeature that in somecasesit is possibleto determinewhether a particularmodelis consistentwith observeddata.Usedproperlyin this way, SEM canbe a valuable tool. (The best introduction remains the 1989 text by Bollen, which, although somewhatdemanding,is intendedfor and accessibleto social scientists.Seealso a collection of paperson technicalissues,editedby Bollen andLong [1993];Bollen andCurran's 2006 book using SEMs to estimate latent curve models; and Bollen and Brand's 2008 paperusingSEMSto estimaterandomandfixed effectsmodels.)
SAMPLING OFPROBABILIry THEIMPORTANCE To generalizefrom a sampLeto a population-which is what social scientistsare almost alwaysinterestedin doing, whetherwe admit it or not-it is necessaryto samplecases from the population of interest in such a way that eachindividual in the population has a known probability of being included in the sample.only under this circumstancedo the principles of statistical inference apply. Nonetheless,many studiesviolate this principle, drawing "convenience" or "causal" samples.Chinesesocial surveysare particularly egregiousin this respect,often sampling a sei of provinces or cities that are said to be typical of particular types of places; this is true of even high-quality surveys such as the Chinese Health and Nutrition Survey (Hendersonand others 1994). The difficulty is that there is no way of knowing to what extent and in what ways the chosenplaces a:e indeed similar to the places that are not chosenbut are purportedto be representedby the chosenplaces.In sum, samplesof "typical" placesare no substitutefor probability samples.It is well worth the extra cost-in the sampling effort and, often, in the fieldwork-to design a samplein such a way that it car be generalizedto the population of interest.
ASK A FOREIGNER TO DO lT
scientistsarenoto'iouslvb"d social f)f,
it Oroved survey. A casein point: in my 1996Chinesetheirown societies. at characterizing \ Trom ot opposrtron urbandistrictbecause to do the fieldworkin one county-level impossible localofficials.Insteadof askingme to providea substituteplacefrom the samestratum (recallfrom chapter Nine that there were twenty-five urban strata, basedon the level of educationin the population),my chinesecolleaguessimplysubstitutedanotherdistrict from the samecity that they saidwas very similarto the omitted district.However,it turned out that whereasthe omitteddistrictwas in the eighteenthstratum,the substjtutewas in tr modelhg. tal6 lggestiDg thar lh \' in PsYcholo'gil limitadons-=g r$' magically o\lF r. it is a Procedre
the twenty-thirdstratum,clearlya violationof the stratifiedsamplingdesign The truth is that if you want a clear-headedcharacterizationof a society,you should aska foreignerto renderit. Thisessentialpoint was understoodby the carnegiecorporation, GunnarMyrdalto heada study and sociologist the Swedisheconomist whichcommissioned monographAn wasthe classic The result in the'1930s States in the United of racerelations AmericanDilemma(Myrdal1944,vi-vii)
||
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues399
rM" d E -!t lllF
.e Er5.D
rfr dirytd. .6|F
-
Still another ex:rmple can be found in institutionally based studies-for example, iudies of hospitals,clinics, and their catchmentareas,which are often usedin public bealthresearch.The justification for what amount to conveniencesamplesis that the pardcularhospitalsor clinics being studiedarerepresentative of all similar places. When is it legitimate to invoke the conceptof a superpopulation?I suggestthat when data for a population exist from which a probability sample can be drawn, convenience samplesare a poor substitute and do not meet current scientific standards----claimsof saduate student poverty, lack of time, and so on, notwithstanding. However, when the populationis unlmown and unknowable,as in the caseof Murdock and Provost'sethEographicsample,use of the available data and generalization to a superpopulationare kgitimate. In the caseof singlecross-sectional surveysbasedon probabilitysamplesof 6€ population at the time of the survey,we are on firm ground in characterizingthe socieq, as it was at the time of the survey but are increasingly on shaky ground as we try to gneralize over time. It canbe done,but it mustbejustified. Data from Multiple 5ureys Invoking the conceptof a superpopulationhas 'oling tonsiderable practicalusewhen it can be justified.A particularlycompellingapplication fu rvhencomparabledata are available over time, as in the U.S. GSS and other repeated uoss-sections.If it can be shown that relationships amongthe variables of interest do not ran. over time, data from severalyears may be pooled to increasethe size of the sample railable for analysis.This canbe a particularlyusefulstrategywhenany oneyearyields isufficient data to sustain reliable comparisons,for example of race differences in the f-rited States.The basictestis a varianton the strategyfor group comparisonsdiscussed rzlier in this chapterandalsoin ChapterSix (seealsothe discussionof trendanalysisin Gapter Seven).There are two steps.First, estimatean equation of the form .-,
I
- q+
\-,
J IJ
. r -r-\- \-,1 . -r> DI-. , \z2- t- t , 221_-n..i . j:2
i:t
rr r j
(16.7)
j:2
the X are predictor variables and the { are cross-sectionalreplicates of the survey irh the first omitted to avoid linear dependency).Second,test whether the c. and the d.. collectivelyzero.If so,you can concludethat all the samplesaredrawnfrom a single ion and happily proceedto pool your data.But evenif there are year-to-yearvarims in the level of I (significant differences among the c,) or in the relationship of one moreof the Xs to I (significantdifferencesamongthe 1,,t,you may still wish to pool data but include the dummy variables and interaction terms necessaryto capturethe the social processyou are studyingchangesover time. This has the advantageof permitting al analysisof changeald increasingstatisticalpower for ing the relationships that do not vary over time. (For some recent examplesof the of this strategy,seeBarkanand Greenwood[2003],Chenand Guilkey [2003],Pow[2003],FitzgeraldandRibar [2004],Kelly ard Kelly [2005],andTavits[2005].) which hasmuch to recommendit, is to -$ altemativeuseof repeatedcross-sections, data from one survey to develop a preferred model, modifying the model in light of
400
Researchto Test ldeas Quantitative Data Analysis:Doing Social
in-the data' Then estimatey relationships unanticipatedby your theory but observed
ttini dutuftomur"plicatedcross-section' pr"r"rr"o l:] T:T^i?-tLtl:""":: ,: in trt" precedingor following vear(recallthe discussicr -"0"r "i;;, il;;;;;-"#o","0 this strategyin ChapterSeven).
for more than one idr A final possibility, in caseswhere information is collected trouseholdmember one than T? uiaoa *itt in u t oosehold(either by interviewing more members)' is to exp household other of characteristics the u.tirrg u i".pona"nt about
alirla:.t:T,^i-:T eachindividualfor whominformatron"i: u "uting it is necessaryto take account of the fact that obsen However, in ,o"h
,rr" ,"-pr" iy t
case. "ur". householdsuslng survey e$n are not independent,by adjusting for clustering within
;;;;.;"-d;
or Ly'adoptingllekind of.multilev"lT"9-"-tllc..':l::t""*t":=
is available for a restricted*"Masin (2001) cited earlier. Moreover, when information to the consequen!'85'attentive to be set of others, for example, spouses,it is important in conclusionsyidddr differences of the ty cu.rying ooi."ntitiuity io. of sensim4 (see discussion the -alysis adults "*umpt", ;;;ffii;'"f;#"d ieople and a sampleof all analysisin the next section).
PRACTICE A FINALNOTE:GOOD PROFESSIONAL
quantitativedata analystsandhar-ebd Now that we haveconsideredvanous issuesfacing of study, I close by offering serall a uri"r introduction to advancedtechniquesworthy that make a difference bersrar things g good professional pra&ce-the t;-dt ;;; principles' availat'tsI are-simple mediocreand supenorquantltatrvedata analysis These or brilliance insight or matllerEuny--ufyrt; tft"it upplication doesnot require particular to them is sure to improve the quality of your work' i"if f".lfi v. e* ",Adon
the Propertiesof You Data aJnderstand
or data fiom an archi\t u' Whether working with data you acquiredfrorn another-analyst you ,hoold thoroughly understandhow the data were cred y";t# il;;;; "ott""t"d, attention to the sampledesigr I and also should explore thell properties Pay particular to implement it For the sc determine whether survey estrmationis possible and how were constructedand hos n .*.on, Vo. need to understandhow any weight variables investigatorsare poorly documenrcd rr" tfr"*. Of"t afteweights provided byihe original ask them how they constnrd It is enfuely appropnateto wnte to the investigators to tfllt",I d:f::iltt^5y imposition' an their weights.You should not regardthis as :"": public use is to pro\rG for available the respJnsibilities of those who make their data adequatedocumentation. distributions for ers! You also should calculate and inspect univariate frequency you( analysis This is erq pertinent to variable in the data set, or at least every variable With respectto eachvariable'ax to ao Uy u.ing Stata's -cod.ebook- command' what you know abourlic *ft",t* ,ft" observeddistribution is plausible' given y"*.# of univari'c being studied. It is surprising how informative the inspection i.o"l",i*
ilfi;;'ilffi*or*
or tablesf .* u" ThLnextstepis to createcross-tabulations
dependentvariablesandaI meansthat show the associationbetweeneachof your central
F FinalThoughtsand FutureDirections: Research Designand Interpretation tssres 401
r!ul!D' *5 rrd RdF
ru,h
#r !F r-: G rr;l d!# FdtF
d Eita$l
fr
q
IN THE UNITEDSTATES, PUBLICLY FUNDEDSTUDIES MUST BE MADE AVAILABLE TO THE RESEARCH
COMMUNITY tt is now a reouirement of both the NationatscienceFoundation (NSF)and the NationalInstitutes of Health(NlH)that samplesurveysfundedby these agencies be madeavailable for publicusein a timelyway.ThecurrentNIHpolicyreads,"NlH endorses the sharingof final research data . . . and expectsand supportsthe timelyrelease and sharingof final research datafrom NIH-supporled studiesfor useby other researchers. 'Timelyreleaseand sharing'isdefinedas no laterthan the acceptance for publicationof the mainfindingsfrom the finaldataset" (http://grants.nih.gov/grantvpolicy/nihgps,2003/ NIHGPs_Part7. htm#_Toc546001 31, accessed December9,2007). The NSFpolicystateprecise principle: "NSFexpects. . . investigators ment is less but conveysthe same to share with other researchers, at no morethan incremental costand within a reasonable time,the data,samples,physicalcollections and other supportingmaterialscreatedor gatheredin the courseof the work. lt alsoencourages awardeesto sharesoftwareand inventionsor otherwiseactto makethe innovations theyembodywidelyusefuland usable"(http://wvwv. nsf.gov/pubs/2001/9c10'1/9c101 revl.pdf,accessed December 9, 2007).Providing adequate documentation is oart of the reouirement.
6e candidatepredictor variables. This too can be extremely informative, revealing both &ficiencies in the data and deficienciesin your a priori assumptions. I still recall, with someembanassment,an incident forry-five years ago when I was a @i-rning graduatestudentat the University of Chicago. I worked as a researchassistant r the National Opinion ResearchCenter (NORC), and Peter Rossi was the director of \ORC . I ran into him oneeveningashe wasleaving the building and carrying a greatstack {tr computerprintout--{ross-tabs from the study we were working on. I made some snide rma-rk about why should we bother with cross-tabsnow that we could do regressionsby cnrnputer,and he gaveme a withering look and said somethinglike, "Live and leam, kid." Ot-course,he was completely correct. There is a lot to be leamed by getting a feel for the daiabefore rushing to estimatefancy, or evennot-so-fancy,models.
E qlore Alternatives to Your a Priori Hypotheses E D:
ftr lss ddrbd\rs f
t=
(he of the features of truly strong research papers is that the auihor anticipates and qlores all of the altemative explanationsfor the observedphenomenonor relationship tar a critic might propose.In nonexperimentalwork the searchfor alternativeexplanations den amountsto assessingthe possibility of spuriousassociationdue to the failure to ilude variables that affect both the independentand dependentvariables in the model. fhus you need to ask yourself, is there an altemative explanation for the associationI $serve? In particular, might some other variable be causing both the outcome I observe al the values of my predictor variables?Then, if possible, include the candidate vari$les in your model, or do a side analysis (even using a different data set) to investigate ft associationof thesevariableswith variablesalreadvin vour model.
402
to Testldeas Doing SocialResearch QuantitativeData Analysrs:
thaterp- ':; tt A niceexampleof theuseof this strategv " P"P"l !1.ytlt^:1.(2007) incre'i' : an in th! twentiethcenturyresulted whethergrantingwomenthevoteearlyin e\1., -r: shong a reiuctionin chilclmortalityHe finds :: ;;;1i;":1rh ,p;"ding andhence argurl'ri: thecausal thatbeforeaccepting :! in supportof his claim'But ne recognizes ri endosenous legislation-was outthepossibllitvthatsuffrage il;ru;""il;;;ie devoes a:i:_ ,[
:-l tfi*:
m- a
in public healthspending'He thus tols that alsoresult"d'n an '.ncreuse i:ialoityitsts" designedto rule out the possibili: ' of his paper (z4JB) tou'ioo' alternaiiveexplanationsfor his resultsconfou::-:-l not possiblebecausethe potential Where the ,t ut"gy .1"ttttitto*i"J is thati: 'r noting bv to rule ie-potslur" iiiluy ou'"'u"a' bt"n not th-T,r^o:J variableshave exa:---.. differ from what is observedFor f'"ai"t"o "r""it i"ould utCtt-1" "t"tii*"1;t"r; ofitera"y in Cttina(Treiman2007a)il -".. in apaperanalyzingthe determrnaiis
lllli:-it
;_ Lf
,"
6{ly::1"},iTilj:j"Til}i"}"j,::Hi}:1;:J5'Jil:Ti.ir," : ii*,',e;''"" thatnonmanualv "the hypothesis this-conclusion'I had to ru ' -: it'(146) ffo*tu"'' t"tot" accepting work suppresses ch'-': i" rOlOliry1y1"1t"terl historical the possibilitythat ug" anet"n""''--""'uttO out pointing - ' by literacyby cohofi' I did this "' in Chinathat prodo""a Oif"'"nte' ln rn'::*"; an expect ttl:l-Y:-:1:uld (decreased) the quality of educationincreased "t"t workersrafherthan the ob:' :* (decrease)in literacy to' Uotttmanuatand nonmanual the nonn"--I ruled out the postllllll 'n" as tn u ;;;;;;;;";s"t"" 'i*itu'''uv andmanualworkersdecltnec grew,the av"tug" "quuttti; of both nonmanual *t,ttti;*il;tiln, iector is^to sweepunmei-i--:. avaitJuteunder some circumstances' or randome.-:--: potentialconfoundersout of th;-;lititiy T^"i,:T""t "ttl-u'ing to adjustt' i:E is possibility lftupter' Still another models as we did in tn" p'"-*t endoge:' with coping ot ttt" methodsfor on" oting ly confounders potential effectof earlierin this chapter' inj .u.pi" t.f""tion bias discussed
&
'1r.- ri
Jlm
uric 0 d
!t
ConductSensitivitYAnalYsis
reader-:: inspireconfidenc:"1.1:^Ojn ot tt* Anotherway to gain confidence-and ruuusri' t '-':::' art robust resul$ are your your results 111:i:,:jllT*t $|fi:Ti'T:,-" ;-,-^ ,-'^ ----"r ti.ear motlel framework.c -:- - "onoJ'i""u'li1 linearmode represemnationships in a general you forms by which -di1 ' ::' generallyan"alysis'and-more ent cutting points when to"Vt"g'o"i-iutufar omitted-rr'-: ' Like consiieration 3f'loleltial ways of representingyou' tont"fo' being anr- :set t|r,"]3:: t"q"i*-going bias,this sort of exploratronarso'iuv |:t:g o the adequacr:' ' izo-AV""-T^T:::tf See,for example'Treimanand r<""i trSS:' standardproxyforlaborfbrceexperience(:ageminusyearsofschoolingminussi-: ontwoaltelr'- :
fromthea'"tii "i s*inv^"":,s-:.:?^'*"o "#rn:Jr['""it-","s rnd actuallaborforceexpertence
' the pro\y medsure mea.ures, yield similarresults But e\: specilications Oiffe'ent tnut is course' of Your hope,
to'"':I1llllt ]:.lJ::ffn:"i,"J;I:f,i#::#"':f:,"'*"J'[:T. n""a not, voo but to discoverhow a hvDothesis that our dataarenot ;;iil;;;;;;;tud" to alternativesPecifications'
not r' r-; informativebecauseout resultsare
!x1 t I !!
ilu il
Finalrhoughtsand FutureDirections: Research Designand Interpretation tssues403
6
3fl_,- ihat erp.-er Eri ii- an tD!-rai.r [. ri.r ::rong
er iie=:
drillgenou! F :3r.rtas a
:a t.r:--ill
t :le pcssibt:,
r
Errid ionlou:"::u r Eotrn,gthat ::r ned. For er:=:r)-e,- I ar,sued:-,.a.t sorkers coDr:-/c nr:e $ hile m-r,!iir. a I hadr o n -::u r I h:storical ch=-lr: pointins our --:e r erF€.-t an in.-:.a€ dur the ob:e:,:c I a-r the nonn-:iia ters declinecnep unmea-i--J or raldom er.-:"$ s Io adjust t'u : : raith endose:
]'our reader-:.r ifferent funcriqa &amerr ork. di]:rneralll - ditl';: I omined-r an-r* et beins anahz.: the adequaci .-i r lin,s minus \ir :"r )tr t\\ o altema::.: esults.But er e: t' I is not to "pr..i3edmes this m.-.fsi ults are not rotv::i
Hout and Hauser(1992)critique.d_ Erikson and Goldthorpe,simportantcomparative studyofsocialmobility,the ConstantFla.r(1992b),showingthat EriksonandGoldthorpe,s resultsare not robustto changesin the model specificatiin, in the statisticalprocedure used,or in the level of aggregationof their occupationalclassification. see also Erikson andGoldthorpe's(1992a)response.The exchangeprovides an illuminating exampleof u hy it is importantto carry out sensitivityanalysisyourself beforea cntic doesit for you. For a striking exampleof a tendentioxsand sloppily argued analysisthat wasthoroughly demolishedby the long knivesof crirics,seeUermst"inLd Munay (1994)anOimportant ffitiques by Heckman(1995),Fischerand others(1996),and HauserandHu ang(1997). Oneusefulapproachis to .,bracket"your results,reporting not a point estimatebut a rangeof estimatesderivedunderdifferentassumptions.-For eiample, if it is not clearto 1ou whether,in an attitudescale,a "don't know"iesponseshouldie coded,,missing,, or stventhe middle value,intermediatebetweena positiveanda negative attrtude,try it both s aysandassess-and,of course,report_the results.
DocumentYour Work You.shouldcari:yout all your analysisusing corrmand files (_do_ tiles in Stata)and producinga log of your commandsandresultseachtime you executeyour commandfile (a -log- file in Stata).Moreover,you should useextensivecommentsrn your command files,sayin-gfor eachbit of analysiswhat you_are doing and why you are doing it. In my own work I go further,addingcommentson the results. This practice has several advantages.First, it provides a record of what you have dole.-Theprocessof researchproductionin the social sciences from initial idea to pub_ lishedpaperoften coversa period of severalyears.Even if you are an efficient person u'hodoesonething at a time andthusarc ableto executeyour-analysis from startto finish in-amatterof a few weeks,you then haveto submityour paper to aloumal, which typi_ cally will take severalmonthsto get back to you, oit.n *i,i, u r"qu"rt fbr revision and resubmission that entailsdoing additionalanarysis.At this point yo; do not want to be in the embarrassing positionof not rememberingexactlyhow you carriedout the computa_ tionsto producethe statisticsthat appearin your tablesand graphs and, worse still, not beingable to replicatethem. If you havea well_documented'command file, you will be ableto figure out what you havedone,and why. Moreover,you will be ableto modify your analysisandcreate a new setof computa_ rionsefficiently.Suppose,for example,that the refereessuggest that you control fbr an additionalvariable.This is a trivial taskif you havean existii! commandfile. you simply add the variableto your model and executethe commandfiie. This is far preferableto redoingan entiresectionof your analysis. U*" you will makeit possiblefor othersto replicate_or challenge_your . work, by archivingyour log file so thatit is availableon demand.you may be temptedto obscure the detailsof your work so that no one elsecan discovererrors in it. gut this is not how screnceprogresses-far better to be clear (evenif wrong) than vague. If you are clear aboutyour procedures, you makeit possiblefor othersto !"u"tly ."pir"at" *hat you have done and perhapsto figure out how to do it better. Remember,ihe aim of the game is to advanceour collectiveunderstanding of socialstructureandprocess.
404
to Testldeas QuantitativeDataAnalysis:DoingSocialResearch
Of course,the gold standardfor the production of researchpapersis that they all of the ilformation necessaryto exactly replicate the research.Your goal should bc document your work thoroughly enough so that if you handedyour paper and your set to a competent analyst, he or she could reproduce every number in your paper laudable as this goal is, however,it tends to be frustratedby joumal editors who insi.sl shortening papersby omitting technical detail. So in addition to describing your cal proceduresas cleady as possible in your paper, archiving your log file is very professionalpractice.
Do a Last Checkfor Errors The last thing you should do before you submit a paper for publication (or asa term or a dissertationchapteror post it in a working paper series)is to executeyour
?r)! AN "AVAILABLEFROM AUTHOR"ARCHIVE BecausecrainE \
papers in OuOllsneO proveto befalse, thatadditional materials are"available fromauthor"usually (CCPR) at leastattera fewmonths, theCalifornia CenterforPopulation Research at UCLArecenti_r ,doimplemented a mechanism bywhichadditional materials, for example, and -1og- files, postedin itsPopulation canbeattached to papers WorkingPaper archive. Otherresearch centers areto beencouraged to do thesame.
file and then to check every single number in your paper againstthe correspondingnmbers in your log file. You will be amazedat the number of discrepanciesyou find. Beca producing a professional paper is typically a lengthy process, it is extremely eas)-iE inconsistenciesto creep in. Your goal should be to produce a single command file ri:r contains all the computationsrequired for an analysis.Even in caseswhere you are m*lyzing more than one data set, you would be well advised io incorporate all your cre,. mands into a single file. In this way, you create a single document that producesd explains all of your work. You also minimize the chancethat portions of the analysisril fail to be documentedor that the documentationwill be lost. For the samereason-!{! should incorporate side computations,evenhand computations,into your commandflE(In Stata the -displaycommandaccommodatesthis by functioning like a h.ni calculator.) The standard to be emulated-at least pardy-is the lab notebook conventioMl! kept in a chemistry lab. Lab notebooks record the conditions under which an expE . ment was conducted, including the temperature and humidity of the room, whetbs r reagent was spilled on the floor that day (together with the exact time and descrip.ix of what was spilled ard where), and the outcome of each procedure, whether succx.ful or not. We need not go that far. Nothing much is gained by recording the error: rc made in the processof getting our file to execute.But we should record,analytic dead ends,hypothesesthat did not pan out, assumptionsthat proved to be incorrect, and so .rYou will find such conmentary enormously helpful when you retum to analysis after n
FinalThoughtsand FutureDirections: Research Designand Interpretation tssues405
br a.rrr Lrliirr dSadl rpry.t.aI| b i f t d{rC tII'r--Lrl rqr
d
fucl-Bcr:y czq fu diiL l-:r Fr:0-l.-"dl {lsrd 5f[-]ul
E;t brd
*
-ia{t rq EF
rhl:fur l.rc'flrrnrl rqr-.B
arrt* IF:.H rd. ro cu briEr
absenceof months or years, which, as I have noted, is not an uncommon gap. Moreover, by documenting and archiving your analytic deadends,you may help othir inalysts.
WHAT THISCHAPTERHAS SHOWN In this chapterI havereviewed somegeneralpoints regarding good researchdesign; have briefly introduceda numberof advancedstatisticaltechniquesandprocedures, *hich you should pursuein further coursework or independentstudy; have emphasizedthe value of probability sampling; and have concluded with some advice aboufgood researchpractice. On the basis of the material we have covered in this book, you are well preparid to do high-qualityand rigorousanalysisof samplesurveyand other data.But you should not stophere,becausestatisticalmethodologyin the socialsciencesis advancing rapidly, and a first course in data analysis is no longer sufficient to master state-of-the-art techmques,many of which I reviewedin this chapter.I thus urge you to think of this book as the beginning of a career-longcommitment to continually expandyour tool kit, just as I havedone in the more than forty years since completing my phD. If an old doe like me can learnnew hicks, so canyou! Havefun!
F
APPEN DI X
RIPTIONS DATADESC ANDDOWNLOAD FORTHE LOCATIONS DATAUSEDIN THIS BOOK This appendix describesall of the data sets used to create the worked examplesin the book. A common feature of all these surveysis that they are household samples,which meansthat the data need to be weighted by the reciprocal of the number of adults in the bouseholdto converttheminto personsamples-seethe discussionof this issuein chaprr Nine. They are all basedon probabilitysamplesof households,and the detailsof the designare given in other sources,indicated in the referencesincluded in this appendix'
408
QuantitativeData Analysis:Doing SocialResearch to TestIdeas
CHINA The 1996 sluwey,Life Histories and Social Change in Contemporary China (Treirn_, Walder, and Li 2006), was conducted by faculty and students of eeoplet Univer:15. Beijing,wirh fundingfrom theU.S.NarionalScienceFoundation(SBR_9423453), theFd Foundation-Beijing, and the Luce Foundation.The principal investrgators were Donall I Treiman (UCLA), Andrew G. Wal_der.(Stanford), and-eian; Li (then at people,sUnivers4 and now at Qinghua University, Beijing). The suwey coliected extensrveinfbrmation c respondents'socioeconomiccharacteristicsand educational,occupational, and famill. Lr_ tones, and also information on their spouses,parents,children, and other family memt ar The surveywasbasedon a stratifiednationalprobabilitysample of the populationd. China agetwenty to sixty-nine,yielding 6,090casesplus a speciatsampleof jg3 uiitr_ leaders(village cadres).Detailsofthe sampledesignL givenin Trerman(199g). atrd accompanying documentation can bJdownloaded from the UCLI 1"" ^ .1h:Science Social DataArchive ar http://www.sscnet.ucla.edu/issr/da. Click Catalog,ht,ttzAsia-China, and,then Life Histories and Social Changein Contemporary China, 19!,6_
EASTERN EUROPE The.survey.ofSocialStrairtcafionin Eastem Europeajler l9g9 (Sz€lenyi andTreimn 1994) consistsof six generalpopulation.u*.yr, iur"d on probability samplesof tu adultpopxlationsofBulgaria,theCzechRepublii, Hungary,poiand, Russia,andSloral:'with all the surveysconductedin 1993exceptthe poUst survey, whictr wascarriedou: r 1994, andall surveysusing an essenfiallyidenticalquestion;re. Each surveysampi* approximately5,000 adults using a multistagestr;tified national probabiliiy sampb design(exceptthat the Polish samplewas smaller,approximately 3,500adults).Dee-* on the surveydesigncan be found in Treiman(1994).Thesesuweys we.e fundedbr.* U.S. National ScienceFoundarion(SES_9111722 and SBR_93103i5),;" t;. N";-; Councilfor SovietandEasternEuropeanResearch(g06_29),the Dutch NationalScienc Foundation (NWO), and various Eastem Europeangovemmental agenctes..fhe princiFri investigatorswere Ivan Sz6lenyi and Donald J. Treiman, at that tim"eooth at UCLA. The focusof the datacollectionwas on the effect of the collapse of communismte life chances.Extensiveinformationwas collectedon respondents; socroeconofircch.T..actedsticsandeducational,occupational,residential,andiamily histones,andalsoini._r_ mation^ on_theirspouses, parents,children,and otherfamily rnembers.In addition,a gcrrt deal of political information was colrected, as well as inflrmation that permitted a c!-Etrastbetween1988and 1993. and accompanyingdocumentationcan be downloadedfrom the UCL{ 9u,u ^ .lh:Jcrence Jocrat IJataArchive at http://www.sscnet.ucla.edu/issr/da. Click Catalog,htdi_t Europe-Bulgaria, and,then Social Stratirtcation in Eastern nroop" e1t", l9g9: Gener; Population Survey. The surveyof elitesthat wascaried out in eachofthe six nationsat the sametime .r, the genenl_populationsurvey and alalyzed in chapter Thirteen is not currently availat h for public distributiondue to the difficulty of protectingthe confidentialityof r"rponr._,. The difficulty with an elite survey,of course,is that individuals are farrty readily idendi_ ablefrom detailsof their biographies.
I DataDescriptions and DownloadLocations for the DataUsedin ThisBook 409
SOUTHAFRICA
E t. b
B I
rlf,.
jrh df a!50!
l*
Gd It9
b lrfr
Sr I
b
fLr
tr'q.
pl
'Bl
4.! d.a fl"
Thc Sutae.!of SocioeconomicOpportunities and.Achievementin SouthAfrica (Treiman, lXoeno, and Schlemmer i994) is a multistage national probability sample survey of all c-es in "greater South Africa" carried out in the early 1990sin severalstagesbetween l9l and 1994. GreaterSouthAfrica refers to what was historically, and is currently, the $trth African nation; that is, it includes the '"|VBC States,' that at the time of data co{tectionwere nominally independentpuppet stateshived off by the apartheid regime lre Treiman [2007b] for a brief history). The sampleconsistsof a generalpopulation sam1b of 8,7i4 adults and a Black elite sampleof 372 adults.The principal investigatorswere Dcnatd J. Treiman and two South African sociologists,Sylvia N. Moeno and Lawrence llfr.lemmer SeeTreiman, Lewin, and Lu (2006) for details on the survey design. Extensiveinformationwas collectedon respondents'socioeconomiccharacteristics d educational,occupational,residential, and family histories, and also information on hr spouses, parents,children,and otherfamily members. The data and accompanying documentation can be downloaded from the UCLA ScienceDamArchive at http://www.sscnet.ucla.edu/issr/da. Click Catalog,Index, a-SouthAfrica, and then Surveyof SocioeconomicOpportunities and Achievement.
GENERAL SOCIALSURVEY by theU.S. NationalScienceFoundation, theGeneralSocialSrryey(GSS:Davis. and Marsden 2007) is a repeatedcross-sectionalsurvey,with data collected from ional multistageprobabilityof U.S. adults-about 1,500peopleapproximatelyeach from 1972 ihrough 1991 and then, beginningin 1994,about 3,000 peopleevery year. The principal investigators are JamesA. Davis and Tom W Smith and, in years, PeterV. Marsden. Appendix B provides details on the sample design as it changed over lhe years.
The GSS is intended to be an all-purpose survey,to permit analysis of the attitudes, ior, andcharacteristics of the U.S. populationby thosewho cannotafford the masresourcesrequired to mount a national probability samplesurvey.As it has matured, becomean increasingly valuable vehicle for the study of social change,especially in attitudes.The strategyof the GSSis to repeata substantialportion of the quesire year after year to permit the analysis of changesover time but also to incorpoDewquestionsthat areresponsiveto changingconditionsand concems. The data may be downloaded from the National Opinion ResearchCenter at the of Chicago,the producer of the GSS: http://www.norc.org/GSS+Website/ It is also possibleto do data analysisusing this site, without actuallydownthe data.An alternativesite, which also permits both data analvsisand downloadL the SDA Archive at the University of California-Berkeley: http://sda.berkeley. For information regarding how to download or purchasethe documenseehttp://www.gss.norc.org.
A PPM N DI X
ESTIMATION SURVEY WITHTHEGENERAL SOCIALSURVEY Social Survey (GSS) uses a stratified multistage probability sample in Davis, Smith, and Marsden [2007, Appendix A], and the sources cited rhich meansthat correct estimatesof standarderrors require survey estimation Unfortunately, the GSS documentationis not complete,presumablyto mamiality; only the primary sampling units (PSUs) are identified by the ,IAMPwiable; neither the stratanor the secondarysampling units are identified. This is becauseit precludes exploiting Stata's proceduresfor adjusting for multiified sampling. Also, information is not provided that would permit a finite correction, although this limitation is not important, given the large numberof te population sampledat each stage.Moreover, the sampledesign has changed years. with a new sampling frame createdeach decadebasedon the decennial and additional major changesin 1976, from a block quota design to a full sample design; in 2004, with the inhoduction of a partial list-based sample U.S. PostalService addresslist, an aggressiveeffort to convert a subsetofinitial and post-enumeration adjustments for differential nonresponse;and in uirh the inhoduction of a Spanish language sample. Finaily, in 1982 and 1987 rere oversampled.Thesechangescomplicate pooling data acrossyears.
4.'! 2
to Testldeas Doing socialResearch QuantitativeData Analysrs:
SINGLEYEAR ANATYZINGDATAFROMA
:,i**ij.:,il:y?;:rlilii{#*r.'i# **srl1ffi r:i-
lowing section,offer suggestrons
Samples The 1972to 1976BlockQuota
to the block l-1:' samplingtras carrie^doutdoir'n ptouuUltiry CSS' tn" of years around the bi--L{In the early i'itelei in a sPecified^way trrti*^#"t *" block' each Then. within
: 1 ru
tl::t"Tl]1'j,HJTF iil'*5:'"#ffJ"'$ "l";.;::"':q i'.&; "nd;;;; ho-* p"' rT'pr'-i""d h#xin:*::l*:*:rh*1,Til:y::::"J c:Io:rld-:r"--]* numbero[ peoplewith Part'cula: ,"",.l"i.*fewed, andthe inteni'--:
rru;:;*;,.:pllliHil::ili rll#""fJl;;:,x*r;f Ts';;;l:ff iliililil:ilx[rr.LT"fJ't?'#ilt*':*iil*l"*;J"'iiiiffi ni:[tt"*\;*:j:l;llr*j;"*:jt*Y;rx'^lmffi *ti'u-tt'" the interuiews were conducted
half were condu'.:': ui"tft q"9" Tllhod: an<1
lr
-'i: u"o"'-repre'e **pt":1L9:].11:'ut't quoru utoft the riai concludeo and in single-=:-: olinsprocedures ou"'ttp'""nrtd i*-:lT:: living ttn iully employed -o 'ot"*iu] quota samplesexlst' e:'jr tiutittical inferencewith Although procedu'"' rot din!"
m'iiffi"",.."{#;m:t':n";..l:1iJ'J-ilii a''f and Mccarthy lvbr' LnapLEr';i";;;;;r"t"designeffectsof
about15' app::r--
{smittrtee,D:": "-:::il::t::'^:'S::T:"ilili;i"iloi'ii""'-n'*if it to heattheblockquota:::Z',clll"reasonabtt approactt Smith.andMarsdenZOOf' plesasif thevareprobabilitvt1"iejl"J"-*l'5:'*t subselJili '3# asfor thetrueprobabilitvsamples iil-
procec-= survevesdmadon
youc- .h:'.] ,uorr samples, :
obsene: aistriuutions u'*oontrre welgn:'-: ," ,Jll:['i::i:J]"":t-1"1ffiffii;:iei'Je you?t u"ottt"tto o"-post-enumeration M, turr"'uon 'Jitiui : the1970census. in youranalysis'whichwill 1:: statu-s unO'"-:pfoy**t gencler include Rather,simply
;'q:"-:-":1*:if,',"J;l1li::,il:ffi ,TJi::=: ffi*:J;ffi T.,.il-l'#Tffi []""fit'ffffi.1i*l-"#fiti;ffi;;;;;;a*',irso'perhapsbracketing:e emplovedmales'T:t;;;; ;l;flat" the numberof summarystatisticsbv weigh""g,ti; tttl'iullJ^ reDortthe original anOinnut"o valuesof the statisticsfall'
on
:-r of the range within which the
"'tirrlut"
1982and 1987. . The 1977to 2OO2Surueys'Except of Bl;:' o'rer-samPles r'vhich included
uJd 1987sur-veYs' With theexceptionof tft" rscz he treatedin a stan;: tt.t"tlir-iooz to'u'tys crn all o'"oi*n the full probabilitytu'npr"' surve)'s :: GSS' like most household tfl"'fo"t tt'ot fo' adjust to need You wav. 't.tt
with the GeneralSocialsurvey 413 SurveyEstimation sampleof householdsrather than people. But the eligible population consists rdults (people age eighteen and over) who are capable of responding to an intefBecausehouseholds are randomly sampled within small areasbut only one ranchosenadult per householdis interviewed, adults living in householdswith many bave a smaller chance of being included in the sample than do adults living in with few adults.A reasonableway to convert the sampleof householdsinto a of people is to weight eachrespondentby the ratio of the number of adults in the to the mean number of adults in all households in the sample. This can be in Stataby constructing a householdweightvaiable, HHWT: c-sen adultm ge:
hhwt
= mean (adults
)
= adults/adultm
your databy this variable.(In fact, becauseStatarenormsprobability weights ciginat samplesize, you can simply use the ADULZS variable as your weight variyou areusing this variableasa componentin a more complexweight variablein the next section-in which caseyou should useI/I1[rI asthe component.) the GSS is a multistage samplewith two, and for somePSUSthree, stages, only provides information on the primary sampling units (metropoliand nonmetropolitan counties) and no information on strata (basedon region, place, and race/ethnicity). This meansthat we can go only pafi way to adjusting ing in the GSS sample design. Here are the Stata commandsthat will accomusine the GSS PSU variable, SAMPCODE: =adults
et
sampcode
lpweight
t
sampcode
lpweight=hhwt]
]
this command also adiustsfor differential householdsize.
and 1987Surueyswith Oversamplesof Blacks aralyzing the 1982 or 1987surveysand want to compute descriptive statistics' to adjust for the fact that Blacks were oversampled.To adjust for both the of blacks and differential householdsize, createa new weight variable that of the OVERSAMPweight variable provided by the GSS and the weight rou constructedto correct for differential householdsize-that is aewwt
= hhwt*oversamp
6e mean of this new variable is 1.0.) Then set your data for suney analysis: E sampcode
[Pweight =newwt ]
414
to Testldeas DoingsocialResearch QuantitativeDataAnalysis:
The 2004and 2006SurveYs
new sampling procedurethat exploits IiE In the 2004 GSS,NORC introduced a radically Ly tne U S' tostat Service' which in lffr4 availability of a list of addresses For areas covered b-r dr -Jntuin"a io filoit"tt"uttuigh ?99J). covered T|percent of househorot to small areas-in essen'a p".iJ S"*i* fit,' it was possibleto go directly from PSUs s""onO innovation was an aggressiveefi-cm from the PSU to th" t"rti-y tu-piing-onit 'q' to respondents.The secondinnovaDi! to convert a random half of initiar noiuespondents data were weighted to make them represenrF necessitaFda changein the way the GS3 & had rc be weighted by twice the weight of tive of the population-tfre convertedcases I44SSadiustsboth ,ir onty t uti."." tollowed up. The variable ;;G'"ilJ;r" size' thisandfor differentialhousehold r* data' this variab9 i: nan€d ]1T:::: Note that in the originut u""'on ot th" 2004 variableappearstwrce' F i; the 1972-2006cumulativeflle' this I. ,h; ;ill;";td p't for all years; for 2004 and 2006 the !1TSSfor years 2004 and 2006 and as I4TSSALL files earlier vou a: *::i1l-:l:h" ur" iA.nti"ul. Thus, depending on which ;;; othernan (orwhatever
ffi;;:;ilil;;;
orwrsstoNEI{I4T to nn*n" twssn'-o+
to have a comparableweight variable :ir you give to your consmlcrco w"igftit*i*f"l utt tiflrt;o or poole
TheFORMWTVariable
somequestionswere askedonly of a sur In someyears(1978, 1980,and 1982-1985)' was to administer the questionsto a rand('= t"-pi" # t".pttOents Althoughthe intent : (S-ittt and Petersol 1986)'Thus' the GSS offer' subset,this was not alwayst"uti'"J h: is that you not use this variable corection weight,FORMWT'Uy i"to*tn"naution
with the Generalsocialsurvey 415 SurveyEstimation do multiple imputation (see Chapter Eight) to create a complete data set that all respondents.
FROMMORETHAN ONEYEAR SURVEYS rc pool surveysfrom more than one year, it is reasonableto treat YEARasthe straiable becausethe surveys from each year are independent,and YEAR is a fixed The Statacommandto accomplish this is Gr:r'set
sampcode
lpweight=newwt]
, strata
(year)
K!;r xR mru c n S 'l,beno, and Javier Gardeazabal. 2003. The economic costs of conflict A case studv of the Basoue countrv.
EconomicReview93(1):ll3-132. &iid Drukker,JaneLeber Hen, and Guido W Imbens.2004.Implementingmatchingestimatorsfor avereff€ctsin Stata.StataJoumal4f3):290311. 1002. Categoricaldata analysjs.2nd ed. NewYo*: Wiley-Inte$cience. aEl D. 2001. Missing data. Sage univenity papers series on quaxtitative applications in the social sciences,
Thousand Oaks,CA: Sage. :L6. Fixed effectsregressionmethodsfor longitudinaldatausing SAS. Cary,NC: SASInsrituteInc. hlglas. 2006. Is the l9l8 influenzapandemicover?Long-term effectsof h utem influeozaexposue in i 9-10U.S.population. Joumalof PoliticalEconoor,y 114(4)1612112. t'-. and Michael P Shields.1991.Son prefercnceand contmceptionin Egypt. EconomicDevelopment iEral Change39(2):353-370. lndv 8., AlexanderBasilevsky,and Derck P J. Hum. 1983.Missing data:A review of the literature.In ofsu ey research, ed.PeterH. Rossi,JamesD- Wright,andAndy B. Anderson, 415-494.NewYork: hess. Franl M., JamesN. Morgan,JohnA. Sonquist, andLaumKlem. 1973.Multipleclassification analysis: A .-o a computerprogramfor multiple regressionusing categoricalp.edictors.2nd ed.Ann Arbor: Unive$ity Institutefor SocialResearch. lxnua D. 1990.Lifetime eamingsand theVietnamera draft lottery: Evidencefrom SocialSecurityAdmin.ecords. AmericanEconomicReview80(3):313336.(Seealsotheerrata,80[5]:128+1286.) F.J. 1973.Gmphsin statistical analysis. AmedcanStatistician 27(1):17-22. Stephen, James M. Snyder Jr, and Charles Stewart IIL 2000. Old voters, new voters, and the personal Ljrng redistricting to measure the iDcumbency advantage. American Joumal of Political Science Orley, and Alan Krueger 1994. Estimates of the economic rctum to schooling from a new sample of
-lmericanEconomicReview84(5):11571173. \ro Varie, Robert Schoen, Margaret Ensminger, and Kendra Rothert. 2000. School reentry in early adult-
fte caseof inner-cityAfricanAmericans. Sociologyof Education73(3):f33-154. Brdi H. 2005.F-conometricanalysisof paneldata.3rd ed. NewYork: Wiley. j{elen D., and SusanR Greenwood.2003. Religiousattendance and subjectivewell-beingamongolder
h
Evidencefrom the GeneralSocialSu ey.Reviewof ReligiousResearch45(2).116-129. DJ.P 1998.Mothers, babies,and health in later life. Edinburgh: Churchill Livingstone. S,muel H., and Max Kaase. 1979.Political action: An eight nation study, 1973-1976. Machine-readabledata
&. i.Fruel H. BamesandMax Kaase[principal investigato$].Ann Arbor, MIi Inter-UniversityConsortiumfor and Social Research [distributor]. &rsropher F. 2006. An introduction to modem econometrics using Stata. College Station, TX: Stata Press.
ll.]lrard S. 1986.Writing for socialscientists:How to start and finish your thesis,book, or aticle. Chicago: r€rsirr-of ChicagoPress. \lark P, andClifford C. Clogg. 1989.Analysisof setsof two-way contingencytablesusingassociationmodi:qmal of the American StatisticalAssociation 84(405\:142-151. Vscha O., and Marco Caliendo.2007. Sensitivity analysis for average treatrnent effects. Stata Joumal
2002.Estimationof averagefeatment effectsbasedon propensityscores.StataJoumal Bernard.1979.Romania's1966anti-abortiondecree:The demographicexperienceof the nrst decade.Pop-
tu{r
Studies 33(2):209 222. A. 1983. An inEoduction to sample selection bias in sociological data. American Sociological Review
:i86-398.
418
References
Fox and J Scottl'I" 1990.A primer on robustrcgressionln Mo'lem methodsof dataanalysis'ed John 292 324.NewburYPark,CA: Sage the findingstE! and Phyllis J' Newlon. 1985.Doesarrestreally deterwife battery?An effofi to replicate -, 50(2):253-262' Revie\q MinneapolisSpouseAbuseExperiment.AmericanSociological ff(4):352-i9! data SocialScienceResearch in sociological lg8t Selectionbiases *i 5n611*1,6.Ray -, -M. and Nont*d of Black enors lgTT Response FeathemaD Huo."r, analDavid L Bielby, William T., Rob"n
-.
malesinmodelsoftheintefgenemtionaltransmissionofsocioeconomicstatus.AmericalJoumalofsocial.E. 82(.6):1242-1288. analysis:Theor-!a eish;;, YvonneIa. M-, stephenE. Fienberg,and PautW Holland lgT5 Discretemultivariate practice.Cambridge,MA: MIT Press. of collegequality? E!id* stack, Dan,4..,alrd refftey A. smith. 2003.How robust is the evidenceon the effects 121(1-2):99-124 from matching.Journalof Econometrics NewYofk Wiley Blau, PeterM., aid Otis Dudley Duncan 1967.The Americanoccupationaisaucture Journalof Human ResoE estimates structual form ard Blinder, Alan S. 1973.Wage discrimination:Reduced 8(4):436-455. Bollen, KennethA. 1989.Structuralequationswith latentvariables NewYork: Wiley paneldatausing structuralequationmodels Pafd> , andJennieE. Brand.2008.Fi-'(edandran'lom effectsin for PopulationResearch'University of Calif@ Cen@r Califomia tion Working PaperPWP-CCPR-2008-003' LosAngeles. New York Wile!&; pu61g11.6urran. 2006.LatentCurve models:A structuralequationpersp€ctive of oudiers and influ'@ treatment An expository diagnostics: Regression lgg0 W Jackman. nobert 61a PaIk' CA: SeB cases.InModernmethoalsofdata analysis,eal.JohnFox andJ ScottLong' 257-291 Newbury Sage Park' CA: Newbury models equation stnrctural Testing eds. 1993. I-ong, ard J. Scott -, Political Economy82(1 Pd" Boskjn- Michael J. 1974.A conditionallogiirnodet of occuparionalchoice.Joumalof l):389-398. on job quality:Findingsfrom theWisconsinLonginrdr Brani, JennieE. 2006.The efrectsofjob displacement 24(3r'275-298' Mobility and Stratification in Social Research Study. -.andchar.tesHalaby.2006.Regrossionandmatchingestimatesoftheefrectsofelitecollegeattendan.am ealucationaland carcerachievement.SocialScienceResearch35(3):749-770 in hetercgeoaG arld Yu Xie. 2007. Who benofits most from college?Evidencefor negativeselection -, Califomia Cemerj]r PWP-CCPR-200?-035' Paper Working Population education higher to retums eco;omic PopulationResearch,University of Califomia' Los Angeles' sageuniversltypaperss@ B.""n, Ri"t,*d. 199e.n"gressionmodels:censored,sampleselected,or truncateddata Sage CAI Oaks' l Thousand on quantitativeapplicationsin the socialsciences'07-ll model Americai So-r116f* O. ion.son. 2000.Analyzing ealucationalcarce$:A multinomial transition -, logicalReview65(5):754-772. Statigi Norman E. i996. Statisticsin epidemiology:The case-controlstudy Joumal of the American Breslo_w. 91(433):l4-28. Association Methodsin \tc'IF Brick. J. Michael, andGrahamKalton. 1996.Handlingmissingdatain surveyrcsearch Siatistical 5(3):215238 calResearch SociologicalReric' Budig, Michelle J., and PaulaEngland.2001. The wage penalty for motherhoodAmedcan 66(2):20+22s. Burgess,Eric. 1978.To the red planet NewYork ColumbiaUnive$ity Press' from lndonesia Popul'mx Bu;nheim, Alison M. 2006a.Microfinanceprogramsan'l contraceptiveuse:Evidence poputationResearch,university of california- l-;} working paperpwp_ccpR_2006-020,calif;mia cenrer for
-, -,
Angeres. and child healthin Bangtadesh.PopulationWorking PaperPwP-CCPR-200fi:: ftoot ZOOOb. "xposure Califomia Centerfor PopulationResearch,University of Califomia' Los Angeles EconometricSociety \l'u_ Cameron,A. Colin, and Privin K. Trivedi. lgg8 Regressionanalysisof count data Press. graphs,30. NewYo*: CambndgeUniversity Demog!4 Lisa. 2000. The resid;ncy alecisionof elderly Indonesians:A nested logit analysis. Carn_eron, 3't(1\:r'7-21. China: Spacingand +:aCampbell,Cameron,and JafiesZ.I'Ee.2OO5. Deliberatefertility coDtrolil late imperial Centerfor PcErrCalifomia PWP-CCPR-2005-M1' Paper Working ping in dre Qing imperial lioeage.Population lation Research,Univenity of Catifomia, Los Angeles'
-i
neferences419 Campbell,DonaldT. 1957.Factorsrclevantto lhe validity of experimentsin social settings.PsychologicalBulletin 54@\:291112. andDavidA. Kenny. 1999.A primer on regressionartifacts.NewYo*: Guilford Press. -, and H. Laurence Ross. 1968. The Connecticut crackdown on speeding:Time-series data in quasi-, experimental Law andSocietyReview3(1):33-53. analysis. tuld Julian C. Star ey. 1966. Expedmentaland quasi-experimentaldesignsfor research.Chicago: Rand -, vcNally. Crrmines,&lward G., andRichardA. Zeller 1979.Reliability andvalidity assessment. Sageunivefiity papen series on quantitativeapplicationsin the socialsciences,07-017.Beverly Hills, CA: Sage. Chamberlain, with qualitativedata.Reviewof EconomicSfirdies47(l):225-238. G. 1980-Analysisof covariance Chattopadhyay, Arpita, Michael J. White, andComeliusDebpuur.2006.Migrant fetility in Ghana:Selectionversus adaptationanddisruptionascausalmechanisms.PopulationStudies60(2):1a9-203. Cben.Susan,and David K. Guilkey. 2003.Determinantsof contraceptivemethodchoice in rural Tanzaniabew€en t99l and 1999.Studiesin FamilyPlanning34(4)1263-276. C:en,Yiu Por,andZai Liang. 2007.Educationalattaiiment ofmigrant children:The forgottenstoryof Chita's u$anization.In Educationandreform in China,ed. Emily HannumandAlbert Park, 117-132.Oxford: Routledge. Clark. T. G., and D. G. Altman. 2003.Developinga prognosticmodel in the presenceof missingdata:An ovadan cancerstudy.Joumalof ClinicalEpidemiology 56(1):28-37. Clogg. Clifford C. 1982.Using associationmodelsin sociologicalresearch:Someexamples.American Joumalof Sociology88(1):l14-134. C..ben, Gidon. 2005. Propensity score methods and the Lenin School. Joumal of Interdisciplinary History 36(2r:209J32. C".hen,Jacob,andPatriciaCohen.1975.Applied multiple regression/corelalionanalysisfor thebehavioralsciences. Hillsdale,NJ: LawrenceErlbaumAssociates. D6ula.Thomas,D. Alton Smifh,andRay Nord. 1990. Inequalityin the military: Fact or 6ction?AmericanSociologicalReview55(5):714-718. Englewood Cliffs,NJ: PrenticeHall. hsis, JamesA. | 971.Elementary surveyanalysis. 1974.Hierarchicalmodelsfor significancetestsin multivariatecontingencytables:An exegesisof Good-. man'srccentpapels.Sociological Methodology 5:189 231. andTomW. Smith. 1992.The NORC GeneralSociatSurvey:A user'sguide.Guidesto major socialscience -, databasesL Thousand Oaks,CA: Sage. Tom W. Smith, ard PeterV Ma$den. 2007. GeneralSocial Surveys,1972-2006cumulativefile lcomputer -, tlel. Principal investigatot JamesA. Davis; director and coprincipalinvestigator,TomW. Smith; copdncipalinrestigatot Peier V Marsden.Chicago:National Opinion ResearchCenter [producer]; Stons, CT: The Roper Centerfor Public Opinion Research,University of ConnecticuqAnn Arbor, MI: Inter-UniversityConsortiumfor Political and SocialResearch[distributon]. D.\. son,DeborahA. 2000.The link betwe€nfamily history andearly onsetalcoholism:Earlier initiation of drinking Joumalof Studieson Alcohol 6l(5\:63'7446. or morempid developmentof dependence? IH€jia, RajeevH., andSadekwahba.2002.Propensityscore-matchingmethodsfor nonexperime al causalstudies. Reviewof Economics andStatistics 84(1):151-161. Drog. Zhong, and DonaldJ. Treiman. 1997.The impact of the Cultural Revolutionon trendsin educationalattainmentin thePeople'sRepublicof China.AmericanJoumalof Sociology103(.2):391-428. f,fPrele, ThomasA., and Jerry D- Fonistal. 1994.Multilevel models:Methods and substance.Annual Review of Sociology20i331 357. andDavid B. Grusky.1990a.RecentEendsin lhe processof stratification.Demogaphy 27(4\t617 13'7. -, and David B. Grusky. 1990b.Stsuctureand trend in the prccessof shatification for Amedcar men and -, * omen.AmericanJoumalof Sociology96(I ):107-143. andDavid B. Grusky.1990c.The multilevel analysisof trendswith repeatedqoss-sectionaldata.Sociologi -, calMethodology 20:33?-368. D!'GHnski,Henryk. 2008.A newdimensionof socialstratificationin Poland?Classmembe$hipandelectoralvoting in 1991-2001. European Sociological Review24(2):169-182. Lqney, DouglasB. 1995.when biggeris not better:Family size,parental.esources,andchildren'seducationalperformance. AmericanSociological Review60(5):746-761. Dln, Jeffrey,and DouglasRivers. 1989.Selectionbias in linear regression,logit and probit models.Sociological 18(2-3):36G-390. \{ethodsandResearch
424
References
Duncan,Beverly. 1965.Family factorsandschooldropout: 1920-1960.CooperativeResearchPrcject2258(i\j: i StudiesCenter U.S.Ofice ofEducation).AnnArbor,MI: Population Duncan,Otis Dudley. 1968.Inheritanceof povertyor inheritanceof mce?In On understandingpoverty:Perspe:=qi fiom thesocialsciences, ed.DanielPatrickMoynilun, 85 110.NewYork BasicBooks. 1975.Iitroduction to structuralequationmodels.NewYork AcademicPress. -. . 1979.How destinationdependson origin in the occupationalmobility table.AmedcanJoumalof SlrcinlEl 84(4):793-803. 1984.Noteson socialmeasurement Historicalandcritical.N€wYork:RussellsageFoundation. -. , ArchibaldO. Haller, andAlejandroPortes.1968.Pee.influenceson aspimtions:A reinterpretation.AIDd3r Joumalof Sociology?4(2):ll9 13'l. Eliason,ScottR. 1993.Maximum likelihood estimation:Logic and practice.Sageuniversitypapersserieson qE F tative applicationsin the socialsciences,07-096.NewburyPark,CA: Sage. for design-based analysisof compler* Eltinge,JohnL., andwilliam M. Sibney. 1996.sv1:Somebasicconcepts v€y dara.In Statatechnicalbulletin rcprints, vol. 6, ed. H. JosephNewton,208 213. CollegeStation,TX: !:r Corporation. Entwisle,Barbara,andwilliam M. Mason. 1985.Multilevel effectsof socioeconomicdevelopmentand famili -:]rrningprog&mson childreneverbom.AmericanJournalofSociology91(3):616-649. Erikson, Robert,and John H. Goldthorpe.198?a.Commonalityand variationirr social fluidity in industrial narrt PartI: A modelfor evaluating the"FJH hypothesis." European Sociological Review3(1):54-77. . 1987b.Conmonality and variation in social ffuidity in indust ial nations.Part II: The modelof core i:€a fluidityapplied.European Sociological Review3(2):145-166. Sociological Review8(3):283305. 1992a.The CASMINprojectandlheAmericandream-European -. 1992b.Theconstantflux:A studyofclassmobilityin induslrialsocieties. Oxford:Clarcndon. Ettner,Susan.2004. Methodsfor addressingselectionbias in observationalstudies.Text venion of a {lr! presentation Agencyfor Heallarat a NationalResearchServiceAwardTraineesResearchConference. (accesse::5 Researchand Quality, Rockville, MD. htq://www.ah4.gov/fund/training/ettnertxt.htm December2007). Evans,M.D.R., andJonalhanKelley.2004.Australianeconomyandsociety2002:Religion, morality,andpubh. sicy in intemationalperspective,198+2002. Sydney:FedemtionPress. Evans,M.D.R., JonathanKelley,JoannaSikora,alld DonaldJ. Tieiman. 2005.ScholarlycultureandeducationaiEcessin 27 nations.Revisedversionof a paperpresentedat theworld Congressof Sociology,Brisbane,Au\t!-Er" July 2002. Fair RayC. 1978.A theoryof extramarital affairs.JoumalofPoliticalEconomy86(l):45 61. Featherman,David L., andRobenM. Hauser 1978.Opportunityand change.NewYork AcademicPress. Fienberg,StephenE. 1980.The analysisof cross-classifiedcategoricaldata.2nd ed. Cambddge,MA: MIT Pre'i. Fischer ClaudeS., Michael Hout, Martin Sdnchez-Jankowski, SamuelR. Lucas,Anne Swidler, andKim Voss-I qrr Inequalityby design:Cracking"Tbe Bell Curve" myth. Pdnceton,NJ: PrincetonUniversity Prcss. Fisher,RonaldA. 1192511970.Statisticalmedbds for researchworkers. 14thed. Edinbugh: Oliver and Bo]d Demography 4l(2):189-l:: Fitzgerald, JohnM., andDavidc. tubar.2004.Welfarercfolm andfemaleheadship. applications in the i.:;ir Fox, John.1991.Regression diagnostics. Sageuniversitypapersserieson quantitative sciences,07-079.NewburyPark,CA: Sage. . 1997.Appliedregression Thousand Oaks,CA: Sage. analysis, linearmodels,andrelatedmeihods. and GeorgesMonette. 1992.ceneralizedcollinearity diagnostics.Joumalof the AmericanStatistical.{..:{-, ciation87(417.):178-183. Frankenberg,Elizabeth,andWilliam M. Mason. 1995.Matemal educationand healdl-relatedbehaviors:A pre-:nnaryanalysisofthe 1993Indonesian familylife survey.Joumalof Population1(1):2144. aid Duncan Thomas-2001, Women's health and pregrancy outcomes:Do servicesmake a differea:' -, Demography 38(2):253-265. Fu,VincentKang.1998.sg88:Estimatirggenemlized orderedlogit models.In StatatechnicalbulletinreprinG.; :t8, ed.H. JosephNewton,160 164.CollegeStation,TX: StataCorporation. pairings.Demography . 2001.Racialintermarriage 38(2):147159. Gamoran,Adam, andRobertD. Mare. 1989.S€condaryschooltracking and stratification:Compensation,reinr'i,-:ment,or neutrality? AmericanJoumalofSociology94(5):1146-1183. Ganzeboom,Harry B. G., Paulde craaf, andDonaldJ. Treiman.1992.An intemationalscaleof occupationals!3-. SocialScienceResearch 21(l):1-56.
References421
a
aa t . * 1, if, D
t G
f
.tl$
tr--rl
p > I
a -, I d/r
--" 5 rflr t-
n-
, Ruud Luijloq and Donald J. Treiman. 1989.Intergenerationalclassmobility in comparativeperspective. Researchin Social StratificationandMobility 8:3 84. andDonaldJ. Treiman.1996.Intemationallycomparablemeasuresof occupationalstatusfor the 1988inter-, SocialScienceResearch 25(3\.201--239. nationalstandard classification of occupations. Gaziano,Cecilie. 2005.comparativeanalysisof within-householdrespondentselectiontechniques.Public Opinion Quarterly69(1):124-157. celman, Aldrew, and Donald B. Rubin. 1995.Avoiding modet selectionin Bayesiansocial research.Sociological Methodology 25:165-173. Gerber,TheodoreP 2000.Membershipbenefitsor selectionefects? Why former CommunistPartymembe$do bet29(l\t25-50. terin postsovietRussia.SocialScienceResearch 2001.'I'l€ selectiontheoryof persistingparty advantagesin Russia:More evidenceandimplications.Social ScienceResearch 30(4):653-671. of teen childbearingrcconGeronimus,Arline T., and SandersKorenman.1992.The socioeconomicconsequences sidered.QurrterlyJoumalof Economics107(4):1187- 1214. cilbert, G. Nigel. 1981.Modeling society:An introduction to loglinear analysisfor social rcsearche . London: Geoqe Allen and Unwin. Goldberger,Afihur S. 1968.Topicsin regressionanalysis.London: Macmillan. models.3rd ed.London:Arnold. Goldstein, Harvey.2003.Multilevelstatistical Goodman, I-eo A. 1972. A. general model for the analysis of surveys. American Joumal of Sociology 77(6):r035-1086. . 1978.Analyzing qualitative/categoicaldata:Log-linear modelsand latent-structureanalysis.Cambddge, MA: Abt Books. havingorderedcategories.Jour1979.Simplemodelsfor the analysisof associationin cross-classifications -. 74(367):537552. nal of theAmericanStatrstical Association 1984.The amlysis of cross-classifieddatahaving orderedcategories.Camb dge, MA: HarvardUnive$ity -. Press. and Michael Hout. 1998. Statistical methodsand graphical displays for analyzing how the association -, betweentwo qualitativevariablesdiffers amongcountries,amonggroups,or over time: A modified rcgrcssiontype approach.In SociologicalMethodology 1998,ed. Adrian E. Raftery, 175-230.Washington,DC: Americar SociologicalAssociation.(Seealsothe commertsby Xie andYamaguchi,and the reply.) Gould,William W 1993.sg19:Linear splinesandpiecewiselinear functions.In Statatechnicalbullerin repdnts,vol. collegeStation,TX: Statacorp. 3, ed.SeanBecketti,98-104. 2000. s9124:Interyreting logistic regressionin all its forms. ln Statatechnicalbulletin reprints, vol. 9, ed. -. H. JosephNewton,257 270. CollegeStation,TX: Statacorp. andWillialn Scdbney.1999.Maximum likelihood estimationwith Stata.CollegeStation,TX: StataPress. -, Criminology29(l):17-46. Greenberg, DavidR 1991.Modelingcriminalcareers. Greene,William H. 2008.Econometricanalysis.6th ed. Upper SaddleRiver,NJ: PrenticeHall. crusky, David B., and RobertM. Hauser 1984.Comparativesocial mobility revisited:Models of convergenceand AmericanSociological Review49(1);l9-38. diveryence in 16countries. Hagan, John, and Patricia Parker 1985. Wlife-collar crime and punishment.American Sociological Review 50(3):302-316. Halaby,CharlesN. 2004.Panelmodelsin sociologicalresearch:theory into pmctice.Annual Review of Sociology 30:507-544. Hamerle,A., and G. Ronning. 1995.Panelanalysisfor qualitativevariables.In Handbookof statisticalmodelingfor ihe social and behavionl sciences,ed. GerhardArminger, Clifford C. Clogg, and Michael E. Sobel,401-451. NewYork Plenum. Hamilton,LawrcnceC. 1992a.srdlr How robustis rcbustregression?In Statatechnicalbulletin reprints,vol. 1, ed. Center. Hilbe, 169-175.SantaMonica,CA: ComputingResource Joseph 1992b.Regressionwith graphics:A secondcoursein appliedstatistics.Belmont,CA: Duxbury Press. 2006.Statisticswith Stata(updat€dfor version9). Belmont,CA: Brooks/Cole. Eanushek.Eric A.. andJohnE. Jackson.1977.Statisticalmethodsfor socialscientists.NewYork AcademicPress. Hrrding, David. 2002.Counte actualmodelsof neighborhoodeffects:The efect of neighbofiood povety on droppingout andteedage pregnancy. AmericanJoumalof Sociology109(3):676719. gardy, Melissa A. 1989. Estimating selectiofl effects in occupationalmobility in a lgth-century city. American Sociological Review54(5):834-843.
422
References
quantitativeapplicationsiD lh 1993.Regressionwith dummy variables sage university papersserieson Sage' CA: Oaks, Thousand socialsciences,07_093. gurg"nr, I-o*"ff 1 1976 A note on standardizedcoemcientsas structuralparamete$'SociologicalMelhods d
-.
Research5(2):27-256 Hauser,RobertM.lgT8.AstructuralmodelofthemobilitytableSocialForces56(3):919-953' qoss-classifieddata Sociologici . 1980.Someexploratorymeihodsfor modelingmobiliry tablesand other MethodologyI l:413--458. 1995.gggs. rules for bettet decisions sociological Methodology25:175-83' -. Sociel Sciea:' Min-Hsiung Huang. 1997.Ve$al ability and socioeconomicsuccess:A trend analysis and -, 26(3):331-376 Research enor il social * Shu-Ling Tsai, and wiliam H Sewell lg83 A model of stratificalion with rcsponse -, 56(1):2H6' of Education psychologicalvariables.Sociology Hausman,Jerry.19?8.Specificationtestsin econometdcsEconometrica46(6)11251-127l' logit model Econometri> and buniel Mcpadden lg84 Specification tests for the multinomial -, s2(5)tr2r9-r24o. andpartisanship:A long6_ ffaynl StepftenE., andDavid Jacobs.1994.Madoeconomics,economicsaadfication' 100(l):7(F103' of Sociology Joumal American dinal analysisof contingentshifts in political identification Fconometrica^4J(1):153-161' errora specification as bias selection Sample Heckman,JamesJ 19?9. ftom TheBell Curve Journalof PoliticalEconomy103(5):1091-1120 1995.Review:Lessons -. Jin shuigao,Ma Haijiang, and Ge lGyou. 1994 Equity andthe utilizaD" Li zhiming, Akin, Henderson,Gail, John Medicine 39(5):687-699of healthservices:Reportof an eight-provincesurveyin China SocialScienceand structurein AmericanliF aDd class Intelligence curve: bell The 1994 Murray Charles Hermstein,Robe( J , and NewYork:FreePress. butletin reprints' vol 9' ed Il Hendrickx, John.1999.dm?3i Using categoricalvariablesin Stata ln Statatechnical JosephNewton,51-59. CollegeStation,TXi StataCorporation' -.2000.dm73.1:Contraststorcategoricalvariables:Updatelnstatatechnicalbulletinreprints'vol9'c' H. JosephNewton,6Hl. CollegeStation,TX: StataCorporation -.2oo1a'd||fl3.2lcontrastslolcategolicalvariables:Update'InstatatechnicalbulletmrcPrints,vol'10.4 H. JosephNewton,9-14. Colege Station'TX: StataCorporation repdnts' vol 10' 41 2001b dm?3.3:Contrastslbr categoricalvariables:Update ln stata technicalbulletin -. Corporation H. JosephNewton, 14-15. CollegeStation,fi: Stata of ordinal data Sageunivelsiq Hildehan4 David K, JamesD. Laing' and Howard Rosenthal lgTT Analysis Hills' CA: Sage' Beverly 07_008 papersserieson quantitativeapplicationsitr ihe socialsciences' modelsin demqr!discrete-choice loglt conditional and Multilomial 1988. Duncan Greg J. D., and Holiman, Saul 25(3):415-427. phy.DemograPhy premiums in South Aftica I-abcE ffofmevr, f ian i, and nobert E Lucas 2001 The rise in union wage 15(4):685-719. Glymour, andGranger'andI Holland, Paulw 19s6.statisticsand causalhference (with commentsby Rubin' cox, 81(396\:945-970 Association ' rejoinderby Holland) Joumalof the AmericanStatistical ed New Yorki Wiley Hosmer,oavii w., and Stanleyt emeshow2000 APplied logistic regression2nd the eamedincometax craii Hotz, V'Joseph,CharlesH M;[in, andJohnKarl Scholz 2005 Examiningthe effectof PWP-CCPR-2005&<Paper Working Population on welfarc' of families participatior market on the labor Angeles' ['os of Califomia' Califomia Centerfor PopulationResearch,University productquality: Tb' and Mo xiao. 2005.The impact of minimum quality standardson firm entry'exii and -, for Populari'' center califomia population PWP-CCPR-2005-063, Paper working caseof the child caremarket. Angeles' Los Califomia' of University Research, applicationsin the socialscieE Hout. Michael. 1983.Mobility tables.Sageuniversitypapersserieson quantitative CA: Sage Oaks, 07-031.Thousand of Sociology 8q6: 1984- Status, autonomy,and training in occupationalmobility American Joumal -. t379 1409. analysisof fu andRobertM. Hauser lgg2. syrnnetry anclhierarchyin socialmobility: A methodological -, 8(3):23+266' Review Sociological European class mobility model of CASMIN Press' Hsiao,Cheng.2003.Analysis of panetdata.2nd ed. NewYork: CambddgeUniversity Revised edition 196! occupadons: of classification standatl lntemational 1969. Office. Labour Intemational Geneva:lntemationalLabour Office.
F
References423 1.[ trD r![ -
1990.lntemationalstandardclassificationof occupations(ISCO88).Geneva:Intemationalt abourOffice. -. David,andRobertM. O'Brien-1998.The decrminantsof deadlyforce:A snucturalanalysisof policevioJacobs, lence-AmericanJoumalof Sociology103(4):837-862. Jahoda, Marie.PaulR Lazarsfeld, andHansZeisel.[In German,1933]1971.Marienthal:The sociography of anunemployed co$munity. Trans. by the authors, with John Reginall and Thomas Elsaesser.Chicago: Aldine,
|l!!or!
@ Eta rl@
@r I!!!rLF G
&|0n
4r* ldi!.
@
f,
1.d.
d!,
a.
lLrl|, fl], l, Lfl; rfldllt tugD ! lbF d l!
-.d rG^_lit6" f,
]L
F]lD lr,rlldEl ll
{l]]lur
p
!fln
r
i{ts,
Jasso,Guillermina.1985.Marital coital frequencyandthe passageof time: Estimatingthe separateeffectsof spouses' agesand marital duration, bifth and marriagecohofts, and period influences.American SociologicalReview 50(2):133149. 1986.Is it outlier deletionor is it sampletruncation?Noteson scienceand sexuality.AmericanSociological -. Review51(5):?38-742. Jencks,Cbristopher,SusanBa(lett, Mary Corcoran,JamesCrouse, David Eaglesfield,Gregory Jackson,Kent Mcclelland. PeterMueser,Michael Olneck, JosephSchwartz,SherryWard, and Jill Williams. 1979.Who gets alead?The determinantsof economicsuccessin America.NewYork Basic Books. Marshall Smift, Henry Acland, Mary Jo Bane,David Cohen,Herbert cintis, BarbaraHeyns,and Stephan -, Michelson. 1972.Inequality: A reassessment of the efrectof family and schoolingin America.New York: Basic Books. Johnson,Robert.andReynoldsFarley.1985.On the statislcal significanceof the index of dissimilarity.Proceedings of the SocialStatisticsSection.Washington,DC: Americar StatisticalAssociation. Jones,F. I-ancaste! and JonathanKelley. 1984.Decomposingdiffercncesbetweengroups:A cautionarynote oo measuringdiscriminatioD.SociologicalMethodsand Research12(3):323 343. Jones,Michael P 1996. lndicator and stratification medrcdsfor missing explanatoryvariablesin multiple linear regression. Association 9l(433):222-230. Joumalof theAmericanStatistical Jdreskog,Karl G. 1970.A generalmethodfor analysisof covariancestrucfures.Biometrika 57(2):239-251. Judson,D. H. 1992.smv5: Performingloglinem analysisofcross-classifications.ln Statatechdcal bulletin reprints, vol. l, ed.JosephHilbe,139 152.SantaMonica,CA: computingResource Center update.InStatatechnicalbulletinreprints,vol.2, 1993.smv5.l Loglinearanalysisofcrossclassifications, -. ed.JosephHilbe,162-163.SantaMonica,CA: ComputingResource Center. Kahn,JoanR., andJ. fuchard Udry. 1986.Marital coital fr€quency:Unnoticedoutliers and unspecifiedinteractions conclusions. Ame.icanSociological Review51(5):734-737. leadto enoneous Kaufman,Robert L. 1983.A strxctural decompositionof Black-White earningsdifferentials.American Joumal of Sociology89(3):585{1l. to interpretlog-linearrelationships.American andPaulC. Schervish.1986.Using adjustedcrosstabulations -, Sociological Review51(5\t7l'7-'733. Borbas,EdwardGuadagnoli. 2001.Discussion Keating,NancyL., JaneC. Weeks,MaryBethLandrum,Catherine of teatrnent optionsfor eady-stagebreastcancer Effect of provider specialtyon type of surgeryand satisfaction. MedicalCare39(7):681691. Keeley,Michael, Pbilip Robins,Robe( Spiegelman,and Richad West. 1978.The labor supply effectsand costsof altemativenegativeincometax prognms. Joumalof HumanResources13(1):3-36. Kelly, Nathan J., and Jana Morgan Kelly.2005. Religion and Latino padisanshipin the United States.Political Research Quarterly58(1):8795. Kim, Jae-On,and CharlesW. Mueller. 1976.Standardizedand unstandardizedcoefficientsin causalanalysis:An note.Sociological MethodsandResemch 4(4)1423438. expository King, Gary. 1989. Unifying political methodology: The likelihood theory of statistical inference. New York: Cambridge. Kish,Leslie.1965.Surveysampling.NewYork:wiley. Kitagawa,Evelyn M. 1955. Componentsof a differencebetweentwo rates.Joumal of the American Stalistical Association 50(272):11681194. andPhilip M. Hauser.1973.Differontial mo.taliry in the United States:A studyin socioeconomicepidemiol-, ogy. Cambridge,MA: Ha ard University Press. Knoke,David, andPeterJ. Burke. 1980.l-oglinear models.Sageunivercitypapercserieson quantitativeapplications in thesocialsciences, 07-020.BeverlyHills, CA: Sage. Kraus,Vered.1986.Grcupdifferences: The issueof decomposition. QualityandQuantity20(2-3):181-190. Lassen,David Dreyer. 2005. The eff€ct of information on voter nrmout: Evidencefrom a natural experiment. AmericanJoumalof PoliticalScience49(1)1103118.
424
References
Lazarsfeld,Paul F. 1955.lnterpretationof statisticalrelations as a researchopention ln The languageof so research:A readerin the methodologyof socialresearch,ed PaulE. I-azarsfeldandMorris Rosenberg'I 15-,f,1 Glencoe:FreePress. Lee,ValerieE,, andAnthony S. Bryk. 1989.A multilevel modolof the socialdistributionof high schoolachiel€E a Sociologyof Education62(3\:172 192. I-ewis, SusanK., andValerieK. Oppenleimer.2000.Educationalassortativemating acrossmarriagemarkets:kHispanicWhites in the United States.Demogmphy37(I):2940 Lichter, Daniel, Diane Mclaughlin, and David Ribar. 2002.EconomicresEucturingand the rctreat ftom |narlit4! socialScienceResearch 3l(z\230 256. Little, RoderickJ. A. 1992.Regessionwith missiDgx's: A review Joumal of the Amedcan statistical Associror 81420\t1227 -1238. ard Donald B. Rubin. 2002. Statistical analysis with missing dala, 2nd ed New York John stu -, andSons. SocialForces68(4):1297-1316. in science. Long,J. Scott.1990.Theo.iginsof sexdifferences 199?.Regressionmodelsfor cat€goricalandlimited dependentvariables.Thousandoaks' CA: Sage-. andJeremyFreese.2006.Regressionmodelsfor categoricaldependentvariablesusingStata.2nd ed.cotr+! -, Station,TX: StataPress. Lu, Bo, Elaine Zanutto, Robert Homik, and Paul R. Rosenbaum.2001. Matching with dosesin an obsenatil study of a mealia campaign agahst drug abuse. Joumal of the American Statistical Associai 96(456),rU5 1253. Lu, Yao. 2005. Sibship size,family organization,and children's educationin SouthAfrica: Black-Wlite variadt r PopulationWorking Paper PWP-CCPR-2005-M5,California Ce er for Population Research,Universi+ { Califomia, Los Angeles. and DonaldJ. Treiman.200?.The effect of labor migration aDdremittanceson children's education^--, Blacks in South Africa. Population Working Paper PWP-CCPR-2007-001,Califomia Center for Popul,s Research,University of Califomia, t-osAngeles. and Donald J. Treiman.2008. Forthcoming.The effecl of sibship size on educationalattainmentin C-bd -, Coho( variations.AmencanSociologicalReview73(5). Lundquist, Jennifer Hickes. 2004. when race makes no diffeGnce: Marriage and the military. Social Fr-rt 83(2\:73rJ5'7. Manskr.CharlesF., SarahS. Mclanahan, Daniel Powers,and Gary D Sandefur'1992.Altemative estimatesoi lefrectsof family structureduing adolescenceon high school graduation.Joumal of lhe Amencan Statiqia Association 8?(417):25-37. andDavidA. wise. 1983.Collegechoicein America.Cambddge,MA: HarvardUniversityPress. -, Maralani,Vida. 2004.Family size and educationalattainmentin Indonesia:A cohort perspective.Populalon \\'ciing Paper PWP-CCPR-2004-017,Califomia Center for Population Research,University of Califomia- I-.e Angeles. Mare, Robert D. 1980.Social backgroundand school continuattondecisions.Joumal of the Amedcan Statigr Association 75(370):295-305. Review46(1):72-8?. AmericanSociological stratification. 1981.Changeandstabilityin educationat -. Review56(1):15-32. mating.AmericanSociological of educational assortative 1991.Fivedecades -. 1995.Changesin educationalattainmentandschoolerollrnent.In Stateofthe tmion:Americain the l99a vol. 1, Economictrends,ed. ReynoldsFarley,155-213.NewYork: RussellSage andMeichu D. Chen. 1986.Fu(her evidenceon sibshipsizeand educafionalstratification.AmericanSo-' -! logicalReview51(3):403-412. and CbristopherWinship. 1984.The paradoxof lesseningracial inequality and joblessnessamongblaci -, Review49(1):3955. youth:Effollrnent,enlistment, 1964-1981. AmericanSociological andemployment, and Christopherwinship. 1988.Endogenoosswitchingregtessionmodelsfor the causesand effectsoi i+ -, uete variables.In Commonproblems/propersolutions:Avoiding efior il quantitativeresearch,ed. J. ScottLq: 132-160.NewburyPark,CA: Sage. Marx, Gary T. 1967a.Religion:Opiateor inspimtionof civil rights militancy amongNegroes.AmericanSociol€i:a Revle\ 32(l\i6+'12. 1967b Protestandprejudice:A study of belief in the Black community NewYork: Harperand Row -. MasoL William M. 2001.Multilevel methodsof statisticalanalysis.In Internationalencyclopediaof the socialar behaviomlsciences,ed. Neil J. Smelserand PaulB. Baltes, 14,988-14,994.Amsterdam:ElsevierScience
neferences425 fl I
McFadden'Daniel lgT4 conditional logit analysis of qualitativechoicebehavior In Frontiersof econometrjcs,ed. PaulZarembka,105_142.Newyork Academicpress. Mclvery John' and Edwardc carmines lg8r. unidimensionalscaling. sage universlty ." "'' papeft serieson quantitativeapplcarion\in dresocialsciences. 07 024.BeverlyHills,CA: S'ae;.". -. Miller. Crant. 2007. women.ssuffrage.politicat .""p.;";;; J;i;survivat in amencan history paperpre sentedin the califomia PoPuJationResearch.work.rr.p cnliro-," C"oi". to, roputution n".""r"t, uni "";"., I-s
J,iff*jja?l"-*
;
Angeles' 5 December rtnp,ll***""pr..r".J,vi"enri"nJrs/seminarT,z0papersruller-
b
1966. Income disrriburion in the Unired States. Washingron,DC: U.S. covemment prinring " 5?OX:** Miller JaneE.2O04.The Chicagoguide to writing aboutnumbers.Chicago:Universityof Chicagopress. The Chicagoguidero writins abourmrrl,iuariu,. _. uoay.i.. cii""g?, Ulri-vffioi Chicagopress. Mincer,Jacob,andsolomonporachJk.1974.Family invesh;n," ;ifi; .;;;;,,;""rinss of women.Joumatof PoLiricaj Econom)82tsuppl.):S7o-S | 06.
;
""'st"T;tj!ii?*"illj;;;[frxi"jlj'
**" effecr hereroseneitv, andthecarhoric schoor effecr onr€amin8.
--'
I I
a l.
I T
t
and christopher winship 2007. counterfactuars and causalinference:Methods and principres for social rcsearch. Cambridge. MA: Cambridge Universitypress. ,, -\au'er'warter'andyossishavir.1998.The institutionalembeddedness ofthe stratifica.ron process: A comparalive srudyofqiralifications andoccupationsin thirteen n e-- ,"iooitilui.i a .toay or"a ucationalqualificationsandoccupationardestination", "ooo"i".. "o,np"rutiu" v"..i si""i t+s. oxiora, cturerr_ "a. ""a"w"rteiilritte., Murdock,Georgep, andCaterinahovost 1973. Measurement ofcuhuralcomplexity. Ethnology12(4):379.,392. 1e44.An American ditemma:rhe Negro proble.;;;;;;';#;^"y. New york: Ha.F,erand "r'#;f:J*
ti#il!ll3i,i;t tlllJlTarket
transirion: Frornredistribution tomarkets in statesocialism. American socio-
1996 The emergence of a marketsociety:changing mechanismsof s&atification in china. AmericanJoumal of Sociologyl0l (4):908_949. Netemeyer,Richard c., William O. BeardoIn' and subhashsharma 2003'scalingprocedures: Issuesand applica_ tions.Thousand oaks,cA: sase. \ieuwbeerta' paul, andHarry B. G Ganreboom. 1996.rntemationarsociarmobility andporitics fire: Documentation of an integrateddatasetof 113narionalsulv€y-s hetd_ ,O lqii;;ti.i#rl.ou_, st"in."t, ar.niu.. tlobles,Jenna,and ElizabethFrankenbers._200;. "*"J* Morherr c;;";i;;t"rr*;;d child heatth.popuiarion t"o"' t*-""pn-zooo-oto, -
*jf;.: t"'.*ryj:;ffi:*T:rlj,1i;rirJ,j:r"",
cairo-i" c*". r". i"p"iJ,ri"'i".i*Ji,
*,""
,r,
Los ", "alirornia, Methods, simuration experimenrs andpracrical exampres. rntemarional
\unnally,Jum,andIra H. Bemstein.19g4.psyciometric rheory3rd€d.Newyork:Mccraw_Hill. *ff4,:;:_"i;rltt Mate-fenatewasedifferentials i"r". .*r",,'. rniiationar EconomicReview - ".i"" o'Muircheartaigh' corn 2003Thereandbackagain: Demographic suveysampring in tle 21srcentury. Iclrote address, Federal Committee onStatistical Methodotogy, 2Oo3 t np,Z*"iii."..rirl"1ntvon*."rOOa.t _. cter' Emiry2005Heparitis "".i"*""". B andthecaseof tbemi"r", r.-*. i i"-"'"i""i'to'#i1"""., , , rtu).1163-1216. hnis c 1994sg24:Thepiecewise rinear s'rineruo.ioioutior.rn iruiur""r,"t""iilri"o" r"n""o, ed.sean College station. TX:staraCorp. "o1.3, ,"t:.kni:146-l49 iarK. dyunJoon.and JeroenSmits.2005.Educational
asso(ativemating in South Korea:.liends 1930_199g. Researchin SocialSfarificarion andMobitiry 23.103-12j. _ Paut'cbrisropher'william M Mason.Danier Mccattuey,and sarahA. Fox. 2008.ForthcominS. wlar shourdwe do
usins losisti" *g."*1.";;;;ri"i" #fi'#il:t^h11:i*caseTstudv
"'H:i;,T;'iir1:t'
"ll'il"rn -,*,ate) staristicar
commentonpresenlins resultsfiom losir andprobitmodels.Amerjcan sociologicar Review ^
1q84.changing conceprionsof mce: Towardsan accountof anomarousfindingsof "t:::tlllllilo sentencngresearch. '*"-Hagan. AmericanSociologicalReview 49(l\t56-70.
426
References
t***ay associationsh effectmodelsfor fie analysi"^:fOjtr-t':l:t"-1n Pisati.Maurizio 2001.sgl42: Uniform layer jo*n N"*ton' tod-tsr' colleg€slatronTX: statacolp' shta technicalbulletinreptint",uor' ro' "o ii *Uo timited dependentvariables Sociologicl Powers,Daniel A. lgg3 Encogenousswrwh"ing't"*Jt-*o
_
t":*
ffifir,I":L1t"f,l3ilriii.
ad hvpothesis rhecontact auirudes: racial an.tBlack ,.,""""ialconract
biassocialFo*:'10f 1)'ffi'rl;*,egoricat Press selectivitv cA: Acadernic sanDieSo' da* analvsis
*,*; '
Hf^"i#";1XS;itrilii,'T"[?;;;'"?'"-"i
i4 disab alEmative un'rer Estimates *'t activirv:
l8(lr:522-556' definitionr.Jo,rrnalof HumanResources in the labor market?Journald wofi Zo06 Does volunteet work pay off Frantois-Ctartes anal Lionel, Prouteau,
all::ii:f:';T;:'#]fi?31]o*** 63t6-21.
science Quajte+ risk andwives'laborsupplvbehaviorsocial
in homicide cases Law and SciIg85 Race an'l prosecutorial discretion Radelet, Michael L , and Glenn L Pierce
sociologicalReview51(1):145-146 Amedcan for cross-classincarions. moders .Jil:ff;[t;:if:]"1t:sing 25:I 11-163' Methodology researchSociological moaetsel"ctionin"soc-all Bavesian . 1995a. SociologicalMethodolog research' social iri it' 1995b.Rejoinder:Model *'**"' -. "t"*ta^ob indexessociolosicalMethodsandReseard R. 2000 samplingdisfiibtrtionsof sesresation .""1l,llt#3i;"t 28(4):454475 :. rr." r"ni,n revolotion.Americansociologll Rasler,KareD'1996.Concessions'represslon,andpoliticalprolestinihelranianfevoloti( linear models:Applications and data analli ,ttt'onv s nrvt 2002 Hierarchical ."*ilfJt:tJ:rit":li]ala metho'ls 2nd ed rhousand oaks' cA: s:c:ileges Russiaan'l ed'b!' of pastcornmunist Partymembershipin Akos, analAIya Guseva2001 Thr *;;J;ilil**;.ssion Rona-Tas, SocialScienceResearch30(4):fl 1-652 politicj Extending a comparativea@lysis of a*" f;.'iolit r_eaming 1992. w. Dennis Roncek, "o"ffi"i"n., "" ReviewST4r:l';::t;a Fansiuc f,o'"u 'm"a'* s"ciotosical behavior:Applvins tos-linearmodelsfor 2( Chick Garry and lr' M.. John Robens, 30(Ir:313-32-E socialScienceResearch beMeenof{icesin a Mexicanfesrivalsyslem -."a"i EconomeE *lo jolnt wage_hou$ detemination t"b".'r"porv Taxes'" 1976. Hafvey. Rosen, ^ frr of the prcpensitvscorein obsewationalstu'hes *o DonaldB Rubin 1983 The cental role *r#l'"];ll,t "l"oJ;.'
$ unive'"iq Deveropment' onHuman comnittee pape! .",i1"i,1""Ji!?;lilf,ilf,l3"!,]#"'i't""u*'otone'r n"rfl'ililoi"t. ^"*-.'toos"
valuessmtaJ,oumal of nussins imputation .^:)!'.':::t zoo+r,n"ltiple Joumal512l:t88-2oI'
Stata oa"t lor" imputationof missingvaluesitlpdate ice StataJoumal5(4J:527-536' of Update vJues: 2005b Multiple irnputationor mrssi'n! -. of i""' with an emPhasison interval censor4: i.poo,ioo of l1li".ing"t"illt' ruiirter opAut" Multiple 200?. -. StataJournal7(4):445-464' in survevs N::,I*:yl"o is;? Multiple imputationfor nonresponse n tln,i".ti simple random salld" f"t interval esd$atio^n^fr'om itptt"'i"" ". Schenker'1986 naJtiii" and Nathaniel (394):366-374' --"i -. 8l essociatio' trt"Lerican statisdcal itt^a il;#;;;p"""" "r in the duarlabor marker.Amedcanssip,"or"orr-".J""tnmenr sakamoro,Arrhur, andMeichu D. chen. ,rni.
t^i:'i;3?1ffi'""tont
andLaborRerarir lndustriar rheroleorPACS uot'nsonlaborissues: A multiler-* andviolentcrime: EarlslgeTNeishborhoods an'lFelton t*'T:TilgYl:T3-"]JI?; w Raudenbush' -32lf,llt# 277t riuayoi"orr".iu".ificacvScience r,",o,". o{3):44t. 500. itrtbe*' "r papers apprcations onquanrirative series university rage lll,iSi;f."$-iJ;rl'J"t;,"Tfi:t#,:rr".
,",Ji'ii]&;il
,*liL::'$'|:'#
Harr' and chapman London: dutu il:l,iJl"ifll;1""i#8,o'uttiu^'iut"
neferences427 - ! !r b F
ry"u rdll
Ei
biib m!, -r FlsF
-.;ruF 6-.tt rrlql b!@r
-loq
lry Fmrlll r-!m l@
fr ItD,
ry d dhD 5D
&l
td !&
1999.Mulripleimputarion: -. A primer Statistical Merhodsin MedicalResearch 8(1):3-15. schwanz,christineR.. andRobertD. Mare.2005.Trendsin educational assortative marriagefrorn 1940to 2003. Demogaphy42(4):62l-646. scbentet Nathaniel,DonardJ. Treiman,andLynn weidman. 1993.Analysis public of usecrecenniar censusdatawith mulriplyimputedindusry andoccupation codes.ApptiedStatisrics ,2(3):5i5_556 -_ Sloan, JohnH., A. L. Kellemann. D. T. Reay,J. A. nenis, f. foepsetq n e. niuara, C. nr"e, t-. Cray, andJ. Locerfo. l98S Handgunregulations, crime,assaults, andhomicide:A taleof twocities.NewbnglandJoumalof Medicine 319(l9):1256-126t. smith' HerbertL. 1997.Matching with multiple controlsto estimatetseatment effectsin observationalstudies.sociologicalMerhodology 27:325-352. smith,PatriciaL 1g7g.spiinesasa usefulandconvenient statistical toor.Americanstatistician33(2)t5.7-62. Smirh,Tom W. 1979.Sex and the cSS: Nomesponse differences. GSSMethodologicl nepon No. 9. Chicago: NationalOpinion ResearchCenter andBruceL Peterson.19g6.prcblemsin fom randomizationon the cenerar -, social surveys.GSs MethodologicalReponNo. 36,July.Chicago:NationatOpinionResearch Center smock'PamelaJ., wendy D. Manning,and sanjiv Gupta.1999.The effect of marriageand divorceon women,s economicwell-being. AmericanSociological Revrew64(6):194_g12. Snijders, T., andR. Bosker 1999.Multilevelanalysis. London:Sagepublications. Sobel,Michaet E., Michael Hout. and Otis Dxdley Duncan. 198i. Exchange, str'cture, anclsymmetryin occupa_ tlonalmobiliry.AmericanJournalof Sociology9l (2).359_372. soop.amanien,Didier, andcemint Johnes.2001.A newrook at gendereffects participauon in andoccupationchoice. Labour15(3):415443. Sorokin,PitirimA. [1927]1959.Socialandcutturalmobiliry.clencoe:Freehess. sousa_Poza' Alfonso. 2004.Is the swiss trbor marketsegmented? Al anarysisusing artemanveapprcaches.Labour l8(1 ) : 131- 161. south, scott J and Eric p Baumer 2001-community effectson tbe resolution of adolescenrprema.tal pregnancy. ' Journalof FamilyIssues22(11).|OZS_1043. Statacorp.2007.Statastatisricalsoftware:Release10.CollegeStation,TX: Statahess. steele' ctaude. 1997 A thrcat in the air: How stercotypesshapeintenectual ialentily and perfbrmance.American Psychologisr 52(6):613-629. steiger,JamesH. 200l Driving fast in reverse:The rcrationshipbetweensoftware deveropmenqtheory ancteducanon m structural equationmodeling.Joumalof theAmericanStatistical Association , 96(453):331_33g. philip Stephan, Frede.ickJ.,and J. Mccanhy.1963.Samplingopinions:An analysisof surveyprocedure. york: New Wiley ScienceEditions. stephan,PaulaE andsharonG. Levin. 1992.striking the motherlode in science: The impoftanceof age,place,and ' time. New York: Oxford Universitypress. Srcphenson, Bruce. 1979. probability samplingwith quotas:An experiment. I,ublic Opinion euanerly _C.43(4):477 496. Stine,Robert. 1990.An introductionto boorstrapmethods:Extunplesand ideas. In modemmethoclsof dataanalysis, ed.JohnFox andJ.ScofiLong,325-373.Newburypa*, CA: Sage. stoltzenbery'Ross' 1974.Estimatingan equationwith multiplicative anl additive terms.sociological Methodsand Research 2(3):313-331. 1975.Education,occupation,and wage differencesbetweenwhite and Black -. men.Amencan Joumal of Sociology81(2):299-323. and Daniel A. Reles. 1997.Tools for intuition about sampleserectionbias -' ancrlts co,'ection. American Sociological Review62(3).494-50j. stone'RoslynA., D. sco$ obrosky,DanielE. singer wishwaN. Kapoor,Michaer J. Fine,pneumoniaparientou! comesResearchream Investigato^. 1995.prcpensity scorc adjustmentfor prctreatment differencesbetween hospitalizedand ambulatorypatientswith community-acquiredpneumonia.proceedings of the conferenceon Measuringthe EffecrsofMedical Treatmenr,Apil. Medicat Care33(4):A556_A.566. Stouffer,SamuelA. 1949.TheAmedcansoldier2 vols.Vol. t, Adjusrmenr duringarmytifb. Vot.2, Combarandirs aflermatb.Princeton,NJ: princetonUniversity Pless. Communism,co formity, and civil Iiberties:A cross_section of the nation speaksits mind. Gmden -. l?55 City, NY Doubleday. 1962.Socialresearch -. ro testideas:Selected wTitings. Newyork: Freepressof Glencoe.
428
References
Sudman,Seymour.1976.Appliedsampting. NewyorktAcademicpress. sweeney'MeganM 2002 T\ro decadesof famfly change:The shifting economic foundatronsor marriage.AmerrSociological Review670):132 147. szel'nyi, Ivdn, and Donard J. Treiman. 1994-social stratificarionin Eastem Euope afrer l9g9 (computerdB. Principal investigators,rviin szer6nyiandDonardJ. Trciman.producedby a consortrumof researchgoups rn ri nationsinvolved.SocialScienceDataArchive, University of Califomia, Los Angelesldistributorl. Tavits, Margit. 2005. the developmentof stable paty support: Electoral aynu,oi"" lo post_communistE,.€EE AmericanJoumatof potiricat Science49(2):283_298. Tholas, FraDkenberg, JedFriedman,Jean_piereHabicht,NathanJones,ChristopherMcKeha P:n:an,-Elizabeth Gretel Pelto, Bondan Sikoki, Jamesp Smith, CecepSumantri, a.ra Wayan Suriastini. 2004. Causaleffe!-I :r h€alih on labor marketoutcomes:Evidencefrom a randomassignmentiron supprementatron intervention,pot* lation WorkingPaperpwp-CCpR 2004-022,Califomia Centerior population ni."r*f,, u"i"" if iJ6o[ LosAngeles. "i Tienda' Marta, and FmDrdinF wilson. 1992.Migration and the eamings of Hispanic men.Amencan sociologi.r Review57(5):661-678. Tobin,James.1958.Estimationof relationshipsfor limited dependenrvariables. Econometrica26(l ):24_36. Tomaskovic-Devey,and Sheryl Skaggs.1999.Degenderedj;bs? Organizational processesmrctgenctersegregtl* employment. Research in SocialStarificarionandMobiliry t7:13t_172. Trelnan, J. 1970. Reply ro Ceschwenderon ..Statusdiscrepancyand prejudice.,, _Don19 American Joumai r Sociology76(1):162_168. I 977. Occupationalprestigein compamtivepe$pective.Newyork: Academicpress. -. 1994.Social stmtification in EastemEuropeafter 19g9:Generalpopulation -. survey.provisional code[aa (revised),8 DecemberI 994. SociatScjenceDataArchive, University oi Catifomia, I,os Angeles. ed l99S Life historiesand so€ial changeh contemporaryChina: provisional -' codebook.social sci.o.c DataArchive, University of Califomia, Los Anseles. 2007a.crowth and determinantsof litemcy ln Ctr;a. t eaucaton and -. reform in china, ed. Emily Hannru andAlbertPark,135_153. Oxford:Routledse. 2007b.The legacyof apartheid:racial inequalidesin lhe newsouti Atiica.In Unequalchances:Ethnicmincr.hesin westemlabourmarkets.ed.AnrhonyHeathandsin yi cheung,401-447. oxford: oxlbrd university pr.i.winiam Bielby' and Man-Tsun cheng 1988.Evaluaringa irultiple-imputation -' method for rccaribratitrs 1970-U-.S, -* Censusderailedindusrrycodesro rhe 1980standard.Socioloiicat Vi",f,"O"f"gy rS:di ioil'" William Cumbertand,Xilai Shi, ZhongdongM4 and ShaolingZhu. 1998. -, A sampledesignfor rhe ChiF:€ life history survey.ln Life historiesand socialchaagein contempoiary china: provisional codebook,ed. Dotr!{ J. Treiman,Appendix D. 1.a.Social ScienceDataAichive, Universityof CaLfomia,Los Angeles. andHeidi I. Hartrnann,eds. 19gl. Women,work, and wages:Equalpay for jobs -, oi equalvalue.Washingtc
neferences429 and Kermit Tenell. 1975.The processof statusattainmentin the United Statesand GreatBritain. American Joumalof Sociology8l(3):563-583. , Andrcw G. Walder,and Qiang Li. 2006.Life historiesand socialchangein contemporaryChina [computer filel. Principal investigatot Donald J. Treimani coprincipal investigators,Andrew G. Walder and Qiang Li. Depanmentof Sociology,People'sUniversity,Beijine lproducerliSocialScienc€DataArchiv€,Unive$iryof Califomia,LosAngeles[distributor]. attaifinentin Japan.In Persistent inequality:Changingeduca. andKazuoYamaguchi.1993.Trcndsin educational tionalattainmentin thiteen countries.ed.YossiShavitard HaN-PeterBlossfeld,229 250.Boulder:WestviewPress. andKam-BorYip.1989.Educational andoccupational attaiirment in 2l countries. -, In Crossnationalresearch in sociology, ed.Melvin L. Kohn(ASA Presidential Series),373-394.BeverlyHills, CA: Sage. Treno,AndrewJ.,MariaL. Alaniz,andPaulJ. Gruenewald. 2000.Theuseof ddnkingplacesby gender,ageandethnic groups:An analysisof routinedrinkingactivities. Addiction95(4):537-551Tukey,JohnW. 1977.Exploratorydataanalysis.Reading,MA: Addison Wesley. Upton,GrahanJ. G. 1978.Theanalysisof cross{abulated data.NewYork:Wiley. VanBuulen,S.,H. C. Boshuizen, andD. L. Krook. 1999.Multipleimputationof missingbloodpressure covariates in survivalanalysis. Statistics in Medicine18(6):681694. Vermunt,JercenK. 1997.lem: A generalprogmmfor the analysisof categoricaldata-Tilburg, Netherlands:Tilburg University. walder,AndrewG. 1996.Marketsandinequalityin transitionaleconomiesrTowardtestabletheories.AmericanJournal of Sociology101(4)i10601073. Walton,John,andCharlesRagin.1990.Globalandnationalsources of politicalFotest:Thirdworldresponses to the debtcrisis.AmericanSociological Review55(6):876-890. weakliem,David L. 1999.A critique of theBayesianInformationCriterion for modelselection.SociologicalMethods andResearch 27(2):359397. weitoft,GunillaRingback, AndersHjem,Ilija Batljan,andBo Vinnerljung. 2008.Healthandsocialoutcomes among childrenin low-income familiesandfamiliesreceivingsocialassistance: A Swedishnationalcohortstudy.Social ScienceandMedicine66(l):14-30. White, Michael J., andZai Liang. 1998.The effect of immigration on the intemal migrationof the native-bompopulation,1981-1990. Popularion Research andPolicyReview1?(2):141166. williams, Richard.2006.Generalizedorderedlogit/partial proportionaloddsmodelsfor ordinal dependentvariables. StataJoumal6(1):5882. Willis, Roben J., and Sherwin Rosen. 1979. Educationand self-selection. Joumal of Political Economy 87(suppl.):S507-S536. Winsborcugh,H. H., andPeterDickinson. 1971.Components of Negro-White incomedifferences. In Proceedings of the American StatisticalAssociation.Socialstatisticssection,ed. Edwin G. Goldfield,G8. Washington, DC: AmericanStatistical Association.
-, -!tn@ )!lLn!l
3 -d![@
a E Eir,
E$er i&r r!r]p'!r t C-tm
I c. ero@ d
.b'dl!
i
;!:r@i
l X i@ Ef f i lor:rfif,f
€ t:E@ :i|dro -r4 to.*"qrEi@+ I!d!r* tG
Winship, Christopher 1999a.Editor's introduction to the specialissue on the BayesianInformation Criterion. Sociological Methods and Research27(3):355-358. , ed. 1999b. Special issue on the Bayesian information criterion. Sociological Methods and Research
(3\:35s-443 2'7 .
,r@$l -,
4r
la
m-
-ri{d G .: bs-
L :-
l |l irn|rr
:€:d
and Robert D. Mare. 1984. Regression models with ordinal variables. American Sociological Review
49(4rt5r2s2s. bias.AnnualReviewof Sociology18:327-350. . andRobertD. Mare.1992.Modelsfor sampleselection . and StephenL. Morgan. 1999.The estimationofcausal effectsftom observationaldata.Annual Reviewof Sociology25:659-706. Witte,A. 1980.Estimatingan economicmodelof crime with individualdata.QuarterlyJournalof Economics 94(1\.5784. Wood,CharlesH., and PeggyA. Lovell. 1992.Racialinequalityand child mortalityin Brazil. SocialForces '70(3r:703-'724. Wooldridge, JeffreyM. 2002.Econometric analysisof crcsssectionandpaneldata.Cambridge, MA: MIT Press. . 2006.Introductory econometrics: A modemapproach. 3rd ed.Mason,OH: ThomsonSouth-Westem. wright, Erik Olin. 1985.Classes. [-ondon:NewLeft. CynthiaCostello,DavidHachen,alld JoeSprague.1982.TheAmericanclassstructrr€.AmericanSociological -, Review47(6)r709-726.
430
References ADericanJoumaloi classstructure'1960-1980' of theAmerican andBill Martin.1987.Thetransformation
93(1):1-29. SociologY Genetics3(4):36?-374' in Chiia: wriehl s;;1. 1918.on thenatue of sizefactors n'""""ftJJ *gito"ti"t systeman'l social sfiatification zur'r4]fi" rreiman J Donata and Wu. Xiaogang, 41{2)136}384 1955-1996 Demogtaphy The hukou systemand iltergenerationaloccupa_ 2007. Inequality and equality unoer Chinesesocialism: -. | | l( 2r:415-445 ronJ.oiii;,y. e*"4.* Journaiot Socinlow sociologic'I tio"i i"' comparing mobllitv tables American t"r"t"'iii"i log-multiplicative The 1992. Xie, Yu. Amencansociomobilitv tables:Towardpa$imonv andsubstance Mod'lt fo' ini t"*::':il,t;"',tl", "o,npu'ing losicalReview57(3):38G-395'
;i FrtI :s:yJ3T1;if,liilTfiil ";;.?;,,;;'i,,;;. "fll;,isfi*'.*#:,"tfi:#5T* from a multi_wavopanelanaly F-videDce selectronbias il sentelx_ tim"' andpunishmenl:An explomtionof Zatz. Mariorie S.' andJohnifug* rqss cit"' lr I r:l0l- | 26 1"" ,.Jr*.n. f."."1 of QuandratveCriminology NewYork:HarperandKow' Re\'6thed ficure5 witl ir s;v lq85 2",'.,1ir--" iinoli modeis:Testing the IIA propefiy Sociologicc Ott"t"*"i"i"" tgni Hofrman 7-hane.Junsen,and Saul D. "git ' M-ethodsandResearch22(2)1193-213 FrcePrcss' laureatesin the United States NewYork Zuckeman, Harriet. l9?? Scientificelite: Nobel