Praise for Larry Hatcher The writing is exceptionally clear and easy to follow, and precise definitions are provided to avoid confusion. Examples are used to illustrate each concept, and those examples are, like everything in this book, clear and logically presented. Sample SAS output is provided for every analysis, with each part labeled and thoroughly explained so the reader understands the results. Sheri Bauman, Ph.D. Assistant Professor Department of Educational Psychology University of Arizona, Tucson
[Larry Hatcher] once again manages to provide clear, concise, and detailed explanations of the SAS program and procedures, including appropriate examples and sample write-ups. Frank Pajares Winship Distinguished Research Professor Emory University
The Student Guide and the Exercises books are excellent choices for use in quantitative courses in psychology and education.
Bert W. Westbrook, Ph.D. Professor of Psychology Alumni Distinguished Undergraduate Professor North Carolina State University
Step-by-Step
S T U D E N T
G U I D E
BASIC STATISTICS Using
SAS
®
L A R R Y H ATC H E R , P H . D .
The correct bibliographic citation for this manual is as follows: Hatcher, Larry. 2003. Step-by-Step Basic Statistics Using SAS®: Student Guide. Cary, NC: SAS Institute Inc.
Step-by-Step Basic Statistics Using SAS®: Student Guide Copyright © 2003 by SAS Institute Inc., Cary, NC, USA ISBN 1-59047-148-2 All rights reserved. Printed in the United States of America. No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or by any means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS Institute Inc. U.S. Government Restricted Rights Notice: Use, duplication, or disclosure of this software and related documentation by the U.S. government is subject to the Agreement with SAS Institute and the restrictions set forth in FAR 52.227-19, Commercial Computer Software-Restricted Rights (June 1987). SAS Institute Inc., SAS Campus Drive, Cary, North Carolina 27513. 1st printing, April 2003 SAS Publishing provides a complete selection of books and electronic products to help customers use SAS software to its fullest potential. For more information about our e-books, e-learning products, CDs, and hardcopy books, visit the SAS Publishing Web site at support.sas.com/pubs or call 1-800-727-3228. ®
SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in the USA and other countries. ® indicates USA registration. Other brand and product names are trademarks of their respective companies.
Dedication
To my friends at Saginaw Valley State University.
ii Step-by-Step Basic Statistics Using SAS: Student Guide
Contents Acknowledgments .............................................................................................ix Chapter 1: Using This Student Guide ..............................................................1 Introduction ........................................................................................................................... 3 Introduction to the SAS System ............................................................................................ 4 Contents of This Student Guide ............................................................................................ 6 Conclusion .......................................................................................................................... 11
Chapter 2: Terms and Concepts Used in This Guide ..................................13 Introduction ......................................................................................................................... 15 Research Hypotheses and Statistical Hypotheses ............................................................. 16 Data, Variables, Values, and Observations ........................................................................ 21 Classifying Variables According to Their Scales of Measurement...................................... 24 Classifying Variables According to the Number of Values They Display ............................ 27 Basic Approaches to Research........................................................................................... 29 Using Type-of-Variable Figures to Represent Dependent and Independent Variables ..................................................................................................... 32 The Three Types of SAS Files ............................................................................................ 37 Conclusion .......................................................................................................................... 45
Chapter 3: Tutorial: Writing and Submitting SAS Programs .......................47 Introduction ......................................................................................................................... 48 Tutorial Part I: Basics of Using the SAS Windowing Environment..................................... 50 Tutorial Part II: Opening and Editing an Existing SAS Program ......................................... 75 Tutorial Part III: Submitting a Program with an Error ......................................................... 94 Tutorial Part IV: Practicing What You Have Learned ....................................................... 102 Summary of Steps for Frequently Performed Activities .................................................... 105 Controlling the Size of the Output Page with the OPTIONS Statement............................ 109 For More Information......................................................................................................... 110 Conclusion ........................................................................................................................ 110
Chapter 4: Data Input .....................................................................................111 Introduction ....................................................................................................................... 113 Example 4.1: Creating a Simple SAS Data Set ............................................................... 117 Example 4.2: A More Complex Data Set ......................................................................... 122 Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set........................................................................................................... 131 Using PROC PRINT to Create a Printout of Raw Data..................................................... 139 The Complete SAS Program............................................................................................. 142 Conclusion ........................................................................................................................ 144
Chapter 5: Creating Frequency Tables ........................................................145 Introduction ....................................................................................................................... 146 Example 5.1: A Political Donation Study.......................................................................... 147 Using PROC FREQ to Create a Frequency Table............................................................ 152
iv Contents Examples of Questions That Can Be Answered by Interpreting a Frequency Table ........................................................................................................ 155 Conclusion ........................................................................................................................ 157
Chapter 6: Creating Graphs ..........................................................................159 Introduction ....................................................................................................................... 160 Reprise of Example 5.1: the Political Donation Study....................................................... 161 Using PROC CHART to Create a Frequency Bar Chart ................................................... 162 Using PROC CHART to Plot Means for Subgroups.......................................................... 174 Conclusion ........................................................................................................................ 177
Chapter 7: Measures of Central Tendency and Variability ........................179 Introduction ....................................................................................................................... 181 Reprise of Example 5.1: The Political Donation Study...................................................... 181 Measures of Central Tendency: The Mode, Median, and Mean ...................................... 183 Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ................................ 187 Using PROC UNIVARIATE to Determine the Shape of Distributions ............................... 190 Simple Measures of Variability: The Range, the Interquartile Range, and the Semi-Interquartile Range ................................................................................. 200 More Complex Measures of Central Tendency: The Variance and Standard Deviation........................................................................................................ 204 Variance and Standard Deviation: Three Formulas ......................................................... 207 Using PROC MEANS to Compute the Variance and Standard Deviation ........................ 210 Conclusion ........................................................................................................................ 214
Chapter 8: Creating and Modifying Variables and Data Sets ....................215 Introduction ....................................................................................................................... 217 Example 8.1: An Achievement Motivation Study ............................................................. 218 Using PROC PRINT to Create a Printout of Raw Data..................................................... 222 Where to Place Data Manipulation and Data Subsetting Statements............................... 225 Basic Data Manipulation ................................................................................................... 228 Recoding a Reversed Item and Creating a New Variable for the Achievement Motivation Study...................................................................................... 235 Using IF-THEN Control Statements .................................................................................. 239 Data Subsetting................................................................................................................. 248 Combining a Large Number of Data Manipulation and Data Subsetting Statements in a Single Program......................................................... 256 Conclusion ........................................................................................................................ 260
Chapter 9: z Scores........................................................................................261 Introduction ....................................................................................................................... 262 Example 9.1: Comparing Mid-Term Test Scores for Two Courses................................. 266 Converting a Single Raw-Score Variable into a z-Score Variable .................................... 268 Converting Two Raw-Score Variables into z-Score Variables .......................................... 278 Standardizing Variables with PROC STANDARD............................................................. 285 Conclusion ........................................................................................................................ 286
Contents v
Chapter 10: Bivariate Correlation .................................................................287 Introduction ....................................................................................................................... 290 Situations Appropriate for the Pearson Correlation Coefficient......................................... 290 Interpreting the Sign and Size of a Correlation Coefficient ............................................... 293 Interpreting the Statistical Significance of a Correlation Coefficient ................................. 297 Problems with Using Correlations to Investigate Causal Relationships............................ 299 Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables................. 303 Using PROC PLOT to Create a Scattergram.................................................................... 307 Using PROC CORR to Compute the Pearson Correlation between Two Variables................................................................................................. 313 Using PROC CORR to Compute All Possible Correlations for a Group of Variables ................................................................................................ 320 Summarizing Results Involving a Nonsignificant Correlation............................................ 324 Using the VAR and WITH Statements to Suppress the Printing of Some Correlations ........................................................................................................ 329 Computing the Spearman Rank-Order Correlation Coefficient for Ordinal-Level Variables................................................................................................. 332 Some Options Available with PROC CORR ..................................................................... 333 Problems with Seeking Significant Results ....................................................................... 335 Conclusion ........................................................................................................................ 338
Chapter 11: Bivariate Regression.................................................................339 Introduction ....................................................................................................................... 341 Choosing between the Terms Predictor Variable, Criterion Variable, Independent Variable, and Dependent Variable ............................................................... 341 Situations Appropriate for Bivariate Linear Regression .................................................... 344 Example 11.1: Predicting Weight Loss from a Variety of Predictor Variables.................. 346 Using PROC REG: Example with a Significant Positive Regression Coefficient .................................................................................................. 350 Using PROC REG: Example with a Significant Negative Regression Coefficient ........... 371 Using PROC REG: Example with a Nonsignificant Regression Coefficient..................... 379 Conclusion ........................................................................................................................ 383
Chapter 12: Single-Sample t Test .................................................................385 Introduction ....................................................................................................................... 387 Situations Appropriate for the Single-Sample t Test ......................................................... 387 Results Produced in a Single-Sample t Test..................................................................... 388 Example 12.1: Assessing Spatial Recall in a Reading Comprehension Task (Significant Results) ............................................................................................. 393 One-Tailed Tests versus Two-Tailed Tests ...................................................................... 406 Example 12.2: An Illustration of Nonsignificant Results................................................... 407 Conclusion ........................................................................................................................ 412
Chapter 13: Independent-Samples t Test ....................................................413 Introduction ....................................................................................................................... 415 Situations Appropriate for the Independent-Samples t Test ............................................. 417 Results Produced in an Independent-Samples t Test....................................................... 420
vi Contents Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences)........................... 428 Example 13.2: An Illustration of Results Showing Nonsignificant Differences................. 446 Conclusion ........................................................................................................................ 450
Chapter 14: Paired-Samples t Test...............................................................451 Introduction ....................................................................................................................... 453 Situations Appropriate for the Paired-Samples t Test ....................................................... 453 Similarities between the Paired-Samples t Test and the Single-Sample t Test ................ 457 Results Produced in a Paired-Samples t Test .................................................................. 461 Example 14.1: Women’s Responses to Emotional versus Sexual Infidelity .................... 463 Example 14.2: An Illustration of Results Showing Nonsignificant Differences................. 483 Conclusion ........................................................................................................................ 487
Chapter 15: One-Way ANOVA with One Between-Subjects Factor ..........489 Introduction ....................................................................................................................... 491 Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor ........... 491 A Study Investigating Aggression ..................................................................................... 494 Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size ... .497 Some Possible Results from a One-Way ANOVA ............................................................ 500 Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect ................... 505 Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect ............. 529 Conclusion ........................................................................................................................ 537
Chapter 16: Factorial ANOVA with Two Between-Subjects Factors.........539 Introduction ....................................................................................................................... 542 Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors ........... 542 Using Factorial Designs in Research ................................................................................ 546 A Different Study Investigating Aggression....................................................................... 546 Understanding Figures That Illustrate the Results of a Factorial ANOVA......................... 550 Some Possible Results from a Factorial ANOVA.............................................................. 553 Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction.................................................................................... 565 Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction.................................................................................... 607 Example of a Factorial ANOVA Revealing a Significant Interaction ................................. 617 Using the LSMEANS Statement to Analyze Data from Unbalanced Designs................... 625 Learning More about Using SAS for Factorial ANOVA ..................................................... 627 Conclusion ........................................................................................................................ 628
Chapter 17: Chi-Square Test of Independence ............................................629 Introduction ....................................................................................................................... 631 Situations That Are Appropriate for the Chi-Square Test of Independence...................... 631 Using Two-Way Classification Tables............................................................................... 634 Results Produced in a Chi-Square Test of Independence ................................................ 637 A Study Investigating Computer Preferences ................................................................... 640 Computing Chi-Square from Raw Data versus Tabular Data ........................................... 642
Contents vii Example of a Chi-Square Test That Reveals a Significant Relationship .......................... 643 Example of a Chi-Square Test That Reveals a Nonsignificant Relationship .................... 661 Computing Chi-Square from Raw Data............................................................................. 668 Conclusion ........................................................................................................................ 671
References .......................................................................................................673 Index..................................................................................................................675
viii Contents
Acknowledgments
During the development of these books, Caroline Brickley, Gretchen Rorie Harwood, Stephenie Joyner, Sue Kocher, Patsy Poole, and Hanna Schoenrock served as editors. All were positive, supportive, and helpful. They made the books stronger, and I thank them for their guidance. A number of other people at SAS made valuable contributions in a variety of areas. My sincere thanks go to those who reviewed the books for technical accuracy and readability: Jim Ashton, Jim Ford, Marty Hultgren, Catherine Lumsden, Elizabeth Maldonado, Paul Marovich, Ted Meleky, Annette Sanders, Kevin Scott, Ron Statt, and Morris Vaughan. I also thank Candy Farrell and Karen Perkins for production and design; Joan Stout for indexing; Cindy Puryear and Patricia Spain for marketing; and Cate Parrish for the cover designs. Special thanks to my wife Ellen, who was loving and supportive throughout.
x Step-by-Step Basic Statistics Using SAS: Student Guide
Using This Student Guide Introduction............................................................................................ 3 Overview...................................................................................................................3 Intended Audience and Level of Proficiency .............................................................3 Platform and Version ................................................................................................3 Materials Needed......................................................................................................4 Introduction to the SAS System ............................................................ 4 Why Do You Need This Student Guide?...................................................................4 What Is the SAS System?.........................................................................................5 Who Uses SAS? .......................................................................................................5 Using the SAS System for Statistical Analyses.........................................................5 Contents of This Student Guide............................................................. 6 Overview...................................................................................................................6 Chapter 2: Terms and Concepts Used in This Guide...............................................7 Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs ..........................................................................................7 Chapter 4: Data Input...............................................................................................7 Chapter 5: Creating Frequency Tables ....................................................................7 Chapter 6: Creating Graphs.....................................................................................8 Chapter 7: Measures of Central Tendency and Variability.......................................8 Chapter 8: Creating and Modifying Variables and Data Sets...................................8 Chapter 9: Standardized Scores (z Scores).............................................................8 Chapter 10: Bivariate Correlation.............................................................................9 Chapter 11: Bivariate Regression ............................................................................9 Chapter 12: Single-Sample t Test ............................................................................9
2 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 13: Independent-Samples t Test ................................................................9 Chapter 14: Paired-Samples t Test..........................................................................9 Chapter 15: One-Way ANOVA with One Between-Subjects Factor.......................10 Chapter 16: Factorial ANOVA with Two Between-Subjects Factors ......................10 Chapter 17: Chi-Square Test of Independence .....................................................10 References .............................................................................................................10 Conclusion.............................................................................................11
Chapter 1: Using This Student Guide 3
Introduction Overview This chapter introduces you to the SAS System, a computer application that can be used to perform statistical analyses. It explains just what SAS is, where it is installed, and describes some of the advantages associated with using SAS for data analysis. Finally, it briefly summarizes what you will learn in each of the chapters that comprise this Student Guide. Intended Audience and Level of Proficiency This guide is intended for those who want to learn how to use SAS to perform elementary statistical analyses. The guide assumes that many students using it have not already taken a course on elementary statistics. To assist these students, this guide briefly reviews basic terms and concepts in statistics at an elementary level. It was designed to be easily understood by first and second year college students. This book was also designed to be user-friendly to those who may have little or no experience with personal computers. The beginning of Chapter 3, “Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs,” reviews basic concepts in using Microsoft Windows, such as selecting menus, double-clicking icons, and so forth. Those who already have experience in using Windows will be able to quickly skim through this elementary material. Platform and Version This guide shows how to use the SAS System for Windows, as opposed to other operating environments. This is most apparent in Chapter 3, “Using the SAS Windowing Environment to Write and Submit SAS Programs.” However, the remaining chapters show how to write SAS code to perform statistical analyses, and most of this material will be useful to all SAS users, regardless of the operating environment. This is because, for the most part, the same SAS code can be used on a wide variety of operating environments to obtain the same results. This book was designed for those using the SAS System Version 8 and later versions. It may also be helpful to those using earlier versions of SAS (such as V6 or V7). However, if you are using one of these earlier versions, it is likely that some of the SAS system options described here are not available with your version. It is also likely that some of the SAS output that you obtain will be arranged differently than the output that is presented here.
4 Step-by-Step Basic Statistics Using SAS: Student Guide
Materials Needed To complete the activities described in this book, you will need •
access to a personal computer on which the SAS System for Windows has been installed,
•
one (and preferably two) 3.5-inch disks, formatted for IBM PCs (or some other type of storage media).
Some students using this book will also use its companion volume, Step-by-Step Basic Statistics Using SAS: Exercises. The chapters in the Exercises book parallel most of the chapters contained in this Student Guide. Each chapter in the Exercises book contains two assignments for students to complete. Complete solutions are provided for the oddnumbered exercises, but not for the even-numbered ones. The Exercises book can give you useful practice in learning how to use SAS, but it is not absolutely required.
Introduction to the SAS System Why Do You Need This Student Guide? This Student Guide shows you how to use a computer application called the SAS System to perform elementary statistical analyses. Until recently, students in elementary statistics courses typically performed statistical computations by hand or with a pocket calculator. In recent years, however, the increased availability of computers has made it possible for students to also use statistical software packages such as SPSS and the SAS System to perform these analyses. This latter approach allows students to focus more on conceptual issues in statistics, and spend less time on the mechanics of performing mathematical operations by hand. Step by step, this Student Guide will introduce you to the SAS System, and will show you how to use it to perform a variety of statistical analyses that are commonly used in the social and behavioral sciences and in education.
Chapter 1: Using This Student Guide 5
What Is the SAS System? The SAS System is a modular, integrated, and hardware-independent application. It is used as an information delivery system by business organizations, governments, and universities worldwide. SAS is used for virtually every aspect of information management in organizations, including decision support, project management, financial analysis, quality improvement, data warehousing, report writing, and presentations. However, this guide will focus on just one aspect of SAS: its ability to perform the types of statistical analyses that are appropriate for research in the social sciences and education. By the time you have completed this text, you will have accomplished two objectives: you will have learned how to perform elementary statistical analyses using SAS, and you will have become familiar with a widely used information delivery system. Who Uses SAS? The SAS System is widely used in business organizations and universities. Consider the following statistics from July 2002: •
SAS supports over 40 operating environments, including Windows, OS/2, and UNIX.
•
SAS Institute’s computer software products are installed at over 38,400 sites in 115 countries.
•
Approximately 71% of SAS installations are in business locations, 18% are education sites, and 11% are government sites. It is used for teaching and research at about 3,000 university locations.
•
It is estimated that SAS software products are used by more than 3.5 million people worldwide.
•
90% of all Fortune 500 companies are SAS clients.
Using the SAS System for Statistical Analyses SAS is a particularly powerful tool for social scientists and educators because it allows them to easily perform virtually any type of statistical analysis that may be required in their research. SAS is comprehensive enough to perform the most sophisticated multivariate analyses, but is so easy to use that undergraduates can perform simple analyses after only a short period of instruction. In a sense, the SAS System may be viewed as a library of prewritten statistical algorithms. By submitting a brief SAS program, you can access a procedure from the library
6 Step-by-Step Basic Statistics Using SAS: Student Guide
and use it to analyze a set of data. For example, below are the SAS statements used to call up the algorithm that calculates Pearson correlation coefficients: PROC CORR RUN;
DATA=D1;
The preceding statements will cause SAS to compute the Pearson correlation between every possible pair of numeric variables in your data set. Being able to call up complex procedures with such a simple statement is what makes SAS so powerful. By contrast, if you had to prepare your own programs to compute Pearson correlations by using a programming language such as FORTRAN or BASIC, it would require many statements, and there would be many opportunities for error. By using SAS instead, most of the work has already been completed, and you are able to focus on the results of the analysis rather than on the mechanics of obtaining those results.
Contents of This Student Guide Overview This guide has two objectives: to teach the basics of using SAS in general and, more specifically, to show how to use SAS procedures to perform elementary statistical analyses. Chapters 1–4 provide an overview to the basics of using SAS. The remaining chapters cover statistical concepts in a sequence that is representative of the sequence followed in most elementary statistics textbooks. Chapters 10–17 introduce you to inferential statistical procedures (the type of procedures that are most often used to analyze data from research). Each chapter shows you how to conduct the analysis from beginning to end. Each chapter also provides an example of how the analysis might be summarized for publication in an academic journal in the social sciences or education. For the most part, these summaries are written according to the guidelines provided in the Publication Manual of the American Psychological Association (1994). Many students using this book will also use its companion volume, Step-by-Step Basic Statistics Using SAS: Exercises. For Chapters 3–17 in this student guide, the corresponding chapter in the exercise book provides you with a hands-on exercise that enables you to practice the data analysis skills that you are learning. The following sections provide a summary of the contents of the remaining chapters in this guide.
Chapter 1: Using This Student Guide 7
Chapter 2: Terms and Concepts Used in This Guide Chapter 2 defines some important terms related to research and statistics that will be used throughout this guide. It also introduces you to the three types of files that you will work with during a typical session with SAS: the SAS program, the SAS log, and the SAS output file. Chapter 3: Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs The SAS windowing environment is a powerful application that you will use to create, edit, and submit SAS programs. You will also use it to review your SAS logs and output. Chapter 3 provides a tutorial that teaches you how to use this application. Step by step, it shows you how to write simple SAS programs and interpret their results. By the end of this chapter, you should be ready to use the SAS windowing environment to write and submit SAS programs on your own. Chapter 4: Data Input Chapter 4 shows you how to use the DATA and INPUT statements to create SAS data sets. You will learn how to read both numeric and character variables by using a simple, list style for data input. By the end of the chapter, you will be prepared to input the data sets that will be presented throughout the remainder of this guide. Chapter 5: Creating Frequency Tables Chapter 5 shows you how to create frequency tables that are useful for understanding your data and answering some types of research questions. For example, imagine that you ask a sample of 150 people to tell you their age. If you then used SAS to create a frequency table for this age variable, you would be able to easily answer questions such as •
How many people are age 30?
•
How many people are age 30 or younger?
•
What percent of people are age 45?
•
What percent of people are age 45 or younger?
8 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 6: Creating Graphs Chapter 6 shows you how to use SAS to create frequency bar charts––bar charts that indicate the number of people who displayed a given value on a variable. For example, imagine that you asked 150 people to indicate their political party. If you used SAS to create a frequency bar chart, the resulting chart would indicate the number of people who are democrats, the number who are republicans, and the number who are independents. Chapter 6 also shows how to create bar charts that plot subgroup means. For example, assume that, in the “political party” study described above, you asked the 150 subjects to indicate both their political party and their age. You could then use SAS to create a bar chart that plots the mean age for people in each party. For instance, the resulting bar chart might show that the average age for democrats was 32.12, the average age for republicans was 41.56, and the average age for independents was 37.33. Chapter 7: Measures of Central Tendency and Variability Chapter 7 shows you how to compute measures of variability (e.g., the interquartile range, standard deviation, and variance) as well as measures of central tendency (e.g., the mean, median, and mode) for numeric variables. It also shows how to use stem-and-leaf plots to determine whether a distribution is skewed or approximately normal in shape. Chapter 8: Creating and Modifying Variables and Data Sets Chapter 8 shows how to use subsetting IF statements to create new data sets that contain a specified subgroup from the original sample. It also shows how to use mathematical operators and IF-THEN statements to recode variables and to create new variables from existing variables. Chapter 9: Standardized Scores (z Scores) Chapter 9 shows how to transform raw scores into standardized variables (z score variables) with a mean of 0 and a standard deviation of 1. You will learn how to do this by using the data manipulation statements that you learned about in Chapter 8. Chapter 9 also illustrates how you can review the sign and absolute magnitude of a z score to understand where a particular observation stands on the variable in question.
Chapter 1: Using This Student Guide 9
Chapter 10: Bivariate Correlation Bivariate correlation coefficients allow you to determine the nature of the relationship between two numeric variables. Chapter 10 shows you how to use the CORR procedure to compute Pearson correlation coefficients for interval- and ratio-level variables. You will also learn to interpret the p values (probability values) that are produced by PROC CORR to determine whether a given correlation coefficient is significantly different from zero. Chapter 10 also shows how to use PROC PLOT to create a two-dimensional scattergram that illustrates the relationship between two variables. Chapter 11: Bivariate Regression Bivariate regression is used when you want to predict scores on an interval- or ratio-level criterion variable from an interval- or ratio-level predictor variable. Chapter 11 shows you how to use the REG procedure to compute the slope and intercept for the regression equation, along with predicted values and residuals of prediction. Chapter 12: Single-Sample t Test Chapter 12 shows how to use the TTEST procedure to perform a single-sample t test. This is an inferential procedure that is useful for determining whether a sample mean is significantly different from a specified population mean. You will learn how to interpret the t statistic, and the p value associated with that t statistic. Chapter 13: Independent-Samples t Test You use an independent-samples t test to determine whether there is a significant difference between two groups of subjects with respect to their mean scores on the dependent variable. Chapter 13 explains when to use the equal-variance t statistic versus the unequal-variance t statistic, and shows how to use the TTEST procedure to conduct this analysis. Chapter 14: Paired-Samples t Test The paired-samples t test is also appropriate when you want to determine whether there is a significant difference between two sample means. The paired-samples approach is indicated when each score in one sample is dependent upon a corresponding score in the second sample. This will be the case in studies in which the same subjects provide repeated measures on the same dependent variable under different conditions, or when matching procedures are used. Chapter 14 shows how to perform this analysis using the TTEST procedure.
10 Step-by-Step Basic Statistics Using SAS: Student Guide
Chapter 15: One-Way ANOVA with One Between-Subjects Factor One-way analysis of variance (ANOVA) is an inferential procedure similar to the independent-samples t test, with one important difference: while the t test allows you to test the significance of the difference between two sample means, a one-way ANOVA allows you to test the significance of the difference between more than two sample means. Chapter 15 shows how to use the GLM procedure to perform a one-way ANOVA, and then to follow with multiple comparison (post hoc) tests. Chapter 16: Factorial ANOVA with Two Between-Subjects Factors A one-way ANOVA, as described in Chapter 15, may be appropriate for analyzing data from an experiment in which the researcher manipulates only one independent variable. In contrast, a factorial ANOVA with two between-subjects factors may be appropriate for analyzing data from an experiment in which the researcher manipulates two independent variables simultaneously. Chapter 16 shows how to perform this type of analysis. It provides examples of results in which the main effects are significant, as well as results in which the interaction is significant. Chapter 17: Chi-Square Test of Independence Nonparametric statistical procedures are procedures that do not require stringent assumptions about the nature of the populations under study. Chapter 17 illustrates one of the most common nonparametric procedures: the chi-square test of independence. This test is appropriate when you want to study the relationship between two variables that assume a limited number of values. Chapter 17 shows how to conduct the test of significance and interpret the results presented in the two-way classification table created by the FREQ procedure. References Many statistical procedures are illustrated in this guide by showing you how to analyze fictitious data from an empirical study. Many of these “studies” are loosely based on actual investigations reported in the research literature. These studies were chosen to help introduce you to the types of empirical investigations that are often conducted in the social and behavioral sciences and in education. The “References” section at the end of this guide provides complete references for the actual studies that inspired the fictitious studies reported here.
Chapter 1: Using This Student Guide 11
Conclusion This guide assumes that some of the students using it have not yet completed a course on elementary statistics. This means that some readers will be unfamiliar with terms used in data analysis, such as “observations,” “null hypothesis,” “dichotomous variables,” and so on. To remedy this, the following chapter, "Terms and Concepts Used in This Guide," provides a brief primer on basic terms and concepts in statistics. This chapter should lay a foundation that will make it easier to understand the chapters to follow.
12 Step-by-Step Basic Statistics Using SAS: Student Guide
Terms and Concepts Used in This Guide Introduction...........................................................................................15 Overview.................................................................................................................15 A Common Language for Researchers...................................................................15 Why This Chapter Is Important ...............................................................................15 Research Hypotheses and Statistical Hypotheses ..............................16 Example: A Goal-Setting Study..............................................................................16 The Research Question ..........................................................................................16 The Research Hypothesis.......................................................................................16 The Statistical Null Hypothesis................................................................................18 The Statistical Alternative Hypothesis.....................................................................19 Directional versus Nondirectional Alternative Hypotheses ......................................19 Summary ................................................................................................................21 Data, Variables, Values, and Observations ..........................................21 Defining the Instrument, Gathering Data, Analyzing Data, and Drawing Conclusions...........................................................................................21 Variables, Values, and Observations ......................................................................22 Classifying Variables According to Their Scales of Measurement......24 Introduction .............................................................................................................24 Nominal Scales .......................................................................................................25 Ordinal Scales.........................................................................................................25 Interval Scales ........................................................................................................26 Ratio Scales............................................................................................................27
14 Step-by-Step Basic Statistics Using SAS: Student Guide
Classifying Variables According to the Number of Values They Display .....................................................................................27 Overview.................................................................................................................27 Dichotomous Variables ...........................................................................................27 Limited-Value Variables ..........................................................................................28 Multi-Value Variables ..............................................................................................28 Basic Approaches to Research ............................................................29 Nonexperimental Research ....................................................................................29 Experimental Research...........................................................................................31 Using Type-of-Variable Figures to Represent Dependent and Independent Variables .....................................................................32 Overview.................................................................................................................32 Figures to Represent Types of Variables................................................................33 Using Figures to Represent the Types of Variables Assessed in a Specific Study...............................................................................................34 The Three Types of SAS Files...............................................................37 Overview.................................................................................................................37 The SAS Program...................................................................................................37 The SAS Log...........................................................................................................42 The SAS Output File ...............................................................................................44 Conclusion.............................................................................................45
Chapter 2: Terms and Concepts Used in This Guide 15
Introduction Overview This chapter has two objectives. This first is to introduce you to basic terms and concepts related to research design and data analysis. This chapter describes the different types of variables that might be analyzed when conducting research, the classification of these variables according to their scale of measurement or other characteristics, and the differences between nonexperimental versus experimental research. The chapter’s second objective is to introduce you to the three types of files that you will work with when you perform statistical analyses with SAS. These include the SAS program file, the SAS log file, and the SAS output file. After completing this chapter, you should be familiar with the fundamental terms and concepts that are relevant to data analysis, and you will have a foundation to begin learning about the SAS System in detail in subsequent chapters. A Common Language for Researchers Research in the behavioral sciences and in education is extremely diverse. In part, this is because the behavioral sciences represent a wide variety of disciplines, including psychology, sociology, anthropology, political science, management, and other fields. Further complicating matters is the fact that, within each discipline, a wide variety of methods are used to conduct research. These methods can include unobtrusive observation, participant observation, case studies, interviews, focus groups, surveys, ex post facto studies, laboratory experiments, and field experiments. Despite this diversity in methods used and topics investigated, most scientific investigations still share a number of characteristics. Regardless of field, most research involves an investigator who gathers data and performs analyses to determine what the data mean. In addition, most researchers in the behavioral sciences and education use a common language in reporting their research; researchers from all fields typically speak of “testing null hypotheses” and “obtaining significant p values.” Why This Chapter Is Important The purpose of this chapter is to review some fundamental concepts and terms that are shared in the behavioral sciences and in education. You should familiarize (or refamiliarize) yourself with this material before proceeding to the subsequent chapters, as most of the terms introduced here will be referred to again and again throughout the text. If you have not yet taken a course in statistics, this chapter will provide an elementary introduction; if you have already completed a course in statistics, it will provide a quick review.
16 Step-by-Step Basic Statistics Using SAS: Student Guide
Research Hypotheses and Statistical Hypotheses Example: A Goal-Setting Study Imagine that you have been hired by a large insurance company to find ways of improving the productivity of its insurance agents. Specifically, the company would like you to find ways to increase the number of insurance policies sold by the average agent. You will therefore begin a program of research to identify the determinants of agent productivity. In the course of this program, you will work with research questions, research hypotheses, and statistical hypotheses. The Research Question The process of research often begins by developing a clear statement of the research question (or questions). The research question is a statement of what you hope to have learned by the time the research has been completed. It is good practice to revise and refine the research question several times to ensure that you are very clear about what it is you really want to know. For example, in the current example, you might begin with the question “What is the difference between agents who sell much insurance versus agents who sell little insurance?” A more specific question might be “What variables have a causal effect on the amount of insurance sold by agents?” Upon reflection, you might realize that the insurance company really only wants to know what things management can do to cause the agents to sell more insurance. This might eliminate from consideration those variables that are not under management’s control, and can substantially narrow the focus of the research program. This narrowing, in turn, leads to a more specific statement of the research question such as “What variables under the control of management have a causal effect on the amount of insurance sold by agents?” Once the research question has been more clearly defined in this way, you are in a better position to develop a good hypothesis that provides a possible answer to the question. The Research Hypothesis An hypothesis is a statement about the predicted relationships among events or variables. A good hypothesis in the present case might identify a specific variable that is expected to have a causal effect on the amount of insurance sold by agents. For example, a research hypothesis might predict that the agents’ level of training will have a positive effect on the amount of insurance sold. Or it might predict that the agents’ level of achievement motivation will positively affect sales.
Chapter 2: Terms and Concepts Used in This Guide 17
In developing the hypothesis, you might be influenced by any of a number of sources: an existing theory, some related research, or even personal experience. Let's assume that in the present situation, for example, you have been influenced by goal-setting theory. This theory states, among other things, that higher levels of work performance are achieved when difficult goals are set for employees. Drawing on goal-setting theory, you now state the following hypothesis: “The difficulty of the goals that agents set for themselves is positively related to the amount of insurance they sell.” Notice how this statement satisfies our definition for a research hypothesis, as it is a statement about the predicted relationship between two variables. The first variable can be labeled “goal difficulty,” and the second can be labeled “amount of insurance sold.” The predicted relationship between goal difficulty and amount of insurance sold is illustrated in Figure 2.1. Notice that there is an arrow extending from goal difficulty to amount of insurance sold. This arrow reflects the prediction that goal difficulty is the causal variable, and amount of insurance sold is the variable being affected.
Figure 2.1. Causal relationship between goal difficulty and amount of insurance sold, as predicted by the research hypothesis.
In Figure 2.1, you can see that the variable being affected (insurance sold) appears on the left side of the figure, and that the causal variable (goal difficulty) appears on the right. This arrangement might seem a bit unusual to you, since most figures that portray causal relationships have the order reversed (with the causal variable on the left and the variable being affected on the right). However, this guide will always use the arrangement that appears in Figure 2.1, for reasons that will become clear later. You can see that the research hypothesis stated above is quite broad in nature. In many research situations, however, it is helpful to state hypotheses that are more specific in the predictions they make. For example, assume that there is an instrument called the “Smith Goal Difficulty Scale.” Scores on this fictitious instrument can range from zero to 100, with higher scores indicating more difficult goals. If you administered this scale to a sample of agents, you could develop a more specific research hypothesis along the following lines: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.”
18 Step-by-Step Basic Statistics Using SAS: Student Guide
The Statistical Null Hypothesis Beginning in Chapter 10, “Bivariate Correlation,” this guide will show you how to use the SAS System to perform tests of null hypotheses. The way that you state a specific null hypothesis will vary depending on the nature of your research question and the type of data analysis that you are performing. Generally speaking, however, a statistical null hypothesis is typically a prediction that there is no difference between groups in the population, or that there is no relationship between variables in the population. For example, consider the research hypothesis stated in the preceding section: “Agents who score 60 or above on the Smith Goal Difficulty Scale will sell greater amounts of insurance than agents who score below 60.” Assume that you conduct a study to investigate this research hypothesis. You identify two groups of subjects: •
50 Agents who score 60 or above on the Smith Goal Difficulty Scale (the “high goaldifficulty group”).
•
50 Agents who score below 60 on the Smith Goal Difficulty Scale (the “low goaldifficulty group”).
You observe these agents over a 12-month period, and record the amount of insurance that they sell. You want to investigate the following (fairly specific) research hypothesis: Research hypothesis: The average amount of insurance sold by the high goal-difficulty group will be greater than the average amount sold by the low goal-difficulty group. You plan to analyze the data using a statistical procedure such as a t test (which will be discussed in Chapter 13, “Independent-Samples t Test”). One way to structure this analysis is to begin with the following statistical null hypothesis: Statistical null hypothesis: In the population, there is no difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold. Notice that this is a prediction of no difference between the groups. You will analyze the data from your sample, and if the observed difference is large enough, you will reject this null hypothesis of no difference. Rejecting this statistical null hypothesis means that you have obtained some support for your original research hypothesis (the hypothesis that there is a difference between the groups). Statistical null hypotheses are often represented symbolically. For example, this is how you could have symbolically represented the preceding statistical null hypothesis: H0:
µ1 = µ2
Chapter 2: Terms and Concepts Used in This Guide 19
where H0
is the symbol used to represent the null hypothesis
µ1
is the symbol used to represent the mean amount of insurance sold by Group 1 (the high goal-difficulty group) in the population
µ2
is the symbol used to represent the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.
The Statistical Alternative Hypothesis A statistical alternative hypothesis is typically a prediction that there is a difference between groups in the population, or that there is relationship between variables in the population. The alternative hypothesis is the counterpart to the null hypothesis; if you reject the null hypothesis, you tentatively accept the alternative hypothesis. There are different ways that you can state alternative hypotheses. One way is simply to predict that there is a difference between the population means, without predicting which population mean is higher. Here is one way of stating that type of alternative hypothesis for the current study: Statistical alternative hypothesis: In the population, there is a difference between the high goal-difficulty group and the low goal-difficulty group with respect to their mean scores on the amount of insurance sold. The alternative hypothesis also can be stated symbolically H1:
µ1 ≠ µ2
The H1 symbol above is the symbol for an alternative hypothesis. Notice that the “not equal” symbol (≠) is used to represent the prediction that the means will not be equal. Directional versus Nondirectional Alternative Hypotheses Nondirectional hypotheses. The preceding section illustrated a nondirectional alternative hypothesis, also known as a two-sided or two-tailed alternative hypothesis. With the type of study described here (a study in which group means are being compared), a nondirectional alternative hypothesis simply predicts that one population mean differs from the other population mean––it does not predict which population mean will be higher. You would obtain support for this nondirectional alternative hypothesis if the high goal-difficulty group sold significantly more insurance, on the average, than the low goal-difficulty group. You would also obtain support for this nondirectional alternative hypothesis if the low goaldifficulty group sold significantly more insurance than the high goal-difficulty group. With a nondirectional alternative hypothesis, you are predicting some type of difference, but you are not predicting the specific nature, or direction, of the difference.
20 Step-by-Step Basic Statistics Using SAS: Student Guide
Directional hypotheses. In some situations it might be appropriate to use a directional alternative hypothesis. With the type of study described above, a directional alternative hypothesis (also known as a one-sided or one-tailed alternative hypothesis) not only predicts that there will be a difference, but also makes a specific prediction about which population will display the higher mean. For example, in the present study, previous research might lead you to predict that the population of high goal-difficulty employees will sell more insurance, on the average, than the population of low goal-difficulty employees. If this were the case, you might state the following directional alternative hypothesis: Statistical alternative hypothesis: In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group. This alternative hypothesis can also be stated symbolically H1:
µ1 > µ2
where µ1
represents the mean amount of insurance sold by Group 1 (the high goal-difficulty group) in the population
µ2
represents the mean amount of insurance sold by Group 2 (the low goal-difficulty group) in the population.
Notice that the “greater than” symbol (>) is used to represent the prediction that the mean for the high goal-difficulty population is greater than the mean for the low goal-difficulty population. Choosing directional versus nondirectional tests. Which type of alternative hypothesis should you use in your research? Most statistics textbooks recommend using a nondirectional, or two-sided, alternative hypothesis, in most cases. The problem with the directional hypothesis is that if your obtained sample means are in the opposite direction of the direction that you predict, it can cause you to fail to reject the null hypothesis even when there are very large differences between the sample means. For example, assume that you state the directional alternative hypothesis presented above (i.e., “In the population, mean amount of insurance sold by the high goal-difficulty group is greater than the mean amount of insurance sold by the low goal-difficulty group”). Because your alternative hypothesis is a directional hypothesis, the null hypothesis you are testing is as follows: H0:
µ1 ≤ µ2
which means, “In the population, the mean amount of insurance sold by the high goaldifficulty group (Group 1) is less than or equal to the mean amount of insurance sold by the low goal-difficulty group (Group 2).”
Chapter 2: Terms and Concepts Used in This Guide 21
Clearly, to reject the null hypothesis, the high goal-difficulty group (Group 1) must display a mean that is greater than the low goal-difficulty group (Group 2). If Group 2 displays the higher mean, then you might not reject the null hypothesis, no matter how great that difference might be. This presents a problem because the finding that Group 2 scored higher than Group 1 may be of great interest to other researchers (particularly because it is not what many would have expected). This is why, in many situations, nondirectional tests are preferred over directional tests. Summary In summary, research projects often begin with a statement of a research hypothesis. This allows you to develop a specific, testable statistical null hypothesis and an alternative hypothesis. The analysis of your data will lead you to one of two results: •
If the results are significant, you can reject the null hypothesis and tentatively accept the alternative hypothesis. Assuming the means are in the predicted direction, this type of result provides some support for your initial research hypothesis.
•
If the results are nonsignificant, you fail to reject the null hypothesis. This type of result fails to provide support for your initial research hypothesis.
Data, Variables, Values, and Observations Defining the Instrument, Gathering Data, Analyzing Data, and Drawing Conclusions With the null hypothesis stated, you can now test it by conducting a study in which you gather and analyze relevant data. Data is defined as a collection of scores that are obtained when subject characteristics and/or performance are observed and recorded. For example, you can choose to test your hypothesis by conducting a simple correlational study: You identify a group of 100 agents and determine •
the difficulty of the goals that have been set for each agent
•
the amount of insurance sold by each.
Different types of instruments can be used to obtain different types of data. For example, you might use a questionnaire to assess goal difficulty, but rely on company records for measures of insurance sold. Once the data are gathered, each agent will have one score indicating the difficulty of his or her goals, and a second score indicating the amount of insurance he or she has sold. You would then analyze the data to see if the agents with the more difficult goals did, in fact, sell more insurance. If so, the study results would lend some support to your research hypothesis; if not, the results would fail to provide support. In either case, you would be
22 Step-by-Step Basic Statistics Using SAS: Student Guide
able to draw conclusions regarding the tenability of your hypotheses, and would have made some progress toward answering your research question. The information learned in the current study might stimulate new questions or new hypotheses for subsequent studies, and the cycle would repeat. For example, if you obtained support for your hypothesis with a correlational study, you might choose to follow it up with a study using a different research method, perhaps an experimental study (the difference between these methods will be described below). Over time, a body of research evidence would accumulate, and researchers would be able to review this body to draw general conclusions about the determinants of insurance sales. Variables, Values, and Observations Definitions. When discussing data, one often speaks in terms of variables, values, and observations. Further complicating matters is the fact that researchers make distinctions between different types of variables (such as quantitative variables versus classification variables). This section discusses the distinctions between these terms. •
Variables. For the type of research discussed in this book, a variable refers to some specific characteristic of a subject that can assume one or more different values. For the subjects in the study described above, “amount of insurance sold” is an example of a variable: Some subjects had sold a large amount of insurance, and others had sold less. A different variable was “goal difficulty:” Some subjects had more difficult goals, while others had less difficult goals. Subject age was a third variable, while subject sex (male versus female) was yet another.
•
Values. A value, on the other hand, refers to either a particular subject's relative standing on a quantitative variable, or a subject's classification within a classification variable. For example, the “amount of insurance sold” is a quantitative variable that can assume a large number of values: One agent might sell $2,000,000 worth of insurance in one year, one might sell $100,000 worth, and another might sell $0 worth. Subject age is another quantitative variable that can assume a wide variety of values. In the sample studied, these values ranged from a low of 22 years to a high of 64 years.
•
Quantitative variables. You can see that, in both of these examples, a particular value is a type of score that indicates where the subject stands on the variable. The word “score” is an appropriate substitute for the word “value” in these cases because both “amount of insurance sold” and “age” are quantitative variables: variables that represent the quantity, or amount, of the construct that is being assessed. With quantitative variables, numbers typically serve as values.
•
Classification variables. A different type of variable is a classification variable or, alternatively, qualitative variable or categorical variable. With classification variables, different values represent different groups to which the subject might belong. “Sex” is a good example of a classification variable, as it might assume only one of two values: A particular subject is classified as being either a male or a female. “Political Party” is an example of a classification variable that can assume a larger number of
Chapter 2: Terms and Concepts Used in This Guide 23
values: A subject might be classified as being a republican, a democrat, or an independent. These variables are classification variables and not quantitative variables because the values only represent membership in a singular, specific group–– membership that cannot be represented meaningfully with a numeric value. •
Observational units. In discussing data, researchers often make references to observational units, that can be defined as the individual subjects (or other objects) that serve as the source of the data. Within the behavioral sciences and education, an individual person usually serves as the observational unit under study (although it is also possible to use some other entity, such as an individual school or organization, as the observational unit). In this text, the individual person is used as the observational unit in most examples. Researchers will often refer to the “number of observations” or “number of cases” included in their data set, and this typically refers to the number of subjects who were studied.
An example. For a more concrete illustration of the concepts discussed so far, consider the data set displayed in Table 2.1: Table 2.1 Insurance Sales Data ________________________________________________________________________ Goal difficulty Overall Observation Name Sex Age scores ranking Sales ________________________________________________________________________ 1 Bob M 34 97 2 $598,243 2 Walt M 56 80 1 $367,342 3 Jane F 36 67 4 $254,998 4 Susan F 24 40 3 $80,344 5 Jim M 22 37 5 $40,172 6 Mack M 44 24 6 $0 ________________________________________________________________________
The preceding table reports information regarding six research subjects: Bob, Walt, Jane, Susan, Jim, and Mack; therefore, we would say that the data set includes six observations. Information about a particular observation (subject) is displayed as a row running horizontally from left to right across the table. The first column of the data set (running vertically from top to bottom) is headed “Observation,” and it simply provides an observation number for each subject. The second column (headed “Name”) provides a name for each subject. The remaining five columns report information about the five research variables that are being studied. The column headed “Sex” reports subject sex, which might assume one of two values: “M” for male and “F” for female.
24 Step-by-Step Basic Statistics Using SAS: Student Guide
The column headed “Age” reports the subject's age in years. The “Goal Difficulty Scores” column reports the subject's score on a fictitious goal difficulty scale. In this example, each participant has a score on a 20-item questionnaire about the difficulty of his or her work goals. Depending on how they respond to the questionnaire, subjects receive a score ranging from a low of zero (meaning that the subject views the work goals as extremely easy) to a high of 100 (meaning that the goals are viewed as extremely difficult). The column headed “Overall Ranking,” shows how the subjects were ranked by their supervisor according to their overall effectiveness as agents. A rank of 1 represents the most effective agent, and a rank of 6 represents the least effective. The column headed “Sales” reveals the amount of insurance sold by each agent (in dollars) during the most recent year. Table 2.1 provides a very small data set with six observations and five research variables (sex, age, goal difficulty, overall ranking, and sales). One of the variables was a classification variable (sex), while the remainder were quantitative variables. The numbers or letters that appear within a particular column represent some of the values that could be assumed by that variable.
Classifying Variables According to Their Scales of Measurement Introduction One of the most important schemes for classifying a variable involves its scale of measurement. Researchers generally discuss four different scales of measurement: nominal, ordinal, interval, and ratio. Before analyzing a data set, it is important to determine which scales of measurement were used because certain types of statistical procedures require specific scales of measurement. For example, a one-way analysis of variance generally requires that the dependent variable be an interval-level or ratio-level variable; the chi-square test of independence allows you to analyze nominal-level variables; other statistics make other assumptions about the scale of measurement used with the variables that are being studied.
Chapter 2: Terms and Concepts Used in This Guide 25
Nominal Scales A nominal scale is a classification system that places people, objects, or other entities into mutually exclusive categories. A variable that is measured using a nominal scale is a classification variable: It simply indicates the name of the group to which each subject belongs. The examples of classification variables provided earlier (e.g., sex and political party) also serve as examples of nominal-level variables: They tell you which group a subject belongs to, but they do not provide any quantitative information about the subjects. That is, the “sex” variable might tell you that some subjects are males and other are females, but it does not tell you that some subjects possess more of a specific characteristic relative to others. With the remaining three scales of measurement, however, some quantitative information is provided. Ordinal Scales Values on an ordinal scale represent the rank order of the subjects with respect to the variable that is being assessed. For example, Table 2.1 includes one variable called “Overall Ranking,” which represents the rank-ordering of the subjects according to their overall effectiveness as agents. The values on this ordinal scale represent a hierarchy of levels with respect to the construct of “effectiveness”: We know that the agent ranked “1” was perceived as being more effective than the agent ranked “2,” that the agent ranked “2” was more effective than the one ranked “3,” and so forth. However, an ordinal scale has a serious limitation in that equal differences in scale values do not necessarily have equal quantitative meaning. For example, notice the rankings reproduced here: Overall ranking _______
Name ______
1 2 3 4 5 6
Walt Bob Susan Jane Jim Mack
Notice that Walt was ranked #1 while Bob was ranked #2. The difference between these two rankings is 1 (because 2 – 1 = 1), so we might say that there is one unit of difference between Walt and Bob. Now notice that Jim was ranked #5 while Mack was ranked #6. The difference between these two rankings is also 1 (because 6 – 5 = 1), so we might say that there is also 1 unit of difference between Jim and Mack. Putting the two together, we can see that the difference in ranking between Walt and Bob is equal to the difference in ranking between Jim and Mack. But does this mean that the difference in overall effectiveness between Walt and Bob is equal to the difference in overall effectiveness between Jim and Mack? Not necessarily. It is possible that Walt was just barely superior to
26 Step-by-Step Basic Statistics Using SAS: Student Guide
Bob in effectiveness, while Jim was substantially superior to Mack. These rankings tell us very little about the quantitative differences between the subjects with regard to the underlying construct (effectiveness, in this case). An ordinal scale simply provides a rank order of who is better than whom. Interval Scales With an interval scale, equal differences between scale values do have equal quantitative meaning. For this reason, you can see that the interval scale provides more quantitative information than the ordinal scale. A good example of an interval scale is the Fahrenheit scale used to measure temperature. With the Fahrenheit scale, the difference between 70 degrees and 75 degrees is equal to the difference between 80 degrees and 85 degrees: the units of measurement are equal throughout the full range of the scale. However, the interval scale also has an important limitation: it does not have a true zero point. A true zero point means that a value of zero on the scale represent zero quantity of the construct being assessed. It should be obvious that the Fahrenheit scale does not have a true zero point. When the thermometer reads zero degrees, that does not mean that there is absolutely no heat present in the environment––it is still possible for the temperature to go lower (into the negative numbers). Researchers in the social sciences often assume that many of their “man-made” variables are measured on an interval scale. For example, in the preceding study involving insurance agents, you would probably assume that scores from the goal difficulty questionnaire constitute an interval-level scale; that is, you would likely assume that the difference between a score of 50 and 60 is approximately equal to the difference between a score of 70 and 80. Many researchers would also assume that scores from an instrument such as an intelligence test are also measured at the interval level of measurement. On the other hand, some researchers are skeptical that instruments such as these have true equal-interval properties, and prefer to refer to them as quasi-interval scales. Disagreements concerning the level of measurement achieved with such paper-and-pencil instruments continues to be a controversial topic within many disciplines. In any case, it is clear that there is no true zero point with either of the preceding instruments: a score of zero on the goal difficulty scale does not indicate the complete absence of goal difficulty, and a score of zero on an intelligence test does not indicate the complete absence of intelligence. A true zero point can be found only with variables measured on a ratio scale.
Chapter 2: Terms and Concepts Used in This Guide 27
Ratio Scales Ratio scales are similar to interval scales in that equal differences between scale values do have equal quantitative meaning. However, ratio scales also have a true zero point, which gives them an additional property: with ratio scales, it is possible to make meaningful statements about the ratios between scale values. For example, the system of inches used with a common ruler is an example of a ratio scale. There is a true zero point with this system, in that “zero inches” does in fact indicate a complete absence of length. With this scale, it is possible to make meaningful statements about ratios. It is appropriate to say that an object four inches long is twice as long as an object two inches long. Age, as measured in years, is also on a ratio scale: a 10-year-old house is twice as old as a 5-year-old house. Notice that it is not possible to make these statements about ratios with the interval-level variables discussed above. One would not say that a person with an IQ of 160 is twice as intelligent as a person with an IQ of 80, as there is no true zero point with that scale. Although ratio-level scales are most commonly used for reporting the physical properties of objects (e.g., height, weight), they are also common in the type of research that is discussed in this manual. For example, the study discussed above included the variables “age” and “amount of insurance sold (in dollars).” Both of these have true zero points, and are measured as ratio scales.
Classifying Variables According to the Number of Values They Display Overview The preceding section showed that variables can be classified according to their scale of measurement. Sometimes is also useful to classify variables according to the number of values they display. There might be any number of approaches for doing this, but this guide uses a simple division of variables into three groups according to the number of possible values: dichotomous variables, limited-value variables, and multi-value variables. Dichotomous Variables A dichotomous variable is a variable that assumes just two values. These variables are sometimes called binary variables. Here are some examples of dichotomous variables: •
Suppose that you obtain Smith Anxiety Test scores from 50 male subjects and 50 female subjects. In this study, “subject sex” is a dichotomous variable, because it can assume just two values, “male” versus “female.”
28 Step-by-Step Basic Statistics Using SAS: Student Guide •
Suppose that you conduct an experiment to determine whether the herbal supplement ginkgo biloba causes improvement in a rat’s ability to learn. You begin with 20 rats, and randomly assign them to two groups. Ten rats are assigned to the 100 mg group (they receive 100 mg of ginkgo), and the other ten rats are assigned to the 0 mg group (they receive no ginkgo). In this study, the independent variable that you are manipulating is “amount of ginkgo administered.” This is a dichotomous variable because it assumes just two values “0 mg” versus “100 mg.”
Limited-Value Variables A limited-value variable is a variable that assumes just two to six values in your sample. Here are some examples of limited-value variables: •
Suppose that you obtain Smith Anxiety Test scores from 50 Caucasian subjects, 50 African-American subjects, and 50 Asian-American subjects. In this study, “subject race” is a limited-value variable because it assumes just three values: “Caucasian” versus “African-American” versus “Asian-American.”
•
Suppose that you again conduct an experiment to determine whether ginkgo biloba causes improvements in a rat’s ability to learn. You begin with 100 rats, and randomly assign them to four groups: Twenty-five rats are assigned to the 150 mg group, 25 rats are assigned to the 100 mg group, 25 rats are assigned to the 50 mg group, and 25 rats are assigned to the 0 mg group. In this study, the independent variable that you are manipulating is still “amount of ginkgo administered.” You know that this is a limitedvalue variable because it assumes just four values “0 mg” versus “50 mg” versus “100 mg” versus “150 mg.”
Multi-Value Variables Finally, this book defines a multi-value variable as a variable that assumes more than six values in your sample. Here are some examples of multi-value variables: •
Assume that you obtain Smith Anxiety Test scores from 100 subjects. With the Smith Anxiety Test, scores (values) may range from 0–99, with higher scores indicating greater anxiety. In analyzing the data, you see that your subjects displayed a wide variety of scores, for example: • • • • • • •
One subject received a score of 2. One subject received a score of 5. Two subjects received a score of 10. Five subjects received a score of 21. Seven subjects received a score of 33. Eight subjects received a score of 45. Nine subjects received a score of 53.
Chapter 2: Terms and Concepts Used in This Guide 29
• • • • •
Seven subjects received a score of 68. Six subjects received a score of 72. Six subjects received a score of 81. One subject received a score of 89. One subject received a score of 91.
Other subjects received yet other scores. Clearly, scores on the Smith Anxiety Test constitute a multi-value variable in your sample because your subjects displayed more than six values on this variable. •
Assume that, in the ginkgo biloba study just described, you assess your dependent variable (learning) in the rats by having them work at a maze-solving task. First, you teach each rat that, if it can correctly find its way through a maze, it will be rewarded with food at the end. You then allow each rat to try to find its way through a series of mazes. Each rat is allowed 30 trials––30 opportunities to get through a maze. Your measure of learning, therefore, is the number of mazes that each rat correctly negotiates. This score can range from zero (if the rat is not successful on any of the trials), to 30 (if the rat is successful on all of the trials). A rat also can score anywhere in between these extremes. In analyzing the data, you find that the rats displayed a wide variety of scores on this “successful trials” dependent variable, for example: • • • • • • • • • •
One rat displayed zero successful trials. Two rats displayed three successful trials. Three rats displayed eight successful trials. Four rats displayed 10 successful trials. Five rats displayed 14 successful trials. Six rats displayed 15 successful trials. Six rats displayed 19 successful trials. Two rats displayed 21 successful trials. One rat displayed 27 successful trials. One rat displayed 28 successful trials.
Other rats displayed yet other scores. Clearly, scores on the “successful trials” variable constitute a multi-value variable in your sample because the rats displayed more than six values on this variable.
Basic Approaches to Research Nonexperimental Research Naturally-occurring variables. Much research can be described as being either nonexperimental or experimental in nature. In nonexperimental research (also called
30 Step-by-Step Basic Statistics Using SAS: Student Guide
correlational, nonmanipulative, or observational research), the researcher simply studies the naturally-occurring relationship between two or more naturally-occurring variables. A naturally-occurring variable is a variable that is not manipulated or controlled by the researcher; it is simply measured as it normally exists. The insurance study described previously is a good example of nonexperimental research, in that you simply measured two naturally-occurring variables (goal difficulty and amount of insurance sold) to determine whether they were related. If, in a different study, you investigated the relationship between IQ and college grade point average (GPA), this would also be an example of nonexperimental research. Criterion versus predictor variables. With nonexperimental designs, researchers often refer to criterion variables and predictor variables. A criterion variable is an outcome variable that can be predicted from one or more predictor variables. The criterion variable is often the main focus of the study in that it is the outcome variable mentioned in the statement of the research problem. With our insurance example, the criterion variable is the amount of insurance sold. The predictor variable, on the other hand, is the variable that is used to predict values on the criterion. In some studies, you might even believe that the predictor variable has a causal effect on the criterion. In the insurance study, for example, the predictor variable was “goal difficulty.” Because you believed that goal difficulty can positively affect insurance sales, you conducted a study in which goal difficulty was the predictor and insurance sales was the criterion. You do not necessarily have to believe that there is a causal relationship between two variables to conduct a study such as this, however; you might simply be interested in determining whether it is possible to predict one variable from the other. Cause-and-effect relationships. It should be noted here that nonexperimental research that investigates the relationship between just two variables generally provides very weak evidence concerning cause-and-effect relationships. The reasons for this can be seen by reviewing our study on insurance sales. If the psychologist conducts this study and finds that the agents with the more difficult goals also tend to sell more insurance, does that mean that having difficult goals caused them to sell more insurance? Not necessarily. It can also be argued that selling a lot of insurance increases the agents' self-confidence, and that this causes them to set higher work goals for themselves. Under this second scenario, it was actually the insurance sales that had a causal effect on goal difficulty. As this example shows, with nonexperimental research it is often possible to obtain a single finding that is consistent with a number of different, contradictory causal explanations. Hence, a strong inference that “variable A had a causal effect on variable B” is generally not possible when one conducts simple correlational research with just two variables. To obtain stronger evidence of cause and effect, researchers generally either analyze the relationships among a larger number of variables using sophisticated statistical procedures that are beyond the scope of this text (such as structural equation modeling), or drop the nonexperimental approach entirely and instead use experimental research methods. The nature of experimental research is discussed in the following section.
Chapter 2: Terms and Concepts Used in This Guide 31
Experimental Research General characteristics. Most experimental research can be identified by three important characteristics: •
subjects are randomly assigned to experimental conditions
•
the researcher manipulates an independent variable
•
subjects in different experimental conditions are treated similarly with regard to all variables except the independent variable.
To illustrate these concepts, let's describe a possible experiment in which you test the hypothesis that goal difficulty positively affects insurance sales. First you identify a group of 100 agents who will serve as subjects. Then you randomly assign 50 agents to a “difficult goal” condition. Subjects in this group are told by their superiors to make at least 25 “cold calls” (sales calls) to potential policyholders per week. Assume that this is a relatively difficult goal. The other 50 agents have been randomly assigned to the “easy goal” condition. They have been told to make just 5 cold calls to potential policy holders per week. To the extent possible, you see to it that agents in both groups are treated similarly with respect to everything except for the difficulty of the goals that are set for them. After one year, you determine how much new insurance each agent has sold that year. You find that the average agent in the difficult goal condition sold new policies totaling $156,000, while the average agent in the easy goal condition sold policies totaling only $121,000. Independent versus dependent variables. It is possible to use some of the terminology associated with nonexperimental research when discussing this experiment. For example, it would be appropriate to continue to refer to the amount of insurance sold as being a criterion variable because this is the outcome variable of central interest. You also could continue to refer to goal difficulty as the predictor variable because you believe that this variable will predict sales to some extent. Notice that goal difficulty is now a somewhat different variable, however. In the nonexperimental study, goal difficulty was a naturally-occurring variable that could take on a wide variety of values (whatever score the subject received on the goal difficulty questionnaire). In the present experiment, however, goal difficulty is a manipulated variable, which means that you (as the researcher) determined what value of the variable would be assigned to each subject. In the experiment, the goal difficulty variable could assume only one of two values: Subjects were either in the difficult goal group or the easy goal group. Therefore, goal difficulty is now a classification variable that codes group membership. Although it is acceptable to speak of predictor and criterion variables within the context of experimental research, it is more common to speak in terms of independent variables and dependent variables. The independent variable is that variable whose values (or levels) are
32 Step-by-Step Basic Statistics Using SAS: Student Guide
selected by the experimenter to determine what effect the independent variable has on the dependent variable. The independent variable is the experimental counterpart to a predictor variable. A dependent variable, on the other hand, is some aspect of the subject's behavior that is assessed to determine whether it has been affected by the independent variable. The dependent variable is the experimental counterpart to a criterion variable. In the present example experiment, goal difficulty is the independent variable, and the amount of insurance sold is the dependent variable. Remember that the terms “predictor variable” and “criterion variable” can be used with almost any type of research––experimental or nonexperimental. However, the terms “independent variable” and “dependent variable” should be used only with experimental research––research conducted under controlled conditions with a true manipulated independent variable. Levels of the independent variable. Researchers often speak in terms of the different levels of the independent variable. These levels are also referred to as experimental conditions or treatment conditions, and correspond to the different groups to which a subject might be assigned. The present example included two experimental conditions: a difficult goal condition, and an easy goal condition. With respect to the independent variable, it is common to speak of the experimental group versus the control group. Generally speaking, the experimental group is the group that receives the experimental treatment of interest, while the control group is an equivalent group of subjects that does not receive this treatment. The simplest type of experiment consists of one experimental group and one control group. For example, the present study could have been redesigned so that it simply consisted of an experimental group that was assigned the goal of making 25 cold calls (the difficult goal condition), as well as a control group in which no goals were assigned (the no-goal condition). Obviously, it is possible to expand the study by creating more than one experimental group. This could be accomplished in the present case by assigning one experimental group the difficult goal of 25 cold calls and the second experimental group the easy goal of 5 cold calls. The control group could still be assigned zero goals.
Using Type-of-Variable Figures to Represent Dependent and Independent Variables Overview Many studies in the social sciences and education are designed to investigate the relationship between just two variables. In an experiment, researchers generally refer to these as the independent and dependent variables; in a nonexperimental study, researchers often call them the predictor and criterion variables.
Chapter 2: Terms and Concepts Used in This Guide 33
Some chapters in this guide will describe studies in which a researcher investigates the relationship between predictor and criterion variables. To help you better visualize the nature of the variables being analyzed, most of these chapters will provide a type-ofvariable figure: a figure that graphically illustrates the number of values that are assumed by the two variables in the study. This section begins by presenting the symbols that will represent three types of variables: dichotomous variables, limited-value variables, and multi-value variables. It then provides a few examples of the type-of-variable figures that you will see in subsequent chapters of this book. Figures to Represent Types of Variables Dichotomous variables. A dichotomous variable is one that assumes just two values. For example, the variable “sex” is a dichotomous variable that can assume just the values of “male” versus “female”). Below is the type-of-variable symbol that will represent a dichotomous variable:
D i The “Di” that appears inside the boxes is an abbreviation for “Dichotomous.” The figure includes two boxes to help you remember that a dichotomous variable is one that assumes only two values. Limited-value variables. A limited-value variable is one that assumes only two to six values. For example, the variable “political party” would be a limited-value variable if it assumed only the values of “democrat” versus “republican” versus “independent.” Below is the type-of-variable symbol that will represent a limited-value variable:
The “Lmt” that appears inside the boxes is an abbreviation for “Limited.” The figure includes three boxes to remind you that a limited-value variable is one that can have only two to six values. Multi-value variables. A multi-value variable is one that assumes more than six values. For example, if you administered an IQ test to a sample of 300 subjects, then “IQ scores” would be a multi-value variable if more than six different IQ scores appeared in your sample. Below is the type-of-variable symbol that will represent a multi-value variable:
This figure consists of seven boxes to help you remember that a multi-value variable is one that assumes more than six values in your sample.
34 Step-by-Step Basic Statistics Using SAS: Student Guide
Using Figures to Represent the Types of Variables Assessed in a Specific Study As was stated earlier, when a study is a true experiment, the two variables that are being investigated are typically referred to as a dependent variable and an independent variable. It is possible to construct a type-of-variable figure that illustrates the nature of the dependent variable, as well as the nature of the independent variable, in a single figure. The research hypothesis. For example, earlier this chapter developed the research hypothesis that goal difficulty will have a positive causal effect on the amount of insurance sold by insurance agents. This hypothesis was illustrated by the causal figure presented in Figure 2.1. That figure is again reproduced here as Figure 2.2. Notice that, in this figure, the dependent variable (amount of insurance sold) appears on the left, and the independent variable (goal difficulty) appears on the right.
Figure 2.2. Predicted causal relationship between goal difficulty (the independent variable) and amount of insurance sold (the dependent variable).
An experiment with two conditions. In this example, you conduct a simple experiment to investigate this research hypothesis. You begin with 100 insurance agents, and randomly assign each agent to either an experimental group or a control group. The 50 agents in the experimental group (the “difficult-goal condition”) are told to make 25 cold calls each week. The 50 agents in the control group (the “easy-goal condition”) are told to make 5 cold calls each week. After one year, you measure your dependent variable: The amount of insurance (in dollars) sold by each agent. When you review the data, you find that the agents displayed a wide variety of scores on this dependent variable: some agents sold $0 worth of insurance, some agents sold $5,000,000 worth of insurance, and most sold somewhere in between these two extremes. As a group, they displayed far more than six values on this dependent variable. The type-of-variable figure for the preceding study is shown below:
=
D i
When illustrating an experiment with a type-of-variable figure, this guide will use the convention of placing the symbol for the dependent variable on the left side of the equals sign (=), and placing the symbol for the independent variable on the right side of the equals sign. You can see that this convention was followed in the preceding figure: the word “Multi” on the left of the equals sign represents the fact that the dependent variable in your
Chapter 2: Terms and Concepts Used in This Guide 35
study (amount of insurance sold) was a multi-value variable. You knew this, because the agents displayed more than six values on this variable. In addition, the letters “Di” on the right side of the equals sign represents the fact that the independent variable (goal difficulty) was a dichotomous variable. You knew this, because this independent variable consisted of just two values (conditions): a difficult-goal condition and an easy-goal condition. Because the dependent variable is on the left and the independent variable is on the right, the preceding type-of-variable figure is similar to Figure 2.2, which illustrated the research hypothesis. In that figure, the dependent variable was also on the left, and the independent variable was also on the right The preceding type-of-variable figure could be used to illustrate any experiment in which the dependent variable was a multi-value variable and the independent variable was a dichotomous variable. In Chapter 13, “Independent-Samples t Test,” you will learn that data from this type of experiment is often analyzed using a statistical procedure called a t test. A warning about statistical assumptions. Please note that, when you are deciding whether it is appropriate to analyze a data set with a t test, it is not sufficient to simply verify that the dependent variable is a multi-value variable and that the independent variable is a dichotomous variable. There are many statistical assumptions that must be satisfied for a t test to be appropriate, and those assumptions will not be discussed in this chapter (they will be discussed in the chapters on t tests). The type-of-variable figure was presented above to help you visualize the type of situation in which a t test is often performed. Each chapter of this guide that discusses an inferential statistical procedure (such as a t test) also will describe the assumptions that must be met in order for the test to be valid. An experiment with three conditions. Now let’s modify the experiment somewhat, and observe how it changes the type-of-variable figure. Assume that you now have 150 subjects, and your independent variable now consists of three conditions, rather than just two: •
The 50 agents in experimental group #1 (the “difficult-goal condition”) are told to make 25 cold calls each week.
•
The 50 agents in experimental group #2 (the “easy-goal condition”) are told to make 5 cold calls each week.
•
The 50 agents in the control group (the “no-goal condition”) are not given any specific goals about the number of cold calls to make each week.
Assume that everything else about the study remains the same. That is, you use the same dependent variable, the number of values observed on the dependent variable still exceed six, and so forth. If this were the case, you would illustrate this revised study with the following figure:
=
Notice that “Multi” still appears to the left of the equals sign because your dependent variable has not changed. However, “Lmt” now appears to the right of the equals sign
36 Step-by-Step Basic Statistics Using SAS: Student Guide
because the independent variable now has three values rather than two. This means that the independent variable is now a limited-value variable, not a dichotomous variable. The preceding figure could be used to illustrate any experiment in which the dependent variable was a multi-value variable, and the independent variable was a limited-value variable. In Chapter 15, “One-Way ANOVA with One Between-Subjects Factor,” you will learn that data from this type of experiment can be analyzed using a statistical procedure called a one-way ANOVA (assuming that other assumptions are met; those assumptions will be discussed in Chapter 15). A correlational study. Finally, let’s modify the experiment one more time, and observe how it changes the type-of-variable figure. This time you are interested in the same research hypothesis, but you are doing a nonexperimental study rather than an experiment. In this study, you will not manipulate an independent variable. Instead, you will simply measure two naturally occurring variables and will determine whether they are correlated in a sample of 200 insurance agents. The two variables are •
Goal difficulty. Each agent completes a scale that assesses the difficulty of the goals that the agent sets for himself/herself. With this scale, scores can range from 0 to 99, with higher scores representing more difficult goals. When you analyze the data, you find that this variable displays more than six values in this sample (i.e., you find that the agents get a wide variety of different scores).
•
Amount of insurance sold. For each agent, you review records to determine how much insurance the agent has sold during the previous year. Assume that this variable also displays more than six observed values in your sample.
By analyzing your data, you want to determine whether there is a significant correlation between goal difficulty and the amount of insurance sold. You hope to find that agents who had high scores on goal difficulty also tended to have high scores on insurance sold. Because this is nonexperimental research, it is not appropriate to speak in terms of an independent variable and a dependent variable. Instead, you will refer to “goal difficulty” as a predictor variable, and “insurance sold” as a criterion variable. When preparing a type-ofvariable figures for this type of study, the criterion variable should appear to the left of the equals sign, and the predictor variable should appear to the right.
Chapter 2: Terms and Concepts Used in This Guide 37
The correlational study that was described above can be represented with the following type-of-variable figure:
=
The “Multi” appearing to the left of the equals sign represents the criterion variable in your study: amount of insurance sold. You knew that it was a multi-value variable, because it displayed more than six values in your sample. The “Multi” appearing on the right of the equals sign represents the predictor variable in your study: scores on the goal difficulty scale. The preceding figure could be used to illustrate any correlational study in which the criterion variable and predictor variable were both multi-value variables. In Chapter 10 of this guide (“Bivariate Correlation”), you will learn that data from this type of study are often analyzed by computing a Pearson correlation coefficient (assuming that other assumptions are met).
The Three Types of SAS Files Overview The purpose of this section is to provide a very general overview of the procedure that you will follow when you submit a SAS program and then interpret the results. To do this, the current section will present a short SAS program and briefly describe the output that it creates. Generally speaking, you will work with three types of files when you use the SAS System: one file will contain the SAS program, one will contain the SAS log, and one will contain the SAS output. The differences between these three types of files are discussed next. The SAS Program A SAS program consists of a set of statements written by the user. These statements provide the SAS System with the data to be analyzed, tell SAS about the nature of the data, and indicate which statistical analyses should be performed on the data. These statements are usually typed as data lines in a file in the computer’s memory. Some fictitious data. This section will illustrate a simple SAS program that analyzes some fictitious data. Suppose that you have administered two tests (Test 1 and Test 2) to a group of eight people. Scores on a particular test can range from 0 to 9. Table 2.2 presents the scores that the eight subjects earned on Test 1 and Test 2.
38 Step-by-Step Basic Statistics Using SAS: Student Guide Table 2.2 Scores Earned on Test 1 and Test 2 ____________________________ Subject Test 1 Test 2 ____________________________ Marsha
2
3
Charles
2
2
Jack
3
3
Cathy
3
4
Emmett
4
3
Marie
4
4
Cindy
5
3
Susan 5 4 ____________________________
The way that the information is arranged in Table 2.2 is representative of the way that information is arranged in most SAS data sets. Each vertical column (running from the top to the bottom) provides information about a different variable. The headings in Table 2.2 tell us that: •
The first column provides information about the “Subject” variable: It provides the first name for each subject.
•
The second column provides information about the “Test 1” variable: It provides each subject’s score on Test 1.
•
The third column provides information about the “Test 2” variable: It provides each subject’s score on Test 2.
In contrast, each horizontal row in the table (running from left to right) provides information about a different subject. For example, •
The first row provides information about the subject named Marsha. Where the row for Marsha intersects with the column headed Test 1, you can see that she obtained a score of “2” on Test 1. Where the row for Marsha intersects with the column headed Test 2, you can see that she obtained a score of “3” on Test 2.
•
The second row provides information about the subject named Charles. Where the row for Charles intersects with the column headed Test 1, you can see that he obtained a score of “2” on Test 1. Where the row for Charles intersects with the column headed “Test 2,” you can see that he also obtained a score of “2” on Test 2.
The rows for the remaining subjects can be interpreted in the same way.
Chapter 2: Terms and Concepts Used in This Guide 39
The SAS program. Suppose that you now want to analyze subject scores on the two tests. Specifically, you want to compute the means and some other descriptive statistics for Test 1 and Test 2. Following is a complete SAS program that enters the data presented in Table 2.2. It also computes means and some other descriptive statistics for Test 1 and Test 2. OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN; It will be easier to refer to the different components of the preceding program if we assign line numbers to each line. We will then be able to use these line numbers to refer to specific statements. Therefore, the program is reproduced again below, this time with line numbers added (remember that you would not actually type these line numbers if you were writing a program to be analyzed by the SAS System––the line numbers should already appear on your computer screen if you use the SAS windowing environment and follow the directions to be provided in the Chapter 3 of this guide): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
40 Step-by-Step Basic Statistics Using SAS: Student Guide
This chapter does not discuss SAS programming statements in detail. However, the preceding program will make more sense to you if the functions of its various parts are briefly explained: •
Line 1 of the preceding program contains the OPTIONS statement. This is a global statement that can be used to modify how the SAS System operates. In this example, the OPTIONS statement is used to specify how large each page of SAS output should be when it is printed.
•
Line 2 contains the DATA statement. You use this statement to start the DATA step (explained below) and assign a name to the data set that you are creating.
•
Line 3 contains the INPUT statement. You use this statement to assign names to the variables that SAS will work with.
•
Line 4 contains the DATALINES statement. This statement tells SAS that the data lines will begin with the next line of the program.
•
Lines 5–12 are the data lines that will be read by SAS. You can see that these data lines were taken directly from Table 2.2: Line 5 contains scores on Test 1 and Test 2 from Marsha; line 6 contains scores on Test 1 and Test 2 from Charles, and so on. There are eight data lines because there were eight subjects. Obviously, the subjects’ names have not been included as part of the data set (although they can be included, if you choose).
•
Line 13 is the “null statement.” It is very short, consisting of a single semicolon. This null statement tells SAS that the data lines have ended.
•
Line 14 contains the PROC MEANS statement. It tells SAS to compute means and other descriptive statistics for all numeric variables in the data set.
•
Line 15 contains the TITLE1 statement. You use this statement to assign a title, or heading, that will appear on each page of output. Here, the title will be “JANE DOE”.
•
Finally, Line 16 contains the RUN statement that signals the end of the program.
Subsequent chapters will discuss the use of the preceding statements in much more detail. What is the single most common programming error? For new SAS users, the most common programming error usually involves omitting a required semicolon (;). Remember that every SAS statement must end with a semicolon (in the preceding program, notice that the DATA statement ends with a semicolon, as does the INPUT statement and the PROC MEANS statement). When you obtain an error in running a SAS program, one of the first things that you should do is inspect the program for missing semicolons.
Chapter 2: Terms and Concepts Used in This Guide 41
The DATA step versus the PROC step. There is another, more fundamental way, to divide a SAS program into its constituent components. It is possible to think of each SAS program as consisting of a DATA step and a PROC step. Below, we show how the preceding program can be divided in this way:
DATA step
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ;
PROC step
PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
The differences between these steps are described below. In the DATA step, programming statements create and/or modify a SAS data set. Among other things, statements in the DATA step may •
assign a name to the data set
•
assign names to the variables to be included in the data set
•
provide the actual data to be analyzed
•
recode existing variables
•
create new variables from existing variables.
In contrast to the DATA step, the PROC step includes statements that request specific statistical analyses of the data. For example, the PROC step might request that correlations be computed for all pairs of numeric variables, or might request that a t test be performed. In the preceding example, the PROC step requested that means be computed. What text editor will I use to write my SAS program? An editor is a computer application that allows you to create lines of text, such as the lines that constitute a SAS program. If you are working on a mainframe or mid range computer system, you might have a variety of editors that can be used to write your SAS programs; just ask the staff at your computer facility.
42 Step-by-Step Basic Statistics Using SAS: Student Guide
For many users, it is best to use the SAS windowing environment to write SAS programs. The SAS windowing environment is an integrated application that allows users to create and edit SAS programs, submit them for interactive analysis, view the results on their screens, manage files, and perform other activities. This application is available at most locations where the SAS System is installed (including personal computers). Chapter 3 of this guide provides a tutorial that shows you how to use the SAS windowing environment. After submitting the SAS program. Once the preceding program has been submitted for analysis, SAS will create two types of files reporting the results of the analysis. One file is called the SAS log or log file, and the other file is the SAS output file. The following sections explain the purpose of these files. The SAS Log The SAS log is generated by SAS after you submit your program. It is a summary of notes and messages generated by SAS as your program executes. These notes and messages will help you verify that your SAS program ran correctly. Specifically, the SAS log provides •
a reprinting of the SAS program that was submitted (minus the data lines)
•
a listing of notes indicating how many variables and observations are contained in the data set
•
a listing of any notes, warnings, or error messages generated during the execution of the SAS program.
Chapter 2: Terms and Concepts Used in This Guide 43
Log 2.1 provides a reproduction of the SAS log generated for the preceding program: NOTE: SAS initialization used: real time 14.54 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 8 observations and 2 variables. NOTE: DATA statement used: real time 1.59 seconds
13 14 15 16
; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
NOTE: There were 8 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.64 seconds Log 2.1. Log file created by the current SAS program.
Notice that the statements constituting the SAS program have assigned line numbers, which are reproduced in the SAS log. The data lines are not normally reproduced as part of the SAS log unless they are specifically requested. About halfway down the log, the following note appears: NOTE: The data set WORK.D1 has 8 observations and 2 variables.
This note indicates that the data set that you created (named D1) contains 8 observations and 2 variables. You would normally check this note to verify that the data set contains all of the variables that you intended to input (in this case 2), and that it contains data from all of your subjects (in this case 8). So far, everything appears to be correct. If you had made any errors in writing the SAS program, there also would have been ERROR messages in the SAS log. Often, these error messages provide you with some help in determining what was wrong with the program. For example, a message can indicate that SAS was expecting a program statement that was not included. Chapter 3, “Tutorial: Using the SAS Windowing Environment to Write and Submit SAS Programs” will discuss error messages in more detail, and will provide you with some practice in debugging a program with an error. Once the error or errors have been identified, you must revise the original SAS program and resubmit it for analysis. After processing is complete, again review the new SAS log to see
44 Step-by-Step Basic Statistics Using SAS: Student Guide
if the errors have been eliminated. If the log indicates that the program ran correctly, you are free to review the results of the analyses in the SAS output file. Very often you will submit a SAS program and, after a few seconds, the SAS output window will appear on your computer screen. Some users mistakenly assume that this means that their program ran without errors. But this is not necessarily the case. Very often some parts of your program will run correctly, but other parts will have errors. The only way to be sure is to carefully review all of the SAS log before reviewing the SAS output. Chapter 3 will lead you through these steps. The SAS Output File The SAS output file contains the results of the statistical analyses requested in the SAS program. An output file is sometimes called a “listing” file, because it contains a listing of the results of the analyses that were requested. Because the program above requested the MEANS procedure, the output file that was produced by this program will contain means, standard deviations, and some other descriptive statistics for the two variables. Output 2.1 presents the SAS output that would be produced by the preceding SAS program. Numbers (such as and ) have been added to the output to more easily identify specific sections. JANE DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------TEST1 8 3.5000000 1.1952286 2.0000000 5.0000000 TEST2 8 3.2500000 0.7071068 2.0000000 4.0000000 ----------------------------------------------------------------------------Output 2.1. SAS output produced by PROC MEANS.
At the top of the output page is the name “JANE DOE.” This name appears here because “JANE DOE” was included in the TITLE1 statement of the program. Later, this guide will show you how to insert your name in the TITLE1 statement, so that your name will appear at the top of each of your output pages. Below the heading “Variable,” SAS prints the names of each of the variables being analyzed. In this case, the variables are called TEST1 and TEST2. To the right of the heading “TEST1,” descriptive statistics for Test 1 can be found. Statistics for Test 2 appear to the right of “TEST2.” Below the heading “N,” the number of valid observations being analyzed is reported. You can see that the SAS System analyzed eight observations for TEST1, and eight observations for TEST2.
Chapter 2: Terms and Concepts Used in This Guide 45
The average score on each variable is reproduced under “Mean.” Standard deviations appear in the column labeled “Std Dev.” You can see that, for Test 1, the mean was 3.5 and the standard deviation was 1.1952. For Test 2, the corresponding figures were 3.25 and 0.7071. Below the headings “Minimum” and “Maximum,” you will find the lowest and highest scores observed for the two variables, respectively. Once you have obtained this output file from your analysis, you can review it on your computer monitor, or print it out at a printer. Chapter 3 will show you how to interpret your output.
Conclusion This chapter has introduced you (or reintroduced you) to the terminology that is used by researchers in the behavioral sciences and education. With this foundation, you are now ready to learn about performing data analyses with SAS. The preceding section indicated that you must use some type of text editor to write SAS programs. For most users, it is advantageous to use the SAS windowing environment for this purpose. With the SAS windowing environment, you can write and submit SAS programs, view the results on your monitor, print the results, and save your SAS programs on a diskette––all from within one application. Chapter 3 provides a hands-on tutorial that shows you how to perform these activities within the SAS windowing environment.
46 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial: Writing and Submitting SAS Programs Introduction...........................................................................................48 Overview.................................................................................................................48 Materials You Will Need for This Tutorial................................................................48 Conventions and Definitions ...................................................................................48 Tutorial Part I: Basics of Using the SAS Windowing Environment ......50 Tutorial Part II: Opening and Editing an Existing SAS Program ..........75 Tutorial Part III: Submitting a Program with an Error ..........................94 Tutorial Part IV: Practicing What You Have Learned.........................102 Summary of Steps for Frequently Performed Activities.....................105 Overview...............................................................................................................105 Starting the SAS Windowing Environment ............................................................105 Opening an Existing SAS Program from a Floppy Disk ........................................106 Finding and Correcting an Error in a SAS Program ..............................................107 Controlling the Size of the Output Page with the OPTIONS Statement .......................................................................109 For More Information ..........................................................................110 Conclusion...........................................................................................110
48 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you how to use the SAS windowing environment—an application that enables you to create and edit SAS programs in a text editor, submit programs for execution, review and print the results of the analysis, and perform related activities. This chapter assumes that you are using the SAS System for Windows on an IBM-compatible computer. The tutorial in this chapter is based on Version 8 of SAS. If you are using Version 7 of SAS, you can still use the tutorial presented here (with some minor adjustments), because the interfaces for Version 7 and Version 8 are very similar. However, if you are using Version 6 of SAS, the interface that you are using is substantially different from the Version 8 interface. The majority of this chapter consists of a tutorial that is divided into four parts. Part I shows you how to start the SAS windowing environment, create a short SAS program, save it on a 3.5-inch floppy disk, submit it for execution, and print the resulting SAS log and SAS output files. Part II shows you how to open an existing SAS file and edit it. Part III takes you through the steps involved in debugging a program with an error. Finally, Part IV gives you the opportunity to practice what you have learned. In addition, two short sections at the end of the chapter summarize the steps that are involved in frequently performed activities, and show you how to use the OPTIONS statement to control the size of your output page. Materials You Will Need for This Tutorial To complete this tutorial, you will need access to a computer on which the SAS System for Windows has been installed. You will also need at least one (and preferably two) 3.5” diskettes formatted for IBM-compatible computers (as opposed to Macintosh computers). Conventions and Definitions Here is a brief explanation of the computer-related terms that are used in this chapter: •
The ENTER key. Depending on the computer you are using, this key is identified by “Enter,” “Return,” “CR,” “New Line,” the symbol of an arrow looping backward , or some other identifier. This key is equivalent to the return key on a typewriter. Enter
•
The backspace key. This is the key that allows you to delete text one letter at a time. The key is identified by the word “Backspace” or “Delete,” or possibly by an arrow pointing backward: . Backspace
•
Menus. This book uses the abbreviation “menu” for “pull-down menu.” A menu is a list of commands that you can access by clicking a word on the menu bar at the top of a window. For example, if you click the word File on the menu bar of the Editor window,
Chapter 3: Tutorial: Writing and Submitting SAS Programs 49
the File pull-down menu appears (this menu contains commands for working with files, as you will see later in this chapter). •
The mouse pointer. The mouse pointer is a small icon that you move around the screen by moving your mouse around on its mouse pad. Different icons serve as the mouse pointer in different contexts, depending on where you are in the SAS windowing environment. Sometimes the mouse pointer is an arrow ( ), sometimes it is an I-beam (I), and sometimes it is a small hand ( ).
•
The I-beam. The I-beam is a special type of mouse pointer. It is the small icon that looks like the letter “I” and appears on the screen when you are working in the Editor window. You move the I-beam around the screen by moving the mouse around on its mouse pad. If the I-beam appears at a particular location in a SAS program and you click the left button on the mouse, that point becomes the insertion point in the program; whatever you type will be inserted at that point.
•
The cursor. The cursor is the flashing bar ( ) that appears in your SAS program when you are working in the Editor window. Anything you type will appear at the location of the cursor.
•
Insert versus overtype mode. The Insert key toggles between insert mode and overtype mode. When you are in insert mode, the text to the right of the cursor will be pushed over to the right as you type. If you are in overtype mode, the text to the right of the cursor will disappear as you type over it.
•
Pointing. When this tutorial tells you to point at an icon on your screen, it means to position the mouse pointer over that icon.
•
Clicking. When this tutorial tells you to click something on your screen, it means to put the mouse pointer on that word or icon and click the button on your mouse one time. If your mouse has more than one button, click the button on the left.
•
Double-clicking. When this tutorial tells you to double-click something on your screen, it means to put the mouse pointer on that word or icon and click the left button on your mouse twice in rapid succession. Make sure that your mouse does not move when you are clicking.
50 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part I: Basics of Using the SAS Windowing Environment Overview This section introduces you to the basic features of the SAS windowing environment. You will learn how to start the SAS System and how to cycle between the three windows that you use in a typical session: the Editor window, the Log window, and the Output window. You will type a simple SAS program, save the program on a 3.5-inch floppy disk, and submit it for execution. Finally, you will learn how to review the SAS log and SAS output created by your program and how to print these files on a printer. Starting the SAS System Turn on your computer and monitor if they are not already on. If you are working in a computer lab at a university and your computer screen is blank, your computer might be in sleep mode. To activate it, press any key. After the computer has finished booting up (or waking up), the monitor displays its normal start-up screen. Figure 3.1 shows the start-up screen for computers at Saginaw Valley State University (where this book was written). Your screen will not look exactly like Figure 3.1, although it should have a gray bar at the bottom, similar to the gray bar at the bottom of Figure 3.1. On the left side of this bar is a button labeled Start.
Photograph copyright © 2000 Saginaw Valley State University
Chapter 3: Tutorial: Writing and Submitting SAS Programs 51
Click Start button to display a list of options. One of the options should be Programs. One of the programs should be The SAS System for Windows V8. Figure 3.1. The initial computer screen.
This is how you start the SAS System (try this now): Î Use your mouse to move the mouse pointer on your screen to the Start button. Click this Start button once (click it with the left button on your mouse, if your mouse has more than one button). A menu of options appears. Î In this list of options is the word “Programs.” Put the mouse pointer on Programs. This reveals a list of programs on the computer. One of these programs is “The SAS System for Windows V8.” Î Put the mouse pointer on The SAS System for Windows V8 and select it (click it and release). This starts the SAS System. This process takes several seconds. At your facility, the actual sequence for starting SAS might be different from the sequence described here. For example, it is possible that there is no item on your Programs menu labeled “The SAS System for Windows V8.” If this is the case, you should ask the lab assistant or your professor for guidance regarding the correct way to start SAS at your location.
52 Step-by-Step Basic Statistics Using SAS: Student Guide
When SAS is running, three windows appear on your screen: the Explorer window, the Log window, and the Editor window. Your screen should look something like Figure 3.2. Click this close button to close the Explorer window.
Log window Explorer window
Editor window
Figure 3.2. The initial SAS screen (closing the SAS Explorer window).
The Five Basic SAS System Windows After you start SAS, you have access to five SAS windows: the Editor, Log, Output, Explorer, and Results windows. Not all of these windows are visible when you first start SAS. Of these five windows, you will use only three of them to perform the types of analyses described in this book. The three windows that you will use are briefly described here: •
The Editor window. The Editor is a SAS program editor. It enables you to create, edit, submit, save, open, and print SAS programs. In a typical session, you will spend most of your time working within this window. When you first start SAS, the words “Editor Untitled1” appear in the title bar for this window (the title bar is the bar that appears at the top of a window). After you save your SAS program and give it a name, that name appears in the title bar for the Editor window.
•
The Log window displays your SAS log after you submit a SAS program. The SAS log is a file generated by SAS that contains your SAS program (minus the data lines), along with a listing of notes, warnings, error messages, and other information pertaining to the execution of your program. In Figure 3.2, you can see that the Log window appears in the top half of the initial screen. The word “Log” appears in the title bar for this window.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 53 •
The Output window. The Output window contains the results of the analyses requested by your SAS program. Although the Output window does not appear in Figure 3.2, a later section shows you how to make it appear.
This book does not show you how to use the two remaining windows (the Explorer window and the Results window). In fact, for this tutorial, the first thing you should do each time you start SAS is to close these windows. This is not because these windows are not useful; it is because this book is designed to be an elementary introduction to SAS, and these two windows enable you to perform more advanced activities that are beyond the scope of this book. For guidance in using these more advanced features of the SAS windowing environment, see Delwiche and Slaughter (1998) and Gilmore (1997). The two windows that you will close each time you start SAS are briefly described here: •
The Explorer window appears on the left side of your computer screen when you first start SAS (the word “Explorer” appears in its title bar; see Figure 3.2). It enables you to open files, move files, copy files, and perform other file management tasks. You can use it to create libraries of SAS files and to create shortcuts to files other than SAS files. The Explorer window is helpful when you are managing a large number of files or libraries.
•
The Results window also appears on the left side of your screen when you start SAS. It is hidden beneath the Explorer window, but you can see it after you close that window. The Results window lists each section of your SAS output in outline form. When you request many different statistical procedures, it provides a concise, easy-to-navigate listing of results. You can use the Results window to view, print, and save individual sections of output. The Results window is useful when you write a SAS program that contains a large number of procedures.
What If My Computer Screen Does Not Look Like Figure 3.2? Your computer screen might not look exactly like the computer screen illustrated in Figure 3.2. For example, your computer screen might not contain one of the windows (such as the Editor window) that appears in Figure 3.2. There are a number of possible reasons for this, and it is not necessarily a cause for alarm. This chapter was prepared using Version 8 of SAS, so if you are using a later version of SAS, your screen might differ from the one shown in Figure 3.2. Also, the computer services staff at your university might have customized the SAS System, which would make it look different at startup. The only important consideration is this: Your SAS interface must be set up so that you can use the Editor window, Log window, and Output window. There is more than one way to achieve this. After you read the following sections, you will have a good idea of how to accomplish this, even if your screen does not look exactly like Figure 3.2.
54 Step-by-Step Basic Statistics Using SAS: Student Guide
The following sections show you how to close the two windows that you will not use, how to maximize the Editor window, and how to perform other activities that will help prepare the SAS windowing environment for writing and submitting simple SAS programs. Closing the SAS Explorer Window The SAS Explorer window appears on the left side of your computer screen (see Figure 3.2). At the top of this window is a title bar that contains the word “Explorer.” Your first task is to close this Explorer window to create more room for the SAS Editor, Log, and Output windows. In the upper-right corner of the Explorer window (on the title . This is bar), there is a small box with an “X” in it, a box that looks something like this: the close button for the Explorer window. At this time, complete the following step: Î Put your mouse pointer on the close button for the Explorer window and click once (see Figure 3.2 for guidance; make sure that you click the close button for the Explorer window, and not for any other window). The Explorer window will close. Closing the SAS Results Window When the Explorer window closes, it reveals another window beneath it––the Results window. Your screen should look like Figure 3.3. Click this close button to close the Results window.
Figure 3.3. Closing the SAS Results window.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 55
Your next task is to close this Results window to create more room for the Editor, Log, and Output windows. In the upper-right corner of the Results window (on its title bar), there is a small box with an “x” in it ( ). This is the close button for the Results window. Î Put your mouse pointer on the close button for the Results window and click once (see Figure 3.3). The Results window will close. Maximizing the Editor Window After you close the Results window, the Log window and the Editor window expand to the left to fill the SAS screen. The Log window appears on the upper part of the screen, and the Editor window appears on the lower part. Your screen should look like Figure 3.4.
Click this maximize button to expand the Editor window. Figure 3.4. Maximizing the Editor window.
As said earlier, the title bar for the Editor window is the bar that appears at the top of the window. In Figure 3.4, the title “Editor - Untitled1” appears on the left side of the title bar. On the right side of this title bar are three buttons (don’t click them yet): •
is the minimize window button; if you click this button, the window will shrink and become hidden.
•
is the maximize window button; if you click this button, the window will become larger and fill the screen.
56 Step-by-Step Basic Statistics Using SAS: Student Guide •
is the close window button; if you click this button, the window will close.
At this point, the Editor window and the Log window are both visible on your screen. A possible drawback to this arrangement is that both windows are so small that it is difficult to see very much in either window. For some SAS users it is easier to view one window at a time, allowing the “active” window to expand so that it fills the screen. With this arrangement, it is as if you have stacked the windows on top of one another, but you can see only the window that is in the foreground, on the “top” of the stack. This book shows you how to set up this arrangement. In order to rearrange your windows this way, complete the following step: Î Using your mouse pointer, click the maximize window button for the Editor window. This is the middle button––the one that contains the square (see Figure 3.4). Be sure that you do this for the Editor window, not for the Log window or any other window. When clicking the maximize button in the Editor window, take care that you do not click the close button (the button on the far right that looks like this: ). If you close this window by accident, you can reopen it by completing the following steps (do not do this now unless you have closed your Editor window by mistake): Î On the menu bar, put your mouse pointer on the word View and click. The View menu appears. Î Select Enhanced Editor. Your Editor window should return to the SAS windowing environment. You can select it by using the Window menu (a later section shows you how). You can use this same procedure to bring back your Log and Output windows if you close them by accident. Requesting Line Numbers and Other Options There are a number of options that you can select to make SAS easier to use. One of the most important options is the “line numbers” option. If you request this option, SAS will automatically generate line numbers for the lines of the SAS programs that you write in the Editor. Having line numbers is useful because it helps you to know where you are in the program, and it can make it easier to copy lines of a program, move lines, and so forth. To request line numbers, you must first open the Enhanced Editor Options dialog box. Figure 3.5 shows you how to do this. Complete the following steps: Î On the menu bar, select Tools. The Tools menu appears. Î Select Options (and continue to hold the mouse button down). A pop-up menu appears. Î Select Enhanced Editor and release the mouse button.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 57 First, select the Tools menu. Menu bar
Second, select Options. Third, select Enhanced Editor.
Figure 3.5. Requesting the Enhanced Editor Options dialog box.
The Enhanced Editor Options dialog box appears. This dialog box should be similar to the one in Figure 3.6. Verify that Show line numbers is selected (click inside this box if it is not checked).
In the Indentation section, verify that None is selected (with a dot).
Verify that Clear text on submit is not selected (that is, verify that it is not checked).
Figure 3.6. Selecting appropriate options in the Enhanced Editor Options dialog box.
58 Step-by-Step Basic Statistics Using SAS: Student Guide
There are two tabs for Enhanced Editor options: General options and Appearance options. In Figure 3.6, you can see that the General options tab has been selected. If General options has not been selected for the dialog box on your screen, click the General tab now to bring General options to the front. A number of options are listed in this Enhanced Editor Options dialog box. For example, in the upper-left corner of the dialog box in Figure 3.6, you can see that two of the possible options are Allow cursor movement past end of line and Drag and drop text editing. If a check mark appears in the small box to the left of an option, it means that the option has been selected. If no check mark appears, the option has not been selected. If a box for an option is empty, you can click inside the box to select that option. If a box already has a check mark, you can click inside the box to deselect it and make the check mark disappear. There are three settings that you should always review at the beginning of a SAS session. If you do not set these options as described here, SAS will still work, but your screen might not look like the screens displayed in this chapter. If your screen does not look correct or if you are having other problems with SAS, you should go to this dialog box and verify that your options are set correctly. Set your options as follows, at the beginning of your SAS session: •
Verify that Show line numbers is selected (that is, make sure that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, make sure that a check mark does not appear in the box for this option).
Figure 3.6 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that these three options are set properly (you can disregard the other options). When all options are correct, complete this step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. A single line number (the number “1”) appears in the upper-left corner of the Editor window. As you begin typing the lines of a SAS program, SAS will automatically generate new line numbers. A later section of this tutorial provides you with a specific SAS program to type. First, however, you will learn more about the menu bar and the Window menu.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 59
The Menu Bar Figure 3.7 illustrates what your screen should look like at this time. The Editor window should now be enlarged and fill your screen. Toward the top of this window is the menu bar. The menu bar lists all of the menus that you can access while in this window: the File menu, the Edit menu, the View menu, the Tools menu, the Run menu, the Solutions menu, the Window menu, and the Help menu. You will use these menus to access commands that enable you to edit SAS programs, save them on diskettes, submit them for execution, and perform other activities. Use the Window menu to change windows.
Menu bar
Line number
Figure 3.7. The Editor window with line numbers.
Using the Window Menu At this point, the Editor window should be enlarged and in the foreground. During a typical session, you will jump back and forth frequently between the Editor window, the Log window, and the Output window. To do this, you will use the Window menu. In order to bring the Log window to the front of your stack, perform the following steps: Î Go to the menu bar (at the top of your screen). Î Put your mouse pointer on the word Window and click; this pulls down the Window menu and lists the different windows that you can select.
60 Step-by-Step Basic Statistics Using SAS: Student Guide
Î In this menu, put your mouse pointer on the word Log and then release the button on the mouse. When you release the button, the (empty) Log window comes to the foreground. Notice that the words “SAS – [Log-(Untitled)]” now appear in the title bar at the top of your screen. To bring the Output window to the foreground, complete the following steps: Î Go to the menu bar at the top of your screen. Î Pull down the Window menu. Î Select Output. The (empty) Output window comes to the front of your stack. Notice that the words “SAS [Output - (Untitled)]” now appear on the title bar at the top of your screen. To go back to the Editor window, complete these steps: Î Go to the menu bar at the top of your screen. Î Pull down the Window menu. Î Select Editor. The Editor window comes to the foreground. If your Editor window is not as large as you would like, you can enlarge it by clicking the bottom right corner of the window and dragging it down and to the right. When you put your mouse pointer in this corner, make sure that the double-headed arrow appears before you click and drag. A More Concise Way of Illustrating Menu Paths The preceding section showed you how to follow a number of different menu paths. A menu path is a sequence in which you pull down a menu and select one or more commands from that menu. For example, you are following a menu path when you go to the menu bar, pull down the Window menu, and select Editor.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 61
In the preceding section, menu paths were illustrated by listing each step on a separate line. This was done for clarity. However, to conserve space, the remainder of this book will often list an entire menu path on a single line. Here is an example: Î Window Î Editor The preceding menu path instructs you to go to the menu bar at the top of the screen, pull down the Window menu, and select Editor. Obviously, this is the same sequence that was described earlier, but it is now being presented in a more concise way. When possible, the remainder of this book will use this abbreviated form for specifying menu paths. Typing a Simple SAS Program In this section, you will prepare and submit a short SAS program. Before doing this, make sure that the Editor window is the active window (that is, it is in front of other windows). You know that the Editor is the active window if the title bar at the top of your screen includes the words “SAS - [Editor - Untitled1].” If it is not the active window, use the Window menu to bring it to the front, as described earlier. Your cursor should now be in position to begin typing your SAS program (if it is not, use your mouse or the arrow keys on your keyboard to move the cursor down to the now-empty lines where your program will be typed). Keep these points in mind as you type your program: •
Do not type the line numbers that appear to the left of the program (that is, the numbers 1, 2, 3, and so forth, that appear on the left side of the SAS program). These line numbers are automatically generated by the SAS System as you type your SAS program.
•
The lines of your SAS program should be left-justified (that is, begin at the left side of the window). If your cursor is in the wrong location, use your arrows keys to move it to the correct location.
•
You can type SAS statements in uppercase letters or in lowercase letters––either is acceptable.
•
If you make an error, use the backspace key to correct it. This key is identified by the word “Backspace” or “Delete,” or possibly by an arrow pointing backward: . Backspace
•
Be sure to press ENTER at the end of each line in the program. This moves you down to the next line.
62 Step-by-Step Basic Statistics Using SAS: Student Guide
Type this program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEANS DATA=D1; TITLE1 'type your name here'; RUN;
Some notes about the preceding program: •
The line numbers on the far left side of the program (for example, 1, 2, and 3) are generated by the SAS Editor. Do not type these line numbers. When you get to the bottom of the screen, the Editor continues to generate new lines for you each time you press ENTER.
•
In this book some lines in the programs are indented (such as line 3 in the preceding program). You are encouraged to use indention in the same way. This is not required by the SAS System, but many programmers use indention in this way to keep sections of the program organized meaningfully.
•
Line 15 contains the TITLE1 statement. You should type your first and last names between the single quotation marks in this statement. By doing this, your name will appear at the top of your printout when you print the results of the analysis. Be sure that your single quotation marks are balanced (that is, you have one to the left of your name and one to the right of your name before the semicolon). If you leave off one of the single quotation marks, it will cause an error.
Scrolling through Your Program The program that you have just typed is so short that you can probably see all of it on your screen at one time. However, very often you will work on longer programs that extend beyond the bottom of your screen. In these situations, it is necessary to scroll (move) through your program so that you can see the hidden parts. There are a variety of approaches that you can use to scroll through a file, and Figure 3.8 illustrates some of these approaches.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 63 Click this up-arrow to move up one line at a time.
Click and drag this scroll bar to move quickly through large files.
Click this down-arrow to move down one line at a time.
Figure 3.8. How to scroll up and down.
Here is a brief description of ways to scroll through a file: •
You can press the Page Up and Page Down keys on your keyboard.
•
You can click and drag the scroll bar that appears on the right side of your Editor window. Drag it down to see the lower sections of your program, drag it up to see the earlier sections (see Figure 3.8).
•
You can click the area that appears just above or below the scroll bar to move up or down one screen at a time.
•
You can click the up-arrow and down-arrow in the scroll bar area to move up or down one line at a time.
You can use these same techniques to scroll through any type of SAS file, whether it is a SAS program, a SAS log, or a SAS output file. SAS File Types After your program has been typed correctly, you should save it. This is done with the File menu and the Save As command (see the next section). The current section discusses some conventions that are followed when naming SAS files. Most files, including SAS files, have two-part names consisting of a root and a suffix. The root is the basic file name that you can often make up. For example, if you are analyzing
64 Step-by-Step Basic Statistics Using SAS: Student Guide
data from a sample of subjects who are republicans, you might want to give the file a root name such as REPUB. If you are using a version of Windows prior to Windows 95 (such as Windows 3.1), the root part of the file name must begin with a letter, must be no more than eight characters in length, and must not contain any spaces or special characters (for example, #@“:?>;*). If you are using Windows 95 or later, the root part of the file name can be up to 255 characters in length, and it can contain spaces and some special characters. It cannot contain the following characters: * | \ / ; : ? “ < >. The file name extension indicates what type of file you are working with. The extension immediately follows the root, begins with a period, and is three letters long, such as .LST, .SAS, .LOG, or .DAT. Therefore, a complete file name might appear as REPUB.LST or REPUB.SAS. The extensions are described here: root.SAS
This file contains the SAS program that you write. Remember that a SAS program is a set of statements that causes the SAS System to read data and perform analyses on the data. If you want to include the data as part of the program in this file, you can.
root.LOG
This file contains your SAS log: a file generated by SAS that includes notes, warnings, error messages, and other information pertaining to the execution of your program.
root.LST
This file contains your SAS output: the results of your analyses as generated by SAS. The “LST” is an abbreviation for “Listing.”
root.DAT
This file is a raw data file, a file containing raw data that are to be read and analyzed by SAS. You use a file like this only if the data are not already included in the .SAS file that contains the SAS program. This book does not illustrate the use of the .DAT file, because with each of the SAS programs illustrated here, the data are always included as part of the .SAS file.
Saving Your SAS Program on Floppy Disks versus Other Media This book shows you how to save your SAS programs on 3.5-inch floppy disks. This is because it is assumed that most readers of this book are university students and that floppy disks are the media that are most readily available to students. It is also possible to save your SAS programs in a variety of other ways. For example, in some courses, students are instructed to save their programs on their computers’ hard drives. In other courses, students are told to save their programs on Zip disks or some other removable media. If you decide to save your programs on a storage medium other than a 3.5-inch floppy disk, ask your lab assistant or professor for guidance.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 65
Saving Your SAS Program for the First Time on a Floppy Disk To save your SAS program on a floppy disk, make sure that a 3.5-inch high-density IBM PC-formatted disk has been inserted in drive “A” on your CPU (this book assumes that drive “A” is the floppy drive). Also make sure that the Editor window containing your program is the active window (the one currently in the foreground). Then make the following selections: Î File Î Save As You will see a Save As dialog box on your screen. This dialog box contains smaller boxes with labels such as Save in, File name, and so forth. The dialog box should resemble Figure 3.9. Click this down arrow to get a list of other locations where you can save your file.
Names of files and folders might appear here.
Click inside this File name box, and then type the name that you want to give to your file. Figure 3.9. The initial Save As dialog box.
The first time you save a program, you must tell the computer which drive your diskette is in and provide a name for your file. In this example, suppose that your default destination is a folder named “V8” and that you do not want to save your file in this folder (on your computer, the default folder might have a different name). Suppose you want to save it on a floppy disk instead. With most computers, this is done in the computer’s “A” drive. It is therefore necessary to change your computer’s drive, as illustrated here.
66 Step-by-Step Basic Statistics Using SAS: Student Guide
To change the location where your file will be saved, complete these steps: Î Click the down arrow on the right side of the Save in box; from there you can navigate to the location where you want to save your file. Î Scroll up and down until you see 3 1/2 Floppy (A:). Î Select 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Save in box. Now you must name your file. Complete the following steps: Î Click inside the box labeled File name. Your cursor appears inside this box (see Figure 3.9). Î Type the following name inside the File name box: DEMO.SAS. With this done, your Save As dialog box should resemble the completed Save As dialog box that appears in Figure 3.10. When ready, click this Save button.
Figure 3.10. The completed Save As dialog box.
After you have completed these tasks, you are ready to save the SAS program on the floppy disk. To do this, complete the following step: Î Click the Save button (see Figure 3.10). After clicking Save, a small light next to the 3.5-inch drive will light up for a few seconds, indicating that the computer is saving your program on the diskette. When the light goes off, your program has been saved under the name DEMO.SAS.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 67
Where the Name of the SAS Program Will Appear After you use the Save As command to name a SAS program within the Editor, the name of that file will appear on the left side of the title bar for the Editor window (remember that the title bar is the bar that appears at the top of a window). For example, if you look on the left side of the title bar for the current Editor window, you will see that it no longer contains the words “SAS - [Editor - Untitled1]” as it did before. Instead, this location now contains the words “SAS - [DEMO].” Similarly, whenever you pull down the Window menu during this session, you will see that it still contains items labeled “Log” and “Output,” but that it no longer contains “Editor - Untitled1.” Instead, this label has been replaced with “Demo,” the name that you gave to your SAS file. Saving a File Each Subsequent Time on a Floppy Disk The first time you save a new file, you should use the Save As command and give it a specific name as you did earlier. Each subsequent time you save that file (during the same session) you can use the Save command, rather than the Save As command. To use the Save command, verify that your Editor window is active and make the following selections. Î File Î Save Notice that when you save the file a second time using the Save command in this way, you do not get a dialog box. Instead, the file is saved again under the same name and in the same location that you selected earlier. Save Your Work Often! Sometimes you will work on a single SAS program for a long period of time. On these occasions, you should save the file once every 10 minutes or so. If you do not do this, you might lose all of the work you have done if the computer loses power or becomes inoperative during your session. However, if you have saved your work frequently, you will be able to reopen the file, and it will appear the way that it did the last time you saved it. Submitting the SAS Program for Execution So far you have created your SAS program and saved it as a file on your diskette. It is now time to submit it for execution.
68 Step-by-Step Basic Statistics Using SAS: Student Guide
There are at least two ways to submit SAS programs. The first way is to use the Run menu. The Run menu is identified in Figure 3.11. One way to submit a SAS program is to click the Run menu, and then select Submit.
Another way is to click the Submit button on the toolbar (that is, the “running person” icon).
Figure 3.11. How to submit a SAS program for execution.
To submit a SAS program by using the Run menu, make sure that the Editor is in the foreground, and then make the following selections (go ahead and try this now): Î Run Î Submit The second way to submit a SAS program is to click the Submit button on the toolbar. This is a row of buttons below the Editor window. These buttons provide shortcuts for performing a number of activities. One of these buttons is identified with the icon of a running person (see Figure 3.11). This is the Submit button. To submit your program using the toolbar, you would do the following (do not do this now): Put your mouse pointer on the running person icon, and click it once. This submits your program for execution (note that this button is identified with a running person because after you click it, your program will be running).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 69
What Happens after Submitting a Program In the preceding section, you submitted your SAS program for execution. When you submit a SAS program, it disappears from the Editor window. While the program is executing, a message appears above the menu bar for the Editor, and this message indicates which PROC (SAS procedure) is running. It usually takes only a few seconds for a typical SAS program to execute. Some programs might take longer, however, depending on the size of the data set being analyzed, the number of procedures being requested, the speed of the computer’s processor, and other factors. After you submit a SAS program and it finishes executing, typically you will experience one of three possible outcomes. •
Outcome 1: Your program runs perfectly and without any errors. If this happens, the Editor window will disappear and SAS will automatically bring the Output window to the foreground. The results of your analysis will appear in this Output window.
•
Outcome 2: Part of your program runs correctly, and part of it has errors. If this happens, it is still possible that your Editor window will disappear and the Output window will come to the foreground. In the Output window, you will see the results from those sections of the program that ran correctly. This outcome can be misleading, however, because if you are not careful, you might never realize that part of your program had errors and did not run.
•
Outcome 3: Your program has errors and no results are produced. If this happens, the Output window will never appear; you will see only the Editor window.
Outcome 2 can mislead you into believing that there were no problems with your SAS program when there may, in fact, have been problems. The point is this: after you submit a SAS program, even if SAS brings up the Output window, you should always review all of the log file prior to reviewing the output file. This is the only way to be sure you have no errors or other problems in your program. Reviewing the Contents of Your Log File The SAS log is a file generated by SAS that contains the program you submitted (minus the data lines), along with notes, warnings, error messages, and other information pertaining to the execution of your program. It is important to always review your log file prior to reviewing your output file to verify that the program ran as planned. If your program ran, the Output window is probably in the foreground now. If your program did not run, the Editor window is probably in the foreground. In any case, make sure that the Log window is in the foreground, and then make the following selections: Î Window Î Log
70 Step-by-Step Basic Statistics Using SAS: Student Guide
Your Log window is now your active window. In many cases, your SAS log will be fairly long and only the last part of it will be visible in the Log window. This is a problem, because it is best to begin at the beginning of your log and review it from beginning to end. If you are currently at the end of your log file, click the scroll bar on the right side of the Log window, drag it up, and release it. The beginning of your log file is now at the top of your screen. Scroll through the log and verify that you have no error messages. If your program executed correctly, your log file should look something like Log 3.1 (notice that the SAS log contains your SAS program minus the data lines). NOTE: SAS initialization used: real time 14.54 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 8 observations and 2 variables. NOTE: DATA statement used: real time 1.59 seconds 13 14 15 16
; PROC MEANS DATA=D1; TITLE1 'JANE DOE'; RUN;
NOTE: There were 8 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.64 seconds
Log 3.1. Log file created by the current SAS program.
What Do I Do If My Program Did Not Run Correctly? If your program ran correctly and there are no error messages in your SAS log, you should skim this section and then skip to the following section (titled “Printing the Log File on a Printer”). If your program did not run correctly, you should follow the instructions provided here. If there is an error message in your log file, or if for any other reason your program did not run correctly, you have two options: •
If there is a professor or a lab assistant available, ask this person to help you debug the program and resubmit it. After the program runs correctly, continue with the next section.
•
If there is no professor or lab assistant available, go to the section titled “Tutorial Part III: Submitting a Program with an Error,” which is in the second half of this chapter. That section shows you how to correct and resubmit a program. Use the guidelines
Chapter 3: Tutorial: Writing and Submitting SAS Programs 71
provided there to debug your program and resubmit it. When your program is running correctly, continue with the next section below. Printing the Log File on a Printer To get a hardcopy (paper copy) of your log file, the Log window must be in the foreground. If necessary, make the following selections to bring your Log window to the front: Î Window Î Log Then select Î File Î Print This gives you the Print dialog box, which should look something like Figure 3.12.
Click OK to print your file. Figure 3.12. The Print dialog box.
Assuming that all of the settings in this dialog box are acceptable, you can print your file by completing the following step (do this now): Î Click the OK button at the bottom of this dialog box. Your log file should print. Go to the printer and pick it up. If other people are using the SAS System at this time, make sure that you get your own personal log file and not a log file
72 Step-by-Step Basic Statistics Using SAS: Student Guide
created by someone else (your log file should have your name in the TITLE1 statement, toward the bottom). Reviewing the Contents of Your Output File If your program ran correctly, the results of the analysis can be viewed in the Output window. To review your output, bring the Output window to the foreground by making the following selections: Î Window Î Output Your Output window is in the foreground. If you cannot see all of your output, it is probably because you are at the bottom of the output file. If this is the case, scroll up to see all of the output page. If your program ran correctly (and you keyed your data correctly), the output should look something like Output 3.1. JANE DOE The MEANS Procedure
1
Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------TEST1 8 3.5000000 1.1952286 2.0000000 5.0000000 TEST2 8 3.2500000 0.7071068 2.0000000 4.0000000 ---------------------------------------------------------------------------
Output 3.1. SAS output produced by PROC MEANS.
Printing Your SAS Output Now you should print your output file on a printer. You do this by following the same procedure used to print your log file. Make sure that your Output window is in the foreground and make the following selections: Î File Î Print This gives you the Print dialog box. If all the settings are appropriate, complete this step: Î Click the OK button at the bottom of the Print dialog box. Your output file should print. When you pick it up, verify that you have your own output and not the output file created by someone else (your name should appear at the top of the output if you typed your name in the TITLE1 statement, as you were directed).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 73
Clearing the Log and Output Windows After you finish an analysis, it is a good idea to clear the Log and Output windows prior to doing any subsequent analyses. If you perform subsequent analyses, the new log and output files will be appended to the bottom of the log and output files created by earlier analyses. To clear the contents of your Output window, make sure that the Output window is in the foreground and make the following selections: Î Edit Î Clear All The contents of the Output window should disappear from the screen. Now bring the Log window to the front: Î Window Î Log Clear its contents by clicking Î Edit Î Clear All Returning to the Editor Window Suppose that you now want to modify your SAS program by adding new data to the data set. Before doing this, you must bring the Editor to the foreground. An earlier section warned you that after you save a SAS program, the word “Editor” will no longer appear on the Window menu. In its place, you will see the name that you gave to your SAS program. In this session, you gave the name “DEMO.SAS” to your SAS program. This means that, in the Window menu, you will now find the word “Demo” where “Editor” used to be. To bring the Editor to the foreground, you should select “Demo.” Î Window Î Demo The Editor window containing your SAS program now appears on your screen. What If the Editor Window Is Empty? When you bring the Editor window to the foreground, it is possible that your SAS program has disappeared. If this is the case, it might be because you did not set the Enhanced Editor Options in the way described earlier in this chapter. Figure 3.6 showed how these options should be set. Toward the bottom of the Enhanced Editor Options dialog box, one of the options is “Clear text on submit.” The directions provided earlier indicated that this option should not be selected (that is, the box for this option should not be checked). If your SAS program disappeared after you submitted it, it might be because this box was checked. If this is the case, go to the Enhanced Editor Options dialog box now and verify that this
74 Step-by-Step Basic Statistics Using SAS: Student Guide
option is not selected. You should deselect it if it is selected (see the previous section titled “Requesting Line Numbers and Other Options” for directions on how to do this). If your SAS program has disappeared from the Editor window, you can retrieve it easily by using the Recall Last Submit command. If this is necessary, verify that your Editor window is in the foreground and make the following selections (do this only if your SAS program has disappeared): Î Run Î Recall Last Submit Your SAS program reappears in the Editor window. Saving Your SAS Program on a Diskette (Again) At the end of a SAS session, you will save the most recent version of your SAS program on a diskette. If you do this, you will be able to open up this most recent version of the program the next time you want to do some additional analyses. You will now save the program on the diskette in drive “A.” Because this is not the first time you have saved it this session, you can use the Save command rather than the Save As command. Verify that your Editor window is the active window (and that your program is actually in this window), and make the following selections: Î File Î Save Ending Your SAS Session You can now end your SAS session by selecting Î File Î Exit A dialog box appears with the message, “Are you sure you want to end the SAS session?” Click OK to end the SAS session.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 75
Tutorial Part II: Opening and Editing an Existing SAS Program Overview This section shows you how to open the SAS program that you have saved on a floppy disk. It also shows you how to edit an existing program: how to insert new lines, delete lines, copy lines, and perform other activities that are necessary to modify a SAS program. Restarting SAS Often you will want to open a file (on a diskette) that contains a SAS program that you created earlier. This section and the three to follow show you how to do this. Verify that your computer and monitor are turned on and are not in sleep mode. Then complete the following steps: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select (click) The SAS System for Windows V8. This starts the SAS System. After a few seconds, you will see the initial SAS screen, which contains the Explorer window, the Log window, and the Editor window. Your screen should look something like Figure 3.13.
76 Step-by-Step Basic Statistics Using SAS: Student Guide First, click the close button for the Explorer window, as well as for the Results window, which lies beneath it.
Next, click the maximize button for the Editor window.
Figure 3.13. Modifying the initial SAS screen.
Modifying the Initial SAS System Screen Before opening an existing SAS file, you need to modify the initial SAS System screen so that it is easier to work with. Complete the following steps. Close the Explorer window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). This reveals the Results window, which was hidden beneath the Explorer window. Now close the Results window: Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). The remaining visible windows now expand to fill the screen. Your screen should contain only the Log window (at the top of the screen) and the Editor window (at the bottom). Remember that the Editor window is identified by the words “Editor - Untitled1” in its title bar. To maximize the Editor window, complete the following step: Î Click the maximize window button for the Editor window (the middle button that contains a square: ). The Editor window expands and fills your screen.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 77
Setting Line Numbers and Other Options To change the settings for line numbers and other options, use the Enhanced Editor Options dialog box. From the Editor’s menu bar, make the following selections: Î Tools Î Options Î Enhanced Editor This opens the Enhanced Editor Options dialog box (see Figure 3.14). Verify that Show line numbers is selected.
In the Indentation section, verify that None is selected.
Verify that Clear text on submit is not selected.
Figure 3.14. Verifying that appropriate options are selected in the Enhanced Editor Options dialog box.
If you began Part II of this tutorial immediately after completing Part I (and if you are working at the same computer), the options that you selected in Part I should still be selected. However, if you have changed computers, or if someone else has used SAS on your computer since you used it, your options might have been changed. For that reason, it is always a good idea to check at the beginning of each SAS session to ensure that your Editor options are correct. As explained in Part I, the Enhanced Editor Options dialog box consists of two components: General options and Appearance options. The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. You should verify that General options is at the front of the stack (that is, General options is visible), and you should click the tab labeled General if it is not visible.
78 Step-by-Step Basic Statistics Using SAS: Student Guide
The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
This option should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. A single line number (the number 1) appears in the upper left corner of the Editor window. Reviewing the Names of Files on a Floppy Disk and Opening an Existing SAS Program Earlier, you saved your SAS program on a 3.5-inch floppy disk in drive A. This section shows you how to open this program in the Editor. To begin this process, verify that your floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. The Open dialog box contains a number of smaller boxes with labels such as Look in, File name, and so forth. It should look something like Figure 3.15.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 79 Click this down arrow to display a list of other possible locations where SAS can look for your file.
After you have selected the correct Look in location, the names of your files should appear in this window. Figure 3.15. The Open dialog box.
Toward the top of the Open dialog box is a box labeled Look in (see Figure 3.15). This box tells SAS where it should look to find a file. The default is to look in a folder named “V8” on your hard drive. You know that this is the case if the Look in box contains the icon of a folder and the words “V8” (although the default location might be different on your computer). You have to change this default so that SAS will look on your 3.5-inch floppy disk to find your program file. To accomplish this, complete the following steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where SAS can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. When SAS searches for files, it will look on your floppy disk.
80 Step-by-Step Basic Statistics Using SAS: Student Guide
The contents of your disk now appear below the Look in box. One of these files is “DEMO,” the file that you need to open. Remember that when you first saved your file, you gave it the full name “DEMO.SAS.” However, this “.SAS” extension does not appear on the SAS program files that appear in the Open dialog box. To open this file, complete the following steps: Î Click the file named DEMO. The name DEMO appears in the File name box below the large box. Î Click the Open button. The SAS program that you saved under the name DEMO.SAS appears in the Editor window. You can now modify it and submit it for execution. What If I Don’t See the Name of My File on My Disk? In the Open dialog box, if you don’t see the name of a file that you know is on your diskette, there are several possible reasons. The first thing you should do is verify that you are looking in drive A, and that your floppy disk has been inserted in the drive. If everything appears to be correct with the Look in box, the second thing you should do is review the box labeled Files of type, which also appears in the Open dialog box. Verify that the entry inside this box is SAS Files (*.sas) (see Figure 3.15). If this entry does not appear in this box, it means that SAS is looking for the wrong types of files on your disk. Click the down arrow on the right side of the Files of type box to reveal other options. Select SAS Files (*.sas), and then check the window to see if DEMO now appears there. Another possible solution is to instruct SAS to list the names of all files on your disk, regardless of type. To do this, go to the box labeled Files of type. On the right side of this Files of type box is a down arrow. Click the down arrow to reveal different options. Select the entry that says All Files (*.*). This reveals the names of all the files on your disk, regardless of the format in which they were saved.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 81
General Comments about Editing SAS Programs The following sections show you how to edit an existing SAS program. Editing a SAS program involves modifying it in some way: inserting new lines, copying lines, moving lines, and so forth. Keep the following points in mind as you edit files: •
The Undo command. The Undo command allows you to undo (reverse) your most recent editing action. For example, assume that you select (highlight) a large section of your SAS program, and then you accidentally delete it. It is possible to use the Undo command and return your program to its prior state.
When you make a mistake, you can undo your most recent editing action (do not select this now; this is for illustration only): Î Edit Î Undo This returns your program to the state it was in prior to the most recent editing action, whether that action involved deleting, copying, cutting, or some other activity. You can select the Undo command multiple times in a row. This allows you to undo a sequence of changes that you have made since the last time the Save command was used. •
Using the arrow keys. Somewhere on your keyboard are keys marked with directional arrows such as ↑←↓→. They enable you to move your cursor around the SAS program.
When you want to move your cursor to a lower line on a program that you have already written, you generally use the down arrow key (↓) rather than the ENTER key. This is because, when you press the ENTER key, it creates a new blank line in the SAS program as it moves your cursor down. Thus, you should use the ENTER key only when you want to create new lines; otherwise, rely on the arrow keys. The following sections show you how to perform a number of editing activities. It is important that you perform these editing functions as you move through the tutorial. As you read each section, you should modify your SAS program in the same way that the SAS program in the book is being modified. Inserting a Single Line in an Existing SAS Program When editing a SAS program in the Editor, you might want to insert a single line between two existing lines as follows: •
Place the cursor at the end of the line that is to precede the new line
•
Press the ENTER key once.
82 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, suppose that you want to insert a new line after line 4 in the following program. To do this, you must first place the cursor at the end of that line (that is, at the end of the DATALINES statement that appears on line 4). Complete the following steps: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4 Î Click once. The flashing cursor appears at the point where you clicked (if you missed and the cursor is not in the correct location, use your arrow keys to move it to the correct location). 1 2 3 4 5 6 7
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 2 3 2 2 3 3 Î Press ENTER. A new blank line is inserted between existing lines 4 and 5, as shown here:
1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 2 3 2 2 3 3
Your cursor (the insertion point) is now in column 1 of the new line you have created. You can now type a new data line in the blank line. Complete the following step: Î Type the numbers 6 and 7 on line 5, as shown here: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Chapter 3: Tutorial: Writing and Submitting SAS Programs 83
Let’s do this one more time: Again, insert a new blank line after the DATALINES statement: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4. Î Click once to place the insertion point there. Î Press the ENTER key once. This gives you a new blank line. Î Now type the numbers 8 and 9 in the new line that you have created, as shown here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 8 9 6 7 2 3 2 2 3 3
Inserting Multiple Lines You follow essentially the same procedure to insert multiple lines, with one exception: After you have positioned your insertion point, you press ENTER multiple times rather than one time. For example, complete these steps to insert three new lines between existing lines 4 and 5: Î Use the mouse to place the I-beam at the end of the DATALINES statement on line 4. Î Click once to place the insertion point there. Î Press ENTER three times. This gives you three new blank lines, as shown here: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 8 6 2 2 3
9 7 3 2 3
84 Step-by-Step Basic Statistics Using SAS: Student Guide
Next, you will type some new data on the three new lines you have created, starting with line 5. Î Use your arrow keys to move the cursor up to line 5. Î Now type the following values on lines 5–7, so that your data set looks like the following program (after you have typed the values on a particular line, use the arrow keys to move the cursor down to the next line): 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Deleting a Single Line There are at least two ways to delete a single line in a SAS program. One way is obvious: place the cursor at the end of the line and press the backspace key to delete the line, one character at a time. A second way (and the way that you will use here) is to click and drag to highlight the line, and then delete the entire line at once. This introduces you to the concept of “clicking and dragging,” which is a very important technique to use when editing a SAS program. For example, suppose that you want to delete line 5 of the following program (the line with “2 2” on it): 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Chapter 3: Tutorial: Writing and Submitting SAS Programs 85
Complete the following steps: Î Place your I-beam cursor at the beginning of the data on line 5. (This means place the I-beam to the immediate left of the first “2” on line 5; do not go too far to the left of the “2” or your I-beam will turn into an arrow––if this happens you have gone too far. This might take a little practice.) Î Click once and hold the button down (do not release it yet). Î While holding the button down, drag your mouse to the right so that the data on line 5 are highlighted in black. (This means that you should drag to the right until the “2 2” is highlighted in black. Do not drag your mouse up or down, or you might accidentally highlight additional lines of data). Î After the data are highlighted in black, release the button. The data should remain highlighted. Your program should look something like this: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 2 2 3 3 4 4 8 9 6 7 2 3 2 2 3 3
To delete the line you have highlighted, complete the following step: Î Press the Backspace (DELETE) key. The highlighted data disappear, leaving only a blank line, as shown here: 1 2 3 4 5 6 7 8 9 10 11 12
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ▌ 3 3 4 4 8 9 6 7 2 3 2 2 3 3
86 Step-by-Step Basic Statistics Using SAS: Student Guide
Your cursor is in column 1 of the newly blank line. To make the blank line disappear, complete the following step: Î Press the Backspace (DELETE) key again. Your program now appears as shown here: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Deleting a Range of Lines You follow a similar procedure to delete a range of lines, with one exception: When you click and drag, you will drag down (as well as to the right) so that you highlight more than one line. When you press the backspace key, all of the highlighted lines will be deleted. For example, suppose that you want to delete lines 5, 6, and 7 in your program: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
Complete the following steps: Î Place your I-beam at the beginning of the data on line 5. (Again, this means place the I-beam to the immediate left of the first “3” on line 5; do not go too far to the left of the “3” or your I-beam will turn into an arrow––if this happens you have gone too far.) Î Click once and hold the button down (do not release it yet).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 87
Î While holding the button down, drag your mouse down and to the right so that the data on lines 5, 6, and 7 are highlighted in black. Î After the data are highlighted in black, release the button. The lines remain highlighted. Your program should look something like this: 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 3 3 4 4 8 9 6 7 2 3 2 2 3 3
To delete the lines you have highlighted, complete this step: Î Press the Backspace (DELETE) key once. The highlighted lines disappear. After deleting these lines, it is possible that one blank line will be left, as shown on line 5 here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; ❙ 6 7 2 3 2 2 3 3
To delete the blank line, Î Press the Backspace (DELETE) key again. Your program now appears as shown here: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
88 Step-by-Step Basic Statistics Using SAS: Student Guide
Copying a Single Line into Your Program To copy a single line involves these steps (do not do this yet): 1. Create a new blank line where the copied line is to be inserted. 2. Click and drag to highlight the line to be copied. 3. Pull down the Edit menu and select Copy. 4. Place the cursor at the point where the line is to be pasted. 5. Pull down the Edit menu and select Paste. For example, suppose that you want to make a copy of line 5 and place the copy before line 6 in the following program: 1 2 3 4 5 6 7 8
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
First, you must create a new blank line where the copied line is to be inserted. Complete the following steps (try this now): Î Place the I-beam at the end of the data on line 5 (that is, after the “6 7” on line 5) and click once. This places your cursor to the right of the numbers “6 7.” Î Press ENTER once. This creates a new blank line after line 5, as shown here: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Next you highlight the data to be copied. Complete the following steps: Î Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the “6” on the “6 7” line.) Î Click once and hold the button down (do not release it yet).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 89
Î While holding the button down, drag your mouse to the right so that the data on line 5 are highlighted in black. Î After the data are highlighted in black, release the button. The data remain highlighted. With this done, your program should look something like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 2 3 2 2 3 3
Now you must select the Copy command. Î Edit Î Copy Nothing appears to happen, but don’t worry––the highlighted text has been copied to an invisible clipboard. Now place your cursor (the insertion point) at the beginning of line 6. Complete this step: Î Place the I-beam in column 1 of line 6 and click. Your program should look like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 ▌ 2 3 2 2 3 3
Finally, you can take the material that you copied to the clipboard and paste it at the insertion point in your program. Make the following selections: Î Edit Î Paste
90 Step-by-Step Basic Statistics Using SAS: Student Guide
The copied data now appear on line 6. Your program should look something like this: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 2 2 3 3
Copying a Range of Lines To copy a range of lines, you follow the same procedure that you use to copy a single line, with one exception: When you click and drag, you drag down (as well as to the right) so that you highlight more than one line. When you select Paste, all of the highlighted lines will be copied. For example, assume that you want to copy lines 5-7 and place the copied lines before line 8 in the following program: 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 2 2 3 3
First, you must create a new blank line where the copied lines are to be inserted. Complete the following steps: Î Place the I-beam at the end of the data on line 7 (that is, after the “2 3” on line 7) and click once. This places your cursor to the right of the “2 3.”) Î Press ENTER once.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 91
This creates a new blank line after line 7, as shown here: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3
TEST2;
2 2 3 3
Next you highlight the data to be copied. Complete these steps: Î Place your I-beam at the beginning of the data on line 5. (This means place the Ibeam to the immediate left of the “6” on the “6 7” line.) Î Click once and hold the button down. Î While holding the button down, drag your mouse down and to the right so that the data on lines 5–7 are highlighted in black. Î After the lines are highlighted in black, release the button. The lines remain highlighted. With this done, your program should look something like this: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3
TEST2;
2 2 3 3
Now you must select the Copy command. Î Edit Î Copy Nothing appears to happen, but don’t worry––the highlighted text has been copied to an invisible clipboard. Now you need to place your cursor (the insertion point) at the beginning of line 8: Î Place the I-beam in column 1 of line 8 and click.
92 Step-by-Step Basic Statistics Using SAS: Student Guide
Your program should look like this: 2 3 4 5 6 7 8 9 10
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3 ▌ 2 2 3 3
TEST2;
Finally, take the material that you copied to the clipboard and paste your selection at the insertion point in your program. Î Edit Î Paste The copied data now appear on (and following) line 8. Your program should look something like this: 2 3 4 5 6 7 8 9 10 11 12
DATA D1; INPUT TEST1 DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3
TEST2;
Moving Lines To move lines, you follow the same procedure that you use to copy lines, with one exception: When you initially pull down the Edit menu, select Cut rather than Copy. For example, you follow these steps to move a range of lines (do not actually do this now): 1. Create a new blank line where the moved lines are to be inserted. 2. Click and drag to highlight the lines to be moved. 3. Pull down the Edit menu and select Cut. 4. Place the cursor at the point where the lines are to be pasted. 5. Pull down the Edit menu and select Paste.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 93
When you are finished, there will be one blank line at the location where the moved lines used to be. You can delete this line in the usual way: Use your mouse to place the cursor in column 1 of that blank line, and press the backspace (delete) key. Saving Your SAS Program and Ending the SAS Session Now it is time to save the program on your floppy disk in drive A. Because you opened DEMO.SAS from drive A, drive A is now the default drive—it is not necessary to assign it as the default drive. You can save your file using the Save command, rather than the Save As command. Verify that your Editor window is the active window, and select: Î File Î Save Now end your SAS session by selecting: Î File Î Exit This produces a dialog box that asks if you are sure you want to end the SAS session. Click OK in this dialog box, and the SAS session ends.
94 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part III: Submitting a Program with an Error Overview In this section, you modify an existing SAS program so that it will produce an error when you submit it. This gives you the opportunity to learn the procedure that you will follow when debugging SAS programs with errors. Restarting SAS Complete the following steps: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select The SAS System for Windows V8 and release the button. Modifying the Initial SAS System Screen Before opening an existing SAS file, modify the initial SAS screen so that it is easier to work with. Complete the following steps. Close the Explorer window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). This reveals the Results window, which was hidden beneath the Explorer window. Now close the Results window: Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). Your screen now contains the Log window and the Editor window. To maximize the Editor window, complete this step:
Chapter 3: Tutorial: Writing and Submitting SAS Programs 95
Î Click the maximize window button for the Editor window (the middle button that contains a square : ; see Figure 3.4). The Editor expands and fills your screen. Finally, you need to review the Enhanced Editor Options dialog box to verify that the appropriate options have been selected. To request this dialog box, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that the General options tab is in the foreground (that is, General options is visible), and click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, complete the following step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window.
96 Step-by-Step Basic Statistics Using SAS: Student Guide
Opening an Existing SAS Program from Your Floppy Disk Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells the SAS System where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary, complete these steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where the SAS System can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette now appear in the larger box below the Look in box. One of these files should be DEMO, the file that you need to open. To open this file, complete the following steps: Î Click DEMO. The name DEMO appears in the File name box. Î Click the Open button. The SAS program that you saved in the preceding section appears in the Editor window. You are now free to modify it and submit it for execution. Submitting a Program with an Error You will now submit a program with an error in order to see how errors are identified and corrected. With the file DEMO.SAS opened in the Editor window, change the third line from the bottom so that it requests “PROC MEENS” instead of “PROC MEANS.” This will produce an error message when SAS attempts to execute the program, because there is no procedure named PROC MEENS.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 97
Here is the modified program; notice that the third line from the bottom now requests “PROC MEENS”: OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEENS DATA=D1; TITLE1 'JANE DOE'; RUN; After you make this change, submit the program for execution in the usual way: Î Run Î Submit This submits your program for execution. Reviewing the SAS Log and Correcting the Error After you submit the SAS program, it takes SAS a few seconds to finish processing it. However, after this processing is complete the Output window will not become the active window. The fact that your Output window does not appear indicates that your program did not run correctly. To determine what was wrong with the program, you need to review the Log window. Make the following selections: Î Window Î Log
98 Step-by-Step Basic Statistics Using SAS: Student Guide
The log file that is created by your program should look similar to Log 3.2. NOTE: SAS initialization used: real time 15.44 seconds 1 2 3 4
OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES;
NOTE: The data set WORK.D1 has 13 observations and 2 variables. NOTE: DATA statement used: real time 2.86 seconds 18 ; 19 PROC MEENS DATA=D1; ERROR: Procedure MEENS not found. 20 TITLE1 'JANE DOE'; 21 RUN; NOTE: The SAS System stopped processing this step because of errors. NOTE: PROCEDURE MEENS used: real time 0.22 seconds Log 3.2. Log file created by a SAS program with an error.
When you go to the SAS Log window, you will usually see the last part of the log file. You should begin reviewing at the beginning of a log file when looking for errors or other problems. Scroll to the top of the log file by completing this step (try this now): Î Scroll up to the top of the log file either by clicking and dragging the scroll bar or by clicking the up arrow in the scroll bar area of the Log window. Starting at the top of the log file, begin looking for warning messages, error messages, or other signs of problems. Begin at the top of the log file and work your way down. It is important to always begin at the top, because a single error message early in a program can sometimes cause dozens of additional error messages later in the program; if you correct the first error, the remaining error messages will often disappear. Toward the bottom half of Log 3.2, you can see an error message that reads “Error: Procedure MEENS not found.” The SAS program statement causing this error is the statement that immediately precedes the error message in the SAS log. That SAS program statement, along with the resulting error message, is reproduced here: 19 PROC MEENS DATA=D1; ERROR: Procedure MEENS not found. This error message indicates that SAS does not have a procedure named MEENS. A subsequent statement in the log indicates that SAS stopped processing this step because of the error.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 99
Obviously, the error is that PROC MEANS was incorrectly spelled as PROC MEENS. When you see an error like this in a log file, your first impulse might be to correct the error in the log file itself. But this will not work––you must correct the error in the SAS program, not in the log file. Before doing this, however, clear the text of the current log file before you resubmit the corrected program. If you do not clear the text of this log file, the next time you submit the SAS program, SAS will append a new log file to the bottom of the existing log file, and it will be difficult to review your current log. To correct your error, first clear your Log window by selecting Î Edit Î Clear All Now return to your SAS program by making the Editor the active window: Î Window Î Demo The SAS program that you submitted reappears in the Editor: OPTIONS LS=80 PS=60; DATA D1; INPUT TEST1 TEST2; DATALINES; 6 7 6 7 2 3 6 7 6 7 2 3 2 2 3 3 3 4 4 3 4 4 5 3 5 4 ; PROC MEENS DATA=D1; TITLE1 'JANE DOE'; RUN; Now correct the error in the program: Î If necessary, scroll down to the bottom of the program. Move your cursor down to the line that contains PROC MEENS and change this statement so that it reads PROC MEANS. Now submit the program again: Î Run Î Submit
100 Step-by-Step Basic Statistics Using SAS: Student Guide
If the error has been corrected (and if the program contains no other errors), the Output window appears after processing is completed. The results of PROC MEANS appear in this Output window, and these results should look similar to Output 3.2. JANE DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ------------------------------------------------------------------------TEST1 13 4.1538462 1.6251233 2.0000000 6.0000000 TEST2 13 4.3846154 1.8946619 2.0000000 7.0000000 ------------------------------------------------------------------------Output 3.2. SAS output produced by PROC MEANS after correcting the error.
If SAS does not take you to the Output window, it means that your program still contains an error. If this is the case, repeat the process described earlier: 1. Go to the Log window and identify the error. 2. Clear the Log window of text. 3. Go to the Editor window, which contains the SAS program. 4. Correct the error. 5. Resubmit the program. Saving Your SAS Program and Ending This Session If the SAS program ran correctly, the Output window should be in the foreground. Now you must make the Editor the active window before you can save your program. Make the following selections: Î Window Î Demo Your SAS program is visible in the Editor window. Now save the program on the disk in drive A: Î File Î Save You can end your session with the SAS System by selecting Î File Î Exit The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS session ends.
Chapter 3: Tutorial: Writing and Submitting SAS Programs 101
To Learn More about Debugging SAS Programs This section summarizes the steps that you follow when debugging a SAS program with an error. A more concise summary of these steps appears in “Finding and Correcting an Error in a SAS Program” later in this chapter. To learn more about debugging SAS programs, see “Common SAS Progamming Errors That Beginners Make” on this book’s companion Web site (support.sas.com/companionsites). Delwiche and Slaughter (1998) also provide guidance on how to find and correct errors in SAS programs.
102 Step-by-Step Basic Statistics Using SAS: Student Guide
Tutorial Part IV: Practicing What You Have Learned Overview In this section, you practice the skills that you have developed in the preceding sections. You open your existing SAS program, edit it, submit it, print the resulting log and output files, and perform other activities. Restarting the SAS System and Modifying the Initial SAS System Screen Complete the following steps to restart the SAS System: Î Click the Start button that appears at the bottom of your initial screen. This displays a list of options, including the word Programs. Î Select Programs. This reveals a list of programs on the computer. One of them is The SAS System for Windows V8. Î Select The SAS System for Windows V8 and release your button. This produces the initial SAS System screen. Next, close the Explorer window and the Results window and maximize the Editor window: Î Click the close window button for the Explorer window (the button that looks like in the upper-right corner of the Explorer window; see Figure 3.13). Î Click the close window button for the Results window (the button that looks like in the upper-right corner of the Results window). Î Click the maximize window button for the Editor window (the middle button that contains a square: ; see Figure 3.13). Finally, if there is any possibility that someone has changed the options in the Enhanced Editor Options dialog box, you should review this dialog box. If this is necessary, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that the General options tab is in the
Chapter 3: Tutorial: Writing and Submitting SAS Programs 103
foreground (that is, General options is visible), and click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
This option should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click inside the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. Reviewing the Names of Files on a Floppy Disk and Opening an Existing SAS Program Verify that your 3.5-inch floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells the SAS System where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it. If this is necessary, complete the following steps: Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where the SAS System can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:).
104 Step-by-Step Basic Statistics Using SAS: Student Guide
3 1/2 Floppy (A:) appears in the Look in box. The contents of your diskette should now appear in the larger box below the Look in box. One of these files should be DEMO, the file that you need to open. Complete the following steps: Î Click the file named DEMO. The name DEMO appears in the File name box. Î Click the Open button. The SAS program that you saved in the preceding section appears in the Editor window. You are now free to modify it and submit it for execution. Practicing What You Have Learned Now that the file DEMO.SAS is open in the Editor, you can practice what you have learned in this tutorial. When you are not sure about how to complete certain tasks, refer to earlier sections. Within the Editor window, complete the following steps to modify your file named DEMO: 1) Insert three new lines of data into the middle of your data set (somewhere after the DATALINES statement). Make up the numbers. 2) Delete two existing lines of data (choose any two lines of data). 3) Copy four lines of data. 4) Move three lines of data. 5) Save your program on the 3.5” diskette using the File Î Save As command. Give it a new name: DEMO2.SAS. 6) Submit the program for execution. 7) Review the contents of your log file on the screen. 8) Print your log file. 9) Clear your log file from the screen by using the Edit Î Clear All command. 10) Review the contents of your output file on the screen. 11) Print your output. 12) Clear your output file from the screen by using the Edit Î Clear All command. 13) Go back to the Editor window. 14) Add one more line of data to your program. 15) Save your program again, this time using the File Î Save command (not the Save As command).
Chapter 3: Tutorial: Writing and Submitting SAS Programs 105
Ending the Tutorial You can end your SAS session by making the following selections: Î File Î Exit The computer asks if you are sure you want to end the SAS session. Click OK, and the SAS session ends. This completes the tutorial sections of this chapter.
Summary of Steps for Frequently Performed Activities Overview This section summarizes the steps that you follow when performing three common activities: • starting the SAS windowing environment • opening an existing SAS program from a floppy disk • finding and correcting an error in a SAS program. Starting the SAS Windowing Environment Verify that your computer and monitor are turned on and not in sleep mode. Make sure that your initial Windows screen appears (with the Start button at the bottom). Î Click the Start button that appears at the bottom left of your initial screen. This displays a list of options. Î Select Programs. This reveals a list of programs on the computer. Î Select The SAS System for Windows V8. This produces the initial SAS System screen. Next, close the Explorer window and the Results window, and maximize the Editor window: Î Click
in the upper-right corner of the Explorer window (see Figure 3.13).
Î Click
in the upper-right corner of the Results window.
Î Click
to maximize the window.
106 Step-by-Step Basic Statistics Using SAS: Student Guide
Finally, if there is any possibility that someone has changed the options in the Enhanced Editor Options dialog box, you should review this dialog box. If this is necessary, make the following selections: Î Tools Î Options Î Enhanced Editor The upper-left corner of the Enhanced Editor Options dialog box contains one tab labeled General and one tab labeled Appearance. Verify that General options is in the foreground (that is, General options is visible), and you should click the tab labeled General if it is not visible. The Enhanced Editor Options dialog box contains a variety of different options, but we will focus on three of them. Here are the two options that should be selected at the beginning of a SAS session: •
Verify that Show line numbers is selected (that is, verify that a check mark appears in the box for this option).
•
In the box labeled Indentation, verify that None is selected.
Here is the one option that should not be selected: •
Verify that Clear text on submit is not selected (that is, verify that a check mark does not appear in the box for this option).
Figure 3.14 shows the proper settings for the Enhanced Editor Options dialog box. If necessary, click the appropriate boxes so that the three options described earlier are set properly (you can disregard the other options). When all are correct, complete the following step: Î Click the OK button at the bottom of the dialog box. This returns you to the Editor window. You are now ready to begin typing a SAS program or open an existing SAS program from a diskette. Opening an Existing SAS Program from a Floppy Disk Verify that your floppy disk is in drive A and that the Editor window is the active window. Then make the following selections: Î File Î Open The Open dialog box appears on your screen. Toward the top of the Open dialog box is a box labeled Look in. This box tells SAS where it should look to find a file. If this box does not contain 3 1/2 Floppy (A:), you will have to change it by completing the following steps:
Chapter 3: Tutorial: Writing and Submitting SAS Programs 107
Î On the right side of the Look in box is a down arrow. Click this down arrow to get other possible locations where SAS can look (see Figure 3.15). Î Scroll up and down this list of possible locations (if necessary) until you see an entry that reads 3 1/2 Floppy (A:). Î Click the entry that reads 3 1/2 Floppy (A:). 3 1/2 Floppy (A:) appears in the Look in box. The contents of your disk should now appear in the larger box below the Look in box. One of these files is the file that you need to open. Remember that the file names in this box might not contain the “.SAS” suffix, even if you included this suffix in the file name when you first saved the file. This does not mean that there is a problem. To open your file, complete the following steps: Î Click the name of the file that you want to open. This file name appears in the File name box. Î Click the Open button. The SAS program saved under that file name appears in the Editor window. You are now free to modify it and submit it for execution. Finding and Correcting an Error in a SAS Program After you submit a SAS program, one of two things will happen: •
If SAS does not take you to the Output window (that is, if you remain at the Editor window), it means that your SAS program did not run correctly. You need to go to the Log window and locate the error (or errors) in your program.
•
If SAS does take you to the Output window, it still does not mean that the entire SAS program ran correctly. It is always a good idea to review your SAS log for warnings, errors, and other messages prior to reviewing the SAS output.
The point is this: After submitting a SAS program, you should always review the log file prior to reviewing the SAS output, either to verify that there are no errors or to identify the nature of those errors. This section summarizes the steps in this process. After you have submitted the SAS program and processing has stopped, make the following selections to go to the Log window: Î Window Î Log When you go to the SAS Log window, what you will usually see is the last part of the log file. You must scroll to the top of the log file to see the entire log:
108 Step-by-Step Basic Statistics Using SAS: Student Guide
Î Scroll up to the top of the log file by either clicking and dragging the scroll bar or by clicking the up arrow in the scroll bar area of the Log window. Now that you are at the top of the log file, begin reviewing it for warning messages, error messages, or other signs of problems. Begin at the top of the log file and work your way down. It is important to begin at the top of the log because a single error message early in a program can sometimes cause dozens of additional error messages later in the program; if you correct the first error, the dozens of remaining error messages often disappear. If there are no warnings, errors, or other signs of problems, go to your output file: Î Window Î Output If your SAS log does contain error messages, try to find the cause of these errors. Begin with the first (earliest) error message in the SAS log. Remember that your SAS log always contains the statements that make up the SAS program (minus the data lines). Review the line of your SAS program that immediately precedes the first error message––are there any problems with this line (for example, a missing semicolon, a misspelled word)? If that line appears to be correct, review the line above it. Are there any problems with that line? Continue working backward, one line at a time, until you find the error. After you have found the error, remember that you cannot correct the error in the SAS log. Instead, you must correct the error in the SAS program. First, clear all text in the existing SAS Log and SAS Output windows. This ensures that when you resubmit your SAS program, the new log and output will not be appended to the old. Complete the following steps to delete the existing log and output files: Î Window Î Log Î Edit Î Clear All Î Window Î Output Î Edit Î Clear All You will now go to the Editor window (remember that the Window menu might contain the name that you gave to your SAS program, rather than the word “Editor,” which appears here): Î Window Î Editor You should now edit your SAS program to correct the error that you identified earlier. After you have corrected the error, save the modified program: Î File Î Save Now submit the revised SAS program: Î Run Î Submit At this point, the process repeats itself. If the program runs, you should still go to the log file to verify that there are no errors or other signs of problems. If the program did not run, you
Chapter 3: Tutorial: Writing and Submitting SAS Programs 109
will go to the log file to look for the error (or errors) in your program. Continue this process until your program runs without errors.
Controlling the Size of the Output Page with the OPTIONS Statement In completing this tutorial, all of the programs that you submitted contained the following OPTIONS statement: OPTIONS
LS=80
PS=60;
The OPTIONS statement is a global statement that can be used to change the value of system options and change how the SAS System operates. For example, you can use the OPTIONS statement to suppress the printing of page numbers, to suppress the printing of dates, and to perform other tasks. In this tutorial, the OPTIONS statement was used for one purpose: to specify the size of the printed page. The OPTIONS statement presented earlier requests a small-page format for output. The LS=80 section of this statement requests that your output have a line size of 80 characters per line (the “LS” stands for “line size”). This setting makes your output easy to view on a narrow computer screen. The PS=60 section of this statement requests that your output have a page size of 60 lines per page (the “PS” stands for “page size”). These specifications are fairly standard. Specifying a line size of 80 and a page size of 60 is fine for most programs, but it is not optimal for SAS programs that provide a great deal of information on each page. For example, when performing a factor analysis or principal component analysis, it is better to use a larger format so that each page can contain more information. The printed output from these more sophisticated analyses are easier to read if the line size is 120 characters per line rather than 80 (of course, this assumes that you have access to a large-format printer that can print 120 characters per line). To print your output in the larger format, change the OPTIONS statement on the first line of the program. Specifically, set LS=120 (you can leave the PS=60). This revised OPTIONS statement is illustrated here: OPTIONS
LS=120
PS=60;
In the OPTIONS statements presented in this section, “LS” is used as an abbreviation for the keyword “LINESIZE,” and PS is used as an abbreviation for the keyword “PAGESIZE.” If you prefer, you can also write your OPTIONS statement with the full-length keywords, as shown here: OPTIONS
LINESIZE=80
PAGESIZE=60;
110 Step-by-Step Basic Statistics Using SAS: Student Guide
For More Information The reference section at the end of this book lists a number of books that provide additional information about using the SAS windowing environment. Delwiche and Slaughter (1998) provide a concise but comprehensive introduction to using SAS Version 7. Two books by Jodie Gilmore (Gilmore, 1997; Gilmore, 1999) provide detailed instructions for using SAS in the Windows environment. These books can be ordered from SAS at (800) 727-3228 or (919) 677-8000.
Conclusion For students who are using SAS for the first time, learning to use the SAS windowing environment is often the most challenging task. The tutorial in this chapter has introduced you to this application. When you are performing analyses with the SAS System, you should continue to refer to this chapter to refresh your memory on how to perform specific activities. For most students, using the SAS windowing environment becomes second nature within a matter of weeks. The next chapter in this book, Chapter 4, “Data Input,” describes how to get your data into a format that can be analyzed by SAS. Before you can perform statistical analyses on your data, you must first provide SAS with information about how many variables it has to read, what names should be given to those variables, whether the variables are numeric or character, along with other information. Chapter 4 shows you the basics for creating the types of data sets that are most frequently encountered when conducting research.
Data Input Introduction.........................................................................................113 Overview...............................................................................................................113 The Rows and Columns that Constitute a Data Set..............................................113 Overview of Three Options for Writing the INPUT Statement ...............................115 Example 4.1: Creating a Simple SAS Data Set ..................................117 Overview...............................................................................................................117 The OPTIONS Statement .....................................................................................117 The DATA Statement............................................................................................118 The INPUT Statement...........................................................................................119 The DATALINES Statement .................................................................................120 The Data Lines......................................................................................................120 The Null Statement ...............................................................................................121 The PROC Statement ...........................................................................................121 Example 4.2: A More Complex Data Set............................................122 Overview...............................................................................................................122 The Study .............................................................................................................122 Data Set to Be Analyzed.......................................................................................123 The SAS DATA Step.............................................................................................125 Some Rules for List Input......................................................................................126 Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set .............................................131 Overview...............................................................................................................131 Adding PROC MEANS and PROC FREQ to the SAS Program............................131
112 Step-by-Step Basic Statistics Using SAS: Student Guide
The SAS Log.........................................................................................................134 Interpreting the Results Produced by PROC MEANS...........................................135 Interpreting the Results Produced by PROC FREQ..............................................137 Summary ..............................................................................................................138 Using PROC PRINT to Create a Printout of Raw Data .......................139 Overview...............................................................................................................139 Using PROC PRINT to Print Raw Data for All of the Variables In the Data Set ....139 Using PROC PRINT to Print Raw Data for a Subset of Variables In the Data Set ..................................................................................................141 A Common Misunderstanding Regarding PROC PRINT ......................................142 The Complete SAS Program................................................................142 Conclusion...........................................................................................144
Chapter 4: Data Input 113
Introduction Overview Raw data must be converted into a SAS data set before you can analyze it with SAS statistical procedures. In this chapter you learn how to create simple SAS data sets. Most of the chapter uses the format-free approach to data input because this is the simplest approach, and will be adequate for the types of data sets that you will encounter in this guide. You will learn how to write an INPUT statement to read both numeric variables as well as character variables. You will also learn how to add missing data into the data set that you want to analyze. After you have typed your data, you should always analyze it with a few simple procedures to verify that SAS read your data set as you intended. This chapter shows you how to use PROC MEANS, PROC FREQ, and PROC PRINT to verify that your data set has been created correctly. The Rows and Columns that Constitute a Data Set Suppose that you administer a short questionnaire to nine subjects. The questionnaire asks the subjects to indicate their height (in inches), weight (in pounds), and age (in years). The results that you obtain for the nine subjects are summarized in Table 4.1. Table 4.1 Data from the Height and Weight Study ________________________________ Subject Height Weight Age ________________________________ 1. Marsha 64 140 20 2. Charles 68 170 28 3. Jack 74 210 20 4. Cathy 60 110 32 5. Emmett 64 130 22 6. Marie 68 170 23 7. Cindy 65 140 22 8. Susan 65 140 22 9. Fred 68 160 22 ________________________________
Table 4.1 is a data set: a collection of variables and observations that could be analyzed using a statistical package such as SAS. The table is organized pretty much the way a SAS data set is organized: each row in the data set is (running horizontally from left to right) represents a different observation. Because you are doing research in which the individual person is the unit of analysis, each observation in your data set is a different person. You can
114 Step-by-Step Basic Statistics Using SAS: Student Guide
see that the first row of the data set presents data from a subject named Marsha, the next row presents data from a subject named Charles, and so on. In contrast, each column in the data set (running vertically from top to bottom) represents a different variable. The first column (“Subject”) provides each subject’s name and number; the second column (“Height”) provides each subject's height in inches; the third column (“Weight”) provides each subject’s weight in pounds, and so on. By reading across a given row, you can see where each subject scored on each variable. For example, by reading across the row for subject #1 (Marsha), you can see that she stands 64 inches in height, weighs 140 pounds, and is 20 years old. After you have entered the data in Table 4.1 into a SAS program, you could analyze it with any number of statistical procedures. For example, you could find the mean score for the three quantitative variables (height, weight, and age). Below is an example of a SAS program that will do this (remember that you would not type the line numbers appearing on the left; these numbers are used here simply to identify the lines in the program): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM HEIGHT WEIGHT AGE ; DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ; PROC MEANS DATA=D1; VAR HEIGHT WEIGHT TITLE1 'JANE DOE'; RUN;
AGE;
Later sections of this chapter will discuss various parts of the preceding SAS program: the DATA statement on line 2, the INPUT statement on lines 3–6, and so on. For now, just focus on the data set itself, which appears on lines 8–16. Notice that this is identical to the data set of Table 4.1 except that the subjects’ first names have been removed, and the columns have been moved closer to one another so that there is less space between the variables. You can see that line 8 still presents data for subject #1 (Marsha), line 9 still represents data for subject #2, and so on.
Chapter 4: Data Input 115
The point is this: in this guide, all data sets will be arranged in the same fashion. The rows (running horizontally from left to right) will represent different observations (typically different people) and the columns (running vertically from top to bottom) will represent different variables. Overview of Three Options for Writing the INPUT Statement The first step in analyzing variables with SAS involves reading them as part of a DATA step, and the heart of the DATA step is the INPUT statement. The INPUT statement is the statement in which you assign names to the variables that you will analyze. There are many ways to write an INPUT statement, and some ways are much more complex than others. A later section of this chapter will provide detailed guidelines for using one specific approach. However, following is a quick overview of all three most commonly used options. List input. List input (also called “free-formatted input”) is probably the simplest way to write an INPUT statement. This is the approach that will be taught in this guide. With list input, you simply give a name to each variable, and tell SAS the order in which they appear on a data line (i.e., which variable comes first, which comes second, and so on). This is a good approach to use when you are first learning SAS, and when you have data sets with a small number of variables. List input is also called free-formatted input because you do not have to put a variable into any particular column on the data line. You simply have to be sure that you leave at least one blank space between each variable, so that SAS can tell one variable from another. Additional guidelines on the use of free-formatted input will be presented in the section “Example 4.2: A More Complex Data Set.” Here is an example of how you could write an INPUT statement that will read the preceding data set using the free-formatted approach: INPUT
SUB_NUM HEIGHT WEIGHT AGE;
The preceding INPUT statement tells SAS that it will read four variables for each subject. On each data line, it will first read the subject’s score on SUB_NUM (the participant’s subject number). On the same data line, it will then read the subject’s score on HEIGHT, the subject’s score on WEIGHT, and finally the subject’s score on AGE. Column input. With column input, you assign a name to each variable, and tell SAS the exact columns in which the variable will appear. For example, you might indicate that the variable SUB_NUM will appear in column 1, the variable HEIGHT will appear in columns 3 through 4, the variable WEIGHT will appear in columns 6 through 8, and the variable AGE will appear in columns 10 through 11.
116 Step-by-Step Basic Statistics Using SAS: Student Guide
Column input is a useful approach when you are working with larger data sets that contain a larger number of variables. Although column input will not be covered in detail here, you can learn more about in Schlotzhauer and Littell (1997, pp. 41–44). Here is an example of column input: INPUT
SUB_NUM HEIGHT WEIGHT AGE
1 3-4 6-8 10-11 ;
Formatted input. Formatted input is a more complex type of column input in which you again assign names to your variables and indicate the exact columns in which they will appear. Formatted input has the advantage of making it easy to input string variables, variables whose names begin with the same root and end with a series of numbers. For example, imagine that you administered a 50-item questionnaire to a large sample of people, and wanted to use the SAS variable name V1 to represent responses to the first question, the variable name V2 for responses to the second question, and so on. It would be very time consuming if you listed each of these variable names individually in the INPUT statement. However, if you used formatted input, you could create all 50 variables very easily with the following statement: INPUT
@1
(V1-V50)
(1.);
To learn about formatted input, see Cody and Smith (1997), and Hatcher and Stepanski (1994). Here is an example of how the current data set could be input using the formatted input approach: INPUT
@1 @3 @6 @10
(SUB_NUM) (HEIGHT) (WEIGHT) (AGE)
(1.) (2.) (3.) (2.) ;
Chapter 4: Data Input 117
Example 4.1: Creating a Simple SAS Data Set Overview This section shows you how to create a simple data set that contains just three quantitative variables (i.e., the data set from Table 4.1). You will learn to use the various components that constitute the DATA step: •
OPTIONS statement
•
DATA statement
•
INPUT statement
•
DATALINES statement
•
data lines
•
null statement.
For reference, here is the DATA step that was presented earlier in this chapter: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM HEIGHT WEIGHT AGE ; DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ;
The OPTIONS Statement The OPTIONS statement is not really a formal part of the DATA step; it is actually a global command that can be used to set a variety of system options. In this section you will learn how to use the OPTIONS statement to control the size of your output page when it is printed. The syntax for the OPTIONS statement is as follows: OPTIONS
LS=n1
PS=n2 ;
118 Step-by-Step Basic Statistics Using SAS: Student Guide
LS in the preceding OPTIONS statement is an abbreviation for “LINESIZE.” This option enables you to control the maximum number of characters that will appear on each line of output. In this OPTIONS statement, n1 = the maximum number of characters that you want to appear on each printed line. PS in the OPTIONS statement is an abbreviation for “PAGESIZE.” This option enables you to control the maximum number of lines that will appear on each page of output. In this OPTIONS statement, the n2 = the maximum number of lines that you want to appear on each page. For example, suppose that you want your SAS output to have a maximum of 80 characters (letters and numbers) per line, and you want to have a maximum of 60 lines per page. The following OPTIONS statement would request this: OPTIONS
LS=80
PS=60 ;
The preceding is a good format to request when your output will be printed on standard letter-size paper. However, if your output will be printed on a large-format printer with a long carriage, you may want to have 120 characters per line. The following statement would request this: OPTIONS
LS=120
PS=60 ;
The DATA Statement You use the DATA statement to begin the DATA step and assign a name to the data set that you are creating. The syntax for the DATA statement is as follows: DATA
data-set-name ;
For example, if you want to assign the name D1 to the data set that you are creating, you would use the following statement: DATA
D1;
You can assign just about any name you like to a data set, as long as the name conforms to the following rules for a SAS data set name: •
The name must begin with a letter or an underscore (_).
•
The remainder of the name can include either letters or numbers.
•
If you are using SAS System Version 6 (or earlier), the name can be a maximum of eight characters; if you are using Version 7 (or later), the name can be a maximum of 32 characters.
•
The name cannot include any embedded blanks. For example, “POL PRTY” is not an acceptable name, as it includes a blank space. However, “POL_PRTY” is acceptable,
Chapter 4: Data Input 119
because an underscore (“_”) connects the first part of the name (“POL”) to the second part of the name (“PRTY”). •
The name cannot contain any special characters (e.g., “*,” “#”) or hyphens (-).
This guide typically uses the name D1 for a SAS data set simply because it is short and easy to remember. The INPUT Statement You use the INPUT statement to assign names to the variables that you will analyze, and to indicate the order in which the variables will appear on the data lines. Using free-formatted input, the syntax for the INPUT statement is as follows: INPUT
first-variable second-variable third-variable . . . last-variable ;
In the INPUT statement, the first variable that you name (first-variable above) should be the first variable that SAS will encounter when reading a data line from left to right. The second variable you name (second-variable above) should be the second variable that SAS will encounter when reading a data line, and so on. The INPUT statement from the preceding height and weight study is reproduced here: INPUT
SUB_NUM HEIGHT WEIGHT AGE ;
You can assign almost any name to a SAS variable, provided that you adhere to the rules for creating a SAS variable name. The rules for creating a SAS variable name are identical to the rules for creating a SAS data set name, and these rules were discussed in the section, “The DATA Statement.” That is, a SAS variable name must begin with a letter, it must not contain any imbedded blanks, and so on.
120 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATALINES Statement The DATALINES statement tells SAS that the data set will begin on the next line. Here is the DATALINES statement from the preceding program, along with the first two lines of data: DATALINES; 1 64 140 20 2 68 170 28 You use the DATALINES statement when you want to include the data as a part of the SAS program. However, this is not the only way to input data with SAS. For example, it is also possible to keep your data in a separate file, and refer to that file within your SAS program by using the INFILE statement. This approach has the advantage of allowing your SAS program to remain relatively short. The data sets used in this guide are fairly short, however. Therefore, to keep things simple, this guide includes the data set as part of the example SAS program, using the DATALINES statement. To learn how to use the INFILE statement, see Cody and Smith (1997), and Hatcher and Stepanski (1994, pp. 56–58). The Data Lines The data lines should be placed between the DATALINES statement (described above) and the null statement (to be described in the following section). Below are the data lines from the height and weight study, preceded by the DATALINES statement, and followed by the null statement (the semicolon at the end): DATALINES; 1 64 140 20 2 68 170 28 3 74 210 20 4 60 110 32 5 64 130 22 6 68 170 23 7 65 140 22 8 65 140 22 9 68 160 22 ; The data sets in this guide are short and simple, with only one line of data for each subject. This should be adequate when you have collected data on only a few variables. When you collect data on a large number of variables, however, it will be necessary to use more than one line of data for each subject. This will require a more sophisticated approach to data input than the format-free approach used here. To learn about these more advanced approaches, see Hatcher and Stepanski (1994, pp 31–51).
Chapter 4: Data Input 121
The Null Statement The null statement is the shortest statement in SAS programming: It consists simply of a line with a semicolon, as shown here: ; The null statement appears on the line following the end of the data set. It tells SAS that the data lines have ended. Here are the last two lines of the preceding data set, followed by the null statement: 8 65 140 22 9 68 160 22 ; Make sure that you place this semicolon by itself on the first line following the end of the data lines. A mistake that is often made by new SAS users is to instead place it at the end of the last line of data, as shown here: 8 65 140 22 9 68 160 22; Do not do this; placing the semicolon at the end of the last line of data will usually result in an error statement. Make sure that you always place it alone on the first line following the last line of data. The PROC Statement The null statement, described in the previous section, tells SAS that the DATA step has ended. When the DATA step is complete, you can then request statistical procedures that will analyze your data set. You request these statistical procedures using PROC statements. For example, below is a reproduction of (a) the last two data lines for the height and weight study, (b) the null statement, and (c) the PROC MEANS statement that tells SAS to compute the means and other descriptive statistics for three variables in the data set: 8 65 140 22 9 68 160 22 ; PROC MEANS DATA=D1; VAR HEIGHT WEIGHT TITLE1 'JANE DOE'; RUN;
AGE;
This guide shows you how to use a variety of PROC statements to request descriptive statistics, correlations, t tests, analysis of variance, and other statistical procedures.
122 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 4.2: A More Complex Data Set Overview The preceding section was designed to provide the “big picture” regarding the SAS DATA step. Now that you understand the fundamentals, you are ready to learn some of the details. This section describes a fictitious study from the discipline of political science, and shows you how to input the data that might be obtained from such a study. It shows you how to write a simple SAS program that will handle numeric variables, character variables, and missing data. The Study Suppose that you are interested in identifying variables that predict the size of financial donations that people make to political parties. You develop the following questionnaire: 1. Would you describe yourself as being generally conservative, generally liberal, or somewhere between these two extremes? (Please circle the number that represents your orientation) Generally Conservative
1
2
3
4
5
6
7
Generally Liberal
2. Would you like to see the size of the federal government increased or decreased? Greatly Decreased
1
2
3
4
5
6
7
Greatly Increased
3. Would you like to see the federal government assume an increased role or a decreased role in providing health care to our citizens? Decreased Role
1
2
3
4
5
6
7
Increased Role
4. What is your political party? (Check one) ____ Democrat
____ Republican
____ Other
5. What is your sex? ____ Female
_____ Male
6. What is your age? ___________ years old 7. During the past year, how much money have you donated to your political party? $ ________________
Chapter 4: Data Input 123
Data Set to Be Analyzed The table of data. You administer this questionnaire to 11 people. Their responses are reproduced in Table 4.2. Table 4.2 Data from the Political Donation Study ______________________________________________________________ Responses to questions: ______________ Political Subject Q1 Q2 Q3 party Sex Age Donation _______________________________________________________________ 01. Marsha 7 6 5 D F 32 1000 02. Charles 2 2 3 R M . 0 03. Jack 3 4 3 . M 45 100 04. Cindy 6 6 5 . F 20 . 05. Cathy 5 4 5 D F 31 0 06. Emmett 2 3 1 R M 54 0 07. Edward 2 1 3 . M 21 250 08. Eric 3 3 3 R M 43 . 09. Susan 5 4 5 D F 32 100 10. Freda 3 2 . R F 18 0 11. Richard 3 6 4 R M 21 50 _______________________________________________________________
As was the case with the height and weight data set presented earlier, the horizontal rows of Table 4.2 represent individual subjects, and the vertical columns represent different variables. The first column is headed “Subject,” and below this heading you will find a subject number (e.g., “01”) and a first name (e.g., “Marsha”) for each subject. Notice that the subject numbers are now two-digit numbers ranging from “01” to “11.” Numbers that normally would be single-digit numbers (such as “1”) have been converted to two-digit numbers such as “01”. This will make it easier to keep columns of numbers lined up properly when you are typing these subject numbers as part of a SAS data set. The first row presents questionnaire responses from subject #1, Marsha. Reading from left to right, you can see that Marsha •
circled a “7” in response to Question 1
•
circled a “6” in response to Question 2
•
circled a “5” in response to Question 3
•
indicated that she is a democrat (this is reflected by the “D” in the column “Political party”)
•
indicated that she is a female (reflected by the “F” in the column headed “Sex”)
•
is 32 years old
•
donated $1000 dollars to her party (reflected by the “1000” in the “Donation” column).
124 Step-by-Step Basic Statistics Using SAS: Student Guide
The remaining rows of the table can be interpreted in the same way. Using periods to represent missing data. The next-to-last column in Table 4.2 is “Age,” and this column indicates each subject’s age in years. For example, you can see that subject #1 (Marsha) is 32 years old. Where the row for subject #2 (Charles) intersects with the column headed “Age,” you can see a period (“.”). In this book, periods will be used to represent missing data. In this case, the period for subject #2 means that you do not have data on the Age variable for this subject. In conducting questionnaire research, you will often obtain missing data when subjects fail to complete certain questionnaire items. When you review the Table 4.2, you can see that there are missing data on other variables in addition to Age. For example, •
Subject #3 (Jack) and subject #4 (Cindy) each have missing data for the Political party variable.
•
Subject #4 (Cindy) also has missing data on Donation.
•
Subject #10 (Freda) has missing data on Q3.
There are a few other periods in Table 4.2, and each of these periods similarly represent missing data. Later, when you write your SAS program, you will again use periods to represent missing data within the DATA step. When SAS reads a data line and encounters a period as a value, it interprets it as missing data.
Chapter 4: Data Input 125
The SAS DATA Step Following is the DATA step of a SAS program that contains information from Table 4.2. You can see that all of the variables in Table 4.2 are included in this SAS data set, except for subjects’ first names (such as “Marsha”). However, subject numbers such as “01” and “02” have been included as the first variable in the data set. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ;
The INPUT statement appears on lines 3–10 of the preceding program. It assigns the following SAS variable names: •
The SAS variable name SUB_NUM will be used to represent each participant’s subject number.
•
The SAS variable name Q1 will be used to represent subjects’ responses to Question 1.
•
The SAS variable name Q2 will be used to represent subjects’ responses to Question 2.
•
The SAS variable name Q3 will be used to represent subjects’ responses to Question 3.
•
The SAS variable name POL_PRTY will be used to represent subjects’ political party.
•
The SAS variable name SEX will be used to represent subjects’ sex.
•
The SAS variable name AGE will be used to represent subjects’ age.
•
The SAS variable name DONATION will be used to represent the size of subjects’ donations to political parties.
126 Step-by-Step Basic Statistics Using SAS: Student Guide
The data set appears on lines 12–22 of the preceding program. You can see that it is identical to the data presented in Table 4.2, except that subject names have been omitted, and the columns of data have been moved together so that only one blank space separates each variable. Some Rules for List Input The list approach to data input is probably the easiest way to write an INPUT statement. However, there are a number of rules that you must observe to ensure that your data are read correctly by SAS. The most important of these rules are presented here. The variables must appear on the data lines in the same sequence that they are listed in the INPUT statement. In the INPUT statement of the preceding program, the SAS variables were listed in this order: SUB_NUM Q1 Q2 Q3 POL_PRTY SEX AGE DONATION. This means that the variables must appear in exactly the same order on the data lines following the DATALINES statement. Each variable on the data lines must be separated by at least one blank space. Below are the first three data lines from the preceding program: DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 The first data line was for subject #1 (Marsha), and so the first value on her line is “01” (her subject number). Marsha circled a “7” for Question 1, a “6” for Question 2, and so forth. It was necessary to leave one blank space between the “7” and the “6” so that SAS would read them as two separate values, rather than a single value of “76.” You will notice that the variables in the data lines of the preceding program were lined up so that each variable formed a neat, orderly column. Technically, this is not necessary with list input, but it is recommended as it increases the likelihood that you will have at least one blank space between each variable and you will not make any other errors in typing your data. When you have a large number of variables, it becomes awkward to leave one blank space between each variable. In these cases, it is better to enter each variable without the blank spaces and to use either the column input approach or the formatted input approach instead of the list approach to entering data. See Cody and Smith (1997) or Hatcher and Stepanski (1994) for details.
Chapter 4: Data Input 127
Each missing value must be represented by a single period. Data from the first three subjects in Table 4.2 are again reproduced here: _______________________________________________________________ Responses to questions: ______________ Political Subject Q1 Q2 Q3 party Sex Age Donation _______________________________________________________________ 01. Marsha 7 6 5 D F 32 1000 02. Charles 2 2 3 R M . 0 03. Jack 3 4 3 . M 45 100 _______________________________________________________________
You can see that there are some missing data in this table. For example, consider the second line of the table, which presents questionnaire responses for subject #2, Charles: in the column for “Age,” there is a single period (.) where you would expect to find Charles’ age. Similarly, the third line of the table presents responses for subject #3, Jack. You can see that there is a period for Jack in the “Political party” column. If you are using the list approach to input, it is very important that you use a single period (.) to represent each instance of missing data when typing your data lines. As was mentioned earlier, SAS recognizes a single period as the symbol for missing data. For example, here again are the first three lines of data from the preceding SAS program: DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 The seventh variable (from the left) in this data set is the AGE variable. You can see that, on the second line of data, a period appears in the column for the AGE variable. The second line of data is for subject #2, Charles, and this period tells SAS that you have missing data on the AGE variable for Charles. Other periods in the data set may be interpreted in the same way. It is important that you use only one period for each instance of missing data; do not, for example, use two periods simply because the relevant variable occupies two columns. As an illustration, the following lines show an incorrect way to indicate missing data for subject #2 on the age variable (the next-to-last variable): DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M .. 0 03 3 4 3 . M 45 100 In the above incorrect example, the programmer keyed two periods in the place where the second subject’s age would normally be typed. But this will cause problems––because there are two periods, SAS will assume that there is missing data on two variables: AGE, as well
128 Step-by-Step Basic Statistics Using SAS: Student Guide
as DONATION (the variable next to AGE). The point is simple: use a single period to represent a single instance of missing data, regardless of how many columns the variable occupies. In the INPUT statement, use the $ symbol to identify character variables. All of the variables discussed in this guide are either numeric variables or character variables. Numeric variables consist exclusively of numbers––they do not contain any letters of the alphabet or any special characters (symbols such as *, %, #). In the preceding data set, AGE was an example of a numeric variable because it could assume only numeric values such as 32, 45, and 20. In contrast, character variables may consist of letters of the alphabet, special characters, or numbers. In the preceding data set, POL_PRTY was an example of a character variable because it could assume the values “D” (for democrats) or “R” (for republicans). SEX was also a character variable because it could assume the values “F” and “M.” By default, SAS assumes that all of your variables will be numeric variables. If a particular variable is a character variable, you must indicate this in your INPUT statement. You do this by placing the dollar symbol ($) after the name of the variable in the INPUT statement. Leave at least one blank space between the name of the variable and the $ symbol. For example, the INPUT statement from the preceding program is again reproduced here: INPUT
SUB_NUM Q1 Q2 Q3 POL_PRTY SEX $ AGE DONATION
$ ;
In this program, the SAS variables Q1, Q2, and Q3 are numeric variables, so the $ symbol is not placed next to them. However, POL_PRTY is a character variable, and so the $ appears next to it. The same is true for the SEX variable. If you are using the column input approach, you should type the $ symbol before indicating the columns in which the variable will appear. For example, here is the way the preceding INPUT statement would be typed if you were using the column input approach: INPUT
SUB_NUM Q1 Q2 Q3 POL_PRTY SEX AGE DONATION
$ $
1 4 6 8 10 12 14-15 17-20 ;
Chapter 4: Data Input 129
The preceding statement tells SAS that SUB_NUM appears in column 1, Q1 appears in column 4, Q2 appears in column 6, Q3 appears in column 8, POL_PRTY appears in column 10, and so on. The $ symbols next to POL_PRTY and SEX inform SAS that these variables are character variables. Limit the values of character variables to eight characters. When using the format-free approach to inputting data, a value of a character variable can be no more than eight characters in length. Remember that the values are the actual entries that appear in the data lines. With a numeric variable, a value is usually the “score” that the subject displays on the variable. For example, the numeric variable AGE could assume values such as “32,” “45,” “20,” and so on. With a character variable, a value is usually a name or an abbreviation consisting of letters or symbols. For example, the character variable POL_PRTY could assume the values “D” or “R.” The character variable SEX could assume the values “F” or “M.” Suppose that you wanted to create a new character variable called NAME to include your subjects’ first names. The values of this variable would be the subjects’ first names (such as “Marsha”), and you would have to ensure that no name was over eight letters in length. Now, suppose that you drop the SUB_NUM variable, which assigns numeric subject numbers to each subject (such as “01,” “02,” and so on). Then you decide to replace SUB_NUM with your new character variable called NAME, which will consist of your subjects’ first names. This NAME variable would be the first variable on each data line. Here is the INPUT statement for this revised program, along with the first few data lines: INPUT
NAME $ Q1 Q2 Q3 POL_PRTY SEX $ AGE DONATION
$
; DATALINES; Marsha 7 6 5 D F 32 1000 Charles 2 2 3 R M . 0 Jack 3 4 3 . M 45 100 Notice that each value of NAME in the preceding program (such as “Marsha”) is eight characters in length or shorter. This is acceptable.
130 Step-by-Step Basic Statistics Using SAS: Student Guide
However, the following data lines would not be acceptable, because the values of NAME are over eight characters in length: DATALINES; Elizabeth 7 6 5 D F 32 1000 Christopher 2 2 3 R M . 0 Francisco 3 4 3 . M 45 100 Remember also that the value of a character variable must not contain any embedded blanks. This means, for example, that you cannot have a blank space in the middle of a name, as is done with the following unacceptable data line: Betty Lou
7 6 5 D F 32 1000
Avoid using hyphens in variable names. When listing SAS variable names in the INPUT statement, you should avoid creating any SAS variable names that include a hyphen, such as “AGE-YRS.” This is because SAS usually reads a variable name containing a hyphen as a string variable (string variables were discussed in the section “Overview of Three Options for Writing the INPUT Statement”). Students learning SAS programming for the first time will sometimes write a SAS variable name that includes a hyphen, not realizing that this will cause SAS to search for a string variable. The result is often an error message and confusion. Instead of using hyphens, it is good practice to use an underscore (“_”) in SAS variable names. If you use an underscore, SAS will assume that the variable is a regular SAS variable, and not a string variable. For example, suppose that one of your variables is “age in years.” You should not use the following SAS variable name to represent this variable, because SAS will interpret it as a string variable: AGE-YRS Instead, you can use an underscore in the variable name, like this: AGE_YRS
Chapter 4: Data Input 131
Using PROC MEANS and PROC FREQ to Identify Obvious Problems with the Data Set Overview The DATA step is now complete, and you are finally ready to analyze the data set you have entered. This section shows you how to use two SAS procedures to analyze the data set: •
PROC MEANS, which requests that the means and other descriptive statistics be computed for the numeric variables
•
PROC FREQ, which creates frequency tables for either numeric or character variables.
This section will show you the basic information that you need to know in order to use these two procedures. PROC MEANS and PROC FREQ are illustrated here so that you can perform some simple analyses to help verify that you created your data set correctly. A more detailed treatment of PROC MEANS and PROC FREQ will appear in the chapters to follow. Adding PROC MEANS and PROC FREQ to the SAS Program The syntax. Here is the syntax for requesting PROC MEANS and PROC FREQ: PROC MEANS DATA=data-set-name ; VAR variable-list ; TITLE1 ' your-name ' ; PROC FREQ DATA=data-set-name ; TABLES variable-list ; RUN; In this guide, syntax is a template for a section of a SAS program. When you use syntax for guidance in writing a SAS program, you should adhere to the following guidelines: •
If certain words are presented in uppercase type (capital letters) in the syntax, you should type those same words in your SAS program.
•
If certain words are presented in lowercase type in the syntax, you should not type those words in your SAS program. Instead, you should substitute the data set names, variable names, or key words that are appropriate for your specific analysis.
132 Step-by-Step Basic Statistics Using SAS: Student Guide
For example: PROC MEANS
DATA=data-set-name ;
In the preceding line, PROC MEANS and DATA= are printed in uppercase type. Therefore, you should type these words in your program just as they appear in the syntax. However, the words “data-set-name” appear in lower case italics. Therefore, you will not type the words “data-set-name.” Instead, you will type the name of the data set that you wish to analyze in your specific analysis. For example, if you wish to analyze a data set that is named D1, you would write the PROC MEANS statement this way in your SAS program: PROC MEANS
DATA=D1;
Most of the chapters in this guide will include syntax for performing different tasks with SAS. In each instance, you should follow the guidelines presented above for using the syntax. Variables that you will analyze. With the preceding syntax, the entry “variable-list” that appears with the VAR and TABLES statements refers to the list of variables that you want to analyze. In analyzing data from the political donation study, suppose that you will use PROC MEANS to analyze your numeric variables (such as Q1 and AGE), and PROC FREQ to analyze your character variables (POL_PRTY and SEX).
Chapter 4: Data Input 133
The SAS program. Following is the entire program for analyzing data from the political donation study. This time, statements have been appended to the end of the program to request PROC MEANS and PROC FREQ. Notice how the names of actual variables have been inserted in the locations where “variable-list” had appeared in the syntax that was presented above. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE DONATION; TITLE1 'JOHN DOE'; RUN; PROC FREQ DATA=D1; TABLES POL_PRTY SEX; RUN;
Lines 24–27 of this program contain the statements that request the MEANS procedures. Line 25 contains the VAR statement for PROC MEANS. You use the VAR statement to list the variables to be analyzed. You can see that this statement requests that PROC MEANS be performed on Q1, Q2, Q3, AGE, and DONATION. Remember that you may list only numeric variables in the VAR statement for PROC MEANS––you may not list character variables (such as POL_PRTY or SEX).
134 Step-by-Step Basic Statistics Using SAS: Student Guide
Lines 28–30 of this program contain the statements that request the FREQ procedure. Line 29 contains the TABLES statement for PROC FREQ. You use this statement to list the variables for which frequency tables will be produced. You can see that PROC FREQ will be performed on POL_PRTY and SEX. In the TABLES statement for PROC FREQ, you may list either character variables or numeric variables. The SAS Log After the preceding program has been submitted and executed, you should first review the SAS log file to verify that it ran without error. The log file for the preceding program is reproduced as Log 4.1. NOTE: SAS initialization used: real time 18.56 seconds 1 2 3 4 5 6 7 8 9 10 11
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES;
NOTE: The data set WORK.D1 has 11 observations and 8 variables. NOTE: DATA statement used: real time 1.43 seconds 23 24 25 26 27
; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE TITLE1 'JOHN DOE'; RUN;
DONATION;
NOTE: There were 11 observations read from the dataset WORK.D1. NOTE: PROCEDURE MEANS used: real time 1.63 seconds 28 29 30
PROC FREQ DATA=D1; TABLES POL_PRTY RUN;
SEX;
NOTE: There were 11 observations read from the dataset WORK.D1. NOTE: PROCEDURE FREQ used: real time 0.61 seconds Log 4.1. Log file from the political donation study.
Remember that the SAS log consists of your SAS program (minus the data), along with notes, warnings, and error messages generated by SAS as it executes your program. Lines
Chapter 4: Data Input 135
1–11 in Log 4.1 reproduce the DATA step of your SAS program. Immediately after this, the following note appeared in the log window: NOTE: The data set WORK.D1 has 11 observations and 8 variables.
This note indicates that your SAS data set (named D1) has 11 observations and 8 variables. This is a good sign, because you intended to input data from 11 subjects on 8 variables. The remainder of the SAS log reveals no evidence of any problems in the SAS program, and so you can proceed to the SAS output file. Interpreting the Results Produced by PROC MEANS The SAS Output. The output file for the current analysis consists of two pages. Page 1 contains the results of PROC MEANS, and page 2 contains the results of PROC FREQ. The results of PROC MEANS are reproduced in Output 4.1. JOHN DOE
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------Q1 11 3.7272727 1.7372915 2.0000000 7.0000000 Q2 11 3.7272727 1.7372915 1.0000000 6.0000000 Q3 10 3.7000000 1.3374935 1.0000000 5.0000000 AGE 10 31.7000000 12.2750877 18.0000000 54.0000000 DONATION 9 166.6666667 323.0711996 0 1000.00 ----------------------------------------------------------------------------Output 4.1. Results of PROC MEANS, political donation study.
Once you have created a data set, it is a good idea to perform PROC MEANS on all numeric variables, and review the results for evidence of possible errors. It is especially important to review the information in the columns headed “N,” “Minimum,” and “Maximum.” Reviewing the number of valid observations. The first column in Output 4.1 is headed “Variable.” In this column you will find the names of the variables that were analyzed. You can see that, as expected, PROC MEANS was performed on Q1, Q2, Q3, AGE, and DONATION. The second column in Output 4.1 is headed “N.” This column indicates the number of valid observations that were found for each variable. Where the row for Q1 intersects with the column headed “N,” you will find the number “11.” This indicates that PROC MEANS analyzed 11 valid cases for the variable Q1, as expected. Where the row for Q3 intersects with the column headed “N,” you find the number “10,” meaning that there were only 10 usable observations for the Q3 variable. However, this does not necessarily mean that there was an error: if you review the actual data set (reproduced earlier), you will note that there is
136 Step-by-Step Basic Statistics Using SAS: Student Guide
one instance of missing data for Q3 (indicated by the single period in the column for Q3 for the next-to-last subject). Similarly, although Output 4.1 indicates only 9 valid observations for DONATION, this is no cause for concern because the data set itself shows that you had missing data for two people on this variable. Reviewing the Minimum and Maximum columns. The fifth column in Output 4.1 is headed “Minimum.” This column indicates the smallest value that was observed for each variable. The last column in the output is headed “Maximum,” and this column indicates the largest value that was observed for each variable. The “Minimum” and “Maximum” columns are useful for determining whether any values are out of bounds. Out-of-bounds values are scores that are either too large or too small to be possible, given the type of variable that you are analyzing. If you find any out-of-bounds values, it probably means that you made an error, either in writing your INPUT statement or in typing your data. For example, consider the variable Q1 in Output 4.1. Where the row for Q1 intersects with the column headed “Minimum,” you see the value of 2. This means that the smallest value that the SAS System observed for Q1 was 2. Where the row for Q1 intersect with the column headed “Maximum,” you see the value of 7. This means that the largest value that SAS read for Q1 was 7. Remember that the variable Q1 represents responses to a questionnaire item where the possible responses ranged from a low of 1 (Generally Conservative) to a maximum of 7 (Generally Liberal). If Output 4.1 showed a “Minimum” score of 0 for Q1, this would be an invalid, out-of-bounds score (because Q1 is not supposed to go any lower than 1). Such a result might mean that you made an error in keying your data. Similarly, if the output showed a Maximum score of 9 for Q1, this would also be an invalid score (because Q1 is not supposed to go any higher than 7). A review of minimum and maximum values in Output 4.1 does not reveal any out-ofbounds scores for any of the variables.
Chapter 4: Data Input 137
Interpreting the Results Produced by PROC FREQ The SAS Output. Because the results of PROC MEANS did not reveal any obvious problems with the data set, you can proceed to the results of PROC FREQ. These results are reproduced in Output 4.2. JOHN DOE
2
The FREQ Procedure Cumulative Cumulative POL_PRTY Frequency Percent Frequency Percent ------------------------------------------------------------D 3 37.50 3 37.50 R 5 62.50 8 100.00 Frequency Missing = 3
Cumulative Cumulative SEX Frequency Percent Frequency Percent -------------------------------------------------------F 5 45.45 5 45.45 M 6 54.55 11 100.00 Output 4.2. Results of PROC FREQ, political donation study.
Reviewing the frequency tables. Two tables appear in Output 4.2: a frequency table for the variable POL_PRTY, and a frequency table for the variable SEX. In the first table, the variable name POL_PRTY appears in the upper left corner, meaning that this is the frequency table for POL_PRTY (political party). Beneath this variable name are the two possible values that the variable could assume: “D” (for democrats) and “R” (for republicans). You should always review this list of values to verify that no invalid values appear there. For example, if the values for this table had included “T” along with “D” and “R,” it probably would indicate that you made an error in keying your data because “T” doesn’t stand for anything meaningful in this study. When typing character variables in a data set, case is important, so you must be consistent in using uppercase and lowercase letters. For example, when keying POL_PRTY, if you initially use an uppercase “D” to represent democrats, you should never switch to using a lowercase “d” within that data set. If you do, SAS will treat the uppercase “D” and the lowercase “d” as two completely different values. When you perform a PROC FREQ on the POL_PRTY variable, you will obtain one row of frequency information for subjects identified with the uppercase “D,” and a different row of frequency information for subjects identified with the lowercase “d.” In most cases, this will not be desirable.
138 Step-by-Step Basic Statistics Using SAS: Student Guide
The second column in the frequency table in Output 4.2 is headed “Frequency.” This column indicates the number of subjects who were observed in each of the categories of the variable being analyzed. For example, where the row for the value “D” intersects with the column headed “Frequency,” you can see the number “3.” This means that 3 subjects were coded with a “D” in the data set. In other words, it means that 3 subjects were democrats. Where the row for the value “R” intersects with the column headed “Frequency,” you see the number “5.” This means that 5 subjects were coded with a “R” in the data set (i.e., 5 subjects were republicans). Below the frequency table for the POL_PRTY variable, you can see the entry “Frequency Missing = 3”. This section of the results produced by PROC FREQ indicates the number of observations with missing data for the variable being analyzed. This frequency missing entry for POL_PRTY indicates that there were three subjects with missing data for the political party variable. Whenever you create a new data set, you should always perform PROC FREQ on all character variables in this manner, to verify that the results seem reasonable. A warning sign, for example, would be a very large value for “Frequency Missing.” For POL_PRTY, all of the results from PROC FREQ seem reasonable, indicating no obvious problems. The second frequency table in Output 4.2 provides results for the SEX variable. It shows that 5 subjects were coded with an “F” (5 subjects were female), and 6 subjects were coded with an “M” (6 subjects were male). There is no “Frequency Missing” entry for the SEX table, which indicates that there were no missing data for this variable. These results, too, seem reasonable, and do not indicate any obvious problems with the DATA step so far. Summary In summary, whenever you create a new data set, you should perform a few simple descriptive analyses to verify that there were no obvious errors in writing the INPUT statement or in typing the data. At a minimum, this should include performing PROC MEANS on your numeric variables, and performing PROC FREQ on your character variables. PROC UNIVARIATE is also useful for performing descriptive analysis on numeric variables. The results produced by PROC UNIVARIATE are somewhat more complex than those produced by PROC MEANS; for this reason, it will be covered in Chapter 7, “Measures of Central Tendency and Variability.” If the results produced by PROC MEANS and PROC FREQ do not reveal any obvious problems, it does not necessarily mean that your data set is free of typos or other errors. An even more thorough approach to checking your data set involves using PROC PRINT to print out the raw data, so that you can proof every subject’s value on every variable. The following section shows how to do this.
Chapter 4: Data Input 139
Using PROC PRINT to Create a Printout of Raw Data Overview The PRINT procedure (PROC PRINT) is useful for generating a printout of your raw data (i.e., a printout of your data as they appear in a SAS data set). You can use PROC PRINT to review each subject’s score on each variable in your data set. Whenever you create a new data set, you should always use PROC PRINT to print out the raw data before doing any other, more sophisticated analyses. You should check the output created by PROC PRINT against your original data records to verify that SAS has read your data in the way that you intended. The first part of this section shows you how to use PROC PRINT to print raw data for all variables in a data set. Later, this section shows how you can use the VAR statement to print raw data for a subset of variables. Using PROC PRINT to Print Raw Data for All of the Variables In the Data Set The Syntax. Here is the syntax for the PROC step that will cause the PRINT procedure to print the raw data for all variables in your data set: PROC PRINT TITLE1 RUN;
DATA=data-set-name ; ' your-name ' ;
Here are the actual statements that you use with the PRINT procedure to print the raw data for the political donation study described above (a later section will show where these statements should go in your program): PROC PRINT TITLE1 RUN;
DATA=D1; 'JOHN DOE';
Output 4.3 shows the results that are generated by the preceding statements.
140 Step-by-Step Basic Statistics Using SAS: Student Guide
JOHN DOE
1
Obs
SUB_NUM
Q1
Q2
Q3
POL_PRTY
SEX
AGE
DONATION
1 2 3 4 5 6 7 8 9 10 11
1 2 3 4 5 6 7 8 9 10 11
7 2 3 6 5 2 2 3 5 3 3
6 2 4 6 4 3 1 3 4 2 6
5 3 3 5 5 1 3 3 5 . 4
D R
F M M F F M M M F F M
32 . 45 20 31 54 21 43 32 18 21
1000 0 100 . 0 0 250 . 100 0 50
D R R D R R
Output 4.3. Results of PROC PRINT performed on data from the political donation study (see Table 4.2).
Output created by PROC PRINT. For the most part, Output 4.3 presents a duplication of the data that appeared in Table 4.2, which was presented earlier in this chapter. The most obvious difference is the fact that subject names that appeared in Table 4.2 do not appear in Output 4.3. The first column, Obs (Observation number) lists a unique observation number for each subject in the study. When the observations in a data set are individual subjects (as is the case with the current political donation study), the observation numbers are essentially subject numbers. This means that, in the row for observation #1, you will find data for your first subject (Marsha from Table 4.2); in the row for observation #2, you will find data for your second subject (Charles from Table 4.2), and so on. You probably remember that you did not include this Obs variable in the data set that you created. Instead, this Obs variable is automatically generated by SAS whenever you create a SAS data set. This column shows the “subject number” variable that was input as part of your SAS data set. The column headed Q1 contains subject responses to question #1 from the political donation questionnaire that was presented earlier. Question #1 asked, “Would you describe yourself as being generally conservative, generally liberal, or somewhere between these two extremes?” Subjects could circle any number from 1 to 7 to indicate their response, where 1 = Generally Conservative and 7 = Generally Liberal. Under the heading of Q1 in Output 4.3, you can see that subject #1 circled a 7, subject #2 circled a 2, subject #3 circled a 3, and so on. In the columns headed Q2 and Q3, you will find subject responses to question #2 and question #3 from the political donation questionnaire. These questions also used a 7-point response format. The output shows that subject #10 has a period (.) listed under the heading Q3. This means that this subject has missing data for question #3.
Chapter 4: Data Input 141
Under POL_PRTY, you will find subject values for the political party variable. You will remember that this was a character variable in which the value “D” represents democrats and “R” represents republicans. You can see subject #3, subject #4, and subject #7 do not have any values for POL_PRTY. This is because they had missing data on the political party variable. The column headed SEX indicates subject sex. This was a character variable in which “F” represented females and “M” represented males. The column headed AGE indicates subject age. The column headed DONATION indicates the amount of the financial donation made to a political party by each subject. Using PROC PRINT to Print Raw Data for a Subset of Variables In the Data Set Statements for the SAS Program. In some cases, you may wish to print raw data for only a few variables in a data set. When this is the case, you should use the VAR statement in conjunction with PROC PRINT statement. In the VAR statement, list only the names of the variables that you want to print. Below is the syntax: PROC PRINT DATA=data-set-name ; VAR variable-list ; TITLE1 ' your-name ' ; RUN; For example, the following will cause PROC PRINT to print raw values only for the SEX and AGE variables: PROC PRINT DATA=D1; VAR SEX AGE; TITLE1 'JOHN DOE'; RUN;
142 Step-by-Step Basic Statistics Using SAS: Student Guide
Output created by PROC PRINT. Output 4.4 shows the results that are generated by the preceding statements. JOHN DOE Obs
SEX
AGE
1 2 3 4 5 6 7 8 9 10 11
F M M F F M M M F F M
32 . 45 20 31 54 21 43 32 18 21
Output 4.4. Results of PROC PRINT in which only the SEX and AGE variables were listed in the VAR statement.
You can see that Output 4.4 is similar to Output 4.3, with the exception that Output 4.4 includes only three variables: Obs, SEX, and AGE. As was stated earlier, Obs is not entered by the SAS user as a part of the data set; instead, it is automatically generated by SAS. A Common Misunderstanding Regarding PROC PRINT Students learning the SAS System for the first time often misunderstand PROC PRINT: they sometimes assume that a SAS program must contain PROC PRINT in order to generate a paper printout of their results. This is not the case. PROC PRINT simply generates a printout of your raw data (i.e., subjects’ individual scores for the variables in your data set). If you have performed some other SAS procedure such as PROC MEANS or PROC FREQ, you do not have to include PROC PRINT in your program to create a paper printout of the results generated by these procedures. Use PROC PRINT only when you want to generate a listing of each subject’s values on the variables in your SAS data set.
The Complete SAS Program To review: when you first create a SAS data set, it is very important to perform a few simple SAS procedures to verify that SAS read your data set as you intended. In most cases, this means that you should •
perform PROC MEANS on all numeric variables in the data set (if any).
•
perform PROC FREQ on all character variables in the data set (if any).
Chapter 4: Data Input 143 •
perform PROC PRINT to print out the complete raw data set, including numeric and character variables.
These three procedures have been discussed separately in previous sections. However, it is often best to request all three procedures in the same SAS program when you have created a new data set. An example of such a program appears below. The program does the following: 1. inputs the political donation data set described earlier in this chapter 2. requests that PROC MEANS be performed on one subset of variables 3. requests that PROC FREQ be performed on a different subset of variables 4. includes a PROC PRINT statement that will cause the entire raw data set to be printed out (notice that the VAR statement has been omitted from the PROC PRINT section of the program): OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 POL_PRTY $ SEX $ AGE DONATION ; DATALINES; 01 7 6 5 D F 32 1000 02 2 2 3 R M . 0 03 3 4 3 . M 45 100 04 6 6 5 . F 20 . 05 5 4 5 D F 31 0 06 2 3 1 R M 54 0 07 2 1 3 . M 21 250 08 3 3 3 R M 43 . 09 5 4 5 D F 32 100 10 3 2 . R F 18 0 11 3 6 4 R M 21 50 ; PROC MEANS DATA=D1; VAR Q1 Q2 Q3 AGE DONATION; TITLE1 'JOHN DOE'; RUN; PROC FREQ DATA=D1; TABLES POL_PRTY SEX; RUN; PROC PRINT DATA=D1; RUN;
144 Step-by-Step Basic Statistics Using SAS: Student Guide
Conclusion This chapter focused on the list input approach to writing the INPUT statement. This is a relatively simple approach, and will be adequate for the types of data sets that you will encounter in this Student Guide. For more complex data sets (e.g., data sets that include more than one line of data for each observation), you might want to learn more about formatted input. This approach is described and illustrated in Cody and Smith (1997), Hatcher (2001), and Hatcher and Stepanski (1994). After you have prepared the DATA step of your SAS program, it is good practice to analyze it with PROC FREQ (along with other procedures) to verify that there were no obvious errors in the INPUT statement. This chapter provided a quick introduction to PROC FREQ; next, Chapter 5, “Creating Frequency Tables,” discusses the FREQ procedure in greater detail.
Creating Frequency Tables Introduction.........................................................................................146 Overview...............................................................................................................146 Why It Is Important to Use PROC FREQ ..............................................................146 Example 5.1: A Political Donation Study ...........................................147 The Study .............................................................................................................147 Data Set to Be Analyzed.......................................................................................148 The DATA Step of the SAS Program ....................................................................150 Using PROC FREQ to Create a Frequency Table................................152 Writing the PROC FREQ Statement .....................................................................152 Output Produced by the SAS Program .................................................................152 Examples of Questions That Can Be Answered by Interpreting a Frequency Table......................................................155 The Frequency Table............................................................................................155 The Questions.......................................................................................................156 Conclusion...........................................................................................157
146 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview In this chapter you learn how to use the FREQ procedure to create simple, one-way frequency tables. When you use PROC FREQ to analyze a specific variable, the resulting frequency table displays •
values for that variable that were observed in the sample that you analyzed
•
frequency (number) of observations appearing at each value
•
percent of observations appearing at each value
•
cumulative frequency of observations appearing at each value
•
cumulative percent of observations appearing at each value.
Some of the preceding statistics terms (e.g., cumulative frequency) may be new to you. Later sections of this chapter will explain these terms, and will show you how to interpret a frequency table created by the FREQ procedure. Why It Is Important to Use PROC FREQ After you have created a SAS data set, it is often a good idea to analyze it with PROC FREQ before going on to perform more sophisticated statistical analyses (such as analysis of variance). At a minimum, this will help you find errors in your data or program. In addition, with some types of investigations it is necessary to create a frequency table in order to answer research questions. For example, performing PROC FREQ on the correct data set can help you answer the research question “What percentage of the adult U.S. population favors the death penalty?”
Chapter 5: Creating Frequency Tables 147
Example 5.1: A Political Donation Study The Study Suppose that you are a political scientist conducting research on campaign finance. With your current study, you wish to identify the variables that predict the size of the financial donations that people make to political causes. You develop the following questionnaire: What is your political party (please check one): ____ Democrat
____ Republican
____ Independent
What is your sex? ____ Female
____ Male
What is your age? ______ years old
During the past year, how much money have you donated to political causes? $ ____________ Below are a number of statements with which you may agree or disagree. For each, please circle the number that indicates the extent to which you either agree or disagree with the statement. Please use the following format in making your responses: 7 6 5 4 3 2 1
= = = = = = =
Agree Very Strongly Agree Strongly Agree Neither Agree nor Disagree Disagree Disagree Strongly Disagree Very Strongly
Circle your response ––––––––– 1 2 3 4 5
6
7
1. I believe that our federal government is generally doing a good job.
1
2
3
4
5
6
7
2. The federal government should raise taxes.
1
2
3
4
5
6
7
3. The federal government should do a better job of maintaining our interstate highway system.
1
2
3
4
5
6
7
4. The federal government should increase social security benefits to the elderly.
148 Step-by-Step Basic Statistics Using SAS: Student Guide
Data Set to Be Analyzed Responses to the questionnaire. You administer this questionnaire to 22 individuals between the ages of 33 and 59. Table 5.1 contains subject responses to the questionnaire. Table 5.1 Subject Responses to the Political Donation Questionnaire _______________________________________________________________ Responses to statementsb Subject
Political partya
Sex
Age
Donation
__________________ Q1 Q2 Q3 Q4
_______________________________________________________________ 01 D M 47 400 4 3 6 2 02 R M 36 800 4 6 6 6 03 I F 52 200 1 3 7 2 04 R M 47 300 3 2 5 3 05 D F 42 300 4 4 5 6 06 R F 44 1200 2 2 5 5 07 D M 44 200 6 2 3 6 08 D M 50 400 4 3 6 2 09 R F 49 2000 3 1 6 2 10 D F 33 500 3 4 7 1 11 R M 49 700 7 2 6 7 12 D F 59 600 4 2 5 6 13 D M 38 300 4 1 2 6 14 I M 55 100 5 5 6 5 15 I F 52 0 5 2 6 5 16 D F 48 100 6 3 4 6 17 R F 47 1500 2 1 6 2 18 D M 49 500 4 1 6 2 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 _______________________________________________________________ a
For the political party variable, “D” represents democrats, “R” represents republicans, and “I” represents independents. b Responses to the four “agree-disagree” statements at the end of the questionnaire.
Chapter 5: Creating Frequency Tables 149
Understanding Table 5.1. In Table 5.1, the rows (running horizontally) represent different subjects, and the columns (running vertically) represent different variables. The first column is headed “Subject.” This variable simply assigns a unique subject number to each person who responded to the questionnaire. These subject numbers run from “01” to “22.” The second column is headed “Political party.” With this variable, the value “D” is used to represent democrats, “R” is used to represent republicans, and “I” is used to represent independents. Third column is headed “Sex.” With this variable, the value “M” is used to represent male subjects, and the value “F” is used to represent female subjects. The fourth and fifth columns are headed “Age” and “Donation.” These columns provide each subject’s age and the size of the political donations they have made, respectively. For example, you can see that Subject 01 was 47 years old and made a donation of $400, Subject 02 was 36 years old and made a donation of $800, and so forth. The last four columns of Table 5.1 appear under the major heading “Responses to statements.” These columns contain subject responses to the four “agree-disagree” statements that appear in the previously mentioned questionnaire: •
Column Q1 indicates the number that each subject circled in response to the statement “I believe that our federal government is generally doing a good job.” You can see that Subject 01 circled “4” (which stands for “Neither agree nor Disagree”), Subject 02 also circled “4,” Subject 03 circled “1” (which stands for “Disagree Very Strongly”), and so on.
•
Column Q2 contains responses to the statement “The federal government should raise taxes.”
•
Column Q3 contains responses to the statement “The federal government should do a better job of maintaining our interstate highway system.”
•
Column Q4 contains responses to the statement “The federal government should increase social security benefits to the elderly.”
150 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATA Step of the SAS Program Keying the DATA step. Now you include the data that appear in Table 5.1 as part of the DATA step of a SAS program. In doing this, you arrange the data in a way that is similar to the preceding table (i.e., the first column contains a unique subject number for each participant, the second column indicates the political party to which each subject belongs, and so on). Below is the DATA step for the SAS program that contains these data: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM POL_PRTY $ SEX $ AGE DONATION Q1 Q2 Q3 Q4 ; DATALINES; 01 D M 47 400 4 3 6 2 02 R M 36 800 4 6 6 6 03 I F 52 200 1 3 7 2 04 R M 47 300 3 2 5 3 05 D F 42 300 4 4 5 6 06 R F 44 1200 2 2 5 5 07 D M 44 200 6 2 3 6 08 D M 50 400 4 3 6 2 09 R F 49 2000 3 1 6 2 10 D F 33 500 3 4 7 1 11 R M 49 700 7 2 6 7 12 D F 59 600 4 2 5 6 13 D M 38 300 4 1 2 6 14 I M 55 100 5 5 6 5 15 I F 52 0 5 2 6 5 16 D F 48 100 6 3 4 6 17 R F 47 1500 2 1 6 2 18 D M 49 500 4 1 6 2 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ;
Chapter 5: Creating Frequency Tables 151
Understanding the DATA step. Remember that if you were typing the preceding program, you would not actually type the line numbers (in italic) that appear on the left; they are provided here for reference. With the preceding program, the INPUT statement appears on lines 3–11. This INPUT statement assigns the following SAS variable names to your variables: •
The SAS variable name SUB_NUM is used to represent each subject’s unique subject number (i.e., “01,” “02,” “03,” and so on).
•
The SAS variable name POL_PRTY represents the political party to which the subject belongs. In typing your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents. The dollar sign ($) to the right of this variable name indicates that it is a character variable.
•
The SAS variable name SEX represents each subject’s sex, with the value “F” representing females and “M” representing males. Again, the dollar sign ($) to the right of this variable name indicates that it is also a character variable.
•
The SAS variable name AGE indicates the subject's age in years.
•
The SAS variable name DONATION indicates the size of the political donation (in dollars) that each subject made in the past year.
•
The SAS variable name Q1 indicates subject responses to the first question using the “agree-disagree” format. You keyed a “1” if the subject circled a “1” (for “Disagree Very Strongly”), you keyed a “2” if the subject circled a “2” (for “Disagree Strongly”), and so on.
•
In the same way, the SAS variable names Q2, and Q3, and Q4 represent subject responses to the second, third, and fourth questions using the “Agree-Disagree” format.
Notice that at least one blank space was left between each variable in the data set. This is required when using the list, or free-formatted, approach to data input. With the data set typed in, you can now append PROC statements below the null statement (the lone semicolon that appears on line 35). You can use these PROC statements to create frequency tables that will help you understand more clearly the variables in the data set. The following section shows you how to do this.
152 Step-by-Step Basic Statistics Using SAS: Student Guide
Using PROC FREQ to Create a Frequency Table Writing the PROC FREQ Statement The syntax. Following is the syntax for the PROC FREQ statement (and related statements) that will create a simple frequency table: PROC FREQ TABLES TITLE1 RUN;
DATA=data-set-name ; variable-list ; ' your-name ' ;
The second line of the preceding code is the TABLES statement. In this statement you list the names of the variables for which frequency tables should be created. If you list more than one variable, you should separate each variable name by at least one space. Listing more than one variable in the TABLES statement will cause SAS to create a separate frequency table for each variable. For example, the following statements create a frequency table for the variable AGE that appears in the preceding data set: PROC FREQ DATA=D1; TABLES AGE; TITLE1 'JANE DOE'; RUN;
Output Produced by the SAS Program The frequency table. The frequency table created by the preceding statements is reproduced as Output 5.1. The remainder of this section shows how to interpret the various parts of the table.
Chapter 5: Creating Frequency Tables 153
JANE DOE
1
The FREQ Procedure Cumulative Cumulative AGE Frequency Percent Frequency Percent -------------------------------------------------------33 1 4.55 1 4.55 36 1 4.55 2 9.09 38 2 9.09 4 18.18 42 1 4.55 5 22.73 43 1 4.55 6 27.27 44 3 13.64 9 40.91 47 4 18.18 13 59.09 48 1 4.55 14 63.64 49 3 13.64 17 77.27 50 1 4.55 18 81.82 52 2 9.09 20 90.91 55 1 4.55 21 95.45 59 1 4.55 22 100.00 Output 5.1. Results from the FREQ procedure performed on the variable AGE.
You can see that the frequency table consists of five vertical columns headed AGE ( ), Frequency ( ), Percent ( ), Cumulative Frequency ( ), and Cumulative Percent ( ). The following sections describe the meaning of the information contained in each column. The column headed with the variable name. The first column in a SAS frequency table is headed with the name of the variable that is being analyzed. You can see that the first column in Output 5.1 is labeled “AGE,” the variable being analyzed here. The various values assumed by AGE appear in the first column under the heading “AGE.” Reading from the top down in this column, you can see that, in this data set, the observed values of AGE were 33, 36, 38, and so on, through 59. This means that the youngest person in your data set was 33, and the oldest was 59. The second column in the data set is headed “Frequency.” This column reports the number of observations that appear at each value of the variable being analyzed. In the present case, it will tell us how many subjects were at age 33, how many were at age 36, and so on. For example, the first value in the AGE column is 33. If you read to the right of this value, you will find information about how many people were at age 33. Where the row for “33” intersects with the column for “Frequency,” you see the number “1.” This means that just one person was at age 33. Now skip down two rows to the row for the age “38.” Where the row for “38” intersects with the column headed “Frequency,” you see the number “2.” This means that two people were at age 38.
154 Step-by-Step Basic Statistics Using SAS: Student Guide
Reviewing various parts of the “Frequency” column reveals the following: •
There were 3 people at age 44.
•
There were 4 people at age 47.
•
There was 1 person at age 59.
The next column is the “Percent” column. This column indicates the percent of observations appearing at each value. In the present case, it will reveal the percent of people at age 33, the percent at age 36, and so on. A particular entry in the “Percent” column is equal to the corresponding value in the “Frequency” column, divided by the total number of usable observations in the data set. For example, where the row for the age of 33 intersects with the column headed “Percent,” you see the entry 4.55. This means that 4.55% of the subjects were at age 33. This was computed by dividing the frequency of people at age 33 (which was “1”) by the total number of usable observations (which was “22”). 1 divided by 22 is equal to .0455, or 4.55%. Now go down to the row for the age of 44. Where the row for 44 intersects with the column headed “Percent,” you see the entry 13.64, meaning that 13.64% of the subjects were at age 44. This was computed by dividing the frequency of people at age 44 (which was “3”) by the total number of usable observations (which was “22”). 3 divided by 22 is equal to .1364, or 13.64%. The next column is the “Cumulative Frequency” column. A particular entry in the “Cumulative Frequency” column indicates the sum of •
the number of observations scoring at the current value in the "Frequency" column plus ...
•
the number of observations scoring at each of the preceding (lower) values in the "Frequency" column.
For example, look at the point where the row for AGE = 44 intersects with the column headed “Cumulative Frequency.” At that intersection, you see the number “9.” This means that a total of 9 people were at age 44 or younger. Next, look at the point where the row for AGE = 55 intersects with the column headed “Cumulative Frequency.” At that intersection, you see the number “21.” This means that a total of 21 people were at age 55 or younger. Finally, the last entry in the “Cumulative Frequency” column is “22,” meaning that 22 people were at age 59 or younger. It also means that a total of 22 people provided valid data on this AGE variable (the last entry in the “Cumulative Frequency” column always indicates the total number of usable observations for the variable being analyzed).
Chapter 5: Creating Frequency Tables 155
The last column is the “Cumulative Percent” column. A particular entry in the “Cumulative Percent” column indicates the sum of •
the percent of observations scoring at the current value in the “Percent” column plus...
•
the percent of observations scoring at each of the preceding (lower) values in the “Percent” column.
For example, look at the point where the row for AGE = 44 intersects with the column headed “Cumulative Percent.” At that intersection, you see the number “40.91.” This means that 40.91% of the subjects were at age 44 or younger. Next, look at the point where the row for AGE = 55 intersects with the column headed “Cumulative Percent.” At that intersection, you see the number “95.45.” This means that 95.45% of the subjects were at age 55 or younger. Finally, the last entry in the “Cumulative Percent” column is “100.00,” meaning that 100% of the people were at age 59 or younger. The last figure in the “Cumulative Percent” column will always be 100%. Frequency missing. In some instances, you will see “Frequency missing = n” below the frequency table that was created by PROC FREQ (this entry does not appear in Output 5.1). This entry appears when you perform PROC FREQ on a variable that has at least some missing data for the variable being analyzed. The entry is followed by a number that indicates the number of observations containing missing data that were encountered by SAS. For example, the following entry, after the frequency table for the variable AGE, would indicate that SAS encountered five subjects with missing data on the AGE variable: Frequency missing = 5
Examples of Questions That Can Be Answered by Interpreting a Frequency Table The Frequency Table The frequency table for the AGE variable is reproduced again as Output 5.2. This output is identical to Output 5.1, with the exception that different parts of the output are now identified with numbers (e.g., , ).
156 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE
1
The FREQ Procedure Cumulative Cumulative AGE Frequency Percent Frequency Percent -------------------------------------------------------33 1 4.55 1 4.55 36 1 4.55 2 9.09 38 2 9.09 4 18.18 42 1 4.55 5 22.73 43 1 4.55 6 27.27 44 3 13.64 9 40.91 47 4 18.18 13 59.09 48 1 4.55 14 63.64 49 3 13.64 17 77.27 50 1 4.55 18 81.82 52 2 9.09 20 90.91 55 1 4.55 21 95.45 59 1 4.55 22 100.00 Output 5.2. Results from the FREQ procedure, with various items identified for purposes of answering questions.
The Questions The companion volume to this book, Step-by-Step Basic Statistics Using SAS: Exercises, provides exercises that enable you to review what you learned in this chapter by •
entering a new data set
•
performing PROC FREQ on one of the variables in that data set
•
answering a series of questions about the frequency table created by PROC FREQ.
Here are examples of the types of questions that you will be asked to answer. Read each of the questions presented below, review the answer provided, and verify that you understand where the answer is found in Output 5.2. Also verify that you understand why that answer is correct. If you are confused by any of the following questions and answers, go back to the relevant section of this chapter and reread that section. •
Question: What is the lowest observed value for the AGE variable? Answer: 33. This number is identified with number
•
in Output 5.2.
Question: What is the highest observed value for the AGE variable? Answer: 59. This number is identified with number
in Output 5.2.
Chapter 5: Creating Frequency Tables 157 •
Question: How many people are 49 years old? (i.e., What is the frequency for people who displayed a value of 49 on the AGE variable?) Answer: Three. This number is identified with number
•
in Output 5.2.
Question: What percent of people are 38 years old? Answer: 9.09% This number is identified with number
•
Question: How many people are 50 years old or younger? Answer: 18. This number is identified with number
•
in Output 5.2.
Question: What percent of people are 52 years old or younger? Answer: 90.91%. This number is identified with number
•
in Output 5.2.
in Output 5.2.
Question: What is the total number of valid observations for the AGE variable in this data set? Answer: 22. This number is identified with number
in Output 5.2.
Conclusion This chapter has shown you how to use PROC FREQ to create simple frequency tables. These tables provide the numbers that enable you to verbally describe the nature of your data; they allow you to make statements such as “Nine percent of the sample were 38 years of age” or “96% of the sample were age 55 or younger.” In some cases, it is more effective to use a graph to illustrate the nature of your data. For example, you might use a bar graph to indicate the frequency of subjects at various ages. Or you might use a bar graph to illustrate the mean age for male subjects versus female subjects. SAS provides an number of procedures that enable you to create bar graphs of this sort, as well as other types of graphs and charts. The following chapter introduces you to some of these procedures.
158 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating Graphs Introduction.........................................................................................160 Overview...............................................................................................................160 High-Resolution versus Low-Resolution Graphics ................................................160 What to Do If Your Graphics Do Not Fit on the Page............................................161 Reprise of Example 5.1: the Political Donation Study .......................161 The Study .............................................................................................................161 SAS Variable Names ............................................................................................161 Using PROC CHART to Create a Frequency Bar Chart.......................162 What Is a Frequency Bar Chart? ..........................................................................162 Syntax for the PROC Step ....................................................................................163 Creating a Frequency Bar Chart for a Character Variable ....................................163 Creating a Frequency Bar Chart for a Numeric Variable.......................................165 Creating a Frequency Bar Chart Using the LEVELS Option .................................168 Creating a Frequency Bar Chart Using the MIDPOINTS Option...........................170 Creating a Frequency Bar Chart Using the DISCRETE Option.............................172 Using PROC CHART to Plot Means for Subgroups .............................174 Plotting Means versus Frequencies ......................................................................174 The PROC Step ....................................................................................................174 Output Produced by the SAS Program .................................................................176 Conclusion...........................................................................................177
160 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview In this chapter you learn how to use the SAS System’s CHART procedure to create bar charts. Most of the chapter focuses on creating frequency bar charts: figures in which the horizontal axis plots values for a variable, and the vertical axis plots frequencies. A bar for a particular value in a frequency bar chart indicates the number of observations that display a particular value in the data set. Frequency bar charts are useful for quickly determining which values are relatively common in a data set, and which values are less common. You'll learn how to modify your bar charts by using the LEVELS, MIDPOINTS, and DISCRETE options. The final section of this chapter shows you how to use PROC CHART to create subgroupmean bar charts. These are figures in which the points on the horizontal axis represent different subgroups of subjects, and the vertical axis plots values on a selected quantitative variable. A bar for a particular group illustrates the mean score displayed by that group for the quantitative variable. Subgroup-mean bar charts are useful for quickly determining which groups scored relatively high on a quantitative variable, and which groups scored relatively low. High-Resolution versus Low-Resolution Graphics This chapter shows you how to create low-resolution graphics, as opposed to high-resolution graphics. The difference between low-resolution and high-resolution graphics is one of appearance: high-resolution graphics have a higher-quality, more professional look, and therefore are more appropriate for publication in a research journal. Low-resolution graphics are fine for helping you review and understand your data, but the quality of their appearance is generally not good enough for publication. This chapter presents only low-resolution graphics because the SAS programs requesting them are simpler, and they require only base SAS and SAS/STAT software, which most SAS users have access to. If you need to prepare high-resolution graphics, you need SAS/GRAPH software. For more information on producing high-quality figures with SAS/GRAPH, see SAS Institute Inc. (2000), and Carpenter and Shipp (1995).
Chapter 6: Creating Graphs 161
What to Do If Your Graphics Do Not Fit on the Page Most of chapters in this book advise that you begin each SAS program with the following OPTIONS statement to control the size of your output page: OPTIONS
LS=80
PS=60;
The option PS=60 is an abbreviation for PAGESIZE=60. This requests that output be printed with 60 lines per page. With some computers and printers, however, these specifications will cause some figures (e.g., bar charts) to be too large to be printed on a single page. If you have charts that are broken across two pages, try reducing the page size to 50 lines by using the following OPTIONS statement: OPTIONS
LS=80
PS=50;
Reprise of Example 5.1: the Political Donation Study The Study This chapter will demonstrate how to use PROC CHART to analyze data from the fictitious political donation study that was presented in Chapter 5, “Creating Frequency Tables.” In that chapter, the example involved research on campaign finance, with a questionnaire that was administered to 33 subjects. The results of the questionnaire provided demographic information about the subjects (e.g., sex, age), the size of political donations they had made recently, and some information regarding their political beliefs (sample item: “I believe that our federal government is generally doing a good job”). Subjects responded to these items using a seven-point response format in which 1 = “Disagree Very Strongly” and 7 = “Agree Very Strongly.” SAS Variable Names When you typed your data, you used the following SAS variable names: •
POL_PRTY represents the political party to which the subject belongs. In keying your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents.
•
SEX represents the subject’s sex, with the value “F” representing females and “M” representing males.
•
AGE represents the subject's age in years.
•
DONATION represents the size (in dollars) of the political donation that each subject made in the past year.
162 Step-by-Step Basic Statistics Using SAS: Student Guide •
Q1 represents subject responses to the first question using the “Agree-Disagree” format. You typed a “1” if the subject circled a “1” for “Disagree Very Strongly," you typed a “2” if the subject circled a “2” for “Disagree Strongly,” and so on.
•
Q2, Q3, and Q4 represent subject responses to the second, third, and fourth questions using the agree-disagree format.
Chapter 5, “Creating Frequency Tables” includes a copy of the questionnaire that was used to obtain the data. It also presented the complete SAS DATA step that input the data as a SAS data set. If you need to refamiliarize yourself with the questionnaire or the data set, review Example 5.1 in Chapter 5.
Using PROC CHART to Create a Frequency Bar Chart What Is a Frequency Bar Chart? Chapter 5 showed you how to use PROC FREQ to create a simple frequency table. These tables are useful for determining the number of people whose scores lie at each value of a given variable. In some cases, however, it is easier to get a sense of these frequencies by plotting them in a bar chart, rather than summarizing them in numerical form in a table. The SAS System’s PROC CHART makes it easy to create this type of bar chart. A frequency bar chart is a figure in which the horizontal axis plots values for a variable, and the vertical axis plots frequencies. A bar for a particular value in a frequency bar chart indicates the number of observations displaying that value in the data set. Frequency bar charts are useful for quickly determining which values are relatively common in a data set, and which values are less common. The following sections illustrate a variety of approaches for creating these charts.
Chapter 6: Creating Graphs 163
Syntax for the PROC Step Following is the syntax for the PROC step of a SAS program that requests a frequency bar chart with vertical bars.: PROC CHART DATA=data-set-name; VBAR variable-list / options ; TITLE1 ' your-name '; RUN; The second line of the preceding syntax presents the VBAR statement, which requests a vertical bar chart (use the HBAR statement for a horizontal bar chart). It is in the variablelist from the VBAR statement that you list the variables for which you want frequency bar charts. The VBAR statement ends with /options, the section in which you list particular options you want for the charts. Some of these options will be discussed in later sections. Creating a Frequency Bar Chart for a Character Variable The PROC step. In this example, you will create a frequency bar chart for the variable POL_PRTY. You will recall that this is the variable for the subject's political party: democrat, republican, or independent. Following are the statements that will request a vertical bar chart plotting frequencies for POL_PRTY: PROC CHART DATA=D1; VBAR POL_PRTY; TITLE1 'JANE DOE'; RUN; Where these statements should appear in the SAS program. Remember that the PROC step of a SAS program should generally come after the DATA step. Chapter 5 provided the DATA step for the political donation data that will be analyzed here. To give you a sense of where the PROC CHART statement should go, the following code shows the last few data lines from the data set, followed by the statements in the PROC step: [Lines 1–30 of the DATA step presented in Chapter 5 would appear here] 19 D F 43 1000 5 2 7 3 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ; PROC CHART DATA=D1; VBAR POL_PRTY; TITLE1 'JANE DOE'; RUN;
164 Step-by-Step Basic Statistics Using SAS: Student Guide
Output produced by the SAS program. Output 6.1 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 12 + ***** | ***** 11 + ***** | ***** 10 + ***** | ***** 9 + ***** | ***** 8 + ***** | ***** 7 + ***** | ***** 6 + ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** 4 + ***** ***** ***** | ***** ***** ***** 3 + ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** | ***** ***** ***** -------------------------------------------D I R POL_PRTY Output 6.1. Results of PROC CHART performed on POL_PRTY.
The name of the variable that is being analyzed appears below the horizontal axis of the bar chart. You can see that POL_PRTY is the variable being analyzed in this case. The values that this variable assumed are used as labels for the bars in the bar chart. In Output 6.1, the value D (democrats) labels the first bar, I (independents) labels the second bar, and R (republicans) labels the last bar. Frequencies are plotted on the vertical axis of the bar chart. The height of a bar on the frequency axis indicates the frequencies associated with each value of the variable being analyzed. For example, Output 6.1 shows a frequency of 12 for subjects coded with a D, a frequency of 4 for subjects coded with an I, and a frequency of 6 for subjects coded with an R. In other words, this bar chart shows that there were 12 democrats, 4 independents, and 6 republicans in your data set.
Chapter 6: Creating Graphs 165
Creating a Frequency Bar Chart for a Numeric Variable When you create a bar chart for a character variable, SAS will create a separate bar for each value that your character variable includes. You can see this in Output 6.1, where separate bars were created for the values D, I, and R. However, if you create a bar chart for a numeric variable (and the numeric variable assumes a relatively large number of values), SAS will typically “group” your data, and create a bar chart in which the various bars are labeled with the midpoint for each group. The PROC step. Here are the statements necessary to create a frequency bar chart for the numeric variable AGE: PROC CHART DATA=D1; VBAR AGE; TITLE1 'JANE DOE'; RUN;
166 Step-by-Step Basic Statistics Using SAS: Student Guide
Output produced by the SAS program. Output 6.2 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 9 + ***** | ***** | ***** | ***** 8 + ***** | ***** | ***** | ***** 7 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 6 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** | ***** ***** | ***** ***** 3 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** 1 + ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** -------------------------------------------------------------------33 39 45 51 57 AGE Midpoint Output 6.2. Results of PROC CHART performed on AGE.
Notice that the horizontal axis is now labeled “AGE Midpoint.” The various bars in the chart are labeled with the midpoints for the groups that they represent. Table 6.1 summarizes the way that the AGE values were grouped:
Chapter 6: Creating Graphs 167 Table 6.1 Criteria Used for Grouping Values of the AGE Variable _____________________________________________ An observation is Interval placed in this interval if midpoint AGE scores fell in this range _____________________________________________ 33
30 ≤ AGE Score < 36
39
36 ≤ AGE Score < 42
45
42 ≤ AGE Score < 48
51
48 ≤ AGE Score < 54
57 54 ≤ AGE Score < 60 _____________________________________________
In Table 6.1, the first value under “Interval midpoint” is 33. To the right of this midpoint is the entry “30 ≤ AGE Score < 36.” This means that if a given value of AGE is greater than or equal to 30 and also is less than 36 it is placed into the interval that is identified with a midpoint of 33. The remainder of Table 6.1 can be interpreted in the same way. The first bar in Output 6.2 shows that there is a frequency of 1 for the group identified with the midpoint of 33. This means that there was only one person whose age was in the interval from 30 to 35. The remaining bars in Output 6.2 show that •
there was a frequency of “3” for the group identified with the midpoint of 39
•
there was a frequency of “9” for the group identified with the midpoint of 45
•
there was a frequency of “7” for the group identified with the midpoint of 51
•
there was a frequency of “2” for the group identified with the midpoint of 57.
168 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the LEVELS Option The preceding section showed that when you analyze a numeric variable with PROC CHART, SAS may group the values on that variable and identify each bar in the bar chart with the interval midpoint. But what if SAS does not group these values into the number of bars that you want? Is there a way to override the SAS System’s default approach to grouping these values? Fortunately, there is. PROC CHART provides a number of options that enable you to control the number and nature of the bars that appear in the bar chart. For example, the LEVELS option provides one of the easiest approaches for controlling the number of bars that will appear. The syntax. Here is the syntax for the PROC step in which the LEVELS option is used to control the number of bars: PROC CHART DATA=data-set-name; VBAR variable-list / LEVELS=desired-number-of-bars ; TITLE1 ' your-name '; RUN; For example, suppose that you want to have exactly six bars in your chart. The following statements would request this: PROC CHART DATA=D1; VBAR AGE / LEVELS=6; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 169
Output produced by the SAS program. Output 6.3 shows the chart that is created by the preceding statements. JANE DOE
1
Frequency 8 + ***** | ***** | ***** | ***** | ***** 7 + ***** | ***** | ***** | ***** | ***** 6 + ***** | ***** | ***** | ***** | ***** 5 + ***** ***** | ***** ***** | ***** ***** | ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** | ***** ***** | ***** ***** | ***** ***** 3 + ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** | ***** ***** ***** ***** 2 + ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** 1 + ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** | ***** ***** ***** ***** ***** ***** ------------------------------------------------------------------------32.5 37.5 42.5 47.5 52.5 57.5 AGE Midpoint
Output 6.3. Results of PROC CHART using the LEVELS option.
Notice that there are now six bars in the chart, as requested in the PROC step. Notice also that the midpoints in the chart have been changed to accommodate the fact that there are now bars for six groups, rather than five.
170 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the MIDPOINTS Option PROC CHART also enables you to specify exactly what you want the midpoints to be for the various bars in the figure. You can do this by using the MIDPOINTS option in the VBAR statement. With this approach, you can either list the exact values that each midpoint should assume, or provide a range and an interval. Both approaches are illustrated below. Listing the exact midpoints. If your bar chart will have a small number of bars, you might want to specify the exact value for each midpoint. Here is the syntax for the PROC step that specifies midpoint values: PROC CHART DATA=data-set-name; VBAR variable-list / MIDPOINTS=desired-midpoints ; TITLE1 ' your-name '; RUN; For example, suppose that you want to use the midpoints 20, 30, 40, 50, 60, 70, and 80 for the bars in your chart. The following statements would request this: PROC CHART DATA=D1; VBAR AGE / MIDPOINTS=20 30 40 50 60 70 80; TITLE1 'JANE DOE'; RUN; Providing a range and an interval. Writing the MIDPOINTS statement in the manner illustrated can be tedious if you want to have a large number of bars on your chart. In these situations, it may be easier to use the key words TO and BY with the MIDPOINTS option. This allows you to specify the lowest midpoint, the highest midpoint, and the interval that separates each midpoint. For example, the following MIDPOINTS option asks SAS to create midpoints that range from 20 to 80, with each midpoint separated by an interval of 10 units: PROC CHART DATA=D1; VBAR AGE / MIDPOINTS=20 TO 80 BY 10; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 171
Output produced by the SAS program. Output 6.4 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 11 + ***** | ***** 10 + ***** | ***** 9 + ***** | ***** 8 + ***** ***** | ***** ***** 7 + ***** ***** | ***** ***** 6 + ***** ***** | ***** ***** 5 + ***** ***** | ***** ***** 4 + ***** ***** | ***** ***** 3 + ***** ***** | ***** ***** 2 + ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** ***** | ***** ***** ***** ***** --------------------------------------------------------------------------20 30 40 50 60 70 80 AGE Midpoint Output 6.4. Results of PROC CHART using the MIDPOINTS option.
You can see that all of the requested midpoints appear on the horizontal axis of Output 6.4. This is the case despite the fact that some of the midpoints have no bar at all, indicating a frequency of zero. For example, the last midpoint on the horizontal axis is 80, but there is no bar for this group. This is because, as you may remember, the oldest subject in your data set was 59 years old; thus there were no subjects in the group with a midpoint of 80.
172 Step-by-Step Basic Statistics Using SAS: Student Guide
Creating a Frequency Bar Chart Using the DISCRETE Option Earlier sections have shown that when you use PROC CHART to create a frequency table for a character variable (such as POL_PRTY or SEX), it will automatically create a separate bar for each value that the variable includes. However, it will not typically do this with a numeric variable; when numeric variables assume many values, PROC CHART will normally group the data, and label the axis with the midpoints for each group. But what if you want to create a separate bar for each observed value of your numeric variable? In this case, you simply specify the DISCRETE option in the VBAR statement. The syntax. Here is the syntax for the PROC step that will cause a separate bar to be printed for each observed value of a numeric variable: PROC CHART DATA=data-set-name; VBAR variable-list / DISCRETE ; TITLE1 ' your-name '; RUN; The following statements again create a frequency bar chart for AGE. This time, however, the DISCRETE option is used to create a separate bar for each observed value of AGE. PROC CHART DATA=D1; VBAR AGE / DISCRETE; TITLE1 'JANE DOE'; RUN;
Chapter 6: Creating Graphs 173
Output produced by the SAS program. Output 6.5 shows the bar chart that is created by the preceding statements. JANE DOE
1
Frequency 4 + *** | *** | *** | *** | *** | *** | *** | *** | *** | *** 3 + *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** | *** *** *** 2 + *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** | *** *** *** *** *** 1 + *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** | *** *** *** *** *** *** *** *** *** *** *** *** *** -------------------------------------------------------------------33 36 38 42 43 44 47 48 49 50 52 55 59 AGE
Output 6.5. Results of PROC CHART using the DISCRETE option.
In Output 6.5, the bars are narrower than in the previous examples because there are more of them. There is now one bar for each value that AGE actually assumed in the data set. You can see that the first bar indicates the number of people at age 33, the second bar indicates the number of people at age 36, and so on. Notice that there are no bars labeled 34 or 35, as there were no people at these ages in your data set.
174 Step-by-Step Basic Statistics Using SAS: Student Guide
Using PROC CHART to Plot Means for Subgroups Plotting Means versus Frequencies Earlier sections of this chapter have shown how to use PROC CHART to create frequency bar charts. However, it is also possible to use PROC CHART to create subgroup-mean bar charts. These are figures in which the bars represent the means for various subgroups according to a particular, quantifiable criterion. For example, consider the political donation questionnaire that was presented in Chapter 5. One of the items on the questionnaire was “I believe that our federal government is generally doing a good job.” Subjects were asked to indicate the extent to which they agreed or disagreed with this statement by circling a number from 1 (“Disagree Very Strongly”) to 7 (“Agree Very Strongly”). Responses to this question were given the SAS variable name Q1 in the SAS program. A separate item on the questionnaire asked subjects to indicate whether they were democrats, republicans, or independents. Responses to this question were given the SAS variable name POL_PRTY. It would be interesting to see if there are any differences between the ways that democrats, republicans, and independents responded to the question about whether the federal government is doing a good job (variable Q1). In fact, it is possible to compute the mean score on Q1 for each of these subgroups, and to plot these subgroup means on a chart. The following section shows how do this with PROC CHART. The PROC Step Following is the syntax for the PROC CHART statements that create a subgroup-mean bar chart: PROC CHART DATA=data-set-name; VBAR group-variable / SUMVAR=criterion-variable TITLE1 ' your-name '; RUN;
TYPE=MEAN;
The second line of the preceding syntax includes the following: VBAR
group-variable
Here, group-variable refers to the SAS variable that codes group membership. In your program, you list POL_PRTY as the group variable because POL_PRTY indicates whether a given subject is a democrat, a republican, or an independent (the three groups that you want to compare).
Chapter 6: Creating Graphs 175
The second line of the syntax also includes the following: / SUMVAR=criterion-variable The slash (/) indicates that options will follow. You use the SUMVAR= option to identify the criterion variable in your analysis. This criterion variable is the variable on which means will be computed. In the present example, you want to compute mean scores on Q1, the item asking whether the federal government is doing a good job. Therefore, you will include SUMVAR=Q1 in your final program. Finally, the second line of the syntax ends with TYPE=MEAN; This option specifies that PROC CHART should compute group means on the criterion variable, as opposed to computing group sums on the criterion variable. If you wanted to compute group sums on the criterion variable, this option would be typed as: TYPE=SUM; Following are the statements that request that PROC CHART create a bar chart to plot means for the three groups on the variable Q1: PROC CHART DATA=D1; VBAR POL_PRTY / SUMVAR=Q1 TITLE1 'JANE DOE'; RUN;
TYPE=MEAN;
176 Step-by-Step Basic Statistics Using SAS: Student Guide
Output Produced by the SAS Program Output 6.6 shows the bar chart that is created by the preceding PROC step. JANE DOE
1
Q1 Mean | ***** 4 + ***** ***** | ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 3 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 2 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** 1 + ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** | ***** ***** ***** -------------------------------------------D I R POL_PRTY Output 6.6. Results of PROC CHART in which the criterion variable is Q1 and the grouping variable is POL_PRTY.
When this type of analysis is performed, the name of the grouping variable appears as the label for the horizontal axis. Output 6.6 shows that POL_PRTY is the grouping variable for the current analysis. The values that POL_PRTY can assume appear as labels for the individual bars in the chart. The labels for the three bars are D (democrats), I (independents), and R (republicans). In the frequency bar charts that were presented earlier in this chapter, the vertical axis plotted frequencies. However, with the type of analysis reported here, the vertical axis reports mean scores for the various groups on the criterion variable. The heading for the vertical axis is now Q1 Mean, meaning that you use this axis to determine each group’s mean score on the criterion variable, Q1.
Chapter 6: Creating Graphs 177
The height of a particular bar on the vertical axis indicates the mean score for that group on Q1 (the question about whether the federal government is doing a good job). Output 6.6 shows that •
The democrats (the bar labeled D) had a mean score that was just above 4.0 on the criterion variable, Q1.
•
The independents (the bar labeled I) had a mean score that was about 4.0.
•
The republicans (the bar labeled R) had a mean score that was just below 4.0.
Remember that 1 represented “Disagree Very Strongly” and 7 represented “Agree Very Strongly.” A score of 4 represented “Neither Agree nor Disagree.” The mean scores presented in Output 6.6 show that all three groups had a mean score close to 4.0, meaning that their mean scores were all close to the response “Neither Agree nor Disagree.” The mean score for the democrats was a bit higher (a bit closer to “Agree”), and the mean score for the republicans was a bit lower (a bit closer to “Disagree”), although we have no way of knowing whether these differences are statistically significant at this point.
Conclusion This chapter has shown you how to use PROC CHART to create frequency bar charts and subgroup-mean bar charts. Summarizing results graphically in charts such as these can make it easier for you to identify trends in your data at a glance. These figures can also make it easier to communicate your findings to others. When presenting your findings, it is also common to report measures of central tendency (such as the mean or the median) and measures of variability (such as the standard deviation or the interquartile range). Chapter 7, "Measures of Central Tendency and Variability," shows you how to use PROC MEANS and PROC UNIVARIATE to compute these measures, along with other measures of central tendency and variability. Chapter 7 also shows you how to create stem-and-leaf plots that can be reviewed to determine the general shape of a particular distribution of scores.
178 Step-by-Step Basic Statistics Using SAS: Student Guide
Measures of Central Tendency and Variability Introduction.........................................................................................181 Overview...............................................................................................................181 Why It Is Important to Compute These Measures.................................................181 Reprise of Example 5.1: The Political Donation Study.......................181 The Study .............................................................................................................181 SAS Variable Names ............................................................................................182 Measures of Central Tendency: The Mode, Median, and Mean.........183 Overview...............................................................................................................183 Writing the SAS Program......................................................................................183 Output Produced by the SAS Program .................................................................184 Interpreting the Mode Computed by PROC UNIVARIATE....................................185 Interpreting the Median Computed by PROC UNIVARIATE .................................186 Interpreting the Mean Computed by PROC UNIVARIATE....................................186 Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE ...187 Overview...............................................................................................................187 Output Produced by the SAS Program .................................................................187 Using PROC UNIVARIATE to Determine the Shape of Distributions ...................................................................................190 Overview...............................................................................................................190 Variables Analyzed ...............................................................................................190 An Approximately Normal Distribution ..................................................................191 A Positively Skewed Distribution...........................................................................194
180 Step-by-Step Basic Statistics Using SAS: Student Guide
A Negatively Skewed Distribution .........................................................................196 A Bimodal Distribution...........................................................................................198 Simple Measures of Variability: The Range, the Interquartile Range, and the Semi-Interquartile Range ......................................200 Overview...............................................................................................................200 The Range ............................................................................................................200 The Interquartile Range ........................................................................................202 The Semi-Interquartile Range ...............................................................................203 More Complex Measures of Central Tendency: The Variance and Standard Deviation .........................................................................204 Overview...............................................................................................................204 Relevant Terms and Concepts..............................................................................204 Conceptual Formula for the Population Variance..................................................205 Conceptual Formula for the Population Standard Deviation .................................206 Variance and Standard Deviation: Three Formulas ..........................207 Overview...............................................................................................................207 The Population Variance and Standard Deviation ................................................207 The Sample Variance and Standard Deviation .....................................................208 The Estimated Population Variance and Standard Deviation................................209 Using PROC MEANS to Compute the Variance and Standard Deviation .........................................................................210 Overview...............................................................................................................210 Computing the Sample Variance and Standard Deviation ....................................210 Computing the Population Variance and Standard Deviation ...............................212 Computing the Estimated Population Variance and Standard Deviation ..............212 Conclusion...........................................................................................214
Chapter 7: Measures of Central Tendency and Variability 181
Introduction Overview This chapter shows you how to perform simple procedures that help describe and summarize data. You will learn how to use PROC UNIVARIATE to compute the measures of central tendency that are most frequently used in research: the mode, the median, and the mean. You will also learn how to create and interpret a stem-and-leaf plot, which can be helpful in understanding the shape of a variable’s distribution. Finally, you will learn how to use PROC MEANS to compute the variance and standard deviation of quantitative variables. You will learn how to compute the sample standard deviation and variance, as well as the estimated population standard deviation and variance. Why It Is Important to Compute These Measures There are a number of reasons why you need to be able to perform these procedures. At a minimum, the output produced by PROC MEANS and PROC UNIVARIATE will help you verify that you made no obvious errors in entering data or writing your program. It is always important to verify that your data are correct before going on to analyze them with more sophisticated procedures. In addition, most research journals require that you report simple descriptive statistics (e.g., means and standard deviations) for the variables you analyze. Finally, many of the later chapters in this guide build on the more basic concepts presented here. For example, in Chapter 9 you will learn how to create a type of standardized variable called a “z score” by using standard deviations, a concept taught in this chapter.
Reprise of Example 5.1: The Political Donation Study The Study This chapter illustrates PROC UNIVARIATE and PROC MEANS by analyzing data from the fictitious political donation study presented in Chapter 5, “Creating Frequency Tables.” In that chapter, you were asked to suppose that you are a political scientist conducting research on campaign finance. You developed a questionnaire and administered it to 22 individuals. With the questionnaire, subjects provided demographic information about themselves (e.g., sex, age), indicated the size of political donations they had made recently, and responded to four items designed to assess some of their political beliefs (sample item: “I believe that our federal government is generally doing a good job”). Subjects responded to these items using a 7-point response format in which “1” = “Disagree Very Strongly” and “7” = “Agree Very Strongly.”
182 Step-by-Step Basic Statistics Using SAS: Student Guide
SAS Variable Names In entering your data, you used the following SAS variable names to represent the variables measured: •
The SAS variable SUB_NUM contains unique subject numbers assigned to each subject.
•
The SAS variable POL_PRTY represents the political party to which the subject belongs. In entering your data, you used the value “D” to represent democrats, “R” to represent republicans, and “I” to represent independents.
•
The SAS variable SEX represents the subject’s sex, with the value “F” representing females and “M” representing males.
•
The SAS variable AGE represents the subject's age in years.
•
The SAS variable DONATION represents the size of the political donation (in dollars) that each subject made in the past year.
•
The SAS variable Q1 represents subject responses to the first question using the “AgreeDisagree” format. You typed a “1” if the subject circled a “1” (for “Disagree Very Strongly”), you typed a “2” if the subject circled a “2” (for “Disagree Strongly”), and so forth.
•
In the same way, the SAS variables Q2, Q3, and Q4 represent subject responses to the second, third, and fourth questions using the “Agree-Disagree” format.
Chapter 5, “Creating Frequency Tables,” provides a copy of the questionnaire that was used to obtain the preceding data. It also presented the complete SAS DATA step that read in the data as a SAS data set.
Chapter 7: Measures of Central Tendency and Variability 183
Measures of Central Tendency: The Mode, Median, and Mean Overview When you assess some numeric variable (such as subject age) in the course of conducting a research study, you will typically obtain a variety of different scores––a distribution of scores. If you write an article about your study for a research journal, you will need some mechanism to describe your obtained distribution of scores. Readers will want to know what was the most “typical” or “representative” score on your variable; in the present case, they would want to know what was the most typical or representative age in your sample. To convey this, you will probably report one or more measures of central tendency. A measure of central tendency is a value or number that represents the location of a sample of data on a continuum by revealing where the center of the distribution is located in that sample. There are a variety of different measures of central tendency, and each uses a somewhat different approach for determining just what the “center” of the distribution is. The measures of central tendency that are used most frequently in the behavioral sciences and education are the mode, the median, and the mean. This section discusses the difference between each, and shows how to compute them using PROC UNIVARIATE. Writing the SAS Program The PROC step. The UNIVARIATE procedure in SAS can produce a wide variety of indices for describing a distribution. These include the sample size, the standard deviation, the skewness, the kurtosis, percentiles, and other indices. This section, however, focuses on just three: the mode, the median, and the mean. Below is the syntax for the PROC step that requests PROC UNIVARIATE: PROC UNIVARIATE DATA=data-set-name VAR variable-list ; TITLE1 ' your-name '; RUN;
options
;
This chapter illustrates how to use a PROC UNIVARIATE statement that requests the usual default statistics, along with two options: The PLOT option (which requests a stem-and-leaf plot), and the NORMAL option (which requests statistics that test the null hypothesis that the sample data were drawn from a normally distributed population). The current section discusses only the default output (which includes the mode, median, and mean). Later sections cover the stem-and-leaf plot and the tests for normality.
184 Step-by-Step Basic Statistics Using SAS: Student Guide
Here are the actual statements requesting that PROC UNIVARIATE be performed on the variable AGE: PROC UNIVARIATE DATA=D1 VAR AGE; TITLE1 'JANE DOE'; RUN;
PLOT
NORMAL;
Where these statements should appear in the SAS program. Remember that the PROC step of a SAS program should generally come after the DATA step. Chapter 5, “Creating Frequency Tables,” provided the DATA step for the political donation data that will be analyzed here. To give you a sense of where the PROC CHART statement should go, here is a reproduction of the last few data lines from the data set, followed by the statements in the current PROC step: [Lines 1-30 of the DATA step presented in Chapter 5 would appear here] 20 D F 44 300 4 3 5 7 21 I F 38 100 5 2 4 1 22 D F 47 200 3 7 1 4 ; PROC UNIVARIATE DATA=D1 PLOT NORMAL; VAR AGE; TITLE1 'JANE DOE'; RUN;
Output Produced by the SAS Program The preceding statements produced two pages of output that provide plenty of information about the variable being analyzed. The output provides: •
a moments table that includes the sample size, mean, standard deviation, variance, skewness, and kurtosis, along with other statistics
•
a basic statistical measures table that includes measures of location (the mean, median, and mode), as well as measures of variability (the standard deviation, variance, range, and interquartile range)
•
a tests for normality table that includes statistical tests of the null hypothesis that the sample was drawn from a normally distributed population
•
a quantiles table that provides the median, 25th percentile, 75th percentile, and related information
•
an extremes table that provides the five highest values and five lowest values on the variable being analyzed
•
a stem-and-leaf plot, box plot, and normal probability plot.
Chapter 7: Measures of Central Tendency and Variability 185
Interpreting the Mode Computed by PROC UNIVARIATE The mode (also called the modal score) is the most frequently occurring value or score in a sample. Technically, the mode can be assessed with either quantitative or nonquantitative (nominal-level) variables, although this chapter focuses on quantitative variables because the UNIVARIATE procedure is designed for quantitative variables only. When you are working with numeric variables, the mode is a useful measure of central tendency to report when the distribution has more than one mode, is skewed, or is markedly nonnormal in some other way. PROC UNIVARIATE prints the mode as part of the “Basic Statistical Measures” table in its output. The Basic Statistical Measures table from the current analysis of the variable AGE is reproduced here as Output 7.1. Basic Statistical Measures Location Mean Median Mode
46.04545 47.00000 47.00000
Variability Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Output 7.1. The mode as it appears in the Basic Statistical Measures table from PROC UNIVARIATE; the variable analyzed is AGE.
The mode appears on the last line of the “Location” section of the table. You can see that, for this data set, the mode is 47. This means that the most frequently occurring score on AGE was 47. You can verify that this is the case by reviewing output reproduced earlier in this guide, in Chapter 6, “Creating Graphs.” Output 6.5 contains the results of running PROC CHART on AGE; it shows that the value “47” has the highest frequency. A word of warning about the mode as computed by PROC UNIVARIATE: when there is more than one mode for a given variable, PROC UNIVARIATE prints only the mode with the lowest numerical value. For example, suppose that there were two modes for the current data set: imagine that 10 people were at age 25, and 10 additional people were at age 35. This means that the two most common scores on AGE would be 25 and 35. PROC UNIVARIATE would report only one mode for this variable: it would report 25 (because 25 was the mode with the lowest numerical value). When you have more than one mode, a note at the bottom of the Basic Statistical Measures table indicates the number of modes that were observed. This situation is discussed in the section, "A Bimodal Distribution," that appears later in this chapter. See Output 7.12.
186 Step-by-Step Basic Statistics Using SAS: Student Guide
Interpreting the Median Computed by PROC UNIVARIATE The median (also called the median score) is the score located at the 50th percentile. This means that the median is the score below which 50% of all data appear. For example, suppose that you administer a test worth 100 points to a very large sample of students. If 50% of the students obtain a score below 71 points, then the median is 71. The median can be computed only from some type of quantitative data. It is a particularly useful measure of central tendency when you are working with an ordinal (ranked) variable. It is also useful when you are working with interval- or ratio-level variables that display a skewed distribution. Along with the mode, PROC UNIVARIATE also prints the median as part of the Basic Statistical Measures table in its output. The table from the present analysis of the variable AGE is reproduced as Output 7.2. Basic Statistical Measures Location Mean Median Mode
46.04545 47.00000 47.00000
Variability Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Output 7.2. The median as it appears in the Basic Statistical Measures table from PROC UNIVARIATE; the variable analyzed is AGE.
Output 7.2 shows that the median for the current data set is 47. You can see that, in this data set, the median and the mode happen to be the same number: 47. This result is not unusual, especially when the data set has a symmetrical distribution. Interpreting the Mean Computed by PROC UNIVARIATE The mean is the score that is located at the mathematical center of a distribution. It is computed by (a) summing the scores and (b) dividing by the number of observations. The mean is useful as a measure of central tendency for numeric variables that are assessed at the interval- or ratio-level, particularly when they display fairly symmetrical, unimodal distributions (later, you will see that the mean can be dramatically affected when a distribution is skewed). You may have noticed that the mean was printed as part of the Basic Statistical Measures table in Output 7.1 and 7.2. The same mean is also printed as part of the “Moments” table produced by PROC UNIVARIATE. The moments table from the current PROC UNIVARIATE analysis of AGE appears here in Output 7.3.
Chapter 7: Measures of Central Tendency and Variability 187
Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation
22 46.0454545 6.19890369 -0.2047376 47451 13.4625746
Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean
22 1013 38.4264069 0.19131126 806.954545 1.32161071
Output 7.3. The N (sample size) and the mean as they appear in the Moments table from PROC UNIVARIATE; the variable analyzed is AGE.
Output 7.3 provides many statistics from the analysis of AGE, but this section focuses on just two. First, to the right of the heading “N” you will find the number of valid (usable) observations on which these analyses were based. Here, you can see that N = 22, which means that scores on AGE were analyzed for 22 subjects. Second, to the right of the heading “Mean” you will find the mean score for AGE. You can see that the mean score on AGE is 46.045 for the current sample. Again, this is fairly close to the mode and median of 47, which is fairly common for distributions that are largely symmetrical.
Interpreting a Stem-and-Leaf Plot Created by PROC UNIVARIATE Overview A stem-and-leaf plot is a special type of chart for plotting frequencies. It is particularly useful for understanding the shape of the distribution; i.e., for determining whether the distribution is approximately normal, as opposed to being skewed, bimodal, or in some other way nonnormal. The current section shows you how to interpret the stem-and-leaf plot generated by PROC UNIVARIATE. The section that follows provides examples of variables with normal and nonnormal distributions. Output Produced by the SAS Program Earlier this chapter indicated that, when you include the PLOT option in the PROC UNIVARIATE statement, SAS produces three figures. These figures are a stem-and-leaf plot, a box plot, and a normal probability plot. This section focuses only on the stem-andleaf plot, which is reproduced here as Output 7.4.
188 Step-by-Step Basic Statistics Using SAS: Student Guide
Stem 5 5 4 4 3 3
Leaf 59 022 77778999 23444 688 3 ----+----+----+----+ Multiply Stem.Leaf by 10**+1
# 2 3 8 5 3 1
Boxplot 0 | +--+--+ +-----+ | 0
Output 7.4. Stem-and-leaf plot for the variable AGE produced by PROC UNIVARIATE.
Remember that the variable being analyzed in this case is AGE. In essence, the stem-andleaf plot indicates what values appear in the data set for the variable AGE, and how many occurrences of each value appear. Interpreting the “stems” and “leaves.” Each potential value of AGE is separated into a “stem” and a “leaf.” The “stem” for a given value appears under the heading “Stem.” The “leaf” for each value appears under the heading “Leaf.” This concept is easier to understand one score at a time. For example, immediately under the heading “Stem,” you can see that the first stem is “5.” Immediately under the heading “Leaf” you see a “5” and a “9.” This excerpt from Output 7.4 is reproduced here: Stem Leaf 5 59 Connecting the stem (“5”) to the first leaf (also a “5”) gives you the first potential value that AGE took on: “55.” This means that the data set included one subject at age 55. Similarly, connecting the stem (“5”) to the second leaf (the “9”) gives you the second potential value that AGE took on: “59.” In short, the plot tells you that one subject had a score on AGE of “55,” and one had a score of “59.” Now move down one line. The stem on the second line is again a “5,” but you now have different leaves: a “0,” a “2,” and another “2” (see below). Connecting the stem to these leaves tells you that, in your data set, one subject had a score on AGE of “50,” one had a score of “52,” and another had a score of “52.” Stem Leaf 5 59 5 022
Chapter 7: Measures of Central Tendency and Variability 189
One last example: Move down to the third line. The stem on the third line is now a “4.” Further, you now have eight leaves: four leaves are a “7,” one leaf is an “8,” and three leaves are a “9,” as follows: Stem 5 5 4
Leaf 59 022 77778999
If you connect the stem (“4”) to these individual leaves, you learn that, in your data set, there were four subjects who had a score on AGE of “47,” one subject who had a score of “48,” and three subjects who had a score of “49.” The remainder of the stem-and-leaf plot can be interpreted in the same way. In summary, you can see that the stem-and-leaf plot is similar to a frequency bar chart, except that it is set on its side: the values that the variable took on appear on the vertical axis, and the frequencies are plotted along the horizontal axis. This is the reverse of what you saw with the vertical bar charts created by PROC CHART in the previous chapter. Interpreting the note at the bottom of the plot. There is one more feature in the stem-andleaf plot that requires explanation. Output 7.4 shows that the following note appears at the bottom of the plot: Multiply Stem.Leaf by 10**+1 To understand the meaning of this note, you need to mentally insert a decimal point into the stem-leaf values you have just reviewed. Those values are again reproduced here: Stem 5 5 4 4 3 3
Leaf 59 022 77778999 23444 688 3
Notice the blank space that separates each stem from its leaves. For example, in the first line, there is a blank space that separates the stem “5” from the leaves “5” and “9.” Technically, you are supposed to read this blank space as a decimal point (.). This means that the values in the first line are actually 5.5 (for the subject whose age was 55) and 5.9 (for the subject whose age was 59). The note at the bottom of the page tells you how to move this decimal point so that the values will return to their original metric. For this plot, the note says “Multiply Stem.Leaf by 10**+1.” This means “multiply the stem-leaf by 10 raised to the first power.” The number 10 raised to the first power is, of course, 10. So what happens to the stem-leaf 5.5 when it is multiplied by 10? It becomes 55, the subject’s actual score on AGE. And what happens to the stem-leaf 5.9 when it is multiplied by 10? It becomes 59, another subject’s actual score on AGE.
190 Step-by-Step Basic Statistics Using SAS: Student Guide
And that is how you interpret a stem-and-leaf plot. Whenever you type in a new data set, you should routinely create stem-and-leaf plots for each of your numeric variables. This will help you identify obvious errors in data entry, and will help you visualize the shape of your distribution (i.e., help you determine whether it is skewed, bimodal, or in any other way nonnormal). The next section of this chapter shows you how to do this.
Using PROC UNIVARIATE to Determine the Shape of Distributions Overview This guide often describes a sample of scores as displaying an “approximately normal” distribution. This means that the distribution of scores more or less follows the bell-shaped, symmetrical pattern of the normal curve. It is generally wise to review the shape of a sample of data prior to analyzing it with more sophisticated inferential statistics. This is because some inferential statistics require that your sample be drawn from a normally distributed population. When the values that you have obtained in a sample display a marked departure from normality (such as a strong skew), then it becomes doubtful that your sample was drawn from a population with a normal distribution. In some cases, this will mean that you should not analyze the data with certain types of inferential statistics. This section illustrates several different shapes that a distribution may display. Using stemand-leaf plots, it shows how a sample may appear (a) when it is approximately normal, (b) when it is positively skewed, (c) when it is negatively skewed, and (d) when it may have multiple modes. It also shows how each type of distribution affects the mode, median, and mean computed by PROC UNIVARIATE. Variables Analyzed As was discussed earlier, Chapter 5, “Creating Frequency Tables,” provided a fictitious political donation questionnaire. The last four items on the questionnaire presented subjects with statements, and asked the subjects to indicate the extent to which they agreed or disagreed with each statement. They responded by using a 7-point scale in which “1” represented “Disagree Very Strongly” and “7” represented “Agree Very Strongly.” Responses to these four items were given the SAS variable names Q1, Q2, Q3, and Q4, respectively. This section shows some of the results produced when PROC UNIVARIATE was used to analyze responses to these items. Here are the SAS statements requesting that PROC UNIVARIATE be performed on Q1, Q2, Q3, and Q4. Notice that the PROC UNIVARIATE statement itself contains the PLOT
Chapter 7: Measures of Central Tendency and Variability 191
option (which will cause stem-and-leaf plots to be created), as well as the NORMAL option (which requests tests of the null hypothesis that the data were sampled from a normally distributed population). PROC UNIVARIATE DATA=D1 VAR Q1 Q2 Q3 Q4; TITLE1 'JANE DOE'; RUN;
PLOT
NORMAL;
These statements resulted in four sets of output: one set of PROC UNIVARIATE output for each of the four variables. An Approximately Normal Distribution The stem-and-leaf plot. The SAS variable Q1 represents responses to the statement, “I believe that our federal government is generally doing a good job.” Output 7.5 presents the stem-and-leaf plot of fictitious responses to this question. Stem 7 6 5 4 3 2 1
Leaf 0 00 0000 00000000 0000 00 0 ----+----+----+----+
# 1 2 4 8 4 2 1
Boxplot | | +-----+ *--+--* +-----+ | |
Output 7.5. Stem-and-leaf plot of an approximately normal distribution produced by the PROC UNIVARIATE analysis of Q1.
The stem-and-leaf plot appears on the left side of Output 7.5. The box plot for the same variable appears on the right (for guidelines on how to interpret a box plot, see Schlotzhauer and Littell [1997], or Tukey [1977] ). The first line of the stem-and-leaf plot shows a stem-leaf combination of “7 0,” which means that the variable included one value of 7.0. The second line shows the stem “6,” along with two leaves of “0” and “0.” This means that the sample included two subjects with a score of 6.0. The remainder of the plot can be interpreted in the same fashion. Notice that there is no note at the bottom of the plot saying anything such as “Multiply Stem.Leaf by 10**+1.” This means that these stem-leaf values already display the appropriate unit of measurement; it is not necessary to multiply them by 10 (or any other value). The stem-and-leaf plot shows that Q1 has a symmetrical distribution centered around the score of 4.0 (in this context, the word symmetrical means that the tail extending above a
192 Step-by-Step Basic Statistics Using SAS: Student Guide
score of 4.0 is the same length as the tail extending below a score of 4.0). A response of “4” on the questionnaire corresponds to the answer “Neither Agree nor Disagree,” so it appears that the most common response was for subjects to neither agree nor disagree with the statement, “I believe that our federal government is generally doing a good job.” Based on the physical appearance of the stem-and-leaf plot, Q1 appears to have an approximately normal shape. The mean, median, and mode. Output 7.6 provides the Basic Statistical Measures table and the Tests for Normality table produced by PROC UNIVARIATE in its analysis of Q1. Basic Statistical Measures Location Mean Median Mode
Variability
4.000000 4.000000 4.000000
Std Deviation Variance Range Interquartile Range
1.41421 2.00000 6.00000 2.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
13.2665 11 126.5
<.0001 <.0001 <.0001
Tests for Normality Test
--Statistic---
-----p Value------
Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling
W D W-Sq A-Sq
Pr Pr Pr Pr
0.957881 0.181818 0.115205 0.55516
< > > >
W D W-Sq A-Sq
0.4477 0.0573 0.0686 0.1388
Output 7.6. Basic Statistical Measures table and Tests for Normality table produced by the PROC UNIVARIATE analysis of Q1.
Output 7.6 shows that, for the variable Q1, the mean is 4 ( ), the median is 4 ( ), and the mode is also 4 ( ). You may remember that the mean, median, and mode are expected to be the same number when the distribution being analyzed is symmetrical (i.e., when neither tail of the distribution is longer than the other). The stem-and-leaf plot of Output 7.5 has already shown that the distribution is symmetrical, so it makes sense that the mean, median, and mode of Q1 would all be the same value.
Chapter 7: Measures of Central Tendency and Variability 193
The test for normality. In Output 7.6, the section headed “Tests for Normality” provides the results from four statistics. Each of these statistics tests the null hypothesis that the sample was drawn from a normally distributed population. This section focuses on just one of these tests: The results for the Shapiro-Wilk W statistic appear in the row headed “Shapiro-Wilk.” In the Tests for Normality section, one of the columns is headed “Statistic.” This section provides the obtained value for the statistic. In the present analysis, you can see that the obtained value for the Shapiro-Wilk W statistic is 0.957881, which rounds to .96. The next column in the Tests for Normality section is headed “p Value,” which stands for “probability value.” Where the row headed “Shapiro-Wilk” intersects with the column headed “p Value,” you can see that the obtained p value for the current statistic is 0.4477 (this appears to the immediate right of the heading “Pr < W”). In general, a p value represents the probability that you would obtain the present statistic if the null hypothesis were true. The smaller the p value is, the less likely it is that the null hypothesis is true. A standard rule of thumb is that, if a p value is less than .05, you should assume that it is very unlikely that the null hypothesis is true, and you should reject the null hypothesis. In the present analysis, the p value of the Shapiro-Wilk statistic represents the probability that you would obtain a W statistic of this size if your sample were drawn from a normally distributed population. In general, when this p value is less than .05, you may reject the null hypothesis, and conclude that your sample was probably not drawn from a normally distributed population. When the p value is greater than .05, you should fail to reject the null hypothesis, and tentatively conclude that your sample probably was drawn from a normally distributed population. In the current analysis, you can see that this p value is 0.4477 (look below the heading “p Value”). Because this p value is greater than the criterion of .05, you fail to reject the null hypothesis of normality. In other words, you tentatively conclude that your sample probably did come from a normally distributed population. In most cases, this is good news because many statistical procedures require that your data be drawn from normally distributed populations. As is the case with many statistics, this statistic is sensitive to sample size in that it tends be very powerful (sensitive) with large samples. This means that the W statistic may imply that your sample did not come from a normal distribution even when the sample shows a very minor departure from normality. So keep the sample size in mind, and use caution in interpreting the results of this test, particularly when your sample size is large.
194 Step-by-Step Basic Statistics Using SAS: Student Guide
A Positively Skewed Distribution What is a skewed distribution? A distribution is skewed if one tail is longer than the other. It shows positive skewness if the longer tail of the distribution points in the direction of higher values. In contrast, negative skewness means that the longer tail points in the direction of lower values. The present section illustrates a positively skewed distribution, and the following section covers negative skewness. The stem-and-leaf plot. In the political donation study, the second agree-disagree item stated, “The federal government should raise taxes.” Responses to this question were represented by the SAS variable Q2. The stem-and-leaf plot created when PROC UNIVARIATE analyzed Q2 is reproduced here as Output 7.7. Stem 7 6 5 4 3 2 1
Leaf 0 0 0 00 00000 00000000 0000 ----+----+----+----+
# 1 1 1 2 5 8 4
Boxplot * 0 0 | +-----+ *--+--* |
Output 7.7. Stem-and-leaf plot of a positively skewed distribution produced by the PROC UNIVARIATE analysis of Q2.
The stem-and-leaf plot of Output 7.7 shows that most responses to item Q2 appear around the number “2,” meaning that most subjects circled the number “2” or some nearby number. This seems reasonable, because, on the questionnaire, the response number “2” represented “Disagree Strongly,” and it makes sense that many people would disagree with the statement “The federal government should raise taxes.” However, the stem-and-leaf plot shows that a small number of people actually agreed with the statement. It shows that two people circled “4” (for “Neither Agree nor Disagree”), one person circled “5” (for “Agree”), one person circled “6” (for “Agree Strongly”), and one person circled “7” (for “Agree Very Strongly”). These responses created a long tail for the distribution––a tail that stretches out in the direction of higher numbers (such as “6” and “7”). This means that the distribution for Q2 is a positively skewed distribution. The mean, median, and mode. You can also determine whether a distribution is skewed by comparing the mean to the median. These statistics are presented in Output 7.8.
Chapter 7: Measures of Central Tendency and Variability 195
Basic Statistical Measures Location Mean Median Mode
Variability
2.772727 2.000000 2.000000
Std Deviation Variance Range Interquartile Range
1.60154 2.56494 6.00000 1.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
8.120454 11 126.5
<.0001 <.0001 <.0001
Tests for Normality Test
--Statistic---
-----p Value------
Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling
W D W-Sq A-Sq
Pr Pr Pr Pr
0.85838 0.230725 0.208576 1.173185
< > > >
W D W-Sq A-Sq
0.0048 <0.0100 <0.0050 <0.0050
Output 7.8. Basic Statistical Measures table and Tests for Normality table produced by the PROC UNIVARIATE analysis of Q2.
Output 7.8 shows that, for Q2, the mean is 2.77 ( ), the median is 2.0 ( ), and the mode is also 2.0 ( ). Even if you had never seen the stem-and-leaf plot of Output 7.7, you would still know that the distribution is positively skewed, because the mean (at 2.77) is higher than the median (at 2.0). The reasons for this are explained here. Generally speaking, the mean tends to be more strongly influenced by a long tail, compared to the median. For example, if a distribution has a long tail in the direction of higher values, the mean tends to be “pulled upward” in the direction of those higher values––it tends to take on a higher value. However, the median generally remains unaffected by the longer tail. Taken together, this means that when a distribution has a long tail in the direction of higher values (i.e., when a distribution is positively skewed), the mean should be higher than the median. Output 7.8 bears this out: the mean (at 2.77) is higher than the median (at 2.0), indicating a positively skewed distribution. The test for normality. Further evidence of a departure from normality can be seen in the results of the test for normality, which appears toward the bottom of Output 7.8.
196 Step-by-Step Basic Statistics Using SAS: Student Guide
The p value for the W statistic is quite small at 0.0048. This is below the standard criterion of .05, meaning that you may reject the null hypothesis that this sample of scores was drawn from a normally distributed population. This is the result you would generally expect when a sample displays a dramatic skew, as is the case here. A Negatively Skewed Distribution The stem-and-leaf plot. In the political donation study, the third agree-disagree item stated, “The federal government should do a better job of maintaining our interstate highway system.” Responses to this question were represented by the SAS variable Q3. The stemand-leaf plot created when PROC UNIVARIATE analyzed Q3 is reproduced here as Output 7.9. Stem 7 6 5 4 3 2 1
Leaf 000 000000000 00000 00 0 0 0 ----+----+----+----+
# 3 9 5 2 1 1 1
Boxplot | +-----+ +--+--+ | 0 0 *
Output 7.9. Stem-and-leaf plot of a negatively skewed distribution produced by the PROC UNIVARIATE analysis of Q3.
The stem-and-leaf plot of Output 7.9 shows that most responses to item Q3 appear around the number “6,” meaning that most subjects circled the number “6” or some nearby number. This seems reasonable, because, on the questionnaire, the response number “6” represented “Agree Strongly,” and it makes sense that many people would agree with the statement “The federal government should do a better job of maintaining our interstate highway system.” However, the stem-and-leaf plot shows that a small number of people actually disagreed with the statement. It shows that two people circled “4” (for “Neither Agree nor Disagree”), one person circled “3” (for “Disagree”), one person circled “2” (for “Disagree Strongly”), and one person circled “1” (for “Disagree Very Strongly”). These responses created a long tail for the distribution––a tail that stretches out in the direction of lower numbers (such as “1” and “2”). This means that the distribution for Q3 is a negatively skewed distribution. The mean, median, and mode. Output 7.10 presents the basic statistical measures table from the analysis of Q3.
Chapter 7: Measures of Central Tendency and Variability 197
Basic Statistical Measures Location Mean Median Mode
Variability
5.181818 6.000000 6.000000
Std Deviation Variance Range Interquartile Range
1.56255 2.44156 6.00000 1.00000
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
15.55464 11 126.5
<.0001 <.0001 <.0001
Tests for Normality Test Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling
--Statistic--W 0.846403 D 0.245183 W-Sq 0.248146 A-Sq 1.354406
-----p Value-----Pr < W 0.0029 Pr > D <0.0100 Pr > W-Sq <0.0050 Pr > A-Sq <0.0050
Output 7.10. Basic Statistical Measures table and Tests for Normality table produced by the PROC UNIVARIATE analysis of Q3.
Output 7.10 shows that, for Q3, the mean is 5.18 ( ), the median is 6.0 ( ), and the mode is also 6.0 ( ). The preceding section indicated that the mean tends to be “pulled” in the direction of the longer tail in a distribution. With a negatively skewed distribution, the longer tail is in the direction of the lower numbers. This means that, with a negatively skewed distribution, the mean should be “pulled downward,” so that it is lower than the median. As expected, Output 7.10 shows that this is exactly the case: The mean for Q3 (at 5.18) is lower than the median (at 6.0). The test for normality. The test for normality further attests that Q3 shows a marked departure from normality. The p value from the Shapiro-Wilk W Statistic is .0029. This is below the standard criterion of .05, and so you may reject the null hypothesis that this sample was drawn from a normally distributed population. In other words, you may tentatively conclude that the sample was probably drawn from a population that was not normally distributed.
198 Step-by-Step Basic Statistics Using SAS: Student Guide
A Bimodal Distribution What is a bimodal distribution? A bimodal distribution is a frequency distribution that has two “peaks” or “humps.” With a bimodal distribution, the scores at the very center of the two peaks have relatively high frequencies. Many textbooks say that, technically, the two scores at the center of each peak should have exactly the same frequency for the distribution to be considered bimodal. In practice, however, researchers often say that they have a bimodal distribution simply because it has two peaks; they often say this even when the two scores at the center of each peak do not have exactly the same frequency (i.e., even when one peak is a little taller than the other). The stem-and-leaf plot. In the political donation study, the fourth agree-disagree item stated, “The federal government should increase social security benefits to the elderly.” Responses to this question were represented by the SAS variable Q4. The stem-and-leaf plot created when PROC UNIVARIATE analyzed Q4 is reproduced here as Output 7.11. Stem 7 6 5 4 3 2 1
Leaf 00 000000 000 0 00 000000 00 ----+----+----+----+
# 2 6 3 1 2 6 2
Boxplot | +-----+ | | *--+--* | | +-----+ |
Output 7.11. Stem-and-leaf plot of a bimodal distribution produced by the PROC UNIVARIATE analysis of Q4.
The stem-and-leaf plot of Output 7.11 reveals two peaks, or modes. It shows that six people circled response number “6” (for “Agree Strongly”), and another six people circled response number “2” (for “Disagree Strongly”). Clearly, responses to Q4 form a bimodal distribution. Although stem-and-leaf plots are useful for identifying the general shape of distributions, they can be misleading when it comes to determining whether a distribution is technically bimodal. According to some textbooks, a distribution is technically bimodal only if the scores at the center of each peak have exactly the same frequencies. With some data sets (especially with large data sets), SAS must perform some rounding of numbers prior to constructing its stem-and-leaf plot. In those situations, you should view the stem-and-leaf plot as providing only a rough approximation of the shape of the distribution. Further, even if the plot appears to suggest that the two scores at the center of each peak have exactly the same frequencies, you should not assume that this is the case. You should instead look for a note at the bottom of the Basic Statistical Measures table that tells you whether the sample contains more than one mode. This note is discussed in the following section.
Chapter 7: Measures of Central Tendency and Variability 199
The mean, median, and mode. The Basic Statistical Measures table and Tests for Normality table from the analysis of Q4 are presented here as Output 7.12. Basic Statistical Measures Location Mean Median Mode
Variability
4.045455 4.500000 2.000000
Std Deviation Variance Range Interquartile Range
2.05814 4.23593 6.00000 4.00000
NOTE: The mode displayed is the smallest of 2 modes with a count of 6.
Tests for Location: Mu0=0 Test
-Statistic-
-----p Value------
Student's t Sign Signed Rank
t M S
Pr > |t| Pr >= |M| Pr >= |S|
9.219434 11 126.5
<.0001 <.0001 <.0001
Tests for Normality Test
--Statistic---
-----p Value------
Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling
W D W-Sq A-Sq
Pr Pr Pr Pr
0.875992 0.203485 0.190371 1.147371
< > > >
W D W-Sq A-Sq
0.0101 0.0186 0.0064 <0.0050
Output 7.12. Basic Statistical Measures table and Tests for Normality table produced by the PROC UNIVARIATE analysis of Q4.
Output 7.12 shows that the mean score on Q4 is 4.05 ( ), and the median is 4.5 ( ). The output indicates that the mode is equal to 2.0 ( ), and this may surprise you because, as established previously, Q4 actually has two modes: 2.0 and 6.0. However, an earlier section of this chapter mentioned that, when a distribution has more than one mode, PROC UNIVARIATE reports only the mode with the lowest numerical value. With Q4, the lower mode is 2.0. Just below the Basic Statistical Measures table, the following note appears: NOTE: The mode displayed is the smallest of 2 modes with a count of 6. A note similar to this will tell you if you have more than one mode. As you can see, it will also indicate the number of modes observed in your distribution, although it will not tell you the scores associated with those modes.
200 Step-by-Step Basic Statistics Using SAS: Student Guide
The test for normality. The Shapiro-Wilk W statistic appears as part of the Tests for Normality table. You can see that the p value for this test is .0101. This is below the criterion of .05, and so you may reject the null hypothesis that this sample came from a normally distributed population. You may instead tentatively conclude that the sample was probably drawn from a population with a nonnormal distribution.
Simple Measures of Variability: The Range, the Interquartile Range, and the Semi-Interquartile Range Overview A measure of variability is a number or set of numbers that represents the extent of dispersion in a set of scores. It is the extent to which the scores differ from one another, or from some measure of central tendency such as the mean. This section shows how to use PROC UNIVARIATE to compute three relatively simple measures of variability: the range, the interquartile range, and the semi-interquartile range. The Range The range is defined as the difference between the highest and lowest scores in a distribution. The formula for the range is Xmax – Xmin where Xmax represents the highest observed score in the distribution. Xmin represents the lowest observed score in the distribution. When you run PROC UNIVARIATE on a variable, information relevant to the range appears in the Basic Statistical Measures table and the Quantiles table. These tables appear in Output 7.13 (again, the variable being analyzed is AGE).
Chapter 7: Measures of Central Tendency and Variability 201
Basic Statistical Measures Location Mean Median Mode
Variability
46.04545 47.00000 47.00000
Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Estimate 59 59 55 52 49 47 43 38 36 33 33
Output 7.13. The range, maximum score, and minimum score as they appear in the Basic Statistical Measures table and Quantiles table produced by the PROC UNIVARIATE analysis of AGE.
Output 7.13 shows that the range for the AGE variable was 26. This makes sense, because the highest score on AGE was 59 and the lowest score was 33. Using the formula for range presented above, the range can be computed as: Xmax – Xmin 59 – 33 = 26. If you wish to compute the range by hand, you can easily find the highest and lowest observed scores in your sample in the Quantiles table produced by PROC UNIVARIATE. In Output 7.13, the Quantiles table provides the highest observed score to the right of the heading “100% Max.” Here, you can see that the highest score was indeed 59. The Quantiles table provides the lowest score to the right of the heading “0% Min.” You can see that the lowest score in the present sample was indeed 33.
202 Step-by-Step Basic Statistics Using SAS: Student Guide
The Interquartile Range The interquartile range is a useful measure of variability to use when you are working with a data set that demonstrates a dramatic departure from normality. For example, it is a useful measure of variability to use when your sample is skewed or has more than one mode. The formula for the interquartile range is Q3 – Q1 Where Q3
represents the “third quartile:” the score at the 75th percentile of the distribution (the score below which 75% of all scores fall), and
Q1
represents the “first quartile:” the score at the 25th percentile of the distribution (the score below which 25% of all scores fall).
Q1, Q3, and the interquartile range are reported in the Basic Statistical Measures table and Quantiles table produced by PROC UNIVARIATE. Output 7.14 again presents these tables, as produced when PROC UNIVARIATE was performed on the AGE variable. Basic Statistical Measures Location Mean Median Mode
Variability
46.04545 47.00000 47.00000
Std Deviation Variance Range Interquartile Range
6.19890 38.42641 26.00000 6.00000
Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min
Estimate 59 59 55 52 49 47 43 38 36 33 33
Output 7.14. The interquartile range, third quartile, and first quartile as they appear in the Basic Statistical Measures table and Quantiles Table produced by the PROC UNIVARIATE analysis of AGE.
Chapter 7: Measures of Central Tendency and Variability 203
In the Quantiles table, the score at the third quartile appears to the right of the heading “75% Q3.” Output 7.14 shows that the score at the third quartile for AGE is 49. The score at the first quartile appears to the right of the heading “25% Q1.” The output shows that the score at the first quartile for AGE is 43. In the Basic Statistical Measures table, the interquartile range itself appears to the right of the heading “Interquartile Range.” Output 7.14 shows that the interquartile range for AGE is 6. This makes sense because, for the current data set, the interquartile range was computed in this way: Q3 – Q1 49 – 43 = 6 The Semi-Interquartile Range The semi-interquartile range is simply the interquartile range divided by 2. The formula is Q3 – Q1 ––––––––– 2 PROC UNIVARIATE does not compute the semi-interquartile range, but it is easily computed from output that PROC UNIVARIATE does provide: Simply divide the interquartile range by 2. For the current output, the semi-interquartile range is computed as follows: Q3 – Q1 ––––––––– = 2 49 – 43 ––––––––– = 2 6 ––––––––– = 3 2 So, the semi-interquartile range for AGE is 3.
204 Step-by-Step Basic Statistics Using SAS: Student Guide
More Complex Measures of Central Tendency: The Variance and Standard Deviation Overview The variance is defined as the average of the squared deviations of scores around their mean. The standard deviation is the square root of the variance. This section provides formulas for computing these two statistics. The variance and standard deviation are more complex measures of variability, compared to the range, interquartile range, and semi-interquartile range. In part, this is because the variance and standard deviation are influenced by every observation in a sample, whereas the range and its kindred measures are not. The fact that the variance and standard deviation are influenced by every observation in the sample is one of the reasons that these two measures are widely used in research. This section begins by reviewing some basic terms that are relevant to the concepts to be taught here. It discusses the differences between populations versus samples because the formula for computing a population variance is different from the formula for computing a sample variance, and both types of formulas are presented later. This section also discusses the difference between descriptive versus inferential statistics for the same reason. Finally, it shows how the population variance and standard deviation are computed, using conceptual formulas. Relevant Terms and Concepts Populations versus samples. A population can be defined as a complete set of scores obtained from a specified group in a particular situation. For example, suppose that you are conducting research on intelligence tests (IQ), and you are interested in using adults in the U.S. as your subjects. If you were able to obtain IQ scores for every adult in the U.S., you would have obtained the entire population of scores. Needless to say, it would probably not be possible for you to obtain scores for this entire population when conducting research. In fact, it is assumed that it is seldom possible to study entire populations. In contrast, a sample is a subset of scores drawn from a population. Suppose that, in order to conduct your research, you put together a group of 200 adults from the U.S. and obtain their IQ scores. When you work with this set of 200 IQ scores, you are working with a sample. Researchers in the behavioral sciences almost always conduct their research with samples. A later section shows that the formula for computing the population variance is somewhat different from the formula for computing the sample variance. Descriptive versus inferential statistics. Descriptive statistics are procedures for summarizing and conveying some particular characteristic of a specific set of data. For example, suppose that you assess the mean IQ score for your sample of 200 adults, and find
Chapter 7: Measures of Central Tendency and Variability 205
that this mean score is 105. If you use this number merely as an index of the average IQ of this specific group of 200 subjects––and do not use it to draw inferences about the average IQ score in the larger population––then you are using it as a descriptive statistic. In contrast, inferential statistics are procedures that involve gathering data from a sample of subjects, and then using the data to draw inferences about the likely characteristics of the larger population from which the sample was drawn. For example, again suppose that you assess the mean IQ in your sample of 200 U.S. adults, and find that this mean IQ score is 105. You conclude the following: “Based on these results from the sample, I estimate that the average IQ score for the entire population of U.S. adults is 105.” In this case, you are using sample data to draw inferences about the likely characteristics of the population from which they were drawn. This means that you were using the obtained IQ score as an inferential statistic. A later section shows that the formula for computing the variance as a descriptive statistic is somewhat different from the formula for computing the variance as an inferential statistic. Conceptual Formula for the Population Variance The preceding section indicated that populations are generally assumed to be so large that it is not possible to obtain data from every member of a population. For the moment, however, suspend disbelief and suppose that you have conducted a study in which you did obtain data from every member of a population. If your data were on an interval or ratio scale, and if they formed a symmetrical distribution, you would probably want to use the variance and standard deviation as your measures of variability. The formula for computing the variance for a population of scores is as follows: Σ (X– µ)2 σX2 = –––––––––––––––––– N where σX2 = the population variance X = the individual scores µ = the population mean N = the number of scores.
206 Step-by-Step Basic Statistics Using SAS: Student Guide
An earlier section defined the variance as the average of the squared deviations of scores around their mean. You can see that the preceding formula is consistent with this definition. This formula provides the following instructions for computing the population variance: 1. For a given subject, find the deviation of that subject’s observed score from the population mean. This is represented by the subtraction term which appears in the formula, and is again reproduced below: (X– µ) 2. Next, square that deviation. This is represented by the fact that the deviation term is squared: (X– µ)2 3. Next, sum these squared deviations. This is represented by the Σ symbol to the left of the deviation term: Σ (X– µ)
2
4. Finally, divide the sum of the squared deviations by the number of observations. This is represented by the fact that the entire quantity is divided by N: σX2 =
Σ (X– µ)2 –––––––––––––– N
Conceptual Formula for the Population Standard Deviation So far, this section has focused on the variance as a measure of variability. However, researchers in the social and behavioral sciences also make heavy use of the standard deviation as a measure of variability. Earlier, the standard deviation was defined as the square root of the variance. Therefore, the formula for computing the standard deviation for a population of scores is as follows: Σ (X– µ) –––––––––––––––––– N 2
σX =
You can see that the preceding formula is identical to the formula for the population variance, with two exceptions. First, note that the square root symbol has now been placed around everything to the right of the equals sign. This conveys the fact that the standard deviation is simply the square root of the variance. Second, you can see that the symbol for the standard deviation is σx, whereas the symbol for the variance had been σx2 (notice that only the symbol for the variance has the “squared” sign).
Chapter 7: Measures of Central Tendency and Variability 207
In reporting the results of investigations, researchers often prefer to report the standard deviation rather than the variance. This is because the variance is a measure of variability expressed in squared units. As the square root of the variance, however, the standard deviation is a measure of variability that reflects the original unit of measurement. This makes the standard deviation easier to interpret in general.
Variance and Standard Deviation: Three Formulas Overview Students in elementary statistics courses are sometimes confused when they learn about variance because there are essentially three types of variance estimates that they must learn. These variance estimates differ with respect to the formulas that are used to compute them and whether they are descriptive versus inferential in nature. Adding to the confusion is the fact that different textbooks sometimes use different names when discussing the same type of variance estimate. Therefore, this section uses the names, statistical symbols, and formulas that are fairly typical of most statistics textbooks in the social and behavioral sciences. The Population Variance and Standard Deviation The population variance. The population variance is a parameter that is descriptive (rather than inferential). It is a number that describes the actual variance in a population of scores. It is appropriate to compute the population variance when you have obtained scores from every member of the relevant population (remember that this seldom occurs in real life). The formula for the population variance was presented in the preceding section, and is presented again here for purposes of comparison: Σ (X– µ) σX2 = –––––––––––––––––– N 2
The population standard deviation. The population standard deviation is simply the square root of the population variance. The formula for the population standard deviation was also presented earlier. It, too, is presented again here for purposes of comparison: σX =
Σ (X– µ)2 –––––––––––––––––– N
208 Step-by-Step Basic Statistics Using SAS: Student Guide
The Sample Variance and Standard Deviation The sample variance. Suppose that you find yourself in the following situation: •
You have obtained data from a sample of subjects drawn from a larger population. Suppose that this is the more typical situation in which you have data from the sample, but not from each member of the larger population.
•
You wish to compute the variance in this sample.
•
Further, you wish this variance to describe the actual variance in the sample; you do not wish it to estimate what the variance probably is in the larger population. In other words, you wish to use this variance as a descriptive statistic (describing the actual variance in the sample), and not as an inferential statistic (making inferences about what the variance might be in the population).
In a situation such as this, you wish to compute what this book calls the sample variance. In this book, the sample variance is defined as a descriptive statistic that describes the actual variance in a sample; it is not an inferential statistic that estimates what the variance probably is in the population. The definitional formula for the sample variance is as follows: Σ (X – X)2 SX2 = –––––––––––––––– N where SX2 = the sample variance X = the individual scores X = the sample mean N = the number of scores. Notice the differences between the formula for the population variance (presented earlier) versus the formula for the sample variance presented here. First, the symbols for the 2 2 variance statistics themselves are different (σx versus SX ). Also, notice that the symbol for the mean in the population formula was µ, whereas the symbol for the mean in the sample formula was X. Despite these differences, you can see that the formula for the population variance is essentially equivalent to the formula for the sample variance. Both formulas involve the steps of (a) taking deviations from the mean, (b) squaring and summing these deviations, and (c) dividing by N.
Chapter 7: Measures of Central Tendency and Variability 209
The sample standard deviation. Again, the sample standard deviation is simply the square root of the sample variance. The formula is as follows:
SX =
Σ (X – X)2 –––––––––––––––––– N
The Estimated Population Variance and Standard Deviation The estimated population variance. It is possible to use data from a sample to estimate the variance and standard deviation in the population from which the sample was drawn. However, to do this you should not use the formula for the sample variance presented in the preceding section. This is because the formula for the sample variance tends to underestimate the actual population variance. To obtain a more accurate estimate of the population variance, it is necessary to change the devisor in this formula (the part below the division line). Below is the formula for the sample variance that was presented earlier: Σ (X – X)2 SX2 = –––––––––––––––––– N The divisor in this formula is N, the sample size. To compute the estimated population variance, however, it is necessary to replace N with N – 1. The quantity N – 1 is referred to as the degrees of freedom. Below is the revised formula, the formula for the estimated population variance: Σ (X – X) sX2 = –––––––––––––––––– N–1 2
Notice that the divisor is now N – 1 rather than N. The above formula also contains one additional change. The symbol for the estimated population variance is sX2 . The “s” in this symbol is a lowercase “s,” rather than the uppercase “S” that was used as the symbol for the sample variance.
210 Step-by-Step Basic Statistics Using SAS: Student Guide
The estimated population standard deviation. The relationship between the estimated population standard deviation and estimated population variance is similar to the relationship between the sample standard deviation and sample variance. The estimated population standard deviation is computed by taking the square root of the estimated population variance. The formula is as follows:
sX =
Σ (X – X)2 –––––––––––––––––– N–1
Using PROC MEANS to Compute the Variance and Standard Deviation Overview The variance and standard deviation can be computed using either PROC UNIVARIATE or PROC MEANS. Since the output of PROC MEANS is somewhat easier to read, the following sections show how to use PROC MEANS to compute the sample variance, population variance, and estimated population variance, respectively. Computing the Sample Variance and Standard Deviation Use the directions in this section for situations in which you are using the variance as a descriptive statistic. That is, they are appropriate for situations in which you wish to describe the variance as it actually is in the sample, and you are not interested in estimating what the variance probably is in the larger population. Below is the syntax for the PROC step that uses PROC MEANS to compute the sample variance and sample standard deviation (along with some additional statistics): PROC MEANS DATA=data-set-name VAR variable-list; TITLE1 ' your-name '; RUN;
VARDEF=N
N
MEAN
STD
VAR
MIN
MAX;
The PROC MEANS statement is the first statement in the preceding PROC step. It includes the following keywords and values for VARDEF: VARDEF=N Requests that the devisor in the computation of the variance and standard deviation be equal to “N,” the sample size. This ensures that PROC MEANS will compute the
Chapter 7: Measures of Central Tendency and Variability 211
sample variance and standard deviation, as opposed to the estimated population variance and standard deviation (which will be discussed in a later section). N prints the sample size (the number of valid observations). MEAN prints the sample mean. STD prints the sample standard deviation. VAR prints the sample variance. MIN prints the lowest observed score in the sample. MAX prints the highest observed score in the sample. Here are the statements used to compute the sample variance and standard deviation for the AGE variable from the political donation study: PROC MEANS DATA=D1 VARDEF=N N MEAN VAR AGE; TITLE1 'JANE DOE... DIVISOR = N'; RUN;
STD
VAR
MIN
MAX;
Output 7.15 presents the results generated by the preceding statements. JANE DOE... DIVISOR = N The MEANS Procedure Analysis Variable : AGE N Mean Std Dev Variance Minimum Maximum ----------------------------------------------------------------------------22 46.0454545 6.0563811 36.6797521 33.0000000 59.0000000 -----------------------------------------------------------------------------
Output 7.15. The sample standard deviation and variance as produced by PROC MEANS with the option VARDEF=N.
You can see that the sample size, mean, minimum value, and maximum value for the AGE variable reported in Output 7.15 are identical to the values reported when AGE was analyzed using PROC UNIVARIATE earlier in this chapter.
212 Step-by-Step Basic Statistics Using SAS: Student Guide
Below the heading “Std Dev” is the sample standard deviation for AGE, which in this case is 6.056. Below the heading “Variance” is the sample variance: 36.680. You can see that the standard deviation and variance in Output 7.15 are not identical to those contained in the output from PROC UNIVARIATE, presented earlier. This is because, by default, PROC UNIVARIATE prints the estimated population standard deviation and variance, rather than the sample standard deviation and variance. Later, this section shows how to modify the PROC MEANS statement so that it prints the estimated population standard deviation and variance. Computing the Population Variance and Standard Deviation To compute the population variance and standard deviation, you should write your PROC MEANS statement according to the directions provided in the preceding section. That is, you should write your PROC MEANS statement the same way that you would in order to compute the sample variance and standard deviation. This is because the formula for computing the population variance is essentially equivalent to the formula for the sample variance: Both formulas have “N” as the divisor (rather than “N – 1”). In the previous section, you learned that you can request that N be used as the divisor by including the option “VARDEF=N” in the PROC MEANS statement. This means that you should include the VARDEF=N option in the PROC MEANS statement regardless of whether you wish to compute the sample variance or the population variance. Remember that these directions are appropriate only if every member of the population is included in the data set that you are going to analyze. In this situation, you are computing the variance as a descriptive statistic because you are simply describing the population variance as it is. These directions do not apply if you are analyzing data from a sample and are using those data to estimate what the variance probably is in the larger population. In this second situation, you are using the variance as an inferential statistic. In this inferential situation, you should follow the directions provided in the following section. Computing the Estimated Population Variance and Standard Deviation The directions in this section are appropriate for situations in which you are using the variance as a inferential statistic. That is, they are appropriate for situations in which you have obtained data from a sample, and you wish to use that data to estimate what the variance probably is in the larger population from which the sample was drawn.
Chapter 7: Measures of Central Tendency and Variability 213
Below is the syntax for the SAS statements that request the estimated population variance and standard deviation: PROC MEANS DATA=data-set-name VAR variable-list; TITLE1 ' your-name '; RUN;
VARDEF=DF
N
MEAN
STD
VAR
MIN
MAX;
You can see that the only difference involves the VARDEF option in the PROC MEANS statement. When you requested the sample variance, you saw “VARDEF=N.” Now that you are instead requesting the estimated population variance, it instead reads “VARDEF=DF.” The “DF” here stands for “degrees of freedom.” Remember that, in this context, the degrees of freedom are equal to N – 1. This means that, when you request VARDEF=DF, SAS uses N – 1 as the divisor in computing the variance and standard deviation. Here are the statements used to request the estimated population variance and standard deviation for the AGE variable: PROC MEANS DATA=D1 VARDEF=DF N MEAN VAR AGE; TITLE1 'JANE DOE... DIVISOR = N-1'; RUN;
STD
VAR
MIN
MAX;
Output 7.16 presents the results of this PROC MEANS statement. JANE DOE... DIVISOR = N-1
The MEANS Procedure Analysis Variable : AGE N Mean Std Dev Variance Minimum Maximum ----------------------------------------------------------------------------22 46.0454545 6.1989037 38.4264069 33.0000000 59.0000000 -----------------------------------------------------------------------------
Output 7.16. The estimated population standard deviation and variance as produced by PROC MEANS with the option VARDEF=DF.
Output 7.16 shows that the estimated population standard deviation for AGE rounds to 6.199 ( ) and that the estimated population variance is 38.426 ( ). Notice that the estimated population variance (38.426) is larger than the sample variance (36.680) that was reported in Output 7.15. This is to be expected, since the divisor in computing the estimated population variance (N – 1) is a smaller number than the divisor used in computing the sample variance (N). This results in a larger value for the estimated population variance.
214 Step-by-Step Basic Statistics Using SAS: Student Guide
Conclusion This chapter has shown you how to perform a variety of descriptive procedures. You have learned (a) how to compute measures of central tendency to determine the location of your sample on a continuum, (b) how to compute measures of variability to determine the dispersion of your scores, and (c) how to interpret a stem-and-leaf plot to identify the shape of a distribution. After creating a data set, you should generally perform simple procedures such as these before moving on to more sophisticated analyses. But what if your data are not yet ready to be analyzed with more sophisticated procedures? For example, what if you have administered a depression-screening questionnaire to subjects, have entered their responses, and now need to sum their responses to the individual items to create a single score that represents their level of depression? In situations such as these, it is necessary to perform some type of data manipulation: operations in which you (or a computer application) transform existing variables or create new variables from existing variables. When conducting research in the behavioral sciences and education, it is commonplace that researchers need to perform transformations on their raw data prior to performing more sophisticated statistical procedures. Fortunately, SAS makes it easy to do this through the use of simple mathematical equations, “IF-THEN” statements, and other operations. The next chapter shows how to use these tools to recode reversed items, create “total scores” from individual scale items, create data subsets, and perform other forms of data manipulation.
Creating and Modifying Variables and Data Sets Introduction.........................................................................................217 Overview...............................................................................................................217 Why It Is Often Necessary to Modify Variables and Data Sets .............................217 Example 8.1: An Achievement Motivation Study ..............................218 The Study .............................................................................................................218 Data Set to Be Analyzed.......................................................................................220 SAS Program to Read the Raw Data....................................................................221 Using PROC PRINT to Create a Printout of Raw Data .......................222 Why You Need to Use PROC PRINT ...................................................................222 Using PROC PRINT to Print the Current Data Set................................................222 A Common Misunderstanding Regarding PROC PRINT ......................................225 Where to Place Data Manipulation and Data Subsetting Statements .....................................................................................225 Overview...............................................................................................................225 Placing These Statements Immediately Following the INPUT Statement.............225 Placing the Statements Immediately After Creating a New Data Set....................226 Basic Data Manipulation.....................................................................228 Overview...............................................................................................................228 Creating Duplicate Variables with New Variable Names.......................................228 Creating New Variables from Existing Variables...................................................230 Recoding Reversed Variables...............................................................................233 Where Should the Recoding Statements Go? ......................................................234
216 Step-by-Step Basic Statistics Using SAS: Student Guide
Recoding a Reversed Item and Creating a New Variable for the Achievement Motivation Study ......................................................235 The Scale..............................................................................................................235 Creating the New SAS Variable ............................................................................235 Recoding the Reversed Item ................................................................................236 The SAS Program.................................................................................................236 The SAS Output....................................................................................................237 The SAS Log.........................................................................................................238 Using IF-THEN Control Statements ....................................................239 Overview...............................................................................................................239 Creating a New AGE Variable...............................................................................239 Comparison Operators..........................................................................................241 The Syntax of the IF-THEN Statement .................................................................242 Using ELSE Statements .......................................................................................243 Using the Conditional Statements AND and OR ...................................................244 Working with Character Variables.........................................................................245 Data Subsetting ..................................................................................248 Overview...............................................................................................................248 The Syntax of Data Subsetting Statements ..........................................................248 An Example ..........................................................................................................249 An Example with Multiple Subsets ........................................................................251 Using Comparison Operators and the Conditional Statements AND and OR.......252 Eliminating Observations That Have Missing Data on Some Variables ................252 Combining a Large Number of Data Manipulation and Data Subsetting Statements in a Single Program ..................................256 Overview...............................................................................................................256 A Longer SAS Program ........................................................................................256 Some General Guidelines .....................................................................................259 Conclusion...........................................................................................260
Chapter 8: Creating and Modifying Variables and Data Sets 217
Introduction Overview This chapter shows you how to modify a data set by creating new variables and modifying existing variables. It shows how to write SAS statements that contain simple mathematical formulas, and how to use IF-THEN control statements. It shows how to use subsetting IF statements to analyze data for various subgroups within a data set. Finally, it shows how to eliminate unwanted observations from a data set so that analyses are performed only on subjects that have no missing data. Why It Is Often Necessary to Modify Variables and Data Sets Very often researchers obtain a data set in which the data are not yet in a form appropriate for analysis. For example, imagine that you are conducting research on job satisfaction. Perhaps you wish to compute the correlation between subject age and a single index of job satisfaction. You administer a 10-item questionnaire that assesses job satisfaction to 200 employees, and enter their responses to the 10 individual questionnaire items. You now need to add together each subject’s response to those 10 items to arrive at a single composite score that reflects that subjects’ overall level of satisfaction. This computation is very easy to perform within the SAS program by including a number of data manipulation statements. Data manipulation statements are SAS statements that transform the data set in some way. They may be used to recode reversed variables, create new variables from existing variables, and perform a wide range of other tasks. At the same time, it is possible that your original data set contains observations that you do not wish to include in your analyses. Perhaps the questionnaire was administered to hourly as well as nonhourly employees, and you wish to analyze only data from the hourly employees. In addition, you may wish to analyze data only from subjects who have usable data on all of the study’s variables. In these situations, you may include data subsetting statements to eliminate the unwanted subjects from the sample. Data subsetting statements are SAS statements that eliminate unwanted observations from a sample, so that only a specified subgroup is included in the resulting data set. The SAS programming language is so comprehensive and flexible that it can perform virtually any type of data manipulation task imaginable. A complete treatment of these capabilities would easily fill a book, and is therefore beyond the scope of this text. However, this chapter reviews some basic SAS statements that can be used to perform a wide variety of research tasks (particularly in research that involves questionnaire data). Those who need additional help should consult Hatcher and Stepanski (1994).
218 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 8.1: An Achievement Motivation Study The Study The use of data manipulation and data subsetting statements is illustrated here by analyzing data from a fictitious study dealing with achievement motivation. Briefly, achievement motivation is the desire to exceed some standard of performance. People who score high on achievement motivation try to improve on their past performance, take moderate risks, and tend to set challenging goals for themselves. Suppose that you have constructed a 5-item scale designed to assess achievement motivation in a sample of college students. The scale consists of five statements, and a subject who completes the scale uses a 6-point response format to indicate the extent to which he or she agrees with each statement. The scale is designed so that subjects who possess a great deal of achievement motivation should tend to agree with items 1, 2, 3, and 5, but disagree with item 4. Your 5-item scale appears on the following short questionnaire. The questionnaire also contains a few additional items to assess demographic information (sex, age, and academic major).
Chapter 8: Creating and Modifying Variables and Data Sets 219
Directions: Please indicate the extent to which you agree or disagree with each of the following statements. You will do this by circling the appropriate number to the left of that statement. The following format shows what each response number stands for: 6 5 4 3 2 1
= = = = = =
Agree Very Strongly Agree Strongly Agree Somewhat Disagree Somewhat Disagree Strongly Disagree Very Strongly
For example, if you "Disagree Very Strongly" with the first statement, circle the "1" to the left of that statement. If you "Agree Somewhat," circle the "4," and so forth. ---------------Circle Your Response ---------------1 2 3 4 5 6
1. I try very hard to improve on my performance in classes.
1
2
3
4
5
6
2. I take moderate risks to get ahead in classes.
1
2
3
4
5
6
3. I try to perform better than my fellow students.
1
2
3
4
5
6
4. I try to achieve as little as possible in classes.
1
2
3
4
5
6
5. I do my best work when my class assignments are fairly difficult.
6. What is your sex?
______ Female (F)
______ Male (M)
7. What is your age in years? _______________ 8. What is your major?
______ Arts and Sciences (1) ______ Business (2) ______ Education (3)
220 Step-by-Step Basic Statistics Using SAS: Student Guide
Data Set to Be Analyzed You administer the questionnaire to nine college students. Their responses are summarized in Table 8.1 Table 8.1 Data from the Achievement Motivation Study ___________________________________________ Agree-Disagree Questions __________________ Subject Q1 Q2 Q3 Q4 Q5 Sex Age Major ____________________________________________________ 1. Marsha
6
5
5
2
6
F
22
1
2. Charles
2
1
2
5
2
M
25
1
3. Jack
5
.
4
3
6
M
30
1
4. Cathy
5
6
6
.
6
F
41
2
5. Emmett
4
4
5
2
5
M
22
2
6. Marie
5
6
6
2
6
F
20
2
7. Cindy
5
5
6
1
5
F
21
3
8. Susan
2
3
1
5
2
F
25
3
9. Fred 4 5 4 2 5 M 23 3 ___________________________________________________
As is the case with most tables of data in this guide, Table 8.1 follows the conventions that the horizontal rows of the table represent different subjects, and the vertical columns of the table represent different variables. Below the heading “Q1” are responses to the first achievement motivation statement (“I try very hard to improve on my performance in classes”). For example, with respect to Q1, you can see that •
Marsha circled a “6” (for “Agree Very Strongly”)
•
Charles circled a “2” (for “Disagree Strongly)
•
Jack circled a “5” (for “Agree Strongly”).
Below the heading “Q2” are responses to the second achievement motivation statement (“I take moderate risks to get ahead in classes”). Responses to the remaining achievement motivation questions appear below the headings “Q3,” “Q4,” and “Q5,” respectively. Subject sex is recorded below the heading “Sex” (with the value “F” representing females and the value “M” representing males), and subject age appears below “Age.”
Chapter 8: Creating and Modifying Variables and Data Sets 221
Values below the heading “Major” indicate whether a given subject is majoring in the arts and sciences versus business versus education. The value “1” represents subjects majoring in the arts and sciences, the value “2” represents subjects majoring in business, and the value “3” represents subjects majoring in education. SAS Program to Read the Raw Data Here is the program that creates a SAS data set from the data appearing in Table 8.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 1 6 5 5 2 6 F 2 2 1 2 5 2 M 3 5 . 4 3 6 M 4 5 6 6 . 6 F 5 4 4 5 2 5 M 6 5 6 6 2 6 F 7 5 5 6 1 5 F 8 2 3 1 5 2 F 9 4 5 4 2 5 M ;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
Some notes regarding the preceding DATA step: •
The INPUT statement appears on lines 3–11.
•
The subjects’ actual names were not included in the data set. However, each subject was given a unique subject number (from 1 through 9) for purposes of identification. These subject numbers are contained in the SAS variable SUB_NUM. The variable name SUB_NUM appears on line 3 of the program.
•
The SAS variables Q1–Q5 contain subject responses to achievement motivation items 1–5, respectively. These variable names appear in the INPUT statement on lines 4 through 8.
•
Line 9 shows that the SAS variable name SEX was used to represent subject sex (this was the only character variable in the data set, as is indicated by the “$” that follows the name “SEX”).
•
Lines 10–11 show that the SAS variable name AGE was used to represent subject age, and MAJOR was used to represent subject major.
222 Step-by-Step Basic Statistics Using SAS: Student Guide
The data lines themselves appear on lines 13–21. You can see that these data lines correspond to the data appearing in Table 8.1. Remember that the first column contains the unique subject numbers assigned to each subject (this variable was given the SAS variable name SUB_NUM).
Using PROC PRINT to Create a Printout of Raw Data Why You Need to Use PROC PRINT Before discussing data manipulation and data subsetting, this section first shows you how to use the PRINT procedure. The PRINT procedure (PROC PRINT) is useful for generating a printout of your raw data (i.e., a printout of your data as they appear in an internal SAS data set). You can use PROC PRINT to review each subject’s score on each variable in your data set (or on any subset of variables that you choose). PROC PRINT is useful for a wide variety of purposes, but here this section focuses on just two: •
After you have created a SAS data set, you can use PROC PRINT to print the contents of that data set to verify that SAS read your data the way that you intended.
•
After you have used data manipulation statements to create new SAS variables, you can use PROC PRINT to print the values of that variable and verify that it was created in the manner that you intended.
Using PROC PRINT to Print the Current Data Set Printing all variables in a data set. Here is the syntax for the PROC step that prints the raw data for all variables in your data set: PROC PRINT DATA=data-set-name; TITLE1 ' your-name '; RUN; Here are the PRINT procedure statements that print the raw data for the achievement motivation study described above. To illustrate where the statements should be placed in the program, the last few data lines from the preceding data set are also reproduced below: [The first part of the DATA step appears here] 5 5 6 1 5 F 21 3 2 3 1 5 2 F 25 3 4 5 4 2 5 M 23 3 ; PROC PRINT DATA=D1; TITLE1 'JANE DOE'; RUN;
Chapter 8: Creating and Modifying Variables and Data Sets 223
Output 8.1 presents the PRINT results generated by the preceding statements. JANE DOE
1
Obs
SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX
AGE
MAJOR
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
6 2 5 5 4 5 5 2 4
5 1 . 6 4 6 5 3 5
5 2 4 6 5 6 6 1 4
2 5 3 . 2 2 1 5 2
6 2 6 6 5 6 5 2 5
F M M F M F F F M
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
Output 8.1. Results of PROC PRINT performed on initial data set, achievement motivation study.
For the most part, Output 8.1 duplicates the data that appeared in Table 8.1. The most obvious difference is the fact that subject names do not appear in Output 8.1; instead, each data line is identified with an observation number. These observation numbers appear in the column headed “Obs.” The first entry in the column headed “Obs” is “1” for “observation #1,” the second entry is “2” for “observation #2,” and so forth. This Obs variable was generated by SAS. When the observations in a data set are individual subjects (as is the case with the current achievement motivation study), the observation numbers are essentially subject numbers. This means that observation #1 consists of data for subject #1 (Marsha, from Table 8.1), observation #2 consists of data for subject #2 (Charles), and so forth. This can be easily verified by comparing the observation numbers in the first column ( ) to values of the subject number variable (SUB_NUM) which appears as the second column in the output ( ). The remaining columns of Output 8.1 correspond to the columns of Table 8.1 presented earlier. Specifically: The column headed “Q1” presents subject responses to question 1, the first achievement motivation item. In the same way, columns Q2-Q5 present responses to the remaining achievement motivation items. The column headed “SEX” identifies the sex for each subject in the study. The column headed “AGE” represents subject age. The column headed “MAJOR” identifies the area that each subject majored in. You will remember that, in the way that these data were entered, the value “1” identifies arts and
224 Step-by-Step Basic Statistics Using SAS: Student Guide
sciences majors, the value “2” identifies business majors, and the value “3” identifies education majors. Printing a subset of variables in a data set. Sometimes you will wish to print raw data for just a few variables in a data set. When this is the case, you should use the VAR statement with a PROC PRINT statement. In the VAR statement, you list just the names of the variables that you want to print. Here is the syntax: PROC PRINT DATA=data-set-name; VAR variable-list ; TITLE1 ' your-name '; RUN; For example, the following will cause PROC PRINT to print raw values for just the SEX and AGE variables: PROC PRINT DATA=D1; VAR SEX AGE; TITLE1 'JANE DOE'; RUN; Output 8.2 presents the results generated by the preceding statements: JANE DOE Obs
SEX
AGE
1 2 3 4 5 6 7 8 9
F M M F M F F F M
22 25 30 41 22 20 21 25 23
Output 8.2. Results of PROC PRINT in which only the SEX and AGE variables were listed in the VAR statement.
You can see that the values for SEX and AGE in Output 8.2 are identical to the values appearing in Output 8.1.
Chapter 8: Creating and Modifying Variables and Data Sets 225
A Common Misunderstanding Regarding PROC PRINT Students learning SAS often misunderstand PROC PRINT: they sometimes assume that a SAS program must contain PROC PRINT in order to generate a paper printout of their results. This is not the case. PROC PRINT simply generates a printout of your raw data (i.e., subjects’ individual scores on the variables in your data set). If you have run some other SAS procedure such as PROC MEANS or PROC FREQ, you do not have to include PROC PRINT in your program to create a paper printout of the results generated by those procedures.
Where to Place Data Manipulation and Data Subsetting Statements Overview In general, data manipulation and subsetting statements should appear only within a SAS DATA step. Remember that a DATA step begins with the DATA statement, and ends when SAS encounters a PROC (procedure) statement. This means that, if you prepare a DATA step, end the DATA step with a PROC, and then place some manipulation or subsetting statements immediately after the PROC, an error results. To avoid this error (and keep things simple), place your data manipulation and data subsetting statements either: •
immediately following the INPUT statement, or
•
immediately following the creation of a new data set.
Placing These Statements Immediately Following the INPUT Statement This guideline is illustrated by referring to the study on achievement motivation. Suppose that you prepared the following SAS program to analyze data obtained in your study.
226 Step-by-Step Basic Statistics Using SAS: Student Guide
In the following program, lines 13 and 14 indicate where you could place data manipulation or data subsetting statements in that program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
OPTIONS LS=80 PS=60; DATA D1; INPUT Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; /*place data manipulation statements and data subsetting statements here*/
DATALINES; 6 5 5 2 6 F 2 1 2 5 2 M 5 . 4 3 6 M 5 6 6 . 6 F 4 4 5 2 5 M 5 6 6 2 6 F 5 5 6 1 5 F 2 3 1 5 2 F 4 5 4 2 5 M ; PROC MEANS DATA=D1; RUN;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
Placing the Statements Immediately After Creating a New Data Set A new data set may be created at virtually any point in a SAS program (even after PROCs have been requested). SAS programmers often create a new data set so that, initially, it is a duplicate of an existing data set (perhaps the one created with a preceding INPUT statement). If data manipulation or data subsetting statements follow the creation of this new data set, the new data set displays the modifications requested by those statements. To create a duplicate of an existing data set, use the following syntax: DATA new-data-set-name ; SET existing-data-set-name ;
Chapter 8: Creating and Modifying Variables and Data Sets 227
Here is an example of statements that you could use: DATA D2; SET D1; The preceding lines tell SAS to create a new data set named D2, and make this new data set a duplicate of the existing data set, D1. Now that a new data set has been created, you can write as many data manipulation and subsetting statements as you like. However, once you write a PROC statement, that ends the DATA step, and no more manipulation or subsetting statements may be written beyond that point (unless you create yet another data set, perhaps calling it something like D3, later in the program). Here is an example of how a program might have been written so that the manipulation and subsetting statements follow the creation of a new data set: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
OPTIONS LS=80 PS=60; DATA D1; INPUT Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 6 5 5 2 6 F 22 2 1 2 5 2 M 25 5 . 4 3 6 M 30 5 6 6 . 6 F 41 4 4 5 2 5 M 22 5 6 6 2 6 F 20 5 5 6 1 5 F 21 2 3 1 5 2 F 25 4 5 4 2 5 M 23 ; PROC MEANS DATA=D1; RUN;
1 1 1 2 2 2 3 3 3
DATA D2; SET D1; /*place data manipulation statements and data subsetting statements here*/ PROC MEANS RUN;
DATA=D2;
Some notes about the preceding program: •
The DATA step on line 2 tells SAS to give this data set the name “D1.”
•
Lines 2–21 create the initial data set in the usual way.
228 Step-by-Step Basic Statistics Using SAS: Student Guide •
Lines 22–23 cause PROC MEANS to be performed on the initial data set.
•
Lines 25–26 cause a new DATA step to begin. Line 25 tells SAS to give this new data set the name “D2,” and line 26 tells SAS to make D2 an exact copy of D1.
•
Any data manipulation or subsetting statements that appear in lines 28–29 affect only the new data set, D2.
The PROC MEANS statement in line 31 requests that some simple descriptive statistics be computed. By default, PROC MEANS always computes the mean, standard deviation, and a few other descriptive statistics. It is clear that these statistics would be computed from the data in the new data set D2, because of the “DATA=D2” option that appears in the PROC MEANS statement. If the statement had instead specified “DATA=D1,” the analyses would have instead been performed on the original data set.
Basic Data Manipulation Overview Data manipulation involves performing some type of transformation on one or more variables within a DATA step. This section discusses several types of transformations that are frequently required in research, such as creating duplicate variables with new variable names, creating new variables from existing variables, and recoding reversed items. Creating Duplicate Variables with New Variable Names Suppose you gave a variable a certain name when it was input, but now wish the variable to have a different, perhaps more meaningful name when it appears later in the SAS program or in SAS output. One way to “rename” an existing variable is to create a new variable that is identical to the existing variable, and assign a new, more meaningful name to this new variable. This can easily be accomplished by writing a statement that uses the following syntax: new-variable-name
=
existing-variable-name;
For example, in the achievement motivation study, you used the SAS variable name “SEX” to represent the subject’s sex. Suppose for a moment that, at some point in the program, you wish to use the SAS variable name “GENDER” instead of “SEX.” This could be done with the following statement: GENDER = SEX; The preceding statement tells SAS to create a new variable named “GENDER,” and make it a duplicate of “SEX.” This means that, if a given subject had a value of “F” on SEX, she
Chapter 8: Creating and Modifying Variables and Data Sets 229
will also have a value of “F” on GENDER; if a given subject had a value of “M” on SEX, he will also have a value of “M” on GENDER. The key, of course, is to make sure that you place statements such as these only within a DATA step. Below is an example of how this could be done with the achievement motivation program (to conserve space, only the last few data lines from the program are reproduced below): 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; PROC MEANS DATA=D1; RUN;
3 3 3
DATA D2; SET D1; GENDER = SEX; PROC FREQ DATA=D2; TABLES GENDER; RUN;
Some notes about the preceding program: •
The data lines end with line 20, and lines 22–23 cause PROC MEANS to be performed on the initial data set.
•
Lines 25–26 cause a new DATA step to begin. Line 25 tells SAS to give this new data set the name “D2,” and line 26 tells SAS to make D2 a duplicate of D1.
•
Line 28 tells SAS to create a new variable called “GENDER” in the new data set D2, and to make it a duplicate of the existing variable, “SEX.”
•
Lines 30–32 cause PROC FREQ to be performed on GENDER. Notice that the “DATA=D2” part of line 30 specifies that the analysis should be performed using the data set named “D2.”
Conforming to the rules for SAS variable names. When you create a new variable name, remember to make it conform to the rules for SAS variable names discussed in Chapter 4, “Data Input” (e.g., begin the new variable’s name with a letter). Also, note that each statement that creates a duplicate of an existing variable must end with a semicolon. Duplicating variables versus renaming variables. Technically, it should be clear that the previous program did not really “rename” the variable initially named SEX. What it actually did was create a duplicate of the variable SEX, and assign a new name to the duplicate variable. This means that the resulting data set contained both the original
230 Step-by-Step Basic Statistics Using SAS: Student Guide
variable under its old name (SEX), as well as the duplicate variable under its new name (GENDER). Creating New Variables from Existing Variables A job satisfaction example. It is often necessary to perform mathematical operations on existing variables, and use the results to create a new variable. For example, suppose that you created the following 3-item scale designed to measure job satisfaction: Directions: Please indicate the extent to which you agree or disagree with each of the following statements. You will do this by circling the appropriate number to the left of that statement. The following format shows what each response number stands for: 6 5 4 3 2 1
= = = = = =
Agree Very Strongly Agree Strongly Agree Somewhat Disagree Somewhat Disagree Strongly Disagree Very Strongly
Circle Your Response –––––––––––––––– 1
2
3
4
5
6
1. I am satisfied with my job.
1
2
3
4
5
6
2. My job satisfies most of my work-related needs.
1
2
3
4
5
6
3. I like my job.
Subjects respond to each of these three items using a 6-point response format in which “1” = “Disagree Very Strongly” and “6” = “Agree Very Strongly” (similar to the achievement motivation scale presented earlier). Suppose that you administer this 3-item scale to 100 subjects and write a SAS program to analyze their responses. You use the SAS variable name “Q1” to represent responses to item 1, “Q2” to represent responses to item 2, and “Q3” to represent responses to item 3. Suppose that, for each subject, you would like to compute a single score to represent overall level of job satisfaction. For a given subject, this single score will simply be the sum of his or her responses to items 1, 2, and 3 from the scale. The following formula shows how it would be computed: Overall job satisfaction = Q1 + Q2 + Q3
Chapter 8: Creating and Modifying Variables and Data Sets 231
Scores on this overall job satisfaction variable must fall somewhere within the following range: •
The lowest possible score would be a score of “3.” This score would be obtained if the subject circled “1” for each of the three items (for “Disagree Very Strongly”).
•
The highest possible score would be a score of “18”. This score would be obtained if the subject circled “6” for each of the three items (for “Agree Very Strongly”).
Obviously, with this scale, higher scores indicate higher levels of job satisfaction. There are at least two ways that you can create this single overall job satisfaction variable. The more difficult way would be to pull out your pocket calculator, look at a given subject’s responses to Q1, Q2, and Q3, and add these three responses together. The easier way would be to write a simple SAS statement that does this work for you. The following section shows how. Creating new variables using simple formulas. To create a new variable by performing mathematical operations on existing variables, use the following syntax: new-variable-name
=
formula-including-existing-variables;
For example, assume that you have written a SAS program that inputs responses to the three job satisfaction questions, and have given them the SAS variable names Q1, Q2, and Q3. You can now write a SAS statement within a DATA step to create the single overall job satisfaction variable that you need. The following statement does this: SATIS = Q1 + Q2 + Q3; The preceding statement tells SAS to create a new variable named “SATIS.” A given subject’s score on SATIS should be equal to the sum of Q1, Q2, and Q3. You can now use the variable SATIS in subsequent analyses: You can compute your subjects’ mean score on SATIS, correlate SATIS with other variables, and perform a wide variety of other analyses. When creating new variables this way, be sure that all variables on the right side of the equals sign are existing variables. This means that they already exist in the data set, either because they are listed in the INPUT statement, or because they were created with earlier data manipulation statements. Symbols for arithmetic operators. The preceding statement that created the new variable “SATIS” used two arithmetic operators: the equals sign (=) and the plus sign (+). You use arithmetic operators to tell SAS about the types of mathematical operations you wish to perform on your data. Here is a list of the symbols for these arithmetic operators to use when writing SAS statements: + Addition - Subtraction * Multiplication / Division = Equals
232 Step-by-Step Basic Statistics Using SAS: Student Guide
Using parentheses. When you write formulas, make heavy use of parentheses. Remember that operations enclosed within parentheses are performed first, and operations outside the parentheses are performed later. Using parentheses ensures that operations are performed in the sequence that you expect. For example, suppose that, with the preceding study on job satisfaction, you want each subject’s score on SATIS to be equal to the average of his or her responses to Q1, Q2, and Q3 (rather than the sum of his or her responses to Q1, Q2, and Q3). The following statement would compute this average: SATIS = (Q1 + Q2 + Q3) / 3; This statement tells SAS to •
create a new variable named SATIS.
•
in creating this new variable, begin by adding together Q1, Q2, and Q3.
•
divide this sum by 3.
The resulting quotient is that subject's score on SATIS. By using parentheses, you ensure that the addition is performed first, and the division is performed second. In contrast, consider what would have happened if you had instead written the statement in the following way, without parentheses: SATIS = Q1 + Q2 + Q3 / 3; In this case, SAS would have begun by dividing each subject’s score on Q3 by 3. The resulting quotient would have then been added to Q1 and Q2. Obviously, this would have resulted in a very different score for SATIS. Why the difference? Because division has priority over addition as an arithmetic operator. When no parentheses are included in a formula, division is performed before addition is performed. When an expression contains more than one operator, SAS follows a set of rules that determine which operations are performed first, which are performed second, and so forth. Here are the rules that pertain to mathematical operators (+, -, /, and *): •
Multiplication and division operators (* and /) have equal priority, and they are performed first.
•
Addition and subtraction operators (+ and -) have equal priority, and they are performed second.
To protect yourself from errors, use parentheses when writing formulas. Because operations that are included inside parentheses are performed first, using parentheses gives you control over the sequence in which operations are executed.
Chapter 8: Creating and Modifying Variables and Data Sets 233
Recoding Reversed Variables Very often a questionnaire will contain a number of "reversed" items. A reversed item is a question stated so that its meaning is the opposite of the other items in that scale. For example, consider the following (somewhat revised) items from the job satisfaction scale: 1
2
3
4
5
6
1. I am satisfied with my job.
1
2
3
4
5
6
2. My job satisfies most of my work-related needs.
1
2
3
4
5
6
3. I hate my job.
Items 1 and 2 from the preceding scale are stated in the same way that they were stated when the scale was first presented a few pages earlier. Item 3, however, has been changed; item 3 is now a reversed item. In the original version of this scale, item 3 stated “I like my job.” In the current version of the scale, item 3 now states the opposite: “I hate my job.” In a sense, all of the questions in this 3-item scale are measuring the same thing: whether the subject feels satisfied with his or her job. Items 1 and 2 are stated so that, the more strongly you agree with the statement, the higher is your level of job satisfaction (remember that, with the response format, “1” = “Disagree Very Strongly” and “6” = “Agree Very Strongly”). However, item 3 is now a reversed item: It is stated so that the more strongly you agree with it, the lower is your level of job satisfaction. Here, scores of 1 indicate a higher level of satisfaction, and scores of 6 indicate lower satisfaction (which is just the reverse of items 1 and 2). It would be nice if all three items were consistent, so that scores of 6 always indicate high satisfaction, and scores of 1 always indicate low satisfaction. This requires that you recode item 3 so that people who circled a 6 are given a score of 1 instead; people who circled a 5 are given a score of 2 instead; and so on. This can be done very easily with the following statement: V3 = 7 - V3; In SAS, these are called assignment statements. This book refers to them as recoding statements. The preceding tells the computer to create a new version of the variable V3 and subtract the subject’s existing (old) score on V3 from the number “7”. The result will be the subject's new score on V3. Notice that now, if a person's old score on V3 was 6, his or her new score is 1; if the old score was 1, the new score is 6, and so on. The syntax for this recoding statement is as follows: existing-variable
=
constant
-
existing-variable;
What is the “constant?” It will always be equal to “the number of response points on your scale plus 1.” For example, the job satisfaction scale included 6 response points: a subject
234 Step-by-Step Basic Statistics Using SAS: Student Guide
could choose from a range of responses, beginning with “1” for “Disagree Very Strongly” through “6” for “Agree Very Strongly.” It was a 6-point scale, so the constant was 6 + 1 = 7. What would the constant be if the following 7-point response format had been used instead? 7 6 5 4 3 2 1
= = = = = = =
Agree Very Strongly Agree Strongly Agree Somewhat Neither Agree nor Disagree Disagree Somewhat Disagree Strongly Disagree Very Strongly
The constant would have been 8, because 7 + 1 = 8. This means that the recoding statement would have read: V3 = 8 - V3;
Where Should the Recoding Statements Go? In most cases, reversed items should be recoded before other data manipulations are performed on them. For example, suppose that you want to create a new variable named SATIS, which stands for “job satisfaction.” With this scale, higher scores indicate higher levels of satisfaction. For a given subject, his or her score on this scale will be the average of his or her responses to items 1, 2, and 3 from the preceding scale. Because item 3 is a reversed item, it is important that it be recoded before it is added to items 1 and 2 in calculating this scale score. Therefore, the correct sequence of statements is as follows: V3 = 7 - V3; SATIS = (V1 + V2 + V3) / 3; The following sequence is not correct: SATIS = (V1 + V2 + V3) / 3; V3 = 7 - V3;
Chapter 8: Creating and Modifying Variables and Data Sets 235
Recoding a Reversed Item and Creating a New Variable for the Achievement Motivation Study The Scale The beginning of this chapter presented a short questionnaire that could be used to assess achievement motivation in a sample of college students. The five questionnaire items that assessed achievement motivation are reproduced here: ---------------Circle Your Response ---------------1 2 3 4 5 6
1. I try very hard to improve on my performance in classes.
1
2
3
4
5
6
2. I take moderate risks to get ahead in classes.
1
2
3
4
5
6
3. I try to perform better than my fellow students.
1
2
3
4
5
6
4. I try to achieve as little as possible in classes.
1
2
3
4
5
6
5. I do my best work when my class assignments are fairly difficult.
It is clear that all five items were designed to measure academic achievement motivation. With items 1, 2, 3, and 5, the more the subject agrees with the item, the higher his or her level of achievement motivation (you will remember that the scale uses a response format in which “1” = “Disagree Very Strongly” and “6” = “Agree Very Strongly”). Item 4, however, is a reversed item; it states “I try to achieve as little as possible in classes.” The more strongly subjects agree with item 4, the lower their level of achievement motivation. Creating the New SAS Variable Suppose that you wish to create a new SAS variable named “ACH_MOT.” A given subject’s score on this new variable would be equal to the average of his or her responses to items 1–5 from the achievement motivation questionnaire. Higher scores on ACH_MOT indicate higher levels of achievement motivation. The SAS statement that creates ACH_MOT looks like this: ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;
236 Step-by-Step Basic Statistics Using SAS: Student Guide
Recoding the Reversed Item However, you have a problem: As was stated earlier, item 4 from the scale is a reversed item. Before you can include it in the preceding statement to create ACH_MOT, you must first recode Q4 so that it is no longer reversed. Here is the SAS statement that will accomplish this: Q4 = 7 - Q4; In the preceding statement, the constant is “7.” This is because the achievement motivation scale uses a 6-point response format, and 6 + 1 = 7. The SAS Program Finally, you are ready to put it all together. Here are the statements that recode the reversed item, Q4, and create the new SAS variable, ACH_MOT: 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
Q4 = 7 - Q4; ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5; PROC PRINT DATA=D2; VAR Q1 Q2 Q3 Q4 TITLE1 'JANE DOE'; RUN;
Q5
ACH_MOT;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary, because you can recode variables and create new variables only within a DATA step.
•
Line 25 presents the statement that recodes Q4.
•
Line 26 presents the statement that creates ACH_MOT. Each subject’s score on ACH_MOT is equal to the average of his or her responses to Q1–Q5.
•
Lines 28–31 request that the PRINT procedure be performed. Notice that the VAR statement on line 29 tells SAS to print only the variables Q1, Q2, Q3, Q4, Q5, and ACH_MOT.
Chapter 8: Creating and Modifying Variables and Data Sets 237
The SAS Output Output 8.3 presents the results generated by the preceding program. JANE DOE Obs
Q1
Q2
Q3
Q4
Q5
ACH_MOT
1 2 3 4 5 6 7 8 9
6 2 5 5 4 5 5 2 4
5 1 . 6 4 6 5 3 5
5 2 4 6 5 6 6 1 4
5 2 4 . 5 5 6 2 5
6 2 6 6 5 6 5 2 5
5.4 1.8 . . 4.6 5.6 5.4 2.0 4.6
Output 8.3. Results of the PROC PRINT statement in which Q1–Q5 and ACH_MOT were listed in the VAR statement, achievement motivation study.
Some notes about Output 8.3: •
The column headed “Q4” presents each subject’s score on SAS variable Q4 (responses to item 4 from the questionnaire). This output presents Q4 as it existed after being recoded by the SAS statement “Q4 = 7 – Q4;”.
•
In Output 8.3, Observation #1 corresponds to subject #1 (Marsha) from Table 8.1, presented at the beginning of this chapter. In Table 8.1, Marsha had a score of “2” on Q4; in Output 8.3, her score on Q4 has been recoded to be “5.” This is as expected.
•
If you compare the variable Q4 as it appears in Table 8.1 to the variable Q4 as it appears in Output 8.3, you will see that the recoding statement appears to have had the intended effect in recoding all responses to Q4.
•
The final column in Output 8.3 is headed “ACH_MOT.” This column provides each subject’s score on the new variable, ACH_MOT. The score on ACH_MOT for subject #1 is “5.4.” This represents the average of subject #1’s responses to Q1, Q2, Q3, Q4, and Q5, (6, 5, 5, 5, and 6, respectively). The output shows that the score on ACH_MOT for subject #2 is 1.8. This represents the average of subject #2’s responses to Q1, Q2, Q3, Q4, and Q5 (2, 1, 2, 2, and 2, respectively). You can see that scores on ACH_MOT were determined in the same way for each of the remaining subjects in Output 8.3.
•
Subject #3 has missing data for ACH_MOT (a single period appears where you would expect his score on ACH_MOT to appear). This is because Subject #3 (Jack) had missing data on the variable Q2, which was used in creating ACH_MOT. Whenever a subject has missing data on an existing variable that is used in the creation of a new variable, that subject is assigned missing data on the new variable as well (at least this is the case with the type of data manipulation statements presented in this chapter).
238 Step-by-Step Basic Statistics Using SAS: Student Guide •
In the same way, you can see that Subject #4 (Cathy) also had missing data on ACH_MOT. This is because she had missing data on Q4, which was also used in the creation of ACH_MOT.
The SAS Log The SAS log contains your SAS program (minus the data lines), along with any notes, warnings, or error messages created by SAS as it executes the program. Log 8.1 presents an excerpt from the SAS log created by the preceding program. 22 23 24 25 26
DATA D2; SET D1; Q4 = 7 - Q4; ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5;
NOTE: Missing values were generated as a result of performing an operation on missing values. Each place is given by: (Number of times) at (Line):(Column). 1 at 25:11 1 at 26:18 1 at 26:23 2 at 26:28 2 at 26:33 2 at 26:39 Log 8.1. Note about missing values produced by data manipulation statements.
The excerpt of the SAS log appearing in Log 8.1 contains the statements that cause the creation of the new SAS data set D2 on lines 22–23. It also contains the statement in which Q4 was recoded on line 25, and the statement in which ACH_MOT was created from existing variables on line 26. Log 8.1 shows that, just below these statements, SAS generated the following: NOTE: Missing values were generated as a result of performing an operation on missing values. The note lists the location in the SAS program where these missing values were generated. A note such as this is not necessarily a cause for alarm. SAS automatically prints this type of note whenever you are creating a new variable from existing variables and some of the existing variables contain missing data. The preceding section headed “The SAS Output” pointed out that the program contained missing data for the following two subjects: •
Subject #3 (Jack) had missing data on the variable Q2, which was used in creating the new variable ACH_MOT
•
Subject #4 (Cathy) had missing data on Q4, which was also used in the creation of ACH_MOT.
Chapter 8: Creating and Modifying Variables and Data Sets 239
The note that appears in Log 8.1 was generated by SAS to remind you that it was generating missing values on the new variable that it was creating (ACH_MOT) because these two subjects had missing values on the two existing variables, Q2 and Q4. When you receive a note such as this, you should review it to verify that the number of missing values being created is reasonable, given the number of missing values that appear in your initial data set. If the number of missing values being generated seems to be inappropriate, you should attempt to identify the source of the problem by carefully reviewing the data values in your data set, along with all data manipulation statements.
Using IF-THEN Control Statements Overview An IF-THEN control statement allows you to ensure that operations are performed on a given subject's data only if certain conditions are true regarding that subject. You can use IF-THEN control statements to modify existing variables or create new variables. This section illustrates a number of ways that IF-THEN control statements can be used to perform tasks that are commonly required in research. Creating a New AGE Variable For example, in the achievement motivation study described earlier, one of the SAS variables included in the data set is AGE: each subject’s age in years. Suppose that you now wish to create a new variable called AGE2. You will use the following rules in assigning scores to subjects on AGE2: •
If a given subject’s score on AGE is less than 25, then his or her score on AGE2 will be zero.
•
If a given subject’s score on AGE is greater than or equal to 25, then his or her score on AGE2 will be 1.
240 Step-by-Step Basic Statistics Using SAS: Student Guide
The following IF-THEN control statements create AGE2 according to these rules: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
AGE2 = .; IF AGE LT 25 THEN AGE2 = 0; IF AGE GE 25 THEN AGE2 = 1; PROC PRINT DATA=D2; VAR AGE AGE2; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary because you can use IF-THEN control statements only within a DATA step to create a new variable.
•
Line 25 tells SAS to create a new variable called “AGE2,” and begin by assigning missing data (.) to all subjects on AGE2.
•
Line 26 tells SAS that, if the score on AGE for a given subject is less than 25, then that subject’s score on AGE2 should be “0” (the “LT” in this statement stands for “is less than”).
•
Line 27 tells SAS that, if the score on AGE for a given subject is greater than or equal to 25, then that subject’s score on AGE2 should be “1” (the “GE” in this statement stands for “is greater than or equal to”).
•
Lines 29–32 request that the PRINT procedure be performed on variables AGE and AGE2.
Chapter 8: Creating and Modifying Variables and Data Sets 241
The preceding program generates results presented in Output 8.4. JANE DOE Obs
AGE
AGE2
1 2 3 4 5 6 7 8 9
22 25 30 41 22 20 21 25 23
0 1 1 1 0 0 0 1 0
Output 8.4. Results of the PROC PRINT in which AGE and AGE2 were listed in the VAR statement, achievement motivation study.
In Output 8.4, the column headed “AGE” presents subject scores on AGE as they appeared in the initial SAS data set. The column headed “AGE2” presents subject scores on the new AGE2 variable, as created by the IF-THEN control statements. Observation #1 presents the AGE score for subject #1 (Marsha), which was “22.” This subject’s score on AGE2 was “0,” which was as expected: The preceding IF-THEN control statements indicated that if a subject’s score on AGE was less than 25, then her score on AGE2 should be 0. Observation #2 presents the AGE score for subject #2 (Charles), which was “25.” This subject’s score on AGE2 was “1,” which was as expected: The preceding IF-THEN control statements indicated that if a subject’s score on AGE was greater than or equal to 25, then his score on AGE2 should be 1. Output 8.4 shows that each remaining subject’s score on AGE2 was created according to the same rules. Comparison Operators The preceding example introduced you to the concept of comparison operators. The comparison operator “LT” represented “is less than,” and the comparison operator “GE” represented “is greater than or equal to.”
242 Step-by-Step Basic Statistics Using SAS: Student Guide
The following comparison operators may be used with IF-THEN statements: = NE GT or > GE LT or < LE
is is is is is is
equal to not equal to greater than greater than or equal to less than less than or equal to
The Syntax of the IF-THEN Statement The syntax is as follows: expression
IF
THEN
statement ;
The “expression” usually consists of some comparison involving existing variables. The “statement” usually involves some operation performed on existing variables or new variables. To illustrate the use of the IF-THEN statement, this section again refers to the fictitious Learning Aptitude Test (abbreviated “LAT”) presented earlier in this guide. A previous chapter indicated that the LAT included a verbal subtest as well as a mathematical subtest. Suppose that you have obtained LAT scores for a sample of subjects, and now wish to create a new variable called LATVGRP, which is an abbreviation for “LAT-verbal group.” This variable will be created with the following provisions: •
If you do not know what a subject’s LAT Verbal test score is, that subject will have a score of "." (for "missing data") on LATVGRP.
•
If the subject’s score is under 500 on the LAT Verbal test, the subject will have a score of 0 on LATVGRP.
•
If the subject’s score is 500 or greater on the LAT Verbal test, the subject will have a score of 1 on LATVGRP.
Suppose that the variable LATV already exists in your data set and that it contains each subject’s actual score on the LAT Verbal test. You can now use it to create the new variable, LATVGRP, by writing the following statements: LATVGRP = .; IF LATV LT 500 THEN LATVGRP = 0; IF LATV GE 500 THEN LATVGRP = 1; The preceding statements tell SAS to create a new variable called LATVGRP, and begin by setting everyone's score as equal to "." (missing). If a subject's score on LATV is less than 500, then SAS sets his or her score on LATVGRP as equal to 0. If a subject's score on LATV is greater than or equal to 500, then SAS sets his or her score on LATVGRP as equal to 1.
Chapter 8: Creating and Modifying Variables and Data Sets 243
Using ELSE Statements You could have performed the preceding operations more efficiently by using the ELSE statement. Here is the syntax for using the ELSE statement with the IF-THEN statement: IF
expression THEN statement ; ELSE IF expression THEN statement ;
The ELSE statement provides alternative actions that SAS may take if the original IF expression is not true. For example, consider the following: 1 2 3
LATVGRP = .; IF LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1;
The preceding tells SAS to •
create a new variable called LATVGRP, and initially assign all subjects a value of “missing”
•
assign a particular subject a score of 0 on LATVGRP if that subject has an LATV score less than 500
•
assign a particular subject a score of 1 on LATVGRP if that subject has an LATV score greater than or equal to 500.
Obviously, the preceding statements were identical to the earlier statements that created LATVGRP, except that the word ELSE has now been added to the beginning of the third line. In fact, these two approaches result in assigning exactly the same values on LATVGRP to each subject. So what, then, is the advantage of including the ELSE statement? The answer has to do with efficiency: When an ELSE statement is included, the actions specified in that statement are executed only if the expression in the preceding IF statement is not true. For example, consider the situation in which subject 1 has a score on LATV that is less than 500. Line 2 in the preceding statements would assign that subject a score of 0 on LATVGRP. SAS would then ignore line 3 (because it contains the ELSE statement) thus saving computer time. If line 3 did not contain the word ELSE, SAS would execute the line, checking to see whether the LATV score for subject 1 is greater than or equal to 500 (which is actually unnecessary, given what was learned in line 2). Regarding missing data, notice that line 2 of the preceding program assigns subjects to group 0 (under LATVGRP) if their scores on LATV are less than 500. Unfortunately, a score of “missing” (.) on LATV is viewed as being less than 500 (actually, SAS views it as being less than 0). This means that subjects with missing data on LATV will be assigned to group 0 under LATVGRP by line 2 of the preceding program. This is not desirable.
244 Step-by-Step Basic Statistics Using SAS: Student Guide
To prevent this from happening, you may rewrite the program in the following way: 1 2 3
LATVGRP = .; IF LATV GE 200 AND LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1;
Line 2 of the program now tells SAS to assign subjects to group 0 only if their scores on LATV are both greater than or equal to 200, and less than 500. This modification uses the conditional AND statement, which is discussed in greater detail in the following section. Finally, remember to use the ELSE statement only in conjunction with a preceding IF statement and always to place it immediately following the relevant IF statement. Using the Conditional Statements AND and OR As the preceding section suggests, it is also possible to use the conditional statement AND within an IF-THEN statement or an ELSE statement. For example, consider the following: LATVGRP = .; IF LATV GT 400 AND LATV LT 500 THEN LATVGRP = 0; ELSE IF LATV GE 500 THEN LATVGRP = 1; The second statement tells SAS "if LATV is greater than 400 and less than 500, then give this subject a score on LATVGRP of 0." This means that subjects are given a score of 0 only if they are both over 400 and under 500. What happens to those who have a score of 400 or less on the LATV? They are given a score of "." on LATVGRP. That is, they are classified as having missing data on LATVGRP. This is because they (along with everyone else) were given a score of "." in the first statement, and neither of the later statements replaces that "." with a 0 or a 1. However, for subjects whose scores are over 400, one of the later statements will replace the "." with either a 0 or a 1. It is also possible to use the conditional statement OR within an IF-THEN statement or an ELSE statement. For example, suppose that you have a variable in your data set called ETHNIC. With this variable, subjects were assigned the value 5 if they were Caucasian, 6 if they were African-American, or 7 if they were Asian-American. Supose that you now wish to create a new variable called MAJORITY: Subjects will be assigned a value of 1 on this variable if they are in the majority group (i.e., if they are Caucasians), and they will be assigned a value of 0 on this variable if they are in a minority group (if they are either African-Americans or Asian-Americans). The following statements would create this variable: MAJORITY=.; IF ETHNIC = 5 THEN MAJORITY = 1; ELSE IF ETHNIC = 6 OR ETHNIC = 7 THEN MAJORITY = 0; In the preceding statements, all subjects are first assigned a value of “missing” on MAJORITY. If their value on ETHNIC is 5, their score on MAJORITY is changed to 1,
Chapter 8: Creating and Modifying Variables and Data Sets 245
and SAS ignores the following ELSE statement. If their value on ETHNIC is not 5, then SAS moves on to the ELSE statement. There, if the subject’s value on ETHNIC is either 6 or 7, the subject is assigned a value of 0 on MAJORITY. Working with Character Variables Using single quotation marks. When you are working with character variables (variables in which the values may consist of letters or special characters), it is important that you enclose values within single quotation marks in the IF-THEN and ELSE statements. Converting character values to numeric values. For example, suppose that you administered the achievement motivation questionnaire (from the beginning of this chapter) to a sample of subjects. This questionnaire asked subjects to identify their sex. In entering the data, you created a character variable called SEX in which the value “F” represented female subjects and the value “M” represented male subjects. Suppose that you now wish to create a new variable called SEX2. SEX2 will be a numeric variable in which the value “0” is used to represent females and the value “1” is used to represent males. The following SAS statements could be used to create and print this new variable: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
SEX2 = .; IF SEX = 'F' THEN SEX2 = 0; IF SEX = 'M' THEN SEX2 = 1; PROC PRINT DATA=D2; VAR SEX SEX2; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary, because you can use IF-THEN control statements only within a DATA step to create a new variable.
•
Line 25 tells SAS to create a new variable called “SEX2,” and begin by assigning missing data (.) to all subjects on SEX2.
246 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 26 tells SAS that, if a given subject’s value on SEX is equal to “F,” then her value on SEX2 should be “0.”
•
Line 27 tells SAS that, if a given subject’s value on SEX is equal to “M,” then his value on SEX2 should be “1.”
On line 26, notice that the “F” is enclosed within single quotation marks. This was necessary because SEX was a character variable. However, when “SEX2” is set to zero on line 26, the zero is not enclosed within single quotation marks. This is because SEX2 is not a character variable––it is a numeric variable. Use single quotation marks only to enclose character variable values. Output 8.5 presents the results generated by the preceding program. JANE DOE Obs
SEX
SEX2
1 2 3 4 5 6 7 8 9
F M M F M F F F M
0 1 1 0 1 0 0 0 1
Output 8.5. Results of PROC PRINT in which SEX and SEX2 were listed in the VAR statement, achievement motivation study.
Output 8.5 presents each subject’s values for the variables SEX and SEX2. Notice that, if a given subject’s value for SEX is “F,” her value on SEX2 is equal to zero; if a given subject’s value on SEX is “M,” his value on SEX2 is equal to 1. This is as expected, given the preceding IF-THEN control statements. Converting numeric values to character values. The same conventions apply when you convert numeric values to character values: Within the IF-THEN control statements, the values for character variables should be enclosed within single quotation marks, but the values for numeric variables should not be enclosed within single quotation marks. For example, you may remember that the achievement motivation questionnaire asked subjects to indicate their major. That section of the questionnaire is reproduced here: 8. What is your major?
______ Arts and Sciences (1) ______ Business (2) ______ Education (3)
Chapter 8: Creating and Modifying Variables and Data Sets 247
In entering the data, you created a numeric variable called MAJOR. You used the value “1” to represent subjects majoring in the arts and sciences, the value “2” to represent subjects majoring in business, and the value “3” to represent subjects majoring in education. Suppose that you now wish to create a new variable called MAJOR2, which will also identify the area in which the subjects majored. However, MAJOR2 will be a character variable, and the values of MAJOR2 will be three characters long. Specifically, •
the value “A&S” represents subjects majoring in the arts and sciences
•
the value “BUS” represents subjects majoring in business
•
the value “EDU” represents subjects majoring in education.
The following statements use IF-THEN control statements to create the new MAJOR2 variable: 1 2 3 4
MAJOR2 = IF MAJOR IF MAJOR IF MAJOR
' .'; = 1 THEN MAJOR2 = 'A&S'; = 2 THEN MAJOR2 = 'BUS'; = 3 THEN MAJOR2 = 'EDU';
Some notes about the preceding program: •
Line 1 creates the new variable MAJOR2 and initially assigned missing data to all subjects. It did this with the following statement: MAJOR2 = '
.';
•
Notice that there is room for three characters within the single quotation marks. Within the single quotation marks there are two blank spaces and a single period (to represent “missing data”). This was important, because the subsequent IF-THEN statements assign 3-character values to MAJOR2.
•
Line 2 indicates that, if a given subject has a value of “1” on MAJOR, then that subject’s value on MAJOR2 should be “A&S.”
•
Line 3 indicates that, if a given subject has a value of “2” on MAJOR, then that subject’s value on MAJOR2 should be “BUS.”
•
Line 4 indicates that, if a given subject has a value of “3” on MAJOR, then that subject’s value on MAJOR2 should be “EDU.”
Once again, notice that when line 2 includes the expression IF MAJOR = 1 the “1” does not appear within single quotation marks. This is because MAJOR is a numeric variable. However, when line 2 includes the statement: THEN MAJOR2 = 'A&S';
248 Step-by-Step Basic Statistics Using SAS: Student Guide
the “A&S” does appear within single quotation marks. This is because MAJOR2 is a character variable. If a later section of your program included a PROC PRINT statement to print the contents of MAJOR and MAJOR2, the results would look something like Output 8.6. JANE DOE Obs
MAJOR
MAJOR2
1 2 3 4 5 6 7 8 9
1 1 1 2 2 2 3 3 3
A&S A&S A&S BUS BUS BUS EDU EDU EDU
Output 8.6. Results of the PROC PRINT in which MAJOR and MAJOR2 were listed in the VAR statement, achievement motivation study.
Data Subsetting Overview An earlier section of this chapter indicated that data subsetting statements are SAS statements that eliminate unwanted observations from a sample, so that only a specified subgroup is included in the resulting data set. Often, it is necessary to perform an analysis on only a subset of the subjects who are included in the data set. For example, you may wish to review the mean survey responses provided by just the female subjects. Or, you may wish to review mean survey responses provided by just those subjects majoring in the arts and sciences. Subsetting IF statements may be used to obtain these results. The Syntax of Data Subsetting Statements Here is the syntax for the statements that perform data subsetting: DATA new-data-set-name ; SET existing-data-set-name ; IF comparison ; The "comparison" generally includes some existing variable and at least one comparison operator.
Chapter 8: Creating and Modifying Variables and Data Sets 249
An Example The SAS program. For example, suppose you wish to compute mean survey responses for just those subjects who are majoring in the arts and sciences. The following statements accomplish this: 18 19 20 21 22 23 24 25 26 27 28 29
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1;
3 3 3
IF MAJOR = 1; PROC MEANS DATA=D2; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 18–20 (to save space, only the last few lines are presented).
•
A new DATA step begins on line 22. This was necessary, because you can perform data subsetting only within a DATA step.
•
Lines 22–23 tell SAS to create a new data set, name it “D2,” and create it as a duplicate of data set D1.
•
Line 25 tells SAS to retain a given observation for D2 only if that observation’s value on MAJOR is equal to “1.” This will retain in the data set D2 only those subjects who majored in the arts and sciences (because the number “1” was used to represent this group under the MAJOR variable).
•
Lines 27–29 request that PROC MEANS be performed on the data set.
•
Line 27 includes the option “DATA=D2,” which specifies that PROC MEANS should be performed on the data set D2. This makes sense, because the D2 is the data set that contains just the arts and sciences majors.
250 Step-by-Step Basic Statistics Using SAS: Student Guide
Output 8.7 presents the results generated by the preceding program. JANE DOE--ARTS AND SCIENCES MAJORS
1
The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ----------------------------------------------------------------------------SUB_NUM 3 2.0000000 1.0000000 1.0000000 3.0000000 Q1 3 4.3333333 2.0816660 2.0000000 6.0000000 Q2 2 3.0000000 2.8284271 1.0000000 5.0000000 Q3 3 3.6666667 1.5275252 2.0000000 5.0000000 Q4 3 3.3333333 1.5275252 2.0000000 5.0000000 Q5 3 4.6666667 2.3094011 2.0000000 6.0000000 AGE 3 25.6666667 4.0414519 22.0000000 30.0000000 MAJOR 3 1.0000000 0 1.0000000 1.0000000 ----------------------------------------------------------------------------Output 8.7. Results of the PROC MEANS performed on data set consisting of arts and sciences majors only, achievement motivation study.
Some notes about this output: PROC MEANS was performed on all of the numeric variables in the data set because the VAR statement had been omitted from the SAS program. In the column headed “N,” you can see that, for most variables, the analyses were performed on just three subjects. This makes sense, because Table 8.1 showed that just three subjects were majoring in the arts and sciences. To the right of the variable name “MAJOR,” you can see that the mean score on MAJOR is 1.0, the minimum value for MAJOR is 1.0, and the maximum value for MAJOR is 1.0. This is what you would expect if your data set consisted exclusively of arts and sciences majors: Each of them should have a value on MAJOR that is equal to “1.”
Chapter 8: Creating and Modifying Variables and Data Sets 251
An Example with Multiple Subsets It is possible to write a single program that creates multiple data sets, with each data set consisting of a different subgroup of subjects. This is done with the following program: 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 3 8 2 3 1 5 2 F 25 3 9 4 5 4 2 5 M 23 3 ; DATA D2; SET D1; IF MAJOR = 1; PROC MEANS DATA=D2; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN; DATA D3; SET D1; IF MAJOR = 2; PROC MEANS DATA=D3; TITLE1 'JANE DOE--BUSINESS MAJORS'; RUN; DATA D4; SET D1; IF MAJOR = 3; PROC MEANS DATA=D4; TITLE1 'JANE DOE--EDUCATION MAJORS'; RUN;
Some notes about the preceding program: •
Lines 22–24 create a new data set named D2. The subsetting IF statement on line 24 ensures that this data set will contain only subjects with a value of “1” on the variable MAJOR. This means that the data set will contain only arts and sciences majors. Lines 25– 27 request that PROC MEANS be performed on this data set.
•
Lines 29–31 create a new data set named D3. The subsetting IF statement on line 31 ensures that this data set will contain only subjects with a value of “2” on the variable, “MAJOR.” This means that the data set will contain only business majors. Lines 32–34 request that PROC MEANS be performed on this data set.
•
Lines 36–38 create a new data set named D4. The subsetting IF statement on line 38 ensures that this data set will contain only subjects with a value of “3” on the variable, “MAJOR.” This means that the data set will contain only education majors. Lines 39–41 request that PROC MEANS be performed on this data set.
252 Step-by-Step Basic Statistics Using SAS: Student Guide
Specifying the initial data set in the SET statement. Notice that, throughout the preceding program, the SET statement always specifies “D1,” as shown here: SET D1; This is because the data set D1 was the initial data set and the only data set that contained all of the initial observations. When creating a new data set that will consist of a subset of this initial data set, you will usually want to specify the initial data set in your SET statement. Specifying the current data set in the PROC statements. PROC MEANS statements appear on lines 25, 32, and 39 of the preceding program. Notice that, in each case, the “DATA= “ option of the PROC MEANS statement always specifies the data set that has just been created. Line 25 reads “PROC MEANS DATA=D2,” line 32 reads, “PROC MEANS DATA=D3,” and line 39 reads “PROC MEANS DATA=D4.” This ensures that the first PROC MEANS is performed on the data set containing just the arts and sciences majors, the second PROC MEANS is performed on the data set containing just the business majors, and the third PROC MEANS is performed on the data set containing just the education majors. Using Comparison Operators and the Conditional Statements AND and OR When writing a subsetting IF statement, you may use all of the comparison operators described above (such as “LT” or “GE”) as well as the conditional statements AND and OR. For example, suppose that you have created an initial data set named D1 that contains the SAS variables SEX (which represents subject sex) and AGE (which represents subject age). You now wish to create a second data set named D2, and a subject will be retained in D2 only if she is a female, and she is 65 years of age or older. The following statements will accomplish this: DATA D2; SET D1; IF SEX = 'F' AND AGE GE 65;
Eliminating Observations That Have Missing Data on Some Variables Overview. One of the most common difficulties encountered by researchers in the social sciences and education is the problem of missing data. Briefly, the missing data problem involves not having scores on all variables for all subjects in a data set.
Chapter 8: Creating and Modifying Variables and Data Sets 253
Missing data in the achievement motivation study. To illustrate the concept of missing data, Table 8.1 is reproduced here as Table 8.2: Table 8.2 Data from the Achievement Motivation Study ___________________________________________ Agree-Disagree Questions __________________ Subject Q1 Q2 Q3 Q4 Q5 Sex Age Major ____________________________________________________ 1. Marsha
6
5
5
2
6
F
22
1
2. Charles
2
1
2
5
2
M
25
1
3. Jack
5
.
4
3
6
M
30
1
4. Cathy
5
6
6
.
6
F
41
2
5. Emmett
4
4
5
2
5
M
22
2
6. Marie
5
6
6
2
6
F
20
2
7. Cindy
5
5
6
1
5
F
21
3
8. Susan
2
3
1
5
2
F
25
3
9. Fred 4 5 4 2 5 M 23 3 ___________________________________________________
Table 8.2 uses a single period (.) to represent missing data. The table reveals missing data for the third subject (Jack) on the variable Q2: There is a single period in the location where you would expect to see Jack’s score for Q2. Similarly, the table also reveals missing data for the fourth subject (Cathy) on variable Q4.
254 Step-by-Step Basic Statistics Using SAS: Student Guide
In Chapter 4, “Data Input,” you learned that you should also use a single period to represent missing data when entering data in a SAS data set. This was shown in the section “SAS Program to Read the Raw Data,” earlier in this chapter. The initial DATA step from that program is reproduced below: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 1 6 5 5 2 6 F 2 2 1 2 5 2 M 3 5 . 4 3 6 M 4 5 6 6 . 6 F 5 4 4 5 2 5 M 6 5 6 6 2 6 F 7 5 5 6 1 5 F 8 2 3 1 5 2 F 9 4 5 4 2 5 M ;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
Line 15 from the preceding program contains data from the subject Jack. You can see that a single period appears in the location where you would normally expect Jack’s score on Q2. In the same way, line 16 contains data from the subject Cathy. You can see that a single period appears in the location where you would normally expect Cathy’s score on Q4. Eliminating observations with missing data from a new data set. Suppose that you now wish to create a new data set named D2. The new data set will be identical to the initial data set (D1) with one exception: D2 will contain only observations that have no missing data on the five achievement motivation questionnaire items (Q1, Q2, Q3, Q4, and Q5). In other words, you wish to include a subject in the new data set only if the subject answered all five of the achievement motivation questionnaire items. Once you have created the new data set, you will use PROC PRINT to print it out.
Chapter 8: Creating and Modifying Variables and Data Sets 255
The following statements accomplish this: 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
[The first part of the DATA step appears here] 7 5 5 6 1 5 F 21 8 2 3 1 5 2 F 25 9 4 5 4 2 5 M 23 ; DATA D2; SET D1; IF Q1 NE . AND Q2 NE . AND Q3 NE . AND Q4 NE . AND Q5 NE . ;
3 3 3
PROC PRINT DATA=D2; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
The last data lines from the achievement motivation study appear on lines 19–21.
•
A new DATA step begins on line 23. This was necessary, because you can perform data subsetting only within a DATA step.
•
Lines 23–24 tell SAS to create a new data set, name it “D2,” and initially create it as a duplicate of data set D1.
•
Lines 25–29 contain a single subsetting IF statement. The comparison operator “NE” that appears in the statement stands for “is not equal to.” This subsetting IF statement tells SAS to retain a given observation for data set D2 only if all of the following are true: • Q1 is not equal to “missing data” • Q2 is not equal to “missing data” • Q3 is not equal to “missing data” • Q4 is not equal to “missing data” • Q5 is not equal to “missing data.”
•
With this subsetting IF statement, a given subject is retained in data set D2 only if he or she had no missing data on any of the five variables listed.
•
Lines 31–33 contain the PROC PRINT statement that prints the new data set. Notice that the “DATA=D2” option on line 31 specifies that D2 should be printed, rather than D1.
256 Step-by-Step Basic Statistics Using SAS: Student Guide
Output 8.8 presents the results generated by the preceding program. JANE DOE
1
Obs
SUB_NUM
Q1
Q2
Q3
Q4
Q5
SEX
AGE
MAJOR
1 2 3 4 5 6 7
1 2 5 6 7 8 9
6 2 4 5 5 2 4
5 1 4 6 5 3 5
5 2 5 6 6 1 4
2 5 2 2 1 5 2
6 2 5 6 5 2 5
F M M F F F M
22 25 22 20 21 25 23
1 1 2 2 3 3 3
Output 8.8. Results of the PROC PRINT performed on data set D2, achievement motivation study.
Notice that there are only seven observations in Output 8.8. The initial data set (D1) contained nine observations, but two of these (observations for subjects Jack and Cathy) contained missing data, and were therefore not included in data set D2. If you look at the values for variables Q1, Q2, Q3, Q4, and Q5, you will not see any single periods indicating missing data.
Combining a Large Number of Data Manipulation and Data Subsetting Statements in a Single Program Overview Most of the SAS programs presented in this chapter have been fairly simple in that only a few data manipulation or data subsetting statements have been included in each program. In practice, however, it is possible to include––within a single program––a relatively large number of statements that modify variables and data sets. This section provides an example of such a program. A Longer SAS Program For example, this section presents a fairly long SAS program. This program includes data from the achievement motivation study (first presented in Table 8.1) that has been analyzed throughout this chapter. You will see that this single program includes a wide variety of statements that perform most of the tasks discussed in this chapter.
Chapter 8: Creating and Modifying Variables and Data Sets 257
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM Q1 Q2 Q3 Q4 Q5 SEX $ AGE MAJOR; DATALINES; 1 6 5 5 2 6 F 2 2 1 2 5 2 M 3 5 . 4 3 6 M 4 5 6 6 . 6 F 5 4 4 5 2 5 M 6 5 6 6 2 6 F 7 5 5 6 1 5 F 8 2 3 1 5 2 F 9 4 5 4 2 5 M ; PROC PRINT DATA=D1; TITLE1 'JANE DOE'; RUN;
22 25 30 41 22 20 21 25 23
1 1 1 2 2 2 3 3 3
DATA D2; SET D1; Q4 = 7 - Q4; ACH_MOT = (Q1 + Q2 + Q3 + Q4 + Q5) / 5; AGE2 = .; IF AGE LT 25 THEN AGE2 = 0; IF AGE GE 25 THEN AGE2 = 1; SEX2 = .; IF SEX = 'F' THEN SEX2 = 0; IF SEX = 'M' THEN SEX2 = 1; MAJOR2 = IF MAJOR IF MAJOR IF MAJOR
' .'; = 1 THEN MAJOR2 = 'A&S'; = 2 THEN MAJOR2 = 'BUS'; = 3 THEN MAJOR2 = 'EDU';
PROC PRINT DATA=D2; TITLE1 'JANE DOE'; RUN; DATA D3;
258 Step-by-Step Basic Statistics Using SAS: Student Guide
51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81
SET D2; IF MAJOR2 = 'A&S'; PROC MEANS DATA=D3; TITLE1 'JANE DOE--ARTS AND SCIENCES MAJORS'; RUN; DATA D4; SET D2; IF MAJOR2 = 'BUS'; PROC MEANS DATA=D4; TITLE1 'JANE DOE--BUSINESS MAJORS'; RUN; DATA D5; SET D2; IF MAJOR2 = 'EDU'; PROC MEANS DATA=D5; TITLE1 'JANE DOE--EDUCATION MAJORS'; RUN; DATA D6; SET D2; IF Q1 NE Q2 NE Q3 NE Q4 NE Q5 NE
. . . . .
AND AND AND AND ;
PROC PRINT DATA=D6; TITLE1 'JANE DOE'; RUN;
Some notes concerning the preceding program: •
Lines 1–22 input the achievement motivation data that were first presented in Table 8.1.
•
Lines 23–25 request that PROC PRINT be performed on the initial data set, D1.
•
Lines 27–28 begin a new DATA step. The new data set is named D2, and is initially created as a duplicate of D1.
•
With line 30, the reversed variable Q4 is recoded.
•
With line 31, the new variable ACH_MOT is created as the average of variables Q1, Q2, Q3, Q4, and Q5.
•
Lines 33–35 create a new variable named AGE2, based on the existing variable named AGE.
•
Lines 37–39 create a new variable named SEX2, based on the existing variable named SEX.
Chapter 8: Creating and Modifying Variables and Data Sets 259 •
Lines 41–44 create a new variable named MAJOR2, based on the existing variable named MAJOR.
•
Lines 46–48 request that PROC PRINT be performed on the new data set, D2.
•
Lines 50–51 begin a new DATA step. The new data set is named D3, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 52 that retains a subject only if his or her value on MAJOR2 is “A&S.” This ensures that only arts and sciences majors be retained for the new data set. The statements on lines 53–55 request that PROC MEANS be performed on the new data set.
•
Lines 57–58 begin a new DATA step. The new data set is named D4, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 59 that retains a subject only if his or her value on MAJOR2 is “BUS.” This ensures that only business majors be retained for the new data set. The statements on lines 60–62 request that PROC MEANS be performed on the new data set.
•
Lines 64–65 begin a new DATA step. The new data set is named D5, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on line 66 that retains a subject only if his or her value on MAJOR2 is “EDU.” This ensures that only education majors be retained for the new data set. The statements on lines 67–69 request that PROC MEANS be performed on the new data set.
•
Lines 71–72 begin a new DATA step. The new data set is named D6, and is initially created as a duplicate of D2. This is followed by a subsetting IF statement on lines 73–77 that retains a subject only if he or she has no missing data on Q1, Q2, Q3, Q4, or Q5. The statements on lines 79–81 request that PROC PRINT be performed on the new data set.
Some General Guidelines When writing relatively long SAS programs such as this one, it is important to keep two points in mind. First, remember that you can perform data manipulation or data subsetting only within a DATA step. This means that in most cases you should begin a new DATA step (by using the “DATA” statement) before writing the statements that create new variables, modify existing variables, or create subsets of data. Second, you must keep track of the names that you give to new data sets, and must specify the correct data set name within a given PROC statement. For example, suppose that you create a data set called D1. In the course of a lengthy SAS program, you create a number of different data sets, all based on D1. Somewhere late in the program, you create a new data set named D5, and within this data set create a new variable named ACH_MOT. You now wish to perform PROC MEANS on ACH_MOT. To do this, you must specify the data set D5 in the PROC MEANS statement, as follows: PROC MEANS RUN;
DATA=D5;
260 Step-by-Step Basic Statistics Using SAS: Student Guide
If you specify any other data set (such as D1), you will not obtain the mean for ACH_MOT, as it appears only within the data set named D5. In this case, SAS will issue an error statement in your log file.
Conclusion This chapter has shown you how to use simple formulas, IF-THEN control statements, subsetting IF statements, and other tools to modify existing data sets. You should now be prepared to perform the types of data manipulation that are most commonly required in research in the social sciences and education. For example, with these tools it should now be a simple matter for you to convert raw scores into standardized scores. When analyzing data, researchers often like to standardize variables so that they have a known mean (typically zero) and a known standard deviation (typically 1). Scores that have been standardized in this way are called z scores. The following chapter shows you how to use data manipulation statements to create z scores, and illustrates some of the ways that z scores can be used to answer research questions.
z Scores Introduction.........................................................................................262 Overview...............................................................................................................262 Raw-Score Variables versus Standardized Variables...........................................262 Types of Standardized Scores ..............................................................................262 The Advantages of Working with z Scores ...........................................................263 Example 9.1: Comparing Mid-Term Test Scores for Two Courses...266 Data Set to Be Analyzed.......................................................................................266 The DATA Step.....................................................................................................267 Converting a Single Raw-Score Variable into a z-Score Variable......268 Overview...............................................................................................................268 Step 1: Computing the Mean and Sample Standard Deviation ............................269 Step 2: Creating the z-Score Variable..................................................................270 Examples of Questions That Can Be Answered with the New z-Score Variable ..276 Converting Two Raw-Score Variables into z-Score Variables ...........278 Overview...............................................................................................................278 Review: Data Set to Be Analyzed ........................................................................278 Step 1: Computing the Means and Sample Standard Deviations ........................279 Step 2: Creating the z-Score Variables................................................................280 Examples of Questions That Can Be Answered with the New z-Score Variables.284 Standardizing Variables with PROC STANDARD ................................285 Isn’t There an Easier Way to Do This? .................................................................285 Why This Guide Used a Two-Step Approach .......................................................286 Conclusion...........................................................................................286
262 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you the advantages of working with standardized variables: variables with specified means and standard deviations. Most of the chapter focuses on z scores: scores that have been standardized to have a mean of zero and a standard deviation of 1. It shows you how to use SAS data manipulation statements to convert raw scores into z scores, and how to interpret the characteristics of a z score (its sign and size) to understand the relative standing of that score within a sample. Raw-Score Variables versus Standardized Variables All of the variables presented in this guide so far have been raw-score variables. Raw-score variables are variables that have not been transformed to have a specified mean and standard deviation. For example, if you administer an attitude scale to a group of subjects, compute their scores on the scale, and do not transform their scores in any way, then the attitude scale is a raw-score variable. Depending on the nature of your scale, the sample of scores might have almost any mean or standard deviation. In contrast, a standardized variable is a variable that has been transformed to have a specified mean and standard deviation. For example, consider the scores on the attitude scale mentioned above. If you wanted to, you could convert these raw scores into z scores. This means that you would transform the variable so that it has a mean of zero and a standard deviation of 1. In this situation the new variable that you create (the group of z scores) is a standardized variable. Types of Standardized Scores In the social sciences and education, the z score is probably the most frequently used type of standardized score. A z score is a value that indicates the distance of a raw score from the mean when the distance is measured in standard deviations. In other words, a z score indicates how many standard deviations above (or below) the mean a given raw score is located. By definition, a sample of z scores has a mean of zero and a standard deviation of 1. Another type of standardized variable that is sometimes used in the social and behavioral sciences is the T score. A sample of T scores has a mean of 50 and a standard deviation of 10. Intelligence quotient (IQ) scores are often standardized as well. A sample of IQ scores is typically standardized so that it has a mean of 100 and a standard deviation of about 15.
Chapter 9: z Scores 263
The Advantages of Working with z Scores Overview. z scores can be easier to work with than raw scores for a variety of purposes. For example, z scores enable you to immediately determine a particular score’s relative position in a sample, and can also make it easier to compare scores on variables that initially had different means and/or standard deviations. These advantages are discussed below. Immediately determining the relative position of a particular score in a sample. When you look at a z score, you can immediately determine whether the corresponding raw score is above or below the mean, and how far the raw score is from the mean. You do this by viewing the sign and the absolute magnitude of the z score (the absolute magnitude of a number is simply the size of the number, regardless of sign). Again, assume that you have administered the attitude scale discussed earlier, and have converted the raw scores of the subjects into z scores. The preceding section stated that a sample of z scores has a mean of zero and a standard deviation of 1. You can take advantage of this fact to review a subject’s z score and immediately understand that subject's position within the sample. For example, if a subject has a z score of zero, you know that she scored at the mean on the attitude scale. If her z score has a positive sign (e.g., + 1.0, +2.0), then you know that she scored above the mean. If her z score has a negative sign (e.g., – 1.0, –2.0), then you know that she scored below the mean. The absolute magnitude of the subject’s z score tells you how far away from the mean the corresponding raw score was located, when the raw score was measured in terms of standard deviations. If a z score has a positive sign, it tell you how many standard deviations the corresponding raw score was above the mean. For example, if a subject has a z score of +1.0, this tells you that his raw score was 1 standard deviation above the mean. If the subject has a z score of +2.0, his raw score was 2 standard deviations above the mean. The same holds true for z scores with a negative sign, except that these z scores tell you how many standard deviations the corresponding raw score was below the mean. If a given subject has a z score of –1.0, this tells you that his raw score was 1 standard deviation below the mean; if the z score was –2.0, the corresponding raw score was 2 standard deviations below the mean. So far the discussion has focused on z scores that are whole numbers (such as “1” or “2”), but it is important to remember that z scores are typically carried out to one or more places to the right of the decimal point. It is common, for example, to see z scores with values such as 1.4 or –2.31. Comparing scores for variables with different means and standard deviations. When you are working with a group of variables that have different means and standard deviations, it is difficult to compare scores across variables. For example, a raw score of 50 may be above the mean on Variable 1, but below the mean on Variable 2.
264 Step-by-Step Basic Statistics Using SAS: Student Guide
If you need to make comparisons across variables, it is often a good idea to first convert all raw scores on all variables into z scores. Regardless of the variable being represented, all z scores have the same interpretation (e.g., a z score of 1.0 always means that the corresponding raw score was 1 standard deviation above the mean). To illustrate this concept in a more concrete way, imagine that you are an admissions officer at a university. Each year 5,000 people apply for admission to your school. Half of them come from states in which college applicants take the Learning Aptitude Test (LAT), an aptitude test that contains three subtests (the Verbal subtest, the Math subtest, and the Analytical subtest). Suppose that the LAT Verbal subtest has a range from 200 to 800, a mean of 500, and a standard deviation of 100. The other half come from states in which college applicants take Higher Education Aptitude Test (HEAT). This test also consists of three subtests (for verbal, math, and analytical skills), but each subtest has a range, mean, and standard deviation that is different from those found with the LAT. For example, suppose that the HEAT Verbal subtest has a range from 1 to 30, a mean of 15, and a standard deviation of 5. Suppose that you are reviewing the files of two people who have applied for admission to your university. Applicant A comes from a state that uses the LAT, and her raw score on the LAT Verbal subtest is 600. Applicant B comes from a state that uses the HEAT, and his raw score on the HEAT Verbal subtest is 19. Relatively speaking, which of these two had the higher score? It is very difficult to make this comparison as long as the variables are in raw-score form. However, the comparison becomes much easier once the two variables have been converted into z scores. The formula for computing a z score is X–X z = –––––––– SX where z = the subject’s z score X = the subject’s raw score X = the sample mean SX = the sample standard deviation (remember that N is used in the denominator for this standard deviation; not N –1).
Chapter 9: z Scores 265
First, we will convert Applicant A’s raw score into a z score (remember that Applicant A had a raw score of 600 on the LAT Verbal subtest). Below, we substitute the appropriate values into the formula: X–X z = –––––––– SX
=
600 – 500 100 ––––––––––– = ––––––– = 1.0 100 100
So Applicant A had a z score of 1.0 (she stood 1 standard deviation above the mean). Next, we convert Applicant B’s raw score into a z score (remember that he had a raw score of 19 on the HEAT Verbal subtest ). Below, we substitute the appropriate values into the formula (notice that a different mean and standard deviation are used for Applicant B’s formula, compared to Applicant A’s formula): X–X z = –––––––– SX
=
19 – 15 4 ––––––––––– = ––––––– = 0.8 5 5
So Applicant B had a z score of 0.8 (he stood 8/10ths of a standard deviation above the mean). Earlier, we asked which of the two applicants had the higher score. This question was difficult to answer when the variables were in raw-score form, but is easier to answer now that the variables are in z-score form. The z score for Applicant A (1.0) was slightly higher than the z score for Applicant B (0.8). In terms of entrance exam scores, Applicant A may be a somewhat stronger candidate. This illustrates one of the reasons that z scores are so important in the social sciences and education: very often, you will work with groups of variables that have different means and standard deviations, making it difficult to compare scores from one variable to another. By converting all scores to z scores, you create a common metric that makes it easier to make these comparisons.
266 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 9.1: Comparing Mid-Term Test Scores for Two Courses Data Set to Be Analyzed Suppose that you obtain test scores for 12 college students. All of the students are enrolled in a French course (French 101) and a geology course (Geology 101). All students recently took a mid-term test in each of these two courses. With the test given in French 101, scores could range from 0 to 50. The test given in Geology 101 was longer––scores on that test could range from 0 to 200. Table 9.1 presents the scores that the 12 students obtained on these two tests. Table 9.1 Mid-Term Test Scores for Students _______________________________________ Subject French 101 Geology 101 _______________________________________ 01. Fred
50
90
02. Susan
46
165
03. Marsha
45
170
04. Charles
41
110
05. Paul
39
130
06. Cindy
38
150
07. Jack
37
140
08. Cathy
35
120
09. George
34
155
10. John
31
180
11. Marie
29
135
12. Emmett 25 200 _______________________________________
Table 9.1 use the same conventions that have been used with other tables in this guide: the horizontal rows represent individual subjects, and the vertical columns represent different variables (scores on mid-term tests, in this case). Where the row for a particular student intersects with the column for a particular course, the table provides the student’s score on the mid-term test for that course (e.g., the table shows that Fred received a score of 50 on
Chapter 9: z Scores 267
the French 101 test and a score of 90 on the Geology 101 test; Susan received a score of 46 on the French 101 test and a score of 165 on the Geology 101 test, and so on). The DATA Step As you know, the first section of most SAS programs is the DATA step: the section in which the raw data are read to create a SAS data set. The data set of the program used in this example will include all four variables represented in Table 9.1: •
subject numbers
•
subject names
•
scores on the French 101 test
•
scores on the Geology 101 test.
Technically, it is not necessary to create a SAS variable for subject numbers and subject names in order to compute z scores and perform the other tasks illustrated in this chapter. However, including the subject number and subject name variables will make the output somewhat easier to read. Below is the DATA step from the SAS program that will analyze the data from Table 9.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM NAME $ FREN GEOL; DATALINES; 01 Fred 50 90 02 Susan 46 165 03 Marsha 45 170 04 Charles 41 110 05 Paul 39 130 06 Cindy 38 150 07 Jack 37 140 08 Cathy 35 120 09 George 34 155 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ;
Some notes concerning the preceding DATA step: •
Line 2 assigns this data set the SAS data set name “D1.”
•
Line 3 assigns the SAS variable name “SUB_NUM” to represent subject numbers.
268 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 4 assigns the SAS variable name “NAME” to represent the students’ names. Notice that the variable name is followed by a “$” to tell SAS that this will be a character variable.
•
Line 5 assigns to the SAS variable name “FREN” to represent the students’ scores on the French 101 test.
•
Line 6 assigns to the SAS variable name “GEOL” to represent the students’ scores on the Geology 101 test.
•
The actual data appear on lines 8–19. You can see that these data were taken directly from Table 9.1
Converting a Single Raw-Score Variable into a z-Score Variable Overview This section shows you how to convert student scores on the French 101 mid-term test into z scores. We can convert scores on the Geology 101 test later. The approach recommended here involves two steps: •
Step 1: Computing the mean and sample standard deviation for the raw-score variable.
•
Step 2: Using data manipulation statements to create the z-score variable.
This approach requires you to submit your SAS program twice. At Step 1, you will use PROC MEANS to determine the mean and sample standard deviation for raw scores on the French 101 test variable, FREN. At Step 2, you will add a data manipulation statement to your SAS program. This data manipulation statement will create a new variable to be called FREN_Z, which will be the zscore version of student scores on the French 101 test. The data manipulation statement that creates FREN_Z is the formula for the z score, similar to the one presented earlier in this chapter. After you have created the new z score variable, you will use PROC PRINT to print out values on the variable, and will use PROC MEANS to obtain descriptive statistics.
Chapter 9: z Scores 269
Step 1: Computing the Mean and Sample Standard Deviation The syntax. You will use PROC MEANS to calculate the mean and sample standard deviation for the raw score variable, FREN. The syntax is presented below: PROC MEANS DATA=data-set-name VAR raw-score-variable ; TITLE1 ' your-name '; RUN;
VARDEF=N
N
MEAN
STD
MIN
MAX;
In the preceding syntax, one of the options specified was the VARDEF option (see the first line). This VARDEF option specifies the divisor to be used when calculating the standard deviation. If you request VARDEF=N then PROC MEANS will compute the sample standard deviation (the formula for the sample standard deviation uses N as the divisor). In contrast, if you request VARDEF=DF then PROC MEANS will compute the estimated population standard deviation (the formula for the estimated population standard deviation uses N – 1 as the divisor). This distinction is important because ultimately (at Step 2) you will want to insert the correct type of standard deviation into the formula that creates your z scores. When computing z scores, it is very important that you insert the sample standard deviation into the computational formula for z scores; you generally should not insert the estimated population standard deviation. This means that, when writing the PROC MEANS statement at Step 1, you should specify VARDEF=N. If you leave the VARDEF option out, then PROC MEANS will compute the estimated population standard deviation by default, and you do not want this. A number of other options are also included in the preceding PROC MEANS statement: •
N requests that the sample size be printed.
•
MEAN requests that the sample mean be printed.
•
STD requests that the standard deviation be printed.
•
MIN requests that the smallest observed value be printed.
•
MAX requests that the largest observed value be printed.
The remaining sections of the preceding syntax are self-explanatory.
270 Step-by-Step Basic Statistics Using SAS: Student Guide
The actual SAS statements. Below are the actual statements that request that the MEANS procedure be performed on the FREN variable from the current data set (note that “FREN” is specified in the VAR statement): PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
The SAS Output. Output 9.1 presents the results generated by the preceding program. JANE DOE The MEANS Procedure Analysis Variable : FREN N Mean Std Dev Minimum Maximum ---------------------------------------------------------12 37.5000000 7.0059499 25.0000000 50.0000000 ---------------------------------------------------------Output 9.1. Results of PROC MEANS performed on the raw-score variable FREN.
Some notes concerning this output: FREN is the name of the variable being analyzed. “N” is the number of subjects providing usable data. In this analysis, 12 students provided scores on FREN, as expected. “Mean” is the sample mean that (at Step 2) will be inserted in your formula for computing z scores. Here, you can see that the sample mean for FREN is 37.50. “Std Dev” is the sample standard deviation that (at Step 2) also will be inserted in your formula for computing z scores. Here, you can see that the sample standard deviation for FREN rounds to 7.01 It is usually a good idea to check the “Minimum” and “Maximum” values to verify that there were no obvious typographical errors in the data. Here, the minimum and maximum values were 25 and 50 respectively, which seems reasonable. Step 2: Creating the z-Score Variable Overview. The preceding MEANS procedure provided the sample mean and sample standard deviation for the raw-score variable, FREN. You will now include these values in a
Chapter 9: z Scores 271
SAS data manipulation statement that will use the raw scores included in FREN to create the z-score variable, FREN_Z. The formula for z scores. Remember that the formula for computing z scores is: X–X z = –––––––– SX where z = the subject’s z score X = the subject’s raw score X = the sample mean SX = the sample standard deviation The SAS data manipulation statement. It is now necessary to convert this generic formula into a SAS data manipulation statement that does the same thing (i.e., that creates z scores by transforming raw scores). Here is the syntax for the SAS data manipulation statement that will do this: z-variable = (raw-variable – mean) / standard-deviation; You can see that both the generic formula for creating z scores, as well as the SAS data manipulation statement presented above, do the same thing: •
They begin with a subject’s score on the raw variable.
•
They subtract the sample mean from that raw-variable score.
•
The resulting difference is then divided by the sample standard deviation.
•
The result is the subject’s z score.
Below is the SAS data manipulation statement that creates the z-score variable, FREN_Z. Notice how the variable names, as well as the mean and standard deviation from Step 1, have been inserted in the appropriate locations in this statement: FREN_Z = (FREN - 37.50) / 7.01; This statement tells SAS to create a new variable called FREN_Z by doing the following: •
Begin with a given subject’s score on FREN.
•
Subtract 37.50 from FREN (37.50 is the sample mean from Step 1).
•
The resulting difference should then be divided by 7.01 (7.01 is the sample standard deviation from Step 1).
•
The result is the subject’s score on FREN_Z (a z score).
272 Step-by-Step Basic Statistics Using SAS: Student Guide
Including the data manipulation statement as part of a new SAS DATA step. In Chapter 8, “Creating and Modifying Variables and Data Sets,” you learned that you can only create a new variable within a DATA step. In the present example, you are creating a new variable (FREN_Z) to compute these z scores, and this means that the data manipulation statement that creates FREN_Z must appear within a new DATA step. In your original SAS program from Step 1, you assigned the name “D1” to your initial SAS data set. After the DATA step, you added the PROC MEANS statements that computed the sample mean and standard deviation. In order to complete the tasks required for Step 2, you can now append new SAS statements to that existing SAS program. Here is one way that you can do this: •
Begin a new DATA step after the PROC MEANS statements that were used in Step 1.
•
Begin this new DATA step by creating a new data set named “D2.” Initially, D2 will be created as a duplicate of the existing data set, D1.
•
After creating this new data set, D2, you will append the data manipulation statement that creates FREN_Z. This will ensure that the new z-score variable, FREN_Z, will be included in D2.
Following is a section of the SAS program that accomplishes this: 1 2 3 4 5 6 7 8 9 10 11 12 13
[First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01;
Some notes concerning the preceding lines: •
Lines 1–3 present the last three lines from the data set (to conserve space, only the last few lines are presented here).
•
Lines 6–9 present the PROC MEANS statements that were used in Step 1. Obviously, it was not really necessary to include these statements in the program that you submit at Step 2; if you liked, you could have simply deleted these lines. But it is a good idea to include them so that, within a single program, you will have all of the SAS statements that are used to compute the z scores.
Chapter 9: z Scores 273
•
Lines 11–13 create the new data set (named D2), initially creating it as a duplicate of D1. Line 13 presents the data manipulation statement that creates the z scores, and includes them in a variable named FREN_Z.
Using PROC PRINT to print out the new z-score variable. After creating the new z-score variable, FREN_Z, you should next use PROC PRINT to print out each subject’s value on this variable. Among other things, this will enable you to check your work, to verify that the new variable was created correctly. Chapter 8, “Creating and Modifying Variables and Data Sets,” provided the syntax for a PROC PRINT statement. Below are the statements that use PROC PRINT to create a printout listing each subject’s value for NAME, FREN, and FREN_Z: PROC PRINT DATA=D2; VAR NAME FREN FREN_Z; TITLE 'JANE DOE'; RUN; Notice that, in the PROC PRINT statement, the DATA option specifies that the analysis should be performed using the data set D2. This is important because the variable FREN_Z appears only in D2; it does not appear in D1. Using PROC MEANS to request descriptive statistics for the new variable. Finally, you use PROC MEANS to obtain simple descriptive statistics (e.g., means, standard deviations) for any new z-score variables that you create. This is useful because you already know that a sample of z scores is supposed to have a mean of zero and a standard deviation of 1. In Step 2, you will review the results of PROC MEANS to verify that your new z-score variable FREN_Z also has a mean of zero and a standard deviation of 1. Here are the statements that request descriptive statistics for the current example: PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
Two points concerning these statements: •
The DATA option in the PROC MEANS statement specifies that the analysis should be performed on the new data set, D2.
•
The VARDEF option specifies “VARDEF=N,” which ensures that the MEANS procedure will compute the sample standard deviation rather than the estimated population standard deviation. This is appropriate when you want to verify that the standard deviation of the z-score variable is close to 1.
274 Step-by-Step Basic Statistics Using SAS: Student Guide
Putting it all together. So far, this chapter has presented the SAS statements needed for Step 2. So that you will have a better idea of how all of these statements fit together, below we present (a) the last part of the initial DATA step (from Step 1), and (b) the SAS statements needed to perform the various tasks of Step 2: [First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
STD
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; PROC PRINT DATA=D2; VAR NAME FREN FREN_Z; TITLE 'JANE DOE'; RUN; PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
SAS Output generated by PROC PRINT. Output 9.2 presents the results generated by the PRINT procedure in the preceding program: JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
FREN_Z
50 46 45 41 39 38 37 35 34 31 29 25
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
Output 9.2. Results of PROC PRINT performed on NAME, FREN, and FREN_Z.
Chapter 9: z Scores 275
Some notes concerning the preceding output: The OBS variable is generated by SAS whenever it performs PROC PRINT. It merely assigns an observation number to each subject. The NAME column provides each student’s first name. The FREN column provides each student’s raw score for the French 101 mid-term test, as it appears in Table 9.1. The FREN_Z column provides each student’s score for the new z-score variable that was created. These z scores correspond to the raw scores for the French 101 mid-term test, as they appear in Table 9.1. Were the z scores in the FREN_Z column created correctly? You can find out by computing z scores manually for a few subjects, and verifying that your results match the results generated by SAS. For example, the first subject (Fred) had a raw score on FREN of 50. You can compute his z score by inserting this raw score in the z-score formula: X–X 50 – 37.50 12.50 z = –––––––– = ––––––––––– = –––––––––––– = 1.783 7.01 7.01 SX So Fred’s z score was 1.783, which rounds to 1.78. Output 9.2 shows that this is also the value that SAS obtained. So far, these results are consistent with the conclusion that the z scores have been computed correctly. Reviewing the mean and standard deviation for the new z-score variable. Another way to verify that the z scores were created correctly is to perform PROC MEANS on the z-score variable, then verify that the mean for the new variable is approximately zero, and that the standard deviation is approximately 1. The preceding program included PROC MEANS statements, and the results are presented in Output 9.3. JANE DOE The MEANS Procedure Analysis Variable : FREN_Z N Mean Std Dev Minimum Maximum ---------------------------------------------------------12 -5.55112E-17 0.9994222 -1.7831669 1.7831669 ---------------------------------------------------------Output 9.3. Results of PROC MEANS performed on FREN_Z.
276 Step-by-Step Basic Statistics Using SAS: Student Guide
The variable name FREN_Z tells you that this analysis was performed on the new z-score variable. The “Mean” column contains the sample mean for this z-score variable, –5.55112E–17. You might be concerned that something is wrong because this number does not appear to be approximately zero, as we had expected. But the number is, in fact, very close to zero. The number is presented in scientific notation. The actual value presented in Output 9.3 is “–5.55112,” and E–17 tells you that the decimal place must be moved 17 spaces to the left. Thus, the actual mean is –0.0000000000000000555112. Obviously, this mean is very close to zero, and should reassure us that the z-score variable was probably created correctly. Why was the mean for FREN_Z not exactly zero? The answer is that we did not use a great deal of precision in creating FREN_Z. Here again is the data manipulation statement that created it: FREN_Z = (FREN - 37.50) / 7.01; Notice that we went to only two places beyond the decimal point when we typed the sample mean (37.50) and standard deviation (7.01). If we had carried these values out a greater number of decimal places, our z-score variable would have been created with greater precision, and the mean score on FREN_Z would have been even closer to zero. The standard deviation for FREN_Z appears below the heading “Std Dev” in Output 9.3. You can see that the standard deviation for this variable is 0.9994222. Again, this is very close to the value of 1 that is expected with a sample of z scores (the fact that it is not exactly 1, again, is due to the somewhat weak precision used in our data manipulation statement). The results suggest that the z-score variable was probably created in the correct manner. Examples of Questions That Can Be Answered with the New zScore Variable Reviewing the sign and absolute magnitude of a z score. The introduction section of this chapter discussed a number of advantages of working with z scores. One of these advantages involves the fact that, by simply reviewing a z score, you can immediately determine the relative position of that score within a sample. Specifically, •
The sign of a z score tells you whether the raw score appears above or below the mean (a positive sign means above, a negative sign means below).
•
The absolute magnitude of the z score tells you how far away from the mean the corresponding raw score was located, in terms of standard deviations (e.g., a z score of 1.2 tells you that the raw score was 1.2 standard deviations from the mean).
Chapter 9: z Scores 277
Output 9.4 presents the results of the PROC PRINT that were previously presented as Output 9.2. This output provides each subject’s z score for the French 101 test, as created by the preceding program. This output is reproduced again so that you can see how the results that it contains can be used to answer questions about the location of specific scores within the sample. This section provides the answers for each of the questions. Be sure that you understand the reasoning that led to these answers, as you might be asked to answer similar questions as part of an exercise when you complete this chapter. JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
FREN_Z
50 46 45 41 39 38 37 35 34 31 29 25
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
Output 9.4. Results of PROC PRINT performed on NAME, FREN, and FREN_Z (to illustrate the questions that can be answered with z scores).
Questions regarding the new z-score variable, FREN_Z, that appears in Output 9.4: 1. Question: Fred’s raw score on the French 101 test was 50 (Fred was Observation #1). What was the relative position of this score within the sample? Explain your answer. Answer: Fred’s score was 1.78 standard deviations above the mean. I know that his score was above the mean because his z score was a positive value. I know that his score was 1.78 standard deviations from the mean because the absolute value of the z score was 1.78. 2. Question: Cindy’s raw score on the French 101 test was 38 (Cindy was Observation #6). What was the relative position of this score within the sample? Explain your answer. Answer: Cindy’s score was 0.07 standard deviations above the mean. I know that her score was above the mean because her z score was a positive value. I know that her score was 0.07 standard deviations from the mean because the absolute value of the z score was 0.07. 3. Question: Marie’s raw score on the French 101 test was 29 (Marie was Observation #11). What was the relative position of this score within the sample? Explain your answer.
278 Step-by-Step Basic Statistics Using SAS: Student Guide
Answer: Marie’s score was 1.21 standard deviations below the mean. I know that her score was below the mean because her z score was a negative value. I know that her score was 1.21 standard deviations from the mean because the absolute value of the z score was 1.21.
Converting Two Raw-Score Variables into z-Score Variables Overview It many situations, it is necessary to convert one or more raw-score variables into z-score variables. In these situations, you should follow the same 2-step sequence described above: (a) compute the mean and standard deviation for each raw-score variable, and then (b) write data manipulation statements that will create the new z-score variables. You will write a separate data manipulation statement for each new z-score variable to be created. Needless to say, when you write a data manipulation statement for a given variable, it is important to insert the correct mean and standard deviation into that statement (i.e., the mean and standard deviation for the corresponding raw-score variable). This section shows you how to convert two raw-score variables into two new z-score variables. It builds on the preceding section by analyzing the same data set (the data set with test scores for French 101 and Geology 101). Because almost all of the concepts discussed here have already been discussed in earlier section of the chapter, the present material will be covered with less detail. Review: Data Set to Be Analyzed An earlier section of this chapter, “Example 9.1: Scores on Mid-Term Tests in Two Courses,” described the four variables included in your data set: •
The first variable was given the SAS variable name “SUB_NUM.” This was a numeric variable that included each student’s subject number.
•
The second variable was given the SAS variable name “NAME.” This was a character variable that included each subject’s first name.
•
The third variable was given the SAS variable name “FREN.” This was a numeric variable that included each subject’s raw score on the mid-term test given in the French 101 course.
•
The fourth variable was given the SAS variable name “GEOL.” This was a numeric variable that included each subject’s raw score on the mid-term test given in the Geology 101 course.
Chapter 9: z Scores 279
The same earlier section also provided each subject’s values on these variables, then presented the SAS DATA step to read the data into a SAS data set. Step 1: Computing the Means and Sample Standard Deviations Your first task is to compute the sample mean and sample standard deviation for the two test score variables, FREN and GEOL. This can be done by adding the following statements to a SAS program that already contains the SAS DATA step: PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
The preceding statements are identical to the PROC MEANS statements presented earlier, except that the VAR statement now lists both FREN and GEOL. This will cause SAS to compute the mean and sample standard deviation (along with some other descriptive statistics) for both of these variables. Output 9.5 presents the results that were generated by the preceding statements. JANE DOE The MEANS Procedure
Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------FREN 12 37.5000000 7.0059499 25.0000000 50.0000000 GEOL 12 145.4166667 29.7530344 90.0000000 200.0000000 -------------------------------------------------------------------Output 9.5. Results of PROC MEANS performed on the raw-score variables FREN and GEOL.
The Mean column presents the sample mean for the two test-score variables. The Std Dev column presents the sample standard deviations. You can see that, for FREN, the mean is 37.50 and the sample standard deviation is 7.01 (obviously, these figures had to be identical to the figures presented in Output 9.1 because the same variable was analyzed). For the second variable, GEOL, the mean is 145.42, and the sample standard deviation is 29.75.
280 Step-by-Step Basic Statistics Using SAS: Student Guide
With these means and standard deviations successfully computed, you can now move on to Step 2, where they will be inserted into data manipulation statements that will create the new z-score variables. Step 2: Creating the z-Score Variables The SAS data manipulation statements. Earlier in this chapter, the following syntax for the data manipulation statement created a z-score variable: z-variable = (raw-variable – mean) / standard-deviation; In this step you will create a z-score variable for scores on the French 101 test, and give it the SAS variable name FREN_Z. In doing this, the mean and sample standard deviation for FREN from Output 9.5 will be inserted in the formula (because you are working with the same mean and standard deviation, this data manipulation statement will be identical to the data manipulation statement for FREN_Z that was presented earlier). FREN_Z = (FREN - 37.50) / 7.01; Next, you will create a z-score variable for scores on the Geology 101 test, and give it the SAS variable name GEOL_Z. In doing this, the mean and sample standard deviation for GEOL from Output 9.5 will be inserted in the formula: GEOL_Z = (GEOL - 145.42) / 29.75; Including the data manipulation statements as part of a new SAS DATA step. Remember that new SAS variables can be created only within a DATA step. Therefore, within your SAS program, you will begin a new DATA step prior to writing the two preceding statements that create FREN_Z and GEOL_Z. This is done in the following excerpt from the SAS program: 1 2 3 4 5 6 7 8 9 10 11 12 13 14
[First part of the DATA step appears here] 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; GEOL_Z = (GEOL - 145.42) / 29.75;
STD
MIN
MAX;
Chapter 9: z Scores 281
Some notes about the preceding excerpt: •
Lines 1–3 present the last three data lines from the data set. To save space, only the last few lines from the data set are reproduced.
•
Lines 6–9 present the PROC MEANS statements that cause SAS to compute the mean and standard deviation for FREN and GEOL. These statements were discussed in Step 1.
•
Lines 11–12 begin a new DATA step by creating a new data set named D2. Initially, D2 is created as a duplicate of D1.
•
Lines 13–14 present the data manipulation statements that create the new z-score variables, FREN_Z and GEOL_Z.
Using PROC PRINT to print out the new z-score variables. After you create the new z-score variables, you can use PROC PRINT to print out each subject’s value on these variables. The following statements accomplish this: PROC PRINT DATA=D2; VAR NAME FREN GEOL TITLE 'JANE DOE'; RUN;
FREN_Z
GEOL_Z;
The preceding VAR statement requests that this printout include each subject’s values for the variables NAME, FREN, GEOL, FREN_Z, and GEOL_Z. Using PROC MEANS to request descriptive statistics for the new variables. Remember that it is generally a good idea to use PROC MEANS to compute simple descriptive statistics for the new z-score variables that you create. This will enable you to verify that the mean is approximately zero, and the standard deviation is approximately 1, for each new variable. This is accomplished by the following statements: PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z GEOL_Z; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
282 Step-by-Step Basic Statistics Using SAS: Student Guide
Putting it all together. The following shows the last part of the initial DATA step, along with the SAS statements needed to perform the various tasks of Step 2: (First part of the DATA step appears here) 10 John 31 180 11 Marie 29 135 12 Emmett 25 200 ; PROC MEANS DATA=D1 VARDEF=N VAR FREN GEOL; TITLE1 'JANE DOE'; RUN;
N
MEAN
STD
MIN
MAX;
MIN
MAX;
DATA D2; SET D1; FREN_Z = (FREN - 37.50) / 7.01; GEOL_Z = (GEOL - 145.42) / 29.75; PROC PRINT DATA=D2; VAR NAME FREN GEOL TITLE 'JANE DOE'; RUN;
FREN_Z
PROC MEANS DATA=D2 VARDEF=N VAR FREN_Z GEOL_Z; TITLE1 'JANE DOE'; RUN;
N
GEOL_Z;
MEAN
STD
SAS output generated by PROC MEANS. In the output that is generated by the preceding program, the results of the MEANS procedure performed on FREN_Z and GEOL_Z will be presented first. This will enable you to verify that there were no obvious errors in creating the new z-score variables. After this is done, you can view the results from the PRINT procedure. Output 9.6 presents the results of PROC MEANS performed on FREN_Z and GEOL_Z. JANE DOE The MEANS Procedure Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------FREN_Z 12 -5.55112E-17 0.9994222 -1.7831669 1.7831669 GEOL_Z 12 -0.000112045 1.0001020 -1.8628571 1.8346218 -------------------------------------------------------------------Output 9.6. Results of the PROC MEANS performed on FREN_Z and GEOL_Z.
Chapter 9: z Scores 283
Reviewing the means and standard deviations for the new z-score variables. Output 9.3 (from earlier in this chapter) has already provided the mean and standard deviation for FREN_Z. Those results are identical to the results for FREN_Z presented in Output 9.6 ( ), and so the descriptive statistics for FREN_Z will not be reviewed again at this point. The second row of results in Output 9.6 presents descriptive statistics for GEOL_Z ( ). You can see that the mean for this variable is –0.000112045, which is very close to the mean of zero that you would normally expect for a z-score variable. This provides some assurance that GEOL_Z was created correctly. You may have noticed that the mean for GEOL_Z was not presented in scientific notation, as was the case for FREN_Z. This is because the mean for GEOL_Z was not quite as close to zero, and it was therefore not necessary to use scientific notation. Where the row GEOL_Z intersects with the column headed “Std Dev,” you can see that the standard deviation for this variable is 1.0001020. This is very close to the standard deviation of 1 that you would normally expect for a z-score variable, and again provides some evidence that GEOL_Z was probably created correctly. Because the means and standard deviations for the new z-score variables seem to be appropriate, you can now review the individual z scores in the results that were generated by PROC PRINT. SAS output generated by PROC PRINT. Output 9.7 presents results generated by the PROC PRINT statements included in the preceding program. JANE DOE Obs 1 2 3 4 5 6 7 8 9 10 11 12
NAME Fred Susan Marsha Charles Paul Cindy Jack Cathy George John Marie Emmett
FREN
GEOL
FREN_Z
GEOL_Z
50 46 45 41 39 38 37 35 34 31 29 25
90 165 170 110 130 150 140 120 155 180 135 200
1.78317 1.21255 1.06990 0.49929 0.21398 0.07133 -0.07133 -0.35663 -0.49929 -0.92725 -1.21255 -1.78317
-1.86286 0.65815 0.82622 -1.19059 -0.51832 0.15395 -0.18218 -0.85445 0.32202 1.16235 -0.35025 1.83462
Output 9.7. Results of PROC PRINT performed on NAME, FREN, GEOL, FREN_Z, and GEOL_Z.
In Output 9.7, the column headed “FREN” presents subjects’ raw scores for the French 101 test. Similarly, the column headed “GEOL” presents raw scores for the Geology 101 test.
284 Step-by-Step Basic Statistics Using SAS: Student Guide
At this point, however, you are more interested in the new z-score variables that appear in the columns headed “FREN_Z” and “GEOL_Z”. These columns provide standardized versions of scores on the French 101 and Geology 101 tests, respectively. Because these test scores have been standardized, you can use them to answer a different set of questions. These questions and their answers are discussed in the following section. Examples of Questions That Can Be Answered with the New zScore Variables The introduction section of this chapter indicated that one of the advantages of working with z scores is the fact that they enable you to compare scores on variables that otherwise would have different means and standard deviations. For example, assume that you are working with the raw-score versions of scores on the French 101 and Geology 101 tests. Scores on the French 101 test could possibly range from 1 to 50, and scores on the Geology 101 test could possibly range from 1 to 200. This resulted in the two tests having very different means and standard deviations. Comparing scores on variables with different means and standard deviations. Suppose that you wanted to know: “Compared to the other students, did the student named Susan (Observation #2 in Output 9.7) score higher on the French 101 test or on the Geology 101 test?” This question is difficult to answer if you focus on the raw scores (the columns headed FREN and GEOL in Output 9.7). Although her score on the French 101 test was 46, and her score on the Geology 101 test was higher at 165, this certainly does not mean that she did better on the Geology 101 test; her score may have been higher there simply because the test was on a 200-point scale (rather than the 50-point scale used with the French 101 test). Comparing scores on z-score variables. The data becomes much more meaningful when you review the z-score versions of the variables named FREN_Z and GEOL_Z. These columns show that Susan’s z score on the French 101 test was 1.21, while her z score on the Geology 101 test was lower at 0.66. Because you are working with z scores, you know that both of these variables have the same mean (zero) and the same standard deviation (1). This means that you can directly compare the two scores. Clearly, Susan did better on the French 101 test than on the Geology 101 test (compared to other students). The following section provides some questions that could be asked about the performance of students on the French 101 and Geology 101 tests. Following each question is the correct answer based on the z scores presented in Output 9.7. Be sure that you understand the reasoning that led to these answers, as you might be asked to answer similar questions on your own as part of an exercise when you complete this chapter.
Chapter 9: z Scores 285
Questions regarding the new z-score variables, FREN_Z and GEOL_Z, that appear in Output 9.7: 1. Question: Compared to the other students, did the student named Fred (Observation #1 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Fred scored higher on the French 101 test than on the Geology 101 test. I know this because his z score on the French 101 test was a positive value (1.78), while his z score on the Geology 101 test was a negative value (–1.86). 2. Question: Compared to the other students, did the student named Cindy (Observation #6 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Cindy scored higher on the Geology 101 test than on the French 101 test. I know this because both z scores were positive, and her z score on the Geology 101 test (0.15) was higher than her score on the French 101 test (0.07). 3. Question: Compared to the other students, did the student named Cathy (Observation #8 in Output 9.7) score higher on the French 101 test or on the Geology 101 test? Explain your answer. Answer: Compared to the other students, Cathy scored higher on the French 101 test than on the Geology 101 test. I know this because both z scores were negative, and her z score on the French 101 test (–0.36) was closer to zero than her score on the Geology 101 test (–0.85).
Standardizing Variables with PROC STANDARD Isn’t There an Easier Way to Do This? This chapter has presented a two-step approach that can be used to standardize variables, converting raw-score variables into z-score variables. It is worth mentioning, however, that when you work with SAS, there is often more than one way to accomplish a data management or statistical analysis task. This applies to the creation of z scores as well. The SAS System includes a procedure named STANDARD that can be used to standardize variables in a SAS data set. This procedure enables you to begin with a raw-score variable and standardize it so that it has a specified mean and standard deviation. If you specify that the new variable should have a mean of zero and a standard deviation of 1, then you have
286 Step-by-Step Basic Statistics Using SAS: Student Guide
created a z-score variable. You can use the new standardized variable in subsequent analyses. Using PROC STANDARD has a number of advantages over the approach to standardization taught in this chapter. One of these advantages is that it enables you to complete the standardization process in one step, rather than two. Why This Guide Used a Two-Step Approach If PROC STANDARD has this advantage, then why did the current chapter teach the somewhat more laborious, two-step procedure? This was done because it was expected that this guide will normally be used in a basic statistics course, and the present approach is somewhat more educational: you begin with the generic formula for creating z scores, and translate that generic formula into a SAS data manipulation statement that actually creates z scores. This approach should reinforce your understanding of what a z score is, and exactly how a z score is obtained. For more detailed information on the use of the STANDARD procedure for computing z scores and other types of standardized variables, see the SAS Procedures Guide (1999c).
Conclusion Up until this point, this guide has largely focused on basic concepts in statistics and the SAS System. You have learned the basics of how to use SAS, and have learned about SAS procedures that perform elementary types of data analysis. The next chapter will take you to a new level, however, as it presents the first inferential statistic to be covered in this text. In Chapter 10, you will learn how to use the SAS System to compute Pearson correlation coefficients. The Pearson correlation coefficient is a measure of association that is used to investigate the relationship between two numeric variables. In Chapter 10, “Bivariate Correlation,” you will learn the assumptions that underlie this statistic, will see examples in which PROC CORR is used to compute Pearson correlations, and will learn how to prepare analysis reports that summarize the results obtained from correlational research.
Bivariate Correlation Introduction..........................................................................................290 Overview................................................................................................................ 290 Situations Appropriate for the Pearson Correlation Coefficient.........290 Overview................................................................................................................ 290 Nature of the Predictor and Criterion Variables ..................................................... 291 The Type-of-Variable Figure .................................................................................. 291 Example of a Study Providing Data Appropriate for This Procedure...................... 291 Summary of Assumptions for the Pearson Correlation Coefficient ........................ 293 Interpreting the Sign and Size of a Correlation Coefficient................293 Overview................................................................................................................ 293 Interpreting the Sign of a Correlation Coefficient ................................................... 293 Interpreting the Size of a Correlation Coefficient ................................................... 295 The Coefficient of Determination............................................................................ 296 Interpreting the Statistical Significance of a Correlation Coefficient .......................................................................................297 Overview................................................................................................................ 297 The Null Hypothesis for the Test of Significance ................................................... 297 The Alternative Hypothesis for the Test of Significance......................................... 297 The p Value ........................................................................................................... 298 Problems with Using Correlations to Investigate Causal Relationships ...................................................................................299 Overview................................................................................................................ 299 Correlations and Cause-and-Effect Relationships ................................................. 300 An Initial Explanation ............................................................................................. 300
288 Step-by-Step Basic Statistics Using SAS: Student Guide
Alternative Explanations ........................................................................................ 300 Obtaining Stronger Evidence of Cause and Effect................................................. 302 Is Correlational Research Ever Appropriate?......................................................... 302 Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables..........................................................................303 Overview................................................................................................................ 303 The Study .............................................................................................................. 303 The Criterion Variable and Predictor Variables in the Analysis.............................. 304 Data Set to Be Analyzed........................................................................................ 305 The DATA Step for the SAS Program.................................................................... 306 Using PROC PLOT to Create a Scattergram........................................307 Overview................................................................................................................ 307 Why You Should Create a Scattergram Prior to Computing a Correlation Coefficient .......................................................................................................... 307 Syntax for the SAS Program.................................................................................. 308 Results from the SAS Output................................................................................. 310 Using PROC CORR to Compute the Pearson Correlation between Two Variables...................................................................313 Overview................................................................................................................ 313 Syntax for the SAS Program.................................................................................. 313 Results from the SAS Output................................................................................. 315 Steps in Interpreting the Output ............................................................................. 315 Summarizing the Results of the Analysis............................................................... 318 Using PROC CORR to Compute All Possible Correlations for a Group of Variables...........................................................................320 Overview................................................................................................................ 320 Writing the SAS Program....................................................................................... 321 Results from the SAS Output................................................................................. 322 Summarizing Results Involving a Nonsignificant Correlation.............324 Overview................................................................................................................ 324 The Results from PROC CORR............................................................................. 324 The Results from PROC PLOT .............................................................................. 325 Summarizing the Results of the Analysis............................................................... 328 Using the VAR and WITH Statements to Suppress the Printing of Some Correlations.......................................................................329 Overview................................................................................................................ 329 Writing the SAS Program....................................................................................... 329 Results from the SAS Output................................................................................. 331 Computing the Spearman Rank-Order Correlation Coefficient for Ordinal-Level Variables..............................................................332 Overview................................................................................................................ 332 Situations Appropriate for This Statistic ................................................................. 332
Chapter 10: Bivariate Correlation 289
Example of When to Compute the Spearman Rank-Order Correlation Coefficient........................................................................................ 332 Writing the SAS Program....................................................................................... 333 Understanding the SAS Output.............................................................................. 333 Some Options Available with PROC CORR ..........................................333 Overview................................................................................................................ 333 Where in the Program to Request Options ............................................................ 334 Description of Some Options ................................................................................. 334 Where to Find More Options for PROC CORR ...................................................... 335 Problems with Seeking Significant Results ........................................335 Overview................................................................................................................ 335 Reprise: Null Hypothesis Testing with Just Two Variables ................................... 335 Null Hypothesis Testing with a Larger Number of Variables .................................. 336 How to Avoid This Problem.................................................................................... 337 Conclusion............................................................................................338
290 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows you how to use SAS to compute correlation coefficients. Most of the chapter focuses on the Pearson product-moment correlation coefficient. You use this procedure when you want to determine whether there is a significant relationship between two numeric variables that are each assessed on an interval scale or ratio scale (there are a number of additional assumptions that must also be met; these will be discussed below). The chapter also illustrates the use of the Spearman rank-order correlation coefficient, which is appropriate for variables assessed on an ordinal scale of measurement. This chapter discusses a number of issues related to the conduct of correlational research. It shows how to interpret the sign and size of correlations coefficients, and how to determine whether they are statistically significant. It cautions against “fishing” for significant findings by computing large numbers of correlation coefficients in a single study. It also cautions against using correlational findings to draw conclusions about cause-and-effect relationships. This chapter shows you how to use PROC PLOT to create a scattergram so that you can verify that the relationship between two variables is linear. It then illustrates the use of PROC CORR to compute Pearson correlation coefficients. It shows (a) how to compute the correlation between just two variables, (b) how to compute all possible correlations between a number of variables, and (c) how to use the VAR and WITH statements to selectively suppress the printing of some correlations. It shows how to prepare analysis reports for correlations that are statistically significant, as well as for correlations that are nonsignificant.
Situations Appropriate for the Pearson Correlation Coefficient Overview A correlation coefficient is a number that summarizes the nature of the relationship between two variables. Most of this chapter focuses on the Pearson product-moment correlation coefficient. The Pearson correlation coefficient is appropriate when both variables being analyzed are assessed on an interval or ratio level, and the relationship between the two variables is linear. The symbol for the Pearson product-moment correlation is r. The first part of this section describes the types of situations in which this statistic is typically computed, and discusses a few of the assumptions underlying the procedure. A more complete summary of assumptions is presented at the end of this section.
Chapter 10: Bivariate Correlation 291
Nature of the Predictor and Criterion Variables Predictor variable. In computing a Pearson correlation coefficient, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. Criterion variable. The criterion variable should also be a numeric variable that is assessed on an interval or ratio scale of measurement. The Type-of-Variable Figure When researchers compute Pearson correlation coefficients, they are typically studying the relationship between (a) a criterion variable that is a multi-value numeric variable and (b) a predictor variable that is also a multi-value numeric variable. Chapter 2, “Terms and Concepts Used in This Guide,” introduced you to the concept of the “type-of-variable” figure. A type-of-variable figure indicates the types of variables that are included in an analysis when the variables are classified according to the number of values that they assume. Using this scheme, all variables can be classified as being either dichotomous variables, limited-value variables, or multi-value variables. The following figure illustrates the types of variables that are typically being analyzed when computing a Pearson correlation coefficient. Criterion
Predictor
=
Chapter 2 indicated that the symbol that appears to the left of the equal sign in this type of figure represents the criterion variable in the analysis. The “Multi” symbol that appears to the left of the equal sign in the above figure shows that the criterion variable in the computation of a Pearson correlation is typically a multi-value variable (a variable that assumes more than six values in your sample). Chapter 2 also indicated that the symbol that appears to the right of the equal sign in this type of figure represents the predictor variable in the analysis. The “Multi” symbol that appears to the right of the equal sign in the above figure shows that the predictor variable in the computation of a Pearson correlation is also typically a multi-value variable. Example of a Study Providing Data Appropriate for This Procedure Predictor and criterion variables. Suppose that you are an industrial psychologist who is studying prosocial organizational behavior. Employees score high on prosocial organizational behavior when they do helpful things for the organization or for other employees––helpful things that are beyond their normal job responsibilities. This might include volunteering for some new assignment, helping a new employee on the job, or helping to clean up the shop.
292 Step-by-Step Basic Statistics Using SAS: Student Guide
Suppose that you want to identify variables that may be correlated with prosocial organizational behavior. Based on a review of the literature, you hypothesize that perceived organizational fairness may be related to this variable. Employees score high on perceived organizational fairness when they believe that the organization’s management has treated them equitably. Research method. You conduct a study to determine whether there is a significant correlation between prosocial organizational behavior and perceived organizational fairness in a sample of 300 employees. To assess prosocial organizational behavior, you develop a checklist of prosocial behaviors, and ask supervisors to evaluate each of their subordinates with this checklist by checking off behaviors each time they are displayed by the subordinate. To assess perceived organizational fairness, you use a questionnaire scale developed by other researchers. The questionnaire contains items such as “This organization treats me fairly.” Employees circle a number from 1 to 7 to indicate the extent to which they agree or disagree with each item. You sum responses to the individual items to create a single summed score for each employee. With this variable, higher scores indicate greater agreement that the organization treats them fairly. To analyze the data, you compute the correlation between the measure of prosocial behavior and the measure of perceived fairness. You hypothesize that there will be a positive correlation between the two variables. Why this questionnaire data would be appropriate for this procedure. Earlier sections have indicated that, to compute a Pearson product-moment correlation coefficient, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The predictor variable in this study consisted of scores on a questionnaire scale that is designed to assess perceived organizational fairness. Most researchers would agree that scores from this type of summated rating scale can be viewed as constituting an interval scale of measurement (assuming that the scale was developed properly). To compute a Pearson correlation, the criterion variable in the analysis should also be a numeric variable that is assessed on an interval or ratio scale of measurement. The criterion variable in the current study was prosocial organizational behavior. A particular employee’s score on this variable is the number of prosocial behaviors (as assessed by the employee’s supervisor) that they have displayed in a specified period of time. This “number of prosocial behaviors” variable has equal intervals and a true zero point. Therefore, this variable appears to be assessed on the ratio level. To review, when you compute a Pearson correlation, the predictor and the criterion variable are usually multi-value variables. To determine whether this is the case for the current study, you would use PROC FREQ to create simple frequency tables for the predictor and criterion variables (similar to those shown in Chapter 5, “Creating Frequency Tables”). You would know that both variables were multi-value variables if you observed more than six values for each of them in their frequency tables.
Chapter 10: Bivariate Correlation 293
Summary of Assumptions for the Pearson Correlation Coefficient •
Interval-level measurement. Both the predictor and criterion variables should be assessed on an interval or ratio level of measurement.
•
Random sampling. Each subject in the sample should contribute one score on the predictor variable, and one score on the criterion variable. These pairs of scores should represent a random sample drawn from the population of interest.
•
Linearity. The relationship between the criterion variable and the predictor variable should be linear. This means that, in the population, the mean criterion scores at each value of the predictor variable should fall on a straight line. The Pearson correlation coefficient is not appropriate for assessing the strength of the relationship between two variables involved in a curvilinear relationship.
•
Bivariate normal distribution. The pairs of scores should follow a bivariate normal distribution. This means that (a) scores on the criterion variable should form a normal distribution at each value of the predictor variable and (b) scores of the predictor variable should form a normal distribution at each value of the criterion variable. Scores that represent a bivariate normal distribution form an elliptical scattergram when plotted (i.e., their scattergram is shaped like a football: relatively fat in the middle and tapered on the ends).
Interpreting the Sign and Size of a Correlation Coefficient Overview As was stated earlier, a correlation coefficient is a number that represents the nature of the relationship between two variables. To understand the nature of this relationship, you will review the sign of the coefficient (whether it is positive or negative), as well as the size of the coefficient (whether it is relatively close to zero or close to ± 1.00). This section shows you how. Interpreting the Sign of a Correlation Coefficient Overview. A correlation coefficient may be either positive (+) or negative (–). The sign of the correlation tells you about the direction of the relationship between the two variables. Positive correlation. A positive correlation means that high values on one variable tend to be associated with high values on the other variable, and low values on one variable tend to be associated with low values on the other variable.
294 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, consider the fictitious industrial psychology study described above. In that study, you assessed two variables: •
Prosocial organizational behavior. This refers to positive, helpful things that an employee might do to help his or her organization. Assume that, according to the way that you measured this variable, higher scores represent higher levels of prosocial organizational behavior.
•
Perceived organizational fairness. This refers to the extent to which the employee believes that the organization has treated him or her in a fair and equitable way. Again, assume that, according to the way that you measured this variable, higher scores represent higher levels of perceived organizational fairness.
Suppose that you have reviewed the research literature, and it suggests that employees who score high on perceived organizational fairness are likely to feel grateful to their employing organizations, and are likely to repay them by engaging in prosocial organizational behavior. Based on this idea, you conduct a correlational study. If you measured these two variables in a sample of 300 employees, you would probably find that there is a positive correlation between perceived organizational fairness and prosocial organizational behavior. Consistent with the definition provided above, you would probably find that both of the following are true: •
employees with high scores on perceived organizational fairness would also tend to have high scores on prosocial organizational behavior
•
employees with low scores on perceived organizational fairness would also tend to have low scores on prosocial organizational behavior.
In social science research, there are countless additional examples of pairs of variables that would demonstrate positive correlations. Here are just a few examples: •
In a sample of college students, there would probably be a positive correlation between scores on the Scholastic Aptitude Test (SAT) and subsequent grade point average in college.
•
In a sample of contestants in a body-building contest, there would probably be a positive correlation between the number of hours that they spend training and their overall scores as body builders.
Negative correlation. In contrast to a positive correlation, a negative correlation means that high values on one variable tend to be associated with low values on the second variable. To illustrate this concept, consider what kind of variables would probably show a negative correlation with prosocial organizational behavior. For example, imagine that you develop a multi-item scale designed to measured burnout among employees. For our purposes, burnout refers to the extent to which employees feel exhausted, stressed, and unable to cope on the job. Assume that, according to the way that you measured this variable, higher scores represent higher levels of burnout.
Chapter 10: Bivariate Correlation 295
Suppose that you have reviewed the research literature, and it suggests that employees who score high on burnout are probably too exhausted to engage in any prosocial organizational behavior. Based on this idea, you conduct a correlational study. If you measured burnout and prosocial organizational behavior in a sample of employees, you would probably find that there is a negative correlation between the two variables. Consistent with the definition provided above, you would probably find that both of the following are true: •
employees with high scores on burnout would tend to have low scores on prosocial organizational behavior
•
employees with low scores on burnout would tend to have high scores on prosocial organizational behavior.
In social science research, there are also countless examples of pairs of variables that would demonstrate negative correlations. Here are just a few examples: •
In a sample of college students, there would probably be a negative correlation between the number of hours they spent at parties each week, and their subsequent grade point average in college.
•
In a sample of contestants in a body-building contest, there would probably be a negative correlation between the amount of junk food that they eat and their subsequent overall scores as body builders.
Interpreting the Size of a Correlation Coefficient Overview. You interpret the size of a correlation coefficient to determine the strength of the relationship between the two variables. Generally speaking, the larger the size of the coefficient (in absolute value), the stronger the relationship. Absolute value refers to how large the correlation coefficient is, regardless of its sign. When there is a strong relationship between two variables, you are able to predict values on one variable from values on the second variable with a relatively high degree of accuracy. When there is a weak relationship between two variables, you are able to predict values on variable from values on the second variable with a relatively low degree of accuracy. A guide. Below is an informal guide for interpreting the approximate strength of the relationship between two variables, based on the absolute value of the coefficient: ±1.00 ±.80 ±.50 ±.20 .00
= = = = =
Perfect correlation Strong correlation Moderate correlation Weak correlation No correlation
For example, the above guide suggests that you should view a correlation as being relatively strong if the correlation coefficient were +.80 (or –.80). Similarly, it suggests that you
296 Step-by-Step Basic Statistics Using SAS: Student Guide
should view a correlation as being relatively weak if the correlation coefficient were +.20 (or –.20). Again, remember to consider the absolute value of the coefficient when you interpret the size of the correlation. This means that a correlation of –.50 is just as strong as a correlation of +.50; a correlation of –.75 is just as strong as a correlation of +.75, and so forth. The above guide shows that the possible values of correlation coefficients range from –1.00 through zero through +1.00. This means that you will never obtain a Pearson productmoment correlation below –1.00, or above +1.00. Perfect correlation. A correlation of ±1.00 is a perfect correlation. When the correlation between two variables is ±1.00, it means that you can predict values on one variable from values on the second variable with no errors. For all practical purposes, the only time you will obtain a perfect correlation is when you correlate a variable with itself. Zero correlation. A correlation of .00 means that there is no relationship between the two variables being studied. This means that, if you know how a subject is rated on one variable, it does not allow you to predict how that subject is rated on the second variable with any accuracy. The Coefficient of Determination The coefficient of determination refers to the proportion of variance in one variable that is accounted for by variability in the second variable. This issue of “proportion of variance accounted for” is an important one in statistics; in the chapters that follow, you will learn some techniques for calculating the percentage of variance in a criterion variable that is accounted for by a predictor variable. The coefficient of determination is relatively simple to compute if you have calculated a Pearson correlation coefficient. The formula is as follows: Coefficient of determination = r
2
In other words, to compute the coefficient of determination, you simply square the correlation coefficient. For example, suppose that you find that the correlation between two variables is equal to .50. In this case: Coefficient of determination = r2 Coefficient of determination = (.50)2 Coefficient of determination = .25 So, when the Pearson correlation is equal to .50, the coefficient of determination is equal to .25. This means that 25% of the variability in the criterion variable is associated with variability in the predictor variable.
Chapter 10: Bivariate Correlation 297
Interpreting the Statistical Significance of a Correlation Coefficient Overview When researchers report the results of correlational research, they typically indicate whether the correlation coefficients that they have computed are statistically significant. When researchers report that a correlation coefficient is “statistically significant,” they typically mean that the coefficient is significantly different from zero. To understand the concept of statistical significance, it is necessary to first understand the concepts of the null hypothesis, the alternative hypothesis, and the p value, as they apply to correlational research. Each of these concepts is discussed in the following sections. The Null Hypothesis for the Test of Significance A null hypothesis is a statistical hypothesis about a population or about the relationship between two or more different populations. A null hypothesis typically states either that (a) there is no relationship between the variables being studied, or that (b) there is no difference between the populations being studied. In other words, the null hypothesis is typically a hypothesis or no relationship or no difference. When you are conducting correlational research and are investigating the correlation between two variables, your null hypothesis will typically state that, in the population, there is no correlation between these two variables. For example, again suppose that you are studying the relationship between perceived organizational fairness and prosocial organizational behavior in a sample of employees. Your null hypothesis for this analysis might be stated as follows: Statistical null hypothesis (H0): ρ = 0; In the population, the correlation between the perceived organizational fairness and prosocial organizational behavior is equal to zero. In the preceding statement, the symbol “H0” is the symbol for “null hypothesis.” The symbol “ρ” is the greek letter that represents the correlation between two variables in the population. When the above null hypothesis states “ρ = 0”, it is essentially stating that the correlation between these two variables is equal to zero in the population. The Alternative Hypothesis for the Test of Significance Like the null hypothesis, the alternative hypothesis is a statistical hypothesis about a population or about the relationship between two or more different populations. In contrast to the null hypothesis, however, the alternative hypothesis typically states either that (a) there is a relationship between the variables being studied, or that (b) there is a difference between the populations.
298 Step-by-Step Basic Statistics Using SAS: Student Guide
For example, again consider the fictitious study investigating the relationship between perceived organizational fairness and prosocial organizational behavior. The alternative hypothesis for this study would state that there is a relationship between these two variables in the population. In formal terms, it could be stated this way: Statistical alternative hypothesis (H1): ρ ≠ 0; In the population, the correlation between perceived organizational fairness and prosocial organizational behavior is not equal to zero. Notice that the above alternative hypothesis was stated as a nondirectional hypothesis. It does not predict whether the actual correlation in the population is positive or negative; it simply predicts that it will not be equal to zero. The p Value Overview. When you use SAS to compute a Pearson correlation coefficient, it automatically provides a p value for that coefficient. This p value may range in size from .00 through 1.00. You will review this p value to determine whether the coefficient is statistically significant. This section shows how to interpret these p values. What a p value represents. In general terms, a probability value (or p value) is the probability that you would obtain the present results if the null hypothesis were true. The exact meaning of a p value depends upon the type of analysis that you are performing. When you compute a correlation coefficient, the p value represents the probability that you would obtain a correlation coefficient this large or larger in absolute magnitude if the null hypothesis were true. Remember that the null hypothesis that you are testing states that, in the population, the correlation between the two variables is equal to zero. Suppose that you perform your analysis, and you find that, for your sample, the obtained correlation coefficient is .15 (symbolically, r = .15). This is a relatively weak correlation––it is fairly close to zero. Assume that the p value associated with this correlation coefficient is .89 (symbolically, p =.89). Essentially, this p value is making the following statement: •
If the null hypothesis were true, the probability that you would obtain a correlation coefficient of .15 is fairly high at 89%.
Clearly, 89% is a high probability. Under these circumstances, it seems reasonable to retain your null hypothesis. You will conclude that, in the population, the correlation between these two variables probably is equal to zero. You will conclude that your obtained correlation coefficient of r =.15 is not statistically significant. In other words, you will conclude that it is not significantly different from zero. Now consider another fictitious outcome. Suppose that you perform your analysis, and you find that, for your sample, the obtained correlation coefficient is .70 (symbolically, r = .70). This is a relatively strong correlation. Suppose that the p value associated with this
Chapter 10: Bivariate Correlation 299
correlation coefficient is .01 (symbolically, p = .01). This p value is making the following statement: •
If the null hypothesis were true, the probability that you would obtain a correlation coefficient of .70 is fairly low at only 1%.
Most researchers would agree that 1% is a fairly low probability. Given this low probability, it now seems reasonable to reject your null hypothesis. You will conclude that, in the population, the correlation between these two variables is probably not equal to zero. You will conclude that your obtained correlation coefficient of r =.70 is statistically significant. In other words, you will conclude that it is significantly different from zero. Deciding whether to reject the null hypothesis. With the above examples, you saw that, when the p value is a relative large value (such as p = .89) you should not reject the null hypothesis, and that when the p value is a relative small value (such as p = .01) you should reject the null hypothesis. This naturally leads to the question, “Just how small must the p value be to reject the null hypothesis?” The answer to this question will depend on a number of factors, such as the nature of your research and the importance of not erroneously rejecting a true null hypothesis. To keep things simple, however, this book will adopt the following guidelines: •
If the p value is less than .05, you should reject the null hypothesis.
•
If the p value is .05 or larger, you should not reject the null hypothesis.
This means that if your p value is less than .05 (such as p = .0400, p = .0121, or p < .0001), you should reject the null hypothesis, and should conclude that your obtained correlation coefficient is statistically significant. Conversely, if your p value is .05 or larger (such as p = .0510, p = .5456, or p = .9674), you should not reject the null hypothesis, and should conclude that your obtained correlation is statistically nonsignificant. This book will use this guideline throughout all of the remaining chapters. Most of the chapters following this one will report some type of significance test. In each of those chapters, you will continue to use the same rule that you will reject the null hypothesis if your p value is less than .05.
Problems with Using Correlations to Investigate Causal Relationships Overview When you compute a Pearson correlation, you can review the correlation coefficient to learn about the nature of the relationship between the two variables (e.g., whether the relationship is positive or negative; whether the relationship is relatively strong or weak). However, a single Pearson correlation coefficient by itself will not tell you anything about whether there is a causal relationship between the two variables. This section discusses the concept of
300 Step-by-Step Basic Statistics Using SAS: Student Guide
cause-and-effect relationships, and cautions against using simple correlational analyses to provide evidence for such relationships. Correlations and Cause-and-Effect Relationships Chapter 2, “Terms and Concepts Used in This Guide” discussed some of the differences between experimental research versus nonexperimental (correlational) research. It indicated that correlational research generally provides relatively weak evidence concerning causeand-effect relationships. This is especially the case when investigating the correlation between just two variables (as is the case in this chapter). This means that, if you observe a strong, significant relationship between two variables, it should not be taken as evidence that one of the variables is exerting a causal effect on the other. An Initial Explanation For example, suppose that you began your investigation with a hypothesis of cause and effect. Suppose that you hypothesized that perceived organizational fairness has a causal effect on prosocial organizational behavior: If you can increase employee perceptions of fairness, this will cause the employees to display an increase in prosocial organizational behavior. This cause-and-effect relationship is illustrated in Figure 10.1.
Figure 10.1. Hypothesized relationship between prosocial organizational behavior and perceived organizational fairness.
Suppose that you conduct your study, and find that there is a relatively large, positive, and significant correlation between the fairness variable and the prosocial behavior variable. Your first impulse might be to rejoice that you have obtained “proof” that fairness has a causal effect on prosocial behavior. Unfortunately, few researchers will find your evidence convincing. Alternative Explanations The reason has to do with the concept of “alternative explanations.” You have found a significant correlation, and have offered your own explanation for what it means: It means that fairness has a causal effect on prosocial behavior. However, it will be easy for other researchers to offer very different explanations that may be equally plausible in explaining why you obtained a significant correlation. And if others can generate alternative
Chapter 10: Bivariate Correlation 301
explanations, few researchers are going to be convinced that your explanation must be correct. For example, suppose that you have obtained a strong, positive correlation, and have presented your results at a professional conference. You tell your audience that these results prove that perceived fairness has an effect on prosocial behavior, as illustrated in Figure 10.1. At the end of your presentation, a member of the audience may rise and ask if it is not possible that there is a different explanation for your correlation. She may suggest that the two variables are correlated because of the influence of an underlying third variable. One such possible underlying variable is illustrated in Figure 10.2.
Figure 10.2. An alternative explanation for the observed correlation between prosocial organizational behavior and perceived organizational fairness.
Figure 10.2 suggests that fairness and prosocial behavior may be correlated because they are both influenced by the same underlying third variable: the personality trait of “optimism.” The researcher in your audience may argue that it is reasonable to assume that optimism has a causal effect on prosocial behavior. She could argue that, if a person is generally optimistic, they will be more likely to believe that showing prosocial behaviors will result in rewards from the organization (such as promotions). This argument supports the causal arrow that goes from optimism to prosocial behavior in Figure 10.2. The researcher in your audience could go on to argue that it is reasonable to assume that optimism also has a causal effect on perceived fairness. She could argue that optimistic people tend to look for the good in all situations, including their work situations. This could cause optimistic people to describe their organizations as treating them more fairly (compared to pessimistic people). This argument supports the causal arrow that goes from optimism to perceived fairness in Figure 10.2. In short, the researcher in your audience could argue that there is no causal relationship between prosocial behavior and fairness at all––the only reason they are correlated is because they are both influenced by the same underlying third variable: optimism.
302 Step-by-Step Basic Statistics Using SAS: Student Guide
This is why correlational research provides relatively weak evidence of cause-and-effect relationships: It is often possible to generate more than one explanation for an observed correlation between two variables. Obtaining Stronger Evidence of Cause and Effect Researchers who wish to obtain stronger evidence of cause-and-effect relationships typically rely on one (or both) of two approaches. The first approach is to conduct experimental research. Chapter 2 of this guide argued that, when you conduct a true experiment, it is sometimes possible to control all important extraneous variables. This means that, when you conduct a true experiment and obtain a significant effect for your independent variable, it provides more convincing evidence that it was truly your independent variable that had an effect on the dependent variable, and not an “underlying third variable.” In other words, with a well-designed experiment, it is more difficult for other researchers to generate plausible alternative explanations for your results. Another alternative for researchers who wish to obtain stronger evidence of cause-and-effect relationships is to use correlational data, but analyze them with statistical procedures that are much more sophisticated than the procedures discussed in this text. These sophisticated correlational procedures go by the names such as “path analysis,” “causal modeling,” and “structural equation modeling.” Hatcher (1994) provides an introduction to some of these procedures. Is Correlational Research Ever Appropriate? None of this is meant to discourage you from conducting correlational research. There are many situations in which correlational research is perfectly appropriate. These include: •
Situations in which it would be unethical or impossible to conduct an experiment. There are many situations in which it would be unethical or impossible to conduct a true experiment. For example, suppose that you want to determine whether physically abusing children will cause those individuals to become child abusers themselves when they grow up. Obviously, no one would wish to conduct a true experiment in which half of the child subjects are assigned to an “abused” condition, and the other half are assigned to a “nonabused” condition. Instead, you might conduct a correlational study in which you simply determine whether the way people were previously treated by their parents is correlated with the way that they now treat their own children.
•
As an early step in a research program that will eventually include experiments. In many situations, experiments are more expensive and time-consuming to conduct, relative to correlational research. Therefore, when researchers believe that two variables may be causally related, they sometimes begin their program of research by conducting a simple nonexperimental study to see if the two variables are, in fact, correlated. If yes, the researcher may then be sufficiently encouraged to proceed to more ambitious controlled studies such as experiments.
Chapter 10: Bivariate Correlation 303
Also, remember that in many situations researchers are not even interested in testing causeand-effect relationships. In many situations they simply wish to determine whether two variables are correlated. For example, a researcher might simply wish to know whether high scores on Test X tend to be associated with high scores on Test Y. And that is the approach that will be followed in this chapter. In general, it will discuss studies in which researchers simply wish to determine whether one variable is correlated with another. To the extent possible, it will avoid using language that implies possible cause-and-effect relationships between the variables.
Example 10.1: Correlating Weight Loss with a Variety of Predictor Variables Overview Most of this chapter focuses on analyzing data from a fictitious study that produces data that can be analyzed with the Pearson correlation coefficient. In this study, you will investigate the correlation between weight loss and a number of variables that might be correlated with weight loss. This section describes these variables, and shows you how to prepare the DATA step. The Study Hypotheses. Suppose that you are conducting a study designed to identify the variables that are predictive of weight loss in men. You want to test the following research hypotheses: •
Hypothesis 1: Weight loss will be positively correlated with motivation: Men who are highly motivated to lose weight will tend to lose more weight than those who are less motivated.
•
Hypothesis 2: Weight loss will be positively correlated with time spent exercising: Men who exercise many hours each week will tend to lose more weight than those who exercise fewer hours each week.
•
Hypothesis 3: Weight loss will be negatively correlated with calorie consumption: Men who consume many calories each day will tend to lose less weight than those who consume fewer calories each week.
•
Hypothesis 4: Weight loss will be positively correlated with intelligence: Men who are highly intelligent will tend to lose more weight than those who are less intelligent.
Research method. To test these hypotheses, you conduct a correlational study with a group of 22 men over a 10-week period. At the beginning of the study, you administer a 5-item scale that is designed to assess each subject’s motivation to lose weight. The scale consists of statements such as “It is very important to me to lose weight.” Subjects respond to each item using a 7-point response format in which 1 = “Disagree Very Strongly” and 7 =
304 Step-by-Step Basic Statistics Using SAS: Student Guide
“Agree Very Strongly.” You sum their responses to the five items to create a single motivation score for each subject. Scores on this measure may range from 5 to 35, with higher scores representing greater motivation to lose weight. You will correlate this motivation scale with subsequent weight loss to test Hypothesis 1 (from above). Throughout the 10-week study, you ask each subject to record the number of hours that he exercises each week. At the end of the study, you determine the average number of hours spent exercising for each subject, and correlate this number of hours spent exercising with subsequent weight loss to test Hypothesis 2. Throughout the study, you also ask each subject to keep a log of the number of calories that he consumes each day. At the end of the study, you compute the average number of calories consumed by each subject. You will correlate this measure of daily calorie intake with subsequent weight loss. You use this correlation to test Hypothesis 3. At the beginning of the study you also administer the Weschler Adult Intelligence Scale (WAIS) to each subject. The combined IQ score from this instrument will serve as the measure of intelligence in your study. You will correlate IQ with subsequent weight loss to test Hypothesis 4. Throughout the 10-week study, you weigh each subject and record his body weight in kilograms (1 kilogram is equal to approximately 2.2 pounds). When the study is completed, you subtract their body weight at the beginning of the study from their weight at the end of the study. You use the resulting difference as your measure of weight loss. The Criterion Variable and Predictor Variables in the Analysis This study involves one criterion variable and four predictor variables. The criterion variable is weight loss measured in kilograms (kgs). In the analysis, you will give this variable the SAS variable name KG_LOST for “kilograms lost.” The first predictor variable is motivation to lose weight, as measured by the questionnaire described earlier in this example. In the analysis, you will use the SAS variable name MOTIVAT to represent this variable. The second predictor variable is the average number of hours the subjects spent exercising each week during the study. In the analysis, you will use the SAS variable name EXERCISE to represent this variable. The third predictor variable is the average number of calories consumed during each day of the study. In the analysis, you will use the SAS variable name CALORIES to represent this variable. The final predictor variable is intelligence, as measured by the WAIS. In the analysis, you will use the SAS variable name IQ to represent this variable.
Chapter 10: Bivariate Correlation 305
Data Set to Be Analyzed Table 10.1 presents fictitious scores for each subject on each of the variables to be analyzed in this study. Table 10.1 Variables Analyzed in the Weight Loss Study ___________________________________________________________________ Kilograms Hours Calories Subject lost Motivation exercising consumed IQ ___________________________________________________________________ 01. John 2.60 5 0 2400 100 02. George 1.00 5 0 2000 120 03. Fred 1.80 10 2 1600 130 04. Charles 2.65 10 5 2400 140 05. Paul 3.70 10 4 2000 130 06. Jack 2.25 15 4 2000 110 07. Emmett 3.00 15 2 2200 110 08. Don 4.40 15 3 1400 120 09. Edward 5.35 15 2 2000 110 10. Rick 3.25 20 1 1600 90 11. Ron 4.35 20 5 1800 150 12. Dale 5.60 20 3 2200 120 13. Bernard 6.44 20 6 1200 90 14. Walter 4.80 25 1 1600 140 15. Doug 5.75 25 4 1800 130 16. Scott 6.90 25 5 1400 140 17. Sam 7.75 25 . 1400 100 18. Barry 5.90 30 4 1600 100 19. Bob 7.20 30 5 2000 150 20. Randall 8.20 30 2 1200 110 21. Ray 7.80 35 4 1600 130 22. Tom 9.00 35 6 1600 120 ___________________________________________________________________
Table 10.1 provides scores for 22 male subjects. The first subject appearing in the table is named John. Table 10.1 shows the following values for John on the study’s variables: •
He lost 2.60 kgs of weight by the end of the study.
•
His score on the motivation to lose weight scale was 5 (out of a possible 35).
•
His score on “Hours Exercising” was 0, meaning that he exercised zero hours per week on the average.
•
His score on calories was 2400, meaning that he consumed 2400 calories each day, on the average.
•
His IQ was 100 (with the WAIS, the mean IQ is 100 and the standard deviation is about 15 in the population).
Scores for the remaining subjects can be interpreted in the same way.
306 Step-by-Step Basic Statistics Using SAS: Student Guide
The DATA Step for the SAS Program Below is the DATA step for the SAS program that will read the data presented in Table 10.1: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM KG_LOST MOTIVAT EXERCISE CALORIES IQ; DATALINES; 01 2.60 5 0 2400 02 1.00 5 0 2000 03 1.80 10 2 1600 04 2.65 10 5 2400 05 3.70 10 4 2000 06 2.25 15 4 2000 07 3.00 15 2 2200 08 4.40 15 3 1400 09 5.35 15 2 2000 10 3.25 20 1 1600 11 4.35 20 5 1800 12 5.60 20 3 2200 13 6.44 20 6 1200 14 4.80 25 1 1600 15 5.75 25 4 1800 16 6.90 25 5 1400 17 7.75 25 . 1400 18 5.90 30 4 1600 19 7.20 30 5 2000 20 8.20 30 2 1200 21 7.80 35 4 1600 22 9.00 35 6 1600 ;
100 120 130 140 130 110 110 120 110 90 150 120 90 140 130 140 100 100 150 110 130 120
Some notes about the preceding program: •
Line 1 of the preceding program contains the OPTIONS statement which, in this case, specifies the size of the printed page of output. One entry in the OPTIONS statement is “PS=60”, which is an abbreviation for “PAGESIZE=60.” This key word requests that each page of output have up to 60 lines of text on it. Depending on the font that you are using (and other factors), requesting PS=60 may cause the bottom of your scatterplot to be “cut off” when it is printed. If this happens, you should change the OPTIONS statement so that it requests just 50 lines of text per page. You will do this by including PS=50 in your OPTIONS statement, rather than PS=60. Your complete OPTIONS statement should appear as follows: OPTIONS LS=80 PS=50;
Chapter 10: Bivariate Correlation 307 •
You can see that lines 3–8 of the preceding program provide the INPUT statement. There, the SAS variable name SUB_NUM is used to represent “subject number,” KG_LOST is used to represent “kilograms lost,” MOTIVAT is used to represent “motivation to lose weight,” and so forth.
•
The data themselves appear in lines 10–31. The data on these lines are identical to the data appearing in Table 10.1, except that the names of the subjects have been removed.
Using PROC PLOT to Create a Scattergram Overview A scattergram is a type of graph that is useful when you are plotting one multi-value variable against a second multi-value variable. This section explains why it is always necessary to plot your variables in a scattergram prior to computing a Pearson correlation coefficient. It then shows how to use PROC PLOT to create a scattergram, and how to interpret the output created by PROC PLOT. Why You Should Create a Scattergram Prior to Computing a Correlation Coefficient What is a scattergram? A scattergram (also called a scatterplot) is a graph that plots the individual data points from a correlational study. It is particularly useful when both of the variables being analyzed are multi-value variables. Each data point in a scattergram represents a single observation (typically a human subject). The data point indicates where the subject stands on both the predictor variable (the X variable) and the criterion variable (the Y variable). You should always create a scattergram for a given pair of variables prior to computing the correlation between those variables. This is because the Pearson correlation coefficient is appropriate only if the relationship between the variables is linear; it is not appropriate when the relationship is nonlinear. Linear versus nonlinear relationships. There is a linear relationship between two variables when their scattergram follows the form of a straight line. This means that, in the population, the mean criterion scores at each value of the predictor variable should fall on a straight line. When there is a linear relationship between X and Y, it is possible to draw a straight line through the center of the scattergram. In contrast, there is a nonlinear relationship between two variables if their scattergram does not follow the form of a straight line. For example, imagine that you have constructed a test of creativity, and have administered it to a large sample of college students. With this test, higher scores reflect higher levels of creativity. Imagine further that you obtain the Learning Aptitude Test (LAT) verbal test scores for these students, plot their LAT scores
308 Step-by-Step Basic Statistics Using SAS: Student Guide
against their creativity scores, creating a scattergram. With this scattergram, LAT scores are plotted on the horizontal axis, and creativity scores are plotted on the vertical axis. Suppose that this scattergram shows that (a) students with low LAT scores tend to have low creativity scores, (b) students with moderate LAT scores tend to have high creativity scores, and (c) students with high LAT scores tend to have low creativity scores. Such a scattergram would take the form of an upside-down “U.” It is would not be possible to draw a good-fitting straight line through the data points of this scattergram, and this is why we would say that there is a nonlinear (or perhaps a curvilinear) relationship between LAT scores and creativity scores. Problems with nonlinear relationships. When you use the Pearson correlation coefficient to assess the relationship between two variables involved in a nonlinear relationship, the resulting correlation coefficient usually underestimates the actual strength of the relationship between the two variables. For example, computing the Pearson correlation between the LAT scores and creativity scores (from the preceding example) might result in a correlation coefficient of .10, which would indicate only a very weak relationship between the two variables. And yet there might actually be a fairly strong relationship between LAT scores and creativity: It may be possible to predict someone's creativity with great accuracy if you know where they stand on the LAT. Unfortunately, you would never know this if you did not first create the scattergram. The implication of all this is that you should always create a scattergram to verify that there is a linear relationship between two variables before computing a Pearson correlation for those variables. Fortunately, this is very easy to do using the SAS PLOT procedure. Syntax for the SAS Program Here is the syntax for requesting a scattergram with the PLOT procedure: PROC PLOT PLOT TITLE1 RUN;
DATA=data-set-name; criterion-variable*predictor-variable ; ' your-name ';
The variable listed as the “criterion-variable” in the preceding program will be plotted on the vertical axis (the Y axis), and the “predictor-variable” will be plotted on the horizontal axis (the X axis).
Chapter 10: Bivariate Correlation 309
Suppose that you wish to compute the correlation between KG_LOST (kilograms lost) and MOTIVAT (the motivation to lose weight). Prior to computing the correlation, you would use PROC PLOT to create a scattergram plotting KG_LOST against MOTIVAT. The SAS statements that would create this scattergram are: 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC PLOT DATA=D1; PLOT KG_LOST*MOTIVAT; TITLE1 'JANE DOE'; RUN;
Some notes about the preceding program: •
To conserve space, the preceding shows only the last few data lines from the DATA step on lines 28–30. This DATA step was presented in full in the preceding section titled “The DATA Step for the SAS Program.”
•
The PROC PLOT statement appears on line 32. The DATA option for this statement requests that the analysis be performed on the data set named D1.
•
The PLOT statement appears on line 33. It requests that KG_LOST serve as the criterion variable (Y variable) in the plot, and that MOTIVAT serve as the predictor variable (X variable). This means that KG_LOST will appear on the vertical axis, and MOTIVAT will appear on the horizontal axis.
•
Lines 34–35 present the TITLE1 and RUN statements for the program.
310 Step-by-Step Basic Statistics Using SAS: Student Guide
Results from the SAS Output Output 10.1 presents the scattergram that was created by the preceding program. Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 10.1. Scattergram plotting kilograms lost against the motivation to lose weight.
Understanding the scattergram. Notice that, in this output, the criterion variable (KG_LOST) is plotted on the vertical axis, while the predictor variable (MOTIVAT) is plotted on the horizontal axis. Each letter in a scattergram represents one or more individual subjects. For example, consider the letter “A” that appears in the top right corner of the scattergram. This letter is located directly above a score of “35” on the “MOTIVAT” axis (the horizontal axis). It is also located directly to the right of a score of “9” on the
Chapter 10: Bivariate Correlation 311
KG_LOST axis. This means that this letter represents a person who had a score of 35 on MOTIVAT and a score of 9 on KG_LOST. In contrast, now consider the letter “A” that appears in the lower left corner of the scattergram. This letter is located directly above a score of “5” on the “MOTIVAT” axis (the horizontal axis). It is also located directly to the right of a score of “1” on the KG_LOST axis. This means that this letter represents a person who had a score of 5 on MOTIVAT and a score of 1 on KG_LOST. Each of the remaining letters in the scattergram can be interpreted in the same fashion. The legend at the top of the output says, “Legend: A = 1 obs, B = 2 obs, etc.” This means that a particular letter in the graph may represent one or more observations (human subjects, in this case). If you see the letter “A,” it means that a single person is located at that point (i.e., a single subject had that particular combination of scores on KG_LOST and MOTIVAT). If you see the letter “B,” it means that two people are located at that point, and so forth. You can see that only the letter “A” appears in this output, meaning that there was no point in the scattergram where more than one person scored. Drawing a straight line through the scattergram. The shape of the scattergram in Output 10.1 shows that there is a linear relationship between KG_LOST and MOTIVAT. This can be seen from the fact that it would be possible to draw a good-fitting straight line through the center of the scattergram. To illustrate this, Output 10.2 presents the same scattergram, this time with a straight line drawn through its center.
312 Step-by-Step Basic Statistics Using SAS: Student Guide
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 10.2. Graph plotting kilograms lost against the motivation to lose weight, with straight line drawn through the center of the scattergram.
Strength of the relationship. The general shape of the scattergram also suggests that there is a fairly strong relationship between the two variables: Knowing where a subject stands on the MOTIVAT variable enables us to predict, with some accuracy, where that subject will stand on the KG_LOST variable. Later, we will compute the correlation coefficient for these two variables to see just how strong the relationship is. Positive versus negative relationships. Output 10.2 shows that the relationship between MOTIVAT and KG_LOST is positive: Large values on MOTIVAT are associated with large values on KG_LOST, and small values on MOTIVAT are associated with small values on KG_LOST. This makes intuitive sense: You would expect that the subjects who are highly motivated to lose weight would in fact be the subjects who would lose the most
Chapter 10: Bivariate Correlation 313
weight. When there is a positive relationship between two variables, the scattergram will stretch from the lower left corner to the upper right corner of the graph (as is the case in Output 10.2). In contrast, when there is a negative relationship between two variables, the scattergram will be distributed from the upper left corner to the lower right corner of the graph. A negative relationship means that small values on the predictor variable are associated with large values of the criterion variable, and large values on the predictor variable are associated with small values on the criterion variable. Because the relationship between MOTIVAT and KG_LOST is linear, it is reasonable to proceed with the computation of a Pearson correlation for this pair of variables.
Using PROC CORR to Compute the Pearson Correlation between Two Variables Overview This chapter illustrates three different ways of using PROC CORR to compute correlation coefficients. It shows (a) how to compute the correlation between just two variables, (b) how to compute all possible correlations between a number of variables, and (c) how to use the VAR and WITH statements to selectively suppress the printing of some correlations. The present section focuses on the first of these: computing the correlation between just two variables. This section shows how to manage the PROC step for the analysis, how to interpret the output produced by PROC CORR, and how to prepare a report that summarizes the results. Syntax for the SAS Program In some instances, you may wish to compute the correlation between just two variables. Here is the syntax for the statements that will accomplish this: PROC CORR DATA=data-set-name options ; VAR variable1 variable2 ; TITLE1 ' your-name '; RUN; In the PROC CORR statement, you specify the name of the data set to be analyzed and request any options for the analysis. A section toward the end of this chapter will discuss some of the options available with PROC CORR.
314 Step-by-Step Basic Statistics Using SAS: Student Guide
You use the VAR statement to list the names of the two variables to be correlated. (The choice of which variable is “variable1” and which is “variable2” is arbitrary.) For example, suppose that you want to compute the correlation between the number of kilograms lost (KG_LOST) and the motivation to lose weight (MOTIVAT). Here are the required statements: 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC CORR DATA=D1; VAR KG_LOST MOTIVAT; TITLE1 'JANE DOE'; RUN;
Some notes concerning the preceding statements: •
To conserve space, the preceding code block shows only the last few data lines from the DATA step on lines 28–30. This DATA step was presented in full in the preceding section titled “The DATA Step for the SAS Program.”
•
The PROC CORR statement appears on line 32. The DATA option for this statement requests that the analysis be performed on the data set named D1.
•
The VAR statement appears on line 33. It requests that SAS compute the correlation between KG_LOST and MOTIVAT.
•
Lines 34–35 present the TITLE1 and RUN statements for the program.
Chapter 10: Bivariate Correlation 315
Results from the SAS Output The preceding program results in a single page of output, reproduced here as Output 10.3: JANE DOE
2
11
The CORR Procedure Variables: KG_LOST MOTIVAT Simple Statistics
Variable KG_LOST MOTIVAT
N 22 22
Mean 4.98591 20.00000
Std Dev 2.27488 8.99735
Sum 109.69000 440.00000
Minimum 1.00000 5.00000
Maximum 9.00000 35.00000
Pearson Correlation Coefficients, N = 22 Prob > |r| under H0: Rho=0 KG_LOST MOTIVAT KG_LOST 1.00000 0.88524 <.0001 MOTIVAT 0.88524 1.00000 <.0001 Output 10.3. Results of PROC CORR in which kilograms lost is correlated with the motivation to lose weight.
Steps in Interpreting the Output Make sure that everything looks right. The first part of Output 10.3 presents simple descriptive statistics for the variables being analyzed. This enables you to verify that everything “looks right,” that the correct number of cases were analyzed, that no variables were “out of bounds,” and so on. The names of the variables appear below the “Variable” heading, and the statistics for the variables appear to the right of the variable names. The column headed “N” shows that 22 subjects provided usable data for the KG_LOST variable. The column headed “Mean” shows that the mean for KG_LOST was 4.99 kilograms. The column headed “Std Dev” shows that the standard deviation was 2.27. It is always important to review the “Minimum” and “Maximum” columns to verify that no “impossible” scores appear in the data. In Output 10.3, the Minimum column shows that the lowest observed score on “kilograms lost” was 1.0 kilograms. The Maximum column shows that the highest observed score was 9.0 kilograms.
316 Step-by-Step Basic Statistics Using SAS: Student Guide
These scores seem reasonable: When converted to pounds, these figures indicate that the least successful subject lost approximately 2.2 pounds, and the most successful subject lost approximately 19.8 pounds. Because these figures seem reasonable, they provide no obvious evidence that you have entered an “impossible” score when you typed the data (again, these proofing procedures do not guarantee that no errors were made in typing data, but they are useful for identifying some types of errors). The descriptive statistics for the motivation variable (to the right of “MOTIVAT”) also seem reasonable, given the nature of the scale used. Since the descriptive statistics provide no obvious evidence of typing or programming mistakes, you can review the correlations more confidently. Interpreting a matrix of correlation coefficients. So that it will be easy to review, Output 10.3 is reproduced again here as Output 10.4. JANE DOE
2
Variable KG_LOST MOTIVAT
N 22 22
11
The CORR Procedure Variables: KG_LOST MOTIVAT
Mean 4.98591 20.00000
Simple Statistics Std Dev Sum 2.27488 109.69000 8.99735 440.00000
Minimum 1.00000 5.00000
Maximum 9.00000 35.00000
Pearson Correlation Coefficients, N = 22 Prob > |r| under H0: Rho=0 KG_LOST MOTIVAT KG_LOST 1.00000 0.88524 <.0001 MOTIVAT 0.88524 1.00000 <.0001 Output 10.4. Correlation coefficients and p values obtained when kilograms lost is correlated with the motivation to lose weight.
The bottom half of Output 10.4 provides the correlations that were requested in the VAR statement. There are actually four correlation coefficients in the output because your statement requested that the system compute every possible correlation between the variables KG_LOST and MOTIVAT. This caused the computer to compute the correlation between KG_LOST and MOTIVAT, between MOTIVAT and KG_LOST, between KG_LOST and KG_LOST, and between MOTIVAT and MOTIVAT. You can see that the correlation matrix in Output 10.4 consists of two rows, with one row headed “KG_LOST” and the other headed “MOTIVAT.” It also contains two columns, also with one headed “KG_LOST” and the other headed “MOTIVAT.” The point at which a given row and column intersect is called a cell. In Output 10.4, find the cell that appears at the intersection of the row headed “KG_LOST” and the column that is also headed “KG_LOST.” This cell appears in the upper left corner of
Chapter 10: Bivariate Correlation 317
the matrix of correlation coefficients in Output 10.4 ( ). This cell provides information about the Pearson correlation between KG_LOST and KG_LOST (i.e., the correlation obtained when KG_LOST is correlated with itself). You can see that the correlation between KG_LOST and KG_LOST is equal to 1.00000, or 1.00. A correlation of 1.00 is a perfect correlation. This result makes sense, because you will always obtain a perfect correlation when you correlate a variable with itself. Similarly, in the lower right corner of the matrix (where the row headed “MOTIVAT” intersects with the column headed “MOTIVAT”), you see that the correlation between MOTIVAT and MOTIVAT is also 1.00 ( ). When a particular cell provides information about one variable correlated with a different variable, that cell will provide at least two pieces of information: (a) the Pearson correlation between the two variables, and (b) the probability value (p value) associated with that correlation. The information is arranged according to the following format: correlation p value In other words, the top number in the cell is the correlation coefficient, and the bottom number is the p value that is associated with that correlation. For example, find the cell where the column headed KG_LOST intersects with the row headed MOTIVAT. The top number in this cell is .88524, which rounds to .89 ( ). This means that the Pearson correlation between KG_LOST and MOTIVAT is .89. The earlier section titled “Interpreting the Size of a Correlation Coefficient” showed you how to interpret the size of a correlation coefficient. There, you learned that correlations that are larger than .80 (in absolute magnitude) are generally considered to represent fairly strong relationships. This means that there is a fairly strong relationship between kilograms lost and motivation (in this fictitious example, at least). Below this correlation coefficient, you find the following entry: “<.0001” ( ). This is the probability value, or p value, associated with this correlation coefficient. A section earlier in this chapter titled “The p Value” indicated that, if a p value for a correlation coefficient is less than .05, you should reject the null hypothesis, and should conclude that the correlation is significantly different from zero. The present p value of “<.0001” is clearly less than .05. Therefore, you will reject the null hypothesis and conclude that the correlation between kilograms lost and motivation is statistically significant. Interpreting the sign of the coefficient. An earlier section indicated that a correlation coefficient can be either positive (+) or negative (–). You also learned that, with a positive relationship, large values on one variable tend to be associated with large values on the second variable, and small values on one variable tend to be associated with small values on the second variable. You further learned that, with a negative relationship, large values on one variable tend to be associated with small values on the second variable. In Output 10.4, the correlation between KG_LOST and MOTIVAT was .88524 ( ), which means that there was a positive correlation between these two variables. Although the “positive” symbol (+) does not appear in front of the number “.88524,” it is common
318 Step-by-Step Basic Statistics Using SAS: Student Guide
convention to assume that a correlation coefficient is positive unless the negative sign (–) appears in front of it. Because the correlation between KG_LOST and MOTIVAT is positive, you know that large values on MOTIVAT are associated with large values on KG_LOST, and that small values on MOTIVAT are associated with small values on KG_LOST. This is consistent with the relationship that you observed in the scattergram of Output 10.2. Determine the sample size. The size of the sample that is used in computing a correlation coefficient may appear in one of two places on the output page. First, if all correlations in the analysis are based on the same number of subjects, the sample size appears only once on the page, in the line above the matrix of correlations. This line with the sample size appears just below the descriptive statistics. In Output 10.4 the line takes the following form: Pearson Correlation Coefficients, N = 22 Prob > |r| under H0: Rho=0 The “N =” portion of this output indicates the sample size. In this case, you can see that the sample size was 22 ( ). On the other hand, if you compute every possible correlation for a group of variables, and if some correlations are based on sample sizes that are different from the sample sizes used to compute other correlations, then these sample sizes will appear in a different location. Specifically, the sample size for a given correlation will appear in the cell that presents that correlation. In this situation, the information in each cell is presented according to the following format: correlation p value N This format shows that the sample size (N) appears just below the correlation coefficient and probability value for that coefficient. These issues are discussed in greater detail in the section titled “Using PROC CORR to Compute All Possible Correlations for a Group of Variables.” Summarizing the Results of the Analysis The present chapter and the chapters that follow show you how to perform null hypothesis significance tests. These chapters will show you how to prepare brief analysis reports: reports in which you state the null hypothesis being tested, describe the results that you obtained, and summarize your conclusions regarding the null hypothesis. The exact way that you prepare an analysis report will vary depending on the statistic that you used to analyze your data.
Chapter 10: Bivariate Correlation 319
As an illustration, the following report summarizes the results of the analysis in which you computed the correlation between the number of kilograms lost and the motivation to lose weight. A) Statement of the research question: The purpose of this study was to determine whether the motivation to lose weight is correlated with the amount of weight actually lost over a 10-week period. B) Statement of the research hypothesis: There will be a positive relationship between motivation to lose weight and the amount of weight actually lost over a 10-week period. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was the motivation to lose weight. This was a multi-value variable and was assessed on an interval scale. • The criterion variable was the number of kilograms of weight lost during the 10-week study. This was a multivalue variable and was assessed on a ratio scale. Pearson product-moment correlation
D) Statistical test: coefficient.
E) Statistical null hypothesis (H0): ρ = 0; In the population, the correlation between the motivation to lose weight and the number of kilograms lost is equal to zero. F) Statistical alternative hypothesis (H1): ρ ≠ 0; In the population, the correlation between the motivation to lose weight and the number of kilograms lost is not equal to zero. G) Obtained statistic:
r = .89
H) Obtained probability (p) value: p < .0001 I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. K) Coefficient of determination:
.79.
L) Formal description of the results for a paper: Results were analyzed by computing a Pearson product-moment correlation coefficient. This analysis revealed a significant positive correlation between the motivation to lose weight and the amount of weight actually lost, r = .89, p < .0001.The nature of the correlation coefficient showed that subjects who scored higher on the motivation to lose weight tended to lose more weight than those who scored lower on the motivation to lose
320 Step-by-Step Basic Statistics Using SAS: Student Guide
weight. The coefficient of determination showed that motivation accounted for 79% of the variance in the amount of weight actually lost. Some notes about the preceding report: •
Item H in the preceding report indicated the following: H) Obtained probability (p) value:
p < .0001
This item used the “less-than” sign (<), because the less-than sign actually appeared in Output 10.4, presented previously (that is, the p value that appeared below the correlation coefficient was printed as <.0001). If the less-than sign is not actually printed in the SAS Output, you should instead use the equal sign (=) when indicating your obtained p value. For example, suppose that the p value for this correlation coefficient had actually been printed as .0143. In that situation, you would have used the equal sign, as below: H) Obtained probability (p) value: •
p = .0143
Item K reported the coefficient of determination as follows: K) Coefficient of determination: .79. An earlier section in this chapter indicated that the coefficient of determination is the proportion of variance in one variable that is accounted for by variability in the second variable. It indicated that, to compute the coefficient of determination, you square the obtained correlation coefficient. In the present case, the obtained correlation coefficient was r = .89. The coefficient of determination was therefore computed in this way: Coefficient of determination = r2 Coefficient of determination = (.89)2 Coefficient of determination = .79
Using PROC CORR to Compute All Possible Correlations for a Group of Variables Overview When you conduct a study that involves a number of variables, it is often a good idea to compute correlations for all possible pairings of the variables. This enables you to see the “big picture” regarding the nature of the relationship between the variables. In addition, if you submit your research report for publication, the reviewers for most research journals will expect you to include a table that contains these correlations.
Chapter 10: Bivariate Correlation 321
This section shows how to use PROC CORR to compute all possible correlations for a group of variables. It also shows you how to interpret results from the matrix of correlations that will be produced by PROC CORR. Writing the SAS Program The syntax. In some situations, you may have measured a number of numeric variables in a study, and want to compute every possible correlation between every possible pair of variables. The syntax for the SAS statements that will do this is: PROC CORR DATA=data-set-name VAR variable-list ; TITLE1 ' your-name '; RUN;
options ;
The actual program. In the preceding VAR statement, the “variable-list” should simply be a list of all numeric variables that you wish to analyze. For example, the current weight loss study involves one criterion variable (KG_LOST), and four predictor variables (MOTIVAT, EXERCISE, CALORIES, and IQ). Here are the SAS statements that will cause PROC CORR to compute every possible correlation between these variables: 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC CORR DATA=D1; VAR KG_LOST MOTIVAT EXERCISE TITLE1 'JANE DOE'; RUN;
CALORIES
IQ;
The only difference between these SAS statements and those presented in the preceding section is the fact that the VAR statement (on line 33) now lists all variables in the study, rather than just KG_LOST and MOTIVAT. Omitting the VAR statement. It should be noted that, if the VAR statement had been omitted from the program, PROC CORR would again have computed every possible correlation between every possible combination of numeric variables. If your data set consists of a large number of numeric variables (as is often the case with conducting research with questionnaires), this will result in a long printout of results with a very large matrix of correlation coefficients. This might be undesirable.
322 Step-by-Step Basic Statistics Using SAS: Student Guide
Results from the SAS Output Results that were generated by the statements in the preceding section are reproduced here: JANE DOE
5
Variable KG_LOST MOTIVAT EXERCISE CALORIES IQ
KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ
Variables:
N 22 22 21 22 22
12
The CORR Procedure KG_LOST MOTIVAT EXERCISE CALORIES IQ
Simple Statistics Mean Std Dev Sum 4.98591 2.27488 109.69000 20.00000 8.99735 440.00000 3.23810 1.84132 68.00000 1773 356.14579 39000 120.00000 17.99471 2640
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations KG_LOST MOTIVAT EXERCISE 1.00000 0.88524 0.53736 <.0001 0.0120 22 22 21 0.88524 1.00000 0.47845 <.0001 0.0282 22 22 21 0.53736 0.47845 1.00000 0.0120 0.0282 21 21 21 -0.55439 -0.54984 -0.22594 0.0074 0.0080 0.3247 22 22 21 0.02361 0.10294 0.31201 0.9169 0.6485 0.1685 22 22 21
Minimum 1.00000 5.00000 0 1200 90.00000
Maximum 9.00000 35.00000 6.00000 2400 150.00000
CALORIES -0.55439 0.0074 22 -0.54984 0.0080 22 -0.22594 0.3247 21 1.00000 22 0.19319 0.3890 22
IQ 0.02361 0.9169 22 0.10294 0.6485 22 0.31201 0.1685 21 0.19319 0.3890 22 1.00000 22
Output 10.5. All possible correlations between the five variables that were assessed in the weight loss study.
Understanding the output. In most respects, Output 10.5 (which presents the correlations between all five variables) is very similar to Output 10.4 (which essentially presented only the correlation between KG_LOST and MOTIVAT). For example, the top section of the output presents simple descriptive statistics. The lower section of the output presents the correlation coefficients. The difference is that Output 10.5 presents a larger matrix of correlation coefficients. This matrix consists of five rows running horizontally from left to right (each headed with the name of one of the variables being analyzed), and five columns running vertically from top to bottom (also headed with the name of one of the variables being analyzed).
Chapter 10: Bivariate Correlation 323
Interpreting the information in each cell. Where the row for one variable intersects with the column for another variable, you will find a cell that provides three pieces of information concerning the correlation between those two variables: (a) the Pearson correlation between the variables, (b) the p value associated with that correlation, and (c) the size of the sample (N) on which the correlation is based. Within the cell, this information is presented in the following format: correlation p value N For example, in Output 10.5 consider the cell appearing where the row “KG_LOST” intersects with the column “MOTIVAT.” This cell shows that the correlation between the variables is .88524. The p value for this correlation is < 0.0001. The size of the sample on which the correlation is based is 22. These figures are identical to those shown in Output 10.4. Now consider the cell that appears at the intersection of the row “KG_LOST” and the column headed “IQ.” The information in this cell shows that the correlation between KG_LOST and IQ is .02361, which rounds to .02. This is a very small correlation coefficient, indicating a very weak relationship between the two variables. Predictably, the p value for this correlation is quite large at .9169. As with the earlier correlation coefficient, the size of the sample on which the correlation is based is equal to 22. Given that the p value for this correlation is quite large at .9169 (which is higher than the standard criterion of .05), it is clear that this correlation is not significantly different from zero. As a final example, consider the cell where the row “KG_LOST” intersects with the column “EXERCISE.” Here, you can see that the correlation in this cell is based on just 21 subjects, which is one less than the sample size for most of the other correlations in this matrix. Why is this sample size smaller? It is because there was one instance of missing data for the EXERCISE variable. If you review Table 10.1 (presented at the beginning of this chapter), you will see that Subject #17 (Sam) had missing data on EXERCISE. This means that the correlation between EXERCISE and any other variable will be based on 21 observations, not 22.
324 Step-by-Step Basic Statistics Using SAS: Student Guide
Summarizing Results Involving a Nonsignificant Correlation Overview This section discusses the case of a correlation coefficient that is not significantly different from zero. This is done so that you will be able to prepare analysis reports summarizing nonsignificant correlations. The Results from PROC CORR Output 10.5 presented a matrix of correlations containing a few nonsignificant correlations. For convenience, part of that output is reproduced here as Output 10.6.
KG_LOST
MOTIVAT
EXERCISE
CALORIES
IQ
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations KG_LOST MOTIVAT EXERCISE 1.00000 0.88524 0.53736 <.0001 0.0120 22 22 21 0.88524 1.00000 0.47845 <.0001 0.0282 22 22 21 0.53736 0.47845 1.00000 0.0120 0.0282 21 21 21 -0.55439 -0.54984 -0.22594 0.0074 0.0080 0.3247 22 22 21 0.02361 0.10294 0.31201 0.9169 0.6485 0.1685 22 22 21
CALORIES -0.55439 0.0074 22 -0.54984 0.0080 22 -0.22594 0.3247 21 1.00000 22 0.19319 0.3890 22
IQ 0.02361 0.9169 22 0.10294 0.6485 22 0.31201 0.1685 21 0.19319 0.3890 22 1.00000 22
Output 10.6. Correlation coefficient and p value obtained for the correlation between kilograms lost and IQ.
The correlation coefficient. As mentioned in the preceding section, the results of this analysis revealed that the relationship between KG_LOST and IQ was very weak. Information about this correlation appears in the cell where the row “KG_LOST” intersects with the column “IQ.” This cell shows that the Pearson correlation between the number of kilograms lost and subject IQ was only .02361, which rounds to .02. This correlation is quite close to zero, indicating that there is essentially no relationship between KG_LOST and IQ: Knowing where a subject stands on IQ does not predict where he stands on KG_LOST.
Chapter 10: Bivariate Correlation 325
The probability value. The second entry in this cell provides the p value associated with this correlation. You will remember that this p value indicates the probability that you would obtain a correlation coefficient this large or larger if the actual correlation between these variables in the population was equal to zero. Your obtained p value in this case is .9169. This means that, if the actual correlation between KG_LOST and IQ in the population was equal to zero, the probability that you would obtain a correlation of .02 in a sample with N = 22 is equal to .9169 (a very high probability). As stated earlier, this guide recommends that a statistic be considered statistically significant only when its p values is less than .05. Because the p value for this correlation (.9169) is much larger than this criterion of .05, you would conclude that the correlation between KG_LOST and IQ is not statistically significant. In other words, you would conclude that the correlation between KG_LOST and IQ is not significantly different from zero. This only makes sense, because your obtained correlation coefficient was only .02, a value quite close to zero. The Results from PROC PLOT Why create a scattergram? An earlier section of this chapter indicated that you should always use PROC PLOT to create a scattergram for two variables before computing the correlation between them. Among other things, this scattergram will help you verify that the relationship between the two variables is linear (a linear relationship is an assumption for the Pearson correlation coefficient). At this point, it will be instructive to review the scattergram plotting KG_LOST against IQ. This illustrates what a scattergram might look like when there is virtually no relationship between the two variables. Output 10.7 presents this scattergram.
326 Step-by-Step Basic Statistics Using SAS: Student Guide
Plot of KG_LOST*IQ.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-90 100 110 120 130 140 150 IQ
Output 10.7. Scattergram plotting kilograms lost against subject IQ.
Interpreting the shape of the scattergram. An earlier section said that, when there is a relationship between two variables, their scattergram will usually be elliptical in shape (shaped like a football), and will display some degree of slope. If the correlation is positive, the ellipse will slant from the lower left corner to the upper right corner; if the correlation is negative, the ellipse will slant from the upper left corner to the lower right corner. However, you can see that the scattergram in Output 10.7 does not form an ellipse; instead, it is fairly round in shape. If you were to draw a best-fitting straight line through the scattergram, the line would not have any slope to speak of––it would be horizontal. To illustrate this, Output 10.8 reproduces the same scattergram as appears in Output 10.7, this time with a good-fitting straight line drawn through it.
Chapter 10: Bivariate Correlation 327
Plot of KG_LOST*IQ.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
| 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | KG_LOST | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-90 100 110 120 130 140 150 IQ
Output 10.8. Scattergram plotting kilograms lost against subject IQ, with straight line drawn through the center of the scattergram.
These characteristics––a round scattergram with a near-horizontal slope––are the characteristics of a scattergram for two variables that are not correlated with one another.
328 Step-by-Step Basic Statistics Using SAS: Student Guide
Summarizing the Results of the Analysis Suppose that, in this study, you had initially hypothesized that IQ would be correlated with the number of kilograms of weight lost. Here is an example of how you might prepare a report summarizing your analysis and results: A) Statement of the research question: The purpose of this study was to determine whether subject IQ is correlated with the amount of weight lost over a 10-week period. B) Statement of the research hypothesis: There will be a positive relationship between subject IQ and the amount of weight lost over a 10-week period. C) Nature of the variables: This analysis involved two variables. The predictor variable was subject IQ: scores from a standard intelligence test. This was an interval-level variable. The criterion variable was the number of kilograms of weight lost during the 10-week study. The criterion variable was assessed on a ratio scale. Pearson correlation coefficient.
D) Statistical test:
E) Statistical null hypothesis (H0): ρ=0; In the population, the correlation between IQ and the number of kilograms lost is equal to zero. F) Statistical alternative hypothesis (H1): ρ ≠ 0; In the population, the correlation between IQ and the number of kilograms lost is not equal to zero. G) Obtained statistic:
r=.02
H) Obtained probability (p) value:
p=.9169
I) Conclusion regarding the statistical null hypothesis: Fail to reject the null hypothesis. J) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. K) Coefficient of determination:
.00.
L) Formal description of the results for a paper: Results were analyzed by computing a Pearson correlation coefficient. This analysis revealed a nonsignificant correlation between IQ and the amount of weight lost, r=.02, p=.9169. The coefficient of determination showed that IQ accounted for approximately 0% of the variance in the amount of weight actually lost.
Chapter 10: Bivariate Correlation 329
Some notes regarding the preceding report: •
Item K of the preceding report indicated that the coefficient of determination for this analysis was equal to .00. This is because the correlation coefficient was r = .02. This value squared was equal to .0004, which rounds to .00.
•
Item L of the preceding report provides a formal description of the results for a paper. Notice that this paragraph does not provide a description of the nature of the relationship between the two variables (i.e., it does not report that high scores on IQ tended to be associated with high scores on the amount of weight lost). This is because the correlation coefficient was found to be nonsignificant. When the results are nonsignificant, researchers typically do not describe the nature of the relationship between the variables in their papers.
•
Item L also reports that the “...coefficient of determination showed that IQ accounted for approximately 0% of the variance in the amount of weight actually lost” (italics added). The word “approximately” was used here because, technically, the coefficient of determination was not exactly zero––it was .0004, which became zero only because it was rounded to two decimal places.
Using the VAR and WITH Statements to Suppress the Printing of Some Correlations Overview An earlier section showed you how to compute all possible correlations for a group of variables. When you are working with a moderate-to-large number of variables, however, this approach has some disadvantages. Among these are the fact that it can result in a very large number of correlations and many pages of output. In those situations, it is sometime preferable to print a limited number of correlations. You can include the VAR and WITH statements within the PROC step to achieve this. Using this approach, SAS will again print a matrix of correlation coefficients. The variables that you list in the VAR statement will appear as the columns in this matrix, and the variables that you list in the WITH statement will appear as the rows. This gives you greater control over your output, and enables you to avoid printing “all possible correlations.” Writing the SAS Program The syntax. Here is the syntax for using the VAR and WITH statements with PROC CORR: PROC CORR DATA=data-set-name VAR column-variables ; WITH row-variables ; RUN;
options ;
330 Step-by-Step Basic Statistics Using SAS: Student Guide
The second line in the preceding syntax is the VAR statement. This statement includes the entry “column-variables” to indicate that any variable that you list there will appear as a column (running up and down) in the resulting matrix of correlation coefficients. The third line of the general form is the WITH statement. This statement includes the entry “rowvariables” to indicate that any variable that you list there will appear as a row (running from left to right) in the resulting matrix of correlation coefficients. You can list the same variable in both the VAR and WITH statements. Suppose that you want to compute correlations between two sets of variables. One set of variables will be KG_LOST and MOTIVAT, and the second set of variables will be EXERCISE, CALORIES, and IQ. As output, you want to create a matrix of correlation coefficients in which the column variables are KG_LOST and MOTIVAT, and the row variables are EXERCISE, CALORIES, and IQ. Here are the SAS statements that will cause PROC CORR to create this matrix of correlations. Notice that the VAR statement includes KG_LOST and MOTIVAT, and the WITH statement includes EXERCISE, CALORIES, and IQ. 28 29 30 31 32 33 34 35 36
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC CORR DATA=D1; VAR KG_LOST MOTIVAT; WITH EXERCISE CALORIES TITLE1 ' JANE DOE '; RUN;
IQ;
Chapter 10: Bivariate Correlation 331
Results from the SAS Output Results generated by the statements in the preceding section are reproduced in Output 10.9: JANE DOE
13
The CORR Procedure 3 With Variables: EXERCISE CALORIES IQ 2 Variables: KG_LOST MOTIVAT
Variable EXERCISE CALORIES IQ KG_LOST MOTIVAT
N 21 22 22 22 22
Mean 3.23810 1773 120.00000 4.98591 20.00000
Simple Statistics Std Dev Sum 1.84132 68.00000 356.14579 39000 17.99471 2640 2.27488 109.69000 8.99735 440.00000
Minimum 0 1200 90.00000 1.00000 5.00000
Maximum 6.00000 2400 150.00000 9.00000 35.00000
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations KG_LOST MOTIVAT EXERCISE 0.53736 0.47845 0.0120 0.0282 21 21 CALORIES -0.55439 -0.54984 0.0074 0.0080 22 22 IQ 0.02361 0.10294 0.9169 0.6485 22 22 Output 10.9. Output produced by using the VAR and WITH statements in the SAS program.
In Output 10.9, notice that the variables KG_LOST and MOTIVAT appear as columns. This is because they were listed in the VAR statement. Notice that the variables EXERCISE, CALORIES, and IQ appear as rows. This is because they were listed in the WITH statement.
332 Step-by-Step Basic Statistics Using SAS: Student Guide
Computing the Spearman Rank-Order Correlation Coefficient for Ordinal-Level Variables Overview So far, this chapter has focused on computing the Pearson product-moment correlation coefficient, which is appropriate when both of your variables are assessed on an interval scale or ratio scale. In some situations, however, one or both of your variables may be assessed on an ordinal scale. In those situations, it may be more appropriate to compute the Spearman rank-order correlation coefficient. This is easy to do with SAS. Situations Appropriate for This Statistic You can use the Spearman rank-order correlation coefficient when both of the following are true: •
both of your variables are assessed on an ordinal scale
•
one of your variables is assessed on an ordinal scale, and the other variable is assessed on an interval or ratio scale.
In Chapter 2 of this book, you learned that values on an ordinal scale represent the rank order of the subjects with respect to the variable that was being assessed. When a variable is on an ordinal scale, equal differences in scale values do not necessarily have equal quantitative meaning. The best example of an ordinal scale is a variable that is created by rank ordering the subjects according to a particular construct. Example of When to Compute the Spearman Rank-Order Correlation Coefficient For example, suppose that you measured two of the variables in your weight-loss study in a particular way. Suppose that, to create the variable KG_LOST, you simply rank-ordered your subjects with respect to how much weight they lost. The subject who lost the most weight was assigned a value (score) of 1, the subject who lost the next most weight was given a value of 2, and so on. You used the same procedure to create a second variable in your study, MOTIVAT: You rank-ordered your subjects with respect to their level of motivation to lose weight. The subject who was most motivated to lose weight was given a value of 1, the subject who was next most motivated was given a value of 2, and so on. The variables KG_LOST and MOTIVAT are now rank-order variables. When you compute the correlation between them, it is no longer appropriate to compute the Pearson correlation coefficient. It is now appropriate to compute the Spearman rank-order correlation coefficient.
Chapter 10: Bivariate Correlation 333
Writing the SAS Program The syntax for computing a Spearman correlation is identical to the syntax for computing a Pearson correlation, except that you must include the key word SPEARMAN in the options field of the PROC CORR statement. The syntax is as follows: PROC CORR DATA=data-set-name VAR variable-list ; TITLE1 ' your-name '; RUN;
SPEARMAN ;
For example, to compute the Spearman correlation between KG_LOST and MOTIVAT, the SAS statements would appear as follows: PROC CORR DATA=D1 SPEARMAN; VAR KG_LOST MOTIVAT ; TITLE1 'JANE DOE'; RUN;
Understanding the SAS Output When you compute Spearman correlations, the output is almost identical to the output that is produced for Pearson correlations. This means that the output page will contain a matrix of cells. Within a cell you will find the Spearman correlation at the top with the p value, and the sample size below. You will interpret these correlation coefficients and p values in the same manner as with the Pearson correlation coefficient. You will know that it is a matrix of Spearman correlations appearing on a page output because a heading similar to the following heading will appear above the matrix: Spearman Correlation Coefficients, N = 22 Prob > |r| under H0: Rho=0
Some Options Available with PROC CORR Overview Several options for PROC CORR enable you to control the way that data are analyzed, the types of statistics that are computed, and the way that output is presented. This section discusses a few of the more popular options, and lists the keywords that can be used to request these options.
334 Step-by-Step Basic Statistics Using SAS: Student Guide
Where in the Program to Request Options An earlier section showed the following syntax for the CORR procedure: PROC CORR DATA=data-set-name VAR variable-list ; TITLE1 ' your-name '; RUN;
options ;
The word “options” appears as part of the PROC CORR statement, meaning that this is the location where keywords for options should appear. For example, the previous section showed that the keyword SPEARMAN requests that PROC CORR compute Spearman correlations rather than Pearson correlations. If you had computed the Spearman correlation between KG_LOST and MOTIVAT, the last part of your SAS program would have looked like the following (notice where the keyword SPEARMAN appears on line 32): 28 29 30 31 32 33 34 35
[First part of the DATA step appears here] 20 8.20 30 2 1200 110 21 7.80 35 4 1600 130 22 9.00 35 6 1600 120 ; PROC CORR DATA=D1 SPEARMAN; VAR KG_LOST MOTIVAT; TITLE1 'JANE DOE'; RUN;
Description of Some Options Below are the keywords for some of the options that tend to be used most frequently when conducting research in the social and behavioral sciences. Remember that these keywords would appear in the same location that the keyword SPEARMAN appeared on line 32 in the program presented in the preceding section. ALPHA Prints coefficient alpha (a measure of reliability that is often used for summated rating scales). Coefficient alpha will be computed for the variables listed in the VAR statement. You must use the NOMISS option in conjunction with the ALPHA option. The NOMISS option is discussed later. COV Prints covariances between the variables. This is useful when you need to create a variance-covariance table, rather than a table of correlations. KENDALL Prints Kendall’s tau-b coefficient, a measure of bivariate association for variables assessed at the ordinal level.
Chapter 10: Bivariate Correlation 335
NOMISS Excludes from the analysis any observation (subject) with missing data on any of the variables listed in the VAR statement. Using this option ensures that all correlations will be based on exactly the same observations (and, therefore, on the same number of observations). NOPROB Suppresses the printing of p values associated with the correlations. RANK For each variable, reorders the correlations from highest to lowest (in absolute value) and prints them in this order. SPEARMAN Prints Spearman rank-order correlation coefficients, which are appropriate for variables that are measured on an ordinal level. Where to Find More Options for PROC CORR PROC CORR is a part of base SAS software, and this means that you will find a complete listing of all of the options that are available with PROC CORR in the SAS Procedures Guide (SAS Institute Inc., 1999c). Please note that you will not find PROC CORR covered in the SAS/STAT User’s Guide (SAS Institute Inc., 1999d).
Problems with Seeking Significant Results Overview One of the positive aspects of using SAS is that it makes it very easy to compute a large number of correlation coefficients for a large number of variables. But this also creates a trap into which many researchers fall. This trap involves computing a large number of coefficients, searching through the results to identify the significant coefficients, and then preparing a research report in which you create the impression that the significant relationships that you obtained were the ones that you predicted from the beginning. This section explains why this approach is not good science. Reprise: Null Hypothesis Testing with Just Two Variables The study. Suppose that you are investigating the relationship between just two variables in a sample of 200 adults: You are investigating the correlation between height (in inches) and IQ. Suppose that you don’t know the correlation between these two variables, but it is in fact equal to zero in the population. That is, if it were possible to study the entire population of possible subjects, you would find that there is absolutely no relationship between height and IQ.
336 Step-by-Step Basic Statistics Using SAS: Student Guide
You state your null hypothesis as follows: Statistical null hypothesis (H0): ρ = 0; In the population, the correlation between height and IQ is equal to zero. You draw a random sample of 200 adults, measure their height in inches, assess their IQ, and compute the correlation between the two variables. Making a Type I error. Before the study began, you made the decision that you would reject the null hypothesis of no correlation only if the p value that you obtain is less than .05. This decision gives you some protection against making a Type I error. When you make a Type I error, you reject a null hypothesis that was true––you reject a null hypothesis that should not have been rejected. In other words, you conclude that there is a correlation between the two variables in the population when, in fact, there is none. When you conduct a study such as the one described above, what is the probability that you will commit a Type I error? It is equal to the criterion that you set for your p value. If you decide that you will reject the null hypothesis only if your p value is less than .05, then you have only a 5% chance of making a Type I error. If you decide that you will reject the null hypothesis only if your p value is less than .01, then you have only a 1% chance of making a Type I error. This criterion is typically referred to as the alpha level that you set for your test (symbolized as α = .05 or α = .01, for example). The alpha level can be defined as the significance level that a researcher sets for a null hypothesis test; it is the probability that the analysis will result in a Type I error. If you are computing the correlation between just two variables and you set alpha at .05, then you know that you have only a 5% chance of making a Type I error. This enables you to have some confidence in your findings. Null Hypothesis Testing with a Larger Number of Variables Overview. The situation is different when you compute a large number of correlations between a large number of variables. This is where researchers often commit Type I errors without knowing it. The study. Consider the following situation: Suppose that you have obtained data on a relatively large number of variables from a sample of 200 subjects. Imagine that, in the population, there is absolutely no correlation between any of these variables. That is, if you were able to gather data from the entire population, you would find that every possible correlation between these variables is equal to zero. But you don’t know this, because you cannot study the entire population. So instead you gather data from your 200 subjects, and compute every possible correlation between the variables. Assume that this results in exactly 100 correlation coefficients. You decide that you will consider a correlation coefficient to be significant if its p value is less than .05.
Chapter 10: Bivariate Correlation 337
The results. You review the results produced by SAS, and find that five of your correlation coefficients are significant at the .05 level. Some of these correlations look interesting. For example, you find that there is a significant correlation between height and IQ. You prepare an article that summarizes your findings, and it is eventually published by a research journal. All over the world, researchers are now reading about the relationship between height and IQ. Making Type I errors. The problem, of course, is that you have made a Type I error. In the population, the correlation between height and IQ is actually equal to zero. Yet you have rejected this null hypothesis, and have told the research community that there is a correlation between these two variables. How could this have happened? It is because you computed such a large number of correlations. Whenever you (a) set your alpha level at .05 and (b) compute 100 correlations, you should expect about five of the correlations to be statistically significant, even if there is actually no correlation between any of the variables in the population. When you set alpha at .05, it means that you expect a 5% chance of making a Type I error. If you compute just one correlation coefficient, you are fairly safe. But if you compute 100 correlation coefficients, about five of them (that is, about 5% of the 100 coefficients) are going to be significant simply due to sampling error. Sampling error occurs whenever a sample is not perfectly representative of the population from which it was drawn. Some degree of sampling error occurs just about any time that you work with a sample. Because of sampling error, five pairs of variables in your data set demonstrated correlations that were fairly large––large enough to be statistically significant. But these correlations were not significant because the variables were actually correlated in the population. They were significant because of sampling error. Because you published an article in which you discussed your “significant” results, you have now misled other researchers into believing that there is a correlation between height and IQ when, in fact, there is not. You have led them down a blind alley that will ultimately prove to be a waste of their time. How to Avoid This Problem Unfortunately, a large number of researchers conduct research in the manner described above. But there are several things that you can do to avoid this path: •
In any study, you should generally compute as few correlations as possible. Any analysis should be driven by theory or by previous research. You should never blindly compute a large number of correlations, and then review the results to see what “turns out” significant.
•
The research reports that you prepare should specify the number of correlations that you compute––including correlations that proved to be nonsignificant. In this way your reviewers can assess the probability that your results are due to sampling error.
•
In some cases in which you compute a relatively large number of correlations, you may want to consider using a more conservative alpha level for each correlation. For example,
338 Step-by-Step Basic Statistics Using SAS: Student Guide
you may decide that you will consider a correlation to be significant only if it is significant at the .001 level (rather than the .05) level. The logic behind this approach is similar to the logic behind using the Bonferroni t test to control the familywise error rate when performing analysis of variance (ANOVA). See Howell (1997, pp 362–365) for a discussion of the Bonferroni t test.
Conclusion This chapter has shown you how to use PROC PLOT and PROC CORR to investigate the relationship between two numeric variables. It has shown you how to compute the Pearson correlation coefficient, and how to determine whether that correlation coefficient is significantly different from zero. A statistical procedure that is similar to bivariate correlation is bivariate linear regression. Linear regression is a procedure for identifying the best-fitting straight line that summarizes the relationship between two numeric variables. Bivariate linear regression enables you to assess where an individual stands on one variable, and then use that score to predict where they probably stand on a second variable. The following chapter introduces you to bivariate linear regression. It shows how to use PROC REG to perform regression, how to determine whether the resulting regression coefficient is significantly different from zero, how to draw a regression line through a scattergram, and how to use this regression line to predict where specific subjects will probably stand on the criterion variable.
Bivariate Regression Introduction............................................................................................ 341 Overview................................................................................................................... 341 Choosing between the Terms Predictor Variable, Criterion Variable, Independent Variable, and Dependent Variable ............................... 341 Overview................................................................................................................... 341 Nonexperimental Research ...................................................................................... 341 Experimental Research............................................................................................. 342 Choosing the Terms to Use ...................................................................................... 343 Situations Appropriate for Bivariate Linear Regression ....................... 344 Overview................................................................................................................... 344 Scale of Measurement Used with the Predictor and Criterion Variables................... 344 The Type-of-Variable Figure ..................................................................................... 344 Example of a Study Providing Data Appropriate for This Procedure......................... 345 Summary of Assumptions Underlying Bivariate Linear Regression .......................... 346 Example 11.1: Predicting Weight Loss from a Variety of Predictor Variables............................................................................ 346 Overview................................................................................................................... 346 The Study ................................................................................................................. 347 The Predictor Variables and Criterion Variable in the Analysis ................................. 347 Data Set to be Analyzed ........................................................................................... 348 The DATA Step for the SAS Program....................................................................... 349
340 Step-by-Step Basic Statistics Using SAS: Student Guide
Using PROC REG: Example with a Significant Positive Regression Coefficient ...................................................................... 350 Overview................................................................................................................... 350 Verifying That Your Data Are Appropriate for the Analysis ....................................... 351 Writing the SAS Program with PROC REG .............................................................. 351 Results from the SAS Output.................................................................................... 353 Steps in Interpreting the Output ................................................................................ 353 Drawing a Regression Line through the Scattergram ............................................... 358 Reviewing the Table of Predicted Values ................................................................. 365 Predicting Y within the Range of X............................................................................ 367 Summarizing the Results of the Analysis.................................................................. 367 Notes about the Preceding Summary ....................................................................... 370 Using PROC REG: Example with a Significant Negative Regression Coefficient ...................................................................... 371 Overview................................................................................................................... 371 Correlation between Kilograms Lost and Calorie Consumption ................................ 371 Using PROC PLOT to Create a Scattergram............................................................ 372 Using PROC REG to Perform the Regression Analysis............................................ 374 Summarizing the Results of the Analysis.................................................................. 376 Notes about the Preceding Summary ....................................................................... 378 Using PROC REG: Example with a Nonsignificant Regression Coefficient ...................................................................... 379 Overview................................................................................................................... 379 Correlation between Kilograms Lost and IQ Scores ................................................. 379 Using PROC REG to Perform the Regression Analysis............................................ 379 Summarizing the Results of the Analysis.................................................................. 380 Note about the Preceding Summary ......................................................................... 382 Conclusion.............................................................................................. 383
Chapter 11: Bivariate Regression 341
Introduction Overview Linear regression is a procedure for identifying the best-fitting straight line that summarizes the relationship between two variables. It is called bivariate regression here because this chapter focuses on situations in which you are dealing with just two variables: a single predictor variable and a single criterion variable. You may use bivariate regression in the same types of situations in which a Pearson correlation coefficient is appropriate: situations in which you want to investigate the relationship between two numeric variables that are assessed on an interval scale or ratio scale. When variables X and Y are correlated with one another, you can use linear regression for prediction: If you know where a given subject stands on the X variable, you can use regression procedures to compute a best estimate of where that subject probably stands on the Y variable. This chapter builds on the information already covered in Chapter 10, “Bivariate Correlation.” It begins by providing some guidance on the appropriate use the terms predictor variable, criterion variable, independent variable, and dependent variable. It shows you how to use the SAS System’s REG procedure to analyze data from nonexperimental research and to develop the regression equation for a pair of variables. It illustrates how you can use this regression equation to draw a best-fitting straight line through a scattergram created by PROC PLOT. Finally, it shows how PROC REG can be used to create a table of predicted values, along with the residuals of prediction.
Choosing between the Terms Predictor Variable, Criterion Variable, Independent Variable, and Dependent Variable Overview In the remaining chapters of this book, you will often see the terms predictor variable, criterion variable, independent variable, and dependent variable. Two of these terms are more appropriately used with one type of research investigation, while the other two terms are more appropriately used with a different type of investigation. This section reviews these two types of research, and offers guidelines for using these terms. Nonexperimental Research Most of the studies described in this chapter are examples of nonexperimental research, research in which you are studying the relationship between naturally occurring variables. In
342 Step-by-Step Basic Statistics Using SAS: Student Guide
nonexperimental research, you are not manipulating or controlling variables; rather, you are studying the variables as they naturally exist. For example, suppose that you are interested in studying the relationship between taking vitamin E and the symptoms of depression. You might believe that, the more vitamin E a person takes, the fewer symptoms of depression that person will report. To investigate this idea, you administer a survey to a large sample of individuals. The survey assesses two variables: (a) the quantity of vitamin E taken by the subject, and (b) the depression symptoms reported by the subject. You do not manipulate or control the amount of vitamin E that the subjects are taking––you are simply measuring the amount of vitamin E that they normally take on their own. You will analyze the data to determine whether there is a relationship between the two variables. If there is, it may be possible to use scores on the “amount of vitamin E taken” variable to predict scores on the “symptoms of depression” variable. The study described above involved one criterion variable and one predictor variable. A criterion variable is an outcome variable that can be predicted from one or more predictor variables. The criterion variable is often the main focus of the study because it is the outcome variable that is mentioned in the statement of the research problem. In the study described above, the criterion variable was the “symptoms of depression” variable. It was the outcome variable that was of central interest in the study. When discussing bivariate regression, a criterion variable is sometimes referred to as a “Y” variable, and is often symbolized by the letter “Y.” In nonexperimental research, a predictor variable is a variable that is used to predict values on the criterion. In some studies, you might even believe that the predictor variable has a causal effect on the criterion (although nonexperimental research generally provides only weak evidence of cause and effect). In the study described above, the predictor variable was the “amount of vitamin E taken.” If you had found that there was a relationship between the two variables, you could have gone on to use scores on the “amount of vitamin E taken” variable to predict scores on the “symptoms of depression” variable. When discussing bivariate regression, a predictor variable is sometimes referred to as a “X” variable, and is often symbolized by the letter “X.” Experimental Research In contrast to nonexperimental research, experimental research typically involves a much higher degree of control over the subjects and over the environmental conditions experienced by the subjects. Chapter 2 of this book indicated that most experimental research is identified by three characteristics: •
subjects are randomly assigned to experimental conditions
•
the researcher manipulates an independent variable
•
subjects in different treatment conditions are treated similarly with regard to all variables except for the independent variable.
Chapter 11: Bivariate Regression 343
An experiment involves at least one independent variable and at least one dependent variable. An independent variable is a variable whose values (or levels) are selected by the experimenter to determine what effect the independent variable has on the dependent variable. In contrast, a dependent variable is some aspect of the subject’s behavior that is assessed to determine whether it has been affected by the independent variable. To illustrate these concepts, it is possible to modify the preceding study so that it becomes a true experiment. Suppose that you begin with a sample of 200 subjects, and you randomly assign 100 of subjects to an experimental group and the other 100 subjects to a control group. Subjects in the experimental group are each given 60 IU (International Units) of vitamin E each day. Subjects in the control group are each given zero IU of vitamin E each day. Subjects experience these conditions over a six-month period, and at the end of this period each subject completes a questionnaire in which he or she reports the number of symptoms of depression they have experienced recently. In this study, the independent variable was the “amount of vitamin E taken.” You know this because this was the variable that was manipulated by you, the researcher. In this study, the dependent variable consisted of scores on the “symptoms of depression” questionnaire. You know this because this was that aspect of the subjects’ behavior that you measured to determine whether it had been affected by the independent variable (the amount of vitamin E taken). Choosing the Terms to Use From the preceding, you can that see the term predictor variable is to some extent a counterpart to the term independent variable: The term “predictor variable” is relevant to nonexperimental research, and the term “independent variable” is relevant to experimental research. In the same way, the term criterion variable is to some extent a counterpart to the term dependent variable. Because students are sometimes confused by the use of these terms, this section will summarize the circumstances under which it is best to use the terms independent and dependent variable, and the circumstances under which it is best to use the terms predictor and criterion variable. •
In general, it is best to use the terms independent variable and dependent variable only when you are discussing a true experiment in which the researcher has actually manipulated some variable of interest. In this book, however, you will see some exceptions to this rule. For example, this chapter shows only examples of nonexperimental research. However, when the data are analyzed, you will see that the SAS output includes the heading “Dependent Variable” where the criterion variable is listed.
•
In general, it is best to use the terms predictor variable and criterion variable when you are discussing nonexperimental research (i.e., research in which you are simply studying the relationship between naturally occurring variables). However, the terms predictor variable and criterion variable are general-purpose terms, and it is also acceptable to use them when discussing experimental research. In fact, this book uses them with respect to both types of research. For example, the following chapters of this book will show you how to prepare
344 Step-by-Step Basic Statistics Using SAS: Student Guide
analysis reports for both experimental as well as nonexperimental studies. For simplicity, these research reports will use the the headings “Predictor variable” and “Criterion variable,” regardless of whether the study being described was a true experiment or a nonexperimental study.
Situations Appropriate for Bivariate Linear Regression Overview Linear regression is generally used for purposes of prediction. You use this statistic when you suspect that there might be a significant relationship between two variables, and you want to use subject scores on one variable (the predictor variable) to predict subject scores on the second variable (the criterion variable). The first part of this section describes the types of situations in which linear regression is usually performed, and discusses a few of its statistical assumptions. A more complete summary of assumptions is presented at the end of this section. Scale of Measurement Used with the Predictor and Criterion Variables Predictor variable. When performing linear regression and prediction, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. Criterion variable. The criterion variable should also be a numeric variable that is assessed on an interval or ratio scale of measurement. The Type-of-Variable Figure When researchers perform bivariate regression, they are typically studying the relationship between (a) a criterion variable that is a multi-value numeric variable and (b) a predictor variable that is also a multi-value numeric variable. The type-of-variable figure below illustrates this situation: Criterion
Predictor
=
The “Multi” symbol that appears to the left of the equal sign in the above figure represents the fact that the criterion variable in this analysis is usually a multi-value variable (a variable that assumes more than six values in your sample). The “Multi” symbol that appears to the right of the equal sign shows that the predictor variable in the regression analysis is also typically a multi-value variable.
Chapter 11: Bivariate Regression 345
Example of a Study Providing Data Appropriate for This Procedure The study. Suppose that you are a college administrator who wants to use scores on a fictitious test called the Learning Aptitude Test (LAT) to predict whether applicants will be successful if admitted to your university. You administer the aptitude test to a group of 400 high school students while they are still in high school. Two years later, you track down the students (who are now in college) and record their college grade point averages (GPAs). At that time, you compute the correlation between the students’ LAT scores and their college GPAs. You are pleased to find that there is a strong positive correlation between the two variables. This means that it should be possible to use LAT scores to predict college GPA with some degree of accuracy. Using the same data set, you then use bivariate regression to further investigate the relationship between the two variables. You use the REG procedure to perform a regression in which college GPA is the criterion variable, and LAT scores are the predictor variable. The output from PROC REG includes the regression coefficient and intercept that you need to create a regression equation (these terms will be explained later in this chapter). In subsequent years, you use this regression equation to predict college GPA for applicants who want to be admitted to your university. Specifically, you obtain the LAT scores from high school students who have applied for admission. You insert the applicants’ LAT scores in the regression equation, and use the equation to predict what the students’ GPA will probably be if they are admitted to your school. You then accept those students who are likely to have GPAs over 2.00 (according to the regression equation). Why this data would be appropriate for this procedure. This section has already indicated that, in bivariate regression, the predictor variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The predictor variable in this study consisted of scores on the Learning Aptitude Test (LAT). The LAT is fictitious, but if we assume that it is similar to other carefully developed aptitude tests (such as the Graduate Record Exam), then most researchers will agree that its scores constitute an interval scale of measurement. To perform regression, the criterion variable should also be on an interval scale or ratio scale of measurement. The criterion variable in the present study was college grade point average. Assuming that GPA in this study was assessed in the usual fashion (e.g., using the system in which “4.00” represents straight “As”), most researchers would probably agree that it is also on an interval scale of measurement. Remember that, when you perform linear regression, the predictor and the criterion variables are usually multi-value variables. To determine whether this is the case for the current study, you would use the FREQ procedure to create simple frequency tables for the predictor and criterion variables (similar to those shown in Chapter 5, “Creating Frequency Tables”). If you observe more than six values for each of them in their frequency tables, then you know that both variables are multi-value variables.
346 Step-by-Step Basic Statistics Using SAS: Student Guide
Summary of Assumptions Underlying Bivariate Linear Regression •
Interval-level measurement. Both the predictor and criterion variables should be assessed on an interval or ratio level of measurement.
•
Random sampling. Each subject in the sample should contribute one score on the predictor variable, and one score on the criterion variable. These pairs of scores should represent a random sample drawn from the population.
•
Linearity. The relationship between the criterion variable and the predictor variable should be linear. This means that, in the population, the mean criterion scores at each value of the predictor variable should fall on a straight line. Linear regression procedures are not appropriate for assessing the nature of the relationship between two variables that are involved in a curvilinear relationship.
•
Equal variances. The variances of the Y scores should be approximately the same at each value of X. This condition is referred to as homoscedasticity.
•
Bivariate normal distribution. The pairs of scores should follow a bivariate normal distribution. This means that (a) scores on the criterion variable should form a normal distribution at each value of the predictor variable and (b) scores of the predictor variable should form a normal distribution at each value of the criterion variable. When scores represent a bivariate normal distribution, they form an elliptical scattergram when they are plotted (i.e., their scattergram is shaped like a football: relatively fat in the middle and tapered on the ends).
•
Normally distributed residuals. When predicting the Y variable from the X variable, the residuals of prediction (i.e., the errors of prediction) will be normally distributed with a mean of zero and a standard deviation of one. The concept of “residuals of prediction” will be explained later in this chapter.
Example 11.1: Predicting Weight Loss from a Variety of Predictor Variables Overview This chapter illustrates regression and prediction by referring to the fictitious study on weight loss that was presented in Chapter 10, “Bivariate Correlation.” It is assumed that you have completed Chapter 10 before moving on to the present chapter; for this reason, the highlights of that study will only be briefly reviewed here to refresh your memory. You are encouraged to complete Chapter 10 before continuing with the current chapter.
Chapter 11: Bivariate Regression 347
The Study Overview. Suppose that you conduct a correlational study that is designed to identify variables that are predictive of weight loss in a sample of 22 men over a 10-week period. Throughout the study, you assess four predictor variables that you believe should be correlated with weight loss. At the end of the study, you assess how much weight each man has lost, and investigate the relationship between this criterion variable and your four predictor variables. Method. At the beginning of the study, you administer a 5-item scale that is designed to assess each subject’s motivation to lose weight. The scale consists of statements such as “It is very important to me to lose weight.” Subjects respond to each item using a 7-point response format in which 1 = “Disagree Very Strongly” and 7 = “Agree Very Strongly.” You sum their responses to the five items to create a single motivation score for each subject. Scores on this measure may range from 5 to 35, with higher scores representing greater motivation to lose weight. Throughout the 10-week study, you ask each subject to record the number of hours that he exercises each week. At the end of the study, you determine the average number of hours spent exercising for each subject, and correlate this number with subsequent weight loss. Throughout the study, you also ask each subject to keep a log of the number of calories that he consumes each day. At the end of the study, you compute the average number of calories that are consumed by each subject each day. You will correlate this measure of daily calorie intake with subsequent weight loss. At the beginning of the study you also administer the Weschler Adult Intelligence Scale (WAIS) to each subject. The “combined IQ” score from this instrument will serve as the measure of intelligence in your study. You then correlate IQ with subsequent weight loss. Throughout the 10-week study, you weigh each subject and record their body weight in kilograms (1 kilogram is equal to approximately 2.2 pounds). When the study is completed, you subtract the subjects’ body weight at the beginning of the study from their weight at the end of the study. You use the resulting difference as your measure of weight loss. The Predictor Variables and Criterion Variable in the Analysis This study involves one criterion variable and four predictor variables. The criterion variable is weight loss measured in kilograms (kgs). In the analysis, you will give this variable the SAS variable name KG_LOST for “kilograms lost.” The first predictor variable is motivation to lose weight, as measured by the questionnaire described above. In the analysis, you will use the SAS variable name MOTIVAT to represent this variable.
348 Step-by-Step Basic Statistics Using SAS: Student Guide
The second predictor variable is the average number of hours that the subjects spent exercising each week during the study. In the analysis, you will use the SAS variable name EXERCISE to represent this variable. The third predictor variable is the average number of calories that are consumed each day. In the analysis, you will use the SAS variable name CALORIES to represent this variable. The final predictor variable is intelligence, as measured by the WAIS test. In the analysis, you will use the SAS variable name IQ to represent this variable. Data Set to be Analyzed Table 11.1 presents fictitious scores for each subject on each of the variables to be analyzed in this study. Table 11.1 Variables Analyzed in the Weight Loss Study ___________________________________________________________________ Kilograms Hours Calories Subject lost Motivation exercising consumed IQ ___________________________________________________________________ 01. John 2.60 5 0 2400 100 02. George 1.00 5 0 2000 120 03. Fred 1.80 10 2 1600 130 04. Charles 2.65 10 5 2400 140 05. Paul 3.70 10 4 2000 130 06. Jack 2.25 15 4 2000 110 07. Emmett 3.00 15 2 2200 110 08. Don 4.40 15 3 1400 120 09. Edward 5.35 15 2 2000 110 10. Rick 3.25 20 1 1600 90 11. Ron 4.35 20 5 1800 150 12. Dale 5.60 20 3 2200 120 13. Bernard 6.44 20 6 1200 90 14. Walter 4.80 25 1 1600 140 15. Doug 5.75 25 4 1800 130 16. Scott 6.90 25 5 1400 140 17. Sam 7.75 25 . 1400 100 18. Barry 5.90 30 4 1600 100 19. Bob 7.20 30 5 2000 150 20. Randall 8.20 30 2 1200 110 21. Ray 7.80 35 4 1600 130 22. Tom 9.00 35 6 1600 120 ___________________________________________________________________
Table 11.1 provides scores for 22 male subjects. The first subject appearing in the table is named John. Table 11.1 shows the following values for John on the study’s variables: •
He lost 2.60 kgs of weight by the end of the study.
•
His score on the motivation scale was 5 (out of a possible 35).
Chapter 11: Bivariate Regression 349 •
His score on “Hours Exercising” was 0, meaning that he exercised zero hours per week on the average.
•
His score on calories was 2400, meaning that he consumed 2400 calories each day, on the average.
•
His IQ was 100 (with the WAIS, the mean IQ is 100 and the standard deviation is approximately 15 in the population).
Scores for the remaining subjects can be interpreted in the same way. The DATA Step for the SAS Program Below is the DATA step for the SAS program that will read the data presented in Table 11.1. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM KG_LOST MOTIVAT EXERCISE CALORIES IQ; DATALINES; 01 2.60 5 0 2400 02 1.00 5 0 2000 03 1.80 10 2 1600 04 2.65 10 5 2400 05 3.70 10 4 2000 06 2.25 15 4 2000 07 3.00 15 2 2200 08 4.40 15 3 1400 09 5.35 15 2 2000 10 3.25 20 1 1600 11 4.35 20 5 1800 12 5.60 20 3 2200 13 6.44 20 6 1200 14 4.80 25 1 1600 15 5.75 25 4 1800 16 6.90 25 5 1400 17 7.75 25 . 1400 18 5.90 30 4 1600 19 7.20 30 5 2000 20 8.20 30 2 1200 21 7.80 35 4 1600 22 9.00 35 6 1600 ;
100 120 130 140 130 110 110 120 110 90 150 120 90 140 130 140 100 100 150 110 130 120
350 Step-by-Step Basic Statistics Using SAS: Student Guide
Some notes about the preceding program: Line 1 of the preceding program contains the OPTIONS statement which, in this case, specifies the size of the printed page of output. One entry in the OPTIONS statement is “PS=60”, which is an abbreviation for “PAGESIZE=60.” This keyword requests that each page of output have up to 60 lines of text on it. Depending on the font that you are using (and other factors), requesting PS=60 may cause the bottom of your scatterplot to be “cut off” when it is printed. If this happens, you should change the OPTIONS statement so that it requests just 50 lines of text per page. You will do this by including PS=50 in your OPTIONS statement, rather than PS=60. Your complete OPTIONS statement should appear as follows: OPTIONS
LS=80
PS=50;
You can see that lines 3–8 of the preceding program provide the INPUT statement. The SAS variable name SUB_NUM represents “subject number,” KG_LOST represents “kilograms lost,” MOTIVAT represents “motivation to lose weight,” and so on. The data appear in lines 10–31. The data on these lines are identical to the data appearing in Table 11.1, except that the names of the subjects have been removed.
Using PROC REG: Example with a Significant Positive Regression Coefficient Overview When there is a positive relationship between variable X and variable Y, it means that •
low scores on X are associated with low scores on Y
•
high scores on X are associated with high scores on Y.
Chapter 10 presented one analysis in which the motivation to lose weight (MOTIVAT) was correlated with kilograms of weight lost (KG_LOST). The analysis revealed a strong positive relationship between the two variables. The strong relationship shows that, if you know where a subject stands on MOTIVAT, it should be possible to predict where they stand on KG_LOST with some degree of accuracy. In this section, you will again explore the relationship between MOTIVAT and KG_LOST. This time, however, you will use PROC REG to develop a regression equation that can be used to predict weight loss scores from motivation scores.
Chapter 11: Bivariate Regression 351
Verifying That Your Data Are Appropriate for the Analysis Before performing a regression analysis, you should always perform some preliminary analyses to verify that your data are in proper form. At a minimum, you should perform PROC MEANS and PROC UNIVARIATE to verify that the data set does not contain any obvious typing errors and that the variables do not demonstrate a marked departure from normality. Chapter 7, “Measures of Central Tendency and Variability” showed how to perform these analyses, and so those procedures will not be repeated here. In addition, you should perform the PLOT procedure to verify that the relationship between your variables is linear and to verify that your data forms an approximately bivariate normal distribution. Chapter 10 of this book showed how to do that. Writing the SAS Program with PROC REG The syntax. Below is the syntax for the SAS statements that request the REG procedure: PROC REG DATA=data-set-name ; MODEL criterion-variable = predictor-variable /options ; TITLE1 ' your-name '; RUN; The preceding shows that, in the PROC REG statement, you should use the DATA option to identify the data set that is to be analyzed. You can see that the MODEL statement includes an equal sign (=). To the left of this equal sign, you provide the name of the criterion variable––that is, the outcome variable that you are trying to predict. Many texts refer to this as the “Y” variable. It is also sometimes referred to as the dependent variable (you will see that the SAS output uses the heading “Dependent Variable” when the name of this variable is printed). To the right of the equal sign in the MODEL statement, you provide the name of the predictor variable. Many texts refer to this as the “X” variable. It is also sometimes referred to as the independent variable. If you are requesting any options for the analysis, add a slash mark (/) followed by the keyword for each option in the MODEL statement. This chapter illustrates two of the many options available with PROC REG: STB The keyword STB requests that PROC REG print the standardized regression coefficient for the analysis. A regression coefficient is an estimate of the average amount of change that takes place in Y for every one-unit change in X. A standardized regression coefficient is an estimate of what the regression coefficient would be if both variables were standardized to have a mean of zero and a standard deviation of 1 (regression coefficients will be discussed in greater detail in a later section).
352 Step-by-Step Basic Statistics Using SAS: Student Guide
P The keyword P requests that PROC REG print predicted values (Y´ values) on the criterion variable for each observation. The table produced by this option includes the following: •
An observation number for each observation.
•
Each observation’s actual score on the criterion variable.
•
Each observation’s predicted score on the criterion variable (based on the regression equation).
•
The residual for each prediction (the difference between each observation’s actual score on the criterion versus that observation’s predicted score).
A later section will discuss the concepts of predicted scores and residuals in greater detail. For a complete discussion of other options available with PROC REG, see the chapter on the REG procedure in the SAS/STAT User’s Guide (1999d). The program. Below are the SAS statements that request that PROC REG be performed, specifying KG_LOST as the criterion variable and MOTIVAT as the predictor variable. 25 26 27 28 29 30 31 32 33 34 35
. . . 20 8.20 21 7.80 22 9.00 ; PROC REG MODEL TITLE1 RUN;
30 35 35
2 4 6
1200 1600 1600
110 130 120
DATA=D1; KG_LOST = MOTIVAT 'JANE DOE';
/
STB
P;
Some notes about the preceding program: •
To conserve space, the preceding shows just the last few data lines from the DATA step on lines 28–30. This DATA step was presented in full in the preceding section titled “The DATA Step for the SAS Program.”
•
The PROC REG statement appears on line 32. The DATA option for this statement requests that the analysis be performed on the data set named D1.
•
The MODEL statement appears on line 33. It requests that KG_LOST serve as the criterion variable (Y variable) in the analysis, and that MOTIVAT serve as the predictor variable (X variable).
•
Lines 34–35 present the TITLE1 and RUN statements for the program.
Chapter 11: Bivariate Regression 353
Results from the SAS Output The preceding program produces two pages of output. Page 1 contains the analysis of variance table and the parameter estimates section. Page 2 contains the table of predicted values. Output 11.1 presents Page 1 from this output. JANE DOE The REG Procedure Model: MODEL1 Dependent Variable: KG_LOST
Source Model Error Corrected Total
DF 1 20 21
Root MSE Dependent Mean Coeff Var
Variable Intercept MOTIVAT
DF 1 1
Parameter Estimate 0.50944 0.22382
1
Analysis of Variance Sum of Mean Squares Square 85.16485 85.16485 23.51188 1.17559 108.67673 1.08425 4.98591 21.74625
R-Square Adj R-Sq
Parameter Estimates Standard Error t Value 0.57450 0.89 0.02630 8.51
F Value 72.44
Pr > F <.0001
0.7837 0.7728
Pr > |t| 0.3857 <.0001
Standardized Estimate 0 0.88524
Output 11.1. Results of PROC REG in which kilograms of weight lost (KG_LOST) serves as the criterion variable. Motivation to lose weight (MOTIVAT) serves as the predictor variable.
Steps in Interpreting the Output Step 1: Make sure that everything looks right. In some instances, you might perform a number of regression analyses in which you use different variables as predictor and criterion variables. In sorting through the output from these analyses, you must be able to quickly identify the criterion variable and the predictor variable on which the current analysis is based. To do this, first check to the right of the heading “Dependent Variable.” Here, you will find the name of the criterion variable in your analysis. In Output 11.1, you can see that the criterion variable was KG_LOST ( ). Next, look in the lower left corner of the page, below the heading “Variable.” Below this heading, you will find the word “Intercept,” and below “Intercept” will be the name of the predictor variable in your analysis. In Output 11.1, the predictor variable is MOTIVAT ( ). Next, use the degrees of freedom printed in the output to verify that the analysis was performed on the correct number of observations. You will find this information in the upper half of the output, in the section headed “Analysis of Variance” ( ). In this part of the output, look below the heading “DF” and to the right of the heading “Corrected Total.” At this
354 Step-by-Step Basic Statistics Using SAS: Student Guide
location, you will find the corrected total degrees of freedom for your analysis. In Output 11.1, the entry at that location is “21” ( ). This means that the corrected total degrees of freedom for your analysis is 21. The corrected total degrees of freedom for this type of analysis is equal to N–1 where N is equal to the total number of observations (in this case, the number of subjects who contributed valid data for the analysis). In the weight loss study, the total number of observations was 22, so the corrected total degrees of freedom would be N – 1 = 22 – 1 = 21 When computed manually, the corrected total degrees of freedom is equal to the corrected total degrees of freedom as computed by SAS. Again, this is what you would expect to see––if the corrected total degrees of freedom in Output 11.1 had been, for example, 12, it might have meant that there was an error in either the DATA step or the PROC step. In fact, however, there is no evidence of problems, and so you can proceed. Step 2: Review the intercept and nonstandardized regression coefficient. One of your main objectives in conducting this analysis is to develop a linear regression equation that can be used to predict KG_LOST from MOTIVAT. The general form for this linear regression equation is as follows: Y´ = b (X) + a where: Y´
represents a given subject’s predicted score on the Y variable (the criterion variable).
b
represents the regression coefficient, or the slope of the regression line. This regression coefficient represents the average amount of change in Y that is associated with a one-unit change in X.
X
represents a particular subject’s actual score on the X variable (the predictor variable).
a
represents the Y-intercept, also called the intercept constant. This is the value of Y at the location where the regression line crosses the Y axis (assuming that both the X axis and Y axis begin at zero).
To construct the regression equation for a particular data set, the only items that you need from the preceding output are b (the regression coefficient, or slope), and a (the Y-intercept). Both of these statistics are computed by PROC REG, and appear in the section of output titled “Parameter Estimates.” This was the lower section appearing in Output 11.1. For convenience, this section is reproduced again as Output 11.2.
Chapter 11: Bivariate Regression 355
Parameter Estimates
Variable DF Intercept 1 MOTIVAT 1
Parameter Estimate 0.50944 0.22382
Standard Error 0.57450 0.02630
t Value 0.89 8.51
Pr > |t| 0.3857 <.0001
Standardized Estimate 0 0.88524
Output 11.2. Parameter estimates section of output of PROC REG in which KG_LOST was regressed on MOTIVAT.
In Output 11.2, the first column is headed “Variable.” Below this heading are the entries “Intercept” and “MOTIVAT.” The third column is headed “Parameter Estimate.” Where the row headed “Intercept” intersects with the column headed “Parameter Estimate,” you will find the Y-intercept (or intercept constant) for your regression equation. In Output 11.2, this Y-intercept is 0.50944, which rounds to .509. At this point, you can insert your Yintercept into the regression equation, as shown here: Y´ = b (X) + a Y´ = b (X) + .509 In Output 11.2, where the row headed “MOTIVAT” intersects with the column headed “Parameter Estimate,” you will find the regression coefficient (or slope) for your regression equation. In Output 11.2, this slope is .22382, which rounds to .224. You can now insert this slope into the regression equation, as shown here: Y´ = b (X) + .509 Y´ = .224 (X) + .509 The slope for this equation indicates that, for every one-unit increase in scores on MOTIVAT, there is an average increase of .224 units on KG_LOST. In other words, for every increase of 1 point on the motivation to lose weight scale, there is an increase of about .22 kilograms of weight actually lost. A later section will show how you can use this regression equation to predict how much weight a subject is likely to lose, given his score on the motivation scale. The type of regression coefficient that has been discussed in this section is a nonstandardized regression coefficient. A nonstandardized regression coefficient is obtained when the X and Y variables have not been standardized to have equal variances. When you use the MODEL statement options recommended here, PROC REG will also print a standardized regression coefficient. The interpretation of a standardized coefficient will be discussed in a later section. Step 3: Review the significance test for the nonstandardized regression coefficient. In most cases, you will be interested in interpreting the regression coefficient (slope) of the
356 Step-by-Step Basic Statistics Using SAS: Student Guide
regression equation only if it is significantly different from zero. Fortunately, PROC REG provides a t statistic to test the null hypothesis that the slope is equal to zero in the population. The t statistic for this test appears in Output 11.2 in the column headed “t Value.” Where the column with this heading intersects with the row headed “MOTIVAT,” you can see a t statistic of 8.51. The degrees of freedom for this t test are equal to N–2 where N is equal to the number of valid observations in the analysis. In the present analysis, N = 22, so (N – 2) = (22 – 2) = 20 Therefore, the present t test is based on 20 degrees of freedom. The probability, or p value for this t test appears in Output 11.2 under the heading “Pr > |t|.” Where the column with this heading intersects with the row headed “MOTIVAT,” you can see the entry “<.0001.” This means that the p value associated with this t statistic is less than .0001. Remember that in this book a statistic is significant if its p value is less than .05. Clearly, the p value for MOTIVAT is statistically significant. This p value indicates that, if the regression coefficient for the relationship between KG_LOST and MOTIVAT were actually equal to zero in the population, there is less than 1 chance in 10,000 that you would obtain a regression coefficient equal to .224 (or larger) in a sample of this size. Because this probability is so low, you will reject the null hypothesis, and tentatively conclude that the regression coefficient is probably larger than zero in the population. It is worth mentioning that Output 11.2 also presents a t statistic that tests the null hypothesis that the intercept is equal to zero in the population. This t statistic appears where the row headed “Intercept” intersects with the column headed “t Value.” However, it is rare that a researcher is interested in testing the null hypothesis that the intercept is equal to zero in the population, so in most cases you should disregard this section of the output. Step 4: Review the standardized regression coefficient. Earlier, it was stated that a standardized regression coefficient is an estimate of what the regression coefficient would be if both variables were standardized to have a mean of zero and a standard deviation of 1. Output 11.2 also provides the standardized regression coefficient for the current analysis. For convenience, this section is again reproduced next as Output 11.3.
Chapter 11: Bivariate Regression 357
Parameter Estimates Variable Intercept MOTIVAT
DF 1 1
Parameter Estimate 0.50944 0.22382
Standard Error 0.57450 0.02630
t Value 0.89 8.51
Pr > |t| 0.3857 <.0001
Standardized Estimate 0 0.88524
Output 11.3. Standardized regression coefficient obtained when KG_LOST was regressed on MOTIVAT.
The standardized regression coefficient appears in Output 11.3 below the heading “Standardized Estimate.” Where the row headed “MOTIVAT” intersects with the column headed “Standardized Estimate,” you can see that the standardized regression coefficient for this analysis is .88524, which rounds to .89. This number should sound familiar to you because it is equal to the Pearson correlation between MOTIVAT and KG_LOST that was reported in the previous chapter. And this is no coincidence––when a regression equation contains just one criterion variable and just one predictor variable, the standardized regression coefficient will always be equal to the Pearson correlation between the two variables. Step 5: Review the coefficient of determination. The coefficient of determination (denoted 2 as “r ” or “r-square”) refers to the proportion of variance in the criterion variable that is accounted for by variability in the predictor variable. This coefficient may range from .00 to 1.00, with higher values indicating that a higher proportion of variance is accounted for. When the coefficient of determination for two variables is high, it means that there is a strong relationship between the two variables. When a regression is performed with only two variables (as is the case in this chapter), the coefficient of determination is equal to r2, the square of the Pearson correlation between the variables. This means that, after you compute the Pearson correlation between two variables, you can calculate the coefficient of determination by squaring the correlation coefficient. The coefficient of determination also appears in the output of PROC REG, in that section of output headed “Analysis of Variance.” For convenience, this section of output is reproduced again as Output 11.4.
358 Step-by-Step Basic Statistics Using SAS: Student Guide
Source
DF
Model Error Corrected Total
1 20 21
Root MSE Dependent Mean Coeff Var
Analysis of Variance Sum of Mean Squares Square 85.16485 23.51188 108.67673 1.08425 4.98591 21.74625
85.16485 1.17559 R-Square Adj R-Sq
F Value
Pr > F
72.44
<.0001
0.7837 0.7728
Output 11.4. Coefficient of determination (r2) obtained when KG_LOST was regressed on MOTIVAT.
The coefficient of determination for this analysis appears in Output 11.4 to the right of the heading “R-Square.” In this case, you can see that the coefficient of determination is .7837, which rounds to .78. This coefficient indicates that approximately 78% of the variance in KG_LOST is accounted for by variability in MOTIVAT. This is a very large percentage of variance (however, you should remember that the results presented here are fictitious, and it is unlikely that such a large percentage of variability in weight loss would be accounted for by motivation in a real study). Drawing a Regression Line through the Scattergram Overview. In Chapter 10, “Bivariate Correlation,” you learned how to use the PLOT procedure to create a scattergram that plots a criterion variable against a predictor variable. Once you have the output generated by PROC REG, it is a relatively simple matter to draw a best-fitting regression line through the center of the scattergram that was created by PROC PLOT. For students of elementary statistics, this is a useful exercise for understanding the meaning of the regression equation generated by PROC REG. Here is a short overview of how it will be done (the remainder of this section provides more detailed, step-by-step instructions): First, you will print out a copy of the scattergram that plots your Y variable (criterion variable) against your X variable (predictor variable). Next, you will select a low value on the X variable, insert this X value into your regression equation, and compute the Y´ (predicted Y) value that is associated with that low value of X. You will place a dot on your scattergram that represents the location of this predicted value. You will then start with a high value on the X variable, insert this X value into your regression equation, and compute the Y´ value that is associated with that high value of X. You will place a dot on your scattergram that represents the location of this predicted value. Finally, you will draw a straight line through the scattergram connecting the two dots that you placed there. This line represents the regression line for your scattergram: it will be a bestfitting line that goes through the center of the scattergram. It will represent the predicted value of Y that is associated with every possible value of X. The remainder of this section provides step-by-step instructions for drawing this best-fitting line.
Chapter 11: Bivariate Regression 359
Step 1: Printing the output from PROC PLOT. Here are the PROC PLOT statements that will create a scattergram plotting KG_LOST against MOTIVAT: PROC PLOT DATA=D1; PLOT KG_LOST*MOTIVAT; TITLE1 'JANE DOE'; RUN; When the weight loss data from Table 11.1 are analyzed, these statements produce the scattergram presented in Output 11.5. Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
3
KG_LOST | 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 11.5. Scattergram that plots kilograms lost against motivation to lose weight.
360 Step-by-Step Basic Statistics Using SAS: Student Guide
Step 2: Computing a predicted value of Y that is associated with a low value of X. Review the X axis (the axis for MOTIVAT, in this case), and find a point on the X axis that represents a relatively low score on the X variable. From Output 11.5, you can see that a relatively low score on MOTIVAT would be a score of 10. Next, insert this low X score into your regression equation. Here is the regression equation for the relationship between KG_LOST and MOTIVAT, as reported earlier in this chapter: Y´ = .224 (X) + .509 Inserting a score of 10 into this equation gives you the following: Y´ = .224 (10) + .509 Y´ = 2.24 + .509 Y´ = 2.749 So Y´ (the predicted value of Y) is equal to 2.749, which rounds to 2.75. This means that, if a subject’s score on X (MOTIVAT) is 10, your regression equation predicts that his score on Y (KG_LOST) would be 2.75. Step 3: Marking the location on the scattergram that corresponds to this low Y´ value. On your scattergram, find the point on the X axis that corresponds to a score of 10, and imagine an invisible line going straight up from this point. At the same time, find the point on the Y axis that corresponds to a score of 2.75 (Y´), and imagine an invisible line going straight to the right from this point. The point at which your two imaginary lines intersect represents Y´ (the predicted value of Y that is associated with an X score of 10). Place a dot at that point. This step is illustrated in Output 11.6. The vertical dotted line goes up from the low score of 10 on the X axis. The horizontal dotted line goes to the right from the predicted value on Y of 2.75. A dot has been drawn at the point where the two lines meet.
Chapter 11: Bivariate Regression 361
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
3
KG_LOST | 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 11.6. Scattergram that plots kilograms lost against motivation to lose weight. (This scattergram identifies the Y´ value that is associated with a low value of X.)
Step 4: Computing a predicted value of Y that is associated with a high value of X. Review the X axis (the axis for MOTIVAT, in this case), and find a point on the X axis that represents a relatively high score on the X variable. From Output 11.6, you can see that a relatively high score on MOTIVAT would be a score of 30.
362 Step-by-Step Basic Statistics Using SAS: Student Guide
Next, insert this high X score into your regression equation. Inserting a score of 30 into this equation provides the following: Y´ = .224 (X) + .509 Y´ = .224 (30) + .509 Y´ = 6.72 + .509 Y´ = 7.229 So Y´ (the predicted value of Y) is equal to 7.229, which rounds to 7.23. This means that, if a subject’s score on X (MOTIVAT) was 30, your regression equation predicts that his score would be 7.23 on Y (KG_LOST). Step 5: Marking the location on the scattergram that corresponds to this high Y´ value. On your scattergram, find the point on the X axis that corresponds to a score of 30, and imagine an invisible line going straight up from this point. At the same time, find the point on the Y axis that corresponds to a score of 7.23 (Y´), and imagine an invisible line going straight to the right from this point. The point at which your two imaginary lines intersect represents Y´ (the predicted value of Y that is associated with an X score of 30). Place a dot at that point. This step is illustrated in Output 11.7. There, the vertical dotted line that goes up from the high score of 30 on the X axis. The horizontal dotted line goes to the right from the predicted value on Y of 7.23. A dot has been drawn at the point where the two lines meet.
Chapter 11: Bivariate Regression 363
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
3
KG_LOST | 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 11.7. Scattergram that plots kilograms lost against motivation to lose weight. (This scattergram identifies the Y´ value that is associated with a high value of X.)
Step 6: Draw a regression line that connects the two dots. The final step is simple: draw a straight line that connects the two dots that you have just made. Be sure that your line does not extend beyond the range of your X variable. That is, be sure that your line does not extend any lower than your lowest observed score on the X variable, or any higher than your highest observed score on the X variable. In the present case, this means that your regression line should not extend to the left of a score of 5 on the X axis (the MOTIVAT axis), and it should not extend to the right of a score of 35 on the X axis. A section appearing later in this chapter
364 Step-by-Step Basic Statistics Using SAS: Student Guide
(titled “Predicting Y within the Range of X”) will discuss this issue concerning the “range of X” in greater detail. The line that you have now drawn on your output page is your regression line: it is the best-fitting line that goes through the center of your scattergram. Points on this line represent predicted values of Y that are associated with the values of X that appear directly below them on the X axis. Output 11.8 provides the final version of the scattergram with the regression line drawn through it.
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
3
KG_LOST | 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 11.8. Scattergram that plots kilograms lost against motivation to lose weight (with regression line).
Chapter 11: Bivariate Regression 365
Reviewing the Table of Predicted Values The preceding section showed you how to draw a best-fitting regression line through the center of your scattergram. You can use this regression line to identify the predicted value of Y that is associated with any value of X. You can also use the regression line to estimate residuals of prediction (some textbooks refer to these as the “errors of prediction”). A residual of prediction is simply the difference between the predicted value of Y for a particular subject versus the actual value of Y for that subject. Fortunately, there is an easier way to do all of this that does not require you to work with a regression line on a scattergram. PROC REG will print predicted values and residuals of prediction when you include the keyword “P” in the MODEL statement. You may remember that, in the SAS program presented earlier in this chapter, the MODEL statement did in fact include the keyword P. The table of predicted values and residuals produced by this option appears here as Output 11.9. JANE DOE The REG Procedure Model: MODEL1 Dependent Variable: KG_LOST Output Statistics
Obs 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
Dep Var KG_LOST 2.6000 1.0000 1.8000 2.6500 3.7000 2.2500 3.0000 4.4000 5.3500 3.2500 4.3500 5.6000 6.4400 4.8000 5.7500 6.9000 7.7500 5.9000 7.2000 8.2000 7.8000 9.0000
Predicted Value 1.6286 1.6286 2.7477 2.7477 2.7477 3.8668 3.8668 3.8668 3.8668 4.9859 4.9859 4.9859 4.9859 6.1050 6.1050 6.1050 6.1050 7.2241 7.2241 7.2241 8.3433 8.3433
Sum of Residuals Sum of Squared Residuals Predicted Residual SS (PRESS)
2
Residual 0.9714 -0.6286 -0.9477 -0.0977 0.9523 -1.6168 -0.8668 0.5332 1.4832 -1.7359 -0.6359 0.6141 1.4541 -1.3050 -0.3550 0.7950 1.6450 -1.3241 -0.0241 0.9759 -0.5433 0.6567 0 23.51188 27.64729
Output 11.9. Table of predicted values and residuals for the PROC REG analysis in which KG_LOST was regressed on MOTIVAT.
366 Step-by-Step Basic Statistics Using SAS: Student Guide
The table in Output 11.9 consists of four columns. The observation column. The first column is headed “Obs” and provides observation numbers for each of your subjects. Reading down this column, you can see that there are 22 observations; that is, there were 22 subjects in the data set that you analyzed (from Table 11.1). If you compare Output 11.9 to Table 11.1 (presented at the beginning of this chapter), you can see that observation #1 is John, observation #2 is George, observation #3 is Fred, and so on. The criterion variable column. The second column in Output 11.9 is headed “Dep Var KG_LOST,” which stands for “Dependent Variable: KG_LOST.” This column presents each subject’s actual score on KG_LOST, the criterion variable in the study. If you compare this column to the column headed “Kilograms Lost” from Table 11.1, you will find that they are identical. The predicted value column. The third column in Output 11.9 is headed “Predicted Value.” This column presents each subject’s Y´ score: his predicted score on the Y variable (KG_LOST), given his score on the X variable (MOTIVAT). These predicted scores are based on the regression equation presented earlier in this chapter: Y´ = .224 (X) + .509 This column is useful because it eliminates your having to insert X values into the regression equation and computing Y´ values by hand. For example, assume that you want to know what predicted value of Y is associated with an X value of 10. If you review Table 11.1 from the beginning of this chapter, you can see that the third subject in the table (Fred) has a score of 10 on the X variable (that is, a score of 10 on MOTIVAT). In Output 11.9, Fred appears as observation #3. Output 11.9 shows that the predicted Y value for observation #3 is 2.7477, which rounds to 2.75. In other words, PROC REG predicts that subjects with an X score of 10 are likely to have a Y score of 2.75. This number (2.75) is the same value that we computed previously by inserting an X score of 10 in the regression equation. In short, the results presented in Output 11.9 are useful because they can save you from having to do a large number of manual calculations when you need to compute Y´ values. The residual column. Finally, the fourth column in Output 11.9 is headed “Residual.” This column presents the residuals of prediction (also called the “errors of prediction”). A residual is computed by subtracting from a subject’s Y score the predicted Y score (Y´) that was generated by the regression equation. For example, consider observation #1 from Output 11.9. For observation #1, the actual score on the Y variable was 2.6000, and the predicted score on Y was 1.6286. Subtracting the latter from the former produces the following: 2.6000 – 1.6286 = 0.9714 So the residual of prediction is equal to .9714. You can see that this is exactly the value that appears in the “Residual” column for observation #1 in Output 11.9. Performing this
Chapter 11: Bivariate Regression 367
subtraction for each of the remaining observations in Output 11.9 shows that their residual scores were computed in the same way. Predicting Y within the Range of X When you perform a regression analysis, it is important to always predict Y only within the range of X. For example, suppose that you are using this regression formula (presented earlier) to compute predicted scores on Y: Y´ = .224 (X) + .509 You have already learned that, to do this, you insert a value of X into the formula, and solve for Y´. However, in doing this, it is important that •
you do not insert a value of X that is lower than the lowest observed value of X in your sample, and
•
you do not insert a value of X that is higher than the highest observed value of X in your sample.
For example, the preceding regression equation was based on an analysis in which the X variable was MOTIVAT (motivation), and the Y variable was KG_LOST (kilograms of weight lost). Table 11.1 provided the data set for this analysis. That table shows that the lowest observed score on the X variable (motivation) was 5, and the highest observed score was 35. This means that when you are predicting scores on Y, you should not use an X score lower than 5, or higher than 35. What is the reason for this? Remember that the preceding regression equation was based on a specific sample of data. In this sample, the lowest X score was 5, and the highest was 35. You cannot be sure that you would have obtained the same regression equation (i.e., the same intercept and regression coefficient) if you had analyzed a sample with a greater range on the X variable (i.e., a sample in which the lowest X score was below 5 and the highest was above 35). That is why you should only predict Y within the range of X that was observed in your sample. Summarizing the Results of the Analysis Overview. When you perform bivariate regression, it is possible to compute a regression coefficient (b) that represents the nature of the relationship between the predictor variable and the criterion variable. You have also learned that PROC REG produces a t statistic for this regression coefficient. This t statistic tests the null hypothesis that your sample was drawn from a population in which this regression coefficient is equal to zero. This section shows you how to prepare an analysis report that is appropriate for this null hypothesis test. You will see that this analysis report is somewhat similar to the one used for the Pearson correlation coefficient (presented in the last chapter). The following report
368 Step-by-Step Basic Statistics Using SAS: Student Guide
summarizes the results of the analysis in which the predictor variable was the motivation to lose weight, and the criterion variable was the number of kilograms lost. A) Statement of the research question: The purpose of this study was to determine whether the regression coefficient representing the relationship between the motivation to lose weight and the amount of weight actually lost over a 10-week period is significantly different from zero. B) Statement of the research hypothesis: There will be a positive relationship between the motivation to lose weight and the amount of weight that is actually lost over a 10-week period. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was the motivation to lose weight. This was a multi-value variable and was assessed on an interval scale. • The criterion variable was the number of kilograms of weight lost during the 10-week study. This was a multi-value variable and was assessed on a ratio scale. D) Statistical procedure:
Linear bivariate regression.
E) Statistical null hypothesis (H0): b = 0; In the population, the regression coefficient representing the relationship between the motivation to lose weight and the number of kilograms of weight lost is equal to zero. F) Statistical alternative hypothesis (H1): b ≠ 0; In the population, the regression coefficient representing the relationship between the motivation to lose weight and the number of kilograms of weight lost is not equal to zero. G) Obtained statistic:
b = .224, t (20) = 8.51
H) Obtained probability (p) value:
p < .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. K) Coefficient of determination:
.78.
L) Formal description of the results for a paper: Results were analyzed by using linear regression to regress kilograms of weight lost on the motivation to lose weight. This analysis revealed a significant regression coefficient, b = .224, t(20)
Chapter 11: Bivariate Regression 369
= 8.51, p < .0001. The nature of the regression coefficient showed that, on the average, an increase of .224 kilograms of weight loss was associated with every 1-unit increase in the motivation to lose weight. The analysis showed that motivation accounted for 78% of the variance in weight loss. M) Figure representing the results: Output 11.10 is a scattergram showing the relationship between the motivation to lose weight and kilograms of weight actually lost.
Plot of KG_LOST*MOTIVAT.
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
KG_LOST | 9 + A | | | | | A 8 + | A A | | | | A 7 + | A | | A | | 6 + | A A | A | | A | 5 + | A | | | A A | 4 + | | A | | A | 3 + A | | A A | | A | 2 + | A | | | | 1 + A ---+----------+----------+----------+----------+----------+----------+-5 10 15 20 25 30 35 MOTIVAT
Output 11.10. Scattergram representing the relationship between the motivation to lose weight and kilograms of weight actually lost.
3
370 Step-by-Step Basic Statistics Using SAS: Student Guide
Notes about the Preceding Summary Degrees of freedom for the t test. Item G in the preceding summary provides the nonstandardized regression coefficient for the analysis, along with the t statistic and degrees of freedom associated with that regression coefficient. This section of the summary is again reproduced here: G) Obtained statistic:
b = .224, t (20) = 8.51
The preceding shows that the t statistic for the analysis was 8.51. The (20) that appears next to the t statistic represents the degrees of freedom associated with that test. An earlier section of this chapter titled “Steps in Interpreting the Output” discussed where on the output you can find the nonstandardized regression coefficient and the t statistic associated with that coefficient. However, the output does not provide the degrees of freedom for this test; the degrees of freedom must be computed manually. As explained earlier, the formula for computing the degrees of freedom for this t test is N–2 where N is equal to the number of pairs of scores. When individual human subjects are your unit of observation (as is the present case), then N is equal to the number of subjects that are included in your analysis. The present analysis was based on 22 subjects, so the degrees of freedom are calculated as (N – 2) = (22 – 2) = 20 Presenting the p value. Item H in the preceding report presents the obtained probability value associated with the t statistic: H) Obtained probability (p) value:
p < .0001
You can see that this item used the “less-than” sign (<) to indicate that the obtained p value was less than .0001. This item uses the less-than sign because the less-than sign actually appeared in the SAS output (see callout number in Output 11.2, presented previously). There, the p value was presented as “<.0001”. If the less-than sign had not appeared with the p value in the PROC REG output, then you would have instead used the “equal” sign (=) when reporting your p value in the analysis report. For example, assume that, in the PROC REG output, the p value was reported as “.0367” (without the “<” sign). If this had been the case, you would have reported the p value in your analysis report as follows: H) Obtained probability (p) value:
p = .0367
Chapter 11: Bivariate Regression 371
Regression line in the scattergram. Output 11.10 shows the scattergram that was created when PROC PLOT was used to plot KG_LOST against MOTIVAT. The regression line drawn through the center of the scattergram was created by following the steps described in the preceding section titled “Drawing a Regression Line through the Scattergram.”
Using PROC REG: Example with a Significant Negative Regression Coefficient Overview When there is a negative relationship between variable X and variable Y, it means that •
high scores on X are associated with low scores on Y
•
low scores on X are associated with high scores on Y.
Among the variables in the weight loss study, it seems likely that the average number of calories consumed each day should demonstrate a negative correlation with weight loss. It only makes sense that (a) people with high scores for calorie consumption are going to have low scores on weight loss (lose little weight), and (b) people with low scores for calorie consumption are going to have high scores on weight loss (lose more weight). To illustrate the nature of a negative correlation, the following sections show how to use the results of PROC CORR, PROC PLOT, and PROC REG to explore the relationship between calorie consumption and kilograms lost. Correlation between Kilograms Lost and Calorie Consumption Chapter 10, “Bivariate Correlation” showed you how to use PROC CORR to compute the Pearson correlation between a number of variables. Output 11.11 reproduces a correlation matrix that was first presented in Chapter 10. This matrix includes every possible correlation between the five variables measured in the weight loss study (remember that these results are fictitious):
372 Step-by-Step Basic Statistics Using SAS: Student Guide
KG_LOST MOTIVAT EXERCISE CALORIES IQ
KG_LOST 1.00000 22 0.88524 <.0001 22 0.53736 0.0120 21 -0.55439 0.0074 22 0.02361 0.9169 22
Pearson Correlation Coefficients Prob > |r| under H0: Rho=0 Number of Observations MOTIVAT EXERCISE CALORIES 0.88524 0.53736 -0.55439 <.0001 0.0120 0.0074 22 21 22 1.00000 0.47845 -0.54984 0.0282 0.0080 22 21 22 0.47845 1.00000 -0.22594 0.0282 0.3247 21 21 21 -0.54984 -0.22594 1.00000 0.0080 0.3247 22 21 22 0.10294 0.31201 0.19319 0.6485 0.1685 0.3890 22 21 22
IQ 0.02361 0.9169 22 0.10294 0.6485 22 0.31201 0.1685 21 0.19319 0.3890 22 1.00000 22
Output 11.11. All possible correlations between the five variables assessed in the weight loss study.
In Output 11.11, find the cell that appears at the point where the row headed “KG_LOST” intersects with the column headed “CALORIES” ( ). There, you can see that the correlation between weight loss and calorie consumption is equal to –.55439, which rounds to –.55. The sign of this coefficient indicates that the relationship between the two variables is negative. The p value of .0074 is less than the standard criterion of .05, which means that the coefficient is significantly different from zero. Because you know that the relationship is negative, you can use PROC PLOT to create a scattergram that will reveal the general shape of the bivariate distribution. Using PROC PLOT to Create a Scattergram Following are the SAS statements that will cause PROC PLOT to create a scattergram for the variables KG_LOST and CALORIES: PROC PLOT DATA=D1; PLOT KG_LOST*CALORIES; TITLE1 'JANE DOE'; RUN;
Chapter 11: Bivariate Regression 373
Output 11.12 presents the results generated by the preceding statements. Plot of KG_LOST*CALORIES. KG_LOST | 9 + | | | | | 8 + | | | | | 7 + | | | | | 6 + | | | | | 5 + | | | | | 4 + | | | | | 3 + | | | | | 2 + | | | | | 1 +
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
5
A
A A
A
A A A
A
A A A
A A
A
A A A B A A
A --+----------+----------+----------+----------+----------+----------+-1200 1400 1600 1800 2000 2200 2400 CALORIES
Output 11.12. Scattergram plotting kilograms lost against calorie consumption.
The scattergram in Output 11.12 demonstrates two notable features. First, you can see that there some tendency for subjects with low score on CALORIES to have relatively high scores on KG_LOST, and for subjects with high scores on CALORIES to have relatively low scores on KG_LOST. This, of course, is the defining characteristic of a negative relationship. Second, you can see that, although there is something of a negative trend in the data, it is not an extremely strong trend. This can be seen in the general shape of the scattergram: It forms an ellipse, but it is not a particularly narrow ellipse. This reflects the fact that the correlation
374 Step-by-Step Basic Statistics Using SAS: Student Guide
between CALORIES and KG_LOST is not an extremely strong relationship; the correlation is only –.55. For purpose of contrast, turn back to Output 11.10, which presented a scattergram plotting KG_LOST against MOTIVAT. The ellipse in that figure was much more narrow, with scores clustering around the (imaginary) regression line much more tightly. This was because the correlation between KG_LOST and MOTIVAT was stronger at .89. Using PROC REG to Perform the Regression Analysis Program and output. Following are the statements that will cause PROC REG to perform a regression analysis with KG_LOST as the criterion variable and CALORIES as the predictor variable: PROC REG DATA=D1; MODEL KG_LOST = CALORIES / STB RUN;
P;
The preceding statements produced two pages of output. The first page includes the analysis of variance table and parameter estimates tables. These results are presented in Output 11.13.
Source Model Error Corrected Total
JANE DOE The REG Procedure Model: MODEL1 Dependent Variable: KG_LOST Analysis of Variance Sum of Mean DF Squares Square 1 33.40216 33.40216 20 75.27458 3.76373 21 108.67673
Root MSE Dependent Mean Coeff Var
Variable Intercept CALORIES
DF 1 1
Parameter Estimate 11.26348 -0.00354
1.94003 4.98591 38.91032
R-Square Adj R-Sq
Parameter Estimates Standard Error t Value 2.14745 5.25 0.00119 -2.98
6
F Value 8.87
Pr > F 0.0074
0.3074 0.2727
Pr > |t| <.0001 0.0074
Standardized Estimate 0 -0.55439
Output 11.13. Results of PROC REG with weight lost as the criterion variable, and calorie consumption as the predictor variable (Analysis of Variance and Parameter Estimates tables).
Variable names. When you review the output of PROC REG you should first verify that you are looking at the output for the correct criterion variable and predictor variable. The name of the criterion variable appears toward the top of the page, to the right of the heading “Dependent Variable.” In Output 11.13, the criterion variable is KG_LOST ( ). The name of
Chapter 11: Bivariate Regression 375
the predictor variable appears in the “Parameter Estimates” section, below the heading “Variable.” For the current analysis, you can see that the predictor variable is CALORIES ( ). Slope and intercept. To construct the regression equation for this analysis, you will need the nonstandardized regression coefficient (slope) and the intercept. These statistics appear in the “Parameter Estimates” section below the heading “Parameter Estimate.” Output 11.13 shows that, for the current analysis, the regression coefficient is –.00354 ( ), which rounds to – .0035, and that the intercept is 11.26348 ( ), which rounds to 11.26. The regression equation for this analysis therefore takes the following form: Y´ = b (X) + a Y´ = –.0035 (X) + 11.26 The size and sign of the slope in this regression equation shows that, on the average, a decrease of .0035 kilograms of weight loss was associated with every 1-unit increase in calorie consumption. The regression coefficient was negative, and not positive, because people who consumed more calories tended to lose less weight. Significance test. As mentioned previously, PROC REG provides a t test for the null hypothesis that your sample was drawn from a population in which the slope is equal to zero. This appears in Output 11.13 below the heading “t Value.” You can see that the t statistic in this case is equal to –2.98 ( ). The probability value for this t test is .0074 ( ). Because the probability value is less than the standard criterion of .05, you can conclude that the slope is in fact significantly different from zero. Standardized regression coefficient. Below the heading “Standardized Estimate” you can see that the standardized regression coefficient for the analysis is –.55439, which rounds to –.55 ( ). You might remember that the Pearson correlation between KG_LOST and CALORIES was also –.55. This was no coincidence because the standardized regression coefficient in bivariate regression will always be equal to the Pearson correlation between the two variables that are being analyzed. Coefficient of determination. Finally, to the right of the heading “R-Square,” you can see that the coefficient of determination for the analysis is .3074 ( ), which rounds to .31. This value is the square of the Pearson correlation between the two variables. It shows that approximately 31% of the variance in kilograms lost was accounted for by calorie consumption.
376 Step-by-Step Basic Statistics Using SAS: Student Guide
Summarizing the Results of the Analysis The following report summarizes the results of the analysis in which the number of kilograms lost was correlated with calorie consumption. A) Statement of the research question: The purpose of this study was to determine whether the regression coefficient, representing the relationship between calorie consumption and the amount of weight lost over a 10-week period, is significantly different from zero. B) Statement of the research hypothesis: There will be a negative relationship between calorie consumption and the amount of weight that is actually lost over a 10-week period. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was calorie consumption. This was a multi-value variable and was assessed on a ratio scale. • The criterion variable was the number of kilograms of weight lost during the 10-week study. This was a multi-value variable and was assessed on a ratio scale. D) Statistical procedure:
Linear bivariate regression.
E) Statistical null hypothesis (H0): b = 0; In the population, the regression coefficient, representing the relationship between calorie consumption and the number of kilograms of weight lost, is equal to zero. F) Statistical alternative hypothesis (H1): b ≠ 0; In the population, the regression coefficient, representing the relationship between calorie consumption and the number of kilograms of weight lost, is not equal to zero. G) Obtained statistic:
b = –.0035, t (20) = –2.98
H) Obtained probability (p) value:
p = .0074
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. K) Coefficient of determination:
.31.
L) Formal description of the results for a paper: Results were analyzed by using linear regression to regress kilograms of weight lost on the average number of calories that were consumed each day. This analysis revealed a significant
Chapter 11: Bivariate Regression 377
regression coefficient, b = –.0035, t(20) = –2.98, p = .0074. The nature of the regression coefficient showed that, on the average, a decrease of .0035 kilograms of weight loss was associated with every 1-unit increase in calories consumed per day. The analysis showed that calorie consumption accounted for 31% of the variance in weight loss. M) Figure representing the results: Output 11.14 presents a scattergram showing the relationship between calorie consumption and kilograms of weight lost. Plot of KG_LOST*CALORIES. KG_LOST | 9 + | | | | | 8 + | | | | | 7 + | | | | | 6 + | | | | | 5 + | | | | | 4 + | | | | | 3 + | | | | | 2 + | | | | | 1 +
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
5
A
A A
A
A A A
A
A A A
A A
A
A A A B A A
A --+----------+----------+----------+----------+----------+----------+-1200 1400 1600 1800 2000 2200 2400 CALORIES
Output 11.14. Scattergram plotting kilograms lost against calorie consumption.
378 Step-by-Step Basic Statistics Using SAS: Student Guide
Notes about the Preceding Summary Degrees of freedom for the t test. Item G in the preceding summary provides the nonstandardized regression coefficient for the analysis, along with the t statistic and degrees of freedom that are associated with that regression coefficient. The “20” that appears in parentheses are the degrees of freedom associated with that test. Again, the formula for computing the degrees of freedom for this t test is: N–2 The present analysis was based on 22 subjects, so the degrees of freedom are calculated as (N – 2) = (22 – 2) = 20 Presenting the p value. Item H in the preceding report presents the obtained probability value associated with the t statistic: H) Obtained probability (p) value:
p = .0074
Notice that this item uses the “equal” sign (=) to show that the p value is equal to .0074. It uses the equal sign because the “less-than” sign (<) did not appear with the p value in the PROC REG output. Regression line in the scattergram. Output 11.14 shows the scattergram created when PROC PLOT was used to plot KG_LOST against CALORIES. The regression line drawn through the center of the scattergram was created by following the steps described in the preceding section “Drawing a Regression Line through the Scattergram.” Formal description of the results. The last sentence in item L reads in part, “...on the average, a decrease of .0035 kilograms of weight loss was associated with every 1-unit increase in calories that were consumed per day” (italics added). This sentence uses the word “decrease” rather than “increase” because the slope of the regression coefficient (–.0035) was a negative value rather than a positive value.
Chapter 11: Bivariate Regression 379
Using PROC REG: Example with a Nonsignificant Regression Coefficient Overview In this section you will learn to recognize and summarize results that include a nonsignificant relationship by using a regression analysis to obtain a nonsignificant regression coefficient. Correlation between Kilograms Lost and IQ Scores Output 11.15 presents the Pearson correlations between KG_LOST and the four predictor variables in the weight loss study (this is detail taken from Output 11.11). KG_LOST
KG_LOST 1.00000 22
MOTIVAT 0.88524 <.0001 22
EXERCISE 0.53736 0.0120 21
CALORIES -0.55439 0.0074 22
IQ 0.02361 0.9169 22
Output 11.15. Pearson correlations for the relationship between kilograms lost and the four predictor variables of the weight loss study.
In Output 11.15, find the cell where the row headed “KG_LOST” intersects with the column headed “IQ.” This cell provides information about the correlation between kilograms of weight lost and the scores on a standard IQ test. The top figure in the column is the Pearson correlation between KG_LOST and IQ. You can see that this correlation is .02361 ( ), which rounds to .02. This near-zero correlation suggests that there is virtually no relationship between the two variables. Below the correlation, the probability value for the correlation is .9169 ( ). Because this p value is greater than .05, you will conclude that this correlation coefficient is not significantly different from zero. This means that the regression coefficient from the regression analysis (to be discussed below) will also fail to be significantly different from zero. Using PROC REG to Perform the Regression Analysis Output 11.16 presents the first page of output that is generated when PROC REG is used to regress KG_LOST on IQ. You should review this table and interpret it in the usual way. In particular, note the nonsignificant t statistic that tests the null hypothesis that the regression coefficient (slope) is equal to zero in the population ( ).
380 Step-by-Step Basic Statistics Using SAS: Student Guide
Source Model Error Corrected Total
JANE DOE The REG Procedure Model: MODEL1 Dependent Variable: KG_LOST Analysis of Variance Sum of Mean DF Squares Square 1 0.06060 0.06060 20 108.61613 5.43081 21 108.67673
Root MSE Dependent Mean Coeff Var
Variable DF Intercept 1 IQ 1
Parameter Estimate 4.62767 0.00299
2.33041 4.98591 46.73990
R-Square Adj R-Sq
Parameter Estimates Standard Error t Value 3.42745 1.35 0.02826 0.11
9
F Value 0.01
Pr > F 0.9169
0.0006 -0.0494
Pr > |t| 0.1920 0.9169
Standardized Estimate 0 0.02361
Output 11.16. Results of PROC REG with kilograms of weight lost as the criterion variable and IQ as the predictor variable (Analysis of Variance and Parameter Estimate tables).
In Output 11.16, you can see that the t statistic for the regression coefficient is only 0.11 ( ). The p value associated with this statistic is quite large at .9169 ( ). Because the p value is larger than the standard criterion of .05, you can conclude that the regression coefficient is not significantly different from zero. Summarizing the Results of the Analysis The following report summarizes the results of the analysis in which the number of kilograms lost was regressed on IQ scores. A) Statement of the research question: The purpose of this study was to determine whether the regression coefficient representing the relationship between IQ and the amount of weight lost over a 10-week period is significantly different from zero. B) Statement of the research hypothesis: There will be a positive relationship between IQ and the amount of weight that is actually lost over a 10-week period. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was IQ. This was a multi-value variable and was assessed on an interval scale. • The criterion variable was the number of kilograms of weight lost during the 10-week study. The criterion variable was
Chapter 11: Bivariate Regression 381
also a multi-value variable and was assessed on a ratio scale. D) Statistical procedure:
Linear bivariate regression.
E) Statistical null hypothesis (H0): b = 0; In the population, the regression coefficient representing the relationship between IQ and the number of kilograms of weight lost is equal to zero. F) Statistical alternative hypothesis (H1): b ≠ 0; In the population, the regression coefficient representing the relationship between IQ and the number of kilograms of weight lost is not equal to zero. G) Obtained statistic:
b = .0030, t (20) = .11
H) Obtained probability (p) value:
p = .9169
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. K) Coefficient of determination:
.00.
L) Formal description of the results for a paper: Results were analyzed by using linear regression to regress kilograms of weight lost on IQ. This analysis revealed a nonsignificant regression coefficient, b = .0030, t(20) = .11, p = .9169. The analysis showed that IQ accounted for less than 1% of the variance in weight loss. M) Figure representing the results: Output 11.17 presents a scattergram showing the relationship between IQ and kilograms of weight lost.
382 Step-by-Step Basic Statistics Using SAS: Student Guide
Plot of KG_LOST*IQ. KG_LOST | 9 + | | | | | 8 + | | | | | 7 + | | | | | 6 + | | | | | 5 + | | | | | 4 + | | | | | 3 + | | | | | 2 + | | | | | 1 +
JANE DOE Legend: A = 1 obs, B = 2 obs, etc.
8
A
A A
A
A A A
A
A A A A A
A
A A A A
A A A
A --+----------+----------+----------+----------+----------+----------+-90 100 110 120 130 140 150 IQ
Output 11.17. Scattergram plotting kilograms lost against IQ.
Note about the Preceding Summary Item L from the preceding summary provides a formal description of the results for a paper. It is similar to the summary for the two other analyses reported earlier in this chapter, with the exception that this description omits any interpretation of the meaning of the regression coefficient (e.g., “The nature of the regression coefficient showed that, on the average, an increase of .0030 kilograms of weight loss was associated with every 1-unit increase in IQ”). This is because the regression coefficient (slope) in the analysis was not significantly different
Chapter 11: Bivariate Regression 383
from zero. When statistics such as this are nonsignificant, a summary of results typically does not include any attempt to interpret the nature of the relationship.
Conclusion Many of the statistics covered in a course on basic statistics can be divided into two families: tests of association versus tests of group differences. With a test of association, you are typically studying a single population of individuals and wish to know whether there is a relationship between two (or more) variables within that population. For example, in Chapter 10, “Bivariate Correlation,” you learned about the Pearson correlation coefficient, one of the most widely used measures of association. You learned how to use SAS to compute a Pearson correlation coefficient and to test the null hypothesis that the correlation was equal to zero in the population. In this chapter, you learned how to compute a related measure––the regression coefficient––and test the null hypothesis that the coefficient was equal to zero in the population. Now that you are familiar with tests of association, you will begin learning about tests of group differences. With a test of group differences, you typically want to know whether two (or more) populations differ from one another with respect to their mean scores on a criterion (or dependent) variable. The t test and analysis of variance (ANOVA) are widely used tests of group differences. Discussion of this family of tests begins in Chapter 12 with one of the more elementary tests of group differences: the single-sample t test.
384 Step-by-Step Basic Statistics Using SAS: Student Guide
Single-Sample t Test Introduction..........................................................................................387 Overview................................................................................................................ 387 Situations Appropriate for the Single-Sample t Test ..........................387 Overview................................................................................................................ 387 Example of a Study Providing Data Appropriate for This Procedure...................... 387 Summary of Assumptions Underlying the Single-Sample t Test ............................ 388 Results Produced in a Single-Sample t Test .......................................388 Overview................................................................................................................ 388 Test of the Null Hypothesis .................................................................................... 389 Confidence Interval for the Mean ........................................................................... 391 Effect Size.............................................................................................................. 391 Example 12.1: Assessing Spatial Recall in a Reading Comprehension Task (Significant Results) ...............................................................393 Overview................................................................................................................ 393 The Study .............................................................................................................. 393 Data Set to Be Analyzed........................................................................................ 394 Choosing the Comparison Number for the Analysis .............................................. 395 Writing the SAS Program....................................................................................... 396 Output Produced by PROC TTEST ....................................................................... 398 Steps in Interpreting the Output ............................................................................. 399 Summarizing the Results of the Analysis............................................................... 404 One-Tailed Tests versus Two-Tailed Tests .........................................406 Dividing the Obtained p Value by 2........................................................................ 406 Caution .................................................................................................................. 407
386 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 12.2: An Illustration of Nonsignificant Results ...................407 Overview................................................................................................................ 407 The Study .............................................................................................................. 407 Interpreting the SAS Output................................................................................... 408 Summarizing the Results of the Analysis............................................................... 410 Conclusion............................................................................................412
Chapter 12: Single-Sample t Test 387
Introduction Overview This chapter shows you how to use the SAS System to perform a t test for a single-sample mean. This is a parametric procedure that is appropriate when you want to test the null hypothesis that a given sample of data was drawn from a population that has a specified mean. The chapter shows you how to write the appropriate SAS program, interpret the output, and prepare a report summarizing the results of the analysis. Special emphasis is given to testing the null hypothesis, interpreting the confidence interval for the mean, and computing an index of effect size.
Situations Appropriate for the Single-Sample t Test Overview You may use a single sample t test when •
You have obtained interval- or ratio-level data from a single sample of subjects.
•
You want to determine whether the mean for this sample is significantly different from some specified population mean.
You may perform this analysis only when the population mean of interest is known. That is, you may perform it when the population mean has already been established by earlier research, or when it is established by theoretical considerations. It is not necessary that the standard deviation of scores in the population be known; this will be estimated from the sample data. Example of a Study Providing Data Appropriate for This Procedure The study. Suppose you are conducting research with the ESP Club on campus. The club includes ten members who claim to be able to predict the future. To prove it, they each complete 100 trials in which they predict the results of a coin flip (i.e., they predict prior to the flip whether the results will be heads or tails). The members show some variability in their performance. One member achieves a score of only 40 correct guesses, another achieves a score of 70 correct guesses, and so on. When you average their scores, you find that the average for the group is 60. This means that, on the average, the ten members guessed correctly on 60 out of 100 flips. Members of the ESP Club are very happy with this average score. They point out that, if they did not have precognition skills, they should have made an average of only 50 correct guesses out of 100 flips, based on the probability that correctly guessing a flip was only .5.
388 Step-by-Step Basic Statistics Using SAS: Student Guide
It is true that their sample mean of 60 correct guesses is higher than the hypothetical population mean of 50 correct guesses, but is it significantly higher? To find out, you perform a single-sample t test, testing the null hypothesis that the sample mean came from a population in which the population mean was actually equal to 50. If you reject this null hypothesis, it will provide some support to the club members’ claim that they have ESP. Why these data would be appropriate for this procedure. To perform a single-sample t test, you need a criterion variable. The criterion variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. In the present study, the criterion variable is the number of correct guesses out of 100 coin flips. Conceivably, subjects could get a score of zero, a score of 100, or any number in between. You know that this variable is on a ratio scale because equal differences between scale values do have equal quantitative meaning, and also because there is a true zero point (a score of zero on this measure means you made no correct guesses at all). Therefore, this assumption appears to be met (additional assumptions for this test are listed in the following section). When researchers perform a single-sample t test, the numeric criterion variable being analyzed is usually a multi-value variable. You could ensure that this is the case in the present study by using PROC FREQ to create a frequency table for the criterion variable (number of correct guesses), and verifying that the variable assumes more than six values in your sample. Summary of Assumptions Underlying the Single-Sample t Test Level of measurement. The criterion variable should be a numeric variable that is assessed on an interval or ratio level of measurement. Random sampling. Scores on the criterion variable should represent a random sample drawn from the population of interest. Normal distributions. The sample should be drawn from a normally distributed population (you can use PROC UNIVARIATE to test the null hypothesis that the sample is from a normally distributed population). If the sample contains over 30 subjects, the single-sample t test is robust against moderate departures from normality (when a test is robust against violations of certain assumptions, it means that violating those assumptions will have only a negligible effect on the results).
Results Produced in a Single-Sample t Test Overview When you use PROC TTEST to perform a single-sample t test, SAS automatically performs a test of the null hypothesis and estimates a confidence interval for the mean. If you use a
Chapter 12: Single-Sample t Test 389
few statistics that are included in the output of PROC TTEST, it is relatively easy to compute by hand an index of effect size. This section explains the meaning of these results. Test of the Null Hypothesis Overview. When you perform a single-sample t test, you begin with the null hypothesis that your sample was drawn from a population with a specific mean. This is analogous to saying that your sample represents a population with a specific mean. SAS computes the mean of your sample, and compares it against the hypothetical population mean. In this analysis, it computes a t statistic. The more different your sample mean is from the population mean stated in the null hypothesis, the larger this t statistic will be (in absolute terms). SAS also computes a p value (probability value) associated with this t statistic. If this p value is less than some standard criterion (alpha level), you will reject the null hypothesis. This book recommends that you use an alpha level of .05. This means that, if your obtained p value is less than .05, you will reject the null hypothesis that your sample was drawn from a population with the mean stated in your null hypothesis. In this case, you will conclude that you have statistically significant results. The statistical null hypothesis. As an illustration, consider the fictitious study on coin flips. In that study, each of ten subjects participated in 100 trials in which they attempted to predict the results of coin flips. Theoretically, we would expect the average subject to be correct 50 times (because they had a .5 probability of being correct from chance, there were 100 coin flips, and 100 × .5 = 50). This means that you begin with the null hypothesis that your sample was drawn from a population in which the mean number of correct guesses was 50. Symbolically, you can state the null hypothesis in this way: H0: µ = 50 In this null statement, H0 is the symbol for null hypothesis, and µ is the symbol for the population mean. The statistical alternative hypothesis. Before stating the alternative hypothesis, you must first decide whether you wish to perform a directional test or a nondirectional test. A directional test is sometimes called a one-sided or one-tailed test, and it involves stating a directional alternative hypothesis. You state a directional alternative hypothesis if you not only predict that there will be a difference, but also make a specific prediction about the direction of that difference. For example, if you have strong reason to believe that the mean number of correct guesses in your sample will be significantly higher than 50, you might state the following directional alternative hypothesis: Statistical alternative hypothesis (H1): µ > 50; In the population, the average number of correct guesses is greater than 50. This is analogous to predicting that your sample was drawn from a population in which the mean is greater than 50. In the preceding alternative hypothesis, notice that the symbol form
390 Step-by-Step Basic Statistics Using SAS: Student Guide
of the alternative hypothesis uses the greater-than sign (>) rather than the equal sign (=) to reflect this direction. On the other hand, if you have strong reason to believe that the mean number of correct guesses in your sample will be significantly lower than 50, you might state a different directional alternative hypothesis: Statistical alternative hypothesis (H1): µ < 50; In the population, the average number of correct guesses is less than 50. This is analogous to predicting that your sample was drawn from a population in which the mean is less than 50. In the preceding alternative hypothesis, notice that the symbol form of the alternative hypothesis uses the less-than sign (<). You may remember from Chapter 2 that this book recommends that, in most cases, you should not use a directional alternative hypothesis, and should instead use a nondirectional alternative hypothesis. With a single-sample t test, a nondirectional alternative hypothesis simply predicts that your sample was not drawn from a population that has a specific mean. It predicts that there will be a significant difference between your sample mean and the population mean stated in the null hypothesis, but it does not predict whether your sample mean will be higher or lower than the population mean. A nondirectional t test is often referred to as a two-sided or two-tailed test. For the current study, a nondirectional alternative hypothesis would be stated in this fashion: Statistical alternative hypothesis (H1): µ ≠ 50; In the population, the average number of correct guesses is not equal to 50. With the preceding alternative hypothesis, notice that the symbol form of the hypothesis includes the not-equal sign (≠). This reflects the fact that you are simply predicting that actual population mean is not 50; you are not specifically predicting whether it is higher than 50 or lower than 50. Obtaining significant results. An earlier section of this chapter asked you to suppose that you have conducted this study and found that, in reality, the average number of correct guesses in your sample is 60. This sample mean of 60 is higher than the hypothetical population mean of 50, but is it significantly higher? To find out, you perform a singlesample t test. Suppose that, after performing this test, your obtained p value is .001. This obtained p value is less than our standard criterion of .05, and so you reject the null hypothesis. You tentatively conclude that your sample was probably not drawn from a population in which the mean number of correct guesses is 50. In other words, you conclude that there is a statistically significant difference between your sample mean of 60 and the hypothetical population mean of 50.
Chapter 12: Single-Sample t Test 391
Confidence Interval for the Mean Confidence interval defined. When you use PROC TTEST to perform a single-sample t test, SAS also automatically computes a confidence interval for the mean. A confidence interval is an interval that extends from a lower confidence limit to an upper confidence limit and is assumed to contain a population parameter with a stated probability, or level of confidence. An example. As an illustration, consider the coin toss study described earlier. Suppose that you analyze your data, and compute the 95% confidence interval for the mean. Remember that your sample mean is 60. Suppose that SAS estimates that the 95% interval extends from 55 (the lower limit) to 65 (the upper limit). This means that there is a 95% probability that your sample of ten subjects was drawn from a population in which the mean number of correct guesses was somewhere between 55 and 65. You do not know the exact mean of this population, but you estimate that there is a 95% probability that it is somewhere between 55 and 65. Notice that, with this confidence interval, you are not stating that there is a 95% probability that the sample mean is somewhere between 55 and 65. You know exactly what the sample mean is––you have already computed it to be 60. The confidence interval that SAS computed is a probability statement about the population mean, not the sample mean. Effect Size The need for an index of effect size. Suppose that you perform a single-sample t test and your obtained p value is less than the standard criterion of .05. You therefore reject the null hypothesis. You know that there is a statistically significant difference between the population mean stated under the null hypothesis versus the observed sample mean. But is it a relatively large difference? The null hypothesis test, by itself, does not tell you whether the difference is large or small. In fact, if your sample is large, you may obtain statistically significant results even if the difference is relatively trivial. Effect size defined. Because of this problem with null hypothesis testing, many researchers are now supplementing these tests with measures of effect size. The exact definition of effect size will vary, depending upon the type of analysis that you are performing. For a single-sample t test, we can define effect size as the degree to which the sample mean differs from the population mean, stated in terms of the standard deviation of the population. The symbol for effect size is d, and the formula is as follows: d=
| X – µ0 | –––––––––––– σ
where: X = the observed mean of the sample µ0 = the hypothetical population mean stated under the null hypothesis σ = the standard deviation of the null hypothesis population.
392 Step-by-Step Basic Statistics Using SAS: Student Guide
The preceding formula shows that, to compute effect size, you subtract the population mean from the sample mean and divide the resulting difference by the population standard deviation. This means that the effect size is essentially the number of standard deviations that the sample mean differs from the null hypothesis population mean. One problem with the preceding formula is that σ (the standard deviation of the null hypothesis population) is often unknown. In these situations, you may instead use sX, the estimated population standard deviation. In Chapter 7, “Measures of Central Tendency and Variability,” you learned that sX is an estimate of the population standard deviation, and is computed from sample data. You may recall that the formula for sX uses N–1 in the denominator, whereas the formula for σ uses N. Later, you will see that PROC TTEST automatically reports sX, which makes it relatively easy to compute the effect size for a single-sample t test. With sX substituted for σ, the formula for the d statistic becomes the following: d=
| X – µ0 | –––––––––––– sX
An example. Suppose that you analyze data from the coin toss study described earlier. Under the null hypothesis, the population µ0 = 50. Suppose that your observed sample mean (X) is 60, and the estimated population standard deviation (sX) is 15. In this situation, you would compute effect size as follows:
d=
| X – µ0 | –––––––––––– sX
d=
| 60 – 50 | –––––––––––– 15
d=
10 –––––––––––– 15
d=
.667
d=
.67
Guidelines for interpreting effect size. This effect size of .67 indicates that the observed sample mean was approximately .67 standard deviations from the hypothetical population mean. This is an effect, but would most researchers view it as being a relatively large effect?
Chapter 12: Single-Sample t Test 393
In making this interpretation, many researchers refer to the guidelines provided in Cohen’s (1969) classic book on statistical power. Cohen’s guidelines appear in Table 12.1: Table 12.1 Guidelines for Interpreting Effect Size _________________________________________ Effect size Obtained d statistic _________________________________________ Small effect d = .20 Medium effect d = .50 Large effect d = .80 _________________________________________
Earlier, you computed the effect size for the coin toss study as d = .67. The guidelines appearing in Table 12.1 show that this is somewhere between a medium effect and a large effect (remember that the coin toss study is fictitious!).
Example 12.1: Assessing Spatial Recall in a Reading Comprehension Task (Significant Results) Overview This section shows you how to use the PROC TTEST to perform a single-sample t test. You will use a fictitious study that provides data appropriate for this procedure. You will learn how to arrange your data set, write the SAS program, interpret the results of PROC TTEST, and summarize the results of your analysis in a report. The Study Suppose that you are a cognitive psychologist doing exploratory research in the area of spatial recall. You ask a sample of 13 subjects to read the same text (say, a 100-page-long excerpt from a biology textbook). Each page of text is divided into four quadrants, and a single paragraph appears in each quadrant, as follows:
394 Step-by-Step Basic Statistics Using SAS: Student Guide
After reading the 100-page text, the subject takes a 100-item test. However, in this study, you are not interested in whether subjects can remember the facts they have learned. Instead, what you really want to know is whether subjects can remember where on the page they saw specific pieces of information. For example, assume that item 5 on the test deals with the concept of “ecological niche,” and the information relevant to this item appeared in quadrant 2 of page 5 in the text. In responding to this item, the subject indicates the quadrant that the relevant information appeared in by circling 1 for quadrant 1, 2 for quadrant 2, and so forth. A response is considered correct if the subject correctly recalls the quadrant in which the information appeared. Needless to say, this is a difficult task. If the subjects are totally unable to recall the location on the page where the information appeared, you would expect them to guess the correct quadrant about 25% of the time. This is because they have four choices––1, 2, 3, and 4––and if they guessed randomly they would be correct about one time in four just due to chance. Since there are 100 items on the test, you would therefore expect them to guess correctly on about 25 items due to chance alone (because 100 × 25% = 25). If the subjects get more than 25 items correct, it may meant that they have good spatial recall skills––that they really can remember where the information appeared on the page. So you determine how many correct guesses were made by each of your 13 subjects. You then use a single-sample t test to determine whether their average number of correct guesses was significantly larger than the theoretical population mean of 25 correct guesses. Data Set to Be Analyzed Table 12.2 shows the scores on the criterion variable: the number of correct guesses made by each of the 13 subjects. Table 12.2 Number of Correct Guesses Regarding Spatial Location ______________________ Correct Subject guesses ______________________ 01 31 02 34 03 26 04 36 05 31 06 30 07 29 08 30 09 34 10 28 11 28 12 30 13 33 ___________________________
Chapter 12: Single-Sample t Test 395
Choosing the Comparison Number for the Analysis With this analysis, you will test the null hypothesis that your sample of 13 subjects was drawn from a population in which the average score is 25. This is another way of saying that your sample represents a population in which the average score is 25. Symbolically, it can be represented this way: H0: µ = 25 So 25 is the comparison number for this analysis. In this type of analysis, the comparison number is always the mean score that you would expect in the population, based on theoretical considerations. In the present study, the comparison number is 25 because •
Subjects are making 100 guesses.
•
If the subjects have no spatial recall skills, the average number of correct guesses they make should be about the number that you would expect based on chance (guessing randomly).
•
Since there are four possible responses, the probability of guessing correctly when guessing randomly is equal to 25%.
•
Theoretically, if subjects are guessing randomly, you would expect them to make about 25 correct guesses, because 25% of 100 guesses is equal to 25 correct guesses.
This forms the basis for your single-sample t test. If the average score in your sample is significantly different from 25 (the comparison number), you will be able to reject this null hypothesis. If the average score in your sample is significantly larger than 25, it will provide support for the idea that your subjects do have spatial recall skills. Remember that the comparison number was 25 in this analysis because this was the number specified in the null hypothesis. In a different study and analysis, a different comparison number would likely be used, depending on the nature of the null hypothesis.
396 Step-by-Step Basic Statistics Using SAS: Student Guide
Writing the SAS Program The DATA step. Suppose that you create a SAS data set that contains just two variables. The variable SUB_NUM contains a unique subject number for each subject, and the variable RECALL contains each subject’s score on the criterion variable (the number of correct guesses regarding spatial location). Here is the DATA step for this program (line numbers in italic have been added on the left): 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM RECALL; DATALINES; 01 31 02 34 03 26 04 36 05 31 06 30 07 29 08 30 09 34 10 28 11 28 12 30 13 33 ;
The data lines appear on lines 6–18 of the DATA step. You can see that these are the same data lines that appeared in Table 12.2. The PROC Step. The syntax for the PROC step of a program that will perform a singlesample t test is as follows: PROC TTEST
DATA=data-set-name H0=comparison-number ALPHA=alpha-level; VAR criterion-variable; TITLE1 ' your-name '; RUN;
In the syntax, the PROC TTEST statement contains the following option: H0=comparison-number The comparison-number that appears in this option should be the population mean stated under the null hypothesis. It is the mean score that you wish your sample mean score to be compared against. The number that you type in this location will depend on the nature of your study. The preceding section, “Choosing the Comparison Number for the Analysis,”
Chapter 12: Single-Sample t Test 397
stated that the appropriate comparison number for the current analysis is 25. Therefore, you will include the following option in the PROC TTEST statement: H0=25 Note that the 0 that appears in the preceding option H0 is a zero (0), and is not the upper case of the letter O. If you omit the H0 option from the PROC TTEST statement, the default comparison number is zero. The syntax for the PROC TTEST statement also contains the following option: ALPHA=alpha-level The ALPHA option allows you to specify the size of the confidence interval that you will estimate around the sample mean. Specifying ALPHA=0.01 produces a 99% confidence interval, specifying ALPHA=0.05 produces a 95% confidence interval, and specifying ALPHA=0.1 produces a 90% confidence interval. Suppose that, in this analysis, you wish to create a 95% confidence interval for your mean. This means that you will include the following option in the PROC TTEST statement: ALPHA=0.05 The preceding syntax included the following VAR statement: VAR criterion-variable; In the VAR statement, criterion-variable should be the name of the variable that is of central interest in the analysis. In the present study, the criterion variable is RECALL: the number of correct guesses regarding spatial location. This means that the SAS program will contain the following VAR statement: VAR RECALL; Below are the statements that constitute the PROC step for the current analysis: PROC TTEST DATA=D1 H0=25 VAR RECALL; TITLE1 'JOHN DOE'; RUN;
ALPHA=0.05;
The complete SAS program. Here is the program that you can use to analyze the fictitious data from the preceding study. This program will perform a single-sample t test to determine whether the mean RECALL score in a sample of 13 subjects is significantly different from a comparison number of 25. It will estimate the 95% confidence interval for the mean.
398 Step-by-Step Basic Statistics Using SAS: Student Guide
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM RECALL; DATALINES; 01 31 02 34 03 26 04 36 05 31 06 30 07 29 08 30 09 34 10 28 11 28 12 30 13 33 ; PROC TTEST DATA=D1 H0=25 VAR RECALL; TITLE1 'JOHN DOE'; RUN;
ALPHA=0.05;
Output Produced by PROC TTEST With the line size and page size options requested in the OPTIONS statement, the preceding program would produce one page of output, shown in Output 12.1. JOHN DOE
1
The TTEST Procedure
Variable RECALL
N 13
Lower CL Mean 29.057
Statistics Upper CL Mean Mean 30.769 32.481
Variable RECALL
Statistics Std Err Minimum 0.7857 26
Variable RECALL
T-Tests DF t Value 12 7.34
Lower CL Std Dev 2.0315
Std Dev 2.833
Upper CL Std Dev 4.6765
Maximum 36
Pr > |t| <.0001
Output 12.1. Results of the TTEST procedure performed on subject recall data.
Chapter 12: Single-Sample t Test 399
Here are several main points in Output 12.1 from PROC TTEST: The name of the variable being analyzed appears below the heading “Variable.” In this case, you can see that the criterion variable being analyzed was RECALL. The output includes two sections headed “Statistics.” These sections report the mean, standard deviation, confidence intervals, and other information. The section headed “T-Tests” reports the t statistic and other information relevant for the null hypothesis that the sample was drawn from a population with a specified mean. The following section describes the various sections of PROC TTEST output in greater detail. Steps in Interpreting the Output 1. Make sure that everything looks right. The output from the analysis is again reproduced as Output 12.2. Callout numbers identify the sections that you should review to verify that there were no obvious errors in entering your data or requesting the TTEST procedure. JOHN DOE
Variable RECALL
N 13
Lower CL Mean 29.057
Variable RECALL
Variable RECALL
2
The TTEST Procedure Statistics Upper CL Lower CL Mean Mean Std Dev 30.769 32.481 2.0315 Statistics Std Err 0.7857
Minimum 26
T-Tests DF t Value 12 7.34
Std Dev 2.833
Upper CL Std Dev 4.6765
Maximum 36
Pr > |t| <.0001
Output 12.2. Sections to review to verify that there were no obvious errors in writing the SAS program or entering data.
First check the name of the criterion variable to verify that you are looking at results for the correct variable (RECALL, in this case). Check the number of valid observations in the column headed “N” to verify that the data set includes the expected number of subjects. Here, the N is 13, as expected. Next review the mean, the minimum value, and the maximum value to verify that you have not made any obvious errors in keying the data (e.g., verify that you don’t have, say, a maximum observed score of 200, although the highest possible score was supposed to be 100). So far, these results do not reveal any problems.
400 Step-by-Step Basic Statistics Using SAS: Student Guide
JOHN DOE
Variable RECALL
2
The TTEST Procedure Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev 29.057 30.769 32.481 2.0315 Statistics
N 13
Variable RECALL
Std Err 0.7857
Minimum 26
Std Dev 2.833
Upper CL Std Dev 4.6765
Maximum 36
T-Tests Variable RECALL
DF 12
t Value 7.34
Pr > |t| <.0001
Output 12.3. Sections to review for the test of the study’s null hypothesis.
2. Review the results of the t test. Output 12.3 presents the output from PROC TTEST again, this time identifying information relevant to the test of the null hypothesis. Remember that the null hypothesis for your study states that your sample was drawn from a population in which the mean score was 25 (25 is the number of correct responses that you would expect if your subjects were responding correctly at a chance level). Symbolically, the null hypothesis was stated this way: H0: µ = 25 Output 12.3 shows that the mean score obtained for your sample of 13 subjects was 30.769. This is higher than the population mean of 25 that was stated in the null hypothesis, but is it significantly higher? To find out, you will need to consult the results of the t test. The results of this test appear in the lower part of the output, in the section headed “T-Tests.” Below the heading “t Value,” you can see that the obtained t statistic for your analysis is 7.34. The section headed “DF” provides the degrees of freedom for the t test, which in this case was 12. Finally, the section headed “Pr > | t |” provides the p value (probability value) for the t test. Remember that the p value is the probability that you would obtain a t value this large or larger (in absolute terms) if the null hypothesis were true. Output 12.3 shows that the p value for this test is <.0001 (less than one in 10,000). This book recommends that you should reject the null hypothesis whenever the p value associated with a test is less than .05. In the present case, the obtained p value is less than .0001, which is much less than .05. Therefore, you will reject the null hypothesis that your sample was drawn from a population in which the average recall score was 25. You will conclude that your obtained sample mean of 30.769 is significantly higher than the
Chapter 12: Single-Sample t Test 401
hypothetical mean of 25. In other words, you will conclude that the subjects in your sample were able to recall spatial location at a rate higher than would be expected with random guessing. 3. Review the confidence interval for the mean. An earlier section of this chapter indicated that a confidence interval is an interval that extends from a lower confidence limit to an upper confidence limit and is assumed to contain a population parameter with a stated probability, or level of confidence. For example, if you compute the 95% confidence interval for a single-sample t test, you can be 95% sure that the actual population mean is somewhere between those two limits. When you wrote your SAS program to perform this analysis, you included the following PROC TTEST statement: PROC TTEST
DATA=D1
H0=25
ALPHA=0.05;
The option ALPHA=0.05 that is included in this statement requests that SAS compute the 95% confidence interval for the mean (if you had desired the 99% confidence interval, you would have included the option ALPHA=0.01). The 95% confidence interval for the current analysis appears in the output for the PROC TTEST and is shown again as Output 12.4. JOHN DOE
2
The TTEST Procedure Statistics
Variable RECALL
N 13
Lower CL Mean 29.057
Variable RECALL
Upper CL Mean Mean 30.769 32.481 Statistics Std Err 0.7857
Minimum 26
Lower CL Std Dev 2.0315
Std Dev 2.833
Upper CL Std Dev 4.6765
Maximum 36
T-Tests Variable RECALL
DF 12
t Value 7.34
Pr > |t| <.0001
Output 12.4. Sections to review to find the confidence interval for the mean.
You have already seen that the mean RECALL score for your sample of 13 subjects was 30.769. In Output 12.4, the lower confidence limit for the mean appears below the heading “Lower CL Mean.” You can see that the lower confidence limit in this case is 29.057. In the same output, the upper confidence limit for the mean appears below the heading “Upper CL Mean.” You can see that the upper confidence limit in this case is 32.481. Taken together, these results show that the 95% confidence interval for the current analysis ranges
402 Step-by-Step Basic Statistics Using SAS: Student Guide
from 29.057 to 32.481. This means that there is a 95% likelihood that the actual mean for the population from which your sample was drawn is somewhere between 29.057 and 32.481. Notice that this confidence interval does not contain the mean of 25 correct recalls that was stated in null hypothesis. This finding is consistent with the idea that the average number of correct guesses displayed by your sample is significantly greater than 25, the number that would have been expected with purely random guessing. When you are looking for the confidence interval for the mean from an analysis such as this, be sure that you do not look under the headings ( ) “Lower CL Std Dev” and ( ) “Upper CL Std Dev.” Under these headings, you will find instead the lower and upper confidence limits (respectively) for the study’s standard deviation. With most studies conducted in the social sciences and in education, it is much more common to report the confidence interval for the mean rather than the confidence interval for the standard deviation. 4. Compute the index of effect size. An earlier section defined effect size as the degree to which the sample mean differs from the population mean, stated in terms of standard deviation of the population. The symbol for effect size is d. When the population standard deviation is being estimated from sample data, the formula for d is as follows: d=
| X – µ0 | –––––––––––– sX
In the preceding formula, X represents the observed sample mean, µ0 represents the theoretical population mean as stated in the null hypothesis, and sX represents the standard deviation of the null hypothesis population, as estimated from sample data. Although SAS does not include d as a part of the output from PROC TTEST, the statistic is relatively easy to compute by hand, as we will see. Two of the three values to be inserted into the preceding formula appear on the SAS output. See Output 12.5 for the output from the current analysis.
Chapter 12: Single-Sample t Test 403
JOHN DOE
2
The TTEST Procedure Statistics
Variable RECALL
N 13
Lower CL Mean 29.057
Variable RECALL
Upper CL Mean Mean 30.769 32.481 Statistics Std Err 0.7857
Minimum 26
Lower CL Std Dev 2.0315
Std Dev 2.833
Upper CL Std Dev 4.6765
Maximum 36
T-Tests Variable RECALL
DF 12
t Value 7.34
Pr > |t| <.0001
Output 12.5. Sections to review to compute d, the index of effect size.
First, you will need the observed sample mean. In Output 12.5, this appears below the heading “Mean.” As you have already seen the sample mean from this analysis is 30.769, and this value is now inserted in the formula for d: d=
30.769 – µ0 ––––––––––––––––––– sX
With the preceding formula, the symbol sX represents the population standard deviation, as estimated from sample data. In Output 12.5, this value appears below the heading “Std Dev.” This standard deviation was based on sample data, using N – 1 in the denominator (rather than N). Output 12.5 shows that the estimated population standard deviation is 2.833. This value is now inserted in the formula below: d=
30.769 – µ0 ––––––––––––––––––– 2.833
The final value in the formula is µ0 , which represents the population mean under the null hypothesis. For the current study, the null hypothesis was stated symbolically in this way: H0: µ = 25 Your null hypothesis stated that the current sample was drawn from a population in which the mean score is 25. The number 25 was chosen because that is the number of correct guesses you would expect the subjects to make if they were guessing in a random fashion.
404 Step-by-Step Basic Statistics Using SAS: Student Guide
You will remember that this is the reason that you chose 25 as the comparison number for the H0 option that you included in PROC TTEST statement: PROC TTEST
DATA=D1
H0=25
ALPHA=0.05;
This should generally be the case. In most instances, the comparison number from your PROC TTEST statement will serve as the value of µ0 in the formula for computing d. Because the comparison number for the current analysis is 25, that value is now inserted in the formula below and the value of d is computed. d=
30.769 – 25 ––––––––––––––––––– 2.833
d=
5.769 ––––––––––––––––––– 2.833
d = 2.036 d = 2.04 And so the index of effect size for the current analysis is 2.04. Is this a relatively large effect or a relatively small effect? Cohen’s guidelines for evaluating effect size are shown again in Table 12.3: Table 12.3 Guidelines for Interpreting Effect Size _________________________________________ Effect size Obtained d statistic _________________________________________ Small effect d = .20 Medium effect d = .50 Large effect d = .80 _________________________________________
According to the table, an effect is considered large if d = .80. For the current analysis, d = 2.04, which is much larger than .80. Therefore, you can conclude that the present data produced a relatively large index of effect size. Summarizing the Results of the Analysis Here is the analysis report for the current analysis. Notice that it follows a format that is not identical to the format used for analysis reports in the previous chapter. Following the report are a number of notes that clarify the information that it contains.
Chapter 12: Single-Sample t Test 405
A) Statement of the research question: The purpose of this study was to determine whether subjects performing a fourchoice spatial recall task will perform at a level that is higher than the level expected with random responding. B) Statement of the research hypothesis: Subjects performing a four-choice spatial recall task will perform at a level that is higher than the level expected with random responding. C) Nature of the variable: The criterion variable was RECALL, the number of times the subjects correctly guessed the quadrant where targeted information appeared on the text page. This was a multi-value variable and was assessed on a ratio scale. Single-sample t test.
D) Statistical test:
E) Statistical null hypothesis (H0): µ = 25; In the population, the average number of correct recalls is equal to 25 out of 100 (the number expected with random responding). F) Statistical alternative hypothesis (H1): µ ≠ 25; In the population, the average number of correct recalls is not equal to 25 out of 100. G) Obtained statistic:
t = 7.34
H) Obtained probability (p) value:
p < .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Confidence interval: The sample mean on the criterion variable (number of correct recalls) was 30.769. The 95% confidence interval for the mean extended from 29.057 to 32.481. K) Effect size:
d = 2.04.
L) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using a single-sample t test. This analysis revealed a significant t value, t(12) = 7.34, p < .0001. In the sample, the mean number of correct recalls was 30.769 (SD = 2.833), which was significantly higher than the 25 correct recalls that would have been expected with random responding. The 95% confidence interval for the mean extended from 29.057 to 32.481. The effect size was computed as d = 2.04. According to Cohen’s (1969) guidelines, this represents a relatively large effect.
406 Step-by-Step Basic Statistics Using SAS: Student Guide
Notes regarding the preceding report. In general, the preceding summary was prepared according to the conventions recommended by the Publication Manual of the American Psychological Association (1994). A few words of explanation may be necessary: •
The second sentence of item M reports the obtained t statistic in the following way: t(12) = 7.34, p < .0001. The 12 that appears in parentheses in the excerpt is the degrees of freedom for the analysis. With a single-sample t test, the degrees of freedom are equal to N – 1, where N represents the number of subjects who provided valid data for the analysis. In the present case, N = 13, so it makes sense that the degrees of freedom would be 13 – 1 = 12. In the SAS output, the degrees of freedom for the t test appear below the heading “DF.”
•
In item H and item M, the probability value for the t statistic is presented in this way: p < .0001 This item used the less-than sign (<), because the less-than sign actually appeared in Output 12.3, (that is, the p value that appeared below the heading “Pr > | t |” was <.0001). If the less-than sign is not actually printed in the SAS output, you should instead use the equal sign (=) when indicating your obtained p value. For example, assume that the p value for this t statistic had actually been printed as .0143. In that situation, you would have used the equal sign: H) Obtained probability (p) value:
•
p = .0143
The third sentence of item M states, In the sample, the mean number of correct recalls was 30.769 (SD = 2.833)... The symbol SD represents the standard deviation of the criterion variable, RECALL. You will remember that this standard deviation appeared on Output 12.5 below the heading “Std Dev.”
One-Tailed Tests versus Two-Tailed Tests Dividing the Obtained p Value by 2 When you use PROC TTEST to perform a single-sample t test, the resulting p value is the probability value for a nondirectional test (i.e., for a two-tailed or two-sided test). If you instead wish to perform a directional test (i.e., a one-tailed or one-sided test), just divide the p value on the output by 2; the result will be the p value for a directional test. For example, imagine that you perform an analysis that produces a nondirectional p value of .08 in the output. This is larger than the standard criterion of .05, and thus fails to attain significance as a nondirectional test. However, suppose that, prior to the analysis, you
Chapter 12: Single-Sample t Test 407
decided that a directional test was appropriate. You therefore divide the obtained p value of .08 by 2, resulting in a directional p value of .04. This obtained p value of .04 is less than the standard criterion of .05, and so you reject the null hypothesis and conclude that you have significant results. It makes sense that the directional test was significant while the nondirectional test was not, as a directional test has greater power. Caution Avoid the temptation to begin an analysis as a nondirectional test, and then to decide that it was actually a directional test when you see that the results are nonsignificant as a two-tailed test. By following such a strategy, your actual probability of making a Type I error is higher than the standard .05 level assumed by your readers. In most analyses, you are advised to use a nondirectional test (see Abelson [1995 pp 57–59] for a discussion of one-tailed tests, two-tailed tests, and alternatives).
Example 12.2: An Illustration of Nonsignificant Results Overview So that you will be able to recognize SAS results that show a nonsignificant t value, this section provides the results of a t test that failed to attain significance. This section focuses on the same recall study described in the previous section. Here, however, you will analyze a different fictitious data set designed to produce nonsignificant results. You will review the SAS output and then prepare an analysis report for the new results. The Study Overview. The previous section described a study in which subjects read 100 pages of text, and then tried to recall where on the page a given piece of information appeared. They completed 100 trials, and so their scores on the criterion variable could range from zero (if none of their guesses were correct) to 100 (if all of their guesses were correct). If they are guessing randomly, you expect them to be correct 25 times out of 100, on the average (because each page is divided into four sections, and 100 divided by 4 is equal to 25. The data set. Table 12.4 provides data from this fictitious study. Notice that the scores under the heading “Correct guesses” are different from those that appeared in Table 12.2.
408 Step-by-Step Basic Statistics Using SAS: Student Guide Table 12.4 Number of Correct Guesses Regarding Spatial Location (Data Producing Nonsignificant Results) ______________________ Correct Subject guesses ______________________ 01 24 02 30 03 27 04 26 05 25 06 33 07 23 08 24 09 26 10 20 11 26 12 28 13 27 ___________________________
The SAS program. The SAS program to analyze the preceding data set would be identical to the SAS program presented earlier (in the section titled “Writing the SAS Program”), except that the scores for the criterion variable RECALL would be replaced with those appearing in Table 12.4. The option H0=25 would again be included in the PROC TTEST statement to request that the sample mean be compared against a population value of 25. Interpreting the SAS Output Reviewing the sample mean. Output 12.6 presents the results of the single-sample t test performed on the data from Table 12.4. JOHN DOE
1
The TTEST Procedure Statistics
Variable RECALL
N 13
Lower CL Mean 24.127 Variable RECALL
Upper CL Lower CL Mean Mean Std Dev Std Dev 26.077 28.027 2.3137 3.2265 Statistics Std Err Minimum Maximum 0.8949 20 33
Upper CL Std Dev 5.3261
T-Tests Variable RECALL
DF 12
t Value 1.20
Pr > |t| 0.2520
Output 12.6. Results of the TTEST procedure performed on subject recall data (nonsignificant results).
Chapter 12: Single-Sample t Test 409
Your first clue that you are probably going to obtain nonsignificant results is the size of the sample mean itself. Output 12.6 shows that the sample mean is 26.077. This value is very close to the population mean of 25, which is stated by the null hypothesis. Reviewing the results of the t test. To determine whether the sample mean differs from the population mean, consult the T-Tests section the obtained t statistic. For this analysis it was only 1.20. the p value. For this t statistic, the value is .2520. Because this p value is larger than our standard criterion of .05, you fail to reject the null hypothesis. You conclude that the sample mean of 26.077 is not significantly different from the population mean of 25. In other words, you conclude that the number of correct guesses made by your subjects was not significantly greater than the number that would be expected with random guessing. Reviewing the confidence interval. These results are also reflected by the 95% confidence interval for the mean that SAS computed. Output 12.6 shows that the lower confidence limit for the mean is 24.127, the upper confidence limit for the mean is 28.027. This means that there is a 95% likelihood that your sample was drawn from a population in which the population mean was somewhere between 24.127 and 28.027. Notice that this interval contains the number 25, which was the population mean stated by the study’s null hypothesis. This confidence interval gives us another way of seeing that your sample could very likely have come from a population in which the mean number of correct responses was 25. And that is why you failed to reject the null hypothesis with your t test. Whenever a confidence interval contains the population mean stated by your null hypothesis, you can expect the corresponding t statistic to be nonsignificant. This, of course, assumes that you use the same alpha level for both the confidence interval and the t test (e.g., you compute the 95% confidence interval and set alpha = .05 for the t test). Computing the index of effect size. The formula for d, the effect size index, is again provided below: d=
| X – µ0 | –––––––––––– sX
Thus far, we have already established that the sample mean ( X ) for the current analysis is 26.077, and that the population mean ( µ0 ) under the null hypothesis is again 25. The only remaining value needed for the formula is sX, the estimated population standard deviation. This can be found in Output 12.6.
410 Step-by-Step Basic Statistics Using SAS: Student Guide
Under the heading “Std Dev,” you can see that the standard deviation is 3.2265. When you insert these values into the formula, you can compute the effect size for the current analysis as follows: d=
| X – µ0 | –––––––––––– sX
d=
26.077 – 25 ––––––––––––––– 3.2265
d=
1.077 –––––––––––– 3.2265
d=
.334
d=
.33
The obtained effect index is .33. To determine whether this is considered relatively large or relatively small, we consult Cohen’s (1969) guidelines in Table 12.5: Table 12.5 Guidelines for Interpreting Effect Size _________________________________________ Effect size Obtained d statistic _________________________________________ Small effect d = .20 Medium effect d = .50 Large effect d = .80 _________________________________________
According to Table 12.5, your obtained effect size of .33 falls somewhere between a small effect (d = .20) and a medium effect (d = .50). Summarizing the Results of the Analysis Here is the analysis report for the current analysis. Notice that the results have been changed to be consistent with those reported in Output 12.6. A) Statement of the research question: The purpose of this study was to determine whether subjects performing a fourchoice spatial recall task will perform at a level that is higher than the level expected with random responding.
Chapter 12: Single-Sample t Test 411
B) Statement of the research hypothesis: Subjects performing a four-choice spatial recall task will perform at a level that is higher than the level expected with random responding. C) Nature of the variable: The criterion variable was RECALL, the number of times the subjects correctly guessed the quadrant where targeted information appeared on the text page. This was a multi-value variable and was assessed on a ratio scale. Single-sample t test.
D) Statistical test:
E) Statistical null hypothesis (H0): µ = 25; In the population, the average number of correct recalls is equal to 25 out of 100 (the number expected with random responding). F) Statistical alternative hypothesis (H1): µ ≠ 25; In the population, the average number of correct recalls is not equal to 25 out of 100. t = 1.20
G) Obtained statistic:
H) Obtained probability (p) value:
p = .2520
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Confidence interval: The sample mean on the criterion variable (number of correct recalls) was 26.077. The 95% confidence interval for the mean extended from 24.127 to 28.027. K) Effect size:
d =
.33.
L) Conclusion regarding research hypothesis: These findings fail to provide support for the study’s research hypothesis. M) Formal description of the results for a paper: Results were analyzed using a single-sample t test. This analysis revealed a nonsignificant t value, t(12) = 1.20, p = .2520. In the sample, the mean number of correct recalls was 26.077 (SD = 3.2265), which was not significantly higher than the 25 correct recalls that would have been expected with random responding. The 95% confidence interval for the mean extended from 24.127 to 28.027. The effect size was computed as d = .33. According to Cohen’s (1969) guidelines, this falls somewhere between a small effect and a medium effect.
412 Step-by-Step Basic Statistics Using SAS: Student Guide
Conclusion In this chapter you learned how to use SAS to perform a single-sample t test. This test is useful when you have computed the mean score on a criterion variable for just one sample, and wish to determine whether this mean is significantly different from a specified population value. With this foundation laid, you are now ready to move on to one of the most widely used inferential statistics: the independent-samples t test. The independent-samples t test is useful when you have obtained numeric data from two samples and wish to determine whether the sample means are significantly different from each other. For example, you might use this test when you have conducted an experiment with an experimental group and a control group, and wish to determine whether the observed difference between the two group means is significant. Chapter 13 discusses the assumptions underlying this test, and shows how it can be performed using SAS.
IndependentSamples t Test Introduction..........................................................................................415 Overview................................................................................................................ 415 Independent Samples versus Paired Samples ...................................................... 415 Situations Appropriate for the Independent-Samples t Test ..............417 Overview................................................................................................................ 417 Nature of the Predictor and Criterion Variables ..................................................... 417 The Type-of-Variable Figure .................................................................................. 417 Example of a Study Providing Data Appropriate for This Procedure...................... 418 Summary of Assumptions Underlying the Independent-Samples t Test ................ 419 Results Produced in an Independent-Samples t Test .........................420 Overview................................................................................................................ 420 Test of the Null Hypothesis .................................................................................... 420 Confidence Interval for the Difference between the Means ................................... 426 Effect Size.............................................................................................................. 426 Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences) .................................................................428 Overview................................................................................................................ 428 The Study .............................................................................................................. 428 The Predictor Variable and Criterion Variables in the Analysis.............................. 429 Data Set to Be Analyzed........................................................................................ 430 The DATA Step for the Program ............................................................................ 431 Writing the SAS Program....................................................................................... 432 Results from the SAS Output................................................................................. 434
414 Step-by-Step Basic Statistics Using SAS: Student Guide
Steps in Interpreting the Output ............................................................................. 435 Summarizing the Results of the Analysis............................................................... 443 Example 13.2: An Illustration of Results Showing Nonsignificant Differences..............................................................446 Overview................................................................................................................ 446 The SAS Output..................................................................................................... 446 Interpreting the Output ........................................................................................... 446 Summarizing the Results of the Analysis............................................................... 448 Conclusion............................................................................................450
Chapter 13: Independent-Samples t Test 415
Introduction Overview This chapter shows you how to use the SAS System to perform an independent-samples t test. You use this test when you want to compare two independent groups, to determine whether there is a significant difference between the groups with respect to their mean scores on some numeric criterion variable. The criterion variable must be assessed on an interval or ratio level (additional assumptions will be discussed). This chapter shows you how to write the appropriate SAS program, interpret the output, and prepare a report summarizing the results of the analysis. Special emphasis is given to testing the null hypothesis, interpreting the confidence interval for the difference between the means, and computing an index of effect size. Independent Samples versus Paired Samples Overview. This chapter shows how to perform the independent-samples t test, and the following chapter shows how to perform the paired-samples t test (also called the correlatedsamples t test or related-samples t test). Your choice of which procedure to use depends in part upon whether the observations in the two samples are independent. When observations are independent, you may analyze the data using an independent-samples t test; when observations are not independent, you should analyze the data using a paired-samples t test. This section explains the meaning of independence as it applies in this context and provides examples of studies that provide data appropriate for the independent-samples t test versus the paired-samples t test. Independent samples. Suppose that you conduct an experiment in which you obtain scores on a dependent variable under an experimental condition and a control condition. The scores obtained in these two conditions are independent if the probability of a specific score occurring under one condition is not influenced by the scores that occur under the other condition. If the scores are independent, then the samples are considered independent and it is appropriate to analyze the data using the independent-samples t test (assuming that certain other assumptions are met). There are a number of ways that researchers can create independent samples. One way is to begin with a pool of subjects, and then randomly assign half of the subjects to the experimental condition, and the other half to the control condition. With this procedure, the scores that occur under one condition do not have any influence on scores that occur under the other condition, and so they are independent. This research design is called a randomized-subjects design. The randomized-subjects design is generally regarded as being a fairly strong experimental design because the random assignment of subjects to conditions makes it likely that the two groups will be equivalent on most important variables at the outset of the study.
416 Step-by-Step Basic Statistics Using SAS: Student Guide
A second way to create independent samples is to use a subject variable (such as subject sex) as a quasi-independent variable in a study. For example, assume that you conduct an investigation in which you compare males versus females with respect to their scores on a test of verbal reasoning. In this instance, the scores are again independent because a specific score obtained under one condition (say the male condition) cannot be paired in any meaningful way with the specific score obtained under the second condition (the female condition). Therefore, it would again be appropriate to analyze the data using the independent-samples t test. You should remember, however, that the type of research design described here is not nearly as desirable as the randomized-subjects design described earlier. This is because, when the so-called “independent variable” is actually a subject characteristic (such as subject sex), it is not likely that the groups will be equivalent on most important variables at the outset of the study. Paired samples. In some types of investigations, observations are not independent. In those studies, it is possible to pair scores obtained under one condition with scores obtained under a different condition. The resulting samples are referred to as paired samples, correlated samples, or related samples. There are a number of ways that researchers can create paired samples. One way is to use a repeated-measures design in conducting the study. With a repeated-measures design, each subject is exposed to every treatment condition under the independent variable. This means that each subject provides a score on the dependent variable under each condition of the independent variable. Scores obtained in this way are no longer independent, because scores obtained under different treatment conditions are obtained from the same set of subjects. A second way to create paired samples is to use a matched-subjects design. With this approach, the subjects that are assigned to different conditions are matched on some variable of interest. This means that a subject in one condition is paired with a subject in another condition. Subjects are paired because they are similar to each other on some matching variable. For example, assume that you randomly assign half of your subjects to an experimental condition, and the other half to a control condition. Assume that the dependent variable in your study will again be scores on a test of verbal reasoning. Therefore, you match the subjects on IQ scores, because you believe that IQ scores will be correlated with scores on the verbal reasoning dependent variable. This means that, if subject #1 (in the experimental condition) has a high score on IQ, she will be matched with a subject in the control condition who also has a high score on IQ. Scores on the dependent variable that are obtained in this way are no longer independent, because they have been obtained from pairs of subjects who are similar to each other on the matching variable. When the data are analyzed, they will be analyzed in a special way to take advantage of this matching (this will be illustrated in the next chapter). Between-subjects designs versus within-subjects designs. Randomized-subjects designs and other research procedures that produce data appropriate for an independent-samples t test are typically referred to as between-subjects designs. Repeated-measures designs,
Chapter 13: Independent-Samples t Test 417
matched-subjects designs, and other procedures that produce data appropriate for a pairedsamples t test are typically referred to as within-subjects designs.
Situations Appropriate for the Independent-Samples t Test Overview The independent-samples t test is a test of group differences. You use this statistic when you want to determine whether there is a significant difference between two groups with respect to their mean scores on some numeric criterion (or dependent) variable. The t test is appropriate when you are comparing exactly two groups, and is not appropriate for studies that involve more than two groups. For guidance in analyzing data from studies with three or more groups, see Chapters 15 and 16. The first part of this section describes the types of situations in which this statistic is typically computed, and discusses a few of the assumptions underlying the procedure. A more complete summary of assumptions is presented at the end of this section. Nature of the Predictor and Criterion Variables Predictor variable. To perform an independent-samples t test, the predictor (or independent) variable should be a dichotomous variable (i.e., a variable that assumes just two values). The predictor variable in an independent-samples t test is simply the variable that indicates which group a subject is in. The predictor variable may be assessed on any scale of measurement: nominal, ordinal, interval, or ratio. Criterion variable. The criterion (or dependent) variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The Type-of-Variable Figure The following figure illustrates the types of variables that are typically being analyzed when performing an independent-samples t test. Criterion Variable
Predictor Variable
=
D i
The symbol that appears to the left of the equals sign represents the criterion variable in the analysis. The word “Multi” that appears in the figure shows that the criterion variable in an independent-samples t test is typically a multi-value variable (a variable that assumes more than six values in your sample).
418 Step-by-Step Basic Statistics Using SAS: Student Guide
The letters “Di” on the right of the equals sign show that the predictor variable in this procedure is a dichotomous variable (i.e., a variable that assumes just two values). As was stated earlier, the predictor variable in this procedure simply indicates which group a subject is in. Example of a Study Providing Data Appropriate for This Procedure The study. Suppose you are a criminologist doing research on drunk driving. In your current project, you wish to determine whether people who live in “dry” counties (counties that prohibit the sale of alcohol) tend to drive under the influence of alcohol less frequently than people who live in “wet” counties (counties that allow the sale of alcohol). Suppose that you survey people in both types of counties about their behavior. Your criterion variable is the number of times that each subject has driven a car under the influence of alcohol in the past month. You then use an independent-samples t test to determine whether the average score for the subjects in the dry counties is significantly lower than the average score for subjects in the wet counties. Why these data would be appropriate for this procedure. Earlier sections have indicated that, to perform an independent-samples t test, you need a predictor variable. The predictor variable should be a dichotomous variable. The predictor variable in this study was “type of county policy toward alcohol.” You know that this is a dichotomous variable, because it consists of just two values: “dry” versus “wet” (that is, each subject was classified as either being from a dry county or from a wet county). Earlier sections have stated that, to perform an independent-samples t test, the criterion variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The criterion variable in the present study is “the number of times driving under the influence of alcohol in past month.” You know that scores on this criterion variable are assessed on a ratio level of measurement because equal intervals between scale scores have equal quantitative meaning, and also because there is a true zero point (i.e., if someone has a score of zero, it means that he or she did not drink and drive at all). An earlier section also indicated that, when researchers perform a t test, the criterion variable is usually a multi-value variable. To determine whether this is the case for the current study, you would use PROC FREQ to create a simple frequency table for the criterion variable (similar to those shown in Chapter 5: “Creating Frequency Tables”). You would know that the criterion is a multi-value variable if you observe more than six values in its frequency table.
Chapter 13: Independent-Samples t Test 419
Summary of Assumptions Underlying the Independent-Samples t Test •
Level of measurement. The criterion variable should be assessed on an interval or ratio level of measurement. The predictor variable may be assessed on any level of measurement.
•
Independent observations. A given observation should not be dependent on any other observation in either group. In an experiment, this is normally achieved by drawing a random sample, and randomly assigning each subject to only one of the two treatment conditions. This assumption would be violated if a given subject contributed scores on the criterion variable under both treatment conditions. The independence assumption is also violated when one subject’s behavior influences another subject’s behavior within the same condition. For example, if subjects are given experimental instructions in groups of five, and are allowed to interact in the course of providing scores on the criterion variable, it is likely that their scores will not be independent: each subject’s score is likely to be affected by the other subjects in that group. In these situations, scores from the subjects constituting a given group of five should be averaged, and these average scores should constitute the unit of analysis. None of the tests discussed in this text are robust against violations of the independence assumption.
•
Random sampling. Scores on the criterion variable should represent a random sample drawn from the populations of interest.
•
Normal distributions. Each sample should be drawn from a normally distributed population (you can use PROC UNIVARIATE with the NORMAL option to test the null hypothesis that the sample is from a normally distributed population). If each sample contains over 30 subjects, the test is robust against moderate departures from normality (when a test is robust against violations of certain assumptions, it means that violating those assumptions will have only a negligible effect on the results). If the assumption of normality is violated, you may instead analyze your data using PROC NPAR1WAY. For guidance, see the NPAR1WAY procedure in the SAS/STAT User’s Guide.
•
Homogeneity of variance. To use the equal-variances t test, you should draw the samples from populations with equal variances on the criterion. If the null hypothesis of equal population variances is rejected, you should use the unequal-variances t test. Both types of tests are provided in the output of PROC TTEST, to be described in the following section.
420 Step-by-Step Basic Statistics Using SAS: Student Guide
Results Produced in an Independent-Samples t Test Overview When you use PROC TTEST to perform an independent-samples t test, SAS automatically performs a test of the null hypothesis and estimates a confidence interval for the difference between the means. If we use a few statistics that are included in the output of PROC TTEST, it is relatively easy to compute by hand an index of effect size. This section explains the meaning of these results. Test of the Null Hypothesis Overview. This section provides a concrete example of a another study that would provide data appropriate for an independent-samples t test. It discusses the statistical null hypothesis and alternative hypothesis that would be stated for the analysis. It discusses the distinction between directional versus nondirectional alternative hypotheses as they apply to the independent-samples t test. Finally, it discusses the meaning of rejecting the null hypothesis, as it applies to the current example. A study on memory. Suppose that you are conducting research on the herbal supplement Ginkgo biloba. You wish to determine whether taking Ginkgo biloba affects subject performance on a memory test. You begin with a pool of 100 subjects. You randomly assign 50 of the subjects to an experimental condition. Subjects in this condition will take 60 mg of Ginkgo biloba three time per day for six weeks. You label this the “ginkgo condition.” You randomly assign the other 50 subjects to a control condition. Subjects in this condition take a placebo pill three times per day for six weeks. You label this the “placebo condition.” At the end of the six-week period, you administer a memory test to all subjects. Scores on this test may range from zero to 100, with higher scores representing better memory. Assume that score on this test on an interval scale of measurement and are normally distributed within conditions. Samples and populations. You assume that your sample of 50 subjects in the ginkgo condition represent a population of subjects taking the same dose of Ginkgo biloba. The symbol µ1 represents the mean score of this population on the memory test that you are using. Of course, you cannot compute µ1 (because you cannot administer your memory test to all members of the population), but you can compute the mean score on the memory test for your sample of 50 subjects in the ginkgo condition. The symbol X1 will represent this sample mean. Similarly, you assume that your sample of 50 subjects in the placebo condition represents a different population––a population of subjects who are not taking ginkgo. The symbol µ2 represents the mean score of this population on the memory test that you are using. Again,
Chapter 13: Independent-Samples t Test 421
you cannot actually compute µ2, but you can compute the mean memory test score for your sample of 50 subjects in the placebo condition. The symbol X2 will represent this sample mean. The statistical null hypothesis for a nondirectional test. In Chapter 12, you learned that you state statistical hypotheses differently depending on whether you plan to perform a directional test or a nondirectional test. This section and the next section focuses on the situation in which you plan to perform a nondirectional test: a test in which you predict that there will be a difference, but do not make a specific prediction regarding the nature of that difference (i.e., a situation in which you do not predict which group will score higher on the criterion variable). A nondirectional test is sometimes called a two-sided test or a twotailed test. A later section will show how to state null hypotheses and alternative hypotheses when you wish to perform a directional test. In earlier chapters, you learned that a statistical null hypothesis is a hypothesis of no difference or no association. The statistical null hypothesis describes the results that you will obtain if your independent variable has no effect. Symbolically, the null hypothesis for an independent-samples t test can be stated in a number of ways. Here is one possibility: H0: µ1 = µ2 The preceding null hypothesis states that the mean of population 1 is equal to the mean of population 2. If you are conducting an experiment with two treatment conditions, this is equivalent to saying that, in the population, the mean for the subjects in condition 1 is equal to the mean for the subjects in condition 2. To make this concrete, think about the study on Ginkgo biloba that you are conducting. Assume for a moment that ginkgo has no effect on memory. If that were the case, you would expect the mean score on the memory test obtained for the population of people taking ginkgo to be equal to the mean score obtained for the population of people taking the placebo. This means that, in verbal terms, you could state the null hypothesis for the current study in this way: “In the population, there is no difference between subjects in the ginkgo condition versus subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test).” You can see how this would be an appropriate representation of the null hypothesis. If ginkgo really has no effect on memory, you would expect the mean memory test score for the ginkgo population to be equal to the mean memory test score for the placebo population. When you prepare an analysis report for an independent-samples t test, you should provide the null hypothesis in symbolic form combined with the null hypothesis in verbal form. Below is an example of how this could be done for the ginkgo study: Statistical null hypothesis (H0): µ1 = µ2; In the population, there is no difference between subjects in the ginkgo condition versus subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test).
422 Step-by-Step Basic Statistics Using SAS: Student Guide
Some textbooks portray this statistical null hypothesis in a somewhat different fashion. Some textbooks represent it in this way: H0: µ1 – µ2 = 0 Symbolically, the preceding null hypothesis states that the difference between µ1 and µ2 is equal to zero. You can see that this is essentially the same as saying that µ1 is equal to µ2 , because if two means are equal to each other, then the difference between them must be equal to zero. Therefore, these two ways of representing the null hypothesis are essentially equivalent. Below is an example of how you might state this type of null hypothesis in an analysis report: Statistical null hypothesis (H0): µ1 – µ2 = 0; In the population, the difference between subjects in the ginkgo condition versus subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test) is equal to zero. The statistical alternative hypothesis for a nondirectional test. The statistical alternative hypothesis is a hypothesis that there is a difference or that there is an association. Again, this section illustrates the nature of the alternative hypothesis when you are performing a nondirectional test: a test in which you are not predicting which group will score higher on the criterion variable. Symbolically, a nondirectional alternative hypothesis for an independent-samples t test may be stated in this way: H1: µ1 ≠ µ2 The preceding alternative hypothesis states that the mean of population 1 is not equal to the mean of population 2. If you are conducting an experiment with two treatment conditions, this is equivalent to saying that, in the population, the mean for the subjects in condition 1 is not equal to the mean for the subjects in condition 2. In verbal terms, there are a number of ways that this alternative hypothesis may be stated. Here is one possibility for expressing the alternative hypothesis in very general terms: Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between subjects in the first condition versus subjects in the second condition with respect to their mean scores on the criterion variable. To make things more concrete, here is one way of stating the alternative hypothesis for the Ginkgo biloba study described previously: Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between subjects in the ginkgo condition versus subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test).
Chapter 13: Independent-Samples t Test 423
The preceding section on the statistical null hypothesis indicated that some textbooks state the null hypothesis in this way: H0: µ1 – µ2 = 0 This null hypothesis states that the difference between the mean of population 1 and the mean of population 2 is equal to zero (this is equivalent to saying that there is no difference between the two population means). If you state your null hypothesis in this way, it is appropriate to state a corresponding alternative hypothesis in a similar fashion. Below is one way that this could be done for the ginkgo study: Statistical alternative hypothesis (H1): H0: µ1 – µ2 ≠ 0; In the population, the difference between subjects in the ginkgo condition versus subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test) is not equal to zero. Obtaining significant results with a nondirectional test. When you use PROC TTEST to perform an independent-samples t test, the procedure analyzes data from your sample and computes a t statistic. With other factors held constant, the greater the observed difference between your two sample means, the larger this t statistic will be (in absolute terms). SAS also computes a p value (probability value) associated with this t statistic. If this p value is less than some standard criterion (alpha level), you will reject the null hypothesis. This book recommends that you use an alpha level of .05. This means that, if your obtained p value is less than .05, you will reject the null hypothesis that your two samples were drawn from populations with the same mean. In this case, you will conclude that you have statistically significant results. When you perform a nondirectional test, the only requirement for rejecting the null hypothesis is that you obtain a p value that is below some set criterion (such as an alpha level of .05). When you perform a nondirectional test, it does not matter which sample mean is higher than the other––as long as your p value is less than the criterion, you may reject the null hypothesis. The null and alternative hypothesis for a directional test. The preceding section showed you how to state the null hypothesis and alternative hypothesis for a nondirectional test. This section will show how to state these hypotheses for a directional test. A directional test is a test in which you not only predict that there will be a difference, but you also make a specific prediction regarding the nature of that difference. For example, when you are performing an independent-samples t test, you would use a directional test if you were making a specific prediction about which group was going to score higher on the criterion variable. A directional test is sometimes called a one-sided test or a one-tailed test. Consider the Ginkgo biloba study. It is possible that previous research suggests that taking ginkgo should have a positive effect on memory. Therefore, you develop the following research hypothesis: “Subjects who take Ginkgo biloba will later demonstrate higher scores on the memory test, compared to subjects who take the placebo.”
424 Step-by-Step Basic Statistics Using SAS: Student Guide
It will probably be easier to understand the statistical hypotheses if you consider the statistical alternative hypothesis prior to considering the statistical null hypothesis. For the current study on ginkgo and memory, the statistical alternative hypothesis could be stated in the following way: Statistical alternative hypothesis (H1): µ1 > µ2; In the population, subjects in the ginkgo condition will score higher than subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test). With the preceding alternative hypothesis, assume that µ1 represents the mean score on the memory test for the ginkgo condition in the population, and µ2 represents the mean score on the memory test for the placebo condition in the population. You can see that the symbolic version of the alternative hypothesis predicts that µ1 > µ2 (i.e., that the ginkgo population will score higher than the placebo population). This is consistent with the research hypothesis. Below is the statistical null hypothesis that serves as counterpart to the preceding alternative hypothesis: Statistical null hypothesis (H0): µ1 ≤ µ2; In the population, subjects in the ginkgo condition score lower than or equal to subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test). The preceding null hypothesis predicts that the ginkgo population scores “lower than or equal to” the placebo population on the memory test. This hypothesis includes the words “lower than” as well as the words “equal to” because there are two types of outcomes that you could obtain that would fail to support your research hypothesis: •
If there were no statistically significant difference between the ginkgo condition and the placebo condition, this outcome would fail to support your research hypothesis. This is why your null hypothesis must contain some type of statement to the effect that the ginkgo population is equal to the placebo population on the memory test.
•
If the placebo condition scores significantly higher than the ginkgo condition on the memory test, this outcome would also fail to support your research hypothesis (in fact, this outcome would be the exact opposite of your research hypothesis!). This is why your null hypothesis must contain some type of statement to the effect that the ginkgo population scores lower than the placebo population on the memory test.
When you performed a nondirectional test (earlier), your statistical null hypothesis was stated in the following way: H0: µ1 = µ2 But this type of null hypothesis would be inadequate for a directional test. For the reasons stated above, the null hypothesis for the current ginkgo study must contain the ≤ sign rather than the = sign. This section has shown you how your null and alternative hypotheses should appear when your research hypothesis predicts that µ1 > µ2. But what about a situation in which your
Chapter 13: Independent-Samples t Test 425
research hypothesis predicts that µ1 < µ2? In this situation, your null and alternative hypotheses must predict the opposite direction of results. For example, assume that the research literature actually suggests that Ginkgo biloba has a negative effect on memory, rather than a positive effect. If this were the case, you might state your research question in this way: “Subjects who take Ginkgo biloba will later demonstrate lower scores on the memory test, compared to subjects who take the placebo.” Below is the alternative hypothesis that would be appropriate for such a research hypothesis: Statistical alternative hypothesis (H1): µ1 < µ2; In the population, subjects in the ginkgo condition will score lower than subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test). Below is the null hypothesis that would be appropriate for such a research hypothesis: Statistical null hypothesis (H0): µ1 ≥ µ2; In the population, subjects in the ginkgo condition score higher than or equal to subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test). Obtaining significant results with a directional test. When you perform a directional test with PROC TTEST, you will again obtain a t statistic and a p value that is associated with that statistic. It is important to remember, however, that the p value computed by SAS is the p value for a nondirectional test. In order to compute the p value for a directional test, it is necessary divide this p value by 2. For example, suppose that you perform the analysis, and the results of PROC TTEST include a p value of .06. This is larger than the standard criterion of .05 that is recommended by this book. Therefore, at first glance your results appear to be nonsignificant. However, this p = .06 is relevant only to the nondirectional test. To compute the p value for the directional test, you divide it by two, and arrive at an actual value of p = .03. This is below the standard criterion of .05, which means that your results are in fact statistically significant. When you perform a directional test, it is important to remember that there are actually two conditions that must be met before you may reject your null hypothesis. They are: •
Your p value must be below some set criterion (as described above), and
•
Your sample means must be in the direction specified by the alternative hypothesis.
The second of these two conditions emphasizes that you may reject the null hypothesis only if the mean that you predicted would be higher (in your alternative hypothesis) is, in fact, higher. For example, consider the ginkgo study. Assume that you began your analysis with the following statistical alternative hypothesis:
426 Step-by-Step Basic Statistics Using SAS: Student Guide
Statistical alternative hypothesis (H1): µ1 > µ2; In the population, subjects in the ginkgo condition will score higher than subjects in the placebo condition with respect to their mean scores on the criterion variable (the memory test). If this were the case, you would be justified in rejecting your null hypothesis only if your p value were less than .05 (for example), and the sample mean for the ginkgo condition were, in fact, higher than the sample mean for the placebo condition. If the sample mean for the placebo condition were higher than the sample mean for the ginkgo condition, it would not be appropriate to reject the null hypothesis even if the p value were less than .05. Confidence Interval for the Difference between the Means Confidence interval defined. When you use PROC TTEST to perform an independentsamples t test, SAS also automatically computes a confidence interval for the difference between the means. In the last chapter, you learned that a confidence interval is an interval that extends from a lower confidence limit to an upper confidence limit and is assumed to contain a population parameter with a stated probability, or level of confidence. In that chapter, you learned that SAS can compute the confidence interval for a mean. In this chapter, you will see that SAS can also compute a confidence interval for the difference between two means. An example. As an illustration, consider the Ginkgo biloba study described above. Assume that the mean score on the memory test for a sample of subjects in the ginkgo condition is 56, and the mean score for a sample of subjects in the placebo condition is 50. The difference between the two sample means is 56 – 50 = 6. This means that the observed difference between the sample means is equal to 6. Assume that you analyze your data using PROC TTEST, and SAS estimates that the 95% confidence interval for this difference between the means extends from 4 (the lower confidence limit) to 8 (the upper confidence limit). This means that there is a 95% probability that, in the population, the actual difference between the ginkgo condition versus the placebo condition is somewhere between 4 and 8 points on the memory test. You do not know the actual difference between the population means, but you estimate that there is a 95% probability that it is somewhere between 4 and 8 points. Notice that with this confidence interval you are not stating that there is a 95% probability that the difference between the sample means is somewhere between 4 and 8 points. You know exactly what the difference between the sample means is––you have already computed it to be 6 points. The confidence interval computed by SAS is a probability statement about the difference in the populations, not in the samples. Effect Size The need for an index of effect size. In Chapter 12, you learned that, when you perform a significance test, it is best to supplement that test with an estimate of effect size. When you analyze data from an experiment, an index of effect size indicates how large the treatment
Chapter 13: Independent-Samples t Test 427
effect was: It indicates how much change occurs in the dependent variable as a result of a change in the independent variable. In Chapter 12, you also learned about an index of effect size that can be used with a single-sample t test. With that statistical procedure, effect size was defined as the degree to which the sample mean differs from the population mean, stated in terms of the standard deviation of the population. Effect size defined. When you perform an independent-samples t test, effect size can defined in a somewhat different way. With this procedure, effect size can be defined as the degree to which one sample mean differs from the a second sample mean, stated in terms of the standard deviation of the population. The symbol for effect size is d (as was the case in the previous chapter), and the formula (adapted from Thorndike and Dinnel [2001]) is as follows: d=
| X1 – X2 | –––––––––––– sp
where: X1 = the observed mean of sample 1 (i.e., the subjects in treatment condition 1) X2 = the observed mean of sample 2 (i.e., the subjects in treatment condition 2) sp = the pooled estimate of the population standard deviation. You can see that the procedure for computing d is fairly straightforward: You simply subtract one sample mean from the other sample mean and divide the absolute value of the result by the pooled estimate of the population standard deviation. The resulting statistic represents the number of standard deviations that the two means differ from one another. The index of effect size is not automatically computed by PROC TTEST, although it can easily be calculated by hand from other statistics that do appear on the procedure output. A later section of this chapter will show how this is done, and will provide some guidelines for interpreting the size of d.
428 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences) Overview This section provides a complete illustration of an independent-samples t test. It describes a simple fictitious experiment that produces data appropriate for an independent-samples t test. It shows how to use SAS to perform an independent-samples t test on data from this study. It also shows how to handle the DATA step, how to run PROC TTEST, how to interpret the output from PROC TTEST and how to prepare a report that summarizes the results of the analysis. This section illustrates significant results; a section that follows will illustrate nonsignificant results. Note: Although the study described here is fictitious, it was inspired by the actual investigation reported by Bandura (1965). The Study The research hypothesis. Suppose that you are a social psychologist studying aggression in children. In your investigation, you wish to determine whether exposure to aggressive models causes children to behave more aggressively. You have an hypothesis that children who see a model rewarded for engaging in aggressive behavior will themselves behave more aggressively than children who see a model punished for engaging in aggressive behavior. This research hypothesis is illustrated in Figure 13.1.
Figure 13.1. Causal relationship between consequences for model and the number of subsequent subject aggressive acts, as predicted by the research hypothesis.
Chapter 13: Independent-Samples t Test 429
The causal arrow going from right to left in Figure 13.1 illustrates your prediction that “observed consequences for the model” should have an effect on the “number of subsequent subject aggressive acts” displayed by the subject (the child who observes the model). You predict that: •
When children observe a model behaving aggressively and receiving the consequence of being rewarded, those children will themselves subsequently behave more aggressively.
•
When children observe a model behaving aggressively and receiving the consequence of being punished, those children will themselves subsequently behave less aggressively.
The study. To test this hypothesis you conduct a study in two stages. In Stage 1, you begin with a pool of 36 children, and randomly assign each child to either a “model-rewarded” condition or to a “model-punished” condition. In both conditions, children watch a videotape of a model behaving in an aggressive fashion with an inflatable “Bobo doll.” In the videotape, the model punches the Bobo doll in the nose, strikes it with a rubber mallet, and kicks it around the room. However, the ending of the video is different for children in the two conditions (this is how you manipulate the independent variable). For the 18 children in the model-rewarded condition the video ends with a second adult entering the room, praising the model, and rewarding him with candy and soft drinks. For the 18 children in the model-punished condition, the video ends with the second adult entering the room, scolding the model, and spanking him. The independent variable in your study, therefore, is the observed consequences for the model. Half of your subjects observe the model being rewarded, while the other half observe the model being punished. In Stage 2, you measure your dependent variable: the number of aggressive acts displayed by the children after they have watched the video. In Stage 2, each child is given the opportunity to play for 30 minutes (alone) in a room similar to the one seen in the video. The play room contains a Bobo doll (as in the video), a rubber mallet, and a wide variety of other toys. While the child plays, you and your assistants view the child through a one-way mirror and count the number of aggressive acts performed by the child (e.g., punching the inflatable doll). This “number of aggressive acts” serves as the dependent variable in your study. The Predictor Variable and Criterion Variables in the Analysis The predictor variable in your study is “observed consequences for the model,” which consists of just two conditions: the “model-rewarded condition” versus the “model-punished condition.” In the analysis, you will give this variable the SAS variable name VIDGRP, which stands for video group. The criterion variable in your study is the number of aggressive acts displayed by the children during the 30-minute play period after they have watched the videotape. This variable is assessed on a ratio scale, since it has equal intervals and a true zero point. In the
430 Step-by-Step Basic Statistics Using SAS: Student Guide
analysis, you will give it the SAS variable name AGGRESS, which refers to “subject aggressive acts.” Data Set to Be Analyzed Table 13.1 provides the data set that you will analyze. Table 13.1 Number of Aggressive Acts Displayed by Subjects as a Function of Video Group ____________________________________ Video Aggressive a acts Subject group ____________________________________ 01 PUN 3 02 PUN 6 03 PUN 4 04 PUN 8 05 PUN 7 06 PUN 0 07 PUN 5 08 PUN 2 09 PUN 4 10 PUN 5 11 PUN 6 12 PUN 1 13 PUN 2 14 PUN 3 15 PUN 4 16 PUN 5 17 PUN 3 18 PUN 4 19 REW 8 20 REW 9 21 REW 5 22 REW 10 23 REW 7 24 REW 7 25 REW 5 26 REW 3 27 REW 4 28 REW 11 29 REW 6 30 REW 6 31 REW 9 32 REW 8 33 REW 6 34 REW 7 35 REW 7 36 REW 10 _____________________________________ a With the variable Video group, the value PUN identifies subjects in the “model-punished condition,” and the value REW identifies subjects in the “model-rewarded condition.”
Chapter 13: Independent-Samples t Test 431
You can see that Table 13.1 consists of three columns. The first column is headed Subject, and simply provides a unique subject number for each participant. The second column is headed Video group. The values that appear in this column indicate the treatment condition to which a given subject was assigned. The value PUN identifies the subjects who were assigned to the “model-punished condition,” and the value REW identifies the subjects who were assigned to the “model-rewarded condition.” You can see that Subjects 1-18 were assigned to the model-punished condition, and Subjects 19-36 were assigned to the model-rewarded condition. The third column in the table is headed Aggressive acts, and this column indicate the number of aggressive acts that each child displayed in Stage 2 of the study, after watching the video. You can see that Subject 1 displayed 3 aggressive acts, Subject 2 displayed 6 aggressive acts, and so forth. The DATA Step for the Program Suppose that you prepare a SAS program to input the data set presented in Table 13.1. You use the SAS variable name SUB_NUM to represent subject numbers, the SAS variable name VIDGRP to represent the video group predictor variable, and you use the SAS variable name AGGRESS to represent subject scores on the criterion variable of the number of aggressive acts. Following are the SAS statements that constitute the DATA step of this program. Notice that a dollar sign ($) appears to the right of VIDGRP to identify it as a character variable. Notice also that the data set appearing below the DATALINES statement is essentially identical to the one appearing in Table 13.1. OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM VIDGRP $ AGGRESS; DATALINES; 01 PUN 3 02 PUN 6 03 PUN 4 04 PUN 8 05 PUN 7 06 PUN 0 07 PUN 5 08 PUN 2 09 PUN 4 10 PUN 5 11 PUN 6 12 PUN 1 13 PUN 2 14 PUN 3 15 PUN 4 16 PUN 5 17 PUN 3
432 Step-by-Step Basic Statistics Using SAS: Student Guide
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 ;
PUN 4 REW 8 REW 9 REW 5 REW 10 REW 7 REW 7 REW 5 REW 3 REW 4 REW 11 REW 6 REW 6 REW 9 REW 8 REW 6 REW 7 REW 7 REW 10
Writing the SAS Program The PROC Step. The syntax for the PROC step that will perform an independent-samples t test is as follows: PROC TTEST DATA=data-set-name CLASS predictor-variable ; VAR criterion-variable ; TITLE1 ' your-name '; RUN;
ALPHA=alpha-level ;
In this syntax, the PROC TTEST statement contains the following option: ALPHA=alpha-level This ALPHA= option allows you to specify the size of the confidence interval that PROC TTEST will estimate for the observed difference between sample means. Specifying ALPHA=0.01 produces a 99% confidence interval, specifying ALPHA=0.05 produces a 95% confidence interval, and specifying ALPHA=0.1 produces a 90% confidence interval. Assume that, in this analysis, you wish to create a 95% confidence interval. This means that you will include the following option in the PROC TTEST statement: ALPHA=0.05
Chapter 13: Independent-Samples t Test 433
The SAS statements. Here are the statements that will perform an independent-samples t test on the current data set: PROC TTEST DATA=D1 ALPHA=0.05; CLASS VIDGRP; VAR AGGRESS; TITLE1 'JANE DOE'; RUN; Some notes about the syntax: •
The PROC step begins with the PROC TTEST statement, in which you provide the name of the data set to be analyzed. In this case, the data set was D1.
•
In the CLASS statement, you provide the name of the predictor variable in the analysis. For a t test, this will always be a dichotomous variable that simply indicates which group a given subject is in. In this analysis, the predictor variable was VIDGRP.
•
In the VAR statement, you provide the name of the numeric criterion variable to be analyzed. In the current analysis, the criterion variable was AGGRESS.
•
The PROC step ends with the usual TITLE1 and RUN statements.
The Complete SAS Program. Here is the program––including the DATA step––that you can use to analyze the fictitious data from the preceding study: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM VIDGRP $ AGGRESS; DATALINES; 01 PUN 3 02 PUN 6 03 PUN 4 04 PUN 8 05 PUN 7 06 PUN 0 07 PUN 5 08 PUN 2 09 PUN 4 10 PUN 5 11 PUN 6 12 PUN 1 13 PUN 2 14 PUN 3 15 PUN 4 16 PUN 5 17 PUN 3 18 PUN 4 19 REW 8 20 REW 9 21 REW 5 22 REW 10
434 Step-by-Step Basic Statistics Using SAS: Student Guide
23 REW 7 24 REW 7 25 REW 5 26 REW 3 27 REW 4 28 REW 11 29 REW 6 30 REW 6 31 REW 9 32 REW 8 33 REW 6 34 REW 7 35 REW 7 36 REW 10 ; PROC TTEST DATA=D1 ALPHA=0.05; CLASS VIDGRP; VAR AGGRESS; TITLE1 'JANE DOE'; RUN;
Results from the SAS Output Output 13.1 presents the results obtained from the preceding program. JANE DOE 1 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Variable Class N Mean Mean Mean Std Dev Std Dev AGGRESS PUN 18 2.9766 4 5.0234 1.5443 2.058 AGGRESS REW 18 6.0338 7.1111 8.1884 1.6256 2.1663 AGGRESS Diff (1-2) -4.542 -3.111 -1.68 1.709 2.1128 Statistics Upper CL Variable Class Std Dev Std Err Minimum Maximum AGGRESS PUN 3.0852 0.4851 0 8 AGGRESS REW 3.2476 0.5106 3 11 AGGRESS Diff (1-2) 2.7682 0.7043 Variable AGGRESS AGGRESS
Method Pooled Satterthwaite
Variable AGGRESS
T-Tests Variances Equal Unequal
DF 34 33.9
Equality of Variances Method Num DF Den DF Folded F 17 17
t Value -4.42 -4.42 F Value 1.11
Pr > |t| <.0001 <.0001 Pr > F 0.8350
Output 13.1. Results of the PROC TTEST analysis of aggression data; significant differences observed.
Chapter 13: Independent-Samples t Test 435
The results in Output 13.1 are in divided into four sections. The contents of these sections will be very briefly summarized here, and will be discussed in much greater detail later in this chapter. The first section of Output 13.1 contains a table labeled “Statistics” that provides simple univariate statistics for the study’s criterion variable. This table provides means, standard deviations, and other statistics. The second section contains another table also labeled “Statistics” that provides additional univariate statistics, such as the standard error of the difference between means. The third section contains a table headed “T-Tests” that provides the results of the independent-samples t test. In fact, the results of two t tests are actually presented here; one test assumes equal variances, and a second assumes unequal variances (more on this later). Finally, a fourth table is headed “Equality of Variances.” This section presents the results of an F' statistic which tests the null hypothesis that the two samples were drawn from populations with equal variances. Steps in Interpreting the Output 1. Make sure that everything looks right. Before reviewing the results of the t tests, you should always review the results of the two tables headed “Statistics” to verify that there were no obvious errors in preparing your SAS program. For example, you should verify that the number of subjects in each condition (as reported in the output) is what you expect. Reviewing these tables will also provide you with an understanding of the general trend in your results. To make this easier, the two statistics tables from Output 13.1 are reproduced as Output 13.2. JANE DOE 1 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Variable Class N Mean Mean Mean Std Dev Std Dev AGGRESS PUN 18 2.9766 4 5.0234 1.5443 2.058 AGGRESS REW 18 6.0338 7.1111 8.1884 1.6256 2.1663 AGGRESS Diff (1-2) -4.542 -3.111 -1.68 1.709 2.1128 Statistics Upper CL Variable Class Std Dev Std Err Minimum Maximum AGGRESS PUN 3.0852 0.4851 0 8 AGGRESS REW 3.2476 0.5106 3 11 AGGRESS Diff (1-2) 2.7682 0.7043 Output 13.2. Two statistics tables from the PROC TTEST analysis of aggression data; significant differences observed.
Below the heading “Variable” you will find the name of the criterion variable in your analysis. In this case, you can see that the criterion variable is AGGRESS (the number of aggressive acts displayed by subjects).
436 Step-by-Step Basic Statistics Using SAS: Student Guide
Below the heading “Class” you will find the names of the values that were used to identify the two treatment conditions. In the present analysis, you can see that these two values were PUN (which was used to identify subjects in the model-punished condition) and REW (which was used to identify subjects in the model-rewarded condition). To the right of PUN you find statistics relevant to the model-punished sample (e.g., this sample’s mean and standard deviation on the criterion variable). To the right of REW you find statistics relevant to the model-rewarded sample. The third entry down in the “Class” column is “Diff (1 – 2).” This row provides information about the difference between sample 1 (the model-punished sample) and sample 2 (the model-rewarded sample). Among other things, this row reports the difference between the means of these two samples on the criterion variable. The column labeled “N” indicates the number of subjects in each sample. You can see that there were 18 subjects in the model-punished condition, and also 18 subjects in the modelrewarded condition. The column headed “Mean” provides the average score for each sample on the criterion variable. Output 13.2 shows that the mean score for subjects in the model-punished condition was 4, and that the mean score for subjects in the model-rewarded condition was 7.1111. This means that, on the average, children in the model-rewarded condition displayed a greater number of aggressive acts. However, at this point you do not know whether this difference is statistically significant. You will learn that later, when you review the results of the t test. The third entry in the “Mean” column (to the right of “Diff (1 – 2)” is –3.111. This indicates that the difference between the means of the model-punished condition versus the modelrewarded condition is equal to –3.111. The column headed “Std Dev” provides the estimated population standard deviations (sx) for the two samples (computed separately). You can see that the estimated population standard deviation for the model-punished condition is 2.058, and the corresponding statistic for the model-rewarded condition is 2.1663. The third entry in “Std Dev” column, to the right of “Diff (1 – 2), is the pooled estimate of the population standard deviation (sp). You will need this statistic when you later compute the index of effect size, d. For the current analysis, Output 13.2 shows that the pooled estimate of the population standard deviation is 2.1128. The second statistics table has a column headed “Std Err” that provides standard errors. Of greatest interest is the third entry down, which appears to the right of “Diff (1 – 2). This entry is the standard error of the difference between the means: the estimated standard deviation of the sampling distribution of differences between means. For the current analysis, you can see that the standard error of the difference is 0.7043. The last two columns in the second statistics tables are headed “Minimum” and “Maximum.” These columns display the lowest observed score and the highest observed score (respectively) for your two samples. It is a good idea to review these columns to verify
Chapter 13: Independent-Samples t Test 437
that they do not contain any “out-of-bounds” values. An out-of-bounds value is a value that is either too low or too high to be reasonable, given the nature of your criterion variable. In the present study, the criterion variable was the number of aggressive acts displayed by the children during a 30-minute period after they watched the videotape. These columns show that, for subjects in the model-punished condition, scores on this criterion variable ranged from a low of zero to a high of 8. For subjects in the model-rewarded condition, scores on the criterion variable ranged from 3 to 11. For both groups these numbers seem reasonable. Therefore, information from the Minimum and Maximum columns do not provide any evidence that you made any obvious errors in typing your data or preparing other sections of your SAS program. 2. Review the F' test for equality of variances. Output 13.1 shows that PROC TTEST actually computes two t statistics, but only one of these will be relevant for a specific analysis. One of the t statistics is the standard statistic based on the assumption that the two samples were drawn from populations with equal variances. The second t statistic is based on the assumption that the two samples were drawn from populations with unequal variances. To determine which t statistic is appropriate for your analysis, you will refer to the F' test that appears in a table labeled “Equality of Variances.” This table appears toward the bottom of the output of PROC TTEST. For convenience, that table is reproduced again at this point as Output 13.3. Equality of Variances Variable AGGRESS
Method Folded F
Num DF 17
Den DF 17
F Value 1.11
Pr > F 0.8350
Output 13.3. The equality of variances table from the PROC TTEST analysis of aggression data.
The heading “Equality of Variances” that appears above this table conveys the nature of the null hypothesis that is being tested. This null hypothesis states that, in the population, there is no difference between the model-punished condition versus the model-rewarded condition with respect to their variances on the AGGRESS criterion variable (notice that this null hypothesis deals with the difference between variances, not means). PROC TTEST computes an F' statistic to test this null hypothesis. If the p value for the resulting F' test is less than .05, you will reject the null hypothesis of no differences and conclude that the variances are unequal. In this case, you will refer to the results of the unequal variances t test. On the other hand, if the p value is greater than .05, you can tentatively conclude that the variances are equal, and instead use the results of the equal variances t test. The F' statistic relevant to this test appears below the heading “F Value” in Output 13.3. For the present analysis, you can see that the F' value is 1.11. The p value for this F' statistic appears below the heading “Pr > F.” For the current analysis, this p value is 0.8350. This means that the probability of obtaining an F' this large or larger when the population variances are equal is quite large––it is 0.8350. Your obtained p value
438 Step-by-Step Basic Statistics Using SAS: Student Guide
is greater than the criterion of .05, and so you fail to reject the null hypothesis of equal variances––instead you tentatively conclude that the variances are equal. This means that you can interpret the equal variances t statistic (in the step that follows this one). To review, here is a summary of how you are to interpret the results presented in the Equality of Variances table: •
When the “Pr > F'” is nonsignificant (greater than .05), report the t test based on equal variances.
•
When the “Pr > F'” is significant (less than .05), report the t test based on unequal variances.
3. Review the t test for the difference between the means. You are now ready to determine whether there is a significant difference between your sample means. To do this, you will refer to the “Statistics” table and the “T-Tests” table from your output. For your convenience, those tables are reproduced here as Output 13.4. JANE DOE 1 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Variable Class N Mean Mean Mean Std Dev Std Dev AGGRESS PUN 18 2.9766 4 5.0234 1.5443 2.058 AGGRESS REW 18 6.0338 7.1111 8.1884 1.6256 2.1663 AGGRESS Diff (1-2) -4.542 -3.111 -1.68 1.709 2.1128 Statistics Upper CL Variable Class Std Dev Std Err Minimum Maximum AGGRESS PUN 3.0852 0.4851 0 8 AGGRESS REW 3.2476 0.5106 3 11 AGGRESS Diff (1-2) 2.7682 0.7043 T-Tests Variable AGGRESS AGGRESS
Method Pooled Satterthwaite
Variances Equal Unequal
DF 34 33.9
t Value -4.42 -4.42
Pr > |t| <.0001 <.0001
Output 13.4. The statistics table and t-tests table from the PROC TTEST analysis of aggression data; significant differences observed.
The sample means for your two treatment conditions appear in the “Statistics” section of the output in the column headed “Mean.” From Output 13.4, you can see that the mean for the model-punished condition was 4, and the mean for the model-rewarded condition was 7.1111. There appears to be a fairly large difference between these two means. But is the difference statistically significant? To find out, you will review the results of the t test. The t test that you are about to review is a test of the following statistical null hypothesis (the following is for a nondirectional test): Statistical null hypothesis (H0): µ1 = µ2 ; In the population, there is no difference between the subjects in the model-rewarded condition versus the subjects in the model-
Chapter 13: Independent-Samples t Test 439
punished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). The t statistic relevant to this null hypothesis appears in the “T-Tests” section of the output. From Output 13.4, you can see that there are actually two t statistics reported there. Below the heading “Variances,” you can see the entries “Equal” and “Unequal.” To the right of “Equal,” you will see the results for the equal-variances t test. You should report the results in this row if the F' test (discussed in the preceding section) was nonsignificant. To the right of “Unequal,” you will see the results for the unequal-variances t test. You should report the results in this row if the F' test (discussed in the preceding section) was significant. You will remember that the F' test for the current analysis was nonsignificant. This means that you will focus on the results of the equal-variances t test (i.e., the results that are presented in the “Equal” row). The degrees of freedom for this test appear in the column headed “DF.” You can see that the degrees of freedom for the equal-variances t test are 34. The obtained t statistic for the current analysis appears in the column headed “t Value.” You can see that the equal-variances t statistic for the current analysis is –4.42. The probability value (p value) for this t statistic appears in the column headed “Pr > | t |.” This value estimates the probability that you would obtain the present results if the null hypothesis were true. From Output 13.4, you can see that the p value for the equal-variances t statistic is “<.0001.” This means that the probability that you would obtain the present results if the null hypothesis were true is less than .0001 (less than 1 in 10,000). This is a very low probability. This book recommends that you should generally reject the null hypothesis anytime that your obtained p value is less than .05. For the present analysis, your p value is <.0001, which is much less than .05. Therefore, you will reject the null hypothesis, and will conclude that the difference between your sample means is statistically significant. Rejecting the null hypothesis means that you will tentatively accept your statistical alternative hypothesis. For the current study, the nondirectional alternative hypothesis may be stated as follows: Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between the subjects in the model-rewarded condition versus the subjects in the model-punished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). Earlier, you reviewed the column headed “Mean” in Output 13.4, and you saw that the mean for the model-punished condition was 4, and the mean for the model-rewarded condition was 7.1111. The relative size of these means, combined with the results of the t test, tells you that the subjects in the model-rewarded condition displayed a significantly higher number of aggressive acts than the subjects in the model-punished condition.
440 Step-by-Step Basic Statistics Using SAS: Student Guide
This section has shown you how to interpret the p value that SAS computed for a nondirectional (two-tailed) test. If you instead wish to perform a directional (one-tailed) test, you should divide the obtained p value by 2. For example, suppose that you perform an analysis, and the SAS Output displays a p value of .0800. This is the p value for a nondirectional test (because PROC TTEST computes this by default). If you wish to perform a directional test instead, you would divide this p value by 2, resulting in a final p value of .0400. This p value of .0400 is the appropriate p value for a directional test. 4. Review the confidence interval for the difference between the means. When you use PROC TTEST, SAS computes a confidence interval for the difference between the means. The PROC TTEST statement included in the program for the current analysis contained the following ALPHA option: ALPHA=0.05 This option causes SAS to compute the 95% confidence interval. If you had instead wanted the 99% confidence interval, you would have used this option instead: ALPHA=0.01 The 95% confidence interval appears in the “Statistics” table in the PROC TTEST output. That table is reproduced here as Output 13.5: JANE DOE The TTEST Procedure Statistics
Variable AGGRESS AGGRESS AGGRESS
Class PUN REW Diff (1-2)
N 18 18
Lower CL Mean Mean 2.9766 4 6.0338 7.1111 -4.542 -3.111
1
Upper CL Mean 5.0234 8.1884 -1.68
Lower CL Std Dev Std Dev 1.5443 2.058 1.6256 2.1663 1.709 2.1128
Output 13.5. The lower confidence limit and upper confidence limit for the difference between the means.
The row headed “AGGRESS Diff (1-2)” contains the information that you will need at this point. This row provides information about the difference between condition 1 (the modelpunished condition) versus condition 2 (the model-rewarded condition). Where the row headed “AGGRESS Diff (1-2)” ( ) intersects with the column headed “Mean” ( ), you can see that the observed difference between the samples means is –3.111. This difference is computed by starting with the sample mean for condition 1 (the model-punished condition), and subtracting from it the sample mean of condition 2 (the model-rewarded condition). This observed difference indicates that, on the average, subjects in the modelpunished condition displayed 3.111 fewer aggressive acts, compared to subjects in the model-rewarded condition. As you remember from Chapter 12, a confidence interval extends from a lower confidence limit to an upper confidence limit. To find the lower confidence limit for the difference, find the location where the row headed “AGGRESS Diff (1-2)” ( ) intersects with the column
Chapter 13: Independent-Samples t Test 441
headed “Lower CL Mean” ( ). There, you can see that the lower confidence limit for the difference is –4.542. To find the upper confidence limit for the difference, find the location where the row headed “AGGRESS Diff (1-2)” ( ) intersects with the column headed “Upper CL Mean” ( ). There, you can see that the upper confidence limit for the difference is –1.68. When they are combined, these findings indicate that the 95% confidence interval for the difference between means extends from –4.542 to –1.68. This means that you can estimate with a 95% probability that the actual difference between the mean of the model-rewarded condition and the mean of the model-punished condition (in the population) is somewhere between –4.542 and –1.68. Notice that this interval does not contain the value of zero. This is consistent with your rejection of the null hypothesis in the previous section (i.e., you rejected the null hypothesis which stated “In the population, there is no difference between the subjects in the model-rewarded condition versus the subjects in the model-punished condition with respect to their mean scores on the criterion variable”). 5. Compute the index of effect size. Earlier in this chapter, you learned that effect size can be defined as the degree to which one sample mean differs from the a second sample mean, stated in terms of the standard deviation of the population. The symbol for effect size is d, and the formula for effect size is as follows: | X1 – X2 | –––––––––––– sp
d=
where: X1 = the observed mean of sample 1 (i.e., the subjects in treatment condition 1) X2 = the observed mean of sample 2 (i.e., the subjects in treatment condition 2) sp = the pooled estimate of the population standard deviation. Although SAS does not automatically compute effect size, you can easily do so yourself using the information that appears in the “Statistics” table from the output of PROC TTEST, reproduced as Output 13.6.
Variable AGGRESS AGGRESS AGGRESS
Class PUN REW Diff (1-2)
JANE DOE The TTEST Procedure Statistics Lower CL Upper CL N Mean Mean Mean 18 2.9766 4 5.0234 18 6.0338 7.1111 8.1884 -4.542 -3.111 -1.68
1 Lower CL Std Dev Std Dev 1.5443 2.058 1.6256 2.1663 1.709 2.1128
Output 13.6. Information needed to compute the index of effect size.
In the preceding formula, X1 is the observed mean of sample 1 (which, in the present study, is the model-punished sample).
442 Step-by-Step Basic Statistics Using SAS: Student Guide
The column headed “Mean” shows that the mean for the model-punished condition is 4. In the formula, X2 is the observed mean of sample 2 (which, in the present study, is the modelrewarded sample). The column headed “Mean” shows that the mean for the model-rewarded condition is 7.1111. In the preceding formula, sp represents the pooled estimate of the population standard deviation. This statistic appears in Output 13.6 in the location where the row headed “AGGRESS Diff (1–2)” ( ) intersects with the column headed “Std Dev” ( ). For the present analysis, you can see that the pooled estimate of the population standard deviation is 2.1128. You can now insert these statistics into the formula and compute the index of effect size in this way: d=
| X1 – X2 | –––––––––––– sp
d=
| 4.0000 – 7.1111 | ––––––––––––––––––––– 2.1128
d=
| – 3.1111 | ––––––––––––––––––––– 2.1128
d=
3.1111 ––––––––––––––––––––– 2.1128
d=
1.4725
d=
1.47
And so the obtained index of effect size for the current analysis is 1.47. This means that the sample mean for the model-punished condition differs from the sample mean of the model rewarded condition by 1.47 standard deviations. To determine whether this is a relatively large difference or a relative small difference, you can consult the guidelines provided by Cohen (1969). Cohen’s guidelines are reproduced in Table 13.2:
Chapter 13: Independent-Samples t Test 443 Table 13.2 Guidelines for Interpreting Effect Size _________________________________________ Effect size Obtained d statistic _________________________________________ Small effect d = .20 Medium effect d = .50 Large effect d = .80 _________________________________________
Your obtained d statistic of 1.47 is larger than the “large effect” value of .80 that appears in Table 13.2. This means that the manipulation in your study produced a relatively large effect. Summarizing the Results of the Analysis The following format may be used to summarize the result of the analysis: A) Statement of the research question: The purpose of this study was to determine whether children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. B) Statement of the research hypothesis: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was the observed consequences for the model. This was a dichotomous variable that was assessed on a nominal scale and included two levels: a model-rewarded condition (coded as REW), and a model-punished condition (coded as PUN). • The criterion variable was the number of aggressive acts displayed by the children after observing the model. This was a multi-value variable and was assessed on a ratio scale. D) Statistical test:
Independent-samples t test.
E) Statistical null hypothesis (H0): µ1 = µ2 ; In the population, there is no difference between the subjects in the model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the
444 Step-by-Step Basic Statistics Using SAS: Student Guide
criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between the subjects in the model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). G) Obtained statistic:
t = –4.42
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Confidence interval: Subtracting the mean of the modelrewarded condition from the mean of the model-punished condition resulted in an observed difference of –3.111. The 95% confidence interval for this difference extended from –4.542 to –1.68. K) Effect size:
d = 1.47.
L) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using an independent-samples t test. This analysis revealed a significant difference between the two conditions, t(34) = -4.42, p = .0001. The sample means are displayed in Figure 13.2, which shows that subjects in the model-rewarded condition scored significantly higher on aggression compared to subjects in the model-punished condition (for modelrewarded group, M = 7.11, SD = 2.17; for model-punished group, M = 4.00, SD = 2.06). The observed difference between the means was –3.11, and the 95% confidence interval for the difference between means extended from –4.54 to –1.68. The effect size was computed as d = 1.47. According to Cohen’s (1969) guidelines, this represents a relatively large effect.
Chapter 13: Independent-Samples t Test 445
N) Figure representing the results:
Figure 13.2. Mean number of subject aggressive acts as a function of the observed consequences for the model.
Notes regarding the preceding report. Item M of the preceding report provided a description of the results for a published paper. The second sentence of that summary reports the obtained t statistic in the following way: t(34) = –4.42, p = .0001. The number 34 that appears in parentheses in the preceding excerpt represents the degrees of freedom for the analysis. With an independent-samples t test, the degrees of freedom are equal to N – 2, where N represents the total number of subjects from both groups combined. In the present case, N = 36, so it makes sense that the degrees of freedom would be 36 – 2 = 34. In Output 13.4 (presented earlier), the degrees of freedom appeared below the heading “DF” ( ). The third sentence of Item M contains the following excerpt: ...(for model-rewarded group, M = 7.11, SD = 2.17; for model-punished group, M = 4.00, SD = 2.06) In this excerpt, the symbol M represents “sample mean” and SD represents “sample standard deviation.” These statistics may be found in Output 13.6: Means appear in the column headed “Mean” ( ), and standard deviations appear in the column headed “Std Dev” ( ).
446 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 13.2: An Illustration of Results Showing Nonsignificant Differences Overview This section presents the results of an analysis of a different data set––a data set designed to produce nonsignificant results. This will allow you to see how nonsignificant results will appear in the output of PROC TTEST. A later section will also show you how to summarize nonsignficant results in an analysis report. The SAS Output Output 13.7 contains the results of the analysis of a different fictitious data set, one in which the means for the two treatment conditions were not significantly different. JANE DOE The TTEST Procedure Statistics
Variable AGGRESS AGGRESS AGGRESS
Class PUN REW Diff (1-2)
Variable AGGRESS AGGRESS AGGRESS
1
Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev 2.9766 4 5.0234 1.5443 2.058 3.9143 4.9444 5.9745 1.5544 2.0714 -2.343 -0.944 0.4542 1.6701 2.0647 Statistics Upper CL Std Dev Std Err Minimum Maximum 3.0852 0.4851 0 8 3.1054 0.4882 1 9 2.7052 0.6882
N 18 18
Class PUN REW Diff (1-2)
T-Tests Variable AGGRESS AGGRESS
Method Pooled Satterthwaite
Variances Equal Unequal
DF 34 34
t Value -1.37 -1.37
Pr > |t| 0.1790 0.1790
Equality of Variances Variable AGGRESS
Method Folded F
Num DF 17
Den DF 17
F Value 1.01
Pr > F 0.9789
Output 13.7. Results of PROC TTEST analysis of aggression data; nonsignificant differences observed.
Interpreting the Output Overview. You would normally interpret Output 13.7 following the same steps that were listed in Example 13.1. To save space, however, this section focuses on the results that are most relevant to the significance test, the confidence interval, and the index of effect size.
Chapter 13: Independent-Samples t Test 447
Review the F' test for equality of variances. ( )This test appears in the table headed “Equality of Variances.” You will remember that you begin with the null hypothesis that, in the population, the variances for the two conditions are equal. The p value associated with this null hypothesis is .9789. Because this is greater than usual criterion of .05, you conclude that it is nonsignificant and fail to reject the null hypothesis. This means that you will interpret the equal-variances t test in the next step. Review the t test for the difference between the means. In the column headed “Mean,” ( ) you can see that the mean for the model-punished condition is 4, and the mean for the model-rewarded condition is 4.9444. There is a difference between the means, but is this difference large enough to be statistically significant? To find out, you will review the t statistic. You can see that the obtained t statistic for the current analysis is –1.37, and that the p value associated with this statistic is .1790. This p value is above the standard criterion of .05, and so you conclude that your results are nonsignificant and fail to reject the null hypothesis. You will conclude that the observed difference between the means is probably due to sampling error. Review the confidence interval for the difference between the means. The “Mean” column ( ) in Output 13.7 shows that the observed difference between the model-punished condition and the model-rewarded condition is –.944. The “Lower CL Mean” column shows that the lower confidence limit for this difference is – 2.343. The “Upper CL Mean” column shows that the upper confidence limit for this difference is .4542. Combined, this means that the 95% confidence interval for the difference between means extends from –2.343 to .4542. Notice that this interval does include the value of zero, which is consistent with your finding that the difference between means is nonsignificant. Compute the index of effect size. You have already seen that the mean for the modelpunished condition is 4, and the mean for the model-rewarded condition is 4.9444. The only other piece of information that you need to compute the effect size is sp, the pooled estimate of the population standard deviation. This appears in Output 13.7 as the third entry in the column headed “Std Dev.” There, you can see that the pooled estimate is 2.0647. You may now insert these statistics into the formula for effect size: d=
| X1 – X2 | –––––––––––– sp
d=
| 4.0000 – 4.9444 | ––––––––––––––––––––– 2.0647
448 Step-by-Step Basic Statistics Using SAS: Student Guide
d=
| – .9444 | ––––––––––––––––––––– 2.0647
d=
.9444 ––––––––––––––––––––– 2.0647
d=
.4574
d=
.46
And so the index of effect size for the current analysis is .46. According to Cohen’s guidelines appearing in Table 13.2, this falls somewhere between a small effect and a medium effect. Summarizing the Results of the Analysis Here is an example of how you might summarize the preceding analysis: A) Statement of the research question: The purpose of this study was to determine whether children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. B) Statement of the research hypothesis: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was the observed consequences for the model. This was a dichotomous variable that was assessed on a nominal scale and included two levels: a model-rewarded condition (coded as REW), and a model-punished condition (coded as PUN). • The criterion variable was the number of aggressive acts displayed by the children after observing the model. This was a multi-value variable and was assessed on a ratio scale. D) Statistical test:
Independent-samples t test.
Chapter 13: Independent-Samples t Test 449
E) Statistical null hypothesis (H0): µ1 = µ2 ; In the population, there is no difference between the subjects in the model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between the subjects in the model-rewarded condition versus the subjects in the modelpunished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). G) Obtained statistic:
t = –1.37
H) Obtained probability (p) value:
p = .1790
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Confidence interval: Subtracting the mean of the modelrewarded condition from the mean of the model-punished condition resulted in an observed difference of –.944. The 95% confidence interval for this difference extended from –2.343 to .4542. K) Effect size:
d = .46.
L) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using an independent-samples t test. This analysis revealed a nonsignificant difference between the two conditions, t(34) = -1.37, p = .1790. The sample means are displayed in Figure 13.3, which shows that the subjects in the model-rewarded condition displayed a mean score on aggression which was similar to that displayed by the subjects in the model-punished condition (for model-rewarded group, M = 4.94, SD = 2.07; for model-punished group, M = 4.00, SD = 2.06). The observed difference between the means was –.94, and the 95% confidence interval for the difference between means extended from –2.34 to .45. The effect size was computed as d = 46. According to Cohen’s (1969) guidelines, this falls somewhere between a small effect and a medium effect.
450 Step-by-Step Basic Statistics Using SAS: Student Guide
N) Figure representing the results:
Figure 13.3. Mean number of subject aggressive acts as a function of the observed consequences for the model (nonsignificant results).
Conclusion An earlier section in this chapter indicated that researchers make a distinction between an independent-samples t test versus a paired-samples t test. This chapter has focused on the independent-samples test, which is appropriate when the subjects in Condition 1 and Condition 2 are two entirely different groups of people, and when the subjects in Condition 1 are not matched with subjects in Condition 2 in any systematic way. But what if subjects in Condition 1 were matched with subjects in Condition 2? For example, what if each subject in Condition 1 were paired with a corresponding subject in Condition 2 on the basis of similarity in demographic characteristics? For example, a young male Caucasian in Condition 1 might be paired with a young male Caucasian in Condition 2; a middle-aged female African-American in Condition 1 might be paired with a middle-aged female African-American in Condition 2, and so forth. Data obtained from a study such as this should not be analyzed using the independent-samples t test discussed in the current chapter, because the observations from the two groups are no longer independent. Instead, these data should be analyzed using a paired-samples t test, a statistical procedure that is covered in the following chapter.
PairedSamples t Test Introduction..........................................................................................453 Overview................................................................................................................ 453 Situations Appropriate for the Paired-Samples t Test ........................453 Overview................................................................................................................ 453 The Independent-Samples t Test versus the Paired-Samples t Test ..................... 453 Nature of the Predictor and Criterion Variables ..................................................... 454 The Type-of-Variable Figure .................................................................................. 454 Examples of Studies Providing Data Appropriate for This t Test ........................... 454 Summary of Assumptions Underlying the Paired-Samples t Test.......................... 456 Similarities between the Paired-Samples t Test and the Single-Sample t Test .......................................................................457 Introduction ............................................................................................................ 457 A Repeated-Measures Study ................................................................................. 457 Outcome 1: The Manipulation Has an Effect ........................................................ 457 Outcome 2: The Manipulation Has No Effect........................................................ 459 Summary ............................................................................................................... 460 Results Produced in a Paired-Samples t Test .....................................461 Overview................................................................................................................ 461 Test of the Null Hypothesis .................................................................................... 461 Confidence Interval for the Difference between the Means ................................... 462 Effect Size.............................................................................................................. 463
452 Step-by-Step Basic Statistics Using SAS: Student Guide
Example 14.1: Women’s Responses to Emotional versus Sexual Infidelity...............................................................................463 Overview................................................................................................................ 463 Background............................................................................................................ 464 The Study .............................................................................................................. 464 The Predictor and Criterion Variables in the Analysis ............................................ 466 Data Set to Be Analyzed........................................................................................ 467 The SAS DATA Step for the Program.................................................................... 468 The PROC Step for the Program ........................................................................... 468 The Complete SAS Program ................................................................................. 471 Steps in Interpreting the Output ............................................................................. 472 Summarizing the Results of the Analysis............................................................... 479 Notes Regarding the Preceding Report ................................................................. 481 Example 14.2: An Illustration of Results Showing Nonsignificant Differences..............................................................483 Overview................................................................................................................ 483 The SAS Output..................................................................................................... 483 Steps in Interpreting the Output ............................................................................. 483 Summarizing the Results of the Analysis............................................................... 485 Conclusion............................................................................................487
Chapter 14: Paired-Samples t Test 453
Introduction Overview This chapter shows you how to use the SAS System to perform a paired-samples t test, also known as a correlated-sample t test, a matched-samples t test, t test for dependent samples, and a t test for a within-subjects design. This is a parametric procedure that is appropriate when you want to determine whether the mean score that is obtained under one condition is significantly different from the mean score obtained under a second condition. With this test, each score in one condition is paired with, or dependent upon, a specific score in the second condition. This chapter shows you how to write the appropriate SAS program, interpret the output, and prepare a report that summarizes the results of the analysis.
Situations Appropriate for the Paired-Samples t Test Overview The paired-samples t test is a test of differences between means. You use this test when you want to compare average scores on a criterion variable obtained under two conditions, to determine whether there is a significant difference between the two means. Your study should contain no more than two conditions. The criterion variable must be on an interval or ratio level, and each observation in one set of scores must be paired in a meaningful way with a corresponding observation in the second set of scores. This section describes the types of situations in which this statistic is typically computed. A summary of assumptions underlying the procedure is presented at the end of the section. The Independent-Samples t Test versus the Paired-Samples t Test Before conducting a t test, it is important to determine whether your data should be analyzed using an independent-samples t test or a paired-samples t test. Briefly, the independentsamples t test is appropriate if the observations that are obtained under one treatment condition are independent of (unrelated to) the observations obtained under the other treatment condition. For example, this would be the case if both of the following were true: •
you conducted an experiment with one group of subjects in the experimental condition and an entirely different group of subjects in the control condition
•
you made no effort to match subjects in the two conditions.
In contrast, a paired samples t test is appropriate if each observation in one set of scores is paired in a meaningful way with a corresponding observation in the second set of scores. This is normally accomplished either by using a repeated-measures design or a matching procedure.
454 Step-by-Step Basic Statistics Using SAS: Student Guide
This chapter provides examples of studies that would provide data appropriate for a pairedsamples t test. A more complete discussion of the differences between independent samples versus paired samples in provided in Chapter 13, “Independent-Samples t Test,” in the section titled “Independent Samples versus Paired Samples.” Nature of the Predictor and Criterion Variables Predictor variable. To perform a paired-samples t test, the predictor (or independent) variable should be a dichotomous variable (i.e., a variable that assumes only two values). The predictor variable may be assessed on any scale of measurement: nominal, ordinal, interval, or ratio. Criterion variable. The criterion (or dependent) variable should be a numeric variable that is assessed on an interval or ratio scale of measurement. The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when performing a paired-samples t test. Criterion
Predictor
=
D i
The “Multi” symbol that appears in the above figure shows that the criterion variable in a paired-samples t test is typically a multi-value variable (a variable that assumes more than six values in your sample). The “Di” symbol that appears to the right of the equal sign in the above figure shows that the predictor variable in this procedure is a dichotomous variable (i.e., a variable that assumes only two values). Examples of Studies Providing Data Appropriate for This t Test Overview. Earlier, this chapter stated that the paired-samples t test is typically used to analyze data from studies that employ either a repeated-measures design, or a subjectmatching procedure. These two approaches to research are illustrated in the following two studies. Study 1. A physiological psychologist wants to determine whether there is a relationship between the affiliative motive (the desire for warm relationships with others) and release of the neurotransmitter, dopamine. She measures dopamine levels in saliva in a single group of subjects at Time 1, and then shows the group a film designed to arouse the affiliative motive. After showing the film (at Time 2), she again measures dopamine levels in saliva. She analyzes the data to determine whether the mean level of saliva obtained at Time 2 is significantly higher than the mean level of saliva obtained at Time 1. Because this research
Chapter 14: Paired-Samples t Test 455
design involves taking repeated measures from a single sample of subjects, she uses a paired-samples t test to compare Time 1 dopamine levels to Time 2 dopamine levels. Why these data would be appropriate for this procedure. In this study, the predictor variable is “treatment condition” (“before the affiliative-film” condition versus “after the affiliative-film” condition). You know that this is a dichotomous variable because it consists of only two conditions. The criterion variable is “dopamine levels in saliva.” Earlier sections indicated that, to perform a paired-samples t test, the criterion variable must be on an interval or ratio scale of measurement. Here, you can assume that the researcher’s measure of dopamine levels in saliva is on a ratio scale because it has equal intervals and a true zero point (i.e., when subjects have a score of zero, it means that they have no dopamine in their saliva). Finally, earlier sections indicated that, to perform a paired-samples t test, each score in one condition must be paired with, or dependent upon, a specific score in the second condition. The researcher achieves this in the current study by using a repeated-measures research design. Each subject provides scores on the criterion variable under both experimental conditions: They provide scores on the dopamine measure (a) at Time 1 (before watching the film), and again (b) at Time 2 (after watching the film). In performing the pairedsamples t test, each subject’s score at Time 1 will be paired with his or her score at Time 2 (later sections will show how this is done). Study 2. A doctoral candidate in political science wants to determine whether the way that political issues are presented affects citizen support for government programs. To find out, he prepares two different versions of a film about a federal program that provides aid to the poor. •
The “abstract” version of the film deals with issues of poverty in abstract, impersonal terms, using a good number of statistics and charts.
•
The “personal” version of the film discusses the same issues by focusing on the lives of two families actually living in poverty.
He shows the abstract version of the film to one group of subjects, and the personal version of the film to a second group. After the film, each subject rates his or her support for the federal program described. Before the study was conducted, each subject in the abstract group was paired with a similar subject in the personal group. Subjects were matched so that the two subjects in each pair were similar with respect to income, sex, and education. In other words, this is a study that uses a matched-subjects design. Why these data would be appropriate for this procedure. In this study, the predictor variable is “presentation condition” (“abstract” condition versus “personal” condition). You know that this a dichotomous variable because it consists of only two conditions. The criterion variable is “rated support for the program.” Most researchers would agree that, if this criterion variable is assessed with a carefully developed summated rating scale, it is
456 Step-by-Step Basic Statistics Using SAS: Student Guide
probably on an interval-scale of measurement (meaning that it displays equal intervals but no true zero point). As was stated earlier, an additional requirement for a paired-samples t test is the requirement that each score in one condition must be paired with, or dependent upon, a specific score in the second condition. Although there are two groups of subjects in this study, they are not truly independent because the groups were formed using a matching procedure. Specifically, subjects were matched for income, sex, and education. This means that (for example), a wealthy female with a college education in the abstract condition was matched with a wealthy female with a college education in the personal condition; a poor male with a high school education in the abstract condition was matched with a poor male with a high school education in the personal condition, and so on. In performing the t test, a particular subject’s score on the criterion variable in one group will be paired with the score of his or her matched counterpart in the other group. Summary of Assumptions Underlying the Paired-Samples t Test Level of measurement. The criterion variable should be assessed on an interval or ratio level of measurement. The predictor variable should be a dichotomous variable (that is, it should have only two categories), and it may be assessed on any level of measurement. Paired observations. A particular observation that appears in one condition must be paired in some meaningful way with a corresponding observation that appears in the other condition. This is often accomplished by using a repeated-measures design in which each subject contributes one score under Condition 1, and a separate score under Condition 2. Observations can also be paired by using a matching procedure. Independent observations. A particular subject’s score in one condition should not be affected by any other subject’s score in either of the two conditions. It is, of course, acceptable for a subject’s score in one condition to be dependent upon his or her own score in the other condition. This is another way of saying that it is acceptable for subjects’ scores in Condition 1 to be correlated with their scores in Condition 2. Random sampling. Subjects contributing data should represent a random sample drawn from the populations of interest. Normal distribution for difference scores. The differences in paired scores should be normally distributed. These difference scores are usually created by (a) beginning with a subject’s score on the criterion variable obtained under one treatment condition, and (b) subtracting from it that subject’s score on the criterion variable obtained under the other treatment condition (the nature of these “difference scores” will be discussed in greater detail in the following section). It is not necessary that the individual criterion variables be normally distributed, as long as the distribution of difference scores is normally distributed. Homogeneity of variance. The populations represented by the two conditions should have equal variances on the criterion.
Chapter 14: Paired-Samples t Test 457
Similarities between the Paired-Samples t Test and the Single-Sample t Test Introduction This section explains why a paired-samples t test is essentially equivalent to a single-sample t test. It begins by describing a fictitious study that would provide data appropriate for a paired-samples t test. It shows how it is possible to use the data from this study to create difference scores and then perform a single-sample test on these difference scores. Finally, it explains why the results of this single-sample t test can be used to determine whether there is a significant difference between the scores that were obtained under the two treatment conditions. Tables illustrate the type of data that you would expect to see if your experimental manipulation had an effect, as well as the type of data that you would expect to see if your manipulation had no effect. A Repeated-Measures Study Suppose that you conduct a study in which you wish to determine whether taking the herb ginkgo biloba will affect subject scores on a test of memory. For your study, you ask a sample of six subjects to ingest 60 mg of ginkgo daily for one month. At the end of the month you administer a memory test to these subjects. With this test, higher scores indicate better memory. You refer to this as the “ginkgo condition” (Condition 1). Later, you have the same six subjects ingest placebo pills (pills that do not contain any substance expected to have any effect on memory). After doing this for one month, you have these subjects take the same memory test that was taken earlier. You refer to this as the “placebo condition” (Condition 2). Outcome 1: The Manipulation Has an Effect The results. Table 14.1 provides some fictitious data obtained from the memory study. These are the type of data that you would expect to see if your manipulation had an effect on memory test scores.
458 Step-by-Step Basic Statistics Using SAS: Student Guide Table 14.1 Scores on the Memory Task Obtained under the Ginkgo Condition versus the Placebo Condition (Outcome 1) ________________________________________________________ Ginkgo Placebo condition condition Difference Subject (Condition 1) (Condition 2) scores (D) ________________________________________________________ 1 50 40 10 2 65 60 5 3 40 34 6 4 62 50 12 5 70 64 6 6 61 52 9 ________________________________________________________ Means: 58 50 8 ________________________________________________________
Below the heading “Subject” in the table, you will find subject numbers assigned to each participant (six subjects participated in the study). Below the heading “Ginkgo condition (Condition 1)” you will find each subject’s score on the memory task under the ginkgo condition. Below the heading “Placebo condition (Condition 2)” you will find each subject’s score on the memory test under the placebo condition. Reviewing the results one row at a time, you can see that Subject 1 had a score of 50 on the memory test under the ginkgo condition, and a score of 40 under the placebo condition. Subject 2 had a score of 65 under the ginkgo condition, and a score of 60 under the placebo condition. Results for the remaining subjects can be interpreted in the same way. The mean scores obtained under the two conditions appear to the right of the heading “Means” at the bottom of the table. In Table 14.1, you can see that the mean memory test score obtained under the ginkgo condition was 58, and the mean memory test score obtained under the placebo condition was 50. Since the mean score obtained under the ginkgo condition was somewhat higher than that obtained under the placebo condition, this may be seen as evidence that taking ginkgo has a positive effect on memory (you should not draw any conclusions at this point as you have not yet performed any statistical analyses). Creating difference scores. For each subject, it is possible to create a difference score by subtracting the score obtained under the placebo condition from the score obtained under the ginkgo condition. This difference score will be represented using the symbol D. In Table 14.1, difference scores have already been created for each subject, and they appear in the column headed “Difference scores (D).” You can see that •
The difference score for Subject 1 was 10 (because 50 – 40 = 10)
•
The difference score for Subject 2 was 5 (because 65 – 60 = 5)
•
The difference score for Subject 3 was 6 (because 40 – 34 = 6)
The difference scores for the remaining subjects can be interpreted in the same way.
Chapter 14: Paired-Samples t Test 459
In Table 14.1, the mean of the difference score is shown where the row headed “Means” intersects with the column headed “Difference score (D).” You can see that the average difference score was 8. This means that, on the average, scores obtained under the ginkgo condition were 8 points higher than scores obtained under the placebo condition. Again, this is consistent with the idea that ginkgo might have a positive effect on memory (if the average difference score had been a negative number, it would have been consistent with the idea that ginkgo has a negative effect on memory). Performing a single-sample t test. Until now, you have been looking at the data, but have not performed any significance tests. If you wanted to know whether there is a significant difference between the mean memory test score obtained under the ginkgo condition versus the placebo condition, it would be possible to perform a single-sample t test on the scores appearing in the column headed “Difference scores (D).” In this analysis, you could test the null hypothesis that this sample was drawn from a population in which the mean difference score is equal to zero. Symbolically, this null hypothesis can be represented like this: H0: µ = 0 Again, your null hypothesis states that your sample was drawn from a population in which the mean difference score is equal to zero. This is because the null hypothesis is the hypothesis of no difference; it is the hypothesis that states that your manipulation did not have any effect. If ginkgo does not really have any effect on memory, this means that (in the population) there should be no difference between average memory scores obtained under the ginkgo condition versus those obtained under the placebo condition. When you perform a paired-samples t test, you are actually performing a single-sample t test on a sample of difference scores (D scores). In this analysis, you test the null hypothesis that your sample was drawn from a population of D scores in which the average score is equal to zero. If the average difference score observed in your sample is substantially different from zero, you will reject this null hypothesis and conclude that your manipulation probably did have an effect. If the average difference score obtained in your sample is fairly close to zero, you will fail to reject the null hypothesis and conclude that your manipulation probably did not have an effect. Outcome 2: The Manipulation Has No Effect As an additional example, this section provides some fictious results that would be consistent with the idea that ginkgo does not have an effect on memory. Table 14.2 contains the same headings that appeared in Table 14.1, but you can see that a different data set has now been inserted in the body of the table.
460 Step-by-Step Basic Statistics Using SAS: Student Guide Table 14.2 Scores on the Memory Task Obtained under the Ginkgo Condition versus the Placebo Condition (Outcome 2) ________________________________________________________ Ginkgo Placebo condition condition Difference Subject (Condition 1) (Condition 2) scores (D) ________________________________________________________ 1 50 45 5 2 65 70 -5 3 40 39 1 4 62 63 -1 5 70 66 4 6 61 65 -4 _________________________________________________________ Means: 58 58 0 _________________________________________________________
To the right of the heading “Means” (toward the bottom of the table), you can see that the observed sample mean obtained under the ginkgo condition is 58, and the mean obtained under the placebo condition is now also 58. Because the means for the two conditions are now identical, these data fail to support the idea that ginkgo has a positive effect on memory. In the column headed “Difference scores (D),” you can see that some of the difference scores are now positive, and some are negative. In fact, the negative difference scores now cancel out the positive ones, resulting in a mean difference score of zero. Again, this result fails to support the idea that ginkgo has a positive effect on memory. In the present example, the average difference score is equal to zero. However, remember that your obtained average difference score does not have to be exactly equal to zero to be seen as evidence that your manipulation had no effect. Even when the manipulation is ineffective, you should still expect the mean difference score to sometimes be a bit above zero, and to sometimes be a bit below zero simply due to sampling error. To determine whether it is enough above (or below) zero to reject the null hypothesis, you should perform a paired-samples t test. Summary In summary, a paired-samples t test is essentially equivalent to a single-sample t test. When you use SAS to perform a paired-samples t test, the application (a) creates difference scores by subtracting the scores obtained under Condition 2 from the scores obtained under Condition 1, and (b) performs a single-sample t test on these difference scores. If the probability value (p value) obtained from this single-sample t test is significant (i.e., if the p value is less than .05), you might conclude that you have a statistically significant difference between the mean scores obtained in Condition 1 versus Condition 2.
Chapter 14: Paired-Samples t Test 461
Results Produced in a Paired-Samples t Test Overview When you perform a paired-samples t test, you can interpret the test of the null hypothesis, the confidence interval for the difference between means, and the index of effect size. This section discusses the meaning of these results. Test of the Null Hypothesis The single-sample convention. The preceding section indicated that a paired-samples t test is essentially equivalent to a single-sample t test. For this reason, many statistics textbooks show readers how to state a null hypothesis for a paired-samples test using a convention similar to that used with the single-sample t test. For example, here is one example of how you could state a nondirectional statistical null hypothesis for the preceding memory study using the single-sample convention: Statistical null hypothesis (H0): µ = 0; In the population, the average difference score created by subtracting placebo condition scores from ginkgo condition scores is equal to zero. Here is the corresponding nondirectional statistical alternative hypothesis: Statistical alternative hypothesis (H1): µ ≠ 0; In the population, the average difference score created by subtracting placebo condition scores from ginkgo condition scores is not equal to zero. When you perform the paired-samples t test, SAS computes an obtained t statistic and a p value associated with that statistic. If your obtained p value is less than .05, you may reject the null hypothesis and tentatively accept the alternative hypothesis. In your report, you would indicate that you obtained a statistically significant difference between mean scores for the two conditions. The two-sample convention. There is, however, another way to state the same hypotheses. Because a paired-samples t test typically involves comparing scores obtained under two treatment conditions, it is also possible to use conventions for stating hypotheses similar to those introduced in Chapter 13, “Independent-Samples t Test.” The following illustrates the way that you would state a nondirectional statistical null hypothesis for the memory study using the two-sample convention: Statistical null hypothesis (H0): µ1 = µ2; In the population, there is no difference between the placebo condition versus the ginkgo condition with respect to mean scores on the criterion variable (the memory test).
462 Step-by-Step Basic Statistics Using SAS: Student Guide
Here is the corresponding nondirectional statistical alternative hypothesis: Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between the placebo condition versus the ginkgo condition with respect to mean scores on the criterion variable (the memory test). Again, when you perform the paired-samples t test, you review the obtained t statistic and p value computed by SAS. If the p value is less than .05, you reject the null hypothesis and tentatively accept the alternative. Convention to be used in this chapter. These two approaches to stating hypotheses are essentially equivalent, so either approach may be used. To maintain continuity with the preceding chapter, however, this chapter will focus on the two-sample convention as was illustrated above. Nondirectional versus directional tests. With the paired-samples t test, you may perform either a nondirectional test (in which you do not make a specific prediction about which condition will display the higher mean) or a directional test (in which you do make a specific prediction about which condition will display the higher mean). The exact nature of your null hypothesis and alternative hypothesis will vary depending on whether you plan to perform a directional test or a nondirectional test. To conserve space, however, this section will not review the differences between these two types of tests and hypotheses because they were discussed in detail in the previous chapter. See the section titled “Test of the Null Hypothesis” in Chapter 13, “Independent-Samples t Test.” Confidence Interval for the Difference between the Means When you use PROC TTEST to perform a paired-samples t test, SAS automatically computes a confidence interval for the difference between the means. In Chapter 12, “The Single-Sample t Test,” you learned that a confidence interval is an interval that extends from a lower confidence limit to an upper confidence limit and is assumed to contain a population parameter with a stated probability, or level of confidence. In that chapter, you learned that SAS can compute the confidence interval for a mean. When you perform a paired-samples t test, SAS instead computes a confidence interval for the difference between the means observed in the two conditions. For example, again consider the ginkgo biloba study described earlier. Assume that the mean score on the memory test under the ginkgo condition is 56, and the mean score under the placebo condition is 50. The difference between the two means is 56 – 50 = 6. This means that the observed difference between means is equal to 6. Assume that you analyze your data using PROC TTEST, and SAS estimates that the 95% confidence interval for this difference extends from 4 (the lower confidence limit) to 8 (the upper confidence limit). This means that there is a 95% probability that, in the population, the actual difference between the ginkgo condition versus the placebo condition is somewhere between 4 and 8 points on the memory test.
Chapter 14: Paired-Samples t Test 463
Effect Size When you perform a paired-samples t test, effect size can be defined as the degree to which the mean obtained under one condition differs from the mean obtained under the second condition, stated in terms of the standard deviation of the population of difference scores. The symbol for effect size is d, and the formula (adapted from Spatz [2001]) is as follows: d=
| X1 – X2 | –––––––––––– sD
where: X1 = the observed mean of the sample of scores obtained under Condition 1 X2 = the observed mean of the sample of scores obtained under Condition 2 sD = the estimated standard deviation of the population of difference scores. You can see that the procedure for computing d is fairly straightforward: You simply subtract one mean from the other mean and divide the absolute value of the result by the estimated standard deviation of the difference scores. The resulting statistic represents the number of standard deviations that the two means differ from one another. The index of effect size is not automatically computed by PROC TTEST, although it can easily be calculated by hand from other statistics that do appear on the output of PROC MEANS and PROC TTEST. A later section of this chapter will show how this is done, and will provide some guidelines for interpreting the size of d.
Example 14.1: Women’s Responses to Emotional versus Sexual Infidelity Overview This section describes a fictitious study in which female subjects provide scores on a measure of psychological distress under two treatment conditions: (a) while imagining their partner engaging in emotional infidelity, and (b) while imagining their partner engaging in sexual infidelity. The chapter shows how to perform a paired-samples t test to determine whether the mean distress score obtained under the emotional infidelity condition is significantly different from the mean distress score obtained under the sexual infidelity condition. The section begins by describing the study and the data produced by the subjects. It shows how to prepare the SAS DATA step and how to write the PROC statements that will provide the needed results. It shows how to interpret the SAS output and prepare an analysis report. Special emphasis is placed on •
interpreting the test of the null hypothesis
464 Step-by-Step Basic Statistics Using SAS: Student Guide •
interpreting the confidence interval for the difference between means
•
computing the index of effect size.
Note: Although the investigation described here is fictitious, it was inspired by an actual study reported by Buunk et al. (1996). Background This fictitious study investigates the way that people respond to infidelity in a romantic partner. Some researchers in the field of sociobiology believe that there may be differences between women versus men with respect to what types of jealousy-provoking situations cause them the greatest distress (e.g., Daly et al., 1982). Some have predicted that •
a man should be more distressed by the thought that his partner has had sexual intercourse with another man, compared to the thought that his partner has developed a deep emotional attachment to another man
•
a woman should be more distressed by the thought that her partner has developed a deep emotional attachment to another woman, compared to the thought that her partner has had sexual intercourse with another woman.
The study described in this section addresses the second of these two predictions (i.e., the prediction that women should be more distressed at the thought that her partner has developed a deep emotional attachment, compared to a sexual attachment). The Study Overview. Suppose that you are a sociologist conducting research in this area. There are a number of ways that you could test the ideas about sex and jealousy presented in the preceding section. One approach might involve focusing on just one of the sexes––for example, women––and determining whether the way that women respond to infidelity depends, in part, on the nature of the infidelity. For example, you might ask each subject in a sample of women to imagine how she would feel if she had learned that her partner had committed emotional infidelity (i.e., that her partner formed a deep emotional attachment to another person). While imagining this, each woman could rate how distressed she would feel in this situation. Next, you could ask the each member of the same group of women to imagine how she would feel if she had learned that her partner had committed sexual infidelity (i.e., that her partner had experienced sexual intercourse with another person). While imagining this, she could again rate how distressed she would feel. Research hypothesis. Here, your research question might be summarized in this fashion: “The purpose of this study is to determine whether there is a difference between emotional infidelity versus sexual infidelity with respect to the amount of psychological distress that they produce in women.”
Chapter 14: Paired-Samples t Test 465
Your research hypothesis might be “When asked to imagine how they would feel if they learned that their partner had been unfaithful, women will display higher levels of psychological distress when imagining emotional infidelity than when imagining sexual infidelity.” In summary, your study is designed to determine whether the type of infidelity (emotional versus sexual) has an effect on the subjects’ level of distress. The causal nature of your research hypothesis is illustrated in Figure 14.1.
Figure 14.1. Causal relationship between type of infidelity and psychological distress, as predicted by the research hypothesis.
Statistical hypotheses. You will use the two-sample convention for stating your statistical hypotheses. Your research hypothesis (above) is technically a directional hypothesis, because it predicts that distress scores will be higher under the emotional infidelity condition than under the sexual infidelity condition. Nevertheless, you will state your statistical hypotheses as nondirectional hypotheses to avoid concentrating all of your region of rejection in just one tail of the sampling distribution. Below is the nondirectional statistical null hypothesis for your analysis: Statistical null hypothesis (H0): µ1 = µ2; In the population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). Here is the corresponding nondirectional statistical alternative hypothesis: Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the population, there is a difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). Research method. Suppose that you conduct the study using a repeated-measures design with 17 female subjects. Each subject is presented with a set of 12 scenarios that may or may not cause her to experience psychological distress. Some examples of the scenarios include “failing to get a promotion at work,” “learning that a friend is ill,” and “ being in a minor automobile collision.” For each of the 12 scenarios, the subjects read the description of the situation and try to imagine how they would feel if this event actually happened to them. After imagining this,
466 Step-by-Step Basic Statistics Using SAS: Student Guide
they rate how distressed they feel by responding to the following four items (for each item, the subject would circle one number from 1 to 7): Not at all distressed
1
2
3
4
5
6
7
Extremely distressed
Not at all upset
1
2
3
4
5
6
7
Extremely upset
Not at all angry
1
2
3
4
5
6
7
Extremely angry
Not at all hurt
1
2
3
4
5
6
7
Extremely hurt
For a given subject, you sum her responses to the four items, resulting in a single “distress” score that can range from a low of 4 (if she circled “1s” for each item) to a high of 28 (if she circled “7s” for each item). With this scale, higher scores indicate higher levels of distress. Although the subjects respond in this way to 12 different scenarios, there are only two scenarios that you are actually interested in: one that deals with emotional infidelity, and one that deals with sexual infidelity. The emotional infidelity scenario reads as follows: Imagine how you would feel if your romantic partner formed a deep emotional attachment to another person. The sexual infidelity scenario reads as follows: Imagine how you would feel if your romantic partner experienced sexual intercourse with another person. Obviously, subjects would make one set of distress ratings for the emotional infidelity scenario, and a different set of distress ratings for the sexual infidelity scenario. As the researcher, you want to determine whether the ratings obtained under the emotional infidelity condition are significantly higher than the ratings obtained under the sexual infidelity condition. Because the same group of people provided ratings under both conditions, this is essentially a repeated-measures design. You may therefore analyze the data using a paired-samples t test. The Predictor and Criterion Variables in the Analysis Technically, the predictor variable in your study is “type of infidelity,” a nominal-level variable that consists of just two conditions (emotional infidelity versus sexual infidelity). The criterion variable in your study is “distress” measured by the subjects’ scores on the four-item ratings described above. The distress measure is on an interval scale, as it has approximately equal intervals, but no true zero point. However, in this analysis you will not create one SAS variable to represent your predictor variable (type of infidelity), and a separate SAS variable to represent scores on the criterion (distress), as might be the case if you were going to perform an independent-samples t test. Instead, you will create both of the following: •
one SAS variable to contain distress scores obtained under the emotional infidelity condition
Chapter 14: Paired-Samples t Test 467 •
a second SAS variable to contain distress scores obtained under the sexual infidelity condition.
The following sections show how the data should be arranged. Data Set to Be Analyzed Following are the (fictitious) scores on the distress measure obtained for the 17 subjects under the two treatment conditions: Table 14.3 Data from the Infidelity Study (Significant Differences) ________________________________________ Distress scores _________________________ Emotional Sexual infidelity infidelity Subject condition condition _______________________________________ 01 21 18 02 24 19 03 23 21 04 27 24 05 25 25 06 24 21 07 26 22 08 25 21 09 28 21 10 20 19 11 22 20 12 27 23 13 26 23 14 23 22 15 22 19 16 22 20 17 23 20 _________________________________________
The preceding data set consists of three columns. The first column is headed “Subject,” and provides a unique number for each subject. The next two columns appear under the major heading “Distress scores.” The values in these two columns indicate the scores that each subject displayed on the measure of psychological distress. The values appearing in the column titled “Emotional infidelity condition” indicate the distress scores that the subjects displayed when they imagined their partners engaging in emotional infidelity. The values appearing in the column titled “Sexual infidelity condition” indicate the distress scores that the subjects displayed when they imagined their partners engaging in sexual infidelity. Each horizontal row in the preceding table presents the distress scores obtained for a specific subject under the two treatment conditions. For example, the first row presents results for Subject 1, who provided a distress score of 21 under the emotional infidelity condition, and a distress score of 18 under the sexual infidelity condition. Subject 2 provided a score of 24
468 Step-by-Step Basic Statistics Using SAS: Student Guide
under the emotional condition, and a score of 19 under the sexual condition. The remaining rows may be interpreted in the same fashion. The SAS DATA Step for the Program Suppose that you prepare a SAS program to input the data set presented in Table 14.3. You use the SAS variable name SUB_NUM to represent subject numbers, the SAS variable name EMOTION to represent subject distress scores obtained under the emotional infidelity condition, and the SAS variable name SEXUAL to represent subject scores obtained under the sexual infidelity condition. Following are the statements that constitute the DATA step of this program. Notice that the INPUT statement makes use of the SAS variable names that were described in the previous paragraph. Notice also that the data set itself (which appears after the DATALINES statement) is essentially identical to the data set that appears in Table 14.3. OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM EMOTION SEXUAL; DATALINES; 01 21 18 02 24 19 03 23 21 04 27 24 05 25 25 06 24 21 07 26 22 08 25 21 09 28 21 10 20 19 11 22 20 12 27 23 13 26 23 14 23 22 15 22 19 16 22 20 17 23 20 ;
The PROC Step for the Program Overview. The PROC step of your program will include two SAS procedures. First, PROC MEANS enables you to review the means and other descriptive statistics obtained on the criterion variable under the two treatment conditions. Second, PROC TTEST enables you to produce a test of the null hypothesis and the confidence interval for the difference between means. This section shows you how to request both procedures.
Chapter 14: Paired-Samples t Test 469
Syntax for PROC MEANS. When performing a paired-samples t test, the syntax for PROC MEANS is as follows: PROC MEANS DATA=data-set-name ; VAR criterion-variable-1 criterion-variable-2 TITLE1 ' your-name '; RUN;
;
The second line of the preceding syntax contains the VAR statement. You can see that the VAR statement contains the entries criterion-variable-1 and criterion-variable-2. These entries represent “scores on the criterion variable obtained under Condition 1” and “scores on the criterion variable obtained under Condition 2,” respectively. SAS statements for PROC MEANS. Here are the statements that would request PROC MEANS for the current analysis: PROC MEANS DATA=D1; VAR EMOTION SEXUAL; TITLE1 'JANE DOE'; RUN; You can see that the PROC MEANS statement specifies “DATA=D1.” This is because D1 is the name of the data set that you created. Below the PROC MEANS statement, the VAR statement lists EMOTION and SEXUAL. This is because EMOTION contains scores on the criterion variable obtained under Condition 1 (the emotional infidelity condition), and SEXUAL contains scores on the criterion variable obtained under Condition 2 (the sexual infidelity condition). Notice how this is consistent with the syntax of the VAR statement presented above. Syntax for PROC TTEST. The syntax for the section of the program that will request a paired-samples t test is as follows: PROC TTEST
DATA=data-set-name H0=comparison-number ALPHA=alpha-level ; PAIRED criterion-variable-1*criterion-variable-2 TITLE1 ' your-name ' ; RUN;
;
In the preceding syntax, the PROC TTEST statement contains the following option: H0=comparison-number The “comparison-number” that appears in this option should be the mean difference score expected under the null hypothesis. When you perform a paired-sample t test, the mean difference score that is usually expected under the null hypothesis is zero. Therefore, you should generally use the following option when performing a paired-samples t test: H0=0
470 Step-by-Step Basic Statistics Using SAS: Student Guide
Note that the “0” that appears in the preceding option “H0” is a zero (“0”), and is not the upper case of the letter “O”. In addition, the “0” that appears to the right of the equal sign is also a zero, and is not the upper case of the letter “O.” If you omit the H0 option from the PROC TTEST statement, the default comparison number is zero. This means that, in most cases there is no harm in omitting this option. The syntax of the PROC TTEST statement also contains the following option: ALPHA=alpha-level This ALPHA option enables you to specify the size of the confidence interval that you will estimate around the difference between the means. Specifying ALPHA=0.01 produces a 99% confidence interval, specifying ALPHA=0.05 produces a 95% confidence interval, and specifying ALPHA=0.1 produces a 90% confidence interval. Suppose that, in this analysis, you wish to create a 95% confidence interval. This means that you will include the following option in the PROC TTEST statement: ALPHA=0.05 The preceding general form includes the following PAIRED statement: PAIRED criterion-variable-1*criterion-variable-2
;
In the PAIRED statement, you should list the names of the two SAS variables that contain the scores on the criterion variable obtained under the two treatment conditions. Notice that there is an asterisk (*) that separates the two variable names. An earlier section of this chapter indicated that when SAS performs a paired-samples t test it subtracts scores obtained under one condition from scores obtained under the other condition to create a new variable that consists of the resulting difference scores. The order in which you type your criterion variable names in the PAIRED statement determines how these difference scores are created. Specifically, SAS subtracts scores on criterion-variable2 from scores on criterion-variable-1. In other words, it subtracts scores on the variable on the right side of the asterisk from scores on the variable on the left side of the asterisk. SAS statements for PROC TTEST. Here are the actual statements that request SAS to perform a paired-samples t test on the present data set: PROC TTEST DATA=D1 H0=0 ALPHA=0.05; PAIRED EMOTION*SEXUAL; RUN; In the preceding PROC TTEST statement, you have requested the option H0=0. This option requests that SAS test the null hypothesis that your sample was drawn from a population in which the average difference score was equal to zero. The PROC TTEST statement also includes the option ALPHA=0.05. This statement requests that SAS compute the 95% confidence interval for the difference between the means.
Chapter 14: Paired-Samples t Test 471
The PAIRED statement lists the SAS variable EMOTION on the left side of the asterisk and SEXUAL on the right side. This means that scores on SEXUAL will be subtracted from scores on EMOTION to compute difference scores. With this arrangement, if the mean difference score is a positive number, you will know that the subjects displayed higher distress scores under the emotional infidelity condition than under the sexual infidelity condition, on the average. On the other hand, if the mean difference score is a negative number, you will know that the subjects displayed higher distress scores under the sexual infidelity condition than under the emotional infidelity condition. The Complete SAS Program Below is the complete SAS program––including the DATA step––to analyze the fictitious data from the preceding study. Notice that PROC MEANS and PROC TTEST are both included in the same program. OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM EMOTION SEXUAL; DATALINES; 01 21 18 02 24 19 03 23 21 04 27 24 05 25 25 06 24 21 07 26 22 08 25 21 09 28 21 10 20 19 11 22 20 12 27 23 13 26 23 14 23 22 15 22 19 16 22 20 17 23 20 ; PROC MEANS DATA=D1; VAR EMOTION SEXUAL; TITLE1 'JANE DOE'; RUN; PROC TTEST DATA=D1 H0=0 ALPHA=0.05; PAIRED EMOTION*SEXUAL; RUN;
472 Step-by-Step Basic Statistics Using SAS: Student Guide
Steps in Interpreting the Output Overview. The preceding program produces two pages of output. The first page contains the results produced by PROC MEANS. These results include the means and standard deviations (along with other information) for the variables EMOTION and SEXUAL. The second page of output contains the results of the paired-samples t test (test of the null hypothesis, confidence intervals, and other information). This section shows you how to make sense of this output. It shows you: •
how to verify that there were no obvious errors in typing the data or program
•
how to determine which treatment condition displayed the higher mean on the criterion variable
•
how to interpret the t test for the difference between the means
•
how to interpret the confidence interval
•
how to compute the index of effect size.
1. Make sure that everything looks correct. The first page of output provides results from PROC MEANs performed on EMOTION and SEXUAL. These results appear here as Output 14.1. JANE DOE The MEANS Procedure
1
Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------EMOTION 17 24.0000000 2.2912878 20.0000000 28.0000000 SEXUAL 17 21.0588235 1.9193289 18.0000000 25.0000000 ---------------------------------------------------------------------------Output 14.1. Reviewing the results of PROC MEANS for signs of possible errors.
Below the heading “Variable” you will see the name of the variables being analyzed. In the row to the right of “EMOTION” you will find descriptive statistics for EMOTION, and in the row to the right of “SEXUAL” you will find descriptive statistics for SEXUAL. Check the number of usable observations in the column headed “N” to verify that the data set includes the expected number of subjects. Here, the N is 17, as expected. Remember that the variable EMOTION contains psychological distress scores obtained under the emotional infidelity condition, and the variable SEXUAL contains psychological distress scores obtained under the sexual infidelity condition. With the distress variable, the lowest possible score was supposed to be 4, and the highest possible score was supposed to be 28. With this in mind, you can review the descriptive statistics in Output 14.1 to verify that you did not key any “impossible” values. An impossible value is a value that is out-ofbounds: One that is either lower than the lowest possible score (4), or higher than the highest possible score (28).
Chapter 14: Paired-Samples t Test 473
The average distress scores obtained under the two conditions can be found in the column headed “Mean.” In Output 14.1, you can see that the mean for EMOTION was 24.0000, and the mean for SEXUAL was 21.0588. Both of these means seem reasonable for a scale where scores could range from 4 to 28. The column titled “Minimum” contains the lowest score that was observed for each variable. Output 14.1 shows that the lowest observed score on EMOTION was 20, and the lowest observed score on SEXUAL was 18. Neither of these are lower than the lowest possible score of 4, and so you see no obvious evidence of an error in typing your data. The column titled “Maximum” shows the highest score that was observed for each variable. Output 14.1 shows that the highest observed score on EMOTION was 28, and the highest observed score on SEXUAL was 25. Neither of these are higher than the highest possible score of 28, and so you again see no obvious evidence of an error in typing your data. Next, look at the results produced by PROC TTEST to see if they display any obvious errors in preparing the program. These results are presented below as Output 14.2. JANE DOE The TTEST Procedure
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
2
Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev 2.0989 2.9412 3.7835 1.2201 1.6382 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.4933 0.3973 0 7
Difference EMOTION - SEXUAL
T-Tests DF 16
t Value 7.40
Pr > |t| <.0001
Output 14.2. Reviewing the results of PROC PROC TTEST for signs of possible errors.
When you use SAS to perform a paired-samples t test, the results look similar to the results from a single-sample t test. The output consists of three tables. Two of the tables are headed “Statistics” ( and in Output 14.2). These tables provide information regarding the difference score variable that was created in the analysis. The third table is headed “T-Tests” ( ), and this table provides information relevant to the paired-samples t test itself. In the first statistics table, under the heading “Difference” ( ), you will find the names of the two variables that were used to create difference scores in your analysis. In Output 14.2, you can see the entry “EMOTION - SEXUAL” below this heading, meaning that SAS used the variables EMOTION and SEXUAL to create difference scores for the current analysis. This is as it should be. On the output, the entry “EMOTION - SEXUAL” tells you that, to create these difference scores, each subject’s score for the SEXUAL variable was subtracted from her score for the EMOTION variable. Again, this is how you requested it in your SAS program, and so there is no obvious sign of an error.
474 Step-by-Step Basic Statistics Using SAS: Student Guide
Below the heading “N” ( ), you will find the number of valid observations in the data set. When you conduct a repeated-measures study (as you did in the present case), N should be equal to the number of subjects in your study. Output 14.2 shows that N is equal to 17, which seems correct. Below the heading “Mean” ( ), you will find the average difference score that was created when SEXUAL scores were subtracted from EMOTION scores. Output 14.2 shows that the mean difference score is 2.9412. This means that, according to PROC TTEST, the average distress score was 2.9412 points higher under the emotional infidelity condition than under the sexual infidelity condition. To determine whether there is any obvious error, you can manually compute the difference between means, and compare it against this mean difference score of 2.9412. To compute the difference manually, you need to refer to the results of PROC MEANS from Output 14.1 (presented earlier). There, you saw that the mean score on EMOTION was 24.0000, and the mean score on SEXUAL was 21.0588. Subtracting the latter mean from the former results in the following: 24.0000 – 21.0588 = 2.9412 You can therefore see that using the means from PROC MEANS to manually compute the mean difference results in the same difference that was reported under “Mean” in the output of PROC TTEST, shown in Output 14.2. Again, this suggests that there was no error made in typing the data or the SAS program itself. Because the output has passed this criterion, you can now review the results to see what implications they have for your research question. 2. Review the means on the criterion variable that you obtained under the two conditions. In an earlier section of this chapter, you stated the following research question: “The purpose of this study is to determine whether there is a difference between emotional infidelity versus sexual infidelity with respect to the amount of psychological distress that they produce in women.” To obtain an answer to this question, you will review a number of pieces of information. One of the first things you will review will be the mean scores for psychological distress that the women displayed under the emotional infidelity condition versus the sexual infidelity condition. These means were presented in Output 14.1. For convenience, they are again reproduced here as Output 14.3. JANE DOE 1 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------EMOTION 17 24.0000000 2.2912878 20.0000000 28.0000000 SEXUAL 17 21.0588235 1.9193289 18.0000000 25.0000000 ---------------------------------------------------------------------------Output 14.3. Mean scores on the criterion variable (Psychological Distress) obtained under the emotional infidelity condition versus the sexual infidelity condition.
Chapter 14: Paired-Samples t Test 475
As was mentioned earlier, the mean distress score obtained under the emotional infidelity condition was 24.0000 ( ), while the mean distress score obtained under the sexual infidelity condition was only 21.0588 ( ). The mean score obtained under the emotional infidelity condition was a bit higher than that obtained under the sexual infidelity condition, but was it enough higher to conclude that there is a statistically significant difference between the two means? To find out, you must consult the paired-sample t test. 3. Review the t test for the difference between means. The paired-samples t test for the current analysis appears in the results of PROC TTEST, presented earlier. For convenience, that output is reproduced here as Output 14.4.
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
JANE DOE 2 The TTEST Procedure Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev 2.0989 2.9412 3.7835 1.2201 1.6382 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.4933 0.3973 0 7 T-Tests
Difference EMOTION - SEXUAL
DF 16
t Value 7.40
Pr > |t| <.0001
Output 14.4. The paired-samples t test, comparing mean distress scores obtained under the emotional infidelity condition versus the sexual infidelity condition.
You have already learned that the value that appears below the title “Mean” in the output represents the mean difference score created when SEXUAL scores are subtracted from EMOTION scores. Output 14.4 shows that this mean difference score is 2.9412. When you perform a paired-samples t test, you determine whether this obtained mean difference score is significantly different from zero. The results of this test appear lower in the output, in the section titled “T-Tests.” The degrees of freedom for this test appear below the heading “DF.” You can see that there are 16 degrees of freedom for the current analysis. Below the heading “t Value,” you will find the obtained t statistic for the current pairedsamples t test. You can see that the obtained t statistic is 7.40. To determine whether the obtained t value is statistically significant, you consult the probability value (or p value) associated with that statistic. This p value appears below the heading “Pr > | t |” ( ). You can see that the p value for the current analysis is “ < .0001”, which means that it is less than one in ten thousand. This text recommends that you reject the null hypothesis whenever your obtained p value is less than .05. The obtained p value of
476 Step-by-Step Basic Statistics Using SAS: Student Guide
“<.0001” is clearly less than this criterion of .05. Therefore, you reject the null hypothesis. Remember that your statistical null hypothesis stated the following: Statistical null hypothesis (H0): µ1 = µ2; In the population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). In your analysis report, you will indicate that there was a statistically significant difference between mean distress scores obtained under the emotional infidelity condition, versus the sexual infidelity condition. These results, when combined with the results produced by PROC MEANS, show that the subjects displayed significantly higher levels of distress when they imagined emotional infidelity than when they imagined sexual infidelity. 4. Review the confidence interval for the difference between the means. When you use PROC TTEST, SAS computes a confidence interval for the difference between the means. The PROC TTEST statement that is included in the program for the current analysis contained the following ALPHA option: ALPHA=0.05 This option causes SAS to compute the 95% confidence interval. If you had instead wanted the 99% confidence interval, you would have instead used this option: ALPHA=0.01 The 95% confidence interval appears in the “Statistics” table in the PROC TTEST output. That table is reproduced here as Output 14.5: JANE DOE The TTEST Procedure Statistics
Difference EMOTION - SEXUAL
N 17
Lower CL Mean 2.0989
Mean 2.9412
Upper CL Mean 3.7835
2
Lower CL Std Dev 1.2201
Std Dev 1.6382
Output 14.5. The lower confidence limit and upper confidence limit for the difference between the means.
Below the heading “Difference,” you can see the entry “EMOTION – SEXUAL.” This means that the information that appears in this row is information about the difference score variable that was created by subtracting SEXUAL scores from EMOTION scores. Below the heading “Mean,” you can see that the average difference score is 2.9412. This indicates that the average distress score obtained under the emotional infidelity condition is 2.9412 point higher that the average distress score obtained under the sexual infidelity condition. As you remember from Chapter 12, a confidence interval extends from a lower confidence limit to an upper confidence limit.
Chapter 14: Paired-Samples t Test 477
To find the lower confidence limit for the current difference between means, look below the heading “Lower CL Mean.” There, you can see that the lower confidence limit for the difference is 2.0989. To find the upper confidence limit for the difference, look below the heading “Upper CL Mean.” There, you can see that the upper confidence limit for the difference is 3.7835. Combined, these findings indicate that the 95% confidence interval for the difference between means ranges from 2.0989 to 3.7835. This indicates that you can estimate with a 95% probability that the actual difference between the mean of the emotional infidelity condition and the mean of the sexual infidelity condition (in the population) is somewhere between 2.0989 and 3.7835. Notice that this interval does not contain the value of zero. This is consistent with your rejection of the null hypothesis in the previous section (i.e., you rejected the null hypothesis that stated “In the population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress).” If the null hypothesis had been true, you would have expected the confidence interval to include a value of zero (i.e., a difference score of zero). The fact that your confidence interval does not contain a value of zero is consistent with your rejection of this null hypothesis. 5. Compute the index of effect size. Earlier in this chapter, you learned that effect size can be defined as the degree to which the mean score obtained under one condition differs from the mean score obtained under the second condition, stated in terms of the standard deviation of the population of difference scores. The symbol for effect size is d. When performing a paired-samples t test, the formula for effect size is as follows: d=
| X1 – X2 | –––––––––––– sD
where X1 = the observed mean of the sample of scores obtained under Condition 1 X2 = the observed mean of the sample of scores obtained under Condition 2 sD = the estimated standard deviation of the population of difference scores. Although SAS does not automatically compute effect size, you can easily do so by using the information that appears in the output of PROC MEANS and PROC TTEST. First, you will need the mean scores on the “psychological distress” criterion variable obtained under the two treatment conditions. These mean scores appear in the output of PROC MEANS, and that output is reproduced again here as Output 14.6.
478 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE 1 The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------EMOTION 17 24.0000000 2.2912878 20.0000000 28.0000000 SEXUAL 17 21.0588235 1.9193289 18.0000000 25.0000000 ---------------------------------------------------------------------------Output 14.6. Information from the results of PROC MEANS that are needed to compute the index of effect size.
In the preceding formula, X1 represents the observed mean of the sample of scores that were obtained under Condition 1 (the emotional infidelity condition). In Output 14.6, you can see that the mean score on the distress variable that was obtained under the emotional infidelity condition was 24.0000. In the preceding formula, X2 represents the observed mean of the sample of scores obtained under Condition 2 (the sexual infidelity condition). In Output 14.6, you can see that the mean score on the distress variable that was obtained under this condition was 21.0588. Substituting these two means in the formula results in the following: d=
| 24.0000 – 21.0588 | ––––––––––––––––––––– sD
In the formula for d, sD represents the estimated standard deviation of the population of difference scores. This statistic appears in the “Statistics” table from the results of PROC TTEST. The relevant table from the current analysis is reproduced here as Output 14.7.
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev 2.0989 2.9412 3.7835 1.2201 1.6382 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.4933 0.3973 0 7
Output 14.7. The estimated standard deviation of the population of difference scores.
The estimated standard deviation of the population of difference scores appears below the heading “Std Dev.” For the current analysis, you can see that this standard deviation is 1.6382. Substituting this value in the formula results in the following: d=
| 24.0000 – 21.0588 | ––––––––––––––––––––– 1.6382
Chapter 14: Paired-Samples t Test 479
d=
2.9412 ––––––––––––––––––––– 1.6382
d=
1.7954
d=
1.80
And so the obtained index of effect size for the current analysis is 1.80. This means that the mean distress score obtained under the emotional infidelity condition differs from the mean distress score obtained under the sexual infidelity condition by 1.80 standard deviations. To determine whether this is a relatively large difference or a relative small difference, you can consult the guidelines provided by Cohen (1969). Cohen’s guidelines are reproduced in Table 14.4. Table 14.4 Guidelines for Interpreting Effect Size _________________________________________ Effect size Obtained d statistic _________________________________________ Small effect d = .20 Medium effect d = .50 Large effect d = .80 _________________________________________
Your obtained d statistic of 1.80 is larger than the “large effect” value of .80 that appears in Table 14.4. This means that the manipulation in your study produced a relatively large effect. Summarizing the Results of the Analysis Following is an analysis report that summarizes the preceding research question and results. A) Statement of the research question: The purpose of this study is to determine whether there is a difference between emotional infidelity versus sexual infidelity with respect to the amount of psychological distress that they produce in women. B) Statement of the research hypothesis: When asked to imagine how they would feel if they learned that their partner had been unfaithful, women will display higher levels of psychological distress when imagining emotional infidelity than when imagining sexual infidelity. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was type of infidelity. This was a dichotomous variable, was assessed on a nominal scale, and
480 Step-by-Step Basic Statistics Using SAS: Student Guide
included two conditions: an emotional infidelity condition versus a sexual infidelity condition. • The criterion variable was subjects’ scores on a 4-item measure of distress. This was a multi-value variable and was assessed on an interval scale. Paired-samples t test.
D) Statistical test:
E) Statistical null hypothesis (H0): µ1 = µ2; In the study population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). F) Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the study population, there is a difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). G) Obtained statistic:
t = 7.40
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Confidence interval: Subtracting the mean of the sexual infidelity condition from the mean of the emotional infidelity condition resulted in an observed difference of 2.94. The 95% confidence interval for this difference ranged from 2.10 to 3.78. K) Effect size:
d = 1.80.
L) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using a paired-samples t test. This analysis revealed a statistically significant difference between the two conditions, t(16) = 7.40, p = .0001. The sample means are displayed in Figure 14.2, which shows that the mean distress score obtained under the emotional infidelity condition was significantly higher than the mean distress score obtained under the sexual infidelity condition (for emotional infidelity, M = 24.00, SD = 2.29; for sexual infidelity, M = 21.06, SD = 1.92). The observed difference between the means was 2.94, and the 95% confidence interval for the difference between means ranged from 2.10 to 3.78. The effect size was computed as d = 1.80. According to Cohen’s (1969) guidelines, this represents a relatively large effect.
Chapter 14: Paired-Samples t Test 481
N) Figure representing the results:
Figure 14.2. Mean scores on the measure of psychological distress as a function of the type of infidelity (significant differences).
Notes Regarding the Preceding Report The output. Much of the information that was presented in the preceding analysis report was taken from the results of PROC MEANS and PROC TTEST, presented earlier. For your convenience, that output (combined) is reproduced here as Output 14.8. The MEANS Procedure Variable N Mean Std Dev Minimum Maximum ---------------------------------------------------------------------------EMOTION 17 24.0000000 2.2912878 20.0000000 28.0000000 SEXUAL 17 21.0588235 1.9193289 18.0000000 25.0000000 ----------------------------------------------------------------------------
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
The TTEST Procedure Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev 2.0989 2.9412 3.7835 1.2201 1.6382 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.4933 0.3973 0 7 T-Tests
Difference EMOTION - SEXUAL
DF 16
t Value 7.40
Pr > |t| <.0001
Output 14.8. Results from PROC MEANS and PROC TTEST, infidelity study (significant differences).
482 Step-by-Step Basic Statistics Using SAS: Student Guide
Rounding to two decimal places. Item J in the preceding report provided information about the 95% confidence interval for the difference between the means. This information was taken from the first “Statistics” table ( ) from the output of PROC TTEST. Notice that, in the analysis report, the values have been rounded to two decimal places. Item J refers to “an observed difference of 2.94. The 95% confidence interval for this difference ranged from 2.10 to 3.78.” When you present statistics in an analysis report, in most cases you should round them to two decimal places (you might have noticed that most of the statistics in the preceding analysis have been rounded to two decimal places). You should report more than two decimal places when it is necessary to convey important information, such as the p value associated with the statistic (most analysis reports in this text report p values to four decimal places). The t statistic and related information. The obtained t statistic for the analysis is reported in Items G and M of the analysis report. The t statistic itself appears in the output of PROC TTEST below the heading “t Value” ( ). Item M from the preceding report provides a summary of the results for a paper. The second sentence reports the obtained t statistic in the following way: t(16) = 7.40, p = .0001. The “16” that appears in parentheses in the preceding excerpt represents the degrees of freedom for the analysis. With a paired-samples t test, the degrees of freedom are equal to N – 1, where N represents the number of difference scores computed. In the present case, N = 17, so it makes sense that the degrees of freedom would be 17 – 1 = 16. In the output from PROC TTEST, the degrees of freedom are reported below the heading “DF” ( ). The “p = .0001” from this excerpt indicates that the obtained p value for the analysis was .0001. This came from the output of PROC TTEST, below the heading “Pr > | t |” ( ). Means and standard deviations for the treatment conditions. In the formal description of results for a paper (in Item M), the third sentence states: ...(for emotional infidelity, M = 24.00, SD = 2.29; for sexual infidelity, M = 21.06, SD = 1.92). In this excerpt, the symbol “M” represents the mean, and “SD” represents the standard deviation. This sentence reports the mean and standard deviation of EMOTION (which contained distress scores obtained under the emotional infidelity condition), and SEXUAL (which contained distress scores obtained under the sexual infidelity condition). These means and standard deviations appear in the results of PROC MEANS, under the headings “Mean” ( ) and “Std Dev” ( ), respectively.
Chapter 14: Paired-Samples t Test 483
Example 14.2: An Illustration of Results Showing Nonsignificant Differences Overview This section presents the results of an analysis of a different data set––a data set that is designed to produce nonsignificant results. This will enable you to see how nonsignificant results might appear in your output. A later section will show you how to summarize nonsignificant results in an analysis report. The SAS Output Output 14.9 resulted from the analysis of a different fictitious data set, one in which the means for the two treatment conditions are not significantly different. The MEANS Procedure Variable N Mean Std Dev Minimum Maximum -------------------------------------------------------------------------EMOTION 17 21.0588235 1.9193289 18.0000000 25.0000000 SEXUAL 17 20.9411765 1.5996323 18.0000000 24.0000000 -------------------------------------------------------------------------The TTEST Procedure Statistics
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev -0.807 0.1176 1.0424 1.3396 1.7987 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.7375 0.4362 -4 3 T-Tests
Difference EMOTION - SEXUAL
DF 16
t Value 0.27
Pr > |t| 0.7909
Output 14.9. Results from PROC MEANS and PROC TTEST, infidelity study (nonsignificant differences).
Steps in Interpreting the Output Overview. You would normally interpret Output 14.9 following the same steps that were listed in the previous section titled “Steps in Interpreting the Output.” However, this section will focus only on those results that are most relevant to the significance test, the confidence interval, and the index of effect size.
484 Step-by-Step Basic Statistics Using SAS: Student Guide
1. Review the means on the criterion variable obtained under the two conditions. In the output from PROC MEANS, below the heading “Mean” ( ), you can see that the mean distress score obtained under the emotional infidelity condition was 21.0588, and the mean score obtained under the sexual infidelity condition was 20.9412. You can see that there does not appear to be a large difference between these two treatment means. 2. Review the t test for the difference between the means. From the results of PROC TTEST, below the title “t Value” ( ), you can see that the obtained t statistic for the current analysis is 0.27. The p value that is associated with this t statistic is 0.7909 ( ). Because this obtained p value is larger than the standard criterion of .05, you fail to reject the null hypothesis. In your report, you will indicate that the difference between the two treatment conditions is not statistically significant. 3. Review the confidence interval for the difference between the means. From the output of PROC TTEST, you can see that the observed difference between the means for the two treatment conditions is 0.1176 ( ). The 95% confidence interval for this difference ranges from –0.807 ( ) to 1.0424 ( ). Notice that this interval does include the value of zero, which is consistent with your failure to reject the null hypothesis. 4. Compute the index of effect size. The formula for computing effect size in a pairedsamples t test is reproduced here: d=
| X1 – X2 | –––––––––––– sD
The symbols X1 and X2 represent the sample means on the “distress” variable obtained under the two treatment conditions. From Output 14.9, you can see that the mean distress score obtained under the emotional infidelity condition was 21.0588, and the mean distress score obtained under the sexual infidelity condition was 20.9412 ( ). Inserting these means into the formula for the effect size index results in the following: d=
| 21.0588 – 20.9412 | ––––––––––––––––––––– sD
Chapter 14: Paired-Samples t Test 485
The symbol sD in the formula represents the estimated standard deviation of difference scores in the population. This statistic may be found in the “Statistics” table produced by PROC TTEST. This table is reproduced here:
Difference EMOTION - SEXUAL
N 17
Difference EMOTION - SEXUAL
Statistics Lower CL Upper CL Lower CL Mean Mean Mean Std Dev Std Dev -0.807 0.1176 1.0424 1.3396 1.7987 Statistics Upper CL Std Dev Std Err Minimum Maximum 2.7375 0.4362 -4 3
Output 14.10. Estimated population standard deviation that is needed to compute effect size for the infidelity study (nonsignificant differences).
In Output 14.10, you can see that the estimated population standard deviation is 1.7987 ( ). Substituting this value into the formula for effect size results in the following: d=
| 21.0588 – 20.9412 | ––––––––––––––––––––– 1.7987
d=
| .1176 | ––––––––––––––––––––– 1.7987
d=
.0654
d=
.07
Thus, the index of effect size for the current analysis is .07. Cohen’s guidelines (appearing in Table 14.4) indicated that a “small effect” was obtained when d = .20. The value of .07 that was obtained with the present analysis was well below this criterion, indicating that the present manipulation produced less than a small effect. Summarizing the Results of the Analysis Following is an analysis report that summarizes the preceding research question and results. A) Statement of the research question: The purpose of this study is to determine whether there is a difference between emotional infidelity versus sexual infidelity with respect to the amount of psychological distress that they produce in women. B) Statement of the research hypothesis: When they are asked to imagine how they would feel if they learned that their partner had been unfaithful, women will display higher levels of psychological distress when imagining emotional infidelity than when imagining sexual infidelity.
486 Step-by-Step Basic Statistics Using SAS: Student Guide
C) Nature of the variables: This analysis involved one predictor variable and one criterion variable. • The predictor variable was type of infidelity. This was a dichotomous variable, was assessed on a nominal scale, and included two conditions: an emotional infidelity condition versus a sexual infidelity condition. • The criterion variable was subjects’ scores on a 4-item measure of distress. This was a multi-value variable and was assessed on an interval scale. Paired-samples t test.
D) Statistical test:
E) Statistical null hypothesis (H0): µ1 = µ2; In the study population, there is no difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). F) Statistical alternative hypothesis (H1): µ1 ≠ µ2; In the study population, there is a difference between the emotional infidelity condition versus the sexual infidelity condition with respect to mean scores on the criterion variable (the measure of psychological distress). G) Obtained statistic:
t = 0.27
H) Obtained probability (p) value:
p = .7909
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Confidence interval: Subtracting the mean of the sexual infidelity condition from the mean of the emotional infidelity condition resulted in an observed difference of 0.12. The 95% confidence interval for this difference extended from –0.81 to 1.04. K) Effect size:
d = .07.
L) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. M) Formal description of results for a paper: Results were analyzed using a paired-samples t test. This analysis revealed a statistically nonsignificant difference between the two conditions, t(16) = 0.27, p = .7909. The sample means are displayed in Figure 14.3, which shows that the mean distress score obtained under the emotional infidelity condition was similar to the mean distress score obtained under the sexual infidelity condition (for emotional infidelity, M = 21.06, SD = 1.92; for sexual infidelity, M = 20.94, SD = 1.60). The observed difference between the means was 0.12, and the 95%
Chapter 14: Paired-Samples t Test 487
confidence interval for the difference between means ranged from –0.81 to 1.04. The effect size was computed as d = .07. According to Cohen’s (1969) guidelines, this represents less than a small effect. N ) Figure representing the results:
Figure 14.3. Mean scores on the measure of psychological distress as a function of the type of infidelity (nonsignificant differences).
Conclusion In this chapter, you learned how to perform a paired-samples t test. With the information learned here, along with the information learned in Chapter 13, “Independent-Samples t Test,” you should now be prepared to analyze data from many types of studies that compare two treatment conditions. But what if you are conducting an investigation that involves more than two treatment conditions? For example, you might conduct a study that investigates the effect of caffeine on learning in laboratory rats. Such a study might involve four treatment conditions: (a) a group given zero mg of caffeine, (b) a group given 1 mg of caffeine, (c) a group given 2 mg of caffeine, and (d) a group given 3 mg of caffeine. You might think that the way to analyze the data from this study would be to perform a series of t tests in which you compare every possible combination of conditions. But most researchers and statisticians would advise against this approach. Instead, most statisticians and researchers would counsel you to analyze your data using a one-way analysis of variance (abbreviated as ANOVA).
488 Step-by-Step Basic Statistics Using SAS: Student Guide
Analysis of variance is one of the most flexible and widely used statistical procedures in the behavioral sciences and education. It is essentially an expansion of the t test because it enables you to analyze data from studies that involve more than two treatment conditions. The following chapter shows you how to use SAS to perform a one-way analysis of variance.
One-Way ANOVA with One BetweenSubjects Factor Introduction..........................................................................................491 Overview................................................................................................................ 491 Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor ................................................................491 Overview................................................................................................................ 491 Nature of the Predictor and Criterion Variables ..................................................... 491 The Type-of-Variable Figure .................................................................................. 492 Example of a Study Providing Data That Are Appropriate for This Procedure....... 492 Summary of Assumptions Underlying One-Way ANOVA with One BetweenSubjects Factor .................................................................................................. 493 A Study Investigating Aggression .......................................................494 Overview................................................................................................................ 494 Research Method................................................................................................... 495 The Research Design ............................................................................................ 496 Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size .................................................................497 Overview................................................................................................................ 497 Treatment Effects................................................................................................... 497 Multiple Comparison Procedures ........................................................................... 498 2 R , an Index of Variance Accounted For ................................................................ 499 Some Possible Results from a One-Way ANOVA.................................500 Overview................................................................................................................ 500 Significant Treatment Effect, All Multiple Comparison Tests are Significant .......... 500
490 Step-by-Step Basic Statistics Using SAS: Student Guide
Significant Treatment Effect, Two of Three Multiple Comparison Tests Are Significant........................................................................................................... 502 Nonsignificant Treatment Effect ............................................................................. 504 Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect .............................................................................505 Overview................................................................................................................ 505 Choosing SAS Variable Names and Values to Use in the SAS Program .............. 505 Data Set to Be Analyzed........................................................................................ 506 Writing the SAS Program....................................................................................... 507 Keywords for Other Multiple Comparison Procedures ........................................... 510 Output Produced by the SAS Program .................................................................. 511 Steps in Interpreting the Output ............................................................................. 511 Using a Figure to Illustrate the Results .................................................................. 525 Analysis Report for the Aggression Study (Significant Results) ............................. 526 Notes Regarding the Preceding Analysis Report ................................................... 528 Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect .............................................................................529 Overview................................................................................................................ 529 The Complete SAS Program ................................................................................. 530 Steps in Interpreting the Output ............................................................................. 531 Using a Graph to Illustrate the Results .................................................................. 534 Analysis Report for the Aggression Study (Nonsignificant Results) ....................... 535 Conclusion............................................................................................537
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 491
Introduction Overview This chapter shows how to enter data into SAS and prepare SAS programs that will perform a one-way analysis of variance (ANOVA) using the GLM procedure. This chapter focuses on between-subjects research designs: designs in which each subject is exposed to only one condition under the independent variable. It shows how to determine whether there is a significant effect for the study’s independent variable, and how to use multiple comparison procedures to identify the pairs of groups that are significantly different from each other, how to request confidence intervals for differences between the means, and how to interpret an index of effect size. Finally, it shows how to prepare a report that summarizes the results of the analysis.
Situations Appropriate for One-Way ANOVA with One Between-Subjects Factor Overview One-way ANOVA is a test of group differences: it enables you to determine whether there are significant differences between two or more treatment conditions with respect to their mean scores on a criterion variable. ANOVA has an important advantage over a t test: A t test enables you to determine whether there is a significant difference between only two groups. ANOVA, on the other hand enables you to determine whether there is a significant difference between two or more groups. ANOVA is routinely used to analyze data from experiments that involve three or more treatment conditions. In summary, one-way ANOVA with one between-subjects factor can be used when you want to investigate the relationship between (a) a single predictor variable (which classifies group membership) and (b) a single criterion variable. Nature of the Predictor and Criterion Variables Predictor variable. With ANOVA, the predictor (or independent) variable is a type of classification variable: it simply indicates which group a subject is in. Criterion variable. With analysis of variance, the criterion (or dependent) variable is typically a multi-value variable. It must be a numeric variable that is assessed on either an interval or ratio level of measurement. The criterion variable in the analysis must also satisfy a number of additional assumptions, and these assumptions are summarized in a later section.
492 Step-by-Step Basic Statistics Using SAS: Student Guide
The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when researchers perform a one-way ANOVA with one between-subjects factor. Criterion
Predictor
=
The “Multi” symbol that appears in the above figure shows that the criterion variable in an ANOVA is typically a multi-value variable (a variable that assumes more than six values in your sample). The “Lmt” symbol that appears to the right of the equal sign in the above figure shows that the predictor variable in this procedure is usually a limited-value variable (i.e., a variable that assumes just two to six values). Example of a Study Providing Data That Are Appropriate for This Procedure The Study. Suppose that you are an industrial psychologist studying work motivation and work safety. You are trying to identify interventions that may increase the likelihood that employees will engage in safe behaviors at work. In your current investigation, you are working with pizza deliverers. With this population, employers are interested in increasing the likelihood that they will display safe driving behaviors. They are particularly interested in interventions that may increase the frequency with which they come to a full stop at stop signs (as opposed to a more dangerous “rolling stop”). You are investigating two research questions: •
Does setting safety-related goals for employees increase the frequency with which they will engage in safe behaviors?
•
Do participatively set goals tend to be more effective than goals that are assigned by a supervisor without participation?
To explore these questions, you conduct an experiment in which you randomly assign 30 pizza deliverers to one of three treatment conditions: •
10 drivers are assigned to the participative goal-setting condition. These drivers meet as a group and set goals for themselves with respect to how frequently they should come to a full stop at stop signs.
•
10 drivers are assigned to the assigned goal-setting condition. The drivers in this group meet with their supervisors, and the supervisors assign goals regarding how frequently they should come to a full stop at stop signs (unbeknownst to the drivers, these goals are the same as the goals developed by the preceding group).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 493 •
10 drivers are assigned to the control condition. Drivers in this condition do not experience any goal-setting.
You secretly observe the drivers at stop signs over a two-month period, noting how many times each driver comes to a full stop out of 30 opportunities. You perform a one-way ANOVA to determine whether there are significant differences between the three groups with respect to their average number of full stops out of 30. The ANOVA procedure permits you to determine (a) whether the two goal-setting groups displayed a greater average number of full stops, compared to the control group, and (b) whether the participative goalsetting group displayed a greater number of full stops, compared to the assigned goal-setting group. Why these data would be appropriate for this procedure. The preceding study involved a single predictor variable and a single criterion variable. The predictor variable was “type of motivational intervention.” You know that this was a limited-value variable because it assumed only three values: a participative goal-setting condition, an assigned goal-setting condition, and a control condition. This predictor variable was assessed on a nominal scale because it indicates group membership but does not convey any quantitative information. However, remember that the predictor variable used in a one-way ANOVA may be assessed on any scale of measurement. The criterion variable in this study was the number of full stops displayed by the pizza drivers. This was a numeric variable, and you know it was assessed on a ratio scale because it had equal intervals and a true zero point. You would know that this was a multi-value variable if you used a procedure such as PROC FREQ to verify that the drivers’ scores displayed a relatively large number of values (i.e., some drivers had zero full stops out of a possible 30, other drivers had 30 full stops out of a possible 30, and still other drivers had any number of full stops between these two extremes). Before you analyze your data with ANOVA, you first want to perform a number of other preliminary analyses on your data to verify that they meet the assumptions underlying this statistical procedure. The most important of these assumptions are summarized in the following section. Note: Although the study described here is fictitious, it is based on a real study reported by Ludwig and Geller (1997). Summary of Assumptions Underlying One-Way ANOVA with One Between-Subjects Factor •
Level of measurement. The criterion variable must be a numeric variable that is assessed on an interval or ratio level of measurement. The predictor variable may be assessed on any level of measurement, although it is essentially treated as a nominal-level (classification) variable in the analysis.
•
Independent observations. A particular observation should not be dependent on any other observation in any group. In practical terms, this means that (a) each subject is
494 Step-by-Step Basic Statistics Using SAS: Student Guide
exposed to just one condition under the predictor variable and (b) subject matching procedures are not used. •
Random sampling. Scores on the criterion variable should represent a random sample drawn from the study populations.
•
Normal distributions. Each group should be drawn from a normally distributed population. If each group contains over 30 subjects, the test is robust against moderate departures from normality (in this context, “robust” means that the test will still provide accurate results as long as violations of the assumptions are not large). You should analyze your data with PROC UNIVARIATE using the NORMAL option to determine whether your data meet this assumption. Be warned that the tests for normality that are provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large.
•
Homogeneity of variance. The populations that are represented by the various groups should have equal variances on the criterion. If the number of subjects in the largest group is no more than 1.5 times greater than the number of subjects in the smallest group, the test is robust against moderate violations of the homogeneity assumption (Stevens, 1986).
A Study Investigating Aggression Overview Assume that you are conducting research concerning the possible causes of aggression in children. You are aware that social learning theory (Bandura, 1977) predicts that exposure to aggressive models can cause people to behave more aggressively. You design a study in which you experimentally manipulate the amount of aggression that a model displays. You want to determine whether this manipulation affects how aggressively children subsequently behave, after they have viewed the model. Essentially, you wish to determine whether viewing a model’s aggressive behavior can lead to an increased aggressive behavior on the part of the viewer. Your research hypothesis: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observe the model. Specifically, you predict the following: •
Children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression.
•
Children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness a low level of aggression.
You perform a single investigation to test these hypotheses. The following sections describe the research method in more detail. Note: Although the study and results presented here are fictitious, they are inspired by a real study reported by Bandura (1965).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 495
Research Method Overview. You conduct a study in which 24 nursery-school children serve as subjects. The study is conducted in two stages. In Stage 1, you show a short videotape to your subjects. You manipulate the independent variable by varying what the children see in this videotape. In Stage 2, you assess the dependent variable (the amount of aggression displayed by the children) to determine whether it has been affected by this independent variable. The following sections refer to your independent variable as a “predictor variable.” This is because the term “predictor variable” is more general, and is appropriate regardless of whether your variable is a true manipulated independent variable (as in the present case), or a nonmanipulated subject variable (such as subject sex). Stage 1: Manipulating the predictor variable. The predictor variable in your study is the “level of aggression displayed by the model” or, more concisely, “model aggression.” You manipulate this independent variable by randomly assigning each child to one of three treatment conditions: •
Eight children are assigned to the low-model-aggression condition. When the subjects in this group watch the videotape, they see a model demonstrate a relatively low level of aggressive behavior. Specifically, they see a model (an adult female) enter a room that contains a wide variety of toys. For 90% of the tape, the model engages in nonaggressive play (e.g., playing with building blocks). For 10% of the tape, the model engages in aggressive play (e.g., violently punching an inflatable “bobo doll”).
•
Another eight children are assigned to the moderate-model-aggression condition. They watch a videotape of the same model in the same playroom, but they observe the model displaying a somewhat higher level of aggressive behavior. Specifically, in this version of the tape, the model engages in nonaggressive play (again, playing with building blocks) 50% of the time, and engages in aggressive play (again, punching the bobo doll) 50% of the time.
•
Finally, the remaining eight children are assigned to the high-model-aggression condition. They watch a videotape of the same model in the same playroom, but in this version the model engages in nonaggressive play 10% of the time, and engages in aggressive play 90% of the time.
Stage 2: Assessing the criterion variable. This chapter will refer to the dependent variable in the study as a “criterion variable.” Again, this is because the term “criterion variable” is a more general term that is appropriate regardless of whether your study is a true experiment (as in the present case), or is a nonexperimental investigation. The criterion variable in this study is the “number of aggressive acts displayed by the subjects” or, more concisely, “subject aggressive acts.” The purpose of your study was to determine whether certain manipulations in your videotape caused some groups of children to behave more aggressively than others. To assess this, you allowed each child to engage in a free play period immediately after viewing the videotape. Specifically, each child was individually escorted to a play room similar to the room that was shown in the tape. This playroom contained a large assortment of toys, some of which were appropriate for
496 Step-by-Step Basic Statistics Using SAS: Student Guide
nonaggressive play (e.g., building blocks), and some of which were appropriate for aggressive play (e.g., an inflatable bobo doll identical to the one in the tape). The children were told that they could do whatever they liked in the play room, and were then left to play alone. Outside of the play room, three observers watched the child through a one-way mirror. They recorded the total number of aggressive acts the child displayed during a 20-minute period in the play room (an “aggressive act” could be an instance in which the child punches the bobo doll, throws a building block, and so forth). Therefore, the criterion variable in your study is the total number of aggressive acts demonstrated by each child during this period. The Research Design The research design used in this study is illustrated in Figure 15.1. You can see that this design is represented by a figure that consists of three squares, or cells.
Figure 15.1. Research design for the aggression study.
The figure is titled “Predictor Variable: Level of Aggression Displayed by Model.” The first cell (on the left) represents the eight subjects in Level 1 (the children who saw a videotape in which the model displayed a low level of aggression). The middle cell represents the eight subjects in Level 2 (the children who saw the model display a moderate level of aggression). Finally, the cell on the right represents the eight subjects in Level 3 (the children who saw the model display a high level of aggression).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 497
Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size Overview This section introduces the three types of results that you will review when you conduct a one-way ANOVA. First, it covers the concept of treatment effects: overall differences among group means that are due to the experimental manipulations. Next, it discusses multiple comparison procedures––tests used to determine which pairs of treatment conditions are significantly different from each other. Finally, it introduces the R2 statistic–– a measure of the variance in the criterion variable that is accounted for by the predictor variable. It discusses the use of R2 as an index effect size in experiments. Treatment Effects Null and alternative hypotheses. The concept of treatment effects is best understood with reference to the concept of the null hypothesis. For an experiment with three treatment conditions (as with the current study), the statistical null hypothesis may generally be stated according to this format: Statistical null hypothesis (Ho): µ1 = µ2 = µ3; In the study population, there is no difference between subjects in the three treatment conditions with respect to their mean scores on the criterion variable. In the preceding null hypothesis, the symbol µ represents the population mean on the criterion variable for a particular treatment condition. For example µ1 represents the population mean for Treatment Condition 1, µ2 represents the population mean for Treatment Condition 2, and so on. The preceding section describes a study that was designed to determine whether exposure to aggressive models will cause subjects who view the models to behave aggressively themselves. The statistical null hypothesis for this study may be stated in this way: Statistical null hypothesis (H0): µ1 = µ2 = µ3; In the study population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderate-model-aggression condition, and subjects in the high-model-aggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). For a study with three treatment conditions, the statistical alternative hypothesis may generally be stated in this way: Statistical alternative hypothesis (H1): Not all µs are equal; In the study population, there is a difference between at least two of the three treatment conditions with respect to their mean scores on the criterion variable.
498 Step-by-Step Basic Statistics Using SAS: Student Guide
The statistical alternative hypothesis appropriate for the model aggression study that was previously described may be stated in this way: Statistical alternative hypothesis (H1): Not all µs are equal; In the study population, there is a difference between at least two of the following three groups with respect to their mean scores on the criterion variable: subjects in the low-model-aggression condition, subjects in the moderate-model-aggression condition, and subjects in the highmodel-aggression condition. The following sections of this chapter will show you how to use SAS to perform a one-way analysis of variance. You will learn that, in this analysis, SAS computes an F statistic that tests the statistical null hypothesis. If this F statistic is significant, you can reject the null hypothesis that all population means are equal. You may then tentatively conclude that at least two of the population means differ from one another. In this situation, you have obtained a significant treatment effect. Treatment effects. In an experiment, a significant treatment effect refers to differences among group means that are due to the influence of the independent variable. When you conduct a true experiment and have a significant treatment effect, this means that your independent variable had some type of effect on the dependent variable. It means that at least two of the treatment conditions are significantly different from each other. In most cases, this is what you want to demonstrate. In a single-factor experiment (an experiment in which only one independent variable is being manipulated), there can be only one treatment effect. However, in a factorial experiment (an experiment in which more than one independent variable is being manipulated), there may be more than one treatment effect. The present chapter will deal exclusively with single-factor experiments; Chapter 16, “Factorial ANOVA with Two Between-Subjects Factors,” will introduce you to the concept of factorial experiments. Multiple Comparison Procedures As was stated previously, when an F statistic is large enough to enable you to reject the null hypothesis, you may tentatively conclude that, in the population, at least two of the three treatment conditions differ from one another. But which two? It is possible, for example, that Group 1 is significantly different from Group 2, but Group 2 is not significantly different from Group 3. Clearly, researchers need a tool that will enable them to determine which groups are significantly different from one another. A special type of test called a multiple comparison procedure is normally used for this purpose. A multiple comparison procedure is a statistical test that enables researchers to determine the significance of the difference between pairs of means from studies that include more than two treatment conditions. A wide variety of multiple comparison procedures are available with SAS, but this chapter will focus on just one that is very widely used: Tukey’s studentized range (HSD) test.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 499
A later section shows how to request the Tukey test in your SAS program, and how to interpret the output that it generates. You will supplement the Tukey test with an option that requests confidence intervals for the difference between the means, somewhat similar to the confidence intervals that you obtained with the independent-samples t tests that you performed in Chapter 13. The results of these tests will give you a better understanding of the nature of the differences between your treatment conditions. R2, an Index of Variance Accounted For The need for an index of effect size. In Chapter 12, “Single-Sample t Test,” you learned that it is possible to conduct an experiment and obtain results that are statistically significant, even though the magnitude of the treatment effect is trivial. This outcome is most likely to occur when you conduct a study with a very large number of subjects. When your sample is very large, your statistical test has a relatively large amount of power. This means that you are fairly likely to reject the null hypothesis and conclude that you have obtained significant differences, even when the magnitude of the difference is relatively small. To address this problem, researchers are now encouraged to supplement their significance tests with indices of effect size. An index of effect size is a measure of the magnitude of a treatment effect. A variety of different effect size indices are available for use in research. In the three chapters in this text that deal with t tests, you learned about the d statistic. The d statistic indicates the degree to which one sample mean differs from the second sample mean (or population mean), stated in terms of the standard deviation of the population. The d statistic is often used as an index of effect size when researchers compute t statistics. R2 as a measure of variance accounted for. In this chapter, you will learn about a different index of effect size––R2. R2 is an index that is often reported when researchers perform analysis of variance. The R2 statistic indicates the proportion of variance in the criterion variable that is accounted for by the study’s predictor variable. It is computed by dividing the sum of squares for the predictor variable (the “between-groups” sum of squares) by the 2 sum of squares for the corrected total. Values of R may range from .00 to 1.00, with larger values indicating a larger treatment effect (the word “effect” is appropriate only for experimental research––not for nonexperimental research). The larger the value of R2, the larger the effect that the independent variable had on the dependent variable. For example, when a researcher conducts an experiment and obtains an R2 value of .40, she may conclude that her independent variable accounted for 40% of the variance in the dependent variable. Researchers typically hope to obtain relatively large values of R2. Interpreting the size of R2. Chapters 12–14 in this text provided information about t tests and guidelines for interpreting the d statistic. Specifically, they provided tables that showed how the size of a d statistic indicates a “small” effect versus a “moderate” effect versus a “large” effect. Unfortunately, however, there are no similar widely accepted guidelines for 2 2 interpreting R . For example, although most researchers would agree that R values less than .05 are relatively trivial, there is no widely accepted criterion for how large R2 must be to be considered “large.” This is because the significance of the size of R2 depends on the nature
500 Step-by-Step Basic Statistics Using SAS: Student Guide
of the phenomenon being studied, and also on the size of R2 values that were obtained when other researchers have studied the same phenomenon. For example, researchers looking for ways to improve the grades of children in elementary schools might find that it is difficult to construct interventions that have much of an impact. 2 If this is the case, an experiment that produces an R value of .15 may be considered a big success, and the R2 value of .15 may be considered meaningful. In contrast, researchers conducting research on reinforcement theory using laboratory rats and a “bar-pressing” procedure may find that it is easy to construct manipulations that have a major effect on the rats’ bar-pressing behavior. In these studies, they may routinely obtain R2 values over .80. If this is the case, then a new experiment that produces an R2 value of .15 may be considered a failure, and the R2 value of .15 may be considered relatively trivial. The above example illustrates the problem with R2 as an index of effect size: in one situation, an R2 value of .15 was interpreted as a meaningful proportion of variance, and in a different situation, the same value of .15 was interpreted as a relatively trivial proportion of variance. Therefore, before you can interpret an R2 value from a study that you have conducted, you must first be familiar with the R2 values that have been obtained in similar research that has already been conducted by others. It is only within this context that you can determine whether your R2 value can be considered large or meaningful. Summary. In summary, R2 is a measure of variance accounted for that may be used as an index of effect size when you analyze data from experiments that use a between-subjects design. Later sections of this chapter will show where the R2 statistic is printed in the output of PROC GLM, and how you should incorporate it into your analysis reports.
Some Possible Results from a One-Way ANOVA Overview When you conduct a study that involves three or more treatment conditions, a number of different types of outcomes are possible. Some of these possibilities are illustrated below. All these examples are based on the aggression experiment described above. Significant Treatment Effect, All Multiple Comparison Tests are Significant Figure 15.2 illustrates an outcome in which both of the following are true: •
there is a significant treatment effect for the predictor variable (level of aggression displayed by the model)
•
all multiple comparison tests are significant.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 501
Figure 15.2. Mean number of aggressive acts as a function of the level of aggression displayed by the model (significant treatment effect; all multiple comparison tests are significant).
Understanding the figure. The bar labeled “Low” in Figure 15.2 represents the children who saw the model display a low level of aggression in the videotape. The bar labeled “Moderate” represents the children who saw the model display a moderate level of aggression, and the bar labeled “High” represents the children who saw the model display a high level of aggression The vertical axis labeled “Subject Aggressive Acts” in Figure 15.2 indicates the mean number of aggressive acts that the various groups of children displayed after viewing the videotape. You can see that the “Low” bar reflects a score of approximately “5” on this axis, meaning that the children in the low-model-aggression condition displayed an average of about five aggressive acts in the play room after viewing the videotape. In contrast, the bar for the children in the “Moderate” group shows a substantially higher score of about 14 aggressive acts, and the bar for the children in the “High” group shows a even higher score of about 23 aggressive acts. Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be significant because at least two of the groups in Figure 15.2 display means that appear to be substantially different
502 Step-by-Step Basic Statistics Using SAS: Student Guide
from one another. You would also expect to see significant multiple comparison tests, because •
The mean for the moderate-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the moderate-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group; this suggests that the multiple comparison test comparing these two groups would probably be significant.
Conclusions regarding the research hypotheses. Figure 15.2 shows that the greater the amount of aggression modeled in the videotape, the greater the number of aggressive acts subsequently displayed by the children in the play room. It would be reasonable to conclude that the results provide support for the research hypotheses that were stated in the previous section “A Study Investigating Aggression.” Note: Of course, you don’t arrive at conclusions such as this by merely preparing a figure and “eyeballing” the data. Instead, you perform the appropriate statistical analyses to confirm your conclusions; these analyses will be illustrated in later sections. Significant Treatment Effect, Two of Three Multiple Comparison Tests Are Significant Figure 15.3 illustrates an outcome in which both of the following are true: •
there is a significant treatment effect for the predictor variable
•
two of the three possible multiple comparison tests are significant.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 503
Figure 15.3. Mean number of aggressive acts as a function of the level of aggression displayed by the model (significant treatment effect; two of three multiple comparison tests are significant).
Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be significant because at least two of the groups in Figure 15.3 have means that appear to be substantially different from one another. You would also expect to see two significant multiple comparison tests, because •
The mean for the high-model-aggression group appears to be substantially higher than the mean for the moderate-model-aggression group.
•
The mean for the high-model-aggression group appears to be substantially higher than the mean for the low-model-aggression group.
In contrast, there is very little difference between the mean for the moderate-modelaggression group and the mean for the low-model-aggression group. The multiple comparison test comparing these two groups would probably not demonstrate significance. Conclusions regarding the research hypotheses. It is reasonable to conclude that the results shown in Figure 15.3 provide partial support for your research hypotheses. The results are somewhat supportive because the high-model-aggression groups was more aggressive than the other two groups. However, they were not fully supportive, as there was not a significant difference between the “Low” and “Moderate” groups.
504 Step-by-Step Basic Statistics Using SAS: Student Guide
Nonsignificant Treatment Effect Figure 15.4 illustrates an outcome in which the treatment effect for the predictor variable is nonsignificant.
Figure 15.4. Mean number of aggressive acts as a function of the level of aggression displayed by the model (treatment effect is nonsignificant).
Expected statistical results. If you analyzed the data for this figure using a one-way ANOVA, you would probably expect the overall treatment effect to be nonsignificant. This is because none of the groups in the figure appear to be substantially different from one another with respect to their mean scores on the criterion variable. Each group displays a mean of approximately 15 aggressive acts, regardless of condition. When an overall treatment effect is nonsignificant, it is normally not appropriate to further interpret the results of the multiple comparison procedures. This is important to remember because a later section of this chapter will illustrate a SAS program that will request that multiple comparison tests be computed and printed regardless of the significance of the overall treatment effect. As the researcher, you must remember to always consult this overall test first, and proceed to the multiple comparison results only if the overall treatment effect is significant. Conclusions regarding the research hypotheses. It is reasonable to conclude that the results shown in Figure 15.4 fail to provide support for your research hypotheses.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 505
Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect Overview The steps that you follow in performing an ANOVA will vary depending on whether the treatment effect is significant. This section illustrates an analysis that results in a significant treatment effect. It shows you how to prepare the SAS program, interpret the SAS output, and summarize the results. These procedures are illustrated by analyzing fictitious data from the aggression study that was described previously. In these analyses, the predictor variable is the level of aggression displayed by the model, and the criterion variable is the number of aggressive acts displayed by the children after viewing the videotape. Choosing SAS Variable Names and Values to Use in the SAS Program Before you write a SAS program to perform an ANOVA, it is helpful to first prepare a figure similar to Figure 15.5. The purpose of this figure is to help you choose a meaningful SAS variable name for the predictor variable, and meaningful values to represent the different levels under the predictor variables. Carefully choosing meaningful variable names and values at this point will make it easier to interpret your SAS output later.
Figure 15.5. Predictor variable name and values to be used in the SAS program for the aggression study.
SAS variable name for the predictor variable. You can see that Figure 15.5 is very similar to Figure 15.1, except that variable names and values have now been added. Figure 15.5 again shows that the predictor variable in your study is the “Level of Aggression Displayed by Model.” Below this heading is “MOD_AGGR,” which will be the SAS variable name for the predictor variable in your SAS program (“MOD_AGGR” stands for “model aggression”). Of course, you can choose any SAS variable name, but it should be meaningful and must comply with the rules for SAS variable names.
506 Step-by-Step Basic Statistics Using SAS: Student Guide
Values to represent conditions under the predictor variable. Below the heading for the predictor variable are the names of the three conditions under this predictor variable: “Low,” “Moderate,” and “High.” Below these headings for the three conditions are the values that you will use to represent these conditions in your SAS program: “L” represents children in the lowmodel-aggression condition, “M” represents children in the moderate-model-aggression condition, and “H” represents children in the high-model-aggression condition. Choosing meaningful letters such as L, M, and H will make it easier to interpret your SAS output later. Data Set to Be Analyzed Table 15.1 presents the data set that you will analyze. Table 15.1 Variables Analyzed in the Aggression Study (Data Set Will Produce a Significant Treatment Effect) ____________________________________________________ Model Subject Subject aggression aggression ____________________________________________________ 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20 13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 __________________________________________________
Understanding the columns in the table. The columns in Table 15.1 provide the variables that you will analyze in your study. The first column in Table 15.1 is headed “Subject.” This column simply assigns a subject number to each child. The second column is headed “Model aggression.” In this column, the value “L” identifies children who saw the model in the videotape display a low level of aggression, “M” identifies children who saw the model display a moderate level of aggression, and “H” identifies children who saw the model display a high level of aggression.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 507
Finally, the column headed “Subject aggression” indicates the number of aggressive acts that each child displayed in the play room after viewing the videotape. This variable will serve as the criterion variable in your study. Understanding the rows of the table. The rows in Table 15.1 represent individual children who participated as subjects in the study. The first row represents Subject 1. The “L” under “Model Aggression” tells you that this child was in the low condition under the predictor variable. The “02” under “Subject Aggression” tells you that this child displayed two aggressive acts after viewing the videotape. The data lines for the remaining children may be interpreted in the same way. Writing the SAS Program The DATA step. In preparing the SAS program, you will type the data similar to the way that they appear in Table 15.1. That is, you will have one column to contain subject numbers, one column to indicate the subjects’ condition under the model-aggression predictor variable, and one column to indicate the subjects’ scores on the criterion variable. Here is the DATA step for your SAS program: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20 13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 ;
508 Step-by-Step Basic Statistics Using SAS: Student Guide
You can see that the INPUT statement of the preceding program uses the following SAS variable names: •
The SAS variable name SUB_NUM is used to represent subject numbers.
•
The SAS variable name MOD_AGGR is used to code subject condition under the modelaggression predictor variable. Values are either L, M, or H for this variable. Note that the variable name is followed by the “$” symbol to indicate that it is a character variable.
•
The SAS variable name SUB_AGGR is used to contain subjects’ scores on the subject aggression criterion variable.
The PROC step. Following is the syntax for the PROC step needed to perform a one-way ANOVA with one between-subjects factor, and follow it with Tukey's HSD test: PROC GLM DATA = data-set-name ; CLASS predictor-variable ; MODEL criterion-variable = predictor-variable ; MEANS predictor-variable ; MEANS predictor-variable / TUKEY CLDIFF ALPHA=alpha-level ; TITLE1 ' your-name ' ; RUN; QUIT; Substituting the appropriate SAS variable names into this syntax results in the following (line numbers have been added on the left; you will not actually type these line numbers): 1 2 3 4 5 6 7 8
PROC GLM CLASS MODEL MEANS MEANS TITLE1 RUN; QUIT;
DATA=D1; MOD_AGGR; SUB_AGGR = MOD_AGGR; MOD_AGGR; MOD_AGGR / TUKEY CLDIFF 'JOHN DOE';
ALPHA=0.05;
Some notes about the preceding code: •
In Line 1, the PROC GLM statement requests the GLM procedure, and requests that the analysis be performed on data set D1.
•
In Line 2, the CLASS statement lists the classification variable as MOD_AGGR (model aggression, the predictor variable in the experiment).
•
Line 3 contains the MODEL statement for the analysis. The name of the criterion variable SUB_AGGR appears to the left of the equal sign in this statement, and the name of the predictor variable MOD_AGGR appears to the right of the equal sign.
•
Line 4 contains the first MEANS statement: MEANS
MOD_AGGR;
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 509
This statement requests that SAS print the group means and standard deviations for the treatment conditions under the predictor variable. You will need these means in interpreting the results, and you will report the means and standard deviations in your analysis report. •
Line 5 contains the second MEANS statement: MEANS
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
This second MEANS statement requests the multiple comparison procedure that will determine which pairs of treatment conditions are significantly different from each other. You should list the name of your predictor variable to the right of the word MEANS. In the preceding statement, MOD_AGGR is the name of the predictor variable in the current analysis, and so it was listed to the right of the word MEANS. The name of your predictor variable should be followed by a slash (“/”) and the keywords TUKEY, CLDIFF, and ALPHA=0.05. The keyword TUKEY requests that the Tukey HSD test be performed as a multiple comparison procedure. The keyword CLDIFF requests that Tukey tests be presented as confidence intervals for the differences between the means. The keyword ALPHA=0.05 requests that alpha (the level of significance) be set at .05 for the Tukey tests, and results in the printing of 95% confidence intervals for differences between means. If you had instead used the keyword ALPHA=0.01, it would have resulted in alpha being set at .01 for the Tukey tests, and in the printing of 99% confidence intervals. If you had instead used the keyword ALPHA=0.1, it would have resulted in alpha being set at .10 for the Tukey tests, and in the printing of 90% confidence intervals. If you omit the ALPHA option, the default is .05. •
Finally, lines 6, 7, and 8 contain the TITLE1, RUN, and QUIT statements for your program.
The complete SAS program. Here is the complete SAS program that will input your data set, perform a one-way ANOVA with one between-subjects factor, and follow with Tukey’s HSD test: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 02 02 L 14 03 L 10 04 L 08 05 L 08 06 L 15 07 L 03 08 L 12 09 M 13 10 M 25 11 M 16 12 M 20
510 Step-by-Step Basic Statistics Using SAS: Student Guide
13 M 21 14 M 21 15 M 17 16 M 26 17 H 20 18 H 14 19 H 23 20 H 22 21 H 24 22 H 26 23 H 19 24 H 29 ; PROC GLM CLASS MODEL MEANS MEANS TITLE1 RUN; QUIT;
DATA=D1; MOD_AGGR; SUB_AGGR = MOD_AGGR; MOD_AGGR; MOD_AGGR / TUKEY CLDIFF 'JOHN DOE';
ALPHA=0.05;
Keywords for Other Multiple Comparison Procedures The preceding section showed you how to write a program that would request the Tukey HSD test as a multiple comparison procedure. The sections that follow will show you how to interpret the results generated by that test. However, it is possible that some readers will want to use a multiple comparison procedure other than the Tukey test. In fact, a wide variety of other tests are available with SAS. They can be requested with the MEANS statement, using the following syntax: MEANS
predictor-variable / ALPHA=alpha-level ;
mult-comp-proc
You should insert the keyword for the procedure that you want in the location where “multcomp-proc” appears. With some of these procedures, you can also include the CLDIFF option to request confidence intervals for differences between means. Here is a list of keywords for some frequently used multiple comparison procedures that are available with SAS: BON
Bonferroni t tests of differences between means
DUNCAN
Duncan’s multiple range test
DUNNETT Dunnett’s two-tailed t test, determining if any groups are significantly different from a single control
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 511
GABRIEL
Gabriel’s multiple-comparison procedure
REGWQ
Ryan-Einot-Gabriel-Welsch multiple range test
SCHEFFE
Scheffe’s multiple-comparison procedure
SIDAK
Pairwise t tests of differences between means, with levels adjusted according to Sidak’s inequality
SNK
Student-Newman-Keuls multiple range test
T
Pairwise t tests (equivalent to Fisher’s least-significant-difference test when cell sizes are equal)
TUKEY
Tukey’s studentized range (HSD) test
Output Produced by the SAS Program Using the OPTIONS statement shown, the preceding program would produce four pages of output. The information that appears on each page is briefly summarized here; later sections will provide detailed guidelines for interpreting these results. •
Page 1 provides class level information and the number of observations in the data set.
•
Page 2 provides the ANOVA summary table from the GLM procedure.
•
Page 3 provides the results of the first MEANS statement. This MEANS statement simply requests means and standard deviations on the criterion variable for the three treatment conditions.
•
Page 4 provides the results of the second MEANS statement. This includes the results of the Tukey multiple comparison tests and the confidence intervals.
Steps in Interpreting the Output 1. Make sure that everything looks correct. With most analyses, you should begin this process by analyzing the criterion variable with PROC MEANS or PROC UNIVARIATE to verify that (a) no “Minimum” observed value in your data set is lower than the theoretically lowest possible score, and (b) no “Maximum” observed value in your data set is higher than the theoretically highest possible score. Before you analyze your data with PROC GLM, you should review the section titled “Summary of Assumptions Underlying One-Way ANOVA with One Between-Subjects Factor,” earlier in this chapter. See Chapter 7, “Measures of Central Tendency and Variability,” for a discussion of PROC MEANS and PROC UNIVARIATE. The output created by the GLM procedure in the preceding program also contains information that might help to identify possible errors in the writing the program or in typing the data. This section shows how to review that information.
512 Step-by-Step Basic Statistics Using SAS: Student Guide
First review the class level information that appears on Page 1 of the PROC GLM output. This page is reproduced here as Output 15.1. JOHN DOE The GLM Procedure Class Level Information Class MOD_AGGR
Levels 3
Number of observations
1
Values H L M 24
Output 15.1. Class level information from one-way ANOVA performed on aggression data, significant treatment effect.
First, verify that the name of your predictor variable appears under the heading “Class.” Here, you can see that the classification variable is MOD_AGGR. Under the heading “Levels,” the output should indicate how many groups of subjects were included in your study. Output 15.1 correctly indicates that your predictor variable consists of three groups. Under the heading “Values,” the output should indicate the specific numbers or letters that you used to code this predictor variable. Output 15.1 correctly indicates that you used the values “H,” “L,” and “M.” It is important to use uppercase and lowercase letters consistently when you are coding treatment conditions under the predictor variable. For example, the preceding paragraph indicated that you used uppercase letters (H, L, and M) in coding conditions. If you had accidentally keyed a lowercase “h” instead of an uppercase “H” for one or more of your subjects, SAS would have interpreted that as a code for a new and different treatment condition. This, of course, would have led to errors in the analysis. Finally, the last line in Output 15.1 indicates the number of observations in the data set. The present example used three groups with eight subjects each, for a total N of 24. Output 15.1 indicates that your data set included 24 observations, so everything appears to be correct at this point. Page 2 of the output provides the analysis of variance table created by PROC GLM. It is reproduced here as Output 15.2.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 513
JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.640854
DF 2 21 23
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924
Mean Square 394.125000 21.035714
Root MSE 4.586471
F Value 18.74
Pr > F <.0001
SUB_AGGR Mean 17.00000
Source MOD_AGGR
DF 2
Type I SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Source MOD_AGGR
DF 2
Type III SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Output 15.2. ANOVA summary table from one-way ANOVA performed on aggression data, significant treatment effect.
Near the top of output page 2 on the left side, the name of the criterion variable being analyzed should appear to the right of the heading “Dependent Variable.” In Output 15.2, the dependent variable is listed as SUB_AGGR. You will remember that SUB_AGGR stands for “subject aggression.” The remainder of Output 15.2 provides information about the analysis of this dependent variable. The top half of Output 15.2 consists of the ANOVA summary table for the analysis. This ANOVA summary table is made up of columns with headings such as “Source,” “DF,” “Sum of Squares,” and so on. The first column of this table is headed “Source,” and below this “Source” heading are three subheadings: “Model,” “Error,” and “Corrected Total.” Look to the right of the heading “Corrected Total,” and under the column headed “DF.” For the current output, you will see the number “23.” This number represents the corrected total degrees of freedom. This number should always be equal to N – 1, where N represents the total number of subjects for whom you have a complete set of data. In this study, N was 24, and so the corrected total degrees of freedom should be equal to 24 – 1 = 23. Output 15.2 shows that the corrected total degrees of freedom are in fact equal to 23, so again it appears that everything is correct so far. Later, you will return to the ANOVA summary table that appears in Output 15.2 to determine whether your treatment effect was significant, and to review other important information. For now, however, continue reviewing other pages of output to see if there are any other obvious signs of problems with your analysis. The means and standard deviations for the three groups of subjects are found on output page 3. This page is reproduced here as Output 15.3.
514 Step-by-Step Basic Statistics Using SAS: Student Guide
JOHN DOE The GLM Procedure Level of MOD_AGGR H L M
N 8 8 8
3
-----------SUB_AGGR---------Mean Std Dev 22.1250000 4.58062691 9.0000000 4.75093976 19.8750000 4.42194204
Output 15.3. Table of means and standard deviations from one-way ANOVA performed on aggression data, significant treatment effect.
On the left side of Output 15.3 is the heading “Level of MOD_AGGR.” Below this heading are the three values used to code the three treatment conditions: “H,” “L,” and “M.” To the right of these values are descriptive statistics for the three groups. You should review these descriptive statistics for any obvious signs of problems (e.g., a sample size that is too large or too small, a group mean that is lower than the lowest possible score on the criterion or higher than the highest possible score). Output 15.3 shows that, for the “High” group, the sample size was eight, the mean was 22.13, and the standard deviation was 4.58. For the “Low” group, the sample size was 8, the mean was 9.00, and the standard deviation was 4.75. For the “Moderate” group, the sample size was 8, the mean was 19.88, and the standard deviation was 4.42. These samples sizes are correct, and the means and standard deviations all seem reasonable. In summary, these results provide no evidence of an obvious error in writing the program or typing the data. You can therefore proceed to interpret the results that are relevant to your research questions. 2. Determine whether the treatment effect is statistically significant. Since there are no obvious signs from the output that you made errors in writing your program, you can now determine whether your study’s predictor variable had an effect on the study’s criterion variable. To do this, you will review an F statistic that appears in the ANOVA summary table produced by PROC GLM. This F statistic tests the study’s null hypothesis. To refresh your memory, that null hypothesis is reproduced again here: Statistical null hypothesis (H0): µ1 = µ2 = µ3; In the population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderate-modelaggression condition, and subjects in the high-model-aggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). The F statistic relevant to this null hypothesis appears on page 2 of the output produced by PROC GLM. This output is reproduced here as Output 15.4.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 515
JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total
DF 2 21 23
R-Square 0.640854 Source MOD_AGGR Source MOD_AGGR
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924 DF 2 DF 2
Root MSE 4.586471
Type I SS 788.2500000 Type III SS 788.2500000
Mean Square 394.125000 21.035714
F Value 18.74
Pr > F <.0001
SUB_AGGR Mean 17.00000
Mean Square 394.1250000 Mean Square 394.1250000
F Value 18.74 F Value 18.74
Pr > F <.0001 Pr > F <.0001
Output 15.4. Type III sums of squares from one-way ANOVA performed on aggression data, significant treatment effect.
The F statistic that you are interested in appears in a sum of squares table toward the bottom of Output 15.4. There are actually two sums of squares tables that appear at the bottom of this page of output. The first table is based on the Type I sums of squares, and the second is based on the Type III sums of squares. For the type of investigation described here, you should generally interpret the Type III sums of squares. Note that the results that are presented in the Type I table are often identical to the results in the Type III table, which is the case for the current analysis. However, data from some types of studies will lead to results in the Type I table that are not identical to the results in the Type III table. This may happen, for example, if your study includes more than one predictor variable or if different numbers of subjects appear in different treatment conditions. It would typically be best to interpret the results from the Type III table, rather than the Type I table in these situations. Therefore, to keep things simple, this text recommends that you always interpret the Type III table for the types of studies presented here. On the left side of this section of output is the heading “Source,” which represents “source of variation.” Below this heading you will find the name of your predictor variable, which in this case is MOD_AGGR. To the right, you will find analysis of variance information for this treatment effect (the “model aggression” treatment effect): Below the heading “DF,” you can see that the degrees of freedom associated with the model-aggression predictor variable was equal to 2. The formula for these degrees of freedom is k – 1, where k is equal to the number of groups being compared. The current analysis involves three groups, and 3 – 1 = 2. Below the heading “Type III SS,” you can see that the sum of squares associated with the model-aggression predictor variable was equal to 788.25.
516 Step-by-Step Basic Statistics Using SAS: Student Guide
Below the heading “Mean Square,” you can see that the mean square associated with the model-aggression predictor variable was equal to 394.13. Below the heading “F Value,” you can see that the obtained F statistic associated with the model-aggression predictor variable was equal to 18.74. This is the F statistic that tests your study’s null hypothesis. Below the heading “Pr > F”, you can see that the p value (probability value) associated with the preceding F statistic is 0.0001. This p value indicates that the F statistic is significant at the .0001 level. This last heading (“Pr > F”) gives you the probability of obtaining an F statistic that is this large or larger, if the null hypothesis were true. In the present case, this p value is very small: it is equal to 0.0001. When a p value is less than .05, you may reject the null hypothesis, so in this case the null hypothesis of no population differences is rejected. This means that you have a significant treatment effect. In other words, you may tentatively conclude that, in the population, there is a difference between at least two of the treatment conditions. Because you have obtained a significant F statistic, these results seem to provide support for your research hypothesis that model aggression has an effect on subject aggression. Later, you will review the group means and results of the multiple comparison procedures to see if the results are in the predicted direction. First, however, you will prepare an ANOVA summary table to summarize some of the information from Output 15.4. 3. Prepare your own version of the ANOVA summary table. Table 15.2 provides the completed ANOVA summary table for the current analysis. Table 15.2 ANOVA Summary Table for Study Investigating the Relationship between Level of Aggression Displayed by Model and Subject Aggression (Significant Treatment Effect) ___________________________________________________________________ 2
Source df SS MS F R ___________________________________________________________________ Model aggression Within groups
2
788.25
394.13
21
441.75
21.04
18.74 *
.64
Total 23 1230.00 ___________________________________________________________________ Note: N = 24. * p < .0001
To complete the preceding table, you simply transfer information from Output 15.4 to the appropriate line of the ANOVA summary table. For your convenience, Output 15.4 is reproduced here as Output 15.5.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 517
JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.640854
DF 2 21 23
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924
Mean Square 394.125000 21.035714
Root MSE 4.586471
F Value 18.74
Pr > F <.0001
SUB_AGGR Mean 17.00000
Source MOD_AGGR
DF 2
Type I SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Source MOD_AGGR
DF 2
Type III SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Output 15.5. Information needed for ANOVA summary table for analysis report, aggression study with significant treatment effect.
Here are some instructions for transferring information from Output 15.5 to the ANOVA summary table in Table 15.2: •
Treatment effect for the predictor variable. The top line in an ANOVA summary table such as Table 15.2 provides information for the study’s predictor (or independent) variable. The predictor variable in your study was level of aggression displayed by the model. Information concerning this effect appears to the right of the heading MOD_AGGR in Output 15.5. You can see that all of the information regarding MOD_AGGR (e.g., degrees of freedom, sum of squares, mean square) has been entered on the line headed “Model aggression” in Table 15.2. It is important that you give your predictor variable a short but meaningful name (such as “Model aggression”). You should generally not use the SAS variable name that you used for the predictor variable in your computer analyses because SAS variable names are typically too short to be meaningful to the reader of a research article. You can see that Table 15.2 does not include an entry for the p value associated with your F statistic. Instead, it simply includes an asterisk (*) next to the F statistic to indicate that it was statistically significant. A note at the bottom of Table 15.2 explains the meaning of this asterisk. The note indicates “* p < .0001,” which means that the F statistic was significant at the .0001 level. If the F value had been significant at the .01 level, your note would have looked like this: * p < .01 If the F value had been significant at the .05 level, your note would have looked like this: * p < .05
518 Step-by-Step Basic Statistics Using SAS: Student Guide
If the F value is not statistically significant, you do not put an asterisk next to it or place a note at the bottom of the table. You can include a column for the p value in an ANOVA summary table, and you might label this column “p” or “Probability.” Below this heading, you can record the actual p value (.0001, in this case). If you do this, you may omit the asterisk and the note at the bottom of the table. The last column in Table 15.2 is headed “R2,” and it is in this location that you provide the R2 value for your predictor variable. In the output of PROC GLM, this statistic appears toward the middle of the page, below the heading “R-Square.” In Output 15.5, you can see that the R2 value is 0.640854, which 2 rounds to .64 (R is typically rounded to two decimal places). Therefore, the value “.64” appears below the heading “R2” in Table 15.2. This means that the predictor variable (model aggression) accounted for 64% of the variance in the criterion variable (subject aggression) in this study. •
Within groups. The “Within-groups” line of an ANOVA summary table contains information about the error term from the analysis of variance. To find this information for the current analysis, look to the right of the heading “Error” in Output 15.5. You can see that the information from the “Error” line of Output 15.5 has been copied onto the line headed “Within groups” in Table 15.2.
•
Total. The total degrees of freedom and the total sum of squares from an analysis of variance can be found to the right of the heading “Corrected Total” in the output of PROC GLM. For the current analysis, look to the right of “Corrected Total” in Output 15.5. You can see that the information from this line has been copied onto the line headed “Total” in Table 15.2.
•
Note regarding sample size. Place a note at the bottom of the table to indicate the size of the total sample (Note: N = 24). Your own version of the ANOVA summary table is now complete.
4. Review the sample means and standard deviations. Earlier, you reviewed the F statistic produced by PROC GLM and determined that it was statistically significant. This told you that you had a significant treatment effect. However, at this point, you still do not know which group scored higher on the criterion variable. To determine this, you will review the output that was produced by your first MEANS statement. The first MEANS statement was as follows: MEANS
MOD_AGGR;
This relatively simple statement calculated the means and standard deviations on the criterion variable for the three treatment conditions. The results produced by this MEANS statement were presented earlier, and appear again in Output 15.6.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 519
JOHN DOE The GLM Procedure Level of MOD_AGGR
N
H L M
8 8 8
3
-----------SUB_AGGR---------Mean Std Dev 22.1250000 9.0000000 19.8750000
4.58062691 4.75093976 4.42194204
Output 15.6. Means and standard deviations on the criterion variable for the three treatment conditions, aggression study with significant treatment effect.
Below the heading “Level of MOD_AGGR,” you will find the values for the various treatment conditions. You will remember that the value “H” indicates values for the highmodel-aggression condition, “M” indicates values for the moderate-model-aggression condition, and the value “L” indicates values for the low-model-aggression condition. Below the heading “Mean” you will find mean scores on the criterion variable for the three treatment conditions. The values in this column show that subjects in the high-modelaggression condition displayed the highest average score on subject aggression, with a mean of 22.13 (S.D. = 4.58). Subjects in the moderate-model-aggression-condition displayed the next highest average score, with a mean of 19.88 (S.D. =4.42). Finally, subjects in the lowmodel-aggression condition displayed the lowest average score, with a mean of 9.00 (S.D. = 4.75). You now know the relative ordering of the three group’s mean scores on the criterion variable. However, you do not know which groups means are significantly different from which. In order to determine this, you must consult the result of the multiple comparison test. 5. Review the results of the multiple comparison procedure. Because the F statistic (presented in Output 15.4) was significant, you reject the null hypothesis of no differences in population means. Instead, you tentatively accept the alternative hypothesis, that at least one of the population means is different from at least one of the other population means. But because you have three experimental conditions, you now have a new problem: Which of these groups is significantly different from which? To answer this question, you have requested a multiple comparison procedure called Tukey's HSD test. Following is the MEANS statement that you included in your program to request this test: MEANS
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
The results of this analysis are presented here as Output 15.7.
520 Step-by-Step Basic Statistics Using SAS: Student Guide
JOHN DOE The GLM Procedure Tukey's Studentized Range (HSD) Test for SUB_AGGR NOTE: This test controls the Type I experimentwise error rate.
4
Alpha 0.05 Error Degrees of Freedom 21 Error Mean Square 21.03571 Critical Value of Studentized Range 3.56463 Minimum Significant Difference 5.7803 Comparisons significant at the 0.05 level are indicated by ***.
MOD_AGGR Comparison H H M M L L
-
Difference Between Means
M L H L H M
2.250 13.125 -2.250 10.875 -13.125 -10.875
Simultaneous 95% Confidence Limits -3.530 7.345 -8.030 5.095 -18.905 -16.655
8.030 18.905 3.530 16.655 -7.345 -5.095
*** *** *** ***
Output 15.7. Results of Tukey HSD multiple comparison test performed on aggression data, significant treatment effect.
Some notes about Output 15.7: The heading tells you that this page presents the results of a Tukey test performed on the criterion variable SUB_AGGR. To the right of “Alpha” is the entry “0.05,” which tells you that these tests are performed at the .05 level of significance. You can request different levels of significance (see the section “The PROC step” for instructions on how to do this). To the right of “Minimum Significant Difference” is the entry “5.7803.” This means that, according to the Tukey test, the means for two treatment conditions must differ by at least 5.7803 to be considered significantly different at the .05 level. In the middle of the page, a note says “Comparisons significant at the 0.05 level are indicated by ***.” This means that, in the lower portion of the table, you will know that the comparison between two treatment conditions is significant at .05 level if it is flagged with three asterisks. In the lower portion of the output, the first section is headed “MOD_AGGR Comparison.” This section indicates which treatment conditions are being compared. For example, the first row is headed “H – M.” This row provides information about the Tukey test in which the “H” condition (the high-model-aggression condition) is compared to the “M” condition (the moderate-model-aggression condition). Everything in this row provides information about this one comparison.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 521
In the lower portion of the output, the second section is headed “Difference Between Means.” The entries in this column represent the difference between the two means that are being compared. For example, the first row contains the “H – M” comparison, which can be read as the “High minus Moderate” comparison. This comparison was made by starting with the mean score on the criterion variable for the “high” group, and subtracting from it the mean score on the criterion variable for the “moderate” group. Output 15.6 showed that the mean for the high group was 22.125, and the mean for the moderate group was 19.875. The difference between these means is calculated as follows: 22.125 – 19.875 = 2.25 As you can see from Output 15.7, the value 2.25 appears in the column headed “Difference Between Means” for the comparison “H – M.” The remaining values in this column can be interpreted in the same way. In the row headed “H – L,” you will find the difference between means for the high condition versus the low condition (13.125); in the row headed “M – H,” you will find the difference between means for the moderate condition versus the high condition (–2.250), and so on. In the lower portion of the output, on the right side of the table is an area where asterisks (“***”) may appear. If these asterisks appear in the row for a particular comparison, it means that there is a significant difference between the two treatment conditions being compared in that row (according to the Tukey test). For example, the first row is the row for the “H – M” comparison. Notice that there are no asterisks on the far right of this row. This means that the difference between the high group versus the moderate group is nonsignificant, according to the Tukey test. The second row is the row for the “H – L” comparison. Notice that there are three asterisks on the far right side of this row. This means that there is a significant difference between the high group versus the low group, according to the Tukey test. The remaining rows can be interpreted in the same way. After you review all of the comparisons, you will see that the difference between the high group and the low group is significant, the difference between the moderate group and the low group is significant, but the difference between the high group and the moderate group is nonsignificant. 6. Review the confidence intervals for the differences between means. The MEANS statement that you included in the SAS program that performed your analysis of variance is once again presented here: MEANS
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
This MEANS statement contains the options TUKEY and CLDIFF. Together, these options request the Tukey HSD test (discussed earlier), and also request that the results be presented in the form of confidence limits. In Chapter 12, “The Single-Sample t Test,”, you learned that a confidence interval is an interval that extends from a lower confidence limit to an upper confidence limit and is assumed to contain a population parameter with a stated probability, or level of confidence. In this case, the population parameter that you are interested in is the difference between two treatment means in the population. When SAS prints the results that are generated by the preceding MEANS statement, this output will not
522 Step-by-Step Basic Statistics Using SAS: Student Guide
only contain the observed differences between the sample means (which were discussed in the preceding section), but it will also contain confidence intervals for these differences. Output 15.8 reproduces the output generated by the preceding MEANS statement. To conserve space, only the lower portion of the output page is reproduced in Output 15.8. Comparisons significant at the 0.05 level are indicated by ***.
MOD_AGGR Comparison H H M M L L
-
M L H L H M
Difference Between Means 2.250 13.125 -2.250 10.875 -13.125 -10.875
Simultaneous 95% Confidence Limits -3.530 7.345 -8.030 5.095 -18.905 -16.655
8.030 18.905 3.530 16.655 -7.345 -5.095
*** *** *** ***
Output 15.8. Confidence limits for differences between means from aggression study, significant treatment effect.
As stated earlier, the section of the output that is headed “MOD_AGGR Comparison” ( ) indicates which treatment conditions are being compared. For example, the first row is identified with “H – M” ( ), which means that this row provides information about the comparison between the high-model-aggression group versus the moderate-modelaggression group. The second section in this portion of the output is headed “Difference Between Means” ( ), and this column indicates the observed difference between the means of the two treatment conditions that are being compared. The entry for the first row is 2.250, which means that the difference between the means of the high group versus the moderate group is 2.250. The confidence interval for this difference between means appears in the section headed “Simultaneous 95% Confidence Limits” ( ). Two columns of numbers appear below this heading. The column of numbers on the left provide the lower limit for the confidence interval ( ), and the column of numbers on the right provide the upper limit for the confidence interval ( ). Once again, consider the first row in the table––the row for the “H – M” comparison ( ). The observed difference between means for these two groups is 2.250, and the 95% confidence interval extends from –3.530 to 8.030 (These latter two values came from the section headed “Simultaneous 95% Confidence Limits”). This confidence interval means that, although you do not know for sure what the actual difference is between these two conditions in the population, you estimate that there is a 95% probability that the difference is somewhere between –3.530 and 8.030. Notice that the preceding confidence interval included the value of zero. This is consistent with the results of the Tukey procedure indicating that the difference between the high condition and the moderate condition was nonsignificant. Whenever you find a
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 523
nonsignificant difference between two conditions, the confidence interval for that difference will typically contain zero. Now consider the second row in the table––the row for the “H – L” comparison ( ). This row provides information about the comparison between the high condition versus the low condition. The observed difference between means for these two groups is 13.125, and the 95% confidence interval extends from 7.345 to 18.905. This confidence interval means that, although you cannot be certain what the actual difference is between these two conditions in the population, this procedure estimates that there is a 95% probability that the difference is between 7.345 and 18.905. Notice that the preceding confidence interval does not include the value of zero. This is consistent with results of the Tukey procedure indicating that the difference between the high condition versus the low condition was statistically significant. Anytime you find significant difference between two conditions, the confidence interval for that difference will typically not contain zero. The remaining rows in Output 15.8 (for the remaining comparisons) may be interpreted in the same way. Notice that half of the rows in Output 15.8 seem to be redundant. For example, you have already seen that the first row provides information about the comparison between the high condition versus the moderate condition. You can see that the third row down provides information about the same comparison. The difference is this: In the first row, the mean for the moderate condition is subtracted from the mean for the high condition, while in the third row the mean for the high condition is subtracted from the mean for the moderate condition. The absolute value of all of the numbers involved are the same, so the two rows provide information that is essentially redundant. 7. Prepare a table that presents the results of the Tukey tests and the confidence intervals. Now that you know how to interpret the Tukey tests and the confidence intervals for the differences between means, you will prepare a table that summarizes them for a report. You can obtain the information that you need from the lower portion of the output page that provided the results of the Tukey multiple comparison procedure. For your convenience, those results are reproduced as Output 15.9.
524 Step-by-Step Basic Statistics Using SAS: Student Guide
Comparisons significant at the 0.05 level are indicated by ***. MOD_AGGR Comparison H H M M L L
-
M L H L H M
Difference Between Means 2.250 13.125 -2.250 10.875 -13.125 -10.875
Simultaneous 95% Confidence Limits -3.530 7.345 -8.030 5.095 -18.905 -16.655
8.030 18.905 3.530 16.655 -7.345 -5.095
*** *** *** ***
Output 15.9. Information needed for table presenting Tukey tests and confidence intervals, aggression data with significant treatment effect.
The table that you prepare for a published report will use a format that is very similar to the format used with Output 15.9. You can copy information directly from the SAS output to your table. The main difference is that you will use titles, headings, and notes that convey information more clearly. You will also omit lines of information that are redundant. The completed table is shown in Table 15.3. Table 15.3 Results of Tukey Tests Comparing High-Model-Aggression Group Versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression) ____________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a Comparison means Lower Upper ____________________________________________________ High – Moderate High - Low
2.250
-3.530
8.030
13.125 *
7.345
18.905
Moderate - Low 10.875 * 5.095 16.655 ____________________________________________________ Note: N = 24. Differences are computed by subtracting the mean for the second group from the mean for the first group. * Tukey test indicates that the differences between the means is significant at p < .05.
a
Notice that, in Table 15.3, the column on the left is headed “Comparison.” The entries in this column are the names for the treatment conditions (e.g., “High – Moderate”), rather than the simpler letter values that appeared in the SAS output (e.g., “H – M”). This was done to present the information more clearly to the reader. You can see that a single asterisk is used to identify comparisons that are statistically significant. In Table 15.3, these asterisks appear in the column headed “Difference between
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 525
means” (in the SAS output, they had appeared on the far right side). A note at the bottom of the table indicates the significance level that is indicated by an asterisk. Notice that there are only three comparisons listed in Table 15.3: high versus moderate, high versus low, and moderate versus low. This includes all possible comparisons between the three treatment conditions in the study. It is true that Output 15.9 actually lists six comparisons, but three of these six comparisons were redundant (as was discussed in an earlier section). Using a Figure to Illustrate the Results The results of an analysis of variance are easiest to understand when they are represented in a figure that plots the means for each of the study’s treatment conditions. The mean subject aggression scores for the current analysis were presented in Output 15.6. They are presented in bar chart form in Figure 15.6.
Figure 15.6. Mean number of aggressive acts as a function of the level of aggression displayed by model (significant F statistic).
You can see that the three bars in Figure 15.6 illustrate the mean scores for the three groups that are included in the analysis. The figure shows that the mean scores are 9.00, 19.88, and 22.13 for the low-, moderate-, and high-model-aggression conditions, respectively.
526 Step-by-Step Basic Statistics Using SAS: Student Guide
Analysis Report for the Aggression Study (Significant Results) Following is an analysis report that summarizes the results of your analysis. A section following this one explains where in your SAS output you can find some of the statistics that appear in this report. A) Statement of the research question: The purpose of this study was to determine whether there was a relationship between (a) the level of aggression displayed by a model and (b) the number of aggressive acts later demonstrated by children who observed the model. B) Statement of the research hypothesis: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observed the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression. C) Nature of the variables: This analysis involved one predictor variable and one criterion variable: • The predictor variable was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. D) Statistical test: One-way ANOVA with one between-subjects factor. E) Statistical null hypothesis (Ho): µ1 = µ2 = µ3; In the study population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): Not all µs are equal; In the study population, there is a difference between at least two of the following three groups with respect to their mean scores on the criterion variable: subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 527
G) Obtained statistic:
F(2, 21) = 18.74
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Multiple comparison procedure: Tukey’s HSD test showed that subjects in the high-model-aggression condition and moderate-model-aggression condition scored significantly higher on subject aggression than did subjects in the lowmodel-aggression condition. (p < .05). With alpha set at .05, there were no significant differences between subjects in the high-model-aggression condition versus the moderate-modelaggression condition. K) Confidence intervals: Confidence intervals for differences between the means are presented in Table 15.3. L) Effect size: R2 = .64, indicating that model aggression accounted for 64% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings provide partial support for the study’s research hypothesis. The findings provided support for the hypothesis that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a low level of aggression, as well as for the hypothesis that (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression. However, the study failed to provide support for the hypothesis that children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate level of aggression N) Formal description of the results for a paper: Results were analyzed using a one-way ANOVA with one between-subjects factor. This analysis revealed a significant treatment effect for level of aggression displayed by the model, F(2, 21) = 18.74, MSE = 21.04, p = .0001. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 22.13 (SD = 4.58), the mean for the moderate-model-aggression-condition was 19.88 (SD = 4.42), and the mean for the low-model-aggression condition was 9.00 (SD = 4.75). The sample means are displayed in Figure 15.6. Tukey’s HSD test showed that subjects in the high-model-aggression condition and moderate-model-aggression condition scored significantly higher on subject aggression than did subjects in the low-model-aggression condition. (p < .05). With alpha set at .05, there were no significant differences between subjects in the high-model-aggression condition versus the
528 Step-by-Step Basic Statistics Using SAS: Student Guide
moderate-model-aggression condition. Confidence intervals for differences between the means are presented in Table 15.3. In the analysis, R2 was computed as .64. This indicated that model aggression accounted for 64% of the variance in subject aggression. O) Figure representing the results:
See Figure 15.6.
Notes Regarding the Preceding Analysis Report Overview. With some sections of the preceding report, it might not be clear where, in your SAS output, to find the necessary statistics to insert in that section. The main ANOVA summary table that was produced by PROC GLM will be reproduced as Output 15.10. This is because much of the information needed for your report appears on this page of the output. JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.640854
DF 2 21 23
Sum of Squares 788.250000 441.750000 1230.000000
Coeff Var 26.97924
Mean Square 394.125000 21.035714
Root MSE 4.586471
F Value 18.74
Pr > F <.0001
SUB_AGGR Mean 17.00000
Source MOD_AGGR
DF 2
Type I SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Source MOD_AGGR
DF 2
Type III SS 788.2500000
Mean Square 394.1250000
F Value 18.74
Pr > F <.0001
Output 15.10. Information needed for analysis report on aggression study with significant treatment effect.
F Statistic, degrees of freedom, and p value. Items G and H from the preceding analysis report is reproduced here: G) Obtained statistic: F(2, 21) = 18.74 H) Obtained probability (p) value: p = .0001 You can see that Items G and H provide the F statistic, the degrees of freedom for this statistic, and the p value that is associated with this statistic. It is worthwhile to review where these terms can be found in the SAS output. The F statistic of 18.74, which appears in Item G of the preceding report, is the F statistic associated with the model aggression predictor variable. It can be found in Output 15.10 where the row headed “MOD_AGGR” intersects with the column headed “F Value” ( ).
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 529
It is customary to list the degrees of freedom for an F statistic within parentheses. In the preceding report, these degrees of freedom were listed in Item G as “F(2, 21).” The first term (the “2” within the parentheses) represents the degrees of freedom for the numerator in the F ratio (i.e., the degrees of freedom for the model aggression treatment effect). This term appears in Output 15.10 where the row headed “MOD_AGGR” intersects with the column headed “DF” ( ). The second term (the “21” within parentheses) represents the degrees of freedom for the denominator in the F ratio (i.e., the degrees of freedom for the error term). This term appears in Output 15.10 where the row headed “Error” intersects with the column headed “DF” ( ). Finally, the p value listed in Item H of the preceding report is the p value associated with the model aggression treatment effect. It appears in Output 15.10 where the row headed “MOD_AGGR” intersects with the column headed “Pr > F” ( ). The MSE (mean square error). Item N from the analysis report provides a statistic abbreviated as the “MSE.” The relevant section of Item N is reproduced here: N) Formal description of the results for a paper: Results were analyzed using a one-way ANOVA with one betweensubjects factor. This analysis revealed a significant treatment effect for level of aggression displayed by the model, F(2, 21) = 18.74, MSE = 21.04, p = .0001. The last sentence of the preceding excerpt indicates that “MSE = 21.04.” Here, “MSE” stands for “Mean Square Error.” It is an estimate of the error variance in your analysis. In the output from PROC GLM, you will find the MSE where the row headed “Error” intersects with the column headed “Mean Square.” In Output 15.10, you can see that the mean square error is equal to 21.035714 ( ), which rounds to 21.04.
Example 15.2: One-Way ANOVA Revealing a Nonsignificant Treatment Effect Overview This section presents the results of a one-way ANOVA in which the treatment effect is nonsignificant. These results are presented so that you will be prepared to write analysis reports for projects in which nonsignificant outcomes are observed.
530 Step-by-Step Basic Statistics Using SAS: Student Guide
The Complete SAS Program The study presented here is the same aggression study that was described in the preceding section. The data will be analyzed with the same SAS program that was presented earlier. Here, the data have been changed so that they will produce nonsignificant results. The complete SAS program, including the new data set, is presented here: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM MOD_AGGR $ SUB_AGGR; DATALINES; 01 L 07 02 L 17 03 L 14 04 L 11 05 L 11 06 L 20 07 L 08 08 L 15 09 M 08 10 M 20 11 M 11 12 M 15 13 M 16 14 M 16 15 M 12 16 M 21 17 H 14 18 H 10 19 H 17 20 H 18 21 H 20 22 H 21 23 H 14 24 H 23 ; PROC GLM DATA=D1; CLASS MOD_AGGR; MODEL SUB_AGGR = MOD_AGGR; MEANS MOD_AGGR; MEANS MOD_AGGR / TUKEY CLDIFF TITLE1 'JOHN DOE'; RUN; QUIT;
ALPHA=0.05;
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 531
Steps in Interpreting the Output As with the earlier data set, the SAS program that performs this analysis produces four pages of output. This section will present just those sections of output that are relevant to preparing the ANOVA summary table, the confidence intervals table, the figure, and the analysis report. This section will review the output in a fairly abbreviated manner; for a more detailed discussion of the output of PROC GLM, see earlier sections of this chapter. 1. Determine whether the treatment effect is significant. As before, you determine whether the treatment effect is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on page 2 of the output, and is reproduced here as Output 15.11. JOHN DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total
DF 2 21 23
R-Square 0.151655
Sum of Squares 72.3333333 404.6250000 476.9583333 Coeff Var 29.34496
Mean Square 36.1666667 19.2678571
Root MSE 4.389517
F Value 1.88
Pr > F 0.1778
SUB_AGGR Mean 14.95833
Source MOD_AGGR
DF 2
Type I SS 72.33333333
Mean Square 36.16666667
F Value 1.88
Pr > F 0.1778
Source MOD_AGGR
DF 2
Type III SS 72.33333333
Mean Square 36.16666667
F Value 1.88
Pr > F 0.1778
Output 15.11. ANOVA summary table for one-way ANOVA performed on aggression data, nonsignificant treatment effect.
As with the earlier data set, you review the results of the analyses that appear in the section headed “Type III SS” ( ), as opposed to the section headed “Type I SS.” To determine whether the treatment effect is significant, look to the right of the heading “MOD_AGGR” ( ). Here, you can see that the F statistic is only 1.88 ( ), with a p value of .1778 ( ). The obtained p value is greater than the standard criterion of .05, which means that this F statistic is nonsignificant. This means that you do not have a significant treatment effect for your predictor variable. 2. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for this analysis is presented here as Table 15.4.
532 Step-by-Step Basic Statistics Using SAS: Student Guide Table 15.4 ANOVA Summary Table for Study Investigating the Relationship between Level of Aggression Displayed by Model and Subject Aggression (Nonsignificant Treatment Effect) ___________________________________________________________________ 2
SS MS F R Source df ___________________________________________________________________ Model aggression Within groups
2
72.33
36.17
21
404.63
19.27
a
1.88
.15
Total 23 476.96 ___________________________________________________________________ Note: N = 24. F statistic is nonsignificant with alpha set at .05.
a
Notice how information from Output 15.11 was used to fill in the relevant sections of Table 15.4: •
Information from the line headed “MOD_AGGR” ( ) in Output 15.11 was transferred to the lined headed “Model aggression” in Table 15.4.
•
Information from the line headed “Error” ( ) in Output 15.11 was transferred to the lined headed “Within groups” in Table 15.4.
•
Information from the line headed “Corrected Total” ( ) in Output 15.7 was transferred to the lined headed “Total” in Table 15.4.
•
The R2 value that appeared below the heading “R-Square” ( ) in Output 15.11 was transferred below the heading “R2” in Table 15.4.
3. Review the results of the multiple comparison procedure. Notice that, unlike the previous section, this section does not advise you to review the results of the multiple comparison procedure (the Tukey test). This is because the treatment effect in the current analysis was nonsignificant, and you normally would not interpret the results of multiple comparison procedures for treatment effects that are not significant. 4. Review the confidence intervals for the differences between means. Although the results from the F statistic are nonsignificant, it may still be useful to review the size of the confidence intervals created in the analysis. Output 15.12 presents these intervals.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 533
JOHN DOE The GLM Procedure Tukey's Studentized Range (HSD) Test for SUB_AGGR NOTE: This test controls the Type I experimentwise error rate.
4
Alpha 0.05 Error Degrees of Freedom 21 Error Mean Square 19.26786 Critical Value of Studentized Range 3.56463 Minimum Significant Difference 5.532 Comparisons significant at the 0.05 level are indicated by ***. MOD_AGGR Comparison H - M H - L M - H M - L L - H L - M
Difference Between Means 2.250 4.250 -2.250 2.000 -4.250 -2.000
Simultaneous 95% Confidence Limits -3.282 7.782 -1.282 9.782 -7.782 3.282 -3.532 7.532 -9.782 1.282 -7.532 3.532
Output 15.12. Confidence intervals for differences between the means created in analysis of aggression data, nonsignificant differences.
The confidence intervals for the current analysis appear below the heading “Simultaneous 95% Confidence Limits” ( ). You can see that none of the comparisons in this section are flagged with three asterisks, which means that none of the differences were significant according to the Tukey test. In addition, you can see that all of the confidence intervals in this section contain the value of zero. This is also consistent with the fact that none of the differences were statistically significant. Table 15.5 presents the confidence intervals resulting from the Tukey tests that you might prepare for a published report. The note at the bottom of the table tells the reader that all comparison are nonsignificant. You can see that all of the differences between means and confidence limits came from Output 15.12.
534 Step-by-Step Basic Statistics Using SAS: Student Guide Table 15.5 Results of Tukey Tests Comparing High-Model-Aggression Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression), Nonsignificant Differences Observed _____________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a b Comparison means Lower Upper _____________________________________________________ High - Moderate
2.250
-3.282
7.782
High - Low
4.250
-1.282
9.782
Moderate - Low 2.000 -3.532 7.532 _____________________________________________________ Note: N = 24. Differences are computed by subtracting the mean for the second group from the mean for the first group. b With alpha set at .05, all Tukey test comparisons were nonsignificant. a
Using a Graph to Illustrate the Results Journal articles typically do not include a graph to illustrate group means when the treatment effect is nonsignificant. However, a graph presenting group means will be presented here as an illustration. The means for the three conditions of the present investigation appeared on page 3 of the preceding output (which presented the results from the first MEANS statement). Page 3 from the analysis is reproduced here as Output 15.13.
Level of MOD_AGGR
JOHN DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev
H L M
8 8 8
17.1250000 12.8750000 14.8750000
3
4.29077083 4.45413131 4.42194204
Output 15.13. Means and standard deviations produced by the MEANS statement in analysis of aggression data, nonsignificant differences.
Below the heading “Mean” ( ) you will find the mean scores for these three groups. You can see that the mean scores were 17.13, 12.88, and 14.88 for the high, low, and moderate groups, respectively. Figure 15.7 illustrates these group means.
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 535
Figure 15.7. Mean number of aggressive acts as a function of the level of aggression displayed by model (nonsignificant F statistic).
Analysis Report for the Aggression Study (Nonsignificant Results) The results from the preceding analysis could be summarized in the following report. Notice that some results (such as the results of the Tukey test) are not discussed because the treatment effect was nonsignificant. A) Statement of the research question: The purpose of this study was to determine whether there was a relationship between (a) the level of aggression displayed by a model and (b) the number of aggressive acts later demonstrated by children who observed the model. B) Statement of the research hypothesis: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observed the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression.
536 Step-by-Step Basic Statistics Using SAS: Student Guide
C) Nature of the variables: This analysis involved one predictor variable and one criterion variable: • The predictor variable was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. One-way ANOVA with one between-subjects
D) Statistical test: factor.
E) Statistical null hypothesis (Ho): µ1 = µ2 = µ3; In the population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): Not all µs are equal; In the population, there is a difference between at least two of the following three groups with respect to their mean scores on the criterion variable: subjects in the lowmodel-aggression condition, subjects in the moderate-modelaggression condition, and subjects in the high-modelaggression condition. G) Obtained statistic:
F(2, 21) = 1.88
H) Obtained probability (p) value:
p = .1778
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Multiple comparison procedure: The multiple comparison procedure was not appropriate because the F statistic for the ANOVA was nonsignificant. K) Confidence Intervals: Confidence intervals for differences between the means are presented in Table 15.5. 2 L) Effect size: R = .15, indicating that model aggression accounted for 15% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. N) Formal description of the results for a paper: Results were analyzed using a one-way ANOVA with one between-subjects factor. This analysis revealed a nonsignificant treatment
Chapter 15: One-Way ANOVA with One Between-Subjects Factor 537
effect for level of aggression displayed by the model, F(2, 21) = 1.88, MSE = 19.27, p = .1778. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 17.13 (SD = 4.29), the mean for the moderate-model-aggression-condition was 14.88 (SD = 4.42), and the mean for the low-model-aggression condition was 12.88 (SD = 4.45). The sample means are displayed in Figure 15.7. Confidence intervals for differences between the means (based on Tukey’s HSD test) are presented in Table 15.5. In the analysis, R2 was computed as .15. This indicated that model aggression accounted for 15% of the variance in subject aggression. O) Figure representing the results:
See Figure 15.7.
Conclusion This chapter has shown how to perform an analysis of variance on data from studies in which only one independent variable is manipulated. However, researchers in the social sciences and education often conduct research in which two independent variables are manipulated simultaneously in a single study. With such investigations, it is usually not appropriate to perform two separate one-way ANOVAs on the data––one ANOVA for the first independent variable, and a separate ANOVA for the second independent variable. Instead, it is usually more appropriate to analyze the data with a different statistical procedure: a factorial ANOVA. Performing a factorial analysis of variance not only enables you to determine whether you have significant treatment effects for your two independent variables; it also enables you to test for an entirely different type of effect: an interaction. Chapter 16 introduces you to the concept of an interaction, and shows how to use the GLM procedure to perform a factorial ANOVA with two between-subject factors.
538 Step-by-Step Basic Statistics Using SAS: Student Guide
Factorial ANOVA with Two BetweenSubjects Factors Introduction..........................................................................................542 Overview................................................................................................................ 542 Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors ......................................................542 Overview................................................................................................................ 542 Nature of the Predictor and Criterion Variables ..................................................... 542 The Type-of-Variable Figure .................................................................................. 543 Example of a Study Providing Data Appropriate for This Procedure...................... 543 True Independent Variables versus Subject Variables .......................................... 544 Summary of Assumptions Underlying Factorial ANOVA with Two Between-Subjects Factors ................................................................................ 545 Using Factorial Designs in Research ..................................................546 A Different Study Investigating Aggression ........................................546 Overview................................................................................................................ 546 Research Method................................................................................................... 547 The Factorial Design Matrix ................................................................................... 548 Understanding Figures That Illustrate the Results of a Factorial ANOVA .........................................................550 Overview................................................................................................................ 550 Example of a Figure............................................................................................... 550 Interpreting the Means on the Solid Line ............................................................... 551 Interpreting the Means on the Broken Line ............................................................ 552 Summary ............................................................................................................... 553
540 Step-by-Step Basic Statistics Using SAS: Student Guide
Some Possible Results from a Factorial ANOVA ................................553 Overview................................................................................................................ 553 A Significant Main Effect for Predictor A Only........................................................ 554 Another Example of a Significant Main Effect for Predictor A Only ........................ 556 A Significant Main Effect for Predictor B Only........................................................ 557 A Significant Main Effect for Both Predictor Variables ........................................... 558 No Main Effects...................................................................................................... 559 A Significant Interaction ......................................................................................... 560 Interpreting Main Effects When an Interaction Is Significant.................................. 562 Another Example of a Significant Interaction ......................................................... 563 Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction ...............................565 Overview................................................................................................................ 565 Choosing SAS Variable Names and Values to Use in the SAS Program .................. 565 Data Set to Be Analyzed........................................................................................ 567 Writing the DATA Step of the SAS Program .......................................................... 568 Data Screening and Testing Assumptions Prior to Performing the ANOVA........... 570 Writing the SAS Program to Perform the Two-Way ANOVA.................................. 571 Log File Produced by the SAS Program ................................................................ 574 Output Produced by the SAS Program .................................................................. 575 Steps in Interpreting the Output ............................................................................. 576 Using a Figure to Illustrate the Results .................................................................. 592 Steps in Preparing the Graph ................................................................................ 595 Interpreting Figure 16.11........................................................................................ 596 Preparing Analysis Reports for Factorial ANOVA: Overview ................................. 597 Analysis Report Concerning the Main Effect for Predictor A (Significant Effect) ... 597 Notes Regarding the Preceding Analysis Report ................................................... 599 Analysis Report Regarding the Main Effect for Predictor B (Significant Effect)...... 601 Notes Regarding the Preceding Analysis Report ................................................... 603 Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 605 Notes Regarding the Preceding Analysis Report ................................................... 607 Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction ...............................607 Overview................................................................................................................ 607 The Complete SAS Program ................................................................................. 608 Steps in Interpreting the Output ............................................................................. 609 Using a Figure to Illustrate the Results .................................................................. 612 Interpreting Figure 16.12........................................................................................ 613 Analysis Report Concerning the Main Effect for Predictor A (Nonsignificant Effect) ........................................................................................ 614 Analysis Report Concerning the Main Effect for Predictor B (Nonsignificant Effect) ........................................................................................ 615 Analysis Report Concerning the Interaction (Nonsignificant Effect) ...................... 616
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 541
Example of a Factorial ANOVA Revealing a Significant Interaction ..617 Overview................................................................................................................ 617 The Complete SAS Program ................................................................................. 617 Steps in Interpreting the Output ............................................................................. 618 Using a Graph to Illustrate the Results .................................................................. 620 Interpreting Figure 16.13........................................................................................ 621 Testing for Simple Effects ...................................................................................... 622 Analysis Report Concerning the Interaction (Significant Effect)............................. 622 Using the LSMEANS Statement to Analyze Data from Unbalanced Designs........................................................................625 Overview................................................................................................................ 625 Reprise: What Is an Unbalanced Design? ............................................................ 625 Writing the LSMEANS Statements......................................................................... 626 Output Produced by LSMEANS ............................................................................. 627 Learning More about Using SAS for Factorial ANOVA ........................627 Conclusion............................................................................................628
542 Step-by-Step Basic Statistics Using SAS: Student Guide
Introduction Overview This chapter shows how to enter data and prepare SAS programs that will perform a twoway analysis of variance (ANOVA) using the GLM procedure. This chapter focuses on factorial designs with two between-subjects factors, meaning that each subject is exposed to only one condition under each independent variable. It discusses the differences between main effects versus interaction effects in factorial ANOVA. It provides guidelines for interpreting results that do not indicate a significant interaction, and separate guidelines for interpreting results that do indicate a significant interaction. It shows how to use multiple comparison procedures to identify the pairs of groups that are significantly different from each other, how to request confidence intervals for differences between the means, how to interpret an index of effect size, and how to prepare a figure that illustrates cell means. Finally, it shows how to prepare a report that summarizes the results of the analysis.
Situations Appropriate for Factorial ANOVA with Two Between-Subjects Factors Overview Factorial ANOVA is a test of group differences that enables you to determine whether there are significant differences between two or more groups with respect to their mean scores on a criterion variable. Furthermore, it enables you to investigate group differences with respect to two independent variables (or predictor variables) at the same time. In summary, factorial ANOVA with two between-subjects factors may be used when you wish to investigate the relationship between (a) two predictor variables (each of which classifies group membership) and (b) a single criterion variable. Nature of the Predictor and Criterion Variables Predictor variables. In factorial ANOVA, the two predictor (or independent) variables are classification variables, that is, variables that indicate which group a subject is in. They may be assessed on any scale of measurement (nominal, ordinal, interval, or ratio), but they serve mainly as classification variables in the analysis. Criterion variable. The criterion (or dependent) variable is typically a multi-value variable. It must be a numeric variable that is assessed on either an interval or ratio level of measurement. The criterion variable must also satisfy a number of additional assumptions, and these assumptions are summarized in a later section.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 543
The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when researchers perform a factorial ANOVA with two between-subjects factors. Criterion
Predictors
=
The “Multi” symbol that appears in the above figure shows that the criterion variable in a factorial ANOVA is typically a multi-value variable (a variable that assumes more than six values in your sample). The “Lmt” symbols that appear to the right of the equal sign in the above figure show that the two predictor variables in this procedure are usually limited-value variables (that is, variables that assume only two to six values). Example of a Study Providing Data Appropriate for This Procedure The study. Suppose that you are a physiological psychologist conducting research on aggression in children. You are interested in two research questions: (a) does consuming sugar cause children to behave more aggressively?, and (b) are boys more aggressive than girls? To investigate these questions, you conduct an experiment in which you randomly assign 90 children to three treatment conditions: •
30 children are assigned to a 0-gram condition. These children consume zero grams of sugar each day over a two-month period.
•
30 children are assigned to a 20-gram condition. These children consume 20 grams of sugar each day over a two-month period.
•
30 children are assigned to a 40-gram condition. These children consume 40 grams of sugar each day over a two-month period.
You observe the children over the two-month period, and for each child you record the number of aggressive acts that the child displays each day. At the end of the two month period, you determine whether the children in the 40-gram condition displayed a mean number of aggressive acts that is significantly higher than the mean displayed by the other two groups. This analysis helps you to determine whether consuming sugar causes children to behave more aggressively. In the same study, however, you are also interested in investigating sex differences in aggression. At the time that subjects were assigned to conditions you ensured that, within
544 Step-by-Step Basic Statistics Using SAS: Student Guide
each of the “sugar consumption” treatment groups, half of the children were male and half were female. This means that, in your study, both of the following are true: •
45 children were in the male group
•
45 children were in the female group.
At the end of the two-month period, you determine whether the male group displays a mean number of aggressive acts that is significantly different from the mean number displayed by the female group. Why these data would be appropriate for this procedure. The preceding study involved two predictor variables and a single criterion variable. The first predictor variable (Predictor A) was “amount of sugar consumed.” You know that this was a limited-value variable, because it assumed only three values: a 0-gram condition, a 20-gram condition, and a 40gram condition. This predictor variable was assessed on a ratio scale, since “grams of sugar” has equal intervals and a true zero point. However, remember that the predictor variables that are used in ANOVA may be assessed on any scale of measurement. In general, they are treated as classification variables in the analysis. The second predictor variable (Predictor B) was “subject sex.” You know that this was a dichotomous variable because it involved only two values: a male group versus a female group. This variable was assessed on a nominal scale, since it indicates group membership but does not convey any quantitative information. Finally, the criterion variable in this study was the number of aggressive acts that were displayed by the children. You know that this was a multi-value variable if you verified that the children’s scores took on a relatively large number of values (that is, some children might have displayed zero aggressive acts each day, other children might have displayed 50 aggressive acts each day, and still other children might have displayed a variety of aggressive acts between these two extremes). Remember that, for our purposes, we label a variable a multi-value variable if it assumes more than six different values in the sample. The criterion variable was assessed on a ratio scale. You know this because the “number of aggressive acts” has equal intervals and a true zero point. True Independent Variables versus Subject Variables Notice that, with the preceding study, one of the predictor variables was a true independent variable, while the other predictor variable was merely a subject variable. A true independent variable is a variable that is manipulated and controlled by the researcher so that it is independent of (uncorrelated with) any other independent variable in the study. In this study, “amount of sugar consumed” (Predictor A) was a true independent variable because it was manipulated and controlled by you, the researcher. You manipulated this variable by randomly assigning subjects to either the 0-gram condition, the 20-gram condition, or the 40-gram condition.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 545
In contrast to a true independent variable, a subject variable is a characteristic of the subject that is not directly manipulated by the researcher, but is used as predictor variable in the study. In the preceding study, “subject sex” (Predictor B) was a subject variable. You know that this is a subject variable because sex is a characteristic of the subject that is not directly manipulated by the researcher. You know that subject sex is not a true independent variable because it is not possible to manipulate it in a direct fashion (i.e., it is not possible to randomly assign half of your subjects to be male, and half to be female). With a subject variable, you simply note which condition a subject is already in; you do not assign the subject to that condition. Other examples of subject variables include age, political party, and race. When you perform a factorial ANOVA, it is possible to use any combination of true independent variables and subject variables. That is, you can perform an analysis in which •
both predictor variables are true independent variables
•
both predictor variables are subject variables
•
one predictor is a true independent variable and the other predictor is a subject variable.
Summary of Assumptions Underlying Factorial ANOVA with Two Between-Subjects Factors •
Level of measurement. The criterion variable must be a numeric variable that is assessed on an interval or ratio level of measurement. The predictor variables may be assessed on any level of measurement, although they are essentially treated as nominal-level (classification) variables in the analysis.
•
Independent observations. An observation should not be dependent on any other observation in any cell (the meaning of the term “cell” will be explained in a later section). In practical terms, this means that each subject is exposed to only one condition under each predictor variable, and that subject matching procedures are not used.
•
Random sampling. Scores on the criterion variable should represent a random sample that is drawn from the populations of interest.
•
Normal distributions. Each cell should be drawn from a normally-distributed population. If each cell contains over 30 subjects, the test is robust against moderate departures from normality (in this context, “robust” means that the test will still provide accurate results as long as violations of the assumptions are not large). You should analyze your data with PROC UNIVARIATE using the NORMAL option to determine whether your data meet this assumption. Remember that the significance tests for normality provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large.
•
Homogeneity of variance. The populations represented by the various cells should have equal variances on the criterion. If the number of subjects in the largest cell is no more than 1.5 times greater than the number of subjects in the smallest cell, the test is robust against moderate violations of the homogeneity assumption.
546 Step-by-Step Basic Statistics Using SAS: Student Guide
Using Factorial Designs in Research Chapter 15, “One-Way ANOVA with One Between-Subjects Factor,” described a simple experiment in which you manipulated a single independent variable: the level of aggression that was displayed by a model. Because there was a single independent variable in that study, it was analyzed using a one-way ANOVA. But suppose that there are two independent variables that you wish to manipulate. In this situation, you might think that it would be necessary to conduct two separate experiments, one for each independent variable. But you would be wrong: In many cases, it will be possible (and preferable) to manipulate both independent variables in a single study. The research design that is used in these studies is called a factorial design. In a factorial design, two or more independent variables are manipulated in a single study so that the treatment conditions represent all possible combinations of the various levels of the independent variables. In theory, a factorial design might include any number of independent variables. In practice, however, it often becomes impractical to use more than three or four. This chapter illustrates factorial designs that include only two independent variables, and such designs can be analyzed using a two-way ANOVA.
A Different Study Investigating Aggression Overview To illustrate the concept of factorial design, imagine that you are interested in conducting a different type of study that investigates aggression in nursery-school children. You want to test the following two research hypotheses: •
Hypothesis A: There will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observe the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression.
•
Hypothesis B: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 547
You perform a single investigation to test these two hypotheses. In this investigation, you will simultaneously manipulate two independent variables. One of the independent variables will be relevant to Hypothesis A, and the other independent variable will be relevant to Hypothesis B. The following sections describe the research method in more detail. Note: Although the study and results presented here are fictitious, they are inspired by the actual studies reported by Bandura (1965, 1977). Research Method Overview. Suppose that you conduct a study in which 30 nursery-school children serve as subjects. The study is conducted in two stages. In Stage 1, you show a short videotape to your subjects. You manipulate the two independent variables by varying what the children see in this videotape. In Stage 2, you assess the dependent variable (the amount of aggression displayed by the children) to determine whether it has been affected by the independent variables. The following sections refer to your independent variables as “Predictor A” and “Predictor B” rather than “Independent Variable A” and “Independent Variable B.” This is because the term “predictor variable” is more general, and is appropriate regardless of whether your variable is a true manipulated independent variable (as in the present case), or a nonmanipulated subject variable (such as subject sex). Stage 1: Manipulating Predictor A. Predictor A in your study is “the level of aggression displayed by the model” or, more concisely, “model aggression.” You manipulate this independent variable by randomly assigning each child to one of three treatment conditions: •
Ten children are assigned to the “low” condition. When the subjects in this group watch the videotape, they see a model demonstrate a relatively low level of aggressive behavior. Specifically, they see a model (an adult female) enter a room that contains a wide variety of toys. For 90% of the tape, the model engages in nonaggressive play (e.g., playing with building blocks). For 10% of the tape, the model engages in aggressive play (e.g., violently punching an inflatable “bobo doll”).
•
Another 10 children are assigned to the “moderate” condition. They watch a videotape of the same model in the same playroom, but they observe the model displaying a somewhat higher level of aggressive behavior. Specifically, in this version of the tape, the model engages in nonaggressive play (again, playing with building blocks) 50% of the time, and engages in aggressive play (again, punching the bobo doll) 50% of the time.
•
Finally, the last 10 children are assigned to the “high” condition. They watch a videotape of the same model in the same playroom, but in this version the model engages in nonaggressive play 10% of the time, and engages in aggressive play 90% of the time.
Stage 1 continued: Manipulating Predictor B: Predictor B is “the consequences for the model.” You manipulate this independent variable by randomly assigning each child to one of two treatment conditions:
548 Step-by-Step Basic Statistics Using SAS: Student Guide •
Fifteen children are assigned to the “model rewarded” condition. Toward the end of the videotape (described above), children in this group see the model rewarded for her behavior. Specifically, the videotape shows another adult who enters the room with the model, praises her, and gives her cookies.
•
The other 15 children are assigned to the “model punished” condition. Toward the end of the same videotape, children in this group see the model punished for her behavior: Another adult enters the room with the model, scolds her, shakes her finger at her, and puts her in “time out.”
Stage 2: Assessing the criterion variable. This chapter will refer to the dependent variable in the study as a “criterion variable.” Again, this is because the term “criterion variable” is a more general term that is appropriate regardless of whether your study is a true experiment (as in the present case), or is a nonexperimental investigation. The criterion variable in this study is the “number of aggressive acts displayed by the subjects” or, more concisely, “subject aggressive acts.” The purpose of your study was to determine whether certain manipulations in your videotape caused some groups of children to behave more aggressively than others. To assess this, you allowed each child to engage in a free play period immediately after viewing the videotape. Specifically, each child was individually escorted to a playroom similar to the one shown in the tape. This playroom contained a large assortment of toys, some of which were appropriate for nonaggressive play (e.g., building blocks), and some of which were appropriate for aggressive play (e.g., an inflatable bobo doll identical to the one in the tape). The children were told that they could do whatever they liked in the play room, and were then left to play alone. Outside of the playroom, three observers watched the child through a one-way mirror. They recorded the total number of aggressive acts the child displayed during a 20-minute period in the playroom (an “aggressive act” could be an instance in which the child punches the bobo doll, throws a building block, and so on). Therefore, the criterion variable in your study is the total number of aggressive acts demonstrated by each child during this period. The Factorial Design Matrix The factorial design of this study is illustrated in Figure 16.1. You can see that this design is represented by a matrix that consists of two rows and three columns.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 549
Figure 16.1. Factorial design used in the aggression study.
The columns of the matrix. When an experimental design is represented in a matrix such as this, it is easiest to understand if you focus on only one aspect of the matrix at a time. For example, first consider just the three columns in Figure 16.1. The three columns are headed “Predictor A: Level of Aggression Displayed by Model,” and these columns represent the various levels of the “model aggression” independent variable. The first column represents the 10 subjects in level A1 (the children who saw a videotape in which the model displayed a low level of aggression), the second column represents the 10 subjects in level A2 (the children who saw the model display a moderate level of aggression), and the last column represent the 10 subjects in level A3 (the children who saw the model display a high level of aggression). The rows of the matrix. Now consider just the two rows in Figure 16.1. These rows are headed “Predictor B: Consequences for Model.” The first row is headed “Level B1: Model Rewarded,” and this row represents the 15 children who saw the model rewarded for her behavior. The second row is headed “Level B2: Model Punished,” and represents the 15 children who saw the model punished for her behavior. The r × c design. It is common to refer to a factorial design as an “r × c” design, in which “r” represents the number of rows in the matrix, and “c” represents the number of columns. The present study is an example of a 2 × 3 factorial design because it has two rows and three columns. If it included four levels of model aggression rather than three, it would be referred to as a 2 × 4 factorial design. The cells of the matrix. You can see that this matrix consists of six cells. A cell is a location in the matrix where the column for one predictor variable intersects with the row for a second predictor variable. For example, look at the cell where column A1 (low level of model aggression) intersects with row B1 (model rewarded) . The entry “5 Subjects” appears in this cell, which means that there were five children who experienced this particular combination of “treatments” under the two predictor variables. More specifically,
550 Step-by-Step Basic Statistics Using SAS: Student Guide
it means that there were five subjects who both (a) saw the model engage in a low level of aggression, and (b) saw the model rewarded for her behavior. Now look at the cell in which row A2 (moderate level of model aggression) intersects with row B2 (model punished). Again, the cell contains the entry “5 Subjects,” which means that there was a different group of five children who experienced the treatments of (a) seeing the model display a moderate level of aggression and (b) seeing the model punished for her behavior. In the same way, you can see that there was a separate group of five children assigned to each of the six cells of the matrix. Earlier, it was said that a factorial design involves two or more independent variables being manipulated so that the treatment conditions represent all possible combinations of the various levels of the independent variables. The cells of Figure 16.1 illustrate this concept. You can see that the six cells of the figure represent every possible combination of (a) level of aggression displayed by the model and (b) consequences for the model. This means that, for the children who saw a low level of model aggression, half of them saw the model rewarded, and the other half saw the model punished. The same is true for the children who saw a moderate level of model aggression, as well as for the children who saw a high level of model aggression.
Understanding Figures That Illustrate the Results of a Factorial ANOVA Overview Factorial designs are popular in research for a variety of reasons. One reason is that they allow you to test for several different types of effects in a single investigation. The types of effects that may be produced from a factorial study will be discussed in the next section. However, it is important to note that this advantage has a corresponding drawback: Because they involve different types of effects, factorial designs sometimes produce results that can be difficult to interpret, compared to the simpler results that are produced in a one-way ANOVA. Fortunately, however, this task of interpretation can be made much easier if you first prepare a figure that plots the results of the factorial study. This section shows how to interpret these figures. Example of a Figure Figure 16.2 presents one type of figure that is often used to illustrate the results of a factorial study. Notice that, with this figure, scores on the criterion variable (“Subject Aggressive Acts”) are plotted on the vertical axis.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 551
Figure 16.2. Example of a one type of figure that is often used to illustrate the results from a factorial ANOVA.
The three levels of Predictor A (level of aggression displayed by model) are plotted on the horizontal axis. The first point on this axis is labeled “Low,” and represents group A1 (the children who watched the model display a low level of aggression). The middle point is labeled “Moderate,” and represents group A2 (the children who watched the children display a moderate level of aggression). The point at the right is labeled “High,” and represents group A3 (the children who watched the children display a high level of aggression). The two levels of Predictor B (consequences for the model) are identified by drawing two different lines in the body of the figure itself. Specifically, the mean aggression scores displayed by children who saw the model rewarded (level B1) are illustrated with small circles connected by a solid line, while the mean aggression scores displayed by the children who saw the model punished (level B2) are displayed by small triangles connected by a broken line. Interpreting the Means on the Solid Line You will remember that your investigation involved six groups of subjects, corresponding to the six cells of the factorial design matrix described earlier. Figure 16.2 illustrates the mean score for each of these six groups. To read these means, begin by focusing on just the solid line with circles. This line provides means for the subjects who saw the model rewarded. First, find the circle that appears above the label “Low” on the figure’s horizontal axis. This
552 Step-by-Step Basic Statistics Using SAS: Student Guide
circle represents the mean aggression score for the five children who (a) saw the model display a low level of aggression, and (b) saw the model rewarded. Look to the left of this circle to find the mean score for this group on the “Subject Aggressive Acts” axis. The circle for this group is found at about 13 on this axis. This means that the five children in this group displayed an average of approximately 13 aggressive acts in the playroom after they watched the videotape. Now find the next circle on the solid line, above the label “Moderate” on the horizontal axis. This circle represents the five children in the group that (a) saw the model display a moderate level of aggression, and (b) saw the model rewarded. Looking to the vertical axis on the left, you can see that this group displayed a mean score of about 18. This means that the five children in this group displayed an average of approximately 18 aggressive acts in the playroom after they watched the videotape. Finally, find the circle on the solid line above the label “High” on the horizontal axis. This circle represents the five children in the group who (a) saw the model display a high level of aggression, and (b) saw the model rewarded. Looking to the vertical axis on the left, you can see that this group displayed a mean score of 24, meaning that they engaged in an average of about 24 aggressive acts in the playroom after watching the videotape. These three circles are all connected by a single solid line, indicating that all of these subjects were in the same condition under Predictor B––the model-rewarded condition. Interpreting the Means on the Broken Line Next you will find the mean scores for the subjects in the other condition under Predictor B: the children who saw the model punished. To do this, focus on the broken line with triangles. First, find the triangle that appears above the label “Low” on the figure’s horizontal axis. This triangle provides the mean aggression score for the five children who (a) saw the model display a low level of aggression, and (b) saw the model punished. Look to the left of this triangle to find the mean score for this group on the “Subject Aggressive Acts” axis. The triangle for this group is 2, which means that the five children in this group displayed an average of approximately 2 aggressive acts in the playroom after they watched the videotape. Repeating this process for the two other triangles on the broken line shows the following: •
The five children who (a) saw the model display a moderate level of aggression, and (b) saw the model punished displayed an average of approximately 7 aggressive acts in the playroom.
•
The five children who (a) saw the model display a high level of aggression, and (b) saw the model punished displayed an average of approximately 13 aggressive acts in the playroom.
These three triangles are all connected by a single broken line, indicating that all of these subjects were in the same condition under Predictor B––the model-punished condition.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 553
Summary The important points to remember when interpreting the graphs in this chapter are as follows: •
The possible scores on the criterion variable are represented on the vertical axis.
•
The three levels of Predictor A are represented as three different points on the horizontal axis.
•
The two levels of Predictor B are represented by drawing two different lines within the graph.
Now you are ready to learn about the different types of effects that are observed in factorial designs, and how these effects appear when they are plotted in this type of graph.
Some Possible Results from a Factorial ANOVA Overview When a predictor variable (or independent variable) in a factorial design displays a significant main effect, it means that, in the population, there is a difference between at least two of the levels of that predictor variable with respect to mean scores on the criterion variable. In a one-way analysis of variance, only one main effect is possible: the main effect for the study’s one independent variable. However, in a factorial design, there will be one main effect possible for each predictor variable included in the study. Because the present study involves two predictor variables, two types of main effects are possible: •
a main effect for Predictor A
•
a main effect for Predictor B.
However, a factorial ANOVA can also produce an entirely different type of effect that is not possible with a one-way ANOVA––it can reveal a significant interaction between Predictor A and Predictor B. When an interaction is significant, it means that the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable (a later section will discuss interactions in more detail). The following sections show how main effects and interactions might appear when plotted in a graph.
554 Step-by-Step Basic Statistics Using SAS: Student Guide
A Significant Main Effect for Predictor A Only Figure 16.3 shows one example of how a graph may appear when there is •
A significant main effect for Predictor A. Predictor A was the level of aggression displayed by the model: low versus moderate versus high.
•
A nonsignificant main effect for Predictor B. Predictor B was consequences for the model: model-rewarded versus model-punished.
•
A nonsignificant effect for the interaction.
In other words, Figure 16.3 shows a situation in which the only significant effect is a main effect for Predictor A.
Figure 16.3. A significant main effect for Predictor A (level of aggression displayed by the model); nonsignificant main effect for Predictor B; nonsignificant interaction.
Interpreting the graph. Figure 16.3 shows that a relatively low level of aggression was displayed by subjects in the “low” condition of Predictor A (the level of aggression displayed by the model). When you look above the label “Low” on the horizontal axis, you can see that both the children in the model-rewarded group (represented with a small circle) as well as the children in the model-punished group (represented with a small triangle) display relatively low scores on aggression (the two groups demonstrated a mean of approximately 6 aggressive acts in the playroom). However, a somewhat higher level of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 555
aggression was demonstrated by subjects in the moderate-model-aggression condition: When you look above the “Moderate” on the horizontal axis, you can see that the children in this condition displayed an average of about 12 aggressive acts. Finally, an even higher level of aggression was displayed by subjects in the high-model-aggression condition: When you look above the label “High,” you can see that the children in this condition displayed an average of approximately 18 aggressive acts. In short, this trend shows that there was a main effect for the model aggression variable. Figure 16.3 shows that, the greater the amount of aggression displayed by the model in the videotape, the greater the number of aggressive acts subsequently displayed by the children when they were in the playroom. Characteristics of main effect for Predictor A when graphed. This leads to an important point: When a figure representing the results of a factorial study displays a significant main effect for Predictor A, it will demonstrate both of the following characteristics: •
Corresponding line segments are parallel.
•
At least one set of corresponding line segments displays a relatively steep angle (or slope).
First, you need to understand that “corresponding line segments” refers to line segments that (a) run from one point on the horizontal axis for Predictor A to the next point on the same axis, and (b) appear immediately above and below each other in a figure. For example, the solid line and the broken line that run from “Low” to “Moderate” in Figure 16.3 are corresponding line segments. Similarly, the solid line and the broken line that go from “Moderate” to “High” in Figure 16.3 are also corresponding line segments. The first of the two preceding conditions––that the lines should be parallel––conveys that the two predictor variables are not involved in an interaction. This is important because you typically will not interpret a main effect for a predictor variable if that predictor variable is involved in a significant interaction (the meaning of interaction will be discussed later in the section “A Significant Interaction”). In Figure 16.3, you can see that the lines for the two conditions under Predictor B (the solid line and the broken line) are parallel to one another. This suggests that there probably is not an interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model) in the present study. The second condition––that at least one set of corresponding line segments should display a relatively steep angle—can be understood by again referring to Figure 16.3. Notice that the segment that begins at “Low” (the low-model-aggression condition) and extends to “Moderate” (the moderate-model-aggression condition) is not horizontal; it displays upward angle––a positive slope. Obviously, this is because the aggression scores for the moderatemodel-aggression group were higher than the aggression scores for the low-modelaggression group. When you obtain a significant effect for “Predictor A” variable in your study, you should expect to see this type of angle. Similarly, you can see that the line segment that begins at “Moderate” and continues to “High” also displays an upward angle, also consistent with a significant effect for the model aggression variable.
556 Step-by-Step Basic Statistics Using SAS: Student Guide
Remember that these guidelines are merely intended to help you understand what a main effect looks like when it is plotted in a graph such as Figure 16.3. To determine whether this main effect is statistically significant, it will of course be necessary to review the results of the analysis of variance, to be discussed below. Another Example of a Significant Main Effect for Predictor A Only Figure 16.4 shows another example of a significant main effect for the model aggression factor. You know that this figure illustrates a main effect for Predictor A, because both of the following are true: •
the corresponding line segments are all parallel
•
one set of corresponding line segments displays a relatively steep angle.
Figure 16.4. Another example of a significant main effect for Predictor A (level of aggression displayed by the model); nonsignificant main effect for Predictor B, nonsignificant interaction.
Notice that the solid line and the broken line that run from “Low” to “Moderate” are parallel to each other. In addition, the solid line and the broken line that run from “Moderate” to “High” are also parallel. This tells you that there is probably not a significant interaction between Predictor A and Predictor B. Where an interaction is concerned, it is irrelevant that the lines show an upward angle from “Low” to “Moderate” and then become level from “Moderate” to “High.” The important point is that the corresponding line segments are
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 557
parallel to each other. Think of these lines as being like the rails of a railroad track that twists and curves along the landscape: As long as the two corresponding rails are always parallel to each other (regardless of how much they slope), the interaction is probably not significant. You know that Figure 16.4 illustrates a significant main effect for Predictor A because the lines demonstrate a relatively steep angle as they run from “Low” to “Moderate.” This tells you that the children who observed a moderate level of model aggression displayed a higher number of aggressive acts after viewing the videotape, compared to the children who observed a low level of model aggression. It is this difference that tells you that you probably have a significant main effect for Predictor A. You can see from Figure 16.4 that the lines do not demonstrate a relatively steep slope as they run from “Moderate” to “High.” This tells you that there was probably not a significant difference between the children who watched a moderate level of model aggression versus those who observed a high level. But this does not change the fact that you still have a significant effect for Predictor A. When a predictor variable contains three or more conditions, that predictor variable will display a significant main effect if at least two of the conditions are markedly different from each other. A Significant Main Effect for Predictor B Only How Predictor B is represented. You would expect to see a different type of pattern in a graph if the main effect for the other predictor variable (Predictor B) were significant. Earlier, you learned that Predictor A was represented in a graph by plotting three points on the horizontal axis. In contrast, you learned that Predictor B was represented by drawing different lines within the body of the graph: one line for each level of Predictor B. In the present study, Predictor B was the “consequences for the model” variable: A solid line was used to represent mean scores from the children who saw the model rewarded, and a broken line was used to represent mean scores from the children who saw the model punished. Characteristics of a main effect for Predictor B when graphed. When Predictor B is represented in a figure by plotting separate lines for its various levels, a significant main effect for Predictor B is revealed when the figure displays both of the following characteristics: •
Corresponding line segments are parallel
•
At least two of the lines are relatively separated from each other.
Interpreting the graph. For example, a main effect for Predictor B in the current study is represented by Figure 16.5. Consistent with the two preceding points, the two lines in Figure 16.5 (a) are parallel to one another (indicating that there is probably no interaction), and (b) are separated from one another.
558 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.5. A significant main effect for Predictor B (consequences for the model); nonsignificant main effect for Predictor A, nonsignificant interaction.
Regarding the separation between the lines: Notice that, in general, the children in the “model rewarded” condition tended to demonstrate a higher number of aggressive acts after viewing the videotape, compared to the children in the “model punished” condition. This is the general trend that you would expect, given the assumptions of social learning theory (Bandura, 1977). Notice that neither the solid line nor the broken line show much of an angle, or slope. This indicates that there was probably not a main effect for Predictor A (level of aggression displayed by model). A Significant Main Effect for Both Predictor Variables It is possible to obtain significant effects for both Predictor A and Predictor B in the same investigation. When there is a significant effect for both predictor variables, you should see all of the following: •
Corresponding line segments are parallel (indicating no interaction).
•
At least one set of corresponding line segments displays a relatively steep angle (indicating a main effect for Predictor A)
•
At least two of the lines are relatively separated from each other (indicating a significant main effect for Predictor B).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 559
Figure 16.6 shows what the graph might look like under these circumstances.
Figure 16.6. Significant main effects for both Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model); nonsignificant interaction.
From Figure 16.6, you can see that the dotted line and the solid line are parallel, indicating no interaction. Both lines display an upward angle, indicating a significant effect for Predictor A (level of aggression displayed by the model): The children in the “high” condition were more aggressive than the children in the “moderate” condition, who in turn were more aggressive than the children in the “low” condition. Finally, the solid line is higher than the broken line, indicating a significant effect for Predictor B (consequences for the model): The children who saw the model rewarded tended to be more aggressive than the children who saw the model punished. No Main Effects Figure 16.7 shows what a graph might look like if there were no main effects for either Predictor A or Predictor B. Notice that the lines are parallel (indicating no interaction), none of the line segments display a relatively steep angle (indicating no main effect for Predictor A), and the lines are not separated (indicating no main effect for Predictor B).
560 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.7. Nonsignificant main effects and a nonsignificant interaction.
A Significant Interaction Overview. An earlier section indicated that, when you perform a two-way ANOVA, there are three types of effects that may be observed: (a) a main effect for Predictor A, (b) a main effect for Predictor B, and (c) an interaction between Predictor A and Predictor B. This section provides definitions for the concept of “interaction,” shows what an interaction might look like when plotted on a graph, and addresses the issue of whether main effects should typically be interpreted when an interaction is significant. Definitions for “interaction.” The concept of an interaction can be defined in several ways. For example, with respect to experimental research (in which you are actually manipulating true independent variables), the following definition can be used: •
An interaction is a condition in which the effect of one independent variable on the dependent variable is different at different levels of the second independent variable.
On the other hand, for nonexperimental research (in which you are simply measuring naturally-occurring variables rather than manipulating true independent variables), the concept of interaction can be defined in this way: •
An interaction is a condition in which the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 561
Characteristics of an interaction when graphed. These definitions are abstract and somewhat difficult to understand at first reading. However, the concept of interaction is much easier to understand by seeing an interaction illustrated in a graph. A graph indicates that there is probably an interaction between two predictor variables when it displays the following characteristic: •
At least one set of corresponding line segments are not parallel.
An interaction illustrated in a graph. For example, Figure 16.8 displays a significant interaction between Predictor A and Predictor B in the present study. Notice that the solid line and the broken line are no longer parallel: The line representing the children who saw the model rewarded now displays a fairly steep angle, while the line for the line for the children who saw the model punished is relatively flat. This is the key characteristic of a figure that displays a significant interaction: lines that are not parallel.
Figure 16.8. A significant interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model).
Notice how the relationships depicted in Figure 16.8 are consistent with the definition for interaction provided above: The relationship between one predictor variable (level of aggression displayed by the model) and the criterion variable (subject aggression) is different at different levels of the second predictor variable (consequences for the model). More specifically, the figure shows that there is a relatively strong, positive relationship between Predictor A (model aggression) and the criterion variable (subject aggression) for the children in the “model rewarded” level of Predictor B. For the children in this group, the more aggression that they saw from the model, the more aggressively they (the children)
562 Step-by-Step Basic Statistics Using SAS: Student Guide
behaved when they were in the playroom. The children who saw a high level of model aggression displayed an average of about 23 aggressive acts, the children who saw a moderate level of model aggression displayed an average of about 17 aggressive acts, and the children who saw a low level of model aggression displayed an average of about 9 aggressive acts. In contrast, notice that Predictor A (level of aggression displayed by the model) had essentially no effect on the children in the “model-punished” level of Predictor B. The children in this condition are represented by the broken line in Figure 16.8. You can see that this broken line is relatively flat: The children in the “high,” “moderate,” and “low” groups all displayed the same number of aggressive acts when they were in the playroom (each group displayed about 7 aggressive acts). The fact that there were no differences between the three conditions means that Predictor A had no effect on the children in the modelpunished condition. Interpreting the interaction in the figure. If you conducted the study described here and actually obtained the interaction that is illustrated in Figure 16.8, what would the results mean with respect to the effects of your two predictor variables? These results suggest that the level of aggression displayed by a model can have an effect on the level of aggression later displayed by the subjects, but only if the subjects see the model being rewarded for her aggressive behavior. However, if the subjects instead see the model being punished, then the level of aggression displayed by the model has no effect. Two caveats regarding this interpretation: First, remember that these results, like most of the results presented in this book, are fictitious, and were provided only to illustrate statistical concepts. They do not necessarily represent what researchers have discovered when conducting research on aggression. Second, remember that the interaction illustrated in Figure 16.8 is just one example of what an interaction might look like when plotted. When you perform a two-way ANOVA, a significant interaction might take on any of an infinite variety of forms. These different types of interactions will have one characteristic in common: they will all involve corresponding line segments that are not parallel. A later section will show a different example of how an interaction from the present study might appear. Interpreting Main Effects When an Interaction Is Significant The problem. When you perform a two-way ANOVA, it is possible that you will find that (a) the interaction term is statistically significant, and (b) one or both of the main effects are also statistically significant. When you prepare a report summarizing the results, you will certainly discuss the nature of your significant interaction. But is it also acceptable to discuss and interpret the main effects that were significant? There is some disagreement between statisticians in answering this question. Some statisticians argue that, if the interaction is significant, you should not interpret the main effects at all, even if they are significant. Others take a less extreme approach. They say that
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 563
it is acceptable to interpret significant main effects, as long as the primary interpretation of the results focuses on the interaction (if it is significant). Example. The interaction illustrated in Figure 16.8 provides a good example of why you must be very cautious in interpreting main effects when the predictor variables are involved in an interaction. To understand why, consider this: Based on the results presented in the figure, would it make sense to begin your discussion of the results by saying that there is a main effect for Predictor A (level of aggression displayed by the model)? Probably not––to simply say that there is a main effect for Predictor A would be somewhat misleading. It is clear that the level of aggression displayed by the model does seem to have an effect on aggression among children who see the model rewarded, but the graph suggests that the level of aggression displayed by the model probably does not have any real effect on aggression among children who see the model punished. To simply say that there is a main effect for model aggression might mislead readers into believing that exposure to aggressive models is likely to cause increased subject aggression under any circumstances (which it apparently does not). Recommendations. If it is questionable to discuss main effects under these circumstances, then how should you present your results when you have a significant interaction along with significant main effects? In situations such as this, it often makes more sense to do the following: •
Note that there was a significant interaction between the two predictor variables. Your interpretation of the results should be based primarily upon this interaction.
•
Prepare a figure (like Figure 16.8) that illustrates the nature of the interaction.
•
If appropriate, further investigate the nature of the interaction by testing for simple effects (a later section explains this concept of “simple effects”).
•
If the main effect for a predictor variable was significant, interpret this main effect very cautiously. Remind the reader that the predictor variable was involved in an interaction, and explain how the effect of this predictor variable was different for different groups of subjects.
Another Example of a Significant Interaction An earlier section pointed out that an interaction can assume an almost infinite variety of forms. Figure 16.9 illustrates a different type of interaction for the current study.
564 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.9. Another example of a significant interaction between Predictor A (level of aggression displayed by the model) and Predictor B (consequences for the model).
How would you know that these results constitute an interaction? Because one of the sets of corresponding line segments in this figure contains lines that are not parallel. Specifically, you can see that the solid line running from “Moderate” to “High” displays a fairly steep angle, while the broken line running from “Moderate” to “High” does not. Obviously, these line segments are not parallel, and that means that you have an interaction. It is true that another set of line segments in the figure are parallel to each other (i.e., the solid line and broken line running from “Low” to “Moderate” are parallel), but this is irrelevant. As long as at least one set of corresponding line segments display markedly different angles, the interaction is probably significant. As was noted before, the purpose of this section was simply to show you how main effects and interactions might appear when plotted in a graph. Obviously, you do not conclude that a main effect or interaction is significant by simply viewing a graph; instead, this is done by performing the appropriate statistical analysis using the SAS System. The remainder of this chapter shows how to perform these analyses.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 565
Example of a Factorial ANOVA Revealing Two Significant Main Effects and a Nonsignificant Interaction Overview The steps that you follow in performing a factorial ANOVA will vary depending on whether the interaction is significant: If the interaction is significant, you will follow one set of steps, and if the interaction is nonsignificant you will follow a different set of steps. This section illustrates an analysis that results in a nonsignificant interaction, along with two significant main effects. It shows you how to prepare the SAS program, how to interpret the SAS output, and how to write up the results. These procedures are illustrated by analyzing fictitious data from the aggression study described previously. In these analyses, Predictor A is the level of aggression displayed by the model, Predictor B is the consequences for the model, and the criterion variable is the number of aggressive acts displayed by the children after viewing the videotape. Choosing SAS Variable Names and Values to Use in the SAS Program Overview. Before you write a SAS program to perform a factorial ANOVA, you might find it helpful to prepare a table similar to that in Figure 16.10. This table will help you choose (a) meaningful SAS variable names for the variables in the analysis, (b) values to represent the different levels under the predictor variables, and (c) cell designators for the cells that constitute the factorial design matrix. If you carefully choose meaningful variable names and values now, you will find it easier to interpret your SAS output later.
Figure 16.10. Variable names and values to be used in the SAS program for the aggression study.
566 Step-by-Step Basic Statistics Using SAS: Student Guide
SAS variable name for Predictor A. You can see that Figure 16.10 is very similar to Figure 16.1, except that variable names and values have now been added. For example, Figure 16.10 again shows that Predictor A in your study is the “Level Of Aggression Displayed by Model.” Below this heading (within parentheses) is “MOD_AGGR,” which will serve as the SAS variable name for Predictor A in your SAS program (“MOD_AGGR” stands for “model aggression”). Obviously, you can choose any SAS variable name that is meaningful and that complies with the rules for SAS variable names. Values to represent conditions under Predictor A. Below the heading for Predictor A are the names of the three conditions for this predictor variable: “Low,” “Moderate,” and “High.” Below these headings for the three conditions (within parentheses) are the values that you will use to represent these conditions in your SAS program. You will use the value “L” to represent children in the low-model-aggression condition, the value “M” to represent children in the moderate-model-aggression condition, and the value “H” to represent children in the high-model-aggression condition. Choosing meaningful letters such as L, M, and H will make it easier to interpret your SAS output later. SAS variable name for Predictor B. Figure 16.10 shows that Predictor B in your study is “Consequences for Model.” Below this heading (within parentheses) is “CONSEQ,” which will serve as the SAS variable name for Predictor B in your SAS program (“CONSEQ” stands for “consequences”). Values to represent conditions under Predictor B. To the right of this heading are the names for the treatment conditions under Predictor B, along with the values that will represent these conditions in your SAS program. You will use the value “MR” to represent children in the “Model Rewarded” condition, and “MP” to represent children in the “Model Punished” condition. Cell designators. Each cell in Figure 16.10 contains a cell designator that indicates the condition a child was assigned to under both predictor variables. For example, the upper left cell has the designator “Cell MR-L.” The “MR” tells you that this group of children were in the “model-rewarded” condition under Predictor B, and the “L” tells you that they were in the “low” condition under Predictor A. Now consider the cell in the middle column on the bottom row of the figure. This cell has the designator “Cell MP-M.” Here, the “MP” tells you that this group of children were in the “model-punished” condition under Predictor B, and the “M” tells you that they were in the “moderate” condition under Predictor A. When you work with these cell designators, remember that the value for the row always comes first, and the value for the column always comes second. This means that the cell at the intersection of row 2 and column 1 should be identified with the designator MP-L, not LMP. As you will soon see, being able to quickly interpret these cells designators will make it easier to write your SAS program and to interpret the results.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 567
Data Set to Be Analyzed Table 16.1 presents the data set that you will analyze. Table 16.1 Variables Analyzed in the Aggression Study (Data Set Will Produce Significant Main Effects and a Nonsignificant Interaction) ____________________________________________________ Consequences Model Subject Subject for model aggression aggression ____________________________________________________ 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20 30 MP H 18 __________________________________________________
Understanding the columns in the table. The columns of Table 16.1 provide the variables that you will analyze in your study. The first column in Table 16.1, “Subject,” assigns a unique subject number to each child. The second column is headed “Consequences for model.” This column identifies the condition to which children were assigned under Predictor B, consequences for the model. In this column, the value “MR” identifies children in the model-rewarded condition, and “MP” identifies children in the model-punished condition. You can see that children with subject numbers 1-15 were in the model-rewarded condition, and children with subject numbers 16-30 were in the model-punished condition.
568 Step-by-Step Basic Statistics Using SAS: Student Guide
The third column is headed “Model aggression.” This column identifies the condition to which children were assigned under Predictor A, level of aggression displayed by the model. In this column, the value “L” identifies children who saw the model in the videotape display a low level of aggression, “M” identifies children who saw the model display a moderate level of aggression, and “H” identifies children who saw the model display a high level of aggression. Finally, the column headed “Subject aggression” indicates the number of aggressive acts that each child displayed in the play room after viewing the videotape. This variable will serve as the criterion variable in your study. Understanding the rows of the table. The rows of Table 16.1 represent the individual children who participated as subjects in the study. The first row represents Subject 1. The “MR” under “Consequences for model” tells you that this child was in the model-rewarded condition under Predictor B. The “L” under “Model aggression” tells you that the subject was in the low condition under Predictor A. Finally, the “11” under “Subject aggression” tells you that this child displayed 11 aggressive acts after viewing the videotape. The rows for the remaining children may be interpreted in the same way. Stepping back and getting the “big picture” of Table 16.1 shows that it contains every possible combination of the levels of Predictor A and Predictor B. Notice that subjects 1-15 were in the model-rewarded condition under Predictor B, and that subjects 1-5 were in the low-model-aggression condition, subjects 6-10 were in the moderate-model-aggression condition, and subjects 11-15 were in the high-model-aggression condition under Predictor A. For subjects 16-30, the pattern repeats itself, with the exception that these subjects were in the model-punished condition under Predictor B. Writing the DATA Step of the SAS Program As you type the SAS program, you will enter the data similar to the way that they appear in Table 16.1. That is, you will have one column to contain subject numbers, one column to indicate the subjects’ condition under the consequences for model predictor variable, one column to indicate the subjects’ condition under the model aggression predictor variable,
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 569
and one column to indicate the subjects’ score on the subject aggression criterion variable. Here is the DATA step for your SAS program: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20 30 MP H 18 ; You can see that INPUT statement of the preceding program uses the following SAS variable names: •
The variable SUB_NUM represents subject numbers;
•
The variable CONSEQ represents the condition that each subject is in under the consequences-for-model predictor variable (values are either MR or MP for this variable;
570 Step-by-Step Basic Statistics Using SAS: Student Guide
note that the variable name is followed by the “$” symbol to indicate that it is a character variable); •
The variable MOD_AGGR represents subject condition under the model-aggression predictor variable (values are either L, M, or H for this variable; note that the variable name is also followed by the “$” symbol to indicate that it is a character variable);
•
The variable SUB_AGGR contains subjects’ scores on the criterion variable: the number of aggressive acts displayed by the subject in the playroom.
Data Screening and Testing Assumptions Prior to Performing the ANOVA Overview. Prior to performing the ANOVA, you should perform some preliminary analyses to verify that your data are valid and that you have met the assumptions underlying analysis of variance. This section summarizes these analyses, and refers you to the other sections of this Guide that show you how to perform them. Basic data screening. Before performing any statistical analyses, you should always verify that your data are valid. This means checking for any obvious errors in typing the data or in writing the DATA step of your SAS program. At the very least, you should analyze your numeric variables with PROC MEANS to verify that the means are reasonable that you do not have any invalid values. For guidance in doing this, see Chapter 4 “Data Input,” the section “Using PROC MEANS and PROC FREQ to Identify Obvious Problems with a Data Set.” It is also wise to create a printout of your raw data that you can audit. For guidance in doing this, again see Chapter 4, the section “Using PROC PRINT to Create a Printout of Raw Data.” Testing assumptions underlying the procedure. The first section of this chapter included a list of the assumptions underlying factorial ANOVA. For many of these assumptions (such as “random sampling”), there is no statistical procedure for testing the assumption. The only way to verify that you have met the assumption is to conduct a careful review of how you conducted the study. On the other hand, some of these assumptions can be tested statistically using the SAS System. In particular, an excerpt of one of the assumptions is reproduced below: •
Normal distributions. Each cell should be drawn from a normally-distributed population.
You can use PROC UNIVARIATE to test this assumption. Using the PLOT option with PROC UNIVARIATE prints a stem-and-leaf plot that you can use to determine the approximate shape of the sample data’s distribution. Using the NORMAL option with PROC UNIVARIATE requests a test of the null hypothesis that the sample data were drawn from a normally-distributed population. You should perform this test separately for each of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 571
the cells that constitute your experimental design (remember that the tests for normality provided by PROC UNIVARIATE tend to be fairly sensitive when samples are large). For guidance in doing this, see Chapter 7, “Measures of Central Tendency and Variability,” and especially the section “Using PROC UNIVARIATE to Determine the Shape of Distributions.” Writing the SAS Program to Perform the Two-Way ANOVA Overview. This section shows you how to write SAS statements to perform an ANOVA with two between-subjects factors. It shows how to prepare the PROC GLM statement, the CLASS statement, the MODEL statement, and the MEANS statement. The syntax. Below is the syntax for the PROC step that is needed to perform a two-way factorial ANOVA, and to follow it with Tukey's HSD test: PROC GLM DATA = data-set-name; CLASS predictorB predictorA; MODEL criterion-variable = predictorB predictorA predictorB*predictorA; MEANS predictorB predictorA predictorB*predictorA; MEANS predictorB predictorA / TUKEY CLDIFF ALPHA=alpha-level ; TITLE1 ' your-name '; RUN; QUIT; The actual code for the current analysis. Here, the appropriate SAS variable names have been substituted into the syntax (line numbers have been added on the left): 1 2 3 4 5 6 7 8
PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Some notes about the preceding code: •
The PROC GLM statement in Line 1 requests the GLM procedure, and requests that the analysis be performed on data set D1.
•
The CLASS statement in Line 2 lists the two classification variables as CONSEQ (Predictor B) and MOD_AGGR (Predictor A).
572 Step-by-Step Basic Statistics Using SAS: Student Guide •
Line 3 contains the MODEL statement for the analysis. The name of the criterion variable (SUB_AGGR) appears to the left of the equal sign in this statement. To the right of the equal sign, you should list the following: – the two predictor variables (CONSEQ and MOD_AGGR, in this case). The names of the two predictor variables should be separated by at least one space. – an additional term that represents the interaction between these two predictor variables. To create this interaction term, type the names for Predictor B and Predictor A, and connect them with an asterisk (“*”). This should be typed as a single term with no spaces. For the current analysis, the interaction term was CONSEQ*MOD_AGGR.
•
You will notice that in all of the statements that contain the SAS variable names for Predictor A and Predictor B, the statement lists Predictor B first, followed by Predictor A (that is, CONSEQ is always followed by MOD_AGGR). This order may seem counterintuitive, but there is a reason for it: When SAS lists the means for the various cells in the study, this output will be somewhat easier to interpret if you list Predictor B prior to Predictor A in these statements (more on this later)
•
Line 4 presents the first MEANS statement: MEANS
CONSEQ
MOD_AGGR
CONSEQ*MOD_AGGR;
•
This statement requests means and standard deviations on the criterion variable for the various conditions under Predictor B (CONSEQ) and Predictor A (MOD_AGGR). By including the interaction term in this statement (CONSEQ*MOD_AGGR), you ensure that PROC GLM will also print means and standard deviations on the criterion variable for each of the six cells in the factorial design. You will need these means in interpreting the results.
•
Line 5 presents the second MEANS statement: MEANS
CONSEQ
MOD_AGGR / TUKEY
CLDIFF
ALPHA=0.05;
•
In this statement, you will list the names for the two predictor variables followed by a slash and a number of options. The first option is requested by the key word TUKEY. The key word TUKEY requests that Tukey’s HSD test be performed as a multiple comparison procedure, in the event that the main effects are significant (for an explanation of multiple comparison procedures, see the section “Treatment Effects, Multiple Comparison Procedures, and a New Index of Effect Size” in Chapter 15, “One-Way ANOVA with One Between-Subjects Factor”). Technically, you could have omitted the predictor variable CONSEQ from this statement because it contains only two conditions; you will remember from the preceding chapter that it is not necessary to perform a multiple comparison procedure on a predictor variable when it involves only two conditions.
•
The MEANS statement also contains the key word CLDIFF, which requests that the results of the Tukey test be printed as confidence intervals for the differences between the means. The option ALPHA=0.05 requests that the the significance level (alpha) be set at .05 for the Tukey tests. If you had wanted alpha set at .01, you would have used the option
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 573
ALPHA=0.01, and if you had wanted alpha set at .10, you would have used the option ALPHA=0.1. •
Although the preceding MEANS statement requests the Tukey HSD test, remember that it is possible to request other multiple comparison procedures instead of the Tukey test (such as the Bonferroni t test and the Scheffe multiple comparison procedure). For guidance in doing this, see Chapter 15, the section “Keywords for Other Multiple Comparison Procedures.”
•
Finally, lines 6, 7, and 8 contain the TITLE, RUN, and QUIT statements for your program.
The complete SAS program. Here is the complete SAS program that will input your data set and perform a factorial ANOVA with two between-subjects factors: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 11 02 MR L 7 03 MR L 15 04 MR L 12 05 MR L 8 06 MR M 24 07 MR M 19 08 MR M 20 09 MR M 23 10 MR M 29 11 MR H 23 12 MR H 29 13 MR H 25 14 MR H 20 15 MR H 27 16 MP L 4 17 MP L 0 18 MP L 9 19 MP L 2 20 MP L 8 21 MP M 17 22 MP M 20 23 MP M 12 24 MP M 17 25 MP M 21 26 MP H 12 27 MP H 20 28 MP H 21 29 MP H 20
574 Step-by-Step Basic Statistics Using SAS: Student Guide
30 MP H 18 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Log File Produced by the SAS Program Why review the log file? After submitting your SAS program, you should always review the log file prior to reviewing the output file. Verify that the analysis was performed correctly, and look for any error messages, warning messages, or other notes indicating that something went wrong. Log 16.1 contains the log file produced by the preceding program. NOTE: SAS initialization used: real time 20.53 seconds 1 OPTIONS LS=80 PS=60; 2 DATA D1; 3 INPUT SUB_NUM 4 CONSEQ $ 5 MOD_AGGR $ 6 SUB_AGGR; 7 DATALINES; NOTE: The data set WORK.D1 has 30 observations and 4 variables. NOTE: DATA statement used: real time 1.43 seconds 38 ; 39 PROC GLM DATA=D1; 40 CLASS CONSEQ MOD_AGGR; 41 MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; 42 MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; 43 MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; 44 TITLE1 'JANE DOE'; 45 RUN; NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement. NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement. 46 QUIT; NOTE: There were 30 observations read from the dataset WORK.D1. NOTE: PROCEDURE GLM used: real time 3.29 seconds Log 16.1. Log file produced by the SAS program.
Note regarding the number of observations and variables. Log 16.1 provides no evidence of any obvious errors in conducting the analysis. For example, following line 7 of
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 575
the program, you can see a note that says “NOTE: The data set WORK.D1 has 30 observations and 4 variables.” This is a good sign, because you intended your program to have 30 observations (subjects) and four variables. Notes regarding the LSMEANS statement. Following line 45 of the program, you can see two notes that both say the same thing: “NOTE: Means from the MEANS statement are not adjusted for other terms in the model. For adjusted means, use the LSMEANS statement.” This note is not necessarily a cause for alarm. When you are performing a two-way ANOVA, it is appropriate to use the MEANS statement (rather than the LSMEANS statement) to compute group means on the criterion variable as long as your experimental design is balanced. An experimental design is balanced if the same number of observations (subjects) appear in each cell of the design. For example, Figure 16.10 illustrates the research design used in the aggression study. It shows that there are five subjects in each cell of the design (that is, there are five subjects in the cell of subjects who experienced the “low” condition under Predictor A and the “model-rewarded” condition under Predictor B, there are five subjects in the cell of subjects who experienced the “moderate” condition under Predictor A and the “model-punished” condition under Predictor B, and so forth). Because your experimental design is balanced, there is no need to adjust the means from the analysis for other terms in the model. This means that it is acceptable to use the MEANS statement in your program, and it is not necessary to use the LSMEANS statement. In contrast, a research design is typically unbalanced if some cells in the design contain a larger number of observations (subjects) than other cells. For example, again consider Figure 16.10. If there were 20 subjects in Cell MR-L, but only five subjects in each of the remaining five cells, then the experimental design would be unbalanced. Note that if you are analyzing data from an unbalanced design, using the MEANS statement may produce marginal means that are biased. Thus, to analyze data from an unbalanced design, it is generally preferable to use the LSMEANS (least-squares means) statement in your program, rather than the MEANS statement. This is because the LSMEANS statement will estimate the marginal means over a balanced population. The marginal means estimated by the LSMEANS statement are less likely to be biased. In summary, if your experimental design is balanced, you can ignore the note about LSMEANS that appears in your log file. Because the design of your aggression study is balanced, it is appropriate to use the MEANS statement. Analyzing unbalanced designs. For guidance in analyzing data from studies with unequal cell sizes, see the section “Using the LSMEANS Statement to Analyze Data from Unbalanced Designs,” which appears toward the end of this chapter. Output Produced by the SAS Program The preceding program would produce five pages of output. Most of this output will be presented in this section. The information that appears on each page is summarized below:
576 Step-by-Step Basic Statistics Using SAS: Student Guide •
Page 1 provides class level information and the number of observations in the data set.
•
Page 2 provides the ANOVA summary table from the GLM procedure.
•
Page 3 provides results from the first MEANS statement. These results consist of three tables that present means and standard deviations for the criterion variable. These means and standard deviations are broken down by the various levels that constitute the study: – The first table provides the means observed for each level of CONSEQ. – The second table provides the means observed for each level of MOD_AGGR. – The third table provides the means observed for each of the six cells in the study’s factorial design.
•
Page 4 provides the results of the Tukey multiple comparison procedure for Predictor B (CONSEQ).
•
Page 5 provides the results of the Tukey multiple comparison procedure for Predictor A (MOD_AGGR).
Steps in Interpreting the Output Overview. The fictitious data set that was analyzed in this section was designed so that the interaction term would be nonsignificant. When the interaction is nonsignificant, interpreting the results of a two-factor ANOVA is very similar to interpreting the results of a one-factor ANOVA. This section begins by showing you how to review specific sections of output to verify that there were no obvious errors in entering data or in writing the program. It will then show you how to determine whether the interaction is significant, how to determine whether the main effects were significant, how to prepare an ANOVA summary table, and how to review the results of the multiple comparison procedures. 1. Make sure that the output looks correct. The output created by the GLM procedure in the preceding program contains information that may help identify possible errors in the writing the program or in entering the data. This section shows how to review that information. You will begin by reviewing the class level information which appears on Page 1 of the PROC GLM output. This page is reproduced here as Output 16.1. JANE DOE The GLM Procedure Class Level Information Class CONSEQ MOD_AGGR
Levels 2 3
Number of observations
1
Values MP MR H L M 30
Output 16.1. Verifying that everything looks correct on the class level information page; two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 577
Below the heading “Class,” you will find the names of the classification variables (the predictor variables) in your analysis. In Output 16.1, you can see that the classification variables are CONSEQ and MOD_AGGR. Under the heading “Levels” the output should indicate how many levels (or conditions) for each of your predictor variables. Output 16.1 shows that there were two levels for the predictor variable CONSEQ, and there were three levels for the predictor variable MOD_AGGR. This is as it should be. Under the heading “Values” the output should indicate the specific numbers or letters that you used to code the two predictor variables. Output 16.1 shows that you used the values “MP” and “MR” to code conditions under CONSEQ, and that you used the values “H,” “L,” and “M” to code conditions under MOD_AGGR. This is all correct. Remember that it is important to use uppercase and lowercase letters consistently when you are coding treatment conditions under the predictor variables. For example, the preceding paragraph indicated that you used uppercase letters (H, L, and M) in coding conditions under MOD_AGGR. If you had accidentally keyed a lowercase “h” instead of an uppercase “H” for one or more of your subjects, the SAS system would have interpreted that as a code for a new and different treatment condition. This, of course, would have led to errors in the analysis. Again, the point is that it is important to use uppercase and lowercase letters consistently when you are coding treatment conditions, because SAS treats uppercase letters and lowercase letters as different values. Finally, the last line in Output 16.1 indicates the number of observations in the data set. The present experimental design consisted of six cells of subjects with five subjects in each cell, for a total of 30 subjects in the study. Output 16.1 indicates that your data set included 30 observations, and so everything appears to be correct at this point. Page 2 of the output provides the analysis of variance table created by PROC GLM. It is reproduced here as Output 16.2.
578 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.822988
DF 5 24 29
Sum of Squares 1456.166667 313.200000 1769.366667
Coeff Var 21.98263
Mean Square 291.233333 13.050000
Root MSE 3.612478
F Value 22.32
Pr > F <.0001
SUB_AGGR Mean 16.43333
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Pr > F 0.0001 <.0001 0.9527
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Pr > F 0.0001 <.0001 0.9527
Output 16.2. Verifying that everything looks correct with the ANOVA summary table; two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Near the top of page 2 on the left side, the name of the criterion variable being analyzed should appear to the right of the heading “Dependent Variable.” In Output 16.2, the dependent variable is listed as SUB_AGGR. You will remember that SUB_AGGR stands for “subject aggression.” The remainder of Output 16.2 provides information about the analysis of this criterion variable. The top half of Output 16.2 consists of the ANOVA summary table for the analysis. This ANOVA summary table is made up of columns with headings such as “Source,” “DF,” “Sum of Squares,” and so on. The first column of this table is headed “Source,” and below this “Source” heading are three subheadings: “Model,” “Error,” and “Corrected Total.” Look to the right of the heading “Corrected Total,” and under the column headed “DF.” For the current output, you will see the number “29.” This number represents the corrected total degrees of freedom. This number should always be equal to N - 1, where N represents the total number of subjects for whom you have a complete set of data. In this study, N was 30, and so the corrected total degrees of freedom should be equal to 30 – 1 = 29. Output 16.2 shows that the corrected total degrees of freedom are in fact equal to 29, so again it appears that everything is correct so far. Later, you will return to the ANOVA summary table that appears in Output 16.2 to determine whether any of the effects are significant (and to review other important information). For now, however, you will continue reviewing other pages of output to see if there are any other obvious signs of problems with your analysis.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 579
Page 3 of the output provides means and standard deviations on the criterion variable broken down by the various conditions under the two predictor variables. This page is reproduced here as Output 16.3.
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 13.4000000 7.28795484 15 19.4666667 7.31794923
Level of MOD_AGGR H L M
N 10 10 10
Level of CONSEQ MP MP MP MR MR MR
Level of MOD_AGGR H L M H L M
3
-----------SUB_AGGR---------Mean Std Dev 21.5000000 4.83620604 7.6000000 4.59951688 20.2000000 4.58984386 N 5 5 5 5 5 5
------------SUB_AGGR----------Mean Std Dev 18.2000000 3.63318042 4.6000000 3.84707681 17.4000000 3.50713558 24.8000000 3.49284984 10.6000000 3.20936131 23.0000000 3.93700394
Output 16.3. Verifying that everything looks correct regarding the means and standard deviations for the various conditions under Predictor A (model aggression) and Predictor B (consequences for model).
Output 16.3 is divided into three tables. The top table provides means and standard deviations on the criterion variable broken down by levels of the predictor variable CONSEQ (consequences for the model). To the right of the value “MP” you will find the mean and standard deviation for the modelpunished group. Below the heading “N,” you can see that there were 15 subjects in this condition, as expected. Below the heading “Mean,” you can see that the mean score of this group on the subject aggression criterion variable was 13.4, which seems reasonable. Below the heading “Std Dev,” you can see that the group’s standard deviation was approximately 7.29, which again seems reasonable. To the right of the value “MR,” you will find the same statistics for the model-rewarded group. These statistics can be reviewed in the same way to verify that they seem correct. The second table in Output 16.3 is headed “Level of MOD_AGGR.” This table provides means and standard deviations for the various conditions under the MOD_AGGR predictor
580 Step-by-Step Basic Statistics Using SAS: Student Guide
variable (this was the predictor variable that manipulated the level of aggression displayed by the model). You can use the same procedure (described above) to review the means and standard deviations for the three conditions under this predictor variable, to verify that they are reasonable. Finally, the bottom table in Output 16.3 provides means and standard deviations broken down by both predictor variables simultaneously. This is analogous to saying that it provides means and standard deviations for each of the six cells that constitute your experimental design. For example, consider the first row in this table. This row is identified by the value “MP” under “Level of CONSEQ” and the value “H” under “Level of MOD_AGGR.” This means that this row provides information about the one cell of subjects who were, respectively, in the model-punished condition under Predictor B (consequences for the model), and were in the high-model-aggression condition under Predictor A (level of aggression displayed by model). Other information in this row shows that (a) there were five subjects in the cell (under “N”), (b) their mean score on the criterion variable was 18.2 (under “Mean”), and (c) their standard deviation was approximately 3.63 (under “Std Dev”). All of these figures seem reasonable. The remaining five rows in the bottom table represent the remaining five cells in your experimental design. You can review the information in these rows in the same way to verify that the numbers seem correct. In summary, these results provide no evidence of any obvious errors in writing the program or in keying the data. You can therefore proceed to interpreting the results that are relevant to your research questions. 2. Determine whether the interaction term is statistically significant. Two-factor ANOVA allows you to test for three types of effects: (a) the main effect of Predictor A (level of aggression displayed by model, in this case), (b) the main effect of Predictor B (consequences for the model), and (c) the interaction between Predictor A and Predictor B. Remember that, if the interaction term is significant, you should interpret the main effects only with great caution. So one of your first steps must be to determine whether the interaction is significant. The null hypothesis for the interaction effect in the aggression study can be stated in this way: Statistical null hypothesis (Ho): In the population, there is no interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). You can determine whether the interaction is significant by looking at the analysis of variance results, which appear on page 2 of the SAS Output. That page is reproduced here as Output 16.4.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 581
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.822988
DF 5 24 29
Sum of Squares 1456.166667 313.200000 1769.366667
Coeff Var 21.98263
Mean Square 291.233333 13.050000
Root MSE 3.612478
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 276.033333 1178.866667 1.266667
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 276.033333 1178.866667 1.266667
F Value 22.32
Pr > F <.0001
SUB_AGGR Mean 16.43333
Mean Square 276.033333 589.433333 0.633333 Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05 F Value 21.15 45.17 0.05
Pr > F 0.0001 <.0001 0.9527 Pr > F 0.0001 <.0001 0.9527
Output 16.4. Determining whether the interaction is significant; two-way ANOVA performed on aggression data, nonsignificant interaction, significant main effects.
Toward the top of Output 16.4, you can see that the criterion variable being analyzed is SUB_AGGR ( ). The bottom half of the output page actually provides two sum of squares tables: One based on Type I sum of squares, and one based on the Type III sum of squares. Remember that it is usually best to interpret only the results that are based on the Type III sum of squares ( ). When your cells sizes are equal (i.e., when the design is balanced), the results from the two sections will be identical. However, when the cells sizes are not equal, the Type III results are more appropriate. In the lower left corner of Output 16.4 is the heading, “Source” ( ). Below this heading are the names of the predictor variables in your study (CONSEQ and MOD_AGGR), along with the name of the interaction term (CONSEQ*MOD_AGGR). If you look to the right of the name for the interaction term ( ), you can see that the interaction has 2 degrees of freedom, a value of approximately 1.267 for the Type III sum of squares, a mean square of 0.633, an F value of 0.05, and a corresponding p value of .9527. Remember that you generally view a result as being statistically significant only if the p value is less than .05. Since this p value of .9527 is larger than .05, you conclude that the interaction between the two predictor variables is nonsignificant. You can therefore proceed with your review of the two main effects.
582 Step-by-Step Basic Statistics Using SAS: Student Guide
3. Determine whether either of the two main effects are statistically significant. A twoway ANOVA allows you to test two null hypotheses concerning main effects: one for Predictor A, and one for Predictor B. The null hypothesis for Predictor A (the level of aggression displayed by the model) can be stated as follows: Statistical null hypothesis (Ho): µA1 = µA2 = µA3 In the population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderate-model-aggression condition, and subjects in the high-model-aggression condition with respect to mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). You can see that the preceding null hypothesis is essentially identical to the null hypothesis that was stated for the aggression study in Chapter 15, “One-Way ANOVA with One Between-Subjects Factor.” The difference is in the symbolic representation of the null hypothesis: µA1 = µA2 = µA3. In this symbolic representation, the subscripts to the symbol µ are now “A1,” “A2,” and “A3.” These subscripts identify Level A1, Level A2, and Level A3 under Predictor A, respectively. These levels correspond to the low-model-aggression condition, the moderate-model-aggression condition, and the high-model-aggression condition, respectively. The F statistic to test this null hypothesis can again be found in the ANOVA summary table in Output 16.4. To the right of the heading MOD_AGGR ( ), you can see that this effect has 2 degrees of freedom, a value of 1178.87 for the Type III sum of squares, a mean square of 589.43, an F value of 45.17, and a p value of <.0001. This p value is less than our standard criterion of .05. With such a small p value, you can clearly reject the null hypothesis of no main effect for model aggression. This means that at least two of the conditions under this predictor variable must be significantly different from each other (although you don’t know which two at this point). Later, you will review the results of the Tukey tests to see which groups (low-model-aggression, moderate-model-aggression, etc.) are significantly different from one another. The second test for main effects is the test for the main effect of Predictor B (the consequences for the model). Remember that Predictor B had only two conditions (modelrewarded versus model-punished). Therefore, you will be able to state the null hypothesis for this predictor variable by using the same format that was used for an independentsamples t test (which always has only two conditions). The null hypothesis for Predictor B (consequences for the model) may be stated as follows: Statistical null hypothesis (H0): µB1 = µB2 In the population, there is no difference between subjects in the model-rewarded condition versus subjects in the model-punished condition with respect to mean scores on the criterion variable (number of aggressive acts displayed by the subjects). The appropriate F statistic is found toward the bottom of Output 16.4. To the right of the variable name CONSEQ ( ), you can see that this effect is associated with 1 degree of freedom, a value of 276.03 for the Type III sum of squares, a mean square of 276.03, an F
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 583
value of 21.15, and a p value of .0001. This p value is also less than .05; therefore you may again reject the null hypothesis. You may conclude that there is also a significant main effect for the “consequences for model” predictor variable. Because there are only two levels of CONSEQ, you will not have to review the results of the Tukey test for this variable––the Tukey test is necessary only if the predictor variable has three or more conditions. You need only look at the means for the two conditions to determine which condition scored significantly higher than the other. A later section will show you how to do this. 4. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for the analysis is reproduced here as Table 16.2. Table 16.2 ANOVA Summary Table for Study Investigating the Relationship between Level of Aggression Displayed by Model (A), Consequences For Model (B), and Subject Aggression (Significant Main Effects, Nonsignificant Interaction) ________________________________________________________________________ Source
df
SS
MS
F
p
R2
________________________________________________________________________ Model aggression (A) 2 1178.87 589.43 45.17 .0001 .67 Consequences for model (B) 1 276.03 276.03 21.15 .0001 .16 A X B Interaction 2 1.27 0.63 0.05 .9527 .00 Within groups 24 313.20 13.05 Total 29 1769.37 ________________________________________________________________________ Note: N = 30
The headings in Table 16.2 are similar to the headings used in Chapter 15, “One-Way ANOVA with One Between-Subjects Factor.” Specifically: •
In the column headed “df,” you will provide the degrees of freedom associated with different sources of variation.
•
In the column headed “SS” you will provide the Type III sum of squares.
•
In the column headed “MS,” you will provide the mean square for each source of variation.
•
In the column headed “F,” you will provide the F value for each effect.
•
In the column headed “p,” you will provide the p value (probability value) for each effect (the tables shown in Chapter 15 did not contain a separate column for p values).
•
In the column headed “R2,” you will provide R2 values for the different effects. You will remember that R2 is an index of effect size that can be used with ANOVA.
To complete the preceding table, you will mostly transfer information from Output 16.4 to the appropriate line of the ANOVA summary table. One exception is the last column in the
584 Step-by-Step Basic Statistics Using SAS: Student Guide
table––the column that provides R2 values. You will have to compute those values manually (see “R2 values” in the following list). Here is a description of how information was transferred from Output 16.4 to Table 16.2: •
Main effect for Predictor A. Predictor A in your study was the level of aggression displayed by the model. Information concerning this effect appears to the right of the heading MOD_AGGR in Output 16.4 ( ). You can see that all of the information in the row for MOD_AGGR in Output 16.4 (for example, degrees of freedom, sum of squares) has been entered on the line headed “Model aggression (A)” in Table 16.2. The only piece of information that is not directly provided by Output 16.4 is the R2 value for Table 16.2 (see “R2 values” below).
•
Main effect for Predictor B. Predictor B in your study was consequences for the model. Information concerning this effect appears to the right of the heading CONSEQ in Output 16.4 ( ). You can see that all of the information in the row for CONSEQ in Output 16.4 (e.g., degrees of freedom, sum of sum of squares) has been entered on the line headed “Consequences for model (B)” in Table 16.2.
•
The A × B interaction. Information about the interaction between Predictor A and Predictor B can be found to the right of the heading CONSEQ*MOD_AGGR in Output 16.4 ( ). You can see that all of the information from the row for this CONSEQ*MOD_AGGR interaction in Output 16.4 (e.g., degrees of freedom, sum of sum of squares) has been entered on the line headed “A × B Interaction” in Table 16.2
•
Within groups. In the chapter about one-way ANOVA, you learned that the “Withingroups” line of an ANOVA summary table contains information about the error term from the analysis of variance. To find this information for the current analysis, look to the right of the heading “Error” in Output 16.4 ( ). You can see that the information from the “Error” line of Output 16.4 has been copied onto the line headed “Within groups” in Table 16.2.
•
Total. In the preceding chapter, you also learned that the total degrees of freedom and the total sum of squares from an analysis of variance can be found to the right of the heading “Corrected Total” in the output of PROC GLM, and the same is true for a factorial ANOVA. For the current analysis, look to the right of “Corrected Total” in Output 16.4 ( ). You can see that the information from this line has been copied onto the line headed “Total” in Table 16.2.
•
R2 values. In earlier chapters, you learned that it is good practice to report an index of effect size when you conduct an experiment. In general, an index of effect size is a measure of the magnitude of a treatment effect. A variety of different types of indices are available to researchers. In Chapter 15, you learned that an index of effect size that is often 2 2 used with ANOVA is R . The R statistic indicates the proportion of variance in the criterion variable that is accounted for by the study’s predictor variable(s). Values of R2 may range from .00 to 1.00, with larger values indicating a larger treatment effect.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 585
In the preceding chapter, you learned that the output of PROC GLM includes the heading “R-Square,” and below this heading you may find the R2 value for a one-way ANOVA. When you perform a two-way ANOVA, you will again see the heading “RSquare” in your output. However, you generally will not report this R2 value in your 2 analysis. The R value that appears in the output of PROC GLM indicates the total percent of variance in the criterion variable that is accounted for by all of your treatment effects combined. In most cases, it is better to instead report a separate R2 value for each of your treatment effects individually. This means that you will report one R2 value for Predictor A, one R2 value for Predictor B, and one R2 value for the interaction term. You will have to perform a few simple hand calculations to compute these three values of R2. To calculate R2 for a given effect, divide the type III sum of squares associated with that effect by the corrected total sum of squares. For example, Table 16.2 shows that the sum of squares for model aggression (Predictor A) is 1178.87, and the total sum of squares for the analysis is 1769.37. To calculate the R2 value for the model aggression main effect (Predictor A), you substitute these terms in the formula for R2, as is done below: R2
=
Sum of Squares for Predictor A ————————————————————————————————— Total Sum Of Squares
=
1178.87 ——————— = .67 1769.37
The R2 value of .67 indicates that Predictor A (the level of aggression displayed by the model) accounted for 67% of the variance in the criterion variable (the number of aggressive acts displayed by the subjects). This is a fairly large effect size (but remember that these data are fictitious). 2 To calculate the R value for the consequences for model main effect (Predictor B), you divide the sum of squares for Predictor B (from Table 16.2) by the same total sum of squares:
R
2
Sum of Squares for Predictor B = ————————————————————————————————— Total Sum Of Squares
=
276.03 ———————— = .16 1769.37
The R2 value of .16 indicates that Predictor B accounted for 16% of the variance in the criterion variable. This is a smaller treatment effect, compared to the treatment effect for Predictor A. Finally, to calculate the R2 value for the interaction between Predictor A and Predictor B, you divide the sum of squares for “A × B Interaction” (from Table 16.2) by the same total sum of squares: R
2
Sum of Squares for A × B Interaction = ————————————————————————————————————— = Total Sum Of Squares
1.27 ———————— = .00 1769.37
This computation resulted in an R2 value of .0007, which rounded to .00. Clearly, the interaction term accounted for almost none of the variance in the criterion variable.
586 Step-by-Step Basic Statistics Using SAS: Student Guide
Once you have computed the R2 values for the two main effects and the interaction effect, you can enter them in the column headed “R2” in your ANOVA summary table. You can see that this has already been done in Table 16.2. 5. Review the sample means and the results of the multiple comparison procedures for Predictor A. If a particular main effect is statistically significant, you can review the group means and the results of the Tukey tests to determine which pairs of groups are significantly different from each other. Do this in the same way that you would if you had conducted a one-way ANOVA. One of the MEANS statements in your program requested means and standard deviations, broken down by the levels of the predictor variables. The output created by that statement appeared earlier as Output 16.3. An excerpt of the same output is reproduced here as Output 16.5.
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 13.4000000 7.28795484 15 19.4666667 7.31794923
Level of MOD_AGGR H L M
N 10 10 10
3
-----------SUB_AGGR---------Mean Std Dev 21.5000000 4.83620604 7.6000000 4.59951688 20.2000000 4.58984386
Output 16.5. Means and standard deviations for conditions under Predictor A (level of aggression displayed by the model); two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Means and standard deviations for Predictor A appear in the section headed “Level of MOD_AGGR.” Below this heading, the row identified with the value “H” provides information for the high-model-aggression group, the row identified with the value “L” provides information for the low-model-aggression group, and the row identified with the value “M” provides information for the moderate model-aggression group. Below the heading “Mean” you can see the means for the three treatment conditions. These results show that the high-model-aggression group (identified with an “H”) displayed a mean of 21.5 (meaning that these children displayed an average of 21.5 aggressive acts after viewing the videotape). The moderate-model-aggression group (identified with an “M”) displayed a mean of 20.2, and the low-model-aggression group (identified with an “L”) displayed a mean of 7.6. On the surface, it appears that the “low” condition scored substantially lower than the other two conditions on the criterion variable. But will the differences be statistically significant? To find out, you must review the results of the Tukey HSD multiple comparison procedure. Output 16.6 provides the results of this procedure for Predictor A.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 587
JANE DOE The GLM Procedure Tukey's Studentized Range (HSD) Test for SUB_AGGR NOTE: This test controls the Type I experimentwise error rate. Alpha Error Degrees of Freedom Error Mean Square Critical Value of Studentized Range Minimum Significant Difference
5
0.05 24 13.05 3.53170 4.0345
Comparisons significant at the 0.05 level are indicated by ***.
MOD_AGGR Comparison H H M M L L
-
M L H L H M
Difference Between Means 1.300 13.900 -1.300 12.600 -13.900 -12.600
Simultaneous 95% Confidence Limits -2.734 9.866 -5.334 8.566 -17.934 -16.634
5.334 17.934 2.734 16.634 -9.866 -8.566
*** *** *** ***
Output 16.6. Results of Tukey HSD test for Predictor A (level of aggression displayed by the model); two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
This section will present a brief review of how to interpret the results of the Tukey tests and how to prepare a table to display the confidence intervals for the differences between the means. All of these concepts were explained in much more detail in Chapter 15, “One-Way ANOVA with One Between-Subjects Factor.” If you need to refresh your memory on how to interpret the output that presents the Tukey tests and confidence intervals, go to Chapter 15 and reread “Example 15.1: One-Way ANOVA Revealing a Significant Treatment Effect,” especially items 5–7 in the subsection “Steps in Interpreting the Output.” In Output 16.6, an entry at the top of the page tells you that the criterion variable in the analysis was SUB_AGGR ( ). Remember that SUB_AGGR represents the number of aggressive acts displayed by the subject. Lower on the page, the heading “MOD_AGGR Comparison” ( ) indicates that the predictor variable in this analysis is MOD_AGGR (the level of aggression displayed by the model). Below this heading are a number of entries such as “H – M,” “H – L,” and so on. These entries tell you which conditions are being compared. You will remember that you coded your data so that “H” would represent the high-model-aggression condition, “M” would represent the moderate-model-aggression condition, and “L” would represent the “L” low-model-aggression condition. The same values appear in this section. The first row is identified by the entry “H – M” ( ). This means that this row provides information about the comparison between the high-model-aggression condition versus the moderate-model-aggression condition. Below the heading “Difference Between Means” ( ), you can see that the difference between means for these two conditions is 1.300. The
588 Step-by-Step Basic Statistics Using SAS: Student Guide
order that the values “H” and “M” appear in for the entry “H – M” tell you how this difference was computed: SAS started with the mean for the high-model-aggression condition, and subtracted from it the mean for the moderate-model-aggression condition. The resulting difference was 1.3000. The next two columns are headed “Simultaneous 95% Confidence Limits” ( ). This section provides the 95% confidence interval for the difference between the means. Output 16.6 shows that the 95% confidence interval for the current difference extends from –2.734 to 5.334. This means that, although you are not sure what the actual difference between the means is (in the population), you estimate that there is a 95% probability that it is somewhere between –2.734 and 5.334. Notice that this confidence interval contains the value of zero. In general, when a confidence interval contains the value of zero, it means that the difference between the two conditions is not statistically significant. A note about mid-way down Output 16.6 says “Comparisons significant at the 0.05 level are indicated by ***” ( ). This means that three asterisks will be used to flag any comparisons that are significant at p < .05. These asterisks will appear on the right side of the table at the bottom of Output 16.6 ( ). You can see that no asterisks appear on the right side of the row for the entry “H – M” ( ). This means that, according to the Tukey HSD test, the difference between the high-model-aggression condition versus the moderate-modelaggression condition is not statistically significant. The row identified with “H – L” ( ) provides information about the comparison between the high-model-aggression condition versus the low-model-aggression condition. Information in this row shows that the difference between means for these two conditions is 13.900, and that the 95% confidence interval ranges from 9.866 to 17.934. Three asterisks (“***”) appear on the right side of this row, indicating the difference between the high condition and the low condition is in fact significant at the .05 level. Finally, the fourth row down is identified with “M – L” ( ), meaning that this row provides information about the comparison between the moderate-model-aggression condition versus the low-model-aggression condition. Information in this row shows that the difference between means for these two conditions is 12.600, and that the 95% confidence interval ranges from 8.566 to 16.634. Again, three asterisks appear on the right side of the row, indicating the difference between the moderate condition and the low condition is also significant at the .05 level. Notice that there are three additional rows of information at the bottom of Output 16.6. It is not necessary to interpret them here, because these rows provide information that is already provided by the three other rows. For example, the third row down is identified by “M – H.” This row provides the same information as was provided by the row labeled “H – M” (discussed above). The other two rows provide similarly redundant information. Because there is so much information provided in Output 16.6, it is useful to summarize the most important information in a table:
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 589 Table 16.3 Results of Tukey Tests for Predictor A: Comparing High-Model-Aggression Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression) ________________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a means Lower Upper Comparison ________________________________________________________ High – Moderate 1.300 -2.734 5.334 High - Low 13.900 * 9.866 17.934 Moderate - Low 12.600 * 8.566 16.634 ________________________________________________________ Note: N = 30 a Differences are computed by subtracting the mean for the second group from the mean for the first group. * Tukey test indicates that the differences between the means is significant at p < .05.
Table 16.3 summarizes the results of the multiple comparison procedure. The asterisks in the table show that (a) the high-model-aggression condition scored significantly higher than the low-model-aggression condition, (b) the moderate-model-aggression condition also scored significantly higher than the low-model-aggression condition, but that (c) the difference between the high condition versus the moderate condition was not statistically significant. Chapter 15 provides detailed instructions on how to take information from SAS output such as Output 16.6, and use it to create a table such as Table 16.3. In particular, see the subsubsection titled “7. Prepare a table that presents the results of the Tukey tests and the confidence intervals.” 6. Review the sample means and the confidence interval for Predictor B. Predictor B in your study was “consequences for the model.” Output 16.5 displayed means on the criterion variable broken down by levels of this predictor variable. For convenience, that output is presented again as Output 16.7.
590 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The GLM Procedure Level of CONSEQ MP MR
N 15 15
-----------SUB_AGGR---------Mean Std Dev 13.4000000 7.28795484 19.4666667 7.31794923
Level of MOD_AGGR H L M
N 10 10 10
-----------SUB_AGGR---------Mean Std Dev 21.5000000 4.83620604 7.6000000 4.59951688 20.2000000 4.58984386
3
Output 16.7. Means for conditions under Predictor B (consequences for the model); two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Statistics related to Predictor B appear in the section headed “Level of CONSEQ.” In the section headed “Mean,” you can see that the subjects in the model-punished condition (identified by the value “MP”) displayed an average of 13.4 aggressive acts, while the subjects in the subjects in the model-rewarded condition (identified by the value “MR”) displayed an average of approximately 19.47 aggressive acts. You already know that the difference between these two means is statistically significant because you have already determined that the main effect for Predictor B (CONSEQ) from the ANOVA was statistically significant (see Output 16.4 and Table 16.2). From one perspective, it is not really necessary for you to review the results of the Tukey HSD test to determine whether there is a significant difference between these two conditions––you need to refer to the Tukey test only when the predictor variable has three or more conditions. Although you do not need to review to the results of the Tukey test to determine whether there is a significant difference between the two conditions under Predictor B, there is a different reason that you should review these results. In addition to printing the results of a significance test, the Tukey procedure also prints a confidence interval for the difference between the means (as long as you have included the CLDIFF option in the MEANS statement). Therefore, the results of the Tukey test for Predictor B is reproduced here as Output 16.8.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 591
JANE DOE The GLM Procedure Tukey's Studentized Range (HSD) Test for SUB_AGGR NOTE: This test controls the Type I experimentwise error rate. Alpha Error Degrees of Freedom Error Mean Square Critical Value of Studentized Range Minimum Significant Difference
4
0.05 24 13.05 2.91880 2.7225
Comparisons significant at the 0.05 level are indicated by ***.
CONSEQ Comparison MR - MP MP - MR
Difference Between Means 6.067 -6.067
Simultaneous 95% Confidence Limits 3.344 8.789 -8.789 -3.344
*** ***
Output 16.8. Confidence interval for Predictor B (consequences for the model); two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
A note at the top of Output 16.8 tells you that the criterion variable being analyzed is SUB_AGGR, the number of aggressive acts displayed by the subjects. A note toward the bottom of the page tells you that the predictor variable is CONSEQ, the consequences for the model. The first row in the bottom section contains the entry “MR – MP.” This entry indicates that this row provides information about the comparison in which the mean of the modelpunished condition (MP) is subtracted from the mean of the model-rewarded condition (MR). In the column headed “Difference Between Means,” you can see that the difference between these two means is equal to 6.067. In the column headed “Simultaneous 95% Confidence Limits” you can see that the 95% confidence interval for this difference goes from 3.344 to 8.789. There is a second row of information at the bottom of this output (the “MP – MR” row), but it is not necessary to interpret it, since it provides essentially redundant information. The confidence interval from Output 16.8 tells you that, although you do not know what the exact difference between the means is (in the population), you estimate that there is a 95% probability that the actual difference is between 3.344 and 8.789. (Many research journals will expect you to report this confidence interval in your published article.) With Predictor A (discussed in an earlier section), you presented your confidence intervals in a table, specifically Table 16.3. This was because Predictor A included three treatment conditions, and this meant that you had three confidence intervals to present. In relatively complicated situations such as that, it is often best to present your confidence intervals in the form of a table.
592 Step-by-Step Basic Statistics Using SAS: Student Guide
However, Predictor B included only two treatment conditions, meaning that there is only one confidence interval to present. In relatively simple situations such as this it will typically not be necessary to prepare a table to present the confidence interval. Because there is relatively little information to present, you can simply describe the confidence interval in the text of your paper. Following is an example of how this might be done for Predictor B, using the information from Output 16.8. Subtracting the mean of the model-punished condition from the mean of the modelrewarded condition resulted in an observed difference of 6.067. The 95% confidence interval for this difference ranged from 3.344 to 8.789. Using a Figure to Illustrate the Results The results of a factorial ANOVA are easiest to understand when they are represented in a table that plots the means for each of the cells in the study’s factorial design. The factorial design used in the current study involved a total of six cells (this design was presented in Figure 16.1). The mean commitment scores for these six cells are presented on page 3 of the SAS output that is produced by the current analysis. Output page 3 is reproduced here as Output 16.9.
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 13.4000000 7.28795484 15 19.4666667 7.31794923
Level of MOD_AGGR H L M
N 10 10 10
Level of CONSEQ MP MP MP MR MR MR
Level of MOD_AGGR H L M H L M
3
-----------SUB_AGGR---------Mean Std Dev 21.5000000 4.83620604 7.6000000 4.59951688 20.2000000 4.58984386 N 5 5 5 5 5 5
-----------SUB_AGGR---------Mean Std Dev 18.2000000 3.63318042 4.6000000 3.84707681 17.4000000 3.50713558 24.8000000 3.49284984 10.6000000 3.20936131 23.0000000 3.93700394
Output 16.9. The means that are needed to plot the results of the study in a figure.
Means for conditions under Predictor B. There are actually three tables of means on page 3 of this output. The first table provides the means for each level of CONSEQ, the SAS variable that coded membership under the consequences-for-model predictor variable. Below the heading CONSEQ are the values that represented the two levels of this variable.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 593
To the right of “MP,” you can see that the mean for the model-punished group was 13.4. To the right of “MR,” you can see that the mean for the model-rewarded groups was 19.47. Means for conditions under Predictor A. The second table provides the means for each level of MOD_AGGR, the SAS variable that represents the model-aggression predictor. Below the heading MOD_AGGR are the values for the three levels of this predictor variable –H, M, and L–, which represent the high-model-aggression condition, moderate-modelaggression condition, and low-model-aggression condition, respectively. To the right of these values are the mean scores displayed by subjects in these conditions. Means for the individual cells. Finally, the third table is the most interesting as it provides means and standard deviations for each of the six cells from the study’s factorial design matrix. This table shows means and standard deviations for groups of subjects broken down by Predictor A and Predictor B. You can see that this third table has a column on the far left headed “Level of CONSEQ.” Below this heading are values that represent the conditions under Predictor B (MP and MR). Next to this column is a column headed “Level of MOD_AGGR.” Below this heading are the values that represent the conditions under Predictor A (H, L, and M). The first line of this third table provides the mean score displayed by the cell that was coded with an “MP” under CONSEQ and an “H” under MOD_AGGR. This line thus provides the mean score for the five subjects who were in the “model-punished” condition under Predictor B (consequences for the model), and in the “high” condition under Predictor A (level of aggression displayed by the model). In other words, these subjects saw the version of the videotape in which the model demonstrated a high level of aggression and was subsequently punished. You can see that this group had a mean of 18.2 on the criterion variable, and a standard deviation of 3.63. The second line from this table provides the mean score displayed by the cell that was coded with an “MP” under CONSEQ and an “L” under MOD_AGGR. This line therefore provides the mean score for the five subjects who were in the “model-punished” condition under Predictor B, and in the “low” condition under Predictor A. You can see that this group had a mean on the criterion variable of 4.6 and a standard deviation of 3.85. You can follow the same procedure to review the means and standard deviations for the other four groups who participated in the study. Plotting cell means in a graph. It can be difficult to understand the results of a factorial ANOVA by simply reviewing cell means from a table, such as the third table in Output 16.9. These results are typically easier to understand if the means are plotted in a graph. Therefore, the cells means from Output 16.9 have been plotted in Figure 16.11.
594 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 16.11. Mean number of aggressive acts as a function of the level of aggression displayed by the model and the consequences for the model (significant main effects for both Predictor A and Predictor B; nonsignificant interaction).
The vertical axis in Figure 16.11 is labeled “Subject Aggressive Acts,” which was the criterion variable in the study. It is on this axis that you will plot the mean number of aggressive acts that were displayed by the various groups of children after viewing the videotape. The two different lines that appear in Figure 16.11 represent the two conditions under Predictor B (consequences for the model). The solid line represents mean scores for the children in the model-rewarded group, and the broken line represents mean scores for the model-punished group. The horizontal axis of the graph labels the three conditions under Predictor A (level of aggression displayed by the model). For example, directly above the label “Low,” you see the mean scores displayed by subjects in the low-model-aggression condition. Notice that two mean scores are plotted: A small circle plots the mean aggression scores from subjects who were in the “low-model-aggression/model-rewarded” cell, and a small triangle plots mean aggression scores from subjects who were in the “low-model-aggression/modelpunished” cell. The same system was used to plot mean scores for subjects in the remaining cells.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 595
Steps in Preparing the Graph Overview. Preparing a graph that plots the results of a factorial ANOVA (such as Figure 16.11) can be a confusing undertaking. For this reason, this section presents structured, step-by-step guidelines that should make the task easier. The following guidelines describe how to begin with the cells means from Output 16.9, and transform them into the figure represented in Figure 16.11. 1. Labeling the vertical axis. Label the vertical axis with the name of the criterion variable. In this case, the criterion variable is “Subject Aggressive Acts.” On this same axis, provide some midpoints of scores which were possible. In Figure 16.11, this was done by using the midpoints of 5, 10, 15, and so forth. 2. Labeling the horizontal axis. Label the horizontal axis with the name of Predictor A. In this case, Predictor A was “Level of Aggression Displayed by Model.” On the same axis, provide one midpoint for each level of this predictor variable, and label these midpoints. Here, this was done by creating three midpoints labeled “Low,” “Moderate,” and “High.” 3. Drawing a solid line to represent the model-rewarded condition. Graphs such as the one in Figure 16.11 are easiest to draw if you draw just one line at a time. Here, you will begin by drawing a solid line to represent the mean scores of subjects in the model-rewarded condition. Draw small circles on the graph to indicate the mean subject aggression scores of just those subjects in the model-rewarded condition under Predictor B. First, plot the mean of the subjects who were in the model-rewarded conditions under Predictor B, and were also in the low-model-aggression condition under Predictor A. Go to the cell means provided in Output 16.9, and find the entry for the group that is coded with an “MR” under CONSEQ and is also coded with a “L” under MOD_AGGR. It turns out that this is the next-to-last entry in the table ( ), and the mean subject aggression score for this subgroup is 10.6. Therefore, draw a small circle where “Low” on the horizontal axis intersects with 10.6 on the “Subject Aggressive Acts” vertical axis. Next, plot the mean for the subjects who were in the model-rewarded condition under Predictor B, and in the moderate-model-aggression condition under Predictor A. In Output 16.9, find the entry for the subgroup that is coded with an “MR” under CONSEQ and an “M” under MOD_AGGR. This is the last entry in the table ( ), and their mean subject aggression score was 23.0. Therefore, draw a small circle where “Moderate” on the horizontal axis intersects with 23.0 on the “Subject Aggressive Acts” vertical axis. Next, plot the mean for the subjects who were in the model-rewarded condition under Predictor B, and the high-model-aggression condition under Predictor A. In Output 16.9, find the entry for the subgroup that is coded with an “MR” under CONSEQ and an “H” under MOD_AGGR. This entry in the table ( ) shows that the mean subject aggression score for this group was 24.8. Therefore, draw a small circle where “High” intersects with 24.8.
596 Step-by-Step Basic Statistics Using SAS: Student Guide
As a final step, draw a solid line connecting these three circles. This solid line will help readers to see that the circles all represent the same condition under Predictor B (the modelrewarded condition). 4. Drawing a broken line to represent the model-punished condition. Now repeat this procedure, except that this time you will draw small triangles to represent the scores of the subjects who were in the model-punished condition under Predictor B. Remember that these are the subgroups coded with an “MP” under “Level of CONSEQ” in Output 16.9. Output 16.9 shows that the mean for the model-punished/low-model-aggression group is 4.6 ( ). To represent this score, draw a small triangle above the “Low” midpoint on the horizontal axis, to the right of 4.6 on the vertical axis. The mean for the model-punished/moderatemodel-aggression subgroup is 17.4 ( ), and the mean for the model-punished/high-modelaggression subgroup is 18.2 ( ). Note where triangles were drawn on the figure to represent these means. After all three triangles are in place, they are connected with a broken line. With this done, the figure is now complete. Using this system, you will know that whenever you see a solid line connecting circles, you are looking at mean subject aggression scores from subjects in model-rewarded group, and whenever you see a broken line connecting triangles, you are looking at mean subject aggression scores from the model-punished group. Interpreting Figure 16.11 Notice how the graphic pattern of results in Figure 16.11 are consistent with the statistical results reported in Output 16.4. Specifically: •
In Figure 16.11, the corresponding line segments are parallel to one another. That is, the segments going from “Low” to “Moderate” are parallel, and the segments going from “Moderate” to “High” are also parallel. This is the “railroad track” pattern that you would expect when the interaction is nonsignificant (as was reported in Output 16.4).
•
The line segments going from “Low” to “Moderate” in Figure 16.11 display a marked upward angle, or slope. This is consistent with the fact that Output 16.4 reported a significant main effect for Predictor A (level of aggression displayed by the model). However, notice that there is very little angle going from “Moderate” to “High.” This outcome is consistent with the results of the Tukey HSD test (from Output 16.6), which indicated that the low-model-aggression group was significantly different from the moderate- and high-model-aggression groups, but that the moderate- and high-modelaggression groups were not significantly different from each other.
•
Finally, you can see that the solid line for the model-rewarded group is separated from the broken line for the model-punished group. This shows that, generally speaking, the children who saw the model rewarded in the videotape tended to be more aggressive themselves, compared to children in the model-punished condition. This is consistent with the finding from Output 16.4 that there was a main effect for Predictor B (consequences for the model).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 597
Preparing Analysis Reports for Factorial ANOVA: Overview To summarize the results of a two-way ANOVA, you will use a format very similar to the one used with the one-way ANOVA in the preceding chapter. However, in the present case it is possible to prepare three different analysis reports from a single analysis, since it is possible to test three different null hypotheses with a two-way ANOVA: •
the hypothesis of no main effect for Predictor A (level of aggression displayed by the model)
•
the hypothesis of no main effect for Predictor B (consequences for the model)
•
the hypothesis of no interaction.
The following sections show how to prepare analysis reports for these three types of effects, based on the current results. Analysis Report Concerning the Main Effect for Predictor A (Significant Effect) This section shows how to prepare a report concerning Predictor A, the level of aggression displayed by the model. You will remember that the main effect for Predictor A was significant in this analysis. The section that follows this section explains where you have to look in the SAS output to find some of the statistics that have been inserted in the report. A) Statement of the research question: One purpose of this study was to determine whether there was a relationship between (a) the level of aggression displayed by a model and (b) the number of aggressive acts later demonstrated by children who observed the model. B) Statement of the research hypothesis: There will be a positive relationship between the level of aggression displayed by a model, and the number of aggressive acts later demonstrated by children who observed the model. Specifically, it is predicted that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate or low level of aggression, and (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. This was the predictor variable that was relevant to the present hypothesis.
598 Step-by-Step Basic Statistics Using SAS: Student Guide
• Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on an nominal scale, and included two levels: model-rewarded and model-punished. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. Factorial ANOVA with two between-
D) Statistical test: subjects factors
E) Statistical null hypothesis (Ho): µA1 = µA2 = µA3; In the population, there is no difference between subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition with respect to mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): Not all µA are equal; In the population, there is a difference between at least two of the following three conditions with respect to their mean scores on the criterion variable: subjects in the low-model-aggression condition, subjects in the moderatemodel-aggression condition, and subjects in the high-modelaggression condition. G) Obtained statistic:
F(2, 24) = 45.17
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Multiple comparison procedure: Tukey’s HSD test showed that subjects in the high-model-aggression condition and moderate-model-aggression condition scored significantly higher on subject aggression than did subjects in the lowmodel-aggression condition. (p < .05). With alpha set at .05, the difference between the high-model-aggression condition versus the moderate-model-aggression condition was nonsignificant. K) Confidence intervals:. Confidence intervals for differences between the means are presented in Table 16.3 2 L) Effect size:. R = .67, indicating that model aggression accounted for 67% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These findings provide partial support for the study’s research hypothesis. The findings provided support for the hypothesis
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 599
that (a) children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a low level of aggression, as well as for the hypothesis that (b) children who witness a moderate level of aggression will demonstrate a greater number of aggressive acts than children who witness low level of aggression. However, the study failed to provide support for the hypothesis that children who witness a high level of aggression will demonstrate a greater number of aggressive acts than children who witness a moderate level of aggression. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a significant main effect for level of aggression displayed by the model, F(2, 24) = 45.17, MSE = 13.05, p = .0001. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 21.50 (SD = 4.84), the mean for the moderate-model-aggression-condition was 20.20 (SD = 4.59), and the mean for the low-model-aggression condition was 7.60 (SD = 4.60). Sample means for the various conditions that constituted the study are displayed in Figure 16.11. Tukey’s HSD test showed that subjects in the high-modelaggression condition and moderate-model-aggression condition scored significantly higher on subject aggression than did subjects in the low-model-aggression condition (p < .05). With alpha set at .05, the difference between the high-modelaggression condition versus the moderate-model-aggression condition was nonsignificant. Confidence intervals for differences between the means are presented in Table 16.3. 2 In the analysis, R for this main effect was computed as .67. This indicated that model aggression accounted for 67% of the variance in subject aggression.
O) Figure representing the results:
See Figure 16.11.
Notes Regarding the Preceding Analysis Report Overview. With some sections of the preceding report, it might not be clear where you needed to look in your SAS output to find the necessary statistics to insert at that location. This section will attempt to clarify some of those issues. First, the main ANOVA summary table produced by PROC GLM is reproduced here as Output 16.10. Much of the information needed for your report appears on this page of the output.
600 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.822988
DF 5 24 29
Sum of Squares 1456.166667 313.200000 1769.366667
Coeff Var 21.98263
Mean Square 291.233333 13.050000
Root MSE 3.612478
F Value 22.32
Pr > F <.0001
SUB_AGGR Mean 16.43333
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Pr > F 0.0001 <.0001 0.9527 Pr > F 0.0001 <.0001 0.9527
Output 16.10. Information needed from ANOVA summary table to complete an analysis report; two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
F Statistic, degrees of freedom, and p value. Items G and H from the preceding analysis report are reproduced here: G) Obtained statistic: F(2, 24) = 45.17 H) Obtained probability (p) value: p = .0001 You can see that Items G and H provide the F statistic, degrees of freedom for this statistic, and p value associated with this statistic. It is useful to review where these terms can be found in the SAS output. The F statistic of 45.17 that appears in Item G of the preceding report is the F statistic associated with Predictor A, the level of aggression displayed by the model. It can be found in Output 16.10 where the row labeled “MOD_AGGR” intersects with the column headed “F Value” ( ). It is customary to list the degrees of freedom for an F statistic within parentheses. In the preceding report, these degrees of freedom were listed in item G as “F(2, 24).” The first term (2) represents the degrees of freedom for the numerator in the F ratio (i.e., the degrees of freedom for the model aggression main effect). This term appears in Output 16.10 where the row labeled “MOD_AGGR” intersects with the column headed “DF” ( ). The second term (24) represents the degrees of freedom for the denominator in the F ratio (i.e., the degrees of freedom for the error term). This term appears in Output 16.10 where the row labeled “Error” intersects with the column headed “DF” ( ). The p value listed in item H of the preceding report is the p value associated with the model aggression main effect. It appears in Output 16.10 where the row labeled “MOD_AGGR” intersects with the column headed “Pr > F” ( ).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 601
The MSE (mean square error). The mean square error is an estimate of the error variance in your analysis. Item N from the analysis report provides a formal description of the results for a paper. The first paragraph of this report includes a statistic abbreviated as “MSE.” The relevant section of Item N is reproduced as follows: N) Formal description of the results for a paper: Results were analyzed using a with two between-subjects factors. revealed a significant main effect aggression displayed by the model, MSE = 13.05, p = .0001.
factorial ANOVA This analysis for level of F(2, 24) = 45.17,
The last sentence of the preceding excerpt indicates that “MSE = 13.05.” In the output from PROC GLM, you will find it at the location where the row labeled “Error” intersects with the column headed “Mean Square.” In Output 16.10, you can see that the mean square error is equal to 13.050000 ( ), which rounds to 13.05. Analysis Report Regarding the Main Effect for Predictor B (Significant Effect) This section shows how to prepare a report about Predictor B, “consequences for the model.” You will remember that the main effect for Predictor B was significant in this analysis. There are some differences between the way that this report was prepared, versus the way that the report for Predictor A was prepared. See the notes following this analysis report for details. A) Statement of the research question: One purpose of this study was to determine whether there was a relationship between (a) the consequences experienced by an aggressive model and (b) the number of aggressive acts later demonstrated by children who had observed this model. B) Statement of the research hypothesis: Children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on a nominal scale, and
602 Step-by-Step Basic Statistics Using SAS: Student Guide
included two levels: model rewarded and model punished. This was the predictor variable that was relevant to the present hypothesis. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. Factorial ANOVA with two between-
D) Statistical test: subjects factors
E) Statistical null hypothesis (H0): µB1 = µB2; In the population, there is no difference between subjects in the model-rewarded condition versus subjects in the model-punished condition with respect to mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): µB1 ≠ µB2; In the population, there is a difference between subjects in the model-rewarded condition versus subjects in the model-punished condition with respect to their mean scores on the criterion variable (the number of aggressive acts displayed by the subjects). G) Obtained statistic:
F(1, 24) = 21.15
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Multiple comparison procedure: A multiple comparison procedure was not necessary because this predictor variable included just two levels. K) Confidence interval: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded condition resulted in an observed difference of 6.067. The 95% confidence interval for this difference ranged from 3.344 to 8.789. L) Effect size: R2 = .16, indicating that consequences for the model accounted for 16% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis that children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 603
N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a significant main effect for consequences for the model, F(1, 24) = 21.15, MSE = 13.05, p = .0001. On the criterion variable (number of aggressive acts displayed by the subjects), the mean score for the modelrewarded condition was 19.47 (SD = 7.32), and the mean score for the model-punished condition was 13.40 (SD = 7.29). Sample means for the various conditions that constituted the study are displayed in Figure 16.11. Subtracting the mean of the model-punished condition from the mean of the model-rewarded condition resulted in an observed difference of 6.067. The 95% confidence interval for this difference ranged from 3.344 to 8.789. 2 In the analysis, R for this main effect was computed as .16. This indicated that consequences for the model accounted for 16% of the variance in subject aggression.
O) Figure representing the results:
See Figure 16.11.
Notes Regarding the Preceding Analysis Report Overview. Many sections of the analysis report for Predictor B were prepared using the same format that was used for the analysis report for Predictor A. Therefore, to conserve space, this section will not repeat explanations that were provided following the analysis report for Predictor A (such as the explanation of where to look in the SAS output for the MSE statistic). Instead, this section will discuss the ways in which the report for Predictor B differed from the report from Predictor A. Some sections of the analysis report for Predictor B were completed in a way that differed from the report for Predictor A. In most cases, this was because Predictor B consisted of only two levels (model-rewarded versus model-punished), while Predictor A consisted of three levels (low versus moderate versus high). Because Predictor B involved two levels, several sections of its analysis report used a format similar to that used for an independentsamples t test. To refresh your memory on how reports were prepared for that statistic, see Chapter 13, “Independent-Samples t Test,” the section “Example 13.1: Observed Consequences for Modeled Aggression: Effects on Subsequent Subject Aggression (Significant Differences),” in the subsection “Summarizing the Results of the Analysis.” Information about the main effect for Predictor B. Much of the information about the main effect for Predictor B came from Output 16.10, in the section for the Type III sum of
604 Step-by-Step Basic Statistics Using SAS: Student Guide
squares, in the row labeled CONSEQ ( ). This includes the F statistic, the p value, as well as other information. Item F, the statistical alternative hypothesis. For Predictor B, the statistical alternative hypothesis was stated in symbolic terms as follows: µB1 ≠ µB2 The above format is similar to the format used with an independent samples t test. Remember that this format is appropriate only for predictor variables that contain two levels. Item G, the obtained statistic. Item G from the preceding analysis report presented the F statistic and degrees of freedom for Predictor B (CONSEQ). This item is reproduced again here: G) Obtained statistic: F(1, 24) = 21.15 The degrees of freedom for this main effect appear in parentheses above. The number 1 is the degree of freedom for the numerator used in computing the F statistic for the CONSEQ main effect. This “1” comes from the ANOVA summary table that was created by PROC GLM, and was reproduced in Output 16.10 , presented earlier. Specifically, this “1” appears in Output 16.10 at the location where the row labeled “CONSEQ” intersects with the column headed “DF” ( ). The second number in item G, 24, represents the degrees of freedom for the denominator used in computing the F statistic. This “24” appears in Output 16.10 at the location where the row labeled “Error” intersects with the column headed “DF” ( ). Item K, the confidence interval. Notice that, in the analysis report for Predictor B, Item K does not refer the reader to a table that presents the confidence intervals (as was the case with Predictor A). Instead, because there is only one confidence interval to report, it is described in the text of the report itself. Item N, the formal description of the results for a paper. Item N of the preceding report provides means and standard deviations for the model-rewarded and model-punished conditions under Predictor B. These statistics may be found in Output 16.9, in the section for the CONSEQ variable ( ). The relevant portion of that output is presented again as Output 16.11.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 605
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 13.4000000 7.28795484 15 19.4666667 7.31794923
3
Output 16.11. Means and standard deviations for conditions under Predictor B (consequences for model).
Analysis Report Concerning the Interaction (Nonsignificant Effect) An earlier section stated that an interaction is an outcome for which the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable. Output 16.10 showed that the interaction term for the current analysis was nonsignificant. Below is an example of how this result could be described in a report. A) Statement of the research question: One purpose of this study was to determine whether there was a significant interaction between (a) the level of aggression displayed by the model, and (b) the consequences experienced by the model in the prediction of (c) the number of aggressive acts later displayed by subjects who have observed the model. B) Statement of the research hypothesis: The positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on a nominal scale, and included two levels: model rewarded and model punished. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. D) Statistical test: subjects factors
Factorial ANOVA with two between-
606 Step-by-Step Basic Statistics Using SAS: Student Guide
E) Statistical null hypothesis (Ho): In the population, there is no interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): In the population, there is an interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). F(2, 24) = 0.05
G) Obtained statistic:
p = .9527
H) Obtained probability (p) value:
I) Conclusion regarding the statistical null hypothesis: Fail to reject the null hypothesis. J) Multiple comparison procedure: K) Confidence intervals:
Not relevant.
Not relevant.
L) Effect size: R2 = .00, indicating that the interaction term accounted for none of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis that the positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant F statistic for the interaction between the level of aggression displayed by the model and the consequences for the model, F(2, 24) = 0.05, MSE = 13.05, p = .9527. Sample means for the various conditions that constituted the study are displayed in Figure 16.11. 2
In the analysis, R for this interaction effect was computed as .00. This indicated that the interaction accounted for none of the variance in subject aggression. O) Figure representing the results:
See Figure 16.11.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 607
Notes Regarding the Preceding Analysis Report Item B, statement of the research hypothesis. Item B (above) not only predicts that there will be an interaction, but also makes a specific prediction about the nature of that interaction. The type of interaction described here was chosen arbitrarily; remember that an interaction can be expressed in an infinite variety of forms. Items G and H. The degrees of freedom for the interaction term, along with the F statistic and p value for the interaction term, appeared in Output 16.10. The relevant portion of that output is presented again here as Output 16.12. The information relevant to the interaction appears to the right of the heading “CONSEQ*MOD_AGGR” ( ). Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 276.033333 1178.866667 1.266667
Mean Square 276.033333 589.433333 0.633333
F Value 21.15 45.17 0.05
Pr > F 0.0001 <.0001 0.9527
Output 16.12. Excerpt from ANOVA summary table; two-way ANOVA performed on aggression data, significant main effects, nonsignificant interaction.
Item G from the preceding report is again reproduced below: G) Obtained statistic: F(2, 24) = 0.05 The degrees of freedom for the F statistic appear within parentheses. The first degree of freedom is “2,” and this value appears in the “DF” column of Output 16.12 ( ). The F statistic itself appears in the column headed “F Value” ( ), and the probability value for the F statistic appears in the column headed “Pr > F” ( ). The second number in Item G, “24,” again represents the degrees of freedom for the denominator of the F ratio. Again, these degrees of freedom may be found in Output 16.10 at the location where the row titled “Error” intersects with the column titled “DF.”
Example of a Factorial ANOVA Revealing Nonsignificant Main Effects and a Nonsignificant Interaction Overview This section presents the results of a factorial ANOVA in which the main effects for Predictor A and Predictor B, along with the interaction term, are all nonsignificant. These results are presented so that you will be prepared to write analysis reports for projects in which nonsignificant outcomes are observed.
608 Step-by-Step Basic Statistics Using SAS: Student Guide
The Complete SAS Program The study presented here is the same aggression study described in the preceding section. The data will be analyzed with the same SAS program that was presented earlier, in the section “Writing the SAS Program.” Here, the data have been changed so that they will produce nonsignificant results. The complete SAS program, including the new data set, is presented below: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 15 02 MR L 22 03 MR L 19 04 MR L 16 05 MR L 11 06 MR M 16 07 MR M 24 08 MR M 10 09 MR M 17 10 MR M 17 11 MR H 17 12 MR H 12 13 MR H 24 14 MR H 20 15 MR H 15 16 MP L 14 17 MP L 7 18 MP L 22 19 MP L 15 20 MP L 13 21 MP M 14 22 MP M 21 23 MP M 11 24 MP M 9 25 MP M 19 26 MP H 15 27 MP H 9 28 MP H 10 29 MP H 20 30 MP H 21 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 609
Steps in Interpreting the Output Overview. As was the case with the earlier data set, the SAS program performing this analysis would produce five pages of output. This section will present only those sections of output that are relevant to preparing the ANOVA summary table, the graph, and the analysis reports. You will notice that the steps listed in this section are not identical to the steps listed in the preceding section. Some of the steps (such as “1. Make sure that everything looks correct”) are not included here because the key concepts have already been covered. Some other steps are not included here because they are typically not appropriate when effects are nonsignificant. 1. Determine whether the interaction term is statistically significant. As was the case before, you determine whether the interaction is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on output page 2, and is reproduced here as Output 16.13. JANE DOE The GLM Procedure Dependent Variable: SUB_AGGR Sum of Source DF Squares Mean Square Model 5 45.3666667 9.0733333 Error 24 594.8000000 24.7833333 Corrected Total 29 640.1666667 R-Square 0.070867
Coeff Var 31.44181
Root MSE 4.978286
2
F Value 0.37
Pr > F 0.8667
SUB_AGGR Mean 15.83333
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 40.83333333 4.06666667 0.46666667
Mean Square 40.83333333 2.03333333 0.23333333
F Value 1.65 0.08 0.01
Pr > F 0.2115 0.9215 0.9906
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 40.83333333 4.06666667 0.46666667
Mean Square 40.83333333 2.03333333 0.23333333
F Value 1.65 0.08 0.01
Pr > F 0.2115 0.9215 0.9906
Output 16.13. ANOVA summary table for two-way ANOVA performed on aggression data, nonsignificant main effects, nonsignificant interaction.
As was the case with the earlier data set, you will review the results of the analyses that appear in the section headed “Type III SS” ( ), as opposed to the section headed “Type I SS.” To determine whether the interaction is significant, look to the right of the heading “CONSEQ*MOD_AGGR” ( ). Here, you can see that the F statistic is only 0.01, with a p value of .9906. This p value is larger than our standard criterion of .05, and so you know that
610 Step-by-Step Basic Statistics Using SAS: Student Guide
the interaction is nonsignificant. Since the interaction is nonsignificant, you may proceed to interpret the significance tests for the main effects. 2. Determine whether either of the two main effects are statistically significant. Information regarding the main effect for Predictor A appears to the right of the heading “MOD_AGGR” ( ). You can see that the F value for this effect is 0.08, with a nonsignificant p value of .9215. Information regarding Predictor B appears to the right of “CONSEQ” ( ). This factor displayed a nonsignificant F statistic of 1.65 (p = .2115). 3. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for this analysis is presented here as Table 16.4. Table 16.4 ANOVA Summary Table for Study Investigating the Relationship Between Level Of Aggression Displayed by Model (A), Consequences for Model (B), and Subject Aggression (Nonsignificant Interaction, Nonsignificant Main Effects) _______________________________________________________________________ SS MS F p R2 Source df _______________________________________________________________________ Model aggression (A) 2 4.07 2.03 0.08 .9215 .01 Consequences for model (B) 1 40.83 40.83 1.65 .2115 .06 A X B Interaction 2 0.47 0.23 0.01 .9906 .00 Within groups 24 594.80 24.78 Total 29 640.17 _______________________________________________________________________ Note: N = 30
Notice how information from Output 16.13 was used to fill in the relevant sections of Table 16.4: •
Information from the row labeled “MOD_AGGR” ( ) in Output 16.13 was transferred to the row labeled “Model aggression (A)” in Table 16.4.
•
Information from the row labeled “CONSEQ” ( ) in Output 16.13 was transferred to the row labeled “Consequences for model (B)” in Table 16.4.
•
Information from the row labeled “CONSEQ*MOD_AGGR” ( ) in Output 16.13 was transferred to the row labeled “A × B Interaction” in Table 16.4.
•
Information from the row labeled “ERROR” ( ) in Output 16.13 was transferred to the row labeled “Within groups” in Table 16.4.
•
Information from the row labeled “Corrected Total” ( ) in Output 16.13 was transferred to the row labeled “Total” in Table 16.4.
The R2 entries in Table 16.4 were computed by dividing the sum of squares for a particular effect by the total sum of squares. For detailed guidelines on constructing an ANOVA summary table (such as Table 16.4), see the subsection “4. Prepare your own version of the
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 611
ANOVA summary table” from the major section “Example of a Factorial ANOVA Revealing a Significant Main Effects and a Nonsignificant Interaction,” presented earlier. 4. Prepare a table that displays the confidence intervals for Predictor A. Notice that, unlike the previous section, this section does not advise you to review the results of the Tukey multiple comparison procedures to determine whether there are significant differences between the various levels of Predictor A or Predictor B. This is because the main effects for these two predictor variables were nonsignificant, and you normally should not interpret the results of multiple comparison procedures for main effects that are not significant. However, the Tukey test that you requested also computes confidence intervals for the differences between the means. Some readers might be interested in seeing these confidence intervals, even when the main effects are not statistically significant. To conserve space, this section will not present the SAS output that shows the results of the Tukey tests and confidence intervals. However, it will show how to summarize the information that was presented in that output for a paper. Predictor A was the level of aggression displayed by the model. This predictor contained three levels, and therefore produced three confidence intervals for differences between the means. Because this is a fairly large amount of information to present, these confidence intervals will be summarized in Table 16.5 here: Table 16.5 Results of Tukey Tests for Predictor A: Comparing High-Model-Aggression Group versus Moderate-Model-Aggression Group versus Low-Model-Aggression Group on the Criterion Variable (Subject Aggression) ________________________________________________________ Simultaneous 95% Difference confidence limits between ––––––––––––––––– a b means Lower Upper Comparison ________________________________________________________ High – Moderate 0.5000 -5.0599 6.0599 High - Low 0.9000 -4.6599 6.4599 Moderate - Low 0.4000 -5.1599 5.9599 ________________________________________________________ Note: N = 30 a Differences are computed by subtracting the mean for the second group from the mean for the first group. b Tukey test indicates that the none of the differences between the means were significant at p < .05.
5. Summarize the confidence interval for Predictor B in the text of your paper. In contrast, Predictor B (consequences for the model), contained only two levels and therefore produced only one confidence interval for a difference between the means. This interval could be summarized in the text of your paper in this way:
612 Step-by-Step Basic Statistics Using SAS: Student Guide
Subtracting the mean of the model-punished condition from the mean of the modelrewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference ranged from –1.418 to 6.085. Using a Figure to Illustrate the Results When all main effects and interactions are nonsignificant, researchers usually do not illustrate means for the various conditions in a graph. However, this section will present a graph, simply to provide an additional example of how to prepare a graph from the cell means that are provided in the SAS output. Page 3 of the SAS output from the current program provides means and standard deviations for the various treatment conditions manipulated in the current study. These results are presented here as Output 16.14.
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 14.6666667 4.95215201 15 17.0000000 4.27617987
Level of MOD_AGGR H L M
N 10 10 10
Level of CONSEQ MP MP MP MR MR MR
Level of MOD_AGGR H L M H L M
3
-----------SUB_AGGR---------Mean Std Dev 16.3000000 4.98998998 15.4000000 4.69515116 15.8000000 4.87168691 N 5 5 5 5 5 5
-----------SUB_AGGR---------Mean Std Dev 15.0000000 5.52268051 14.2000000 5.35723809 14.8000000 5.11859356 17.6000000 4.61519230 16.6000000 4.15932687 16.8000000 4.96990946
Output 16.14. Means needed to plot the study’s results in a figure; two-way ANOVA performed on aggression data; nonsignificant main effects; nonsignificant interaction.
As was noted earlier, this page of output actually contains three tables of means. However, you will be interested only in the third table––the one that presents the means broken down by both Predictor A and Predictor B ( ). Figure 16.12 graphs the six cell means from the third table in Output 16.14. As before, cell means from subjects in the model-rewarded condition are displayed as circles on a solid line, while cell means from subjects in the model-punished condition are displayed as triangles on a solid line.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 613
Figure 16.12. Mean number of aggressive acts as a function of the level of aggression displayed by the model and the consequences for the model (nonsignificant main effects and nonsignificant interaction).
To conserve space, this section will not repeat the steps that you should follow when transferring mean scores from Output 16.14 to Figure 16.12. For detailed guidelines on how to construct a graph such as that in Figure 16.12, see the subsection “Steps in Preparing the Graph” from the major section “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction,” presented earlier. Interpreting Figure 16.12 Notice how the graphic pattern of results in Figure 16.12 are consistent with the statistical results reported in Output 16.13. Specifically: •
In Figure 16.12, the corresponding line segments are parallel to one another. This is the “railroad track” pattern that you would expect when the interaction is nonsignificant (as reported in Output 16.13).
•
None of the line segments going from “Low” to “Moderate” or from “Moderate” to “High” display a substantial angle. This is consistent with the fact that Output 16.13 reported a nonsignificant main effect for Predictor A (model aggression).
•
Finally, the solid line for the model-rewarded group is not substantially separated from the broken line for the model-punished group. This is consistent with the finding from Output 16.13 that the main effect for Predictor B (consequences for the model) was nonsignificant
614 Step-by-Step Basic Statistics Using SAS: Student Guide
(it is true that the solid line is slightly separated from the broken line, but the separation is not enough to attain statistical significance). Analysis Report Concerning the Main Effect for Predictor A (Nonsignificant Effect) The analysis report in this section provides an example of how to write up the results for a predictor variable that has three levels and is not statistically significant. Items A through F of this report would be identical to items A through F of the analysis report in the section titled “Analysis Report Concerning the Main Effect for Predictor A (Significant Effect),” presented earlier. These sections stated the research question, the research hypothesis, and so on. Therefore, those items will not be presented again in this section. (Items A through F would normally appear here) G) Obtained statistic:
F(2, 24) = 0.08
H) Obtained probability (p) value:
p = .9215
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Multiple comparison procedure: The multiple comparison procedure was not appropriate because the F statistic for the main effect was nonsignificant. K) Confidence intervals: Confidence intervals for differences between the means are presented in Table 16.5 L) Effect size: R2 = .01, indicating that model aggression accounted for 1% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis that there will be a positive relationship between the level of aggression displayed by a model and the number of aggressive acts later demonstrated by children who observed the model. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant main effect for level of aggression displayed by the model, F(2, 24) = 0.08, MSE = 24.78, p = .9215. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the high-modelaggression condition was 16.30 (SD = 4.99), the mean for the moderate-model-aggression-condition was 15.80 (SD = 4.87), and
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 615
the mean for the low-model-aggression condition was 15.40 (SD = 4.70). Sample means for the various conditions that constituted the study are displayed in Figure 16.12. Confidence intervals for differences between the means (based on Tukey’s HSD test) are presented in Table 16.5. 2 In the analysis, R for this main effect was computed as .01. This indicated that model aggression accounted for 1% of the variance in subject aggression.
O) Figure representing the results:
See Figure 16.12.
Notice that the preceding “Formal description of the results for a paper” did not discuss the results of the Tukey HSD test (other than to refer to the confidence intervals). This is because the results of multiple comparison procedures are typically not discussed when the main effect is nonsignificant. Analysis Report Concerning the Main Effect for Predictor B (Nonsignificant Effect) The analysis report in this section provides an example of how to write up the results for a predictor variable that has two levels and is not statistically significant. Items A through F of this report would be identical to items A through F of the analysis report in the section titled “Analysis Report Concerning the Main Effect for Predictor B (Significant Effect),” presented earlier. These sections stated the research question, the research hypothesis, and so on. Therefore, those items will not be presented again in this section. (Items A through F would normally appear here) G) Obtained statistic:
F(1, 24) = 1.65
H) Obtained probability (p) value:
p = .2115
I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis. J) Multiple comparison procedure:
Fail
Not relevant.
K) Confidence intervals: Subtracting the mean of the modelpunished condition from the mean of the model-rewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference extended from –1.418 to 6.085. L) Effect size: R2 = .06, indicating that consequences for the model accounted for 6% of the variance in subject aggression. M) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research
616 Step-by-Step Basic Statistics Using SAS: Student Guide
hypothesis that children who observe a model being rewarded for engaging in aggressive behavior will later demonstrate a greater number of aggressive acts, compared to children who observe a model being punished for engaging in aggressive behavior. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a nonsignificant main effect for consequences for the model, F(1, 24) = 1.65, MSE = 24.78, p = .2115. On the criterion variable (number of aggressive acts displayed by subjects), the mean score for the model-rewarded condition was 17.00 (SD = 4.28), and the mean for the modelpunished condition was 14.67 (SD = 4.95). Sample means for the various conditions that constituted the study are displayed in Figure 16.12. Subtracting the mean of the model-punished condition from the mean of the model-rewarded condition resulted in an observed difference of 2.333. The 95% confidence interval for this difference ranged from –1.418 to 6.085. In the analysis, R2 for this main effect was computed as .06. This indicated that consequences for the model accounted for 6% of the variance in subject aggression. O) Figure representing the results:
See Figure 16.12.
Analysis Report Concerning the Interaction (Nonsignificant Effect) An earlier section of this chapter has already shown how to prepare a report for a nonsignificant interaction term. Therefore, to save space, a similar report will not be provided here. To find the previous example, see subsection “Analysis Report Concerning the Interaction (Nonsignificant Effect)” within the major section “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction.”
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 617
Example of a Factorial ANOVA Revealing a Significant Interaction Overview This section presents the results of a factorial ANOVA in which the interaction between Predictor A and Predictor B is significant. You will remember that an interaction means that the relationship between one predictor variable and the criterion variable is different at different levels of the second predictor variable. These results are presented so that you will be prepared to write analysis reports for projects in which significant interactions are observed. The Complete SAS Program The study presented here is the same aggression study that was described in the preceding sections. The data will be analyzed with the same SAS program that was presented earlier in the section headed “Writing the SAS Program.” Here, the data have been changed so that they will produce a significant interaction term. The complete SAS program, including the new data set, is presented below: OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM CONSEQ $ MOD_AGGR $ SUB_AGGR; DATALINES; 01 MR L 10 02 MR L 8 03 MR L 12 04 MR L 10 05 MR L 9 06 MR M 14 07 MR M 15 08 MR M 17 09 MR M 16 10 MR M 16 11 MR H 18 12 MR H 20 13 MR H 20 14 MR H 23 15 MR H 21 16 MP L 9 17 MP L 8 18 MP L 7 19 MP L 11 20 MP L 10 21 MP M 10 22 MP M 12 23 MP M 13
618 Step-by-Step Basic Statistics Using SAS: Student Guide
24 MP M 9 25 MP M 9 26 MP H 11 27 MP H 12 28 MP H 9 29 MP H 13 30 MP H 11 ; PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; MEANS CONSEQ MOD_AGGR / TUKEY CLDIFF ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Steps in Interpreting the Output Overview. As was the case with the earlier data set, the SAS program performing this analysis would produce five pages of output. This section will present only those sections of output that are relevant to preparing the ANOVA summary table, the graph, and the analysis reports. You will notice that the steps listed in this section are not identical to the steps listed in the preceding sections. Some of the steps (such as “1. Make sure that everything looks correct”) are not included here because the key concepts have already been covered. Other steps in this section are different because, when an interaction is significant, it is necessary to follow a special sequence of steps. 1. Determine whether the interaction term is statistically significant. As was the case before, you determine whether the interaction is significant by reviewing the ANOVA summary table produced by PROC GLM. This table appears on output page 2, and is reproduced here as Output 16.15.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 619
JANE DOE The GLM Procedure
2
Dependent Variable: SUB_AGGR Source Model Error Corrected Total R-Square 0.890647
DF 5 24 29
Sum of Squares 482.1666667 59.2000000 541.3666667
Coeff Var 12.30206
Mean Square 96.4333333 2.4666667
Root MSE 1.570563
F Value 39.09
Pr > F <.0001
SUB_AGGR Mean 12.76667
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type I SS 187.5000000 206.4666667 88.2000000
Mean Square 187.5000000 103.2333333 44.1000000
F Value 76.01 41.85 17.88
Pr > F <.0001 <.0001 <.0001
Source CONSEQ MOD_AGGR CONSEQ*MOD_AGGR
DF 1 2 2
Type III SS 187.5000000 206.4666667 88.2000000
Mean Square 187.5000000 103.2333333 44.1000000
F Value 76.01 41.85 17.88
Pr > F <.0001 <.0001 <.0001
Output 16.15. ANOVA summary table for two-way ANOVA performed on aggression data; significant interaction.
The information relevant to the interaction term appears to the right of the heading “CONSEQ*MOD_AGGR” ( ). You can see that this interaction term has an F statistic of 17.88, and a corresponding p value of .0001. Because the p value is below the standard criterion of .05, you conclude that the interaction is statistically significant. Other results presented in Output 16.15 show that the main effects for CONSEQ and for MOD_AGGR are also statistically significant. However, main effects must be interpreted very cautiously (if at all) when an interaction is significant. Therefore, the main effects for Predictors A and B will not be interpreted in this section. Instead, the primary focus will be on the interpretation of the interaction. 2. Prepare your own version of the ANOVA summary table. The completed ANOVA summary table for this analysis is presented here as Table 16.6. Table 16.6 ANOVA Summary Table for Study Investigating the Relationship between Level Of Aggression Displayed by Model (A), Consequences for Model (B), and Subject Aggression (Significant Interaction) _______________________________________________________________________ SS MS F p R2 Source df _______________________________________________________________________ Model aggression (A) 2 206.47 103.23 41.85 .0001 .38 Consequences for model (B) 1 187.50 187.50 76.01 .0001 .35 A X B Interaction 2 88.20 44.10 17.88 .0001 .16 Within groups 24 59.20 2.47 Total 29 541.37 _______________________________________________________________________ Note: N = 30
620 Step-by-Step Basic Statistics Using SAS: Student Guide
For the most part, the information that appears in Table 16.6 was taken from Output 16.15. For detailed guidelines on constructing an ANOVA summary table (such as Table 16.6), see the subsection “4. Prepare your own version of the ANOVA summary table” from the major section “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction,” presented earlier. Using a Graph to Illustrate the Results Interactions are easiest to understand when they are plotted in a graph. You can plot the current interaction by preparing the same type of graph that has been used throughout this chapter. Page 3 of the SAS output from the current program provides means and standard deviations for the various treatment conditions manipulated in the current study. These results are presented here as Output 16.16:
Level of CONSEQ MP MR
JANE DOE The GLM Procedure -----------SUB_AGGR---------N Mean Std Dev 15 10.2666667 1.79151439 15 15.2666667 4.69751707
Level of MOD_AGGR H L M
N 10 10 10
Level of CONSEQ MP MP MP MR MR MR
Level of MOD_AGGR H L M H L M
3
-----------SUB_AGGR---------Mean Std Dev 15.8000000 5.09465951 9.4000000 1.50554531 13.1000000 2.99814758 N 5 5 5 5 5 5
-----------SUB_AGGR---------Mean Std Dev 11.2000000 1.48323970 9.0000000 1.58113883 10.6000000 1.81659021 20.4000000 1.81659021 9.8000000 1.48323970 15.6000000 1.14017543
Output 16.16. Means needed to plot the study’s results in a figure; two-way ANOVA performed on aggression data, significant interaction.
As noted earlier, this page of output contains three tables of means. However, you will be interested only in the third table––the one that presents the means broken down by both Predictor A and Predictor B ( ). Figure 16.13 graphs the six cell means from the third table of Output 16.16. As before, cell means from subjects in the model-rewarded condition are displayed as circles on a solid line, while cell means from subjects in the model-punished condition are displayed as triangles on a broken line.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 621
Figure 16.13. Mean number of aggressive acts as a function of the level of aggression displayed by the model and the consequences for the model (significant interaction between Predictor A and Predictor B).
To conserve space, this section will not repeat the steps that you should follow when you transfer mean scores from Output 16.16 to Figure 16.13. For detailed guidelines on how to construct a graph such as Figure 16.13, see the subsection headed “Steps in Preparing the Graph” from the major section headed “Example of a Factorial ANOVA Revealing Significant Main Effects and a Nonsignificant Interaction,” presented earlier. Interpreting Figure 16.13 Notice how the pattern of results in presented Figure 16.13 is consistent with the statistical results reported in Output 16.15 (that is, the finding that the interaction was significant). In Figure 16.13, you can see that corresponding line segments are not parallel to each other. Because the two lines are not parallel, you know that your have an interaction between Predictor A and Predictor B. Specifically: •
The broken line (representing subjects in the model-punished condition) is relatively flat, indicating that Predictor A (the level of aggression displayed by the model) had little if any effect on subjects in this condition.
•
In contrast, the solid line (representing subjects in the model-rewarded condition) displays a marked upward angle, indicating that Predictor A had a stronger effect on subjects in this condition.
622 Step-by-Step Basic Statistics Using SAS: Student Guide
Notice how these results are also consistent with the definition for an interaction that was presented earlier: The relationship between one predictor variable (model aggression) and the criterion variable (subject aggression) is different at different levels of the second predictor variable (consequences for the model). Testing for Simple Effects When there is a simple effect for Predictor A at a particular level of Predictor B, it means that there is a significant relationship between Predictor A and the criterion variable at that level of Predictor B. This concept of a simple effect is perhaps easiest to understand by again referring again to Figure 16.13. First, consider the solid line (the line representing the model-rewarded group) in Figure 16.13. This line displays a relatively steep angle, suggesting that there may be a significant relationship between model aggression and subject aggression for the subjects in the modelrewarded group. In other words, there may be a significant simple effect for model aggression at the model-rewarded level of the “consequences for model” predictor variable. Now consider the broken line in the same figure––the line which represents the modelpunished group. This line displays an angle that is less steep than the angle displayed by the solid line. It is impossible to be sure by simply viewing the graph in this way, but this may mean that there is not a simple effect for model aggression at the model-punished level of the “consequences for model” predictor variable. It is possible to perform tests to determine whether a simple effect is statistically significant. For example, if you performed these tests, you might find that the simple effect for model aggression at the model-rewarded level of Predictor B was statistically significant, but that the simple effect for model aggression at the model-punished level of Predictor B was nonsignificant. Whether a researcher chooses to perform tests for simple effects will depend, in part, on the nature of the research questions being addressed in the study. Testing for simple effects is a somewhat advanced concept, and so it is not addressed in this text. For detailed guidelines on how to use SAS to test for simple effects, see Hatcher and Stepanski (1994), pages 263–279. Analysis Report Concerning the Interaction (Significant Effect) Below is an example of how this result could be described in an analysis report. A) Statement of the research question: One purpose of this study was to determine whether there was a significant interaction between (a) the level of aggression displayed by the model and (b) the consequences experienced by the model in the prediction of (c) the number of aggressive acts later displayed by subjects who have observed the model.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 623
B) Statement of the research hypothesis: The positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. C) Nature of the variables: This analysis involved two predictor variables and one criterion variable: • Predictor A was the level of aggression displayed by the model. This was a limited-value variable, was assessed on an ordinal scale, and included three levels: low, moderate, and high. • Predictor B was the consequences for the model. This was a dichotomous variable, was assessed on a nominal scale, and included two levels: model-rewarded and model-punished. • The criterion variable was the number of aggressive acts displayed by the subjects after observing the model. This was a multi-value variable, and was assessed on a ratio scale. Factorial ANOVA with two between-
D) Statistical test: subjects factors
E) Statistical null hypothesis (Ho): In the population, there is no interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). F) Statistical alternative hypothesis (H1): In the population, there is an interaction between the level of aggression displayed by the model and the consequences for the model in the prediction of the criterion variable (the number of aggressive acts displayed by the subjects). G) Obtained statistic:
F(2, 24) = 17.88
H) Obtained probability (p) value:
p = .0001
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Multiple comparison procedure: K) Confidence intervals:
Not relevant.
Not relevant.
2 L) Effect size: R = .16, indicating that the interaction term accounted for 16% of the variance in subject aggression.
M) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis
624 Step-by-Step Basic Statistics Using SAS: Student Guide
that the positive relationship between the level of aggression displayed by the model and subsequent subject aggressive acts will be stronger for subjects who have seen the model rewarded than for subjects who have seen the model punished. N) Formal description of the results for a paper: Results were analyzed by using a factorial ANOVA with two between-subjects factors. This analysis revealed a significant F statistic for the interaction between the level of aggression displayed by the model and the consequences for the model, F(2, 24) = 17.88, MSE = 2.47, p = .0001. Sample means for the various conditions that constituted the study are displayed in Figure 16.13. The nature of the interaction displayed in Figure 16.13 shows that there is a positive relationship between model aggression and subject aggression for subjects in the model-rewarded group: for subjects in this condition, greater levels of model aggression are associated with greater levels of subject aggression. On the other hand, there is only a very weak relationship between model aggression and subject aggression for subjects in the model-punished group. In the analysis, R2 for this interaction effect was computed as .16. This indicated that the interaction accounted for 16% of the variance in subject aggression. O) Figure representing the results:
See Figure 16.13.
The preceding analysis report was prepared following the same steps that were followed with other analysis report in this chapter. The F statistic, degrees of freedom, and p value came from Output 16.15, from the row “CONSEQ*MOD_AGGR”( ).
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 625
Using the LSMEANS Statement to Analyze Data from Unbalanced Designs Overview This section shows you how to use the LSMEANS statement rather than the MEANS statement in a factorial ANOVA. “LSMEANS” stands for “least-squares means.” It is often appropriate to use the LSMEANS statement when you are analyzing data from an unbalanced design. This section explains what an unbalanced design is, and shows you how to use LSMEANS statements to request least-squares means, multiple comparison procedures, and confidence intervals. Reprise: What Is an Unbalanced Design? An experimental design is balanced if the same number of observations (subjects) appear in each cell of the design. For example, Figure 16.1 (presented toward the beginning of this chapter) illustrates the research design that was used in the aggression study. It shows that there are five subjects in each cell of the design (that is, there are five subjects in the cell of subjects who experienced the “low” condition under Predictor A and the “model-rewarded” condition under Predictor B, there are five subjects in the cell of subjects who experienced the “moderate” condition under Predictor A and the “model-punished” condition under Predictor B, and so on). When a research design is balanced, it is generally appropriate to use the MEANS statement with PROC GLM to request group means, multiple comparison procedures, and confidence intervals. In contrast, a research design is typically unbalanced if some cells in the design contain a larger number of observations (subjects) than other cells. For example, again consider Figure 16.1, presented earlier. The research design illustrated there contains six cells. If there were 20 subjects in one of the cells, but only five subjects in each of the remaining five cells, the research design would then be unbalanced. When you analyze data from an unbalanced design, it is generally best not to use the MEANS statement. This is because with unequal cell sizes, the MEANS statement may produce marginal means that are biased. When analyzing data from an unbalanced design, it is generally preferable to use the LSMEANS statement in your program, rather than the MEANS statement because the LSMEANS statement will estimate the marginal means over a balanced population. LSMEANS estimates what the marginal means would be if you did have equal cell sizes. The marginal means estimated by the LSMEANS statement are less likely to be biased.
626 Step-by-Step Basic Statistics Using SAS: Student Guide
Writing the LSMEANS Statements The syntax. Below is the syntax for the PROC step of a SAS program that uses the LSMEANS statement rather than the MEANS statement: PROC GLM DATA = data-set-name; CLASS predictorB predictorA; MODEL criterion-variable = predictorB predictorA predictorB*predictorA; LSMEANS predictorB predictorA predictorB*predictorA; LSMEANS predictorB predictorA / PDIFF ADJUST=TUKEY CL ALPHA=alpha-level; TITLE1 ' your-name '; RUN; QUIT; The preceding syntax is very similar to the syntax that used the MEANS statement, presented earlier in this chapter. The first LSMEANS statement takes this form: LSMEANS
predictorB
predictorA
predictorB*predictorA;
You can see that this LSMEANS statement is identical to the earlier MEANS statement, except that “MEANS” has been replaced with “LSMEANS.” The second LSMEANS statement is more complex: LSMEANS
predictorB predictorA / PDIFF ADJUST=TUKEY CL ALPHA=alpha-level;
You can see that this second LSMEANS statement contains a slash, followed by a number of key words for options. Here is what the key words request: •
PDIFF requests that SAS print p values for significance tests related to the multiple comparison procedure. These p values will tell you whether there are significant differences between the least-squares means for the different levels under the two predictor variables.
•
ADJUST=TUKEY requests a multiple comparison adjustment for the p values and confidence limits for the differences between the least-squares means. Including ADJUST=TUKEY requests an adjustment based on the Tukey HSD test. The adjustment can also be based on other multiple-comparison procedures; see the chapter on the GLM procedure in the SAS/STAT User’s Guide for details.
•
CL requests confidence limits for individual least-squares means. If you also include the PDIFF option (as is done here), it will also print confidence limits for differences between means. These are the type of confidence limits that have been illustrated throughout this chapter.
•
ALPHA=alpha-level specifies the significance level to be used for the multiple comparison procedure and the confidence level to be used with the confidence limits.
Chapter 16: Factorial ANOVA With Two Between-Subjects Factors 627
Specifying ALPHA=0.05 requests that the the significance level (alpha) be set at .05 for the Tukey tests. If you had wanted alpha set at .01, you would have used the option ALPHA=0.01, and if you had wanted alpha set at .10, you would have used the option ALPHA=0.1. The actual SAS statements. Following are the statements that you would include in a SAS program to request a factorial ANOVA using the LSMEANS statement rather than the MEANS statement. The following statements would be appropriate to analyze data from the aggression study described in this chapter. Notice that alpha is set at .05 for the Tukey tests. PROC GLM DATA=D1; CLASS CONSEQ MOD_AGGR; MODEL SUB_AGGR = CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; LSMEANS CONSEQ MOD_AGGR CONSEQ*MOD_AGGR; LSMEANS CONSEQ MOD_AGGR / PDIFF ADJUST=TUKEY CL ALPHA=0.05; TITLE1 'JANE DOE'; RUN; QUIT;
Output Produced by LSMEANS The output produced by the LSMEANS statements is very similar to the output produced by the MEANS statements, except that the means have been appropriately adjusted. There are a few additional differences. For example, the MEANS statement prints the means and standard deviations, while the LSMEANS statement prints only the adjusted means. For the most part, however, if you have read the sections of this chapter that show how to interpret the results of the MEANS statements, you should have little difficulty in interpreting the results produced by LSMEANS.
Learning More about Using SAS for Factorial ANOVA This chapter has provided an elementary introduction to the use of factorial ANOVA for analyzing data from research. Its scope was limited to the simplest types of factorial designs: those with only two predictor variables. Its scope was also limited to studies in which both factors are between-subjects factors. For a discussion of how to use SAS to perform factorial ANOVA with one between-subjects factor and one within-subjects factor (i.e., one repeated-measures factor), see Hatcher and Stepanski (1994). As was mentioned earlier, Hatcher and Stepanski (1994) also show how to perform tests for simple effects.
628 Step-by-Step Basic Statistics Using SAS: Student Guide
For a guide on using SAS for an even wider variety of factorial designs, see Cody and Smith (1997). This book shows how to analyze data from studies with up to three predictor variables. It also shows how to handle any combination of between-subjects factors and within-subjects factors.
Conclusion This chapter, along with Chapters 10-15 of this text, has dealt with statistical tests in which the criterion variable was always a numeric variable assessed on an interval or ratio scale. Chapter 10 and Chapter 11 illustrated tests of association: procedures that allowed you to determine whether a multi-value numeric criterion variable was associated with a predictor variable that was also a multi-value numeric variable. Chapter 13 through Chapter 16 illustrated tests of group differences: tests that enabled you to determine whether a multivalue numeric criterion variable was associated with a predictor variable that was either a dichotomous or limited-value variable. But what if you are conducting a study in which the criterion variable is not a multi-value numeric variable assessed on an interval or ratio scale? More specifically, what if you have conducted a study in which both the criterion variable and the predictor variable are dichotomous or limited-value variables? What if you need to determine whether there is a significant relationship between these two variables? In studies such as these, it may be appropriate to analyze your data using a nonparametric procedure called the chi-square test of independence. This statistical procedure is the topic of the next (and final) chapter.
Chi-Square Test of Independence Introduction..........................................................................................631 Overview................................................................................................................ 631 Situations That Are Appropriate for the Chi-Square Test of Independence ..................................................................................631 Overview................................................................................................................ 631 Nature of the Predictor and Criterion Variables ..................................................... 631 The Type-of-Variable Figure .................................................................................. 632 Example of a Study That Provides Appropriate Data for This Procedure .............. 632 Summary of Assumptions Underlying the Chi-Square Test of Independence ....... 633 Using Two-Way Classification Tables .................................................634 Overview................................................................................................................ 634 General Structure of a Two-Way Classification Table............................................ 634 Two-Way Classification Table for the Juvenile Offender Study ............................. 635 Results Produced in a Chi-Square Test of Independence...................637 Overview................................................................................................................ 637 Test of the Null Hypothesis .................................................................................... 637 Effect Size.............................................................................................................. 638 A Study Investigating Computer Preferences .....................................640 Overview................................................................................................................ 640 Research Method................................................................................................... 640
630 Step-by-Step Basic Statistics Using SAS: Student Guide
Computing Chi-Square from Raw Data versus Tabular Data ..............642 Overview................................................................................................................ 642 Raw Data ............................................................................................................... 642 Tabular Data .......................................................................................................... 642 The Approach Used Here ...................................................................................... 643 Example of a Chi-Square Test That Reveals a Significant Relationship ...............................................................643 Overview................................................................................................................ 643 Choosing SAS Variable Names and Values to Use in the Analysis ....................... 643 Data Set to Be Analyzed........................................................................................ 645 Writing the SAS Program....................................................................................... 646 Output Produced by the SAS Program .................................................................. 649 Steps in Interpreting the Output ............................................................................. 649 Using a Graph to Illustrate the Results .................................................................. 655 Analysis Report for the Computer Preferences Study (Significant Results)........... 658 Notes Regarding the Preceding Analysis Report ................................................... 659 Example of a Chi-Square Test That Reveals a Nonsignificant Relationship .........................................................661 Overview................................................................................................................ 661 The Complete SAS Program ................................................................................. 662 Steps in Interpreting the Output ............................................................................. 662 Using a Figure to Illustrate the Results .................................................................. 664 Analysis Report for the Computer Preferences Study (Nonsignificant Results) ..... 665 Notes Regarding the Preceding Analysis Report ................................................... 667 Computing Chi-Square from Raw Data ................................................668 Overview................................................................................................................ 668 Inputting Raw Data ................................................................................................ 668 The PROC Step ..................................................................................................... 670 The Complete SAS Program ................................................................................. 671 Interpreting the SAS Output................................................................................... 671 Conclusion............................................................................................671
Chapter 17: Chi-Square Test of Independence 631
Introduction Overview This chapter shows how to enter data and prepare SAS programs that will perform a chisquare test of independence (also called the Pearson chi-square test, the chi-square test of association, chi-square test of homogeneity, or the two-way chi-square test). This test is useful when you want to determine whether there is a significant relationship between two dichotomous or limited-value variables that are assessed on any scale of measurement. The 2 symbol for the chi-square statistic is χ . The chapter shows how to prepare SAS programs that will input either tabular data (data that have already been summarized in a table) or raw data. It shows how to interpret the two-way classification table produced by PROC FREQ, how to determine whether there is a significant relationship between the two variables, and how to prepare a report that summarizes the results of the analysis, including a figure and an index of effect size.
Situations That Are Appropriate for the Chi-Square Test of Independence Overview The chi-square test of independence is a test of association. It allows you to determine whether a single predictor variable is related to a single criterion variable. It is called a test of independence because, in this context, the word “independent” means “unrelated.” By the end of the analysis, you will conclude that the two variables are either independent (unrelated) or dependent (related). With the chi-square test, both the predictor variable as well as the criterion variable are typically dichotomous or limited-value variables. This is test is particularly useful because it can be used with variables assessed on any scale of measurement: nominal, ordinal, interval, or ratio. It is one of the few statistical procedures that allow you to determine whether there is a significant relationship between two nominal-level variables. Nature of the Predictor and Criterion Variables Predictor variable. The predictor variable is typically a dichotomous or limited-value variable. It may be assessed on any scale of measurement (nominal, ordinal, interval, or ratio).
632 Step-by-Step Basic Statistics Using SAS: Student Guide
Criterion variable. The criterion variable is also typically a dichotomous or limited-value variable. It may also be assessed on any scale of measurement (nominal, ordinal, interval, or ratio). The Type-of-Variable Figure The figure below illustrates the types of variables that are typically being analyzed when researchers perform a chi-square test of independence. Criterion
Predictor =
The “Lmt” symbol that appears to the left of the equal sign in the above figure indicates that the criterion variable in a chi-square test of independence is typically a limited-value variable (a variable that assumes two to six values in your sample). The “Lmt” symbol that appears to the right of the equal sign indicates that the predictor variable in this procedure is also usually a limited-value variable. It should be noted that, in theory, either of the variables can actually assume any number of values. In practice, however, the number of values is usually relatively small, typically from two to six. Example of a Study That Provides Appropriate Data for This Procedure Overview. Suppose that you are a criminologist studying violent crime among male juvenile delinquents. You are interested in determining whether some types of juvenile offenders are more likely to use weapons (such as a gun) in the commission of crimes. Knowing this would enable officials in law enforcement to make better predictions about what types of delinquents are most likely to be dangerous. The study. To conduct research of this nature, you need to choose a typology for classifying juvenile offenders. One such typology has been offered by Dicataldo and Grisso (1995). They have shown that young offenders can be classified into the following three categories: •
Immature juvenile offenders. Individuals in this category tend to be young, have problems in school, and be child-like and dependent.
•
Socialized juvenile offenders. Individuals in this category tend to display better performance in school, display a sense of guilt, and are motivated to comply with the court.
•
Mature delinquent juvenile offenders. Individuals in this category tend to be independent and adult-like, display little guilt, lack respect for the court, and have an extensive history of delinquency.
Chapter 17: Chi-Square Test of Independence 633
You wish to determine whether there is any relationship between (a) what type of delinquent a young person is (according to the above categories), and (b) whether that delinquent used a weapon in his or her most recent crime. You therefore review court records of 300 cases in which a juvenile committed a crime. For each case, you do the following: •
You direct a panel of experts to classify the young person into one of the three preceding categories.
•
You note whether or not a weapon was used in the crime.
You then perform a chi-square test of independence to determine whether there is a significant relationship between (a) the “type of offender” category of the juvenile and (b) whether or not the juvenile used a weapon. If there is a significant relationship, you would then review the results in more detail to determine which type of offender is most likely to use a weapon. Why these data would be appropriate for this procedure. The preceding study involved a single predictor variable and a single criterion variable. The predictor variable was “type of offender.” You know that this was a limited-value variable, because it assumed only three values: an immature type, a socialized type, and a mature delinquent type. This predictor variable was assessed on a nominal scale because it indicates group membership but does not convey any quantitative information. However, remember that the predictor variable used in a chi-square test of independence can be assessed on any scale of measurement. The criterion variable in this study was whether or not the juvenile used a weapon in the most recent crime. You know that this is a dichotomous variable, because it has only two values: a weapon was used, versus a weapon was not used. This variable was assessed on an nominal scale, since it doesn’t really convey any meaningful quantitative information. Note: Although the study described here is fictitious, it is based on the actual study reported by Dicataldo and Grisso (1995). Summary of Assumptions Underlying the Chi-Square Test of Independence •
Level of measurement. Both the predictor and criterion variables can be assessed on any scale of measurement (nominal, ordinal, interval, or ratio).
•
Random sampling. Subjects contributing data should represent a random sample drawn from the population of interest.
•
Independent cell entries. Each subject should appear in only one cell of the two-way classification table (the concept of a “cell” and a “classification table” will be discussed in the next section). Among other things, this means that the chi-square test of independence should generally not be used with within-subject designs in which the same subject is exposed to more than one experimental condition. In addition, the fact that a particular
634 Step-by-Step Basic Statistics Using SAS: Student Guide
subject appears in one cell should not affect the probability of another subject appearing in any other cell. •
Observed frequencies should not be zero. The chi-square test might not be valid if the observed frequency in any of the cells is zero. When this might be a problem, consider combining categories.
•
Minimum expected cell frequencies. When analyzing a 2 × 2 classification table, no cell should display an expected frequency of less than 5. When this minimum is violated, consider computing Fisher’s exact test instead of chi-square. With larger tables (e.g., 3 × 4 tables), no more than 20% of the cells should have expected frequencies less than 5. When this minimum is violated, consider combining categories or, again, using Fisher’s exact test. Note that, although these minimums have long been advised by statistics textbooks, Monte Carlo studies suggest that they might be overly conservative, and that the probability of making Type I errors is not greatly increased even when these minimums are violated. For a concise review of these issues, see Spatz (2001, pages 293–294).
Using Two-Way Classification Tables Overview The chi-square test is typically performed to investigate the relationship between two dichotomous or limited-value variables. The nature of the relationship between two variables of this kind is easiest to understand if you first prepare a two-way classification table. This is a table in which the rows represent the categories (values) of one variable, while the columns represent the categories (values) of the second variable. General Structure of a Two-Way Classification Table For example, assume that you wish to prepare a table that plots one variable that contains two categories against a second variable that contains three categories. The general structure of such a table appears in Figure 17.1:
Chapter 17: Chi-Square Test of Independence 635
Figure 17.1. General structure for a two-way classification table.
In Figure 17.1, the rows run horizontally and the columns run vertically. The point at which a row and column intersect is called a cell, and each cell is given a unique subscript. The first digit in this subscript indicates the row to which the cell belongs, and the second digit indicates the column to which the cell belongs. So the basic format for cell subscripts is cellrc, where r = row and c = column. This means that cell21 is at the intersection of row 2 and column 1, cell13 is at the intersection of row 1 and column 3, and so on. One of the first steps in performing a chi-square test of independence is to determine exactly how many subjects fall into each of the cells in the classification table (that is, how many subjects appear in each subgroup). The pattern shown by these subgroups will help you understand whether the two classification variables are related to one another. Two-Way Classification Table for the Juvenile Offender Study To make this example more explicit, Figure 17.2 present a two-way classification table for the juvenile offender study described above. You can see that, in this figure, the “column” variable is “Type of Offender.” The first column represents the 100 juveniles classified as “Immature,” the second column represents the 100 juveniles classified as “Socialized,” and the third column represents the 100 juveniles classified as “Mature Delinquent.” This “Type of Offender” variable will serve as the predictor variable in your study.
636 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 17.2. Number of cases in which weapons were used versus not used, as a function of the type of juvenile offender.
The “row” variable in Figure 17.2 is “Use of Weapon,” which can assume two values: “Weapon Used” versus “Weapon Not Used.” This variable will serve as the criterion variable in your study. The numbers that appear in each of the cells in the figure (such as “n = 10”) are frequencies—the number of subjects that appear in each cell. It is these frequencies that you will analyze when you perform the chi-square test of independence. Figure 17.2 is easiest to understand if you interpret it just one column at time. For example, first consider the column headed “Immature.” This column consists of two cells. The top cell (in the row labeled “Weapon Used”) includes the entry “n = 10.” This means that there were only 10 juveniles who fell in this subcategory. The bottom cell (in the row labeled “Weapon Not Used”) includes the entry “n = 90,” which means that 90 juveniles fell into this subcategory. In summary, of the 100 juveniles in the immature category, 10 of them used weapons in their crimes, while 90 did not. Now consider the second column in Figure 17.2: the column that represents the “socialized” offenders. There, you see exactly the same pattern that was seen with the immature juveniles. Of the 100 individuals in the socialized category, 10 used weapons and 90 did not. Finally, review the third column, the column representing the “mature delinquent” offenders. You see a different pattern with this group. The top cell for this column shows that 60 of the 100 juveniles in this group used a weapon, and the bottom cell shows that 40 did not use a weapon. Taken together, these figures indicate that there appears to be a relationship between (a) type of offender and (b) whether or not a weapon was used in the crime. It appears that juveniles in the immature and socialized categories are unlikely to use a weapon, and juveniles in the mature delinquent category are much more likely to use a weapon.
Chapter 17: Chi-Square Test of Independence 637
Notice that the preceding paragraph said that there appears to be a relationship between the two variables. To determine whether there really is a significant relationship, you would perform a chi-square test of independence. The remainder of this chapter shows you how.
Results Produced in a Chi-Square Test of Independence Overview When you perform a chi-square test of independence, you test a null hypothesis which states that there is no relationship between the predictor variable and the criterion variable in the population. If the chi-square statistic is significant, you reject this null hypothesis. In the analysis, you also review an index of effect size to determine the strength of the relationship between the two variables. Your choice of an index of effect size will depend on the number of rows and columns in your two-way classification table. This section discusses these issues in greater detail. Test of the Null Hypothesis Overview. This section describes the statistical null hypothesis that is tested in studies of the type described in this chapter. It discusses the circumstances under which it is appropriate to use the chi-square statistic to test this null hypothesis, as well as the circumstances under which it may be more appropriate to use Fisher’s exact test. Statistical null and alternative hypotheses. For a chi-square test of independence, the null hypothesis states that there is no relationship between the predictor variable and the criterion variable in the population. As a concrete example, the statistical null hypothesis for the juvenile offender study could be stated in this way: Statistical null hypothesis (H0): In the population, there is no relationship between the type of offender and the use of a weapon in a crime. The statistical alternative hypothesis states that there is a relationship in the population. For the juvenile offender study, it could be stated in this way: Statistical alternative hypothesis (H1): In the population, there is a relationship between the type of offender and the use of a weapon in a crime. The chi-square statistic and p value. When you analyze your sample data with PROC FREQ (and request the appropriate options), SAS computes a chi-square statistic (symbolized as χ2) to test the statistical null hypothesis. If there is absolutely no relationship between the two variables in the sample, the obtained value of chi-square will be equal to zero. The stronger the relationship between the two variables in the sample, the larger the obtained chi-square statistic will be.
638 Step-by-Step Basic Statistics Using SAS: Student Guide
SAS also computes a p value (probability value) associated with the chi-square statistic. If this p value is less than some standard criterion (alpha level), you will reject the null hypothesis of no relationship between the two variables. This book recommends that you use an alpha level of .05. This means that, if your obtained p value is less than .05, you will reject the statistical null hypothesis. In this case, you will state that you have statistically significant results, and will conclude that the two variables are probably related in the population. Fisher’s exact test for 2 × 2 tables. A 2 × 2 table is a two-way classification table in which the predictor variable consists of just two categories and the criterion variable also consists of just two categories. When you are analyzing data from a 2 × 2 table, it is often appropriate to review the chi-square statistic and its p value, as described in the previous section. This is particularly the case if your sample size is relatively large. There are some instances, however, in which it may be more appropriate to consult a different statistic when you are analyzing data from a 2 × 2 table. The other statistic is called Fisher’s exact test. Fisher’s exact test may be more desirable than the chi-square statistic if you are analyzing data from a 2 × 2 table and your sample is relatively small. One way of determining whether your sample is relatively small is by reviewing the expected frequencies in each cell of your two-way classification table. The expected frequencies are the frequencies that would be expected in a cell if there were no relationship between the predictor variable and the criterion variable. Some statistics textbooks recommend against using the chi-square test for a 2 × 2 table if the expected frequency in any cell is less than 5. In those instances, it might be preferable to consult Fisher’s exact test. A later section of this chapter shows how to use PROC FREQ to perform a chi-square test of independence. There, you will see how to use the keyword EXPECTED in the TABLES statement to request expected frequencies, and how to use the keyword FISHER in the TABLES statement to request Fisher’s exact test. For a conceptual introduction to Fisher’s exact test, see Hays (1988), pages 781–783. For a detailed guide on how to compute and interpret Fisher’s exact test with SAS, see the SAS/STAT User’s Guide, Chapter 28 “The FREQ Procedure,” the section titled “Example 28.4. Analyzing a 2 × 2 Contingency Table,” pages 1342–1345. Be warned that requesting Fisher’s exact test with PROC FREQ may require a great deal of computer time and/or memory. This is especially likely to be the case if your sample size is large or if you analyze a table that is larger than a 2 × 2 table. Effect Size Overview. With most of the inferential statistical procedures covered in this book, you have learned to report an index of effect size. The way that effect size has been defined has varied, depending upon the statistic. With the chi-square test of independence, it is useful to use a measure of association as an index of effect size. These measures of association are
Chapter 17: Chi-Square Test of Independence 639
essentially correlation coefficients, similar in some ways to the Pearson r correlation coefficient that you learned about in Chapter 10, “Bivariate Correlation.” This book recommends that you use as a measure of effect size, either the phi coefficient or Cramer’s V, depending upon the size of your two-way classification table. Both of these statistics can be computed by PROC FREQ. For 2 × 2 tables. As was mentioned in an earlier section, a 2 × 2 classification table is one that involves a predictor variable consisting of two categories, and a criterion variable that also consists of two categories. When analyzing data from a table such as this, the phi coefficient (symbolized as φ) is a desirable measure of association, or effect size. The phi coefficient is a correlation coefficient, and in general can be interpreted in the same way as Pearson correlation coefficient, r. For example, phi can range from approximately –1.00 through zero to approximately +1.00. Values of phi that are closer to zero indicate a weaker relationship between the predictor and criterion variables; values that are closer to –1.00 or +1.00 indicate a stronger relationship (for guidelines on interpreting Pearson’s r, see Chapter 10 “Bivariate Correlation,” the section titled “Interpreting the Sign and Size of a Correlation Coefficient”). There are two important caveats related to the interpretation of phi, however: •
Although the range of Pearson’s r extends from exactly –1.00 to exactly +1.00, the range of phi ranges from approximately –1.00 through approximately +1.00. This means that, under some circumstances, phi may be somewhat less than –1.00 or +1.00 (in absolute value) even when there is a perfect relationship between the two variables.
•
Although phi may be positive or negative, the sign of the coefficient is meaningful only if both the predictor and criterion variables are ordered in some meaningful way. This means that, for the sign to be meaningful, both variables must be assessed on either an ordinal, interval, or ratio scale (see Hays, 1988, pages 785-786).
For tables larger than 2 × 2. A table is larger than 2 × 2 if you have more than two categories for the predictor variable, or more than two categories for the criterion variable, or both. For example, the juvenile offender classification table presented in Figure 17.2 (earlier) is larger than 2 × 2 because it has two categories under “Use of Weapon,” but three categories under “Type of Offender.” It is a 2 × 3 table. For these larger tables, a desirable measure of association is Cramer’s V (symbolized as V or V , rarely symbolized as φc or φ'). Cramer’s V is also a type of correlation coefficient. Its values may range from zero (indicating no relationship between the two variables) to +1.00 (indicating a perfect relationship). Summary. For 2 × 2 classification tables, use the phi coefficient as an index of effect size. For tables that are larger than 2 × 2, use Cramer’s V.
640 Step-by-Step Basic Statistics Using SAS: Student Guide
A Study Investigating Computer Preferences Overview Most of the remainder of this chapter illustrates how to use SAS to perform a chi-square test of independence. To do this, it describes a fictitious investigation in which you are studying a sample of college students and are trying to determine whether there is a significant relationship between the students’ school of enrollment and the type of computer that they prefer. This section describes the study in greater detail. Research Method Research question. Assume that you are a university administrator preparing to purchase a large number of new personal computers for three of the “schools” that constitute your university: the School of Arts and Sciences, the School of Education, and the School of Business. For a given school, you may purchase either IBM-compatible computers (computers that use the Windows environment) or Macintosh computers. You now need to know which type of computer tends to be preferred by students within each school. In general terms, your research question is “Is there a relationship between the following two variables: (a) school of enrollment, and (b) computer preference?” You suspect that there probably is such a relationship. Before conducting your study, you have a hunch that students in the School of Arts and Sciences and the School of Education will prefer Macintosh computers, while students in the School of Business will prefer IBM-compatible computers. The chi-square test of independence will help you determine whether this is true. If the test shows that there is a relationship between school of enrollment and computer preference, you will then review the two-way classification table to see which type of computer is preferred by most students in the School of Arts and Sciences, which type is preferred by most students in the School of Education, and so on. Measuring the predictor and criterion variables. To investigate your research question, you draw a random, representative sample of 370 students from the 8,000 students that constitute the three schools. Each student is given a short questionnaire that asks just two questions: 1. In which school are you enrolled? (circle one): a. School of Arts and Sciences b. School of Business c. School of Education 2. Which type of computer do you prefer that we purchase for your school? (circle one): a. IBM-compatible b. Macintosh
Chapter 17: Chi-Square Test of Independence 641
The two questions above constitute the two nominal-level variables for your study: Question 1 allows you to create a “school of enrollment” variable that can assume one of three values (Arts & Sciences versus Business versus Education), while Question 2 allows you to create a “computer preference” variable that may assume one of two values (IBMcompatible versus Macintosh). Notice that these are both limited-value variables, which is appropriate for the chi-square test of independence. Two-way classification table. After you have gathered completed questionnaires from the 370 students, you prepare a two-way classification table that plots computer preference against school of enrollment. This table (with fictitious data) is presented as here as Figure 17.3. Notice that computer preference is the row variable; row 1 represents students who preferred IBM compatibles, and row 2 represents students who preferred Macintosh. In the same way, you can see that school of enrollment is the column variable: Column 1 represents students from the School of Arts and Sciences, column 2 represents students from the School of Business, and column 3 represents students from the School of Education. This is a 2 × 3 classification table.
Figure 17.3. Preference for IBM-compatible computers versus Macintosh computers as a function of school of enrollment (significant results).
Figure 17.3 reveals the number of students who appear in each cell of the classification table. For example, the first row of the table shows that, among those students who preferred IBM compatibles, 30 were Arts and Sciences students, 100 were Business students, and 20 were Education majors. Remember that the purpose of the study was to determine whether there is any relationship between the two variables: to determine whether school of enrollment is related to computer preference. This is just another way of saying, “If you know what school a student is enrolled in, does that help you predict what type of computer that student is likely to prefer?” In the present case, the answer to this question is easiest to find if you review the table one column at a time. For example, if you look at the Arts and Sciences column of the table, you can see that only a minority of these students (n = 30) preferred IBM-compatible computers, while a larger
642 Step-by-Step Basic Statistics Using SAS: Student Guide
number of students (n = 60) preferred Macintosh computers. The column for the Business students shows the opposite trend, however: most business students (n = 100) preferred IBM compatibles, while fewer (n = 40) preferred Macintosh computers. Finally, the pattern for the Education students was similar to that of the Arts and Sciences students: a minority (n = 20) preferred IBM compatibles, while a majority (n = 120) preferred Macintosh. In short, there appears to be a relationship between school of enrollment and computer preference, with Business students preferring IBM compatibles, and Arts and Sciences and Education students preferring Macintoshes. But this is just a trend that you observed in the sample. Is this trend strong enough to allow you to reject the null hypothesis that there is no relationship in the population of students? To determine this, you must conduct the chisquare test of independence.
Computing Chi-Square from Raw Data versus Tabular Data Overview The way that you will write a SAS program to perform the chi-square test will differ somewhat depending on whether you are working with raw data or with tabular data. This section explains the difference between these two formats. Raw Data Raw data is data that have not been summarized or tabulated in any way. For example, suppose that you have administered your questionnaire to 370 students, and you have not yet tabulated their responses: you merely have 370 completed questionnaires. So you enter the questionnaire responses, one subject at a time. In your SAS data set, line 1 contains responses from Subject #1, line 2 contains responses from Subject # 2, and so on. In this situation, you are working with raw data. Tabular Data On the other hand, tabular data are data that have already been summarized in a table. For example, suppose that it was actually another researcher who administered this questionnaire and then summarized subject responses in a two-way classification table similar to Figure 17.3. This two-way classification table tells you how many people were in cell11 (that is, how many people were in Arts and Sciences and also preferred IBM compatibles), how many were in cell12 (that is, how many people were in the School of Business and also preferred IBM compatibles), and so on. In this case you are dealing with tabular data.
Chapter 17: Chi-Square Test of Independence 643
The Approach Used Here In computing the chi-square statistic, there is no real advantage to using one form of data rather than another, although you will generally have a lot less data to enter if your data are already in tabular form. The following sections shows how to input the data and request the chi-square statistic for tabular data; a section appearing later in the chapter shows you how to perform the same analysis using raw data.
Example of a Chi-Square Test That Reveals a Significant Relationship Overview You will use PROC FREQ to perform the chi-square test of independence. You first learned about PROC FREQ in Chapter 5, “Creating Frequency Tables.” There, you used the FREQ procedure to create a one-way table that listed values of a single variable. Here, you will use PROC FREQ to create a two-way classification table in which your predictor variable is crosstabulated with the criterion variable. This type of table is sometimes called a crosstabulation table. This section shows you how to prepare the data set, write the SAS program, interpret the output, summarize the results in a bar chart, and prepare an analysis report. The current section also shows you how to analyze tabular data: data that have already been organized into a table. Choosing SAS Variable Names and Values to Use in the Analysis Before you write a SAS program to perform the chi-square test, it is helpful to first prepare a figure similar to Figure 17.4. The purpose of this figure is to help you choose meaningful SAS variable names for the predictor and criterion variables, as well as meaningful values to represent the different categories under the predictor and criterion variables. If you carefully choose meaningful variable names and values at this point, you will find it easier to interpret your SAS output later.
644 Step-by-Step Basic Statistics Using SAS: Student Guide
Figure 17.4. Variable names and values to be used in the SAS program for the computer preference study.
SAS variable name and values for the predictor variable. You can see that Figure 17.4 is very similar to Figure 17.3 presented earlier, except that shorter, more concise labels are used in Figure 17.4. These shorter labels will serve as the SAS variable names and values to be used in the SAS program. For example, the label “School of Enrollment” from Figure 17.3 has been replaced with the SAS variable name “SCHOOL” in Figure 17.4. Obviously, you may choose any SAS variable name you like, as long as it is meaningful and complies with the rules for SAS variable names. Each column in Figure 17.4 is headed with the value that will be used to code that category in the SAS program. The figure shows that •
the value “ARTS” will be used to represent the School of Arts and Sciences
•
the value “BUS” will be used to represent the School of Business
•
the value “ED” will be used to represent the School of Education.
SAS variable name and values for the criterion variable. You can see that the label “Computer Preference” from Figure 17.3 has been replaced with the SAS variable name “PREF” in Figure 17.4. This will be the SAS variable name to represent the criterion variable in the analysis. Each row in Figure 17.4 is labeled with the following values that will be used to code each category in the SAS program: •
The value “IBM” will be used to represent students who prefer IBM-compatibles.
•
The value “MAC” will be used to represent students who prefer Macintosh computers.
Chapter 17: Chi-Square Test of Independence 645
Data Set to Be Analyzed Table 17.1 presents the data set that you will analyze. Table 17.1 Tabular Data Set for the Computer Preferences Study (Data Will Produce a Significant Chi-Square Statistic) _____________________________ Preference School Number _____________________________ IBM ARTS 30 IBM BUS 100 IBM ED 20 MAC ARTS 60 MAC BUS 40 MAC ED 120 _____________________________
Understanding the columns of Table 17.1. The first two columns of Table 17.1 represent the two variables that you will analyze in your SAS program. The first column is headed “Preference,” and this column indicates whether a particular group preferred IBM compatibles versus Macintosh computers. The second column is headed “School,” and this column indicates the school in which a particular group is enrolled (Arts and Science versus Business versus Education). The third column of Table 17.1 is headed “Number,” and this column simply indicates the number of students who were in a particular subgroup, or cell. Understanding the rows of Table 17.1. Each row in Table 17.1 represents one of the six cells from Figure 17.4. For example, the first row is coded “IBM” under “Preference,” and “ARTS” under “School.” This row therefore represents the subgroup of students who (a) preferred IBM compatibles, and (b) were in the School of Arts and Sciences. For this row, the value “30” appears under “Number,” meaning that there were 30 students in this subgroup. This subgroup also appears in Figure 17.4, in the cell where the row labeled “IBM” intersects with the column headed “ARTS.” The “n = 30” in that cell also indicates that there were 30 students in that subgroup. The second row of Table 17.1 is coded “IBM” under “Preference,” and “BUS” under “School.” This row therefore represents the subgroup of students who (a) preferred IBM compatibles, and (b) were in the School of Business. For this row, the value “100” appears under “Number,” meaning that there were 100 students in this subgroup. You can see that this row corresponds to the cell in Figure 17.4 where “IBM” intersects with “BUS.” In the same way, you can see that the six rows in Table 17.1 correspond to the six cells of Figure 17.4. After you have tabulated your data in a two-way classification table such as Figure 17.4, it is fairly simple to convert this information into a data set (such as Table 17.1) that can be analyzed with PROC FREQ.
646 Step-by-Step Basic Statistics Using SAS: Student Guide
Writing the SAS Program The SAS DATA step. When the data for a chi-square test of independence are in tabular form (as they are in Table 17.1), it is necessary to write a special type of INPUT statement to read the data. Here is the syntax: DATA data-set-name; INPUT row-variable-name $ column-variable-name $ number-variable-name ; DATALINES; row-value column-value number-in-cell row-value column-value number-in-cell [Additional data lines would go here] row-value column-value number-in-cell row-value column-value number-in-cell ; The INPUT statement in this program tells SAS that the data set includes three variables, and the names of these three variables are symbolized as “row-variable-name,” “columnvariable-name,” and “number-variable-name.” The first variable is a character variable that codes the rows of the classification table (in the present study, the “row variable” was “computer preference”). The second variable is a character variable that codes the columns of the table (here, the “column variable” was “school of enrollment”). Finally, the third variable (symbolized as “number-variablename”) is a quantitative variable that codes how many subjects appear in a cell. (Specific names will be given to these variables in the program to be presented shortly). Each line of data in the DATALINES section corresponds to one of the cells in the two-way classification table that is presented in Figure 17.4. This classification table included six cells, so there will be six data lines in the DATALINES section for the current program. Below is the actual DATA step for inputting the tabular data presented in Figure 17.4 and Table 17.1 (line numbers have been added on the left): 1 2 3 4 5 6 7 8 9 10 11 12 13
OPTIONS LS=80 PS=60; DATA D1; INPUT PREF $ SCHOOL $ NUMBER ; DATALINES; IBM ARTS 30 IBM BUS 100 IBM ED 20 MAC ARTS 60 MAC BUS 40 MAC ED 120 ;
Chapter 17: Chi-Square Test of Independence 647
The DATA statement on line 2 of the preceding program tells SAS to create a new data set and name it “D1.” The INPUT statement on lines 3-5 indicates that the data set contains three variables. The first variable is a character variable named PREF (coding the row variable), the second is a character variable named SCHOOL (coding the column variable), and the third variable is a numeric variable called NUMBER (indicating how many students appear in a cell). The DATALINES portion of the preceding program includes six lines of data, one for each cell. The first cell represents those students who (a) preferred IBM-compatibles, and (b) were in the School of Arts and Sciences. The value for NUMBER on this line shows that there were 30 subjects in this cell. The remaining data lines may be interpreted in the same fashion. You can see that these lines of data were taken directly from Table 17.1. The PROC Step. Below is the syntax for the PROC step that will create a two-way classification table when data have been input in tabular form. The options used with these statements (to be described below) allow you to request a chi-square test of independence, along with additional information. PROC FREQ TABLES WEIGHT RUN;
DATA=data-set-name; row-variable-name*column-variable-name options ; number-variable-name;
/
Substituting the appropriate SAS variable names into this syntax results in the following (line numbers have been added on the left): 1 2 3 4 5
PROC FREQ TABLES WEIGHT TITLE1 RUN;
DATA=D1; PREF*SCHOOL NUMBER; 'JANE DOE';
/
ALL;
The PROC FREQ statement on line 1 requests the FREQ procedure to be performed on data set D1. The TABLES statement in line 2 sets PREF as row variable and SCHOOL as the column variable in the two-way classification table that is created by PROC FREQ. This request is followed by a slash, the keyword ALL for the “all statistics” option (to be described below), and a semicolon. The WEIGHT statement on line 3 provides the name of the variable that codes the number of subjects in each cell. In this case, the variable named “NUMBER” is specified. This part of the program then ends with the usual TITLE1 and RUN statements. Some options available with PROC FREQ. When PROC FREQ is used to create and analyze a two-way classification table, SAS enables you to request a variety of options in the TABLES statement. To request these options, you begin the TABLES statement with the
648 Step-by-Step Basic Statistics Using SAS: Student Guide
word “TABLES,” followed by the names of the row variable and column variable (connected by an asterisk), followed by a slash (“/”), and then by the keywords for the options that you want. For a complete list of options, see the SAS/STAT User’s Guide, Chapter 28, “The FREQ Procedure.” Here are some of the options that may be especially useful for research in the social sciences and eduction: ALL Requests several significance tests (including the chi-square test of independence) and several measures of bivariate association. Although many statistics are printed, only a few will be appropriate for a particular analysis. The choice of the correct statistic will depend upon the level of measurement that is used with the variables, the size of the two-way classification table, and other considerations. CHISQ Requests the chi-square test of independence, and prints a number of measures of bivariate association based on chi-square. It is not necessary to list this option if you have already listed the ALL option. FISHER Prints Fisher’s exact test. This test is printed automatically for 2 × 2 tables (if the ALL or CHISQ options are specified), but must be specifically requested for larger tables. Be warned that, if your table has a large number of rows or columns, or if your sample size is large, Fisher’s exact test may require a large amount of computer time or memory. EXPECTED Prints the expected cell frequencies; that is, the cell frequencies that are expected if the two variables are independent (unrelated). You should request this option if you suspect that the expected frequency in any of your cells might be below the minimum described in the previous section titled “Summary of Assumptions Underlying the Chi-Square Test of Independence.” This is also a useful option for better understanding the nature of the relationship between the two variables that you are studying. MEASURES Requests several measures of bivariate association, along with their asymptotic standard errors (ASE). These include the Pearson and Spearman correlation coefficients, gamma, Kendall’s tau-b, Stuarts’ tau-c, symmetric lambda, asymmetric lambda, uncertainty coefficients, and other measures. Again, only a few of these indices will be appropriate for a particular study. All of these measures are printed if the ALL option is requested.
Chapter 17: Chi-Square Test of Independence 649
The complete SAS program. Below is a complete SAS program that will (a) read tabular data (b) create a two-way classification table, and (c) print the statistics requested by the ALL option (including the chi-square test of independence): OPTIONS LS=80 PS=60; DATA D1; INPUT PREF $ SCHOOL $ NUMBER ; DATALINES; IBM ARTS 30 IBM BUS 100 IBM ED 20 MAC ARTS 60 MAC BUS 40 MAC ED 120 ; PROC FREQ DATA=D1; TABLES PREF*SCHOOL WEIGHT NUMBER; TITLE1 'JANE DOE'; RUN;
/
ALL;
Output Produced by the SAS Program The preceding program produces two pages of output. Page 1 includes a two-way classification table in which PREF is the row variable and SCHOOL is the column variable (this table will be similar to Figure 17. 4). Page 1 also includes the chi-square test of independence, the phi coefficient, Cramer’s V coefficient, and a few other statistics. Page 2 presents additional statistics requested by the ALL option in the TABLES statement. The following sections show you how to interpret those parts of the output that are most relevant to your research question. Steps in Interpreting the Output 1. Verify that the SAS variable names and values are correct. You should begin by verifying that there were no obvious errors in typing your data or in writing the SAS program. The two-way classification table created by the FREQ procedure contains information that can help to identify possible errors. The two-way classification table produced by PROC FREQ appears here as Output 17.1.
650 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The FREQ Procedure Table of PREF by SCHOOL PREF n SCHOOL o Frequency| Percent | Row Pct | Col Pct |ARTS r |BUS s |ED t | Total ---------+--------+--------+--------+ IBM p | 30 | 100 | 20 | 150 | 8.11 | 27.03 | 5.41 | 40.54 | 20.00 | 66.67 | 13.33 | | 33.33 | 71.43 | 14.29 | ---------+--------+--------+--------+ MAC q | 60 | 40 | 120 | 220 | 16.22 | 10.81 | 32.43 | 59.46 | 27.27 | 18.18 | 54.55 | | 66.67 | 28.57 | 85.71 | ---------+--------+--------+--------+ Total 90 140 140 370 24.32 37.84 37.84 100.00
1
Output 17.1. Two-way classification table, requested by PROC FREQ, for the computer preferences study (significant results).
In the classification table reproduced in Output 17.1, the name of the row variable (PREF) appears in the upper left corner (n). The first row (labeled IBM) represents the subjects who preferred IBM-compatibles (p), and the second row (labeled MAC) represents subjects who preferred Macintoshes (q). The name of the column variable (SCHOOL) appears above the three columns (o), and each column in turn is headed with its label: Column 1 is headed ARTS (r) and represents the Arts and Sciences students, column 2 is headed BUS (s) and represents the Business students, and column three is headed ED (t) and represents the Education students. Your first task should be to review these SAS variable names and values. Verify that the correct values (categories) are listed under the predictor and criterion variables. In the present case, there is no evidence of any problems. 2. Verify that the cell frequencies are correct. Next, you will check each cell to verify that it contains the correct frequency. To illustrate this, the two-way classification table from the current analysis is reproduced again below as Output 17.2.
Chapter 17: Chi-Square Test of Independence 651
JANE DOE The FREQ Procedure Table of PREF by SCHOOL PREF SCHOOL Frequency| Percent | Row Pct | Col Pct |ARTS |BUS |ED | Total ---------+--------+--------+--------+ IBM | 30n| 100o| 20p| 150 | 8.11 | 27.03 | 5.41 | 40.54 | 20.00 | 66.67 | 13.33 | | 33.33 | 71.43 | 14.29 | ---------+--------+--------+--------+ MAC | 60q| 40r| 120s| 220 | 16.22 | 10.81 | 32.43 | 59.46 | 27.27 | 18.18 | 54.55 | | 66.67 | 28.57 | 85.71 | ---------+--------+--------+--------+ Total 90 140 140 370 t 24.32 37.84 37.84 100.00
1
Output 17.2. Verifying that cell frequencies are correct for the computer preferences study (significant results).
As mentioned earlier, a cell is a location in a two-way table at which a row for the criterion variable intersects with a column for the predictor variable. For example, in Output 17.2, look at the cell at the intersection of the row labeled “IBM” and the column labeled “ARTS.” Four numbers appear in this cell: 30 8.11 20.00 33.33 A later section will describe what these four numbers represent, but for now we will focus on just the top number: 30. The top number in each cell should be the frequency for that cell. In the current study, the top number for each cell should be the number of people who appear in each subgroup. You should always verify that the top number in each cell is correct. For example, in Output 17.2, the cell where the row labeled “IBM” intersects with the column labeled “ARTS” indicates a frequency of 30 (n). This means that there were 30 subjects who (a) preferred IBM-compatible computers, and (b) were in the School of Arts and Sciences. If you review Figure 17.4 (presented earlier), you can see that this figure of 30 is correct. In the same way, Output 17.2 shows that there were 100 students in the IBM-BUS cell (o), 20 students in the IBM-ED cell (p), 60 students in the MAC-ARTS cell (q), 40 students in the MAC-BUS cell (r), and 120 students in the MAC-ED cell (s). Each of these numbers match the cell frequencies presented in Figure 17.4.
652 Step-by-Step Basic Statistics Using SAS: Student Guide
The frequency figure in the lower right corner of this page of output provides the total sample size for the analysis. Here, you can see that the total sample size is 370 (t). So far, there do not appear to be any obvious errors in preparing your SAS program. 3. Review the supplementary information in each cell. The preceding section indicated that four numbers are presented in each cell of a two-way classification table created by PROC FREQ. For example, here again are the numbers that appear in the cell at the intersection of “IBM” and “ARTS:” 30 8.11 20.00 33.33
n o p q
Here is a description of what each figure represents: n The top number in each cell is the “Frequency;” that is, the raw number of subjects in the cell. The top number in the IBM-ARTS cell is 30, as was discussed above. o The second number in each cell is the “Percent;” that is, the percent of subjects in that cell relative to the total number of subjects (the number of subjects in the cell divided by the total number of subjects). For example, there were 30 subjects in the IBM-ARTS cell, and a total of 370 subjects in the study. Therefore, the cell percent is 30 / 370 = 8.11%. p The third number is the “Row Pct;” that is, the percent of subjects in that cell, relative to the number of subjects in that row. For example, there are 30 subjects in the IBM-ARTS cell, and 150 subjects in the IBM row. Therefore, the row percent for this cell is 30 / 150 = 20%. q The bottom number in each cell is the “Col Pct;” that is, the percent of subjects in that cell, relative to the number of subjects in that column. For example, there are 30 subjects in the IBM-ARTS cell, and 90 subjects in the ARTS column. Therefore, the column percent for this cell is 30 /90 = 33.33%. In the upper left corner of Output 17.2, you can see that PROC FREQ has provided a key to help your remember what the four values in each cell represent. That key from the classification table in Output 17.2 is reproduced here: Frequency Percent Row Pct Col Pct 4. Review the column percent figures. In the present study, it is particularly revealing to review the classification table one column at a time, and to pay particular attention to the last entry in each cell: the “column percent.” First, consider the ARTS column in Output 17.2. The column percent entries show that only 33.33% of the Arts and Sciences students preferred IBM-compatible computers, while 66.67% preferred Macintosh computers. Next, consider the BUS column, which shows the reverse trend: 71.43% of the Business students
Chapter 17: Chi-Square Test of Independence 653
preferred IBM compatibles while only 28.57% preferred Macintoshes. Finally, the trend of the Education students in the ED column is similar to that for the Arts and Sciences students: only 14.29% preferred IBM compatibles, while 85.71% preferred Macintoshes. These percentages reinforce the hypothesis that there may be a relationship between school of enrollment and computer preference. On the surface, it seems that students in the School of Arts and Sciences and in the School of Education tend to prefer Macintoshes over IBMcompatibles, while students in the School of Business tend to prefer IBM compatibles over Macintoshes. But the real question is whether this relationship is statistically significant. To answer this, you must consult the chi-square test of independence. 5. Review the chi-square test of independence. The statistical null hypothesis for the current study may be stated as follows: Statistical null hypothesis (H0): In the study population, there is no relationship between school of enrollment and computer preference. The chi-square test of independence tests this null hypothesis. The results of the chi-square test appear in a statistics table at the bottom of Page 1 of the PROC FREQ output. This table is reproduced here as Output 17.3. Statistics for Table of PREF by SCHOOL n o p q r Statistic DF Value Prob -----------------------------------------------------Chi-Square s 2t 97.3853u <.0001v Likelihood Ratio Chi-Square 2 102.6849 <.0001 Mantel-Haenszel Chi-Square 1 16.9812 <.0001 Phi Coefficient 0.5130 Contingency Coefficient 0.4565 Cramer's V 0.5130 Output 17.3. Chi-square test of independence, requested with PROC FREQ, for the computer preferences study (significant results).
A heading at the top of the table in Output 17.3 identifies the two variables that are being investigated (n). In the present case, you can see that the two variables are PREF and SCHOOL. The left side of the table in Output 17.3 is headed “Statistic” (o). Below this heading are the names of the different statistics appearing in the table. Information that is related to the chisquare test of independence appears as the first row of this table, to the right of the heading “Chi-Square” (s). The third column in Output 17.3 is headed “Value” (q). Below this heading are the actual values for the various statistics that are computed by PROC FREQ. The obtained value of the chi-square statistic appears where the column “Value” intersects with the row “ChiSquare.” Output 17.3 shows that the obtained value of chi-square for the current analysis is 97.3853 (u), which rounds to 97.385. As stated above, this statistic tests the null hypothesis
654 Step-by-Step Basic Statistics Using SAS: Student Guide
that, in the population, the two variables are independent, or unrelated. When the null hypothesis is true, we expect the value of chi-square to be relatively small. The stronger the relationship between the two variables in the sample, the larger the chi-square value will be. To determine whether the current obtained value of 97.385 is statistically significant, you must review the p value (probability value) that is associated with this statistic. The last column in output 17.3 is headed “Prob” (r). This column reports p values for some of the statistics in the table. At the location where the row headed “Chi-Square” intersects with the column headed “Prob,” you can see that the p value for your chi-square statistic is “<.0001” (v). This p value is less than our standard alpha criterion of .05, which means that your chi-square test of independence is statistically significant. You can therefore reject the statistical null hypothesis, and tentatively conclude that there is a relationship between school of enrollment and computer preference in the population. You have already identified the nature of this relationship in the earlier section titled “4. Review the column percent figures. ” There, you determined that students in the School of Arts and Sciences and the School of Education tend to prefer Macintosh computers over IBM compatibles, while students in the School of Business tend to prefer IBM-compatible computers over Macintoshes. The second column in Output 17.3 is headed “DF” (p). This column provides the degrees of freedom for some of the statistics provided in the table. At the location where the column headed “DF” intersects with the row headed “Chi-Square,” you can see that there are 2 degrees of freedom for your chi square statistic (t). The degrees of freedom for the chi-square test of independence are calculated as df = (r-1)(c-1) where r = number of categories for the row variable c = number of categories for the column variable. For the current analysis, the row variable (PREF) had two categories, and the column variable (SCHOOL) had three categories, so the degrees of freedom are calculated as df = (2-1)(3-1) = (1)(2) = 2 6. Review the index of effect size. An earlier section indicated that, when you perform a chi-square test of independence, you should generally use either the phi coefficient or Cramer’s V as your index of effect size (your measure of association). It indicated that you should use the phi-coefficient for 2 × 2 tables, and Cramer’s V for larger tables. The classification table for the current analysis is a 2 × 3 table, because it had two categories under the row variable (IBM compatibles versus Macintosh) and three categories under the
Chapter 17: Chi-Square Test of Independence 655
column variable (Arts and Sciences versus Business versus Education). Therefore, you will report Cramer’s V as your index of effect size. For convenience, the statistics table from your analysis is reproduced here as Output 17.4. Statistics for Table of PREF by SCHOOL n Statistic DF Value Prob -----------------------------------------------------Chi-Square 2 97.3853 <.0001 Likelihood Ratio Chi-Square 2 102.6849 <.0001 Mantel-Haenszel Chi-Square 1 16.9812 <.0001 Phi Coefficient o 0.5130 Contingency Coefficient 0.4565 Cramer's V p 0.5130 q Output 17.4. Index of effect size, requested with PROC FREQ, for the computer preferences study (significant results).
Cramer’s V coefficient appears in the location where the row labeled “Cramer’s V” (p) intersects with the column headed “Value” (n). In Output 17.4, you can see that the obtained value of Cramer’s V for the current analysis is 0.5130 (q), which rounds to .51. You will report this as your index of effect size for the current analysis. Remember that values of Cramer’s V may range from zero to +1.00, with larger values indicating a stronger relationship. If your classification table had been a 2 × 2 table instead of a 2 × 3 table, you would have reported the phi coefficient instead of Cramer’s V. In Output 17.4, this statistic appears to the right of the label “Phi Coefficient” (o). Using a Graph to Illustrate the Results Overview. The results of a chi-square test of independence are easiest to understand when they are represented in a graph that plots the frequencies for each of the cells in the study’s two-way classification table. The cell frequencies for the computer preferences study were presented in the two-way classification table of Output 17.2, presented earlier. For your convenience, that classification table is reproduced here as Output 17.5.
656 Step-by-Step Basic Statistics Using SAS: Student Guide
JANE DOE The FREQ Procedure Table of PREF by SCHOOL PREF SCHOOL Frequency| Percent | Row Pct | Col Pct |ARTS |BUS |ED | Total ---------+--------+--------+--------+ IBM | 30n| 100o| 20p| 150 | 8.11 | 27.03 | 5.41 | 40.54 | 20.00 | 66.67 | 13.33 | | 33.33 | 71.43 | 14.29 | ---------+--------+--------+--------+ MAC | 60q| 40r| 120s| 220 | 16.22 | 10.81 | 32.43 | 59.46 | 27.27 | 18.18 | 54.55 | | 66.67 | 28.57 | 85.71 | ---------+--------+--------+--------+ Total 90 140 140 370 24.32 37.84 37.84 100.00
1
Output 17.5. Cell frequencies needed to prepare bar chart; computer preferences study, significant results.
Remember that the top number in each cell represents the number of people in that subgroup (in Output 17.5, these frequencies are identified with the numbers n,o, and so on). Figure 17.5 presents a bar chart that displays the cell frequencies of Output 17.5.
Figure 17.5. Bar chart displaying preference for IBM compatibles versus Macintosh computers as a function of school of enrollment (significant results).
Chapter 17: Chi-Square Test of Independence 657
Understanding the bar chart. You can see that the vertical axis of Figure 17.5 is labeled “Frequency.” This means that you will be plotting the frequency (number of people) in each cell. The horizontal axis of Figure 17.5 is labeled “School of Enrollment.” The three sets of bars in the figure are labeled “Arts” (for the School of Arts and Sciences), “Business” (for the School of Business”), and Education (for the School of Education). You can see that there are actually two bars for each school. For each school, the solid bar represents the number of people who preferred an IBM-compatible computer, and the white bar represents the number of people who preferred a Macintosh. Using cell frequencies from the SAS output to create the bar chart. The cell frequencies from Output 17.5 were used to create the bars in Figure 17.5. First, consider the row labeled “IBM” in Output 17.5. The frequency entries from this row were used to create the solid bars in Figure 17.5. •
The frequency of “30” from the ARTS column in Output 17.5 (n) was used to create the solid bar for the “Arts” group in Figure 17.5. Notice that this bar indicates a value of “30” on the vertical axis of Figure 17.5.
•
The frequency of “100” from the BUS column in Output 17.5 (o) was used to create the solid bar for the “Business” group in Figure 17.5.
•
The frequency of “20” from the ED column in Output 17.5 (p) was used to create the solid bar for the “Education” group in Figure 17.5.
Next, consider the row labeled “MAC” in Output 17.5. The frequency entries from this row were used to create the white bars in Figure 17.5. •
The frequency of “60” from the ARTS column in Output 17.5 (q) was used to create the white bar for the “Arts” group in Figure 17.5.
•
The frequency of “40” from the BUS column in Output 17.5 (r) was used to create the white bar for the “Business” group in Figure 17.5.
•
The frequency of “120” from the ED column in Output 17.5 (s) was used to create the white bar for the “Education” group in Figure 17.5.
Interpreting the bar chart. If you review the pattern of frequencies in Figure 17.5, you will see the same relationships described earlier: In both the School of Arts and Sciences and the School of Education, the majority of students preferred Macintosh computers over IBMcompatible computers (the white bars are taller than the solid bars for those two schools). However, in the School of Business, the majority of students preferred IBM-compatible computers over Macintosh computers (the solid bar is taller than the white bar for that school).
658 Step-by-Step Basic Statistics Using SAS: Student Guide
Analysis Report for the Computer Preferences Study (Significant Results) Overview. The following analysis report summarizes the results of the preceding analysis. This report can be used as a model of how to prepare reports when you have performed a chi-square test of independence and have obtained significant results. A section following the report explains the meaning of some of the information that appears in the report. A) Statement of the research question: The purpose of the present study was to determine whether there is a relationship between school of enrollment and computer preference among college students. Specifically, this study was designed to determine whether there is a difference between subjects in the School of Arts and Sciences, the School of Business, and the School of Education with respect to their preferences for IBM-compatible computers versus Macintosh computers. B) Statement of the research hypothesis: There will be a relationship between school of enrollment and computer preference such that (a) students in the School of Arts and Sciences and students in the School of Education will be likely to prefer Macintosh computers, while (b) students in the School of Business will be likely to prefer IBM-compatible computers. C) Nature of the variables: variables:
This analysis involved two
• The predictor variable was “school of enrollment.” This was a limited-value variable, was measured on a nominal scale, and could assume three values: the School of Arts and Sciences, the School of Business, and the School of Education. • The criterion variable was “computer preference.” This was a dichotomous variable, was measured on a nominal scale, and could assume two values: IBM compatible, and Macintosh. Chi-square test of independence.
D) Statistical test:
E) Statistical null hypothesis (H0): In the study population, there is no relationship between school of enrollment and computer preference. F) Statistical alternative hypothesis (H1): In the study population, there is a relationship between school of enrollment and computer preference. G) Obtained statistic:
χ2(2, N = 370) = 97.385.
H) Obtained probability (p) value:
p = .0001.
Chapter 17: Chi-Square Test of Independence 659
I) Conclusion regarding the statistical null hypothesis: Reject the null hypothesis. J) Effect size: Cramer’s V was used as the index of effect size. For this analysis, Cramer’s V = .51. Values of Cramer’s V may range from zero to +1.00, with values closer to zero indicating a weaker relationship between the predictor variable and the criterion variable. K) Conclusion regarding the research hypothesis: These findings provide support for the study’s research hypothesis. L) Formal description of the results for a paper: Results were analyzed using a chi-square test of independence. This analysis revealed a significant relationship between school of enrollment and computer preference, χ2(2, N = 370) = 97.385, p = .0001. Figure 17.5 illustrates the number of students who preferred IBMcompatible computers versus Macintosh computers, broken down by school of enrollment. The crosstabulation table showed that, for subjects in the School of Arts and Sciences, a smaller percentage preferred IBM-compatible computers versus Macintosh computers (33% versus 67%, respectively); for the School of Business, a larger percentage preferred IBMcompatible versus Macintosh computers (71% versus 29%, respectively); and for the School of Education, a smaller percentage preferred IBM compatible computers versus Macintosh computers (14% versus 86%, respectively). Cramer’s V was used to assess the strength of the relationship between the two variables. This statistic may range from zero to +1.00, with values closer to zero indicating a weaker relationship. For this analysis, Cramer’s V was computed as V = .51. M) Figure representing the results: See Figure 17.5. Notes Regarding the Preceding Analysis Report Reporting the chi-square statistic. In the preceding report, items G and L presented the obtained value of the chi-square statistic. Many journals in the behavioral sciences report obtained chi-square statistics according to this format: χ2(df, N = N) = value where df is equal to the degrees of freedom for the chi-square analysis. N is equal to the total sample size. value is the obtained chi-square statistic.
660 Step-by-Step Basic Statistics Using SAS: Student Guide
The statistics table at the bottom of Page 1 of the SAS output (Output 17.3), showed that the chi-square test had 2 degrees of freedom and that the obtained value of chi-square was 97.385. The two-way classification table created by PROC FREQ showed that the total sample size was 370. This table appeared in Output 17.2, and the total sample size was provided in the lower right corner of the table. Therefore, the results for the current analysis were reported as χ2 (2, N = 370) = 97.385. Reporting the effect size. Item J in the preceding report provided the index of effect size. For this analysis, the two-way classification table was larger than 2 × 2, so you used Cramer’s V as your index. Item J from the previous report is reproduced below: J) Effect size. Cramer’s V was used as the index of effect size. For this analysis, Cramer’s V = .51. Values of Cramer’s V may range from zero to +1.00, with values closer to zero indicating a weaker relationship between the predictor variable and the criterion variable. If your table had been a 2 × 2 table instead of a 2 × 3 table, you would have instead reported the phi coefficient as your index of effect size (remember that the phi coefficient had also appeared in Output 17.4, presented earlier). If this had been the case, you would have prepared Item J in this way: J) Effect size. The phi coefficient (φ) was used as the index of effect size. For this analysis, φ = .51. Values of φ may range from approximately –1.00 through zero through approximately +1.00, with values closer to zero indicating a weaker relationship between the predictor variable and the criterion variable. Item L (the formal description of results for a paper) also reported the index of effect size in its final paragraph. That paragraph is reproduced again below: Cramer’s V was used to assess the strength of the relationship between the two variables. This statistic may range from zero to +1.00, with values closer to zero indicating a weaker relationship. For this analysis, Cramer’s V was computed as V = .51. If it had been appropriate to report the phi coefficient instead of Cramer’s V, this paragraph would have instead been written in this way: The phi coefficient (φ) was used to assess the strength of the relationship between the two variables. This statistic may range from approximately –1.00 through zero to approximately +1.00, with values closer to zero indicating a
Chapter 17: Chi-Square Test of Independence 661
weaker relationship. For this analysis, the phi coefficient was computed as φ = .51. Reporting subgroup percentages. Item L from the preceding analysis report described the column percentages from the two-way cross-tabulation table produced by PROC FREQ (Output 17.5). To review how to interpret these percentages, see the section “4. Review the column percent figures,” presented earlier. In some cases, it might be better to report the row percentages from the cross-tabulation table instead of the column percentages. The decision will depend on the nature of your study, what hypotheses you are testing, and what point you are trying to make for the reader.
Example of a Chi-Square Test That Reveals a Nonsignificant Relationship Overview This section presents the results of a chi-square test in which the relationship between the two variables is nonsignificant. These results are presented so that you will be prepared to write analysis reports for projects in which nonsignificant outcomes are observed. The study presented here is the same computer preference study described in the preceding section. Here, the data have been changed so that they will produce nonsignificant results. Figure 17.6 presents the new data set that will be analyzed in this section.
Figure 17.6. Preference for IBM compatibles versus Macintosh computers as a function of school of enrollment (nonsignificant results).
662 Step-by-Step Basic Statistics Using SAS: Student Guide
The Complete SAS Program The data from Figure 17.6 will be analyzed with the same SAS program that was presented earlier. The complete SAS program, including the new data set (from Figure 17.6), is presented here: OPTIONS LS=80 PS=60; DATA D1; INPUT PREF $ SCHOOL $ NUMBER ; DATALINES; IBM ARTS 40 IBM BUS 75 IBM ED 68 MAC ARTS 50 MAC BUS 65 MAC ED 72 ; PROC FREQ DATA=D1; TABLES PREF*SCHOOL WEIGHT NUMBER; TITLE1 'JANE DOE'; RUN;
/
ALL;
Steps in Interpreting the Output Overview. As was the case with the earlier data set, the SAS program that performs this analysis produces two pages of output. Page 1 again presents the two-way classification table and significance tests, while page 2 presents the a number of additional statistics. As was discussed earlier, your first step should be to review the output for signs of any obvious errors. To save space, this section will skip those steps, and will move directly to the results that are relevant to the study’s research hypothesis. 1. Review the column percent figures. As was stated in the preceding section, in the present study, it is particularly revealing to review the classification table one column at a time, and to pay particular attention to the last entry in each cell: the “column percent.” Output 17.6 presents the two-way classification table from the current analysis.
Chapter 17: Chi-Square Test of Independence 663
JANE DOE The FREQ Procedure Table of PREF by SCHOOL PREF SCHOOL Frequency| Percent | Row Pct | Col Pct |ARTS n |BUS o |ED p | Total ---------+--------+--------+--------+ IBM | 40 | 75 | 68 | 183 | 10.81 | 20.27 | 18.38 | 49.46 | 21.86 | 40.98 | 37.16 | | 44.44 | 53.57 | 48.57 | ---------+--------+--------+--------+ MAC | 50 | 65 | 72 | 187 | 13.51 | 17.57 | 19.46 | 50.54 | 26.74 | 34.76 | 38.50 | | 55.56 | 46.43 | 51.43 | ---------+--------+--------+--------+ Total 90 140 140 370 24.32 37.84 37.84 100.00
1
Output 17.6. Two-way classification table, requested by PROC FREQ, for the computer preferences study (nonsignificant results).
First, consider the ARTS column in Output 17.6 (n). The column percent entries show that 44.44% of the Arts and Sciences students preferred IBM-compatibles, while 55.56% preferred Macintoshes. These percentages are fairly close to each other, indicating that approximately equal numbers of students wanted IBM-compatibles versus Macintoshes. Next, consider the BUS column (o), which shows a similar trend: 53.57% of the Business students preferred IBM-compatibles while 46.43% percent preferred Macintoshes. The same trend appears again for the Education students in the ED column (p), which shows that 48.57% preferred IBM-compatibles, while 51.43% preferred Macintoshes. In summary, there does not appear to be a strong trend in these data suggesting that the students’ preference for a specific computer depends on their school of enrollment. Regardless of school, in each sample about half of the students preferred IBM-compatibles, and half preferred Macintosh computers. This is the type of pattern that you would expect if the relationship between school of enrollment and computer preference was nonsignificant. Your next step, therefore, will be to determine whether this relationship is in fact nonsignificant. 2. Review the chi-square test of independence. As was stated earlier, the statistical null hypothesis for the current study can be stated as follows: Statistical null hypothesis (H0): In the population, there is no relationship between school of enrollment and computer preference.
664 Step-by-Step Basic Statistics Using SAS: Student Guide
The results of the chi-square test for this null hypothesis appear in Output 17.7. Statistics for Table of PREF by SCHOOL Statistic DF Value Prob -----------------------------------------------------Chi-Square 2 1.8967 n 0.3874 o Likelihood Ratio Chi-Square 2 1.8994 0.3869 Mantel-Haenszel Chi-Square 1 0.1911 0.6620 Phi Coefficient 0.0716 Contingency Coefficient 0.0714 Cramer's V p 0.0716 q Output 17.7. Chi-square test of independence, requested with PROC FREQ, for the computer preferences study (nonsignificant results).
Output 17.7 shows that the obtained value of chi-square is 1.8967 (n), which rounds to 1.897. The analysis had 2 degrees of freedom. This value of chi-square was quite small, given the degrees of freedom. The probability value, or p value, for this chi-square statistic is .3874 (o). This p value is well above the standard criterion of .05, and so you fail to reject the null hypothesis. In other words, you will retain the null hypothesis, and tentatively conclude that school of enrollment is unrelated to computer preference in the population. You will conclude that your results are nonsignificant. 3. Review the index of effect size. Because your classification table is larger than 2 × 3, you will use Cramer’s V as your index of effect size. This statistic appears in Output 17.7, in the row headed “Cramer’s V” (p). You can see that, for the current analysis, Cramer’s V = .0716 (q), which rounds to .07. With Cramer’s V, values closer to zero indicate a weaker relationship. The value of .07 obtained here is quite close to zero, indicating a relatively weak relationship between school of enrollment and computer preference. Using a Figure to Illustrate the Results The frequencies for the number of people appearing in each cell for this study were presented in the two-way classification table of Output 17.6. Remember that the top number in each cell represents the number of people in that subgroup. Figure 17.7 presents a bar chart that displays the cell frequencies of Output 17.6.
Chapter 17: Chi-Square Test of Independence 665
Figure 17.7. Bar chart displaying preference for IBM compatibles versus Macintosh computers as a function of school of enrollment (nonsignificant results).
Again, notice how the results presented in Figure 17.7 are consistent with the hypothesis that there is little relationship between school of enrollment and computer preference: Knowing which school a student is in does not enable you to accurately predict the type of computer that the student is likely to prefer. Analysis Report for the Computer Preferences Study (Nonsignificant Results) The following report summarizes the results of the preceding analysis. This will serve as a model of how to prepare a report when you perform a chi-square test of independence and obtain nonsignificant results. A) Statement of the research question: The purpose of the present study was to determine whether there is a relationship between school of enrollment and computer preference among college students. Specifically, this study was designed to determine whether there is a difference between subjects in the School of Arts and Sciences, the School of Business, and the School of Education with respect to their preferences for IBM-compatible computer versus Macintosh computers. B) Statement of the research hypothesis: There will be a relationship between school of enrollment and computer preference such that (a) students in the School of Arts and Sciences and students in the School of Education will be likely to prefer Macintosh computers, while (b) students in
666 Step-by-Step Basic Statistics Using SAS: Student Guide
the School of Business will be likely to prefer IBM-compatible computers. C) Nature of the variables: variables:
This analysis involved two
• The predictor variable was “school of enrollment.” This was a limited-value variable, was measured on a nominal scale, and could assume three values: the School of Arts and Sciences, the School of Business, and the School of Education. • The criterion variable was “computer preference.” This was a dichotomous variable, was measured on a nominal scale, and could assume two values: IBM compatible, and Macintosh. Chi-square test of independence.
D) Statistical test:
E) Statistical null hypothesis (H0): In the study population, there is no relationship between school of enrollment and computer preference. F) Statistical alternative hypothesis (H1): In the study population, there is a relationship between school of enrollment and computer preference. G) Obtained statistic:
χ2(2, N = 370) = 1.897.
H) Obtained probability (p) value: p = .3874. I) Conclusion regarding the statistical null hypothesis: to reject the null hypothesis.
Fail
J) Effect size: Cramer’s V was used as the index of effect size. For this analysis, Cramer’s V = .07. Values of Cramer’s V may range from zero to +1.00, with values closer to zero indicating a weaker relationship between the predictor variable and the criterion variable. K) Conclusion regarding the research hypothesis: These findings fail to provide support for the study’s research hypothesis. L) Formal description of the results for a paper: Results were analyzed using a chi-square test of independence. This analysis revealed a nonsignificant relationship between school of enrollment and computer 2 preference, χ (2, N = 370) = 1.897, p = .3874. Figure 17.7 illustrates the number of students who preferred IBMcompatible computers versus Macintosh computers, broken down by school of enrollment.
Chapter 17: Chi-Square Test of Independence 667
Cramer’s V was used to assess the strength of the relationship between the two variables. This statistic may range from zero to +1.00, with values closer to zero indicating a weaker relationship. For this analysis, Cramer’s V was computed as V = .07. M) Figure representing the results: See Figure 17.7.
Notes Regarding the Preceding Analysis Report Reporting the subgroup percentages. Item L in the preceding report provides a formal description of the results for a paper. Notice that it provides the chi-square value, p value, and index of effect size, but it does not discuss the “column percents” (which had been discussed in the report on the significant relationship, presented earlier). This is because, when the overall relationship that is being tested is nonsignificant, more detailed results, such as the column percents, are typically not reported. Reporting the effect size. In the preceding report, you used Cramer’s V as your index of effect size because the two-way classification table was larger than 2 × 2. If your table had been a 2 × 2 table instead of a 2 × 3 table, you would have instead reported the phi coefficient as your index of effect size (the phi coefficient had also appeared in Output 17.7). If this had been the case, you would have prepared Item J in this way: J) Effect size. The phi coefficient (φ) was used as the index of effect size. For this analysis, φ = .07. Values of φ may range from approximately –1.00 through zero to approximately +1.00, with values closer to zero indicating a weaker relationship between the predictor variable and the criterion variable. You also reported Cramer’s V as the index of effect size in the last paragraph of Item L, the formal description of results for a paper. If you had instead reported the phi coefficient, this paragraph would have been written this way: The phi coefficient (φ) was used to assess the strength of the relationship between the two variables. This statistic may range from approximately –1.00 through zero through approximately +1.00, with values closer to zero indicating a weaker relationship. For this analysis, the phi coefficient was computed as φ = .07.
668 Step-by-Step Basic Statistics Using SAS: Student Guide
Computing Chi-Square from Raw Data Overview So far, this chapter has shown you only how to analyze data that are already in tabular form. In contrast, raw data are data that have not been summarized in tabular form. Regarding the computer preferences study, above: Suppose that you administered your questionnaire to 370 students, and then keyed responses to the questionnaire with one line of data for each student. In this case, you would be working with raw data. When working with raw data, the DATA step as well as the PROC step will follow a somewhat different format, compared to the format discussed so far. This section shows you the proper format for performing a chi-square test on raw data. Inputting Raw Data Data set to be analyzed. If the data to be analyzed are in raw form, they may be entered following the procedures discussed in Chapter 4, “Data Input,” presented earlier in this book. For example, assume that the two-item computer preferences questionnaire (presented at the beginning of this chapter) has been administered to the 370 subjects. Table 17.2 presents fictitious data from a small subset of these 370 subjects. Table 17.2 Raw Data Set for the Computer Preferences Study _____________________________ Subject Preference School _____________________________ 001 MAC ARTS 002 IBM BUS 003 MAC BUS [Data lines for the remaining subjects would go here] 368 MAC ED 369 IBM BUS 370 MAC ED _____________________________
The conventions used with Table 17.2 are the same conventions that have been used throughout this text. Each column in the table represents a different variable. The first column (headed “Subject”) simply assigns a unique subject number to each respondent. The second column (headed “Preference”) indicates whether that subject preferred an IBMcompatible or a Macintosh. The third column (headed “School”) indicates the school in which the student was enrolled.
Chapter 17: Chi-Square Test of Independence 669
Each row in the table provides data from different student. For example: •
The first row presents data from Subject 001, who preferred a Macintosh and was enrolled in the School of Arts and Sciences.
•
The second row presents data from Subject 002, who preferred an IBM-compatible and was enrolled in the School of Business.
And so forth. Remember that Table 17.2 only presents data from the first three subjects and the last three subjects in the sample. If the table were complete, it would have data for 370 subjects (i.e., it would have 370 lines of data). The DATA step. Below is the syntax for the DATA step when the data are in raw-score form: OPTIONS LS=80 PS=60; DATA data-set-name; INPUT subject-number-variable-name row-variable-name $ column-variable-name $ DATALINES; [data lines go here] ;
;
Syntax for the previous INPUT statement indicates that the name for the row-variable should come before the name for the column-variable, but in fact, this order is arbitrary (in the INPUT statement, at least). The actual DATA step that would input the data from Table 17.2 is as follows (line numbers have been added on the left): 1 2 3 4 5 6 7 8 9
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM PREF $ SCHOOL $ ; DATALINES; 001 MAC ARTS 002 IBM BUS 003 MAC BUS [Remaining data lines would appear here]
375 376 377 378
368 369 370 ;
MAC IBM MAC
ED BUS ED
670 Step-by-Step Basic Statistics Using SAS: Student Guide
The INPUT statement on lines 3-5 of this program tells SAS that the data set includes three variables. The first of these three variables is SUB_NUM (which represents “subject number),” the second variable is PREF (for “computer preference”), and the third variable is SCHOOL (for “school of enrollment”). The data lines themselves appear on lines 7-377. You can see that these data lines are the same as in Table 17.2. The preceding program specified SCHOOL and PREF as character variables with values such as ARTS and MAC, but it also would have been possible to code them as numeric variables. For example, SCHOOL could have been coded so that 1 = Arts and Sciences, 2 = Business, and 3 = Education. The analysis could have then proceeded in the usual way, although you would then need to (a) make a record to remember exactly which group is represented by each numerical value, or (b) use the VALUES statement of PROC FORMAT to attach meaningful value labels (such as “ARTS” and “BUS”) to the variable categories when they are printed. The PROC Step Below is the syntax for the PROC step that will perform the chi-square analysis using raw data: PROC FREQ TABLES
DATA=data-set-name; row-variable-name*column-variable-name options ; TITLE1 'your name'; RUN;
/
Substituting the appropriate SAS variable names into this syntax results in the following (line numbers have been added on the left): 379 380 381 382
PROC FREQ TABLES TITLE1 RUN;
DATA=D1; PREF*SCHOOL 'JANE DOE';
/
ALL;
What is the difference between these statements versus the PROC-step statements presented earlier? The earlier statements (from the section in which the analysis was performed on tabular data) included a WEIGHT statement after the TABLES statement. In the example immediately above, the WEIGHT statement has been dropped. It is not required when you are analyzing raw data.
Chapter 17: Chi-Square Test of Independence 671
The Complete SAS Program Below is the complete SAS program (containing only a subset of data) that will perform a chi-square test of independence using raw data from the computer preferences study (line numbers have been added on the left): 1 2 3 4 5 6 7 8 9 ... 375 376 377 378 379 380 381 382
OPTIONS LS=80 PS=60; DATA D1; INPUT SUB_NUM PREF $ SCHOOL $ ; DATALINES; 001 MAC ARTS 002 IBM BUS 003 MAC BUS [Remaining data lines appear here] 368 MAC ED 369 IBM BUS 370 MAC ED ; PROC FREQ DATA=D1; TABLES PREF*SCHOOL TITLE1 'JANE DOE'; RUN;
/
ALL;
Interpreting the SAS Output From this point forward, the analysis proceeds in exactly the same way as when the data set was based on tabular data. The same options can be requested, and the results are interpreted in exactly the same way.
Conclusion This concludes the final chapter of Step-by-Step Basic Statistics Using SAS: Student Guide. This book was designed to introduce you to just a few of the elementary statistical procedures that are used in research for the social sciences and education. Given its limited scope, it has covered only a very small percentage of the statistical procedures that are made possible with SAS. However, there are a wide variety of additional books available that show how to perform other types of statistical procedures using SAS software.
672 Step-by-Step Basic Statistics Using SAS: Student Guide
A Step-by-Step Approach to Using the SAS System for Univariate and Multivariate Statistics (Hatcher and Stepanski, 1994) covers many of the same procedures and statistics discussed in the present text, but at a somewhat more advanced level. It also covers more advanced statistical procedures, including one-way ANOVA with one repeated-measures factor, factorial ANOVA with repeated-measures factors and between-subjects factors, multiple regression, and principal component analysis. Cody and Smith’s (1997) Applied Statistics and the SAS Programming Language covers similar subject matter, and provides a more indepth treatment of factorial ANOVA with within-subject factors. It also covers a number of more advanced statistics. The SAS Institute’s Books By Users program has published a large number of texts that can be useful to researchers using SAS. These texts range in level of sophistication from the elementary to the advanced. You may obtained a copy of the SAS Publishing Catalog from SAS Publications (1-800-727-3228). You may also view the entire SAS Publishing Catalog on the Web at support.sas.com/pubs.
References Abelson, R. P. (1995). Statistics as principled argument. Hillsdale, NJ: Lawrence Erlbaum Associates. Bandura, A. (1965). Influence of models’ reinforcement contingencies on the acquisition of imitative responses. Journal of Personality and Social Psychology, 1, 589–595. Bandura, A. (1977). Social learning theory. Englewood Cliff, NJ: Prentice Hall. Buunk, B.P., Angleitner, A., Oubaid, V., & Buss, D.M. (1996). Sex differences in jealousy in evolutionary and cultural perspective: Tests from the Netherlands, Germany, and the United States. Psychological Science, 7, 359–363 Carpenter, A.L., & Shipp, C.E. (1995). Quick results with SAS/GRAPH software. Cary, NC: SAS Institute Inc. Cody, R. P., & Smith, J.K. (1997). Applied statistics and the SAS programming language, fourth edition. Upper Saddle River, NJ: Prentice Hall. Cohen, J. (1969). Statistical power analysis for the behavioral sciences. New York: Academic. Daly, M., Wilson, M., & Weghorst, S.J. (1982). Male sexual jealousy. Ethology and Sociobiology, 3, 11–27. Delwiche, L. D., & Slaughter, S. J. (1998). The little SAS book: A primer, second edition. Cary, NC: SAS Institute, Inc. Dicataldo, F., & Grisso, T. (1995). A typology of juvenile offenders based on the judgments of juvenile court professionals. Criminal Justice and Behavior, 22, 246–262. Gilmore, J. (1996). Painless Windows 3.1: A beginner’s handbook for SAS users. Cary, NC: SAS Institute Inc. Gilmore, J. (1997). Painless Windows: A handbook for SAS users. Cary, NC: SAS Institute Inc. Gilmore, J. (1999). Painless Windows: A handbook for SAS users, second edition. Cary, NC: SAS Institute Inc. Hatcher, L. (1994). A step-by-step approach to using the SAS System for factor analysis and structural equation modeling. Cary NC: SAS Institute Inc. Hatcher, L., & Stepanski, E.J. (1994). A step-by-step approach to using the SAS system for univariate and multivariate statistics. Cary NC: SAS Institute Inc. Hatcher, L. (2001). Using the SAS windowing environment: A quick tutorial. Cary, NC: SAS Institute Inc. Hays, W. L. (1988). Statistics, fourth edition. New York: Holt, Rinehart, & Winston. Howell, D. C. (1997). Statistical methods for psychology, fourth edition. Belmont, CA: Duxbury Press.
674 Step-by-Step Basic Statistics Using SAS: Student Guide Locke, E. A., & Latham, G.P. (1990). A theory of goal setting and task performance. Englewood Cliffs, NJ: Prentice-Hall. Ludwig, T.D., & Geller, E.S. (1997). Assigned versus participative goal setting and response generalization: Managing injury control among professional pizza deliverers. Journal of Applied Psychology, 82, 253–261. Rusbult, C. E. (1980). Commitment and satisfaction in romantic associations: A test of the investment model. Journal of Experimental Social Psychology, 16, 172–186. SAS Institute Inc. (1998). Getting started with the SAS system, version 7. Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999a). SAS language reference: Concepts, version 8. Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999b). SAS language reference: Dictionary, version 8. Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999c). SAS procedures guide, version 8, volumes 1 and 2. Cary, NC: SAS Institute Inc. SAS Institute Inc. (1999d). SAS/STAT user’s guide, version 8, volumes 1, 2, and 3. Cary, NC: SAS Institute Inc. SAS Institute Inc. (2000). SAS/GRAPH software: Reference, version 8, volumes 1 and 2. Cary, NC: SAS Institute Inc. Schlotzhauer, S. D., & Littell, R. C. (1997). SAS system for elementary statistical analysis, second edition. Cary, NC: SAS Institute Inc. Spatz, C. (2001). Basic statistics: Tales of distributions, seventh edition. Belmont, CA: Wadsworth/Thomson Learning. Stevens, J. (1986). Applied multivariate statistics for the social sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. Thorndike, R. M., & Dinnel, D. L. (2001). Basic statistics for the behavioral sciences. Upper Saddle River, NJ: Merrill Prentice Hall. Tukey, J.W. (1977). Exploratory data analysis. Reading, MA: Addison-Wesley.
This combined index includes entries for Basic Statistics Using SAS: Student Guide and Exercises.
factorial ANOVA 597–607, 614–616,
Page numbers preceded by "E" indicate pages in Basic Statistics 622–624Using SAS: Exercises. All other page numbers refer to this book.
Index A absolute magnitude/value correlation coefficients 295–296 z scores 276–278 accuracy of data, verifying 181 achievement motivation study (example) 218–222 creating variable conditionally 239–241 data manipulation and subsetting statements, combined 256–260 eliminating missing data 252–256 recoding and creating variables 235–239 ADJUST= option, LSMEANS statement (GLM) 626 aggression in children study (example) See child aggression study (example) All Files option (Open dialog) 80 ALL option, TABLES statement (FREQ) 648 “Allow cursor movement past end of line” option 58 alpha level 336 ALPHA= option CORR procedure 334 LSMEANS statement (GLM) 626–627 MEANS statement (GLM) 509, 572–573 TTEST procedure 396–397, 401, 432, 470, 476–477 alternative explanations 300–302 alternative (directional) hypotheses 20–21 memory performance with Gingko biloba 423–426 one- and two-tailed tests 406–407 paired-samples t tests 462 single-sample t tests 389–390 analysis reports bivariate regression, negative coefficient 376–378 bivariate regression, nonsignificant coefficient 380–383 bivariate regression, positive coefficient 367–371 chi-square test of independence 658–661, 665–667
factorial ANOVA 597–607, 614–616, 622–624 independent-samples t tests 443–445, 448–450 nonsignificant correlation coefficients 328–329 one-way ANOVA 526–529, 535–537 paired-samples t tests 479–482, 485–487 Pearson correlation coefficient 318–320 significance tests 318–320 single-sample t tests, nonsignificant results 410–411 single-sample t tests, significant results 405–406 statistical null hypothesis 318–320 AND statements 244, 252 ANOVA, factorial See factorial ANOVA with two betweensubjects factors ANOVA, one-way See one-way ANOVA with betweensubjects factor Appearance Options tab (Enhanced Editor) 58 approximately normal distribution 191–193, E47 stem-and-leaf plots 191–192 arithmetic operators 231 arrow keys, navigating SAS programs 81 assignment statements 233 association, tests of 383 assumptions, statistical 35 average See central tendency statistics See mean
B balanced factorial ANOVA designs 575, 625 bar charts See also frequency bar charts subgroup-mean bar charts 161, 174–177 basic statistical measures table 184, 192
676 Index best-fitting lines for variables See bivariate linear regression between-subjects designs 416–417 See also factorial ANOVA with two between-subjects factors See also one-way ANOVA with betweensubjects factor bimodal distributions 198–200, E47 See also CORR procedure causal relationships, investigating 299–303 correlating weight loss with predictor variables 303–307 mean, median, mode 199 nonsignificant correlations, summarizing results 324–329 scattergrams, creating 307–313, 325–327 significant results 335–338 Spearman rank-order correlation coefficient 332–333 stem-and-leaf plots 198 suppressing correlation coefficients 329–331 binary (dichotomous) variables 27–28 type-of-variable figures for 33 bivariate correlation See correlation coefficients bivariate linear regression 338, 341–383, E100, E115 appropriate situations and assumptions 344–346 criterion variables 347–348 degrees of freedom 354, 371 drawing regression line through scattergrams 358–364, 372–374 negative regression coefficient 371–379 nonsignificant regression coefficient 379–383 p value 371 positive regression coefficient 350–371 predictor variables 347–348 residuals of prediction 365–367 weight loss, correlating with predictor variables 346–350 BON option, MEANS statement (GLM) 509
C campaign finance studies See political donations study #1 (example) See political donations study #2 (example) categorical variables 22 nominal scales 25 causal relationships, correlations to investigate 299–303
cause-and-effect relationships 30, 300–302 central tendency statistics See also mean median (median score) 186, 192, 194–196, 199 mode (modal score) 185, 192, 194–196, 199 UNIVARIATE procedure for 183–184 variance and standard deviation 204–213 character value conversions 245–248 character variables $ for 128 frequency bar charts for 163–164 in IF-THEN statements 245–248 in INPUT statement 128–130 CHART procedure frequency bar charts 163–173, E33, E38 HBAR statement 163, 168–173, 175 subgroup-mean bar charts 161, 174–177 VBAR statement 163, 168–173 , 175 charts See graphs chi-square statistic 637 chi-square test of independence 631–671, E209, E218 analysis reports 658–661, 665–667 appropriate situations for 631–634 computing from raw data 642–643, 668–671 criterion variables 632, 644 DATA step 646–647, 669–670 DATALINES statement 646–647 FREQ procedure for 647–649, 670, E209, E218 graphing test results 655–657 index of effect size 638–639, 654–655, 660 INPUT statement 646–647, 669–670 nonsignificant relationship 661–667 output files 649–655 p value 638, 654 plotting 655–657 predictor variables 631, 644 significant relationship 643–661 statistical alternative hypothesis 637 statistical null hypothesis 637 two-way classification tables 634–637, 641–642 type-of-variable figures 632 child aggression study (example) 494–496 criterion variables 428–430 factorial ANOVA, significant effect for predictor A only 554–557 factorial ANOVA, significant effect for predictor B only 557–558
Index 677 factorial ANOVA, significant interaction 561–564 factorial ANOVA graphs, interpreting 550–553, 592–594 factorial ANOVA graphs, preparing 595–596, 612, 620 factorial design matrix 548–550 nonsignificant main effects and interaction 607–617 nonsignificant results 446–450 nonsignificant treatment effect 529–537 predictor variables 428–430 significant interactions 617–625 significant main effects, nonsignificant interaction 565–607 significant results 428–445 significant treatment effects 505–529 CHISQ option, TABLES statement (FREQ) 648 CL option, LSMEANS statement (GLM) 626 classification variables 22 nominal scales 25 CLDIFF option, MEANS statement (GLM) 509, 572 Clear All command 73 “Clear text on submit” option 58, 73, 78, 106 clear window contents 73 clicking, defined 49 closing windows 54 coefficient alpha 334 coefficient of determination 296, 357–358, 375 column input 115–116 columns, data set 23, 113–115 comparison number 395 comparison operators 241–242 subsetting data 252 computer preferences among college students (example) 640–642 chi-square test, nonsignificant relationship 661–667 chi-square test, significant relationship 643–661 conditional data modification 239–248 conditional statements 244 subsetting data 252 confidence intervals independent-samples t tests 426, 440–441, 447 paired-samples t tests 462, 463, 476, 484 single-sample t tests 391, 401, 409 control group, defined 32 Copy command 88–92 copying lines into SAS programs 88–92 CORR procedure 313–320, E91, E100 ALPHA= option 334
computing all possible correlations 320–323 computing Pearson correlation coefficient 313–320, E91, E100 COV option 334 KENDALL option 334 NOMISS option 335 NOPROB option 335 options for 333–335 RANK option 335 SPEARMAN option 333, 335 suppressing correlation coefficients 329–331 VAR statement 314, 321, 329–331 WITH statement 329–331 correcting errors in SAS programs 97–100, 107–108 correlated-samples t tests See t tests, paired-samples correlation negative 294–295, 317–318 perfect 296 positive 293–294, 317–318 zero 26–27 correlation coefficients (bivariate correlation) 290–338 See also CORR procedure See also Pearson correlation coefficient absolute magnitude/value 295–296 calculating E91, E100 causal relationships, investigating 299–303 coefficient of determination 296, 357–358, 375 correlating weight loss with predictor variables 303–307 matrices of 316–317 nonsignificant, summarizing results 324–329 sample size and 318 scattergrams, creating 307–313, 325–327 sign and size of, interpreting 293–296, 317–318 significant results 335–338 Spearman rank-order correlation coefficient 332–333 statistical alternative hypothesis 297, 298 statistical null hypothesis 297–299 statistical significance, interpreting 297–299 suppressing 329–331 correlational research 29–30, 36–37, 300–303, 341–342 COV option, CORR procedure 334
678 Index criminal activity among juvenile offenders (example) 632–639 criterion variables 30, 341–344 bivariate linear regression 347–348 chi-square test of independence 632, 644 child aggression study 428–430 factorial ANOVA 542 independent-samples t tests 417–419 one-way ANOVA 491 paired-samples t tests 453 Pearson correlation coefficient 291 scattergrams, creating 307–313, 325–327 women’s responses to sexual infidelity 466, 484 crosstabulation tables 643 Cut command 92
D DAT files 62 data 200–201 See also central tendency statistics See also data modification See also inputting data See also subsetting data analyzing with MEANS and FREQ procedures 131–138 creating and analyzing E9, E14 creating and modifying E63 defined 21 distribution shape 190–200 missing, eliminating observations with 252–256 missing, representing in input 124, 127 printing raw data 139–142, 222–225 stem-and-leaf plots 187–198, E56 variability measures 200–203, E56 verifying accuracy of 181 data files 62 data inputting See inputting data data modification 217–260, E63 See also subsetting data combining several statements 256–260 conditional operations 239–248 creating z-score variables 272, 280 duplicating variables with new names 228–230 placing statements in DATA step 225–228 reasons for 217 DATA= option, PROC statements 252
data screening for factorial ANOVA 570 data sets 23, 113–115 DATA statement 118 DATA step 41, E9, E14 See also examples by name chi-square test of independence 646–647, 669–670 data manipulation and subsetting statements 225–228 one-way ANOVA 507, 530 data subsetting See subsetting data DATALINES statement 120 chi-square test of independence 646–647 debugging SAS programs 97–100, 107–108 degrees of freedom 209 bivariate linear regression 354, 371 deleting lines from SAS programs 84–87 dependent variables 31–32, 342–344 descriptive statistics 204 determination, coefficient of 296, 357–358, 375 Deviation IQs 262 df (degrees of freedom) 209 bivariate linear regression 354, 371 dichotomous (binary) variables 27–28 type-of-variable figures for 33 differences between the means independent-samples t tests 436–443, 447 paired-samples t tests 461–463, 475–476, 484 directional (alternative) hypotheses 20–21 independent-samples t tests 423–426 memory performance with Gingko biloba 423–426 one- and two-tailed tests 406–407 paired-samples t tests 462 single-sample t tests 389–390 DISCRETE option, VBAR/HBAR statements (CHART) 172–173 distribution shape 190–200, E47 approximately normal distribution 191–193 bimodal distributions 198–200 skewed distributions 196–197 UNIVARIATE procedure 190–200, E47 dollar sign ($), for character variables 128 double-clicking, defined 49 “Drag and drop text editing” option 58 DUNCAN option, MEANS statement (GLM) 509 DUNNETT option, MEANS statement (GLM) 509
Index 679
E editing SAS programs 81–93 Editor window 52 maximizing 55–56 menu bar 59 title bar 67 Window menu 59–60 effect size 391–393, 402–404 chi-square test of independence 638–639, 654–655, 660 independent-samples t tests 426–427, 441–443, 447–448 one-way ANOVA 499–500 paired-samples t tests 463, 477–479, 484 single-sample t tests 409 ELSE statements 243 emotional responses to infidelity See women’s responses to sexual infidelity (example) Enhanced Editor Options dialog 56–58, 77–78, 106 equality of variances, testing 437–438, 447 error messages 43, 70 debugging SAS programs 97–100, 107–108 estimated population standard deviation 209, 212–213, E56 calculating to create z-score variables 269, E77, E84 estimated population variance 209, 212–213, E56 executing SAS programs 67–69, E3, E6 with errors 96–101 exiting SAS 74 expected frequencies 638 EXPECTED option, TABLES statement (FREQ) 648 experimental conditions 32, 497 experimental groups, defined 32 experimental research 31–32, 300, 342–343 experiments three conditions 35–36 two conditions 34–35 Explorer window 53 closing 54 maximizing 55–56 extremes table 184
F F’ test for equality of variances 437–438, 447 factorial ANOVA with two between-subjects factors 542–628, E187, E198
aggression study 546–550 analysis reports 597–607, 614–616, 622–624 appropriate situations for 542–545 assumptions for 545 balanced designs 575, 625 criterion variables 542 data screening and assumption tests 570 GLM procedure 571–573 graphs, interpreting 550–553, 592–594 graphs, preparing 595–596, 612, 620 LSMEANS statement 627 no main effects 559 nonsignificant main effects and interaction 607–617 output files 575–592 predictor variables 542 significant effect for both predictors 558–559 significant effect for predictor A only 554–557 significant effect for predictor B only 557–558 significant interactions 560–564, 617–625 significant main effects, nonsignificant interaction 565–607 simple effects, testing for 622 summary tables 583–586 type-of-variable figures 543 unbalanced designs 575, 625–627 file name extensions for SAS programs 61–62 file types 62, 63–64 Files of type (Open dialog) 80 financial donations studies See political donations study #1 (example) See political donations study #2 (example) FISHER option, TABLES statement (FREQ) 648 Fisher’s exact test 638 fitting straight lines to variables See bivariate linear regression floppy disks, saving programs on 64–66 formatted input 116 free-formatted input 115 FREQ procedure 131–134, 147, E9, E14, E21, E27 analyzing data with 131–138 chi-square test of independence 647–649, 670, E209, E218 creating frequency tables 153–156 interpreting results of 137–138 questions answerable by frequency tables 157–158 TABLES statement 647–649
680 Index FREQ procedure (continued) WEIGHT statement 647, 670 frequency bar charts 160, 162–173, E33, E38 See also frequency tables character variables 163–164 controlling number of bars 168–169 numeric variables 165–167 setting bar midpoints 170–171 values as separate bars 172–173 frequency tables See also frequency bar charts creating 147–158, E21 numeric variables 165–171 questions answerable by 157–158 reviewing 137–138 stem-and-leaf plots 187–198, E56
G GABRIEL option, MEANS statement (GLM) 509 gathering data 21 GE (greater than or equal to) operator 240–242 General Options tab (Enhanced Editor) 58, 77–78, 106 Gingko biloba study See memory performance with Gingko biloba (example) GLM procedure factorial ANOVA with two between-subjects factors 571–573 LSMEANS statement 575, 625–627 MEANS statement 509–511, 572–573 MODEL statement 572 one-way ANOVA 508–509, 530, E167, E177 goal setting theory E38 graphs See also frequency tables See also scattergrams See also type-of-variable figures chi-square test of independence 655–657 factorial ANOVA graphs, interpreting 550–553, 592–594 factorial ANOVA graphs, preparing 595–596, 612, 620 frequency bar charts 160, 162–173, E33, E38 one-way ANOVA results 525, 534–535 resolution 160 stem-and-leaf plots 187–190 subgroup-mean bar charts 161, 174–177
group differences, tests of 383 GT (greater than) operator 242
H H0 option, TTEST procedure 396–397, 469–470 HBAR statement, CHART procedure 163 DISCRETE option 172–173 LEVELS= option 168–169 MIDPOINTS= option 170–171 SUMVAR option 175 TYPE= option 175 high-resolution graphics 160 hyphens (-) in variable names 130 hypotheses 16–21 See also statistical alternative hypotheses See also statistical null hypotheses nondirectional vs. directional 19–21 representing with figures 34
I I-beam pointer 49 IF-THEN control statements 239–248, E70 See also subsetting data character variables 245–248 comparison operators 241–242 conditional statements 244 ELSE statements 243 “Indentation” setting 58, 78, 106 independence tests See chi-square test of independence independent samples 415–417 independent-samples t tests See t tests, independent-samples independent variables 31–32, 342–344 levels of 32, 497 subject vs. true independent variables 544–545 index of effect size 391–393, 402–404 chi-square test of independence 638–639, 654–655, 660 independent-samples t tests 426–427, 441–443, 447–448 one-way ANOVA 499–500 paired-samples t tests 463, 477–479, 484 single-sample t tests 409 index of variance accounted for 499–500 industrial psychology study (example) 291–292 alternative explanations 301 positive and negative correlations 294–295
Index 681 inferential statistics 205 infidelity, responses to See women’s responses to sexual infidelity (example) INFILE statement 120 INPUT statement 115–116, 119 character variables in 128–130 data for chi-square tests of independence 646–647, 669–670 rules for list input 126–130 variable names in 119 inputting data 113–144 analyzing data with MEANS and FREQ procedures 131–138 column input 115–116 eliminating observations with missing data 252–256 example of (financial donations study #1) 122–143 example of (three quantitative variables) 117–121 formatted input 116 free-formatted input 115 list input 115, 126–130 printing raw data 139–142 representing missing data 124, 127 inserting lines into SAS programs 81–84 instruments for gathering data 21 interaction between predictors in factorial ANOVA 560–564 child aggression study 617–625 interaction, defined 560 simple effects, testing for 622 intercept constant of regression line 354, 375 interquartile range 202–203 interval scales 26
LEVELS= option, VBAR/HBAR statements (CHART) 168–169 limited-value variables 28 type-of-variable figures for 33 line numbers, displaying 56–58, 77–78, 106 linear regression, bivariate See bivariate linear regression linear relationships between variables 307–313 LINESIZE= option, OPTIONS statement 109, 118, 161 list input 115, 126–130 log files 42–44, 52, 62 analyzing 134–135 creating new variables from existing variables 238–239 debugging with 97–100, 107–108 factorial ANOVA 574–575 reviewing contents of 69–72 Log window 52 debugging with 97–100, 107–108 reviewing and printing contents 69–72 Look in (Open dialog) 79–80, 96, 106–107 low-resolution graphics 160 LS= option, OPTIONS statement 109, 118, 161 LSMEANS statement, GLM procedure 575, 625–627 ADJUST= option 626 ALPHA= option 626–627 CL option 626 factorial ANOVA 627 PDIFF option 626 LST files 62 LT (less than) operator 240–242
M J juvenile offenders, criminal activity of (example) 632–639
K KENDALL option, CORR procedure 334 Kendall’s tau-b coefficient 334
L LE (less than or equal to) operator 242 levels of independent variables 32, 497
main effects in factorial ANOVA no main effects 559 nonsignificant main effects and interaction 607–617 significant effect for both predictors 558–559 significant effect for predictor A only 554–557 significant effect for predictor B only 557–558 significant main effects, nonsignificant interaction 565–607 manipulated variables 31 matched-subjects designs 416 matrices of correlation coefficients 316–317 maximizing Editor window 55–56
682 Index mean 186 See also differences between the means approximately normal distribution 192 bimodal distributions 199 calculating to create z-score variables 269–270, 279, E77, E84 comparing variables with different means 263–265 computing with UNIVARIATE procedure 186–187 factorial ANOVA graphs, interpreting 551–553, 592–594 factorial ANOVA graphs, preparing 595–596, 612, 620 plotting for subgroups 161, 174–177 reviewing for new z-score variables 275–276, 283 skewed distributions 194–196 MEANS procedure 131–134, E9, E14, E56, E70 analyzing data with 131–138 creating z-score variables 269–270, 273, 279, 281, E77, E84 interpreting results of 135–136 paired-samples t tests 468–469 VARDEF= option 210–213, 269, 279 variance and standard deviation 210–213 MEANS statement, GLM procedure 509–511, 572–573 ALPHA= option 509, 572–573 BON option 509 CLDIFF option 509, 572 DUNCAN option 509 DUNNETT option 509 GABRIEL option 509 REGWQ option 509 SCHEFFE option 509 SIDAK option 509 SNK option 509 T option 509 TUKEY option 509, 572 measurement scales 24–27 measures of central tendency See central tendency statistics measures of variability 200–203, E56 MEASURES option, TABLES statement (FREQ) 648 median (median score) 186, 192 bimodal distributions 199 computing with UNIVARIATE procedure 186 skewed distributions 194–196 memory performance with Gingko biloba (example)
hypothesis tests 419–426 paired-samples and single-sample t tests 457–460 menu bar (Editor window) 59 mid-term test scores, comparing (example) 266–268 converting multiple raw-score variables into z-score variables 278–285 converting single raw-score variables into z-score variables 268–278 MIDPOINTS= option, VBAR/HBAR statements (CHART) 170–171 missing data eliminating 252–256 representing in input 124, 127 mode (modal score) 185, 192 bimodal distributions 199 computing with UNIVARIATE procedure 185 skewed distributions 194–196 model-rewarded and -punished conditions See child aggression study (example) MODEL statement, GLM procedure 572 MODEL statement, REG procedure 351–352 P option 352, 365–367 STB option 351 modifying data See data modification moments table 184 multi-value variables 28–29 type-of-variable figures for 33 multiple comparison procedures, one-way ANOVA 498 all significant, significant treatment effect 500–502, 505–529 nonsignificant treatment effect 504, 529–537 some significant, significant treatment effect 502–503
N names, of SAS programs 61–62, 67 names, variables See variable names naturally occurring variables 29–30 navigating among system windows 60 NE (not equal to) operator 242 negative bivariate regression coefficient 371–379 negative correlation 294–295, 317–318 negative relationships between variables 312–313
Index 683 negatively skewed distributions 196–197, E47 nominal scales 25 NOMISS option, CORR procedure 335 nondirectional hypotheses 19–21 independent-samples t tests 420–423 memory performance with Gingko biloba 420–423 one- and two-tailed tests 406–407 paired-samples t tests 462 nonexperimental (correlational) research 29–30, 36–37, 300–303, 341–342 nonlinear relationships between variables 307–313 nonmanipulative research 29–30, 36–37, 300–303, 341–342 nonsignificant bivariate regression coefficients 379–383 nonsignificant correlations p value 325 scattergrams 325–327 summarizing results 324–329 nonsignificant treatment effect (one-way ANOVA) 504, 529–537 nonstandardized regression coefficient 355–356, 375 NOPROB option, CORR procedure 335 normal distributions 191–193, E47 NORMAL option, UNIVARIATE procedure 183 normality, tests for 184, 191–193 null hypothesis See statistical null hypotheses null statement 121 numeric value conversions 245–248
O observational research 29–30, 36–37, 300–303, 341–342 observational units 23 observations defined 113 eliminating missing data 252–256 observational units 23 reviewing for validity 135 obtaining data 21 one-sided (one-tailed) hypotheses See directional (alternative) hypotheses one-tailed t tests 406–407 one-way ANOVA with between-subjects factor 491–537 analysis reports 526–529, 535–537 appropriate situations for 491–494
assumptions for 493 between-subjects designs E167, E177 child aggression study 494–496 criterion variables 491 DATA step 507, 530 GLM procedure 508–509, 530, E167, E177 graphing results 525, 534–535 index of effect size 499–500 multiple comparison procedures 498 nonsignificant treatment effect 504, 529–537 plotting 525, 534–535 predictor variables 491 significant treatment effect, all multiple comparison tests significant 500–502, 505–529 significant treatment effect, some multiple comparison tests significant 502–503 statistical alternative hypothesis 497 statistical null hypothesis 497 summary tables 516 treatment effects 497 type-of-variable figures 491 Open dialog 78–80, 96, 106–107 opening SAS programs 78–80, 96, 106–107 operators arithmetic operators 231 comparison operators 241–242, 252 precedence of 232 OPTIONS statement 109, 117–118, 161, 306, 510 LINESIZE= option 109, 118, 161 PAGESIZE= option 109, 118, 161, 306 OR statements 244 subsetting data 252 ordinal scales 25 Spearman rank-order correlation coefficient 332–333 organization behavior study See prosocial organization behavior study (example) organizational fairness study See perceived organizational fairness study (example) out-of-bounds values 136 output files 44–45 analyzing 137–138 chi-square test of independence 649–655 factorial ANOVA 575–592 LSMEANS statement, factorial ANOVA 627 one-way ANOVA with between-subjects factor 511–525, 531–534 reviewing contents of 72–73
684 Index Output window 53 controlling size of printed page 109 opening 60 reviewing and printing contents 72–73 overtype mode 49
P P option, MODEL statement (REG) 352, 365–367 p value 193 bivariate linear regression 371 chi-square test of independence 638, 654 nonsignificant correlations 325 Pearson correlation coefficient 297–298 single-sample t tests 406–407 page size 109 PAGESIZE= option, OPTIONS statement 109, 118, 161, 306 paired samples 416 paired-samples t tests See t tests, paired-samples PAIRED statement, TTEST procedure 470 parentheses in formulas 232 Paste command 88–92 PDIFF option, LSMEANS statement 626 Pearson chi-square test See chi-square test of independence Pearson correlation coefficient 290–293, 639 analysis reports 318–320 assumptions for 293 causal relationships, investigating 299–303 computing with CORR procedure 313–320, E91, E100 correlating weight loss with predictor variables 303–307 criterion variables 291 p value 297–298 predictor variables 291 scattergrams, creating 307–313, 325–327 significant results 335–338 suppressing correlation coefficients 329–331 type-of-variable figures 291 perceived organizational fairness study (example) 291–292 alternative explanations 301 positive and negative correlations 294–295 perfect correlation 296 period (.) for missing data 124, 127 PLOT option, UNIVARIATE procedure 183, 187 PLOT procedure 307–313, E100, E115 See also scattergrams
nonsignificant correlations 325–327 plotting See also frequency tables See also scattergrams See also stem-and-leaf plots See also type-of-variable figures chi-square test of independence 655–657 factorial ANOVA graphs, interpreting 550–553, 592–594 factorial ANOVA graphs, preparing 595–596, 612, 620 frequency bar charts 160, 162–173, E33, E38 one-way ANOVA results 525, 534–535 resolution 160 subgroup-mean bar charts 161, 174–177 political donations study #1 (example) 122–143 analyzing data with MEANS and FREQ procedures 131–138 complete program 142–143 data input 125–130 DATA step 125 printing raw data 139–142 political donations study #2 (example) 148–150 creating frequency tables 153–156 data input 151–152 DATA step 151–152 distribution shape 190–200 frequency bar charts 163–173 stem-and-leaf plot 183–187 subgroup-mean bar charts 174–177 population standard deviation 206–207, 212 estimated 209, 212–213, E56 z-score variables 269, E77, E84 population variance 205, 207, 212 estimated 209, 212–213, E56 populations 204 populations, samples representing See t tests, single-sample positive bivariate regression coefficient 350–371 positive correlation 293–294, 317–318 positive relationships between variables 312–313 positively skewed distributions 194–196, E47 precedence of operators 232 predicting relationships between variables See hypotheses predictor variables 30, 341–344 See also weight loss, correlating with predictor variables (example) bivariate linear regression 347–348
Index 685 chi-square test of independence 631, 644 child aggression study 428–430 factorial ANOVA 542 independent-samples t tests 417–419 one-way ANOVA 491 paired-samples t tests 453 Pearson correlation coefficient 291 scattergrams, creating 307–313, 325–327 women’s responses to sexual infidelity 466 Print dialog 71 PRINT procedure 139–142, 222–225, E9, E14, E70, E77, E84 printing Log window contents 71–72 Output window contents 72–73 page size 109 raw data 139–142, 222–225 PROC statements 41, 121 DATA= option 252 programs See SAS programs prosocial organization behavior study (example) 291–292 alternative explanations 301 positive and negative correlations 294–295 PS= option, OPTIONS statement 109, 118, 161, 306 psychology study See industrial psychology study (example)
Q qualitative variables 22, 25 quantitative variables 22 See also z scores advantages of 263–265, 276–278, 284–285 types of 263 quartiles table 184, 201 quasi-interval scales 26 quitting SAS 74 quotation mark, single (‘) 245
R R2 statistic 499–500 randomized-subjects designs 415 range 200–201 interquartile range 202–203 semi-interquartile range 203 RANK option, CORR procedure 335 ratio scales 27
raw data files 62 See also data See also inputting data computing chi-square test of independence 642–643, 668–671 raw-score variables 262 converting multiple into z-score variables 278–285 converting singly into z-score variables 268–278 STANDARD procedure with 285–286 reading comprehension See spatial recall in reading comprehension (example) Recall Last Submit command 74 recoding reversed variables 233–234 example 235–239 REG procedure MODEL statement 351–352, 365–367 negative regression coefficient 371–379 nonsignificant regression coefficient 379–383 positive regression coefficient 350–371 regression analysis exercises E100, E115 residuals of prediction 365–367 syntax for 351–352 regression, bivariate linear See bivariate linear regression regression coefficient 354, 375 nonstandardized significance tests 355–356, 375 standardized 351 standardized significance tests 356–358, 375 regression line drawing through scattergrams 358–364, 372–374 intercept constant of 354, 375 slope of 354, 355–356, 375 Y-intercept of 354, 375 REGWQ option, MEANS statement (GLM) 509 related-samples t tests See t tests, paired-samples relationships between variables, predicted See hypotheses renaming variables 228–230 repeated-measures designs 416 paired-samples and single-sample t tests 457–460 research basic approaches 29–32 experimental 31–32, 300, 342–343
686 Index research (continued) nonexperimental 29–30, 36–37, 300–303, 341–342 research question, defined 16 research hypothesis See hypotheses residuals of prediction 365–367 resolution of graphics 160 restarting SAS 75 Results window 53 closing 54 reversed variables, recoding 233–234 example 235–239 rows, data set 23, 113–115 Ryan-Einot-Gabriel-Welsch multiple range test 509
S samples 204 See also t tests, independent-samples See also t tests, paired-samples See also t tests, single-sample independent vs. paired samples 415–417 relative positions of scores in 263 sampling error 337 size, with correlation coefficient computations 318 standard deviation 208–209, 210–212 standard deviation, creating z-score variables 269–270, 279, E77, E84 variance 208–209, 210–212, E56 SAS 5 exiting 74 for statistical analysis 6 restarting 75 starting 50–52, 75, 105–106 SAS Files option (Open dialog) 80 SAS log files See log files SAS output files See output files SAS programs 37–42 copying lines into 88–92 debugging 97–100, 107–108 deleting lines from 84–87 editing 81–93 file name extensions for 61–62 inserting lines 81–84 moving lines within 92–93 names of 61–62, 67 navigating 81 opening 78–80, 96, 106–107
saving 63–67 scrolling 62–63 searching for 78–80, 96, 106–107 submitting for execution 67–69, E3, E6 submitting for execution, with errors 96–101 text editors for writing 41–42 typing 61–63 SAS windowing environment See windowing environment Save As dialog 65–67 saving SAS programs 63–67 scales of measurement 24–27 scattergrams 307–313, E91, E100, E115 bimodal distributions 307–313, 325–327 correlation coefficients 307–313, 325–327 criterion variables 307–313, 325–327 drawing regression line through 358–364, 372–374 nonsignificant correlations 325–327 Pearson correlation coefficient 307–313, 325–327 PLOT procedure for E100, E115 predictor variables 307–313, 325–327 residuals of prediction 365–367 SCHEFFE option, MEANS statement (GLM) 509 scores See also quantitative variables See also z scores median score 186, 192, 194–196, 199 mid-term test scores (example) 266–285 modal score 185, 192, 194–196, 199 raw-score variables 262, 268–286 relative position in samples 263 T scores 262 screening data for factorial ANOVA 570 scrolling SAS programs 62–63 semi-interquartile range 203 semicolon (;) in SAS code 40 SET statement 252 sexuality infidelity, responses to See women’s responses to sexual infidelity (example) Shapiro-Wilk statistic 193 “Show line numbers” option 58, 78, 106 SIDAK option, MEANS statement (GLM) 509 sign correlation coefficients 293–295, 317–318 z scores 276–278 significance tests analysis reports 318–320 nonstandardized regression coefficient 355–356, 375
Index 687 slope of regression line 355–356, 375 standardized regression coefficient 356–358, 375 significant treatment effect (one-way ANOVA) all multiple comparison tests significant 500–502, 505–529 some multiple comparison tests significant 502–503 simple effects, testing for 622 single quotation mark (‘) 245 single-sample t tests See t tests, single-sample skewed distributions 194–197, E47 negatively skewed 196–197, E47 positively skewed 194–196, E47 slope of regression line 354, 375 significance tests 355–356, 375 SNK option, MEANS statement (GLM) 509 spatial recall in reading comprehension (example) nonsignificant t statistic 407–411 significant t statistic 393–406 SPEARMAN option, CORR procedure 333, 335 Spearman rank-order correlation coefficient 332–333 standard deviation 204–210, E56 comparing variables with different 263–265 estimated population standard deviation 209, 212–213, E56 estimated population standard deviation, creating z-score variables 269, E77, E84 MEANS procedure for 210–213 population standard deviation 206–207, 212 reviewing for new z-score variables 275–276, 283 sample standard deviation 208–209, 210–212 sample standard deviation, creating z-score variables 269–270, 279, E77, E84 single-sample t tests 402 standard error of the difference between the means 436, 437, 447 STANDARD procedure 285–286 standardized regression coefficient 351 significance tests 356–358, 375 t statistic 356–358, 375 standardized variables 262 See also z scores STANDARD procedure for 285–286 starting SAS 50–52, 75, 105–106 statistical alternative hypotheses 19 chi-square test of independence 637 correlation coefficients 297, 298
independent-samples t tests 419–426, 439–440 nondirectional vs. directional 20–21 one-way ANOVA 497 paired-samples t tests 461–462 single-sample t tests 389–390 women’s responses to sexual infidelity (example) 466 statistical analysis, SAS for 6 statistical assumptions 35 statistical measures table 184, 192 statistical null hypotheses 18–19 analysis reports 318–320 chi-square test of independence 637 correlation coefficients 297–299 independent-samples t tests 419–426, 438–439 one-way ANOVA 497 paired-samples t tests 461–462, 476 significant results 335–338 single-sample t tests 389 women’s responses to sexual infidelity (example) 466 statistics See also central tendency statistics See also t statistic chi-square statistic 637 descriptive 204 inferential 205 R2 statistic 499–500 Shapiro-Wilk statistic 193 STB option, MODEL statement (REG) 351 stem-and-leaf plots 187–190, E56 approximately normal distribution 191–192 bimodal distributions 198 negatively skewed distributions 196 positively skewed distributions 194 UNIVARIATE procedure 187–190 string variables 116 Student-Newman-Keuls multiple range test 509 subgroup-mean bar charts 161, 174–177 subject vs. true independent variables 544–545 submitting SAS programs for execution 67–69, E3, E6 with errors 96–101 subsetting data 217, 248–256, E70 combining several statements 256–260 comparison operators and statements 252 multiple subsets 251–252 placing statements in DATA step 225–228 syntax for 248 summary tables, ANOVA 516, 583–586
688 Index SUMVAR option, VBAR/HBAR statements (CHART) 175 suppressing correlation coefficients 329–331 syntax, guidelines for 131
T T option, MEANS statement (GLM) 509 T scores 262 t statistic E123, E129 See also t tests, independent-samples See also t tests, paired-samples See also t tests, single-sample nonstandardized regression coefficient 355–356, 375 standardized regression coefficient 356–358, 375 t tests, independent-samples 415–450, E135, E142 analysis reports 443–445, 448–450 appropriate situations and assumptions 417–419 child aggression study, nonsignificant results 446–450 child aggression study, significant results 428–445 confidence intervals 426, 440–441, 447 criterion variables 417–419 directional hypotheses 423–426 effect size 426–427, 441–443, 447–448 nondirectional hypotheses 420–423 paired-samples t tests vs. 415–417, 453 predictor variables 417–419 standard error 437 statistical alternative hypothesis 419–426, 439–440 statistical null hypothesis 419–426, 438–439 t statistic 438–440, 447, E135, E142 TTEST procedure E135, E142 type-of-variable figures 417–419 t tests, one-tailed 406–407 t tests, paired-samples 453–488, E151, E159 analysis reports 479–482, 485–487 appropriate situations and assumptions 453–456 confidence intervals and effect size 463 criterion variables 453 effect size 463, 477–479, 484 hypothesis tests 461–462 independent-samples t tests vs. 415–417, 453 MEANS procedure 468–469 predictor variables 453 repeated-measures designs 457–460
single-sample t tests vs. 457–460 statistical alternative hypothesis 461–462 statistical null hypothesis 461–462, 476 t statistic 475, E151, E159 TTEST procedure 468–469, E151, E159 type-of-variable figures 453 women’s responses to sexual infidelity, nonsignificant results 483–487 women’s responses to sexual infidelity, significant results 463–482 t tests, single-sample 387–410, E123, E129 analysis reports 405–406, 410–411 appropriate situations and assumptions 387–388 confidence intervals and effect size 391–393, 401, 409 hypothesis tests 389–390 nonsignificant results 407–411 one- and two-tailed tests 406–407 paired-samples t tests vs. 457–460 repeated-measures designs 457–460 spatial recall in reading comprehension 393–406 standard deviation 402 statistical alternative hypothesis 389–390 statistical null hypothesis 389 TTEST procedure E123, E129 t tests, two-tailed 406–407 tables See also frequency tables basic statistical measures table 184, 192 crosstabulation tables 643 extremes table 184 moments table 184 quartiles table 184, 201 summary tables, ANOVA 516, 583–586 two-way classification tables 634–637, 641–643 TABLES statement, FREQ procedure 647–649 ALL option 648 CHISQ option 648 EXPECTED option 648 FISHER option 648 MEASURES option 648 test scores (mid-term), comparing (example) 266–268 converting multiple raw-score variables into z-score variables 278–285 converting single raw-score variables into z-score variables 268–278 testing assumptions for ANOVA 570 tests for normality 184, 191–193 tests of association 383 tests of group differences 383
Index 689 text editors for writing SAS programs 41–42 title bar (Editor window) 67 treatment conditions 32, 497 treatment effects 497–498 true independent variables vs. subject variables 544–545 true zero point 26–27 TTEST procedure 388–393 ALPHA= option 396–397, 401, 432, 470, 476–477 child aggression study, nonsignificant results 446–450 child aggression study, significant results 428–445 H0 option 396–397, 469–470 independent-samples t tests E135, E142 one- and two-tailed tests 406–407 paired-samples t tests 468–469, E151, E159 PAIRED statement 470 single-sample t tests E123, E129 spatial recall in reading comprehension 396–404 TUKEY option, MEANS statement (GLM) 509, 572 Tukey’s HSD test 509, 572, 587–589 two-sided (two-tailed) hypotheses See nondirectional hypotheses two-tailed t tests 406–407 two-way ANOVA See factorial ANOVA with two betweensubjects factors two-way chi-square test See chi-square test of independence two-way classification tables 634–637 computer preferences among college students 641–642 raw data vs. (chi-square test) 642–643 type-of-variable figures 32–37, 291, 344 chi-square test of independence 632 dichotomous (binary) variables 33 factorial ANOVA 543 independent-samples t tests 417–419 limited-value variables 33 multi-value variables 33 one-way ANOVA 491 paired-samples t tests 453 Pearson correlation coefficient 291 Type I error 336–337 TYPE= option, VBAR/HBAR statements (CHART) 175 typing SAS programs 61–63
U unbalanced factorial ANOVA designs 575, 625–627 Undo command 81 UNIVARIATE procedure 183–184 distribution shape 190–200, E47 mean, computing 186–187 median (median score), computing 186 mode (modal score), computing 185 NORMAL option 183 PLOT option 183, 187 stem-and-leaf plots 187–190 testing ANOVA assumptions 570 variability measures 200–203, E56
V valid observations, reviewing for 135 values absolute magnitude/value 276–278, 295–296 defined 22 out-of-bounds values 136 variables classified by number of displayed values 27–29 VAR statement, CORR procedure 314 omitting 321 suppressing correlation coefficients 329–331 VARDEF= option, MEANS procedure 210–213, 269, 279 variability measures 200–203, E56 variable names creating variables from existing variables 230–232, 235–241 duplicating variables with new names 228–230 in INPUT statement 119 variables See also criterion variables See also predictor variables See also quantitative variables See also raw-score variables See also type-of-variable figures See also z scores analyzing with MEANS and FREQ procedures 131–134 categorical variables 22, 25 character variables 128–130, 163–164, 245–248 classification variables 22
690 Index variables (continued) creating from existing variables 230–232, 235–241, E9, E14, E63 defined 22 dependent variables 31–32, 342–344 dichotomous (binary) variables 27–28, 33 duplicating with new names 228–230 hyphens in names 130 independent variables 31–32, 342–344, 497, 544–545 inputting, rules for 126 limited-value variables 28, 33 linear and nonlinear relationships 307–313 manipulated variables 31 multi-value variables 28–29, 33 naming in INPUT statement 119 naturally occurring 29–30 negative relationships between 312–313 number of displayed values 27–29 positive relationships between 312–313 printing raw data for 139–142, 222–225 qualitative 22, 25 renaming 228–230 recoding reversed variables 233–239 scales of measurement 24–27 standardized 262, 285–286 string variables 116 subject vs. true independent variables 544–545 variables, correlation between See bivariate correlation variables, fitting line to See bivariate linear regression variables, predicted relationships between See hypotheses variables, relationship between See chi-square test of independence variance 204–210 coefficient of determination 296, 357–358, 375 estimated population variance 209, 212–213, E56 F’ test for equality of variances 437–438, 447 index of variance accounted for 499–500 MEANS procedure for 210–213 population variance 205, 207, 212 R2 statistic 499–500 sample variance 208–209, 210–212, E56 VBAR statement, CHART procedure 163 DISCRETE option 172–173 LEVELS= option 168–169 MIDPOINTS= option 170–171 SUMVAR option 175
TYPE= option 175 verifying data accuracy 181
W weapon use in criminal activity by juveniles (example) 632–639 weight loss, correlating with predictor variables (example) bivariate linear regression for 346–350 computing all possible correlations 320–323 CORR procedure 313–320 negative regression coefficient 350–371 nonsignificant correlations 324–329 nonsignificant regression coefficient 379–383 Pearson correlation coefficient for 303–307 positive regression coefficient 350–371 Spearman rank-order correlation coefficient 332–333 suppressing correlation coefficients 329–331 WEIGHT statement, FREQ procedure 647, 670 Window menu (Editor window) 59–60 windowing environment 42, 47–110, E3, E6 editing programs 81–93 exiting 74 managing Log and Output window contents 69–73 saving programs 63–67 starting and restarting 50–52, 75, 105–106 submitting programs for execution 67–69 submitting programs for execution, with errors 96–101 typing programs 61–63 windows 52–60 windows 52–60 See also Editor window See also Log window See also Output window bringing to foreground 60 clearing contents 73 closing 54 Explorer window 53, 54, 55–56 navigating among 60 Results window 53, 54 WITH statement, CORR procedure 329–331 within-subjects designs 416–417 women’s responses to sexual infidelity (example) criterion variables 466, 484
Index 691 nonsignificant results 483–487 predictor variables 466 significant results 463–482 statistical alternative hypothesis 466 statistical null hypothesis 466
Y Y-intercept of regression line 354, 375
Z z scores 262–286 absolute magnitude/value 276–278 advantages of 263–265 converting raw-score variables into E77, E84 converting raw-score variables into, more than one 278–285 converting raw-score variables into, singly 268–278 creating with MEANS procedure 269–270, 273, 279, 281, E77, E84 creating z-score variables 272, 280 estimated population standard deviation and 269, E77, E84 mean and z-score variables 269–270, 275–276, 279, 283, E77, E84 sign 276–278 STANDARD procedure with 285–286 zero correlation 296 zero point 26–27
Special Characters $ for character variables 128 - (hyphens) in variable names 130 () (parentheses) in formulas 232 . (period) for missing data 124, 127 ; (semicolon) in SAS code 40 ‘ (single quotation mark) 245
692 Index
Call your local SAS office to order these books from
Books by Users Press
Advanced Log-Linear Models Using SAS®
Health Care Data and the SAS® System
by Daniel Zelterman .................................Order No. A57496
by Marge Scerbo, Craig Dickstein, and Alan Wilson .......................................Order No. A57638
Annotate: Simply the Basics by Art Carpenter .......................................Order No. A57320
The How-To Book for SAS/GRAPH ® Software by Thomas Miron .....................................Order No. A55203
Applied Multivariate Statistics with SAS® Software, Second Edition by Ravindra Khattree and Dayanand N. Naik..............................Order No. A56903
Applied Statistics and the SAS Programming Language, Fourth Edition ®
In the Know ... SAS ® Tips and Techniques From Around the Globe by Phil Mason ..........................................Order No. A55513
by Ronald P. Cody and Jeffrey K. Smith.................................Order No. A55984
Integrating Results through Meta-Analytic Review Using SAS® Software by Morgan C. Wang and Brad J. Bushman .............................Order No. A55810
An Array of Challenges — Test Your SAS ® Skills
Learning SAS ® in the Computer Lab, Second Edition
by Robert Virgile.......................................Order No. A55625
by Rebecca J. Elliott ................................Order No. A57739
Beyond the Obvious with SAS ® Screen Control Language
The Little SAS ® Book: A Primer by Lora D. Delwiche and Susan J. Slaughter ...........................Order No. A55200
by Don Stanley .........................................Order No. A55073
Carpenter’s Complete Guide to the SAS® Macro Language by Art Carpenter .......................................Order No. A56100
The Cartoon Guide to Statistics by Larry Gonick and Woollcott Smith.................................Order No. A55153
Categorical Data Analysis Using the SAS System, Second Edition ®
The Little SAS ® Book: A Primer, Second Edition by Lora D. Delwiche and Susan J. Slaughter ...........................Order No. A56649 (updated to include Version 7 features) Logistic Regression Using the SAS® System: Theory and Application by Paul D. Allison ....................................Order No. A55770
by Maura E. Stokes, Charles S. Davis, and Gary G. Koch .....................................Order No. A57998
Longitudinal Data and SAS®: A Programmer’s Guide by Ron Cody ............................................Order No. A58176
Cody’s Data Cleaning Techniques Using SAS® Software by Ron Cody........................................Order No. A57198
Maps Made Easy Using SAS® by Mike Zdeb ............................................Order No. A57495
Common Statistical Methods for Clinical Research with SAS ® Examples, Second Edition
Models for Discrete Data by Daniel Zelterman ................................Order No. A57521
by Glenn A. Walker...................................Order No. A58086
Concepts and Case Studies in Data Management by William S. Calvert and J. Meimei Ma......................................Order No. A55220
Debugging SAS ® Programs: A Handbook of Tools and Techniques by Michele M. Burlew ...............................Order No. A57743
Efficiency: Improving the Performance of Your SAS ® Applications by Robert Virgile.......................................Order No. A55960
A Handbook of Statistical Analyses Using SAS , Second Edition ®
by B.S. Everitt and G. Der .................................................Order No. A58679
Multiple Comparisons and Multiple Tests Using SAS® Text and Workbook Set (books in this set also sold separately) by Peter H. Westfall, Randall D. Tobias, Dror Rom, Russell D. Wolfinger, and Yosef Hochberg ................................Order No. A58274 Multiple-Plot Displays: Simplified with Macros by Perry Watts .........................................Order No. A58314
Multivariate Data Reduction and Discrimination with SAS ® Software by Ravindra Khattree and Dayanand N. Naik..............................Order No. A56902
The Next Step: Integrating the Software Life Cycle with SAS ® Programming by Paul Gill ...............................................Order No. A55697
support.sas.com/pubs
by Lauren E. Haworth ..............................Order No. A58087
SAS® for Monte Carlo Studies: A Guide for Quantitative Researchers
Painless Windows: A Handbook for SAS ® Users
by Xitao Fan, Ákos Felsovályi, Stephen A. Sivo, ˝ and Sean C. Keenan.................................Order No. A57323
by Jodie Gilmore ......................................Order No. A55769 (for Windows NT and Windows 95)
SAS ® Macro Programming Made Easy
Output Delivery System: The Basics
Painless Windows: A Handbook for SAS ® Users, Second Edition
by Michele M. Burlew ...............................Order No. A56516
SAS ® Programming by Example
by Jodie Gilmore ......................................Order No. A56647 (updated to include Version 7 features)
by Ron Cody and Ray Pass ............................................Order No. A55126
PROC TABULATE by Example
SAS ® Programming for Researchers and Social Scientists, Second Edition
by Lauren E. Haworth ..............................Order No. A56514
Professional SAS ® Programmer’s Pocket Reference, Fourth Edition by Rick Aster ............................................Order No. A58128
Professional SAS ® Programmer’s Pocket Reference, Second Edition by Rick Aster ............................................Order No. A56646
by Paul E. Spector....................................Order No. A58784
SAS ® Software Roadmaps: Your Guide to Discovering the SAS ® System by Laurie Burch and SherriJoyce King ..............................Order No. A56195
SAS ® Software Solutions: Basic Data Processing by Thomas Miron......................................Order No. A56196
Professional SAS ® Programming Shortcuts by Rick Aster ............................................Order No. A59353
Programming Techniques for Object-Based Statistical Analysis with SAS® Software
SAS ® Survival Analysis Techniques for Medical Research, Second Edition by Alan B. Cantor .....................................Order No. A58416
by Tanya Kolosova and Samuel Berestizhevsky ....................Order No. A55869
SAS ® System for Elementary Statistical Analysis, Second Edition
Quick Results with SAS/GRAPH ® Software
by Sandra D. Schlotzhauer and Ramon C. Littell.................................Order No. A55172
by Arthur L. Carpenter and Charles E. Shipp ...............................Order No. A55127
SAS ® System for Forecasting Time Series, 1986 Edition
Quick Results with the Output Delivery System
by John C. Brocklebank and David A. Dickey ...................................Order No. A5612
by Sunil K. Gupta .......................................Order No. A58458
Quick Start to Data Analysis with SAS ® by Frank C. Dilorio and Kenneth A. Hardy..............................Order No. A55550
Reading External Data Files Using SAS : Examples Handbook ®
by Michele M. Burlew ...............................Order No. A58369
Regression and ANOVA: An Integrated Approach Using SAS ® Software by Keith E. Muller and Bethel A. Fetterman ..........................Order No. A57559
Reporting from the Field: SAS ® Software Experts Present Real-World Report-Writing
SAS ® System for Mixed Models by Ramon C. Littell, George A. Milliken, Walter W. Stroup, and Russell D. Wolfinger .........................Order No. A55235
SAS ® System for Regression, Third Edition by Rudolf J. Freund and Ramon C. Littell.................................Order No. A57313
SAS ® System for Statistical Graphics, First Edition by Michael Friendly ..................................Order No. A56143
The SAS ® Workbook and Solutions Set (books in this set also sold separately) by Ron Cody .............................................Order No. A55594
Applications .............................................Order No. A55135
Selecting Statistical Techniques for Social Science Data: A Guide for SAS® Users
SAS ®Applications Programming: A Gentle Introduction
by Frank M. Andrews, Laura Klem, Patrick M. O’Malley, Willard L. Rodgers, Kathleen B. Welch, and Terrence N. Davidson .......................Order No. A55854
by Frank C. Dilorio ...................................Order No. A56193
SAS® for Forecasting Time Series, Second Edition by John C. Brocklebank, and David A. Dickey .................................Order No. A57275
SAS ® for Linear Models, Fourth Edition by Ramon C. Littell, Walter W. Stroup, and Rudolf J. Freund ...............................Order No. A56655
support.sas.com/pubs
Solutions for Your GUI Applications Development Using SAS/AF ® FRAME Technology by Don Stanley .........................................Order No. A55811
Statistical Quality Control Using the SAS ® System by Dennis W. King....................................Order No. A55232
A Step-by-Step Approach to Using the SAS ® System for Factor Analysis and Structural Equation Modeling by Larry Hatcher.......................................Order No. A55129
A Step-by-Step Approach to Using the SAS ® System for Univariate and Multivariate Statistics by Larry Hatcher and Edward Stepanski .............................Order No. A55072
Step-by-Step Basic Statistics Using SAS ®: Student Guide and Exercises (books in this set also sold separately) by Larry Hatcher.......................................Order No. A57541
Strategic Data Warehousing Principles Using SAS ® Software by Peter R. Welbrock ...............................Order No. A56278
JMP® Books Basic Business Statistics: A Casebook by Dean P. Foster, Robert A. Stine, and Richard P. Waterman........................Order No. A56813
Business Analysis Using Regression: A Casebook by Dean P. Foster, Robert A. Stine, and Richard P. Waterman........................Order No. A56818
JMP® Start Statistics, Second Edition by John Sall, Ann Lehman, and Lee Creighton....................................Order No. A58166
Regression Using JMP® by Rudolf J. Freund, Ramon C. Littell, and Lee Creighton....................................Order No. A58789
Survival Analysis Using the SAS ® System: A Practical Guide by Paul D. Allison .....................................Order No. A55233
Table-Driven Strategies for Rapid SAS ® Applications Development by Tanya Kolosova and Samuel Berestizhevsky ....................Order No. A55198
Tuning SAS ® Applications in the MVS Environment by Michael A. Raithel ...............................Order No. A55231
Univariate and Multivariate General Linear Models: Theory and Applications Using SAS ® Software by Neil H. Timm and Tammy A. Mieczkowski ....................Order No. A55809
Using SAS ® in Financial Research by Ekkehart Boehmer, John Paul Broussard, and Juha-Pekka Kallunki .........................Order No. A57601
Using the SAS ® Windowing Environment: A Quick Tutorial by Larry Hatcher.......................................Order No. A57201
Visualizing Categorical Data by Michael Friendly ..................................Order No. A56571
Working with the SAS ® System by Erik W. Tilanus ....................................Order No. A55190
Your Guide to Survey Research Using the SAS® System by Archer Gravely ....................................Order No. A55688
support.sas.com/pubs
Welcome * Bienvenue * Willkommen * Yohkoso * Bienvenido
SAS Publishing Is Easy to Reach Visit our Web site located at support.sas.com/pubs You will find product and service details, including •
companion Web sites
•
sample chapters
•
tables of contents
•
author biographies
•
book reviews
Learn about • • •
regional users group conferences trade show sites and dates authoring opportunities
•
e-books
Explore all the services that SAS Publishing has to offer! Your Listserv Subscription Automatically Brings the News to You Do you want to be among the first to learn about the latest books and services available from SAS Publishing? Subscribe to our listserv newdocnews-l and, once each month, you will automatically receive a description of the newest books and which environments or operating systems and SAS ® release(s) each book addresses. To subscribe,
1.
Send an e-mail message to
[email protected].
2.
Leave the “Subject” line blank.
3.
Use the following text for your message: subscribe NEWDOCNEWS-L your-first-name your-last-name For example: subscribe NEWDOCNEWS-L John Doe
You’re Invited to Publish with SAS Institute’s Books by Users Press If you enjoy writing about SAS software and how to use it, the Books by Users program at SAS Institute offers a variety of publishing options. We are actively recruiting authors to publish books and sample code. If you find the idea of writing a book by yourself a little intimidating, consider writing with a co-author. Keep in mind that you will receive complete editorial and publishing support, access to our users, technical advice and assistance, and competitive royalties. Please ask us for an author packet at
[email protected] or call 919-531-7447. See the Books by Users Web page at support.sas.com/bbu for complete information.
Book Discount Offered at SAS Public Training Courses! When you attend one of our SAS Public Training Courses at any of our regional Training Centers in the United States, you will receive a 20% discount on book orders that you place during the course.Take advantage of this offer at the next course you attend!
SAS Institute Inc. SAS Campus Drive Cary, NC 27513-2414 Fax 919-677-4444
E-mail:
[email protected] Web page: support.sas.com/pubs To order books, call SAS Publishing Sales at 800-727-3228 * For product information, consulting, customer service, or training, call 800-727-0025 For other SAS business, call 919-677-8000*
* Note: Customers outside the United States should contact their local SAS office.